AI超级智能体核心引擎:RAG知识库构建的架构、流程与代码全解析

引言

在AI超级智能体项目中,让智能体具备“专属知识”和“实时信息”是突破大模型自身知识局限的关键。检索增强生成(RAG)技术正是解决这一问题的核心架构。它通过将外部知识库与LLM的推理能力相结合,让智能体的回答既具备大模型的通用能力,又拥有专业、准确、实时的信息支撑。本文将深入剖析在我们实际的AI超级智能体项目中,RAG知识库从原始文档到智能检索的完整构建过程,涵盖架构设计、核心算法和可运行的代码示例。

一、 RAG在超级智能体中的角色与价值

在我们设计的超级智能体架构中,RAG模块扮演着“外部大脑”或“长期记忆系统”的角色。当智能体需要处理超出其训练数据范围的问题时(如公司内部文档、最新产品信息、私有代码库等),它会自动触发RAG流程。

核心价值

  1. 解决幻觉问题:为模型提供事实依据,大幅减少胡编乱造。
  2. 知识实时更新:无需重新训练模型,仅更新知识库即可让智能体获取最新信息。
  3. 增强专业能力:为智能体注入领域专业知识,使其成为特定领域的专家。
  4. 答案可溯源:每个回答都能追溯到源文档,增强可信度和可解释性。

二、 RAG知识库构建的整体架构

一个工业级的RAG系统远不止是“向量搜索+LLM”那么简单。下图展示了我们项目中实现的完整RAG构建与检索流程:

flowchart TD
    A[“原始知识源<br>PDF/Word/TXT/HTML”] --> B(文档加载与解析器)
    
    B --> C[“原始文本<br>(可能包含噪音)”]
    
    C --> D{文本预处理与清洗管道}
    D --> E[文本清洗]
    D --> F[格式标准化]
    D --> G[无关内容剔除]
    
    G --> H(文本分割器<br>Text Splitter)
    
    H --> I[“文本块<br>Chunk n”]
    I --> J[“文本块 1”]
    I --> K[“文本块 2”]
    I --> L[...]
    
    J & K & L --> M(向量化编码器<br>Embedding Model)
    
    M --> N[“向量 1<br>dim: 1536”]
    N --> O[(向量数据库<br>Vector DB)]
    M --> P[“向量 2<br>dim: 1536”]
    P --> O
    M --> Q[...]
    Q --> O
    
    R[元数据提取] --> O
    
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style O fill:#ccf,stroke:#333,stroke-width:2px

构建阶段(蓝色部分):专注于如何将原始数据转化为易于检索的知识片段。
检索阶段(绿色部分):关注如何在查询时高效找到最相关的信息。

三、 知识库构建阶段:从原始文档到向量存储

3.1 文档加载与解析

不同的文件格式需要不同的解析策略。我们使用 Unstructured 和专门的文件加载库来处理多样性。

# document_loader.py
import os
from typing import List, Dict, Any
import PyPDF2
from docx import Document
import html2text
from markdown import markdown
import logging

logger = logging.getLogger(__name__)

class DocumentLoader:
    """统一文档加载器,支持多种格式"""
    
    @staticmethod
    def load_pdf(file_path: str) -> str:
        """加载PDF文件并提取文本"""
        try:
            with open(file_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                text = ""
                for page in pdf_reader.pages:
                    text += page.extract_text() + "\n"
                return text
        except Exception as e:
            logger.error(f"Error loading PDF {file_path}: {e}")
            return ""
    
    @staticmethod
    def load_docx(file_path: str) -> str:
        """加载Word文档"""
        try:
            doc = Document(file_path)
            text = ""
            for paragraph in doc.paragraphs:
                text += paragraph.text + "\n"
            return text
        except Exception as e:
            logger.error(f"Error loading DOCX {file_path}: {e}")
            return ""
    
    @staticmethod
    def load_txt(file_path: str) -> str:
        """加载纯文本文件"""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                return file.read()
        except Exception as e:
            logger.error(f"Error loading TXT {file_path}: {e}")
            return ""
    
    @staticmethod
    def load_html(file_path: str) -> str:
        """加载HTML文件并转换为纯文本"""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                html_content = file.read()
                # 使用html2text将HTML转换为Markdown,再清理
                h = html2text.HTML2Text()
                h.ignore_links = False
                return h.handle(html_content)
        except Exception as e:
            logger.error(f"Error loading HTML {file_path}: {e}")
            return ""
    
    @staticmethod
    def load_directory(directory_path: str) -> List[Dict[str, Any]]:
        """加载目录下的所有支持文档"""
        supported_extensions = {
            '.pdf': DocumentLoader.load_pdf,
            '.docx': DocumentLoader.load_docx,
            '.txt': DocumentLoader.load_txt,
            '.html': DocumentLoader.load_html,
            '.htm': DocumentLoader.load_html,
        }
        
        documents = []
        for root, _, files in os.walk(directory_path):
            for file in files:
                ext = os.path.splitext(file)[1].lower()
                if ext in supported_extensions:
                    file_path = os.path.join(root, file)
                    logger.info(f"Loading document: {file_path}")
                    
                    content = supported_extensions[ext](file_path)
                    if content.strip():
                        documents.append({
                            'file_path': file_path,
                            'content': content,
                            'file_name': file,
                            'file_type': ext,
                            'file_size': os.path.getsize(file_path)
                        })
        
        logger.info(f"Successfully loaded {len(documents)} documents")
        return documents

# 使用示例
if __name__ == "__main__":
    loader = DocumentLoader()
    docs = loader.load_directory("./knowledge_docs")
    for doc in docs[:2]:  # 打印前两个文档的信息
        print(f"File: {doc['file_name']}, Size: {len(doc['content'])} chars")
3.2 文本预处理与清洗

原始文本通常包含大量噪音,需要进行清洗和标准化。

# text_processor.py
import re
import string
from typing import List, Dict, Any

class TextProcessor:
    """文本预处理与清洗器"""
    
    @staticmethod
    def clean_text(text: str) -> str:
        """清洗文本,移除噪音"""
        if not text:
            return ""
        
        # 1. 移除多余的空白字符
        text = re.sub(r'\s+', ' ', text)
        
        # 2. 移除不可打印字符(保留换行符)
        text = ''.join(char for char in text if char in string.printable or char == '\n')
        
        # 3. 标准化引号和破折号
        text = re.sub(r'[`´‘’]', "'", text)
        text = re.sub(r'[“”]', '"', text)
        text = re.sub(r'–', '-', text)
        
        # 4. 移除孤立的字符和数字(可选)
        text = re.sub(r'\s[0-9a-zA-Z]\s', ' ', text)
        
        return text.strip()
    
    @staticmethod
    def remove_header_footer(text: str) -> str:
        """尝试移除页眉页脚(基于简单启发式规则)"""
        lines = text.split('\n')
        cleaned_lines = []
        
        for line in lines:
            # 跳过看起来像页码的行
            if re.match(r'^\s*\d+\s*$', line):
                continue
            # 跳过重复出现的页眉(基于位置和内容)
            cleaned_lines.append(line)
        
        return '\n'.join(cleaned_lines)
    
    @staticmethod
    def extract_metadata(text: str, file_info: Dict) -> Dict[str, Any]:
        """从文本中提取元数据"""
        # 简单的元数据提取
        lines = text.split('\n')
        first_few_lines = [line.strip() for line in lines[:10] if line.strip()]
        
        metadata = {
            'file_name': file_info['file_name'],
            'file_path': file_info['file_path'],
            'file_type': file_info['file_type'],
            'content_length': len(text),
            'estimated_word_count': len(text.split()),
            'first_line': first_few_lines[0] if first_few_lines else '',
            'processed_time': '2024-01-01 00:00:00'  # 应该使用实际时间
        }
        
        # 尝试检测文档语言(简单版)
        chinese_chars = len(re.findall(r'[\u4e00-\u9fff]', text))
        if chinese_chars > len(text) * 0.3:
            metadata['language'] = 'zh'
        else:
            metadata['language'] = 'en'
            
        return metadata

# 在文档加载后使用
def preprocess_documents(raw_documents: List[Dict]) -> List[Dict]:
    """完整的文档预处理管道"""
    processed_docs = []
    
    for doc in raw_documents:
        # 1. 清洗文本
        cleaned_content = TextProcessor.clean_text(doc['content'])
        
        # 2. 移除页眉页脚
        cleaned_content = TextProcessor.remove_header_footer(cleaned_content)
        
        # 3. 提取元数据
        metadata = TextProcessor.extract_metadata(cleaned_content, doc)
        
        processed_docs.append({
            'original_content': doc['content'],
            'cleaned_content': cleaned_content,
            'metadata': metadata
        })
    
    return processed_docs
3.3 文本分割策略

这是RAG系统中最为关键的技术环节之一。不合理的分割会严重破坏语义完整性。

# text_splitter.py
from typing import List, Dict, Any
import re
import tiktoken  # 用于精确的token计数

class SemanticTextSplitter:
    """基于语义的文本分割器"""
    
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200, separators: List[str] = None):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.separators = separators or ["\n\n", "\n", "。", "!", "?", ".", "!", "?", ";", ";", ",", ","]
        self.encoder = tiktoken.get_encoding("cl100k_base")  # 与text-embedding-ada-002兼容
    
    def split_text(self, text: str) -> List[Dict[str, Any]]:
        """将文本分割成语义连贯的块"""
        if len(self.encoder.encode(text)) <= self.chunk_size:
            return [{'content': text, 'token_count': len(self.encoder.encode(text))}]
        
        chunks = []
        current_chunk = ""
        separator_index = 0
        
        while text:
            # 尝试用当前分隔符分割
            separator = self.separators[separator_index % len(self.separators)]
            parts = text.split(separator)
            
            for i, part in enumerate(parts):
                candidate = current_chunk + part + (separator if i < len(parts) - 1 else "")
                
                if len(self.encoder.encode(candidate)) <= self.chunk_size:
                    current_chunk = candidate
                else:
                    # 当前候选块太大,保存现有块并开始新块
                    if current_chunk:
                        chunks.append({
                            'content': current_chunk.strip(),
                            'token_count': len(self.encoder.encode(current_chunk))
                        })
                        
                        # 处理重叠
                        if self.chunk_overlap > 0:
                            # 从当前块末尾取一部分作为重叠内容
                            overlap_text = self._get_trailing_overlap(current_chunk)
                            current_chunk = overlap_text + part + (separator if i < len(parts) - 1 else "")
                        else:
                            current_chunk = part + (separator if i < len(parts) - 1 else "")
                    else:
                        # 单个部分就超过chunk_size,强制分割
                        forced_chunks = self._force_split_large_part(part)
                        chunks.extend(forced_chunks[:-1])  # 添加前面的完整块
                        current_chunk = forced_chunks[-1]['content'] if forced_chunks else ""
            
            text = ""  # 文本已被处理
            separator_index += 1
        
        # 添加最后一个块
        if current_chunk.strip():
            chunks.append({
                'content': current_chunk.strip(),
                'token_count': len(self.encoder.encode(current_chunk))
            })
        
        return chunks
    
    def _get_trailing_overlap(self, text: str) -> str:
        """获取文本末尾的重叠部分"""
        words = text.split()
        overlap_words = words[-self.chunk_overlap // 4:]  # 粗略估计
        return " ".join(overlap_words)
    
    def _force_split_large_part(self, text: str) -> List[Dict[str, Any]]:
        """强制分割过大的文本块"""
        chunks = []
        start = 0
        text_length = len(text)
        
        while start < text_length:
            end = min(start + self.chunk_size * 3, text_length)  # 字符级别的粗略估计
            chunk_text = text[start:end]
            
            # 在句子边界处调整结束位置
            sentence_breaks = [m.start() for m in re.finditer(r'[。.!!??;;]', chunk_text)]
            if sentence_breaks:
                adjusted_end = sentence_breaks[-1] + 1 if sentence_breaks else end
            else:
                adjusted_end = end
            
            final_chunk = text[start:start + adjusted_end]
            if final_chunk.strip():
                chunks.append({
                    'content': final_chunk.strip(),
                    'token_count': len(self.encoder.encode(final_chunk))
                })
            
            start += adjusted_end
        
        return chunks

def create_document_chunks(processed_docs: List[Dict]) -> List[Dict[str, Any]]:
    """创建文档块的完整管道"""
    splitter = SemanticTextSplitter(chunk_size=800, chunk_overlap=100)
    all_chunks = []
    
    for doc in processed_docs:
        chunks = splitter.split_text(doc['cleaned_content'])
        
        for i, chunk in enumerate(chunks):
            chunk_metadata = doc['metadata'].copy()
            chunk_metadata.update({
                'chunk_id': f"{chunk_metadata['file_name']}_chunk_{i}",
                'chunk_index': i,
                'total_chunks': len(chunks),
                'char_count': len(chunk['content']),
                'word_count': len(chunk['content'].split())
            })
            
            all_chunks.append({
                'content': chunk['content'],
                'metadata': chunk_metadata,
                'token_count': chunk['token_count']
            })
    
    print(f"Created {len(all_chunks)} chunks from {len(processed_docs)} documents")
    return all_chunks
3.4 向量化与存储

将文本块转换为向量并存储到向量数据库中。

# vector_store.py
import numpy as np
from typing import List, Dict, Any, Tuple
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
import logging

logger = logging.getLogger(__name__)

class VectorStoreManager:
    """向量存储管理器"""
    
    def __init__(self, persist_directory: str = "./vector_db", model_name: str = 'all-MiniLM-L6-v2'):
        self.persist_directory = persist_directory
        self.embedding_model = SentenceTransformer(model_name)
        
        # 初始化Chroma客户端
        self.client = chromadb.PersistentClient(
            path=persist_directory,
            settings=Settings(anonymized_telemetry=False)
        )
        
        # 获取或创建集合
        self.collection = self.client.get_or_create_collection(
            name="knowledge_base",
            metadata={"description": "AI超级智能体知识库"}
        )
    
    def generate_embeddings(self, texts: List[str]) -> List[List[float]]:
        """为文本列表生成嵌入向量"""
        logger.info(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.embedding_model.encode(texts).tolist()
        return embeddings
    
    def add_documents(self, chunks: List[Dict[str, Any]], batch_size: int = 100):
        """将文档块添加到向量数据库"""
        total_chunks = len(chunks)
        logger.info(f"Starting to add {total_chunks} chunks to vector database...")
        
        for i in range(0, total_chunks, batch_size):
            batch = chunks[i:i + batch_size]
            batch_ids = []
            batch_embeddings = []
            batch_documents = []
            batch_metadatas = []
            
            for chunk in batch:
                chunk_id = chunk['metadata']['chunk_id']
                batch_ids.append(chunk_id)
                batch_documents.append(chunk['content'])
                batch_metadatas.append(chunk['metadata'])
            
            # 为整个批次生成嵌入向量(更高效)
            batch_embeddings = self.generate_embeddings(batch_documents)
            
            # 添加到集合
            self.collection.add(
                embeddings=batch_embeddings,
                documents=batch_documents,
                metadatas=batch_metadatas,
                ids=batch_ids
            )
            
            logger.info(f"Processed batch {i//batch_size + 1}/{(total_chunks-1)//batch_size + 1}")
        
        logger.info(f"Successfully added {total_chunks} chunks to vector database")
    
    def search_similar(self, query: str, n_results: int = 5, 
                      filter_metadata: Dict[str, Any] = None) -> List[Dict[str, Any]]:
        """搜索相似的文档块"""
        # 生成查询向量
        query_embedding = self.embedding_model.encode([query]).tolist()[0]
        
        # 执行搜索
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n_results,
            where=filter_metadata,
            include=['documents', 'metadatas', 'distances']
        )
        
        # 格式化结果
        formatted_results = []
        if results['documents']:
            for i, (doc, metadata, distance) in enumerate(zip(
                results['documents'][0],
                results['metadatas'][0],
                results['distances'][0]
            )):
                formatted_results.append({
                    'content': doc,
                    'metadata': metadata,
                    'similarity_score': 1 - distance,  # 转换为相似度分数
                    'rank': i + 1
                })
        
        return formatted_results

# 完整的构建管道
def build_knowledge_base(docs_directory: str = "./knowledge_docs"):
    """构建知识库的完整管道"""
    
    # 1. 加载文档
    raw_docs = DocumentLoader.load_directory(docs_directory)
    print(f"Loaded {len(raw_docs)} raw documents")
    
    # 2. 预处理文档
    processed_docs = preprocess_documents(raw_docs)
    print(f"Processed {len(processed_docs)} documents")
    
    # 3. 分割文档
    chunks = create_document_chunks(processed_docs)
    print(f"Created {len(chunks)} chunks")
    
    # 4. 初始化向量存储
    vector_store = VectorStoreManager()
    
    # 5. 添加文档到向量数据库
    vector_store.add_documents(chunks)
    
    print("Knowledge base build completed successfully!")
    return vector_store

if __name__ == "__main__":
    # 运行知识库构建
    build_knowledge_base("./knowledge_docs")

四、 检索阶段:智能检索与增强生成

4.1 高级检索策略

基础的向量搜索往往不够,我们需要更智能的检索策略。

# advanced_retriever.py
from typing import List, Dict, Any, Optional
import re
from vector_store import VectorStoreManager

class AdvancedRetriever:
    """高级检索器,集成多种检索策略"""
    
    def __init__(self, vector_store: VectorStoreManager):
        self.vector_store = vector_store
    
    def hybrid_search(self, query: str, n_results: int = 5, 
                     keyword_boost: bool = True) -> List[Dict[str, Any]]:
        """混合检索:向量搜索 + 关键词增强"""
        
        # 1. 基础向量搜索
        vector_results = self.vector_store.search_similar(query, n_results * 2)
        
        if keyword_boost:
            # 2. 关键词提取(简单版)
            keywords = self._extract_keywords(query)
            
            # 3. 关键词重新排序
            reranked_results = self._rerank_by_keywords(vector_results, keywords)
            final_results = reranked_results[:n_results]
        else:
            final_results = vector_results[:n_results]
        
        return final_results
    
    def _extract_keywords(self, text: str) -> List[str]:
        """从文本中提取关键词(简单实现)"""
        # 移除停用词(生产环境应该使用更完整的停用词表)
        stop_words = {'的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这个', '那个'}
        words = re.findall(r'[\u4e00-\u9fff\w]+', text.lower())
        keywords = [word for word in words if word not in stop_words and len(word) > 1]
        return keywords
    
    def _rerank_by_keywords(self, results: List[Dict], keywords: List[str]) -> List[Dict]:
        """基于关键词对结果重新排序"""
        if not keywords:
            return results
        
        scored_results = []
        for result in results:
            score = result['similarity_score']
            content = result['content'].lower()
            
            # 关键词匹配加分
            keyword_bonus = sum(1 for keyword in keywords if keyword in content)
            final_score = score + keyword_bonus * 0.1
            
            scored_results.append((final_score, result))
        
        # 按新分数排序
        scored_results.sort(key=lambda x: x[0], reverse=True)
        return [result for score, result in scored_results]
    
    def contextual_search(self, query: str, conversation_history: List[Dict] = None, 
                         n_results: int = 5) -> List[Dict[str, Any]]:
        """上下文感知的检索"""
        
        # 如果有对话历史,增强查询
        enhanced_query = self._enhance_query_with_context(query, conversation_history)
        
        # 使用增强的查询进行搜索
        results = self.hybrid_search(enhanced_query, n_results)
        return results
    
    def _enhance_query_with_context(self, current_query: str, 
                                   conversation_history: List[Dict]) -> str:
        """使用对话历史增强当前查询"""
        if not conversation_history:
            return current_query
        
        # 从最近的对话历史中提取相关上下文
        recent_context = []
        for turn in conversation_history[-3:]:  # 最近3轮对话
            if 'user' in turn:
                recent_context.append(turn['user'])
            if 'assistant' in turn:
                recent_context.append(turn['assistant'])
        
        context_text = " ".join(recent_context[-2:])  # 取最近2条
        enhanced_query = f"{current_query} [上下文: {context_text}]"
        
        return enhanced_query

4.2 RAG增强生成

将检索到的信息与LLM结合,生成最终答案。

# rag_generator.py
from typing import List, Dict, Any
import json
from advanced_retriever import AdvancedRetriever
from unified_model_client import UnifiedModelClient  # 假设有上一篇文章中的统一客户端

class RAGGenerator:
    """RAG生成器"""
    
    def __init__(self, vector_store: VectorStoreManager, model_client: UnifiedModelClient):
        self.retriever = AdvancedRetriever(vector_store)
        self.model_client = model_client
    
    async def generate_answer(self, question: str, conversation_history: List[Dict] = None, 
                            max_retrieved_docs: int = 5) -> Dict[str, Any]:
        """生成基于检索信息的答案"""
        
        # 1. 检索相关文档
        retrieved_docs = self.retriever.contextual_search(
            question, conversation_history, max_retrieved_docs
        )
        
        if not retrieved_docs:
            return {
                'answer': "抱歉,我在知识库中没有找到相关信息。",
                'retrieved_documents': [],
                'confidence': 0.0
            }
        
        # 2. 构建增强的Prompt
        prompt = self._build_enhanced_prompt(question, retrieved_docs)
        
        # 3. 调用LLM生成答案
        try:
            response = await self.model_client.chat_completion(
                messages=[{"role": "user", "content": prompt}],
                model_config=self.model_client.get_model_config("gpt-3.5-turbo"),  # 使用合适的模型
                temperature=0.3  # 低温度以获得更确定的答案
            )
            
            answer = response['content']
            
            return {
                'answer': answer,
                'retrieved_documents': retrieved_docs,
                'confidence': self._calculate_confidence(answer, retrieved_docs),
                'source_documents': [doc['metadata'] for doc in retrieved_docs]
            }
            
        except Exception as e:
            logger.error(f"Error generating RAG answer: {e}")
            return {
                'answer': "生成答案时出现错误。",
                'retrieved_documents': retrieved_docs,
                'confidence': 0.0,
                'error': str(e)
            }
    
    def _build_enhanced_prompt(self, question: str, documents: List[Dict]) -> str:
        """构建增强的Prompt"""
        
        context_parts = []
        for i, doc in enumerate(documents, 1):
            context_parts.append(f"[文档 {i} - 相似度: {doc['similarity_score']:.3f}]\n{doc['content']}\n")
        
        context = "\n".join(context_parts)
        
        prompt = f"""基于以下提供的参考文档,请回答问题。如果文档中的信息不足以回答问题,请明确说明。

参考文档:
{context}

问题:{question}

请根据以上信息给出准确、有用的回答,并在适当的地方引用文档来源。如果信息不足,请说明哪些方面缺乏信息。
"""
        return prompt
    
    def _calculate_confidence(self, answer: str, documents: List[Dict]) -> float:
        """计算回答的置信度(简化版)"""
        if not documents:
            return 0.0
        
        # 基于检索文档的最高相似度分数
        max_similarity = max(doc['similarity_score'] for doc in documents)
        
        # 基于回答长度的启发式规则(避免过短的无意义回答)
        answer_length = len(answer.strip())
        length_confidence = min(answer_length / 50, 1.0)  # 假设50字符为合理长度
        
        final_confidence = max_similarity * 0.7 + length_confidence * 0.3
        return min(final_confidence, 1.0)

五、 完整的工作流程与集成

5.1 端到端RAG工作流程
# rag_orchestrator.py
import asyncio
from typing import List, Dict, Any
from vector_store import VectorStoreManager
from rag_generator import RAGGenerator
from unified_model_client import UnifiedModelClient

class RAGOrchestrator:
    """RAG工作流程协调器"""
    
    def __init__(self, knowledge_base_path: str = "./vector_db"):
        self.vector_store = VectorStoreManager(knowledge_base_path)
        self.model_client = UnifiedModelClient()
        self.rag_generator = RAGGenerator(self.vector_store, self.model_client)
        self.conversation_history = []
    
    async def query_knowledge_base(self, question: str) -> Dict[str, Any]:
        """查询知识库的完整流程"""
        
        # 记录用户问题
        self.conversation_history.append({'user': question})
        
        # 生成答案
        result = await self.rag_generator.generate_answer(
            question, self.conversation_history
        )
        
        # 记录助手回答
        self.conversation_history.append({'assistant': result['answer']})
        
        return result
    
    def get_conversation_history(self) -> List[Dict]:
        """获取对话历史"""
        return self.conversation_history.copy()
    
    def clear_conversation_history(self):
        """清空对话历史"""
        self.conversation_history.clear()

# 使用示例
async def main():
    # 初始化RAG协调器
    orchestrator = RAGOrchestrator()
    
    questions = [
        "我们公司的年假政策是什么?",
        "产品A的主要特性有哪些?",
        "报销流程需要哪些步骤?"
    ]
    
    for question in questions:
        print(f"\n用户: {question}")
        result = await orchestrator.query_knowledge_base(question)
        
        print(f"助手: {result['answer']}")
        print(f"置信度: {result['confidence']:.2f}")
        print(f"参考文档: {len(result['source_documents'])} 个")
        
        for doc in result['source_documents'][:2]:  # 显示前2个来源
            print(f"  - {doc['file_name']} (相似度: {next(d for d in result['retrieved_documents'] if d['metadata']['chunk_id'] == doc['chunk_id'])['similarity_score']:.3f})")

if __name__ == "__main__":
    asyncio.run(main())

六、 性能优化与最佳实践

6.1 检索质量优化
  1. 多向量策略:为每个文档块生成多个嵌入(如标题、摘要、全文),提高召回率。
  2. 查询扩展:使用LLM对原始查询进行改写和扩展。
  3. 重排序模型:使用交叉编码器对初步检索结果进行重新排序。
6.2 工程优化
  1. 批量处理:文档处理和向量化时使用批量操作。
  2. 缓存机制:对常见查询结果进行缓存。
  3. 增量更新:支持知识库的增量更新,避免全量重建。

七、 总结

通过本文介绍的完整RAG架构,我们为AI超级智能体构建了一个强大的知识系统。这个系统具备以下特点:

  1. 端到端的自动化:从原始文档到智能检索的全流程自动化。
  2. 语义理解:基于向量嵌入的深度语义检索。
  3. 上下文感知:结合对话历史的智能查询增强。
  4. 可解释性:每个回答都可追溯到源文档。
  5. 可扩展性:支持多种文件格式和检索策略。

在实际的超级智能体项目中,这个RAG系统成为了智能体的"专业知识库",让其能够准确回答领域特定问题,大大提升了实用性和可靠性。随着技术的不断发展,我们还在持续优化检索算法、探索多模态RAG等前沿方向,让智能体真正成为无所不知的"领域专家"。

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

北辰alk

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值