Implementing Branched RAG

Last Updated : 10 Feb, 2026

Branched Retrieval‑Augmented Generation (Branched RAG) is a type of RAG system where multiple retrieval paths operate in parallel to handle complex queries. Each branch retrieves and processes information independently and the combined outputs improve answer accuracy and reasoning depth.

  • Enables parallel retrieval from multiple sources or contexts
  • Improves response quality for complex or multi‑part queries
  • Enhances flexibility and scalability in RAG‑based systems
branched_rag
Branched RAG

Implementation

Step 1: Install Required Libraries

Install the following libraries to set up the environment for implementing Branched RAG using LangGraph:

  • langchain: Core framework for building applications with large language models.
  • langgraph: Manage multi step and branched RAG workflows using graph based execution.
  • langchain google genai: Enables integration of Google’s Generative AI models within LangChain.
  • faiss cpu: High performance similarity search library for vector embeddings.
  • sentence transformers: Generates dense vector embeddings for semantic search and retrieval tasks.

Run the command below to install or upgrade all required packages:

Python
pip install --upgrade langchain langgraph langchain-google-genai faiss-cpu sentence-transformers google-colab

Step 2: Import Required Libraries

We start by importing all the building blocks required for documents, embeddings, vector search, LLMs and graph orchestration.

  • Document : Standard format for storing text
  • RecursiveCharacterTextSplitter: Breaks text into manageable chunks
  • HuggingFaceEmbeddings: Converts text into numerical vectors
  • FAISS: Fast vector similarity search
  • ChatGoogleGenerativeAI: LLM for reasoning and answer generation
  • LangGraph: Controls multi step RAG flow using nodes
Python
from typing import TypedDict, List

from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_google_genai import ChatGoogleGenerativeAI

from langgraph.graph import START, StateGraph, END

Step 3: Load Dummy Documents

In this article we use small dummy documents to simulate a real knowledge base.

  • Each text snippet is wrapped inside a Document object
  • Document provides a standardized interface i.e text content and optional metadata like source, tags, timestamps, etc.
  • Using dummy documents enables faster iteration and easier debugging.
Python
documents = [
    Document(page_content="Retrieval Augmented Generation combines retrieval with LLMs."),
    Document(page_content="Fine-tuning adapts models using training data."),
    Document(page_content="RAG reduces hallucinations."),
    Document(page_content="Branched RAG uses multiple queries.")
]

Step 4: Split Documents into Chunks

Large text blocks dilute retrieval accuracy so we use chunking that breaks documents into focused, overlapping segments that helps vector search engine to retrieve precise and relevant context instead of broad, noisy passages.

  • RecursiveCharacterTextSplitter: splits text intelligently while preserving semantic boundaries
  • chunk_size=100: Limits each chunk to 100 characters for fine grained retrieval
  • chunk_overlap=20: Maintains continuity between adjacent chunks, preventing context loss
Python
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

chunks = splitter.split_documents(documents)

Step 5: Generate Embeddings and Build Vector Store

To enable semantic search, we convert text chunks into numerical vector representations. These vectors allow the system to compare meaning not just keywords making retrieval accurate and context aware.

  • HuggingFaceEmbeddings: Uses a Sentence Transformers model to encode text into dense vectors.
  • FAISS Vector Store: Stores embeddings in memory for rapid similarity search
Python
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = FAISS.from_documents(chunks, embeddings)

Output:

Screenshot-2026-02-10-113314
Creating Embeddings

Step 6: Create a Retriever

This step retrieves data from our vector store based on query.

  • Retriever acts as a clean query layer over the vector store
  • k=2: Retrieves the top 2 most relevant chunks for each query
  • The same retriever can now be reused across multiple query branches
Python
retriever = vectorstore.as_retriever(search_kwargs={"k":2})

Step 7: Initialize the LLM

The Large Language Model (LLM) is the decision engine of the Branched RAG system. Here we will use Google Gemini as LLM.

  • google_api_key: Passes the API key directly to the LLM client
  • Here we will use gemini-2.5-flash model.
  • temperature=0: means the LLM gives the most deterministic and repeatable output, always choosing the highest probability next token.

To know how to get Gemini API Key refer to: Google Gemini API Key

Python
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key="YOUR_GOOGLE_API_KEY_HERE",
    temperature=0
)

Step 8: Define the Graph State

LangGraph works using a shared, immutable state that flows through all nodes in the graph. This state acts as a single source of truth, allowing each node to read from and write to the same structured data. Each field in the state represents a key stage in the RAG lifecycle.

  • query: The original user question and the entry point of the graph.
  • branches: Sub queries generated from the original query, enabling parallel retrieval paths.
  • retrieved_docs: Raw content returned by the retriever across all branches, acts as the evidence pool.
  • context: Merged and refined knowledge created from retrieved documents and passed to the LLM.
  • answer: Final grounded response generated by the LLM and the output of the graph
Python
class RAGState(TypedDict):
    query: str
    branches: List[str]
    retrieved_docs: List[str]
    context: str
    answer: str

Step 9: Implement Branched RAG Execution Nodes

1. Query Branching Node: Uses the LLM to intelligently decompose the user’s intent into multiple focused sub queries. Each branch captures a different semantic meaning, improving coverage compared to a single broad query.

Python
def branch_node(state: RAGState):

    prompt = f"Break into 3 search queries:\n{state['query']}"

    response = llm.invoke(prompt)

    branches = response.content.split("\n")

    return {"branches": branches}

2. Multi Branch Retrieval Node: Performs vector search independently for each query branch, retrieves the top k relevant chunks and aggregates all results into a unified evidence pool. This parallel retrieval is the core differentiator of Branched RAG.

Python
def retrieve_node(state: RAGState):

    results = []

    for branch in state["branches"]:
        docs = retriever.invoke(branch)
        results.extend([d.page_content for d in docs])

    return {"retrieved_docs": results}

3. Context Merge Node: Combines and cleans all retrieved content into a single structured context, ensuring the LLM receives clear and relevant information for reasoning.

Python
def merge_node(state: RAGState):

    context = "\n".join(state["retrieved_docs"])

    return {"context": context}

4. Answer Generation Node: Generates the final response by grounding the LLM in the merged context. This design produces coherent, evidence based answers and significantly reduces hallucinations.

Python
def answer_node(state: RAGState):

    prompt = f"""
    Context:
    {state['context']}

    Question:
    {state['query']}
    """

    response = llm.invoke(prompt)

    return {"answer": response.content}

Step 10: Build and Execute the LangGraph Workflow

This step brings all the defined nodes together into a single executable workflow using LangGraph’s graph based model, where nodes specify what operations are performed and edges control when and in what order they run, allowing the user query to flow step by step until the final answer is generated.

Python
workflow = StateGraph(RAGState)

workflow.add_node("branch", branch_node)
workflow.add_node("retrieve", retrieve_node)
workflow.add_node("merge", merge_node)
workflow.add_node("answer", answer_node)

workflow.add_edge(START,"branch")
workflow.add_edge("branch", "retrieve")
workflow.add_edge("retrieve", "merge")
workflow.add_edge("merge", "answer")
workflow.add_edge("answer", END)

graph = workflow.compile()

graph

Output:

Flow-of-state
Graph formed
  • __start__: The entry point of the graph where the user query enters the system. An initial empty state is created to begin execution.
  • branch: The LLM analyzes the query and splits it into multiple sub queries, each targeting a different intent. This creates logical branches for parallel exploration.
  • retrieve: Each branch performs its own vector search. Relevant document chunks are fetched independently and then collected into a shared result set.
  • merge: All retrieved content is combined and refined. Redundant information is unified to form a single, clean context.
  • answer: The LLM uses the merged context along with the original query to generate a grounded final response, significantly reducing hallucinations.
  • __end__: The final answer is returned and the graph execution completes.

Step 11: Running the Graph

Python
result = graph.invoke({
    "query": "Explain Branched RAG"
})

print(result["answer"])

Output:

Screenshot-2026-02-10-113104
Output
  • Branched RAG first splits the query into multiple focused sub queries.
  • Each sub query retrieves relevant information independently, covering different aspects of the topic.
  • These results are then merged into a single context, which the LLM uses to generate the final answer.
  • Because the answer is built from multiple retrieval paths, it is more complete, accurate and less prone to hallucinations.

You can download the code notebook from here

Comment

Explore