Metadata Filtering in LangChain

Metadata filtering is a technique used in vector databases to refine and control search results based on structured attributes associated with documents. Instead of relying only on text similarity, metadata filtering allows you to apply specific conditions like date, category, author and department to retrieve the most contextually relevant results. This is especially useful when working with large, heterogeneous datasets where documents share similar content but differ in key attributes such as source or time period.

Metadata: Structured key-value data attached to documents (e.g., source, department, year, tags, sensitivity).
Vector similarity: Semantic matching using embeddings (cosine, dot-product, L2) to find nearest vectors.
Filter predicates: Boolean or comparison expressions applied to metadata (e.g., equality, greater-than, in-list, logical operators like and, or).
Hybrid retrieval: Combining metadata filters with vector similarity to narrow the candidate set before/after ranking.

Implementation

Let's see the implementation to understand how metadata filtering is done:

Step 1: Install Dependencies and Import Libraries

We will install the required dependencies for our model and then we will import the required libraries.

Python

!pip install -qU langchain-community langchain-chroma

from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

Step 2: Prepare Documents with Metadata

Here we will:

Creates four Document objects with text and metadata.
Each metadata entry includes source, department and year.
Metadata helps in later filtering for example, retrieving only marketing or finance-related data.

Python

documents = [
    Document(
        page_content="The new product launch is scheduled for Q3 2025 and focuses on AI-driven analytics.",
        metadata={"source": "report_2025.pdf",
                  "department": "marketing", "year": 2025},
    ),
    Document(
        page_content="Our Q1 2024 earnings were significantly boosted by the European market expansion.",
        metadata={"source": "financials_2024.pdf",
                  "department": "finance", "year": 2024},
    ),
    Document(
        page_content="Internal memo detailing the revised company social media policy for all departments.",
        metadata={"source": "policy_memo.txt",
                  "department": "hr", "year": 2024},
    ),
    Document(
        page_content="Quarterly report predicting strong growth in the Asian AI sector for 2025.",
        metadata={"source": "market_analysis.pdf",
                  "department": "marketing", "year": 2025},
    ),
]

Step 3: Define Embedding Function

We will define an embedding function which will return fixed-zero vectors.

embed_documents processes multiple texts
embed_query processes one query.

Python

class DummyEmbeddings:
    def embed_documents(self, texts):
        return [[0.0] * 10 for _ in texts]

    def embed_query(self, text):
        return [0.0] * 10

Step 4: Create Vector Store and Index Documents

Here:

Chroma.from_documents() indexes all documents by storing their embeddings and metadata.
collection_name labels the dataset.
The print statement confirms successful indexing.

Python

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=DummyEmbeddings(),
    collection_name="company_docs"
)
print(f"Total documents indexed: {len(documents)}")

Output:

Total documents indexed: 4

Step 5: Perform Standard Similarity Search

This embeds the query and searches for the top 2 semantically similar documents. Displays both text and metadata for each matched document.

Python

query = "What about our financials?"
results = vectorstore.similarity_search(query, k=2)

print("\n--- Standard Search (Top 2) ---")
for doc in results:
    print(f"Content: {doc.page_content[:50]}...")
    print(f"Metadata: {doc.metadata}")
    print("-" * 15)

Output:

Step 6: Apply Simple Metadata Filter

Here:

Chroma supports metadata-based filtering using a filter parameter.
The filter {"department": "finance"} restricts results to only those documents where metadata key department equals "finance".
This is useful when you only want results from a specific source, year and category.

Python

finance_filter = {"department": "finance"}

query = "What about our financials?"
finance_results = vectorstore.similarity_search(
    query,
    k=2,
    filter=finance_filter
)

print("\n--- Filtered Search (Department = 'finance') ---")
for doc in finance_results:
    print(f"Content: {doc.page_content[:50]}...")
    print(f"Metadata: {doc.metadata}")
    print("-" * 15)

Output:

Step 7: Complex Filtered Search Using Logical Operators

This step demonstrates advanced metadata filtering during a similarity search.
$and -> all conditions must be true
$eq -> exact match (string comparison)
$gt -> greater than (numeric comparison)
Filters marketing documents from 2025 onward

Python

complex_filter = {
    "$and": [
        {"department": {"$eq": "marketing"}},
        {"year": {"$gt": 2024}}
    ]
}

query = "What's the forecast for next year?"
complex_results = vectorstore.similarity_search(
    query,
    k=2,
    filter=complex_filter
)

print("\n---Complex Filtered Search (Department = 'marketing' AND Year > 2024) ---")
for doc in complex_results:
    print(f"Content: {doc.page_content[:50]}...")
    print(f"Metadata: {doc.metadata}")
    print("-" * 15)

Output:

Step 8: Create a Retriever with Pre-set Filter

This step converts the vector store into a retriever with a permanent filter.

Ensures consistent filtering
Avoids repeating filters in every query
Ideal for policy-based or role-based retrieval

Python

marketing_2025_filter = {
    "$and": [
        {"department": {"$eq": "marketing"}},
        {"year": {"$gt": 2024}}
    ]
}

filtered_retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 1,
        "filter": marketing_2025_filter
    }
)

query = "Tell me about the upcoming AI focus."
retrieved_docs = filtered_retriever.invoke(query)

print("\n--- Corrected: Retriever Invocation with Pre-set Filter ---")
for doc in retrieved_docs:
    print(f"Content: {doc.page_content[:50]}...")
    print(f"Metadata: {doc.metadata}")
    print("-" * 15)

Output:

Explanation:

as_retriever() wraps Chroma into a reusable retrieval interface
search_kwargs defines default behavior
Every query automatically enforces the filter
Reduces errors and improves maintainability

You can download the complete code from here.

Advantages

Precision: Filters out irrelevant documents, improving relevance.
Efficiency: Reduces search space for faster retrieval.
Compliance: Enforces organizational access or data policies.
Customizability: Allows complex logical and numeric filters.
RAG readiness: Perfect for integrating with retrieval-augmented generation systems.

Limitations

Index overhead: Adding metadata increases indexing complexity.
Syntax differences: Filter syntax varies across vector stores.
Strict typing: Requires consistent metadata field types.
Over-filtering risk: Too many filters can lead to no results.
Performance variance: Some databases handle filtering slower if done post-query.

Metadata Filtering in LangChain

Implementation

Step 1: Install Dependencies and Import Libraries

Step 2: Prepare Documents with Metadata

Step 3: Define Embedding Function

Step 4: Create Vector Store and Index Documents

Step 5: Perform Standard Similarity Search

Step 6: Apply Simple Metadata Filter

Step 7: Complex Filtered Search Using Logical Operators

Step 8: Create a Retriever with Pre-set Filter

Advantages

Limitations

Explore