Metadata filtering is a technique used in vector databases to refine and control search results based on structured attributes associated with documents. Instead of relying only on text similarity, metadata filtering allows you to apply specific conditions like date, category, author and department to retrieve the most contextually relevant results. This is especially useful when working with large, heterogeneous datasets where documents share similar content but differ in key attributes such as source or time period.
- Metadata: Structured key-value data attached to documents (e.g., source, department, year, tags, sensitivity).
- Vector similarity: Semantic matching using embeddings (cosine, dot-product, L2) to find nearest vectors.
- Filter predicates: Boolean or comparison expressions applied to metadata (e.g., equality, greater-than, in-list, logical operators like and, or).
- Hybrid retrieval: Combining metadata filters with vector similarity to narrow the candidate set before/after ranking.
Implementation
Let's see the implementation to understand how metadata filtering is done:
Step 1: Install Dependencies and Import Libraries
We will install the required dependencies for our model and then we will import the required libraries.
!pip install -qU langchain-community langchain-chroma
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
Step 2: Prepare Documents with Metadata
Here we will:
- Creates four Document objects with text and metadata.
- Each metadata entry includes source, department and year.
- Metadata helps in later filtering for example, retrieving only marketing or finance-related data.
documents = [
Document(
page_content="The new product launch is scheduled for Q3 2025 and focuses on AI-driven analytics.",
metadata={"source": "report_2025.pdf",
"department": "marketing", "year": 2025},
),
Document(
page_content="Our Q1 2024 earnings were significantly boosted by the European market expansion.",
metadata={"source": "financials_2024.pdf",
"department": "finance", "year": 2024},
),
Document(
page_content="Internal memo detailing the revised company social media policy for all departments.",
metadata={"source": "policy_memo.txt",
"department": "hr", "year": 2024},
),
Document(
page_content="Quarterly report predicting strong growth in the Asian AI sector for 2025.",
metadata={"source": "market_analysis.pdf",
"department": "marketing", "year": 2025},
),
]
Step 3: Define Embedding Function
We will define an embedding function which will return fixed-zero vectors.
- embed_documents processes multiple texts
- embed_query processes one query.
class DummyEmbeddings:
def embed_documents(self, texts):
return [[0.0] * 10 for _ in texts]
def embed_query(self, text):
return [0.0] * 10
Step 4: Create Vector Store and Index Documents
Here:
- Chroma.from_documents() indexes all documents by storing their embeddings and metadata.
- collection_name labels the dataset.
- The print statement confirms successful indexing.
vectorstore = Chroma.from_documents(
documents=documents,
embedding=DummyEmbeddings(),
collection_name="company_docs"
)
print(f"Total documents indexed: {len(documents)}")
Output:
Total documents indexed: 4
Step 5: Perform Standard Similarity Search
This embeds the query and searches for the top 2 semantically similar documents. Displays both text and metadata for each matched document.
query = "What about our financials?"
results = vectorstore.similarity_search(query, k=2)
print("\n--- Standard Search (Top 2) ---")
for doc in results:
print(f"Content: {doc.page_content[:50]}...")
print(f"Metadata: {doc.metadata}")
print("-" * 15)
Output:

Step 6: Apply Simple Metadata Filter
Here:
- Chroma supports metadata-based filtering using a filter parameter.
- The filter {"department": "finance"} restricts results to only those documents where metadata key department equals "finance".
- This is useful when you only want results from a specific source, year and category.
finance_filter = {"department": "finance"}
query = "What about our financials?"
finance_results = vectorstore.similarity_search(
query,
k=2,
filter=finance_filter
)
print("\n--- Filtered Search (Department = 'finance') ---")
for doc in finance_results:
print(f"Content: {doc.page_content[:50]}...")
print(f"Metadata: {doc.metadata}")
print("-" * 15)
Output:

Step 7: Complex Filtered Search Using Logical Operators
- This step demonstrates advanced metadata filtering during a similarity search.
- $and -> all conditions must be true
- $eq -> exact match (string comparison)
- $gt -> greater than (numeric comparison)
- Filters marketing documents from 2025 onward
complex_filter = {
"$and": [
{"department": {"$eq": "marketing"}},
{"year": {"$gt": 2024}}
]
}
query = "What's the forecast for next year?"
complex_results = vectorstore.similarity_search(
query,
k=2,
filter=complex_filter
)
print("\n---Complex Filtered Search (Department = 'marketing' AND Year > 2024) ---")
for doc in complex_results:
print(f"Content: {doc.page_content[:50]}...")
print(f"Metadata: {doc.metadata}")
print("-" * 15)
Output:

Step 8: Create a Retriever with Pre-set Filter
This step converts the vector store into a retriever with a permanent filter.
- Ensures consistent filtering
- Avoids repeating filters in every query
- Ideal for policy-based or role-based retrieval
marketing_2025_filter = {
"$and": [
{"department": {"$eq": "marketing"}},
{"year": {"$gt": 2024}}
]
}
filtered_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={
"k": 1,
"filter": marketing_2025_filter
}
)
query = "Tell me about the upcoming AI focus."
retrieved_docs = filtered_retriever.invoke(query)
print("\n--- Corrected: Retriever Invocation with Pre-set Filter ---")
for doc in retrieved_docs:
print(f"Content: {doc.page_content[:50]}...")
print(f"Metadata: {doc.metadata}")
print("-" * 15)
Output:

Explanation:
- as_retriever() wraps Chroma into a reusable retrieval interface
- search_kwargs defines default behavior
- Every query automatically enforces the filter
- Reduces errors and improves maintainability
You can download the complete code from here.
Advantages
- Precision: Filters out irrelevant documents, improving relevance.
- Efficiency: Reduces search space for faster retrieval.
- Compliance: Enforces organizational access or data policies.
- Customizability: Allows complex logical and numeric filters.
- RAG readiness: Perfect for integrating with retrieval-augmented generation systems.
Limitations
- Index overhead: Adding metadata increases indexing complexity.
- Syntax differences: Filter syntax varies across vector stores.
- Strict typing: Requires consistent metadata field types.
- Over-filtering risk: Too many filters can lead to no results.
- Performance variance: Some databases handle filtering slower if done post-query.