This project is a fully local, privacy-first document Q&A system, designed to help you search, explore, and interact with your own documents - securely and efficiently. It supports real-time ingestion of PDF, DOCX, and TXT files, applies semantic chunking and vector embedding, and uses a locally hosted LLM for natural-language answers in both single-turn and multi-turn formats.
The system prioritizes modularity, observability, and full offline support, making it suitable for personal knowledge bases, secure enterprise settings, or research workflows - all without sending data to the cloud.
This system aims to become a powerful and private Retrieval-Augmented Generation (RAG) engine, capable of:
- Ingesting large collections of documents across folders
- Answering questions with real-time citations
- Summarizing or comparing multiple documents
- Operating fully offline, powered by local vector DBs and LLMs
- Providing traceability and observability via Phoenix & OpenTelemetry
The system is built from modular, testable components:
- Runs a multilingual model (e.g.,
intfloat/multilingual-e5-base
) - Accepts batch inputs via a local FastAPI server
- Returns dense embeddings for semantic indexing
- Stores document chunk embeddings + metadata (filename, page, position)
- Supports efficient top-k retrieval based on similarity
- Used for both retrieval and metadata tracking (checksums, ingestion status)
- Runs your local LLM (e.g., Mistral, GPTQ, GGUF)
- Accessible via OpenAI-compatible API (
/v1/chat/completions
or/v1/completions
) - Works in both chat or completion mode
- Upload files and folders
- Ask questions and receive cited answers
- Adjust LLM model, temperature, mode
- Switch between chat and completion
- Observability layer based on OpenTelemetry + Arize Phoenix
- Captures span metadata for ingestion, embedding, retrieval, and LLM steps
- Uses OpenInference schema for standardized analytics
- 🔍 Semantic search over local documents
- 📎 Supports multiple formats: PDF, DOCX, TXT
- 💬 Chat Mode (multi-turn)
- 🧠 Completion Mode (single Q&A)
- 📁 Multi-file + folder ingestion, with parallel processing
- 🧾 Source attribution (filename + page or position)
- 🗃️ File deduplication by checksum + path tracking
- 🧱 Modular architecture (easy to swap models or vector DB)
- 📊 Tracing and observability with Phoenix
- 🔒 Fully local: no cloud APIs, no internet needed
- 🧼 Robust text preprocessing (PDF-first): header/footer stripping, page-number cleanup, hyphenation repair, conservative soft-wrap joining, table tagging, and removal of symbol-only / empty-bullet lines to prevent junk chunks.
- Upload one or more files and/or folders
- Files are recursively scanned, chunked, embedded, and indexed
- Ingestion is logged and deduplicated via checksum and path tracking
- Duplicate files (same checksum in different locations) are indexed and viewable in the duplicates page
- Choose between chat or completion mode
- Type natural-language questions (e.g., "What is this contract about?")
- System retrieves the most relevant document chunks and builds a prompt
- LLM answers using local knowledge + sources
- Model, temperature, and mode are adjustable in sidebar
- Supports any LLM with OpenAI-compatible endpoints
- Python 3.10+
- Qdrant running (Docker or native)
- OpenSearch
- Redis server for Celery broker
- Celery worker for async embedding
- Text-Generation-WebUI with a loaded model
- Dockerized embedder API
-
Install dependencies
pip install -r requirements/app.txt
-
Start the services
Ensure Qdrant, OpenSearch, Redis, the embedder API, and your Text-Generation-WebUI are running. With Docker:
docker-compose up qdrant opensearch redis embedder-api celery
-
Launch the Streamlit app
streamlit run main.py
Install development requirements and run the test suite:
pip install -r requirements/shared.txt -r requirements/dev.txt
pytest
- ✅ Ingestion supports mixed file/folder input, with deduplication
- ✅ File Index Viewer & manager UI for re-sync, stats, and delete
- ✅ Modular pipeline orchestrated by
ingestion.py
- ✅ Batched embedding via API and Celery
- ✅ Phoenix tracing across ingestion and QA flows
- ✅ Vector store: Qdrant only (no SQLite)
- ✅ Hybrid search (BM25 + dense vectors)
- ✅ Source filenames and pages displayed with each answer
- ✅ Works with both chat and completion LLMs (e.g. Mistral, GPTQ)
- ✅ Query rewriting layer supports clarification and intent extraction
- ✅ Progress bar and estimated time remaining during ingestion
- Streaming answers (token-by-token) is currently disabled
The system includes a dedicated LLM-based query rewriter that improves search accuracy by:
- ✅ Detecting vague or ambiguous questions (e.g., “What about that contract?”)
- ✅ Asking for clarification when context is missing (e.g., “Who is ‘he’?”)
- ✅ Rewriting clean questions into compressed, keyword-rich search phrases
-
All user queries are passed through a chat-tuned query rewriter
-
The rewriter returns one of:
{ "clarify": "Who are you referring to with 'he'?" }
or
{ "rewritten": "Ali assistant professor work years" }
-
If clarification is needed, the main pipeline halts and returns the message to the user
- Reduces retrieval noise from vague or malformed queries
- Enhances accuracy when using local LLMs + vector search
- Handles grammar errors, typos, lack of punctuation, and missing context
- The
qa_chain
trace includes a "Rewrite Query" span - It records:
- Original user query
- Rewritten form
- Clarification flag (if applicable)
These milestones are fully implemented and working in the system:
- Query rewriting (clarify + keywords)
- Hybrid search (BM25 + dense vectors)
- Index viewer & manager UI (status, re-sync, stats, delete)
- Embedder API + Celery pipeline
- Multi-file + folder ingestion
- Phoenix tracing (QA + ingestion)
- Deduplication + full path display
- Progress bar + ETA during ingestion
Next steps actively being planned or started:
- Reranker (cross-encoder or LLM-based)
- Per-document QA mode
- Session save/load for chat + files
Mid-term roadmap items queued for future sprints:
- Batch summarization (map-reduce) – summarize many documents at once
- Advanced chunking (semantic, LLM-aided) – segment text into retrieval-friendly pieces
- Offline Docker bundle (TGW + Embedder + Qdrant + OpenSearch) – one-command local deployment
- Agent workflows (document reasoning) – multi-step agents for deeper analysis