🧠 Local Document Q&A System

This project is a fully local, privacy-first document Q&A system, designed to help you search, explore, and interact with your own documents - securely and efficiently. It supports real-time ingestion of PDF, DOCX, and TXT files, applies semantic chunking and vector embedding, and uses a locally hosted LLM for natural-language answers in both single-turn and multi-turn formats.

The system prioritizes modularity, observability, and full offline support, making it suitable for personal knowledge bases, secure enterprise settings, or research workflows - all without sending data to the cloud.

🔭 Vision

This system aims to become a powerful and private Retrieval-Augmented Generation (RAG) engine, capable of:

Ingesting large collections of documents across folders
Answering questions with real-time citations
Summarizing or comparing multiple documents
Operating fully offline, powered by local vector DBs and LLMs
Providing traceability and observability via Phoenix & OpenTelemetry

🔧 Architecture Overview

The system is built from modular, testable components:

✅ 1. Embedding Service (Dockerized or local)

Runs a multilingual model (e.g., intfloat/multilingual-e5-base)
Accepts batch inputs via a local FastAPI server
Returns dense embeddings for semantic indexing

✅ 2. Qdrant (Vector Store)

Stores document chunk embeddings + metadata (filename, page, position)
Supports efficient top-k retrieval based on similarity
Used for both retrieval and metadata tracking (checksums, ingestion status)

✅ 3. Text-Generation-WebUI (TGW)

Runs your local LLM (e.g., Mistral, GPTQ, GGUF)
Accessible via OpenAI-compatible API (/v1/chat/completions or /v1/completions)
Works in both chat or completion mode

✅ 4. Streamlit Frontend

Upload files and folders
Ask questions and receive cited answers
Adjust LLM model, temperature, mode
Switch between chat and completion

✅ 5. Phoenix Tracing

Observability layer based on OpenTelemetry + Arize Phoenix
Captures span metadata for ingestion, embedding, retrieval, and LLM steps
Uses OpenInference schema for standardized analytics

🚀 Key Features

🔍 Semantic search over local documents
📎 Supports multiple formats: PDF, DOCX, TXT
💬 Chat Mode (multi-turn)
🧠 Completion Mode (single Q&A)
📁 Multi-file + folder ingestion, with parallel processing
🧾 Source attribution (filename + page or position)
🗃️ File deduplication by checksum + path tracking
🧱 Modular architecture (easy to swap models or vector DB)
📊 Tracing and observability with Phoenix
🔒 Fully local: no cloud APIs, no internet needed
🧼 Robust text preprocessing (PDF-first): header/footer stripping, page-number cleanup, hyphenation repair, conservative soft-wrap joining, table tagging, and removal of symbol-only / empty-bullet lines to prevent junk chunks.

🧪 Usage Guide

📥 Ingest Documents

Upload one or more files and/or folders
Files are recursively scanned, chunked, embedded, and indexed
Ingestion is logged and deduplicated via checksum and path tracking
Duplicate files (same checksum in different locations) are indexed and viewable in the duplicates page

💬 Ask Questions

Choose between chat or completion mode
Type natural-language questions (e.g., "What is this contract about?")
System retrieves the most relevant document chunks and builds a prompt
LLM answers using local knowledge + sources

🧠 LLM Controls

Model, temperature, and mode are adjustable in sidebar
Supports any LLM with OpenAI-compatible endpoints

🧰 Requirements

Python 3.10+
Qdrant running (Docker or native)
OpenSearch
Redis server for Celery broker
Celery worker for async embedding
Text-Generation-WebUI with a loaded model
Dockerized embedder API

🚀 Getting Started

Install dependencies
```
pip install -r requirements/app.txt
```
Start the services

Ensure Qdrant, OpenSearch, Redis, the embedder API, and your Text-Generation-WebUI are running. With Docker:
```
docker-compose up qdrant opensearch redis embedder-api celery
```
Launch the Streamlit app
```
streamlit run main.py
```

Run tests

Install development requirements and run the test suite:

pip install -r requirements/shared.txt -r requirements/dev.txt
pytest

📌 Current Status

✅ Ingestion supports mixed file/folder input, with deduplication
✅ File Index Viewer & manager UI for re-sync, stats, and delete
✅ Modular pipeline orchestrated by ingestion.py
✅ Batched embedding via API and Celery
✅ Phoenix tracing across ingestion and QA flows
✅ Vector store: Qdrant only (no SQLite)
✅ Hybrid search (BM25 + dense vectors)
✅ Source filenames and pages displayed with each answer
✅ Works with both chat and completion LLMs (e.g. Mistral, GPTQ)
✅ Query rewriting layer supports clarification and intent extraction
✅ Progress bar and estimated time remaining during ingestion

⚠️ Known Limitations

Streaming answers (token-by-token) is currently disabled

🔎 Query Rewriting (New Feature)

The system includes a dedicated LLM-based query rewriter that improves search accuracy by:

✅ Detecting vague or ambiguous questions (e.g., “What about that contract?”)
✅ Asking for clarification when context is missing (e.g., “Who is ‘he’?”)
✅ Rewriting clean questions into compressed, keyword-rich search phrases

🔧 How it works:

All user queries are passed through a chat-tuned query rewriter

The rewriter returns one of:

{ "clarify": "Who are you referring to with 'he'?" }

or

{ "rewritten": "Ali assistant professor work years" }

If clarification is needed, the main pipeline halts and returns the message to the user

📌 Why this matters:

Reduces retrieval noise from vague or malformed queries
Enhances accuracy when using local LLMs + vector search
Handles grammar errors, typos, lack of punctuation, and missing context

✅ Tracing Integration

The qa_chain trace includes a "Rewrite Query" span
It records:
- Original user query
- Rewritten form
- Clarification flag (if applicable)

🛣️ Roadmap

✅ Completed

These milestones are fully implemented and working in the system:

Query rewriting (clarify + keywords)
Hybrid search (BM25 + dense vectors)
Index viewer & manager UI (status, re-sync, stats, delete)
Embedder API + Celery pipeline
Multi-file + folder ingestion
Phoenix tracing (QA + ingestion)
Deduplication + full path display
Progress bar + ETA during ingestion

🔧 Near-Term Enhancements

Next steps actively being planned or started:

Reranker (cross-encoder or LLM-based)
Per-document QA mode
Session save/load for chat + files

🔮 Coming Next

Mid-term roadmap items queued for future sprints:

Batch summarization (map-reduce) – summarize many documents at once
Advanced chunking (semantic, LLM-aided) – segment text into retrieval-friendly pieces
Offline Docker bundle (TGW + Embedder + Qdrant + OpenSearch) – one-command local deployment
Agent workflows (document reasoning) – multi-step agents for deeper analysis

Name		Name	Last commit message	Last commit date
Latest commit History 329 Commits
.github/workflows		.github/workflows
assets		assets
components		components
core		core
embedder_api_multilingual		embedder_api_multilingual
pages		pages
requirements		requirements
scripts		scripts
tests		tests
ui		ui
utils		utils
worker		worker
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
file_picker.py		file_picker.py
main.py		main.py
pytest.ini		pytest.ini
tracing.py		tracing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Local Document Q&A System

🔭 Vision

🔧 Architecture Overview

✅ 1. Embedding Service (Dockerized or local)

✅ 2. Qdrant (Vector Store)

✅ 3. Text-Generation-WebUI (TGW)

✅ 4. Streamlit Frontend

✅ 5. Phoenix Tracing

🚀 Key Features

🧪 Usage Guide

📥 Ingest Documents

💬 Ask Questions

🧠 LLM Controls

🧰 Requirements

🚀 Getting Started

Run tests

📌 Current Status

⚠️ Known Limitations

🔎 Query Rewriting (New Feature)

🔧 How it works:

📌 Why this matters:

✅ Tracing Integration

🛣️ Roadmap

✅ Completed

🔧 Near-Term Enhancements

🔮 Coming Next

About

Uh oh!

Releases

Packages

Uh oh!

Languages

abulhawa/document-qa-llm

Folders and files

Latest commit

History

Repository files navigation

🧠 Local Document Q&A System

🔭 Vision

🔧 Architecture Overview

✅ 1. Embedding Service (Dockerized or local)

✅ 2. Qdrant (Vector Store)

✅ 3. Text-Generation-WebUI (TGW)

✅ 4. Streamlit Frontend

✅ 5. Phoenix Tracing

🚀 Key Features

🧪 Usage Guide

📥 Ingest Documents

💬 Ask Questions

🧠 LLM Controls

🧰 Requirements

🚀 Getting Started

Run tests

📌 Current Status

⚠️ Known Limitations

🔎 Query Rewriting (New Feature)

🔧 How it works:

📌 Why this matters:

✅ Tracing Integration

🛣️ Roadmap

✅ Completed

🔧 Near-Term Enhancements

🔮 Coming Next

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages