A comprehensive demo program that compares embedding models (Embedding Gemma vs Nomic Embed Text) on a 1000-line random field paragraph dataset, featuring an interactive web UI for results visualization.
- Automated Data Generation: Creates 1000 lines of diverse, multi-domain random text content
- Dual Model Comparison:
- Embedding Gemma (google/gemma-2b with custom embedding extraction)
- Nomic Embed Text (nomic-ai/nomic-embed-text-v1.5)
- Interactive Web Dashboard: Real-time visualization with multiple analysis tabs
- Comprehensive Metrics:
- Similarity analysis (cosine similarity distributions)
- Clustering quality assessment (silhouette scores)
- Performance benchmarking (encoding time)
- Cross-model agreement analysis
pip install -r requirements.txt
python run_demo.py
Choose from three options:
- Interactive Web UI (recommended) - Full dashboard experience
- Terminal Only - Command-line results
- Web UI Only - If comparison was already run
The interactive web UI will be available at http://localhost:3000
and includes:
- 📊 Overview: Summary statistics and model comparison
- 🔍 Similarity Analysis: Distribution plots and cross-model agreement
- 🎯 Clustering: Silhouette score analysis
- ⚡ Performance: Encoding time comparisons
embedding-compare/
├── app.py # Flask web application
├── embedding_comparison.py # Main comparison orchestrator
├── embedding_models.py # Model implementations
├── data_generator.py # Random content generator
├── run_demo.py # Demo launcher script
├── requirements.txt # Python dependencies
├── templates/
│ └── index.html # Interactive web dashboard
└── README.md # This file
The system generates 1000 lines of diverse content spanning multiple domains:
- Technology (AI, ML, quantum computing, etc.)
- Science (physics, chemistry, biology, etc.)
- Medicine (cardiology, neurology, pharmacology, etc.)
- Business (marketing, finance, strategy, etc.)
- Environment (climate change, sustainability, etc.)
- Mean/std/min/max cosine similarity within each model
- Cross-model agreement (diagonal similarity between models)
- Similarity distribution histograms
- Silhouette scores for 2-10 clusters
- K-means clustering quality assessment
- Comparative clustering performance
- Encoding time per model
- Speed ratio comparison
- Embedding dimensionality analysis
The system includes intelligent fallbacks:
- Gemma: Falls back to DistilBERT if Gemma models are unavailable
- Nomic: Falls back to all-MiniLM-L6-v2 if Nomic models are unavailable
This ensures the demo runs even with limited model availability.
- Sampling: Uses every 10th line (~100 samples) for faster demo execution
- Batch Processing: Efficient batch encoding with configurable batch sizes
- Memory Management: Optimized for both CPU and GPU execution
- Error Handling: Comprehensive error handling with user-friendly messages
After running the comparison, the following files are generated:
comparison_results.json
: Complete numerical resultscomparison_summary.md
: Human-readable summary reportrandom_field_content.txt
: Generated content for reference
The web UI is compatible with all modern browsers and includes:
- Responsive design for mobile/tablet viewing
- Interactive Plotly.js charts
- Real-time progress updates
- Tabbed interface for organized results
- Initial model loading may take 1-2 minutes depending on internet connection
- Actual embedding comparison typically completes in under 5 minutes
- Web UI provides real-time progress feedback
- Results are cached for quick re-viewing