BM25 (Best Matching 25) is a ranking algorithm used in information retrieval systems to determine how relevant a document is to a given search query. It’s an improved version of the traditional TF-IDF (Term Frequency–Inverse Document Frequency) approach and is widely used in modern search engines and databases.
- It measures term frequency and document relevance more accurately.
- It accounts for document length normalization, giving fair weight to all documents.
- It is widely used in tools like Elasticsearch, Whoosh and Lucene.
- It helps to deliver more relevant search results based on keyword matching and context.
In simple terms, BM25 helps rank documents or web pages based on how well they match a user’s search terms making it a cornerstone of effective search and retrieval systems.

Working of BM25
BM25 computes a relevance score between a query
1. Term Frequency (TF)
Term frequency measures how often a query term appears in a document. Intuitively, a document containing a query term multiple times is more likely to be relevant. However, BM25 introduces a saturation effect i.e beyond a certain point, additional occurrences of a term contribute less to the score. This prevents overly long documents from being unfairly favored.
Mathematically, the term frequency component is normalized using the formula:
TF(t,d)=\frac{freq(t,d)}{freq(t,d) + k_1 . (1-b+b.\frac{|d|}{\text{avgdl}})}
where:
t : Query termd : Documentfreq(t,d) : Number of times termt appears in documentd ∣d∣ : Length of documentd \text{avgdl} : average document length in corpusk_1 : controls term frequency scalingb : controls document length normalization
2. Inverse Document Frequency (IDF)
Inverse document frequency measures the importance of a term across the entire corpus. Rare terms are considered more informative than common ones. For example, the word "the" appears in almost every document and thus carries little value, whereas a rare term like "quantum" is more indicative of relevance.
The IDF component is calculated as:
IDF(t)=log(\frac{N-n_t+0.5}{n_t+0.5})
where:
N : Total number of documents in the corpusn_t : Number of documents containing termt
3. Document Length Normalization
BM25 accounts for document length by normalizing scores to prevent longer documents from dominating the rankings. This is controlled by the parameter
4. Final Score Calculation
The final BM25 score for a document
Score(q,d) = \sum_{t\epsilon q}IDF(t).TF(t,d)
This sums up the contributions of all query terms
BM25 vs. Modern Dense Retrieval
Let's see the comparison between BM25 and Modern Dense Retrieval.
| Aspect | BM25 (Sparse/Term-based) | Dense/Embedding-based Retrieval |
|---|---|---|
| Representation | Term / lexical features (inverted index) | Dense vector embeddings (semantic features) |
| Semantic matching | Exact term or near‐term matches | Captures synonyms, paraphrases, conceptual similarity |
| Computation cost | Low (inverted index lookups) | Higher (embedding generation, similarity search, GPU usage) |
| Interpretability | High — scoring formula transparent | Often lower — model internal weights less interpretable |
| Storage / indexing | Sparse index structure, efficient | Requires storing high-dimensional vectors, approximate nearest-neighbour (ANN) structures |
| Hybrid usage | Often used for first‐stage retrieval | Often used for re‐ranking or full retrieval in semantic tasks |
Applications
- Web search engines or opensource infrastructures such as Apache Lucene or Elasticsearch use it for initial document ranking.
- Enterprise search systems, for retrieving documents across internal corpora (intranets, knowledge bases).
- E-commerce search & recommendation uses it for matching product descriptions or search queries to products.
- Often used for first‐stage candidate retrieval before applying more expensive processing. This is widely used in question-answering system or information-retrieval pipelines
Advantages
- Robust and reliable: Works well across many datasets and retrieval tasks.
- Efficient and scalable: Computationally simpler than many neural retrieval methods making it practical for large‐scale search.
- Tunable: k1 and b parameters allow adaptation to domain or document‐type characteristics.
- Interpretable: Because it is based on well‐understood statistical components, it is easier to debug and understand compared to many “black-box” models.
Limitations
- Lexical only: It matches terms, not concepts so synonyms, paraphrases, semantic relatedness are not captured.
- No user personalization or context awareness: The model does not incorporate user signals, query history or implicit context by default.
- Corpus characteristics matter: The effect of document length, term distribution and corpus size can influence performance significantly.
- Does not use dense embeddings: Cannot capture more abstract semantic relationships the way embedding‐based/dense retrieval methods can.