From TF-IDF to BM25
In our last post, we built an inverted index and used it to calculate TF-IDF scores. TF-IDF is a fantastic starting point, but it has limitations. For instance, it doesn't account for document length (a long document might get a high score just by repeating terms) and it assumes that term frequency scales linearly, which isn't always true (the difference between 1 and 10 mentions is more significant than between 100 and 110).
Enter Okapi BM25 (Best Match 25), a ranking function that improves upon these ideas and has been the state-of-the-art for keyword search for decades. It's a probabilistic model that scores documents based on the query terms they contain, but with some clever adjustments.
import numpy as np
import re
from collections import defaultdict
Setup
We'll start with the same corpus and pre-processing from the last post to see the differences in scoring.
documents = {
0: "the quick brown fox jumped over the lazy dog",
1: "the lazy dog slept in the sun",
2: "the sun is a star and the fox is an animal"
}
def tokenize(text):
return re.findall(r'\b\w+\b', text.lower())
tf_index = defaultdict(lambda: defaultdict(int))
doc_lengths = defaultdict(int)
for doc_id, text in documents.items():
tokens = tokenize(text)
doc_lengths[doc_id] = len(tokens)
for token in tokens:
tf_index[token][doc_id] += 1
N = len(documents)
avg_doc_length = sum(doc_lengths.values()) / N
The BM25 Formula
The score for a document D
given a query Q
with terms q_1, ..., q_n
is:
It looks intimidating, but let's break it down:
- IDF(q_i): This is the Inverse Document Frequency, which we already know. It penalizes common terms. BM25 often uses a slightly different variant,
log(1 + (N - df + 0.5) / (df + 0.5))
, which handles edge cases well. - TF(q_i, D): The Term Frequency of the query term in the document.
- |D| and avg_dl: The length of document D and the average document length. This part normalizes for document length.
- k1 and b: These are hyperparameters.
- k1 (typically ~1.2-2.0): Controls term frequency saturation. A lower k1 means the score saturates faster. It dictates how much a higher TF can boost the score.
- b (typically ~0.75): Controls the penalty for document length. b=1 means the document's length fully penalizes the score, b=0 means no penalty.
Let's implement it.
class BM25Scorer:
def __init__(self, documents, k1=1.5, b=0.75):
self.documents = documents
self.k1 = k1
self.b = b
self.N = len(documents)
self.avg_doc_length = 0
self.doc_lengths = defaultdict(int)
self.tf_index = defaultdict(lambda: defaultdict(int))
self.idf = defaultdict(float)
self._build_index()
def _build_index(self):
"""Pre-calculates TF, IDF, and document lengths."""
all_tokens = []
for doc_id, text in self.documents.items():
tokens = tokenize(text)
self.doc_lengths[doc_id] = len(tokens)
all_tokens.extend(tokens)
for token in tokens:
self.tf_index[token][doc_id] += 1
self.avg_doc_length = sum(self.doc_lengths.values()) / self.N
# Calculate IDF for all unique terms
unique_terms = set(all_tokens)
for term in unique_terms:
df = len(self.tf_index[term])
# Using the BM25 IDF variant
self.idf[term] = np.log(1 + (self.N - df + 0.5) / (df + 0.5))
def score_document(self, query_tokens, doc_id):
"""Calculates the BM25 score for a single document."""
score = 0.0
doc_len = self.doc_lengths[doc_id]
for term in query_tokens:
if term in self.tf_index:
tf = self.tf_index[term].get(doc_id, 0)
idf = self.idf[term]
# Numerator of the TF component
tf_component_num = tf * (self.k1 + 1)
# Denominator of the TF component
tf_component_den = tf + self.k1 * (1 - self.b + self.b * doc_len / self.avg_doc_length)
score += idf * (tf_component_num / tf_component_den)
return score
def search(self, query):
"""Scores all documents for a given query and returns a ranked list."""
query_tokens = tokenize(query)
scores = {}
for doc_id in self.documents.keys():
scores[doc_id] = self.score_document(query_tokens, doc_id)
# Return sorted documents by score, descending
return sorted(scores.items(), key=lambda item: item[1], reverse=True)
Running a Search
scorer = BM25Scorer(documents)
query = "lazy dog"
results = scorer.search(query)
for doc_id, score in results:
print(f"Doc {doc_id}: Score = {score:.4f} | Text: '{documents[doc_id]}' \n")
Doc 1: Score = 1.0445 | Text: 'the lazy dog slept in the sun'
Doc 0: Score = 0.9400 | Text: 'the quick brown fox jumped over the lazy dog'
Doc 2: Score = 0.0000 | Text: 'the sun is a star and the fox is an animal'
Analysis of the Results
For the query "lazy dog":
- Doc 1 ("the lazy dog slept in the sun") gets the highest score. Both query terms are present, and the document is relatively short (6 words).
- Doc 0 ("the quick brown fox jumped over the lazy dog") gets the second-highest score. It also contains both terms, but it's a longer document (9 words), so the
b
parameter penalizes it slightly, suggesting the terms are less central to the document's topic. - Doc 2 gets a score of 0.0 because neither "lazy" nor "dog" appear in it. This demonstrates the power of BM25: it intelligently combines term frequency, inverse document frequency, and document length normalization to produce a highly effective relevance score.
BM25 is a powerful and robust baseline for any search application. While newer models based on neural networks and embeddings are emerging, BM25 remains incredibly hard to beat for pure keyword search. In our next post, we will begin to explore that new frontier: vector search and semantic understanding.