Building Inverted Index from Scratch

In our introduction, we established that Information Retrieval (IR) systems are built on a foundation of computer science, math, and linguistics. Now, we'll build the single most important data structure in all of IR: the Inverted Index.

An inverted index, as its name suggests, inverts a page-centric data structure (page -> words) to a word-centric one (word -> pages). It's what allows a search engine to find all documents containing "python" in milliseconds, without having to scan every single document in the collection.

import numpy as np
import re
from collections import defaultdict

Our Collection of Documents

Let's start with a simple collection of documents, which we'll call a corpus. In a real-world scenario, this could be millions of web pages, products, or articles. For our example, we'll use a few simple sentences.

documents = {
    0: "the quick brown fox jumped over the lazy dog",
    1: "the lazy dog slept in the sun",
    2: "the sun is a star and the fox is an animal"
}

Text Preprocessing and Tokenization

Before we can index, we need to clean and break our text into individual words, or "terms". This process is called tokenization. A simple approach is to convert to lowercase and split by non-alphanumeric characters.

def tokenize(text):
    """A simple tokenizer: lowercase and split by non-alphanumeric chars."""
    return re.findall(r'\b\w+\b', text.lower())
 
tokenize(documents[0])

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

Building a Simple Inverted Index

The core of an inverted index is a dictionary (or hash map) where:

The keys are the terms (words) in our vocabulary.
The values are a list of document IDs where the term appears. This list is called a "postings list".

inverted_index = defaultdict(set)
 
for doc_id, text in documents.items():
    tokens = tokenize(text)
    for token in tokens:
        inverted_index[token].add(doc_id)
 
sorted(list(inverted_index['fox']))

[0, 2]

With this index, if a user searches for "fox", we can instantly retrieve doc IDs 0 and 2 without scanning the documents. This is the fundamental power of the inverted index.

Adding Term Frequency (TF)

Knowing that a word is in a document is good. Knowing how many times it appears is better. This is called Term Frequency (TF). A document that mentions "python" 10 times is likely more relevant than one that mentions it once.

We'll modify our index to store not just the doc ID, but also the term's frequency within that document. The structure will be: term -> doc_id_1: tf_1, doc_id_2: tf_2, ...

tf_index = defaultdict(lambda: defaultdict(int))
 
for doc_id, text in documents.items():
    tokens = tokenize(text)
    for token in tokens:
        tf_index[token][doc_id] += 1
 
dict(tf_index['fox'])[0]

Calculating Inverse Document Frequency (IDF)

Some words are more important than others. The word "the" appears in all our documents, so it doesn't help distinguish them. The word "fox" is rarer and thus more discriminative. We can capture this with Inverse Document Frequency (IDF).

The formula for IDF is:

$IDF(t) = log( N / (df(t) + 1) )$

Where:

N = Total number of documents in the corpus.
df(t) = Document Frequency of term t (the number of documents containing t).
We add 1 to the denominator to avoid division by zero for new words.

N = len(documents)
idf = {}
 
# df(t) is simply the length of the postings list for term t
for term, postings in tf_index.items():
    df = len(postings)
    idf[term] = np.log(N / (df + 1))
 
 
print(f"IDF for 'fox': {idf['fox']:.4f} (rare, high score)")

IDF for 'fox': 0.0000 (rare, high score)

print(f"IDF for 'the': {idf['the']:.4f} (common, low score)")

IDF for 'the': -0.2877 (common, low score)

Putting It All Together: TF-IDF

TF-IDF is a score that combines both metrics to determine a term's overall importance to a specific document within the context of the entire corpus.

The formula is simple:

$TF-IDF(t, d) = TF(t, d) * IDF(t)$

Where:

t = term
d = document

Let's calculate the TF-IDF score for the term 'fox' in document 0.

term_to_check = 'fox'
doc_to_check = 0
 
# 1. Get TF
tf = tf_index[term_to_check][doc_to_check]
 
# 2. Get IDF
idf_score = idf[term_to_check]
 
# 3. Calculate TF-IDF
tf_idf_score = tf * idf_score
 
print(f"TF = {tf}")
print(f"IDF = {idf_score:.4f}")
print(f"TF-IDF Score = {tf_idf_score:.4f}")

TF = 1 IDF = 0.0000 TF-IDF Score = 0.0000

Now let's calculate the TF-IDF for 'the' in the same document:

term_to_check_2 = 'the'
tf_2 = tf_index[term_to_check_2][doc_to_check]
idf_score_2 = idf[term_to_check_2]
tf_idf_score_2 = tf_2 * idf_score_2
 
print(f"TF = {tf_2}")
print(f"IDF = {idf_score_2:.4f}")
print(f"TF-IDF Score = {tf_idf_score_2:.4f}")

TF = 2 IDF = -0.2877 TF-IDF Score = -0.5754

As you can see, even though 'the' appears more frequently in document 0 (TF=2), its TF-IDF score is much lower than 'fox' (TF=1) because it's a very common word. This score is the foundation of classic ranking models like TF-IDF and BM25.

We have successfully built an inverted index from scratch and used it to calculate TF-IDF scores. This data structure is the engine that powers keyword-based search. In our next post, we will explore how to use these scores to rank documents against a user query.