Full-Text Search in Databases

So far, we've built retrieval systems from scratch to understand the core components like inverted indexes and ranking functions like BM25. But what if you don't want to manage a separate search service? Many modern databases, like PostgreSQL, have powerful, built-in Full-Text Search (FTS) capabilities that leverage the very same principles.

In this post, we'll explore how to use PostgreSQL's FTS engine. It handles the complex parts - tokenization, stemming, and creating an inverted index behind the scenes, allowing you to query natural language text directly with SQL.

The Core Concepts: tsvector and tsquery

PostgreSQL's FTS is primarily based on two data types:

tsvector:
This is the indexed document representation. PostgreSQL processes a string of text, tokenizes it, reduces words to their root form (stemming), removes stop words, and stores the result as a tsvector. It's essentially a sorted list of unique lexemes (the normalized words) with their positions in the document. This is PostgreSQL's version of the document side of an inverted index.

tsquery:
This represents the user's search query. It takes a query string and converts it into a set of lexemes, along with boolean operators (& for AND, | for OR, ! for NOT) to combine them.

Let's see them in action:

-- Create a tsvector from a document
SELECT to_tsvector('english', 'the quick brown fox jumped over the lazy dog');

SELECT to_tsquery('english', 'foxes & jumping');

Notice: 'foxes' becomes 'fox' and 'jumping' becomes 'jump'.

Setting up a Searchable Table

Let's create a table of documents and prepare it for search. We'll add a dedicated tsvector column that will store the indexed representation of our text.

CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT,
    body TEXT,
    tsv tsvector
);

Insert our familiar documents:

INSERT INTO articles (title, body) VALUES
    ('The Fox', 'the quick brown fox jumped over the lazy dog'),
    ('Sleeping Dog', 'the lazy dog slept in the sun'),
    ('Astronomy', 'the sun is a star and the fox is an animal');

Now, we'll populate the tsv column. It's often best to use a trigger to do this automatically whenever a row is inserted or updated. This keeps the indexed data consistent with the source text.

CREATE OR REPLACE FUNCTION update_tsv_column()
RETURNS TRIGGER AS $
BEGIN
    NEW.tsv :=
        setweight(to_tsvector('english', coalesce(NEW.title,'')), 'A') ||
        setweight(to_tsvector('english', coalesce(NEW.body,'')), 'B');
    RETURN NEW;
END;
$ LANGUAGE plpgsql;
 
CREATE TRIGGER tsv_update_trigger
BEFORE INSERT OR UPDATE ON articles
FOR EACH ROW EXECUTE FUNCTION update_tsv_column();

Let's update our existing rows to populate the tsv column:

UPDATE articles SET title = title; -- A simple trick to fire the update trigger.

Creating the Index

To make searches fast, we need to create an index on our tsvector column. A standard B-tree index won't work. We need a GIN (Generalized Inverted Index), which is PostgreSQL's implementation of the inverted index structure we learned about.

CREATE INDEX articles_tsv_idx ON articles USING GIN(tsv);

Running Full-Text Search Queries

The primary FTS matching operator is @@. It returns true if the tsvector matches the tsquery:

SELECT id, title FROM articles WHERE tsv @@ to_tsquery('english', 'fox');

Search for documents containing both 'lazy' AND 'dog':

SELECT id, title FROM articles WHERE tsv @@ to_tsquery('english', 'lazy & dog');

Search for documents containing 'sun' OR 'animal':

SELECT id, title FROM articles WHERE tsv @@ to_tsquery('english', 'sun | animal');

Ranking the Results

Just getting matches isn't enough; we need to rank them. PostgreSQL provides functions like ts_rank and ts_rank_cd which use algorithms similar in principle to BM25. They consider term frequency, document length, and term proximity.

Let's search for 'dog' and rank the results:

SELECT
    id,
    title,
    ts_rank_cd(tsv, to_tsquery('english', 'dog')) AS rank
FROM articles
WHERE tsv @@ to_tsquery('english', 'dog')
ORDER BY rank DESC;

Notice how 'The Fox' ranks higher for the query 'lazy & dog' than 'Sleeping Dog' because the terms are closer together in the original text, a factor ts_rank_cd considers.

SELECT
    id,
    title,
    ts_rank_cd(tsv, to_tsquery('english', 'lazy & dog')) AS rank
FROM articles
WHERE tsv @@ to_tsquery('english', 'lazy & dog')
ORDER BY rank DESC;

Strengths and Limitations

Strengths:

Convenience: Search functionality lives right next to your data. No need for a separate service or data synchronization pipeline.
Transactional Integrity: FTS indexes are updated within the same transaction as your data, ensuring the search index is always up-to-date.
Powerful SQL Integration: You can easily combine FTS queries with other SQL clauses (JOINs, WHERE filters, etc.).

Limitations:

Less Control: While configurable, you have less fine-grained control over the ranking algorithm and scoring compared to a dedicated library or service like Lucene or Elasticsearch.
Scalability: For extremely large collections (billions of documents) or very high query loads, a dedicated, distributed search cluster will outperform a database's FTS.
Operational Complexity: Advanced configuration for different languages, synonyms, and custom dictionaries can be complex to manage.

Conclusion

For many applications, PostgreSQL's Full-Text Search is a powerful, robust, and convenient solution. It's a fantastic tool to have in your engineering handbook, especially when you need solid search capabilities without the overhead of a dedicated search engine.