Vector Search and RAG

Previous posts focused on keyword-based search, where relevance is determined by matching words. We'll now talk about semantic search, where we find documents based on their meaning. This is powered by vector embeddings and has given rise to one of the most impactful architectural patterns in modern AI: Retrieval-Augmented Generation (RAG).

Previously, we've discussed the implementation techniques for the vectors and embeddings in the "Data Science Introduction" chapter. This post explains the end-to-end process of implementing vector search and using it to build a RAG system that can answer questions using your private data.

From Text to Searchable Vectors

The fundamental shift is from indexing words to indexing meaning. This is a two-stage process: an offline indexing stage and a real-time query stage.

The Indexing Pipeline

This is how you prepare your knowledge base for semantic search:

Chunk Your Documents
Instead of indexing whole documents, you first split them into smaller, semantically coherent chunks. This could be paragraphs, sections, or even sentences. This is crucial because it ensures the retrieved context is dense and relevant.

Choose an Embedding Model
Select a model to convert your text chunks into numerical vectors. This is a critical choice, as you must use the same model for indexing and querying. Popular choices include open-source models like Sentence-BERT or API-based models like OpenAI's text-embedding-3-small.

Generate and Store Embeddings
You iterate through every text chunk, pass it to your chosen embedding model to get a vector, and then store that vector alongside its original text content in a Vector Database.

A vector database (like Pinecone, Weaviate, Milvus, or a PostgreSQL extension like pgvector) is a specialized system designed for one task: finding the "nearest" vectors in a high-dimensional space, incredibly quickly.

The Query Pipeline

This is what happens when a user performs a search:

Embed the User's Query
The user's natural language query (e.g., "what are our most popular products in Europe?") is passed through the exact same embedding model used during indexing.

Perform a Similarity Search
This query vector is sent to the vector database. The database performs a similarity search (often using Cosine Similarity) to find the k most similar document chunk vectors from its index.

Retrieve Original Content
The database returns the original text chunks associated with those top k vectors. The result is a list of text passages that are semantically related to the user's query, even if they don't share any keywords.

RAG: Giving LLMs Your Data

Vector search is powerful on its own, but its true potential is realized when combined with a Large Language Model (LLM) in a RAG architecture. RAG solves the biggest limitations of LLMs: their knowledge is frozen in time, and they have no access to your private, proprietary information.

The RAG workflow elegantly combines retrieval and generation:

The User Asks a Question
A user poses a question in natural language, for example: "What was the key takeaway from our Q3 customer feedback report?"

Retrieve Relevant Context
The system takes this question and uses the vector search pipeline described above to retrieve the most relevant chunks of text from your internal knowledge base (e.g., your internal wiki, PDFs, or Slack history).

Augment the Prompt
This is the "magic" step. The system constructs a new, detailed prompt for an LLM. This prompt includes the original user question and the retrieved text chunks as context.

Generate a Grounded Answer
This augmented prompt is sent to a powerful LLM (like GPT-4, Claude 3, etc.). The LLM uses the provided context to formulate a precise, factual answer. Because the answer is based on the retrieved information, it is "grounded" in your data.

Example of an Augmented Prompt:

You are a helpful assistant who answers questions based on the provided context.

Context:
---
[Chunk 7, from 'Q3_Feedback_Report.pdf': "Overall customer satisfaction in Q3 was 8.2/10. A recurring theme in feedback was a desire for improved integration with third-party calendars. Many users cited this as a key feature that would increase their daily usage."]
[Chunk 12, from 'Q3_Feedback_Report.pdf': "While feature requests were diverse, the most frequently mentioned item, by a margin of 3-to-1, was calendar integration. This suggests a significant opportunity for product development in Q4."]
---

User Question: What was the key takeaway from our Q3 customer feedback report?

The LLM, now armed with this specific context, can confidently answer: "The key takeaway from the Q3 customer feedback report was the strong demand for improved integration with third-party calendars, which was the most frequently requested feature by a significant margin."