What Is RAG?

Language models have a fundamental limitation: they only know what was in their training data. Ask about your company's internal documentation, recent events, or private codebases, and the model either admits ignorance or — worse — hallucinates a plausible-sounding but incorrect answer.

Retrieval-Augmented Generation (RAG) solves this by combining search with generation.

How RAG Works

Instead of relying solely on the model's training, RAG retrieves relevant information and includes it in the prompt:

1. User asks a question
2. System searches relevant documents (retrieval)
3. Retrieved content added to prompt (augmentation)  
4. Model generates answer using that content (generation)

Without RAG, asking "What's our refund policy?" might produce a generic or hallucinated answer. With RAG, the system finds your actual refund policy document and the model answers based on that specific content.

Key Components

Embeddings convert text into numerical vectors. Similar text produces similar vectors, enabling semantic search — finding documents by meaning rather than exact keyword matches. When you ask about "returning products," embeddings help find documents about "refunds" even if they don't use the word "returning."

Vector databases store these embeddings for fast similarity search. When a question arrives, the system converts it to a vector and finds the most similar document vectors. Popular options include Pinecone, Chroma, pgvector (PostgreSQL extension), and Weaviate.

Chunking splits documents into smaller pieces before embedding. A 50-page manual becomes hundreds of chunks. When answering a question, you retrieve only the relevant chunks — not the entire document. This keeps the context focused and fits within token limits.

Common RAG Applications

Documentation chatbots — Users ask questions in natural language, and the system finds and synthesizes answers from your docs.

Code assistants with codebase context — The assistant retrieves relevant files from your repository when answering questions about your specific code.

Customer support — Agents get instant access to product information, policies, and troubleshooting guides.

Knowledge management — Search across company wikis, Slack history, and documents using natural questions.

RAG vs Fine-Tuning

RAG adds knowledge at query time — easy to update, just change the documents. Fine-tuning bakes knowledge into the model through additional training — harder to update but can change model behavior more fundamentally.

For most "the model needs to know about X" problems, RAG is the right first choice.

See More

How RAG Works

Instead of relying solely on the model's training, RAG retrieves relevant information and includes it in the prompt:

1. User asks a question
2. System searches relevant documents (retrieval)
3. Retrieved content added to prompt (augmentation)  
4. Model generates answer using that content (generation)

Key Components

Common RAG Applications

Documentation chatbots — Users ask questions in natural language, and the system finds and synthesizes answers from your docs.

Code assistants with codebase context — The assistant retrieves relevant files from your repository when answering questions about your specific code.

Customer support — Agents get instant access to product information, policies, and troubleshooting guides.

Knowledge management — Search across company wikis, Slack history, and documents using natural questions.

RAG vs Fine-Tuning

For most "the model needs to know about X" problems, RAG is the right first choice.

What Is RAG?

How RAG Works

Key Components

Common RAG Applications

RAG vs Fine-Tuning

See More

Further Reading

How RAG Works

Key Components

Common RAG Applications

RAG vs Fine-Tuning

See More

Further Reading

How RAG Works

Key Components

Common RAG Applications

RAG vs Fine-Tuning

See More

Further Reading