Retrieval-Augmented Generation

How RAG
actually works.

Why we give language models an open book — and how they learn to look things up before they answer.

R

Retrieve

A

Augment

G

Generate

Use→or swipe to begin · a 12-slide walkthrough, no prior AI knowledge needed.

The problem · 1

A language model answers from memory. That breaks in three ways.

An LLM is trained once on a frozen snapshot of text, then answers everything from what it absorbed. Brilliant for general knowledge — but it has no idea what it doesn't know.

FAILURE 01

It makes things up

When it doesn't know, it doesn't say so — it generates a fluent, confident, plausible-but-false answer. We call this hallucination.

FAILURE 02

Its knowledge is frozen

It only knows the world up to its training cutoff. Yesterday's release, this morning's price, last week's policy — invisible to it.

FAILURE 03

It can't see your data

Your company wiki, your PDFs, your support tickets were never in its training set. It simply can't read what it was never shown.

RAG is the fix for all three — without retraining the model.

The mental model · 2

Turn a closed-book exam into an open-book one.

Picture a sharp student sitting an exam. Closed-book, they answer from memory — and when memory fails, they bluff. Hand them the textbook and the same student looks up the relevant page, then answers from what's in front of them.

RAG is that textbook for an AI. The model doesn't get smarter — it gets to look things up first. Keep this picture; every step ahead is just a detail of how the book gets opened to the right page.

Throughline: the open book

Closed book · from memory

"What's our refund window?"

"Most companies use 30 days, so probably 30."
↑ a confident guess — it never read your policy

Open book · looks it up

"What's our refund window?"

reads policy.pdf → "Your policy allows returns within 14 days of delivery."
↑ grounded in a real source

The shape of it · 3

Three steps, and the acronym spells them.

Every RAG system, however fancy, is these three moves in order. The book gets opened (retrieve), the page is slipped into the question (augment), and the model writes the answer from it (generate).

1 Retrieve

Find the right pages

Take the user's question, search a library of your documents, and pull back the handful of passages most likely to hold the answer.

2 Augment

Staple them to the question

Paste those passages into the prompt alongside the question — "here's the question, and here are the relevant facts. Answer using these."

3 Generate

Write the answer from them

The LLM does what it's great at — writing fluent prose — but now anchored to the supplied passages, and able to cite them.

The model is never retrained. RAG changes what the model reads at question time, not what it learned.

Before any question · 4

First you build the library — once, ahead of time.

Searching whole documents is slow and clumsy, so we prepare them first: cut each document into bite-sized chunks, then turn every chunk into a string of numbers (next slide) and file it in a searchable index.

Documents

PDFs, wikis, tickets

Chunks

small passages

Embeddings

chunk → numbers

Vector store

the searchable index

Demo · ChunkingLive

Chunk size 14 words

Overlap 2 words

0Chunks made

0Avg words / chunk

Drag the sliders. Too big and a chunk mixes many topics, so search gets blurry. Too small and it loses the context that made it meaningful. A little overlap stops ideas from being sliced in half. Finding the balance is a real tuning knob in production RAG.

The key trick · 5

Embeddings: turning meaning into coordinates.

Here's the idea the whole thing rests on. An embedding model reads a piece of text and places it as a point in space, so that things that mean similar things land near each other — even when they share no words. "Refund" sits beside "money back"; "puppy" sits beside "dog."

Demo · Meaning as geometryLive

Click a word. Its nearest neighbours light up — that nearness is similarity of meaning.

Pick a word to see what sits closest to it in meaning.

Real embeddings use hundreds or thousands of dimensions, not two — impossible to draw, but the principle is exactly this: distance ≈ difference in meaning. This illustrative map is hand-placed; production embeddings are learned from billions of sentences.

Step 1 · Retrieve · 6

Now the search makes sense: match by meaning, not keywords.

A question comes in. We embed it with the same model, drop it into the same space, and grab the chunks sitting closest to it — the top-k. No keyword has to match; "how long to get my money back" can find a chunk that only says "refunds are processed within 14 days."

Demo · Top-k retrievalLive

Ask the support knowledge base something:

Keep the top

most similar chunks →

Scores here are illustrative (a lightweight meaning-overlap model), but the behaviour is real: the retriever ranks every chunk by similarity and hands the winners on. Only the highlighted top-k travel to the next step — everything else is left behind.

Steps 2 & 3 · Augment + Generate · 7

Staple the pages to the question, then let it write.

The retrieved chunks get slotted into a prompt template — instructions, the facts, then the question. That fuller prompt is what the model actually sees, so its answer is built from your sources. Toggle the answer to feel the difference.

# System

Answer using ONLY the context below. If it isn't there, say you don't know. Cite sources.

# Context (retrieved)

# Question

How long do I have to return an item?

Same model, same question. The only thing that changed is what it was allowed to read — that's the whole game. The [policy.pdf]-style markers are citations the model can produce because it knows exactly where each fact came from.

Why this way · 8

"Why not just retrain the model, or paste everything in?"

Three ways to give a model new knowledge. RAG wins when facts change, sources matter, and the knowledge base is bigger than a single prompt.

	Fine-tuning	Stuff it all in the prompt	RAG
Add fresh / changing facts	Slow — retrain to update	Yes, but limited room	Update the index instantly
Handle a huge knowledge base	Bakes in, hard to audit	Won't fit the context window	Searches millions of chunks
Show its sources	No — knowledge is blended in	Sometimes	Yes — cites each chunk
Cost & effort to set up	High (GPUs, data, training)	Trivial	Moderate (build the index)
Best at	Teaching style & skills	Small, one-off context	Knowledge-heavy Q&A

These aren't rivals — real systems often combine them: fine-tune for tone, RAG for facts. RAG's edge is that knowledge lives outside the model, where you can see it, fix it, and update it.

The fine print · 9

RAG isn't magic. It fails in specific, fixable ways.

The chain is only as strong as its weakest link: retrieve → augment → generate. Break any one and the answer suffers — usually upstream of where it shows.

Garbage retrieval in, garbage out

If the search pulls the wrong chunks, the model writes a confident answer from the wrong facts. Most "RAG hallucinations" are really retrieval misses.

fixBetter chunking, better embeddings, re-ranking the results.

The answer isn't in the library

If no document covers the question, retrieval returns the "least bad" chunks — and a weak system answers anyway instead of admitting the gap.

fixInstruct "say you don't know"; set a similarity threshold.

Lost in too much context

Cram in 20 chunks and the real answer can get buried; models attend unevenly to a long context and may miss the middle.

fixRetrieve fewer, better chunks; re-rank; keep top-k small.

Stale or messy index

The library is only as current as your last sync. Outdated, duplicated, or contradictory chunks quietly poison the answers.

fixRe-index on a schedule; dedupe; track source freshness.

Notice the pattern: almost every failure is about the retrieval half, not the model. Good RAG is mostly good search.

Check yourself · 10

Five questions. Instant feedback, and the why.

The whole thing · 11

You now know how the book gets opened.

1A model alone answers from frozen memory — and bluffs when it's unsure.

2Index your documents: chunk them, embed each chunk into coordinates, store them.

3Retrieve by embedding the question and grabbing the nearest chunks.

4Augment the prompt with those chunks, then generate a grounded, citable answer.

5It's only as good as its search — fix retrieval before blaming the model.

The open book, start to finish.

Go to the sources

The original RAG paper (2020)

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" — where the term was coined.

arxiv.org/abs/2005.11401

Pinecone — RAG learning guide

Practical walkthrough of embeddings, vector search and retrieval.

pinecone.io/learn

LangChain — build a RAG app

Hands-on tutorial: load, chunk, embed, retrieve, generate.

python.langchain.com

"Lost in the Middle" (2023)

Liu et al. — evidence that models attend unevenly to long contexts.

arxiv.org/abs/2307.03172

The micro-demos use small, hand-built illustrative data so they run offline — real systems learn embeddings from vast text and search millions of chunks. The pipeline, the concepts, and the trade-offs shown here match how production RAG works.

How RAGactually works.