Why we give language models an open book — and how they learn to look things up before they answer.
The problem · 1
An LLM is trained once on a frozen snapshot of text, then answers everything from what it absorbed. Brilliant for general knowledge — but it has no idea what it doesn't know.
When it doesn't know, it doesn't say so — it generates a fluent, confident, plausible-but-false answer. We call this hallucination.
It only knows the world up to its training cutoff. Yesterday's release, this morning's price, last week's policy — invisible to it.
Your company wiki, your PDFs, your support tickets were never in its training set. It simply can't read what it was never shown.
RAG is the fix for all three — without retraining the model.
The mental model · 2
Picture a sharp student sitting an exam. Closed-book, they answer from memory — and when memory fails, they bluff. Hand them the textbook and the same student looks up the relevant page, then answers from what's in front of them.
RAG is that textbook for an AI. The model doesn't get smarter — it gets to look things up first. Keep this picture; every step ahead is just a detail of how the book gets opened to the right page.
The shape of it · 3
Every RAG system, however fancy, is these three moves in order. The book gets opened (retrieve), the page is slipped into the question (augment), and the model writes the answer from it (generate).
Take the user's question, search a library of your documents, and pull back the handful of passages most likely to hold the answer.
Paste those passages into the prompt alongside the question — "here's the question, and here are the relevant facts. Answer using these."
The LLM does what it's great at — writing fluent prose — but now anchored to the supplied passages, and able to cite them.
The model is never retrained. RAG changes what the model reads at question time, not what it learned.
Before any question · 4
Searching whole documents is slow and clumsy, so we prepare them first: cut each document into bite-sized chunks, then turn every chunk into a string of numbers (next slide) and file it in a searchable index.
Drag the sliders. Too big and a chunk mixes many topics, so search gets blurry. Too small and it loses the context that made it meaningful. A little overlap stops ideas from being sliced in half. Finding the balance is a real tuning knob in production RAG.
The key trick · 5
Here's the idea the whole thing rests on. An embedding model reads a piece of text and places it as a point in space, so that things that mean similar things land near each other — even when they share no words. "Refund" sits beside "money back"; "puppy" sits beside "dog."
Click a word. Its nearest neighbours light up — that nearness is similarity of meaning.
Real embeddings use hundreds or thousands of dimensions, not two — impossible to draw, but the principle is exactly this: distance ≈ difference in meaning. This illustrative map is hand-placed; production embeddings are learned from billions of sentences.
Step 1 · Retrieve · 6
A question comes in. We embed it with the same model, drop it into the same space, and grab the chunks sitting closest to it — the top-k. No keyword has to match; "how long to get my money back" can find a chunk that only says "refunds are processed within 14 days."
Ask the support knowledge base something:
Scores here are illustrative (a lightweight meaning-overlap model), but the behaviour is real: the retriever ranks every chunk by similarity and hands the winners on. Only the highlighted top-k travel to the next step — everything else is left behind.
Steps 2 & 3 · Augment + Generate · 7
The retrieved chunks get slotted into a prompt template — instructions, the facts, then the question. That fuller prompt is what the model actually sees, so its answer is built from your sources. Toggle the answer to feel the difference.
Same model, same question. The only thing that changed is what it was allowed to read — that's the whole game. The [policy.pdf]-style markers are citations the model can produce because it knows exactly where each fact came from.
Why this way · 8
Three ways to give a model new knowledge. RAG wins when facts change, sources matter, and the knowledge base is bigger than a single prompt.
| Fine-tuning | Stuff it all in the prompt | RAG | |
|---|---|---|---|
| Add fresh / changing facts | Slow — retrain to update | Yes, but limited room | Update the index instantly |
| Handle a huge knowledge base | Bakes in, hard to audit | Won't fit the context window | Searches millions of chunks |
| Show its sources | No — knowledge is blended in | Sometimes | Yes — cites each chunk |
| Cost & effort to set up | High (GPUs, data, training) | Trivial | Moderate (build the index) |
| Best at | Teaching style & skills | Small, one-off context | Knowledge-heavy Q&A |
These aren't rivals — real systems often combine them: fine-tune for tone, RAG for facts. RAG's edge is that knowledge lives outside the model, where you can see it, fix it, and update it.
The fine print · 9
The chain is only as strong as its weakest link: retrieve → augment → generate. Break any one and the answer suffers — usually upstream of where it shows.
If the search pulls the wrong chunks, the model writes a confident answer from the wrong facts. Most "RAG hallucinations" are really retrieval misses.
If no document covers the question, retrieval returns the "least bad" chunks — and a weak system answers anyway instead of admitting the gap.
Cram in 20 chunks and the real answer can get buried; models attend unevenly to a long context and may miss the middle.
The library is only as current as your last sync. Outdated, duplicated, or contradictory chunks quietly poison the answers.
Notice the pattern: almost every failure is about the retrieval half, not the model. Good RAG is mostly good search.
Check yourself · 10
The whole thing · 11
The micro-demos use small, hand-built illustrative data so they run offline — real systems learn embeddings from vast text and search millions of chunks. The pipeline, the concepts, and the trade-offs shown here match how production RAG works.