Demo · Module 20 · Interactive

One query, four stages.

Embed · Retrieve · Augment · Generate
8-document mini vector store
Step through or auto-play the pipeline

Question

§1

Embed the query

Project the question into a 768-d embedding. Same model used for the document store.

§2

Retrieve top-3

Cosine-similarity against the 8 documents in the vector store. Return top-3.

awaiting embedding...

§3

Augment the prompt

Stuff the top-3 docs into the prompt as context. This is what the LLM actually sees.

awaiting retrieval...

§4

Generate the answer

LLM reads the augmented prompt and streams an answer grounded in the retrieved docs.

awaiting augmented prompt...

stage 0 / 4

What RAG actually does. Retrieval-augmented generation is the workhorse pattern for making an LLM answer questions about content it wasn't trained on. The four stages: (1) Embed the user's query into the same vector space as the document store. (2) Retrieve the top-K most semantically similar documents via cosine similarity (or hybrid keyword + vector). (3) Augment the prompt — paste the retrieved chunks into a structured prompt the LLM will read. (4) Generate a grounded answer. What real systems add: a reranker between stages 2 and 3 (a cross-encoder model that re-scores the top-20 down to top-3 with better precision), hybrid retrieval (BM25 + dense vectors), chunking strategies (sliding window, hierarchical, propositional), GraphRAG for relationship-heavy domains, and citation surfaces so the user can verify each claim. What this demo simplifies: real embedding models output 768-3072 dim vectors; we show 64 cells. Real corpora have millions of docs; we show 8. The point is the pipeline shape, not the production engineering.