Demo · Module 20 · Interactive
One query, four stages.
§1
Embed the query
Project the question into a 768-d embedding. Same model used for the document store.
--
§2
Retrieve top-3
Cosine-similarity against the 8 documents in the vector store. Return top-3.
awaiting embedding...
§3
Augment the prompt
Stuff the top-3 docs into the prompt as context. This is what the LLM actually sees.
awaiting retrieval...
§4
Generate the answer
LLM reads the augmented prompt and streams an answer grounded in the retrieved docs.
awaiting augmented prompt...
stage 0 / 4
What RAG actually does. Retrieval-augmented generation is the workhorse pattern for making an LLM answer questions about content it wasn't trained on. The four stages: (1) Embed the user's query into the same vector space as the document store. (2) Retrieve the top-K most semantically similar documents via cosine similarity (or hybrid keyword + vector). (3) Augment the prompt — paste the retrieved chunks into a structured prompt the LLM will read. (4) Generate a grounded answer. What real systems add: a reranker between stages 2 and 3 (a cross-encoder model that re-scores the top-20 down to top-3 with better precision), hybrid retrieval (BM25 + dense vectors), chunking strategies (sliding window, hierarchical, propositional), GraphRAG for relationship-heavy domains, and citation surfaces so the user can verify each claim. What this demo simplifies: real embedding models output 768-3072 dim vectors; we show 64 cells. Real corpora have millions of docs; we show 8. The point is the pipeline shape, not the production engineering.