Retrieval-Augmented Generation from embedding model selection through to production evaluation. Cert-focused synthesis of the ingest, retrieval, and generation stages, with honest trade-off framing the exams probe.
A cert-focused tour of the full RAG pipeline — parametric memory, embeddings, indexes, hybrid search, chunking, agentic patterns, evaluation, and when not to use RAG.
Experimentation
22%
Second-largest domain. Covers pipeline design, component selection (embeddings, indexes, chunking), evaluation with RAGAS, and when to choose RAG vs fine-tuning vs context stuffing. Scenario-based questions.
Data Preparation
9%
More precise on ingestion specifics: chunking trade-offs, metadata enrichment, index construction (HNSW vs IVF), hybrid search fusion formulas, and evaluation metric definitions.
RAG system design decisions bleed into the Evaluation domain (RAGAS metrics), the Prompt Engineering domain (retrieval-augmented prompt construction), and the NVIDIA stack domain (NeMo Retriever, NIM). The architecture understanding from Deck 01 is the prerequisite — embedding models are encoder-only transformers.
A pretrained LLM stores knowledge in its weights — parametric memory. It is fixed at inference time, cannot be updated without retraining, and cannot cite sources. The model may also confabulate.
Lewis et al. (NeurIPS 2020) combined a seq2seq model with a dense retrieval index over Wikipedia, achieving state-of-the-art on open-domain QA while producing "more specific, diverse and factual" language than parametric-only baselines. The retrieval index is non-parametric memory: updatable by modifying the index without retraining.
Lewis et al. proposed two variants. RAG-Sequence uses the same retrieved passages for the entire generated output. RAG-Token allows different passages to inform each generated token. RAG-Sequence is the standard in production; RAG-Token is rarely used outside research.
The ingest pipeline runs offline (or continuously). The query pipeline runs online for every user request. The bottleneck is almost always the retrieval + rerank stage, not the generation stage.
| Type | Mechanism | Strengths | Weaknesses |
|---|---|---|---|
| Dense bi-encoder | Query and doc encoded independently to single vector; cosine/dot similarity | Semantic similarity, paraphrase generalisation, fast ANN retrieval | Miss exact-match, abbreviations, rare proper nouns |
| Sparse (BM25) | TF-IDF with saturation and length normalisation; exact lexical match | Exact matches, rare terms, abbreviations | No semantic generalisation |
| ColBERT late interaction | Per-token vectors for query and doc; MaxSim sum over query tokens | More accurate than single-vector bi-encoder; token-level granularity | Larger index; slower than bi-encoder |
| Cross-encoder | Query+doc concatenated and processed jointly through a transformer | Highest accuracy — full query-doc interaction visible to model | O(N) forward passes per query — too slow for first-stage retrieval |
Bi-encoders are used for first-stage retrieval at scale. Cross-encoders are used for reranking a small candidate set (typically 50–200 documents). A question asking "which is appropriate for ranking 10 million documents in real time?" points to bi-encoder. Distractors will suggest cross-encoders for their accuracy advantage without acknowledging the latency cost.
Full technical depth on embedding models, training objectives, and benchmarks: RAG_01_Embedding_Models.
| Index type | Structure | Query time | Memory | Best for |
|---|---|---|---|---|
| HNSW | Multi-level proximity graph; each node has edges to near neighbours at multiple levels | O(log N) | High — graph stored in RAM | Latency-sensitive; default for most applications |
| IVF | K-means clusters at build time; query searches nprobe clusters only | O(nprobe × cluster_size) | Lower than HNSW | Large-scale where memory matters; recall degrades with low nprobe |
| PQ | Splits vectors into sub-vectors; quantises each sub-vector independently | Fast when combined with IVF or HNSW | 8–32× reduction | Billions of vectors; combined as IVFPQ |
Vector database comparison (pgvector, Qdrant, Weaviate, Pinecone, Chroma): RAG_02_Vector_Databases.
Neither dense nor sparse retrieval alone is optimal. Dense misses exact lexical matches; sparse misses semantic similarity. Hybrid search combines both.
-- Score document d across all rankers:
RRF(d) = Σᵢ 1 / (k + rᵢ(d))
k = 60 -- smoothing constant; standard default
rᵢ(d) = rank of d in ranker i (1-indexed)
-- Uses rank positions only, not raw scores
-- Robust to scale mismatch between BM25 scores and dense scores
-- No trained weights required
RRF uses only rank positions, not raw scores. Distractors may describe score normalisation (min-max scaling) as the standard fusion approach. RRF is the production default for hybrid search fusion.
RAG_03_Hybrid_Search_and_Reranking — full implementation and ablation.
The retrieval unit must be sized correctly: too large and context is swamped; too small and each chunk lacks sufficient context. Chunking is the most underrated design decision in RAG.
Split by token count with sliding overlap (e.g., 512 tokens, 64-token overlap). Simple and predictable. Breaks mid-sentence; loses discourse structure.
Suitable for: homogeneous prose; simplicity requirements.
Segment at sentence or paragraph boundaries. Variable-size chunks. LangChain SemanticChunker embeds adjacent sentences and splits where cosine similarity drops.
Suitable for: well-structured prose; when chunk coherence matters.
Preserves document structure: headers, tables, code blocks, captions. Tools: Unstructured, LlamaParse. Critical for PDFs, HTML, mixed-modality documents.
Suitable for: any document where a naive text split would mix table cells with prose.
Attaching metadata — source URL, document date, section heading, page number — enables filtered retrieval and improves citation generation. Metadata filtering runs before or inside ANN search depending on the vector database.
RAG_04_Chunking_and_Ingestion — implementation, overlap strategies, metadata schema design.
After first-stage retrieval (typically top 50–200 documents), a cross-encoder reranker scores each candidate by processing the (query, document) pair jointly. The top-k reranked documents are passed to the generator.
First-stage retrieval optimises for recall at scale. Reranking optimises for precision on a small set: the cross-encoder sees the full query-document interaction, which is impossible for a bi-encoder that encodes each side independently.
ColBERT retains per-token vectors for query and doc; relevance is the sum of MaxSim inner products across query tokens. PLAID compresses these per-token vectors using product quantisation, making ColBERT deployable at scale as a first-stage retriever (not a reranker).
| Method | Latency | Precision | Corpus scale |
|---|---|---|---|
| Bi-encoder ANN | Very low (<10 ms) | Good recall, moderate precision | Billions of docs |
| ColBERT PLAID | Low (<50 ms) | Better precision than bi-encoder | Millions of docs |
| Cross-encoder reranker | High (100–500 ms on 100 docs) | Best precision | Tens to hundreds of docs only |
Add a reranker when first-stage precision is the bottleneck. If first-stage recall@k is already low, fix the retriever first — a reranker cannot surface documents that were not in the candidate set.
Standard single-pass RAG (retrieve → generate) is brittle for multi-step questions. Agentic RAG patterns introduce reasoning loops over retrieval.
The LLM generates a hypothetical answer to the query. The embedding of that hypothetical answer is used as the retrieval query — not the original question. The hypothetical answer is typically closer in embedding space to relevant documents than the sparse original query.
Exam trap: HyDE uses the embedding of a hypothetical answer, not the original question. This is counterintuitive and a common distractor target.
The model emits reflection tokens — [Retrieve], [Relevant], [Supported], [IsUseful] — interspersed in its output. It decides when to retrieve and evaluates retrieved passages for relevance. Requires a specially trained model; not a prompting technique applicable to any LLM.
An evaluator scores retrieved documents. If none score above a threshold, the system triggers a web search fallback or rewrites the query. Addresses the failure mode where first-stage retrieval returns nothing useful.
A router classifies the incoming query and directs it to the appropriate source (vector store, SQL, web search, API). Query decomposition breaks a complex question into sub-queries, executes them, and merges results.
RAG_05_Agentic_RAG_Patterns — full implementation of HyDE, Self-RAG, CRAG, routing.
Standard vector RAG retrieves chunks based on local semantic similarity and cannot capture explicit entity relationships or multi-hop reasoning paths across documents.
GraphRAG is better for global synthesis; vector RAG for local fact lookup. GraphRAG has higher construction cost. They are not mutually exclusive — production systems combine both.
RAG_06_GraphRAG_and_KGs — graph construction, community detection, hybrid graph+vector.
Production RAG systems typically serve multiple users or organisations from shared infrastructure. Two design patterns handle data isolation:
Each tenant has a separate vector index or collection. Retrieval is physically isolated. Simple but expensive: N× index storage and memory overhead. Suitable for high-security multi-tenancy with modest tenant counts.
All documents share a single index. Each document is tagged with tenant or ACL metadata. Retrieval queries include a mandatory metadata filter (e.g., tenant_id == X). More memory-efficient but requires fast filter support without recall degradation.
RAG evaluation must assess both the retrieval stage and the generation stage independently, because failures have different remediation paths.
| Metric | What it measures | Failure mode it catches |
|---|---|---|
| Faithfulness | Do generated claims appear in the retrieved context? | Generator hallucinating beyond the retrieved context |
| Answer relevancy | Is the generated answer responsive to the original question? | Answer accurately reports context but doesn't address the question |
| Context precision | What fraction of retrieved chunks are actually relevant? | Retriever returning noisy or irrelevant documents |
| Context recall | Does retrieved context contain the information needed to answer? | Retriever missing relevant documents |
| Retrieval recall@k | Fraction of questions where at least one relevant doc is in top-k | Retriever failing to surface any relevant document |
Faithfulness measures grounding in retrieved context, not global factual correctness. Answer relevancy measures whether the answer addresses the question. A faithful-but-irrelevant answer is possible: the model accurately reports something from context but does not answer the question asked.
RAG_07_Production_RAG — RAGAS implementation, drift detection, production monitoring.
RAG adds latency, retrieval failure modes, infrastructure complexity, and cost. The exam tests this directly. Several common scenarios are better served by simpler alternatives.
| Scenario | Better approach | Reason |
|---|---|---|
| Small, static corpus (<100 docs, no updates) | Context stuffing — put all docs in prompt | RAG adds retrieval failure modes; just include everything in context |
| Structured data (tables, databases) | SQL / structured query generation | Vector similarity is the wrong query primitive for structured data |
| Very low update frequency, domain adaptation needed | Fine-tuning | RAG adds inference latency; fine-tuning bakes knowledge into weights |
| LLM parametric knowledge is sufficient | Prompt engineering only | Retrieval adds latency and cost with no quality benefit |
| Real-time, sub-50 ms latency required | Cached retrieval or no retrieval | ANN + rerank + generation chain is typically 200–800 ms |
Do not default to recommending RAG for every knowledge problem. The key signals for alternatives: small or static corpus, structured data, no update cadence, latency constraints under 50 ms.
Total wall-clock time for a typical hybrid RAG response (embed + ANN + rerank 100 docs + generate 500 tokens) is roughly 600–1000 ms on a single GPU with CPU retrieval. Streaming generation output reduces perceived latency at no throughput cost.
NVIDIA stack details (NIM, NeMo, TensorRT-LLM) covered in NVIDIA_GPU_20_NeMo_NIM_AI_Enterprise and cert-prep Presentation 05 (NVIDIA Stack Overview).
The highest-probability exam questions based on notes/05_rag_systems.md and the NCA/NCP domain weightings: