NVIDIA GenAI Cert Prep — Presentation 02

RAG Deep Dive

Retrieval-Augmented Generation from embedding model selection through to production evaluation. Cert-focused synthesis of the ingest, retrieval, and generation stages, with honest trade-off framing the exams probe.

NCA Experimentation 22% NCP Data Prep 9% Embeddings HNSW Hybrid Search Reranking RAGAS Agentic RAG
Ingest Chunk Embed Store Retrieve Rerank Generate Eval
00

Topics in This Deck

A cert-focused tour of the full RAG pipeline — parametric memory, embeddings, indexes, hybrid search, chunking, agentic patterns, evaluation, and when not to use RAG.

01

Cert Framing — Exam Domains and Weightings

NCA-GENL Associate

Experimentation

22%

Second-largest domain. Covers pipeline design, component selection (embeddings, indexes, chunking), evaluation with RAGAS, and when to choose RAG vs fine-tuning vs context stuffing. Scenario-based questions.

NCP-GENL Professional

Data Preparation

9%

More precise on ingestion specifics: chunking trade-offs, metadata enrichment, index construction (HNSW vs IVF), hybrid search fusion formulas, and evaluation metric definitions.

Cross-domain bleed

RAG system design decisions bleed into the Evaluation domain (RAGAS metrics), the Prompt Engineering domain (retrieval-augmented prompt construction), and the NVIDIA stack domain (NeMo Retriever, NIM). The architecture understanding from Deck 01 is the prerequisite — embedding models are encoder-only transformers.

02

The RAG Argument — Parametric vs Non-Parametric Memory

A pretrained LLM stores knowledge in its weights — parametric memory. It is fixed at inference time, cannot be updated without retraining, and cannot cite sources. The model may also confabulate.

Lewis et al. (NeurIPS 2020) combined a seq2seq model with a dense retrieval index over Wikipedia, achieving state-of-the-art on open-domain QA while producing "more specific, diverse and factual" language than parametric-only baselines. The retrieval index is non-parametric memory: updatable by modifying the index without retraining.

Parametric memory (LLM weights)

  • Fixed at training time
  • No source attribution
  • May hallucinate outdated facts
  • Update requires full retraining

Non-parametric memory (retrieval index)

  • Updatable at any time (re-index)
  • Supports source attribution
  • Knowledge bounded by index quality
  • Adds latency and retrieval failure modes

RAG-Sequence vs RAG-Token

Lewis et al. proposed two variants. RAG-Sequence uses the same retrieved passages for the entire generated output. RAG-Token allows different passages to inform each generated token. RAG-Sequence is the standard in production; RAG-Token is rarely used outside research.

03

The Full Pipeline — Ingest to Evaluate

INGEST QUERY Raw Docs Chunk Embed Vector Store Query Embed Retrieve BM25+ANN RRF Merge Rerank Generate (LLM) Eval Key component Standard component Index lookup (offline path)

The ingest pipeline runs offline (or continuously). The query pipeline runs online for every user request. The bottleneck is almost always the retrieval + rerank stage, not the generation stage.

04

Embedding Models — Dense, Sparse, ColBERT

TypeMechanismStrengthsWeaknesses
Dense bi-encoder Query and doc encoded independently to single vector; cosine/dot similarity Semantic similarity, paraphrase generalisation, fast ANN retrieval Miss exact-match, abbreviations, rare proper nouns
Sparse (BM25) TF-IDF with saturation and length normalisation; exact lexical match Exact matches, rare terms, abbreviations No semantic generalisation
ColBERT late interaction Per-token vectors for query and doc; MaxSim sum over query tokens More accurate than single-vector bi-encoder; token-level granularity Larger index; slower than bi-encoder
Cross-encoder Query+doc concatenated and processed jointly through a transformer Highest accuracy — full query-doc interaction visible to model O(N) forward passes per query — too slow for first-stage retrieval
Critical exam distinction

Bi-encoders are used for first-stage retrieval at scale. Cross-encoders are used for reranking a small candidate set (typically 50–200 documents). A question asking "which is appropriate for ranking 10 million documents in real time?" points to bi-encoder. Distractors will suggest cross-encoders for their accuracy advantage without acknowledging the latency cost.

Cross-reference

Full technical depth on embedding models, training objectives, and benchmarks: RAG_01_Embedding_Models.

05

Vector Indexes — HNSW, IVF, PQ

Index typeStructureQuery timeMemoryBest for
HNSW Multi-level proximity graph; each node has edges to near neighbours at multiple levels O(log N) High — graph stored in RAM Latency-sensitive; default for most applications
IVF K-means clusters at build time; query searches nprobe clusters only O(nprobe × cluster_size) Lower than HNSW Large-scale where memory matters; recall degrades with low nprobe
PQ Splits vectors into sub-vectors; quantises each sub-vector independently Fast when combined with IVF or HNSW 8–32× reduction Billions of vectors; combined as IVFPQ
Recall@10 vs Latency by Index Type Latency → Recall@10 ↑ HNSW IVF IVFPQ low high low high
Cross-reference

Vector database comparison (pgvector, Qdrant, Weaviate, Pinecone, Chroma): RAG_02_Vector_Databases.

06

Hybrid Search — BM25 + Dense + RRF + Reranker

Neither dense nor sparse retrieval alone is optimal. Dense misses exact lexical matches; sparse misses semantic similarity. Hybrid search combines both.

Query
BM25 ranked list
lexical match
RRF merge
k=60
Cross-encoder rerank
Top-k → LLM
Dense ANN list
semantic match
Reciprocal Rank Fusion (RRF)
-- Score document d across all rankers:
RRF(d) = Σᵢ 1 / (k + rᵢ(d))

k = 60          -- smoothing constant; standard default
rᵢ(d) = rank of d in ranker i (1-indexed)

-- Uses rank positions only, not raw scores
-- Robust to scale mismatch between BM25 scores and dense scores
-- No trained weights required
Exam angle: RRF vs score normalisation

RRF uses only rank positions, not raw scores. Distractors may describe score normalisation (min-max scaling) as the standard fusion approach. RRF is the production default for hybrid search fusion.

Cross-reference

RAG_03_Hybrid_Search_and_Reranking — full implementation and ablation.

07

Chunking Strategies — Fixed, Semantic, Layout-Aware

The retrieval unit must be sized correctly: too large and context is swamped; too small and each chunk lacks sufficient context. Chunking is the most underrated design decision in RAG.

Fixed-Size

Split by token count with sliding overlap (e.g., 512 tokens, 64-token overlap). Simple and predictable. Breaks mid-sentence; loses discourse structure.

Suitable for: homogeneous prose; simplicity requirements.

Semantic / Sentence

Segment at sentence or paragraph boundaries. Variable-size chunks. LangChain SemanticChunker embeds adjacent sentences and splits where cosine similarity drops.

Suitable for: well-structured prose; when chunk coherence matters.

Layout-Aware

Preserves document structure: headers, tables, code blocks, captions. Tools: Unstructured, LlamaParse. Critical for PDFs, HTML, mixed-modality documents.

Suitable for: any document where a naive text split would mix table cells with prose.

Metadata enrichment

Attaching metadata — source URL, document date, section heading, page number — enables filtered retrieval and improves citation generation. Metadata filtering runs before or inside ANN search depending on the vector database.

Cross-reference

RAG_04_Chunking_and_Ingestion — implementation, overlap strategies, metadata schema design.

08

Reranking — When It Pays Off

After first-stage retrieval (typically top 50–200 documents), a cross-encoder reranker scores each candidate by processing the (query, document) pair jointly. The top-k reranked documents are passed to the generator.

Why reranking improves precision

First-stage retrieval optimises for recall at scale. Reranking optimises for precision on a small set: the cross-encoder sees the full query-document interaction, which is impossible for a bi-encoder that encodes each side independently.

ColBERT PLAID

ColBERT retains per-token vectors for query and doc; relevance is the sum of MaxSim inner products across query tokens. PLAID compresses these per-token vectors using product quantisation, making ColBERT deployable at scale as a first-stage retriever (not a reranker).

MethodLatencyPrecisionCorpus scale
Bi-encoder ANNVery low (<10 ms)Good recall, moderate precisionBillions of docs
ColBERT PLAIDLow (<50 ms)Better precision than bi-encoderMillions of docs
Cross-encoder rerankerHigh (100–500 ms on 100 docs)Best precisionTens to hundreds of docs only
Practical guidance

Add a reranker when first-stage precision is the bottleneck. If first-stage recall@k is already low, fix the retriever first — a reranker cannot surface documents that were not in the candidate set.

09

Agentic RAG — HyDE, Self-RAG, CRAG, Query Routing

Standard single-pass RAG (retrieve → generate) is brittle for multi-step questions. Agentic RAG patterns introduce reasoning loops over retrieval.

HyDE (Hypothetical Document Embeddings)

The LLM generates a hypothetical answer to the query. The embedding of that hypothetical answer is used as the retrieval query — not the original question. The hypothetical answer is typically closer in embedding space to relevant documents than the sparse original query.

Exam trap: HyDE uses the embedding of a hypothetical answer, not the original question. This is counterintuitive and a common distractor target.

Self-RAG

The model emits reflection tokens — [Retrieve], [Relevant], [Supported], [IsUseful] — interspersed in its output. It decides when to retrieve and evaluates retrieved passages for relevance. Requires a specially trained model; not a prompting technique applicable to any LLM.

Corrective RAG (CRAG)

An evaluator scores retrieved documents. If none score above a threshold, the system triggers a web search fallback or rewrites the query. Addresses the failure mode where first-stage retrieval returns nothing useful.

Query Routing and Decomposition

A router classifies the incoming query and directs it to the appropriate source (vector store, SQL, web search, API). Query decomposition breaks a complex question into sub-queries, executes them, and merges results.

Cross-reference

RAG_05_Agentic_RAG_Patterns — full implementation of HyDE, Self-RAG, CRAG, routing.

10

GraphRAG — Entities, Communities, Hybrid

Standard vector RAG retrieves chunks based on local semantic similarity and cannot capture explicit entity relationships or multi-hop reasoning paths across documents.

Microsoft GraphRAG (Edge et al., 2024)

  1. Extract entities and relationships from the corpus using an LLM.
  2. Build a knowledge graph; cluster entities into communities at multiple granularities.
  3. Generate community summaries via a further LLM pass.
  4. At query time, retrieve community summaries alongside raw chunks, providing global context that dense vector search cannot surface.

GraphRAG strengths

  • Global synthesis across many documents ("what are the main themes?")
  • Multi-hop entity relationship reasoning
  • Better for corpus-level questions than document-level fact lookup

GraphRAG limitations

  • High construction cost: many LLM calls during ingestion
  • Not a replacement for vector RAG on local fact lookup
  • Community summary quality depends on extraction LLM
Exam angle

GraphRAG is better for global synthesis; vector RAG for local fact lookup. GraphRAG has higher construction cost. They are not mutually exclusive — production systems combine both.

Cross-reference

RAG_06_GraphRAG_and_KGs — graph construction, community detection, hybrid graph+vector.

11

Multi-Tenancy and ACLs

Production RAG systems typically serve multiple users or organisations from shared infrastructure. Two design patterns handle data isolation:

Namespace / Collection isolation

Each tenant has a separate vector index or collection. Retrieval is physically isolated. Simple but expensive: N× index storage and memory overhead. Suitable for high-security multi-tenancy with modest tenant counts.

Metadata-filter ACL

All documents share a single index. Each document is tagged with tenant or ACL metadata. Retrieval queries include a mandatory metadata filter (e.g., tenant_id == X). More memory-efficient but requires fast filter support without recall degradation.

Practical considerations

12

Evaluation — RAGAS Metrics

RAG evaluation must assess both the retrieval stage and the generation stage independently, because failures have different remediation paths.

MetricWhat it measuresFailure mode it catches
Faithfulness Do generated claims appear in the retrieved context? Generator hallucinating beyond the retrieved context
Answer relevancy Is the generated answer responsive to the original question? Answer accurately reports context but doesn't address the question
Context precision What fraction of retrieved chunks are actually relevant? Retriever returning noisy or irrelevant documents
Context recall Does retrieved context contain the information needed to answer? Retriever missing relevant documents
Retrieval recall@k Fraction of questions where at least one relevant doc is in top-k Retriever failing to surface any relevant document
Critical exam distinction

Faithfulness measures grounding in retrieved context, not global factual correctness. Answer relevancy measures whether the answer addresses the question. A faithful-but-irrelevant answer is possible: the model accurately reports something from context but does not answer the question asked.

Cross-reference

RAG_07_Production_RAG — RAGAS implementation, drift detection, production monitoring.

13

When NOT to Use RAG

RAG adds latency, retrieval failure modes, infrastructure complexity, and cost. The exam tests this directly. Several common scenarios are better served by simpler alternatives.

ScenarioBetter approachReason
Small, static corpus (<100 docs, no updates) Context stuffing — put all docs in prompt RAG adds retrieval failure modes; just include everything in context
Structured data (tables, databases) SQL / structured query generation Vector similarity is the wrong query primitive for structured data
Very low update frequency, domain adaptation needed Fine-tuning RAG adds inference latency; fine-tuning bakes knowledge into weights
LLM parametric knowledge is sufficient Prompt engineering only Retrieval adds latency and cost with no quality benefit
Real-time, sub-50 ms latency required Cached retrieval or no retrieval ANN + rerank + generation chain is typically 200–800 ms
Exam angle

Do not default to recommending RAG for every knowledge problem. The key signals for alternatives: small or static corpus, structured data, no update cadence, latency constraints under 50 ms.

14

Latency Budget Breakdown

Typical RAG latency components Embed query ~5 ms ANN search ~10 ms Rerank (100 docs) ~200 ms ← bottleneck Generate (~500 tok) ~400–800 ms 0 50 ms 100 ms 150 ms 200 ms

Total wall-clock time for a typical hybrid RAG response (embed + ANN + rerank 100 docs + generate 500 tokens) is roughly 600–1000 ms on a single GPU with CPU retrieval. Streaming generation output reduces perceived latency at no throughput cost.

Optimisation levers

15

NVIDIA-Specific — NeMo Retriever and NIM Embedding Microservices

NeMo Retriever

  • NVIDIA's production RAG microservice framework.
  • Supports dense + sparse hybrid retrieval with GPU-accelerated embedding.
  • Integrates with NeMo Guardrails for output filtering.
  • Available as a container for DGX and cloud instances.
  • Provides the retrieval layer in NVIDIA AI Enterprise RAG blueprints.

NIM Embedding Microservices

  • Pre-built, optimised containers for serving embedding models on NVIDIA GPUs.
  • Includes NV-Embed-v2: a dense bi-encoder optimised for retrieval.
  • OpenAI-compatible API endpoint — drop-in replacement for existing RAG pipelines.
  • GPU-accelerated batch inference reduces embed latency vs CPU significantly.
Cross-reference

NVIDIA stack details (NIM, NeMo, TensorRT-LLM) covered in NVIDIA_GPU_20_NeMo_NIM_AI_Enterprise and cert-prep Presentation 05 (NVIDIA Stack Overview).

16

Likely Exam Angles

The highest-probability exam questions based on notes/05_rag_systems.md and the NCA/NCP domain weightings:

17

Cross-References and Further Reading

Portfolio repos (depth treatment)

Cert-prep repo resources

Primary literature