RAG Deep Dive — NVIDIA GenAI Cert Prep

00

Topics in This Deck

A cert-focused tour of the full RAG pipeline — parametric memory, embeddings, indexes, hybrid search, chunking, agentic patterns, evaluation, and when not to use RAG.

Cert Framing — Exam Domains and Weightings
The RAG Argument — Parametric vs Non-Parametric Memory
The Full Pipeline — Diagram
Embedding Models — Dense, Sparse, ColBERT
Vector Indexes — HNSW, IVF, PQ
Hybrid Search — BM25 + Dense + RRF + Reranker
Chunking Strategies — Fixed, Semantic, Layout-Aware
Reranking — When It Pays Off
Agentic RAG — HyDE, Self-RAG, CRAG, Routing
GraphRAG — Entities, Communities, Hybrid
Multi-Tenancy and ACLs
Evaluation — RAGAS Metrics
When NOT to Use RAG
Latency Budget Breakdown
NVIDIA-Specific — NeMo Retriever and NIM
Likely Exam Angles
Cross-References and Further Reading

01

Cert Framing — Exam Domains and Weightings

NCA-GENL Associate

Experimentation

22%

Second-largest domain. Covers pipeline design, component selection (embeddings, indexes, chunking), evaluation with RAGAS, and when to choose RAG vs fine-tuning vs context stuffing. Scenario-based questions.

NCP-GENL Professional

Data Preparation

9%

More precise on ingestion specifics: chunking trade-offs, metadata enrichment, index construction (HNSW vs IVF), hybrid search fusion formulas, and evaluation metric definitions.

Cross-domain bleed

RAG system design decisions bleed into the Evaluation domain (RAGAS metrics), the Prompt Engineering domain (retrieval-augmented prompt construction), and the NVIDIA stack domain (NeMo Retriever, NIM). The architecture understanding from Deck 01 is the prerequisite — embedding models are encoder-only transformers.

02

The RAG Argument — Parametric vs Non-Parametric Memory

A pretrained LLM stores knowledge in its weights — parametric memory. It is fixed at inference time, cannot be updated without retraining, and cannot cite sources. The model may also confabulate.

Lewis et al. (NeurIPS 2020) combined a seq2seq model with a dense retrieval index over Wikipedia, achieving state-of-the-art on open-domain QA while producing "more specific, diverse and factual" language than parametric-only baselines. The retrieval index is non-parametric memory: updatable by modifying the index without retraining.

Parametric memory (LLM weights)

Fixed at training time
No source attribution
May hallucinate outdated facts
Update requires full retraining

Non-parametric memory (retrieval index)

Updatable at any time (re-index)
Supports source attribution
Knowledge bounded by index quality
Adds latency and retrieval failure modes

RAG-Sequence vs RAG-Token

Lewis et al. proposed two variants. RAG-Sequence uses the same retrieved passages for the entire generated output. RAG-Token allows different passages to inform each generated token. RAG-Sequence is the standard in production; RAG-Token is rarely used outside research.

03

The Full Pipeline — Ingest to Evaluate

The ingest pipeline runs offline (or continuously). The query pipeline runs online for every user request. The bottleneck is almost always the retrieval + rerank stage, not the generation stage.

04

Embedding Models — Dense, Sparse, ColBERT

Type	Mechanism	Strengths	Weaknesses
Dense bi-encoder	Query and doc encoded independently to single vector; cosine/dot similarity	Semantic similarity, paraphrase generalisation, fast ANN retrieval	Miss exact-match, abbreviations, rare proper nouns
Sparse (BM25)	TF-IDF with saturation and length normalisation; exact lexical match	Exact matches, rare terms, abbreviations	No semantic generalisation
ColBERT late interaction	Per-token vectors for query and doc; MaxSim sum over query tokens	More accurate than single-vector bi-encoder; token-level granularity	Larger index; slower than bi-encoder
Cross-encoder	Query+doc concatenated and processed jointly through a transformer	Highest accuracy — full query-doc interaction visible to model	O(N) forward passes per query — too slow for first-stage retrieval

Critical exam distinction

Bi-encoders are used for first-stage retrieval at scale. Cross-encoders are used for reranking a small candidate set (typically 50–200 documents). A question asking "which is appropriate for ranking 10 million documents in real time?" points to bi-encoder. Distractors will suggest cross-encoders for their accuracy advantage without acknowledging the latency cost.

Cross-reference

Full technical depth on embedding models, training objectives, and benchmarks: RAG_01_Embedding_Models.

05

Vector Indexes — HNSW, IVF, PQ

Index type	Structure	Query time	Memory	Best for
HNSW	Multi-level proximity graph; each node has edges to near neighbours at multiple levels	O(log N)	High — graph stored in RAM	Latency-sensitive; default for most applications
IVF	K-means clusters at build time; query searches nprobe clusters only	O(nprobe × cluster_size)	Lower than HNSW	Large-scale where memory matters; recall degrades with low nprobe
PQ	Splits vectors into sub-vectors; quantises each sub-vector independently	Fast when combined with IVF or HNSW	8–32× reduction	Billions of vectors; combined as IVFPQ

Cross-reference

Vector database comparison (pgvector, Qdrant, Weaviate, Pinecone, Chroma): RAG_02_Vector_Databases.

06

Hybrid Search — BM25 + Dense + RRF + Reranker

Neither dense nor sparse retrieval alone is optimal. Dense misses exact lexical matches; sparse misses semantic similarity. Hybrid search combines both.

Query

↓

BM25 ranked list
lexical match

↘

RRF merge
k=60

→

Cross-encoder rerank

→

Top-k → LLM

Query

↑

Dense ANN list
semantic match

↗

Reciprocal Rank Fusion (RRF)

-- Score document d across all rankers:
RRF(d) = Σᵢ 1 / (k + rᵢ(d))

k = 60          -- smoothing constant; standard default
rᵢ(d) = rank of d in ranker i (1-indexed)

-- Uses rank positions only, not raw scores
-- Robust to scale mismatch between BM25 scores and dense scores
-- No trained weights required

Exam angle: RRF vs score normalisation

RRF uses only rank positions, not raw scores. Distractors may describe score normalisation (min-max scaling) as the standard fusion approach. RRF is the production default for hybrid search fusion.

Cross-reference

RAG_03_Hybrid_Search_and_Reranking — full implementation and ablation.

07

Chunking Strategies — Fixed, Semantic, Layout-Aware

The retrieval unit must be sized correctly: too large and context is swamped; too small and each chunk lacks sufficient context. Chunking is the most underrated design decision in RAG.

Fixed-Size

Split by token count with sliding overlap (e.g., 512 tokens, 64-token overlap). Simple and predictable. Breaks mid-sentence; loses discourse structure.

Suitable for: homogeneous prose; simplicity requirements.

Semantic / Sentence

Segment at sentence or paragraph boundaries. Variable-size chunks. LangChain SemanticChunker embeds adjacent sentences and splits where cosine similarity drops.

Suitable for: well-structured prose; when chunk coherence matters.

Layout-Aware

Preserves document structure: headers, tables, code blocks, captions. Tools: Unstructured, LlamaParse. Critical for PDFs, HTML, mixed-modality documents.

Suitable for: any document where a naive text split would mix table cells with prose.

Metadata enrichment

Attaching metadata — source URL, document date, section heading, page number — enables filtered retrieval and improves citation generation. Metadata filtering runs before or inside ANN search depending on the vector database.

Cross-reference

RAG_04_Chunking_and_Ingestion — implementation, overlap strategies, metadata schema design.

08

Reranking — When It Pays Off

After first-stage retrieval (typically top 50–200 documents), a cross-encoder reranker scores each candidate by processing the (query, document) pair jointly. The top-k reranked documents are passed to the generator.

Why reranking improves precision

First-stage retrieval optimises for recall at scale. Reranking optimises for precision on a small set: the cross-encoder sees the full query-document interaction, which is impossible for a bi-encoder that encodes each side independently.

ColBERT PLAID

ColBERT retains per-token vectors for query and doc; relevance is the sum of MaxSim inner products across query tokens. PLAID compresses these per-token vectors using product quantisation, making ColBERT deployable at scale as a first-stage retriever (not a reranker).

Method	Latency	Precision	Corpus scale
Bi-encoder ANN	Very low (<10 ms)	Good recall, moderate precision	Billions of docs
ColBERT PLAID	Low (<50 ms)	Better precision than bi-encoder	Millions of docs
Cross-encoder reranker	High (100–500 ms on 100 docs)	Best precision	Tens to hundreds of docs only

Practical guidance

Add a reranker when first-stage precision is the bottleneck. If first-stage recall@k is already low, fix the retriever first — a reranker cannot surface documents that were not in the candidate set.

09

Agentic RAG — HyDE, Self-RAG, CRAG, Query Routing

Standard single-pass RAG (retrieve → generate) is brittle for multi-step questions. Agentic RAG patterns introduce reasoning loops over retrieval.

HyDE (Hypothetical Document Embeddings)

The LLM generates a hypothetical answer to the query. The embedding of that hypothetical answer is used as the retrieval query — not the original question. The hypothetical answer is typically closer in embedding space to relevant documents than the sparse original query.

Exam trap: HyDE uses the embedding of a hypothetical answer, not the original question. This is counterintuitive and a common distractor target.

Self-RAG

The model emits reflection tokens — [Retrieve], [Relevant], [Supported], [IsUseful] — interspersed in its output. It decides when to retrieve and evaluates retrieved passages for relevance. Requires a specially trained model; not a prompting technique applicable to any LLM.

Corrective RAG (CRAG)

An evaluator scores retrieved documents. If none score above a threshold, the system triggers a web search fallback or rewrites the query. Addresses the failure mode where first-stage retrieval returns nothing useful.

Query Routing and Decomposition

A router classifies the incoming query and directs it to the appropriate source (vector store, SQL, web search, API). Query decomposition breaks a complex question into sub-queries, executes them, and merges results.

Cross-reference

RAG_05_Agentic_RAG_Patterns — full implementation of HyDE, Self-RAG, CRAG, routing.

10

GraphRAG — Entities, Communities, Hybrid

Standard vector RAG retrieves chunks based on local semantic similarity and cannot capture explicit entity relationships or multi-hop reasoning paths across documents.

Microsoft GraphRAG (Edge et al., 2024)

Extract entities and relationships from the corpus using an LLM.
Build a knowledge graph; cluster entities into communities at multiple granularities.
Generate community summaries via a further LLM pass.
At query time, retrieve community summaries alongside raw chunks, providing global context that dense vector search cannot surface.

GraphRAG strengths

Global synthesis across many documents ("what are the main themes?")
Multi-hop entity relationship reasoning
Better for corpus-level questions than document-level fact lookup

GraphRAG limitations

High construction cost: many LLM calls during ingestion
Not a replacement for vector RAG on local fact lookup
Community summary quality depends on extraction LLM

Exam angle

GraphRAG is better for global synthesis; vector RAG for local fact lookup. GraphRAG has higher construction cost. They are not mutually exclusive — production systems combine both.

Cross-reference

RAG_06_GraphRAG_and_KGs — graph construction, community detection, hybrid graph+vector.

11

Multi-Tenancy and ACLs

Production RAG systems typically serve multiple users or organisations from shared infrastructure. Two design patterns handle data isolation:

Namespace / Collection isolation

Each tenant has a separate vector index or collection. Retrieval is physically isolated. Simple but expensive: N× index storage and memory overhead. Suitable for high-security multi-tenancy with modest tenant counts.

Metadata-filter ACL

All documents share a single index. Each document is tagged with tenant or ACL metadata. Retrieval queries include a mandatory metadata filter (e.g., tenant_id == X). More memory-efficient but requires fast filter support without recall degradation.

Practical considerations

Post-filter (filter after ANN retrieval) can miss relevant documents if many results are filtered out.
Pre-filter (search only within the filtered subset) is more reliable but requires database support. Qdrant and Weaviate support payload filtering natively.
HNSW graphs are harder to pre-filter efficiently than IVF-based indexes.
Audit logging of retrieval queries is a common compliance requirement in enterprise deployments.

12

Evaluation — RAGAS Metrics

RAG evaluation must assess both the retrieval stage and the generation stage independently, because failures have different remediation paths.

Metric	What it measures	Failure mode it catches
Faithfulness	Do generated claims appear in the retrieved context?	Generator hallucinating beyond the retrieved context
Answer relevancy	Is the generated answer responsive to the original question?	Answer accurately reports context but doesn't address the question
Context precision	What fraction of retrieved chunks are actually relevant?	Retriever returning noisy or irrelevant documents
Context recall	Does retrieved context contain the information needed to answer?	Retriever missing relevant documents
Retrieval recall@k	Fraction of questions where at least one relevant doc is in top-k	Retriever failing to surface any relevant document

Critical exam distinction

Faithfulness measures grounding in retrieved context, not global factual correctness. Answer relevancy measures whether the answer addresses the question. A faithful-but-irrelevant answer is possible: the model accurately reports something from context but does not answer the question asked.

Cross-reference

RAG_07_Production_RAG — RAGAS implementation, drift detection, production monitoring.

13

When NOT to Use RAG

RAG adds latency, retrieval failure modes, infrastructure complexity, and cost. The exam tests this directly. Several common scenarios are better served by simpler alternatives.

Scenario	Better approach	Reason
Small, static corpus (<100 docs, no updates)	Context stuffing — put all docs in prompt	RAG adds retrieval failure modes; just include everything in context
Structured data (tables, databases)	SQL / structured query generation	Vector similarity is the wrong query primitive for structured data
Very low update frequency, domain adaptation needed	Fine-tuning	RAG adds inference latency; fine-tuning bakes knowledge into weights
LLM parametric knowledge is sufficient	Prompt engineering only	Retrieval adds latency and cost with no quality benefit
Real-time, sub-50 ms latency required	Cached retrieval or no retrieval	ANN + rerank + generation chain is typically 200–800 ms

Exam angle

Do not default to recommending RAG for every knowledge problem. The key signals for alternatives: small or static corpus, structured data, no update cadence, latency constraints under 50 ms.

14

Latency Budget Breakdown

Total wall-clock time for a typical hybrid RAG response (embed + ANN + rerank 100 docs + generate 500 tokens) is roughly 600–1000 ms on a single GPU with CPU retrieval. Streaming generation output reduces perceived latency at no throughput cost.

Optimisation levers

Reranker: score fewer candidates (top-20 instead of top-100); use a smaller reranker; use ColBERT PLAID as first-stage instead.
Embedding: batch queries; use GPU-accelerated embedding (NIM microservice).
Generation: speculative decoding; FP8 quantisation; shorter context; streaming.

15

NVIDIA-Specific — NeMo Retriever and NIM Embedding Microservices

NeMo Retriever

NVIDIA's production RAG microservice framework.
Supports dense + sparse hybrid retrieval with GPU-accelerated embedding.
Integrates with NeMo Guardrails for output filtering.
Available as a container for DGX and cloud instances.
Provides the retrieval layer in NVIDIA AI Enterprise RAG blueprints.

NIM Embedding Microservices

Pre-built, optimised containers for serving embedding models on NVIDIA GPUs.
Includes NV-Embed-v2: a dense bi-encoder optimised for retrieval.
OpenAI-compatible API endpoint — drop-in replacement for existing RAG pipelines.
GPU-accelerated batch inference reduces embed latency vs CPU significantly.

Cross-reference

NVIDIA stack details (NIM, NeMo, TensorRT-LLM) covered in NVIDIA_GPU_20_NeMo_NIM_AI_Enterprise and cert-prep Presentation 05 (NVIDIA Stack Overview).

16

Likely Exam Angles

The highest-probability exam questions based on notes/05_rag_systems.md and the NCA/NCP domain weightings:

Bi-encoder vs cross-encoder use cases — bi-encoder for first-stage retrieval at scale; cross-encoder for reranking a small candidate set. "Which for 10 million documents in real time?" → bi-encoder.
RRF vs score normalisation — RRF uses rank positions only, not raw scores. Production default for hybrid search fusion.
RAGAS faithfulness vs answer relevancy — faithfulness is grounding in retrieved context; answer relevancy is whether the answer addresses the question. A faithful-but-irrelevant answer is possible.
When not to use RAG — small static corpus, structured data, no update cadence, sub-50 ms latency constraint.
HyDE mechanism — encodes a hypothetical answer as the retrieval query, not the original question. Common distractor: "HyDE uses the question embedding."
GraphRAG vs vector RAG — GraphRAG for global synthesis; vector RAG for local fact lookup. Higher construction cost. Not mutually exclusive.
Chunking for mixed-modality documents — layout-aware chunking required for PDFs with tables and diagrams. Fixed-size chunking loses structure.

17

Cross-References and Further Reading

Portfolio repos (depth treatment)

RAG_01_Embedding_Models — dense, sparse, ColBERT, training objectives.
RAG_02_Vector_Databases — HNSW, IVF, PQ; pgvector, Qdrant, Weaviate, Pinecone.
RAG_03_Hybrid_Search_and_Reranking — BM25+dense, RRF, cross-encoder reranking.
RAG_04_Chunking_and_Ingestion — chunking strategies, metadata enrichment.
RAG_05_Agentic_RAG_Patterns — HyDE, Self-RAG, CRAG, query routing.
RAG_06_GraphRAG_and_KGs — knowledge graph construction, community detection.
RAG_07_Production_RAG — RAGAS, drift detection, production monitoring.
LLM_Hub_RAG_Retrieval — RAG hub overview.

Cert-prep repo resources

notes/05_rag_systems.md — the notes this deck is built from.

Primary literature

Lewis et al. (2020), RAG paper: arxiv.org/abs/2005.11401
Es et al. (2023), RAGAS: arxiv.org/abs/2309.15217
Edge et al. (2024), GraphRAG: arxiv.org/abs/2404.16130
Self-RAG: arxiv.org/abs/2310.11511