GenAI Engineering · 7 Modules
The Lab
67 technical reports across the full GenAI stack — from LLM fundamentals to production-grade RAG, agents, and evaluation. Built systematically, module by module.
LLM Foundations
Prompts · Context · Memory · Safety
- F-01
Prompt Engineering Benchmark
Study the anatomy of a reliable prompt — instruction clarity, role framing, few-shot examples, delimiters. Build a personal benchmark to measure prompt quality objectively across a test set, so 'it felt better' becomes a measurable improvement.
To do - F-02
Benchmark Design for LLM Systems
Design evaluation suites that are resistant to overfitting and data leakage. Cover metric selection (BLEU, ROUGE, BERTScore, custom), test set construction, contamination risks, and the difference between automated and human evaluation benchmarks.
In progress - F-03
Evaluation Systems for LLMs
Build a full evaluation pipeline: define tasks, collect ground truth, run batch inference, aggregate scores, and surface regressions. Study existing frameworks (EleutherAI Harness, OpenAI Evals, HELM) and understand when to use each.
To do - F-04
Failure Modes of LLM Systems
Create a taxonomy of LLM failures: hallucination, sycophancy, instruction-following drift, prompt injection, context overflow, and stochastic inconsistency. For each failure type, study the root cause and a concrete mitigation strategy.
To do - F-05
Context Engineering
Go beyond 'stuff everything in'. Study context selection (which documents to include), ordering effects (recency bias, primacy bias), compression strategies (summarisation, selective extraction), and how context length affects model behaviour and cost.
To do - F-06
Prompt + Context Interaction
Investigate how retrieved context changes model behaviour: context over-reliance, parametric vs contextual knowledge tension, and how to prompt the model to correctly weight external information over its training knowledge.
To do - F-07
Structured Output & Schema
Master reliable structured generation: JSON mode, function calling schemas, Pydantic validation, retry logic on parse failure, and constrained decoding. Study failure modes (truncated JSON, schema drift) and how to build robust output parsers.
To do - F-08
Guardrails & Safety
Build a layered safety system: input classifiers, output moderation, topical restriction, and prompt injection detection. Compare Guardrails AI, NVIDIA NeMo Guardrails, and custom classifier approaches — and understand where each layer sits in the pipeline.
To do - F-09
Memory Systems
Design memory architectures for conversational agents: in-context (full history), summary memory, entity memory, and vector store long-term memory. Benchmark each on latency, cost, fidelity, and retrieval accuracy across different session lengths.
To do - F-10
Memory Compression & Summarization
When history exceeds the context window, compression is unavoidable. Study progressive summarisation, rolling windows, entity extraction, and importance-weighted pruning — benchmarking each against a full-history baseline on recall and coherence.
To do - F-11
State Management in Agents
Define what 'state' means in an agent: conversation history, task progress, intermediate results, tool outputs. Study serialisation strategies, checkpoint-and-resume patterns, and how to persist state across sessions and restarts in production.
To do - F-12
Fine-tuning vs Prompting Trade-offs
Map the decision space: when does fine-tuning beat prompting, and at what cost? Cover LoRA, QLoRA, instruction tuning, and RLHF. Study the data flywheel problem — how much data you actually need, and when prompt engineering is the right answer.
To do - F-13
Embeddings Deep Dive
Understand how dense representations work: contrastive training objectives, embedding spaces, dimensionality trade-offs, and matryoshka embeddings. Compare models (OpenAI, Cohere, E5, BGE, sentence-transformers) and study when to fine-tune embeddings on domain data.
To do - F-14
Chain-of-Thought & Reasoning Patterns
Master the core reasoning techniques: zero-shot CoT, few-shot CoT, self-consistency (majority vote over sampled chains), Tree-of-Thought, and ReAct (Reasoning + Acting). Benchmark each against direct prompting on multi-step tasks and understand their failure modes.
To do - F-15
Long Context Strategies
With 128k+ context windows, the rules change. Study the 'lost in the middle' phenomenon, needle-in-a-haystack evaluation, and when long context replaces RAG entirely — and when it doesn't. Cover chunked attention, positional interpolation, and the cost-quality trade-off.
To do
Knowledge Ingestion
Parsing · Chunking · Indexing · Graphs
- I-01
Document Loading & Parsing
Survey the parsing landscape for PDFs (PyMuPDF, pdfplumber, Unstructured), HTML (Trafilatura), DOCX, and Markdown. Study layout-aware parsing, table extraction, OCR fallback for scanned docs, and how parsing errors propagate and degrade downstream retrieval quality.
To do - I-02
Text Preprocessing for RAG
After parsing, raw text is messy. Study deduplication (exact and near-duplicate detection with MinHash/LSH), normalisation (unicode, headers/footers removal), language detection, and PII scrubbing — measuring how each step affects downstream retrieval recall.
To do - I-03
Metadata Extraction & Enrichment
Build a metadata pipeline: extract title, author, date, section, source URL, and document type automatically. Study LLM-assisted metadata generation, taxonomy classification, and how rich metadata enables filtered retrieval, explainability, and freshness-aware ranking.
To do - I-04
Chunking Strategies
Benchmark fixed-size, sentence-based, semantic (embedding-similarity split), and recursive character chunking. Study chunk overlap, parent-child chunking (small chunks for retrieval, large for context), and how chunk size interacts with embedding model token limits and retrieval recall@k.
To do - I-05
Entities & Relations Extraction
Extract a knowledge graph from unstructured text using spaCy, GLiNER, or LLM-based extraction. Study entity types, coreference resolution, and relationship triplet extraction — then merge entities across documents into a consistent graph ready for Graph RAG pipelines.
To do - I-06
Structured Data Ingestion
Ingest relational data (SQL tables, CSVs) into a RAG system. Study schema serialisation strategies (markdown tables, natural language descriptions), row-level chunking, and hybrid querying that combines SQL retrieval with vector search in a unified pipeline.
To do - I-07
Temporal Indexing & Freshness
Build a freshness-aware knowledge base: timestamp metadata, staleness scoring, incremental indexing (only re-index changed documents), and crawl strategies. Study how freshness interacts with retrieval ranking and when stale data actively harms answer quality.
To do - I-08
Vector Databases
Understand the infrastructure powering semantic search: HNSW and IVF indexing algorithms, ANN vs exact search trade-offs, pre vs post-filter for metadata, and payload storage. Compare Qdrant, Weaviate, Pinecone, Milvus, and pgvector on performance, cost, and developer experience.
To do
Retrieval Systems
BM25 · Semantic · Graph · SQL · Multi-Hop
- R-01
Query Understanding
Before retrieval, understand what the user actually wants. Study intent classification, named entity extraction from queries, ambiguity detection, and query decomposition (splitting a complex question into answerable sub-questions). Build a query analysis layer that feeds the right signal to the retriever.
To do - R-02
Routing Systems
Design the decision layer that routes queries to the right pipeline: RAG, direct LLM, SQL, tool use, or a combination. Study classifier-based routing, LLM-based routing with structured output, semantic similarity-based routing, and confidence-based fallback chains.
To do - R-03
Model Routing (multi-model)
Reduce cost without sacrificing quality by routing easy queries to cheap models and complex ones to powerful ones. Study difficulty classifiers, cascading (try small model first, escalate on low confidence), and how to measure quality-cost trade-offs in a production setting.
To do - R-04
Retrieval Strategies
Deep dive into BM25 (term frequency, IDF, BM25+) and dense retrieval (bi-encoder, dot product vs cosine). Benchmark on domain-specific corpora and understand when keyword search still beats embeddings — particularly for rare terms, code, and exact product names.
To do - R-05
Hybrid Search
Combine BM25 and dense retrieval using Reciprocal Rank Fusion (RRF), linear interpolation, and learned fusion weights. Study how to tune the balance for different query types and how hybrid search handles the failure modes of each individual approach.
To do - R-06
Query Rewriting for Retrieval
Improve recall by rewriting the query before retrieval: step-back prompting (abstract the question to a broader category), query expansion (synonyms and related terms), sub-question decomposition, and LLM-based reformulation. Measure recall@k before and after each technique.
To do - R-07
HyDE — Hypothetical Document Embeddings
Instead of embedding the raw query (which lives in a different region of the embedding space from documents), generate a hypothetical document that would answer the query and embed that. Study the query-document gap problem, when HyDE helps most, and its failure modes on factual queries.
To do - R-08
RAG-Fusion & Multi-Query Retrieval
Generate multiple paraphrases of the query, retrieve for each independently, then fuse the ranked lists using Reciprocal Rank Fusion. Study how query diversity increases recall, how many variants to generate before diminishing returns, and the latency cost of parallel retrieval.
To do - R-09
Re-ranking Strategies
After retrieving top-k candidates, reorder them for higher precision. Study Cross-Encoders (compute query-document relevance jointly), ColBERT (late interaction for efficiency), and LLM-based reranking (RankGPT). Benchmark precision@k improvement against latency and cost overhead.
To do - R-10
Graph RAG
Build a retrieval pipeline over a knowledge graph: entity linking (match query terms to graph nodes), graph traversal (BFS, personalised PageRank), and combining graph-retrieved facts with vector-retrieved passages. Study Microsoft's GraphRAG architecture and Neo4j integration patterns.
To do - R-11
SQL RAG
Generate SQL from natural language (Text-to-SQL) to retrieve structured data. Study schema linking, few-shot SQL prompting, query validation before execution, error-correction loops, and how to combine SQL results with vector retrieval in a unified answer generation step.
To do - R-12
Multi-Hop / Iterative Retrieval
Answer complex questions that require chaining retrieval steps: retrieve → read → identify missing information → retrieve again. Study IRCoT (Interleaved Retrieval with CoT), ITER-RETGEN, and how to prevent retrieval drift and compounding errors across hops.
To do - R-13
Retrieval Gating (Adaptive RAG)
Not every query needs retrieval. Build a gating mechanism that classifies queries as 'retrieve' or 'answer directly' based on confidence, query type, and cost budget. Study self-ask patterns, confidence calibration, and how gating interacts with latency SLAs.
To do - R-14
Self-RAG / CRAG / FLARE
Adaptive retrieval patterns where the model controls the process: Self-RAG (model decides when to retrieve and critiques its own output with special tokens), CRAG (evaluates retrieval quality and triggers web search as fallback), FLARE (retrieves mid-generation when confidence drops). Study when the complexity is worth it.
To do
Generation & Validation
Grounding · Confidence · Hallucination
- G-01
Evidence Fusion
Aggregate multiple retrieved passages into a coherent context: deduplication, contradiction detection, relevance scoring, and ordered presentation. Study map-reduce prompting (process each document then combine) vs full stuffing, and when each approach wins on quality and cost.
To do - G-02
Grounded Answer Generation
Condition generation strictly on retrieved context: citation-aware prompting, inline source attribution (which claim came from which document), and post-hoc citation verification with NLI. Study the faithfulness vs completeness tension and how to balance them in practice.
To do - G-03
Confidence Estimation
Estimate how confident the model is in its answer: log-probability scoring, self-evaluation prompting, semantic consistency across multiple samples, and P(True) as a calibration probe. Build a calibration curve and study when model confidence actually correlates with factual accuracy.
To do - G-04
Hallucination Detection & Mitigation
Detect when the model fabricates facts not present in context: NLI-based entailment checks (answer vs source documents), SelfCheckGPT (compare consistency across multiple sampled outputs), and factual recall probing. Study mitigation strategies beyond detection — how to reduce hallucination at generation time.
To do - G-05
Confidence-Aware Delivery / Abstention
Design the policy for when the system should say 'I don't know': threshold-based abstention, risk-aware routing (abstain on medical/legal, answer on low-stakes), and how to communicate uncertainty to users without destroying trust. Study the precision-abstention trade-off curve.
To do
Agents & Orchestration
Tools · Planning · Multi-Agent · Agentic RAG
- A-01
Tool Use / Function Calling
Define tools (APIs, databases, code execution, calculators) and teach the model to invoke them correctly: schema design, error handling, tool output parsing, and multi-turn tool call chains. Study parallel tool calling, tool selection failures, and how to handle tool errors gracefully.
To do - A-02
Planning vs Reactive Agents
Compare two agent paradigms: Plan-and-Execute (generate a full plan upfront, then execute step by step) vs ReAct (interleave reasoning and action in a loop). Study when planning helps (long multi-step tasks) vs when reactivity is better (dynamic environments where early plans become invalid).
To do - A-03
Agent Orchestration
Coordinate complex agent workflows: task decomposition, step sequencing, error recovery, and human checkpoint insertion. Study LangGraph, CrewAI, and custom loop approaches — and when you actually need an orchestration framework vs a simple while loop.
To do - A-04
Workflow vs Agent Architectures
Make the fundamental architecture decision: use a deterministic workflow (DAG of steps with defined transitions) or an autonomous agent (LLM decides next action). Study the reliability-flexibility trade-off and hybrid approaches (workflows with LLM decision nodes at branch points).
To do - A-05
Feedback Loops / Self-Improvement
Build agents that critique and refine their own outputs: Reflexion (reflect on failures and retry with updated context), self-evaluation prompts, constitutional critique-revision cycles, and beam search over the action space. Study convergence guarantees and stopping criteria.
To do - A-06
Multi-Agent Collaboration
Design systems where multiple specialised agents collaborate: task decomposition and role assignment, communication protocols (shared memory vs message passing), conflict resolution, and quality-checking between agents. Study the coordination overhead vs specialisation quality trade-off.
To do - A-07
Human-in-the-loop Systems
Integrate human oversight into agent pipelines: confidence-gated checkpoints (pause and ask when uncertain), structured approval workflows, and how to present partial results for human review. Study how to minimise interruptions while maintaining meaningful control.
To do - A-08
Agentic RAG
Combine RAG with full agent capabilities: the agent decides which retrieval strategy to use, can issue multiple retrieval rounds, validates its own sources, and escalates to broader search (web, SQL, graph) when the knowledge base falls short. Study STORM and Self-Ask as reference architectures.
To do
Production Engineering
Cost · Latency · Caching · Observability
- P-01
Cost Optimization
Map all cost drivers (input tokens, output tokens, API calls, embedding generation, vector DB queries) and build a cost model. Study caching, context compression, prompt shortening, and model downgrading on easy queries — measuring the quality impact of each optimisation.
To do - P-02
Latency Optimization
Measure and reduce end-to-end latency: profile the pipeline (where is the time going?), speculative decoding, streaming responses, parallel retrieval, and batching. Build a latency budget and study the latency-quality Pareto frontier for your specific workload.
To do - P-03
Parallelism & Concurrency
Scale agent systems horizontally: async/await for concurrent tool calls, parallel retrieval from multiple sources, fan-out/fan-in architectures, and rate limit management. Study Python asyncio patterns specifically designed for LLM workloads with mixed I/O and compute.
To do - P-04
Caching Strategies
Build a multi-level cache: exact match (hash of prompt → cached response), semantic cache (embed query → lookup similar past queries), and provider-level prompt caching (Anthropic, OpenAI). Study cache invalidation, TTL strategies, and how caching interacts with streaming responses.
To do - P-05
Observability & Tracing
Instrument an LLM pipeline end-to-end: trace every prompt/response pair (LangSmith, Weave, Helicone), log latency and cost per step, set up alerts for quality degradation, and build dashboards. Study what to log, how much to retain, and the GDPR implications of tracing LLM calls.
To do - P-06
Security in LLM Systems
Defend against LLM-specific attacks: direct prompt injection (user manipulates model), indirect injection (malicious content in retrieved documents), jailbreaking, and data exfiltration via prompt. Study defence-in-depth — what each layer catches and what it misses.
To do - P-07
Robustness & Reliability
Make LLM systems production-stable: retry with exponential backoff, fallback chains (primary model → cheaper model → static response), circuit breakers, and input validation. Study SLA definition and how to measure p95/p99 reliability for systems with non-deterministic components.
To do - P-08
Prompt Versioning & Management
Manage prompts in production like code: version control, environment-based deployment (dev/staging/prod), A/B testing infrastructure, rollback on quality regression, and team collaboration. Study PromptLayer, LangSmith prompt hubs, and custom prompt registries.
To do - P-09
Streaming & Progressive Generation
Implement streaming from first token to client: server-sent events, chunked HTTP responses, streaming with tool calls (tricky — tool calls require full output before execution), and partial JSON streaming. Study the UX impact of streaming and how to handle cancellation and backpressure.
To do
Evaluation & Testing
RAGAS · Red Teaming · A/B Tests · Multimodal
- E-01
RAG Systems (complete)
Synthesise the full RAG architecture end-to-end: ingestion pipeline → knowledge repository → retrieval → reranking → generation → validation. Build a reference implementation integrating techniques from all previous modules and benchmark it against a naive baseline across multiple task types.
To do - E-02
End-to-End AI System Benchmark
Design and run a comprehensive benchmark for a complete LLM system: define a task suite, evaluation dimensions (quality, cost, latency, robustness), run ablation studies (what happens when you remove each component?), and write a systematic findings report.
To do - E-03
RAG Evaluation Frameworks
Evaluate RAG pipelines with dedicated frameworks: RAGAS (faithfulness, answer relevance, context recall, context precision), TruLens (feedback functions), and DeepEval (custom metrics). Build a test set, run automated evaluation, and interpret the metric trade-offs between retrieval and generation quality.
To do - E-04
LLM as a Judge
Use an LLM to evaluate another LLM's outputs at scale: pointwise scoring, pairwise comparison (A vs B), and reference-free evaluation. Study position and verbosity biases that corrupt judgement, how to write reliable judge prompts, and how LLM-as-a-Judge correlates with human evaluation.
To do - E-05
Red Teaming LLM Systems
Systematically find vulnerabilities before attackers do: adversarial prompt generation, jailbreak benchmarks (HarmBench, JailbreakBench), automated red teaming with LLMs, and coverage-guided fuzzing. Build a structured red team report with severity classification and remediation steps.
To do - E-06
Synthetic Data Generation for Testing
Generate realistic test datasets automatically: question generation from documents (LLM-based), ground truth answer synthesis, context perturbation (add noise to test robustness), and coverage analysis to ensure all topics are represented. Study quality filtering strategies for generated test data.
To do - E-07
A/B Testing for LLM Applications
Run controlled experiments on LLM components: prompt variants, model versions, RAG configurations. Study traffic splitting, metric selection (what to measure and why), minimum sample sizes for statistical significance, and how to handle the temporal confounders unique to online LLM experiments.
To do - E-08
Multimodal RAG
Extend RAG to non-text modalities: parse PDFs with figures using vision models, embed images alongside text (CLIP, ColPali), handle tables and charts, and retrieve across modalities. Study ColPali's page-level late interaction approach as the current state of the art for document-image retrieval.
To do