Skip to main content

GenAI Engineering · 7 Modules

The Lab

67 technical reports across the full GenAI stack — from LLM fundamentals to production-grade RAG, agents, and evaluation. Built systematically, module by module.

67Reports
7Modules
0Done
1In progress

LLM Foundations

Prompts · Context · Memory · Safety

15 reports
  • F-01

    Prompt Engineering Benchmark

    Study the anatomy of a reliable prompt — instruction clarity, role framing, few-shot examples, delimiters. Build a personal benchmark to measure prompt quality objectively across a test set, so 'it felt better' becomes a measurable improvement.

    To do
  • F-02

    Benchmark Design for LLM Systems

    Design evaluation suites that are resistant to overfitting and data leakage. Cover metric selection (BLEU, ROUGE, BERTScore, custom), test set construction, contamination risks, and the difference between automated and human evaluation benchmarks.

    In progress
  • F-03

    Evaluation Systems for LLMs

    Build a full evaluation pipeline: define tasks, collect ground truth, run batch inference, aggregate scores, and surface regressions. Study existing frameworks (EleutherAI Harness, OpenAI Evals, HELM) and understand when to use each.

    To do
  • F-04

    Failure Modes of LLM Systems

    Create a taxonomy of LLM failures: hallucination, sycophancy, instruction-following drift, prompt injection, context overflow, and stochastic inconsistency. For each failure type, study the root cause and a concrete mitigation strategy.

    To do
  • F-05

    Context Engineering

    Go beyond 'stuff everything in'. Study context selection (which documents to include), ordering effects (recency bias, primacy bias), compression strategies (summarisation, selective extraction), and how context length affects model behaviour and cost.

    To do
  • F-06

    Prompt + Context Interaction

    Investigate how retrieved context changes model behaviour: context over-reliance, parametric vs contextual knowledge tension, and how to prompt the model to correctly weight external information over its training knowledge.

    To do
  • F-07

    Structured Output & Schema

    Master reliable structured generation: JSON mode, function calling schemas, Pydantic validation, retry logic on parse failure, and constrained decoding. Study failure modes (truncated JSON, schema drift) and how to build robust output parsers.

    To do
  • F-08

    Guardrails & Safety

    Build a layered safety system: input classifiers, output moderation, topical restriction, and prompt injection detection. Compare Guardrails AI, NVIDIA NeMo Guardrails, and custom classifier approaches — and understand where each layer sits in the pipeline.

    To do
  • F-09

    Memory Systems

    Design memory architectures for conversational agents: in-context (full history), summary memory, entity memory, and vector store long-term memory. Benchmark each on latency, cost, fidelity, and retrieval accuracy across different session lengths.

    To do
  • F-10

    Memory Compression & Summarization

    When history exceeds the context window, compression is unavoidable. Study progressive summarisation, rolling windows, entity extraction, and importance-weighted pruning — benchmarking each against a full-history baseline on recall and coherence.

    To do
  • F-11

    State Management in Agents

    Define what 'state' means in an agent: conversation history, task progress, intermediate results, tool outputs. Study serialisation strategies, checkpoint-and-resume patterns, and how to persist state across sessions and restarts in production.

    To do
  • F-12

    Fine-tuning vs Prompting Trade-offs

    Map the decision space: when does fine-tuning beat prompting, and at what cost? Cover LoRA, QLoRA, instruction tuning, and RLHF. Study the data flywheel problem — how much data you actually need, and when prompt engineering is the right answer.

    To do
  • F-13

    Embeddings Deep Dive

    Understand how dense representations work: contrastive training objectives, embedding spaces, dimensionality trade-offs, and matryoshka embeddings. Compare models (OpenAI, Cohere, E5, BGE, sentence-transformers) and study when to fine-tune embeddings on domain data.

    To do
  • F-14

    Chain-of-Thought & Reasoning Patterns

    Master the core reasoning techniques: zero-shot CoT, few-shot CoT, self-consistency (majority vote over sampled chains), Tree-of-Thought, and ReAct (Reasoning + Acting). Benchmark each against direct prompting on multi-step tasks and understand their failure modes.

    To do
  • F-15

    Long Context Strategies

    With 128k+ context windows, the rules change. Study the 'lost in the middle' phenomenon, needle-in-a-haystack evaluation, and when long context replaces RAG entirely — and when it doesn't. Cover chunked attention, positional interpolation, and the cost-quality trade-off.

    To do

Knowledge Ingestion

Parsing · Chunking · Indexing · Graphs

8 reports
  • I-01

    Document Loading & Parsing

    Survey the parsing landscape for PDFs (PyMuPDF, pdfplumber, Unstructured), HTML (Trafilatura), DOCX, and Markdown. Study layout-aware parsing, table extraction, OCR fallback for scanned docs, and how parsing errors propagate and degrade downstream retrieval quality.

    To do
  • I-02

    Text Preprocessing for RAG

    After parsing, raw text is messy. Study deduplication (exact and near-duplicate detection with MinHash/LSH), normalisation (unicode, headers/footers removal), language detection, and PII scrubbing — measuring how each step affects downstream retrieval recall.

    To do
  • I-03

    Metadata Extraction & Enrichment

    Build a metadata pipeline: extract title, author, date, section, source URL, and document type automatically. Study LLM-assisted metadata generation, taxonomy classification, and how rich metadata enables filtered retrieval, explainability, and freshness-aware ranking.

    To do
  • I-04

    Chunking Strategies

    Benchmark fixed-size, sentence-based, semantic (embedding-similarity split), and recursive character chunking. Study chunk overlap, parent-child chunking (small chunks for retrieval, large for context), and how chunk size interacts with embedding model token limits and retrieval recall@k.

    To do
  • I-05

    Entities & Relations Extraction

    Extract a knowledge graph from unstructured text using spaCy, GLiNER, or LLM-based extraction. Study entity types, coreference resolution, and relationship triplet extraction — then merge entities across documents into a consistent graph ready for Graph RAG pipelines.

    To do
  • I-06

    Structured Data Ingestion

    Ingest relational data (SQL tables, CSVs) into a RAG system. Study schema serialisation strategies (markdown tables, natural language descriptions), row-level chunking, and hybrid querying that combines SQL retrieval with vector search in a unified pipeline.

    To do
  • I-07

    Temporal Indexing & Freshness

    Build a freshness-aware knowledge base: timestamp metadata, staleness scoring, incremental indexing (only re-index changed documents), and crawl strategies. Study how freshness interacts with retrieval ranking and when stale data actively harms answer quality.

    To do
  • I-08

    Vector Databases

    Understand the infrastructure powering semantic search: HNSW and IVF indexing algorithms, ANN vs exact search trade-offs, pre vs post-filter for metadata, and payload storage. Compare Qdrant, Weaviate, Pinecone, Milvus, and pgvector on performance, cost, and developer experience.

    To do

Retrieval Systems

BM25 · Semantic · Graph · SQL · Multi-Hop

14 reports
  • R-01

    Query Understanding

    Before retrieval, understand what the user actually wants. Study intent classification, named entity extraction from queries, ambiguity detection, and query decomposition (splitting a complex question into answerable sub-questions). Build a query analysis layer that feeds the right signal to the retriever.

    To do
  • R-02

    Routing Systems

    Design the decision layer that routes queries to the right pipeline: RAG, direct LLM, SQL, tool use, or a combination. Study classifier-based routing, LLM-based routing with structured output, semantic similarity-based routing, and confidence-based fallback chains.

    To do
  • R-03

    Model Routing (multi-model)

    Reduce cost without sacrificing quality by routing easy queries to cheap models and complex ones to powerful ones. Study difficulty classifiers, cascading (try small model first, escalate on low confidence), and how to measure quality-cost trade-offs in a production setting.

    To do
  • R-04

    Retrieval Strategies

    Deep dive into BM25 (term frequency, IDF, BM25+) and dense retrieval (bi-encoder, dot product vs cosine). Benchmark on domain-specific corpora and understand when keyword search still beats embeddings — particularly for rare terms, code, and exact product names.

    To do
  • R-05

    Hybrid Search

    Combine BM25 and dense retrieval using Reciprocal Rank Fusion (RRF), linear interpolation, and learned fusion weights. Study how to tune the balance for different query types and how hybrid search handles the failure modes of each individual approach.

    To do
  • R-06

    Query Rewriting for Retrieval

    Improve recall by rewriting the query before retrieval: step-back prompting (abstract the question to a broader category), query expansion (synonyms and related terms), sub-question decomposition, and LLM-based reformulation. Measure recall@k before and after each technique.

    To do
  • R-07

    HyDE — Hypothetical Document Embeddings

    Instead of embedding the raw query (which lives in a different region of the embedding space from documents), generate a hypothetical document that would answer the query and embed that. Study the query-document gap problem, when HyDE helps most, and its failure modes on factual queries.

    To do
  • R-08

    RAG-Fusion & Multi-Query Retrieval

    Generate multiple paraphrases of the query, retrieve for each independently, then fuse the ranked lists using Reciprocal Rank Fusion. Study how query diversity increases recall, how many variants to generate before diminishing returns, and the latency cost of parallel retrieval.

    To do
  • R-09

    Re-ranking Strategies

    After retrieving top-k candidates, reorder them for higher precision. Study Cross-Encoders (compute query-document relevance jointly), ColBERT (late interaction for efficiency), and LLM-based reranking (RankGPT). Benchmark precision@k improvement against latency and cost overhead.

    To do
  • R-10

    Graph RAG

    Build a retrieval pipeline over a knowledge graph: entity linking (match query terms to graph nodes), graph traversal (BFS, personalised PageRank), and combining graph-retrieved facts with vector-retrieved passages. Study Microsoft's GraphRAG architecture and Neo4j integration patterns.

    To do
  • R-11

    SQL RAG

    Generate SQL from natural language (Text-to-SQL) to retrieve structured data. Study schema linking, few-shot SQL prompting, query validation before execution, error-correction loops, and how to combine SQL results with vector retrieval in a unified answer generation step.

    To do
  • R-12

    Multi-Hop / Iterative Retrieval

    Answer complex questions that require chaining retrieval steps: retrieve → read → identify missing information → retrieve again. Study IRCoT (Interleaved Retrieval with CoT), ITER-RETGEN, and how to prevent retrieval drift and compounding errors across hops.

    To do
  • R-13

    Retrieval Gating (Adaptive RAG)

    Not every query needs retrieval. Build a gating mechanism that classifies queries as 'retrieve' or 'answer directly' based on confidence, query type, and cost budget. Study self-ask patterns, confidence calibration, and how gating interacts with latency SLAs.

    To do
  • R-14

    Self-RAG / CRAG / FLARE

    Adaptive retrieval patterns where the model controls the process: Self-RAG (model decides when to retrieve and critiques its own output with special tokens), CRAG (evaluates retrieval quality and triggers web search as fallback), FLARE (retrieves mid-generation when confidence drops). Study when the complexity is worth it.

    To do

Generation & Validation

Grounding · Confidence · Hallucination

5 reports
  • G-01

    Evidence Fusion

    Aggregate multiple retrieved passages into a coherent context: deduplication, contradiction detection, relevance scoring, and ordered presentation. Study map-reduce prompting (process each document then combine) vs full stuffing, and when each approach wins on quality and cost.

    To do
  • G-02

    Grounded Answer Generation

    Condition generation strictly on retrieved context: citation-aware prompting, inline source attribution (which claim came from which document), and post-hoc citation verification with NLI. Study the faithfulness vs completeness tension and how to balance them in practice.

    To do
  • G-03

    Confidence Estimation

    Estimate how confident the model is in its answer: log-probability scoring, self-evaluation prompting, semantic consistency across multiple samples, and P(True) as a calibration probe. Build a calibration curve and study when model confidence actually correlates with factual accuracy.

    To do
  • G-04

    Hallucination Detection & Mitigation

    Detect when the model fabricates facts not present in context: NLI-based entailment checks (answer vs source documents), SelfCheckGPT (compare consistency across multiple sampled outputs), and factual recall probing. Study mitigation strategies beyond detection — how to reduce hallucination at generation time.

    To do
  • G-05

    Confidence-Aware Delivery / Abstention

    Design the policy for when the system should say 'I don't know': threshold-based abstention, risk-aware routing (abstain on medical/legal, answer on low-stakes), and how to communicate uncertainty to users without destroying trust. Study the precision-abstention trade-off curve.

    To do

Agents & Orchestration

Tools · Planning · Multi-Agent · Agentic RAG

8 reports
  • A-01

    Tool Use / Function Calling

    Define tools (APIs, databases, code execution, calculators) and teach the model to invoke them correctly: schema design, error handling, tool output parsing, and multi-turn tool call chains. Study parallel tool calling, tool selection failures, and how to handle tool errors gracefully.

    To do
  • A-02

    Planning vs Reactive Agents

    Compare two agent paradigms: Plan-and-Execute (generate a full plan upfront, then execute step by step) vs ReAct (interleave reasoning and action in a loop). Study when planning helps (long multi-step tasks) vs when reactivity is better (dynamic environments where early plans become invalid).

    To do
  • A-03

    Agent Orchestration

    Coordinate complex agent workflows: task decomposition, step sequencing, error recovery, and human checkpoint insertion. Study LangGraph, CrewAI, and custom loop approaches — and when you actually need an orchestration framework vs a simple while loop.

    To do
  • A-04

    Workflow vs Agent Architectures

    Make the fundamental architecture decision: use a deterministic workflow (DAG of steps with defined transitions) or an autonomous agent (LLM decides next action). Study the reliability-flexibility trade-off and hybrid approaches (workflows with LLM decision nodes at branch points).

    To do
  • A-05

    Feedback Loops / Self-Improvement

    Build agents that critique and refine their own outputs: Reflexion (reflect on failures and retry with updated context), self-evaluation prompts, constitutional critique-revision cycles, and beam search over the action space. Study convergence guarantees and stopping criteria.

    To do
  • A-06

    Multi-Agent Collaboration

    Design systems where multiple specialised agents collaborate: task decomposition and role assignment, communication protocols (shared memory vs message passing), conflict resolution, and quality-checking between agents. Study the coordination overhead vs specialisation quality trade-off.

    To do
  • A-07

    Human-in-the-loop Systems

    Integrate human oversight into agent pipelines: confidence-gated checkpoints (pause and ask when uncertain), structured approval workflows, and how to present partial results for human review. Study how to minimise interruptions while maintaining meaningful control.

    To do
  • A-08

    Agentic RAG

    Combine RAG with full agent capabilities: the agent decides which retrieval strategy to use, can issue multiple retrieval rounds, validates its own sources, and escalates to broader search (web, SQL, graph) when the knowledge base falls short. Study STORM and Self-Ask as reference architectures.

    To do

Production Engineering

Cost · Latency · Caching · Observability

9 reports
  • P-01

    Cost Optimization

    Map all cost drivers (input tokens, output tokens, API calls, embedding generation, vector DB queries) and build a cost model. Study caching, context compression, prompt shortening, and model downgrading on easy queries — measuring the quality impact of each optimisation.

    To do
  • P-02

    Latency Optimization

    Measure and reduce end-to-end latency: profile the pipeline (where is the time going?), speculative decoding, streaming responses, parallel retrieval, and batching. Build a latency budget and study the latency-quality Pareto frontier for your specific workload.

    To do
  • P-03

    Parallelism & Concurrency

    Scale agent systems horizontally: async/await for concurrent tool calls, parallel retrieval from multiple sources, fan-out/fan-in architectures, and rate limit management. Study Python asyncio patterns specifically designed for LLM workloads with mixed I/O and compute.

    To do
  • P-04

    Caching Strategies

    Build a multi-level cache: exact match (hash of prompt → cached response), semantic cache (embed query → lookup similar past queries), and provider-level prompt caching (Anthropic, OpenAI). Study cache invalidation, TTL strategies, and how caching interacts with streaming responses.

    To do
  • P-05

    Observability & Tracing

    Instrument an LLM pipeline end-to-end: trace every prompt/response pair (LangSmith, Weave, Helicone), log latency and cost per step, set up alerts for quality degradation, and build dashboards. Study what to log, how much to retain, and the GDPR implications of tracing LLM calls.

    To do
  • P-06

    Security in LLM Systems

    Defend against LLM-specific attacks: direct prompt injection (user manipulates model), indirect injection (malicious content in retrieved documents), jailbreaking, and data exfiltration via prompt. Study defence-in-depth — what each layer catches and what it misses.

    To do
  • P-07

    Robustness & Reliability

    Make LLM systems production-stable: retry with exponential backoff, fallback chains (primary model → cheaper model → static response), circuit breakers, and input validation. Study SLA definition and how to measure p95/p99 reliability for systems with non-deterministic components.

    To do
  • P-08

    Prompt Versioning & Management

    Manage prompts in production like code: version control, environment-based deployment (dev/staging/prod), A/B testing infrastructure, rollback on quality regression, and team collaboration. Study PromptLayer, LangSmith prompt hubs, and custom prompt registries.

    To do
  • P-09

    Streaming & Progressive Generation

    Implement streaming from first token to client: server-sent events, chunked HTTP responses, streaming with tool calls (tricky — tool calls require full output before execution), and partial JSON streaming. Study the UX impact of streaming and how to handle cancellation and backpressure.

    To do

Evaluation & Testing

RAGAS · Red Teaming · A/B Tests · Multimodal

8 reports
  • E-01

    RAG Systems (complete)

    Synthesise the full RAG architecture end-to-end: ingestion pipeline → knowledge repository → retrieval → reranking → generation → validation. Build a reference implementation integrating techniques from all previous modules and benchmark it against a naive baseline across multiple task types.

    To do
  • E-02

    End-to-End AI System Benchmark

    Design and run a comprehensive benchmark for a complete LLM system: define a task suite, evaluation dimensions (quality, cost, latency, robustness), run ablation studies (what happens when you remove each component?), and write a systematic findings report.

    To do
  • E-03

    RAG Evaluation Frameworks

    Evaluate RAG pipelines with dedicated frameworks: RAGAS (faithfulness, answer relevance, context recall, context precision), TruLens (feedback functions), and DeepEval (custom metrics). Build a test set, run automated evaluation, and interpret the metric trade-offs between retrieval and generation quality.

    To do
  • E-04

    LLM as a Judge

    Use an LLM to evaluate another LLM's outputs at scale: pointwise scoring, pairwise comparison (A vs B), and reference-free evaluation. Study position and verbosity biases that corrupt judgement, how to write reliable judge prompts, and how LLM-as-a-Judge correlates with human evaluation.

    To do
  • E-05

    Red Teaming LLM Systems

    Systematically find vulnerabilities before attackers do: adversarial prompt generation, jailbreak benchmarks (HarmBench, JailbreakBench), automated red teaming with LLMs, and coverage-guided fuzzing. Build a structured red team report with severity classification and remediation steps.

    To do
  • E-06

    Synthetic Data Generation for Testing

    Generate realistic test datasets automatically: question generation from documents (LLM-based), ground truth answer synthesis, context perturbation (add noise to test robustness), and coverage analysis to ensure all topics are represented. Study quality filtering strategies for generated test data.

    To do
  • E-07

    A/B Testing for LLM Applications

    Run controlled experiments on LLM components: prompt variants, model versions, RAG configurations. Study traffic splitting, metric selection (what to measure and why), minimum sample sizes for statistical significance, and how to handle the temporal confounders unique to online LLM experiments.

    To do
  • E-08

    Multimodal RAG

    Extend RAG to non-text modalities: parse PDFs with figures using vision models, embed images alongside text (CLIP, ColPali), handle tables and charts, and retrieve across modalities. Study ColPali's page-level late interaction approach as the current state of the art for document-image retrieval.

    To do
Ask AIAlways online

Ask about Fabio

Skills, projects, experience — ask anything.