Skip to main content
Technical Report2026

What Actually Matters in Prompt Engineering: A Controlled Benchmark

Fabio Mesquita

Key result

A controlled benchmark of 30 prompting strategies shows that structure and constraints dominate performance, while role prompting and complex reasoning patterns have limited or inconsistent impact in structured tasks.

Why it matters

  • Prompt engineering is widely used in production systems, yet its real impact is poorly quantified under controlled conditions
  • Most advice focuses on prompt creativity, while this work shows that structure and constraints are the main drivers of performance
  • Understanding which prompting techniques actually matter helps reduce cost, latency, and system complexity in real-world applications

Approach

  • Design of a controlled structured extraction + decision task with fixed schema and ground truth
  • Evaluation of 30 prompting strategies grouped into categories: baseline, structure, constraints, examples, reasoning, self-improvement, and advanced patterns
  • Execution across multiple models (gpt-4o-mini and gpt-4o) and temperatures (0.2 and 0.7) to measure stability and sensitivity
  • Multi-metric evaluation including accuracy, format adherence, hallucination rate, consistency, cost, latency, and robustness across edge cases

Results

  • Structured prompts with explicit constraints consistently achieve the highest performance across all metrics
  • Few-shot prompting improves consistency but increases token usage and latency
  • Role prompting has minimal impact in structured extraction tasks compared to format-driven approaches
  • Reasoning-heavy prompts (e.g. chain-of-thought) often degrade performance in deterministic extraction tasks
  • Prompt sensitivity is higher in smaller models, while stronger models are more robust to prompt variations
  • Higher temperature increases variance, making structure and constraints even more critical

Abstract

Prompt engineering is often presented as the primary lever for improving LLM outputs, yet most guidance lacks controlled evaluation. This work benchmarks 30 prompting strategies — including structured prompts, constraints, few-shot examples, reasoning patterns, and self-improvement loops — on a controlled structured extraction and decision task. By fixing model, dataset, and evaluation pipeline, the study isolates the true impact of prompt design. Results show that structured prompts and explicit constraints consistently outperform other strategies, while more complex techniques such as chain-of-thought or role prompting provide limited or context-dependent benefits. The findings suggest that clarity, format enforcement, and task alignment are the dominant factors in prompt effectiveness for well-defined tasks.

Ask AIAlways online

Ask about Fabio

Skills, projects, experience — ask anything.