Technical Report2026

What Actually Matters in Prompt Engineering: A Controlled Benchmark

Fabio Mesquita

Key result

A controlled benchmark of 30 prompting strategies shows that structure and constraints dominate performance, while role prompting and complex reasoning patterns have limited or inconsistent impact in structured tasks.

Why it matters

Prompt engineering is widely used in production systems, yet its real impact is poorly quantified under controlled conditions
Most advice focuses on prompt creativity, while this work shows that structure and constraints are the main drivers of performance
Understanding which prompting techniques actually matter helps reduce cost, latency, and system complexity in real-world applications

Approach

Design of a controlled structured extraction + decision task with fixed schema and ground truth
Evaluation of 30 prompting strategies grouped into categories: baseline, structure, constraints, examples, reasoning, self-improvement, and advanced patterns
Execution across multiple models (gpt-4o-mini and gpt-4o) and temperatures (0.2 and 0.7) to measure stability and sensitivity
Multi-metric evaluation including accuracy, format adherence, hallucination rate, consistency, cost, latency, and robustness across edge cases

Results

Structured prompts with explicit constraints consistently achieve the highest performance across all metrics
Few-shot prompting improves consistency but increases token usage and latency
Role prompting has minimal impact in structured extraction tasks compared to format-driven approaches
Reasoning-heavy prompts (e.g. chain-of-thought) often degrade performance in deterministic extraction tasks
Prompt sensitivity is higher in smaller models, while stronger models are more robust to prompt variations
Higher temperature increases variance, making structure and constraints even more critical

Abstract

Prompt engineering is often presented as the primary lever for improving LLM outputs, yet most guidance lacks controlled evaluation. This work benchmarks 30 prompting strategies — including structured prompts, constraints, few-shot examples, reasoning patterns, and self-improvement loops — on a controlled structured extraction and decision task. By fixing model, dataset, and evaluation pipeline, the study isolates the true impact of prompt design. Results show that structured prompts and explicit constraints consistently outperform other strategies, while more complex techniques such as chain-of-thought or role prompting provide limited or context-dependent benefits. The findings suggest that clarity, format enforcement, and task alignment are the dominant factors in prompt effectiveness for well-defined tasks.

Why it matters

Approach

Results

Abstract

Ask about Fabio