Agent evaluation is hard because success is not just “the final answer is correct”. Agents have cost, latency, tool correctness, safety constraints, and multi-step failure modes.
This section covers benchmarks, evaluation frameworks, and metrics that measure consistency and process quality, not just outcome accuracy.