Agent benchmarks attempt to measure end-to-end capability: tool calling, web interaction, long-horizon planning, and reliability.
Why It Matters
Headline scores can be misleading: verified subsets and βproβ variants often reveal large gaps in robustness.