Overview
Agent output grows non-stop. Reviewer time stays approximately constant. This is the fundamental scaling problem of agentic engineering: pull requests as currently designed — line-by-line diff review by a human — will not survive agentic scale.
The question is not “is the model good enough to skip review?” (not yet). The question is: what is the most optimal use of human review time?
Pattern 1: Surprise-Driven Triage
Not all code changes deserve equal human scrutiny. The idea: compute a surprise factor for each change and route human attention to the unexpected.
How to compute surprise:
- Core module classification. Designate critical areas of the codebase (auth, payments, data models) as high-scrutiny. Mark auto-generated code, tests, and configuration as low-scrutiny. Changes to core modules always get human review.
- Task-diff mismatch. Compare the task description to the actual diff. A task that says “fix the login button” should not be modifying database schemas. High divergence = high surprise = human review.
- Function signature and type changes. These indicate behavioral changes. Focus review here rather than on implementation details.
- Old/new input-output examples. For behavioral changes, require the agent to provide concrete before/after examples so the reviewer sees the behavioral delta, not the code delta.
def compute_surprise(task_description, diff, codebase_config):
"""Score how surprising a diff is given the task."""
signals = []
# Files outside expected scope
expected_files = predict_affected_files(task_description, codebase_config)
unexpected = [f for f in diff.files if f not in expected_files]
signals.append(len(unexpected) / max(len(diff.files), 1))
# Core module touches
core_touches = [f for f in diff.files if f in codebase_config.core_modules]
signals.append(len(core_touches) > 0)
# Signature changes
sig_changes = count_signature_changes(diff)
signals.append(sig_changes / max(diff.total_changes, 1))
return weighted_score(signals)
Pattern 2: Artifact-Based Review
The diff is not the only — or even the best — artifact for review. At agentic scale, purpose-built review artifacts are often more efficient than reading code.
Schemas and flowcharts. Generate structural diagrams from the changed code showing data flow, API contracts, and component relationships. Review the architecture, not the syntax.
E2e test videos. For frontend changes, a 30-second screen recording is worth thousands of lines of DOM manipulation code. “At 0:42 that widget fails to update” is faster and more precise than reading the diff.
Data visualizations. Performance before/after charts, dependency graph changes, test coverage maps. Visual comparisons for visual questions.
Agent timelines. A chronological log of the agent’s session:
- What the agent changed at each step
- Summaries of its reasoning at each decision point
- Pivots: where the agent changed approach and why
- Code smells detected during the session
- Hawk Agent outputs and flags
The timeline is often more reviewable than the final diff because it shows the reasoning process that produced the code, not just the code itself.
Pattern 3: Layered Review Pipeline
Combine automated checks, agent-based review, and human review in a pipeline where each layer filters what reaches the next:
- Automated layer. Linting, type checking, test execution, coverage checks. Fails hard on regressions. This runs on everything.
- Agent review layer. A review agent (or Hawk Agent) examines the reasoning traces and flags potential reward hacking, architectural concerns, or drift from task intent. This catches what tests cannot.
- Human review layer. Receives only the changes that survived the first two layers, along with the surprise score, review artifacts, and any flags from the Hawk. The human focuses on alignment, architecture, and product intent — not syntax.
Adapting to Agentic Scale
Some practical shifts:
- Review frequency changes. Instead of reviewing one large PR, review many small atomic changes. Agents can produce fine-grained commits that are individually reviewable.
- Review granularity changes. Spend less time on implementation details (the agent will rewrite it cheaply if the approach is wrong) and more time on architectural decisions, API contracts, and behavioral changes.
- Review artifacts change. Invest in generating the artifacts above. The upfront cost of building artifact generation pays for itself quickly at agentic volume.
Related Patterns
- Hawk Agent — Process-level oversight that feeds into the review pipeline.
- Human-in-the-Loop — Structured pause points for review during, not just after, agent execution.