Hawk Agent

A second agent that monitors the reasoning process -- not just the output -- of the primary agent, using cross-provider pairing for diverse oversight.

●●●●○ Complexity

Overview

A Hawk Agent is a second agent that monitors the development and reasoning process of the primary (implementer) agent. Unlike an evaluator that checks output quality, the Hawk watches how the agent thinks — its reasoning traces, decision points, and problem-solving approach.

The key insight: there are thousands of possible implementations for any given task, but only a few valid reasoning paths given the current project state. You catch misalignment much earlier by watching the process than by inspecting the final output.

TASK IMPLEMENTER Provider A (e.g. Codex) Reasoning traces Code changes Tool calls REASONING HAWK Provider B (e.g. Claude) Monitors reasoning path Detects reward hacking Flags misalignment ALIGNED: continue MISALIGNED: stop + flag Output

Why Reasoning Over Output?

Consider a task: “Fix the payment integration.” There are hundreds of valid implementations. But given the current codebase state, tech stack, and the specific bug, there are only a handful of sensible reasoning paths.

A traditional evaluator sees the final code and asks “does this work?” The Hawk sees the reasoning and asks “is this agent solving the right problem in the right way?” This catches:

  • Reward hacking — the agent satisfies tests while bypassing the actual problem (see Reward Hacking).
  • Refactor avoidance — the agent works around a legacy issue instead of fixing it.
  • Context contamination — the agent copies a bad pattern from elsewhere in the codebase.

No test, linter, or set of deterministic rules can catch these. They live in the process, not the output.

Cross-Provider Pairing

The Hawk should use a different model provider from the implementer.

RoleRecommendedRationale
ImplementerCodex / GPTExcellent engineer, fast, prone to reward hacking
HawkClaudeWell-aligned, sensitive to reward hacking patterns

Why cross-provider? Same-provider models share training data, RLHF conventions, and implicit biases. They tend to agree on stylistic choices and failure modes. Cross-provider pairing introduces genuine diversity of perspective — a Claude Hawk is more likely to flag when a Codex implementer takes a shortcut because their training biases are structurally different.

How It Works

def hawk_monitored_session(task, implementer, hawk):
    """Run an implementer agent with Hawk oversight."""
    session = implementer.start(task)

    for turn in session:
        # The implementer produces reasoning + actions
        reasoning = turn.reasoning_trace
        actions = turn.actions

        # The Hawk evaluates the reasoning path
        verdict = hawk.evaluate(
            task=task,
            reasoning=reasoning,
            actions=actions,
            project_state=get_project_state()
        )

        if verdict.misaligned:
            session.pause()
            return {
                "status": "flagged",
                "reason": verdict.explanation,
                "turn": turn.number,
                "suggestion": verdict.suggested_correction
            }

    return {"status": "complete", "output": session.result}

The Hawk receives the full reasoning trace — not just the code output — after each turn. It evaluates whether the reasoning path is sound given the project context and flags misalignment before it compounds.

When to Use

  • High-stakes changes where reward hacking would be costly (payment systems, auth, data migrations).
  • Long-running agent sessions where small misalignments compound over many turns.
  • Codebases with legacy patterns that agents are likely to imitate rather than fix.
  • Any environment where you have observed agents taking shortcuts that satisfy tests but miss intent.

When Not to Use

  • Simple, well-defined tasks where the output can be verified mechanically (tests, type checks).
  • Cost-sensitive environments where the overhead of a second agent is not justified.
  • Tasks where the reasoning path is obvious and there is little room for misalignment.

Trade-offs

ProsCatches misalignment that no deterministic check can detect. Early detection prevents compounding errors. Cross-provider diversity improves coverage.
ConsHigher cost (roughly 2x per session). More complex setup — you need to pipe reasoning traces between agents across providers. Adds latency to each turn.
  • Evaluator-Optimizer — Evaluates output quality. The Hawk evaluates process quality. They complement each other.
  • Reflection — Self-review by the same model. The Hawk provides external review by a different model.
  • Two-Agent Architecture — Splits work between roles. The Hawk splits oversight from execution.