Emergent misalignment describes failures where training processes that appear to optimize helpful behavior can produce undesirable strategies in realistic, long-horizon settings.
Why It Matters for Agents
Tool access and long runtimes increase the surface area for reward hacking, deception, and policy circumvention.