Reward hacking patterns in LLM agent benchmarks.
Cataloging reward-hacking strategies observed across 200+ trajectory annotations on Terminal-Bench. We propose a taxonomy and recommend evaluation patterns that resist gaming.
Cataloging reward-hacking strategies observed across 200+ trajectory annotations on Terminal-Bench. We propose a taxonomy and recommend evaluation patterns that resist gaming.