PhD Dissertation Proposal: Will Schwarzer, Mitigating Alignment Failures in Artificial Agents
Content
Speaker:
Abstract:
As artificial agents become more capable, ensuring that the goals they pursue are aligned with human goals becomes increasingly critical. Such alignment can fail at several stages: an agent's objectives may be wrong from the start when derived from imperfect human input; rare misaligned behavior may escape any feasible pre-deployment evaluation; and behavior may degrade once deployment tasks differ from evaluation tasks, whether because the agent generalizes poorly or because it deliberately behaves well only on tasks it recognizes as evaluations. This proposed dissertation addresses each of these failure points.
First, we address the acquisition of objectives from human behavior. Existing methods assume specific models of near-optimal behavior, and so misinterpret mistakes or communicative actions such as gestures. We present Supervised Reward Inference (SRI), which treats behavior as an indication of goals rather than an optimization of them, learning a behavior-to-reward mapping via supervised learning. We prove that SRI is asymptotically Bayes-optimal on any class of behavior, and show empirically that it infers accurate rewards from arbitrarily suboptimal demonstrations.
Second, we address pre-deployment alignment evaluation. The failures that determine whether an agent is safe to deploy are often too rare to appear in any practical evaluation set; recent work forecasts deployment-scale worst-case failures by extrapolating from the tail of evaluation scores, but is assumption-dependent. We introduce the forecastability loss, a fine-tuning objective that trains models to have predictable failure tails, and show that it substantially improves forecast accuracy at minimal cost to safety and primary-task capability.
Finally, we address deployment-time misalignment. A natural defense is to monitor evidence emitted by the agent during action selection, such as chain-of-thought reasoning; however, rational agents may manipulate evidence. To study this, we formalize evidence control as a game between agent and monitor. Our preliminary analysis shows that a rational agent's evidence production depends on its beliefs about the monitoring regime rather than the regime itself, so agents may manipulate evidence even absent monitoring, and we identify mechanisms to let the agent verify that faithful evidence production is in its interest. We propose further analysis and experiments testing both predictions on language model agents.
Advisor:
Philip Thomas