PhD Thesis Defense: Abhinav Bhatia, Learning to Think about Thinking: Metareasoning with Deep Reinforcement Learning for Efficient and Safe Decision-Making
Content
Speaker:
Abstract:
Intelligent agents—such as robots and autonomous vehicles—often face practical limits on critical factors like computation time, available data, or acceptable safety risk. For example, a robot cannot deliberate indefinitely before each movement, a learning algorithm cannot gather data without bound, and an autonomous vehicle must balance passenger comfort against safety requirements.
Historically, these challenges have been addressed through metareasoning, broadly defined as "reasoning about reasoning." This framework emerged as a key approach to achieving bounded optimality—the principle that rational agents should make the best possible decisions given real-world resource constraints. It treats dynamic allocation of computational effort to maximize overall performance of an algorithm as a sequential "meta-level" decision problem. Early metareasoning work relied heavily on explicit predictive models and hand-crafted analyses of algorithm behavior. More recently, researchers have begun applying model-free reinforcement learning (RL) to learn meta-level control policies directly from experience, primarily for specific algorithms such as anytime planners and with computation time as the main limiting factor.
This thesis significantly generalizes the scope of metareasoning by treating any factor that restricts an agent's ability to optimize its core objective—such as computation time, data availability, or even permissible risk exposure—as a bounded resource to be managed at the meta-level. To address this broader setting, we adopt a model-free deep RL approach in which a meta-level controller observes internal state variables of the underlying algorithm and learns adaptive control patterns directly from experience, enabling agents to "learn how to think about their own thinking."
We present four contributions that illustrate the breadth and impact of this idea. First, we demonstrate deep-RL metareasoning for the joint optimization of stopping points and hyperparameters in anytime planning algorithms, significantly outperforming heuristic strategies. Second, we apply metareasoning to model-based RL, enabling dynamic selection of prediction horizons to optimize final policy quality under a limited environment interaction budget and highlighting the advantage of closed-loop strategies over fixed heuristic schedules. Third, we apply metareasoning to model-free RL itself, introducing a novel architecture called RL³, in which a meta-controller manages action selection for a general-purpose RL learner. RL³ thereby defines a new meta-reinforcement learning algorithm that substantially improves exploration-exploitation trade-offs, generalization to broader task distributions, and meta-training efficiency. Finally, we formulate and address the problem of safe runtime policy personalization (SRPP), which enables users to customize the personality of customer-facing autonomous agents, for example the driving style of autonomous vehicles, at runtime through multiple preference variables—yielding millions of possible configurations rather than fixed presets—while still guaranteeing safety even for untested settings. We provide a formal mathematical framework for SRPP and propose a portfolio-based metareasoning approach in which a small set of certified-feasible policies is computed offline and a meta-level controller selects among them while tracking safety and failure risk margins as resources. Experiments demonstrate that our approach achieves high levels of personalization within acceptable risk margins across all test users, whereas baselines that lack explicit risk management frequently violate safety requirements.
Together, these contributions establish a scalable, general-purpose framework for learned metareasoning, significantly advancing the goal of adaptive, resource-aware artificial intelligence.
Advisor:
Shlomo Zilberstein