Content

Speaker

Abhinav Bhatia

Abstract

Intelligent agents—such as robots and autonomous vehicles—often face practical limits on critical factors like computation time, available data, or acceptable levels of safety risk. For example, a robot cannot deliberate indefinitely before each movement, a learning algorithm cannot endlessly gather data, and an autonomous vehicle must balance passenger comfort with acceptable risk. Historically, these challenges have been addressed through metareasoning, which broadly refers to "reasoning about reasoning". Metareasoning emerged as a key framework for achieving bounded optimality—the principle that rational agents should make the best possible decisions given real-world resource constraints. It frames the allocation of computational effort and resources toward maximizing overall performance as a sequential "meta-level" decision problem. Early metareasoning work relied heavily on explicit predictive models and handcrafted analyses of algorithm behavior. More recently, researchers have begun applying model-free reinforcement learning (RL) to learn meta-level control policies directly from experience, primarily focusing on specific algorithms like anytime planners and considering computation time as the main limiting factor.

This thesis significantly generalizes the scope of metareasoning by treating any factor that restricts an agent's ability to optimize its core objective—such as computation time, data availability, or even user tolerance for safety risks—as a bounded resource to be managed at the meta-level. To address this broader setting, we adopt a model-free deep RL approach in which a meta-level controller observes internal state variables of the underlying algorithm it monitors and learns adaptive control patterns directly from experience, effectively enabling agents to "learn how to think about their own thinking".

We present four contributions illustrating the breadth and impact of this idea. First, we demonstrate deep-RL-based metareasoning for the joint optimization of stopping points and hyperparameters in anytime planning algorithms, significantly outperforming static or heuristic strategies. Second, we apply metareasoning to model-based RL, enabling dynamic selection of prediction horizons to optimize final policy quality under a limited environment interaction budget—highlighting the advantage of closed-loop strategies over fixed heuristic schedules. Third, we apply metareasoning to model-free RL itself, introducing a novel architecture called RL³, in which a meta-controller directly supervises and ove rides a general-purpose RL  algorithm. RL³ effectively defines a new meta reinforcement learning algorithm that substantially improves exploration-exploitation trade-offs, generalization to broader task distributions, and meta-training efficiency. 

Finally, we propose a metareasoning approach to safe policy personalization in autonomous driving, where the meta-controller dynamically switches among pretrained safe policies at runtime to optimize for individual user preferences while explicitly managing the user's tolerance for safety risks as a bounded resource.

Together, these contributions establish a scalable, general purpose framework for learned metareasoning, significantly advancing the goal of adaptive, resource-aware artificial intelligence.