PhD Dissertation Proposal: Shreyas Chaudhari, Compact Reinforcement Learning: Resource-Efficient Formulations for Large-Scale Decision Making
Content
Speaker:
Abstract:
Reinforcement learning (RL) offers a general framework for addressing problems involving sequential decision-making. However, modern practical applications of RL involve intractably large state and action sets: such as those in slate recommendations, image-based control, and large language-modeling tasks. Standard formulations of these problems demand prohibitive amounts of data and computational resources. Achieving high performance without incurring high resource costs requires compact problem formulations that capture task-relevant information while abstracting away the rest. This thesis develops and analyzes such compact formulations for decision-making problems characterized by large action and large state sets, and the resulting challenge of obtaining reward feedback over them.
We first address the challenge of large action sets in distributional off-policy evaluation (OPE) for slate recommendations. In this setting, the need for risk-sensitive metrics makes it essential to estimate the full performance distribution from logged data, but this estimation is highly sample-inefficient and suffers from high variance. We propose a low-variance variant of the standard estimator that relaxes the fully joint and interdependent treatment of the combinatorial action space, and establish conditions under which the resulting estimator remains unbiased. Empirical evaluation on real-world datasets demonstrates that even when these conditions are not strictly satisfied, the proposed method outperforms more general baselines.
We next focus on large state spaces and long-horizon problems, where classical OPE methods become intractable or error-prone. We develop a framework that combines state abstraction with a model-learning approach that distills complex, potentially continuous problems into compact, discrete models called abstract reward processes (ARPs), which preserve sufficient information about policy performance.Within the proposed OPE framework, these ARPs are estimated from off-policy data and we prove that their predictions are consistent. Varying the granularity of the state abstraction function yields a broad range of ARP-based OPE estimators, encompassing classical methods as the limiting cases. Empirical evaluations demonstrate that estimators within this framework outperform existing OPE baselines across a variety of domains.
Finally, we propose studying the problem of reward selection under limited feedback constraints in reinforcement learning. While large amounts of data can often be collected or generated cheaply in modern applications, obtaining reward feedback is practically expensive, particularly when relying on human feedback, over large state and action sets. In such cases, when RL is constrained to operating with limited feedback, a central question arises: which subset of data should be reward-labeled to maximize policy performance? We formalize this reward selection problem and empirically investigate several sampling-based selection strategies, aiming to identify selection patterns that generalize across domains and enable more efficient use of reward feedback.
Advisors:
Bruno Castro da Silva and Philip Thomas