Content

Speaker:

Shreyas Chaudhari

Abstract:

Reinforcement learning (RL) offers a general framework for addressing problems involving sequential decision-making. However, modern practical applications of RL involve prohibitively large state and action sets: such as those in slate recommendations, image-based control, and large language-modeling tasks. Standard formulations of these problems demand intractable amounts of data and computational resources. Achieving high performance without incurring high resource costs requires compact problem formulations that capture task-relevant information while abstracting away the rest. This thesis develops and analyzes such compact formulations for decision-making problems characterized by large action and large state sets, as well as the corresponding challenge of obtaining reward feedback over them.

We first address the challenge of large action sets encountered in off-policy evaluation (OPE) for slate recommendations. Risk-sensitive metrics (such as quantiles and CVaR), common for evaluating performance in this setting, require estimation of the full performance distribution from logged data rather than just the expected performance. This requires distributional OPE, which in this high-dimensional setting is highly sample-inefficient and suffers from high variance. We introduce a low-variance variant of the standard estimator that approximates the joint influence of all action dimensions through a simpler combination of marginal effects, and establish conditions under which the resulting estimator remains unbiased. Empirical evaluations on real-world datasets demonstrate that even when these conditions are not strictly satisfied, the proposed method outperforms more general baselines.

We next focus on large state spaces and long-horizon problems, where classical OPE methods become intractable or error-prone. We develop a framework that combines state abstraction with a model-learning approach that distills complex, potentially continuous problems into compact, discrete models called abstract reward processes (ARPs), which preserve sufficient information about policy performance. Within the proposed OPE framework, these ARPs are estimated from off-policy data and we prove that their predictions are consistent. Varying the granularity of the state abstraction function reveals a broad range of previously unexplored hybrid ARP-based OPE estimators, encompassing classical methods as the limiting cases. Empirical evaluations demonstrate that estimators within this framework outperform existing OPE baselines across a variety of domains.

Finally, we study the problem of reward selection under limited feedback constraints in reinforcement learning. While large amounts of state transition data can often be collected or generated cheaply in modern applications, obtaining reward feedback is practically expensive, particularly when relying on human feedback, over large state and action sets. In such cases, when RL is constrained to operating with limited or partial reward feedback, a central question arises: which subset of data should be reward-labeled to maximize policy performance? We formalize this reward selection problem and empirically investigate several sampling-based selection strategies, aiming to identify selection patterns that generalize across domains and enable more efficient use of reward feedback. Our results show that effective selection strategies can achieve performance comparable to learning from a fully labeled dataset while reward-labeling only a small fraction of the data, enabling feedback-efficient reinforcement learning in large-scale resource-constrained problems.

Advisor:

Bruno Castro da Silva