PhD Thesis Defense: Dhawal Gupta, Improving Temporal Credit Assignment in Reinforcement Learning
Content
Speaker:
Abstract:
Reinforcement learning (RL) studies sequential decision-making problems in which an agent interacts with an environment and improves its behavior from scalar reward feedback. Unlike supervised learning, this feedback does not usually specify which individual decision is correct; it only indicates the relative quality of outcomes produced by a sequence of decisions. This makes \emph{credit assignment} a central challenge in RL. This dissertation focuses on the \emph{temporal credit assignment problem} (TCAP), which relates to determining the extent to which certain observed outcomes may be attributed to particular earlier states and actions taken by the agent. TCAP manifests in several forms. In policy evaluation settings, for instance, delayed or stochastic reward feedback may lead to noisy estimates of long-term return. Similarly, in sparse-reward settings, the agent may receive limited immediate, short-term feedback on which decisions may have contributed to success or failure in solving a given task. This dissertation examines three manifestations of the TCAP and develops methods for making value function estimation, update rules based on temporal differences, and reward-function design more reliable, robust, and more likely to facilitate credit assignment in challenging RL problems.
The first contribution of this thesis examines TCAP in policy evaluation (or, more generally, prediction of long-term signals) under limited interaction data. In this setting, the agent's goal is to predict long-term outcomes (e.g., expected sums of future rewards) resulting from following a fixed policy. Replay buffers provide a way to increase sample efficiency in such settings by reusing past interactions, represented as previously observed transitions, to perform additional value-function updates that are more likely to accelerate learning. The use of such replay mechanisms, however, raises the question of which transitions should be replayed at each moment, and with what priority, given the agent's current knowledge. Commonly used replay strategies include sampling prior transitions uniformly at random or sampling according to task-specific priorities, such as rewards or prediction errors that would result from replaying a particular prior experience. In this chapter, by contrast, we investigate whether (and under what assumptions) replay strategies can alternatively be guided by \emph{reward-agnostic} structure in prior experiences—namely, structure and patterns that underlie the domain itself in which predictions are made, independently of the agent's particular goal. In the linear temporal-difference (TD) setting, we formalize this idea by decomposing the mathematical components of temporal-difference (TD) update systems into those that depend solely on transition dynamics and state features (and that may be shared by many underlying tasks the agent may face) and those that depend on the reward function. This leads to a novel norm-based replay rule for more accurately estimating such a shared component of the TD update system, thus enabling sample-efficient, low-variance value function estimation in multi-task scenarios. Our analyses show that the new reward-agnostic replay rule we introduce is most useful in settings where many value predictions need to be performed simultaneously, in domains that share the same underlying dynamics and features, whereas existing reward-aware replay techniques may be more appropriate when tackling single-reward value learning problems.
The second contribution of this thesis re-examines eligibility traces, a standard mechanism for temporal credit assignment in TD learning. Eligibility traces are intended to allow recently observed prediction errors to update predictions associated with states visited earlier in a given trajectory. We show that under nonlinear function approximation, the parameters of the value function may change as learning proceeds along the trajectory, which implies that the information accumulated in an eligibility trace may no longer be consistent with the agent's current value function. As a result, we formally show that such a mismatch may invalidate the usual and widely accepted interpretation of the mechanisms underlying techniques such as TD($\lambda$), commonly used to facilitate assigning credit to earlier states. Correcting this mismatch directly yields a novel, though computationally expensive algorithm. Through further analyses, we introduce a reformulation of the problem in terms of value functions that combine standard future-return prediction with a backward-looking prediction of expected past rewards.
The third contribution of the thesis studies TCAP in sparse-reward environments, where the agent receives limited immediate feedback about which decisions may have contributed to long-term performance. In such cases, practitioners often design heuristic auxiliary reward functions that provide denser feedback to the agent in order to facilitate credit assignment. However, such heuristic feedback signals may inadvertently induce behaviors that are misaligned with the agent's original, primary objective. We introduce Behavior Alignment via Reward Function Optimization (BARFI), a novel bilevel reward-design framework that autonomously learns how to leverage and combine user-defined auxiliary rewards with any given, arbitrary (and often sparse) reward function, while ensuring that the behaviors induced by optimizing this combined reward function remain aligned with those originally intended by the designer The proposed framework leverages implicit differentiation techniques, allowing it to dynamically adapt learned reward functions efficiently. We show that our novel reward-design framework allows heuristically designed (and potentially misaligned) auxiliary feedback signals to be used when appropriate (i.e., when they contribute to accelerate learning), while reducing their influence when they are likely to lead to undesired behavior.
Together, these contributions show that TCAP is not a single algorithmic failure mode but a recurring challenge that often manifests as high-variance value-function estimation, misaligned value-function update rules or update rules that perform credit assignment in ways different from those intended by the user, and sparse or imperfectly designed reward feedback signals. By carefully investigating such scenarios, this dissertation develops novel methods that improve credit assignment through new, principled techniques that determine which past experiences should be replayed with higher priority, how temporal-difference-based learning rules should propagate credit in challenging non-linear scenarios, and how (and whether) user-designed auxiliary rewards should be leveraged during policy learning.
Advisors:
Bruno Castra da Silva and Philip Thomas