Machine Learning and Friends Lunch: Dana Arad, Sparse Autoencoders for Content Control
Content
Speaker
Dana Arad (Technion)
Abstract
Sparse Autoencoders have recently been proposed as a method for decomposing a model’s latent space into monosemantic, interpretable features. In this talk, I will present two of our recent papers on understanding and leveraging SAEs for content control.
First, I will introduce our taxonomy of SAE features, distinguishing between those responsible for processing the input and those that affect the output. We show that this distinction plays a critical role when selecting features for inference-time interventions.
Next, I will present our new method for persistent unlearning using SAE features, which enables fine-grained control and provides insights into the suppressed features. These studies demonstrate the potential of SAEs as a tool for interpretable control, while highlighting the need for deeper understanding to unlock their full potential.
Speaker Bio
Dana Arad is a CS PhD candidate at the Technion, advised by Yonatan Belinkov. Her research aims to improve our understanding of the internal mechanisms of language and vision-language models, with a focus on information flow and factuality. Dana has interned at Amazon and eBay, and is a fellow of the Ariane de Rothschild Women Doctoral Program.