Content

Speaker:

Tanya Choudhury

Abstract:

Large language models (LLMs) learn powerful internal representations from evidence-backed data, yet the mechanisms that produce their behavior remain opaque. In this defense, I outline a program to move from post-hoc explanations toward a principled science of reverse-engineering LLM internals—so learned structure can be trusted, audited, and leveraged for discovery. I organize the talk around three steps: Axioms, Probes, and Coalitions.

First, I introduce RankSHAP, an axiomatic attribution framework for ranking systems that extends Shapley values to ordering functions by grounding explanations in ranking metrics such as NDCG. I then show how similar axiomatic thinking fixes interpretability failures in biological, domain-knowledge networks, via a PNET case study for prostate cancer where pathway-structured sparsity and literature-derived connectivity induce node-degree/annotation bias. Second, I turn inward with probing: using lightweight probes over MLP activations in LLaMA/Mistral/Pythia rerankers, I test whether classical statistical priors (tf–idf/IDF, lexical overlap) emerge as stable representations and how they shift across depth. Third, I present Hedonic Neurons, a weight-based mechanistic framework that models neurons as agents whose utilities encode synergy; with a PAC Top-Cover procedure, we identify stable neuron coalitions that act as computational subroutines, produce larger out-of-distribution ablation impacts than clustering baselines.

I close with an agenda of adapting coalition- and weight-based reverse-engineering to biomedical prediction models to decode emergent priors, and to generate testable hypotheses that connect model internals to downstream validation.

Advisor:

James Allan