CIIR Talk Series: Chenyan Xiong, Test-time Scaling Behaviors of General LLM Agents and How to Improve Them : Manning College of Information & Computer Sciences : UMass Amherst

Friday, April 10, 2026, 1:30 PM - Friday, April 10, 2026, 2:30 PM

E120

Computer Science Laboratories

Hybrid

CIIR Talk Series

Speaker

Chenyan Xiong, Carnegie Mellon University

Abstract

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended requests across various environments. We introduce General AgentBench, a benchmark designed to evaluate agents across search, coding, reasoning, and tool use within a single framework. We systematically analyze test-time scaling behaviors under two paradigms: sequential scaling and parallel scaling. We demonstrate a counterintuitive phenomenon: neither scaling approach effectively enhances performance in practice due to two behavioral constraints: the context ceiling and the verification gap. To enable reproducible behavioral analysis, we present DeepResearchGym, an open-source search sandbox built on large-scale public corpora, supporting over 14.4M real-world queries. Our in-the-wild study shows that agents reuse evidence effectively but tend to over-refine local queries instead of broadening search strategies when progress stalls. We finally propose a post-training approach that emphasizes cultivating core reasoning behaviors prior to reinforcement learning. These behavioral analyses and methods together offer a systematic approach for developing more robust and adaptable general agents.

Bio

Chenyan Xiong is an Associate Professor at the Language Technologies Institute (LTI), affiliated with the Machine Learning Department (MLD) in the School of Computer Science at Carnegie Mellon University. He is also a member of the CMU Foundation and Language Model Center (FLAME). From 2018 to 2023, He worked at Microsoft Research Redmond on conversational search, dense retrieval, and large-scale pretraining, contributing both scientific advances and real-world impact across production systems serving billions of users and trillions of web pages. He received my Ph.D. from LTI, CMU, in 2018. His recent work focuses on foundation and large language models, with particular emphasis on improving the speed–quality trade-offs in pretraining, exploring new scaling frontiers, and enabling new capabilities for next-generation GenAI applications.

About

The CIIR Talk Series is an initiative for researchers and practitioners working on information retrieval and related disciplines to present their work.

Subscribe to the Zoom link/passcode notification mailing list by sending an email to ciir-talks-request [at] cs [dot] umass [dot] edu (ciir-talks-request[at]cs[dot]umass[dot]edu) with "subscribe" as the email subject (without the quotation marks), or click here for the Zoom link and reach out to zamani [at] cs [dot] umass [dot] edu (subject: CIIR%20Talks%20Passcode) (Hamed Zamani) for the passcode.

Hybrid event posted in CIIR Talk Series for Faculty , Staff , and Current students

Computer Science Laboratories

130 Governors Drive
Amherst, MA 01003
United States

E120

Event Host

Hamed Zamani

Associate Professor

View profile

CIIR Talk Series: Chenyan Xiong, Test-time Scaling Behaviors of General LLM Agents and How to Improve Them

Content

Computer Science Laboratories

Event Host

Hamed Zamani