PhD Thesis Defense: Yixiao Song, Advancing AI Factuality via Comprehensive Evaluation
Content
The rapid advancement of AI models in natural language generation often outpaces the development of reliable evaluation metrics, making it difficult to capture nuances in performance across tasks such as long-form question answering (LFQA), machine translation, and instruction following. This thesis develops scalable, accurate tools and benchmarks for factuality assessment that set higher standards for AI evaluation.
We begin by identifying key limitations in current evaluation practices through a focused analysis of long-form question answering. We first collect expert annotations on LFQA answers across seven domains (e.g., history, economics, biology). We carefully design the annotation setup and instructions, and conduct a rigorous screening process to recruit qualified domain experts. We then compare these expert annotations to crowd-sourced judgments and automatic metrics. We find that both crowd workers and existing metrics often fail to detect factual errors that experts reliably identify. On the human evaluation side, experts offer more accurate assessments of factuality and completeness, while crowd workers tend to favor superficial qualities like conciseness. On the automatic metric side, while no automatic metric consistently aligns with human judgments of overall quality, some show strengths in evaluating specific dimensions such as coherence.
Building on these insights, we introduce Veriscore, a general-purpose factuality metric designed for long-form model generations. Veriscore combines enhanced claim extraction, web-based evidence retrieval, and verification judgments to provide accurate and efficient assessments of factual consistency. Our results show that Veriscore effectively distinguishes between verifiable and unverifiable claims and is preferred by human annotators in 93% of cases compared to popular alternatives Factscore (Min et. al., 2023) and SAFE (Wei et. al., 2024). Additionally, it produces verification judgments that closely align with GPT-4 outputs, as confirmed
through human evaluation.
While Veriscore focuses on evaluating static model outputs, a natural next step is to assess factuality in more interactive settings. With the advent of versatile web agents, an important question arises: can these agents reliably retrieve factual information? To investigate this, we introduce BearCubs, a new benchmark designed to evaluate AI agents' ability to identify factual information in open-ended, real-world, and multimodal settings, such as interacting with live web content and navigating complex visual tasks.
BearCubs uncovers a significant performance gap between humans, who achieve 85% accuracy, and earlier state-of-the-art agents, which achieve only 24%. The ChatGPT Agent released in July 2025 marks substantial progress, reaching 66% accuracy, yet still falling short of human performance. These results highlight the need for new directions in agent evaluation, including improving the interpretability of agent trajectories and enhancing assessments of source credibility.
As AI models grow more capable, Veriscore and BearCubs are becoming less discriminative, as the former is most effective on Wikipedia-style claims and the latter requires only linear web operations. To keep pace with AI progress, the development of more challenging benchmarks is essential. As part of this thesis, a fact verification dataset is developed that moves beyond the Wikipedia-centric scope of prior efforts, using history and politics as a case study. Claims are derived from historical and political non-fiction books, requiring multi-step retrieval and synthesis of dispersed evidence. State-of-the-art systems, including OpenAI Deep Research (63.89% accuracy on a three-way classification), fall short of reliable fact-checking under these conditions. This highlights substantial gaps in current factuality capabilities. The dataset is scalable and serves as both a benchmark for complex, non-Wikipedia factuality and a resource for training more capable fact-checkers.
Finally, this thesis concludes by proposing directions for advancing factuality evaluation, including the development of scalable benchmarks, improved alignment between human and automatic judgments, and training strategies for models that can reliably operate in high-stakes, open-domain settings.