PhD Dissertation Proposal: Nigel Fernandez, Natural Language Processing for Scalable Educational Assessment and AI Systems
Content
Speaker:
Abstract:
Recent advances in natural language processing (NLP), particularly large language models (LLMs), have created new opportunities to develop scalable educational technologies that support assessment, feedback, and personalized learning. Realizing this vision requires AI methods capable of performing several core educational tasks, including generating high-quality assessments, evaluating open-ended student responses, answering course-related questions reliably, and operating efficiently at scale. Developing methods that enable these capabilities is critical for building educational AI systems that can support instructors and learners in large, diverse learning environments. This dissertation investigates how NLP methods can enable scalable educational assessment and AI systems across several key components of the educational pipeline.
First, we study scalable assessment generation. Multiple-choice questions (MCQs) are widely used in educational settings, yet constructing effective distractors—plausible but incorrect answers—is challenging and time-consuming. We introduce DiVERT (Distractor Generation with Variational Errors Represented as Text), a generative framework that models student misconceptions to produce realistic distractors for math MCQs.
Second, we address scalable automated evaluation of student responses, focusing on open-ended reading comprehension questions. We propose an approach based on in-context language model fine-tuning that uses a single shared scoring model across multiple items, enabling scalable grading while leveraging passage relationships between questions to improve performance.
Third, we investigate course logistics question answering, a practical application of educational AI systems that can help students access important course information. We introduce SyllabusQA, a dataset of 5,078 open-ended question–answer pairs derived from 63 real course syllabi across 36 majors. We benchmark several strong baselines on this task, ranging from LLM prompting to retrieval-augmented generation. To ensure reliability in educational contexts, we also introduce Fact-QA, an LLM-based evaluation metric designed to assess the factual accuracy of generated answers.
Fourth, we address the efficient deployment of large language models in educational AI systems. We propose RADAR (Reasoning–Ability and Difficulty-Aware Routing), a psychometrics-inspired routing framework that allocates queries to different LLM configurations based on estimated query difficulty and model ability, enabling improved performance–cost tradeoffs. Together, these contributions demonstrate how NLP methods can support scalable educational assessment and AI systems by enabling automated assessment generation, response evaluation, educational question answering, and efficient LLM deployment.
Advisor:
Andrew Lan