NLP Seminar : Manning College of Information & Computer Sciences : UMass Amherst

Thursday, February 12, 2026, 12:00 PM - Thursday, February 12, 2026, 1:00 PM

E120

Computer Science Laboratories

Hybrid

Tokenization is the hidden interface between human text and language models: we type characters, but the model internally operates on tokens, and that mismatch can introduce surprising quirks, boundary sensitivities, and brittle behavior. This talk demystifies what tokenizers do and argues that the field still lacks a clear formal account of what a tokenizer is—and therefore how to reason about when tokenization helps or hurts.

I’ll present a simple formal framework for tokenization as mappings between character strings and token sequences, then show how viewing tokenization as a latent variable leads to a principled probabilistic treatment that cleanly relates the two. The result is a practical toolkit for scoring, constraining, and generating text directly in character space while retaining the efficiency of token-based models—so users never have to think about tokens, and decoding becomes cleaner and more robust.

Hybrid event posted in Research

Computer Science Laboratories

130 Governors Drive
Amherst, MA 01003
United States

E120

Event Host

Andrew McCallum

Director of the Center for Data Science
Distinguished Professor

View profile

NLP Seminar

Content

Computer Science Laboratories

Event Host

Andrew McCallum