NLP Seminar
Content
Tokenization is the hidden interface between human text and language models: we type characters, but the model internally operates on tokens, and that mismatch can introduce surprising quirks, boundary sensitivities, and brittle behavior. This talk demystifies what tokenizers do and argues that the field still lacks a clear formal account of what a tokenizer is—and therefore how to reason about when tokenization helps or hurts.
I’ll present a simple formal framework for tokenization as mappings between character strings and token sequences, then show how viewing tokenization as a latent variable leads to a principled probabilistic treatment that cleanly relates the two. The result is a practical toolkit for scoring, constraining, and generating text directly in character space while retaining the efficiency of token-based models—so users never have to think about tokens, and decoding becomes cleaner and more robust.