PhD Thesis Defense: Shib Dasgupta, Box Embeddings as Set-theoretic Representations for Information Retrieval and Recommender Systems
Content
Speaker
Shib Dasgupta
Abstract
Sets are fundamental to human knowledge representation and reasoning. Many queries in information retrieval and recommender systems are inherently set-theoretic, involving conjunctions, disjunctions, and negations. However, learning differentiable representations of concepts as sets is challenging, as it requires maintaining set-theoretic consistencies—such as closure under intersection. Region-based embeddings, particularly Box Embeddings, offer a natural solution by acting as trainable Venn diagrams whose volumes capture joint probabilities between concepts.
Despite their intuitive appeal, Box Embeddings present challenges in training, particularly due to flat loss surfaces. To address this, I develop a Gumbel random process-based approach that improves the optimization landscape, resulting in a more stable and expressive variant of Box Embeddings that has since seen broad adoption. Building on this foundation, this thesis demonstrates the effectiveness of box embeddings in capturing complex set-theoretic semantic operations across a range of NLP, information retrieval, and recommender system tasks. In all of these set-based contexts, I show both theoretically and empirically that set-based similarity from box embeddings offers a superior alternative to commonly used vector-based similarity/distance measures like dot product or $l_p$ distance.
As an initial proof-of-concept for this thesis, I train box embeddings for word representation using the core distributional principle behind Word2Vec—that similar words appear in similar contexts. Our model,
Word2Box, replaces the dot product similarity used in Word2Vec with the volume of intersection between box embeddings, naturally encoding set-theoretic relationships. Despite being trained on the same co-occurrence signal, Word2Box demonstrates a deeper semantic understanding: for example, it infers that "tongue" ∩ "body" is similar to "mouth", while "tongue" ∩ "language" is similar to "dialect". Word2Box also supports set-theoretic queries involving union and complement, enabling richer and more interpretable reasoning over word meanings.
Subsequently, I transition to real-world information retrieval and recommendation systems, where user queries often encode set-theoretic operations, either explicitly or implicitly. Accurately modeling such queries
requires representations that can handle compositions like intersection, union, and negation—capabilities that traditional vector-based methods typically lack.
For example, in platforms such as Netflix or Spotify, users frequently express preferences with attribute/tag filters, issuing explicitly set-theoretic queries such as "Jazz but not Smooth Jazz" or "Animation ∩
Monsters" . Traditional dense representations, such as low-rank matrix factorization, often struggle with these types of compositional queries, especially in the presence of data sparsity. To address this, I frame
personalized item search as matrix completion with set-theoretic constraints, where both users and attributes are represented as hyper-rectangles (boxes). This enables the system to reason over complex, structured preferences in a principled and interpretable way.
Similarly, in web and product search, queries often encode implicit set-theoretic structure, such as "multivitamin with both fish oil and biotin but not iron" or "Reptiles in India that are also found in East Africa." To better handle such queries, I augment pre-trained transformer-based LLM encoders with a box embedding final layer. This hybrid model combines the semantic richness of transformer representations with the explicit set-theoretic reasoning of box embeddings. Empirical results show that this layered architecture significantly improves retrieval accuracy and better captures compositional user intent.
Advisor
Andrew McCallum