Faculty Recruiting Support CICS

Reasoning about User Feedback under Identity Uncertainty in Knowledge Base Construction

14 Mar
Thursday, 03/14/2019 9:00am to 11:00am
Computer Science Building, Room 150/151
PhD Dissertation Proposal Defense
Speaker: Ari Kobren

Intelligent, automated systems that are intertwined with everyday life---such as Google Search and virtual assistants like Amazon's Alexa or Apple's Siri---are often powered in part by knowledge bases (KBs), i.e., structured data repositories of entities, their attributes, and the relationships among them. Despite a wealth of research focused on automated KB construction methods, KBs are inevitably imperfect, with errors stemming from various points in the construction pipeline. Making matters more challenging, new data is created daily and must often be integrated with existing KBs so that they remain up-to-date.

As the primary consumers of KBs and the applications upon which they are built, human users have tremendous potential to aid in KB construction by contributing feedback that identifies spurious and missing entity attributes and relations.  However, integrating user feedback with existing KBs is complicated by the necessity to resolve identity uncertainty, i.e., determining to which real-world entity each piece of raw data refers. This determination is typically performed via entity resolution, which clusters raw data by real-world entity. Since the clustering procedure may produce errors, reconsideration of resolution decisions is required throughout KB construction. This gives rise to the difficulty of appropriately allocating user feedback when previously edited KB entities are merged or split.

In this thesis, we present a continuous reasoning framework capable of integrating user feedback and new data with existing KBs in the presence of identity uncertainty. Our approach is based on the notion that user feedback should participate alongside raw evidence throughout the KB construction process. To begin, we introduce the Grafting and Rotation-based INCremental Hierarchical Clustering algorithm---GRINCH.  GRINCH builds a hierarchical clustering over the raw KB data, one point at a time, and includes both local and global rearrangement subroutines.  We prove that GRINCH is guaranteed to construct pure clusterings when our new notion of model-based separation is satisfied, and demonstrate that GRINCH outperforms other state-of-the-art algorithms in entity resolution as well as in more general clustering problems.

Next, we show how to integrate user feedback with KBs by (1) representing the feedback as raw KB data, and (2) clustering it together with the other raw data using GRINCH. Our experiments reveal that clustering user feedback and raw data jointly improves integration accuracy over strategies based on using the feedback to directly modify KB entities.  We also propose an extended representation that explicitly delineates between contextual aspects of user feedback and aspects that correct mistaken or introduce missing attributes and relationships.  Utilization of this second representation enables KBs to recover from errors in the raw data as well as from mistakes made during entity resolution. Empirically, we find that the extended representation leads to further improvements in overall integration accuracy.

Advisor: Andrew McCallum