Exploiting Social Media Sources for Search, Fusion and Evaluation

18 Aug
Tuesday, 08/18/2015 3:00pm to 5:00pm
Computer Science Building, Room 151
Ph.D. Seminar
Speaker: Chia-Jung Lee

The web contains heterogeneous information that is generated with different characteristics and is presented via different media. Social media, as one of the largest content carriers, has generated information from millions of users worldwide, creating material rapidly in all types of forms such as comments, images, tags, videos and ratings, etc. In social applications, the formation of online communities contributes to conversations of substantially broader aspects, as well as unfiltered opinions about subjects that are rarely covered in public media. Information accrued on social platforms, therefore, presents a unique opportunity to augment web sources such as Wikipedia or news pages, which are usually characterized as being more formal. The goal of this dissertation is to investigate in depth how social data can be exploited and applied in the context of three fundamental information retrieval (IR) tasks: search, fusion, and evaluation.

Improving search performance has consistently been a major focus in the IR community. Given the in-depth discussions and active interactions contained in social media, we present approaches to incorporating this type of data to improve search on general web corpora. In particular, we propose two graph-based frameworks, social anchor and information network, to associate related web and social content, where information sources of diverse characteristics can be used to complement each other in a unified manner. The experimental results show that augmenting document representations with social signals can significantly outperform a wide range of baselines, including performing pseudo-relevance feedback on social collections.

Presenting social media content to users is valuable particularly for queries intended for time-sensitive events or community opinions. Current major search engines commonly blend results from different search services (or verticals) into core web results. Motivated by this real-world need, we explores ways to merge results from different web and social services into a single ranked list. We present an optimization framework for fusion, where the impact of documents, ranked lists, and verticals can be modeled simultaneously to maximize performance. Our results demonstrate that modeling all types of impact together achieves the best effectiveness. 

Evaluating search system performance has largely relied on creating reusable test collections in IR. While traditional evaluation can require substantial manual effort, we explore an approach to automating the process of collecting pairs of queries and relevance judgments, using the high quality social media, Community Question Answering (CQA). Our approach is based on the idea that CQA services support platforms for users to raise questions and to share answers, therefore encoding the associations between real user information needs and real user assessments. We conduct experiments to study and verify the reliability of these new, CQA-based evaluation test sets.

Advisor: W. Bruce Croft