Computer Science Building, Room 151
The design of an Information Retrieval (IR) system involves selecting between alternatives in the ways queries are represented (query representations) and the algorithms used to score documents (ranking algorithms). For example, a user specified query can be represented without any modification, or expanded automatically to include more words, or reduced by removing unnecessary terms. Because no single alternative performs the best for all queries, selecting the best representation or ranking algorithm for each query can yield better effectiveness over a fixed choice for all queries.
In this thesis, we develop an approach for query-dependent selection of retrieval alternatives. We treat query-dependent selection as a general problem of selecting between the result sets that are retrieved using each alternative. The main research issue lies in estimating the quality (or effectiveness) of these result sets. To this end, we aggregate measures of the relevance of individual documents as estimates for the effectiveness of the result set as a whole. Using these aggregate features, we then target the prediction of the relative effectiveness of each alternative with respect to a baseline.
We apply this relative effectiveness estimation technique to select between different reduced versions of long queries and to select between different ways of combining multiple ranking algorithms. For both applications, we show that predicting the relative effectiveness with respect to a baseline alternative is more effective than independent estimation of absolute effectiveness. We also explore query-dependent selection with a specific efficiency constraint in a black-box meta-search scenario, where always using all search engines can be expensive. We develop easy-to-compute features based on the results page alone to predict when querying an alternate search engine can be useful. While a simple classification using these features fails, we show that learning thresholds on the class probabilities, and using automatically generated surrogate data can improve selection performance.
Finally, we conclude with an analysis of the results to discuss when query-dependent selection can be useful.
Advisor: James Allan