1,721,127 research outputs found
Extending weighting models with a term quality measure
Weighting models use lexical statistics, such as term frequencies, to derive term weights, which are used to estimate the relevance of a document to a query. Apart from the removal of stopwords, there is no other consideration of the quality of words that are being ‘weighted’. It is often assumed that term frequency is a good indicator for a decision to be made as to how relevant a document is to a query. Our intuition is that raw term frequency could be enhanced to better discriminate between terms. To do so, we propose using non-lexical features to predict the ‘quality’ of words, before they are weighted for retrieval. Specifically, we show how parts of speech (e.g. nouns, verbs) can help estimate how informative a word generally is, regardless of its relevance to a query/document. Experimental results with two standard TREC collections show that integrating the proposed term quality to two established weighting models enhances retrieval performance, over a baseline that uses the original weighting models, at all times
Multinomial randomness models for retrieval with document fields
Document fields, such as the title or the headings of a document, offer a way to consider the structure of documents for retrieval. Most of the proposed approaches in the literature employ either a linear combination of scores assigned to different fields, or a linear combination of frequencies in the term frequency normalisation component. In the context of the Divergence From Randomness framework, we have a sound opportunity to integrate document fields in the probabilistic randomness model. This paper introduces novel probabilistic models for incorporating fields in the retrieval process using a multinomial randomness model and its information theoretic approximation. The evaluation results from experiments conducted with a standard TREC Web test collection show that the proposed models perform as well as a state-of-the-art field-based weighting model, while at the same time, they are theoretically founded and more extensible than current field-based models
Performance comparison of clustered and replicated information retrieval systems
The amount of information available over the Internet is increasing daily as well as the importance and magnitude of Web search engines. Systems based on a single centralised index present several problems (such as lack of scalability), which lead to the use of distributed information retrieval systems to effectively search for and locate the required information. A distributed retrieval system can be clustered and/or replicated. In this paper, using simulations, we present a detailed performance analysis, both in terms of throughput and response time, of a clustered system compared to a replicated system. In addition, we consider the effect of changes in the query topics over time. We show that the performance obtained for a clustered system does not improve the performance obtained by the best replicated system. Indeed, the main advantage of a clustered system is the reduction of network traffic. However, the use of a switched network eliminates the bottleneck in the network, markedly improving the performance of the replicated systems. Moreover, we illustrate the negative performance effect of the changes over time in the query topics when a distributed clustered system is used. On the contrary, the performance of a distributed replicated system is query independent
Efficient dynamic pruning with proximity support
Modern retrieval approaches apply not just single-term weighting models when ranking documents - instead, proximity weighting models are in common use, which highly score the co-occurrence of pairs of query terms in close proximity to each other in documents. The adoption of these proximity weighting models can cause a computational overhead when documents are scored, negatively impacting the efficiency of the retrieval process. In this paper, we discuss the integration of proximity weighting models into efficient dynamic pruning strategies. In particular, we propose to modify document-at-a-time strategies to include proximity scoring without any modifications to pre-existing index structures. Our resulting two-stage dynamic pruning strategies only consider single query terms during first stage pruning, but can early terminate the proximity scoring of a document if it can be shown that it will never be retrieved. We empirically examine the efficiency benefits of our approach using a large Web test collection of 50 million documents and 10,000 queries from a real query log. Our results show that our proposed two-stage dynamic pruning strategies are considerably more efficient than the original strategies, particularly for queries of 3 or more terms. Copyright © 2010 for the individual papers by the papers' authors
Effect of dynamic pruning safety on learning to rank effectiveness
A dynamic pruning strategy, such as WAND, enhances retrieval efficiency without degrading effectiveness to a given rank K, known as safe-to-rank-K. However, it is also possible for WAND to obtain more efficient but unsafe retrieval without actually significantly degrading effectiveness. On the other hand, in a modern search engine setting, dynamic pruning strategies can be used to efficiently obtain the set of documents to be re-ranked by the application of a learned model in a learning to rank setting. No work has examined the impact of safeness on the effectiveness of the learned model. In this work, we investigate the impact of WAND safeness through experiments using 150 TREC Web track topics. We find that unsafe WAND is biased towards documents with lower docids, thereby impacting effectiveness
Query efficiency prediction for dynamic pruning
Dynamic pruning strategies are effective yet permit efficient retrieval by pruning - i.e. not fully scoring all postings of all documents matching a given query. However, the amount of pruning possible for a query can vary, resulting in queries with similar properties (query length, total numbers of postings) taking different amounts of time to retrieve search results. In this work, we investigate the causes for inefficient queries, identifying reasons such as the balance between informativeness of query terms, and the distribution of retrieval scores within the posting lists. Moreover, we note the advantages in being able to predict the efficiency of a query, and propose various query efficiency predictors. Using 10,000 queries and the TREC ClueWeb09 category B corpus for evaluation, we find that combining predictors using regression can accurately predict query response time
Upper-bound approximations for dynamic pruning
Dynamic pruning strategies for information retrieval systems can increase querying efficiency without decreasing effectiveness by using upper bounds to safely omit scoring documents that are unlikely to make the final retrieved set. Often, such upper bounds are pre-calculated at indexing time for a given weighting model. However, this precludes changing, adapting or training the weighting model without recalculating the upper bounds. Instead, upper bounds should be approximated at querying time from various statistics of each term to allow on-the-fly adaptation of the applied retrieval strategy. This article, by using uniform notation, formulates the problem of determining a term upper-bound given a weighting model and discusses the limitations of existing approximations. Moreover, we propose an upper-bound approximation using a constrained nonlinear maximization problem. We prove that our proposed upper-bound approximation does not impact the retrieval effectiveness of several modern weighting models from various different families. We also show the applicability of the approximation for the Markov Random Field proximity model. Finally, we empirically examine how the accuracy of the upper-bound approximation impacts the number of postings scored and the resulting efficiency in the context of several large Web test collections.</jats:p
Learning to predict response times for online query scheduling
Dynamic pruning strategies permit efficient retrieval by not fully scoring all postings of the documents matching a query -- without degrading the retrieval effectiveness of the top-ranked
results. However, the amount of pruning achievable for a query can vary, resulting in queries taking different amounts of time to execute. Knowing in advance the execution time of queries would permit the exploitation of online algorithms to schedule queries across replicated servers in order to minimise the average query waiting and completion times. In this work, we investigate the impact of dynamic pruning strategies on query response times, and propose a framework for predicting the efficiency of a query. Within this framework, we analyse the accuracy of several query efficiency predictors across 10,000 queries submitted to in-memory inverted indices of a 50-million-document Web crawl. Our results show that combining multiple efficiency predictors with regression can accurately predict the response times of a query before it is executed. Moreover, using the efficiency predictors to facilitate online scheduling algorithms can result in a 25% reduction in the average waiting time experienced by queries before processing
On upper bounds for dynamic pruning
Dynamic pruning strategies enhance the efficiency of search engines, by making use of term upper bounds to decide when a document will not make the final set of k retrieved documents. After discussing different approaches for obtaining term upper bounds, we propose the use of multiple least upper bounds. Experiments are conducted on the TREC ClueWeb09 corpus, to measure the accuracy of different upper bounds. © 2011 Springer-Verlag
- …
