1,721,072 research outputs found

    Recommendations for the long tail by term-query graph

    No full text
    We define a new approach to the query recommendation problem. In particular, our main goal is to design a model enabling the generation of query suggestions also for rare and previously unseen queries. In other words we are targeting queries in the long tail. The model is based on a graph having two sets of nodes: Term nodes, and Query nodes. The graph induces a Markov chain on which a generic random walker starts from a subset of Term nodes, moves along Query nodes, and restarts (with a given probability) only from the same initial subset of Term nodes. Computing the stationary distribution of such a Markov chain is equivalent to extracting the so-called Center-piece Subgraph from the graph associated with the Markov chain itself. Given a query, we extract its terms and we set the restart subset to this term set. Therefore, we do not require a query to have been previously observed for the recommending model to be able to generate suggestions

    SE-PQA: Personalized Community Question Answering

    Full text link
    Personalization in Information Retrieval is a topic studied for a long time. Nevertheless, there is still a lack of high-quality, real-world datasets to conduct large-scale experiments and evaluate models for personalized search. This paper contributes to filling this gap by introducing SE-PQA (StackExchange - Personalized Question Answering), a new curated resource to design and evaluate personalized models related to the task of community Question Answering (cQA). The contributed dataset includes more than 1 million queries and 2 million answers, annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. We describe the characteristics of SE-PQA and detail the features associated with questions and answers. We also provide reproducible baseline methods for the cQA task based on the resource, including deep learning models and personalization approaches. The results of the preliminary experiments conducted show the appropriateness of SE-PQA to train effective cQA models; they also show that personalization remarkably improves the effectiveness of all the methods tested. Furthermore, we show the benefits in terms of robustness and generalization of combining data from multiple communities for personalization purposes

    Electoral Predictions with Twitter: A Machine-Learning approach

    Full text link
    Several studies have shown how to approximately predict public opinion, such as in political elections, by analyzing user activities in blogging platforms and on-line social networks. The task is challenging for several reasons. Sample bias and automatic understanding of textual content are two of several non trivial issues. In this work we study how Twitter can provide some interesting insights concerning the primary elections of an Italian political party. State-of-the-art approaches rely on indicators based on tweet and user volumes, often including sentiment analysis. We investigate how to exploit and improve those indicators in order to reduce the bias of the Twitter users sample. We propose novel indicators and a novel content-based method. Furthermore, we study how a machine learning approach can learn correction factors for those indicators. Experimental results on Twitter data support the validity of the proposed methods and their improvement over the state of the art

    Query Performance Prediction Using Dimension Importance Estimators

    No full text
    Query Performance Prediction (QPP) tends to fall short when predicting the performance of dense Information Retrieval (IR) systems. Therefore, the research community is investigating QPP approaches designed to synergize with this class of state-of-the-art IR models. At the same time, recent advances concerning dense IR have shown that we can improve the retrieval performance by projecting embeddings in a (query-wise) optimal linear subspace of the dense representation space. The Dimension IMportance Estimation (DIME) framework was proposed to identify such optimal subspaces on a query-by-query basis. In this paper, we illustrate how to design QPP models that rely on measuring the alignment between the query and document representations and the optimal DIME dimensions, based on the hypothesis that good alignment indicates better retrieval performance. We experimentally evaluate the proposed QPPs, showing that our approach outperforms the state-of-the-art when predicting the performance of two commonly used dense encoders, Contriever and TAS-B, on two popular TREC collections, Deep Learning 2019 and 2020

    Evaluating Top-K Approximate Patterns via Text Clustering

    No full text
    This work investigates how approximate binary patterns can be objectively evaluated by using as a proxy measure the quality achieved by a text clustering algorithm, where the document features are derived from such patterns. Specifically, we exploit approximate patterns within the well-known FIHC (Frequent Itemset-based Hierarchical Clustering) algorithm, which was originally designed to employ exact frequent itemsets to achieve a concise and informative representation of text data. We analyze different state-of-the-art algorithms for approximate pattern mining, in particular we measure their ability in extracting patterns that well characterize the document topics in terms of the quality of clustering obtained by FIHC. Extensive and reproducible experiments, conducted on publicly available text corpora, show that approximate itemsets provide a better representation than exact ones

    Supervised Evaluation of Top-k Itemset Mining Algorithms

    No full text
    A major mining task for binary matrixes is the extraction of approximate top-k patterns that are able to concisely describe the input data. The top-k pattern discovery problem is commonly stated as an optimization one, where the goal is to minimize a given cost function, e.g., the accuracy of the data description. In this work, we review several greedy state-of-the-art algorithms, namely Asso, Hyper+, and PaNDa+, and propose a methodology to compare the patterns extracted. In evaluating the set of mined patterns, we aim at overcoming the usual assessment methodology, which only measures the given cost function to minimize. Thus, we evaluate how good are the models/patterns extracted in unveiling supervised knowledge on the data. To this end, we test algorithms and diverse cost functions on several datasets from the UCI repository. As contribution, we show that PaNDa+ performs best in the majority of the cases, since the classifiers built over the mined patterns used as dataset features are the most accurate.A major mining task for binary matrixes is the extraction of approximate top-k patterns that are able to concisely describe the input data. The top-k pattern discovery problem is commonly stated as an optimization one, where the goal is to minimize a given cost function, e.g., the accuracy of the data description. In this work, we review several greedy state-of-the-art algorithms, namely Asso, Hyper+, and PaNDa ^{+}, and propose a methodology to compare the patterns extracted. In evaluating the set of mined patterns, we aim at overcoming the usual assessment methodology, which only measures the given cost function to minimize. Thus, we evaluate how good are the models/patterns extracted in unveiling supervised knowledge on the data. To this end, we test algorithms and diverse cost functions on several datasets from the UCI repository. As contribution, we show that PaNDa ^{+} performs best in the majority of the cases, since the classifiers built over the mined patterns used as dataset features are the most accurate

    Gossip Communities: Collaborative Filtering Through Peer-to-Peer Overlays

    No full text
    Gossip-based Peer-to-Peer protocols proved to be very efficient for supporting dynamic and complex information exchange among distributed peers. They are useful for building and maintaining the net- work topology itself as well as to support a pervasive diusion of the information injected into the network. In this paper, we propose the general architecture of a system that tries to exploit the collaborative exchange of information between peers in order to build a system able to gather similar users and spread useful suggestions among them
    corecore