1,721,496 research outputs found

    What Happened in CLEF... For Another While?

    No full text
    2024 marks the 25th birthday for CLEF, an evaluation campaign activity which has applied the Cranfield evaluation paradigm to the testing of multilingual and multimodal information access systems in Europe. This paper provides a summary of the motivations which led to the establishment of CLEF, a description of how it has evolved over the years, and its major achievements

    A docker-based replicability study of a neural information retrieval model

    No full text
    In thiswork,we propose a Docker image architecture for the replicability of Neural IR (NeuIR) models.We also share two self-contained Docker images to run the Neural Vector Space Model (NVSM) [22], an unsupervised NeuIR model. The first image we share (nvsm-cpu) can run on most machines and relies only on CPU to perform the required computations. The second image we share (nvsm-GPU) relies instead on the Graphics Processing Unit (GPU) of the host machine, when available, to perform computationally intensive tasks, such as the training of the NVSM model. Furthermore, we discuss some insights on the engineering challenges we encountered to obtain deterministic and consistent results from NeuIR models, relying on TensorFlow within Docker. We also provide an in-depth evaluation of the differences between the runs obtained with the shared images. The differences are due to the usage within Docker of TensorFlow and CUDA libraries - whose inherent randomness alter, under certain circumstances, the relative order of documents in rankings

    Comparing ANOVA Approaches to Detect Significantly Different IR Systems

    No full text
    The ultimate goal of the evaluation is to understand when two IR systems are (significantly) different. To this end, many comparison procedures have been developed over time. However, to date, most reproducibility efforts focused just on reproducing systems and algorithms, almost fully neglecting to investigate the reproducibility of the methods we use to compare our systems. In this paper, we focus on methods based on ANalysis Of VAriance (ANOVA), which explicitly model the data in terms of different contributing effects, allowing us to obtain a more accurate estimate of significant differences. In this context, we compare statistical analysis methods based on “traditional” ANOVA (tANOVA) to those based on a bootstrapped version of ANOVA (bANOVA) and those performing multiple comparisons relying on a more conservative Family-wise Error Rate (FWER) controlling approach to those relying on a more lenient False Discovery Rate (FDR) controlling approach. Our findings highlight that, compared to the tANOVA approaches, bANOVA presents greater statistical power, at the cost of lower stability

    Evaluating Differential Privacy Approaches for Query Obfuscation in Information Retrieval

    No full text
    Protecting the privacy of a user while they interact with an Information Retrieval (IR) system is crucial. This becomes more challenging when the IR system is not cooperative in satisfying the user’s privacy needs. Recent advancements in Natural Language Processing (NLP) have demonstrated Differential Privacy’s (DP) effectiveness in safeguarding text privacy for tasks like spam detection and sentiment analysis, even under the assumption of a non-cooperative system. Our investigation explores if DP methods, originally designed for specific NLP tasks, can effectively obscure queries in IR. Our analyses show that using the Vickrey DP mechanism, employing the Mahalanobis norm with a privacy budget ranging from ε = 10 to 12.5, provides cutting-edge privacy protection and enhances effectiveness. Unlike previous methods, DP allows users to fine-tune their desired level of privacy by adjusting the privacy budget ε. This flexibility offers a balance between how effective the system is and how much privacy is maintained, unlike the more rigid nature of previous approaches

    Uncontextualized significance considered dangerous

    Full text link
    We examine the context of significance tests in offline retrieval experiments. Our Information Retrieval (IR) community is notable for its experimental rigour: the use of statistical significance is grows across our publications. However, we show that ignoring the context of a test risks Type I errors, leading to potential publication bias. We examine two contexts: multiple testing and the types of the retrieval systems being compared. Our results show that multiple testing corrections are critical for experimental work. In addition, we find that past research on the reliability of test collections maybe flawed owing to the type of systems examined. The latter result has not been shown before. Together our results suggest substantial numbers of Type I errors in offline IR experiments. We detail a methodology to alleviate the errors
    corecore