1,721,048 research outputs found

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods

    Author Index

    No full text
    Nao informado

    Compilation and code generation for efficient data science

    Full text link
    In today's data-driven world, data science plays an important role in benefiting from big data, enabling smart decision-making, and helping innovative acts across industries. The fact that data science creates tangible value for businesses and organizations has resulted in an increasing demand for efficient data science tools. Python has become the main tool and language of choice for data scientists. It is mostly because of its simplicity and rich ecosystem of libraries such as Pandas, NumPy, and TensorFlow. However, Python's user-friendliness comes with the costs of inefficiency and lack of scalability. This limitation relates to the interpreted nature of this language and also the way its libraries are developed. While previous efforts towards more scalable data science in Python have explored avenues such as fine-tuned low-level kernels, auto-parallelization, and compilation to other languages, their approaches missed some enhancement opportunities or showed only limited coverage on the diverse spectrum of data science workloads. In this thesis, we adopt a compilation-centric approach to address the challenges of scalable Python data science and introduce a framework comprising different compilation pipelines for the same goal. This framework aims to enhance the efficiency of data science workloads by translating them into SQL/C++. After these transformations, the workloads can be executed by a conventional query engine (RDBMS), with proven optimization and computation power, or a tailored query engine that is crafted for the given workloads. In the case of specialized query engines, we additionally propose a design for optimizing these engines that exploit batch-processing techniques to accelerate the execution and improve the overall performance. We showcase the efficacy of the proposed framework through comprehensive micro and end-to-end benchmarks. By making data science processing more efficient, our framework not only accelerates data analytics and decision-making but also contributes to sustainable computing practices by reducing the computing resource requirements

    Efficient structured tensor algebra by compilation and compression

    Full text link
    Tensor algebra is fundamental to data-intensive computational workloads across domains such as machine learning, scientific computing, and signal processing. As data complexity increases, researchers face a trade-off between the highly specialized optimizations of dense tensor algebra frameworks and the efficiency of structure-aware algorithms in sparse tensor algebra. On the one hand, extensive research has been conducted on dense tensor algebra, where computations involve tensors without explicit structure. Known memory access patterns for the computation at compile time allow compilers and high-performance engineers to heavily tune the kernels by leveraging optimizations such as parallelization, vectorization, and tiling. On the other hand, many real-world applications require computations over tensors with sparsity patterns and inherent structure. Many lines of research have been dedicated to sparse tensor algebra, to efficiently exploit the memory structure, enhance algorithmic complexity, and improve computational performance. Many real-world applications involve tensors with well-defined structures (e.g., diagonal, upper-triangular, Toeplitz-like), often known at compile time. Exploiting these structures can drastically reduce computational costs. While prior efforts have leveraged tensor structure to develop specialized and optimized kernels, they suffer from three major limitations: 1) they are restricted to a small set of predefined structures, 2) they cannot be composed or propagated through computation, and 3) they do not necessarily provide the best memory layout for a given computation. This dissertation aims to leverage the benefits of both dense and sparse worlds and create an infrastructure to bridge the gap by leveraging the structure. This will lead to overcoming the 3 aforementioned limitations and solving the dilemma by introducing an end-to-end pipeline that transforms structured tensor algebra expressions to specialised low-level code through 1) Structured Tensor Unified Representation (\stur{}) language, a language with structure as its first-class citizens, 2) structure inference and compilation, and 3) automatic data layout compression. This dissertation presents the three major components of this pipeline. The first component which is the backbone of the infrastructure is Structured Tensor Unified Representation (\stur{}). \stur{} is a domain-specific language that represents tensor algebra computation using generalised Einstein summation (einsum). \stur{} treats tensor structure as a first-class citizen using a unique set and redundancy map, making structured tensor algebra computation both expressible and extensible beyond a fixed set of predefined structures. The second component, \structtensor{}, is a framework that automatically infers and propagates the structure of input tensors throughout the entire computation by applying a set of program reasoning rules. \structtensor{} uses \stur{} as the intermediate language to reason about sparsity and redundancy patterns. This reasoning limits the iteration space of the tensor computation to non-zero and non-repetitive values. Hence, the computation is done more efficiently. The densely assembled tensor algebra compiler (\dastac{}) is the final component of this pipeline. \dastac{} is built on top of \structtensor{} and uses \stur{} as an intermediate language similarly. \dastac{} automatically reorders the elements and compresses the underlying data layout of the structured tensors based on the iteration space and the structure throughout the computation, leading to more memory efficiency. Polyhedral optimizations and parallelisation are also enabled through ISL and MLIR in \dastac{}. Through extensive benchmarks and evaluation, it is demonstrated that this pipeline achieves orders of magnitude speed-up compared to sparse (e.g., TACO) and dense (e.g., TensorFlow and PyTorch) tensor algebra frameworks when compile time structure is available. It is also presented that in many real-world applications, the system outperforms hand-tuned specialised expert code (e.g., Intel MKL) by up to 2 orders of magnitude in both single- and multi-threaded scenarios while taking up to 5x less memory. This work establishes an infrastructure that bridges the gap between dense and sparse tensor algebra by bringing the best aspects together. It significantly reduces the computational cost and memory requirements for tensor algebra computation while achieving state-of-the-art performance. This pipeline also mitigates the development overhead for implementing new kernels or composing them by providing a compilation pipeline and a code generator that produces highly optimized code

    Towards analytics over dirty databases

    No full text
    In today's data-driven world, organizations increasingly rely on large and complex datasets to drive decision-making, build predictive models, and optimize operations. From e-commerce companies leveraging customer behavior data to improve marketing strategies, to financial institutions analyzing transactions for fraud detection, the demand for efficient, scalable, and seamless data processing is critical. However, real-world data is often incomplete, inconsistent, or erroneous, which undermines the accuracy and reliability of the insights drawn from it. Traditional workflows require moving data out of database systems into external analytical tools for machine learning, data cleaning, and query answering tasks, introducing significant inefficiencies and creating bottlenecks. This thesis presents integrated solutions that enable these operations to be performed directly within the database, addressing three key problems in modern database systems: (1) inefficiency in executing machine learning tasks, (2) inability to handle missing data effectively, and (3) limitations in querying inconsistent databases. First, we address the challenge of efficiently training machine learning models over relational data. Most data resides in databases, yet current machine learning systems typically require exporting data into external tools, leading to excessive data transfer and redundant computations. To overcome these inefficiencies, we propose an in-database machine learning library implemented on PostgreSQL and DuckDB. Our approach rewrites popular machine learning algorithms to run directly within the database, leveraging the relational data structure. By training models over aggregate values computed from normalized tables, we eliminate the need for expensive joins and preprocessing, achieving 10 to 100-fold faster model training compared to state-of-the-art solutions like MADLib. Additionally, our library allows multiple models to be constructed using the same set of aggregate computations, further optimizing the learning process. Second, missing data is a pervasive issue in real-world datasets, often necessitating the use of imputation techniques to fill in gaps before analysis or model training can proceed. External tools for imputation, such as those implementing the Multiple Imputation by Chained Equations (MICE) method, typically require data export and preprocessing, adding complexity to the workflow. We introduce an in-database imputation framework that integrates MICE directly into database systems, allowing it to operate over normalized data. By re-engineering the MICE algorithm to share computations across iterations and optimize for fast access to frequently used data, we significantly reduce runtime. Our solution, implemented in both PostgreSQL and DuckDB, outperforms traditional methods, providing a more efficient and scalable way to handle missing data without leaving the database environment. Third, we tackle the problem of query answering over inconsistent databases -- a critical challenge for organizations that rely on accurate query results despite data inconsistencies. Conventional methods often involve data cleaning to restore consistency, but this can be impractical in real-time environments or where altering the original data is not feasible. To address this, we develop a method based on Consistent Query Answering (CQA), which allows queries to be evaluated directly over inconsistent data. We model the concept of ``minimal repairs'' -- smallest changes that restore consistency -- as a logical formula and use model counting techniques to determine the number of possible repairs. Furthermore, by optimizing the size of the logical formula, we achieve up to a 1000-fold reduction in computational complexity. To efficiently compute the number of repairs supporting each query answer, we introduce two Monte Carlo approximation algorithms that leverage the compiled logical formula. These algorithms provide theoretical guarantees for approximation accuracy while maintaining practical efficiency, enabling the execution of CQA over large datasets with multiple functional dependency violations. In conclusion, this thesis presents a comprehensive set of in-database solutions designed to overcome inefficiencies in machine learning, data imputation, and query evaluation processes, particularly in the presence of incomplete or inconsistent data. By integrating these tasks directly into modern relational databases, our approach not only streamlines workflows but also significantly improves performance. Our contributions include a high-performance machine learning library, a scalable imputation technique for handling missing data, and a robust framework for consistent query answering over erroneous databases. Collectively, these innovations represent a significant advancement toward more efficient, reliable, and scalable data management solutions

    koamabayili/VECTRON-author-checklist: VECTRON author checklist

    No full text
    We have done our best to complete the author checklist relating to the use of animals in the hut study. Note that the objective for the hut study was to evaluate the IRS treatment applications for residual efficacy against Anopheles mosquitoes, including the local An. coluzzii mosquito population. Cows were only used to attract mosquitoes into the huts and no tests were carried out directly on the cows. The author checklist is intended for use with studies where experiments are carried out on animals, which is why we have had such difficulty in completing this for the hut study, as many of the questions do not relate to how the cows were used
    corecore