1,721,050 research outputs found

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Parallelizing graph computation with automated vectorization

    No full text
    Graph computations have found widespread use in social network analysis, bioinformatics, and web search. Applications often need to evaluate the same graph query multiple times over the same data graph, starting from different source vertices, referred to as multi-instance processing (MIP). There are mainly two approaches to MIP. The first approach is to use highly optimized multi-instance graph algorithms that interleave the evaluation of multiple query instances to exploit computation sharing across instances. These multi-instance algorithms are efficient but challenging to implement. The other approach is to use general-purpose graph processing frameworks and obtain answers to multiple query instances through serial or batch evaluation. These frameworks are easy to program but shown to be significantly less efficient than multi-instance algorithms. With these two existing approaches, users have to choose between efficiency and ease of programming. In response to the challenge, this thesis presents a systematic approach to get the best of both worlds. In the first part of this thesis, we present MITra, a framework for composing Multi- Instance graph Traversal algorithms that traverse from multiple source vertices simultaneously over a single thread. Underlying MITra is a frontier-ranking model, which provides an abstraction for graph algorithms, separating traversal logic from computation logic. Based on this model, MITra offers an easy-to-use programming interface. MITra enables user to compose multi-instance algorithms by programming computation logic in a dedicated edge function following textbook algorithms, and choosing traversal logic via frontier-ranking configuration. On the backend, MITra synthesizes and executes the multi-instance algorithm by automatically organizing vertices into frontiers based on their numeric rank values, automatically sharing computation across instances and benefiting from SIMD vectorization. We further showcase the ease of use, expressiveness, and efficiency of MITra by developing a plug-and-play web demo. In addition, we extend MITra to take advantage of multi-core parallelization, and evaluate the performance of MITra through extensive experiments. The second part of this thesis presents AutoMI, a framework for automatically converting vertex-centric graph algorithms into their vectorized multi-instance versions. A well-developed multi-instance algorithm runs significantly faster than traditional serial and batch evaluation, however, its design and implementation are notoriously challenging. AutoMI relieves the burden of writing delicate multi-instance algorithms from developers and achieves superior performance through vectorization. In addition, we propose TrackFree optimization in AutoMI, yielding simpler and more efficient multiiii instance algorithm implementation. To aid the decision of whether to use TrackFree in AutoMI, we develop an algebraic characterization. AutoMI targets vertex-centric algorithms written in the GAS (Gather-Apply-Scatter) programming model, as promoted by major distributed graph processing frameworks. We implement AutoMI and demonstrate its performance advantage through extensive experiments on real-life and synthetic data graphs. Putting together, MITra and AutoMI provide a systematic approach to easily program multi-instance graph algorithms and achieve high performance through automated and effective vectorization

    Towards analytics over dirty databases

    No full text
    In today's data-driven world, organizations increasingly rely on large and complex datasets to drive decision-making, build predictive models, and optimize operations. From e-commerce companies leveraging customer behavior data to improve marketing strategies, to financial institutions analyzing transactions for fraud detection, the demand for efficient, scalable, and seamless data processing is critical. However, real-world data is often incomplete, inconsistent, or erroneous, which undermines the accuracy and reliability of the insights drawn from it. Traditional workflows require moving data out of database systems into external analytical tools for machine learning, data cleaning, and query answering tasks, introducing significant inefficiencies and creating bottlenecks. This thesis presents integrated solutions that enable these operations to be performed directly within the database, addressing three key problems in modern database systems: (1) inefficiency in executing machine learning tasks, (2) inability to handle missing data effectively, and (3) limitations in querying inconsistent databases. First, we address the challenge of efficiently training machine learning models over relational data. Most data resides in databases, yet current machine learning systems typically require exporting data into external tools, leading to excessive data transfer and redundant computations. To overcome these inefficiencies, we propose an in-database machine learning library implemented on PostgreSQL and DuckDB. Our approach rewrites popular machine learning algorithms to run directly within the database, leveraging the relational data structure. By training models over aggregate values computed from normalized tables, we eliminate the need for expensive joins and preprocessing, achieving 10 to 100-fold faster model training compared to state-of-the-art solutions like MADLib. Additionally, our library allows multiple models to be constructed using the same set of aggregate computations, further optimizing the learning process. Second, missing data is a pervasive issue in real-world datasets, often necessitating the use of imputation techniques to fill in gaps before analysis or model training can proceed. External tools for imputation, such as those implementing the Multiple Imputation by Chained Equations (MICE) method, typically require data export and preprocessing, adding complexity to the workflow. We introduce an in-database imputation framework that integrates MICE directly into database systems, allowing it to operate over normalized data. By re-engineering the MICE algorithm to share computations across iterations and optimize for fast access to frequently used data, we significantly reduce runtime. Our solution, implemented in both PostgreSQL and DuckDB, outperforms traditional methods, providing a more efficient and scalable way to handle missing data without leaving the database environment. Third, we tackle the problem of query answering over inconsistent databases -- a critical challenge for organizations that rely on accurate query results despite data inconsistencies. Conventional methods often involve data cleaning to restore consistency, but this can be impractical in real-time environments or where altering the original data is not feasible. To address this, we develop a method based on Consistent Query Answering (CQA), which allows queries to be evaluated directly over inconsistent data. We model the concept of ``minimal repairs'' -- smallest changes that restore consistency -- as a logical formula and use model counting techniques to determine the number of possible repairs. Furthermore, by optimizing the size of the logical formula, we achieve up to a 1000-fold reduction in computational complexity. To efficiently compute the number of repairs supporting each query answer, we introduce two Monte Carlo approximation algorithms that leverage the compiled logical formula. These algorithms provide theoretical guarantees for approximation accuracy while maintaining practical efficiency, enabling the execution of CQA over large datasets with multiple functional dependency violations. In conclusion, this thesis presents a comprehensive set of in-database solutions designed to overcome inefficiencies in machine learning, data imputation, and query evaluation processes, particularly in the presence of incomplete or inconsistent data. By integrating these tasks directly into modern relational databases, our approach not only streamlines workflows but also significantly improves performance. Our contributions include a high-performance machine learning library, a scalable imputation technique for handling missing data, and a robust framework for consistent query answering over erroneous databases. Collectively, these innovations represent a significant advancement toward more efficient, reliable, and scalable data management solutions

    Certain Answers of Extensions of Conjunctive Queries by Datalog and First-Order Rewriting

    Full text link
    To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found efficiently for conjunctive queries and their unions, even in the presence of constraints such as keys or functional dependencies. With negation added, the complexity of finding certain answers becomes intractable however.In this paper we exhibit a well-behaved class of queries that extends unions of conjunctive queries with a limited form of negation and that permits efficient computation of certain answers even in the presence of constraints by means of rewriting into Datalog with negation. The class consists of queries that are the closure of conjunctive queries under Boolean operations of union, intersection and difference. We show that for these queries, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods

    Author Index

    No full text
    Nao informado

    Scaling and explaining machine learning powered database applications

    Full text link
    For decades, database systems have been the backbone of applications in a wide range of domains, e.g., finance, web services, business intelligence, social analysis, healthcare. Meanwhile, the resurgence of machine learning in particular large-scale deep learning services provided by big companies in recent years has been expeditiously reshaping these applications, enabling them to easily take advantage of model prediction services and be significantly more adaptive, intelligent and capable. However, this movement causes database applications to be less transparent, rendering database vendors incapable of offering reliable insights and explanations to their application customers. In addition, the increasing popularity of machine learning features rapidly boosts the scale of applications while the underlying legacy database systems are struggling to scale out and keep up the pace, causing tension between the application load and database system. This thesis aims to address these two challenges. The first part of the thesis presents a new concept, referred to as on-database contextual explanation, and a suit of associated techniques to empower database applications to explain their learning powered decisions to end customers, even if they are generated by third-party cloud-based prediction services that opt not to offer explainability. The key intuition is that the data exchange between the databases and remote machine learning models already gives the applications a dynamic context that contains vital information to deduce reliable explanations, independent of the explainability of the remote models. To elaborate on and exploit this, we develop algorithms and systems to efficiently compute contextual feature explanation and counterfactual explanation by examining and monitoring databases at runtime, faithfully conforming to the “right-to-explanation” policy requested by GDPR. We also evaluate its effectiveness via extensive experiments and real-world case studies. The second part of thesis develops a means to scale out legacy databases without migrating them to the cloud or re-deploying with added hardware. Our method is to augment legacy database systems at runtime with external caches, allowing us to offload database load to a look-aside cache on-the-fly. However, a caveat of extending database systems with lightweight caches is that the augmented system as a whole loses correctness guarantees that a typical database system offers especially for transactional workloads, requiring the developers to re-design the applications. To this end, we present transactional caching, a scheme that maintains application invariant over the augmented system. It works with any key-value in-memory caches, e.g., Redis and Memcached, and empowers them to assure that applications always see a monotonically increasing snapshot of the databases. Critical to the performance of such cache-augmented databases is the design of transactional cache replacement policies, which we prove is intractable as opposed to linear time decidable for conventional caching. Nonetheless, we develop efficient learning-augmented transactional cache policies with provable guarantees. Over real-life traces and benchmarks, they have shown effective in improving transaction throughput while guaranteeing application invariant. These together give us a suit of concepts and techniques for database applications to benefit from powerful machine learning models, without compromising their transparency, reliability and correctness that database systems have been offering for decades
    corecore