1,721,019 research outputs found

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Knowledge-Driven Text Generation

    No full text
    Natural Language Generation (NLG) is an automated process that produces human-like text. It can create this text either from scratch or using inputs like natural language or structured data such as database records, computer-generated reports or keywords. The main objective of NLG is to generate coherent, fluent and relevant natural language text, usually based on input data. NLG has various practical applications, such as creating content, summarizing text and generating dialogues. For instance, NLG can automatically produce news articles, product descriptions, or weather reports. Advancements in machine learning and NLP have led to the development of Large Language Models (LLMs) that are trained on massive amounts of data and can generate human-like text. Pretrained Language Models (PLMs), such as T5, BART, and ChatGPT, are examples of modern state-of-the-art models which have evolved beyond traditional grammar and statisticalbased methods. These models can be improved by providing more data and increasing the number of neural network layers. In modern NLP methods, such models are often first pretrained on large datasets and then fine-tuned for specific tasks. This thesis studies the data-to-text generation task that aims to generate textual descriptions of structured data with the help of Pretrained Language Models (PLMs). First, we investigate the capability of PLMs to generate grammatically correct and consistent text with different types of structured data, such as keywords, tables, and abstract meaning representation. Second, we explore how PLMs can retrieve useful information from incomplete datasets and generate text with the provided multiple data sources. The study also investigates the possibility of fine-tuning only the first few layers of PLMs to save time and resources. Third, we examine the hybrid PLMs that work on natural language generation and natural language understanding and compare them with pretrained seq2seq models. Finally, we investigate effective control mechanisms for the language model in epic-level text generation. The study is divided into four parts, each with a specific scope of investigation. The first part is limited to implementing a method for keyword-to-text generation and evaluating the generated text from a syntactic and semantic perspective. The data sources used in this study are RACE and Wikimedia, and the English frequency word list is used to identify keywords. The second part focuses on developing a system for table-to-text and RDF-to-text generation using PLMs. The sources of data for this study are E2E, WebNLG, and DART. The proposed method involves fine-tuning different PLMs for data-to-text generation tasks and developing the dynamic prompt tuning method for data augmentation. The third part studies on text generation from tables and knowledge graphs. The data sources for this study are WikiBio and Wikidata. The study proposes a hybrid model combining PLMs and assesses its performance in comparison to a pre-trained seq2seq model. A new dataset called TaKG is also created to address the incomplete problem of the WikiBio dataset. In the fourth part, a framework is proposed to address the limitations of large-scale language models in generating epic-scale text. Our contribution include designing effective control mechanisms for the language model, optimizing GPT-3.5 for open-domain text, and evaluating the generated text against long text generation requirements. In summary, this thesis makes a contribution to the investigation of structured data’s impact on PLMs and offers valuable insights into the factors that influence the effectiveness of controlling PLMs. This includes the advancement of effective control mechanisms for the PLMs

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Exploring Structure-Informed Information Retrieval for Legal Statutes

    No full text
    Improved Information Retrieval (IR) techniques have the potential to improve access to justice and save time for both legal professionals and laypeople alike. However, accurately ensuring that all relevant legal provisions are retrieved from a vast legal database remains a challenging task. Vector similarity search with contextual embeddings has enabled IR systems to retrieve more semantically relevant text, even when it does not share lexical similarity with a query. However, achieving high recall of relevant legal provisions from legal statutes remains challenging, due to the inability of vector search to interpret the deep hierarchical structures of legislative provisions, to take into account extensive explicit cross-referencing between legal provisions, and to recognise implicit relationships between provisions built on shared context. We propose a structure-aware retrieval method that uses similarity search results from a Hierarchical Navigable SmallWorlds (HNSW) embeddings graph to query a multi-layer graph representing structural, citation, and implicit relationships between legal provisions. We incorporate this into a "Retrieve then Re- Rank" IR pipeline featuring cross-encoder re-ranking. We evaluate our method using the annual Competition on Legal Information and Entailment (COLIEE) dataset, based on the Japanese Civil Code. Performance evaluations demonstrate how incorporating structural information from a multi-layered graph, as we propose, can be a simple yet effective way to improve recall in legal article retrieval tasks, compared to using vector similarity search alone

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods

    Author Index

    No full text
    Nao informado

    Node Classification on Graph Data with Global Learning

    No full text
    Node classification is a core task in graph-based machine learning, where the goal is to predict labels or categories for nodes in a graph, leveraging the relational structure and attributes of the data. To enhance the expressiveness of node representations, various Graph Neural Networks (GNNs) have been proposed to aggregate information from neighbouring nodes, conducting message-passing within local receptive fields, a process also referred to as local learning. However, the suboptimal nature of the available graph structure, often characterized by noisy edges or missing edges, generally negatively affects the performance of node classification. To address this issue, this dissertation explores node classification by considering relations among all nodes, referred to as global learning, and further investigates opportunities and challenges for its application across different scenarios. -We introduce Chain of Propagation Prompting (CPP) to enhance the expressiveness of node representations while reducing the dependency on label information. CPP involves designing a simple message-passing pattern, which we incorporate into node representations using graph contrastive learning. This simple pattern prompts multi-head self-attention-based layers to globally capture more complex patterns while minimizing their reliance on label information. Additionally, we implement majority voting to enhance the predictive confidence of multiple heads. -We introduce Robust Node Classification under Graph and Label Noise (RNCGLN) to improve the robustness of node classification when both graph and label noise are present. By integrating local graph learning and global graph learning, RNCGLN can provide comprehensive information to enhance node classification performance. Additionally, we develop graph and label self-improvement modules to improve and supplement the quality of supervisory information. Consequently, RNCGLN leverages self-training and pseudo-label techniques to facilitate two self-improvement processes in an end-to-end learning framework. -We introduce a Flexible-pass Filter-based Graph Transformer (FFGT) to resist adversarial attacks on graph data. Leveraging self-attention's ability to capture arbitrary graph filters, our self-attention layers with three heads capture multi-frequency representations across low-frequency, hybrid-frequency, and high-frequency ranges. Additionally, we designed graph learning and fusion modules to improve self-attention effectiveness in capturing designed information, yielding a flexible-frequency representation. Consequently, FFGT shows consistent resistance to adversarial perturbation in multiple datasets and against diverse adversarial attacks. We conducted theoretical analyses and numerical evaluations of our proposed methods using diverse graph data. The experimental results show that our methods, leveraging global learning strategies, consistently outperform traditional Graph Neural Networks (GNNs) based on local learning. This superior performance is demonstrated across various scenarios, including diverse graph datasets, graph noise, label noise, and multiple adversarial attacks. Additionally, our theoretical analysis validates the effectiveness of each proposed method in addressing these challenges. Together, these findings confirm that our methods enhance expressiveness, improve robustness to noise, and strengthen resilience against adversarial attacks

    Explainable and Automated Scientific Fact-Checking with Neural Networks

    No full text
    Fact-checking plays a crucial role in combating misinformation, especially in scientific domains where the stakes are high, and the consequences of false claims can be severe. As experienced during the COVID-19 pandemic, unfaithful claim verifications underscored the need for robust fact-checking systems. This thesis addresses the challenges of verifying claims in the scientific literature using deep learning-based computational linguistics techniques, focusing on faithfully incorporating knowledge from a vast existing literature and addressing the scarcity of appropriate training datasets for robust fact-checking systems. Language models in Natural Language Processing (NLP) are computational models designed to understand language. These models are trained on massive amounts of data to learn statistical patterns and relationships within language. They work by predicting words, sub-words, or characters in a sequence, taking into account the context provided by preceding sequence elements. In recent years, the dominant architecture for language models has been transformerbased models, which can be seen as a milestone in NLP research due to their significant improvements for downstream tasks. However, these models also have limitations, including modelling challenges and dataset challenges. Modelling challenges refer to the limitations and complexities faced by computational models, particularly those based on deep learning and natural language processing techniques. These challenges include difficulties in accurately capturing nuanced arguments, potential generation of false information (‘hallucinations’), constraints on handling lengthy input texts, and limitations in reasoning capabilities. On the other hand, dataset challenges stem from the scarcity of appropriate training data essential for building robust fact-checking systems. The specialized nature of scientific content, coupled with the need for accurate annotations, creates an expertise bottleneck. In this context, developing large-scale, domain-specific datasets becomes crucial to train models effectively. These challenges collectively necessitate innovative methodologies to enhance the capabilities of computational models for accurate scientific fact-checking, addressing both their inherent modelling intricacies and the scarcity of specialised training data. This thesis proposes methods to advance the field of scientific fact-checking by developing approaches that enhance the capabilities of transformer-based language models. In addressing modelling challenges, we present a novel methodology that leverages multiple viewpoints from scientific literature, allowing the assessment of contradictory arguments and implicit assumptions. Our proposed inference method enhances reasoning by distilling information from diverse, relevant scientific abstracts. This approach yields a verdict label that can be weighted based on the article’s reputation and an explanation that can be traced back to sources to avoid hallucinations. Our findings demonstrate that human evaluators perceive our explanation to be significantly superior to off-the-shelf models, enabling faithful tracing of evidence back to its original sources. For the problem of handling lengthy input texts, we introduce a method that utilises the layer-based attention scores of transformers to filter input length. This approach proves efficient for scientific paper topic classification and verdict label prediction tasks, which is critical for effective fact-checking. Regarding dataset challenges, we address the expertise bottleneck limiting the availability of appropriate training data for scientific fact-checking. We propose a pipeline, Multi2Claim, for automatically converting multiple-choice questions into fact-checking data. Using this pipeline, we create two large-scale datasets: Med-Fact for the medical domain and Gsci-Fact for general science. These datasets represent significant contributions as they are among the first large-scale scientific fact-checking datasets. Baseline models developed using each dataset show promising results, with performance improvements of up to 26% on existing fact-checking datasets such as SciFact, HEALTHVER, COVID-Fact, and CLIMATE-FEVER. In conclusion, the proposed methodologies in this thesis contribute to the advancement of scientific fact-checking by addressing modelling intricacies and dataset challenges, offering a promising step towards more accurate and effective systems to combat misinformation in scientific domains
    corecore