1,721,019 research outputs found
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Knowledge-Driven Text Generation
Natural Language Generation (NLG) is an automated process that produces human-like text.
It can create this text either from scratch or using inputs like natural language or structured
data such as database records, computer-generated reports or keywords. The main objective
of NLG is to generate coherent, fluent and relevant natural language text, usually based on
input data. NLG has various practical applications, such as creating content, summarizing text
and generating dialogues. For instance, NLG can automatically produce news articles, product
descriptions, or weather reports.
Advancements in machine learning and NLP have led to the development of Large Language
Models (LLMs) that are trained on massive amounts of data and can generate human-like
text. Pretrained Language Models (PLMs), such as T5, BART, and ChatGPT, are examples of
modern state-of-the-art models which have evolved beyond traditional grammar and statisticalbased
methods. These models can be improved by providing more data and increasing the
number of neural network layers. In modern NLP methods, such models are often first pretrained
on large datasets and then fine-tuned for specific tasks.
This thesis studies the data-to-text generation task that aims to generate textual descriptions
of structured data with the help of Pretrained Language Models (PLMs). First, we investigate
the capability of PLMs to generate grammatically correct and consistent text with different
types of structured data, such as keywords, tables, and abstract meaning representation. Second,
we explore how PLMs can retrieve useful information from incomplete datasets and generate
text with the provided multiple data sources. The study also investigates the possibility of
fine-tuning only the first few layers of PLMs to save time and resources. Third, we examine the
hybrid PLMs that work on natural language generation and natural language understanding
and compare them with pretrained seq2seq models. Finally, we investigate effective control
mechanisms for the language model in epic-level text generation.
The study is divided into four parts, each with a specific scope of investigation. The first
part is limited to implementing a method for keyword-to-text generation and evaluating the
generated text from a syntactic and semantic perspective. The data sources used in this study
are RACE and Wikimedia, and the English frequency word list is used to identify keywords.
The second part focuses on developing a system for table-to-text and RDF-to-text generation
using PLMs. The sources of data for this study are E2E, WebNLG, and DART. The proposed
method involves fine-tuning different PLMs for data-to-text generation tasks and developing
the dynamic prompt tuning method for data augmentation. The third part studies on text
generation from tables and knowledge graphs. The data sources for this study are WikiBio and
Wikidata. The study proposes a hybrid model combining PLMs and assesses its performance
in comparison to a pre-trained seq2seq model. A new dataset called TaKG is also created to
address the incomplete problem of the WikiBio dataset. In the fourth part, a framework is
proposed to address the limitations of large-scale language models in generating epic-scale
text. Our contribution include designing effective control mechanisms for the language model,
optimizing GPT-3.5 for open-domain text, and evaluating the generated text against long text
generation requirements.
In summary, this thesis makes a contribution to the investigation of structured data’s
impact on PLMs and offers valuable insights into the factors that influence the effectiveness of
controlling PLMs. This includes the advancement of effective control mechanisms for the PLMs
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
Exploring Structure-Informed Information Retrieval for Legal Statutes
Improved Information Retrieval (IR) techniques have the potential to improve access
to justice and save time for both legal professionals and laypeople alike. However,
accurately ensuring that all relevant legal provisions are retrieved from a vast legal
database remains a challenging task. Vector similarity search with contextual embeddings
has enabled IR systems to retrieve more semantically relevant text, even
when it does not share lexical similarity with a query. However, achieving high
recall of relevant legal provisions from legal statutes remains challenging, due to
the inability of vector search to interpret the deep hierarchical structures of legislative
provisions, to take into account extensive explicit cross-referencing between legal
provisions, and to recognise implicit relationships between provisions built on
shared context. We propose a structure-aware retrieval method that uses similarity
search results from a Hierarchical Navigable SmallWorlds (HNSW) embeddings
graph to query a multi-layer graph representing structural, citation, and implicit
relationships between legal provisions. We incorporate this into a "Retrieve then Re-
Rank" IR pipeline featuring cross-encoder re-ranking. We evaluate our method using
the annual Competition on Legal Information and Entailment (COLIEE) dataset,
based on the Japanese Civil Code. Performance evaluations demonstrate how incorporating
structural information from a multi-layered graph, as we propose, can be
a simple yet effective way to improve recall in legal article retrieval tasks, compared
to using vector similarity search alone
Dispelling the Myths Behind First-author Citation Counts
We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued
use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation
counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more
sophisticated methods
Node Classification on Graph Data with Global Learning
Node classification is a core task in graph-based machine learning, where the goal is to predict labels or categories for nodes in a graph, leveraging the relational structure and attributes of the data. To enhance the expressiveness of node representations, various Graph Neural Networks (GNNs) have been proposed to aggregate information from neighbouring nodes, conducting message-passing within local receptive fields, a process also referred to as local learning. However, the suboptimal nature of the available graph structure, often characterized by noisy edges or missing edges, generally negatively affects the performance of node classification. To address this issue, this dissertation explores node classification by considering relations among all nodes, referred to as global learning, and further investigates opportunities and challenges for its application across different scenarios.
-We introduce Chain of Propagation Prompting (CPP) to enhance the expressiveness of node representations while reducing the dependency on label information. CPP involves designing a simple message-passing pattern, which we incorporate into node representations using graph contrastive learning. This simple pattern prompts multi-head self-attention-based layers to globally capture more complex patterns while minimizing their reliance on label information. Additionally, we implement majority voting to enhance the predictive confidence of multiple heads.
-We introduce Robust Node Classification under Graph and Label Noise (RNCGLN) to improve the robustness of node classification when both graph and label noise are present. By integrating local graph learning and global graph learning, RNCGLN can provide comprehensive information to enhance node classification performance. Additionally, we develop graph and label self-improvement modules to improve and supplement the quality of supervisory information. Consequently, RNCGLN leverages self-training and pseudo-label techniques to facilitate two self-improvement processes in an end-to-end learning framework.
-We introduce a Flexible-pass Filter-based Graph Transformer (FFGT) to resist adversarial attacks on graph data. Leveraging self-attention's ability to capture arbitrary graph filters, our self-attention layers with three heads capture multi-frequency representations across low-frequency, hybrid-frequency, and high-frequency ranges. Additionally, we designed graph learning and fusion modules to improve self-attention effectiveness in capturing designed information, yielding a flexible-frequency representation. Consequently, FFGT shows consistent resistance to adversarial perturbation in multiple datasets and against diverse adversarial attacks.
We conducted theoretical analyses and numerical evaluations of our proposed methods using diverse graph data. The experimental results show that our methods, leveraging global learning strategies, consistently outperform traditional Graph Neural Networks (GNNs) based on local learning. This superior performance is demonstrated across various scenarios, including diverse graph datasets, graph noise, label noise, and multiple adversarial attacks. Additionally, our theoretical analysis validates the effectiveness of each proposed method in addressing these challenges. Together, these findings confirm that our methods enhance expressiveness, improve robustness to noise, and strengthen resilience against adversarial attacks
Explainable and Automated Scientific Fact-Checking with Neural Networks
Fact-checking plays a crucial role in combating misinformation, especially in scientific domains
where the stakes are high, and the consequences of false claims can be severe. As experienced
during the COVID-19 pandemic, unfaithful claim verifications underscored the need for robust
fact-checking systems. This thesis addresses the challenges of verifying claims in the scientific
literature using deep learning-based computational linguistics techniques, focusing on faithfully
incorporating knowledge from a vast existing literature and addressing the scarcity of appropriate
training datasets for robust fact-checking systems.
Language models in Natural Language Processing (NLP) are computational models designed
to understand language. These models are trained on massive amounts of data to learn statistical
patterns and relationships within language. They work by predicting words, sub-words, or
characters in a sequence, taking into account the context provided by preceding sequence
elements. In recent years, the dominant architecture for language models has been transformerbased
models, which can be seen as a milestone in NLP research due to their significant
improvements for downstream tasks. However, these models also have limitations, including
modelling challenges and dataset challenges. Modelling challenges refer to the limitations
and complexities faced by computational models, particularly those based on deep learning
and natural language processing techniques. These challenges include difficulties in accurately
capturing nuanced arguments, potential generation of false information (‘hallucinations’),
constraints on handling lengthy input texts, and limitations in reasoning capabilities. On the
other hand, dataset challenges stem from the scarcity of appropriate training data essential for
building robust fact-checking systems. The specialized nature of scientific content, coupled with
the need for accurate annotations, creates an expertise bottleneck. In this context, developing
large-scale, domain-specific datasets becomes crucial to train models effectively. These challenges
collectively necessitate innovative methodologies to enhance the capabilities of computational
models for accurate scientific fact-checking, addressing both their inherent modelling intricacies
and the scarcity of specialised training data.
This thesis proposes methods to advance the field of scientific fact-checking by developing
approaches that enhance the capabilities of transformer-based language models. In addressing
modelling challenges, we present a novel methodology that leverages multiple viewpoints
from scientific literature, allowing the assessment of contradictory arguments and implicit
assumptions. Our proposed inference method enhances reasoning by distilling information from
diverse, relevant scientific abstracts. This approach yields a verdict label that can be weighted
based on the article’s reputation and an explanation that can be traced back to sources to avoid
hallucinations. Our findings demonstrate that human evaluators perceive our explanation to be
significantly superior to off-the-shelf models, enabling faithful tracing of evidence back to its
original sources. For the problem of handling lengthy input texts, we introduce a method that
utilises the layer-based attention scores of transformers to filter input length. This approach
proves efficient for scientific paper topic classification and verdict label prediction tasks, which
is critical for effective fact-checking.
Regarding dataset challenges, we address the expertise bottleneck limiting the availability of
appropriate training data for scientific fact-checking. We propose a pipeline, Multi2Claim, for
automatically converting multiple-choice questions into fact-checking data. Using this pipeline,
we create two large-scale datasets: Med-Fact for the medical domain and Gsci-Fact for general
science. These datasets represent significant contributions as they are among the first large-scale
scientific fact-checking datasets. Baseline models developed using each dataset show promising
results, with performance improvements of up to 26% on existing fact-checking datasets such
as SciFact, HEALTHVER, COVID-Fact, and CLIMATE-FEVER.
In conclusion, the proposed methodologies in this thesis contribute to the advancement
of scientific fact-checking by addressing modelling intricacies and dataset challenges, offering
a promising step towards more accurate and effective systems to combat misinformation in
scientific domains
- …
