1,720,992 research outputs found
Tabular context-aware optical character recognition and tabular data reconstruction for historical records
Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS Data Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel contextaware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5 for real-time post-OCR correction, providing improved multilingual support. This pipeline reduces errors encountered in table digitization tasks by correcting OCR outputs in real time during training. The model achieves superior performance with a 0.049 word error rate and 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources
Data rescue of historical tables through semi-supervised table structure recognition
This study uses a novel semi-supervised learning framework to explore Tabular Structure Recognition (TSR) for digitizing historical documents, specifically employing the CascadeTabNet model. TSR is crucial for transforming archival tabular data into digital formats, enhancing accessibility and analysis across various research fields. Challenges like physical degradation, inconsistent lighting, and non-standard handwriting hinder the generation of high-quality annotations of historical documents needed for effective model training. To address these issues, this research explores two research questions: (i) Can a semi-supervised training approach reduce the need for expensive data annotations? and (ii) Does semi-supervised training improve model robustness? We applied our methodology across three datasets: the GloSAT and ICDAR-2019 datasets based on historical documents, and the predominantly modern documents PubTabNet dataset. Our results indicate that semi-supervised learning substantially increases TSR accuracy and decreases dependency on extensive labelled datasets, providing a robust solution for large-scale digitization initiatives and contributing to the preservation and improved accessibility of historical data. All code from this paper is freely available on GitHub
Geoparsing and geosemantics for social media: spatio-temporal grounding of content propagating rumours to support trust and veracity analysis during breaking news
In recent years there has been a growing trend to use publically available social media sources within the field of journalism. Breaking news has tight reporting deadlines, measured in minutes not days, but content must still be checked and rumours verified. As such journalists are looking at automated content analysis to pre-filter large volumes of social media content prior to manual verification. This paper describes a real-time social media analytics framework for journalists. We extend our previously published geoparsing approach to improve its scalability and efficiency. We develop and evaluate a novel approach to geosemantic feature extraction, classifying evidence in terms of situatedness, timeliness, confirmation and validity. Our approach works for new unseen news topics. We report results from 4 experiments using 5 Twitter datasets crawled during different English-language news events. One of our datasets is the standard TREC 2012 microblog corpus. Our classification results are promising, with F1 scores varying by class from 0.64 to 0.92 for unseen event types. We lastly report results from two case studies during real-world news stories, showcasing different ways our system can assist journalists filter and cross check content as they examine the trust and veracity of content and source
Do prompt positions really matter?
Prompt-based models have gathered a lot of attention from researchers due to their remarkable advancements in the fields of zero-shot and few-shot learning. Developing an effective prompt template plays a critical role. However, prior studies have mainly focused on prompt vocabulary searching or embedding initialization within a predefined template with the prompt position fixed. In this empirical study, we conduct the most comprehensive analysis to date of prompt position for diverse Natural Language Processing (NLP) tasks. Our findings quantify the substantial impact prompt position has on model performance. We observe that the prompt positions used in prior studies are often sub-optimal, and this observation is consistent even in widely used instruction-tuned models. These findings suggest prompt position optimisation as a valuable research direction to augment prompt engineering methodologies and prompt position-aware instruction tuning as a potential way to build more robust models in the future
Extraction and summarization of suicidal ideation evidence in social media content using large language models
This paper explores the use of Large Language Models (LLMs) in analyzing social media content for mental health monitoring, specifically focusing on detecting and summarizing evidence of suicidal ideation. We utilized LLMs Mixtral7bx8 and Tulu-2-DPO-70B, applying diverse prompting strategies for effective content extraction and summarization. Our methodology included detailed analysis through Few-shot and Zero-shot learning, evaluating the ability of Chain-of-Thought and Direct prompting strategies. The study achieved notable success in the CLPsych 2024 shared task (ranked top for the evidence extraction task and second for the summarization task), demonstrating the potential of LLMs in mental health interventions and setting a precedent for future research in digital mental health monitoring
Adversarial defence without adversarial defence: instance-level principal component removal for robust language models
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation
Benchmark evaluation for tasks with highly subjective crowdsourced annotations: Case study in argument mining of political debates
This paper assesses the feasibility of using crowdsourcing techniques for subjective tasks, like the identification of argumentative relations in political debates, and analyses their inter-annotator metrics, common sources of error and disagreements. We aim to address how best to evaluate subjective crowdsourced annotations, which often exhibit significant annotator disagreements and contribute to a "quality crisis" in crowdsourcing. To do this, we compare two datasets of crowd annotations for argumentation mining performed by an open crowd with quality control settings and a small group of master annotators without these settings but with several rounds of feedback. Our results show high levels of disagreement between annotators with a rather low Krippendorf's alpha, a commonly used inter-annotator metric. This metric also fluctuates greatly and is highly sensitive to the amount of overlap between annotators, whereas other common metrics like Cohen's and Fleiss' kappa are not suitable for this task due to their underlying assumptions. We evaluate the appropriateness of the Krippendorf's alpha metric for this type of annotation and find that it may not be suitable for cases with many annotators coding only small subsets of the data. This highlights the need for more robust evaluation metrics for subjective crowdsourcing tasks. Our datasets provide a benchmark for future research in this area and can be used to increase data quality, inform the design of further work, and mitigate common errors in subjective coding, particularly in argumentation mining
ConversationMoC: encoding conversational dynamics using multiplex network for identifying moment of change in mood and mental health classification
Understanding mental health conversation dynamics is crucial,yet prior studies often overlooked the intricate interplay of social interactions. This paper introduces a unique conversationlevel dataset and investigates the impact of conversational context in detecting Moments of Change (MoC) in individual emotions and classifying Mental Health (MH) topics in discourse. In this study, we differentiate between analyzing individual posts and studying entire conversations, using sequential and graph-based models to encode the complex conversation dynamics. Further, we incorporate emotion and sentiment dynamics with social interactions using a graph multiplex model driven by Graph Convolution Networks (GCN). Comparative evaluations consistently highlight the enhanced performance of the multiplex network, especially when combining reply, emotion, and sentiment network layers. This underscores the importance of understanding the intricate interplay between social interactions, emotional expressions, and sentiment patterns in conversations, especially within online mental health discussions. We are sharing our new dataset (ConversationMoC) and models with the broader research community to facilitate further research
Harmful sharenting in the UK: Protecting children from digital harm, long version
Sharenting - the sharing of children’s personal information by parents on social media - has become a widespread practice. While often well-intentioned, it exposes children to digital harm. Examples include identity-related crimes, harassment, cyberbullying, contact from strangers, and privacy breaches. Funded by the Economic and Social Research Council (ESRC), and led by researchers at the University of Southampton, the ProTechThem interdisciplinary research project brings together social and computer science expertise to investigate whether and how sharenting leads to serious (cyber) crimes and harms against affected children. The project reveals that current regulations, platforms’ safety provisions, and parental cybersecurity measures are insufficient to protect affected children from harm. This brief outlines victimisations experienced by children due to sharenting and proposes actionable policy recommendations for a safer digital future
AI large language models inquiry: TASHub response
Policy submission to the Consultation by Communications and Digital Committee, House of Lords, AI Large Language Models Inquiry.<br/
- …
