International Journal of Digital Curation
Not a member yet
605 research outputs found
Sort by
Transparent Disclosure, Curation & Preservation of Dynamic Digital Resources
This paper explores an enhanced curation lifecycle being developed at the UK Data Service (UKDS), with our Data Product Builder. Through a Graphical User Interface, we aim to provide the researcher with a tailored digital resource. We detail the threefold motivation behind this initiative: data dissemination scalability, researcher satisfaction and the reduction of nationwide duplication of research effort.
Subsequent sections detail the technical components and challenges involved. In addition to more standard data subsetting, filtering and linking components, this data dissemination platform offers dynamic disclosure assessments – identifying combinations of variables that present a potential disclosure risk. All components are underpinned by the Data Documentation Initiative’s new Cross-Domain Integration standard (DDI-CDI), designed to handle the many structures in which data may be organised.
Ever conscious of the scale of the task we are embarking on, we remain motivated by the need for such advances in data dissemination and optimistic of the feasibility of such a system to meet the needs of the researcher while balancing the data disclosivity concerns of the data depositor
Closing Gaps: A Model of Cumulative Curation and Preservation Levels for Trustworthy Digital Repositories
Curation and preservation measures carried out by digital repository staff are an important building block in maintaining the accessibility and usability of digital resources over time. The measures adequate to achieve long-term usability for a given audience strongly depend on scenarios of (re)use, the (intended) users’ needs and skills, the organisational setting (e.g., mission, resources, policies), as well as the characteristics of the digital objects to be preserved. The assessment of curation and preservation measures also forms an important part of existing certification procedures for trustworthy digital repositories (TDRs) as offered, for example, by the CoreTrustSeal foundation, the nestor network, or ISO.
The digital curation community is presented with the challenge of finding community-, organisation-, and object-specific approaches to curation and preservation at the same time as defining the minimum level of curation and preservation measures expected from a TDR in sufficiently generic terms to ensure applicability to a wide array of repositories. Against this backdrop, this paper discusses the need for and benefits of community-agreed levels of curation and preservation to address this challenge, and considers the tiered model proposed by the CoreTrustSeal Board as an example.
The proposed model is then applied in an analysis of successful CoreTrustSeal applications from 2018–2022 in an effort to better understand the capacity of the curation and preservation levels to capture the respective practices of repositories and to identify potential gaps
Adapting FAIR Evaluation to Photon and Neutron Facilities
The FAIR principles have become essential in establishing transparent and trustworthy research practices. However the FAIR principles are guidelines indicating the features expected for data to be FAIR, and do not stipulate evaluation criteria. Consequently,there has been a proliferation of approaches to FAIR evaluation to substantiate claims for FAIR-ness, establish baselines, and measure improvement. Some approaches are focussed on FAIR-ness of individual datasets, others of repositories; some require extensive human evaluation, others use automation. However, within some scientific domains, data generation and management follow well-defined processes that result in datasets annotated with metadata and archived in repositories. Existing FAIR evaluation methods consider in less detail the contribution of the processes used in collecting and analysing data and how these enable FAIR-ness.
We describe the evaluation approach adopted for FAIR self-evaluation for Photon and Neutron Research Infrastructures (PaN RI’s). We review selected examples of existing FAIR evaluation frameworks designed to enable assessment at different levels, and outline four dimensions that characterise them. As no existing framework met our specific need to focus on FAIR workflows and processes inPaN RIs, it was necessary to select, combine, and adapt existing frameworks, and we developed an approach drawing heavily on the original FAIR principles, the RDA FAIR Data Maturity Model, and FAIRsFAIR’s CoreTrustSeal+FAIRenabling framework. Post-evaluation feedback from ExPaNDS partners indicated that they found the FAIR self-evaluation a useful and valuable exercise forunderstanding current levels of FAIR-ness at their facilities and for articulating what implementations they have in progress or planned to support FAIR in future
Resolving Conflicts in Data Through Curation-informed Weight Distribution Networks
Missing and conflicting data values create problems when integrating datasets from multiple collections. Moreover, when the collections to be integrated are large and continuously updated, it is not feasible to manually resolve these problems. Instead, disagreements and gaps should be resolved in an automated fashion. To achieve good quality integrated datasets automatically we introduce the Curation-informed Weight Distribution Network (CiWDN), a method that suggests which collection is more reliable in providing a data value in question. CiWDN adapts the PageRank algorithm (PR) to assign and distribute weights across data fields present in the different collections. Weight assignment is rooted in data curation best practices as metrics of a collection\u27s reliability. The metrics include: a) data completeness, b) data coincidence, and c) data consistency over time. Final weights used as collection ranks provide the basis to resolve conflicts between different collections contributing a data value for a given field. CiWDN relies on a data dictionary that normalizes fields across collections, and is implemented on a graph database. We demonstrate CiWDN’s capability using the case of ASTRIAGraph, a knowledge system built to increase transparency of activities in Earth’s orbital environment. CiWDN can assess the reliability of data collections that conflict on space object characteristic data fields, which can be used to resolve the differences. This method for computing collections\u27 reliability can be ported to curate other types of large integrated datasets for use in machine learning and other data-driven applications
Trusted Research Environments: Analysis of Characteristics and Data Availability
Trusted Research Environments (TREs) enable the analysis of sensitive data under strict security assertions that protect the data with technical, organizational, and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks and their slight technical variations. To highlight on these problems, an overview of the existing, publicly described TREs and a bibliography linking to the system description are provided. Their technical characteristics, especially in commonalities and variations, are analysed, and insight is provided into their data type characteristics and availability. The literature study shows that 47 TREs worldwide provide access to sensitive data, of which two-thirds provide data predominantly via secure remote access. Statistical offices (SOs) make the majority of sensitive data records included in this study available
Community-based Curate-a-Thons to Enhance Preservation of Global Genetic Biodiversity Data: A Practical Case Study
Science, Technology, Engineering, and Mathematics (STEM) and Research Data Librarians collaborated with an international research team of conservation geneticists to create an instructional and practical guide combining genetic biodiversity initiatives and data curation. Over the course of two months, the academic librarians held multiple community-based Curate-A-Thons where an international group of students, researchers, librarians, and faculty researchers participated in tracking down publications and metadata for genomic sequence data, thus crowd-sourcing this effort of metadata enhancement. This article details the successful Curate-a-Thon design and implementation process; the openly available instructional materials created and used to host the Curate-a-Thons; and the challenges and successes of these community-based events
Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation
We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal argumentation framework (AF). Such argumentation frameworks can be automatically analyzed and solved by translating them to a logic program PAF whose declarative semantics yield a transparent solution with many desirable properties, e.g., uncontroversial updates are accepted, unjustified ones are rejected, and the remaining ambiguities are exposed and presented to users for further analysis. After motivating the problem, we introduce our approach and illustrate it with a detailed running example introducing both well-founded and stable semantics to help understand the AF solutions. We have begun to develop open source tools and Jupyter notebooks that demonstrate the practicality of our approach. In future work we plan to develop a toolkit for conflict resolution that can be used in conjunction with OpenRefine, a popular interactive data cleaning tool
Artificial Intelligence Assisted Curation of Population Groups in Biomedical Literature
Curation of the growing body of published biomedical research is of great importance to both the synthesis of contemporary science and the archiving of historical biomedical literature. Each of these tasks has become increasingly challenging given the expansion of journal titles, preprint repositories and electronic databases. Added to this challenge is the need for curation of biomedical literature across population groups to better capture study populations for improved understanding of the generalizability of findings. To address this, our study aims to explore the use of generative artificial intelligence (AI) in the form of large language models (LLMs) such as GPT-4 as an AI curation assistant for the task of curating biomedical literature for population groups. We conducted a series of experiments which qualitatively and quantitatively evaluate the performance of OpenAI’s GPT-4 in curating population information from biomedical literature. Using OpenAI’s GPT-4 and curation instructions, executed through prompts, we evaluate the ability of GPT-4 to classify study ‘populations’, ‘continents’ and ‘countries’ from a previously curated dataset of public health COVID-19 studies.
Using three different experimental approaches, we examined performance by: A) evaluation of accuracy (concordance with human curation) using both exact and approximate string matches within a single experimental approach; B) evaluation of accuracy across experimental approaches; and C) conducting a qualitative phenomenology analysis to describe and classify the nature of difference between human curation and GPT curation. Our study shows that GPT-4 has the potential to provide assistance in the curation of population groups in biomedical literature. Additionally, phenomenology provided key information for prompt design that further improved the LLM’s performance in these tasks. Future research should aim to improve prompt design, as well as explore other generative AI models to improve curation performance. An increased understanding of the populations included in research studies is critical for the interpretation of findings, and we believe this study provides keen insight on the potential to increase the scalability of population curation in biomedical studies
DMPs as Management Tool for Intellectual Assets by SMART-metrics
Data Management Plans (DMPs) are vital components of effective research data management (RDM). They serve not only as organisational tools but also as a structured framework dictating the collection, processing, sharing/publishing, and management of data throughout the research data life cycle. This can include existing data curation standards, the establishment of data handling protocols, and the creation, when necessary, of community curation policies. Therefore, DMPs present a unique opportunity to harmonise project management efforts for optimising the formulation and execution of project objectives.
To harness the full potential of DMPs as project management tools, the SMART approach (i.e., Specific, Measurable, Achievable, Relevant, and Time-bound) emerges as a compelling methodology. During the initial stage of the project proposal, drafted SMART metrics can offer a systematic approach to map work packages (WPs) and deliverables to the overarching project objectives. Then, the Principal Investigators (PIs) can ensure the consortia that all the project potential intellectual assets (i.e., expected research results) were considered properly, as well as their necessary timelines, resources, and execution. It becomes imperative for data stewards (DSs) and governance policymakers to educate and provide guidelines to researchers on the advantages of developing well-curated DMPs that align results with SMART metrics. This alignment ensures that every intellectual asset intended as a research result (e.g., intellectual properties, publications, datasets, and software) within the project is subject to rigorous drafted planning, execution, and accountability.
Consequently, the risk of unforeseen setbacks and/or deviations from the original objectives is minimised, increasing the traceability and transparency of the research data life cycle. In addition, the integration of Technology Readiness Levels (TRLs) into this proposed enhanced DMP provides a systematic method to evaluate the maturity and readiness of technologies across scientific disciplines. Regular TRL assessments will allow PIs: (1) to monitor the WP progress, (2) to adapt research strategies if required, and (3) to ensure the projects remain in line with the drafted SMART metrics in the enhanced DMP before the project started. The TRLs can also help PIs maintain their focus on project milestones and specific tasks aligned with the original objectives, contributing to the overall success of their endeavours, while improving the transparency for the reporting and divulgation of the research results.
The paper presents the overall framework for enhancing DMPs as project management tools for any intellectual assets using SMART metrics and TRLs, as well as introducing suggested support services for data stewardship teams to assist PIs when implementing this novel framework effectively
Curation is Communal: Transparency, Trust, and (In)visible Labour
Research about trust and transparency within the realm of research data management and sharing typically centres on accreditation and compliance. Missing from many of these conversations are the social systems and enabling structures that are built on interpersonal connections. As members of the Data Curation Network (DCN), a consortium of United States-based institutional and non-profit data repositories, we have experienced first-hand the effort required to develop and sustain interpersonal trust and the benefits it provides to curation. In this paper, we reflect on the well-documented realities of curator and labour invisibility; the importance of fostering active communities (such as the DCN); and how trust, vulnerability and connectivity among colleagues leads to better curation practices. Through an investigation into data curators in the DCN, we found that, while curation can be isolating and invisible work, having a network of trusted peers helps alleviate these burdens and makes us better curators. We conclude with practical suggestions for implementing trust and transparency in relationships with colleagues and researchers