International Journal of Digital Curation
Not a member yet
605 research outputs found
Sort by
The Transparency of an Honest Data Broker in Providing Electronic Health Record Data Sufficient for Reuse
Electronic Health Records (EHRs) offer a rich data source for clinical researchers to assess a wide variety of treatments and outcomes. Researchers can use Honest Data Brokers (HDBs) to gain access to EHR data for research. Unfortunately, HDBs’ data analysts can inadvertently overlook nuances in clinical workflows when generating request code for data capture and cause data extraction errors that negatively impact EHR data quality. This paper presents findings from interviews with clinical researchers who have had experience with requesting EHR data from the Regenstrief Institute Data Core (RDC), an HDB with data analysts who provide authorised access to EHR data of nearly 25 million patients in the state of Indiana. Our participants wanted greater transparency when they had questions about the quality of the datasets they received. Participants wanted their data analysts to double check the data in their system, explain how they extracted the data, let them visit the data in their system, and/or view/edit the code used for data extraction. We offer a set of recommendations for HDBs on how to provide greater transparency to clinical researchers about the processes used to generate the EHR data they receive and discuss future directions for research
Infra Finder: a New Tool to Enhance Transparency, Discoverability and Trust in Open Infrastructure
This paper describes Infra Finder, a new tool built by Invest in Open Infrastructure to help institutional budget holders and libraries make more informed decisions around adoption of and investment in open infrastructure. Through increased transparency and discoverability, we aim for this tool to foster trust in the decision-making process and to help build connections between services, users, and funders. The design of Infra Finder is intended to contribute to ongoing discussions and developments regarding trust and transparency in open scholarly infrastructure, as well as help level the playing field between organizations with limited resources to conduct extensive due diligence processes and those with their own analyst teams. In this work, we describe the landscape analysis that led to the creation of Infra Finder, the use cases for the tool, and the approach IOI is taking to create and foster use of Infra Finder in the open infrastructure environment. We also address some of the principles of trust in open source and open infrastructure that have informed and impacted the Infra Finder project and our work in creating this tool
Preserving Secondary Knowledge: Using Language Models for Software Preservation
Emulation and migration are still our main tools for digital curation and preservation practice. Both strategies have been discussed extensively and have been demonstrated to be effective and applicable in various scenarios. Discussions have primarily centered on technical feasibility, workflow integration, and usability. However, there remains one important aspect when discussing these two techniques: managing and preserving operational knowledge. Both approaches require specialized knowledge but especially emulation requires future users to also have a great variety of knowledge about past software and computer systems for successful operation. We investigate how this knowledge can be stored and utilized, and to what extent it can be rendered machine-actionable, using modern large language models. We demonstrate a proof-of-concept implementation that operates an emulated software environment through natural language
Long-Term Preservation and Reusability of Open Access Scholar-Led Press Monographs
This brief report outlines some initial findings and challenges identified by the Community-Led Open Publication Infrastructures for Monographs (COPIM) project when looking to archive and preserve open access books produced by small, scholar-led presses. This paper is based on the research conducted by Work Package 7 in COPIM, which has a focus on the preservation and archiving of open access monographs in all their complexity, along with any accompanying materials. 
Cluster Analysis of Open Research Data: A Case for Replication Metadata
Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software orReport), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample), would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility
E-Preservation of Old and Rare Books: A Structured Approach for Creating a Digital Collection
Antique books, old and rare documents are fragile and vulnerable to different hazards. Preserving them for an extended period is a real challenge. From ancient times people started expressing their knowledge by writing and keeping records and subsequently started collecting and storing these at later ages as antique materials. These can be seen in different museums, libraries, archives, individual households, and other places all over the world. Preserving and conserving these antique, old and rare books, documents etc. in good condition is a challenge for librarians, conservators, preservation administrators or persons associated with storing these. In this paper, details of the digital preservation of such a collection available in the Directorate of Historical and Antiquarian Studies (DHAS), Guwahati, Assam, India, are discussed. DHAS is a Government of Assam wing and is mainly mandated to collect, preserve and research historical and antiquarian resources. The collection of DHAS is one of the oldest collections and has been serving as a study and research centre in Assam since 1928. A special drive has been taken for the digital preservation of an identified part of the collection, with grant support from the National Archive of India. This paper discusses the entire project process starting from the project proposal formulation to the structuring of the digital collection. The paper sequentially discusses the different steps of the entire work of digitization of a collection of 241 old and rare books from the main collection of DHAS
Analysis of U.S. Federal Funding Agency Data Sharing Policies: 2020 Highlights and Key Observations
Federal funding agencies in the United States (U.S.) continue to work towards implementing their plans to increase public access to funded research and comply with the 2013 Office of Science and Technology memo Increasing Access to the Results of Federally Funded Scientific Research. In this article we report on an analysis of research data sharing policy documents from 17 U.S. federal funding agencies as of February 2021. Our analysis is guided by two questions: 1.) What do the findings suggest about the current state of and trends in U.S. federal funding agency data sharing requirements? 2.) In what ways are universities, institutions, associations, and researchers affected by and responding to these policies? Over the past five years, policy updates were common among these agencies and several themes have been thoroughly developed in that time; however, uncertainty remains around how funded researchers are expected to satisfy these policy requirements
If Data is Used in the Forest and No-one is Around to Hear it, Did it Happen? a Citation Count Investigation
In this article I describe the process and results of tracking a citation from a data repository through the article publication process and trying to add a citation event to one of our DOIs. I also discuss some other confusing aspects related to citation counts as indicated in various systems, including reference managers, the publisher’s perspective, aggregators, and DOI minters. I discovered numerous problems with citations. Addressing these problems is important as citations can be key to determining both the original use and reuse of a dataset, especially for repositories that do not track usage by requiring people to login or provide an email to download a dataset. The lack of transparency in some data citation systems and processes obscures how and where data is being used. 
Proposal for a Maturity Continuum Model for Open Research Data
As a contribution to the general effort in research to generalize and improve the practices of Open Research Data (ORD), we developed a model conceptualizing the degrees of maturity of a research community in terms of ORD. This model may be used to assess the ORD capacity or maturity level of a specific research community, to strengthen the use of standards with respect to ORD within this community, and to increase its ORD maturity level.
We present the background and our motivations for developing such an instrument as well as the reasoning leading to its design. We present its elements in detail and discuss possible applications. 
Data Curation in Interdisciplinary and Highly Collaborative Research
This paper provides a systematic analysis of publications that discuss data curation in interdisciplinary and highly collaborative research (IHCR). Using content analysis methodology, it examined 159 publications and identified patterns in definitions of interdisciplinarity, projects’ participants and methodologies, and approaches to data curation. The findings suggest that data is a prominent component in interdisciplinarity. In addition to crossing disciplinary and other boundaries, IHCR is defined as curating and integrating heterogeneous data and creating new forms of knowledge from it. Using personal experiences and descriptive approaches, the publications discussed challenges that data curation in IHCR faces, including an increased overhead in coordination and management, lack of consistent metadata practices, and custom infrastructure that makes interoperability across projects, domains, and repositories difficult. The paper concludes with suggestions for future research