International Journal of Digital Curation
Not a member yet
605 research outputs found
Sort by
Long-Term Data Preservation Data Lifecycle, Standardisation Process, Implementation and Lessons Learned
Science and Earth Observation data represent today a unique and valuable asset for humankind that should be preserved without time constraints and kept accessible and exploitable by current and future generations. In Earth Science, knowledge of the past and tracking of the evolution are at the basis of our capability to effectively respond to the global changes that are putting increasing pressure on the environment, and on human society. This can only be achieved if long time series of data are properly preserved and made accessible to support international initiatives. Within ESA Member States and beyond, Earth Science data holders are increasingly coordinating data preservation efforts to ensure that the valuable data are safeguarded against loss and kept accessible and useable for current and future generations. This task becomes increasingly challenging in view of the existing 40 years’ worth of Earth Science data stored in archives around the world and the massive increase of data volumes expected over the next years from e.g., the European Copernicus Sentinel missions. Long Term Data Preservation (LTDP) aims at maintaining information discoverable and accessible in an independent and understandable way, with supporting information, which helps ensuring authenticity, over the long term. A focal aspect of LTDP is data Curation. Data Curation refers to the management of data throughout its life cycle. Data Curation activities enable data discovery and retrieval, maintain its quality, add value, and allow data re-use over time. It includes all the processes that involve data management, such as pre-ingest initiatives, ingest functions, archival storage and preservation, dissemination, and provision of access for a designated community.
The paper presents specific aspects, of importance during the entire Earth observation data lifecycle, with respect to evolving data volumes and application scenarios. These particular issues are introduced in the section on \u27Big Data\u27 and LTDP. The Data Stewardship Reference lifecycle section describes how the data stewardship activities can be efficiently organised, while the following section addresses the overall preservation workflow and shows the technical steps to be taken during Data Curation. Earth Science Data Curation and preservation should be addressed during all mission stages - from the initial mission planning, throughout the entire mission lifetime, and during the post- mission phase. The Data Stewardship Reference Lifecycle gives a high-level overview of the steps useful for implementing Curation and preservation rules on mission data sets from initial conceptualisation or receipt through the iterative Curation cycle
An Exploratory Analysis of Social Science Graduate Education in Data Management and Data Sharing
Effective data management and data sharing are crucial components of the research lifecycle, yet evidence suggests that many social science graduate programs are not providing training in these areas. The current exploratory study assesses how U.S. masters and doctoral programs in the social sciences include formal, non-formal, and informal training in data management and sharing. We conducted a survey of 150 graduate programs across six social science disciplines, and used a mix of closed and open-ended questions focused on the extent to which programs provide such training and exposure. Results from our survey suggested a deficit of formal training in both data management and data sharing, limited non-formal training, and cursory informal exposure to these topics. Utilizing the results of our survey, we conducted a syllabus analysis to further explore the formal and non-formal content of graduate programs beyond self-report. Our syllabus analysis drew from an expanded seven social science disciplines for a total of 140 programs. The syllabus analysis supported our prior findings that formal and non-formal inclusion of data management and data sharing training is not common practice. Overall, in both the survey and syllabi study we found a lack of both formal and non-formal training on data management and data sharing. Our findings have implications for data repository staff and data service professionals as they consider their methods for encouraging data sharing and prepare for the needs of data depositors. These results can also inform the development and structuring of graduate education in the social sciences, so that researchers are trained early in data management and sharing skills and are able to benefit from making their data available as early in their careers as possible
Sustaining Software Preservation Efforts Through Use and Communities of Practice
The brief history of software preservation efforts illustrates one phenomenon repeatedly: not unlike spinning a plate on a broomstick, it is easy to get things going, but difficult to keep them stable and moving. Within the context of video games and other forms of cultural heritage (where most software preservation efforts have lately been focused), this challenge has several characteristic expressions, some technical (e.g., the difficulty of capturing and emulating protected binary files and proprietary hardware), and some legal (e.g., providing archive users with access to preserved games in the face of variously threatening end user licence agreements). In other contexts, such as the preservation of research-oriented software, there can be additional challenges, including insufficient awareness and training on unusual (or even unique) software and hardware systems, as well as a general lack of incentive for preserving “old data.” We believe that in both contexts, there is a relatively accessible solution: the fostering of communities of practice. Such groups are designed to bring together like-minded individuals to discuss, share, teach, implement, and sustain special interest groups—in this case, groups engaged in software preservation.
In this paper, we present two approaches to sustaining software preservation efforts via community. The first is emphasizing within the community of practice the importance of “preservation through use,” that is, preserving software heritage by staying familiar with how it feels, looks, and works. The second approach for sustaining software preservation efforts is to convene direct and adjacent expertise to facilitate knowledge exchange across domain barriers to help address local needs; a sufficiently diverse community will be able (and eager) to provide these types of expertise on an as-needed basis. We outline here these sustainability mechanisms, then show how the networking of various domain-specific preservation efforts can be converted into a cohesive, transdisciplinary, and highly collaborative software preservation team.
[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.
Three Approaches to Documenting Database Migrations
Database migration is a crucial aspect of digital collections management, yet there are few best practices to guide practitioners in this work. There is also limited research on the patterns of use and processes motivating database migrations. In the “Migrating Research Data Collections” project, we are developing these best practices through a multi-case study of database and digital collections migration. We find that a first and fundamental problem faced by collection staff is a sheer lack of documentation about past database migrations. We contribute a discussion of ways information professionals can reconstruct missing documentation, and some three approaches that others might take for documenting migrations going forward.
[This paper is a conference pre-print presented at IDCC 2020 after lightweight peer review.
A Class Focused Approach to Research Outputs and Policy Literature Metadata
Successful research object sharing requires that systems and users understand the structure, semantics and rules that govern a given research object collection.
A number of metadata standards define ontologies and vocabularies for consistent expression of research object semantics. Supporting, clarifying and sometimes extending these standards are metadata application profiles (MAPs). MAPs play a key role defining metadata element cardinality and data types. MAPs may also mandate or recommend controlled vocabularies, where metadata standards have not already mentioned these in formal range declarations, encoding schemes and semantics that are to be consumed by external systems. MAPs also guide design options for in-house systems and workflows. In this paper, development of a draft MAP for grey-literature policy and research collections is discussed. A focus of the discussion is the considerations around selection and adoption of metadata standards given the research data and literature communities in the APO stakeholder map.
This paper presents a work-in-progress version of a Dublin Core Application Profile (DCAP) candidate. The Analysis & Policy Observatory Metadata Application Profile (APO-MAP) takes research object class structure as a starting point and considers class model options, especially given the availability of registry services and Persistent Indenter (PID) systems. The discussion finds that MAP development progresses towards a best fit that balances the need to adopt widely supported standards, local business drivers, and community acceptance
Practices, Challenges, and Prospects of Big Data Curation: a Case Study in Geoscience
Open and persistent access to past, present, and future scientific data is fundamental for transparent and reproducible data-driven research. The scientific community is now facing both challenges and opportunities caused by the growingly complex disciplinary data systems. Concerted efforts from domain experts, information professionals, and Internet technology experts are essential to ensure the accessibility and interoperability of the big data. Here we review current practices in building and managing big data within the context of large data infrastructure, using geoscience cyberinfrastructure such as Interdisciplinary Earth Data Alliance (IEDA) and EarthCube as a case study. Geoscience is a data-rich discipline with a rapid expansion of sophisticated and diverse digital data sets. Having started to embrace the digital age, the community have applied big data and data mining tools into the new type of research. We also identified current challenges, key elements, and prospects to construct a more robust and future-proof big data infrastructure for research and publication for the future, as well as the roles, qualifications, and opportunities for librarians/information professionals in the data era
Curated Archiving of Research Software Artifacts: Lessons Learned from the French Open Archive (HAL)
Software has become an indissociable support of technical and scientific knowledge. The preservation of this universal body of knowledge is as essential as preserving research articles and data sets. In the quest to make scientific results reproducible, and pass knowledge to future generations, we must preserve these three main pillars: research articles that describe the results, the data sets used or produced, and the software that embodies the logic of the data transformation.
The collaboration between Software Heritage (SWH), the Center for Direct Scientific Communication (CCSD) and the scientific and technical information services (IES) of The French Institute for Research in Computer Science and Automation (Inria) has resulted in a specified moderation and curation workflow for research software artifacts deposited in the HAL the French global open access repository. The curation workflow was developed to help digital librarians and archivists handle this new and peculiar artifact - software source code. While implementing the workflow, a set of guidelines has emerged from the challenges and the solutions put in place to help all actors involved in the process
Extending Support for Publishing Sensitive Research Data at the University of Bristol
The University of Bristol Research Data Service was set up in 2014 to provide support and training for academic staff and postgraduate researchers in all aspects of research data management. As part of this, the data.bris Research Data Repository was developed to provide a publication platform for research data generated at the University of Bristol. Initially launched in 2015 to provide open access to data, since 2017 it has also been possible to publish access-controlled datasets containing sensitive data via this platform.
The vast majority (90%) of datasets published are openly accessible, but there has been steady demand for access-controlled release of datasets containing information that is ethically or commercially sensitive. These cases require careful management of additional risk: for example, where datasets contain information on human participants, balancing the risk of re-identification with the need to provide robust data that maximises research value through re-use. Many groups within the University of Bristol (for example, the Avon Longitudinal Study of Parents and Children) have extensive experience and expertise in this area, but it became apparent that there was a need to provide additional support for researchers who were not able to draw on the experience of these established groups. This practice paper describes the process of setting up a dedicated service to provide training and basic disclosure risk assessments in order to address these skills gaps, and outlines lessons learnt and future directions for the service
Co-Creating Autonomy: Group Data Protection and Individual Self-determination within a Data Commons
Recent privacy scandals such as Cambridge Analytica and the Nightingale Project show that data sharing must be carefully managed and regulated to prevent data misuse. Data protection law, legal frameworks, and technological solutions tend to focus on controller responsibilities as opposed to protecting data subjects from the beginning of the data collection process. Using a case study of how data subjects can be better protected during data curation, we propose that a co-created data commons can protect individual autonomy over personal data through collective curation and rebalance power between data subjects and controllers.
 
Selecting Efficient and Reliable Preservation Strategies
This article addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modeling, discrete-event-based simulation, hierarchical modeling, and then use empirically calibrated sensitivity analysis to identify effective strategies.
Specifically, the framework formally defines an objective function for preservation that maps a set of preservation policies and a risk profile to a set of preservation costs, and an expected collection loss distribution. In this framework, a curator’s objective is to select optimal policies that minimize expected loss subject to budget constraints. To estimate preservation loss under different policy conditions optimal policies, we develop a statistical hierarchical risk model that includes four sources of risk: the storage hardware; the physical environment; the curating institution; and the global environment. We then employ a general discrete event-based simulation framework to evaluate the expected loss and the cost of employing varying preservation strategies under specific parameterization of risks.
The framework offers flexibility for the modeling of a wide range of preservation policies and threats. Since this framework is open source and easily deployed in a cloud computing environment, it can be used to produce analysis based on independent estimates of scenario-specific costs, reliability, and risks.
We present results summarizing hundreds of thousands of simulations using this framework. This exploratory analysis points to a number of robust and broadly applicable preservation strategies, provides novel insights into specific preservation tactics, and provides evidence that challenges received wisdom