International Journal of Digital Curation
Not a member yet
    605 research outputs found

    Towards Environmentally Sustainable Long-term Digital Preservation

    Full text link
       ARCHIVER and Pre-Commercial Procurement funding has enabled small to medium enterprises (SMEs) to innovate and deliver new services for EOSC. Within the framework of the ARCHIVER pre-commercial procurement tender, between December 2020 and August 2021, three commercial consortia competed to deliver innovative, prototype solutions for long-term data preservation. Two of them were selected to continue with the pilot phase and deliver research-ready solutions for long-term data preservation of research data, therefore filling a gap in the current European Open Science panorama.  Digital preservation relies on technological infrastructure (information and communication technology, ICT) that can have environmental impacts. While altering technology usage can reduce the impact of digital preservation practices, this alone is not a strategy for sustainable practice. Moving toward environmentally sustainable digital preservation requires critically examining the motivations and assumptions that shape current practice (Pendergrass et al, 2019).* The use of scalable cloud infrastructures can reduce the environmental impacts of long-term data preservation solutions. * Editor Note: this previously omitted reference has now been included

    Factors Influencing Perceptions of Trust in Data Infrastructures

    No full text
    Trust is an essential pre-condition for the acceptance of digital infrastructures and services. Transparency has been identified as one mechanism for increasing trustworthiness. Yet, it is difficult to assess to which extent and how exactly different aspects of transparency contribute to trust, or potentially impede it in cases of overwhelming complexity of the information provided. To address these issues, we performed two initial studies to help determining the factors that influence or have impact on trust, focusing on transparency across a range of elements associated with data, data infrastructures and virtual research environments. On one hand, we performed a survey among IT experts in the field of data science focusing on quality aspects in the context of re-using and sharing open source software, assessing issues such as the need for documentation, test cases, and accountability. On the other hand, we complemented this with a set of semi-structured interviews with senior researchers to address specific issues of the degree of transparency achievable with different approaches. They include, for example, the amount of transparency we can achieve with approaches from explainable AI, or the usefulness and limitations of data provenance in determining the suitability of data for reuse and others. Specifically, we consider mechanisms on three levels, i.e. technical, process-oriented as well as social mechanisms. Starting from attributes of trust in the “analogue world”, we aim to understand which of these can be applied in the digital world, how they differ, and what additional mechanisms need to be established, in order to support trust in complex socio-technological processes and their emergent results when the traditional approaches cannot be applied anymore

    Capturing the Cloud: Towards SharePoint Transfer at UK Parliament

    No full text
    Since 2020, UK Parliament has moved towards cloud-based ways of working and collaboration, with colleagues across both Houses increasingly storing and sharing most of their information in Microsoft SharePoint. In response to this shift, the Parliamentary Archives sought to establish an end-to-end process to transfer information of archival value out of SharePoint and into the Digital Repository. Three challenges, unique to the cloud-based and collaborative nature of this environment, arose: defining the authoritative version, extracting files from the cloud with properties and metadata intact, and validating and authenticating files in the cloud. This brief report outlines the Parliamentary Archives efforts to explore and test the transfer and authentication of archival data from the cloud and into their digital repository with a focus on building trust and transparency

    The Generation of Revision Identifier (rsid) Numbers in MS Word: Implications for Document Analysis

    No full text
    The 2007 implementation of the Office Open XML standard for Microsoft Word introduced the assignation of individual revision save identifiers (Rsid) to document editing sessions that end in a save action. The relevant standards ECMA (2016) and ISO/ IEC 29500-1:2016 (2016) stipulate that these Rsid should be allocated randomised but with increasing numerical value, thereby documenting the progress of the editing. As MS Word is the most ubiquitous word processing software, Rsid appear to be a useful tool to examine and provide evidence for a wide range of common document generation editing and modification processes and file management operations, with implications for document analysis including, but not limited to academic integrity issues in student assignment submissions (e.g. contract cheating). This paper presents the results of a series of experiments conducted to assess whether and how well MS Word implements the ECMA and ISO/ IEC standards. The results show that the number of allocated Rsid indeed increases with each edit and save action, with the previous Rsids carried over and retained. The newly allocated Rsid, however, do not conform to the standard as the numerical value of a Rsid associated with a save action may be larger or smaller than any or all of those allocated during that of the previous save actions. The allocation of a new Rsid is not necessarily caused by an edit event but that a new Rsid can also be generated if a file is saved as rtf or if it is sent as an e-mail from within MS Word, although the file was not edited in any way. Rsid numbers are not generated if a person opens a MS Word document, reads it and closes the file without saving, making this action impossible to detect. MS Word template files on a given machine contain document (root) Rsid numbers that are generated when a newly installed application is launched for the first time. As these will be embedded as legacy Rsid into every new file generated from that template file, they act as signatures for all MS Word documents that are created. The experiments have shown that user behaviour has a direct influence on the number of Rsid represented in a given file. Although the implementation of Office Open XML chosen by Microsoft is not compliant with the relevant standards, and thus Rsid cannot be used determine the exact chronological order of all editing sequences within a given document, the Rsid retain their value for document forensics as they are associated with specific edit events, and illuminate the document writing and editing process.

    Reproducible and Attributable Materials Science Curation Practices: A Case Study

    No full text
    While small labs produce much of the fundamental experimental research in Material Science and Engineering (MSE), little is known about their data management and sharing practices and the extent to which they promote trust in, and transparency of, the published research. In this research, we conduct a case study of a leading MSE research lab to characterize the limits of current data management and sharing practices concerning reproducibility and attribution. We systematically reconstruct the workflows, underpinning four research projects by combining interviews, document review, and digital forensics. We then apply information graph analysis and computer-assisted retrospective auditing to identify where critical research information is unavailable or at risk. We find that while data management and sharing practices in this leading lab protect against computer and disk failure, they are insufficient to ensure reproducibility or correct attribution of work — especially when a group member withdraws before project completion.   We conclude with recommendations for adjustments to MSE data management and sharing practices to promote trustworthiness and transparency by adding lightweight automated file-level auditing and automated data transfer processes

    Assessing Quality Variations in Early Career Researchers’ Data Management Plans

    No full text
    This paper aims to better understand early career researchers’ (ECRs’) research data management (RDM) competencies by assessing the contents and quality of  data management plans (DMPs) developed during a multi-stakeholder RDM course. We also aim to identify differences between DMPs in relation to several background variables (e.g., discipline, course track). The Basics of Research Data Management (BRDM) course has been held in two multi-faculty, research-intensive universities in Finland since 2020. In this study, 223 ECRs’ DMPs created in the BRDM of 2020 - 2022 were assessed, using the recommendations and criteria of the Finnish DMP Evaluation Guide + General Finnish DMP Guidance (FDEG). The median quality of DMPs appeared to be satisfactory. The differences in rating according to FDEG’s three-point performance criteria were statistically insignificant between DMPs developed in separate years, course tracks or disciplines. However, using content analysis, differences were found between disciplines or course tracks regarding DMP’s key characteristics such as sharing, storing, and preserving data. DMPs that contained a data table (DtDMPs) also differed highly significantly from prose DMPs. DtDMPs better acknowledged the data handling needs of different data types and improved the overall quality of a DMP. The results illustrated that the ECRs had learned the basic RDM competencies and grasped their significance to the integrity, reliability, and reusability of data. However, more focused, further training to reach the advanced competency is needed, especially in areas of handling and sharing personal data, legal issues, long-term preserving, and funders’ data policies. Equally important to the cultural change when RDM is an organic part of the research practices is to merge research support services, processes, and infrastructure into the research projects’ processes. Additionally, incentives are needed for sharing and reusing data

    Metrics to Increase Data Usage Understanding and Transparency

    No full text
    Data metrics are essential to assess the impact of data repositories\u27 holdings and to understand the research practices of the community that they serve. These metrics are useful for reporting to funders, to inform community engagement strategies, and to direct and sustain repository services. In turn, communicating these metrics to the user community conveys transparency and elicits their trust in data sharing. However, because data metrics are time-sensitive and context-dependent, tracking, interpreting, and communicating them is challenging. In this work we introduce data usage analyses including benchmarking and grouping, developed to better assess the impact of the DesignSafe Data Depot, a natural hazards data repository. Make Data Count compliant metrics are analysed in relation to research methods, sub-disciplines, natural hazard types, and time, to learn what data is being used, what influences data usage, and to establish realistic usage expectations. Results are interpreted in relation to the research and publication practices of the community and to natural hazard events. In addition, we introduce strategies to clearly communicate dataset metrics to users

    Selecting Efficient and Reliable Preservation Strategies:: Modeling Long-term Information Integrity Using Large-scale Hierarchical Discrete Event Simulation

    No full text
    This article addresses the problem of formulating efficient and reliable operational preservation policies that ensure bit-level information integrity over long periods, and in the presence of a diverse range of real-world technical, legal, organizational, and economic threats. We develop a systematic, quantitative prediction framework that combines formal modelling, discrete-event-based simulation, hierarchical modelling, and then use empirically calibrated sensitivity analysis to identify effective strategies. Specifically, the framework formally defines an objective function for preservation that maps a set of preservation policies and a risk profile to a set of preservation costs, and an expected collection loss distribution. In this framework, a curator’s objective is to select optimal policies that minimize expected loss subject to budget constraints. To estimate preservation loss under different policy conditions optimal policies, we develop a statistical hierarchical risk model that includes four sources of risk: the storage hardware; the physical environment; the curating institution; and the global environment. We then employ a general discrete event-based simulation framework to evaluate the expected loss and the cost of employing varying preservation strategies under specific parameterization of risks. Source code is available at:https://github.com/MIT-Informatics/PreservationSimulation The framework offers flexibility for the modeling of a wide range of preservation policies and threats. Since this framework is open source and easily deployed in a cloud computing environment, it can be used to produce analysis based on independent estimates of scenario-specific costs, reliability, and risk. We present results summarizing hundreds of thousands of simulations using this framework. This exploratory analysis points to a number of robust and broadly applicable preservation strategies, provides novel insights into specific preservation tactics, and provides evidence that challenges received wisdom. An earlier version of this paper was published previously in IJDC 15(1) 202

    Have Your Cake and Eat It Too: a Case Study in Updated Modular Workflows for a Longitudinal Research Project

    No full text
    As datasets have become a more significant aspect of Open Science, attention has turned to the data transformations that drive their creation. Li and Ludäscher have pointed out the importance of identifying data cleaning workflows as a series of modular transformations that can be extracted for reuse. This modular approach aids reproducibility and allows for transparency in data provenance. However, the constantly evolving nature of data science technology means that even once these modules have been identified and implemented, their functionality must be ported to new platforms as old ones become less applicable or less common in a field of study. When these transformations take place, it is important to consider not only practicality and functionality, but also transparency within a data processing team. Clarity of communication within a team is the first step towards providing clear and transparent documentation to the end user. This case study of an updated workflow process for a long-running longitudinal health and well-being study provides practical examples of these principles

    T-KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework

    No full text
    Entity resolution (ER) is the process of determining whether two representations refer to the same real-world entity and plays a crucial role in data curation and data cleaning. Recent studies have introduced the KAER framework, aiming to improve pre-trained language models by augmenting external knowledge. However, identifying and documenting the external knowledge that is being augmented and understanding its contribution to the model\u27s predictions have received little to no attention in the research community. This paper addresses this gap by introducing T-KAER, the Transparency-enhanced Knowledge-Augmented Entity Resolution framework. To enhance transparency, three Transparency-related Questions (T-Qs) have been proposed: T-Q(1): What is the experimental process for matching results based on data inputs? T-Q(2): Which semantic information does KAER augment in the raw data inputs? T-Q(3): Which semantic information of the augmented data inputs influences the predictions? To address the T-Qs, T-KAER is designed to improve transparency by documenting the entity resolution processes in log files. In experiments, a citation dataset is used to demonstrate the transparency components of T-KAER. This demonstration showcases how T-KAER facilitates error analysis from both quantitative and qualitative perspectives, providing evidence on `"what" semantic information is augmented and ``"why" the augmented knowledge influences predictions differently

    522

    full texts

    605

    metadata records
    Updated in last 30 days.
    International Journal of Digital Curation
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇