1,721,119 research outputs found
Open tools for quantitative anonymization of tabular phenotype data: literature review
Abstract
Precision medicine relies on molecular and systems biology methods as well as bidirectional association studies of phenotypes and (high-throughput) genomic data. However, the integrated use of such data often faces obstacles, especially in regards to data protection. An important prerequisite for research data processing is usually informed consent. But collecting consent is not always feasible, in particular when data are to be analyzed retrospectively. For phenotype data, anonymization, i.e. the altering of data in such a way that individuals cannot be identified, can provide an alternative. Several re-identification attacks have shown that this is a complex task and that simply removing directly identifying attributes such as names is usually not enough. More formal approaches are needed that use mathematical models to quantify risks and guide their reduction. Due to the complexity of these techniques, it is challenging and not advisable to implement them from scratch. Open software libraries and tools can provide a robust alternative. However, also the range of available anonymization tools is heterogeneous and obtaining an overview of their strengths and weaknesses is difficult due to the complexity of the problem space. We therefore performed a systematic review of open anonymization tools for structured phenotype data described in the literature between 1990 and 2021. Through a two-step eligibility assessment process, we selected 13 tools for an in-depth analysis. By comparing the supported anonymization techniques and further aspects, such as maturity, we derive recommendations for tools to use for anonymizing phenotype datasets with different properties
Concept acquisition and improved in-database similarity analysis for medical data
Efficient identification of cohorts of similar patients is a major precondition for personalized medicine. In order to train prediction models on a given medical data set, similarities have to be calculated for every pair of patients—which results in a roughly quadratic data blowup. In this paper we discuss the topic of in-database patient similarity analysis ranging from data extraction to implementing and optimizing the similarity calculations in SQL. In particular, we introduce the notion of chunking that uniformly distributes the workload among the individual similarity calculations. Our benchmark comprises the application of one similarity measures (Cosine similariy) and one distance metric (Euclidean distance) on two real-world data sets; it compares the performance of a column store (MonetDB) and a row store (PostgreSQL) with two external data mining tools (ELKI and Apache Mahout)
Dateninfrastrukturen für die Gesundheitsforschung: Ethische Rahmenbedingungen und rechtliche Umsetzung
The role of data infrastructures for health research is not limited to acting as a service or interface for data exchange between data producers and data users. Rather, the infrastructure itself is an actor in the process of data sharing and therefore also bears responsibility for this process. This applies first of all to the lawfulness of personal data processing. If data processing is based on the consent of the data subject, the infrastructure must also ensure that all data processing is covered by this consent. If the data processing is based on a statutory basis, the infrastructure must ensure the highest possible level of data protection, in particular through technical and organizational measures. In addition, the infrastructure is also responsible for implementing the rights of data subjects, such as the right to information, rectification or erasure of data, and dealing with incidental or additional findings. The question of how researchers regard their involvement in infrastructure projects and how private companies should be involved in such projects must be based on the principle of public welfare. This is accompanied by the obligation of infrastructures to take into account the principles of participation, transparency, and scientific communication as far as possible. Observing all these ethical and legal aspects is especially important because only by doing so can the trust of all stakeholders be established and thus the central basis for the successful construction and operation of data infrastructures be provided
tSPM+; a high-performance algorithm for mining transitive sequential patterns from clinical data
The increasing availability of large clinical datasets collected from
patients can enable new avenues for computational characterization of complex
diseases using different analytic algorithms. One of the promising new methods
for extracting knowledge from large clinical datasets involves temporal pattern
mining integrated with machine learning workflows. However, mining these
temporal patterns is a computational intensive task and has memory
repercussions. Current algorithms, such as the temporal sequence pattern mining
(tSPM) algorithm, are already providing promising outcomes, but still leave
room for optimization. In this paper, we present the tSPM+ algorithm, a
high-performance implementation of the tSPM algorithm, which adds a new
dimension by adding the duration to the temporal patterns. We show that the
tSPM+ algorithm provides a speed up to factor 980 and a up to 48 fold
improvement in memory consumption. Moreover, we present a docker container with
an R-package, We also provide vignettes for an easy integration into already
existing machine learning workflows and use the mined temporal sequences to
identify Post COVID-19 patients and their symptoms according to the WHO
definition
FAIRifizierung von Real World Data für die Gesundheitsforschung: Ein Petitum für modernes Record Linkage
BACKGROUND: The provision of real-world data according to the FAIR principles is prerequisite for an efficient exploitation of the potential of health data for prevention and care. OBJECTIVES: To discuss the opportunities and limitations of reuse and record linkage of health data in Germany. MATERIALS AND METHODS: Initiatives to establish an improved research data infrastructure are presented and the limitations that hinder record linkage of personal health data are illustrated using an example. RESULTS: In general, health data in Germany do not meet the requirements of the FAIR principles. Their findability already fails because either no metadata are available or they are not posted in searchable repositories in a standardized way. Record linkage of personal health data is extremely limited by restrictive data protection regulations and the lack of a so-called unique identifier. Privacy-compliant solutions for linking health data, which are successfully practiced in neighboring European countries, could serve as a model here. CONCLUSIONS: The establishment of a National Research Data Infrastructure (NFDI), especially for personal health data (NFDI4Health), can only be realized with considerable efforts and legislative changes. Already existing structures and standards that have been for instance developed by the Medical Informatics Initiative and the Netzwerk Universitätsmedizin (English: University Medicine Network), and international initiatives such as the European Open Science Cloud should be taken into consideration
Applying FAIRness: Redesigning a Biomedical Informatics Research Data Management Pipeline
- …
