1,721,085 research outputs found

    Pyysalo, Sampo

    No full text

    DrugProt Silver Standard Knowledge Graph

    No full text
    DrugProt Silver Standard Knowledge Graph Please cite if you use any DrugProt resource: Miranda, Antonio, et al. "Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations." Proceedings of the seventh BioCreative challenge evaluation workshop. 2021. @inproceedings{miranda2021overview, title={Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations}, author={Miranda, Antonio and Mehryary, Farrokh and Luoma, Jouni and Pyysalo, Sampo and Valencia, Alfonso and Krallinger, Martin}, booktitle={Proceedings of the seventh BioCreative challenge evaluation workshop}, year={2021} } Description Files: drugprot-silver-standard-kg.zip : JSON files with the relations predicted by the DrugProt systems and their precision large_scale_network_abstracts.tsv : PubMed abstracts large_scale_network_entities.tsv : CHEMICAL/drug and GENE/protein entities predicted by DrugProt NER Taggers large_scale_network_pmids.txt : list of PMIDs Related resources: Web DrugProt corpus Evaluation library Online evaluation (CodaLab) Relation annotation guidelines Gene and protein annotation guidelines Chemicals and drugs annotation guidelines FAQ DrugProt Large Scale Additional SubTrack DrugProt Large Scale document collection protoco

    A Dependency Parsing Approach to Biomedical Text Mining

    Full text link
    Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.Siirretty Doriastaei tietoa saavutettavuudest

    Data for "RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature"

    No full text
    <div> <p><strong>RegulaTome corpus</strong>: this <a href="../api/records/10808330/files/RegulaTome-corpus.tar.gz/content" target="_blank" rel="noopener">file</a> contains the RegulaTome corpus in <a href="https://brat.nlplab.org/">BRAT</a> format. The directory <strong>"splits" </strong>has the corpus split based on the train/dev/test used for the training of the relation extraction system</p> <p><strong>RegulaTome annodoc</strong>: The annotation guidelines along with the annotation configuration files for BRAT are provided in <a href="../api/records/10808330/files/annodoc+config.tar.gz/content" target="_blank" rel="noopener">annodoc+config.tar.gz</a>. The online version of the annotation documentation can be found here: <a href="https://katnastou.github.io/s1000-corpus-annotation-guidelines/">https://katnastou.github.io/regulatome-annodoc/ </a></p> <p>The tagger software can be found here: <a href="https://github.com/larsjuhljensen/tagger">https://github.com/larsjuhljensen/tagger</a>. The command used to run tagger before large-scale execution of the RE system is:</p> <p><code>gzip -cd `ls -1 pmc/*.en.merged.filtered.tsv.gz` `ls -1r pubmed/*.tsv.gz` | cat dictionary/excluded_documents.txt - | tagger/tagcorpus --threads=16 --autodetect --types=dictionary/curated_types.tsv --entities=dictionary/all_entities.tsv --names=dictionary/all_names_textmining.tsv --groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv --local-stopwords=dictionary/all_local.tsv --type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv</code></p> <p><strong>Input documents </strong>for large-scale execution, which is done on entire <a href="https://a3s.fi/March-2024-PubMed/PubMed_20230314.tar.gz" target="_blank" rel="noopener">PubMed</a> (as of March 2024) and <a href="https://a3s.fi/Jan-2024-documents/PMC_Nov_23.tar.gz" target="_blank" rel="noopener">PMC Open Access</a> (as of November 2023) articles in BioC format. The files are converted to a <a href="https://a3s.fi/March-2024-PubMed/all_documents.tsv" target="_blank" rel="noopener">tab-delimited format </a>to be compatible with the RE system input (see below).</p> <p><strong>Input dictionary files</strong>: all the files necessary to execute the command above are available in <a href="../api/records/10808330/files/tagger_dictionary_files.tar.gz/content" target="_blank" rel="noopener">tagger_dictionary_files.tar.gz </a></p> <p><strong>Tagger output</strong>: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in <a href="../api/records/10808330/files/tagger_matches_ggp_only_gt_1_hit.tsv.gz/content" target="_blank" rel="noopener">tagger_matches_ggp_only_gt_1_hit.tsv.gz</a></p> <p><strong>Relation extraction system input</strong>: <a href="../api/records/10808330/files/combined_input_for_re.tar.gz/content" target="_blank" rel="noopener">combined_input_for_re.tar.gz</a>: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, <a href="../api/records/10808330/files/tagger_matches_ggp_only_gt_1_hit.tsv.gz/content" target="_blank" rel="noopener">tagger_matches_ggp_only_gt_1_hit.tsv.gz</a>) using the <a href="https://github.com/spyysalo/string-db-tools/blob/main/tagger2standoff.py">tagger2standoff.py</a> script from the <a href="https://github.com/spyysalo/string-db-tools/">string-db-tools</a> repository.</p> <p><strong>Relation extraction models</strong>. The Transformer-based model used for large-scale relation extraction and prediction on the test set is at <a href="../api/records/10808330/files/relation_extraction_multi-label-best_model.tar.gz/content" target="_blank" rel="noopener">relation_extraction_multi-label-best_model.tar.gz</a></p> <p>The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available <a href="https://github.com/facebookresearch/bio-lm/blob/main/README.md">here</a>.</p> <p><strong>Relation extraction system output</strong>: the tab-delimited outputs of the relation extraction system are found at <a href="https://a3s.fi/regulatome-ls/large_scale_relation_extraction_results.tar.gz" target="_blank" rel="noopener">large_scale_relation_extraction_results.tar.gz </a><strong>!!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!</strong></p> <p>The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.</p> </div&gt

    On the use of topic models for word completion

    No full text
    We investigate the use of topic models, such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA), for word completion tasks. The advantage of using these models for such an application is twofold. On the one hand, they allow us to exploit semantic or contextual information when predicting candidate words for completion. On the other hand, these probabilistic models have been found to outperform classical latent semantic analysis (LSA) for modeling text documents. We describe a word completion algorithm that takes into account the semantic context of the word being typed. We also present evaluation metrics to compare different models being used in our study. Our experiments validate our hypothesis of using probabilistic models for semantic analysis of text documents and their application in word completion tasks

    Proceedings of The 5th Workshop on BioNLP Open Shared Tasks

    No full text
    We present the approach of the Turku NLP group to the PharmaCoNER task on Spanish biomedical named entity recognition. We apply a CRF-based baseline approach and multilingual BERT to the task, achieving an F-score of 88% on the development data and 87% on the test set with BERT. Our approach reflects a straightforward application of a state-of-the-art multilingual model that is not specifically tailored to either the language nor the application domain. The source code is available at: https://github.com/chaanim/pharmaconer</p

    Proceedings of the 28th International Conference on Computational Linguistics

    Full text link
    Named entity recognition (NER) is frequently addressed as a sequence classification task with each input consisting of one sentence of text. It is nevertheless clear that useful information for NER is often found also elsewhere in text. Recent self-attention models like BERT can both capture long-distance relationships in input and represent inputs consisting of several sentences. This creates opportunities for adding cross-sentence information in natural language processing tasks. This paper presents a systematic study exploring the use of cross-sentence information for NER using BERT models in five languages. We find that adding context as additional sentences to BERT input systematically increases NER performance. Multiple sentences in input samples allows us to study the predictions of the sentences in different contexts. We propose a straightforward method, Contextual Majority Voting (CMV), to combine these different predictions and demonstrate this to further increase NER performance. Evaluation on established datasets, including the CoNLL’02 and CoNLL’03 NER benchmarks, demonstrates that our proposed approach can improve on the state-of-the-art NER results on English, Dutch, and Finnish, achieves the best reported BERT-based results on German, and is on par with other BERT-based approaches in Spanish. We release all methods implemented in this work under open licenses.</p

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
    corecore