1,721,028 research outputs found

    Arabic and English Spatial Entity Dataset for Animal Disease Surveillance Extracted with PADI-web

    No full text
    As part of the “Arabic Corpus and Entities Dealing with Animal Disease Surveillance Extracted with PADI-web” dataset (https://doi.org/10.18167/DVN1/2B4WLR), we built a new dataset containing 284 spatial entities in Arabic, their translation into English (manually validated) and their automatic translation by three automatic tools (DeepL, Microsoft Azure, and Reverso). The dataset was updated with two new columns on September 3, 2025: GeoNames ID and GeoNames Feature Class, enabling the matching of spatial entities to the GeoNames gazetteer. The dataset is organised as a table with twelve columns : ID: The unique identifier of each article (from PADI-web database) Arabic Location: The spatial entities in Arabic, manually extracted from 53 articles collected via PADI-web English Location: The manual translation of spatial entities into English, based on existing field sources such as Google Maps and the GeoNames database GeoNames ID: The unique ID from the GeoNames database (2022 version of GeoNames: https://www.geonames.org/) corresponding to each spatial entity (empty if no match in GeoNames) GeoNames Feature Class : The feature class corresponding to the GeoNames ID (empty if no match in GeoNames) Type: A manually assigned type of spatial entity (country, city, region, village, etc.). Category: The classification of spatial entities into two categories: absolute spatial entities (ASE) and relative spatial entities (RSE). Arabic Phrases: The sentence, in Arabic, from which the spatial entity was extracted. Translation DeepL: The translation of the location by DeepL. Translation Microsoft Azure: The translation of the location by Microsoft Azure. Translation Reverso: The translation of the location by Reverso. English Sentences Translated by DeepL: The translation of the sentence by DeepL. English Sentences Translated by Microsoft Azure: The translation of the sentence by Microsoft Azure. English Sentences Translated by Reverso: The translation of the sentence by Reverso. Absolute spatial entities are direct references to precise, locatable geographic spaces, i.e. entities that can be located on a map or in a geographic database (e.g. cities such as Safi, countries such as Morocco, Egypt, etc.). Relative spatial entities are entities defined in relation to at least one other spatial entity, using spatial indicators of a topological nature (for example, “الطود شرق” (El-Tod East), “ناحية تلات” (Talat district), etc.)

    Labeled corpora for post-training Language Models on thematic and misinformation classification in a One Health context

    No full text
    This repository contains five corpora of labeled texts used for fine-tuning language models based on selective masking to adapt them to targeted domains within the One Health context. The corpora comprise collections of unannotated texts generally sourced from PubMed and PADI-web, representing two main areas of application: (i) thematic content related to the One Health domain, covering the biomedical, phytosanitary, and syndromic surveillance fields, and (ii) epidemic misinformation. The repository contains 5 files: Medical Text - Cancer_snippets: 996 scientific articles and abstracts on human cancers, extracted from the Medical Text Dataset - Cancer Doc Classification Dataset. This corpus is divided into three classes (Thyroid Cancer: 283, Colon Cancer: 261, Lung Cancer: 453). PubMed Plant Diseases_snippets: 1,200 abstracts of PubMed scientific papers written in English that focus on the plant health domain. This corpus is divided equally among three major plant diseases that affect crops (Downy Mildew, Powdery Mildew, and Bacterial Wilt). Abstracts were collected by us using web scraping, selecting those whose titles and content contained the disease names. PADI-web Plant Health_snippets: 748 news articles on Xylella fastidiosa (i.e., plant disease) collected with PADI-web (https://padi-web.cirad.fr/en) and manually classified by experts into two classes: relevant (317 articles, i.e., documents related to a new, suspected or unknown outbreak) or not relevant (431 articles). PADI-web Syndromic_snippets: 769 online news articles, divided into two classes: positive, with 311 news articles dealing with unknown diseases, and negative, with 458 news articles where a pathogenic cause is identified. CoAID_snippets: 252 news articles and Facebook posts on the COVID-19 epidemic, extracted from the largest CoAID dataset. This corpus is divided into two classes: fake, with 126 fake news items, and true, with 126 real news. The complete corpora are available under restricted access, while the open-access versions contain only snippets from each corpus

    Unlabeled corpora for post-training Language Models on thematic and misinformation classification in a One Health context

    No full text
    This repository contains four corpora of unlabeled texts used to post-training language models based on selective masking to adapt them to targeted domains within the One Health context. The corpora comprise collections of unannotated texts generally sourced from PubMed and PADI-web, representing two main areas of application: (i) thematic content related to the One Health domain, covering the biomedical, phytosanitary, and syndromic surveillance fields, and (ii) epidemic misinformation. The repository contains 4 files: PubMed Biomedical_snippets: 10,000 English abstracts of biomedical articles, extracted from the <a href="https://www.kaggle.com/datasets/thedevastator/ pubmed-article-summarization-dataset.">PubMed Article Summarization Dataset PubMed Plant Health_snippet: 9,388 English abstracts of PubMed articles on plant health, collected by us through web scraping, selecting abstracts with titles and content containing keywords related to plant health (e.g., plant diseases and plant names). PADI-web Unspecified Diseases_snippet: 8,000 English news articles dedicated to syndromic surveillance (i.e., articles describing unknown diseases and symptoms), collected from the PADI-web tool (https://padi-web.cirad.fr/en). PADI-web Public Health_snippet: 10,000 English news articles on human epidemics (e.g., Influenza and Ebola), used for the epidemic misinformation domain. The complete corpora are available under restricted access, while the open-access versions contain only snippets from each corpus

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods

    Author Index

    No full text
    Nao informado

    Innovation lexicon

    No full text
    The DeSIRA (Development Smart Innovation through Research in Agriculture) Initiative from the European Commission funds Research and Development projects seeking to bridge the gap between the research community and the formulation of policies to build resilient, sustainable and equitable agri-food systems in the Global South. These projects are accompanied by DeSIRA-LIFT (Leveraging the DeSIRA Initiative for the Transformation of Agri-Food Systems), which provides services to prove and improve their impact. In the context of the DeSIRA-LIFT initiative, we aim to mobilize text-mining methods to characterise the nature of the innovations developed by the DeSIRA projects via the documents produced at different stages of the projects. To enable characterising innovation from textual document, we built a lexicon dealing with (1) the actors involved in these processes ; (2) triggering factors and issues to be resolved ; (3) the nature of the process through which actors are engaged ; (4) its components or outputs (new technology, new process, new service, new organization, new policy etc.) ; (5) the themes or domains tackled; and (6) the innovation maturity level. Lexicon - 1st version (desira-lift-v1.ods) Three researchers on agricultural innovation (CA, AT, SM) designed, with the support of the “Keops Team”, the lexicon's overall architecture. Keywords were then identified by combining 3 types of resources : The researchers proposed keywords based on previous scientific work on agricultural innovation, Relevant keywords were then selected from the LEAP4FNSSA lexicon, dedicated to food security (https://dataverse.cirad.fr/dataset.xhtml?persistentId=doi:10.18167/DVN1/D1C53L), The researchers selected additional keywords from a list of 20 keywords generated by ChatGPT for each concept. The lexicon is divided into 3 levels: vocabularies (level 1, corresponding to sheet names), concepts (level 2, corresponding to the first column of each sheet) and keywords (level 3, corresponding to the second column of each sheet), distributed as follows : The vocabulary Actors contains 3 concepts and 429 keywords The vocabulary Innovation triggers contains 5 concepts and 82 keywords The vocabulary Innovation process contains 4 concepts and 59 keywords The vocabulary Innovation products contains 9 concepts and 140 keywords The vocabulary Innovation theme or domain contains 19 concepts and 327 keywords The vocabulary Innovation phase contains 4 concepts and 24 keywords Lexicon - enhanced version (desira-lift-v2.ods) We built upon the first version of the Desira-LIFT lexicon and proposed an enhanced version. Three major changes have been made: (1) new keywords were added after a new phase of consultation with the experts, (2) we added a new column containing, when relevant, the plural form and synonyms of each keyword, and (3) two axes were renamed (Innovation theme or domain -> Innovation purpose, Innovation products -> Innovation outputs). The enhanced DeSIRA-LIFT lexicon is divided into 3 levels: axes (level 1, corresponding to sheet names), concepts (level 2, corresponding to the first column of each sheet) and keywords (level 3, corresponding to the second column of each sheet), distributed as follows : The axis Actors contains 2 concepts and 26 keywords The axis Innovation triggers contains 6 concepts and 94 keywords The axis Innovation process contains 4 concepts and 90 keywords The axis Innovation outputs contains 9 concepts and 160 keywords The axis Innovation purpose contains 19 concepts and 336 keywords The axis Innovation phase contains 4 concepts and 24 keywords </ol
    corecore