1,721,021 research outputs found

    #BlackLivesMatter and #AllLivesMatter Tweet IDs

    No full text
    Tweet IDs for tweets containing #BlackLivesMatter or #AllLivesMatter (case-insensitive), as collected over the time period from August 8th, 2014 to August 31st, 2015. Tweets were collected from the Twitter Gardenhose, which represents a 10% sample of all tweets. These tweet IDs can be hydrated using a tool like Documenting the Now's hydrator

    #BlackLivesMatter and #AllLivesMatter Tweet IDs

    No full text
    Tweet IDs for tweets containing #BlackLivesMatter or #AllLivesMatter (case-insensitive), as collected over the time period from August 8th, 2014 to August 31st, 2015. Tweets were collected from the Twitter Gardenhose, which represents a 10% sample of all tweets. These tweet IDs can be hydrated using a tool like Documenting the Now's hydrator

    Humanitarian Assistance and Disaster Relief (HA/DR) Articles and Lexicon

    No full text
    ReliefWeb HA/DR Article Corpus This corpus consists of ~504K newswire text harvested from ReliefWeb.int, an aggregator of HA/DR news articles and analysis sponsored by the United Nations Office for the Coordination of Humanitarian Affairs (OCHA)]. The corpus is over 300M total words, with documents primarily in English (85%), with some French (9%) and Spanish (6%). The documents are natively annotated for disaster type and 'theme'; see the ReliefWeb Taxonomy for descriptions of each. Approximately 28% articles are marked for one or more disaster types and a disaster name (e.g., "Myanmar: Tropical Cyclone Nargis - May 2008"), and just under half (45%) are annotated for a theme. Data Citation The corpus and lexicon were constructed by Leidos Corp. under funding from the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O), program: Low Resource Languages for Emergent Incidents (LORELEI), issued by DARPA/I2O under Contract No. HR0011-15-C-0114. The data was originally privately distributed for performers within that program. Any usage of the dataset should cite the following paper describing its construction: Littell, P., Tian, T., Xu, R. et al. (2018) The ARIEL-CMU situation frame detection pipeline for LoReHLT16: a model translation approach. Machine Translation 32: 105. https://doi.org/10.1007/s10590-017-9205-3. The data was originally released upon publication of the following paper: Gallagher, Ryan J., et al. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (2017). Corpus Format The data are in JSON format; each article consists of the following fields. `id`: A unique id. `title`: The original article title. `text`: The body text of the article. `date_created`: Date created on ReliefWeb (ISO 8601). `country_name`: The primary country of the disaster event (i.e., the country in which the event occurred). `country_location`: The geographical coordinates of the affected country. `disaster_name`: A list of short descriptions of the disaster events described in the article. `disaster_type`: A list of types, according to the ReliefWeb taxonomy. `glide`: A list of disaster event GLIDE numbers. `theme`: A list of relief topics, according to the ReliefWeb taxonomy. `source_name`: The name of the original publishing organization. `source_type`: The organization type of the publisher (i.e., media, gov't, NGO, etc.) `href`: ReliefWeb API url of article. HA/DR Topic Lexicon This lexicon contains ~34K English language terms (words and multi-word expressions) semantically relevant to the HA/DR topic taxonomy devised by DARPA and the LORELEI evaluation team. The lexicon is intended to support lexical transfer from high-resource (e.g., English) to low-resource languages, particularly for topic modeling and elicitation of domain-specific translations. Format The lexicon is formatted as a single JSON file. Each entry contains the following fields. The entries are sorted by topic, relevance, frequency, and distance. `topic` : The HA/DR topic to which the term belongs, e.g., Violent Civil Unrest, Water, etc. `term` : A word or multi-word expression related to the topic. `seed` : Boolean; whether or not the term was originally identified by a HA/DR expert as highly relevant to the topic. `len` : The length of the term, in words. `dist` : The cosine distance of the term to the topic, averaged over five vector space models. `relevance` : The three-auditor average of the term's relevance to the topic on a 5 point Likert scale. `freq` : The frequency of the term in the ReliefWeb corpus. `example` : A sentence from the ReliefWeb corpus containing the term, if available. NB: While the sentence is likely to relate to the topic, it is not guaranteed to; it may only be generically HA/DR relevant. Construction The lexicon was developed with a semi-supervised extraction process: 1) A set of seed terms for each defined topic area was constructed manually with the input of an HA/DR domain expert. Additional terms from CrisisLex's CrisisLexRec and EMTerms lexicons were included in these sets. The seed lists were typically between 40-60 terms per topic. 2) For each set of seed terms, a set of candidate terms was generated with a set of word2vec models: * A word2vec model trained on all HA/DR documents collected for by ADRIEL for LORELEI. * A word2vec model trained on over one billion English language tweets available on the Internet Archive. * The pre-trained Google News word2vec vectors. 3) Candidates were filtered to remove commonly occurring given names, surnames, and place names (taken from dbpedia), expanded with WordNet synonyms and hyponyms, and finally filtered according to their semantic distance from seed terms using an ensemble of the word2vec models above and the more traditional vector space models below: * A singular-value decomposition of dependency path features constructed from the HA/DR documents with Stanford's CoreNLP dependency parser. * A latent semantic indexing model of an English language thesaurus. 4) From a set of ~15K candidates per topic, 3K "semantically near" terms were selected in this manner for each topic. 5) Finally, a variety of low level text filters were applied to remove, e.g., non-ASCII terms, terms of 3 or fewer characters, and terms with non-word punctuation. Auditing All extracted terms were audited with CrowdFlower. Contributors were asked to rate each term's relevance to the topic on a five point Likert scale, with extreme points on the scale described as indicating a-contextual relevance (i.e., "sewage" is necessarily relevant to Sanitation without any additional context) or irrelevance (i.e., it is difficult to imagine how "bubblegum" would be relevant to Extreme Violence/Terrorism), and the mid-range indicating contextual dependence (i.e., "water" can be relevant to a discussion of Energy in the context of hydroelectricity plants). Terms receiving an average relevance of 3.5 or lower were dropped from the final lexicon. Overall agreement among participants on the rating scale was 75%. Contributors were required correctly label a set of 50 researcher-defined sample questions before participating in the auditing; contributors scoring less than 70% were not allowed to participate. </p

    Humanitarian Assistance and Disaster Relief (HA/DR) Articles and Lexicon

    No full text
    ReliefWeb HA/DR Article Corpus This corpus consists of ~504K newswire text harvested from ReliefWeb.int, an aggregator of HA/DR news articles and analysis sponsored by the United Nations Office for the Coordination of Humanitarian Affairs (OCHA)]. The corpus is over 300M total words, with documents primarily in English (85%), with some French (9%) and Spanish (6%). The documents are natively annotated for disaster type and 'theme'; see the ReliefWeb Taxonomy for descriptions of each. Approximately 28% articles are marked for one or more disaster types and a disaster name (e.g., "Myanmar: Tropical Cyclone Nargis - May 2008"), and just under half (45%) are annotated for a theme. Data Citation The corpus and lexicon were constructed by Leidos Corp. under funding from the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O), program: Low Resource Languages for Emergent Incidents (LORELEI), issued by DARPA/I2O under Contract No. HR0011-15-C-0114. The data was originally privately distributed for performers within that program. Any usage of the dataset should cite the following paper describing its construction: Littell, P., Tian, T., Xu, R. et al. (2018) The ARIEL-CMU situation frame detection pipeline for LoReHLT16: a model translation approach. Machine Translation 32: 105. https://doi.org/10.1007/s10590-017-9205-3. The data was originally released upon publication of the following paper: Gallagher, Ryan J., et al. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (2017). Corpus Format The data are in JSON format; each article consists of the following fields. `id`: A unique id. `title`: The original article title. `text`: The body text of the article. `date_created`: Date created on ReliefWeb (ISO 8601). `country_name`: The primary country of the disaster event (i.e., the country in which the event occurred). `country_location`: The geographical coordinates of the affected country. `disaster_name`: A list of short descriptions of the disaster events described in the article. `disaster_type`: A list of types, according to the ReliefWeb taxonomy. `glide`: A list of disaster event GLIDE numbers. `theme`: A list of relief topics, according to the ReliefWeb taxonomy. `source_name`: The name of the original publishing organization. `source_type`: The organization type of the publisher (i.e., media, gov't, NGO, etc.) `href`: ReliefWeb API url of article. HA/DR Topic Lexicon This lexicon contains ~34K English language terms (words and multi-word expressions) semantically relevant to the HA/DR topic taxonomy devised by DARPA and the LORELEI evaluation team. The lexicon is intended to support lexical transfer from high-resource (e.g., English) to low-resource languages, particularly for topic modeling and elicitation of domain-specific translations. Format The lexicon is formatted as a single JSON file. Each entry contains the following fields. The entries are sorted by topic, relevance, frequency, and distance. `topic` : The HA/DR topic to which the term belongs, e.g., Violent Civil Unrest, Water, etc. `term` : A word or multi-word expression related to the topic. `seed` : Boolean; whether or not the term was originally identified by a HA/DR expert as highly relevant to the topic. `len` : The length of the term, in words. `dist` : The cosine distance of the term to the topic, averaged over five vector space models. `relevance` : The three-auditor average of the term's relevance to the topic on a 5 point Likert scale. `freq` : The frequency of the term in the ReliefWeb corpus. `example` : A sentence from the ReliefWeb corpus containing the term, if available. NB: While the sentence is likely to relate to the topic, it is not guaranteed to; it may only be generically HA/DR relevant. Construction The lexicon was developed with a semi-supervised extraction process: 1) A set of seed terms for each defined topic area was constructed manually with the input of an HA/DR domain expert. Additional terms from CrisisLex's CrisisLexRec and EMTerms lexicons were included in these sets. The seed lists were typically between 40-60 terms per topic. 2) For each set of seed terms, a set of candidate terms was generated with a set of word2vec models: * A word2vec model trained on all HA/DR documents collected for by ADRIEL for LORELEI. * A word2vec model trained on over one billion English language tweets available on the Internet Archive. * The pre-trained Google News word2vec vectors. 3) Candidates were filtered to remove commonly occurring given names, surnames, and place names (taken from dbpedia), expanded with WordNet synonyms and hyponyms, and finally filtered according to their semantic distance from seed terms using an ensemble of the word2vec models above and the more traditional vector space models below: * A singular-value decomposition of dependency path features constructed from the HA/DR documents with Stanford's CoreNLP dependency parser. * A latent semantic indexing model of an English language thesaurus. 4) From a set of ~15K candidates per topic, 3K "semantically near" terms were selected in this manner for each topic. 5) Finally, a variety of low level text filters were applied to remove, e.g., non-ASCII terms, terms of 3 or fewer characters, and terms with non-word punctuation. Auditing All extracted terms were audited with CrowdFlower. Contributors were asked to rate each term's relevance to the topic on a five point Likert scale, with extreme points on the scale described as indicating a-contextual relevance (i.e., "sewage" is necessarily relevant to Sanitation without any additional context) or irrelevance (i.e., it is difficult to imagine how "bubblegum" would be relevant to Extreme Violence/Terrorism), and the mid-range indicating contextual dependence (i.e., "water" can be relevant to a discussion of Energy in the context of hydroelectricity plants). Terms receiving an average relevance of 3.5 or lower were dropped from the final lexicon. Overall agreement among participants on the rating scale was 75%. Contributors were required correctly label a set of 50 researcher-defined sample questions before participating in the auditing; contributors scoring less than 70% were not allowed to participate. </p

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods

    Author Index

    No full text
    Nao informado
    corecore