1,721,070 research outputs found

    Named Entity Processing for Digital Humanities

    No full text
    Abstract of paper 1081 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019

    Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection

    No full text
    As developers of a highly multilingual named entity recognition (NER) system, we face an evaluation resource bottleneck problem: we need evaluation data in many languages, the annotation should not be too time-consuming, and the evaluation results across languages should be comparable. We solve the problem by automatically annotating the English version of a multi-parallel corpus and by projecting the annotations into all the other language versions. For the translation of English entities, we use a phrase-based statistical machine translation system as well as a lookup of known names from a multilingual name database. For the projection, we incrementally apply different methods: perfect string matching, perfect consonant signature matching and edit distance similarity. The resulting annotated parallel corpus will be made available for reuse

    Digitised Historical Newspapers: A Changing Research Landscape (Introduction)

    No full text
    The application of digital technologies to newspaper archives is transforming the way historians engage with these sources. The digital evolution not only affects how scholars access historical newspapers, but also, increasingly, how they search, explore and study them. Two developments have been driving this transformation: massive digitisation, which facilitates access to remote holdings and, more recently, improved search capabilities, which alleviate the tedious exploration of vast collections, opens up new prospects and transforms research practices. The volume "Digitised newspapers - A New Eldorado for Historians?" brings together the contributions of a workshop held in 2020 on tools, methods and epistemological reflections on the use of digitised newspapers and offers three perspectives: how digitisation is transforming access to and exploration of historical newspaper collections; how automatic content processing allows for the creation of new layers of information; and, finally, what analyses this enhanced material opens up. This introductory chapter reviews recent developments that have influenced the research landscape of digitized newspapers in recent years and introduces the eighteen articles that comprise this volume.DHLA

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    No full text
    parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model. Index Terms—Machine translation, knowledge, morphological resources. I

    Multi-label Eurovoc classification for Eastern and Southern EU languages

    No full text
    Multi-label document classification is the task of automatically assigning multiple categories to the same document (e.g. a book is about cooking and about Austrian food). At least for Machine Learning approaches, this task is harder than standard (single label) classification because it is not clear for the learning software whether the presence of a feature (typically a word) is an indication of one class or another (e.g. whether the presence of the word ‘salt’ is an indication for the category cooking or for the category Austrian food). Multi-label classification is a real challenge if the number of classes is very high and if the number of training documents per category is unevenly distributed. We are presenting experiments with the JRC EuroVoc Indexer software JEX (Steinberger et al. 2012), which has been trained for all official EU languages on tens of thousands of documents per language to assign the thousands of class labels of the EuroVoc thesaurus . JEX is a multi-label classification system using a bag-of-words document representation. When applying such a tool that uses word forms as classifier features to languages as different as Germanic (e.g. English), Romance (e.g. French), Slavic (e.g. Czech or Polish) and Finno-Ugric languages (e.g. Estonian or Hungarian), the question arises how much the classifier performance differs. It can be expected that the significantly higher ratio of word forms to lemmas in Slavic and Finno-Ugric languages has a negative impact on the classifier performance, or that more training material would be needed for these more highly inflected languages to achieve the same performance. Similarly, one might wonder whether part-of-speech (POS) information is useful. JEX will soon be made available to parliamentary and other users. The experiments described in this chapter thus have a practical relevance as they can give an indication to the users and their technical partners as to whether they should invest in improving the software through linguistic pre-processing

    Towards Chapterisation of Podcasts Detection of Host and Structuring Questions in Radio Transcripts

    No full text
    This Master thesis investigates the application of Bidirectional Encoder Representations from Transformers (BERT) on podcast to identify the host and detect structuring questions within each episode. This research is conducted on an annotated dataset of automatic transcriptions of 38 French podcasts of Radio France and 37 TV shows in English of France 24. A variety of BERT models, with different language orientations, are tested and compared on two classifying tasks: the detection of host sentences and the classification of structuring questions. The latter is firstly performed as a three label classification task. Secondly, a reduction to a binary classifier is proposed, with two new configurations. Initially, BERT models are fine-tuned separately on French and English datasets, as well as on the joint dataset. In a second time, a multilingual approach is implemented with an automatic translation of the original dataset into a total of twenty languages. The translated datasets are used for multilingual fine-tuning and German is included as an evaluation language. BERT models demonstrate adequate performance in host detection to pinpoint within the list of the speakers the actual host of the show, as well as a proposed comparison rule-based method. For structuring question detection, the three label classifier appears too subtle, at least regarding the size of fine-tuning data. One binary classification configuration yields promising results. The multilingual experiment shows that automatic translation has potential as a source of fine-tuning data and highlight the need for original testing data in these languages.DHLA

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
    corecore