1,721,013 research outputs found

    [Données] Comparaison d'approches de reconnaissance d'entités nommées imbriquées dans des documents historiques structurés

    No full text
    This repository references the models that were trained and compared in the following article:Tual, Solenn and Abadie, Nathalie and Carlinet, Edwin and Chazalon, Joseph and Duménieu, Bertrand. A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents. Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR'23). Aug. 2023. San José, California, USA. https://doi.org/10.1007/978-3-031-41682-8_8Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities often are nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1 scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data.This repository references the models that were trained and compared in the following article:Tual, Solenn and Abadie, Nathalie and Carlinet, Edwin and Chazalon, Joseph and Duménieu, Bertrand. A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents. Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR'23). Aug. 2023. San José, California, USA. https://doi.org/10.1007/978-3-031-41682-8_

    [Données] Comparaison d'approches de reconnaissance d'entités nommées imbriquées dans des documents historiques structurés

    No full text
    This repository references the models that were trained and compared in the following article:Tual, Solenn and Abadie, Nathalie and Carlinet, Edwin and Chazalon, Joseph and Duménieu, Bertrand. A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents. Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR'23). Aug. 2023. San José, California, USA. https://doi.org/10.1007/978-3-031-41682-8_8Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities often are nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1 scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data.This repository references the models that were trained and compared in the following article:Tual, Solenn and Abadie, Nathalie and Carlinet, Edwin and Chazalon, Joseph and Duménieu, Bertrand. A Benchmark of Nested Named Entity Recognition Approaches in Historical Structured Documents. Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR'23). Aug. 2023. San José, California, USA. https://doi.org/10.1007/978-3-031-41682-8_

    A Dataset of French Trade Directories from the 19th Century (FTD)

    No full text
    This dataset is composed of pages and entries extracted from French directories published between 1798 and 1861. The purpose of this dataset is to evaluate the performance of Optical Character Recognition (OCR) and Named Entity Recognition (NER) on 19th century French documents. This dataset is divided into two parts: A labeled dataset, which contains 8765 manually corrected entries from 78 pages (18 different directories), and which is designed for supervised training. An unlabeled dataset, containing 1058196 raw entries from 6887 pages (13 different directories), and which is designed for self-supervised pre-training. For the labeled dataset, we provide: Original pages and cropped images Human-corrected positions, transcriptions and entity tagging for each entry OCR prediction from 3 systems (Tesseract v4, PERO OCR v2020 and Kraken) Projected NER reference from clean text to OCR predictions, making it suitable to evaluate the performance of NER systems on real, noisy OCR predictions For the unlabeled dataset, we provide: Automatically detected positions for each entry (lot of noise) OCR predictions for each entry (PERO OCR engine) How to cite this dataset Please cite this dataset as: N. Abadie, S. Baciocchi, E. Carlinet, J. Chazalon, P. Cristofoli, B. Duménieu and J. Perret, A Dataset of French Trade Directories from the 19th Century (FTD), version 1.0.0, May 2022, online at https://doi.org/10.5281/zenodo.6394464. @dataset{abadie_dataset_22, author = {Abadie, Nathalie and Bacciochi, St{\'e}phane and Carlinet, Edwin and Chazalon, Joseph and Cristofoli, Pascal and Dum{\'e}nieu, Bertrand and Perret, Julien}, title = {{A} {D}ataset of {F}rench {T}rade {D}irectories from the 19th {C}entury ({FTD})}, month = mar, year = 2022, publisher = {Zenodo}, version = {v1.0.0}, doi = {10.5281/zenodo.6394464}, url = {https://doi.org/10.5281/zenodo.6394464} } You may also be interested in our paper presented at DAS 2022 (15th IAPR International Workshop on Document Analysis Systems), which compares the performance of OCR and NER systems on this dataset: N. Abadie, E. Carlinet, J. Chazalon and B. Duménieu, A Benchmark of Named Entity Recognition Approaches in Historical Documents — Application to 19th Century French Directories, May 2022, La Rochelle, France, Springer. @inproceedings{abadie_das_22, author = {Abadie, Nathalie and Carlinet, Edwin and Chazalon, Joseph and Dum{\'e}nieu, Bertrand}, title = {{A} {B}enchmark of {N}amed {E}ntity {R}ecognition {A}pproaches in {H}istorical {D}ocuments — {A}pplication to 19th {C}entury {F}rench {D}irectories}, month = may, year = 2022, publisher = {Springer}, place = {La Rochelle, France} } Copyright and License The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library). Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept. Researchers do not have to pay any fee for reusing the original contents in research publications or academic works. Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022. The original contents were significantly transformed before being included in this dataset. All derived content is licensed under the permissive Creative Commons Attribution 4.0 International license.The images were extracted from the original source https://gallica.bnf.fr, owned by the Bibliothèque nationale de France (French national library). Original contents from the Bibliothèque nationale de France can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept. Researchers do not have to pay any fee for reusing the original contents in research publications or academic works. The original contents were significantly transformed before being included in this dataset. Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

    Un arbre des formes pour les images multivariées

    Full text link
    Nowadays, the demand for multi-scale and region-based analysis in many computer vision and pattern recognition applications is obvious. No one would consider a pixel-based approach as a good candidate to solve such problems. To meet this need, the Mathematical Morphology (MM) framework has supplied region-based hierarchical representations of images such as the Tree of Shapes (ToS). The ToS represents the image in terms of a tree of the inclusion of its level-lines. The ToS is thus self-dual and contrast-change invariant which make it well-adapted for high-level image processing. Yet, it is only defined on grayscale images and most attempts to extend it on multivariate images - e.g. by imposing an “arbitrary” total ordering - are not satisfactory. In this dissertation, we present the Multivariate Tree of Shapes (MToS) as a novel approach to extend the grayscale ToS on multivariate images. This representation is a mix of the ToS's computed marginally on each channel of the image; it aims at merging the marginal shapes in a “sensible” way by preserving the maximum number of inclusion. The method proposed has theoretical foundations expressing the ToS in terms of a topographic map of the curvilinear total variation computed from the image border; which has allowed its extension on multivariate data. In addition, the MToS features similar properties as the grayscale ToS, the most important one being its invariance to any marginal change of contrast and any marginal inversion of contrast (a somewhat “self-duality” in the multidimensional case). As the need for efficient image processing techniques is obvious regarding the larger and larger amount of data to process, we propose an efficient algorithm that can be build the MToS in quasi-linear time w.r.t. the number of pixels and quadraticw.r.t. the number of channels. We also propose tree-based processing algorithms to demonstrate in practice, that the MToS is a versatile, easy-to-use, and efficient structure. Eventually, to validate the soundness of our approach, we propose some experiments testing the robustness of the structure to non-relevant components (e.g. with noise or with low dynamics) and we show that such defaults do not affect the overall structure of the MToS. In addition, we propose many real-case applications using the MToS. Many of them are just a slight modification of methods employing the “regular” ToS and adapted to our new structure. For example, we successfully use the MToS for image filtering, image simplification, image segmentation, image classification and object detection. From these applications, we show that the MToS generally outperforms its ToS-based counterpart, demonstrating the potential of our approachDe nombreuses applications issues de la vision par ordinateur et de la reconnaissance des formes requièrent une analyse de l'image multi-échelle basée sur ses régions. De nos jours, personne ne considérerait une approche orientée « pixel » comme une solution viable pour traiter ce genre de problèmes. Pour répondre à cette demande, la Morphologie Mathématique a fourni des représentations hiérarchiques des régions de l'image telles que l'Arbre des Formes (AdF). L'AdF représente l'image par un arbre d'inclusion de ses lignes de niveaux. L'AdF est ainsi auto-dual et invariant au changement de contraste, ce qui fait de lui une structure bien adaptée aux traitements d'images de haut niveau. Néanmoins, il est seulement défini aux images en niveaux de gris et la plupart des tentatives d'extension aux images multivariées (e.g. en imposant un ordre total «arbitraire ») ne sont pas satisfaisantes. Dans ce manuscrit, nous présentons une nouvelle approche pour étendre l'AdF scalaire aux images multivariées : l'Arbre des Formes Multivarié (AdFM). Cette représentation est une « fusion » des AdFs calculés marginalement sur chaque composante de l'image. On vise à fusionner les formes marginales de manière « sensée » en préservant un nombre maximal d'inclusion. La méthode proposée a des fondements théoriques qui consistent en l'expression de l'AdF par une carte topographique de la variation totale curvilinéaire depuis la bordure de l'image. C'est cette reformulation qui a permis l'extension de l'AdF aux données multivariées. De plus, l'AdFM partage des propriétés similaires avec l'AdF scalaire ; la plus importante étant son invariance à tout changement ou inversion de contraste marginal (une sorte d'auto-dualité dans le cas multidimensionnel). Puisqu'il est évident que, vis-à-vis du nombre sans cesse croissant de données à traiter, nous ayons besoin de techniques rapides de traitement d'images, nous proposons un algorithme efficace qui permet de construire l'AdF en temps quasi-linéaire vis-à-vis du nombre de pixels et quadratique vis-à-vis du nombre de composantes. Nous proposons également des algorithmes permettant de manipuler l'arbre, montrant ainsi que, en pratique, l'AdFM est une structure facile à manipuler, polyvalente, et efficace. Finalement, pour valider la pertinence de notre approche, nous proposons quelques expériences testant la robustesse de notre structure aux composantes non-pertinentes (e.g. avec du bruit ou à faible dynamique) et nous montrons que ces défauts n'affectent pas la structure globale de l'AdFM. De plus, nous proposons des applications concrètes utilisant l'AdFM. Certaines sont juste des modifications mineures aux méthodes employant d'ores et déjà l'AdF scalaire mais adaptées à notre nouvelle structure. Par exemple, nous utilisons l'AdFM à des fins de filtrage, segmentation, classification et de détection d'objet. De ces applications, nous montrons ainsi que les méthodes basées sur l'AdFM surpassent généralement leur analogue basé sur l'AdF, démontrant ainsi le potentiel de notre approch

    Appropriate Similarity Measures for Author Cocitation Analysis

    Full text link
    We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis

    A Tree of shapes for multivariate images

    No full text
    De nombreuses applications issues de la vision par ordinateur et de la reconnaissance des formes requièrent une analyse de l'image multi-échelle basée sur ses régions. De nos jours, personne ne considérerait une approche orientée « pixel » comme une solution viable pour traiter ce genre de problèmes. Pour répondre à cette demande, la Morphologie Mathématique a fourni des représentations hiérarchiques des régions de l'image telles que l'Arbre des Formes (AdF). L'AdF représente l'image par un arbre d'inclusion de ses lignes de niveaux. L'AdF est ainsi auto-dual et invariant au changement de contraste, ce qui fait de lui une structure bien adaptée aux traitements d'images de haut niveau. Néanmoins, il est seulement défini aux images en niveaux de gris et la plupart des tentatives d'extension aux images multivariées (e.g. en imposant un ordre total «arbitraire ») ne sont pas satisfaisantes. Dans ce manuscrit, nous présentons une nouvelle approche pour étendre l'AdF scalaire aux images multivariées : l'Arbre des Formes Multivarié (AdFM). Cette représentation est une « fusion » des AdFs calculés marginalement sur chaque composante de l'image. On vise à fusionner les formes marginales de manière « sensée » en préservant un nombre maximal d'inclusion. La méthode proposée a des fondements théoriques qui consistent en l'expression de l'AdF par une carte topographique de la variation totale curvilinéaire depuis la bordure de l'image. C'est cette reformulation qui a permis l'extension de l'AdF aux données multivariées. De plus, l'AdFM partage des propriétés similaires avec l'AdF scalaire ; la plus importante étant son invariance à tout changement ou inversion de contraste marginal (une sorte d'auto-dualité dans le cas multidimensionnel). Puisqu'il est évident que, vis-à-vis du nombre sans cesse croissant de données à traiter, nous ayons besoin de techniques rapides de traitement d'images, nous proposons un algorithme efficace qui permet de construire l'AdF en temps quasi-linéaire vis-à-vis du nombre de pixels et quadratique vis-à-vis du nombre de composantes. Nous proposons également des algorithmes permettant de manipuler l'arbre, montrant ainsi que, en pratique, l'AdFM est une structure facile à manipuler, polyvalente, et efficace. Finalement, pour valider la pertinence de notre approche, nous proposons quelques expériences testant la robustesse de notre structure aux composantes non-pertinentes (e.g. avec du bruit ou à faible dynamique) et nous montrons que ces défauts n'affectent pas la structure globale de l'AdFM. De plus, nous proposons des applications concrètes utilisant l'AdFM. Certaines sont juste des modifications mineures aux méthodes employant d'ores et déjà l'AdF scalaire mais adaptées à notre nouvelle structure. Par exemple, nous utilisons l'AdFM à des fins de filtrage, segmentation, classification et de détection d'objet. De ces applications, nous montrons ainsi que les méthodes basées sur l'AdFM surpassent généralement leur analogue basé sur l'AdF, démontrant ainsi le potentiel de notre approcheNowadays, the demand for multi-scale and region-based analysis in many computer vision and pattern recognition applications is obvious. No one would consider a pixel-based approach as a good candidate to solve such problems. To meet this need, the Mathematical Morphology (MM) framework has supplied region-based hierarchical representations of images such as the Tree of Shapes (ToS). The ToS represents the image in terms of a tree of the inclusion of its level-lines. The ToS is thus self-dual and contrast-change invariant which make it well-adapted for high-level image processing. Yet, it is only defined on grayscale images and most attempts to extend it on multivariate images - e.g. by imposing an “arbitrary” total ordering - are not satisfactory. In this dissertation, we present the Multivariate Tree of Shapes (MToS) as a novel approach to extend the grayscale ToS on multivariate images. This representation is a mix of the ToS's computed marginally on each channel of the image; it aims at merging the marginal shapes in a “sensible” way by preserving the maximum number of inclusion. The method proposed has theoretical foundations expressing the ToS in terms of a topographic map of the curvilinear total variation computed from the image border; which has allowed its extension on multivariate data. In addition, the MToS features similar properties as the grayscale ToS, the most important one being its invariance to any marginal change of contrast and any marginal inversion of contrast (a somewhat “self-duality” in the multidimensional case). As the need for efficient image processing techniques is obvious regarding the larger and larger amount of data to process, we propose an efficient algorithm that can be build the MToS in quasi-linear time w.r.t. the number of pixels and quadraticw.r.t. the number of channels. We also propose tree-based processing algorithms to demonstrate in practice, that the MToS is a versatile, easy-to-use, and efficient structure. Eventually, to validate the soundness of our approach, we propose some experiments testing the robustness of the structure to non-relevant components (e.g. with noise or with low dynamics) and we show that such defaults do not affect the overall structure of the MToS. In addition, we propose many real-case applications using the MToS. Many of them are just a slight modification of methods employing the “regular” ToS and adapted to our new structure. For example, we successfully use the MToS for image filtering, image simplification, image segmentation, image classification and object detection. From these applications, we show that the MToS generally outperforms its ToS-based counterpart, demonstrating the potential of our approac

    Dispelling the Myths Behind First-author Citation Counts

    Full text link
    We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more sophisticated methods
    corecore