Search CORE

1,721,042 research outputs found

Transfomer Models: From Model Inspection to Applications in Patents

Author: PUCCETTI Giovanni
Publication venue
Publication date: 07/11/2023
Field of study

L'elaborazione del linguaggio naturale viene utilizzata per affrontare diversi compiti, sia di tipo linguistico, come ad esempio l'etichettatura della parte del discorso, il parsing delle dipendenze, sia più specifiche, come ad esempio la traduzione automatica e l'analisi del sentimento. Per affrontare questi compiti, nel tempo sono stati sviluppati approcci dedicati.Una metodologia che aumenta le prestazioni in tutti questi casi in modo unificato è la modellazione linguistica, che consiste nel preaddestrare un modello per sostituire i token mascherati in grandi quantità di testo, in modo casuale all'interno di pezzi di testo o in modo sequenziale uno dopo l'altro, per sviluppare rappresentazioni di uso generale che possono essere utilizzate per migliorare le prestazioni in molti compiti contemporaneamente.L'architettura di rete neurale che attualmente svolge al meglio questo compito è il transformer, inoltre, le dimensioni del modello e la quantità dei dati sono essenziali per lo sviluppo di rappresentazioni ricche di informazioni. La disponibilità di insiemi di dati su larga scala e l'uso di modelli con miliardi di parametri sono attualmente il percorso più efficace verso una migliore rappresentazione del testo.Tuttavia, i modelli di grandi dimensioni comportano una maggiore difficoltà nell'interpretazione dell'output che forniscono. Per questo motivo, sono stati condotti diversi studi per indagare le rappresentazioni fornite da modelli di transformers.In questa tesi indago questi modelli da diversi punti di vista, studiando le proprietà linguistiche delle rappresentazioni fornite da BERT, per capire se le informazioni che codifica sono localizzate all'interno di specifiche elementi della rappresentazione vettoriale. A tal fine, identifico pesi speciali che mostrano un'elevata rilevanza per diversi compiti di sondaggio linguistico. In seguito, analizzo la causa di questi particolari pesi e li collego alla distribuzione dei token e ai token speciali.Per completare questa analisi generale ed estenderla a casi d'uso più specifici, studio l'efficacia di questi modelli sui brevetti. Utilizzo modelli dedicati, per identificare entità specifiche del dominio, come le tecnologie o per segmentare il testo dei brevetti. Studio sempre l'analisi delle prestazioni integrandola con accurate misurazioni dei dati e delle proprietà del modello per capire se le conclusioni tratte per i modelli generici valgono anche in questo contesto.Natural Language Processing is used to address several tasks, linguistic related ones, e.g. part of speech tagging, dependency parsing, and downstream tasks, e.g. machine translation, sentiment analysis. To tackle these tasks, dedicated approaches have been developed over time.A methodology that increases performance on all tasks in a unified manner is language modeling, this is done by pre-training a model to replace masked tokens in large amounts of text, either randomly within chunks of text or sequentially one after the other, to develop general purpose representations that can be used to improve performance in many downstream tasks at once.The neural network architecture currently best performing this task is the transformer, moreover, model size and data scale are essential to the development of information-rich representations. The availability of large scale datasets and the use of models with billions of parameters is currently the most effective path towards better representations of text.However, with large models, comes the difficulty in interpreting the output they provide. Therefore, several studies have been carried out to investigate the representations provided by transformers models trained on large scale datasets.In this thesis I investigate these models from several perspectives, I study the linguistic properties of the representations provided by BERT, a language model mostly trained on the English Wikipedia, to understand if the information it codifies is localized within specific entries of the vector representation. Doing this I identify special weights that show high relevance to several distinct linguistic probing tasks. Subsequently, I investigate the cause of these special weights, and link them to token distribution and special tokens.To complement this general purpose analysis and extend it to more specific use cases, given the wide range of applications for language models, I study their effectiveness on technical documentation, specifically, patents. I use both general purpose and dedicated models, to identify domain-specific entities such as users of the inventions and technologies or to segment patents text. I always study performance analysis complementing it with careful measurements of data and model properties to understand if the conclusions drawn for general purpose models hold in this context as well

Archivio istituzionale della Ricerca - Scuola Normale Superiore

A simple and fast method for Named Entity context extraction from patents

Author: Puccetti Giovanni
Chiarello Filippo
Fantoni Gualtiero
Publication venue
Publication date: 01/01/2021
Field of study

The process of extracting relevant technical information from patents or technical literature is as valuable as it is challenging. It deals with highly relevant information extraction from a corpus of documents with particular structure, and a mix of technical and legal jargon. Patents are the wider free source of technical information where homogeneous entities can be found. From a technical perspective the approaches refer to Named Entity Recognition (NER) and make use of Machine Learning techniques for Natural Language Processing (NLP). However, due to the large amount of data, to the complexity of the lexicon, the peculiarity of the structure and the scarcity of the examples to be used to feed the machine learning system, new approaches should be studied. NER methods are increasing their performances in many contexts, but a gap still exists when dealing with technical documentation. The aim of this work is to create an automatic training sets for NER systems by exploiting the nature and structure of patents, an open and massive source of technical documentation. In particular, we focus on collecting the context where users of the invention appear within patents. We then measure to which extent we achieve our goal and discuss how much our method is generalizable to other entities and documents

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Technology identification from patent texts : a novel named entity recognition method

Author: Spada Irene
Puccetti Giovanni
Chiarello Filippo
Giordano Vito
Fantoni Gualtiero
Publication venue
Publication date: 01/01/2023
Field of study

Identiying technologies is a key element or mapping a domain and its evolution. It allows managers and de- cision makers to anticipate trends or an accurate orecast and eective oresight. Researchers and practitioners are taking advantage o the rapid growth o the publicly accessible sources to map technological domains. Among these sources, patents are the widest technical open access database used in the literature and in practice. Nowadays, Natural Language Processing (NLP) techniques enable new methods or the analysis o patent texts. Among these techniques, in this paper we explore the use o Named Entity Recognition (NER) with the purpose to identiy the technologies mentioned in patents' text. We compare three dierent NER methods, gazetteer-based, rule-based and deep learning-based (e.g. BERT), measuring their perormances in terms o precision, recall and computational time. We test the approaches on 1600 patents rom our assorted IPC classes as case studies. Our NER systems collected over 4500 ne-grained technologies, achieving the best results thanks to the combination o the three methodologies. The proposed method overcomes the literature thanks to the ability to lter generic technological terms. Our study delineates a valid technology identication tool that can be integrated in any text analysis pipeline to support academics and companies in investigating a technological domain.Identifying technologies is a key element for mapping a domain and its evolution. It allows managers and decision makers to anticipate trends for an accurate forecast and effective foresight. Researchers and practitioners are taking advantage of the rapid growth of the publicly accessible sources to map technological domains. Among these sources, patents are the widest technical open access database used in the literature and in practice. Nowadays, Natural Language Processing (NLP) techniques enable new methods for the analysis of patent texts. Among these techniques, in this paper we explore the use of Named Entity Recognition (NER) with the purpose to identify the technologies mentioned in patents' text. We compare three different NER methods, gazetteer-based, rule-based and deep learning-based (e.g. BERT), measuring their performances in terms of precision, recall and computational time. We test the approaches on 1600 patents from four assorted IPC classes as case studies. Our NER s..

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Archivio della Ricerca - Università di Pisa

B4DS @ PRELEARN: Ensemble method for prerequisite learning

Author: Puccetti Giovanni
Chiarello Filippo
Fantoni G.
Bolanos Luis
Publication venue
Publication date: 01/01/2020
Field of study

In this paper we describe the methodologies we proposed to tackle the EVALITA 2020 shared task PRELEARN. We propose both a methodology based on gated recurrent units as well as one using more classical word embeddings together with ensemble methods. Our goal in choosing these approaches, is twofold, on one side we wish to see how much of the prerequisite information is present within the pages themselves. On the other we would like to compare how much using the information from the rest of Wikipedia can help in identifying this type of relation. This second approach is particularly useful in terms of extension to new entities close to the one in the corpus provided for the task but not actually present in it. With this methodologies we reached second position in the challenge

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Outlier Dimensions that Disrupt Transformers are Driven by Frequency

Author: Drozd Aleksandr
Dell'Orletta Felice
Puccetti Giovanni
Rogers Anna
Publication venue
Publication date: 01/01/2022
Field of study

While Transformer-based language models are generally very robust to pruning, there is the recently discovered outlier phenomenon: disabling only 48 out of 110M parameters in BERT-base drops its performance by nearly 30% on MNLI. We replicate the original evidence for the outlier phenomenon and we link it to the geometry of the embedding space. We find that in both BERT and RoBERTa the magnitude of hidden state coefficients corresponding to outlier dimensions correlate with the frequencies of encoded tokens in pre-training data, and they also contribute to the “vertical” self-attention pattern enabling the model to focus on the special tokens. This explains the drop in performance from disabling the outliers, and it suggests that to decrease anisotopicity in future models we need pre-training schemas that would better take into account the skewed token distributions

Archivio istituzionale della Ricerca - Scuola Normale Superiore

Studying mixability with supermodular aggregating functions

Author: Puccetti Giovanni
V. Bignozzi
G. Puccetti
BIGNOZZI VALERIA
Publication venue
Publication date: 01/01/2015
Field of study

We introduce the concepts of φ-complete mixability and φ-joint mixability and we investigate some necessary and sufficient conditions to the φ-mixability of a set of distribution functions for some supermodular functions φ. We give examples and numerical verifications which confirm our findings

Crossref

AIR Universita degli studi di Milano

Archivio della ricerca- Università di Roma La Sapienza

The Vine Philosopher

Author: Puccetti Giovanni
Durante Fabrizio
Vanduffel Steven
Durante Fabrizio
F. Durante
Scherer Matthias
Steven Vanduffel
G. Puccetti
M. Scherer
Fabrizio Durante
S. Vanduffel
Matthias Scherer
Vanduffel Steven
Giovanni Puccetti
Puccetti Giovanni
Scherer Matthias
Publication venue
Publication date: 01/01/2017
Field of study

Roger Cooke received his PhD (1974) from Yale University in Mathematics and Philosophy. From 1975-2005 he worked in the Netherlands, rst as assistant professor in Logic and Philosophy of Science at the University of Amsterdam, and later as professor of Applied Decision Theory in the Department of Mathematics at the Delft University of Technology. In 2005 he moved back to the USA as senior fellow at Resources for the Future. In 2006-2008 he supervised the development of non-parametric continuous-discrete Bayesian Belief Nets for the Dutch Ministry of Transport. Subsequent development was under contract with Shell, AIRBUS, and the National Institute for Aerospace. In 2008 he was elected fellow of the Society for Risk Analysis. In 2010 he was named lead author in the fth assessment of the Intergovernmental Panel on Climate Change for the chapter on Risk and Uncertainty. In 2011 he received the Lifetime Distinguished Achievement Award from the Society for Risk Analysis. He currently works on uncertainty quanti cation in con- ceptual design for AIRBUS and on value of information of Earth Observation Missions for NASA Langley

Crossref

AIR Universita degli studi di Milano

Directory of Open Access Journals

Archivio Istituzionale della Ricerca- Università del Salento

Author Instructions

Author: Instructions Author
Publication venue
Publication date: 04/11/2013
Field of study

Crossref

Cartographic Perspectives (E-Journal - North American Cartographic Information Society, NACIS)

Going Beyond Counting First Authors in Author Co-citation Analysis

Author: Zhao Dangzhi
Publication venue
Publication date: 01/01/2005
Field of study

The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

E-LIS

Variations on the Author

Author: Sayad Cecilia
Publication venue
Publication date: 01/01/2016
Field of study

“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship

Crossref

Kent Academic Repository