1,720,996 research outputs found
NeMig - A Bilingual News Collection and Knowledge Graph about Migration
NeMig are two English and German knowledge graphs constructed from news articles on the topic of migration, collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sub-topics and sentiment annotations, as well as named entities extracted from the articles' content and metadata and linked to Wikidata. The graphs are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.
NeMig comes in four flavors, for both the German, and the English corpora:
Base NeMig: contains literals and entities from the corresponding annotated news corpus;
Entities NeMig: derived from the Base NeMIg by removing all literal nodes, it contains only resource nodes;
Enriched Entities NeMig: derived from the Entities NeMig by enriching it with up to two-hop neighbors from Wikidata, it contains only resource nodes and Wikidata triples;
Complete NeMig: the combination of the Base and Enriched Entities NeMig, it contains both literals and resources.
Information about uploaded files:
(all files are b-zipped and in the N-Triples format.)
File
Description
nemig_{language}_ {graph_type}-metadata.nt.bz2
Metadata about the dataset, described using void vocabulary.
nemig_{language}_ {graph_type}-instances_types.nt.bz2
Class definitions of news and event instances.
nemig_{language}_ {graph_type}-instances_labels.nt.bz2
Labels of instances.
nemig_{language}_ {graph_type}-instances_related.nt.bz2
Relations between news instances based on one another.
nemig_{language}_ {graph_type}-instances_metadata_literals.nt.bz2
Relations between news instances and metadata literals (e.g. URL, publishing date, modification date, sentiment label, political orientation of news outlets).
nemig_{language}_ {graph_type}-instances_content_mapping.nt.bz2
Mapping of news instances to content instances (e.g. title, abstract, body).
nemig_{language}_ {graph_type}-instances_topic_mapping.nt.bz2
Mapping of news instances to sub-topic instances.
nemig_{language}_ {graph_type}-instances_content_literals.nt.bz2
Relations between content instances and corresponding literals (e.g. text of title, abstract, body).
nemig_{language}_ {graph_type}-instances_metadata_resources.nt.bz2
Relations between news or sub-topic instances and entities extracted from metadata (i.e. publishers, authors, keywords).
nemig_{language}_ {graph_type}-instances_event_mapping.nt.bz2
Mapping of news instances to event instances.
nemig_{language}_ {graph_type}-event_resources.nt.bz2
Relations between event instances and entities extracted from the text of the news (i.e. actors, places, mentions).
nemig_{language}_ {graph_type}-resources_provenance.nt.bz2
Provenance information about the entities extracted from the text of the news (e.g. title, abstract, body).
nemig_{language}_ {graph_type}-wiki_resources.nt.bz2
Relations between Wikidata entities from news and their k-hop entity neighbors from Wikidata
On the Effect of Incorporating Expressed Emotions in News Articles on Diversity within Recommendation Models
Despite news articles being highly edited and trimmed to maintain a neutral and objective tone, there are still stylistic residues of authors like expressed emotions, which impact the decision-making of users whether or not to consume the recommended articles. In this study, we delve into the effects of incorporating emotional signals within the model on both emotional and topical diversity in news recommendations. Our findings show a nuanced alignment with users’ preferences, leading to less diversity and potential creation of an “emotion chamber.” However, it is crucial to model these emotional dimensions explicitly rather than implicitly as contemporary deep-learning models do. This approach offers the opportunity to communicate and raise awareness about the reduction in diversity, allowing for interventions if necessary. We further explore the complex distinction between intra-list and user-centric diversity, sparking a critical debate on guiding user choices. Overall, our work emphasizes the importance of a balanced, ethically-grounded approach, paving the way for more informed and diverse news consumption
Going Beyond Counting First Authors in Author Co-citation Analysis
The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation
counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings
are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that
only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into
account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
Variations on the Author
“Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
Appropriate Similarity Measures for Author Cocitation Analysis
We provide a number of new insights into the methodological discussion about author cocitation analysis. We first argue that the use of the Pearson correlation for measuring the similarity between authors’ cocitation profiles is not very satisfactory. We then discuss what kind of similarity measures may be used as an alternative to the Pearson correlation. We consider three similarity measures in particular. One is the well-known cosine. The other two similarity measures have not been used before in the bibliometric literature. Finally, we show by means of an example that our findings have a high practical relevance.information science;Pearson correlation;cosine;similarity measure;author cocitation analysis
Dynamic Fusion of Information Retrieval Systems on a Per-Query Basis
Hybridsøk er en teknikk i informasjonsgjenfinning som har som mål å forbedre
søkeresultater ved å slå sammen resultatene fra forskjellige søkere. Formålet med
denne masteroppgaven er å utforske sammenslåing av leksikalsk (nøkkelordbasert)
og semantisk (meningsbasert) søk ved hjelp av konveks kombinasjon, en fusjoner-
ingsfunksjon som vektlegger resultatene fra hver søker basert på en koeffisient,
α. I motsetning til tidligere forskning, hvor samme α brukes for alle spørringer
i datasettet, utforsker dette prosjektet metoder for å bestemme α dynamisk per
spørring, ved hjelp av informasjon hentet fra selve spørringen.
To typer modeller ble utviklet: algoritmiske metoder som predikerer α basert på
enkle egenskaper i spørringen, uten behov for trening, og maskinlærte modeller—
basert på LightGBM og nevrale nettverk—som er trent på de gitte datasettene for
å predikere α basert på mer avanserte egenskaper. I tillegg analyseres forholdet
mellom egenskapene og de optimale verdiene til α.
De algoritmiske metodene fungerte på visse datasett, men de maskinlærte
metodene var generelt bedre, ettersom de jevnlig utkonkurrerte referansemeto-
dene. Ytelsen til de maskinlærte metodene, i forhold til potensialet, var likevel
begrenset. Dette skyldes antageligvis mangel på tilstrekkelig treningsdata. Likevel
er resultatene lovende, og prosjektet belyser de betydelige forbedringene i ytelse
som kan oppnås dersom bedre fusjoneringsmetoder kan utarbeides.Hybrid search is a technique in information retrieval which aims to improve
search results by merging the results of multiple retrievers. This thesis explores
combining lexical (keyword-based) and semantic (meaning-based) search using
convex combination, a fusion function which merges results based on a blending
coefficient, α. Unlike previous research, which commonly uses a fixed α for all
queries, this project studies methods for determining α on a per-query basis, using
information extracted from the query itself.
Two main types of models are developed: algorithmic models that infer α from
query properties without any training, and machine learning models—based on
LightGBM and feed-forward neural networks—which are trained on the target
datasets in order to predict α from more advanced features. In addition, the
relationship between various query properties and the optimal α values is analyzed.
The algorithmic models saw some success on certain datasets, but were gener-
ally less performant than the supervised models, which consistently outperformed
the baselines. The improvements of the supervised models were, however, modest,
which is primarily attributed to a lack of adequate training data. Despite these
challenges, the findings are promising, and highlight the potential for considerable
improvement in retrieval results
Dispelling the Myths Behind First-author Citation Counts
We conducted a full-scale evaluative citation analysis study of scholars in the XML research field to explore just how different from each other author rankings resulting from different citation counting methods actually are, and to demonstrate the capability of emerging data and tools on the Web in supporting more realistic citation counting methods. Our results contest some common arguments for the continued
use of first-author citation counts in the evaluation of scholars, such as high correlations between author rankings by first-author citation counts and other citation
counting methods, and high costs of using more realistic citation counting methods that are not well-supported by the ISI databases. It is argued that increasingly available digital full text research papers make it possible for citation analysis studies to go beyond what the ISI databases have directly supported and to employ more
sophisticated methods
A Norwegian Whisper Model for Automatic Speech Recognition
Talegjenkjenning har hatt betydelig fremgang i de siste årene og er blitt vesentlig flinkere i å transkribere lyd til tekst på forskjellige språk. I dag er teknologien uunnværlig og brukes i ulike smarte enheter, deriblant sosiale roboter. Sosiale roboter er designet til å kommunisere med mennesker på en naturlig og intuitiv måte og brukes blant annet for språklæring, undervisning, og behandling av barn med autisme. Ett eksempel for en sosial robot er den såkalte Furhat roboten fra Furhat Robotics som brukes av det norske forskningssentret for AI-innovasjon (NorwAI) ved NTNU for å teste og demonstrere språkmodeller utviklet ved sentret. Til tross for at roboten er utstyrt med moderne og avansert teknologi er talegjenkjenningsmodellen ikke ideelt. Den sliter blant annet med en rekke norske dialekter, har store vansker med navn og forkortelser og er svært upålitelig når det er mye bakgrunnsstøy. Utover det støtter modellen bare Bokmål og er ikke i stand til å transkribere til Nynorsk. Målet ved denne oppgaven er derfor å undersøke om den nåværende modellen kan erstattes med Whisper. Whisper er en avansert talegjenkjenningsmodell som ble trent på mer enn 680,000 timer med data og støtter 96 forskjellige språk for talegjenkjenning. Den mellomstore Whisper modellen ble finjustert på Bokmål og Nynorsk ved hjelp av Stortingskorpuset og ytelsen ble analysert med hensyn til støyrobusthet, transkribering av navn og talerelaterte egenskaper, som dialekt, alder og kjønn. Dessuten ble modellen sammenlignet med den lille Whisper modellen og Wav2Vec 2.0 som begge ble trent av Nasjonalbiblioteket. Modellene ble sammenlignet og evaluert ved hjelp av ordfeilraten (WER), som måler antall ord som må legges til, slettes og erstattes for at prediksjonen stemmer overens med referansesetningen.
Ordfeilraten ble betraktelig redusert både på Bokmål og Nynorsk, og resultatene viser at ytelsen ikke påvirkes av verken kjønn eller alder. I tillegg er ordfeilraten relativt stabil når støynivået er lavt, og det er først når signal-til-støyforholdet er på 10 dB eller mindre at den begynner å stige. Resultatene viser derimot at ytelsen er påvirket av talerens dialekt som fører til at ordfeilraten er litt høyere for noen dialekter mens den er lavere for andre. Videre er ordfeilraten litt større for setninger som inneholder navn eller forkortelser, noe som tyder på at Whisper ikke er unntatt problemet.In the past few decades, automatic speech recognition (ASR) systems made significant progress, achieving high transcription accuracy across a wide range of languages. Today, ASR systems are indispensable components of various smart devices, particularly social robots. Social robots are designed to interact with humans in a natural and intuitive manner and are used in various ways, including language learning, tutoring, and for therapy of children with autism. An example of a modern social robot is the Furhat robot by Furhat Robotics. It is used at the Norwegian Research Center for AI Innovation (NorwAI) to test and demonstrate language models developed at the center. Still, despite its modern technology, the speech recognition system of the Furhat robot is not ideal as it struggles with a range of Norwegian dialects, is very susceptible to background noise, and has difficulties understanding names. Moreover, while it is capable of transcribing spoken Norwegian to Bokmål, which is one of the two official written languages in Norway, it has no built-in support for the second official written language, that is, Nynorsk.
In an effort to combat the issues with the current speech recognition system, this thesis investigates the adaption of Whisper to the Furhat robot. Whisper is a state-of-the-art speech recognition model trained on 680,000 hours of training data and supporting 96 different languages for multilingual speech recognition.
The medium-sized Whisper model was fine-tuned on Bokmål and Nynorsk using the Norweigan Parliament Speech Corpus (NPSC) dataset and evaluated on both languages with regard to the overall performance, noise robustness, the transcription of names, as well as speaker-related characteristics, such as dialect, age, and gender. The performance of the fine-tuned model was further compared to other state-of-the-art architectures, including a fine-tuned version of the small Whisper model and Wav2Vec 2.0. The model was compared and evaluated using the word error rate (WER), which is the number of insertions, deletions, and substitutions required for the prediction to match the ground-truth sentence.
Fine-tuning the model improved the overall WER considerably in both written languages and model performance was generally not influenced by the age or gender of the speaker. Moreover, even though the WER starts to increase at high levels of noise with a signal-to-noise ratio of 10 dB or less, model performance remains stable at low levels of noise. However, while the overall dialect performance was significantly improved by fine-tuning, some dialects still caused the WER to spike. What is more, the WER increased in many cases if a name or abbreviation was present in the sentence, indicating that the transcription of names remains an issue
- …
