1,721,000 research outputs found
Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks.
In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses
The Effect of author set size in authorship attribution for Lithuanian /
This paper reports the first authorship attribution results based on the effect of the author set size using automatic computational methods for the Lithuanian language. The aim is to determine how fast authorship attribution results are deteriorating while the number of candidate authors is gradually increasing: i.e. starting from 3, going up to 5, 10, 20, 50, and 100. Using supervised machine learning techniques we also investigated the influence of different features (lexical, character, morphological, etc.) and language types (normative parliamentary speeches and non-normative forum posts). The experiments revealed that the effectiveness of the method and feature types depends more on the language type rather than on the number of candidate authors. The content features based on word lemmas are the most useful type for the normative texts, due to the fact that Lithuanian is a highly inflective, morphologically and vocabulary rich language. The character features are the most accurate type for forum posts, where texts are too complicated to be effectively processed with external morphological tools
Labai dažnų lietuvių kalbos žodžių ir žodžių formų ypatybės.
Straipsnio tikslas – aptarti dažniausių lietuvių kalbos žodžių ir žodžių formų savybes ir jų svarbą teksto analizei, nes dažniausios žodžių formos nuo retesnių skiriasi ne tik ypač dideliu dažnumu, bet ir kitomis tik joms būdingomis ypatybėmis. Labai dažni žodžiai ir labai dažnos žodžių formos nustatytos remiantis Vytauto Didžiojo universiteto Kompiuterinės lingvistikos centro Dabartinės lietuvių kalbos tekstynu, o jų ypatybės vertinamos taikant statistinius metodus. Labai dažnų žodžių ir labai dažnų formų statistinė analizė parodė, kad labai dažni žodžiai ir labai dažnos formos turi tik jiems būdingų savybių, neįprastų retesniems žodžiams, todėl juos tikslinga laikyti atskira žodžių grupe, kuri turėtų būti analizuojama atskirai. Remiantis statistiniais žodžių pasiskirstymo duomenimis, galima daryti išvadą, kad svarbiausia labai dažnų žodžių ir labai dažnų formų paskirtis nepriklausomai nuo to, kokiai kalbos daliai jie priklauso (pagrindinei ar tarnybinei), yra susijusi su funkcine jų vartosena, o ne su turinio raiška. Todėl labai dažni žodžiai ir labai dažnos formos gali būti įvardijami kaip funkciniai žodžiai. Dažniausių žodžių ir dažniausių formų analizė taip pat parodė, kad labai dažnų žodžių ir labai dažnų formų pasiskirstymas tekstuose nėra chaotiškas ar atsitiktinis. Būdami dažniausi struktūriniai teksto vienetai, jie yra tiesiogiai susiję su teksto funkcijomis, todėl kartu su kitomis formaliomis teksto ypatybėmis jie gali būti laikomi reikšmingais teksto funkcinių ypatybių rodikliaisThe aim of the article is to discuss the characteristic features of the most frequent words and wordforms in Lithuanian as well as their significance to text analysis as the most frequent wordforms differ from less frequent ones not by especially extensive frequency but also by other characteristic properties. Very frequent words and wordforms were defined on the basis of Contemporary Lithuanian Language Corpus issued by the Computer Linguistics Centre of Vytautas Magnus University, and their peculiarities were assessed by statistical methods. Statistical analysis of very frequent words and wordforms revealed that very frequent words and wordforms have particular attributes that are not characteristic to less frequent words, thus they should be regarded as a separate group of words that should be separately analysed. On the basis of data regarding statistical word distribution, a conclusion may be made that the most important purpose of most frequent words and wordforms is related to their functional use rather than expression of content irrespective of the type of language to which it is attributed (main or administrative). Therefore, very frequent word or wordforms may be regarded to as functional words. Analysis of the most frequent words and forms also revealed that distribution of the most frequent words and wordforms in texts isn’t chaotic or incidental. As they are the most frequent textual units, they are directly related to text functions, thus together with other formal text peculiarities they may be regarded as significant indicators of functional properties of text
Statistinis tekstų funkcijų nustatymas
Disertacija rengta 1999 - 2004 metais Vytauto Didžiojo universiteteBibliogr.: p. 30Vytauto Didžiojo universitetas / Vytautas Magnus Universit
Academic Research and Standards: a discussion on standards for multi-lingual language resources
Frequency lists of pivot words and GSE counts
The resource contains data used to estimate the amount of words in Lithuanian texts indexed by the selected Global Search Engines (GSE), namely Google (by Alphabet Inc.), Bing (by Microsoft Corporation), and Yandex (by ООО «Яндекс», Russia). For this purpose, a special list of 100 rare Lithuanian words (pivot words) with specific characteristics was compiled. Shorter lists for Belarusian, Estonian, Finnish, Latvian, Polish, and Russian languages were also compiled.
Pivot words are words with special characteristics that are used to estimate the amount of words in corpora. Pivot words that were used for the estimation of the amount of words indexed by GSE should meet the following special criteria: 1) frequency of occurrence - 10-100; 2) do not coincide with regular words in another language; 3) longer than 6 letters; 4) not of international origin; 5) not foreign loanwords; 6) not proper names of any kind; 7) not headword forms; 8) with only basic Latin letters; 9) not specific to particular domain or time period; 10) they should not coincide with variants of other words, when diacritics are removed; 11) not words that, when commonly misspelled coincide with words, in other languages.
Low frequency of pivot words is crucial to consider the count of document matches reported by GSE as an indicator of the word count. Comparative results for neighbouring Belarusian, Estonian, Finnish, Latvian , Polish , and Russian languages have also been assessed. The results have been publish in https://www.bjmc.lu.lv/fileadmin/user_upload/lu_portal/projekti/bjmc/Contents/10_3_06_Dadurkevicius.pdfHumanitarinių mokslų fakultetas / Faculty of HumanitiesVytauto Didžiojo universitetas / Vytautas Magnus UniversityLituanistikos katedra / Department of Lithuanian Studie
Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015
Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015
- …
