1,720,965 research outputs found

    Description of valency frames of the Croatian ditransitive verbs based on dependency treebanks

    No full text
    Analiza glagolskih valencijskih okvira obuhvaća opis i popis broja i vrste dopuna kojima određen glagol otvara mjesto u rečenici, odnosno kojima glagol upravlja na sintaktičkome i morfološkome planu. Nakon teorijskoga razmatranja karakteristika valencije i prijelaznosti te pregleda hrvatskih jezičnih resursa koji se na njima temelje, u ovome se radu predstavlja računalni opis glagolskih okvira koji počiva na stvarnoj jezičnoj uporabi. Kao izvor korpusnih podataka odabrana je ovisnosna banka stabala SETimes-HR, trenutačno najveći korpus za hrvatski jezik obilježen na sintaktičkoj razini, koji je označen i mrežno dostupan u sklopu projekta Universal Dependencies (https://universaldependencies.org/hr/). Na primjeru dvoprijelaznih glagola razrađen je model izvlačenja valencijskih okvira uz pomoć paketa pyconll u programskome jeziku Python. Utvrđene su pogreške kod označivanja i poteškoće pri računalnome modeliranju ovisnosne strukture glagola te su predložena rješenja za njih. Kvantitativna analiza dobivenih podataka pokazala je da su dvoprijelazni okviri rijetki u korpusu, ali da se dvoprijelazni glagoli često ostvaruju u drugim valencijskim okvirima. Naposljetku je uveden koncept sintaktičkih n-grama i kolokacija, kojima se mogu proširiti mogućnosti izvlačenja okvira i poboljšati točnost modela. Predstavljena metoda pokazala se korisnom i primjenjivom u budućim istraživanjima valencije i glagolske sintakse te razvoju jezičnih tehnologija za hrvatski jezik.Analysis of verb valency frames includes a description and list of the number and types of arguments to which a given verb opens a place in a sentence, that is, which the verb governs on a syntactic and morphological level. After a theoretical consideration of the characteristics of valency and transitivity and an overview of Croatian language resources based on them, this paper presents a computational description of verb frames based on real language usage. As the source of corpus data, the SETimes-HR dependency treebank was selected, currently the largest corpus for the Croatian language annotated at the syntactic level, which is tagged and available online as part of the Universal Dependencies project (https://universaldependencies.org/hr/). Using the example of ditransitive verbs, a model for extracting valency frames was developed with the help of the pyconll package in the Python programming language. Annotation errors and difficulties in computational modeling of the dependent structure of the verb were identified and solutions were proposed. Quantitative analysis of the obtained data showed that ditransitive frames are rare in the corpus, but that ditransitive verbs are often realized in other valency frames. Finally, the concept of syntactic n-grams and collocations was introduced, which can extend the possibilities of frame extraction and improve the accuracy of the model. The presented method proved to be useful and applicable in future research on valency and verbal syntax, as well as in the development of language technologies for the Croatian language

    Description of valency frames of the Croatian ditransitive verbs based on dependency treebanks

    No full text
    Analiza glagolskih valencijskih okvira obuhvaća opis i popis broja i vrste dopuna kojima određen glagol otvara mjesto u rečenici, odnosno kojima glagol upravlja na sintaktičkome i morfološkome planu. Nakon teorijskoga razmatranja karakteristika valencije i prijelaznosti te pregleda hrvatskih jezičnih resursa koji se na njima temelje, u ovome se radu predstavlja računalni opis glagolskih okvira koji počiva na stvarnoj jezičnoj uporabi. Kao izvor korpusnih podataka odabrana je ovisnosna banka stabala SETimes-HR, trenutačno najveći korpus za hrvatski jezik obilježen na sintaktičkoj razini, koji je označen i mrežno dostupan u sklopu projekta Universal Dependencies (https://universaldependencies.org/hr/). Na primjeru dvoprijelaznih glagola razrađen je model izvlačenja valencijskih okvira uz pomoć paketa pyconll u programskome jeziku Python. Utvrđene su pogreške kod označivanja i poteškoće pri računalnome modeliranju ovisnosne strukture glagola te su predložena rješenja za njih. Kvantitativna analiza dobivenih podataka pokazala je da su dvoprijelazni okviri rijetki u korpusu, ali da se dvoprijelazni glagoli često ostvaruju u drugim valencijskim okvirima. Naposljetku je uveden koncept sintaktičkih n-grama i kolokacija, kojima se mogu proširiti mogućnosti izvlačenja okvira i poboljšati točnost modela. Predstavljena metoda pokazala se korisnom i primjenjivom u budućim istraživanjima valencije i glagolske sintakse te razvoju jezičnih tehnologija za hrvatski jezik.Analysis of verb valency frames includes a description and list of the number and types of arguments to which a given verb opens a place in a sentence, that is, which the verb governs on a syntactic and morphological level. After a theoretical consideration of the characteristics of valency and transitivity and an overview of Croatian language resources based on them, this paper presents a computational description of verb frames based on real language usage. As the source of corpus data, the SETimes-HR dependency treebank was selected, currently the largest corpus for the Croatian language annotated at the syntactic level, which is tagged and available online as part of the Universal Dependencies project (https://universaldependencies.org/hr/). Using the example of ditransitive verbs, a model for extracting valency frames was developed with the help of the pyconll package in the Python programming language. Annotation errors and difficulties in computational modeling of the dependent structure of the verb were identified and solutions were proposed. Quantitative analysis of the obtained data showed that ditransitive frames are rare in the corpus, but that ditransitive verbs are often realized in other valency frames. Finally, the concept of syntactic n-grams and collocations was introduced, which can extend the possibilities of frame extraction and improve the accuracy of the model. The presented method proved to be useful and applicable in future research on valency and verbal syntax, as well as in the development of language technologies for the Croatian language

    Description of valency frames of the Croatian ditransitive verbs based on dependency treebanks

    No full text
    Analiza glagolskih valencijskih okvira obuhvaća opis i popis broja i vrste dopuna kojima određen glagol otvara mjesto u rečenici, odnosno kojima glagol upravlja na sintaktičkome i morfološkome planu. Nakon teorijskoga razmatranja karakteristika valencije i prijelaznosti te pregleda hrvatskih jezičnih resursa koji se na njima temelje, u ovome se radu predstavlja računalni opis glagolskih okvira koji počiva na stvarnoj jezičnoj uporabi. Kao izvor korpusnih podataka odabrana je ovisnosna banka stabala SETimes-HR, trenutačno najveći korpus za hrvatski jezik obilježen na sintaktičkoj razini, koji je označen i mrežno dostupan u sklopu projekta Universal Dependencies (https://universaldependencies.org/hr/). Na primjeru dvoprijelaznih glagola razrađen je model izvlačenja valencijskih okvira uz pomoć paketa pyconll u programskome jeziku Python. Utvrđene su pogreške kod označivanja i poteškoće pri računalnome modeliranju ovisnosne strukture glagola te su predložena rješenja za njih. Kvantitativna analiza dobivenih podataka pokazala je da su dvoprijelazni okviri rijetki u korpusu, ali da se dvoprijelazni glagoli često ostvaruju u drugim valencijskim okvirima. Naposljetku je uveden koncept sintaktičkih n-grama i kolokacija, kojima se mogu proširiti mogućnosti izvlačenja okvira i poboljšati točnost modela. Predstavljena metoda pokazala se korisnom i primjenjivom u budućim istraživanjima valencije i glagolske sintakse te razvoju jezičnih tehnologija za hrvatski jezik.Analysis of verb valency frames includes a description and list of the number and types of arguments to which a given verb opens a place in a sentence, that is, which the verb governs on a syntactic and morphological level. After a theoretical consideration of the characteristics of valency and transitivity and an overview of Croatian language resources based on them, this paper presents a computational description of verb frames based on real language usage. As the source of corpus data, the SETimes-HR dependency treebank was selected, currently the largest corpus for the Croatian language annotated at the syntactic level, which is tagged and available online as part of the Universal Dependencies project (https://universaldependencies.org/hr/). Using the example of ditransitive verbs, a model for extracting valency frames was developed with the help of the pyconll package in the Python programming language. Annotation errors and difficulties in computational modeling of the dependent structure of the verb were identified and solutions were proposed. Quantitative analysis of the obtained data showed that ditransitive frames are rare in the corpus, but that ditransitive verbs are often realized in other valency frames. Finally, the concept of syntactic n-grams and collocations was introduced, which can extend the possibilities of frame extraction and improve the accuracy of the model. The presented method proved to be useful and applicable in future research on valency and verbal syntax, as well as in the development of language technologies for the Croatian language

    Intralexical and interlexical structures of the nominal part of the Croatian lexicon

    No full text
    Temeljni je cilj ovoga rada opis unutarleksičkih i međuleksičkih struktura hrvatskoga jezika s naglaskom na imeničku sufiksaciju. Opis unutarleksičkih struktura podrazumijeva opis morfološke strukture hrvatskih imenica, a opis međuleksičkih struktura temelji se na morfotaktičkome modelu kojim se pokazuju tvorbena povezanost hrvatskih leksema i ograničenja koja utječu na mogućnost sufiksalne tvorbe. Za potrebe ovoga rada prikupljene su najčestotnije imenice iz dvaju najvećih mrežno dostupnih korpusa hrvatskoga jezika – Hrvatskoga nacionalnog korpusa i hrWaC-a. Iz svakoga od tih korpusa izdvojeno je 5.000 najčestotnijih imenica jednostavnom pretragom s pomoću popisa riječi. Izbacivanjem duplih unosaka i dodatnim ručnim čišćenjem dobiveno je 5.536 najčestotnijih hrvatskih imenica za morfološku i tvorbenu analizu. Rad je podijeljen u tri dijela. U prvome se dijelu u okviru temeljne lingvističke teorije za hrvatski jezik utvrđuju načela morfološke i tvorbene analize hrvatskih imenica. Unutar morfološke analize razlikuju se morfska i morfemska analiza. Morfska analiza podrazumijeva raščlambu površinske postave riječi, a morfemskom se analizom površinski morfovi u dubinskoj postavi spajaju na temeljni morf koji služi za prikaz morfema. Tvorbenom analizom utvrđuju se polazišna riječ u tvorbi i tvorbeni afiksi. Zatim se na temelju rezultata morfološke i tvorbene analize donose podatci o najčestotnijim hrvatskim korijenima u tvorbi imenica i usporedba s najčestotnijim korijenima u tvorbi glagola, kao i o najčestotnijim sufiksima i njihovim kombinacijama u morfološkoj strukturi imenica. Takvi podatci dosad nisu postojali za hrvatski jezik. Na kraju prvoga dijela opisuju se postojeći računalni resursi na tvorbenoj razini, među kojima i CroDeriv ‒ prvi javno dostupan računalni resurs koji se bavi morfologijom hrvatskoga jezika na tvorbenoj razini. Razrađuje se struktura rječničke natuknice u CroDerivu i pokazuje se kako rezultati ovoga rada obogaćuju računalni prikaz hrvatske morfologije. U drugome je dijelu analizirana polisemna struktura 19 imeničkih sufikasa koji se pojavljuju u sufiksalnim kombinacijama u morfološkoj strukturi hrvatskih imenica. Oblikovan je model opisa polisemnih struktura hrvatskih afikasa koji se temelji na analizi velikoga broja tvorenica tvorenih istim sufiksima slijedeći jasno definirane postupke koji osiguravaju ujednačenost analize i čiji su rezultati 1) primjenjivi u oblikovanju morfotaktičkoga modela za hrvatski jezik te 2) prikladni za računalni opis hrvatske tvorbe. Radi se o pristupu opisu značenjske strukture hrvatskih sufikasa za koji se jasno utvrđuju načela opisa i čija se primjenjivost zatim provjerava analizom većega broja sufikasa. Analizom je utvrđeno koji sufiksi mogu izražavati iste značenjske kategorije, odnosno koji su sufiksi u nekome od svojih značenja bliskoznačni. Osim toga, utvrđeno je koji se od analiziranih sufikasa mogu međusobno kombinirati, čime je pokazano kako se ostvaruje vrlo malen broj mogućih sufiksalnih kombinacija u hrvatskome jeziku. U trećemu se dijelu analiziraju utvrđene sufiksalne kombinacije u morfološkoj strukturi hrvatskih imenica. Opisuju se postojeći pristupi poretku afikasa u jezicima svijeta i pokazuje se kako su svi osim kognitivnoga modela binarnih kombinacija sufikasa neprimjenjivi na hrvatsku jezičnu građu. Stoga se upravo taj model primjenjuje na hrvatski jezik te se pokazuje kako je on uistinu primjenjiv. Međutim, ukazuje se i na to kako nije dostatan za cjelovit opis poretka sufikasa u morfološkoj strukturi hrvatskih imenica, nego mora biti nadopunjen pojedinačnim fonološkim, morfološkim, sintaktičkim, značenjskim i etimološkim načelima. Osim toga, pokazano je kako se načela nerijetko primjenjuju hijerarhijski te kako postojeća riječ može utjecati na odabir netipičnoga sufiksa pri tvorbi nove riječi. Prvim prikazom načela koja djeluju na poredak afikasa u hrvatskome jeziku potvrđena je glavna hipoteza rada: da poredak afikasa u hrvatskome jeziku nije arbitraran, odnosno da se mogu utvrditi načela koja utječu na to da se ostvaruju samo određene kombinacije morfemaAlthough Croatian is a morphologically rich language, overall and detailed descriptions of morphological properties of Croatian language are scarce, especially when it comes to morphological and word-formation analysis. The main goal of this thesis is to overcome this shortfall by describing intralexical and interlexical structures of Croatian nouns derived via suffixation. In order to achieve this goal, the thesis is divided into three major parts:1. Morphological and word-formation analysis of Croatian nouns, 2. Semantic description of Croatian nominal suffixes, and 3. The principles of the morphotactic model of the Croatian nominal suffixation. Although models of affix ordering exist for a wide range of languages (cf. Manova and Aronoff 2010), none of these models has been applied to Croatian language data, mainly due to the non-existence of morphosemantically analysed lexemes. The first two parts of the thesis are thus preparatory steps for the morphotactic model in the third part of the thesis. Our starting hypothesis is that affix ordering in Croatian is not arbitrary and that principles governing the possible morpheme combinations can be established. 1. Morphological and word-formation analysis of Croatian nouns In the first major part of this thesis, we extracted 5,000 most frequent nouns from the two major Croatian corpora ‒ the Croatian National Corpus (Tadić 2009a) and the Croatian Web Corpus hrWaC (Ljubešić and Erjavec 2011). The nouns were obtained via a simple wordlist search and manually cleaned. The initial set consisted of 5,536 both motivated (derived) and non-motivated (base) Croatian nouns. Only after the nouns in this initial set were morphologically analysed and their word-formation patterns were established were we able to extract suffixed nouns, which were the main focus of further analysis. However, in order to perform morphological and word-formation analysis, it was necessary to establish principles of analysis. Our model is formulated within the framework of basic linguistic theory (Haspelmath 2009; Dryer 2006; Dixon 1997), the descriptive and nonrestrictive theory which enables the description of the wide range of grammatical phenomena. Our approach is a formal, morpheme-based approach. It considers morphology to be a part of grammar, although it allows that there are some idiosyncratic combinations stored directly in the lexicon. Moreover, it includes both phonological (e.g. minimal pairs, complementary distribution) and syntactic (e.g. affix ordering) formalisms. Finally, the model presupposes that meaning, especially word-formation meaning, is usually incremental, i.e. compositional. The principles of morphological and word-formation analysis established in this thesis are the first major outcome of our research. We have differentiated between the morphological, morph and morpheme analysis on the one hand and the word-formation analysis on the other hand. Morphological analysis is a hypernym and includes both morph and morpheme analysis. The morph analysis is the analysis of the surface form of a word, and the morpheme analysis connects surface morphs with their basic morphs in the deep layer. We have also emphasised that morphological analysis has to include both lists of morphemes and rules which determine their combinations as a precondition for building a morphotactic model. Moreover, we have emphasised that it is necessary to distinguish between morphological and word-formation analysis. The morphological analysis enables us to determine intralexical structures of the word analysed, while word-formation analysis enables the description of interlexical structures within word-formation families. The established principles were applied to the analysis of the initial set of Croatian nouns. The morphological analysis resulted in the list of Croatian nominal morphemes, both lexical and affixal, and possible suffixal combinations. We have demonstrated that only a small number of the possible suffixal combinations actually occurs. Moreover, only ca. 20 suffixal combinations occur in the morphological structure of more than 10 derived words. Thus, we have confirmed the first hypothesis: only some of the possible suffixal combinations occur and some of them are more frequent than others. We have also presented the most frequent lexical morphemes and the most productive nominal suffixes. The data on word-formation families enabled the interlexical description of the Croatian lexicon that had not been possible earlier. The second major outcome of this thesis is the computational representation of morphological and word-formation analysis of Croatian nouns in CroDeriv, Croatian derivational lexicon. CroDeriv consisted only of morphologically analysed verbs, and our analysis enabled its further expansion in two directions: 1) we have included another major POS ‒ nouns ‒ in the lexicon, and 2) we have expanded the structure of the entry with the word-formation analysis and the affixal senses. These expansions will make CroDeriv a unique morphological resource which exhibits a thorough morphological description of one of the world languages. As a final step of the first part of the research, we have extracted nominal suffixes that occur in the confirmed suffixal combinations for the semantic analysis in the second part of the thesis. 2. Semantic description of Croatian nominal suffixes The second major part of this thesis consists of the semantic analysis of Croatian nominal suffixes. First, we have shown that there is no coherent theoretical approach to affixal semantics in the contemporary linguistic literature. The most systematic model is the model presented in (Bagasheva 2017). However, the principles which govern the determination of affixal senses are not explicitly stated. Thus, we have presented our own approach to the affixal semantics. It is based on the explicitly formulated principles following the regular polysemy approach (Apresjan 1974) and based on the analysis of the numerous nouns derived via same suffix. We determined 27 semantic categories which can be realised by Croatian nominal suffixes. This approach was immediately applied to a wide range of Croatian nominal suffixes, and their polysemous structures were determined. The semantic analysis of Croatian suffixes confirmed our second hypothesis: suffixes that can combine with a wide range of other suffixes and bases have more complex polysemous structures. The obtained results were used along with the results of the morphological and the word-formation analysis in the first part of the thesis to establish the principles governing Croatian nominal morphotactics in the third part of the thesis. 3. The principles of the morphotactic model of the Croatian nominal suffixation In the first part of this section, we have described the existing models of affix ordering and general principles governing affix order in the languages of the world. We have shown that only the cognitive model of binary suffix combinations, presented in (Manova 2011a), has not been challenged so far. Moreover, it is the only model that was built on the Slavic data. Thus, we used this model as the starting point for the morphotactic model of the Croatian nominal suffixation. This model was complemented with language-specific principles discovered in the Croatian data. We have focussed our analysis on the most problematic cases: the suffixal combinations in which there are two nominal suffixes with the same semantic properties and in which it cannot be stated that one of these suffixes is applied by default. For example, Croatian suffixes -lo and -lica are both nominal suffixes with instrumental meaning and can follow the verbal thematic suffix, i.e. they can occur in the same suffixal combinations, e.g. sjed-a-lo ʻseatʼ ~ sjed-a-lica ʻseatʼ. These examples show that the governing principles of Manova’s model, although functional in Croatian, are not sufficient to describe Croatian morphotactics in full. We have thus analysed the semantic categories which can be expressed by several suffixes: ʻagent/professionʼ, ʻlocationʼ, ʻinstrumentʼ, ʻproperty/characteristicʼ to determine additional principles. Moreover, we have extended our analysis to include not only binary suffixal combinations but also the base ‒ suffix combinations to gain deeper insight into intralexical structures of Croatian nouns. Finally, on the basis of the data analysed in this thesis, we have determined phonological, syntactic, morphological, semantic and etymological principles governing affix ordering in Croatian. These principles enable some of the possible combinations and restrict others. We have also emphasised that the principles are hierarchically governed and that the existing words can block the application of the default suffix, thus resulting in atypical combinations. The examples of mirror-image combinations have additionally confirmed that both the order and the meaning of all morphemes in the morphological structure contribute to the meaning of the derived word. Finally, the principles of affix ordering have confirmed the main hypothesis of this thesis: affix ordering in Croatian is not arbitrary and principles governing the possible morpheme combinations can be established

    ANNOTATING COORDINATION IN DEPENDENCY TREEBANKS

    Full text link
    U ovome ćemo radu prikazati na koji se način obilježava koordinacija, i koordinacija surečenica i koordinacija skupina (engl. phrases), u ovisnosnim bankama stabala. Banke stabala temelje se na ovisnosnim pristupima sintaksi i preduvjet su za oblikovanje parsera, alata za automatsko sintaktičko označavanje rečenica. Posebno ćemo se pozabaviti označavanjem koordinacije unutar projekta Universal Dependencies (UD) (https://universaldependencies.org/). Projekt UD teži ujednačenome označavanju gramatičkih struktura u jezicima svijeta. Dosad je u sklopu projekta prikupljeno gotovo 200 banaka stabala za više od 100 jezika, među kojima je i jedna hrvatska ovisnosna banka stabala ‒ Croatian UD. Prije nje za hrvatski je izrađena i Hrvatska ovisnosna banka stabala ‒ HOBS (hobs.ffzg.hr), pri čijemu se obilježavanju slijedio pristup primijenjen pri izradi Praške ovisnosne banke stabala. Pristup primijenjen u izradi tih dviju banaka stabala razlikuje se u obilježavanju određenih sintaktičkih struktura. Prikazat ćemo temeljne razlike u obilježavanju koordinacije u dvjema hrvatskim ovisnosnim bankama stabala, a zatim ćemo se usredotočiti na problematične slučajeve koordinacije i rješenja za njihovo označavanje u dvjema bankama stabala te pokazati koje su prednosti i nedostaci ponuđenih rješenja.In this paper, we present how coordination (both coordination of clauses and phrases) is annotated in dependency treebanks. Dependency treebanks are built in accordance with the dependency approaches to syntax. Special emphasis will be given to coordination annotation within the Universal Dependencies project (UD) (https://universaldependencies.org/). The UD project aims for consistent annotation of grammatical structures across world languages and has collected almost 200 treebanks in more than 100 languages so far, including the one for Croatian ‒ the Croatian UD. Before the Croatian UD treebank was built, the first Croatian Dependency Treebank was built based on the modified Prague Dependency Treebank specification for annotation at the analytical level. The approach used in these two treebanks differs when it comes to the annotation of particular syntactic structures. We show the main differences in annotating coordination in the two Croatian dependency treebanks and focus on problematic cases of syntagmatic and clausal coordination

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed

    Variations on the Author

    Full text link
    “Variations on the Author” discusses two of Eduardo Coutinho’s recent films (Um Dia na Vida, from 2010, and Últimas Conversas, posthumously released in 2015) and their contribution to the general question of documentary authorship. The director’s filmography is characterized by a consistent yet self-effacing form of authorial self-inscription: Coutinho often features as an interviewer that rather than express opinions propels discourses; an interviewer that is good at listening. This mode of self-inscription characterizes him as an author who is not expressive but who is nonetheless markedly present on the screen. In Um Dia na Vida, however, Coutinho is completely absent form the image, while Últimas Conversas, on the contrary, includes a confessional prologue that moves the director from the margins to the center of his films. This article examines the ways in which these works stand out in the filmography of a director who offers new insights into the notion of cinematic authorship
    corecore