Charles University

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

AI Koditex v1

Author: Milička Jiří
Marklová Anna
Cvrček Václav
Publication venue: Charles University, Faculty of Arts, Department of Linguistics
Publication date: 27/09/2025
Field of study

AI Koditex is a corpus of Czech texts generated with large language models (LLMs). Its main purpose is to create a resource for comparing human-written texts with LLM-generated text linguistically. The corpus is multi-genre and rich in terms of topics, authors, and text types, and comparabile with existing human-created corpora. The corpus replicates reference human Koditex corpus that follows the Brown Corpus tradition. The new corpus was generated using models from OpenAI, Anthropic, Alphabet, Meta, and DeepSeek, ranging from GPT-3 (davinci-002) to GPT-4.5, and are tagged according to the Universal Dependencies standard (i.e., the texts are tokenized, lemmatized, and morphologically and syntactically annotated). The subcorpus size varies according to the model used (The subcorpus size varies according to the model used (768k tokens per model on average, 21.5M tokens altogether). The raw data and plain texts are freely available for download under the CC BY 4.0 license, the UD annotated data are under CC BY-NC-SA 4.0 licence. The corpus is also accessible through the KonText search interface of the Czech National Corpus (https://www.korpus.cz/kontext/query?corpname=ai_koditex_v1)

CEC6-Converter (2025-05-29)

Author: Rüdiger Jan Oliver
Publication venue: Rüdiger, Jan Oliver
Publication date: 29/05/2025
Field of study

Diese Software erlaubt eine Konvertierung von *.cec6-Dateien in 24 Formate, die in der Korpuslinguistik / NLProc üblich sind. Die Ausführung ist unter allen modernen Betriebssystemen möglich (Windows, Linux, MacOS). Die Binärdateien wurden für die x64-Architektur kompiliert. Sollten Sie einen Prozessor (CPU) verwenden, der eine x86- oder ARM-Architektur hat, dann nutzen Sie bitte die Anleitung: andere Betriebssysteme bzw. x86 / ARM / ARM64. --- This software allows the conversion of *.cec6 files into 24 formats that are commonly used in corpus linguistics / NLProc. Execution is possible under all modern operating systems (Windows, Linux, MacOS). The binary files have been compiled for the x64 architecture. If you are using a processor (CPU) with x86 or ARM architecture, please use the instructions for "other operating systems or x86 / ARM / ARM64"

SynSemClass 5.5

Author: Urešová Zdeňka
Alcaina Cristina Fernández
Bourgonje Peter
Fučíková Eva
Hajič Jan
Hajičová Eva
Kolářová Veronika
Rehm Georg
Rysová Kateřina
Zaczynska Karolina
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 30/06/2025
Field of study

The SynSemClass event-type ontology evolved from the original idea to create a bilingual, later multilingual, synonym lexicon. This version (5.5.) builds on previous versions, but substantially enriches them with new synonymous classes (the number has risen from 1546 to 1993). In addition, version 5.5. has been extended by two items: Czech deverbal nouns (a small sample) and hierarchical relations. The hierarchical structure captures specialization and generalization relations between classes that are formally and technically unrelated in the original ontology, and it is now integrated with the main files constituting the lexicon (symsemclass55.zip). As a lexical-semantic resource, this version continues to link to similar resources, such as to PDT-Vallex, EngVallex, CzEngVallex, NomVallex, FrameNet, VerbNet, PropBank, Ontonotes Woxikon, E-VALBU, GUP, and German FrameNet), ADESSE, SenSem, AnCora, and Spanish WordNet and FrameNet. Examples of sentences in which multilingual synonyms have been used are also included (example_sentences.zip). Version with the original classes composition as automatically pre-suggested but later removed in the manual correction and further annotation process are included for completeness and historical reasons (removed_cms.zip). The individual languages are linked as follows (referenced resources not included but all are available online): The Spanish entries are linked to ADESSE (http://adesse.uvigo.es/), Spanish SenSem (https://grial-research.github.io/en/index.html), Spanish WordNet (https://adimen.ehu.eus/cgi-bin/wei/public/wei.consult.perl), AnCora (https://clic.ub.edu/corpus/en/ancoraverb_es), and Spanish FrameNet (http://sfn.spanishfn.org/SFNreports.php). The English entries are linked to EngVallex (http://hdl.handle.net/11858/00-097C-0000-0023-4337-2), CzEngVallex (http://hdl.handle.net/11234/1-1512), FrameNet (https://framenet.icsi.berkeley.edu/), VerbNet (https://uvi.colorado.edu/ and http://verbs.colorado.edu/verbnet/index.html), PropBank (http://propbank.github.io/), Ontonotes (http://clear.colorado.edu/compsem/index.php?page=lexicalresources&sub=ontonotes), and the Open English Wordnet (https://en-word.net/). The Czech verbal entries are linked to PDT-Vallex4.5 (http://hdl.handle.net/11234/1-5814), Vallex (http://hdl.handle.net/11234/1-4756), and CzEngVallex (http://hdl.handle.net/11234/1-1512). The Czech deverbal nouns are linked to https://ufal.mff.cuni.cz/nomvallex/2.5. The German entries are linked to Woxikon (https://synonyme.woxikon.de), E-VALBU (https://grammis.ids-mannheim.de/verbvalenz), and GUP (https://github.com/UniversalDependencies/UD_German-GSD)

ParCzech4Speech 1.0

Author: Stankov Vladislav
Kopp Matyáš
Bojar Ondřej
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 27/06/2025
Field of study

We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours of aligned speech from 587 speakers. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment. The dataset is offered in three flexible variants: (1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries, (2) unsegmented preserving original utterance flow across sentences, and (3) a raw-alignment for further custom refinement for other possible tasks. Note: This release contains alignment data and text segments (official and recognized transcripts). The source audio must be obtained separately from the AudioPSP 24.01 corpus , using the 'filePath' column to locate the corresponding audio file and the 'start'/ 'end' timestamps to extract specific segments. The official transcripts are available in ParCzech 4.0 corpus (http://hdl.handle.net/11234/1-5360). The original audio files are available in AudioPSP 24.01 corpus (http://hdl.handle.net/11234/1-5404). Note: All three variants are provided in both .tsv (tab-separated values) and .parquet (columnar binary) formats. The data content is identical across formats

EdUKate translation models 2025

Author: Hrabal Miroslav
Popel Martin
Poláková Lucie
Novák Michal
Kloudová Věra
Anisimova Mariia
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 01/01/2025
Field of study

This package includes three models adapted for sentence-level machine translation in educational domain: Czech-to-Ukrainian, Czech-to-English and Czech-to-German. The models are provided as LoRA adapters on top of EuroLLM-9B-Instruct LLM and can be used in the Charles Translator service (https://translator.cuni.cz) and in the web portal Škola s nadhledem (https://skolasnadhledem.cz/). The models were developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools

DeriNet 2.3

Author: Olbrich Michal
Brezinová Viktória
Dohnalová Šárka
John Vojtěch
Kyjánek Lukáš
Papáček Aleš
Svoboda Emil
Ševčíková Magda
Vidra Jonáš
Žabokrtský Zdeněk
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 29/01/2025
Field of study

DeriNet is a lexical network modeling derivational and compositional relations in Czech. The nodes of the network represent Czech lexemes, while the edges capture word-formational relations between derived words and their base word(s). The current version, DeriNet 2.3, introduces several key improvements over version 2.2: (a) the set of 1,040,126 lexemes is aligned with the latest version of MorfFlex CZ (version 2.1), (b) 5,781 derivational trees containing loanwords are enriched with etymological information specifying their origins, adopted from the Czech Etymological Lexicon, (c) 8,867 new derivational and 1,262 new compound relations have been identified, resulting in a total of 791,771 derivational and 7,598 compound relations, and (d) the morphological segmentation and classification of morphs have been significantly enhanced

Universal Dependencies 2.16

Author: Zeman Daniel
Nivre Joakim
Abrams Mitchell
Ackermann Elia
Adolphe Jephtey
Aepli Noëmi
Aghaei Hamid
Agić Željko
Ahmadi Amir
Ahrenberg Lars
Ajede Chika Kennedy
Akhundjanova Arofat
Akkurt Furkan
Aleksandravičiūtė Gabrielė
Alfina Ika
Algom Avner
Alnajjar Khalid
Alzetta Chiara
Anastasopoulos Antonios
Andersen Erik
Andrews Matthew
Antonsen Lene
Aoyama Tatsuya
Aplonova Katya
Aquino Angelina
Aragon Carolina
Aranes Glyd
Aranzabe Maria Jesus
Arıcan Bilge Nas
Arnardóttir Þórunn
Arutie Gashaw
Arwidarasti Jessica Naraiswari
Asahara Masayuki
Ásgeirsdóttir Katla
Aslan Deniz Baran
Asmazoğlu Cengiz
Ateyah Luma
Atmaca Furkan
Attia Mohammed
Atutxa Aitziber
Augustinus Liesbeth
Avelãs Mariana
Badmaeva Elena
Bajorat Jana
Balasubramani Keerthana
Ballesteros Miguel
Banerjee Esha
Bank Sebastian
Barbosa Bryan Khelven da Silva
Barbu Mititelu Verginica
Barkarson Starkaður
Basile Rodolfo
Basmov Victoria
Batchelor Colin
Bauer John
Bedir Seyyit Talha
Behzad Shabnam
Belieni Juan
Bémová Alevtina
Bengoetxea Kepa
Benli İbrahim
Ben Moshe Yifat
Benzerrak Marie
Berg Ansu
Berk Gözde
Bhat Riyaz Ahmad
Biagetti Erica
Bick Eckhard
Bielinskienė Agnė
Bilgin Taşdemir Esma Fatıma
Binici Helin
Bjarnadóttir Kristín
Blaschke Verena
Blokland Rogier
Böbel Nina
Bobicev Victoria
Boizou Loïc
Bompolas Stavros
Bonilla Johnatan
Borges Völker Emanuel
Börstell Carl
Bosco Cristina
Bouma Gosse
Bowman Sam
Boyd Adriane
Braggaar Anouck
Branco António
Bras Myriam
Brokaitė Kristina
Bu Lanni
Buráňová Eva
Burchardt Aljoscha
Cabeza Carmen
Cáceres Arandia Natalia
Campos Marisa
Candito Marie
Caron Bernard
Caron Gauthier
Carvalheiro Catarina
Carvalho Rita
Cassidy Lauren
Castro Maria Clara
Castro Sérgio
Cavalcanti Tatiana
Cebiroğlu Eryiğit Gülşen
Cecchini Flavio Massimiliano
Celano Giuseppe G. A.
Çepani Anila
Čéplö Slavomír
Cesur Neslihan
Cetin Savas
Çetinoğlu Özlem
Chalub Fabricio
Chamila Liyanage
Chamoreau Claudine
Chauhan Shweta
Chen Yifei
Chi Ethan
Chika Taishi
Cho Yongseok
Choi Jinho
Chontaeva Bermet
Chun Jayeol
Chung Juyeon
Cignarella Alessandra T.
Cinková Silvie
Collomb Aurélie
Çöltekin Çağrı
Connor Miriam
Corbetta Claudia
Corbetta Daniela
Costa Francisco
Courtin Marine
Crabbé Benoît
Cristescu Mihaela
Cvetkoski Vladimir
Dahan Netanel
Dale Ingerid Løyning
Daniel Philemon
Daoudi Khensa
Dash Bijayalaxmi
Dash Satya Ranjan
Davidson Elizabeth
de Alencar Leonel Figueiredo
Dehouck Mathieu
de Laurentiis Martina
de Marneffe Marie-Catherine
Demir Ahmet
de Paiva Valeria
Derin Mehmet Oguz
de Souza Elvis
Diaz de Ilarraza Arantza
Díaz Hernández Roberto Antonio
Dickerson Carly
Di Felippo Ariani
Dinakaramani Arawinda
Di Nuovo Elisa
Dione Bamba
Dirix Peter
Do Hoa
Dobrovoljc Kaja
Döhmer Caroline
Doyle Adrian
Dozat Timothy
Droganova Kira
Duran Magali Sanches
Dwivedi Puneet
Ebert Christian
Eckhoff Hanne
Eguchi Masaki
Eiche Sandra
Eiselen Roald
Eli Marhaba
Elkahky Ali
Ephrem Binyam
Erina Olga
Erjavec Tomaž
Esher Louise
Eslami Soudabeh
Essaidi Farah
Etienne Aline
Evelyn Wograine
Facundes Sidney
Farkas Richárd
Faryad Ján
Favero Federica
Ferdaousi Jannatul
Fernanda Marília
Fernandez Alcalde Hector
Fethi Amal
Foster Jennifer
Francioni Barbara
Fransen Theodorus
Freitas Cláudia
Fujita Kazunori
Gajdošová Katarína
Galbraith Daniel
Galy Edith
Gamba Federica
Garcia Marcos
García-Miguel José María
Gärdenfors Moa
Gaustad Tanja
Genç Efe Eren
Gerardi Fabrício Ferraz
Gerdes Kim
Gessler Luke
Ginter Filip
Godoy Gustavo
Goenaga Iakes
Gojenola Koldo
Gökırmak Memduh
Goldberg Yoav
Goldin Gili
Gómez Guinovart Xavier
González Saavedra Berta
Griciūtė Bernadeta
Grioni Matias
Grobol Loïc
Grūzītis Normunds
Guillaume Bruno
Guiller Kirian
Guillot-Barbance Céline
Güngör Tunga
Gurevich Vladimir
Habash Nizar
Hafsteinsson Hinrik
Hahn Michael
Hajič Jan
Hajič jr. Jan
Hajičová Eva
Hämäläinen Mika
Hà Mỹ Linh
Han Na-Rae
Hanifmuti Muhammad Yudistira
Harada Takahiro
Hardwick Sam
Harris Kim
Hassert Naïma
Haug Dag
Havelka Jiří
Heinecke Johannes
Hellwig Oliver
Hennig Felix
Hladká Barbora
Hlaváčová Jaroslava
Hociung Florinel
Hoefels Diana
Hohle Petter
Howell Nick
Huang Yidi
Huerta Mendez Marivel
Hwang Jena
Ikeda Takumi
Iliadou Inessa
Ingason Anton Karl
Ion Radu
Irimia Elena
Ishola Ọlájídé
Islamaj Artan
Ito Kaoru
Iurescia Federica
Ivani Jessica K.
Jagodzińska Sandra
Jannat Siratun
Jelínek Tomáš
Jha Apoorva
Jiang Katharine
Job Sylvanus
Jobanputra Mayank
Johannsen Anders
Jónsdóttir Hildur
Jørgensen Fredrik
Ju Zhuoxuan
Juutinen Markus
Kaşıkara Hüner
Kabaeva Nadezhda
Kahane Sylvain
Kanayama Hiroshi
Kanerva Jenna
Kara Neslihan
Karahóǧa Ritván
Kárník Jiří
Kåsen Andre
Kayadelen Tolga
Kengatharaiyer Sarveswaran
Kettnerová Václava
Kharatyan Lilit
Kirchner Jesse
Klementieva Elena
Klyachko Elena
Kocharov Petr
Köhn Arne
Köksal Abdullatif
Kolářová Veronika
Kopacewicz Kamil
Korkiakangas Timo
Köse Mehmet
Koshevoy Alexey
Kote Nelda
Kotsyba Natalia
Kovačić Barbara
Kovalevskaitė Jolanta
Kowner Emmanuelle
Krek Simon
Krishnamurthy Parameswari
Kübler Sandra
Kučová Lucie
Kuqi Adrian
Kuyrukçu Oğuzhan
Kuzgun Aslı
Kwak Sookyoung
Kyle Kris
Laan Käbi
Laippala Veronika
Lambertino Lorenzo
Landau Israel
Lando Tatiana
Larasati Septina Dian
Larrivée Pierre
Lavrentiev Alexei
Lee John
Lê Hồng Phương
Lenci Alessandro
Lertpradit Saran
Leung Herman
Levina Maria
Levine Lauren
Li Cheuk Ying
Li Josie
Li Keying
Li Yixuan
Li Yuan
Lim KyungTae
Lima Padovani Bruna
Lin Yi-Ju Jessica
Lindén Krister
Liu Yang Janet
Liu Zoey
Ljubešić Nikola
Lobzhanidze Irina
Loginova Olga
Lopatková Markéta
Lopes Lucelene
Luftiu Edita
Lukashevskyi Arsenii
Lusito Stefano
Lutgen Anne-Marie
Luthfi Andry
Luukko Mikko
Lyashevskaya Olga
Lynn Teresa
Macketanz Vivien
Mahamdi Menel
Maillard Jean
Makarchuk Ilya
Makazhanov Aibek
Mambrini Francesco
Mandl Michael
Manning Christopher
Manurung Ruli
Marşan Büşra
Mărănduc Cătălina
Mareček David
Marheinecke Katrin
Markantonatou Stella
Martínez Alonso Héctor
Martín Rodríguez Lorena
Martins André
Martins Cláudia
Mašek Jan
Matsuda Hiroshi
Matsumoto Yuji
Mazzei Alessandro
McDonald Ryan
McGuinness Sarah
Mehta Maitrey
Ménard Pierre André
Mendonça Gustavo
Merhav Hilla
Merzhevich Tatiana
Meurer Paul
Miekka Niko
Mikulová Marie
Milano Emilia
Miletić Aleksandra
Miller Aaron
Min Junghyun
Minerbi Yael
Mírovský Jiří
Mischenkova Karina
Missilä Anna
Mititelu Cătălin
Mitrofan Maria
Miyao Yusuke
Mohapatra Biswakalpita
Mojiri Foroushani AmirHossein
Molnár Judit
Moloodi Amirsaeid
Montemagni Simonetta
More Amir
Moreno Romero Laura
Moretti Giovanni
Mori Shinsuke
Morioka Tomohiko
Moro Shigeki
Mortensen Bjartur
Moskalevskyi Bohdan
Muischnek Kadri
Munro Robert
Murawaki Yugo
Mus Nikolett
Müürisep Kaili
Nainwani Pinkey
Nakhlé Mariam
Navarro Horñiacek Juan Ignacio
Nedoluzhko Anna
Nešpore-Bērzkalne Gunta
Nevaci Manuela
Nguyễn Thị Lương
Nguyễn Thị Minh Huyền
Nikaido Yoshihiro
Nikolaev Vitaly
Nitisaroj Rattima
Norrman Victor
Nourian Alireza
Novák Michal
Nunes Maria das Graças Volpe
Nurmi Hanna
Ojala Stina
Ojha Atul Kr.
Óladóttir Hulda
Olúòkun Adédayọ̀
Omura Mai
Onwuegbuzia Emeka
Ordan Noam
Osenova Petya
Östling Robert
Ott Annika
Øvrelid Lilja
Oya Masanori
Özateş Şaziye Betül
Özçelik Merve
Özgür Arzucan
Öztürk Başaran Balkız
Paccosi Teresa
Pajas Petr
Palmero Aprosio Alessio
Panevová Jarmila
Panova Anastasia
Pardo Thiago Alexandre Salgueiro
Parida Shantipriya
Park Hyunji Hayley
Partanen Niko
Pascual Elena
Passarotti Marco
Patejuk Agnieszka
Paulino-Passos Guilherme
Pedonese Giulia
Peeters Oggi
Peljak-Łapińska Angelika
Peng Siyao
Peng Siyao Logan
Pereira Rita
Pereira Sílvia
Perez Cenel-Augusto
Perkova Natalia
Perrier Guy
Petrov Slav
Petrova Daria
Peverelli Andrea
Phelan Jason
Pierre-Louis Claudel
Piitulainen Jussi
Pinter Yuval
Pinto Clara
Pintucci Rodrigo
Pirinen Tommi A
Pitler Emily
Plamada Magdalena
Plank Barbara
Plum Alistair
Poibeau Thierry
Ponomareva Larisa
Popel Martin
Poujade Clamença
Pretkalniņa Lauma
Pretorius Rigardt
Prévost Sophie
Prokopidis Prokopis
Przepiórkowski Adam
Pugh Robert
Puolakainen Tiina
Purschke Christoph
Pyysalo Sampo
Qi Peng
Querido Andreia
Rääbis Andriela
Rabinovich Ella
Rademaker Alexandre
Publication venue: Universal Dependencies Consortium
Publication date: 15/05/2025
Field of study

Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008)

Universal Dependencies 2.17 models for UDPipe 2 (2025-11-25)

Author: Straka Milan
Publication venue: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication date: 25/11/2025
Field of study

Tokenizer, POS Tagger, Lemmatizer and Parser models for 169 treebanks of 93 languages of Universal Depenencies 2.17 Treebanks, created solely using UD 2.17 data (http://hdl.handle.net/11234/1-6036). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_217_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2

Antoninus Liberalis, Μεταμορφώσεων συναγωγή (Transformationum congeries)

Author: Kamil Gregor
Publication venue: Filozofická fakulta, Univerzita Karlova
Publication date: 01/01/2025
Field of study

Μεταμορφώσεων συναγωγή (lat. "Transformationum congeries", English "Collection of Metamorphoses") is a Greek prosaic mythographic work attributed to Antoninus Liberalis, otherwise unknown author, and dated most likely to the 1st or 2nd century CE

Odia Visual Genome

Author: Parida Shantipriya
Sahoo Shashikanta
Sahoo Kalyanamalini
Bojar Ondřej
Dash Satya Ranjan
Publication venue: Charles University, UFAL
Publication date: 2025
Field of study

The Odia Visual Genome is a multimodal dataset comprising aligned textual and visual data, designed to support research in English-Odia multimodal machine translation as well as broader studies in multimodal language processing. The dataset is derived from the Visual Genome corpus, which provides short English image captions paired with corresponding images. For the Odia Visual Genome, we selected a subset of these captions and automatically translated them into Odia, followed by careful manual post-editing. In the post-editing stage, annotators explicitly considered the associated visual context to ensure semantic adequacy and naturalness of the Odia translations. The corpus is partitioned into four subsets. The training set contains approximately 29,000 segments, while the development set and the test set contain 1,000 and 1,600 segments, respectively. Both were taken from Hindi Visual Genome 1.1 where they were created via random sampling from the Visual Genome corpus. In addition, a challenge test set of 1,400 segments was prepared for the WAT2019 Multimodal Translation Task. The challenge test set was constructed to specifically target lexical ambiguity in English captions. Candidate items were identified based on embedding similarity, and ambiguous instances were manually selected where visual information plays a crucial role in disambiguation. Although in many cases surrounding textual context also provides sufficient cues, the inclusion of the image enhances the robustness of disambiguation. Odia Visual Genome was used in WAT 2025 Multimodal Translation Task (https://ufal.mff.cuni.cz/wat2025english-indicmultimodaltranslation). Dataset Formats The dataset contains both textual and visual components. Textual Data. The training, development, and test partitions are distributed as tab-delimited plain-text files. Each file consists of seven columns: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Odia Text The bounding-box coordinates (X, Y, Width, Height) specify the rectangular region of the image referenced by the caption. Visual Data. The image collection contains full-resolution images, each identified by the corresponding image_id. The bounding-box metadata enables linking of captions to specific regions within the images. Corpus Statistics Parallel corpus statistics for Odia Visual Genome. Dataset Segments English Words Odia Words ---------------- --------- ---------------- ------------- Train 28930 143134 141652 Dev 998 4922 4912 Test 1595 7854 7734 Challenge Test 1400 8186 8100 ---------------- --------- ---------------- ------------- Total 32923 164096 162398 The word counts are approximate, prior to tokenization

0

full texts

1,998

metadata records

Updated in last 30 days.

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇