1,720,970 research outputs found

    Comparable corpus of parliamentary debates ParlaMint-IL 1.0

    No full text
    The ParlaMint-IL corpus is the Israeli contribution to the ParlaMint collection of comparable parliamentary corpora (https://www.clarin.eu/parlamint), which contain transcriptions of parliamentary debates of European countries and autonomous regions. The Knesset Corpus follows the ParlaMint encoding guidelines and is fully aligned with version 4.1 of the ParlaMint corpora (cf. http://hdl.handle.net/11356/1912 and http://hdl.handle.net/11356/1911). The corpus comprises transcriptions of all plenary and committee protocols of the Israeli parliament (the Knesset), spanning from 1994 to 2024. It includes more than 12 million speeches and over 400 million words, making it the largest corpus in the ParlaMint collection. All transcriptions are provided in Hebrew, the primary language of Knesset proceedings. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription. The corpus includes extensive metadata, most importantly on speakers (name, gender, year of birth, MP and minister status, party affiliation), and on their political parties and parliamentary groups (name, coalition/opposition status, and Wikipedia-sourced left-to-right political orientation). The transcriptions are also marked with the subcorpora they belong to, i.e. "reference" (until 2020-01-30), "covid" (from 2020-01-31), and "war" (from 2022-02-24). The corpus TEI/XML schemas are included in the distribution. The corpus is available in two variants, the "plain-text" version (ParlaMint-IL.tgz, corresponding to http://hdl.handle.net/11356/1912) and the linguistically annotated version (ParlaMint-IL.ana.tgz, corresponding to http://hdl.handle.net/11356/1911). The ParlaMint-IL.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. The corpus was annotated with morphological and syntactic annotations by Trankit (https://github.com/nlp-uoregon/trankit) based model, fine-tuned on Knesset data. Named Entity Recognition was performed using dicta-bert (https://huggingface.co/dicta-il/dictabert), a Hebrew NER model. The "plain-text" version (ParlaMint-IL.tgz) contains the canonical TEI/XML files; derived plain-text files; and derived TSV metadata files for the speeches. The linguistically annotated version (ParlaMint-IL.ana.tgz) contains the canonical TEI/XML files with linguistic annotations; derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. The ParlaMint-IL corpus is based on data and annotations described in: Goldin, Gili; Wintner, Shuly; and Rabinovich, Ella. The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings. Language Resources and Evaluation (2025). https://doi.org/10.1007/s10579-025-09833-

    An Abstract Machine for Unification Grammars

    No full text
    MAchine for LIinguistic Applications -- User's Guide. Laboratory for Computational Linguistics, Computer Science Deparmtent, Technion, Israel Institute of Technology, 32000 Haifa, Israel, January. Wintner, Shuly and Uzzi Ornan. 1991. Syntactic analysis of Hebrew sentences. In Proceedings of the 8th Israeli Symposium on Artificial Intelligence and Computer Vision, pages 201--230. Information Processing Association of Israel, December. Wintner, Shuly and Uzzi Ornan. 1996. Syntactic analysis of hebrew sentences. Natural Language Engineering, 1(3):261--288. Yizhar, Dana. 1993. Computational grammar for Hebrew noun phrases. Master's thesis, Computer Science Department, Hebrew University, Jerusalem, Israel, June. (In Hebrew). Zajac, R'emi. 1992. Inheritance and constraint-based grammar formalisms. Computational Linguistics, 18(2):159--182. Pollard, Carl J. and M. Drew Moshier. 1990. Unifying partial descriptions of sets. In Philip P. Hanson, editor, Information, Language and Cognition, vol..

    Deterministic Parsing using PCFGs

    No full text
    We propose the design of deterministic constituent parsers that choose parser actionsaccording to the probabilities of parses of a given probabilistic context-freegrammar. Several variants are presented. One of these deterministically constructs aparse structure while postponing commitment to labels. We investigate theoreticaltime complexities and report experiments

    Compostional Semantics for Unification-based Linguistics Formalisms

    Full text link
    Contemporary linguistic formalisms have become so rigorous that it is now possible to view them as very high level declarative programming languages. Consequently, grammars for natural languages can be viewed as programs; this view enables the application of various methods and techniques that were proved useful for programming languages to the study of natural languages. This paper adapts the notion of program composition, well developed in the context of logic programming languages, to the domain of linguistic formalisms. We study alternative definitions for the semantics of such formalisms, suggesting a denotational semantics that we show to be compositional and fully-abstract. This facilitates a clear, mathematically sound way for defining grammar modularity

    The Knesset Meetings Corpus 2004-2005

    No full text
    <p>The Knesset Meetings Corpus 2004-2005 is made up of two components:</p> <ul> <li>Raw texts - 282 files made up of 867,725 lines together. These can be downloaded in two formats: <ul> <li>As <code>doc</code> files, encoded using <code>windows-1255</code> encoding: <ul> <li><code>kneset16.zip</code> - Contains 164 text files made up of 543,228 lines together. <a href="http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/txt/docs/kneset16.zip">[MILA host]</a> <a href="https://github.com/NLPH/knesset-2004-2005/blob/master/kneset16.zip?raw=true">[Github Mirror]</a></li> <li><code>kneset17.zip</code> - Contains 118 text files made up of 324,497 lines together. <a href="http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/txt/docs/kneset17.zip">[MILA host]</a> <a href="https://github.com/NLPH/knesset-2004-2005/blob/master/kneset17.zip?raw=true">[Github Mirror]</a></li> </ul> </li> <li>As <code>txt</code> files, encoded using <code>utf8</code> encoding: <ul> <li><code>kneset.tar.gz</code> - An archive of all the raw text files, divided into two folders: <a href="https://github.com/NLPH/knesset-2004-2005/blob/master/kneset.tar.gz">[Github mirror]</a> <ul> <li><code>16</code> - Contains 164 text files made up of 543,228 lines together.</li> <li><code>17</code> - Contains 118 text files made up of 324,497 lines together.</li> </ul> </li> <li><code>knesset_txt_16.tar.gz</code>- Contains 164 text files made up of 543,228 lines together. <a href="http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/txt/utf8/knesset_txt_16.tar.gz">[MILA host]</a> <a href="https://github.com/NLPH/knesset-2004-2005/blob/master/knesset_txt_16.tar.gz?raw=true">[Github Mirror]</a></li> <li><code>knesset_txt_17.zip</code> - Contains 118 text files made up of 324,497 lines together. <a href="http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/txt/utf8/knesset_txt_17.zip">[MILA host]</a> <a href="https://github.com/NLPH/knesset-2004-2005/blob/master/knesset_txt_17.zip?raw=true">[Github Mirror]</a></li> </ul> </li> </ul> </li> <li>Tokenized and morphologically tagged texts - Tagged versions exist only for the files in the <code>16</code> folder. The text are represented using <a href="http://www.mila.cs.technion.ac.il/eng/resources_standards.html">MILA's XML schema for corpora</a>. These can be downloaded in two ways: <ul> <li><code>knesset_tagged_16.tar.gz</code> - An archive of all tokenized and tagged files. <a href="http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/tagged/knesset_tagged_16.tar.gz">[MILA host]</a> <a href="https://archive.org/details/knesset_transcripts_2004_2005">[Archive.org mirror]</a></li> <li>By cloning this repository, as the unarchived version of these files can be found in this repository, under the <code>knesset_tagged</code> folder.</li> </ul> </li> </ul>The Open Natural Language Processing in Hebrew (NLPH) initiative is a joint effort by members of DataHack and The Public Knowledge Workshop to promote open tools and resources for Natural Language Processing in Hebrew. This community collects resources for NLP in Hebrew, as part of the NLPH project, which you can read more about here. These include corpora, lexicons, dictionaries, treebanks, embeddings, code, services, applications, papers, course materials and presentations, among others. A full list of these resources is maintained here: https://github.com/NLPH/NLPH_Resources If you have a resource you can contribute, to be released under some open license, please submit a pull request, or contact us at [email protected]

    Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

    Full text link
    We present Knesset-DictaBERT, a large Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings. The model is based on the DictaBERT architecture and demonstrates significant improvements in understanding parliamentary language according to the MLM task. We provide a detailed evaluation of the model\u27s performance, showing improvements in perplexity and accuracy over the baseline DictaBERT model.3 pages, 1 tabl

    Going Beyond Counting First Authors in Author Co-citation Analysis

    Full text link
    The present study examines one of the fundamental aspects of author co-citation analysis (ACA) - the way co-citation counts are defined. Co-citation counting provides the data on which all subsequent statistical analyses and mappings are based, and we compare ACA results based on two different types of co-citation counting - the traditional type that only counts the first one among a cited work's authors on the one hand and a non-traditional type that takes into account the first 5 authors of a cited work on the other hand. Results indicate that the picture produced through this non-traditional author co-citation counting contains more coherent author groups and is therefore considerably clearer. However, this picture represents fewer specialties in the research field being studied than that produced through the traditional first-author co-citation counting when the same number of top-ranked authors is selected and analyzed. Reasons for these effects are discussed
    corecore