University of Warsaw

Biblioteka Cyfrowa KLF UW (Digital Library of the Formal Linguistics Department at the University of Warsaw)

Od skanów do Unicode

Author: Bień Janusz S.
Publication venue
Publication date: 28/09/2012
Field of study

Skanowane teksty jako korpusy

Author: Bień Janusz S.
Publication venue
Publication date: 01/01/2012
Field of study

Scanned texts as corpora A modification of the Poliqarp corpus search tool is described, which is oriented towards searching scanned texts with dirty OCR (i.e. the fully automatic Optical Character Recognition without any proof-reading). This search tool operates since December 2009 and is available at http://wbl.klf.uw.edu.pl/. The two-level regular expressions, which can be used in the queries, allow, at least in principle, to circumvent the OCR errors. The crucial property of the search engine is its ability to highlight the hits on the original scans stored in the DjVu format. Although the feature is not original, as it has been used for the first time for the Century Dictionary and later for Jamieson’s Etymological Dictionary of the Scottish Language, it is substantially augmented by allowing so called graphical concordances and providing a convenient way to bookmark the hits. Our system handles now four dictionaries, with the total size of over 40 000 pages. It is expected that in the near future other texts will be added to the system

Czy aby Maciej mi przytaknie? O statusie aby w pytaniach rozstrzygnięcia

Author: Danielewiczowa Magdalena
Publication venue
Publication date: 01/01/2012
Field of study

Oferta dydaktyczna 2012/2013 dla Instytutu Informatyki UW

Author: Bień Janusz S.
Publication venue
Publication date: 01/01/2012
Field of study

Narzędzia dygitalizacji tekstów na potrzeby badań filologicznych

Author: Bień Janusz S.
Publication venue: Katedra Lingwistyki Formalnej UW
Publication date: 2012
Field of study

Sprawozdanie merytoryczne z grantu MNSzWiT nr N N519 384036, 13.05.2009-12.05.2012

Laudacja

Author: Bogusławski Andrzej
Publication venue: Biuro Promocji UW
Publication date: 01/01/2012
Field of study

Cross-language Perspective on Lexicon Building and Deployment in IMPACT

Author: de Does Jesse
Depuyd Katrien
Schulz Klaus
Gotscharek Annette
Ringlstetter Christoph
Bień Janusz S.
Erjavec Tomaž
Kučera Karel
Martinez Isabel
Mihov Stoyan
Souvay Gilles
Publication venue: IMPACT
Publication date: 2012
Field of study

Od skanów do Unicode

Author: Bień Janusz S.
Publication venue
Publication date: 12/11/2012
Field of study

Scanned publications in digital libraries: new Open Source DjVu tools

Author: Bień Janusz S.
Publication venue
Publication date: 04/10/2012
Field of study

The DjVu technology is described by its authors as "an image compression technique, a document format, and a software platform for delivering documents images over the Internet"; according to the recent statistics, about 80% of documents stored in Polish digital libraries is in this format. Besides the commercial software supporting this technology there is also the DjVuLibre suit of Open Source tools and utilities, developed by the technology creators. In the presentations another Open Source suit of programs will be discussed. It consist of two sets. The first set contains some programs for creation and improvement of DjVu documents including the results of Optical Character Recogniton. A typical OCR program outputs the results as a PDF "sandwich" document containg text under image (although since version 11 ABBY FineReader can save directly the output as a DjVu files, the output in the PDF form contains more information). The pdf2djvu program conceived by Jakub Wilk (http://jwilk.net/software/pdf2djvu) convert the PDF files into DjVu preserving all the features (e.g. outlines) which are representable in the latter format. The purpose of another program, also conceived by Jakub Wilk (http://jwilk.net/software/didjvu), is the conversion of graphic files into the DjVu documents consisting of foreground (the printed text), mask and background layers (e.g. illustrations). Such separation not only allows to achieve a high compression ratio, but also improves the quality of OCR results which should operate only on foreground or mask. The third program named, for the historical reasons, ocrodjvu (http://jwilk.net/software/ocrodjvu) is a wrapper for several Open Source OCR programs including Tesseract, which achieves quality comparable with commercial systems (cf. e.g. a test results). The second set of programs concerns the delivery of DjVu documents to the users. It consist of a search engine server and two kind of clients: marasca installable as a WWW site and djview4poliqarp, a standalone client installable on a user computer. As the server is based on the Poliqarp corpus tool, the whole set is called just Poliqarp for DjVu. The author of djview4poliqarp is Michał Rudolf, the rest of the system was created by Jakub Wilk. The tools has been developed in the framework of the project directed by the present author, the results are available on the principle of GNU General Public License

O pewnym typie informacji leksykograficznej nieobecnej w słownikach

Author: Danielewiczowa Magdalena
Publication venue: Norbertinum
Publication date: 01/01/2012
Field of study

159

full texts

403

metadata records

Updated in last 30 days.

Biblioteka Cyfrowa KLF UW (Digital Library of the Formal Linguistics Department at the University of Warsaw)

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇