About the application
Introduction
Korpusomat is a simple web application designed for the creation of morphosyntactically tagged text corpora that incorporates the MTAS search engine. Korpusomat integrates a selection of tools for natural language processing developed in Linguistic Engineering Group Institute of Computer Science Polish Academy of Science.
The main tools used for processing are:
- Morfeusz 2 inflectional analyser and generator built upon Grammatical Dictionary of Polish,
- tagger Concraft,
- Liner 2 named entity recognition tool,
- TermoPL,
- corpus query engine MTAS.
The first two tools, Morfeusz & Concraft, are continually developed and updated. Liner 2 has many features such as temporal expression, action description recognition, named entity recognition. Korpusomat is currently limited to NER. TermoPL is used to extract terminology - in the corpus statistics view. MTAS is a corpus search engine developed by Meertens Instituut under the CLARIN project.
Korpusomat processes text files (txt) and most of the other formats used to preserve text data (e.g. epub, mobi, doc, rtf, pdf) - with a full list of formats included here: http://tika.apache.org/1.17/formats.html. All texts are converted to UTF-8 encoding for processing.
Korpusomat allows adding articles from webpages. Added URL is processed by the newspaper library described here.
Conferences
Details of features are described in the materials below (in Polish)
Sources:
- http://platontv.pl - DARIAH-PL: Sesja 3a, Łukasz Kobyliński "Korpusomat — narzędzie do tworzenia przeszukiwalnych korpusów języka polskiego".
- Linguistic Engineering Group Seminar, with presentation available.
Used tools
Korpusomat integrates a selection of tools for natural language processing.
Included tools are described e.g. in:
- Witold Kieraś and Marcin Woliński. Morfeusz 2 – analizator i generator fleksyjny dla języka polskiego. Język Polski, XCVII(1):75–83, 2017.
- Waszczuk J., Kieraś W., Woliński M. (2018) Morphosyntactic Disambiguation and Segmentation for Historical Polish with Graph-Based Conditional Random Fields. In: Sojka P., Horák A., Kopeček I., Pala K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science, vol 11107. Springer, Cham
- Marcińczuk, Michał; Kocoń, Jan; Gawor, Michał. Recognition of Named Entities for Polish-Comparison of Deep Learning and Conditional Random Fields Approaches In: Ogrodniczuk, Maciej; Kobyliński, Łukasz (Eds.): Proceedings of the PolEval 2018 Workshop, pp. 63-73, Institute of Computer Science, Polish Academy of Science, Warszawa, 2018.
- Marciniak, M., Mykowiecka, A., & Rychlik, P. (2016). TermoPL - a Flexible Tool for Terminology Extraction. LREC.
- Matthijs Brouwer, Hennie Brugman and Marc Kemps-Snijders 2017. MTAS: A Solr/Lucene based multi tier annotation search solution. Selected papers from the CLARIN Annual Conference 2016. Linköping Electronic Conference Proceedings 136: 19–37.
- Piotr Rybak and Alina Wróblewska. Semi-supervised neural system for tagging, parsing and lematization. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 45–54. Association for Computational Linguistics, 2018.