Yatsko's Computational Linguistics Laboratory


A Russian Part-of-speech Tagger

Part-of-speech tagging has been widely used in many linguistic programs and systems to annotate text corpora, increase information retrieval efficiency, detect plagiarism, and classify documents. Development of a POS tagger for Russian is in no way a trivial task. Russian belongs to morphologically rich, highly inflected languages, and its processing faces the problems of stems, suffixes, and endings homonymy and sounds deletion and alteration.

    This page presents a Russian POS tagger designed to support an opinion mining system and developed during implementation of the project supported by a grant from the Russian Foundation for Basic Research No 16-07-00014.  The  tagger is a preprocessing module that functions on an extensive morphological dictionary and Bayesian analysis.  Dictionary structure, rules for parts-of-speech recognition, and tagger's algorithm are described in my papers, see   [35], [36]  in Publications section. Note that Russian paper is outdated and much smaller. 

    The  Y-TAGGRU  is distributed as freeware together with the editable morphological dictionary. The user can delete some items from the dictionary, or, on the contrary, add them to the dictionary to change the tagging results. Input text must be in .txt format, UTF8 encoding. 

    The screenshot given below represents tagger's interface. Upper section shows input text, lower section - tagging results, and right section - statistics and methods of recognition. If you have updated the dictionary, click Refresh button for changes to come into effect. You may select text in the lower section and copy it for further analysis. 

Go to Downloads section of this site to get the Y-TAGGRU. The tagger works on Windows machines.




Desktop Site