CLL-Tagger  allows the user getting a text annotated with POS tags. It works on a well-known bi-directional inference algorithm according to which a POS tag is assigned to a token depending on POS tags of tokens to the right and to the left of current token.

Part-of speech tagging has been widely used in corpus linguistics and in the last decades has become an indispensable component in such fields as text mining and text classification/categorization. Application of tagging in these fields faces one major problem: it is a time consuming procedure that badly affects the speed of an NLP system when performed dynamically. To make POS-tagging faster we modified the bi-directional inference algorithm excluding from it two parameters that can be computed on the fly when the rest of the parameters are already known. For details see our paper [16] in the list of publications.
 As a result CLL-Tagger works much faster than its immediate analogue, a tagger developed by the Japanese scientists (T&T tagger) that employs the same algorithm.

The table below displays results of tests conducted on Pentium 4, 2.8 GHz, 768 Mb of RAM machine.


 

Text size
 T&T Tagger
 CLL Tagger
10 KB
 2 in no time
 50 KB
 9 <<1 sec
 100 KB
 17 1 sec
 500 KB
 1 min 22 sec
 3 sec
 1000  KB
 2 min 50 sec
6 sec

We also evaluated quality of our against quality of a tagger used in American National Corpus (ANC) by matching their annotated texts against texts annotated manually by human experts. We found out that quality of our tagger was 99.27% , i.e. it made 0,73% mistakes per 1000 words while quality ANC tagger was 99.33%, i.e. it made 0,67% mistakes per 1000 words.

The ANC texts were chosen for contrastive analyses because this corpus is the latest one and it employs the most modern software. While developing StarT we took ANC as a model and used the same tagset.

To use CLL Tagger open a text form a directory by clicking “load and tag” button. You can copy the annotated text to an external editor.