Stemming as a procedure of automatic morphological analysis has been an indispensable feature of information retrieval and text summarization since early 1960s. The general idea underlying stemming is to identify words that are the same in meaning but different in form by removing suffixes and endings. Such identification is important for correct term weighting and significantly increases effectiveness of information retrieval
By now a number of stemmers for different languages have been created, the most famous English ones being Porter stemmer and Paice/Husk (Lancaster) stemmer. Both stemmers are algorithmic and work on lists of suffixes specific for English.
Y-Stemmer (Yatsko’s stemmer) in contrast with existing stemmers is built on CLL-Tagger. First the input text is annotated with POS tags and then suffixes and endings specific for a given part of speech are removed. Stemming is done on preliminary POS tagging, which reduces the number of overstemming mistakes.
Another specific feature of Y-stemmer is identification of irregular verb forms and nouns and pronouns that have irregular plural forms. For example Y-stemmer will identify was, were, are, am as forms of the verb be, and buys, buying, bought as forms of the verb buy. Actually lemmatization is done in this case.
Stemmers are usually integrated into NLP system to be used during text preprocessing. We distribute Y-Stemmer as a stand-alone application for purely testing purposes. It can also be used for educational purposes and term-weighting.
To get your text stemmed just open it in Y-Stemmer; stems will be given in square brackets.
Check “only stemmed words” box to get only stems without word-forms.
We evaluated our stemmer against Paice/Husk stemmer to find out that quality of our stemmer is 98.7% (1,3% mistakes per 1000 words) while quality of Lancaster stemmer is 88.87%, i.e. 11,13% mistakes pre 1000 words. Y-Stemmer outperformed Paice/Husk stemmer by 9,83% (see details in our publications).