An Overview of Stemming Algorithms

Krithika Nagi
3 min readNov 6, 2020

--

Obtaining similar-structure words for a root or a base word is known as stemming. Extensively used in Information Retrieval and then also became a crucial part in Natural Language Processing (NLP). In vectorizations of texts, stemming contributes to being an essential step and improves the overall Machine Learning model’s performance. Stemming algorithms have been popular as users who lookup a word would be interested in looking up similar words. This concept is most useful in search engines and Google implemented this in 2003 and has seen desired results. Prior to that, search did not result in words like “ sleeping”, “sleepy” when “sleep” was googled.

The first ever Stemming algorithm ever introduced was by J.B. Lovins in 1968. And then, the next algorithm was proposed in the 1980s by M.F. Porter, which is now the standard algorithm for stemming. Later on, many developments lead to various algorithms which are now finding success.

As we can see, from the above figure, there’s a subdivision of the major algorithms into three distinct classes, namely: Truncating, Statistical and Mixed. Let’s dive into each one of them and look at the well-known algorithms.

  1. Truncating:

Simplest of all, truncating algorithms aim at just slicing the words and converting them to a fixed length words.

Lovin’s Stemmer: The Lovins stemmer follows a rule-based affix elimination approach. Fairly, a simple algorithm which performs a finite number of rules to obtain stems. Basically, removes the longest affixes and retains the valid word from the stemmed word. It’s extremely fast but has a high chance of over-stemming error. In some cases, it has failed miserably but on the whole, it is seen to work well with irregular plurals.

Porter’s Stemmer: This is a 60 rule algorithm. Modifications were done on the basic stemmer and seen to be widely used stemmer. Series of steps are followed to get stem words. Each step is about length of the suffixes and how it will be replaced to smaller suffix. This algorithm is seen as a slower process comapred to the above even though Lovin’s lookup table is extensive.

Dawson’s Stemmer: Extension to Lovin’s Stemmer except the arrangement of lookup table is reversed order with length as its index. Also, a single pass stemmer and proven to be faster than Lovin’s. Covers more suffixes than Lovins and is fast in execution.

2. Statistical

Statistical algorithms are based on analytical approach. They apply mathematical approximation to get the stem words.

N-gram: As the name suggests, this algorithm calculates n-grams(n=2,3 usually), which is essentially breakage of the words into n length words and then applying any statistical analysis on top of them in order to identify them. Only disadvantage being, it requires high memory power to run and store the data.

3. Mixed

Krovetz Stemmer: This stemming algorithm converts plural to singular form and converts past to present tense. This is often used with another algorithm. But can’t single handedly manage for large documents. Seen to not produce consistent results in terms of accuracy.

Xerox Stemmer: Equiped to handle large data and is seen to produce valid words. Chances of over stemming is high, its dependence on lexicon makes it language dependent. Hence its limitation is its language specific.

YASS Stemmer: A corpus based method which needs computational resource to work efficiently. But the advantage of this algorithm is its language independent. Still is seen to be as a comlpex method involving statistical analysis and hierarchical clustering approach.

References:

Wikipedia references on Stemming, Lovins Stemming, Porter Stemming

A Comparative Study of Stemming Algorithms by Anjali G. Jivani ( Nov 2011)

--

--