Porter Stemmer algorithm


Reading time: 90 minutes

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. For example: words such as β€œLikes”, ”liked”, ”likely” and ”liking” will be reduced to β€œlike” after stemming.

In 1980, Porter presented a simple algorithm for stemming English language words. The Porter algorithm differs from Lovins-type stemmers (which were developed in 1968) in two major ways. The rules associated with suffix removal are much less complex in case of Porter's Stemmer. Lovins algorithm has been designed principally for the processing of scientific texts. The second difference is that the Porter's stemmer uses a single, unified approach to the handling of context whereas, Lovins' stemmer has separate rules according to the length of the stem remaining after removal of suffix.

We cover:

  • The algorithmic steps in Porter Stemmer algorithm
  • A native implementation in Python
  • Using Porter Stemmer algorithm from NLTK library
  • Conclusion

Algorithm

To present the suffix stripping algorithm in its entirety we will need a few difinitions.

A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that the term consonant is defined to some extent in terms of itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel.

A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms:

  • CVCV ... C
  • CVCV ... V
  • VCVC ... C
  • VCVC ... V

These may all be represented by the single form

[C]VCVC ... [V]

where the square brackets denote arbitrary presence of their contents. Using ($VC^m$) to denote VC repeated m times, this may again be written as

[C]($ VC^m $)[V]

m will be called the measure of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples:

  • m=0 TR, EE, TREE, Y, BY.
  • m=1 TROUBLE, OATS, TREES, IVY.
  • m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

The rules for removing a suffix will be given in the form

(condition) S1 -> S2

This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g.

(m > 1) EMENT ->

Here S1 is 'EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.

The 'condition' part may also contain the following:

  • *S - the stem ends with S (and similarly for the other letters).
  • *v* - the stem contains a vowel.
  • m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
  • *d - the stem ends with a double consonant (e.g. -TT, -SS).
  • *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

And the condition part may also contain expressions with and, or and not, so that:

(m>1 and (*S or *T)) : tests for a stem with m>1 ending in S or T, while

(*d and not (*L or *S or *Z)) : tests for a stem ending witha double consonant other than L, S or Z. Elaborate conditions like this are required only rarely.

In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S1 for the given word. For example, with

  • SSES -> SS
  • IES -> I
  • SS -> SS
  • S ->

(here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for S1. Equally CARESS maps to CARESS (S1=SS) and CARES to CARE (S1=S).

In the rules below, examples of their application, successful or otherwise, are given on the right in lower case. The algorithm now follows:

Step 1a :

  • SSES -> SS (Example : caresses -> caress)
  • IES -> I (Example : ponies -> poni ; ties -> ti)
  • SS -> SS (Example : caress -> caress)
  • S -> (Example : cats -> cat)

Step 1b :

  • (m>0) EED -> EE (Example : feed -> feed ; agreed -> agree)
  • (v) ED -> (Example : plastered -> plaster ; bled -> bled)
  • (v) ING -> (Example : motoring -> motor ; sing -> sing)
  • S -> (Example : cats -> cat)

If the second or third of the rules in Step 1b is successful, the following is done:

  • AT -> ATE (Example : conflat(ed) -> conflate)
  • BL -> BLE (Example : troubl(ed) -> trouble)
  • IZ -> IZE (Example : siz(ed) -> size)
  • S -> (Example : cats -> cat)
  • (*d and not (*L or *S or *Z)) -> single letter (Example : hopp(ing) -> hop ; tann(ed) -> tan ; fall(ing) -> fall ; hiss(ing) -> hiss ; fizz(ed) -> fizz)
  • (m=1 and *o) -> E (Example : fail(ing) -> fail ; fil(ing) -> file)

The rule to map to a single letter causes the removal of one of the double letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be recognised later. This E may be removed in step 4.

Step 1c :

(\*v\*) Y -> I (Example : happy -> happi ; sky -> sky)

Step 1 deals with plurals and past participles. The subsequent steps are much more straightforward.

Step 2 :

  • (m>0) ATIONAL -> ATE (Example : relational -> relate
  • (m>0) TIONAL -> TION (Example : conditional -> condition ; rational -> rational)
  • (m>0) ENCI -> ENCE (Example : valenci -> valence)
  • (m>0) ANCI -> ANCE (Example : hesitanci -> hesitance)
  • (m>0) IZER -> IZE (Example : digitizer -> digitize)
  • (m>0) ABLI -> ABLE (Example : conformabli -> conformable)
  • (m>0) ALLI -> AL (Example : radicalli -> radical)
  • (m>0) ENTLI -> ENT (differentli -> different)
  • (m>0) ELI -> E (vileli -> vile)
  • (m>0) OUSLI -> OUS (analogousli -> analogous)
  • (m>0) IZATION -> IZE (vietnamization -> vietnamize)
  • (m>0) ATION -> ATE (predication -> predicate)
  • (m>0) ATOR -> ATE (operator -> operate)
  • (m>0) ALISM -> AL (feudalism -> feudal)
  • (m>0) IVENESS -> IVE (decisiveness -> decisive)
  • (m>0) FULNESS -> FUL (hopefulness -> hopeful)
  • (m>0) OUSNESS -> OUS (callousness -> callous)
  • (m>0) ALITI -> AL (formaliti -> formal)
  • (m>0) IVITI -> IVE (sensitiviti -> sensitive)
  • (m>0) BILITI -> BLE (sensibiliti -> sensible)

The test for the string S1 can be made fast by doing a program switch on the penultimate letter of the word being tested. This gives a fairly even breakdown of the possible values of the string S1. It will be seen in fact that the S1-strings in step 2 are presented here in the alphabetical order of their penultimate letter. Similar techniques may be applied in the other steps.

Step 3 :

  • (m>0) ICATE -> IC (Example : triplicate -> triplic)
  • (m>0) ATIVE -> (Example : formative -> form)
  • (m>0) ALIZE -> AL (Example : formalize -> formal)
  • (m>0) ICITI -> IC (Example : electriciti -> electric))
  • (m>0) ICAL -> IC (Example : electrical -> electric)
  • (m>0) FUL -> (Example : hopeful -> hope)
  • (m>0) NESS -> (Example : goodness -> good)

Step 4 :

  • (m>1) AL -> (Example : revival -> reviv)
  • (m>1) ANCE -> (Example : allowance -> allow)
  • (m>1) ENCE -> (Example : inference -> infer)
  • (m>1) ER -> (Example : airliner -> airlin)
  • (m>1) IC -> (Example : gyroscopic -> gyroscop)
  • (m>1) ABLE -> (Example : adjustable -> adjust)
  • (m>1) IBLE -> (Example : defensible -> defens)
  • (m>1) ANT -> (Example : irritant -> irrit)
  • (m>1) EMENT -> (Example : replacement -> replac)
  • (m>1) MENT -> (Example : adjustment -> adjust)
  • (m>1) ENT -> (Example : dependent -> depend)
  • (m>1 and (*S or *T)) ION -> (Example : adoption -> adopt)
  • (m>1) OU -> (Example : homologou -> homolog)
  • (m>1) ISM -> (Example : communism -> commun)
  • (m>1) ATE -> (Example : activate -> activ)
  • (m>1) ITI -> (Example : angulariti -> angular)
  • (m>1) OUS -> (Example : homologous -> homolog)
  • (m>1) IVE -> (Example : effective -> effect)
  • (m>1) IZE -> (Example : bowdlerize -> bowdler)

The suffixes are now removed. All that remains is a little tidying up.

Step 5a :

  • (m>1) E -> (Example : probate -> probat ; rate -> rate)
  • (m=1 and not *o) E -> (Example : cease -> ceas)

Step 5b

(m > 1 and *d and *L) -> single letter

(Example : controll -> control ; roll -> roll)

Now, let's see a native implementation of this algorithm. The algorithm will become crystal clear as you go through code. The original code can be found here here. Thankyou Julia Menchavez for the awesome code :). The original paper by Martin Porter can be found here.

Python code for native implementation

         class PorterStemmer:
    def isCons(self, letter):#function that returns true if a letter is a consonant otherwise false
        if letter == 'a' or letter == 'e' or letter == 'i' or letter == 'o'
        or letter == 'u':
            return False
        else:
            return True

    def isConsonant(self, word, i):#function that returns true only if the letter at i th position 
        #in the argument 'word' is a consonant.But if the letter is 'y' and the letter at i-1 th position 
        #is also a consonant, then it returns false.
        letter = word[i]
        if self.isCons(letter):
            if letter == 'y' and isCons(word[i-1]):
                return False
            else:
                return True
        else:
            return False

    def isVowel(self, word, i):#function that returns true if the letter at i th position in the argument 'word'
        #is a vowel
        return not(isConsonant(word, i))

    # *S
    def endsWith(self, stem, letter):#returns true if the word 'stem' ends with 'letter' 
        if stem.endswith(letter):
            return True
        else:
            return False

    # *v*
    def containsVowel(self, stem):#returns true if the word 'stem' contains a vowel
        for i in stem:
            if not self.isCons(i):
                return True
        return False

    # *d
    def doubleCons(self, stem):#returns true if the word 'stem' ends with 2 consonants
        if len(stem) >= 2:
            if self.isConsonant(stem, -1) and self.isConsonant(stem, -2):
                return True
            else:
                return False
        else:
            return False

    def getForm(self, word):
        #This function takes a word as an input, and checks for vowel and consonant sequences in that word.
        #vowel sequence is denoted by V and consonant sequences by C
        #For example, the word 'balloon' can be divived into following sequences:
        #'b' : C
        #'a' : V
        #'ll': C
        #'oo': V
        #'n' : C
        #So form = [C,V,C,V,C] and formstr = CVCVC
        form = []
        formStr = ''
        for i in range(len(word)):
            if self.isConsonant(word, i):
                if i != 0:
                    prev = form[-1]
                    if prev != 'C':
                        form.append('C')
                else:
                    form.append('C')
            else:
                if i != 0:
                    prev = form[-1]
                    if prev != 'V':
                        form.append('V')
                else:
                    form.append('V')
        for j in form:
            formStr += j
        return formStr

    def getM(self, word):
        #returns value of M which is equal to number of 'VC' in formstr
        #So in above example of word 'balloon', we have 2 'VC'

        form = self.getForm(word)
        m = form.count('VC')
        return m

    # *o
    def cvc(self, word):
        #returns true if the last 3 letters of the word are of the following pattern: consonant,vowel,consonant
        #but if the last word is either 'w','x' or 'y', it returns false
        if len(word) >= 3:
            f = -3
            s = -2
            t = -1
            third = word[t]
            if self.isConsonant(word, f) and self.isVowel(word, s)
            and self.isConsonant(word, t):
                if third != 'w' and third != 'x' and third != 'y':
                    return True
                else:
                    return False
            else:
                return False
        else:
            return False

    def replace(self, orig, rem, rep):
        #this function checks if string 'orig' ends with 'rem' and
        #replaces 'rem' by the substring 'rep'. The resulting string 'replaced'
        #is returned.

        result = orig.rfind(rem)
        base = orig[:result]
        replaced = base + rep
        return replaced

    def replaceM0(self, orig, rem, rep):
        #same as the function replace(), except that it checks the value of M for the 
        #base string. If it is >0 , it replaces 'rem' by 'rep', otherwise it returns the
        #original string

        result = orig.rfind(rem)
        base = orig[:result]
        if self.getM(base) > 0:
            replaced = base + rep
            return replaced
        else:
            return orig

    def replaceM1(self, orig, rem, rep):
        #same as replaceM0(), except that it replaces 'rem' by 'rep', only when M>1 for
        #the base string

        result = orig.rfind(rem)
        base = orig[:result]
        if self.getM(base) > 1:
            replaced = base + rep
            return replaced
        else:
            return orig

    def step1a(self, word):
        #In a given word, this function replaces 'sses' by 'ss', 'ies' by 'i',
        #'ss' by 'ss' and 's' by ''

        """step1a() gets rid of plurals. e.g.

           caresses  ->  caress
           ponies    ->  poni
           ties      ->  ti
           caress    ->  caress
           cats      ->  cat
        """
        if word.endswith('sses'):
            word = self.replace(word, 'sses', 'ss')
        elif word.endswith('ies'):
            word = self.replace(word, 'ies', 'i')
        elif word.endswith('ss'):
            word = self.replace(word, 'ss', 'ss')
        elif word.endswith('s'):
            word = self.replace(word, 's', '')
        else:
            pass
        return word

    def step1b(self, word):
        #this function checks if a word ends with 'eed','ed' or 'ing' and replces these substrings by
        #'ee','' and ''. If after the replacements in case of 'ed' and 'ing', the resulting word
        # -> ends with 'at','bl' or 'iz' : add 'e' to the end of the word
        # -> ends with 2 consonants and its last letter isn't 'l','s' or 'z': remove last letter of the word
        # -> has 1 as value of M and the cvc(word) returns true : add 'e' to the end of the word
        
        '''
        step1b gets rid of -eed -ed or -ing. e.g.

        feed      ->  feed
           agreed    ->  agree
           disabled  ->  disable

           matting   ->  mat
           mating    ->  mate
           meeting   ->  meet
           milling   ->  mill
           messing   ->  mess

           meetings  ->  meet
        '''
        flag = False
        if word.endswith('eed'):
            result = word.rfind('eed')
            base = word[:result]
            if self.getM(base) > 0:
                word = base
                word += 'ee'
        elif word.endswith('ed'):
            result = word.rfind('ed')
            base = word[:result]
            if self.containsVowel(base):
                word = base
                flag = True
        elif word.endswith('ing'):
            result = word.rfind('ing')
            base = word[:result]
            if self.containsVowel(base):
                word = base
                flag = True
        if flag:
            if word.endswith('at') or word.endswith('bl')
            or word.endswith('iz'):
                word += 'e'
            elif self.doubleCons(word) and not self.endsWith(word, 'l')
            and not self.endsWith(word, 's') and not self.endsWith(word, 'z'):
                word = word[:-1]
            elif self.getM(word) == 1 and self.cvc(word):
                word += 'e'
            else:
                pass
        else:
            pass
        return word

    def step1c(self, word):
        #In words ending with 'y' this function replaces 'y' by 'i'
        
        """step1c() turns terminal y to i when there is another vowel in the stem."""

        if word.endswith('y'):
            result = word.rfind('y')
            base = word[:result]
            if self.containsVowel(base):
                word = base
                word += 'i'
        return word

    def step2(self, word):
        #this function checks the value of M, and replaces the suffixes accordingly
        
        """step2() maps double suffices to single ones.
        so -ization ( = -ize plus -ation) maps to -ize etc. note that the
        string before the suffix must give m() > 0.
        """

        if word.endswith('ational'):
            word = self.replaceM0(word, 'ational', 'ate')
        elif word.endswith('tional'):
            word = self.replaceM0(word, 'tional', 'tion')
        elif word.endswith('enci'):
            word = self.replaceM0(word, 'enci', 'ence')
        elif word.endswith('anci'):
            word = self.replaceM0(word, 'anci', 'ance')
        elif word.endswith('izer'):
            word = self.replaceM0(word, 'izer', 'ize')
        elif word.endswith('abli'):
            word = self.replaceM0(word, 'abli', 'able')
        elif word.endswith('alli'):
            word = self.replaceM0(word, 'alli', 'al')
        elif word.endswith('entli'):
            word = self.replaceM0(word, 'entli', 'ent')
        elif word.endswith('eli'):
            word = self.replaceM0(word, 'eli', 'e')
        elif word.endswith('ousli'):
            word = self.replaceM0(word, 'ousli', 'ous')
        elif word.endswith('ization'):
            word = self.replaceM0(word, 'ization', 'ize')
        elif word.endswith('ation'):
            word = self.replaceM0(word, 'ation', 'ate')
        elif word.endswith('ator'):
            word = self.replaceM0(word, 'ator', 'ate')
        elif word.endswith('alism'):
            word = self.replaceM0(word, 'alism', 'al')
        elif word.endswith('iveness'):
            word = self.replaceM0(word, 'iveness', 'ive')
        elif word.endswith('fulness'):
            word = self.replaceM0(word, 'fulness', 'ful')
        elif word.endswith('ousness'):
            word = self.replaceM0(word, 'ousness', 'ous')
        elif word.endswith('aliti'):
            word = self.replaceM0(word, 'aliti', 'al')
        elif word.endswith('iviti'):
            word = self.replaceM0(word, 'iviti', 'ive')
        elif word.endswith('biliti'):
            word = self.replaceM0(word, 'biliti', 'ble')
        return word

    def step3(self, word):
        #this function checks the value of M, and replaces the suffixes accordingly
        
        """step3() dels with -ic-, -full, -ness etc. similar strategy to step2."""

        if word.endswith('icate'):
            word = self.replaceM0(word, 'icate', 'ic')
        elif word.endswith('ative'):
            word = self.replaceM0(word, 'ative', '')
        elif word.endswith('alize'):
            word = self.replaceM0(word, 'alize', 'al')
        elif word.endswith('iciti'):
            word = self.replaceM0(word, 'iciti', 'ic')
        elif word.endswith('ful'):
            word = self.replaceM0(word, 'ful', '')
        elif word.endswith('ness'):
            word = self.replaceM0(word, 'ness', '')
        return word

    def step4(self, word):
        #this function checks the value of M, and replaces the suffixes accordingly
        
        """step4() takes off -ant, -ence etc., in context <c>vcvc<v>{meaning, M >1 for the word}."""

        if word.endswith('al'):
            word = self.replaceM1(word, 'al', '')
        elif word.endswith('ance'):
            word = self.replaceM1(word, 'ance', '')
        elif word.endswith('ence'):
            word = self.replaceM1(word, 'ence', '')
        elif word.endswith('er'):
            word = self.replaceM1(word, 'er', '')
        elif word.endswith('ic'):
            word = self.replaceM1(word, 'ic', '')
        elif word.endswith('able'):
            word = self.replaceM1(word, 'able', '')
        elif word.endswith('ible'):
            word = self.replaceM1(word, 'ible', '')
        elif word.endswith('ant'):
            word = self.replaceM1(word, 'ant', '')
        elif word.endswith('ement'):
            word = self.replaceM1(word, 'ement', '')
        elif word.endswith('ment'):
            word = self.replaceM1(word, 'ment', '')
        elif word.endswith('ent'):
            word = self.replaceM1(word, 'ent', '')
        elif word.endswith('ou'):
            word = self.replaceM1(word, 'ou', '')
        elif word.endswith('ism'):
            word = self.replaceM1(word, 'ism', '')
        elif word.endswith('ate'):
            word = self.replaceM1(word, 'ate', '')
        elif word.endswith('iti'):
            word = self.replaceM1(word, 'iti', '')
        elif word.endswith('ous'):
            word = self.replaceM1(word, 'ous', '')
        elif word.endswith('ive'):
            word = self.replaceM1(word, 'ive', '')
        elif word.endswith('ize'):
            word = self.replaceM1(word, 'ize', '')
        elif word.endswith('ion'):
            result = word.rfind('ion')
            base = word[:result]
            if self.getM(base) > 1
            and (self.endsWith(base, 's') or self.endsWith(base, 't')):
                word = base
            word = self.replaceM1(word, '', '')
        return word

    def step5a(self, word):
        #this function checks if the word ends with 'e'. If it does, it checks the value of
        #M for the base word. If M>1, OR, If M = 1 and cvc(base) is false, it simply removes 'e'
        #ending.
        
        """step5() removes a final -e if m() > 1.
        """

        if word.endswith('e'):
            base = word[:-1]
            if self.getM(base) > 1:
                word = base
            elif self.getM(base) == 1 and not self.cvc(base):
                word = base
        return word

    def step5b(self, word):
        #this function checks if the value of M for the word is greater than 1 and it ends with 2 consonants
        # and it ends with 'l', it removes 'l'
        
        #step5b changes -ll to -l if m() > 1
        if self.getM(word) > 1 and self.doubleCons(word)
        and self.endsWith(word, 'l'):
            word = word[:-1]
        return word

    def stem(self, word):
        #steps to be followed for stemming according to the porter stemmer :)

        word = self.step1a(word)
        word = self.step1b(word)
        word = self.step1c(word)
        word = self.step2(word)
        word = self.step3(word)
        word = self.step4(word)
        word = self.step5a(word)
        word = self.step5b(word)
        return word   

NLTK library

A package called PorterStemmer is available in the NLTK library. It makes life so much more easier for us :p. Let's see how to use it.

Python code for using NLTK package

         #First, we're going to grab and define our stemmer:
         
         from nltk.stem import PorterStemmer
         from nltk.tokenize import sent_tokenize, word_tokenize
         ps = PorterStemmer()
         
         #Now, let's choose some words with a similar stem, like:
         
         example_words =["python","pythoner","pythoning","pythoned","pythonly"]
         
         #Next, we can easily stem by doing something like:
         
         for w in example_words:
             print(ps.stem(w))

Output

python
python
python
python
pythonli

Python code

    #Now let's try stemming a typical sentence, rather than some words:
    
    new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
    
    words = word_tokenize(new_text)

    for w in words:
        print(ps.stem(w))

Output

It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.

Conclusion

Porter’s algorithm is important for two reasons. First, it provides a simple approach to conflation that seems to work well in practice and that is applicable to a range of languages.

Second, it has spurred interest in stemming as a topic for research in its own right, rather than merely as a low-level component of an information retrieval system. The algorithm was first published in 1980; however, it and its descendants
continue to be employed in a range of applications that stretch far beyond its original intended use.