Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
Reading time: 90 minutes
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. For example: words such as βLikesβ, βlikedβ, βlikelyβ and βlikingβ will be reduced to βlikeβ after stemming.
In 1980, Porter presented a simple algorithm for stemming English language words. The Porter algorithm differs from Lovins-type stemmers (which were developed in 1968) in two major ways. The rules associated with suffix removal are much less complex in case of Porter's Stemmer. Lovins algorithm has been designed principally for the processing of scientific texts. The second difference is that the Porter's stemmer uses a single, unified approach to the handling of context whereas, Lovins' stemmer has separate rules according to the length of the stem remaining after removal of suffix.
We cover:
- The algorithmic steps in Porter Stemmer algorithm
- A native implementation in Python
- Using Porter Stemmer algorithm from NLTK library
- Conclusion
Algorithm
To present the suffix stripping algorithm in its entirety we will need a few difinitions.
A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that the term consonant is defined to some extent in terms of itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a vowel.
A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms:
- CVCV ... C
- CVCV ... V
- VCVC ... C
- VCVC ... V
These may all be represented by the single form
[C]VCVC ... [V]
where the square brackets denote arbitrary presence of their contents. Using ($VC^m$) to denote VC repeated m times, this may again be written as
[C]($ VC^m $)[V]
m will be called the measure of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples:
- m=0 TR, EE, TREE, Y, BY.
- m=1 TROUBLE, OATS, TREES, IVY.
- m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
The rules for removing a suffix will be given in the form
(condition) S1 -> S2
This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g.
(m > 1) EMENT ->
Here S1 is 'EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.
The 'condition' part may also contain the following:
*S
- the stem ends with S (and similarly for the other letters).*v*
- the stem contains a vowel.m=2
TROUBLES, PRIVATE, OATEN, ORRERY.*d
- the stem ends with a double consonant (e.g. -TT, -SS).*o
- the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).
And the condition part may also contain expressions with and, or and not, so that:
(m>1 and (*S or *T))
: tests for a stem with m>1 ending in S or T, while
(*d and not (*L or *S or *Z))
: tests for a stem ending witha double consonant other than L, S or Z. Elaborate conditions like this are required only rarely.
In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S1 for the given word. For example, with
- SSES -> SS
- IES -> I
- SS -> SS
- S ->
(here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for S1. Equally CARESS maps to CARESS (S1=SS
) and CARES to CARE (S1=S
).
In the rules below, examples of their application, successful or otherwise, are given on the right in lower case. The algorithm now follows:
Step 1a :
- SSES -> SS (Example : caresses -> caress)
- IES -> I (Example : ponies -> poni ; ties -> ti)
- SS -> SS (Example : caress -> caress)
- S -> (Example : cats -> cat)
Step 1b :
- (m>0) EED -> EE (Example : feed -> feed ; agreed -> agree)
- (v) ED -> (Example : plastered -> plaster ; bled -> bled)
- (v) ING -> (Example : motoring -> motor ; sing -> sing)
- S -> (Example : cats -> cat)
If the second or third of the rules in Step 1b is successful, the following is done:
- AT -> ATE (Example : conflat(ed) -> conflate)
- BL -> BLE (Example : troubl(ed) -> trouble)
- IZ -> IZE (Example : siz(ed) -> size)
- S -> (Example : cats -> cat)
- (
*d
and not (*L
or*S
or*Z
)) -> single letter (Example : hopp(ing) -> hop ; tann(ed) -> tan ; fall(ing) -> fall ; hiss(ing) -> hiss ; fizz(ed) -> fizz) - (m=1 and
*o
) -> E (Example : fail(ing) -> fail ; fil(ing) -> file)
The rule to map to a single letter causes the removal of one of the double letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be recognised later. This E may be removed in step 4.
Step 1c :
(\*v\*
) Y -> I (Example : happy -> happi ; sky -> sky)
Step 1 deals with plurals and past participles. The subsequent steps are much more straightforward.
Step 2 :
- (m>0) ATIONAL -> ATE (Example : relational -> relate
- (m>0) TIONAL -> TION (Example : conditional -> condition ; rational -> rational)
- (m>0) ENCI -> ENCE (Example : valenci -> valence)
- (m>0) ANCI -> ANCE (Example : hesitanci -> hesitance)
- (m>0) IZER -> IZE (Example : digitizer -> digitize)
- (m>0) ABLI -> ABLE (Example : conformabli -> conformable)
- (m>0) ALLI -> AL (Example : radicalli -> radical)
- (m>0) ENTLI -> ENT (differentli -> different)
- (m>0) ELI -> E (vileli -> vile)
- (m>0) OUSLI -> OUS (analogousli -> analogous)
- (m>0) IZATION -> IZE (vietnamization -> vietnamize)
- (m>0) ATION -> ATE (predication -> predicate)
- (m>0) ATOR -> ATE (operator -> operate)
- (m>0) ALISM -> AL (feudalism -> feudal)
- (m>0) IVENESS -> IVE (decisiveness -> decisive)
- (m>0) FULNESS -> FUL (hopefulness -> hopeful)
- (m>0) OUSNESS -> OUS (callousness -> callous)
- (m>0) ALITI -> AL (formaliti -> formal)
- (m>0) IVITI -> IVE (sensitiviti -> sensitive)
- (m>0) BILITI -> BLE (sensibiliti -> sensible)
The test for the string S1 can be made fast by doing a program switch on the penultimate letter of the word being tested. This gives a fairly even breakdown of the possible values of the string S1. It will be seen in fact that the S1-strings in step 2 are presented here in the alphabetical order of their penultimate letter. Similar techniques may be applied in the other steps.
Step 3 :
- (m>0) ICATE -> IC (Example : triplicate -> triplic)
- (m>0) ATIVE -> (Example : formative -> form)
- (m>0) ALIZE -> AL (Example : formalize -> formal)
- (m>0) ICITI -> IC (Example : electriciti -> electric))
- (m>0) ICAL -> IC (Example : electrical -> electric)
- (m>0) FUL -> (Example : hopeful -> hope)
- (m>0) NESS -> (Example : goodness -> good)
Step 4 :
- (m>1) AL -> (Example : revival -> reviv)
- (m>1) ANCE -> (Example : allowance -> allow)
- (m>1) ENCE -> (Example : inference -> infer)
- (m>1) ER -> (Example : airliner -> airlin)
- (m>1) IC -> (Example : gyroscopic -> gyroscop)
- (m>1) ABLE -> (Example : adjustable -> adjust)
- (m>1) IBLE -> (Example : defensible -> defens)
- (m>1) ANT -> (Example : irritant -> irrit)
- (m>1) EMENT -> (Example : replacement -> replac)
- (m>1) MENT -> (Example : adjustment -> adjust)
- (m>1) ENT -> (Example : dependent -> depend)
- (m>1 and (*S or *T)) ION -> (Example : adoption -> adopt)
- (m>1) OU -> (Example : homologou -> homolog)
- (m>1) ISM -> (Example : communism -> commun)
- (m>1) ATE -> (Example : activate -> activ)
- (m>1) ITI -> (Example : angulariti -> angular)
- (m>1) OUS -> (Example : homologous -> homolog)
- (m>1) IVE -> (Example : effective -> effect)
- (m>1) IZE -> (Example : bowdlerize -> bowdler)
The suffixes are now removed. All that remains is a little tidying up.
Step 5a :
- (m>1) E -> (Example : probate -> probat ; rate -> rate)
- (m=1 and not *o) E -> (Example : cease -> ceas)
Step 5b
(m > 1 and *d
and *L
) -> single letter
(Example : controll -> control ; roll -> roll)
Now, let's see a native implementation of this algorithm. The algorithm will become crystal clear as you go through code. The original code can be found here here. Thankyou Julia Menchavez for the awesome code :). The original paper by Martin Porter can be found here.
Python code for native implementation
class PorterStemmer:
def isCons(self, letter):#function that returns true if a letter is a consonant otherwise false
if letter == 'a' or letter == 'e' or letter == 'i' or letter == 'o'
or letter == 'u':
return False
else:
return True
def isConsonant(self, word, i):#function that returns true only if the letter at i th position
#in the argument 'word' is a consonant.But if the letter is 'y' and the letter at i-1 th position
#is also a consonant, then it returns false.
letter = word[i]
if self.isCons(letter):
if letter == 'y' and self.isCons(word[i-1]):
return False
else:
return True
else:
return False
def isVowel(self, word, i):#function that returns true if the letter at i th position in the argument 'word'
#is a vowel
return not(isConsonant(word, i))
# *S
def endsWith(self, stem, letter):#returns true if the word 'stem' ends with 'letter'
if stem.endswith(letter):
return True
else:
return False
# *v*
def containsVowel(self, stem):#returns true if the word 'stem' contains a vowel
for i in stem:
if not self.isCons(i):
return True
return False
# *d
def doubleCons(self, stem):#returns true if the word 'stem' ends with 2 consonants
if len(stem) >= 2:
if self.isConsonant(stem, -1) and self.isConsonant(stem, -2):
return True
else:
return False
else:
return False
def getForm(self, word):
#This function takes a word as an input, and checks for vowel and consonant sequences in that word.
#vowel sequence is denoted by V and consonant sequences by C
#For example, the word 'balloon' can be divived into following sequences:
#'b' : C
#'a' : V
#'ll': C
#'oo': V
#'n' : C
#So form = [C,V,C,V,C] and formstr = CVCVC
form = []
formStr = ''
for i in range(len(word)):
if self.isConsonant(word, i):
if i != 0:
prev = form[-1]
if prev != 'C':
form.append('C')
else:
form.append('C')
else:
if i != 0:
prev = form[-1]
if prev != 'V':
form.append('V')
else:
form.append('V')
for j in form:
formStr += j
return formStr
def getM(self, word):
#returns value of M which is equal to number of 'VC' in formstr
#So in above example of word 'balloon', we have 2 'VC'
form = self.getForm(word)
m = form.count('VC')
return m
# *o
def cvc(self, word):
#returns true if the last 3 letters of the word are of the following pattern: consonant,vowel,consonant
#but if the last word is either 'w','x' or 'y', it returns false
if len(word) >= 3:
f = -3
s = -2
t = -1
third = word[t]
if self.isConsonant(word, f) and self.isVowel(word, s)
and self.isConsonant(word, t):
if third != 'w' and third != 'x' and third != 'y':
return True
else:
return False
else:
return False
else:
return False
def replace(self, orig, rem, rep):
#this function checks if string 'orig' ends with 'rem' and
#replaces 'rem' by the substring 'rep'. The resulting string 'replaced'
#is returned.
result = orig.rfind(rem)
base = orig[:result]
replaced = base + rep
return replaced
def replaceM0(self, orig, rem, rep):
#same as the function replace(), except that it checks the value of M for the
#base string. If it is >0 , it replaces 'rem' by 'rep', otherwise it returns the
#original string
result = orig.rfind(rem)
base = orig[:result]
if self.getM(base) > 0:
replaced = base + rep
return replaced
else:
return orig
def replaceM1(self, orig, rem, rep):
#same as replaceM0(), except that it replaces 'rem' by 'rep', only when M>1 for
#the base string
result = orig.rfind(rem)
base = orig[:result]
if self.getM(base) > 1:
replaced = base + rep
return replaced
else:
return orig
def step1a(self, word):
#In a given word, this function replaces 'sses' by 'ss', 'ies' by 'i',
#'ss' by 'ss' and 's' by ''
"""step1a() gets rid of plurals. e.g.
caresses -> caress
ponies -> poni
ties -> ti
caress -> caress
cats -> cat
"""
if word.endswith('sses'):
word = self.replace(word, 'sses', 'ss')
elif word.endswith('ies'):
word = self.replace(word, 'ies', 'i')
elif word.endswith('ss'):
word = self.replace(word, 'ss', 'ss')
elif word.endswith('s'):
word = self.replace(word, 's', '')
else:
pass
return word
def step1b(self, word):
#this function checks if a word ends with 'eed','ed' or 'ing' and replces these substrings by
#'ee','' and ''. If after the replacements in case of 'ed' and 'ing', the resulting word
# -> ends with 'at','bl' or 'iz' : add 'e' to the end of the word
# -> ends with 2 consonants and its last letter isn't 'l','s' or 'z': remove last letter of the word
# -> has 1 as value of M and the cvc(word) returns true : add 'e' to the end of the word
'''
step1b gets rid of -eed -ed or -ing. e.g.
feed -> feed
agreed -> agree
disabled -> disable
matting -> mat
mating -> mate
meeting -> meet
milling -> mill
messing -> mess
meetings -> meet
'''
flag = False
if word.endswith('eed'):
result = word.rfind('eed')
base = word[:result]
if self.getM(base) > 0:
word = base
word += 'ee'
elif word.endswith('ed'):
result = word.rfind('ed')
base = word[:result]
if self.containsVowel(base):
word = base
flag = True
elif word.endswith('ing'):
result = word.rfind('ing')
base = word[:result]
if self.containsVowel(base):
word = base
flag = True
if flag:
if word.endswith('at') or word.endswith('bl')
or word.endswith('iz'):
word += 'e'
elif self.doubleCons(word) and not self.endsWith(word, 'l')
and not self.endsWith(word, 's') and not self.endsWith(word, 'z'):
word = word[:-1]
elif self.getM(word) == 1 and self.cvc(word):
word += 'e'
else:
pass
else:
pass
return word
def step1c(self, word):
#In words ending with 'y' this function replaces 'y' by 'i'
"""step1c() turns terminal y to i when there is another vowel in the stem."""
if word.endswith('y'):
result = word.rfind('y')
base = word[:result]
if self.containsVowel(base):
word = base
word += 'i'
return word
def step2(self, word):
#this function checks the value of M, and replaces the suffixes accordingly
"""step2() maps double suffices to single ones.
so -ization ( = -ize plus -ation) maps to -ize etc. note that the
string before the suffix must give m() > 0.
"""
if word.endswith('ational'):
word = self.replaceM0(word, 'ational', 'ate')
elif word.endswith('tional'):
word = self.replaceM0(word, 'tional', 'tion')
elif word.endswith('enci'):
word = self.replaceM0(word, 'enci', 'ence')
elif word.endswith('anci'):
word = self.replaceM0(word, 'anci', 'ance')
elif word.endswith('izer'):
word = self.replaceM0(word, 'izer', 'ize')
elif word.endswith('abli'):
word = self.replaceM0(word, 'abli', 'able')
elif word.endswith('alli'):
word = self.replaceM0(word, 'alli', 'al')
elif word.endswith('entli'):
word = self.replaceM0(word, 'entli', 'ent')
elif word.endswith('eli'):
word = self.replaceM0(word, 'eli', 'e')
elif word.endswith('ousli'):
word = self.replaceM0(word, 'ousli', 'ous')
elif word.endswith('ization'):
word = self.replaceM0(word, 'ization', 'ize')
elif word.endswith('ation'):
word = self.replaceM0(word, 'ation', 'ate')
elif word.endswith('ator'):
word = self.replaceM0(word, 'ator', 'ate')
elif word.endswith('alism'):
word = self.replaceM0(word, 'alism', 'al')
elif word.endswith('iveness'):
word = self.replaceM0(word, 'iveness', 'ive')
elif word.endswith('fulness'):
word = self.replaceM0(word, 'fulness', 'ful')
elif word.endswith('ousness'):
word = self.replaceM0(word, 'ousness', 'ous')
elif word.endswith('aliti'):
word = self.replaceM0(word, 'aliti', 'al')
elif word.endswith('iviti'):
word = self.replaceM0(word, 'iviti', 'ive')
elif word.endswith('biliti'):
word = self.replaceM0(word, 'biliti', 'ble')
return word
def step3(self, word):
#this function checks the value of M, and replaces the suffixes accordingly
"""step3() dels with -ic-, -full, -ness etc. similar strategy to step2."""
if word.endswith('icate'):
word = self.replaceM0(word, 'icate', 'ic')
elif word.endswith('ative'):
word = self.replaceM0(word, 'ative', '')
elif word.endswith('alize'):
word = self.replaceM0(word, 'alize', 'al')
elif word.endswith('iciti'):
word = self.replaceM0(word, 'iciti', 'ic')
elif word.endswith('ful'):
word = self.replaceM0(word, 'ful', '')
elif word.endswith('ness'):
word = self.replaceM0(word, 'ness', '')
return word
def step4(self, word):
#this function checks the value of M, and replaces the suffixes accordingly
"""step4() takes off -ant, -ence etc., in context <c>vcvc<v>{meaning, M >1 for the word}."""
if word.endswith('al'):
word = self.replaceM1(word, 'al', '')
elif word.endswith('ance'):
word = self.replaceM1(word, 'ance', '')
elif word.endswith('ence'):
word = self.replaceM1(word, 'ence', '')
elif word.endswith('er'):
word = self.replaceM1(word, 'er', '')
elif word.endswith('ic'):
word = self.replaceM1(word, 'ic', '')
elif word.endswith('able'):
word = self.replaceM1(word, 'able', '')
elif word.endswith('ible'):
word = self.replaceM1(word, 'ible', '')
elif word.endswith('ant'):
word = self.replaceM1(word, 'ant', '')
elif word.endswith('ement'):
word = self.replaceM1(word, 'ement', '')
elif word.endswith('ment'):
word = self.replaceM1(word, 'ment', '')
elif word.endswith('ent'):
word = self.replaceM1(word, 'ent', '')
elif word.endswith('ou'):
word = self.replaceM1(word, 'ou', '')
elif word.endswith('ism'):
word = self.replaceM1(word, 'ism', '')
elif word.endswith('ate'):
word = self.replaceM1(word, 'ate', '')
elif word.endswith('iti'):
word = self.replaceM1(word, 'iti', '')
elif word.endswith('ous'):
word = self.replaceM1(word, 'ous', '')
elif word.endswith('ive'):
word = self.replaceM1(word, 'ive', '')
elif word.endswith('ize'):
word = self.replaceM1(word, 'ize', '')
elif word.endswith('ion'):
result = word.rfind('ion')
base = word[:result]
if self.getM(base) > 1
and (self.endsWith(base, 's') or self.endsWith(base, 't')):
word = base
word = self.replaceM1(word, '', '')
return word
def step5a(self, word):
#this function checks if the word ends with 'e'. If it does, it checks the value of
#M for the base word. If M>1, OR, If M = 1 and cvc(base) is false, it simply removes 'e'
#ending.
"""step5() removes a final -e if m() > 1.
"""
if word.endswith('e'):
base = word[:-1]
if self.getM(base) > 1:
word = base
elif self.getM(base) == 1 and not self.cvc(base):
word = base
return word
def step5b(self, word):
#this function checks if the value of M for the word is greater than 1 and it ends with 2 consonants
# and it ends with 'l', it removes 'l'
#step5b changes -ll to -l if m() > 1
if self.getM(word) > 1 and self.doubleCons(word)
and self.endsWith(word, 'l'):
word = word[:-1]
return word
def stem(self, word):
#steps to be followed for stemming according to the porter stemmer :)
word = self.step1a(word)
word = self.step1b(word)
word = self.step1c(word)
word = self.step2(word)
word = self.step3(word)
word = self.step4(word)
word = self.step5a(word)
word = self.step5b(word)
return word
NLTK library
A package called PorterStemmer is available in the NLTK library. It makes life so much more easier for us :p. Let's see how to use it.
Python code for using NLTK package
#First, we're going to grab and define our stemmer:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
#Now, let's choose some words with a similar stem, like:
example_words =["python","pythoner","pythoning","pythoned","pythonly"]
#Next, we can easily stem by doing something like:
for w in example_words:
print(ps.stem(w))
Output
python
python
python
python
pythonli
Python code
#Now let's try stemming a typical sentence, rather than some words:
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)
for w in words:
print(ps.stem(w))
Output
It
is
import
to
by
veri
pythonli
while
you
are
python
with
python
.
All
python
have
python
poorli
at
least
onc
.
Conclusion
Porterβs algorithm is important for two reasons. First, it provides a simple approach to conflation that seems to work well in practice and that is applicable to a range of languages.
Second, it has spurred interest in stemming as a topic for research in its own right, rather than merely as a low-level component of an information retrieval system. The algorithm was first published in 1980; however, it and its descendants
continue to be employed in a range of applications that stretch far beyond its original intended use.