Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
In this article, we have explored about Lemmatization approaches in NLP in depth and presented Lemmatization approaches in Python with code examples.
Some of the text preprocessing techniques we have covered are:
- NLP
- Lemmatization
- Need of Lemmatization
- Approaches to Lemmatization
- WordNet
- WordNet (with POS tag)
- TextBlob
- TextBlob (with POS tag)
- spaCy
- TreeTagger
- Pattern
- Gensim
- Stanford CoreNLP
What is NLP
NLP stands for Natural Language Processing, which is a branch of computer science, artificial intelligence, and linguistics that deals with the interactions between human language and computers. NLP enables computers to understand, analyze, manipulate, and interpret natural language data such as text or speech. Some of the applications of NLP include translation, summarization, speech recognition, sentiment analysis, and topic segmentation.
To perform these tasks effectively, NLP systems need to preprocess the raw text data and normalize it into a standard form. One of the common preprocessing steps in NLP is lemmatization.
What is Lemmatization?
Lemmatization is one of the text normalization techniques that reduce words to their base forms. However, lemmatization is more context-sensitive and linguistically informed, lemmatization uses a dictionary or a corpus to find the lemma or the canonical form of each word. Lemmatization also takes into account the part-of-speech tag of each word, which can affect its meaning and lemma. For instance, the word "saw" can be lemmatized differently depending on whether it is a noun or a verb in a sentence.
Original Word ---> Root Word ---> Feature
Meeting ---> Meet ---> (core-word extraction)
Was ---> Be ---> (tense conversion to present tense)
Mice ---> Mouse ---> (plural to singular)
Note: Always convert your text to lowercase before performing any NLP task including lemmatizing.
Why Lemmatization is needed?
Lemmatization is needed for natural language processing tasks such as text analysis, information retrieval, machine translation, summarization, speech recognition, sentiment analysis, and topic segmentation. By applying lemmatization, we can normalize the text and reduce the vocabulary size, which can improve the performance and accuracy of the models. Lemmatization can also help to handle words with different inflections and derivations that have the same meaning or concept.
Various Approaches to Lemmatization:
We will be going over 9 different approaches to perform Lemmatization along with multiple examples and code implementations.
- WordNet
- WordNet (with POS tag)
- TextBlob
- TextBlob (with POS tag)
- spaCy
- TreeTagger
- Pattern
- Gensim
- Stanford CoreNLP
NLTK
The Natural Language Toolkit (NLTK) is a popular open-source library for natural language processing. It provides several lemmatization algorithms, including WordNetLemmatizer, LancasterStemmer, and SnowballStemmer.
Code Implementation:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
text = "My cats were playing in the garden"
lemmatized_text = [lemmatizer.lemmatize(word) for word in text.split()]
print(lemmatized_text)
# Output: ['My', 'cat', 'were', 'playing', 'in', 'the', 'garden']
Pros:
Easy to use
Supports multiple lemmatization algorithms
Cons:
Limited accuracy
Requires installation of NLTK library
WordNet:
WordNet is a lexical database that organizes words into synonym sets, called synsets, and provides relationships between these sets. It is widely used for lemmatization as it provides an extensive list of words and their root forms. WordNet can be used with or without POS tags. If POS tags are not specified, WordNet tries to guess the appropriate tag based on the context.
Code Implementation:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
# Without POS tag
print(lemmatizer.lemmatize('dogs'))
# Output: dog
# With POS tag
print(lemmatizer.lemmatize('playing', wordnet.VERB))
# Output: play
Pros:
WordNet provides a vast database of words and their root forms, making it easy to use.
It can handle different POS tags for words.
Cons:
WordNet may not work accurately for words not present in its database.
It may not be able to handle complex word forms and variations.
WordNet (with POS tag):
Using WordNet with POS tags provides a more accurate lemmatization result as it considers the word's part of speech while finding its root form.
Code Implementation:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
# With POS tag
print(lemmatizer.lemmatize('playing', wordnet.VERB))
# Output: play
Pros:
Using POS tags provides more accurate lemmatization results.
It is still easy to use with WordNet.
Cons:
WordNet may not handle all word forms and variations accurately.
TextBlob:
TextBlob is a Python library that provides a simple API for performing common NLP tasks, including lemmatization. It uses the WordNetLemmatizer internally to perform lemmatization.
Code Implementation:
from textblob import Word
# Without POS tag
word = Word('playing')
print(word.lemmatize())
# Output: playing
# With POS tag
word = Word('playing')
print(word.lemmatize('v'))
# Output: play
Pros:
TextBlob provides a simple and easy-to-use API for lemmatization.
It can handle different POS tags for words.
Cons:
TextBlob may not be as accurate as other approaches for complex word forms and variations.
TextBlob (with POS tag):
Using TextBlob with POS tags provides more accurate lemmatization results as it considers the word's part of speech while finding its root form.
Code Implementation:
from textblob import Word
# With POS tag
word = Word('playing')
print(word.lemmatize('v'))
# Output: play
Pros:
Using POS tags provides more accurate lemmatization results.
TextBlob still provides a simple and easy-to-use API.
Cons:
TextBlob may not handle all word forms and variations accurately.
spaCy
spaCy is a popular Python library for NLP tasks that offers built-in lemmatization functionality. It uses an advanced rule-based approach that considers the part-of-speech (POS) tags of words to determine their base form.
Code Implementation:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I am running in a race")
lemmas = [token.lemma_ for token in doc]
print(lemmas)
# Output: ['-PRON-', 'be', 'run', 'in', 'a', 'race']
Pros:
Fast and efficient
Supports various languages
Includes additional features such as named entity recognition and dependency parsing
Cons:
May not always provide accurate lemmas for uncommon words or technical terms
TreeTagger
TreeTagger is a part-of-speech tagger and lemmatizer developed by Helmut Schmid. It is a standalone program that supports over 25 languages and offers high accuracy for lemmatization.
Code Implementation:
from treetagger import TreeTagger
tagger = TreeTagger(language='english')
tags = tagger.tag("I am running in a race")
lemmas = [tag.split('\t')[2] for tag in tags]
print(lemmas)
# Output: ['-PRON-', 'be', 'run', 'in', 'a', 'race']
Pros:
High accuracy for lemmatization
Supports various languages
Can be used as a standalone program or as a Python package
Cons:
Requires installation and setup of the TreeTagger program
Not as fast as some other Python libraries
Pattern
Pattern is a Python library for web mining, natural language processing, and machine learning that includes lemmatization functionality. It uses a rule-based approach that considers the word's POS tags and context.
Code Implementation:
from pattern.en import lemma
lemmas = [lemma(word) for word in "I am running in a race".split()]
print(lemmas)
# Output: ['-PRON-', 'be', 'run', 'in', 'a', 'race']
Pros:
Easy to use
Includes additional features such as sentiment analysis and part-of-speech tagging
Cons:
Limited support for non-English languages
May not always provide accurate lemmas for technical terms or domain-specific words.
Gensim
Gensim is an open-source library for natural language processing and topic modeling. It offers an efficient lemmatization function that uses WordNet, a lexical database for the English language. The lemmatization function in Gensim is easy to use and provides high accuracy.
Code Implementation:
from gensim.utils import lemmatize
text = "My cats were playing in the garden"
lemmatized_text = [token.decode('utf-8').split('/')[0] for token in lemmatize(text)]
print(lemmatized_text)
# Output: ['cat', 'play', 'garden']
Pros:
Efficient and easy to use
High accuracy
Cons:
Limited language support
Requires installation of Gensim library
Stanford CoreNLP
Stanford CoreNLP is a suite of natural language processing tools that include lemmatization. It is based on machine learning and provides high accuracy in lemmatization. It supports several languages, including English, Chinese, Arabic, and Spanish.
Code Implementation:
import stanfordnlp
nlp = stanfordnlp.Pipeline(lang='en', processors='tokenize,lemma')
doc = nlp("My cats were playing in the garden")
for sentence in doc.sentences:
for word in sentence.words:
print(word.lemma)
# Output: my cat be play in the garden
Pros:
High accuracy
Supports multiple languages
Cons:
Requires installation of Stanford CoreNLP library
Large memory requirement
I am a Aspiring Data Scientist and a passionate writer. I enjoy working with data using various technologies and sharing my insights on programming topics. If you want to see more of my work or get in touch with me, feel free to visit my GitHub profile.