Text Preprocessing Techniques


Reading time: 20 minutes

In this post, we will look at a variety of text preprocessing techniques which are frequently used for a Natural language processing (NLP) task. We'll also see an implementation for the same in Python.

text_preprocessing

Text Preprocessing

Text preprocessing refers to the process of converting a human language text into a machine-interpretable text which can be used for further processing for a predictive modeling task.

It is very crucial to pre-process the data we have so that the machine can identify and determine interesting patterns from it, which will further help it to make better predictions. The garbage in garbage out (GIGO) principle is always applicable for a predictive model. That is why there exists a significant need to properly pre-process a text in order to achieve a highly accurate predictive model.
There exists a set of a variety of techniques which are explained and implemented in the following sections.

We will cover the following text preprocessing techniques:

  • Lowercasing
  • Removing Punctuations
  • Removing Stopwords
  • Stemming
  • Lemmatization
  • Removing Emojis
  • Removing URLs

1. Lowercasing

It is very easy to lowercase the text, by simply using the inbuilt lower function.

def lowercase_text(text):
    return text.lower()

text = 'My name is Akshat Maheshwari.\nI am pursuing my post-graduation from ABV-IIITM Gwalior.'

print(lowercase_text(text))

OUTPUT-
my name is akshat maheshwari.
i am pursuing my post-graduation from abv-iiitm gwalior.

2. Removing Punctuations

All the special characters (punctuations) are stored in PUNCT_TO_REMOVE.
Using the translate method all punctuations are mapped to whitespaces.
Further, the join function removes the leading whitespaces.

import string

PUNCT_TO_REMOVE = string.punctuation

def remove_punctuation(text):
    return ' '.join(text.translate(str.maketrans('', '', PUNCT_TO_REMOVE)).split())

text = "I'm having a lot of punctuations!!. All special characters will be removed ;) :) Is it so ? ## Yes :( I will. "
print(remove_punctuation(text))

OUTPUT-
Im having a lot of punctuations All special characters will be removed Is it so Yes I will

3. Removing Stopwords

A stopword is a commonly used word (such as 'the', 'a', 'an', 'in' ...) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

Here we have imported stopwords of English language using NLTK.
After this, we removed all the stopwords present in the text.

Note- First convert the required text into lowercase since stopwords are case-sensitive.

from nltk.corpus import stopwords

STOPWORDS = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in str(text).split() if word not in STOPWORDS])

text = 'My name is Akshat Maheshwari.\nI am pursuing my post-graduation from ABV-IIITM Gwalior.'

print(remove_stopwords(lowercase_text(text)))

OUTPUT-
name akshat maheshwari. pursuing post-graduation abv-iiitm gwalior.

4. Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

eg- games, game, gaming all are stemmed to game

Limitations- May result in a word which is not meaningful.
eg- study studies studying all are stemmed to studi :(

This issue is resolved with the help of Lemmatization.

Different stemming algorithms available like- PorterStemmer, LancasterStemmer, SnowballStemmer and many more.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

text='I love to play games. I like gaming very much. One of my favourite game is Counter Strike.'
print(stem_words(lowercase_text(text)))

OUTPUT-
i love to play games. i like game veri much. one of my favourit game is counter strike.

5. Lemmatization

Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, or dictionary form.

The obtained word is referred to as lemma. The lemma is a meaningful word with a dictionary meaning.

Limitation- Slow computation when compared to Stemming.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

text='I love to play games. I like gaming very much. One of my favourite game is Counter Strike.'
print(lemmatize_words(lowercase_text(text)))

Here all the words have a proper meaning which was not in the case of stemming.

OUTPUT-
i love to play games. i like gaming very much. one of my favourite game is counter strike.

6. Removing Emojis

With the increased use of social media and chatting platforms, there is a significant increase in the usage of emojis. So, there exists a need to pre-process those emojis too. Here we are using regular expressions to remove the emojis which are nothing but special coded characters.

import re

def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

print(remove_emoji("😆 Removing emojis for text preprocessing 🔥😊🔥😊"))

OUTPUT-
Removing emojis for text preprocessing

7. Removing URLs

With recent advancements in web scraping technologies, a lot of text data is scraped using different websites. The scraped data often contains various hyperlinks which should be removed before doing any predictive analysis.

URL's can be removed using regular expressions.

def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

print(remove_urls('URL of whatsapp https://web.whatsapp.com/'))
print(remove_urls('My LinkedIn profile https://www.linkedin.com/in/akshat-maheshwari/ Lets connect'))

OUTPUT-
Removing emojis for text preprocessing
URL of whatsapp
My LinkedIn profile Lets connect

Note- The pre-processed text is not directly fed to a predictive model. Instead, we pass it using a feature vector in the form of Word Embedding. Different word embeddings can be frequency-based, binary-based, etc.

A sparse matrix is generally used for representing a vector. Different approaches like CountVectorizer, TF-IDF Vectorizer, and many more are used for encoding the text data into a vector of numbers.

Conclusion

In this post, we looked at different text pre-processing techniques and their implementation in Python. Doing proper pre-processing is very crucial and can significantly improve the performance of your predictive model.

There exist a whole lot of more techniques like removal of HTML tags, conversion of emojis to words and vice-versa, chat words conversion and many more. Do try these techniques in your next project or some data science contest and you might achieve some boost in your model's performance. 😉