Lexicon based Sentiment Analysis



Reading time: 30 minutes | Coding time: 15 minutes

Introduction

Lexicon-based Sentiment Analysis techniques, as opposed to the Machine Learning techniques, are based on calculation of polarity scores given to positive and negative words in a document.

They can be broadly classfied into:

  • Dictionary-based

  • Corpus-based

  • Dictionary-based methods create a database of postive and negative words from an initial set of words by including synonyms and antonyms

  • Corpus-based methods, on the other hand, obtains the dictionary from the initial set by usage of statistical techniques

In this article, we will discuss sentiment analysis using Wordnet Polarity Detection.

WordNet

  • WordNet is a lexical database composing English words, grouped as synonyms into what is known as a synset s.
  • It is a freely available tool, which can be downloaded from its official website.
  • While WordNet can be loosely termed as a Thesaurus, it is said to be more semantically accurate, since it stores synonyms of words put together is specific contexts.
  • All the words are linked together by the ISA relationship (more commonly, Generalisation). For example, a car is a type of vehicle, just as a truck.
  • Several algorithms make use of this database for Lexical Sentiment Analysis, and we will be discussing one such algorithm called SentiWordNet.

SentiWordNet

  • SentiWordNet operates on the database provided by WordNet.
  • The additional functionality that it provides is the measure of positivity, negativity or neutrality as is required for Sentiment Analysis.
Thus, every synset *s* is associated with a 
    Pos(s): a positivity score
    Neg(s): a negativity score
    Obj(s): an objectivity (neutrality) score
    
    Pos(s) + Neg(s) + Obj(s) = 1
  • The scores are very precise, pertaining to the word itself alongwith its context.
  • All three scores range within the values [0,1].

The Algorithm

Step 1
Data preprocessing must be performed on the dataset, including removal of stopwords or punctuation marks.
The sentences can be stored in Python dictionaries to make it easier to manipulate.

Step 2
While using SentiWordNet, it is important to find out the Parts of Speech for each word present in the dictionaries.
Parts of Speech include -

  1. Noun (n)
  2. Verb (v)
  3. Adjective (a)
  4. Adverb
  5. Preposition
  6. Conjunction
  7. Pronoun
  8. Interjection

The first three are the most commonly used while reviewing sentiments of a sentence

Step 3
The polarity of each word, in context with POS tagging, is found out using the sentiwordnet functions - pos_score(), neg_score() and obj_score().

SentiWordNet is built on WordNet. After having explored multiple synsets on the basis of synonymy or antonymy, with a known value for the polarity for a set of seed or starting words, classifiers are built to obtain the polarity of all the related words/synsets. Thus, it can be said that SentiWordNet determines the polarity of words in a synset using a semi-supervised approach.

Example

Let us consider the sentence -
I disliked the movie

The overall sentiment of the above sentence is negative. The same can be demonstrated using the SentiWordNet functions described below.

  • The negativity score for the word dislike (the verb form) is 0.5.
  • The remaining tokens, like I and the in the sentence will be filtered out during preprocessing
  • Meanwhile, the positivity and negativity score of movie is zero, thus making its objectivity score 1.0.

Thus, the overall sentiment of the sentence will be negative, since only positive and negative terms are used to calculate the sentiment.

Code Demonstration

  1. Download the necessary resources after importing nltk
import ntlk
nltk.download('sentiwordnet')
nltk.download('wordnet')
from nltk.corpus import sentiwordnet as swn
  1. The following code can be used to find other words belonging to the same synset as given word
list(swn.senti_synsets('slow'))    

The Output is as follows

[SentiSynset('decelerate.v.01'),
 SentiSynset('slow.v.02'),
 SentiSynset('slow.v.03'),
 SentiSynset('slow.a.01'),
 SentiSynset('slow.a.02'),
 SentiSynset('dense.s.04'),
 SentiSynset('slow.a.04'),
 SentiSynset('boring.s.01'),
 SentiSynset('dull.s.08'),
 SentiSynset('slowly.r.01'),
 SentiSynset('behind.r.03')]
  1. For a sentence, we first find out the POS tagging of each token using the following code -
from nltk.tag import pos_tag
token = nltk.word_tokenize(sentence)
after_tagging = nltk.pos_tag(token)
  1. The above code returns the PennTreebank tags associated with the tokens. They need to be converted to simple WordNet tags, which can be done using the following code
if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None
  1. The following code can be used to check the polarity of words

About senti_synsets() function

  • It imports wordnet to access the synset passed as a parameter
  • It returns a Python filter object of iterable type, thus we are converting it into a list in the code below
words = swn.senti_synsets('sad', 'a') #the 'a' describes that the word 'sad' is an adjective
word1 = list(words)[0] #index 0 is required to return the first result from the list of words in the synset. 
word1.pos_score()
word1.neg_score()
word1.obj_score()

The following scores are obtained for the word 'sad':

  1. Positivity Score - 0.125
  2. Negativity Score - 0.75
  3. Objectivity Score - 0.125

Thus, the word can be categorized as a negative word.

  1. Step 3 is repeated to obtain the polarity scores for each token in the sentence.
  2. The Sentiment of the sentence as a whole is determined by obtaining the difference between the postive and negative scores of the contextual tokens/words that use SentiWordNet to calculate the polarity.

For the example we were considering, let us calculate the SentiWordNet polarity scores for each of the tokens

References

  1. Sentiment Analysis Techniques by Chaitanyasuma Jain (OpenGenus)
  2. Princeton University "About WordNet." WordNet. Princeton University. 2010.
  3. Baccianella, S., Esuli, A., & Sebastiani, F. (2010, May). Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Lrec (Vol. 10, No. 2010, pp. 2200-2204).
  4. Ohana, B., & Tierney, B. (2009, October). Sentiment classification of reviews using SentiWordNet. In 9th. it & t conference (Vol. 13, pp. 18-30).
  5. NLTK SentiWordNet documentation