Sentiment Analysis using LSTM

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

Reading time: 10 minutes

Sentimental analysis is one of the most important applications of Machine learning. It is used extensively in Netflix and YouTube to suggest videos, Google Search to suggest positive search results in response to a negative term, Uber Eats to suggest delicacies based on your recent activities and others. In this article, we will build a sentiment analyser from scratch using KERAS framework with Python using concepts of LSTM.

What is Sentiment?

Sentimemt is a a view or opinion that is held or expressed towards anything e.g food, restraurant, movies etc.

How sentiment can be use in AI?

Now a days sentiments can be use to recommend the movie, restraurant ect.

Why we should use LSTM in sentiment analysis

As I previously explained LSTM, LSTM is used where we have to preserve data or gradient in technical terms for the future reference.

IMDB Datasets

Source : Download from here

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative).

Data details :
title.akas.tsv.gz - Contains the following information for titles:

titleId (string) - a tconst, an alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
title (string) – the localized title
region (string) - the region for this version of the title
language (string) - the language of the title
types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
attributes (array) - Additional terms to describe this alternative title, not enumerated
isOriginalTitle (boolean) – 0: not original title; 1: original title

title.basics.tsv.gz - Contains the following information for titles:

tconst (string) - alphanumeric unique identifier of the title
titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
originalTitle (string) - original title, in the original language
isAdult (boolean) - 0: non-adult title; 1: adult title
startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
runtimeMinutes – primary runtime of the title, in minutes
genres (string array) – includes up to three genres associated with the title

title.crew.tsv.gz – Contains the director and writer information for all the

titles in IMDb. Fields include:

tconst (string) - alphanumeric unique identifier of the title
directors (array of nconsts) - director(s) of the given title
writers (array of nconsts) – writer(s) of the given title

title.episode.tsv.gz – Contains the tv episode information. Fields include:

tconst (string) - alphanumeric identifier of episode
parentTconst (string) - alphanumeric identifier of the parent TV Series
seasonNumber (integer) – season number the episode belongs to
episodeNumber (integer) – episode number of the tconst in the TV series
title.principals.tsv.gz – Contains the principal cast/crew for titles
tconst (string) - alphanumeric unique identifier of the title
ordering (integer) – a number to uniquely identify rows for a given titleId
nconst (string) - alphanumeric unique identifier of the name/person
category (string) - the category of job that person was in
job (string) - the specific job title if applicable, else '\N'
characters (string) - the name of the character played if applicable, else '\N'
title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles
tconst (string) - alphanumeric unique identifier of the title
averageRating – weighted average of all the individual user ratings
numVotes - number of votes the title has received

name.basics.tsv.gz – Contains the following information for names:

nconst (string) - alphanumeric unique identifier of the name/person
primaryName (string)– name by which the person is most often credited
birthYear – in YYYY format
deathYear – in YYYY format if applicable, else '\N'
primaryProfession (array of strings)– the top-3 professions of the person
knownForTitles (array of tconsts) – titles the person is known for

Analysis using KERAS framework

Keras supports some preprocessed datasets:

CIFAR10 small image classification
CIFAR100 small image classification
IMDB Movie reviews sentiment classification
Reuters newswire topics classification
MNIST database of handwritten digits
Fashion-MNIST database of fashion articles

API calls for getting the data:


        from keras.datasets impot imdb
        (x_train,y_train),(x_test,y_test) = imdb.load_data()

It will download the imdb dataset which is preprocessed. Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

Building and Training the model:

First we have to import all the dependencies


        import keras
        from keras import backend as k
        from keras.datasets import imdb
        from keras.layers import LSTM, Embedding, Activation, Dense
        from keras.models import Sequential

Define your maximum word length for one input and download the dataset


       top_words = 5000
       (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = top_words)

Data preprocessing
All the Sentimens are not of same length some can be of 20 words some 1 and may be some can containe 300 it varies according to the veiwers. So we have to pad the input to make them od=f equal lentgh.


          from keras.preprocessing import sequence 
          max_review_length = 500 
          X_train = sequence.pad_sequences(X_train, maxlen=max_review_length) 
          X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

output will be like :


                array([[   0,    0,    0, ...,   14,    6,  717],
                 [   0,    0,    0, ...,  125,    4, 3077],
                 [  33,    6,   58, ...,    9,   57,  975],
                 ...,
                 [   0,    0,    0, ...,   21,  846,    2],
                 [   0,    0,    0, ..., 2302,    7,  470],
                 [   0,    0,    0, ...,   34, 2005, 2643]])

Now to extract the sentiment from the above output to natural language so that a normal human ca read it we have to make a dictionary


          word_to_id = keras.datasets.imdb.get_word_index()
          word_to_id = {k:(v+1) for k,v in word_to_id.items()}
          word_to_id[""] = 0
          word_to_id[""] = 1
          word_to_id[""] = 2
          id_to_word = {value:key for key,value in word_to_id.items()}
          print(' '.join(id_to_word[id] for id in X_train[0] ))

output:

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> was not for it's self joke professional disappointment see already pretending their staged a every so found of his movies it's third plot good episodes <UNK> in who guess wasn't of doesn't a again plot find <UNK> poor let her a again vegas trouble with fight like that oh a big good for to watching essentially but was not a fat centers turn a not well how this for it's self like bad as that natural a not with starts with this for david movie <UNK> of only moments this br special br films of a sell <UNK> for guess their childish an a man this for like musical of his ever more so while there his feelings an to not this role be get when of was others for people <UNK> br a character love <UNK> as found a <UNK> is turner of upon so well it's self fine have early seeing if is a <UNK> social that watch him a sex as plays could by suffering time have through to long <UNK> movie a music not on scene fine have guess of i'm all <UNK> movie more so be whole its his watch a music see for like blue him this for everything of for sits never characters by as for <UNK> but down by

Building model
1. Create the instance of the sequential model
2. On that instance add a Embedding layer with maximum vocab size and dimention of output.
3. Now add a layer of LSTM with 100 units.
4. For output we have to add a Dense layer with one node.
5. At last we have to compile the model for taining with loss function as binary cross entropy, optimizer as adam and metric as acuuracy.


            embedding_vector_length = 32 
            model = Sequential() 
            model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length)) 
            model.add(LSTM(100)) 
            model.add(Dense(1, activation='sigmoid')) 
            model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 
            model.summary()

output:


        _________________________________________________________________
        Layer (type)                 Output Shape              Param #   
        =================================================================
        embedding_1 (Embedding)      (None, 500, 32)           160000    
        _________________________________________________________________
        lstm_1 (LSTM)                (None, 100)               53200     
        _________________________________________________________________
        dense_1 (Dense)              (None, 1)                 101       
        =================================================================
        Total params: 213,301
        Trainable params: 213,301
        Non-trainable params: 0
        _________________________________________________________________

Training of the model
1. Now we have to fit the data i.e X_train and Y_train into the model we have created in the step 3.
2. We can't pass all the input at once, it will take long time to train the model so we divide the input into batches and then train the model by passing one batch at a time. It increases the efficiency of the model.
3. Batch size difines that how much input data in divided into each batch.
4. An epoch is a measure of the number of times all of the training vectors are used once to update the weights.For batch training all of the training samples pass through the learning algorithm simultaneously in one epoch before weights are updated.


            model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128) 
     output: 
     
             Train on 25000 samples, validate on 25000 samples
        Epoch 1/2
        25000/25000 [==============================] - 210s 8ms/step - loss: 0.5347 - acc: 0.7306 - val_loss: 0.3227 - val_acc: 0.8622
        Epoch 2/2
        25000/25000 [==============================] - 203s 8ms/step - loss: 0.2980 - acc: 0.8796 - val_loss: 0.3217 - val_acc: 0.8639
        <keras.callbacks.History at 0x12d03d61d30>

At last we have to evaluate the model perfomance by camparing the predicted value and actual value, with same batch size.


        scores = model.evaluate(X_test, y_test, verbose=0)
        print("Accuracy: %.2f%%" % (scores[1]*100))
        print("loss: {}".format((scores[0])))

output:

     Accuracy: 86.39%
     loss: 0.3216540405845642

Checking the model :
Take the input from the user.
There are two ways to encode the input:
First is the inbuilt funtion i.e one_hot encoding
sencond map all the word to id (integer) using dictionaries.
Then pass the encoded list of integer to the model via model.predict funtion.


          import numpy as np
          bad = "professional disappointment"
          good = "i really liked the movie and had fun"
          for review in [good,bad]:
              tmp = []
              for word in review.split(" "):
                  tmp.append(word_to_id[word])
              tmp_padded = sequence.pad_sequences([tmp], maxlen=max_review_length) 
              print("%s. Sentiment: %s" % (review,model.predict(np.array([tmp_padded][0]))[0][0]))

  output : 
  
      i really liked the movie and had fun. Sentiment: 0.6231231
      professional disappointment. Sentiment: 0.37993282

At last save the model


        model.save('imdb.h5')

Sentiment Analysis using LSTM with Keras

Machine Learning (ML) lstm sentiment analysis keras

What is Sentiment?

How sentiment can be use in AI?

Why we should use LSTM in sentiment analysis

IMDB Datasets

Analysis using KERAS framework

API calls for getting the data:

Building and Training the model:

Von Neumann Cellular Automaton

Long Short Term Memory (LSTM)