Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
Natural Language Processing (NLP) is a process by which a computer can able to work on text. When a machine gets the power to understand the text it give us marvelous. In this article, we are going to see some applications of NLP that can help us to do language related task with the help of machine.
In this article we will be going to see applications of NLP like:
- Text Generation using GPT-2
- Text Summarization
- Sentiment Analysis
Text Generation using GPT-2
In this article, we have got to know about how the transformers work for language modeling. GPT-2 (Generative Pre-training Model) is an AI model released by OpenAI to perform both supervised and unsupervised learning to perform text generation for NLP tasks. OpenAI didn't released the whole at once but released it incrementally, with each phase having a larger and better model than the previous phase.
We are to explore the implementation of the various GPT-2 models. In this implementation, we are going to implement it on Google Colab for the resources that it provide.
For generating text with GPT-2 we are going to use a Python package "Chatting Transformer". Chatting Transformer is a Python library for generating text using GPT2. By using Chatting Transformer, we can implement and use this model with a few lines of code.
To install the Chatting Transformer Package use the pip command:and import the Chatting Transformer, use the following command
pip install chattingtransformer
Now, importing and instantiating the module. Instantiating the model for first time will take some as it will down the GPT-2 model:
from chattingtransformer import ChattingGPT2
gpt2 = ChattingGPT2()
Now, we can be able to use the GPT-2 for generating text. For that we have to give the model a input text so that model is able to generate the extended text using the input text. generate_text() will predict the generate b text.
text = "Generous people are always"
output = gpt2.generate_text(text)
print(output)
Output: Generous people are always willing to do things for the greater good. In this, I think we should have a more open society, one that is more inclusive and more welcoming.
This continued output text is generated using default GPT-2 model. GPT-2 have various available models for text generation that are:- gpt2, gpt2_medium, gpt2-large, gpt2-xl.
Model size will increase as the largest model is used i.e having 1.5 billion parameters.
Lets use the gpt2-large to get the better performance by staying in constraint of Google Colab RAM.
gpt2 = ChattingGPT2("gpt2-large")
text = "Generous people are always "
output = gpt2.generate_text(text)
print(output)
Output: Generous people are always in the mood to share. I'm a little confused. I haven't found a good place to discuss this, but I see you're still at it. Well, I was, for a while.
You can see that GPT-2 default and large model both are generating different because of the parameters difference they are trained on.
Predefined Methods
Chatting Transformer have also some underneath method to fluctuate the generated text. Below are predefined methods that may be used to determine the output.
- greedy
- beam-search
- generic-sampling
- top-k-sampling
- top-p-nucleus-sampling
methods = ["greedy", "beam-search", "generic-sampling", "top-k-sampling","top-p-nucleus-sampling"]
text = "Generous people are always "
for method in methods:
output = gpt2.generate_text(text, method=method)
print(method, ":", output, end="\n\n")
Do explore about their usage and play around it to get better result according to your need. But work with it wisely because GPT-2 also have a larger model that can expand upto 6.5 GB of size.
Text Summarization
Text Summarization is another application of NLP where a transformer can be applied on pre-trained or fine tuned model. Here we are using the HuggingFace’s Transformers as it provides us with thousands of pretrained models. We will use the pipeline API. Pipelines group together a pre-trained model with the preprocessing that was used during that model training.
For start working with pre-trained model we have to download tranformer python model from which we will import the summarization pipeline.
pip install transformers
The most reliable way to use language model in transformers is using the Pipeline API. This API will help us to get the summarization model from the transformer.
from transformers import pipeline
summarization= pipeline("summarization")
Now we have the model available with us, we can now so the pre-trained transformer to summarize our text. Let consider the following text that we have to summarize.
# Text to summarize
original_text= """Coronaviruses are a group of related RNA viruses that cause diseases in mammals and birds. In humans and birds, they cause respiratory tract infections that
can range from mild to lethal. Mild illnesses in humans include some cases of the common cold (which is also caused by other viruses, predominantly rhinoviruses), while more
lethal varieties can cause SARS, MERS, and COVID-19. In cows and pigs they cause diarrhea, while in mice they cause hepatitis and encephalomyelitis. Coronaviruses constitute the
subfamily Orthocoronavirinae, in the family Coronaviridae, order Nidovirales, and realm Riboviria. They are enveloped viruses with a positive-sense
single-stranded RNA genome and a nucleocapsid of helical symmetry. The genome size of coronaviruses ranges from approximately 26 to 32 kilobases, one of the largest among RNA
viruses. They have characteristic club-shaped spikes that project from their surface, which in electron micrographs create an image reminiscent of the solar corona, from which
their name derives."""
The models that this pipeline can use are models that have been fine-tuned on a summarization task, which is currently, 'bart-large-cnn', 't5-small', 't5-base', 't5-large', 't5-3b', 't5-11b'.
We will use the default model that is defined in summarization() function.
summarized_text= summarization(original_text)[0]['summary_text']
Additionally we can pass the min_length and max_length of the text that we have to get as a summary. The summarized text length will be in the range that we provide.
summarized_text= summarization(original_text,min_length=5, max_length=20)[0]['summary_text']
Output: Coronaviruses are a group of related RNA viruses that cause diseases in mammals and birds . In humans and birds, they cause respiratory tract infections that can range from mild to lethal . Mild illnesses in humans include some cases of the common cold . In cows and pigs they cause diarrhea, while in mice they cause hepatitis and encephalomyelitis .
The pre-trained transformer are considered as the state of the art, but one can also fine-tune the transformer and then trained it to get around the model depending on the computational power of the machine.
Sentiment Analysis
Sentiment Analysis is also a great applications of NLP to work on. Sentiment Analysis is the process of computationally identifying and categorizing opinions from piece of text and determine whether the writer's attitude towards a particular topic is positive, negative or neutral.
In the following code, we are going to analyze the sentiments of the Tweets. Following which a user can enter a tweet and the program would be able to predict the polarity of the tweet.
For training process we going to use the cleaned Twitter data in which the pre-processing is already done (text pre-processing workflow), so that we can focus on the Sentiment Analysis part.
We will use the cleaned csv data file that contains the labeled tweets with its sentiment.
Sentiment is denoted by numerical value 0 or 1. 0 denotes negative sentiment in twee and 1 denotes positive sentiment.
To work with the cleaned csv we will read the file in Pandas Dataframe.
import pandas as pd
import numpy as np
data= pd.read_csv("https://raw.githubusercontent.com/laxmimerit/twitter-data/master/twitter30k_cleaned.csv")
data.head()
Tweets data will look something like this and it has total 30,000 rows.
tweets | sentiment | |
---|---|---|
0 | robbiebronniman sounds like a great night | 1 |
1 | damn the person who stolde my wallet may karma... | 1 |
2 | greetings from the piano bench photo | 1 |
3 | drewryanscott i love it i love you haha forget... | 1 |
4 | kissthestars pretty pretty pretty please pakid... | 0 |
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
We will use the Tf-Idf(Term Frequency - Inverse Document Frequency). It is a 2 dimensional data matrix where each term denotes the relative frequency of a particular word in a particular document as compared to other documents. For detailed explanation you can refer this https://iq.opengenus.org/tf-idf/. And then we will split the data in train and test. And we use the Liner Support Vector Classifier to classify the tweets in Positive, Negative and Neutral.
x = data['twitts']
y = data['sentiment']
As the twitter sentiment prediction is a Supervised Learning approach. Therefore, we will assign independent column (tweets column) to x and dependent column (sentiment classified column) to y.
vectorizer = TfidfVectorizer(max_features=10000) # max_feature will vectorize max. 10000 features
X = vectorizer.fit_transform(x) # fit_transform will fit and transform the data to vectorize the tweet to numerical data
X.shape
Output of X.shape: (30000, 10000)
10000 maximum features are converted.
We have the tweets in text form but computer work only with numerical data that is why we will be using the Tf-Idf vectorizer to convert it into numerical vector.
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state= False)
X_train.shape, X_test.shape
# X_train shape= (24000, 10000), X_test shape= (6000, 10000)
clf = LinearSVC(max_iter = 800, C = 0.1)
clf.fit(X_train, y_train)
We will use the SVC (Support Vector Classifier) to classify the tweets as positive or negative. For that we have to split the training and testing data and fit the training data on the classifier.
Now, we are ready with the vectorized data and trained model. Lets see that can our model will able to predict the sentiment on the user tweet.
user_tweet = ["Happy to see you all."]
vec = vectorizer.transform(user_tweet) # to transform tweet to numerical vector
prediction = clf.predict(vec) # trained classifier
if(prediction[0] == 0):
print("Tweet sentiment is Negative.")
else:
print("Tweet sentiment is Positive.")
Output: Tweet sentiment is Positive.
This are the applications that you can experiment with. Like sentiment analysis can be used to see the customer opinion about a product. Or to use Text Generation to generate some random comment for a friend Youtube channel (just for fun 😅). For more awesome applications of NLP you can refer [this].(http://iq.opengenus.org/applications-of-nlp-part-2/)