Applications of NLP: Extraction from PDF, Language Translation and more

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In the previous article, we have gone through some of the applications of NLP. Though there are numerous applications of NLP but in this article we are going to get brief about some more applications which can be seen in real world.

In this article we will be going to see applications of NLP like:

Text Extraction
**Language Translation
Text Classification
Question Answering
Text to Speech
Speech to Text.

Text Extraction from PDF

NLP can be used to work with PDF, it can help to convert PDF to text file and other manipulation task. We are going to use PyPdf2 module to read and extract text of a PDF. Text from PDF cannot be extracted correctly always as PDF can sometime comprises of Diagrams, Tables etc. which are not compatible to extract.
We will install and import PyPDF2 module and open the PDF file in Python to start reading from the PDF file.

!pip install PyPDF2

import PyPDF2 as pdf

file= open('/ASK THE RIGHT QUESTIONS.pdf', 'rb')
file

Now create a object so that PyPDF can read text of the PDF and pass the file in parameter that we opened above. PyPDF give numerous method to work on PDF. But here we are using getNumPages() that return total pages in the file and getIsEncrypted() will return True based on whether PDF file is password protected or not.

pdf_reader= pdf.PdfFileReader(file)

pdf_reader.getNumPages()

pdf_reader.getIsEncrypted()

Now lets extract the information of a specific page number using getPage()
and pass the page number as the parameter. But getPage() will return the text in binary form to extract the information we will use extractText() for readable text.

page1= pdf_reader.getPage(0)
page1      # will output binary data

print(page1.extractText())

Now we will write a PDF file from the text data. For that we will merge the first and last page of the extracted text data and will merge them to make a new PDF file. Make a write object so that PyPDF can write in a file. We will extract text data of pages that we want to merge using pdf_reader and then add that pages in pdf_writer object. Then to save the PDF we will open a new file using Python and write the pdf_writer information to the new PDF.

pdf_writer= pdf.PdfFileWriter()

page1= pdf_reader.getPage(0)
pdf_writer.addPage(page1)

page15= pdf_reader.getPage(14)
pdf_writer.addPage(page15)

output= open('/Pages1&15.pdf', 'wb')
pdf_writer.write(output)
output.close()

Above code will create and save a PDF named as Pages1&15 that will only contain first and last pages of the original file and have same style as of the original PDF text.

That is how you can work with PDF file in Python using PyPDF2. There are so many methods available with this package and you can utilize that for your own purpose.

Language Translation

Natural Language can be defined as any human readable language and for this another amazing module is there in Python that can be used in various purpose. Language Translation is another application where the power of NLP can be utilized.
For Language Translation task we are going to use the module GoogleTrans for the conversion of any language to the destination language the user choose.

Install the googletrans module and import the Translator from it and create an object of it.

!pip install googletrans==3.1.0a0

from googletrans import Translator

translator= Translator()

We will use the translate() method to translate the sentence which accept the parameters as text-> text to translate, src-> source language code(it is automatically recognized by the module) and dest-> destination language code the sentence have to be translated into.
And to extract the converted text from the output use the .text

sentence= input("Sentence to convert: ")

translatedSent= translator.translate(sentence, src='en', dest='hi')

print("Translated Sentence: ", translatedSent.text)

Output is:

Sentence to convert: It is an example of translation.
Translated Sentence: यह अनुवाद का एक उदाहरण है।

Though this module looks modest but you can use for various purpose as per your need as it make the process simple for the demanding task of translation.

Text Classification

Another awesome application of NLP is to classify the text into certain categories. Classification of text can be seen like Hate Speech classification used in various social media platform to limit the hate speech on the Internet.
For text classification we will make a Naive Bayes Classifier that will use the fetch_20newsgroups dataset available in sklearn.datasets to train on. And will predict the class of the text that the text belong based on the target classes available in fetch_20newsgroups.

We start with importing the fetch_20newsgroups dataset, this dataset is very common for working with tokenized texts and exploring how word categorized in a document. Then we assigned the imported dataset to 'data' variable and targetnames will print all the categories in which the dataset is categorized.

import numpy as np

from sklearn.datasets import fetch_20newsgroups

data= fetch_20newsgroups()
data.target_names

This are the target_names in which the text is classified into sub-categories, there are 20 different classes:

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles',
'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med',
'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast',
'talk.politics.misc', 'talk.religion.misc']

Now, we define the categories and training & testing data from the dataset to make our Naive Bayes Classifier. fetch_20newsgroups has already differentiated subset of train and test dataset so we can fetch that.

# Defining all the categories
categories= ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

# Getting Training & Testing data from fetch_20newsgroups
train= fetch_20newsgroups(subset= 'train', categories= categories)
test= fetch_20newsgroups(subset= 'test', categories= categories)

So, to work with text data we already knew that we have to vectorized it so that text can be meaningful to the computer. And we then pass that vectorized data to the Naive Bayes Classifier to prepare the model. To make that flow easy we can create a pipeline so that Vectorized can be send directly to the Naive Bayes thats why we used make_pipeline.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# to flow the data from vectorizer to Naive Bayes and directly create a model based on Naive Bayes
model= make_pipeline(TfidfVectorizer(), MultinomialNB())

# training the data on the model
model.fit(train.data, train.target)

We have out model now lets make a prediction on a user text. And to get it predicted based on actual text category we will use train.targetname.

pred= model.predict(['You will win the game'])

train.target_names[pred[0]]

Output: rec.sport.hockey

You can see that user text is classified into sport.hockey according to the target classes available in the dataset. You can also make up your labelled dataset according to your need and can train it on a classifier.

QnA Bot

NLP can work very precisely with text we can use this quality of it to use it for understanding the query of a user and to answer it accordingly. But for answering the queries we have to provide it with the text so that it can find the answers from it.
For making a QnA we use a module named allennlp that contains the text comprehension model and we will also import the predictor from allennlp that is available publicly for the use, this includes all the necessary tokenizers and wordnet corpus for understanding the meaningful text and to make answers from the passage and allennlp predictor is based on Bidirectional Attention Flow model.

!pip install allennlp==1.0.0 allennlp-models==1.0.0

from allennlp.predictors.predictor import Predictor

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-model-2020.03.19.tar.gz")

To make the answers of the query we have to provide a passage to the predictor so that it can extract the answer from it. The model will only able to answer the question that are based on the text and to get the most accurate text reply we will use the 'best_span_str that will match the best string from the passage that will match to the question.

passage = """
Coronavirus disease 2019 (COVID-19) is a contagious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first case was identified in Wuhan, China, in December 2019.
It has since spread worldwide, leading to an ongoing pandemic. Symptoms of COVID-19 are variable, but often include fever, cough, fatigue, breathing difficulties, and loss of smell and taste. Symptoms begin one to fourteen days after exposure to the virus. Around one in five infected individuals do not develop any symptoms. While most people have mild symptoms, some people develop acute respiratory distress syndrome (ARDS). ARDS can be precipitated by cytokine storms, multi-organ failure, septic shock, and blood clots. Longer-term damage to organs (in particular, the lungs and heart) has been observed."""

result= predictor.predict(passage= passage, question= "What are the symptoms?")

result['best_span_str']

Output: one to fourteen days after exposure to the virus

The model is able to answer the question from the text accurately. The model implementation is not so complicated this model can be integrated according to your purpose to create a GUI based chatbot like slack, telegram etc. that can able to make the answers from the predefined passage at the backend.

Text to Speech and Speech to Text

Text to Speech

We have that NLP can be able to translate the text to other languages and other text related tasks. Other applications where NLP can be used is speech and that is like a task where a text is spoken by a human voice and for that we will make use of gTTS it is a module provided by Google that helps in converting text-to-speech.
First of all we have to install the gTTS module and then import the gTTS and provide the text that we have to convert to spoken words.

!pip install gtts

from gtts import gTTS

text= "An example for conversion text to speech"

Since we are using the Google Colab we can use Audio that helps to play the convert audio within Google Colab.

from IPython.display import Audio
tts = gTTS(text)
tts.save('output.wav')
sound_file = 'output.wav'
Audio(sound_file, autoplay= True)

This will save the audio file as a wav format and will play the text in audio.

Speech to Text

We have converted text to speech and another application is to convert the audio for that SpeechRecognition is used that have a speech recognizer for understanding the text in audio and convert it to text.

!pip install SpeechRecognition

import speech_recognition as sr

Now, we initialize the recognizer and pass the file to the speech recognizer that will then convert it into the text.

import speech_recognition as sr

audio_file= "/content/input.wav"
r= sr.Recognizer()

with sr.AudioFile(audio_file) as source:
  audio_data= r.record(source)
  text= r.recognize_google(audio_data)
  print(text)

Output text: An example for conversion speech to text

We have provided a audio file for the speech to text what if we can convert live speech to text? That can also with the SpeechRecognition but we have to additionally install the PyAudio and we can use to listen the audio from the input microphone and recognition will convert it to text. For live speech recognition we have to make source for input as microphone using sr.Microphone() and for recording it we have to pass the duration in seconds otherwise microphone will keep on listening and code will break.

!pip install PyAudio

import speech_recognition as sr

r= sr.Recognizer()

print('Speak something! ')

with sr.Microphone() as source:
  audio_data= r.record(source, duration= 5)
  text= r.recognize_google(audio_data)
  print(text)

Output: Text that will be recorded in 5 seconds duration

These are the some awesome applications that can be implemented using the model and libraries available in Python. Though some of them looks simple but if you integrate them with some task like Business assistance or feature then they can be proved to be advantageous for the additional service.