×

Search anything:

NLP Project: Compare Text Summarization Models

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

Introduction

In this article, we will go over the basics of Text Summarization, the different approaches to generating automatic summaries, some of the real world applications of Text Summarization, and finally, we will compare various Text Summarization models with the help of ROUGE, a set of metrics used to evaluate automatic Text Summarization, in Python.

What is Text Summarization?

Text Summarization is the process of shortening a long piece of text, such as an article, essay, or research paper, into a summary that conveys the overarching meaning of the text by retaining key information and leaving out the bits that are not important. There are two broad approaches when it comes to generating automatic summaries, namely:

  1. Extractive Summarization
  2. Abstractive Summarization

Extractive Summarization

Extractive Summarization models concatenate several relevant, information-containing sentences exactly as they are in the source material in order to create short summaries. These models do not generate any text that does not already exist in the source material. Most summarization systems used today happen to be extractive.

Abstractive Summarization

Abstractive Summarization models create summaries that convey the main information in the source material. Certain phrases and clauses may be reused from the source material, but the overall summary is generally rephrased and written in different words. Sentences in the summary may not necessarily be present in the original block of text.

Abstractive Summarization models generally require more computational power. This is because they need to generate grammatically and contextually intact sentences that are relevant to the domain that is being referred to. The model has to first thoroughly understand the source material in order for it to be able to summarize it effectively and meaningfully. This is why most summarization systems used today happen to be Extractive. Extractive Summarization models only need to focus on fulfilling one objective: identifying the important sentences that need to be a part of the summary. Abstractive Summarization models have to take many more details into account before being able to generate an adequate summary.

Before we begin, let us take a look at how we will evaluate our automatically generated summaries.

Evaluation using ROUGE Metrics

ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation, is a metric that is used to obtain the degree of similarity between a candidate summary (automatically generated summary) and the target summary (hand-written reference summary. ROUGE scores are divided into ROUGE-1, ROUGE-2, and ROUGE-L scores.

ROUGE-1

ROUGE-1 compares the degree of similarity of unigrams in the automatically generated and hand-written summaries. Unigrams, in this case, are words. Thus, the precision and recall can be calculated by evaluating how many individual words are captured from the source material to the final automatically generated summary.

If we have a candidate sentence: "He went to the park"

This sentence can be expressed as a list of unigram tokens:
'He','went','to','the','park'

Let us say we have a reference sentence: "He went to the park yesterday"

This sentence contains the following unigram tokens:
'He','went','to','the','park','yesterday'

Now, we look at all of the unigram tokens captured by the candidate sentence:
'He','went','to','the','park'

Thus, the ROUGE-1 Precision can be calculated as:
(Number of captured unigram tokens) ÷ (Number of candidate unigram tokens)

This gives us 5 ÷ 5 = 1

And, the ROUGE-1 Recall can be calculated as:
(Number of captured unigram tokens) ÷ (Number of reference unigram tokens)

This gives us 5 ÷ 6 = 0.83

The ROUGE-1 F-Score can be calculated as:
2 x (Precision x Recall) ÷ (Precision + Recall)

This gives us 2 x (1 x 0.83) ÷ (1 + 0.83) = 0.907

ROUGE-2

ROUGE-1 compares the degree of similarity of bigrams in the automatically generated and hand-written summaries. Bigrams, in this case, are two consecutive words. Thus, the precision and recall can be calculated by evaluating how many bigrams are captured from the source material to the final automatically generated summary.

If we have a candidate sentence: "He likes going to the park"

This sentence can be expressed as a list of bigram tokens:
'He likes','likes going','going to','to the','the park'

If we have a reference sentence: "He really likes going to the park"

This sentence contains the following bigram tokens:
'He really','really likes','likes going','going to','to the','the park'

Now, we look at all of the bigram tokens captured by the candidate sentence:
'likes going','going to','to the','the park'

Thus, the ROUGE-2 Precision can be calculated as:
(Number of captured bigram tokens) ÷ (Number of candidate bigram tokens)

This gives us 4 ÷ 5 = 0.8

And, the ROUGE-2 Recall can be calculated as:
(Number of captured bigram tokens) ÷ (Number of reference bigram tokens)

This gives us 4 ÷ 6 = 0.66

The ROUGE-2 F-Score can be calculated as:
2 x (Precision x Recall) ÷ (Precision + Recall)

This gives us 2 x (0.8 x 0.66) ÷ (0.8 + 0.66) = 0.723

ROUGE-L

ROUGE-L measures the longest common subsequence (LCS). This refers to the words that happen to be in sequence, not taking into account any different words that are in the way of the matching sequence (when comparing the candidate and reference sentences).

For example, if we have the candidate sentence: "I carried an umbrella to the zoo in case it rained"

This sentence contains the following tokens:
'I','carried','an','umbrella','to','the','zoo','in','case','it','rained'

If we have a reference sentence: "I took an umbrella to the zoo since it could have rained"

This sentence contains the following tokens:
'I','took','an','umbrella','to','the','zoo','since','it','could','have','rained'

Now, we look at all of the captured tokens:
'I','an','umbrella','to','the','zoo','it','rained'

Thus, the ROUGE-L Precision can be calculated as:
(Number of captured tokens) ÷ (Number of candidate tokens)

This gives us 8 ÷ 11 = 0.72

And, the ROUGE-L Recall can be calculated as:
(Number of captured tokens) ÷ (Number of reference tokens)

This gives us 8 ÷ 12 = 0.66

The ROUGE-L F-Score can be calculated as:
2 x (Precision x Recall) ÷ (Precision + Recall)

This gives us 2 x (0.72 x 0.66) ÷ (0.72 + 0.66) = 0.688

List of Text Summarization Models

Now that we know about the two broad categories of summarization models, as well as the evaluation metrics that we will use to score our automatically generated summaries, let us take a look at the different models that we will be comparing in this article.

  1. Luhn's Heuristic Method
  2. TextRank
  3. Latent Semantic Analysis (LSA)
  4. Kullback-Leibler Sum (KL-Sum)
  5. T5 Transformer Model

Luhn's Heuristic Method

Luhn's Heuristic Method for Text Summarization is one of the first Text Summarization algorithms, being published in 1958. It is based on TF-IDF (Term Frequency-Inverse Document Frequency), and selects words of high importance based on their frequency of occurrence. Also, higher weightage is given to the words that occur at the beginning of the document.

Let us now take a look at the document that we will be summarizing using all of the aforementioned Text Summarization models:

"Consciousness, at its simplest, is sentience or awareness of internal and external existence. Despite millennia of analyses, definitions, explanations and debates by philosophers and scientists, consciousness remains puzzling and controversial, being 'at once the most familiar and also the most mysterious aspect of our lives'. Perhaps the only widely agreed notion about the topic is the intuition that consciousness exists. Opinions differ about what exactly needs to be studied and explained as consciousness. Sometimes, it is synonymous with the mind, and at other times, an aspect of mind. In the past, it was one's 'inner life', the world of introspection, of private thought, imagination and volition. Today, it often includes any kind of cognition, experience, feeling or perception. It may be awareness, awareness of awareness, or self-awareness either continuously changing or not. There might be different levels or orders of consciousness, or different kinds of consciousness, or just one kind with different features. Other questions include whether only humans are conscious, all animals, or even the whole universe. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked. Examples of the range of descriptions, definitions or explanations are: simple wakefulness, one's sense of selfhood or soul explored by 'looking within'; being a metaphorical 'stream' of contents, or being a mental state, mental event or mental process of the brain; having phanera or qualia and subjectivity; being the 'something that it is like' to 'have' or 'be' it; being the 'inner theatre' or the executive control system of the mind."

As we can see, this document contains multiple sentences, and the 'important' ones are not necessarily all that easy to tell apart from the ones that simply give us a brief insight into the history and ideologies surrounding consciousness. We will now use Python to generate an automatic summary of the above document using Luhn's Heuristic Method (Sumy libary).

First, we install the Sumy library:

pip install sumy

Next, we import the necessary packages:

import sumy
import nltk
nltk.download('punkt')
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser

Then, we define our source material (the above document on 'consciousness' will serve as our source material):

source_material = "Consciousness, at its simplest, is sentience or awareness of internal and external existence. Despite millennia of analyses, definitions, explanations and debates by philosophers and scientists, consciousness remains puzzling and controversial, being 'at once, the most familiar and also the most mysterious aspect of our lives'. Perhaps the only widely agreed notion about the topic is the intuition that consciousness exists. Opinions differ about what exactly needs to be studied and explained as consciousness. Sometimes, it is synonymous with the mind, and at other times, an aspect of mind. In the past, it was one's 'inner life', the world of introspection, of private thought, imagination and volition. Today, it often includes any kind of cognition, experience, feeling or perception. It may be awareness, awareness of awareness, or self-awareness either continuously changing or not. There might be different levels or orders of consciousness, or different kinds of consciousness, or just one kind with different features. Other questions include whether only humans are conscious, all animals, or even the whole universe. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked. Examples of the range of descriptions, definitions or explanations are: simple wakefulness, one's sense of selfhood or soul explored by 'looking within'; being a metaphorical 'stream' of contents, or being a mental state, mental event or mental process of the brain; having phanera or qualia and subjectivity; being the 'something that it is like' to 'have' or 'be' it; being the 'inner theatre' or the executive control system of the mind."

Then, we define the language that we will pass to the Tokenizer, and create a parser:

LANGUAGE = "english"
parser = PlaintextParser.from_string(source_material,Tokenizer(LANGUAGE))

Then, we create the summarizer:

summarizer = LuhnSummarizer()

Now, we will generate an automatic summary using our summarizer. We will restrict the number of sentences so as to acquire a summary that is roughly 100 words long.

testsummary = summarizer(parser.document,sentences_count=3)

Since 'testsummary' is a tuple containing multiple sentences, we will have to concatenate these sentences to form a singular string. This will make it possible for us to evaluate the string using ROUGE.

summary = ""
for sentence in testsummary:
  summary+=str(sentence)
print(summary)

We obtain the following output:
Despite millennia of analyses, definitions, explanations and debates by philosophers and scientists, consciousness remains puzzling and controversial, being 'at once, the most familiar and also the most mysterious aspect of our lives'. There might be different levels or orders of consciousness, or different kinds of consciousness, or just one kind with different features. Examples of the range of descriptions, definitions or explanations are: simple wakefulness, one's sense of selfhood or soul explored by 'looking within'; being a metaphorical 'stream' of contents, or being a mental state, mental event or mental process of the brain; having phanera or qualia and subjectivity; being the 'something that it is like' to 'have' or 'be' it; being the 'inner theatre' or the executive control system of the mind.

As we can see, we obtained an output consisting of what our summarizer picked out to be the three most significant/important sentences, however, as we will see from the hand-written summary below (which we will use to compare all of our automatically generated summaries), the above summary does miss out of some key information that the source material conveys.

Let us now evaluate our automatically generated summary using ROUGE metrics.

We install the 'Rouge' library:

!pip install rouge

We then install the necessary package:

from rouge import Rouge

We define the hand-written, reference summary:

reference = "Consciousness is essentially the awareness of one's internal and external existence. Despite seeming like a fairly trivial concept, the only notion that seems to be widely agreed upon after millenia of theorizing and debating is the fact that consciousness exists. In the past, consciousness was perceived as one's inner life, the world of introspection, of private thought, imagination and volition. Today, this definition includes any kind of cognition, experience, feeling or perception. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked."

We then compare our automatically generated summary to the reference summary:

ROUGE = Rouge()
ROUGE.get_scores(summary,reference)

On executing the above statements, we get the following output:
[{'rouge-1': {'r': 0.2054794520547945,
'p': 0.18292682926829268,
'f': 0.19354838211363173},
'rouge-2': {'r': 0.011235955056179775,
'p': 0.008620689655172414,
'f': 0.00975609264771217},
'rouge-l': {'r': 0.1780821917808219,
'p': 0.15853658536585366,
'f': 0.16774193050072858}}]

Thus, our ROUGE-1 scores are as follows:
F1 - 0.19354838211363173
Precision - 0.18292682926829268
Recall - 0.2054794520547945

Our ROUGE-2 scores are as follows:
F1 - 0.00975609264771217
Precision - 0.008620689655172414
Recall - 0.011235955056179775

And our ROUGE-L scores are as follows:
F1 - 0.16774193050072858
Precision - 0.15853658536585366
Recall - 0.1780821917808219

As we can see from our ROUGE scores, our automatically generated summary did not score too well when compared to the hand-written reference summary. Let us move to the next Text Summarization model, TextRank.

TextRank

TextRank is a graph-based extractive Text Summarization technique. It is used to find the most relevant sentences (as well as keywords) in a piece of text. Here, sentences that contain highly frequent words are considered important. Thus, the algorithm assigns scores to each sentence in the source material. The sentences are then ranked in the descending order of their scores, and the top scoring sentences are included in the summary. Let us now generate an automatic summary using TextRank.

First, we install the Gensim library:

!pip install gensim

Next, we import the necessary packages:

import gensim
from gensim.summarization import summarize

Now, we define our source material:

source_material = "Consciousness, at its simplest, is sentience or awareness of internal and external existence. Despite millennia of analyses, definitions, explanations and debates by philosophers and scientists, consciousness remains puzzling and controversial, being 'at once, the most familiar and also the most mysterious aspect of our lives'. Perhaps the only widely agreed notion about the topic is the intuition that consciousness exists. Opinions differ about what exactly needs to be studied and explained as consciousness. Sometimes, it is synonymous with the mind, and at other times, an aspect of mind. In the past, it was one's 'inner life', the world of introspection, of private thought, imagination and volition. Today, it often includes any kind of cognition, experience, feeling or perception. It may be awareness, awareness of awareness, or self-awareness either continuously changing or not. There might be different levels or orders of consciousness, or different kinds of consciousness, or just one kind with different features. Other questions include whether only humans are conscious, all animals, or even the whole universe. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked. Examples of the range of descriptions, definitions or explanations are: simple wakefulness, one's sense of selfhood or soul explored by 'looking within'; being a metaphorical 'stream' of contents, or being a mental state, mental event or mental process of the brain; having phanera or qualia and subjectivity; being the 'something that it is like' to 'have' or 'be' it; being the 'inner theatre' or the executive control system of the mind."

Now, we pass the source material to the 'summarize' function:

summary = summarize(source_material,word_count=100)

And finally, we print the automatically generated summary:

print(summary)

We obtain the following output:
Consciousness, at its simplest, is sentience or awareness of internal and external existence. Despite millennia of analyses, definitions, explanations and debates by philosophers and scientists, consciousness remains puzzling and controversial, being 'at once, the most familiar and also the most mysterious aspect of our lives'. Perhaps the only widely agreed notion about the topic is the intuition that consciousness exists. Other questions include whether only humans are conscious, all animals, or even the whole universe. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked.

As we can see, the summary contains ideas from throughout the source material, including the basic definition of consciousness, how consciousness seems like a trivial concept but is also very unexplored and vast, and also closes with questions that are being asked about consciousness today.

Let us now evaluate our automatically generated summary using ROUGE metrics.

We install the 'Rouge' library:

!pip install rouge

We then install the necessary package:

from rouge import Rouge

We define the hand-written, reference summary:

reference = "Consciousness is essentially the awareness of one's internal and external existence. Despite seeming like a fairly trivial concept, the only notion that seems to be widely agreed upon after millenia of theorizing and debating is the fact that consciousness exists. In the past, consciousness was perceived as one's inner life, the world of introspection, of private thought, imagination and volition. Today, this definition includes any kind of cognition, experience, feeling or perception. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked."

We then compare our automatically generated summary to the reference summary:

ROUGE = Rouge()
ROUGE.get_scores(summary,reference)

On executing the above statements, we get the following output:
[{'rouge-1': {'f': 0.45070422035608015,
'p': 0.463768115942029,
'r': 0.4383561643835616},
'rouge-2': {'f': 0.29999999500061736,
'p': 0.2967032967032967,
'r': 0.30337078651685395},
'rouge-l': {'f': 0.43661971331382665,
'p': 0.4492753623188406,
'r': 0.4246575342465753}}]

Thus, our ROUGE-1 scores are as follows:
F1 - 0.45070422035608015
Precision - 0.463768115942029
Recall - 0.4383561643835616

Our ROUGE-2 scores are as follows:
F1 - 0.29999999500061736
Precision - 0.2967032967032967
Recall - 0.30337078651685395

And our ROUGE-L scores are as follows:
F1 - 0.43661971331382665
Precision - 0.4492753623188406
Recall - 0.4246575342465753

In the above summary, we limited the word count to 100, but this can be altered if required. Let us now move on to the next Text Summarization model, the Latent Semantic Analysis (LSA) model.

Latent Semantic Analysis (LSA)

Latent Semantic Analysis is an unsupervised Natural Language Processing (NLP) technique that uses statistics to extract the association between words in a document on the basis of their contextual use. The goal is to identify the most important topics from the source material and to then choose the sentences with the greatest combined weights across the topics. Singular Value Decomposition (SVD) is the statistical technique that is used to uncover the hidden semantic structure of words in the source material. Let us now generate an automatic summary using Latent Semantic Analysis.

First, we install the Sumy library:

!pip install sumy

Next, we import the necessary packages:

import sumy
import nltk
nltk.download('punkt')
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser

Now, we define our source material:

source_material = "Consciousness, at its simplest, is sentience or awareness of internal and external existence. Despite millennia of analyses, definitions, explanations and debates by philosophers and scientists, consciousness remains puzzling and controversial, being 'at once, the most familiar and also the most mysterious aspect of our lives'. Perhaps the only widely agreed notion about the topic is the intuition that consciousness exists. Opinions differ about what exactly needs to be studied and explained as consciousness. Sometimes, it is synonymous with the mind, and at other times, an aspect of mind. In the past, it was one's 'inner life', the world of introspection, of private thought, imagination and volition. Today, it often includes any kind of cognition, experience, feeling or perception. It may be awareness, awareness of awareness, or self-awareness either continuously changing or not. There might be different levels or orders of consciousness, or different kinds of consciousness, or just one kind with different features. Other questions include whether only humans are conscious, all animals, or even the whole universe. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked. Examples of the range of descriptions, definitions or explanations are: simple wakefulness, one's sense of selfhood or soul explored by 'looking within'; being a metaphorical 'stream' of contents, or being a mental state, mental event or mental process of the brain; having phanera or qualia and subjectivity; being the 'something that it is like' to 'have' or 'be' it; being the 'inner theatre' or the executive control system of the mind."

Then, we define the language that we will pass to the Tokenizer, and create a parser:

LANGUAGE = "english"
parser = PlaintextParser.from_string(source_material,Tokenizer(LANGUAGE))

Then, we create the summarizer:

summarizer = LsaSummarizer()

Now, we will generate an automatic summary using our summarizer. We will restrict the number of sentences so as to acquire a summary that is roughly 100 words long.

testsummary = summarizer(parser.document,sentences_count=4)

Since 'testsummary' is a tuple containing multiple sentences, we will have to concatenate these sentences to form a singular string. This will make it possible for us to evaluate the string using ROUGE.

summary = ""`
for sentence in testsummary:
  summary+=str(sentence)
print(summary)

We obtain the following output:
Consciousness, at its simplest, is sentience or awareness of internal and external existence. Opinions differ about what exactly needs to be studied and explained as consciousness. In the past, it was one's 'inner life', the world of introspection, of private thought, imagination and volition. Today, it often includes any kind of cognition, experience, feeling or perception. Other questions include whether only humans are conscious, all animals, or even the whole universe. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked.

As we can see, the summary contains various key points from the source material, including past perceptions of consciousness and how they differ from more recent definitions, the kinds of questions that are being asked about consciousness, and whether the questions that are being asked are the right ones.

Let us now evaluate our automatically generated summary using ROUGE metrics.

We install the 'Rouge' library:

!pip install rouge

We then install the necessary package:

from rouge import Rouge

We define the hand-written, reference summary:

reference = "Consciousness is essentially the awareness of one's internal and external existence. Despite seeming like a fairly trivial concept, the only notion that seems to be widely agreed upon after millenia of theorizing and debating is the fact that consciousness exists. In the past, consciousness was perceived as one's inner life, the world of introspection, of private thought, imagination and volition. Today, this definition includes any kind of cognition, experience, feeling or perception. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked."

We then compare our automatically generated summary to the reference summary:

ROUGE = Rouge()
`ROUGE.get_scores(summary,reference)

On executing the above statements, we get the following output:
[{'rouge-1': {'r': 0.6438356164383562,
'p': 0.6527777777777778,
'f': 0.6482758570692034},
'rouge-2': {'r': 0.47191011235955055,
'p': 0.4772727272727273,
'f': 0.4745762661866003},
'rouge-l': {'r': 0.6301369863013698,
'p': 0.6388888888888888,
'f': 0.6344827536209275}}]

Thus, our ROUGE-1 scores are as follows:
F1 - 0.6482758570692034
Precision - 0.6527777777777778
Recall - 0.6438356164383562

Our ROUGE-2 scores are as follows:
F1 - 0.4745762661866003
Precision - 0.4772727272727273
Recall - 0.47191011235955055

And our ROUGE-L scores are as follows:
F1 - 0.6344827536209275
Precision - 0.6388888888888888
Recall - 0.6301369863013698

As we can see from our ROUGE scores, our automatically generated summary scored the highest so far when compared to our hand-written reference summary. Let us now move on to the next Text Summarization model, the Kullback-Leibler Sum (KL-Sum) model.

Kullback-Leibler Sum

In the realm of mathematical statistics, the Kullback-Leibler divergence, which is also often termed relative entropy, is a type of statistical distance, that is used to measure how different a probability distribution 'P' is when compared with a reference probability distribution 'Q'. It is inversely proportional to the degree of similarity between the source material and the automatically generated summary (in terms of readability and the information conveyed). The Kullback-Leibler Sum algorithm is a greedy method that creates a summary by appending sentences as long as the Kullback-Leibler divergence is decreasing. This ensures that the summary contains a set of sentences that happen to be similar to the document set unigram distribution. Let us now generate an automatic summary using the aforementioned principles.

First, we install the Sumy library:

!pip install sumy

Next, we import the necessary packages:

import sumy
import nltk
nltk.download('punkt')
from sumy.summarizers.kl import KLSummarizer
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser

Now, we define our source material:

source_material = "Consciousness, at its simplest, is sentience or awareness of internal and external existence. Despite millennia of analyses, definitions, explanations and debates by philosophers and scientists, consciousness remains puzzling and controversial, being 'at once, the most familiar and also the most mysterious aspect of our lives'. Perhaps the only widely agreed notion about the topic is the intuition that consciousness exists. Opinions differ about what exactly needs to be studied and explained as consciousness. Sometimes, it is synonymous with the mind, and at other times, an aspect of mind. In the past, it was one's 'inner life', the world of introspection, of private thought, imagination and volition. Today, it often includes any kind of cognition, experience, feeling or perception. It may be awareness, awareness of awareness, or self-awareness either continuously changing or not. There might be different levels or orders of consciousness, or different kinds of consciousness, or just one kind with different features. Other questions include whether only humans are conscious, all animals, or even the whole universe. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked. Examples of the range of descriptions, definitions or explanations are: simple wakefulness, one's sense of selfhood or soul explored by 'looking within'; being a metaphorical 'stream' of contents, or being a mental state, mental event or mental process of the brain; having phanera or qualia and subjectivity; being the 'something that it is like' to 'have' or 'be' it; being the 'inner theatre' or the executive control system of the mind."

Then, we define the language that we will pass to the Tokenizer, and create a parser:

LANGUAGE = 'english'
parser = PlaintextParser.from_string(source_material,Tokenizer(LANGUAGE))

Then, we create the summarizer:

summarizer = KLSummarizer()

Now, we will generate an automatic summary using our summarizer. We will restrict the number of sentences so as to acquire a summary that is roughly 100 words long.

testsummary = summarizer(parser.document,sentences_count=6)

Since 'testsummary' is a tuple containing multiple sentences, we will have to concatenate these sentences to form a singular string. This will make it possible for us to evaluate the string using ROUGE.

summary = ""
for sentence in testsummary:
  summary+=str(sentence)
print(summary)

We obtain the following output:
Consciousness, at its simplest, is sentience or awareness of internal and external existence. Opinions differ about what exactly needs to be studied and explained as consciousness. In the past, it was one's 'inner life', the world of introspection, of private thought, imagination and volition. Today, it often includes any kind of cognition, experience, feeling or perception. It may be awareness, awareness of awareness, or self-awareness either continuously changing or not. There might be different levels or orders of consciousness, or different kinds of consciousness, or just one kind with different features.

As we can see, while the summary does gloss over key points that were conveyed in the source material, it does not mention the questions that are being asked about consciousness today, and whether or not they are the right ones. We will, hence, most likely see a dip in our ROUGE scores when we evaluate it against our hand-written reference summary.

Let us now evaluate our automatically generated summary using ROUGE metrics.

We install the 'Rouge' library:

!pip install rouge

We then install the necessary package:

from rouge import Rouge

We define the hand-written, reference summary:

reference = "Consciousness is essentially the awareness of one's internal and external existence. Despite seeming like a fairly trivial concept, the only notion that seems to be widely agreed upon after millenia of theorizing and debating is the fact that consciousness exists. In the past, consciousness was perceived as one's inner life, the world of introspection, of private thought, imagination and volition. Today, this definition includes any kind of cognition, experience, feeling or perception. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked."

We then compare our automatically generated summary to the reference summary:

ROUGE = Rouge()
ROUGE.get_scores(summary,reference)

On executing the above statements, we get the following output:
[{'rouge-1': {'r': 0.4383561643835616,
'p': 0.47761194029850745,
'f': 0.4571428521520408},
'rouge-2': {'r': 0.2808988764044944,
'p': 0.28735632183908044,
'f': 0.2840909040915548},
'rouge-l': {'r': 0.4246575342465753,
'p': 0.4626865671641791,
'f': 0.4428571378663266}}]

Thus, our ROUGE-1 scores are as follows:
F1 - 0.4571428521520408
Precision - 0.47761194029850745
Recall - 0.4383561643835616

Our ROUGE-2 scores are as follows:
F1 - 0.2840909040915548
Precision - 0.28735632183908044
Recall - 0.2808988764044944

And our ROUGE-L scores are as follows:
F1 - 0.4428571378663266
Precision - 0.4626865671641791
Recall - 0.4246575342465753

As we can see from our ROUGE scores, the above summary did not score as high as the summary generated by our Latent Semantic Analysis model. It missed out on a key point that was a part of the hand-written reference summary. Let us now move on to the next Text Summarization model, which happens to be an Abstractive Text Summarization model, the T5 Transformer model.

T5 Transformer Model

Transformers are a type of neural network architecture, and were developed by a group of researchers at Google (and UoT) in 2017. They avoid using the principle of recurrence, and work entirely on an attention mechanism to draw global dependencies between the input and the output. Transformers allow for much more parallelization than sequential models, and can achieve very high translation quality even after being trained only for short periods of time. They can also be trained on very large amounts of data without as much difficulty. Read more about 'Text Summarization using Transformers' here.

The T5 Transformer model (developed by Google AI in 2020) is an encoder-decoder model that can achieve state-of-the-art results when performing Natural Language Processing (NLP) tasks, while also being flexible enough to be fine-tuned for more specific problems. It frames all such tasks to a text-to-text format, where the input and output are always strings. Let us now generate an abstractive automatic summary using HuggingFace's T5 Transformer Model.

First, we install the necessary libraries:

!pip install transformers
!pip install sentencepiece

Note: Once SentencePiece has been installed, our kernel needs to be restarted in order for further lines of code to run successfully.

Next, we import the necessary packages:

import torch
import json
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

Now, we define our source material:

source_material = "Consciousness, at its simplest, is sentience or awareness of internal and external existence. Despite millennia of analyses, definitions, explanations and debates by philosophers and scientists, consciousness remains puzzling and controversial, being 'at once, the most familiar and also the most mysterious aspect of our lives'. Perhaps the only widely agreed notion about the topic is the intuition that consciousness exists. Opinions differ about what exactly needs to be studied and explained as consciousness. Sometimes, it is synonymous with the mind, and at other times, an aspect of mind. In the past, it was one's 'inner life', the world of introspection, of private thought, imagination and volition. Today, it often includes any kind of cognition, experience, feeling or perception. It may be awareness, awareness of awareness, or self-awareness either continuously changing or not. There might be different levels or orders of consciousness, or different kinds of consciousness, or just one kind with different features. Other questions include whether only humans are conscious, all animals, or even the whole universe. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked. Examples of the range of descriptions, definitions or explanations are: simple wakefulness, one's sense of selfhood or soul explored by 'looking within'; being a metaphorical 'stream' of contents, or being a mental state, mental event or mental process of the brain; having phanera or qualia and subjectivity; being the 'something that it is like' to 'have' or 'be' it; being the 'inner theatre' or the executive control system of the mind."

Ensuring our CPU is utilized to run our code:

device = torch.device('cpu')

Now, we define our model and create a tokenizer. In our case, we will be using a pre-trained model to abstractly summarize our source material:

summarizer = T5ForConditionalGeneration.from_pretrained('t5-base')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

In order for our model to be able to summarize our source material, we add the keyword "summarize:" at the beginning of the text:

updated_material = "summarize:" + source_material

We now use our tokenizer to encode the updated material:

tokenized_material = tokenizer.encode(updated_material, return_tensors="pt").to(device)

This is what our encoded updated material looks like when we print it:

print(tokenized_material)

Output:
tensor([[21603, 10, 4302, 7, 75, 2936, 655, 6, 44, 165,
3, 21120, 6, 19, 1622, 23, 1433, 42, 4349, 13,
3224, 11, 3866, 6831, 5, 3, 4868, 3293, 35, 29,
23, 9, 13, 15282, 6, 4903, 7, 6, 7295, 7,
11, 5054, 7, 57, 25857, 7, 11, 7004, 6, 13645,
3048, 4353, 5271, 697, 11, 15202, 6, 271, 3, 31,
144, 728, 6, 8, 167, 3324, 11, 92, 8, 167,
15124, 2663, 13, 69, 1342, 31, 5, 5632, 8, 163,
5456, 4686, 9347, 81, 8, 2859, 19, 8, 26207, 24,
13645, 8085, 5, 411, 22441, 7, 7641, 81, 125, 1776,
523, 12, 36, 7463, 11, 5243, 38, 13645, 5, 3921,
6, 34, 19, 30141, 28, 8, 809, 6, 11, 44,
119, 648, 6, 46, 2663, 13, 809, 5, 86, 8,
657, 6, 34, 47, 80, 31, 7, 3, 31, 77,
687, 280, 31, 6, 8, 296, 13, 16, 30113, 106,
6, 13, 1045, 816, 6, 9675, 11, 5063, 4749, 5,
1960, 6, 34, 557, 963, 136, 773, 13, 23179, 4749,
6, 351, 6, 1829, 42, 8136, 5, 94, 164, 36,
4349, 6, 4349, 13, 4349, 6, 42, 1044, 18, 9,
3404, 655, 893, 11721, 2839, 42, 59, 5, 290, 429,
36, 315, 1425, 42, 5022, 13, 13645, 6, 42, 315,
4217, 13, 13645, 6, 42, 131, 80, 773, 28, 315,
753, 5, 2502, 746, 560, 823, 163, 6917, 33, 13381,
6, 66, 3127, 6, 42, 237, 8, 829, 8084, 5,
37, 8378, 342, 620, 13, 585, 6, 9347, 7, 11,
22547, 7, 3033, 7, 3228, 7, 81, 823, 8, 269,
746, 33, 271, 1380, 5, 19119, 13, 8, 620, 13,
15293, 6, 4903, 7, 42, 7295, 7, 33, 10, 650,
7178, 18154, 6, 80, 31, 7, 1254, 13, 1044, 4500,
42, 3668, 15883, 57, 3, 31, 10119, 441, 31, 117,
271, 3, 9, 21253, 1950, 3, 31, 8103, 31, 13,
10223, 6, 42, 271, 3, 9, 2550, 538, 6, 2550,
605, 42, 2550, 433, 13, 8, 2241, 117, 578, 3,
8237, 1498, 42, 546, 5434, 11, 1426, 10696, 117, 271,
8, 3, 31, 23180, 24, 34, 19, 114, 31, 12,
3, 31, 7965, 31, 42, 3, 31, 346, 31, 34,
117, 271, 8, 3, 31, 77, 687, 8516, 31, 42,
8, 4297, 610, 358, 13, 8, 809, 5, 1]])

We now generate our summary.

tokenized_summary = summarizer.generate(tokenized_material,
                                    num_beams=5,
                                    no_repeat_ngram_size=2,
                                    min_length=100,
                                    max_length=120,
                                    early_stopping=True)

Here, we are introducing constraints to our summary, as we do not want ngrams of size two to repeat, and we want to obtain a summary that is between one hundred, and one hundred and twenty words long.

Since the summary we obtain after executing the above line of code is still tokenized, or encoded, it looks like so:

print(tokenized_summary)

Output:
tensor([[ 0, 13645, 6, 44, 165, 3, 21120, 6, 19, 1622,
23, 1433, 42, 4349, 13, 3224, 11, 3866, 6831, 3,
5, 3, 3565, 3293, 35, 29, 23, 9, 13, 15282,
6, 4903, 7, 11, 5054, 7, 6, 13645, 3048, 4353,
5271, 697, 11, 15202, 6, 271, 3, 31, 532, 167,
3324, 11, 92, 8, 167, 15124, 2663, 13, 69, 1342,
31, 16, 8, 657, 6, 34, 47, 80, 31, 7,
4723, 280, 6, 8, 296, 13, 16, 30113, 106, 6,
13, 1045, 816, 6, 9675, 11, 5063, 4749, 5, 469,
34, 557, 963, 136, 773, 13, 23179, 4749, 6, 351,
6, 1829, 42, 8136, 5, 132, 429, 36, 315, 1425,
42, 5022, 13, 13645, 1]])

Now, we will decode the above tokenized summary to obtain a text summary:

summary = tokenizer.decode(tokenized_summary[0], skip_special_tokens=True)

And finally, we print our automatically generated summary:

print(summary)

We obtain the following output:
Consciousness, at its simplest, is sentience or awareness of internal and external existence. Despite millennia of analyses, definitions and debates, consciousness remains puzzling and controversial, being 'the most familiar and also the most mysterious aspect of our lives'. In the past, it was one's inner life, the world of introspection, of private thought, imagination and volition. Today it often includes any kind of cognition, experience, feeling or perception. There might be different levels or orders of consciousness.

As we can see, the summary that we obtained is written in different words when compared to that of our source material. This is because the T5 Transformer model is an Abstractive model.

Let us now evaluate our automatically generated summary using ROUGE metrics.

We install the 'Rouge' library:

!pip install rouge

We then install the necessary package:

from rouge import Rouge

We define the hand-written, reference summary:

reference = "Consciousness is essentially the awareness of one's internal and external existence. Despite seeming like a fairly trivial concept, the only notion that seems to be widely agreed upon after millenia of theorizing and debating is the fact that consciousness exists. In the past, consciousness was perceived as one's inner life, the world of introspection, of private thought, imagination and volition. Today, this definition includes any kind of cognition, experience, feeling or perception. The disparate range of research, notions and speculations raises doubts about whether the right questions are being asked."

We then compare our automatically generated summary to the reference summary:

ROUGE = Rouge()
ROUGE.get_scores(summary,reference)

On executing the above statements, we get the following output:
[{'rouge-1': {'r': 0.410958904109589,
'p': 0.46153846153846156,
'f': 0.43478260371245536},
'rouge-2': {'r': 0.2808988764044944,
'p': 0.30864197530864196,
'f': 0.2941176420698962},
'rouge-l': {'r': 0.410958904109589,
'p': 0.46153846153846156,
'f': 0.43478260371245536}}]

Thus, our ROUGE-1 scores are as follows:
F1 - 0.43478260371245536
Precision - 0.46153846153846156
Recall - 0.410958904109589

Our ROUGE-2 scores are as follows:
F1 - 0.2941176420698962
Precision - 0.30864197530864196
Recall - 0.2808988764044944

And our ROUGE-L scores are as follows:
F1 - 0.43478260371245536
Precision - 0.46153846153846156
Recall - 0.410958904109589

Now that we have evaluated all of our automatically generated summaries, let us take a look at how the models that we tested fared against each other when compared to our hand-written reference summary!

Conclusion

  1. Latent Semantic Analysis (LSA) worked best on our source material when evaluated against our reference summary.
  2. TextRank scored the second highest.
  3. The Kullback-Leibler Sum model scored the third highest.
  4. The T5 Transformer model scored the second lowest.
  5. Luhn's Heuristic Method scored the lowest.
Evaluation Metric LSA TextRank T5 KL-Sum Luhn's Method
ROUGE-1 Precision 0.6527 0.4637 0.4615 0.4776 0.1829
ROUGE-1 Recall 0.6438 0.4383 0.4109 0.4383 0.2054
ROUGE-1 F-Score 0.6482 0.4507 0.4347 0.4571 0.1935
ROUGE-2 Precision 0.4772 0.2967 0.3086 0.2873 0.0086
ROUGE-2 Recall 0.4719 0.3033 0.2808 0.2808 0.0112
ROUGE-2 F-Score 0.4745 0.2999 0.2941 0.2840 0.0097
ROUGE-L Precision 0.6388 0.4492 0.4615 0.4626 0.1585
ROUGE-L Recall 0.6301 0.4246 0.4109 0.4246 0.1780
ROUGE-L F-Score 0.6344 0.4366 0.4347 0.4428 0.1677

The TextRank model, the Kullback-Leibler Sum model, and the T5 Transformer model all scored very similarly, and could have performed differently on a different input document. Luhn's Heuristic Method, however, scored significantly worse, owing to the fact that it is one of the earliest Text Summarization models to come into the picture.

In terms of readability, however, the T5 Transformer model produced the best summary, as it most closely resembles a human-made summary. While a lot of work is still being done in the field of Abstractive Summarization, it is impressive to see how well the pretrained model summarized our source material. If our source material was packed with more sentences, it very well could have produced a summary with higher ROUGE scores than the rest of the Extractive Models that we evaluated above.

Thanks for reading!

NLP Project: Compare Text Summarization Models
Share this