Text Summarization Techniques

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

Reading time: 30 minutes

Text Summarization is the process of creating a compact yet accurate summary of text documents. In this article, we will cover the different text summarization techniques.

What is the need for Text Summarization?

With the ever-growing amount of information available for use, it is important to have shorter, and meaningful summaries for better structure to the same information.

Forms of Text Summarization

There are two primary approaches towards text summarization. They are -

Extractive

Within this approach, the most relevant sentences in the text document are reproduced as it is in the summary. New words or phrases are thus, not added.

Abstractive

This approach, on the other hand, focuses on interpreting the text within documents and generating new phrases that best represent the essence of the document.

Extractive Approach

The Extractive Approach is mainly based on three independent steps as described below.

1. Generation of an Intermediate Representation

The text that is to be summarized is drawn to an intermediate form, either by topic representation or indicator representation.
Each kind of representation differs in complexity and has several techniques for performing it.

2. Assign a score to each sentence

The Sentence Score directly implies how important the sentence is to the text.

3. Select Sentences for the Summary

The most relevant k number of sentences are selected for the summary based on several factors such as eliminating redundancy, fulfilling the context, etc.

Abstractive Approach

The Abstractive Approach is maily based on the following steps -

1. Establishing a context for the text

An Abstractive Approach works similar to human understanding of text summarization. Thus, the first step is to understand the context of the text.

2. Semantics

Words based on semantic understanding of the text are either reproduced from the original text or newly generated.

Example

Text: There were bad weather conditions in the town. Subsequently, the roads were impassable.

Extractive Approach

There were bad weather conditions in the town. Subsequently, the roads were impassable.

Bad weather conditions town. Subsequently, roads impassable

Abstractive Approach

Bad weather conditions made town roads impassable.

Methods of Implementation

Following are the text summarization techniques:

Luhn's Heuristic Method
Edmundson's Heuristic Method
SumBasic
KL-Sum
LexRank
TextRank
Reduction
Latent Semantic Analysis

Listed below are some common methods of text summarization, their advantages and disadvantages -

Luhn's Heuristic Method

Luhn proposed that the significance of each word in a text document signifies how relevant it is.
Filler words like 'a', 'and', 'the' and likewise are ignored and more importance is assigned to the sentences in the beginning.
The idea is that any sentence with maximum occurences of the highest frequency words are more important to the meaning of the document than others.

This is one of the earliest approaches of text summarization and is not considered very accurate

Edmundson's Heuristic Method

This method uses the idea of defining bonus words and stigma words, words that are of high or low importance respectively.
Words in the document title are given additional importance.
It is one of the earlier methods of text summarization, along with Luhn's Method.

SumBasic

It is generally used for generating multi-document summaries.
It applies the basic idea of probability, assuming that the high-frequency words in the bag-of-words model of the document have a higher possibility of occurring in the summary of the document.
Probabilities are assigned to each word on the basis of their term frequency in the document, and these probabilities are updated as sentences are chosen for the summary.

KL-Sum

This method is based on the concept of KL Divergence and Unigram distribution.
It includes those sentences to the summary that minimize Summary Vocabulary Divergence from the original Inout Vocabulary.

This method has no explicit way of eliminating redundancy

LexRank

As stated by the authors of this algorithm, it is "based on the concept of eigenvector centrality in a graph representation of sentences".

Within this algorithm, each sentence recommends sentences similar to it.
A graph is created with each node being a sentence, connected to its similar sentences ( the similarity measure is usually Cosine Similarity or TF-IDF )
Sentences with maximum recommendations is more likely to get picked for the summary.
The idea is that any sentence important to the text document will probably be repeated in similar ways thus having more number of similar sentences.

TextRank

This algorithm is similar to LexRank but relatively simpler.

This algorithm works on the same basic principle as LexRank, with the only difference being the similarity measure or metric for construction of edges of the graph.
In this algorithm, number of common words measure the sentences similarity.
While LexRank can be applied to multiple documents, TextRank is primarily used for single documents.

Reduction

This method also works on the idea of graph-based modelling of the text document.
It assigns importance to sentences in accordance with the sum of their edges to other sentences.

Latent Semantic Analysis

It works on the principle of Term Frequency along with Singular Value Decomposition.
The idea is to resolve the document space to a "concept space", meaning the document is broken down into the actual underlying concept and comparisons are made within that space.
This is a more complicated method as compared to others.

Applications

Text Summarizations finds a wide variety of applications in creation of headlines, synopses, reviews, book, movie and pla summaries, resumes, and so on.

Question

Which text summarization approach imparts more readability to the automatically generated text summaries?

Abstractive Approach

Extractive Approach

Abstractive Approach is based on the human model of understanding text and thus, generates more readable summaries

Text Summarization Techniques

Machine Learning (ML) Natural Language Processing (NLP)

Forms of Text Summarization

Extractive Approach

Abstractive Approach

Methods of Implementation

Luhn's Heuristic Method

Edmundson's Heuristic Method

SumBasic

KL-Sum

LexRank

TextRank

Reduction

Latent Semantic Analysis

Applications

Question

Which text summarization approach imparts more readability to the automatically generated text summaries?

Object Detection using Region-based Convolutional Neural Networks (R-CNN)

Text Preprocessing Techniques