Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
Reading time: 30 minutes
Text Summarization is the process of creating a compact yet accurate summary of text documents. In this article, we will cover the different text summarization techniques.
What is the need for Text Summarization?
With the ever-growing amount of information available for use, it is important to have shorter, and meaningful summaries for better structure to the same information.
Forms of Text Summarization
There are two primary approaches towards text summarization. They are -
- Extractive
Within this approach, the most relevant sentences in the text document are reproduced as it is in the summary. New words or phrases are thus, not added.
- Abstractive
This approach, on the other hand, focuses on interpreting the text within documents and generating new phrases that best represent the essence of the document.
Extractive Approach
The Extractive Approach is mainly based on three independent steps as described below.
1. Generation of an Intermediate Representation
The text that is to be summarized is drawn to an intermediate form, either by topic representation or indicator representation.
Each kind of representation differs in complexity and has several techniques for performing it.
2. Assign a score to each sentence
The Sentence Score directly implies how important the sentence is to the text.
3. Select Sentences for the Summary
The most relevant k number of sentences are selected for the summary based on several factors such as eliminating redundancy, fulfilling the context, etc.
Abstractive Approach
The Abstractive Approach is maily based on the following steps -
1. Establishing a context for the text
An Abstractive Approach works similar to human understanding of text summarization. Thus, the first step is to understand the context of the text.
2. Semantics
Words based on semantic understanding of the text are either reproduced from the original text or newly generated.
Example
Text: There were bad weather conditions in the town. Subsequently, the roads were impassable.
Extractive Approach
There were bad weather conditions in the town. Subsequently, the roads were impassable.
Bad weather conditions town. Subsequently, roads impassable
Abstractive Approach
Bad weather conditions made town roads impassable.
Methods of Implementation
Following are the text summarization techniques:
- Luhn's Heuristic Method
- Edmundson's Heuristic Method
- SumBasic
- KL-Sum
- LexRank
- TextRank
- Reduction
- Latent Semantic Analysis
Listed below are some common methods of text summarization, their advantages and disadvantages -
Luhn's Heuristic Method
- Luhn proposed that the significance of each word in a text document signifies how relevant it is.
- Filler words like 'a', 'and', 'the' and likewise are ignored and more importance is assigned to the sentences in the beginning.
- The idea is that any sentence with maximum occurences of the highest frequency words are more important to the meaning of the document than others.
This is one of the earliest approaches of text summarization and is not considered very accurate
Edmundson's Heuristic Method
- This method uses the idea of defining bonus words and stigma words, words that are of high or low importance respectively.
- Words in the document title are given additional importance.
- It is one of the earlier methods of text summarization, along with Luhn's Method.
SumBasic
- It is generally used for generating multi-document summaries.
- It applies the basic idea of probability, assuming that the high-frequency words in the bag-of-words model of the document have a higher possibility of occurring in the summary of the document.
- Probabilities are assigned to each word on the basis of their term frequency in the document, and these probabilities are updated as sentences are chosen for the summary.
KL-Sum
- This method is based on the concept of KL Divergence and Unigram distribution.
- It includes those sentences to the summary that minimize Summary Vocabulary Divergence from the original Inout Vocabulary.
This method has no explicit way of eliminating redundancy
LexRank
As stated by the authors of this algorithm, it is "based on the concept of eigenvector centrality in a graph representation of sentences".
- Within this algorithm, each sentence recommends sentences similar to it.
- A graph is created with each node being a sentence, connected to its similar sentences ( the similarity measure is usually Cosine Similarity or TF-IDF )
- Sentences with maximum recommendations is more likely to get picked for the summary.
- The idea is that any sentence important to the text document will probably be repeated in similar ways thus having more number of similar sentences.
TextRank
This algorithm is similar to LexRank but relatively simpler.
- This algorithm works on the same basic principle as LexRank, with the only difference being the similarity measure or metric for construction of edges of the graph.
- In this algorithm, number of common words measure the sentences similarity.
- While LexRank can be applied to multiple documents, TextRank is primarily used for single documents.
Reduction
- This method also works on the idea of graph-based modelling of the text document.
- It assigns importance to sentences in accordance with the sum of their edges to other sentences.
Latent Semantic Analysis
- It works on the principle of Term Frequency along with Singular Value Decomposition.
- The idea is to resolve the document space to a "concept space", meaning the document is broken down into the actual underlying concept and comparisons are made within that space.
- This is a more complicated method as compared to others.
Applications
Text Summarizations finds a wide variety of applications in creation of headlines, synopses, reviews, book, movie and pla summaries, resumes, and so on.