Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
Reading time: 35 minutes | Coding time: 15 minutes
Introduction
TextRank is a text summarization technique which is used in Natural Language Processing to generate Document Summaries.
TextRank uses an extractive approach and is an unsupervised graph-based text summarization technique.
Overview of PageRank
PageRank is an algorithm used to calculate rank of web pages, and is used by search engines such as Google.
TextRank is based on the PageRank Algorithm.
This algorithm gets its name from Larry Page, one of the co-founders of Google.
Algorithm
- The rank/importance of a page is decided by the number and quality of links to that page.
- It applies several iterations on the pages to arrive at a final value.
How TextRank works?
The basic steps involved in TextRank algorithm are as follows -
Step 1
Extract all the sentences from the text document, either by splitting at whitespaces or full stops, or any other way in which you wish to define your sentences.
Step 2
A Graph is created out of the sentences extracted in Step 1.
The nodes represent the sentences, while the weight on the edges between two nodes is found by using a Similarity function, like Cosine Similarity or Jaccard Similarity.
You can define your own similarity metric for this purpose.
Step 3
This step involves finding importance (scores) of each node by iterating the algorithm until convergence, i.e. until consistent scores are obtained.
Note: It can be said that this step involves application of PageRank algorithm to the document, with the only difference being that the nodes are sentences instead of web pages.
Step 4
The sentences are sorted in a descending order based upon their scores. The first k sentences are chosen to be a part of the text summary.
Example
Let us find the summary for the text given below -
A programming language comprises a set of instructions. It is used to produce various kinds of output. It is a language which involves a computer performing some kind of instructions or algorithm to produce some form of output.
Source
Step 1 - Divide the text into sentences
1 - A programming language comprises a set of instructions
2 - It is used to produce various kinds of output
3 - It is a language which involves a computer performing some kinds of instructions or algorithm to produce some kinds of output
Step 2 - Find vector form of the sentences
word | tf 1 | tf 2 | tf 3 | idf | tf-idf 1 | tf-idf 2 | tf-idf 3 |
---|---|---|---|---|---|---|---|
a | 0.25 | 0 | 0.1 | 0.176 | 0.044 | 0 | 0.0176 |
programming | 0.125 | 0 | 0 | 0.477 | 0.06 | 0 | 0 |
language | 0.125 | 0 | 0.0476 | 0.176 | 0.022 | 0 | 0.0084 |
comprises | 0.125 | 0 | 0 | 0.477 | 0.06 | 0 | 0 |
set | 0.125 | 0 | 0 | 0.477 | 0.06 | 0 | 0 |
of | 0.125 | 0.11 | 0.1 | 0 | 0 | 0 | 0 |
instructions | 0.125 | 0 | 0.0476 | 0.176 | 0.022 | 0 | 0.0084 |
it | 0 | 0.11 | 0.0476 | 0.176 | 0 | 0.005 | 0.0084 |
is | 0 | 0.11 | 0.0476 | 0.176 | 0 | 0.005 | 0.0084 |
used | 0 | 0.11 | 0 | 0.477 | 0 | 0.05 | 0 |
to | 0 | 0.11 | 0 | 0.477 | 0 | 0.05 | 0 |
produce | 0 | 0.11 | 0.0476 | 0.176 | 0 | 0.005 | 0.0084 |
various | 0 | 0.11 | 0 | 0.477 | 0 | 0.05 | 0 |
kinds | 0 | 0.11 | 0.0476 | 0.176 | 0 | 0.005 | 0.0084 |
output | 0 | 0.11 | 0.0476 | 0.176 | 0 | 0.005 | 0.0084 |
which | 0 | 0 | 0.0476 | 0.477 | 0 | 0 | 0.023 |
involves | 0 | 0 | 0.0476 | 0.477 | 0 | 0 | 0.023 |
computer | 0 | 0 | 0.0476 | 0.477 | 0 | 0 | 0.023 |
performing | 0 | 0 | 0.0476 | 0.477 | 0 | 0 | 0.023 |
some | 0 | 0 | 0.1 | 0.477 | 0 | 0 | 0.0477 |
or | 0 | 0 | 0.0476 | 0.477 | 0 | 0 | 0.023 |
algorithm | 0 | 0 | 0.0476 | 0.477 | 0 | 0 | 0.023 |
Sentence | Vector |
---|---|
1 | [0.044, 0.06, 0.022, 0.06, 0.06, 0, 0.022, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] |
2 | [0, 0, 0, 0, 0, 0, 0, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0.005, 0, 0, 0, 0, 0, 0, 0] |
3 | [0.0176, 0, 0.0084, 0, 0, 0, 0.0084, 0.0084, 0.0084, 0, 0, 0.0084, 0, 0.0084, 0.0084, 0.023, 0.023, 0.023, 0.023, 0.0477, 0.023, 0.023] |
Step 3 - Compute similarity between the sentences
We use Cosine Similarity to do so
Sentence | 1 | 2 | 3 |
---|---|---|---|
1 | 1 | 0 | 0.0011 |
2 | 0 | 1 | 0.00021 |
3 | 0.0011 | 0.00021 | 1 |
Step 4 - Form a graph with sentences as nodes and edges represented by the similarity metric
1 2
\ /
0.001 \ / 0.00021
\ /
3
Step 5 - Apply PageRank to the above graph
- According to the PageRank algorithm, we can treat the sentences (nodes) as webpages and the edges as links to the webpages.
- Applying PageRank to the above graph concludes that node 3 is the most important/highly ranked sentence since it contains most links to other nodes.
Thus, as is intuitive, sentence 3 is most highly ranked to be included in the summary of this text.
TextRank Implementation
- Import necessary libraries
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
import math
import numpy
from ._summarizer import AbstractSummarizer
Here, we carry out preprocessing of the data by eliminating stopwords like is, in, the, etc.
- frozenset() is simply an immutable version of the Python Set. Here we define the frozenset _stop_words, currently uninitialized.
class TextRankSummarizer(AbstractSummarizer):
epsilon = 1e-4
damping = 0.85
# small number to prevent zero-division error, see https://github.com/miso-belica/sumy/issues/112
_delta = 1e-7
_stop_words = frozenset()
- self is an indicator to the current instance of the class.
- We define a function stop_words which returns the frozenset object _stop_words.
@property
def stop_words(self):
return self._stop_words
- We define another function with the same name but with different parameters (polymorphism)
- The map function applies the function normalize_word to every object in the iterable words and returns the result in the frozenset _stop_words
@stop_words.setter
def stop_words(self, words):
self._stop_words = frozenset(map(self.normalize_word, words))
- We define a function _ensure_dependencies_installed() to raise an exception in case the required libraries are not installed.
def __call__(self, document, sentences_count):
self._ensure_dependencies_installed()
if not document.sentences:
return ()
ratings = self.rate_sentences(document)
return self._get_best_sentences(document.sentences, sentences_count, ratings)
@staticmethod
def _ensure_dependencies_installed():
if numpy is None:
raise ValueError("LexRank summarizer requires NumPy. Please, install it by command 'pip install numpy'.")
We are using _create_matrix() function to create the graph as explained in Step 2
def _create_matrix(self, document):
"""Create a stochastic matrix for TextRank.
Element at row i and column j of the matrix corresponds to the similarity of sentence i and j, where the similarity is computed as the number of common words between them, divided by their sum of logarithm of their lengths. After such matrix is created, it is turned into a stochastic matrix by normalizing over columns i.e. making the columns sum to one. TextRank uses PageRank algorithm with damping, so a damping factor is incorporated as explained in TextRank's paper. The resulting matrix is a stochastic matrix ready for power method."""
sentences_as_words = [self._to_words_set(sent) for sent in document.sentences]
sentences_count = len(sentences_as_words)
weights = numpy.zeros((sentences_count, sentences_count))
for i, words_i in enumerate(sentences_as_words):
for j, words_j in enumerate(sentences_as_words):
weights[i, j] = self._rate_sentences_edge(words_i, words_j)
weights /= (weights.sum(axis=1)[:, numpy.newaxis]+self._delta) # delta added to prevent zero-division error
return numpy.full((sentences_count, sentences_count), (1.-self.damping) / sentences_count) \+ self.damping * weights
def _to_words_set(self, sentence):
words = map(self.normalize_word, sentence.words)
return [self.stem_word(w) for w in words if w not in self._stop_words]
We are using rate_sentences() function to find importance of each sentence to the document.
- numpy.matrix.T returns the transpose of the matrix
- numpy.array() method creates an array
- numpy.dot() method returns the dot product of the two arrays
- numpy.linalg.norm() is able to return one of eight different matrix norms, or one of an infinite number of vector norms, depending on the value of the ord parameter.
input array for this method is obtained by subtracting element-wise p_vector from next_p
def rate_sentences(self, document):
matrix = self._create_matrix(document)
ranks = self.power_method(matrix, self.epsilon)
return {sent: rank for sent, rank in zip(document.sentences, ranks)}
@staticmethod
def power_method(matrix, epsilon):
transposed_matrix = matrix.T
sentences_count = len(matrix)
p_vector = numpy.array([1.0 / sentences_count] * sentences_count)
lambda_val = 1.0
while lambda_val > epsilon:
next_p = numpy.dot(transposed_matrix, p_vector)
lambda_val = numpy.linalg.norm(numpy.subtract(next_p, p_vector))
p_vector = next_p
return p_vector
We are using _rate_sentences_edge() function to assign ranks so we can pick highest ranked sentences to form the summary. (This is analogous to finding number of links to a page in PageRank Algorithm)
@staticmethod
def _rate_sentences_edge(words1, words2):
rank = 0
for w1 in words1:
for w2 in words2:
rank += int(w1 == w2)
if rank == 0:
return 0.0
assert len(words1) > 0 and len(words2) > 0
norm = math.log(len(words1)) + math.log(len(words2))
if numpy.isclose(norm, 0.):
# This should only happen when words1 and words2 only have a single word. Thus, rank can only be 0 or 1.
assert rank in (0, 1)
return rank * 1.0
else:
return rank / norm
Summary
While executing, call() method is called.
- call() function calls _ensure_dependencies_installed() followed by rate_sentences()
- rate_sentences() calls _create_matrix() to in turn, create the graph
- _create_matrix() calls supplementary method _to_words_set() and _rate_sentences_edge() to assign path values to edges in the graph
- When the execution is returned to rate_sentences(), it calls power_method()