Luhn’s Heuristic Method for text summarization

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

Reading time: 30 minutes | Coding time: 10 minutes

In this article we will work on Luhns Heuristic Method for text summarization. This is one of the earliest approaches of text summarization. Luhn proposed that the significance of each word in a document signifies how important it is. The idea is that any sentence with maximum occurances of the highest frequency words(Stopwords) and least occurances are not important to the meaning of document than others. Although it is not considered very accurate approach.

Introduction

Luhn’s algorithm is an approach based on TF-IDF. It selects only the words of higher importance as per their frequency. Higher weights are assigned to the words present at the begining of the document. It considers the words lying in the shaded region in this graph:

Untitled111

The region on the right signifies highest occuring elements while words on the left signifies least occuring elements. Luhn introduced the following criteria during text preprocesing:

Removing stopwords
Stemming (Likes->Like)

In this method we select sentences with highest concentration of salient content terms. For example , if we have 10 words in a sentence and 4 of the words are significant.

figmm

For calculating the significance instead of number of significant words by all words
here we divide them by the span that consist of these words. Thus the Score obtained from our example would be

Score= 4²/6 = 2.7

Application

The Luhns method is most significant when:

Too low frequent words are not significant
Too high frequent words are also not significant (e.g. “is”, “and”)
Removing low frequent words is easy
- set a minimum frequency-threshold
Removing common (high frequent) words:
- Setting a maximum frequency threshold (statistically obtained)
- Comparing to a common-word list
Used for summarizing technical documents.

Algorithm

Luhns method is a simple technique in order to generate a summary from given words. The algorithm can be implemented in two stages.

In the first stage, we try to determine which words are more significant towards the meaning of document. Luhn states that this is first done by doing a frequency analysis, then finding words which are significant, but not unimportant English words.

In the second phase, we find out the most common words in the document, and then take a subset of those that are not these most common english words, but are still important. It usually consists of following three steps:

It begins with transforming the content of sentences into a mathematical expression, or vector(represented below through binary representation). Here we use a bag of words , which ignores all the filler words. Filler words are usually the supporting words that do not have any impact on our document meaning. Then we count all the valuable words left to us. For example,

Mytable

In the above table we can clearly see that the words like an and a that are the stopwords are not considered while evaluation.

In this step we use evaluate sentences using sentence scoring technique. We can use the scoring method as illustrated below.

Score= (Number of meaningful words)²/(Span of meaningful words)

A span here refers to the part of sentence(in our case)/document consisting all the meaningful words.

Tf-idf can also be used in order to prioritize the words in a sentence.

Once the sentence scoring is complete, the last step is simply to select those sentences with the highest overall rankings.

Working

First step is to load the file or document for summarization. Let us assume we have any file words.txt

'''In this function we load our file and thus take each sentence indivisually in lowercase.'''
def load_common_words():
    f = open("words.txt", "r")
    common_words = set()
    for word in f:
        common_words.add(word.strip('\n').lower())
    f.close()
    return common_words

In the next step we try to find the most common words to our document base don their frequency. We also try to remove the unwanted stopwords.

def top_words(file_name):
    f = open(file_name, "r")
    record = {}
    common_words = load_common_words()
    for line in f:
        words = line.split()
        for word in words:
            w = word.strip('.!?,()\n').lower()
            if record.has_key(w):
                record[w] += 1
            else:
                record[w] = 1

    for word in record.keys():
        if word in common_words:
            record[word] = -1
    f.close()     
    occur = [key for key in record.keys()]
    occur.sort(reverse=True, key=lambda x: record[x])
    return set(occur[: len(occur) / 10 ])

The code below is used to calculate the score of sentences

def calculate_score(sentence, metric):
    words = sentence.split()
    imp_words, total_words, begin_unimp, end, begin = [0]*5
    for word in words:
        w = word.strip('.!?,();').lower()
        end += 1
        if w in metric:
            imp_words += 1
            begin = total_words
            end = 0
        total_words += 1
    unimportant = total_words - begin - end
    if(unimportant != 0):
        return float(imp_words**2) / float(unimportant)
    return 0.0

In the final step we generate our summary

def summarize(file_name):
    f = open(file_name, "r")
    text = ""
    for line in f:
        text += line.replace('!','.').replace('?','.').replace('\n',' ')
    f.close()
    sentences = text.split(".")
    metric = top_words(file_name)
    scores = {}
    for sentence in sentences:
        scores[sentence] = calculate_score(sentence, metric)
    top_sentences = list(sentences)                                # make a copy
    top_sentences.sort(key=lambda x: scores[x], reverse=True)      # sort by score
    top_sentences = top_sentences[:int(len(scores)*ABSTRACT_SIZE)] # get top 5%
    top_sentences.sort(key=lambda x: sentences.index(x))           # sort by occurrence
    return '. '.join(top_sentences)

Quick Implementation

We can implement Luhns Heuristic methon quickly by using sumy in python.

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer

It will import all the packages needed for performing summarization.

#Taking a document to be summarized
doc="""OpenGenus Foundation is an open-source non-profit organization with the aim to enable people to work offline for a longer stretch, reduce the time spent on searching by exploiting the fact that almost 90% of the searches are same for every generation and to make programming more accessible.
OpenGenus is all about positivity and innovation.
Over 1000 people have contributed to our missions and joined our family. We have been sponsored by three great companies namely Discourse, GitHub and DigitalOcean. We run one of the most popular Internship program and open-source projects and have made a positive impact over people's life. """
print(doc)

Output

OpenGenus Foundation is an open-source non-profit organization with the aim to enable people to work offline for a longer stretch, reduce the time spent on searching by exploiting the fact that almost 90% of the searches are same for every generation and to make programming more accessible.
OpenGenus is all about positivity and innovation.
Over 1000 people have contributed to our missions and joined our family. We have been sponsored by three great companies namely Discourse, GitHub and DigitalOcean. We run one of the most popular Internship program and open-source projects and have made a positive impact over people's life.

In the next step, we will summarize our document using Luhns method.

# For Strings
parser=PlaintextParser.from_string(doc,Tokenizer("english"))
# Using KL
summarizer = LuhnSummarizer()
#Summarize the document with 4 sentences
summary = summarizer(parser.document,2)

Finally, we have out summary

for sentence in summary:
    print(sentence)

Output

Conclusion

From the above article we learned about the Luhns Heuristic method for text summarization. In the end , following conclusions can be drawn :

Criteria for sentence selection:
- Word frequency
- Phrase frequency
Type of document: single
Corpus suited: Technical articles (like OpenGenus IQ articles) and documents
Advantages: Frequent words are present in summary
Disadvantages: Redundancy and Low accuracy

Luhn’s Heuristic Method for text summarization

Machine Learning (ML) Natural Language Processing (NLP)

Introduction

Application

Algorithm

Working

Quick Implementation

Conclusion

Further Reading

Method Reference in Java

Implement Least Recently Used (LRU) Cache