Understanding TF IDF (term frequency - inverse document frequency)

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

Reading time: 25 minutes

tf-idf stands for Term Frequency - Inverse Document Frequency. It is a 2 dimensional data matrix where each term denotes the relative frequency of a particular word in a particular document as compared to other documents. This is a widely used metric and is used in Text Mining and Information retrieval.

Function - To identify how important a word is to a document

Term Frequency

Definition - The number of times a term appears in a document is known as the term frequency.

It is denoted by tf(t,d), where tf is the term frequency for the term t in the document d.
Every term in a document has a weight associated with it.
The weight is determined by the frequency of appearance of the term in a document.

To Calculate

tf ( t, d ) = n / N
    
where tf is the term frequency function
      t is the term/ word
      d is the document
      n is the number of occurences of t in d
      N is the number of occurences of t in all documents

Inverse Document Frequency

Definition - "The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs."

It is denoted by idf(t,d), where idf is the inverse document frequency for the term t in the document d.
Some terms like 'a, an, the' occur very frequently in documents. Thus, the weight associated with them could be uncharacteristically high.
To avoid unnecessary bias being introduced due to the weight associated with these words, IDF is introduced.
IDF signifies how commonly the term occurs in all the documents.

To Calculate

idf ( t, d ) = log ( D / { d ∈ D : t ∈ d })

where idf is the inverse document frequency function
      t is the term/ word
      d is the document
      D is the total number of documents
      { d ∈ D : t ∈ d } denotes the number of documents in which t occur

tf-idf

The product of tf and idf of a term is calculated to be tf-idf

The logarithmic term in idf approaches zero for a term present in more number of documents, from all the documents under consideration.
Thus, the tf-idf value for a more common term approaches zero.

To Calculate

tf-idf ( t, d, D ) = tf ( t, d ) * idf ( t, D )
    
where tf-idf is the term frequency - inverse document frequency function
      t is the term/ word
      d is the document
      D is total number of documents
      tf ( t, d ) is the term frequency
      idf ( t, D ) is the inverse document frequency

Example

Let us calculate the tf-idf for the terms given in the table below. The raw count of the terms in the two documents Doc1 and Doc2 is mentioned in the table.

term	Doc1	Doc2
a	7	8
good	2	3
day	0	2

tf("good", Doc1) = 2/9 = 0.222
tf("good", Doc2) = 3/13 = 0.231
idf("good", D) = log(2/2) = 0
- tf-idf("good", Doc1, D) = 0
- tf-idf("good", Doc2, D) = 0

Since the word good appears in both the documents, it is not as informative to our search.

tf("day", Doc1) = 0/9 = 0
tf("day", Doc2) = 2/13 = 0.154
idf("day", D) = log(2/1) = 0.301
- tf-idf("good", Doc1, D) = 0
- tf-idf("good", Doc2, D) = 0.154 * 0.301 = 0.046

Thus, as can be seen, the word day is more informative since it's tf-idf value differs greatly in both the documents.

Test Your Knowledge

"A higher tf-idf implies more common words". True or False?

False

True

The logarithmic term in idf approaches zero for a term present in more number of documents, from all the documents under consideration. Thus, a higher tf-idf value implies a less common word.

Application

The concept of tf-idf is used in
- Text Mining
- Information Retrieval
tf-idf word vectors are created to signify their importance to documents.
It is a preliminary step for actions such as finding:
- similar documents
- searching for terms within documents
- clustering documents
- document summarization

Understanding TF IDF (term frequency - inverse document frequency)

Software Engineering Machine Learning (ML) Natural Language Processing (NLP) tf idf

Term Frequency

Inverse Document Frequency

tf-idf

Example

Test Your Knowledge

"A higher tf-idf implies more common words". True or False?

Application

Understanding Bit mask/ Bit map in depth

Functors in C++