Understanding TF IDF (term frequency - inverse document frequency)


Reading time: 25 minutes

tf-idf stands for Term Frequency - Inverse Document Frequency. It is a 2 dimensional data matrix where each term denotes the relative frequency of a particular word in a particular document as compared to other documents. This is a widely used metric and is used in Text Mining and Information retrieval.

Function - To identify how important a word is to a document

Term Frequency

Definition - The number of times a term appears in a document is known as the term frequency.

  • It is denoted by tf(t,d), where tf is the term frequency for the term t in the document d.
  • Every term in a document has a weight associated with it.
  • The weight is determined by the frequency of appearance of the term in a document.

To Calculate

tf ( t, d ) = n / N
    
where tf is the term frequency function
      t is the term/ word
      d is the document
      n is the number of occurences of t in d
      N is the number of occurences of t in all documents

Inverse Document Frequency

Definition - "The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs."

  • It is denoted by idf(t,d), where idf is the inverse document frequency for the term t in the document d.
  • Some terms like 'a, an, the' occur very frequently in documents. Thus, the weight associated with them could be uncharacteristically high.
  • To avoid unnecessary bias being introduced due to the weight associated with these words, IDF is introduced.
  • IDF signifies how commonly the term occurs in all the documents.

To Calculate

idf ( t, d ) = log ( D / { d ∈ D : t ∈ d })

where idf is the inverse document frequency function
      t is the term/ word
      d is the document
      D is the total number of documents
      { d ∈ D : t ∈ d } denotes the number of documents in which t occur

tf-idf

The product of tf and idf of a term is calculated to be tf-idf

  • The logarithmic term in idf approaches zero for a term present in more number of documents, from all the documents under consideration.
  • Thus, the tf-idf value for a more common term approaches zero.

To Calculate

tf-idf ( t, d, D ) = tf ( t, d ) * idf ( t, D )
    
where tf-idf is the term frequency - inverse document frequency function
      t is the term/ word
      d is the document
      D is total number of documents
      tf ( t, d ) is the term frequency
      idf ( t, D ) is the inverse document frequency

Example

Let us calculate the tf-idf for the terms given in the table below. The raw count of the terms in the two documents Doc1 and Doc2 is mentioned in the table.

term Doc1 Doc2
a 7 8
good 2 3
day 0 2
  • tf("good", Doc1) = 2/9 = 0.222
  • tf("good", Doc2) = 3/13 = 0.231
  • idf("good", D) = log(2/2) = 0
    • tf-idf("good", Doc1, D) = 0
    • tf-idf("good", Doc2, D) = 0

Since the word good appears in both the documents, it is not as informative to our search.

  • tf("day", Doc1) = 0/9 = 0
  • tf("day", Doc2) = 2/13 = 0.154
  • idf("day", D) = log(2/1) = 0.301
    • tf-idf("good", Doc1, D) = 0
    • tf-idf("good", Doc2, D) = 0.154 * 0.301 = 0.046

Thus, as can be seen, the word day is more informative since it's tf-idf value differs greatly in both the documents.

Test Your Knowledge

"A higher tf-idf implies more common words". True or False?

False
True
The logarithmic term in idf approaches zero for a term present in more number of documents, from all the documents under consideration. Thus, a higher tf-idf value implies a less common word.

Application

  • The concept of tf-idf is used in
    • Text Mining
    • Information Retrieval
  • tf-idf word vectors are created to signify their importance to documents.
  • It is a preliminary step for actions such as finding:
    • similar documents
    • searching for terms within documents
    • clustering documents
    • document summarization