# Understanding TF IDF (term frequency - inverse document frequency)

#### software engineering machine learning natural language processing tf idf

tf-idf stands for Term Frequency - Inverse Document Frequency. It is a 2 dimensional data matrix where each term denotes the relative frequency of a particular word in a particular document as compared to other documents. This is a widely used metric and is used in Text Mining and Information retrieval.

Function - To identify how important a word is to a document

## Term Frequency

Definition - The number of times a term appears in a document is known as the term frequency.

• It is denoted by tf(t,d), where tf is the term frequency for the term t in the document d.
• Every term in a document has a weight associated with it.
• The weight is determined by the frequency of appearance of the term in a document.

To Calculate

tf ( t, d ) = n / N

where tf is the term frequency function
t is the term/ word
d is the document
n is the number of occurences of t in d
N is the number of occurences of t in all documents


## Inverse Document Frequency

Definition - "The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs."

• It is denoted by idf(t,d), where idf is the inverse document frequency for the term t in the document d.
• Some terms like 'a, an, the' occur very frequently in documents. Thus, the weight associated with them could be uncharacteristically high.
• To avoid unnecessary bias being introduced due to the weight associated with these words, IDF is introduced.
• IDF signifies how commonly the term occurs in all the documents.

To Calculate

idf ( t, d ) = log ( D / { d ∈ D : t ∈ d })

where idf is the inverse document frequency function
t is the term/ word
d is the document
D is the total number of documents
{ d ∈ D : t ∈ d } denotes the number of documents in which t occur


## tf-idf

The product of tf and idf of a term is calculated to be tf-idf

• The logarithmic term in idf approaches zero for a term present in more number of documents, from all the documents under consideration.
• Thus, the tf-idf value for a more common term approaches zero.

To Calculate

tf-idf ( t, d, D ) = tf ( t, d ) * idf ( t, D )

where tf-idf is the term frequency - inverse document frequency function
t is the term/ word
d is the document
D is total number of documents
tf ( t, d ) is the term frequency
idf ( t, D ) is the inverse document frequency


### Example

Let us calculate the tf-idf for the terms given in the table below. The raw count of the terms in the two documents Doc1 and Doc2 is mentioned in the table.

term Doc1 Doc2
a 7 8
good 2 3
day 0 2
• tf("good", Doc1) = 2/9 = 0.222
• tf("good", Doc2) = 3/13 = 0.231
• idf("good", D) = log(2/2) = 0
• tf-idf("good", Doc1, D) = 0
• tf-idf("good", Doc2, D) = 0

Since the word good appears in both the documents, it is not as informative to our search.

• tf("day", Doc1) = 0/9 = 0
• tf("day", Doc2) = 2/13 = 0.154
• idf("day", D) = log(2/1) = 0.301
• tf-idf("good", Doc1, D) = 0
• tf-idf("good", Doc2, D) = 0.154 * 0.301 = 0.046

Thus, as can be seen, the word day is more informative since it's tf-idf value differs greatly in both the documents.

#### "A higher tf-idf implies more common words". True or False?

False
True
The logarithmic term in idf approaches zero for a term present in more number of documents, from all the documents under consideration. Thus, a higher tf-idf value implies a less common word.

### Application

• The concept of tf-idf is used in
• Text Mining
• Information Retrieval
• tf-idf word vectors are created to signify their importance to documents.
• It is a preliminary step for actions such as finding:
• similar documents
• searching for terms within documents
• clustering documents
• document summarization