Open-Source Internship opportunity by OpenGenus for programmers. Apply now.
Reading time: 25 minutes
tf-idf stands for Term Frequency - Inverse Document Frequency. It is a 2 dimensional data matrix where each term denotes the relative frequency of a particular word in a particular document as compared to other documents. This is a widely used metric and is used in Text Mining and Information retrieval.
Function - To identify how important a word is to a document
Term Frequency
Definition - The number of times a term appears in a document is known as the term frequency.
- It is denoted by tf(t,d), where tf is the term frequency for the term t in the document d.
- Every term in a document has a weight associated with it.
- The weight is determined by the frequency of appearance of the term in a document.
To Calculate
tf ( t, d ) = n / N
where tf is the term frequency function
t is the term/ word
d is the document
n is the number of occurences of t in d
N is the number of occurences of t in all documents
Inverse Document Frequency
Definition - "The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs."
- It is denoted by idf(t,d), where idf is the inverse document frequency for the term t in the document d.
- Some terms like 'a, an, the' occur very frequently in documents. Thus, the weight associated with them could be uncharacteristically high.
- To avoid unnecessary bias being introduced due to the weight associated with these words, IDF is introduced.
- IDF signifies how commonly the term occurs in all the documents.
To Calculate
idf ( t, d ) = log ( D / { d ∈ D : t ∈ d })
where idf is the inverse document frequency function
t is the term/ word
d is the document
D is the total number of documents
{ d ∈ D : t ∈ d } denotes the number of documents in which t occur
tf-idf
The product of tf and idf of a term is calculated to be tf-idf
- The logarithmic term in idf approaches zero for a term present in more number of documents, from all the documents under consideration.
- Thus, the tf-idf value for a more common term approaches zero.
To Calculate
tf-idf ( t, d, D ) = tf ( t, d ) * idf ( t, D )
where tf-idf is the term frequency - inverse document frequency function
t is the term/ word
d is the document
D is total number of documents
tf ( t, d ) is the term frequency
idf ( t, D ) is the inverse document frequency
Example
Let us calculate the tf-idf for the terms given in the table below. The raw count of the terms in the two documents Doc1 and Doc2 is mentioned in the table.
term | Doc1 | Doc2 |
---|---|---|
a | 7 | 8 |
good | 2 | 3 |
day | 0 | 2 |
- tf("good", Doc1) = 2/9 = 0.222
- tf("good", Doc2) = 3/13 = 0.231
- idf("good", D) = log(2/2) = 0
- tf-idf("good", Doc1, D) = 0
- tf-idf("good", Doc2, D) = 0
Since the word good appears in both the documents, it is not as informative to our search.
- tf("day", Doc1) = 0/9 = 0
- tf("day", Doc2) = 2/13 = 0.154
- idf("day", D) = log(2/1) = 0.301
- tf-idf("good", Doc1, D) = 0
- tf-idf("good", Doc2, D) = 0.154 * 0.301 = 0.046
Thus, as can be seen, the word day is more informative since it's tf-idf value differs greatly in both the documents.
Test Your Knowledge
"A higher tf-idf implies more common words". True or False?
Application
- The concept of tf-idf is used in
- Text Mining
- Information Retrieval
- tf-idf word vectors are created to signify their importance to documents.
- It is a preliminary step for actions such as finding:
- similar documents
- searching for terms within documents
- clustering documents
- document summarization