Get this book -> Problems on Array: For Interviews and Competitive Programming

Reading time: 25 minutes

**tf-idf** stands for **Term Frequency - Inverse Document Frequency**. It is a 2 dimensional data matrix where each term denotes the relative frequency of a particular word in a particular document as compared to other documents. This is a widely used metric and is used in Text Mining and Information retrieval.

**Function** - To identify how important a word is to a document

## Term Frequency

**Definition** - The number of times a term appears in a document is known as the term frequency.

- It is denoted by tf(t,d), where tf is the term frequency for the term
*t*in the document*d*. - Every term in a document has a weight associated with it.
- The weight is determined by the frequency of appearance of the term in a document.

**To Calculate**

```
tf ( t, d ) = n / N
where tf is the term frequency function
t is the term/ word
d is the document
n is the number of occurences of t in d
N is the number of occurences of t in all documents
```

## Inverse Document Frequency

**Definition** - "The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs."

- It is denoted by idf(t,d), where idf is the inverse document frequency for the term
*t*in the document*d*. - Some terms like 'a, an, the' occur very frequently in documents. Thus, the weight associated with them could be uncharacteristically high.
- To avoid unnecessary bias being introduced due to the weight associated with these words, IDF is introduced.
- IDF signifies how commonly the term occurs in all the documents.

**To Calculate**

```
idf ( t, d ) = log ( D / { d âˆˆ D : t âˆˆ d })
where idf is the inverse document frequency function
t is the term/ word
d is the document
D is the total number of documents
{ d âˆˆ D : t âˆˆ d } denotes the number of documents in which t occur
```

## tf-idf

The product of tf and idf of a term is calculated to be tf-idf

- The logarithmic term in idf approaches zero for a term present in more number of documents, from all the documents under consideration.
- Thus, the tf-idf value for a more common term approaches zero.

**To Calculate**

```
tf-idf ( t, d, D ) = tf ( t, d ) * idf ( t, D )
where tf-idf is the term frequency - inverse document frequency function
t is the term/ word
d is the document
D is total number of documents
tf ( t, d ) is the term frequency
idf ( t, D ) is the inverse document frequency
```

### Example

Let us calculate the tf-idf for the terms given in the table below. The raw count of the terms in the two documents Doc1 and Doc2 is mentioned in the table.

term | Doc1 | Doc2 |
---|---|---|

a | 7 | 8 |

good | 2 | 3 |

day | 0 | 2 |

- tf("good",
*Doc1*) = 2/9 = 0.222 - tf("good",
*Doc2*) = 3/13 = 0.231 - idf("good",
*D*) = log(2/2) = 0- tf-idf("good",
*Doc1*,*D*) = 0 - tf-idf("good",
*Doc2*,*D*) = 0

- tf-idf("good",

Since the word *good* appears in both the documents, it is not as informative to our search.

- tf("day",
*Doc1*) = 0/9 = 0 - tf("day",
*Doc2*) = 2/13 = 0.154 - idf("day",
*D*) = log(2/1) = 0.301- tf-idf("good",
*Doc1*,*D*) = 0 - tf-idf("good",
*Doc2*,*D*) = 0.154 * 0.301 = 0.046

- tf-idf("good",

Thus, as can be seen, the word day is more informative since it's tf-idf value differs greatly in both the documents.

## Test Your Knowledge

#### "A higher tf-idf implies more common words". True or False?

### Application

- The concept of tf-idf is used in
- Text Mining
- Information Retrieval

- tf-idf word vectors are created to signify their importance to documents.
- It is a preliminary step for actions such as finding:
- similar documents
- searching for terms within documents
- clustering documents
- document summarization