Zipf's Law in NLP


Zipf's Law is an empirical law, that was proposed by George Kingsley Zipf, an American Linguist.

According to Zipf's law, the frequency of a given word is dependent on the inverse of it's rank . Zipf's law is one of the many important laws that plays a significant part in natural language processing, the other being Heaps' Law.

f(r, α) ∝ 1rα

Here,

  • α ≈ 1
  • r = rank of a word
  • f(r, α) = frequency in the corpus

The word with the second highest rank will have the frequency value ½ of the first word, the word with the third highest rank will have the frequency value ⅓ of the first word, and so on...

Some points to note about Zipf's Law

If we consider a constant for the following law, we have:

f = krα
=> f × rα = k
=> f × r = k
(Since α is nearly equal to 1, we can ignore it for this case)

This constant would be same for every word in expected results, but that is not the case for observed result. This will be further shown in the sample code, and explained later, but just remember this for now - few of the highest and the lowest words will not have the same values of k - only the data in the middle range of ranks follow this - and this is only for our particular document. There are other variants of Zip's law, like Zipf-Mandelbrot law, that tries to fit the frequencies, by adding an offest β, and it is given by:

f(r, α, β) ∝ 1(r+β)α

Here, β ≈ 2.7, and it is also known as the rank offset.

Note:
Values of α and β cannot be always taken as equal to 1 and 2.7 - they are parameters that can be tuned to fit a data perfectly, and there are cases where you may have to adjust the value, if the results are not satisfactory.

Fun fact about Zipf's law

Zipf's law is an empirical law, that models the frequency distribution of words in languages. And by languages, this is not just restricted to English corpora. In fact, it was even found out that the language spoken by dolphins, which are a given set of whistles and high frequency shrills, also follow Zipf's Law!

Terminologies

  • Frequency - Frequency is the number of occurences of a word in the given document or corpus.
  • Rank - The rank of a word can be defined as the position occupied, for having the highest number of occurence in the given document or corpus. A word with the highest frequency will have the highest rank.
  • Document - A piece of printed, written, or electronically stored material that can transmit information or evidence. Latin for Documentum, which denotes a "teaching" or "lesson".
  • Corpus - A collection of texts, or written information (known as documents), usually about the works of a particular author, or a given subject/ topic. Usually stored as electronic format.

Program on Zipf's law

First, we start with importing the required libraries. NLTK is actually where we will be accessing our corpus from.These documents are available at "C:\Users\ashvi\AppData\Roaming\nltk_data\corpora\gutenberg", in Windows.
For this particular code, I will be using "carroll-alice.txt". You may have to download them before.

# %% Import libraries required
from operator import itemgetter
import nltk
import pandas as pd
from nltk.corpus import stopwords
from matplotlib import pyplot as plt

Then we start by adding required variables, like frequency (frequency), word in the document (words_doc) and stopwords (stop_words). It removes symbols like '.', '?'and also articles like 'the', 'an', just to show a few.

# %% Initialize frequency, and import all the words in a corpus
frequency = {}
words_doc = nltk.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))
stop_words = set(stopwords.words('english'))

Next, we convert all the letters to lower cases, by the following code.

# %% Convert to lower case and remove stop words
words_doc = [word.lower() for word in words_doc if word.isalpha()]
words_doc = [word for word in words_doc if word not in stop_words]

The for-loop will calculate the frequency of the words inside - it creates a dictionary data type of the form {'word' : number}. We also create a DataFrame (df) which is part of the Pandas (pd) library, followed by initializing basic variables like rank, and column_header.
You can also see the collection variable, which will sort the given dictionary frequency.The next part is for printing the table, in ordered fashion.

# %% Calculate the frequency of the words inside
for word in words_doc :
    count = frequency.get(word , 0)
    frequency[ word ] = count + 1

rank = 1
column_header = ['Rank', 'Frequency', 'Frequency * Rank']
df = pd.DataFrame( columns = column_header )
collection = sorted(frequency.items(), key=itemgetter(1), reverse = True)

Here, we finally print our dataframe to the console

Warning!
This is a very machine intensive task, and if performed on an older device, could take some time.

# %%Creating a table for frequency * rank

for word , freq in collection:
    df.loc[word] = [rank, freq, rank*freq]
    rank = rank + 1

print (df)

But this is far from over! Since numbers are not great for visualization, we need to find a better way to show our values. This is what we get as our output in our console, if we execute all the code.

       Rank Frequency Frequency * Rank
to        1      5239             5239
the       2      5201            10402
and       3      4896            14688
of        4      4291            17164
i         5      3178            15890
    ...       ...              ...
true    296        66            19536
heart   297        66            19602
talked  298        66            19668
sir     299        66            19734
ready   300        66            19800

[300 rows x 3 columns]

We will be using one of the most useful tools available in Python - pyplot, provided by the matplotlib library. We chose to use bar graph to represent our values, as it is the perfect way to represent frequencies with respect to words.

# %% Python visualization with pyplot
plt.figure(figsize=(20,20))  #to increase the plot resolution
plt.ylabel("Frequency")
plt.xlabel("Words")
plt.xticks(rotation=90)    #to rotate x-axis values

for word , freq in collection[:30]:
    plt.bar(word, freq)    
plt.show()

Figure_1-1

This is how our visualized data looks like! From this, you can understand that our second word 'alice' is 12α times the first word 'said'. The third word 'little' is 13α; the frequency of the word 'said'. And so, the rest of the frequency keeps getting shorter in the series of 14α, 15α, 16α....

So, overall, our Python program should look something like this

# %% Import libraries required
from operator import itemgetter
import nltk
import pandas as pd
from nltk.corpus import stopwords
from matplotlib import pyplot as plt

# %% Initialize frequency, and import all the words in a corpus
frequency = {}
words_doc = nltk.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))
stop_words = set(stopwords.words('english'))

# %% Convert to lower case and remove stop words
words_doc = [word.lower() for word in words_doc if word.isalpha()]
words_doc = [word for word in words_doc if word not in stop_words]

# %% Calculate the frequency of the words inside
for word in words_doc :
    count = frequency.get(word , 0)
    frequency[ word ] = count + 1

rank = 1
column_header = ['Rank', 'Frequency', 'Frequency * Rank']
df = pd.DataFrame( columns = column_header )
collection = sorted(frequency.items(), key=itemgetter(1), reverse = True)

# %%Creating a table for frequency * rank

for word , freq in collection:
    df.loc[word] = [rank, freq, rank*freq]
    rank = rank + 1
    
print (df)

# %% Python visualization with pyplot
plt.figure(figsize=(20,20))  #to increase the plot resolution
plt.ylabel("Frequency")
plt.xlabel("Words")
plt.xticks(rotation=90)    #to rotate x-axis values

for word , freq in collection[:30]:
    plt.bar(word, freq)    
plt.show()
# %% End of the program

Significance of k, the constant

The value that we discussed earlier seems to have no relationship with both the dataframe, as well as the visualization. This is because of the fact that there were too many words to fit inside the chart, and the console would truncate automatically.
So, how do we prove that? Spyder IDE has a feature, called the Variable Explorer, from which we can visualize all the possible variables, and we will put that to test.

As said before, the value of k is not valid for the first and the last few data in our example, which will be explained further below, and so, we will choose some random values from between.

Screenshot 1:
Screenshot--30-

Screenshot 2:
Screenshot--31-

We can see that both the screenshots have a large gap in their ranks, but they are values from somewhere in the middle.If you check the value of 'Frequency * Rank' (which is k), you can see that most of the values are nearly similar, especially within a screenshot. So the laws do hold true for words.

Python files for reference:

To access the files, please click on the following link:
https://github.com/Ashvith/Zipf-s-Law
It contains both .py, as well as .ipynb format.

Problems with Zipf's law

Zipf's law require that the size of the material that is to be worked on, should be as large as a mature corpus - and not just from a source or two. The deviations in our first and last few terms were because of the simple reason that we used a simple document with a word count of 26,443 - this isn't sufficient to begin with. For more accurate results, you could use Brown Corpus (nearly 6 million words), Gutenberg or Google Book's corpus (155 million words for American English)- although it could take a lot of time processing, and you may need to move to Google's Colaboratory, or other such platforms.

Note:
You could use the whole Gutenberg corpus, by changing this code snippet:

words_doc = nltk.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))

to this one:

words_doc = nltk.Text(nltk.corpus.gutenberg.words())

or for Brown corpus:

words_doc = nltk.Text(nltk.corpus.brown.words())

Question 1

What is the importance of Zipf's law in data science and machine learning?

Forms the basis for NLP
Important for understanding AI
Not important
Basics of general programming
NLP is dependent on Zipf's Law, as well as Heap's law, for any basic operation that requires to find how words are related to each other.

Question 2

Is there any specific requirement for Zipf's Law?

Using a large corpus is recommended
Basic data processing is required
Using an english corpus is recommended
None - it works flawlessly
Zipf's law holds true for all languages. It may not work as expected, in a small corpus, so it is recommended that one must use a large corpus. Data processing may be required, but not all documents follow the same format. Some may require no need for cleaning data, or fixing the contents.

Question 3

What is significant about the value of k in expected outcome?

k remains constant for every word
There is no k in Zipf's law
It increases, as the ranks decreases
It decreases, as the ranks increases
According to the formula,
 <center>f &equals; <sup><i>k</i></sup>&frasl;<sub>r<sup>&alpha;</sup></sub></center>
<center>=> f &times; r<sup>&alpha;</sup> &equals; <i>k</i></center>
<center>=> f &times; r &equals; <i>k</i></center>
<center><b>(Since &alpha; is nearly equal to 1, we can ignore it for this case)</b></center>

This means that for the given word, we can have a value of f and r, which would give us the same value of <i>k</i>.

Sources