Jensen Shannon Divergence

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

This article will discuss about one of the performance metrics used to evaluate the model's performance. These metrics will compare the generated output to the provided ground-truth values. Jensen Shannon Divergence is one of the distribution comparison techniques that can be easily used in parametric tests.

Table of contents:

What is the need for measuring divergence or similarity between distributions?
Jensen Shannon Divergence (JSD)
Properties of Jenson Shannon Divergence (JSD)
Implementing JSD in Python

What is the need for measuring divergence or similarity between distributions?

Often in Machine Learning tasks, we are encountered with probability distributions for continuous as well as discrete input data, in the form of outputs obtained from various models, and while performing error calculation between the actual and the predicted output.

Some of the important use cases of finding out the similarities between distributions are:

Calculating the probability for all the input and output features is useful in data drift detection. Data drift is one of the important reason why the accuracy of deployed models in production decreases over time.
When calculating the image reconstruction error in Generative Adversial Networks (GANs), the probability distributions between the original and reconstructed images are compared.

Jensen Shannon Divergence (JSD)

Jensen-Shannon Divergence (JSD) measures the similarity between two distributions (i.e. the ground truth and the simulated values). In other words, this metric basically calculates the amount of divergence between two distributions. It is also known as Information radius (IRad) or total divergence to the average. It is based on Kullback–Leibler Divergence (KL Divergence) or D(p,q) and is non-symmetric measure of differences between the distributions implying that Dₖₗ(P|Q) ≠ Dₖₗ(Q|P). What actually has been done is that bounded symmetrization has been performed, without requiring the condition for the distribution to be absolutely continuous. Also smoothening is performed along with symmetrization.

Let us suppose we've got a sample x and we want to measure what is the likelyhood of x to occur in the ground truth distribution p as opposed to the generated distribution q. The likelihood-ratio (LR) will help us to measure this:

$$LR = \frac{p(x)}{q(x)}$$

Ratio greater than 1 indicates that p(x) is more likely while a ratio less than 1 indicates q(x) is more likely.

We calculate the total likelihood as the product for each sample:
$$LR = \prod_{i=0}^n \frac{p(x_i)}{q(x_i)}$$

We take the log-likelihood to convert it to sum instead of product for our better understanding.
$$\log{(LR)} = \sum_{i=0}^n \log\left(\frac{p(x_i)}{q(x_i)}\right)$$

Log-likelihoods have the same relations of order as the likelihoods, so similar to what we had discussed previously here, now log(LR) greater than 0 implies that p(x) fits better while values less than 0 indicates that q(x) would fit better.

We calculate the predictive power to quantify on an average about how much better a model would indicate how much better a model is over the other.

$$Predictive Power = \frac{1}{N}\sum_{i=0}^n \log\left(\frac{p(x_i)}{q(x_i)}\right) = \int_{-\infty}^{\infty}p(x)\log\left(\frac{p(x)}{q(x)}\right)$$

$$D(p(x), w(x)) = E_{p}(x) \left( \log\left(\frac{p(x)}{q(x)}\right) \right)$$

The JS Divergence can be calculated as follows:
$$D_{JS}(P || Q) = \frac{1}{2}D_{KL}(P || M) + \frac{1}{2}D_{KL}(Q || M)$$
and M can be calculated as,
$$M = \frac{1}{2}(P + Q)$$ which is a mixed distribution.

or, it can also be written as,

$$JSD(p(x),q(x)) = \frac{1}{2}D\left(p(x),\frac{1}{2}p(x) + q(x) \right) + \frac{1}{2}D\left(q(x),\frac{1}{2}p(x) + q(x) \right) $$

The values of the JSD obtained are bounded between [0,1] for base-2 log and for natural log with base-e will have the ranges from [0, ln(2)]

The log can be set to base-2 to give the units in “bits,” or the natural logarithm base-e is chosen to give the units in “nats”. When the obtained score is 0, it suggests that both distributions are identical, the rest other positive values indicate how different they are - a value of 1 shows that they've the maximum possible difference.

Properties of Jenson Shannon Divergence (JSD)

It is based on the KL divergence, but it is symmetric so that implies JSD (P||Q) = JSD (Q||P)
It provides a more smoother and normalized version because it is bounded between 0 and 1 when log base-2 is used for calculation.

Implementing JSD in Python:

from scipy import stats
from scipy.stats import norm
import numpy as np
from matplotlib import pyplot as plt

# create the data distribution
data_1 = abs(np.random.randn(1000))
data_2 = np.random.lognormal(size=1000)

#function to compute KL Divergence
"""KL Divergence(P|Q)"""
def KLD(p_probs, q_probs):    
    KLD = p_probs * np.log(p_probs / q_probs)
    return np.sum(KLD)
    
#function to compute JS Divergence
def JSD(p, q):
    p = np.asarray(p)
    q = np.asarray(q)
    # normalize
    p /= p.sum()
    q /= q.sum()
    m = (p + q) / 2
    return (KLD(p, m) + KLD(q, m)) / 2
    
# To show JS Divergence is symmetric
result_JSD12= JSD(data_1, data_2)
print("JS Divergence between data_1 and data_2",result_JSD12)
result_JSD21= JSD(data_2, data_1)
print("JS Divergence between data_2 and data_1",result_JSD21)

We obtain the result for a sample distribution as:

JS Divergence between data_1 and data_2 0.1795276370397127
JS Divergence between data_2 and data_1 0.1795276370397127

With this article at OpenGenus, you must have the complete idea of Jensen Shannon Divergence.

Jensen Shannon Divergence

Machine Learning (ML)

What is the need for measuring divergence or similarity between distributions?

Jensen Shannon Divergence (JSD)

Properties of Jenson Shannon Divergence (JSD)

Implementing JSD in Python:

Gatsby.js tutorial: Introduction and Setup

Hibernate Inheritance Mapping