×

Search anything:

Benford's Law in ML

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

In this article, we have explored Benford's Law and the use of it in the field of Machine Learning. This is one of the core ML laws you must master.

Content :-

  • What is Benford's Law?
  • History and origin of Benford's law
  • Uses of Benford's law in machine learning and data analysis
  • Benford's law applied in auditing and fraud detection
  • Real-life datasets that do not follow Benford's law
  • Mathematical equation with example
  • How has Benford's law been generalized to other types of digits?
  • Limitations of Benford's law
  • Uses of Benford's law in various fields

1. What is Benford's law?


Benford's law is an observation about the frequency distribution of first digits in real-world data sets. It states that in many natural data sets, the first digit is more likely to be a "1" than a "9". It is a mathematical principle that states that in many real-world data sets, the first digit of the numbers is more likely to be a smaller number, such as 1 or 2, rather than a larger number like 9. This law applies to a wide range of data, including stock prices, population numbers, and physical constants, among others. The main idea behind Benford's law is that in many natural data sets, the values follow a logarithmic distribution, and the first digit of a number is proportional to its logarithm.

2. History and Origin of Benford's Law

Benford's Law, also known as the First-Digit Law, is a statistical property named after American physicist Frank Benford. It was first published by Benford in 1938 in a paper called "The Law of Anomalous Numbers".

The law states that in many real-life datasets, the first digit of the numbers in the dataset is more likely to be a smaller digit (e.g. 1, 2, or 3) than a larger digit (e.g. 8, 9). In other words, the probability of a number starting with a 1 is more than starting with a 9. The exact probabilities of each first digit occurring depend on the base in which the numbers are expressed (usually base 10), but in base 10, the probabilities are as follows:

1: 30.1%
2: 17.6%
3: 12.5%
4: 9.7%
5: 7.9%
6: 6.7%
7: 5.8%
8: 5.1%
9: 4.6%
This property is useful for detecting data fraud or manipulation, as deviations from Benford's Law can indicate that data has been altered or fabricated.

3. Uses of Benford's law in machine learning and  data analysis

Benford's Law has several applications in machine learning and data analysis, including :-

  1. Fraud detection: Benford's Law can be used to detect fraudulent data by comparing the distribution of first digits in the data set to the expected distribution according to the law. Significant deviations can indicate that the data has been altered or fabricated.

  2. Data quality assessment: Benford's Law can be used to assess the quality of a data set by checking if it follows the expected distribution of first digits. If the data set does not follow Benford's Law, it may indicate issues with the data collection process or data entry errors.

  3. Outlier detection: Benford's Law can be used to identify outliers in a data set by detecting numbers that deviate significantly from the expected distribution of first digits.

  4. Data normalization: Benford's Law can be used to normalize data by transforming it to conform to the expected distribution of first digits. This can improve the accuracy and stability of machine learning models trained on the data.

  5. Anomaly detection: Benford's Law can be used as a feature in anomaly detection algorithms to identify unusual patterns in the data. This can be useful in applications such as detecting cyberattacks or detecting unusual patterns in financial transactions.

4. Benford's law applied in auditing and fraud detection


Benford's Law is commonly used in auditing and fraud detection due to its ability to detect deviations from expected patterns in numerical data.

In auditing, Benford's Law is used to detect irregularities in financial statements, such as balance sheets and income statements. The law provides a way to assess the likelihood of digits appearing in the financial data and can detect patterns that deviate from what is expected. For example, if a financial statement contains an unusually high number of transactions that begin with the digit "9", it could indicate manipulation or fraud.

In fraud detection, Benford's Law can be used to detect irregularities in data sets related to tax returns, voting patterns, and other records. The law provides a way to identify patterns that deviate from what is expected and can indicate the presence of fraud or manipulation.

5. Real-life datasets that do not follow Benford's law


There are several types of data that do not follow Benford's law. Here are some examples:

1. Uniformly distributed data:
If the data is equally likely to occur, then it will not follow Benford's law.

2. Small datasets:
Benford's law is a large sample phenomenon, so smaller datasets may not exhibit the expected pattern.

3. Human-made data:
Data that is manually created or manipulated by humans, such as financial reports or tax returns, may not follow Benford's law due to the presence of intentional or unintentional bias.

4. Non-naturally occurring datasets:
Some datasets, such as phone numbers or zip codes, are assigned and do not arise naturally.

5. Discrete datasets:
Benford's law is typically applied to continuous datasets, but discrete datasets may not follow the expected pattern.


6. Mathematical equation and Example


The mathematical equation for Benford's law is:

P(d) = log10(1 + 1/d)

where P(d) is the probability that the first digit of a number in a given dataset is d (where d is a digit from 1 to 9).

For example,
let's consider a dataset of 1000 numbers that follows Benford's law.
To apply the law, we first calculate the expected proportion of each first digit by using the equation
P(d) = log10(1 + 1/d).
The table below shows the expected proportion of the first digit in Benford's law:

First digit Expected Proportion
1 0.301
2 0.176
3 0.125
4 0.097
5 0.079
6 0.067
7 0.058
8 0.051
9 0.046

Next, we randomly generate 1000 numbers that follow Benford's law. The table below shows the actual frequency of each first digit in the generated numbers:

First digit Expected Proportion
1 328
2 195
3 128
4 94
5 80
6 64
7 54
8 52
9 5

To test if the generated numbers follow Benford's law, we calculate the chi-square statistic:

χ2 = ∑ (Observed Frequency - Expected Frequency)2 / Expected Frequency

7. How has Benford's law been generalized to other types of digits?


Benford's law originally described the distribution of the first digit of numbers, it has been generalized to other types of digits as well.
Here are a few examples:

1. Second-digit law:
The second-digit law describes the distribution of the second digit of numbers, and it can also be used to detect fraud and errors in data. The second-digit law is similar to Benford's law, but the distribution is shifted one place to the right. For example, the digit "1" is the most common second digit in numbers that follow the second-digit law, whereas the digit "2" is the most common second digit in numbers that follow Benford's law.

2.Last-digit law:
The last-digit law describes the distribution of the last digit of numbers, and it can also be used to detect fraud and errors in data. The last-digit law is different from Benford's law and the second-digit law because the distribution is uniform rather than logarithmic. In other words, each digit from 0 to 9 is equally likely to occur as the last digit in numbers that follow the last-digit law.


7.Limitations of Benford's law


Here are some limitations to consider:

1. Applicability to specific datasets:
While Benford's law applies to many types of datasets, it may not hold true for every dataset. Some datasets may follow different distributions or have specific features that affect the distribution of their digits.

2. Sensitivity to data manipulation:
Benford's law is sensitive to data manipulation, meaning that intentionally or unintentionally altering the data can significantly affect its distribution.

3. Limited information on the cause of anomalies: While Benford's law can detect anomalies in data, it does not provide information on the cause of those anomalies.

4. Limited to numeric data:
Benford's law only applies to numeric data and cannot be applied to other types of data

5. Lack of statistical significance:
Benford's law is a probabilistic statement and does not guarantee that a given data set will conform to the expected pattern.


8. Uses of Benford's law in various fields


Benford's law has applications in many fields, including:

1. Fraud detection:
Benford's law can be used to detect potential fraud in financial statements, tax returns, and other financial data. It can identify transactions that have an unusual distribution of digits, which may indicate fraudulent activity.

2. Forensic accounting:
Benford's law is a common tool used by forensic accountants to identify financial irregularities. They can use it to detect anomalies in large data sets, such as payroll data or expense reports.

3. Election monitoring:
Benford's law has been used to monitor elections in many countries to identify potential voter fraud. It can help to identify irregularities in the distribution of voting patterns, such as a disproportionate number of votes for a particular candidate.

4. Scientific data analysis:
Benford's law can be used to analyze scientific data, such as measurements of natural phenomena, to identify anomalies or errors in the data. It is particularly useful when dealing with large data sets that are difficult to analyze manually.

5. Quality control:
Benford's law can be used in quality control to identify errors or anomalies in manufacturing or production processes. It can help to identify defective products or inconsistent production runs.

6. Digital image analysis:
Benford's law can be used to detect anomalies in digital images. It can help to identify images that have been manipulated or altered, as the distribution of leading digits in the pixel values will be different from that of unaltered images.

7. Epidemiology:
Benford's law has been used to evaluate the quality of population data, such as mortality and birth rates.

8. Pharmacovigilance:
Benford's law has been used to evaluate adverse drug event (ADE) reports in pharmacovigilance.

Benford's Law in ML
Share this