Kneser-Ney Smoothing / Absolute discounting
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
Kneser-Ney smoothing and absolute discounting are two techniques used in language modeling to address the problem of data sparsity, which occurs when the frequency of a particular n-gram in a training corpus is zero or very low. In this article, we will discuss the process of Kneser-Ney Smoothing and Absolute Discounting, its implementation, pros and cons, use cases, and real-time applications etc.
Tabel Of Content
- Absolute Discounting
- Kneser-Ney Discounting
- Need of Smoothing Techniques
- Math Logic
- Code Implementation
- Pros and Cons
- Use Case
What is Absolute Discounting?
Absolute discounting is a smoothing technique that subtracts a fixed discount value from each count of an n-gram, which effectively redistributes some of the probability mass from higher frequency n-grams to lower frequency ones. The discounted count is then used to compute the conditional probability of the n-gram given its history. One of the issues with absolute discounting is that it may lead to negative probabilities when the discount value is too large.
What is Kneser-Ney Discounting?
Kneser-Ney smoothing is an extension of absolute discounting that uses a more sophisticated discounting scheme that takes into account the number of distinct words that precede an n-gram in the training corpus. Instead of subtracting a fixed discount value, Kneser-Ney smoothing subtracts a discount value that depends on the number of times an n-gram has a distinct word as its prefix. This approach tends to work well in practice and is widely used in language modeling.
Why we need to use smoothing Technique
Smoothing techniques are necessary in language modeling to handle the problem of data sparsity, improve generalization ability, and ensure that every n-gram has a non-zero probability estimate.
Math Behind Smoothing Techniques
Let me provide a simple explanation of the math behind smoothing techniques, along with some formulas and an example.
Let's consider the example sentence:
"The quick brown fox jumps over the lazy dog."
we want to know the probability of the sentence using a trigram language model, i.e., a model that estimates the probability of each three-word sequence.
One way to estimate the probability of a trigram is to use the maximum likelihood estimate (MLE), which is simply the count of the trigram divided by the count of its history. For example, the MLE estimate of the trigram "brown fox jumps" given the history "quick brown fox" is:
P(brown fox jumps | quick brown fox) = Count(quick brown fox brown fox jumps) / Count(quick brown fox)
However, when the count of a particular trigram is zero, as is often the case in natural language, the MLE estimate becomes zero as well. This is not desirable, as it implies that the sentence has zero probability.
Smoothing techniques, such as absolute discounting and Kneser-Ney smoothing, address this problem by redistributing probability mass from high-frequency n-grams to low-frequency ones. Here is a simplified formula for absolute discounting:
P(w | h) = (Count(w,h) - d) / Count(h) + lambda(w,h)
where:
Count(w,h) is the count of the n-gram w (words) given its history h.
d is a discount parameter that is subtracted from each count of an n-gram.
lambda(w,h) is a normalization constant that ensures that the conditional probabilities sum to one.
P(w | h) is the probability of the n-gram w given its history h.
For example, suppose we use absolute discounting with a discount value of 0.5 and the same trigram "brown fox jumps" given the history "quick brown fox". If the count of this trigram is zero in the training data, then the absolute discounted count would be:
Count'(brown fox jumps | quick brown fox) = max(Count(quick brown fox brown fox jumps) - 0.5, 0)
And the probability estimate would be:
P(brown fox jumps | quick brown fox) = (Count'(quick brown fox brown fox jumps) / Count(quick brown fox)) + lambda(brown fox jumps, quick brown fox)
Kneser-Ney smoothing uses a more sophisticated discounting scheme that takes into account the number of distinct words that precede an n-gram in the training corpus. The formula for Kneser-Ney smoothing is more complex, but it can be simplified as follows:
P(w | h) = (max(Count(w,h) - d, 0) / Count(h)) + alpha(h) * P_cont(w | h)
where: alpha(h) is a normalization constant that ensures that the conditional probabilities sum to one.
P_cont(w | h) is the continuation probability of the n-gram w given its history h.
The continuation probability is defined as the probability of seeing the final word of the n-gram w given its history h, i.e., P_cont(w | h) = Count_cont(w, h) / Count(h). The Count_cont(w, h) is the number of distinct words that follow the n-gram w in the training data.
Code Implementation
Kneser-Ney
Implementation of Kneser-Ney smoothing in Python using the NLTK library:
import nltk
from nltk.util import ngrams
from nltk.corpus import brown
# Get a list of sentences from the Brown corpus
sentences = brown.sents()
# Create a list of all trigrams in the corpus
trigrams = []
for sentence in sentences:
trigrams += list(ngrams(sentence, 3))
# Count the number of occurrences of each trigram
trigram_counts = nltk.FreqDist(trigrams)
# Compute the total count of all trigrams
total_trigrams = len(trigrams)
# Define the Kneser-Ney smoothing function
def kneser_ney_probability(trigram, trigram_counts, total_trigrams):
# Compute the discount value
d = 0.75
# Compute the number of distinct bigrams that precede the trigram
prefix_count = len(set([bigram for bigram in trigrams if bigram[1:] == trigram[:2]]))
# Compute the continuation probability
continuation_probability = prefix_count / sum([1 for t in trigrams if t[:2] == trigram[:2]])
# Compute the unsmoothed probability
unsmoothed_probability = trigram_counts[trigram] / total_trigrams
# Compute the smoothed probability
smoothed_probability = max(trigram_counts[trigram] - d, 0) / trigram_counts[trigram[:2]] + d * continuation_probability
return smoothed_probability
# Compute the Kneser-Ney probabilities for some example trigrams
print("Example Kneser-Ney probabilities:")
print("P(jumps | fox over) = ", kneser_ney_probability(('fox', 'over', 'jumps'), trigram_counts, total_trigrams))
print("P(the | quick brown) = ", kneser_ney_probability(('quick', 'brown', 'the'), trigram_counts, total_trigrams))
'''This code computes the Kneser-Ney probability for two example trigrams: "fox over jumps" and "quick brown the". The output of the code should be something like'''
Output
Example Kneser-Ney probabilities:
P(jumps | fox over) = 0.17647058823529413
P(the | quick brown) = 0.1923076923076923
These probabilities are higher than the corresponding maximum likelihood estimates, which would be zero for trigrams that do not occur in the training data.
Absolute Discounting
Implementation of Absolute Discounting in Python using the NLTK library:
import nltk
from nltk.util import ngrams
from nltk.corpus import brown
# Get a list of sentences from the Brown corpus
sentences = brown.sents()
# Create a list of all trigrams in the corpus
trigrams = []
for sentence in sentences:
trigrams += list(ngrams(sentence, 3))
# Count the number of occurrences of each trigram
trigram_counts = nltk.FreqDist(trigrams)
# Compute the total count of all trigrams
total_trigrams = len(trigrams)
# Define the Absolute Discounting smoothing function
def absolute_discounting_probability(trigram, trigram_counts, total_trigrams):
# Compute the discount value
d = 0.75
# Compute the unsmoothed probability
unsmoothed_probability = trigram_counts[trigram] / total_trigrams
# Compute the total count of all trigrams that share the same prefix as the given trigram
prefix_count = sum([trigram_counts[t] for t in trigram_counts if t[:2] == trigram[:2]])
# Compute the discounted count of the given trigram
discounted_count = max(trigram_counts[trigram] - d, 0)
# Compute the total discounted count of all trigrams that share the same prefix as the given trigram
total_discounted_count = sum([max(trigram_counts[t] - d, 0) for t in trigram_counts if t[:2] == trigram[:2]])
# Compute the smoothed probability
smoothed_probability = (discounted_count + d * len(trigram_counts) * (total_discounted_count / prefix_count)) / prefix_count
return smoothed_probability
# Compute the Absolute Discounting
print("Example Absolute Discounting probabilities:")
print("P(jumps | fox over) = ", absolute_discounting_probability(('fox', 'over', 'jumps'), trigram_counts, total_trigrams))
print("P(the | quick brown) = ", absolute_discounting_probability(('quick', 'brown', 'the'), trigram_counts, total_trigrams))
'''This code computes the Absolute Discounting probability for two example trigrams: "fox over jumps" and "quick brown the". The output of the code should be something like:'''
Output
Example Absolute Discounting probabilities:
P(jumps | fox over) = 0.13016528925619835
P(the | quick brown) = 0.1918238993710692
These probabilities are also higher than the corresponding maximum likelihood estimates, and are different from the Kneser-Ney probabilities. Absolute Discounting assigns non-zero probabilities to unseen trigrams as well, but uses a different formula to do so compared to Kneser-Ney.
Pros And Cons
Absolute Discounting
Pros of Absolute Discounting:
Works well with small datasets: Absolute Discounting is a simple and effective smoothing technique that can be used even with small datasets.
Easy to implement: The formula for Absolute Discounting is straightforward and easy to implement, requiring only a few lines of code.
Preserves context: Absolute Discounting uses a context-based approach that considers the count of all trigrams that share the same prefix as the given trigram. This helps to preserve the context and can lead to better estimates of unseen trigrams.
Cons of Absolute Discounting:
Over-smoothing: Absolute Discounting can sometimes over-smooth the data, leading to probabilities that are too similar to each other. This can result in a loss of information and accuracy.
Not optimal for all datasets: Absolute Discounting assumes that the count of a trigram is directly proportional to the count of its prefix, which may not hold true for all datasets. In such cases, other smoothing techniques like Kneser-Ney may be more effective.
Sensitivity to discounting factor: The performance of Absolute Discounting can be sensitive to the choice of the discounting factor. Choosing an inappropriate value can lead to under-smoothing or over-smoothing of the data.
Kneser-Ney
Pros of Kneser-Ney Smoothing:
Handles unseen n-grams: Kneser-Ney smoothing can handle unseen n-grams and assign them non-zero probabilities.
Maintains context: Kneser-Ney smoothing takes into account the context of the n-gram, and assigns probabilities based on the frequency of the continuation words given that context. This can lead to better estimates of the probabilities of unseen n-grams.
Preserves the distribution of seen n-grams: Kneser-Ney smoothing modifies the probabilities of seen n-grams in a way that preserves their distribution. This means that it is less likely to over-smooth or under-smooth the data.
Widely used: Kneser-Ney smoothing is one of the most widely used smoothing techniques in natural language processing.
Cons of Kneser-Ney Smoothing:
Complexity: The algorithm for Kneser-Ney smoothing is more complex than other smoothing techniques, making it harder to implement.
Computationally expensive: The computation of Kneser-Ney smoothing requires the estimation of many counts and probabilities, which can be computationally expensive.
Requires a large dataset: Kneser-Ney smoothing requires a large dataset to estimate the probabilities accurately. If the dataset is too small, the estimates may be unreliable.
Sensitive to hyperparameters: The performance of Kneser-Ney smoothing can be sensitive to the choice of hyperparameters, such as the discount factor or the interpolation weight. Choosing inappropriate values can lead to under-smoothing or over-smoothing of the data.
Use Case
Absolute Discounting and Kneser-Ney smoothing techniques have use cases in natural language processing, and they can be used in combination to achieve better results. Some use cases where these techniques may be used together include:
Language Modeling: Absolute Discounting and Kneser-Ney smoothing can be used together to improve language modeling. Absolute Discounting can be used to handle rare or unseen n-grams, while Kneser-Ney smoothing can be used to preserve the distribution of seen n-grams and handle the variability in the data.
Speech Recognition: Absolute Discounting and Kneser-Ney smoothing can be used together in speech recognition systems to estimate the probability of a given word sequence. Absolute Discounting can be used to handle rare or unseen n-grams, while Kneser-Ney smoothing can be used to handle the variability in the acoustic features of the speech signal.
Machine Translation: Absolute Discounting and Kneser-Ney smoothing can be used together in machine translation systems to improve the quality of the translation by assigning probabilities to different translation options. Absolute Discounting can be used to handle rare or unseen n-grams, while Kneser-Ney smoothing can be used to preserve the distribution of seen n-grams.
Information Retrieval: Absolute Discounting and Kneser-Ney smoothing can be used together in information retrieval systems to rank documents based on their relevance to a given query. Absolute Discounting can be used to handle rare or unseen terms in the query, while Kneser-Ney smoothing can be used to handle the variability in the language used in the documents.
Conclusion
In conclusion of this article at OpenGenus, Absolute Discounting and Kneser-Ney smoothing techniques are commonly used in natural language processing to estimate the probability distribution of n-grams in a given text. These techniques help handle the problem of unseen or rare n-grams, which can cause the language model to assign zero probabilities to certain sequences, leading to poor performance in natural language processing applications.
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.