In this article, we will be introducing and exploring the idea of Probabilistic algorithms in depth with the different algorithms like Morris Algorithm, HyperLogLog, LogLog and others in this domain.
Table of content:
- Overview of Probabilistic/ Approximate Counting algorithms
- Problem statement of counting
- Approximate counting algorithm (Morris Algorithm)
- Hyperloglog algorithm
4.2 Loglog algorithm
4.4 Improvements & Differences
Overview of Probabilistic/ Approximate Counting algorithms
The probabilistic/approximate counting algorithms allow one to count a large number of events using a small amount of memory.
These type of algorithms are especially useful when the memory aspect for a program, application, etc. in terms of usage and complexity has to be minimal.
In this article, we will be exploring two Probabilistic/Approximate Counting algorithms, which are:
- Approximate counting or Morris's algorithm
- LogLog algorithm
- Hyperloglog algorithm
- Further improvements
Problem statement of counting
In most of the probabilistic algorithms, the counting procedure is usually similar.
Let us assume that we have a hash function denoted by:
fun hash(x: records): scalar range[0 ... 2ᴸ-1]
where length is denoted by L.
The function transforms the records (passed into it) into integers uniformly distributed over the scalar range of binary strings of length L.
It can be observed that if the values of hash(x) are distributed uniformly, the pattern 0ᵏ1 ... appears with a probability of 2⁻ᵏ⁻¹. The main idea is to record observations on the occurrence of the said patterns in a vector BITMAP of length L (BITMAP[0 ... L-1]).
Let us assume that the list/multi-set whose cardinality is to be determined is M, then we the following operations have to be performed to accomplish the objective:
1. for i from 0 to L-1 do: 1.1. put BITMAP[i] = 0 2. for all x in M do: 2.2. put index = ρ(hash(x)) 2.3. if BITMAP[index] = 0 do: 2.3.1. put BITMAP[index] = 1
Another function/method that has been referenced in the above algorithm is ρ(i), which basically represents the position of the least significant 1-bit in the binary representation of i. It is essentially going to return the index of the least significant 1-bit present in the value that is returned by the hash(x) function.
The final value obtained in the BITMAP vector is used as an anchor point for the counting operation.
Keep in mind that most of the advanced algorithms have modifications of their own to the algorithm, however, the base mostly remains the same.
3. Approximate counting algorithm (Morris Algorithm)
The approximate counting algorithm was invented by Robert Morris in 1977.
The algorithm makes use of probabilistic techniques to increment the counter in order for it to keep count of the events and because of this, its exactness isn't really absolute, although it does provide a fairly good estimate of the true value while providing a minimal and yet fairly constant relative error.
This algorithm stemmed from the time when once Morris had to count a large number of events in an 8-bit register. Seeing that the count would probably exceed the 255 mark, he (Morris) decided that it would be better if he developed a probabilistic algorithm with a quantifiable margin of error first and then deployed it.
The algorithm is considered as one of the foundational algorithm or base for streaming algorithms.
A simple and obvious way to create a working probabilistic/approximate counter would be to count every alternate event.
In Morris's algorithm, however, instead of keeping track of the total number of events n or some constant multiple of n, the value we store in the register is the logarithmic value of (1 + n).
How does the counter make the decision of as to which event should be counted or not? One of ways is to make use of the coin flip technique, which is relatively simple. Here is how it works:
We flip a coin and if it comes out to be head, then we increment the counter else we let it be.
The counter will represent an order of magnitude estimate, i.e., an approximation of the log of a value relative to some understood reference value, of the actual count. When the counter is operating, only the exponent is stored of the probabilistic technique that is used to increment the counter. This is done to save space.
For example, with base as 2, the counter can estimate the count to be 1, 2, 4, 8, 16, .... and so on. With base as 3, the counter can estimate the count to be 1, 3, 9, 27, 81, .... and so on.
To increment from 9 to 27, a pseudo-random number is to be generated such that a probability of ¼ (0.25) generates a positive change in the counter. If it does, then the counter is incremented, else the counter will remain at 9.
The following table, with its base as 3, demonstrates the working and the potential values of a counter:
|Binary value of counter||Estimate||Values range for actual count||Expectation|
|0||1||0, or the initial value||0|
|1||3||1, or more||6|
|10||9||2, or more||24|
|11||27||3, or more||78|
|100||81||4, or more||240|
|101||243||5, or more||726|
When the counter holds the value of 100 (binary value), which equates to an exponent of 4 (the decimal equivalent of 100), then the estimated count is 3^4, or 81. There is a fairly low probability that the actual count of increment events was 4 (1 * ⅓ * ⅑ * ¹⁄₂₇ = ¹⁄₇₂₉). The actual count of increment events is likely to be somewhere around 81, but it could be arbitrarily high.
The algorithm can be used for investigating large data sets/streams for patterns. It is particularly useful in applications of data compression, sight and sound recognition, and other artificial intelligence applications.
Let's assume that we have to count up to n by masking use of the Morris's algorithm. We know that it will go up to
O(logn) (remember that the counter only keeps the exponent to save space), and so the total number of bits that would be required to count from 0 up to
O(logn) would become
Hence, the overall space complexity of Morris's algorithm is
4. Hyperloglog algorithm
The Hyperloglog algorithm is derived and is a further extension of the earlier loglog algorithm. The main objective of both these algorithms remains the same, that is, to estimate to number of unique/distinct items in a given list/set/stream. The hyperloglog algorithm determines the cardinality or the uniqueness (number of unique elements) in a list/set/stream.
Like most probabilistic algorithms, the hyperloglog algorithm also uses significantly less memory or space to accomplish the same task as its counterpart algorithms, mainly because of the fact that the algorithm stresses mostly on saving memory and space without compromising its performance and main objective.
Let's have a brief look at the loglog algorithm first in order for us to have a better understanding of the Hyperloglog algorithm.
4.2 Loglog algorithm
Loglog is a probabilistic algorithm that makes use of a hashing function in order to randomize data and then convert them to a form that resembles random binary data. The hashed data set is then reformed into cardinality estimates by the algorithm.
In general, the loglog algorithm makes use of n small bytes of auxiliary memory to estimate the number of unique elements of a list in a single pass with an accuracy that is of the order of 1/√n.
It can be useful in many scenarios, such as to count the number of different words and their cardinality from a whole book very quickly, etc.
In terms of space complexity, the loglog algorithm consumes
O(log(logn)) bits of storage if n is the range till which the algorithm has to operate.
There are many optimized versions of this algorithm like super-loglog, hyperloglog, etc.
Much like the algorithm from which hyperloglog originated (loglog algorithm), it also makes use of a hash function that is applied to every element that is present in the given set/list to acquire a set of uniformly distributed random numbers with the same cardinality as that of the original set.
However, only doing this to obtain the cardinality of a set will result in a large variance. To counter this, the hyperloglog algorithm further sub-divides the set into various subsets, after which the maximum number of leading zeros in the numbers for each subset is calculated. After this operation, a harmonic mean is used to combine the calculated estimates for all the subsets into an estimate of the cardinality of the combined subsets.
The hyperloglog algorithm has three primary operations and its data is stored in an array P of counters called registers with size p that are set to 0 in their initial state.
1> Add operation: This operation is used to add a new element to the set. It consists of computing the hash of the input data d with a hash function h, getting the first b bits (b = log2p). The address of the register that is to be modified is obtained by adding 1 to them. The bits that are remaining are used to compute ρ(w), which returns the position of the 1 that is present at the left-most position in terms of index (in binary form). The maximum value between the current value of the register and ρ(w) is chosen as the new value of the register.
2> Count operation: This operation is used to obtain the cardinality of the set. The harmonic mean of the p registers is calculated and then the estimate E of the count (cardinality) is derived using a constant.
3> Merge operation: This operation is used to merge or join two sets together. For example, if we have two different sets U1 and U2 that are using the hyperloglog algorithm, then the merge operation will obtain the maximum for each pair of registers, i.e., j from 1 to p.
Merge(U1, U2)[j] = max(U1[j], U2[j])
4.4 Improvements & Differences
Although the overall complexity of both the Hyperloglog and the Loglog algorithm is roughly the same, they still have some significant differences between them as Hyperloglog is simply an improved and more efficient version of the loglog algorithm.
The hyperloglog algorithm further divides the sets into sub-sets and then performs further operations on them. This is done mainly to ensure that the variance (when the cardinality is being obtained) remains minimal. This increases the efficiency and accuracy of this algorithm.
The hyperloglog uses the harmonic mean to estimate the cardinality of a given list/set, whereas the loglog algorithm uses the normal mean.
The standard error estimation for the Hyperloglog algorithm is (1.04/√n), whereas it is (1.30/√n) for the loglog algorithm, where n are the number of buckets. The difference, although not very significant, does come into play when the algorithms are to be used on big data sets (which they usually are).
The main application of the hyperloglog algorithm is to estimate accurate cardinality for a given set while being extremely space-sufficient.
It can be used in applications or softwares that have very large data sets and can also be utilized for various forms of graph statistics.
The relative error of Hyperloglog algorithm is generally 1/√p with a space complexity of
O(ε⁻²log(logn) + logn), where n is the cardinality of the set and p is the number of registers.
Taking the primary operations into consideration, the add operation is dependent upon the size of the output of the main hash function, and hence its running time is
O(1), because of its fixed nature. Since the count and merge operations are actually dependent upon the size of the registers (p), they have a running time of
O(p). However, if the number of registers are fixed, then the running time of both these operations will also become constant.
Hence, the hyperloglog algorithm has a overally space complexity of
- "Probabilistic counting algorithms for database applications" by Philippe Flajolet and G. Nigel Martin in "Journal of Computer and System Sciences" October 1985.