In this article, we have explained the Reservoir Sampling Technique which is the basis of Randomized Algorithms. We have covered two methods Simple Reservoir and Variable Probability.
Table of Contents
- Method 1: Simple Reservoir
- Method 2: Variable Probability
- Initial proof
- Revision of algorithm
- Proof of new algorithm
Prerequisite: Randomized Algorithms
Reservoir Sampling is a group of randomised algorithms which helps us choose random samples across a large stream of data. We want to be able to choose k random items from a population n with an unknown size. N is usually too big to fit into main memory, and is not known by the algorithm initially, instead being revealed over time. Our algorithm cannot see previous items and the algorithm must always be able to extract simple random samples without replacement of size k items over the current portion of the population seen. The algorithm itself is quite simple however we will see in this article how it can become very elegant with the right optimisations.
Method 1: Simple Reservoir
Firstly, we can try to solve this problem simply by creating an array with all elements of the stream, and then sampling random elements, returning their indices. This is not a good approach as we do not know how large the stream and is most likely too big to store in memory, we therefore need to change our solution to better fit the problem.
We can implement sorting into our solution to possibly make it more efficient, if we assign tags to every element randomly between 1 and 0, then sort by the tag and use k lowest items, we will be left with elements we want. However, this method does not decrease our time or space complexity and it still unfeasible for large datasets.
Here is where the reservoir technique can be implemented. For random k, we can maintain a reservoir of k elements with the lowest tags. We can initially do this by comparing the tag of the current element to the elements in the reservoir, if an element has a greater tag than that of the current element, we can replace it with the current and move to the next item.
To further improve our approach, we can implement heaps into the technique. This way we have an efficient way to maintain k smallest items. This also means we have solved the aspect of being able to extract samples at any point in the stream. Which in turn means we could handle streams which are infinite, for example new information is able to be sent to web servers whilst them still being able to provide samples of requests received by clients, even though as more and more information is given to the server, the sample changes.
Method 2: Variable Probability
Currently, our solution operates in logarithmic time O(logk) since we are implementing a heap. We can reduce this to constant time by only using the reservoir, disregarding the random tags. This works by as we are iterating through the stream, replace the elements in the reservoir at a random probability. The probability is variable as the stream is processed.
To demonstrate, let's say k = 1, meaning the sampling algorithm would work as follows:
- Store the first element as the reservoir element
- For i element, make that the reservoir element at a 1/i probability
For another example if the stream was 5 elements long:
- Store element 1
- Store element 2 with a probability of 1/2
- Store element 3 with a probability of 1/3
- This means that elements 1 and 2 have a 1/2 * 2/3 probability of being the reservoir element, in other words: all elements now have a 1/3 chance to be chosen
- Store element 4 with a probability of 1/4
- Meaning that the rest of the elements now have 1/3 * 3/4 chance or just 1/4
- Store element 5 with a probability of 1/5
- Leading the rest of the elements to have a 1/4 * 4/5 chance of being picked = 1/5
Proof by induction
We can prove this method works for any number of elements by using induction:
If our stream has n elements:
- Each element is chosen eventually with a probability of 1/n
If our stream now has n+1 elements:
- Each element should now have a probability of 1/n+1
From this we can deduce that, the next element in the stream will be chosen with a probability of 1/n+1 and that the other elements have 1/n probability.
This means that the current element in the reservoir has a 1-1/n+1 probability of staying in the reservoir which simplifies to n/n+1.
This therefore implies that all previous elements have eventually the probability of 1/n * n/n+1 or simply 1/n+1 to become the reservoir element after the current element therefore proving that all elements have equal probability of being selected as the one currently in the reservoir.
To fully solve our initial problem, we need to be able to sample k random elements of our stream, rather than a specified number. We can maintain our reservoir of k elements with it initially storing the first input k elements of our stream.
For the i element, when i > k, we can add it to the reservoir with a k/i probability (since there is a uniform distribution). If the addition is successful then swap the a random reservoir element with the probability of 1/k to be swapped.
This leaves our algorithm as
- Store stream input k elements as the elements in the reservoir
- For element i, add to reservoir with k/i probability
Proof by induction
Similarly to our last method, this one can also be proved by induction:
If our stream has n elements with k <= n:
- Each element is chosen eventually with a probability of k/n
If our stream now has n+1 elements:
- Each element is chosen eventually with a probability of k/n+1
From this we can deduce that, the next element in the stream will be chosen with a probability of k/n+1 and that the other elements have k/n probability. Various probabilities need to be considered to find if an element remains in the reservoir:
Added to the reservoir: k/n+1
Added to the reservoir but without replacement: k-1/k
Combined above: k-1/n+1
Current element is not added to reservoir: n+1-k/n+1
To find the probability of the element staying in the reservoir we need to add the combined of the first 2 cases and the last case:
n+1-k/n+1 + k-1/n+1 = n/n+1
We then multiply this answer by k/n since that is the chance for an element to begin in the reservoir leaving us with the total of
n/n+1 * k/n = k/n+1
Therefore we have proved by induction that all elements have an equal probability of k/n+1 to be in the reservoir
As mentioned above reservoir sampling can be implemented in web servers that have a potentially infinite sized stream of data they are receiving, which they have to be able to sample it at any point, without having to know the complete history of every request made to it. This can be seen in things like lists of search queries on Google.
Optimised databases plan their queries in a way where numerous approaches are considered, this can be done by simulations of a subset of the data. Reservoir sampling can be implemented to sample that subset from the database since we might not know how large the dataset is. It also makes it more accessible to sample for exclusively specific parts of the query
In this article at OpenGenus, we have reviewed reservoir sampling techniques, leading us to develop an optimised algorithm to solve our problem. Mathematically we have proved by induction that it is successful for any sized stream and any sized sample. In addition to discussing various application where these techniques are implemented. After reading this you will have a good understand of reservoir sampling.
Thanks for reading!