Jump Consistent Hash: A Fast, Minimal Memory, Consistent Hash Algorithm

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we discuss the jump consistent hashing algorithm which offers improvements compared to previous approaches to consistent hashing for distributed data storage applications.

Table of contents:

Introduction to Consistent Hashing
Consistent hashing algorithms
- Kargers algorithm
- Rendezvous Algorithm
Jump consistent hashing algorithm
- Benefits of jump consistent hash
- Steps
- Code
- Time Complexity Analysis
- Optimization

Prerequisite: Hashing

Introduction to Consistent Hashing

Consistent hashing is a strategy for distributing data in a distributed hash table in a way that nodes in the network can be added and/or removed without having to reorganize data each time.

Properties for consistent hashing according to David Karger's paper:

Monotonicity - data can only be transfered from old to new nodes.
Spread - outputs for different inputs are spread as evenly as possible over the output distribution.
Load - even distribution of work load among nodes.
Balance - objectsare evenly distributed among buckets.

Example

To distribute a dataset among n nodes in a network. We have to compute the hash value of each object and based on that value we assign it to a node with the same hash value within the network.

dht-2--1

Problem

The problem with this approach is the following,
In the distribution above we used the hash function
h(x) = x%9, 9 represents the cardinality which means we expect 9 groups, that is 9 shards.
But data grows in size and we may need more shards, that means the previous hash function does not work anymore and we need to recompute and redistribute data to nodes again.
This is costly especially we we are dealing with huge amounts of data per bucket for multiple nodes in a network.

Consistent hashing algorithms

Kargers algorithm

Given an object the algorithm will compute its hash value and proceed clockwise round a network of nodes in a distributed system until it finds a matching arc associated with the computed hash value and assigns the object to it.
The memory requirements are ***(No. of arcs * range of points covered by the arc)***.

Rendezvous Algorithm

Given an object it computes its hash value and returns a bucket/shard for which the computed hash value yielded the highest value.
This approach was further improved. The idea was to organize buckets in a tree data structure so it would take logn steps but trees come at a cost of rebalancing when new buckets are added or removed.

These algorithms allow for random bucket ids which means easier removal and addition of buckets, this is an advantage when it comes to cache server but not ideal for data storage applications. The reason for this is that in data storage applications a bucket is responsible for data assigned to it and if we add or remove shards randomly we risk low availability for the data owned by that bucket.

Jump consistent hashing algorithm

Jump consistent hashing solves this problem by moving only data whose bucket assignment, that is, data whose hash value changes when there is an increase or decrease in the number of shards.

This is a significant improvement when compared to the David Karger's algorithm which recomputes and reassigns data to shards even though some of the data will be reassigned to the same bucket as it was once before. The memory consumption here is very high when considering large amounts of data that get moved between nodes.

The algorithm saves memory in that fewer resources are needed for even redistribution of data among the shards because it will only deal with a small portion of the affected data.

Out of the four properties for consistent hashing mentioned at the begining of this post, the jump consistent hashing algorithm applies two which are important for data storage applications. These are balance, which states that objects are evenly distributed among buckets and monotonicity which states that when buckets are increased, objects move from old to new buckets to avoid unnecessary rearrangements. The other two are not applicable in the algorithm.

Benefits of jump consistent hash

No storage requirements.
Faster.
Even key distribution.
Even workload distribution.

Steps

The algorithm takes a key and number of buckets as its parameters. ch(key, j)
It considers each successive bucket from 1 to num_of_buckets - 1.
At each step it uses ch(key, j) to compute ch(key, j+1) and decides whether to keep ch(key, j) same as ch(key, j+1) or jump its value to j.
It uses a pseudorandom number generator with the key as its seed to generate uniform random numbers between 0.0 and 1.0, and jumps a fraction of the keys if the value is less than $\frac{1}{(j + 1)}$ .
Finally it returns the bucket.

Code

int ch(int key, int num_buckets){
    random.seed(key);
    // This will track ch(key, j+1).
    int b = 0;     
    for (int j = 1; j < num_buckets; j++){
        if (random.next() < 1.0 / (j + 1)){
            b = j;   
        }
    return b;
}

Time Complexity Analysis

It takes linear time O(n) for a cluster size of n.

Optimization

In the previous approach, we note that jumps are only made occasionally that is when ch(key, j) != ch(key, j+1). To improve the linear complexity to logarithmic O(logn) time, it only computes the jump when the above condition is satisfied.
You can read more on this in the original paper link provided at the end of this post.

int ch(int key, int num_buckets){
    random.seed(key);
    int b = 1;  // bucket number before the previous jump
    int j = 0; // bucket number before the current jump
    while (j < num_buckets){
        b = j;
        r = random.next();
        j = floor((b + 1) / r);
    }
    return = b;
}

Points to note

The jump consistent algorithm does not rehash keys if they are already integer values.
The algorithm's time complexity is logarithmic even as the bucket size grows(billions) as compared to karger's algorithm which deteriorates as data grows.
Buckets are numbered sequentially which makes it suitable for data storage applications.

References

This is based on the research paper "A Fast, Minimal Memory, Consistent Hash Algorithm" (PDF) by John Lamping and Eric Veach from Google.

With this article at OpenGenus, you must have the complete idea of Jump Consistent Hash: A Fast, Minimal Memory, Consistent Hash Algorithm.