Get this book -> Problems on Array: For Interviews and Competitive Programming

A cuckoo filter is a **probabilistic data structure** that is used to test **whether an element is a member of a set** or not. It is similar to Bloom filter. It can show false positives but not false negatives. It can also delete existing items.

It is a hash table that uses cuckoo hashing to resolve collisions. It decreases the space used by only maintaining a fingerprint of the value to be stored in the set. It uses a small **f-bit fingerprint to store the data**. The value of f is decided by the ideal false positive rate.

The **false-positive rate** is the probability that a false alarm will be raised. Even though the true value is negative it will give a positive result. It is calculated as **FP/(FP+TN)**, where FP is the number of false positives and TN is the number of true negatives.

Table of Contents:

- Buckets in cuckoo filter
- Cuckoo hashing in cuckoo filter
- Insertion operation in cuckoo filter
- Lookup function in cuckoo filter
- Delete operation in cuckoo filter
- Implementing cuckoo filter
- Time & Space complexity of Cuckoo Filter
- Advantages of Cuckoo filter over Bloom filter
- Limitations of Cuckoo Filter

# Buckets in cuckoo filter

Cuckoo filter stores the fingerprint of the value to be **stored in an array of buckets**. The number of fingerprints that a bucket can hold is referred to as b. They have a load that describes the percentage of the filter they are currently using and plays a key role in deciding when cuckoo filter size should be resized. It uses the semi-sort technique. The semi-sort technique can compress the buckets by sorting the fingerprints present in them.

# Cuckoo hashing in cuckoo filter

The cuckoo filter **uses cuckoo hashing to resolve collisions**. Cuckoo hashing is a method used to avoid collisions in a hash table. It got its name because of its similarities with cuckoo. As a chick of cuckoo pushes the other eggs out of the nest to make a place of their own. Similarly in cuckoo hashing, to insert a new key into the hash table we push the older key to a new place. Though cuckoo filter is based on cuckoo hashing, they use a derivative called partial-key cuckoo hashing.

# Insertion operation in cuckoo filter

This function is used to **insert the value to be stored into the cuckoo filter**. It uses cuckoo hashing to insert the value.

### Algorithm implementing insertion in the cuckoo filter.

**Step 1:** compute the fingerprint of the value to be stored and store it in a variable called fprint.

**Step 2:** Compute the hash of k(value to be stored), h(k) = k_{h} using the first hash function and store it in a variable called first_bucket.

**Step 3:** Compute the hash of fprint,h(fprint)=k_{f}

**Step 4:** Xor first_bucket and hash of fprint and store it in a variable called second_bucket.

**Step 5:** If first_bucket is empty then insert the fprint in bucket[first_bucket].

**Step 6:** If first_bucket is not empty then insert the fprint in bucket[second_bucket].

**Step 7:** If second_bucket is not empty then reallocate the entry present in the second_bucket to empty space and insert the fprint in bucket[second_bucket].

### Pseudocode of insertion function.

```
Insert(x){
fprint=fingerprint(x);
first_bucket=hash(x);
second_bucket=first_bucket⊕hash(fprint);
if bucket[first_bucket] or bucket[second_bucket] has an empty entry then
add fprint to that bucket;
return Done;
//the two buckets are not empty so we must relocate existing items;
i=randomly pick first_bucket or second_bucket;
for n=0;n<MaxNumKicks;n++ do
randomly select an entry e from bucket[i];
swap fprint and the fingerprint stored in entry e;
i=i⊕hash(fprint);
if bucket[i] has an empty entry then
add fprint to bucket[i];
return Done;
//Hashtable is considered full;
return Failure;
}
```

# Lookup function in cuckoo filter

Lookup function **checks if the given value is present in the cuckoo filter** or not. We will compute first_bucket and second_bucket and we will check if the fingerprint of the given value is present in the first_bucket or the second_bucket then it will return true else return false.

### Algorithm implementing lookup in the cuckoo filter.

**Step 1:** compute the fingerprint of the value to be stored and store it in a variable called fprint.

**Step 2:** Compute the hash of k(value to be stored), h(k) = k_{h} using the first hash function and store it in a variable called first_bucket.

**Step 3:** Compute the hash of fprint,h(fprint)=k_{f}

**Step 4:** Xor first_bucket and hash of fprint and store it in a variable called second_bucket.

**Step 5:** If bucket[first_bucket] or bucket[second_bucket] contains fprint then return true else return false.

### Pseudocode of lookup function.

```
Lookup(x){
fprint=fingerprint(x);
first_bucket=hash(x);
second_bucket=first_bucket⊕hash(fprint);
if bucket[first_bucket] or bucket[second_bucket] has fprint then
return True;
return False;
```

# Delete operation in cuckoo filter

Delete function **removes a copy of the fingerprint of the given value from the bucket**. We will compute first_bucket and second_bucket and we will check if the fingerprint of the given value is present in the first_bucket or the second_bucket then we will remove a copy of fprint from this bucket.

### Algorithm implementing delete in the cuckoo filter.

**Step 1:** compute the fingerprint of the value to be stored and store it in a variable called fprint.

**Step 2:** Compute the hash of k(value to be stored), h(k) = k_{h} using the first hash function and store it in a variable called first_bucket.

**Step 3:** Compute the hash of fprint,h(fprint)=k_{f}

**Step 4:** Xor first_bucket and hash of fprint and store it in a variable called second_bucket.

**Step 5:** If bucket[first_bucket] or bucket[second_bucket] contains fprint then remove a copy of fprint from this bucket.

### Pseudocode of Delete function.

```
Delete(x){
fprint=fingerprint(x);
first_bucket=hash(x);
second_bucket=first_bucket⊕hash(fprint);
if bucket[first_bucket] or bucket[second_bucket] has fprint then
remove a copy of fprint from this bucket;
returnTrue;
return False;
```

# Implementing cuckoo filter

Following is the implementation of Cuckoo Filter in GoLang:

- Go

### Go

```
package main
```

import (
"bytes"
"encoding/binary"
"errors"
"fmt"
"math"
"crypto/sha1"
"math/rand"
)

type bucket []fingerprint
type fingerprint []byte
var hasher = sha1.New()
const retries = 200 // how many times do we try to move items around?

type Cuckoo struct {
buckets []bucket
m uint // buckets
bucket_entries uint // entries per bucket
fingerprint_length uint // fingerprint length
n uint // filter capacity
}

func hash(data []byte) []byte {
hasher.Write([]byte(data))
hash := hasher.Sum(nil)
hasher.Reset()
return hash
}

func nextPower(i uint) uint {
i--
i |= i >> 1
i |= i >> 2
i |= i >> 4
i |= i >> 8
i |= i >> 16
i |= i >> 32
i++
return i
}

func fingerprintLength(b uint, e float64) uint {
f := uint(math.Ceil(math.Log(2 * float64(b) / e)))
f /= 8
if f < 1 {
return 1
}
return f// if f is greater than or equal to 1 return f.
}

// n = len(items), fp = false positive rate
func NewCuckoo(n uint, fp float64) *Cuckoo {
bucket_entries := uint(4)
fingerprint_length := fingerprintLength(bucket_entries, fp)
m := nextPower(n / fingerprint_length * 8)
buckets := make([]bucket, m)
for i := uint(0); i < m; i++ {
buckets[i] = make(bucket, bucket_entries)
}
return &Cuckoo{
buckets: buckets,
m: m,
bucket_entries: bucket_entries,
fingerprint_length: fingerprint_length,
n: n,
}
}

// nextIndex returns the next index for entry, or an error if the bucket is full
func (b bucket) nextIndex() (int, error) {
for i, f := range b {
if f == nil {
return i, nil
}
}
return -1, errors.New("bucket full")
}

// hashes returns h1, h2 and the fingerprint
func (c *Cuckoo) hashes(data string) (uint, uint, fingerprint) {
h := hash([]byte(data))
fprint := h[0:c.fingerprint_length]
first_bucket := uint(binary.BigEndian.Uint32(h))
second_bucket := first_bucket^ uint(binary.BigEndian.Uint32(hash(fprint)))
return first_bucket, second_bucket, fingerprint(fprint)
}

func (b bucket) contains(f fingerprint) (int, bool) {
for i, x := range b {
if bytes.Equal(x, f) {
return i, true
}
}
return -1, false
}

//function that inserts string into the cuckoo filter.
func (c *Cuckoo) insert(input string) {
first_bucket, second_bucket, fprint := c.hashes(input)
// first try bucket one
b1 := c.buckets[first_bucket%c.m]
if i, err := b1.nextIndex(); err == nil {
b1[i] = fprint
return
}
b2 := c.buckets[second_bucket%c.m]
if i, err := b2.nextIndex(); err == nil {
b2[i] = fprint
return
}
// else we need to start relocating items
i := first_bucket
for r := 0; r < retries; r++ {
index := i % c.m
entryIndex := rand.Intn(int(c.bucket_entries))
// swap
fprint, c.buckets[index][entryIndex] = c.buckets[index][entryIndex], fprint
i = i ^ uint(binary.BigEndian.Uint32(hash(fprint)))
b := c.buckets[i%c.m]
if idx, err := b.nextIndex(); err == nil {
b[idx] = fprint
return
}
}
panic("cuckoo filter full")
}

// lookup needle in the cuckoo filter
func (c *Cuckoo) lookup(needle string) bool {
first_bucket, second_bucket, fprint := c.hashes(needle)
_, b1 := c.buckets[first_bucket%c.m].contains(fprint)
_, b2 := c.buckets[second_bucket%c.m].contains(fprint)
return b1 || b2
}

// delete the value from the cuckoo filter
func (c *Cuckoo) delete(needle string) {
first_bucket, second_bucket, fprint := c.hashes(needle)
// try to remove from f1
b1 := c.buckets[first_bucket%c.m]
if ind, ok := b1.contains(fprint); ok {
b1[ind] = nil
return
}

b2 := c.buckets[second_bucket%c.m]
if ind, ok := b2.contains(fprint); ok {
b2[ind] = nil
return
}
}

func main() {
c := NewCuckoo(10, 0.1)
c.insert("hello")
c.insert("world")
ok := c.lookup("world")
fmt.Printf("%v\n", ok)
c.delete("world")
ok = c.lookup("world")
fmt.Printf("%v\n", ok)
}

# Time & Space complexity of Cuckoo Filter

### Time complexity.

- The time complexity of delete is
`O(1)`

because it does not depend upon the input size. - The time complexity of lookup is also
`O(1)`

because it does not depend upon the input size. - The time complexity of insertion has a reasonably high probability of
`O(1)`

if the load is well managed.

### Space complexity.

- A cuckoo filter requires
**( log**bits of space per inserted key_{2}(1/ϵ) + 2 ) / α

**Note:**

`α`

is the hash table load factor.The load factor is the average number of key-value pairs per bucket.It is calculated using formula load factor = total number of key-value pairs / number of buckets.`ϵ`

is the false positive rate.

**Example:**If the false positive rate is 3% and the load factor is around 95.5% then ϵ is .03 and α is 0.955

substituting these values in the formula

bits of space per inserted key =( log_{2}(1/0.03) + 2 ) / 0.955

( log_{2}(33.333) + 2 ) / 0.955

( 5.059 + 2 ) / 0.955

7.059 / 0.955

7.391.

So it takes approximately 7 bits per inserted key.

# Advantages of Cuckoo filter over Bloom filter

- It takes less time for lookups.
- The cuckoo filter has fewer false positives than the bloom filter for the same number of items stored.
- It supports the deletion of items.
- It is simple to implement than counting bloom filters.

# Limitation of Cuckoo filter

- It can only delete elements that are known to be inserted before.
- Insertion can fail and rehashing is required like other cuckoo hash tables.

With this article at OpenGenus, you must have the complete idea of Cuckoo Filter. Enjoy. Read more:

- Bloom Filter: Better than Hash Map by Harsh Bardhan Mishra
- Book on Probabilistic Data Structures by Aditya Chatterjee and Ethan Z. Booker