Probabilistic Data Structures


This is a general overview of probablistic data structures (with examples of data structures over 5 different categories).

Introduction

When it comes to answers we often want the precise exact answer, but there comes a time and place where getting the exact answer may take too much resources and we would be fine with an approximate answer that is close to the correct answer.

Take for instance, Big Data, as Big Data deals with massive amounts of data and processing time, often times the standard data structures such as a hash map, hash set are just not feasible for the problems that are encountered. Big Data problems tend to deal with running out of memory or deterministic data structures taking too long to process.

These issues brought rise for a new type of data structure that gives us an approxiately good enough answer which is known as using a probablistic data structure.

Probablistic Data Structure

By using this type of data structure, we can only safely assume that we have an approximately solution which may or may not be the exact answer but it's in the right direction. These data structures are proven to use either a fixed or sublinear memory and have constant execution time. As mentioned before, the answers may not be exact and have some probability of error.

Any probablistic data structure will rely on some form of probablity such as using randomness, hasing, and etc to reach an approximate solution.

Some of the data structures are rather proven alternative approaches for a data structure but often times they are needed for these cases:

  1. analzing / mining big data sets (more than what a deterministic data structure can handle).
  2. Statistical analysis
  3. A stream of data that need an answer aftwares

Different Probablistic Data Structures

Here is a list of data structures that have been organized in a manner of which their purpose is used for. A small description about each data structure is included.

Frequency

  • Count-min Sketch
    • Memory efficient hash table approach using 1 or more hash functions to estimate (overestimation) frequency counts.

Ranking

  • Random Sampling
    • Uses an internal random selecting algorithm to perform quick linear time random selecting. Useful if you need only a sublist of options without any prioritization.
  • q-digest
    • This data structure is complete binary tree over a set of values where it keeps track of frequency and propagates an estimate of the lower frequency values. Originally was designed for sensor networks but found a place in rank-based statistics.
  • t-digest
    • Useful for detecting anomalies and is typically a tree-based data structure that handles a stream of integers to handle such queries like quantiles, percentiles, and other rank-based statistics.

Similarity

  • MinHash
    • A useful data structure for estimating similarity between two sets of data (strings, numbers, etc) using Jaccard similarity metrics and uses one or more hash functions to quickly evaluate.
  • SimHash
    • Similar to the idea of MinHash but relies on the items to have a hash function along with comparing the corresponding bits by using some metric like the hamming distance.

Cardinality

  • LogLog
    • An algorithm that can be structured into a data structure that deals with distinct elements in a set. It's very quick and can handle lots of elements with less memory than a normal set.
  • HyperLogLog
    • An extension of the LogLog which uses a different way of measuring the distinct difference count.

Membership

  • Bloom Filter
    • A data structure that mimics a hash table but uses bytes from a hash code to determine position and if that element exists. It uses less memory and approximates if an element exists in a set.
  • Counting Bloom filter
    • Simply a generalized versoin of the bloom filter which allows a threshold count number to query the set.
  • Quotient Filter
    • A modified verson of bloom filter which includes metadata about the buckets. This uses more memory than the bloom filter but less than the counting bloom filter.
  • Cuckoo Filter
    • Perhaps a more compact verson of the bloom filter while allowing the delete opertion to be implementable. This uses the cuckoo hashing approach.

Not Classified

  • Hash Tables
  • Kinetic Hanger
    • A heap where inserts and deletes don't need to balance and can be randomized while still maintaining competitive heap performance.
  • Kinetic Heater
    • A priortized queue similar to the kinetic hanger but in practice not as efficient as the better kinetic priority queues.
  • Skip List
    • An ordered-key based data structure that allows for competitive performance dictionary or list while implementation remaining relatively easy. This data structure proves that probability can work along with being able to quick index certain items based on probability.
  • Random Tree
    • Uses stochastic properties while maintain tree-like properties. This has proven to applicable uses in fractals, machine learning, and etc.

Notes

It's important to know that each data structure have their own niche purposes but some can be used for other purposes though it's up to you to decide which is data structure is appropriate to use.

All these data structures can have slight various implementation differences providing different amounts of error or performance.
Such as adding a margin of error threshold or creating deterministic-like behavior within a probablistic data structure.

Just like how there are probablistic data structures, there are probablistic algorithms that also give an approximate solution.