Defaultdict (Default Dictionary) in Python


Let's dive right into the most useful data structure from the Python collections module, the defaultdict. The default dictionary is Python's intelligent hash table.'

In here we shall talk about:

  • A brief introduction to Python3 dictionaries
  • How do dictionaries work?
  • What is a defaultdict?
  • How is it different from the regular dictionary in Python?
  • The default_factory attribute
  • The __missing__ function
  • defaultdict used as a frequency counter
  • Other important data structures from the Python collections module

A brief introduction to Python3 dictionaries

We may already have come across them in other languages and as parts of a data structure. They make a comeback in Python as dictionaries but we can imagine them as *drum rolls* Hash Maps or Hash Tables

Dictionaries are a collection of objects in no particular order (sadly, not anymore, but it's better not to depend on it)

Dictionaries are defined by a list of comma separted key-value pairs.

dictionary = {
    key1: value1,
    key2: value2,
      .
      .
      .
    key3: value3,
}

Comparing dictionaries with lists show us a few similarities:

  • Both are mutable.
  • Both are dynamic. Items can be added or deleted from them as and when needed.
  • Both can be nested. A list of lists is possible to create. So is a dictionary which has dictionaries as it's values. Even a dictionary of lists can be created and a list of dictionaries as well.

And a few differences:

  • List elements are accessed by their position in the list, via indexing.
  • And since we assume a Dictionary to be a collection without any particular order, positions of elements aren't certain and thus elements are accessed via keys.

Fun fact: Loopup time in a list is $\mathcal{O}(n)$ whereas in a dictionary it's $\mathcal{O}(1)$. Jump on to the next section to know how.

How do dictionaries work?

In Python, the dictionaries are implemented as resizable hash tables. The most common operation for dictionaries are lookups and compared to B-trees, hash tables have a better performance and a simpler implementation.

The built-in hash() function in Python calculates a hash code for each key in the dictionary. It uses this hash value calculated from the key value to calculate a location in an internal array where the value will be stored and later used to look up the key. If the key were to be a mutable object, by definition, the key's value could change, and thus the key's hash value could also change.

If we're successful in storing keys that all have different hash values, it'd mean that retrieving a key would take a constant time $\mathcal{O}(1)$.

Take a look at the references section at the end of the article to know what the Python3 documentation has to say about this.

What is a defaultdict?

A default dictionary is a dictionary that automatically assigns default values to keys, if queried keys are not present.

# how to import defaultdict in Python
from collections import defaultdict
help(defaultdict)

defaultdict(default_factory[, ...]) --> dict with default factory
        The default factory is called without arguments to produce
    a new value when a key is not present, in __getitem__ only.
        A defaultdict compares equal to a dict with the same items.
    All remaining arguments are treated the same as if they were
    passed to the dict constructor, including keyword arguments.

Creating a defaultdict is quite simple:

# examples of empty defaultdict initializations
default_dictionary1 = defaultdict(list)
default_dictionary2 = defaultdict(int)
default_dictionary3 = defaultdict(lambda: "some default string")
default_dictionary4 = defaultdict(None)
# examples of non-empty defaultdict initializations
default_dictionary5 = defaultdict(int, {
        "france": 45,
        "germany": 25,
        "india": 64,
})

The signature of the defaultdict() is as follows:

defaultdict(default_factory[, ...]) --> dict with default factory

We shall learn more about the default_factory argument soon. An abstract idea of it is that an instance of default_factory is the default value of any key, if the key was not present earlier in the dictionary.

Available attributes of a defaultdict are:

defaultdict.default_factory
# member 'default_factory' of 'collections.defaultdict' objects

Available methods in defaultdict are:

defaultdict.clear()
# D.clear() -> None. Remove all items from D.

defaultdict.copy()
# D.copy() -> a shallow copy of D.

defaultdict.default_factory()
# Factory for default value called by __missing__().

defaultdict.fromkeys(iterable, value=None, /)
# Create a new dictionary with keys from iterable and values set to value.

defaultdict.get(key, default=None, /)
# Return the value for key if key is in the dictionary, else default

defaultdict.items()
# D.items() -> a set-like object providing a view on D's items

defaultdict.keys()
# D.keys() -> a set-like object providing a view on D's keys

defaultdict.pop()
# D.pop(k[,d]) -> v, remove specified key and return the corresponding value.
# If key is not found, d is returned if given, otherwise KeyError is raised

defaultdict.popitem()
# D.popitem() -> (k, v), remove and return some (key, value) pair as a 2-tuple;
# but raise KeyError if D is empty.

defaultdict.setdefault(key, default=None, /)
# Insert key with a value of default if key is not in the dictionary.
# Return the value for key if key is in the dictionary, else default.

defaultdict.update()
# D.update([E, ]**F) -> None.  Update D from dict/iterable E and F.
# If E is present and has a .keys() method, then does:  for k in E: D[k] = E[k].
# If E is present and lacks a .keys() method, then does:  for k, v in E: D[k] = v.
# In either case, this is followed by: for k in F:  D[k] = F[k]

defaultdict.values()
# D.values() -> an object providing a view on D's values

How is it different from the regular dictionary in Python?

The difference from a regular dictionary arises when dealing with keys that are not present in the dictionary.

# an incomplete list of scores on a test
regular_dictionary = {
    "ross": [89,78,95],
    "chandler": [68,77,89],
    "joey": [89,88],
    "monica": [98],
}
default_dictionary = defaultdict(list,{
    "ross": [89,78,95],
    "chandler": [68,77,89],
    "joey": [89,88],
    "monica": [98],
})

# correct way of updating rachel's list of scores in a regular dictionary
if "rachel" in regular_dictionary:
    regular_dictionary["rachel"] = [77,79]
else:
    regular_dictionary["rachel"].extend([77,79])
    
# correct way of updating phoebe's list of scores in a default dictionary
default_dictionary["phoebe"].extend([95,83,79])

When a key is absent in a regular dictionary and code is written to retrieve it, like regular_dictionary["phoebe"], it generates a KeyError. Whereas, when a key is absent in a default dictionary and code is written to retrieve it like default_dictionary["rachel"], it returns an instance of the default_factory parameter of the defaultdict, which, in this case is an empty list: []

Question


default_d = defaultdict(list)
regular_d = dict()
sample_list = [1,2,3]
sample_tuple = (1,2,3)

Which case won't generate an error?

default_d[sample_list].append("value")
default_d[sample_tuple].append("value")
regular_d[sample_list].append("value")
regular_d[sample_tuple].append("value")
Only a hashable data type can be a key to any dictionary. Some hashable types in Python are int, str or tuple.

lists are unhashable and thus cannot be a key to a dictionary.

No such restrictions are put on dictionary values.

Since regular_d doesn't contain any such element as (1,2,3) hence a KeyError is seen. Whereas, default_d initializes such a key by itself with it's default value as an empty list []

The default_factory attribute

This is the first argument to the defaultdict constructor and it's used by the __missing__() method. If the argument to the constructor is absent, default_factory is initialized as None

defaultdict(default_factory, **kwargs)
# first argument to the constructor

The __missing__ function

When a defaultdict (say def_d) is used to retrieve a value with a key (say 'K') that doesn't exist, this dunder function __missing__() is called by the dunder function __getitem__() of def_d.

If the default_factory for def_d was set to None, it raises a KeyError.
If the default_factory is defined as something other than None, it inserts a key-value pair of key-default_factory in def_d and returns value.
Screenshot-from-2020-05-27-23-24-36
Pseudo code:

__missing__(key) # Called by __getitem__ for missing key;
    if self.default_factory is None: raise KeyError((key,))
    self[key] = value = self.default_factory()
    return value

Note:

Only the method __getitem__() can call __missing__(). No other operation is allowed to call it. This means that the get() method, when executed on a defaultdict, will, like normal dictionaries, return None as a default value rather than using the default_factory attribute to initialize the default.
Screenshot-from-2020-05-27-23-25-50

defaultdict used as a frequency counter

# dictionary of number of days in each month of a non-leap year
months = {
    "jan": 31, "feb": 28, "mar": 31, "apr":30,
    "may": 31, "jun": 30, "jul": 31, "aug":31,
    "sep": 30, "oct": 31, "nov": 30, "dec":31,
}
freq_of_days = defaultdict(int)
for month, days in months.items():
    freq_of_days[days] += 1

print(sorted(freq_of_days.items()))

Output: [(28, 1), (30, 4), (31, 7)]

Other important data structures from the Python collections module

  1. Named Tuple
  2. OrderedDict
  3. Counter

With this article at OpenGenus, you must have the complete idea of defaultdict in Python3. Enjoy.