Linked Hash Set in Java

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article, we will understand the LinkedHashSet<E> Class which has been included in java.util since version 1.4.

Table of Contents:

  1. Introduction and Inheritance
  2. Set in Java
  3. HashSet in Java
  4. LinkedHashSet in Java
  5. Example Usage
  6. Complexity

Introduction and Inheritance

A Linked Hash Set is a time-efficient way to store and retrieve data. The class LinkedHashSet has been included in every Java version since 1.4. To use this class, you will first need to import java.util.LinkedHashSet.

LinkedHashSet<E> extends the HashSet<E> class and implements the Set<E>, Cloneable, and Serializable interfaces. Internally, LinkedHashSet uses a a hash table to store data, and a doubly linked list to maintain entry order. In the following sections we will break down the important qualities of sets, hash tables, and doubly linked lists that LinkedHashSet inherits.

Set in Java

A Set is a collection of elements without any duplicates, and at most one null element. The Set<E> interface in Java describes the qualities of a Set for any class that inherits from the interface. The AbstractSet<E> class is a minimal implementation of this interface which makes it easier for other classes to implement the interface. There are many methods in the Set<E> interface, but the most important are add(E e), remove(Object o), contains(Object o), and size(), which add an element, remove an element, check to see if an element is contained in the set, and check the total number of elements in the set, respectively.

Notably, the elements in a Set are not ordered. Thus, if you print every element in a Set using a for-each loop, the elements will print in some arbitrary order, which may not be the same as the order they were inserted into the Set.

HashSet in Java

The HashSet<E> is an implementation of a Set<E> that uses a hash table. As such, HashSet<E> extends AbstractSet<E> and is backed by an instance of a HashMap<K,V>. We have already discussed the Set functionality, so let's turn our attention to the use of HashMap<K,V>.

Basically, a map is just a collection of key-value pairs. You can think of a dictionary as a type of map, in which every word is a key and every definition is a value (indeed, Java's HashMap is very similar to Python's Dictionary). A HashMap is a kind of map that relies on hashing. Hashing is the process of using some algorithm to take a key as an input and output an integer value called a "hash." Then, the hash is used to determine the location of the value associated with the key.

For instance, suppose you have a key-value pair <1,"one">, and you want to insert this into a HashMap where the hash function is key%10. Well, 1%10=1, so the hash for this specific key is 1. The value "one" might then be inserted into the 1st index of the array that holds all the values in the map. Like other maps, insertion occurs in constant O(1) time. However, the real advatange of HashMaps is that in the best case scenario, search and retrieval also occur in O(1) time. If you want to find the value associated with the key 1, then you can simply compute the hash for that key and check the corresponding location in the array of values.

This best case scenario only occurs if there are no collisions, that is if every key has a unique hash. However, in our exmaple, we can see that 1, 11, 21, 31, and so on will all produce the same hash. A common technique for dealing with these collisions is called linear chaining. If a location in the array is already occupied (by some other key-value pair with the same hash), then we would check the rest of the array in a linear fashion until we come upon an empty space. Similarly, when we search for a key's value, we may have to iterate through the array to find where we inserted that key-value pair as a result of linear chaining. In the worst case, this will take O(n) time.

A HashSet employs a HashMap to speed up the retrieval of elements. As every element e is added to the set, an entry <e,e> is added to the internal HashMap. Then, to find e within the set, a HashSet can use the speedy retrieval from a HashMap instead of iterating through every element like in a regular Set.

LinkedHashSet

So far we have learned that a HashSet is an unordered collection of elements with no repeats that uses a HashMap to locate elements in near-constant time. A LinkedHashSet, which extends HashSet<E>, adds in a doubly-linked list to the equation and solves the problem of order. In a doubly linked list, each item, or "node," in the list contains two pointers pointing to the node in front of it and the node behind it. In a LinkedHashSet, each elements is stored as a node in a doubly linked list following the order in which they were inserted. This can be useful because many times, you will want to be able to see all elements in a collection according to their entry order. A HashMap is still used to find individual elements, but if you print every element in a LinkedHashSet, you will find they always print according to the order in which they were inserted.

Here are some important methods in LinkedHashSet<E>:

  • add(E e): adds an element to the set
  • remove(Object o): removes an element from the set
  • contains(Object o): checks if an element is present in the set
  • iterate(): returns an Iterator<E> object of the set elements, in the order they were entered

Example Usage

Let's look at an example in Java:

import java.util.LinkedHashSet;

class LinkedHashSetDemo {

    public static void main(String[] args) {
        
        //Create a LinkedHashSet of Strings
        LinkedHashSet<String> lhs = new LinkedHashSet<>();

        //Add elements to our LinkedHashSet
        lhs.add("Hello");
        lhs.add("Error");
        lhs.add("World!");


        //Print elements in order of entry
        System.out.print("My Set: ");
        for (String string : lhs) {
            System.out.print(string+" ");
        }
        System.out.println();

        //Test "contains" method
        System.out.println("\"Error\" detected: "+lhs.contains("Error"));

        //Remove "Error" from the set
        System.out.println("Removing \"Error\": "+lhs.remove("Error"));

        //Test contains again
        System.out.println("\"Error\" detected: "+lhs.contains("Error"));

        //See what happens when we try to remove an element that is not in the set
        System.out.println("Removing \"Error\": "+lhs.remove("Error"));

        //Print all elements of modified set
        System.out.print("My Set: ");
        for (String string : lhs) {
            System.out.print(string+" ");
        }
    }

} 

Complexity

We have already touched briefly upon complexity, especially during the HashSet section, but we will examine it in more detail here.

LinkedHashSet mostly operates on the same time complexity as HashSet. This means insertion, removal, and search all occur in constant O(1) time for the best case scenario. But how likely is the best case scenario?

Well, this all depends on the number of collisions. The more collisions that occur in the hash table, the more linear chaining (or some other process for resolving collisions) will be necessary. As we have seen already, this slows down search and removal. To avoid collisions as much as possible, there are a few implementation details to consider:

  1. Choosing a good hash function. A good hash function will not give the same output for many different inputs. The modulo function works pretty well as a hash function, especially if the divisor (the number after the modulus operator) is relatively large. However, the hashCode() method is written into the overarching Object class in Java, so you probably will not have to worry about this detail.
  2. Choosing an Initial Capacity and a Load Factor. The initial capacity sets the limit for how many values you can initially insert into your hash table. In other words, it is the number of distinct hashes that can be generated, aka the range of the hash function. For %10, the initial capacity is 10 (0 through 9, inclusive). A higher initial capacity might improve time complexity, but it takes up more space. The load factor is a float value that determines when to increase the capacity of your hash table. When this occurs, the table typically doubles in size and the entire data set is rehashed. Since there are now twice as many spaces for the same amount of data, collisions typically decrease and operations move closer to being optimal. If the load factor is too low, rehashing, which is an expensive process, will occur too often. If rehashing is too high, you are likely to run into more collisions. A load factor of 0.75 (rehashing will occur at 75% capacity) is generally considered to be a good balance between reducing overhead storage and reducing time complexity.

The LinkedHashSet<E> class has two constructors where you can set either the initial capacity or both the initial capacity and the load factor. Generally, it will perform just as well as the HashSet class, adding, removing, and finding elements in approximately O(1) time. A LinkedHashSet does require slightly more space to store the doubly linked list, but it is usually faster to iterate through a LinkedHashSet than a normal HashSet because it does not need to check every space in the hash table, it can just iterate through the list.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.