Median of two sorted arrays

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article, we have presented 3 Algorithms to find the Median of two sorted arrays efficiently in logarithmic time. This involves a clever use of Binary Search algorithm.

Table of contents:

  1. Introduction to Median
  2. Problem Statement: Median of two sorted arrays
  3. Method 1
  4. Method 2
  5. Method 3
  6. Applications

Let us get started with Median of two sorted arrays.

Introduction to Median

In mathematics and statistics, the median is a value that divides a data into two halves. The left half contains values less than the median and the right half contains value greater than the median.

In order to calculate the median value, we first arrange all the data points in a data set in an ascending order. After that, we pick the value in the middle of the data set to get the median. If the data set has odd number of elements, we can directly pick the value in the middle to get the median.

However, if the data set has even number of elements, we pick the middle two values and calculate the average of the two to get the median value. This is shown in the image below.

Problem Statement: Median of two sorted arrays

In order to calculate the median of two sorted arrays, we would first need to combine the two arrays into a new array and then calculate the median of the resulting array. We can do this using various approaches.

Method 1

The most straightforward way to solve this problem is to do it linearly. That is, we keep a track of all elements equal to the number of elements in first and second list taken together. At the same time, we traverse all elements in the first and the second list, keep on comparing them and tracking the count. Once we reach the middle index (as per the number elements designated for the new list), it implies we have arrived at the middle value i.e the median. We can put two conditions as well to check if the length of the new resulting list is odd or even, and calculate the median accordingly.

The steps to perform this method are as follows-

  1. Find the length of first and second list.
  2. Create index variables, initialized at zero, for each list.
  3. Set two variables as median 1 and median 2. (This is because in case the data set has even number of elements, then we'd need two middle values to calculate the median. In case of odd data points, the median 2 would be ignored.)
  4. Traverse all elements in list 1 and list 2 combined, but only till the middle position (because the median will exactly be at this position/index), while at the same time sorting the traversed elements and setting the median accordingly. Return the final median value in the end.
  5. In case the number of elements are even, set median 2 equal to median 1 and do the same as in step 4. However, this time return the average of median 1 and median 2 as the final median value.

Implementation Example

The following program executes the above algorithm.


# function to find median
def find_median(lst1, lst2):
    m = len(lst1)
    n = len(lst2)
    i = 0
    j = 0
    # we can set median 1 and median 2 as any value initially
    median1, median2 = None, None

    # if the total number of elements (elements in list 1 + elements in list 2) is odd
    if ((m+n) % 2 == 1):
        # x is the count variable
        for x in range(((m+n)//2) + 1):
            if (i != m and j != n):
                if lst1[i] > lst2[j]:
                    median1 = lst2[j]
                    j += 1
                else:
                    median1 = lst1[i]
                    i += 1
            # for remaining elements in lst 1
            elif (i < m):
                median1 = lst[i]
                i += 1
            # for remaining elements in lst 2 i.e when j<n
            else:
                median1 = lst2[j]
                j += 1
        return median1
    # if the total number of elements is even 
    else:
        for x in range(((m+n) // 2) + 1):
            # because we will have two middle values
            # median2 is the middle value before the second middle value in case of even data points

            median2 = median1
            if ( i != m and j != n):
                if lst1[i] > lst2[j]:
                    median1 = lst2[j]
                    j += 1
                else:
                    median1 = lst1[i]
                    i += 1
            elif (i < m):
                median1 = lst[i]
                i += 1
            # for remaining elements in lst 2 i.e when j<n
            else:
                median1 = lst2[j]
                j += 1

        return ((median1 + median2) / 2)

                    
#example lists to test median program
list1 = [1, 2]
list2 = [3, 4, 5]

# call the median function and print the final median value
print(find_median(list1, list2))

The following output is produced by the above program.

3

Complexity

  • Time: in the worst case, in order to traverse and sort all elements in list 1 and list 2, we use the runtime is O(m+n), where m, n are the number of elements in list 1 and list 2 respectively. Even in the best case, the runtime would be the same.
  • Space: this algorithm has a constant space requirement of O(1), in the best and worst case, in order to keep a track of all traversed elements and return the final value of median.

Method 2

Another way to solve this problem is to apply the mathematical formula directly.
We create a new list and add the elements of list 1 and list 2. Then we sort the new list and check the length. Let the length of the list be 'n'. If it is odd, we apply the formula of (n+1)/2. If it's even, we calculate two middle values by applying the formula (n/2) and (n/2)-1. This will give us the middle values required to calculate the median.

The steps involved in this method are as follows-

  1. Combine the two lists into a new list.
  2. Use sorting to arrange the elements in the new list in ascending order.
  3. Check the length of the new list. If it's odd, apply the formula- (n+1)/2 to get the index of the median. It it's even, we need two middle values, the average of which gives us the median, first apply the formula (n/2) to get the first middle value. Then use the formula- (n/2)-1 to get the second middle value. Calculate the average of these two values and return it.

Implementation Example

We can use the following program to implement this method.


def MergeSort(lst):
    # making sure length of list is 1 , which is the midpoint element
    if len(lst) > 1:
        # find midpoint of the list
        midpoint = len(lst)//2
        # divide the list into two parts (sublists), part to the left and right of midpoint
        lefthalf = lst[:midpoint]
        righthalf = lst[midpoint:]
        
        # keep calling the function till length of list becomes 1
        MergeSort(lefthalf)
        MergeSort(righthalf)

        # Assign zero value to all index variables initially
        # k is the index variable for list parameter
        i = j = k = 0
        
        # compare all elements in leftpart and rightpart
        # add elements in sorted order and update the list
        while i < len(lefthalf) and j < len(righthalf):
            if lefthalf[i] < righthalf[j]:
                lst[k] = lefthalf[i]
                i += 1
            else:
                lst[k] = righthalf[j]
                j += 1
            k += 1

        #add any remaining elements in leftpart to the new list  
        while i < len(lefthalf):
            lst[k] = lefthalf[i]
            i += 1
            k += 1
  
        #add any remaining elements in rightpart to the new list  
        while j < len(righthalf):
            lst[k] = righthalf[j]
            j += 1
            k += 1

    return lst

# function to find median
def find_median(lst1, lst2):
    # combine all elements of first and second list
    lst3 = lst1 + lst2
    # sort the new list
    MergeSort(lst3)
    # get the length of the new list
    n = len(lst3)

    
    # in case of even number of elements in new list
    if n % 2 == 0:
        # index of first middle value
        k = n // 2
        mid_value1 = lst3[k]
        mid_value2 = lst3[k-1]
        median = (mid_value1 + mid_value2)/ 2
        return median
    # in case of odd number of elements in new list
    else:
        k = n // 2
        median  = lst3[k]
        return median
                    


#example lists to test median program
list1 = [1, 2, 6]
list2 = [3, 4, 5]

# call the median function and print the final median value
print(find_median(list1, list2))

The above program produces the following output.

3.5

Complexity

  • Time: since we use merge sort here, the best and worst case runtime is O(nlogn). Because we'd need to traverse all elements at each stage when we decrease the length of the list.
  • Space: in the worst case, the space complexity of this method is O(m+n), where m, n are the number of elements in list 1 and list 2 respectively. This is because we create a new list of length (m+n) to store all elements of list 1 and list 2, which we then use to calculate the median. Even in the best case, the space requirement would still be the same.

Method 3

A more efficient method to solve this problem is to use binary search. We can create an algorithm that tells us how many elements we should take from each lists, just enough to get the median of the whole data set, instead of traversing all elements in both lists. The steps involved to create such an algorithm are as follows-

  1. Get the length of each list.
  2. Switch list 1 with list 2 if the length of list 2 is shorter. This is because for one, faster calculation, and two, because we can have few elements from list 1 for median calculation (as it's the shorter list) and the bigger one can supply rest of the elements, but not the other way around.
  3. Set the start (low) and end (high)variables to perform the binary search.
  4. We create a variable called cut_1 and cut_2, for each list respectively. Each cut divides the list into two parts- left and right, during each iteration of the binary search. All the elements in left part must be less than or equal to all the elements in the right part of the list.
  5. Let's suppose we make the cut to divide each list into two parts, left and right
    using an arbitrary number within the range of list 1. And the remaining elements would then be from list 2.
  6. Since the lists are sorted, all we need to do to ensure that we are taking the right elements for median is to check the left-most value in list 1 (just beside the cut) with the value in the right part of list 2 (just after the cut). Likewise, check the left-most value in list 2 just beside the cut with the right part of list just after the cut. The left values must always be less than or equal to the right values as median divides the entire list into two parts. Values to the left of the median are always smaller than the values on the right of the median.
  7. In case there are no values to the left of the cut, we set the left values to be minus infinity. Likewise, in case there are no values to the right of the cut, we set the right values to be infinity.
  8. Once we get all the correct elements for median calculation, if the list contains even number of elements, we need two middle values. We would need to take the first middle value as the maximum of left elements, just beside the cut, because the list is sorted. Likewise, we take the second middle value as the minimum of right elements. In case the number of total elements is odd, the median will be the maximum of the left elements.
  9. If the values in the left part are greater, we decrease the cut and assign that value to the end variable. This makes sure we move now move towards the left of the list, to get smaller values. Otherwise, we keep increasing the cut and assign it to the start variable, to get as much elements from list 1 till all the previous conditions are no longer satisfied.

Implementation Example

The following program implements the above algorithm to calculate the median.

def find_median(lst1,lst2):

        # get the length of both lists
        m, n = len(lst1), len(lst2)

        # we switch lst 1 with lst2 in case lst 1 has more elements
        # this is done to make lst1 the shorter one for faster calculation
        
        if m > n: 
            lst1, m, lst2, n = lst2, n, lst1, m

        # set the starting and ending indexes to decide the cut for median 
        low = 0
        high = m

        # we do a binary seach from here
        while low <= high:

                # a cut divides the lists into left and right part
                # if cut_1 is 0 it means nothing is there on left side of list 1 
                # if cut_1 is length of list then there is nothing on right side 
                # likewise, for cut_2
                cut_1 = (low + high) // 2
                cut_2 = (m + n + 1)//2 - cut_1

               
                
                # values to the left of the median are always smaller than right
                # if there are no values to the left of the cut, 
                # set the left values  to be minus infinity
                # in case there are no values to the right of the cut, 
                # we set the right values to be infinity
                # since cut_1 gives us the number of elements, 
                # to get the index we subtract 1
                
                max_left1 = float('-inf') if cut_1 == 0 else lst1[cut_1 - 1]
                min_right1 = float('inf') if cut_1 == m else lst1[cut_1]

                max_left2 = float('-inf') if cut_2 == 0 else lst2[cut_2 - 1]
                min_right2 = float('inf') if cut_2 == n else lst2[cut_2]
                
                # implies all the correct elements found for median calculation
                if (max_left1 <= min_right2 ) and (max_left2 <= min_right1):

                        # in case total number of elements are even
                        if ((m+n) % 2 == 0):
                                return ((max(max_left1,max_left2) + min(min_right1,min_right2))/2)
                        # in case total number of elements are odd
                        else:
                                return max(max_left1,max_left2)

                # if values in left are greater, implies too ahead in the list
                # so we need to decrease the index to get back to smaller values
                
                elif (max_left1 > min_right2):
                        high = cut_1 - 1
                        
                # else keep on increasing the index 
                else:
                        low = cut_1 + 1

        

# example to test the median program
lst1 = [1,2]
lst2 = [3,4,5]

print('Median:',find_median(lst1,lst2))

The following output is produced by the above program.

Median: 3

Complexity

  • Time: since we use binary search for the above algorithm, the complexity falls down to O(log(min(m,n))), depending on the length of each list, where m, n are the number of elements in list 1 and list 2 respectively. In the best case, the runtime would be O(1) or a constant runtime.
  • Space: we don't use any additional space and the space remains required remains same irrespective of the length of lists. In best and worst case, the space requirement is constant or O(1).

Applications

  • In statistics and mathematics, median is used for data manipulation and tabulation, in order to summarize the data and draw conclusions and inferences about the population.
  • Many areas, including data science, data analysis and data engineering, make use of median and related algorithms for machine learning and data maintenance.

With this article at OpenGenus, you must have the complete idea of finding the Median of two sorted arrays.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.