Get this book -> Problems on Array: For Interviews and Competitive Programming
In this article, we have presented 3 Algorithms to find the Median of two sorted arrays efficiently in logarithmic time. This involves a clever use of Binary Search algorithm.
Table of contents:
- Introduction to Median
- Problem Statement: Median of two sorted arrays
- Method 1
- Method 2
- Method 3
Let us get started with Median of two sorted arrays.
Introduction to Median
In mathematics and statistics, the median is a value that divides a data into two halves. The left half contains values less than the median and the right half contains value greater than the median.
In order to calculate the median value, we first arrange all the data points in a data set in an ascending order. After that, we pick the value in the middle of the data set to get the median. If the data set has odd number of elements, we can directly pick the value in the middle to get the median.
However, if the data set has even number of elements, we pick the middle two values and calculate the average of the two to get the median value. This is shown in the image below.
Problem Statement: Median of two sorted arrays
In order to calculate the median of two sorted arrays, we would first need to combine the two arrays into a new array and then calculate the median of the resulting array. We can do this using various approaches.
The most straightforward way to solve this problem is to do it linearly. That is, we keep a track of all elements equal to the number of elements in first and second list taken together. At the same time, we traverse all elements in the first and the second list, keep on comparing them and tracking the count. Once we reach the middle index (as per the number elements designated for the new list), it implies we have arrived at the middle value i.e the median. We can put two conditions as well to check if the length of the new resulting list is odd or even, and calculate the median accordingly.
The steps to perform this method are as follows-
- Find the length of first and second list.
- Create index variables, initialized at zero, for each list.
- Set two variables as median 1 and median 2. (This is because in case the data set has even number of elements, then we'd need two middle values to calculate the median. In case of odd data points, the median 2 would be ignored.)
- Traverse all elements in list 1 and list 2 combined, but only till the middle position (because the median will exactly be at this position/index), while at the same time sorting the traversed elements and setting the median accordingly. Return the final median value in the end.
- In case the number of elements are even, set median 2 equal to median 1 and do the same as in step 4. However, this time return the average of median 1 and median 2 as the final median value.
The following program executes the above algorithm.
# function to find median def find_median(lst1, lst2): m = len(lst1) n = len(lst2) i = 0 j = 0 # we can set median 1 and median 2 as any value initially median1, median2 = None, None # if the total number of elements (elements in list 1 + elements in list 2) is odd if ((m+n) % 2 == 1): # x is the count variable for x in range(((m+n)//2) + 1): if (i != m and j != n): if lst1[i] > lst2[j]: median1 = lst2[j] j += 1 else: median1 = lst1[i] i += 1 # for remaining elements in lst 1 elif (i < m): median1 = lst[i] i += 1 # for remaining elements in lst 2 i.e when j<n else: median1 = lst2[j] j += 1 return median1 # if the total number of elements is even else: for x in range(((m+n) // 2) + 1): # because we will have two middle values # median2 is the middle value before the second middle value in case of even data points median2 = median1 if ( i != m and j != n): if lst1[i] > lst2[j]: median1 = lst2[j] j += 1 else: median1 = lst1[i] i += 1 elif (i < m): median1 = lst[i] i += 1 # for remaining elements in lst 2 i.e when j<n else: median1 = lst2[j] j += 1 return ((median1 + median2) / 2) #example lists to test median program list1 = [1, 2] list2 = [3, 4, 5] # call the median function and print the final median value print(find_median(list1, list2))
The following output is produced by the above program.
- Time: in the worst case, in order to traverse and sort all elements in list 1 and list 2, we use the runtime is O(m+n), where m, n are the number of elements in list 1 and list 2 respectively. Even in the best case, the runtime would be the same.
- Space: this algorithm has a constant space requirement of O(1), in the best and worst case, in order to keep a track of all traversed elements and return the final value of median.
Another way to solve this problem is to apply the mathematical formula directly.
We create a new list and add the elements of list 1 and list 2. Then we sort the new list and check the length. Let the length of the list be 'n'. If it is odd, we apply the formula of (n+1)/2. If it's even, we calculate two middle values by applying the formula (n/2) and (n/2)-1. This will give us the middle values required to calculate the median.
The steps involved in this method are as follows-
- Combine the two lists into a new list.
- Use sorting to arrange the elements in the new list in ascending order.
- Check the length of the new list. If it's odd, apply the formula- (n+1)/2 to get the index of the median. It it's even, we need two middle values, the average of which gives us the median, first apply the formula (n/2) to get the first middle value. Then use the formula- (n/2)-1 to get the second middle value. Calculate the average of these two values and return it.
We can use the following program to implement this method.
def MergeSort(lst): # making sure length of list is 1 , which is the midpoint element if len(lst) > 1: # find midpoint of the list midpoint = len(lst)//2 # divide the list into two parts (sublists), part to the left and right of midpoint lefthalf = lst[:midpoint] righthalf = lst[midpoint:] # keep calling the function till length of list becomes 1 MergeSort(lefthalf) MergeSort(righthalf) # Assign zero value to all index variables initially # k is the index variable for list parameter i = j = k = 0 # compare all elements in leftpart and rightpart # add elements in sorted order and update the list while i < len(lefthalf) and j < len(righthalf): if lefthalf[i] < righthalf[j]: lst[k] = lefthalf[i] i += 1 else: lst[k] = righthalf[j] j += 1 k += 1 #add any remaining elements in leftpart to the new list while i < len(lefthalf): lst[k] = lefthalf[i] i += 1 k += 1 #add any remaining elements in rightpart to the new list while j < len(righthalf): lst[k] = righthalf[j] j += 1 k += 1 return lst # function to find median def find_median(lst1, lst2): # combine all elements of first and second list lst3 = lst1 + lst2 # sort the new list MergeSort(lst3) # get the length of the new list n = len(lst3) # in case of even number of elements in new list if n % 2 == 0: # index of first middle value k = n // 2 mid_value1 = lst3[k] mid_value2 = lst3[k-1] median = (mid_value1 + mid_value2)/ 2 return median # in case of odd number of elements in new list else: k = n // 2 median = lst3[k] return median #example lists to test median program list1 = [1, 2, 6] list2 = [3, 4, 5] # call the median function and print the final median value print(find_median(list1, list2))
The above program produces the following output.
- Time: since we use merge sort here, the best and worst case runtime is O(nlogn). Because we'd need to traverse all elements at each stage when we decrease the length of the list.
- Space: in the worst case, the space complexity of this method is O(m+n), where m, n are the number of elements in list 1 and list 2 respectively. This is because we create a new list of length (m+n) to store all elements of list 1 and list 2, which we then use to calculate the median. Even in the best case, the space requirement would still be the same.
A more efficient method to solve this problem is to use binary search. We can create an algorithm that tells us how many elements we should take from each lists, just enough to get the median of the whole data set, instead of traversing all elements in both lists. The steps involved to create such an algorithm are as follows-
- Get the length of each list.
- Switch list 1 with list 2 if the length of list 2 is shorter. This is because for one, faster calculation, and two, because we can have few elements from list 1 for median calculation (as it's the shorter list) and the bigger one can supply rest of the elements, but not the other way around.
- Set the start (
low) and end (
high)variables to perform the binary search.
- We create a variable called
cut_2, for each list respectively. Each cut divides the list into two parts- left and right, during each iteration of the binary search. All the elements in left part must be less than or equal to all the elements in the right part of the list.
- Let's suppose we make the cut to divide each list into two parts, left and right
using an arbitrary number within the range of list 1. And the remaining elements would then be from list 2.
- Since the lists are sorted, all we need to do to ensure that we are taking the right elements for median is to check the left-most value in list 1 (just beside the cut) with the value in the right part of list 2 (just after the cut). Likewise, check the left-most value in list 2 just beside the cut with the right part of list just after the cut. The left values must always be less than or equal to the right values as median divides the entire list into two parts. Values to the left of the median are always smaller than the values on the right of the median.
- In case there are no values to the left of the cut, we set the left values to be minus infinity. Likewise, in case there are no values to the right of the cut, we set the right values to be infinity.
- Once we get all the correct elements for median calculation, if the list contains even number of elements, we need two middle values. We would need to take the first middle value as the maximum of left elements, just beside the cut, because the list is sorted. Likewise, we take the second middle value as the minimum of right elements. In case the number of total elements is odd, the median will be the maximum of the left elements.
- If the values in the left part are greater, we decrease the cut and assign that value to the end variable. This makes sure we move now move towards the left of the list, to get smaller values. Otherwise, we keep increasing the cut and assign it to the start variable, to get as much elements from list 1 till all the previous conditions are no longer satisfied.
The following program implements the above algorithm to calculate the median.
def find_median(lst1,lst2): # get the length of both lists m, n = len(lst1), len(lst2) # we switch lst 1 with lst2 in case lst 1 has more elements # this is done to make lst1 the shorter one for faster calculation if m > n: lst1, m, lst2, n = lst2, n, lst1, m # set the starting and ending indexes to decide the cut for median low = 0 high = m # we do a binary seach from here while low <= high: # a cut divides the lists into left and right part # if cut_1 is 0 it means nothing is there on left side of list 1 # if cut_1 is length of list then there is nothing on right side # likewise, for cut_2 cut_1 = (low + high) // 2 cut_2 = (m + n + 1)//2 - cut_1 # values to the left of the median are always smaller than right # if there are no values to the left of the cut, # set the left values to be minus infinity # in case there are no values to the right of the cut, # we set the right values to be infinity # since cut_1 gives us the number of elements, # to get the index we subtract 1 max_left1 = float('-inf') if cut_1 == 0 else lst1[cut_1 - 1] min_right1 = float('inf') if cut_1 == m else lst1[cut_1] max_left2 = float('-inf') if cut_2 == 0 else lst2[cut_2 - 1] min_right2 = float('inf') if cut_2 == n else lst2[cut_2] # implies all the correct elements found for median calculation if (max_left1 <= min_right2 ) and (max_left2 <= min_right1): # in case total number of elements are even if ((m+n) % 2 == 0): return ((max(max_left1,max_left2) + min(min_right1,min_right2))/2) # in case total number of elements are odd else: return max(max_left1,max_left2) # if values in left are greater, implies too ahead in the list # so we need to decrease the index to get back to smaller values elif (max_left1 > min_right2): high = cut_1 - 1 # else keep on increasing the index else: low = cut_1 + 1 # example to test the median program lst1 = [1,2] lst2 = [3,4,5] print('Median:',find_median(lst1,lst2))
The following output is produced by the above program.
- Time: since we use binary search for the above algorithm, the complexity falls down to O(log(min(m,n))), depending on the length of each list, where m, n are the number of elements in list 1 and list 2 respectively. In the best case, the runtime would be O(1) or a constant runtime.
- Space: we don't use any additional space and the space remains required remains same irrespective of the length of lists. In best and worst case, the space requirement is constant or O(1).
- In statistics and mathematics, median is used for data manipulation and tabulation, in order to summarize the data and draw conclusions and inferences about the population.
- Many areas, including data science, data analysis and data engineering, make use of median and related algorithms for machine learning and data maintenance.
With this article at OpenGenus, you must have the complete idea of finding the Median of two sorted arrays.