Get this book -> Problems on Array: For Interviews and Competitive Programming

In this post, we discuss various approaches used to adapt a sequential merge sort algorithm onto a parallel computing platform. We have presented 4 different approaches of Parallel Merge Sort.

Merge sort is a divide and conquer algorithm. Given an input array, it divides it into halves and each half is then applied the same division method until individual elements are obtained. A pairs of adjacent elements are then merged to form sorted sublists until one fully sorted list is obtained.

Table of contents:

- Sequential merge sort algorithm: MergeSort(arr[], l, r)
- Parallel Merge Sort algorithm
- Approach 1: Quick Merge sort
- Approach 2: Odd-Even merge sort
- Approach 3: Bitonic merge sort
- Approach 4: Parallel merge sort with load balancing

Prerequisite: Merge Sort Algorithm

Let us get started with Parallel Merge Sort.

## Sequential merge sort algorithm: MergeSort(arr[], l, r)

- if r > l,
- Find the mid-point and divide the array into two halves, mid = l + (r-l) / 2.
- Mergesort first half, mergeSort(arr, l, mid)
- Mergesort second half, mergeSor(arr, mid+1, r)
- Merge the two sorted halves, merge(arr, l, mid, r)

### Analysis

The time complexity is O(nlogn) in all cases.

The space complexity is O(n) linear time.

## Parallel Merge Sort algorithm

The key to designing parallel algorithms is to find steps in the algorithm that can be carried out concurrently. That is, examine the sequential algorithm and look for operations that could be carried out simultaneously on a given number of processors.

* Note:* The number of processors will influence the performance of the algorithm.

#### Ideation. Simple parallel merge sort.

In this approach we can assume we have an unlimited number of processors. Looking at the merge sort algorithm tree in the sequential algorithm we can try to assign simultaneous operations to separate processors which will work concurrently to do the dividing and merging at each level.

Given the list [9, 3, 17, 11, 6, 2, 1, 10] and 8 processors.

Here is how the algorithm would sort it parallelly,pms

Merging would start at the bottom of the tree going up.

Considering the sequential algorithm merge sort tree, tree work would start from the 4th level after the list has been divided into individual sublists of one element, each of the n processors would be divided among the 4 sublists of 2 pairs and merge them together, once done, going down the tree, 2 processors can work on merging the 2 4-element sorted sublists and finally one process can merge the 2 halves to create a final sorted list of 8 items.

Each processor will be executing in parallel at each level of the tree.

## Approach 1: Quick Merge sort

A more realistic case is when the number of processors is fewer than the number of elements to be sorted, therefore we need a strategy for dividing work among the given n processors.

Assume an large input of size n, each of the p processes sorts n/p of the original list using a sequential quick sort algorithm. The two sorted halves are then merged by the parent process and this repeated up the tree until there are two halves each of which are merged with one process at the top of the tree to obtain a fully sorted list.

#### Example.

- Given a list of 4000 items and 8 processors, we opt to divide the list into halves until we have 8 partitions of size 500.
- We sort the sublist of 500 items using a fast sequential algorithm, quick sort.
- Recursively merge the sorted sublist items moving up the tree until we achieve a fully sorted list.

#### Analysis.

The sequential quick sort algorithm sorts in O(nlogn) time and merging is O(logn) steps, total time complexity is $O\left(log\right(n{)}^{2})$.

Space complexity is O(n).

## Approach 2: Odd-Even merge sort

This is a parallel approach to sorting based on the recursive application of the * odd-even merge algorithm*.

The odd-even merge algorithm merges sorted sub-lists in a bottom-up manner starting with sublists of size 2 and merging them into larger lists until the final sorted list is obtained.

The odd-even-merge operates on two alternating phases,

- Even-numbered processes exchange values with their right neighbor.**Even phase**- Odd-numbered processes exchange numbers with their right neighbor.**Odd phase**

#### Example.

Given an array [9, 3, 17, 11, 6, 2, 1, 10], the algorithm will divide it into two sorted sublists A[3, 9, 11, 17] and B[1, 2, 6, 10]. The even and odd phases will arrange the list in the following order.

- E(A) = [3, 11], O(A) = [9, 17]
- E(B) = [1, 6], O(B) = [2, 10]
- Recursively merge(E(A), E(B)) = C = oddEvenMerge([3, 11], [1, 6]) = [1, 3, 6, 11]
- Recursively merge(O(A), O(B)) = D = oddEvenMerge([9, 17], [2, 10]) = [2, 9, 10, 17]
- Interleave C with D, E = [1, 3,2, 6,9, 11,10, 17]
- Rearrange unordered neighbors c, d, as (min(c,d), max(c,d)) E = [1, 2, 3, 6, 9, 10, 11, 17] to achieve a fully sorted list.

#### Algorithm.

- Divide and merge even and odd element E(A), E(B), O(A), O(B).
- Recursively merge(E(A), E(B)) <- C.
- Recursively merge(O(A), O(B)) <- D.
- Interleave C with D.
- Rearrange unordered neighbors C, D as (min(C, D), max(C, D)) <- E(sorted array).

#### Analysis.

Merging two sorted sublists of size n requires

- Two recursive calls to oddEvenMerge each merging n/2 size sublists
- oddEvenMerge([3, 11], [1, 6])
- oddEvenMerge([9, 17], [2, 10])

- n-1 comparison-and-swap operations.
- compare and swap (3,2, 6,9, 11,10)

#### Parallelizing Odd-Even merge sort.

For parallelizing the OddEvenMergSort() algorithm, we can run oddEvenMerge() and compare-and-swap operations concurrently.

That is,

- 2 parallel steps for compare and swap operations for the recursive calls to oddEvenMerge() for lists of size n/2.
- logn - 1 parallel steps for remaining compare and swap operations.

#### Analysis.

There are log(n) sorting steps, oddEvenMerge time complexity is O(logn), total time complexity is $O\left(log\right(n{)}^{2})$.

Space complexity is O(n).

## Approach 3: Bitonic merge sort

This is another parallel approach based on the bitonic sequence.

A * bitonic sequence* is a sequence of numbers which is first increasing up to a certain point and then it starts decreasing.

*a1 < a2 < ... < a1 - 1 < ai > a1 + 1 > ai + 2 > ... an.*

**ie,**Following this, a bitonic point is a point in the bitonic sequence before which elements are strictly increasing after which they are strictly decreasing.

A bitonic point can only arise in arrays where there are both increasing and decreasing values.

A list having theses properties can be exploited by a parallel sorting algorithm.

#### Example.

Given the bitonic sequence below.

We can perform a compare and swap operation with elements from both sides to obtain tow bitonic sequences whereby all values in one sequence are smaller than the other values.

Obtaining the result we can further apply compare-and-swap operation, with the following in mind,

- All values in left sequence of each bitonic sequence are less than all right values

Finally, we obtain a fully sorted list.

#### Steps.

- Transform the list into a bitonic sequence, as shown above.
- Recursively compare-and-swap then merge the sublists lists to obtain a large fully sorted list.

#### Algorithm.

- Convert adjacent pairs of values into increasing and decreasing sequences to form a bitonic sequence.
- Sort each bitonic sequence into increasing and decreasing segments.
- Merge adjacent sequences into a larger bitonic sequence.
- Sort the final bitonic sequence to achive a sorted list.

#### Analysis.

* Note:* Bitonic parallel sorting is usually done when a list of size ${2}^{n}$ is given as the input otherwise it fails.

We will consider an unsorted list of size ${2}^{n}$ in this analysis.

There are n steps, with each step working concurrently.

In total it will take,

$\frac{n(n+1)}{2}\xc2=\xc2log\left(n\right)\xc2.\xc2\frac{\mathrm{log}\left(n\right)\xc2+\xc21}{2}$

The total time complexity is $O\left(log\right(n{)}^{2})$

## Approach 4: Parallel merge sort with load balancing

As we have seen in the previous approach, there is a need for load balancing, as we go up or down the tree, some processors will be idle while others are assigned work load at each level.

This approach aims to distribute work onto all processors such that they all partake in merging throughout the execution of the algorithm.

This would achieve more parallelism and achieve a better time computational complexity thereby shortening the sorting time.

The additional complexity with this approach is how to merge two lists that are distributed among multiple processors while minimizing elements movement.

#### Steps

- A group handles one sorted list, it will store elements by distributing them evenly to all processors in the group. A histogram is computed which is used for determining the minimum number of elements to be swapped during merging.
- In the first step when the algorithm starts, all groups will contain one processor, then in subsequent steps groups are paired together.
- Each group will exchange its boundaries(min and max elements) to determine the order of processors according to the minimum elements.
- Each group will exchange histograms and a new histogram that covers both pairs is computed.
- Each processor divides the histogram bins of the shared histogram so as the lower indexed processor keeps the smaller half and the larger index processor keeps the large half.
- Elements are exchanged on the basis of the histogram intervals between partner processors.
- As the algorithm proceeds the group size doubles due to merging

To reduce costs during swapping elements, * index swapping* is used, this is when a processor has to send most of its elements to partner processor and receive the same equal amount, the algorithm will instead swaps the logical ids of the two to minimize element exchange.

* Note*:

- Group size doubles as we go up the tree merging.
- The larger the number of histogram bins, the better the load balancing, the better the concurrency.

#### Algorithm

- Each process p sorts a list of n/p elements locally to obtain a local histogram.
- Iterate log(p) time doing the following;
- Exchange boundary values between partner groups and determine logical ids of processors for merged lists.
- Exchange histograms with pair group and compute a new histogram and divide accordingly (lesser, larger) into equal parts.
- Each processor sends keys to designated processors.
- Each processor locally merges its elements with the received elements to obtain a new sorted list
- Broadcast logical ids of processors to the next iterations.

You can read more on this approach here (PDF) explained by Minsoo Jeon and Dongseung Kim from Korea University.

With this article at OpenGenus, you must have the complete idea of Parallel Merge Sort.