My Notes and codes documentation for CUDA learning journey
*Starting of Chapter 12 — Merge
The goal: Merge two sorted arrays into a single sorted list while maintaining order and stability.
Explaining Stability a bit more in detail:
Okay, so let’s try to explain this using the figure below:
![]()
Fig 36_01: Example of merge operation.
Well here, we can illustrate 2 different types of stability at once.
- Across input lists/arrays:
- The element
7from both arrays $A$ and $B$ are handled such that the7from $A$ comes before $B$ and then merged in array $C$. This demonstrates stability across input lists because the order of equal elements is preserved.- Within input list:
- The two elements
10from array $B$ maintain their relative order in the output array $C$. This shows stability within a single input list.Why it matters?
Stability is essential for scenarios where input arrays have been previously sorted based on other keys. For example:
- Array $A$ might have been sorted by name, and then by age.
- Array $B$ might have been sorted similarly.
- When merging these arrays based on age, stability ensures that the previous sorting by name remains intact.
The sequential merge algorithm merges two sorted arrays, $A$ and $B$, into a single sorted array, $C$. Let’s break this down step by step:
Step 1: Initialization:
i for traversing array $A$ (initialized to $0$).j for traversing array $B$ (initialized to $0$).k for filling array $C$ (initialized to $0$).Step 2: Merging Elements from Arrays $A$ and $B$:
i < m && j < n, where m is the size of $A$ and n is the size of $B$).i and k.j and k.Step 3: Handling Remaining Elements
(i == m), copy all remaining elements of array $B$ into array $C$.(j == n), copy all remaining elements of array $A$ into array $C$.Complexity Analysis:
Time Complexity: $O(m+n) \sim O(i+j)$ Space Complexity: $O(m+n)$
Click Here to redirect towards sequential merge implementation comparing with parallel approach (GPU based).
ⓘ Note: Since there are three phases, it's also called as Three Phase Scan Algo. While running the above code, the CPU execution time is way quicker than GPU for smaller arrays (Even till 10,000 elements), however when we put the input array sizes like in range of 10000000's parallesim starts kicking in.
Sample Output for various array sizes: (Array output part is excluded in larger ones to reduce too much output)
For A and B =
100000CPU Time: 969 microseconds GPU Time: 514 microsecondsHowever for smaller array sizes:
Enter size of array A: 12 Enter size of array B: 13 Array A: 7 10 11 14 24 31 34 38 42 48 55 59 Array B: 9 10 15 17 22 29 35 36 45 54 57 60 61 CPU Merged Array: 7 9 10 10 11 14 15 17 22 24 29 31 34 35 36 38 42 45 48 54 55 57 59 60 61 GPU Merged Array: 7 9 10 10 11 14 15 17 22 24 29 31 34 35 36 38 42 45 48 54 55 57 59 60 61 Result: Match :) CPU Time: 0 microseconds GPU Time: 473 microseconds
The reason why GPU execution time is longer than CPU for smaller array sizes is due to the overhead associated with GPU operations. Here's a breakdown of the key factors:
In essence, GPUs are designed for large-scale parallel computations. When the dataset is small, the overheads associated with GPU operations outweigh the benefits of parallelism, resulting in slower execution compared to the CPU.