100 Days of CUDA

My Notes and codes documentation for CUDA learning journey

View the Project on GitHub Firojpaudel/100_days_of_CUDA

Summary of Day 31:

Detailed Explanation of Kogge Stone Algo Continued

Okay, So today let’s dive deeper into this Algo and try to understand its mechanism, why we are using this and code implementation in more detailed way.

1. So what is Kogge-Stone Algo?

Answer: The Kogge-Stone algorithm is a parallel prefix sum algorithm optimized for fast execution with minimal dependencies.

2. Slight Overview:

3. Inclusive Prefix Sum Algorithm


Algorithm Steps


Initialization
  1. The input array $\text{X}$ is loaded into $\text{XY}$.
  2. Each element $\text{XY}[i]$ initially contains $\text{X}[i]$.
Iterations

for $k = 0$ to $\log_2(N) - 1$:

  1. Compute the stride as $2^k$.
  2. For each element $i$:
      XY[i] = XY[i] + XY[i - \text{stride}]
    
    • Otherwise, leave $XY[i]$ unchanged.
Final State

After all iterations, $\text{XY}$ contains the inclusive scan results.


Example with a $16$- Element Array

Fig 31_01: A parallel inclusive scan algorithm based on Kogge-Stone adder design.

Step-by-step Execution:

  1. Initial State: Each element in $\text{XY}$ contains its corresponding input value:
    \text{XY} = [x_0, x_1, x_2, ..., x_{15}]
    
  2. Iteration 1 (Stride= 1): Each element $XY[i]$ is updated by adding it’s immediate left neighbour $(XY[i-1])$
    \text{XY} = [x_0, x_0 + x_1, x_1+x_2, ..., x_{14}+x_{15}]
    
  3. Iteration 2 (Stride= 2): Each element $XY[i]$ is updated by adding the value two positions into its left $(XY[i-2])$:
    \text{XY}= [x_0, x_0 + x_1, x_0+x_1+x_2, x_0+x_1+x_2+x_3, ... ]
    
  4. Iteration 3 (Stride= 4): Each element $XY[i]$ is updated by adding the value four positions into its left $(XY[i-4])$:
\text{XY}= [x_0, x_0 + x_1, x_0+x_1+x_2, x_0+x_1+x_2+x_3, x_0+x_1+x_2+x_3+x_4, ... ]
  1. Continue untill all cumulative sums are computed.

After $\log_2(16) = 4$ iterations, the final array contains all cumulative sums.

Implementation:

Click Here to redirect towards complete inclusive scan.

Click Here to redirect towards complete exclusive scan.

So how different is Exclusive from Inclusive?

Answer:

The main difference between inclusive and exclusive prefix sums lies in whether the current element is included in the sum.

Inclusive Prefix Sum:

Exclusive Prefix Sum:

Example

Let’s consider an input array:

Input: [1, 2, 3, 4, 5]

Inclusive Prefix Sum:

Output: [1, 3, 6, 10, 15]

Exclusive Prefix Sum:

Output: [0, 1, 3, 6, 10]

Implementation Difference

The exclusive scan can be implemented by performing an inclusive scan and then shifting the array to the right by one position and setting the first element to zero.

Fig 31_02: A parallel exclusive scan algorithm based on Kogge-Stone adder design.


Algorithm Steps for Exclusive Scan


Initialization:
Iterations:
Shift Right:
Final State:

Complexity Analysis of Kogge-Stone Algorithm

Time Complexity

Space Complexity

Summary for both Inclusive and Exclusive: Both inclusive and exclusive prefix sum algorithms using Kogge-Stone’s parallel method have the same big-Oh time and space complexity, as only additional $O(1)$ steps are needed for shifting the data.


TL;DR

| **Metric** | **Inclusive Scan** | **Exclusive Scan** | |--------|----------------|----------------| | Parallel Time Complexity | $O(\log(N))$ | $O(\log(N))$ | | Sequential Time Complexity | $O(N \cdot \log (N))$ | $O(N \cdot \log (N))$ | | Space Complexity | $O(N)$ | $O(N)$ |