My Notes and codes documentation for CUDA learning journey
*Exercises from Chapter — 9 first
Solution:
First, mathematical definition of throughput:
\text{Throughput} = \frac{\text{Number of operations}}{\text{Time taken}}
Since each atomic operation takes $100 \space ns$, the throughput in operations per second is:
\frac{1}{100 \space ns}
Now converting this in seconds:
= \frac{1}{100\space \times 10^{-9} \space s}
= 10^7 \text{operations per second}
Solution:
Given:
First, calculating the average latency:
\text{Average Latency} = (\text{L2 Hit Rate} \times \text{L2 Latency}) + (\text{L2 Miss Rate} \times \text{DRAM Latency})
Substituting the values:
= (0.9 \times 4 \space ns )+ (0.1 \times 100 \space ns)
= 3.6 + 10 = 13.6 \space ns
Hence, the throughput is $\frac{1}{\text{Average Latency}}$. So,
= \frac{1}{13.6 \times 10^-9 \space s}
= 73.5 \space\text{million operations per second}
Solution:
So, from previous (Qn. 1) calculation, the max throughput we got was= $10^7 \text{atomic operations per second}$.
Now, the maximum floating point throughput of kernel execution as limited by the throughput of atomic operations is given by:
\text{Floating-Point Throughput} = (\text{Atomic Operations Throughput}) \times (\text{FLOPs/atomic operation}) \\
= (10^7) \times 5 = 50 \times 10^6 \space\text{FLOPs/sec}\\
= 50 \space\text{MFLOPs}
Solution:
From Qn 1: The original DRAM Atomic Operation Throughput = $10^7 \text {atomic op/sec}$
Next, going into shared memory privatization,
\text{Effective Latency} = 1 \times (1 + 0.1) \space ns = 1.1 \space ns
Hence, the new throughput becomes:
= \frac{1}{\text{Effective Latency}} = \frac{1}{1.1 \times 10^-9 \space s} = 0.909 \times 10^9 \space s
Now, finally: the maximum floating-point throughput of atomic operations given $5 \space \text{FLOPs/operation}$
\text{Floating point Throughput} = (0.909 \times 10^9) \times 5 = 4.545 \times 10^9 \text{FLOPs/sec} \\
= 4.545 \text{GFLOPs}
*That’s it for the exercises from Chapter 9*
Starting of *Chapter 10
When we talk about reduction, we mean taking a large dataset (like an array of numbers) and “reducing” it to a single meaningful value. This value could be the sum of all elements, the maximum, the minimum, or even the product of the elements.
Reduction is a fundamental computation pattern because it helps us summarize large amounts of data into something manageable.
For Example: if we have an array {7.0, 2.1, 5.3, 9.0, 11.2}, performing a sum reduction would give us:
7.0+2.1+5.3+9.0+11.2=34.6
Key Components of Reduction:
- Binary Operator: This specifies how we combine two values (e.g., addition for sum reduction or comparison for max reduction).
- Identity Value: This is the starting point for the operation:
- For addition: $0$
- For multiplication: $1$
- For max: $- \infty$
- For min: $+ \infty$
In sequential reduction (Code 26_01), we process elements one by one in a loop. For example:
sum = 0.sum until all elements are processed.
sum = 0.0f;
for (i = 0; i< N; ++i) {
sum += input[i];
}
Code 26_01: A simple sum reduction sequential code
Click Here to redirect towards complete sum reduction CUDA code.
The general form of sequential reduction (Code 26_02) uses an accumulator (acc) initialized to the identity value and applies an operator to combine acc with each element.
acc = IDENTITY;
for(i = 0; i < N; ++i) {
acc = Operator (acc, input[i]);
}
Code 26_02: The general form of a reduction sequential code
Click Here to redirect towards complete max reduction CUDA code.
While this approach is simple and easy to implement, it is slow because it processes elements sequentially.
Parallel reduction is where things get exciting! Instead of processing elements one by one, we divide the work among multiple threads and process them simultaneously. This speeds up computation significantly.
How Parallel Reduction Works
Parallel reduction operates in steps:
- In the first step, pairs of elements are processed in parallel (e.g., finding the max of two numbers at a time).
- In subsequent steps, partial results are combined further until only one result remains.
- This forms a “reduction tree” (Figure 26_01), where:
- Leaves are input elements.
- The root is the final result.
For Example:
{3, 1, 7, 0, 4, 1, 6, 3}.
(3 vs 1), (7 vs 0), (4 vs 1), (6 vs 3) → Results: {3, 7, 4, 6}(3 vs 7), (4 vs 6) → Results: {7, 6}.(7 vs 6) → Result: 7.
Fig 26_01: Example illustratory diagram
Key Properties for Parallel Reduction: To perform parallel reduction effectively:
- Associativity: The operator must be associative $(a \Theta b)\Theta c=a\Theta (b\Theta c)$. For example:
- Addition and max are associative.
- Subtraction is not.
- Commutativity (optional): If $a\Theta b=b\Theta a$, we can rearrange operands freely for optimization.
In sequential reduction:
In parallel reduction:
For example:
However, parallel reduction requires more hardware resources initially (e.g., multiple comparators for max operations).