100 Days of CUDA

My Notes and codes documentation for CUDA learning journey

View the Project on GitHub Firojpaudel/100_days_of_CUDA

Summary of Day 71:

*I’m still competing today as well.

  1. Tanh implementation:

Click Here to see the implementation using manual tanh šŸ™‚ā€ā†•ļø

[!Note]

  • Average performance: $28.74 \text{ GFLOPs}$
  • Average Runtime: $1.20 \text{ ms}$ Device: Tesla T4

  • Average performance: $194.73 \text{ GFLOPs}$
  • Average Runtime: $0.25 \text{ ms}$ Device: NVIDIA H100
  1. Softmax:

Click Here to redirect towards code.

[!Note]

  • Average performance: $164.78 \text{ GFLOPs}$
  • Average Runtime: $0.93 \text{ ms}$ Device: NVIDIA H100
  1. Vector Addition:

Approach 1:

Trying loop unrolling (4 elements per thread)

Click Here to redirect towards the code.

[!Note]

  • Average performance: $202.72 \text{ GFLOPs}$
  • Average Runtime: $0.95 \text{ ms}$ Device: NVIDIA H100

Approach 2:

Using Shared Memory

Click Here to redirect towards the code.

[!Note]

  • Average performance: $166.15 \text{ GFLOPs}$
  • Average Runtime: $1.16 \text{ ms}$ Device: NVIDIA H100

Using both at the same time:

Click Here to redirect towards the code.

[!Note]

  • Average performance: $245.66 \text{ GFLOPs}$
  • Average Runtime: $0.74 \text{ ms}$ Device: NVIDIA H100