My Notes and codes documentation for CUDA learning journey
Improved my previous vector addition kernel and made it work with more GFLOPs.
Click Here to redirect to the code.
[!note]
- Performance: $277.82 \text{ GFLOPs}$
- Runtime: $0.72 \text{ ms}$
- GPU: NVIDIA H100
Next, wrote a ReLU kernel:
Click Here to redirect to the code
[!note]
- Performance: $450.33 \text{ GFLOPs}$
- Runtime: $0.18 \text{ ms}$
- GPU: NVIDIA H100