Summary of Day 86:

Improved my previous vector addition kernel and made it work with more GFLOPs.

Click Here to redirect to the code.

[!note]

Performance: $277.82 \text{ GFLOPs}$

Runtime: $0.72 \text{ ms}$

GPU: NVIDIA H100

Next, wrote a ReLU kernel:

Click Here to redirect to the code

[!note]

Performance: $450.33 \text{ GFLOPs}$

Runtime: $0.18 \text{ ms}$

GPU: NVIDIA H100