Summary of Day 71:

*I’m still competing today as well.

Tanh implementation:

Click Here to see the implementation using manual tanh 🙂‍↕️

[!Note]

Average performance: $28.74 \text{ GFLOPs}$

Average Runtime: $1.20 \text{ ms}$ Device: Tesla T4

Average performance: $194.73 \text{ GFLOPs}$

Average Runtime: $0.25 \text{ ms}$ Device: NVIDIA H100

Softmax:

Click Here to redirect towards code.

[!Note]

Average performance: $164.78 \text{ GFLOPs}$

Average Runtime: $0.93 \text{ ms}$ Device: NVIDIA H100

Vector Addition:

Approach 1:

Trying loop unrolling (4 elements per thread)

Click Here to redirect towards the code.

[!Note]

Average performance: $202.72 \text{ GFLOPs}$

Average Runtime: $0.95 \text{ ms}$ Device: NVIDIA H100

Approach 2:

Using Shared Memory

Click Here to redirect towards the code.

[!Note]

Average performance: $166.15 \text{ GFLOPs}$

Average Runtime: $1.16 \text{ ms}$ Device: NVIDIA H100

Using both at the same time:

Click Here to redirect towards the code.

[!Note]

Average performance: $245.66 \text{ GFLOPs}$

Average Runtime: $0.74 \text{ ms}$ Device: NVIDIA H100