Huanyu Wang -

Attention-based Transformer model has achieved remarkable performance in various artificial intelligence fields, while the attention computation, being a combination of matrix multiplication and softmax function, is still sub-optimized in terms of hardware implementation. Normally, it needs 3 pass of input memory access to compute attention, and the on-chip storage requirement is coupled with the input length, both of which pose significant memory issues. Further, the computation burden is heavy for long inputs. This paper proposes an algorithm-hardware co-design for attention. On algorithm side, it uses a linear-softmax fused kernel to fuse the matrix multiplication and non-linear functions, which enables high on-chip memory source utilization. On hardware side, it shows an accelerator named BLADE with identical partial product removing, which eliminates unnecessary computation by exploiting math feature of softmax. Experiment on ViT, Swin Transformer, GPT-2 and LLaMA2 show that the proposed design achieves 10.6% - 18.7% of energy efficiency improvement compared to state-of-the-art Flash Attention implementations.