AUTHOREA
Log in Sign Up Browse Preprints
LOG IN SIGN UP
Zhiwei Lin
Zhiwei Lin

Public Documents 1
BLADE: Energy-Efficient Attention Accelerator with Fused Kernel and Bit-Level Redunda...
Zhiwei Lin
Yubin Qin

Zhiwei Lin

and 9 more

December 06, 2024
Attention-based Transformer model has achieved remarkable performance in various artificial intelligence fields, while the attention computation, being a combination of matrix multiplication and softmax function, is still sub-optimized in terms of hardware implementation. Normally, it needs 3 pass of input memory access to compute attention, and the on-chip storage requirement is coupled with the input length, both of which pose significant memory issues. Further, the computation burden is heavy for long inputs. This paper proposes an algorithm-hardware co-design for attention. On algorithm side, it uses a linear-softmax fused kernel to fuse the matrix multiplication and non-linear functions, which enables high on-chip memory source utilization. On hardware side, it shows an accelerator named BLADE with identical partial product removing, which eliminates unnecessary computation by exploiting math feature of softmax. Experiment on ViT, Swin Transformer, GPT-2 and LLaMA2 show that the proposed design achieves 10.6% - 18.7% of energy efficiency improvement compared to state-of-the-art Flash Attention implementations.

| Powered by Authorea.com

  • Home