BLADE: Energy-Efficient Attention Accelerator with Fused Kernel and Bit-Level Redundancy Elimination

Zhiwei Lin; Yubin Qin; Jiachen Wang; Yang Wang; Huanyu Wang; Zhe Zheng; WenPeng Cui; Shaojun Wei; Yang Hu; Shouyi Yin

doi:10.22541/au.173347059.95757021/v1

loading page

BLADE: Energy-Efficient Attention Accelerator with Fused Kernel and Bit-Level Redundancy Elimination

Zhiwei Lin,
Yubin Qin,
Jiachen Wang,
Yang Wang,
Huanyu Wang,
Zhe Zheng,
WenPeng Cui,
Shaojun Wei,
Yang Hu,
Shouyi Yin

Abstract

Attention-based Transformer model has achieved remarkable performance in various artificial intelligence fields, while the attention computation, being a combination of matrix multiplication and softmax function, is still sub-optimized in terms of hardware implementation. Normally, it needs 3 pass of input memory access to compute attention, and the on-chip storage requirement is coupled with the input length, both of which pose significant memory issues. Further, the computation burden is heavy for long inputs. This paper proposes an algorithm-hardware co-design for attention. On algorithm side, it uses a linear-softmax fused kernel to fuse the matrix multiplication and non-linear functions, which enables high on-chip memory source utilization. On hardware side, it shows an accelerator named BLADE with identical partial product removing, which eliminates unnecessary computation by exploiting math feature of softmax. Experiment on ViT, Swin Transformer, GPT-2 and LLaMA2 show that the proposed design achieves 10.6% - 18.7% of energy efficiency improvement compared to state-of-the-art Flash Attention implementations.

03 Dec 2024Submitted to Electronics Letters

Show details

Hide details

06 Dec 2024Submission Checks Completed

06 Dec 2024Assigned to Editor

06 Dec 2024Review(s) Completed, Editorial Evaluation Pending

07 Dec 2024Reviewer(s) Assigned

26 Dec 2024Editorial Decision: Revise Minor

29 Dec 20241st Revision Received

30 Dec 2024Submission Checks Completed

30 Dec 2024Assigned to Editor

30 Dec 2024Review(s) Completed, Editorial Evaluation Pending

30 Dec 2024Reviewer(s) Assigned

02 Jan 2025Editorial Decision: Accept

Abstract

Peer review status:ACCEPTED