loading page

BLADE: Energy-Efficient Attention Accelerator with Fused Kernel and Bit-Level Redundancy Elimination
  • +7
  • Zhiwei Lin,
  • Yubin Qin,
  • Jiachen Wang,
  • Yang Wang,
  • Huanyu Wang,
  • Zhe Zheng,
  • WenPeng Cui,
  • Shaojun Wei,
  • Yang Hu,
  • Shouyi Yin
Zhiwei Lin
Tsinghua University
Author Profile
Yubin Qin
Tsinghua University
Author Profile
Jiachen Wang
Tsinghua University
Author Profile
Yang Wang
Tsinghua University
Author Profile
Huanyu Wang
Tsinghua University
Author Profile
Zhe Zheng
Beijing Smart-chip Microelectronics Technology Co., Ltd.
Author Profile
WenPeng Cui
Beijing Smart-chip Microelectronics Technology Co., Ltd.
Author Profile
Shaojun Wei
Tsinghua University
Author Profile
Yang Hu
Tsinghua University
Author Profile
Shouyi Yin
Tsinghua University

Corresponding Author:yinsy@tsinghua.edu.cn

Author Profile

Abstract

Attention-based Transformer model has achieved remarkable performance in various artificial intelligence fields, while the attention computation, being a combination of matrix multiplication and softmax function, is still sub-optimized in terms of hardware implementation. Normally, it needs 3 pass of input memory access to compute attention, and the on-chip storage requirement is coupled with the input length, both of which pose significant memory issues. Further, the computation burden is heavy for long inputs. This paper proposes an algorithm-hardware co-design for attention. On algorithm side, it uses a linear-softmax fused kernel to fuse the matrix multiplication and non-linear functions, which enables high on-chip memory source utilization. On hardware side, it shows an accelerator named BLADE with identical partial product removing, which eliminates unnecessary computation by exploiting math feature of softmax. Experiment on ViT, Swin Transformer, GPT-2 and LLaMA2 show that the proposed design achieves 10.6% - 18.7% of energy efficiency improvement compared to state-of-the-art Flash Attention implementations.
03 Dec 2024Submitted to Electronics Letters
06 Dec 2024Submission Checks Completed
06 Dec 2024Assigned to Editor
06 Dec 2024Review(s) Completed, Editorial Evaluation Pending
07 Dec 2024Reviewer(s) Assigned