BLADE: Energy-Efficient Attention Accelerator with Fused Kernel and
Bit-Level Redundancy Elimination
Abstract
Attention-based Transformer model has achieved remarkable performance in
various artificial intelligence fields, while the attention computation,
being a combination of matrix multiplication and softmax function, is
still sub-optimized in terms of hardware implementation. Normally, it
needs 3 pass of input memory access to compute attention, and the
on-chip storage requirement is coupled with the input length, both of
which pose significant memory issues. Further, the computation burden is
heavy for long inputs. This paper proposes an algorithm-hardware
co-design for attention. On algorithm side, it uses a linear-softmax
fused kernel to fuse the matrix multiplication and non-linear functions,
which enables high on-chip memory source utilization. On hardware side,
it shows an accelerator named BLADE with identical partial product
removing, which eliminates unnecessary computation by exploiting math
feature of softmax. Experiment on ViT, Swin Transformer, GPT-2 and
LLaMA2 show that the proposed design achieves 10.6% - 18.7% of energy
efficiency improvement compared to state-of-the-art Flash Attention
implementations.