ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper I-FlashAttention: Fully Integer Fused Attention for Efficient Vision Transformers
Cited 0 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Sehyeon Oh, Yongin Kwon, Jemin Lee
Issue Date
2025-09
Citation
International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES) 2025, pp.25-26
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1145/3742872.3757072
Abstract
Transformer self-attention offers strong expressiveness, but its compute and memory cost grows rapidly with longer sequences. This results in frequent off-chip memory access, which becomes a major performance bottleneck. FlashAttention reduces this by dividing the sequence into tiles, computed entirely in on-chip memory. This avoids storing intermediate tensors off-chip and alleviates memory bandwidth issues. However, tile-wise online softmax requires floating-point operations for numerical stability using max-based scaling and accumulation. We propose I-FlashAttention, an integer-only version of FlashAttention. It uses shift-based exponential approximation and integer max-tracking to perform online softmax without floating point. All steps, from INT8 GEMM to output, are fused into a single Triton kernel. I-FlashAttention is 1.08× faster than FP16 FlashAttention and 7.10× faster than I-ViT.
KSP Keywords
Exponential approximation, Floating-point operations, Memory Access, Memory bandwidth, On-chip memory, memory cost, numerical stability, off-chip, performance bottleneck