ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article DAWN: Efficient Distribution of Attention Workload in PIM-Enabled Systems for LLM Inference
Cited 0 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Jaehoon Chung, Jinho Han, Young-Ho Gong, Sung Woo Chung
Issue Date
2026-02
Citation
IEEE Computer Architecture Letters, v.25, no.1
ISSN
1556-6056
Publisher
IEEE
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1109/LCA.2026.3665202
Abstract
Recently, processing-in-memory (PIM) units have been deployed to accelerate matrix-vector multiplications in large language models (LLMs). However, due to the limited flexibility of PIMs, PIMs require a strict data layout for storing matrices in memory. As LLM inference operates autoregressively, new elements are appended to the stored matrices during inference, necessitating costly data layout reorganization. Nevertheless, since the conventional workload allocation method assigns entire matrices solely to PIMs, it causes data layout reorganization overhead (i.e., excessive memory writes). Furthermore, the significant variance in matrix sizes exacerbates PIM load imbalance. In this letter, we propose DAWN, a novel workload allocation method. DAWN divides matrices into equally sized chunks and employs a single chunk as the allocation unit. DAWN assigns a portion of chunks to traditional accelerators (e.g., neural processing units), which have no constraints on data layout for computation, to mitigate reorganization overhead. DAWN evenly distributes the remaining chunks across PIMs using a greedy approach to achieve PIM load balancing. Our simulation results show that DAWN improves throughput by up to 44.2% (34.8% on average) over the conventional workload allocation method.
Keyword
Processing-in-memory (PIM), self-attention
KSP Keywords
Allocation method, Allocation unit, Limited flexibility, Load Imbalance, Load balancing, Matrix-vector multiplications, Neural processing, Processing-in-memory, Workload Allocation, data layout reorganization, efficient distribution