ETRI Knowledge Sharing Platform : DAWN: Efficient Distribution of Attention Workload in PIM-Enabled Systems for LLM Inference

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article DAWN: Efficient Distribution of Attention Workload in PIM-Enabled Systems for LLM Inference

Cited 0 time in scopus

Abstract: Recently, processing-in-memory (PIM) units have been deployed to accelerate matrix-vector multiplications in large language models (LLMs). However, due to the limited flexibility of PIMs, PIMs require a strict data layout for storing matrices in memory. As LLM inference operates autoregressively, new elements are appended to the stored matrices during inference, necessitating costly data layout reorganization. Nevertheless, since the conventional workload allocation method assigns entire matrices solely to PIMs, it causes data layout reorganization overhead (i.e., excessive memory writes). Furthermore, the significant variance in matrix sizes exacerbates PIM load imbalance. In this letter, we propose DAWN, a novel workload allocation method. DAWN divides matrices into equally sized chunks and employs a single chunk as the allocation unit. DAWN assigns a portion of chunks to traditional accelerators (e.g., neural processing units), which have no constraints on data layout for computation, to mitigate reorganization overhead. DAWN evenly distributes the remaining chunks across PIMs using a greedy approach to achieve PIM load balancing. Our simulation results show that DAWN improves throughput by up to 44.2% (34.8% on average) over the conventional workload allocation method.

KSP Keywords: Allocation method, Allocation unit, Limited flexibility, Load Imbalance, Load balancing, Matrix-vector multiplications, Neural processing, Processing-in-memory, Workload Allocation, data layout reorganization, efficient distribution

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.