ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article LLM 추론 성능 향상을 위한 Disaggregated Prefill 기술 동향
Cited - time in scopus Download 4 time Share share facebook twitter linkedin kakaostory
Authors
안후영, 윤정은, 이수광, 김경민, 이훈순, 박준혁, 안신영
Issue Date
2025-10
Citation
전자통신동향분석, v.40, no.5, pp.1-12
ISSN
1225-6455
Publisher
한국전자통신연구원
Language
Korean
Type
Journal Article
DOI
https://dx.doi.org/10.22648/ETRI.2025.J.400501
Abstract
Recent large language models (LLMs) require substantial computational and memory resources, and the tradeoff between latency and throughput becomes more pronounced as input/output sequence lengths and user concurrency increase. Traditional model-centric optimization techniques offer limited solutions to system-level bottlenecks. In particular, in multiuser environments, resource contention and interference between the prefill and decode stages can significantly degrade the overall inference performance. This study highlights disaggregated prefill as a systemlevel approach to address these challenges. By decoupling the LLM inference pipeline into distinct prefill and decode stages, and executing them independently on different accelerators, this technique reduces the inference latency and maximizes resource utilization. We review state-ofthe-art research from 2024 to 2025 and provide a system-level architectural perspective on how disaggregated prefill mitigates performance bottlenecks in LLM inference. The approach is analyzed through four key technical components: (1) prefill/decode instance configuration, (2) key-value cache transmission, (3) key-value cache management, and (4) request scheduling. Our analysis confirms that disaggregated prefill is particularly effective in improving large-scale LLM inference performance, especially in multiuser and heterogeneous hardware environments.
Keyword
AI, Disaggregated Prefill, Heterogeneous Computing, Inference, LLM
KSP Keywords
Cache management, Heterogeneous hardware, Optimization techniques, Request Scheduling, Resource utilization, heterogeneous computing, key-value, language models, large-scale, memory resources, model-centric
This work is distributed under the term of Korea Open Government License (KOGL)
(Type 4: : Type 1 + Commercial Use Prohibition+Change Prohibition)
Type 4: