ETRI Knowledge Sharing Platform : LLM 추론 성능 향상을 위한 Disaggregated Prefill 기술 동향

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article LLM 추론 성능 향상을 위한 Disaggregated Prefill 기술 동향

Cited - time in scopus

Download 98 time Share share

Abstract: Recent large language models (LLMs) require substantial computational and memory resources, and the tradeoff between latency and throughput becomes more pronounced as input/output sequence lengths and user concurrency increase. Traditional model-centric optimization techniques offer limited solutions to system-level bottlenecks. In particular, in multiuser environments, resource contention and interference between the prefill and decode stages can significantly degrade the overall inference performance. This study highlights disaggregated prefill as a systemlevel approach to address these challenges. By decoupling the LLM inference pipeline into distinct prefill and decode stages, and executing them independently on different accelerators, this technique reduces the inference latency and maximizes resource utilization. We review state-ofthe-art research from 2024 to 2025 and provide a system-level architectural perspective on how disaggregated prefill mitigates performance bottlenecks in LLM inference. The approach is analyzed through four key technical components: (1) prefill/decode instance configuration, (2) key-value cache transmission, (3) key-value cache management, and (4) request scheduling. Our analysis confirms that disaggregated prefill is particularly effective in improving large-scale LLM inference performance, especially in multiuser and heterogeneous hardware environments.

KSP Keywords: Cache management, Heterogeneous hardware, Optimization techniques, Request Scheduling, Resource utilization, heterogeneous computing, key-value, language models, large-scale, memory resources, model-centric

This work is distributed under the term of Korea Open Government License (KOGL)
(Type 4: : Type 1 + Commercial Use Prohibition+Change Prohibition)