ETRI Knowledge Sharing Platform : PF-GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT-2 inference

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article PF-GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT-2 inference

Cited 1 time in scopus

Download 153 time Share share

Abstract: Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix–vector multiplication in addition to the conventional matrix–matrix multiplication. However, current AI processor architectures are optimized for general matrix–matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix–vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 (Formula presented.) 8 processor, thus resulting in a 7.5 (Formula presented.) increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 (Formula presented.) is achieved in single-batch inferences.

KSP Keywords: Artificial Neural Network, Formula presented, Low-precision, Matrix Operation, Outer product, Processor architecture, artificial intelligence, matrix multiplication, neural network(NN), transformer-based

This work is distributed under the term of Korea Open Government License (KOGL)
(Type 4: : Type 1 + Commercial Use Prohibition+Change Prohibition)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.