ETRI Knowledge Sharing Platform : Tensor slicing and optimization for multicore NPUs

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Tensor slicing and optimization for multicore NPUs

Cited 5 time in scopus

Authors: Rafael Sousa, Marcio Pereira, Yongin Kwon, Taeho Kim, Namsoon Jung, Chang Soo Kim, Michael Frank, Guido Araujo

Issue Date: 2023-05

Citation: Journal of Parallel and Distributed Computing, v.175, pp.66-79

ISSN: 0743-7315

Publisher: Elsevier

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.1016/j.jpdc.2022.12.008

Abstract: Although code generation for Convolution Neural Network (CNN) models has been extensively studied, performing efficient data slicing and parallelization for highly-constrained Multicore Neural Processor Units (NPUs) is still a challenging problem. Given the size of convolutions' input/output tensors and the small footprint of NPU on-chip memories, minimizing memory transactions while maximizing parallelism and MAC utilization are central to any effective solution. This paper proposes a TensorFlow XLA/LLVM compiler optimization pass for Multicore NPUs, called Tensor Slicing Optimization (TSO), which: (a) maximizes convolution parallelism and memory usage across NPU cores; and (b) reduces data transfers between host and NPU on-chip memories by using DRAM memory burst time estimates to guide tensor slicing. To evaluate the proposed approach, a set of experiments was performed using the NeuroMorphic Processor (NMP), a multicore NPU containing 32 RISC-V cores extended with novel CNN instructions. Experimental results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models. Speed-ups of up to 21.7% result when comparing the TSO burst-based technique to a no-burst data slicing approach. To validate the generality of the TSO approach, the algorithm was also ported to the Glow Machine Learning framework. The performance of the models were measured on both Glow and TensorFlow XLA/LLVM compilers, revealing similar results.

KSP Keywords: Burst data, Burst time, Convolution neural network(CNN), Data transfer, LLVM compiler, Learning framework, Neural processor, RISC-V, code generation, compiler optimization, execution time

ETRI

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.

제1유형

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI