ETRI-Knowledge Sharing Plaform

ENGLISH

성과물

논문 검색
구분 SCI
연도 ~ 키워드

상세정보

학술대회 ShmCaffe: A Distributed Deep Learning Platform with Shared Memory Buffer for HPC Architecture
Cited 14 time in scopus Download 5 time Share share facebook twitter linkedin kakaostory
저자
안신영, 김중헌, 임은지, 최완, Aziz Mohaisen, 강성원
발행일
201807
출처
International Conference on Distributed Computing Systems (ICDCS) 2018, pp.1118-1128
DOI
https://dx.doi.org/10.1109/ICDCS.2018.00111
협약과제
18HS2700, 대규모 딥러닝 고속 처리를 위한 HPC 시스템 개발, 최완
초록
One of the reasons behind the tremendous success of deep learning theory and applications in the recent days is advances in distributed and parallel high performance computing (HPC). This paper proposes a new distributed deep learning platform, named ShmCaffe, which utilizes remote shared memory for communication overhead reduction in massive deep neural network training parameter sharing. ShmCaffe is designed based on Soft Memory Box (SMB), a virtual shared memory framework. In the SMB framework, the remote shared memory is used as a shared buffer for asynchronous massive parameter sharing among many distributed deep learning processes. Moreover, a hybrid method that combines asynchronous and synchronous parameter sharing methods is also discussed in this paper for improving scalability. As a result, ShmCaffe is 10.1 times faster than Caffe and 2.8 times faster than Caffe-MPI for deep neural network training when Inception\-v1 is trained with 16 GPUs. We verify the convergence of the Inception\-v1 model training using ShmCaffe-A and ShmCaffe-H by varying the number of workers. Furthermore, we evaluate scalability of ShmCaffe by analyzing the computation and communication times per one iteration of deep learning training in four convolutional neural network (CNN) models.
키워드
Deep learning, Distributed deep learning, Shared memory, ShmCaffe, Soft memory box
KSP 제안 키워드
Communication overhead, Convolution neural network(CNN), Deep neural network(DNN), High Performance Computing, Learning platform, Learning training, Neural network training, Shared buffer, Virtual Shared Memory, deep learning(DL), hybrid method