ETRI Knowledge Sharing Platform : EDDIS: Accelerating Distributed Data-Parallel DNN Training for Heterogeneous GPU Cluster

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Conference Paper EDDIS: Accelerating Distributed Data-Parallel DNN Training for Heterogeneous GPU Cluster

Cited 0 time in scopus

Citation: International Parallel and Distributed Processing Symposium (IPDPS) 2024, pp.1167-1168

Abstract: EDDIS is a novel distributed deep learning library designed to efficiently utilize heterogeneous GPU resources for training deep neural networks (DNNs), addressing scalability and communication challenges in distributed training environments. It offers three training modes (synchronous, asynchronous, and hybrid) and supports TensorFlow and PyTorch frameworks. EDDIS significantly accelerates DNN training in heterogeneous GPU settings, achieving up to 17.5x faster training with 16 nodes compared to a single node. Remarkably, the Hybrid training mode surpasses Horovod, achieving training speeds 2.53 times faster for the ResNet50 model.

KSP Keywords: Data-parallel, Deep neural network(DNN), Distributed training, GPU Cluster, Hybrid training, communication challenges, deep learning(DL), distributed data, neural network(NN), single node

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.