ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper EDDIS: Accelerating Distributed Data-Parallel DNN Training for Heterogeneous GPU Cluster
Cited 0 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Shinyoung Ahn, Hooyoung Ahn, Hyeonseong Choi, Jaehyun Lee
Issue Date
2024-05
Citation
International Parallel and Distributed Processing Symposium (IPDPS) 2024, pp.1167-1168
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1109/IPDPSW63119.2024.00194
Abstract
EDDIS is a novel distributed deep learning library designed to efficiently utilize heterogeneous GPU resources for training deep neural networks (DNNs), addressing scalability and communication challenges in distributed training environments. It offers three training modes (synchronous, asynchronous, and hybrid) and supports TensorFlow and PyTorch frameworks. EDDIS significantly accelerates DNN training in heterogeneous GPU settings, achieving up to 17.5x faster training with 16 nodes compared to a single node. Remarkably, the Hybrid training mode surpasses Horovod, achieving training speeds 2.53 times faster for the ResNet50 model.
KSP Keywords
Data-parallel, Deep neural network(DNN), Distributed training, GPU Cluster, Hybrid training, communication challenges, deep learning(DL), distributed data, single node