ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper Efficient Data-parallel Distributed DNN Training for Big Dataset under Heterogeneous GPU Cluster
Cited 0 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Shinyoung Ahn, Sookwang Lee, Hyeonseong Choi, Jaehyun Lee
Issue Date
2024-12
Citation
International Conference on Big Data (Big Data) 2024, pp.179-188
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1109/BigData62323.2024.10825722
Abstract
Training large-scale deep neural networks (DNNs) using a large number of parameters requires significant computational resources. Despite the rapid advancements in GPU technology, limited budgets have forced many institutions to gradually build GPU servers, leading to growing challenges with resource heterogeneity. However, most open-source distributed deep-learning libraries use synchronous training algorithms that perform better on homogeneous GPUs than on heterogeneous GPUs. Therefore, many researchers have struggled to efficiently conduct distributed training on heterogeneous GPU clusters owing to the straggler problem. In this study, we introduce Efficient Distributed Deep learning lIbrary based on SoftMemoryBox(EDDIS), a novel data-parallel distributed deep learning library. EDDIS overcomes the scalability limitations caused by heterogeneity, enabling the efficient utilization of heterogeneous GPU resources. EDDIS trains DNNs synchronously, asynchronously, and in a hybrid manner and supports TensorFlow and PyTorch. In a heterogeneous GPU environment, EDDIS's three training modes - synchronous, asynchronous, and hybrid synchronous - accelerate distributed DNN training by approximately 8.2x, 19x, and 18.7x on 16 nodes, respectively, compared to a single node. In particular, the EDDIS hybrid synchronous training mode achieves training speeds that are 2.8 times faster than PyTorch DDP and 2.3 times faster than Horovod when training the Yolov5m model.
KSP Keywords
Big Dataset, Data-parallel, Deep neural network(DNN), Distributed training, Efficient Utilization, GPU Cluster, Resource heterogeneity, Straggler problem, Training algorithms, computational resources, deep learning(DL)