ETRI Knowledge Sharing Platform : Efficient Data-parallel Distributed DNN Training for Big Dataset under Heterogeneous GPU Cluster

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Conference Paper Efficient Data-parallel Distributed DNN Training for Big Dataset under Heterogeneous GPU Cluster

Cited 0 time in scopus

Abstract: Training large-scale deep neural networks (DNNs) using a large number of parameters requires significant computational resources. Despite the rapid advancements in GPU technology, limited budgets have forced many institutions to gradually build GPU servers, leading to growing challenges with resource heterogeneity. However, most open-source distributed deep-learning libraries use synchronous training algorithms that perform better on homogeneous GPUs than on heterogeneous GPUs. Therefore, many researchers have struggled to efficiently conduct distributed training on heterogeneous GPU clusters owing to the straggler problem. In this study, we introduce Efficient Distributed Deep learning lIbrary based on SoftMemoryBox(EDDIS), a novel data-parallel distributed deep learning library. EDDIS overcomes the scalability limitations caused by heterogeneity, enabling the efficient utilization of heterogeneous GPU resources. EDDIS trains DNNs synchronously, asynchronously, and in a hybrid manner and supports TensorFlow and PyTorch. In a heterogeneous GPU environment, EDDIS's three training modes - synchronous, asynchronous, and hybrid synchronous - accelerate distributed DNN training by approximately 8.2x, 19x, and 18.7x on 16 nodes, respectively, compared to a single node. In particular, the EDDIS hybrid synchronous training mode achieves training speeds that are 2.8 times faster than PyTorch DDP and 2.3 times faster than Horovod when training the Yolov5m model.

KSP Keywords: Big Dataset, Data-parallel, Deep neural network(DNN), Distributed training, Efficient Utilization, GPU Cluster, Resource heterogeneity, Straggler problem, Training algorithms, computational resources, deep learning(DL)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.