ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Optimizing Multi-Level Checkpointing for Distributed Deep Learning Workloads on Cloud Spot VM Clusters
Cited 1 time in scopus Download 46 time Share share facebook twitter linkedin kakaostory
Authors
Yonghyeon Cho, Yoochan Kim, Kihyun Kim, Jinwoo Kim, Hong-Yeon Kim, Youngjae Kim
Issue Date
2024-09
Citation
IEEE Access, v.12, pp.116891-116904
ISSN
2169-3536
Publisher
Institute of Electrical and Electronics Engineers Inc.
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1109/ACCESS.2024.3446770
Abstract
Spot Virtual Machines (Spot VMs) offer access to underutilized computing resources at significant discounts, sometimes up to 90% off regular on-demand pricing. For budget-conscious organizations, using clusters of Spot VMs is an effective strategy for training large-scale distributed deep learning (DDL) models. However, the risk of preemption by cloud providers poses a challenge, as it can result in the loss of unsaved data in memory and local storage. To mitigate this risk, one solution involves using networked storage systems for checkpoints, though their low write throughput can slow down training. An alternative approach is to use the memory of a remote, on-demand computing node for temporary checkpoint storage, balancing data protection with training efficiency. In this paper, we propose a novel approach, ACUTE, to optimize temporary checkpointing in the memory of on-demand nodes during DDL training. ACUTE includes three key optimizations: 1) Check-Mem, which reduces memory copying overhead on the training node; 2) Check-Trans, which accelerates checkpoint data transfer through parallel processing; and 3) Check-Pack, which eliminates unnecessary data unpacking and repacking. Implemented using PyTorch's distributed data-parallel library, ACUTE was evaluated against two other checkpointing schemes on AWS VM instances. Results show that ACUTE reduces makespan delay to nearly zero and achieves, on average, 43.30% faster checkpointing compared to a baseline multi-level checkpointing scheme, without compromising the precision of Deep Neural Network (DNN) models.
KSP Keywords
Balancing data, Cloud providers, Computing Node, Computing resources, Data protection, Data transfer, Data-parallel, Deep neural network(DNN), Multi-level, Novel approach, On-demand computing
This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)
CC BY NC ND