ETRI Knowledge Sharing Platform : Optimizing Multi-Level Checkpointing for Distributed Deep Learning Workloads on Cloud Spot VM Clusters

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Optimizing Multi-Level Checkpointing for Distributed Deep Learning Workloads on Cloud Spot VM Clusters

Cited 1 time in scopus

Download 93 time Share share

Authors: Yonghyeon Cho, Yoochan Kim, Kihyun Kim, Jinwoo Kim, Hong-Yeon Kim, Youngjae Kim

Issue Date: 2024-09

Citation: IEEE Access, v.12, pp.116891-116904

ISSN: 2169-3536

Publisher: Institute of Electrical and Electronics Engineers Inc.

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.1109/ACCESS.2024.3446770

Abstract: Spot Virtual Machines (Spot VMs) offer access to underutilized computing resources at significant discounts, sometimes up to 90% off regular on-demand pricing. For budget-conscious organizations, using clusters of Spot VMs is an effective strategy for training large-scale distributed deep learning (DDL) models. However, the risk of preemption by cloud providers poses a challenge, as it can result in the loss of unsaved data in memory and local storage. To mitigate this risk, one solution involves using networked storage systems for checkpoints, though their low write throughput can slow down training. An alternative approach is to use the memory of a remote, on-demand computing node for temporary checkpoint storage, balancing data protection with training efficiency. In this paper, we propose a novel approach, ACUTE, to optimize temporary checkpointing in the memory of on-demand nodes during DDL training. ACUTE includes three key optimizations: 1) Check-Mem, which reduces memory copying overhead on the training node; 2) Check-Trans, which accelerates checkpoint data transfer through parallel processing; and 3) Check-Pack, which eliminates unnecessary data unpacking and repacking. Implemented using PyTorch's distributed data-parallel library, ACUTE was evaluated against two other checkpointing schemes on AWS VM instances. Results show that ACUTE reduces makespan delay to nearly zero and achieves, on average, 43.30% faster checkpointing compared to a baseline multi-level checkpointing scheme, without compromising the precision of Deep Neural Network (DNN) models.

KSP Keywords: Balancing data, Cloud providers, Computing Node, Computing resources, Data protection, Data transfer, Data-parallel, Deep neural network(DNN), Multi-level, Novel approach, On-demand computing

This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI