ETRI Knowledge Sharing Platform : Malware detection using pre-trained transformer encoder with byte sequences

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Malware detection using pre-trained transformer encoder with byte sequences

Cited 0 time in scopus

Download 40 time Share share

Authors: Eun-Jin Kim, Yun-Kyung Lee, Sang-Min Lee, Jeong-Nyeo Kim, Ah Reum Kang, Mi-seo Kim, Young-Seob Jeong

Issue Date: 2025-10

Citation: PLoS ONE, v.20, no.10, pp.1-16

ISSN: 1932-6203

Publisher: Public Library of Science

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.1371/journal.pone.0332307

Abstract: Ordinary users encounter various documents on the network every day, such as news articles, emails, and messages, and most are vulnerable to malicious attacks. Malicious attack methods continue to evolve, making neural network-based malware detection increasingly appealing to both academia and industry. Recent studies have leveraged byte sequences within files to detect malicious activities, primarily using convolutional neural networks to capture local patterns in the byte sequences. Meanwhile, in natural language processing, Transformer-based language models have demonstrated superior performance across various tasks and have been applied to other domains, such as image analysis and speech recognition. In this paper, we introduce a novel Transformer-based language model for malware detection that processes byte sequences as input. We propose two new pre-training strategies: real-or-fake prediction and same-sequence prediction. Including conventional pre-training strategies such as masked language modeling and next-sentence prediction, we explore all possible combinations of these approaches. By compiling existing byte sequences for malware detection, we construct a benchmark consisting of three file types (PDF, HWP, and MS Office) for pre-training and fine-tuning. Our empirical results demonstrate that our language model outperforms convolutional neural networks in the malware detection task, achieving a macro F1 score improvement of approximately 2.7%p∼11.1%p. We believe our language model will serve as a foundation model for malware detection services, and will extend our research to develop a more powerful encoder-based model that can process longer byte sequences.

KSP Keywords: Convolution neural network(CNN), Detection task, Fine-tuning, Image analysis, Language modeling, Local pattern, Malicious Activity, Malware detection, Natural Language Processing(NLP), Network-based, News articles

This work is distributed under the term of Creative Commons License (CCL)
(CC BY)

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI