ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Malware detection using pre-trained transformer encoder with byte sequences
Cited 0 time in scopus Download 23 time Share share facebook twitter linkedin kakaostory
Authors
Eun-Jin Kim, Yun-Kyung Lee, Sang-Min Lee, Jeong-Nyeo Kim, Ah Reum Kang, Mi-seo Kim, Young-Seob Jeong
Issue Date
2025-10
Citation
PLoS ONE, v.20, no.10, pp.1-16
ISSN
1932-6203
Publisher
Public Library of Science
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1371/journal.pone.0332307
Abstract
Ordinary users encounter various documents on the network every day, such as news articles, emails, and messages, and most are vulnerable to malicious attacks. Malicious attack methods continue to evolve, making neural network-based malware detection increasingly appealing to both academia and industry. Recent studies have leveraged byte sequences within files to detect malicious activities, primarily using convolutional neural networks to capture local patterns in the byte sequences. Meanwhile, in natural language processing, Transformer-based language models have demonstrated superior performance across various tasks and have been applied to other domains, such as image analysis and speech recognition. In this paper, we introduce a novel Transformer-based language model for malware detection that processes byte sequences as input. We propose two new pre-training strategies: real-or-fake prediction and same-sequence prediction. Including conventional pre-training strategies such as masked language modeling and next-sentence prediction, we explore all possible combinations of these approaches. By compiling existing byte sequences for malware detection, we construct a benchmark consisting of three file types (PDF, HWP, and MS Office) for pre-training and fine-tuning. Our empirical results demonstrate that our language model outperforms convolutional neural networks in the malware detection task, achieving a macro F1 score improvement of approximately 2.7%p∼11.1%p. We believe our language model will serve as a foundation model for malware detection services, and will extend our research to develop a more powerful encoder-based model that can process longer byte sequences.
KSP Keywords
Convolution neural network(CNN), Detection task, Fine-tuning, Image analysis, Language modeling, Local pattern, Malicious Activity, Malware detection, Natural Language Processing(NLP), Network-based, News articles
This work is distributed under the term of Creative Commons License (CCL)
(CC BY)
CC BY