ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Building Robust Korean Speech Recognition Model by Fine-tuning Large Pretrained Model
Cited - time in scopus Download 107 time Share share facebook twitter linkedin kakaostory
Authors
Changhan Oh, Cheongbin Kim, Kiyoung Park
Issue Date
2023-09
Citation
말소리와 음성과학, v.15, no.3, pp.75-82
ISSN
2005-8063
Publisher
한국음성학회
Language
Korean
Type
Journal Article
DOI
https://dx.doi.org/10.13064/KSSS.2023.15.3.075
Abstract
Automatic speech recognition (ASR) has been revolutionized with deep learning-based approaches, among which self-supervised learning methods have proven to be particularly effective. In this study, we aim to enhance the performance of OpenAI’s Whisper model, a multilingual ASR system on the Korean language. Whisper was pretrained on a large corpus (around 680,000 hours) of web speech data and has demonstrated strong recognition performance for major languages. However, it faces challenges in recognizing languages such as Korean, which is not major language while training. We address this issue by fine-tuning the Whisper model with an additional dataset comprising about 1,000 hours of Korean speech. We also compare its performance against a Transformer model that was trained from scratch using the same dataset. Our results indicate that fine-tuning the Whisper model significantly improved its Korean speech recognition capabilities in terms of character error rate (CER). Specifically, the performance improved with increasing model size. However, the Whisper model’s performance on English deteriorated post fine-tuning, emphasizing the need for further research to develop robust multilingual models. Our study demonstrates the potential of utilizing a fine-tuned Whisper model for Korean ASR applications. Future work will focus on multilingual recognition and optimization for real-time inference.
KSP Keywords
Fine-tuning, Korean language, Korean speech, Learning methods, Learning-based, Performance improved, Real-time inference, Recognition model, Recognition performance, Web Speech, automatic speech recognition(ASR)
This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC)
CC BY NC