ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article KMSAV: Korean multi‐speaker spontaneous audiovisual dataset
Cited 1 time in scopus Download 129 time Share share facebook twitter linkedin kakaostory
Authors
Kiyoung Park, Changhan Oh, Sunghee Dong
Issue Date
2024-02
Citation
ETRI Journal, v.46, no.1, pp.71-81
ISSN
1225-6463
Publisher
한국전자통신연구원
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.4218/etrij.2023-0352
Abstract
Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-the-art ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.
KSP Keywords
Audiovisual dataset, Audiovisual speech recognition, Creative Commons, Fine-tuning, Visual recognition, automatic speech recognition(ASR), deep learning(DL), error rate, need for, open source, state-of-The-Art
This work is distributed under the term of Korea Open Government License (KOGL)
(Type 4: : Type 1 + Commercial Use Prohibition+Change Prohibition)
Type 4: