ETRI Knowledge Sharing Platform : KMSAV: Korean multi‐speaker spontaneous audiovisual dataset

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article KMSAV: Korean multi‐speaker spontaneous audiovisual dataset

Cited 3 time in scopus

Download 1681 time Share share

Abstract: Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-the-art ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.

KSP Keywords: Audiovisual dataset, Audiovisual speech recognition, Creative Commons, Fine-tuning, Visual recognition, automatic speech recognition(ASR), deep learning(DL), error rate, need for, open source, state-of-The-Art

This work is distributed under the term of Korea Open Government License (KOGL)
(Type 4: : Type 1 + Commercial Use Prohibition+Change Prohibition)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.