ETRI Knowledge Sharing Platform : English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Journal Article English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

Cited 9 time in scopus

Download 2801 time Share share

Abstract: We present an English?밙orean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English?밙orean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English?밙orean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English?밙orean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

KSP Keywords: Construction procedure, End to End(E2E), European languages, F-measure, Freely available, Government Open Data, Korean speech, Parallel data, Source language, Speech translation, TED talks

This work is distributed under the term of Korea Open Government License (KOGL)
(Type 4: : Type 1 + Commercial Use Prohibition+Change Prohibition)

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.