ETRI-Knowledge Sharing Plaform

ENGLISH

성과물

논문 검색
구분 SCI
연도 ~ 키워드

상세정보

학술지 English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results
Cited 3 time in scopus Download 125 time Share share facebook twitter linkedin kakaostory
저자
방정욱, 맹준규, 박준, 윤승, 김상훈
발행일
202302
출처
ETRI Journal, v.45 no.1, pp.18-27
ISSN
1225-6463
출판사
한국전자통신연구원 (ETRI)
DOI
https://dx.doi.org/10.4218/etrij.2021-0336
협약과제
21ZS1100, 자율성장형 복합인공지능 원천기술 연구, 송화전
초록
We present an English?밙orean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English?밙orean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English?밙orean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English?밙orean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.
KSP 제안 키워드
Construction procedure, End to End(E2E), European languages, F-measure, Freely available, Government Open Data, Korean speech, Parallel data, Source language, Speech translation, TED talks
본 저작물은 공공누리 제4유형 : 출처표시 + 상업적 이용금지 + 변경금지 조건에 따라 이용할 수 있습니다.
제4유형