ETRI-Knowledge Sharing Plaform

ENGLISH

성과물

논문 검색
구분 SCI
연도 ~ 키워드

상세정보

학술지 An Effective Style Token Weight Control Technique for End-to-End Emotional Speech Synthesis
Cited 35 time in scopus Download 5 time Share share facebook twitter linkedin kakaostory
저자
권오성, 장인선, 안충현, 강홍구
발행일
201909
출처
IEEE Signal Processing Letters, v.26 no.9, pp.1383-1387
ISSN
1070-9908
출판사
IEEE
DOI
https://dx.doi.org/10.1109/LSP.2019.2931673
협약과제
19HR4400, 시청각 장애인의 방송시청을 지원하는 감성표현 서비스 개발, 안충현
초록
In this letter, we propose a high-quality emotional speech synthesis system, using emotional vector space, i.e., the weighted sum of global style tokens (GSTs). Our previous research verified the feasibility of GST-based emotional speech synthesis in an end-to-end text-to-speech synthesis framework. However, selecting appropriate reference audio (RA) signals to extract emotion embedding vectors to the specific types of target emotions remains problematic. To ameliorate the selection problem, we propose an effective way of generating emotion embedding vectors by utilizing the trained GSTs. By assuming that the trained GSTs represent an emotional vector space, we first investigate the distribution of all the training samples depending on the type of each emotion. We then regard the centroid of the distribution as an emotion-specific weighting value, which effectively controls the expressiveness of synthesized speech, even without using the RA for guidance, as it did before. Finally, we confirm that the proposed controlled weight-based method is superior to the conventional emotion label-based methods in terms of perceptual quality and emotion classification accuracy.
KSP 제안 키워드
Control technique, Emotion classification, End to End(E2E), High-quality, Perceptual Quality, Selection problem, Synthesized speech, Text-To-Speech(TTS), Text-To-Speech synthesis, Training samples, Weight control