ETRI Knowledge Sharing Platform : Emotion-Aware Speaker Identification With Transfer Learning

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Emotion-Aware Speaker Identification With Transfer Learning

Cited 9 time in scopus

Download 159 time Share share

Authors: Kyoungju Noh, Hyuntae Jeong

Issue Date: 2023-07

Citation: IEEE Access, v.11, pp.77292-77306

ISSN: 2169-3536

Publisher: Institute of Electrical and Electronics Engineers Inc.

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.1109/ACCESS.2023.3297715

Abstract: Speech is a natural communication method used by humans. Speaker identification (SI) technology based on human speech has been used as an entry point for many human–computer-interaction applications. The performance of SI models can degrade when dealing with expressive speech uttered in emotional situations because emotion databases do not have sufficient data on expressive speech to train SI models for various emotional states. Generally, SI models are trained using relatively more samples of “neutral” speech than samples of other emotion classes. In this study, we propose an emotion-aware SI (em-SI) method that uses an emotion-embedding vector generated from a pre-trained speech emotion recognition (SER) model along with the acoustic features of speech data. We assess the performance of this method using individual English and Korean corpora and confirm that the proposed method provides an improved performance on multilingual corpora. The evaluation results show that the SI accuracy of em-SI on the Korean Emotion Multimodal Database (KEMDy19) improved by 3.2%, and the average speaker verification (SV) performance in terms of the equal error rate (EER) was improved by 1.3% compared to that of the baseline SI model. The visualization of the embedding vector of em-SI shows that em-SI maps speech data to an embedding space where both SI and emotional information are simultaneously represented. Through the experiments conducted in this study, we confirmed that the em-SI model, which learns by integrating emotion and speaker embedding information, improved the performance of SI for expressive speech.

KSP Keywords: Embedding space, Emotion-aware, Emotional states, Equal error rate, Improved performance, Multimodal database, SI model, Speaker Identification(SI), Speaker verification, Speech Emotion recognition, Transfer learning

This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC ND)

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI