Speech is a natural way of communication amongst humans and advancements in speech emotion recognition (SER) technology allow further improvement of human-computer interactions (HCI) with speech by understanding human emotions. SER systems are traditionally focused on categorizing emotions into discrete classes. However, discrete classes often overlook some subtleties between each emotion as they are prone to individual differences and cultures. In this study, we focused on the use of dimensional emotional values: valence, arousal, and dominance as outputs for an SER instead of the traditional categorical classification. An SER model is developed using largely pre-trained models Wav2Vec 2.0 and HuBERT as feature encoders as a feature extraction technique from raw audio input. The model’s performance is assessed using a mean concordance coefficient (CCC) score for models trained on an English language-based dataset called Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a Korean language-based dataset called Korean Emotion Multimodal Database (KEMDy19). For the experiments done on the IEMOCAP dataset, we reported a mean CCC of 0.3673 on the Wav2Vec 2.0-based model with CCC values of 0.3004, 0.4585, and 0.3431 for the valence, arousal, and dominance values respectively trained on the “anger”, “happy”, “sad”, and “neutral” emotion classes. Meanwhile, a mean CCC of 0.3573 on the HuBERT-based model with CCC values of 0.2789, 0.3295, and 0.3361 for the respectively on the same set of emotional classes. For the experiments done on the KEMDy19 dataset, a mean CCC of 0.5473 on the Wav2Vec 2.0-based model with CCC values of 0.5804 and 0.5142 for the valence and arousal were achieved using all available emotional classes on the dataset, while a mean CCC of 0.5580 from CCC values of 0.5941 and 0.5219 on four emotional classes “anger”, “happy”, “sad”, and “neutral” were observed. For the HuBERT-based model, a mean CCC of 0.5271 with CCC values of 0.5429 and 0.5113 for the valence and arousal were recorded using all available emotional classes, while a mean CCC of 0.5392 from CCC values of 0.5765 and 0.5019 for the valence and arousal values on the four selected emotional classes. The proposed approach outperforms traditional machine learning methods and previously reported CCC values from other literature. Moreover, the use of dimensional emotional values provides a more fine-grained insight into the user’s emotional states allowing for a much deeper understanding of one’s affective state with reduced dimensionality. By applying such SER technologies to other areas such as HCI, affective computing, and psychological research, more personalized and adaptable user interfaces can be developed to suit the emotional needs of each individual. This could also contribute to the advancement of our understanding of human factors by developing emotion recognition systems.
KSP Keywords
Affective states, Emotional states, English language, Extraction technique, Feature extractioN, Fine grained(FG), Human Factors, Human computer interaction, Human emotions, Korean language, Machine Learning Methods
Copyright Policy
ETRI KSP Copyright Policy
The materials provided on this website are subject to copyrights owned by ETRI and protected by the Copyright Act. Any reproduction, modification, or distribution, in whole or in part, requires the prior explicit approval of ETRI. However, under Article 24.2 of the Copyright Act, the materials may be freely used provided the user complies with the following terms:
The materials to be used must have attached a Korea Open Government License (KOGL) Type 4 symbol, which is similar to CC-BY-NC-ND (Creative Commons Attribution Non-Commercial No Derivatives License). Users are free to use the materials only for non-commercial purposes, provided that original works are properly cited and that no alterations, modifications, or changes to such works is made. This website may contain materials for which ETRI does not hold full copyright or for which ETRI shares copyright in conjunction with other third parties. Without explicit permission, any use of such materials without KOGL indication is strictly prohibited and will constitute an infringement of the copyright of ETRI or of the relevant copyright holders.
J. Kim et. al, "Trends in Lightweight Kernel for Many core Based High-Performance Computing", Electronics and Telecommunications Trends. Vol. 32, No. 4, 2017, KOGL Type 4: Source Indication + Commercial Use Prohibition + Change Prohibition
J. Sim et.al, “the Fourth Industrial Revolution and ICT – IDX Strategy for leading the Fourth Industrial Revolution”, ETRI Insight, 2017, KOGL Type 4: Source Indication + Commercial Use Prohibition + Change Prohibition
If you have any questions or concerns about these terms of use, or if you would like to request permission to use any material on this website, please feel free to contact us
KOGL Type 4:(Source Indication + Commercial Use Prohibition+Change Prohibition)
Contact ETRI, Research Information Service Section
Privacy Policy
ETRI KSP Privacy Policy
ETRI does not collect personal information from external users who access our Knowledge Sharing Platform (KSP). Unathorized automated collection of researcher information from our platform without ETRI's consent is strictly prohibited.
[Researcher Information Disclosure] ETRI publicly shares specific researcher information related to research outcomes, including the researcher's name, department, work email, and work phone number.
※ ETRI does not share employee photographs with external users without the explicit consent of the researcher. If a researcher provides consent, their photograph may be displayed on the KSP.