ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Speech Emotion Recognition Using 2D-CNN with Mel-Frequency Cepstrum Coefficients
Cited 10 time in scopus Download 156 time Share share facebook twitter linkedin kakaostory
Authors
Youngsik Eom, Junseong Bang
Issue Date
2021-09
Citation
Journal of Information and Communication Convergence Engineering, v.19, no.3, pp.148-154
ISSN
2234-8255
Publisher
한국정보통신학회
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.6109/jicce.2021.19.3.148
Abstract
With the advent of context-aware computing, many attempts were made to understand emotions. Among these various attempts, Speech Emotion Recognition (SER) is a method of recognizing the speaker's emotions through speech information. The SER is successful in selecting distinctive 'features' and 'classifying' them in an appropriate way. In this paper, the performances of SER using neural network models (e.g., fully connected network (FCN), convolutional neural network (CNN)) with Mel-Frequency Cepstral Coefficients (MFCC) are examined in terms of the accuracy and distribution of emotion recognition. For Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, by tuning model parameters, a two-dimensional Convolutional Neural Network (2D-CNN) model with MFCC showed the best performance with an average accuracy of 88.54% for 5 emotions, anger, happiness, calm, fear, and sadness, of men and women. In addition, by examining the distribution of emotion recognition accuracies for neural network models, the 2D-CNN with MFCC can expect an overall accuracy of 75% or more
KSP Keywords
Audio-visual, Best performance, Convolution neural network(CNN), Frequency cepstral coefficients, Mel-Frequency Cepstrum Coefficients(MFCC), Mel-frequency cepstral, Model parameter, Overall accuracy, Speech Emotion recognition, Speech information, context-Aware computing
This work is distributed under the term of Creative Commons License (CCL)
(CC BY NC)
CC BY NC