ETRI-Knowledge Sharing Plaform

ENGLISH

성과물

논문 검색
구분 SCI
연도 ~ 키워드

상세정보

학술지 Speech Emotion Recognition Using 2D-CNN with Mel-Frequency Cepstrum Coefficients
Cited 9 time in scopus Download 102 time Share share facebook twitter linkedin kakaostory
저자
엄영식, 방준성
발행일
202109
출처
Journal of Information and Communication Convergence Engineering, v.19 no.3, pp.148-154
ISSN
2234-8255
출판사
한국정보통신학회
DOI
https://dx.doi.org/10.6109/jicce.2021.19.3.148
협약과제
21IR1800, 대화형 치안 지식 서비스 폴봇 개발, 방준성
초록
With the advent of context-aware computing, many attempts were made to understand emotions. Among these various attempts, Speech Emotion Recognition (SER) is a method of recognizing the speaker's emotions through speech information. The SER is successful in selecting distinctive 'features' and 'classifying' them in an appropriate way. In this paper, the performances of SER using neural network models (e.g., fully connected network (FCN), convolutional neural network (CNN)) with Mel-Frequency Cepstral Coefficients (MFCC) are examined in terms of the accuracy and distribution of emotion recognition. For Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, by tuning model parameters, a two-dimensional Convolutional Neural Network (2D-CNN) model with MFCC showed the best performance with an average accuracy of 88.54% for 5 emotions, anger, happiness, calm, fear, and sadness, of men and women. In addition, by examining the distribution of emotion recognition accuracies for neural network models, the 2D-CNN with MFCC can expect an overall accuracy of 75% or more
KSP 제안 키워드
Audio-visual, Best performance, Convolution neural network(CNN), Frequency cepstral coefficients, Mel-Frequency Cepstrum Coefficients(MFCC), Mel-frequency cepstral, Model parameter, Overall accuracy, Speech Emotion recognition, Speech information, context-Aware computing
본 저작물은 크리에이티브 커먼즈 저작자 표시 - 비영리 (CC BY NC) 조건에 따라 이용할 수 있습니다.
저작자 표시 - 비영리 (CC BY NC)