ETRI-Knowledge Sharing Plaform

ENGLISH

성과물

논문 검색
구분 SCI
연도 ~ 키워드

상세정보

학술지 Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation
Cited 11 time in scopus Download 4 time Share share facebook twitter linkedin kakaostory
저자
로렌조, 이윤경, 신현순
발행일
202211
출처
Electronics, v.11 no.23, pp.1-14
ISSN
2079-9292
출판사
MDPI
DOI
https://dx.doi.org/10.3390/electronics11233935
협약과제
22PB2400, 비대면 학습 및 산업현장 지원을 위한 감성 인지·교감 AI 서비스 기술 개발, 신현순
초록
In this paper, an automatic speech emotion recognition (SER) task of classifying eight different emotions was experimented using parallel based networks trained using the Ryeson Audio-Visual Dataset of Speech and Song (RAVDESS) dataset. A combination of a CNN-based network and attention-based networks, running in parallel, was used to model both spatial features and temporal feature representations. Multiple Augmentation techniques using Additive White Gaussian Noise (AWGN), SpecAugment, Room Impulse Response (RIR), and Tanh Distortion techniques were used to augment the training data to further generalize the model representation. Raw audio data were transformed into Mel-Spectrograms as the model's input. Using CNN's proven capability in image classification and spatial feature representations, the spectrograms were treated as an image with the height and width represented by the spectrogram's time and frequency scales. Temporal feature representations were represented by attention-based models Transformer, and BLSTM-Attention modules. Proposed architectures of the parallel CNN-based networks running along with Transformer and BLSTM-Attention modules were compared with standalone CNN architectures and attention-based networks, as well as with hybrid architectures with CNN layers wrapped in time-distributed wrappers stacked on attention-based networks. In these experiments, the highest accuracy of 89.33% for a Parallel CNN-Transformer network and 85.67% for a Parallel CNN-BLSTM-Attention Network were achieved on a 10% hold-out test set from the dataset. These networks showed promising results based on their accuracies, while keeping significantly less training parameters compared with non-parallel hybrid models.
KSP 제안 키워드
Additive white Gaussian noise(AWGN), Attention-Based Models, Audio data, Audio-visual, Augmentation techniques, Data Augmentation, Feature representation, Hybrid Models, Hybrid architecture, Image classification, Model representation
본 저작물은 크리에이티브 커먼즈 저작자 표시 (CC BY) 조건에 따라 이용할 수 있습니다.
저작자 표시 (CC BY)