ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation
Cited 22 time in scopus Download 18 time Share share facebook twitter linkedin kakaostory
Authors
John Lorenzo Bautista, Yun Kyung Lee, Hyun Soon Shin
Issue Date
2022-11
Citation
Electronics, v.11, no.23, pp.1-14
ISSN
2079-9292
Publisher
MDPI
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.3390/electronics11233935
Abstract
In this paper, an automatic speech emotion recognition (SER) task of classifying eight different emotions was experimented using parallel based networks trained using the Ryeson Audio-Visual Dataset of Speech and Song (RAVDESS) dataset. A combination of a CNN-based network and attention-based networks, running in parallel, was used to model both spatial features and temporal feature representations. Multiple Augmentation techniques using Additive White Gaussian Noise (AWGN), SpecAugment, Room Impulse Response (RIR), and Tanh Distortion techniques were used to augment the training data to further generalize the model representation. Raw audio data were transformed into Mel-Spectrograms as the model's input. Using CNN's proven capability in image classification and spatial feature representations, the spectrograms were treated as an image with the height and width represented by the spectrogram's time and frequency scales. Temporal feature representations were represented by attention-based models Transformer, and BLSTM-Attention modules. Proposed architectures of the parallel CNN-based networks running along with Transformer and BLSTM-Attention modules were compared with standalone CNN architectures and attention-based networks, as well as with hybrid architectures with CNN layers wrapped in time-distributed wrappers stacked on attention-based networks. In these experiments, the highest accuracy of 89.33% for a Parallel CNN-Transformer network and 85.67% for a Parallel CNN-BLSTM-Attention Network were achieved on a 10% hold-out test set from the dataset. These networks showed promising results based on their accuracies, while keeping significantly less training parameters compared with non-parallel hybrid models.
KSP Keywords
Additive white Gaussian noise(AWGN), Attention-Based Models, Audio data, Audio-visual, Augmentation techniques, Data Augmentation, Feature representation, Hybrid Models, Hybrid architecture, Image classification, Model representation
This work is distributed under the term of Creative Commons License (CCL)
(CC BY)
CC BY