ETRI Knowledge Sharing Platform : Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Speech Emotion Recognition Based on Parallel CNN-Attention Networks with Multi-Fold Data Augmentation

Cited 30 time in scopus

Download 58 time Share share

Authors: John Lorenzo Bautista, Yun Kyung Lee, Hyun Soon Shin

Issue Date: 2022-11

Citation: Electronics, v.11, no.23, pp.1-14

ISSN: 2079-9292

Publisher: MDPI

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.3390/electronics11233935

Abstract: In this paper, an automatic speech emotion recognition (SER) task of classifying eight different emotions was experimented using parallel based networks trained using the Ryeson Audio-Visual Dataset of Speech and Song (RAVDESS) dataset. A combination of a CNN-based network and attention-based networks, running in parallel, was used to model both spatial features and temporal feature representations. Multiple Augmentation techniques using Additive White Gaussian Noise (AWGN), SpecAugment, Room Impulse Response (RIR), and Tanh Distortion techniques were used to augment the training data to further generalize the model representation. Raw audio data were transformed into Mel-Spectrograms as the model's input. Using CNN's proven capability in image classification and spatial feature representations, the spectrograms were treated as an image with the height and width represented by the spectrogram's time and frequency scales. Temporal feature representations were represented by attention-based models Transformer, and BLSTM-Attention modules. Proposed architectures of the parallel CNN-based networks running along with Transformer and BLSTM-Attention modules were compared with standalone CNN architectures and attention-based networks, as well as with hybrid architectures with CNN layers wrapped in time-distributed wrappers stacked on attention-based networks. In these experiments, the highest accuracy of 89.33% for a Parallel CNN-Transformer network and 85.67% for a Parallel CNN-BLSTM-Attention Network were achieved on a 10% hold-out test set from the dataset. These networks showed promising results based on their accuracies, while keeping significantly less training parameters compared with non-parallel hybrid models.

KSP Keywords: Additive white Gaussian noise(AWGN), Attention-Based Models, Audio data, Audio-visual, Augmentation techniques, Data Augmentation, Feature Representation, Hybrid model, Image Classification, Model representation, Speech Emotion recognition

This work is distributed under the term of Creative Commons License (CCL)
(CC BY)

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI