ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper Many-to-Many Audio Spectrogram Tansformer: Transformer for Sound Event Localization and Detection
Cited - time in scopus Share share facebook twitter linkedin kakaostory
Authors
Sooyoung Park, Youngho Jeong, Taejin Lee
Issue Date
2021-11
Citation
Detection and Classification of Acoustic Scenes and Events (DCASE) 2021: Workshop, pp.105-109
Language
English
Type
Conference Paper
Abstract
Over the past few years, convolutional neural networks (CNNs) have been established as the core architecture for audio classification and detection. In particular, a hybrid model that combines a recurrent neural network or a self-attention mechanism with CNNs to deal with longer-range contexts has been widely used. Recently, Transformers, which are pure attention-based architectures, have achieved excellent performance in various fields, showing that CNNs are not essential. In this paper, we investigate the reliance on CNNs for sound event localization and detection by introducing the Many-to-Many Audio Spectrogram Transformer (M2M-AST), a pure attention-based architecture. We adopt multiple classification tokens in the Transformer architecture to easily handle various output resolutions. We empirically show that the proposed M2M-AST outperforms the conventional hybrid model on TAU-NIGENS Spatial Sound Events 2021 dataset.
KSP Keywords
Attention mechanism, Audio classification, Convolution neural network(CNN), Hybrid model, Many-to-many, Multiple classification, Spatial Sound, event localization, excellent performance, neural network(NN), recurrent neural network(RNN)