ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Weakly Supervised U-Net with Limited Upsampling for Sound Event Detection
Cited 3 time in scopus Download 84 time Share share facebook twitter linkedin kakaostory
Authors
Sangwon Lee, Hyemi Kim, Gil-Jin Jang
Issue Date
2023-06
Citation
APPLIED SCIENCES-BASEL, v.13, no.11, pp.1-28
ISSN
2076-3417
Publisher
MDPI
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.3390/app13116822
Abstract
Sound event detection (SED) is the task of finding the identities of sound events, as well as their onset and offset timings from audio recordings. When complete timing information is not available in the training data, but only the event identities are known, SED should be solved by weakly supervised learning. The conventional U-Net with global weighted rank pooling (GWRP) has shown a decent performance, but extensive computation is demanded. We propose a novel U-Net with limited upsampling (LUU-Net) and global threshold average pooling (GTAP) to reduce the model size, as well as the computational overhead. The expansion along the frequency axis in the U-Net decoder was minimized, so that the output map sizes were reduced by 40% at the convolutional layers and 12.5% at the fully connected layers without SED performance degradation. The experimental results on a mixed dataset of DCASE 2018 Tasks 1 and 2 showed that our limited upsampling U-Net (LUU-Net) with GTAP was about 23% faster in training and achieved 0.644 in audio tagging and 0.531 in weakly supervised SED tasks in terms of F1 scores, while U-Net with GWRP showed 0.629 and 0.492, respectively. The major contribution of the proposed LUU-Net is the reduction in the computation time with the SED performance being maintained or improved. The other proposed method, GTAP, further improved the training time reduction and provides versatility for various audio mixing conditions by adjusting a single hyperparameter.
KSP Keywords
Global threshold, Mixed Dataset, Sound event detection(SED), Training time, Weakly supervised learning, audio tagging, computation time, mixing conditions, performance degradation, time reduction, training data
This work is distributed under the term of Creative Commons License (CCL)
(CC BY)
CC BY