ETRI Knowledge Sharing Platform : Weakly Supervised U-Net with Limited Upsampling for Sound Event Detection

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Weakly Supervised U-Net with Limited Upsampling for Sound Event Detection

Cited 3 time in scopus

Download 426 time Share share

Authors: Sangwon Lee, Hyemi Kim, Gil-Jin Jang

Issue Date: 2023-06

Citation: APPLIED SCIENCES-BASEL, v.13, no.11, pp.1-28

ISSN: 2076-3417

Publisher: MDPI

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.3390/app13116822

Abstract: Sound event detection (SED) is the task of finding the identities of sound events, as well as their onset and offset timings from audio recordings. When complete timing information is not available in the training data, but only the event identities are known, SED should be solved by weakly supervised learning. The conventional U-Net with global weighted rank pooling (GWRP) has shown a decent performance, but extensive computation is demanded. We propose a novel U-Net with limited upsampling (LUU-Net) and global threshold average pooling (GTAP) to reduce the model size, as well as the computational overhead. The expansion along the frequency axis in the U-Net decoder was minimized, so that the output map sizes were reduced by 40% at the convolutional layers and 12.5% at the fully connected layers without SED performance degradation. The experimental results on a mixed dataset of DCASE 2018 Tasks 1 and 2 showed that our limited upsampling U-Net (LUU-Net) with GTAP was about 23% faster in training and achieved 0.644 in audio tagging and 0.531 in weakly supervised SED tasks in terms of F1 scores, while U-Net with GWRP showed 0.629 and 0.492, respectively. The major contribution of the proposed LUU-Net is the reduction in the computation time with the SED performance being maintained or improved. The other proposed method, GTAP, further improved the training time reduction and provides versatility for various audio mixing conditions by adjusting a single hyperparameter.

KSP Keywords: Global threshold, Mixed Dataset, Sound event detection(SED), Training time, Weakly supervised learning, audio tagging, computation time, mixing conditions, performance degradation, time reduction, training data

This work is distributed under the term of Creative Commons License (CCL)
(CC BY)

ETRI

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.

제1유형

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI