ETRI Knowledge Sharing Platform : Improved Alias-and-Separate Speech Coding Framework With Minimal Algorithmic Delay

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Improved Alias-and-Separate Speech Coding Framework With Minimal Algorithmic Delay

Cited 1 time in scopus

Authors: Eunkyun Lee, Seungkwon Beack, Jong Won Shin

Issue Date: 2024-12

Citation: IEEE Journal of Selected Topics in Signal Processing, v.18, no.8, pp.1414-1426

ISSN: 1932-4553

Publisher: IEEE

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.1109/JSTSP.2024.3501681

Abstract: Alias-and-Separate (AaS) speech coding framework has shown the possibility to encode wideband (WB) speech with a narrowband (NB) speech codec and reconstruct it using speech separation. WB speech is first decimated incurring aliasing and then coded, transmitted, and decoded with a NB codec. The decoded signal is then separated into lower band and spectrally-flipped high band using a speech separation module, which are expanded, lowpass/highpass filtered, and added together to reconstruct the WB speech. The original AaS system, however, has algorithmic delay originated from the overlap-add operation for consecutive segments. This algorithmic delay can be reduced by omitting the overlap-add procedure, but the quality of the reconstructed speech is also degraded due to artifacts on the segment boundaries. In this work, we propose an improved AaS framework with minimum algorithmic delay. The decoded signal is first expanded by inserting zeros in-between samples before being processed by source separation module. As the expanded signal can be viewed as a summation of the frequency-shifted versions of the original signal, the decoded-and-expanded signal is then separated into the frequency-shifted signals, which are multiplied by complex exponentials and summed up to reconstruct the original signal. With carefully designed transposed convolution operation in the separation module, the proposed system requires minimal algorithmic delay while preventing discontinuity at the segment boundaries. Additionally, we propose to employ a generative vocoder to further improve the perceived quality and a modified multi-resolution short-time Fourier transform (MR-STFT) loss. Experimental results on the WB speech coding with a NB codec demonstrated that the proposed system outperformed the original AaS system and the existing WB speech codec in the subjective listening test. We have also shown that the proposed method can be applied when the decimation factor is not 2 in the experiment on the fullband speech coding with a WB codec.

KSP Keywords: In-between, Inserting zeros, Multi-resolution, Overlap-Add, Perceived quality, Reconstructed speech, Separation module, Short time Fourier transform, Speech Separation, Speech coding, algorithmic delay

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI