ETRI Knowledge Sharing Platform : Generative De-Quantization For Neural Speech Codec Via Latent Diffusion

Titles

논문 검색
Type		SCI
Year	~	Keyword

List

Conference Paper Generative De-Quantization For Neural Speech Codec Via Latent Diffusion

Cited 7 time in scopus

Citation: International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024, pp.1251-1255

Abstract: End-to-end speech coding models achieve high coding gains by learning compact yet expressive features and a powerful decoder in a single network. A challenging problem as such results in unwelcome complexity increase and inferior speech quality. In this paper, we propose to separate the representation learning and information reconstruction tasks. We leverage an end-to-end codec for learning low-dimensional discrete tokens. Instead of using its decoder, we employ a latent diffusion model to de-quantize coded features into a high-dimensional continuous space, relieving the decoder’s burden of de-quantizing and upsampling. To mitigate the issue of over-smooth generation, we introduce midway-infilling with less noise reduction and stronger conditioning. We investigate the hyperparameters for midway-infilling and latent diffusion space with different dimensions in ablation studies. Subjective listening tests show that our model outperforms the state-of-the-art at two low bitrates, 1.5 and 3 kbps. We open-source the project for reproducibility.

KSP Keywords: Coding Gain, Continuous space, End to End(E2E), High-dimensional, Information reconstruction, Noise Reduction(NR), Representation learning, Speech coding, coding models, de-quantization, different dimensions

218 Gajeong-ro, Yuseong-gu, Daejeon, 34129, KOREA, Contact: sh.kim@etri.re.kr

Please refrain from automatic collection of e-mail addresses posted on this homepage.