ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper Generative De-Quantization For Neural Speech Codec Via Latent Diffusion
Cited 5 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Haici Yang, Inseon Jang, Minje Kim
Issue Date
2024-04
Citation
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024, pp.1251-1255
Publisher
IEEE
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1109/ICASSP48485.2024.10446556
Abstract
End-to-end speech coding models achieve high coding gains by learning compact yet expressive features and a powerful decoder in a single network. A challenging problem as such results in unwelcome complexity increase and inferior speech quality. In this paper, we propose to separate the representation learning and information reconstruction tasks. We leverage an end-to-end codec for learning low-dimensional discrete tokens. Instead of using its decoder, we employ a latent diffusion model to de-quantize coded features into a high-dimensional continuous space, relieving the decoder’s burden of de-quantizing and upsampling. To mitigate the issue of over-smooth generation, we introduce midway-infilling with less noise reduction and stronger conditioning. We investigate the hyperparameters for midway-infilling and latent diffusion space with different dimensions in ablation studies. Subjective listening tests show that our model outperforms the state-of-the-art at two low bitrates, 1.5 and 3 kbps. We open-source the project for reproducibility.
KSP Keywords
Coding Gain, Continuous space, End to End(E2E), High-dimensional, Information reconstruction, Noise Reduction(NR), Representation learning, Speech coding, coding models, de-quantization, different dimensions