ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Journal Article Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF
Cited 2 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Ah-Hyung Shin, Jae-Ho Lee, Jiwon Hwang, Yoonhyung Kim, Gyeong-Moon Park
Issue Date
2024-08
Citation
Image and Vision Computing, v.148, pp.1-14
ISSN
0262-8856
Publisher
Elsevier BV
Language
English
Type
Journal Article
DOI
https://dx.doi.org/10.1016/j.imavis.2024.105104
Abstract
Talking head generation is an essential task in various real-world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the existing NeRF-based methods fail to accurately generate the audio-synced videos. In this paper, we point out that the previous methods do not consider the audio-visual representations explicitly, which is crucial for precise lip synchronization. Moreover, the existing methods struggle to generate high-frequency details, making the generation results unnatural. To overcome these problems, we propose a novel audio-synced and high-fidelity NeRF-based talking head generation framework, named Wav2NeRF, which learns audio-visual cross-modality representations and employs the wavelet transform for better visual quality. In precise, we adopt a 2D CNN-based neural rendering decoder to a NeRF-based encoder for fast generation of the whole image to employ a new multi-level SyncNet loss for accurate lip synchronization. We also propose a novel cross-attention module to effectively fuse the image and the audio representation. In addition, we integrate the wavelet transform into our framework by proposing the wavelet loss function to enhance high-frequency details. We demonstrate that the proposed method renders realistic and audio-synced talking head videos and shows outstanding performances on average in 4 representative metrics, including PSNR (+ 4.7%), SSIM (+ 2.2%), LMD (+ 51.3%), and SyncNet Confidence (+ 154.7%) compared to the NeRF-based current state-of-the-art methods.
KSP Keywords
Audio Representation, Audio-visual, Current state, High fidelity, High frequency(HF), Multi-level, Real-world applications, Structural information, Talking head, Visual Representation, Visual quality