ETRI Knowledge Sharing Platform : Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF

BROWSE

Titles

논문 검색
Type		SCI
Year	~	Keyword

Detail

List

Journal Article Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF

Cited 4 time in scopus

Authors: Ah-Hyung Shin, Jae-Ho Lee, Jiwon Hwang, Yoonhyung Kim, Gyeong-Moon Park

Issue Date: 2024-08

Citation: Image and Vision Computing, v.148, pp.1-14

ISSN: 0262-8856

Publisher: Elsevier BV

Language: English

Type: Journal Article

DOI: https://dx.doi.org/10.1016/j.imavis.2024.105104

Abstract: Talking head generation is an essential task in various real-world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the existing NeRF-based methods fail to accurately generate the audio-synced videos. In this paper, we point out that the previous methods do not consider the audio-visual representations explicitly, which is crucial for precise lip synchronization. Moreover, the existing methods struggle to generate high-frequency details, making the generation results unnatural. To overcome these problems, we propose a novel audio-synced and high-fidelity NeRF-based talking head generation framework, named Wav2NeRF, which learns audio-visual cross-modality representations and employs the wavelet transform for better visual quality. In precise, we adopt a 2D CNN-based neural rendering decoder to a NeRF-based encoder for fast generation of the whole image to employ a new multi-level SyncNet loss for accurate lip synchronization. We also propose a novel cross-attention module to effectively fuse the image and the audio representation. In addition, we integrate the wavelet transform into our framework by proposing the wavelet loss function to enhance high-frequency details. We demonstrate that the proposed method renders realistic and audio-synced talking head videos and shows outstanding performances on average in 4 representative metrics, including PSNR (+ 4.7%), SSIM (+ 2.2%), LMD (+ 51.3%), and SyncNet Confidence (+ 154.7%) compared to the NeRF-based current state-of-the-art methods.

KSP Keywords: Audio Representation, Audio-visual, Current state, High fidelity, High frequency(HF), Multi-level, Real-world applications, Structural information, Talking head, Visual quality, Visual representations

ETRI-Knowledge Sharing Plaform

BROWSE

Titles

Detail

ETRI