Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation Manipulation

Supplementary Material

Comparison results on the Celeb1 dataset.
Comparison results on the Celeb2 dataset.
Comparison results on the HDTF dataset.
Comparison results on the LRW dataset.

Comparison results of state-of-the-art methods and our TAVCE on the Celeb1 [1] dataset:

Each video displays six columns: Source, Audio2Head, MakeItTalk, StyleHEAT, SadTalker, and our TAVCE.

Comparison results of state-of-the-art methods and our TAVCE on the Celeb2 [2] dataset:

Each video displays six columns: Source, Audio2Head, MakeItTalk, StyleHEAT, SadTalker, and our TAVCE.

Comparison results of state-of-the-art methods and our TAVCE on the HDTF [3] dataset:

Each video displays six columns: Source, Audio2Head, MakeItTalk, StyleHEAT, SadTalker, and our TAVCE.

Comparison results of state-of-the-art methods and our TAVCE on the LRW [4] dataset:

Each video displays six columns: Source, Audio2Head, MakeItTalk, StyleHEAT, SadTalker, and our TAVCE.

[1] Nagrani A, Chung J S, Zisserman A. VoxCeleb: a large-scale speaker identification dataset[J]. Telephony, 3: 33,039.

[2] Chung J, Nagrani A, Zisserman A. VoxCeleb2: Deep speaker recognition[J]. Interspeech 2018, 2018.

[3] Zhang Z, Li L, Ding Y, et al. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 3661-3670.

[4] Chung J S, Zisserman A. Lip reading in the wild[C]//Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016.