Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation
Manipulation
Supplementary Material
Comparison results of state-of-the-art methods and our TAVCE on the Celeb1 [1] dataset:
Each video displays six columns: Source, Audio2Head, MakeItTalk, StyleHEAT, SadTalker, and our TAVCE.
Comparison results of state-of-the-art methods and our TAVCE on the Celeb2 [2] dataset:
Each video displays six columns: Source, Audio2Head, MakeItTalk, StyleHEAT, SadTalker, and our TAVCE.
Comparison results of state-of-the-art methods and our TAVCE on the HDTF [3] dataset:
Each video displays six columns: Source, Audio2Head, MakeItTalk, StyleHEAT, SadTalker, and our TAVCE.
Comparison results of state-of-the-art methods and our TAVCE on the LRW [4] dataset:
Each video displays six columns: Source, Audio2Head, MakeItTalk, StyleHEAT, SadTalker, and our TAVCE.
[1] Nagrani A, Chung J S, Zisserman A. VoxCeleb: a large-scale speaker identification dataset[J]. Telephony, 3: 33,039.
[2] Chung J, Nagrani A, Zisserman A. VoxCeleb2: Deep speaker recognition[J]. Interspeech 2018, 2018.
[3] Zhang Z, Li L, Ding Y, et al. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 3661-3670.
[4] Chung J S, Zisserman A. Lip reading in the wild[C]//Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016.