Faculty Recruiting Support CICS

Audio-driven Talking Head Animation

29 May
Friday, 05/29/2020 3:00pm to 4:30pm
Zoom Meeting
PhD Dissertation Proposal Defense
Speaker: Yang Zhou

Zoom Meeting: https://umass-amherst.zoom.us/j/3151140144

Abstract

Generating believable facial animations is a fundamentally important problem in the field of graphics and also has a diverse set of applications ranging from entertainment (e.g., films, games), medicine (e.g., facial therapy and prosthetics), mixed reality, and education (e.g., language/speech training and cyber-assistants). All these applications are all empowered by the ability to model, simulate and animate faces of characters convincingly (human or non-human). Existing key-framing or performance capture approaches used for creating facial animations are either laborious or hard to edit. In particular, producing expressive animations from input speech automatically remains an open challenge.

In this thesis, we propose novel deep-learning based approaches to produce audio-driven talking head animations for character face rigs and portrait images ranging from artistic cartoons to real human faces. First, we propose a neural network architecture that can automatically animate an input face rig using audio as input. The network has three stages: one that learns to predict a sequence of phoneme-groups from audio; another that learns to predict the geometric location of important facial landmarks from audio; and a final stage that combines the outcome from previous stages to produce animation motion curves for FACS-based face rigs. Second, we propose a method that takes as input a still image of a face along with audio, and produces animated facial landmarks tailored for the input face, and also synchronized with the input audio. In addition, it generates the whole head motion dynamics matching the audio stresses and pauses. Finally, we applied an image-to-image translation network to produce the final talking head animation given the landmarks. The key insight of our method is to disentangle the content and speaker in the input audio signals, and drive the animation from the resulting disentangled representations. The content is used for robust synchronization of lips and nearby facial regions. The speaker information is used to capture the rest of the facial expressions and head motion dynamics that are important for generating expressive talking head animations. We also show that our approach can generalize to new audio clips and face images not seen during training. Both our proposed methods lead to much more expressive animations with higher overall quality compared to the state-of-the-art.

Advisor: Evangelos Kalogerakis