Faculty Recruiting Support CICS

Audio-Driven Character Animation

14 May
Friday, 05/14/2021 2:00pm to 4:00pm
Zoom Meeting
PhD Thesis Defense
Speaker: Yang Zhou

Zoom Meeting: https://umass-amherst.zoom.us/j/3151140144

Meeting ID: 315 114 0144

Abstract:

Generating believable character animations is a fundamentally important problem in the field of computer graphics and computer vision. It also has a diverse set of applications ranging from entertainment (e.g., films, games), medicine (e.g., facial therapy and prosthetics), mixed reality, and education (e.g., language/speech training and cyber-assistants). All these applications are all empowered by the ability to model and animate characters convincingly (human or non-human). Existing key-framing or performance capture approaches used for creating animations, especially facial animations, are either laborious or hard to edit. In particular, producing expressive animations from input speech automatically remains an open challenge.

In this thesis, we propose novel deep-learning based approaches to produce speech audio-driven character animations, including talking-head animations for character face rigs and portrait images, and reenacted gesture animations for natural human videos.

First, we propose a neural network architecture that can automatically animate an input face rig using audio as input. The network has three stages: one that learns to predict a sequence of phoneme-groups from audio; another that learns to predict the geometric location of important facial landmarks from audio; and a final stage that combines the outcome from previous stages to produce animation motion curves for FACS-based (Facial Action Coding System-based) face rigs. 

Second, we propose a method that takes as input a portrait image of a face along with audio, and produces the expressive synchronized talking-head animation. The portrait image can range from artistic cartoons to real human faces. In addition, our method generates the whole head motion dynamics matching the audio stresses and pauses. The key insight of our method is to disentangle the content and speaker identity in the input audio signals, and drive the animation from both of them. The content is used for robust synchronization of lips and nearby facial regions. The speaker information is used to capture the rest of the facial expressions and head motion dynamics that are important for generating expressive talking head animations. Both our proposed methods lead to much more expressive talking-head animations with higher overall quality compared to the state-of-the-art.

Lastly, besides the facial animation, we propose a method that generates speech gesture animation by reenacting a given video to match a target speech audio. The key idea is to split and re-assemble clips from an existing reference video through a novel video motion graph encoding valid transitions between clips. To seamlessly connect different clips in the reenactment, we propose a pose-aware video blending network which synthesizes video frames around the stitched frames between two clips. Moreover, we developed an audio-based gesture searching algorithm to find the optimal order of the reenacted frames. Our system generates reenactments that are consistent with both the audio rhythms and the speech content. Our synthesized videos are demonstrated of much higher quality and consistency with the target audio compared to previous work and baselines.


Advisor: Evangelos Kalogerakis