FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation

Cheng Peng 1,2*,   Zhuo Su 2*†,   Liao Wang 2*,   Chen Guo 1,2,   Zhaohu Li 2,   Chengjiang Long 2,   Zheng Lv 2,   Jingxiang Sun 1,   Chenyangguang Zhang 1,   Yebin Liu 1†

1 Tsinghua University

,

2 ByteDance

,
*Equal contribution    Corresponding author

Abstract

We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.

Method Overview

FlexAvatar reconstructs a high-quality Gaussian head avatar by mapping input images with varying expressions and camera views into Gaussian representations in UV space, through the following steps:

  1. Use a flexible feed-forward backbone to obtain static Gaussian maps and identity feature map from input images.
  2. Convert driving expression signal into a FLAME UV position map and concatenate with identity features.
  3. Feed concatenated representation into a UNet to generate dynamic Gaussian attributes.
  4. then sample into FLAME space with LBS for rendering.
  5. Apply optional efficient refinement to improve results.

Ultimately, our proposed FlexAvatar produces detailed, real-time 360° reenactment renderings.

Video