About Me

I am currently a Research Scientist at the Intelligent Creation Lab, ByteDance. I earned my Ph.D. in Computer Science from the Technical University of Munich, where I was mentored by Prof. Nils Thuerey. Prior to that, I obtained an M.Eng. degree from the School of Mechanical Science and Engineering at Huazhong University of Science and Technology, and a B.Eng. degree from the College of Mechanical and Electrical Engineering at Central South University.

My research interests lie in computer graphics and vision, video generation, and physics-based simulation.

Keywords of Recent Research

  • design icon

    Virtual Human

  • Video Generation

    Video Generation

  • MultiModal Large Language Model

    MultiModal Large Language Model

  • Diffusion Model

    Diffusion Model

Education

  1. Technical University of Munich

    2017 — 2022

    Ph.D., Department of Computer Science
    Research Topic: Video Generation; Deep Learning; Physics-based Simulation
    Advisor: Prof. Nils Thuerey

  2. Huazhong University of Science and Technology

    2015 — 2017

    Master, School of Mechanical Science and Engineering, CAD Center
    Research Topic: Multi-disciplinary Simulation and Optimization Algorithms
    Advisor: Prof. Yizhong Wu

  3. Central South University

    2011 — 2015

    Bachelor, College of Mechanical and Electrical Engineering
    Major: Mechanical Design & Manufacturing and Automation

Career

  1. Tiktok, Bytedance

    2022.12 — now

    I work as a Research Scientist, focusing on cutting-edge video generation algorithms for virtual human-related applications.

  2. National University of Singapore

    2022.2 — 2022.4

    I worked as a Research Assistant in the CVML Lab under the guidance of Prof. Angela Yao, where I focused on developing generative algorithms for 3D hand reconstruction from 2D images.

  3. Bosch

    2014.8 — 2014,12

    I worked as an Assistant Engineer in the R&D department, specializing in the design and testing of starters.

Main Experience

Projects

  Google Scholar

  • X-Portrait 2: Highly Expressive Portrait Animation

    Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu

    We introduce X-Portrait 2, which builds upon our previous work X-Portrait and brings the expressiveness of portrait animation to a whole new level. To achieve this, we build a state-of-the-art expression encoder model that implicitly encodes every minuscule expressions from the input by training it on large-scale datasets. This encoder is then combined with powerful generative diffusion models to generate fluid and expressive videos. Our X-Portrait 2 model can transfer subtle and minuscule facial expressions from the actors as well as challenging expressions including pouting, tougue-out, cheek-puffing and frowning. High fidelity of emotion preservation can also be achieved in the generated videos.

  • X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

    You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo

    SIGGRAPH 2024

    We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

  • Dynamic Avatar in TikTok User Profile

    The algorithm developed by our team has been meticulously optimized and seamlessly integrated into TikTok, empowering users to create their own dynamic avatar profiles. This innovative feature enhances the user experience by making profile customization more engaging, interactive, and enjoyable for millions of users around the globe.
  • DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

    Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, Jiashi Feng

    We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

  • DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

    Yuming Gu, You Xie, Hongyi Xu, Guoxian Song, Yichun Shi, Di Chang, Jing Yang, Linjie Luo

    CVPR 2024(Highlight)

    We present DiffPortrait3D, a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait. Specifically, given a single RGB input, we aim to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression. In lieu of time-consuming optimization and finetuning, our zero-shot method generalizes well to arbitrary face portraits with unposed camera views, extreme facial expressions, and diverse artistic depictions. At its core, we leverage the generative prior of 2D diffusion models pre-trained on large-scale image datasets as our rendering backbone, while the denoising is guided with disentangled attentive control of appearance and camera pose. To achieve this, we first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view. Furthermore, we insert a trainable crossview attention module to enhance view consistency, which is further strengthened with a novel 3D-aware noise generation process during inference. We demonstrate state-ofthe-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks.

  • Doubao Avatar: Real-time Chatting Avatar on Mobile Devices

    Developed a real-time, audio-driven portrait animation algorithm for single-image inputs, seamlessly integrated into the Doubao App. This innovative solution generates lifelike talking animations using just one portrait image, featuring precise lip-sync, authentic facial expressions, and natural head movements.

  • Learning Temporal Coherence via Self-Supervision for GAN-based Video Generation (TecoGAN)

    Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taixe, Nils Thuerey

    SIGGRAPH 2020

    Our work explores temporal self-supervision for GAN-based video generation tasks. While adversarial training successfully yields generative models for a variety of areas, temporal relationships in the generated data are much less explored. Natural temporal changes are crucial for sequential generation tasks, e.g. video super-resolution and unpaired video translation. For the former, state-of-the-art methods often favor simpler norm losses such as L^2 over adversarial training. However, their averaging nature easily leads to temporally smooth results with an undesirable lack of spatial detail. For unpaired video translation, existing approaches modify the generator networks to form spatio-temporal cycle consistencies. In contrast, we focus on improving learning objectives and propose a temporally self-supervised algorithm. For both tasks, we show that temporal adversarial learning is key to achieving temporally coherent solutions without sacrificing spatial detail. We also propose a novel Ping-Pong loss to improve the long-term temporal consistency. It effectively prevents recurrent networks from accumulating artifacts temporally without depressing detailed features. Additionally, we propose a first set of metrics to quantitatively evaluate the accuracy as well as the perceptual quality of the temporal evolution. A series of user studies confirm the rankings computed with these metrics.

  • A Multi-Pass GAN for Fluid Flow Super-Resolution

    Maximilian Werhahn, You Xie, Mengyu Chu, Nils Thuerey

    SCA 2019

    We propose a novel method to up-sample volumetric functions with generative neural networks using several orthogonal passes. Our method decomposes generative problems on Cartesian field functions into multiple smaller sub-problems that can be learned more efficiently. Specifically, we utilize two separate generative adversarial networks: the first one up-scales slices which are parallel to the XY- plane, whereas the second one refines the whole volume along the Z- axis working on slices in the YZ- plane. In this way, we obtain full coverage for the 3D target function and can leverage spatio-temporal supervision with a set of discriminators. Additionally, we demonstrate that our method can be combined with curriculum learning and progressive growing approaches. We arrive at a first method that can up-sample volumes by a factor of eight along each dimension, i.e., increasing the number of degrees of freedom by 512. Large volumetric up-scaling factors such as this one have previously not been attainable as the required number of weights in the neural networks renders adversarial training runs prohibitively difficult. We demonstrate the generality of our trained networks with a series of comparisons to previous work, a variety of complex 3D results, and an analysis of the resulting performance.

  • tempoGAN: A Temporally Coherent, Volumetric GAN for Super-resolution Fluid Flow

    You Xie, Erik Franz, Mengyu Chu, Nils Thuerey

    SIGGRAPH 2018

    We propose a temporally coherent generative model addressing the superresolution problem for fluid flows. Our work represents a first approach to synthesize four-dimensional physics fields with neural networks. Based on a conditional generative adversarial network that is designed for the inference of three-dimensional volumetric data, our model generates consistent and detailed results by using a novel temporal discriminator, in addition to the commonly used spatial one. Our experiments show that the generator is able to infer more realistic high-resolution details by using additional physical quantities, such as low-resolution velocities or vorticities. Besides improvements in the training process and in the generated outputs, these inputs offer means for artistic control as well. We additionally employ a physics-aware data augmentation step, which is crucial to avoid overfitting and to reduce memory requirements. In this way, our network learns to generate advected quantities with highly detailed, realistic, and temporally coherent features. Our method works instantaneously, using only a single time-step of low-resolution fluid data. We demonstrate the abilities of our method using a variety of complex inputs and applications in two and three dimensions.

  • Reviving Autoencoder Pretraining

    You Xie, Nils Thuerey

    Neural Computing and Applications Journal

    The pressing need for pretraining algorithms has been diminished by numerous advances in terms of regularization, architectures, and optimizers. Despite this trend, we re-visit the classic idea of unsupervised autoencoder pretraining and propose a modified variant that relies on a full reverse pass trained in conjunction with a given training task. This yields networks that are {\em as-invertible-as-possible}, and share mutual information across all constrained layers. We additionally establish links between singular value decomposition and pretraining and show how it can be leveraged for gaining insights about the learned structures. Most importantly, we demonstrate that our approach yields an improved performance for a wide variety of relevant learning and transfer tasks ranging from fully connected networks over residual neural networks to generative adversarial networks. Our results demonstrate that unsupervised pretraining has not lost its practical relevance in today’s deep learning environment.

  • TemporalUV: Capturing Loose Clothing with Temporally Coherent UV Coordinates

    You Xie, Huiqi Mao, Angela Yao, Nils Thuerey

    CVPR 2022

    We propose a novel approach to generate temporally coherent UV coordinates for loose clothing. Our method is not constrained by human body outlines and can capture loose garments and hair. We implemented a differentiable pipeline to learn UV mapping between a sequence of RGB inputs and textures via UV coordinates. Instead of treating the UV coordinates of each frame separately, our data generation approach connects all UV coordinates via feature matching for temporal stability. Subsequently, a generative model is trained to balance the spatial quality and temporal stability. It is driven by supervised and unsupervised losses in both UV and image spaces. Our experiments show that the trained models output high-quality UV coordinates and generalize to new poses. Once a sequence of UV coordinates has been inferred by our model, it can be used to flexibly synthesize new looks and modified visual styles. Compared to existing methods, our approach reduces the computational workload to animate new outfits by several orders of magnitude.

  • UV-Based 3D Hand-Object Reconstruction with Grasp Optimization

    Ziwei Yu, Linlin Yang, You Xie, Ping Chen, Angela Yao

    BMVC 2022 (Spotlight)

    We propose a novel framework for 3D hand shape reconstruction and hand-object grasp optimization from a single RGB image. The representation of hand-object contact regions is critical for accurate reconstructions. Instead of approximating the contact regions with sparse points, as in previous works, we propose a dense representation in the form of a UV coordinate map. Furthermore, we introduce inference-time optimization to fine-tune the grasp and improve interactions between the hand and the object. Our pipeline increases hand shape reconstruction accuracy and produces a vibrant hand texture. Experiments on datasets such as Ho3D, FreiHAND, and DexYCB reveal that our proposed method outperforms the state-of-the-art.