DiffuEraser: A Diffusion Model for Video Inpainting

Tongyi Lab, Alibaba Group
TECHNICAL REPORT

Abstract

Recent video inpainting algorithms integrate flow-based pixel propagation with transformer-based generation to leverage optical flow for restoring textures and objects using information from neighboring frames, while completing masked regions through visual Transformers. However, these approaches often encounter blurring and temporal inconsistencies when dealing with large masks, highlighting the need for models with enhanced generative capabilities. Recently, diffusion models have emerged as a prominent technique in image and video generation due to their impressive performance. In this paper, we introduce DiffuEraser, a video inpainting model based on stable diffusion, designed to fill masked regions with greater details and more coherent structures. We incorporate prior information to provide initialization and weak conditioning, which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models. Experimental results demonstrate that our proposed method outperforms state-of-the-art techniques in both content completeness and temporal consistency while maintaining acceptable efficiency.

Method

MY ALT TEXT

Overview of the proposed video inpainting model DiffuEraser, based on Stable Diffusion. The main denoising UNet performs the denoising process to generate the final output. The BrushNet branch extracts features from masked images, which are added to the main denoising UNet layer by layer after a zero convolution block. Temporal attention is incorporated after self-attention and cross-attention to improve temporal consistency.

We incorporate `prior` information to provide initialization and weak conditioning, which helps mitigate noisy artifacts and suppress hallucinations. Additionally, to improve temporal consistency during long-sequence inference, we expand the `temporal receptive fields` of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing capabilities of Video Diffusion Models. Please read the paper for details.

Comparisons

Notice: Most of the videos are 10 seconds in length. DiffuEraser can generate temporally consistent results with sufficient detail and texture for long-sequence inference.

The Object and Background Remain Relatively motion

DiffuEraser can restore details and objects by leveraging information from adjacent frames while maintaining temporal consistency for long-sequence inference.

The Object and Background Remain Relatively rest

DiffuEraser can leverage the robust generative capabilities of SD to create plausible content with enhanced details and textures, while minimizing hallucinations.

Complicated case

For complicated case, DiffuEraser can also generate temporally consistent results with enhanced detail and a more complete structure for long-sequence inference, all without requiring a text prompt.

BibTeX

@misc{li2025diffueraserdiffusionmodelvideo,
      title={DiffuEraser: A Diffusion Model for Video Inpainting}, 
      author={Xiaowen Li and Haolan Xue and Peiran Ren and Liefeng Bo},
      year={2025},
      eprint={2501.10018},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.10018}, 
}