OneDiffusion

One Diffusion to Generate them All

¹Allen Institue for AI ²University of California, Irvine ³University of Washington ^* equal contribution

Demo video demonstrate the ability of One Diffusion on several tasks.

Abstract

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset.

Approach

In OneDiffusion, we frame image generation with multimodal conditions as a sequential modeling problem. By treating all tasks as frame sequences with different noise scales during training, our approach is both simple and highly effective. This design enables any frame to serve as a conditioning image during inference. To train our model we create the One-Gen dataset, which integrates high-quality data across a variety of sources. This includes standard T2I data along with synthetic outputs from state-of-the-art models to support a range of tasks such as depth estimation, segmentation, pose estimation, etc. Our dataset also incorporates data for ID customization and multiview generation, providing diverse conditioning setups.

Results

Citation

@misc{le2024diffusiongenerate,
    title={One Diffusion to Generate Them All}, 
    author={Duong H. Le and Tuan Pham and Sangho Lee and Christopher Clark and Aniruddha Kembhavi and Stephan Mandt and Ranjay Krishna and Jiasen Lu},
    year={2024},
    eprint={2411.16318},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.16318}, 
}

Acknowledgements

The website template was borrowed from Jon Barron.

One Diffusion to Generate them All

Paper

Video

Code

Weight

Demo

Abstract

Approach

Results

Citation

Acknowledgements