One Diffusion to Generate them All
- Duong H. Le*1
- Tuan Pham*2
- Sangho Lee1
- Christopher Clark1
- Aniruddha Kembhavi1
- Stephan Mandt2
- Ranjay Krishna1,3
- Jiasen Lu1
Demo video demonstrate the ability of One Diffusion on several tasks.
Abstract
We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset.
Approach
In OneDiffusion, we frame image generation with multimodal conditions as a sequential modeling problem. By treating all tasks as frame sequences with different noise scales during training, our approach is both simple and highly effective. This design enables any frame to serve as a conditioning image during inference. To train our model we create the One-Gen dataset, which integrates high-quality data across a variety of sources. This includes standard T2I data along with synthetic outputs from state-of-the-art models to support a range of tasks such as depth estimation, segmentation, pose estimation, etc. Our dataset also incorporates data for ID customization and multiview generation, providing diverse conditioning setups.
Results
Citation
Acknowledgements
The website template was borrowed from Jon Barron.