One Diffusion to Generate them All

1Allen Institue for AI      2University of California, Irvine      3University of Washington      * equal contribution

Demo video demonstrate the ability of One Diffusion on several tasks.

Abstract

We introduce OneDiffusion, a versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps, while also handling tasks like image deblurring, upscaling, and reverse processes such as depth estimation and segmentation. Additionally, OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs. Our model takes a straightforward yet effective approach by treating all tasks as frame sequences with varying noise scales during training, allowing any frame to act as a conditioning image at inference time. Our unified training framework removes the need for specialized architectures, supports scalable multi-task training, and adapts smoothly to any resolution, enhancing both generalization and scalability. Experimental results demonstrate competitive performance across tasks in both generation and prediction such as text-to-image, multiview generation, ID preservation, depth estimation and camera pose estimation despite relatively small training dataset.

Approach

Pipeline of our approach

In OneDiffusion, we frame image generation with multimodal conditions as a sequential modeling problem. By treating all tasks as frame sequences with different noise scales during training, our approach is both simple and highly effective. This design enables any frame to serve as a conditioning image during inference. To train our model we create the One-Gen dataset, which integrates high-quality data across a variety of sources. This includes standard T2I data along with synthetic outputs from state-of-the-art models to support a range of tasks such as depth estimation, segmentation, pose estimation, etc. Our dataset also incorporates data for ID customization and multiview generation, providing diverse conditioning setups.

Results

Citation

Acknowledgements


The website template was borrowed from Jon Barron.