Simplifying Diffusion Models: Fine-Tuning for Faster and More Accurate Depth Estimation

Monocular depth estimation (MDE) plays an important role in various applications, including image and video editing, scene reconstruction, novel view synthesis, and robotic navigation. However, this task poses significant challenges due to the inherent scale distance ambiguity, making it ill-posed. Learning-based methods should utilize robust semantic knowledge to achieve accurate results and overcome this limitation. Recent progress has seen the adaptation of large diffusion models for MDE, treating depth prediction as a conditional image generation problem, but they suffer from slow inference speeds. The computational demands of repeatedly evaluating large neural networks during inference have become a major concern in the field.

Recently, various methods have been developed to address the challenges in MDE. One such method is Monocular depth estimation which predicts depth based on pixels. Another method is Metric depth estimation, which provides a more detailed representation but contains additional complexities due to camera focal length variations. Further, surface normal estimation has evolved from early learning-based approaches to complex deep learning methods. Recently, diffusion models have been applied to geometry estimation, with some methods producing multi-view depth and normal maps for single objects. Scene-level depth estimation approaches like VPD have used Stable Diffusion, but generalization remains a challenge for complex and real-world environments.

Researchers from RWTH Aachen University and Eindhoven University of Technology presented an innovative solution to the inefficiency of diffusion-based MDE. They developed a fixed model by taking an older unnoticed flaw in the inference pipeline, where the fixed model performs comparably to the best-reported configurations while being 200 times faster. An end-to-end fine-tuning is implemented with task-specific losses on top of their single-step model to enhance performance. This method results in a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. Moreover, this fine-tuning protocol works directly on Stable Diffusion, achieving comparable performance to state-of-the-art models.

The proposed method utilizes two synthetic datasets for training: Hypersim for photorealistic indoor scenes and Virtual KITTI 2 for driving scenarios to provide high-quality annotations. For evaluation, a diverse set of benchmarks, including NYUv2 and ScanNet for indoor environments, ETH3D and DIODE for mixed indoor-outdoor scenes, and KITTI for outdoor driving scenarios, are utilized. The implementation is built on the official Marigold checkpoint for depth estimation, while a similar setup is used for normal estimation, encoding normal maps as 3D vectors in color channels. The team follows Marigoldâ€™s hyperparameters, training all models for 20,000 iterations using the AdamW optimizer.Â

The results demonstrate that Marigoldâ€™s multi-step denoising process is not working as expected, with performance declining as the denoising steps increase. The fixed DDIM scheduler demonstrated superior performance across all step counts. Comparisons between vanilla Marigold, its Latent Consistency Model variant, and the researchersâ€™ single-step models show that the fixed DDIM scheduler achieves comparable or better results in a single step without ensembling. Moreover, Marigoldâ€™s end-to-end fine-tuning outperforms all previous configurations in a single step without ensembling. Surprisingly, directly fine-tuning Stable Diffusion yields similar results to the Marigold-pretrained model.

In summary, researchers introduced a solution to the inefficiency of diffusion-based MDE, revealing a critical flaw in the DDIM scheduler implementation. It challenges previous conclusions in diffusion-based monocular depth and normal estimation. Researchers showed that the simple end-to-end fine-tuning outperforms more complex training pipelines and architectures without losing support of the hypothesis that diffusion pretraining provides excellent priors for geometric tasks. The resulting models enable accurate single-step inference and make it possible to use large-scale data and advanced self-training methods. These findings lay the foundation for future advancements in diffusion models, making reliable priors and improved performance in geometry estimation.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 50k+ ML SubReddit

FREE AI WEBINAR: â€˜SAM 2 for Video: How to Fine-tune On Your Dataâ€™ (Wed, Sep 25, 4:00 AM â€“ 4:45 AM EST)

The post Simplifying Diffusion Models: Fine-Tuning for Faster and More Accurate Depth Estimation appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

Mastering SVG Arcs

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Finally, a luxury soundbar that’s compact and delivers immersive audio (and it’s $500 off)

This affordable Lenovo gaming PC is the one I recommend to most people. Here’s why

The last day of ’12 days of OpenAI’ is expected to bring biggest drop yet

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Windows 11 hidden toggle reveals how to turn on or off Administrator protection

10 Must-Have Apps for 3 Monitors You Should Know About

Simplifying Diffusion Models: Fine-Tuning for Faster and More Accurate Depth Estimation

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

What do the State of CSS and HTML surveys tell us?

AI & politics: Elon Musk shares AI video of Kamala Harris

The human touch, digital twins, and swiveling laptop screens

Q&A: 10 emerging technologies to watch in 2024

10 Types of Logo Design On Trends

Hieroglyphic â€“ find LaTeX symbols

Transportation Companies Hit by Cyberattacks Using Lumma Stealer and NetSupport Malware

How 20 Minutes empowers journalists and boosts audience engagement with generative AI on Amazon Bedrock

Linux Mint 22 Review: Subtle And Impactful Upgrade

Simplifying Diffusion Models: Fine-Tuning for Faster and More Accurate Depth Estimation

Related Posts