Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 8, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 8, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 8, 2025

      AI is currently in its teenage years, battling raging hormones

      June 6, 2025

      Apple doesn’t need better AI as much as AI needs Apple to bring its A-game

      June 8, 2025

      DistroWatch Weekly, Issue 1125

      June 8, 2025

      Motion Highlights #9

      June 8, 2025

      The 2025 Wholesome Direct was chock-full of cozy casual games and aesthetic vibes

      June 8, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Online Scrap Portal Using PHP and MySQL

      June 8, 2025
      Recent

      Online Scrap Portal Using PHP and MySQL

      June 8, 2025

      Master Image Processing in Node.js Using Sharp for Fast Web Apps

      June 7, 2025

      mkocansey/bladewind

      June 7, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft built a bloat-free, optimized Windows 11 UI for handheld gaming

      June 8, 2025
      Recent

      Microsoft built a bloat-free, optimized Windows 11 UI for handheld gaming

      June 8, 2025

      DistroWatch Weekly, Issue 1125

      June 8, 2025

      Gradia is a Slick New Screenshot Annotation Tool for Linux

      June 8, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing

    Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing

    April 22, 2025

    Diffusion Transformers have demonstrated outstanding performance in image generation tasks, surpassing traditional models, including GANs and autoregressive architectures. They operate by gradually adding noise to images during a forward diffusion process and then learning to reverse this process through denoising, which helps the model approximate the underlying data distribution. Unlike the commonly used UNet-based diffusion models, Diffusion Transformers apply the transformer architecture, which has proven effective after sufficient training. However, their training process is slow and computationally intensive. A key limitation lies in their architecture: during each denoising step, the model must balance encoding low-frequency semantic information while simultaneously decoding high-frequency details using the same modules—this creates an optimization conflict between the two tasks.

    To address the slow training and performance bottlenecks, recent work has focused on improving the efficiency of Diffusion Transformers through various strategies. These include utilizing optimized attention mechanisms, such as linear and sparse attention, to reduce computational costs, and introducing more effective sampling techniques, including log-normal resampling and loss reweighting, to stabilize the learning process. Additionally, methods like REPA, RCG, and DoD incorporate domain-specific inductive biases, while masked modeling enforces structured feature learning, boosting the model’s reasoning capabilities. Models like DiT, SiT, SD3, Lumina, and PixArt have extended the diffusion transformer framework to advanced areas such as text-to-image and text-to-video generation. 

    Researchers from Nanjing University and ByteDance Seed Vision introduce the Decoupled Diffusion Transformer (DDT), which separates the model into a dedicated condition encoder for semantic extraction and a velocity decoder for detailed generation. This decoupled design enables faster convergence and improved sample quality. On the ImageNet 256×256 and 512×512 benchmarks, their DDT-XL/2 model achieves state-of-the-art FID scores of 1.31 and 1.28, respectively, with up to 4× faster training. To further accelerate inference, they propose a statistical dynamic programming method that optimally shares encoder outputs across denoising steps with minimal impact on performance.

    The DDT introduces a condition encoder and a velocity decoder to handle low- and high-frequency components in image generation separately. The encoder extracts semantic features (zt) from noisy inputs, timesteps, and class labels, which are then used by the decoder to estimate the velocity field. To ensure consistency of zt across steps, representation alignment and decoder supervision are applied. During inference, a shared self-condition mechanism reduces computation by reusing zt at certain timesteps. A dynamic programming approach identifies the optimal timesteps for recomputing zt, minimizing performance loss while accelerating the sampling process.

    The researchers trained their models on 256×256 ImageNet using a batch size of 256 without gradient clipping or warm-up. Using VAE-ft-EMA and Euler sampling, they evaluated performance using FID, sFID, IS, Precision, and Recall. They built improved baselines with SwiGLU, RoPE, RMSNorm, and lognorm sampling. Their DDT models consistently outperformed prior baselines, particularly in larger sizes, and converged significantly faster than REPA. Further gains were achieved through encoder sharing strategies and careful tuning of the encoder-decoder ratio, resulting in state-of-the-art FID scores on both 256×256 and 512×512 ImageNet.

    In conclusion, the study presents the DDT, which addresses the optimization challenge in traditional diffusion transformers by separating semantic encoding and high-frequency decoding into distinct modules. By scaling encoder capacity relative to the decoder, DDT achieves notable performance gains, especially in larger models. The DDT-XL/2 model sets new benchmarks on ImageNet, achieving faster training convergence and lower FID scores for both 256×256 and 512×512 resolutions. Additionally, the decoupled design enables encoder sharing across denoising steps, significantly improving inference efficiency. A dynamic programming strategy further enhances this by determining optimal sharing points, maintaining image quality while reducing computational load.


    Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post Decoupled Diffusion Transformers: Accelerating High-Fidelity Image Generation via Semantic-Detail Separation and Encoder Sharing appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI Agents
    Next Article A Coding Guide to Build an Agentic AI‑Powered Asynchronous Ticketing Assistant Using PydanticAI Agents, Pydantic v2, and SQLite Database

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 8, 2025
    Machine Learning

    Google Introduces Open-Source Full-Stack AI Agent Stack Using Gemini 2.5 and LangGraph for Multi-Step Web Search, Reflection, and Synthesis

    June 8, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4184 – PCMan FTP Server Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Actionsflow automates developers’ workflows based on GitHub actions

    Linux

    Updating the Frontier Safety Framework

    Artificial Intelligence

    OrderOnline: AI Improves Conversion Rate by 56% with MongoDB

    Databases

    Highlights

    CVE-2025-4379 – DobryCMS Reflected Cross-Site Scripting (XSS) Vulnerability

    May 23, 2025

    CVE ID : CVE-2025-4379

    Published : May 23, 2025, 10:15 a.m. | 2 hours, 24 minutes ago

    Description : DobryCMS in versions 2.* and lower is vulnerable to Reflected Cross-Site Scripting (XSS). Improper input validation in szukaj parameter allows arbitrary JavaScript to be executed on victim’s browser when specially crafted URL is opened.

    A hotfix for affected versions was released on 29.04.2025. It removes the vulnerability without incrementing the version.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-3968 – Codeprojects News Publishing Site Dashboard SQL Injection Vulnerability

    April 27, 2025

    ⚡ Weekly Recap: Critical SAP Exploit, AI-Powered Phishing, Major Breaches, New CVEs & More

    April 29, 2025

    CVE-2025-5264 – Firefox Local Command Injection Vulnerability

    May 27, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.