Scaling Diffusion transformers (DiT): An AI Framework for Optimizing Text-to-Image Models Across Compute Budgets

Large language models (LLMs) have demonstrated consistent scaling laws, revealing a power-law relationship between pretraining performance and computational resources. This relationship, expressed as C = 6ND (where C is compute, N is model size, and D is data quantity), has proven invaluable for optimizing resource allocation and maximizing computational efficiency. However, the field of diffusion models, particularly diffusion transformers (DiT), lacks similar comprehensive scaling laws. While larger diffusion models have shown improved visual quality and text-image alignment, the precise nature of their scaling properties remains unclear. This gap in understanding hinders the ability to accurately predict training outcomes, determine optimal model and data sizes for given compute budgets, and comprehend the intricate relationships between training resources, model architecture, and performance. Consequently, researchers must rely on costly and potentially suboptimal heuristic configuration searches, impeding efficient progress in the field.

Previous research has explored scaling laws in various domains, particularly in language models and autoregressive generative models. These studies have established predictable relationships between model performance, size, and dataset quantity. In the realm of diffusion models, recent work has empirically demonstrated scaling properties, showing that larger compute budgets generally yield better models. Researchers have also compared scaling behaviors across different architectures and investigated sampling efficiency. However, the field lacks an explicit formulation of scaling laws for diffusion transformers that captures the intricate relationships between compute budget, model size, data quantity, and loss. This gap in understanding has limited the ability to optimize resource allocation and predict performance in diffusion transformer models.

Researchers from Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, ByteDance, and The University of Hong Kong characterize the scaling behavior of diffusion models for text-to-image synthesis, establishing explicit scaling laws for DiT. The study explores a wide range of compute budgets from 1e17 to 6e18 FLOPs, training models from 1M to 1B parameters. By fitting parabolas for each compute budget, optimal configurations are identified, leading to power-law relationships between compute budgets, model size, consumed data, and training loss. The derived scaling laws are validated through extrapolation to higher compute budgets. Also, the research demonstrates that generation performance metrics, such as FID, follow similar power-law relationships, enabling predictable synthesis quality across various datasets.

The study explores scaling laws in diffusion transformers across compute budgets from 1e17 to 6e18 FLOPs. Researchers vary In-context Transformers from 2 to 15 layers, using AdamW optimizer with specific learning rate schedules and hyperparameters. For each budget, they fit a parabola to identify optimal loss, model size, and data allocation. Power law relationships are established between compute budgets and optimal model size, data quantity, and loss. The derived equations reveal that model size grows slightly faster than data size as training budget increases. To validate these laws, they extrapolate to a 1.5e21 FLOPs budget, training a 958.3M parameter model that closely matches predicted loss.

The study validates scaling laws on out-of-domain datasets using the COCO 2014 validation set. Four metricsâ€”validation loss, Variational Lower Bound (VLB), exact likelihood, and Frechet Inception Distance (FID)â€”are evaluated on 10,000 data points. Results show consistent trends across both Laion5B subset and COCO validation dataset, with performance improving as training budget increases. A vertical offset is observed between metrics for the two datasets, with COCO consistently showing higher values. This offset remains relatively constant for validation loss, VLB, and exact likelihood across budgets. For FID, the gap widens with increasing budget, but still follows a power-law trend.

Scaling laws provide a robust framework for evaluating model and dataset quality. By analyzing isoFLOP curves at smaller compute budgets, researchers can assess the impact of modifications to model architecture or data pipeline. More efficient models exhibit lower model scaling exponents and higher data scaling exponents, while higher-quality datasets result in lower data scaling exponents and higher model scaling exponents. Improved training pipelines are reflected in smaller loss scaling exponents. The study compares In-Context and Cross-Attention Transformers, revealing that Cross-Attention Transformers achieve better performance with the same compute budget. This approach offers a reliable benchmark for evaluating design choices in model and data pipelines.

This study establishes scaling laws for DiT across a wide range of compute budgets. The research confirms a power-law relationship between pretraining loss and compute, enabling accurate predictions of optimal model size, data requirements, and performance. The scaling laws demonstrate robustness across different datasets and can predict image generation quality using metrics like FID. By comparing In-context and Cross-Attention Transformers, the study validates the use of scaling laws as a benchmark for evaluating model and data design. These findings provide valuable guidance for future developments in text-to-image generation using DiT, offering a framework for optimizing resource allocation and performance.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Scaling Diffusion transformers (DiT): An AI Framework for Optimizing Text-to-Image Models Across Compute Budgets appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

Mastering SVG Arcs

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Finally, a luxury soundbar that’s compact and delivers immersive audio (and it’s $500 off)

This affordable Lenovo gaming PC is the one I recommend to most people. Here’s why

How to delete your X/Twitter account for good (and protect your data)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Windows 11 hidden toggle reveals how to turn on or off Administrator protection

10 Must-Have Apps for 3 Monitors You Should Know About

Scaling Diffusion transformers (DiT): An AI Framework for Optimizing Text-to-Image Models Across Compute Budgets

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

What do the State of CSS and HTML surveys tell us?

Cyber Monday: Avoid Online Shopping Disasters with TestingÂ

How to Use Hash Tables for Fast Data Lookup in JavaScript

CodeSOD: Sanitary Paths

Dark Web Actor Advertises New Click Fraud Software for Online Marketing Deception

AMD reveals its Ryzen 9000 CPUs, with an added treat for those still on AM4

Small Blog Features That Make a Big Difference

BRAG Released: High-Performance SLMs (Small Language Models) Specifically Trained for RAG Tasks Under $25 Each

OpenAI hesitant to release accurate ChatGPT text detector

Scaling Diffusion transformers (DiT): An AI Framework for Optimizing Text-to-Image Models Across Compute Budgets

Related Posts