torchao: A PyTorch Native Library that Makes Models Faster and Smaller by Leveraging Low Bit Dtypes, Quantization and Sparsity

PyTorch has officially launched torchao, a comprehensive native library designed to optimize PyTorch models for better performance and efficiency. The launch of this library is a milestone in deep learning model optimization, providing users with an accessible toolkit that leverages advanced techniques such as low-bit types, quantization, and sparsity. The library is predominantly written in PyTorch code, ensuring ease of use and integration for developers working on inference and training workloads.

Key Features of torchao

Provides comprehensive support for various generative AI models, such as Llama 3 and diffusion models, ensuring compatibility and ease of use.

Demonstrates impressive performance gains, achieving up to 97% speedup and significant reductions in memory usage during model inference and training.

Offers versatile quantization techniques, including low-bit dtypes like int4 and float8, to optimize models for inference and training.

Supports dynamic activation quantization and sparsity for various dtypes, enhancing the flexibility of model optimization.

Features Quantization Aware Training (QAT) to minimize accuracy degradation that can occur with low-bit quantization.

It provides easy-to-use, low-precision computing and communication workflows for training that are compatible with PyTorchâ€™s â€˜nn.Linearâ€™ layers.

Introduces experimental support for 8-bit and 4-bit optimizers, serving as a drop-in replacement for AdamW to optimize model training.

Seamlessly integrates with major open-source projects, such as HuggingFace transformers and diffusers, and serves as a reference implementation for accelerating models.

These key features establish torchao as a versatile and efficient deep-learning model optimization library.

Advanced Quantization Techniques

One of the standout features of torchao is its robust support for quantization. The libraryâ€™s inference quantization algorithms work over arbitrary PyTorch models that contain â€˜nn.Linearâ€™ layers, providing weight-only and dynamic activation quantization for various dtypes and sparse layouts. Developers can select the most suitable quantization techniques using the top-level â€˜quantize_â€™ API. This API includes options for memory-bound models, such as int4_weight_only and int8_weight_only, and compute-bound models. For compute-bound models, torchao can perform float8 quantization, providing additional flexibility for high-performance model optimization. Moreover, torchaoâ€™s quantization techniques are highly composable, enabling the combination of sparsity and quantization for enhanced performance.

Quantization Aware Training (QAT)

Torchao addresses the potential accuracy degradation associated with post-training quantization, particularly for models quantized at less than 4 bits. The library includes support for Quantization Aware Training (QAT), which has been shown to recover up to 96% of the accuracy degradation on challenging benchmarks like Hellaswag. This feature is integrated as an end-to-end recipe in torchtune, with a minimal tutorial to facilitate its implementation. Incorporating QAT makes torchao a powerful tool for training models with low-bit quantization while maintaining accuracy.

Training Optimization with Low Precision

In addition to inference optimization, torchao offers comprehensive support for low-precision computing and communication during training. The library includes easy-to-use workflows for reducing the precision of training compute and distributed communications, beginning with float8 for `torch.nn.Linear` layers.

Torchao has demonstrated impressive results, such as a 1.5x speedup for Llama 3 70B pretraining when using float8. The library also provides experimental support for other training optimizations, such as NF4 QLoRA in torchtune, prototype int8 training, and accelerated sparse 2:4 training. These features make torchao a compelling choice for users looking to accelerate training while minimizing memory usage.

Low-Bit Optimizers

Inspired by the pioneering work of Bits and Bytes in low-bit optimizers, torchao introduces prototype support for 8-bit and 4-bit optimizers as a drop-in replacement for the widely used AdamW optimizer. This feature enables users to switch to low-bit optimizers seamlessly, further enhancing model training efficiency without significantly modifying their existing codebases.

Integrations and Future Developments

Torchao has been actively integrated into some of the most significant open-source projects in the machine-learning community. These integrations include serving as an inference backend for HuggingFace transformers, contributing to diffusers-torchao for accelerating diffusion models, and providing QLoRA and QAT recipes in torchtune. torchaoâ€™s 4-bit and 8-bit quantization techniques are also supported in the SGLang project, making it a valuable tool for those working on research and production deployments.

Moving forward, the PyTorch team has outlined several exciting developments for torchao. These include pushing the boundaries of quantization by going lower than 4-bit, developing performant kernels for high-throughput inference, expanding to more layers, scaling types, or granularities, and supporting additional hardware backends, such as MX hardware.

Key Takeaways from the Launch of torchao

Significant Performance Gains: Achieved up to 97% speedup for Llama 3 8B inference using advanced quantization techniques.

Reduction in Resource Consumption: Demonstrated 73% peak VRAM reduction for Llama 3.1 8B inference and 50% reduction in VRAM for diffusion models.

Versatile Quantization Support: Provides extensive options for quantization, including float8 and int4, with support for QAT to recover accuracy.

Low-Bit Optimizers: Introduced 8-bit and 4-bit optimizers as a drop-in replacement for AdamW.

Integration with Major Open-Source Projects: Actively integrated into HuggingFace transformers, diffusers-torchao, and other key projects.

In conclusion, the launch of torchao represents a major step forward for PyTorch, providing developers with a powerful toolkit to make models faster and more efficient across training and inference scenarios.

Check out the Details and GitHub. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 50k+ ML SubReddit

The post torchao: A PyTorch Native Library that Makes Models Faster and Smaller by Leveraging Low Bit Dtypes, Quantization and Sparsity appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

CodeSOD: Ready Xor Not

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

I tested the viral ‘tangle-free’ USB-C cable, and it’s my new travel essential

I tried an ultra-thin iPhone case, and here’s how my daunting experience went

I found one of the fastest-charging portable batteries for home backups – and it’s on sale

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Windows 11’s Microsoft 365 app is taking a new AI-first approach with Copilot

Windows 11’s Microsoft 365 app is taking a new AI-first approach with Copilot

5 Compelling Reasons to Choose Linux Over Windows

Rilasciato DXVK 2.5.2: Ottimizzazioni e Correzioni per i Giochi Windows su GNU/Linux

torchao: A PyTorch Native Library that Makes Models Faster and Smaller by Leveraging Low Bit Dtypes, Quantization and Sparsity

Why developers needn’t fear CSS – with the King of CSS himself Kevin Powell [Podcast #154]

I tested the viral ‘tangle-free’ USB-C cable, and it’s my new travel essential

Smashing Security podcast #395: Gym hacking, disappearing DNA, and a social lockout

Best SNES Emulator: 5 Feature-Rich Options

Microsoft reportedly upgrades users with Windows Server 2022 to 2025 without notice

Intel announces new Lunar Lake Core Ultra CPUs, as direct competition to AMD and Qualcomm

Rilasciato Sway 1.10: Una nuova versione del compositore Wayland ispirato a i3

This Machine Learning Paper Introduce PISSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

DeepSim: AI-Accelerated 3D Physics Simulator for Engineers

Dissent â€“ lightweight Discord app

torchao: A PyTorch Native Library that Makes Models Faster and Smaller by Leveraging Low Bit Dtypes, Quantization and Sparsity

Related Posts