Faster Dynamically Quantized Inference with XNNPack

Posted by Alan Kelly, Software Engineer

We are excited to announce that XNNPackâ€™s Fully Connected and Convolution 2D operators now support dynamic range quantization. XNNPack is TensorFlow Liteâ€™s CPU backend and CPUs deliver the widest reach for ML inference and remain the default target for TensorFlow Lite. Consequently, improving CPU inference performance is a top priority. We quadrupled inference performance in TensorFlow Liteâ€™s XNNPack backend compared to the single precision baseline by adding support for dynamic range quantization to the Fully Connected and Convolution operators. This means that more AI powered features may be deployed to older and lower tier devices.

Previously, XNNPack offered users the choice between either full integer quantization, where the weights and activations are stored as signed 8-bit integers, or half-precision (fp16) or single-precision (fp32) floating-point inference. In this article we demonstrate the benefits of dynamic range quantization.

Dynamic Range Quantization

Dynamically quantized models are similar to fully-quantized models in that the weights for the Fully Connected and Convolution operators are quantized to 8-bit integers during model conversion. All other tensors are not quantized, they remain as float32 tensors. During model inference, the floating-point layer activations are converted to 8-bit integers before being passed to the Fully Connected and Convolution operators. The quantization parameters (the zero point and scale) for each row of the activation tensor are calculated dynamically based on the observed range of activations. This maximizes the accuracy of the quantization process as the activations make full use of the 8 quantized bits. In fully-quantized models, these parameters are fixed during model conversion, based on the range of the activation values observed using a representative dataset. The second difference between full quantization and dynamic range quantization is that the output of the Fully Connected and Convolution operators is in 32-bit floating-point format, as opposed to 8-bit integer for fully-quantized operators. With dynamic range quantization, we get most of the performance gains of full quantization, yet with higher overall accuracy.

Traditionally the inference of such models was done using TensorFlow Liteâ€™s native operators. Now dynamically quantized models can benefit from XNNPackâ€™s highly-optimized per-architecture implementations of the Fully Connected and Convolution2D operators. These operators are optimized for all architectures supported by XNNPack (ARM, ARM64, x86 SSE/AVX/AVX512 and WebAssembly), including the latest ArmV9 processors such as the Pixel 8â€™s Tensor G3 CPU or the One Plus 11â€™s SnapDragon 8 Gen 2 CPU.

How can you use it?

Two steps are required to use dynamic range quantization. You must first convert your model from TensorFlow with support for dynamic range quantization. Existing models already converted using dynamic range quantization do not need to be reconverted. Dynamic range quantization can be enabled during model conversion by enabling the
converter.optimizations = [tf.lite.Optimize.DEFAULT] converter flag. Unlike full integer quantization, no representative dataset is required and unsupported operators do not prevent conversion from succeeding. Dynamic range quantization is therefore far more accessible to non-expert users than full integer quantization.

From TensorFlow 2.17, dynamically quantized XNNPack inference will be enabled by default in prebuilt binaries. If you want to use it sooner, the nightly TensorFlow builds may be used.

Mixed Precision Inference

In our previous article we presented the impressive performance gains from using half precision inference. Half-precision and dynamic range quantization may now be combined within XNNPack to get the best possible on-device CPU inference performance on devices which have hardware fp16 support (most phones on sale today do). The Fully Connected and Convolution 2D operators can output fp16 data instead of fp32. The Pixel 3, released in 2018, was the first Pixel model with fp16 support. fp16 uses half as many bits to store a floating-point value compared to fp32, meaning that the relative accuracy of each value is reduced due to the significantly shorter mantissa (10 vs 23 bits). Not all models support fp16 inference, but if a model supports it, the computational cost of vectorized floating-point operators can be reduced by half as the CPU can process twice as much data per instruction. Dynamically quantized models with compute-intensive floating point operators, such as Batch Matrix Multiply and Softmax, can benefit from fp16 inference as well.

Performance Improvements

Below, we present benchmarks on four public models covering common computer vision tasks:

EfficientNetV2 – image classification and feature extraction
Inception-v3 – image classification
Deeplab-v3 – semantic segmentation
Stable Diffusion – image generation (diffusion model)

Each model was converted three times where possible: full float, full 8 bit signed integer quantization and dynamic range quantization. Stable Diffusionâ€™s diffusion model could not be converted using full integer quantization due to unsupported operators. The speed-up versus the original float32 model using TFLiteâ€™s kernels is shown below.

FP32 refers to the baseline float32 model.
QS8 refers to full signed 8 bit integer quantization using XNNPack.
QD8-F32 refers to dynamically quantized 8 bit integers with fp32 activations using XNNPack.
QD8-F16 refers to dynamically quantized 8 bit integers with fp16 activations using XNNPack.

The speed-up versus TFLiteâ€™s dynamically quantized Fully Connected and Convolution operators are shown below. By simply using the latest version of TensorFlow Lite, you can benefit from these speed-ups.

We can clearly see that dynamic range quantization is competitive with, and in some cases can exceed, the performance of full integer quantization. Stable Diffusionâ€™s diffusion model runs up to 6.2 times faster than the original float32 model! This is a game changer for on-device performance.

We would expect that full integer quantization would be faster than dynamic range quantization as all operations are calculated using integer arithmetic. Furthermore, dynamic range quantization has the additional overhead of converting the floating point activations to quantized 8 bit integers. Surprisingly, in two of the three models tested, this is not true. Profiling the models using TFLiteâ€™s profiler solves the mystery. These models are slower due to a combination of quantization artifacts â€” quantized arithmetic is more efficient when the ratio of the input and output scales fall within a certain range and missing op support in XNNPack. These quantization parameters are determined during model conversion based on a provided representative dataset. Since the ratio of the scales falls outside the optimal range, a less optimal path must be taken and performance suffers.

Finally, we demonstrate that model precision is mostly preserved when using mixed precision inference, we compare the image generated by the dynamically quantized Stable Diffusion model using fp32 activations with that generated using fp16 activations with the random number generator seeded with the same number to verify that using fp16 activations for the diffusion model does not impact the quality of the generated image.

Image generated using fp16 inference (left) and fp32 inference (right)

Both generated images are indeed lovely cats, corresponding to the given prompt. The images are indistinguishable which is a very strong indication that the diffusion model is suited to fp16 inference. Of course, as for all neural networks, any quantization strategy should be validated using a large validation dataset, and not just a single test.

Conclusions

Full integer quantization is hard, converting models is difficult, error prone and accuracy is not guaranteed. The representative dataset must be truly representative to minimize quantization errors. Dynamic range quantization offers a compromise between full integer quantization and fp32 inference: the models are of similar size to the fully-quantized model and performance gains are often similar and sometimes exceeds that of fully-quantized models. Using Stable Diffusion, we showed that dynamic range quantization and fp16 inference can be combined giving game changing performance improvements. XNNPackâ€™s dynamic range quantization is now powering Gemini, Google Meet, and Chrome OS audio denoising and will launch in many other products this year. This same technology is now available to our open source users so try it for yourself using the models linked above and following the instructions in the How can I use it section!

Acknowledgements

We would like to thank Frank Barchard and Quentin Khan for contributions towards dynamic range quantization inference in TensorFlow Lite and XNNPack.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Faster Dynamically Quantized Inference with XNNPack

Dynamic Range Quantization

How can you use it?

Mixed Precision Inference

Performance Improvements

Conclusions

Acknowledgements

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

Why tech-savvy leadership is key to cyber insurance readiness

Think DeepSeek has cut AI spending? Think again

Quickgui – virtual machine manager

“If we had a choice in this, we would not pursue any legal actions against Schedule I” — Drug Dealer Simulator developers speak out on controversy involving their publisher

This thermal camera is my new favorite smartphone accessory (and it’s $50 off)

Wix vs WordPress vs Squarespace

Embracing Joy and Empowerment

Sitecore 10.4 is out and hereâ€™s all you need to know about it

Faster Dynamically Quantized Inference with XNNPack

Dynamic Range Quantization

How can you use it?

Mixed Precision Inference

Performance Improvements

Conclusions

Acknowledgements

Related Posts