Large Language Models (LLMs) have made significant advancements in natural language processing but face challenges due to memory and computational demands. Traditional quantization techniques reduce model size by decreasing the bit-width of model weights, which helps mitigate these issues but often leads to performance degradation. This problem gets worse when LLMs are used in different situations with limited resources. This means that quantization-aware training (QAT) has to be done multiple times for each application, which requires huge resources.
Researchers from the South China University of Technology, the Hong Kong University of Science and Technology, Tsinghua University, and Salesforce AI Research propose LLM-QFA (Quantization-Aware Fine-tuning once-for-all for LLMs) to address these inefficiencies. Current methods to handle memory and computational inefficiencies of LLMs include Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). PTQ compresses the model without retraining, providing quick deployment but often at the cost of significant performance loss, especially at lower bit widths. Whereas QAT integrates quantization errors during training to maintain performance, it is time-consuming and computationally expensive. The proposed framework aims to train a single “once-for-all†supernet capable of generating various optimal subnets tailored for different deployment scenarios without repeated training.
The LLM-QFA framework tackles the interference issues caused by weight sharing in traditional QAT by decoupling the weights of different quantization configurations. This decoupling is achieved using lightweight Low-Rank adapters, which introduce negligible additional computational cost. Specifically, the method involves quantizing the model weights to different bit-widths (2, 3, and 4 bits) and applying Low-Rank adapters for each configuration. During fine-tuning, only the adapters corresponding to the active quantization configuration are updated, thus avoiding interference between configurations.
LLM-QFA framework adapts resource-balanced sampling strategy. Earlier, uniform sampling strategies favored subnets with average bit-widths which led to imbalanced training and underfitting of subnets with extreme bit-width configurations. In contrast, resource-balanced sampling utilizes a non-parametric scheduler to dynamically adjust the sampling rate dynamically, ensuring a more balanced training resource allocation among subnets. This balanced approach helps optimize all subnets effectively, resulting in robust performance across different resource constraints.
LLM-QFA’s performance was evaluated using LLaMA2 models on the MMLU and Common Sense QA benchmarks. The results demonstrated that LLM-QFA could maintain high performance while significantly reducing deployment time compared to traditional QAT methods. For instance, on the MMLU benchmark, LLM-QFA outperformed GPTQ and QA-LoRA methods, particularly under mid-range bit-width constraints, achieving a good balance between performance and resource efficiency. The LLM-QFA framework also showed consistent improvements on the Common Sense QA benchmarks, further validating its effectiveness in diverse deployment scenarios.
In conclusion, the study addresses the critical issue of efficiently deploying large language models across varied resource-constrained environments. By introducing interference-less fine-tuning with Low-Rank adapters and a resource-balanced sampling strategy, the proposed framework significantly reduces the computational cost associated with traditional QAT methods while maintaining and enhancing performance. This approach takes a major step toward making LLMs more adaptable and efficient for real-world applications, even on resource-constrained devices.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform
The post LLM-QFA Framework: A Once-for-All Quantization-Aware Training Approach to Reduce the Training Cost of Deploying Large Language Models (LLMs) Across Diverse Scenarios appeared first on MarkTechPost.
Source: Read MoreÂ