Comprehensive Evaluation of Quantized Instruction-Tuned LLMs: Exploring Quantization Methods for Models Ranging from 7B to 405B Parameters

Large Language Models (LLMs) have gained significant attention due to their impressive performance, with the release of Llama 3.1 in July 2024 being a notable example. However, deploying these models in resource-constrained environments poses significant challenges due to their huge parameter count. Low-bit quantization has emerged as a popular technique to compress LLMs, reducing memory and computational demands during inference. Existing research on quantization algorithms has been limited in scope, focusing mainly on pre-trained models rather than the more widely used instruction-tuned models. Understanding the impact of using these quantization methods efficiently on accuracy across various datasets, model sizes, and training approaches is important.

Existing methods to address LLM quantization challenges include Quantization Aware Training (QAT) and Post-Training Quantization (PTQ), where QAT is difficult to apply, and hence PTQ is more widely adopted for LLMs despite potential accuracy reduction. Other methods include LLM.int8(), which uses 8-bit weights and activations, and GPTQ, a layer-wise quantization technique utilizing inverse Hessian information. For evaluating LLMs, aspects like weight and activation quantization in language modeling tasks, emergent abilities of quantized LLMs, and trustworthiness dimensions have been explored. However, most research depends heavily on accuracy as the primary evaluation metric, which has left gaps in understanding quantization impacts on crucial tasks like trustworthiness, dialogue, and long-context scenarios.

A team from ETRI, KETI, and Neubla have proposed a comprehensive evaluation of instruction-tuned LLMs across various quantization methods. Their study encompasses models ranging from 7B to 405B parameters, utilizing GPTQ, AWQ, SmoothQuant, and FP8 quantization techniques. This approach provides a detailed understanding of how different quantization methods affect LLM performance across diverse tasks and model sizes. It also addresses the limitations of previous studies by including the latest models and a wider range of parameters, offering insights into the effectiveness of quantization techniques on cutting-edge LLMs.

The study includes a comprehensive evaluation framework, utilizing 13 widely used datasets and benchmarks across 6 task types. For CommonSenseQA, datasets like ARC, HellaSwag, and Winogrande are used to evaluate the ability of AI to handle human-like reasoning and elementary knowledge. Moreover, activation quantization (SmoothQuant) and weight-only quantization methods like GPTQ and AWQ are implemented using tools like AutoGPTQ, llmcompressor, and AutoAWQ. GPTQ uses layer-wise quantization, and utilizes inverse Hessian information to mitigate accuracy loss, while AWQ is designed to preserve the precision of critical weights in LLMs. Both methods used a group size of 128 for quantization.

The experimental results show that the quantized larger LLMs generally outperform smaller models across most benchmarks, except for hallucination and instruction-following tasks. For example, a 4-bit quantized Llama-2-13B (6.5 GB) outperformed an FP16 Llama-2-7B (14 GB) on most benchmarks, with 4.66% and 1.16% higher accuracy on OpenLLM Leaderboard-v1 and v2 datasets, respectively. Further, the comparison of quantization methods showed little difference between weight-only (GPTQ and AWQ) and activation quantization (SmoothQuant) in most cases. However, SmoothQuant caused accuracy drops, up to -2.93% and -9.23% on average for large models like Llama3.1-405B compared to FP8 on OpenLLM Leaderboard-v1 and v2 datasets, respectively.

In this paper, a team from ETRI, KETI, and Neubla presented a comprehensive evaluation of instruction-tuned LLMs across various quantization methods across a wide range of 13 datasets and 6 task types. The paper covers models ranging from 7B to 405B parameters and uses four quantization methods: GPTQ, AWQ, SmoothQuant, and FP8. The findings revealed that quantized LLMs outperformed smaller models in most tasks, with notable exceptions in hallucination detection and instruction following. The weight-only (GPTQ and AWQ)Â quantization showed superior results in the 405B model. The study also highlighted the limitations of the MT-Bench evaluation method in differentiating between high-performing LLMs.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 50k+ ML SubReddit

FREE AI WEBINAR: â€˜SAM 2 for Video: How to Fine-tune On Your Dataâ€™ (Wed, Sep 25, 4:00 AM â€“ 4:45 AM EST)

The post Comprehensive Evaluation of Quantized Instruction-Tuned LLMs: Exploring Quantization Methods for Models Ranging from 7B to 405B Parameters appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

CodeSOD: While This Works

CodeSOD: Ready Xor Not

CodeSOD: A Set of Mistakes

I tested the viral ‘tangle-free’ USB-C cable, and it’s my new travel essential

I tried an ultra-thin iPhone case, and here’s how my daunting experience went

I found one of the fastest-charging portable batteries for home backups – and it’s on sale

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Windows 11’s Microsoft 365 app is taking a new AI-first approach with Copilot

Windows 11’s Microsoft 365 app is taking a new AI-first approach with Copilot

5 Compelling Reasons to Choose Linux Over Windows

Rilasciato DXVK 2.5.2: Ottimizzazioni e Correzioni per i Giochi Windows su GNU/Linux

Comprehensive Evaluation of Quantized Instruction-Tuned LLMs: Exploring Quantization Methods for Models Ranging from 7B to 405B Parameters

Why developers needn’t fear CSS – with the King of CSS himself Kevin Powell [Podcast #154]

I tested the viral ‘tangle-free’ USB-C cable, and it’s my new travel essential

ReadyBoost on Windows 7: How to Enable it & Speed up Your PC

Fallout 76’s new update patch notes nerf one of its best Atlantic City weapons, but don’t worry â€” it got some buffs, too

battery-wallpaper â€“ script to set wallpaper according to battery percentage

The Timeless Appeal of Posters in Art and Web Design

Memorial Day Best Buy sale: 7 PC gaming deals under $100 actually worth buying

Il Portogallo entra nell’era 2.0 della carta d’identità digitale mediante un middleware open-source

Russia Spreading Deepfakes and Misinformation on Kursk Offensive, Says Ukraine

See Your Past Life Photos: The World-Changing AGI App by Srinidhi Ranganathan that has taken the World by Storm

Comprehensive Evaluation of Quantized Instruction-Tuned LLMs: Exploring Quantization Methods for Models Ranging from 7B to 405B Parameters

Related Posts