Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 21, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 21, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 21, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 21, 2025

      The best smart glasses unveiled at I/O 2025 weren’t made by Google

      May 21, 2025

      Google’s upcoming AI smart glasses may finally convince me to switch to a pair full-time

      May 21, 2025

      I tried Samsung’s Project Moohan XR headset at I/O 2025 – and couldn’t help but smile

      May 21, 2025

      Is Google’s $250-per-month AI subscription plan worth it? Here’s what’s included

      May 21, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      IOT and API Integration With MuleSoft: The Road to Seamless Connectivity

      May 21, 2025
      Recent

      IOT and API Integration With MuleSoft: The Road to Seamless Connectivity

      May 21, 2025

      Celebrating GAAD by Committing to Universal Design: Low Physical Effort

      May 21, 2025

      Celebrating GAAD by Committing to Universal Design: Flexibility in Use

      May 21, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft open-sources Windows Subsystem for Linux at Build 2025

      May 21, 2025
      Recent

      Microsoft open-sources Windows Subsystem for Linux at Build 2025

      May 21, 2025

      Microsoft Brings Grok 3 AI to Azure with Guardrails and Enterprise Controls

      May 21, 2025

      You won’t have to pay a fee to publish apps to Microsoft Store

      May 21, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Fine-Tuning LLaMA 70B Using Hugging Face Accelerate & DeepSpeed on Multiple Nodes 

    Fine-Tuning LLaMA 70B Using Hugging Face Accelerate & DeepSpeed on Multiple Nodes 

    March 31, 2025

    by Luis Pacheco, Uday Yallapragada and Cristian Muñoz

    Large language models (LLMs) like Meta’s LLaMA 70B are revolutionizing natural language processing tasks, but training or fine-tuning them requires massive computational and memory resources. To address these challenges, we employ distributed training across multiple GPU nodes using DeepSpeed and Hugging Face Accelerate.

    This blog walks you through a production-ready setup for fine-tuning the LLaMA 70B model on two nodes, each equipped with H100 GPUs, using:

    • DeepSpeed Stage 3 ZeRO optimization for memory-efficient training, 
    • Hugging Face Accelerate for seamless multi-node orchestration, 
    • 4-bit quantization to drastically reduce memory footprint, 
    • PEFT (Parameter-Efficient Fine-Tuning) via LoRA for lightweight adaptation. 

    Whether scaling your model or optimizing compute costs, this setup enables powerful fine-tuning workflows at scale. 

    Setup 

    The architecture involves a master node and a worker node, both running identical fine-tuning processes. Communication is managed via NCCL over TCP, and shared access to the dataset/model is through NFS.

    Diagram Breakdown

    Screenshot 2025 03 26 At 4.53.13 pm

    Key Components

    • DeepSpeed Stage 3 (ZeRO-3): 

      • Enables model sharding by partitioning optimizer states, gradients, and parameters across all GPUs. 
      • Critical for training models larger than what fits in the memory of a single GPU. 
    • Hugging Face Accelerate: 

      • Provides a lightweight interface to manage distributed training, wrapping model, optimizer, data loaders, and more. 
      • Handles launching across nodes using accelerated launch with configuration files. 
    • 4-bit Quantization (via bitsandbytes): 

      • Reduces model size in memory, allowing larger models to fit and train faster. 
      • Uses nf4 quantization and bfloat16 (bf16) compute type to balance performance and accuracy. 
    • LoRA (Low-Rank Adaptation) via PEFT: 

      • Finetune only a subset of low-rank adapter weights. 
      • Dramatically reduces the number of trainable parameters, making fine-tuning efficient even on smaller datasets. 
    • Monitoring: 

      • Tools like Grafana, Prometheus, nvidia-smi, and DCGM track system and GPU performance. 
    • Model/Data: 

      • LLaMA 3.3 70B model loaded via NFS. 
      • Amazon Reviews dataset (400 MB) used for binary sentiment classification. 

     Implementation Details 

    Accelerate Configurations 

    Two YAML files define the multi-node setup: 

    default_config_main.yaml

    machine_rank: 0
    num_machines: 2
    distributed_type: DEEPSPEED
    mixed_precision: fp8
    deepspeed_config:
      zero_stage: 3

    default_config_worker.yaml

    machine_rank: 1
    main_process_ip: <your ip address>

     

    These configs enable synchronized training across two nodes, each running 4 processes (1 per GPU). 

    Python Script Highlights 

    The fine-tuning script llm_finetune_for_blog.py is modular and production-friendly. Here’s a breakdown of its workflow: 

    Data Loading & Preprocessing

    df = pd.concat([...])
    df["label"] = df["rating"].apply(convert_rating)
    train_dataset = datasets.Dataset.from_pandas(train_df)
    • Loads and preprocesses Amazon reviews CSV files. 
    • Converts ratings to binary labels. 
    • Tokenizes using Hugging Face tokenizer with padding and truncation. 

    Model & Quantization Setup

    bnb_config = BitsAndBytesConfig(load_in_4bit=True, ...)
    model = AutoModelForSequenceClassification.from_pretrained(..., quantization_config=bnb_config)
    • Loads LLaMA model in 4-bit with nf4 quantization. 
    • Applies gradient checkpointing and enables input gradients. 

    Apply LoRA

    lora_config = LoraConfig(...)
    model = get_peft_model(model, lora_config)
    • LoRA reduces the number of trainable parameters, speeding up training while maintaining performance. 

    Accelerator Setup

    accelerator = Accelerator(mixed_precision='fp8')
    model, optimizer, train_dataloader, ... = accelerator.prepare(...)
    • Wraps training components with Accelerate to handle distributed training. 

    Training Loop

    for epoch in range(num_epochs):
        for batch in train_dataloader:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
    • Simple loop structure with loss backpropagation using Accelerate. 
    • The optimizer and scheduler steps are executed per batch. 

    Evaluation

    with torch.no_grad():
        outputs = model(**batch)
        total_eval_loss += outputs.loss.item()
    • Computes average validation loss at the end of training. 

    Conclusion 

    In this blog, we’ve walked through a complete end-to-end setup to fine-tune LLaMA 70B using:

    • Distributed multi-node training with DeepSpeed ZeRO-3, 
    • 4-bit quantization with bitsandbytes to optimize memory usage, 
    • Hugging Face Accelerate for seamless training orchestration, 
    • PEFT via LoRA to fine-tune only critical parts of the model efficiently. 

    This architecture is robust and scalable, suitable for large-scale enterprise LLM fine-tuning tasks. By combining quantization, PEFT, and distributed computing, you can unlock high-performance fine-tuning workflows on trillion-parameter-scale models — all while optimizing for computing cost and speed. 

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticlePerficient Publishes 2024 Community Impact Report
    Next Article Meet Perficient at the Optimized AI Conference

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 21, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-48205 – TYPO3 sr_feuser_register Insecure Direct Object Reference

    May 21, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-1625 – Qi Blocks WordPress Stored Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    blank wholesale t shirts | bulk wholesale t shirts | bulk blank t shirts

    Development

    Sitecore Awards Six Perficient Colleagues as MVPs in 2025

    Development

    What is Artificial Empathy? How Will it Impact AI?

    Development

    Highlights

    Distribution Release: Proxmox 8.4 “Virtual Environment”

    April 9, 2025

    The DistroWatch news feed is brought to you by TUXEDO COMPUTERS. Proxmox is a commercial company offering specialised products based on Debian, notably Proxmox Virtual Environment and Proxmox Mail Gateway. The company has announced its release of Proxmox 8.4 “Virtual Environment” which introduces new hardware support, an API for third-party backups, and Virtiofs passthrough. “We are excited to announce….

    How designers can leverage AI to become unstoppable

    May 19, 2025

    SAP Confirms Critical NetWeaver Flaw Amid Suspected Zero-Day Exploitation by Hackers

    April 26, 2025

    CVE-2025-4265 – PHPGurukul Emergency Ambulance Hiring Portal SQL Injection Vulnerability

    May 5, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.