Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Prevent WordPress SQL Injection Attacks

      June 14, 2025

      This week in AI dev tools: Apple’s Foundations Model framework, Mistral’s first reasoning model, and more (June 13, 2025)

      June 13, 2025

      Open Talent platforms emerging to match skilled workers to needs, study finds

      June 13, 2025

      Java never goes out of style: Celebrating 30 years of the language

      June 12, 2025

      6 registry tweaks every tech-savvy user must apply on Windows 11

      June 14, 2025

      Here’s why network infrastructure is vital to maximizing your company’s AI adoption

      June 14, 2025

      The AI video tool behind the most viral social trends right now

      June 14, 2025

      Got a new password manager? How to clean up the password mess you left in the cloud

      June 14, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Right Invoicing App for iPhone: InvoiceTemple

      June 14, 2025
      Recent

      Right Invoicing App for iPhone: InvoiceTemple

      June 14, 2025

      Tunnel Run game in 170 lines of pure JS

      June 14, 2025

      Integrating Drupal with Salesforce SSO via SAML and Dynamic User Sync

      June 14, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 24H2 tests toggle to turn off Recommended feed in the Start menu

      June 14, 2025
      Recent

      Windows 11 24H2 tests toggle to turn off Recommended feed in the Start menu

      June 14, 2025

      User calls Windows 11 “pure horror,” Microsoft says it’s listening to feedback

      June 14, 2025

      John the Ripper is an advanced offline password cracker

      June 14, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

    Training Llama 3.3 Swallow: A Japanese sovereign LLM on Amazon SageMaker HyperPod

    June 13, 2025

    This post is based on a technical report written by Kazuki Fujii, who led the Llama 3.3 Swallow model development.

    The Institute of Science Tokyo has successfully trained Llama 3.3 Swallow, a 70-billion-parameter large language model (LLM) with enhanced Japanese capabilities, using Amazon SageMaker HyperPod. The model demonstrates superior performance in Japanese language tasks, outperforming GPT-4o-mini and other leading models. This technical report details the training infrastructure, optimizations, and best practices developed during the project.

    This post is organized as follows:

    • Overview of Llama 3.3 Swallow
    • Architecture for Llama 3.3 Swallow training
    • Software stack and optimizations employed in Llama 3.3 Swallow training
    • Experiment management

    We discuss topics relevant to machine learning (ML) researchers and engineers with experience in distributed LLM training and familiarity with cloud infrastructure and AWS services. We welcome readers who understand model parallelism and optimization techniques, especially those interested in continuous pre-training and supervised fine-tuning approaches.

    Overview of the Llama 3.3 Swallow

    Llama 3.3 Swallow is a 70-billion-parameter LLM that builds upon Meta’s Llama 3.3 architecture with specialized enhancements for Japanese language processing. The model was developed through a collaboration between the Okazaki Laboratory and Yokota Laboratory at the School of Computing, Institute of Science Tokyo, and the National Institute of Advanced Industrial Science and Technology (AIST).

    The model is available in two variants on Hugging Face:

    • Llama 3.3 Swallow 70B Base v0.4 – The pretrained base model, which serves as the foundation for Japanese language understanding
    • Llama 3.3 Swallow 70B Instruct v0.4 – The instruction-tuned model, optimized for dialogue and task completion

    Both variants are accessible through the tokyotech-llm organization on Hugging Face, providing researchers and developers with flexible options for different application needs.

    Training methodology

    The base model was developed through continual pre-training from Meta Llama 3.3 70B Instruct, maintaining the original vocabulary without expansion. The training data primarily consisted of the Swallow Corpus Version 2, a carefully curated Japanese web corpus derived from Common Crawl. To secure high-quality training data, the team employed the Swallow Education Classifier to extract educationally valuable content from the corpus. The following table summarizes the training data used for the base model training with approximately 314 billion tokens. For compute, the team used 32 ml.p5.48xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances (H100, 80 GB, 256 GPUs) for continual pre-training with 16 days and 6 hours.

    Training Data Number of Training Tokens
    Japanese Swallow Corpus v2 210 billion
    Japanese Wikipedia 5.3 billion
    English Wikipedia 6.9 billion
    English Cosmopedia 19.5 billion
    English DCLM baseline 12.8 billion
    Laboro ParaCorpus 1.4 billion
    Code Swallow-Code 50.2 billion
    Math Finemath-4+ 7.85 billion

    For the instruction-tuned variant, the team focused exclusively on Japanese dialogue and code generation tasks. This version was created through supervised fine-tuning of the base model, using the same Japanese dialogue data that proved successful in the previous Llama 3.1 Swallow v0.3 release. Notably, the team made a deliberate choice to exclude English dialogue data from the fine-tuning process to maintain focus on Japanese language capabilities. The following table summarizes the instruction-tuning data used for the instruction-tuned model.

    Training Data Number of Training Samples
    Gemma-2-LMSYS-Chat-1M-Synth 240,000
    Swallow-Magpie-Ultra-v0.1 42,000
    Swallow-Gemma-Magpie-v0.1 99,000
    Swallow-Code-v0.3-Instruct-style 380,000

    Performance and benchmarks

    The base model has demonstrated remarkable performance in Japanese language tasks, consistently outperforming several industry-leading models. In comprehensive evaluations, it has shown superior capabilities compared to OpenAI’s GPT-4o (gpt-4o-2024-08-06), GPT-4o-mini (gpt-4o-mini-2024-07-18), GPT-3.5 (gpt-3.5-turbo-0125), and Qwen2.5-72B. These benchmarks reflect the model’s enhanced ability to understand and generate Japanese text. The following graph illustrates the base model performance comparison across these different benchmarks (original image).

    The instruction-tuned model has shown particularly strong performance on the Japanese MT-Bench, as evaluated by GPT-4o-2024-08-06, demonstrating its effectiveness in practical applications. The following graph presents the performance metrics (original image).

    Licensing and usage

    The model weights are publicly available on Hugging Face and can be used for both research and commercial purposes. Users must comply with both the Meta Llama 3.3 license and the Gemma Terms of Use. This open availability aims to foster innovation and advancement in Japanese language AI applications while enforcing responsible usage through appropriate licensing requirements.

    Training infrastructure architecture

    The training infrastructure for Llama 3.3 Swallow was built on SageMaker HyperPod, with a focus on high performance, scalability, and observability. The architecture combines compute, network, storage, and monitoring components to enable efficient large-scale model training. The base infrastructure stack is available as an AWS CloudFormation template for seamless deployment and replication. This template provisions a comprehensive foundation by creating a dedicated virtual private cloud (VPC). The networking layer is complemented by a high-performance Amazon FSx for Lustre file system, alongside an Amazon Simple Storage Service (Amazon S3) bucket configured to store lifecycle scripts, which are used to configure the SageMaker HyperPod cluster.

    Before deploying this infrastructure, it’s essential to make sure the AWS account has the appropriate service quotas. The deployment of SageMaker HyperPod requires specific quota values that often exceed default limits. You should check your current quota against the requirements detailed in SageMaker HyperPod quotas and submit a quota increase request as needed.

    The following diagram illustrates the high-level architecture of the training infrastructure.

    Compute and network configuration

    The compute infrastructure is based on SageMaker HyperPod using a cluster of 32 EC2 P5 instances, each equipped with 8 NVIDIA H100 GPUs. The deployment uses a single spine configuration to provide minimal latency between instances. All communication between GPUs is handled through NCCL over an Elastic Fabric Adapter (EFA), providing high-throughput, low-latency networking essential for distributed training. The SageMaker HyperPod Slurm configuration manages the deployment and orchestration of these resources effectively.

    Storage architecture

    The project implements a hierarchical storage approach that balances performance and cost-effectiveness. At the foundation is Amazon S3, providing long-term storage for training data and checkpoints. To prevent storage bottlenecks during training, the team deployed FSx for Lustre as a high-performance parallel file system. This configuration enables efficient data access patterns across all training nodes, crucial for handling the massive datasets required for the 70-billion-parameter model.

    The following diagram illustrates the storage hierarchy implementation.

    The integration between Amazon S3 and FSx for Lustre is managed through a data repository association, configured using the following AWS Command Line Interface (AWS CLI) command:

    
    aws fsx create-data-repository-association 
        --file-system-id ${FSX_ID} 
        --file-system-path "/hsmtest" 
        --data-repository-path s3://${BUCKET_NAME_DATA} 
        --s3 AutoImportPolicy='{Events=[NEW,CHANGED,DELETED]}',AutoExportPolicy={Events=[NEW,CHANGED,DELETED]} 
        --batch-import-meta-data-on-create 
        --region ${AWS_REGION}
    

    Observability stack

    The monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana to provide comprehensive observability. The team integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for any anomalies through Grafana Dashboards. The following screenshot shows an example of a GPU health dashboard.

    Software stack and training optimizations

    The training environment is built on SageMaker HyperPod DLAMI, which provides a preconfigured Ubuntu base Amazon Machine Image (AMI) with essential components for distributed training. The software stack includes CUDA drivers and libraries (such as cuDNN and cuBLAS), NCCL for multi-GPU communication, and AWS-OFI-NCCL for EFA support. On top of this foundation, the team deployed Megatron-LM as the primary framework for model training. The following diagram illustrates the software stack architecture.

    Distributed training implementation

    The training implementation uses Megatron-LM’s advanced features for scaling LLM training. The framework provides sophisticated model parallelism capabilities, including both tensor and pipeline parallelism, along with efficient data parallelism that supports communication overlap. These features are essential for managing the computational demands of training a 70-billion-parameter model.

    Advanced parallelism and communication

    The team used a comprehensive 4D parallelism strategy of Megatron-LM that maximizes GPU utilization through careful optimization of communication patterns across multiple dimensions: data, tensor, and pipeline, and sequence parallelism. Data parallelism splits the training batch across GPUs, tensor parallelism divides individual model layers, pipeline parallelism splits the model into stages across GPUs, and sequence parallelism partitions the sequence length dimension—together enabling efficient training of massive models.

    The implementation overlaps communication across data parallelism, tensor parallelism, and pipeline parallelism domains, significantly reducing blocking time during computation. This optimized configuration enables efficient scaling across the full cluster of GPUs while maintaining consistently high utilization rates. The following diagram illustrates this communication and computation overlap in distributed training (original image).

    Megatron-LM enables fine-grained communication overlapping through multiple configuration flags: --overlap-grad-reduce and --overlap-param-gather for data-parallel operations, --tp-comm-overlap for tensor parallel operations, and built-in pipeline-parallel communication overlap (enabled by default). These optimizations work together to improve training scalability.

    Checkpointing strategy

    The training infrastructure implements an optimized checkpointing strategy using Distributed Checkpoint (DCP) and asynchronous I/O operations. DCP parallelizes checkpoint operations across all available GPUs, rather than being constrained by tensor and pipeline parallel dimensions as in traditional Megatron-LM implementations. This parallelization, combined with asynchronous I/O, enables the system to:

    • Save checkpoints up to 10 times faster compared to synchronous approaches
    • Minimize training interruption by offloading I/O operations
    • Scale checkpoint performance with the total number of GPUs
    • Maintain consistency through coordinated distributed saves

    The checkpointing system automatically saves model states to the FSx Lustre file system at configurable intervals, with metadata tracked in Amazon S3. For redundancy, checkpoints are asynchronously replicated to Amazon S3 storage.

    For implementation details on asynchronous DCP, see Asynchronous Saving with Distributed Checkpoint (DCP).

    Experiment management

    In November 2024, the team introduced a systematic approach to resource optimization through the development of a sophisticated memory prediction tool. This tool accurately predicts per-GPU memory usage during training and semi-automatically determines optimal training settings by analyzing all possible 4D parallelism configurations. Based on proven algorithmic research, this tool has become instrumental in maximizing resource utilization across the training infrastructure. The team plans to open source this tool with comprehensive documentation to benefit the broader AI research community.

    The following screenshot shows an example of the memory consumption prediction tool interface (original image).

    Training pipeline management

    The success of the training process heavily relied on maintaining high-quality data pipelines. The team implemented rigorous data curation processes and robust cleaning pipelines, maintaining a careful balance in dataset composition across different languages and domains.For experiment planning, version control was critical. The team first fixed the versions of pre-training libraries and instruction tuning libraries to be used in the next experiment cycle. For libraries without formal version releases, the team managed versions using Git branches or tags to provide reproducibility. After the versions were locked, the team conducted short-duration training runs to:

    • Measure throughput with different numbers of GPU nodes
    • Search for optimal configurations among distributed training settings identified by the memory prediction library
    • Establish accurate training time estimates for scheduling

    The following screenshot shows an example experiment schedule showing GPU node allocation, expected training duration, and key milestones across different training phases (original image).

    To optimize storage performance before beginning experiments, training data was preloaded from Amazon S3 to the FSx for Lustre file system to prevent I/O bottlenecks during training. This preloading process used parallel transfers to maximize throughput:

    # Preload data to Lustre filesystem
    find <data/path> -type f -print0 | xargs -0 -n 1 -P 8 sudo lfs 
    hsm_restore

    Monitoring and performance management

    The team implemented a comprehensive monitoring system focused on real-time performance tracking and proactive issue detection. By integrating with Weights & Biases, the system continuously monitors training progress and delivers automated notifications for key events such as job completion or failure and performance anomalies. Weights & Biases provides a set of tools that enable customized alerting through Slack channels. The following screenshot shows an example of a training monitoring dashboard in Slack (original image).

    The monitoring infrastructure excels at identifying both job failures and performance bottlenecks like stragglers. The following figure presents an example of straggler detection showing training throughput degradation.

    Conclusion

    The successful training of Llama 3.3 Swallow represents a significant milestone in the development of LLMs using cloud infrastructure. Through this project, the team has demonstrated the effectiveness of combining advanced distributed training techniques with carefully orchestrated cloud resources. The implementation of efficient 4D parallelism and asynchronous checkpointing has established new benchmarks for training efficiency, and the comprehensive monitoring and optimization tools have provided consistent performance throughout the training process.

    The project’s success is built on several foundational elements: a systematic approach to resource planning and optimization, robust data pipeline management, and a comprehensive monitoring and alerting system. The efficient storage hierarchy implementation has proven particularly crucial in managing the massive datasets required for training a 70-billion-parameter model.Looking ahead, the project opens several promising directions for future development. The team plans to open source the memory prediction tools, so other researchers can benefit from the optimizations developed during this project. Further improvements to the training pipelines are under development, along with continued enhancement of Japanese language capabilities. The project’s success also paves the way for expanded model applications across various domains.

    Resources and references

    This section provides key resources and references for understanding and replicating the work described in this paper. The resources are organized into documentation for the infrastructure and tools used, as well as model-specific resources for accessing and working with Llama 3.3 Swallow.

    Documentation

    The following resources provide detailed information about the technologies and frameworks used in this project:

    • Amazon SageMaker HyperPod Workshop (Slurm)
    • Amazon SageMaker HyperPod Workshop (Amazon EKS)
    • AWSome Distributed Training

    Model resources

    For more information about Llama 3.3 Swallow and access to the model, refer to the following resources:

    • Official release page
    • Technical documentation
    • Model weights

    About the Authors

    Kazuki Fujii graduated with a bachelor’s degree in Computer Science from Tokyo Institute of Technology in 2024 and is currently a master’s student there (2024–2026). Kazuki is responsible for the pre-training and fine-tuning of the Swallow model series, a state-of-the-art multilingual LLM specializing in Japanese and English as of December 2023. Kazuki focuses on distributed training and building scalable training systems to enhance the model’s performance and infrastructure efficiency.

    Daisuke Miyamato is a Senior Specialist Solutions Architect for HPC at Amazon Web Services. He is mainly supporting HPC customers in drug discovery, numerical weather prediction, electronic design automation, and ML training.

    Kei Sasaki is a Senior Solutions Architect on the Japan Public Sector team at Amazon Web Services, where he helps Japanese universities and research institutions navigate their cloud migration journeys. With a background as a systems engineer specializing in high-performance computing, Kei supports these academic institutions in their large language model development initiatives and advanced computing projects.

    Keita Watanabe is a Senior GenAI World Wide Specialist Solutions Architect at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMake videos accessible with automated audio descriptions using Amazon Nova
    Next Article React Three Fiber: The Ultimate Guide to 3D Web Development

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 14, 2025
    Machine Learning

    MemOS: A Memory-Centric Operating System for Evolving and Adaptive Large Language Models

    June 14, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

    Artificial Intelligence

    Twilio Unveils Next-Generation Customer Engagement Platform Built for an AI and Data-Powered World at SIGNAL 2025

    Tech & Work

    Windows Common Log File System Driver Vulnerability Let Attackers Escalate Privileges

    Security

    Philippines Prepares for 2025 Elections with Focus on Digital Integrity and Misinformation

    Development

    Highlights

    News & Updates

    The OLED AI laptop I can’t recommend enough just hit its lowest price ever

    June 10, 2025

    If there’s one OLED AI laptop I recommend to everyone, it’s the Zenbook A14. It…

    Windows 11 will soon lose the Maps app forever, affecting dozens of users — RIP a Windows Phone relic

    April 25, 2025

    UX Design Strategy:Tips and Best Practices

    May 13, 2025

    CVE-2025-48910 – Apache DFile Buffer Overflow Vulnerability

    June 6, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.