Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning Method that Computes the Cross-Entropy Loss without Materializing the Logits for all Tokens into Global Memory

Advancements in large language models (LLMs) have revolutionized natural language processing, with applications spanning text generation, translation, and summarization. These models rely on large amounts of data, large parameter counts, and expansive vocabularies, necessitating sophisticated techniques to manage computational and memory requirements. A critical component of LLM training is the cross-entropy loss computation, which, while central to model accuracy, presents significant memory challenges due to the size and complexity of the vocabulary.

The memory requirements of the cross-entropy loss layer constrict training large language models, especially as vocabulary sizes reach hundreds of thousands of tokens. The issue becomes acute in models like Gemma 2 (2B), where the cross-entropy loss computation alone can consume up to 24 GB of memory, accounting for up to 90% of the memory footprint during training. These limitations restrict batch sizes and force trade-offs between model performance and computational feasibility, posing a significant bottleneck for scalability.

Previous methods aimed at reducing memory usage, such as FlashAttention and hierarchical vocabularies, have addressed specific components like self-attention but fall short in alleviating the burden of the cross-entropy layer. Chunking methods reduce memory requirements but introduce latency trade-offs, limiting their practical use. Also, these approaches need to fully exploit the sparsity of gradients or leverage hardware optimizations, leaving room for improvement.

Researchers at Apple introduced the Cut Cross-Entropy (CCE) method, a novel approach designed to overcome the memory challenges associated with large vocabulary models. Unlike conventional methods that compute and store all logits for tokens in memory, CCE dynamically calculates only the necessary logits and performs log-sum-exp reductions in on-chip memory. This technique eliminates the need to materialize large matrices in GPU memory, significantly reducing the memory footprint. For instance, in the Gemma 2 model, the memory usage for loss computation dropped from 24 GB to just 1 MB, with total classifier head memory consumption reduced from 28 GB to 1 GB.

The core of CCE lies in its efficient computation strategy, which employs custom CUDA kernels to process embeddings and perform reductions. By calculating logits on the fly and avoiding intermediate memory storage, the method capitalizes on shared GPU memory, which is faster and more efficient than traditional global memory usage. Also, gradient filtering selectively skips computations that contribute negligibly to the gradient, leveraging the inherent sparsity of the softmax matrix. Vocabulary sorting optimizes processing by grouping tokens with significant contributions, minimizing wasted computation. Together, these innovations enable a memory-efficient, low-latency loss computation mechanism.

The performance gains from CCE are remarkable. Memory reductions enabled a 10-fold increase in batch size for smaller models like GPT-2 and a 1.5-fold increase for larger models like Llama 2 (13B). Training throughput remained unaffected, and experimental results demonstrated stable convergence, matching the performance of traditional methods. For a batch of 8,192 tokens with a vocabulary size 256,000, CCE achieved a peak memory usage of just 1 MB compared to 28 GB in baseline methods. Training stability tests on models such as Llama 3 (8B) and Phi 3.5 Mini confirmed the reliability of CCE, with indistinguishable loss curves compared to existing approaches.

This research highlights several key takeaways:

Significant Memory Reduction: CCE reduces memory usage for cross-entropy loss computation to negligible levels, as low as 1 MB for large-scale models like Gemma 2 (2B).Â Â
Improved Scalability: By enabling larger batch sizes, the method supports more efficient utilization of computational resources, which is crucial for training extensive models.Â Â
Efficiency Gains: Custom CUDA kernels and gradient filtering ensure that the reduction in memory footprint does not compromise training speed or model convergence.Â Â
Practical Applicability: The method is adaptable to various architectures and scenarios, with potential applications extending to image classification and contrastive learning.Â Â
Future Potential: CCEâ€™s ability to handle large vocabularies with minimal memory impact could facilitate training even more extensive models with improved pipeline balancing.

In conclusion, the CCE method represents a significant breakthrough in training large language models by addressing the critical bottleneck of memory-intensive cross-entropy loss layers. Through innovative techniques like dynamic logit computation, gradient filtering, and vocabulary sorting, CCE enables dramatic reductions in memory usage without sacrificing speed or accuracy. This advancement not only enhances the efficiency of current models but also paves the way for more scalable and balanced architectures in the future, opening new possibilities for large-scale machine learning.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions

The post Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning Method that Computes the Cross-Entropy Loss without Materializing the Logits for all Tokens into Global Memory appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning Method that Computes the Cross-Entropy Loss without Materializing the Logits for all Tokens into Global Memory

CVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability

CVE-2025-4824 – TOTOLINK A702R, A3002R, A3002RU HTTP POST Request Handler Buffer Overflow Vulnerability

CISA Warns of High-Risk Flaws in Honeywell Products

Pre-Trained Foundation Model Representations to Uncover Breathing Patterns in Speech

5+ WordPress Plugins for Developers To Use in 2025

Developer Spotlight: Francesco Michelini

CVE-2024-56006 – Automattic Jetpack Debug Tools Missing Authorization Vulnerability

Laravel Simple RabbitMQ Package

Here’s every AI feature coming to Samsung Galaxy foldable phones

How can we run only selected testcases from extent.html report in selenium webdriver?

Apple Researchers Propose Cut Cross-Entropy (CCE): A Machine Learning Method that Computes the Cross-Entropy Loss without Materializing the Logits for all Tokens into Global Memory

Related Posts