Large Language Models (LLMs) have grown in complexity and demand, creating significant challenges for companies aiming to provide scalable and cost-effective Model-as-a-Service (MaaS). The rapid adoption of LLMs in various applications has led to highly variable workloads in terms of input/output lengths, arrival frequencies, and service requirements. Balancing resource utilization to meet these diverse needs has become a critical challenge. Achieving this balance requires sophisticated strategies to meet different Service Level Objectives (SLOs) for latency and throughput. Additionally, conventional LLM serving architectures often assume sufficient resources are available to handle all requests, which is increasingly difficult with rising demand, especially during peak usage times.
The primary challenge is to maximize throughput without compromising latency—particularly as operational costs rise and GPU resources remain limited. To address these issues, Moonshot AI developed a new architecture.
Moonshot AI Open-Sources its Core Reasoning Architecture: Mooncake
China-based AI company Moonshot AI has officially open-sourced its core reasoning architecture, named Mooncake. Mooncake aims to address key scalability and efficiency challenges in LLM serving. Moonshot AI employs a KVCache-centric disaggregated architecture, which sets Mooncake apart from traditional LLM serving platforms. The first open-source component of Mooncake, called the Transfer Engine, is now available on GitHub, with more components planned for future release GitHub link.
The core of Mooncake is its KVCache-centric approach to handling computational workloads. By separating the prefill and decoding clusters, Mooncake can dynamically optimize resources, making use of underutilized CPU, DRAM, and SSD resources for efficient caching. This separation is crucial for addressing the diverse computational characteristics of LLM serving stages. The decision to open source Mooncake reflects a commitment to transparency and community-driven improvements in LLM scalability.
Technical Details
Mooncake leverages a KVCache-centric Prefill-Decoding (PD) separation technique and a storage-computation disaggregated architecture, which have significantly improved the inference throughput of Moonshot AI’s LLM service, Kimi. The KVCache mechanism is central to optimizing both throughput and latency. Instead of keeping GPU resources engaged with all aspects of model serving, Mooncake isolates KVCache usage from computational tasks, allowing it to be managed by underutilized hardware like CPUs and SSDs.
Mooncake’s architecture divides LLM serving into two stages—Prefill and Decoding. During the prefill stage, reusable cache is transferred to prefill instances, which optimizes the first token generation while reducing redundant computations. Then, during the decoding stage, the KVCache is aggregated, allowing for efficient batching. This separation has led to substantial performance improvements.
By implementing a prediction-based early rejection policy, Mooncake also helps prevent system overload during peak request periods. This approach has been instrumental in maintaining Service Level Objectives (SLOs) for time to first token (TTFT) and time between tokens (TBT), even under high workloads. Experimental results have shown that compared to the baseline, Mooncake achieved up to a fivefold increase in throughput in simulated scenarios and enabled 75% more request handling under real-world workloads.
The significance of Mooncake’s open-source release is multi-layered. It represents progress in the decentralization of LLM inference workloads, ensuring that no single hardware component becomes a bottleneck. The KVCache-centric scheduling model balances resource loads effectively, enabling service providers to maximize throughput without violating latency requirements. This efficiency is essential given the growing demand for LLM capabilities across industries.
Experimental results demonstrate that Mooncake achieved a fivefold increase in throughput in some simulated long-context scenarios while maintaining the required SLOs. In real-world settings, Mooncake enabled Kimi to handle 75% more requests compared to previous architectures. These improvements highlight Mooncake’s ability to scale efficiently and reduce costs. The disaggregation approach also provides greater flexibility in adding computational resources on-the-fly, which addresses variability in LLM workloads more efficiently than traditional coupled systems.
The phased open-source rollout also encourages collaborative development. By starting with the Transfer Engine, Moonshot AI aims to gather community insights before releasing additional components. This phased approach is intended to lead to further optimizations and broader adoption across various sectors that need efficient LLM serving solutions.
Conclusion
Moonshot AI’s decision to open source Mooncake reflects a broader industry trend towards transparent and scalable AI development practices. By focusing on KVCache-centric separation, Mooncake addresses the key challenges of LLM serving—latency, efficiency, and scalability. It has already shown significant performance gains, making it a promising framework for LLM serving. Mooncake’s architecture balances computational and caching demands effectively, improving resource utilization, reducing latency, and enhancing overall throughput. The phased open-source approach underscores Moonshot AI’s commitment to continuous improvement and community collaboration.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.
[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)
The post China’s AI Unicorn ‘Moonshot AI’ Open-Sources its Core Reasoning Architecture: ‘Mooncake’ appeared first on MarkTechPost.
Source: Read MoreÂ