Xiaomi introduced MiMo-7B: A Compact Language Model that Outperforms Larger Models in Mathematical and Code Reasoning through Rigorous Pre-Training and Reinforcement Learning

With rising demand for AI systems that can handle tasks involving multi-step logic, mathematical proofs, and software development, researchers have turned their attention toward enhancing models’ reasoning potential. This capability, once believed to be exclusive to human intelligence, is now actively being pursued in smaller-scale models to make them more efficient and widely deployable. As reasoning-based tasks continue to expand in relevance, encompassing academic problem-solving, automated theorem proving, algorithm design, and complex software debugging, language models are expected to become more than just general-purpose conversational agents. They are being encouraged to become domain-specific problem solvers who can assist professionals and researchers alike.

One challenge in building reasoning-focused models is achieving strong, simultaneous performance in mathematics and programming while maintaining a relatively small model size. Most competitive results in these domains are achieved by models with approximately 32 billion parameters or more. These large models are often used because smaller ones struggle with generalization and reward optimization in reinforcement learning tasks, particularly when it comes to code-based problem-solving. Sparse reward feedback, limited high-quality data, and weak base model architecture make it difficult to develop compact yet powerful models. Additionally, the data used to train these models is not always curated with reasoning in mind, often resulting in training inefficiencies and limited gains in problem-solving abilities.

To address reasoning challenges, several models, including OpenAI’s o-series, DeepSeek R1, and Claude 3.7, have been introduced, leveraging massive parameter counts and complex reinforcement learning strategies. These models employ techniques such as step-by-step planning and backtracking to enhance reasoning, particularly in algorithmic thinking and math-related tasks. However, they heavily depend on post-training stages and underplay the importance of high-quality pre-training data. Many also rely on fixed template-based reward systems that are prone to reward hacking. Code generation benchmarks often reveal that these models perform inconsistently in challenging tasks due to shallow pretraining foundations and ineffective reward signal modeling during fine-tuning.

A research team from Xiaomi introduced the MiMo-7B family of language models with a focused approach to overcoming these barriers. The innovation lies in treating both pre-training and post-training as equally critical phases for developing reasoning capabilities. The base model, MiMo-7B-Base, was trained from scratch using a dataset comprising 25 trillion tokens. This dataset was constructed with a three-stage mixture strategy that progressively increased the share of mathematical and programming content. An additional multiple-token prediction (MTP) objective was introduced during pre-training to improve both performance and inference speed. For post-training, the team developed a curated dataset of 130,000 verifiable math and programming problems, each tagged with difficulty scores. Reinforcement learning was then applied using a difficulty-driven reward framework, allowing more nuanced and effective feedback during training. This resulted in two major variants: MiMo-7B-RL and MiMo-7B-RL-Zero.

The pre-training methodology started by extracting reasoning-heavy content from web pages, academic papers, and books using a custom HTML extraction tool designed to preserve math equations and code snippets. Unlike generic pipelines, this extractor retained structural elements critical to problem-solving domains. The team then enhanced the PDF parsing tools to interpret scientific and programming content accurately. To prevent data duplication, global deduplication was applied using URL-based and MinHash techniques. The training corpus was filtered using small language models fine-tuned to tag content quality, replacing outdated heuristic-based filters that often removed valuable reasoning examples. High-quality synthetic reasoning data was also generated from advanced models and added in the final stage of training. This three-stage approach resulted in a final training mix comprising 70% math and code data in stage two and an additional 10% of synthetic content in stage three. The maximum context length was extended from 8,192 to 32,768 tokens, ensuring the model could handle long-form reasoning problems.

In the reinforcement learning stage, the research team engineered a seamless rollout engine to accelerate training and validation. This infrastructure incorporated asynchronous reward computation and early termination mechanisms to reduce GPU idle time, resulting in 2.29 times faster training and 1.96 times faster validation. The model’s policy was optimized using fine-grained rewards derived from the difficulty of test cases, addressing the sparse reward issue in programming benchmarks. Data re-sampling techniques were introduced to maintain training stability and increase rollout sampling efficiency. These strategies collectively enabled the MiMo-7B variants to learn effectively, even from cold-start states where no pre-fine-tuned initialization is available.

Performance evaluation revealed that MiMo-7B-Base achieved a score of 75.2 on the Big-Bench Hard (BBH) task, surpassing other open-source 7B models. It also performed well on SuperGPQA, which includes graduate-level reasoning questions. The post-trained MiMo-7B-RL scored 55.4 on the AIME 2025 benchmark, surpassing OpenAI’s o1-mini by 4.7 points. On code generation tasks, it outperformed much larger models like DeepSeek-R1-Zero-32B and Qwen2.5-32B-RL-Zero on both LiveCodeBench v5 and v6. These results demonstrate that a properly optimized 7B model can rival or even outperform models with more than four times the number of parameters.

The MiMo-7B project serves as a concrete demonstration of how pre-training, data quality, and reinforcement learning infrastructure contribute to the final reasoning capability of a language model. By rethinking the pipeline from data extraction to reward computation, the Xiaomi research team achieved compact yet powerful models suitable for real-world applications in mathematics, coding, and logic. Their approach highlights the untapped potential of small models and challenges the assumption that size alone determines intelligence or versatility.

Key Takeaways from the Research on MiMo-7B:

MiMo-7B was trained on a massive dataset of 25 trillion tokens, targeting reasoning tasks through the use of structured data mixtures.
130,000 math and code problems were used in RL training, each annotated with difficulty scores to enable effective reward shaping.
Three-stage pre-training raised math and coding content to 70%, followed by 10% synthetic problem-solving data.
A seamless rollout engine increased RL training speed by 2.29 times and validation by 1.96 times.
MiMo-7B-RL achieved 55.4 on AIME 2025, outperforming OpenAI o1-mini by 4.7 points.
MiMo-7B models are publicly available and include all checkpoints: base, SFT, and RL variants.
The model’s success shows that small, well-designed models can rival or exceed the performance of 32B models in reasoning tasks.

Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Xiaomi introduced MiMo-7B: A Compact Language Model that Outperforms Larger Models in Mathematical and Code Reasoning through Rigorous Pre-Training and Reinforcement Learning appeared first on MarkTechPost.

Source: Read MoreÂ

15 Essential Skills to Look for When Hiring Node.js Developers for Enterprise Projects (2025-2026)

African training program creates developers with cloud-native skills

React.js for SaaS Platforms: How Top Development Teams Help Startups Launch Faster

Upwork Freelancers vs Dedicated React.js Teams: What’s Better for Your Project in 2025?

LastPass can now warn or block logins to shadow SaaS apps – here’s how

Get up to a year of Adobe Creative Cloud access for 40% off

Got 6 hours? This free AI training from Google and Goodwill can boost your resume today

Why I recommend this budget phone with a paper-like screen over ‘minimalist’ devices

Laravel Boost, your AI coding starter kit

Laravel Boost, your AI coding starter kit

Using GitHub Copilot in VS Code

Optimizely Mission Control – Part I

Top 20 kubectl Commands Every Kubernetes Beginner Must Know

Top 20 kubectl Commands Every Kubernetes Beginner Must Know

Microsoft’s record stock run collides with Nadella’s admission that 15,000 layoffs still ‘hurt’

Microsoft and Adobe Power Up Fantasy Premier League Fans with AI – Here’s How

Xiaomi introduced MiMo-7B: A Compact Language Model that Outperforms Larger Models in Mathematical and Code Reasoning through Rigorous Pre-Training and Reinforcement Learning

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Ambisonics Super-Resolution Using A Waveform-Domain Neural Network

Do LLMs Know Internally When They Follow Instructions?

SoftBank dethroned Microsoft as OpenAI’s largest investor, pushing the ChatGPT maker’s market cap to $300 billion — but reportedly buried itself in debt

CVE-2025-4195 – iSourcecode Gym Management System SQL Injection

CVE-2025-52901 – Apache File Browser JWT Session Leak Vulnerability

The Elder Scrolls 4: Oblivion Remastered has already reached 4 million players in its first week

CVE-2025-4374 – Quay Unauthorized Privilege Escalation Vulnerability

Ubuntu 25.10 Codename Revealed — or an April Fools’ Prank?

8 Best Free and Open Source NVIDIA GPU Monitoring Tools

Xiaomi introduced MiMo-7B: A Compact Language Model that Outperforms Larger Models in Mathematical and Code Reasoning through Rigorous Pre-Training and Reinforcement Learning

Related Posts