Slow Thinking with LLMs: Lessons from Imitation, Exploration, and Self-Improvement

Reasoning systems such as o1 from OpenAI were recently introduced to solve complex tasks using slow-thinking processes. However, it is clear that large language models have limitations, as they cannot plan, break down problems, improve ideas, summarize, or rethink due to their training and methods. While these tools try to enhance reasoning, they depend on structured guidance and extra processing time, raising doubts about their ability to handle complex tasks without regular human help.

Current methods in reasoning systems are mostly based on fast-thinking approaches, thus providing quick responses but with less depth and accuracy. The industry has mostly developed and maintained these systems, but their core techniques are not disclosed publicly. They usually fail in extended thinking, thus considerably limiting their ability to solve complex problems. Methods like tree search and reward models were used in some systems, but they were not very effective in generalizing across domains or were too slow for real-world use. New systems used test-time scaling to give more time for processing and generating detailed reasoning steps called thoughts to find solutions. Fine-tuning large language models with long chains of thought has also improved performance on complex tasks.

To solve this, researchers from the Gaoling School of Artificial Intelligence, Renmin University of China, and BAAI proposed a solution that involves a three-phase framework called “imitate, explore, and self-improve” to enhance reasoning in language models. Researchers presented a three-phase training method—imitation, exploration, and self-improvement—for developing reasoning models similar to OpenAI’s o1 system.

The model was trained to follow specific formats in the imitation phase, using minimal data to generate reasoning and solutions. During the exploration phase, the model focused on difficult problems, developing multiple solutions and improving them based on the correct answers, especially for tasks requiring slow thinking. In the self-improvement phase, high-quality data and techniques like supervised fine-tuning (SFT) and direct preference optimization (DPO) were used to boost the model’s reasoning skills. Metrics like length and perplexity helped filter out low-quality data. However, there weren’t enough challenging problems, and reinforcement learning wasn’t used due to limited resources. The approach focused on improving the model’s reasoning abilities through continuous refinement.

Researchers evaluated the framework using three challenging benchmarks: MATH-OAI, AIME2024, and GPQA. MATH-OAI included 500 competition mathematics problems, AIME2024 featured 30 issues for high school students, and GPQA had 198 multiple-choice questions in biology, physics, and chemistry. The focus was on mathematics, with Qwen2.5-32B-Instruct as the backbone model, compared to models like o1-preview, DeepSeek-R1-LitePreview, and QwQ-32B. The experiments used a greedy search with up to 32k tokens.

Results showed that slow-thinking systems like o1-preview performed well, particularly on AIME, while distillation and exploration-based training also yielded competitive outcomes. Models with 3.9k instances from distillation achieved 90.2% accuracy on MATH-OAI and 46.7% on AIME. Iterative SFT and exploration training improved performance on benchmarks like AIME and MATH-OAI, with variants trained on 1.1k instances showing consistent gains. However, performance fluctuated due to limited exploration capacity, especially on AIME, which had fewer test samples. The analysis indicated that excluding hard problems reduced performance while mixing mathematical and other domain data enhanced reasoning abilities. Further DPO analysis showed that aligning only the thought process with SFT led to stable optimization, although more experiments were needed to refine the strategies. This maintained a good balance of iterative training, distillation, and exploration strategies to support improvement across all the benchmarks.

In summary, the researchers presented a slow-thinking framework for enhancing reasoning systems, demonstrating its effectiveness in solving complex problems across domains. Based on training with high-quality, long-form thought data, the approach enables models to generalize and handle difficult tasks, particularly in mathematics. The system benefits from self-improvement through exploration and flexible thought processes. However, the research is still in its early stages, and there remains a gap in performance compared to industry-level systems. In the future, this domain can be developed, and this framework can act as a baseline for upcoming researchers!

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

The post Slow Thinking with LLMs: Lessons from Imitation, Exploration, and Self-Improvement appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

Slow Thinking with LLMs: Lessons from Imitation, Exploration, and Self-Improvement

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Google Prevented 2.28 Million Malicious Apps from Reaching Play Store in 2023

Here’s every AI feature coming to Samsung Galaxy foldable phones

I can’t review a game if I can’t finish it, and Still Wakes the Deep is broken because of this bug

Hire Overseas

Your Apple Watch Series 4’s new ‘vintage’ status means fixing it just got more complicated – here’s why

Mobile Car Paint & Body Repair Chelmsford, Brentwood, Essex | Essex Smart Repairs

Triggering File Creation and Auto-Download in PowerApps Using Power Automate

Clapper Media Player Adds New Features, Official Windows Build

Slow Thinking with LLMs: Lessons from Imitation, Exploration, and Self-Improvement

Related Posts