CodePMP: A Scalable Preference Model Pre-training for Supercharging Large Language Model Reasoning

Large Language Models (LLMs) have made considerable advancements in natural language understanding and generation through scalable pretraining and fine-tuning techniques. However, a major challenge persists in enhancing LLMsâ€™ reasoning abilities, particularly for complex logical and mathematical tasks. The scarcity of high-quality preference data for fine-tuning reward models (RMs) limits the effectiveness of Reinforcement Learning from Human Feedback (RLHF) approaches, which are essential for improving LLM performance in reasoning. This lack of data, which is costly and labor-intensive to collect, hinders the scalability of RMs, creating a critical bottleneck for advancing LLM capabilities in reasoning tasks such as problem-solving and decision-making.

Current solutions for improving reward models, such as Anthropicâ€™s Preference Model Pretraining (PMP), attempt to address data efficiency by using publicly available large-scale datasets like those from Reddit or Wikipedia for pretraining. However, these datasets are not tailored for reasoning-specific tasks. Annotating data for reasoning tasks, especially for complex logical and mathematical problems, is difficult to scale, limiting the applicability of existing methods. Additionally, the computational complexity of these models makes them impractical for real-time applications, and their reliance on vast amounts of human-annotated data further constrains scalability. As a result, these methods struggle to deliver the efficiency required for fine-tuning reasoning tasks.

The researchers from the University of Chinese Academy of Sciences introduced CodePMP, a novel pretraining method that generates large-scale preference data from publicly available source code, specifically tailored for reasoning tasks. By leveraging the structured and logical nature of code, the proposed method synthesizes millions of code-preference pairs for use in training reward models. Two language models, one strong and one weak, are employed to generate chosen and rejected code responses for a given prompt, creating a rich dataset for pretraining. This innovative approach overcomes the limitations of existing methods by automating preference data generation, significantly improving the efficiency and scalability of RM fine-tuning. CodePMP enables models to generalize better across reasoning tasks, providing a cost-effective solution that reduces reliance on human-annotated data.

CodePMP involves two key components: Reward Modeling (RM) and Language Modeling (LM). In RM, the model is trained on code-preference pairs, learning to rank higher-quality responses over lower-quality ones using pairwise ranking loss. The LM component focuses on training only the chosen responses, ensuring the model retains general language understanding capabilities while improving its reasoning performance. The training dataset consists of 28 million files and 19 billion tokens sourced from GitHub, with a balanced distribution of chosen and rejected responses to ensure unbiased learning. This scalable pretraining dataset enables the model to generalize effectively across multiple reasoning tasks, improving RM fine-tuning efficiency.

CodePMP demonstrated significant improvements in reasoning performance across mathematical and logical reasoning tasks. Models pre-trained with CodePMP consistently outperformed those without it in both RM accuracy and Best-of-N performance. These improvements were seen across both 1.5B and 7B model sizes. For example, in mathematical reasoning tasks, the model achieved substantially higher accuracy, and in logical reasoning tasks, it displayed enhanced ability to differentiate between correct and incorrect reasoning steps. The results highlight the effectiveness of CodePMP in boosting RM fine-tuning efficiency, resulting in better generalization and performance across diverse reasoning domains.

In conclusion, CodePMP presents a scalable and efficient approach to improve reasoning abilities in large language models by leveraging code-preference pairs generated from publicly available source code. This innovative method addresses the challenge of limited reasoning-specific data and significantly enhances reward model fine-tuning. The improvements achieved through CodePMP are robust across multiple reasoning tasks, indicating that it provides a scalable, cost-effective solution to enhancing LLM performance in areas requiring complex reasoning. The approach holds potential to advance LLMsâ€™ capabilities in domains such as mathematical problem-solving, logical deduction, and decision-making.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Letâ€™s collaborate!

The post CodePMP: A Scalable Preference Model Pre-training for Supercharging Large Language Model Reasoning appeared first on MarkTechPost.

Source: Read MoreÂ

CodeSOD: Enterprise Code Coverage

Mastering SVG Arcs

CodeSOD: A Set of Mistakes

CodeSOD: While This Works

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Finally, a luxury soundbar that’s compact and delivers immersive audio (and it’s $500 off)

This affordable Lenovo gaming PC is the one I recommend to most people. Here’s why

The last day of ’12 days of OpenAI’ is expected to bring biggest drop yet

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PECL Releases (12.10.2024)

Community News: Latest PEAR Releases (12.09.2024)

Community News: Latest PECL Releases (12.17.2024)

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

Windows 11 hidden toggle reveals how to turn on or off Administrator protection

10 Must-Have Apps for 3 Monitors You Should Know About

CodePMP: A Scalable Preference Model Pre-training for Supercharging Large Language Model Reasoning

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

What do the State of CSS and HTML surveys tell us?

State of Node.js Performance 2024

AI Firm iLearningEngines Hit by Cyberattack, Loses $250,000 in Wire Fraud

Designer Spotlight: Maria Rakovic

Is This the End for Perplexity? Lawsuits, Competition & ChatGPT Ready to Dominate! | ThatsMyAI

Distribution Release: OpenMandriva 24.07 “ROME”

Top 12 Python Libraries for Sentiment Analysis

Binary Tree Diameter: Algorithm and Implementation Guide

Windows App will be available on Windows this fall, and so does its Android preview

CodePMP: A Scalable Preference Model Pre-training for Supercharging Large Language Model Reasoning

Related Posts