Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»CMU Researchers Propose miniCodeProps: A Minimal AI Benchmark for Proving Code Properties

    CMU Researchers Propose miniCodeProps: A Minimal AI Benchmark for Proving Code Properties

    December 18, 2024

    Recently, AI agents have demonstrated very promising developments in automating mathematical theorem proving and code correctness verification using tools like Lean. Such tools pair code with specifications and proofs to ensure it meets its intended requirements, offering very strong safeguards in safety-critical applications. Artificial Intelligence has demonstrated that it can enable the fundamental steps of solution development, namely coding, specifying, and proving, through large language models. While these advances promise much, fully automating program verification remains challenging.

    Traditionally, mathematical theorem proving has relied on tools like Lean, which train models on datasets such as Mathlib to solve problems using specific definitions and strategies. However, these tools have struggled to adapt to program verification, which requires entirely different methods and approaches. While machine learning has improved automation in systems like Coq and Isabelle, similar advancements for Lean in program verification are still missing. Other tools like Dafny and Verus, as well as benchmarks like miniF2F and CoqGym, offer alternatives. Still, they have not been able to fully address the challenges of adapting mathematical theorem-proving methods to the needs of program verification.

    To solve this, researchers from Carnegie Mellon University proposed miniCodeProps, a benchmark containing 201 program specifications in the Lean proof assistant, to address the challenge of automatically generating proofs for programs and their specifications. miniCodeProps contained simple, self-contained programs like lists, natural numbers, and binary trees, with varying difficulty levels for proving. The dataset, divided into three categories—intuitive properties of lists, trees, and numbers (medley), termination lemmas for recursive functions (termination), and properties of nonstandard sorting algorithms (sorting)—included 201 theorem statements. The functions primarily operated on linked lists, with some involving natural numbers and binary trees. These properties were categorized by difficulty: easy (medley), medium (termination), and hard (sorting). Termination lemmas required proving recursion termination, which was critical for Lean 4’s use. The dataset, available in jsonlines format, included essential details such as the proof state and dependencies for each theorem. Examples like the zip over concatenation property and sorting properties highlighted the challenge of proving these properties, especially for more complex sorting algorithms. 

    The evaluation of miniCodeProps focused on two main tasks: full-proof generation and tactic-by-tactic generation. In full-proof generation, models were tested on their ability to generate complete proofs for given specifications. For tactic-by-tactic generation, models were evaluated based on their ability to suggest the next appropriate tactic from the current proof state, testing incremental reasoning. The evaluation also considered the difficulty levels of the proofs, ranging from simple properties of lists and numbers to complex termination and sorting algorithm properties, measuring both efficiency and correctness in proof generation or tactic application. 

    The results indicated that neural theorem provers, such as GPT-4o, performed well on simpler tasks, achieving a 75.6% success rate on medley properties. However, performance on the harder tasks, such as termination and sorting, was lower, at 4.34% and 6.96%, respectively. The Mathlib-trained model ntp-ctx-1.3B demonstrated similar efficiency to GPT-4o, suggesting the potential for domain-specific verifiers to show further promise. MiniCodeProps provides a framework for improving automated theorem-proving agents for code verification, supporting human engineers, and offering additional guarantees through diverse reasoning approaches.

    In the end, the proposed miniCodeProps is a valuable benchmark that can be used to advance automated ITP-based code verification. It contains problems from a range of Inductive problem datasets, which enables stepwise progress in checking program properties. However, the method showed limitations and cannot effectively solve complicated problems. MiniCodeProps can potentially drive advancements in verification agents and serve as a baseline for evaluating new approaches in automated code verification.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

    The post CMU Researchers Propose miniCodeProps: A Minimal AI Benchmark for Proving Code Properties appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMicrosoft AI Introduces SCBench: A Comprehensive Benchmark for Evaluating Long-Context Methods in Large Language Models
    Next Article How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

    Related Posts

    Machine Learning

    Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

    May 16, 2025
    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Ubuntu Fixes Wi-Fi Connection Fail at Login Screen

    Linux

    Why is it Important to Reconcile your Bank Account?

    Artificial Intelligence

    CodeSOD: A Double Date

    News & Updates

    JPMorgan Chase Researchers Propose JPEC: A Novel Graph Neural Network that Outperforms Expert’s Predictions on Tasks of Competitor Retrieval

    Development
    Hostinger

    Highlights

    BrowserStack Accessibility Testing Made Simple

    December 30, 2024

    In today’s online world, web accessibility and the accessibility health of your websites matter a lot. It’s no longer just an option. Websites and apps need to work for everyone, regardless of their disabilities. However, finding and fixing accessibility issues can be tough. This is where BrowserStack comes in. We have been long using BrowserStack
    The post BrowserStack Accessibility Testing Made Simple appeared first on Codoid.

    PGPTool – encrypt and decrypt files

    February 4, 2025

    Ex-Air Guardsman Sentenced to 15 Years for Leaking Top-Secret U.S. Military Intel on Social Media

    November 15, 2024

    Building a Cyber-Resilient Organization: Strategies and Best Practices

    June 9, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.