Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      15 Essential Skills to Look for When Hiring Node.js Developers for Enterprise Projects (2025-2026)

      August 4, 2025

      African training program creates developers with cloud-native skills

      August 4, 2025

      React.js for SaaS Platforms: How Top Development Teams Help Startups Launch Faster

      August 3, 2025

      Upwork Freelancers vs Dedicated React.js Teams: What’s Better for Your Project in 2025?

      August 1, 2025

      LastPass can now warn or block logins to shadow SaaS apps – here’s how

      August 4, 2025

      Get up to a year of Adobe Creative Cloud access for 40% off

      August 4, 2025

      Got 6 hours? This free AI training from Google and Goodwill can boost your resume today

      August 4, 2025

      Why I recommend this budget phone with a paper-like screen over ‘minimalist’ devices

      August 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laravel Boost, your AI coding starter kit

      August 4, 2025
      Recent

      Laravel Boost, your AI coding starter kit

      August 4, 2025

      Using GitHub Copilot in VS Code

      August 4, 2025

      Optimizely Mission Control – Part I

      August 4, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Top 20 kubectl Commands Every Kubernetes Beginner Must Know

      August 4, 2025
      Recent

      Top 20 kubectl Commands Every Kubernetes Beginner Must Know

      August 4, 2025

      Microsoft’s record stock run collides with Nadella’s admission that 15,000 layoffs still ‘hurt’

      August 4, 2025

      Microsoft and Adobe Power Up Fantasy Premier League Fans with AI – Here’s How

      August 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases

    UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases

    June 20, 2025

    Cybersecurity has become a significant area of interest in artificial intelligence, driven by the increasing reliance on large software systems and the expanding capabilities of AI tools. As threats evolve in complexity, ensuring the security of software systems has become more than just a matter of conventional protections; it now intersects with automated reasoning, vulnerability detection, and code-level comprehension. Modern cybersecurity requires tools and methods that can simulate real-world scenarios, identify hidden flaws, and validate system integrity across diverse software infrastructures. Within this environment, researchers have been developing benchmarks and methods to systematically evaluate AI agents’ ability to understand, detect, and even exploit vulnerabilities, drawing parallels with human security researchers. However, bridging the gap between AI reasoning and real-world cybersecurity complexities remains a key challenge.

    Problem with Existing Benchmarks

    One pressing issue is the lack of effective ways to evaluate whether AI systems are truly capable of understanding and handling security tasks under realistic conditions. Simplified benchmark tasks often dominate current testing methods, which rarely mirror the messy and layered reality of large-scale software repositories. These environments involve intricate input conditions, deep code paths, and subtle vulnerabilities that demand more than surface-level inspection. Without robust evaluation methods, it’s difficult to determine whether AI agents can be trusted to perform tasks like vulnerability detection or exploit development. More importantly, current benchmarks don’t reflect the scale and nuance of vulnerabilities found in actively maintained, widely used software systems, leaving a critical evaluation gap.

    Limitations of Current Tools

    Several benchmarks have been used to evaluate cybersecurity capabilities, including Cybench and the NYU CTF Bench. These focus on capture-the-flag-style tasks that offer limited complexity, typically involving small codebases and constrained test environments. Some benchmarks attempt to engage real-world vulnerabilities, but they often do so at a limited scale. Furthermore, many of the tools rely on either synthetic test cases or narrowly scoped challenge problems, which fail to represent the diversity of software inputs, execution paths, and bug types found in actual systems. Even specialized agents created for security analysis have been tested on benchmarks with only tens or a few hundred tasks, far short of the complexity of real-world threat landscapes.

    Introducing CyberGym

    Researchers introduced CyberGym, a large-scale and comprehensive benchmarking tool specifically designed to evaluate AI agents in real-world cybersecurity contexts. Developed at the University of California, Berkeley, CyberGym includes 1,507 distinct benchmark tasks sourced from actual vulnerabilities found and patched across 188 major open-source software projects. These vulnerabilities were originally identified by OSS-Fuzz, a continuous fuzzing campaign maintained by Google. To ensure realism, each benchmark instance includes the full pre-patch codebase, an executable, and a textual description of the vulnerability. Agents must generate a proof-of-concept test that reproduces the vulnerability in the unpatched version, and CyberGym evaluates success based on whether the vulnerability is triggered in the pre-patch version and absent in the post-patch one. This benchmark uniquely emphasizes the generation of Proof of Concepts (PoCs), a task that requires agents to traverse complex code paths and synthesize inputs to meet specific security conditions. CyberGym is modular and containerized, enabling easy expansion and reproducibility.

    CyberGym Evaluation Levels

    The evaluation pipeline in CyberGym is built around four levels of difficulty, each increasing the amount of input information provided. At level 0, the agent is given only the codebase with no hint of the vulnerability. Level 1 adds a natural language description. Level 2 introduces a ground-truth proof of concept (PoC) and crash stack trace, while Level 3 includes the patch itself and the post-patch codebase. Each level presents a new layer of reasoning and complexity. For instance, in level 1, agents must infer the vulnerability’s location and context purely from its textual description and codebase. To ensure benchmark quality, CyberGym applies filters such as checking the informativeness of patch commit messages, validating proof-of-concept (PoC) reproducibility, and removing redundancy by comparing stack traces. The final dataset comprises codebases with a median of 1,117 files and 387,491 lines of code, ranging up to over 40,000 files and 7 million lines of code. The patch sizes also vary, modifying a median of 1 file and seven lines, but sometimes spanning 40 files and over 3,000 lines. The vulnerabilities target various crash types, with 30.4% related to heap-buffer-overflow READ and 19.0% due to uninitialized value use.

    Experimental Results

    When tested against this benchmark, existing agents showed limited success. Among four agent frameworks, OpenHands, Codex, ENiGMA, and Cybench, the top performer was OpenHands combined with Claude-3.7-Sonnet, which reproduced only 11.9% of target vulnerabilities. This performance dropped significantly when dealing with longer PoC inputs, as success rates were highest for PoCs under 10 bytes (43.5%) and fell below 8% for lengths over 100 bytes. Open-source models, such as DeepSeek-V3, lagged, with only a 3.6% success rate. Even specialized models fine-tuned for code reasoning, like SWE-Gym-32B and R2E-Gym-32B, failed to generalize, scoring under 2%. Surprisingly, richer input information at higher difficulty levels increased performance: level 3 saw 17.1% success, while level 0 achieved only 3.5%. Analysis also revealed that most successful PoC reproductions occurred between 20 and 40 execution steps, with many runs exceeding 90 steps and ultimately failing. Despite these challenges, agents discovered 15 previously unknown zero-day vulnerabilities and two disclosed but unpatched ones across real-world projects, demonstrating their latent capacity for novel discovery.

    Key Takeaways

    • Benchmark Volume and Realism: CyberGym contains 1,507 tasks derived from real, patched vulnerabilities across 188 software projects, making it the largest and most realistic benchmark of its kind.
    • Agent Limitations: Even the best-performing agent-model combination reproduced only 11.9% of vulnerabilities, with many combinations scoring under 5%.
    • Difficulty Scaling: Providing additional inputs, such as stack traces or patches, significantly improved performance, with level 3 tasks yielding a 17.1% success rate.
    • Length Sensitivity: Agents struggled with tasks involving long PoCs. PoCs exceeding 100 bytes, which made up 65.7% of the dataset, had the lowest success rates.
    • Discovery Potential: 15 new zero-day vulnerabilities were discovered by agent-generated PoCs, validating their potential use in real-world security analysis.
    • Model Behavior: Most successful exploits were generated early in the task execution, with diminishing returns after 80 steps.
    • Tool Interactions: Agents performed better when allowed to interact with tools (e.g., using ‘awk’, ‘grep’, or installing ‘xxd’) and adapt PoCs based on runtime feedback.

    Conclusion

    In conclusion, this study highlights a critical problem: evaluating AI in cybersecurity is not only challenging but essential for understanding its limitations and capabilities. CyberGym stands out by offering a large-scale, real-world framework for doing so. The researchers addressed the issue with a practical and detailed benchmark that forces agents to reason deeply across entire codebases, generate valid exploits, and adapt through iteration. The results make it clear that while current agents show promise, especially in discovering new bugs, there is still a long road ahead to enable AI to contribute to cybersecurity at scale reliably.


    Check out the Paper, GitHub Page, Leaderboard. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuild an Intelligent Multi-Tool AI Agent Interface Using Streamlit for Seamless Real-Time Interaction
    Next Article This AI Paper from Google Introduces a Causal Framework to Interpret Subgroup Fairness in Machine Learning Evaluations More Reliably

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 4, 2025
    Machine Learning

    Ambisonics Super-Resolution Using A Waveform-Domain Neural Network

    August 4, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Xbox CEO Phil Spencer reportedly couldn’t stop playing ZeniMax’s unannounced game — and now it’s canceled

    News & Updates

    CVE-2025-20278 – “Cisco Unified Communications Command Injection Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    Refactor More Than Code: Learn about Refactoring PHP Code and Other Tools and Methods to Improve Your Code

    Development

    CVE-2025-40665 – TCMAN GIM Blind SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    News & Updates

    I don’t play my handheld gaming PCs without this $10 accessory — Here’s why it’s a must-have for anyone who owns an ROG Ally, Steam Deck, or Legion Go

    July 17, 2025

    I always use this $10 accessory with my ROG Ally, Legion Go, Steam Deck, and…

    CVE-2025-5183 – Summer Pearl Group Vacation Rental Management Platform Header Handler Open Redirect Vulnerability

    May 26, 2025

    CVE-2025-3824 – SourceCodester Web-based Pharmacy Product Management System Cross-Site Scripting Vulnerability

    April 20, 2025

    CVE-2025-5202 – Open Asset Import Library Assimp Out-of-Bounds Read Vulnerability

    May 26, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.