Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Gemini 2.5 Pro and Flash are generally available and Gemini 2.5 Flash-Lite preview is announced

      June 19, 2025

      CSS Cascade Layers Vs. BEM Vs. Utility Classes: Specificity Control

      June 19, 2025

      IBM launches new integration to help unify AI security and governance

      June 18, 2025

      Meet Accessible UX Research, A Brand-New Smashing Book

      June 18, 2025

      How to free up your Mac’s storage space – 3 easy ways

      June 19, 2025

      I finally found a mini PC with a striking design (and the power to back it up)

      June 19, 2025

      The best password generators of 2025: Expert tested

      June 19, 2025

      Facebook’s new passkey support could soon let you ditch your password forever

      June 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      eslint-plugin-mutate

      June 19, 2025
      Recent

      eslint-plugin-mutate

      June 19, 2025

      Event-Driven Microservice Backend For a Modern E-commerce Platform.

      June 19, 2025

      Search Params Are State – How TanStack Router Solves It

      June 19, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You Can Now Auto-Generate Google Forms Using Gemini Using Prompts or Files – Here’s How

      June 19, 2025
      Recent

      You Can Now Auto-Generate Google Forms Using Gemini Using Prompts or Files – Here’s How

      June 19, 2025

      Google Helps Devs Build Safe Android Apps with THIS Play Policy – Find Out More Here

      June 19, 2025

      Microsoft Edge for Business Now Lets Admins Push Encrypted Passwords to Users Securely

      June 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases

    UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases

    June 20, 2025

    Cybersecurity has become a significant area of interest in artificial intelligence, driven by the increasing reliance on large software systems and the expanding capabilities of AI tools. As threats evolve in complexity, ensuring the security of software systems has become more than just a matter of conventional protections; it now intersects with automated reasoning, vulnerability detection, and code-level comprehension. Modern cybersecurity requires tools and methods that can simulate real-world scenarios, identify hidden flaws, and validate system integrity across diverse software infrastructures. Within this environment, researchers have been developing benchmarks and methods to systematically evaluate AI agents’ ability to understand, detect, and even exploit vulnerabilities, drawing parallels with human security researchers. However, bridging the gap between AI reasoning and real-world cybersecurity complexities remains a key challenge.

    Problem with Existing Benchmarks

    One pressing issue is the lack of effective ways to evaluate whether AI systems are truly capable of understanding and handling security tasks under realistic conditions. Simplified benchmark tasks often dominate current testing methods, which rarely mirror the messy and layered reality of large-scale software repositories. These environments involve intricate input conditions, deep code paths, and subtle vulnerabilities that demand more than surface-level inspection. Without robust evaluation methods, it’s difficult to determine whether AI agents can be trusted to perform tasks like vulnerability detection or exploit development. More importantly, current benchmarks don’t reflect the scale and nuance of vulnerabilities found in actively maintained, widely used software systems, leaving a critical evaluation gap.

    Limitations of Current Tools

    Several benchmarks have been used to evaluate cybersecurity capabilities, including Cybench and the NYU CTF Bench. These focus on capture-the-flag-style tasks that offer limited complexity, typically involving small codebases and constrained test environments. Some benchmarks attempt to engage real-world vulnerabilities, but they often do so at a limited scale. Furthermore, many of the tools rely on either synthetic test cases or narrowly scoped challenge problems, which fail to represent the diversity of software inputs, execution paths, and bug types found in actual systems. Even specialized agents created for security analysis have been tested on benchmarks with only tens or a few hundred tasks, far short of the complexity of real-world threat landscapes.

    Introducing CyberGym

    Researchers introduced CyberGym, a large-scale and comprehensive benchmarking tool specifically designed to evaluate AI agents in real-world cybersecurity contexts. Developed at the University of California, Berkeley, CyberGym includes 1,507 distinct benchmark tasks sourced from actual vulnerabilities found and patched across 188 major open-source software projects. These vulnerabilities were originally identified by OSS-Fuzz, a continuous fuzzing campaign maintained by Google. To ensure realism, each benchmark instance includes the full pre-patch codebase, an executable, and a textual description of the vulnerability. Agents must generate a proof-of-concept test that reproduces the vulnerability in the unpatched version, and CyberGym evaluates success based on whether the vulnerability is triggered in the pre-patch version and absent in the post-patch one. This benchmark uniquely emphasizes the generation of Proof of Concepts (PoCs), a task that requires agents to traverse complex code paths and synthesize inputs to meet specific security conditions. CyberGym is modular and containerized, enabling easy expansion and reproducibility.

    CyberGym Evaluation Levels

    The evaluation pipeline in CyberGym is built around four levels of difficulty, each increasing the amount of input information provided. At level 0, the agent is given only the codebase with no hint of the vulnerability. Level 1 adds a natural language description. Level 2 introduces a ground-truth proof of concept (PoC) and crash stack trace, while Level 3 includes the patch itself and the post-patch codebase. Each level presents a new layer of reasoning and complexity. For instance, in level 1, agents must infer the vulnerability’s location and context purely from its textual description and codebase. To ensure benchmark quality, CyberGym applies filters such as checking the informativeness of patch commit messages, validating proof-of-concept (PoC) reproducibility, and removing redundancy by comparing stack traces. The final dataset comprises codebases with a median of 1,117 files and 387,491 lines of code, ranging up to over 40,000 files and 7 million lines of code. The patch sizes also vary, modifying a median of 1 file and seven lines, but sometimes spanning 40 files and over 3,000 lines. The vulnerabilities target various crash types, with 30.4% related to heap-buffer-overflow READ and 19.0% due to uninitialized value use.

    Experimental Results

    When tested against this benchmark, existing agents showed limited success. Among four agent frameworks, OpenHands, Codex, ENiGMA, and Cybench, the top performer was OpenHands combined with Claude-3.7-Sonnet, which reproduced only 11.9% of target vulnerabilities. This performance dropped significantly when dealing with longer PoC inputs, as success rates were highest for PoCs under 10 bytes (43.5%) and fell below 8% for lengths over 100 bytes. Open-source models, such as DeepSeek-V3, lagged, with only a 3.6% success rate. Even specialized models fine-tuned for code reasoning, like SWE-Gym-32B and R2E-Gym-32B, failed to generalize, scoring under 2%. Surprisingly, richer input information at higher difficulty levels increased performance: level 3 saw 17.1% success, while level 0 achieved only 3.5%. Analysis also revealed that most successful PoC reproductions occurred between 20 and 40 execution steps, with many runs exceeding 90 steps and ultimately failing. Despite these challenges, agents discovered 15 previously unknown zero-day vulnerabilities and two disclosed but unpatched ones across real-world projects, demonstrating their latent capacity for novel discovery.

    Key Takeaways

    • Benchmark Volume and Realism: CyberGym contains 1,507 tasks derived from real, patched vulnerabilities across 188 software projects, making it the largest and most realistic benchmark of its kind.
    • Agent Limitations: Even the best-performing agent-model combination reproduced only 11.9% of vulnerabilities, with many combinations scoring under 5%.
    • Difficulty Scaling: Providing additional inputs, such as stack traces or patches, significantly improved performance, with level 3 tasks yielding a 17.1% success rate.
    • Length Sensitivity: Agents struggled with tasks involving long PoCs. PoCs exceeding 100 bytes, which made up 65.7% of the dataset, had the lowest success rates.
    • Discovery Potential: 15 new zero-day vulnerabilities were discovered by agent-generated PoCs, validating their potential use in real-world security analysis.
    • Model Behavior: Most successful exploits were generated early in the task execution, with diminishing returns after 80 steps.
    • Tool Interactions: Agents performed better when allowed to interact with tools (e.g., using ‘awk’, ‘grep’, or installing ‘xxd’) and adapt PoCs based on runtime feedback.

    Conclusion

    In conclusion, this study highlights a critical problem: evaluating AI in cybersecurity is not only challenging but essential for understanding its limitations and capabilities. CyberGym stands out by offering a large-scale, real-world framework for doing so. The researchers addressed the issue with a practical and detailed benchmark that forces agents to reason deeply across entire codebases, generate valid exploits, and adapt through iteration. The results make it clear that while current agents show promise, especially in discovering new bugs, there is still a long road ahead to enable AI to contribute to cybersecurity at scale reliably.


    Check out the Paper, GitHub Page, Leaderboard. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post UC Berkeley Introduces CyberGym: A Real-World Cybersecurity Evaluation Framework to Evaluate AI Agents on Large-Scale Vulnerabilities Across Massive Codebases appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuild an Intelligent Multi-Tool AI Agent Interface Using Streamlit for Seamless Real-Time Interaction
    Next Article This AI Paper from Google Introduces a Causal Framework to Interpret Subgroup Fairness in Machine Learning Evaluations More Reliably

    Related Posts

    Machine Learning

    From Backend Automation to Frontend Collaboration: What’s New in AG-UI Latest Update for AI Agent-User Interaction

    June 20, 2025
    Machine Learning

    This AI Paper from Google Introduces a Causal Framework to Interpret Subgroup Fairness in Machine Learning Evaluations More Reliably

    June 20, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5565 – WordPress Hide It Stored Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Asus waarschuwt voor kritieke AiCloud-kwetsbaarheid in wifi-routers

    Security

    CVE-2025-5110 – FreeFloat FTP Server Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-26783 – Samsung Mobile Processor, Wearable Processor, and Modem Exynos RRC Denial of Service Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    News & Updates

    Windows 11’s Start menu gets a bigger, new look: Get it today with these simple steps

    June 11, 2025

    Microsoft is now allowing Windows 11 users to try a redesigned Start menu. Here are…

    Volet is a Customer Feedback Widget for Laravel

    May 26, 2025

    CVE-2025-5700 – WordPress Simple Logo Carousel Stored Cross-Site Scripting Vulnerability

    June 17, 2025

    Nvidia dominates in gen AI benchmarks, clobbering 2 rival AI chips

    April 2, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.