Benchmarking AI-assisted developers (and their tools) for superior AI governance

A quick browse of LinkedIn, DevTok, and X would lead you to believe that almost every developer has jumped on board the vibe coding hype train with full gusto. And while it’s not that far-fetched, with 84% of developers confirming they are currently using (or planning to use) AI coding tools in their daily workflows, a full surrender to vibe coding autonomous agents is still unusual. Stack Overflow’s 2025 AI Survey revealed that most respondents (72%) are not (yet) vibe coding. Still, adoption is trending upwards, and AI is currently generating 41% of all code, for better or worse.

Tools like Cursor and Windsurf represent the latest generation of AI coding assistants, each with a powerful autonomous mode that can make decisions independently based on preset parameters. The speed and productivity gains are undeniable, but a worrying trend is emerging: many of these tools are being deployed in enterprise environments, and these teams are not equipped to address the inherent security issues associated with their use. Human governance is paramount, and too few security leaders are making an effort to modernize their security programs to adequately shield themselves from the risk of AI-generated code.

If the tech stack lacks tools that oversee not only developer security proficiency, but also the trustworthiness of approved AI coding companions each developer uses, then it is likely that efforts to uplift the overall security program and the developers working within it will be short of the appropriate data insights to effect change.

AI and human governance should be a priority

The drawing card of agentic models is their ability to work autonomously and make decisions independently, and these being embedded into enterprise environments at scale without appropriate human governance is inevitably going to introduce security issues that are not particularly visible or easy to stop.

Long-standing security problems like sensitive data exposure and insufficient logging and monitoring remain, and emerging threats like memory poisoning and tool poisoning are not issues to take lightly. CISOs must take steps to reduce developer risk, and provide continuous learning and skills verification within their security programs in order to safely implement the help of agentic AI agents.

Powerful benchmarking lights your developer’s path

It’s very difficult to make impactful, positive improvements to a security program based solely on anecdotal accounts, limited feedback, and other data points that are more subjective in nature. These types of data, while helpful in correcting more glaring faults (such as a particular tool continuously failing or personnel time being wasted on a low-value and frustrating task), will do little to uplift the program to a new level. Sadly, the “people” part of an enterprise security (or, indeed, Secure by Design) initiative is notoriously tricky to measure, and too often neglected as a piece of the puzzle that must be a priority to solve.

This is where governance tools that deliver data points on individual developer security proficiency – categorized by language, framework and even industry – can be the difference between executing yet another flat training and observability exercise, as opposed to proper developer risk management, where the tools are working to collect the insights needed to plug knowledge gaps, filter security-proficient devs to the most sensitive projects, and importantly, monitor and approve the tools they use in their day, such as AI coding companions.

Assessment of agentic AI coding tools and LLMs

Three years on, we can confidently conclude that not all AI coding tools are created equal. More studies are emerging that assist in differentiating the strengths and weaknesses of each model, for a variety of applications. Sonar’s recent study on the coding personalities of each model was quite eye-opening, revealing the different traits of models like Claude Sonnet 4, OpenCoder-8B, Llama 3.2 90B, GPT-4o, and Claude Sonnet 3.7, with insight into how their individual approaches to coding affect code quality and, subsequently, associated security risk. Semgrep’s deep dive into the capabilities of AI coding agents for detecting vulnerabilities also yielded mixed results, with findings that generally demonstrated that a security-focused prompt can already identify real vulnerabilities in real applications. However, depending on the vulnerability class, a high volume of false positives created noisy, less valuable results.

Our own unique benchmarking data supports much of Semgrep’s findings. We were able to show that the best LLMs perform comparably with proficient people at a range of limited secure coding tasks. However, there is a significant drop in consistency among LLMs across different stages of tasks, languages, and vulnerability categories. Generally, top developers with security proficiency outperform all LLMs, while average developers do not.

With studies like this in mind, we must not lose sight of what we as an industry are allowing into our codebases: AI coding agents have increasing autonomy, oversight and general use, and they must be treated like any other human with their hands on the tools. This, in effect, requires careful management in terms of assessing their security proficiency, access level, commits and mistakes with the same fervor as the human operating them, with no exceptions. How trustworthy is the output of the tool, and how security proficient is its operator?

If security leaders cannot answer these questions and plan accordingly, the attack surface will continue to grow by the day. If you don’t know where the code is coming from, make sure it’s not going in any repository, with no exceptions.

The post Benchmarking AI-assisted developers (and their tools) for superior AI governance appeared first on SD Times.

Source: Read MoreÂ

CodeSOD: Identify a Nap

Ambient Animations In Web Design: Principles And Implementation (Part 1)