AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents

Recent advancements in large language models (LLMs) have enabled the development of AI-based coding agents that can generate, modify, and understand software code. However, the evaluation of these systems remains limited, often constrained to synthetic or narrowly scoped benchmarks, primarily in Python. These benchmarks seldom reflect the structural and semantic diversity of real-world codebases, and as a result, many agents overfit to benchmark-specific patterns rather than demonstrating robust, transferable capabilities.

AWS Introduces SWE-PolyBench: A More Comprehensive Evaluation Framework

To address these challenges, AWS AI Labs has introduced SWE-PolyBench, a multilingual, repository-level benchmark designed for execution-based evaluation of AI coding agents. The benchmark spans 21 GitHub repositories across four widely-used programming languages—Java, JavaScript, TypeScript, and Python—comprising 2,110 tasks that include bug fixes, feature implementations, and code refactorings.

Unlike prior benchmarks, SWE-PolyBench incorporates real pull requests (PRs) that close actual issues and include associated test cases, allowing for verifiable evaluation. A smaller, stratified subset—SWE-PolyBench500—has also been released to support quicker experimentation while preserving task and language diversity.

Technical Structure and Evaluation Metrics

SWE-PolyBench adopts an execution-based evaluation pipeline. Each task includes a repository snapshot and a problem statement derived from a GitHub issue. The system applies the associated ground truth patch in a containerized test environment configured for the respective language ecosystem (e.g., Maven for Java, npm for JS/TS, etc.). The benchmark then measures outcomes using two types of unit tests: fail-to-pass (F2P) and pass-to-pass (P2P).

To provide a more granular assessment of coding agents, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics. These include both file-level and node-level retrieval scores, assessing the agent’s ability to locate and modify relevant sections of the codebase. These metrics offer insights beyond binary pass/fail outcomes, especially for complex, multi-file modifications.

Empirical Evaluation and Observations

Three open-source coding agents—Aider, SWE-Agent, and Agentless—were adapted for SWE-PolyBench. All used Anthropic’s Claude 3.5 as the underlying model and were modified to handle the multilingual, repository-level requirements of the benchmark.

The evaluation revealed notable differences in performance across languages and task types. For instance, agents performed best on Python tasks (up to 24.1% pass rate) but struggled with TypeScript (as low as 4.7%). Java, despite its higher complexity in terms of average node changes, achieved higher success rates than TypeScript, suggesting that pretraining exposure and syntax familiarity play a critical role in model performance.

Performance also varied with task complexity. Tasks limited to single-function or single-class changes yielded higher success rates (up to 40%), while those requiring mixed or multi-file changes saw a significant drop. Interestingly, high retrieval precision and recall—particularly for file and CST node identification—did not always translate to higher pass rates, indicating that code localization is necessary but insufficient for problem resolution.

Conclusion: Toward Robust Evaluation of AI Coding Agents

SWE-PolyBench presents a robust and nuanced evaluation framework for coding agents, addressing key limitations in existing benchmarks. By supporting multiple programming languages, covering a wider range of task types, and incorporating syntax-aware metrics, it offers a more representative assessment of an agent’s real-world applicability.

The benchmark reveals that while AI agents exhibit promising capabilities, their performance remains inconsistent across languages and tasks. SWE-PolyBench provides a foundation for future research aimed at improving the generalizability, robustness, and reasoning capabilities of AI coding assistants.

Check out the AWS DevOps Blog, Hugging Face – SWE-PolyBench and GitHub – SWE-PolyBench. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents appeared first on MarkTechPost.

Source: Read MoreÂ

CVE-2025-53621 – DSpace XXE Injection Vulnerability

July 15, 2025

CVE ID : CVE-2025-53621

Published : July 15, 2025, 3:15 p.m. | 1 hour, 19 minutes ago

Description : DSpace open source software is a repository application which provides durable access to digital resources. Two related XML External Entity (XXE) injection possibilities impact all versions of DSpace prior to 7.6.4, 8.2, and 9.1. External entities are not disabled when parsing XML files during import of an archive (in Simple Archive Format), either from command-line (`./dspace import` command) or from the “Batch Import (Zip)” user interface feature. External entities are also not explicitly disabled when parsing XML responses from some upstream services (ArXiv, Crossref, OpenAIRE, Creative Commons) used in import from external sources via the user interface or REST API. An XXE injection in these files may result in a connection being made to an attacker’s site or a local path readable by the Tomcat user, with content potentially being injected into a metadata field. In the latter case, this may result in sensitive content disclosure, including retrieving arbitrary files or configurations from the server where DSpace is running. The Simple Archive Format (SAF) importer / Batch Import (Zip) is only usable by site administrators (from user interface / REST API) or system administrators (from command-line). Therefore, to exploit this vulnerability, the malicious payload would have to be provided by an attacker and trusted by an administrator, who would trigger the import. The fix is included in DSpace 7.6.4, 8.2, and 9.1. Please upgrade to one of these versions. For those who cannot upgrade immediately, it is possible to manually patch the DSpace backend. One may also apply some best practices, though the protection provided is not as complete as upgrading. Administrators must carefully inspect any SAF archives (they did not construct themselves) before importing. As necessary, affected external services can be disabled to mitigate the ability for payloads to be delivered via external service APIs.

Severity: 6.9 | MEDIUM

Visit the link for more details, such as CVSS details, affected products, timeline, and more…

The Ultimate Guide to Node.js Development Pricing for Enterprises

Stack Overflow: Developers’ trust in AI outputs is worsening year over year

Web Components: Working With Shadow DOM

Google’s new Opal tool allows users to create mini AI apps with no coding required

5 preinstalled apps you should delete from your Samsung phone immediately

Ubuntu Linux lagging? Try my 10 go-to tricks to speed it up

How I survived a week with this $130 smartwatch instead of my Garmin and Galaxy Ultra

YouTube is using AI to verify your age now – and if it’s wrong, that’s on you to fix

Time-Controlled Data Processing with Laravel LazyCollection Methods

Time-Controlled Data Processing with Laravel LazyCollection Methods

Create Apple Wallet Passes in Laravel

The Laravel Idea Plugin is Now FREE for PhpStorm Users

New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

Opera throws Microsoft to Brazil’s watchdogs for promoting Edge as your default browser — “Microsoft thwarts‬‭ browser‬‭ competition‬‭‬‭ at‬‭ every‬‭ turn”

Activision once again draws the ire of players for new Diablo Immortal marketing that appears to have been made with generative AI

AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents

AWS Introduces SWE-PolyBench: A More Comprehensive Evaluation Framework

Technical Structure and Evaluation Metrics

Empirical Evaluation and Observations

Conclusion: Toward Robust Evaluation of AI Coding Agents

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

Microsoft kicks off testing its new GIF maker — right inside Windows 11

CVE-2025-20154 – Cisco TWAMP Server Out-of-Bounds Array Access Denial of Service Vulnerability

AI-First Transformation Strategy Unfolds at the 3M Open

CVE-2025-45835 – Netis WF2880 Null Pointer Dereference Vulnerability

CVE-2025-53621 – DSpace XXE Injection Vulnerability

Custom Active Directory Client-Side Extensions Enable Stealthy Corporate Backdoors

Microsoft Edge just got a big performance boost, but can it be the only app I use on Windows 11?

OpenAI’s Deep Research has more fact-finding stamina than you, but it’s still wrong half the time

AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents

AWS Introduces SWE-PolyBench: A More Comprehensive Evaluation Framework

Technical Structure and Evaluation Metrics

Empirical Evaluation and Observations

Conclusion: Toward Robust Evaluation of AI Coding Agents

Related Posts