Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Ultimate Guide to Node.js Development Pricing for Enterprises

      July 29, 2025

      Stack Overflow: Developers’ trust in AI outputs is worsening year over year

      July 29, 2025

      Web Components: Working With Shadow DOM

      July 28, 2025

      Google’s new Opal tool allows users to create mini AI apps with no coding required

      July 28, 2025

      I replaced my Samsung OLED TV with this Sony Mini LED model for a week – and didn’t regret it

      July 29, 2025

      I tested the most popular robot mower on the market – and it was a $5,000 crash out

      July 29, 2025

      5 gadgets and accessories that leveled up my gaming setup (including a surprise console)

      July 29, 2025

      Why I’m patiently waiting for the Samsung Z Fold 8 next year (even though the foldable is already great)

      July 29, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Performance Analysis with Laravel’s Measurement Tools

      July 29, 2025
      Recent

      Performance Analysis with Laravel’s Measurement Tools

      July 29, 2025

      Memoization and Function Caching with this PHP Package

      July 29, 2025

      Laracon US 2025 Livestream

      July 29, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft mysteriously offered a Windows 11 upgrade to this unsupported Windows 10 PC — despite it failing to meet the “non-negotiable” TPM 2.0 requirement

      July 29, 2025
      Recent

      Microsoft mysteriously offered a Windows 11 upgrade to this unsupported Windows 10 PC — despite it failing to meet the “non-negotiable” TPM 2.0 requirement

      July 29, 2025

      With Windows 10’s fast-approaching demise, this Linux migration tool could let you ditch Microsoft’s ecosystem with your data and apps intact — but it’s limited to one distro

      July 29, 2025

      Windows 10 is 10 years old today — let’s look back at 10 controversial and defining moments in its history

      July 29, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents

    AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents

    April 23, 2025

    Recent advancements in large language models (LLMs) have enabled the development of AI-based coding agents that can generate, modify, and understand software code. However, the evaluation of these systems remains limited, often constrained to synthetic or narrowly scoped benchmarks, primarily in Python. These benchmarks seldom reflect the structural and semantic diversity of real-world codebases, and as a result, many agents overfit to benchmark-specific patterns rather than demonstrating robust, transferable capabilities.

    AWS Introduces SWE-PolyBench: A More Comprehensive Evaluation Framework

    To address these challenges, AWS AI Labs has introduced SWE-PolyBench, a multilingual, repository-level benchmark designed for execution-based evaluation of AI coding agents. The benchmark spans 21 GitHub repositories across four widely-used programming languages—Java, JavaScript, TypeScript, and Python—comprising 2,110 tasks that include bug fixes, feature implementations, and code refactorings.

    Unlike prior benchmarks, SWE-PolyBench incorporates real pull requests (PRs) that close actual issues and include associated test cases, allowing for verifiable evaluation. A smaller, stratified subset—SWE-PolyBench500—has also been released to support quicker experimentation while preserving task and language diversity.

    Technical Structure and Evaluation Metrics

    SWE-PolyBench adopts an execution-based evaluation pipeline. Each task includes a repository snapshot and a problem statement derived from a GitHub issue. The system applies the associated ground truth patch in a containerized test environment configured for the respective language ecosystem (e.g., Maven for Java, npm for JS/TS, etc.). The benchmark then measures outcomes using two types of unit tests: fail-to-pass (F2P) and pass-to-pass (P2P).

    To provide a more granular assessment of coding agents, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics. These include both file-level and node-level retrieval scores, assessing the agent’s ability to locate and modify relevant sections of the codebase. These metrics offer insights beyond binary pass/fail outcomes, especially for complex, multi-file modifications.

    Empirical Evaluation and Observations

    Three open-source coding agents—Aider, SWE-Agent, and Agentless—were adapted for SWE-PolyBench. All used Anthropic’s Claude 3.5 as the underlying model and were modified to handle the multilingual, repository-level requirements of the benchmark.

    The evaluation revealed notable differences in performance across languages and task types. For instance, agents performed best on Python tasks (up to 24.1% pass rate) but struggled with TypeScript (as low as 4.7%). Java, despite its higher complexity in terms of average node changes, achieved higher success rates than TypeScript, suggesting that pretraining exposure and syntax familiarity play a critical role in model performance.

    Performance also varied with task complexity. Tasks limited to single-function or single-class changes yielded higher success rates (up to 40%), while those requiring mixed or multi-file changes saw a significant drop. Interestingly, high retrieval precision and recall—particularly for file and CST node identification—did not always translate to higher pass rates, indicating that code localization is necessary but insufficient for problem resolution.

    Conclusion: Toward Robust Evaluation of AI Coding Agents

    SWE-PolyBench presents a robust and nuanced evaluation framework for coding agents, addressing key limitations in existing benchmarks. By supporting multiple programming languages, covering a wider range of task types, and incorporating syntax-aware metrics, it offers a more representative assessment of an agent’s real-world applicability.

    The benchmark reveals that while AI agents exhibit promising capabilities, their performance remains inconsistent across languages and tasks. SWE-PolyBench provides a foundation for future research aimed at improving the generalizability, robustness, and reasoning capabilities of AI coding assistants.


    Check out the AWS DevOps Blog, Hugging Face – SWE-PolyBench and GitHub – SWE-PolyBench. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleRipple NPM supply chain attack hunts for private keys
    Next Article Meet Xata Agent: An Open Source Agent for Proactive PostgreSQL Monitoring, Automated Troubleshooting, and Seamless DevOps Integration

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 29, 2025
    Machine Learning

    Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

    July 29, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long Texts

    Machine Learning

    Firefox Lets Users Remove On-Device AI Models for Smart Tab Grouping, Link Previews & More

    Operating Systems

    I’ve Seen Things

    Operating Systems

    CVE-2025-20260 – ClamAV PDF Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Implement human-in-the-loop confirmation with Amazon Bedrock Agents Machine Learning

    Implement human-in-the-loop confirmation with Amazon Bedrock Agents

    April 9, 2025

    Agents are revolutionizing how businesses automate complex workflows and decision-making processes. Amazon Bedrock Agents helps…

    AI stirs up the recipe for concrete in MIT study

    June 2, 2025

    CVE-2025-20960 – CocktailBarService Privilege Escalation Vulnerability

    May 7, 2025

    Orange Pi RV2 Single Board Computer Running Linux: Installing to the eMMC

    July 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.