Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Ultimate Guide to Node.js Development Pricing for Enterprises

      July 29, 2025

      Stack Overflow: Developers’ trust in AI outputs is worsening year over year

      July 29, 2025

      Web Components: Working With Shadow DOM

      July 28, 2025

      Google’s new Opal tool allows users to create mini AI apps with no coding required

      July 28, 2025

      5 preinstalled apps you should delete from your Samsung phone immediately

      July 30, 2025

      Ubuntu Linux lagging? Try my 10 go-to tricks to speed it up

      July 30, 2025

      How I survived a week with this $130 smartwatch instead of my Garmin and Galaxy Ultra

      July 30, 2025

      YouTube is using AI to verify your age now – and if it’s wrong, that’s on you to fix

      July 30, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025
      Recent

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025

      Create Apple Wallet Passes in Laravel

      July 30, 2025

      The Laravel Idea Plugin is Now FREE for PhpStorm Users

      July 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

      July 30, 2025
      Recent

      New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

      July 30, 2025

      Opera throws Microsoft to Brazil’s watchdogs for promoting Edge as your default browser — “Microsoft thwarts‬‭ browser‬‭ competition‬‭‬‭ at‬‭ every‬‭ turn”

      July 30, 2025

      Activision once again draws the ire of players for new Diablo Immortal marketing that appears to have been made with generative AI

      July 30, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Tech & Work»Closing the loop on agents with test-driven development

    Closing the loop on agents with test-driven development

    April 29, 2025

    Traditionally, developers have used test-driven development (TDD) to validate applications before implementing the actual functionality. In this approach, developers follow a cycle where they write a test designed to fail, then execute the minimum code necessary to make the test pass, refactor the code to improve quality, and repeat the process by adding more tests and continuing these steps iteratively.

    As AI agents have entered the conversation, the way developers use TDD has changed. Rather than evaluating for exact answers, they are evaluating behaviors, reasoning, and decision-making. To take it even further, they must continuously adjust based on real-world feedback. This development process is also extremely helpful to help mitigate and avoid unforeseen hallucinations as we begin to give more control to AI.

    The ideal AI product development process follows the experimentation, evaluation, deployment, and monitoring format. Developers who follow this structured approach can better build reliable agentic workflows. 

    Stage 1: Experimentation: In this first phase of test-driven developers, developers test whether the models can solve for an intended use case. Best practices include experimenting with prompting techniques and testing on various architectures. Additionally, utilizing subject matter experts to experiment in this phase will help save engineering time. Other best practices include staying model and inference provider agnostic and experimenting with different modalities. 

    Stage 2: Evaluation: The next phase is evaluation, where developers create a data set of hundreds of examples to test their models and workflows against. At this stage, developers must balance quality, cost, latency, and privacy. Since no AI system will perfectly meet all these requirements, developers make some trade-offs. At this stage, developers should also define their priorities. 

    If ground truth data is available, this can be used to evaluate and test your workflows. Ground truths are often seen as the backbone of  AI model validation as it is high-quality examples demonstrating ideal outputs. If you do not have ground truth data, developers can alternatively use another LLM to consider another model’s response. At this stage, developers should also use a flexible framework with various metrics and a large test case bank.

    Developers should run evaluations at every stage and have guardrails to check internal nodes. This will ensure that your models produce accurate responses at every step in your workflow. Once there is real data, developers can also return to this stage.

    Stage 3: Deployment: Once the model is deployed, developers must monitor more things than deterministic outputs. This includes logging all LLM calls and tracking inputs, output latency, and the exact steps the AI system took. In doing so, developers can see and understand how the AI operates at every step. This process is becoming even more critical with the introduction of agentic workflows, as this technology is even more complex, can take different workflow paths and make decisions independently.

    In this stage, developers should maintain stateful API calls, retry, and fallback logic to handle outages and rate limits. Lastly, developers in this stage should ensure reasonable version control by using standing environments and performing regression testing to maintain stability during updates. 

    Stage 4: Monitoring: After the model is deployed, developers can collect user responses and create a feedback loop. This enables developers to identify edge cases captured in production, continuously improve, and make the workflow more efficient.

    The Role of TDD in Creating Resilient Agentic AI Applications

    A recent Gartner survey revealed that by 2028, 33% of enterprise software applications will include agentic AI. These massive investments must be resilient to achieve the ROI teams are expecting.

    Since agentic workflows use many tools, they have multi-agent structures that execute tasks in parallel. When evaluating agentic workflows using the test-driven approach, it is no longer critical to just measure performance at every level; now, developers must assess the agents’ behavior to ensure that they are making accurate decisions and following the intended logic. 

    Redfin recently announced Ask Redfin, an AI-powered chatbot that powers daily conversations for thousands of users. Using Vellum’s developer sandbox, the Redfin team collaborated on prompts to pick the right prompt/model combination, built complex AI virtual assistant logic by connecting prompts, classifiers, APIs, and data manipulation steps, and systematically evaluated prompt pre-production using hundreds of test cases.

    Following a test-driven development approach, their team could simulate various user interactions, test different prompts across numerous scenarios, and build confidence in their assistant’s performance before shipping to production. 

    Reality Check on Agentic Technologies

    Every AI workflow has some level of agentic behaviors. At Vellum, we believe in  a six-level framework that breaks down the different levels of autonomy, control, and decision-making for AI systems: from L0: Rule-Based Workflows, where there’s no intelligence, to L4: Fully Creative, where the AI is creating its own logic.

    Today, more AI applications are sitting at L1. The focus is on orchestration—optimizing how models interact with the rest of the system, tweaking prompts, optimizing retrieval and evals, and experimenting with different modalities. These are also easier to manage and control in production—debugging is somewhat easier these days, and failure modes are kind of predictable.  

    Test-driven development truly makes its case here, as developers need to continuously improve the models to create a more efficient system. This year, we are likely to see the most innovation in L2, with AI agents being used to plan and reason. 

    As AI agents move up the stack, test-driven development presents an opportunity for developers to better test, evaluate, and refine their workflows. Third-party developer platforms offer enterprises and development teams a platform to easily define and evaluate agentic behaviors and continuously improve workflows in one place.

    The post Closing the loop on agents with test-driven development appeared first on SD Times.

    Source: Read More 

    news
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleSentinelOne Uncovers Chinese Espionage Campaign Targeting Its Infrastructure and Clients
    Next Article You can now access Android, iPhone from Windows 11 Start menu and transfer files

    Related Posts

    Tech & Work

    The Ultimate Guide to Node.js Development Pricing for Enterprises

    July 29, 2025
    Tech & Work

    Stack Overflow: Developers’ trust in AI outputs is worsening year over year

    July 29, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Rethinking Toxic Data in LLM Pretraining: A Co-Design Approach for Improved Steerability and Detoxification

    Machine Learning

    Ubisoft blames gamers, Steam, trends, and everyone but itself for poor game sales

    News & Updates

    CVE-2025-6321 – PHPGurukul Pre-School Enrollment System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-7067 – HDF5 Heap-Based Buffer Overflow

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-50121 – Apache HTTP Server OS Command Injection

    July 11, 2025

    CVE ID : CVE-2025-50121

    Published : July 11, 2025, 10:15 a.m. | 9 hours, 59 minutes ago

    Description : CWE-78: Improper Neutralization of Special Elements used in an OS Command (‘OS Command Injection’)
    vulnerability exists that could cause unauthenticated remote code execution when a malicious folder is created
    over the web interface HTTP when enabled. HTTP is disabled by default.

    Severity: 10.0 | CRITICAL

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    “We’re witnessing an urgent and active threat” — Microsoft SharePoint “ToolShell” vulnerability is being attacked globally

    July 22, 2025

    CVE-2025-4791 – FreeFloat FTP Server Buffer Overflow Vulnerability

    May 16, 2025

    Shine a spotlight on your open source project

    May 23, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.