Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Ultimate Guide to Node.js Development Pricing for Enterprises

      July 29, 2025

      Stack Overflow: Developers’ trust in AI outputs is worsening year over year

      July 29, 2025

      Web Components: Working With Shadow DOM

      July 28, 2025

      Google’s new Opal tool allows users to create mini AI apps with no coding required

      July 28, 2025

      5 preinstalled apps you should delete from your Samsung phone immediately

      July 30, 2025

      Ubuntu Linux lagging? Try my 10 go-to tricks to speed it up

      July 30, 2025

      How I survived a week with this $130 smartwatch instead of my Garmin and Galaxy Ultra

      July 30, 2025

      YouTube is using AI to verify your age now – and if it’s wrong, that’s on you to fix

      July 30, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025
      Recent

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025

      Create Apple Wallet Passes in Laravel

      July 30, 2025

      The Laravel Idea Plugin is Now FREE for PhpStorm Users

      July 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

      July 30, 2025
      Recent

      New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

      July 30, 2025

      Opera throws Microsoft to Brazil’s watchdogs for promoting Edge as your default browser — “Microsoft thwarts‬‭ browser‬‭ competition‬‭‬‭ at‬‭ every‬‭ turn”

      July 30, 2025

      Activision once again draws the ire of players for new Diablo Immortal marketing that appears to have been made with generative AI

      July 30, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Safeguarding Agentic AI Systems: NVIDIA’s Open-Source Safety Recipe

    Safeguarding Agentic AI Systems: NVIDIA’s Open-Source Safety Recipe

    July 29, 2025

    As large language models (LLMs) evolve from simple text generators to agentic systems —able to plan, reason, and autonomously act—there is a significant increase in both their capabilities and associated risks. Enterprises are rapidly adopting agentic AI for automation, but this trend exposes organizations to new challenges: goal misalignment, prompt injection, unintended behaviors, data leakage, and reduced human oversight. Addressing these concerns, NVIDIA has released an open-source software suite and a post-training safety recipe designed to safeguard agentic AI systems throughout their lifecycle.

    The Need for Safety in Agentic AI

    Agentic LLMs leverage advanced reasoning and tool use, enabling them to operate with a high degree of autonomy. However, this autonomy can result in:

    • Content moderation failures (e.g., generation of harmful, toxic, or biased outputs)
    • Security vulnerabilities (prompt injection, jailbreak attempts)
    • Compliance and trust risks (failure to align with enterprise policies or regulatory standards)

    Traditional guardrails and content filters often fall short as models and attacker techniques rapidly evolve. Enterprises require systematic, lifecycle-wide strategies for aligning open models with internal policies and external regulations.

    NVIDIA’s Safety Recipe: Overview and Architecture

    NVIDIA’s agentic AI safety recipe provides a comprehensive end-to-end framework to evaluate, align, and safeguard LLMs before, during, and after deployment:

    • Evaluation: Before deployment, the recipe enables testing against enterprise policies, security requirements, and trust thresholds using open datasets and benchmarks.
    • Post-Training Alignment: Using Reinforcement Learning (RL), Supervised Fine-Tuning (SFT), and on-policy dataset blends, models are further aligned with safety standards.
    • Continuous Protection: After deployment, NVIDIA NeMo Guardrails and real-time monitoring microservices provide ongoing, programmable guardrails, actively blocking unsafe outputs and defending against prompt injections and jailbreak attempts.

    Core Components

    StageTechnology/ToolsPurpose
    Pre-Deployment EvaluationNemotron Content Safety Dataset, WildGuardMix, garak scannerTest safety/security
    Post-Training AlignmentRL, SFT, open-licensed dataFine-tune safety/alignment
    Deployment & InferenceNeMo Guardrails, NIM microservices (content safety, topic control, jailbreak detect)Block unsafe behaviors
    Monitoring & Feedbackgarak, real-time analyticsDetect/resist new attacks

    Open Datasets and Benchmarks

    • Nemotron Content Safety Dataset v2: Used for pre- and post-training evaluation, this dataset screens for a wide spectrum of harmful behaviors.
    • WildGuardMix Dataset: Targets content moderation across ambiguous and adversarial prompts.
    • Aegis Content Safety Dataset: Over 35,000 annotated samples, enabling fine-grained filter and classifier development for LLM safety tasks.

    Post-Training Process

    NVIDIA’s post-training recipe for safety is distributed as an open-source Jupyter notebook or as a launchable cloud module, ensuring transparency and broad accessibility. The workflow typically includes:

    1. Initial Model Evaluation: Baseline testing on safety/security with open benchmarks.
    2. On-policy Safety Training: Response generation by the target/aligned model, supervised fine-tuning, and reinforcement learning with open datasets.
    3. Re-evaluation: Re-running safety/security benchmarks post-training to confirm improvements.
    4. Deployment: Trusted models are deployed with live monitoring and guardrail microservices (content moderation, topic/domain control, jailbreak detection).

    Quantitative Impact

    • Content Safety: Improved from 88% to 94% after applying the NVIDIA safety post-training recipe—a 6% gain, with no measurable loss of accuracy.
    • Product Security: Improved resilience against adversarial prompts (jailbreaks etc.) from 56% to 63%, a 7% gain.

    Collaborative and Ecosystem Integration

    NVIDIA’s approach goes beyond internal tools—partnerships with leading cybersecurity providers (Cisco AI Defense, CrowdStrike, Trend Micro, Active Fence) enable integration of continuous safety signals and incident-driven improvements across the AI lifecycle.

    How To Get Started

    1. Open Source Access: The full safety evaluation and post-training recipe (tools, datasets, guides) is publicly available for download and as a cloud-deployable solution.
    2. Custom Policy Alignment: Enterprises can define custom business policies, risk thresholds, and regulatory requirements—using the recipe to align models accordingly.
    3. Iterative Hardening: Evaluate, post-train, re-evaluate, and deploy as new risks emerge, ensuring ongoing model trustworthiness.

    Conclusion

    NVIDIA’s safety recipe for agentic LLMs represents an industry-first, openly available, systematic approach to hardening LLMs against modern AI risks. By operationalizing robust, transparent, and extensible safety protocols, enterprises can confidently adopt agentic AI, balancing innovation with security and compliance.


    Check out the NVIDIA AI safety recipe and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    FAQ: Can Marktechpost help me to promote my AI Product and position it in front of AI Devs and Data Engineers?

    Ans: Yes, Marktechpost can help promote your AI product by publishing sponsored articles, case studies, or product features, targeting a global audience of AI developers and data engineers. The MTP platform is widely read by technical professionals, increasing your product’s visibility and positioning within the AI community. [SET UP A CALL]

      The post Safeguarding Agentic AI Systems: NVIDIA’s Open-Source Safety Recipe appeared first on MarkTechPost.

      Source: Read More 

      Facebook Twitter Reddit Email Copy Link
      Previous ArticleIt’s Okay to Be “Just a Wrapper”: Why Solution-Driven AI Companies Win
      Next Article 9 Open Source Cursor Alternatives You Should Use in 2025

      Related Posts

      Machine Learning

      How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

      July 29, 2025
      Machine Learning

      Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

      July 29, 2025
      Leave A Reply Cancel Reply

      For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

      Continue Reading

      CVE-2025-49012 – Microsoft Azure Entra ID Intune Himmelblau Privilege Escalation Vulnerability

      Common Vulnerabilities and Exposures (CVEs)

      CVE-2025-4060 – PHPGurukul Notice Board System SQL Injection Vulnerability

      Common Vulnerabilities and Exposures (CVEs)

      I’ve tested the Meta Ray-Bans for months, and these 5 features still amaze me

      News & Updates

      CVE-2024-52887 – Shrew Soft VPN Bookmark Script Injection Vulnerability

      Common Vulnerabilities and Exposures (CVEs)

      Highlights

      Predict your future health? Ultrahuman’s new AI tool says it can – for $800/year

      June 17, 2025

      Using a blood test that requires 8-10 vials, Ultrahuman Blood Vision will deliver an assessment…

      CVE-2025-46349 – YesWiki Reflected Cross-Site Scripting (XSS) Vulnerability

      April 29, 2025

      This iconic Xbox RPG is just $25 in an anti-Prime Day deal — don’t miss out on a physical copy while they last

      July 8, 2025

      Bluetooth flaws could let hackers spy through your microphone

      June 29, 2025
      © DevStackTips 2025. All rights reserved.
      • Contact
      • Privacy Policy

      Type above and press Enter to search. Press Esc to cancel.