Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Meet MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

    Meet MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

    December 5, 2024

    Large Language Models (LLMs) are powerful tools for various applications due to their knowledge and understanding capabilities. However, they are also vulnerable to exploitation, especially in jailbreaking attacks in multi-round dialogues. Jailbreaking attacks exploit the complex and sequential nature of human-LLM interactions to subtly manipulate the model’s responses over multiple exchanges. By carefully building questions and incrementally navigating the conversation, attackers can then avoid safety controls and elicit from LLMs the creation of illegal, unethical, or otherwise harmful content, giving a great challenge to these systems’ safe and responsible deployment.

    Existing methods to safeguard LLMs focus predominantly on single-round attacks, employing techniques like prompt engineering or encoding harmful queries, which fail to address the complexities of multi-round interactions. LLM attacks can be classified into single-round and multi-round attacks. Single-round attacks, with techniques such as prompt engineering and fine-tuning, have limited success with closed-source models. Multi-round attacks, though rare, exploit sequential interactions and human-like dialogue to elicit harmful responses. Notable methods like Chain-of-Attack (CoA) improve effectiveness by building semantic links across rounds but depend heavily on LLM conversational abilities.

    To address these issues, a team of researchers from Alibaba Group, Beijing Institute of Technology, Nanyang Technological University, and Tsinghua University have proposed a novel multi-round dialogue jailbreaking agent called MRJ-Agent. This agent emphasizes stealthiness and uses a risk decomposition strategy that distributes risks across multiple rounds of queries along with psychological strategies to enhance the strength of the attacks. 

    The MRJ-Agent attacks incrementally decompose toxic queries into rounds, making them more challenging to identify or block by the LLM. It starts with an innocuous question and then gradually steers to more sensitive information, culminating in generating harmful responses. The sub-queries maintain semantic similarity with the original harmful query by using a control strategy based on information. Additionally, psychological tactics are used so that the likelihood of rejection can be minimized by the LLM.

    Large-scale experiments show that MRJ-Agent outperforms previous methods on single-round and multi-round attacks with state-of-the-art attack success rates. Due to its adaptiveness and exploratory properties, it can develop more generalized attacking strategies applicable to diverse models and scenarios. Also, Experiments reveal that MRJ-Agent outperforms both single-round and multi-round methods in attack success rate, achieving 100% on models like Vicuna-7B and nearly 98% on GPT-4. The agent maintains high efficacy and demonstrates robustness and stealth under measures like prompt detectors and system prompts.

    In conclusion, the MRJ agent solves the problem of LLM vulnerabilities in multi-round dialogues. The MRJ agent’s innovative approach to risk decomposition and psychological strategies significantly enhances the success rate of jailbreak attacks, creates new perspectives for future research on LLM safety, and contributes to the discourse on societal governance in the context of increasingly integrated conversational AI systems. Maintaining the safety of human-AI interactions is paramount as these systems become more deeply embedded in everyday life.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 [Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

    The post Meet MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleRevolutionizing In-Context Learning: The HiAR-ICL Paradigm for Advanced Reasoning with MCTS
    Next Article ServiceNow Releases AgentLab: A New Open-Source Python Package for Developing and Evaluating Web Agents

    Related Posts

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4610 – WordPress WP-Members Membership Plugin Stored Cross-Site Scripting Vulnerability

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4824 – TOTOLINK A702R, A3002R, A3002RU HTTP POST Request Handler Buffer Overflow Vulnerability

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    How to Develop PHP Applications Faster Using the PHP Low-Code Wizard’s Toolkit

    Development

    Revolutionizing Disaster Relief: How Blockchain Ensures Transparency & Efficiency 🌍⚡

    Web Development

    “Bad news is the AI PC and AI smartphone ‘supercycle’ has more or less been a bust.” Copilot+ PCs made a dramatic entrance in 2024, and then their sales fell flat. But are they doomed?

    Development

    Gemma 2 gets three new products: a 2B parameter version, ShieldGemma, and Gemma Scope

    Development

    Highlights

    CVE-2025-3954 – ChurchCRM Referer Handler Server-Side Request Forgery Vulnerability

    April 26, 2025

    CVE ID : CVE-2025-3954

    Published : April 26, 2025, 10:15 p.m. | 48 minutes ago

    Description : A vulnerability, which was classified as problematic, has been found in ChurchCRM 5.16.0. Affected by this issue is some unknown functionality of the component Referer Handler. The manipulation leads to server-side request forgery. The attack may be launched remotely. The complexity of an attack is rather high. The exploitation is known to be difficult. The exploit has been disclosed to the public and may be used. The vendor was contacted early about this disclosure but did not respond in any way.

    Severity: 3.7 | LOW

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    VibE: A Visual Analytics Workflow for Semantic Error Analysis of CVML Models at Subgroup Level

    March 31, 2025

    You can buy Samsung’s Frame TV at up to $1,300 off for Black Friday – multiple sizes in stock

    November 27, 2024

    Endor Labs Empowers Organizations to Discover and Govern Open Source Artificial Intelligence Models Used in Applications

    January 28, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.