Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 18, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 18, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 18, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 18, 2025

      I need to see more from Lenovo’s most affordable gaming desktop, because this isn’t good enough

      May 18, 2025

      Gears of War: Reloaded — Release date, price, and everything you need to know

      May 18, 2025

      I’ve been using the Logitech MX Master 3S’ gaming-influenced alternative, and it could be your next mouse

      May 18, 2025

      Your Android devices are getting several upgrades for free – including a big one for Auto

      May 18, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      YTConverter™ lets you download YouTube videos/audio cleanly via terminal — especially great for Termux users.

      May 18, 2025
      Recent

      YTConverter™ lets you download YouTube videos/audio cleanly via terminal — especially great for Termux users.

      May 18, 2025

      NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

      May 17, 2025

      Big Changes at Meteor Software: Our Next Chapter

      May 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      I need to see more from Lenovo’s most affordable gaming desktop, because this isn’t good enough

      May 18, 2025
      Recent

      I need to see more from Lenovo’s most affordable gaming desktop, because this isn’t good enough

      May 18, 2025

      Gears of War: Reloaded — Release date, price, and everything you need to know

      May 18, 2025

      I’ve been using the Logitech MX Master 3S’ gaming-influenced alternative, and it could be your next mouse

      May 18, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Fine-tuning AdvPrompter: A Novel AI Method to Generate Human-Readable Adversarial Prompt

    Fine-tuning AdvPrompter: A Novel AI Method to Generate Human-Readable Adversarial Prompt

    May 1, 2024

    Large Language Models (LLMs) have succeeded greatly and are widely used in various fields. LLMs are sensitive to input prompts, and this behavior has led to multiple research studies to understand and exploit this characteristic. This helps to create prompts for learning tasks like zero-shot and in-context. For instance, AutoPrompt recognizes task-specific tokens for zero-shot text classification and fact retrieval. This approach uses gradient-based scoring of tokens considering task-specific loss evaluation to find the optimal probability distributions over discrete tokens.

    Despite showing great capability, LLMs sometimes become vulnerable to certain jailbreaking attacks due to which irrelevant or toxic contents are generated. The main cause of jailbreaking attacks is the requirement of adversarial prompts by manual re-teaming, and one of its examples is inserting a suffix to a given instruction, which is inadequate and time-consuming. However, the automated generation of adversarial prompts frequently results in attacks that lack semantic meaning, can be easily identified by filters based on perplexity, and may need gradient information from the TargetLLM. 

    Researchers from AI at Meta, and Max-Planck-Institute for Intelligent Systems, Tubingen, Germany, introduced a novel method that uses another LLM, AdvPrompter, to generate human-readable adversarial prompts in seconds. Compared to other optimized approaches, this method is ∼ 800× faster. The AdvPrompter is trained by utilizing an AdvPromterTrain algorithm that does not need access to the TargetLLM gradients. The trained AdvPrompter can generate suffixes and veil the input instruction, keeping its meaning intact. This tactic lures the TargetLLM into providing a harmful response. 

    The approach proposed by researchers has the following key advantages:

    It enhances human readability with the help of AdvPromter, which generates clear human-readable adversarial prompts.

    Researchers’ experiments on multiple open-source LLMs have demonstrated excellent attack success rates (ASR) compared to previous approaches such as GCG and AutoDAN.

    The trained AdvPrompter can generate adversarial suffixes using next-token prediction, unlike previous methods such as GCG and AutoDAN, which need to solve new optimization problems for every generated suffix. 

    Generated adversarial suffixes with the help of trained AdvPromter are random with a non-zero temperature that allows users to sample a diverse set of adversarial prompts rapidly. Evaluation of more samples leads to better performance and a successful outcome. It further stabilizes at around k = 10, where k is the number of candidates of a score vector. Moreover, researchers found that the initial version of Llama2-7b constantly improves without fine-tuning, which means that generated suffixes with diversity are helpful for a successful attack.

    In conclusion, researchers proposed a novel method for automated red-teaming of LLMs. The main approach includes training AdvPromter using an algorithm called AdvPromterTrain to generate human-readable adversarial prompts. Further, a novel algorithm called AdvPromterOpt is useful for automatically generating adversarial prompts. It is also used in the training loop to fine-tune the AdvPrompter predictions. Future work includes a detailed analysis of safety fine-tuning from automatically generated data, which is motivated by the robust increase of the TargetLLM via AdvPrompter.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 40k+ ML SubReddit

    The post Fine-tuning AdvPrompter: A Novel AI Method to Generate Human-Readable Adversarial Prompt appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleNatural language boosts LLM performance in coding, planning, and robotics
    Next Article PyTorch Introduces ExecuTorch Alpha: An End-to-End Solution Focused on Deploying Large Language Models and Large Machine Learning ML Models to the Edge

    Related Posts

    Development

    February 2025 Baseline monthly digest

    May 18, 2025
    Artificial Intelligence

    Markus Buehler receives 2025 Washington Award

    May 18, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    ChatGPT has officially replaced Google Search for me – here’s why

    Development

    Top 10 Upskilling Courses to stay ahead in 2025

    Web Development

    NATO releases a revised AI strategy to combat threats

    Artificial Intelligence

    Is your Chromecast V2 broken? Don’t worry, Google is going to (slowly) fix it

    Operating Systems

    Highlights

    Development

    IIT Kanpur, NPTI Join Forces to Secure India’s Power Grid

    July 27, 2024

    The Indian Institute of Technology, Kanpur (IITK), and the National Power Training Institute (NPTI) have…

    AI PCs bring new security protections and risks. Here’s what users need to know

    August 7, 2024

    This AI Paper Explores the Theoretical Foundations and Applications of Diffusion Models in AI

    April 18, 2024

    Qubes OS – security-oriented operating system

    January 27, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.