This AI Paper from Tel Aviv University Introduces GASLITE: A Gradient-Based Method to Expose Vulnerabilities in Dense Embedding-Based Text Retrieval Systems

Dense embedding-based text retrieval has become the cornerstone for ranking text passages in response to queries. The systems use deep learning models for embedding text into vector spaces that enable semantic similarity measurements. This method has been adopted widely in applications such as search engines and retrieval-augmented generation (RAG), where retrieving accurate and contextually relevant information is critical. These systems efficiently match queries with relevant content by building on learned representations, driving huge advancements in knowledge-intensive domains.

However, the main challenge for embedding-based retrieval systems is their susceptibility to manipulation by adversaries. The reason is that these systems often build on public corpora, which are not immune to adversarial content. Malicious actors can inject crafted passages into the corpus in a way that affects the retrieval system’s ranking to prioritize the adversarial entries over the queries containing them. This can threaten the integrity of search results with the spread of misinformation or the introduction of biased content, endangering the reliability of knowledge systems.

Previous approaches to counter adversarial attacks have used simple poisoning techniques, such as stuffing targeted queries with repetitive text or embedding misleading information. Although these methods can break single-query systems, they are often ineffective against more complex models that handle diverse query distributions. Existing defenses also do not address the core vulnerabilities in embedding-based retrieval systems, leaving the systems open to more advanced and subtle attacks.

Researchers at Tel Aviv University introduced a mathematically grounded gradient-based optimization method called GASLITE for crafting adversarial passages. GASLITE performs better than previous techniques because it focuses precisely on the retrieval model’s embedding space rather than modifying content in the text. It aligns itself with certain query distributions, which results in adversarial passages achieving high visibility within retrieval results. Thus, this makes it a potent tool for evaluating vulnerabilities in dense embedding-based systems.

The GASLITE methodology is grounded in rigorous mathematical principles and innovative optimization techniques. It constructs adversarial passages from attacker-chosen prefixes combined with optimized triggers designed to maximize similarity to targeted query distributions. Optimization takes the form of gradient calculations in the embedding space to find optimal token substitutions. Unlike previous approaches, GASLITE does not edit the corpus or model but instead focuses on generating text that the retrieval system’s ranking algorithm can manipulate. This design makes it stealthy and effective; adversarial passages can blend directly into the corpus without being detectable by standard defenses.

The authors test GASLITE with nine state-of-the-art retrieval models under various threat scenarios. The method consistently outperformed baseline approaches, achieving a remarkable 61-100% success rate in ranking adversarial passages within the top 10 results for concept-specific queries. These results were achieved with minimal poisoning of the corpus, with adversarial passages comprising just 0.0001% of the dataset. For example, GASLITE demonstrated top-10 visibility across most retrieval models when targeting concept-specific queries, showcasing its precision and efficiency. In single-query attacks, the method consistently ranked adversarial content as the top result, which is effective even under the most stringent conditions.

Further analysis of the factors that contributed to the success of GASLITE showed that embedding-space geometry and similarity metrics significantly determined model susceptibility. Models using dot-product similarity measures were particularly vulnerable because the GASLITE method exploited these characteristics to achieve optimal alignment with targeted query distributions. The researchers further emphasized that models with anisotropic embedding spaces, where random text pairs produced high similarities, were more susceptible to attacks. This again points towards the importance of understanding embedding-space properties while designing retrieval systems.

It underscores the need for strong defenses against adversarial manipulations in embedding-based retrieval systems. The authors thus recommend utilizing hybrid retrieval approaches like dense and sparse retrieval techniques that can minimize the risks provided by such methods as GASLITE. It serves, on its own, to expose the vulnerability in current retrieval systems to risks and pave the way for more secure and resilient technologies.

The researchers urgently call to focus on the risks presented by such adversarial attacks to dense embedding-based systems. The minimal effort that GASLITE could use to manipulate search results shows the potential severity of such attacks. However, by characterizing critical vulnerabilities and developing actionable defenses, this work provides valuable insights into improving this robustness and reliability in retrieval models.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post This AI Paper from Tel Aviv University Introduces GASLITE: A Gradient-Based Method to Expose Vulnerabilities in Dense Embedding-Based Text Retrieval Systems appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

Alienware’s rumored laptop could be the first to feature NVIDIA’s revolutionary Arm-based APU

easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

easy-live2d – About Make your Live2D as easy to control as a pixi sprite! Live2D Web SDK based on Pixi.js.

From Kitchen To Conversion

Perficient Included in Forrester’s AI Technical Services Landscape, Q2 2025

SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

The Witcher 4 looks absolutely amazing in UE5 technical presentation at State of Unreal 2025

Razer’s having another go at making it so you never have to charge your wireless gaming mouse, and this time it might have nailed it

This AI Paper from Tel Aviv University Introduces GASLITE: A Gradient-Based Method to Expose Vulnerabilities in Dense Embedding-Based Text Retrieval Systems

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Distillation Scaling Laws

Understanding In-Out and Input Parameters in IICS

Solo Development: Learning To Let Go Of Perfection

Chats – messaging application for mobile and desktop

Google DeepMind at NeurIPS 2024

CVE-2025-4194 – WordPress AlT Monitoring CSRF

Explore British Culture and Lifestyle

ChatGPT search gets a new shopping experience — But will OpenAI need Chrome to compete with Google and Microsoft?

Timelinize is a tool that stores data in a cohesive timeline

This AI Paper from Tel Aviv University Introduces GASLITE: A Gradient-Based Method to Expose Vulnerabilities in Dense Embedding-Based Text Retrieval Systems

Related Posts