ML applications

AgentPoison: A Novel Red Teaming Approach and Backdoor Attack Targeting Generic and RAG-based LLM Agents by Poisoning their Long-Term Memory or RAG Knowledge Base

Editorial Team

July 26, 2024
4 Min Read

AgentPoison: A Novel Red Teaming Approach and Backdoor Attack Targeting Generic and RAG-based LLM Agents by Poisoning their Long-Term Memory or RAG Knowledge Base

Recent advances in large language models (LLMs) have made it possible to use LLM agents in many areas, including safety-critical ones like finance, healthcare, and self-driving cars. Usually, these agents use an LLM to understand tasks for making plans, and they can use external tools, like third-party APIs, to carry out those plans. However, their focus is mostly on efficacy and generalization, and their trustworthiness is yet to be explored thoroughly. The use of potentially unreliable knowledge bases is the main challenge to the trustworthiness of LLM agents. For instance, state-of-the-art LLMs can produce harmful responses if given malicious examples during knowledge-enabled reasoning.

Current attacks on large language models (LLMs), like jailbreaking during testing and backdooring in-context learning, are inefficient against LLM agents using retrieval-augmented generation (RAG). Jailbreaking attacks, such as GCG, face difficulties due to a strong retrieval process that can handle injected harmful content. Backdoor attacks, like BadChain, use weak triggers that fail to retrieve malicious demonstrations in LLM agents, which leads to poor attack success. This paper discusses two related works: LLM agents based on RAG and red-teaming LLM agents. Recent studies on backdoor attacks in LLM agents focus only on poisoning the training data of LLM backbones and do not evaluate the safety of more advanced RAG-based LLM agents.

A team of researchers from the University of Chicago, the University of Illinois at Urbana-Champaign, the University of Wisconsin-Madison, and the University of California, Berkeley, have introduced a new method called AGENTPOISON. It is the first backdoor attack targeting generic LLM agents based on RAG. AGENTPOISON is launched by corrupting the long-term memory or knowledge base using a few harmful examples of the victim LLM agent, including a valid question, a special trigger, and some adversarial targets. This method makes the agent, retrieve these harmful examples whenever the query has a special trigger, causing the agent to produce the adversarial outcomes shown in the examples.

To show how AGENTPOISON works in different real-world situations, three types of agents are chosen for various tasks, (a) Agent-Driver for self-driving cars, (b) ReAct agent for answering questions that require a lot of knowledge, and (c) EHRAgent for managing healthcare records. The following metrics are considered:

The attack success rate for retrieval (ASR-r): This is the percentage of test cases where all the examples, retrieved from the database are poisoned.
The attack success rate for the target action (ASR-a): This is the percentage of test cases where the agent performs the target action (like a “sudden stop”) after retrieving poisoned examples successfully.

The results obtained from the experiments demonstrate that AGENTPOISON has a high attack success rate and good benign utility. Compared to the other methods, the proposed method has minimal impact on benign performance, with an average of only 0.74%, while outperforming the baselines by achieving a retrieval success rate of 81.2%. It generates target actions 59.4% of the time, where 62.6% of these actions impact the environment as intended. This method also transfers well across different embedders, creating a unique cluster in the embedding space that remains unique even with similar data distributions.

In summary, researchers have introduced a new red-teaming method called AGENTPOISON that extensively evaluates the safety and reliability of RAG-based LLM agents. It uses a special algorithm to map queries into a particular and compact area in the embedding space, to ensure high retrieval accuracy and a high success rate for attacks. Moreover, the proposed method does not need any model training, and the optimized trigger is highly adaptable, stealthy, and coherent. Extensive experiments on three real-world agents demonstrate that AGENTPOISON outperforms all four baseline methods across the four key metrics present in this paper.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.