Researchers uncover an all-too-easy trick to bypass LLM safeguards

Editorial Team

July 21, 2024
3 Min Read

Researchers uncover an all-too-easy trick to bypass LLM safeguards

A team at EPFL has discovered an alarming security flaw in leading AI language models: Simply rephrasing malicious queries in the past tense often allows users to bypass protective measures and obtain detailed answers normally blocked by the system.

Large language models (LLMs) like ChatGPT and GPT-4o are designed to reject potentially harmful requests. However, Maksym Andriushchenko and Nicolas Flammarion from the École polytechnique fédérale de Lausanne (EPFL) demonstrate in their study “Does Refusal Training in LLMs Generalize to the Past Tense?” that these safeguards can frequently be circumvented by rewording questions in the past tense.

For instance, if a user asks ChatGPT (GPT-4o) how to make a Molotov cocktail, the model refuses. But if the question is rephrased to ask how people used to make them, the model provides step-by-step instructions. I was able to replicate this effect with GPT-4o.

The team systematically evaluated this method across six state-of-the-art language models, including Llama-3 8B, GPT-3.5 Turbo, and GPT-4o. They used GPT-3.5 Turbo to automatically convert malicious queries from the JailbreakBench dataset into past tense forms.

While only 1% of direct malicious requests succeeded with GPT-4o, the success rate jumped to 88% after 20 past-tense reformulation attempts. For sensitive topics like hacking and fraud, the method achieved 100% success rates.

Interestingly, the researchers found that future-tense reformulations were less effective. This suggests that the models’ protective measures tend to classify questions about the past as less harmful than hypothetical future scenarios.

According to Andriushchenko and Flammarion, these results show that widely used alignment techniques such as SFT, RLHF, and adversarial training – employed to enhance model safety – can be fragile and don’t always generalize as intended.

“We believe that the generalization mechanisms underlying current alignment methods are understudied and require further research,” the researchers write.

This study highlights the unpredictability of LLM technology and raises questions about its use in critical operations and infrastructure. The newly discovered vulnerability could compromise existing security measures, and the fact that such an obvious and easily exploitable flaw went undetected for so long is troubling, to say the least.

Recommendation

The researchers also show a way to mitigate this security issue: A GPT-3.5 fine-tuned with critical past tense prompts and corresponding rejections was able to reliably detect and reject sensitive prompts.