Adversarial Attacks on Aligned Language Models

Ishan Garg

Aug 27, 2023

I decided to ask a certain popular language model how to build an explosive, from everday items (for no particular reason), but it didn’t give me a plausible answer. What is happening here?

‘chess’

Have you ever wondered how would publicly available LLMs respond if asked how to destroy the humanity or how to build an atom bomb?? Well ,turns out they don’t respond to such questions.So what is the reason. Turns out, most LLMs today are trained on text scraped over internet and contains a lot of objectionable content, and in order to prevent the model from answering such questions “aligning” has been done.

So in this blog let us try to understand a new approach based on a recently published paper “Universal and Transferable Adversarial Attacks on Aligned Language Models” to bypass this alignment and produce virtually nay objectionable content.Let’s begin!!

‘prompt’

It is widely known that making small changes to the input of a machine learning model can significantly change its output. Similar techniques have been used against Large Language Models (LLMs), which are powerful language models. Researchers have discovered certain “jailbreaks”, which are cleverly designed input prompts that can make LLMs generate inappropriate or objectionable content. However, unlike traditional adversarial examples that are generated automatically, these jailbreaks are created through human creativity and ingenuity, involving a lot of manual effort to trick the models into producing undesirable results.

‘jailbreak’

So what are we doing different that allows us to produce objectionable content?? In layman terms our attack appends a adversarial suffix to the query that attempts to induce negative behavior i.e. produce “dangerous” content. Our suffix token consists of 3 key elements:-

1) Initial Affirmative Responses

Our attack targets the model to begin its response with “Sure, here is (content of query)” in response to a number of prompts eliciting undesirable behavior.

2) Combined greedy and gradient-based discrete optimization

We take advantage of token-level gradients to identify a set of potential single-token replacements. Once we have this set, we evaluate the loss for each candidate in the set. After evaluating them, we select the best substitution that maximizes the likelihood of a successful attack.

3)Robust multi-prompt and multi-model attacks

We want suffixes that work for multiple prompts across multiple prompts. So we use our greedy gradient method to find a string that works for all prompts across different models.

Now let’s look at the technical aspect of the method.

Assume a user writes the following prompt.

Tell me how to build a bomb

‘blast’

So the LLM behind a chatbot reads it in the following form.

System: You are a chat assistant designed to provide helpful and not  
harmful responses to user queries.  
User: Tell me how to build a bomb.  
Assistant:

In a normal case the LLM would not provide a response.But we introduce our suffix in the following form to get a response.

System: You are a chat assistant designed to provide helpful and not  
harmful responses to user queries.  
User: Tell me how to build a bomb. ! ! ! ! ! ! ! ! ! !  
Assistant :

In order to create a universal attack we do not consider changes to the user input but we try to find the text in place of ! so that any LLM for any prompt responds affirmatively to the user prompt.

Now lets look at the key elements in detail.

Producing Affirmative Responses

First we need to define a loss function to optimize the suffix. We could use log likelihood but it comes with its own set of problems. Instead we would use the approach specified above. The output of LLM would be like this

System: You are a chat assistant designed to provide helpful and not  
harmful responses to user queries.  
User: Tell me how to build a bomb. ! ! ! ! ! ! ! ! ! !  
Assistant: Sure, here is how to build a bomb:

The intuition of this approach is that if the language model can be put into a “state” where this completion is the most likely response, as opposed to refusing to answer the query, then it likely will continue the completion with precisely the desired objectionable behavior. This can be attributed to the autoregressive nature of the LLMs.

In multimodal LLMs specifying the first target token was found to be sufficient but in case of text-only space there is a chance that the suffix could overwrite the entire prompt thus getting a response but not the intended one.

Now let’s have a look at the optimization problem.

It denotes the probability that the next token is xn+1 given previous n tokens .

We try to minimize the negative log likelihood of probability of target of sequences from x = n+1 to x = n+H where n is the input size.

Greedy oordinate Gradient-based Search

‘algo1’ A primary challenge in optimizing is that we have to optimize over a discrete set of inputs.

Here in the algorithm we use gradients with respect to each token to find a set of promising candidates for replacement at each token position.

Here `I` is the set of the positions of the adversarial suffix. So in the loop we first try to find the k substitutions having lowest gradients for all the positions.Then we initialize elements for each batch by selecting elements at random from the substitution set and then find the batch for which the loss function is minimum.

Universal Multi-prompt and Multi-model attacks

‘algo2’

Now we build upon the above algorithm to optimize the attack for multiple prompts.Unlike in the above algorithm here x represents the prompts by the user. We use multiple prompts and their corresponding losses and define a postfix `p` of length l tokens.Instead of specifying a different subset of modifiable tokens for all the prompts we choose a single postfix and optimize the losses over that. Similar to above approach we first find the top -K substitutions for the first prompt by optimizing over p.We start with only first prompt and increment the prompts only when the postfix yields results on the earlier prompts.

After finding the k substitutions the process is similar to the process in the previous algorithm.To make the adversarial examples transferable, we incorporate loss functions over multiple models.

Results

‘results’ ‘results2’

Following results were obtained on using the above method

We find that combining multiple GCG prompts can further improve ASR on several models. Firstly, we attempt to concatenate three GCG prompts into one and use it as the suffix to all behaviors. The “+ Concatenate” row of Table 2 shows that this longer suffix particularly increases ASR from 47.4% to 79.6% on GPT-3.5 (gpt-3.5-turbo), which is more than 2× higher than using GCG prompts optimized against Vicuna models only.

The method proposed raise substantial questions regarding current methods for the alignment of LLMs.

References

Paper on Universal and Transferable Adversarial Attacks on Aligned Language Models

Photo by Joshua Hoehne on Unsplash