papers_we_read

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Luke Bailey,Euan Ong,Stuart Russel,Scott Emmons, ICML 2024

Summary

The paper investigates a security vulnerability in Vision-Language Models called “Image Hijacks.” Image hijacks are adversarial images that, when fed into a VLM, can manipulate the model’s behavior at inference time, forcing it to produce outputs desired by the attacker. The authors introduce a novel algorithm, “Behavior Matching” for crafting such image hijacks. They then explore various attack scenarios using this technique, demonstrating its effectiveness against LLaVA. Notably, these attacks are automated, require only subtle image perturbations, and can even mimic the effects of specific textual instructions.

Contributions

Method

Formalization:

The authors formalize a VLM as a function $M_\phi(x, ctx)$ that takes an image $x$ and a text context $ctx$ as input and outputs $out$ (a sequence of logits).

Target Behavior:
The desired behavior is defined as a function $B$ mapping input contexts to target logits sequences ($B: C \to \text{Logits}$).

Optimization:
The algorithm uses projected gradient descent to find an image hijack $\hat{x}$ that minimizes the cross-entropy loss between the VLM’s output given the hijack and the target logits over a set of input contexts $C$. This approach ensures that the generated hijack generalizes across different user inputs, as it’s optimized to produce the target behavior for a range of contexts.

Mathematical Representation:

The goal is to find:

\[\hat{x} = \underset{x \in \text{Image}}{\arg \min} \sum_{\text{ctx} \in C} \left[ L\left(M^\text{force}_\phi(x, \text{ctx}, B(\text{ctx})), B(\text{ctx})\right) \right]\]

where:

Prompt Matching: For attacks where defining the target behavior directly through a dataset is difficult (e.g., disinformation attacks), the authors introduce “Prompt Matching”. This method leverages the fact that it’s often easier to characterize the desired behavior using a specific text prompt. Instead of training on a dataset of (context, target_text) pairs, Prompt Matching trains on (context, target_logits) pairs, where the target logits are generated by the VLM when provided with the target prompt concatenated with the context.

Results

The authors systematically evaluate the effectiveness of their image hijacks in four attack scenarios: