Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

Yuancheng Xu¹, Jiarui Yao², Manli Shu³, Yanchao Sun⁴, Zichu Wu⁵
Ning Yu⁶, Tom Goldstein¹, Furong Huang¹

¹University of Maryland, College Park    ²University of Illinois Urbana-Champaign
    ³Salesforce Research     ⁴Apple
    ⁵University of Waterloo     ⁶Netflix Eyeline Studios

Neurips, 2024

Paper Code

Responses of the clean and poisoned LLaVA-1.5 models. The poisoned samples are crafted using a different VLM, MiniGPT-v2.

Method

The attacker’s goal is to manipulate the model into responding to original concept images with texts consistent with a destination concept, using stealthy poison samples that can evade human visual inspection.

A poison sample consists of a poison image that looks like a clean image from the destination concept, and a congruent text description. The text description is generated from the clean destination concept image using any off-the-shelf VLM. The poison image is crafted by introducing imperceptible perturbation to the clean destination concept image, to match an original concept image in the latent feature space.

When training on these poison samples, the VLM learns to associate the the original concept feature (in the poison image) with the destination concept texts, achieving the attacker's goal.

Illustration of how Shadowcast crafts a poison sample with visually matching image and text descriptions.

Below is another example of the poison sample where the original concept is "Junk Food" and the destination concept is "Healthy and Nutritious Food". The poison image looks like the clean destination concept image, and the text visually matches the image.

Experiment

We consider the following four tasks for poisoning attacks exemplifying the practical risks of VLMs, ranging from misidentifying political figures to disseminating healthcare misinformation.

The red ones are Label Attacks, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden. The green ones Persuasion Attacks, which leverage VLMs’ text generation capabilities to craft narratives, such as portraying junk food as health food, through persuasive and seemingly rational descriptions.

Attack tasks and their associated concepts.

We study both grey-box and black-box scenarios. In the grey-box setting, the attacker only has access to the VLM’s vision encoder (no need to access the whole VLM as in the white-box setting). In the black-box setting, the adversary has no access to the specific VLM under attack and instead utilizes an alternate open-source VLM. We evaluate the attack success rates under different poison ratios.

Grey-box results

Attack success rate of Label Attack for LLaVA1.5.

Attack success rate of Persuasion Attack for LLaVA-1.5.

Shadowcast begins to demonstrate a significant impact (over 60% attack success rate) with a poison rate of under 1% (or 30 poison samples)!

Black-box results

(Architecture transferability) Attack success rate for LLaVA-1.5 when InstructBLIP (left) and MiniGPTv2 (right) are used to craft poison images.

Shadowcast is still effective even across different VLM architectures!

Attack Robustness

What if the VLM uses image data augmentation (as a defense method) during training? Will the poisoned VLM exhibit targeted behaviour when different text prompts are used? Our evaluation shows positive results.

Attack success rate for augmented LLaVA-1.5 with/without poison augmentation

(Data augmentation) Attack success rate for LLaVA-1.5 trained with data augmentation, when poison images are crafted without augmentation (left) and with augmentation (right).

Attack success rates with diverse prompts

(Generalization to diverse prompts) Attack success rates when diverse prompts are used during test time.

Ethics and Disclosure

This study uncovers a pivotal vulnerability in the visual instruction tuning of large vision language models (VLMs), demonstrating how adversaries might exploit data poisoning to disseminate misinformation undetected. While the attack methodologies and objectives detailed in this research introduce new risks to VLMs, the concept of data poisoning is not new, having been a topic of focus in the security domain for over a decade. By bringing these findings to light, our intent is not to facilitate attacks but rather to sound an alarm in the VLM community. Our disclosure aims to elevate vigilance among VLM developers and users, advocate for stringent data examination practices, and catalyze the advancement of robust data cleaning and defensive strategies. In doing so, we believe that exposing these vulnerabilities is a crucial step towards fostering comprehensive studies in defense mechanisms and ensuring the secure deployment of VLMs in various applications.