Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models

University of Maryland, College Park1    JP Morgan AI Research2
    University of Waterloo3     Salesforce Research4

Feb, 2024
Description of First Image

Responses of the clean and poisoned LLaVA-1.5 models. The poisoned samples are crafted using a different VLM, MiniGPT-v2.

Abstract

Vision-Language Models (VLMs) excel in generating textual responses from visual inputs, yet their versatility raises significant security concerns. This study takes the first step in exposing VLMs' susceptibility to data poisoning attacks that can manipulate responses to innocuous, everyday prompts. We introduce Shadowcast, a stealthy data poisoning attack method where poison samples are visually indistinguishable from benign images with matching texts. Shadowcast demonstrates effectiveness in two attack types. The first is Label Attack, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden. The second is Persuasion Attack, which leverages VLMs' text generation capabilities to craft narratives, such as portraying junk food as health food, through persuasive and seemingly rational descriptions. We show that Shadowcast are highly effective in achieving attacker's intentions using as few as 50 poison samples. Moreover, these poison samples remain effective across various prompts and are transferable across different VLM architectures in the black-box setting. This work reveals how poisoned VLMs can generate convincing yet deceptive misinformation and underscores the importance of data quality for responsible deployments of VLMs.

TL;DR: Shadowcast is the first stealthy data poisoning attack against Vision-Language Models (VLMs). The poisoned VLMs can disseminate misinformation coherently, subtly shifting users’ perceptions.

Method

The attacker’s goal is to manipulate the model into responding to original concept images with texts consistent with a destination concept, using stealthy poison samples that can evade human visual inspection.

A poison sample consists of a poison image that looks like a clean image from the destination concept, and a congruent text description. The text description is generated from the clean destination concept image using any off-the-shelf VLM. The poison image is crafted by introducing imperceptible perturbation to the clean destination concept image, to match an original concept image in the latent feature space.

When training on these poison samples, the VLM learns to associate the the original concept feature (in the poison image) with the destination concept texts, achieving the attacker's goal.

Description of First Image

Illustration of how Shadowcast crafts a poison sample with visually matching image and text descriptions.


Below is another example of the poison sample where the original concept is "Junk Food" and the destination concept is "Healthy and Nutritious Food". The poison image looks like the clean destination concept image, and the text visually matches the image.

Description of First Image

Experiment

We consider the following four tasks for poisoning attacks exemplifying the practical risks of VLMs, ranging from misidentifying political figures to disseminating healthcare misinformation.

The red ones are Label Attacks, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden. The green ones Persuasion Attacks, which leverage VLMs’ text generation capabilities to craft narratives, such as portraying junk food as health food, through persuasive and seemingly rational descriptions.

Description of First Image

Attack tasks and their associated concepts.


We study both grey-box and black-box scenarios. In the grey-box setting, the attacker only has access to the VLM’s vision encoder (no need to access the whole VLM as in the white-box setting). In the black-box setting, the adversary has no access to the specific VLM under attack and instead utilizes an alternate open-source VLM. We evaluate the attack success rates under different poison ratios.

Grey-box results

Shadowcast begins to demonstrate a significant impact (over 60% attack success rate) with a poison rate of under 1% (or 30 poison samples)!

Black-box results

Description of First Image

(Architecture transferability) Attack success rate for LLaVA-1.5 when InstructBLIP (left) and MiniGPTv2 (right) are used to craft poison images.


Shadowcast is still effective even across different VLM architectures!

Attack Robustness

What if the VLM uses image data augmentation (as a defense method) during training? Will the poisoned VLM exhibit targeted behaviour when different text prompts are used? Our evaluation shows positive results.

Ethics and Disclosure

This study uncovers a pivotal vulnerability in the visual instruction tuning of large vision language models (VLMs), demonstrating how adversaries might exploit data poisoning to disseminate misinformation undetected. While the attack methodologies and objectives detailed in this research introduce new risks to VLMs, the concept of data poisoning is not new, having been a topic of focus in the security domain for over a decade. By bringing these findings to light, our intent is not to facilitate attacks but rather to sound an alarm in the VLM community. Our disclosure aims to elevate vigilance among VLM developers and users, advocate for stringent data examination practices, and catalyze the advancement of robust data cleaning and defensive strategies. In doing so, we believe that exposing these vulnerabilities is a crucial step towards fostering comprehensive studies in defense mechanisms and ensuring the secure deployment of VLMs in various applications.


BibTeX

@article{
        xu2024shadowcast,
        title={Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models},
        author={Xu, Yuancheng and Yao, Jiarui and Shu, Manli and Sun, Yanchao and Wu, Zichu and Yu, Ning and Goldstein, Tom and Huang, Furong},
        journal={arXiv preprint arXiv:2402.06659},
        year={2024}
      }