The attacker’s goal is to manipulate the model into responding to original concept images with texts consistent with a destination concept, using
stealthy poison samples that can evade human visual inspection.
A poison sample consists of a poison image that looks like a clean image from the destination concept, and a congruent text description. The text description is generated from the clean
destination concept image using any off-the-shelf VLM. The poison image is crafted by introducing imperceptible perturbation to the clean destination concept image, to match an original
concept image in the latent feature space.
When training on these poison samples, the VLM learns to associate the the original concept feature (in the poison image) with the destination concept texts, achieving the attacker's goal.