Assessing Neural Network Robustness via Adversarial Pivotal Tuning

Visualizations

Overview of the Generated Manipulations

Row 1 shows the input images. Row 2 shows the images resulting from our manipulations. Row 3 (and 4) shows the result of Dual manifold adversarial robustness, using pixel-space adversarial manipulations applied to StyleGAN-XL’s reconstructions. Row 4 shows the result using latent space manipulates applied using StyleGAN-XL. Our method manipulates images in a non-trivial but class-preserving manner, using the full capacity of a pretrained StyleGAN generator. For example, it removes the eye of the mantis (second column), changes the type of race car (third column), changes the color of the crab tail (fifth column), removes the text in a spaceship (seventh column) and removes some of the ropes (eighth column). All of these are class-preserving examples that fool a pretrained PRIME-ResNet50 classifier. In contrast, Dual manifold adversarial robustness either generates noisy and less realistic images (row 3) or images which differ significantly semantically and which do not preserve the input class (row 4).

TLDR

We propose a framework for generation of photorealistic images that fool a classifier using automatic semantic manipulations

The Adversarial Pivotal Tuning (APT) framework

In the first step, we optimize a style code wp using standard latent optimization Lo, while keeping the generator G frozen. The loss is computed between the ground-truth image xgtr and the generated image xgen. In the second step, we freeze wp and finetune G (shown in red) using the three objectives; a reconstruction objective Lrec, the projected GAN objective using the discriminator D, LP G, and our fooling objective LCE using the classifier C. A ∗ is used to indicate a frozen component

Manipulations using different classifiers

Top row shows input images. The middle row shows APT manipulations for a ResNet-50 classifier, and the bottom row shows APT manipulations from a FAN-VIT classifier. The leftmost image of a dog and the subsequent images including the image of a butterfly and column 7 (Fluffy dog) show similar manipulations for both classifiers, column 5-6 shows texture and spatial manipulations, the last column showcase a fooling image without a clear APT manipulati

Transferability of APT generated samples

For the ImageNet-1k validation set, we consider samples generated to fool a PRIME-Resnet50 (PRIME) and a FAN-VIT (FAN) pretrained classifier. We then test the accuracy (Acc) and mean softmax probability of the labelled class (Conf) on those samples. The left column indicates the classifier on which we tested the accuracy of real or generatedsamples. ∗ indicates the accuracy and confidence of samples generated and tested using the same classifier.

Attack Success Rate (ASR) for APT, SSAH and PGD

We investigate the attack success rate for our method APT against more traditional but imperceptible attacks such as PGD and SSAH. Additionally we also test how well these generated images can fool other classifiers that was not subject to an attack.

An example of these attacks are illustrated here

Average accuracy and confidence on APT samples using PRIME-ResNet50 before and after fine-tuning.

We investigate the effect of finetuning a PRIME-ResNet50 model on our generated fooling images usng APT, PGD and SSAG. We find that the accuracy to correcly predict the original class increases after finetuning for each indidual attack but also when combining all attacks.

Acknowledgement

This research was supported by the Pioneer Centre for AI, DNRF grant number P1.

BibTeX

@article{christensen2022apt
    author    = {Christensen, Peter Ebert and Snæbjarnarson, Vésteinn and Dittadi, Andrea and Belongie, Serge and Benaim, Sagie},
    title     = {Assessing Neural Network Robustness via Adversarial Pivotal Tuning},
    journal   = {arxiv},
    year      = {2022},
  }