Adversarial Attacks

Jonathan Hui
20 min readJul 20, 2020
“My Wife and My Mother-In-Law”

Is the drawing above a portrait of a young lady or an old lady? Such optical illusion needs special skills to create in order to fool humans. Unfortunately, in deep learning (DL), DL models are much easier to be fooled. In the figure below, the left column contains the original images, and the right column contains the modified images with carefully calculated noise added — right image = left image + middle image (the noisy image is exaggerated 10x and value shifted for the visual demonstration). Even though they are visually the same to humans, the left images are correctly classified by a DNN classifier while the right images are misclassified as “ostrich, Struthio camelus”. The deep network is fooled.


Worse, these seem harmless permutations often trick the model to give highly-confidence predictions — 99.3% confidence that the panda picture on the right is a gibbon.


This kind of attack can be easily carried out in real situations. For example, with some carefully designed eyeglass frames, the top row below fools the VGG-Face CNN to re-classify the faces as well-known persons in the second row.


The figure below is another example where we put a sticker close to a banana and cause the classifier to change the prediction from a banana to a toaster.


However, successful attacks should take into consideration that the subjects may be viewed from different viewpoints and lighting. Nevertheless, experiments show many attacks are pretty robust in real situations.

And this kind of attack can be dangerous. For example, researchers can make adversarial attacks to look like harmless graffiti. But these patterns can fool a classifier in misclassified traffic signs. In the road sign below, the adversarial identifies the points to carry the attack for maximum changes. Then it calculates the color needed and prints it as patches like the one below. This poses major issues for autonomous driving. In particular, the attack is successful from different distances and viewpoints.


Creating adversarial attacks

In the figure below, the artist carefully adds features to make it look like an old lady while the new additions will not negatively impact the look of the young lady too much. For example, the right eyebrow of the old lady (marked in red below) does not distort the ear of the young lady too much.

In DL’s adversarial attacks, changes are often made to individual pixels independently. The objective is, with the minimum pixel changes to the image x, it

  • drops the probability pᵢ(x) the most for the ground truth label i, while
  • increase pⱼ(x) the most for a wrong label j (or a specific target label j).

In DL training, we adjust weights to maximize the probability pᵢ(x) using backpropagation. Since x is part of the parameters in DNN, we can use backpropagation to change the input x pixel-wise also. But this time, we change x in the direction of the steepest gradient to lower pᵢ(x). This will give us the biggest bang of the buck in misclassifying x with negligible visual changes.

Why the adversarial attack can be done so successful

Experiments show the attack is relatively easy to create with a pretty high success rate. In fact, if it is done properly, a majority of these adversarial samples will be misclassified. In addition, it can be done in a wide spectrum of DL architecture designs. Therefore, such weakness is likely coming from issues inherited by a wide spectrum of DL architectures and training.

One speculation is that these models are complex with non-linear and complex decision boundaries. The overcapacity of these models makes room for overfitting and therefore easily exploited. Small changes, if you know how to do, can make dramatic changes in predictions. However, Ian Goodfellow argued that the primary cause should be the linear nature of those classifiers instead. This conflicting view changes the landscape of the countermeasure significantly. For example, if overfitting is the primary cause, we can apply regularization. But the traditional regularization methods, like dropout, are ineffective to fend off adversarial attacks.

First, it is observed that many adversarial samples created with a particular trained model still work in models with different designs and/or trained with different datasets. In fact, these models often assign the same mislabels to adversarial samples. If overfit is the primary cause, the adversarial samples should be dataset and model-specific. Such transferability discredit the overfitting explanation.

Second, the paper claims that the relationship between the input x and the logit output (the score value before the softmax layer) is quite piecewise-linear for the trained model among different architects and training datasets.

Let’s create a mathematical model for the linear behavior before we revisit it again later. To create permutations, we add η to the image x. Since x maintains a linear relationship with the logit output, the permutation will simply add a second term wᵀη below to the output.

Modified from source

So to increase or decrease the score effectively with the smallest change η, the permutation should be pointed (align) in a similar direction as w. For example, to increase the score the most, η will be in the same direction as w such that the dot product wᵀη will be the largest. Intuitively, w acts as an amplifier for η if they align well. If η is in high-dimension, these amplified changes (Σwᵢηᵢ) add up. In adversarial examples, the total amplified effect completely changes the predictions, even with high confidence. But the pixel-wise change ηᵢ is too small to notice visually. Small things add up quickly!

If the models are non-linear and overfitted, we should expect pockets of mislabeling throughout the input space of x. Instead, in the MNIST experiments, the mislabeling is close to the direction of w instead. This is an early sign that overfitting may not be the primary cause.

Source (Pockets of adversarial samples v.s. along a plane)

Fast gradient sign method (FGSM)

But to constraint the magnitude of the changes, we restraint the max norm, the maximum absolute value of ηᵢ, to be smaller than a threshold ε (the magnitude of perturbation allowed).

In Fast gradient sign method (FGSM), η is proposed to be:

where y is the ground truth label. This new formularization ensures the max norm of the permutation is within ε and “somehow” points to the direction of the steepest ascent in cost. Even it is not the optimal direction, its popularity arises from easy computation and its effectiveness.

Targeted fast gradient sign method (T-FGSM)

For the sample to be misclassified to a specific label, rather than just mislabel it, we can calculate J above w.r.t. the target label instead. Then we add the negative of the result to x. In short, we change x in the direction of steep descent in decreasing the cost for the target label. This method for the targeted label is called Targeted fast gradient sign method (T-FGSM).

Basic iterative method (BIM)/Projected Gradient Descent (PGD)

An obvious improvement to FGSM is to apply it in multiple iterations instead of only one in FGSM. To reduce change, for each iteration in the Basic iterative method/Project Gradient Descent(BIM/PGD), a smaller step (αT = ε) is used. Clipping is applied to constraint the new image to be within the ε max norm of the original image.


Conceptually, for each iteration, BIM finds the “most-adversarial” examples, one with the largest change in cost, within a fixed distance from the Xn.

This kind of attack is called the white box attacks as we have the details knowledge of the model design and parameters to compute the gradient for the attack. (Later, we will discuss attacks which we do not know the inner workings.) As shown in those equations, the calculation is simple and empirical results show these adversarial samples are extremely effective in fooling DNNs. For example, with ε = 0.1, it obtains an average probability of 96.6% assigned to the incorrect labels for the adversarial samples when using a convolutional maxout network on a preprocessed CIFAR-10 dataset.

Model Linearity

The math based on an assumption that x is linear related to the logit output, i.e. when we start with a specific example (say the number 4 in MNIST), we should see the logit output value to follow a piecewise linear pattern when we increase/decrease the value of ε. The left diagram below plots the logit values for all different classes as ε is changed. When ε=0 (or close to 0), the red line (class 4) has the highest value and the model will predict the input as “4”.


But as the change increases, different classes will be predicted. The yellow boxes on the right figure above indicate ε=0, or close to 0. ε remains small and the predictions are still correct for those yellow boxes. But as ε move further away, the predictions will be incorrect even it still resembles “4” visually. Nevertheless, when ε gets further bigger, the visual input turns into garbage. But the most important finding here is that the model (at least for MNIST) behaves quite piece-wise linear.

The paper carries another MNIST experiment comparing the predictions of a maxout model with a shallow softmax network and a shallow RBF network. The objective of the experiment is to show whether the more complex MNIST model’s logits output behaves more linear or non-linear. The results show the maxout model’s predictions are closer to the softmax network (a linear model) than the RBF model (a non-linear model).

The intuition is that the activation function and the weight initialization are purposely designed to encourage piecewise linearity. At least in the early training, these initializations want models to train and to learn in the non-saturated linear region. The catch is linear-like models are easier to train but unfortunately more susceptible to adversarial attacks.

Note: This section presents the linearity claim as-it. Please note that this is not a closed research topic. Some researchers, like Carlini, argue that the MNIST dataset is too simple to make such a generic conclusion. More complex datasets like ImageNet may show different experimental results.

The momentum iterative fast gradient sign method (MI-FGSM)

In many optimization methods in DL, momentum is applied for better stability and model convergence in training. In MI-FGSM, a very similar concept is applied to compute the gradient used for the perturbation. Instead of a vanilla gradient calculation in FGSM, equation (6) below calculates an adjusted gradient with a similar momentum concept.


Jacobian-based Saliency Map Attack (JSMA)

This paper introduces another attack called JSMA. This is a greedy algorithm that undergoes many iterations which each iteration changes one pixel at a time to increase the targeted misclassification. It computes ∇Z(x) (which Z is the logit score for the target label) for a saliency map. Next, it picks and changes the most likely pixel that makes that largest increase (largest gradient). The iterations continue until either a set threshold of pixels is modified or it succeeds in misclassify the data.


DeepFool adapts the same iteration approach to gradually move the image across the decision boundary with minimum change.


In each iteration, it adapts a linear model and uses the first-order approximation of Taylor’s expansion to find the next image x that gets closer to the decision boundary with the smallest change.


Carlini and Wagner Attacks

The adversarial attack problem is often defined as minimizing the pixel changes on top of a loss function. This loss function measures how well the adversarial sample x’ will be misclassified to the incorrect target label l


where c is a searchable parameter to find the minimum difference to create the adversarial samples (using binary search). In C&W attacks, the objective is defined as:


The loss function here is refined as a margin loss, i.e. it will not decrease the cost function further if the misclassification is doing very well. When the target class is getting more likely than the second most likely class, the cost function decreases. But there is a threshold at −𝜅 which the cost will not decrease further to acknowledge that any further improvement should be neglected. When 𝜅 = 0, the adversarial examples fall into the low-confidence adversarial examples and are just classified as the target class. As 𝜅 increases, the model classifies the adversarial example with better confidence. They are called high-confidence adversarial examples. This paper claims that C&W attacks are stronger than the attacks like FGSM and JSMA.

Universal adversarial attack

The universal adversarial attack finds a single perturbation vector δ that can add to each sample and cause most of them to be misclassified. (This has been demonstrated by experiments with a high success rate.)

In each iteration, for the real sample that yet to fool the model, it formulates an optimization problem that targets at finding the minimum additional perturbation to δ to fool the model. Then, the optimization will be solved using methods like L-BFGS — one of the attacking method. Then δ is updated with such perturbations until all samples are processed. Eventually, the final perturbation will enable most samples to fool the network.

So how can we defend against these attacks?

Adversarial training

Data augmentation is a common technique in improving model accuracy. The input image space is often large but the real data is sparse. We can imagine far more images than what real life has. The blue “+” and “-” below are the data available for the training. As shown in Figure C below, there are pockets of data points (artificially created data points) that can be misclassified if the model’s decision boundary is linear.


Augmenting the training dataset with real images will not be dense enough to defend the adversarial attack. Indeed, most artificially created data (like the random noise) will likely be mislabeled by DL models. Many researchers believe that the sparsity of real data is a major obstacle for any adversarial defense to be effective.

Since the adversarial samples are simple and can be easily calculated, we can include them in the training to counter the attack. But instead of training them separately, we can integrate the original and the adversarial sample’s cost functions and train them together.

The second term above penalize the prediction if the adversarial sample makes a different prediction as to the original sample.

Without going into detail, the Goodfellow paper demonstrates that this adversarial training provides a form of regularization similar to techniques like L1-regularization and it performs better the dropout. This strategy also modifies the weight significantly. When the weights are visualized, weights are more clustered with similar values and easier to interpret for its purpose.

However, the trained model is still vulnerable to iterative adversarial attacks like PGD. To solve that, we can switch from the FGSM approach to PGD for the adversarial samples. Experiments show that this type of adversarial training produces a more universal robust model.

Defensive distillation

Nothing is 100% correct. Should the number in the red box below be 6 or 8?

However, when we train a DNN, we use hard labels, i.e. we assign 100% probability for the “ground truth” and 0% for others. In reality, information has its uncertainty. After the training, the output of the DNN model is a probability distribution (say, 0.1, 0.02, …, 0.05). This distribution actually captures better information on uncertainty and model it better in many real problems.

In Defensive distillation, it uses the output of this network, soft labels, to train a second but smaller network. If it is trained probably, the second network can achieve the same accuracy even it has a smaller capacity (supported by empirical result) and improves generalization. The second training creates a smoother surface in the directions of what the adversarial attacks exploit. Therefore, it enhances resilience to perturbations.


Gradient masking

Some adversarial defense belongs to a category called gradient masking. It constructs a model (like nearest neighbor classifier) which do not have a useful gradient. Or at least around the real data, the gradient information is low in quality to guide which direction to go for the attacks. In some defense, a non-smooth or non-differentiable preprocessor is performed on the input first to make the gradient information not helpful for the attack.


Some researchers believe that the adversarial training and defensive distillation actually perform some form of gradient masking that makes it more difficult to exploit the gradient. Above is a more extreme model, the gradient is zero almost everywhere and therefore provides no information on where to search to increase or decrease cost.

Black box attacks

In many situations, the design, or the parameters of the models are not known. The type of attack when the inner working is not available is called black box attacks. We can query the label of the input from the model but nothing more. So how can these attacks be done with the absence of helpful gradients?

Let’s recall that adversarial samples have a nice generalization and transferability. This observation has a profound implication. Many existing trained models have a decent generalization across architectures and training sets. They behave similarly. They often make the same predictions and more importantly the exact mistakes. It implies those models are exploiting similar data patterns or features in making their predictions. This commonality makes it vulnerable to a common attack. We can attack similar but easier models, instead of the original one. So even the model is unknown or one that has useless gradient information, we can use adversarial samples from other models to attack the same problem domain.

Alternatively, we can build a new surrogate model using queried or known labels. In a nutshell, because of the observed generalization, we can train a DNN model that resembles the behavior of others. Then we will use this as a substituted model for attacks. So even the original model has a strong gradient masking behavior, we can still train a DNN to approximate its behavior. Similar to defensive distillation, the trained model can have a smoother surface in the directions of what the adversarial attacks exploit. Therefore, the simulated model can be used to compute and counteract the adversarial attacks.


On the defense side, because of the transferability, we can add adversarial samples from other models into its own training dataset. It can fend off black-box attacks better.

Adversarial Example Detector

While we want to make correct label predictions as accurate as possible, an alternative approach for the problem is to detect whether an image has been modified/created by an attacker. We can treat adversarial samples as one additional class and combined these data samples to the training dataset. So instead of possible N classes, the classifier predicts N+1 classes which the last one is for the adversarial samples. Another approach is to build a binary classifier to mark whether it is an adversarial sample or not.


The adversarial methods add noise-like data to real images to fool the classifier. One defense strategy is to restore the adversaries closer to the originals and remove the added manipulation — a.k.a. denoising. This method adapts the autoencoder architecture which first encodes the image’s features and later decodes it back. With the adversarial samples as input, the autoencoder learns a model in minimizing the reconstruction loss between the model output and the original non-corrupted data. Therefore, the reconstructed image will have manipulated data removed (in theory). Once this autoencoder is trained, we use it to denoise images before feeding them to classifiers.


Ensemble methods

Just like DL, ensemble methods can be applied to adversarial defenses. For denoising, it can be applied at two levels. First, it builds multiple denoisers in which each one is built differently — with different techniques and models in creating adversarial samples, different autoencoder structure, and/or hyperparameters, etc … The key point is each denoiser may be better in cleaning up different aspects and types of adversarial attacks. Second, it builds multiple classifiers/verifiers — again with different training conditions and designs. Denoised images are then fed into each classifier/verifier with the final predictions made by some voting method (like majority voting or soft voting). The intuition is similar to other ensemble methods. Simpler models may not be perfect but they are easier to train and don’t make similar mistakes. So the collective judgment improves accuracy. However, this strategy based on an assumption that we can train autoencoders and classifiers that are slightly different from others. But it is still disputable since the generalization of many trained models makes this approach less promising than we may expect.

High-level representation guided denoiser (HGD)

Other denoisers do not use the reconstruction cost at the raw pixel level in calculating the cost. In fact, it minimizes the features difference one (f₋₁) or two-level (f₋₂) before the softmax layer of the classifier between the denoised image (x hat) and the original image x. The insight is we want the extracted features for both images to be similar rather than at the low-level pixel values. We want the images to be perceived to be similar.

Source (The denoiser is applied to convert x* to generate the denoised image x hat.)


Many attacks exploit the knowledge of the model to its full extent such that minimum pixel changes are needed. If heuristics are introduced, those conditions may be changed and hurt those “finely-tuned” attacks. The changes may have a lesser extent to the real data as the model is trained to be more generalized for real data (hopefully). In one approach, during inference, random nodes in the DNN models are dropped.

In another defense, images are resized to a random size and pads zeros around the input.


Another approach adds noises before the convolution layer to perturb the input of the CNN layer.


So we make many inferences under that randomness to see if they still give consistent predictions. Mathematically, for real data, we should expect real images should have the uncertainty below to be as close as zero after making L inferences. So checking the prediction consistency can be one possible candidate for adversarial detection.


Penalize Layers’ Lipschitz Constant

Other defenses recommend the training of a more robust model. One possibility is to penalize the model for each hidden layer when the Lipschitz constraint is violated. The high-level idea is to constraint the change of the output for each layer. This makes the gradient less steep for each layer and hopefully, forces a larger change to the input for the adversarial effects.




We can use the GAN concept to generate a better classifier (discriminator) in detecting adversarial samples. The model computes the gradient of the classifier output F w.r.t. the input x and uses this as an input to the generator to create adversarial perturbation. The classifier is trained alternatively to discriminate against the real and the perturbated input. In addition, the lost function is backpropagated to the generator to learn how to create better attacks. Both the classifier and generator will be trained in step to improve each other and eventually the classifier will get better and better in discriminator adversarial samples.

Effectiveness of the Defenses

Security problems are always tug-of-war. Many defenses, including those discussed, can be evaded by adversarial attacks that target the specific defense. However, there are many defenses, that were once promising with empirical results, are now shown to be vulnerable by countermeasures. Here is a slide from Ian Goodfellow in 2017 on what defenses have failed.


But technologies keep changing and improving. Once works may not and once not working can be improved. So you are warned.

The fundamental question is do we discover the intrinsic properties that the adversarial samples must possess. Without them, the defense can fail. Many detectors explore traces that the attack algorithms left behind. However, many traces are now known to be unnecessary byproducts. For example, the digits in MNIST are centered with the majority of pixels in the boundary to be zero for the natural images. These original images have small values of the last PCA components because of these border pixels. Many adversarial attacks change image pixels based on gradients but not positions (The attacks will change the border pixels to be non-zero). This contributes to the success of PCA-based detectors in detecting MNIST adversarial samples. To countermeasure, we add constraints to the attackers in making pixels changes that put preserving PCA-component values as a priority also.

Another key observation is that defenses that are effective for MNIST may not be useful for complex datasets, like ImageNet. To evaluate a defense, we cannot use MNIST alone.

As a direct quote from a paper on reviewing 10 defense methods:

Our attacks work by defining a special attacker-loss function that captures the requirement that the adversarial examples must fool the defense, and optimizing for this loss function. We discover that the specific loss function chosen is critical to effectively defeating the defense.

For example, the term l₂ below is an additional constraint added so it penalizes traces that a known detector wants to catch.


But this is a double-edged sword, the defend can also modify its cost function to fend off an attack.

Other applications

Reinforcement learning

While this article focuses on adversarial attacks on deep learning, reinforcement learning is shown to be vulnerable to such attacks also.


Researchers also start to apply these attacks on NLP. For example, by swapping a couple of characters, a sentimental analyzer may make completely different predictions.


Credits & references

Explaining and Harnessing Adversarial Examples

Intriguing properties of neural networks

Attacking Machine Learning with Adversarial Examples

Robust Physical-World Attacks on Deep Learning Visual Classification

Ian Goodfellow video: Adversarial Examples and Adversarial Training

Adversarial Machine Learning at Scale

Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods

Adversarial examples in the physical world

Practical Black-Box Attacks against Machine Learning


Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks

Towards the Science of Security and Privacy in Machine Learning

Denoising and Verification Cross-Layer Ensemble Against Black-box Adversarial Attacks

Adversarial Attacks and Defenses in Deep Learning

Boosting Adversarial Attacks with Momentum

Ensemble Adversarial Training: Attacks and Defenses

Defense against Adversarial Attacks Using High-Level Representation Guided Denoiser

Adversarial Attacks on Neural Network Policies

Adversarial Attacks and Defenses in Images, Graphs and Text: A Review

Towards Evaluating the Robustness of Neural Networks

DeepFool: a simple and accurate method to fool deep neural networks