Introduction
AdvertisementsMachine Learning has made inroads into many industries such as finance, healthcare, retail, autonomous driving, transportation and others . Machine Learning gives the computers the capability to learn without being explicitly programmed. This allows computers to accurately predict based on patterns in the data. The machine learning process involves data being fed to the model (algorithm). The model identifies the data patterns and makes predictions. Initially, the training process involves the model being fed training data on which model makes the predictions. The model is then tweaked till we get the desired accuracy. New data is then fed into the model to test for desired accuracy. The model is re-trained until the model gives the desired outcome.
Adversarial machine learning attack is a technique in which one tries to fool deep learning models with false or deceptive data with a goal to cause the model to make inaccurate predictions. The objective of the adversary is to cause the model to malfunction.
Source: YouTube
Adversarial Attacks in Machine Learning and How to Defend Against Them
The success of machine learning is attributed to big datasets being fed to classifiers (models) to make predictions. You train the classifier by minimizing a function which measures the error made on this data. You optimize this function thereby minimizing the error of the predictions you make on the training data by adjusting the parameters of the classifier.
Adversarial attacks exploit the same underlying mechanism of learning, but aim to maximize the probability of errors on the input data. It has become possible because of inaccurate or misrepresenting data used during the training or using maliciously designed data for an already trained model.
To get an idea how adversarial attacks have gained prominence; in 2014 there were no papers regarding the adversarial attacks on preprint server Arxiv.org. Today there are more than 1000 research papers on adversarial attacks and their examples. Google and NYU in 2013 published a research paper titled “Intriguing properties of neural networks,” which showcased the essence of adversarial attack on neural networks.
Adversarial attacks and defense techniques to defend them are becoming common themes in conferences including Black Hat, DEF CON, ICLR, etc.
Also Read: Top 20 Machine Learning Algorithms Explained
Types of Adversarial Attacks
Adversarial attack vectors can take several forms.
Evasion – As the name suggests, they are carried out to avoid detection and are carried out on models that are already trained. An adversary will introduce data to intentionally deceive an already trained model into making errors. This is one of the most prevalent types of attack.
Poisoning – These are carried out during the training phase. The adversary will provide contaminated (misrepresented or inaccurate) data during training forcing the model to make wrong predictions
Model extraction – Adversaries in this case interacts with a production deployed model and tries to re-construct a local copy of the model; a substitute model, which is 99.9% in agreement with the production deployed model.This means the copy of the model is basically identical for the most practical tasks. This is also called Model Stealing type of attack.
How are Adversarial Examples Generated?
Machine learning uses two types of techniques: supervised learning, which trains a model on known input and output data so that it can predict future outputs, and unsupervised learning, which finds hidden patterns or intrinsic structures in input data. At its core, the model produces a loss function. The loss function is a measure of how good your prediction model does in terms of being able to predict the expected outcome or value.
Loss is the penalty for a bad prediction. That is, loss is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average.
In adversarial attacks, the adversary can trick the system by either contaminating the input data or changing the predicted outcome from the original expected prediction. They can be classified in targeted or untargeted attacks. In a targeted attack, noise will be intentionally introduced into an input dataset to cause the model to give an incorrect prediction. In a untargeted attack, an adversary will simply try to find any inputs that tricks the model.
Here are some examples:
- Just adding a few pieces of tape can trick a self-driving car to misclassify a stop sign as a speed limit sign. The first image on the left is an original image which is converted to an adversarial sample using a month shaped tape.
- Researchers at Harvard were able to fool a medical imaging system into classifying a benign mole as malignant with 100% confidence.
- In a Speech-To-Text Transcription Neural Network, a small perturbation when added to the original waveform caused it to transcribe as any phrase the adversary chose.
- Attacks against Deep Neural Networks for face recognition with carefully fabricated eyeglass frames
The existence of these adversarial examples means that systems that incorporate deep neural network learning models actually have a very high-security risk.
Also Read: Introduction to Generative Adversarial Networks (GANs)
Adversarial Perturbation
An adversarial perturbation is any modification to a clean image that retains the semantics of the original input but fools a machine learning model to misclassify. The way this works is; the adversary will compute the derivative of the function that does the classification. A noise in then introduced to the input image and fed back to the function to trick the classifier. In the example below, an imperceptible perturbation is added to original input image to create and adversarial image.
Popular Adversarial Attack methods include the following:
Limited-memory BFGS (L-BFGS) – The L-BFGS method is a non-linear gradient-based numerical optimization algorithm to minimize the number of perturbations added to images. While it is effective at generating adversarial samples, it is computationally intensive.
FastGradient Sign method (FGSM) – It is a simple and fast gradient-based method used to generate adversarial examples to minimize the maximum amount of perturbation added to any pixel of the image to cause misclassification.
Jacobian-based Saliency Map Attack (JSMA) – Unlike FGSM, the method uses feature selection to minimize the number of features modified while causing misclassification. Flat perturbations are added to features iteratively according to saliency value by decreasing order. It is computationally more intensive than FGSM, but the advantage is; very few features are perturbed.
Deepfool Attack – This untargeted adversarial sample generation technique aims at minimizing the euclidean distance between perturbed samples and original samples. Decision boundaries between classes are estimated, and perturbations are added iteratively. It is effective at producing adversarial samples with fewer perturbations but is computationally intensive than FGSM and JSMA.
Carlini & Wagner Attack (C&W) – The technique is based on the L-BFGS attack but without box constraints and different objective functions. The adversarial examples generated by this method was able to defend state of the art defenses, such as defensive distillation and adversarial training. It is quite effective in generating adversarial examples and can defeat adversarial defenses.
Generative Adversarial Networks (GAN) – GANs have been used to generate adversarial attacks, where two neural networks compete with each other. Thereby one is acting as a generator, and the other behaves as the discriminator. The two networks play a zero-sum game, where the generator tries to produce samples that the discriminator will misclassify. Meanwhile, the discriminator tries to distinguish real samples from ones created by the generator. Training a Generate Adversarial Network is very computationally intensive and can be highly unstable.
Black Box VS White Box Attacks
An adversary may or may not have knowledge of the target model and can perform the following two types of attack:
Black box attack –The adversary has no knowledge of the model (how deep or wide the neural network is) or its parameters and also does not have access to any training dataset. The adversary can only observe the output of the model. Hence it is the hardest to exploit, but if carried out could be very effective. In this case, an adversary could create an adversarial example with a model from a clean slate or without any model.
White box attack
White box attack is one where the adversary has full knowledge of the deployed model; it’s model architecture, input and output parameters and the training dataset. The adversary can adapt and directly craft adversarial samples on the target model. Adaptive attack is also known as a gradient-based or iterative attack. The adaptive aspect refers to the adversary’s ability to modify the attack as they receive feedback from the model. The adversary can generate an initial set of inputs and observe the model’s response to them. Based on this response, the adversary can modify the inputs to make them more effective at evading the model’s defenses. This process is repeated iteratively until the adversary is able to find inputs that can reliably fool the model. They are particularly challenging to defend against because the adversary can modify their attack strategy in real-time to overcome any defenses that the model may have in place. Additionally, because the adversary has access to the model’s architecture and parameters, they can use this information to generate attacks that are specifically tailored to the model.
Currently, defense approach that is effective against a black-box attack is vulnerable to an adaptive white-box attack. It is challenging to develop defenses that can completely protect a model from an adaptive attack.
Description |
Black box |
White box |
Adversary Knowledge |
Restricted knowledge from being able to only observe the network output on some probed inputs. |
Detailed knowledge of the network architecture and the parameters resulting from training. |
Strategy |
Based on a greedy local search generating an implicit approximation to the actual gradient w.r.t the current output by observing changes in input. |
Based on the gradient of the network loss function w.r.t to the output. |
http://leonardtang.me/posts/AA-Survey |
Examples of Black Box Attacks
AdvertisementsThe various ways in which practical black box attacks can be manifested is described below:
Physical Attacks
It involves adding something ‘physically’ to the input data to trick the model. It’s usually easier to realize. For example, a CMU research showed that the adversary could just added a colorful eyeglass to facial recognition models and trick the model. The image below illustrates this – The first image is the original image and the second image is an adversarial sample image.
Out of Distribution (OOD) Attack
Black box attacks can also be carried out via out-of-distribution (OOD) attacks. An out-of-distribution attack is a type of adversarial attack where the attacker uses data points that are outside of the distribution of the training data used to build the model.
Machine learning models are trained on a specific dataset that represents the distribution of the problem space that they are intended to solve. An adversary can attempt to trick a model by providing input data that falls outside of this distribution. This can cause the model to produce incorrect outputs, leading to serious consequences in real-world applications such as self-driving cars, medical diagnosis systems, and fraud detection systems.
How Can We Trust Machine Learning?
As machine learning is making more decisions for us and are becoming complex, how can we trust machine learning.
The core principles of trust revolve around the following questions:
- Can I trust what I am building?
- Can you trust what I built?
- Can we trust what we are all building?
To answer the above questions, the three important qualities we need to consider are
- Clarity
- Competency
- Alignment
Clarity is the quality of communicating well and being easily understood. It about understanding why are we making a particular decision and whether we are doing it for the right reasons. Clarity can help humans make more informed decisions. We need to be clear about what is the right metric we will consider.
AdvertisementsCompetency is the quality of having sufficient knowledge, judgement, skill or strength for a particular skill. In machine learning, competency is all about evaluation. In the machine learning world, this means we need to test training models more systematically. We have little insight on how the system might behave in the real world based on the training we do offline. So the benchmark dataset and test dataset are at best a weak proxy to what can happen in the real world.
Alignment is the most complex one. It is a state of agreement or co-operation among persons, groups, nations, etc. with a common cause or viewpoint. It’s agreeing on the balance between concerns and trying to answer the question – Does my system have the same cause or viewpoint that I hope to have? – as every choice that you make when you create systems impact people and they have to be aligned. The choice of data is one of the important decision that defines the behavior of the machine learning model. The diversity and coverage of data is important to avoid any biases and perpetuating any stereotype.