Finally, our last part of the series looks at detecting Deepfakes videos with machine learning (ML) and/or deep learning (DL). This is an active research area. Our goal is to explore possibilities and spotting trends. Because we are still in the early development phases, we will not detail some implementation aspects that likely to be changed. We hope that it develops some insights where we may go next.
One of the major shortcomings in the early Deepfakes video is the incomplete of the source dataset in covering the conditions of the target (illumination, pose, expression, etc …). From this perspective, researchers focus on exploring anomalies in what may be missing that trigger unusual behaviors and artifacts.
For example, many early Deepfakes videos have faces collected from still images. This creates a gross imbalance of training classes. Typically, there will be a lack of side images and images that are blinking. In the latter case, the trained decoder will have a hard time decoding images that should be blinking. For example, Nicolas Cage in this Deepfakes video rarely blinks.
In this paper, it uses object detection to extract the eye area and uses the combination of CNN, LSTM, and fully-connected networks to determine whether the eye in a particular frame is blinking (1 if it blinks). From there, we can determine whether the video has a normal blinking pattern for a normal person.
Unfortunately, this type of detection approach can be defeated easily by identifying what is missing and augmenting the data manually or automatically.
Deepfakes’ Algorithm Issue
The breakthrough made in Deepfakes is to apply the Deep Learning’s (DL) deep network to reenact the source actor on the target actor. This deep network allows us to learn the rules that are otherwise too complicated to program. These DL models are relatively simple but not necessary state-of-the-art DL technology. Being said, the trained Deepfakes models can produce good quality videos but not without flaws. To fix these flaws, we can introduce advanced DL technologies but it requires AI expertise and far more expensive GPU resources. Alternatively, many Deepfakes packages offer one-off solutions from computer graphics or VR to fix some of the potential flaws. However, these approaches are less complex compared to DL and leave some traces or inconsistencies to be detected.
Inconsistent head pose
One of the common options in Deepfakes is to apply warping using face landscapes. Visually, this blends the swapped face with the original face better.
But that may introduce inconsistent in the head pose. Image (j) and (m) compute the head pose (the blue arrow) using landmarks from the whole face. Image (i) and (l) compute the head pose (the red arrow) using landmarks from the center of the face only. In the second row where the center of the face is swapped and warped, the computed poses from (l) and (m) will not match.
Warping in 2-D fails to recreate the 3-D pose correctly.
The issue is mainly caused by the warping algorithm that uses simple affine operations. However, this problem is fixable. For example, new constraints can be added to impose the consistency of the pose. In some generative models, a 3-D face model is also built such that a new face can be rendered according to this 3-D model.
Warping and Blending Artifacts
As discussed in the previous article, the warping and the blending of the swapped face with the background creates artifacts that maybe detect visually. So it is natural that we can train a CNN classifier in determining whether a face is real or fabricated also. In this paper, VGG and ResNet based classifiers are used to extract features and then feed into a binary classifier in judging whether the image is manipulated.
The research paper believes that the warping and blending algorithms lead to artifacts that can be detected by a classifier. Nevertheless, that brings up an important question. If the Deepfakes’ algorithm is improved or changed, will this classifier still work? For example, in some Deepfakes packages, we can use GAN to generate the whole face instead of the inner face. Will this avoid being detected?
Therefore, to measure the success of a detector, we need to know the generalization of this detector also. In particular, how can we detect fabricated videos from new technologies and implementations? This sounds mission impossible. But this is an issue we need to address in real life and we will discuss that later.
Biometric authentication like FaceID has gain wide adoption in the mobile world. But face recognition is not the only way to authenticate a person. So how can we measure and quantify this information?
How you move, how you act and how you speak define who you are.
OpenFace is a popular toolkit for facial landmark detection, head pose estimation, facial expression, and eye-gaze estimation.
Here are some examples of detected facial landmarks.
As shown below, the gaze will be represented by two vectors (in green) and the pose by a 3-D bounding box (in blue).
For the expression, it is represented by the presence and the intensity of the properties below that called action units (AU).
Here is an example of the detected facial expression using AU.
As shown below, the head pose and the eye gaze of the fabricated video will not match the pattern of the original person.
So far, we discuss one-off approaches in detecting fabricated videos. This one-trick pony can be counter measured relatively easily. Instead, we should open multiple frontlines for the battle. Rather than matching the eye glare with the head pose only, we make sure all the extracted properties are coherent with others.
Facial expressions and movements
In “Protecting World Leaders Against Deep Fakes” paper, it extracts 16 AUs with four more properties measuring the head rotation in x and z-direction, the 3-D height between the upper and lower lips and 3-D width between the corners of the mouths. Here is an example of the intensity of the “eye blow lift” AU for an Obama’s video clip.
With these 20 parameters, it computes the corresponding Pearson coefficients for these parameters. Picking any two combinations out of 20, there are ₂₀C₂ combinations and therefore 190 coefficients. These coefficients quantify the correlations between features.
With a moving window, a video will produce segments of overlapping 10-second clips. Then, for each clip, it computes these 190 coefficients for features collected in each frame. The key idea is to spot trends that are correlated in a person’s expression, head movement, etc… For example, how likely this person will raise the eyebrow when the jaw drops. With these extracted features, it builds a one-class support vector machine (SVM) to classify whether the video is real or fake.
The research paper also uses t-SNE to plot these features in 2-D space and uses different colors for different persons. As shown, the samples from the same person are clustered together. This demonstrates the use of one-class SVM is feasible in classifying a person's behavior.
Detecting manipulated photos has been heavily studied before Deepfakes and many techniques remain relevant. But sometimes, we do overlook it.
Operations on raw image leave traces to be detected.
For example, in JPEG compression, when converting DCT coefficients from floating-point to integer values, we can use floor, ceiling or rounding. But if the former two are used, it introduces a periodic artifact in the form of a single darker or brighter pixel (but not for rounding). The second row in the diagram below zooms up a part of the original image in the first row. As shown, when flooring or ceiling is used, JPEG dimples (white or black dots) appear.
And below is the number of camera models that carry these JPEG dimples.
When part of the image is manipulated, the intensity of the JPEF dimples will change. Column (a) below are the original images and (c) are the manipulated images. Column (b) and (d) is the corresponding strength of JPEG dimples with black being the highest. As shown, the strength of the JPEG dimples will diminish, the white area, where the areas are manipulated.
Nevertheless, the effectiveness of this detector decreases if high-level compression is applied in post-production. In addition, if the image is edited and saved with Photoshop, the dimples will disappear.
Even though this method may not be robust. This leads to one important observation.
If part of an image is originated from a different source (different cameras) or manipulated, there should be traces to be detected.
Previously, many detection methods locate fabricated videos by identifying shortfalls, like warping. But this one-off approach is hard to generalize, in particular, algorithms can be changed or improved. Fortunately, we may ask whether these generators contain some common designs and they are so fundamental that it is hard to replace and to change.
In theory, if such a design does exist, we can use face tracking to extract the face area and use a classifier to identify any trace of it.
For example, images generated by early GAN models have noticeable checkerboard patterns.
As shown below, the deconvolution (transposed convolution) in the upsampling may trigger uneven overlaps that explain the pattern above. Some spatial locations separated by a periodic distance will receive a stronger signal from the previous layer. This is one of the traces that deconvolution may have that can be detected.
Since resampling is one of the most common technologies in generative models, let’s explore whether we can detect fabricated videos through resampling detection.
Deepfakes’ source images are downsampled to extract the latent factors and then upsampled to generate the image. In this paper, it detects such resampling that presents in a fabricated image.
First, at once every 8 pixels, a patch of pixel size 64 × 64 is extracted. This divides an image into overlapping patches (the second diagram from the left above). To spot the presence of resampling, it first estimates the periodic correlations among the interpolated pixels. In specific, for every pixel in each patch, it applies a 3x3 Laplacian filter. Then, it is followed by the Radon transform and FFT to find the periodicity. (We will not explain the technical details here. Please refer to the original paper.) Then, these computed values will act as hand-crafted features into six classifiers composed of 2 fully-connected layers. And each classifier is trained to learn one aspect of resampling characteristics (namely, JPEG quality thresholded above or below 85, upsampling, downsampling, rotation clockwise, rotation counterclockwise, and shearing).
Alternatively (the second diagram above), it uses a deep network to learn the features instead of hand-crafted it. (The paper reports different pros and cons for these two approaches.)
These classifiers produce a multi-channel heatmap indicating whether resampling may take place (one channel for each resampling characteristic). Then edge-preserving bilateral filtering is applied to improve region detection accuracy. Otsu thresholds are applied to determine which channel(s) will be used for the final segmentation. Then, a random-walker segmentation is applied to push these channels into binary images on determining whether the pixel is resampled. Finally, a bitwise-OR operation adds all these binary images to obtain a final mask image.
While using fully-connected layers above in classifying the pixels, the paper also proposes an alternative using an LSTM model as shown below.
CNN based generator
This behavior is further generalized by another research paper that focuses on the upsampling CNN layers where many generative models including GAN depend on. With the support of empirical results, it demonstrates that it is possible to detect this general type of generated images relatively easily.
The paper starts with a classifier trained with ProGAN generated images and real images. ProGAN is chosen for the training dataset because it produces high-quality images and has a relatively simple CNN design. The classifier is based on ResNet-50 pre-trained with ImageNet. Then it is further trained as a binary classifier. Later, the detector is tested with images generated by other technologies. Surprisingly, the detector works pretty well including Deepfakes even without further training with the added data. This classifier is generalized to CNN based generators pretty well without further training.
This detector reuses models developed for general computer vision and the training follows typical deep learning classifiers. But to have the classifier generalized well, careful pre-/post-processing and data augmentation for the training dataset are important.
For pre-processing, images are resized to 256 for the shorter size if it is smaller than 256. Then it is randomly cropped to 224 × 224 and randomly flipped horizontally.
In training, additional data augmentation can be applied to the training dataset (ProGAN images). A few scenarios are tested in the paper from no augmentation to applying a fixed percentage (50% or 10%) of images with Gaussian blur and/or JPEG compression. Here is the AP score when the ProGAN trained model is tested on images generated by other technologies. As shown, data augmentation helps most of the cases. However, the augmentation does hurt Deepfakes detection in some scenarios.
In the real world, post-operations, like blurring and JPEG compression, are frequently applied to real or fabricated images. These operations usually impact the effectiveness of the detector. Some of the characters in the fabricated video will be masked out. Therefore, the AP score for the detector decreases as we apply blurring and compression to both real and fabricated images. To improve accuracy, the paper applies data augmentation for the training dataset with blurring and compression. The diagrams below demonstrate the results for different augmentation scenarios. In general, it improves the AP score for the detector comparing with no data augmentation.
Data diversity helps also. If the training dataset contains only one type of image, the detector performance will suffer. However, when the number of classes of images used for training (different object categories) increases from 2 to 16, AP improves. Nevertheless, such improvement return diminishes afterward. Likely, these increased numbers of classes produce little extra information only.
The paper also performs a frequency analysis after applying a high-pass filter. As shown, in the majority of cases, there is a distinct visual difference between the real and generated images. This indicates that CNN generated images do leave low-level CNN traces that a classifier can look for.
In this paper, it builds an autoencoder to extract and to reconstruct an image with latent factors. The model is also trained to have latent factors in red below to have high activation when the input image is real. On the other hand, if the image is fabricated, the latent factors in blue should have high activation. In testing time, an image is classified as real or fabricated depending on the activations between the blue and the red latent factors.
This model serves two purposes. It extracts latent factors that can reconstruct the images and classify them as real or not. The former objective avoids the classifier to be overfitted by a specific type of fabricated images. The learned latent factors are more generalized and hopefully serve a broader purpose.
To improve accuracy for new technologies, it can be further fine-tuned with new images from different technologies. However, it is assumed (or proven) that this can be achieved with few-shot learning. If the model is first trained with the proper generalization, the extracted latent factors will require minor changes to adapt to new generative models.
The diagram below is the model design of the autoencoder/classifier.
Other types of detectors
Detectors for fabricated images and videos are a huge and active research topic. It is not our intention to do a comprehensive review of different technologies. But these are other detectors that you may check out. A comparison for some of the detectors above can be found here.
Rich Models for Steganalysis of Digital Images Steganalysis detects the existence of messages embedded (piggyback) in data or image. Below is one of the example in embedding an image to another image.
The paper develops a detector based on Steganalysis. By developing models for the noise components of an image, it detects potential modifications. However, the performance degrades as compression increases and below human performance for low-quality images.
Recasting Residual-based Local Descriptors as Convolutional Neural Networks: an Application to Image Forgery Detection It feeds the Steganalysis features above to a CNN classifier instead of handcrafted machine learning models.
Recasting Residual-based Local Descriptors as Convolutional Neural Networks: an Application to Image Forgery Detection The operations performed on an image will leave traces and can be detected in the form of recurrent micropatterns. Steganalysis usually provides tools to extract these residual-based local descriptors. This paper demonstrates that it can be done through CNN models also.
A Deep Learning Approach To Universal Image Manipulation Detection Using A New Convolutional Layer. A new convolutional layer is designed on top of otherwise a typical classifier. Traditional convolution layers extract features representing the content of an image. However, the new design tries to learn the features of fabricated videos.
As stated in the paper:
The key idea behind developing this layer is that certain local structural relationships exist between pixels independent of an image’s content. Manipulations will alter these local relationships in a detectable way.
This basic belief is similar to the steganalysis paper which claims there are existing natural patterns (or probability distribution) that will be altered if images are manipulated.
Distinguishing Computer Graphics from Natural Images Using Convolution Neural Networks. It uses the typical CNN architectures but with a custom pooling layer to compute mean, variance, maximum and minimum information for the extracted features. Those statistics will be a good discriminator in finding manipulate images.
This paper details a CNN classifier in detecting the fabricated videos.
XceptionNet A classifier that builds on Google Inception Network
Do GANs leave artificial fingerprints? A look at the GAN fingerprints for the forensic purpose.
In this paper, it computes co-occurrence matrices directly on the image pixels on each of the red, green and blue channels. This becomes the input features for a CNN classifier.
FakeSpotter: A Simple yet Robust Baseline for Spotting AI-Synthesized Fake Faces Instead of using the last layer as the input to the classifier (the bottom diagram), it uses layer-wise features as the input for the classifier.
Detecting and Simulating Artifacts in GAN Fake Images A GAN simulator, AutoGAN, that can simulate the artifacts produced by the common GAN models. This provides the training dataset in detecting GAN-based generators.
Fighting Fake News: Image Splice Detection via Learned Self-Consistency This paper uses the automatically recorded photo EXIF metadata (information recorded by the cameras) to train a classifier in determining whether the metadata is consistent. It examines two random patches and predicts whether they have consistent meta-data.
The learned consistency model can be applied to an image and learn whether their meta-data is self-consistent, i.e. they are created from the same source.
Tampering Detection and Localization through Clustering of Camera-Based CNN Features The image forgery detection assumes images are composed of different camera models. Here is the model design.
Exploiting Spatial Structure for Localizing Manipulated Image Regions It develops a CNN-LSTM model to detect the boundary between the original and the manipulated region.
Exploiting Spatial Structure for Localizing Manipulated Image Regions The paper approaches the problem like an object segmentation problem in which the object is the tampered region. This system is composed of two Faster R-CNN network. The first one takes the original image in detecting artifacts like strong contrast difference and unnatural tampered boundaries. The second one takes input extracted from a steganalysis rich model filter layer to discover the noise inconsistency between the original image and tampered regions.
As quoted from the paper:
For de-interlaced video, we explicitly model the correlations introduced by de-interlacing algorithms, and show how tampering can destroy these correlations. For the interlaced video, we measure the inter-field and inter-frame motions which for an authentic video are the same, but for a doctored video may be different.
A Video Forensic Technique for Detecting Frame Deletion and Insertion Detect insertion and deletion of whole frames in digital videos.
When part of the image is manipulated, its statistical information changes. One group of researchers verifies the consistency among these statistics. For example, whether the pose will match with the eye glaze. Some researchers look for abnormality comparing to unmanipulated images. Other researchers may verify its consistency with other parts of the image. In addition, when images are manipulated by certain operations, like warping and blending, there will be traces to be detected. For example, it will show a different frequency spectrum compared to a non-manipulated area. But this type of one-trick pony can be defeated easily. The good news is a chain is only as strong as its weakest link. To win the war, we need to open multiple fronts. We detect many factors instead of just one. However, the rules in such detection are complex. Many researchers switch to deep learning in modeling the classifier. In the beginning, the input features of these classifiers are handcrafted. But as we progress, we start designing DL models to extract these features with supervised learning. But there is an unusual challenge. In most computer vision problems, the training dataset is relatively static. In Deepfakes, technologies change and improve. There are other variants including how well the fabricated videos are produced. The development of such a dataset will be more challenging. That leads to a question of how well a detector can generalize. To answer that, some researchers answer whether some fundamental designs exist among current and future generative models. The current answer leads us to CNN upsampling.
So far, most researches focus on spatial consistency for individual frames. But temporal coherence is a great challenge for fabricated videos. However, processing sequences of frames may make the model training non-management. One possible direction is applying attention in the model design.