Detect AI-generated Images & Deepfakes (Part 1)
Who wants to be a millionaire? More than 2,000 competitors say so for the total one million prize money in the Deepfake Detection Challenge (DFDC). The goal of the March 2020 challenge is to create technologies that detect Deepfakes and manipulated media.
Updates: Let’s have a quick followup on the competition since this article was first written. With 35K models submitted, the top winner for DFDC was from Selim Seferbekov whose model had an accuracy of 65% in spotting Deepfakes. The accuracy was a little bit lower than I expected as many fake videos in the dataset were not fabricated for high-quality productions. But this demonstrates how hard to generate an automatic detection solution and also there are plenty of rooms to improve.
According to the Facebook paper:
Selim Seferbekov, used MTCNN for face detection and an EfficientNet B-7 for feature encoding. Structured parts of faces were dropped during training as a form of augmentation. The second solution, WM, used the Xception architecture for frame-by-frame feature extraction, and a WSDAN model for augmentation. The third submission, NTechLab, used an ensemble of EfficientNets in addition to using the mixup augmentation during training.
Due to kernel time limits (computation time) established in the competition, MTCNN detector is chosen for face detection over S3FD for speed. Then, Selim Seferbekov expanded the area by 30% and use this as input to EfficientNets to extract facial features. In addition, the training dataset is heavily augmented, including cutout and partial dropout (shown below). This improves the generalization of the detector. Congratulations!
In December 2019, Facebook removed 682 accounts that allegedly used deceptive practices to push pro-Trump narratives to about 55 million users. As Facebook stated, some of these accounts used profile photos generated by artificial intelligence and masqueraded as Americans. It is widely reported that the photos are generated from a public website using StyleGAN in producing profile pictures. The photos below are generated by an improved version called StyleGAN2 which is also publicly available.
Can you spot which image below is fake? Which one is created by StyleGAN?
This one is easy. It is the left one because of the artifacts present in many StyleGAN photos. Just for fun, there are a few more.
All the images on the left are fakes. My accuracy in spotting StyleGAN photos is higher than 95%. But StyleGAN2 is far much harder. All the photos below are fake.
GAN and Deepfakes become more than research topics or engineers’ toys. Starting as an innovative concept or application, now it can be used as a communication weapon. If you want more examples, here is another widely distributed video created with Adobe After Effects and FakeApp (a Deepfakes application).
Design & Implementation Flaws
Design and implementation usually come with shortcomings and mistakes. For example, the instance normalization method used in StyleGAN often triggers blob artifacts and color bleed in generated images. This reveals the fake images easily.
However, like other GAN and Deepfakes technologies, countermeasures are introduced. For example, the blob artifacts in StyleGAN is already resolved by weight demodulation in StyleGAN2 as the alternative normalization method.
For StyleGAN2, if you look in detail, you can still find some flaws. For example, the structure of the background below does not seem right. The rendered structures do not maintain the correct form of lines or shapes.
Symmetry is hard to maintain also. For example, one ear may have an earring but not the other. In the following picture, the pose of the right shoulder does not match with the left shoulder below.
In Deepfakes, step ① below builds a common encoder to encode the latent factors of pictures for two different persons. In steps ② and ③, it builds two separate decoders to reconstruct the first and second photo respectively. To reconstruct the image correctly, the encoder must capture all the variants in a person’s photos, i.e. the latent factors that apprehend information like the pose, the expression, illumination, etc…
Let’s replace Mary’s faces in a video with Amy. We will capture the latent factors of Mary’s face in the video and render it with Amy’s decoder. Therefore, the rendered Amy face will have the same pose, lighting, and emotional expression as the original video.
However, if it is not done probably, this will turn into a “cut & paste” operations with obvious artifacts on the boundary where the face is pasted.
To resolve that, the encoder can learn a mask to blend the new face with the original better.
Nevertheless, the merging of the new face onto the original one is tricky. Ghosting effects, tone changes, and obvious boundaries usually give away the low budget productions including some videos in the DFDC’s dataset.
Another technique can be applied to improve quality. The concept of swapping face using face landmarks has been done before the current AI era. An area of the face is cut off and wrap form its own landmarks to the target landmarks.
Then Gaussian blur is applied to smooth out the edges. But the skin tones and lightness will probably not match. As discussed before, this can be addressed with Deepfakes.
Some Deepfakes implementation detects the facial landmarks and warps the replaced face to match the original landmarks. This will create a better pose and match the shape and dimension of the original face better. To reduce the awkward boundaries, Gaussian blur is applied in particular on the edge area.
Next, let’s examine the low budget Deepfakes production first. Many high-budget versions still have some of these flaws but just far fewer and less subtle.
Faces in many Deepfakes videos are unusually blurry. There are two major reasons. First, the new face needs to blend well with the rest of the images. Therefore, filters are applied which will blur the face slightly. Second, many low-budget productions use low-resolution pictures of the faces to learn the encoder. Since training time grows exponentially with the face resolution, this relaxes the GPU memory requirement as well as the training time. In the early days, many low-budget productions use face resolution of 64 × 64 and produce blur faces.
Now, many high-budget productions will select the input resolution carefully (usually with higher resolution). Combining with days of training using high-end graphics cards, the quality of the video can be significantly improve and hard to detect.
We can also compare the sharpness, lighting and color tone with other faces in the video. If the other person is real, you may spot the difference easily.
However, in Jordan Peele’s video on Obama, there is only one person in the video. Masks are applied to restrict the change to the mouth & jaw area of Obama only. Other parts of the face are not touched. But, if you look at the video closely, you will still find the mouth area is more blurry compared with the eyes.
Again, this is for low-budget productions only. Many high-budget Deepfakes videos are learned with higher-resolution faces with the final video in 1440p. So even the faces are slightly blurred, it still has higher fidelity than what we usually watch in the HD format (740p). This high-fidelity lowers our guard in considering them as fakes. But in the snapshot below, there are areas that Gaussian blur is unevenly applied which indicate the image has been manipulated.
However, there are videos where the original faces have heavy makeup or overexposed. It will not be easy to locate the flaws mentioned above if it is trained correctly.
The snapshot on the left below is a “high-budget” Deepfakes video in high resolution (1440p). It has details superior to the HD version (740p) and hard to observe any blurriness mentioned before. This is just another example of how Deepfakes can overcome some of its previous preceptions, like poor fidelity.
In some swapped face, the skin tone looks un-nature.
Or is it just a bad tanning session of the celebrities? 😂
One way to overcome this problem is by selecting candidates with similar skin tones, hairstyles, and the shapes of the face to swap.
Here, Paul Rudd's face is replaced by Jimmy Fallon's face.
In addition, candidates are selected that are good at impersonating people’s voices, gestures, and expressions.
When we merge the replaced face with the original face, if the mask or the merging is not done probably, we may see two sets of eyebrows — one set from the new face and the other from the original face.
A double chin can happen also but it is harder to tell whether it is natural or not if you do not know the original person well.
While trying to spot abnormalities in the facial area, we can compare the face with other parts of the body. Obviously, you cannot put a 60 years actor face on a 20 somethings actress, in particular, that is Jennifer Lawrence. The skin texture and the smoothness of the arm will not match the face.
In general, look for the differences, including tones, sharpness, and texture, between the impersonated faces and the rest of the video and the current video frame.
While we explore spatial inconsistency, we can also explore the temporal inconsistency.
One of the major weaknesses of Deepfakes is that video frames are generated frame-by-frame independently. Such independence may generate video frames with noticeable different tones, lighting, and shadow compared with the last frame. When it is playing back, flicking occurs.
Sometimes, the quality of the replaced frames is so bad that the bad frames are manually or automatically removed. If not too many frames are skipped, you may not notice it without paying too much attention.
We take a couple of snapshots below. Even they are very close in time, the sharpness and tones are noticeably different.
The diagram below shows another two frames with quite different RGB distributions.
If you playback the video below at 0.25 speed, skin shimmering and unnatural tone changes occur when the head is moving.
In Deepfakes, quick movements often make it hard to create frames with proper temporal smoothness. The changes in the latent factors in the neighboring frames may be incorrectly exaggerated by the decoder. This is not easy to solve unless we add an extra term in the cost function to penalize such temporal jiggle during the training. (And, this may require some ad-hoc changes to the design and implementation).
In Deepfakes, there are areas that you should pay special attention to in spotting fake videos. One is the border area of the face where it merges with the original.
But for more serious productions, the artifacts will be less noticeable or unobservable. Better algorithms or manual manipulations may be done in masking the new faces on top of the background.
Here is another “high-budget” production. It is quite flawless unless you pay attention to the edges on Gillian Anderson’s face.
Post-production video editing
In general, adding training data to match the face angle or applying automatic color augmentation during training will solve more artifacts mentioned in this article. Nevertheless, manual post-production video editing with a mask is often done to solve the remaining issues.
One of the key shortfalls of most Deepfakes videos is the teeth area. It is hard for the decoder to reconstruct a small area that has a well-defined structure. Often, the teeth in Deepfakes are blurry.
In other cases, the teeth are misaligned teeth, or the individual tooth is stretched or shrank.
In one video, I find it that it renders too many teeth. Sometimes, there is a lot of ghost effects in rendering the teeth. And the teeth looks different across the video frames. Even for some “high-budget” Deepfakes videos that have high fidelity, the teeth can still render incorrectly. As shown below, a few teeth are connected together.
While I was comparing the Deepfakes reproduction on the Silence of the Lamb with the original one, I found out that a few seconds of the original clip is missing.
I speculate that it contains a pose with the camera viewing from the jaw of Anthony Hopkins. It is highly likely that the producers do not have enough video frames from Willem Dafoe to learn the Deepfakes model to reproduce this scene correctly. So it is edited out manually. In many Deepfakes videos, the sideway view of the impersonator is usually one of the weakest links of the fake videos.
While the Break Bad Deepfakes video does an excellent job of impersonating Donal Trump. Its side view is not doing so well.
Nevertheless, this problem can be solved by adding relevant video frames in model training. We will discuss this later.
An obscured object moving across the face can sometimes confuse the Deepfakes model. The key reason is the model does not have enough data to learn such situations correctly. As in one “high-budget” production, someone takes a bite out of the lid obscuring the face on the left. Therefore, I often look for obscured faces and see if there is anything wrong.
Glare & reflection
Some of the glare or reflection in Deepflakes looks exaggerated, missing or without the proper complexity. Again, this is the problem for Deepfakes to render small structures. Nevertheless, this usually increases my confidence in real videos rather than for fake videos.
In many “low-budget” productions, the temple of the eyeglasses is missing.
We build the Deepfakes model with still frames in 2-D. Operations including warping may lose important 3-D information during the process. For example, we may see some lazy eyes in the video,
that does not present in the original video.
This kind of problem can happen in GAN also as explained by the StyleGAN2 paper:
In this example the teeth do not follow the pose but stay aligned to the camera, as indicated by the blue line.
Politician & celebrities
Face shape and aspect
Politicians and celebrities are one major source of impersonation. Deepfakes are often applied to celebrity pornographic videos.
In the majority of the case now, we don’t replace the outlines of the face. Therefore, we can create a database of those public figures to spot any forgery. However, newer technology may apply GAN to replace the outlines of the face also. But this is still in the early phase. As a short note, as mistaken by many, most Deepfakes applications do not apply GAN.
For example, the long forehead of Stallone in the Terminator Deepfakes does not look right for Stallone.
The term “high-budget” production in this article does not necessarily mean projects spending tons of money. In this article, we actually refer to projects that have the right know-how people, decent computer graphics card, and a reasonable amount (days) of time to train the model. Collecting, selecting and cleanup of the training dataset is critical to the quality of the project. It is not hard to gain professional knowledge either. There are tons of online tutorials and free tools. You may need some trials and errors but no AI knowledge is needed. (Even AI knowledge may help, many guides will give you enough suggestions.) And post-production manual manipulations are often applied to produce the top-quality videos. Many people with video editing experience can learn the whole process quickly.
In this article, we sound like Deepfakes are easy to spot visually. It is not true for the latest videos as the general public gains more know-how knowledge on producing them. There is no one-size-fit-all troubleshooting guide in detecting Deepfakes videos. Different videos many have different mistakes. Worst, fewer mistakes are made and harder to find. In later articles, we will look into some programmatic ways of detecting them. With the knowledge in this article, here are a couple of videos for you to analyze and apply what you learn.
One of the obvious mistakes is the eye if you look closer. The pupil is a non-circle!
As mentioned before, the boundary also reveals the Deepfake video.
Let’s check out another video.
However, the wrinkle around the eyes does not match the smoothness around the chin. In many Deepfakes videos of celebrities, this happens very often. But again, this can be simply a bad botox session.
The shadow on the side of the face seems not natural. Unfortunately, it is not too obvious and turns into a definite factor to say it is fake.
In addition, the scare on the face will be hard to reproduce as it is hard to collect data for Jared Kushner with scares. Instead, the reproduced frame shows blurred marks on the face only.
Here is another fun Deepfakes video
and the original for you to detect any issues.
Deepfakes in Politics
With the media attention on Deepfakes, the abuse of Deepfakes in politics is still relatively small in 2020. More likely, it will be used as a last-minute surprise rather than a daily attacking mechanism. Many existing political Deepfakes videos come with a disclosure that they are created with Deepfakes (like the ones below). But this can change when software like Reface, FaceSwap, and DeepFaceLab, are getting more popular by the general public.
The videos below encourage the safeguard and the voting rights on the 2020 elections. It comes with a disclosure that the videos are fake at the end. They are filmed by two actors with similar face shapes as Putin and Kim respectively and imitate similar accents to recite the script. Then the faces are swapped with Putin’s and Kim’s faces using open-sourced Deepfakes package. Then, it is improved by post-production video editing. Because of the higher quality requirements, it took 10 days for the whole process which is longer than the average.
But there remain some shortcomings that many high-quality videos overlook.
- If you pay close attention to the teeth, you will find the rendering is wrong once a while.
- Some area of the face is more bury compared with other parts.
- The movement in the chin and the edge area stood out compared with its background.
Yet, the largest giveaway is the head movement. Many politicians speak with much larger and frequent head movements as we will demonstrate in later articles.
We tend to believe what we want to believe. A fake video on Nancy Pelosi was circulated around the Internet with slurred or drunken speech. This low-quality reproduction is not created by Deepfakes. Instead, the view is slowed down by 25% and the pitch is altered to make like she is slurring her words. The lesson learned here is low-quality fake videos can widely distribute also. Contents are pushed in the social platform by engagement algorithm. None of them passes through any journalist standard. So do check the source carefully. Information from social media is usually a bad source of information.
The challenge of fake videos also imposes problems on real news. We will probably hear frequently as politicians shape their scandals as fakes. It happens before Deepfakes but now can be more confusing.
Deepfakes is just one of the many approaches to generate fake videos. Part 2 looks into more academic approaches in this area first.
Detect AI-generated Images & Deepfakes (Part 2)
Deepfakes has gained tremendous traction because many easy-to-use free software packages are available which require no…
Part 3 of the series looks into the detail of two popular packages in generating Deepfakes: Faceswap & DeepFaceLab.
Detect AI-generated Images & Deepfakes (Part 3)
Two popular tools in generating Deepfakes video are Faceswap and DeepFaceLab. This article is not a tutorial for both…
Finally, we look at ways of detecting Deepfakes and fabricated images/videos with machine learning and deep learning.
Detect AI-generated Images & Deepfakes (Part 4)
Finally, our last part of the series looks at detecting Deepfakes videos with machine learning (ML) and/or deep…
Credits & References
For the second woman head picture, I originally got it from a Royalty-free source but unfortunately, cannot trace its source anymore.