Detect AI-generated Images & Deepfakes (Part 3)

12 min readApr 5, 2020

Two popular tools in generating Deepfakes video are Faceswap and DeepFaceLab. Our intention is to learn the process and gain insights into how Deepfakes videos are actually produced. Also, we will learn the measures that people take to correct some shortfalls of the fabricated videos. However, this article is not a tutorial for both. We will not detail logistic steps or basic deep learning configuration. There are guides that serve this purpose better. We assume you have the basic Deep Learning (DL) knowledge. The general DL training technique is not specific to Deepfakes and will be cover nicely in those guides.

Faceswap

In many Deepfakes software packages, the process is breaking down into steps that offer different commands and options. So let’s look into them in the sequential order.

Extraction

Extraction is the phase where faces are collected from video frames or still images. This phase generates two directories separately and independently. They are the source and the target — the source is the face and the identity that we want to mimic on top of the target videos. This process creates the facesets needed for training.

The extraction process is sub-divided into detection, alignment and mask generation. In detection, faces are detected and collected from video frames or a folder containing still images. Alignment is to locate the 68 point face landmarks. This information is needed to transform faces during training if “Warp to Landmarks” is enabled and to locate face during the video conversion that produces a fabricated video.

Also, it is needed to realign a face. The Deepfakes encoder is trained with a prefixed size subject to the selected training model (say 128×128 pixels). i.e. after a model is selected and configured, it takes a fixed size face. So after the face extraction, re-alignment is performed. For example, the red square on the right below is the final detected face.

But we do not want the face to be too small. Therefore, if the detected face is smaller than a pre-configured size, it will be dropped.

Mask generation creates a mask separating the face and the rest of the image (background) which will be used in training and conversion.

This mask and the landmark information will be stored in an alignments file.

Like other steps, you can choose different plugins/extensions to perform a specific task, including different detection or alignment packages. And there are other optional configurations to choose from. Many tutorials will give general suggestions and tradeoffs. Otherwise, it will be a trial and error process depends on the artifacts in the generated video and the training configuration. For example, there are different masking methods that have different coverage of the facial areas and different capability in handling face obstructions. Your choices can strongly depend on the issues that you discovered in the fabricated video.

Faceswap also provides options like “normalizing the input” such that faces are easier to detect in bad lighting conditions. If a video is used for source extraction, we can sample one frame every half to a second. This avoids flooding the source dataset with too many similar samples that unnecessarily prolong the training time.

Sorting

The facesets dictate how well the model learns. This is the most important factor for each project.

Many problems including flicking are caused by missing data samples with poses (angles) similar to the target video. Indeed, to beat some of the Deepfakes detection software or to remove typical artifacts, we can focus on improving the facesets alone. Therefore, after collecting the facesets, the first task is to clean them up and add new data.

Nevertheless, this is an iterative process. We will continuously identify problems in the converted videos and add new source video frames to bridge any data gaps.

To clean up the facesets, we examine them visually to remove false positives (those wrongly detect as a face) and “other faces” that do not belong to the target person. Faceswap provided sorting options so we can eliminate bad samples quickly. For example, with the “sorting by face” option, faces will be renamed according to the similarity (a.k.a. identity). Therefore, faces with the wrong identity will be at the beginning or end of the list and we can remove all these files quickly.

Clean the Alignments File

With faces deleted in the previous step, this step removes the corresponding entries in the alignment file and renames the face files back to the originals. Again, we will cover the high-level picture only in many sections here. Please refer to the guide for the detail instruction.

Manually Fix Alignments

Next, we can manually review and fix any alignment issues, like the face is up-side-down. For the training dataset, we can make sure the masks are correct. For conversion, this is more important. We want to select the right face if multiple persons are present and correct any missing or imprecise alignments.

Extracting a training set from an alignments file

After the cleanup, we can extract the faceset from the alignment file.

Merging facesets for training

Sometimes, the extraction phase takes too long so we want to break it up. We build up the training dataset incrementally and repeat the process above many times using different folders. Finally, we can merge them together with Faceswap.

In addition, to create a high-quality source faceset, we need to collect sources with different variants of expressions, angles (poses), and illuminations that the model can be learned to mimic the conditions for the target scenes. Often, we review the generated video frames for flicking and artifacts. Then, we augment the training data that will mimic the problematic target frame, at least with a similar pose.

Get high-quality videos and images if you can. The head angles should include looking right, left, up, down, and straight at the camera and everything in between. This may require more than one source video to cover it. It also needs different expressions including open or close mouths, blinking, smiling, disgust, angry, happy, etc… If it is done correctly, 1K to 10K of extracted faces will be suited for the training. They don’t need to have the exact combinations as the scene since the model should be able to replicate those combinations itself. If you are merging facesets from multiple sources, you may want to eliminate those that look too similar. It slows down the training.

In many good quality Deepfakes videos, the producers make sure the person in the target video should have a similar hairstyle, face shape, and camera angles compared with the source. Avoid source and destination with strong directional lighting. Recreating the shadows is hard. We want flat lighting.

Left (Wikipedia) Right (**UC Berkeley photo by Stephen McNally)**

Next, we will focus on training.

Choosing a training model

Swapface provides many training models that use different input dimensions for the face (e.g. 64 × 64, 128 × 128, 512 × 512). Some of them will also have different model designs and configurable parameters. (Some knowledge of deep learning will be helpful for complex models.) However, by doubling the width/height of the input, the training time will be at least quadruple. Since training takes days, so you may experiment with it a little bit to find the sweet spot. For some partitioners, they increase the face dimension after they review the lower resolution results first.

Model configuration settings

Once you select the model, there are options to be selected for the training.

By reducing the face coverage, the model can focus on re-rendering a smaller area of the face with a higher-level of detail. The mask, if enabled, focuses on the specific area of the face with less importance to the background. Again, you need to play around these settings for your videos.

Also, there are options in augmenting the data automatically.

For example, the image can be color augmented similar to the one below.

It improves the skin tone of the merged video.

Training

Now, we are ready for training.

And we can monitor the progress. The first column is the original, the second is the decoded image and the third swaps the face.

As well as the loss function.

Conversion

Finally, after the model is trained, we specify the video to be converted. This step is relatively simple. We will not get into details because the conversion is under rework.

DeepFaceLab (DFL 2.0)

Next, we will look into another popular package, DFL 2.0. Here is a video illustrating the steps with different scripts. You can go through it quickly to get some high-level ideas.

Many steps and options are similar to Faceswap. For simplicity, we may be brief or skip the information.

Workspace cleanup/deletion

Remove the old datasets from the workspace and restart a fresh session.

Frames extraction from source video

Extract faces from the source video.

Frames extraction from target video

Extract faces from the target video.

Source faces extraction/alignment

The source data is extracted with the landmarks and produces 512×512 faces. Options, like whole face, full face, mid-half face, and half face training, are provided for different face coverage.

DFL also provides options to upsample and enhance images for low-resolution faces.

Source faceset clean up

DFL provides many options to sort faces for viewing and cleaning. Some options, like face yaw, face pitch, hue, brightness, etc…, can be very helpful in identifying missing data.

There are other options that are helpful in deleting bad faces, including blur, face size, brightness, one face in an image, etc…

Here are the general recommendations in cleaning up the data. The faces are groups with a different colored boundary to indicate its potential issue.

But there are some exceptions. If the image is very unique, like hard to find angles or expressions that also occurs in the target, you may keep them or try to fix them. On the other hand, if the quality is too bad and too many of them, you may delete all or some of them.

Destination preparation

This step performs the extraction and alignment of the destination faces.

Destination cleanup

Again, we can use the sorting tool again to preview the facesets and perform the cleanup. For this step, it is important to ensure all the target faces are extracted and aligned with the precise landmarks.

First, we can sort by the histogram in exploring similarities in faces. This is handy in rejecting faces that do not belong to the person of interest. Also, we can use it to delete zoomed in/out faces, false positives, “other faces”. We will also delete incorrectly aligned, and rotated faces or fix them manually. For those failed, we can extract the face manually with tools in DFL.

Training

There are two models available for training. Quick96 requires 2–4GB video memory that supports 96×96 face resolution. There is little configurable parameters and gears towards low-budget production. SAEHD is more powerful with more configurable options. It requires 6GB+ video card memory and supports resolution up to 512×512 with half face, mid-half face, full face, and whole face mode.

The whole face model trains the whole face, including part of the hair area. Below is the result of the whole face mode training.

There are 2 main architectures to choose from in SAEHD:

DF: It requires the source and destination face to have a similar face shape because it performs a more direct face swap without morphing. It works better on frontal shorts but maybe worse on side profiles. To perform well, the source faceset should have all the required face angles required by the target video.
LIAE: It performs morphing so it does not have a strict requirement on the face shape. But again, to perform well, it is recommended to have similar facial features like eyes, nose between the source and the target. Also, it can handle the side better than the frontal.

There are experimental HD versions for both with goals in improving the quality. Here are some sample outputs:

SAEHD has many configurable parameters for the model and the DL training.

Change the model capacity for the encoder, decoder, and the masker.
DL parameters: Batch size, total iterations, backup frequency, gradient clipping.
Eye priority with a better focus on eye rendering.
Enable GAN (but enable it only after the trained model is quite mature).

There are other options that make the merged face to look closer to the source or have a larger impact from the style of the target video. Another option is the color transfer which makes the swapped face to match the skin tone and the color of the target better. (We miss a lot of details here so refer to the guides if needed.)

Merging

Merging blends learned face over the target frames to form the final video. Here is the converter preview window.

It includes options on:

How to overlay the new face including the use of a histogram to match the new face with the target better.
Masks: expand or contract the mask or control the feature of the mask.
Motion blur.
Image enhancement: on eye, teeth, and the detail and the texture of the face.
Sharpen or blur the learned face.
Scale the learned face.
Ways of creating masks (like the use of landmarks or through learning).
A color transfer that makes the swapped face to match the skin tone and the color of the target better.
Image degrade modes: make the original frame more blurry or decrease the color bit depth.

Conversion of frames into video

Next, the converted frames will be combined together to form a video.

Post-processing

Post-processing with a video editing tool (like After Effect, HitFilm Express) is important for high-budget productions. The swapped faces will be manipulated to match the illumination, the tone, etc… for the background face. And the mask is refined to blend the swapped face better. The first video below is a generic instruction of video editing for face replacement.

Source

The second video demonstrates the post-processing for a Deepfakes video.

Source

Tips in Deepfakes

Here are some to-do or not-to-do tips that people follow:

Use a source actor that has a similar face and skin tone as the target actor.
Use wig to create a similar hairstyle as the target actor.
Avoid directional lighting. Use flat lighting.
Avoid profile shoot if possible (shoot from the side of the face). It is hard to do it right.
Avoid close-up shoot if possible — train the model with face resolution as close to the target face.
Collect a source faceset to have many different face angles.
Avoid face obstructions in the target video.
Make sure the face alignment is correct in preparing the faceset.
Invest heavily in post-production to remove flaws.

Finally, it comes to the last part of the series in detecting Deepfakes with Deep Learning.