How deep learning fakes videos (Deepfake) and how to detect it?

Basic concept

The concept of Deepfakes is very simple. Let’s say we want to transfer the face of person A to a video of person B.

Source: Derpfakes and wikipedia

Image

Before the training, we need to prepare thousands of images for both persons. We can take a shortcut and use a face detection library to scrape facial pictures from their videos. Spend significant time to improve the quality of your facial pictures. It impacts your final result significantly.

  • Remove any picture frames that contain more than one person.
  • Make sure you have an abundance of video footage. Extract facial pictures contain different pose, face angle, and facial expressions.
  • Remove any bad quality, tinted, small, bad lighting or occluded facial pictures.
  • Some resembling of both persons may help, like similar face shape.
A 2 × 2 grid point example.

Deep network model

Let’s take a short break to illustrate how the autoencoder may look like. (Some basic knowledge of CNN is needed here.) The encoder composes of 5 convolution layers to extract features followed by 2 dense layers. Then it uses a convolution layer to upsampling the image. The decoder continues the upsampling with 4 more convolution layers until it reconstructs the 64 × 64 image back.

Problems

Don’t get too excited. If you use a bad implementation, a bad configuration or your model is not properly trained, you will get the result of the following video instead. (Check out the first few seconds. I have marked the video around 3:37 already.)

  • apply a Gaussian filter to further diffuse the mask boundary area,
  • configure the application to expand or contract the mask further, or
  • control the shape of the mask.
Source

Better mask

In our previous effort, our mask is pre-configured. We can do a much better job if our mask is related to the input image and the created face.

GAN

In GAN, we introduce a deep network discriminator (a CNN classifier) to distinguish whether facial images are original or created by the computer. When we feed real images to this discriminator, we train the discriminator itself to recognize real images better. When we feed created images into the discriminator, we use it to train our autoencoder to create more realistic images. We turn this into a race that eventually the created images are not distinguishable from the real ones.

Loss function

Besides the reconstruction cost, GAN adds a generator and discriminator cost to train the model. Indeed, we can add addition loss functions to perfect our model. One common one is the edge cost which measures whether the target image and the created image have the same edge at the same location. Some people also look into the perceptual loss. The reconstruction cost measures the pixel difference between the target image and the created image. However, this may not be a good metric in measuring how our brains perceive objects. Therefore, some people may use perception loss to replace the original reconstruction loss. This is pretty advanced so I will let those enthusiasts read the paper in the reference section instead. You can further analyze where your fake videos perform badly and introduce a new cost function to address the problem.

Demonstration

Let me pick some of the good Deepfakes videos and see whether you can detect them now. Play it in slow motion and pay special attention to:

  • Does it over blur comparing with other non-facial areas of the video?
  • Does it flick?
  • Does it have a change of skin tone near the edge of the face?
  • Does it have a double chin, double eyebrows, double edges on the face?
  • When the face is partially blocked by hands or other things, does it flick or get blurry?

Lip sync from audio

The video made by Jordan Peele is one of the toughest ones to be identified as fake. But once you look closer, the lower lip of Obama is more blurry comparing with other parts of the face. Therefore, instead of swapping out the face, I suspect this is a real Obama video but the mouth is fabricated to lip-sync with fake audio.

Source
Modified from source
Source
Source
Source
Source

More thoughts

It is particularly interesting to see how we apply AI concepts to create new ideas and new products, but not without a warning! The social impacts can be huge. In fact, do not publish any fake videos for fun! It can get you into legal troubles and hurt your online reputation. I look into this topic because of my interest in meta-learning and adversary detections. Better use your energy for things that are more innovative. On the other hand, fake videos will stay and be improved. It is not my purpose to make better fake videos. Through this process, I hope we know how to apply GAN better to reconstruct an image. Maybe one day, this may eventually helpful in detecting tumors.

Listing of other articles

Reference

Synthesizing Obama: Learning Lip Sync from Audio

Credits

Photo credits (Head scarf)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store