Image for post
Image for post

AI: Snap a UI mockup & finish the prototype in seconds.

Image for post
Image for post
Source Alex Yee

Brainstorming, mockups and prototyping take up a large portion of time in product designs. It will be ideal if we can prototype those mockups and run them for user testing. However, engineering resources are hard to allocate. It will be just like a dream if the UI coding can be automated from a simple sketch. So let’s look into AI on how it can be done by simply snapping a picture on the UI mockup. We will review the process first followed by a demonstration on how this can be done in seconds instead of days.

A group of designers may have a brainstorming session and sketches out many design ideas.

Then the designers spend hours to finish the layout design.

Then they may use tools to convert the design into a prototype or have engineers to code it from the layout mockup.

From creating layouts to making prototypes for phones, it can take hours or even days if engineers are involved. However, with the application of Deep Learning in AI, the whole process can be automated. Here is a 3-minute demonstration from UIzard in generating native mobile applications directly from sketches.

As a full product, we still need engineers to finish the business logic and the backend integration. But as a prototype, it may be good to go as designers just want to demonstrate or test the UI flow with users. The technology pix2code is developed by UIzard. It may still take some time to finish a commercial product but the potential of automating UI programming is real. The first half of this article focuses on applying AI to automate the design process and the front end UI coding. The second part is for the AI enthusiastic on the technical aspect for implementing a AI deep network to generate code from a sketch.

Airbnb Sketching Interfaces

Airbnb has demonstrated similar ideas with their Airbnb sketching interfaces. This is a 30 seconds video on how Airbnb UI designs can be generated from UI scratches. Airbnb standardizes their UI components for all their applications. In fact, the whole system contains about 150 types of UI components. The video demonstrates how the application generates designs following a standard UI components design.

With much simpler sketches, the Airbnb application generates UI interfaces that follow a company-wide UI design guide.

Design exploration

The ideas so far focus on imitating a mockup or translate a mockup into a company-wide standard design. However, designers are often requested to have multiple design options. Can we apply AI to generate design options?

Let’s take a look at how AI generates Anime. The following is an example of generating Anime characters using AI Deep learning. In the right panel, we explore different attributes, like hair color, and the left panel will generate the Anime with the corresponding style automatically.

Generating content with different styling, using technology like GAN, is under heavy research. Theoretically, we can apply the similar concept to explore different design variants including different color schemes, layouts, and data hierarchy.


All these AI concepts come to an obvious question: will front-end developers become obsolete? Human always automates what we are doing. Developers learn new technologies every couple years. Not all coding, like business logic, will be automated. The answer is not obvious and I will leave the social impacts for future discussions. For now, I will focus on the opportunities and challenges.

In HTML and CSS coding, there are many hidden rules imposed by the browser implementations and shortfalls. For that, UI implementations are sometimes a trial and error effort. AI Deep learning is good at extracting million patterns and to formulate rules. Font end UI coding is very similar to the human language: many rules are not organized and not easy to explain. AI has demonstrated superhuman performance in those scenarios, just like AlphaGo beats the GO master. So AI will eventually win in UI coding. Nevertheless, both pix2code and Airbnb Sketching Interfaces are in prototype phase only. It will take 10–100x efforts to generalize the solution for commercial success.

However, even UI layout coding is tedious, the most important challenges for the front-end coding is flexibility and maintainability: how easy to make global changes or how easy it breaks after changes? pix2code needs to demonstrate how well it groups and organizes information. Can components share CSS? Are CSS well organized and easy to maintain. For Airbnb, it is much easier to solve. Since the UI is standardized, we can always have a predefined CSS for all 150 UI components. As long as we can break down and classify a sketch correctly into those components, they can shared the predefined CSS. This problem is similar to the object detection in AI and it is pretty well developed.

In addition, we can apply this technology in UI prototyping first. Since the prototyping code is throw away, the quality requirement is much lower. In addition, it is hard to allocate engineering resource. So this market will be much more proven and ready.

For the remaining article, we focus on how to build a Deep network to generate the code from a sketch.

Create image captions using RNN

pix2code uses a deep network model composed of a CNN and two LSTM networks. Its design is similar to the image captioning in Deep Learning. For example, the image-captioning model reads a yellow bus picture below and generates the caption “A yellow school bus idles near a park.” automatically.

We feed the image into a CNN network to extract image features. Together with the label (the true caption provided by the sample) as input, the LSTM module generates captions.

Let’s unroll the LSTM to see how the model is trained. For example, we have the true label “<start> A yellow school bus idles near a park . <end>”. We feed each word into each LSTM cell at the bottom. In the diagram below, for the first token “<start>”, the LSTM predicts the word “A”. We continue with the second token “A” in the true label which predicts “new”. Eventually, it predicts the caption as “A new bus is parking near the street.”

Once it is trained, we make an inference by feeding validation images into the model. For simplicity, we reuse the school bus image as our example. We start the first input token as “<start>”. The LSTM produces the word “A”. Here is the major difference from the training: for the next input to the LSTM cell, we use the output from the last step. i.e. we feed the last output “A” to the model in the next time step. “A” predicts “yellow” which later feed into the model to produce “bus”. We repeat the steps which eventually generate the caption “A yellow bus idles near a park.”


pix2code takes a hand drawing UI mockup from a designer, and then feed it to the deep network to produce the XCode project with the UI design. pix2code can also produce code for Android or Web applications using different HTML/CSS/JS frameworks.

pix2code model

Here is the model architect. pix2code composes of a encoder (the left side LSTM and CNN) and a decode (LSTM’). The CNN encodes the GUI picture into latent features. Each training sample comes with a context containing information about the GUI design. The LSTM encodes the corresponding context of a GUI.


The context is the DSL code (Domain specific language) of the GUI mockups.

The context above has a stack of rows and a footer in holding UI elements. Since we are only interested in GUI components and the layouts, the actual textual values (the texts) of the components are ignored. This significantly reduces the vocabulary size and allows the tokens (like <stack>) to be coded as a one-hot-vector rather than a word embedding. This saves the model from training the embedding layer.

Vision model encoder (CNN)

Images are rescaled to 256x256 pixels without preserving the aspect ratio. Pixels are normalized. The vision model composed of 3 convolutional modules. Each module composes of 2 convolutional layers with 3x3 filters and stride 1. Each module is followed by a 2x2 max-pool for downsampling and a dropout layer for regularization. The convolutional modules output 32, 64 and 128 channels respectively. The final shape is therefore 64x64x128. Then the data is flattened and feed into two fully connected layers of size 1024 with ReLU activations and dropouts.

Language model encoder (LSTM)

The context is encoded by a language model consisting of a stack of two LSTM layers. Each LSTM is unrolled into 48 time steps (48 LSTM cells). The prediction at each time step is a vector of 128 dimensions. (i.e. h1 at time step 1 is a vector with 128 elements.)

Decoder (LSTM’)

The latent features for both context and images are concatenated and feed to a decoder. The decoder contains a stack of two LSTM layers with output dimension at 512 for each time step. Then it is feed into a fully connected layer to compute probabilities for each vocabulary using softmax. We select the output DSL token with the highest probability. For example, if our vocabulary size is just 5, the model will make a prediction of (0.05, 0.1, 0.05, 0.3, 0.5) to represent the probability for each word in the vocabulary.


For the context with tokens (x1, x2, x3, x4, x5, …), we create a sliding window to feed data into the LSTM for training. We start with the first training sample (0, 0, …, 0, x1).

We slide the window to the left once to prepare the next training sample. The following diagram indicates the next two training samples fitted into the model.

The model is trained with mini-batches of 64 image-sequence pairs. The total loss, using the cross entropy, for a single image is:


In making predictions, we feed the GUI image and a context of 48 tokens with values (0, 0, …, 0, <start>) into the model. With the first prediction h1 from the model, we create another context (0, 0, …, 0, <start>, h1) for the second prediction h2. We continue the process until the model predicts the <end> token. The resulting sequence of DSL tokens (<start>, h1, h2, …, <end>) is compiled to the desired target language (HTML, XCode) using traditional compiler techniques.


Use BLEU to compute the accuracy of our outputs with the true labels. It breaks a word sequence into say four n-grams. If the true label is (<start>, tk1, tk2, tk3, <end>) and the prediction is (<start>, tk1, tk2, wr3, <end>), the calculation is:


= (4/5) * 0.25 + (2/4) * 0.25 + (1/3) * 0.25 + (0/2) * 0.25

= 0.2 + 0.125 + 0.083 + 0 = 0.408

Since the word-length of the prediction and the true label is the same, we do not further reduce the BLEU score.


Here is the Keras code snippet in building the vision model (source). This implementation consists of 3 convolution modules using max pooling, dropout and ReLU followed by 2 fully connected layers.

The second code snippet is the language model encoder with a stack of 2 LSTM:

Finally, this is the decoder with a stack of 2 LSTM and the optimizer:

Airbnb Sketching Interfaces

pix2code is similar to the language translation problem. Instead of translating text into different languages, we transcript images into UI DSL. Airbnb Sketching Interfaces support only 150 components (words), the model will be much easier to train with better performance. But such model is less generalized.

We can customize the training for each company but this approach will be hard to scale. In addition, you cannot draw any designs but ones in your design guide. But many corporations have straight design guidelines, so this may not be an issue.

Future exploration


In cognitive science, selective attention illustrates how we restrict our attention to particular objects in the surroundings. It helps us focus, so we can tune out irrelevant information and concentrate on what really matters. Attention helps us to learn more efficiently. Instead of looking at the whole image at every time step, we use the current LSTM state to narrow our focus. In the following picture, each output caption word is generated by a more focus region of interests determined by the LSTM state.

Other possible improvement to pix2code

  • Bidirectional LSTM models
  • Emil Wallner has suggested the use of stride 2 instead of max-pool in CNN to improve accuracy.


Completely replacing the layout coding task by AI may still be years away. The accuracy needs to be improved for much complicated designs. But some corporations have straight design guidelines that may make it happens soon than later.

Other resources

The pix2code research paper.

The pix2code Github code.

The pix2code dataset.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store