Graph Neural Networks (GNN, GAE, STGNN)

Jonathan Hui
10 min readMar 14, 2021

In general, Graph Neural Networks (GNN) refer to the general concept of applying neural networks (NNs) on graphs. In a previous article, we cover GCN which is one of the popular approaches in GNN. But in some literature, GNN may refer to a more specific approach that the hidden state of a node depends on its last states and its neighbors. We can view this as message passings with its neighboring nodes. Mathematically, it has the form of:

In this article, we will look into some of its specific implementations. Then, we will broaden our view on GNN again and covers other GNNs that we have not discussed yet. For all these topics, we will give an overview only, and please refer to individual papers for details.

GNN (Graph Neural Networks)

Some literature may refer to this original GNN model as Recurrent Graph Neural Network (RecGNN). In this section, we stay with the term GNN. The general concept of GNN is to exchange information (message) constantly with its neighbors until a stable equilibrium is reached. This behaves similarly to an RNN as weights are shared in each recurrent step. In contrast, GCN does not share weights between their hidden layers (For example, Grec below shares the same parameters).


The propagation rule for GNN can be generalized as:

where hᵥ⁽⁰⁾, for t=0, is initialized randomly. At each time step t, we propagate the hidden layer feature from its neighbors to update the hidden features of a node. For h to reach a stable state, the parameterized function f must be a contraction mapping. It shrinks the distance between two points after projecting them into a latent space. But many GNN implementations are more specific and f does not use all the parameters on the R.H.S. So for the next three sections, let’s overview some of these GNN implementations.

Gated Graph Neural Network (GGNN)

In RNN, the hidden state is updated by the input and the last hidden state.

To be exact, the equations used in GRU are:

In GGNN, it applies the following equations in updating the hidden state.


where hᵥ⁽⁰⁾ = xᵥ and xᵥ is the annotation for node v (some kind of initial node labeling depending on the problem domains). Here, GGNN applies GRU (equation 6) to its previous hidden state of node v and the new hidden state updates. The following is the exact equation used in the paper for your reference. It looks complicated but it is simply the GRU version on GNN if you look into the details.


where W and U are learnable parameters, z and r are the gates and A indicates the connectivity between nodes.


Graph LSTM

Instead of GRU, we can apply LSTM instead. As shown below, the are different variants of Graph LSTM with slightly different equations. For comparison and as a reference, the first row below is the equations using GRU and the rest are the models based on LSTM.

Modified from source


The diagram below is a document graph that represents various dependencies such as linear context (adjacent words), syntactic dependencies, and discourse relations.


For example, the diagram below indicates a syntactic dependency (“subj”) between “Mary” and “ran”.


The sentence in the first example suggests that tumors with L858E mutation in the EGFR gene respond to the drug gefitinib. If we have a triple defined as (drug, gene, mutation), these sentences will suggest the triple (gefitinib, EGFR, L858E) have a “respond” relation. So let’s go through an example to find out the relations of this triple.

In the architecture below, words in the sentences are encoded with word embeddings. Then it uses a graph LSTM to learn a contextual representation for each word. Next, we concatenate the contextual representation for the words (gefitinib, EGFR, L858E) together. Finally, we use a relation classifier to score (classify) the relations of these three words. So the relation “Respond” (say R₄) should have the highest score.


Now, we have finished our discussion on the original GNN models. Let’s broaden our view and look at other generic GNN models.

Graph autoencoders (GAEs)

GAE is the cousin of autoencoder in deep learning.


It encodes a graph into a latent feature space and reconstructs the graph from it. First, the encoder applies graph convolutional layers to generate a network embedding for each node. We train the model such that this representation will capture the topological information of a node. Next, the decoder computes the pair-wise similarity between network embeddings followed by an activation function. Finally, the decoder reconstructs the graph adjacency matrix using this calculation. GAE is trained to minimize the reconstruction loss between the real adjacency matrix and the decoded adjacency matrix.


There are more design variants in GAEs. Let’s get into a specific example to explain it.

Structural Deep Network Embedding (SDNE)

SDNE exploits the first-order and second-order proximity to preserve the network structure. The first-order proximity is the local pairwise similarity between nodes connected by edges. The loss function L₁ (first-order proximity) below penalizes the difference of the encoder’s network embeddings for neighboring nodes. It assumes two connected nodes should have similar latent representations. The encoder uses the adjacency matrix Aᵢ (S in the paper) as the input feature vector, i.e. xᵢ = Aᵢ. Since Aᵢ characterizes the neighborhood structure of node i, we expect the encoder to capture this information in the dense latent representation of this node.

Source 1, 2

The second-order proximity, L₂, penalizes the reconstruction loss for the node xᵢ.

where b is derived from the nodes’ connectivity A and β is a tunable hyperparameter.

In a social network, when two people are not connected, it does not necessarily mean they are not friends or they are not similar. Consequently, we should pay more attention to the positive samples (nodes connected with edges). Hence, the loss function above penalizes wrong predictions for positive connectivity more.

Variational Graph Auto-Encoders

SDNE focuses on node structural information but ignores feature information of nodes. Variational Graph Auto-Encoders encode both node structural information A and node feature information X.

The decoder decodes network structure from their embeddings by reconstructing the graph adjacency matrix  using node similarity.

Spatial-temporal graph neural networks (STGNNs)

So far, we have studied spatial graphs only.

Source (Pose estimation)

Next, we are going to analyze spatial-temporal graphs that utilize both spatial and temporal information to make predictions. For example, the video below captures both spatial and temporal information of a pose. Therefore, the node representation in the hidden layers depends on neighboring nodes in the spatial and temporal directions. By examining the spatial-temporal graph, we can classify the type of human action in the video. Other applications of the spatial-temporal graphs include traffic flow and driver maneuver anticipation.


To capture the temporal information, we apply RNN with the current input and the hidden states in the last time step.


To incorporate the spatial dependency, we insert the graph convolution Gconv below.


Graph convolutional recurrent network (GCRN)

GCRNs are similar to the LSTM models. But it applies graph convolution to the spatial data xt and temporal data ht instead of matrix multiplication.

Modified from source

Diffusion convolutional recurrent neural network (DCRNN)

DCRNN utilizes the encoder-decoder architecture and the GRU to capture the spatial and temporal information of a graph. Similar to the last section, the matrix multiplications will be replaced. For DCRNN, the matrix operations in GRU are replaced by diffusion convolution.


A diffusion process transfers information from one node to one of its neighbors according to a probability transition matrix P. Mathematically, Diffusion Convolutional Neural Network (DCNN) is formulated as:


But in practice, this diffusion process can be implemented as a random walk on G with restart probability α. The stationary distribution of the diffusion process can be represented as random walks of different lengths on the graph and formulated as:

where D₀⁻¹ W is similar to D⁻¹A in the GCN.

DCRNN includes the reversed direction diffusion process. So it is bidirectional and offers more flexibility to capture information from both the upstream and the downstream traffic. The convolution is defined as


where θ is the trainable filter parameters. This is similar to the DCNN but in both directions.



However, RNN-based models are usually not easy to train. Alternatively, we can use 1D-CNN to explore temporal dependency instead. In a CNN-based solution, a graph convolutional layer is followed by a 1D-CNN layer. The graph convolutional layer operates on A (adjacency matrix) and Xᵗ (nodes’ input features at time t) to capture the spatial dependency. The following 1D-CNN layer slides over X along the time axis to capture the temporal dependency. Finally, we can concatenate the output of the CNN and apply affine operations to generate a prediction for each node.

Spatial Temporal Graph Convolutional Networks (ST-GCN)


ST-GCN works on a spatial-temporal graph. Instead of performing spatial and temporal convolution independently, ST-GCN performs convolutions over neighboring nodes collected in both spatial and temporal directions.

Modified from source

As discussed in GCN, the challenge of applying convolution on a graph is to associate weights to nodes in a consistent order.

ST-GCN proposes three partition strategies in dividing and grouping the neighbors of a root node (central node): (b) Uni-labeling, (c) distance partitioning, and (d) spatial configuration partition. In each partition scheme, it assigns labels to nodes systematically and uses these labels to associate the weights (or to calculate the weights) in a consistent manner.

In uni-labeling partitioning, all nodes have the same label (green). In Distance partitioning, there are two labels — the root node itself with distance 0 (green) and other neighboring points with distance 1. (blue). In Spatial configuration partitioning (d), nodes are labeled according to their distances to the skeleton gravity center (black cross) compared with that of the root node. Centripetal nodes are closer to the black cross than the center node and mark as blue while centrifugal nodes are further away and mark as yellow.

In the distance partition, it labels a node with l measures its node distance from the center node. In our case, it is either 0 or 1 since we consider the immediate neighbors only.

And here is the labels used in spatial configuration partitioning:

where rᵢ is the distance from the gravity center to joint i. Then, the associated weight is derived with this label l accordingly to the specific partition strategy.

For CNN convolutions, the output of x can be formulated as:

with h and w are the labels to identify the associated weight for x. And p is the sampling function that enumerates the neighbors of location x.

In ST-GCN, the label l below will be used to identify the weight, instead of using h and w.

where Z is a normalization factor.