Deep Learning: Variational Auto-Encoders

September 30, 2018

Variational Auto-Encoders (VAE) are a probabilistic (variational) extension of classical auto-encoders. Disclaimer: This article does not approach the concept of VAE from a bayesian perspective (check out the references below for this). The focus lies instead on highlighting the key components and their effects on the practical results such as the latent embedding space structure.

Building Blocks of an Auto-Encoder

x: Input tensor (high-dimensional) f.e. image from MNIST dataset (handwritten digits).
Encoder e(x, p_e) → z: Neural network e compressing input x into a lower dimensional tensor z using parameters p_e.
Latent space z: Hidden lower n-dimensional space (also called bottleneck) in which the compressed representation of input data lies, represented by z ∈ ℝ^n.
Decoder d(z, p_d) → x’: Neural network d reconstructing a lossy representation x’ of the original input x using parameters p_d from a point z generated by e(x).
Auto-Encoder d(e(x, p_e), p_d) → x’: Encoder and decoder chained together as a single model.
Loss Function L(x, x’): Reconstruction loss defined by e.g. mean squared error between x and x’ L(x, x’) = MSE(x, x') and possibly regularization terms e.g. sparsity constraint on z or weights. Used to compute gradients for parameters p_d and p_e minimizing the loss.

Making Auto-Encoders Variational

We turn a classical auto-encoder into a variational one by modifying the encoder. Instead of transforming the input x directly into the single vector (latent space) z ∈ ℝ^n, we instead map x into two vectors: z_mean ∈ ℝ^n & z_var ∈ ℝ^n. These two vectors are parametrize a Gaussian normal distribution (via mean and variance) from which we sample the latent vector z: z ~ gauss(z_mean, z_var). This makes our encoder variational (probabilistic), basically adding gaussian noise to our encoder models output vector z.

Why should we add noise to the encoder? Doing so will generate many more different sample points of z for the decoder to learn reconstructions from, forcing the decoder to generate smooth interpolations between local samples in the latent space.

Computing the derivatives of the random gaussian distribution parameterized by the two output vectors z_mean and z_var of the encoder is achieved by reparameterizing z ~ gauss(z_mean, z_var) into z = z_mean + sqrt(z_var) * gauss(0, 1).

This so called reparameterization trick enables us to take the derivatives of z with respect to either z_mean or z_var, which are necessary to back-propagate the error-signal through the variational sampling layer when using gradient descent as the parameter optimizer.

To prevent the variational encoder of “cheating” by placing different samples far apart from each other in z (avoiding our desired property of smooth local interpolations) we add an additional loss term to our reconstruction loss function, giving the total loss:

L(x, x’) = MSE(x, x') + KL(gauss(z_mean, z_var), gauss(0, 1))

This additional loss term is defined as the Kullback-Leibler-divergence (non-symmetric measure of the difference between two probability distributions) between the encoders output gauss(z_mean, z_var) ∈ ℝ^n and an isotropic standard normal distribution gauss(0, 1) ∈ ℝ^n. This forces the latent space to be standard Gaussian distributed (achieving the desired smooth local interpolation).

Beta-VAE

By adding a tunable parameter β (beta) to the Kullback-Leibler-Loss: β * KL(gauss(z_mean, z_var), gauss(0, 1)) which is 1.0 for the normal VAE, we can vary how much we force the latent space to be standard normally distributed (no interdimensional correlations / spherical covariance matrix). This has an interesting effect on the structure of the latent space dimensions: a higher β parameter (>1) promotes disentanglement of the individual latent space dimensions. Making them independent and often interpretable features. For example a disentangled dimension might represent the abstract feature of gender (negative number may represent male attributes and positive numbers female attributes).

The Underlying Mechanism: Increasing the β parameter disentangles the latent features and works by creating an information capacity bottleneck - it makes using any latent dimension "expensive" by penalizing large deviations from gauss(0, 1). This constraint forces the model to use each dimension efficiently, making specialization (pure features per dimension) more cost-effective than redundant or mixed encodings across dimensions (penalizing inter-dimensional correlations).

Interestingly, this reveals a broader principle: any constraint that limits effective information capacity per dimension can promote disentanglement. For example, quantizing latent representations creates a similar information bottleneck through precision limits rather than loss induced via KL-Divergence, potentially achieving comparable disentanglement effects through the same underlying mechanism of forced dimensional efficiency.

Show me the code

Get your hands dirty and play around:

Beta-VAE

Further Questions

How can we exploit the fact that the decoder is the inverse of the encoder function and vice versa (weights should be in inverse relationship)?

See “Invertible Autoencoders”; TODO: Lookup reversible layers

How is the compressed latent embedding z related with the data distribution of X (f.e. training images)?

„In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to PCAs.“ - http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/

If a simple dense AE approximates PCA, what does a convolutional AE approximate? Local PCAs for each kernel?

Can we train the encoder unsupervised (independently from the decoder) by understanding how the input features are compressed into z?

The latent space z is shaped by the loss function, which consists of the MSE between input and output image plus the KL-divergence between an isotropic standard gaussian and z. -> How can we replace the MSE term of the loss to decouple the decoder from the encoder while training? Define a loss that considers class density and overlap -> reward high class density embeddings, penalize multi-class overlap?

References

Excellent detailed blog post: https://lilianweng.github.io/posts/2018-08-12-vae/
Bayesian perspective: https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
Applications: https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
Intro from the OG authors: https://arxiv.org/pdf/1906.02691

#machine-learning