I have been learning about Normalizing flows since last few days. It is one of those famous Generative models in Machine Learning and Deep Learning. Generative models, as the name suggests, are developed to learn distribution of a given data and then, based on the distribution learnt, can generate new data points that did not exist before, but look quite real. You must be aware of ‘Deepfakes’, which mostly use GANs (Generative Adversarial Networks) from what I know. In this blog, I will try to explain Normalizing Flows. The other two extremely famous Generative models are GANs and VAEs (Variational Autoencoders). Though the purpose of all these architectures is to efficiently learn distribution of a given data, GANs and Normalizing flows can’t be compared in terms of the approaches, but some comparisons are possible between VAE and Normalizing flows, which I will be talking about. I would like to mention that there are many resources online on Normalizing flows, which I have loved learning from. This blog contains my understanding and perspective on them. I recently decided that I would try writing blogs on whatever I learn, whenever possible. Do comment and let me know what you think about this blog.
Technically, everything in ML and DL is about learning distributions of the data at hand. Even in a supervised learning approach, the way I look at it, we are essentially trying to learn a categorical distribution for observations in the data (classification) or in a regression setup, we are finding some distribution of the target variable conditional on the features. There are various ways to learn distributions, one of the most common ways is to assume a parametric distribution (normal distribution in case of continuous variables) and estimate the parameters using maximum likelihood. In fact, in VAE, we assume the prior of the latent space(z) to follow a standard normal distribution and we are approximating the posterior q(z/x) (variational inference) by a normal distribution and expect to get good enough mu and sigma through the encoder neural network. In VAE, since p(x) which would be an integral of p(x,z) over the entire latent space, is intractable, the goal is to optimize an approximate log likelihood called ELBO, which is less than or equal to the actual log likelihood. I won’t go into details of VAE, but here is a good blog on it: https://anotherdatum.com/vae.html. The derivation of ELBO mainly uses Jensen’s Inequality ( f(E(X))≥E(f(x)) ). The main issue here is the assumption of a normal distribution. Yes, I agree that according to Central Limit Theorem (CLT), we can approximate every distribution by a normal distribution if we have sufficiently large data, but in reality, not all datasets really seem to be normally distributed. They could be following complicated multi-modal distributions and even if some follow normal distributions, wouldn’t it be a good idea if we could learn the type of distribution from the data itself through some sophisticated technique! Normalizing flows aim to do exactly this. Let’s discuss how.
Normalizing flows do this by first taking a simple distribution of a latent space Z (typically normal distribution, as you might have guessed) and then, applying a series of bijective and differentiable functions on it in such a way that the resulting transformation is from Z to the original data X. Why bijective and differentiable functions? This is intuitive because we would want to map each x in X to a unique z in Z and vice-versa in such a way that a small change in x also causes a small change in z (and the other way). This also implies that we want the Jacobian of the transformation to be invertible, which is crucial for ‘Change of Variables for Probability distributions’. Let’s pictorially see what we want:
The description of transformations can be written as:
What does change of variables mean for probability distribution? Can we say that there is an exact correspondence in the probability distribution for x and z? In other ways, can we say that if x = f(z) for some x and z, then their probabilities of occurrence is the same? It naturally feels, that would be the case because after all, f is a bijective function as it is a composition of some K bijective functions, but we are forgetting one important fact about probabilities. The probability density functions for the latent space Z and for the space of actual data X, both individually have to integrate to 1. Note that however large my data X is, it is still a much smaller sample in the vast possibilities of values it can take. If X has n dimensions, it is just a sample in R raised to n. We can similarly say about the latent space Z. The PDFs have to integrate to 1 in these much broader and vast spaces. Therefore, it is important to note that there is more than bijection happening here. The transformation could be causing different extents of scaling for vectors from one space to the other. Here is where Jacobian comes into play. A Jacobian matrix is a matrix which contains all partial derivatives of the transformation.
The above discussion leads us to understand that the relation between the PDFs of X and Z are related via the rate by which x changes with respect to a change in z (and the other way) or by how much an instantaneous change in x (or z) brings about a change in z (or x). This information is represented by a Jacobian matrix.
Now, what is that quantity related to a transformation or matrix which actually shows the extent to which scaling has occurred as a result of applying that transformation. If your answer is Determinant of the matrix or transformation, you are right! (you could read about the geometric interpretation of a Determinant of a matrix here). In our case, since we are interested in how PDFs change, we are more interested in the instantaneous changes or the Jacobian. Therefore the change of variable formula gives below relation between p(x) and p(z):
Note that we are only interested in the extent of instantaneous scaling by the transformation, so the direction doesn’t matter to us. We therefore only consider the magnitude of the determinant of the Jacobian. This is intuitive since the probabilities will only get affected by the scaling, but not by the direction of the scaling. Another way of thinking about this is that the volume covered when z changes instantaneously is magnitude of determinant of Jacobian times the corresponding volume covered when x changes. So if in the above equations, mod of determinant of dz/dx is 4, then to find a corresponding z for a random x that is sampled from a small region in X, one would have to look at 4 times the region or volume in Z space (in all these examples, we are considering Z and X to have same number of dimensions).
We earlier saw that f is the composition of K bijective and differentiable transformations. So, in the above equations, if we would like to break down the effect of each of these transformations, we would get below:
In order to get the best possible K transformations on Z that would give the most appropriate distribution for X, we have the following log likelihood that we would like to maximize:
As you might have guessed, the K bijective and differentiable functions cannot be completely arbitrary. They usually are of a particular type and then, the task would be to get best possible parameters for these so that we get the right distribution over X. Also, computing Jacobian can get tedious for many types of transformations and the idea here will be to use transformations for which Jacobians are easy to arrive at and also, for which determinants can be computed easily. Determinant is easy to compute for diagonal and triangular matrices since for such cases, it is simply the product of entries in the diagonal of the matrix. Therefore, the aim is to think of bijective and differentiable functions for which Jacobian is a lower triangular matrix.
We will now dive into types of flows which are basically the type of bijective functions chosen. The early choices were ‘planar flows’ and ‘radial flows’ which are also given in the original paper.
Planar flows: One could think of these as taking hyperplanes in the Z space and then, applying some differentiable non-linearity and translation to them. This would either expand or contract the original hyperplane. As you might have already realized, the purpose of designing such flows is that the determinants of the Jacobians would be easy to compute. Here is how it looks in terms of equations:
where u and w have the same number of dimensions as z, say d. b is a real number and h is an element wise non-linearity.
The determinant of the Jacobian turns out to be:
Radial flows: Like planar flows, this applies differentiable non-linearity to d-dimensional spheres, which results in either an expansion or contraction (after translation). It is given as below:
The effect of planar and radial flows are clearly only seem for smaller volumes. This makes it less efficient to learn the distribution of data using them especially if the number of dimensions of the input data is really high. One option is to use several layers of planar and radial flows.
Autoregressive flows: These flows are designed smartly. As the name suggests, the ith dimension is transformed as a function of dimensions 1 to i, like shown below. The Jacobian would clearly be a triangular matrix.
But the data need not be sequential!! This makes it look like these flows are best for images or text or audio or time series, for which sequence of features is of some importance. I think this is true. However, the order of dimensions is not necessarily kept fixed. In many cases, for every mini-batch, the ordering is random. So maybe, such an approach can also be tried on non-sequential data.
Following are types of autoregressive flows
R-NVP (Real Non Volume Preserving) flow: This is also referred to as affine coupling layer. In this transformation, the first j dimensions are fixed. The remaining dimensions are scaled and translated where the scaling and translation are functions of the elements in first j dimensions.
where ⊙ is the element-wise product and S and T are the scale and translation functions, respectively, both mapping j dimensions to d-j dimensions.
with ⊘ denoting element-wise division. Note that for f to be invertible T and S need not be invertible. In fact, T and S are usually implemented using a neural network. So they are almost never invertible. Also, inverse of f does not require evaluating inverse of S and T and the Jacobian of f does not require Jacobians of S and T. These are very useful observations since it allows us to be as flexible with S and T as possible because of which we can make them as complex as we like. The Jacobian is clearly a lower triangular matrix, making it easier for computing its determinant.
The ordering of dimensions can be random for different mini-batches and epochs. This brings in more complexity, but helps in arriving at a more appropriate distribution for a given data.
A less generalized version of R-NVP is NICE (Non-Linear Independent Component Estimation)which was thought of by the same writers that came up with R-NVP. The difference between NICE and R-NVP is that NICE doesn’t include scaling. This is not preferable since we would want changing volumes so that we capture complexities of a distribution. NICE can be written as below:
To be more expressive than R-NVP, one can do similar transformation like R-NVP, but with all the dimensions, here is how that would look:
where 2≤ i ≤ d (d is dimension of z). Here mu and sigma functions are required to map i-1 dimensions to 1 dimension for every i.
Masked Autoencoders for density estimation (MADE): These are like autoencoders with fewer connections. Connections are removed based on masked matrices. This is an example of an autoregressive model. Below is the explanation:
In the above image, the inputs are randomly numbers from 1 to 3 (d, if the dimension of an input sample is d). We have the same number of outputs since this is an autoencoder and the idea is to be able to reconstruct the input. The nodes in the hidden layers are numbered randomly from 1 to d-1 (2, in this case). Now, the connections are based on how this numbering happens. An output node numbered as i is connected to all the nodes that are numbered less than i (this is how the autoregressive property is enforced). An input node numbered as j is connected to all nodes that are numbered greater than or equal to j in the first hidden layer. Similarly, for every hidden layer, the nodes are connected to nodes in the subsequent hidden layer such that node numbered m is connected to all nodes that are numbered greater than or equal to m. A masked autoregressive flow (MAF) is similar, except it can be a normal MLP (multi-layer perceptron).
For both MADE and MAF, random ordering is performed for every mini-batch, random ordering can also be done while testing to make an ensemble and average the results.
That is it for now. For more details, you could look at the following papers:
For NICE: https://arxiv.org/abs/1410.8516, For Real NVP: https://arxiv.org/abs/1605.08803, For Glow (Generative Flow with invertible 1*1 convolutions): https://arxiv.org/abs/1807.03039, Variational Inference with Normalizing flows: https://arxiv.org/abs/1505.05770, Masked Autoregressive flows for density estimation: https://arxiv.org/abs/1705.07057, Masked autoencoders for density estimation: https://arxiv.org/abs/1502.03509.
A few blogs:
https://deepgenerativemodels.github.io/notes/flow/, https://blog.evjang.com/2018/01/nf1.html, https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html, http://akosiorek.github.io/ml/2018/04/03/norm_flows.html
A nice youtube video: