Skip to main content

Normalizing Flows

Introduction to Generative Models

Main types of generative models:

  • Autoregressive models
  • Variational Autoencoders (VAE)
  • Generative Adversarial Networks (GAN)

Issues:

  • Autoregressive models are slow because they generate elements sequentially.
  • VAEs optimize ELBO instead of true likelihood.
  • GANs are powerful but unstable and less effective for tasks like speech generation.

Reference: https://www.youtube.com/watch?v=uXY18nzdSsM

Flow-based Models

Flow-based models optimize the log-likelihood of data pG(x)p_G(x) transformed from a Gaussian distribution π(z)\pi(z):

zGxz \rightarrow G \rightarrow x

Objective:

G=argmaxGi=1mlogpG(xi)G^* = \arg\max_{G} \sum_{i=1}^{m} \log p_G(x^i)

Equivalent to minimizing:

argminGimDKL(pdatapG)\arg\min_G \sum_{i}^{m} D_{KL}(p_{\text{data}} \| p_G)

Flow-based models directly optimize log-likelihood.

Timeline

2014 — NICE
2016 — RealNVP
2018 — Glow

Jacobian Matrix

Given:

x=f(z),z=[z1z2],x=[x1x2]\mathbf{x} = f(\mathbf{z}), \quad \mathbf{z} = \begin{bmatrix} z_1 \\ z_2 \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}

Jacobian:

Jf=[x1z1x1z2x2z1x2z2]J_f = \begin{bmatrix} \frac{\partial x_1}{\partial z_1} & \frac{\partial x_1}{\partial z_2} \\ \frac{\partial x_2}{\partial z_1} & \frac{\partial x_2}{\partial z_2} \end{bmatrix}

Inverse Jacobian:

Jf1=[z1x1z1x2z2x1z2x2]J_{f^{-1}} = \begin{bmatrix} \frac{\partial z_1}{\partial x_1} & \frac{\partial z_1}{\partial x_2} \\ \frac{\partial z_2}{\partial x_1} & \frac{\partial z_2}{\partial x_2} \end{bmatrix}

Relationship:

Jf1Jf=IJ_{f^{-1}} \cdot J_f = I

Determinant of Jacobian

For an invertible matrix AA:

det(A1)=1det(A)\det(A^{-1}) = \frac{1}{\det(A)}

Thus:

det(Jf1)=1det(Jf)\det(J_{f^{-1}}) = \frac{1}{\det(J_f)}

Important for computing likelihood via change of variables.

Change of Variable Theorem

If x=f(z)\mathbf{x} = f(\mathbf{z}):

p(x)det(Jf)=π(z)p(\mathbf{x}) |\det(J_f)| = \pi(\mathbf{z})

Thus:

p(x)=π(z)det(Jf1)p(\mathbf{x}) = \pi(\mathbf{z}) |\det(J_{f^{-1}})|

2D Illustration

Picking a small square in z\mathbf{z} mapped to a deformed region in x\mathbf{x} gives:

π(z)=p(x)det[x1z1x1z2x2z1x2z2]\pi(\mathbf{z}') = p(\mathbf{x}') \left| \det \begin{bmatrix} \frac{\partial x_1}{\partial z_1} & \frac{\partial x_1}{\partial z_2} \\ \frac{\partial x_2}{\partial z_1} & \frac{\partial x_2}{\partial z_2} \end{bmatrix} \right|

Thus:

π(z)=p(x)det(Jf)\pi(\mathbf{z}') = p(\mathbf{x}') |\det(J_f)|

New Objective

Using:

pG(x)=π(G1(x))det(JG1)p_G(x) = \pi(G^{-1}(x)) |\det(J_{G^{-1}})|

We obtain:

G=argmaxGi=1m[logπ(G1(xi))+logdet(JG1)]G^* = \arg\max_G \sum_{i=1}^m \left[ \log \pi(G^{-1}(x^i)) + \log |\det(J_{G^{-1}})| \right]

Why Coupling Layers

Problems:

  1. High-dimensional Jacobians make determinants expensive.
  2. Training requires G1G^{-1}, so GG must be invertible and dimension-preserving.
  3. Need multiple invertible transformations → flows.

Coupling Layers

We apply a sequence of invertible transformations:

zi=G11(G21(GK1(xi)))z^i = G_1^{-1}(G_2^{-1}(\cdots G_K^{-1}(x^i)))

Density:

pK(xi)=π(zi)h=1Kdet(JGh1)p_K(x^i) = \pi(z^i) \prod_{h=1}^{K} |\det(J_{G_h^{-1}})|

Log density:

logpK(xi)=logπ(zi)+h=1Klogdet(JGh1)\log p_K(x^i) = \log \pi(z^i) + \sum_{h=1}^{K} \log|\det(J_{G_h^{-1}})|

During training we optimize G1G^{-1} but use GG for generation.

Affine Coupling Layer

To compute G1G^{-1}:

zid=xi,zi>d=xiγiβiz_{i \le d} = x_i, \quad z_{i > d} = \frac{x_i - \gamma_i}{\beta_i}

Determinant:

det(JG)=i=d+1Dβi\det(J_G) = \prod_{i=d+1}^{D} \beta_i

Image:
Affine Coupling Layer

Determinant intuition:
Determinant

Masking

Two masking strategies:

Masking

  1. Checkerboard mask
  2. Channel-wise masking (using squeezing to increase channels)

Composition of Coupling Layers

NVP Composition

Multi-scale Architecture

Example structure:

  • Variables at different scales are not equivalent
  • Factorization:
p(z1,z3,z5)=p(z1z3,z5)p(z3z5)p(z5)p(z_1, z_3, z_5) = p(z_1|z_3,z_5)p(z_3|z_5)p(z_5)

Standardization:

z^1=z1μ(z2)σ(z2)\hat{z}_1 = \frac{z_1 - \mu(z_2)}{\sigma(z_2)}

Image:
Multi-scale

GLOW

Glow architecture:

Glow

Glow properties:

Glow Table

Convolution as Invertible Transform

Channel mixing with invertible 3×33 \times 3 matrix WW:

x=Wzx = Wz

If WW is invertible:

Jf=WJ_f = W

Image:
Convolution

Results

See OpenAI Glow results: https://openai.com/index/glow/

Result
Result 2