Jekyll2023-09-24T16:31:35-07:00https://mbernste.github.io/feed.xmlMatthew N. BernsteinPersonal websiteMatthew N. BernsteinGraph convolutional neural networks2023-09-24T00:00:00-07:002023-09-24T00:00:00-07:00https://mbernste.github.io/posts/gcn<p><em>Graphs are ubiqitous mathematical objects that describe a set of relationships between entities; however, they are challenging to model with traditional machine learning methods, which require that the input be represented as vectors. In this post, we will discuss graph convolutional networks (GCNs): a class of neural network designed to operate on graphs. We will discuss the intution behind the GCN and how it is similar and different to the convolutional neural network (CNN) used in computer vision. We will conclude by presenting a case-study training a GCN to classify molecule toxicity.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Graphs are ubiqitous mathematical objects that describe a set of relationships between entities; however, they are challenging to model with traditional machine learning methods, which require that the input be represented as a tensor. Graphs break this paradigm due to the fact that the order of edges and nodes are arbitrary and the model must be capable of accomodating this feature. In this post, we will discuss graph convolutional networks (GCNs) as presented by <a href="https://arxiv.org/abs/1609.02907">Kipf and Welling (2017)</a>: a class of neural network designed to operate on graphs. As their name suggestions, graph convolutional neural networks can be understood as performing a convolution in the same way that traditional convolutional neural networks (CNNs) perform a convolution-like operation (i.e., <a href="https://en.wikipedia.org/wiki/Cross-correlation">cross correlation</a>) when operating on images. This analogy is depicted below:</p>
<p><br /></p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_vs_CNN_overview.png" alt="drawing" width="500" /></center>
<p>In this post, we will discuss the intution behind the GCN and how it is similar and different to the CNN. We will conclude by presenting a case-study training a GCN to classify molecule toxicity.</p>
<h2 id="inputs-and-outputs-of-a-gcn">Inputs and outputs of a GCN</h2>
<p>Fundamentally, a GCN takes as input a graph together with a set of <a href="https://en.wikipedia.org/wiki/Feature_(machine_learning)">feature vectors</a> where each node is associated with its own feature vector. The GCN is then composed of a series of graph convolutional layers (to be discussed in the next section) that iteratively transform the feature vectors at each node. The output is then the graph associated with output vectors associated with each node. These output vectors can be (and often are) of different dimension than the input vectors. This is depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_output_vectors_per_node.png" alt="drawing" width="800" /></center>
<p>If the task at hand is a “node-level” task, such as performing <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> on the nodes, then these per-node vectors can be treated as the model’s final outputs. For node-level classification, these output vectors could, for example, encode the probabilities that each node is associated with each class.</p>
<p>Alternatively, we may be interested in performing a “graph-level” task, where instead of building a model that produces an output per node, we are interested in task that requires an output over the graph as a whole. For example, we may be interested in classifying whole graphs rather than individual nodes. In this scenario, the per-node vectors could be fed, collectively, into another neural network (such as a simple <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">multilayer perceptron</a>), that operates on all them to produce a single output vector. This scenario is depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_output_vectors_per_graph.png" alt="drawing" width="800" /></center>
<p>Note, GCNs can also perform “edge-level” tasks, but we will not discuss this here. See <a href="https://distill.pub/2021/gnn-intro/">this article by Sanchez-Lengeling <em>et al</em>. (2021)</a> for a discussion on how GCNs can perform various types of tasks with graphs.</p>
<p>In the next sections we will dig deeper into the graph convolutional layer.</p>
<h2 id="the-graph-convolutional-layer">The graph convolutional layer</h2>
<p>GCNs are composed of stacked <strong>graph convolutional layers</strong> in a similar way that traditional CNNs are composed of convolutional layers. Each convolutional layer takes as input the nodes’ vectors from the previous layer (for the first layer this would be the input feature vectors) and produces corresponding output vectors for each node. To do so, the graph convolutional layer pools the vectors from each node’s neighbors as depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_conv_layer_message_pass_1.png" alt="drawing" width="800" /></center>
<p><br /></p>
<p>In the schematic above, node A’s vector, denoted $\boldsymbol{x}_A$ is pooled/aggregated with the vectors of its neighbors, $\boldsymbol{x}_B$ and $\boldsymbol{x}_C$. This pooled vector is then transformed/updated to form node A’s vector in the next layer, denoted $\boldsymbol{h}_A$. This same procedure is carred out over every node. Below we show this same procedure, but on node D, which entails aggregating $\boldsymbol{x}_D$ with $\boldsymbol{x}_B$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_conv_layer_message_pass_2.png" alt="drawing" width="800" /></center>
<p><br /></p>
<p>This procedure is often called <strong>message passing</strong> since each node is “passing” its vector to its neighbors in order to update their vectors. Each node’s “message” is the vector associated with it.</p>
<p>Now, how exactly does a GCN perform the aggregation and update? To answer this, we will now dig into the mathematics of the graph convolutional layer. Let $\boldsymbol{X} \in \mathbb{R}^{n \times d}$ be the features corresponding to the nodes where $n$ is the number of nodes and $d$ is the number of features. That is, row $i$ of $\boldsymbol{X}$ stores the features of node $i$. Let $\boldsymbol{A}$ be the adjacency matrix of this graph where</p>
\[\boldsymbol{A}_{i,j} := \begin{cases} 1,& \text{if there is an edge between node} \ i \ \text{and} \ j \\ 0, & \text{otherwise}\end{cases}\]
<p>Note the matrices $\boldsymbol{X}$ and $\boldsymbol{A}$ are the two inputs required as input to a GCN for a given graph. The graph convolution layer can then be expressed as function on these two inputs that outputs a matrix representing the vectors associated with each node at the next layer. This function is given by:</p>
\[f(\boldsymbol{X}, \boldsymbol{A}) := \sigma\left(\boldsymbol{D}^{-1/2}(\boldsymbol{A}+\boldsymbol{I})\boldsymbol{D}^{-1/2} \boldsymbol{X}\boldsymbol{W}\right)\]
<p>where,</p>
\[\begin{align*}\boldsymbol{A} \in \mathbb{R}^{n \times n} &:= \text{The adjacency matrix} \\ \boldsymbol{I} \in \mathbb{R}^{n \times n} &:= \text{The identity matrix} \\ \boldsymbol{D} \in \mathbb{R}^{n \times n} &:= \text{The degree matrix of } \ \boldsymbol{A}+\boldsymbol{I} \\ \boldsymbol{X} \in \mathbb{R}^{n \times d} &:= \text{The input data (i.e., the per-node feature vectors)} \\ \boldsymbol{W} \in \mathbb{R}^{d \times w} &:= \text{The layer's weights} \\ \sigma(.) &:= \text{The activation function (e.g., ReLU)}\end{align*}\]
<p>When I first saw this equation I found it to be quite confusing. To break it down, here is what each <a href="https://mbernste.github.io/posts/matrix_multiplication/">matrix multiplication</a> is doing in this function:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_layer_equation_annotated.png" alt="drawing" width="350" /></center>
<p><br /></p>
<p>Let’s examine the first operation: $\boldsymbol{A}+\boldsymbol{I}$. This operation is simply adding ones along the diagonal entries of the adjacency matrix. This is the equivalent of adding self-loops to the graph where each node has an edge pointing to itself. The reason we need this is because when we perform message passing, each node should pass its vector to itself (since each node aggregates its own vector together with its neighbors).</p>
<p>The matrix $\boldsymbol{D}$ is the <a href="https://en.wikipedia.org/wiki/Degree_matrix">degree matrix</a> of $\boldsymbol{A}+\boldsymbol{I}$. This is a diagonal matrix where element $i,i$ stores the total number of neighboring nodes to node $i$ (including itself). That is,</p>
\[\boldsymbol{D} := \begin{bmatrix}d_{1,1} & 0 & 0 & \dots & 0 \\ 0 & d_{2,2} & 0 & \dots & 0 \\ 0 & 0 & d_{3,3} & \dots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & d_{n,n}\end{bmatrix}\]
<p>where $d_{i,i}$ is the number of adjacent nodes (i.e., direct neighbors) to node $i$.</p>
<p>The matrix $\boldsymbol{D}^{-1/2}$ is the matrix formed by taking the reciprocal of the square root of each entry in $\boldsymbol{D}$. That is,</p>
\[\boldsymbol{D}^{-1/2} := \begin{bmatrix}\frac{1}{\sqrt{d_{1,1}}} & 0 & 0 & \dots & 0 \\ 0 & \frac{1}{\sqrt{d_{2,2}}} & 0 & \dots & 0 \\ 0 & 0 & \frac{1}{\sqrt{d_{3,3}}} & \dots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & \frac{1}{\sqrt{d_{n,n}}}\end{bmatrix}\]
<p>As we will discuss in the next section, left and right multiplying $\boldsymbol{A}+\boldsymbol{I}$ by $\boldsymbol{D}^{-1/2}$ can be viewed as “normalizing” the adjacency matrix. We will discuss what we mean by “normalizing” in the next section and why this is an important step; however, for now, the important point to relealize is that, like $\boldsymbol{A}$, the matrix $\boldsymbol{D}^{-1/2}(\boldsymbol{A}+\boldsymbol{I})\boldsymbol{D}^{-1/2}$ will also only have a non-zero entry at element $i,j$ only if nodes $i$ and $j$ are adjacent. For ease of ntoation, let’s let $\tilde{\boldsymbol{A}}$ denote this normalized matrix. That is,</p>
\[\tilde{\boldsymbol{A}} := \boldsymbol{D}^{-1/2}(\boldsymbol{A}+\boldsymbol{I})\boldsymbol{D}^{-1/2}\]
<p>Then,</p>
\[\tilde{\boldsymbol{A}}_{i,j} := \begin{cases} \frac{1}{\sqrt{d_{i,i} d_{j,j}}} ,& \text{if there is an edge between node} \ i \ \text{and} \ j \\ 0, & \text{otherwise}\end{cases}\]
<p>With this notation, we can simplify the graph convolutional layer function as follows:</p>
\[f(\boldsymbol{X}, \boldsymbol{A}) := \sigma\left(\tilde{\boldsymbol{A}}\boldsymbol{X}\boldsymbol{W}\right)\]
<p>Next, let’s turn to the matrix $\tilde{\boldsymbol{A}}\boldsymbol{X}$. This matrix-product is performing the aggregation function/message passing that we described previously. That is, for every feature, we take a weighted sum of the features of the adjacent nodes where the weights are determined by $\tilde{\boldsymbol{A}}$.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_aggregation_matrices.png" alt="drawing" width="700" /></center>
<p><br /></p>
<p>Let $\bar{\boldsymbol{x}}_i$ denote the vector at node $i$ representing the aggregated features. We see that this vector is given by:</p>
\[\begin{align*}\bar{\boldsymbol{x}}_i &= \sum_{j=1}^n \tilde{a}_{i,j} \boldsymbol{x}_j \\ &= \sum_{j \in \text{Neigh}(i)} \tilde{a}_{i,j} \boldsymbol{x}_j \\ &= \sum_{j \in \text{Neigh}(i)} \frac{1}{\sqrt{d_{i,i} d_{j,j}}} \boldsymbol{x}_j\end{align*}\]
<p>That is, it is simply computed by taking a weighted sum of the neighboring vectors where the weights are stored in the normalized adjacency matrix. We will discuss these neighbor-weights in more detail in the next section, but it is important to note that these weights are <em>not learned</em> weights – that is, they are not parameters to the model. Rather they are determined based only on the input graph itself.</p>
<p>So where are the learned weights/parameters to the model? They are stored in the matrix $\boldsymbol{W}$. In the next matrix multiplication, we “update” the aggregated feature vectors according to these weights via $\left(\tilde{\boldsymbol{A}}\boldsymbol{X}\right)\boldsymbol{W}$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_update_matrices.png" alt="drawing" width="700" /></center>
<p>These vectors are then passed to the activation function, $\sigma$, before being output by the layer. This activation function injects non-linearity into the model.</p>
<p>One key point to note is that the dimensionality of the weights vector, $\boldsymbol{W}$, does not depend on the number of nodes in the graph. Thus, we see that the graph convolutional layer can operate on graphs of any size so long as the feature vectors at each node are of the same dimension!</p>
<p>We can visualize the graph convolutional layer at a given node using a network diagram highlighting the neural network architecture:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_as_neural_net.png" alt="drawing" width="800" /></center>
<p>Now, so far we have discussed only a single graph convolutional layer. We can create a multi-layer GCN by stacking graph convolutional layers together where the output of one layer is fed as input to the next layer! That is, the embedded vector at each node, $\boldsymbol{h}_i$, that is output by a graph convolutional layer can treated as input to the next layer! Mathematically, this would be described as</p>
\[\begin{align*} \boldsymbol{H}_1 &:= f_{\boldsymbol{W}_1}(\boldsymbol{X}, \boldsymbol{A}) \\ \boldsymbol{H}_2 &:= f_{\boldsymbol{W}_2}(\boldsymbol{H}_1, \boldsymbol{A}) \\ \boldsymbol{H}_3 &:= f_{\boldsymbol{W}_3}(\boldsymbol{H}_2, \boldsymbol{A})\end{align*}\]
<p>where $\boldsymbol{H}_1$, $\boldsymbol{H}_2$, and $\boldsymbol{H}_2$ are the embedded node vectors at layers 1, 2 and 3 respectively. The matrices $\boldsymbol{W}_1$, $\boldsymbol{W}_2$, and $\boldsymbol{W}_3$ are the weight matrices that parameterize each layer. A schematic illustration of stacked graph convolutional layers is depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_stacked_layers.png" alt="drawing" width="350" /></center>
<h2 id="normalizing-the-adjacency-matrix">Normalizing the adjacency matrix</h2>
<p>Let’s take a closer look at the normalized adjacency matrix $ \tilde{\boldsymbol{A}} := \boldsymbol{D}^{-1/2}(\boldsymbol{A}+\boldsymbol{I})\boldsymbol{D}^{-1/2}$. What is the intuition behind this matrix and what do we mean by
“normalized”.</p>
<p>To understand this normalized matrix, let us first consider what happens in the convolutional layer if we don’t perform any normalization and instead naively use the raw adjacency matrix (with ones along the diagonal), $\boldsymbol{A}+\boldsymbol{I}$. For ease of notation let</p>
\[\hat{\boldsymbol{A}} := \boldsymbol{A} + \boldsymbol{I}\]
<p>Then, the graph convolutional layer function without normalization would be:</p>
\[f_{\text{unnormalized}}(\boldsymbol{X}, \boldsymbol{A}) := \sigma(\hat{\boldsymbol{A}}\boldsymbol{X}\boldsymbol{W})\]
<p>In the aggregation step, the aggregated features for node $i$, again denoted as $\bar{\boldsymbol{x}}_i$, will be given by</p>
\[\begin{align*}\bar{\boldsymbol{x}}_i &= \sum_{j=1}^n \hat{a}_{i,j} \boldsymbol{x}_j \\ &= \sum_{j=1}^n \mathbb{I}(j \in \text{Neigh}(j)) \boldsymbol{x}_j \\ &= \sum_{j \in \text{Neigh}(i)} \boldsymbol{x}_j\end{align*}\]
<p>where $\mathbb{I}$ is the <a href="https://en.wikipedia.org/wiki/Indicator_function">indicator function</a> and $\text{Neigh}(i)$ is the set of neighbors of node $i$.</p>
<p>We see that this aggregation step simply adds together all of the feature vectors of $i$’s adjacent nodes with its own feature vector. A problem becomes apparent: for nodes that have many neighbors, this sum will be large and we will get vectors with large magnitudes. Conversely, for nodes with few neighbors, this sum will result in vectors with small magnitudes. This is not a desirable property! When attempting to train our neural network, each node’s vector will be highly dependent on the number of neighbors that surround it and it will be challenging to optimize weights that look for signals in the neighboring nodes that are independent of the number of neighbors. Another problem is that if we have multiple layers, the vector associated with a given node may blow up in magnitude the deeper into the layers they go, which can lead to numerical stability issues. Thus, we need a way to perform this aggregation step so that the aggregated vector for each node is of similar magnitude and is not dependent on each node’s number of neighbors.</p>
<p>One idea to mitigate this issue would be to take the <em>mean</em> of the neighboring vectors rather than the sum. That is to compute,</p>
\[\bar{\boldsymbol{x}}_i = \frac{1}{\left\vert \text{Neigh}(i) \right\vert}\sum_{j \in \text{Neigh}(i)} \boldsymbol{x}_j\]
<p>Here, for node $i$, we simply divide the sum by the number of neighbors of node $i$. We can accomplish this averaging operation across all nodes in the graph at once if we normalize the adjacency matrix as follows:</p>
\[\boldsymbol{D}^{-1}\hat{\boldsymbol{A}}\]
<p>Using this version of a normalized matrix for our convolutional layer, we would have:</p>
\[f_{\text{mean}}(\boldsymbol{X}, \boldsymbol{A}) := \sigma(\boldsymbol{D}^{-1}\hat{\boldsymbol{A}}\boldsymbol{X}\boldsymbol{W})\]
<p>We can confirm that the aggregation step at node $i$ would be taking a mean of the vectors of the neighboring nodes:</p>
\[\begin{align*}\bar{\boldsymbol{x}}_i &= \sum_{j=1}^n \hat{a}_{i,j} \boldsymbol{x}_j \\ &= \sum_{j=1}^n \frac{1}{d_{i,i}} \boldsymbol{x}_j \\ &= \frac{1}{d_{i,i}}\sum_{j=1}^n \mathbb{I}(j \in \text{Neigh}(j)) \boldsymbol{x}_j \\ &= \frac{1}{\left\vert \text{Neigh}(i) \right\vert}\sum_{j \in \text{Neigh}(i)} \boldsymbol{x}_j \end{align*}\]
<p>This normalization is a reasonable approach, but <a href="https://arxiv.org/abs/1609.02907">Kipf and Welling (2017)</a> propose a slightly different normalization method that goes a step further than simple averaging. Their normalization is given by</p>
\[\tilde{\boldsymbol{A}} := \boldsymbol{D}^{-1/2}\hat{\boldsymbol{A}}\boldsymbol{D}^{-1/2}\]
<p>which results in each element of this normalized matrix being</p>
\[\tilde{\boldsymbol{A}}_{i,j} := \begin{cases} \frac{1}{\sqrt{d_{i,i} d_{j,j}}} ,& \text{if there is an edge between node} \ i \ \text{and} \ j \\ 0, & \text{otherwise}\end{cases}\]
<p>We note that this normalization is performing a similar correction as mean normalization (ie., $\boldsymbol{D}^{-1}\hat{\boldsymbol{A}}$) because the edge weight between adjacent nodes $i$ and $j$ will be smaller if node $i$ is connected to many nodes, and larger if it is connected to few nodes.</p>
<p>However, ths begs the question, why use this alternative normalization approach rather than the more straightforward mean normalization? It turns out that this alternative normalization approach normalizes for something beyond how many neighbors each node has, <em>it also normalizes for how many neighbors each neighbor has</em>.</p>
<p>Let us say we have some node, Node $i$, with two neighbors: Neighbor 1 and Neighbor 2. Neighbor 1’s <em>only neighbor</em> is $i$. In contrast, Neighbor 2 is neighbors with many nodes in the graph (including $i$). Intuitively, because Neighbor 2 has so many neighbors, it has the opportunity to pass its message to more nodes in the graph. In contrast, Neighbor 1 can only pass its message to Node $i$ and thus, its influence on the rest of the graph is dictated by how Node $i$ is passing along its message. We see that Neighbor 1 is sort of “disempowered” relative to Neighbor 2 just based on its location in the graph. Is there some way to compensate for this imbalance? It turns out that the alternative normalization approach (i.e., $\boldsymbol{D}^{-1/2}\hat{\boldsymbol{A}}\boldsymbol{D}^{-1/2}$), does just that!</p>
<p>Recall that the $i, j$ element of $\boldsymbol{D}^{-1/2}\hat{\boldsymbol{A}}\boldsymbol{D}^{-1/2}$ is given by $\frac{1}{\sqrt{d_{i,i} d_{j,j}}}$. We see that this value will not only be lower if node $i$ has many neighbors, it will also be lower if node $j$, its neighbor, has many neighbors! This helps to boost the signal propogated by nodes with few neighbors relative to nodes with many neighbors.</p>
<p>We illustrate how this works in the schematic below where we color the nodes in a graph according to the weights associated with neighbors of Node 4 according to both mean normalization, $\boldsymbol{D}^{-1}\hat{\boldsymbol{A}}$, and the alternative normalization, $\boldsymbol{D}^{-1/2}\hat{\boldsymbol{A}}\boldsymbol{D}^{-1/2}$ (i.e., the 4th row of these two matrices):</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_compare_graph_norm.png" alt="drawing" width="700" /></center>
<p>We see that the mean normalization provides equal weight to all of the neighbors of Node 4 (including itself). In contrast, the alternative normalization gives the highest weight to Node 5, because it has few neighbors, and gives a lower weight to Node 1 because it has so many neighbors.</p>
<h2 id="comparing-gcns-and-cnns">Comparing GCNs and CNNs</h2>
<p>In the prior section, we described the graph convolution layer as performing a message passing procedure. However, there is <a href="https://mbernste.github.io/posts/understanding_3d/">another perspective</a> from which we can view this process: as performing a convolution operation similar to the convolution-like operation that is performed by CNNs on images (hence the name graph <em>convolutional</em> neural network). Specifically, we can view the message passing procedure instead as the process of passing a <strong>filter</strong> (also called a <strong>kernel</strong>) over each node such that when the filter is centered over a given node, it combines data from the nearby nodes to produce the output vector for that node.</p>
<p>Let’s start with a CNN on images and recall how the filter is passed over each pixel and the values of the neighboring pixels are combined to form the output value at the next layer:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_passing_filter_image.png" alt="drawing" width="350" /></center>
<p><br /></p>
<p>In a similar manner, for GCNs, a filter is passed over each node and the values of the neighboring nodes are combined to form the output value at the next layer:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_passing_filter_graph.png" alt="drawing" width="550" /></center>
<h2 id="aggregating-information-across-nodes-to-perform-graph-level-tasks">Aggregating information across nodes to perform graph-level tasks</h2>
<p>To perform a graph-level task, such as classifying graphs, we need to aggregate information across nodes. A simple way to do this is to perform a simple aggregation step where we aggregate all of the vectors associated with each node into a single vector associated with the entire graph. This pooling can be done by taking the mean each feature (mean pooling) or the maximum of each feature (max pooling). This aggregated vector can then be used as input to a fully connected, multi-layer perceptron that produces the final output:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_graph_pool_vectors.png" alt="drawing" width="700" /></center>
<h2 id="case-study-classifying-molecule-toxicity">Case study: classifying molecule toxicity</h2>
<p>In computational biology, graph neural networks are commonly applied to computational tasks operating on moleculular structures. A graph is a natural data structure for encoding a molecule; each node represents an atom and each edge connects two atoms that are bonded together. An example is depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_molecule_as_graph.png" alt="drawing" width="800" /></center>
<p>As a case-study for applying GCNs to a real task, we will implement and train a GCN to classify molecular toxicity. This is a binary-classification task where we are provided a molecule and our goal is to classify it as either toxic or not toxic. More specifically, we will use the <a href="https://ntp.niehs.nih.gov/whatwestudy/tox21">Tox21</a> dataset downloaded from <a href="https://moleculenet.org/datasets-1">MoleculeNet</a>. Applying GCNs to this dataset has been explored by the scientific community (e.g., <a href="https://doi.org/10.1186/s13321-021-00570-8">Chen <em>et al</em>. (2021)</a>), but we will reproduce such efforts here. Specifically, we focus on the task of predicting <a href="https://doi.org/10.1111/j.1365-2567.2009.03054.x">aryl hydrocarbon receptor activation</a> encoded by the “NR-AhR” column of the Tox21 dataset.</p>
<p>Each molecule is represented as a <a href="https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system">SMILES</a> string. To decode these molecules into graphs, we use the <a href="https://github.com/pckroon/pysmiles">pysmiles</a> Python package, which converts each string to a <a href="https://networkx.org/">NetworkX</a> graph. We then split the molecules/graphs into a random training set (85% of the data) and test set (the remaining 15%).</p>
<p>For node features, we use 1) the element of the atom, 2) the number of implicit hydrogens, 3) the charge of the atom, and lastly, 4) the <a href="https://en.wikipedia.org/wiki/Aromaticity">aromaticity</a> of each atom. After normalizing each adjacency matrix, we train the model using <a href="">binary cross-entropy loss</a>.</p>
<p>Finally, we then apply the model to the test set and generate an <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC curve</a> and <a href="https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#:~:text=The%20precision%2Drecall%20curve%20shows,a%20low%20false%20negative%20rate.">precision-recall curve</a>:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/GCN_Tox21_ROC_PR.png" alt="drawing" width="450" /></center>
<p>The AUROC is 0.86, which isn’t too bad!</p>
<p>The code for this analysis is shown in the Appendix to this post and can be executed <a href="https://colab.research.google.com/drive/1Z9HJqUHV257DITCOOy1DwgDa_nGCuE4m?usp=sharing">on Google Colab</a>.</p>
<p>Note, this is not a very efficient implementation. Notably, we are storing each graph’s adjacency matrix and node feature vectors in a list rather than a tensor. A more efficient strategy would be to store each training batch as a tensor in order to <a href="https://en.wikipedia.org/wiki/Array_programming">vectorize</a> the forward and backward passes. Advanced graph neural network software packages have tricks for overcoming such inefficiencies such as those <a href="https://pytorch-geometric.readthedocs.io/en/latest/advanced/batching.html">described here</a>.</p>
<h2 id="related-links">Related links</h2>
<ul>
<li><strong>A Gentle Introduction to Graph Neural Networks</strong> <a href="https://distill.pub/2021/gnn-intro/">https://distill.pub/2021/gnn-intro/</a></li>
<li><strong>Graph Convolutional Networks</strong> <a href="https://tkipf.github.io/graph-convolutional-networks/">https://tkipf.github.io/graph-convolutional-networks/</a></li>
</ul>
<h2 id="appendix">Appendix</h2>
<p>The full code for the toxicity analysis is shown below (and can be run on <a href="https://colab.research.google.com/drive/1Z9HJqUHV257DITCOOy1DwgDa_nGCuE4m?usp=sharing">Google Colab</a>.</p>
<p>First, download the dataset via the following commands:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -O https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/tox21.csv.gz
gunzip -f tox21.csv.gz
</code></pre></div></div>
<p>The Python code is then:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>import torch
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from pysmiles import read_smiles
import networkx as nx
# Load the dataset
df_tox21 = pd.read_csv('tox21.csv').set_index('mol_id')
# The column of the table encoding the toxicity labels
TASK = 'NR-AhR'
# Select only columns relevant to the task
df_tox21_task = df_tox21.loc[df_tox21[TASK].isin([1.,0.])].loc[:,[TASK, 'smiles']]
# Train-test split ratio
TRAIN_FRACTION = 0.85
# Shuffle the data
df_tox21_na_ar = shuffle(df_tox21_task, random_state=123)
# Make the split
split_ind = int(len(df_tox21_task) * TRAIN_FRACTION)
df_tox21_train = df_tox21_task[:split_ind]
df_tox21_test = df_tox21_task[split_ind:]
# Draw the first training molecule as a sanity check
net = read_smiles(df_tox21_train.iloc[0]['smiles'])
nx.draw(net, labels=dict(net.nodes(data='element')))
# Normalize the adjacency matrices
def normalize_adj(A):
# Fill diagonal with one (i.e., add self-edge)
A_mod = A + torch.eye(A.shape[1])
# Create degree matrix for each graph
diag = torch.sum(A_mod, axis=1)
D = torch.diag(diag)
# Create the normalizing matrix
# (i.e., the inverse square root of the degree matrix)
D_mod = torch.linalg.inv(torch.sqrt(D))
# Create the normalized adjacency matrix
A_hat = torch.matmul(D_mod, torch.matmul(A_mod, D_mod))
A_hat = torch.tensor(A_hat, dtype=torch.float64)
return A_hat
print("Loading network graph models...")
mol_nets_train = [
read_smiles(s)
for s in df_tox21_train['smiles']
]
A_train = [
torch.tensor(nx.adjacency_matrix(net).todense())
for net in mol_nets_train
]
A_train = [
normalize_adj(adj)
for adj in A_train
]
mol_nets_test = [
read_smiles(s)
for s in df_tox21_test['smiles']
]
A_test = [
torch.tensor(nx.adjacency_matrix(net).todense())
for net in mol_nets_test
]
A_test = [
normalize_adj(adj)
for adj in A_test
]
# Generate node-level features
def generate_features(net, attrs, one_hot_encoders):
feat = None
# For each attribute, compute the features for this attribute
# per node
for attr, enc in zip(attrs, one_hot_encoders):
node_to_val = nx.get_node_attributes(net, attr)
vals = []
for node in sorted(node_to_val.keys()):
val = node_to_val[node]
vals.append([val])
# Encode the values for this attribute as a feature vector
if enc is not None:
attr_feat = torch.tensor(enc.transform(vals).todense(), dtype=torch.float64)
else:
attr_feat = torch.tensor(vals, dtype=torch.float64)
# Concatenate the feature vector for this feature to the full
# feature vector for each node
if feat is None:
feat = attr_feat
else:
feat = torch.cat((feat, attr_feat), dim=1)
return feat
# Features to encode
ATTRIBUTES = ['element', 'charge', 'aromatic', 'hcount']
IS_ONE_HOT = [True, False, True, False]
# Create the element one hot encoder
encoders = []
for attr, is_one_hot in zip(ATTRIBUTES, IS_ONE_HOT):
if is_one_hot:
all_vals = set()
for net in mol_nets_train:
all_vals.update(nx.get_node_attributes(net, attr).values())
all_vals = sorted(all_vals)
print(f"All values of '{attr}' in training set: ", all_vals)
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit([[x] for x in all_vals])
encoders.append(enc)
else:
encoders.append(None)
# Build training tensors
X_train = []
for net in mol_nets_train:
feats = generate_features(net, ATTRIBUTES, encoders)
X_train.append(feats)
y_train = torch.tensor(df_tox21_train[TASK])
# Build test tensors
X_test = []
for net in mol_nets_test:
feats = generate_features(net, ATTRIBUTES, encoders)
X_test.append(feats)
y_test = torch.tensor(df_tox21_test[TASK])
# Implement the GCN
class GCNLayer(torch.nn.Module):
def __init__(self, dim, hidden_dim):
super(GCNLayer, self).__init__()
self.W = torch.zeros(
dim,
hidden_dim,
requires_grad=True,
dtype=torch.float64
)
torch.nn.init.xavier_uniform_(self.W, gain=1.0)
def forward(self, A_hat, h):
# Aggregate
h = torch.matmul(A_hat, h)
# Update
h = torch.matmul(h, self.W)
h = F.relu(h)
return h
def parameters(self):
return [self.W]
class GCN(torch.nn.Module):
def __init__(self, x_dim, hidden_dim1, hidden_dim2, hidden_dim3):
super(GCN, self).__init__()
# Convolutional layers
self.layer1 = GCNLayer(x_dim, hidden_dim1)
self.layer2 = GCNLayer(hidden_dim1, hidden_dim2)
self.layer3 = GCNLayer(hidden_dim2, hidden_dim3)
# Output layer linear layer
self.linear = torch.nn.Linear(hidden_dim3, 1, dtype=torch.float64)
def forward(self, A_hat, x):
# Aggregate
#x = torch.matmul(A_hat, x)
# GCN layers
x = self.layer1(A_hat, x)
x = self.layer2(A_hat, x)
x = self.layer3(A_hat, x)
# Global average pooling
x = torch.mean(x, axis=0)
x = self.linear(x)
return F.sigmoid(x)
def parameters(self):
params = self.layer1.parameters() \
+ self.layer2.parameters() \
+ self.layer3.parameters() \
+ list(self.linear.parameters())
return params
def train_gcn(A, X, y, batch_size=100, n_epochs=10, lr=0.1):
# Input validation
assert len(A) == len(X)
assert len(X) == len(y)
# Instantiate model, optimizer, and loss function
model = GCN(
X[0].shape[1], # Input dimensions
20, # Layer 1 dimensions
20, # Layer 2 dimensions
5 # Final layer dimensions
)
optimizer = optim.Adam(model.parameters(), lr=lr)
bce = torch.nn.BCELoss()
# Training loop
for epoch in range(n_epochs):
# Shuffle the dataset upon each epoch
inds = list(np.arange(len(X)))
random.shuffle(inds)
X = [X[i] for i in inds]
A = [A[i] for i in inds]
y = torch.tensor(y[inds])
loss_sum = 0
for start in range(0,len(A),batch_size):
# Compute the start and end indices for the batch
end = start + min(batch_size, len(X)-start)
# Forward pass
pred = torch.concat([
model.forward(A[i], X[i])
for i in range(start, end)
])
# Compute loss on the batch
loss = bce(pred, y[start:end])
loss_sum = loss_sum + float(loss)
# Take gradient step
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch: {epoch}. Mean loss: {loss_sum/len(A)}")
return model
# Train the model
model = train_gcn(
A_train,
X_train,
y_train,
batch_size=200,
n_epochs=100,
lr=0.001
)
# Run the model on the test set
y_pred = torch.concat([
model.forward(A_i, X_i)
for A_i, X_i in zip(A_test, X_test)
])
# Create and display the ROC curve
fpr, tpr, _ = roc_curve(
y_test.numpy(),
y_pred.detach().numpy(),
pos_label=1
)
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
# Create and display the PR-curve
precision, recall, thresholds = precision_recall_curve(
y_test.numpy(), y_pred.detach().numpy(), pos_label=1
)
PrecisionRecallDisplay(
recall=recall,
precision=precision
).plot()
Compute AUROC
print(auc(fpr, tpr))
</code></pre></div></div>Matthew N. BernsteinGraphs are ubiqitous mathematical objects that describe a set of relationships between entities; however, they are challenging to model with traditional machine learning methods, which require that the input be represented as vectors. In this post, we will discuss graph convolutional networks (GCNs): a class of neural network designed to operate on graphs. We will discuss the intution behind the GCN and how it is similar and different to the convolutional neural network (CNN) used in computer vision. We will conclude by presenting a case-study training a GCN to classify molecule toxicity.What determinants tell us about linear transformations2023-09-04T00:00:00-07:002023-09-04T00:00:00-07:00https://mbernste.github.io/posts/determinants<p><em>The determinant of a matrix is often taught as a function that measures the volume of the parallelepiped formed by that matrix’s columns. In this post, we will go a step further in our understanding of the determinant and discuss what the determinant tells us about the linear transformation that is characterized by the matrix. In short, the determinant tells us how much a matrix’s linear transformation grows or shrinks space. The sign of the determinant tells us whether the matrix also inverts space.</em></p>
<h2 id="introduction">Introduction</h2>
<p>In the previous post, we <a href="https://mbernste.github.io/posts/determinantsformula/">derived the formula</a> for the determinant by showing that the determinant describes the geometric volume of the high dimensional parallelepiped formed by the columns of a matrix. But that is not the full story!</p>
<p>As with most topics, it helps to view determinants from <a href="https://mbernste.github.io/posts/understanding_3d/">multiple perspectives</a>, which we will attempt to also do here. To understand determinants from multiple perspectives, we will also need to view matrices from multiple perspectives. Recall from a <a href="https://mbernste.github.io/posts/matrices/">previous post</a> that there are three perpectives for viewing matrices:</p>
<ol>
<li><strong>Perspective 1:</strong> As a table of values</li>
<li><strong>Perspective 2:</strong> As a list of column vectors (or row vectors)</li>
<li><strong>Perspective 3:</strong> As a <a href="https://mbernste.github.io/posts/matrices_linear_transformations/">linear transformation</a> between vector spaces</li>
</ol>
<p>In our last post, we explored the determinant by viewing matrices from Perspective 2. That is, the determinant of a matrix describes the geometric volume of the parallelepiped formed by the column vectors of the matrix. In this post, we will explore determinants by viewing matrices from Perspective 3 and explore what the determinant tells us about the linear transformation characterized by a given matrix. To preview, the determinant tells us two things about the linear transformation:</p>
<ol>
<li>How much a matrix’s linear transformation grows or shrinks space</li>
<li>Whether the matrix’s linear transformation inverts space</li>
</ol>
<p>We’ll conclude by putting these two pieces together and describe how the determinant can be thought about as describing space scaled by a “signed volume”.</p>
<h2 id="the-determinant-describes-how-much-a-matrix-grows-or-shrinks-space">The determinant describes how much a matrix grows or shrinks space</h2>
<p>To review, one can <a href="https://mbernste.github.io/posts/matrices/">view a matrix</a> as a characterizing a <a href="https://mbernste.github.io/posts/matrices_linear_transformations/">linear transformation</a> between vector spaces. That is, given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, we can form a function $T$ that maps vectors in $\mathbb{R}^n$ to $\mathbb{R}^m$ using <a href="https://mbernste.github.io/posts/matrix_vector_mult/">matrix-vector multirplication</a>:</p>
\[T(\boldsymbol{x}) := \boldsymbol{Ax}\]
<p>With this in mind, let’s think about what a matrix, $\boldsymbol{A} \in \mathbb{R}^{2 \times 2}$ will do to the standard basis vectors in $\mathbb{R}^{2 \times 2}$. Specifically, we see that the first standard basis vector will be transformed to the first column-vector of $\boldsymbol{A}$:</p>
\[\begin{bmatrix}a & b \\ c & d\end{bmatrix}\begin{bmatrix}1 \\ 0\end{bmatrix} = \begin{bmatrix}a \\ c\end{bmatrix}\]
<p>Similarly, the second standard basis vector will be transformed to the second column of $\boldsymbol{A}$:</p>
\[\begin{bmatrix}a & b \\ c & d\end{bmatrix}\begin{bmatrix}0 \\ 1\end{bmatrix} = \begin{bmatrix}b \\ d\end{bmatrix}\]
<p>Thus, if we multiply $\boldsymbol{A}$ by the matrix we that is formed by using the two standard basis vectors as columns (which is just the identity matrix), we get back $\boldsymbol{A}$:</p>
\[\boldsymbol{AI} = \boldsymbol{A}\]
<p>Here, we are viewing the matrix $\boldsymbol{A}$ as a function and are viewing $\boldsymbol{I}$ as a list of vectors. We see that $\boldsymbol{A}$ transforms the column vectors in $\boldsymbol{I}$ into $\boldsymbol{A}$ itself. Moreover, we see that the column vectors of $\boldsymbol{I}$ form the unit cube and $\boldsymbol{A}$ transforms this unit cube into a parallelogram with an area equal to $\text{Det}(\boldsymbol{A})$. Thus we see that the matrix $\boldsymbol{A}$ has, in a sense, blown up the area of the original cube to an object that has a size equal to $\lvert \text{Det}(\boldsymbol{A}) \rvert$.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/determinant_grow_unit_cube.png" alt="drawing" width="700" /></center>
<p>This pattern does not just hold for the unit cube alone nor does it hold for just $\mathbb{R}^2$. In fact, any hypercube in $m$ dimensional space that is transformed by some matrix $\boldsymbol{A}$ will become a new hypercube with an area that is grown or shrunk by a factor equal $\lvert \text{Det}(\boldsymbol{A}) \rvert$. To see why, examine what happens to a hypercube with sides of length $c$, which we can represent as the matrix $c\boldsymbol{I}$:</p>
\[c\boldsymbol{I} = \begin{bmatrix}c & 0 & \dots & 0 \\ 0 & c & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & c \end{bmatrix}\]
<p>where</p>
\[\text{Volume}(c\boldsymbol{I}) = c^{m}\]
<p>because each of $m$ sides is of length $m$. If we transform this hypercube by $\boldsymbol{A}$, we get a parallelepiped represented by the matrix $\boldsymbol{A}c\boldsymbol{I} = c\boldsymbol{A}$. It’s volume is given by the determinant $\lvert\text{Det}(c\boldsymbol{A})\rvert$:</p>
\[\text{Volume}(c\boldsymbol{A}) = \lvert \text{Det}(c\boldsymbol{a}_{*,1}, \dots, c\boldsymbol{a}_{*,m})\rvert\]
<p>where $\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,m}$ are the column-vectors of $\boldsymbol{A}$. Notice that $c$ is multiplying each column-vector. From the <a href="https://mbernste.github.io/posts/determinantsformula/">previous post</a>, recall that the determinant is linear with respect to each column-vector so we can “pull out” each $c$ coefficient:</p>
\[\begin{align*}\text{Volume}(c\boldsymbol{A}) &= \lvert \text{Det}(c\boldsymbol{a}_{*,1}, \dots, c\boldsymbol{a}_{*,m}) \rvert \\ &= \lvert c^m \rvert \lvert \text{Det}(\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,m}) \rvert \\ &= \text{Volume}(c\boldsymbol{I}) \lvert \text{Det}(\boldsymbol{A}) \rvert \end{align*}\]
<p>Thus we see that the volume of our cube was scaled by a factor $\lvert \text{Det}(\boldsymbol{A}) \rvert$.</p>
<p>Without proving it formally here, we can now intuitively see that <em>any</em> area/object’s volume will be scaled by the factor $\lvert \text{Det}(\boldsymbol{A}) \rvert$ when transformed by $\boldsymbol{A}$. This is because we can always approximate the volume of an object by filling the object with small hypercubes and summing the volumes of those hypercubes together. As we shrink the hypercubes ever smaller, we get a more accurate approximation of the volume. Under transformation by a matrix, $\boldsymbol{A}$, all of those tiny hypercubes will be scaled by $\lvert \text{Det}(\boldsymbol{A})\rvert$ and thus, the full volume of the object will be scaled by this value as well. This idea can be visualized below where we see the volume of a circle scaled under transformation of a matrix $\boldsymbol{A}$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/determinant_scales_circle.png" alt="drawing" width="700" /></center>
<p><br /></p>
<p><strong>Quick aside: Intuiting why $\text{Det}(\boldsymbol{AB}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{B})$</strong></p>
<p>We can now gain a much better intuition for Theorem 8 presented in <a href="https://mbernste.github.io/posts/determinantsformula/">the previous post</a>, which states that for two square matrices, $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^{n \times n}$ it holds that</p>
\[\text{Det}(\boldsymbol{AB}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{B})\]
<p>First, recall that a matrix product $\boldsymbol{AB}$ can be interpreted as a <a href="https://mbernste.github.io/posts/matrix_multiplication/">composition of linear transformations</a>. That is, the transformation carried out by $\boldsymbol{AB}$ is equivalent to the transformation carried out by $\boldsymbol{B}$ followed consecutively by $\boldsymbol{A}$. Let’s now think about how the area of an object will change as we first transform it by $\boldsymbol{B}$ followed by $\boldsymbol{A}$. First, transforming it by $\boldsymbol{B}$ will scale its area by a factor of $\lvert \text{Det}(\boldsymbol{B}) \rvert$. Then, transforming it by $\boldsymbol{A}$ will scale its area by a factor of $\lvert \text{Det}(\boldsymbol{A}) \rvert$. The total change of its area is thus $\lvert \text{Det}(\boldsymbol{B}) \rvert \lvert \text{Det}(\boldsymbol{A}) \rvert$. This can ve visualized below:</p>
<p><br /></p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/determinant_matrix_product.png" alt="drawing" width="800" /></center>
<p><br /></p>
<p>Above we see a unit cube first transformed into a parallelogram by $\boldsymbol{B}$. It’s area grows by a factor of $\lvert \text{Det}(\boldsymbol{B}) \rvert$. This parallelogram is then transformed into another paralellogram by $\boldsymbol{A}$. It’s transformation grows by an additional factor of $\lvert \text{Det}(\boldsymbol{A}) \rvert$. Thus, the final scaling factor of the unit cube’s area under $\boldsymbol{AB}$ is $\lvert \text{Det}(\boldsymbol{A}) \rvert \lvert \text{Det}(\boldsymbol{B})\rvert$. Equivalently, because the unit cube was transformed by $\boldsymbol{AB}$, its area grew by a factor of $\lvert \text{Det}(\boldsymbol{AB}) \rvert$.</p>
<h2 id="interpreting-the-sign-of-the-determinant">Interpreting the sign of the determinant</h2>
<p>So far, our discussion of the determinant has focused on volume, but we have glossed over the fact that this interpretation of the determinant requires taking its absolute value. What does the sign of the determinant capture? If determinants capture volume, then how can it be negative (intuitively, volume is only a positive quantity)?</p>
<p>It turns out that the sign of the determinant captures something else about a matrix’s linear transformation other than how much it grows or shrinks space: it captures whether or not a matrix “inverts” space. That is, a matrix with a positive determinant will maintain the orientation of vectors in the original space relative to one another, but a matrix with a negative determinant will invert their orientation.</p>
<p>As an example, let us consider the matrix:</p>
\[\boldsymbol{A} := \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix}\]
<p>This matrix represents the identity matrix, but with its two columns flipped. The determinant of $\boldsymbol{A}$ is -1. Why? By Axiom 1, the determinant of the identity matrix is 1. By <a href="https://mbernste.github.io/posts/determinantsformula/">Theorem 1 in my previous post</a>, swapping two columns will make the determinant negative. Thus, the determinant of $\boldsymbol{A}$ is simply -1. (Note, if you perform <em>two</em> swaps, the matrix no longer inverts space though this is a bit hard to visualize in high dimensions).</p>
<p>Below is an illustration of what happens to a set of vectors that form the outline of a hand when transformed by the matrix $\boldsymbol{A}$.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/negative_determinant_inversion_2D.png" alt="drawing" width="700" /></center>
<p>Here we see that this matrix simply flipped the orientation of vectors across the thick dotted line (you can see this by tracing the location of the thumb outlined by the thin dotted lines).</p>
<p>This same phenomenon occurs in higher dimensions too. Here is an example in three dimensions where a 3D hand is transformed by a matrix $\boldsymbol{A}$ that again represents the identity matrix, but with the first and third columns flipped. Notice how the hand went from being a right hand to a left hand by the transformation:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/determinant_inversion_hand.png" alt="drawing" width="700" /></center>
<h2 id="intuiting-determinants-as-signed-volume">Intuiting determinants as “signed volume”</h2>
<p>In <a href="https://en.wikipedia.org/wiki/Determinant">some explanations</a>, the determinant is explained as describing a “signed volume”. What is meant by signed volume? For me, it helps to think about determinants in a similar way that we think about integrals. Integrals express the “signed” area under a curve where the sign tells you whether there is more area above versus below zero. Consider a sequence of univariate functions where each function’s curve approaches zero until it cross zero and becomes more negative:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/integral_analogy_determinant.png" alt="drawing" width="700" /></center>
<p>We see that the integral starts out as positive, shrinks to zero, and then becomes more negative.</p>
<p>Analagously, we can see that as two vectors are rotated towards one another, the determinant is positive but decreases until the vectors are aligned. If the vectors are aligned the determinant is zero. As they cross one another further, the determinant becomes more and more negative. This is visualized below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/negative_determinant_rotate_vecs.png" alt="drawing" width="800" /></center>
<p><br /></p>
<p>Thus, the “sign” of the determinant can be thought about in a similar way as the sign of an integral. A negative integral tells you that the function has more area below zero than above zero. A negative determinant tells you that two columns vectors, in a sense, “crossed” with one another thus inverting space across those two column vectors.</p>Matthew N. BernsteinThe determinant of a matrix is often taught as a function that measures the volume of the parallelepiped formed by that matrix’s columns. In this post, we will go a step further in our understanding of the determinant and discuss what the determinant tells us about the linear transformation that is characterized by the matrix. In short, the determinant tells us how much a matrix’s linear transformation grows or shrinks space. The sign of the determinant tells us whether the matrix also inverts space.Deriving the formula for the determinant2023-09-03T00:00:00-07:002023-09-03T00:00:00-07:00https://mbernste.github.io/posts/determinants_formula<p><em>The determinant is a function that maps each square matrix to a value that describes the volume of the parallelepiped formed by that matrix’s columns. While this idea is fairly straightforward conceptually, the formula for the determinant is quite confusing. In this post, we will derive the formula for the determinant in an effort to make it less mysterious. Much of my understanding of this material comes from <a href="http://faculty.fairfield.edu/mdemers/linearalgebra/documents/2019.03.25.detalt.pdf">these lecture notes</a> by Mark Demers re-written in my own words.</em></p>
<h2 id="introduction">Introduction</h2>
<p>The <strong>determinant</strong> is a function that maps square <a href="https://mbernste.github.io/posts/matrices/">matrices</a> to real numbers:</p>
\[\text{Det} : \mathbb{R}^{m \times m} \rightarrow \mathbb{R}\]
<p>where the absolute value of the determinant describes the volume of the parallelepided formed by the matrix’s columns. This is illustrated below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/determinant_overview.png" alt="drawing" width="550" /></center>
<p><br /></p>
<p>While this idea is fairly straightforward conceptually, the formula for the determinant is quite confusing:</p>
\[\text{Det}(\boldsymbol{A}) := \begin{cases} a_{1,1}a_{2,2} - a_{1,2}a_{2,1} & \text{if $m = 2$} \\ \sum_{i=1}^m (-1)^{i+1} a_{i,1} \text{Det}(\boldsymbol{A}_{-i, -1}) & \text{if $m > 2$}\end{cases}\]
<p>Here, $\boldsymbol{A}_{-i, -1}$ denotes the matrix formed by deleting the $i$th row and first column of $\boldsymbol{A}$. Note that this is a <a href="https://en.wikipedia.org/wiki/Recursive_definition">recursive definition</a> where the base case is a $2 \times 2$ matrix.</p>
<p>When one is usually first taught determinants, they are supposed to take it as a given that this formula calculates the volume of an $m$-dimensional parallelepided; however, if you’re like me, this is not at all obvious. How on earth does this formula calculate volume? Moreover, why is it recursive?</p>
<p>In this post, I am going to attempt to derive this formula from first principles. We will start with the base case of a 2×2
matrix, verify that it indeed computes the volume of the parallelogram formed by the columns of the matrix, and then move on to the determinant for larger matrices.</p>
<h2 id="2-times-2-matrices">$2 \times 2$ matrices</h2>
<p>Let’s first only look at the $m = 2$ case and verify that this equation computes the area of the parallelogram formed by the matrix’s columns. Let’s say we have a matrix</p>
\[\boldsymbol{A} := \begin{bmatrix}a & b \\ c & d\end{bmatrix}\]
<p>Then we see that the area can be obtained by computing the area of the rectangle that encompasses the parallelogram and subtracting the areas of the triangles around it:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/TwoByTwoDeterminant.png" alt="drawing" width="700" /></center>
<p><br /></p>
<p>Simplifying the equation above we get</p>
\[\text{Det}(\boldsymbol{A}) = ad - bc\]
<p>This is exactly the definition for the $2 \times 2$ determinant. So far so good.</p>
<h2 id="defining-the-determinant-for-m-times-m-matrices-a-generalization-of-geometric-volume">Defining the determinant for $m \times m$ matrices: a generalization of geometric volume</h2>
<p>Moving on to $m > 2$, the definition of the determinant is</p>
\[\text{Det}(\boldsymbol{A}) := \sum_{i=1}^m (-1)^{i+1} a_{i,1} \text{Det}(\boldsymbol{A}_{-1,-i})\]
<p>Before understanding this equation, we must first ask ourselves what we really mean by “volume” in $m$-dimensional space. In fact, it is through the process of answering this very question that we bring us to the equation above. Specifically, we will formulate a set of three axioms that attempt to capture the notion of “geometric volume” in a very abstract way that applies to higher dimensions. Then, we will show that the only formula that satisfies these axioms is the formula for the determinant shown above!</p>
<p>These axioms are as follows:</p>
<p><strong>1. The determinant of the identity matrix is one</strong></p>
<p>The first axiom states that</p>
\[\text{Det}(\boldsymbol{I}) := 1\]
<p>Why do we want this to be an axiom? First, we note that the parallelepided formed by the columns of the identity matrix, $\boldsymbol{I}$, is a hypercube in $m$-dimensional space:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/DeterminantIdentityMatrix.png" alt="drawing" width="400" /></center>
<p>We would like our notion of “geometric volume” to match our common intuition that the volume of a cube is simply the product of the sides of the cube. In this case, they’re all of length one so the volume, and thus the determinant, should be one.</p>
<p><strong>2. If two columns of a matrix are equal, then its determinant is zero</strong></p>
<p>For a given matrix $\boldsymbol{A}$, if any two columns $\boldsymbol{a}_{*,i}$ and $\boldsymbol{a}_{*,j}$ are equal, then the determinant of $\boldsymbol{A}$ should be zero.</p>
<p>Why do we want this to be an axiom? We first note that if two columns of a matrix are equal, then the parallelapipde formed by their columns is flat. For example, here’s a depiction of a parallelepided formed by the columns of a $3 \times 3$ matrix with two columns that are equal:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/FlatParallelepiped.png" alt="drawing" width="400" /></center>
<p>We see that the parallelepided is flat and lies within a hyperplane.</p>
<p>We would like our notion of “geometric volume” to match our common intuition that the volume of a flat object is zero. Thus, when any two columns of a matrix are equal, we would like the determinant to be zero.</p>
<p><strong>3. The determinant of a matrix is linear with respect to each column vector</strong></p>
<p>Before digging into this final axiom, let us define some notation to make our discussion easier. Specifically, for the remainder of this post, we will often represent the determinant of a matrix as a function with either a single matrix argument, $\text{Det}(\boldsymbol{A})$, or with multiple vector arguments $\text{Det}(\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,n})$ where $\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,n}$ are the $n$ columns of $\boldsymbol{A}$.</p>
<p>Now, the final axiom for the determinant is that $\text{Det}$ is a <a href="https://mbernste.github.io/posts/matrices_linear_transformations/">linear function</a> with respect to each argument vector. For $\text{Det}$ to be linear with respect to each argument is to imply two conditions. First, for a given constant $k$, it holds that,</p>
\[\forall j \in [n], \ \text{Det}(\boldsymbol{a}_{*,1}, \dots, k\boldsymbol{a}_{*,j}, \dots \boldsymbol{a}_{*,n}) = k\text{Det}(\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,j}, \dots \boldsymbol{a}_{*,n})\]
<p>and second, that</p>
\[\forall j \in [n], \ \text{Det}(\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,j} + \boldsymbol{v}, \dots \boldsymbol{a}_{*,n}) = \text{Det}(\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,j}, \dots \boldsymbol{a}_{*,n}) + \text{Det}(\boldsymbol{a}_{*,1}, \dots, \boldsymbol{v}, \dots \boldsymbol{a}_{*,n})\]
<p>Why do we wish the linearity of $\text{Det}$ to be an axiom? Because it turns out that the volume of a two-dimensional parallelogram is linear with respect to the vectors that form its sides. We can prove this both algebraically as well as geometrically. Let’s start with the algebraic proof starting with a parallelogram defined by the columns of the following matrix:</p>
\[\boldsymbol{A} := \begin{bmatrix}a & b \\ c & d\end{bmatrix}\]
<p>Let’s say we multiply one of the column vectors by k to form $\boldsymbol{A}’$:</p>
\[\boldsymbol{A}' := \begin{bmatrix}ka & b \\ kc & d\end{bmatrix}\]
<p>Its determinant is</p>
\[\begin{align*}\text{Det}(\boldsymbol{A}') &:= kad - bkc \\ &= k(ad - bc) \\ &= k\text{Det}(\boldsymbol{A})\end{align*}\]
<p>Now let’s consider another matrix formed by taking the first column or $\boldsymbol{A}$ and adding a vector $\boldsymbol{v}$:</p>
\[\boldsymbol{A}' := \begin{bmatrix}a + v_1 & b \\ c + v_2 & d\end{bmatrix}\]
<p>Its determinant is</p>
\[\begin{align*}\text{Det}(\boldsymbol{A}') &:= (a + v_1)d - b(c + v_2) \\ &= ad + v_1d - bc - bv_2 \\ &= (ad - bc) + (v_1d - bv_2) \\ &= \text{Det}\left(\begin{bmatrix}a & b \\ c & d\end{bmatrix}\right) + \text{Det}\left(\begin{bmatrix}v_1 & b \\ v_2 & d\end{bmatrix} \right) \end{align*}\]
<p>To provide more intuition about why this linearity property holds, let’s look at it geometrically. As a preliminary observation, notice how if we skew one of the edges of a parallelogram along the axis of the other edge, then the area remains the same. We can see this in the figure below by noticing that the area of the yellow triangule is subtracted from the first paralellogram, but is added to the second:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/DeterminantSkewParallogramAxis.png" alt="drawing" width="500" /></center>
<p>With this observation in mind, we can now show why the determinant is linear from a geometric perspective. Let’s start with the first axiom that says if we scale one of the sides of a parallelogram by $k$, then the area of the parallelogram is scaled by $k$. Below, we show a parallelogram where we scale one of the vectors, $\boldsymbol{v}$, by $k$.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/Determinant_linearity_axiom1_part1.png" alt="drawing" width="500" /></center>
<p>We see that we can skew both sides of the parallelogram to be orthogonal to one another forming a rectangle that preserves the area of the parallelogram. This rectangle has sides of length $a$ and $b$ and thus an area of $ab$. When we skew the enlarged parallelogram in the same way, we form a rectangle with sides of length $a$ and $kb$ and thus an area of $kab$ We know the sides of the enlarged parallelogram are of length $a$ and $kb$ by observing that the two shaded triangles shown below are <a href="https://www.mathsisfun.com/geometry/triangles-similar.html">similar</a>:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/Determinant_linearity_axiom1_part2.png" alt="drawing" width="500" /></center>
<p>The second axiom of linearity states that if we break apart one of the vectors that forms an edge of the parallelogram into two vectors, we can show that they form two “sub-parallelograms” whose total area equals the area of the original parallelogram. This is shown in the following “visual proof”:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/Determinant_linearity_axiom2.png" alt="drawing" width="800" /></center>
<h2 id="deriving-the-formula-for-a-determinant">Deriving the formula for a determinant</h2>
<p>In the previous section, we outlined three axioms that define fundamental ways in which the volume of a parallelogram is related to the vectors that form its sides. It turns out that the <em>only</em> formula that satisfies these axioms is the following:</p>
\[\text{Det}(\boldsymbol{A}) := \begin{cases} a_{1,1}a_{2,2} - a_{1,2}a_{2,1} & \text{if $m = 2$} \\ \sum_{i=1}^m (-1)^{i+1} a_{i,1} \text{Det}(\boldsymbol{A}_{-1,-i}) & \text{if $m > 2$}\end{cases}\]
<p>We will start by assuming that there exists a function $\text{Det}: \mathbb{R}^{m \times m} \rightarrow \mathbb{R}$ that satisfies our three axioms and will subsequently prove a series of theorems that will build up to this final formula. Many of these theorems make heavy use of the fact that invertible matrices can be decomposed into the product of elementary matrices. For an in-depth discussion of elementary matrices, <a href="https://mbernste.github.io/posts/row_reduction/">see my previous post</a>.</p>
<p>The Theorems required to derive this formula are outlined below. and their proofs are given in the Appendix to this post.</p>
<p><span style="color:#0060C6"><strong>Theorem 1:</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times m}$, if we exchange any two column-vectors of $\boldsymbol{A}$ to form a new matrix $\boldsymbol{A}’$, then $\text{Det}(\boldsymbol{A}’) = -\text{Det}(\boldsymbol{A})$</span></p>
<p><span style="color:#0060C6"><strong>Theorem 2:</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times m}$, if if it’s column-vectors are <a href="https://mbernste.github.io/posts/linear_independence/">linearly dependent</a>, then its determinant is zero.</span></p>
<p><span style="color:#0060C6"><strong>Theorem 3:</strong> Given a triangular matrix, $\boldsymbol{A} \in \mathbb{R}^{m \times m}$, its determinant can be computed by multiplying its diagonal entries.</span></p>
<p><span style="color:#0060C6"><strong>Theorem 4:</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times m}$, adding a multiple of one column-vector of $\boldsymbol{A}$ to another column-vector does not change the determinant of $\boldsymbol{A}$.</span></p>
<p><span style="color:#0060C6"><strong>Theorem 5:</strong> Given an <a href="https://mbernste.github.io/posts/row_reduction/">elementary matrix</a> that represents row-scaling, $\boldsymbol{E} \in \mathbb{R}^{m \times m}$, where $\boldsymbol{E}$ scales the $j$th row of a system of linear equations by $k$, its determinant is simply $k$.</span></p>
<p><span style="color:#0060C6"><strong>Theorem 6:</strong> Given an <a href="https://mbernste.github.io/posts/row_reduction/">elementary matrix</a> that represents row-swapping, $\boldsymbol{E} \in \mathbb{R}^{m \times m}$ that swaps the $i$th and $j$th rows of a system of linear equations, its determinant is simply -1.</span></p>
<p><span style="color:#0060C6"><strong>Theorem 7:</strong> Given an <a href="https://mbernste.github.io/posts/row_reduction/">elementary matrix</a> that represents a row-sum, $\boldsymbol{E} \in \mathbb{R}^{m \times m}$ that multiplies row $j$ by $k$ times row $i$, its determinant is simply 1.</span></p>
<p><span style="color:#0060C6"><strong>Theorem 8:</strong> Given matrices $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^{m \times m}$, it holds that $\text{Det}(\boldsymbol{AB}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{B})$</span></p>
<p><span style="color:#0060C6"><strong>Theorem 9:</strong> Given a square matrix $\boldsymbol{A}$, it holds that $\text{Det}(\boldsymbol{A}) = \text{Det}(\boldsymbol{A}^T)$.</span></p>
<p><span style="color:#0060C6"><strong>Theorem 10:</strong> The determinant of matrix is linear with respect to the row vectors of the matrix.</span></p>
<p>With these theorems in hand we can derive the final formula for the determinant:</p>
<p><span style="color:#0060C6"><strong>Theorem 11:</strong> Let $\text{Det} : \mathbb{R}^{n \times n} \rightarrow \mathbb{R}$ be a function that satisfies the following three properties:</span></p>
<p><span style="color:#0060C6">1. $\text{Det}(\boldsymbol{I}) = 1$</span></p>
<p><span style="color:#0060C6">2. Given $\boldsymbol{A} \in \mathbb{R}^{n \times n}$, if any two columns of $\boldsymbol{A}$ are equal, then $\text{Det}(\boldsymbol{A}) = 0$</span></p>
<p><span style="color:#0060C6">3. $\text{Det}$ is linear with respect to the column-vectors of its input.</span></p>
<p><span style="color:#0060C6">Then $\text{Det}$ is given by</span></p>
<p><span style="color:#0060C6">\(\text{Det}(\boldsymbol{A}) := \begin{cases} a_{1,1}a_{2,2} - a_{1,2}a_{2,1} & \text{if $m = 2$} \\ \sum_{i=1}^m (-1)^{i+1} a_{i,1} \text{Det}(\boldsymbol{A}_{-i,-1}) & \text{if $m > 2$}\end{cases}\)</span></p>
<p>Below is an illustration of how each theorem depends on the other theorems. Note, they all flow downward until we can prove the final formula for the determinant in Theorem 11:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/determinant_theorem_map.png" alt="drawing" width="400" /></center>
<p>All of the proofs are left to the Appendix below this blog post.</p>
<h2 id="related-links">Related links</h2>
<ul>
<li>Lecture notes by Mark Demers: <a href="[http://faculty.fairfield.edu/mdemers/linearalgebra/documents/2019.03.25.detalt.pdf]">http://faculty.fairfield.edu/mdemers/linearalgebra/documents/2019.03.25.detalt.pdf</a></li>
<li>Lecture notes by Dan Margalit, Joseph Rabinoff, Ben Williams: <a href="https://personal.math.ubc.ca/~tbjw/ila/determinants-volumes.html">https://personal.math.ubc.ca/~tbjw/ila/determinants-volumes.html</a></li>
<li>Explanation by 3Blue1Brown: <a href="https://www.3blue1brown.com/lessons/determinant">https://www.3blue1brown.com/lessons/determinant</a></li>
</ul>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1:</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times m}$, if we exchange any two column-vectors of $\boldsymbol{A}$ to form a new matrix $\boldsymbol{A}’$, then $\text{Det}(\boldsymbol{A}’) = -\text{Det}(\boldsymbol{A})$</span></p>
<p><strong>Proof:</strong></p>
<p>Let columns $i$ and $j$ be the columns that we exchange within $\boldsymbol{A}$. For ease of notation, let us define</p>
\[\text{Det}_{i,j}(\boldsymbol{a}_{*,i}, \boldsymbol{a}_{*,j}) := \text{Det}_{i,j}(\boldsymbol{a}_{*,1} \dots, \boldsymbol{a}_{*,i}, \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,m})\]
<p>to be the determinant of $\boldsymbol{A}$ as a function of only the $i$th and $j$th column-vectors of $\boldsymbol{A}$ where the other column-vectors are held fixed. Then, we see that</p>
\[\begin{align*} \text{Det}_{i,j}(\boldsymbol{a}_{*,i}, \boldsymbol{a}_{*,j}) &= \text{Det}_{i,j}(\boldsymbol{a}_{*,i}, \boldsymbol{a}_{*,j}) + \text{Det}_{i,j}(\boldsymbol{a}_{*,i}, \boldsymbol{a}_{*,i}) && \text{Axiom 2} \\ &= \text{Det}_{i,j}(\boldsymbol{a}_{*,i}, \boldsymbol{a}_{*,i} + \boldsymbol{a}_{*,j}) && \text{Axiom 3} \\ &= \text{Det}_{i,j}(\boldsymbol{a}_{*,i}, \boldsymbol{a}_{*,i} + \boldsymbol{a}_{*,j}) - \text{Det}_{i,j}(\boldsymbol{a}_{*,i} + \boldsymbol{a}_{*,j}, \boldsymbol{a}_{*,i} + \boldsymbol{a}_{*,j}) && \text{Axiom 2} \\ &= \text{Det}_{i,j}(-\boldsymbol{a}_{*,j}, \boldsymbol{a}_{*,i} + \boldsymbol{a}_{*,j}) && \text{Axiom 3} \\ &= -\text{Det}_{i,j}(\boldsymbol{a}_{*,j}, \boldsymbol{a}_{*,i} + \boldsymbol{a}_{*,j}) && \text{Axiom 3} \\ &= -(\text{Det}_{i,j}(\boldsymbol{a}_{*,j}, \boldsymbol{a}_{*,i}) - \text{Det}_{i,j}(\boldsymbol{a}_{*,j},\boldsymbol{a}_{*,j})) && \text{Axiom 3} \\ &= -\text{Det}_{i,j}(\boldsymbol{a}_{*,j}, \boldsymbol{a}_{*,i}) && \text{Axiom 2}\end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2:</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times m}$, if it’s column-vectors are <a href="https://mbernste.github.io/posts/linear_independence/">linearly dependent</a>, then its determinant is zero.</span></p>
<p><strong>Proof:</strong></p>
<p>Given a matrix $\boldsymbol{A}$ with columns,</p>
\[\boldsymbol{A} := \begin{bmatrix}\boldsymbol{a}_{*,1}, \boldsymbol{a}_{*,2}, \dots, \boldsymbol{a}_{*,m}\end{bmatrix}\]
<p>if the column-vectors of $\boldsymbol{A}$ are <a href="https://mbernste.github.io/posts/linear_independence/">linearly dependent</a>, then there exists a vector $\boldsymbol{a}_{*,j}$ that can be expressed as a linear combination of the remaining vectors:</p>
\[\boldsymbol{a}_{*,j} = \sum_{i \neq j} c_i\boldsymbol{a}_{*,i}\]
<p>for some set of constants. Thus we can write the determinant as:</p>
\[\begin{align*}\text{Det}(\boldsymbol{A}) &= \text{Det}(\boldsymbol{a}_{*,1}, \boldsymbol{a}_{*,2}, \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,m} \\ &= \text{Det}\left(\boldsymbol{a}_{*,1}, \boldsymbol{a}_{*,2}, \dots, \sum_{i \neq j} c_i\boldsymbol{a}_{*,i}, \dots, \boldsymbol{a}_{*,m}\right) \\ &= \sum_{i \neq j} \text{Det}\left(\boldsymbol{a}_{*,1}, \boldsymbol{a}_{*,2}, \dots, c_i\boldsymbol{a}_{*,i}, \dots, \boldsymbol{a}_{*,m}\right) && \text{Axiom 3} \\ &= \sum_{i \neq j} c_i \text{Det}\left(\boldsymbol{a}_{*,1}, \boldsymbol{a}_{*,2}, \dots, \boldsymbol{a}_{*,i}, \dots, \boldsymbol{a}_{*,m}\right) && \text{Axiom 3} \\ &= 0 && \text{Axiom 2} \end{align*}\]
<p>In the last line, we see that all of the determinants in the summation are zero because each term is the determinant of a matrix that has a duplicate column vector.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 3:</strong> Given a triangular matrix, $\boldsymbol{A} \in \mathbb{R}^{m \times m}$, its determinant can be computed by multiplying its diagonal entries.</span></p>
<p><strong>Proof:</strong></p>
<p>We will start with an upper triangular $3 \times 3$ matrix:</p>
\[\boldsymbol{A} := \begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ 0 & a_{2,2} & a_{2,3} \\ 0 & 0 & a_{3,3}\end{bmatrix}\]
<p>Now, we will take the second column-vector and decompose it into the sum of two vectors. Because the determinant is linear by Axiom 3, we can rewrite the determinant as follows:</p>
\[\text{Det}(\boldsymbol{A}) = \text{Det}\left(\begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3} \\ 0 & 0 & a_{2,3} \\ 0 & 0 & a_{3,3}\end{bmatrix} \right) + \text{Det}\left(\begin{bmatrix} a_{1,1} & 0 & a_{1,3} \\ 0 & a_{2,2} & a_{2,3} \\ 0 & 0 & a_{3,3}\end{bmatrix}\right)\]
<p>Note that in the first term of this sum, the column vectors are <em>linearly dependent</em> because the second column-vector can be re-written as a multiple of the first. Thus, according to Theorem 2, its determinant is zero. Hence, the entire first term is zero. Thus, we have:</p>
\[\text{Det}(\boldsymbol{A}) = \text{Det}\left(\begin{bmatrix} a_{1,1} & 0 & a_{1,3} \\ 0 & a_{2,2} & a_{2,3} \\ 0 & 0 & a_{3,3}\end{bmatrix}\right)\]
<p>We can repeat this process with the third column vector by decomposing it into the sum of two vectors and then utilizing the fact that the determinant is linear:</p>
\[\text{Det}(\boldsymbol{A}) = \text{Det}\left(\begin{bmatrix} a_{1,1} & 0 & a_{1,3} \\ 0 & a_{2,2} & 0 \\ 0 & 0 & 0\end{bmatrix}\right) + \text{Det}\left(\begin{bmatrix} a_{1,1} & 0 & 0 \\ 0 & a_{2,2} & a_{2,3} \\ 0 & 0 & 0\end{bmatrix}\right) + \text{Det}\left(\begin{bmatrix} a_{1,1} & 0 & 0 \\ 0 & a_{2,2} & 0 \\ 0 & 0 & a_{3,3}\end{bmatrix}\right)\]
<p>Again, the first and second terms are zero because the columns of each matrix are linearly dependent. This leaves only the third term, which is the determinant of a diagonal matrix. Finally, we see that</p>
\[\begin{align*}\text{Det}(\boldsymbol{A}) &= \text{Det}\left(\begin{bmatrix} a_{1,1} & 0 & 0 \\ 0 & a_{2,2} & 0 \\ 0 & 0 & a_{3,3}\end{bmatrix}\right) \\ &= a_{1,1}a_{2,2}a_{3,3}\text{Det}\left(\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1\end{bmatrix}\right) && \text{Axiom 3} \\ &= a_{1,1}a_{2,2}a_{3,3} && \text{Axiom 1}\end{align*}\]
<p>Thus, we see that the determinant of the diagonal matrix can be computed by multiplying the entries along the diagonal.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 4:</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times m}$, adding a multiple of one column-vector of $\boldsymbol{A}$ to another column-vector does not change the determinant of $\boldsymbol{A}$.</span></p>
<p><strong>Proof:</strong></p>
<p>Say we add $k$ times column $j$ to column $i$. First,</p>
\[\begin{align*}\text{Det}(\boldsymbol{a}_{*,1} \dots, k\boldsymbol{a}_{*,j} + \boldsymbol{a}_{*,i}, \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,m}) &= \text{Det}(\boldsymbol{a}_{*,1} \dots, k\boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,m}) + \text{Det}(\boldsymbol{a}_{*,1} \dots, \boldsymbol{a}_{*,i}, \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,m}) && \text{Axiom 3} \\ &= k\text{Det}(\boldsymbol{a}_{*,1} \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,m}) + \text{Det}(\boldsymbol{a}_{*,1} \dots, \boldsymbol{a}_{*,i}, \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,m}) && \text{Axiom 3} \\ &= \text{Det}(\boldsymbol{a}_{*,1} \dots, \boldsymbol{a}_{*,i}, \dots, \boldsymbol{a}_{*,j}, \dots, \boldsymbol{a}_{*,m}) && \text{Axiom 2}\end{align*}\]
<p>The last line follows from the fact that the first term is computing the determinant of a matrix that has duplicate column-vectors. By Axiom 2, its determinant is zero.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 5:</strong> Given an elementary matrix that represents row-scaling, $\boldsymbol{E} \in \mathbb{R}^{m \times m}$, where $\boldsymbol{E}$ scales the $j$th row of a system of linear equations by $k$, its determinant is simply $k$.</span></p>
<p><strong>Proof:</strong></p>
<p>Such a matrix would be a diagonal matrix with all ones along the diagonal except for the $j$th entry, which would be $k$. For example, a $4 \times 4$ row-scaling matrix that scales the second row by $4$ would look as follows:</p>
\[\boldsymbol{A} := \begin{bmatrix}1 & 0 & 0 & 0 \\ 0 & k & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{bmatrix}\]
<p>Note that this is a triangular matrix and thus, by Theorem 3, its determinant is given by the product along its diagonals, which is simply $k$.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 6:</strong> Given an elementary matrix that represents row-swapping, $\boldsymbol{E} \in \mathbb{R}^{m \times m}$ that swaps the $i$th and $j$th rows of a system of linear equations, its determinant is simply -1.</span></p>
<p><strong>Proof:</strong></p>
<p>A row-swapping matrix that swaps the $i$th and $j$th rows of a system of linear equations can be formed by simply swapping the $i$th and $j$th column vectors of the identity matrix. For example, a $4 \times 4$ row-scaling matrix that swaps the second and third rows would look as follows:</p>
\[\boldsymbol{A} := \begin{bmatrix}1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1\end{bmatrix}\]
<p>Axiom 1 for the definition of the determinant states that the determinant of the identity matrix is 1. According to Theorem 1, if we swap two column-vectors of a matrix, its determinant is multiplied by -1. Here we are swapping two column-vectors of the identity matrix yielding a determinant of -1.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 7:</strong> Given an elementary matrix that represents a row-sum, $\boldsymbol{E} \in \mathbb{R}^{m \times m}$, that adds row $j$ multiplied by $k$ to row $i$, its determinant is simply 1.</span></p>
<p><strong>Proof:</strong></p>
<p>An elementary matrix representing a row-sum, $\boldsymbol{E} \in \mathbb{R}^{m \times m}$ that adds row $j$ multiplied by $k$ to row $i$, is simply the identity matrix, but with element $(i, j)$ equal to $k$. For example, a $4 \times 4$ row-scaling matrix that adds three times the first row to the third would be given by:</p>
\[\boldsymbol{A} := \begin{bmatrix}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 3 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{bmatrix}\]
<p>This matrix is a lower-triangular matrix. By Theorem 3, the determinant of a triangular matrix is the product of the diagonal entries. In this case, all of the diagonal entries are 1. Thus, the determinant is 1.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 8:</strong> Given matrices $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^{m \times m}$, it holds that $\text{Det}(\boldsymbol{AB}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{B})$</span></p>
<p><strong>Proof:</strong></p>
<p>First, if $\boldsymbol{A}$ is singular, then $\text{Det}(\boldsymbol{AB})$ is also singular. By Theorem 2, we know that the determinant of a singular matrix is zero and thus it trivially holds that $\text{Det}(\boldsymbol{AB}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{B})$ since both $\text{Det}(\boldsymbol{AB}) = 0$ and also $\text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{B}) = 0$ (since $\text{Det}(\boldsymbol{A})=0$). The same is true if $\boldsymbol{B}$ is singular. Thus, our proof will focus only on the case in which $\boldsymbol{A}$ and $\boldsymbol{B}$ are both invertible.</p>
<p>We first begin by proving that given an elementary matrix $\boldsymbol{E}$, it holds that</p>
\[\text{Det}(\boldsymbol{AE}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{E})\]
<p>To show this, let us consider each of the three types of elementary matrices individually. First, if $\boldsymbol{E}$ is a scaling matrix where the $i$th diagonal entry is a scalar $k$, then $\boldsymbol{AE}$ will scale the $i$th column of $\boldsymbol{A}$ by $k$. By Axiom 3 of the determinant, scaling a single column will scale the full deterimant. Moreover, by Theorem 5, the determinant of $\boldsymbol{E}$ is $k$. Thus,</p>
\[\begin{align*}\text{Det}(\boldsymbol{AE}) &= \text{Det}(\boldsymbol{A})k && \text{Axiom 3} \\ &= \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{E}) && \text{Theorem 5}\end{align*}\]
<p>Now let’s consider the case where $\boldsymbol{E}$ is a row-swapping matrix that swaps rows $i$ and $j$. Here, $\boldsymbol{AE}$ will swap the $i$th and $j$th columns of $\boldsymbol{A}$. By Theorem 1, swapping any two columns of a matrix flips the sign of the determinant. Moreover, by Theorem 6, the determinant of $\boldsymbol{E}$ is $-1$. Thus,</p>
\[\begin{align*}\text{Det}(\boldsymbol{AE}) &= -\text{Det}(\boldsymbol{A}) && \text{Theorem 1} \\ &= \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{E}) && \text{Theorem 6}\end{align*}\]
<p>Finally let’s consider the case where $\boldsymbol{E}$ is an elementary matrix that adds a multiple of one row to another. Here, $\boldsymbol{AE}$ will add a multiple of one column of $\boldsymbol{A}$ to another column. By Theorem 4, this does not change the determinant. Moreover, by Theorem 7, the determinant of $\boldsymbol{E}$ is simply 1. Thus,</p>
\[\begin{align*}\text{Det}(\boldsymbol{AE}) &= -\text{Det}(\boldsymbol{A}) && \text{Theorem 4} \\ &= \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{E}) && \text{Theorem 7}\end{align*}\]
<p>Now we have proven that for an invertible matrix $\boldsymbol{A}$ and elementary matrix $\boldsymbol{E}$ it holds that $\text{Det}(\boldsymbol{AE}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{E})$. Let’s now turn to the general case where we are multiplying two invertible matrices $\boldsymbol{A}$ and $\boldsymbol{B}$.</p>
<p>As <a href="https://mbernste.github.io/posts/row_reduction/">we have shown</a>, any invertible matrix can be decomposed as the product of some sequence of elementary matrices. Thus, we can write,</p>
\[\boldsymbol{AB} = \boldsymbol{A}\boldsymbol{E}_1\boldsymbol{E}_2 \dots \boldsymbol{E}_k\]
<p>Then we apply our newly proven fact that for an elementary matrix $\boldsymbol{E}$ it holds that $\text{Det}(\boldsymbol{AE}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{E})$. We apply this fact in an iterative way from right to left as shown below:</p>
\[\begin{align*}\text{Det}(\boldsymbol{AB}) &= \text{Det}(\boldsymbol{A}\boldsymbol{E}_1\boldsymbol{E}_2 \dots \boldsymbol{E}_{k-1}\boldsymbol{E}_k) \\ &= \text{Det}(\boldsymbol{A}\boldsymbol{E}_1\boldsymbol{E}_2 \dots \boldsymbol{E}_{k-1})\text{Det}(\boldsymbol{E}_k) \\ &= \text{Det}(\boldsymbol{A}\boldsymbol{E}_1\boldsymbol{E}_2 \dots \boldsymbol{E}_{k-2})\text{Det}(\boldsymbol{E}_{k-1})\text{Det}(\boldsymbol{E}_k) \\ &= \text{Det}(\boldsymbol{A})\prod_{i=1}^k \text{Det}(\boldsymbol{E}_i) \end{align*}\]
<p>Now we reverse the rule that $\text{Det}(\boldsymbol{AE}) = \text{Det}(\boldsymbol{A})\text{Det}(\boldsymbol{E})$ again moving going from right to left:</p>
\[\begin{align*}\text{Det}(\boldsymbol{AB}) &= \text{Det}(\boldsymbol{A})\prod_{i=1}^k \text{Det}(\boldsymbol{E}_i) \\ &= \text{Det}(\boldsymbol{A})\left( \prod_{i=1}^{k-2}\text{Det}(\boldsymbol{E}_i)\right) \text{Det}(\boldsymbol{E}_{k-1}\boldsymbol{E}_{k}) \\ &= \text{Det}(\boldsymbol{A})\left( \prod_{i=1}^{k-3}\text{Det}(\boldsymbol{E}_i)\right) \text{Det}(\boldsymbol{E}_{k-2} \boldsymbol{E}_{k-1}\boldsymbol{E}_{k}) \\ &= \text{Det}(\boldsymbol{A}) \text{Det}(\boldsymbol{E}_1\boldsymbol{E}_2 \dots \boldsymbol{E}_{k-1}\boldsymbol{E}_k) \\ &= \text{Det}(\boldsymbol{AB})\end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 9:</strong> Given a square matrix $\boldsymbol{A}$, it holds that $\text{Det}(\boldsymbol{A}) = \text{Det}(\boldsymbol{A}^T)$.</span></p>
<p><strong>Proof:</strong></p>
<p>First, if $\boldsymbol{A}$ is singular, then $\boldsymbol{A}^T$ <a href="https://en.wikipedia.org/wiki/Transpose#:~:text=The%20transpose%20of%20an%20invertible,either%20of%20these%20equivalent%20expressions.">is also singular</a>. By Theorem 2, the determinant of a singular matrix is zero and thus, $\text{Det}(\boldsymbol{A}) = \text{Det}(\boldsymbol{A}^T) = 0$.</p>
<p>If $\boldsymbol{A}$ is invertible, then we can express $\boldsymbol{A}$ as the product of some sequence of elementary matrices:</p>
\[\boldsymbol{A} = \boldsymbol{E}_1\boldsymbol{E}_2 \dots \boldsymbol{E}_k\]
<p>Then,</p>
\[\boldsymbol{A}^T = \boldsymbol{E}_k^T \boldsymbol{E}_{k-1}^T \dots \boldsymbol{E}_1^T\]
<p>We note that the determinant of every elementary matrix is equal to its transpose. Both scaling elementary matrices and row-swapping matrices are symmetric and thus, their transposes are equal to themselves. Thus the determinant of their transpose is equal to themselves. For a row-sum elementary matrix, the transpose is still a diagonal matrix and thus, its determinant also equals the determinant of its transpose (since by Theorem 3, the determinant of a triangular matrix can be computed by summing the diagonal entries).</p>
<p>Then, we apply Theorem 8 and see that</p>
\[\begin{align*}\text{Det}(\boldsymbol{A}) &= \text{Det}(\boldsymbol{E}_1\boldsymbol{E}_2 \dots \boldsymbol{E}_k) \\ &= \text{Det}(\boldsymbol{E}_1) \text{Det}(\boldsymbol{E}_2) \dots \text{Det}(\boldsymbol{E}_k) \\ &= \text{Det}(\boldsymbol{E}_1^T) \text{Det}(\boldsymbol{E}_2^T) \dots \text{Det}(\boldsymbol{E}_k^T) \\ &= \text{Det}(\boldsymbol{E}_{k}^T) \text{Det}(\boldsymbol{E}_{k-1}^T) \dots \text{Det}(\boldsymbol{E}_1^T) \\ &= \text{Det}(\boldsymbol{E}_k^T \boldsymbol{E}_{k-1}^T \dots \boldsymbol{E}_1^T) \\ &= \text{Det}(\boldsymbol{A}^T)\end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 10:</strong> Tbe determinant of matrix is linear with respect to the row vectors of the matrix.</span></p>
<p><strong>Proof:</strong></p>
<p>This proof follows from Theorem 9 and Axiom 3 of the determinant. Specifically, given a square matrix $\boldsymbol{A} \in \mathbb{R}^{m \times m}$. Let $\boldsymbol{a}_{1,*}, \dots, \boldsymbol{a}_{m,*}$ be the row vectors of $\boldsymbol{A}$. Also, let the $j$th row be represented as the sum of two vectors, $\boldsymbol{a}_{j,*} = \boldsymbol{u} + \boldsymbol{v}$:</p>
\[\boldsymbol{A} := \begin{bmatrix}\boldsymbol{a}_{1, *} \\ \vdots \\ \boldsymbol{u} + \boldsymbol{v} \\ \vdots \\ \boldsymbol{a}_{m, *}\end{bmatrix}\]
<p>The determinant of $\boldsymbol{A}$ is then:</p>
\[\begin{align*}\text{Det}(\boldsymbol{A}) &= \text{Det}(\boldsymbol{A}^T) && \text{By Theorem 9} \\ &= \text{Det}\left(\boldsymbol{a}_{1,*}, \dots, \boldsymbol{v} + \boldsymbol{u}, \dots, \boldsymbol{a}_{m,*} \right) \\ &= \text{Det}\left(\boldsymbol{a}_{1,*}, \dots, \boldsymbol{v}, \dots, \boldsymbol{a}_{m,*} \right) + \text{Det}\left(\boldsymbol{a}_{1,*}, \dots, \boldsymbol{u}, \dots, \boldsymbol{a}_{m,*} \right) && \text{By Axiom 3} \end{align*}\]
<p>Next, let the $j$th row be scaled by some constant $c$. That is,</p>
\[\boldsymbol{A} := \begin{bmatrix}\boldsymbol{a}_{1, *} \\ \vdots \\ c\boldsymbol{a}_{j,*} \\ \vdots \\ \boldsymbol{a}_{m, *}\end{bmatrix}\]
<p>Then,</p>
\[\begin{align*} \text{Det}(\boldsymbol{A}) &= \text{Det}(\boldsymbol{A}^T) && \text{By Theorem 9} \\ &= \text{Det}\left( \boldsymbol{a}_{1,*}, \dots, c\boldsymbol{a}_{j,*}, \dots, \boldsymbol{a}_{m,*} \right) \\ &= c\text{Det}\left(\boldsymbol{a}_{1,*}, \dots, \boldsymbol{a}_{j,*}, \dots, \boldsymbol{a}_{m,*} \right) && \text{By Axiom 3} \end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 11:</strong> Let $\text{Det} : \mathbb{R}^{n \times n} \rightarrow \mathbb{R}$ be a function that satisfies the following three properties:</span></p>
<p><span style="color:#0060C6">1. $\text{Det}(\boldsymbol{I}) = 1$</span></p>
<p><span style="color:#0060C6">2. Given $\boldsymbol{A} \in \mathbb{R}^{n \times n}$, if any two columns of $\boldsymbol{A}$ are equal, then $\text{Det}(\boldsymbol{A}) = 0$</span></p>
<p><span style="color:#0060C6">3. $\text{Det}$ is linear with respect to the column-vectors of its input.</span></p>
<p><span style="color:#0060C6">Then $\text{Det}$ is given by</span></p>
<p><span style="color:#0060C6">\(\text{Det}(\boldsymbol{A}) := \begin{cases} a_{1,1}a_{2,2} - a_{1,2}a_{2,1} & \text{if $m = 2$} \\ \sum_{i=1}^m (-1)^{i+1} a_{i,1} \text{Det}(\boldsymbol{A}_{-i,-1}) & \text{if $m > 2$}\end{cases}\)</span></p>
<p><strong>Proof:</strong></p>
<p>Given a matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$, let $\boldsymbol{A}_{-i,-j}$ be the sub-matrix of $\boldsymbol{A}$ where the $i$th and $j$th rows are deleted. For example, for a $3 \times 3$ matrix</p>
\[\boldsymbol{A} := \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix}\]
<p>$\boldsymbol{A}_{-1,-1}$ would be</p>
\[\boldsymbol{A}_{-1,-1} = \begin{bmatrix} e & f \\ h & i \end{bmatrix}\]
<p>Now, consider an elementary matrix $\boldsymbol{E} \in \mathbb{R}^{m \times m}$. Let us define $\boldsymbol{E}’$ to be an elementary matrix in $\mathbb{R}^{(m+1) \times (m+1)}$ that is formed by taking $\boldsymbol{E}$, but adding a new row and column where the first element is 1. That is,</p>
\[\boldsymbol{E}' := \begin{bmatrix}1 & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{E} & & \\ 0 & & & &\end{bmatrix}\]
<p>Notice that $\boldsymbol{E}’$ is an elementary row matrix that represents the same operation as $\boldsymbol{E}$, but performs this operation on a matrix in $(m+1) \times (m+1)$ instead of a matrix $m \times m$ and leaves the first row alone. Thus, by Theorems 5, 6, and 7 it follows that:</p>
\[\text{Det}(\boldsymbol{E}') = \text{Det}(\boldsymbol{E})\]
<p>Let’s keep this fact in the back of our mind, but now turn our attention towards $\boldsymbol{A}$. Let us say that $\boldsymbol{A}$ is a matrix where the first column-vector only has a non-zero entry in the first row. That is, let’s say $\boldsymbol{A}$ looks as follows</p>
\[\boldsymbol{A} = \begin{bmatrix}a_{1,1} & a_{1,2} & \dots & a_{1,m} \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix}\]
<p>Then we can show that $\text{Det}(\boldsymbol{A}) = a_{1,1}\text{Det}(\boldsymbol{A}_{-1,-1})$ via the following (see notes below the derivation for more details on some of the key steps):</p>
\[\begin{align*}\text{Det}(\boldsymbol{A}) &= \text{Det}\left( \begin{bmatrix}a_{1,1} & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix} \right) + \text{Det}\left( \begin{bmatrix}0 & a_{1,2} & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix} \right) + \dots + \text{Det}\left( \begin{bmatrix}0 & 0 & \dots & a_{1,m} \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix} \right) \ \text{Theorem 10} \\ &= \text{Det}\left( \begin{bmatrix}a_{1,1} & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix} \right) \ \text{by Theorem 2 (see Note 1)} \\ &= a_{1,1}\text{Det}\left( \begin{bmatrix}1 & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix}\right) \ \text{by Axiom 3} \\ &= a_{1,1}\text{Det}\left( \begin{bmatrix}1 & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix}\right) \\ &= a_{1,1}\text{Det}\left( \begin{bmatrix}1 & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{E}_1 \boldsymbol{E}_2 \dots \boldsymbol{E}_k & & \\ 0 & & & &\end{bmatrix}\right) \ \text{see Note 2} \\ &= a_{1,1}\text{Det}(\boldsymbol{E}'_1 \boldsymbol{E}'_2 \dots \boldsymbol{E}'_k) \ \text{see Note 3} \\ &= a_{1,1}\text{Det}(\boldsymbol{E}'_1) \text{Det}(\boldsymbol{E}'_2) \dots \text{Det}(\boldsymbol{E}'_k) \ \text{Theorem 8} \\ &= a_{1,1}\text{Det}(\boldsymbol{E}_1) \text{Det}(\boldsymbol{E}_2) \dots \text{Det}(\boldsymbol{E}_k) \\ &= a_{1,1}\text{Det}(\boldsymbol{E}_1\boldsymbol{E}_2 \dots \boldsymbol{E}_k) \ \text{Theorem 8} \\ &= a_{1,1}\text{Det}(\boldsymbol{A}_{1,1}) \end{align*}\]
<p><strong>Note 1:</strong> Notice in the previous line, all of the determinants except the first are zero since the first column vector of each of their matrix arguments is the zero vector. Thus, these are all singular matrices and by Theorem 2, their determinants are zero.</p>
<p><strong>Note 2:</strong> If $\boldsymbol{A}_{1,1}$ is invertible, then we can factor it into a product of elementary matrices:</p>
\[\boldsymbol{A}_{1,1} = \boldsymbol{E}_1 \boldsymbol{E}_2 \dots \boldsymbol{E}_k\]
<p>where $k$ is some constant.</p>
<p><strong>Note 3:</strong> Here, we use the fact that for some elementary matrix $\boldsymbol{E}$ and some matrix $\boldsymbol{B} \in \mathbb{R}^{m \times m}$, it holds that</p>
\[\begin{bmatrix}1 & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{E}\boldsymbol{B} & & \\ 0 & & & &\end{bmatrix} = \boldsymbol{E}' \begin{bmatrix}1 & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{B} & & \\ 0 & & & &\end{bmatrix}\]
<p>We then apply this iteratively to</p>
\[\begin{bmatrix}1 & 0 & \dots & 0 \\ 0 & & & & \\ \vdots & & \boldsymbol{E}_1 \boldsymbol{E}_2 \dots \boldsymbol{E}_k & & \\ 0 & & & &\end{bmatrix}\]
<p><br /></p>
<p>Finally, at along last, we can derive the formula for the determinant. Let us consider a general matrix $\boldsymbol{A}$:</p>
\[\boldsymbol{A} = \begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3} & \dots & a_{1,m} \\ a_{2,1} & & & & \\ a_{3,1} & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ a_{m,1} & & & &\end{bmatrix}\]
<p>Then,</p>
\[\begin{align*} \text{Det}(\boldsymbol{A}) &= \text{Det}\left(\begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3} & \dots & a_{1,m} \\ a_{2,1} & & & & \\ a_{3,1} & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ a_{m,1} & & & &\end{bmatrix}\right) \\ &= \text{Det}\left(\begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3} & \dots & a_{1,m} \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix}\right)+ \text{Det}\left(\begin{bmatrix}0 & a_{1,2} & a_{1,3} & \dots & a_{1,m} \\ a_{2,1} & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix}\right) + \text{Det}\left(\begin{bmatrix} 0 & a_{1,2} & a_{1,3} & \dots & a_{1,m} \\ 0 & & & & \\ a_{3,1} & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix}\right) + \dots + \text{Det}\left(\begin{bmatrix} 0 & a_{1,2} & a_{1,3} & \dots & a_{1,m} \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ a_{m,1} & & & &\end{bmatrix}\right)\end{align*}\]
<p>For each term, we can move the row with a non-zero element in the first column to the top-row and maintain the relative order of the remaining $m-1$ rows. Performing this operation on each term in the summation will result in an alternation of addition and subtraction. The reason for this is that if the row we moving to the first row is even-numbered, this procedure will require an odd number of row swaps. On the other hand, if the row is odd-numbered, this procedure will require even number of swaps. This is illustrated by the following schematic:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/determinant_formula_number_swaps.png" alt="drawing" width="700" /></center>
<p><br /></p>
<p>Thus, we have</p>
\[\begin{align*}\text{Det}(\boldsymbol{A}) \\ \ \ &= \text{Det}\left(\begin{bmatrix} a_{1,1} & a_{1,2} & a_{1,3} & \dots & a_{1,m} \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix}\right) - \text{Det}\left(\begin{bmatrix}a_{2,1} & a_{2,2} & a_{2,3} & \dots & a_{2,m} \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-2,-1} & & \\ 0 & & & &\end{bmatrix}\right) + \text{Det}\left(\begin{bmatrix} a_{3,1} & a_{3,2} & a_{3,3} & \dots & a_{3,m} \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-3,-1} & & \\ 0 & & & &\end{bmatrix}\right) - \dots +/- \text{Det}\left(\begin{bmatrix} a_{m,1} & a_{m,2} & a_{m,3} & \dots & a_{m,m} \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-m,1} & & \\ 0 & & & &\end{bmatrix}\right) \\ &= \text{Det}\left(\begin{bmatrix} a_{1,1} & 0 & 0 & \dots & 0 \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-1,-1} & & \\ 0 & & & &\end{bmatrix}\right) - \text{Det}\left(\begin{bmatrix}a_{2,1} & 0 & 0 & \dots & 0 \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-2,-1} & & \\ 0 & & & &\end{bmatrix}\right) + \text{Det}\left(\begin{bmatrix} a_{3,1} & 0 & 0 & \dots & 0 \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-3,-1} & & \\ 0 & & & &\end{bmatrix}\right) - \dots +/- \ \text{Det}\left(\begin{bmatrix} a_{m,1} & 0 & 0 & \dots & 0 \\ 0 & & & & \\ 0 & & & & \\ \vdots & & \boldsymbol{A}_{-m,-1} & & \\ 0 & & & &\end{bmatrix}\right) \\ &= a_{1,1}\text{Det}(\boldsymbol{A}_{-1,-1}) - a_{2,1}\text{Det}(\boldsymbol{A}_{-2,-1}) + a_{3,1}\text{Det}(\boldsymbol{A}_{-3,-1}) - \dots +/- a_{-m,1}\text{Det}(\boldsymbol{A}_{-m,-1})\end{align*}\]
<p>Thus we have arrived at our recursive formula where, for each term (corresponding to each row), we compute the determinant of a sub-matrix. This proceeds all the way down until we reach the $2 \times 2$ matrix that is defined as $a_{1,1}a_{2,2} - a_{1,2}a_{2,1}$. That is, putting it all together we arrive at the formula for the determinant:</p>
\[\text{Det}(\boldsymbol{A}) := \begin{cases} a_{1,1}a_{2,2} - a_{1,2}a_{2,1} & \text{if $m = 2$} \\ \sum_{i=1}^m (-1)^{i+1} a_{i,1} \text{Det}(\boldsymbol{A}_{-i,-1}) & \text{if $m > 2$}\end{cases}\]
<p>$\square$</p>Matthew N. BernsteinThe determinant is a function that maps each square matrix to a value that describes the volume of the parallelepiped formed by that matrix’s columns. While this idea is fairly straightforward conceptually, the formula for the determinant is quite confusing. In this post, we will derive the formula for the determinant in an effort to make it less mysterious. Much of my understanding of this material comes from these lecture notes by Mark Demers re-written in my own words.Vector spaces induced by matrices: column, row, and null spaces2023-06-19T00:00:00-07:002023-06-19T00:00:00-07:00https://mbernste.github.io/posts/matrix_spaces<p><em>Matrices are one of the fundamental objects studied in linear algebra. While on their surface they appear like simple tables of numbers, this simplicity hides deeper mathematical structures that they contain. In this post, we will dive into the deeper structures within matrices by discussing three vector spaces that are induced by every matrix: a column space, a row space, and a null space.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Matrices are one of the fundamental objects studied in linear algebra. While on their surface they appear like simple tables of numbers, <a href="https://mbernste.github.io/posts/matrices/">as we have previously described</a>, this simplicity hides deeper mathematical structures that they contain. In this post, we will dive into the deeper structures within matrices by showing three vector spaces that are implicitly defined by every matrix:</p>
<ol>
<li>A column space</li>
<li>A row space</li>
<li>A null space</li>
</ol>
<p>Not only will we discuss the definition for these spaces and how they relate to one another, we will also discuss how to best intuit these spaces and what their properties tell us about the matrix itself.</p>
<p>To understand these spaces, we will need to look at matrices from <a href="https://mbernste.github.io/posts/understanding_3d/">different perspectives</a>. <a href="https://mbernste.github.io/posts/matrices/">In a previous discussion on matrices</a>, we discussed how there are three complementary perspectives for viewing matrices:</p>
<ul>
<li><strong>Perspective 1:</strong> A matrix as a table of numbers</li>
<li><strong>Perspective 2:</strong> A matrix as a list of vectors (both row and column vectors)</li>
<li><strong>Perspective 3:</strong> A matrix as a function mapping vectors from one space to another</li>
</ul>
<p>By viewing matrices through these perspectives we can gain a better intuition for the vector spaces induced by matrices. Let’s get started.</p>
<h2 id="column-spaces">Column spaces</h2>
<p>The <strong>column space</strong> of a matrix is simply the <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a> <a href="https://mbernste.github.io/posts/linear_independence/">spanned</a> by its column-vectors:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (column space):</strong> Given a matrix $\boldsymbol{A}$, the <strong>column space</strong> of $\boldsymbol{A}$, is the vector space that spans the column-vectors of $\boldsymbol{A}$</span></p>
<p>To understand the column space of a matrix $\boldsymbol{A}$, we will consider the matrix from Perspectives 2 and 3 – that is, $\boldsymbol{A}$ as a list of column vectors and as a function mapping vectors from one space to another.</p>
<p><strong>Understanding the column space when viewing matrices as lists of column vectors</strong></p>
<p>The least abstract way to view the column space of a matrix is when considering a matrix to be a simple list of column-vectors. For example:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_visualize_column_vectors.png" alt="drawing" width="600" /></center>
<p>The column space is then the vector space that is <a href="https://mbernste.github.io/posts/linear_independence/">spanned</a> by these three vectors. We see that in the example above, the column space is all of $\mathbb{R}^2$ since we can form <em>any</em> two-dimensional vector using a linear combination of these three vectors:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_visualize_column_space.png" alt="drawing" width="350" /></center>
<p><strong>Understanding the column space when viewing matrices as functions</strong></p>
<p>To gain a deeper understanding into the significance of the column space of a matrix, we will now consider matrices from the perspective of seeing them as <a href="https://mbernste.github.io/posts/matrices_as_functions/">functions between vector spaces</a>. That is, recall for a given matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, we can view this matrix as a function that maps vectors from $\mathbb{R}^n$ to vectors in $\mathbb{R}^m$. This mapping is implemented by <a href="https://mbernste.github.io/posts/matrix_vector_mult/">matrix-vector multiplication</a>. A vector $\boldsymbol{x} \in \mathbb{R}^n$ is mapped to vector $\boldsymbol{b} \in \mathbb{R}^m$ via</p>
\[\boldsymbol{Ax} = \boldsymbol{b}\]
<p>Stated more explicitly, we can define a function $T: \mathbb{R}^n \rightarrow \mathbb{R}^m$ as:</p>
\[T(\boldsymbol{x}) := \boldsymbol{Ax}\]
<p>It turns out that the column space is simply the <a href="https://en.wikipedia.org/wiki/Range_of_a_function">range</a> of this function $T$! That is, it is the set of all vectors that $\boldsymbol{A}$ is capable of mapping to. To see why this is the case, recall that we can view matrix-vector multiplication between $\boldsymbol{A}$ and $\boldsymbol{x}$ as the act of taking a linear combination of the columns of $\boldsymbol{A}$ using the coefficients of $\boldsymbol{x}$ as coefficients:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_vector_multiplication_column_space2.png" alt="drawing" width="600" /></center>
<p>Here we see that the output of this matrix-defined function will always be contained to the span of the column vectors of $\boldsymbol{A}$.</p>
<h2 id="row-spaces">Row spaces</h2>
<p>The <strong>row space</strong> of a matrix is the vector space spanned by its row-vectors:</p>
<p><span style="color:#0060C6"><strong>Definition 2 (row space):</strong> Given a matrix $\boldsymbol{A}$, the <strong>column space</strong> of $\boldsymbol{A}$, is the vector space that spans the row-vectors of $\boldsymbol{A}$</span></p>
<p>To understand the row space of a matrix $\boldsymbol{A}$, we will consider the matrix from Perspectives – that is, $\boldsymbol{A}$ as a list of row vectors. For example:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_visualize_row_vectors.png" alt="drawing" width="600" /></center>
<p>The row space is then the vector space that is spanned by these vectors. We see that in example, the row space is a hyperplane:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_visualize_row_space.png" alt="drawing" width="350" /></center>
<p>Unlike the column space, the row space cannot be interpreted as either the domain or range of the function defined by the matrix. So what is the geometric significance of the row space in the context of Perspective 3 (viewing matrices as functions)? Unfortunately, this does not become evident until we discuss the <em>null space</em>, which we will discuss in the next section!</p>
<h2 id="null-spaces">Null spaces</h2>
<p>The <strong>null space</strong> of a matrix is the third vector space that is induced by matrices. To understand the null space, we will need to view matrices from Perspective 3: matrices as functions between vector space.</p>
<p>Specifically, the null space of a matrix $\boldsymbol{A}$ is the set of all vectors that $\boldsymbol{A}$ maps to the zero vector. That is, the null space is all vectors, $\boldsymbol{x} \in \mathbb{R}^n$ for which $\boldsymbol{Ax} = \boldsymbol{0}$:</p>
<p><span style="color:#0060C6"><strong>Definition 3 (null space):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, the <strong>null space</strong> of $\boldsymbol{A}$ is the set of vectors, $\{\boldsymbol{x} \in \mathbb{R}^n \mid \boldsymbol{Ax} = \boldsymbol{0}\}$</span></p>
<p>It turns out that there is a key relationship between the null space and the row space of a matrix: the null space is the <strong>orthogonal complement</strong> to the row space (Theorem 1 in the Appendix to this post). Before going further, let us define the orthogonal complement. Given a vector space $(\mathcal{V}, \mathcal{F})$, the orthogonal complement to this vector space is another vector space, $(\mathcal{V}’, \mathcal{F})$, where all vectors in $\mathcal{V}’$ are orthogonal to all vectors in $\mathcal{V}$:</p>
<p><span style="color:#0060C6"><strong>Definition 4 (orthogonal complement):</strong> Given two vector spaces $(\mathcal{V}, \mathcal{F})$ and $(\mathcal{V}’, \mathcal{F})$ that share the same scalar field, each is an <strong>orthogonal complement</strong> to the other if $\forall \boldsymbol{v} \in \mathcal{V}, \ \forall \boldsymbol{v}’ \in \mathcal{V}’ \ \langle \boldsymbol{v}, \boldsymbol{v}’ \rangle = 0$</span></p>
<p>Stated more formally:</p>
<p><span style="color:#0060C6"><strong>Theorem 1 (null space is orthogonal complement of row space):</strong> Given a matrix $\boldsymbol{A}$, the null space of $\boldsymbol{A}$ is the orthogonal complement to the row space of $\boldsymbol{A}$.</span></p>
<p>To see why the null space and row space are orthogonal complements, recall that we can view matrix-vector multiplication between a matrix $\boldsymbol{A}$ and a vector $\boldsymbol{x}$ as the process of taking a dot product of each row of $\boldsymbol{A}$ with $\boldsymbol{x}$:</p>
\[\boldsymbol{Ax} := \begin{bmatrix} \boldsymbol{a}_{1,*} \cdot \boldsymbol{x} \\ \boldsymbol{a}_{2,*} \cdot \boldsymbol{x} \\ \vdots \\ \boldsymbol{a}_{m,*} \cdot \boldsymbol{x} \end{bmatrix}\]
<p>If $\boldsymbol{x}$ is in the null space of $\boldsymbol{A}$ then this means that $\boldsymbol{Ax} = \boldsymbol{0}$, which means that every dot product shown above is zero. That is,</p>
\[\begin{align*}\boldsymbol{Ax} &= \begin{bmatrix} \boldsymbol{a}_{1,*} \cdot \boldsymbol{x} \\ \boldsymbol{a}_{2,*} \cdot \boldsymbol{x} \\ \vdots \\ \boldsymbol{a}_{m,*} \cdot \boldsymbol{x} \end{bmatrix} \\ &= \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \\ &= \boldsymbol{0} \end{align*}\]
<p>Recall, if the dot product between a pair of vectors is zero, then the two vectors are orthogonal. Thus we see that if $\boldsymbol{x}$ is in the null space of $\boldsymbol{A}$ it <em>has</em> to be orthogonal to every row-vector of $\boldsymbol{A}$. This means that the null space is the orthogonal complement to the row space!</p>
<p>We can visualize the relationship between the row space and null space using our example matrix:</p>
\[\begin{bmatrix}1 & 2 & 1 \\ 0 & 1 & -1\end{bmatrix}\]
<p>The null space for this matrix is comprised of all of the vectors that point along the red vector shown below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/null_space_compliment_row_space_3.png" alt="drawing" width="500" /></center>
<p>Notice that this red vector is orthogonal to the hyperplane that represents the row space of $\boldsymbol{A}$.</p>
<h2 id="rank-the-intrinsic-dimensionality-of-the-row-and-column-space">Rank: the intrinsic dimensionality of the row and column space</h2>
<p>The <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a> of the row space and column space are also related to one another and tell us alot about the matrix itself. Recall, the intrinsic dimensionality of a set of vectors is given by the maximal number of linearly independent vectors in the set. With this in mind, we can form the following definitions that describe the intrinsic dimensionalities of the row space and column space:</p>
<p><span style="color:#0060C6"><strong>Definition 3 (column rank):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, the <strong>column rank</strong> of $\boldsymbol{A}$ is the maximum sized subset of the columns of $\boldsymbol{A}$ that are linearly independent.</span></p>
<p><span style="color:#0060C6"><strong>Definition 4 (row rank):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, the <strong>row rank</strong> of $\boldsymbol{A}$ is the maximum sized subset of the rows of $\boldsymbol{A}$ that are linearly independent.</span></p>
<p>It turns out that intrinsic dimensionality of the row space and column space are always equal and thus the column rank will always equal the row rank:</p>
<p><span style="color:#0060C6"><strong>Theorem 2 (row rank equals column rank):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, its row rank equals its column rank.</span></p>
<p>Because of the row rank and column rank are equal, the can simply talk about the <strong>rank</strong> of a matrix without the need to delineate whether we mean the row rank or the column rank.</p>
<p>Moreover, because the row rank equals the column rank of a matrix, a matrix of shape $m \times n$ can <em>at most</em> have a rank that is the minimum of $m$ and $n$. For example, a matrix with 3 rows and 5 columns can <em>at most</em> be of rank 3 (but it might be less!). In fact, we observed this phenomenon in our previous example matrix, which has a rank of 2:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/column_space_row_space_dimensionality.png" alt="drawing" width="700" /></center>
<p>As we can see, the column space spans all of $\mathbb{R}^2$ and thus, it’s intrinsic dimensionality is two. The row space spans a hyperplane in $\mathbb{R}^3$ and thus, it’s intrinsic dimensionality is also two.</p>
<h2 id="nullity-the-intrinsic-dimensionality-of-the-null-space">Nullity: the intrinsic dimensionality of the null space</h2>
<p>Where the rank of a matrix describes the intrinsic dimensionality of the row and column spaces of a matrix, the <strong>nullity</strong> describes the intrinsic dimensionality of the null space:</p>
<p><span style="color:#0060C6"><strong>Definition 5 (nullity):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, the <strong>nullity</strong> of $\boldsymbol{A}$ is the maximum number of linearly independent vectors that span the null space of $\boldsymbol{A}$.</span></p>
<p>There is a key relationship between nullity and rank: they sum to the number of columns of $\boldsymbol{A}$! This is proven in the rank-nullity theorem (proof provided in the Appendix to this post):</p>
<p><span style="color:#0060C6"><strong>Theorem 3 (rank-nullity theorem):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, it holds that $\text{rank} + \text{nullity} = n$.</span></p>
<p>Below we illustrate this theorem with two examples:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/row_space_null_space_orthogonal_examples.png" alt="drawing" width="800" /></center>
<p>On the left, we have a matrix whose rows span a hyperplane in $\mathbb{R}^3$, which is of dimension 2. The null space is thus a line, which has dimension 1. In contrast, on the left we have a matrix whose rows span a line in $\mathbb{R}^3$, which is of dimension 1. The null space here is a hyperplane that is orthogonal to this line. In both examples, the dimensionality of the row space and null space sum to 3, which is the number of columns of both matrices!</p>
<h2 id="summarizing-the-relationships-between-matrix-spaces">Summarizing the relationships between matrix spaces</h2>
<p>We can summarize the properties of the column space, row space, and null space with the following table organized around Perspective 2 (matrices as lists of vectors) and Perspective 3 (matrices as functions):</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/row_space_column_space_summary_table.png" alt="drawing" width="700" /></center>
<p><br /></p>
<p>Moreover, we can summarize the relationships between these spaces with the following figure:</p>
<p><br /></p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/row_space_column_space_summary_network.png" alt="drawing" width="600" /></center>
<h2 id="the-spaces-induced-by-invertible-matrices">The spaces induced by invertible matrices</h2>
<p>We conclude by discussing the vector spaces induced by <a href="https://mbernste.github.io/posts/inverse_matrices/">invertible matrices</a>. Recall, that a square matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ is invertible if and only if its columns are linearly independent (see <a href="https://mbernste.github.io/posts/inverse_matrices/">Theorem 4 in the Appendix to my previous blog post</a>). This implies that for invertible matrices, it holds that:</p>
<ol>
<li>The column space spans all of $n$ since they are linearly independent. This implies that the column rank is $n$</li>
<li>The row space spans all of $n$, since by Theorem 2 the row rank equals the column rank</li>
<li>The nullity is zero, since by Theorem 3 the nullity plus the rank must equal the number of columns</li>
</ol>
<p>We call an invertible matrix <strong>full rank</strong> since the rank equals the number of rows and columns. The rank is “full” because it cannot be increased any further past the number of its columns/rows!</p>
<p>Moreover, we see that there is only <em>one</em> vector in the null space of an invertible matrix since its nullity is zero (a dimensionality of zero corresponds to a single point). If we think back on our discussion of invertible matrices as characterizing <a href="https://mbernste.github.io/posts/matrices_as_functions/">invertible functions</a>, then this fact makes sense. For a function to be invertible, it must be <a href="https://en.wikipedia.org/wiki/Injective_function">one-to-one</a> and <a href="https://en.wikipedia.org/wiki/Surjective_function">onto</a>. So if we use an invertible matrix $\boldsymbol{A}$ to define the function</p>
\[T(\boldsymbol{x}) := \boldsymbol{Ax}\]
<p>Then it holds that every vector, $\boldsymbol{b}$, in the range of the function $T$ has exactly one vector, $\boldsymbol{x}$, in the domain of $T$ for which $T(\boldsymbol{x}) = \boldsymbol{b}$. This must also hold for the zero vector. Thus, there must be only one vector, $\boldsymbol{x}$, for which $\boldsymbol{Ax} = \boldsymbol{0}$. Hence, the null space comprises just a single vector.</p>
<p>Now we may ask, what vector is this singular member of the null space. It turns out, it’s the zero vector! We see this by applying <a href="https://mbernste.github.io/posts/linear_independence/">Theorem 1 from this previous blog post</a>.</p>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1 (null space is orthogonal complement of row space):</strong> Given a matrix $\boldsymbol{A}$, the null space of $\boldsymbol{A}$ is the orthogonal complement to the row space of $\boldsymbol{A}$.</span></p>
<p><strong>Proof:</strong></p>
<p>To prove that the null space of $\boldsymbol{A}$ is the orthogonal complement of the row space, we must show that every vector in the null space is orthogonal to every vector in the row space. Consider vector $\boldsymbol{x}$ in the null space of $\boldsymbol{A}$. By the definition of the null space (Definition 5), this means that $\boldsymbol{Ax} = \boldsymbol{0}$. That is,</p>
\[\begin{align*}\boldsymbol{Ax} &= \begin{bmatrix} \boldsymbol{a}_{1,*} \cdot \boldsymbol{x} \\ \boldsymbol{a}_{2,*} \cdot \boldsymbol{x} \\ \vdots \\ \boldsymbol{a}_{m,*} \cdot \boldsymbol{x} \end{bmatrix} \\ &= \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \\ &= \boldsymbol{0} \end{align*}\]
<p>We note that for each row $i$, we see that $\boldsymbol{a}_{i,*} \cdot \boldsymbol{x} = 0$ implies that each row vector of $\boldsymbol{A}$ is orthogonal to $\boldsymbol{x}$.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2 (row rank equals column rank):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, the row rank equals the column rank</span></p>
<p><strong>Proof:</strong></p>
<p>This proof is described on <a href="[https://en.wikipedia.org/wiki/Rank%E2%80%93nullity_theorem](https://en.wikipedia.org/wiki/Rank_(linear_algebra))">Wikipedia</a>, provided here in my own words.</p>
<p>Let $r$ be the row rank of $\boldsymbol{A}$ and let $\boldsymbol{b}_1, \dots, \boldsymbol{b}_r \in \mathbb{R}^n$ be a set of basis vectors for the row space of $\boldsymbol{A}$. Now, let $c_1, c_2, \dots, c_r$ be coefficients such that</p>
\[\sum_{i=1}^r c_i \boldsymbol{Ab}_i = \boldsymbol{0}\]
<p>Furthermore, let</p>
\[\boldsymbol{v} := \sum_{i=1}^r c_i\boldsymbol{b}_i\]
<p>We see that</p>
\[\begin{align*} \sum_{i=1}^r c_i \boldsymbol{Ab}_i &= \boldsymbol{0} \\ \implies \sum_{i=1}^r \boldsymbol{A}c_i \boldsymbol{b}_i &= \boldsymbol{0} \\ \implies \boldsymbol{A} \sum_{i=1}^r c_i\boldsymbol{b}_i &= \boldsymbol{0} \\ \boldsymbol{Av} &= \boldsymbol{0} \end{align*}\]
<p>With this in mind, we can prove that $\boldsymbol{v}$ must be the zero vector. To do so, we first note that $\boldsymbol{v}$ is in both the row space of $\boldsymbol{A}$ and the null space of $\boldsymbol{A}$. It is in the row space $\boldsymbol{A}$ because it is a linear combination of the basis vectors of the row space of $
\boldsymbol{A}$. It is in the null space of $\boldsymbol{A}$, because $\boldsymbol{Av} = \boldsymbol{0}$. From Theorem 1, $\boldsymbol{v}$ must be orthogonal to all vectors in the row space of $\boldsymbol{A}$, which includes itself. The only vector that is orthogonal to itself is the zero vector and thus, $\boldsymbol{v}$ must be the zero vector.</p>
<p>This in turn implies that $c_1, \dots, c_r$ must be zero. We know this because $\boldsymbol{b}_1, \dots, \boldsymbol{b}_r \in \mathbb{R}^n$ are basis vectors, which by definition cannot include the zero vector. Thus we have proven that the only assignment of values for $c_1, \dots, c_r$ for which $\sum_{i=1}^r c_i \boldsymbol{Ab}_i = \boldsymbol{0}$ is the assignment for which they are all zero. By <a href="https://mbernste.github.io/posts/linear_independence/">Theorem 1 in a previous post</a>, this implies that $\boldsymbol{Ab}_1, \dots, \boldsymbol{Ab}_r$ must be linearly independent.</p>
<p>Moreover, by the definition of matrix-vector multiplication, we know that $\boldsymbol{Ab}_1, \dots, \boldsymbol{Ab}_r$ are in the column space of $\boldsymbol{A}$. Thus, we have proven that there exist <em>at least</em> $r$ independent vectors in the column space of $\boldsymbol{A}$. This means that the column rank of $\boldsymbol{A}$ is <em>at least</em> $r$. That is,</p>
\[\text{column rank of} \ \boldsymbol{A} \geq \text{row rank of} \ \boldsymbol{A}\]
<p>We can repeat this exercise on the transpose of $\boldsymbol{A}$, which tells us that</p>
\[\text{row rank of} \ \boldsymbol{A} \geq \text{column rank of} \ \boldsymbol{A}\]
<p>These statements together imply that the column rank and row rank of $\boldsymbol{A}$ are equal.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 3 (rank-nullity theorem):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, it holds that $\text{rank} + \text{nullity} = n$.</span></p>
<p><strong>Proof:</strong></p>
<p>This proof is described on <a href="https://en.wikipedia.org/wiki/Rank%E2%80%93nullity_theorem">Wikipedia</a>, provided here in my own words along with supplemental schematics of the matrices used in the proof.</p>
<p>Let $r$ be the rank of the matrix. This means that there are $r$ linearly independent column vectors in $\boldsymbol{A}$. Without loss of generality, we can arrange $\boldsymbol{A}$ so that the first $r$ columns are linearly independent, and the remaining $n - r$ columns can be written as a linear combination of the first $r$ columns. That is, we can write:</p>
\[\boldsymbol{A} = \begin{pmatrix} \boldsymbol{A}_1 & \boldsymbol{A}_2 \end{pmatrix}\]
<p>where $\boldsymbol{A}_1$ and $\boldsymbol{A}_2$ are the two partitions of the matrix as shown below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/rank_nullity_theorem_partition_A.png" alt="drawing" width="500" /></center>
<p><br /></p>
<p>because the columns of $\boldsymbol{A}_2$ are linear combinations of the columns of $\boldsymbol{A}_1$, there exists a matrix $\boldsymbol{B} \in \mathbb{R}^{r \times n-r}$ for which</p>
\[\boldsymbol{A}_2 = \boldsymbol{A}_1 \boldsymbol{B}\]
<p>This is depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/rank_nullity_theorem_A2_as_A1B.png" alt="drawing" width="500" /></center>
<p><br /></p>
<p>Now, consider a matrix</p>
\[\boldsymbol{X} := \begin{pmatrix} -\boldsymbol{B} \\ \boldsymbol{I}_{n-r} \end{pmatrix}\]
<p>That is, $\boldsymbol{X}$ is formed by concatenating the $n-r \times n-r$ identity matrix below the $-\boldsymbol{B}$ matrix. Now, we see that $\boldsymbol{AX} = \boldsymbol{0}$:</p>
\[\begin{align*}\boldsymbol{AX} &= \begin{pmatrix} \boldsymbol{A}_1 & \boldsymbol{A}_1\boldsymbol{B} \end{pmatrix} \begin{pmatrix} -\boldsymbol{B} \\ \boldsymbol{I}_{n-r} \end{pmatrix} \\ &= -\boldsymbol{A}_1\boldsymbol{B} + \boldsymbol{A}_1\boldsymbol{B} \\ &= \boldsymbol{0} \end{align*}\]
<p>Depicted schematically,</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/rank_nullity_theorem_AX.png" alt="drawing" width="500" /></center>
<p><br /></p>
<p>Thus, we see that every column of $\boldsymbol{X}$ is in the null space of $\boldsymbol{A}$.</p>
<p>We now show that these column vectors are linearly independent. To do so, we will consider a vector $\boldsymbol{u} \in \mathbb{R}^{n-r}$ such that</p>
\[\boldsymbol{Xu} = \boldsymbol{0}\]
<p>For this to hold, we see that $\boldsymbol{u}$ must be zero:</p>
\[\begin{align*}\boldsymbol{Xu} &= \boldsymbol{0} \\ \implies \begin{pmatrix} -\boldsymbol{B} \\ \boldsymbol{I}_{n-r} \end{pmatrix}\boldsymbol{u} &= \begin{pmatrix} \boldsymbol{0}_r \\ \boldsymbol{0}_{n-r} \end{pmatrix} \\ \\ \implies \begin{pmatrix} -\boldsymbol{Bu} \\ \boldsymbol{u} \end{pmatrix} &= \begin{pmatrix} \boldsymbol{0}_r \\ \boldsymbol{0}_{n-r} \end{pmatrix} \end{align*}\]
<p>By <a href="https://mbernste.github.io/posts/linear_independence/">Theorem 1 in a previous post</a>, this proves that the columns of $\boldsymbol{X}$ are linearly independent. So we have shown that there exists $n-r$ linearly independent vectors in the null space of $\boldsymbol{A}$, which means the nullity is <em>at least</em> $n-r$.</p>
<p>We now show that <em>any</em> other vector in the null space of $\boldsymbol{A}$ that is not a column of $\boldsymbol{X}$ can be written as a linear combination of the columns of $\boldsymbol{X}$. If we can prove this fact, we will have proven that the nullity is exactly equal to $n-r$ and is not greater.</p>
<p>We start by again considering a vector $\boldsymbol{u} \in \mathbb{R}^n$ that we assume is in the null space of $\boldsymbol{A}$. We partition this vector into two segments: one segment, $\boldsymbol{u}_1$, comprising the first $r$ elements and a second segment, $\boldsymbol{u}_2$, comprising the remaining $n-r$ elements:</p>
\[\boldsymbol{u} = \begin{pmatrix}\boldsymbol{u}_1 \\ \boldsymbol{u}_2 \end{pmatrix}\]
<p>Because we assume that $\boldsymbol{u}$ is in the null space, it must hold that $\boldsymbol{Au} = \boldsymbol{0}$. Depicted schematically:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/rank_nullity_theorem_Au.png" alt="drawing" width="500" /></center>
<p><br /></p>
<p>Solving for $\boldsymbol{u}$, we see that</p>
\[\begin{align*} \boldsymbol{Au} &= \boldsymbol{0} \\ \begin{pmatrix} \boldsymbol{A}_1 & \boldsymbol{A}_2 \end{pmatrix} \begin{pmatrix}\boldsymbol{u}_1 \\ \boldsymbol{u}_2 \end{pmatrix} &= \boldsymbol{0} \\\begin{pmatrix} \boldsymbol{A}_1 & \boldsymbol{A}_1\boldsymbol{B} \end{pmatrix} \begin{pmatrix}\boldsymbol{u}_1 \\ \boldsymbol{u}_2 \end{pmatrix} &= \boldsymbol{0} \\ \implies \boldsymbol{A}_1\boldsymbol{u}_1 + \boldsymbol{A}_1\boldsymbol{B}\boldsymbol{u}_2 &= \boldsymbol{0} \\ \implies \boldsymbol{A}_1 (\boldsymbol{u}_1 + \boldsymbol{Bu}_2) &= \boldsymbol{0} \\ \implies \boldsymbol{u}_1 + \boldsymbol{Bu}_2 &= \boldsymbol{0} \\ \implies \boldsymbol{u}_1 = -\boldsymbol{Bu}_2 \end{align*}\]
<p>Thus,</p>
\[\begin{align*}\boldsymbol{u} &= \begin{pmatrix}\boldsymbol{u}_1 \\ \boldsymbol{u}_2 \end{pmatrix} \\ &= \begin{pmatrix} -\boldsymbol{Bu}_2 \\ \boldsymbol{u}_2 \end{pmatrix} \\ &= \begin{pmatrix} -\boldsymbol{B} \\ \boldsymbol{I}_{n-r} \end{pmatrix}\boldsymbol{u}_2 \\ &= \boldsymbol{X}\boldsymbol{u}_2 \end{align*}\]
<p>Thus, we see that $\boldsymbol{u}$ must be the linear combination of the columns of $\boldsymbol{X}$! Thus we have shown that:</p>
<ol>
<li>There exists $n-r$ linearly independent vectors in the null space of $\boldsymbol{A}$</li>
<li>Any vector in the null space can be expressed as a linear combination of these linearly independent vectors</li>
</ol>
<p>This proves that the nullity is $n-r$, and thus, the nullity $n-r$ plus the rank $r$, equals $n$.</p>
<p>$\square$</p>Matthew N. BernsteinMatrices are one of the fundamental objects studied in linear algebra. While on their surface they appear like simple tables of numbers, this simplicity hides deeper mathematical structures that they contain. In this post, we will dive into the deeper structures within matrices by discussing three vector spaces that are induced by every matrix: a column space, a row space, and a null space.Variational autoencoders2023-03-14T00:00:00-07:002023-03-14T00:00:00-07:00https://mbernste.github.io/posts/vae<p><em>Variational autoencoders (VAEs) are a family of deep generative models with use cases that span many applications, from image processing to bioinformatics. There are two complimentary ways of viewing the VAE: as a probabilistic model that is fit using variational Bayesian inference, or as a type of autoencoding neural network. In this post, we present the mathematical theory behind VAEs, which is rooted in Bayesian inference, and how this theory leads to an emergent autoencoding algorithm. We also discuss the similarities and differences between VAEs and standard autoencoders. Lastly, we present an implementation of a VAE in PyTorch and apply it to the task of modeling the MNIST dataset of hand-written digits.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Variational autoencoders (VAEs), introduced by <a href="https://arxiv.org/abs/1312.6114_">Kingma and Welling (2013)</a>, are a class of probabilistic models that find latent, low-dimensional representations of data. VAEs are thus a method for performing <a href="https://en.wikipedia.org/wiki/Dimensionality_reduction">dimensionality reduction</a> to reduce data down to their <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a>.</p>
<p>As their name suggests, VAEs are a type of <strong>autoencoder</strong>. An autoencoder is a model that takes a vector, $\boldsymbol{x}$, compress it into a lower-dimensional vector, $\boldsymbol{z}$, and then decompress $\boldsymbol{z}$ back into $\boldsymbol{x}$. The architecture of an autoencoder can can be visualized as follows:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/autoencoder.png" alt="drawing" width="350" /></center>
<p> </p>
<p>Here we see one function (usually a neural network), $h_\phi$, compresses $\boldsymbol{x}$ into a low-dimensional data point, $\boldsymbol{z}$, and then another function (also a neural network), $f_\theta$, decompresses it back into an approximation of $\boldsymbol{x}$, here denoted as $\boldsymbol{x}’$. The variables $\phi$ and $\theta$ denote the parameters to the two neural networks.</p>
<p>VAEs can be understood as a type of autoencoder like the one shown above, but with some important differences: Unlike standard autoencoders, VAEs are probablistic models and as we will see in this post, their “autoencoding” ability emerges from how the probabilistic model is defined and fit.</p>
<p>In summary, it helps to view VAEs <a href="https://mbernste.github.io/posts/understanding_3d/">from two angles</a>:</p>
<ol>
<li><strong>Probabilistic generative model:</strong> VAEs are probabilistic generative models of independent, identically distributed samples, $\boldsymbol{x}_1, \dots, \boldsymbol{x}_n$. In this model, each sample, $\boldsymbol{x}_i$, is associated with a latent (i.e. unobserved), lower-dimensional variable $\boldsymbol{z}_i$. Variational autoencoders are a generative model in that they describe a joint distribution over samples and their associated latent variable, $p(\boldsymbol{x}, \boldsymbol{z})$.</li>
<li><strong>Autoencoder:</strong> VAEs are a form of autoencoders. Unlike traditional autoencoders, VAEs can be veiwed as <em>probabilistic</em> rather than deterministic; Given an input sample, $\boldsymbol{x}_i$, the compressed representation of $\boldsymbol{x}_i$, $\boldsymbol{z}_i$, is randomly generated.</li>
</ol>
<p>In this blog post we will show how VAEs can be viewed through both of these lenses. We will then provide an example implementation of a VAE and apply it to the MNIST dataset of hand-written digits.</p>
<h2 id="vaes-as-probabilistic-generative-models">VAEs as probabilistic generative models</h2>
<p>At their foundation, a VAE defines a probabilistic generative process for “generating” data points that reside in some $D$-dimensional vector space. This generative process goes as follows: we first sample a latent variable $\boldsymbol{z} \in \mathbb{R}^{J}$ where $J < D$ from some distribution such as a standard normal distribution:</p>
\[\boldsymbol{z} \sim N(\boldsymbol{0}, \boldsymbol{I})\]
<p>Then, we use a determinstic function to map $\boldsymbol{z}$ to the parameters, $\boldsymbol{\psi}$, of another distribution used to sample $\boldsymbol{x} \in \mathbb{R}^D$. Most commonly, we construct $\psi$ from $\boldsymbol{z}$ using neural networks:</p>
\[\begin{align*} \boldsymbol{\psi} &:= f_{\theta}(\boldsymbol{z}) \\ \boldsymbol{x} &\sim \mathcal{D}(\boldsymbol{\psi}) \end{align*}\]
<p>where $\mathcal{D}$ is a parametric distribution and $f$ is a neural network parameterized by a set of parameters $\theta$. Here’s a schematic illustration of the generative process:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_generative_process_shapes.png" alt="drawing" width="700" /></center>
<p> </p>
<p>This generative process can be visualized graphically below:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_generative_process.png" alt="drawing" width="700" /></center>
<p> </p>
<p>Interestingly, this model enables us to fit very complicated distributions. That’s because although the distribution of $\boldsymbol{z}$ and the conditional distribution of $\boldsymbol{x}$ given $\boldsymbol{z}$ may both be simple (e.g., both normal distributions), the non-linear mapping between $\boldsymbol{z}$ and $\psi$ via the neural network leads to the marginal distribution of $\boldsymbol{x}$ becoming complex:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_marginal.png" alt="drawing" width="350" /></center>
<p> </p>
<h2 id="using-variational-inference-to-fit-the-model">Using variational inference to fit the model</h2>
<p>Now, let’s say we are given a dataset consisting of data points $\boldsymbol{x}_1, \dots, \boldsymbol{x}_n \in \mathbb{R}^D$ that were generated by a VAE. We may be interested in two tasks:</p>
<ol>
<li>For fixed $\theta$, for each $\boldsymbol{x}_i$, compute the posterior distribution $p_{\theta}(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$</li>
<li>Find the maximum likelihood estimates of $\theta$</li>
</ol>
<p>Unfortunately, for a fixed $\theta$, solving for the posterior $p_{\theta}(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$ using Bayes Theorem is intractible due to the fact that the denominator in the formula for Bayes Theorem requires marginalizing over $\boldsymbol{z}_i$:</p>
\[p_\theta(\boldsymbol{z}_i \mid \boldsymbol{x}_i) = \frac{p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)p(\boldsymbol{z}_i)}{\int p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)p(\boldsymbol{z}_i) \ d\boldsymbol{z}_i }\]
<p>This marginalization requires solving an integral over all of the dimensions of the latent space! This is not feasible to calculate. Estimating $\theta$ via maximum likelihood estimation also requires solving this integral:</p>
\[\begin{align*}\hat{\theta} &:= \text{argmax}_\theta \prod_{i=1}^n p_\theta(\boldsymbol{x}_i) \\ &= \text{argmax}_\theta \prod_{i=1}^n \int p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)p(\boldsymbol{z}_i) \ d\boldsymbol{z}_i \end{align*}\]
<p>Variational autoencoders find approximate solutions to both of these intractible inference problems using <a href="https://mbernste.github.io/posts/variational_inference/">variational inference</a>. First, let’s assume that $\theta$ is fixed and attempt to approximate $p_\theta(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$. Variational inference is a method for performing such approximations by first choosing a set of probability distributions, $\mathcal{Q}$, called the <em>variational family</em>, and then finding the distribution $q(\boldsymbol{z}_i) \in \mathcal{Q}$ that is “closest to” $p_\theta(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$.</p>
<p>Variational inference uses the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL-divergence</a> between $q(\boldsymbol{z}_i)$ and $p_\theta(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$ as its measure of “closeness”. Thus, the goal of variational inference is to minimize the KL-divergence. It turns out that the task of minimizing the KL-divergence is equivalent to the task of maximizing a quantity called the <a href="https://mbernste.github.io/posts/elbo/">evidence lower bound (ELBO)</a>, which is defined as</p>
\[\begin{align*} \text{ELBO}(q) &:= E_{\boldsymbol{z}_1, \dots, \boldsymbol{z}_n \overset{\text{i.i.d.}}{\sim} q}\left[ \sum_{i=1}^n \log p_\theta(\boldsymbol{x}_i, \boldsymbol{z}_i) - \sum_{i=1}^n \log q(\boldsymbol{z}_i) \right] \\ &= \sum_{i=1}^n E_{z_i \sim q} \left[\log p_\theta(\boldsymbol{x}_i, \boldsymbol{z}_i) - \log q(\boldsymbol{z}_i) \right] \end{align*}\]
<p>Thus, variational inference entails finding</p>
\[\hat{q} := \text{arg max}_{q \in \mathcal{Q}} \ \text{ELBO}(q)\]
<p>Now, so far we have assumed that $\theta$ is fixed. Is it possible to find both $q$ and $\theta$ jointly? As we discuss in a <a href="https://mbernste.github.io/posts/reparameterization_vi/">previous post on variational inference</a>, it is perfectly reasonable to define the ELBO as a function of <em>both</em> $q$ and $\theta$ and then to maximize the ELBO jointly with respect to both of these parameters:</p>
\[\hat{q}, \hat{\theta} := \text{arg max}_{q, \theta} \ \text{ELBO}(q, \theta)\]
<p>Why is this a reasonable thing to do? Recall the ELBO is a <em>lower bound</em> on the marginal log-likelihood $p_\theta(x_1, \dots, x_n)$. Thus, optimizing the ELBO with respect to $\theta$ increases the lower bound of the log-likelihood.</p>
<h2 id="variational-family-used-by-vaes">Variational family used by VAEs</h2>
<p>VAEs use a variational family with the following form:</p>
\[\mathcal{Q} := \left\{ N(h^{(1)}_\phi(\boldsymbol{x}), \text{diag}(\exp(h^{(2)}\phi(\boldsymbol{x})))) \mid \phi \in \mathbb{R}^R \right\}\]
<p>where $h^{(1)}_\phi$ and $h^{(2)}_\phi$ are two neural networks that map the original object, $\boldsymbol{x}$, to the mean, $\boldsymbol{\mu}$, and the logarithm of the variance, $\log \boldsymbol{\sigma}^2$, of the approximate posterior distribution. $R$ is the number of parameters to these neural networks.</p>
<p>Said a different way, we define $q_\phi(\boldsymbol{z} \mid \boldsymbol{x})$ as</p>
\[q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) := N(h^{(1)}_\phi(\boldsymbol{x}), \text{diag}(\exp(h^{(2)}_\phi(\boldsymbol{x}))))\]
<p>Said a third way, the approximate posterior distribution can be sampled via the following process:</p>
\[\begin{align*}\boldsymbol{\mu} &:= h^{(1)}_\phi(\boldsymbol{x}) \\ \log \boldsymbol{\sigma}^2 &:= h^{(2)}_\phi(\boldsymbol{x}) \\ \boldsymbol{z} &\sim N(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma^2})) \end{align*}\]
<p>This can be visualized as follows:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_encoder.png" alt="drawing" width="700" /></center>
<p> </p>
<p>Note that $h^{(1)}_\phi$ and $h^{(2)}_\phi$ may either be two entirely separate neural networks or may share some subset of parameters. We use $h_\phi$ to refer to the full neural network (or union of two separate neural networks) comprising both $h^{(1)}_\phi$ and $h^{(2)}_\phi$ as shown below:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_encoder_architecture.png" alt="drawing" width="500" /></center>
<p> </p>
<p>Thus, maximizing the ELBO over $\mathcal{Q}$ reduces to maximizing the ELBO over the neural network parameters $\phi$ (in addition to $\theta$ as discussed previously):</p>
\[\begin{align*}\hat{\phi}, \hat{\theta} &= \text{arg max}_{\phi, \theta} \ \text{ELBO}(\phi, \theta) \\ &:= \text{arg max}_{\phi, \theta} \ \sum_{i=1}^n E_{\boldsymbol{z}_i \sim q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)}\left[ \log p_\theta(\boldsymbol{x}_i, \boldsymbol{z}_i) - \log q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \right] \end{align*}\]
<p>One detail to point out here is that the approximation of the posterior over each $\boldsymbol{z}_i$ is defined by a set of parameters $\phi$ that are shared accross all samples $\boldsymbol{z}_1, \dots, \boldsymbol{z}_n$. That is, we use a single set of neural network parameters $\phi$ to encode the posterior distribution $q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$. Note, we <em>could</em> have gone a different route and defined a <em>separate</em> variational distribution $q_i$ for each $\boldsymbol{z}_i$ that is not conditioned on $\boldsymbol{x}_i$. That is, to define the variational posterior as $q_{\phi_i}(\boldsymbol{z}_i)$, where each $\boldsymbol{z}_i$ has its own set of parameters $\phi_i$. Here, $q_{\phi_i}(\boldsymbol{z}_i)$ does not condition on $\boldsymbol{x}_i$. Why don’t we do this instead? The answer is that for extremely large datasets it’s easier to perform VI when $\phi$ are shared across all data points because it reduces the number of parameters we need to search over in our optimization. This act of defining a common set of parameters shared across all of the independent posteriors is called <strong>amortized variational inference</strong>.</p>
<h2 id="maximizing-the-elbo">Maximizing the ELBO</h2>
<p>Now that we’ve set up the optimization problem, we need to solve it. Unfortunately, the expectation present in the ELBO makes this difficult as it requires integrating over all possible values for $\boldsymbol{z}_i$:</p>
\[\begin{align*}\text{ELBO}(\phi, \theta) &= \sum_{i=1}^n E_{\boldsymbol{z} \sim q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\left[ \log p_\theta(\boldsymbol{x}_i, \boldsymbol{z}_i) - \log q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \right] \\ &= \sum_{i=1}^n \int_{\boldsymbol{z}_i} q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \left[ \log p_\theta(\boldsymbol{x}_i, \boldsymbol{z}_i) - \log q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \right] \ d\boldsymbol{z}_i \end{align*}\]
<p>We address this challenge by using the <strong>reparameterization gradient</strong> method, which we discussed in a <a href="https://mbernste.github.io/posts/reparameterization_vi/">previous blog post</a>. We will review this method here; however, see my previous post for a more detailed explanation.</p>
<p>In brief, the reparameterization method maximizes the ELBO via stochastic gradient ascent in which stochastic gradients are formulated by first performing the <strong>reparameterization trick</strong> followed by Monte Carlo sampling. The reparameterization trick works as follows: we “reparameterize” the distribution $q_\phi(z_i \mid x_i)$ in terms of a surrogate random variable $\epsilon_i \sim \mathcal{J}$ and a determinstic function $g$ in such a way that sampling $z_i$ from $q_\phi(z_i \mid x_i)$ is performed as follows:</p>
\[\begin{align*}\boldsymbol{\epsilon}_i &\sim \mathcal{D} \\ z_i &:= g_\phi(\boldsymbol{\epsilon}_i, x_i)\end{align*}\]
<p>One way to think about this is that instead of sampling $\boldsymbol{z}_i$ directly from our variational posterior $q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$, we “re-design” the generative process of $\boldsymbol{z}_i$ such that we first sample a surrogate random variable $\boldsymbol{\epsilon}_i$ and then transform $\boldsymbol{\epsilon}_i$ into $\boldsymbol{z}_i$ all while ensuring that in the end, the distribution of $\boldsymbol{z}_i$ still follows $q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$. Following the reparameterization trick, we can re-write the ELBO as follows:</p>
\[\text{ELBO}(\phi, \theta) := \sum_{i=1}^n E_{\epsilon_i \sim \mathcal{D}}\left[ \log p_\theta(\boldsymbol{x}_i, g_\phi(\boldsymbol{\epsilon}_i, \boldsymbol{x}_i) - \log q_\phi(g_\phi(\boldsymbol{\epsilon}_i, \boldsymbol{x}_i) \mid \boldsymbol{x}_i) \right]\]
<p>We then approximate the ELBO via Monte Carlo sampling. That is, for each sample, $i$, we first sample random variables from our surrogate distribution $\mathcal{D}$:</p>
\[\boldsymbol{\epsilon}'_{i,1}, \dots, \boldsymbol{\epsilon}'_{i,L} \sim \mathcal{D}\]
<p>Then we can compute a Monte Carlo approximation to the ELBO:</p>
\[\tilde{\text{ELBO}}(\phi, \theta) := \frac{1}{n} \sum_{i=1}^n \frac{1}{L} \sum_{l=1}^L \left[ \log p_\theta(\boldsymbol{x}_i, g_\phi(\boldsymbol{\epsilon}'_{i,l}, \boldsymbol{x}_i)) - \log q_\phi(g_\phi(\boldsymbol{\epsilon}'_{i,l}, \boldsymbol{x}_i) \mid \boldsymbol{x}_i) \right]\]
<p>Now the question becomes, what reparameterization can we use? Recall that for the VAEs discussed here $q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$ is a normal distribution:</p>
\[q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) := N(h^{(1)}_\phi(\boldsymbol{x}), \exp(h^{(2)}_\phi(\boldsymbol{x})) \boldsymbol{I})\]
<p>This naturally can be reparameterized as:</p>
\[\begin{align*}\boldsymbol{\epsilon}_i &\sim N(\boldsymbol{0}, \boldsymbol{I}) \\ z_i &:= h^{(1)}_\phi(\boldsymbol{x}) + \exp(h^{(2)}_\phi(\boldsymbol{x}))\boldsymbol{\epsilon}_i \end{align*}\]
<p>Thus, our function $g$ is simply the function that shifts $\boldsymbol{\epsilon}_i$ by $h^{(1)}_\phi(\boldsymbol{x})$ and scales it by $\exp(h^{(2)}_\phi(\boldsymbol{x}))$. That is,</p>
\[g(\boldsymbol{\epsilon}_i, \boldsymbol{x}_i) := h^{(1)}_\phi(\boldsymbol{x}) + \exp(h^{(2)}_\phi(\boldsymbol{x}))\boldsymbol{\epsilon}_i\]
<p>Because $\tilde{\text{ELBO}}(\phi, \theta)$ is differentiable with respect to both $\phi$ and $\theta$ (notice that $f_\phi(\boldsymbol{x}_i)$ and $h_\phi(\boldsymbol{x}_i)$ are neural networks which are differentiable), we can form the gradient:</p>
\[\nabla_{\phi, \theta} \tilde{\text{ELBO}}(\phi, \theta) = \frac{1}{n} \sum_{i=1}^n \frac{1}{L} \sum_{l=1}^L \nabla_{\phi, \theta} \left[ \log p_\theta(\boldsymbol{x}_i, g_\phi(\epsilon'_{i,l}, \boldsymbol{x}_i) - \log q_\phi(g_\phi(\epsilon'_{i,l}, \boldsymbol{x}_i) \mid \boldsymbol{x}_i) \right]\]
<p>This gradient can then be used to perform gradient ascent. To compute this gradient, we can apply <a href="https://en.wikipedia.org/wiki/Automatic_differentiation">automatic differentiation</a>. Then we an use these gradients to perform <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a>-based optimization. Thus, we can utilize the extensive toolkit developed for training deep learning models!</p>
<h2 id="reducing-the-variance-of-the-stochastic-gradients">Reducing the variance of the stochastic gradients</h2>
<p>For the VAE model there is a modification that we can make to reduce the variance of the Monte Carlo gradients. We first re-write the original ELBO in a different form:</p>
\[\begin{align*}\text{ELBO}(\phi, \theta) &= \sum_{i=1}^n E_{z_i \sim q} \left[ \log p_\theta(\boldsymbol{x}_i, \boldsymbol{z}_i) - \log q( \boldsymbol{z}_i \mid \boldsymbol{x}_i) \right] \\ &= \sum_{i=1}^n \int q(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \left[\log p_\theta(\boldsymbol{x}_i, \boldsymbol{z}_i) - \log q(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \right] \ d\boldsymbol{z}_i \\ &= \sum_{i=1}^n \int q(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \left[\log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i) + \log p(\boldsymbol{z}_i) - \log q(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \right] \ d\boldsymbol{z}_i \\ &= \sum_{i=1}^n E_{\boldsymbol{z}_i \sim q} \left[ \log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i) \right] + \sum_{i=1}^n \int q(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \left[\log p(\boldsymbol{z}_i) - \log q(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \right] \ d\boldsymbol{z}_i \\ &= \sum_{i=1}^n E_{\boldsymbol{z}_i \sim q} \left[ \log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i) \right] + \sum_{i=1}^n E_{\boldsymbol{z}_i \sim q}\left[ \log \frac{ p(\boldsymbol{z}_i)}{q(\boldsymbol{z}_i \mid \boldsymbol{x}_i)} \right] \\ &= \sum_{i=1}^n E_{\boldsymbol{z}_i \sim q} \left[ \log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i) \right] - KL(q(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \ || \ p(\boldsymbol{z}_i)) \end{align*}\]
<p>Recall the VAEs we have considered in this blog post have defined $p(\boldsymbol{z})$ to be the standard normal distribution $N(\boldsymbol{0}, \boldsymbol{I})$. In this particular case, it turns out that the KL-divergence term above can be expressed analytically (See the Appendix to this post):</p>
\[KL(q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \mid\mid p(\boldsymbol{z}_i)) = -\frac{1}{2} \sum_{j=1}^J \left(1 + h^{(2)}_\phi(\boldsymbol{x}_i)_j - \left(h^{(1)}_\phi(\boldsymbol{x}_i)\right)_j^2 - \exp(h^{2}_\phi(\boldsymbol{x}_i)_j) \right)\]
<p>Note above the KL-divergence is calculated by summing over each dimension in the latent space. The full ELBO is:</p>
\[\begin{align*} \text{ELBO}(\phi, \theta) &= \frac{1}{n} \sum_{i=1}^n \left[\frac{1}{2} \sum_{j=1}^J \left(1 + h^{(2)}_\phi(\boldsymbol{x}_i)_j - \left(h^{(1)}_\phi(\boldsymbol{x}_i)\right)_j^2 - \exp(h^{2}_\phi(\boldsymbol{x}_i)_j) \right) + E_{\boldsymbol{z}_i \sim q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)} \left[\log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i) \right]\right] \end{align*}\]
<p>Then, we can apply the reparameterization trick to this formulation of the ELBO and derive the following Monte Carlo approximation:</p>
\[\begin{align*}\text{ELBO}(\phi, \theta) &\approx \frac{1}{n} \sum_{i=1}^n \left[\frac{1}{2} \sum_{j=1}^J \left(1 + h^{(2)}_\phi(\boldsymbol{x}_i)_j - \left(h^{(1)}_\phi(\boldsymbol{x}_i)\right)_j^2 - \exp(h^{2}_\phi(\boldsymbol{x}_i)_j) \right) + \frac{1}{L} \sum_{l=1}^L \left[\log p_\theta(\boldsymbol{x}_i \mid g_\phi(\boldsymbol{x}_{i}) + h_\phi(\boldsymbol{x}_i)\epsilon_{i,l}) \right] \right]\end{align*}\]
<p>Though this equation looks daunting, the feature to notice is that it is differentiable with respect to both $\phi$ and $\theta$. Therefore, we can apply automatic differentation to derive the gradients that are needed to perform stochastic gradient ascent!</p>
<p>Lastly, one may ask: why does this stochastic gradient have reduced variance than the version discussed previously? Intuitively, terms within the ELBO’s expectation are being “pulled out” and computed analytically (i.e., the KL-divergence). Since these terms are analytical, less of this quantity is determined by the variability from sampling each $\boldsymbol{\epsilon}’_{i,l}$ and thus, there will be less overall variability.</p>
<h2 id="vaes-as-autoencoders">VAEs as autoencoders</h2>
<p>So far, we have described VAEs in the context of probabilistic modeling. That is, we have described how the VAE is a probabilistic model that describes each high-dimensional datapoint, $\boldsymbol{x}_i$, as being “generated” from a lower dimensional data point $\boldsymbol{z}_i$. This generating procedure utilizes a neural network to map $\boldsymbol{z}_i$ to the parameters of the distribution $\mathcal{D}$ required to sample $\boldsymbol{x}_i$. Moreover, we can infer the parameters and latent variables to this model via VI. To do so, we solve a sort of inverse problem in which use a neural network to map each $\boldsymbol{x}_i$ into parameters of the variational posterior distribution $q$ required to sample $\boldsymbol{z}_i$.</p>
<p>Now, what happens if we tie the variational posterior $q_\phi(\boldsymbol{z} \mid \boldsymbol{x})$ to the data generating distribution $p_\theta(\boldsymbol{x} \mid \boldsymbol{z})$? That is, given a data point $\boldsymbol{x}$, we first sample $\boldsymbol{z}$ from the variational posterior distribution,</p>
\[\boldsymbol{z} \sim q_\phi(\boldsymbol{z} \mid \boldsymbol{x})\]
<p>then we generate a new data point, $\boldsymbol{x}’$, from $p(\boldsymbol{x} \mid \boldsymbol{z})$:</p>
\[\boldsymbol{x}' \sim p_\theta(\boldsymbol{x} \mid \boldsymbol{z})\]
<p>We can visualize this process schematically below:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_as_autoencoder.png" alt="drawing" width="800" /></center>
<p> </p>
<p>Notice the similarity of the above process to the standard autoencoder:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/autoencoder.png" alt="drawing" width="350" /></center>
<p> </p>
<p>We see that a VAE performs the same sort of compression and decompression as a standard autoencoder! One can view the VAE as a “probabilistic” autoencoder. Instead of mapping each $\boldsymbol{x}_i$ directly to $\boldsymbol{z}_i$, the VAE maps $\boldsymbol{x}_i$ to a <em>distribution</em> over $\boldsymbol{z}_i$ from which $\boldsymbol{z}_i$ is <em>sampled</em>. This randomly sampled $\boldsymbol{z}_i$ is then used to parameterize the distribution from which $\boldsymbol{x}_i$ is sampled.</p>
<h2 id="viewing-the-vae-loss-function-as-regularized-reconstruction-loss">Viewing the VAE loss function as regularized reconstruction loss</h2>
<p>Let’s take a closer look at the loss function for the VAE with our new perspective of VAEs as being probabilistic autoencoders. The (exact) loss function is the negative of the ELBO:</p>
\[\begin{align*} \text{loss}_{\text{VAE}}(\phi, \theta) &= -\sum_{i=1}^n E_{\boldsymbol{z}_i \sim q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)} \left[\log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i) \right] - KL(q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \ || \ p(\boldsymbol{z}_i)) \end{align*}\]
<p>Notice there are two terms with opposite signs. The first term, $\log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$, can be seen as a <strong>reconstruction loss</strong> because it will push the model towards reconstructing the original $\boldsymbol{x}_i$ from its compressed representation, $\boldsymbol{z}_i$.</p>
<p>This can be made especially evident if our model assumes that $p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$ is a normal distribution. That is,</p>
\[p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i) := N(f_\theta(\boldsymbol{z}_i), \sigma_{\text{decoder}}\boldsymbol{I})\]
<p>where $\sigma_{\text{decoder}}$ describes the amount of Gaussian noise around $f_\theta(\boldsymbol{z}_i)$. In this scenario, the VAE will attempt to minimize the squared error between the decoded $\boldsymbol{x}’_i$ and the original input data point $\boldsymbol{x}_i$. We can see this by writing out the analytical form of $\log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$ and highlighting the squared error in red:</p>
\[\begin{align*} \text{loss}_{\text{VAE}}(\phi, \theta) &= -\sum_{i=1}^n E_{\boldsymbol{z}_i \sim q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)} \left[\log p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i) \right] - KL(q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \ || \ p(\boldsymbol{z}_i)) \\ &= -\sum_{i=1}^n E_{\boldsymbol{z}_i \sim q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)} \left[\log \frac{1}{\sqrt{2 \pi \sigma_{\text{decoder}}^2}} - \frac{ \color{red}{||\boldsymbol{x}_i - f_\theta(\boldsymbol{z}_i) ||_2^2}}{2 \sigma_{\text{decoder}}^2} \right] - KL(q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i) \ || \ p(\boldsymbol{z}_i) \end{align*}\]
<p>Recall, in the loss function for the standard autoencoder, we are also minimizing this squared error as seen below:</p>
\[\text{loss}_{AE} := \frac{1}{n} \sum_{i=1}^n \color{red}{||\boldsymbol{x}_i - f_\theta(h_\phi(\boldsymbol{x}_i)) ||_2^2}\]
<p>where $h_\phi$ is the encoding neural network and $f_\theta$ is the decoding neural network.</p>
<p>Thus, both the VAE and standard autoencoder will seek to minimize the squared error between the decoded data point $\boldsymbol{x}’_i$ and the original data point $\boldsymbol{x}_i$. In this regard, the two models are quite similar!</p>
<p>In constract to standard autoencoders, the VAE also has a KL-divergence term with opposite sign to the reconstruction loss term. Notice, how this term will push the model to generate latent variables from $q_\phi(\boldsymbol{z}_i \mid \boldsymbol{x}_i)$ that follow the prior distribution, $p(\boldsymbol{z}_i)$, which in our case is a standard normal. We can think of this KL-term as a <strong>regularization term</strong> on the reconstruction loss. That is, the model seeks to reconstruct each $\boldsymbol{x}_i$; however, it also seeks to ensure that the latent $\boldsymbol{z}_i$’s are distributed according to a standard normal distribution!</p>
<h2 id="comparing-the-implementation-of-a-vae-with-that-of-an-autoencoder">Comparing the implementation of a VAE with that of an autoencoder</h2>
<p>If our generative model assumes that $p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$ is the normal distribution $N(f_\theta(\boldsymbol{z}), \sigma_{\text{decoder}}\boldsymbol{I})$, then the implementation of a standard autoencoder and a VAE are quite similar. To see this similarity, let’s examine the computation graph of the loss function that we would use to train each model. For the standard autoencoder, the computation graph looks like:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/autoencoder.png" alt="drawing" width="350" /></center>
<p> </p>
<p>Below, we show Python code that defines and trains a simple autoencoder using <a href="https://pytorch.org">PyTorch</a>. This autoencoder has one fully connected hidden layer in the encoder and decoder. The function <code class="language-plaintext highlighter-rouge">train_model</code> accepts a <a href="https://numpy.org/">numpy</a> array, <code class="language-plaintext highlighter-rouge">X</code>, that stores the data matrix $X \in \mathbb{R}^{n \times J}$:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torchvision.transforms</span> <span class="k">as</span> <span class="n">transforms</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">DataLoader</span><span class="p">,</span> <span class="n">TensorDataset</span>
<span class="kn">import</span> <span class="nn">torch.optim</span> <span class="k">as</span> <span class="n">optim</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="c1"># Define autoencoder model
</span><span class="k">class</span> <span class="nc">autoencoder</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span>
<span class="bp">self</span><span class="p">,</span>
<span class="n">x_dim</span><span class="p">,</span>
<span class="n">hidden_dim</span><span class="p">,</span>
<span class="n">z_dim</span><span class="o">=</span><span class="mi">10</span>
<span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">autoencoder</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
<span class="c1"># Define autoencoding layers
</span> <span class="bp">self</span><span class="p">.</span><span class="n">enc_layer1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">x_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">enc_layer2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">z_dim</span><span class="p">)</span>
<span class="c1"># Define autoencoding layers
</span> <span class="bp">self</span><span class="p">.</span><span class="n">dec_layer1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">z_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dec_layer2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">x_dim</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">encoder</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="c1"># Define encoder network
</span> <span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">enc_layer1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">enc_layer2</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">return</span> <span class="n">z</span>
<span class="k">def</span> <span class="nf">decoder</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
<span class="c1"># Define decoder network
</span> <span class="n">output</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dec_layer1</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dec_layer2</span><span class="p">(</span><span class="n">output</span><span class="p">))</span>
<span class="k">return</span> <span class="n">output</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="c1"># Define the full network
</span> <span class="n">z</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">encoder</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">decoder</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="k">return</span> <span class="n">output</span>
<span class="k">def</span> <span class="nf">train_model</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">1e-3</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span> <span class="n">num_epochs</span><span class="o">=</span><span class="mi">15</span><span class="p">):</span>
<span class="c1"># Create DataLoader object to generate minibatches
</span> <span class="n">X</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">X</span><span class="p">).</span><span class="nb">float</span><span class="p">()</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Instantiate model and optimizer
</span> <span class="n">model</span> <span class="o">=</span> <span class="n">autoencoder</span><span class="p">(</span><span class="n">x_dim</span><span class="o">=</span><span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">hidden_dim</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span> <span class="n">z_dim</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="n">learning_rate</span><span class="p">)</span>
<span class="c1"># Define the loss function
</span> <span class="k">def</span> <span class="nf">loss_function</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">recon_loss</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">mse_loss</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">reduction</span><span class="o">=</span><span class="s">'sum'</span><span class="p">)</span>
<span class="k">return</span> <span class="n">recon_loss</span>
<span class="c1"># Train the model
</span> <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">):</span>
<span class="n">epoch_loss</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">dataloader</span><span class="p">:</span>
<span class="c1"># Zero the gradients
</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="c1"># Get batch
</span> <span class="n">x</span> <span class="o">=</span> <span class="n">batch</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># Forward pass
</span> <span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1"># Calculate loss
</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">loss_function</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="c1"># Backward pass
</span> <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c1"># Update parameters
</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># Add batch loss to epoch loss
</span> <span class="n">epoch_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
<span class="c1"># Print epoch loss
</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Epoch </span><span class="si">{</span><span class="n">epoch</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="n">num_epochs</span><span class="si">}</span><span class="s">, Loss: </span><span class="si">{</span><span class="n">epoch_loss</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
</code></pre></div></div>
<p>For a VAE that assumes $p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$ to be a normal distribution, the computation graph looks like:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_computation_graph.png" alt="drawing" width="500" /></center>
<p> </p>
<p>One note: we’ve made a slight change to the notation that we’ve used prior to this point; Here, the output of the decoder, $\boldsymbol{x}’$, can be interpreted to be the <em>mean</em> of the normal distribution, $p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$, rather than as a sample from this distribution.</p>
<p>The PyTorch code for the autoencoder would then be slightly altered in the following ways:</p>
<ol>
<li>The loss function between the two is modified to use the approximated ELBO rather than mean-squared error</li>
<li>In the forward pass for the VAE, there is an added step for randomly sample $\boldsymbol{\epsilon}_i$ in order to generate $\boldsymbol{z}_i$</li>
</ol>
<p>Aside from those two differences, the two implementations are quite similar. Below, we show code implementing a simple VAE. Note that here we sample only one value of $\epsilon_{i}$ per data point (that is, $L := 1$). In practice, if the training set is large enough, a single Monte Carlo sample per data sample often suffices to achieve good performance.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torchvision.datasets</span> <span class="k">as</span> <span class="n">datasets</span>
<span class="kn">import</span> <span class="nn">torchvision.transforms</span> <span class="k">as</span> <span class="n">transforms</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">torch.optim</span> <span class="k">as</span> <span class="n">optim</span>
<span class="kn">import</span> <span class="nn">torch.nn.functional</span> <span class="k">as</span> <span class="n">F</span>
<span class="k">class</span> <span class="nc">VAE</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span>
<span class="bp">self</span><span class="p">,</span>
<span class="n">x_dim</span><span class="p">,</span>
<span class="n">hidden_dim</span><span class="p">,</span>
<span class="n">z_dim</span><span class="o">=</span><span class="mi">10</span>
<span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">VAE</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
<span class="c1"># Define autoencoding layers
</span> <span class="bp">self</span><span class="p">.</span><span class="n">enc_layer1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">x_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">enc_layer2_mu</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">z_dim</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">enc_layer2_logvar</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">z_dim</span><span class="p">)</span>
<span class="c1"># Define autoencoding layers
</span> <span class="bp">self</span><span class="p">.</span><span class="n">dec_layer1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">z_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">dec_layer2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">x_dim</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">encoder</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">enc_layer1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">enc_layer2_mu</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">logvar</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">enc_layer2_logvar</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="k">return</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logVar</span>
<span class="k">def</span> <span class="nf">reparameterize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">):</span>
<span class="n">std</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">logvar</span><span class="o">/</span><span class="mi">2</span><span class="p">)</span>
<span class="n">eps</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn_like</span><span class="p">(</span><span class="n">std</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">mu</span> <span class="o">+</span> <span class="n">std</span> <span class="o">*</span> <span class="n">eps</span>
<span class="k">return</span> <span class="n">z</span>
<span class="k">def</span> <span class="nf">decoder</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">z</span><span class="p">):</span>
<span class="c1"># Define decoder network
</span> <span class="n">output</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dec_layer1</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">dec_layer2</span><span class="p">(</span><span class="n">output</span><span class="p">))</span>
<span class="k">return</span> <span class="n">x</span>
<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">encoder</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">reparameterize</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">logVar</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">decoder</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="k">return</span> <span class="n">output</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span>
<span class="c1"># Define the loss function
</span><span class="k">def</span> <span class="nf">loss_function</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">):</span>
<span class="n">recon_loss</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">mse_loss</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">reduction</span><span class="o">=</span><span class="s">'sum'</span><span class="p">)</span>
<span class="n">kl_loss</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">logvar</span> <span class="o">-</span> <span class="n">mu</span><span class="p">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="n">logvar</span><span class="p">.</span><span class="n">exp</span><span class="p">())</span>
<span class="k">return</span> <span class="n">recon_loss</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">kl_loss</span>
<span class="k">def</span> <span class="nf">train_model</span><span class="p">(</span>
<span class="n">X</span><span class="p">,</span>
<span class="n">learning_rate</span><span class="o">=</span><span class="mf">1e-3</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span>
<span class="n">num_epochs</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
<span class="n">hidden_dim</span><span class="o">=</span><span class="mi">256</span><span class="p">,</span>
<span class="n">latent_dim</span><span class="o">=</span><span class="mi">50</span>
<span class="p">):</span>
<span class="c1"># Define the VAE model
</span> <span class="n">model</span> <span class="o">=</span> <span class="n">VAE_simple</span><span class="p">(</span><span class="n">x_dim</span><span class="o">=</span><span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">hidden_dim</span><span class="o">=</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">z_dim</span><span class="o">=</span><span class="n">latent_dim</span><span class="p">)</span>
<span class="c1"># Define the optimizer
</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">lr</span><span class="o">=</span><span class="n">learning_rate</span><span class="p">)</span>
<span class="c1"># Convert X to a PyTorch tensor
</span> <span class="n">X</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">X</span><span class="p">).</span><span class="nb">float</span><span class="p">()</span>
<span class="c1"># Create DataLoader object to generate minibatches
</span> <span class="n">dataset</span> <span class="o">=</span> <span class="n">TensorDataset</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Train the model
</span> <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_epochs</span><span class="p">):</span>
<span class="n">epoch_loss</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">dataloader</span><span class="p">:</span>
<span class="c1"># Zero the gradients
</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="c1"># Get batch
</span> <span class="n">x</span> <span class="o">=</span> <span class="n">batch</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="c1"># Forward pass
</span> <span class="n">output</span><span class="p">,</span> <span class="n">z</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1"># Calculate loss
</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">loss_function</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">logvar</span><span class="p">)</span>
<span class="c1"># Backward pass
</span> <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="c1"># Update parameters
</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># Add batch loss to epoch loss
</span> <span class="n">epoch_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
<span class="c1"># Print epoch loss
</span> <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Epoch </span><span class="si">{</span><span class="n">epoch</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="n">num_epochs</span><span class="si">}</span><span class="s">, Loss: </span><span class="si">{</span><span class="n">epoch_loss</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">model</span>
</code></pre></div></div>
<p>Another quick point to note about the above implementation is that we weighted the KL-divergence term in the loss function by 0.5. This weighting can be interpreted as hard-coding a larger variance in the normal distribution $p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$.</p>
<h2 id="applying-a-vae-on-mnist">Applying a VAE on MNIST</h2>
<p>Let’s run the code shown above on MNIST. MNIST is a dataset consisting of 28x28 pixel images of hand-written digits. We will use a latent representation of length 50 (that is, $\boldsymbol{z} \in \mathbb{R}^{50}$). Note, code displayed previously implements a model that flattens each image into a vector and uses fully connected layers in both the encoder and decoder. For improved performance, one may instead want to use a <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">convolutional neural network</a> architecture, which have been shown to better model imaging data. Regardless, after enough training, the algorithm was able to reconstruct images that were not included in the training data. Here’s an example of the VAE reconstructing an image of the digit “7” that it was not trained on:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_MNIST_reconstruction.png" alt="drawing" width="600" /></center>
<p> </p>
<p>As VAEs are generative models, we can use them to generate new data! To do so, we first sample $\boldsymbol{z} \sim N(\boldsymbol{0}, \boldsymbol{I})$ and then output $f_\theta(\boldsymbol{z})$. Here are a few examples of generated samples:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/example_MNIST_generative_process.png" alt="drawing" width="800" /></center>
<p> </p>
<p>Lastly, let’s explore the latent space learned by the model. First, let’s take the images of a “3” and “7”, and encode them into the latent space by sampling from $q_\phi(\boldsymbol{z} \mid \boldsymbol{x})$. Let’s let $\boldsymbol{z}_1$ and $\boldsymbol{z}_2$ be the latent vectors for “3” and “7” respectively:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_encode_3_7.png" alt="drawing" width="650" /></center>
<p> </p>
<p>Then, let’s interpolate between $\boldsymbol{z}_1$ and $\boldsymbol{z}_2$ and for each interpolated vector, $\boldsymbol{z}’$, we’ll compute $f_\theta(\boldsymbol{z’})$. Interestingly, we see a smooth transition between these digits as the 3 sort of morphs into the 7:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_interpolate_3_to_7.png" alt="drawing" width="800" /></center>
<p> </p>
<h2 id="when-to-use-a-vae-versus-a-standard-autoencoder">When to use a VAE versus a standard autoencoder</h2>
<p>There are two key advantages that VAEs provide over standard autoencoders that make VAEs the better choice for certain types of problems:</p>
<p>1. <strong>Data generation:</strong> As a generative model, a VAE provides a method for <em>generating</em> new samples. To generate a new sample, one first samples a latent variable from the prior: $\boldsymbol{z} \sim p(\boldsymbol{z})$. Then one samples a new data sample, $\boldsymbol{x}$, from $p_\theta(\boldsymbol{x} \mid \boldsymbol{z})$. We demonstrated this ability in the prior section when we used our VAE to generate new MNIST digits.</p>
<p>2. <strong>Control of the latent space:</strong> VAEs enable tighter control over the structure of the latent space. As we saw previously, the ELBO’s KL-divergence term will push the model’s encoder towards encoding samples such that their latent random variables are distributed like the prior distribution. In this post, we discussed models that use a standard normal distribution as a prior. In this case, the latent random variables will tend to be distributed like a standard normal distribution and thus, will group in a sort of spherical pattern in the latent space. Below, we show the distribution of the first latent variable for each of 1,000 MNIST test digits using the VAE described in the previous section (right). The orange line shows the density function of the standard normal distribution. We also show the joint distribution of the first and second latent variables (left) for each of these 1,000 MNIST test digits:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/VAE_latent_space.png" alt="drawing" width="650" /></center>
<p> </p>
<p>3. <strong>Modeling complex distributions:</strong> VAEs provide a principled way to learn low dimensional representations of data that are distributed according to more complicated distributions. That is, they can be used when $p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$ is more complicated than the Gaussian distribution-based model we have discussed so far. One example of an application of VAEs using non-Gaussian distributions comes from <a href="https://en.wikipedia.org/wiki/Single-cell_transcriptomics">single-cell RNA-seq</a> analysis in the field of genomics. As described by <a href="https://www.nature.com/articles/s41592-018-0229-2">Lopez et al. (2018)</a>, <a href="https://scvi-tools.org/">scVI</a> is a tool that uses a VAE to model vectors of counts that are assumed to be distributed according to a <a href="https://en.wikipedia.org/wiki/Zero-inflated_model">zero-inflated</a> negative binomial distribution. That is, the distribution $p_\theta(\boldsymbol{x}_i \mid \boldsymbol{z}_i)$ is a zero-inflated negative binomial.</p>
<p>Although we will not go into depth into this model here (perhaps for a future blog post), it provides an example of how VAEs can be easily extended to model data with specific distributional assumptions. In a sense, VAEs are “modular”. You can pick and choose your distributions. As long as the likelihood function is differentiable with respect to $\theta$ and the variational distribution is differentiable with respect to $\phi$, then you can fit the model using stochastic gradient descent of the ELBO using the reparameterization trick!</p>
<h2 id="appendix-derivation-of-the-kl-divergence-term-when-the-variational-posterior-and-prior-are-gaussian">Appendix: Derivation of the KL-divergence term when the variational posterior and prior are Gaussian</h2>
<p>If we choose the variational distribution $q_\phi(\boldsymbol{z} \mid \boldsymbol{x})$ to be defined by the following generative process:</p>
\[\begin{align*}\boldsymbol{\mu} &:= h^{(1)}_\phi(\boldsymbol{x}) \\ \log \boldsymbol{\sigma}^2 &:= h^{(2)}_\phi(\boldsymbol{x}) \\ \boldsymbol{z} &\sim N(\boldsymbol{\mu}, \boldsymbol{\sigma^2}\boldsymbol{I}) \end{align*}\]
<p>where $h^{(1)}$ and $h^{(2)}$ are the two encoding neural networks, and we choose the prior distribution over $\boldsymbol{z}$ to be a standard normal distribution:</p>
\[p(\boldsymbol{z}) := N(\boldsymbol{0}, \boldsymbol{I})\]
<p>then the KL-divergence from $p(\boldsymbol{z})$ to $q_\phi(\boldsymbol{z} \mid \boldsymbol{x})$ is given by the following formula:</p>
\[KL(q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \mid\mid p(\boldsymbol{z})) = -\frac{1}{2} \sum_{j=1}^J \left(1 + h^{(2)}_\phi(\boldsymbol{x})_j - \left(h^{(1)}_\phi(\boldsymbol{x})\right)_j^2 - \exp(h^{2}_\phi(\boldsymbol{x})_j) \right)\]
<p>where $J$ is the dimensionality of $\boldsymbol{z}$.</p>
<p><strong>Proof:</strong></p>
<p>First, let’s re-write the KL-divergence as follows:</p>
\[\begin{align*}KL(q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) || p(\boldsymbol{z})) &= \int q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \log \frac{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}{ p(\boldsymbol{z}))} \\ &= \int q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \log q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \ d\boldsymbol{z} - \int q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \log p(\boldsymbol{z}) \ d\boldsymbol{z}\end{align*}\]
<p>First, let’s compute the first term, $\int q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \log q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \ d\boldsymbol{z}$:</p>
\[\begin{align*}\int q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \log q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \ d\boldsymbol{z} &= \int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma^2})) \log N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma^2})) \ d\boldsymbol{z} \\ &= \int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma^2})) \sum_{j=1}^J \log N(z_j; \mu_j, \sigma^2_j) \ d\boldsymbol{z} \\ &= \int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma^2})) \sum_{j=1}^J \left[\log\left(\frac{1}{\sqrt{\sigma^2 2 \pi}}\right) - \frac{1}{2} \frac{(z_j - \mu_j)^2}{\sigma_j^2} \right] \ d\boldsymbol{z} \\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2}\sum_{j=1}^J \log \sigma_j^2 - \frac{1}{2} \sum_{j=1}^J \int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma^2}))\frac{(z_j - \mu_j)^2}{\sigma_j^2} \ d\boldsymbol{z} \\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2}\sum_{j=1}^J \log \sigma_j^2 - \frac{1}{2} \sum_{j=1}^J \int_{z_j} N(z_j; \mu_j, \sigma^2_j)\frac{(z_j - \mu_j)^2}{\sigma_j^2} \int_{\boldsymbol{z}_{i \neq j}} \prod_{i \neq j} N(z_i; \mu_i, \sigma^2_i) \ d\boldsymbol{z}_{i \neq j} \ dz_j \\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2}\sum_{j=1}^J \log \sigma_j^2 - \frac{1}{2} \sum_{j=1}^J \int N(z_j; \mu_j, \sigma^2_j)\frac{(z_j - \mu_j)^2}{\sigma_j^2} \ dz_j && \text{Note 1} \\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2}\sum_{j=1}^J \log \sigma_j^2 - \frac{1}{2} \sum_{j=1}^J \frac{1}{\sigma_j^2} \int N(z_j; \mu_j, \sigma^2_j)(z_j^2 - 2z_j\mu_j + \mu_j^2) \ dz_j \\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2}\sum_{j=1}^J \log \sigma_j^2 - \frac{1}{2} \sum_{j=1}^J \frac{1}{\sigma_j^2} \left(E[z_j^2] - E[2z_j\mu_j] + \mu_j^2 \right) && \text{Note 2} \\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2}\sum_{j=1}^J \log \sigma_j^2 - \frac{1}{2} \sum_{j=1}^J \frac{1}{\sigma_j^2} \left(\mu_j^2 + \sigma^2 - 2\mu_j^2 + \mu_j^2 \right) && \text{Note 3} \\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2}\sum_{j=1}^J \log \sigma_j^2 - \frac{1}{2} \sum_{j=1}^J 1 \\ &= -\frac{J}{2} \log(2 \pi) - \frac{1}{2} \sum_{j=1}^J (1 + \log \sigma_j^2) \end{align*}\]
<p>Now, let us compute the second term, $\int q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \log p(\boldsymbol{z}) \ d\boldsymbol{z}$:</p>
\[\begin{align*}\int q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) \log p(\boldsymbol{z}) \ d\boldsymbol{z} &= \int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \log N(\boldsymbol{z}; \boldsymbol{0}, \boldsymbol{I})) \ d\boldsymbol{z} \\ &= \int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \sum_{j=1}^J N(z_i; 0, 1) \ d\boldsymbol{z} \\ &= \int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \sum_{j=1}^J \left[\log \frac{1}{\sqrt{2\pi}} - \frac{1}{2} z_i^2\right] \ d\boldsymbol{z} \\ &= J \log \frac{1}{\sqrt{2\pi}} \int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \ d\boldsymbol{z} - \frac{1}{2}\int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \sum_{j=1}^J z_i^2 \ d\boldsymbol{z} \\ &= -\frac{J}{2} \log (2 \pi) - \frac{1}{2}\int N(\boldsymbol{z}; \boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \sum_{j=1}^J z_i^2 \ d\boldsymbol{z} \\ &= -\frac{J}{2} \log (2 \pi) - \frac{1}{2} \int \sum_{j=1}^J z_j^2 \prod_{j'=1}^J N(z_{j'}; \mu_{j'}, \sigma^2_{j'}) \ d\boldsymbol{z} \\ &= -\frac{J}{2} \log (2 \pi) - \frac{1}{2} \sum_{j=1}^J \int_{z_j} z_j^2 N(z_j; \mu_j, \sigma^2_j) \int_{\boldsymbol{z}_{i \neq j}} \prod_{i \neq j} N(z_i; \mu_i, \sigma^2_i) \ d\boldsymbol{z}_{i \neq j} \ dz_i \\ &= -\frac{J}{2} \log (2 \pi) - \frac{1}{2} \sum_{j=1}^J \int z_j^2 N(z_j; \mu_j, \sigma^2_j) \ dz_j && \text{Note 4} \\ &= -\frac{J}{2} \log (2 \pi) - \frac{1}{2} \sum_{j=1}^J \left( \mu_j^2 + \sigma_j^2) \right) && \text{Note 5} \end{align*}\]
<p>Combining the two terms, we arrive at at the formula:</p>
\[KL(q_\phi(\boldsymbol{z} \mid \boldsymbol{x}) || p(\boldsymbol{z})) = -\frac{1}{2} \sum_{j=1}^J \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma^2_j\right)\]
<p><strong>Note 1:</strong> We see that $\int_{\boldsymbol{z}_{i \neq j}} \prod_{i \neq j} N(z_i; \mu_i, \sigma^2_i) \ d\boldsymbol{z}_{i \neq j}$ integrates to 1 since this is simply integrating the density function of a multivariate normal distribution.</p>
<p><strong>Note 2:</strong> We see that $\int z_j^2 N(z_j; \mu_j, \sigma^2_j) \ dz_j$ is simply $E[z_j^2]$ where $z_j \sim N(z_j; \mu_j, \sigma^2_j)$. Similarly, $\int 2z_j \mu_j N(z_j; \mu_j, \sigma^2_j) \ dz_j$ is simply $E[2 z_j \mu_j]$.</p>
<p><strong>Note 3:</strong> We use the equation for the variance of random variable $X$:</p>
\[\text{Var}(X) = E(X^2) - E(X)^2\]
<p>to see that</p>
\[E[z_j^2] = \mu_j^2 + \sigma_j^2\]
<p><strong>Note 4:</strong> See Note 1.</p>
<p><strong>Note 5:</strong> See Notes 2 and 3.</p>
<p>$\square$</p>Matthew N. BernsteinVariational autoencoders (VAEs) are a family of deep generative models with use cases that span many applications, from image processing to bioinformatics. There are two complimentary ways of viewing the VAE: as a probabilistic model that is fit using variational Bayesian inference, or as a type of autoencoding neural network. In this post, we present the mathematical theory behind VAEs, which is rooted in Bayesian inference, and how this theory leads to an emergent autoencoding algorithm. We also discuss the similarities and differences between VAEs and standard autoencoders. Lastly, we present an implementation of a VAE in PyTorch and apply it to the task of modeling the MNIST dataset of hand-written digits.Blackbox variational inference via the reparameterization gradient2022-11-05T00:00:00-07:002022-11-05T00:00:00-07:00https://mbernste.github.io/posts/reparameterization_vi<p><em>Variational inference (VI) is a mathematical framework for doing Bayesian inference by approximating the posterior distribution over the latent variables in a latent variable model when the true posterior is intractable. In this post, we will discuss a flexible variational inference algorithm, called blackbox VI via the reparameterization gradient, that works “out of the box” for a wide variety of models with minimal need for the tedious mathematical derivations that deriving VI algorithms usually require. We will then use this method to do Bayesian linear regression.</em></p>
<h2 id="introduction">Introduction</h2>
<p>In a <a href="https://mbernste.github.io/posts/variational_inference/">previous blog post</a>, we presented the variational inference (VI) paradigm for estimating posterior distributions when computing them is intractable. To review, VI is used in situations in which we have a model that involves hidden random variables $Z$, observed data $X$, and some posited probabilistic model over the hidden and observed random variables \(P(Z, X)\). Our goal is to compute the posterior distribution $P(Z \mid X)$. Under an ideal situation, we would do so by using Bayes theorem:</p>
\[p(z \mid x) = \frac{p(x \mid z)p(z)}{p(x)}\]
<p>where \(z\) and \(x\) are realizations of \(Z\) and \(X\) respectively and \(p(.)\) are probability mass/density functions for the distributions implied by their arguments. In practice, it is often difficult to compute $p(z \mid x)$ via Bayes theorem because the denominator $p(x)$ does not have a closed form.</p>
<p>Instead of computing \(p(z \mid x)\) exactly via Bayes theorem, VI attempts to find another distribution $q(z)$ from some set of distributions $\mathcal{Q}$, called the <strong>variational family</strong> that minimizes the KL-divergence to \(p(z \mid x)\). This minimization occurs implicitly by maximizing a surrogate quantity called the <a href="https://mbernste.github.io/posts/elbo/">evidence lower bound (ELBO)</a>:</p>
\[\text{ELBO}(q) := E_{Z \sim q}\left[\log p(x, Z) - \log q(Z) \right]\]
<p>Thus, our goal is to solve the following maximization problem:</p>
\[\hat{q} := \text{arg max}_{q \in \mathcal{Q}} \text{ELBO}(q)\]
<p>When each member of $\mathcal{Q}$ is characterized by a set of parameters $\phi$, the ELBO can be written as a function of $\phi$</p>
\[\text{ELBO}(\phi) := E_{Z \sim q_\phi}\left[\log p(x, Z) - \log q_\phi(Z) \right]\]
<p>Then, the optimziation problem reduces to optimizing over $\phi$:</p>
\[\hat{\phi} := \text{arg max}_{\phi} \text{ELBO}(\phi)\]
<p>In this post, we will present a flexible method, called <strong>blackbox variational inference via the reparameterization gradient</strong>, co-invented by <a href="https://arxiv.org/abs/1312.6114">Kingma and Welling (2014)</a> and <a href="https://arxiv.org/abs/1401.4082">Rezende, Mohamed, and Wierstra (2014)</a>, for solving this optimization problem under the following conditions:</p>
<ol>
<li>$q$ is parameterized by some set of variational parameters $\phi$ and is continuous with respect to these parameters</li>
<li>$p$ is continuous with respect to $z$</li>
<li>Sampling from $q_\phi$ can be performed via the <strong>reparameterization trick</strong> (to be discussed)</li>
</ol>
<p>The method is often called “blackbox” VI because it enables practitioners to avoid the tedious model-specific, mathematical derivations that developing VI algorithms often require (As an example of such a tedious derivation, see the <a href="https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">Appendix</a> to the original paper presenting <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet Allocation</a>). That is, blackbox VI works for a large set of models $p$ and $q$ acting as a “blackbox” in which one needs only to input $p$ and $q$ and the algorithm performs VI automatically.</p>
<p>At its core, the reparameterization gradient method is a method for performing stochastic gradient ascent on the ELBO. It does so by employing a reformulation of the ELBO using a clever technique called the “reparameterization trick”. In this post, we will review stochastic gradient ascent, present the reparameterization trick, and finally, dig into an example of implementing the reparameterization gradient method in <a href="https://pytorch.org/">PyTorch</a> in order to perform Bayesian linear regression.</p>
<h2 id="stochastic-gradient-ascent-of-the-elbo">Stochastic gradient ascent of the ELBO</h2>
<p><a href="https://en.wikipedia.org/wiki/Gradient_descent">Gradient ascent</a> is a straightforward method for solving optimization problems for continuous functions and is a very heavily studied method in machine learning. Thus, it is natural to attempt to optimize the ELBO via gradient ascent. Applying gradient ascent to the ELBO would entail iteratively computing the <a href="https://en.wikipedia.org/wiki/Gradient">gradient</a> of the ELBO with respect to $\phi$ and then updating our estimate of $\phi$ via this gradient. That is, at each iteration $t$, we have some estimate of $\phi$, denoted $\phi_t$ that we will update as follows:</p>
\[\phi_{t+1} := \phi_t + \alpha \nabla_\phi \left. \text{ELBO}(\phi) \right|_{\phi_t}\]
<p>where $\alpha$ is the learning rate. This step is repeated until we converge on a local maximum of the ELBO.</p>
<p>Now, the question becomes, how do we compute the gradient of the ELBO? A key challenge here is dealing with the expectation (i.e., the integral) in the ELBO. The reparameterization gradient method addresses this challenge by performing <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">stochastic gradient ascent</a> using <em>computationally tractable</em> random gradients instead of the computationally <em>intractable</em> exact gradients.</p>
<p>To review, stochastic gradient ascent works as follows: Instead of computing the exact gradient of the ELBO with respect to $\phi$, we formulate a random variable $V(\phi)$, whose expectation is the gradient of the ELBO at $\phi$ – that is, for which $E[V(\phi)] = \nabla_\phi ELBO(\phi)$. Then, at iteration $t$, we sample approximate gradients from $V(\phi_t)$ and take a small step in the direction of this random gradients:</p>
\[\begin{align*} v &\sim V(\phi_t) \\ \phi_{t+1} &:= \phi_t + \alpha v \end{align*}\]
<p>The question now becomes, how do we formulate a distribution $V(\phi)$ whose expectation is the gradient of the ELBO, $\nabla_\phi \text{ELBO}(\phi)$? As discussed in the next section, the reparameterization trick will enable one approach towards formulating such a distribution.</p>
<h2 id="the-reparameterization-trick">The reparameterization trick</h2>
<p>Before discussing how we formulate our distribution of stochastic gradients $V(\phi)$, let us first present the reparameterization trick (<a href="https://arxiv.org/abs/1312.6114">Kingma and Welling, 2014</a>; <a href="https://arxiv.org/abs/1401.4082">Rezende, Mohamed, and Wierstra, 2014</a>). It works as follows: we “reparameterize” the distribution $q_\phi(z)$ in terms of a surrogate random variable $\epsilon \sim \mathcal{D}$ and a determinstic function $g$ in such a way that sampling $z$ from $q_\phi(z)$ is performed as follows:</p>
\[\begin{align*}\epsilon &\sim \mathcal{D} \\ z &:= g_\phi(\epsilon)\end{align*}\]
<p>One way to think about this is that instead of sampling $z$ directly from our variational posterior $q_\phi(z)$, we “re-design” the generative process of $z$ such that we first sample a surrogate random variable $\epsilon$ and then transform $\epsilon$ into $z$ all while ensuring that in the end, the distribution of $z$ still follows $q_\phi$. Crucially, $\mathcal{D}$ must be something we can easily sample from such as a standard normal distribution.
Below we depict an example of this process for a hypothetical case where $q_\phi$ looks like a ring. We first sample from a unimodal distribution $\mathcal{D}$ and then transform these samples via $g_\phi$ into samples drawn from $q_\phi$ that form a ring:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/reparameterization_example_ring.png" alt="drawing" width="700" /></center>
<p> </p>
<p>Reparameterizing $q_\phi(z)$ can sometimes be difficult, but with the right variational family can be made easy. For example, if $q_\phi(z)$ is a <a href="https://en.wikipedia.org/wiki/Location%E2%80%93scale_family">location-scale family</a> distribution then reparameterization becomes quite simple. For example, if $q_\phi(z)$ is a Gaussian distribution</p>
\[q_\phi := N(\mu, \sigma^2)\]
<p>where the variational parameters are simply $\phi := \left\{\mu, \sigma^2\right\}$ (i.e., the mean $\mu$ and variance $\sigma^2$), we can reparameterize $q_\phi(z)$ such that sampling $z$ is done as follows:</p>
\[\begin{align*}\epsilon \sim N(0, 1) \\ z = \mu + \sigma \epsilon\end{align*}\]
<p>Here, the surrogate random variable is a simple standard normal distribution. The deterministic function $g$ is the function that simply shifts $\epsilon$ by $\mu$ and scales it by $\sigma$. Note that $z \sim N(\mu, \sigma^2) = q_\phi(z)$ and thus, this is a valid reparameterization.</p>
<p>Now, what does this reparameterization trick get us? How does it enable us to compute random gradients of the ELBO? First, we note that following the reparameterization trick, we can re-write the ELBO as follows:</p>
\[\text{ELBO}(\phi) := E_{\epsilon \sim \mathcal{D}} \left[ \log p(x, g_\phi(\epsilon)) - \log q_\phi(g_\phi(\epsilon)) \right]\]
<p>This formulation enables us to now approximate the ELBO via Monte Carlo sampling. That is, we can first sample random variables from our surrogate distribution $\mathcal{D}$:</p>
\[\epsilon'_1, \dots, \epsilon'_L \sim \mathcal{D}\]
<p>Then we can compute a Monte Carlo approximation to the ELBO:</p>
\[\tilde{ELBO}(\phi) := \frac{1}{L} \sum_{l=1}^L \left[ \log p(x, g_\phi(\epsilon'_l)) - \log q_\phi(g_\phi(\epsilon'_l)) \right]\]
<p>So long as $g_\phi$ is continuous with respect to $\phi$ and $p$ is continuous with respect to $z$, we can take gradient of this approximation:</p>
\[\nabla_\phi \tilde{ELBO}(\phi) := \nabla_\phi \frac{1}{L} \sum_{l=1}^L \left[ \log p(x, g_\phi(\epsilon'_l)) - \log q_\phi(g_\phi(\epsilon'_l)) \right]\]
<p>Notice that $\nabla_\phi \tilde{ELBO}(\phi)$ is a random vector (which we previously denoted by $v$ in the general case) where the randomness comes from sampling $\epsilon’_1, \dots, \epsilon’_L$ from $\mathcal{D}$. Moreover, it can be proven that</p>
\[E[\nabla_\phi \tilde{\text{ELBO}}(\phi)] = \nabla_\phi \text{ELBO}(\phi)\]
<p>Thus, the process of sampling $\epsilon_1, \dots, \epsilon_L$ from $\mathcal{D}$, computing the approximate ELBO, and then calculating the gradient to this approximation is equivalent to sampling from a distribution of random gradients $V(\phi)$ whose expectation is the gradient of the ELBO. Here we also see why when implementing the reparameterization trick, $\mathcal{D}$ must be easy to sample from: we use samples from this distribution to form samples from $V(\phi)$.</p>
<h2 id="joint-optimization-of-both-variational-and-model-parameters">Joint optimization of both variational and model parameters</h2>
<p>In many cases, not only do we have a model with latent variables $z$, but we also have model parameters $\theta$. That is, our joint distribution $p(x, z)$ is parameterized by some set of parameters $\theta$. Thus, we denote the full joint distribution as $p_\theta(x, z)$. In this scenario, how do we estimate the posterior $p_\theta(z \mid x)$ if we don’t know the true value of $\theta$?</p>
<p>One idea would be to place a prior distribution over $\theta$ and consider $\theta$ to be a latent variable like $z$ (that is, let $z$ include both latent variables <em>and</em> model parameters). However, this may not always be desirable. First, we may not need a full posterior distribution over $\theta$. Moreover, as we’ve seen in this blog post, estimating posteriors is challenging! Is it possible to arrive at a point estimate of $\theta$ while <em>jointly</em> estimating $p_\theta(z \mid x)$?</p>
<p>It turns out that inference of $\theta$ can be performed by simply maximizing the ELBO in terms of <em>both</em> the variational parameters $\phi$ <em>and</em> the model parameters $\theta$. That is, to cast the inference task as</p>
\[\hat{\phi}, \hat{\theta} := \text{arg max}_{\phi, \theta} \ \text{ELBO}(\phi, \theta)\]
<p>where the ELBO now becomes a function of both $\phi$ and $\theta$:</p>
\[\text{ELBO}(\phi, \theta) := E_{Z \sim q_\phi}\left[\log p_\theta(x, Z) - \log q_\phi(Z) \right]\]
<p>A natural question arises: why is this valid? To answer, recall the ELBO is the <a href="https://mbernste.github.io/posts/variational_inference/"><em>lower bound</em> of the log-likelihood</a>, $\log p_\theta(x)$. Thus, if we maximize the ELBO in terms of $\theta$ we are maximizing a lower bound of the log-likelihood. By maximizing the ELBO in terms of <em>both</em> $\phi$ and $\theta$, we see that we are simultenously minimizing the KL-divergence between $q_\phi(z)$ and $p_\theta(z \mid x)$ while also maximizing the lower bound of the log-likelihood.</p>
<p>This may remind you of the <a href="https://mbernste.github.io/posts/em/">EM algorithm</a>, where we iteratively optimize the ELBO in terms of a surrogate distribution $q$ and the model parameters $\theta$. In blackbox VI we do the same sort of thing, but instead of using a coordinate ascent algorithm as done in EM, blackbox VI uses a gradient ascent algorithm. Given this argument, one might think they are equivalent (and would produce the same estimates of $\theta$); however, there is one crucial difference: at the $t$th step of the EM algorithm, the $q$ that maximizes the ELBO is the <em>exact</em> distribution $p_{\theta_t}(z \mid x)$ where $\theta_t$ is the estimate of $\theta$ at $t$th time step. Because of this, EM is gauranteed to converge on a local maximum of the log-likelihood $\log p_\theta(x)$. In constrast, in VI, the variatonal family $\mathcal{Q}$ may not include $p_{\theta}(z \mid x)$ at all and thus, the estimate of $\theta$ produced by VI is not gauranteed to be a local maximum of the log-likelihood like it is in EM. In practice, however, maximizing the lower bound for the log-likelihood (i.e., $\text{ELBO}(\phi, \theta)$) often works well even if it does not have the same gaurantee as EM.</p>
<h2 id="example-bayesian-linear-regression">Example: Bayesian linear regression</h2>
<p>The reparameterized gradient method can be applied to a wide variety of models. Here, we’ll apply it to Bayesian linear regression. Let’s first describe the probabilistic model behind linear regression. Our data consists of covariates $\boldsymbol{x}_1, \dots, \boldsymbol{x}_n \in \mathbb{R}^J$ paired with response variables $y_1, \dots, y_n \in \mathbb{R}$. Our data model is then defined as</p>
\[p(y_1, \dots, y_n \mid \boldsymbol{x}_1, \dots, \boldsymbol{x}_n) := \prod_{i=1}^n N(y_i; \boldsymbol{\beta}^T\boldsymbol{x}_i, \sigma^2)\]
<p>where $N(.; a, b)$ is the probability density function parameterized by mean $a$ and variance $b$. We will assume that the first covariate for each $\boldsymbol{x}_i$ is defined to be 1 and thus, the first coefficient of $\boldsymbol{\beta}$ is the intercept term.</p>
<p>That is, we assume that each $y_i$ is “generated” from its $\boldsymbol{x}_i$ via the following process:</p>
\[\begin{align*}\mu_i & := \boldsymbol{\beta}^T\boldsymbol{x}_i \\ y_i &\sim N(\mu_i, \sigma^2)\end{align*}\]
<p>Notice that the variance $\sigma^2$ is constant across all data points and thus, our model assumes <a href="https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity">homoscedasticity</a>. Furthermore, our data only consists of the pairs $(\boldsymbol{x}_1, y_1), \dots, (\boldsymbol{x}_n, y_n)$, but we don’t know $\boldsymbol{\beta}$ or $\sigma^2$. We can infer the value of these variables using Bayesian inference! Note, Bayesian linear regression <em>can</em> be performed exactly (no need for VI) with a specific <a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugate prior</a> over both $\boldsymbol{\beta}$ and $\sigma^2$, but we will use VI to demonstrate the approach.</p>
<p>Specifically, we will define a prior distribution over $\boldsymbol{\beta}$, denoted $p(\boldsymbol{\beta})$. For simplicity, let us assume that all parameters are independently and normally distributed with each parameter’s prior mean being zero with a large variance of $C$ (because we are unsure apriori, what the parameters are). That is, let</p>
\[p(\boldsymbol{\beta}) := \prod_{j=1}^J N(\beta_j; 0, C)\]
<p>Then, our complete data likelihood is given by</p>
\[p(y_1, \dots, y_n, \boldsymbol{\beta} \mid \boldsymbol{x}_1, \dots, \boldsymbol{x}_n) := \prod_{j=1}^J N(\beta_j; 0, C) \prod_{i=1}^n N(y_i; \boldsymbol{\beta}^T\boldsymbol{x}_i, \sigma^2)\]
<p>We will treat $\sigma^2$ as a parameter to the model rather than a random variable. Our goal is to compute the posterior distribution of $\boldsymbol{\beta}$:</p>
\[p(\boldsymbol{\beta} \mid y_1, \dots, y_n, \boldsymbol{x}_1, \dots, \boldsymbol{x}_n)\]
<p>We can approximate this posterior using blackbox VI via the reparameterization gradient! To do so, we must first specify our approximate posterior distribution. For simplicity, we will assume that $q_\phi(\boldsymbol{\beta})$ factors into independent normal distributions (like the prior):</p>
\[q_\phi(\boldsymbol{\beta}) := \prod_{j=1}^J N(\beta_j; \mu_j, \tau^2_j)\]
<p>Note the full set of variational parameters $\phi$ are the collection of mean and variance parameters for all of the normal distributions. Let us represent these means and variances as vectors:</p>
\[\begin{align*}\boldsymbol{\mu} &:= [\mu_0, \mu_1, \dots, \mu_J] \\ \boldsymbol{\tau}^2 &:= [\tau^2_0, \tau^2_1, \dots, \tau^2_J] \end{align*}\]
<p>Then the variational parameters are:</p>
\[\phi := \{\boldsymbol{\mu}, \boldsymbol{\tau}^2 \}\]
<p>We will treat the variance $\sigma^2$ as a model parameter for which we wish to find a point estimate. That is, $\theta := \sigma^2$. We will now attempt to find $q_\phi$ and $\theta$ jointly via blackbox VI. First, we must derive a reparameterization of $q_\phi$. This can be done quite easily as follows:</p>
\[\begin{align*}\boldsymbol{\epsilon} &\sim N(\boldsymbol{0}, \boldsymbol{I}) \\ \boldsymbol{\beta} &= \boldsymbol{\mu} + \boldsymbol{\epsilon} \odot \boldsymbol{\tau} \end{align*}\]
<p>where $\odot$ represent element-wise multiplication between two vectors. Finally, the reparameterized ELBO for this model is:</p>
\[ELBO(\boldsymbol{\beta}) := E_{\boldsymbol{\epsilon} \sim N(\boldsymbol{0}, \boldsymbol{I})}\left[\sum_{j=0}^J \log N(\mu_j + \epsilon_j \tau; 0, C) + \sum_{i=1}^n \log N(y_i; (\boldsymbol{\mu} + \boldsymbol{\epsilon} \odot \boldsymbol{\tau})^T\boldsymbol{x}_i, \tau^2) - \sum_{j=0}^J \log N(\mu_j + \epsilon_j \sigma_j; \mu_j, \tau^2_j)\right]\]
<p>Now, we can use this reparameterized ELBO to perform stochastic gradient ascent! This may appear daunting, but can be done automatically with the help of automatic differentiation algorithms!</p>
<p>In the Appendix to this blog post, we show an implementation for univariate linear regression in Python using <a href="https://pytorch.org/">PyTorch</a> that you can execute in <a href="https://colab.research.google.com/drive/1xCFRHMXhwXisZal9yeBp3TdRmFj2Z1Jg?usp=sharing">Google Colab</a>. To test the method, we will simulate a small toy dataset consisting of five data points:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/Bayesian_linear_regression_example1_data.png" alt="drawing" width="300" /></center>
<p>Below is the output of the reparameterization gradient method when fit on these data. In the left-most figure, we show the five data points (blue dots), the true model (red line), the posterior mean (black line), and five samples from the posterior (grey lines). In the middle and right-hand panels we show the density function of the variational posteriors for the slope, $q(\boldsymbol{\beta_1})$, and the intercept $q(\boldsymbol{\beta_0})$ respectively (black line). The grey vertical lines show the randomly sampled slopes and intercepts shown in the left-most figure:</p>
<p> </p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/Bayesian_linear_regression_example1.png" alt="drawing" width="1200" /></center>
<p> </p>
<p>Finally, let’s compare our variational posterior to the posterior we would get if we ran an alternative method for approximating it. Specifically, let’s compare our results to the results we’d get from <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov Chain Monte Carlo (MCMC)</a>. In MCMC, instead of using an analytical form of a probability distribution to approximate the posterior (as done in VI), we instead <em>sample</em> from the posterior and use these samples to form our approximation. We will use <a href="https://mc-stan.org/">Stan</a> to implement an MCMC approach for this model (code shown in the Appendix to this blog post). Below, we show the approximate posterior from MCMC (blue) along with the approximate posterior from our VI algorithm (red):</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/Bayesian_linear_regression_example1_VI_vs_MCMC.png" alt="drawing" width="1200" /></center>
<p>As you can see, the two results are very similar!</p>
<h2 id="appendix">Appendix</h2>
<p>Below, we show our implementation of Bayesian linear regression via the reparameterized gradient method. There are a few points to note regarding this implementation. First, we break apart the terms of the ELBO to make the code clear, but the number of lines could be reduced. Second, instead of taking the gradient with respect to $\boldsymbol{\sigma}^2$, we will take it with respect to $\log \boldsymbol{\sigma}$ in order to ensure that $\sigma$ is always positive throughout the optimization procedure. Third, we use the <a href="https://arxiv.org/abs/1412.6980">Adam</a> optimizer to choose the step size rather than use a fixed step size as would be done in standard gradient ascent.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.optim</span> <span class="k">as</span> <span class="n">optim</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">bayesian_linear_regression_blackbox_vi</span><span class="p">(</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">prior_mean</span><span class="p">,</span> <span class="n">prior_std</span><span class="p">,</span>
<span class="n">n_iters</span><span class="o">=</span><span class="mi">2000</span><span class="p">,</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.1</span><span class="p">,</span> <span class="n">n_monte_carlo</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span>
<span class="p">):</span>
<span class="s">"""
Parameters
----------
X
NxJ matrix of covariates where the final covariate is a dummy variable
consisting of all ones that correspond to the intercept
y
N-length array of response variables
prior_mean
J-length array of the prior means of the parameters </span><span class="se">\b</span><span class="s">eta and
intercept (final coeffient)
prior_std
J-length array of the prior standard deviations of the parameters
</span><span class="se">\b</span><span class="s">eta and intercept (final coeffient)
n_iters
Number of iterations
lr
Learning rate
n_monte_carlo:
Number of Monte Carlo samples to use to approximate the ELBO
"""</span>
<span class="c1"># Instantiate input tensors
</span> <span class="n">n_dims</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>
<span class="c1"># Variational parameters
</span> <span class="n">q_mean</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">q_logstd</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Model parameters
</span> <span class="n">logsigma</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># Data structures to keep track of learning
</span> <span class="n">losses</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">q_means</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">q_logstds</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">q_logsigma_means</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">q_logsigma_logstds</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># Instantiate Adam optimizer
</span> <span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">([</span><span class="n">q_mean</span><span class="p">,</span> <span class="n">q_logstd</span><span class="p">,</span> <span class="n">logsigma</span><span class="p">],</span> <span class="n">lr</span><span class="o">=</span><span class="n">lr</span><span class="p">)</span>
<span class="c1"># Perform blackbox VI
</span> <span class="k">for</span> <span class="nb">iter</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_iters</span><span class="p">):</span>
<span class="c1"># Generate L monte carlo samples
</span> <span class="n">eps_beta</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n_monte_carlo</span><span class="p">,</span> <span class="n">n_dims</span><span class="p">))</span>
<span class="c1"># Construct random betas and sigma from these samples
</span> <span class="n">beta</span> <span class="o">=</span> <span class="n">q_mean</span> <span class="o">+</span> <span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">q_logstd</span><span class="p">)</span> <span class="o">*</span> <span class="n">eps_beta</span>
<span class="c1"># An LxN matrix storing each the mean
</span> <span class="c1"># of each dot(beta_l, x_i)
</span> <span class="n">y_means</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span><span class="p">)</span>
<span class="c1"># The distribution N(dot(beta_l, x_i), 1)
</span> <span class="c1"># This is the distribution of the residuals
</span> <span class="n">y_dist</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">normal</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span>
<span class="n">y_means</span><span class="p">,</span>
<span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">logsigma</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="n">y_means</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="mi">1</span><span class="p">).</span><span class="n">T</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># An LxN matrix of the probabilities
</span> <span class="c1"># p(y_i \mid x_i, beta_l)
</span> <span class="n">y_probs</span> <span class="o">=</span> <span class="n">y_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span>
<span class="n">y</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="n">y_means</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="mi">1</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># An L-length array storing the probabilities
</span> <span class="c1"># \sum_{i=1}^N p(y_i \mid x_i, beta_l)
</span> <span class="c1"># for all L Monte Carlo samples
</span> <span class="n">y_prob_per_l</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">y_probs</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># The prior distribution of each parameter \beta
</span> <span class="c1"># given by N(prior_mean, prior_std)
</span> <span class="n">prior_beta_mean</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">beta</span><span class="p">[</span><span class="mi">0</span><span class="p">]).</span><span class="n">repeat</span><span class="p">(</span><span class="n">y_prob_per_l</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">prior_mean</span><span class="p">)</span>
<span class="n">prior_beta_std</span> <span class="o">=</span> <span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">beta</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">*</span> <span class="n">prior_std</span><span class="p">).</span><span class="n">repeat</span><span class="p">(</span><span class="n">y_prob_per_l</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="mi">1</span><span class="p">)</span>
<span class="n">prior_beta_dist</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">normal</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span>
<span class="n">prior_beta_mean</span><span class="p">,</span>
<span class="n">prior_beta_std</span>
<span class="p">)</span>
<span class="c1"># An LxD length matrix of \log p(\beta_{l,d}), which is
</span> <span class="c1"># the prior log probabilities of each parameter"
</span> <span class="n">prior_beta_probs</span> <span class="o">=</span> <span class="n">prior_beta_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span>
<span class="c1"># An L-length array of probabilities
</span> <span class="c1"># \log p(\beta_l) = \sum_{d=1}^D \log p(\beta_{l,d})
</span> <span class="n">prior_beta_per_l</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">prior_beta_probs</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># An L-length array of probabilities
</span> <span class="n">y_beta_prob_per_l</span> <span class="o">=</span> <span class="n">y_prob_per_l</span> <span class="o">+</span> <span class="n">prior_beta_per_l</span>
<span class="c1"># The variational distribution over beta approximating the posterior
</span> <span class="c1"># N(q_mean, q_std)
</span> <span class="n">beta_dist</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">normal</span><span class="p">.</span><span class="n">Normal</span><span class="p">(</span>
<span class="n">q_mean</span><span class="p">,</span>
<span class="n">torch</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">q_logstd</span><span class="p">)</span>
<span class="p">)</span>
<span class="c1"># An LxD-length matrix of the variational log probabilities of each parameter
</span> <span class="c1"># \log q(beta_{l,d})
</span> <span class="n">q_beta_probs</span> <span class="o">=</span> <span class="n">beta_dist</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span>
<span class="c1"># An L-length array of \log q(beta_l)
</span> <span class="n">q_beta_prob_per_l</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">q_beta_probs</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># An L-length array of the ELBO value for each Monte Carlo sample
</span> <span class="n">elbo_per_l</span> <span class="o">=</span> <span class="n">y_beta_prob_per_l</span> <span class="o">-</span> <span class="n">q_beta_prob_per_l</span>
<span class="c1"># The final loss value!
</span> <span class="n">loss</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="o">*</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">elbo_per_l</span><span class="p">)</span>
<span class="c1"># Take gradient step
</span> <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
<span class="c1"># Store values related to current optimization step
</span> <span class="n">losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="n">loss</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">()))</span>
<span class="n">q_means</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">q_mean</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">()))</span>
<span class="n">q_logstds</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">q_logstd</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">numpy</span><span class="p">()))</span>
<span class="k">return</span> <span class="n">q_mean</span><span class="p">,</span> <span class="n">q_logstd</span><span class="p">,</span> <span class="n">q_means</span><span class="p">,</span> <span class="n">q_logstds</span><span class="p">,</span> <span class="n">losses</span>
</code></pre></div></div>
<p>Here is code implementing Bayesian linear regression in Stan via <a href="https://pystan.readthedocs.io/en/latest/">PyStan</a>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">stan</span>
<span class="n">STAN_MODEL</span> <span class="o">=</span> <span class="s">"""
data {
int<lower=0> N;
vector[N] X;
vector[N] Y;
}
parameters {
real logsigma;
real intercept;
real slope;
}
model {
intercept ~ normal(0, 100);
slope ~ normal(0, 100);
Y ~ normal(intercept + slope * X, exp(logsigma));
}
"""</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">"N"</span><span class="p">:</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
<span class="s">"X"</span><span class="p">:</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span>
<span class="s">"Y"</span><span class="p">:</span> <span class="n">Y</span>
<span class="p">}</span>
<span class="n">posterior</span> <span class="o">=</span> <span class="n">stan</span><span class="p">.</span><span class="n">build</span><span class="p">(</span><span class="n">STAN_MODEL</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">)</span>
<span class="n">fit</span> <span class="o">=</span> <span class="n">posterior</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span>
<span class="n">num_chains</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
<span class="n">num_samples</span><span class="o">=</span><span class="mi">1000</span>
<span class="p">)</span>
<span class="n">slopes</span> <span class="o">=</span> <span class="n">fit</span><span class="p">[</span><span class="s">"slope"</span><span class="p">]</span>
<span class="n">intercepts</span> <span class="o">=</span> <span class="n">fit</span><span class="p">[</span><span class="s">"intercept"</span><span class="p">]</span>
<span class="n">logsigmas</span> <span class="o">=</span> <span class="n">fit</span><span class="p">[</span><span class="s">"logsigma"</span><span class="p">]</span>
</code></pre></div></div>Matthew N. BernsteinVariational inference (VI) is a mathematical framework for doing Bayesian inference by approximating the posterior distribution over the latent variables in a latent variable model when the true posterior is intractable. In this post, we will discuss a flexible variational inference algorithm, called blackbox VI via the reparameterization gradient, that works “out of the box” for a wide variety of models with minimal need for the tedious mathematical derivations that deriving VI algorithms usually require. We will then use this method to do Bayesian linear regression.Row reduction with elementary matrices2022-10-02T00:00:00-07:002022-10-02T00:00:00-07:00https://mbernste.github.io/posts/row_reduction<p><em>In this post we discuss the row reduction algorithm for solving a system of linear equations that have exactly one solution. We will then show how the row reduction algorithm can be represented as a process involving a sequence of matrix multiplications involving a special class of matrices called elementary matrices. That is, each elementary matrix represents a single elementary row operation in the row reduction algorithm.</em></p>
<h2 id="introduction">Introduction</h2>
<p>In a <a href="https://mbernste.github.io/posts/systems_linear_equations/">previous blog post</a>, we showed how systems of linear equations can be represented as a matrix equation. For example, the system of linear equations,</p>
\[\begin{align*}a_{1,1}x_1 + a_{1,2}x_2 + a_{1,3}x_3 &= b_1 \\ a_{2,1}x_1 + a_{2,2}x_2 + a_{2,3}x_3 &= b_2 \\ a_{3,1}x_1 + a_{3,2}x_2 + a_{3,3}x_3 &= b_3 \end{align*}\]
<p>can be represented succinctly as</p>
\[\boldsymbol{Ax} = \boldsymbol{b}\]
<p>where $\boldsymbol{A}$ is the matrix of coefficients $a_{1,1}, a_{1,2}, \dots, a_{3,3}$ and $\boldsymbol{b}$ is the matrix of coefficients of $b_1, b_2,$ and $b_3$. Furthermore, we noted that this system will have exactly one solution if $\boldsymbol{A}$ is an <a href="https://mbernste.github.io/posts/inverse_matrices/">invertible matrix</a>.</p>
<p>In this post, we will discuss how one can solve for this exact solution using a process called called <strong>row reduction</strong> which entails performing a series of algebraic operations on the system. We will then show how the row reduction algorithm can be represented as a process that entails <a href="https://mbernste.github.io/posts/matrix_multiplication/">multiplying</a> $\boldsymbol{A}$ by a series of matrices called <strong>elementary matrices</strong> in order to convert $\boldsymbol{A}$ to the identity matrix. Each elementary matrix represents a single step of the row reduction algorithm.</p>
<h2 id="row-reduction">Row reduction</h2>
<p>Before digging into matrices, let’s first discuss how one can go about solving a system of linear equations. Say we have a system with three equations and three variables:</p>
\[\begin{align*}a_{1,1}x_1 + a_{1,2}x_2 + a_{1,3}x_3 &= b_1 \\ a_{2,1}x_1 + a_{2,2}x_2 + a_{2,3}x_3 &= b_2 \\ a_{3,1}x_1 + a_{3,2}x_2 + a_{3,3}x_3 &= b_3 \end{align*}\]
<p>To solve such a system, our goal is perform simple algebraic operations on these equations until we convert the system to one with the following form:</p>
\[\begin{align*}x_1 &= c_1 \\ x_2 &= c_2 \\ x_3 &= c_3 \end{align*}\]
<p>where $c_1, c_2$, and $c_3$ are the solutions to the system – that is, they are the values we can assign to $x_1, x_2$, and $x_3$ so that all of the equations in the system are valid.</p>
<p>Now, what kinds of algebraic operations can we perform on the equations of the system to solve it? There are three main categories, called <strong>elementary row operations</strong> (we’ll see soon, why they have this name):</p>
<ol>
<li><strong>Scalar multiplication</strong>: Simply multiply both sides of one of the equations by a scalar.</li>
<li><strong>Row swap</strong>: You can move one equation above or below another. Note, the order the equations are written is irrevalent to the solution, so swapping rows is really just changing how we organize the formulas. Nonetheless, this organization will be important as we demonstrate how elementary matrices can be used to solve the system.</li>
<li><strong>Row sum</strong>: Add a multiple of one equation to another. This is valid because if the equality of each equation holds, then adding a muliple of one equation to another is just adding the same quantity on each side of the equals side of the first equation to both sides of the second equation.</li>
</ol>
<p>Let’s use these operations to solve the following system:</p>
\[\begin{align*}-x_1 - 2 x_2 + x_3 &= -3 \\ 3 x_2 &= 3 \\ 2 x_1 + 4 x_2 &= 10\end{align*}\]
<p>1. First, we <em>row swap</em> the first and third equations:</p>
\[\begin{align*}2 x_1 + 4 x_2 &= 10 \\ 3 x_2 &= 3 \\ -x_1 - 2 x_2 + x_3 &= -3\end{align*}\]
<p>2. Next, let’s perform <em>scalar multiplication</em> and multiply the first equation by 1/2:</p>
\[\begin{align*}x_1 + 2 x_2 &= 5 \\ 3 x_2 &= 3 \\ -x_1 - 2 x_2 + x_3 &= -3\end{align*}\]
<p>3. Next, let’s perform a <em>row sum</em> and add the first row to the third:</p>
\[\begin{align*}x_1 + 2 x_2 &= 5 \\ 3 x_2 &= 3 \\ x_3 &= 2\end{align*}\]
<p>4. Next, let’s perform <em>scalar multiplication</em> and multiply the second equation by 1/3:</p>
\[\begin{align*}x_1 + 2 x_2 &= 5 \\ x_2 &= 1 \\ x_3 &= 2\end{align*}\]
<p>5. Finally, let’s perform a <em>row sum</em> and add -2 multiplied by the second row to the first:</p>
\[\begin{align*}x_1 &= 3 \\ x_2 &= 1 \\ x_3 &= 2\end{align*}\]
<p>And there we go, we’ve solved the system using these elementary row operations. This process of using elementary row operations to solve a system of linear equations is called <strong>row reduction</strong>.</p>
<h2 id="elementary-row-operations-in-matrix-notation">Elementary row operations in matrix notation</h2>
<p>Recall, we can represent a system of linear equations as a <a href="https://mbernste.github.io/posts/systems_linear_equations/">matrix equation</a>. For example, the linear system that we just solved can be written as:</p>
\[\begin{bmatrix}-1 & -2 & 1 \\ 0 & 3 & 0 \\ 2 & 4 & 0 \end{bmatrix}\begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = \begin{bmatrix}-3 \\ 3 \\ 10\end{bmatrix}\]
<p>When solving the system using the elementary row operations, we needn’t write out all of the equations. Really, all we need to do is keep track of how $\boldsymbol{A}$ and \(\boldsymbol{b}\) are being transformed upon each iteration. For ease of notation, we can join $\boldsymbol{A}$ and $\boldsymbol{b}$ into a single matrix, called an <strong>augmented matrix</strong>. In our example, this augmented matrix would look like:</p>
\[\begin{bmatrix}-1 & -2 & 1 & -3 \\ 0 & 3 & 0 & 3 \\ 2 & 4 & 0 & 10 \end{bmatrix}\]
<p>In the augmented matrix, the final column stores $\boldsymbol{b}$ and all of the previous columns store the columns of $\boldsymbol{A}$. Our execution of the row operations can now operate only on this augmented matrix as follows:</p>
<p>1. <em>Row swap</em>: swap the first and third equations:</p>
\[\begin{bmatrix}2 & 4 & 0 & 10 \\ 0 & 3 & 0 & 3 \\ -1 & -2 & 1 & -3 \end{bmatrix}\]
<p>2. <em>Scalar multiplication</em>: Multiply the first equation by 1/2:</p>
\[\begin{bmatrix}1 & 2 & 0 & 5 \\ 0 & 3 & 0 & 3 \\ -1 & -2 & 1 & -3 \end{bmatrix}\]
<p>3. <em>Row sum</em>: add the first row to the third:</p>
\[\begin{bmatrix}1 & 2 & 0 & 5 \\ 0 & 3 & 0 & 3 \\ 0 & 0 & 1 & 2 \end{bmatrix}\]
<p>4. <em>Scalar multiplication</em>: Multiply the second equation by 1/3:</p>
\[\begin{bmatrix}1 & 2 & 0 & 5 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 2 \end{bmatrix}\]
<p>5. <em>Row sum</em> and add -2 multiplied by the second row to the first:</p>
\[\begin{bmatrix}1 & 0 & 0 & 3 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 2 \end{bmatrix}\]
<p>Now, let’s re-write the augmented matrix as a matrix equation:</p>
\[\begin{bmatrix}1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = \begin{bmatrix}3 \\ 1 \\ 2\end{bmatrix}\]
<p>Note that \(\boldsymbol{A}\) has been <em>transformed</em> into the identity matrix \(\boldsymbol{I}\). This will be a key observation as we move into the next section.</p>
<h2 id="elementary-matrices">Elementary matrices</h2>
<p>Notice how during the row reduction process we transformed the matrix $\boldsymbol{A}$ using a series of steps until it became the identity matrix $\boldsymbol{I}$. In fact, each of these elementary row operations can be represented as a matrix. Such a matrix that represents an elementary row operation is called an <strong>elementary matrix</strong>.</p>
<p>To demonstrate how our elementary row operations can be performed using matrix multiplication, let’s look back at our example. We start with the matrix</p>
\[\boldsymbol{A} := \begin{bmatrix}-1 & -2 & 1 \\ 0 & 3 & 0 \\ 2 & 4 & 0 \end{bmatrix}\]
<p>Then, first we <em>row swap</em> the first and third equations:</p>
\[\underbrace{\begin{bmatrix}0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{bmatrix}}_{\boldsymbol{E}_1} \underbrace{\begin{bmatrix}-1 & -2 & 1 \\ 0 & 3 & 0 \\ 2 & 4 & 0 \end{bmatrix}}_{\boldsymbol{A}} = \begin{bmatrix}2 & 4 & 0 \\ 0 & 3 & 0 \\ -1 & -2 & 1 \end{bmatrix}\]
<p>Then perform <em>scalar multiplication</em> and multiply the first equation by 1/2:</p>
\[\underbrace{\begin{bmatrix}1/2 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}}_{\boldsymbol{E_2}} \underbrace{\begin{bmatrix}2 & 4 & 0 \\ 0 & 3 & 0 \\ -1 & -2 & 1 \end{bmatrix}}_{\boldsymbol{E}_1\boldsymbol{A}} = \begin{bmatrix}1 & 2 & 0 \\ 0 & 3 & 0 \\ -1 & -2 & 1 \end{bmatrix}\]
<p>Then perform a <em>row sum</em> and add the first row to the third:</p>
\[\underbrace{\begin{bmatrix}1 & 0 & 0 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}}_{\boldsymbol{E}_3} \underbrace{\begin{bmatrix}1 & 2 & 0 \\ 0 & 3 & 0 \\ -1 & -2 & 1 \end{bmatrix}}_{\boldsymbol{E}_2\boldsymbol{E}_1\boldsymbol{A}} = \begin{bmatrix}1 & 2 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]
<p>Then perform <em>scalar multiplication</em> and multiply the second equation by 1/3:</p>
\[\underbrace{\begin{bmatrix}1 & 0 & 0 \\ 0 & 1/3 & 0 \\ 0 & 0 & 1 \end{bmatrix}}_{\boldsymbol{E}_4} \underbrace{\begin{bmatrix}1 & 2 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 1 \end{bmatrix}}_{\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1\boldsymbol{A}} = \begin{bmatrix}1 & 2 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]
<p>Then perform a <em>row sum</em> and add -2 multiplied by the second row to the first:</p>
\[\underbrace{\begin{bmatrix}1 & -2 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}}_{\boldsymbol{E}_5} \underbrace{\begin{bmatrix}1 & 2 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}}_{\boldsymbol{E}_4\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1\boldsymbol{A}} = \begin{bmatrix}1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]
<p>Notice, we’ve derived a series of matrices that when multiplied by $\boldsymbol{A}$ produces the identity matrix:</p>
\[\boldsymbol{E}_5\boldsymbol{E}_4\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1\boldsymbol{A} = \boldsymbol{I}\]
<p>By the <a href="https://mbernste.github.io/posts/inverse_matrices/">definition of an inverse matrix</a>, we see that the matrix formed by $\boldsymbol{E}_5\boldsymbol{E}_4\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1$ is the inverse of $\boldsymbol{A}$! That is,</p>
\[\boldsymbol{A}^{-1} = \boldsymbol{E}_5\boldsymbol{E}_4\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1\]
<p>Thus, we have found a way to decompose the inverse of $\boldsymbol{A}$ into a set of matrices that when multiplied together yield its inverse. Each of these matrices represents a transformation on $\boldsymbol{A}$ equivalent to an elementary row operation that one would use to solve an equation of the form $\boldsymbol{Ax} = \boldsymbol{b}$!</p>Matthew N. BernsteinIn this post we discuss the row reduction algorithm for solving a system of linear equations that have exactly one solution. We will then show how the row reduction algorithm can be represented as a process involving a sequence of matrix multiplications involving a special class of matrices called elementary matrices. That is, each elementary matrix represents a single elementary row operation in the row reduction algorithm.Reasoning about systems of linear equations using linear algebra2022-06-12T00:00:00-07:002022-06-12T00:00:00-07:00https://mbernste.github.io/posts/systems_of_linear_equations<p><em>In this blog post, we will discuss the relationship between matrices and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors in that matrix equation.</em></p>
<h2 id="introduction">Introduction</h2>
<p>In this blog post, we will discuss the relationship between <a href="https://mbernste.github.io/posts/matrices/">matrices</a> and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors involved in that matrix equation.</p>
<h2 id="systems-of-linear-equations">Systems of linear equations</h2>
<p>A <strong>system of linear equations</strong> is a set of <a href="https://en.wikipedia.org/wiki/Linear_equation">linear equations</a> that all utilize the same set of variables, but each equation differs by the coefficients that multiply those variables.</p>
<p>For example, say we have three variables, $x_1, x_2$, and $x_3$. A system of linear equations involving these three variables can be written as:</p>
\[\begin{align*}3 x_1 + 2 x_2 - x_3 &= 1 \\ 2 x_1 + -2 x_2 + 4 x_3 &= -2 \\ -x_1 + 0.5 x_2 + - x_3 &= 0 \end{align*}\]
<p>A <strong>solution</strong> to this system of linear equations is an assignment to the variables $x_1, x_2, x_3$, such that all of the equations are simultaneously true. In our case, a solution would be given by $(x_1, x_2, x_3) = (1, -2, -2)$.</p>
<p>More abstractly, we could write a system of linear equations as</p>
\[\begin{align*}a_{1,1}x_1 + a_{1,2}x_2 + a_{1,3}x_3 &= b_1 \\ a_{2,1}x_1 + a_{2,2}x_2 + a_{2,3}x_3 &= b_2 \\ a_{3,1}x_1 + a_{3,2}x_2 + a_{3,3}x_3 &= b_3 \end{align*}\]
<p>where $a_{1,1}, \dots, a_{3,3}$ are the coefficients and $b_1, b_2,$ and $b_3$ are the constant terms, all treated as <em>fixed</em>. By “fixed”, we mean that we assume that $a_{1,1}, \dots, a_{3,3}$ and $b_1, b_2,$ and $b_3$ are known. In contrast, $x_1, x_2,$ and $x_3$ are unknown. We can try different values for $x_1, x_2,$ and $x_3$ and test whether or not that assignment is a solution to the system.</p>
<h2 id="reasoning-about-the-solutions-to-a-system-of-linear-equations-by-respresenting-the-system-as-a-matrix-equation">Reasoning about the solutions to a system of linear equations by respresenting the system as a matrix equation</h2>
<p>Now, a natural question is: given a system of linear equations, how many solutions does it have? Will a system always have a solution? If did does have a solution will it only be one solution? We will use concepts from linear algebra to address these questions.</p>
<p>First, note that we can write a system of linear equations much more succinctly using <a href="https://mbernste.github.io/posts/matrix_vector_mult/">matrix-vector</a> multiplication. That is,</p>
\[\begin{bmatrix}a_{1,1} && a_{1,2} && a_{1,3} \\ a_{2,1} && a_{2,2} && a_{2,3} \\ a_{3,1} && a_{3,2} && a_{3,3} \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = \begin{bmatrix}b_1 \\ b_2 \\ b_3\end{bmatrix}\]
<p>If we let the matrix of coefficients be $\boldsymbol{A}$, the vector of variables be $\boldsymbol{x}$, and the vector of constants be $\boldsymbol{b}$, then we could write this even more succinctly as:</p>
\[\boldsymbol{Ax} = \boldsymbol{b}\]
<p>This is an important point: any system of linear equations can be written succintly as an equation using matrix-vector multiplication. By viewing systems of linear equations through this lense, we can reason about the number of solutions to a system of linear equations using properties of the matrix $\boldsymbol{A}$!</p>
<p>Given this newfound representation for systems of linear equations, recall from our <a href="https://mbernste.github.io/posts/matrix_vector_mult/">discussion of matrix-vector multiplication</a>, a matrix $\boldsymbol{A}$ multiplying a vector $\boldsymbol{x}$ can be understood as taking a linear combination of the column vectors of $\boldsymbol{A}$ using the elements of $\boldsymbol{x}$ as the coefficients:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_vec_mult_as_lin_comb.png" alt="drawing" width="700" /></center>
<p>Thus, we see that the solution to a system of linear equations, $\boldsymbol{x}$, is any set of weights for which, if we take a weighted sum of the columns of $\boldsymbol{A}$, we get the vector $\boldsymbol{b}$. That is, $\boldsymbol{b}$ is a vector that lies in the <a href="https://mbernste.github.io/posts/linear_independence/">span</a> of the columns of $\boldsymbol{A}$!</p>
<p>From this observation, we can begin to draw some conclusions about the number of solutions a given system of linear equations will have based on the properties of \(\boldsymbol{A}\). Specifically, the system will either have:</p>
<ol>
<li><strong>No solution.</strong> This will occur if \(\boldsymbol{b}\) lies <em>outside</em> the span of the columns of $\boldsymbol{A}$. This means that there is no way to construct $\boldsymbol{b}$ from the columns of $\boldsymbol{A}$, and thus there is no weights-vector $\boldsymbol{x}$ that will satisfy the equation $\boldsymbol{Ax} = \boldsymbol{b}$. Note, this can only occur if $\boldsymbol{A}$ is singular. Why? Recall that an <a href="https://mbernste.github.io/posts/inverse_matrices/">invertible matrix</a> maps each inpu vector $\boldsymbol{x}$ to a unique output vector $\boldsymbol{b}$ and each output $\boldsymbol{b}$ corresponds to a unique input $\boldsymbol{x}$. Said more succintly, an invertible matrix characterizes a one-to-one and onto <a href="https://mbernste.github.io/posts/matrices_linear_transformations/">linear tranformation</a>. Therefore, if $\boldsymbol{A}$ is invertible, then for any given $\boldsymbol{b}$, there <em>must</em> exist a vector, $\boldsymbol{x}$, that solves the equation $\boldsymbol{Ax} = \boldsymbol{b}$. If such a vector does not exist, then $\boldsymbol{A}$ must not be invertible.</li>
<li><strong>Exactly one solution.</strong> This will occur if $\boldsymbol{A}$ is invertible. As discussed above, an invertible matrix characterizes a one-to-one and onto linear transformation and thus, for any given $\boldsymbol{b}$, there will be exactly one vector, $\boldsymbol{x}$, that solves the equation $\boldsymbol{Ax} = \boldsymbol{b}$.</li>
<li><strong>Infinitely many solutions.</strong> This will occur if $\boldsymbol{b}$ lies <em>inside</em> the span of the columns of $\boldsymbol{A}$, but $\boldsymbol{A}$ is <em>not</em> invertible. Why would there be an infinite number of solutions? <a href="https://mbernste.github.io/posts/inverse_matrices/">Recall</a> that if $\boldsymbol{A}$ is not invertible, then the columns of $\boldsymbol{A}$ are <a href="https://mbernste.github.io/posts/linear_independence/">linearly dependent</a>, meaning that there are an infinite number of ways to take a weighted sum of the columns of $\boldsymbol{A}$ to get $\boldsymbol{b}$. Thus, an infinite number of vectors that satisfy $\boldsymbol{Ax} = \boldsymbol{b}$.</li>
</ol>Matthew N. BernsteinIn this blog post, we will discuss the relationship between matrices and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors in that matrix equation.Span and linear independence2022-06-11T00:00:00-07:002022-06-11T00:00:00-07:00https://mbernste.github.io/posts/linear_independence<p><em>A very important concept linear algebra is that of linear independence. In this blog post we present the definition for the span of a set of vectors. Then, we use this definition to discuss the definition for linear independence. Finally, we discuss some intuition into this fundamental idea.</em></p>
<h2 id="introduction">Introduction</h2>
<p>A very important concept in the study of vector spaces is that of <em>linear independence</em>. At a high level, a set of vectors are said to be <strong>linearly independent</strong> if you cannot form any vector in the set using any combination of the other vectors in the set. If a set of vectors does not have this quality – that is, a vector in the set can be formed from some combination of others – then the set is said to be <strong>linearly dependent</strong>.</p>
<p>In this post, we will first present a more fundamental concept, the <em>span</em> of a set of vectors, and then move on to the definition for linear independence. Finally, we will discuss high-level intuition for why the concept of linearly independence is so important.</p>
<h2 id="span">Span</h2>
<p>Given a set of vectors, the <strong>span</strong> of these vectors is the set of the vectors that can be “generated” by taking linear combinations of these vectors. More rigorously,</p>
<p><span style="color:#0060C6"><strong>Definition 1 (span):</strong> Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, $(\mathcal{V}, \mathcal{F})$ and a set of vectors $S := \{ \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \} \in \mathcal{V}$, the <strong>span</strong> of $S$, denoted $\text{Span}(S)$ is the set of all vectors that can be formed by taking a linear combination of vectors in $S$. That is,</span></p>
<center><span style="color:#0060C6">$$\text{Span}(S) := \left\{ \sum_{i=1}^n c_i\boldsymbol{x}_i \mid c_1, \dots, c_n \in \mathcal{F} \right\}$$ </span></center>
<p>Intuitively, you can think of $S$ as a set of “building blocks” and the $\text{Span}(S)$ as the set of all vectors that can be “constructed” from these building blocks. To illustrate this point, we illustrate two vectors, $\boldsymbol{x}_1$ and $\boldsymbol{x}_2$ (left), and two examples of vectors in their span (right):</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/span_of_vectors.png" alt="drawing" width="600" /></center>
<p>In this example we see that we can construct <em>ANY</em> two dimensional vector from $\boldsymbol{x}_1$ and $\boldsymbol{x_2}$. Thus, the span of these two vectors is all of $\mathbb{R}^2$! This is not always the case. In the figure below, we show an example of two vectors in $\mathbb{R}^2$ with a different span:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/span_of_vectors_2.png" alt="drawing" width="600" /></center>
<p>Here, $\boldsymbol{x}_1$ and $\boldsymbol{x_2}$ don’t span all of $\mathbb{R}^2$, but rather, only the line on which $\boldsymbol{x}_1$ and $\boldsymbol{x_2}$ lie.</p>
<h2 id="linear-dependence-and-independence">Linear dependence and independence</h2>
<p>Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, $(\mathcal{V}, \mathcal{F})$, and a set of vectors $S := \{ \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \} \in \mathcal{V}$, the vectors are said to be <strong>linearly independent</strong> if each vector lies outside the span of the remaining vectors. Otherwise, the vectors are said to be <strong>linearly dependent</strong>. More rigorously,</p>
<p><span style="color:#0060C6"><strong>Definition 2 (linear independence):</strong> Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, $(\mathcal{V}, \mathcal{F})$ and a set of vectors $S := \{ \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \} \in \mathcal{V}$, $S$ is called <strong>linearly independent</strong> if for each vector $\boldsymbol{x_i} \in S$, it holds that $\boldsymbol{x}_i \notin \text{Span}(S \setminus \{ \boldsymbol{x}_i \})$.</span></p>
<p><span style="color:#0060C6"><strong>Definition 2 (linear dependence):</strong> Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, $(\mathcal{V}, \mathcal{F})$ and a set of vectors $S := \{ \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \} \in \mathcal{V}$, $S$ is called <strong>linearly dependent</strong> if there exists a vector $\boldsymbol{x_i} \in S$, such that $\boldsymbol{x}_i \in \text{Span}(S \setminus \{ \boldsymbol{x}_i \})$.</span></p>
<p>Said differently, a set of vectors are linearly independent if you cannot form any of the vectors in the set using a linear combination of any of the other vectors. Below we demonstrate a set of linearly independent vectors (left) and a set of linearly dependent vectors (right):</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/linear_independence.png" alt="drawing" width="600" /></center>
<p>Why is the set on the right linearly dependent? As you can see below, we can use any of the two vectors to construct the third:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/linear_independence_symmetry.png" alt="drawing" width="600" /></center>
<p>Linear dependence/independence among a set of vectors implies another key property about these vectors stated in the following theorems (proven in the Appendix to this post):</p>
<p><span style="color:#0060C6"><strong>Theorem 1 (forming the zero vector from linearly dependent vectors):</strong> Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, $(\mathcal{V}, \mathcal{F})$ and a set of vectors $S := \{ \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \} \in \mathcal{V}$, $S$ is linearly dependent if and only if there exists a set of coefficients $c_1, \dots c_n$ for which $\sum_{i=1}^n c_i\boldsymbol{x}_i = \boldsymbol{0}$ and at least one coefficient is non-zero.</span></p>
<p><span style="color:#0060C6"><strong>Corollary:</strong> $S$ is linearly independent if and only if the only assignment of values to coefficients $c_1, \dots c_n$ for which $\sum_{i=1}^n c_i\boldsymbol{x}_i = \boldsymbol{0}$ is the assignment for which all $c_1, \dots c_n$ are zero.</span></p>
<p>These theorems essentially says that if a set of vectors are linearly dependent, then one can construct the zero vector using a linear combination of the remaining vectors that isn’t the “trivial” linear combination of setting all of the coefficients to zero. This is illustrated below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/linear_dependence_zero_vector.png" alt="drawing" width="300" /></center>
<p>In contrast, Theorem 2 says that the only way to construct the zero vector from a set of linearly independent vectors is by setting all of the coefficients to zero.</p>
<h2 id="intuition-for-linear-independence">Intuition for linear independence</h2>
<p>There are two ways I think about linear independence: in terms of information content and in terms of <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a>. Let me explain.</p>
<p>First, if a set of vectors is linearly dependent, then in a sense there is “reduntant information” within these vectors. What do I mean by “redundant information”? In a linear dependent set, there exists a vector in that set for which if we removed that vector, the span of the set would remain the same! On the other hand, for a linearly independent set of vectors, each vector is vital for defining the span of the set’s vectors. If you remove even one vector, the span of the vectors will change (it will become smaller)!</p>
<p>One can also think about the concept of linear dependence/indepence in terms of <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a> . That is, a set of $n$ linearly independent vectors $S := \{ \boldsymbol{x}_1, \dots, \boldsymbol{x}_n \}$ spans a space with an <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a> of $n$ because in order to specify any vector $\boldsymbol{v}$ in the span of these vectors, one must specify the coefficients $c_1, \dots, c_n$ to construct $\boldsymbol{v}$ from the vectors in $S$. That is,</p>
\[\boldsymbol{v} = c_1\boldsymbol{x}_1 + \dots + c_n\boldsymbol{x}_n\]
<p>However, if $S$ is linearly dependent, then we can throw away “redundant” vectors in $S$. In fact, we see that the intrinsic dimensionality of a linearly dependent set $S$ is the maximum sized subset of $S$ that is linearly independent!</p>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1 (forming the zero vector from linearly dependent vectors):</strong> Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, $(\mathcal{V}, \mathcal{F})$ and a set of vectors $S := \{ \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \} \in \mathcal{V}$, $S$ is linearly dependent if and only if there exists an assignment of values to coefficients $c_1, \dots c_n$ for which $\sum_{i=1}^n c_i\boldsymbol{x}_i = \boldsymbol{0}$ and at least one coefficient is non-zero.</span></p>
<p><strong>Proof:</strong></p>
<p>We must prove both directions of the “if and only if”. Let’s start by proving that if there exists an assignment of coefficients that are not all zero for which $\sum_{i=1}^n c_i\boldsymbol{x}_i = \boldsymbol{0}$, then $S$ is linearly dependent.</p>
<p>Let’s assume that we have a set of coefficients $c_1, \dots c_n$ such that</p>
\[\sum_{i=1}^n c_i\boldsymbol{x}_i = \boldsymbol{0}\]
<p>and that not all of the coefficients are zero and let $C$ be the set of indices of the coefficients that are not zero. Then, we can write</p>
\[\sum_{i \in C} c_i\boldsymbol{x}_i = \boldsymbol{0}\]
<p>There are now two scenarios to consider: $C$ is of size 1 and $C$ is of size greater than 1. Let’s first assume $C$ is of size 1 and let’s let $k$ be the index of the one and only coefficient that is non-zero. Then</p>
\[c_k\boldsymbol{x}_k = \boldsymbol{0}\]
<p>We see that $\boldsymbol{x}_k$ must be the zero vector (because $c_k$ is nonzero). This implies that the zero vector is in $S$. The zero vector is in the span of the remaining vectors in $S$ (since we can form a linear combination of the remaining vectors to form $S$ by simply setting their coefficients to zero). This implies that $S$ is linearly dependent.</p>
<p>Let’s assume that $C$ is of size greater than one. Then for any $k \in C$, we can write:</p>
\[\begin{align*} \sum_{i \in C \setminus \{ k \} } c_i \boldsymbol{x}_i &= - c_k \boldsymbol{x}_k \\ \implies -\frac{1}{c_k} \sum_{i \in C \setminus \{ k \} } c_i\boldsymbol{x}_i &= \boldsymbol{x}_k \\ \implies \sum_{i \in C \setminus \{ k \}} -\frac{c_i}{c_k} \boldsymbol{x}_i &= \boldsymbol{x}_k \end{align*}\]
<p>Thus we see that $\boldsymbol{x}_k$ is in the span of the remaining vectors and thus $S$ is linearly dependent.</p>
<p>Now we will prove the other direction of the “if and only if”: if $S$ is linearly dependent then there exists an assignment of coefficients that are not all zero for which $\sum_{i=1}^n c_i\boldsymbol{x}_i = \boldsymbol{0}$.</p>
<p>If $S$ is linearly dependent then there exists a vector $\boldsymbol{x}_n \in S$ that we can form using a linear combination of the remaining vectors in $S$. Let $\boldsymbol{x}_1, \dots \boldsymbol{x}_{n-1}$ be these remaining vectors. There are now two scenarios to consider: $\boldsymbol{x}_n$ is the zero vector or $\boldsymbol{x}_n$ is not the zero vector.</p>
<p>If $\boldsymbol{x}_n$ is the zero vector, then we see that we can assign zero to the coefficients $c_1, \dots, c_{n-1}$ and <em>any</em> non-zero value to $c_n$ and the following will hold:</p>
\[c_n \boldsymbol{x}_n + \sum_{i=1}^{n-1} c_i \boldsymbol{x}_i = \boldsymbol{0}\]
<p>Thus there exists an assignment of coefficients that are not all zero for which $\sum_{i=1}^n c_i\boldsymbol{x}_i = \boldsymbol{0}$.</p>
<p>Now let’s say that $\boldsymbol{x}_n$ is not the zero vector. Then,</p>
\[\sum_{i=1}^{n-1} c_i \boldsymbol{x}_i = \boldsymbol{x}_n \implies \left(\sum_{i=1}^{n-1} c_i \boldsymbol{x}_i\right) - \boldsymbol{x}_n = \boldsymbol{0}\]
<p>Here, the cofficient for $\boldsymbol{x}_n$ is -1, which is non-zero. Thus there exists an assignment of coefficients that are not all zero for which $\sum_{i=1}^n c_i\boldsymbol{x}_i = \boldsymbol{0}$.</p>
<p>$\square$</p>Matthew N. BernsteinA very important concept linear algebra is that of linear independence. In this blog post we present the definition for the span of a set of vectors. Then, we use this definition to discuss the definition for linear independence. Finally, we discuss some intuition into this fundamental idea.Functionals and functional derivatives2022-04-10T00:00:00-07:002022-04-10T00:00:00-07:00https://mbernste.github.io/posts/functional_derivatives<p><em>The calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Multivariate calculus concerns itself with infitesimal changes of numerical functions – that is, functions that accept a vector of real-numbers and output a real number:</p>
\[f : \mathbb{R}^n \rightarrow \mathbb{R}\]
<p>In this blog post, we discuss the <strong>calculus of variations</strong>, a field of mathematics that generalizes the ideas in multivariate calculus relating to infinitesimal changes of traditional numeric functions to <em>functions of functions</em>, called <em>functionals</em>. Specifically, given a set of functions, $\mathcal{F}$, a <strong>functional</strong> is a mapping between $\mathcal{F}$ and the real-numbers:</p>
\[F : \mathcal{F} \rightarrow \mathbb{R}\]
<p>Functionals are quite prevalent in machine learning and statistical inference. For example, <a href="https://mbernste.github.io/posts/entropy/">information entropy</a> can be considered a functional on probability mass functions. For a given <a href="https://mbernste.github.io/posts/measure_theory_2/">discrete random variable</a>, $X$, entropy can be thought about as a function that accepts as input $X$’s probability mass function, $p_X$, and outputs a real number:</p>
\[H(p_X) := -\sum_{x \in \mathcal{X}} p_X(x) \log p_X(x)\]
<p>where $\mathcal{X}$ is the <a href="https://en.wikipedia.org/wiki/Support_(mathematics)">support</a> of $p_X$.</p>
<p>Another example of a functional is the <a href="https://mbernste.github.io/posts/elbo/">evidence lower bound (ELBO)</a>: a function that, like entropy, operates on probability distributions. The ELBO is a foundational quantity used in the popular <a href="https://mbernste.github.io/posts/em/">EM algorithm</a> and <a href="https://mbernste.github.io/posts/variational_inference/">variational inference</a> used for performing statistical inference with probabilistic models.</p>
<p>In this blog post, we will review some concepts in traditional calculus such as partial derivatives, directional derivatives, and gradients in order to introduce the definition of the <strong>functional derivative</strong>, which is simply the generalization of the gradient of numeric functions to functionals.</p>
<h2 id="a-review-of-derivatives-and-gradients">A review of derivatives and gradients</h2>
<p>In this section, we will introduce a few important concepts in multivariate calculus: derivatives, partial derivatives, directional derivatives, and gradients.</p>
<h3 id="derivatives">Derivatives</h3>
<p>Before going further, let’s quickly review the basic definition of the derivative for a univariate function $g$ that maps real numbers to real numbers. That is,</p>
\[g : \mathbb{R} \rightarrow \mathbb{R}\]
<p>The derivative of $g$ at input $x$, denoted $\frac{dg(x)}{dx}$, describes the rate of change of $g$ at $x$. It is defined rigorously as</p>
\[\frac{dg(x)}{dx} := \lim_{h \rightarrow 0}\frac{g(x+h)-g(x)}{h}\]
<p>Geometrically, $\frac{dg(x)}{dx}$ is the slope of the line that is tangential to $g$ at $x$ as depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/derivative.png" alt="drawing" width="550" /></center>
<p>In this schematic, we depict the value of $h$ getting smaller and smaller. As it does, the slope of the line approaches that of the line that is tangential to $g$ at x. This slope is the derivative $\frac{dg(x)}{dx}$.</p>
<h3 id="partial-derivatives">Partial derivatives</h3>
<p>We will now consider a continous <em>multivariate</em> function $f$ that maps real-valued vectors $\boldsymbol{x} \in \mathbb{R}^n$ to real-numbers. That is,</p>
\[f: \mathbb{R}^n \rightarrow \mathbb{R}\]
<p>Given $\boldsymbol{x} \in \mathbb{R}^n$, the <strong>partial derivative</strong> of $f$ with respect to the $i$th component of $\boldsymbol{x}$, denoted $\frac{\partial f(\boldsymbol{x})}{\partial x_i}$ is simply the derivative of $f$ if we hold all the components of $\boldsymbol{x}$ fixed, except for the $i$th component. Said differently, it tells us the rate of change of $f$ with respect to the $i$th dimension of the vector space in which $\boldsymbol{x}$ resides! This can be visualized below for a function $f : \mathbb{R}^2 \rightarrow \mathbb{R}$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/partial_derivative.png" alt="drawing" width="450" /></center>
<p>As seen above, the partial derivative $\frac{f(\boldsymbol{x})}{\partial x_1}$ is simply the derivative of the function $f(x_1, x_2)$ when holding $x_2$ as fixed. That is, it is the slope of the line tangent to the function of $f(x_1, x_2)$ when $x_2$ fixed.</p>
<h3 id="directional-derivatives">Directional derivatives</h3>
<p>We can see that the partial derivative of $f(\boldsymbol{x})$ with respect to the $i$th dimension of the vector space can be expressed as</p>
\[\frac{\partial f(\boldsymbol{x})}{\partial x_i} := \lim_{h \rightarrow 0} \frac{f(\boldsymbol{x} + h\boldsymbol{e}_i) - f(\boldsymbol{x})}{h}\]
<p>where $\boldsymbol{e}_i$ is the $i$th <a href="https://en.wikipedia.org/wiki/Standard_basis">standard basis vector</a> – that is, the vector of all zeroes except for a one in the $i$th position.</p>
<p>Geometrically, we can view the $i$th partial derivative of $f(\boldsymbol{x})$ as $f$’s rate of change along the direction of the $i$th standard basis vector of the vector space.</p>
<p>Thinking along these lines, there is nothing stopping us from generalizing this idea to <em>any unit vector</em> rather than just the standard basis vectors. Given some unit vector $\boldsymbol{v}$, we define the <strong>directional derivative</strong> of $f(\boldsymbol{x})$ along the direction of $\boldsymbol{v}$ as</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) := \lim_{h \rightarrow 0} \frac{f(\boldsymbol{x} + h\boldsymbol{v}) - f(\boldsymbol{x})}{h}\]
<p>Geometrically, this is simply the rate of change of $f$ along the direction at which $\boldsymbol{v}$ is pointing! This can be viewed schematically below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/directional_derivative.png" alt="drawing" width="450" /></center>
<p>For a given vector $\boldsymbol{v}$, we can derive a formula for $D_{\boldsymbol{v}}f(\boldsymbol{x})$. That is, we can show that:</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) = \sum_{i=1}^n \left( \frac{\partial f(\boldsymbol{x})}{\partial x_i} \right) v_i\]
<p>See Theorem 1 in the Appendix of this post for a proof of this equation. Now, if we define the vector of all partial derivatives $f(\boldsymbol{x})$ as</p>
\[\nabla f(\boldsymbol{x}) := \begin{bmatrix}\frac{\partial f(\boldsymbol{x})}{\partial x_1} & \frac{\partial f(\boldsymbol{x})}{\partial x_2} & \dots & \frac{\partial f(\boldsymbol{x})}{\partial x_n} \end{bmatrix}\]
<p>Then we can represent the directional derivative as simply the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> between $\nabla f(\boldsymbol{x})$ and $\boldsymbol{v}$:</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) := \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}\]
<p>This vector $\nabla f(\boldsymbol{x})$, is called the <strong>gradient vector</strong> of $f$ at $\boldsymbol{x}$.</p>
<h3 id="gradients">Gradients</h3>
<p>As described above, the <strong>gradient vector</strong>,$\nabla f(\boldsymbol{x})$ is the vector constructed by taking the partial derivative of $f$ at $\boldsymbol{x}$ along each basis vector. It turns out that the gradient vector points in the <em>direction of steepest ascent</em> along $f$’s surface at $\boldsymbol{x}$. This can be shown schematically below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/gradient.png" alt="drawing" width="450" /></center>
<p>We prove this property of the gradient vector in Theorem 2 of the Appendix to this post.</p>
<h2 id="functional-derivatives">Functional derivatives</h2>
<p>Now, we will seek to generalize the notion of gradients to functionals. We’ll let $\mathcal{F}$ be some set of functions, and for simplicity, we’ll let each $f$ be a continuous real-valued function. That is, for each $f \in \mathcal{F}$, we have $f: \mathbb{R} \rightarrow \mathbb{R}$. Then, we’ll consider a functional $F$ that maps each $f \in \mathcal{F}$ to a number. That is,</p>
\[F: \mathcal{F} \rightarrow \mathbb{R}\]
<p>Now, we’re going to spoil the punchline with the definition for the functional derivative:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (Functional derivative):</strong> Given a function $f \in \mathcal{F}$, the <strong>functional derivative</strong> of $F$ at $f$, denoted $\frac{\partial{F}}{\partial f}$, is defined to be the function for which: </span></p>
<p><span style="color:#0060C6">\(\begin{align*}\int \frac{\partial F}{\partial f}(x) \eta(x) \ dx &= \lim_{h \rightarrow 0}\frac{F(f + h \eta) - F(f)}{h} \\ &= \frac{d F(f + h\eta)}{dh}\bigg\rvert_{h=0}\end{align*}\)</span></p>
<p><span style="color:#0060C6">where $h$ is a scalar and $\eta$ is an arbitrary function in $\mathcal{F}$.</span></p>
<p>Woah. What is going on here? How on earth does this define the functional derivative? And why is the functional derivative, $\frac{\partial{F}}{\partial f}$ buried inside such a seemingly complicated equation?</p>
<p>Let’s break it down.</p>
<p>First, notice the similarity of the right-hand side of the equation of Definition 1 to the definition of the directional gradient:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/directional_gradient_functional_derivative.png" alt="drawing" width="350" /></center>
<p>Indeed, the equation in Definition 1 describes the analogy of the directional derivative for functionals! That is, it describes the rate of change of $F$ at $f$ in the direction of the function $\eta$!</p>
<p>How does this work? As we shrink $h$ down to an infinitesimaly small number, $f + h \eta$ will become arbitrarily close to $f$. In the illustration below, we see an example function $f$ (red) and another function $\eta$ (blue). As $h$ gets smaller, the function $f + h\eta$ (purple) becomes more similar to $f$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/function_variationn.png" alt="drawing" width="450" /></center>
<p>Thus, we see that $h \eta$ is the “infinitesimal” change to $f$ that is analogous to the infinitesimal change to $\boldsymbol{x}$ that we describe by $h\boldsymbol{v}$ in the definition of the directional derivative. The quantity $h \eta$ is called a <strong>variation</strong> of $f$ (hence the word “variational” in the name “calculus of variations”).</p>
<p>Now, so far we have only shown that the equation in Definition 1 describes something analogous to the directional derivative for multivariate numerical functions. We showed this by comparing the right-hand side of the equation in Definition 1 to the definition of the directional gradient. However, as Definition 1 states, the functional derivative itself is defined to be the function $\frac{\partial F}{\partial f}$ within the integral on the left-hand side of the equation. What is going on here? Why is <em>this</em> the functional derivative?</p>
<p>Now, it is time to recall the gradient for traditional multivariate functions. Specifically, notice the similarity between the alternative formulation of the directional derivative, which uses the gradient, and the left-hand side of the equation in Definition 1:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/functional_derivative_gradient.png" alt="drawing" width="450" /></center>
<p>Notice, that these equations have similar forms. Instead of a summation in the definition of the directional derivative, we have an integral in the eqation for Definition 1. Moreover, instead of summing over elements of the vector $\boldsymbol{v}$, we “sum” (using an integral) each value of $\eta(x)$. Lastly, instead of each partial derivative of $f$, we now have each value of the function $\frac{\partial F}{\partial f}$ for each $x$. This function, $\frac{\partial F}{\partial f}(x)$, is analogous to the gradient! It is thus called the functional derivative!</p>
<p>To drive this home further, recall that we can represent the directional derivative as the dot product between the gradient vector and $\boldsymbol{v}$:</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) := \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}\]
<p>To make this relationship clearer, we note that the dot product is an <a href="https://en.wikipedia.org/wiki/Inner_product_space">inner product</a>. Thus, we can write this definition in a more general way as</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) := \langle \nabla f(\boldsymbol{x}), \boldsymbol{v} \rangle\]
<p>We also recall that a valid inner product between continuous functions $f$ and $g$ is</p>
\[\langle f, g \rangle := \int f(x)g(x) dx\]
<p>Thus, we see that</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/functional_derivative_gradient_w_inner_product.png" alt="drawing" width="450" /></center>
<p>Said differently, the functional gradient of a functional, $F$, at a function $f$, denoted $\frac{\partial F}{\partial f}$ is the function for which given any arbitrary function $\eta$, the inner product between $\frac{\partial F}{\partial f}$ and $\eta$ is the directional derivative of $F$ in the direction of $\eta$!</p>
<h2 id="an-example-the-functional-derivative-of-entropy">An example: the functional derivative of entropy</h2>
<p>As a toy example, let’s derive the functional derivative of <a href="https://mbernste.github.io/posts/entropy/">information entropy</a>. Recall at the beginning of this post, the entropy $H$ of a discrete random variable $X$ can be viewed as a function on $X$’s probability mass function $p_X$. More specifically, $H$ is defined as</p>
\[H(p_X) := \sum_{x \in \mathcal{X}} - p_X(x) \log p_X(x)\]
<p>where $\mathcal{X}$ is the support of $p_X$.</p>
<p>Let’s derive it’s functional derivative. Let’s start with an arbitrary probability mass function $\eta : \mathcal{X} \rightarrow [0,1]$. Then, we can write out the equation that defines the functional derivative:</p>
\[\sum_{x \in \mathcal{X}} \frac{\partial H}{\partial p_X}(x) \eta(x) = \frac{d H(p_X + h\eta)}{dh}\bigg\rvert_{h=0}\]
<p>Let’s simplify this equation:</p>
\[\begin{align*}
\sum_{x \in \mathcal{X}} \frac{\partial H}{\partial p_X}(x) \eta(x)
&= \frac{d H(p_X + h\eta)}{dh}\bigg\rvert_{h=0} \\
&= \frac{d}{dh} \sum_{x \in \mathcal{X}} -(p_X(x) + h\eta(x))\log(p_X(x) + h\eta(x))\bigg\rvert_{h=0} \\
&= \sum_{x \in \mathcal{X}} - \eta(x)\log(p_X(x) + h\eta(x)) + \eta(x)\bigg\rvert_{h=0} \\
&= \sum_{x \ in \mathcal{X}} (-1 - \log p_X(x))\eta(x)\end{align*}\]
<p>Now we see that $\frac{\partial H}{\partial p_X}(x) = -1 - \log p_X(x)$ and thus, this is the functional derivative!</p>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1:</strong> Given a differentiable function $f : \mathbb{R}^n \rightarrow \mathbb{R}$, vectors $\boldsymbol{x}, \boldsymbol{v} \in \mathbb{R}^n$, where $\boldsymbol{v}$ is a unit vector, then $D_{\boldsymbol{v}} f(\boldsymbol{x}) = \sum_{i=1}^n \left( \frac{\partial f(\boldsymbol{x})}{\partial x_i} \right) v_i$.</span></p>
<p><strong>Proof:</strong></p>
<p>Consider $\boldsymbol{x}$ and $\boldsymbol{v}$ to be fixed and let us define the function $g(z) := f(\boldsymbol{x} + z\boldsymbol{v})$. Then,</p>
\[\frac{dg(z)}{dz} = \lim_{h \rightarrow 0} \frac{g(z+h) - g(z)}{h}\]
<p>Evaluating this derivative at $z = 0$, we see that</p>
\[\begin{align*} \frac{dg(z)}{dz}\bigg\rvert_{z=0} &= \frac{g(h) - g(0)}{h} \\ &= \frac{f(\boldsymbol{x} + h\boldsymbol{v}) - f(\boldsymbol{x})}{h} \\ &= D_{\boldsymbol{v}} f(\boldsymbol{x}) \end{align*}\]
<p>We can also express $\frac{dg(z)}{dz}$ another way by applying the <a href="https://en.wikipedia.org/wiki/Chain_rule#Multivariable_case">multivariate chain rule</a>. Doing so, we see that</p>
\[\frac{dg(z)}{dz} = \sum_{i=1}^n D_i f(\boldsymbol{x} + z\boldsymbol{v}) \frac{d (x_i + zv_i)}{dz}\]
<p>where $D_i f(\boldsymbol{x} + z\boldsymbol{v})$ is the partial derivative of $f$ with respect to it’s $i$th argument when evaluated at $\boldsymbol{x} + z\boldsymbol{v}$. Now, we again evaluate this derivative at $z = 0$ and see that</p>
\[\begin{align*} \frac{dg(z)}{dz}\bigg\rvert_{z=0} &= \sum_{i=1}^n D_i f(\boldsymbol{x}) v_i \\ &= \sum_{i=1}^n \frac{f(\boldsymbol{x})}{\partial \boldsymbol{x}_i} v_i \end{align*}\]
<p>So we have now derived two equivalent forms of $\frac{dg(z)}{dz}\bigg\rvert_{z=0}$. Putting them together we see that</p>
\[D_{\boldsymbol{v}} f(\boldsymbol{x}) = \sum_{i=1}^n \frac{f(\boldsymbol{x})}{\partial \boldsymbol{x}_i} v_i\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2:</strong> Given a differentiable function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ and vector $\boldsymbol{x} \in \mathbb{R}^n$, $f$’s direction of steepest ascent is the direction pointed to by the gradient $\nabla f(\boldsymbol{x})$.</span></p>
<p><strong>Proof:</strong></p>
<p>As shown in Theorem 1, given an arbitrary unit vector $\boldsymbol{v} \in \mathbb{R}^n$, the directional derivative $D_{\boldsymbol{v}} f(\boldsymbol{x})$ can be calculated by taking the dot product of the gradient vector with $\boldsymbol{v}$:</p>
\[D_{\boldsymbol{v}} f(\boldsymbol{x}) = \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}\]
<p>The dot product can be computed as</p>
\[\nabla f(\boldsymbol{x}) \cdot \boldsymbol{v} = ||\nabla f(\boldsymbol{x})|| ||\boldsymbol{v}|| \cos \theta\]
<p>where $\theta$ is the angle between the two vectors. The $\cos$ function is maximized (and equals 1) when $\theta = 0$ and thus, directional derivative is maximized when $\theta = 0$. Thus, the unit vector that maximizes the directional derivative is the vector pointing in the same direction as the gradient thus proving that the gradient points in the direction of steepest ascent.</p>
<p>$\square$</p>Matthew N. BernsteinThe calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.