Jekyll2022-08-28T08:09:42-07:00https://mbernste.github.io/feed.xmlMatthew N. BernsteinPersonal websiteMatthew N. BernsteinElementary matrices and the general linear group2022-08-09T00:00:00-07:002022-08-09T00:00:00-07:00https://mbernste.github.io/posts/elementary_matrices<p><em>THIS POST IS CURRENTLY UNDER CONSTRUCTION</em></p> <h2 id="introduction">Introduction</h2> <p>In a <a href="https://mbernste.github.io/posts/systems_linear_equations/">previous blog post</a>, we showed how systems of linear equations can be represented as a matrix equation. For example, the system of linear equations,</p> \begin{align*}a_{1,1}x_1 + a_{1,2}x_2 + a_{1,3}x_3 &amp;= b_1 \\ a_{2,1}x_1 + a_{2,2}x_2 + a_{2,3}x_3 &amp;= b_2 \\ a_{3,1}x_1 + a_{3,2}x_2 + a_{3,3}x_3 &amp;= b_3 \end{align*} <p>can be represented succinctly as</p> $\boldsymbol{Ax} = \boldsymbol{b}$ <p>where $\boldsymbol{A}$ is the matrix of coefficients $a_{1,1}, a_{1,2}, \dots, a_{3,3}$ and $\boldsymbol{b}$ is the matrix of coefficients of $b_1, b_2,$ and $b_3$. Furthermore, we noted that this system will have exactly one solution if $\boldsymbol{A}$ is an <a href="https://mbernste.github.io/posts/inverse_matrices/">invertible matrix</a>.</p> <p>In this post, we will discuss how one can solve for this exact solution using a series of operations of the system that involve a particular class of invertible matrices called the <strong>elementary matrices</strong>. We will show that any invertible matrix can be converted to another invertible matrix by performing some sequence of matrix multiplications with elementary matrices. In fact, this reveals an elegant mathematical structure regarding invertible matrices: they form a <strong>group</strong>. That is, you can always go from one invertible matrix to another via a series of multipliciations by elementary row matrices!</p> <h2 id="elementary-row-operations">Elementary row operations</h2> <p>Before digging into matrices, let’s first discuss how one can go about solving a system of linear equations. Say we have a system with three equations and three variables:</p> \begin{align*}a_{1,1}x_1 + a_{1,2}x_2 + a_{1,3}x_3 &amp;= b_1 \\ a_{2,1}x_1 + a_{2,2}x_2 + a_{2,3}x_3 &amp;= b_2 \\ a_{3,1}x_1 + a_{3,2}x_2 + a_{3,3}x_3 &amp;= b_3 \end{align*} <p>To solve such a system, our goal is perform simple algebraic operations on these equations until we convert the system to one with the following form:</p> \begin{align*}x_1 &amp;= c_1 \\ x_2 &amp;= c_2 \\ x_3 &amp;= c_3 \end{align*} <p>where $c_1, c_2$, and $c_3$ are the solutions to the system – that is, they are the values we can assign to $x_1, x_2$, and $x_3$ so that all of the equations in the system are valid.</p> <p>Now, what kinds of algebraic operations can we perform on the equations of the system to solve it? There are three main categories, called <strong>elementary row operations</strong> (we’ll see soon, why they have this name):</p> <ol> <li><strong>Scalar multiplication</strong>: Simply multiply both sides of one of the equations by a scalar. For example,</li> <li><strong>Row swap</strong>: You can move one equation above or below another. Note, the order the equations are written is irrevalent to the solution, so swapping rows is really just changing how we organize the formulas. Nonetheless, this organization will be important as we demonstrate how elementary matrices can be used to solve the system.</li> <li><strong>Row sum</strong>: Add a multiple of one equation to another. Note, this is a perfectly valid operation because, if the equality truly holds, then this equations to simply adding the same quantity to both sides of a given equation.</li> </ol> <p>Let’s use these operations to solve the following system:</p> \begin{align*}-x_1 - 2 x_2 + x_3 &amp;= -3 \\ 3 x_2 &amp;= 3 \\ 2 x_1 + 4 x_2 &amp;= 10\end{align*} <ol> <li>First, we <em>row swap</em> the first and third equations:</li> </ol> \begin{align*}2 x_1 + 4 x_2 &amp;= 10 \\ 3 x_2 &amp;= 3 \\ -x_1 - 2 x_2 + x_3 &amp;= -3\end{align*} <ol> <li>Next, let’s perform <em>scalar multiplication</em> and multiply the first equation by 1/2:</li> </ol> \begin{align*}x_1 + 2 x_2 &amp;= 5 \\ 3 x_2 &amp;= 3 \\ -x_1 - 2 x_2 + x_3 &amp;= -3\end{align*} <ol> <li>Next, let’s perform a <em>row sum</em> and add the first row to the third:</li> </ol> \begin{align*}x_1 + 2 x_2 &amp;= 5 \\ 3 x_2 &amp;= 3 \\ x_3 &amp;= 2\end{align*} <ol> <li>Next, let’s perform <em>scalar multiplication</em> and multiply the second equation by 1/3:</li> </ol> \begin{align*}x_1 + 2 x_2 &amp;= 5 \\ x_2 &amp;= 1 \\ x_3 &amp;= 2\end{align*} <ol> <li>Finally, let’s perform a <em>row sum</em> and add -2 multiplied by the second row to the first:</li> </ol> \begin{align*}x_1 &amp;= 3 \\ x_2 &amp;= 1 \\ x_3 &amp;= 2\end{align*} <p>And there we go, we’ve solved the system using these elementary row operations.</p> <h2 id="elementary-row-operations-in-matrix-notation">Elementary row operations in matrix notation</h2> <p>Recall, we can represent a system of linear equations as a <a href="https://mbernste.github.io/posts/systems_linear_equations/">matrix equation</a>, we showed how systems of linear equations can be represented as a matrix equation.</p> <p>The linear system that we just solved can be written as:</p> $\begin{bmatrix}-1 &amp; -2 &amp; 1 \\ 0 &amp; 3 &amp; 0 \\ 2 &amp; 4 &amp; 0 \end{bmatrix}\begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = \begin{bmatrix}-3 \\ 3 \\ 10\end{bmatrix}$ <p>When solving the system using the elementary row operations, we needn’t write out all of the equations. Really, all we need to do is keep track of how $\boldsymbol{A}$ and $$\boldsymbol{b}$$ are being transformed upon each iteration. For ease of notation, we can join $\boldsymbol{A}$ and $\boldsymbol{b}$ into a single matrix, called an <strong>augmented matrix</strong>. In our example, this augmented matrix would look like:</p> $\begin{bmatrix}-1 &amp; -2 &amp; 1 &amp; -3 \\ 0 &amp; 3 &amp; 0 &amp; 3 \\ 2 &amp; 4 &amp; 0 &amp; 10 \end{bmatrix}$ <p>In the augmented matrix, the final column stores $\boldsymbol{b}$ and all of the previous columns store the columns of $\boldsymbol{A}$. Our execution of the row operations can now operate only on this augmented matrix as follows:</p> <ol> <li><em>Row swap</em>: swap the first and third equations:</li> </ol> $\begin{bmatrix}2 &amp; 4 &amp; 0 &amp; 10 \\ 0 &amp; 3 &amp; 0 &amp; 3 \\ -1 &amp; -2 &amp; 1 &amp; -3 \end{bmatrix}$ <ol> <li><em>Scalar multiplication</em>: Multiply the first equation by 1/2:</li> </ol> $\begin{bmatrix}1 &amp; 2 &amp; 0 &amp; 5 \\ 0 &amp; 3 &amp; 0 &amp; 3 \\ -1 &amp; -2 &amp; 1 &amp; -3 \end{bmatrix}$ <ol> <li><em>Row sum</em>: add the first row to the third:</li> </ol> $\begin{bmatrix}1 &amp; 2 &amp; 0 &amp; 5 \\ 0 &amp; 3 &amp; 0 &amp; 3 \\ 0 &amp; 0 &amp; 1 &amp; 2 \end{bmatrix}$ <ol> <li><em>Scalar multiplication</em>: Multiply the second equation by 1/3:</li> </ol> $\begin{bmatrix}1 &amp; 2 &amp; 0 &amp; 5 \\ 0 &amp; 1 &amp; 0 &amp; 1 \\ 0 &amp; 0 &amp; 1 &amp; 2 \end{bmatrix}$ <ol> <li><em>Row sum</em> and add -2 multiplied by the second row to the first:</li> </ol> $\begin{bmatrix}1 &amp; 0 &amp; 0 &amp; 3 \\ 0 &amp; 1 &amp; 0 &amp; 1 \\ 0 &amp; 0 &amp; 1 &amp; 2 \end{bmatrix}$ <p>Now, let’s re-write the augmented matrix as a matrix equation:</p> $\begin{bmatrix}1 &amp; 0 &amp; 0 \\ 0 &amp; 1 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}\begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = \begin{bmatrix}3 \\ 1 \\ 2\end{bmatrix}$ <p>Note that $$\boldsymbol{A}$$ has been <em>transformed</em> to the identity matrix $$\boldsymbol{I}$$. This will be a key observation as we move into the next section.</p> <h2 id="elementary-matrices">Elementary matrices</h2> <p>Notice on each elementary row operation, we transformed the matrix $\boldsymbol{A}$ using a series of steps until it became the identity matrix $\boldsymbol{I}$. In fact, each of these elementary row operations can be represented as a matrix that is operating on $\boldsymbol{I}$. Such a matrix that represents an elementary row operation is called an <strong>elementary matrix</strong>.</p> <p>To demonstrate, that our elementary row operations can be performed using matrix multiplication, let’s look back at our example. We start with the matrix</p> $\boldsymbol{A} := \begin{bmatrix}-1 &amp; -2 &amp; 1 \\ 0 &amp; 3 &amp; 0 \\ 2 &amp; 4 &amp; 0 \end{bmatrix}$ <p>Then, first we <em>row swap</em> the first and third equations:</p> $\underbrace{\begin{bmatrix}0 &amp; 0 &amp; 1 \\ 0 &amp; 1 &amp; 0 \\ 1 &amp; 0 &amp; 0 \end{bmatrix}}_{\boldsymbol{E}_1} \underbrace{\begin{bmatrix}-1 &amp; -2 &amp; 1 \\ 0 &amp; 3 &amp; 0 \\ 2 &amp; 4 &amp; 0 \end{bmatrix}}_{\boldsymbol{A}} = \begin{bmatrix}2 &amp; 4 &amp; 0 \\ 0 &amp; 3 &amp; 0 \\ -1 &amp; -2 &amp; 1 \end{bmatrix}$ <p>Then perform <em>scalar multiplication</em> and multiply the first equation by 1/2:</p> $\underbrace{\begin{bmatrix}1/2 &amp; 0 &amp; 0 \\ 0 &amp; 1 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}}_{\boldsymbol{E_2}} \underbrace{\begin{bmatrix}2 &amp; 4 &amp; 0 \\ 0 &amp; 3 &amp; 0 \\ -1 &amp; -2 &amp; 1 \end{bmatrix}}_{\boldsymbol{E}_1\boldsymbol{A}} = \begin{bmatrix}1 &amp; 2 &amp; 0 \\ 0 &amp; 3 &amp; 0 \\ -1 &amp; -2 &amp; 1 \end{bmatrix}$ <p>Then perform a <em>row sum</em> and add the first row to the third:</p> $\underbrace{\begin{bmatrix}1 &amp; 0 &amp; 0 \\ 0 &amp; 1 &amp; 0 \\ 1 &amp; 0 &amp; 1 \end{bmatrix}}_{\boldsymbol{E}_3} \underbrace{\begin{bmatrix}1 &amp; 2 &amp; 0 \\ 0 &amp; 3 &amp; 0 \\ -1 &amp; -2 &amp; 1 \end{bmatrix}}_{\boldsymbol{E}_2\boldsymbol{E}_1\boldsymbol{A}} = \begin{bmatrix}1 &amp; 2 &amp; 0 \\ 0 &amp; 3 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}$ <p>Then perform <em>scalar multiplication</em> and multiply the second equation by 1/3:</p> $\underbrace{\begin{bmatrix}1 &amp; 0 &amp; 0 \\ 0 &amp; 1/3 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}}_{\boldsymbol{E}_4} \underbrace{\begin{bmatrix}1 &amp; 2 &amp; 0 \\ 0 &amp; 3 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}}_{\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1\boldsymbol{A}} = \begin{bmatrix}1 &amp; 2 &amp; 0 \\ 0 &amp; 1 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}$ <p>Then perform a <em>row sum</em> and add -2 multiplied by the second row to the first:</p> $\underbrace{\begin{bmatrix}1 &amp; -2 &amp; 0 \\ 0 &amp; 1 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}}_{\boldsymbol{E}_5} \underbrace{\begin{bmatrix}1 &amp; 2 &amp; 0 \\ 0 &amp; 1 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}}_{\boldsymbol{E}_4\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1\boldsymbol{A}} = \begin{bmatrix}1 &amp; 0 &amp; 0 \\ 0 &amp; 1 &amp; 0 \\ 0 &amp; 0 &amp; 1 \end{bmatrix}$ <p>Notice, we’ve derived a series of matrices that when multiplied by $\boldsymbol{A}$ produces the identity matrix:</p> $\boldsymbol{E}_5\boldsymbol{E}_4\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1\boldsymbol{A} = \boldsymbol{I}$ <p>By the <a href="https://mbernste.github.io/posts/inverse_matrices/">definition of an inverse matrix</a>, we see that the matrix formed by $\boldsymbol{E}_5\boldsymbol{E}_4\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1$ is the inverse of $\boldsymbol{A}$! That is,</p> $\boldsymbol{A}^{-1} = \boldsymbol{E}_5\boldsymbol{E}_4\boldsymbol{E}_3\boldsymbol{E}_2\boldsymbol{E}_1$ <p>Thus, we have found a way to decompose the inverse of $\boldsymbol{A}$ into a set of matrices that when multiplied together yield its inverse. Each of these matrices represents a transformation on $\boldsymbol{A}$ equivalent to an elementary row operation that one would use to solve an equation of the form $\boldsymbol{Ax} = \boldsymbol{b}$!</p> <h2 id="the-general-linear-group">The general linear group</h2> <p>Now, we will note a few important observations about elementary row matrices and invertible matrices in general:</p> <p>First, notice that each elementary matrix is invertible. The inverse of an elementary matrix that row-scales by a constant $c$ is simply the elementary matrix that scales the row by $\frac{1}{c}$. The inverse of an elementary matrix that row swaps is simply the elementary matrix that swaps the rows back to their original configuration. The inverse of an elementary matrix that performs a row sum is simply the elementary matrix that performs the subtraction. Thus we see that not only are elementary matrices invertible, but their inverses are also elementary matrices!</p> <p>Because these elementary row matrices are invertible, instead of starting with some invertible matrix $\boldsymbol{A}$ and producing the identity matrix $\boldsymbol{I}$ via some sequence of multiplications,</p> $\boldsymbol{I} = \boldsymbol{E}_n \dots, \boldsymbol{E}_1\boldsymbol{A}$ <p>we can instead start with the identity matrix and produce $\boldsymbol{A}$:</p> $\boldsymbol{A} = \boldsymbol{E}^{-1}_1 \dots \boldsymbol{E}^{-1}_n\boldsymbol{I}$ <p>From this fact, we can see that we can go from any invertible matrix to another by multiplying the matrix with some series of elementary matrices. For example, say we have two different invertible matrices $\boldsymbol{A}$ and $\boldsymbol{B}$. Then we can transform $\boldsymbol{A}$ into the identity matrix via some some sequence of multiplications by elementary matrices:</p> $\boldsymbol{I} = \boldsymbol{E}_n \dots, \boldsymbol{E}_1\boldsymbol{A}$ <p>We can also transform $\boldsymbol{B}$ into the identity matrix via some some sequence of multiplications by elementary matrices:</p> $\boldsymbol{I} = \boldsymbol{E}_{n+m} \dots, \boldsymbol{E}_{n+1}\boldsymbol{B}$ <p>We can convert $\boldsymbol{A}$ to $\boldsymbol{B}$ by converting $\boldsymbol{A}$ to $\boldsymbol{I}$ and then convert $\boldsymbol{I}$ to $\boldsymbol{B}$ via the inverses of the elementary matrices used to convert $\boldsymbol{B}$ to $\boldsymbol{I}$:</p> <p>\boldsymbol{B} = \boldsymbol{E}^{-1}<em>{n+1} \dots \boldsymbol{E}^{-1}</em>{n+m}\boldsymbol{E}_n \dots, \boldsymbol{E}_1\boldsymbol{A}</p> <p>Let’s let the matrix \boldsymbol{C} be defined as:</p> $\boldsymbol{C} := \boldsymbol{E}^{-1}_{n+1} \dots \boldsymbol{E}^{-1}_{n+m}\boldsymbol{E}_n \dots, \boldsymbol{E}_1$ <p>Then, we see that</p> $\boldsymbol{B} = \boldsymbol{CA}$ <p>Notably, \boldsymbol{C} is also an invertible matrix because all of the elementary matrices we multiplied together to produce \boldsymbol{C} are all invertible!</p>Matthew N. BernsteinTHIS POST IS CURRENTLY UNDER CONSTRUCTIONReasoning about systems of linear equations using linear algebra2022-06-12T00:00:00-07:002022-06-12T00:00:00-07:00https://mbernste.github.io/posts/systems_of_linear_equations<p><em>In this blog post, we will discuss the relationship between matrices and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors in that matrix equation.</em></p> <h2 id="introduction">Introduction</h2> <p>In this blog post, we will discuss the relationship between <a href="https://mbernste.github.io/posts/matrices/">matrices</a> and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors involved in that matrix equation.</p> <h2 id="systems-of-linear-equations">Systems of linear equations</h2> <p>A <strong>system of linear equations</strong> is a set of <a href="https://en.wikipedia.org/wiki/Linear_equation">linear equations</a> that all utilize the same set of variables, but each equation differs by the coefficients that multiply those variables.</p> <p>For example, say we have three variables, x_1, x_2, and x_3. A system of linear equations involving these three variables can be written as:</p> \begin{align*}3 x_1 + 2 x_2 - x_3 &amp;= 1 \\ 2 x_1 + -2 x_2 + 4 x_3 &amp;= -2 \\ -x_1 + 0.5 x_2 + - x_3 &amp;= 0 \end{align*} <p>A <strong>solution</strong> to this system of linear equations is an assignment to the variables x_1, x_2, x_3, such that all of the equations are simultaneously true. In our case, a solution would be given by (x_1, x_2, x_3) = (1, -2, -2).</p> <p>More abstractly, we could write a system of linear equations as</p> \begin{align*}a_{1,1}x_1 + a_{1,2}x_2 + a_{1,3}x_3 &amp;= b_1 \\ a_{2,1}x_1 + a_{2,2}x_2 + a_{2,3}x_3 &amp;= b_2 \\ a_{3,1}x_1 + a_{3,2}x_2 + a_{3,3}x_3 &amp;= b_3 \end{align*} <p>where a_{1,1}, \dots, a_{3,3} are the coefficients and b_1, b_2, and b_3 are the constant terms, all treated as <em>fixed</em>. By “fixed”, we mean that we assume that a_{1,1}, \dots, a_{3,3} and b_1, b_2, and b_3 are known. In contrast, x_1, x_2, and x_3 are unknown. We can try different values for x_1, x_2, and x_3 and test whether or not that assignment is a solution to the system.</p> <h2 id="reasoning-about-the-solutions-to-a-system-of-linear-equations-by-respresenting-the-system-as-a-matrix-equation">Reasoning about the solutions to a system of linear equations by respresenting the system as a matrix equation</h2> <p>Now, a natural question is: given a system of linear equations, how many solutions does it have? Will a system always have a solution? If did does have a solution will it only be one solution? We will use concepts from linear algebra to address these questions.</p> <p>First, note that we can write a system of linear equations much more succinctly using <a href="https://mbernste.github.io/posts/matrix_vector_mult/">matrix-vector</a> multiplication. That is,</p> $\begin{bmatrix}a_{1,1} &amp;&amp; a_{1,2} &amp;&amp; a_{1,3} \\ a_{2,1} &amp;&amp; a_{2,2} &amp;&amp; a_{2,3} \\ a_{3,1} &amp;&amp; a_{3,2} &amp;&amp; a_{3,3} \end{bmatrix} \begin{bmatrix}x_1 \\ x_2 \\ x_3\end{bmatrix} = \begin{bmatrix}b_1 \\ b_2 \\ b_3\end{bmatrix}$ <p>If we let the matrix of coefficients be \boldsymbol{A}, the vector of variables be \boldsymbol{x}, and the vector of constants be \boldsymbol{b}, then we could write this even more succinctly as:</p> $\boldsymbol{Ax} = \boldsymbol{b}$ <p>This is an important point: any system of linear equations can be written succintly as an equation using matrix-vector multiplication. By viewing systems of linear equations through this lense, we can reason about the number of solutions to a system of linear equations using properties of the matrix \boldsymbol{A}!</p> <p>Given this newfound representation for systems of linear equations, recall from our <a href="https://mbernste.github.io/posts/matrix_vector_mult/">discussion of matrix-vector multiplication</a>, a matrix \boldsymbol{A} multiplying a vector \boldsymbol{x} can be understood as taking a linear combination of the column vectors of \boldsymbol{A} using the elements of \boldsymbol{x} as the coefficients:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_vec_mult_as_lin_comb.png" alt="drawing" width="700" /></center> <p>Thus, we see that the solution to a system of linear equations, \boldsymbol{x}, is any set of weights for which, if we take a weighted sum of the columns of \boldsymbol{A}, we get the vector \boldsymbol{b}. That is, \boldsymbol{x} is a vector that lies in the <a href="https://mbernste.github.io/posts/linear_independence/">span</a> of the columns of \boldsymbol{A}!</p> <p>From this observation, we can begin to draw some conclusions about the number of solutions a given system of linear equations will have based on the properties of $$\boldsymbol{A}$$. Specifically, the system will either have:</p> <ol> <li><strong>No solution.</strong> This will occur if $$\boldsymbol{b}$$ lies <em>outside</em> the span of the columns of \boldsymbol{A}. This means that there is no way to construct \boldsymbol{b} from the columns of \boldsymbol{A}, and thus there is no weights-vector \boldsymbol{x} that will satisfy the equation \boldsymbol{Ax} = \boldsymbol{b}. Note, this can only occur if \boldsymbol{A} is singular. Why? Recall that an <a href="https://mbernste.github.io/posts/inverse_matrices/">invertible matrix</a> maps each inpu vector \boldsymbol{x} to a unique output vector \boldsymbol{b} and each output \boldsymbol{b} corresponds to a unique input \boldsymbol{x}. Said more succintly, an invertible matrix characterizes a one-to-one and onto <a href="https://mbernste.github.io/posts/matrices_linear_transformations/">linear tranformation</a>. Therefore, if \boldsymbol{A} is invertible, then for any given \boldsymbol{b}, there <em>must</em> exist a vector, \boldsymbol{x}, that solves the equation \boldsymbol{Ax} = \boldsymbol{b}. If such a vector does not exist, then \boldsymbol{A} must not be invertible.</li> <li><strong>Exactly one solution.</strong> This will occur if \boldsymbol{A} is invertible. As discussed above, an invertible matrix characterizes a one-to-one and onto linear transformation and thus, for any given \boldsymbol{b}, there will be exactly one vector, \boldsymbol{x}, that solves the equation \boldsymbol{Ax} = \boldsymbol{b}.</li> <li><strong>Infinitely many solutions.</strong> This will occur if \boldsymbol{b} lies <em>inside</em> the span of the columns of \boldsymbol{A}, but \boldsymbol{A} is <em>not</em> invertible. Why would there be an infinite number of solutions? <a href="https://mbernste.github.io/posts/inverse_matrices/">Recall</a> that if \boldsymbol{A} is not invertible, then the columns of \boldsymbol{A} are <a href="https://mbernste.github.io/posts/linear_independence/">linearly dependent</a>, meaning that there are an infinite number of ways to take a weighted sum of the columns of \boldsymbol{A} to get \boldsymbol{b}. Thus, an infinite number of vectors that satisfy \boldsymbol{Ax} = \boldsymbol{b}.</li> </ol>Matthew N. BernsteinIn this blog post, we will discuss the relationship between matrices and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors in that matrix equation.Span and linear independence2022-06-11T00:00:00-07:002022-06-11T00:00:00-07:00https://mbernste.github.io/posts/linear_independence<p><em>An extremely important concept linear algebra is that of linear independence. In this blog post we present the definition for the span of a set of vectors. Then, we use this definition to discuss the definition for linear independence. Finally, we discuss some intuition into this fundamental idea.</em></p> <h2 id="introduction">Introduction</h2> <p>An extremely important concept in the study of vector spaces is that of <em>linear independence</em>. At a high level, a set of vectors are said to be <strong>linearly independent</strong> if you cannot form any vector in the set using any combination of the other vectors in the set. If a set of vectors does not have this quality – that is, a vector in the set can be formed from some combination of others – then the set is said to be <strong>linearly dependent</strong>.</p> <p>In this post, we will present a more foundatioanl concept, the <em>span</em> of a set of vectors, and then move on to the definition for linear independence. Finally, we will discuss a high-level intuition for why the concept of linearly independence is so important.</p> <h2 id="span">Span</h2> <p>Given a set of vectors, the <strong>span</strong> of the set of vectors are all of the vectors that can be “constructed” by taking linear combinations of vectors in that set. More rigorously,</p> <p><span style="color:#0060C6"><strong>Definition 1 (span):</strong> Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, (\mathcal{V}, \mathcal{F}) and a set of vectors S := \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \in \mathcal{V}, the <strong>span</strong> of S, denoted \text{Span}(S) is the set of all vectors that can be formed by taking a linear combination of vectors in S. That is,</span></p> <center><span style="color:#0060C6">\text{Span}(S) := \left\{ \sum_{i=1}^n c_i\boldsymbol{x}_i \mid c_1, \dots, c_n \in \mathcal{F} \right\} </span></center> <p>Intuitively, you can think of $S$ as a set of “building blocks” and the $\text{Span}(S)$ as the set of all vectors that can be “constructed” from the building blocks in $S$. To illustrate this point, we show in the figure below two vectors, $\boldsymbol{x}_1$ and $\boldsymbol{x}_2$ (left), and two examples of vectors in their span (right):</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/span_of_vectors.png" alt="drawing" width="600" /></center> <p>Note, we can see that in this example that we could construct <em>ANY</em> two dimensional vector from $\boldsymbol{x}_1$ and $\boldsymbol{x_2}$. Thus, the span of these two vectors is all of $\mathbb{R}^2$! This is not always the case. In the figure below, we show an example with a different span:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/span_of_vectors_2.png" alt="drawing" width="600" /></center> <p>This time, $\boldsymbol{x}_1$ and $\boldsymbol{x_2}$ don’t span all of $\mathbb{R}^2$, but rather, only the line on which $\boldsymbol{x}_1$ and $\boldsymbol{x_2}$ lie.</p> <h2 id="linear-independence">Linear independence</h2> <p>Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, $(\mathcal{V}, \mathcal{F})$, and a set of vectors $S := \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \in \mathcal{V}$, the vectors are said to be <strong>linearly independent</strong> if each vector lies outside the span of the remaining vectors. More rigorously,</p> <p><span style="color:#0060C6"><strong>Definition 2 (linear independence):</strong> Given a <a href="https://mbernste.github.io/posts/vector_spaces/">vector space</a>, $(\mathcal{V}, \mathcal{F})$ and a set of vectors $S := \boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n \in \mathcal{V}$, $S$ is called <strong>linearly independent</strong> if for each vector $\boldsymbol{x_i} \in S$, it holds that $\boldsymbol{x}_i \notin \text{Span}(S \setminus {\boldsymbol{x}_i})$.</span></p> <p>Said differently, a set of vectors are linearly independent if you cannot form any of the vectors in the set using a linear combination of any of the other vectors. Below we demonstrate a set of linearly independent vectors (left) and a set of linearly dependent vectors (right):</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/linear_independence.png" alt="drawing" width="600" /></center> <p>Why is the set on the right linearly dependent? As you can see below, we can use any of the two vectors to construct the third:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/linear_independence_symmetry.png" alt="drawing" width="600" /></center> <h2 id="intuition">Intuition</h2> <p>There are two ways I think about linear independence: in terms of information content and in terms of <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a>. Let me explain.</p> <p>First, if a set of vectors is linearly dependent, then in a sense there is “reduntant information” within the vectors. What do we mean by redundant? By removing a vector from a linearly dependent set of vectors, the span of the set of vectors will remain the same! On the other hand, for a linearly independent set of vectors, each vector is vital for defining the span of the set’s vectors. If you remove even one vector, the span of the vectors will change (in fact, it will become smaller)!</p> <p>At a more geometric level of thinking, a set of $n$ linearly independent vectors $S := { \boldsymbol{x}_1, \dots, \boldsymbol{x}_n }$ spans a space with an <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a> of $n$ because in order to specify any vector $\boldsymbol{v}$ in the span of these vectors, one must specify the coefficients $c_1, \dots, c_n$ to construct $\boldsymbol{v}$ from the vectors in $S$. That is,</p> $\boldsymbol{v} = c_1\boldsymbol{x}_1 + \dots + c_n\boldsymbol{x}_n$ <p>However, if $S$ is linearly dependent, then we can throw away “redundant” vectors in $S$. In fact, we see that the intrinsic dimensionality of a linearly dependent set $S$ is the maximum sized subset of $S$ that is linearly independent!</p>Matthew N. BernsteinAn extremely important concept linear algebra is that of linear independence. In this blog post we present the definition for the span of a set of vectors. Then, we use this definition to discuss the definition for linear independence. Finally, we discuss some intuition into this fundamental idea.Functionals and functional derivatives2022-04-10T00:00:00-07:002022-04-10T00:00:00-07:00https://mbernste.github.io/posts/functional_derivatives<p><em>The calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.</em></p> <h2 id="introduction">Introduction</h2> <p>Multivariate calculus concerns itself with infitesimal changes of numerical functions – that is, functions that accept a vector of real-numbers and output a real number:</p> $f : \mathbb{R}^n \rightarrow \mathbb{R}$ <p>In this blog post, we discuss the <strong>calculus of variations</strong>, a field of mathematics that generalizes the ideas in multivariate calculus relating to infinitesimal changes of traditional numeric functions to <em>functions of functions</em>, called <em>functionals</em>. Specifically, given a set of functions, $\mathcal{F}$, a <strong>functional</strong> is a mapping between $\mathcal{F}$ and the real-numbers:</p> $F : \mathcal{F} \rightarrow \mathbb{R}$ <p>Functionals are quite prevalent in machine learning and statistical inference. For example, <a href="https://mbernste.github.io/posts/entropy/">information entropy</a> can be considered a functional on probability mass functions. For a given <a href="https://mbernste.github.io/posts/measure_theory_2/">discrete random variable</a>, $X$, entropy can be thought about as a function that accepts as input $X$’s probability mass function, $p_X$, and outputs a real number:</p> $H(p_X) := -\sum_{x \in \mathcal{X}} p_X(x) \log p_X(x)$ <p>where $\mathcal{X}$ is the <a href="https://en.wikipedia.org/wiki/Support_(mathematics)">support</a> of $p_X$.</p> <p>Another example of a functional is the <a href="https://mbernste.github.io/posts/elbo/">evidence lower bound (ELBO)</a>: a function that, like entropy, operates on probability distributions. The ELBO is a foundational quantity used in the popular <a href="https://mbernste.github.io/posts/em/">EM algorithm</a> and <a href="https://mbernste.github.io/posts/variational_inference/">variational inference</a> used for performing statistical inference with probabilistic models.</p> <p>In this blog post, we will review some concepts in traditional calculus such as partial derivatives, directional derivatives, and gradients in order to introduce the definition of the <strong>functional derivative</strong>, which is simply the generalization of the gradient of numeric functions to functionals.</p> <h2 id="a-review-of-derivatives-and-gradients">A review of derivatives and gradients</h2> <p>In this section, we will introduce a few important concepts in multivariate calculus: derivatives, partial derivatives, directional derivatives, and gradients.</p> <h3 id="derivatives">Derivatives</h3> <p>Before going further, let’s quickly review the basic definition of the derivative for a univariate function $g$ that maps real numbers to real numbers. That is,</p> $g : \mathbb{R} \rightarrow \mathbb{R}$ <p>The derivative of $g$ at input $x$, denoted $\frac{dg(x)}{dx}$, describes the rate of change of $g$ at $x$. It is defined rigorously as</p> $\frac{dg(x)}{dx} := \lim_{h \rightarrow 0}\frac{g(x+h)-g(x)}{h}$ <p>Geometrically, $\frac{dg(x)}{dx}$ is the slope of the line that is tangential to $g$ at $x$ as depicted below:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/derivative.png" alt="drawing" width="550" /></center> <p>In this schematic, we depict the value of $h$ getting smaller and smaller. As it does, the slope of the line approaches that of the line that is tangential to $g$ at x. This slope is the derivative $\frac{dg(x)}{dx}$.</p> <h3 id="partial-derivatives">Partial derivatives</h3> <p>We will now consider a continous <em>multivariate</em> function $f$ that maps real-valued vectors $\mathcal{x} \in \mathbb{R}^n$ to real-numbers. That is,</p> $f: \mathbb{R}^n \rightarrow \mathbb{R}$ <p>Given $\boldsymbol{x} \in \mathbb{R}^n$, the <strong>partial derivative</strong> of $f$ with respect to the $i$th component of $\boldsymbol{x}$, denoted $\frac{\partial f(\boldsymbol{x})}{\partial x_i}$ is simply the derivative of $f$ if we hold all the components of $\boldsymbol{x}$ fixed, except for the $i$the component. Said differently, it tells us the rate of change of $f$ with respect to the $i$th dimension of the vector space in which $\boldsymbol{x}$ resides! This can be visualized below for a function $f : \mathbb{R}^2 \rightarrow \mathbb{R}$:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/partial_derivative.png" alt="drawing" width="450" /></center> <p>As seen above, the partial derivative $\frac{f(\boldsymbol{x})}{\partial x_1}$ is simply the derivative of the function $f(x_1, x_2)$ when holding $x_1$ as fixed. That is, it is the slope of the line tangent to the function of $f(x_1, x_2)$ when $x_1$ fixed.</p> <h3 id="directional-derivatives">Directional derivatives</h3> <p>We can see that the partial derivative of $f(\boldsymbol{x})$ with respect to the $i$th dimension of the vector space can be expressed as</p> $\frac{\partial f(\boldsymbol{x})}{\partial x_i} := \lim_{h \rightarrow 0} \frac{f(\boldsymbol{x} + h\boldsymbol{e}_i) - f(\boldsymbol{x})}{h}$ <p>where $\boldsymbol{e}_i$ is the $i$th <a href="https://en.wikipedia.org/wiki/Standard_basis">standard basis vector</a> – that is, the vector of all zeroes except for a one in the $i$th position.</p> <p>Geometrically, we can view the $i$th partial derivative of $f(\boldsymbol{x})$ as $f$’s rate of change along the direction of the $i$th standard basis vector of the vector space.</p> <p>Thinking along these lines, there is nothing stopping us from generalizing this idea to <em>any unit vector</em> rather than just the standard basis vectors. Given some unit vector $\boldsymbol{v}$, we define the <strong>directional derivative</strong> of $f(\boldsymbol{x})$ along the direction of $\boldsymbol{v}$ as</p> $D_{\boldsymbol{v}}f(\boldsymbol{x}) := \lim_{h \rightarrow 0} \frac{f(\boldsymbol{x} + h\boldsymbol{v}) - f(\boldsymbol{x})}{h}$ <p>Geometrically, this is simply the rate of change of $f$ along the direction at which $\boldsymbol{v}$ is pointing! This can be viewed schematically below:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/directional_derivative.png" alt="drawing" width="450" /></center> <p>For a given vector $\boldsymbol{v}$, we can derive a formula for $D_{\boldsymbol{v}}f(\boldsymbol{x})$. That is, we can show that:</p> $D_{\boldsymbol{v}}f(\boldsymbol{x}) = \sum_{i=1}^n \left( \frac{\partial f(\boldsymbol{x})}{\partial x_i} \right) v_i$ <p>See Theorem 1 in the Appendix of this post for a proof of this equation. Now, if we define the vector of all partial derivatives $f(\boldsymbol{x})$ as</p> $\nabla f(\boldsymbol{x}) := \begin{bmatrix}\frac{\partial f(\boldsymbol{x})}{\partial x_1} &amp; \frac{\partial f(\boldsymbol{x})}{\partial x_2} &amp; \dots &amp; \frac{\partial f(\boldsymbol{x})}{\partial x_n} \end{bmatrix}$ <p>Then we can represent the directional derivative as simply the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> between $\nabla f(\boldsymbol{x})$ and $\boldsymbol{v}$:</p> $D_{\boldsymbol{v}}f(\boldsymbol{x}) := \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}$ <p>This vector $\nabla f(\boldsymbol{x})$, is called the <strong>gradient vector</strong> of $f$ at $\boldsymbol{x}$.</p> <h3 id="gradients">Gradients</h3> <p>As described above, the <strong>gradient vector</strong>,$\nabla f(\boldsymbol{x})$ is the vector constructed by taking the partial derivative of $f$ at $\boldsymbol{x}$ along each basis vector. It turns out that the gradient vector points in the <em>direction of steepest ascent</em> along $f$’s surface at $\boldsymbol{x}$. This can be shown schematically below:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/gradient.png" alt="drawing" width="450" /></center> <p>We prove this property of the gradient vector in Theorem 2 of the Appendix to this post.</p> <h2 id="functional-derivatives">Functional derivatives</h2> <p>Now, we will seek to generalize the notion of the gradients to functionals. We’ll let $\mathcal{F}$ be some set of functions, and for simplicity, we’ll let each $f$ be a continuous real-valued function. That is, for each $f \in \mathcal{F}$, we have $f: \mathbb{R} \rightarrow \mathbb{R}$. Then, we’ll consider a functional $F$ that maps each $f \in \mathcal{F}$ to a number. That is,</p> $F: \mathcal{F} \rightarrow \mathbb{R}$ <p>Now, we’re going to spoil the punchline with the definition for the functional derivative:</p> <p><span style="color:#0060C6"><strong>Definition 1 (Functional derivative):</strong> Given a function $f \in \mathcal{F}$, the <strong>functional derivative</strong> of $F$ at $f$, denoted $\frac{\partial{F}}{\partial f}$, is defined to be the function for which: </span></p> <p><span style="color:#0060C6">\begin{align*}\int \frac{\partial F}{\partial f}(x) \eta(x) \ dx &amp;= \lim_{h \rightarrow 0}\frac{F(f + h \eta) - F(f)}{h} \\ &amp;= \frac{d F(f + h\eta)}{dh}\bigg\rvert_{h=0}\end{align*}</span></p> <p><span style="color:#0060C6">where $h$ is a scalar and $\eta$ is an arbitrary function in $\mathcal{F}$.</span></p> <p>Woah. What is going on here? How on earth does this define the functional derivative? And why is the functional derivative, $\frac{\partial{F}}{\partial f}$ buried inside such a seemingly complicated equation?</p> <p>Let’s break it down.</p> <p>First, notice the similarity of the right-hand side of the equation of Definition 1 to the definition of the directional gradient:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/directional_gradient_functional_derivative.png" alt="drawing" width="350" /></center> <p>Indeed, the equation in Definition 1 describes the analogy of the directional derivative for functionals! That is, it describes the rate of change of $F$ at $f$ in the direction of the function $\eta$!</p> <p>How does this work? As we shrink $h$ down to an infinitesimaly small number, $f + h \eta$ will become arbitrarily close to $f$. In the illustration below, we see an example function $f$ (red) and another function $\eta$ (blue). As $h$ gets smaller, the function $f + h\eta$ (purple) becomes more similar to $f$:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/function_variationn.png" alt="drawing" width="450" /></center> <p>Thus, we see that $h \eta$ is the “infinitesimal” change to $f$ that is analogous to the infinitesimal change to $\boldsymbol{x}$ that we describe by $h\boldsymbol{v}$ in the definition of the directional derivative. The quantity $h \eta$ is called a <strong>variation</strong> of $f$ (hence the word “variational” in the name “calculus of variations”).</p> <p>Now, so far we have only shown that the equation in Definition 1 describes something analogous to the directional derivative for multivariate numerical functions. We showed this by comparing the right-hand side of the equation in Definition 1 to the definition of the directional gradient. However, as Definition 1 states, the functional derivative itself is defined to be the function $\frac{\partial F}{\partial f}$ within the integral on the left-hand side of the equation. What is going on here? Why is <em>this</em> the functional derivative?</p> <p>Now, it is time to recall the gradient for traditional multivariate functions. Specifically, notice the similarity between the alternative formulation of the directional derivative, which uses the gradient, and the left-hand side of the equation in Definition 1:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/functional_derivative_gradient.png" alt="drawing" width="450" /></center> <p>Notice, that these equations have similar forms. Instead of a summation in the definition of the directional derivative, we have an integral in the eqation for Definition 1. Moreover, instead of summing over elements of the vector $\boldsymbol{v}$, we “sum” (using an integral) each value of $\eta(x)$. Lastly, instead of each partial derivative of $f$, we now have each value of the function $\frac{\partial F}{\partial f}$ for each $x$. This function, $\frac{\partial F}{\partial f}(x)$, is analogous to the gradient! It is thus called the functional derivative!</p> <p>To drive this home further, recall that we can represent the directional derivative as the dot product between the gradient vector and $\boldsymbol{v}$:</p> $D_{\boldsymbol{v}}f(\boldsymbol{x}) := \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}$ <p>To make this relationship clearer, we note that the dot product is an <a href="https://en.wikipedia.org/wiki/Inner_product_space">inner product</a>. Thus, we can write this definition in a more general way as</p> $D_{\boldsymbol{v}}f(\boldsymbol{x}) := \langle \nabla f(\boldsymbol{x}), \boldsymbol{v} \rangle$ <p>We also recall that a valid inner product between continuous functions $f$ and $g$ is</p> $\langle f, g \rangle := \int f(x)g(x) dx$ <p>Thus, we see that</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/functional_derivative_gradient_w_inner_product.png" alt="drawing" width="450" /></center> <p>Said differently, the functional gradient of a functional, $F$, at a function $f$, denoted $\frac{\partial F}{\partial f}$ is the function for which given any arbitrary function $\eta$, the inner product between $\frac{\partial F}{\partial f}$ and $\eta$ is the directional derivative of $F$ in the direction of $\eta$!</p> <h2 id="an-example-the-functional-derivative-of-entropy">An example: the functional derivative of entropy</h2> <p>As a toy example, let’s derive the functional derivative of <a href="https://mbernste.github.io/posts/entropy/">information entropy</a>. Recall at the beginning of this post, the entropy $H$ of a discrete random variable $X$ can be viewed as a function on $X$’s probability mass function $p_X$. More specifically, $H$ is defined as</p> $H(p_X) := \sum_{x \in \mathcal{X}} - p_X(x) \log p_X(x)$ <p>where $\mathcal{X}$ is the support of $p_X$.</p> <p>Let’s derive it’s functional derivative. Let’s start with an arbitrary probability mass function $\eta : \mathcal{X} \rightarrow [0,1]$. Then, we can write out the equation that defines the functional derivative:</p> $\sum_{x \in \mathcal{X}} \frac{\partial H}{\partial p_X}(x) \eta(x) = \frac{d H(p_X + h\eta)}{dh}\bigg\rvert_{h=0}$ <p>Let’s simplify this equation:</p> \begin{align*} \sum_{x \in \mathcal{X}} \frac{\partial H}{\partial p_X}(x) \eta(x) &amp;= \frac{d H(p_X + h\eta)}{dh}\bigg\rvert_{h=0} \\ &amp;= \frac{d}{dh} \sum_{x \in \mathcal{X}} -(p_X(x) + h\eta(x))\log(p_X(x) + h\eta(x))\bigg\rvert_{h=0} \\ &amp;= \sum_{x \in \mathcal{X}} - \eta(x)\log(p_X(x) + h\eta(x)) + \eta(x)\bigg\rvert_{h=0} \\ &amp;= \sum_{x \ in \mathcal{X}} (-1 - \log p_X(x))\eta(x)\end{align*} <p>Now we see that $\frac{\partial H}{\partial p_X}(x) = -1 - \log p_X(x)$ and thus, this is the functional derivative!</p> <h2 id="appendix">Appendix</h2> <p><span style="color:#0060C6"><strong>Theorem 1:</strong> Given a differentiable function $f : \mathbb{R}^n \rightarrow \mathbb{R}$, vectors $\boldsymbol{x}, \boldsymbol{v} \in \mathbb{R}^n$, where $\boldsymbol{v}$ is a unit vector, then $D_{\boldsymbol{v}} f(\boldsymbol{x}) = \sum_{i=1}^n \left( \frac{\partial f(\boldsymbol{x})}{\partial x_i} \right) v_i$.</span></p> <p><strong>Proof:</strong></p> <p>Consider $\boldsymbol{x}$ and $\boldsymbol{v}$ to be fixed and let $g(z) := f(\boldsymbol{x} + z\boldsymbol{v})$. Then,</p> $\frac{dg(z)}{dz} = \lim_{h \rightarrow 0} \frac{g(z+h) - g(z)}{h}$ <p>Evaluating this derivative at $z = 0$, we see that</p> \begin{align*} \frac{dg(z)}{dz}\bigg\rvert_{z=0} &amp;= \frac{g(h) - g(0)}{h} \\ &amp;= \frac{g(\boldsymbol{x} + h\boldsymbol{v}) - f(\boldsymbol{x})}{h} \\ &amp;= D_{\boldsymbol{v}} f(\boldsymbol{x}) \end{align*} <p>We can then apply the <a href="https://en.wikipedia.org/wiki/Chain_rule#Multivariable_case">multivariate chain rule</a> and see that</p> $\frac{dg(z)}{dz} = \sum_{i=1}^n D_i f(\boldsymbol{x} + z\boldsymbol{v}) \frac{d (x_i + zv_i)}{dz}$ <p>where $D_i f(\boldsymbol{x} + z\boldsymbol{v})$ is the partial derivative of $f$ with respect to it’s $i$th argument when evaluated at $\boldsymbol{x} + z\boldsymbol{v}$.</p> <p>Now, evaluating this derivative at $z = 0$, we see that</p> \begin{align*} \frac{dg(z)}{dz}\bigg\rvert_{z=0} &amp;= \sum_{i=1}^n D_i f(\boldsymbol{x}) v_i \\ &amp;= \sum_{i=1}^n \frac{f(\boldsymbol{x})}{\partial \boldsymbol{x}_i} v_i \end{align*} <p>Putting these two results together, we see that</p> $D_{\boldsymbol{v}} f(\boldsymbol{x}) = \sum_{i=1}^n \frac{f(\boldsymbol{x})}{\partial \boldsymbol{x}_i} v_i$ <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 2:</strong> Given a differentiable function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ and vector $\boldsymbol{x} \in \mathbb{R}^n$, $f$’s direction of steepest ascent is the direction pointed to by the gradient $\nabla f(\boldsymbol{x})$.</span></p> <p><strong>Proof:</strong></p> <p>As shown in Theorem 1, given an arbitrary unit vector $\boldsymbol{v} \in \mathbb{R}^n$, the directional derivative $D_{\boldsymbol{v}} f(\boldsymbol{x})$ can be calculated by taking the dot product of the gradient vector with $\boldsymbol{v}$:</p> $D_{\boldsymbol{v}} f(\boldsymbol{x}) = \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}$ <p>The dot product can be computed as</p> $\nabla f(\boldsymbol{x}) \cdot \boldsymbol{v} = ||\nabla f(\boldsymbol{x})|| ||\boldsymbol{v}|| \cos \theta$ <p>where $\theta$ is the angle between the two vectors. The $\cos$ function is maximized (and equals 1) when $\theta = 0$ and thus, directional derivative is maximized when $\theta = 0$. Thus, the unit vector that maximizes the directional derivative is the vector pointing in the same direction as the gradient thus proving that the gradient points in the direction of steepest ascent.</p> <p>$\square$</p>Matthew N. BernsteinThe calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.Normed vector spaces2021-11-23T00:00:00-08:002021-11-23T00:00:00-08:00https://mbernste.github.io/posts/normed_vector_space<p><em>When first introduced to Euclidean vectors, one is taught that the length of the vector’s arrow is called the norm of the vector. In this post, we present the more rigorous and abstract definition of a norm and show how it generalizes the notion of “length” to non-Euclidean vector spaces. We also discuss how the norm induces a metric function on pairs of vectors so that one can discuss distances between vectors.</em></p> <h2 id="introduction">Introduction</h2> <p>A <strong>normed vector space</strong> is a vector space in which each vector is associated with a scalar value called a <strong>norm</strong>. In a standard Euclidean vector spaces, the length of each vector is a norm:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/Norm.png" alt="drawing" width="200" /></center> <p>The more abstract, rigorous definition of a norm generalizes this notion of length to any vector space as follows:</p> <p><span style="color:#0060C6"><strong>Definition 1 (normed vector space):</strong> A <strong>normed vector space</strong> is vector space $(\mathcal{V}, \mathcal{F})$ associated with a function $||.|| : \mathcal{V} \rightarrow \mathbb{R}$, called a <strong>norm</strong>, that obeys the following axioms:</span></p> <ol> <li><span style="color:#0060C6">$\forall \boldsymbol{v} \in \mathcal{V}, \ \ ||\boldsymbol{v}|| \geq 0$</span></li> <li><span style="color:#0060C6">$\forall \boldsymbol{v} \in \mathcal{V}, \forall \alpha \in \mathcal{F}, \ \ ||\alpha\boldsymbol{v}|| = |\alpha| ||\boldsymbol{v}||$</span></li> <li><span style="color:#0060C6">$\forall \boldsymbol{v}, \boldsymbol{u} \in \mathcal{V}, \ \ ||\boldsymbol{u} + \boldsymbol{v}|| \leq ||\boldsymbol{u}|| + ||\boldsymbol{v}||$</span></li> </ol> <p>Here, we outline the intuition behind each axiom in the definition above and describe how these axioms capture this idea of length:</p> <ul> <li>Axiom 1 says that all vectors should have a positive length. This enforces our intuition that a “length’’ is a positive quantity.</li> <li>Axiom 2 says that if we multiply a vector by a scalar, it’s length should increase by the magnitude (i.e. the absolute value) of that scalar. This axiom ties together the notion of scaling vectors (Axiom 6 in the <a href="https://mbernste.github.io/posts/vector_spaces/">definition of a vector space</a>) to the notion of “length” for a vector. It essentially says that to scale a vector is to stretch the vector.</li> <li>Axiom 3 says that the length of the sum of two vectors should not exceed the sum of the lengths of each vector. This enforces our intuition that if we add together two objects that each have a “length”, the resultant object should not exceed the sum of the lengths of the original objects.</li> </ul> <p>Following the axioms for a normed vector space, one can also show that only the zero vector has zero length (Theorem 1 in the Appendix to this post).</p> <h2 id="unit-vectors">Unit vectors</h2> <p>In a normed vector space, a <strong>unit vector</strong> is a vector with norm equal to one. Given a vector $\boldsymbol{v}$, a unit vector can be derived by simply dividing the vector by its norm (Theorem 2 in the Appendix). This unit vector, called the <strong>normalized vector</strong> of $\boldsymbol{v}$ is denoted $\hat{\boldsymbol{v}}$. In a Euclidean vector space, the normalized vector $\hat{\boldsymbol{v}}$ is the unit vector that points in the same direction as $\boldsymbol{v}$.</p> <p>Unit vectors are important because they generalize the idea of “direction” in Euclidean spaces to vector spaces that are not Euclidean. In a Euclidean space, the unit vectors all fall in a sphere of radius one around the origin:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/UnitVectors.png" alt="drawing" width="200" /></center> <p>Thus, the set of all unit vectors can be used to define the set of all “directions” that vectors can point in the vector space. Because of this, one can form any vector by decomposing it into a unit vector multiplied by some scalar. In this way, unit vectors generalize the notion of “direction” in a Euclidean vector space to non-Euclidean vector spaces.</p> <h2 id="normed-vector-spaces-are-also-metric-spaces">Normed vector spaces are also metric spaces</h2> <p>All normed vector spaces are also <a href="https://en.wikipedia.org/wiki/Metric_(mathematics)">metric spaces</a> – that is, the norm function induces a metric function on pairs of vectors that can be interpreted as a “distance” between them (Theorem 3 in the Appendix). This metric is defined simply as:</p> $d(\boldsymbol{x}, \boldsymbol{y}) := \|\boldsymbol{x} - \boldsymbol{y}\|$ <p>That is, if one subtracts one vector from the other, then the “length” of the resultant vector can be interpreted as the “distance” between those vectors.</p> <p>In the figure below we show how the norm can be used to form a metric between Euclidean vectors. On the left, we depict two vectors, $\boldsymbol{v}$ and $\boldsymbol{u}$, as arrows. On the right we depict these vectors as points in Euclidean space. The distance between these points is given by the norm of the difference vector between $\boldsymbol{u}$ and $\boldsymbol{v}$.</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/NormAsMetric.png" alt="drawing" width="400" /></center> <h2 id="examples-of-norms">Examples of norms</h2> <p>Notably, a norm is a function that satisfies a set of axioms and thus, one may consider multiple norms when looking at a vector space. For example, there are multiple norms that are commonly associated with Euclidean vector spaces. Here are just a few examples:</p> <p><strong>L2 norm</strong></p> <p>The L2 norm is the most common norm as it is simply the Euclidean distance between points in a coordinate vector space:</p> $\vert\vert \boldsymbol{x} \vert\vert_2 := \sqrt{\sum_{i=1}^n x_i^2}$ <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/L2Norm.png" alt="drawing" width="200" /></center> <p><strong>L1 norm</strong></p> <p>The L1 norm is simply a sum of the of the absolute values of the elements of the vector:</p> $\vert\vert \boldsymbol{x} \vert\vert_1 := \sum_{i=1}^n \vert x_i \vert$ <p>The L1 norm is also alled the <strong>Manhattan norm</strong> or <strong>taxicab norm</strong> because it calculates distances as if one has to take streets around city blocks to get from one point to another:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/L1Norm.png" alt="drawing" width="200" /></center> <p><strong>Infinity norm</strong></p> <p>The infinity norm is simply the maximum value among the elements of a vector:</p> $\vert\vert \boldsymbol{x} \vert\vert_{\infty} := \text{max}\{x_1, x_2, \dots, x_n\}$ <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/LInftyNorm.png" alt="drawing" width="200" /></center> <h2 id="appendix">Appendix</h2> <p><span style="color:#0060C6"><strong>Theorem 1 (Only the zero vector has zero norm):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$ with norm $\vert\vert . \vert\vert$, it holds that $\vert\vert \boldsymbol{v} \vert\vert = 0 \iff \boldsymbol{v} = \boldsymbol{0}$</span></p> <p><strong>Proof:</strong></p> \begin{align*}\vert\vert \boldsymbol{0} \vert\vert &amp;= \vert\vert 0 \boldsymbol{v} \vert\vert &amp;&amp; \text{for any \boldsymbol{v} \in \mathcal{V}} \\ &amp;= \vert 0 \vert \vert\vert \boldsymbol{v} \vert\vert &amp;&amp; \text{by Axiom 2} \\ &amp;= 0\end{align*} <p>Note the first line is proven in Theorem 2 in my <a href="https://mbernste.github.io/posts/vector_spaces/">previous blog post</a> on vector spaces.</p> <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 2 (Formation of unit vector):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$ with norm $\vert\vert . \vert\vert$, the vector $\hat{\boldsymbol{v}} := \frac{\boldsymbol{v}}{\vert\vert \boldsymbol{v} \vert\vert}$ has norm equal to one.</span></p> <p><strong>Proof:</strong></p> \begin{align*}\vert\vert \hat{\boldsymbol{v}} \vert\vert &amp;= \vert\vert \frac{\boldsymbol{v}}{\vert\vert \boldsymbol{v} \vert\vert} \vert\vert \\ &amp;=\frac{1}{\vert\vert\boldsymbol{v}\vert\vert} \vert\vert\boldsymbol{v}\vert\vert \\ &amp;= 1 \end{align*} <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 3 (Norm-induced metric):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$ with norm $\vert\vert . \vert\vert$, the function $d(\boldsymbol{x}, \boldsymbol{y}) := \vert\vert \boldsymbol{x} - \boldsymbol{y} \vert\vert$ where $\boldsymbol{x}, \boldsymbol{y} \in \mathcal{V}$ is is a metric.</span></p> <p><strong>Proof:</strong></p> <p>To prove that $d$ is a metric, we need to show that it satisfies the three axioms of a metric function. First we need to show that</p> $d(\boldsymbol{x}, \boldsymbol{y}) = 0 \iff \boldsymbol{x} = \boldsymbol{y}$ <p>This can be proven as follows:</p> \begin{align*} d(\boldsymbol{x}, \boldsymbol{y}) = 0 \implies &amp; \vert\vert\boldsymbol{x} - \boldsymbol{y} \vert\vert = 0 \\ \implies &amp; \vert\vert \boldsymbol{x} + (-\boldsymbol{y})\vert\vert = 0 \\ \implies &amp; \boldsymbol{x} + (-\boldsymbol{y}) = \boldsymbol{0} &amp;&amp; \text{by Theorem 1} \\ \implies &amp; \boldsymbol{y} = -1 \boldsymbol{x} \\ \implies &amp; \boldsymbol{y} = \boldsymbol{x}\end{align*} <p>The last line follows from Theorem 4 in my <a href="https://mbernste.github.io/posts/vector_spaces/">previous blog post</a> on vector spaces. Going the other direction, we assume that $$\boldsymbol{x} = \boldsymbol{y}$$. Then</p> \begin{align*} \vert\vert \boldsymbol{y} - \boldsymbol{x} \vert\vert &amp;= \vert\vert \boldsymbol{x} - \boldsymbol{x} \vert\vert \\ &amp;= \vert\vert \boldsymbol{0} \vert\vert \\ &amp;= 0 &amp;&amp; \text{by Theorem 1}\end{align*} <p>Second, we need to show that $d(\boldsymbol{x}, \boldsymbol{y}) \geq 0$. This fact is already evident based on Axiom 1 in Definition 1 above.</p> <p>Third and finally, $d$ needs to satisfy the <a href="https://en.wikipedia.org/wiki/Triangle_inequality">triangle inequality</a>. That is, we need to show that $\forall \boldsymbol{x}, \boldsymbol{y}, \boldsymbol{z} \in \mathcal{V}$, it holds that</p> $d(\boldsymbol{x}, \boldsymbol{y}) \leq d(\boldsymbol{x}, \boldsymbol{z}) + d(\boldsymbol{z}, \boldsymbol{y})$ <p>This is proven as follows:</p> \begin{align*} d(\boldsymbol{x}, \boldsymbol{y}) &amp;= \vert\vert\boldsymbol{x} - \boldsymbol{y} \vert\vert \\ &amp;= \vert\vert\boldsymbol{x} - \boldsymbol{z} + \boldsymbol{z} - \boldsymbol{y} \vert\vert \\ &amp;= \vert\vert(\boldsymbol{x} - \boldsymbol{z}) + (\boldsymbol{z} - \boldsymbol{y}) \vert\vert \\ &amp; \leq \vert\vert\boldsymbol{x} - \boldsymbol{z}\vert\vert\ + \vert\vert\ \boldsymbol{z} - \boldsymbol{y} \vert\vert &amp;&amp; \text{by Axiom 3 of Definition 1} \\ &amp;= d(\boldsymbol{x}, \boldsymbol{z}) + d(\boldsymbol{z}, \boldsymbol{y})\end{align*} <p>$\square$</p>Matthew N. BernsteinWhen first introduced to Euclidean vectors, one is taught that the length of the vector’s arrow is called the norm of the vector. In this post, we present the more rigorous and abstract definition of a norm and show how it generalizes the notion of “length” to non-Euclidean vector spaces. We also discuss how the norm induces a metric function on pairs of vectors so that one can discuss distances between vectors.The overloaded equals sign2021-11-09T00:00:00-08:002021-11-09T00:00:00-08:00https://mbernste.github.io/posts/equal_vs_definition<p><em>Two of the most important relationships in mathematics, namely equality and definition, are both denoted using the same symbol – namely, the equals sign. The overloading of this symbol confuses students in mathematics and computer programming. In this post, I argue for the use of two different symbols for these two fundamentally different operators.</em></p> <h2 id="introduction">Introduction</h2> <p>I find it unfortunate that two of the most important relationships in mathematics, namely <strong>equality</strong> and <strong>definition</strong>, are often denoted using the exact same symbol – namely, the equal sign: “=”. Early in my learning days, I believe that this <a href="https://en.wikipedia.org/wiki/Operator_overloading">overloading</a> of the equal sign led to more confusion than necessary and I have personally witnessed it confuse students.</p> <p>To ensure that we’re on the same page, let’s first define these two notions. Let’s start with the idea of <strong>equality</strong>. Let’s say we have two entities, which we will denote using the symbols $X$ and $Y$. The statement “$X$ equals $Y$”, denoted $X = Y$, means that $X$ and $Y$ <strong>are the same thing</strong>.</p> <p>For example, let’s say we have a right-triangle with edge lengths $a$, $b$ and $c$, where $c$ is the hypotenuse. The <a href="https://en.wikipedia.org/wiki/Pythagorean_theorem">Pythagorean Theorem</a> says that $a^2 + b^2 = c^2$. Said differently, the quantity $c^2$ <em>is the same quantity</em> as the quantity $a^2 + b^2$.</p> <p>Now, let’s move on to <strong>definition</strong>. Given some entity denoted with the symbol $Y$, the statement “let $X$ be $Y$”, also often denoted $X = Y$, means that one should use the symbol “$X$” to refer to the entity referred to by “$Y$”.</p> <p>For example, in introductory math textbooks it is common to define the sine function in reference to a right-triangle:</p> $\sin \theta = \frac{\text{opposite}}{\text{hypotenuse}}$ <p>This is a definition. We are <strong>assigning</strong> the symbol/concept $\sin \theta$ to be the ratio of the length of the triangle’s opposite side to the length of its hypotenuse.</p> <p>The fundamental difference between equality and definition is that in the the equality relationship between $X$ and $Y$, both the symbols $X$ and $Y$ are bound to entities – that is, they “refer” to entities. The statement $Y = X$ is simply a comment about those two entities, namely, that they are the same. In contrast, in a definition, only one of the two symbols is bound to an entity. The act of stating a definition is the act of <em>binding a known entity to a new symbol</em>. For example, the symbol “$\text{foo} \ \theta$” is meaningless. What exactly is “foo”? We don’t know because we have not defined it.</p> <h2 id="overloading-the-equal-sign-creates-confusion-in-mathematics">Overloading the equal sign creates confusion in mathematics</h2> <p>I was tutoring someone who was teaching themselves pre-calculus out of a textbook, and they were quite confused by the statement,</p> $\sin \theta = \frac{\text{opposite}}{\text{hypotenuse}}$ <p>They asked me, “Why is $\sin \theta$ equal to the quantity $\frac{\text{opposite}}{\text{hypotenuse}}$?” They never explicitly stated so, but it become evident that their confusion was not the good kind of confusion. It wasn’t, “Why are the ratios between the sides of a right-triangle functions of the angles between those sides?” Nor, “Why is this definition important?” Rather, their confusion seemed to stem from the very existence of this mysterious object, “$\sin \theta$”. Their question was more along the lines of, “What <em>is</em> this mysterious thing? And why on earth is it equal to the ratio of the sides of the triangle?”</p> <p>Their confusion arose from the erroneous interpretation of this statement as describing an equality rather than a definition. The mystery was, at least partly, alleviated by the clarification that $\sin \theta$ is not an object that existed before we saw this statement – rather, this statement <em>created the object for the first time</em>. The statement is <em>defining</em> $\sin \theta$ to be the ratio between the opposite side to the hypotenuse.</p> <p>The real interesting quality to this definition is that the ratio of the sides of a right triangle are a function of its angles regardless of the lengths of the sides. That is, that we can create this definition at all!</p> <h2 id="overloading-the-equal-sign-creates-confusion-in-computer-programming">Overloading the equal sign creates confusion in computer programming</h2> <p>Anyone who has taught introductory computer programming is familiar with the very common confusion between the <a href="https://en.wikipedia.org/wiki/Assignment_(computer_science)">assignment operator</a> and <a href="https://en.wikipedia.org/wiki/Relational_operator#Equality">equality operator</a> in programming languages.</p> <p>For example, in many programming languages, like C and Python, the assigment operator uses the standard equals sign. That is, the statement <code class="language-plaintext highlighter-rouge">x = y</code> assigns the value referenced by symbol <code class="language-plaintext highlighter-rouge">y</code> to symbol <code class="language-plaintext highlighter-rouge">x</code>. In contrast, the statement <code class="language-plaintext highlighter-rouge">x == y</code> returns either <code class="language-plaintext highlighter-rouge">True</code> or <code class="language-plaintext highlighter-rouge">False</code> depending on whether the value referenced by <code class="language-plaintext highlighter-rouge">x</code> is equal to the value referenced by <code class="language-plaintext highlighter-rouge">y</code>. Though I have not seen any data on the topic, I wonder whether teaching these two operators from the very beginning of a student’s mathematical education would alleviate this common confusion.</p> <h2 id="use--instead-of--to-denote-definition">Use “:=” instead of “=” to denote definition</h2> <p>I think it’s important to use the symbol “:=” to denote definition. I prefer this symbol over the popular “$\equiv$” symbol because it emphasizes the assymetry of the statement. That is, $X := Y$ means “use $X$ as a symbol for $Y$”, which differs from “use $Y$ as a symbol for $X$.” In contrast, the standard equals sign “=” is appropriately symmetric.</p> <p>Using the appropriate symbol to distinguish definition statements from equality statements may go a long way, at least in proportion to the effort of using them, towards alleviating confusion in students of math and computer science.</p>Matthew N. BernsteinTwo of the most important relationships in mathematics, namely equality and definition, are both denoted using the same symbol – namely, the equals sign. The overloading of this symbol confuses students in mathematics and computer programming. In this post, I argue for the use of two different symbols for these two fundamentally different operators.Vector spaces2021-10-27T00:00:00-07:002021-10-27T00:00:00-07:00https://mbernste.github.io/posts/vector_spaces<p><em>The concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In this post, we first present and explain the definition of a vector space and then go on to describe properties of vector spaces. Lastly, we present a few examples of vector spaces that go beyond the usual Euclidean vectors that are often taught in introductory math and science courses.</em></p> <h2 id="introduction">Introduction</h2> <p>The concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In most introductory courses, only vectors in a Euclidean space are discussed. That is, vectors are presented as arrays of numbers:</p> $\boldsymbol{x} = \begin{bmatrix}1 \\ 2\end{bmatrix}$ <p>If the array of numbers is of length two or three, than one can visualize the vector as an arrow:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/EuclideanVector.png" alt="drawing" width="300" /></center> <p>While this definition is adequate for most applications of vector spaces, there exists a more abstract, and therefore more sophisticated definition of vector spaces that is required to have a deeper understanding of topics in math, statistics, and machine learning. In this post, we will dig into the abstract definition for vector spaces and discuss a few of their properties. Moreover, we will look at a few examples of vector spaces outside of the usual Euclidean vectors and see how the formal definition generalizes to other mathematical constructs such as <a href="https://mbernste.github.io/posts/matrices/">matrices</a> and functions.</p> <h2 id="formal-definition">Formal definition</h2> <p>As we mentioned before, vectors are usually introduced as arrays of numbers, and consequently, as arrows. These arrows can be added together and scaled as depicted below:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/AddScaleVectors.png" alt="drawing" width="420" /></center> <p>A <strong>vector space</strong> generalizes this notion of adding and scaling things that behave like Euclidean vectors.</p> <p>At a more rigorous mathematical level, a vector space consists of both a set of vectors $\mathcal{V}$ and a <a href="https://en.wikipedia.org/wiki/Field_(mathematics)">field</a> of scalars $\mathcal{F}$ for which one can add together vectors in $\mathcal{V}$ as well as scale these vectors by elements in the field $\mathcal{F}$ according to a specific list of rules (in most cases, the field of scalars are the real numbers, $\mathbb{R}$). These rules are spelled out in the definition for a vector space:</p> <p><span style="color:#0060C6"><strong>Definition 1 (vector space):</strong> Given a set of objects $\mathcal{V}$ called vectors and a field $\mathcal{F} := (C, +, \cdot, -, ^{-1}, 0, 1)$ where $C$ is the set of elements in the field, called scalars, the tuple $(\mathcal{V}, \mathcal{F})$ is a <strong>vector space</strong> if for all $\boldsymbol{v}, \boldsymbol{u}, \boldsymbol{w} \in \mathcal{V}$ and $c, d \in C$, the following ten axioms hold:</span></p> <ol> <li><span style="color:#0060C6">$\boldsymbol{u} + \boldsymbol{v} \in \mathcal{V}$</span></li> <li><span style="color:#0060C6">$\boldsymbol{u} + \boldsymbol{v} = \boldsymbol{v} + \boldsymbol{u}$</span></li> <li><span style="color:#0060C6">$(\boldsymbol{u} + \boldsymbol{v}) + \boldsymbol{w} = \boldsymbol{u} + (\boldsymbol{v} + \boldsymbol{w})$</span></li> <li><span style="color:#0060C6">There exists a zero vector $\boldsymbol{0} \in \mathcal{V}$ such that $\boldsymbol{u} + \boldsymbol{0} = \boldsymbol{u}$</span></li> <li><span style="color:#0060C6">For each $\boldsymbol{u \in \mathcal{V}}$ there exists a $\boldsymbol{u’} \in \mathcal{V}$ such that $\boldsymbol{u} + \boldsymbol{u’} = \boldsymbol{0}$. We call $\boldsymbol{u}’$ the negative of $\boldsymbol{u}$ and denote it as $-\boldsymbol{u}$</span></li> <li><span style="color:#0060C6">The scalar multiple of $\boldsymbol{u}$ by $c$, denoted by $c\boldsymbol{u}$ is in $\mathcal{V}$</span></li> <li><span style="color:#0060C6">$c(\boldsymbol{u} + \boldsymbol{v}) = c\boldsymbol{u} + c\boldsymbol{v}$</span></li> <li><span style="color:#0060C6">$(c + d)\boldsymbol{u} = c\boldsymbol{u} + d\boldsymbol{u}$</span></li> <li><span style="color:#0060C6">$c(d\boldsymbol{u}) = (cd)\boldsymbol{u}$</span></li> <li><span style="color:#0060C6">$1\boldsymbol{u} = \boldsymbol{u}$</span></li> </ol> <p>Axioms 1-5 of the definition describe how vectors can be added together. Axioms 6-10 describe how these vectors can be scaled using the field of scalars.</p> <h2 id="properties">Properties</h2> <p>The ten axioms outlined in the definition for a vector space may seem somewhat arbitrary (at least, they did for me); however, as we will show, these axioms are sufficient for ensuring that vector spaces have all of the properties that we intuitively associate with Euclidean vectors. Specifically, from these axioms, we can derive the following properties:</p> <ol> <li><strong>The zero vector is unique</strong> (Theorem 1 in the Appendix). There is only one distinct zero vector in a vector space. Notice in a Euclidean vector space, there is only one point at the origin, which represents the zero vector in Euclidean spaces.</li> <li><strong>Any vector multiplied by the zero scalar is the zero vector</strong> (Theorem 2 in the Appendix).The zero scalar converts any vector into the zero vector. That is, given a vector $\boldsymbol{v}$, it holds that $0\boldsymbol{v} = \boldsymbol{0}$. This generalizes the notion of how multiplying a vector in a Euclidean space by zero should shrink the vector to the origin.</li> <li><strong>The negative of a vector is unique</strong> (Theorem 3 in the Appendix). Given a vector $\boldsymbol{v}$, we denote its negative vector as $-\boldsymbol{v}$. This is analogous to each real number $x \in \mathbb{R}$ having a matching negative number $-x$ that lies $|x|$ distance from 0 on the opposite side of 0.</li> <li><strong>Multiplying a negative vector by the scalar -1 produces its negative vector</strong> (Theorem 4 in the Appendix). That is, given a vector $\boldsymbol{v}$, it holds that $-1\boldsymbol{v} = -\boldsymbol{v}$. This is analogous to the fact that if you multiply any number $x$ by $-1$ you get the number $-x$ that lies $|x|$ distance from 0 on the opposite side of 0.</li> <li><strong>The zero vector multiplied by any scalar is the zero vector</strong> (Theorem 5 in the Appendix). The zero vector remains the zero vector despite being multiplied by any scalar. That is, $c\boldsymbol{0} = \boldsymbol{0}$ for any $c \in \mathcal{F}$. This is analogous to the fact that zero multiplied by any number remains zero.</li> <li><strong>The only vector whose negative is not distinct from itself is the zero vector</strong> (Theorem 6 in the Appendix). For every vector other than the zero vector, its negative vector is a distinct vector in the vector space. For the zero vector, its negative is itself. This is analogous to the fact that for any number $x \neq 0$, the number $-x$ is a distinct number from $x$ that lies on the opposite side of 0. However, for $x = 0$, $-x = x$.</li> </ol> <h2 id="examples-of-vector-spaces">Examples of vector spaces</h2> <p><strong>The real numbers</strong></p> <p>It turns out that the real numbers are themselves a vector space (when equipped with standard addition and multiplication). In this vector space, the real numbers are both the vectors and the scalars! Here, the number zero acts as the zero vector. This example may be a bit trivial and silly; however, I like it because it highlights the generality of the definition of a vector space.</p> <p><strong>Matrices</strong></p> <p>Although generally not thought of as vectors, the space of real-valued <a href="https://mbernste.github.io/posts/matrices/">matrices</a> of a fixed size $$\mathbb{R}^{m \times n}$$ form a vector space in which the matrices are vectors. Intuitively, you can add matrices together:</p> $\begin{bmatrix}1 &amp; 2 \\ 3 &amp; 4\end{bmatrix} + \begin{bmatrix}3 &amp; 2 \\ 2 &amp; 5\end{bmatrix} = \begin{bmatrix}4 &amp; 4 \\ 5 &amp; 9\end{bmatrix}$ <p>You can also scale them:</p> $2\begin{bmatrix}1 &amp; 2 \\ 3 &amp; 4\end{bmatrix} = \begin{bmatrix}2 &amp; 4 \\ 6 &amp; 8\end{bmatrix}$ <p>The zero matrix acts as the zero vector:</p> $\begin{bmatrix}0 &amp; 0 \\ 0 &amp; 0\end{bmatrix}$ <p>This may seem a bit confusing because as we discuss in <a href="https://mbernste.github.io/posts/matrices_as_functions/">another blog post</a>, matrices act as functions between Euclidean vector spaces. Nonetheless, matrices can form vector spaces all on their own, distinct from the vector spaces that they act upon!</p> <p><strong>Functions</strong></p> <p>Sets of functions can also form vector spaces! In fact, the real power in the definition for a vector space reveals itself when dealing with functions, and the fact that some sets of functions form vector spaces lies at the foundation for many fundamental ideas in mathematics, physics, and the data sciences such as <a href="https://en.wikipedia.org/wiki/Fourier_transform">Fourier transforms</a> and <a href="https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space">reproducing kernel Hilbert spaces</a>.</p> <p>For example, the set of all continuous, real-valued functions forms a vector space. Intuitively we see that such functions act like vectors in that we can add them together:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/AddFunctionsLikeVectors.png" alt="drawing" width="600" /></center> <p>We can also scale functions. In the following figure, the function $g$ is scaled by $c$:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/ScaleFunctionLikeVector.png" alt="drawing" width="300" /></center> <p>Lastly, the zero function acts as the zero vector. Here we depict the zero function, which outputs 0 for all inputs:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/ZeroFunctionLikeVector.png" alt="drawing" width="300" /></center> <h2 id="appendix-proofs-of-properties-of-vector-spaces">Appendix: Proofs of properties of vector spaces</h2> <p><span style="color:#0060C6"><strong>Theorem 1 (Uniqueness of zero vector):</strong> Given vector space $(\mathcal{V}, \mathcal{F})$, the zero vector is unique.</span></p> <p><strong>Proof:</strong></p> <p>Assume for the sake of contradiction that there exists a vector $\boldsymbol{a}$ such that $\boldsymbol{a} \neq \boldsymbol{0}$ and that $\forall \boldsymbol{v} \in \mathcal{V}$</p> $\boldsymbol{a} + \boldsymbol{v} = \boldsymbol{v}$ <p>Then, this implies that</p> $\boldsymbol{a} + \boldsymbol{0} = \boldsymbol{0}$ <p>However, Axiom 4 of the definition for a vector space states that if a vector is the zero-vector, it must be that $\boldsymbol{a} + \boldsymbol{0} = \boldsymbol{a}$.</p> <p>Since $\boldsymbol{a} \neq \boldsymbol{0}$ , we reach a contradiction. Therefore, there does not exist a vector $\boldsymbol{a} \neq \boldsymbol{0}$ for which $\forall \boldsymbol{v} \in \mathcal{V} \ \ \ \boldsymbol{a} + \boldsymbol{v} = \boldsymbol{v}$. Thus, the zero-vector is unique.</p> <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 2 (The product of the zero scalar and any vector is the zero vector):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$, it holds that $\forall \boldsymbol{v} \in \mathcal{V}, 0\boldsymbol{v} = \boldsymbol{0}$.</span></p> <p><strong>Proof:</strong></p> <p>Assume for the sake of contradiction that there exists a vector $\boldsymbol{a} \neq \boldsymbol{0}$ such that</p> $0\boldsymbol{v} = \boldsymbol{a}$ <p>Now, for any scalar $c \neq 0$, we have</p> \begin{align*}c\boldsymbol{v} &amp;= (c + 0)\boldsymbol{v} \\ &amp;= c\boldsymbol{v} + 0\boldsymbol{v} &amp;&amp; \text{by Axiom 8} \\ &amp;= c\boldsymbol{v} + \boldsymbol{a}\end{align*} <p>Our assumption assumed that $\boldsymbol{a} \neq \boldsymbol{0}$ must be false because by Theorem 1 the only vector $\boldsymbol{a}$ for which $c\boldsymbol{v} + \boldsymbol{a} = c\boldsymbol{v}$ would be true is the zero-vector.</p> <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 3 (Each vector has a unique negative vector):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$ and vector $\boldsymbol{v} \in \mathcal{V}$, its negative, $-\boldsymbol{v}$, is unique. That is, $\boldsymbol{v} + \boldsymbol{a} = \boldsymbol{0} \iff \boldsymbol{a} = -\boldsymbol{v}$.</span></p> <p><strong>Proof:</strong></p> <p>We need only prove $\boldsymbol{v} + \boldsymbol{a} = \boldsymbol{0} \implies \boldsymbol{a} = -\boldsymbol{v}$. The other direction is stated in the axioms for the definition of a vector space.</p> \begin{align*}\boldsymbol{v} + \boldsymbol{a} &amp;= \boldsymbol{0} \\ \implies -\boldsymbol{v} + \boldsymbol{v} + \boldsymbol{a} &amp;= -\boldsymbol{v} + \boldsymbol{0} \\ \implies [-\boldsymbol{v} + \boldsymbol{v}] + \boldsymbol{a} &amp;= -\boldsymbol{v} \\ \implies \boldsymbol{0} + \boldsymbol{a} &amp;= -\boldsymbol{v} &amp;&amp;\text{by Axiom 5} \\ \implies \boldsymbol{a} &amp;= -\boldsymbol{v} &amp;&amp; \text{by Axiom 4}\end{align*} <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 4 (Derivation of a vector’s negative):</strong> Given a vector $\boldsymbol{v} \in \mathcal{V}$, it’s negative is $(-1)\boldsymbol{v}$. That is, $-\boldsymbol{v} = (-1)\boldsymbol{v}$.</span></p> <p><strong>Proof:</strong></p> \begin{align*}\boldsymbol{v} + (-1)\boldsymbol{v} &amp;= (1)\boldsymbol{v} + (-1)\boldsymbol{v} &amp;&amp; \text{by Axiom 10} \\ &amp;= (1-1)\boldsymbol{v} &amp;&amp; \text{by Axiom 8} \\ &amp;= 0\boldsymbol{v} \\ &amp;= \boldsymbol{0} &amp;&amp; \text{by Theorem 2}\end{align*} <p>Then, by Axiom 5, it must be that $(-1)\boldsymbol{v} = -\boldsymbol{v}$.</p> <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 5 (The zero vector multiplied by any scalar is the zero vector):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$, it holds that $c\boldsymbol{0} = \boldsymbol{0} \iff \boldsymbol{a} = \boldsymbol{0}$.</span></p> <p><strong>Proof:</strong> \begin{align*}\boldsymbol{0} + \boldsymbol{0} &amp;= \boldsymbol{0} &amp;&amp; \text{by Axiom 4} \\ c(\boldsymbol{0} + \boldsymbol{0}) &amp;= c\boldsymbol{0} \\ c\boldsymbol{0} + c\boldsymbol{0} &amp;= c\boldsymbol{0} &amp;&amp; \text{by Axiom 8}\end{align*}</p> <p>By Theorem 1, the only vector $\boldsymbol{a}$ in $\mathcal{V}$ for which $\boldsymbol{a} + \boldsymbol{v} = \boldsymbol{v}$ for all vectors $\boldsymbol{v} \in \mathcal{V}$ is the zero vector $\boldsymbol{0}$. Thus, $c\boldsymbol{0} = \boldsymbol{0}$.</p> <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 6 (The zero vector is its own negative):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$, it holds that $-\boldsymbol{0} = \boldsymbol{0}$</span></p> <p><strong>Proof:</strong></p> \begin{align*}\boldsymbol{a} + -\boldsymbol{a} &amp;= \boldsymbol{0} &amp;&amp; \text{by Axiom 5} \\ \boldsymbol{a} + \boldsymbol{a} &amp;= \boldsymbol{0} &amp;&amp; \text{assume \boldsymbol{a} = -\boldsymbol{a}} \\ \implies 2\boldsymbol{a} &amp;= \boldsymbol{0} \\ \implies \boldsymbol{a} &amp;= \boldsymbol{0} &amp;&amp; \text{by Theorem 5}\end{align*} <p>Thus, if we assume $\boldsymbol{a} = -\boldsymbol{a}$, then $\boldsymbol{a}$ must be the zero vector.</p> <p>$\square$</p>Matthew N. BernsteinThe concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In this post, we first present and explain the definition of a vector space and then go on to describe properties of vector spaces. Lastly, we present a few examples of vector spaces that go beyond the usual Euclidean vectors that are often taught in introductory math and science courses.Invertible matrices2021-10-20T00:00:00-07:002021-10-20T00:00:00-07:00https://mbernste.github.io/posts/inverse_matrices<p><em>In this post, we discuss invertible matrices: those matrices that characterize invertible linear transformations. We discuss three different perspectives for intuiting inverse matrices as well as several of their properties.</em></p> <h2 id="introduction">Introduction</h2> <p>As we have discussed in depth, matrices can viewed <a href="https://mbernste.github.io/posts/matrices_as_functions/">as functions</a> between vector spaces. In this post, we will discuss matrices that represent <a href="https://en.wikipedia.org/wiki/Inverse_function">inverse functions</a>. Such matrices are called <strong>invertible matrices</strong> and their corresponding inverse function is characterized by an <strong>inverse matrix</strong>.</p> <p>More rigorously, the inverse matrix of a matrix $\boldsymbol{A}$ is defined as follows:</p> <p><span style="color:#0060C6"><strong>Definition 1 (Inverse matrix):</strong> Given a square matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$, it’s <strong>inverse matrix</strong> is the matrix $\boldsymbol{C}$ that when either left or right multiplied by $\boldsymbol{A}$, yields the identity matrix. That is, if for a matrix $\boldsymbol{C}$ it holds that $$\boldsymbol{AC} = \boldsymbol{CA} = \boldsymbol{I}$$, then $\boldsymbol{C}$ is the inverse of $\boldsymbol{A}$. This inverse matrix, $\boldsymbol{C}$ is commonly denoted as $\boldsymbol{A}^{-1}$.</span></p> <p>This definition might seem a bit of opaque, so in the remainder of this blog post we will explore a number of <a href="https://mbernste.github.io/posts/understanding_3d/">complimentary perspectives</a> for viewing inverse matrices.</p> <h2 id="intuition-behind-invertible-matrices">Intuition behind invertible matrices</h2> <p>Here are three ways to understand invertible matrices:</p> <ol> <li>An invertible matrix characterizes an invertible linear transformation</li> <li>An invertible matrix preserves the dimensionality of transformed vectors</li> <li>An invertible matrix computes a change of coordinates for a vector space</li> </ol> <p>Below we will explore each of these perspectives.</p> <p><strong>1. An invertible matrix characterizes an invertible linear transformation</strong></p> <p>Any matrix $\boldsymbol{A}$ for which there exists an inverse matrix $\boldsymbol{A}^{-1}$ characterizes an invertible linear transformation. That is, given an invertible matrix $\boldsymbol{A}$, the linear transformation $$T(\boldsymbol{x}) := \boldsymbol{Ax}$$ has an inverse linear transformation $T^{-1}(\boldsymbol{x})$ defined as $T^{-1}(\boldsymbol{x}) := \boldsymbol{A}^{-1}\boldsymbol{x}$.</p> <p>Recall, for a function to be invertible it must be both <a href="https://en.wikipedia.org/wiki/Surjective_function">onto</a> and <a href="https://en.wikipedia.org/wiki/Injective_function">one-to-one</a>. We show in the Appendix to this blog post that if $\boldsymbol{A}$ is invertible, then $T(\boldsymbol{x})$ defined using an invertible matrix $\boldsymbol{A}$ is both onto (Theorem 2) and one-to-one (Theorem 3).</p> <p>At a more intuitive level, the inverse of a matrix $\boldsymbol{A}$ is the matrix that “reverts” vectors transformed by $\boldsymbol{A}$ back to their original vectors:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_inverse.png" alt="drawing" width="500" /></center> <p>Thus, since matrix multiplication encodes a composition of the matrices’ linear transformations, it follows that a matrix multiplied by its inverse yields the identity matrix $\boldsymbol{I}$, which characterizes the linear transformation that maps vectors back to themselves.</p> <p><strong>2. A singular matrix collapses vectors into a lower-dimensional subspace</strong></p> <p>A singular matrix “collapses” or “compresses” vectors into an intrinsically lower dimensional space whereas an invertible matrix preserves their <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a> of the vectors.</p> <p>This follows from the fact that a matrix is invertible if and only if its columns are linearly independent (Thoerem 4 in the Appendix). Recall a set of $n$ <a href="https://mbernste.github.io/posts/linear_independence/">linearly independent vectors</a> $$S := \{ \boldsymbol{x}_1, \dots, \boldsymbol{x}_n \}$$ spans a space with an intrinsic dimensionality of $n$ because in order to specify any vector $\boldsymbol{b}$ in the vector space, one must specify the coefficients $c_1, \dots, c_n$ such that</p> $\boldsymbol{b} = c_1\boldsymbol{x}_1 + \dots + c_n\boldsymbol{x}_n$ <p>However, if $S$ is not linearly independent, then we can throw away “redundant” vectors in $S$ that can be constructed from the remaining vectors. Thus, the intrinsic dimensionality of a linearly dependent set $S$ is the maximum sized subset of $S$ that is linearly independent.</p> <p>When a matrix $\boldsymbol{A}$ is singular, its columns are linearly dependent and thus, the vectors that constitute the column space of the matrix is inherently of lower dimension than the number of columns. Thus, when $\boldsymbol{A}$ multiplies a vector $\boldsymbol{x}$, it transforms $\boldsymbol{x}$ into this lower dimensional space. Once transformed, there is no way to transform it back to its original vector because certain dimensions of the vector were “lost” in this transformation.</p> <p>To make this more concrete, an example is shown in below:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_inverse_lin_ind.png" alt="drawing" width="1000" /></center> <p>In Panel A of this figure, we show the column vectors of a matrix $\boldsymbol{A} \in \mathbb{R}^{3 \times 3}$ that span a plane. In Panel B, we show the solution to the equation $\boldsymbol{Ax} = \boldsymbol{b}$. In Panel C, we show another solution to $\boldsymbol{Ax} = \boldsymbol{b}$. Notice that there are multiple vectors in $\mathbb{R}^3$ that $$\boldsymbol{A}$$ maps to $\boldsymbol{b}$. Thus, there does not exist an inverse mapping and therefore no inverse matrix to $\boldsymbol{A}$. These multiple mappings from $\mathbb{R}^3$ to $\boldsymbol{b}$ arise directly from the fact that the columns of $\boldsymbol{A}$ are linearly dependent.</p> <p>Also notice that this singular matrix maps vectors in $\mathbb{R}^3$ to vectors that lie on the plane in $\mathbb{R}^3$ that are spanned by its column vectors. All vectors on a plane in $\mathbb{R}^3$ are of intrinsic dimensionality of two rather than three because we only need to specify coefficients for two of the column vectors in $\boldsymbol{A}$ to specify a point on the plane. We can throw away the third. Thus, we see that this singular matrix collapses points from the full 3-dimensional space $\mathbb{R}^3$ to the 2-dimensional space on the plane spanned by the columns of $\boldsymbol{A}$.</p> <p><strong>3. An invertible matrix computes a change of coordinates for a vector space</strong></p> <p>A vector $\boldsymbol{x} \in \mathbb{R}^n$ can be viewed as the coordinates for a point in a coordinate system. That is, for each dimension $i$, the vector $\boldsymbol{x}$ provides a value along each dimension – that is, $x_i$ is the value along dimension $i$. The coordinate system we use is, in a mathematical sense, arbitrary. To see why it’s arbitrary, notice in the figure below that we can specify locations in $\mathbb{R}^2$ using either the grey coordinate system or the blue coordinate system:</p> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/coordinate_change.png" alt="drawing" width="600" /></center> <p>We see that there is a one-to-one and onto mapping between coordinates in each of these two alternative coordinate systems. The point $\boldsymbol{x}$ is located at $[-4, -2]$ in the grey coordinate system and as $[-1, 1]$ in the blue coordinate system.</p> <p>Thus we see that all coordinate systems are able to provide an unambiguous location for points in the space and thus, there is a one-to-one and onto mapping between them. Nonetheless, it often helps to have some coordinate system that acts as a reference to every other coordinate system. This reference coordinate system is usually defined by the <a href="https://en.wikipedia.org/wiki/Standard_basis">standard basis vectors</a> ${\boldsymbol{e}_1, \dots, \boldsymbol{e}_n }$ where $\boldsymbol{e}_i$ consists of all zeros except for a one at index $i$.</p> <p>All coordinate systems can then be constructed from the coordinate system defined by the standard basis vectors. This is depicted in the previous figure in which the reference coordinate system is depicted by the grey grid and is constructed by the orthonormal basis vectors $\boldsymbol{e}_1$ and $\boldsymbol{e}_2$. An alternative coordinate system is depicted by the blue grid and is constructed from the basis vectors $\boldsymbol{a}_1$ and $\boldsymbol{a}_2$.</p> <p>Now, how do invertible matrices enter the picture? Well, an invertible matrix $\boldsymbol{A} := [\boldsymbol{a}_1, \dots, \boldsymbol{a}_n]$ can be viewed as an operator that converts vectors described in terms of some set of basis vectors ${\boldsymbol{a}_1, \dots, \boldsymbol{a}_n}$ back to a description in terms of the standard basis vectors ${\boldsymbol{e}_1, \dots, \boldsymbol{e}_n }$. That is, if we have some vector $\boldsymbol{x} \in \mathbb{R}^n$, then $\boldsymbol{Ax}$ can be understood to be the vector in the standard basis <em>if</em> $\boldsymbol{x}$ was described according to the basis formed by the columns of $\boldsymbol{A}$.</p> <p>Another way to think about this is that if we have some vector $\boldsymbol{x} \in \mathbb{R}^n$ described according to the standard basis, then we can describe $\boldsymbol{x}$ in terms of an alternative basis $\boldsymbol{a}_1, \dots, \boldsymbol{a}_n$ by multiplying $\boldsymbol{x}$ by the inverse of the matrix $\boldsymbol{A} := [ \boldsymbol{a}_1, \dots, \boldsymbol{a}_n]$. That is $$\boldsymbol{x}_{\boldsymbol{A}} := \boldsymbol{A}^{-1}\boldsymbol{x}$$ is the representation of $\boldsymbol{x}$ in terms of the basis formed by the columns of $\boldsymbol{A}$.</p> <h2 id="properties">Properties</h2> <p>Below we discuss several properties of invertible matrices that provide further intuition into how they behave and also provide algebraic rules that can be used in derivations.</p> <ol> <li><strong>The columns of an invertible matrix are linearly independent</strong> (Theorem 4 in the Appendix).</li> <li><strong>Taking the inverse of an inverse matrix gives you back the original matrix</strong>. Given an invertible matrix $\boldsymbol{A}$ with inverse $\boldsymbol{A}^{-1}$, it follows from the definition of invertible matrices, that $\boldsymbol{A}^{-1}$ is also invertible with its inverse being $\boldsymbol{A}$. That is, $$(\boldsymbol{A}^{-1})^{-1} = \boldsymbol{A}$$ This also follows from the fact that the inverse of an inverse function $f^{-1}$ is simply the original function $f$.</li> <li><strong>The result of multiplying invertible matrices is invertible</strong> (Theorem 5 in the Appendix). Given two matrices $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^{n \times n}$, the matrix that results from their multiplication is invertible. That is, $\boldsymbol{AB}$ is invertible and its inverse is given by $$(\boldsymbol{AB})^{-1} = \boldsymbol{B}^{-1}\boldsymbol{A}^{-1}$$ Recall the result of <a href="https://mbernste.github.io/posts/matrix_multiplication/">matrix multiplication</a> results in a matrix that characterizes the composition of the linear transformations characterized by the factor matrices. That is, $\boldsymbol{ABx}$ first transforms $\boldsymbol{x}$ with $\boldsymbol{B}$ and then transforms the result with $\boldsymbol{A}$. It follows that in order to invert this composition of transformations, one must first pass the vector through $\boldsymbol{B}^{-1}$ and then through $\boldsymbol{A}^{-1}$:</li> </ol> <center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/inverse_matrix_mult.png" alt="drawing" width="900" /></center> <h2 id="appendix-proofs-of-properties-of-invertible-matrices">Appendix: Proofs of properties of invertible matrices</h2> <p><span style="color:#0060C6"><strong>Theorem 1 (Null space of an invertible matrix):</strong> The null space of an invertible matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ consists of only the zero vector $\boldsymbol{0}$.</span></p> <p><strong>Proof:</strong></p> <p>We must prove that</p> $\boldsymbol{Ax} = \boldsymbol{0}$ <p>has only the trivial solution $\boldsymbol{x} := \boldsymbol{0}$.</p> \begin{align*}\boldsymbol{Ax} &amp;= \boldsymbol{0} \\ \implies \boldsymbol{A}^{-1}\boldsymbol{Ax} &amp;= \boldsymbol{A}^{-1}\boldsymbol{0} \\ \implies \boldsymbol{x} &amp;= \boldsymbol{0} \end{align*} <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 2 (Invertible matrices characterize onto functions):</strong> An invertible matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ characterizes an onto linear transformation. </span></p> <p><strong>Proof:</strong></p> <p>Let $\boldsymbol{x}$, $\boldsymbol{b} \in \mathbb{R}^n$. Then, there exists a vector, $\boldsymbol{x} \in \mathbb{R}^n$ such that $$\boldsymbol{Ax} = \boldsymbol{b}$$ This solution is precisely</p> $\boldsymbol{x} := \boldsymbol{A}^{-1}\boldsymbol{b}$ <p>as we see below:</p> \begin{align*}&amp;\boldsymbol{A}(\boldsymbol{A}^{-1}\boldsymbol{b}) = \boldsymbol{b} \\ \implies &amp; (\boldsymbol{AA}^{-1})\boldsymbol{b} = \boldsymbol{b} &amp;&amp; \text{associative law} \\ \implies &amp; \boldsymbol{I}\boldsymbol{b} = \boldsymbol{b} &amp;&amp; \text{definition of inverse matrix} \\ \implies &amp; \boldsymbol{b} = \boldsymbol{b} \end{align*} <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 3 (Invertible matrices characterize one-to-one functions):</strong> A an invertible matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ characterizes a one-to-one linear transformation.</span></p> <p><strong>Proof:</strong></p> <p>For the sake of contradiction assume that there exists two vectors $\boldsymbol{x}$ and $\boldsymbol{x}’$ such that $\boldsymbol{x} \neq \boldsymbol{x}’$ and that $$\boldsymbol{Ax} = \boldsymbol{b}$$ and $$\boldsymbol{Ax}' = \boldsymbol{b}$$ where $b \neq \boldsymbol{0}$. Then,</p> \begin{align*} \boldsymbol{Ax} - \boldsymbol{Ax}' &amp;= \boldsymbol{0} \\ \implies \boldsymbol{A}(\boldsymbol{x} - \boldsymbol{x}') = \boldsymbol{0}\end{align*} <p>By Theorem 1, it must hold that</p> $\boldsymbol{x} - \boldsymbol{x}' = \boldsymbol{0}$ <p>which implies that $\boldsymbol{x} = \boldsymbol{x}’$. This contradicts our original assumption. Therefore, it must hold that there does not exist two vectors $\boldsymbol{x}$ and $\boldsymbol{x}’$ that map to the same vector via the invertible matrix $\boldsymbol{A}$. Therefore, $\boldsymbol{A}$ encodes a one-to-one function.</p> <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 4 (Column vectors of invertible matrices are linearly independent):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$, $\boldsymbol{A}$ is invertible if and only if $$\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,n}$$ are linearly independent. </span></p> <p><strong>Proof:</strong></p> <p>We first prove the $\implies$ direction: we assume that $\boldsymbol{A}$ is invertible and show that under this assumption, the only solution to $$\boldsymbol{a}_{*,1}x_1 + \dots + \boldsymbol{a}_{*,n}x_n = \boldsymbol{0}$$ is $\boldsymbol{x} := \boldsymbol{0}$, which is the condition for linear independence.</p> \begin{align*}\boldsymbol{a}_{*,1}x_1 + \dots + \boldsymbol{a}_{*,n}x_n &amp;= \boldsymbol{0} \\ \implies \boldsymbol{Ax} &amp;= \boldsymbol{0} \\ \implies \boldsymbol{A}^{-1}\boldsymbol{Ax} &amp;= \boldsymbol{A}^{-1}\boldsymbol{0} \\ \implies \boldsymbol{x} &amp;= \boldsymbol{0} \end{align*} <p>We now prove the $\impliedby$ direction: we assume the columns of $\boldsymbol{A}$ are linearly independent and show that under this assumption there exists a matrix $\boldsymbol{C}$ such that</p> $\boldsymbol{CA} = \boldsymbol{AC} = \boldsymbol{I}$ <p>Since the columns of $\boldsymbol{A}$ are linearly independent, then the <a href="https://en.wikipedia.org/wiki/Row_echelon_form#Reduced_row_echelon_form">reduced row echelon form</a> of $\boldsymbol{A}$ has a <a href="https://en.wikipedia.org/wiki/Pivot_element">pivot</a> in every column. This means that there exists a sequence of <a href="https://en.wikipedia.org/wiki/Elementary_matrix">elementary row matrices</a> $\boldsymbol{E}_1, \dots, \boldsymbol{E}_k$ such that when multiplied by $\boldsymbol{A}$, they produce the identity matrix. That is, $$(\boldsymbol{E}_1\dots\boldsymbol{E}_k)\boldsymbol{A} = \boldsymbol{I}$$</p> <p>Though not proven formally, it can be seen that elementary row matrices are invertible. That is, you can always “undo” the transformation imposed by an elementary row matrix (e.g. for an elementary row matrix that swaps rows, you can always swap them back). Furthermore, since the product of invertible matrices is also invertible, $(\boldsymbol{E}_1\dots\boldsymbol{E}_k)$ is invertible. Thus,</p> \begin{align*} &amp; (\boldsymbol{E}_1\dots\boldsymbol{E}_k)\boldsymbol{A} = \boldsymbol{I} \\ \implies &amp; (\boldsymbol{E}_1\dots\boldsymbol{E}_k)^{-1} (\boldsymbol{E}_1 \dots \boldsymbol{E}_k)\boldsymbol{A} = (\boldsymbol{E}_1 \dots \boldsymbol{E}_k)^{-1}\boldsymbol{I} \\ \implies &amp; \boldsymbol{A} = (\boldsymbol{E}_1 \dots \boldsymbol{E}_k)^{-1} \boldsymbol{I} \\ \implies &amp; \boldsymbol{A} = \boldsymbol{I}(\boldsymbol{E}_1 \dots \boldsymbol{E}_k)^{-1} \\ \implies &amp; \boldsymbol{A}(\boldsymbol{E}_1 \dots \boldsymbol{E}_k) = \boldsymbol{I}(\boldsymbol{E}_1 \dots \boldsymbol{E}_k)^{-1}(\boldsymbol{E}_1 \dots \boldsymbol{E}_k) \end{align*} <p>Hence, $\boldsymbol{C} := (\boldsymbol{E}_1 \dots \boldsymbol{E}_k)$ is the matrix for which $\boldsymbol{AC} = \boldsymbol{CA} = \boldsymbol{I}$ and is thus $\boldsymbol{A}$’s inverse.</p> <p>$\square$</p> <p><span style="color:#0060C6"><strong>Theorem 5 (Inverse of matrix product):</strong> Given two invertible matrices $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^n$, the inverse of their product $\boldsymbol{AB}$ is given by $\boldsymbol{B}^{-1}\boldsymbol{A}^{-1}$.</span></p> <p><strong>Proof:</strong></p> <p>We seek the inverse matrix $\boldsymbol{X}$ such that $$(\boldsymbol{AB})\boldsymbol{X} = \boldsymbol{I}$$:</p> \begin{align*} &amp; \boldsymbol{ABX} = \boldsymbol{I} \\ \implies &amp; \boldsymbol{A}^{-1}\boldsymbol{ABX} = \boldsymbol{A}^{-1}\boldsymbol{I}\\ \implies &amp;\boldsymbol{B}^{-1}\boldsymbol{BX} = \boldsymbol{B}^{-1}\boldsymbol{A}^{-1}\boldsymbol{I} \\ \implies &amp;\boldsymbol{X} = \boldsymbol{B}^{-1}\boldsymbol{A}^{-1} \\ \end{align*} <p>$\square$</p>Matthew N. BernsteinIn this post, we discuss invertible matrices: those matrices that characterize invertible linear transformations. We discuss three different perspectives for intuiting inverse matrices as well as several of their properties.Perplexity: a more intuitive measure of uncertainty than entropy2021-10-08T00:00:00-07:002021-10-08T00:00:00-07:00https://mbernste.github.io/posts/perplexity<p><em>Like entropy, perplexity is an information theoretic quantity that describes the uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy and thus, in some sense, they can be used interchangeabley. So why do we need it? In this post, I’ll discuss why perplexity is a more intuitive measure of uncertainty than entropy.</em></p> <h2 id="introduction">Introduction</h2> <p>Perplexity is an information theoretic quantity that crops up in a number of contexts such as <a href="https://en.wikipedia.org/wiki/Perplexity">natural language processing</a> and is a parameter for the popular <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> algorithm used for dimensionality reduction.</p> <p>Like <a href="https://mbernste.github.io/posts/entropy/">entropy</a>, perplexity provides a measure of the amount of uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy. Given a discrete random variable, $X$, perplexity is defined as:</p> $\text{Perplexity}(X) := 2^{H(X)}$ <p>where $H(X)$ is the entropy of $X$.</p> <p>When I first saw this definition, I did not understand its purpose. That is, if perplexity is simply exponentiated entropy, why do we need it? After all, we have a good intuition for entropy already: it describes <a href="https://mbernste.github.io/posts/sourcecoding/">the number of bits</a> needed to encode random samples from $X$’s probability distribution. So why perplexity?</p> <h2 id="an-intuitive-measure-of-uncertainty">An intuitive measure of uncertainty</h2> <p>Perplexity is often used instead of entropy due to the fact that it is arguably more intuitive to our human minds than entropy. Of course, as we’ve discussed in a <a href="https://mbernste.github.io/posts/sourcecoding/">previous blog post</a>, entropy describes the number of bits needed to encode random samples from a distribution, which one may argue is already intuitive; however, I would argue the contrary. If I tell you that a given random variable has an entropy of 7, how should you <em>feel</em> about that at a gut level?</p> <p>Arguably, perplexity provides a more human way of thinking about the random variable’s uncertainty and that is because the perplexity of a uniform, discrete random variable with K outcomes is K (see the Appendix to this post)! For example, the perplexity of a fair coin is two and the perplexity of a fair six-sided die is six. This provides a frame of reference for interpreting a perplexity value. That is, if the perplexity of some random variable X is 20, our uncertainty towards the outcome of X is equal to the uncertainty we would feel towards a 20-sided die. This helps <em>intuit</em> the uncertainty at a more gut level!</p> <h2 id="appendix">Appendix</h2> <p><span style="color:#0060C6"><strong>Theorem:</strong> Given a discrete uniform random variable $X \sim \text{Cat}(p_1, p_2, \dots, p_K)$ where $\forall i,j \in [K], p_i = p_j = 1/K$, it holds that the perplexity of $X$ is $K$.</span></p> <p><strong>Proof:</strong></p> \begin{align*} \text{Perplexity}(X) &amp;:= 2^{H(X)} \\ &amp;= 2^{\frac{1}{K} -\sum_{i=1}^K \log_2 \frac{1}{K}} \\ &amp;= 2^{-\log_2 \frac{1}{K}} \\ &amp;= \frac{1}{2^{\log_2 \frac{1}{K}}} \\ &amp;= K \end{align*} <p>$\square$</p>Matthew N. BernsteinLike entropy, perplexity is an information theoretic quantity that describes the uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy and thus, in some sense, they can be used interchangeabley. So why do we need it? In this post, I’ll discuss why perplexity is a more intuitive measure of uncertainty than entropy.Variational inference2021-05-31T00:00:00-07:002021-05-31T00:00:00-07:00https://mbernste.github.io/posts/variational_inference<p><em>In this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.</em></p> <h2 id="introduction">Introduction</h2> <p>Variational inference is a high-level paradigm for estimating a posterior distribution when computing it explicitly is intractable. More specifically, variational inference is used in situations in which we have a model that involves hidden random variables $Z$, observed data $X$, and some posited probabilistic model over the hidden and observed random variables $$P(Z, X)$$. Our goal is to compute the posterior distribution $P(Z \mid X)$. Under an ideal situation, we would do so by using Bayes theorem:</p> $p(z \mid x) = \frac{p(x \mid z)p(z)}{p(x)}$ <p>where $$z$$ and $$x$$ are realizations of $$Z$$ and $$X$$ respectively and $$p(.)$$ are probability mass/density functions for the distributions implied by their arguments.</p> <p>In practice, it is often difficult to compute $p(z \mid x)$ via Bayes theorem because the denominator $p(x)$ does not have a closed form. Usually, the denominator $p(x)$ can be only be expressed as an integral that marginalizes over $z$: $p(x) = \int p(x, z) \ dz$. In such scenarios, we’re often forced to approximate $p(z \mid x)$ rather than compute it directly. Variational inference is one such approximation technique.</p> <h2 id="intuition">Intuition</h2> <p>Instead of computing $$p(z \mid x)$$ exactly via Bayes theorem, variational inference attempts to find another distribution $q(z)$ that is close” to $$p(z \mid x)$$ (how we define “closeness” between distributions will be addressed later in this post). Ideally, $q(z)$ is easier to evaluate than $$p(z \mid x)$$, and, if $$p(z \mid x)$$ and $$q(z)$$ are similar, then we can use $$q(z)$$ as a replacement for $p(z \mid x)$ for any relevant downstream tasks.</p> <p>We restrict our search for $$q(z)$$ to a family of surrogate distributions over $$Z$$, called the <strong>variational distribution family</strong>, denoted by the set of distributions $\mathcal{Q}$. Our goal then is to find the distribution $q \in \mathcal{Q}$ that makes $q(z)$ as close” to $p(z \mid x)$ as possible. When, each member of $\mathcal{Q}$ is characterized by the values of a set of parameters $\phi$, we call $\phi$ the <strong>variational parameters</strong>. Our goal is then to find the value for $\hat{\phi}$ that makes $q(z \mid \phi)$ as close to $p(z \mid x)$ as possible and return $$q(z \mid \hat{\phi})$$ as our approximation of the true posterior.</p> <h2 id="details">Details</h2> <p>Variational inference uses the KL-divergence from $p(z \mid x)$ to $q(z)$ as a measure of closeness” between these two distributions:</p> $KL(q(z) \ || \ p(z \mid x)) := E_{Z \sim q}\left[\log\frac{q(Z)}{p(Z \mid x)} \right]$ <p>Thus, variational inference attempts to find</p> $\hat{q} := \text{argmin}_q \ KL(q(z) \ || \ p(z \mid x))$ <p>and then returns $\hat{q}(z)$ as the approximation to the posterior.</p> <p>Variational inference minimizes the KL-divergence by maximizing a surrogate quantity called the <strong>evidence lower bound (ELBO)</strong> (For a more in-depth discussion of the evidence lower bound, you can check out <a href="https://mbernste.github.io/posts/elbo/">my previous blog post</a>):</p> $\text{ELBO}(q) := E_{Z \sim q}\left[\log p(x, Z) \right] - E_{Z \sim q}\left[\log q(Z) \right]$ <p>That is, we can formulate an optimization problem that seeks to maximize the ELBO:</p> $\hat{q} := \text{argmax}_q \ \text{ELBO}(q)$ <p>The solution to this optimization problem is equivalent to the solution that minimizes the KL-divergence between $q(z)$ and $p(z \mid x)$. To see why this works, we can show that the KL-divergence can be formulated as the difference between the marginal log-likelihood of the observed data, $$\log p(x)$$ (called the <em>evidence</em>) and the ELBO:</p> \begin{align*}KL(q(z) \ || \ p(z \mid x)) &amp;= E_{Z \sim q}\left[\log\frac{q(Z)}{p(Z \mid x)} \right] \\ &amp;= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log p(Z \mid x) \right] \\ &amp;= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log \frac{p(Z, x)}{p(x)} \right] \\ &amp;= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log p(Z, x) \right] + E_{Z \sim q}\left[\log p(x) \right] \\ &amp;= \log p(x) - \left( E_{Z \sim q}\left[\log p(x, Z) \right] - E_{Z \sim q}\left[\log q(Z) \right] \right)\\ &amp;= \log p(x) - \text{ELBO}(q)\end{align*} <p>Because $\log p(x)$ does not depend on $q$, one can treat the ELBO as a function of $q$ and maximize the ELBO.</p> <p>Conceptually, variational inference allows us to formulate our approximate Bayesian inference problem as an optimization problem. By formulating the problem as such, we can approach this optimization problem using the full toolkit available to us from the field of <a href="https://en.wikipedia.org/wiki/Mathematical_optimization">mathematical optimization</a>!</p> <h2 id="why-is-this-method-called-variational-inference">Why is this method called “variational” inference?</h2> <p>The term “variational” in “variational inference” comes from the mathematical area of <a href="https://en.wikipedia.org/wiki/Calculus_of_variations">the calculus of variations</a>. The calculus of variations is all about optimization problems that optimize <em>functions of functions</em>, called <a href="https://mbernste.github.io/posts/functionals/">functionals</a>.</p> <p>More specifically, let’s say we have some set of functions $\mathcal{F}$ where each $f \in \mathcal{F}$ maps items from some set $A$ to some set $B$. That is,</p> $f: A \rightarrow B$ <p>Let’s say we have some function $g$ that maps functions in $\mathcal{F}$ to real numbers $\mathbb{R}$. That is,</p> $g: \mathcal{F} \rightarrow \mathbb{R}$ <p>Then, we may wish to solve an optimization problem of the form:</p> $\text{arg max}_{f \in \mathcal{F}} g(f)$ <p>This is precisely the problem addressed in the calculus of variations. In the case of variational inference, the functional, $g$, that we are optimzing is the ELBO. The set of functions, $\mathcal{F}$, that we are searching over is the set of <a href="https://mbernste.github.io/posts/measure_theory_2/">measureable functions</a> in the variational family, $\mathcal{Q}$.</p>Matthew N. BernsteinIn this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.