Jekyll2022-05-12T05:03:16-07:00https://mbernste.github.io/feed.xmlMatthew N. BernsteinPersonal websiteMatthew N. BernsteinFunctionals and functional derivatives2022-04-10T00:00:00-07:002022-04-10T00:00:00-07:00https://mbernste.github.io/posts/functional_derivatives<p><em>The calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Multivariate calculus concerns itself with infitesimal changes of numerical functions – that is, functions that accept a vector of real-numbers and output a real number:</p>
\[f : \mathbb{R}^n \rightarrow \mathbb{R}\]
<p>In this blog post, we discuss the <strong>calculus of variations</strong>, a field of mathematics that generalizes the ideas in multivariate calculus relating to infinitesimal changes of traditional numeric functions to <em>functions of functions</em>, called <em>functionals</em>. Specifically, given a set of functions, $\mathcal{F}$, a <strong>functional</strong> is a mapping between $\mathcal{F}$ and the real-numbers:</p>
\[F : \mathcal{F} \rightarrow \mathbb{R}\]
<p>Functionals are quite prevalent in machine learning and statistical inference. For example, <a href="https://mbernste.github.io/posts/entropy/">information entropy</a> can be considered a functional on probability mass functions. For a given <a href="https://mbernste.github.io/posts/measure_theory_2/">discrete random variable</a>, $X$, entropy can be thought about as a function that accepts as input $X$’s probability mass function, $p_X$, and outputs a real number:</p>
\[H(p_X) := -\sum_{x \in \mathcal{X}} p_X(x) \log p_X(x)\]
<p>where $\mathcal{X}$ is the <a href="https://en.wikipedia.org/wiki/Support_(mathematics)">support</a> of $p_X$.</p>
<p>Another example of a functional is the <a href="https://mbernste.github.io/posts/elbo/">evidence lower bound (ELBO)</a>: a function that, like entropy, operates on probability distributions. The ELBO is a foundational quantity used in the popular <a href="https://mbernste.github.io/posts/em/">EM algorithm</a> and <a href="https://mbernste.github.io/posts/variational_inference/">variational inference</a> used for performing statistical inference with probabilistic models.</p>
<p>In this blog post, we will review some concepts in traditional calculus such as partial derivatives, directional derivatives, and gradients in order to introduce the definition of the <strong>functional derivative</strong>, which is simply the generalization of the gradient of numeric functions to functionals.</p>
<h2 id="a-review-of-derivatives-and-gradients">A review of derivatives and gradients</h2>
<p>In this section, we will introduce a few important concepts in multivariate calculus: derivatives, partial derivatives, directional derivatives, and gradients.</p>
<h3 id="derivatives">Derivatives</h3>
<p>Before going further, let’s quickly review the basic definition of the derivative for a univariate function $g$ that maps real numbers to real numbers. That is,</p>
\[g : \mathbb{R} \rightarrow \mathbb{R}\]
<p>The derivative of $g$ at input $x$, denoted $\frac{dg(x)}{dx}$, describes the rate of change of $g$ at $x$. It is defined rigorously as</p>
\[\frac{dg(x)}{dx} := \lim_{h \rightarrow 0}\frac{g(x+h)-g(x)}{h}\]
<p>Geometrically, $\frac{dg(x)}{dx}$ is the slope of the line that is tangential to $g$ at $x$ as depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/derivative.png" alt="drawing" width="550" /></center>
<p>In this schematic, we depict the value of $h$ getting smaller and smaller. As it does, the slope of the line approaches that of the line that is tangential to $g$ at x. This slope is the derivative $\frac{dg(x)}{dx}$.</p>
<h3 id="partial-derivatives">Partial derivatives</h3>
<p>We will now consider a continous <em>multivariate</em> function $f$ that maps real-valued vectors $\mathcal{x} \in \mathbb{R}^n$ to real-numbers. That is,</p>
\[f: \mathbb{R}^n \rightarrow \mathbb{R}\]
<p>Given $\boldsymbol{x} \in \mathbb{R}^n$, the <strong>partial derivative</strong> of $f$ with respect to the $i$th component of $\boldsymbol{x}$, denoted $\frac{\partial f(\boldsymbol{x})}{\partial x_i}$ is simply the derivative of $f$ if we hold all the components of $\boldsymbol{x}$ fixed, except for the $i$the component. Said differently, it tells us the rate of change of $f$ with respect to the $i$th dimension of the vector space in which $\boldsymbol{x}$ resides! This can be visualized below for a function $f : \mathbb{R}^2 \rightarrow \mathbb{R}$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/partial_derivative.png" alt="drawing" width="450" /></center>
<p>As seen above, the partial derivative $\frac{f(\boldsymbol{x})}{\partial x_1}$ is simply the derivative of the function $f(x_1, x_2)$ when holding $x_1$ as fixed. That is, it is the slope of the line tangent to the function of $f(x_1, x_2)$ when $x_1$ fixed.</p>
<h3 id="directional-derivatives">Directional derivatives</h3>
<p>We can see that the partial derivative of $f(\boldsymbol{x})$ with respect to the $i$th dimension of the vector space can be expressed as</p>
\[\frac{\partial f(\boldsymbol{x})}{\partial x_i} := \lim_{h \rightarrow 0} \frac{f(\boldsymbol{x} + h\boldsymbol{e}_i) - f(\boldsymbol{x})}{h}\]
<p>where $\boldsymbol{e}_i$ is the $i$th <a href="https://en.wikipedia.org/wiki/Standard_basis">standard basis vector</a> – that is, the vector of all zeroes except for a one in the $i$th position.</p>
<p>Geometrically, we can view the $i$th partial derivative of $f(\boldsymbol{x})$ as $f$’s rate of change along the direction of the $i$th standard basis vector of the vector space.</p>
<p>Thinking along these lines, there is nothing stopping us from generalizing this idea to <em>any unit vector</em> rather than just the standard basis vectors. Given some unit vector $\boldsymbol{v}$, we define the <strong>directional derivative</strong> of $f(\boldsymbol{x})$ along the direction of $\boldsymbol{v}$ as</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) := \lim_{h \rightarrow 0} \frac{f(\boldsymbol{x} + h\boldsymbol{v}) - f(\boldsymbol{x})}{h}\]
<p>Geometrically, this is simply the rate of change of $f$ along the direction at which $\boldsymbol{v}$ is pointing! This can be viewed schematically below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/directional_derivative.png" alt="drawing" width="450" /></center>
<p>For a given vector $\boldsymbol{v}$, we can derive a formula for $D_{\boldsymbol{v}}f(\boldsymbol{x})$. That is, we can show that:</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) = \sum_{i=1}^n \left( \frac{\partial f(\boldsymbol{x})}{\partial x_i} \right) v_i\]
<p>See Theorem 1 in the Appendix of this post for a proof of this equation. Now, if we define the vector of all partial derivatives $f(\boldsymbol{x})$ as</p>
\[\nabla f(\boldsymbol{x}) := \begin{bmatrix}\frac{\partial f(\boldsymbol{x})}{\partial x_1} & \frac{\partial f(\boldsymbol{x})}{\partial x_2} & \dots & \frac{\partial f(\boldsymbol{x})}{\partial x_n} \end{bmatrix}\]
<p>Then we can represent the directional derivative as simply the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> between $\nabla f(\boldsymbol{x})$ and $\boldsymbol{v}$:</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) := \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}\]
<p>This vector $\nabla f(\boldsymbol{x})$, is called the <strong>gradient vector</strong> of $f$ at $\boldsymbol{x}$.</p>
<h3 id="gradients">Gradients</h3>
<p>As described above, the <strong>gradient vector</strong>,$\nabla f(\boldsymbol{x})$ is the vector constructed by taking the partial derivative of $f$ at $\boldsymbol{x}$ along each basis vector. It turns out that the gradient vector points in the <em>direction of steepest ascent</em> along $f$’s surface at $\boldsymbol{x}$. This can be shown schematically below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/gradient.png" alt="drawing" width="450" /></center>
<p>We prove this property of the gradient vector in Theorem 2 of the Appendix to this post.</p>
<h2 id="functional-derivatives">Functional derivatives</h2>
<p>Now, we will seek to generalize the notion of the gradients to functionals. We’ll let $\mathcal{F}$ be some set of functions, and for simplicity, we’ll let each $f$ be a continuous real-valued function. That is, for each $f \in \mathcal{F}$, we have $f: \mathbb{R} \rightarrow \mathbb{R}$. Then, we’ll consider a functional $F$ that maps each $f \in \mathcal{F}$ to a number. That is,</p>
\[F: \mathcal{F} \rightarrow \mathbb{R}\]
<p>Now, we’re going to spoil the punchline with the definition for the functional derivative:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (Functional derivative):</strong> Given a function $f \in \mathcal{F}$, the <strong>functional derivative</strong> of $F$ at $f$, denoted $\frac{\partial{F}}{\partial f}$, is defined to be the function for which: </span></p>
<p><span style="color:#0060C6">\(\begin{align*}\int \frac{\partial F}{\partial f}(x) \eta(x) \ dx &= \lim_{h \rightarrow 0}\frac{F(f + h \eta) - F(f)}{h} \\ &= \frac{d F(f + h\eta)}{dh}\bigg\rvert_{h=0}\end{align*}\)</span></p>
<p><span style="color:#0060C6">where $h$ is a scalar and $\eta$ is an arbitrary function in $\mathcal{F}$.</span></p>
<p>Woah. What is going on here? How on earth does this define the functional derivative? And why is the functional derivative, $\frac{\partial{F}}{\partial f}$ buried inside such a seemingly complicated equation?</p>
<p>Let’s break it down.</p>
<p>First, notice the similarity of the right-hand side of the equation of Definition 1 to the definition of the directional gradient:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/directional_gradient_functional_derivative.png" alt="drawing" width="350" /></center>
<p>Indeed, the equation in Definition 1 describes the analogy of the directional derivative for functionals! That is, it describes the rate of change of $F$ at $f$ in the direction of the function $\eta$!</p>
<p>How does this work? As we shrink $h$ down to an infinitesimaly small number, $f + h \eta$ will become arbitrarily close to $f$. In the illustration below, we see an example function $f$ (red) and another function $\eta$ (blue). As $h$ gets smaller, the function $f + h\eta$ (purple) becomes more similar to $f$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/function_variationn.png" alt="drawing" width="450" /></center>
<p>Thus, we see that $h \eta$ is the “infinitesimal” change to $f$ that is analogous to the infinitesimal change to $\boldsymbol{x}$ that we describe by $h\boldsymbol{v}$ in the definition of the directional derivative. The quantity $h \eta$ is called a <strong>variation</strong> of $f$ (hence the word “variational” in the name “calculus of variations”).</p>
<p>Now, so far we have only shown that the equation in Definition 1 describes something analogous to the directional derivative for multivariate numerical functions. We showed this by comparing the right-hand side of the equation in Definition 1 to the definition of the directional gradient. However, as Definition 1 states, the functional derivative itself is defined to be the function $\frac{\partial F}{\partial f}$ within the integral on the left-hand side of the equation. What is going on here? Why is <em>this</em> the functional derivative?</p>
<p>Now, it is time to recall the gradient for traditional multivariate functions. Specifically, notice the similarity between the alternative formulation of the directional derivative, which uses the gradient, and the left-hand side of the equation in Definition 1:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/functional_derivative_gradient.png" alt="drawing" width="450" /></center>
<p>Notice, that these equations have similar forms. Instead of a summation in the definition of the directional derivative, we have an integral in the eqation for Definition 1. Moreover, instead of summing over elements of the vector $\boldsymbol{v}$, we “sum” (using an integral) each value of $\eta(x)$. Lastly, instead of each partial derivative of $f$, we now have each value of the function $\frac{\partial F}{\partial f}$ for each $x$. This function, $\frac{\partial F}{\partial f}(x)$, is analogous to the gradient! It is thus called the functional derivative!</p>
<p>To drive this home further, recall that we can represent the directional derivative as the dot product between the gradient vector and $\boldsymbol{v}$:</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) := \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}\]
<p>To make this relationship clearer, we note that the dot product is an <a href="https://en.wikipedia.org/wiki/Inner_product_space">inner product</a>. Thus, we can write this definition in a more general way as</p>
\[D_{\boldsymbol{v}}f(\boldsymbol{x}) := \langle \nabla f(\boldsymbol{x}), \boldsymbol{v} \rangle\]
<p>We also recall that a valid inner product between continuous functions $f$ and $g$ is</p>
\[\langle f, g \rangle := \int f(x)g(x) dx\]
<p>Thus, we see that</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/functional_derivative_gradient_w_inner_product.png" alt="drawing" width="450" /></center>
<p>Said differently, the functional gradient of a functional, $F$, at a function $f$, denoted $\frac{\partial F}{\partial f}$ is the function for which given any arbitrary function $\eta$, the inner product between $\frac{\partial F}{\partial f}$ and $\eta$ is the directional derivative of $F$ in the direction of $\eta$!</p>
<h2 id="an-example-the-functional-derivative-of-entropy">An example: the functional derivative of entropy</h2>
<p>As a toy example, let’s derive the functional derivative of <a href="https://mbernste.github.io/posts/entropy/">information entropy</a>. Recall at the beginning of this post, the entropy $H$ of a discrete random variable $X$ can be viewed as a function on $X$’s probability mass function $p_X$. More specifically, $H$ is defined as</p>
\[H(p_X) := \sum_{x \in \mathcal{X}} - p_X(x) \log p_X(x)\]
<p>where $\mathcal{X}$ is the support of $p_X$.</p>
<p>Let’s derive it’s functional derivative. Let’s start with an arbitrary probability mass function $\eta : \mathcal{X} \rightarrow [0,1]$. Then, we can write out the equation that defines the functional derivative:</p>
\[\sum_{x \in \mathcal{X}} \frac{\partial H}{\partial p_X}(x) \eta(x) = \frac{d H(p_X + h\eta)}{dh}\bigg\rvert_{h=0}\]
<p>Let’s simplify this equation:</p>
\[\begin{align*}
\sum_{x \in \mathcal{X}} \frac{\partial H}{\partial p_X}(x) \eta(x)
&= \frac{d H(p_X + h\eta)}{dh}\bigg\rvert_{h=0} \\
&= \frac{d}{dh} \sum_{x \in \mathcal{X}} -(p_X(x) + h\eta(x))\log(p_X(x) + h\eta(x))\bigg\rvert_{h=0} \\
&= \sum_{x \in \mathcal{X}} - \eta(x)\log(p_X(x) + h\eta(x)) + \eta(x)\bigg\rvert_{h=0} \\
&= \sum_{x \ in \mathcal{X}} (-1 - \log p_X(x))\eta(x)\end{align*}\]
<p>Now we see that $\frac{\partial H}{\partial p_X}(x) = -1 - \log p_X(x)$ and thus, this is the functional derivative!</p>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1:</strong> Given a differentiable function $f : \mathbb{R}^n \rightarrow \mathbb{R}$, vectors $\boldsymbol{x}, \boldsymbol{v} \in \mathbb{R}^n$, where $\boldsymbol{v}$ is a unit vector, then $D_{\boldsymbol{v}} f(\boldsymbol{x}) = \sum_{i=1}^n \left( \frac{\partial f(\boldsymbol{x})}{\partial x_i} \right) v_i$.</span></p>
<p><strong>Proof:</strong></p>
<p>Consider $\boldsymbol{x}$ and $\boldsymbol{v}$ to be fixed and let $g(z) := f(\boldsymbol{x} + z\boldsymbol{v})$. Then,</p>
\[\frac{dg(z)}{dz} = \lim_{h \rightarrow 0} \frac{g(z+h) - g(z)}{h}\]
<p>Evaluating this derivative at $z = 0$, we see that</p>
\[\begin{align*} \frac{dg(z)}{dz}\bigg\rvert_{z=0} &= \frac{g(h) - g(0)}{h} \\ &= \frac{g(\boldsymbol{x} + h\boldsymbol{v}) - f(\boldsymbol{x})}{h} \\ &= D_{\boldsymbol{v}} f(\boldsymbol{x}) \end{align*}\]
<p>We can then apply the <a href="https://en.wikipedia.org/wiki/Chain_rule#Multivariable_case">multivariate chain rule</a> and see that</p>
\[\frac{dg(z)}{dz} = \sum_{i=1}^n D_i f(\boldsymbol{x} + z\boldsymbol{v}) \frac{d (x_i + zv_i)}{dz}\]
<p>where $D_i f(\boldsymbol{x} + z\boldsymbol{v})$ is the partial derivative of $f$ with respect to it’s $i$th argument when evaluated at $\boldsymbol{x} + z\boldsymbol{v}$.</p>
<p>Now, evaluating this derivative at $z = 0$, we see that</p>
\[\begin{align*} \frac{dg(z)}{dz}\bigg\rvert_{z=0} &= \sum_{i=1}^n D_i f(\boldsymbol{x}) v_i \\ &= \sum_{i=1}^n \frac{f(\boldsymbol{x})}{\partial \boldsymbol{x}_i} v_i \end{align*}\]
<p>Putting these two results together, we see that</p>
\[D_{\boldsymbol{v}} f(\boldsymbol{x}) = \sum_{i=1}^n \frac{f(\boldsymbol{x})}{\partial \boldsymbol{x}_i} v_i\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2:</strong> Given a differentiable function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ and vector $\boldsymbol{x} \in \mathbb{R}^n$, $f$’s direction of steepest ascent is the direction pointed to by the gradient $\nabla f(\boldsymbol{x})$.</span></p>
<p><strong>Proof:</strong></p>
<p>As shown in Theorem 1, given an arbitrary unit vector $\boldsymbol{v} \in \mathbb{R}^n$, the directional derivative $D_{\boldsymbol{v}} f(\boldsymbol{x})$ can be calculated by taking the dot product of the gradient vector with $\boldsymbol{v}$:</p>
\[D_{\boldsymbol{v}} f(\boldsymbol{x}) = \nabla f(\boldsymbol{x}) \cdot \boldsymbol{v}\]
<p>The dot product can be computed as</p>
\[\nabla f(\boldsymbol{x}) \cdot \boldsymbol{v} = ||\nabla f(\boldsymbol{x})|| ||\boldsymbol{v}|| \cos \theta\]
<p>where $\theta$ is the angle between the two vectors. The $\cos$ function is maximized (and equals 1) when $\theta = 0$ and thus, directional derivative is maximized when $\theta = 0$. Thus, the unit vector that maximizes the directional derivative is the vector pointing in the same direction as the gradient thus proving that the gradient points in the direction of steepest ascent.</p>
<p>$\square$</p>Matthew N. BernsteinThe calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.Normed vector spaces2021-11-23T00:00:00-08:002021-11-23T00:00:00-08:00https://mbernste.github.io/posts/normed_vector_space<p><em>When first introduced to Euclidean vectors, one is taught that the length of the vector’s arrow is called the norm of the vector. In this post, we present the more rigorous and abstract definition of a norm and show how it generalizes the notion of “length” to non-Euclidean vector spaces. We also discuss how the norm induces a metric function on pairs of vectors so that one can discuss distances between vectors.</em></p>
<h2 id="introduction">Introduction</h2>
<p>A <strong>normed vector space</strong> is a vector space in which each vector is associated with a scalar value called a <strong>norm</strong>. In a standard Euclidean vector spaces, the length of each vector is a norm:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/Norm.png" alt="drawing" width="200" /></center>
<p>The more abstract, rigorous definition of a norm generalizes this notion of length to any vector space as follows:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (normed vector space):</strong> A <strong>normed vector space</strong> is vector space $(\mathcal{V}, \mathcal{F})$ associated with a function $||.|| : \mathcal{V} \rightarrow \mathbb{R}$, called a <strong>norm</strong>, that obeys the following axioms:</span></p>
<ol>
<li><span style="color:#0060C6">$\forall \boldsymbol{v} \in \mathcal{V}, \ \ ||\boldsymbol{v}|| \geq 0$</span></li>
<li><span style="color:#0060C6">$\forall \boldsymbol{v} \in \mathcal{V}, \forall \alpha \in \mathcal{F}, \ \ ||\alpha\boldsymbol{v}|| = |\alpha| ||\boldsymbol{v}||$</span></li>
<li><span style="color:#0060C6">$\forall \boldsymbol{v}, \boldsymbol{u} \in \mathcal{V}, \ \ ||\boldsymbol{u} + \boldsymbol{v}|| \leq ||\boldsymbol{u}|| + ||\boldsymbol{v}||$</span></li>
</ol>
<p>Here, we outline the intuition behind each axiom in the definition above and describe how these axioms capture this idea of length:</p>
<ul>
<li>Axiom 1 says that all vectors should have a positive length. This enforces our intuition that a “length’’ is a positive quantity.</li>
<li>Axiom 2 says that if we multiply a vector by a scalar, it’s length should increase by the magnitude (i.e. the absolute value) of that scalar. This axiom ties together the notion of scaling vectors (Axiom 6 in the <a href="https://mbernste.github.io/posts/vector_spaces/">definition of a vector space</a>) to the notion of “length” for a vector. It essentially says that to scale a vector is to stretch the vector.</li>
<li>Axiom 3 says that the length of the sum of two vectors should not exceed the sum of the lengths of each vector. This enforces our intuition that if we add together two objects that each have a “length”, the resultant object should not exceed the sum of the lengths of the original objects.</li>
</ul>
<p>Following the axioms for a normed vector space, one can also show that only the zero vector has zero length (Theorem 1 in the Appendix to this post).</p>
<h2 id="unit-vectors">Unit vectors</h2>
<p>In a normed vector space, a <strong>unit vector</strong> is a vector with norm equal to one. Given a vector $\boldsymbol{v}$, a unit vector can be derived by simply dividing the vector by its norm (Theorem 2 in the Appendix). This unit vector, called the <strong>normalized vector</strong> of $\boldsymbol{v}$ is denoted $\hat{\boldsymbol{v}}$. In a Euclidean vector space, the normalized vector $\hat{\boldsymbol{v}}$ is the unit vector that points in the same direction as $\boldsymbol{v}$.</p>
<p>Unit vectors are important because they generalize the idea of “direction” in Euclidean spaces to vector spaces that are not Euclidean. In a Euclidean space, the unit vectors all fall in a sphere of radius one around the origin:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/UnitVectors.png" alt="drawing" width="200" /></center>
<p>Thus, the set of all unit vectors can be used to define the set of all “directions” that vectors can point in the vector space. Because of this, one can form any vector by decomposing it into a unit vector multiplied by some scalar. In this way, unit vectors generalize the notion of “direction” in a Euclidean vector space to non-Euclidean vector spaces.</p>
<h2 id="normed-vector-spaces-are-also-metric-spaces">Normed vector spaces are also metric spaces</h2>
<p>All normed vector spaces are also <a href="https://en.wikipedia.org/wiki/Metric_(mathematics)">metric spaces</a> – that is, the norm function induces a metric function on pairs of vectors that can be interpreted as a “distance” between them (Theorem 3 in the Appendix). This metric is defined simply as:</p>
\[d(\boldsymbol{x}, \boldsymbol{y}) := \|\boldsymbol{x} - \boldsymbol{y}\|\]
<p>That is, if one subtracts one vector from the other, then the “length” of the resultant vector can be interpreted as the “distance” between those vectors.</p>
<p>In the figure below we show how the norm can be used to form a metric between Euclidean vectors. On the left, we depict two vectors, $\boldsymbol{v}$ and $\boldsymbol{u}$, as arrows. On the right we depict these vectors as points in Euclidean space. The distance between these points is given by the norm of the difference vector between $\boldsymbol{u}$ and $\boldsymbol{v}$.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/NormAsMetric.png" alt="drawing" width="400" /></center>
<h2 id="examples-of-norms">Examples of norms</h2>
<p>Notably, a norm is a function that satisfies a set of axioms and thus, one may consider multiple norms when looking at a vector space. For example, there are multiple norms that are commonly associated with Euclidean vector spaces. Here are just a few examples:</p>
<p><strong>L2 norm</strong></p>
<p>The L2 norm is the most common norm as it is simply the Euclidean distance between points in a coordinate vector space:</p>
\[\vert\vert \boldsymbol{x} \vert\vert_2 := \sqrt{\sum_{i=1}^n x_i^2}\]
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/L2Norm.png" alt="drawing" width="200" /></center>
<p><strong>L1 norm</strong></p>
<p>The L1 norm is simply a sum of the of the absolute values of the elements of the vector:</p>
\[\vert\vert \boldsymbol{x} \vert\vert_1 := \sum_{i=1}^n \vert x_i \vert\]
<p>The L1 norm is also alled the <strong>Manhattan norm</strong> or <strong>taxicab norm</strong> because it calculates distances as if one has to take streets around city blocks to get from one point to another:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/L1Norm.png" alt="drawing" width="200" /></center>
<p><strong>Infinity norm</strong></p>
<p>The infinity norm is simply the maximum value among the elements of a vector:</p>
\[\vert\vert \boldsymbol{x} \vert\vert_{\infty} := \text{max}\{x_1, x_2, \dots, x_n\}\]
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/LInftyNorm.png" alt="drawing" width="200" /></center>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1 (Only the zero vector has zero norm):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$ with norm $\vert\vert . \vert\vert$, it holds that $\vert\vert \boldsymbol{v} \vert\vert = 0 \iff \boldsymbol{v} = \boldsymbol{0}$</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*}\vert\vert \boldsymbol{0} \vert\vert &= \vert\vert 0 \boldsymbol{v} \vert\vert && \text{for any $\boldsymbol{v} \in \mathcal{V}$} \\
&= \vert 0 \vert \vert\vert \boldsymbol{v} \vert\vert && \text{by Axiom 2} \\ &= 0\end{align*}\]
<p>Note the first line is proven in Theorem 2 in my <a href="https://mbernste.github.io/posts/vector_spaces/">previous blog post</a> on vector spaces.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2 (Formation of unit vector):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$ with norm $\vert\vert . \vert\vert$, the vector $\hat{\boldsymbol{v}} := \frac{\boldsymbol{v}}{\vert\vert \boldsymbol{v} \vert\vert}$ has norm equal to one.</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*}\vert\vert \hat{\boldsymbol{v}} \vert\vert &= \vert\vert \frac{\boldsymbol{v}}{\vert\vert \boldsymbol{v} \vert\vert} \vert\vert \\
&=\frac{1}{\vert\vert\boldsymbol{v}\vert\vert} \vert\vert\boldsymbol{v}\vert\vert \\ &= 1 \end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 3 (Norm-induced metric):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$ with norm $\vert\vert . \vert\vert$, the function $d(\boldsymbol{x}, \boldsymbol{y}) := \vert\vert \boldsymbol{x} - \boldsymbol{y} \vert\vert$ where $\boldsymbol{x}, \boldsymbol{y} \in \mathcal{V}$ is is a metric.</span></p>
<p><strong>Proof:</strong></p>
<p>To prove that $d$ is a metric, we need to show that it satisfies the three axioms of a metric function. First we need to show that</p>
\[d(\boldsymbol{x}, \boldsymbol{y}) = 0 \iff \boldsymbol{x} = \boldsymbol{y}\]
<p>This can be proven as follows:</p>
\[\begin{align*} d(\boldsymbol{x}, \boldsymbol{y}) = 0 \implies & \vert\vert\boldsymbol{x} - \boldsymbol{y} \vert\vert = 0 \\ \implies & \vert\vert \boldsymbol{x} + (-\boldsymbol{y})\vert\vert = 0 \\ \implies & \boldsymbol{x} + (-\boldsymbol{y}) = \boldsymbol{0} && \text{by Theorem 1} \\ \implies & \boldsymbol{y} = -1 \boldsymbol{x} \\ \implies & \boldsymbol{y} = \boldsymbol{x}\end{align*}\]
<p>The last line follows from Theorem 4 in my <a href="https://mbernste.github.io/posts/vector_spaces/">previous blog post</a> on vector spaces. Going the other direction, we assume that \(\boldsymbol{x} = \boldsymbol{y}\). Then</p>
\[\begin{align*} \vert\vert \boldsymbol{y} - \boldsymbol{x} \vert\vert &= \vert\vert \boldsymbol{x} - \boldsymbol{x} \vert\vert \\ &= \vert\vert \boldsymbol{0} \vert\vert \\ &= 0 && \text{by Theorem 1}\end{align*}\]
<p>Second, we need to show that $d(\boldsymbol{x}, \boldsymbol{y}) \geq 0$. This fact is already evident based on Axiom 1 in Definition 1 above.</p>
<p>Third and finally, $d$ needs to satisfy the <a href="https://en.wikipedia.org/wiki/Triangle_inequality">triangle inequality</a>. That is, we need to show that $\forall \boldsymbol{x}, \boldsymbol{y}, \boldsymbol{z} \in \mathcal{V}$, it holds that</p>
\[d(\boldsymbol{x}, \boldsymbol{y}) \leq d(\boldsymbol{x}, \boldsymbol{z}) + d(\boldsymbol{z}, \boldsymbol{y})\]
<p>This is proven as follows:</p>
\[\begin{align*} d(\boldsymbol{x}, \boldsymbol{y}) &= \vert\vert\boldsymbol{x} - \boldsymbol{y} \vert\vert \\ &= \vert\vert\boldsymbol{x} - \boldsymbol{z} + \boldsymbol{z} - \boldsymbol{y} \vert\vert \\ &= \vert\vert(\boldsymbol{x} - \boldsymbol{z}) + (\boldsymbol{z} - \boldsymbol{y}) \vert\vert \\ & \leq \vert\vert\boldsymbol{x} - \boldsymbol{z}\vert\vert\ + \vert\vert\ \boldsymbol{z} - \boldsymbol{y} \vert\vert && \text{by Axiom 3 of Definition 1} \\ &= d(\boldsymbol{x}, \boldsymbol{z}) + d(\boldsymbol{z}, \boldsymbol{y})\end{align*}\]
<p>$\square$</p>Matthew N. BernsteinWhen first introduced to Euclidean vectors, one is taught that the length of the vector’s arrow is called the norm of the vector. In this post, we present the more rigorous and abstract definition of a norm and show how it generalizes the notion of “length” to non-Euclidean vector spaces. We also discuss how the norm induces a metric function on pairs of vectors so that one can discuss distances between vectors.The overloaded equals sign2021-11-09T00:00:00-08:002021-11-09T00:00:00-08:00https://mbernste.github.io/posts/equal_vs_definition<p><em>Two of the most important relationships in mathematics, namely equality and definition, are both denoted using the same symbol – namely, the equals sign. The overloading of this symbol confuses students in mathematics and computer programming. In this post, I argue for the use of two different symbols for these two fundamentally different operators.</em></p>
<h2 id="introduction">Introduction</h2>
<p>I find it unfortunate that two of the most important relationships in mathematics, namely <strong>equality</strong> and <strong>definition</strong>, are often denoted using the exact same symbol – namely, the equal sign: “=”. Early in my learning days, I believe that this <a href="https://en.wikipedia.org/wiki/Operator_overloading">overloading</a> of the equal sign led to more confusion than necessary and I have personally witness it confuse students.</p>
<p>To ensure that we’re on the same page, let’s first define these two notions. Let’s start with the idea of <strong>equality</strong>. Let’s say we have two entities, which we will denote using the symbols $X$ and $Y$. The statement “$X$ equals $Y$”, denoted $X = Y$, means that $X$ and $Y$ <strong>are the same thing</strong>.</p>
<p>For example, let’s say we have a right-triangle with edge lengths $a$, $b$ and $c$, where $c$ is the hypotenuse. The <a href="https://en.wikipedia.org/wiki/Pythagorean_theorem">Pythagorean Theorem</a> says that $a^2 + b^2 = c^2$. Said differently, the quantity $c^2$ <em>is the same quantity</em> as the quantity $a^2 + b^2$.</p>
<p>Now, let’s move on to <strong>definition</strong>. Given some entity denoted with the symbol $Y$, the statement “let $X$ be $Y$”, also often denoted $X = Y$, means that one should use the symbol “$X$” to refer to the entity referred to by “$Y$”.</p>
<p>For example, in introductory math textbooks it is common to define the sine function in reference to a right-triangle:</p>
\[\sin \theta = \frac{\text{opposite}}{\text{hypotenuse}}\]
<p>This is a definition. We are <strong>assigning</strong> the symbol/concept $\sin \theta$ to be the ratio of the length of the triangle’s opposite side to the length of its hypotenuse.</p>
<p>The fundamental difference between equality and definition is that in the the equality relationship between $X$ and $Y$, both the symbols $X$ and $Y$ are bound to entities – that is, they “refer” to entities. The statement $Y = X$ is simply a comment about those two entities, namely, that they are the same. In contrast, in a definition, only one of the two symbols is bound to an entity. The act of stating a definition is the act of <em>binding a known entity to a new symbol</em>. For example, the symbol “$\text{foo} \ \theta$” is meaningless. What exactly is “foo”? We don’t know because we have not defined it.</p>
<h2 id="overloading-the-equal-sign-creates-confusion-in-mathematics">Overloading the equal sign creates confusion in mathematics</h2>
<p>I was tutoring someone who was teaching themselves pre-calculus out of a textbook, and they were quite confused by the statement,</p>
\[\sin \theta = \frac{\text{opposite}}{\text{hypotenuse}}\]
<p>They asked me, “Why is $\sin \theta$ equal to the quantity $\frac{\text{opposite}}{\text{hypotenuse}}$?” They never explicitly stated so, but it become evident that their confusion was not the good kind of confusion. It wasn’t, “Why are the ratios between the sides of a right-triangle functions of the angles between those sides?” Nor, “Why is this definition important?” Rather, their confusion seemed to stem from the very existence of this mysterious object, “$\sin \theta$”. Their question was more along the lines of, “What <em>is</em> this mysterious thing? And why on earth is it equal to the ratio of the sides of the triangle?”</p>
<p>Their confusion arose from the erroneous interpretation of this statement as describing an equality rather than a definition. The mystery was, at least partly, alleviated by the clarification that $\sin \theta$ is not an object that existed before we saw this statement – rather, this statement <em>created the object for the first time</em>. The statement is <em>defining</em> $\sin \theta$ to be the ratio between the opposite side to the hypotenuse.</p>
<p>The real interesting quality to this definition is that the ratio of the sides of a right triangle are a function of its angles regardless of the lengths of the sides. That is, that we can create this definition at all!</p>
<h2 id="overloading-the-equal-sign-creates-confusion-in-computer-programming">Overloading the equal sign creates confusion in computer programming</h2>
<p>Anyone who has taught introductory computer programming is familiar with the very common confusion between the <a href="https://en.wikipedia.org/wiki/Assignment_(computer_science)">assignment operator</a> and <a href="https://en.wikipedia.org/wiki/Relational_operator#Equality">equality operator</a> in programming languages.</p>
<p>For example, in many programming languages, like C and Python, the assigment operator uses the standard equals sign. That is, the statement <code class="language-plaintext highlighter-rouge">x = y</code> assigns the value referenced by symbol <code class="language-plaintext highlighter-rouge">y</code> to symbol <code class="language-plaintext highlighter-rouge">x</code>. In contrast, the statement <code class="language-plaintext highlighter-rouge">x == y</code> returns either <code class="language-plaintext highlighter-rouge">True</code> or <code class="language-plaintext highlighter-rouge">False</code> depending on whether the value referenced by <code class="language-plaintext highlighter-rouge">x</code> is equal to the value referenced by <code class="language-plaintext highlighter-rouge">y</code>. Though I have not seen any data on the topic, I wonder whether teaching these two operators from the very beginning of a student’s mathematical education would alleviate this common confusion.</p>
<h2 id="use--instead-of--to-denote-definition">Use “:=” instead of “=” to denote definition</h2>
<p>I think it’s important to use the symbol “:=” to denote definition. I prefer this symbol over the popular “$\equiv$” symbol because it emphasizes the assymetry of the statement. That is, $X := Y$ means “use $X$ as a symbol for $Y$”, which differs from “use $Y$ as a symbol for $X$.” In contrast, the standard equals sign “=” is appropriately symmetric.</p>
<p>Using the appropriate symbol to distinguish definition statements from equality statements may go a long way, at least in proportion to the effort of using them, towards alleviating confusion in students of math and computer science.</p>Matthew N. BernsteinTwo of the most important relationships in mathematics, namely equality and definition, are both denoted using the same symbol – namely, the equals sign. The overloading of this symbol confuses students in mathematics and computer programming. In this post, I argue for the use of two different symbols for these two fundamentally different operators.Vector spaces2021-10-27T00:00:00-07:002021-10-27T00:00:00-07:00https://mbernste.github.io/posts/vector_spaces<p><em>The concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In this post, we first present and explain the definition of a vector space and then go on to describe properties of vector spaces. Lastly, we present a few examples of vector spaces that go beyond the usual Euclidean vectors that are often taught in introductory math and science courses.</em></p>
<h2 id="introduction">Introduction</h2>
<p>The concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In most introductory courses, only vectors in a Euclidean space are discussed. That is, vectors are presented as arrays of numbers:</p>
\[\boldsymbol{x} = \begin{bmatrix}1 \\ 2\end{bmatrix}\]
<p>If the array of numbers is of length two or three, than one can visualize the vector as an arrow:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/EuclideanVector.png" alt="drawing" width="300" /></center>
<p>While this definition is adequate for most applications of vector spaces, there exists a more abstract, and therefore more sophisticated definition of vector spaces that is required to have a deeper understanding of topics in math, statistics, and machine learning. In this post, we will dig into the abstract definition for vector spaces and discuss a few of their properties. Moreover, we will look at a few examples of vector spaces outside of the usual Euclidean vectors and see how the formal definition generalizes to other mathematical constructs such as <a href="https://mbernste.github.io/posts/matrices/">matrices</a> and functions.</p>
<h2 id="formal-definition">Formal definition</h2>
<p>As we mentioned before, vectors are usually introduced as arrays of numbers, and consequently, as arrows. These arrows can be added together and scaled as depicted below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/AddScaleVectors.png" alt="drawing" width="420" /></center>
<p>A <strong>vector space</strong> generalizes this notion of adding and scaling things that behave like Euclidean vectors.</p>
<p>At a more rigorous mathematical level, a vector space consists of both a set of vectors $\mathcal{V}$ and a <a href="https://en.wikipedia.org/wiki/Field_(mathematics)">field</a> of scalars $\mathcal{F}$ for which one can add together vectors in $\mathcal{V}$ as well as scale these vectors by elements in the field $\mathcal{F}$ according to a specific list of rules (in most cases, the field of scalars are the real numbers, $\mathbb{R}$). These rules are spelled out in the definition for a vector space:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (vector space):</strong> Given a set of objects $\mathcal{V}$ called vectors and a field $\mathcal{F} := (C, +, \cdot, -, ^{-1}, 0, 1)$ where $C$ is the set of elements in the field, called scalars, the tuple $(\mathcal{V}, \mathcal{F})$ is a <strong>vector space</strong> if for all $\boldsymbol{v}, \boldsymbol{u}, \boldsymbol{w} \in \mathcal{V}$ and $c, d \in C$, the following ten axioms hold:</span></p>
<ol>
<li><span style="color:#0060C6">$\boldsymbol{u} + \boldsymbol{v} \in \mathcal{V}$</span></li>
<li><span style="color:#0060C6">$\boldsymbol{u} + \boldsymbol{v} = \boldsymbol{v} + \boldsymbol{u}$</span></li>
<li><span style="color:#0060C6">$(\boldsymbol{u} + \boldsymbol{v}) + \boldsymbol{w} = \boldsymbol{u} + (\boldsymbol{v} + \boldsymbol{w})$</span></li>
<li><span style="color:#0060C6">There exists a zero vector $\boldsymbol{0} \in \mathcal{V}$ such that $\boldsymbol{u} + \boldsymbol{0} = \boldsymbol{u}$</span></li>
<li><span style="color:#0060C6">For each $\boldsymbol{u \in \mathcal{V}}$ there exists a $\boldsymbol{u’} \in \mathcal{V}$ such that $\boldsymbol{u} + \boldsymbol{u’} = \boldsymbol{0}$. We call $\boldsymbol{u}’$ the negative of $\boldsymbol{u}$ and denote it as $-\boldsymbol{u}$</span></li>
<li><span style="color:#0060C6">The scalar multiple of $\boldsymbol{u}$ by $c$, denoted by $c\boldsymbol{u}$ is in $\mathcal{V}$</span></li>
<li><span style="color:#0060C6">$c(\boldsymbol{u} + \boldsymbol{v}) = c\boldsymbol{u} + c\boldsymbol{v}$</span></li>
<li><span style="color:#0060C6">$(c + d)\boldsymbol{u} = c\boldsymbol{u} + d\boldsymbol{u}$</span></li>
<li><span style="color:#0060C6">$c(d\boldsymbol{u}) = (cd)\boldsymbol{u}$</span></li>
<li><span style="color:#0060C6">$1\boldsymbol{u} = \boldsymbol{u}$</span></li>
</ol>
<p>Axioms 1-5 of the definition describe how vectors can be added together. Axioms 6-10 describe how these vectors can be scaled using the field of scalars.</p>
<h2 id="properties">Properties</h2>
<p>The ten axioms outlined in the definition for a vector space may seem somewhat arbitrary (at least, they did for me); however, as we will show, these axioms are sufficient for ensuring that vector spaces have all of the properties that we intuitively associate with Euclidean vectors. Specifically, from these axioms, we can derive the following properties:</p>
<ol>
<li><strong>The zero vector is unique</strong> (Theorem 1 in the Appendix). There is only one distinct zero vector in a vector space. Notice in a Euclidean vector space, there is only one point at the origin, which represents the zero vector in Euclidean spaces.</li>
<li><strong>Any vector multiplied by the zero scalar is the zero vector</strong> (Theorem 2 in the Appendix).The zero scalar converts any vector into the zero vector. That is, given a vector $\boldsymbol{v}$, it holds that $0\boldsymbol{v} = \boldsymbol{0}$. This generalizes the notion of how multiplying a vector in a Euclidean space by zero should shrink the vector to the origin.</li>
<li><strong>The negative of a vector is unique</strong> (Theorem 3 in the Appendix). Given a vector $\boldsymbol{v}$, we denote its negative vector as $-\boldsymbol{v}$. This is analogous to each real number $x \in \mathbb{R}$ having a matching negative number $-x$ that lies $|x|$ distance from 0 on the opposite side of 0.</li>
<li><strong>Multiplying a negative vector by the scalar -1 produces its negative vector</strong> (Theorem 4 in the Appendix). That is, given a vector $\boldsymbol{v}$, it holds that $-1\boldsymbol{v} = -\boldsymbol{v}$. This is analogous to the fact that if you multiply any number $x$ by $-1$ you get the number $-x$ that lies $|x|$ distance from 0 on the opposite side of 0.</li>
<li><strong>The zero vector multiplied by any scalar is the zero vector</strong> (Theorem 5 in the Appendix). The zero vector remains the zero vector despite being multiplied by any scalar. That is, $c\boldsymbol{0} = \boldsymbol{0}$ for any $c \in \mathcal{F}$. This is analogous to the fact that zero multiplied by any number remains zero.</li>
<li><strong>The only vector whose negative is not distinct from itself is the zero vector</strong> (Theorem 6 in the Appendix). For every vector other than the zero vector, its negative vector is a distinct vector in the vector space. For the zero vector, its negative is itself. This is analogous to the fact that for any number $x \neq 0$, the number $-x$ is a distinct number from $x$ that lies on the opposite side of 0. However, for $x = 0$, $-x = x$.</li>
</ol>
<h2 id="examples-of-vector-spaces">Examples of vector spaces</h2>
<p><strong>The real numbers</strong></p>
<p>It turns out that the real numbers are themselves a vector space (when equipped with standard addition and multiplication). In this vector space, the real numbers are both the vectors and the scalars! Here, the number zero acts as the zero vector. This example may be a bit trivial and silly; however, I like it because it highlights the generality of the definition of a vector space.</p>
<p><strong>Matrices</strong></p>
<p>Although generally not thought of as vectors, the space of real-valued <a href="https://mbernste.github.io/posts/matrices/">matrices</a> of a fixed size \(\mathbb{R}^{m \times n}\) form a vector space in which the matrices are vectors. Intuitively, you can add matrices together:</p>
\[\begin{bmatrix}1 & 2 \\ 3 & 4\end{bmatrix} + \begin{bmatrix}3 & 2 \\ 2 & 5\end{bmatrix} = \begin{bmatrix}4 & 4 \\ 5 & 9\end{bmatrix}\]
<p>You can also scale them:</p>
\[2\begin{bmatrix}1 & 2 \\ 3 & 4\end{bmatrix} = \begin{bmatrix}2 & 4 \\ 6 & 8\end{bmatrix}\]
<p>The zero matrix acts as the zero vector:</p>
\[\begin{bmatrix}0 & 0 \\ 0 & 0\end{bmatrix}\]
<p>This may seem a bit confusing because as we discuss in <a href="https://mbernste.github.io/posts/matrices_as_functions/">another blog post</a>, matrices act as functions between Euclidean vector spaces. Nonetheless, matrices can form vector spaces all on their own, distinct from the vector spaces that they act upon!</p>
<p><strong>Functions</strong></p>
<p>Sets of functions can also form vector spaces! In fact, the real power in the definition for a vector space reveals itself when dealing with functions, and the fact that some sets of functions form vector spaces lies at the foundation for many fundamental ideas in mathematics, physics, and the data sciences such as <a href="https://en.wikipedia.org/wiki/Fourier_transform">Fourier transforms</a> and <a href="https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space">reproducing kernel Hilbert spaces</a>.</p>
<p>For example, the set of all continuous, real-valued functions forms a vector space. Intuitively we see that such functions act like vectors in that we can add them together:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/AddFunctionsLikeVectors.png" alt="drawing" width="600" /></center>
<p>We can also scale functions. In the following figure, the function $g$ is scaled by $c$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/ScaleFunctionLikeVector.png" alt="drawing" width="300" /></center>
<p>Lastly, the zero function acts as the zero vector. Here we depict the zero function, which outputs 0 for all inputs:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/ZeroFunctionLikeVector.png" alt="drawing" width="300" /></center>
<h2 id="appendix-proofs-of-properties-of-vector-spaces">Appendix: Proofs of properties of vector spaces</h2>
<p><span style="color:#0060C6"><strong>Theorem 1 (Uniqueness of zero vector):</strong> Given vector space $(\mathcal{V}, \mathcal{F})$, the zero vector is unique.</span></p>
<p><strong>Proof:</strong></p>
<p>Assume for the sake of contradiction that there exists a vector $\boldsymbol{a}$ such that $\boldsymbol{a} \neq \boldsymbol{0}$ and that $\forall \boldsymbol{v} \in \mathcal{V}$</p>
\[\boldsymbol{a} + \boldsymbol{v} = \boldsymbol{v}\]
<p>Then, this implies that</p>
\[\boldsymbol{a} + \boldsymbol{0} = \boldsymbol{0}\]
<p>However, Axiom 4 of the definition for a vector space states that if a vector is the zero-vector, it must be that $\boldsymbol{a} + \boldsymbol{0} = \boldsymbol{a}$.</p>
<p>Since $\boldsymbol{a} \neq \boldsymbol{0}$ , we reach a contradiction. Therefore, there does not exist a vector $\boldsymbol{a} \neq \boldsymbol{0}$ for which $\forall \boldsymbol{v} \in \mathcal{V} \ \ \ \boldsymbol{a} + \boldsymbol{v} = \boldsymbol{v}$. Thus, the zero-vector is unique.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2 (The product of the zero scalar and any vector is the zero vector):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$, it holds that $\forall \boldsymbol{v} \in \mathcal{V}, 0\boldsymbol{v} = \boldsymbol{0}$.</span></p>
<p><strong>Proof:</strong></p>
<p>Assume for the sake of contradiction that there exists a vector $\boldsymbol{a} \neq \boldsymbol{0}$ such that</p>
\[0\boldsymbol{v} = \boldsymbol{a}\]
<p>Now, for any scalar $c \neq 0$, we have</p>
\[\begin{align*}c\boldsymbol{v} &= (c + 0)\boldsymbol{v} \\ &= c\boldsymbol{v} + 0\boldsymbol{v} && \text{by Axiom 8} \\ &= c\boldsymbol{v} + \boldsymbol{a}\end{align*}\]
<p>Our assumption assumed that $\boldsymbol{a} \neq \boldsymbol{0}$ must be false because by Theorem 1 the only vector $\boldsymbol{a}$ for which $c\boldsymbol{v} + \boldsymbol{a} = c\boldsymbol{v}$ would be true is the zero-vector.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 3 (Each vector has a unique negative vector):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$ and vector $\boldsymbol{v} \in \mathcal{V}$, its negative, $-\boldsymbol{v}$, is unique. That is, $\boldsymbol{v} + \boldsymbol{a} = \boldsymbol{0} \iff \boldsymbol{a} = -\boldsymbol{v}$.</span></p>
<p><strong>Proof:</strong></p>
<p>We need only prove $\boldsymbol{v} + \boldsymbol{a} = \boldsymbol{0} \implies \boldsymbol{a} = -\boldsymbol{v}$. The other direction is stated in the axioms for the definition of a vector space.</p>
\[\begin{align*}\boldsymbol{v} + \boldsymbol{a} &= \boldsymbol{0} \\ \implies -\boldsymbol{v} + \boldsymbol{v} + \boldsymbol{a} &= -\boldsymbol{v} + \boldsymbol{0} \\ \implies [-\boldsymbol{v} + \boldsymbol{v}] + \boldsymbol{a} &= -\boldsymbol{v} \\ \implies \boldsymbol{0} + \boldsymbol{a} &= -\boldsymbol{v} &&\text{by Axiom 5} \\ \implies \boldsymbol{a} &= -\boldsymbol{v} && \text{by Axiom 4}\end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 4 (Derivation of a vector’s negative):</strong> Given a vector $\boldsymbol{v} \in \mathcal{V}$, it’s negative is $(-1)\boldsymbol{v}$. That is, $-\boldsymbol{v} = (-1)\boldsymbol{v}$.</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*}\boldsymbol{v} + (-1)\boldsymbol{v} &= (1)\boldsymbol{v} + (-1)\boldsymbol{v} && \text{by Axiom 10} \\ &= (1-1)\boldsymbol{v} && \text{by Axiom 8} \\ &= 0\boldsymbol{v} \\ &= \boldsymbol{0} && \text{by Theorem 2}\end{align*}\]
<p>Then, by Axiom 5, it must be that $(-1)\boldsymbol{v} = -\boldsymbol{v}$.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 5 (The zero vector multiplied by any scalar is the zero vector):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$, it holds that $c\boldsymbol{0} = \boldsymbol{0} \iff \boldsymbol{a} = \boldsymbol{0}$.</span></p>
<p><strong>Proof:</strong>
\(\begin{align*}\boldsymbol{0} + \boldsymbol{0} &= \boldsymbol{0} && \text{by Axiom 4} \\ c(\boldsymbol{0} + \boldsymbol{0}) &= c\boldsymbol{0} \\ c\boldsymbol{0} + c\boldsymbol{0} &= c\boldsymbol{0} && \text{by Axiom 8}\end{align*}\)</p>
<p>By Theorem 1, the only vector $\boldsymbol{a}$ in $\mathcal{V}$ for which $\boldsymbol{a} + \boldsymbol{v} = \boldsymbol{v}$ for all vectors $\boldsymbol{v} \in \mathcal{V}$ is the zero vector $\boldsymbol{0}$. Thus, $c\boldsymbol{0} = \boldsymbol{0}$.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 6 (The zero vector is its own negative):</strong> Given a vector space $(\mathcal{V}, \mathcal{F})$, it holds that $-\boldsymbol{0} = \boldsymbol{0}$</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*}\boldsymbol{a} + -\boldsymbol{a} &= \boldsymbol{0} && \text{by Axiom 5} \\ \boldsymbol{a} + \boldsymbol{a} &= \boldsymbol{0} && \text{assume $\boldsymbol{a} = -\boldsymbol{a}$} \\ \implies 2\boldsymbol{a} &= \boldsymbol{0} \\ \implies \boldsymbol{a} &= \boldsymbol{0} && \text{by Theorem 5}\end{align*}\]
<p>Thus, if we assume $\boldsymbol{a} = -\boldsymbol{a}$, then $\boldsymbol{a}$ must be the zero vector.</p>
<p>$\square$</p>Matthew N. BernsteinThe concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In this post, we first present and explain the definition of a vector space and then go on to describe properties of vector spaces. Lastly, we present a few examples of vector spaces that go beyond the usual Euclidean vectors that are often taught in introductory math and science courses.Invertible matrices2021-10-20T00:00:00-07:002021-10-20T00:00:00-07:00https://mbernste.github.io/posts/inverse_matrices<p><em>In this post, we discuss invertible matrices: those matrices that characterize invertible linear transformations. We discuss three different perspectives for intuiting inverse matrices as well as several of their properties.</em></p>
<h2 id="introduction">Introduction</h2>
<p>As we have discussed in depth, matrices can viewed <a href="https://mbernste.github.io/posts/matrices_as_functions/">as functions</a> between vector spaces. In this post, we will discuss matrices that represent <a href="https://en.wikipedia.org/wiki/Inverse_function">inverse functions</a>. Such matrices are called <strong>invertible matrices</strong> and their corresponding inverse function is characterized by an <strong>inverse matrix</strong>.</p>
<p>More rigorously, the inverse matrix of a matrix $\boldsymbol{A}$ is defined as follows:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (Inverse matrix):</strong> Given a square matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$, it’s <strong>inverse matrix</strong> is the matrix $\boldsymbol{C}$ that when either left or right multiplied by $\boldsymbol{A}$, yields the identity matrix. That is, if for a matrix $\boldsymbol{C}$ it holds that \(\boldsymbol{AC} = \boldsymbol{CA} = \boldsymbol{I}\), then $\boldsymbol{C}$ is the inverse of $\boldsymbol{A}$. This inverse matrix, $\boldsymbol{C}$ is commonly denoted as $\boldsymbol{A}^{-1}$.</span></p>
<p>This definition might seem a bit of opaque, so in the remainder of this blog post we will explore a number of <a href="https://mbernste.github.io/posts/understanding_3d/">complimentary perspectives</a> for viewing inverse matrices.</p>
<h2 id="intuition-behind-invertible-matrices">Intuition behind invertible matrices</h2>
<p>Here are three ways to understand invertible matrices:</p>
<ol>
<li>An invertible matrix characterizes an invertible linear transformation</li>
<li>An invertible matrix preserves the dimensionality of transformed vectors</li>
<li>An invertible matrix computes a change of coordinates for a vector space</li>
</ol>
<p>Below we will explore each of these perspectives.</p>
<p><strong>1. An invertible matrix characterizes an invertible linear transformation</strong></p>
<p>Any matrix $\boldsymbol{A}$ for which there exists an inverse matrix $\boldsymbol{A}^{-1}$ characterizes an invertible linear transformation. That is, given an invertible matrix $\boldsymbol{A}$, the linear transformation \(T(\boldsymbol{x}) := \boldsymbol{Ax}\) has an inverse linear transformation $T^{-1}(\boldsymbol{x})$ defined as $T^{-1}(\boldsymbol{x}) := \boldsymbol{A}^{-1}\boldsymbol{x}$.</p>
<p>Recall, for a function to be invertible it must be both <a href="https://en.wikipedia.org/wiki/Surjective_function">onto</a> and <a href="https://en.wikipedia.org/wiki/Injective_function">one-to-one</a>. We show in the Appendix to this blog post that if $\boldsymbol{A}$ is invertible, then $T(\boldsymbol{x})$ defined using an invertible matrix $\boldsymbol{A}$ is both onto (Theorem 2) and one-to-one (Theorem 3).</p>
<p>At a more intuitive level, the inverse of a matrix $\boldsymbol{A}$ is the matrix that “reverts” vectors transformed by $\boldsymbol{A}$ back to their original vectors:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_inverse.png" alt="drawing" width="500" /></center>
<p>Thus, since matrix multiplication encodes a composition of the matrices’ linear transformations, it follows that a matrix multiplied by its inverse yields the identity matrix $\boldsymbol{I}$, which characterizes the linear transformation that maps vectors back to themselves.</p>
<p><strong>2. A singular matrix collapses vectors into a lower-dimensional subspace</strong></p>
<p>A singular matrix “collapses” or “compresses” vectors into an intrinsically lower dimensional space whereas an invertible matrix preserves their <a href="https://mbernste.github.io/posts/intrinsic_dimensionality/">intrinsic dimensionality</a> of the vectors.</p>
<p>This follows from the fact that a matrix is invertible if and only if its columns are linearly independent (Thoerem 4 in the Appendix). Recall a set of $n$ linearly independent vectors \(S := \{ \boldsymbol{x}_1, \dots, \boldsymbol{x}_n \}\) spans a space with an intrinsic dimensionality of $n$ because in order to specify any vector $\boldsymbol{b}$ in the vector space, one must specify the coefficients $c_1, \dots, c_n$ such that</p>
\[\boldsymbol{b} = c_1\boldsymbol{x}_1 + \dots + c_n\boldsymbol{x}_n\]
<p>However, if $S$ is not linearly independent, then we can throw away “redundant” vectors in $S$ that can be constructed from the remaining vectors. Thus, the intrinsic dimensionality of a linearly dependent set $S$ is the maximum sized subset of $S$ that is linearly independent.</p>
<p>When a matrix $\boldsymbol{A}$ is singular, its columns are linearly dependent and thus, the vectors that constitute the column space of the matrix is inherently of lower dimension than the number of columns. Thus, when $\boldsymbol{A}$ multiplies a vector $\boldsymbol{x}$, it transforms $\boldsymbol{x}$ into this lower dimensional space. Once transformed, there is no way to transform it back to its original vector because certain dimensions of the vector were “lost” in this transformation.</p>
<p>To make this more concrete, an example is shown in below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_inverse_lin_ind.png" alt="drawing" width="1000" /></center>
<p>In Panel A of this figure, we show the column vectors of a matrix $\boldsymbol{A} \in \mathbb{R}^{3 \times 3}$ that span a plane. In Panel B, we show the solution to the equation $\boldsymbol{Ax} = \boldsymbol{b}$. In Panel C, we show another solution to $\boldsymbol{Ax} = \boldsymbol{b}$. Notice that there are multiple vectors in $\mathbb{R}^3$ that \(\boldsymbol{A}\) maps to $\boldsymbol{b}$. Thus, there does not exist an inverse mapping and therefore no inverse matrix to $\boldsymbol{A}$. These multiple mappings from $\mathbb{R}^3$ to $\boldsymbol{b}$ arise directly from the fact that the columns of $\boldsymbol{A}$ are linearly dependent.</p>
<p>Also notice that this singular matrix maps vectors in $\mathbb{R}^3$ to vectors that lie on the plane in $\mathbb{R}^3$ that are spanned by its column vectors. All vectors on a plane in $\mathbb{R}^3$ are of intrinsic dimensionality of two rather than three because we only need to specify coefficients for two of the column vectors in $\boldsymbol{A}$ to specify a point on the plane. We can throw away the third. Thus, we see that this singular matrix collapses points from the full 3-dimensional space $\mathbb{R}^3$ to the 2-dimensional space on the plane spanned by the columns of $\boldsymbol{A}$.</p>
<p><strong>3. An invertible matrix computes a change of coordinates for a vector space</strong></p>
<p>A vector $\boldsymbol{x} \in \mathbb{R}^n$ can be viewed as the coordinates for a point in a coordinate system. That is, for each dimension $i$, the vector $\boldsymbol{x}$ provides a value along each dimension – that is, $x_i$ is the value along dimension $i$. The coordinate system we use is, in a mathematical sense, arbitrary. To see why it’s arbitrary, notice in the figure below that we can specify locations in $\mathbb{R}^2$ using either the grey coordinate system or the blue coordinate system:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/coordinate_change.png" alt="drawing" width="600" /></center>
<p>We see that there is a one-to-one and onto mapping between coordinates in each of these two alternative coordinate systems. The point $\boldsymbol{x}$ is located at $[-4, -2]$ in the grey coordinate system and as $[-1, 1]$ in the blue coordinate system.</p>
<p>Thus we see that all coordinate systems are able to provide an unambiguous location for points in the space and thus, there is a one-to-one and onto mapping between them. Nonetheless, it often helps to have some coordinate system that acts as a reference to every other coordinate system. This reference coordinate system is usually defined by the <a href="https://en.wikipedia.org/wiki/Standard_basis">standard basis vectors</a> ${\boldsymbol{e}_1, \dots, \boldsymbol{e}_n }$ where $\boldsymbol{e}_i$ consists of all zeros except for a one at index $i$.</p>
<p>All coordinate systems can then be constructed from the coordinate system defined by the standard basis vectors. This is depicted in the previous figure in which the reference coordinate system is depicted by the grey grid and is constructed by the orthonormal basis vectors $\boldsymbol{e}_1$ and $\boldsymbol{e}_2$. An alternative coordinate system is depicted by the blue grid and is constructed from the basis vectors $\boldsymbol{a}_1$ and $\boldsymbol{a}_2$.</p>
<p>Now, how do invertible matrices enter the picture? Well, an invertible matrix $\boldsymbol{A} := [\boldsymbol{a}_1, \dots, \boldsymbol{a}_n]$ can be viewed as an operator that converts vectors described in terms of some set of basis vectors ${\boldsymbol{a}_1, \dots, \boldsymbol{a}_n}$ back to a description in terms of the standard basis vectors ${\boldsymbol{e}_1, \dots, \boldsymbol{e}_n }$. That is, if we have some vector $\boldsymbol{x} \in \mathbb{R}^n$, then $\boldsymbol{Ax}$ can be understood to be the vector in the standard basis <em>if</em> $\boldsymbol{x}$ was described according to the basis formed by the columns of $\boldsymbol{A}$.</p>
<p>Another way to think about this is that if we have some vector $\boldsymbol{x} \in \mathbb{R}^n$ described according to the standard basis, then we can describe $\boldsymbol{x}$ in terms of an alternative basis $\boldsymbol{a}_1, \dots, \boldsymbol{a}_n$ by multiplying $\boldsymbol{x}$ by the inverse of the matrix $\boldsymbol{A} := [ \boldsymbol{a}_1, \dots, \boldsymbol{a}_n]$. That is \(\boldsymbol{x}_{\boldsymbol{A}} := \boldsymbol{A}^{-1}\boldsymbol{x}\) is the representation of $\boldsymbol{x}$ in terms of the basis formed by the columns of $\boldsymbol{A}$.</p>
<h2 id="properties">Properties</h2>
<p>Below we discuss several properties of invertible matrices that provide further intuition into how they behave and also provide algebraic rules that can be used in derivations.</p>
<ol>
<li><strong>The columns of an invertible matrix are linearly independent</strong> (Theorem 4 in the Appendix).</li>
<li><strong>Taking the inverse of an inverse matrix gives you back the original matrix</strong>. Given an invertible matrix $\boldsymbol{A}$ with inverse $\boldsymbol{A}^{-1}$, it follows from the definition of invertible matrices, that $\boldsymbol{A}^{-1}$ is also invertible with its inverse being $\boldsymbol{A}$. That is,
\((\boldsymbol{A}^{-1})^{-1} = \boldsymbol{A}\)
This also follows from the fact that the inverse of an inverse function $f^{-1}$ is simply the original function $f$.</li>
<li><strong>The result of multiplying invertible matrices is invertible</strong> (Theorem 5 in the Appendix). Given two matrices $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^{n \times n}$, the matrix that results from their multiplication is invertible. That is, $\boldsymbol{AB}$ is invertible and its inverse is given by
\((\boldsymbol{AB})^{-1} = \boldsymbol{B}^{-1}\boldsymbol{A}^{-1}\)
Recall the result of <a href="https://mbernste.github.io/posts/matrix_multiplication/">matrix multiplication</a> results in a matrix that characterizes the composition of the linear transformations characterized by the factor matrices. That is, $\boldsymbol{ABx}$ first transforms $\boldsymbol{x}$ with $\boldsymbol{B}$ and then transforms the result with $\boldsymbol{A}$. It follows that in order to invert this composition of transformations, one must first pass the vector through $\boldsymbol{B}^{-1}$ and then through $\boldsymbol{A}^{-1}$:</li>
</ol>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/inverse_matrix_mult.png" alt="drawing" width="900" /></center>
<h2 id="appendix-proofs-of-properties-of-invertible-matrices">Appendix: Proofs of properties of invertible matrices</h2>
<p><span style="color:#0060C6"><strong>Theorem 1 (Null space of an invertible matrix):</strong> The null space of an invertible matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ consists of only the zero vector $\boldsymbol{0}$.</span></p>
<p><strong>Proof:</strong></p>
<p>We must prove that</p>
\[\boldsymbol{Ax} = \boldsymbol{0}\]
<p>has only the trivial solution $\boldsymbol{x} := \boldsymbol{0}$.</p>
\[\begin{align*}\boldsymbol{Ax} &= \boldsymbol{0} \\ \implies \boldsymbol{A}^{-1}\boldsymbol{Ax} &= \boldsymbol{A}^{-1}\boldsymbol{0} \\ \implies \boldsymbol{x} &= \boldsymbol{0} \end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2 (Invertible matrices characterize onto functions):</strong> An invertible matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ characterizes an onto linear transformation. </span></p>
<p><strong>Proof:</strong></p>
<p>Let $\boldsymbol{x}$, $\boldsymbol{b} \in \mathbb{R}^n$. Then, there exists a vector, $\boldsymbol{x} \in \mathbb{R}^n$ such that
\(\boldsymbol{Ax} = \boldsymbol{b}\)
This solution is precisely</p>
\[\boldsymbol{x} := \boldsymbol{A}^{-1}\boldsymbol{b}\]
<p>as we see below:</p>
\[\begin{align*}&\boldsymbol{A}(\boldsymbol{A}^{-1}\boldsymbol{b}) = \boldsymbol{b} \\ \implies & (\boldsymbol{AA}^{-1})\boldsymbol{b} = \boldsymbol{b} && \text{associative law} \\ \implies & \boldsymbol{I}\boldsymbol{b} = \boldsymbol{b} && \text{definition of inverse matrix} \\ \implies & \boldsymbol{b} = \boldsymbol{b} \end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 3 (Invertible matrices characterize one-to-one functions):</strong> A an invertible matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ characterizes a one-to-one linear transformation.</span></p>
<p><strong>Proof:</strong></p>
<p>For the sake of contradiction assume that there exists two vectors $\boldsymbol{x}$ and $\boldsymbol{x}’$ such that $\boldsymbol{x} \neq \boldsymbol{x}’$ and that
\(\boldsymbol{Ax} = \boldsymbol{b}\)
and
\(\boldsymbol{Ax}' = \boldsymbol{b}\)
where $b \neq \boldsymbol{0}$. Then,</p>
\[\begin{align*} \boldsymbol{Ax} - \boldsymbol{Ax}' &= \boldsymbol{0} \\ \implies \boldsymbol{A}(\boldsymbol{x} - \boldsymbol{x}') = \boldsymbol{0}\end{align*}\]
<p>By Theorem 1, it must hold that</p>
\[\boldsymbol{x} - \boldsymbol{x}' = \boldsymbol{0}\]
<p>which implies that $\boldsymbol{x} = \boldsymbol{x}’$. This contradicts our original assumption. Therefore, it must hold that there does not exist two vectors $\boldsymbol{x}$ and $\boldsymbol{x}’$ that map to the same vector via the invertible matrix $\boldsymbol{A}$. Therefore, $\boldsymbol{A}$ encodes a one-to-one function.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 4 (Column vectors of invertible matrices are linearly independent):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$, $\boldsymbol{A}$ is invertible if and only if \(\boldsymbol{a}_{*,1}, \dots, \boldsymbol{a}_{*,n}\) are linearly independent. </span></p>
<p><strong>Proof:</strong></p>
<p>We first prove the $\implies$ direction: we assume that $\boldsymbol{A}$ is invertible and show that under this assumption, the only solution to
\(\boldsymbol{a}_{*,1}x_1 + \dots + \boldsymbol{a}_{*,n}x_n = \boldsymbol{0}\)
is $\boldsymbol{x} := \boldsymbol{0}$, which is the condition for linear independence.</p>
\[\begin{align*}\boldsymbol{a}_{*,1}x_1 + \dots + \boldsymbol{a}_{*,n}x_n &= \boldsymbol{0} \\ \implies \boldsymbol{Ax} &= \boldsymbol{0} \\ \implies \boldsymbol{A}^{-1}\boldsymbol{Ax} &= \boldsymbol{A}^{-1}\boldsymbol{0} \\ \implies \boldsymbol{x} &= \boldsymbol{0} \end{align*}\]
<p>We now prove the $\impliedby$ direction: we assume the columns of $\boldsymbol{A}$ are linearly independent and show that under this assumption there exists a matrix $\boldsymbol{C}$ such that</p>
\[\boldsymbol{CA} = \boldsymbol{AC} = \boldsymbol{I}\]
<p>Since the columns of $\boldsymbol{A}$ are linearly independent, then the <a href="https://en.wikipedia.org/wiki/Row_echelon_form#Reduced_row_echelon_form">reduced row echelon form</a> of $\boldsymbol{A}$ has a <a href="https://en.wikipedia.org/wiki/Pivot_element">pivot</a> in every column. This means that there exists a sequence of <a href="https://en.wikipedia.org/wiki/Elementary_matrix">elementary row matrices</a> $\boldsymbol{E}_1, \dots, \boldsymbol{E}_k$ such that when multiplied by $\boldsymbol{A}$, they produce the identity matrix. That is,
\((\boldsymbol{E}_1\dots\boldsymbol{E}_k)\boldsymbol{A} = \boldsymbol{I}\)</p>
<p>Though not proven formally, it can be seen that elementary row matrices are invertible. That is, you can always “undo” the transformation imposed by an elementary row matrix (e.g. for an elementary row matrix that swaps rows, you can always swap them back). Furthermore, since the product of invertible matrices is also invertible, $(\boldsymbol{E}_1\dots\boldsymbol{E}_k)$ is invertible. Thus,</p>
\[\begin{align*} & (\boldsymbol{E}_1\dots\boldsymbol{E}_k)\boldsymbol{A} = \boldsymbol{I} \\ \implies & (\boldsymbol{E}_1\dots\boldsymbol{E}_k)^{-1} (\boldsymbol{E}_1 \dots \boldsymbol{E}_k)\boldsymbol{A} = (\boldsymbol{E}_1 \dots \boldsymbol{E}_k)^{-1}\boldsymbol{I} \\ \implies & \boldsymbol{A} = (\boldsymbol{E}_1 \dots \boldsymbol{E}_k)^{-1} \boldsymbol{I} \\ \implies & \boldsymbol{A} = \boldsymbol{I}(\boldsymbol{E}_1 \dots \boldsymbol{E}_k)^{-1} \\ \implies & \boldsymbol{A}(\boldsymbol{E}_1 \dots \boldsymbol{E}_k) = \boldsymbol{I}(\boldsymbol{E}_1 \dots \boldsymbol{E}_k)^{-1}(\boldsymbol{E}_1 \dots \boldsymbol{E}_k) \end{align*}\]
<p>Hence, $\boldsymbol{C} := (\boldsymbol{E}_1 \dots \boldsymbol{E}_k)$ is the matrix for which $\boldsymbol{AC} = \boldsymbol{CA} = \boldsymbol{I}$ and is thus $\boldsymbol{A}$’s inverse.</p>
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 5 (Inverse of matrix product):</strong> Given two invertible matrices $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^n$, the inverse of their product $\boldsymbol{AB}$ is given by $\boldsymbol{B}^{-1}\boldsymbol{A}^{-1}$.</span></p>
<p><strong>Proof:</strong></p>
<p>We seek the inverse matrix $\boldsymbol{X}$ such that
\((\boldsymbol{AB})\boldsymbol{X} = \boldsymbol{I}\):</p>
\[\begin{align*} & \boldsymbol{ABX} = \boldsymbol{I} \\ \implies & \boldsymbol{A}^{-1}\boldsymbol{ABX} = \boldsymbol{A}^{-1}\boldsymbol{I}\\ \implies &\boldsymbol{B}^{-1}\boldsymbol{BX} = \boldsymbol{B}^{-1}\boldsymbol{A}^{-1}\boldsymbol{I} \\ \implies &\boldsymbol{X} = \boldsymbol{B}^{-1}\boldsymbol{A}^{-1} \\ \end{align*}\]
<p>$\square$</p>Matthew N. BernsteinIn this post, we discuss invertible matrices: those matrices that characterize invertible linear transformations. We discuss three different perspectives for intuiting inverse matrices as well as several of their properties.Perplexity: a more intuitive measure of uncertainty than entropy2021-10-08T00:00:00-07:002021-10-08T00:00:00-07:00https://mbernste.github.io/posts/perplexity<p><em>Like entropy, perplexity is an information theoretic quantity that describes the uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy and thus, in some sense, they can be used interchangeabley. So why do we need it? In this post, I’ll discuss why perplexity is a more intuitive measure of uncertainty than entropy.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Perplexity is an information theoretic quantity that crops up in a number of contexts such as <a href="https://en.wikipedia.org/wiki/Perplexity">natural language processing</a> and is a parameter for the popular <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> algorithm used for dimensionality reduction.</p>
<p>Like <a href="https://mbernste.github.io/posts/entropy/">entropy</a>, perplexity provides a measure of the amount of uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy. Given a discrete random variable, $X$, perplexity is defined as:</p>
\[\text{Perplexity}(X) := 2^{H(X)}\]
<p>where $H(X)$ is the entropy of $X$.</p>
<p>When I first saw this definition, I did not understand its purpose. That is, if perplexity is simply exponentiated entropy, why do we need it? After all, we have a good intuition for entropy already: it describes <a href="https://mbernste.github.io/posts/sourcecoding/">the number of bits</a> needed to encode random samples from $X$’s probability distribution. So why perplexity?</p>
<h2 id="an-intuitive-measure-of-uncertainty">An intuitive measure of uncertainty</h2>
<p>Perplexity is often used instead of entropy due to the fact that it is arguably more intuitive to our human minds than entropy. Of course, as we’ve discussed in a <a href="https://mbernste.github.io/posts/sourcecoding/">previous blog post</a>, entropy describes the number of bits needed to encode random samples from a distribution, which one may argue is already intuitive; however, I would argue the contrary. If I tell you that a given random variable has an entropy of 7, how should you <em>feel</em> about that at a gut level?</p>
<p>Arguably, perplexity provides a more human way of thinking about the random variable’s uncertainty and that is because the perplexity of a uniform, discrete random variable with K outcomes is K (see the Appendix to this post)! For example, the perplexity of a fair coin is two and the perplexity of a fair six-sided die is six. This provides a frame of reference for interpreting a perplexity value. That is, if the perplexity of some random variable X is 20, our uncertainty towards the outcome of X is equal to the uncertainty we would feel towards a 20-sided die. This helps <em>intuit</em> the uncertainty at a more gut level!</p>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem:</strong> Given a discrete uniform random variable $X \sim \text{Cat}(p_1, p_2, \dots, p_K)$ where $\forall i,j \in [K], p_i = p_j = 1/K$, it holds that the perplexity of $X$ is $K$.</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*}
\text{Perplexity}(X) &:= 2^{H(X)} \\
&= 2^{\frac{1}{K} -\sum_{i=1}^K \log_2 \frac{1}{K}} \\
&= 2^{-\log_2 \frac{1}{K}} \\
&= \frac{1}{2^{\log_2 \frac{1}{K}}} \\
&= K
\end{align*}\]
<p>$\square$</p>Matthew N. BernsteinLike entropy, perplexity is an information theoretic quantity that describes the uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy and thus, in some sense, they can be used interchangeabley. So why do we need it? In this post, I’ll discuss why perplexity is a more intuitive measure of uncertainty than entropy.Variational inference2021-05-31T00:00:00-07:002021-05-31T00:00:00-07:00https://mbernste.github.io/posts/variational_inference<p><em>In this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Variational inference is a high-level paradigm for estimating a posterior distribution when computing it explicitly is intractable. More specifically, variational inference is used in situations in which we have a model that involves hidden random variables $Z$, observed data $X$, and some posited probabilistic model over the hidden and observed random variables \(P(Z, X)\). Our goal is to compute the posterior distribution $P(Z \mid X)$. Under an ideal situation, we would do so by using Bayes theorem:</p>
\[p(z \mid x) = \frac{p(x \mid z)p(z)}{p(x)}\]
<p>where \(z\) and \(x\) are realizations of \(Z\) and \(X\) respectively and \(p(.)\) are probability mass/density functions for the distributions implied by their arguments.</p>
<p>In practice, it is often difficult to compute $p(z \mid x)$ via Bayes theorem because the denominator $p(x)$ does not have a closed form. Usually, the denominator $p(x)$ can be only be expressed as an integral that marginalizes over $z$: $p(x) = \int p(x, z) \ dz$. In such scenarios, we’re often forced to approximate $p(z \mid x)$ rather than compute it directly. Variational inference is one such approximation technique.</p>
<h2 id="intuition">Intuition</h2>
<p>Instead of computing \(p(z \mid x)\) exactly via Bayes theorem, variational inference attempts to find another distribution $q(z)$ that is ``close” to \(p(z \mid x)\) (how we define “closeness” between distributions will be addressed later in this post). Ideally, $q(z)$ is easier to evaluate than \(p(z \mid x)\), and, if \(p(z \mid x)\) and \(q(z)\) are similar, then we can use \(q(z)\) as a replacement for $p(z \mid x)$ for any relevant downstream tasks.</p>
<p>We restrict our search for \(q(z)\) to a family of surrogate distributions over \(Z\), called the <strong>variational distribution family</strong>, denoted by the set of distributions $\mathcal{Q}$. Our goal then is to find the distribution $q \in \mathcal{Q}$ that makes $q(z)$ as ``close” to $p(z \mid x)$ as possible. When, each member of $\mathcal{Q}$ is characterized by the values of a set of parameters $\phi$, we call $\phi$ the <strong>variational parameters</strong>. Our goal is then to find the value for $\hat{\phi}$ that makes $q(z \mid \phi)$ as close to $p(z \mid x)$ as possible
and return \(q(z \mid \hat{\phi})\) as our approximation of the true posterior.</p>
<h2 id="details">Details</h2>
<p>Variational inference uses the KL-divergence from $p(z \mid x)$ to $q(z)$ as a measure of ``closeness” between these two distributions:</p>
\[KL(q(z) \ || \ p(z \mid x)) := E_{Z \sim q}\left[\log\frac{q(Z)}{p(Z \mid x)} \right]\]
<p>Thus, variational inference attempts to find</p>
\[\hat{q} := \text{argmin}_q \ KL(q(z) \ || \ p(z \mid x))\]
<p>and then returns $\hat{q}(z)$ as the approximation to the posterior.</p>
<p>Variational inference minimizes the KL-divergence by maximizing a surrogate quantity called the <strong>evidence lower bound (ELBO)</strong> (For a more in-depth discussion of the evidence lower bound, you can check out <a href="https://mbernste.github.io/posts/elbo/">my previous blog post</a>):</p>
\[\text{ELBO}(q) := E_{Z \sim q}\left[\log p(x, Z) \right] - E_{Z \sim q}\left[\log q(Z) \right]\]
<p>That is, we can formulate an optimization problem that seeks to maximize the ELBO:</p>
\[\hat{q} := \text{argmax}_q \ \text{ELBO}(q)\]
<p>The solution to this optimization problem is equivalent to the solution that minimizes the KL-divergence between $q(z)$ and $p(z \mid x)$. To see why this works, we can show that the KL-divergence can be formulated as the difference between the marginal log-likelihood of the observed data, \(\log p(x)\) (called the <em>evidence</em>) and the ELBO:</p>
\[\begin{align*}KL(q(z) \ || \ p(z \mid x)) &= E_{Z \sim q}\left[\log\frac{q(Z)}{p(Z \mid x)} \right] \\ &= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log p(Z \mid x) \right] \\ &= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log \frac{p(Z, x)}{p(x)} \right] \\ &= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log p(Z, x) \right] + E_{Z \sim q}\left[\log p(x) \right] \\ &= \log p(x) - \left( E_{Z \sim q}\left[\log p(x, Z) \right] - E_{Z \sim q}\left[\log q(Z) \right] \right)\\ &= \log p(x) - \text{ELBO}(q)\end{align*}\]
<p>Because $\log p(x)$ does not depend on $q$, one can treat the ELBO as a function of $q$ and maximize the ELBO.</p>
<p>Conceptually, variational inference allows us to formulate our approximate Bayesian inference problem as an optimization problem. By formulating the problem as such, we can approach this optimization problem using the full toolkit available to us from the field of <a href="https://en.wikipedia.org/wiki/Mathematical_optimization">mathematical optimization</a>!</p>
<h2 id="why-is-this-method-called-variational-inference">Why is this method called “variational” inference?</h2>
<p>The term “variational” in “variational inference” comes from the mathematical area of <a href="https://en.wikipedia.org/wiki/Calculus_of_variations">the calculus of variations</a>. The calculus of variations is all about optimization problems that optimize <em>functions of functions</em>, called <a href="https://mbernste.github.io/posts/functionals/">functionals</a>.</p>
<p>More specifically, let’s say we have some set of functions $\mathcal{F}$ where each $f \in \mathcal{F}$ maps items from some set $A$ to some set $B$. That is,</p>
\[f: A \rightarrow B\]
<p>Let’s say we have some function $g$ that maps functions in $\mathcal{F}$ to real numbers $\mathbb{R}$. That is,</p>
\[g: \mathcal{F} \rightarrow \mathbb{R}\]
<p>Then, we may wish to solve an optimization problem of the form:</p>
\[\text{arg max}_{f \in \mathcal{F}} g(f)\]
<p>This is precisely the problem addressed in the calculus of variations. In the case of variational inference, the functional, $g$, that we are optimzing is the ELBO. The set of functions, $\mathcal{F}$, that we are searching over is the set of <a href="https://mbernste.github.io/posts/measure_theory_2/">measureable functions</a> in the variational family, $\mathcal{Q}$.</p>Matthew N. BernsteinIn this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.Three strategies for cataloging cell types2021-03-04T00:00:00-08:002021-03-04T00:00:00-08:00https://mbernste.github.io/posts/three_strategies_cell_type_cataloging<p><em>In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.</em></p>
<h2 id="introduction">Introduction</h2>
<p>In my <a href="https://mbernste.github.io/posts/cell_types_cell_states/">previous post</a>, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space: the set of all possible states a living cell can exist in and the transitions between them. This idea can be summarized in the following figure:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space_ontologies.png" alt="drawing" width="400" /></center>
<p>In this framework, the task of cataloging cell types involves identifying “useful” subsets of cell states and giving those subsets names. Then, one can create a hierarchy of cell types by simply computing the subset-relationships between those sets of cell states.</p>
<p>While this framework is conceptually clean and simple, there are a number of problems with implementing it in the real world. These problems include:</p>
<ol>
<li>We don’t know the full cellular state space</li>
<li>We don’t have a language for describing subsets of that state space</li>
<li>We don’t have a way of agreeing on how to partition the state space</li>
</ol>
<p>Problems 1 and 2 are hard, and I’ll save a discussion on these problems for later. In this post I will only discuss Problem 3: how do we agree on partitions of the state space. Said much more simply: how do we agree on a definition for a cell type.</p>
<p>In my opinion there are three core strategies that have been proposed by the scientific community; however, these ideas have taken different forms. In this post, I will attempt to tease out and more rigorously describe each of these strategies.</p>
<p>These strategies are:</p>
<ol>
<li><strong>Every scientist for themself.</strong> Come up with your own cell type definition based on your own needs. In fact, this idea is embraced by a number of single-cell RNA-seq cell type classifiers such as <a href="https://www.nature.com/articles/s41592-019-0535-3">Garnett</a>. Garnett features a “<a href="https://cole-trapnell-lab.github.io/garnett/classifiers/">zoo</a>” of cell type classifiers that one can create and then use to label a new dataset.</li>
<li><strong>Crowdsourcing.</strong> In this strategy, one may look at all of the published genomics data out there in public repositories and use these data to come to a consensus of how the scientific community as a whole defines cell types. This is the core idea behind <a href="https://www.cell.com/iscience/fulltext/S2589-0042(20)31110-X">CellO</a>, a cell type classification tool that I worked on that uses the collection of publicly available primary cell data to train cell type classifiers.</li>
<li><strong>Central authority.</strong> This is the idea behind the Human Cell Atlas. The idea here is that a single group, or committee, will collect tons of data and attempt to define the various cell types. These cell types will then serve as a reference for all of science.</li>
</ol>
<p>Let me dig a bit into each of these strategies.</p>
<h2 id="every-scientist-for-themself">Every scientist for themself</h2>
<p>This is more or less the current state of affairs (minus the whole cellular state space framework). That is, each scientist has some unique definition of a cell type that may vary, perhaps slightly, with other scientist’s who uses that same cell type name. In the cellular state space framework, this scenario looks something like the following:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cell_type_every_scientist_for_themself.png" alt="drawing" width="350" /></center>
<p>The benefit to this approach is that there is no need to come to a consenus on how to define a particular cell type. Just pick your own! However, without a common language for defining the cell states that they are using to define their cell types, this framework can easily suffer from the problem that two scientists might be using the same term to discuss two different cell types! This happens all the time. For example, when two scientists use two different sets of marker genes to label cell types in a single-cell RNA-seq dataset they are likely choosing different subsets of the cellular state space. Just take a look at the <a href="https://academic.oup.com/nar/article/47/D1/D721/5115823">CellMarker database</a>, a database of literature curated marker genes and you will see that there are often multiple sets of marker genes used to define the same cell type. In computer science parlance, the cell type names are <a href="https://en.wikipedia.org/wiki/Function_overloading">overloaded</a>.</p>
<p><a href="https://www.nature.com/articles/s41592-019-0535-3">Garnett</a> is a cell type classification tool that, in some sense, embraces this idea. They have a model <a href="https://cole-trapnell-lab.github.io/garnett/classifiers/">zoo</a> where you can deposit your pre-trained classifiers that have been trained on data that was labelled based on your own, personal cell type definitions. Moreover, they provide a markdown language in which you define your cell types, and your cell type hierarchy based on your own choice of marker genes.</p>
<h2 id="crowdsourcing">Crowdsourcing</h2>
<p>Here’s a less prevalent, but I think intriguing approach: let’s just take the union of all cellular states that have been used by a scientist and come to a consensus partition on the cellular state space. That is, if multiple scientific publications have slightly different definitions for “T cell”, let’s just use the union of all of them. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cell_type_crowdsource.png" alt="drawing" width="700" /></center>
<p>I argue that cell type classification tools that are trained on public data take this approach. For example, our own tool, <a href="https://www.cell.com/iscience/fulltext/S2589-0042(20)31110-X">CellO</a>, was trained on a collection of primary cell samples from the Sequence Read Archive. Another method that takes this approach is <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3834796/">URSA</a>. Importantly, the training labels used for training CellO and URSA are provided by the scientists who submitted their data. As discussed previously, they might have differing definitions for their cell types; however, this might be a good thing! We’re essentially crowdsourcing the definition of cell type to build a universal cell type classifier.</p>
<p>In fact, you can use such models to build up marker genes for defining each cell type. Because these marker genes are derived from models that are trained on an amalgamation of samples that might use slightly different definitions, one can view these definitions as sort of a consensus definition from the scientific community. You can check out CellO’s derived marker genes <a href="https://uwgraphics.github.io/CellOViewer/">here</a>.</p>
<p>One problem with this approach is that it is difficult to formalize the cell types defined in this way. Furthermore, it is prone to bad data, and thus, one might want to curate the samples one uses to create a consensus.</p>
<h2 id="central-authority">Central authority</h2>
<p>Lastly, one can rely on a central authority to define cell types. This is the idea behind the <a href="https://www.humancellatlas.org">Human Cell Atlas</a> (HCA). The goal of the HCA is to bring together an international consortium of scientists to map out the cellular state space and come to agreed upon partitions of the state space from which one can then use for all of science. This strategy is the most ambitious! Of course, this is a massive undertaking, but if it works, would help to remove ambiguity and clarify our understanding of human biology.</p>Matthew N. BernsteinIn my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.On cell types and cell states2021-03-03T00:00:00-08:002021-03-03T00:00:00-08:00https://mbernste.github.io/posts/cell_types_states_and_ontologies<p><em>The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.</em></p>
<h2 id="introduction">Introduction</h2>
<p>With the advent of single-cell genomics, researchers are now able to probe molecular biology at the single-cell level. That is, scientists are able to measure some aspect of a cell, such as its transcriptome (RNA-seq) or its open chromatin regions (ATAC-seq), for thousands, and <a href="https://science.sciencemag.org/content/370/6518/eaba7721/tab-figures-data">sometimes even millions</a> of cells at a time. These new technologies have brought about new efforts to map and catalog all of the cell types in the human body. The premier effort of this kind is the <a href="https://www.humancellatlas.org">Human Cell Atlas</a>, an international consortium of researchers who have set themselves on the journey towards creating “comprehensive reference maps of all human cells—the fundamental units of life—as a basis for both understanding human health and diagnosing, monitoring, and treating disease.”</p>
<p>Of course, before one begins to catalog cell types, one must define what they mean by “cell type”. This has become a topic of hot debate. Before the age of single-cell genomics, a rigorous definition was usually not necessary. Colloquially, a cell type is a category of cells in the body that performs a certain function. Commonly, cell types are considered to be relatively stable. For example, a cell in one’s skin will not, as far as we know, spontaneously morph into a neuron.</p>
<p>Unfortunately, researchers found that such a fuzzy definition does not suffice as a foundational definition from which one could go on to create “reference maps”. One reason for this is that the resolution provided by single-cell technologies enables one to find clusters of similar cells, that one may deem to be a “cell type”, at ever more extreme resolutions. For example, <a href="https://academic.oup.com/database/article/doi/10.1093/database/baaa073/6008692">Svennson et al. (2021)</a> found that as researchers measure more cells, they tend to find more “cell types”. Here’s Figure 5 from their preprint:</p>
<center><img src="https://www.biorxiv.org/content/biorxiv/early/2019/10/17/742304/F5.large.jpg?width=800&height=600&carousel=1" alt="drawing" width="700" /></center>
<p>Moreover, we now know that cells are actually pretty plastic. While skin cells naturally don’t morph into neurons, they can be induced to morph into neurons <a href="https://www.nature.com/articles/nbt.1946">using special treatments</a>. Moreover, cells do switch their functions relatively often. A T cell floating in the blood stream can “activate” to fight an infection. Do we call transient cell states “cell types”? Do we include them in our catalog?</p>
<p>Lastly, there is the question of how to handle diseased cells. Is a neuron that is no longer able to perform the function that a neuron usually performs still a neuron? Does a “diseased” neuron get its own cell type definition? What criteria do we use to determine whether a cell is “diseased”?</p>
<p>There is not yet an agreement in the scientific community on how to answer these questions. Nonetheless, in this post, I will convey a perspective, which combines many existing ideas in the field, that will attempt to answer them. This perspective is a mental framework for thinking about cell types, cell states, and what it means to “catalog” a cell type.</p>
<h2 id="the-cellular-state-space">The cellular state space</h2>
<p>First, let’s get the obvious out of the way: the concept of “cell type” is human-made. Nature does not create categories, rather, we create categories in our minds. Categories are fundamental building blocks of our mental processes. In nature, there are <em>only cell states</em>. That is, every cell simply exists in a certain configuration. It is expressing certain genes. It is comprised of certain proteins. Its genome is chemically and spatially configured in a specific way. Moreover, cells <em>change</em> their state over time. A cell is in a constant state of flux as it goes about its function and responds to external stimuli.</p>
<p>In computer science parlance, we can think about the set of cell states as a <a href="https://en.wikipedia.org/wiki/State_space">state space</a>. That is, the cell always exists in a specific, single state at a specific time, and over time it <em>transitions</em> to new states. If these states are finite (or <a href="https://en.wikipedia.org/wiki/Countable_set">countable</a>), one can view the state space as a <a href="https://en.wikipedia.org/wiki/Cellular_automaton">cellular automaton</a>, where the state space can be represented by a <a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)">graph</a>, in which nodes in the graph are states and edges are transitions between states. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space.png" alt="drawing" width="350" /></center>
<p>In reality, the state space of a cell is continuous, but for the purposes of this discussion, we will use the simplification that the state space is discrete and can be represented by a graph.</p>
<p>This idea is not new. In fact there is a whole subfield of computational biology that seeks to <a href="https://en.wikipedia.org/wiki/Cellular_model">model cells</a>, and other biological systems, as computational state spaces.</p>
<h2 id="a-cell-type-is-a-subset-of-states">A cell type is a subset of states</h2>
<p>I argue that one can define a <em>cell type</em> to simply be a <strong>subset of cell states in the cellular state space</strong>. For example, when one talks about a “T cell”, they are inherently talking about all states in the cell state space in which the cell is performing a function that we have named “T cell”. Importantly, a cell type is a human-made partition on the cellular state space. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space_cell_type.png" alt="drawing" width="350" /></center>
<p>Importantly, one can define cell types arbitrarily. In fact, any member of the <a href="https://en.wikipedia.org/wiki/Power_set">power set</a> of cell states could be given a name and considered to be a cell type! Of course, as human beings with particular goals (such as treating disease), only a very small number of subsets of the state space are useful to think about. Thus, it might not be a good idea to go ahead and create millions of cell types, even though we could.</p>
<h2 id="cataloging-cell-types-with-ontologies">Cataloging cell types with ontologies</h2>
<p>A big question is, how do we organize all of these cell types? One idea that I find particularly compelling is to use <a href="https://en.wikipedia.org/wiki/Knowledge_graph">knowledge graphs</a> or <a href="https://en.wikipedia.org/wiki/Ontology_(information_science)">ontologies</a> (the two concepts are very similar, with a few subtle differences). In such graphs, each node represents a concept and an edge between two concepts represents a relationship between those two concepts. For example, the <em>subtype</em> relationship between two concepts is often denoted using an edge labelled as “is a” . For example, if we have a knowledge graph containing the nodes “car” and “vehicle”, we would draw an “is a” edge between them, which encodes the knowledge that, “every car is a vehicle.”</p>
<p>In the cellular state space, these “is a” edges are simply subset relationships. If one cell type’s set of states is a subset of another cell type’s set of states, then we can draw an “is a” edge between them in the cell type ontology. For example if we have “Cell Type B is a Cell Type A”, this means that any cell in the set of states labelled “Cell Type B” is also in the set of states labelled as “Cell Type A”. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space_ontologies.png" alt="drawing" width="400" /></center>
<h2 id="viewing-disease-through-the-lense-of-cellular-state-spaces">Viewing disease through the lense of cellular state spaces</h2>
<p>The idea of defining cell types to be subsets of cell states enables one to define disease cell types. That is, a diseased cell type is simply a collection of cell states just like any other cell type. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_states_diseased.png" alt="drawing" width="400" /></center>
<p>Because diseased cell types are represented in the same framework as any other cell type, we can add them to an ontology of cell types as discussed previously:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_states_diseased_ontology.png" alt="drawing" width="400" /></center>
<h2 id="viewing-batch-effects-through-the-lense-of-cellular-state-spaces">Viewing batch effects through the lense of cellular state spaces</h2>
<p>Another important point to keep in mind is that the subgraph of cell states used to define a given cell type need not be connected in the cellular state space. For example, one individual’s T cells are almost certainly in a slightly different state than another individual’s T cells owing to differences in genotype and environment. Nonetheless, we may still wish to call both of these cells “T cells”.</p>
<p>This may also occur in two samples of cultured cells. The two cell cultures may not be grown under the exact same conditions and thus, there may be a slight difference in the cellular states of the cells in the two cultures. Nonetheless, we may wish to still define the cells in the two cultures to be of the same cell types.</p>
<p>We do so as follows: we extend the cellular state space to include multiple individuals or multiple samples (i.e., multiple <em>batches</em>). This results in two disconnected, and approximately <a href="https://en.wikipedia.org/wiki/Graph_isomorphism">isomorphic</a> subgraphs. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space_isomorphism.png" alt="drawing" width="500" /></center>
<p>From this angle, we can more rigorously define the common task in single-cell analysis that involves removing batch effects between two samples. That is, our goal is to find the isomorphism between the two cellular state spaces that the cells in the two samples are following. Of course, in practice, we don’t have access to the underlying cellular state space, so we are left to heuristics. (For example, <a href="https://www.nature.com/articles/nbt.4091">Haghverdi et al. (2018)</a> propose a method for detecting <em>mutual nearest neighbors</em> between cells that belong to two different batches and then uses these neighbors to transform the cells into a common space.)</p>
<h2 id="putting-these-ideas-into-practice">Putting these ideas into practice</h2>
<p>All of the ideas that I presented in this post are horribly simplified. In general, our knowledge of cellular function and the underlying state space of the biochemistry of cells is woefully incomplete and thus, it remains impossible to rigorously define a cell type as a set of cellular states as I discussed here. Nonetheless, I find it to be a useful mental model for thinking about cell types, cell states, and for placing open problems in bioinformatics into a common conceptual framework.</p>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>An article by Cole Trapnell discussing the differences between cell types and cell states: <a href="https://genome.cshlp.org/content/25/10/1491.full.html">https://genome.cshlp.org/content/25/10/1491.full.html</a></li>
<li>An article by Samantha Morris on the ongoing discussion on how to think about cell types and cell states: <a href="https://dev.biologists.org/content/146/12/dev169748.abstract">https://dev.biologists.org/content/146/12/dev169748.abstract</a></li>
<li>Opinions on how to define a cell type: <a href="https://www.cell.com/cell-systems/pdf/S2405-4712(17)30091-1.pdf">https://www.cell.com/cell-systems/pdf/S2405-4712(17)30091-1.pdf</a></li>
<li>Human Cell Atlas: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5762154/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5762154/</a></li>
<li>An exploration of the distinction between cell type and cell state in Clytia Medusa by Tara Chari <em>et al.</em>: <a href="https://www.biorxiv.org/content/10.1101/2021.01.22.427844v2.full.pdf">https://www.biorxiv.org/content/10.1101/2021.01.22.427844v2.full.pdf</a></li>
</ul>Matthew N. BernsteinThe advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.RNA-seq: the basics2021-01-07T00:00:00-08:002021-01-07T00:00:00-08:00https://mbernste.github.io/posts/rna_seq_basics<p><em>RNA sequencing (RNA-seq) has become a ubiquitous tool in biomedical research for measuring gene expression in a population of cells, or a single cell, across the genome. Despite its ubiquity, RNA-seq is relatively complex and there exists a large research effort towards developing statistical and computational methods for analyzing the raw data that it produces. In this post, I will provide a high level overview of RNA-seq and describe how to interpret some of the common units in which gene expression is measured from an RNA-seq experiment.</em></p>
<h2 id="introduction">Introduction</h2>
<p>RNA sequencing (RNA-seq) measures the transcription of each gene in a biological sample (i.e. a group of cells or a single single). In this post, I will review the RNA-seq protocol and explain how to interpret the most commonly used units of gene expression derived from an RNA-seq experiment: transcripts per million (TPM). I will also contrast transcripts per million with another common unit of expression: reads per killobase per million mapped reads (RPKM). This post will assume a basic understanding of the <a href="https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology">Central Dogma</a> of molecular biology.</p>
<p>Getting started, let’s review the inputs and outputs of an RNA-seq experiment. We’re given a biological sample consisting of a cell or a population of cells, and our goal is to estimate the <strong>transcript abundances</strong> from each gene in the sample – that is, the <em>fraction</em> of transcripts in the sample that originate from each gene. A toy example is depicted below where the genome consists of only three genes: a Blue gene, a Green gene, and a Yellow gene.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_input_output.png" alt="drawing" width="700" /></center>
<p>The transcript abundances can be encoded as a vector of numbers where each element $i$ of the vector stores the fraction of transcripts in the sample originating from gene $i$. This vector is often called a <strong>gene expression profile</strong>.</p>
<h2 id="overview-of-rna-seq">Overview of RNA-seq</h2>
<p>Here are the general steps of an RNA-seq experiment:</p>
<ol>
<li><strong>Isolation:</strong> Isolate RNA molecules from a cell or population of cells.</li>
<li><strong>Fragmentation:</strong> Break RNA molecules into fragments (on the order of a few hundred bases long).</li>
<li><strong>Reverse transcription:</strong> Reverse transcribe the RNA into DNA.</li>
<li><strong>Amplification:</strong> Amplify the DNA molecules using <a href="https://en.wikipedia.org/wiki/Polymerase_chain_reaction">polymerase chain reaction</a>.</li>
<li><strong>Sequencing:</strong> Feed the amplified DNA fragments to a sequencer. The sequencer randomly samples fragments and records a short subsequence from the end (or both ends) of the fragment (on the order of a hundred bases long). These measured subsequences are called <strong>sequencing reads</strong>. A sequencing experiment generates millions of reads that are then stored in a digital file.</li>
<li><strong>Alignment:</strong> Computationally align the reads to the genome. That is, find a character-to-character match between each read and a subsequence within the genome. This is a challenging computational task given that genomes consist of billions of bases and a typical RNA-seq experiment generates millions of reads. (Caveat: New algorithms, such as kallisto (<a href="https://www.nature.com/articles/nbt.3519">Bray et al. 2016</a>) and Salmon (<a href="https://www.nature.com/articles/nmeth.4197">Patro et al. 2017</a>), circumvent the computationally expensive task of performing character-to-character alignment via approximate alignments called “pseudoalignment” and “quasi-aligmnent” respectively. These ideas are <a href="https://liorpachter.wordpress.com/2015/11/01/what-is-a-read-mapping/">very similar</a>.)</li>
<li><strong>Quantification:</strong> For each gene, count the number of reads that align to the gene. (Caveat: because of sequencing errors and the presence of reads that align to multiple genes, one performs <a href="https://academic.oup.com/bioinformatics/article/26/4/493/243395">statistical inference</a> to infer the gene of origin for each read. That is the read “counts” for each gene are inferred quantities. Simple counting of reads aligning to each gene can be viewed as a crude inference procedure.)</li>
</ol>
<p>By design, each step of the RNA-seq protocol preserves, in expectation, the relative abundance of each transcript. Here’s a figure illustrating all of these steps (Taken from <a href="https://search.proquest.com/openview/af4f51ec373a0b13438c59e7731adeed/1?pq-origsite=gscholar&cbl=18750&diss=y">Bernstein 2019</a>). This figure depicts a toy example where the genome consists of only five genes specified by the colors red, blue, purple, green, and orange:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_schematic.png" alt="drawing" width="800" /></center>
<h2 id="an-abstracted-overview-of-rna-seq">An abstracted overview of RNA-seq</h2>
<p>The RNA-seq protocol may appear somewhat complex so let’s look at an abstracted view of the procedure. In this abstracted view, we will reduce RNA-seq down to two steps. First, we extract all of the transcripts from the cells in the sample. Then, we randomly sample <em>locations</em> along all of the transcripts in the sample. That is, each read is viewed as a <em>sampled location</em> from some transcript in the sample. Of course, this is not physically what RNA-seq is doing, but it is a mathematically equivalent process (or at least approximately equivalent; there are a few caveats, but this is the gist of it).</p>
<p>In the figure below, we depict a toy example where we have a total of three genes in the genome, each with only one isoform: a Blue gene, a Green gene, and a Yellow gene. We then extract 13 total transcripts from the sample: 7 transcripts from the Blue gene, 4 transcripts from the Green gene, and 2 transcripts from the Yellow gene. In reality, a single cell contains <a href="https://www.qiagen.com/us/resources/faq?id=06a192c2-e72d-42e8-9b40-3171e1eb4cb8&lang=en">hundreds of thousands</a> of transcripts. We can then think of the reads that we generate from the RNA-seq experiment as random locations along these 10 transcripts. Here we depict 10 reads:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_abstracted.png" alt="drawing" width="800" /></center>
<p>Because we are sampling <em>locations</em> along all of the transcripts in the sample, we will tend to get more reads from longer genes and fewer reads from shorter genes. Thus, these counts will not alone be an accurate estimation of the fraction of transcripts from each gene.</p>
<p>Let’s say in this toy example the Blue gene is 4 bases long, the Green gene is 7 bases long, and the Yellow gene is 2 bases long. Then, if we sample many reads, the fraction of locations/reads sampled from each transcript will converge to the following:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_reads_vs_transcript_abundance.png" alt="drawing" width="700" /></center>
<p>Notice how these fractions differ from the fraction of transcripts that originate from each gene. Notably, the fraction of the reads from the Green gene is higher than the fraction of transcripts from the Blue gene. This is because the Green gene is longer and thus, when we sample locations along the transcript, we are more likely to select locations along a transcript from the Green gene. In the next section, we will discuss how to counteract this effect in order to recover the fraction of transcripts from each gene.</p>
<h2 id="estimating-the-fraction-of-transcripts-from-each-gene">Estimating the fraction of transcripts from each gene</h2>
<p>Before we get started, let’s define some mathematical notation:</p>
<ol>
<li>Let $G$ be the number of genes.</li>
<li>Let $N$ be the number of reads.</li>
<li>Let $c_i$ be the number of reads aligning to gene $i$.</li>
<li>Let $t_i$ be the number of transcripts from gene $i$ in the sample.</li>
<li>Let $l_i$ be the length of gene $i$.</li>
</ol>
<p>Now let’s look at the quantity that we are after: the fraction of transcripts from each gene, which we will denote as $\theta_i$.</p>
\[\theta_i := \frac{t_i}{\sum_{j=1}^G t_j}\]
<p>How do we estimate this from our read counts? First, we realize that the total number of nucleotides belonging to gene $i$ in the sample can be computed by multiplying the length of gene $i$ by the number of transcripts from gene $i$:</p>
\[n_i := l_it_i\]
<p>This is the total number of RNA bases within all of the RNA transcripts floating around in the sample that originated from gene $i$.</p>
<p>Furthermore, recall that each read can be thought of as a randomly sampled location from the set of all possible locations along the transcripts in the sample. In this light, $n_i$ represents the total number of possible start sites for a given read from gene $i$. Therefore, the fraction of reads we would expect to see from gene $i$ is</p>
\[p_i := \frac{l_it_i}{\sum_{j=1}^G l_jt_j} = \frac{n_i}{\sum_{j=1}^G n_j}\]
<p>Another way to look at this is as the probability that if we select a read, that read will have originated from gene $i$. This is simply the probability parameter for a <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli random variable</a>, and thus, its maximum likelihood estimate is simply:</p>
\[\hat{p}_i := \frac{c_i}{N}\]
<p>With our estimates, we can then estimate $\hat{\theta}_i$ as follows:</p>
\[\hat{\theta}_i := \frac{\hat{p}_i}{l_i} \left(\sum_{j=1}^G \frac{\hat{p}_j}{l_j} \right)^{-1}\]
<p>Let’s derive it:</p>
\[\begin{align*} \theta_i &= \frac{t_i}{\sum_{j=1}^G t_j} \\ &= \frac{ \frac{n_i}{l_i} }{ \sum_{j=1}^G \frac{n_j}{l_j}} && \text{because} \ n_i = l_it_i \implies t_i = \frac{n_i}{l_i} \\ &= \frac{ \frac{p_i \sum_{j=1}^G n_j}{l_i} }{\sum_{j=1}^G p_j \frac{\sum_{k=1}^G n_k}{l_j}} && \text{because} \ p_i = \frac{n_i}{\sum_{j=1}^G n_j} \implies n_i = p_i \sum_{j=1}^G n_j \\ &= \frac{ \frac{p_i}{l_i}} {\sum_{j=1}^G \frac{p_j}{l_j}} \end{align*}\]
<p>Then, to estimate $\theta_i$, we simply plug in our estimate $\hat{p}_i$ for each gene to arrive at our estimate $\hat{\theta}_i$.</p>
<p>Note that these $\theta_i$ value will be typically very small because there are so many genes. Therefore, it is common to multiply each $\theta_i$ by one million. The resulting values, called <strong>transcripts per million (TPM)</strong>, tell you the number of transcripts in the cell originating from each gene out of every million transcripts:</p>
\[\text{TPM}_i := 10^6 \times \frac{p_i}{l_i} \left(\sum_{j=1}^G \frac{p_j}{l_j} \right)^{-1}\]
<p>Thus, if we substitute $\hat{p}_i$ into the above equation, we have an <em>estimate</em> of the transcripts per million in the sample for gene $i$. We’ll use $\hat{\text{TPM}}$ to differentiate <em>estimated</em> TPMs from true TPMs. That is,</p>
\[\hat{\text{TPM}}_i := 10^6 \times \frac{\hat{p}_i}{l_i} \left(\sum_{j=1}^G \frac{\hat{p}_j}{l_j} \right)^{-1}\]
<h2 id="handling-genes-with-multiple-isoforms">Handling genes with multiple isoforms</h2>
<p>Most genes in the human genome are <a href="https://en.wikipedia.org/wiki/Alternative_splicing">alternatively spliced</a>, resulting in multiple isoforms of the gene. In the example above, we assumed that each gene had only one isoform. How do we handle the case in which a gene has multiple isoforms?</p>
<p>In fact, this is quite trivial. We simply compute the fraction of transcripts <em>of each isoform</em> as described above, and then simply sum the fractions of all isoforms for each gene to arrive at the fraction of transcripts originating from the gene. This is depicted in the figure below where we now assume that the Blue gene produces two isoforms:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_isoform_abundances.png" alt="drawing" width="400" /></center>
<p>Thus, if we have isoform-level estimates of each gene’s TPM, then we simply sum these estimates across isoforms for each gene to arrive at an estimate of the TPM for the gene as a whole.</p>
<h2 id="handling-noise-and-multi-mapped-reads">Handling noise and multi-mapped reads</h2>
<p>So far, we have assumed an idealized scenario in which we know with certainty which gene “produced” each read. In reality, this is not the case. Sometimes, a read may align to multiple isoforms within a single gene (extremely common), or it might align to multiple genes (common enough to affect results), or it might align imperfectly to a gene and we might wonder whether the read really was produced by the gene in the first place. That is, was the mismatch in alignment due to a sequencing error, or was the read <em>not</em> produced by that gene at all (for example, the read may have been produced by a contaminant DNA fragment)?</p>
<p>Because in the real-world, we don’t know which gene produced each read, we have to infer it. State-of-the-art methods perform this inference under an assumed probabilistic generative model (<a href="https://doi.org/10.1093/bioinformatics/btp692">Li et al. 2011</a>) of the reads-generating process (to be discussed in a future post).</p>
<h2 id="rpkm-versus-tpm">RPKM versus TPM</h2>
<p>In the <a href="https://doi.org/10.1038/nmeth.1226">early days of RNA-seq</a>, read counts were summarized in units of <strong>reads per killobase per million mapped reads (RPKM)</strong>. As will be discussed in the next section, RPKMs are known to suffer from a fundamental issue.</p>
<p>Before digging into the problem with RPKM, let’s first define it. Recall, the issue with the raw read counts is that we will tend to sample more reads from longer isoforms/genes and thus, the raw counts will not reflect the relative abundance of each isoform or gene. To get around this, we might try the following normalization procedure: simply divide the fraction of reads from each gene/isoform by the length of each gene/isoform. That is,</p>
\[\frac{c_i}{N l_i} = \frac{\hat{p}_i}{l_i}\]
<p>Here we see that $\frac{c_i}{l_i}$ is the <em>number</em> of reads <em>per base</em> of the gene/isoform. That is, it is the average number of reads generated from each base along the gene/isoform. Then, if we divide this quantity by the total number of reads, $N$, we arrive at the number of reads per base of the gene/isoform <em>per read</em>.</p>
<p>This is a bit confusing. It almost seems circular that we’re computing the number of reads per base per read. If that’s confusing, here’s another way to think about it: $\frac{c_i}{N l_i}$ is the <em>fraction</em> of the reads that were generated, on average, by each base of gene $i$. This inherently normalizes for gene length because the units are in terms of a single base of the gene!</p>
<p>Because $N$ is very large (on the order of millions), and so too is $l_i$ (on the order of thousands), we multiply $\frac{c_i}{N l_i}$ by $10^9$. The resulting units are reads per killobase per million mapped reads of a given gene:</p>
\[\text{RPKM}_i := 10^9 \times \frac{c_i}{N l_i}\]
<p>Note that $10^9$ is the result of multiplying by one thousand bases and one million reads (hence, “<strong>killo</strong>bases per <strong>million</strong> mapped reads”).</p>
<p>With read counts normalized into units of RPKM, we can compare expression values between genes and we don’t have to worry about gene length being an issue. That is, if we have two genes, $i$ and $j$, and we find that $\text{RPKM}_i > \text{RPKM}_j$, we can acertain that gene $i$ may be more highly expressed than gene $j$.</p>
<p>Now, let’s compare RPKM to estimates of TPM. We see that RPKMs can be viewed as “unnormalized” estimates of TPMs:</p>
\[\begin{align*} \hat{\text{TPM}}_i &:= 10^6 \times \frac{\hat{p}_i}{l_i} \left(\sum_{j=1}^G \frac{\hat{p}_j}{l_j} \right)^{-1} \\ &= 10^{6} \times \frac{10^9 \hat{p}_i}{N l_i} \left(\sum_{j=1}^G \frac{10^9 \hat{p}_j}{N l_j}\right)^{-1} \\ &= 10^{6} \times \frac{ \text{RPKM}_i }{\sum_{j=1}^G \text{RPKM}_j} \end{align*}\]
<p>At a higher level, one can contrast RPKM from estimated TPM by viewing RPKM as a <strong>normalization of the read counts</strong>, whereas TPM is an estimate of a <strong>physical quantity</strong> (<a href="https://arxiv.org/abs/1104.3889">Pachter 2011</a>). That is, one can attempt to <em>estimate</em> TPMs from the read counts, or, one can normalize the read counts using RPKMs. In the next section we will discuss a fundamental problem with RPKMs and show that TPMs are generally preferred.</p>
<h2 id="problems-with-rpkm">Problems with RPKM</h2>
<p>The problem with RPKM values is that, although they do allow us to compare relative transcript abundances <em>between two genes within a single sample</em>, they do not allow us to compare relative transcript abundances of a <em>single gene betweeen two samples</em>.</p>
<p>Let’s illustrate this with an example. In the figure below, we depict two samples with the same three genes as used previously, each with only one isoform. Again, the Blue gene is of length 4, the Green gene is of length 7, and the Yellow gene is of length 2. The two samples have the same fraction of transcripts originating from the Yellow gene, but differ in the fraction of transcripts originating from the Blue and Green genes. If we generated many reads, assuming no noise, then the RPKMs would converge to the values depicted below the pie charts:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/problem_w_RPKM.png" alt="drawing" width="500" /></center>
<p>As you can see, the RPKM values differ for the Yellow gene between the two samples even though the fraction of transcripts from the Yellow gene is the same between the two samples! This is not desirable.</p>
<p>Why is this the case? Recall that RPKMs can be viewed as un-normalized TPM estimates. As shown by <a href="https://doi.org/10.1093/bioinformatics/btp692">Li and Dewey (2011)</a>, it turns out that the normalization factor includes the <em>mean length</em> of all of the transcripts in the sample (see the Appendix to this blog post for the full derivation):</p>
\[\hat{\text{TPM}}_i = \text{RPKM}_i \left[ N 10^{-3} \sum_{k=1}^G \hat{\theta}_k l_k \right]\]
<p>We see that the term \(\sum_{k=1}^G \hat{\theta}_k l_k\) is the mean length of all of the transcripts. Thus, the normalization constant required to transform each $\text{RPKM}_i$ value to an estimate of $\hat{TPM}_i$ is dependent on <em>other</em> transcript abundances in the sample, not just the abundances for a specific gene/isoform.</p>
<p>In our toy example above, the mean length of transcripts in Sample 2 is greater than the mean length of transcripts in Sample 1 because we have more transcripts of the Green gene than the Blue gene, which is a longer gene.</p>
<h2 id="problems-with-tpm">Problems with TPM</h2>
<p>In the previous section we showed that estimated TPMs are preferred to RPKMs because estimated TPMs allow one to compare <em>estimated relative transcript abundances</em> between two samples. This is a nice advantage over RPKMs; however, it’s important to keep in mind that because TPMs are simply scaled fractions, they do not enable us to compare absolute expression between two samples. They’re relative expression values.</p>
<p>For example, when comparing the estimated TPMs for some gene $i$ between two samples, which we’ll call Sample 1 and Sample 2, it may be that the TPM is larger in Sample 1, but is in fact more lowly expressed in terms of absolute expression. Here’s an example to illustrate:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/relative_vs_absolute_abundance.png" alt="drawing" width="500" /></center>
<p>As you can see, the absolute number of transcripts from the Blue gene is lower in Sample 1 than in Sample 2, but the <em>fraction</em> of transcripts (and thus, the TPM) of the Blue gene is higher in Sample 1 than in Sample 2.</p>
<p>If RNA-seq only enables us to compute the relative abundances of transcripts within a sample, how is one to compare expression between multiple samples? This is a challenging problem with a number of proposed solutions. One method involves injecting the sample with RNA for which we know it’s abudundance, called <a href="https://en.wikipedia.org/wiki/RNA_spike-in">spike-in RNA</a>, and use the spike-in RNA abundance as a baseline from which we can estimate absolute expression. Other solutions involve using <a href="https://en.wikipedia.org/wiki/Housekeeping_gene">house-keeping gene</a>, for which we assume expression is constant between samples and thus can be used as a baseline from which to estimate absolute abundances (in a similar vein to the spike-in method). Another method, called <a href="https://doi.org/10.1186/gb-2010-11-10-r106">median-ratio normalization</a>, makes the assumption that most genes are not differentially expressed between the samples we are interested in comparing and, using this assumption, proposes a procedure for normalizing counts between samples (to be discussed in a future post).</p>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>A similar, yet more succinct and more advanced, blog post by Harold Pimentel discussing common units of expression from RNA-seq: <a href="https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/">https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/</a></li>
<li>A more rigorous summary of the statistical methods behind RNA-seq analysis by Lior Pachter: <a href="https://arxiv.org/abs/1104.3889">https://arxiv.org/abs/1104.3889</a></li>
<li>A nice tutorial on RNA-seq normalization: <a href="https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html">https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html</a></li>
</ul>
<h2 id="appendix">Appendix</h2>
<p>Deriving the normalizing constant between RPKM and TPM:</p>
\[\begin{align*}\hat{\text{TPM}}_i &= 10^{6} \frac{ \text{RPKM}_i }{\sum_{j=1}^G \text{RPKM}_j} \\ &= \text{RPKM}_i \left[\frac{10^6 }{10^9 \sum_{j=1}^G \frac{\hat{p}_j}{N l_j} } \right] \\ &= \text{RPKM}_i \left[ N 10^{-3} \left(\sum_{j=1}^G \frac{\hat{p}_j}{l_j} \right)^{-1} \right] \\ &= \text{RPKM}_i \left[ N 10^{-3} \left( \sum_{j=1}^G \frac{ \frac{\hat{\theta}_j l_j}{\sum_{k=1}^G \hat{\theta}_kl_k} } {l_j} \right)^{-1} \right] && \text{because} \ \hat{p}_j = \frac{\hat{\theta}_j l_j}{\sum_{k=1}^G \hat{\theta}_kl_k} \\ &= \text{RPKM}_i \left[ N 10^{-3} \left(\sum_{k=1}^G \hat{\theta}_k l_k \right) \left( \sum_{j=1}^G \hat{\theta}_j \right)^{-1} \right] \\ &= \text{RPKM}_i \left[ N 10^{-3} \sum_{k=1}^G \hat{\theta}_k l_k \right] && \text{because} \ \sum_j \hat{\theta}_j = 1 \end{align*}\]Matthew N. BernsteinRNA sequencing (RNA-seq) has become a ubiquitous tool in biomedical research for measuring gene expression in a population of cells, or a single cell, across the genome. Despite its ubiquity, RNA-seq is relatively complex and there exists a large research effort towards developing statistical and computational methods for analyzing the raw data that it produces. In this post, I will provide a high level overview of RNA-seq and describe how to interpret some of the common units in which gene expression is measured from an RNA-seq experiment.