Jekyll2021-06-16T14:31:29-07:00https://mbernste.github.io/feed.xmlMatthew N. BernsteinPersonal websiteMatthew N. BernsteinHypothesis testing versus machine learning: A unified perspective2021-06-14T00:00:00-07:002021-06-14T00:00:00-07:00https://mbernste.github.io/posts/statistical_test_ml<p>THIS POST IS CURRENTLY UNDER CONSTRUCTION</p>
<h2 id="introduction">Introduction</h2>
<p>Statistical hypothesis testing and supervised machine learning are two very different frameworks for making binary decisions with data. A <a href="https://doi.org/10.1016/j.patter.2020.100115">recent article</a> by Jingyi Jessica Li and Xin Tong discusses the differences between these two strategies and offers guidance on which one to choose for a given binary decision problem at hand. Indeed, the two strategies are very different given that they were born from two different scientific fields (statistics and computer science) and are generally best-suited for different kinds of problems. For some problems, the choice of method is obvious. For example, when attempting to decide whether the means between two groups differ based on a finite sample, then the obvious best approach is hypothesis testing. Alternatively, when attempting to classify images as being of either say, a dog or a cat, machine learning is often the much better choice.</p>
<p>Nonetheless, there are problems for which the best method is not so obvious. Such problems are often found in computational biology. For example, inferring <a href="https://en.wikipedia.org/wiki/Gene_regulatory_network">gene regulatory networks</a> is a problem that has been addressed using <a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2217-z">both hypothesis testing and supervised machine learning</a>. Of course, understanding the differences between the two frameworks and their respective strengths and weaknesses, is vital for choosing the appropriate strategy for one’s problem. On the surface, the biggest difference between hypothesis testing and supervised machine learning is that one approach requires training data while the other does not. That is, machine learning requires training data, but (seemingly) few assumptions about the data whereas hypothesis testing requires (seemingly) no training data, but requires strong assumptions about the data in the form of the null distribution. This seems to be a huge difference!</p>
<p>In the remainder of this blog post, I will argue that this difference is not as vast as it seems. When dealing with complex data, such as for applications in computational biology, the development of a novel hypothesis test requires looking at a lot of data! In some sense, one can view this as “learning” or “training”, even though it isn’t being performed by a machine, rather, it’s being performed by a statistician. Conversely, all machine learning algorithms make assumptions about the data; they just do so a bit more implicitly than hypothesis testing. These implicit assumptions, collectively called the <em>inductive biases</em> of a learning algorithm, are crucial to understand when applying machine learning to a problem.</p>
<p>Before digging into these commonalities, I will provide a very brief (and not comprehensive) review of the binary classification task, hypothesis testing, and supervised machine learning. The language and mathematical notation I use will be a sort of hybrid between the language and notation traditionally used in statistics and machine learning. My goal is to highlight the similarities between the two approaches.</p>
<h2 id="making-binary-decisions-with-data">Making binary decisions with data</h2>
<p>The problem we’re interested in addressing is the following: we’re given data, $X$, describing some object or sample and we want to use $X$ in order to make some kind of binary decision between two choices $C_0$ or $C_1$. We’ll use the variable $Y$ to denote the correct choice and thus, we are attempting to determine whether $Y = C_0$ or $Y = C_1$.</p>
<p>There are many scenarios that fit this description. For example, $X$ might be an image of either a cat or a dog and we are interested in classifying which it is. In this case, $Y$ is the decision regarding whether the image is of a cat or dog. A very different example would be that $X$ is a sample of datapoints (e.g., test scores) from two populations (e.g., students from two different schools) and we are interested in deciding whether the true (unobserved) means of the two groups are equal based upon our observed, but limited sample $X$. Here, $C_0$ might be the conclusion/decision that indeed the two groups of students have the same underlying test scores whereas $C_1$ is the conclusion that the two groups’ mean scores differ.</p>
<p>Traditionally, machine learning is used for examples like the former (classifying images as either cats or dogs) and hypothesis testing is used for the latter (deciding whether the mean scores between two populations of students are equal). While indeed machine learning is usually a better fit for problems similar to the former and statistical testing for the latter (see the <a href="https://doi.org/10.1016/j.patter.2020.100115">article</a> by Li and Tong), this does not necessarily have to be the case. Both of these problems can be cast as the problem of making a binary decision between two choices, which can be addressed by either hypothesis testing or machine learning.</p>
<h2 id="a-brief-review-of-hypothesis-testing">A brief review of hypothesis testing</h2>
<p>In hypothesis testing, we summarize our data $X$ using a summarization function $T$ called a <strong>statistic</strong>. For example, if $X$ consists of test scores for a sample of students, then $T$ might be a function that simply computes the mean test score. When performing hypothesis testing, one makes the very strong assumption that if one of the two choices, say $C_0$, is correct, then $T(X)$ will follow a specified distribution. In hypothesis testing parlance, this distribution is called the <a href="https://en.wikipedia.org/wiki/Null_distribution#:~:text=Null%20distribution%20is%20a%20tool,is%20said%20to%20be%20true">null distribution</a> which we’ll denote as $P_{\text{null}}$. In hypothesis testing, one computes the probability of drawing a sample from the null distribution that is “at least as extreme” as $T(X)$. This probability is called a <a href="https://en.wikipedia.org/wiki/P-value">p-value</a>:</p>
\[\text{p-value} := \int P_{\text{null}}(x) \ \mathbb{I}(x \ \text{is more extreme than} \ T(X)) \ dx\]
<p>where $\mathbb{I}$ is the indicator function.</p>
<p>A low p-value means that the null distribution is a poor explanation for our observed $T(X)$ and that we should look elsewhere for an explanation of $X$. That is, a low p-value supports the choosing the alternate category, $C_1$, for $X$ rather than $C_0$ (making this choice is called “rejecting the null hypothesis”; the null hypothesis is the hypothesis that the null distribution produced $T(X)$). Traditionally, one pre-specifies a threshold, $\alpha$, such that if the p-value is below $\alpha$, one chooses $C_1$ rather than $C_0$.</p>
<p>In mathematical notation, this decision function is:</p>
\[f(T(X)) := \begin{cases}C_1 \ \text{if} \ \text{p-value} < \alpha \\ C_0 \ \text{otherwise} \end{cases}\]
<p>where $\mathbb{I}$ is the indicator function.</p>
<h2 id="a-brief-review-of-supervised-machine-learning">A brief review of supervised machine learning</h2>
<p>In supervised machine learning for binary classification, our data $X$ is summarized using a summarization function $T$ in a similar manner to hypothesis testing. In machine learning, $T(X)$ are called <a href="https://en.wikipedia.org/wiki/Feature_selection">features</a> of $X$ and are usually a numerical vector rather than a single number as is usually the case in hypothesis testing. Furthermore, to use machine learning, one is required to have on hand a set of <em>training examples</em> consisting of items and their associated correct decisions. We denote these item-decision pairs as</p>
\[\mathcal{D} := (X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)\]
<p>Then, given these training examples, we employ a <strong>learning algorithm</strong> that looks at these data and finds a decision function, $f$, that will make perform the binary decision when given $T(X)$.</p>
<p>If one views the learning algorithm itself as a function, $\mathcal{A}$, that takes as input a training set $\mathcal{D}$ and outputs a decision function $f$, then we can formulate the complete machine learning-based decision making algorithm as:</p>
\[f(T(X)) := \mathcal{A}(\mathcal{D})(T(X))\]
<p>where $\mathcal{A}(\mathcal{D})$ is the decision function output by the learning algorithm $\mathcal{A}$ when trained on dataset $\mathcal{D}$.</p>
<h2 id="both-approaches-may-require-training-data">Both approaches may require “training” data</h2>
<p>The use of training data in machine learning is obvious; training models using data is the whole point! On the other hand, the use of training data in hypothesis testing is much less obvious, but I would argue is still present, at least in the development of hypothesis tests for complex problems. Here, I use “training data” in a very loose sense to mean all data that was used by either person or machine to formulate the decision function $f$. Im machine learning, $f$ is output by the learning algorithm $\mathcal{A}$. In hypothesis testing, $f$ is handcrafted by a person. This handcrafting of $f$ almost always requires data.</p>
<p>As an illustrative example, let’s look at the problem of <a href="">identifying differentially expressed genes</a> in <a href="https://mbernste.github.io/posts/rna_seq_basics/">RNA-seq</a> data. In this problem, one is given two conditions on which we measure the expression of a set of genes. We are interested in identifying the subset of genes whose mean expression differs between the two conditions. This fundamental problem has been addressed by a multitude of approaches in bioinformatics.</p>
<h2 id="both-approaches-require-assumptions-about-the-data">Both approaches require assumptions about the data</h2>
<p>It is obvious that hypothesis testing requires one to make strong assumptions about the data. These assumptions take the form of the null distribution – that is, we assume a probability distribution over $T(X)$ conditioned on $Y = C_0$. If data as extreme as $T(X)$ looks unlikely under the null distribution, we choose $C_1$.</p>
<p>In supervised machine learning, the assumptions made about the data are less obvious, but they are always present. In fact, their presence is a mathematical certainty! The <a href="">No Free Lunch Theorem</a> in statistical learning theory states that no machine learning algorithm will work for every possible distribution of data. That is, ANY algorithm will work well on some distributions and poorly on others. One can view the distributions on which the model works well as the assumptions that the model is making about the data. These assumptions are often called the <strong>inductive bias</strong> of the algorithm.</p>
<p>Here are a few examples of some inductive biases of well-known algorithms:</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Logistic_regression">Logistic regression</a> assumes that the individual elements of $T(X)$ (assuming $T(X)$ is a feature vector) contribute additively and independently to the likelihood that $C_1$ is a better choice than $C_0$.</li>
<li><a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">Convolutional neural networks</a> assume that useful groups of elements within $T(X)$ ($T(X)$ is assumed to be a tensor) are invariant to spatial shifts and rotations. If this is not the case, then convolutional neural networks might not be the best choice.</li>
</ul>
<h2 id="a-note-on-inference-versus-prediction">A note on “inference” versus “prediction”</h2>Matthew N. BernsteinTHIS POST IS CURRENTLY UNDER CONSTRUCTIONVariational inference2021-05-31T00:00:00-07:002021-05-31T00:00:00-07:00https://mbernste.github.io/posts/variational_inference<p><em>In this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Variational inference is a high-level paradigm for estimating a posterior distribution when computing it explicitly is intractable. More specifically, variational inference is used in situations in which we have a model that involves hidden random variables $Z$, observed data $X$, and some posited probabilistic model over the hidden and observed random variables \(P(Z, X)\). Our goal is to compute the posterior distribution $P(Z \mid X)$. Under an ideal situation, we would do so by using Bayes theorem:</p>
\[p(z \mid x) = \frac{p(x \mid z)p(z)}{p(x)}\]
<p>where \(z\) and \(x\) are realizations of \(Z\) and \(X\) respectively and \(p(.)\) are probability mass/density functions for the distributions implied by their arguments.</p>
<p>In practice, it is often difficult to compute $p(z \mid x)$ via Bayes theorem because the denominator $p(x)$ does not have a closed form. Usually, the denominator $p(x)$ can be only be expressed as an integral that marginalizes over $z$: $p(x) = \int p(x, z) \ dz$. In such scenarios, we’re often forced to approximate $p(z \mid x)$ rather than compute it directly. Variational inference is one such approximation technique.</p>
<h2 id="intuition">Intuition</h2>
<p>Instead of computing \(p(z \mid x)\) exactly via Bayes theorem, variational inference attempts to find another distribution $q(z)$ that is ``close” to \(p(z \mid x)\) (how we define “closeness” between distributions will be addressed later in this post). Ideally, $q(z)$ is easier to evaluate than \(p(z \mid x)\), and, if \(p(z \mid x)\) and \(q(z)\) are similar, then we can use \(q(z)\) as a replacement for $p(z \mid x)$ for any relevant downstream tasks.</p>
<p>We restrict our search for \(q(z)\) to a family of surrogate distributions over \(Z\), called the <strong>variational distribution family</strong>, denoted by the set of distributions $\mathcal{Q}$. Our goal then is to find the distribution $q \in \mathcal{Q}$ that makes $q(z)$ as ``close” to $p(z \mid x)$ as possible. When, each member of $\mathcal{Q}$ is characterized by the values of a set of parameters $\phi$, we call $\phi$ the <strong>variational parameters</strong>. Our goal is then to find the value for $\hat{\phi}$ that makes $q(z \mid \phi)$ as close to $p(z \mid x)$ as possible
and return \(q(z \mid \hat{\phi})\) as our approximation of the true posterior.</p>
<h2 id="details">Details</h2>
<p>Variational inference uses the KL-divergence from $p(z \mid x)$ to $q(z)$ as a measure of ``closeness” between these two distributions:</p>
\[KL(q(z) \ || \ p(z \mid x)) := E_{Z \sim q}\left[\log\frac{q(Z)}{p(Z \mid x)} \right]\]
<p>Thus, variational inference attempts to find</p>
\[\hat{q} := \text{argmin}_q \ KL(q(z) \ || \ p(z \mid x))\]
<p>and then returns $\hat{q}(z)$ as the approximation to the posterior.</p>
<p>Variational inference minimizes the KL-divergence by maximizing a surrogate quantity called the <strong>evidence lower bound (ELBO)</strong> (For a more in-depth discussion of the evidence lower bound, you can check out <a href="https://mbernste.github.io/posts/elbo/">my previous blog post</a>):</p>
\[\text{ELBO}(q) := E_{Z \sim q}\left[\log p(x, Z) \right] - E_{Z \sim q}\left[\log q(Z) \right]\]
<p>That is, we can formulate an optimization problem that seeks to maximize the ELBO:</p>
\[\hat{q} := \text{argmax}_q \ \text{ELBO}(q)\]
<p>The solution to this optimization problem is equivalent to the solution that minimizes the KL-divergence between $q(z)$ and $p(z \mid x)$. To see why this works, we can show that the KL-divergence can be formulated as the difference between the marginal log-likelihood of the observed data, \(\log p(x)\) (called the <em>evidence</em>) and the ELBO:</p>
\[\begin{align*}KL(q(z) \ || \ p(z \mid x)) &= E_{Z \sim q}\left[\log\frac{q(Z)}{p(Z \mid x)} \right] \\ &= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log p(Z \mid x) \right] \\ &= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log \frac{p(Z, x)}{p(x)} \right] \\ &= E_{Z \sim q}\left[\log q(Z) \right] - E_{Z \sim q}\left[\log p(Z, x) \right] + E_{Z \sim q}\left[\log p(x) \right] \\ &= \log p(x) - \left( E_{Z \sim q}\left[\log p(x, Z) \right] - E_{Z \sim q}\left[\log q(Z) \right] \right)\\ &= \log p(x) - \text{ELBO}(q)\end{align*}\]
<p>Because $\log p(x)$ does not depend on $q$, one can treat the ELBO as a function of $q$ and maximize the ELBO.</p>
<p>Conceptually, variational inference allows us to formulate our approximate Bayesian inference problem as an optimization problem. By formulating the problem as such, we can approach this optimization problem using the full toolkit available to us from the field of <a href="https://en.wikipedia.org/wiki/Mathematical_optimization">mathematical optimization</a>!</p>
<h2 id="why-is-this-method-called-variational-inference">Why is this method called “variational” inference?</h2>
<p>The term “variational” in “variational inference” comes from the mathematical area of <a href="https://en.wikipedia.org/wiki/Calculus_of_variations">the calculus of variations</a>. The calculus of variations is all about optimization problems that optimize <em>functions of functions</em> (called <strong>functionals</strong>).</p>
<p>More specifically, let’s say we have some set of functions $\mathcal{F}$ where each $f \in \mathcal{F}$ maps items from some set $A$ to some set $B$. That is,</p>
\[f: A \rightarrow B\]
<p>Let’s say we have some function $g$ that maps functions in $\mathcal{F}$ to real numbers $\mathbb{R}$. That is,</p>
\[g: \mathcal{F} \rightarrow \mathbb{R}\]
<p>Then, we may wish to solve an optimization problem of the form:</p>
\[\text{arg max}_{f \in \mathcal{F}} g(f)\]
<p>This is precisely the problem addressed in the calculus of variations. In the case of variational inference, the functional, $g$, that we are optimzing is the ELBO. The set of functions, $\mathcal{F}$, that we are searching over is the set of <a href="https://mbernste.github.io/posts/measure_theory_2/">measureable functions</a> in the variational family, $\mathcal{Q}$.</p>Matthew N. BernsteinIn this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.Three strategies for cataloging cell types2021-03-04T00:00:00-08:002021-03-04T00:00:00-08:00https://mbernste.github.io/posts/three_strategies_cell_type_cataloging<p><em>In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.</em></p>
<h2 id="introduction">Introduction</h2>
<p>In my <a href="https://mbernste.github.io/posts/cell_types_cell_states/">previous post</a>, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space: the set of all possible states a living cell can exist in and the transitions between them. This idea can be summarized in the following figure:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space_ontologies.png" alt="drawing" width="400" /></center>
<p>In this framework, the task of cataloging cell types involves identifying “useful” subsets of cell states and giving those subsets names. Then, one can create a hierarchy of cell types by simply computing the subset-relationships between those sets of cell states.</p>
<p>While this framework is conceptually clean and simple, there are a number of problems with implementing it in the real world. These problems include:</p>
<ol>
<li>We don’t know the full cellular state space</li>
<li>We don’t have a language for describing subsets of that state space</li>
<li>We don’t have a way of agreeing on how to partition the state space</li>
</ol>
<p>Problems 1 and 2 are hard, and I’ll save a discussion on these problems for later. In this post I will only discuss Problem 3: how do we agree on partitions of the state space. Said much more simply: how do we agree on a definition for a cell type.</p>
<p>In my opinion there are three core strategies that have been proposed by the scientific community; however, these ideas have taken different forms. In this post, I will attempt to tease out and more rigorously describe each of these strategies.</p>
<p>These strategies are:</p>
<ol>
<li><strong>Every scientist for themself.</strong> Come up with your own cell type definition based on your own needs. In fact, this idea is embraced by a number of single-cell RNA-seq cell type classifiers such as <a href="https://www.nature.com/articles/s41592-019-0535-3">Garnett</a>. Garnett features a “<a href="https://cole-trapnell-lab.github.io/garnett/classifiers/">zoo</a>” of cell type classifiers that one can create and then use to label a new dataset.</li>
<li><strong>Crowdsourcing.</strong> In this strategy, one may look at all of the published genomics data out there in public repositories and use these data to come to a consensus of how the scientific community as a whole defines cell types. This is the core idea behind <a href="https://www.cell.com/iscience/fulltext/S2589-0042(20)31110-X">CellO</a>, a cell type classification tool that I worked on that uses the collection of publicly available primary cell data to train cell type classifiers.</li>
<li><strong>Central authority.</strong> This is the idea behind the Human Cell Atlas. The idea here is that a single group, or committee, will collect tons of data and attempt to define the various cell types. These cell types will then serve as a reference for all of science.</li>
</ol>
<p>Let me dig a bit into each of these strategies.</p>
<h2 id="every-scientist-for-themself">Every scientist for themself</h2>
<p>This is more or less the current state of affairs (minus the whole cellular state space framework). That is, each scientist has some unique definition of a cell type that may vary, perhaps slightly, with other scientist’s who uses that same cell type name. In the cellular state space framework, this scenario looks something like the following:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cell_type_every_scientist_for_themself.png" alt="drawing" width="350" /></center>
<p>The benefit to this approach is that there is no need to come to a consenus on how to define a particular cell type. Just pick your own! However, without a common language for defining the cell states that they are using to define their cell types, this framework can easily suffer from the problem that two scientists might be using the same term to discuss two different cell types! This happens all the time. For example, when two scientists use two different sets of marker genes to label cell types in a single-cell RNA-seq dataset they are likely choosing different subsets of the cellular state space. Just take a look at the <a href="https://academic.oup.com/nar/article/47/D1/D721/5115823">CellMarker database</a>, a database of literature curated marker genes and you will see that there are often multiple sets of marker genes used to define the same cell type. In computer science parlance, the cell type names are <a href="https://en.wikipedia.org/wiki/Function_overloading">overloaded</a>.</p>
<p><a href="https://www.nature.com/articles/s41592-019-0535-3">Garnett</a> is a cell type classification tool that, in some sense, embraces this idea. They have a model <a href="https://cole-trapnell-lab.github.io/garnett/classifiers/">zoo</a> where you can deposit your pre-trained classifiers that have been trained on data that was labelled based on your own, personal cell type definitions. Moreover, they provide a markdown language in which you define your cell types, and your cell type hierarchy based on your own choice of marker genes.</p>
<h2 id="crowdsourcing">Crowdsourcing</h2>
<p>Here’s a less prevalent, but I think intriguing approach: let’s just take the union of all cellular states that have been used by a scientist and come to a consensus partition on the cellular state space. That is, if multiple scientific publications have slightly different definitions for “T cell”, let’s just use the union of all of them. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cell_type_crowdsource.png" alt="drawing" width="700" /></center>
<p>I argue that cell type classification tools that are trained on public data take this approach. For example, our own tool, <a href="https://www.cell.com/iscience/fulltext/S2589-0042(20)31110-X">CellO</a>, was trained on a collection of primary cell samples from the Sequence Read Archive. Another method that takes this approach is <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3834796/">URSA</a>. Importantly, the training labels used for training CellO and URSA are provided by the scientists who submitted their data. As discussed previously, they might have differing definitions for their cell types; however, this might be a good thing! We’re essentially crowdsourcing the definition of cell type to build a universal cell type classifier.</p>
<p>In fact, you can use such models to build up marker genes for defining each cell type. Because these marker genes are derived from models that are trained on an amalgamation of samples that might use slightly different definitions, one can view these definitions as sort of a consensus definition from the scientific community. You can check out CellO’s derived marker genes <a href="https://uwgraphics.github.io/CellOViewer/">here</a>.</p>
<p>One problem with this approach is that it is difficult to formalize the cell types defined in this way. Furthermore, it is prone to bad data, and thus, one might want to curate the samples one uses to create a consensus.</p>
<h2 id="central-authority">Central authority</h2>
<p>Lastly, one can rely on a central authority to define cell types. This is the idea behind the <a href="https://www.humancellatlas.org">Human Cell Atlas</a> (HCA). The goal of the HCA is to bring together an international consortium of scientists to map out the cellular state space and come to agreed upon partitions of the state space from which one can then use for all of science. This strategy is the most ambitious! Of course, this is a massive undertaking, but if it works, would help to remove ambiguity and clarify our understanding of human biology.</p>Matthew N. BernsteinIn my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.On cell types and cell states2021-03-03T00:00:00-08:002021-03-03T00:00:00-08:00https://mbernste.github.io/posts/cell_types_states_and_ontologies<p><em>The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.</em></p>
<h2 id="introduction">Introduction</h2>
<p>With the advent of single-cell genomics, researchers are now able to probe molecular biology at the single-cell level. That is, scientists are able to measure some aspect of a cell, such as its transcriptome (RNA-seq) or its open chromatin regions (ATAC-seq), for thousands, and <a href="https://science.sciencemag.org/content/370/6518/eaba7721/tab-figures-data">sometimes even millions</a> of cells at a time. These new technologies have brought about new efforts to map and catalog all of the cell types in the human body. The premier effort of this kind is the <a href="https://www.humancellatlas.org">Human Cell Atlas</a>, an international consortium of researchers who have set themselves on the journey towards creating “comprehensive reference maps of all human cells—the fundamental units of life—as a basis for both understanding human health and diagnosing, monitoring, and treating disease.”</p>
<p>Of course, before one begins to catalog cell types, one must define what they mean by “cell type”. This has become a topic of hot debate. Before the age of single-cell genomics, a rigorous definition was usually not necessary. Colloquially, a cell type is a category of cells in the body that performs a certain function. Commonly, cell types are considered to be relatively stable. For example, a cell in one’s skin will not, as far as we know, spontaneously morph into a neuron.</p>
<p>Unfortunately, researchers found that such a fuzzy definition does not suffice as a foundational definition from which one could go on to create “reference maps”. One reason for this is that the resolution provided by single-cell technologies enables one to find clusters of similar cells, that one may deem to be a “cell type”, at ever more extreme resolutions. For example, <a href="https://academic.oup.com/database/article/doi/10.1093/database/baaa073/6008692">Svennson et al. (2021)</a> found that as researchers measure more cells, they tend to find more “cell types”. Here’s Figure 5 from their preprint:</p>
<center><img src="https://www.biorxiv.org/content/biorxiv/early/2019/10/17/742304/F5.large.jpg?width=800&height=600&carousel=1" alt="drawing" width="700" /></center>
<p>Moreover, we now know that cells are actually pretty plastic. While skin cells naturally don’t morph into neurons, they can be induced to morph into neurons <a href="https://www.nature.com/articles/nbt.1946">using special treatments</a>. Moreover, cells do switch their functions relatively often. A T cell floating in the blood stream can “activate” to fight an infection. Do we call transient cell states “cell types”? Do we include them in our catalog?</p>
<p>Lastly, there is the question of how to handle diseased cells. Is a neuron that is no longer able to perform the function that a neuron usually performs still a neuron? Does a “diseased” neuron get its own cell type definition? What criteria do we use to determine whether a cell is “diseased”?</p>
<p>There is not yet an agreement in the scientific community on how to answer these questions. Nonetheless, in this post, I will convey a perspective, which combines many existing ideas in the field, that will attempt to answer them. This perspective is a mental framework for thinking about cell types, cell states, and what it means to “catalog” a cell type.</p>
<h2 id="the-cellular-state-space">The cellular state space</h2>
<p>First, let’s get the obvious out of the way: the concept of “cell type” is human-made. Nature does not create categories, rather, we create categories in our minds. Categories are fundamental building blocks of our mental processes. In nature, there are <em>only cell states</em>. That is, every cell simply exists in a certain configuration. It is expressing certain genes. It is comprised of certain proteins. Its genome is chemically and spatially configured in a specific way. Moreover, cells <em>change</em> their state over time. A cell is in a constant state of flux as it goes about its function and responds to external stimuli.</p>
<p>In computer science parlance, we can think about the set of cell states as a <a href="https://en.wikipedia.org/wiki/State_space">state space</a>. That is, the cell always exists in a specific, single state at a specific time, and over time it <em>transitions</em> to new states. If these states are finite (or <a href="https://en.wikipedia.org/wiki/Countable_set">countable</a>), one can view the state space as a <a href="https://en.wikipedia.org/wiki/Cellular_automaton">cellular automaton</a>, where the state space can be represented by a <a href="https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)">graph</a>, in which nodes in the graph are states and edges are transitions between states. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space.png" alt="drawing" width="350" /></center>
<p>In reality, the state space of a cell is continuous, but for the purposes of this discussion, we will use the simplification that the state space is discrete and can be represented by a graph.</p>
<p>This idea is not new. In fact there is a whole subfield of computational biology that seeks to <a href="https://en.wikipedia.org/wiki/Cellular_model">model cells</a>, and other biological systems, as computational state spaces.</p>
<h2 id="a-cell-type-is-a-subset-of-states">A cell type is a subset of states</h2>
<p>I argue that one can define a <em>cell type</em> to simply be a <strong>subset of cell states in the cellular state space</strong>. For example, when one talks about a “T cell”, they are inherently talking about all states in the cell state space in which the cell is performing a function that we have named “T cell”. Importantly, a cell type is a human-made partition on the cellular state space. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space_cell_type.png" alt="drawing" width="350" /></center>
<p>Importantly, one can define cell types arbitrarily. In fact, any member of the <a href="https://en.wikipedia.org/wiki/Power_set">power set</a> of cell states could be given a name and considered to be a cell type! Of course, as human beings with particular goals (such as treating disease), only a very small number of subsets of the state space are useful to think about. Thus, it might not be a good idea to go ahead and create millions of cell types, even though we could.</p>
<h2 id="cataloging-cell-types-with-ontologies">Cataloging cell types with ontologies</h2>
<p>A big question is, how do we organize all of these cell types? One idea that I find particularly compelling is to use <a href="https://en.wikipedia.org/wiki/Knowledge_graph">knowledge graphs</a> or <a href="https://en.wikipedia.org/wiki/Ontology_(information_science)">ontologies</a> (the two concepts are very similar, with a few subtle differences). In such graphs, each node represents a concept and an edge between two concepts represents a relationship between those two concepts. For example, the <em>subtype</em> relationship between two concepts is often denoted using an edge labelled as “is a” . For example, if we have a knowledge graph containing the nodes “car” and “vehicle”, we would draw an “is a” edge between them, which encodes the knowledge that, “every car is a vehicle.”</p>
<p>In the cellular state space, these “is a” edges are simply subset relationships. If one cell type’s set of states is a subset of another cell type’s set of states, then we can draw an “is a” edge between them in the cell type ontology. For example if we have “Cell Type B is a Cell Type A”, this means that any cell in the set of states labelled “Cell Type B” is also in the set of states labelled as “Cell Type A”. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space_ontologies.png" alt="drawing" width="400" /></center>
<h2 id="viewing-disease-through-the-lense-of-cellular-state-spaces">Viewing disease through the lense of cellular state spaces</h2>
<p>The idea of defining cell types to be subsets of cell states enables one to define disease cell types. That is, a diseased cell type is simply a collection of cell states just like any other cell type. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_states_diseased.png" alt="drawing" width="400" /></center>
<p>Because diseased cell types are represented in the same framework as any other cell type, we can add them to an ontology of cell types as discussed previously:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_states_diseased_ontology.png" alt="drawing" width="400" /></center>
<h2 id="viewing-batch-effects-through-the-lense-of-cellular-state-spaces">Viewing batch effects through the lense of cellular state spaces</h2>
<p>Another important point to keep in mind is that the subgraph of cell states used to define a given cell type need not be connected in the cellular state space. For example, one individual’s T cells are almost certainly in a slightly different state than another individual’s T cells owing to differences in genotype and environment. Nonetheless, we may still wish to call both of these cells “T cells”.</p>
<p>This may also occur in two samples of cultured cells. The two cell cultures may not be grown under the exact same conditions and thus, there may be a slight difference in the cellular states of the cells in the two cultures. Nonetheless, we may wish to still define the cells in the two cultures to be of the same cell types.</p>
<p>We do so as follows: we extend the cellular state space to include multiple individuals or multiple samples (i.e., multiple <em>batches</em>). This results in two disconnected, and approximately <a href="https://en.wikipedia.org/wiki/Graph_isomorphism">isomorphic</a> subgraphs. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/cellular_state_space_isomorphism.png" alt="drawing" width="500" /></center>
<p>From this angle, we can more rigorously define the common task in single-cell analysis that involves removing batch effects between two samples. That is, our goal is to find the isomorphism between the two cellular state spaces that the cells in the two samples are following. Of course, in practice, we don’t have access to the underlying cellular state space, so we are left to heuristics. (For example, <a href="https://www.nature.com/articles/nbt.4091">Haghverdi et al. (2018)</a> propose a method for detecting <em>mutual nearest neighbors</em> between cells that belong to two different batches and then uses these neighbors to transform the cells into a common space.)</p>
<h2 id="putting-these-ideas-into-practice">Putting these ideas into practice</h2>
<p>All of the ideas that I presented in this post are horribly simplified. In general, our knowledge of cellular function and the underlying state space of the biochemistry of cells is woefully incomplete and thus, it remains impossible to rigorously define a cell type as a set of cellular states as I discussed here. Nonetheless, I find it to be a useful mental model for thinking about cell types, cell states, and for placing open problems in bioinformatics into a common conceptual framework.</p>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>An article by Cole Trapnell discussing the differences between cell types and cell states: <a href="https://genome.cshlp.org/content/25/10/1491.full.html">https://genome.cshlp.org/content/25/10/1491.full.html</a></li>
<li>An article by Samantha Morris on the ongoing discussion on how to think about cell types and cell states: <a href="https://dev.biologists.org/content/146/12/dev169748.abstract">https://dev.biologists.org/content/146/12/dev169748.abstract</a></li>
<li>Opinions on how to define a cell type: <a href="https://www.cell.com/cell-systems/pdf/S2405-4712(17)30091-1.pdf">https://www.cell.com/cell-systems/pdf/S2405-4712(17)30091-1.pdf</a></li>
<li>Human Cell Atlas: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5762154/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5762154/</a></li>
<li>An exploration of the distinction between cell type and cell state in Clytia Medusa by Tara Chari <em>et al.</em>: <a href="https://www.biorxiv.org/content/10.1101/2021.01.22.427844v2.full.pdf">https://www.biorxiv.org/content/10.1101/2021.01.22.427844v2.full.pdf</a></li>
</ul>Matthew N. BernsteinThe advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.RNA-seq: the basics2021-01-07T00:00:00-08:002021-01-07T00:00:00-08:00https://mbernste.github.io/posts/rna_seq_basics<p><em>RNA sequencing (RNA-seq) has become a ubiquitous tool in biomedical research for measuring gene expression in a population of cells, or a single cell, across the genome. Despite its ubiquity, RNA-seq is relatively complex and there exists a large research effort towards developing statistical and computational methods for analyzing the raw data that it produces. In this post, I will provide a high level overview of RNA-seq and describe how to interpret some of the common units in which gene expression is measured from an RNA-seq experiment.</em></p>
<h2 id="introduction">Introduction</h2>
<p>RNA sequencing (RNA-seq) measures the transcription of each gene in a biological sample (i.e. a group of cells or a single single). In this post, I will review the RNA-seq protocol and explain how to interpret the most commonly used units of gene expression derived from an RNA-seq experiment: transcripts per million (TPM). I will also contrast transcripts per million with another common unit of expression: reads per killobase per million mapped reads (RPKM). This post will assume a basic understanding of the <a href="https://en.wikipedia.org/wiki/Central_dogma_of_molecular_biology">Central Dogma</a> of molecular biology.</p>
<p>Getting started, let’s review the inputs and outputs of an RNA-seq experiment. We’re given a biological sample consisting of a cell or a population of cells, and our goal is to estimate the <strong>transcript abundances</strong> from each gene in the sample – that is, the <em>fraction</em> of transcripts in the sample that originate from each gene. A toy example is depicted below where the genome consists of only three genes: a Blue gene, a Green gene, and a Yellow gene.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_input_output.png" alt="drawing" width="700" /></center>
<p>The transcript abundances can be encoded as a vector of numbers where each element $i$ of the vector stores the fraction of transcripts in the sample originating from gene $i$. This vector is often called a <strong>gene expression profile</strong>.</p>
<h2 id="overview-of-rna-seq">Overview of RNA-seq</h2>
<p>Here are the general steps of an RNA-seq experiment:</p>
<ol>
<li><strong>Isolation:</strong> Isolate RNA molecules from a cell or population of cells.</li>
<li><strong>Fragmentation:</strong> Break RNA molecules into fragments (on the order of a few hundred bases long).</li>
<li><strong>Reverse transcription:</strong> Reverse transcribe the RNA into DNA.</li>
<li><strong>Amplification:</strong> Amplify the DNA molecules using <a href="https://en.wikipedia.org/wiki/Polymerase_chain_reaction">polymerase chain reaction</a>.</li>
<li><strong>Sequencing:</strong> Feed the amplified DNA fragments to a sequencer. The sequencer randomly samples fragments and records a short subsequence from the end (or both ends) of the fragment (on the order of a hundred bases long). These measured subsequences are called <strong>sequencing reads</strong>. A sequencing experiment generates millions of reads that are then stored in a digital file.</li>
<li><strong>Alignment:</strong> Computationally align the reads to the genome. That is, find a character-to-character match between each read and a subsequence within the genome. This is a challenging computational task given that genomes consist of billions of bases and a typical RNA-seq experiment generates millions of reads. (Caveat: New algorithms, such as kallisto (<a href="https://www.nature.com/articles/nbt.3519">Bray et al. 2016</a>) and Salmon (<a href="https://www.nature.com/articles/nmeth.4197">Patro et al. 2017</a>), circumvent the computationally expensive task of performing character-to-character alignment via approximate alignments called “pseudoalignment” and “quasi-aligmnent” respectively. These ideas are <a href="https://liorpachter.wordpress.com/2015/11/01/what-is-a-read-mapping/">very similar</a>.)</li>
<li><strong>Quantification:</strong> For each gene, count the number of reads that align to the gene. (Caveat: because of sequencing errors and the presence of reads that align to multiple genes, one performs <a href="https://academic.oup.com/bioinformatics/article/26/4/493/243395">statistical inference</a> to infer the gene of origin for each read. That is the read “counts” for each gene are inferred quantities. Simple counting of reads aligning to each gene can be viewed as a crude inference procedure.)</li>
</ol>
<p>By design, each step of the RNA-seq protocol preserves, in expectation, the relative abundance of each transcript. Here’s a figure illustrating all of these steps (Taken from <a href="https://search.proquest.com/openview/af4f51ec373a0b13438c59e7731adeed/1?pq-origsite=gscholar&cbl=18750&diss=y">Bernstein 2019</a>). This figure depicts a toy example where the genome consists of only five genes specified by the colors red, blue, purple, green, and orange:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_schematic.png" alt="drawing" width="800" /></center>
<h2 id="an-abstracted-overview-of-rna-seq">An abstracted overview of RNA-seq</h2>
<p>The RNA-seq protocol may appear somewhat complex so let’s look at an abstracted view of the procedure. In this abstracted view, we will reduce RNA-seq down to two steps. First, we extract all of the transcripts from the cells in the sample. Then, we randomly sample <em>locations</em> along all of the transcripts in the sample. That is, each read is viewed as a <em>sampled location</em> from some transcript in the sample. Of course, this is not physically what RNA-seq is doing, but it is a mathematically equivalent process (or at least approximately equivalent; there are a few caveats, but this is the gist of it).</p>
<p>In the figure below, we depict a toy example where we have a total of three genes in the genome, each with only one isoform: a Blue gene, a Green gene, and a Yellow gene. We then extract 13 total transcripts from the sample: 7 transcripts from the Blue gene, 4 transcripts from the Green gene, and 2 transcripts from the Yellow gene. In reality, a single cell contains <a href="https://www.qiagen.com/us/resources/faq?id=06a192c2-e72d-42e8-9b40-3171e1eb4cb8&lang=en">hundreds of thousands</a> of transcripts. We can then think of the reads that we generate from the RNA-seq experiment as random locations along these 10 transcripts. Here we depict 10 reads:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_abstracted.png" alt="drawing" width="800" /></center>
<p>Because we are sampling <em>locations</em> along all of the transcripts in the sample, we will tend to get more reads from longer genes and fewer reads from shorter genes. Thus, these counts will not alone be an accurate estimation of the fraction of transcripts from each gene.</p>
<p>Let’s say in this toy example the Blue gene is 4 bases long, the Green gene is 7 bases long, and the Yellow gene is 2 bases long. Then, if we sample many reads, the fraction of locations/reads sampled from each transcript will converge to the following:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_reads_vs_transcript_abundance.png" alt="drawing" width="700" /></center>
<p>Notice how these fractions differ from the fraction of transcripts that originate from each gene. Notably, the fraction of the reads from the Green gene is higher than the fraction of transcripts from the Blue gene. This is because the Green gene is longer and thus, when we sample locations along the transcript, we are more likely to select locations along a transcript from the Green gene. In the next section, we will discuss how to counteract this effect in order to recover the fraction of transcripts from each gene.</p>
<h2 id="estimating-the-fraction-of-transcripts-from-each-gene">Estimating the fraction of transcripts from each gene</h2>
<p>Before we get started, let’s define some mathematical notation:</p>
<ol>
<li>Let $G$ be the number of genes.</li>
<li>Let $N$ be the number of reads.</li>
<li>Let $c_i$ be the number of reads aligning to gene $i$.</li>
<li>Let $t_i$ be the number of transcripts from gene $i$ in the sample.</li>
<li>Let $l_i$ be the length of gene $i$.</li>
</ol>
<p>Now let’s look at the quantity that we are after: the fraction of transcripts from each gene, which we will denote as $\theta_i$.</p>
\[\theta_i := \frac{t_i}{\sum_{j=1}^G t_j}\]
<p>How do we estimate this from our read counts? First, we realize that the total number of nucleotides belonging to gene $i$ in the sample can be computed by multiplying the length of gene $i$ by the number of transcripts from gene $i$:</p>
\[n_i := l_it_i\]
<p>This is the total number of RNA bases within all of the RNA transcripts floating around in the sample that originated from gene $i$.</p>
<p>Furthermore, recall that each read can be thought of as a randomly sampled location from the set of all possible locations along the transcripts in the sample. In this light, $n_i$ represents the total number of possible start sites for a given read from gene $i$. Therefore, the fraction of reads we would expect to see from gene $i$ is</p>
\[p_i := \frac{l_it_i}{\sum_{j=1}^G l_jt_j} = \frac{n_i}{\sum_{j=1}^G n_j}\]
<p>Another way to look at this is as the probability that if we select a read, that read will have originated from gene $i$. This is simply the probability parameter for a <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli random variable</a>, and thus, its maximum likelihood estimate is simply:</p>
\[\hat{p}_i := \frac{c_i}{N}\]
<p>With our estimates, we can then estimate $\hat{\theta}_i$ as follows:</p>
\[\hat{\theta}_i := \frac{\hat{p}_i}{l_i} \left(\sum_{j=1}^G \frac{\hat{p}_j}{l_j} \right)^{-1}\]
<p>Let’s derive it:</p>
\[\begin{align*} \theta_i &= \frac{t_i}{\sum_{j=1}^G t_j} \\ &= \frac{ \frac{n_i}{l_i} }{ \sum_{j=1}^G \frac{n_j}{l_j}} && \text{because} \ n_i = l_it_i \implies t_i = \frac{n_i}{l_i} \\ &= \frac{ \frac{p_i \sum_{j=1}^G n_j}{l_i} }{\sum_{j=1}^G p_j \frac{\sum_{k=1}^G n_k}{l_j}} && \text{because} \ p_i = \frac{n_i}{\sum_{j=1}^G n_j} \implies n_i = p_i \sum_{j=1}^G n_j \\ &= \frac{ \frac{p_i}{l_i}} {\sum_{j=1}^G \frac{p_j}{l_j}} \end{align*}\]
<p>Then, to estimate $\theta_i$, we simply plug in our estimate $\hat{p}_i$ for each gene to arrive at our estimate $\hat{\theta}_i$.</p>
<p>Note that these $\theta_i$ value will be typically very small because there are so many genes. Therefore, it is common to multiply each $\theta_i$ by one million. The resulting values, called <strong>transcripts per million (TPM)</strong>, tell you the number of transcripts in the cell originating from each gene out of every million transcripts:</p>
\[\text{TPM}_i := 10^6 \times \frac{p_i}{l_i} \left(\sum_{j=1}^G \frac{p_j}{l_j} \right)^{-1}\]
<p>Thus, if we substitute $\hat{p}_i$ into the above equation, we have an <em>estimate</em> of the transcripts per million in the sample for gene $i$. We’ll use $\hat{\text{TPM}}$ to differentiate <em>estimated</em> TPMs from true TPMs. That is,</p>
\[\hat{\text{TPM}}_i := 10^6 \times \frac{\hat{p}_i}{l_i} \left(\sum_{j=1}^G \frac{\hat{p}_j}{l_j} \right)^{-1}\]
<h2 id="handling-genes-with-multiple-isoforms">Handling genes with multiple isoforms</h2>
<p>Most genes in the human genome are <a href="https://en.wikipedia.org/wiki/Alternative_splicing">alternatively spliced</a>, resulting in multiple isoforms of the gene. In the example above, we assumed that each gene had only one isoform. How do we handle the case in which a gene has multiple isoforms?</p>
<p>In fact, this is quite trivial. We simply compute the fraction of transcripts <em>of each isoform</em> as described above, and then simply sum the fractions of all isoforms for each gene to arrive at the fraction of transcripts originating from the gene. This is depicted in the figure below where we now assume that the Blue gene produces two isoforms:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/RNA_seq_isoform_abundances.png" alt="drawing" width="400" /></center>
<p>Thus, if we have isoform-level estimates of each gene’s TPM, then we simply sum these estimates across isoforms for each gene to arrive at an estimate of the TPM for the gene as a whole.</p>
<h2 id="handling-noise-and-multi-mapped-reads">Handling noise and multi-mapped reads</h2>
<p>So far, we have assumed an idealized scenario in which we know with certainty which gene “produced” each read. In reality, this is not the case. Sometimes, a read may align to multiple isoforms within a single gene (extremely common), or it might align to multiple genes (common enough to affect results), or it might align imperfectly to a gene and we might wonder whether the read really was produced by the gene in the first place. That is, was the mismatch in alignment due to a sequencing error, or was the read <em>not</em> produced by that gene at all (for example, the read may have been produced by a contaminant DNA fragment)?</p>
<p>Because in the real-world, we don’t know which gene produced each read, we have to infer it. State-of-the-art methods perform this inference under an assumed probabilistic generative model (<a href="https://doi.org/10.1093/bioinformatics/btp692">Li et al. 2011</a>) of the reads-generating process (to be discussed in a future post).</p>
<h2 id="rpkm-versus-tpm">RPKM versus TPM</h2>
<p>In the <a href="https://doi.org/10.1038/nmeth.1226">early days of RNA-seq</a>, read counts were summarized in units of <strong>reads per killobase per million mapped reads (RPKM)</strong>. As will be discussed in the next section, RPKMs are known to suffer from a fundamental issue.</p>
<p>Before digging into the problem with RPKM, let’s first define it. Recall, the issue with the raw read counts is that we will tend to sample more reads from longer isoforms/genes and thus, the raw counts will not reflect the relative abundance of each isoform or gene. To get around this, we might try the following normalization procedure: simply divide the fraction of reads from each gene/isoform by the length of each gene/isoform. That is,</p>
\[\frac{c_i}{N l_i} = \frac{\hat{p}_i}{l_i}\]
<p>Here we see that $\frac{c_i}{l_i}$ is the <em>number</em> of reads <em>per base</em> of the gene/isoform. That is, it is the average number of reads generated from each base along the gene/isoform. Then, if we divide this quantity by the total number of reads, $N$, we arrive at the number of reads per base of the gene/isoform <em>per read</em>.</p>
<p>This is a bit confusing. It almost seems circular that we’re computing the number of reads per base per read. If that’s confusing, here’s another way to think about it: $\frac{c_i}{N l_i}$ is the <em>fraction</em> of the reads that were generated, on average, by each base of gene $i$. This inherently normalizes for gene length because the units are in terms of a single base of the gene!</p>
<p>Because $N$ is very large (on the order of millions), and so too is $l_i$ (on the order of thousands), we multiply $\frac{c_i}{N l_i}$ by $10^9$. The resulting units are reads per killobase per million mapped reads of a given gene:</p>
\[\text{RPKM}_i := 10^9 \times \frac{c_i}{N l_i}\]
<p>Note that $10^9$ is the result of multiplying by one thousand bases and one million reads (hence, “<strong>killo</strong>bases per <strong>million</strong> mapped reads”).</p>
<p>With read counts normalized into units of RPKM, we can compare expression values between genes and we don’t have to worry about gene length being an issue. That is, if we have two genes, $i$ and $j$, and we find that $\text{RPKM}_i > \text{RPKM}_j$, we can acertain that gene $i$ may be more highly expressed than gene $j$.</p>
<p>Now, let’s compare RPKM to estimates of TPM. We see that RPKMs can be viewed as “unnormalized” estimates of TPMs:</p>
\[\begin{align*} \hat{\text{TPM}}_i &:= 10^6 \times \frac{\hat{p}_i}{l_i} \left(\sum_{j=1}^G \frac{\hat{p}_j}{l_j} \right)^{-1} \\ &= 10^{6} \times \frac{10^9 \hat{p}_i}{N l_i} \left(\sum_{j=1}^G \frac{10^9 \hat{p}_j}{N l_j}\right)^{-1} \\ &= 10^{6} \times \frac{ \text{RPKM}_i }{\sum_{j=1}^G \text{RPKM}_j} \end{align*}\]
<p>At a higher level, one can contrast RPKM from estimated TPM by viewing RPKM as a <strong>normalization of the read counts</strong>, whereas TPM is an estimate of a <strong>physical quantity</strong> (<a href="https://arxiv.org/abs/1104.3889">Pachter 2011</a>). That is, one can attempt to <em>estimate</em> TPMs from the read counts, or, one can normalize the read counts using RPKMs. In the next section we will discuss a fundamental problem with RPKMs and show that TPMs are generally preferred.</p>
<h2 id="problems-with-rpkm">Problems with RPKM</h2>
<p>The problem with RPKM values is that, although they do allow us to compare relative transcript abundances <em>between two genes within a single sample</em>, they do not allow us to compare relative transcript abundances of a <em>single gene betweeen two samples</em>.</p>
<p>Let’s illustrate this with an example. In the figure below, we depict two samples with the same three genes as used previously, each with only one isoform. Again, the Blue gene is of length 4, the Green gene is of length 7, and the Yellow gene is of length 2. The two samples have the same fraction of transcripts originating from the Yellow gene, but differ in the fraction of transcripts originating from the Blue and Green genes. If we generated many reads, assuming no noise, then the RPKMs would converge to the values depicted below the pie charts:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/problem_w_RPKM.png" alt="drawing" width="500" /></center>
<p>As you can see, the RPKM values differ for the Yellow gene between the two samples even though the fraction of transcripts from the Yellow gene is the same between the two samples! This is not desirable.</p>
<p>Why is this the case? Recall that RPKMs can be viewed as un-normalized TPM estimates. As shown by <a href="https://doi.org/10.1093/bioinformatics/btp692">Li and Dewey (2011)</a>, it turns out that the normalization factor includes the <em>mean length</em> of all of the transcripts in the sample (see the Appendix to this blog post for the full derivation):</p>
\[\hat{\text{TPM}}_i = \text{RPKM}_i \left[ N 10^{-3} \sum_{k=1}^G \hat{\theta}_k l_k \right]\]
<p>We see that the term \(\sum_{k=1}^G \hat{\theta}_k l_k\) is the mean length of all of the transcripts. Thus, the normalization constant required to transform each $\text{RPKM}_i$ value to an estimate of $\hat{TPM}_i$ is dependent on <em>other</em> transcript abundances in the sample, not just the abundances for a specific gene/isoform.</p>
<p>In our toy example above, the mean length of transcripts in Sample 2 is greater than the mean length of transcripts in Sample 1 because we have more transcripts of the Green gene than the Blue gene, which is a longer gene.</p>
<h2 id="problems-with-tpm">Problems with TPM</h2>
<p>In the previous section we showed that estimated TPMs are preferred to RPKMs because estimated TPMs allow one to compare <em>estimated relative transcript abundances</em> between two samples. This is a nice advantage over RPKMs; however, it’s important to keep in mind that because TPMs are simply scaled fractions, they do not enable us to compare absolute expression between two samples. They’re relative expression values.</p>
<p>For example, when comparing the estimated TPMs for some gene $i$ between two samples, which we’ll call Sample 1 and Sample 2, it may be that the TPM is larger in Sample 1, but is in fact more lowly expressed in terms of absolute expression. Here’s an example to illustrate:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/relative_vs_absolute_abundance.png" alt="drawing" width="500" /></center>
<p>As you can see, the absolute number of transcripts from the Blue gene is lower in Sample 1 than in Sample 2, but the <em>fraction</em> of transcripts (and thus, the TPM) of the Blue gene is higher in Sample 1 than in Sample 2.</p>
<p>If RNA-seq only enables us to compute the relative abundances of transcripts within a sample, how is one to compare expression between multiple samples? This is a challenging problem with a number of proposed solutions. One method involves injecting the sample with RNA for which we know it’s abudundance, called <a href="https://en.wikipedia.org/wiki/RNA_spike-in">spike-in RNA</a>, and use the spike-in RNA abundance as a baseline from which we can estimate absolute expression. Other solutions involve using <a href="https://en.wikipedia.org/wiki/Housekeeping_gene">house-keeping gene</a>, for which we assume expression is constant between samples and thus can be used as a baseline from which to estimate absolute abundances (in a similar vein to the spike-in method). Another method, called <a href="https://doi.org/10.1186/gb-2010-11-10-r106">median-ratio normalization</a>, makes the assumption that most genes are not differentially expressed between the samples we are interested in comparing and, using this assumption, proposes a procedure for normalizing counts between samples (to be discussed in a future post).</p>
<h2 id="further-reading">Further reading</h2>
<ul>
<li>A similar, yet more succinct and more advanced, blog post by Harold Pimentel discussing common units of expression from RNA-seq: <a href="https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/">https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/</a></li>
<li>A more rigorous summary of the statistical methods behind RNA-seq analysis by Lior Pachter: <a href="https://arxiv.org/abs/1104.3889">https://arxiv.org/abs/1104.3889</a></li>
<li>A nice tutorial on RNA-seq normalization: <a href="https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html">https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html</a></li>
</ul>
<h2 id="appendix">Appendix</h2>
<p>Deriving the normalizing constant between RPKM and TPM:</p>
\[\begin{align*}\hat{\text{TPM}}_i &= 10^{6} \frac{ \text{RPKM}_i }{\sum_{j=1}^G \text{RPKM}_j} \\ &= \text{RPKM}_i \left[\frac{10^6 }{10^9 \sum_{j=1}^G \frac{\hat{p}_j}{N l_j} } \right] \\ &= \text{RPKM}_i \left[ N 10^{-3} \left(\sum_{j=1}^G \frac{\hat{p}_j}{l_j} \right)^{-1} \right] \\ &= \text{RPKM}_i \left[ N 10^{-3} \left( \sum_{j=1}^G \frac{ \frac{\hat{\theta}_j l_j}{\sum_{k=1}^G \hat{\theta}_kl_k} } {l_j} \right)^{-1} \right] && \text{because} \ \hat{p}_j = \frac{\hat{\theta}_j l_j}{\sum_{k=1}^G \hat{\theta}_kl_k} \\ &= \text{RPKM}_i \left[ N 10^{-3} \left(\sum_{k=1}^G \hat{\theta}_k l_k \right) \left( \sum_{j=1}^G \hat{\theta}_j \right)^{-1} \right] \\ &= \text{RPKM}_i \left[ N 10^{-3} \sum_{k=1}^G \hat{\theta}_k l_k \right] && \text{because} \ \sum_j \hat{\theta}_j = 1 \end{align*}\]Matthew N. BernsteinRNA sequencing (RNA-seq) has become a ubiquitous tool in biomedical research for measuring gene expression in a population of cells, or a single cell, across the genome. Despite its ubiquity, RNA-seq is relatively complex and there exists a large research effort towards developing statistical and computational methods for analyzing the raw data that it produces. In this post, I will provide a high level overview of RNA-seq and describe how to interpret some of the common units in which gene expression is measured from an RNA-seq experiment.Intrinsic dimensionality2020-12-29T00:00:00-08:002020-12-29T00:00:00-08:00https://mbernste.github.io/posts/intrinsic_dimensionality<p><em>In my formal education, I found that the concept of “intrinsic dimensionality” was never explicitly taught; however, it undergirds so many concepts in linear algebra and the data sciences such as the rank of a matrix and feature selection. In this post I will discuss the difference between the extrinsic dimensionality of a space versus its intrinsic dimensionality.</em></p>
<h2 id="introduction">Introduction</h2>
<p>An important concept in linear algebra and the data sciences is the idea of <strong>intrinsic dimensionality</strong>. I found that in my formal education this concept was never explicitly taught; however, it undergirds so many concepts in linear algebra and data analysis. In this post I will discuss the difference between the <strong>extrinsic dimensionality</strong> of a space versus its <strong>intrinsic dimensionality</strong>. These general ideas provides a nice framework for understanding such diverse concepts as the <a href="https://en.wikipedia.org/wiki/Rank_(linear_algebra)">rank of a matrix</a> in linear algebra as well as <a href="https://en.wikipedia.org/wiki/Dimensionality_reduction">dimension reduction</a> and <a href="https://en.wikipedia.org/wiki/Feature_extraction">feature selection</a> in machine learning.</p>
<h2 id="what-exactly-is-a-space">What exactly is a “space”?</h2>
<p>Before jumpting into the dimensionality of spaces, let’s first address a very basic question: what is a space? According to <a href="https://en.wikipedia.org/wiki/Space_(mathematics)">Wikipedia</a>, a space is a <a href="https://en.wikipedia.org/wiki/Set_(mathematics)">set</a> of objects with some “structure.” This definition is extremely general, perhaps even trivial, but I think it deserves emphasizing that a space is, first and foremost, a <em>set</em>. When we refer to “three-dimensional space”, we are in essence describing the set of all <em>points</em> in three dimensions. As I’ve said, this may seem trivial at first, but I think it will be an important idea as we move on to describe the concept of “intrinsic dimensionality.”</p>
<h2 id="what-is-a-dimension">What is a dimension?</h2>
<p>Let’s move on to another basic question: what does it mean for a space to be three-dimensional versus two-dimensional? More generally, what does it mean for a space to be $D$-dimensional? A basic answer to this question is that a $D$-dimensional space is a space in which one uses $D$ pieces of information (i.e., characteristics), called <strong>dimensions</strong>, to describe each object in that space. For example, in three-dimensional Euclidean space (3D space), we need three pieces of information to describe each point: its value along the x-axis, its value along the y-axis, and its value along the z-axis:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/3D_space.png" alt="drawing" width="400" /></center>
<h2 id="intrinsic-vs-extrinsic-dimensionality">Intrinsic vs. extrinsic dimensionality</h2>
<p>Now that we have some basic ideas down – namely, “space” and “dimensions” – let’s move on to the core of this blog post: intrinsic dimensionality. Before we move on, let me spoil the ending: the <strong>intrinsic dimensionality</strong> of a space is the number of <em>required</em> pieces of information that we need to describe each object in the space, which may differ from the number of pieces of information that we <em>are</em> using, which we call the <strong>extrinsic dimensionality</strong> of the space.</p>
<p>Let’s make this concrete with an example. Let’s say we’re in some situtation we’re we are dealing with objects in some extrinsically $D$-dimensional space. Thus, we are dealing with $D$ pieces of information for describing objects in that space. One may ask a simple question: do we <em>really</em> need $D$ pieces of information to describe each object? Or can we get by with fewer?</p>
<p>Take the following example: we want to describe all points on a flat piece of paper in 3D space. That is, all of the points that we care about will lie <em>only</em> on the sheet of paper.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/paper_in_3D.png" alt="drawing" width="500" /></center>
<p>Of course, we can use each points x, y, and z coordinates, but do we really need three pieces of information to describe each point on the paper? The answer is no, we really only need two! Intuitively, we can specify each point on the paper using two coordinates: its distance from the left edge of the paper and its distance from the top edge of the paper:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/paper_in_3D_coordinates.png" alt="drawing" width="400" /></center>
<p>Of course, representing each point on the paper using these new coordinates requires keeping track of the position of the left edge and top edge in 3D space; however, once we have this information handy <em>every</em> point on the paper can be represented using only these two coordinates.</p>
<p>The <strong>intrinsic dimensionality</strong> of a space is the number of <em>required</em> pieces of information for representing each object. In the piece of paper example, only two coordinates are needed to describe each point on the paper, and thus, it can be said that the “space” of the paper (i.e., the set of all points that lie on the paper) is <em>intrinsically</em> only two-dimensional rather than three-dimensional. Notably, the intrinsic dimensionality of a space may be different than its explicit dimensionality. That is, even though we may be representing each point on the paper using their original three coordinates in 3D space, we could instead only use two.</p>
<p>When will the intrinsic dimensionality of a space be smaller than its extrinsic dimensionality? Intuitively, this will happen when the space that we care about can be formed by taking a <em>subset</em> of the full extrinsic space. In the piece of paper example of above, we only care about the <em>subset of points</em> in 3D space that lie on the piece of paper. By taking a subset, we are in essence coming up with a new, smaller space than the full 3D space. Such spaces are called <strong>subspaces</strong>.</p>
<h2 id="another-example">Another example</h2>
<p>Here’s another example of a piece of paper embedded in 3D space, but this time, the paper is rolled up in a “Swiss roll” shape:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/swiss_roll.png" alt="drawing" width="400" /></center>
<p>Again, one may say that the set of points that lie on the Swiss roll forms an intrinsically two dimensional space rather than a 3D space. The reason is that we can represent each point using two pieces of information: its distance along the width of the roll and its distance around the roll:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/swiss_roll_coordinates.png" alt="drawing" width="400" /></center>
<p>This is very high level and we haven’t described a rigorous system of coordinates for actually describing each point on the roll, but intuitively one can see that if one fixed the geometry of the roll, one could easily describe each point on the roll with only two dimensions.</p>Matthew N. BernsteinIn my formal education, I found that the concept of “intrinsic dimensionality” was never explicitly taught; however, it undergirds so many concepts in linear algebra and the data sciences such as the rank of a matrix and feature selection. In this post I will discuss the difference between the extrinsic dimensionality of a space versus its intrinsic dimensionality.Matrix multiplication2020-12-26T00:00:00-08:002020-12-26T00:00:00-08:00https://mbernste.github.io/posts/matrix_multiplication<p><em>At first glance, the definition for the product of two matrices can be unintuitive. In this post, we discuss three perspectives for viewing matrix multiplication. It is the third perspective that gives this “unintuitive” definition its power: that matrix multiplication represents the composition of linear transformations.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Matrix multiplication is an operation between two <a href="https://mbernste.github.io/posts/matrices/">matrices</a> that creates a new matrix. For students introduced to matrix multiplication, it can be puzzling to learn that matrix multiplication does <strong>not</strong> simply entail computing the product of the each pair of corresponding entries between the two matrices. That is,</p>
\[\begin{bmatrix} a_{1,1} & a_{1,2} \\ a_{2,1} & a_{2,2}\end{bmatrix}\begin{bmatrix}b_{1,1} & b_{1,2} \\ b_{2,1} & b_{2,2}\end{bmatrix} \boldsymbol{\neq} \begin{bmatrix}a_{1,1}b_{1,1} & a_{1,2}b_{1,2} \\ a_{2,1}b_{2,1} & a_{2,2}b_{2,2}\end{bmatrix}\]
<p>Rather, matrix multiplication is defined in what can look to be a much more complicated way:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (Matrix multiplication):</strong> The product of an $m \times n$ matrix $\boldsymbol{A}$ <strong>matrix multiplied</strong> by a $n \times p$ matrix $\boldsymbol{B}$ is given by</span></p>
<center><span style="color:#0060C6">$$\boldsymbol{AB} := \begin{bmatrix} \boldsymbol{A}\boldsymbol{b}_{*,1} & \boldsymbol{A}\boldsymbol{b}_{*,2} & \dots & \boldsymbol{A}\boldsymbol{b}_{*,n} \end{bmatrix}$$</span></center>
<p><span style="color:#0060C6">where $\boldsymbol{b}_{*,i}$ is the $i$th column of $\boldsymbol{B}$.</span></p>
<p>That is, given two matrices $\boldsymbol{A}$ and $\boldsymbol{B}$, each column of the product matrix $\boldsymbol{AB}$ is formed by performing <a href="https://mbernste.github.io/posts/matrix_vector_mult/">matrix-vector multiplication</a> between $\boldsymbol{A}$ and each column of $\boldsymbol{B}$. Note that this definition requires that the number of columns of the first matrix be equal to the number of rows of the second matrix. If this doesn’t hold, then matrix multiplication is not defined.</p>
<p>This complicated definition begs the question, what is the intuition behind matrix multiplication? And, why isn’t it defined as simply the pairwise products between corresponding elements of two matrices as described above? In this post, we’ll look at three ways of viewing matrix multiplication and hopefully it will become evident that this more complicated definition is much more powerful than a naive definition that computes pairwise products. It’s power will come from the third perspective that we will discuss: that matrix multiplication represents the <a href="https://en.wikipedia.org/wiki/Function_composition">composition</a> of <a href="https://mbernste.github.io/posts/matrices_linear_transformations/">linear transformations</a>.</p>
<h2 id="three-perspectives-for-understanding-matrix-multiplication">Three perspectives for understanding matrix multiplication</h2>
<p>There are at least three perspectives for which one can view matrix multiplication each depending on the perspective taken on each of the matrix factors. Recall we can view a matrix <a href="https://mbernste.github.io/posts/matrices/">via a number of perspectives</a>:</p>
<ol>
<li>As a list of column vectors</li>
<li>As a list of row vectors</li>
<li>As a <a href="https://mbernste.github.io/posts/matrices_linear_transformations/">linear transformation</a></li>
</ol>
<p>Given these various ways of possibly viewing each matrix factor, $\boldsymbol{A}$ and $\boldsymbol{B}$, we can view their product, $\boldsymbol{AB}$, as follows:</p>
<ol>
<li><strong>Matrix multiplication computes a linear transformation on a set of vectors:</strong> If we view $\boldsymbol{A}$ as a linear transformation and $\boldsymbol{B}$ as a list of column vectors, the columns of the product matrix $\boldsymbol{AB}$ are the results of transforming each column of $\boldsymbol{B}$ under $\boldsymbol{A}$.</li>
<li><strong>Matrix multiplication computes the dot products for pairs of vectors:</strong> This perspective follows from viewing $\boldsymbol{A}$ as an ordered list of row-vectors and viewing $\boldsymbol{B}$ as an ordered list of column-vectors. The product matrix $\boldsymbol{AB}$ then stores all of the pair-wise dot products between the rows of $\boldsymbol{A}$ and columns of $\boldsymbol{B}$.</li>
<li><strong>Matrix multiplication computes the composition of two linear transformations:</strong> If we view both $\boldsymbol{A}$ <em>and</em> $\boldsymbol{B}$ as linear transformations, then the product matrix is a linear transformation formed by taking the <a href="https://en.wikipedia.org/wiki/Function_composition">composition</a> of linear transformations defined by $\boldsymbol{A}$ and $\boldsymbol{B}$.</li>
</ol>
<p>It is this third perspective that is the most abstract and, arguably, the most powerful. Let’s dig into each of these perspectives.</p>
<p><strong>Matrix multiplication computes a linear transformation on a set of vectors</strong></p>
<p>This perspective follows most directly from our definition of matrix multuiplication: If we view $\boldsymbol{A}$ as a linear transformation and we view the matrix $\boldsymbol{B}$ as an ordered list of column vectors</p>
\[\boldsymbol{B} := \begin{bmatrix} \boldsymbol{b}_{*,1} & \boldsymbol{b}_{*,2} & \dots & \boldsymbol{b}_{*,n} \end{bmatrix}\]
<p>then each column of $\boldsymbol{AB}$ is computed by taking the linear transformation characterized by $\boldsymbol{A}$ of each $\boldsymbol{b}_{*,i}$:</p>
\[\boldsymbol{AB} := \begin{bmatrix} \boldsymbol{A}\boldsymbol{b}_{*,1} & \boldsymbol{A}\boldsymbol{b}_{*,2} & \dots & \boldsymbol{A}\boldsymbol{b}_{*,n} \end{bmatrix}\]
<p><strong>Matrix multiplication computes dot products for pairs of vectors</strong></p>
<p>If we view the matrix $\boldsymbol{A}$ as a list of row-vectors and the matrix $\boldsymbol{B}$ as a list of column vectors, then the product $\boldsymbol{AB}$ is the matrix that stores all of the pair-wise dot products of the vectors in $\boldsymbol{A}$ and $\boldsymbol{B}$. More specifically, the $i,j$th element of $\boldsymbol{AB}$ is the the dot product between the $i$th row of $\boldsymbol{A}$ and the $j$th column of $\boldsymbol{B}$ (See Theorem 1 in the Appendix to this post). This fact, often called the <strong>row-column rule</strong>, can be used for computing each element of $\boldsymbol{AB}$. This is illustrated below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/row_column_rule.png" alt="drawing" width="350" /></center>
<p><strong>Matrix multiplication computes the composition of two linear transformations:</strong></p>
<p>The third and final perspective for viewing matrix multiplication requires that we view <em>both</em> $\boldsymbol{A}$ <em>and</em> $\boldsymbol{B}$ as linear transformations. Then, it turns out that the matrix $\boldsymbol{AB}$ is the matrix that characterizes the linear transformation formed by the composition of the linear transformations characterized by $\boldsymbol{A}$ and $\boldsymbol{B}$ (See Theorem 2 in the Appendix to this post). That is, given two linear transformations</p>
\[\begin{align*}f(\boldsymbol{x}) &:= \boldsymbol{Ax} \\ g(\boldsymbol{x}) &:= \boldsymbol{Bx}\end{align*}\]
<p>the matrix $\boldsymbol{AB}$ is the matrix that characterizes the composition $f \circ g$:</p>
\[f \circ g(\boldsymbol{x}) := f(g(\boldsymbol{x})) = \boldsymbol{A}(\boldsymbol{Bx})\]
<p>This is proven in Theorem 2 in the Appendix to this poast and is illustrated below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_composition.png" alt="drawing" width="500" /></center>
<p>Recall that a matrix’s number of rows determines the dimensions of the vectors in its range and the number of columns correspond to the number of dimensions the domain. Since $\boldsymbol{AB}$ characterizes the composition $f \circ g$, it follows that the matrix $\boldsymbol{AB}$ will map from the domain of $\boldsymbol{B}$ to the range of $\boldsymbol{A}$.</p>
<h2 id="properties-of-matrix-multiplication">Properties of matrix multiplication</h2>
<p>Here are several properties of matrix multiplication that can be used in calculations and derivations involving matrices. For the sake of succinctness, we push all of the proofs to the Appendix of the blog post:</p>
<ol>
<li><strong>Associative property for matrices</strong> (Theorem 3): $\boldsymbol{A}(\boldsymbol{BC}) = (\boldsymbol{AB})\boldsymbol{C}$</li>
<li><strong>Commutative property of scalars</strong> (Theorem 4): $r(\boldsymbol{AB}) = (r\boldsymbol{A})\boldsymbol{B} = \boldsymbol{A}(r\boldsymbol{B})$ where $r$ is a scalar.</li>
<li><strong>Left distributive property</strong> (Theorem 5): $\boldsymbol{A}(\boldsymbol{B} + \boldsymbol{C}) = \boldsymbol{AB} + \boldsymbol{AC}$</li>
<li><strong>Right distributive property</strong> (Theorem 6): $(\boldsymbol{B} + \boldsymbol{C})\boldsymbol{A} = \boldsymbol{BA} + \boldsymbol{CA}$</li>
<li><strong>Identity</strong> (Theorem 7): $\boldsymbol{I}_m\boldsymbol{A} = \boldsymbol{A} = \boldsymbol{AI}_n$</li>
</ol>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1 (Row-column rule):</strong> Given an $m \times n$ matrix $\boldsymbol{A}$ and a $n \times p$ matrix $\boldsymbol{B}$, the $i,j$th element of $\boldsymbol{AB}$ is computed by \(\boldsymbol{a}_{i,*} \cdot \boldsymbol{b}_{*,j}\), where \(\boldsymbol{a}_{i,*}\) is the $i$th column of $\boldsymbol{A}$ and \(\boldsymbol{b}_{*,j}\) is the $j$th column of $\boldsymbol{B}$.</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*}
\boldsymbol{AB} &= \begin{bmatrix} \boldsymbol{A}\boldsymbol{b}_{*,1} & \boldsymbol{A}\boldsymbol{b}_{*,2} & \dots & \boldsymbol{A}\boldsymbol{b}_{*,n} \end{bmatrix} \\ &= \begin{bmatrix} \boldsymbol{a}_{*,1}b_{1,1} + \dots + \boldsymbol{a}_{*,n}b_{n,1} & \dots & \boldsymbol{a}_{*,1}b_{1,n} + \dots + \boldsymbol{a}_{*,n}b_{n,n} \end{bmatrix} \\ &= \begin{bmatrix} a_{1,1}b_{1,1} + \dots + a_{1,n} b_{n,1} & \dots & a_{1,1}b_{1,n} + \dots + a_{1,n}b_{n,n} \\
a_{2,1}b_{1,1} + \dots + a_{2,n} b_{n,1} & \dots & a_{2,1}b_{1,n} + \dots + a_{2,n}b_{n,n} \\
\vdots & \ddots & \vdots \\
a_{n,1}b_{1,1} + \dots + a_{n,n} b_{n,1} & \dots & a_{n,1}b_{1,n} + \dots + a_{n,n}b_{n,n}
\end{bmatrix} \\ &= \begin{bmatrix} \boldsymbol{a}_{1,*} \cdot \boldsymbol{b}_{*,1} & \dots & \boldsymbol{a}_{1,*} \cdot \boldsymbol{b}_{*,n} \\ \boldsymbol{a}_{2,*} \cdot \boldsymbol{b}_{*,1} & \dots & \boldsymbol{a}_{2,*} \cdot \boldsymbol{b}_{*,n} \\ \vdots & \ddots & \vdots \\ \boldsymbol{a}_{m,*} \cdot \boldsymbol{b}_{*,1} & \dots & \boldsymbol{a}_{m,*} \cdot \boldsymbol{b}_{*,n} \end{bmatrix} \end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2:</strong> Matrix multiplication between an $m \times n$ matrix $\boldsymbol{A}$ and a $n \times p$ matrix $\boldsymbol{B}$ results in a matrix $\boldsymbol{AB}$ such that given a vector $\boldsymbol{x} \in \mathbb{R}^m$, the following holds:</span></p>
<center><span style="color:#0060C6">$$(\boldsymbol{AB})\boldsymbol{x} = \boldsymbol{A}(\boldsymbol{Bx})$$</span></center>
<p><strong>Proof:</strong> First, let’s expand $\boldsymbol{Bx}$:</p>
\[\begin{align*} \boldsymbol{Bx} &= \boldsymbol{b}_{*,1}x_1 + \boldsymbol{b}_{*,2}x_2 + \dots + \boldsymbol{b}_{*,n}x_n \\ &= \begin{bmatrix} b_{1,1}x_1 + b_{1,2}x_2 + \dots + b_{1,n}x_n \\ b_{2,1}x_1 + b_{2,2}x_2 + \dots + b_{2,n}x_n \\ \vdots \\ b_{n,1}x_1 + b_{n,2}x_2 + \dots + b_{n,n}x_n \end{bmatrix} \end{align*}\]
<p>Now,</p>
\[\begin{align*}(\boldsymbol{AB})\boldsymbol{x} &= \begin{bmatrix} \boldsymbol{A}\boldsymbol{b}_{*,1} & \boldsymbol{A}\boldsymbol{b}_{*,2} & \dots & \boldsymbol{A}\boldsymbol{b}_{*,n} \end{bmatrix} \boldsymbol{x} \\ &= \boldsymbol{A}\boldsymbol{b}_{*,1}x_1 + \boldsymbol{A}\boldsymbol{b}_{*,2}x_2 + \dots + \boldsymbol{A}\boldsymbol{b}_{*,n}x_n \\ &= \left(\boldsymbol{a}_{*,1}b_{1,1} + \dots + \boldsymbol{a}_{*,n} b_{n,1}\right)x_1 + \left(\boldsymbol{a}_{*,1}b_{1,2} + \dots + \boldsymbol{a}_{*,n} b_{n,2}\right)x_2 + \dots + \left(\boldsymbol{a}_{*,1}b_{1,n} + \dots + \boldsymbol{a}_{*,n} b_{n,n}\right)x_n \\ &= \left(\boldsymbol{a}_{*,1}b_{1,1}x_1 + \dots + \boldsymbol{a}_{*,n}b_{n,1}x_1\right) + \left(\boldsymbol{a}_{*,1}b_{1,2}x_2 + \dots + \boldsymbol{a}_{*,n} b_{n,2}x_2\right) + \dots + \left(\boldsymbol{a}_{*,1}b_{1,n}x_n + \dots \boldsymbol{a}_{*,n} b_{n,n}x_n\right) \\ &= \boldsymbol{a}_{*,1}(b_{1,1}x_1 + \dots + b_{1,n}x_n) + \boldsymbol{a}_{*,2}(b_{2,1}x_1 + \dots + b_{2,n}x_n) + \dots + \boldsymbol{a}_{*,n}(b_{n,1}x_1 + \dots + b_{n,n}x_n) \\ &= \boldsymbol{a}_{*,1}(\boldsymbol{Bx})_1 + \boldsymbol{a}_{*,2}(\boldsymbol{Bx})_2+ \dots \boldsymbol{a}_{*,n}(\boldsymbol{Bx})_n \\ &= \boldsymbol{A}(\boldsymbol{Bx}) \end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 3 (Associative property for matrices):</strong> Given matrices $\boldsymbol{A} \in \mathbb{R}^{m \times l}$, $\boldsymbol{B} \in \mathbb{R}^{l * p}$, and $\boldsymbol{C} \in \mathbb{R}^{p \times n}$ it holds that $\boldsymbol{A}(\boldsymbol{BC}) = (\boldsymbol{AB})\boldsymbol{C}$.</span></p>
<p><strong>Proof:</strong> Since matrix-multiplication can be understood as a composition of functions, and since compositions of functions are associative, it follows that matrix-multiplication is associative.</p>
<p><span style="color:#0060C6"><strong>Theorem 4 (Commutative property of scalars):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, a matrix $\boldsymbol{B} \in \mathbb{R}^{n \times p}$, and scalar $r$, it holds that $r(\boldsymbol{AB}) = (r\boldsymbol{A})\boldsymbol{B} = \boldsymbol{A}(r\boldsymbol{B})$</span></p>
<p><strong>Proof:</strong> First we prove $r(\boldsymbol{AB}) = (r\boldsymbol{A})\boldsymbol{B} = \boldsymbol{A}(r\boldsymbol{B})$:</p>
\[\begin{align*}r(\boldsymbol{AB}) &= r\begin{bmatrix} \boldsymbol{A}\boldsymbol{b}_{*,1} & \dots & \boldsymbol{A}\boldsymbol{b}_{*,p} \end{bmatrix} \\ &= \begin{bmatrix} r\boldsymbol{A}\boldsymbol{b}_{*,1} & \dots & r\boldsymbol{A}\boldsymbol{b}_{*,p} \end{bmatrix} \\ &= (r\boldsymbol{A})\boldsymbol{B} \end{align*}\]
<p>Next, we prove $r(\boldsymbol{AB}) = \boldsymbol{A}(r\boldsymbol{B})$:</p>
\[\begin{align*}r(\boldsymbol{AB}) &= r\begin{bmatrix} \boldsymbol{A}\boldsymbol{b}_{*,1} & \dots & \boldsymbol{A}\boldsymbol{b}_{*,p} \end{bmatrix} \\ &= \begin{bmatrix} r\boldsymbol{A}\boldsymbol{b}_{*,1} & \dots & r\boldsymbol{A}\boldsymbol{b}_{*,p} \end{bmatrix} \\ &= \begin{bmatrix} \boldsymbol{A}(r\boldsymbol{b}_{*,1}) & \dots & \boldsymbol{A}(r\boldsymbol{b}_{*,p}) \end{bmatrix} && \text{linearity of matrix-vector multiplication} \\ &= \boldsymbol{A}(r\boldsymbol{B}) \end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 5 (Left distributive property of matrix multiplication):</strong> Given matrices $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, $\boldsymbol{B} \in \mathbb{R}^{n * p}$, and $\boldsymbol{C} \in \mathbb{R}^{n \times p}$, the following holds: $\boldsymbol{A}(\boldsymbol{B} + \boldsymbol{C}) = \boldsymbol{AB} + \boldsymbol{AC}$</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*} \boldsymbol{A}(\boldsymbol{B} + \boldsymbol{C}) &= \begin{bmatrix}\boldsymbol{A}(\boldsymbol{b}_{*,1} + \boldsymbol{c}_{*,1}) & \dots & \boldsymbol{A}(\boldsymbol{b}_{*,p} + \boldsymbol{c}_{*,p}) \end{bmatrix} && \text{definition of matrix multiplication} \\ &= \begin{bmatrix}(\boldsymbol{A}\boldsymbol{b}_{*,1} + \boldsymbol{A}\boldsymbol{c}_{*,1}) & \dots & (\boldsymbol{A}\boldsymbol{b}_{*,p} + \boldsymbol{A}\boldsymbol{c}_{*,p}) \end{bmatrix} && \text{linearity of matrix-vector multiplication} \\ &= \begin{bmatrix}\boldsymbol{A}\boldsymbol{b}_{*,1} & \dots & \boldsymbol{A}\boldsymbol{b}_{*,p} \end{bmatrix} + \begin{bmatrix} \boldsymbol{A}\boldsymbol{c}_{*,1} & \dots & \boldsymbol{A}\boldsymbol{c}_{*,p} \end{bmatrix} && \text{definition of matrix-addition} \\ &= \boldsymbol{AB} + \boldsymbol{AC} && \text{definition of matrix multiplication}\end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 6 (Right distributive property of matrix multiplication):</strong> Given matrices $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, $\boldsymbol{B} \in \mathbb{R}^{n * p}$, and $\boldsymbol{C} \in \mathbb{R}^{n \times p}$, the following holds: $(\boldsymbol{B} + \boldsymbol{C})\boldsymbol{A} = \boldsymbol{BA} + \boldsymbol{CA}$</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*}(\boldsymbol{B} + \boldsymbol{C})\boldsymbol{A} &= \begin{bmatrix}(\boldsymbol{B} + \boldsymbol{C})\boldsymbol{a}_{*,1} & \dots & (\boldsymbol{B} + \boldsymbol{C})\boldsymbol{a}_{*,p} \end{bmatrix} && \text{definition of matrix-matrix multiplication} \\ &= \begin{bmatrix}(\boldsymbol{B}\boldsymbol{a}_{*,1} + \boldsymbol{C}\boldsymbol{a}_{*,1}) & \dots & (\boldsymbol{B}\boldsymbol{a}_{*,p} + \boldsymbol{C}\boldsymbol{a}_{*,p}) \end{bmatrix} && \text{definition of matrix addition} \\ &= \begin{bmatrix}\boldsymbol{B}\boldsymbol{a}_{*,1} & \dots & \boldsymbol{B}\boldsymbol{a}_{*,p} \end{bmatrix} + \begin{bmatrix} \boldsymbol{C}\boldsymbol{a}_{*,1} & \dots & \boldsymbol{C}\boldsymbol{a}_{*,p} \end{bmatrix} \\ &= \boldsymbol{BA} + \boldsymbol{CA}\end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 7 (Identity):</strong> Given an $m \times n$ matrix $\boldsymbol{A}$, the following holds: $\boldsymbol{I}_m\boldsymbol{A} = \boldsymbol{A} = \boldsymbol{AI}_n$</span></p>
<p><strong>Proof:</strong> By the fact that an identity function simply maps each element in its domain back to itself, it follows that the composition of a function $f$ and the identity function is simply the function $f$. The <a href="https://mbernste.github.io/posts/matrices_as_functions/">identity matrix defines the identity function on vectors</a>. Furthermore, matrix multiplication represents a composition of linear transforamtions. Thus, it follows that any matrix multiplied on the left or right by the identity matrix returns the original matrix (i.e., the function itself). Therefore, $\boldsymbol{I}_m\boldsymbol{A} = \boldsymbol{A} = \boldsymbol{AI}_n$. $\square$</p>Matthew N. BernsteinAt first glance, the definition for the product of two matrices can be unintuitive. In this post, we discuss three perspectives for viewing matrix multiplication. It is the third perspective that gives this “unintuitive” definition its power: that matrix multiplication represents the composition of linear transformations.Matrices characterize linear transformations2020-12-21T00:00:00-08:002020-12-21T00:00:00-08:00https://mbernste.github.io/posts/matrices_linear_transformations<p><em>Linear transformations are functions mapping vectors between two vector spaces that preserve vector addition and scalar multiplication. In this post, we show that there exists a one-to-one corresondence between linear transformations between coordinate vector spaces and matrices. Thus, we can view a matrix as representing a unique linear transformation between coordinate vector spaces.</em></p>
<h2 id="introduction">Introduction</h2>
<p>As previously discussed, <a href="https://mbernste.github.io/posts/matrix_vector_mult/">matrix-vector multiplication</a> enables us to view <a href="https://mbernste.github.io/posts/matrices_as_functions/">matrices as functions</a> between vector spaces. It turns out that matrices define a very specific type of function: <strong>linear transformations</strong>. In fact, <em>any</em> linear transformation between coordinate vector spaces can be characterized by a <em>single, unique matrix</em>. That is, there is a one-to-one mapping between linear transformations and matrices. Thus, in some sense, we can say that a matrix <em>is</em> a linear transformation.</p>
<h2 id="linear-transformations">Linear transformations</h2>
<p>A linear transformation is a function that maps vectors from one vector space to vectors in another vector space such that it preserves scaler multiplication and vector addition. Specifically, a linear transformation is defined as follows:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (Linear transformation):</strong> Given vector spaces $(\mathcal{V}, \mathcal{F})$ and $(\mathcal{U}, \mathcal{F})$, a function $T : \mathcal{V} \rightarrow \mathcal{U}$ is a <strong>linear transformation</strong>, or is simply called <strong>linear</strong>, if for all $\boldsymbol{u}, \boldsymbol{v} \in \mathcal{V}$ and all scalars $c \in \mathcal{F}$,</span></p>
<center><span style="color:#0060C6">$$T(\boldsymbol{u} + \boldsymbol{v}) = T(\boldsymbol{u}) + T(\boldsymbol{v})$$</span></center>
<p><span style="color:#0060C6">and</span></p>
<center><span style="color:#0060C6">$$T(c\boldsymbol{u}) = cT(\boldsymbol{u})$$</span></center>
<p>The figure below illustrates a linear transformation $T$ applied to three vectors (red, blue, and purple). The vectors on the left are in $T$’s domain and the vectors on the right are in $T$’s range. The dotted lines connect each vector in the domain to its vector in the range as mapped by $T$. Notice that the $T(\text{red}) + T(\text{blue}) = T(\text{red} + \text{blue})$.</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/linear_transform.png" alt="drawing" width="500" /></center>
<h2 id="matrices-perform-linear-transformations">Matrices perform linear transformations</h2>
<p>Matrix-vector multiplication has the following two properties:</p>
<ol>
<li>$\boldsymbol{A}(\boldsymbol{v} + \boldsymbol{u}) = \boldsymbol{Av} + \boldsymbol{Au}$</li>
<li>$\boldsymbol{A}(c\boldsymbol{v}) = c\boldsymbol{Av}$</li>
</ol>
<p>Thus, if we hold some matrix $\boldsymbol{A}$ as fixed and define a function $T(\boldsymbol{x}) := \boldsymbol{Ax}$, then it follows that</p>
<ol>
<li>$T(\boldsymbol{v} + \boldsymbol{u}) = T(\boldsymbol{v}) + T(\boldsymbol{u})$</li>
<li>$T(c\boldsymbol{v}) = cT(\boldsymbol{v})$</li>
</ol>
<p>That is, $T$ is a linear transformation. These facts are proven in Theorem 1 in the Appendix to this post.</p>
<h2 id="every-linear-transformation-is-characterized-by-a-matrix">Every linear transformation is characterized by a matrix</h2>
<p>Perhaps more interestingly, it turns out that <em>every</em> linear transformation between finite-dimensional vector spaces is defined by a unique matrix that performs the transformation. That is, if I have two vector spaces in, say $\mathbb{R}^m$ and $\mathbb{R}^n$, and I have a linear transformation $T$ mapping vectors between them, then there exists a single unique matrix $\boldsymbol{A}_T$ that performs $T$’s mapping. That is, where $T(\boldsymbol{x}) = \boldsymbol{A}_T\boldsymbol{x}$. This matrix is called the <strong>standard matrix</strong> of $T$.</p>
<p>Because every matrix performs a linear transformation <em>and</em> every linear transformation is characterized by a matrix, it follows that there is a one-to-one mapping between linear transformations and matrices. Thus, in some sense, we can say that a matrix <em>is</em> is a linear transformation.</p>
<h2 id="computing-the-standard-matrix-of-a-linear-transformation">Computing the standard matrix of a linear transformation</h2>
<p>A natural question is: given a linear transformation $T$, how do we compute its standard matrix? In fact, it’s quite simple. $A_T$ is computed simply as:</p>
\[A_T = \begin{bmatrix}T(\boldsymbol{e}_1) & T(\boldsymbol{e}_2) & \dots & T(\boldsymbol{e}_m) \end{bmatrix}\]
<p>where \(\boldsymbol{e}_i\) is the $i$th basis vector of $\mathbb{R}^m$ (i.e., the vector with every element equal to zero except for the $i$th element, which is equal to one).</p>
<p>What this says is that in order to form the standard matrix for a linear transformation $T$, you simply compute the vectors that result from transforming the basis vectors under $T$. The resultant vectors form the columns of $T$’s standard matrix! This is depicted in the figure below where we visualize the basis vectors for $\mathbb{R}^2$ and the two column vectors of the standard matrix for some linear transformation $T: \mathbb{R}^2 \rightarrow \mathbb{R}^2$:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/columns_standard_matrix.png" alt="drawing" width="400" /></center>
<p>This is proven in Theorem 2 in the Appendix to this post.</p>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1 (Matrices define linear transformations):</strong> The function $T(\boldsymbol{x}) := \boldsymbol{Ax}$ is a linear transformation.</span></p>
<p><strong>Proof:</strong> We show that for all $\boldsymbol{u}, \boldsymbol{v}$ in the domain of $T$ and for all scalars $c$, the following conditions hold:</p>
<ol>
<li>$\boldsymbol{A}(\boldsymbol{u} + \boldsymbol{v}) = \boldsymbol{A}\boldsymbol{u} + \boldsymbol{A}\boldsymbol{v}$</li>
<li>$\boldsymbol{A}(c\boldsymbol{u}) = c(\boldsymbol{A}\boldsymbol{u})$</li>
</ol>
<p>Proof of 1:</p>
\[\begin{align*}\boldsymbol{A}(\boldsymbol{u} + \boldsymbol{v}) &= \boldsymbol{a}_{*,1}(u_1 + v_1) + \boldsymbol{a}_{*,2}(u_2 + v_2) + \dots + \boldsymbol{a}_{*,n}(u_n + v_n) \\ &= \boldsymbol{a}_{*,1}u_1 + \boldsymbol{a}_{*,1}v_1 + \boldsymbol{a}_{*,2}u_2 + \boldsymbol{a}_{*,2}v_2 + \dots + \boldsymbol{a}_{*,n}u_n + \boldsymbol{a}_{*,n}v_n \\ &= (\boldsymbol{a}_{*,1}u_1 + \boldsymbol{a}_{*,2}u_2 + \dots + \boldsymbol{a}_{*,n}u_n) + (\boldsymbol{a}_{*,1}v_1 + \boldsymbol{a}_{*,2}v_2 + \dots + \boldsymbol{a}_{*,n}v_n) \\ &= \boldsymbol{Au} + \boldsymbol{Av}\end{align*}\]
<p>Proof of 2:</p>
\[\begin{align*}\boldsymbol{A}(c\boldsymbol{u}) &= \boldsymbol{a}_{*,1}(cu_1) + \boldsymbol{a}_{*,2}(cu_2) + \dots + \boldsymbol{a}_{*,n}(cu_n) \\ &= c\boldsymbol{a}_{*,1}(u_1) + c\boldsymbol{a}_{*,2}(u_2) + \dots + c\boldsymbol{a}_{*,n}(u_n) \\ &= c(\boldsymbol{a}_{*,1}u_1 + \boldsymbol{a}_{*,2}u_2 + \dots + \boldsymbol{a}_{*,n}u_n) \\ &= c(\boldsymbol{Au})\end{align*}\]
<p>Now, we see that</p>
\[\begin{align*}T(\boldsymbol{u} + \boldsymbol{v}) &= \boldsymbol{A}(\boldsymbol{u} + \boldsymbol{v}) \\ &= \boldsymbol{Au} + \boldsymbol{Av} \\ &= T(\boldsymbol{u}) + T(\boldsymbol{v})\end{align*}\]
<p>and</p>
\[\begin{align*}T(c\boldsymbol{u}) &= \boldsymbol{A}(c\boldsymbol{u}) \\ &= c\boldsymbol{Au} \\ &= cT(\boldsymbol{u})\end{align*}\]
<p>$\square$</p>
<p><span style="color:#0060C6"><strong>Theorem 2 (Standard matrix of a linear transformations):</strong> Given a linear transformation $T: \mathbb{R}^m \rightarrow \mathbb{R}^n$, it holds that </span></p>
<center><span style="color:#0060C6">$$T(\boldsymbol{x}) = \boldsymbol{A}_T\boldsymbol{x}$$</span></center>
<p><span style="color:#0060C6">where $\boldsymbol{A}_T$ is defined as</span></p>
<center><span style="color:#0060C6">$$A_T := \begin{bmatrix}T(\boldsymbol{e}_1) & T(\boldsymbol{e}_2) & \dots & T(\boldsymbol{e}_m) \end{bmatrix}$$</span></center>
<p><span style="color:#0060C6">where $\boldsymbol{e}_i$ is the $i$th basis vector of $\mathbb{R}^m$.</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*} \boldsymbol{x} &= \boldsymbol{Ix} \\ &= \boldsymbol{e}_1x_1, + \boldsymbol{e}_2x_2 + \dots + \boldsymbol{e}_mx_m \\ \implies T(\boldsymbol{x}) &= T(\boldsymbol{e}_1x_1 + \boldsymbol{e}_2x_2 + \dots + \boldsymbol{e}_mx_m) \\ &= T(\boldsymbol{e}_1x_1) + T( \boldsymbol{e}_2x_2) + \dots + T(\boldsymbol{e}_mx_m) && \text{$T$ is linear} \\ &= T(\boldsymbol{e}_1)x_1 + T( \boldsymbol{e}_2)x_2 + \dots + T(\boldsymbol{e}_m)x_m && \text{$T$ is linear} \\ &= \begin{bmatrix} T(\boldsymbol{e}_1) & T(\boldsymbol{e}_2) & \dots & T(\boldsymbol{e}_m) \end{bmatrix} \boldsymbol{x} \\ &= \boldsymbol{A}_T\boldsymbol{x} \end{align*}\]
<p>$\square$</p>Matthew N. BernsteinLinear transformations are functions mapping vectors between two vector spaces that preserve vector addition and scalar multiplication. In this post, we show that there exists a one-to-one corresondence between linear transformations between coordinate vector spaces and matrices. Thus, we can view a matrix as representing a unique linear transformation between coordinate vector spaces.Matrices as functions2020-12-20T00:00:00-08:002020-12-20T00:00:00-08:00https://mbernste.github.io/posts/matrices_as_functions<p><em>At the core of linear algebra is the idea that matrices represent functions. In this post, we’ll look at a few common, elementary functions and discuss their corresponding matrices.</em></p>
<h2 id="introduction">Introduction</h2>
<p>Recall that the definition of <a href="https://mbernste.github.io/posts/matrix_vector_mult/">matrix-vector multiplication</a> enables us treat matrices as functions in the following sense: if we hold a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ as fixed, this matrix maps vectors in $\mathbb{R}^n$ to vectors in $\mathbb{R}^m$. That is, we can define a function $T : \mathbb{R}^n \rightarrow \mathbb{R}^m$ as:</p>
\[T(\boldsymbol{x}) := \boldsymbol{A}\boldsymbol{x}\]
<p>In this post, we’ll look at a few common, elementary functions and discuss their corresponding matrices.</p>
<h2 id="the-identity-matrix-defines-the-identity-function">The identity matrix defines the identity function</h2>
<p>Recall an identity function $f$ for a set $S$ is the function $f(x) := x$ for all $x \in S$. In the context of a function $T$ over a vector space $\mathbb{R}^n$, the identity function $T(\boldsymbol{x}) := \boldsymbol{x}$ for all $\boldsymbol{x} \in \mathbb{R}^n$ is defined using the <strong>identity matrix</strong> for $\mathbb{R}^n$. The identity matrix for $\mathbb{R}^n$, denoted $\boldsymbol{I}_n$ (or simply $\boldsymbol{I}$ if the dimensionality is implied by the context), is a square matrix of all zeros except for ones along the diagonal:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (Identity matrix):</strong> Each real-valued, Euclidean vector space $\mathbb{R}^n$ is associated with an <strong>identity matrix</strong>, denoted $\boldsymbol{I}_{n \times n}$ (or simply $\boldsymbol{I}$ if the dimensionality is implied by the context), which is a square matrix consisting of zeros in the off-diagonal entries and ones along the diagonal.</span></p>
<p>For example, the identity matrix for $\mathbb{R}^3$ is defined as</p>
\[\boldsymbol{I}_3 := \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]
<p>It can easily be shown that applying matrix-vector multiplication using an identity matrix $\boldsymbol{I}_n$ with any vector $\boldsymbol{x} \in \mathbb{R}^n$ will result in the same vector $\boldsymbol{x}$. Thus, a function $T(\boldsymbol{x}) := \boldsymbol{I}\boldsymbol{x}$ is the identity function with domain $\mathbb{R}^n$. This is depicted schematically below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/identity_matrix_as_function.png" alt="drawing" width="500" /></center>
<h2 id="the-zero-matrix-defines-the-zero-function">The zero matrix defines the zero function</h2>
<p>Recall a zero-function $f$ for a set $S$ is the function $f(x) := 0$ for all $x \in S$. In the context of a function $T$ over a vector space $\mathbb{R}^n$, the zero function $T(\boldsymbol{x}) := \boldsymbol{0}$ for all $\boldsymbol{x} \in \mathbb{R}^n$ is defined using the <strong>zero matrix</strong> for $\mathbb{R}^n$. The zero matrix for $\mathbb{R}^n$, denoted $\boldsymbol{0}_n$ is a square matrix of all zeros:</p>
<p><span style="color:#0060C6"><strong>Definition 2 (Zero matrix):</strong> Each real-valued, Euclidean vector space $\mathbb{R}^n$ is associated with a <strong>zero matrix</strong>, denoted $\boldsymbol{0}_{n \times n}$, which is a square matrix consisting of all zeros.</span></p>
<p>For example, the zero matrix for $\mathbb{R}^3$ is defined as</p>
\[\boldsymbol{0}_3 := \begin{bmatrix} 0 & 0 & 0\\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\]
<p>It can easily be shown that applying matrix-vector multiplication using an identity matrix $\boldsymbol{0}_n$ with any vector $\boldsymbol{x} \in \mathbb{R}^n$ will result in the zero vector $\boldsymbol{0}$. Thus, a function $T(\boldsymbol{x}) := \boldsymbol{0}_n\boldsymbol{x}$ is the zero function for domain $\mathbb{R}^n$. This is depicted schematically below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/zero_matrix_as_function.png" alt="drawing" width="500" /></center>
<h2 id="an-inverse-matrix-defines-an-inverse-function">An inverse matrix defines an inverse function</h2>
<p>Let’s say we have some function $T(\boldsymbol{x}) := \boldsymbol{Ax}$. Let’s say there exists a matrix $\boldsymbol{C}$ such that the function $F(\boldsymbol{x}) := \boldsymbol{Cx}$ is the inverse function of $T$. That is, \(F = T^{-1}\) where</p>
\[T^{-1}(T(\boldsymbol{x})) = \boldsymbol{x}\]
<p>If this is the case, then we call $\boldsymbol{C}$ the <strong>inverse matrix</strong> of $\boldsymbol{A}$. Usually, we will denote this inverse matrix $\boldsymbol{C}$ as $\boldsymbol{A}^{-1}$. This is depicted in the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_inverse.png" alt="drawing" width="500" /></center>
<p>Not every matrix has an inverse. Matrices that have inverses are a special category of matrix called <strong>invertible matrices</strong> and have special properties, which we will discuss in a future post.</p>Matthew N. BernsteinAt the core of linear algebra is the idea that matrices represent functions. In this post, we’ll look at a few common, elementary functions and discuss their corresponding matrices.Matrix-vector multiplication2020-12-19T00:00:00-08:002020-12-19T00:00:00-08:00https://mbernste.github.io/posts/matrix_vector_multiplication<p><em>Matrix-vector multiplication is an operation between a matrix and a vector that produces a new vector. In this post, I’ll define matrix vector multiplication as well as three angles from which to view this concept. The third angle entails viewing matrices as functions between vector spaces</em></p>
<h2 id="introduction">Introduction</h2>
<p>In a <a href="https://mbernste.github.io/posts/matrices/">previous post</a>, we discussed three ways one can view a matrix: as a table of values, as a list of vectors, and finally, as a function. It is the third way of viewing a matrices that really gives matrices their power. Here, we’ll introduce an operation between matrices and vectors, called <strong>matrix-vector multiplication</strong>, which will enable us to use matrices as functions.</p>
<p>Matrix-vector multiplication is an operation between a matrix and a vector that produces a new vector. Notably, matrix-vector multiplication is only defined between a matrix and a vector where the length of the vector equals the number of columns of the matrix. It is defined as follows:</p>
<p><span style="color:#0060C6"><strong>Definition 1 (Matrix-vector multiplication):</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ and vector $\boldsymbol{x} \in \mathbb{R}^n$ the <strong>matrix-vector multiplication</strong> of $\boldsymbol{A}$ and $\boldsymbol{x}$ is defined as </span></p>
<center><span style="color:#0060C6">$$\boldsymbol{A}\boldsymbol{x} := x_1\boldsymbol{a}_{*,1} + x_2\boldsymbol{a}_{*,2} + \dots + x_n\boldsymbol{a}_{*,n}$$ </span></center>
<p><span style="color:#0060C6">where $\boldsymbol{a}_{*,i}$ is the $i$th column vector of $\boldsymbol{A}$.</span></p>
<p>Like most mathematical concepts, matrix-vector multiplication can be <a href="">viewed from multiple angles</a>, at various levels of abstraction. These views come in handy when we attempt to conceptualize the various ways in which we utilize matrix-vector multiplication to model real-world problems. Below are three ways that I find useful for conceptualizing matrix-vector multiplication ordered from least to most abstract:</p>
<ol>
<li><strong>As a “row-wise”, vector-generating process:</strong> Matrix-vector multiplication defines a process for creating a new vector using an existing vector where each element of the new vector is “generated” by taking a weighted sum of each row of the matrix using the elements of a vector as coefficients</li>
<li><strong>As taking a linear combination of the columns of a matrix:</strong> Matrix-vector multiplication is the process of taking a linear combination of the column-space of a matrix using the elements of a vector as the coefficients</li>
<li><strong>As evaluating a function between vector spaces:</strong> Matrix-vector multiplication allows a matrix to define a mapping between two vector spaces.</li>
</ol>
<p>I find all three of the perspectives useful. The first two perspectives provide a way of understanding the <em>mechanism</em> of matrix-vector multiplication whereas the third perspective provides the <em>essence</em> of matrix-vector multiplication. It is this third perspective of matrix-vector multiplication that enables us to view matrices as functions, as we discussed in the <a href="https://mbernste.github.io/posts/matrices/">previous post</a>.</p>
<h2 id="matrix-vector-multiplication-as-a-row-wise-vector-generating-process">Matrix-vector multiplication as a “row-wise”, vector-generating process</h2>
<p>A useful way for viewing the mechanism of matrix-vector multiplication between a matrix $\boldsymbol{A}$ and a vector $\boldsymbol{x}$ is to see it as a sort of “process” (or even as a computer program) that constructs each element of the output vector in an iterative fashion where we iterate over each row of $A$. Specifically, for each row $i$ of $\boldsymbol{A}$, we take $\boldsymbol{x}$ and compute the dot product between $\boldsymbol{x}$ and the $i$th row of the matrix thereby producing the $i$th element of the output vector (See Theorem 1 in the Appendix to this post):</p>
\[\boldsymbol{Ax} = \begin{bmatrix} \boldsymbol{a}_{1,*} \cdot \boldsymbol{x} \\ \boldsymbol{a}_{2,*} \cdot \boldsymbol{x} \\ \vdots \\ \boldsymbol{a}_{m,*} \cdot \boldsymbol{x} \\ \end{bmatrix}\]
<p>where $\boldsymbol{a}_{i,*}$ is the $i$th row-vector in $\boldsymbol{A}$. This process is illustrated schematically in Panel A of the figure below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_vec_mult_as_process.png" alt="drawing" width="700" /></center>
<h2 id="matrix-vector-multiplication-as-taking-a-linear-combination-of-the-columns-of-a-matrix">Matrix-vector multiplication as taking a linear combination of the columns of a matrix</h2>
<p>Matrix-vector multiplication between a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ and vector $\boldsymbol{x} \in \mathbb{R}^m$ can be understood as taking a linear combination of the column vectors of $\boldsymbol{A}$ using the elements of $\boldsymbol{x}$ as the coefficients. This is illustrated schematically below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_vec_mult_as_lin_comb.png" alt="drawing" width="700" /></center>
<p>We can also view this geometrically:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_vec_mult_as_linear_comb_geom.png" alt="drawing" width="720" /></center>
<p>In Panel A, we depict two column vectors of some matrix $\boldsymbol{A} \in \mathbb{R}^{3,2}$. In Panel B, we take a linear combination of the two column vectors of \(\boldsymbol{A}\) according to the elements of some vector $\boldsymbol{x}$ thus producing the vector $\boldsymbol{Ax}$ (black vector) as shown in Panel C.</p>
<h2 id="matrix-vector-multiplication-as-evaluating-a-function-between-vector-spaces">Matrix-vector multiplication as evaluating a function between vector spaces</h2>
<p>If we hold a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ as fixed, this matrix maps vectors in $\mathbb{R}^n$ to vectors in $\mathbb{R}^m$. Making this more explicit, we can define a function $T : \mathbb{R}^n \rightarrow \mathbb{R}^m$ as:
\(T(\boldsymbol{x}) := \boldsymbol{Ax}\)
where $T$ uses the matrix $\boldsymbol{A}$ to performing the mapping.</p>
<p>This is illustrated schematically below:</p>
<center><img src="https://raw.githubusercontent.com/mbernste/mbernste.github.io/master/images/matrix_as_function.png" alt="drawing" width="600" /></center>
<p>In fact, as we show in a later post, such a matrix-defined function is a <a href="https://en.wikipedia.org/wiki/Linear_map">linear function</a>. Even more importantly, <em>any</em> linear function between finite-dimensional vector spaces is uniquely defined by some matrix.</p>
<h2 id="appendix">Appendix</h2>
<p><span style="color:#0060C6"><strong>Theorem 1:</strong> Given a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ and vector $\boldsymbol{x} \in \mathbb{R}^n$, it holds that</span></p>
<center><span style="color:#0060C6">$$\boldsymbol{Ax} = \begin{bmatrix}\boldsymbol{a}_{1,*} \cdot \boldsymbol{x} \\ \boldsymbol{a}_{2,*} \cdot \boldsymbol{x} \\ \vdots \\ \boldsymbol{a}_{m,*} \cdot \boldsymbol{x} \\ \end{bmatrix}$$</span></center>
<p><span style="color:#0060C6">where $\boldsymbol{a}_{i,*}$ is the $i$th row-vector in $\boldsymbol{A}$.</span></p>
<p><strong>Proof:</strong></p>
\[\begin{align*} \boldsymbol{A}\boldsymbol{x} &:= x_1\boldsymbol{a}_{*,1} + x_2\boldsymbol{a}_{*,2} + \dots + x_n\boldsymbol{a}_{*,n} \\ &= x_1 \begin{bmatrix}a_{1,1} \\ a_{2,1} \\ \vdots \\ a_{m, 1} \end{bmatrix} + x_2\begin{bmatrix}a_{1,2} \\ a_{2,2} \\ \vdots \\ a_{m, 2} \end{bmatrix} + \dots + x_n\begin{bmatrix}a_{1,n} \\ a_{2,n} \\ \vdots \\ a_{m, n} \end{bmatrix} \\ &= \begin{bmatrix}x_1a_{1,1} \\ x_1a_{2,1} \\ \vdots \\ x_1a_{m, 1} \end{bmatrix} + \begin{bmatrix}x_2a_{1,2} \\ x_2a_{2,2} \\ \vdots \\ x_2a_{m, 2} \end{bmatrix} + \dots + \begin{bmatrix}x_na_{1,n} \\ x_na_{2,n} \\ \vdots \\ x_na_{m, n} \end{bmatrix} \\ &= \begin{bmatrix} x_1a_{1,1} + x_2a_{1,2} + \dots + x_na_{1,n} \\ x_1a_{2,1} + x_2a_{2,2} + \dots + x_na_{2,n} \\ \vdots \\ x_1a_{m,1} + x_2a_{m,2} + \dots + x_na_{m,n} \\ \end{bmatrix} \\ &= \begin{bmatrix} \sum_{i=1}^n a_{1,i}x_i \\ \sum_{i=1}^n a_{2,i}x_i \\ \vdots \\ \sum_{i=1}^n a_{m,i}x_i \\ \end{bmatrix} \\ &= \begin{bmatrix}\boldsymbol{a}_{1,*} \cdot \boldsymbol{x} \\ \boldsymbol{a}_{2,*} \cdot \boldsymbol{x} \\ \vdots \\ \boldsymbol{a}_{m,*} \cdot \boldsymbol{x} \\ \end{bmatrix}\end{align*}\]
<p>$\square$</p>Matthew N. BernsteinMatrix-vector multiplication is an operation between a matrix and a vector that produces a new vector. In this post, I’ll define matrix vector multiplication as well as three angles from which to view this concept. The third angle entails viewing matrices as functions between vector spaces