Posts by Tags

Gaussian mixture models

17 minute read

Published: November 24, 2020

Gaussian mixture models are a very popular method for data clustering. Here I will define the Gaussian mixture model and also derive the EM algorithm for performing maximum likelihood estimation of its paramters.

The graph Laplacian

12 minute read

Published: November 11, 2020

At the heart of of a number of important machine learning algorithms, such as spectral clustering, lies a matrix called the graph Laplacian. In this post, I’ll walk through the intuition behind the graph Laplacian and describe how it represents the discrete analogue to the Laplacian operator on continuous multivariate functions.

Demystifying measure-theoretic probability theory (part 3: expectation)

10 minute read

Published: May 11, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Median-ratio normalization for bulk RNA-seq data

11 minute read

Published: November 24, 2023

In a previous post, we discussed how RNA-seq provides measurements of relative expression between genes rather than measurements of absolute expression. In this post, we will discuss median-ratio normalization: a procedure that attempts to scale each sample’s read counts so that differences in the read counts between samples better reflects differences in absolute expression. We will start by describing the underlying assumption that must be met for median-ratio normalization to work and then walk through the details of the algorithm.

Three strategies for cataloging cell types

7 minute read

Published: March 04, 2021

In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.

On cell types and cell states

10 minute read

Published: March 03, 2021

The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.

RNA-seq: the basics

19 minute read

Published: January 07, 2021

RNA sequencing (RNA-seq) has become a ubiquitous tool in biomedical research for measuring gene expression in a population of cells, or a single cell, across the genome. Despite its ubiquity, RNA-seq is relatively complex and there exists a large research effort towards developing statistical and computational methods for analyzing the raw data that it produces. In this post, I will provide a high level overview of RNA-seq and describe how to interpret some of the common units in which gene expression is measured from an RNA-seq experiment.

Understanding attention

33 minute read

Published: December 21, 2025

Attention is a type of layer in a neural network that is widely regarded to be one of the most important breakthroughs that enabled the development of modern AI systems and large language models. At its heart, attention is a mechanism for explicitly drawing relationships between items in a set. In natural language processing, the set being processed are words (or tokens) and attention enables the model to relate those words to one another even when those words lie far away from eachother in the body of text. In this blog post, we will step through the attention mechanism both mathematically and intuitively. We then present a minimal example of a neural network that uses attention to perform binary classification in a task that is not solveable using a naïve bag-of-words model.

Median-ratio normalization for bulk RNA-seq data

11 minute read

Published: November 24, 2023

In a previous post, we discussed how RNA-seq provides measurements of relative expression between genes rather than measurements of absolute expression. In this post, we will discuss median-ratio normalization: a procedure that attempts to scale each sample’s read counts so that differences in the read counts between samples better reflects differences in absolute expression. We will start by describing the underlying assumption that must be met for median-ratio normalization to work and then walk through the details of the algorithm.

Three strategies for cataloging cell types

7 minute read

Published: March 04, 2021

In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.

On cell types and cell states

10 minute read

Published: March 03, 2021

The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.

RNA-seq: the basics

19 minute read

Published: January 07, 2021

RNA sequencing (RNA-seq) has become a ubiquitous tool in biomedical research for measuring gene expression in a population of cells, or a single cell, across the genome. Despite its ubiquity, RNA-seq is relatively complex and there exists a large research effort towards developing statistical and computational methods for analyzing the raw data that it produces. In this post, I will provide a high level overview of RNA-seq and describe how to interpret some of the common units in which gene expression is measured from an RNA-seq experiment.

Intuiting biology (Part 1: Order and chaos in the crowded cell)

7 minute read

Published: November 24, 2024

Cells are crowded spaces packed with biomolecules colliding and interacting with one another. Despite this chaotic environment, biologists routinely describe intracellular functions using the clean mathematical language of networks. In this post I will attempt to reconcile these two seemingly contradictory perspectives of the cell. This post will serve as a first part in a series of blog posts I hope to write where I will collect and connect some of the works that have helped me better “intuit” biology as a person coming to biology from the field of computer science.

Notes on The Art of War by Sun Tzu

30 minute read

Published: November 16, 2024

I am currently reading Sun Tzu’s Art of War and am finding much wisdom in it. I have been taking notes during my reading and I thought I’d share them in this post. Here I cover Books 1 and 2.

Functionals and functional derivatives

13 minute read

Published: April 10, 2022

The calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.

Three strategies for cataloging cell types

7 minute read

Published: March 04, 2021

In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.

On cell types and cell states

10 minute read

Published: March 03, 2021

The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.

Gaussian mixture models

17 minute read

Published: November 24, 2020

Gaussian mixture models are a very popular method for data clustering. Here I will define the Gaussian mixture model and also derive the EM algorithm for performing maximum likelihood estimation of its paramters.

Visualizing covariance

1 minute read

Published: May 21, 2020

Covariance quantifies to what extent two random variables are linearly correlated. In this post, I will outline a visualization of covariance that helped me better intuit this concept.

Assessing the utility of data visualizations based on dimensionality reduction

24 minute read

Published: March 02, 2024

We human beings use our vision as our chief sense for understanding the world, and thus when we are confronted with data, we try to understand that data through visualization. Dimensionality reduction methods, such as PCA, t-SNE, and UMAP, are approaches designed to enable the visualization of high-dimensional data. Unfortunately, because these methods inevitably distort aspects of the data, these methods are receiving new scrutiny. In this post, I propose that dimensionality reduction requires a “probabilistic” framework of interpretation rather than a “deterministic” one wherein conclusions one draws from a dimensionality reduction plot have some probability of not actually being true of the data. I will propose that this does not mean these plots are not useful. Rather, to evaluate their utility, I will argue that empirical user studies of these methods will shed light on whether these methods provide more benefit or more harm in practice.

Understanding attention

33 minute read

Published: December 21, 2025

Attention is a type of layer in a neural network that is widely regarded to be one of the most important breakthroughs that enabled the development of modern AI systems and large language models. At its heart, attention is a mechanism for explicitly drawing relationships between items in a set. In natural language processing, the set being processed are words (or tokens) and attention enables the model to relate those words to one another even when those words lie far away from eachother in the body of text. In this blog post, we will step through the attention mechanism both mathematically and intuitively. We then present a minimal example of a neural network that uses attention to perform binary classification in a task that is not solveable using a naïve bag-of-words model.

Denoising diffusion probabilistic models (Part 2: Theoretical justification)

11 minute read

Published: October 20, 2024

In Part 1 of this series, we introduced the denoising diffusion probabilistic model for modeling and sampling from complex distributions. We described the diffusion model as a model that can generate new samples by learning how to reverse a diffusion process. In this post, we provide more theoretical justification for the objective function used to fit diffusion models and make connections between the diffusion model and other concepts in statistical inference and probabilistic modeling.

Denoising diffusion probabilistic models (Part 1: Definition and derivation)

57 minute read

Published: June 28, 2024

Diffusion models are a family of state-of-the-art probabilistic generative models that have achieved ground breaking results in a number of fields ranging from image generation to protein structure design. In Part 1 of this two-part series, I will walk through the denoising diffusion probabilistic model (DDPM) as presented by Ho, Jain, and Abbeel (2020). Specifically, we will walk through the model definition, the derivation of the objective function, and the training and sampling algorithms. We will conclude by walking through an implementation of a simple diffusion model in PyTorch and apply it to the MNIST dataset of hand-written digits.

Graph convolutional neural networks

25 minute read

Published: September 24, 2023

Graphs are ubiqitous mathematical objects that describe a set of relationships between entities; however, they are challenging to model with traditional machine learning methods, which require that the input be represented as vectors. In this post, we will discuss graph convolutional networks (GCNs): a class of neural network designed to operate on graphs. We will discuss the intution behind the GCN and how it is similar and different to the convolutional neural network (CNN) used in computer vision. We will conclude by presenting a case-study training a GCN to classify molecule toxicity.

Variational autoencoders

33 minute read

Published: March 14, 2023

Variational autoencoders (VAEs) are a family of deep generative models with use cases that span many applications, from image processing to bioinformatics. There are two complimentary ways of viewing the VAE: as a probabilistic model that is fit using variational Bayesian inference, or as a type of autoencoding neural network. In this post, we present the mathematical theory behind VAEs, which is rooted in Bayesian inference, and how this theory leads to an emergent autoencoding algorithm. We also discuss the similarities and differences between VAEs and standard autoencoders. Lastly, we present an implementation of a VAE in PyTorch and apply it to the task of modeling the MNIST dataset of hand-written digits.

True understanding is “seeing” in 3D

3 minute read

Published: November 01, 2020

In this post, I will discuss an analogy that I find useful for thinking about what it means to “understand” something: True understanding of a concept is akin to “seeing” the concept in its native three-dimensional space, whereas partial understanding is merely seeing a two-dimensional projection of that inherently three-dimensional concept.

Shannon’s Source Coding Theorem (Foundations of information theory: Part 3)

13 minute read

Published: October 19, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In the first two posts, we discussed the concepts of self-information and information entropy. In this post, we step through Shannon’s Source Coding Theorem to see how the information entropy of a probability distribution describes the best-achievable efficiency required to communicate samples from the distribution.

Information entropy (Foundations of information theory: Part 2)

8 minute read

Published: August 07, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In this series of posts, I will attempt to describe my understanding of how, both philosophically and mathematically, information theory defines the polymorphic, and often amorphous, concept of information. In the first post, we discussed the concept of self-information. In this second post, we will build on this foundation to discuss the concept of information entropy.

The evidence lower bound (ELBO)

3 minute read

Published: May 25, 2020

The evidence lower bound is an important quantity at the core of a number of important algorithms used in statistical inference including expectation-maximization and variational inference. In this post, I describe its context, definition, and derivation.

Demystifying measure-theoretic probability theory (part 3: expectation)

10 minute read

Published: May 11, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Reproducing kernel Hilbert spaces and the kernel trick

22 minute read

Published: December 14, 2024

If you’re a practitioner of machine learning, then there is little doubt you have seen or used an algorithm that falls into the general category of kernel methods. The premier example of such methods is the support vector machine. When introduced to these algorithms, one is taught that one must provide the algorithm with a kernel function that, intuitively, computes a degree of “similarity” between the objects you are classifying. In practice, one can get pretty far with only this understanding; however, to understand these methods more deeply, one must understand a mathematical object called a reproducing kernel Hilbert space (RKHS). In this post, I will explain the definition of a RKHS and exactly how they produce the kernels used in kernel methods thereby laying a rigorous foundation for a deeper understanding of these methods.

Matrices as functions

3 minute read

Published: December 20, 2020

At the core of linear algebra is the idea that matrices represent functions. In this post, we’ll look at a few common, elementary functions and discuss their corresponding matrices.

Median-ratio normalization for bulk RNA-seq data

11 minute read

Published: November 24, 2023

In a previous post, we discussed how RNA-seq provides measurements of relative expression between genes rather than measurements of absolute expression. In this post, we will discuss median-ratio normalization: a procedure that attempts to scale each sample’s read counts so that differences in the read counts between samples better reflects differences in absolute expression. We will start by describing the underlying assumption that must be met for median-ratio normalization to work and then walk through the details of the algorithm.

Three strategies for cataloging cell types

7 minute read

Published: March 04, 2021

In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.

On cell types and cell states

10 minute read

Published: March 03, 2021

The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.

RNA-seq: the basics

19 minute read

Published: January 07, 2021

RNA sequencing (RNA-seq) has become a ubiquitous tool in biomedical research for measuring gene expression in a population of cells, or a single cell, across the genome. Despite its ubiquity, RNA-seq is relatively complex and there exists a large research effort towards developing statistical and computational methods for analyzing the raw data that it produces. In this post, I will provide a high level overview of RNA-seq and describe how to interpret some of the common units in which gene expression is measured from an RNA-seq experiment.

Perplexity: a more intuitive measure of uncertainty than entropy

2 minute read

Published: October 08, 2021

Like entropy, perplexity is an information theoretic quantity that describes the uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy and thus, in some sense, they can be used interchangeabley. So why do we need it? In this post, I’ll discuss why perplexity is a more intuitive measure of uncertainty than entropy.

Shannon’s Source Coding Theorem (Foundations of information theory: Part 3)

13 minute read

Published: October 19, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In the first two posts, we discussed the concepts of self-information and information entropy. In this post, we step through Shannon’s Source Coding Theorem to see how the information entropy of a probability distribution describes the best-achievable efficiency required to communicate samples from the distribution.

Information entropy (Foundations of information theory: Part 2)

8 minute read

Published: August 07, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In this series of posts, I will attempt to describe my understanding of how, both philosophically and mathematically, information theory defines the polymorphic, and often amorphous, concept of information. In the first post, we discussed the concept of self-information. In this second post, we will build on this foundation to discuss the concept of information entropy.

What is information? (Foundations of information theory: Part 1)

4 minute read

Published: June 13, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In this series of posts, I will attempt to describe my understanding of how, both philosophically and mathematically, information theory defines the polymorphic, and often amorphous, concept of information. In this first post, I will describe Shannon’s self-information.

True understanding is “seeing” in 3D

3 minute read

Published: November 01, 2020

In this post, I will discuss an analogy that I find useful for thinking about what it means to “understand” something: True understanding of a concept is akin to “seeing” the concept in its native three-dimensional space, whereas partial understanding is merely seeing a two-dimensional projection of that inherently three-dimensional concept.

Intrinsic dimensionality

6 minute read

Published: December 29, 2020

In my formal education, I found that the concept of “intrinsic dimensionality” was never explicitly taught; however, it undergirds so many concepts in linear algebra and the data sciences such as the rank of a matrix and feature selection. In this post I will discuss the difference between the extrinsic dimensionality of a space versus its intrinsic dimensionality.

Three strategies for cataloging cell types

7 minute read

Published: March 04, 2021

In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.

On cell types and cell states

10 minute read

Published: March 03, 2021

The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.

A framework for making sense of metrics in technical organizations

17 minute read

Published: February 10, 2026

If you work in a quantitative or technical field, there is little doubt that you or your team has worked long and hard to define which metrics to measure and track. Using data-driven metrics is a critical practice for making rational decisions and deciphering truth in a complex and noisy world. However, as others have pointed out, an over-reliance on metrics can lead to poor outcomes. In an effort to better articulate the value and risks inherent in metrics, in this blog post, I will present a mental framework for thinking about metrics that has helped me reason about their value and risks with a bit more clarity.

Dot product

9 minute read

Published: December 09, 2024

The dot product is a fundamental operation on two Euclidean vectors that captures a notion of similarity between the vectors. In this post, we’ll define the dot product and offer a number of angles for which to intuit the idea captured by this fundamental operation.

The invertible matrix theorem

14 minute read

Published: January 28, 2024

Throughout my blog posts on linear algebra, we have proven various properties about invertible matrices. In this post we bring, all of these statements into a single location and form a set of statements called the “invertible matrix theorem”. Each statement in the invertible matrix theorem proves that the matrix is invertible and implies all of the rest of the statements.

What determinants tell us about linear transformations

10 minute read

Published: September 04, 2023

The determinant of a matrix is often taught as a function that measures the volume of the parallelepiped formed by that matrix’s columns. In this post, we will go a step further in our understanding of the determinant and discuss what the determinant tells us about the linear transformation that is characterized by the matrix. In short, the determinant tells us how much a matrix’s linear transformation grows or shrinks space. The sign of the determinant tells us whether the matrix also inverts space.

Deriving the formula for the determinant

35 minute read

Published: September 03, 2023

The determinant is a function that maps each square matrix to a value that describes the volume of the parallelepiped formed by that matrix’s columns. While this idea is fairly straightforward conceptually, the formula for the determinant is quite confusing. In this post, we will derive the formula for the determinant in an effort to make it less mysterious. Much of my understanding of this material comes from these lecture notes by Mark Demers re-written in my own words.

Vector spaces induced by matrices: column, row, and null spaces

21 minute read

Published: June 19, 2023

Matrices are one of the fundamental objects studied in linear algebra. While on their surface they appear like simple tables of numbers, this simplicity hides deeper mathematical structures that they contain. In this post, we will dive into the deeper structures within matrices by discussing three vector spaces that are induced by every matrix: a column space, a row space, and a null space.

Row reduction with elementary matrices

10 minute read

Published: October 02, 2022

In this post we discuss the row reduction algorithm for solving a system of linear equations that have exactly one solution. We will then show how the row reduction algorithm can be represented as a process involving a sequence of matrix multiplications involving a special class of matrices called elementary matrices. That is, each elementary matrix represents a single elementary row operation in the row reduction algorithm.

Reasoning about systems of linear equations using linear algebra

5 minute read

Published: June 12, 2022

In this blog post, we will discuss the relationship between matrices and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors in that matrix equation.

Span and linear independence

10 minute read

Published: June 11, 2022

A very important concept linear algebra is that of linear independence. In this blog post we present the definition for the span of a set of vectors. Then, we use this definition to discuss the definition for linear independence. Finally, we discuss some intuition into this fundamental idea.

Normed vector spaces

8 minute read

Published: November 23, 2021

When first introduced to Euclidean vectors, one is taught that the length of the vector’s arrow is called the norm of the vector. In this post, we present the more rigorous and abstract definition of a norm and show how it generalizes the notion of “length” to non-Euclidean vector spaces. We also discuss how the norm induces a metric function on pairs of vectors so that one can discuss distances between vectors.

Vector spaces

11 minute read

Published: October 27, 2021

The concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In this post, we first present and explain the definition of a vector space and then go on to describe properties of vector spaces. Lastly, we present a few examples of vector spaces that go beyond the usual Euclidean vectors that are often taught in introductory math and science courses.

Invertible matrices

11 minute read

Published: October 20, 2021

In this post, we discuss invertible matrices: those matrices that characterize invertible linear transformations. We discuss three different perspectives for intuiting inverse matrices as well as several of their properties.

Matrix multiplication

11 minute read

Published: December 26, 2020

At first glance, the definition for the product of two matrices can be unintuitive. In this post, we discuss three perspectives for viewing matrix multiplication. It is the third perspective that gives this “unintuitive” definition its power: that matrix multiplication represents the composition of linear transformations.

Matrices characterize linear transformations

5 minute read

Published: December 21, 2020

Linear transformations are functions mapping vectors between two vector spaces that preserve vector addition and scalar multiplication. In this post, we show that there exists a one-to-one corresondence between linear transformations between coordinate vector spaces and matrices. Thus, we can view a matrix as representing a unique linear transformation between coordinate vector spaces.

Matrices as functions

3 minute read

Published: December 20, 2020

At the core of linear algebra is the idea that matrices represent functions. In this post, we’ll look at a few common, elementary functions and discuss their corresponding matrices.

Matrix-vector multiplication

5 minute read

Published: December 19, 2020

Matrix-vector multiplication is an operation between a matrix and a vector that produces a new vector. In this post, I’ll define matrix vector multiplication as well as three angles from which to view this concept. The third angle entails viewing matrices as functions between vector spaces

Introducing matrices

7 minute read

Published: December 13, 2020

Here, I will introduce the three main ways of thinking about matrices. This high-level description of the multi-faceted way of thinking about matrices would have helped me better intuit matrices when I was first introduced to them in my undergraduate linear algebra course.

Matrices characterize linear transformations

5 minute read

Published: December 21, 2020

Linear transformations are functions mapping vectors between two vector spaces that preserve vector addition and scalar multiplication. In this post, we show that there exists a one-to-one corresondence between linear transformations between coordinate vector spaces and matrices. Thus, we can view a matrix as representing a unique linear transformation between coordinate vector spaces.

Matrix-vector multiplication

5 minute read

Published: December 19, 2020

Matrix-vector multiplication is an operation between a matrix and a vector that produces a new vector. In this post, I’ll define matrix vector multiplication as well as three angles from which to view this concept. The third angle entails viewing matrices as functions between vector spaces

Understanding attention

33 minute read

Published: December 21, 2025

Attention is a type of layer in a neural network that is widely regarded to be one of the most important breakthroughs that enabled the development of modern AI systems and large language models. At its heart, attention is a mechanism for explicitly drawing relationships between items in a set. In natural language processing, the set being processed are words (or tokens) and attention enables the model to relate those words to one another even when those words lie far away from eachother in the body of text. In this blog post, we will step through the attention mechanism both mathematically and intuitively. We then present a minimal example of a neural network that uses attention to perform binary classification in a task that is not solveable using a naïve bag-of-words model.

Denoising diffusion probabilistic models (Part 2: Theoretical justification)

11 minute read

Published: October 20, 2024

In Part 1 of this series, we introduced the denoising diffusion probabilistic model for modeling and sampling from complex distributions. We described the diffusion model as a model that can generate new samples by learning how to reverse a diffusion process. In this post, we provide more theoretical justification for the objective function used to fit diffusion models and make connections between the diffusion model and other concepts in statistical inference and probabilistic modeling.

Denoising diffusion probabilistic models (Part 1: Definition and derivation)

57 minute read

Published: June 28, 2024

Diffusion models are a family of state-of-the-art probabilistic generative models that have achieved ground breaking results in a number of fields ranging from image generation to protein structure design. In Part 1 of this two-part series, I will walk through the denoising diffusion probabilistic model (DDPM) as presented by Ho, Jain, and Abbeel (2020). Specifically, we will walk through the model definition, the derivation of the objective function, and the training and sampling algorithms. We will conclude by walking through an implementation of a simple diffusion model in PyTorch and apply it to the MNIST dataset of hand-written digits.

Graph convolutional neural networks

25 minute read

Published: September 24, 2023

Graphs are ubiqitous mathematical objects that describe a set of relationships between entities; however, they are challenging to model with traditional machine learning methods, which require that the input be represented as vectors. In this post, we will discuss graph convolutional networks (GCNs): a class of neural network designed to operate on graphs. We will discuss the intution behind the GCN and how it is similar and different to the convolutional neural network (CNN) used in computer vision. We will conclude by presenting a case-study training a GCN to classify molecule toxicity.

Variational autoencoders

33 minute read

Published: March 14, 2023

Variational autoencoders (VAEs) are a family of deep generative models with use cases that span many applications, from image processing to bioinformatics. There are two complimentary ways of viewing the VAE: as a probabilistic model that is fit using variational Bayesian inference, or as a type of autoencoding neural network. In this post, we present the mathematical theory behind VAEs, which is rooted in Bayesian inference, and how this theory leads to an emergent autoencoding algorithm. We also discuss the similarities and differences between VAEs and standard autoencoders. Lastly, we present an implementation of a VAE in PyTorch and apply it to the task of modeling the MNIST dataset of hand-written digits.

Blackbox variational inference via the reparameterization gradient

21 minute read

Published: November 05, 2022

Variational inference (VI) is a mathematical framework for doing Bayesian inference by approximating the posterior distribution over the latent variables in a latent variable model when the true posterior is intractable. In this post, we will discuss a flexible variational inference algorithm, called blackbox VI via the reparameterization gradient, that works “out of the box” for a wide variety of models with minimal need for the tedious mathematical derivations that deriving VI algorithms usually require. We will then use this method to do Bayesian linear regression.

Variational inference

5 minute read

Published: May 31, 2021

In this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.

Gaussian mixture models

17 minute read

Published: November 24, 2020

Gaussian mixture models are a very popular method for data clustering. Here I will define the Gaussian mixture model and also derive the EM algorithm for performing maximum likelihood estimation of its paramters.

The evidence lower bound (ELBO)

3 minute read

Published: May 25, 2020

The evidence lower bound is an important quantity at the core of a number of important algorithms used in statistical inference including expectation-maximization and variational inference. In this post, I describe its context, definition, and derivation.

Expectation-maximization: theory and intuition

13 minute read

Published: May 13, 2020

Expectation-maximization (EM) is a popular algorithm for performing maximum-likelihood estimation of the parameters in a latent variable model. In this post, I discuss the theory behind, and intuition into this algorithm.

Demystifying Euler’s number

16 minute read

Published: January 26, 2025

Euler’s number $e := 2.71828\dots$ has, to me, always been a semi-mysterious number. While I understood many facts about $e$, I never felt I ever truly understood what it really was – it’s core essence so to speak. I believe that part of the reason for my confusion is that $e$ is often taught coming from two seemingly different perspectives: Either it is introduced in the context of compound interest or it is introduced in the context of calculus as being the base of the exponential function whose derivative is itself. Thanks to an excellent explanation by Grant Sanderson’s 3Blue1Brown video, I now better understand this constant and how these two perspectives relate to one another. In this blog post, I will attempt to describe, in my own words, my understanding of Euler’s number and expound on Sanderson’s explanation.

Reproducing kernel Hilbert spaces and the kernel trick

22 minute read

Published: December 14, 2024

If you’re a practitioner of machine learning, then there is little doubt you have seen or used an algorithm that falls into the general category of kernel methods. The premier example of such methods is the support vector machine. When introduced to these algorithms, one is taught that one must provide the algorithm with a kernel function that, intuitively, computes a degree of “similarity” between the objects you are classifying. In practice, one can get pretty far with only this understanding; however, to understand these methods more deeply, one must understand a mathematical object called a reproducing kernel Hilbert space (RKHS). In this post, I will explain the definition of a RKHS and exactly how they produce the kernels used in kernel methods thereby laying a rigorous foundation for a deeper understanding of these methods.

Dot product

9 minute read

Published: December 09, 2024

The dot product is a fundamental operation on two Euclidean vectors that captures a notion of similarity between the vectors. In this post, we’ll define the dot product and offer a number of angles for which to intuit the idea captured by this fundamental operation.

The invertible matrix theorem

14 minute read

Published: January 28, 2024

Throughout my blog posts on linear algebra, we have proven various properties about invertible matrices. In this post we bring, all of these statements into a single location and form a set of statements called the “invertible matrix theorem”. Each statement in the invertible matrix theorem proves that the matrix is invertible and implies all of the rest of the statements.

The binomial theorem

4 minute read

Published: October 16, 2023

The binomial theorem appears in many proofs across mathematics and mathematical statistics. In this post, I will walk through a proof of this theorem.

What determinants tell us about linear transformations

10 minute read

Published: September 04, 2023

The determinant of a matrix is often taught as a function that measures the volume of the parallelepiped formed by that matrix’s columns. In this post, we will go a step further in our understanding of the determinant and discuss what the determinant tells us about the linear transformation that is characterized by the matrix. In short, the determinant tells us how much a matrix’s linear transformation grows or shrinks space. The sign of the determinant tells us whether the matrix also inverts space.

Deriving the formula for the determinant

35 minute read

Published: September 03, 2023

The determinant is a function that maps each square matrix to a value that describes the volume of the parallelepiped formed by that matrix’s columns. While this idea is fairly straightforward conceptually, the formula for the determinant is quite confusing. In this post, we will derive the formula for the determinant in an effort to make it less mysterious. Much of my understanding of this material comes from these lecture notes by Mark Demers re-written in my own words.

Vector spaces induced by matrices: column, row, and null spaces

21 minute read

Published: June 19, 2023

Matrices are one of the fundamental objects studied in linear algebra. While on their surface they appear like simple tables of numbers, this simplicity hides deeper mathematical structures that they contain. In this post, we will dive into the deeper structures within matrices by discussing three vector spaces that are induced by every matrix: a column space, a row space, and a null space.

Row reduction with elementary matrices

10 minute read

Published: October 02, 2022

In this post we discuss the row reduction algorithm for solving a system of linear equations that have exactly one solution. We will then show how the row reduction algorithm can be represented as a process involving a sequence of matrix multiplications involving a special class of matrices called elementary matrices. That is, each elementary matrix represents a single elementary row operation in the row reduction algorithm.

Reasoning about systems of linear equations using linear algebra

5 minute read

Published: June 12, 2022

In this blog post, we will discuss the relationship between matrices and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors in that matrix equation.

Span and linear independence

10 minute read

Published: June 11, 2022

A very important concept linear algebra is that of linear independence. In this blog post we present the definition for the span of a set of vectors. Then, we use this definition to discuss the definition for linear independence. Finally, we discuss some intuition into this fundamental idea.

Functionals and functional derivatives

13 minute read

Published: April 10, 2022

The calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.

Normed vector spaces

8 minute read

Published: November 23, 2021

When first introduced to Euclidean vectors, one is taught that the length of the vector’s arrow is called the norm of the vector. In this post, we present the more rigorous and abstract definition of a norm and show how it generalizes the notion of “length” to non-Euclidean vector spaces. We also discuss how the norm induces a metric function on pairs of vectors so that one can discuss distances between vectors.

The overloaded equals sign

5 minute read

Published: November 09, 2021

Two of the most important relationships in mathematics, namely equality and definition, are both denoted using the same symbol – namely, the equals sign. The overloading of this symbol confuses students in mathematics and computer programming. In this post, I argue for the use of two different symbols for these two fundamentally different operators.

Vector spaces

11 minute read

Published: October 27, 2021

The concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In this post, we first present and explain the definition of a vector space and then go on to describe properties of vector spaces. Lastly, we present a few examples of vector spaces that go beyond the usual Euclidean vectors that are often taught in introductory math and science courses.

Invertible matrices

11 minute read

Published: October 20, 2021

In this post, we discuss invertible matrices: those matrices that characterize invertible linear transformations. We discuss three different perspectives for intuiting inverse matrices as well as several of their properties.

Intrinsic dimensionality

6 minute read

Published: December 29, 2020

In my formal education, I found that the concept of “intrinsic dimensionality” was never explicitly taught; however, it undergirds so many concepts in linear algebra and the data sciences such as the rank of a matrix and feature selection. In this post I will discuss the difference between the extrinsic dimensionality of a space versus its intrinsic dimensionality.

Matrix multiplication

11 minute read

Published: December 26, 2020

At first glance, the definition for the product of two matrices can be unintuitive. In this post, we discuss three perspectives for viewing matrix multiplication. It is the third perspective that gives this “unintuitive” definition its power: that matrix multiplication represents the composition of linear transformations.

Matrices characterize linear transformations

5 minute read

Published: December 21, 2020

Linear transformations are functions mapping vectors between two vector spaces that preserve vector addition and scalar multiplication. In this post, we show that there exists a one-to-one corresondence between linear transformations between coordinate vector spaces and matrices. Thus, we can view a matrix as representing a unique linear transformation between coordinate vector spaces.

Matrices as functions

3 minute read

Published: December 20, 2020

At the core of linear algebra is the idea that matrices represent functions. In this post, we’ll look at a few common, elementary functions and discuss their corresponding matrices.

Matrix-vector multiplication

5 minute read

Published: December 19, 2020

Matrix-vector multiplication is an operation between a matrix and a vector that produces a new vector. In this post, I’ll define matrix vector multiplication as well as three angles from which to view this concept. The third angle entails viewing matrices as functions between vector spaces

Introducing matrices

7 minute read

Published: December 13, 2020

Here, I will introduce the three main ways of thinking about matrices. This high-level description of the multi-faceted way of thinking about matrices would have helped me better intuit matrices when I was first introduced to them in my undergraduate linear algebra course.

The graph Laplacian

12 minute read

Published: November 11, 2020

At the heart of of a number of important machine learning algorithms, such as spectral clustering, lies a matrix called the graph Laplacian. In this post, I’ll walk through the intuition behind the graph Laplacian and describe how it represents the discrete analogue to the Laplacian operator on continuous multivariate functions.

Demystifying measure-theoretic probability theory (part 3: expectation)

10 minute read

Published: May 11, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 2: random variables)

10 minute read

Published: January 04, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 1: probability spaces)

13 minute read

Published: December 30, 2019

In this series of posts, I will present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Invertible matrices

11 minute read

Published: October 20, 2021

In this post, we discuss invertible matrices: those matrices that characterize invertible linear transformations. We discuss three different perspectives for intuiting inverse matrices as well as several of their properties.

Matrix multiplication

11 minute read

Published: December 26, 2020

At first glance, the definition for the product of two matrices can be unintuitive. In this post, we discuss three perspectives for viewing matrix multiplication. It is the third perspective that gives this “unintuitive” definition its power: that matrix multiplication represents the composition of linear transformations.

Matrices as functions

3 minute read

Published: December 20, 2020

At the core of linear algebra is the idea that matrices represent functions. In this post, we’ll look at a few common, elementary functions and discuss their corresponding matrices.

Matrix-vector multiplication

5 minute read

Published: December 19, 2020

Matrix-vector multiplication is an operation between a matrix and a vector that produces a new vector. In this post, I’ll define matrix vector multiplication as well as three angles from which to view this concept. The third angle entails viewing matrices as functions between vector spaces

Introducing matrices

7 minute read

Published: December 13, 2020

Here, I will introduce the three main ways of thinking about matrices. This high-level description of the multi-faceted way of thinking about matrices would have helped me better intuit matrices when I was first introduced to them in my undergraduate linear algebra course.

Demystifying measure-theoretic probability theory (part 3: expectation)

10 minute read

Published: May 11, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 2: random variables)

10 minute read

Published: January 04, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 1: probability spaces)

13 minute read

Published: December 30, 2019

In this series of posts, I will present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 2: random variables)

10 minute read

Published: January 04, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

A framework for making sense of metrics in technical organizations

17 minute read

Published: February 10, 2026

If you work in a quantitative or technical field, there is little doubt that you or your team has worked long and hard to define which metrics to measure and track. Using data-driven metrics is a critical practice for making rational decisions and deciphering truth in a complex and noisy world. However, as others have pointed out, an over-reliance on metrics can lead to poor outcomes. In an effort to better articulate the value and risks inherent in metrics, in this blog post, I will present a mental framework for thinking about metrics that has helped me reason about their value and risks with a bit more clarity.

Three strategies for cataloging cell types

7 minute read

Published: March 04, 2021

In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.

On cell types and cell states

10 minute read

Published: March 03, 2021

The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.

The overloaded equals sign

5 minute read

Published: November 09, 2021

Two of the most important relationships in mathematics, namely equality and definition, are both denoted using the same symbol – namely, the equals sign. The overloading of this symbol confuses students in mathematics and computer programming. In this post, I argue for the use of two different symbols for these two fundamentally different operators.

Blackbox variational inference via the reparameterization gradient

21 minute read

Published: November 05, 2022

Variational inference (VI) is a mathematical framework for doing Bayesian inference by approximating the posterior distribution over the latent variables in a latent variable model when the true posterior is intractable. In this post, we will discuss a flexible variational inference algorithm, called blackbox VI via the reparameterization gradient, that works “out of the box” for a wide variety of models with minimal need for the tedious mathematical derivations that deriving VI algorithms usually require. We will then use this method to do Bayesian linear regression.

Denoising diffusion probabilistic models (Part 2: Theoretical justification)

11 minute read

Published: October 20, 2024

In Part 1 of this series, we introduced the denoising diffusion probabilistic model for modeling and sampling from complex distributions. We described the diffusion model as a model that can generate new samples by learning how to reverse a diffusion process. In this post, we provide more theoretical justification for the objective function used to fit diffusion models and make connections between the diffusion model and other concepts in statistical inference and probabilistic modeling.

Denoising diffusion probabilistic models (Part 1: Definition and derivation)

57 minute read

Published: June 28, 2024

Diffusion models are a family of state-of-the-art probabilistic generative models that have achieved ground breaking results in a number of fields ranging from image generation to protein structure design. In Part 1 of this two-part series, I will walk through the denoising diffusion probabilistic model (DDPM) as presented by Ho, Jain, and Abbeel (2020). Specifically, we will walk through the model definition, the derivation of the objective function, and the training and sampling algorithms. We will conclude by walking through an implementation of a simple diffusion model in PyTorch and apply it to the MNIST dataset of hand-written digits.

Variational autoencoders

33 minute read

Published: March 14, 2023

Variational autoencoders (VAEs) are a family of deep generative models with use cases that span many applications, from image processing to bioinformatics. There are two complimentary ways of viewing the VAE: as a probabilistic model that is fit using variational Bayesian inference, or as a type of autoencoding neural network. In this post, we present the mathematical theory behind VAEs, which is rooted in Bayesian inference, and how this theory leads to an emergent autoencoding algorithm. We also discuss the similarities and differences between VAEs and standard autoencoders. Lastly, we present an implementation of a VAE in PyTorch and apply it to the task of modeling the MNIST dataset of hand-written digits.

Perplexity: a more intuitive measure of uncertainty than entropy

2 minute read

Published: October 08, 2021

Like entropy, perplexity is an information theoretic quantity that describes the uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy and thus, in some sense, they can be used interchangeabley. So why do we need it? In this post, I’ll discuss why perplexity is a more intuitive measure of uncertainty than entropy.

Variational inference

5 minute read

Published: May 31, 2021

In this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.

Gaussian mixture models

17 minute read

Published: November 24, 2020

Gaussian mixture models are a very popular method for data clustering. Here I will define the Gaussian mixture model and also derive the EM algorithm for performing maximum likelihood estimation of its paramters.

The evidence lower bound (ELBO)

3 minute read

Published: May 25, 2020

The evidence lower bound is an important quantity at the core of a number of important algorithms used in statistical inference including expectation-maximization and variational inference. In this post, I describe its context, definition, and derivation.

Visualizing covariance

1 minute read

Published: May 21, 2020

Covariance quantifies to what extent two random variables are linearly correlated. In this post, I will outline a visualization of covariance that helped me better intuit this concept.

Expectation-maximization: theory and intuition

13 minute read

Published: May 13, 2020

Expectation-maximization (EM) is a popular algorithm for performing maximum-likelihood estimation of the parameters in a latent variable model. In this post, I discuss the theory behind, and intuition into this algorithm.

Demystifying measure-theoretic probability theory (part 3: expectation)

10 minute read

Published: May 11, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 2: random variables)

10 minute read

Published: January 04, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 1: probability spaces)

13 minute read

Published: December 30, 2019

In this series of posts, I will present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 2: random variables)

10 minute read

Published: January 04, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

What is information? (Foundations of information theory: Part 1)

4 minute read

Published: June 13, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In this series of posts, I will attempt to describe my understanding of how, both philosophically and mathematically, information theory defines the polymorphic, and often amorphous, concept of information. In this first post, I will describe Shannon’s self-information.

Three strategies for cataloging cell types

7 minute read

Published: March 04, 2021

In my previous post, I outlined a conceptual framework for defining and reasoning about “cell types”. Specifically, I noted that the idea of a “cell type” can be viewed as a human-made partition on the universal cellular state space. In this post, I attempt to distill three strategies for partitioning this state space and agreeing on cell type definitions.

On cell types and cell states

10 minute read

Published: March 03, 2021

The advent of single-cell genomics has brought about new efforts to characterize and catalog all of the cell types in the human body. Despite these efforts, the very definition of a “cell type” is under debate. In this post, I will discuss a conceptual framework for defining cell types as subsets of states in an underlying cellular state space. Moreover, I will link the cellular state space to biomedical ontologies that attempt to capture biological knowledge regarding cell types.

The graph Laplacian

12 minute read

Published: November 11, 2020

At the heart of of a number of important machine learning algorithms, such as spectral clustering, lies a matrix called the graph Laplacian. In this post, I’ll walk through the intuition behind the graph Laplacian and describe how it represents the discrete analogue to the Laplacian operator on continuous multivariate functions.

Perplexity: a more intuitive measure of uncertainty than entropy

2 minute read

Published: October 08, 2021

Like entropy, perplexity is an information theoretic quantity that describes the uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy and thus, in some sense, they can be used interchangeabley. So why do we need it? In this post, I’ll discuss why perplexity is a more intuitive measure of uncertainty than entropy.

Variational inference

5 minute read

Published: May 31, 2021

In this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.

Gaussian mixture models

17 minute read

Published: November 24, 2020

Gaussian mixture models are a very popular method for data clustering. Here I will define the Gaussian mixture model and also derive the EM algorithm for performing maximum likelihood estimation of its paramters.

The evidence lower bound (ELBO)

3 minute read

Published: May 25, 2020

The evidence lower bound is an important quantity at the core of a number of important algorithms used in statistical inference including expectation-maximization and variational inference. In this post, I describe its context, definition, and derivation.

Visualizing covariance

1 minute read

Published: May 21, 2020

Covariance quantifies to what extent two random variables are linearly correlated. In this post, I will outline a visualization of covariance that helped me better intuit this concept.

Expectation-maximization: theory and intuition

13 minute read

Published: May 13, 2020

Expectation-maximization (EM) is a popular algorithm for performing maximum-likelihood estimation of the parameters in a latent variable model. In this post, I discuss the theory behind, and intuition into this algorithm.

Demystifying measure-theoretic probability theory (part 3: expectation)

10 minute read

Published: May 11, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 2: random variables)

10 minute read

Published: January 04, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 1: probability spaces)

13 minute read

Published: December 30, 2019

In this series of posts, I will present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Understanding attention

33 minute read

Published: December 21, 2025

Attention is a type of layer in a neural network that is widely regarded to be one of the most important breakthroughs that enabled the development of modern AI systems and large language models. At its heart, attention is a mechanism for explicitly drawing relationships between items in a set. In natural language processing, the set being processed are words (or tokens) and attention enables the model to relate those words to one another even when those words lie far away from eachother in the body of text. In this blog post, we will step through the attention mechanism both mathematically and intuitively. We then present a minimal example of a neural network that uses attention to perform binary classification in a task that is not solveable using a naïve bag-of-words model.

Understanding attention

33 minute read

Published: December 21, 2025

Attention is a type of layer in a neural network that is widely regarded to be one of the most important breakthroughs that enabled the development of modern AI systems and large language models. At its heart, attention is a mechanism for explicitly drawing relationships between items in a set. In natural language processing, the set being processed are words (or tokens) and attention enables the model to relate those words to one another even when those words lie far away from eachother in the body of text. In this blog post, we will step through the attention mechanism both mathematically and intuitively. We then present a minimal example of a neural network that uses attention to perform binary classification in a task that is not solveable using a naïve bag-of-words model.

Demystifying Euler’s number

16 minute read

Published: January 26, 2025

Euler’s number $e := 2.71828\dots$ has, to me, always been a semi-mysterious number. While I understood many facts about $e$, I never felt I ever truly understood what it really was – it’s core essence so to speak. I believe that part of the reason for my confusion is that $e$ is often taught coming from two seemingly different perspectives: Either it is introduced in the context of compound interest or it is introduced in the context of calculus as being the base of the exponential function whose derivative is itself. Thanks to an excellent explanation by Grant Sanderson’s 3Blue1Brown video, I now better understand this constant and how these two perspectives relate to one another. In this blog post, I will attempt to describe, in my own words, my understanding of Euler’s number and expound on Sanderson’s explanation.

Reproducing kernel Hilbert spaces and the kernel trick

22 minute read

Published: December 14, 2024

If you’re a practitioner of machine learning, then there is little doubt you have seen or used an algorithm that falls into the general category of kernel methods. The premier example of such methods is the support vector machine. When introduced to these algorithms, one is taught that one must provide the algorithm with a kernel function that, intuitively, computes a degree of “similarity” between the objects you are classifying. In practice, one can get pretty far with only this understanding; however, to understand these methods more deeply, one must understand a mathematical object called a reproducing kernel Hilbert space (RKHS). In this post, I will explain the definition of a RKHS and exactly how they produce the kernels used in kernel methods thereby laying a rigorous foundation for a deeper understanding of these methods.

Dot product

9 minute read

Published: December 09, 2024

The dot product is a fundamental operation on two Euclidean vectors that captures a notion of similarity between the vectors. In this post, we’ll define the dot product and offer a number of angles for which to intuit the idea captured by this fundamental operation.

Denoising diffusion probabilistic models (Part 2: Theoretical justification)

11 minute read

Published: October 20, 2024

In Part 1 of this series, we introduced the denoising diffusion probabilistic model for modeling and sampling from complex distributions. We described the diffusion model as a model that can generate new samples by learning how to reverse a diffusion process. In this post, we provide more theoretical justification for the objective function used to fit diffusion models and make connections between the diffusion model and other concepts in statistical inference and probabilistic modeling.

Denoising diffusion probabilistic models (Part 1: Definition and derivation)

57 minute read

Published: June 28, 2024

Diffusion models are a family of state-of-the-art probabilistic generative models that have achieved ground breaking results in a number of fields ranging from image generation to protein structure design. In Part 1 of this two-part series, I will walk through the denoising diffusion probabilistic model (DDPM) as presented by Ho, Jain, and Abbeel (2020). Specifically, we will walk through the model definition, the derivation of the objective function, and the training and sampling algorithms. We will conclude by walking through an implementation of a simple diffusion model in PyTorch and apply it to the MNIST dataset of hand-written digits.

Assessing the utility of data visualizations based on dimensionality reduction

24 minute read

Published: March 02, 2024

We human beings use our vision as our chief sense for understanding the world, and thus when we are confronted with data, we try to understand that data through visualization. Dimensionality reduction methods, such as PCA, t-SNE, and UMAP, are approaches designed to enable the visualization of high-dimensional data. Unfortunately, because these methods inevitably distort aspects of the data, these methods are receiving new scrutiny. In this post, I propose that dimensionality reduction requires a “probabilistic” framework of interpretation rather than a “deterministic” one wherein conclusions one draws from a dimensionality reduction plot have some probability of not actually being true of the data. I will propose that this does not mean these plots are not useful. Rather, to evaluate their utility, I will argue that empirical user studies of these methods will shed light on whether these methods provide more benefit or more harm in practice.

The invertible matrix theorem

14 minute read

Published: January 28, 2024

Throughout my blog posts on linear algebra, we have proven various properties about invertible matrices. In this post we bring, all of these statements into a single location and form a set of statements called the “invertible matrix theorem”. Each statement in the invertible matrix theorem proves that the matrix is invertible and implies all of the rest of the statements.

Median-ratio normalization for bulk RNA-seq data

11 minute read

Published: November 24, 2023

In a previous post, we discussed how RNA-seq provides measurements of relative expression between genes rather than measurements of absolute expression. In this post, we will discuss median-ratio normalization: a procedure that attempts to scale each sample’s read counts so that differences in the read counts between samples better reflects differences in absolute expression. We will start by describing the underlying assumption that must be met for median-ratio normalization to work and then walk through the details of the algorithm.

The binomial theorem

4 minute read

Published: October 16, 2023

The binomial theorem appears in many proofs across mathematics and mathematical statistics. In this post, I will walk through a proof of this theorem.

Graph convolutional neural networks

25 minute read

Published: September 24, 2023

Graphs are ubiqitous mathematical objects that describe a set of relationships between entities; however, they are challenging to model with traditional machine learning methods, which require that the input be represented as vectors. In this post, we will discuss graph convolutional networks (GCNs): a class of neural network designed to operate on graphs. We will discuss the intution behind the GCN and how it is similar and different to the convolutional neural network (CNN) used in computer vision. We will conclude by presenting a case-study training a GCN to classify molecule toxicity.

What determinants tell us about linear transformations

10 minute read

Published: September 04, 2023

The determinant of a matrix is often taught as a function that measures the volume of the parallelepiped formed by that matrix’s columns. In this post, we will go a step further in our understanding of the determinant and discuss what the determinant tells us about the linear transformation that is characterized by the matrix. In short, the determinant tells us how much a matrix’s linear transformation grows or shrinks space. The sign of the determinant tells us whether the matrix also inverts space.

Deriving the formula for the determinant

35 minute read

Published: September 03, 2023

The determinant is a function that maps each square matrix to a value that describes the volume of the parallelepiped formed by that matrix’s columns. While this idea is fairly straightforward conceptually, the formula for the determinant is quite confusing. In this post, we will derive the formula for the determinant in an effort to make it less mysterious. Much of my understanding of this material comes from these lecture notes by Mark Demers re-written in my own words.

Vector spaces induced by matrices: column, row, and null spaces

21 minute read

Published: June 19, 2023

Matrices are one of the fundamental objects studied in linear algebra. While on their surface they appear like simple tables of numbers, this simplicity hides deeper mathematical structures that they contain. In this post, we will dive into the deeper structures within matrices by discussing three vector spaces that are induced by every matrix: a column space, a row space, and a null space.

Variational autoencoders

33 minute read

Published: March 14, 2023

Variational autoencoders (VAEs) are a family of deep generative models with use cases that span many applications, from image processing to bioinformatics. There are two complimentary ways of viewing the VAE: as a probabilistic model that is fit using variational Bayesian inference, or as a type of autoencoding neural network. In this post, we present the mathematical theory behind VAEs, which is rooted in Bayesian inference, and how this theory leads to an emergent autoencoding algorithm. We also discuss the similarities and differences between VAEs and standard autoencoders. Lastly, we present an implementation of a VAE in PyTorch and apply it to the task of modeling the MNIST dataset of hand-written digits.

Blackbox variational inference via the reparameterization gradient

21 minute read

Published: November 05, 2022

Variational inference (VI) is a mathematical framework for doing Bayesian inference by approximating the posterior distribution over the latent variables in a latent variable model when the true posterior is intractable. In this post, we will discuss a flexible variational inference algorithm, called blackbox VI via the reparameterization gradient, that works “out of the box” for a wide variety of models with minimal need for the tedious mathematical derivations that deriving VI algorithms usually require. We will then use this method to do Bayesian linear regression.

Row reduction with elementary matrices

10 minute read

Published: October 02, 2022

In this post we discuss the row reduction algorithm for solving a system of linear equations that have exactly one solution. We will then show how the row reduction algorithm can be represented as a process involving a sequence of matrix multiplications involving a special class of matrices called elementary matrices. That is, each elementary matrix represents a single elementary row operation in the row reduction algorithm.

Reasoning about systems of linear equations using linear algebra

5 minute read

Published: June 12, 2022

In this blog post, we will discuss the relationship between matrices and systems of linear equations. Specifically, we will show how systems of linear equations can be represented as a single matrix equation. Solutions to the system of linear equations can be reasoned about by examining the characteristics of the matrices and vectors in that matrix equation.

Span and linear independence

10 minute read

Published: June 11, 2022

A very important concept linear algebra is that of linear independence. In this blog post we present the definition for the span of a set of vectors. Then, we use this definition to discuss the definition for linear independence. Finally, we discuss some intuition into this fundamental idea.

Functionals and functional derivatives

13 minute read

Published: April 10, 2022

The calculus of variations is a field of mathematics that deals with the optimization of functions of functions, called functionals. This topic was not taught to me in my computer science education, but it lies at the foundation of a number of important concepts and algorithms in the data sciences such as gradient boosting and variational inference. In this post, I will provide an explanation of the functional derivative and show how it relates to the gradient of an ordinary multivariate function.

Normed vector spaces

8 minute read

Published: November 23, 2021

When first introduced to Euclidean vectors, one is taught that the length of the vector’s arrow is called the norm of the vector. In this post, we present the more rigorous and abstract definition of a norm and show how it generalizes the notion of “length” to non-Euclidean vector spaces. We also discuss how the norm induces a metric function on pairs of vectors so that one can discuss distances between vectors.

The overloaded equals sign

5 minute read

Published: November 09, 2021

Two of the most important relationships in mathematics, namely equality and definition, are both denoted using the same symbol – namely, the equals sign. The overloading of this symbol confuses students in mathematics and computer programming. In this post, I argue for the use of two different symbols for these two fundamentally different operators.

Vector spaces

11 minute read

Published: October 27, 2021

The concept of a vector space is a foundational concept in mathematics, physics, and the data sciences. In this post, we first present and explain the definition of a vector space and then go on to describe properties of vector spaces. Lastly, we present a few examples of vector spaces that go beyond the usual Euclidean vectors that are often taught in introductory math and science courses.

Invertible matrices

11 minute read

Published: October 20, 2021

In this post, we discuss invertible matrices: those matrices that characterize invertible linear transformations. We discuss three different perspectives for intuiting inverse matrices as well as several of their properties.

Perplexity: a more intuitive measure of uncertainty than entropy

2 minute read

Published: October 08, 2021

Like entropy, perplexity is an information theoretic quantity that describes the uncertainty of a random variable. In fact, perplexity is simply a monotonic function of entropy and thus, in some sense, they can be used interchangeabley. So why do we need it? In this post, I’ll discuss why perplexity is a more intuitive measure of uncertainty than entropy.

Variational inference

5 minute read

Published: May 31, 2021

In this post, I will present a high-level explanation of variational inference: a paradigm for estimating a posterior distribution when computing it explicitly is intractable. Variational inference finds an approximate posterior by solving a specific optimization problem that seeks to minimize the disparity between the true posterior and the approximate posterior.

RNA-seq: the basics

19 minute read

Published: January 07, 2021

RNA sequencing (RNA-seq) has become a ubiquitous tool in biomedical research for measuring gene expression in a population of cells, or a single cell, across the genome. Despite its ubiquity, RNA-seq is relatively complex and there exists a large research effort towards developing statistical and computational methods for analyzing the raw data that it produces. In this post, I will provide a high level overview of RNA-seq and describe how to interpret some of the common units in which gene expression is measured from an RNA-seq experiment.

Intrinsic dimensionality

6 minute read

Published: December 29, 2020

In my formal education, I found that the concept of “intrinsic dimensionality” was never explicitly taught; however, it undergirds so many concepts in linear algebra and the data sciences such as the rank of a matrix and feature selection. In this post I will discuss the difference between the extrinsic dimensionality of a space versus its intrinsic dimensionality.

Matrix multiplication

11 minute read

Published: December 26, 2020

At first glance, the definition for the product of two matrices can be unintuitive. In this post, we discuss three perspectives for viewing matrix multiplication. It is the third perspective that gives this “unintuitive” definition its power: that matrix multiplication represents the composition of linear transformations.

Matrices characterize linear transformations

5 minute read

Published: December 21, 2020

Linear transformations are functions mapping vectors between two vector spaces that preserve vector addition and scalar multiplication. In this post, we show that there exists a one-to-one corresondence between linear transformations between coordinate vector spaces and matrices. Thus, we can view a matrix as representing a unique linear transformation between coordinate vector spaces.

Matrices as functions

3 minute read

Published: December 20, 2020

At the core of linear algebra is the idea that matrices represent functions. In this post, we’ll look at a few common, elementary functions and discuss their corresponding matrices.

Matrix-vector multiplication

5 minute read

Published: December 19, 2020

Matrix-vector multiplication is an operation between a matrix and a vector that produces a new vector. In this post, I’ll define matrix vector multiplication as well as three angles from which to view this concept. The third angle entails viewing matrices as functions between vector spaces

Introducing matrices

7 minute read

Published: December 13, 2020

Here, I will introduce the three main ways of thinking about matrices. This high-level description of the multi-faceted way of thinking about matrices would have helped me better intuit matrices when I was first introduced to them in my undergraduate linear algebra course.

Gaussian mixture models

17 minute read

Published: November 24, 2020

Gaussian mixture models are a very popular method for data clustering. Here I will define the Gaussian mixture model and also derive the EM algorithm for performing maximum likelihood estimation of its paramters.

The graph Laplacian

12 minute read

Published: November 11, 2020

At the heart of of a number of important machine learning algorithms, such as spectral clustering, lies a matrix called the graph Laplacian. In this post, I’ll walk through the intuition behind the graph Laplacian and describe how it represents the discrete analogue to the Laplacian operator on continuous multivariate functions.

Shannon’s Source Coding Theorem (Foundations of information theory: Part 3)

13 minute read

Published: October 19, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In the first two posts, we discussed the concepts of self-information and information entropy. In this post, we step through Shannon’s Source Coding Theorem to see how the information entropy of a probability distribution describes the best-achievable efficiency required to communicate samples from the distribution.

Information entropy (Foundations of information theory: Part 2)

8 minute read

Published: August 07, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In this series of posts, I will attempt to describe my understanding of how, both philosophically and mathematically, information theory defines the polymorphic, and often amorphous, concept of information. In the first post, we discussed the concept of self-information. In this second post, we will build on this foundation to discuss the concept of information entropy.

What is information? (Foundations of information theory: Part 1)

4 minute read

Published: June 13, 2020

The mathematical field of information theory attempts to mathematically describe the concept of “information”. In this series of posts, I will attempt to describe my understanding of how, both philosophically and mathematically, information theory defines the polymorphic, and often amorphous, concept of information. In this first post, I will describe Shannon’s self-information.

The evidence lower bound (ELBO)

3 minute read

Published: May 25, 2020

The evidence lower bound is an important quantity at the core of a number of important algorithms used in statistical inference including expectation-maximization and variational inference. In this post, I describe its context, definition, and derivation.

Visualizing covariance

1 minute read

Published: May 21, 2020

Covariance quantifies to what extent two random variables are linearly correlated. In this post, I will outline a visualization of covariance that helped me better intuit this concept.

Expectation-maximization: theory and intuition

13 minute read

Published: May 13, 2020

Expectation-maximization (EM) is a popular algorithm for performing maximum-likelihood estimation of the parameters in a latent variable model. In this post, I discuss the theory behind, and intuition into this algorithm.

Demystifying measure-theoretic probability theory (part 3: expectation)

10 minute read

Published: May 11, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 2: random variables)

10 minute read

Published: January 04, 2020

In this series of posts, I present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Demystifying measure-theoretic probability theory (part 1: probability spaces)

13 minute read

Published: December 30, 2019

In this series of posts, I will present my understanding of some basic concepts in measure theory — the mathematical study of objects with “size”— that have enabled me to gain a deeper understanding into the foundations of probability theory.

Blackbox variational inference via the reparameterization gradient

21 minute read

Published: November 05, 2022

Variational inference (VI) is a mathematical framework for doing Bayesian inference by approximating the posterior distribution over the latent variables in a latent variable model when the true posterior is intractable. In this post, we will discuss a flexible variational inference algorithm, called blackbox VI via the reparameterization gradient, that works “out of the box” for a wide variety of models with minimal need for the tedious mathematical derivations that deriving VI algorithms usually require. We will then use this method to do Bayesian linear regression.

Assessing the utility of data visualizations based on dimensionality reduction

24 minute read

Published: March 02, 2024

We human beings use our vision as our chief sense for understanding the world, and thus when we are confronted with data, we try to understand that data through visualization. Dimensionality reduction methods, such as PCA, t-SNE, and UMAP, are approaches designed to enable the visualization of high-dimensional data. Unfortunately, because these methods inevitably distort aspects of the data, these methods are receiving new scrutiny. In this post, I propose that dimensionality reduction requires a “probabilistic” framework of interpretation rather than a “deterministic” one wherein conclusions one draws from a dimensionality reduction plot have some probability of not actually being true of the data. I will propose that this does not mean these plots are not useful. Rather, to evaluate their utility, I will argue that empirical user studies of these methods will shed light on whether these methods provide more benefit or more harm in practice.

Matthew N. Bernstein

Posts by Tags

Gaussian mixture models

Laplacian matrix

Lebesgue integration

RNA-seq

attention

bioinformatics

biology

book review

calculus of variations

cell type

clustering

covariance

data science

deep learning

education

entropy

evidence lower bound

expectation

functional analysis

functions

gene expression

information theory

insight

intrinsic dimensionality

knowledge representation

leadership

linear algebra

linear transformation

machine learning

mathematics

matrices

measure theory

measureable function

metrics

ontologies

pedagogy