A framework for making sense of metrics in technical organizations

13 minute read

Published:

THIS POST IS CURRENTLY UNDER CONSTRUCTION

If you work in a quantitative or technical field, there is little doubt that you or your team has worked long and hard to define which metrics to measure and track. Using data-driven metrics is a critical practice for making rational decisions and deciphering truth in a complex and noisy world. However, as others have pointed out, an over-reliance on metrics can lead to poor outcomes. In this post, I propose a mental model for conceptualizing metrics and discuss how metrics approximate value in specific, structurally constrained ways. Those constraints determine what the metric can and cannot be used for. Organizations that ignore those constraints misuse metrics; in the extreme case, the only honest move is to stop using them.

Introduction

For good reason, the importance of data-driven reasoning is deeply ingrained in the culture of quantitative and technical disciplines. At the same time, the systems we both build and operate within are too complex to be fully grasped by the human mind. To understand them, we must measure them. The combined consequences of our secular cultural traditions, and the simple need to understand complex systems, lead technical organizations to fixate on metrics. This reliance on metrics is deeply ingrained. Managers repeat quotes like, “You can’t manage what you can’t measure.” Software engineers create dashboards and databases to track metrics. Meetings often begin with an overview of where the project or business stands in terms of the metrics.

I would by no means be the first to point out that an over-reliance on metrics can lead to poor outcomes. For example, Goodhart’s Law states, “When a measure becomes a target, it ceases to be a good measure.” As Jeff Bezos has pointed out, organizations often end up managing the “proxy for truth” rather than the thing they actually value.

In this blog post, I will present a mental framework for thinking about metrics that lets us more precisely articulate the kinds of failure modes that an over-reliance on metrics can lead to. More specifically, in this post I will:

  1. View

Systems as high-dimensional objects

Before we get to discussing metrics, we will first generalize the system being measured as a high-dimensional object in some abstract space (akin to a vector space). By “system”, I mean any complicated thing that a given technical organization seeks to understand or improve.

For example, such a system under consideration might be an entire business; Businesses are complicated “high-dimensional” objects in that they have many components: employees, processes, capital, debt, revenue, and so on. A piece of technology, like a website, that an organization is building is also such a system. For example, a website has many components: traffic, latency, lines of code, vulnerability, etc.

In a very abstract way, one can imagine that any given system resides in a “space” comprising other similar systems. For simplicity, let’s take a business: We can summarize a business in terms of a large list of numbers like number of employees, sales per month, cost of goods sold, cash on hand, debt… (the list can go on an on). Given such a list (which could be extremely long), we can place the business in a coordinate vector space where each location in the vector space is some (possibly non-existent) business. A schematic showing three dimensions is shown below:

drawing

We will denote this space of possible systems (e.g., businesses) as $\mathcal{X}$. A given system $x$ is a member of $\mathcal{X}$, denoted $x \in \mathcal{X}$.

Value functions tell you how “good” a system is

We consider cases in which the goal of an organization to either improve some system under measurement, $x \in \mathcal{X}$, or,at the very least, to assess how “good” the system is in terms of some subjective or economic measure of value.

At risk of being pedantic, let’s define a value function to be a function $V$ that maps systems in $\mathcal{X}$ to real numbers that quantify the value of those systems:

\[V : \mathcal{X} \rightarrow \mathbb{R}\]

That is, $V(x)$ tells us how much to value system $x$. If we have two systems, $x_1$ and $x_2$, then $V(x_1) > V(x_2)$ tells us we should prefer $x_1$ to $x_2$. We can depict this schematically in a small, two-dimensional space of systems with a heatmap. Organizations that seek to maximize $V$ are in a sense performing a form of “gradient descent” over $V$:

\[x' \leftarrow x + \nabla V(x)\]

where the goal is to make iterative progress along $V$. Below is a toy example:

In some situations, the value function is obvious. For example, when considering a business, the value function might simply be its profitability. However, in other cases, the value function is not so easy to define. A premier example in this situation is judging the aesthetic value of art.

I would also argue that even in technical fields, where the system under study should admit an easy-to-define value function, it is often not as clear-cut as one would hope. For example, in my current field of cellular perturbation modeling, the value function is still being hashed out by the reseearch community. In the field of perturbation modeling, the goal is to predict how a biological cell will respond to either a chemical or genetic perturbation. A cell’s response to perturbation is very high-dimensional; We measure how gene expression changes across all ~20k genes in the human genome. While the field agrees that accurate perturbation modeling will enable breakthroughs in how we develop new therapeutics (enabling in silico simulation of cellular behavior rather than requiring expensive wet lab experiments), there is little consensus on how to evaluate the performance of a given perturbation model given the high-dimensionality of the output and the complexity of downstream use-cases.

Metrics are functions that can be either exploratory or value-approximating

Using this same framework, we can define a metric to be some function that, like $V$, projects a system, $x \in \mathcal{X}$, to a number. That is, we can define a metric, $f$, as a function

\[f: \mathcal{X} \rightarrow \mathbb{R}\]

Metrics fall into two fundamentally different categories: those that are intended to approximate the value function $V$, and those that are intended purely to explore and describe the structure of $\mathcal{X}$. These two uses impose very different requirements. A value-approximating metric makes an implicit claim about the relationship between $f$ and $V$ To rely on such a metric, one must understand where this approximation holds. Exploratory metrics, by contrast, make no claim about value at all. I believe that making a clear distinction between these kinds of metrics can bring clarity to discussions around them.

Exploratory metrics: Those that seek to describe $\mathcal{X}$

In many cases, organizations do not seek to necessarily approximate the value function, but instead simply seek to understand will create a collection of metrics, each describing some specific aspect of the system: $f_1, f_2, …, f_M$. These metrics In this way, one can reduce a complex, high-dimensional system down into the space of a few dimensions (In this sense, metrics act as a form of dimensionality reduction):

\[f_1(x), f_2(x), \dots, f_M(x)\]

The goal here is mechanistic understanding of the system. It is not to see whether the system is improving over time, but rather, to gain a holistic understanding of the system, which may lead to new insights into how to improve the system downstream of these metrics.

Exploratory metrics are almost always critical for understanding, though I would note, that numbers alone may not suffice. Sometimes, one must also understand the geometric relationships between these metrics, which are better grasped via visualizations of these metrics in some form (whether that be scatterplots, heatmaps, etc.). Said differently, tracking the metric may not alone be sufficient to gain adequete understanding. Rather, understanding may only come from synthesizing these metrics into a comprehensible visual format.

One final point on this topic: I believe it is important to make the distinction between exploratory metrics and value-approximating metrics clear. As soon as one modifies the system to optimize some metric $f(x)$, that move implicitly moves the metric to a value-approximating one rather than an exploratory one.

Value-approximating metrics: Those that seek to approximate $V$

A value-approximating metric is a metric, $f(x)$, that is treated as a proxy for the value function $V(x)$. Unlike exploratory analyses, an organization moves the system in the direction of the gradient of $f$ and uses that gradient as a proxy for the gradient of $V$:

\[\nabla f(x) \approx \nabla V(x)\]

Because of this it is important that one has a handle on how $f$ is or isn’t a good approximation for $V$. In the following sections, I will briefly describe several ways in which $f$ may deviate from $V$ and how, in each situation, one should approach $f$.

Locally accurate, but globally innacurate

In this scenario, the metric $f$ is very close to $V$ in some local neighborhood around $x$; however, as the organization optimizes for $f$, it pushes the system into regions where $f$ is no longer a good approximation. This is a form of Goodhart’s Law.

This is illustrated schematically in the figure below. In the top left plot, we show the true value function, $V$, and a trajectory we would take if we were optimizing with respect to it. In the top right plot, we show a metric function, $f$, and the trajectory we would take if we were optimizing that instead. In the plot below, we superimpose the two trajectories (blue = $f$ and orange = $V$). As you can see, the two trajectories start off very close, but then diverge as $x$ is optimized towards maximizing $f$.

drawing

A premier example of such a scenario occured recently in the virtual cell challenge held by the Arc Institute. In this challenge, research groups competed to develop a perturbation model that would be scored via two metrics: a measure of how well a model could discriminate between perturbations (perturbation discrimination score) and a measure of accuracy around how correctly the model could predict which genes would change the most (differential expression overlap).

In this challenge, groups found that they could game the metrics by applying absurd transformations to the data. In fact, users could cause purely random predictions to be high-scoring. These metrics, while perhaps a good proxy for the latent value function within the regime for which the Arc Institute designed the challenge were a poor proxy of in distal regions of the space.

Proportional to $V$

In this situation, the metric $f$ is proportional to $V$. That is $V(x) = cf(x)$ for some unknown constant $c$. This is illustrated in the figure below:

This is actually a very nice situtation to find oneself in; however, it comes with a key challenge: We don’t know when we have achieved success. If we define some success criteria in terms of $V$, and $f$ is proportional to $V$ up to a constant $c$, then, we will not

Monotonically correlated with $V$

In this situation, the metric $f$ increases monotonically with $V$. That is, if $f$ increases, then so does $V$; however, it is not clear by how much $V$ increases. In some regimes, $f$ and $V$ may be tightly linked whereas in others there is a vast difference in how much $V$ increases with respect to $f$.

Not all systems admit an accurate value-measuring metric

Sometimes, because the value function is so complex, it is just not possible to develop a metric that tracks it accurately enough to be relied upon. I do believe that this kind of situation is not uncommon. Many “functions” or “mappings” in the real world are incredibly noisy, unintuitive, and non-linear. In fact, the very success of machine learning as a discipline is driven by the way in which algorithms learn complex mappings in data.

When one is confronted with an intractible value-function, it is often wiser to admit this outright than to spend valuable time and energy on finding an elusive value-approximating metric. Relying on a poor value-approximator is a road to ruin and is exactly what Bezos warns against in his message on “proxies”.

Sometimes, one just has to admit defeat: We can’t quantify “good” even though we know it when we see it.

drawing


This is indeed a challenging situation to find oneself. It means that one cannot rely upon an easy, automated way to assess the state of the system and to make decisions. One may feel lost at sea without a compass! But, I would argue that this situation can be navigated, and to do so, one must first acknowledge that one is lost! Once the limitations are acknowledged, organizations can plan around them rather than unknowingly optimize a poor proxy.

For example, if we know that evaluating $V$ requires human judgment that cannot be automated via a metric, we can allocate resources accordingly such as building processes that explicitly incorporate subjective assessment. Or, if the value function is simply difficult to pin down because the downstream use-cases of the system are too complex, one can simulate real-world use and assess how well it is working.