Modern Descartes - Essays by Brian Leehttps://www.moderndescartes.com/essays2022-01-23T00:00:00ZBulk and Cut as a Metaphor for Codebase Growth2022-01-23T00:00:00Z2022-01-23T00:00:00Ztag:www.moderndescartes.com,2022-01-23:/essays/bulk_and_cut<p>In bodybuilding, “bulk and cut” refers to a cycle of gaining weight (building both fat and muscle), and then losing weight (dropping fat while retaining muscle). Over multiple cycles, you grow muscle more efficiently than if you used an incremental muscle-building strategy. (Supposedly.)</p>
<p>As with lots of nutrition advice, evidence for bulk and cut is slim, but it’s a seductively simple idea. It’s easier to get your body to build muscle during a period of caloric excess, and afterwards you can focus on burning off fat while retaining your muscle growth.</p>
<p>Bulk and cut is a great metaphor for ideal codebase growth.</p>
<p>The bulk is when your codebase expands to meet new product requirements. During the bulk, your codebase accumulates “muscle” - useful functionality and abstractions. It also accumulates “fat” - redundant or obsolete functionality and abstractions.</p>
<p>The cut is when you trim the fat from your codebase, creating a cleaner basis from which to start your next bulk cycle. The result is a lean, value-creating codebase that you’re proud to show off, and a solid starting point for your next bulk and cut cycle.</p>
<p>One reason I like this metaphor is because you can empathize with the codebase that’s gone through the “bulk and bulk” program. Imagine how a bodybuilder might feel after one of those! They would feel bloated and unhappy with their body, and that’s just how an engineer would feel in that codebase. Cutting after bulking is the responsible way to grow your codebase/body.</p>
<p>Another reason I like this metaphor is because bulking is just as important as cutting. Building new code is so much more effective when you’re not worried about your muscle-to-fat ratio. Not only that, it’s much easier to figure which code is fat and which code is muscle retrospectively, than to try and figure it out prospectively.</p>
<p>There are obvious parallels with the concept of “tech debt” here. I like the bulk and cut metaphor better because the accrual of “fat” is a natural and inevitable byproduct of bulking, which is how many people try to explain this nebulous idea that “some tech debt is good”.</p>
Carbon Neutral Concrete - the numbers2021-08-02T00:00:00Z2021-08-02T00:00:00Ztag:www.moderndescartes.com,2021-08-02:/essays/carbon_neutral_concrete<p>I recently saw a <a href="https://news.ycombinator.com/item?id=28036927">post from Heimdal</a>, a startup claiming to do carbon-neutral concrete production. In this essay, I’ll explain why this is a problem worth working on, and take a rough stab at calculating whether the hard reality of chemistry and economics supports their business model.</p>
<h2 id="cement-and-carbon-dioxide">Cement and carbon dioxide</h2>
<p>Cement is an aggregate of lime a.k.a. calcium oxide (<span class="math inline">\(\textrm{CaO}\)</span>), silica (<span class="math inline">\(\textrm{SiO_2}\)</span>), aluminum oxide (<span class="math inline">\(\textrm{Al}_2\textrm{O}_3\)</span>), and other stuff. Concrete is about 50-60% lime by mass, making it the majority component.</p>
<p>Unfortunately for the environment, calcium oxide is produced by mining and heating limestone, <span class="math inline">\(\textrm{CaCO}_3\)</span> at ~900 degrees Celsius until it decomposes to form calcium oxide and carbon dioxide. The net result is that concrete production is carbon-intensive. Heimdal proposes to fix this (and make money while doing so) by directly creating calcium oxide from seawater.</p>
<p>Actually, the majority consumer of lime isn’t even the concrete business. It turns out that the iron smelting industry consumes the most lime. <a href="https://www.eula.eu/wp-content/uploads/2019/02/A-Competitive-and-Efficient-Lime-Industry-Technical-report-by-Ecofys_0.pdf">(Source)</a>. Anyway, this just bolsters the idea that carbon-free lime manufacture is a worthy goal. Heimdal’s sales pitch is a bit roundabout. They make some pitch about “carbon fixation” followed by “profit by selling the resulting limestone to the concrete industry”. But since the concrete industry is just going to burn the carbon dioxide off again, let’s just pretend the business is lime production.</p>
<h2 id="the-hard-chemical-reality">The hard chemical reality</h2>
<p>In this section, I compute the absolute theoretical minimum cost, based on the thermodynamics and chemistry of what Heimdal is proposing. This ignores a variety of other costs like anti-fouling membranes to purify the seawater, the cost of the alkaline sorbent, the cost of the overvoltage required to make the reaction proceed at a reasonable rate, capital costs for the plant itself, the cost to transport the very bulky products to where they are needed (quarries are local for a reason!), etc. etc. etc..</p>
<p>With all those caveats in mind, let’s see what reality has to say about Heimdal’s business idea.</p>
<p>Heimdal talks about an alkaline sorbent (which is probably an <a href="https://en.wikipedia.org/wiki/Amine_gas_treating">amine scrubbing system</a>). This alkaline sorbent is catalytic, so we can completely ignore it for now.</p>
<p>The first reaction is the removal of calcium from seawater:</p>
<p><span class="math display">\[\begin{equation}
\textrm{Ca}^{2+} + 2\textrm{OH}^- \rightarrow \textrm{Ca(OH)}_2
\end{equation}\]</span></p>
<p>with a solubility product <span class="math inline">\(\textrm{K}_{sp}\)</span> of 6e-6, or <span class="math inline">\(\Delta G = -30\)</span>kJ/mol. (This number is negative, indicating an energetic downhill.)</p>
<p>This reaction consumes hydroxide, and so we need a net source of hydroxide. Otherwise, the result is net ocean acidification, which would cause a net outgassing of carbon dioxide from the ocean into the atmosphere. That source of hydroxide is electrolysis, as hinted by their sales pitch mentioning the sale of hydrogen gas.</p>
<p><span class="math display">\[\begin{equation}
2\textrm{e}^- + 2\textrm{H}_2\textrm{O} \rightarrow 2\textrm{OH}^- + \textrm{H}_2
\end{equation}\]</span></p>
<p>Well, we’ve found our hydroxide, but now we’re short some electrons. We need an electron source. The two biggest electron sources used in electrolysis at a commercial scale are graphite (in <a href="https://en.wikipedia.org/wiki/Hall%E2%80%93H%C3%A9roult_process">aluminum production</a>) and chloride anions (in <a href="https://en.wikipedia.org/wiki/Chloralkali_process">chlorine production</a>). The graphite is cheaper and more effective but it’s consumed to form carbon dioxide, so that’s a nonstarter. Let’s assume chloride as a source of electrons.</p>
<p><span class="math display">\[\begin{equation}
2\textrm{Cl}^- \rightarrow 2\textrm{e}^- + \textrm{Cl}_2
\end{equation}\]</span></p>
<p>The last two half-reactions come together to create a redox reaction with an electric potential of 1.36V, or <span class="math inline">\(\Delta G = 262\)</span>kJ/mol. This number is positive, indicating a big energetic uphill which will be overcome by an electricity input.</p>
<p><span class="math display">\[\begin{equation}
2\textrm{H}^+ + 2\textrm{Cl}^- \rightarrow \textrm{H}_2 + \textrm{Cl}_2
\end{equation}\]</span></p>
<p>Adding all of these reactions together we have so far:</p>
<p><span class="math display">\[\begin{equation}
2\textrm{H}^+ + 2\textrm{Cl}^- + \textrm{Ca}^{2+} + 2\textrm{OH}^- \rightarrow \textrm{H}_2 + \textrm{Cl}_2 + \textrm{Ca(OH)}_2
\end{equation}\]</span></p>
<p>For some bookkeeping, we add 2x water autodissociation reactions (each one having <span class="math inline">\(K_{w}\)</span> of 1e-14, <span class="math inline">\(\Delta G = -80\)</span>kJ/mol) and the quicklime hydration reaction with <span class="math inline">\(\Delta G = 20\)</span>kJ/mol. The final net reaction is:</p>
<p><span class="math display">\[\begin{equation}
2\textrm{Cl}^- + \textrm{Ca}^{2+} + \textrm{H}_2\textrm{O} \rightarrow \textrm{H}_2 + \textrm{Cl}_2 + \textrm{CaO}
\end{equation}\]</span></p>
<p>This reaction has an unadjusted cost of -30 + 262 + 2*-80 + 20 = 92 kJ/mol. In seawater, chloride ions have a concentration of ~0.5M, and calcium is about 0.01M. After adjusting for the unfavorable concentration gradients, we have a net energetic cost of <strong>107 kJ/mol</strong>.</p>
<h2 id="market-prices">Market prices</h2>
<p>Since we’re using electrolysis, let’s assume a market rate of 13 cents per kWh.</p>
<p><span class="math display">\[\frac{\textrm{107 kJ}}{\textrm{1 mol}} \cdot \frac{\textrm{1 mol}}{\textrm{56 g}} \cdot \frac{\textrm{1e6 g}}{\textrm{1 metric ton}} \cdot \frac{\textrm{1 kWh}}{\textrm{3600 kJ}} \cdot \frac{\textrm{13 cents}}{\textrm{1 kWh}} = \$70 / \textrm{metric ton CaO}\]</span></p>
<p>As a reminder, this is the theoretical minimum price, not taking into account any overhead or inefficiencies. In exchange for $70 of electricity, we get 1 metric ton of CaO, 36 kg of hydrogen gas, and 1.2 metric tons of chlorine gas.</p>
<p>How much money could we make this way? The market price of lime is $100/mton; hydrogen gas is $2/kg and chlorine gas is $250/mton. The market for lime is perhaps 100 times the size of the markets for hydrogen and chlorine gases, so at scale we’ll have market-distorting problems, but for now, let’s take the book price as given. In total, we can expect to make ~$475.</p>
<p>The physics says $70 in, $475 out. The big question now is, “What is the reality factor?” meaning the additional cost of all the overhead/inefficiencies. If we have a reality factor of greater than 7x, this would be a commercially unviable proposition.</p>
<h2 id="estimating-the-reality-factor">Estimating the reality factor</h2>
<p>Can we get an estimate for the reality factor?</p>
<p>I repeated this exact same analysis for the <a href="https://en.wikipedia.org/wiki/Chloralkali_process">Chloralkali process</a>, a 100-year old industrial process that is used on the scale of tens of millions of tons annually. The result is that it costs a theoretical minimum of $235 in electricity to generate $250 of chlorine gas, $60 of hydrogen gas, and $140 of sodium hydroxide, for total revenue of $450. This suggests that a <em>commercially mature</em> process operating at tens of millions of tons has about a 2x reality factor.</p>
<p>So I think there’s space for this idea to be profitable. (From me, that’s a ringing endorsement!)</p>
<p>More than just profitability, can they make a dent in our carbon problem? The scale of our carbon problem is on the order of 30 billion tons of CO2, of which maybe 1 billion tons are due to lime production. Heimdal is currently at the scale of 1 ton per year - <strong>nine orders of magnitude</strong> away from making a difference. Assuming continuous Silicon Valley ridiculous growth rates of 50% year over year, they will take 50 years to grow to a point where they are actually making a dent in our carbon problem. I wish them good luck.</p>
Homeostasis and Volatility2021-03-28T00:00:00Z2021-03-28T00:00:00Ztag:www.moderndescartes.com,2021-03-28:/essays/homeostasis_volatility<p>Nature is remarkably resilient to a wide range of environmental conditions. One of the key strategies involved is <em>homeostasis</em>: the use of <a href="https://en.wikipedia.org/wiki/Negative_feedback">negative feedback loops</a> to respond to fluctuations in the environment.</p>
<p>Homeostasis is a strategy that works well as long as the variation doesn’t exceed the system’s capacity. This capacity is typically designed or evolved (in Nature) in response to some historical range of experience. Levees might be designed to withstand a hundred year flood; supply chains develop enough of a buffer to smooth over monthly fluctuations in supply and demand. However, homeostasis has its limits.</p>
<p>In this essay, I’ll talk about some of Nature’s homeostatic systems and the parallels in humanity’s response to COVID-19.</p>
<h2 id="trees-and-climatological-homeostasis">Trees and Climatological Homeostasis</h2>
<p>Trees are not the first thing that comes to mind when homeostasis is mentioned. We typically see trees as passive endurers of the environment. In reality, trees are active participants in their environment.</p>
<p>In areas with frequent and light rains, tree roots grow shallow and wide, whereas in areas with infrequent and heavy rains, tree roots grow deep and narrow. This is ideal because light rain won’t go deep into the ground before evaporating, but heavy rains will penetrate more deeply into the ground.</p>
<p>In gusty areas, tree branches tend to grow thick, but in calmer areas, trees branches grow long and thin, an effect that all greenhouse owners are aware of. Thicker branches are less likely to break in high winds, but longer branches grant increased access to sunlight.</p>
<p>While it’s interesting that trees have figured this all out, even more interesting is how trees execute this optimal algorithm. One simple guess is evolution: species with shallow roots are favored in areas with light rain, and species with deep roots are favored in areas with heavy rain. While true, it’s not the dominant mechanism.</p>
<p>The actual algorithms are quite simple and fiendishly clever. Root growth is <a href="https://pubmed.ncbi.nlm.nih.gov/15642523/">favored in locations where the soil is more damp</a>. As rainwater flows over the landscape and percolates through the soil, trees will grow roots exactly where the water is. Similarly, branch growth is <a href="https://en.wikipedia.org/wiki/Reaction_wood">favored in areas of high mechanical stress</a>. As trees flex in response to winds and to their own weight under gravity, they grow additional wood in exactly the areas that need reinforcement.</p>
<p>These decentralized homeostatic algorithms are incredibly powerful. The economic forces of supply and demand are decentralized homeostatic algorithms that enable modern supply chains and economies to exist.</p>
<h2 id="environmental-drift-and-extinction-events">Environmental drift and extinction events</h2>
<p>Trees’ adaptive strategies help them tailor their growth to their environment. However, these strategies fail in extreme events, like hurricane-strength winds or severe drought.</p>
<p>Why doesn’t Nature do more to prepare for extreme events? To state the obvious, predicting the future is hard, and resources are always scarce in the rat race of evolution. As long as extreme events occur predictably infrequently, evolution will discover the optimal level of gambling on success. Think of it as Nature’s <a href="https://en.wikipedia.org/wiki/Kelly_criterion">Kelly criterion</a>.</p>
<p>What if the environment drifts, causing extreme events to happen more frequently? For example, as sea levels rise, what used to be a 100-year flooding event may become a 10-year or annual flooding pattern. When change is rapid enough, Nature is doomed to suffer setbacks. The earth’s history is full of extinction events, where entire species disappear because not a single individual of that species could cope. Still, life as a whole keeps on chugging along, because among the diversity of species on Earth, some will survive.</p>
<p>Yet, we would be foolish to ascribe any planning motivation to Nature. Nature did not diversify its species because it was trying to anticipate or prepare for extinction events. Nature diversified its species as a side effect of each species optimizing for its own niche. Millions of microclimates and predator-prey relationships lead to as many species.</p>
<p>To draw an analogy to human systems, what this implies is that diversity cannot or should not be <em>planned</em> as it is not a evolutionarily stable equilibrium. Instead, diversity should arise as the natural solution to a diversity of ecological niches. With many companies, many cultures, many governments, many ecosystems, a humanity-wide robustness to adversity arises.</p>
<p>A contemporary example of this is the contrast between Western and Asian responses to COVID-19. Asian countries, thanks to their recent experiences with <a href="https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome">SARS</a> and <a href="https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome">MERS</a>, have had a strong cultural immune response to COVID. As a result, they had broad community support and cooperation in contact tracing, mask wearing, and lockdown as necessary. Western countries, on the other hand, have suffered from an <a href="https://marginalrevolution.com/marginalrevolution/2020/03/herd-immunity-time-consistency-and-the-epidemic-yoyo.html">epidemic yo-yo</a>, failing to convince their citizens of the steps they need to take to curtail the pandemic’s effects. The situation is akin to the European colonizer’s introduction of infectious diseases to North America.</p>
<h2 id="stress-induced-mutagenesis">Stress-induced mutagenesis</h2>
<p>So far, I’ve characterized Nature as a diversity of species, some of which roll over and die every so often in response to environmental drift. This is, of course, a simplification. Actually, evolution has figured out some very clever ways to respond to environmental drift.</p>
<p>Bacteria are capable of <em>stress induced mutation</em>, in which environmental stress triggers a higher mutation rate. Mutation is normally harmful, in that the vast majority of genetic change is detrimental. However, it allows for the discovery of new mutations that may allow some of the bacteria’s descendants to survive future environmental changes.</p>
<p>This is somewhat of an evolutionary magic trick. How does evolution favor the preservation of a mutagenesis gene whose sole purpose is to shoot its own foot (on average) when the going gets rough?</p>
<p>The answer is robustness to environmental volatility. Imagine a population of bacteria, some individuals with a working mutagenesis gene, some without. When environmental shifts occur, a large fraction of individuals begin to die out - both those with and without the gene. If the environmental shift persists long enough, the individuals without a mutagenesis gene die out completely, but the individuals with a working mutagenesis gene will have adapted, surviving to reestablish the population.</p>
<p>The U.K. and U.S. responses to COVID-19 are a demonstration of cultural mutagenesis (or lack thereof). The U.K. has demonstrated itself capable of rapidly adopting strategies like <a href="https://www.technologyreview.com/2021/03/02/1020169/should-the-us-start-prioritizing-first-vaccine-doses-to-beat-the-variants/">first doses first</a>, whereas the U.S. sticks with the boring but suboptimal two-dose strategy. Of course, mutagenesis can also generate suboptimal strategies, like the U.K.’s initial embrace of the “cocoon strategy”, in which the elderly and vulnerable were encouraged to cocoon while the rest of society was encouraged to carry on in their path to herd immunity. Still, I think the U.K.’s ability to try unconventional strategies bodes well for its future success as a country.</p>
<h2 id="a-plea-for-cultural-acceptance">A plea for cultural acceptance</h2>
<p>It is rarely the case that one culture is strictly superior to another. Yet, that’s what our red-blue political leaders would like us to believe. Cultural diversity is a source of resilience and new ideas to problems, and we should embrace it. It is not difficult to find something to admire in each different culture, and to bring a part of it back to your own life.</p>
Data-oriented Programming in Python2020-09-13T00:00:00Z2020-09-13T00:00:00Ztag:www.moderndescartes.com,2020-09-13:/essays/data_oriented_python<p>Many users of Python deprioritize performance in favor of soft benefits like ergonomics, business value, and simplicity. Users who prioritize performance typically end up on faster compiled languages like C++ or Java.</p>
<p>One group of users is left behind, though. The scientific computing community has lots of raw data they need to process, and would very much like performance. Yet, they struggle to move away from Python, because of network effects, and because Python’s beginner-friendliness is appealing to scientists for whom programming is not a first language. So, how can Python users achieve some fraction of the performance that their C++ and Java friends enjoy?</p>
<p>In practice, scientific computing users rely on the NumPy family of libraries e.g. NumPy, SciPy, TensorFlow, PyTorch, CuPy, JAX, etc.. The sheer proliferation of these libraries suggests that the NumPy model is getting something right. In this essay, I’ll talk about what makes NumPy so effective, and where the next generation of Python numerical computing libraries (e.g. TensorFlow, PyTorch, JAX) seems to be headed.</p>
<h2 id="data-good-pointers-bad">Data good, pointers bad</h2>
<p>A pesky fact of computing is that computers can compute far faster than we can deliver data to compute on. In particular, data transfer <em>latency</em> is the Achille’s heel of data devices (both RAM and storage). Manufacturers disguise this weakness by emphasizing improvements in data transfer <em>throughput</em>, but latency continues to stagnate. Ultimately, this means that any chained data access patterns, where one data retrieval must be completed before the next may proceed, are the worst case for computers.</p>
<p>These worst-case chained data access patterns are unfortunately quite common – so common that they have a name you may be familiar with: a pointer.</p>
<p>Pointers have always been slow. In the ’80s and ’90s, our hard drives were essentially optimized record players, with a read head riding on top of a spinning platter. These hard drives had physical limitations: The disk could only spin so fast without shattering, and the read head was also mechanical, limiting its movement speed. Disk seeks were slow, and the programs that were most severely affected were databases. Some ways that databases dealt with these physical limitations are:</p>
<ul>
<li>Instead of using binary trees (requiring <span class="math inline">\(\log_2 N\)</span> disk seeks), B-trees with a much higher branching factor <span class="math inline">\(k\)</span> were used, only requiring <span class="math inline">\(\log_k N\)</span> disk seeks.</li>
<li>Indices were used to query data without having to read the full contents of each row.</li>
<li>Vertically-oriented databases optimized for read-heavy workloads (e.g. summary statistics over one field, across entire datasets), by reorganizing from <a href="https://en.wikipedia.org/wiki/AoS_and_SoA">arrays of structs to structs of arrays</a>. This maximized effective disk throughput, since no extraneous data was loaded.</li>
</ul>
<p>Today, compute speed is roughly <span class="math inline">\(10^5 - 10^6\)</span> times faster than in 1990. Today, RAM is roughly <span class="math inline">\(10^5\)</span> times faster than HDDs from 1990. I was amused and unsurprised to find that Raymond Hettinger’s <a href="https://www.youtube.com/watch?v=npw4s1QTmPg">excellent talk on the evolution of Python’s in-memory <code>dict</code> implementation</a> plays out like a brief history of early database design. Time, rather than healing things, has only worsened the compute-memory imbalance.</p>
<h2 id="numpys-optimizations">NumPy’s optimizations</h2>
<h3 id="boxing-costs">Boxing costs</h3>
<p>In many higher-level languages, raw data comes in boxes containing metadata and a pointer to the actual data. In Python, the PyObject box holds reference counts, so that the garbage collector can operate generically on all Python entities.</p>
<p>Boxing creates two sources of inefficiency:</p>
<ul>
<li>The metadata bloats the data, reducing the data density of our expensive memory.</li>
<li>The pointer indirection creates another round trip of memory retrieval latency.</li>
</ul>
<p>A NumPy array can hold many raw data within a single PyObject box, <em>provided that all of those data are of the same type</em> (int32, float32, etc.). By doing this, NumPy amortizes the cost of boxing over multiple data.</p>
<p>In <a href="/essays/deep_dive_mcts">my previous investigations into Monte Carlo tree search</a>, a naive UCT implementation performed poorly because it instantiated millions of UCTNode objects whose sole purpose was to hold a handful of float32 values. In the optimized UCT implementation, these nodes were replaced with NumPy arrays, reducing memory usage by a factor of 30.</p>
<h3 id="attribute-lookup-function-dispatch-costs">Attribute lookup / function dispatch costs</h3>
<p>Python’s language design forces an unusually large amount of pointer chasing. I mentioned boxing as one layer of pointer indirection, but really it’s just the tip of the iceberg.</p>
<p>Python has no problem handling the following code, even though each of these multiplications invokes a completely different implementation.</p>
<pre><code>>>> mixed_list = [1, 1.0, 'foo', ('bar',)]
>>> for obj in mixed_list:
... print(obj * 2)
2
2.0
'foofoo'
('bar', 'bar')</code></pre>
<p>Python accomplishes this with a minimum of two layers of pointer indirection:</p>
<ol type="1">
<li>Look up the type of the object.</li>
<li>Look up and execute the <code>__mul__</code> function from that type’s operation registry.</li>
</ol>
<p>Additional layers of pointer indirection may be required if the <code>__mul__</code> method is defined on a superclass: the chain of superclasses must be traversed, one pointer at a time, until an implementation is found.</p>
<p>Attribute lookup is similarly fraught; <code>@property</code>, <code>__getattr__</code>, and <code>__getattribute__</code> provide users with flexibility that incurs pointer chasing overhead with something as simple as executing <code>a.b</code>. Access patterns like <code>a.b.c.d</code> create exactly the chained data access patterns that are a worst-case for data retrieval latency.</p>
<p>To top it all off, merely <em>resolving</em> the object is expensive: there’s a stack of lexical scopes (local, nonlocal, then global) that are checked in order to find the variable name. Each check requires a dictionary lookup, another source of pointer indirection.</p>
<p>As the saying goes: “We can solve any problem by introducing an extra level of indirection… except for the problem of too many levels of indirection”. The NumPy family of libraries deals with this indirection, not by removing it, but again by sharing its cost over multiple data.</p>
<pre><code>>>> homogenous_array = np.arange(5, dtype=np.float32)
>>> multiply_by_two = homogenous_array * 2
>>> print(multiply_by_two)
array([ 0., 2., 4., 6., 8.], dtype=float32)</code></pre>
<p>Sharing a single box for multiple data allows NumPy to retain the expressiveness of Python while minimizing the cost of the dynamism. As before, this works because of the additional constraint that all data in a NumPy array must have identical type.</p>
<h2 id="the-frontier-jit">The Frontier: JIT</h2>
<p>So far, we’ve seen that NumPy doesn’t solve any of Python’s fundamental problems when it comes to pointer overhead. Instead, it merely puts a bandaid on the problem by sharing those costs across multiple data. It’s a pretty successful strategy – in my hands (<a href="/essays/vectorized_pagerank">1</a>, <a href="/essays/deep_dive_mcts">2</a>), I find that NumPy can typically achieve 30-60x speedups over pure Python solutions to dense numerical code. However, given that C code typically achieves <a href="https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/python3-gcc.html">100-200x performance</a> over pure Python on dense numerical code (common in scientific computing), it would be nice if we could further reduce the Python overhead.</p>
<p>Tracing <a href="https://en.wikipedia.org/wiki/Just-in-time_compilation">JITs</a> promise to do exactly this. Roughly, the strategy is to trace the execution of the code and record the pointer chasing outcomes. Then, when you call the same code snippet, reuse the recorded outcomes! NumPy amortizes Python overhead over multiple data, and JIT amortizes Python overhead over multiple function calls.</p>
<p>(I should note that I’m most familiar with the tracing JITs used by TensorFlow and JAX. <a href="https://doc.pypy.org/en/latest/">PyPy</a> and <a href="https://numba.pydata.org/">Numba</a> are two alternate JIT implementations that have a longer history, but I don’t know enough about them to treat them fairly, so my apologies to readers.)</p>
<p>Tracing unlocks many wins typically reserved for compiled languages. For example, once you have the entire trace in one place, operations can be fused together (e.g., to make use of the <a href="https://en.wikipedia.org/wiki/FMA_instruction_set">fused multiply-add instructions</a> common to most modern computers), memory layouts can be optimized, and so on. TensorFlow’s <a href="https://www.tensorflow.org/guide/graph_optimization">Grappler</a> is one such implementation of this idea. Traces can also be <a href="https://en.wikipedia.org/wiki/Backpropagation">walked backwards</a> to automatically compute derivatives. Traces can be compiled for different hardware configurations, so that the same Python code executes on CPU, GPU, and TPU. JAX can <a href="https://jax.readthedocs.io/en/latest/notebooks/quickstart.html#Auto-vectorization-with-vmap">autovectorize traces</a>, adding a batch dimension to all operations. Finally, a trace can be exported in a language-agnostic manner, allowing a program defined in Python to be executed in <a href="https://www.tensorflow.org/js">Javascript</a>, <a href="https://www.tensorflow.org/tfx/guide/serving">C++</a>, or more.</p>
<p>Unsurprisingly, there’s a catch to all this. NumPy can amortize Python overhead over multiple data, but only if that data is the same type. JIT can amortize Python overhead over multiple function calls, but only if the function calls would have resulted in the same pointer chasing outcomes. Retracing the function to verify this would defeat the purpose of JIT, so instead, TensorFlow/JAX JIT uses array shape and dtype to guess at whether a trace is reusable. This heuristic is necessarily conservative, rules out otherwise legal programs, often requires unnecessarily specific shape information, and doesn’t make any guarantees against mischievous tinkering. Furthermore, data-dependent tracing is a known issue (<a href="https://pytorch.org/docs/stable/generated/torch.jit.trace.html">1</a>, <a href="https://jax.readthedocs.io/en/latest/notebooks/Common_Gotchas_in_JAX.html#python-control-flow-+-JIT">2</a>). I worked on <a href="https://blog.tensorflow.org/2018/07/autograph-converts-python-into-tensorflow-graphs.html">AutoGraph</a>, a tool to address data-dependent tracing. Still, the engineering benefits of a shared tracing infrastructure are too good to pass up. I expect to see JIT-based systems flourish in the future and iron out their user experience.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The NumPy API’s specifically addresses Python’s performance problems for the kinds of programs that scientific computing users want to write. It encourages users to write code in ways that minimize pointer overhead. Coincidentally, this way of writing code is a fruitful abstraction for tracing JITs targeting vastly parallel computing architectures like GPU and TPU. (Some people argue that <a href="https://dl.acm.org/citation.cfm?id=3321441">machine learning is stuck in a rut</a> due to this NumPy monoculture.) In any case, tracing JITs built on top of NumPy-like APIs are flourishing, and they are by far the easiest way to access the exponentially growing compute available on the cloud. It’s a good time to be a Python programmer in machine learning!</p>
<p>Thanks to Alexey Radul for commenting on drafts of this essay.</p>
Vectorizing Graph Algorithms2020-09-13T00:00:00Z2020-09-13T00:00:00Ztag:www.moderndescartes.com,2020-09-13:/essays/vectorized_pagerank<p>Graph convolutional neural networks (GNNs) have become <a href="https://trends.google.com/trends/explore?date=2010-07-22%202020-08-22&q=graph%20convolutional%20networks,graph%20neural%20networks">increasingly popular</a> over the last few years. Deep neural networks have reinvented multiple fields (image recognition, language translation, audio recognition/generation), so the appeal of GNNs for graph data is self-evident.</p>
<p>GNNs present a computational challenge to most deep learning libraries, which are typically optimized for dense matrix multiplications and additions. While graph algorithms can typically be expressed as <span class="math inline">\(O(N^3)\)</span> dense matrix multiplications on adjacency matrices, such representations are inefficient as most real-world graphs are sparse, only requiring <span class="math inline">\(O(N)\)</span> operations in theory. On the other hand, sparse representations like adjacency lists have ragged shapes, which are not typical fare for deep learning libraries.</p>
<p>In this essay, I’ll demonstrate how to use NumPy APIs to implement computationally effective GNNs for sparse graphs, using PageRank as a stand-in for GNNs.</p>
<h2 id="a-simplified-gnn-pagerank">A simplified GNN: PageRank</h2>
<p>In a GNN, each node/edge contains some information, and over multiple iterations, every node passes information to its neighbors, who update their state in response. There’s a lively subfield of machine learning exploring all sorts of variations on how this information is sent, received, and acted upon. <a href="https://arxiv.org/abs/1806.01261">(See this review for more)</a></p>
<p>For our investigation into graph representations, let’s stick to the simplest possible variation - one in which each node sends one number to its neighbors and each neighbor simply averages the information coming in.</p>
<p>As it turns out, this variation was invented several decades ago, and is most popularly known as <a href="https://en.wikipedia.org/wiki/PageRank">PageRank</a>.</p>
<p>In PageRank, you have each node (a website) distribute its importance score evenly to its neighbors (websites it links to). Iterate until the importance scores converge, and that gives you the importance of each website within the Internet. To handle disjoint subgraphs and to numerically stabilize the algorithm, a fraction of each node’s score is redistributed evenly to every other node in the graph.</p>
<p>As far as linear algebra goes, there’s a closed-form solution, but I’ll approximate the answer by executing the iterative procedure as described above.</p>
<p>I’ll note that the choice of PageRank as a benchmark favors methods with low overhead, because the amount of computation being done is so small (just a single number being passed along each edge in each iteration). In a typical GNN, a vector would typically be passed instead, allowing more work to be dispatched for the same overhead.</p>
<h2 id="implementing-pagerank">Implementing PageRank</h2>
<h3 id="naive-solution-python-adjacency-list">Naive solution (Python adjacency list)</h3>
<p>First, here’s the naive solution, which looks exactly like you would expect, given my verbal description.</p>
<pre><code>def pagerank_naive(N, num_iterations=100, d=0.85):
node_data = [{'score': 1.0/N, 'new_score': 0} for _ in range(N)]
adj_list = adjacency_list(N)
for _ in range(num_iterations):
for from_id, to_ids in enumerate(adj_list):
score_packet = node_data[from_id]['score'] / len(to_ids)
for to_id in to_ids:
node_data[to_id]['new_score'] += score_packet
for data_dict in node_data:
data_dict['score'] = data_dict['new_score'] * d + (1 - d) / N
data_dict['new_score'] = 0
return np.array([data_dict['score'] for data_dict in node_data])</code></pre>
<h3 id="dense-numpy-solution-numpy-adjacency-matrix">Dense NumPy solution (NumPy adjacency matrix)</h3>
<p>The adjacency matrix solution is probably the most elegant implementation of PageRank.</p>
<pre><code>def pagerank_dense(N, num_iterations=100, d=0.85):
adj_matrix = adjacency_matrix(N)
transition_matrix = adj_matrix / np.sum(adj_matrix, axis=1, keepdims=True)
transition_matrix = d * transition_matrix + (1 - d) / N
score = np.ones([N], dtype=np.float32) / N
for _ in range(num_iterations):
score = score @ transition_matrix
return score</code></pre>
<p>This algorithm most closely matches the math involved. It’s also the fastest of the Python implementations as long as N is small (N < 1000). As mentioned in the introduction, it has one fatal flaw: because it is based on adjacency matrices, memory usage scales as <span class="math inline">\(O(N^2)\)</span>, and computations scale as <span class="math inline">\(O(N^3)\)</span>, regardless of the sparsity of the graph. Since most real graphs are sparse, this solution wastes a lot of time multiplying and adding zeros, and does not scale well. At N = 3,000, the original naive solution matches the dense array solution in overall speed.</p>
<h3 id="sparse-solution-numpy-flattened-adjacency-list">Sparse solution (NumPy flattened adjacency list)</h3>
<p>To scale to larger sparse graphs, we’ll have to use a graph representation that is more compact than an adjacency matrix. The original naive implementation used adjacency lists, so let’s optimize that.</p>
<p>For the most part, the conversion from Python to NumPy is fairly straightforward. The difficultly comes from converting the loops over each node’s adjacency list, which requires rethinking our data representation. An adjacency list is a list of lists of varying lengths, (also called a “ragged” array). Since vectorized operation are most efficient with rectangular shapes, we should find some way to coerce this data into rectangular form. (SciPy and other libraries have native support for sparse matrices, but I find it more understandable to manually implement sparsity.)</p>
<p>The simplest way to get rectangular adjacency lists is to pad each adjacency list to the length of the longest such list using some sentinel value.</p>
<pre><code>adj_list = [[0], [1, 2, 3], [4, 5], [6], [7, 8, 9]]
# padded version, using -1 as a sentinel value
[[0, -1, -1], [1, 2, 3], [4, 5, -1], [6, -1, -1], [7, 8, 9]]</code></pre>
<p>However, this doesn’t work very well when there’s a long tail of highly-connected nodes, which is common in real-world graphs. Every node’s adjacency list would be excessively padded, negating any efficiency gains.</p>
<p>The next simplest way to do this is to just flatten the adjacency lists into one long array:</p>
<pre><code>adj_list = [[0], [1, 2, 3], [4, 5], [6], [7, 8, 9]]
# flattened version
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]</code></pre>
<p>This representation loses information about where each sublist begins and ends. To salvage the situation, we’ll store the originating list index in a separate array</p>
<pre><code>adj_list = [[0], [1, 2, 3], [4, 5], [6], [7, 8, 9]]
# becomes...
from_nodes = [0, 1, 1, 1, 2, 2, 3, 4, 4, 4]
to_nodes = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]</code></pre>
<p>These two arrays are essentially pointers in vectorized form. With a flattened adjacency list representation, we can vectorize the core loop.</p>
<pre><code>for node, out_nodes in enumerate(adj_list):
score_packet = g.nodes[node]['score'] / len(out_nodes)
for out_node in out_nodes:
g.nodes[out_node]['new_score'] += score_packet
# becomes...
# NumPy dialect
score_packets = score[from_nodes]
new_score = np.zeros_like(score)
np.add.at(new_score, to_nodes, score_packets)
# TensorFlow dialect
score_packets = tf.gather(score, from_nodes)
new_score = tf.math.unsorted_segment_sum(score_packets, to_nodes, N)</code></pre>
<p>This representation has other benefits. Pointer latency is the bane of computing (a topic I explore <a href="/essays/data_oriented_python">here</a>), and yet pointers are a necessary layer of indirection in graphs. Modern computers try to ameliorate the situation with a hierarchy of caches, where each successive layer has <a href="https://gist.github.com/jboner/2841832">lower latency, but also lower capacity</a>. By reimplementing 64bit pointers as a compact vector of 16bit integer offsets into a contiguous array, our data fits better into these caches and benefits from the faster data latency.</p>
<h3 id="performance-comparison">Performance comparison</h3>
<p>I ran all three algorithms on sparse graphs of various sizes, all having roughly 3 times as many edges as vertices.</p>
<p><img alt="pagerank implementation speed comparison" src="/static/vectorized_pagerank/pagerank_shootout1.svg" style="width: 100%"/></p>
<p>This chart demonstrates a number of interesting things:</p>
<ul>
<li>The NumPy implementations have a hockey-stick look, due to the cost of the Python overhead at low N. The naive python implementation shows no overhead, but you can also argue that it shows <em>only</em> overhead ;).</li>
<li>Both naive and vectorized sparse implementations demonstrate the expected linear scaling.</li>
<li>My vectorized sparse implementation performs up to 60x faster than the original naive implementation.</li>
<li>The dense implementation starts off much faster due to its extreme simplicity (minimizing any Python overhead), but loses to the sparse implementation at around N = 300, due to its cubic scaling.</li>
</ul>
<h2 id="jit-and-c-implementations">JIT and C implementations</h2>
<p>For fun, I also implemented the flattened adjacency list solution in C, to estimate the performance ceiling. I was surprised to find that despite having optimized Python performance by 60x, another 5x in potential optimizations existed.</p>
<p>I suspected that the gap was due to residual Python dispatch overhead. A potential solution is JIT compilation, which traces the dispatch patterns, eliminating that overhead from the final execution. I tried using Numba, TensorFlow, and JAX to close the gap.</p>
<p><img alt="pagerank implementation speed comparison" src="/static/vectorized_pagerank/pagerank_shootout2.svg" style="width: 100%"/></p>
<p>Or, expressed in relative speed to the C implementation:</p>
<p><img alt="pagerank implementation speed comparison" src="/static/vectorized_pagerank/pagerank_shootout2_normalized.svg" style="width: 100%"/></p>
<p>(I’m not familiar with PyTorch; if you’d like to see PyTorch on this graph, send me a <a href="https://github.com/brilee/python_pagerank/pulls">pull request</a>).</p>
<p>Numba is the winner in this particular performance benchmark. TensorFlow and JAX also showed modest performance gains of 25%, and their overlap is expected as they’re both being compiled to <a href="https://www.tensorflow.org/xla">XLA</a>.</p>
<p>The Numba benchmark comes with some caveats. Numba didn’t actually support compiling the sparse index / gather APIs I’d used, so I had to rewrite the core loop.</p>
<pre><code># TensorFlow version
score_packets = score[from_nodes]
score_packets = tf.gather(score, from_nodes)
score = tf.math.unsorted_segment_sum(score_packets, to_nodes, N)
# Numba rewrite
new_score = np.zeros_like(score)
for i in range(from_nodes.shape[0]):
new_score[to_nodes[i]] += score[from_nodes[i]]
score = new_score</code></pre>
<p>For comparison, here’s my C implementation.</p>
<pre><code># C code
for (int e = 0; e < num_edges; e++) {
new_scores[dest[e]] += scores[src[e]];
}</code></pre>
<p>I’d consider the Numba implementation akin to inlining some C code into Python. My guess is that the XLA compiler was unable to rewrite this code, leading to TF and JAX’s underperformance.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I’ve shown how to use a flatten our adjacency list representation to get performant GNN code. This representation may seem complicated, but I’ve found it to be quite versatile in performantly expressing a wide array of GNN architectures.</p>
<p>While Numba can bring us to within 1.5x performance of handwritten C, it requires essentially writing some inline C and wrestling with obfuscated compilation errors, so I don’t consider this to be a free win.</p>
<p>TensorFlow/JAX can get within 4x performance of handwritten C while using idiomatic NumPy APIs. Given that TensorFlow and JAX also come with automatic differentiation (table stakes for doing machine learning!) and a broader ecosystem of practitioners, I’m pretty happy using those libraries.</p>
<p>You can find the full implementation details on <a href="https://github.com/brilee/python_pagerank">GitHub</a>.</p>
<p>Thanks to Alexey Radul for commenting on drafts of this essay.</p>
Estimating vapor pressure from boiling point2019-10-30T00:00:00Z2019-10-30T00:00:00Ztag:www.moderndescartes.com,2019-10-30:/essays/vapor_pressure<p>Did you know that you can estimate the volatility of a substance at room temperature if you know its boiling point? This can be a useful calculation if you’d like to estimate, e.g. ppm of some volatile organic compound in the air.</p>
<h2 id="the-equations">The equations</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Clausius%E2%80%93Clapeyron_relation">Clausius Clapyeron equation</a> relates several quantities: two temperatures <span class="math inline">\(T_1\)</span> and <span class="math inline">\(T_2\)</span>, the vapor pressures <span class="math inline">\(P_1\)</span> and <span class="math inline">\(P_2\)</span> at those temperatures, the ideal gas constant <span class="math inline">\(R\)</span>, and the <a href="https://en.wikipedia.org/wiki/Enthalpy_of_vaporization">enthalpy of vaporization</a>.</p>
<p><span class="math display">\[ \ln{\frac{P_2}{P_1}} = \frac{\Delta H_{vap}}{R}\left(\frac{1}{T_1} - \frac{1}{T_2}\right)\]</span></p>
<p>We know most of these quantities. <span class="math inline">\(T_1\)</span>, <span class="math inline">\(P_1\)</span> are (boiling point, 1 bar), by definition of the boiling point. <span class="math inline">\(T_2\)</span> is room temperature. <span class="math inline">\(P_2\)</span> is the quantity we want to compute. <span class="math inline">\(R\)</span>, the ideal gas constant, is known.</p>
<p>The only missing quantity is <span class="math inline">\(\Delta H_{vap}\)</span>, the enthalpy of vaporization. This is an empirically measured value which is more difficult to measure, and therefore less commonly measured than boiling point. However, <a href="https://en.wikipedia.org/wiki/Trouton%27s_rule">Trouton’s Rule</a> is an observation that empirically, almost all organic molecules have near-identical <em>entropy</em> of vaporization <span class="math inline">\(\Delta S_{vap}\)</span> = 85 J/K, or about 10.5 times <span class="math inline">\(R\)</span>. (Note that entropy != enthalpy). If we could link entropy back to enthalpy somehow, we’d have all the quantities we’d need to estimate volatility at room temperature.</p>
<p>Luckily, the <a href="https://en.wikipedia.org/wiki/Gibbs_free_energy">Gibbs equation</a> says that at the boiling point, we have an equilibrium between vapor and liquid phases given as</p>
<p><span class="math display">\[ \Delta G = 0 = \Delta H - T\Delta S \]</span></p>
<p>i.e. <span class="math inline">\(\Delta H_{vap} = T_{BP}\Delta S_{vap} = 10.5T_{BP}R\)</span>.</p>
<p>All together, this yields the following equation for estimating vapor pressure at room temperature of a substance:</p>
<p><span class="math display">\[ \ln{\frac{P_2}{\textrm{[1 bar]}}} = \frac{10.5T_{BP}R}{R}\left(\frac{1}{T_{BP}} - \frac{1}{T_2}\right) \]</span></p>
<p><span class="math display">\[ P_2 = \textrm{[1 bar]} \cdot e^{10.5 \left(1 - \frac{T_{BP}}{T_2} \right)} \]</span></p>
<h2 id="sanity-check">Sanity check</h2>
<p>When <span class="math inline">\(T_2 = T_{BP}\)</span>, the exponential turns into a factor of 1, as desired. At a temperature that’s half of the boiling point, we have vapor pressure is <span class="math inline">\(e^{-10.5}\)</span> bars. At a temperature that’s twice the boiling point, we have vapor pressure = <span class="math inline">\(e^{10.5}\)</span> bars. At a temperature that’s 80% the boiling point (i.e. water boils at 373 K, room temperature is 300 K), we have vapor pressure of <span class="math inline">\(e^{-10.5 \cdot 0.25}\)</span> = 0.07 bar. Water at room temperature has a saturation of about 4%, or 0.04 bar, so this is about right.</p>
<h2 id="a-fun-calculation">A fun calculation</h2>
<p>When you take a sniff of an essential oil, how many grams of material did you just inhale?</p>
<p>Let’s assume the essential oil has a boiling point of 200C, or 470K. Room temperature is 300K. Plugging these numbers into our equation yields a vapor pressure of 0.003 bar. A sniff is maybe 100 mL of air. Using the ideal gas law, we end up with <span class="math inline">\(10^{-5}\)</span> moles of material in that whiff. Assuming that our essential oil has a molecular weight of about 200 daltons, this yields 2 milligrams of material. If the oil is diluted to 10% in some carrier medium, then the vapor pressure drops accordingly and we’d have inhaled 0.2 milligrams of material.</p>
Understanding the AUROC metric2019-10-29T00:00:00Z2019-10-29T00:00:00Ztag:www.moderndescartes.com,2019-10-29:/essays/auc_intuition<p>In this essay I’ll dig into the difficulties of comparing two binary classifiers.</p>
<p>Classification is the task of predicting which class an instance belongs to. Many different kinds of classifiers exist - random forests and neural networks are two of the most popular in the machine learning world, but there are many others out there. How do we know which one is the best?</p>
<p>As it turns out, comparing classifiers is surprisingly hard. You would think that just comparing accuracy would be the easy answer, but unfortunately accuracy has a number of disadvantages. For example, if you were trying to detect a rare disease that occurs in 1% of people, and your classifier perfectly detected the rare disease (but accidentally marked 2% of healthy people as diseased), then the accuracy of that classifier would be about 98%. A classifier that only ever says “no disease” would have an accuracy of 99%. You would probably agree that the first classifier is much more useful than the second one, and yet the second classifier has the higher accuracy.</p>
<p>Since accuracy doesn’t really work that well, people look at a variety of other metrics, like recall (what percent of diseases did we catch?) and precision (what percent of disease classifications were correct?) and so on. Each has strengths and weaknesses. You can find the full matrix of possible metrics on <a href="https://en.wikipedia.org/wiki/Sensitivity_and_specificity">Wikipedia</a>, and although I’ve studied this page thoroughly, the truth is that my eyes glaze over pretty quickly. It doesn’t help that multiple fields have reinvented these concepts independently and each field has invented their own names for the same thing - sensitivity, specificity, recall, precision, PPV, NPV, etc.. “Type I error” and “Type II error” definitely take the cake for worst naming ever. I’ve attended a number of talks given by statisticians to machine learning experts (and vice versa), and the amount of confusion generated by bad naming of such basic ideas is simultaneously hilarious and saddening.</p>
<p>In practice, the metric that many studies use is <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">AUROC</a> (synonyms: AUC, ROC). Unfortunately, the ROC is defined by referring to true positive rates and false positive rates and false negative rates and all those things that make your eyes glaze over. (Pop quiz: can you recall what the X and Y axes of the ROC curve are, off the top of your head?)</p>
<p>So instead, let’s try and figure out a better metric by starting from scratch. (Spoiler: we’re going to rederive AUROC).</p>
<h2 id="counting-mistakes">Counting Mistakes</h2>
<p>One way to think of a classifier is as a sorting algorithm. Given a set of patients who may or may not have a disease, the classifier gives a score to each patient, and then we can order the patients from “least likely to have disease” to “most likely to have disease”. The raw output of the classifier is unimportant, other than providing a way to create an ordinal ranking of patients.</p>
<p>We’re throwing away some information here by converting scores into ranks. But since a classifier’s prediction score needs to be rescaled and calibrated, throwing away the exact scores is not a big deal. And ranking makes sense from the doctor’s perspective, since they can then go down the list from most likely to least likely, until they decide that the marginal gain isn’t worth the time.</p>
<p>A perfect ranking would look something like <code>H H H H H H H D D D D</code> (H = healthy, D = diseased), a less-than-perfect ranking would look like <code>H H D H H H H D D H D</code>, and the worst possible ranking would be <code>D D D D H H H H H H H</code> - a completely backwards ranking. So we’d like a metric that gives the best ranking a score of 1, and the worst ranking a score of 0, and everything else somewhere in between. We’d expect a random ordering to have a score of 0.5, since it actually takes some skill to get the ranking exactly backwards.</p>
<p>Here’s a simple idea: let’s count the number of mistakes the model makes. Every time the model ranks a healthy person as “more likely to have a disease” over someone who actually has the disease, that’s a mistake. The classifier’s overall score will then be one minus the fraction of potential mistakes that it actually made.</p>
<pre><code>Ranking | Mistakes | Score
--------------------------------
(H, H, D, D) | 0 / 4 | 1
(H, D, H, D) | 1 / 4 | 0.75
(H, D, D, H) | 2 / 4 | 0.5
(D, H, H, D) | 2 / 4 | 0.5
(D, H, D, H) | 3 / 4 | 0.25
(D, D, H, H) | 4 / 4 | 0</code></pre>
<h2 id="implementation">Implementation</h2>
<pre><code>def fraction_correct(rearrangement):
positive_indices = []
negative_indices = []
for i, item in enumerate(rearrangement):
if item == 0:
negative_indices.append(i)
else:
positive_indices.append(i)
# Uses the punning of True / 1, False / 0 in Python
correct_pairings = sum(p > n for p, n in itertools.product(
positive_indices, negative_indices))
total_pairings = len(positive_indices) * len(negative_indices)
return correct_pairings / total_pairings</code></pre>
<p>As mentioned earlier, it turns out that this function yields identical results as sklearn’s AUC implementation.</p>
<pre><code>>>> data = [0, 0, 0, 1, 1]
>>> scores = list(range(len(data)))
>>> for rearrangement in sorted(set(itertools.permutations(data))):
... print("%s %0.3f %0.3f %0.3f" % (
rearrangement,
sklearn.metrics.roc_auc_score(rearrangement, scores),
fraction_correct(rearrangement)))
(0, 0, 0, 1, 1) 1.000 1.000
(0, 0, 1, 0, 1) 0.833 0.833
(0, 0, 1, 1, 0) 0.667 0.667
(0, 1, 0, 0, 1) 0.667 0.667
(0, 1, 0, 1, 0) 0.500 0.500
(0, 1, 1, 0, 0) 0.333 0.333
(1, 0, 0, 0, 1) 0.500 0.500
(1, 0, 0, 1, 0) 0.333 0.333
(1, 0, 1, 0, 0) 0.167 0.167
(1, 1, 0, 0, 0) 0.000 0.000</code></pre>
<h2 id="why-does-this-work">Why does this work?</h2>
<p>I think it’s pretty surprising that these two scores turn out to measure the same thing. One is a statement about discrete probabilities - “what’s the chance that a random pairwise comparison is correctly ordered?”. The other is a statement about continuous curves - “the area under the curve plotting true positive rate vs. false positive rate”.</p>
<p>To explain this equivalence, let’s go back to our pairwise comparison algorithm. The implementation as written is <span class="math inline">\(O(N^2)\)</span>, because it constructs a list of positive and negative examples, and then checks all pairwise comparisons. It’s a bit inefficient, especially if you want to evaluate the performance of your classifier over a large dataset. There’s actually an <span class="math inline">\(O(N)\)</span> algorithm: keep a running counter of positive examples, and every time you see a negative example, you know you’ve just made that many mistakes. Codifying this idea, we arrive at the following:</p>
<pre><code>def fraction_correct_optimized(rearrangement):
negative_seen = 0
positive_seen = 0
mistakes_seen = 0
for item in rearrangement:
if item == 0:
negative_seen += 1
mistakes_seen += positive_seen
else:
positive_seen += 1
fraction_mistakes = mistakes_seen / (negative_seen * positive_seen)
return 1 - fraction_mistakes</code></pre>
<p>We can verify that this also returns identical results to our <span class="math inline">\(O(N^2)\)</span> implementation and the reference AUC implementation.</p>
<p>So the funny thing about this implementation is that it looks like the rectangular approximation to the integral of a curve. Let me rename the variables to make it a bit more obvious:</p>
<pre><code>X, Y = 0, 1
def area_under_curve(steps):
current_x = 0
current_y = 0
area = 0
for step_direction in steps:
if step_direction == X:
current_x += 1
area += current_y
if step_direction == Y:
current_y += 1
normalized_area = area / (current_x * current_y)
return 1 - normalized_area</code></pre>
<p>So the x-axis = true negatives seen, and y-axis = true positives seen. We can flip one of these two axes to get rid of the “1 minus area” in the last line of code; therefore, the ROC curve plots true positive rate as a function of (1 - true negative rate), aka false positive rate. After having gone through this derivation, it’s much easier to remember what the ROC’s axes are.</p>
Why study algorithms?2019-04-02T00:00:00Z2019-04-02T00:00:00Ztag:www.moderndescartes.com,2019-04-02:/essays/why_algorithms<p>A common sentiment is that algorithms and data structures are useless to know, other than to pass the interview process at most software companies, and that one should just learn enough to get by those interviews. I strongly disagree with this notion, and I’ll try to explain why I think it’s valuable to study algorithms and data structures in depth.</p>
<h2 id="human-algorithms">Human algorithms</h2>
<p>Let’s say you have a used deck of cards, and you suspect you’ve lost a card. You’d like to know which card is missing, if any. As a human, how would you go about doing this? There’s many possible solutions, but here’s the one I would probably use: sort the cards by rank and suit, using an insertion sort, and then scan through the sorted cards to see if any cards are missing. The insertion sort would involve holding a stack of sorted cards, picking up the next card, and then inserting cards one by one.</p>
<p>Breaking this down, there are a few assumptions that play into my choice of algorithm.</p>
<p>First, human working memory can only contain about 7-10 items. If the task were instead to name a missing color of the rainbow, then you would not bother sorting the list of given colors - you would just scan the list and name the missing color. But you could not do this with a full deck of cards. So the sorting step is necessary for humans.</p>
<p>Second, the human visual system is relatively fast at scanning through a fanned-out stack of cards, and figuring out where the next card should be inserted. This happens much faster than the physical act of picking up and manipulating a card, so it’s effectively free.</p>
<p>Third, the real world allows one to inject a card into an existing stack of cards with approximately constant cost. Thus, an insertion sort is <span class="math inline">\(O(N)\)</span> in real life, whereas on computers it is typically <span class="math inline">\(O(N^2)\)</span>.</p>
<p>Combining all of these aspects of the problem at hand, we conclude that sorting the deck via insertion sort, then scanning the sorted deck, is an efficient way to verify a complete deck.</p>
<h2 id="computer-algorithms">Computer algorithms</h2>
<p>Faced with the same task, a computer would handle this a bit differently. One possible solution: First, allocate an array of 52 bits. Then, for each card, you would flip the appropriate bit from 0 to 1 to mark it seen. Finally, scanning through the array, you’d look for any unflipped bits.</p>
<p>Another possible solution: keep a running sum of all cards seen (A of diamonds = 1, 2 of diamonds = 2, …), and then check whether the sum matched the expected sum <span class="math inline">\(1 + 2 + \ldots + 52\)</span>. (This solution only works if at most 1 card is missing; otherwise it cannot distinguish which cards are missing.)</p>
<p>Already, we can see that what is “easy” for humans is not necessarily easy for computers, and vice versa. Human working memory is small, but we can do pattern recognition over our visual field very quickly. Computers can memorize a large amount of arbitrary data and do arithmetic with ease, but to process an image would require a deep convolutional neural network of many millions of operations.</p>
<h2 id="rules-of-the-game">Rules of the game</h2>
<p>Given that humans and computers have different constraints on their operation, they naturally end up with different algorithms for the same task. This also means that normal human intuition for what the “obvious” way to do something isn’t necessarily aligned with what a computer is good at doing. So one reason to study algorithms is to learn the rules of the game for computers and hone your intuition about efficient ways to do things on a computer.</p>
<p>In a broader sense, algorithms is about understanding the consequences of a particular set of rules. As it turns out, the rules of the game have actually been slowly changing over the last half-century, and the algorithms that have been published in textbooks aren’t necessarily the right ones for today’s computers.</p>
<p>Take, for example, memory access. It used to be true decades ago that memory access was about the same cost as arithmetic. But today, that’s not true anymore: <a href="http://norvig.com/21-days.html#answers">Latency numbers every programmer should know</a> tells us that main memory access is actually ridiculously slow. Modern processors have a hierarchy of caches which get progressively larger and slower, and textbook algorithms run best when they fit entirely on the fastest caching levels, where their decades-old assumptions hold true.</p>
<p>So of the two algorithms given above for detecting missing cards, the running sum algorithm ends up a few times faster than the bit-flipping algorithm. While both algorithms have <span class="math inline">\(O(N)\)</span> runtime, one solution requires going back to memory to overwrite a 1, whereas the other one updates a number in-place.</p>
<p>Another example is the dramatic rise in hard drive capacity over the last few decades. Hard drives have gone from gigabytes in capacity to terabytes of capacity over the course of a decade. And yet, the speed of disk reading has been fundamentally limited by the physical constraints of spinning a platter at ~7200 RPM, and thus the ratio of hard drive capacity to read/write speed has dramatically shifted. As a result, storage space is relatively cheap, compared to the cost of actually reading that storage space. I remember when <a href="https://aws.amazon.com/glacier/">Amazon Glacier</a> was first announced, there was a lot of speculation as to what secret storage medium Amazon had invented that resulted in such a peculiar pricing structure (nearly free to store data, but expensive to actually read that data). There is no speculation needed if you understand hard drive trends. And nowadays, SSDs change that equation again - Facebook has published a <a href="https://research.fb.com/publications/reducing-dram-footprint-with-nvm-in-facebook/">few</a> <a href="https://research.fb.com/publications/bandana-using-non-volatile-memory-for-storing-deep-learning-models/">recent</a> papers describing how SSDs (also referred to as NVM, non-volatile memory) can be directly be used as slower caching layer for various serving systems.</p>
<p>Yet another example: in machine learning, where teams are investigating the use of customized hardware to execute giant neural networks, it turns out that different scaling limits are reached - like the bandwidth of connections between chips. So here, an entirely new set of algorithms is needed that works around these constraints. For example, when doing a <code>reduce_sum</code> over computational results from N chips, you end up in a N-to-1 bottleneck at the accumulator. However, if you wire up the chips in a big circle, then you eliminate the bottleneck, at the cost of increasing overall bandwidth to <span class="math inline">\(N^2\)</span> and a latency of <span class="math inline">\(N\)</span>. And if you wire up the chips in a 2-D toroidal configuration, you can reduce first in one direction, then the other, to reduce the latency to <span class="math inline">\(\sqrt{N}\)</span>. (See the <a href="https://arxiv.org/abs/1811.06992">TPU v3 paper</a> for more.)</p>
<h2 id="conclusion">Conclusion</h2>
<p>All of these examples seem pretty esoteric. Do you need to know algorithms if you’re not working on new hardware?</p>
<p>At the leading edge, innovation ends up changing the rules of the game, and if you’re working there, then you had better have a solid grasp on algorithms. Then, as the hardest problems are worked out, services and libraries are created for the rest of us. Still, you cannot effectively use those services/libraries unless you understand the underlying technology’s ideal scenarios and limitations. I’ve heard many stories about projects that were killed because their database technology choices were fundamentally mismatched with the usage patterns they had. Finally, the technology matures enough that a community grows around each correct pairing of technology and application, and then you won’t have to know algorithms to make a good choice. PHP/Wordpress has turned into a pretty solid development platform for DIY-websites.</p>
Forays into 3D geometry2019-01-10T00:00:00Z2019-01-10T00:00:00Ztag:www.moderndescartes.com,2019-01-10:/essays/3d_geometry<p>I’ve been trying to understand <a href="https://arxiv.org/abs/1802.08219">this paper</a> and as part of this process, realized that I should learn more geometry, algebra, and group theory. So I spent a week digging into the math of 3D geometry and here’s what I’ve learned so far.</p>
<p>I stumbled on these concepts in a highly nonlinear fashion, so my notes are going to be a bit scattered. I don’t know how useful they’ll be to other people - probably, they won’t be? This is more for my future self.</p>
<h2 id="mappings">Mappings</h2>
<p>A mapping is a correspondence between two domains. For example, the exponential function maps real numbers to positive real numbers. Furthermore, the group operation of multiplication is transformed into addition.</p>
<p><span class="math display">\[ x + y = z \Leftrightarrow e^xe^y = e^{x+y} = e^z \]</span></p>
<p>This is a useful transformation to make because addition is easier to think about than multiplication. In this case, the mapping is also bijective, meaning that one can losslessly convert back and forth between the two domains.</p>
<p>One common pattern is that you can transform your numbers into the domain that’s easier to reason about, do your computation there, and then convert back afterwards. Eventually, you get annoyed with converting back and forth, and you start reasoning entirely in the transformed domain. This happens, for example, in converting between time/frequency domains for sound using the Fourier transform - everyone thinks in the frequency domain even though the raw data always comes in the time domain.</p>
<h2 id="the-circle-group">The circle group</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Circle_group">circle group</a> consists of the set of all complex numbers with modulus 1 - in other words, the unit circle on the complex plane. It’s a pretty simple group to understand, but it shows how we’re going to try and attack 3D rotations. There are multiple ways to look at this.</p>
<ul>
<li>You can represent this group using rotation <span class="math inline">\(\theta\)</span> from the x-axis (i.e. polar coordinates), using addition.</li>
<li>You can represent this group using complex numbers, under multiplication.</li>
<li>You can work with the matrices of the following form, under multiplication. (This form is convenient because multiplication by this matrix is equivalent to rotating a vector by <span class="math inline">\(\theta\)</span>.)</li>
</ul>
<span class="math display">\[\begin{bmatrix}
\cos \theta & -\sin \theta \\
\sin \theta & \cos \theta \\
\end{bmatrix}\]</span>
<p>All of these representations are tied together by <a href="https://en.wikipedia.org/wiki/Euler%27s_formula">Euler’s formula</a>, which states that <span class="math inline">\(e^{i\theta} = \cos \theta + i\sin \theta\)</span>.</p>
<p>Somewhat surprisingly, Euler’s formula also works if you consider the exponentation to be a <a href="https://en.wikipedia.org/wiki/Matrix_exponential">matrix exponentiation</a>, and you use the <a href="https://en.wikipedia.org/wiki/Complex_number#Matrix_representation_of_complex_numbers">matrix representation</a> of complex numbers.</p>
<p><span class="math display">\[\begin{gather}
\exp\left(
\begin{bmatrix}
0 & -\theta \\
\theta & 0 \\
\end{bmatrix}\right)
=
\begin{bmatrix}
\cos \theta & -\sin \theta \\
\sin \theta & \cos \theta \\
\end{bmatrix}
\end{gather}\]</span></p>
<h2 id="matrix-exponentiation">Matrix exponentiation</h2>
<p>It turns out that subbing in matrix exponentiation for regular exponentiation basically works most of the time, and additionally has some other surprising properties.</p>
<ul>
<li>The determinant of a real matrix’s exponentiation is strictly positive and the result is therefore invertible. (This is analogous to <span class="math inline">\(e^x > 0\)</span> for real <span class="math inline">\(x\)</span>).</li>
<li>It commutes with transposition, so that <span class="math inline">\(e^{\left(M^T\right)} = \left(e^M\right)^T\)</span></li>
</ul>
<p>There are a few things that don’t carry over seamlessly, related to the fact that matrix multiplication isn’t commutative.</p>
<ul>
<li><em>If</em> <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> commute (<span class="math inline">\(XY = YX\)</span>), then <span class="math inline">\(e^{X + Y} = e^Xe^Y\)</span>.</li>
</ul>
<p>Answering the question of what happens to <span class="math inline">\(e^{X+Y}\)</span> when <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> don’t commute leads to a rabbit hole of gnarly algebra.</p>
<h3 id="skew-symmetric-matrices">Skew-symmetric matrices</h3>
<p>A skew-symmetric matrix is a matrix whose transpose is equal to its negation: <span class="math inline">\(A^T = -A\)</span>. It turns out that all matrices can be broken down into a symmetric matrix plus a skew-symmetric matrix, by defining <span class="math inline">\(B = \frac{1}{2}(A + A^T)\)</span> to be the symmetrical part of <span class="math inline">\(A\)</span>, and then realizing that what’s left over, <span class="math inline">\(A - B\)</span>, is skew-symmetric.</p>
<p>Symmetric matrices over real numbers, by the <a href="https://en.wikipedia.org/wiki/Spectral_theorem">spectral theorem</a>, can be represented as a diagonal matrix in some basis. In plain English, this means that symmetric matrices correspond to transformations that are purely “rescaling” along some set of axes, with no rotations.</p>
<p>So that means that the skew-symmetric remainder of a matrix probably corresponds to the rotational part of the transformation. This seems to be related to the fact that a skew-symmetric matrix’s eigenvalues are all purely imaginary, but I don’t really fully understand the connection here.</p>
<p>The more literal connection might be that when you exponentiate a skew symmetric matrix, you get an orthogonal matrix (a matrix for which <span class="math inline">\(MM^T = I\)</span>. The proof is pretty simple:</p>
<p><span class="math display">\[e^A\left(e^A\right)^T = e^Ae^{A^T} = e^{A + A^T} = e^{A - A} = e^0 = I\]</span></p>
<h3 id="orthogonal-matrices">Orthogonal matrices</h3>
<p>You may remember orthogonal matrices as those transformations that preserve distances - aka rotations and inversions (“improper” rotations). SO(3) consists of the orthogonal matrices of determinant +1, thus excluding inversions. The exponential map is surjective - for every element of SO(3), there exists a skew-symmetric matrix that exponentiates to it. (It’s not a bijection as far as I can tell, unlike in the 2-D case.)</p>
<h2 id="mapping-between-so3-and-skew-symmetric-matrices">Mapping between SO(3) and skew-symmetric matrices</h2>
<p>Earlier, we looked at the circle group, which was a toy example showing that we could map 2-D rotations between {multiplication of 2-D matrices} and {addition over plain angles}. Now, to understand 3-D rotations, we’ll try to map between {multiplication of 3-D matrices} and {addition of skew-symmetric matrices}.</p>
<p>It turns out that actually this doesn’t work with finite rotation matrices. So we’ll just brush the problem under the rug by invoking a “tangent space” around the identity, which means the space of infinitesimal rotations. This space is represented by lowercase so(3) and has an orthogonal basis set which I’ll call <span class="math inline">\(\{s_1, s_2, s_3\}\)</span> with concrete representations of</p>
<p><span class="math display">\[\begin{gather}
s_1 =
\begin{bmatrix}
0 & 0 & 0 \\
0 & 0 & \alpha \\
0 & -\alpha & 0 \\
\end{bmatrix}
\quad
s_2 =
\begin{bmatrix}
0 & 0 & \alpha \\
0 & 0 & 0 \\
-\alpha & 0 & 0 \\
\end{bmatrix}
\quad
s_3 =
\begin{bmatrix}
0 & \alpha & 0 \\
-\alpha & 0 & 0 \\
0 & 0 & 0 \\
\end{bmatrix}
\end{gather}\]</span></p>
<p>These skew-symmetric matrices are not themselves rotations; they’re derivatives. To make them rotations, you have to exponentiate them, which turns out to be equivalent to adding them to the identity matrix: <span class="math inline">\(e^{s_i} = I + s_i\)</span>. This is analogous to the real numbers, where <span class="math inline">\(e^x = 1 + x\)</span> for <span class="math inline">\(x \approx 0\)</span>.</p>
<p>Since <span class="math inline">\(\alpha\)</span> is considered to be an infinitesimal, raising <span class="math inline">\((I + s_i)^k\)</span> to a power <span class="math inline">\(k\)</span> just results in the matrix <span class="math inline">\((I + ks_i)\)</span> because all second order terms disappear. Also, addition within so(3) corresponds to multiplication in the exponential map. <span class="math inline">\(m_i + m_j = m_k \Leftrightarrow e^{m_i}e^{m_j} = e^{m_i + m_j} = e^{m_k}\)</span> for arbitrary <span class="math inline">\(m \in so(3)\)</span>. So this is nice; we’ve got something resembling the circle group for 3 dimensions. Unfortunately, this only works for infinitesimal rotations and completely falls apart for finite rotations.</p>
<p>I then stumbled across this <a href="https://en.wikipedia.org/wiki/Baker%E2%80%93Campbell%E2%80%93Hausdorff_formula">monstrosity of a formula</a>, which takes these infinitesimal rotations of so(3) and shows how to map them back to the normal rotations of SO(3). It also answers the question of what happens to <span class="math inline">\(e^{X+Y}\)</span> if <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> don’t commute.</p>
<p>If you squint hard enough, it looks like a Taylor series expansion, in the sense that a Taylor series shows how to take the local derivative information (aka this tangent space business), and use that to extrapolate to the entire function. I can’t imagine anyone actually using this formula in practice, but a quantum information friend of mine says he uses this all the time.</p>
<h2 id="su2-and-quaternions">SU(2) and Quaternions</h2>
<p>At this point, I was trying to find a more computationally insightful or useful way to approach finite rotations. As it turns out, SO(3) is very closely related to SU(2), the set of unitary 2x2 matrices, as well as to the quaternions.</p>
<p>The best intuition I had was the Wikipedia segment describing the <a href="https://en.wikipedia.org/wiki/3D_rotation_group#Topology">topology of SO(3)</a>. If that’s the topology of SO(3), then SU(2) can be thought of as not just the unit sphere, but the entire space, using a projective geometry as described in these <a href="https://eater.net/quaternions">3blue1brown videos</a>. Since the unit sphere representing SO(3) is only half of the space and has funny connectivity, that explains all of this “twice covering” and “you have to spin 720 to get back to where you started” business.</p>
<p>Computationally speaking, I found the 3blue1brown videos very enlightening. In short: the <span class="math inline">\((i,j,k)\)</span> component determines the axis of rotation, and the balance between the real component and the imaginary components determines the degree of rotation. This ends up being basically the topological description of SO(3) given by Wikipedia, with the additional restriction that the real component should remain positive to stay in SO(3).</p>
<h3 id="side-note-lie-groups-and-algebras">Side note: Lie groups and algebras</h3>
<p>Lie groups are groups that have a continuous transformation (i.e. the rotation stuff we’ve been talking about). SO(3), SU(2), and quaternions of unit norm can be considered different Lie <em>groups</em> but they all have the same local structure when you zoom in on their tangent space at the origin (their Lie <em>algebra</em>). (<a href="https://en.wikipedia.org/wiki/Lie_algebra#Relation_to_Lie_groups">More here</a>). Mathematicians like to categorize things, so they don’t particularly care about computing rotations; they just want to be able to show that two algebras must be the same. There’s some topological connection; since SU(2) is simply connected (aka none of this ‘identify opposite points on the sphere’ business), this somehow implies that it must be a universal cover of all Lie groups with the same Lie algebra.</p>
<h2 id="geometric-algebras">Geometric algebras</h2>
<p>Ultimately, I found that the physicists’ and mathematicians’ account of 3D rotations basically talked past each other and I didn’t walk away with much insight on algebraic structure. I think the quaternions came closest; since the application of quaternions is done as <span class="math inline">\(qxq^{-1}\)</span>, it implies that simply multiplying quaternions is enough to get the composed rotation.</p>
<p>I happened to stumble upon <a href="https://en.wikipedia.org/wiki/Geometric_algebra">Geometric Algebras</a>, whose introductory tome can be found <a href="http://geocalc.clas.asu.edu/pdf/OerstedMedalLecture.pdf">in this lecture by Hestenes in 2002</a>. So far it looks like it will deliver on its ambitious goal, claiming that “conventional treatments employ an awkward mixture of vector, matrix and spinor or quaternion methods… GA provides a unified, coordinate-free treatment of rotations and reflections”.</p>
<p>I can’t really explain this stuff any better than it’s been presented in Hestenes’s lecture, so you should go look at that. I found that understanding GA was made much simpler by knowing all of this other stuff.</p>
<p>So that’s roughly where I ended up. And my Christmas break is over, so I guess I’ll pick this up some other day.</p>
<p>Thanks to the many people I bugged about this stuff over the past week or two.</p>
Algorithmic Bias2018-11-25T00:00:00Z2018-11-25T00:00:00Ztag:www.moderndescartes.com,2018-11-25:/essays/algorithmic_bias<p>Algorithmic bias has been around forever. Recently, it’s a very relevant issue as companies left and right promise to revolutionize industries with “machine learning” and “artificial intelligence”, by which they typically mean good ol’ algorithms and statistical techniques, but also sometimes deep learning.</p>
<p>The primary culprits in algorithmic bias are the same as they’ve always been: careless choice of objective function, fuzzy data provenance, and Goodhart’s law. Deep neural networks have their own issues relating to interpretability and blackbox behavior in addition to all of the issues I’ll discuss here. I won’t go into any of those issues, since what I cover here is already quite extensive.</p>
<p>I hope to convince you that algorithmic bias is very easy to create and that you can’t just hope to dodge it by being lucky.</p>
<h2 id="algorithms-are-less-biased-than-people">“Algorithms are less biased than people”</h2>
<p>I commonly run into the idea that because humans are often found to be biased, we should replace them with algorithms, which will be unbiased by virtue of not involving humans.</p>
<p>For example, see <a href="https://www.cnet.com/news/amazon-go-avoid-discrimination-shopping-commentary/">this testimonial about Amazon’s new cashierless stores</a>, or this <a href="http://www.pnas.org/content/108/17/6889">oft-cited finding that judges issue harsher sentences when hungry</a>.</p>
<p>I don’t doubt that humans can be biased. That being said, I also believe that algorithms reflect the choices made by their human creators. Those choices can be biased, either intentionally or not. With careful thought, an algorithm can be designed to be unbiased. But unless demonstrated otherwise, you should assume algorithms to be biased by default.</p>
<h3 id="a-koan">A koan</h3>
<p>Bias doesn’t necessarily have to be intentional. Here’s a koan I particularly like, from <a href="http://www.catb.org/jargon/html/koans.html">the Jargon File</a>.</p>
<blockquote>
<p>Sussman attains enlightenment In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.</p>
<p>“What are you doing?”, asked Minsky.</p>
<p>“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.</p>
<p>“Why is the net wired randomly?”, asked Minsky.</p>
<p>“I do not want it to have any preconceptions of how to play”, Sussman said.</p>
<p>Minsky then shut his eyes.</p>
<p>“Why do you close your eyes?”, Sussman asked his teacher.</p>
<p>“So that the room will be empty.”</p>
<p>At that moment, Sussman was enlightened.</p>
</blockquote>
<p>There are many ideas in this koan, but one that I take in particular, is that all algorithms have their tendencies, whether or not you understand what they are yet. It is the job of the algorithm designers to understand what those tendencies are, and decide if they constitute biases that need correcting.</p>
<p>The definitive example here is probably <a href="https://en.wikipedia.org/wiki/Gerrymandering">gerrymandering</a> for which multiple proposed algoithmic solutions exist - all of them biased in different ways in favor of rural or urban voters. Algorithms have not solved the gerrymandering bias problem; they’ve merely shifted the debate to “which algorithm’s biases are we okay with?”</p>
<h2 id="what-are-you-optimizing-for">What are you optimizing for?</h2>
<p>The easiest way for bias to slip into an algorithm is via the optimization target.</p>
<p>One pernicious way in which bias sneaks into algorithms is via implicitly defined optimization targets. If we are optimizing “total X” for some X, then a good question is “how is X defined?”. If X is the classification error over some dataset, then the demographic makeup of the dataset implicitly defines how important it is to optimize for one group over the other.</p>
<p>For example, image classification algorithms are judged by their ability to correctly classify the images in ImageNet or OpenImages. Unfortunately, we are only now realizing that what we thought was a wide variety of images is actually heavily biased towards Western cultures, because we harvested images from the English-speaking subset of the Internet, and because we hired English-speaking human labelers to annotate the images, and because the categories we are classifying for make the most sense in a Western context. The image classifiers trained on ImageNet and OpenImages are thus great at recognizing objects familiar to Westerners, but horrible at labeling images from other cultures. This <a href="https://www.kaggle.com/c/inclusive-images-challenge">Kaggle challenge</a> asks teams to train a classifier that does well on images from cultures they haven’t trained on.</p>
<p>Another example is this contest to <a href="https://www.bostonglobe.com/opinion/2017/12/22/don-blame-algorithm-for-doing-what-boston-school-officials-asked/lAsWv1Rfwqmq6Jfm5ypLmJ/story.html">algorithmically optimize bus schedules in Boston</a>. The <a href="/static/algorithmic_bias/bps_challenge_overview.pdf">original contest statement</a> asked teams to optimize for the fewest number of busses required to bus all students around. Even though the optimization target doesn’t explicitly prioritize any one group of people over the other, I’d guess that patterns in housing and geography would result in systematic bias in the resulting bus routes. (This example is not so different from the gerrymandering example.)</p>
<p>Finally, a classic paper in this field by <a href="https://arxiv.org/abs/1610.02413">Hardt, Price, and Srebro</a> points out that there isn’t an obvious way to define fairness for subgroups in a classification problem (e.g. loan applications or college admissions). You can require the score thresholds to be equal across subgroups. You can require that an equal percentage of qualified applicants be accepted from each subgroup. You can require that the demographic makeup of the accepted applicants match the demographic makeup of all applicants. (And you’ll find people who think each of these choices is the ‘obvious’ way to do it.) Unfortunately, it’s impossible to simultaneously optimize all of these criteria. You can see a very nice <a href="https://research.google.com/bigpicture/attacking-discrimination-in-ml/">interactive visualization of this phenomena</a>.</p>
<h2 id="where-does-your-data-come-from">Where does your data come from?</h2>
<p>With so many industries being automated by computers, data about operations and human behavior are becoming more widely available - and with it, the temptation to just grab whatever data stream happens to be available. However, data streams come with many caveats which are often ignored.</p>
<p>The most important caveat is that such data streams are often observational, not experimental. What this means is that there has been no particular care taken to create a control group; what you see is what you have. Observational data is often deeply confounded with spurious correlations - the famous “one glass of wine a day is good for you” study was confounded by the fact that wine drinkers form a different demographic than beer drinkers or liquor drinkers. So far, there is no systematic or automatic way to tease out correlation from causation. It’s an active area of research.</p>
<p>That being said, that doesn’t mean that all results from observational data are useless. The correlations you find will often be good enough to form a basis for a predictive model. However, unless you dig into your model’s results to figure out where that predictive power is coming from, it’s highly likely that you have unintentional bias lurking in your model. Even if you don’t have demographic categories as an input to your model, there are a million ways to accidentally introduce demographic information via a side channel - for example, zip codes.</p>
<p>A third risk of data dependencies is that even if you do all the hard work of teasing out correlation from causation, and accounting for bias, you may find your model has developed some unintentional bias a year later, when the collection methodology of your data has shifted. Maybe the survey you were administering has changed its wording, or the survey website broke for (probably wealthier) Safari users only, or the designers changed the font to be low-contrast, discouraging older users and those hard of eyesight from taking your survey. This paper from <a href="https://ai.google/research/pubs/pub43146">D Sculley et al</a> lays out the problems often faced when putting machine learning into production, and makes recommendations like proactively pruning input streams to minimize the risk of data shift.</p>
<p>Related to the idea of shifting data dependencies, companies will typically develop models in isolation, looking narrowly at the data streams they have available. The problem here is that nobody has a full picture of the dependency chains of data and models, and bias can accumulate when algorithms consume other algorithms’ input without careful consideration. For example, when a loan office looks at credit history to make a loan decision, each previous loan decision was probably also made by looking at credit history, leading to a positive feedback loop of bias. <a href="https://www.blog.google/technology/ai/ai-principles/">Google’s AI Principles</a> acknowledge that distinguishing fair and unfair bias is difficult, and that we should seek to avoid creating or reinforcing bias. By not reinforcing bias, you can avoid contributing to the problem, no matter what everyone else is doing.</p>
<h2 id="are-you-keeping-your-model-up-to-date">Are you keeping your model up-to-date?</h2>
<p>Goodhart’s law says:</p>
<blockquote>
<p>When a measure becomes a target, it ceases to be a good measure.</p>
</blockquote>
<p>What we’d like from our algorithms is a nuanced truth. But it’s much easier to fake it with something that happens to work. The problems mentioned above (imprecise objective function, shifting data distributions) are often where algorithms can be exploited by adversaries.</p>
<p>For example, Google’s search has to deal with search engine optimizers trying to exploit <a href="https://en.wikipedia.org/wiki/PageRank">PageRank</a>. 10-15 years ago, link farms were pretty common and people would try to find ways to sneak links to their websites onto other peoples’ websites (often via user-submitted content), which is how <a href="https://en.wikipedia.org/wiki/Nofollow">rel=nofollow</a> came about. Ultimately this happened because PageRank was an imprecise approximation of ‘quality’.</p>
<p>Another more recent example is <a href="https://en.wikipedia.org/wiki/Tay_(bot)">Tay</a> going off the rails when users tweeted hateful messages at it. This one is due to optimizing to predict human tweets (an optimization target that was implicitly weighted by the set of tweets in its training data). The vulnerability here is pretty obvious: submit enough messages, and you can overwhelm the training data with your own.</p>
<p>There’s way too many examples to list here. Chances are, if there’s any nontrivial incentive to game an algorithm, it can and will be gamed.</p>
<h2 id="there-is-no-silver-bullet">There is no silver bullet</h2>
<p>This is an essay that could easily have been written 20 years ago with different examples. None of what I’ve mentioned so far is particularly new. What’s new is that people seem to think that machine learning magically solves these problems. A classic quip comes to mind: “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.”</p>
<p>We can do better.</p>
<p>The advent of ML means that we have to be more, not less careful about bias issues. The increased burden of proof means that for most applications, we should probably stick with the same old models we’ve been using for the last few decades: logistic regression, random forests, etc.. I think that ML should be used sparingly and to enable technologies that were previously completely intractable: for example, anything involving images or audio/text understanding was previously impossible but is now within our reach.</p>
<p>There is still a lot of good we can do with simpler methods that we understand.</p>