Modern Descartes - Essays by Brian Leehttp://www.moderndescartes.com/essays2019-05-12T00:00:00ZTensorFlow 2 Documentation2019-05-12T00:00:00Z2019-05-12T00:00:00Ztag:www.moderndescartes.com,2019-05-12:/essays/tf2_docs<p>I’ve been putting my technical blogging experience to good use.</p>
<p>For the past 9 months or so, I’ve been working on AutoGraph, one of the marquee features of TensorFlow 2. TF2 is a drastically different system and much of my time has been spent thinking through the best ways to use TF2 as a whole. I wrote up my experiences as an internal doc that became pretty popular. That doc is now public documentation as <a href="https://www.tensorflow.org/alpha/guide/effective_tf2">Effective TensorFlow 2.0</a>. I’ve also written the majority of the official documentation on tf.function and AutoGraph, which includes the <a href="https://www.tensorflow.org/alpha/guide/autograph">quick start</a> and the <a href="https://www.tensorflow.org/alpha/tutorials/eager/tf_function">deep dive</a>.</p>
<p>I hope you find these docs useful!</p>
Why study algorithms?2019-04-02T00:00:00Z2019-04-02T00:00:00Ztag:www.moderndescartes.com,2019-04-02:/essays/why_algorithms<p>A common sentiment is that algorithms and data structures are useless to know, other than to pass the interview process at most software companies, and that one should just learn enough to get by those interviews. I strongly disagree with this notion, and I’ll try to explain why I think it’s valuable to study algorithms and data structures in depth.</p>
<h1 id="human-algorithms">Human algorithms</h1>
<p>Let’s say you have a used deck of cards, and you suspect you’ve lost a card. You’d like to know which card is missing, if any. As a human, how would you go about doing this? There’s many possible solutions, but here’s the one I would probably use: sort the cards by rank and suit, using an insertion sort, and then scan through the sorted cards to see if any cards are missing. The insertion sort would involve holding a stack of sorted cards, picking up the next card, and then inserting cards one by one.</p>
<p>Breaking this down, there are a few assumptions that play into my choice of algorithm.</p>
<p>First, human working memory can only contain about 7-10 items. If the task were instead to name a missing color of the rainbow, then you would not bother sorting the list of given colors - you would just scan the list and name the missing color. But you could not do this with a full deck of cards. So the sorting step is necessary for humans.</p>
<p>Second, the human visual system is relatively fast at scanning through a fanned-out stack of cards, and figuring out where the next card should be inserted. This happens much faster than the physical act of picking up and manipulating a card, so it’s effectively free.</p>
<p>Third, the real world allows one to inject a card into an existing stack of cards with approximately constant cost. Thus, an insertion sort is <span class="math inline">\(O(N)\)</span> in real life, whereas on computers it is typically <span class="math inline">\(O(N^2)\)</span>.</p>
<p>Combining all of these aspects of the problem at hand, we conclude that sorting the deck via insertion sort, then scanning the sorted deck, is an efficient way to verify a complete deck.</p>
<h1 id="computer-algorithms">Computer algorithms</h1>
<p>Faced with the same task, a computer would handle this a bit differently. One possible solution: First, allocate an array of 52 bits. Then, for each card, you would flip the appropriate bit from 0 to 1 to mark it seen. Finally, scanning through the array, you’d look for any unflipped bits.</p>
<p>Another possible solution: keep a running sum of all cards seen (A of diamonds = 1, 2 of diamonds = 2, …), and then check whether the sum matched the expected sum <span class="math inline">\(1 + 2 + \ldots + 52\)</span>. (This solution only works if at most 1 card is missing; otherwise it cannot distinguish which cards are missing.)</p>
<p>Already, we can see that what is “easy” for humans is not necessarily easy for computers, and vice versa. Human working memory is small, but we can do pattern recognition over our visual field very quickly. Computers can memorize a large amount of arbitrary data and do arithmetic with ease, but to process an image would require a deep convolutional neural network of many millions of operations.</p>
<h1 id="rules-of-the-game">Rules of the game</h1>
<p>Given that humans and computers have different constraints on their operation, they naturally end up with different algorithms for the same task. This also means that normal human intuition for what the “obvious” way to do something isn’t necessarily aligned with what a computer is good at doing. So one reason to study algorithms is to learn the rules of the game for computers and hone your intuition about efficient ways to do things on a computer.</p>
<p>In a broader sense, algorithms is about understanding the consequences of a particular set of rules. As it turns out, the rules of the game have actually been slowly changing over the last half-century, and the algorithms that have been published in textbooks aren’t necessarily the right ones for today’s computers.</p>
<p>Take, for example, memory access. It used to be true decades ago that memory access was about the same cost as arithmetic. But today, that’s not true anymore: <a href="http://norvig.com/21-days.html#answers">Latency numbers every programmer should know</a> tells us that main memory access is actually ridiculously slow. Modern processors have a hierarchy of caches which get progressively larger and slower, and textbook algorithms run best when they fit entirely on the fastest caching levels, where their decades-old assumptions hold true.</p>
<p>So of the two algorithms given above for detecting missing cards, the running sum algorithm ends up a few times faster than the bit-flipping algorithm. While both algorithms have <span class="math inline">\(O(N)\)</span> runtime, one solution requires going back to memory to overwrite a 1, whereas the other one updates a number in-place.</p>
<p>Another example is the dramatic rise in hard drive capacity over the last few decades. Hard drives have gone from gigabytes in capacity to terabytes of capacity over the course of a decade. And yet, the speed of disk reading has been fundamentally limited by the physical constraints of spinning a platter at ~7200 RPM, and thus the ratio of hard drive capacity to read/write speed has dramatically shifted. As a result, storage space is relatively cheap, compared to the cost of actually reading that storage space. I remember when <a href="https://aws.amazon.com/glacier/">Amazon Glacier</a> was first announced, there was a lot of speculation as to what secret storage medium Amazon had invented that resulted in such a peculiar pricing structure (nearly free to store data, but expensive to actually read that data). There is no speculation needed if you understand hard drive trends. And nowadays, SSDs change that equation again - Facebook has published a <a href="https://research.fb.com/publications/reducing-dram-footprint-with-nvm-in-facebook/">few</a> <a href="https://research.fb.com/publications/bandana-using-non-volatile-memory-for-storing-deep-learning-models/">recent</a> papers describing how SSDs (also referred to as NVM, non-volatile memory) can be directly be used as slower caching layer for various serving systems.</p>
<p>Yet another example: in machine learning, where teams are investigating the use of customized hardware to execute giant neural networks, it turns out that different scaling limits are reached - like the bandwidth of connections between chips. So here, an entirely new set of algorithms is needed that works around these constraints. For example, when doing a <code>reduce_sum</code> over computational results from N chips, you need <span class="math inline">\(2N\)</span> cycles and <span class="math inline">\(2N\)</span> overall bandwidth to aggregate all those numbers in one central chip and relay those results back. However, if you wire up the chips in a big circle, then you can double the speed to <span class="math inline">\(N\)</span> cycles, at the cost of increasing overall bandwidth to <span class="math inline">\(N^2\)</span>. And if you wire up the chips in a 2-D toroidal configuration, then you actually just need <span class="math inline">\(2\sqrt{N}\)</span> bandwidth cycles and overall bandwidth of <span class="math inline">\(2N\)</span> to aggregate the numbers, by first adding in the horizontal direction, and then in the vertical direction. (See the <a href="https://arxiv.org/abs/1811.06992">TPU v3 paper</a> for more.)</p>
<h1 id="conclusion">Conclusion</h1>
<p>All of these examples seem pretty esoteric. Do you need to know algorithms if you’re not working on new hardware?</p>
<p>I think it depends on where in the technology life cycle you want to be at. On the leading end of this cycle, you had better understand your algorithms. On the tail end of this cycle, you probably don’t need to know algorithms as much.</p>
<p>At the leading edge, innovation ends up changing the rules of the game, and if you’re working there, then you had better have a solid grasp on algorithms. Then, as the hardest problems are worked out, services and libraries are created for the rest of us. Still, you cannot effectively use those services/libraries unless you understand the underlying technology. I’ve heard many stories about projects that were killed because their database technology choices were fundamentally wrong. And eventually, the technology matures enough that a community grows around each correct pairing of technology and applications, and then you won’t have to know algorithms to make a good choice - e.g. PHP/Wordpress has turned into a pretty solid development platform for DIY-websites.</p>
Forays into 3D geometry2019-01-10T00:00:00Z2019-01-10T00:00:00Ztag:www.moderndescartes.com,2019-01-10:/essays/3d_geometry<p>I’ve been trying to understand <a href="https://arxiv.org/abs/1802.08219">this paper</a> and as part of this process, realized that I should learn more geometry, algebra, and group theory. So I spent a week digging into the math of 3D geometry and here’s what I’ve learned so far.</p>
<p>I stumbled on these concepts in a highly nonlinear fashion, so my notes are going to be a bit scattered. I don’t know how useful they’ll be to other people - probably, they won’t be? This is more for my future self.</p>
<h1 id="mappings">Mappings</h1>
<p>A mapping is a correspondence between two domains. For example, the exponential function maps real numbers to positive real numbers. Furthermore, the group operation of multiplication is transformed into addition.</p>
<p><span class="math display">\[ x + y = z \Leftrightarrow e^xe^y = e^{x+y} = e^z \]</span></p>
<p>This is a useful transformation to make because addition is easier to think about than multiplication. In this case, the mapping is also bijective, meaning that one can losslessly convert back and forth between the two domains.</p>
<p>One common pattern is that you can transform your numbers into the domain that’s easier to reason about, do your computation there, and then convert back afterwards. Eventually, you get annoyed with converting back and forth, and you start reasoning entirely in the transformed domain. This happens, for example, in converting between time/frequency domains for sound using the Fourier transform - everyone thinks in the frequency domain even though the raw data always comes in the time domain.</p>
<h1 id="the-circle-group">The circle group</h1>
<p>The <a href="https://en.wikipedia.org/wiki/Circle_group">circle group</a> consists of the set of all complex numbers with modulus 1 - in other words, the unit circle on the complex plane. It’s a pretty simple group to understand, but it shows how we’re going to try and attack 3D rotations. There are multiple ways to look at this.</p>
<ul>
<li>You can represent this group using rotation <span class="math inline">\(\theta\)</span> from the x-axis (i.e. polar coordinates), using addition.</li>
<li>You can represent this group using complex numbers, under multiplication.</li>
<li>You can work with the matrices of the following form, under multiplication. (This form is convenient because multiplication by this matrix is equivalent to rotating a vector by <span class="math inline">\(\theta\)</span>.)</li>
</ul>
<span class="math display">\[\begin{bmatrix}
\cos \theta & -\sin \theta \\
\sin \theta & \cos \theta \\
\end{bmatrix}\]</span>
<p>All of these representations are tied together by <a href="https://en.wikipedia.org/wiki/Euler%27s_formula">Euler’s formula</a>, which states that <span class="math inline">\(e^{i\theta} = \cos \theta + i\sin \theta\)</span>.</p>
<p>Somewhat surprisingly, Euler’s formula also works if you consider the exponentation to be a <a href="https://en.wikipedia.org/wiki/Matrix_exponential">matrix exponentiation</a>, and you use the <a href="https://en.wikipedia.org/wiki/Complex_number#Matrix_representation_of_complex_numbers">matrix representation</a> of complex numbers.</p>
<p><span class="math display">\[\begin{gather}
\exp\left(
\begin{bmatrix}
0 & -\theta \\
\theta & 0 \\
\end{bmatrix}\right)
=
\begin{bmatrix}
\cos \theta & -\sin \theta \\
\sin \theta & \cos \theta \\
\end{bmatrix}
\end{gather}\]</span></p>
<h1 id="matrix-exponentiation">Matrix exponentiation</h1>
<p>It turns out that subbing in matrix exponentiation for regular exponentiation basically works most of the time, and additionally has some other surprising properties.</p>
<ul>
<li>The determinant of a real matrix’s exponentiation is strictly positive and the result is therefore invertible. (This is analogous to <span class="math inline">\(e^x > 0\)</span> for real <span class="math inline">\(x\)</span>).</li>
<li>It commutes with transposition, so that <span class="math inline">\(e^{\left(M^T\right)} = \left(e^M\right)^T\)</span></li>
</ul>
<p>There are a few things that don’t carry over seamlessly, related to the fact that matrix multiplication isn’t commutative.</p>
<ul>
<li><em>If</em> <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> commute (<span class="math inline">\(XY = YX\)</span>), then <span class="math inline">\(e^{X + Y} = e^Xe^Y\)</span>.</li>
</ul>
<p>Answering the question of what happens to <span class="math inline">\(e^{X+Y}\)</span> when <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> don’t commute leads to a rabbit hole of gnarly algebra.</p>
<h2 id="skew-symmetric-matrices">Skew-symmetric matrices</h2>
<p>A skew-symmetric matrix is a matrix whose transpose is equal to its negation: <span class="math inline">\(A^T = -A\)</span>. It turns out that all matrices can be broken down into a symmetric matrix plus a skew-symmetric matrix, by defining <span class="math inline">\(B = \frac{1}{2}(A + A^T)\)</span> to be the symmetrical part of <span class="math inline">\(A\)</span>, and then realizing that what’s left over, <span class="math inline">\(A - B\)</span>, is skew-symmetric.</p>
<p>Symmetric matrices over real numbers, by the <a href="https://en.wikipedia.org/wiki/Spectral_theorem">spectral theorem</a>, can be represented as a diagonal matrix in some basis. In plain English, this means that symmetric matrices correspond to transformations that are purely “rescaling” along some set of axes, with no rotations.</p>
<p>So that means that the skew-symmetric remainder of a matrix probably corresponds to the rotational part of the transformation. This seems to be related to the fact that a skew-symmetric matrix’s eigenvalues are all purely imaginary, but I don’t really fully understand the connection here.</p>
<p>The more literal connection might be that when you exponentiate a skew symmetric matrix, you get an orthogonal matrix (a matrix for which <span class="math inline">\(MM^T = I\)</span>. The proof is pretty simple:</p>
<p><span class="math display">\[e^A\left(e^A\right)^T = e^Ae^{A^T} = e^{A + A^T} = e^{A - A} = e^0 = I\]</span></p>
<h2 id="orthogonal-matrices">Orthogonal matrices</h2>
<p>You may remember orthogonal matrices as those transformations that preserve distances - aka rotations and inversions (“improper” rotations). SO(3) consists of the orthogonal matrices of determinant +1, thus excluding inversions. The exponential map is surjective - for every element of SO(3), there exists a skew-symmetric matrix that exponentiates to it. (It’s not a bijection as far as I can tell, unlike in the 2-D case.)</p>
<h1 id="mapping-between-so3-and-skew-symmetric-matrices">Mapping between SO(3) and skew-symmetric matrices</h1>
<p>Earlier, we looked at the circle group, which was a toy example showing that we could map 2-D rotations between {multiplication of 2-D matrices} and {addition over plain angles}. Now, to understand 3-D rotations, we’ll try to map between {multiplication of 3-D matrices} and {addition of skew-symmetric matrices}.</p>
<p>It turns out that actually this doesn’t work with finite rotation matrices. So we’ll just brush the problem under the rug by invoking a “tangent space” around the identity, which means the space of infinitesimal rotations. This space is represented by lowercase so(3) and has an orthogonal basis set which I’ll call <span class="math inline">\(\{s_1, s_2, s_3\}\)</span> with concrete representations of</p>
<p><span class="math display">\[\begin{gather}
s_1 =
\begin{bmatrix}
0 & 0 & 0 \\
0 & 0 & \alpha \\
0 & -\alpha & 0 \\
\end{bmatrix}
\quad
s_2 =
\begin{bmatrix}
0 & 0 & \alpha \\
0 & 0 & 0 \\
-\alpha & 0 & 0 \\
\end{bmatrix}
\quad
s_3 =
\begin{bmatrix}
0 & \alpha & 0 \\
-\alpha & 0 & 0 \\
0 & 0 & 0 \\
\end{bmatrix}
\end{gather}\]</span></p>
<p>These skew-symmetric matrices are not themselves rotations; they’re derivatives. To make them rotations, you have to exponentiate them, which turns out to be equivalent to adding them to the identity matrix: <span class="math inline">\(e^{s_i} = I + s_i\)</span>. This is analogous to the real numbers, where <span class="math inline">\(e^x = 1 + x\)</span> for <span class="math inline">\(x \approx 0\)</span>.</p>
<p>Since <span class="math inline">\(\alpha\)</span> is considered to be an infinitesimal, raising <span class="math inline">\((I + s_i)^k\)</span> to a power <span class="math inline">\(k\)</span> just results in the matrix <span class="math inline">\((I + ks_i)\)</span> because all second order terms disappear. Also, addition within so(3) corresponds to multiplication in the exponential map. <span class="math inline">\(m_i + m_j = m_k \Leftrightarrow e^{m_i}e^{m_j} = e^{m_i + m_j} = e^{m_k}\)</span> for arbitrary <span class="math inline">\(m \in so(3)\)</span>. So this is nice; we’ve got something resembling the circle group for 3 dimensions. Unfortunately, this only works for infinitesimal rotations and completely falls apart for finite rotations.</p>
<p>I then stumbled across this <a href="https://en.wikipedia.org/wiki/Baker%E2%80%93Campbell%E2%80%93Hausdorff_formula">monstrosity of a formula</a>, which takes these infinitesimal rotations of so(3) and shows how to map them back to the normal rotations of SO(3). It also answers the question of what happens to <span class="math inline">\(e^{X+Y}\)</span> if <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span> don’t commute.</p>
<p>If you squint hard enough, it looks like a Taylor series expansion, in the sense that a Taylor series shows how to take the local derivative information (aka this tangent space business), and use that to extrapolate to the entire function. I can’t imagine anyone actually using this formula in practice, but a quantum information friend of mine says he uses this all the time.</p>
<h1 id="su2-and-quaternions">SU(2) and Quaternions</h1>
<p>At this point, I was trying to find a more computationally insightful or useful way to approach finite rotations. As it turns out, SO(3) is very closely related to SU(2), the set of unitary 2x2 matrices, as well as to the quaternions.</p>
<p>The best intuition I had was the Wikipedia segment describing the <a href="https://en.wikipedia.org/wiki/3D_rotation_group#Topology">topology of SO(3)</a>. If that’s the topology of SO(3), then SU(2) can be thought of as not just the unit sphere, but the entire space, using a projective geometry as described in these <a href="https://eater.net/quaternions">3blue1brown videos</a>. Since the unit sphere representing SO(3) is only half of the space and has funny connectivity, that explains all of this “twice covering” and “you have to spin 720 to get back to where you started” business.</p>
<p>Computationally speaking, I found the 3blue1brown videos very enlightening. In short: the <span class="math inline">\((i,j,k)\)</span> component determines the axis of rotation, and the balance between the real component and the imaginary components determines the degree of rotation. This ends up being basically the topological description of SO(3) given by Wikipedia, with the additional restriction that the real component should remain positive to stay in SO(3).</p>
<h2 id="side-note-lie-groups-and-algebras">Side note: Lie groups and algebras</h2>
<p>Lie groups are groups that have a continuous transformation (i.e. the rotation stuff we’ve been talking about). SO(3), SU(2), and quaternions of unit norm can be considered different Lie <em>groups</em> but they all have the same local structure when you zoom in on their tangent space at the origin (their Lie <em>algebra</em>). (<a href="https://en.wikipedia.org/wiki/Lie_algebra#Relation_to_Lie_groups">More here</a>). Mathematicians like to categorize things, so they don’t particularly care about computing rotations; they just want to be able to show that two algebras must be the same. There’s some topological connection; since SU(2) is simply connected (aka none of this ‘identify opposite points on the sphere’ business), this somehow implies that it must be a universal cover of all Lie groups with the same Lie algebra.</p>
<h1 id="geometric-algebras">Geometric algebras</h1>
<p>Ultimately, I found that the physicists’ and mathematicians’ account of 3D rotations basically talked past each other and I didn’t walk away with much insight on algebraic structure. I think the quaternions came closest; since the application of quaternions is done as <span class="math inline">\(qxq^{-1}\)</span>, it implies that simply multiplying quaternions is enough to get the composed rotation.</p>
<p>I happened to stumble upon <a href="https://en.wikipedia.org/wiki/Geometric_algebra">Geometric Algebras</a>, whose introductory tome can be found <a href="http://geocalc.clas.asu.edu/pdf/OerstedMedalLecture.pdf">in this lecture by Hestenes in 2002</a>. So far it looks like it will deliver on its ambitious goal, claiming that “conventional treatments employ an awkward mixture of vector, matrix and spinor or quaternion methods… GA provides a unified, coordinate-free treatment of rotations and reflections”.</p>
<p>I can’t really explain this stuff any better than it’s been presented in Hestenes’s lecture, so you should go look at that. I found that understanding GA was made much simpler by knowing all of this other stuff.</p>
<p>So that’s roughly where I ended up. And my Christmas break is over, so I guess I’ll pick this up some other day.</p>
<p>Thanks to the many people I bugged about this stuff over the past week or two.</p>
Algorithmic Bias2018-11-25T00:00:00Z2018-11-25T00:00:00Ztag:www.moderndescartes.com,2018-11-25:/essays/algorithmic_bias<p>Algorithmic bias has been around forever. Recently, it’s a very relevant issue as companies left and right promise to revolutionize industries with “machine learning” and “artificial intelligence”, by which they typically mean good ol’ algorithms and statistical techniques, but also sometimes deep learning.</p>
<p>The primary culprits in algorithmic bias are the same as they’ve always been: careless choice of objective function, fuzzy data provenance, and Goodhart’s law. Deep neural networks have their own issues relating to interpretability and blackbox behavior in addition to all of the issues I’ll discuss here. I won’t go into any of those issues, since what I cover here is already quite extensive.</p>
<p>I hope to convince you that algorithmic bias is very easy to create and that you can’t just hope to dodge it by being lucky.</p>
<h1 id="algorithms-are-less-biased-than-people">“Algorithms are less biased than people”</h1>
<p>I commonly run into the idea that because humans are often found to be biased, we should replace them with algorithms, which will be unbiased by virtue of not involving humans.</p>
<p>For example, see <a href="https://www.cnet.com/news/amazon-go-avoid-discrimination-shopping-commentary/">this testimonial about Amazon’s new cashierless stores</a>, or this <a href="http://www.pnas.org/content/108/17/6889">oft-cited finding that judges issue harsher sentences when hungry</a>.</p>
<p>I don’t doubt that humans can be biased. That being said, I also believe that algorithms reflect the choices made by their human creators. Those choices can be biased, either intentionally or not. With careful thought, an algorithm can be designed to be unbiased. But unless demonstrated otherwise, you should assume algorithms to be biased by default.</p>
<h2 id="a-koan">A koan</h2>
<p>Bias doesn’t necessarily have to be intentional. Here’s a koan I particularly like, from <a href="http://www.catb.org/jargon/html/koans.html">the Jargon File</a>.</p>
<blockquote>
<p>Sussman attains enlightenment In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.</p>
<p>“What are you doing?”, asked Minsky.</p>
<p>“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.</p>
<p>“Why is the net wired randomly?”, asked Minsky.</p>
<p>“I do not want it to have any preconceptions of how to play”, Sussman said.</p>
<p>Minsky then shut his eyes.</p>
<p>“Why do you close your eyes?”, Sussman asked his teacher.</p>
<p>“So that the room will be empty.”</p>
<p>At that moment, Sussman was enlightened.</p>
</blockquote>
<p>There are many ideas in this koan, but one that I take in particular, is that all algorithms have their tendencies, whether or not you understand what they are yet. It is the job of the algorithm designers to understand what those tendencies are, and decide if they constitute biases that need correcting.</p>
<p>The definitive example here is probably <a href="https://en.wikipedia.org/wiki/Gerrymandering">gerrymandering</a> for which multiple proposed algoithmic solutions exist - all of them biased in different ways in favor of rural or urban voters. Algorithms have not solved the gerrymandering bias problem; they’ve merely shifted the debate to “which algorithm’s biases are we okay with?”</p>
<h1 id="what-are-you-optimizing-for">What are you optimizing for?</h1>
<p>The easiest way for bias to slip into an algorithm is via the optimization target.</p>
<p>One pernicious way in which bias sneaks into algorithms is via implicitly defined optimization targets. If we are optimizing “total X” for some X, then a good question is “how is X defined?”. If X is the classification error over some dataset, then the demographic makeup of the dataset implicitly defines how important it is to optimize for one group over the other.</p>
<p>For example, image classification algorithms are judged by their ability to correctly classify the images in ImageNet or OpenImages. Unfortunately, we are only now realizing that what we thought was a wide variety of images is actually heavily biased towards Western cultures, because we harvested images from the English-speaking subset of the Internet, and because we hired English-speaking human labelers to annotate the images, and because the categories we are classifying for make the most sense in a Western context. The image classifiers trained on ImageNet and OpenImages are thus great at recognizing objects familiar to Westerners, but horrible at labeling images from other cultures. This <a href="https://www.kaggle.com/c/inclusive-images-challenge">Kaggle challenge</a> asks teams to train a classifier that does well on images from cultures they haven’t trained on.</p>
<p>Another example is this contest to <a href="https://www.bostonglobe.com/opinion/2017/12/22/don-blame-algorithm-for-doing-what-boston-school-officials-asked/lAsWv1Rfwqmq6Jfm5ypLmJ/story.html">algorithmically optimize bus schedules in Boston</a>. The <a href="/static/algorithmic_bias/bps_challenge_overview.pdf">original contest statement</a> asked teams to optimize for the fewest number of busses required to bus all students around. Even though the optimization target doesn’t explicitly prioritize any one group of people over the other, I’d guess that patterns in housing and geography would result in systematic bias in the resulting bus routes. (This example is not so different from the gerrymandering example.)</p>
<p>Finally, a classic paper in this field by <a href="https://arxiv.org/abs/1610.02413">Hardt, Price, and Srebro</a> points out that there isn’t an obvious way to define fairness for subgroups in a classification problem (e.g. loan applications or college admissions). You can require the score thresholds to be equal across subgroups. You can require that an equal percentage of qualified applicants be accepted from each subgroup. You can require that the demographic makeup of the accepted applicants match the demographic makeup of all applicants. (And you’ll find people who think each of these choices is the ‘obvious’ way to do it.) Unfortunately, it’s impossible to simultaneously optimize all of these criteria. You can see a very nice <a href="https://research.google.com/bigpicture/attacking-discrimination-in-ml/">interactive visualization of this phenomena</a>.</p>
<h1 id="where-does-your-data-come-from">Where does your data come from?</h1>
<p>With so many industries being automated by computers, data about operations and human behavior are becoming more widely available - and with it, the temptation to just grab whatever data stream happens to be available. However, data streams come with many caveats which are often ignored.</p>
<p>The most important caveat is that such data streams are often observational, not experimental. What this means is that there has been no particular care taken to create a control group; what you see is what you have. Observational data is often deeply confounded with spurious correlations - the famous “one glass of wine a day is good for you” study was confounded by the fact that wine drinkers form a different demographic than beer drinkers or liquor drinkers. So far, there is no systematic or automatic way to tease out correlation from causation. It’s an active area of research.</p>
<p>That being said, that doesn’t mean that all results from observational data are useless. The correlations you find will often be good enough to form a basis for a predictive model. However, unless you dig into your model’s results to figure out where that predictive power is coming from, it’s highly likely that you have unintentional bias lurking in your model. Even if you don’t have demographic categories as an input to your model, there are a million ways to accidentally introduce demographic information via a side channel - for example, zip codes.</p>
<p>A third risk of data dependencies is that even if you do all the hard work of teasing out correlation from causation, and accounting for bias, you may find your model has developed some unintentional bias a year later, when the collection methodology of your data has shifted. Maybe the survey you were administering has changed its wording, or the survey website broke for (probably wealthier) Safari users only, or the designers changed the font to be low-contrast, discouraging older users and those hard of eyesight from taking your survey. This paper from <a href="https://ai.google/research/pubs/pub43146">D Sculley et al</a> lays out the problems often faced when putting machine learning into production, and makes recommendations like proactively pruning input streams to minimize the risk of data shift.</p>
<p>Related to the idea of shifting data dependencies, companies will typically develop models in isolation, looking narrowly at the data streams they have available. The problem here is that nobody has a full picture of the dependency chains of data and models, and bias can accumulate when algorithms consume other algorithms’ input without careful consideration. For example, when a loan office looks at credit history to make a loan decision, each previous loan decision was probably also made by looking at credit history, leading to a positive feedback loop of bias. <a href="https://www.blog.google/technology/ai/ai-principles/">Google’s AI Principles</a> acknowledge that distinguishing fair and unfair bias is difficult, and that we should seek to avoid creating or reinforcing bias. By not reinforcing bias, you can avoid contributing to the problem, no matter what everyone else is doing.</p>
<h1 id="are-you-keeping-your-model-up-to-date">Are you keeping your model up-to-date?</h1>
<p>Goodhart’s law says:</p>
<blockquote>
<p>When a measure becomes a target, it ceases to be a good measure.</p>
</blockquote>
<p>What we’d like from our algorithms is a nuanced truth. But it’s much easier to fake it with something that happens to work. The problems mentioned above (imprecise objective function, shifting data distributions) are often where algorithms can be exploited by adversaries.</p>
<p>For example, Google’s search has to deal with search engine optimizers trying to exploit <a href="https://en.wikipedia.org/wiki/PageRank">PageRank</a>. 10-15 years ago, link farms were pretty common and people would try to find ways to sneak links to their websites onto other peoples’ websites (often via user-submitted content), which is how <a href="https://en.wikipedia.org/wiki/Nofollow">rel=nofollow</a> came about. Ultimately this happened because PageRank was an imprecise approximation of ‘quality’.</p>
<p>Another more recent example is <a href="https://en.wikipedia.org/wiki/Tay_(bot)">Tay</a> going off the rails when users tweeted hateful messages at it. This one is due to optimizing to predict human tweets (an optimization target that was implicitly weighted by the set of tweets in its training data). The vulnerability here is pretty obvious: submit enough messages, and you can overwhelm the training data with your own.</p>
<p>There’s way too many examples to list here. Chances are, if there’s any nontrivial incentive to game an algorithm, it can and will be gamed.</p>
<h1 id="there-is-no-silver-bullet">There is no silver bullet</h1>
<p>This is an essay that could easily have been written 20 years ago with different examples. None of what I’ve mentioned so far is particularly new. What’s new is that people seem to think that machine learning magically solves these problems. A classic quip comes to mind: “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.”</p>
<p>We can do better.</p>
<p>The advent of ML means that we have to be more, not less careful about bias issues. The increased burden of proof means that for most applications, we should probably stick with the same old models we’ve been using for the last few decades: logistic regression, random forests, etc.. I think that ML should be used sparingly and to enable technologies that were previously completely intractable: for example, anything involving images or audio/text understanding was previously impossible but is now within our reach.</p>
<p>There is still a lot of good we can do with simpler methods that we understand.</p>
An Adversarial Adversarial Paper Review Review2018-10-18T00:00:00Z2018-10-18T00:00:00Ztag:www.moderndescartes.com,2018-10-18:/essays/adversarial_review<p>This is a review of a review: <a href="https://arxiv.org/abs/1807.06732">“Motivating the Rules of the Game for Adversarial Example Research”</a>. (Disclaimer: I’ve tried to represent the conclusions of the original review as faithfully as possible but it’s possible I’ve misunderstood the paper.)</p>
<h1 id="background">Background</h1>
<p>As we bring deep neural networks into production use cases, attackers will naturally want to exploit such systems. There are many ways that such systems could be attacked, but one attack in particular seems to have captured the imagination of many researchers. You’ve probably seen <a href="https://arxiv.org/abs/1412.6572">this example</a> somewhere.</p>
<p><img src="/static/adversarial_paper_review/panda_gibbon.png" title="Panda + noise = Gibbon" style="display: block; margin: 0 auto; width:100%;"/> The image that started it all.</p>
<p>I remember thinking this example was cute, and figured it wasn’t useful because it depended on exact pixel manipulation and access to the network’s inner workings. <a href="https://arxiv.org/abs/1607.02533">Variants have since been discovered</a> where adversarial examples can be printed out and photographed and retain their adversarial classification. <a href="https://arxiv.org/abs/1707.08945">Other variants have been discovered</a> where stickers can trigger an adversarial classification.</p>
<p><img src="/static/adversarial_paper_review/stop_sign.jpg" title="Not a stop sign anymore" style="display: block; margin: 0 auto; width:100%;"/> Not a stop sign anymore</p>
<p>Alongside papers discovering new variants of this attack, there has also been a rise in papers discussing attempted defenses against these attacks, both attempting to craft defenses and then finding new holes in these defenses.</p>
<p>This review paper, “Motivating the Rules of the Game for Adversarial Example Research”, surveys the cottage industry of adversarial example research and suggests that most of it is not useful.</p>
<h1 id="whats-wrong-with-the-literature">What’s wrong with the literature?</h1>
<p>The review suggests three flaws.</p>
<p>The first flaw is that this entire line of research isn’t particularly motivated by any real attack. An image that has been perturbed in a way that is imperceptible to the human eye makes for a flashy demo, but there is no attack that is enabled by this imperceptibility.</p>
<p>The second flaw is that the quantitative measure of “human perceptibility” is simplified to an <span class="math inline">\(l_p\)</span> norm with some radius. This simplification is made because it’s kind of hard to quantify human perceptibility. Unfortunately, this is a pretty meaningless result because human perception is so much more than just <span class="math inline">\(l_p\)</span> norms. It’s a kind of <a href="https://en.wikipedia.org/wiki/Streetlight_effect">streetlight error</a>: researchers have spent a lot of effort thoroughly proving that the key is not under the streetlight. But as soon as you look a little bit outside of the lit region, you find the key. As a result, defense after defense has been defeated despite being “proven” to work.</p>
<p>The third flaw is that most of the defenses proposed in the literature result in degraded performance on a non-adversarial test set. Thus, the defenses guard against a very specific set of attacks (which, because of flaws 1 and 2, are pointless to defend against), while increasing the success rate of the stupid attack (iterate through non-adversarial examples until you find something the model messes up).</p>
<p>So let’s go into each suggested flaw in detail</p>
<h1 id="lack-of-motivation">Lack of motivation</h1>
<p>The review describes many dimensions of attacks on ML systems. There’s a lot of good discussion on targeted (attacker induces a specific mislabeling) vs untargeted attacks (attacker induces any mislabeling); whitebox vs blackbox attacks; various levels of constraint on the attacker (ranging from “attacker needs to perturb a specific example” to “attacker can provide any example”); physical vs digital attacks; and more.</p>
<p>The key argument in this segment of the paper is that all of the proposed attacks that purport to be of the “indistinguishable perturbation” category are really of the weaker “functional content preservation”, “content constraint”, or “nonsuspicious content” categories.</p>
<h2 id="functional-content-constraint">Functional Content Constraint</h2>
<p>In this attack, the attacker must evade classification while preserving functional content. The function can vary broadly.</p>
<p>The paper gives examples like email spammers sending V14Gr4 spam, malware authors writing executables that evade antivirus scans, and trolls uploading NSFW content to SFW forums. Another example I thought of: Amazon probably needs to obfuscate their product pages from scrapers while retaining shopping functionality.</p>
<h2 id="content-constraint">Content constraint</h2>
<p>This attack is similar to the functional content constraint, where “looking like the original” is the “function”. This is not the same as imperceptibility: instead of being human-imperceptible from the original, attacks only need to be human-recognizable as the original.</p>
<p>The paper gives examples like bypassing Youtube’s <a href="https://support.google.com/youtube/answer/6013276?hl=en">ContentID system</a> with a pirated video or uploading revenge porn to Facebook while evading their <a href="https://www.facebook.com/fbsafety/posts/1666174480087050">content matching system</a>. In each example, things like dramatically scaling / cropping / rotating / adding random boxes to the image are all fair game, as long as the content remains recognizable.</p>
<h2 id="nonsuspicious-constraint">Nonsuspicious constraint</h2>
<p>In the nonsuspicious constraint scenario, an attacker has to fool an ML system and simultaneously appear nonsuspicious to humans.</p>
<p>The paper gives the example of evading automated facial recognition against a database at an airport without making airport security suspicious. Another example I thought of would be posting selected images on the Internet to contaminate image search results (say, by getting <a href="https://www.google.com/search?q=google+gorilla+black+people">search results for “gorillas” to show black people</a>).</p>
<h2 id="no-content-constraint">No content constraint</h2>
<p>The paper gives examples like unlocking a stolen smartphone (i.e. bypassing biometric authentication) or designing a TV commercial to trigger undesired behavior in a Google Home / Amazon Echo product.</p>
<h2 id="indistinguishable-perturbations">Indistinguishable Perturbations</h2>
<p>Just to be clear on what we’re arguing against:</p>
<blockquote>
<p>“Recent work has frequently taken an adversarial example to be a restricted (often small) perturbation of a correctly-handled example […] the language in many suggests or implies that the degree of perceptibility of the perturbations is an important aspect of their security risk.”</p>
</blockquote>
<p>Here are some of the attacks that have been proposed in the literature, and why they are actually instances of other categories.</p>
<ul>
<li>“Fool a self-driving car by making it not recognize a stop sign” (<a href="https://arxiv.org/abs/1707.08945">paper</a>, <a href="https://arxiv.org/abs/1710.03337">paper</a>): This is actually a nonsuspicious constraint. One could easily just dangle a fake tree branch in front of the stop sign to obscure it.</li>
<li>“Evade malware classification” (<a href="https://arxiv.org/abs/1802.04528">paper</a>, <a href="https://arxiv.org/abs/1606.04435">paper</a>). This is already given as an example of functional content preservation. Yet this exact attack is quoted in a few adversarial perturbation papers.</li>
<li>“Avoid traffic cameras with perturbed license plates” (<a href="https://blog.ycombinator.com/how-adversarial-attacks-work/">blog post</a>). This is an example of a nonsuspicious constraint; it would be far easier to spray a glare-generating coating on the license plate or strategically apply mud, than to adversarially perturb it.</li>
</ul>
<p>The authors have this to say:</p>
<blockquote>
<p>“In contrast to the other attack action space cases, at the time of writing, we were unable to find a compelling example that required indistinguishability. In many examples, the attacker would benefit from an attack being less distinguishable, but indistinguishability was not a hard constraint. For example, the attacker may have better deniability, or be able to use the attack for a longer period of time before it is detected.”</p>
</blockquote>
<h1 id="human-imperceptibility-l_p-norm">Human imperceptibility != <span class="math inline">\(l_p\)</span> norm</h1>
<p>As a reminder, an <span class="math inline">\(l_p\)</span> norm quantifies the difference between two vectors and is defined, for some power <span class="math inline">\(p\)</span> (commonly <span class="math inline">\(p = 1, 2, \infty\)</span>),</p>
<p><span class="math display">\[ ||x - y||^p = \Sigma_i |x_i - y_i|^p\]</span></p>
<p>It doesn’t take long to see that <span class="math inline">\(l_p\)</span> norms are a very bad way to measure perceptibleness. Translations are imperceptible to the human eye; yet they have huge <span class="math inline">\(l_p\)</span> norms because <span class="math inline">\(l_p\)</span> norms use a pixel-by-pixel comparison. The best way we have right now to robustly measure “imperceptibleness” is with GANs, and yet that just begs the question, because the discriminative network is itself a deep neural network.</p>
<p>You may wonder - so what if <span class="math inline">\(l_p\)</span> norm is a bad match for human perception? Why not solve this toy problem for now and generalize later? The problem is that the standard mathematical phrasing of the adversarial defense problem is to minimize the maximal adversarial loss. Unfortunately, trying to bound the maximum adversarial loss is an exercise in futility, because the bound is approaching from the wrong direction. The result: “Difficulties in measuring robustness in the standard <span class="math inline">\(l_p\)</span> perturbation rules have led to numerous cycles of falsification… a combined 18 prior defense proposals are not as robust as originally reported.”</p>
<h1 id="adversarial-defense-comes-at-a-cost">Adversarial defense comes at a cost</h1>
<p>There are <a href="https://arxiv.org/abs/1706.06083">many</a> <a href="https://arxiv.org/abs/1801.09344">examples</a> <a href="https://arxiv.org/abs/1705.09064">in</a> <a href="https://arxiv.org/abs/1805.08006">the</a> <a href="https://arxiv.org/abs/1805.07816">literature</a> <a href="https://arxiv.org/abs/1412.5068">of</a> <a href="https://arxiv.org/abs/1703.09202">adversarial</a> <a href="https://arxiv.org/abs/1704.02654">defenses</a> that end up degrading the accuracy of the model on nonadversarial examples. This is problematic because the simplest possible attack is to just try things until you find something that gets misclassified.</p>
<p>As an example of how easy this is, <a href="https://arxiv.org/abs/1712.02779">this paper</a> claims that “choosing the worst out of 10 random transformations [translation plus rotation] is sufficient to reduce the accuracy of these models by 26% on MNIST, 72% on CIFAR10, and 28% on ImageNet (Top 1)”. I interpret this to mean that you only need to check about 10 nonadversarial examples before you find one the network misclassifies. This is a black-box attack that would work on basically any image classifier.</p>
<p>In some ways, the existence of adversarial attacks is only surprising if you believe the hype that we can classify images perfectly now.</p>
<h1 id="conclusions-and-recommendations">Conclusions and recommendations</h1>
<p>The authors make some concrete suggestions for people interested in adversarial defense as an area of research:</p>
<ul>
<li>Consider an actual threat model within a particular domain, like “Attackers are misspelling their words to evade email spam filters”.</li>
<li>Try to quantify the spectrum between “human imperceptible” changes and “content preserving” changes, acknowledging that defenses will want to target different points on this spectrum.</li>
<li>Enumerate a set of transformations known to be content-preserving (rotation, translation, scaling, adding a random black line to the image, etc.), and then hold out some subset of these transformations during training. At test time, test if your model is robust to the held-out transformations.</li>
</ul>
<p>My personal opinion is that if you want to safeguard your ML system, the first threat you should consider is that of “an army of internet trolls trying all sorts of stupid things to see what happens”, rather than “a highly sophisticated adversary manipulating your neural network”.</p>
Carbon dioxide and closed rooms2018-10-18T00:00:00Z2018-10-18T00:00:00Ztag:www.moderndescartes.com,2018-10-18:/essays/co2_closed_rooms<p>Recently, Scott Alexander <a href="http://slatestarcodex.com/2018/08/23/carbon-dioxide-an-open-door-policy/">posted about the possibility that <span class="math inline">\(\textrm{CO}_2\)</span> could accumulate in a closed room overnight</a>, leading to decreased sleep quality. As with anything Scott posts, it made me really believe him for a while. And I certainly wasn’t the only one - I visited the Bay Area this week and found some friends debating whether they should buy an air quality monitor to test their <span class="math inline">\(\textrm{CO}_2\)</span> levels.</p>
<p>In this case, though, the chemistry just doesn’t support this hypothesis. Scott mentions in his post: “I can’t figure out how to convert external ppm to internal likely level of carbon dioxide in the blood”. Here’s how to do that calculation.</p>
<h1 id="a-primer-on-partial-pressures">A Primer on Partial Pressures</h1>
<p>You probably remember from high school physics that potential energy is a good way to figure out whether some configuration of objects is stable with respect to another. A ball on top of a hill will roll down because doing so will convert its gravitational potential energy into kinetic energy. And a spring will relax to its point of minimal potential energy. And so on.</p>
<p>Pressure is a quantity that has units of energy per volume. It’s the analagous quantity to potential energy, but for continuous media like liquids and gases. Gases and liquids will flow from areas of high pressure to areas of low pressure, with a driving force proportional to the gradient in pressure.</p>
<p>Something that’s interesting about pressures is that they are independent of each other. So if there is a partial pressure of 0.4 millibars of <span class="math inline">\(\textrm{CO}_2\)</span> in the air, its behavior is unaffected by the presence of other gases - whether it’s 200 millibars of oxygen or 790 millibars of nitrogen. (This can change in ultra high-pressure regimes but we’re not dealing with those conditions.) So although Scott’s post and most other internet sources discuss <span class="math inline">\(\textrm{CO}_2\)</span> in units of “parts per million”, this is the wrong unit, because it talks about <span class="math inline">\(\textrm{CO}_2\)</span> as a percentage of the total air. If there were 10 bars of nitrogen but the same 0.4 millibars of <span class="math inline">\(\textrm{CO}_2\)</span>, the parts per million of <span class="math inline">\(\textrm{CO}_2\)</span> would drop precipitously but the chemistry would not change.</p>
<p>Another relevant fact is that gases can dissolve in water. When dissolved, it’s possible to express the quantity as a concentration (grams or moles per liter). As it turns out, equilibrium concentration is proportional to pressure (<a href="https://en.wikipedia.org/wiki/Raoult%27s_law">Raoult’s law</a>), and so for our purposes, we can express <span class="math inline">\(\textrm{CO}_2\)</span> concentration in units of pressure. This works because the lung contains enough surface area and <a href="https://en.wikipedia.org/wiki/Carbonic_anhydrase">carbonic anhydrase</a> that equilibrium can be considered to be reached within a second. (I’m just going to assert proof by evolution.)</p>
<h1 id="carbon-dioxide-and-the-body">Carbon Dioxide and the Body</h1>
<p>Veinous blood entering the lungs arrives with a <span class="math inline">\(\textrm{CO}_2\)</span> partial pressure of about 45-60 millibars. (At sea level and standard atmospheric conditions, this is the equivalent of 45,000-60,000 “ppm” of <span class="math inline">\(\textrm{CO}_2\)</span>.) On the other hand, the typical quantity of <span class="math inline">\(\textrm{CO}_2\)</span> in the air is about 0.4 millibars (400 ppm at standard conditions). The efficiency of <span class="math inline">\(\textrm{CO}_2\)</span> expulsion is then proportional to the difference between 45-60 millibars and 0.4 millibars. The highest indoor level of <span class="math inline">\(\textrm{CO}_2\)</span> mentioned in Scott’s post is about 5 millibars.</p>
<p>At 5 millibars, you would get a <span class="math inline">\(\textrm{CO}_2\)</span> expulsion efficiency of <span class="math inline">\(\frac{50 - 5}{50 - 0.4} \approx\)</span> 90% per breath. What would happen is that as <span class="math inline">\(\textrm{CO}_2\)</span> very slowly built up in the bloodstream, you would breathe 10% more rapidly to compensate. Given that standard human breathing varies significantly, I think it’s safe to say that you wouldn’t notice it. (Try it! I found it hard to maintain exactly 1 breath per 4 seconds, within a tolerance of 10% for each cycle.)</p>
<h1 id="conclusion">Conclusion</h1>
<p>At 5 millibars (5,000 ppm), you would start breathing at a 10% elevated rate, which you probably wouldn’t notice. At 10 millibars (10,000 ppm), you would start breathing at a 25% elevated rate, which is probably noticeable. These computations are consistent with the findings in <a href="https://www.nap.edu/read/11170/chapter/5#49">this report</a>.</p>
<p>People with indoor levels at 2000-3000 ppm shouldn’t worry - this corresponds to a mere 5% elevated breathing rate.</p>
A Deep Dive into Monte Carlo Tree Search2018-05-15T00:00:00Z2018-05-15T00:00:00Ztag:www.moderndescartes.com,2018-05-15:/essays/deep_dive_mcts<p><em>(This was originally a sponsor talk given at PyCon 2018. Unfortunately there is no video.)</em></p>
<h1 id="a-brief-history-of-go-ai">A brief history of Go AI</h1>
<p>The very first Go AIs used multiple modules to handle each aspect of playing Go - life and death, capturing races, opening theory, endgame theory, and so on. The idea was that by having experts program each module using heuristics, the AI would become an expert in all areas of the game. All that came to a grinding halt with the introduction of Monte Carlo Tree Search (MCTS) around 2008. MCTS is a tree search algorithm that dumped the idea of modules in favor of a generic tree search algorithm that operated in all stages of the game. MCTS AIs still used hand-crafted heuristics to make the tree search more efficient and accurate, but they far outperformed non-MCTS AIs. Go AIs then continued to improve through a mix of algorithmic improvements and better heuristics. In 2016, AlphaGo leapfrogged the best MCTS AIs by replacing some heuristics with deep learning models, and <a href="https://deepmind.com/blog/alphago-zero-learning-scratch/">AlphaGoZero</a> in 2018 completely replaced all heuristics with learned models.</p>
<p>AlphaGoZero learns by repeatedly playing against itself, then distilling that experience back into the neural network. This reinforement learning loop is so robust that it can figure out how to play Go starting from random noise. There are two key requirements for this loop to work: that the self-play games represent a higher level of gameplay than the raw neural network output, and that the training process successfully distills this knowledge.</p>
<p>This essay digs into the “how do you reach a higher level of gameplay?” part of the process. Despite replacing all human heuristics, AlphaGoZero still uses tree search algorithms at its core. I hope to convince you that AlphaGoZero’s success is as much due to this algorithm as it is due to machine learning.</p>
<p>Since this was originally a PyCon talk, I’ll also demonstrate the algorithm in Python and show some Python-specific tricks for optimizing the implementation, based on my experience working on <a href="https://github.com/tensorflow/minigo">MiniGo</a>.</p>
<h1 id="exploration-and-exploitation">Exploration and Exploitation</h1>
<p>Let’s start by asking a simpler question: how do you rank submissions on Hacker News?</p>
<p><img src="/static/deep_dive_mcts/hn_screenshot.png" title="Front page of HN" style="display: block; margin: 0 auto; width: 80%;"/></p>
<p>There’s a tension between wanting to show the highest rated submissions (exploitation), but also wanting to discover the good submissions among the firehose of new submissions (exploration). If you show the highest rated submissions only, you’ll get a rich-get-richer effect where you never discover new stories.</p>
<p>The canonical solution to this problem is to use Upper Confidence Bounds.</p>
<figure>
<img src="/static/deep_dive_mcts/hn_screenshot_annotated.png" title="HN ranking = quality + upper confidence bound" style="display: block; margin: 0 auto; width: 80%;"/>
<figcaption style="text-align: center; font-size: larger">
Ranking = Quality + Upper confidence bound
</figcaption>
</figure>
<p>The idea is simple. Instead of ranking according to estimated rating, you add a bonus based on how uncertain you are about the rating. In this example, the top submission on HN has fewer upvotes than the second rank submission, but it’s also newer. So it gets a bigger uncertainty bonus. The uncertainty bonus fades over time, and that submission will fall in ranking unless it can prove its worth with more upvotes.</p>
<p>This is an instance of the <a href="https://en.wikipedia.org/wiki/Multi-armed_bandit">Multi Armed Bandit</a> problem and has a pretty extensive literature if you want to learn more.</p>
<h1 id="uct-upper-confidence-bounds-applied-to-trees">UCT = Upper Confidence bounds applied to Trees</h1>
<p>So how does this help us understand AlphaGoZero? Playing a game has a lot in common with the multi-armed bandit problem: when reading into a game variation, you want to balance between playing the strongest known response, and exploring new variations that could turn out to be good moves. So it makes sense that we can reuse the UCB idea.</p>
<p>This figure from the AlphaGoZero paper lays out the steps.</p>
<p><img src="/static/deep_dive_mcts/alphago_uct_diagram.png" title="UCT diagram from AGZ paper" style="display: block; margin: 0 auto; width: 80%;"/></p>
<ol type="1">
<li><p>First, we select a new variation to evaluate. This is done by recursively picking the move that has highest Q+U score until we reach a variation that we have not yet evaluated.</p></li>
<li><p>Next, we pass the variation to a neural network for evaluation. We get back two things: an array of probabilities, indicating the net’s preference for each followup move, and a position evaluation.</p>
<p>Normally, with a UCB algorithm, all of the options have equal uncertainty. But in this case, the neural network gives us an array of probabilities indicating which followup moves are plausible. Those moves get higher upper confidence bounds, ensuring that our tree search looks at those moves first.</p>
<p>The position evaluation can be returned in one of two ways: in an absolute sense, where 1 = black wins, -1 = white wins, or in a relative sense, where 1 = player to play is winning; -1 player to play is losing. Either way, we’ll have to be careful about what we mean by “maximum Q score”; we want to reorient Q so that we’re always picking the best move for Black or White when it’s their turn. The Python implementation I show will use the absolute sense.</p>
<p>As a historical note, the first MCTS Go AIs attempted to evaluate positions by randomly playing them out to the end and scoring the finished game. This is where the Monte Carlo in MCTS comes from. But now that we no longer do the MC part of MCTS, MCTS is somewhat of a misnomer. So the proper name should really just be UCT search.</p></li>
<li><p>Finally, we walk back up the game tree, averaging in the the position evaluation at each node along the way. The net result is that a node’s Q score will be the average of its subtree’s evaluations.</p></li>
</ol>
<p>This process is repeated however long we’d like; each additional search fleshes out the game tree with one new variation. UCT search is a neat algorithm because it can be stopped at any time with no wasted work, and unlike the <a href="https://en.wikipedia.org/wiki/Minimax">minimax algorithm</a>, UCT search is flexible enough to explore widely or deeply as it sees fit. For deep, narrow sequences like reading ladders, this flexibility is important.</p>
<h1 id="a-basic-implementation-in-python">A basic implementation in Python</h1>
<p>Let’s start at the top. The following code is a straightforward translation of each step discussed above.</p>
<pre><code>def UCT_search(game_state, num_reads):
root = UCTNode(game_state)
for _ in range(num_reads):
leaf = root.select_leaf()
child_priors, value_estimate = NeuralNet.evaluate(leaf.game_state)
leaf.expand(child_priors)
leaf.backup(value_estimate)
return max(root.children.items(),
key=lambda item: item[1].number_visits)</code></pre>
<p>The node class is also pretty straightforward: it has references to the game state it represents, pointers to its parent and children nodes, and a running tally of evaluation results.</p>
<pre><code>class UCTNode():
def __init__(self, game_state, parent=None, prior=0):
self.game_state = game_state
self.is_expanded = False
self.parent = parent # Optional[UCTNode]
self.children = {} # Dict[move, UCTNode]
self.prior = prior # float
self.total_value = 0 # float
self.number_visits = 0 # int</code></pre>
<p>Step 1 (selection) occurs by repeatedly selecting the child node with the largest Q + U score. Q is calculated as the average of all evaluations. U is a bit more complex; the important part of the U formula is that it has the number of visits in the denominator, ensuring that as a node is repeatedly visited, its uncertainty bonus shrinks inversely proportional to the number of visits.</p>
<pre><code>def Q(self): # returns float
return self.total_value / (1 + self.number_visits)
def U(self): # returns float
return (math.sqrt(self.parent.number_visits)
* self.prior / (1 + self.number_visits))
def best_child(self):
return max(self.children.values(),
key=lambda node: node.Q() + node.U())
def select_leaf(self):
current = self
while current.is_expanded:
current = current.best_child()
return current</code></pre>
<p>Step 2 (expansion) is pretty straightforward: mark the node as expanded and create child nodes to be explored on subsequent iterations.</p>
<pre><code>def expand(self, child_priors):
self.is_expanded = True
for move, prior in enumerate(child_priors):
self.add_child(move, prior)
def add_child(self, move, prior):
self.children[move] = UCTNode(
self.game_state.play(move), parent=self, prior=prior)</code></pre>
<p>Step 3 (backup) is also mostly straightforward: increment visit counts and add the value estimation to the tally. The one tricky step is that the value estimate must be inverted, depending on whose turn it is to play. This ensures that the “max” Q value is in fact the best Q from the perspective of the player whose turn it is to play.</p>
<pre><code>def backup(self, value_estimate):
current = self
while current.parent is not None:
current.number_visits += 1
current.total_value += (value_estimate *
self.game_state.to_play)
current = current.parent</code></pre>
<p>And there we have it - a barebones implementation of UCT search in about 50 lines of Python.</p>
<p>Unfortunately, the basic implementation performs rather poorly. When executed with <span class="math inline">\(10^4\)</span> iterations of search, this implementation takes 30 seconds to execute, consuming 2 GB of memory. Given that many Go engines commonly execute <span class="math inline">\(10^5\)</span> or even <span class="math inline">\(10^6\)</span> searches before selecting a move, this is rather poor performance. This implementation has stubbed out the gameplay logic and the neural network execution, so the time and space shown here represents overhead due purely to search.</p>
<p>What went wrong? Do we just blame Python for being slow?</p>
<p>Well, kind of. The problem with the basic implementation is that we instantiate hundreds of UCTNode objects, solely for the purpose of iterating over them and doing some arithmetic on each node to calculate Q and U. Each individual operation is fast, but when we are executing thousands of Python operations (attribute access, addition, multiplication, comparisons) to select a variation, the whole thing inevitably becomes slow.</p>
<h1 id="optimizing-for-performance-using-numpy">Optimizing for performance using NumPy</h1>
<p>One strategy for minimizing the number of Python operations is to get more bang for the buck, by using NumPy.</p>
<p>The way NumPy works is by executing the same operation across an entire vector or matrix of elements. Adding two vectors in NumPy only requires one NumPy operation, regardless of the size of the vectors. NumPy will then delegate the actual computation to an implementation done in C or sometimes even Fortran.</p>
<pre><code>>>> nodes = [(0.7, 0.1), (0.3, 0.3), (0.4, 0.2)]
>>> q_plus_u = [_1 + _2 for _1, _2 in nodes]
>>> q_plus_u
[0.8, 0.6, 0.6]
>>> max(range(len(q_plus_u)), key=lambda i: q_plus_u[i])
0
>>> import numpy as np
>>> q = np.array([0.7, 0.3, 0.4])
>>> u = np.array([0.1, 0.3, 0.2])
>>> q_plus_u = q + u
>>> q_plus_u
array([0.8, 0.6, 0.6])
>>> np.argmax(q_plus_u)
0</code></pre>
<p>This switch in coding style is an instance of the <a href="https://en.wikipedia.org/wiki/AOS_and_SOA">array of structs vs struct of arrays idea</a> which appears over and over in various contexts. Row vs. column oriented databases is another place this idea pops up.</p>
<p>So how do we integrate NumPy into our basic implementation? The first step is to switch perspectives; instead of having each node knowing about its own Q/U statistics, each node now knows about the Q/U statistics of its children.</p>
<pre><code>class UCTNode():
def __init__(self, game_state,
move, parent=None):
self.game_state = game_state
self.move = move
self.is_expanded = False
self.parent = parent # Optional[UCTNode]
self.children = {} # Dict[move, UCTNode]
self.child_priors = np.zeros(
[362], dtype=np.float32)
self.child_total_value = np.zeros(
[362], dtype=np.float32)
self.child_number_visits = np.zeros(
[362], dtype=np.float32)</code></pre>
<p>This already results in huge memory savings. Now, we only add child nodes when exploring a new variation, rather than eagerly expanding all child nodes. The result is that we instantiate a hundred times fewer UCTNode objects, so the overhead there is gone. NumPy is also great about packing in the bytes - the memory consumption of a numpy array containing 362 float32 values is not much more than 362 * 4 bytes. The python equivalent would have a PyObject wrapper around every float, resulting in a much larger memory footprint.</p>
<p>Now that each node no longer knows about its own statistics, we create aliases for a node’s statistics by using property getters and setters. These allow us to transparently proxy these properties to the relevant entry in the parents’ child arrays.</p>
<pre><code>@property
def number_visits(self):
return self.parent.child_number_visits[self.move]
@number_visits.setter
def number_visits(self, value):
self.parent.child_number_visits[self.move] = value
@property
def total_value(self):
return self.parent.child_total_value[self.move]
@total_value.setter
def total_value(self, value):
self.parent.child_total_value[self.move] = value</code></pre>
<p>These aliases work for both reading and writing values - as a result, the rest of the code stays about the same! There is no sacrifice in code clarity to accomodate the numpy perspective switch. As an example, see the new implementation of <code>child_U</code> which uses the property <code>self.number_visits</code>.</p>
<pre><code>def child_Q(self):
return self.child_total_value / (1 + self.child_number_visits)
def child_U(self):
return math.sqrt(self.number_visits) * (
self.child_priors / (1 + self.child_number_visits))
def best_child(self):
return np.argmax(self.child_Q() + self.child_U())</code></pre>
<p>They look identical to the original declarations. The only difference is that previously, each arithmetic operation only worked on one Python float, whereas now, they operate over entire arrays.</p>
<p>How does this optimized implementation perform?</p>
<p>When doing the same <span class="math inline">\(10^4\)</span> iterations of search, this implementation runs in <strong>90 MB of memory (a 20x improvement) and in 0.8 seconds (a 40x improvement)</strong>. In fact, the memory improvement is understated, as 20MB of the 90MB footprint is due to the Python VM and imported numpy modules. If you let the NumPy implementation run for the full 30 seconds that the previous implementation ran for, it completes 300,000 iterations while consuming 2GB of memory - so <strong>30x more iterations in the same time and space</strong>.</p>
<p>The performance wins here come from eliminating thousands of repetitive Python operations and unnecessary objects, and replacing them with a handful of NumPy operations operating on compact arrays of floats. This requires a perspective shift in the code, but with judicious use of <span class="citation" data-cites="property">@property</span> decorators, readability is preserved.</p>
<h1 id="other-components-of-a-uct-implementation">Other components of a UCT implementation</h1>
<p>The code I’ve shown so far is pretty barebones. A UCT implementation must handle these additional details:</p>
<ul>
<li>Disallowing illegal moves.</li>
<li>Detecting when a variation represents a completed game, and scoring according to the actual rules, rather than the network’s approximation.</li>
<li>Imposing a move limit to prevent arbitrarily long games</li>
</ul>
<p>Additionally, the following optimizations can be considered:</p>
<ul>
<li>Subtree reuse</li>
<li>Pondering (thinking during the opponent’s time)</li>
<li>Parent-Q initialization</li>
<li>Tuning relative weights of Q, U</li>
<li>Virtual Losses</li>
</ul>
<p>Of these optimizations, one of them is particularly simple yet incredibly important to AlphaGoZero’s operation - virtual losses.</p>
<h2 id="virtual-losses">Virtual losses</h2>
<p>Until now, I’ve been talking about the Python parts of UCT search. But there’s also a neural network to consider, and one of the things we know about the GPUs that execute the calculations is that GPUs like big batches. Instead of passing in just one variation at a time, it would be preferable to pass in 8 or 16 variations at once.</p>
<p>Unfortunately, the algorithms as implemented above are 100% deterministic, meaning that repeated calls to <code>select_leaf()</code> will return the same variation each time!</p>
<p>To fix this requires five changed lines: <img src="/static/deep_dive_mcts/virtual_losses_diff.png" title="changes needed for virtual losses" style="display: block; margin: 0 auto; width: 80%;"/></p>
<p>This change causes <code>select_leaf</code> to pretend as if it already knew the evaluation results (a loss) and apply it to every node it passes through. This causes subsequent calls to <code>select_leaf</code> to avoid this exact variation, instead picking the second most interesting variation. After submitting a batch of multiple variations to the neural network, the virtual loss is reverted and replaced with the actual evaluation.</p>
<p>(The 5 line change is a bit of an oversimplification; implementing virtual losses requires handling a bunch of edge cases, like “what if the same leaf gets selected twice despite the virtual loss” and “tree consists of one root node”)</p>
<p>The overall scaling made possible by virtual losses is something like 50x. This number comes from significantly increased throughput on the GPU (say, 8x throughput). Also, now that leaf selection and leaf evaluation have been completely decoupled, you can actually scale up the number of GPUs - the match version of AlphaGoZero actually had 4 TPUs cooperating on searching a single game tree. So that’s another 4x. And finally, since the CPU and GPU/TPU are now executing in parallel instead of in series, you can think of it as another 2x speedup.</p>
<h1 id="summary">Summary</h1>
<p>I’ve shown how and why UCT search works, a basic Python implementation as well as an optimized implementation using NumPy, and another optimization that gives smoother integration of UCT search with multiple GPUs/TPUs.</p>
<p>Hopefully you’ll agree with me that UCT search is a significant contribution to AlphaGoZero’s reinforcement learning loop.</p>
<p>The example code shown here is available in a <a href="https://github.com/brilee/python_uct">git repo</a>. You can see the productionized version with all the optimizations in the <a href="https://github.com/tensorflow/minigo">Minigo codebase</a> - see <code>mcts.py</code> and <code>strategies.py</code> in particular.</p>
Visualizing TensorFlow's streaming shufflers2018-04-04T00:00:00Z2018-04-04T00:00:00Ztag:www.moderndescartes.com,2018-04-04:/essays/shuffle_viz<h1 id="introduction">Introduction</h1>
<p>If you’ve ever played Magic: The Gathering or other card games involving large, unwieldy decks of cards, you’ve probably wondered: How the heck am I supposed to shuffle this thing? How would I even know if I were shuffling properly?</p>
<p>As it turns out, there are similar problems in machine learning, where training datasets routinely exceed the size of your machine’s memory. Shuffling here is very important; imagine you (the model) are swimming through the ocean (the data) trying to predict an average water temperature (the outcome). You won’t really be able to give a good answer because the ocean is not well shuffled.</p>
<p>In practice, insufficiently shuffled datasets tend to manifest as spiky loss curves: the loss drops very low as the model overfits to one type of data, and then when the data changes style, the loss spikes back up to random chance levels, and then steadily overfits again.</p>
<p>TensorFlow provides a rather simple api for shuffling data streams: <a href="https://www.tensorflow.org/programmers_guide/datasets#randomly_shuffling_input_data">Dataset.shuffle(buffer_size)</a>. I wanted to understand how the level of shuffledness changed as you fiddled with the <code>buffer_size</code> parameter.</p>
<h1 id="visualizing-shuffledness">Visualizing shuffledness</h1>
<p>The seemingly simple way to measure shuffledness would be to come up with some measure of shuffledness, and compare this number between different invocations of <code>dataset.shuffle()</code>. But I spent a while trying to come up with an equation that could measure shuffledness and came up blank. As it turns out, people have come up with complicated test suites like <a href="https://en.wikipedia.org/wiki/Diehard_tests">Diehard</a> or <a href="https://en.wikipedia.org/wiki/TestU01">Crush</a> to try to measure the quality of pseudorandom number generators, so it suffices to say that it’s a hard problem.</p>
<p>Instead, I decided I’d try to visualize the data directly, in a way that would highlight unshuffled patches of data.</p>
<p>To do this, we use the Hilbert Curve, a space-filling fractal that can take a 1D sequence of data and shove it into a 2D space, in a way that if two points are close to each other in the 1D sequence, then they’ll be close in 2D space.</p>
<table style="margin: 0 auto; caption-side: bottom">
<caption>
Hilbert curves of order 1…5
</caption>
<tr>
<td>
<img src="/static/shuffling_viz/hilbert_curve_1.svg">
</td>
<td>
<img src="/static/shuffling_viz/hilbert_curve_2.svg">
</td>
<td>
<img src="/static/shuffling_viz/hilbert_curve_3.svg">
</td>
<td>
<img src="/static/shuffling_viz/hilbert_curve_4.svg">
</td>
<td>
<img src="/static/shuffling_viz/hilbert_curve_5.svg">
</td>
</tr>
</table>
<p>Each element of the list then gets mapped to a color on the color wheel.</p>
<figure style="text-align: center">
<img src="/static/shuffling_viz/basic_scaling_1024_0.png"> <img src="/static/shuffling_viz/basic_scaling_1024_1.png">
<figcaption>
A sorted list and a shuffled list.
</figcaption>
</figure>
<h1 id="exploring-shuffler-configurations">Exploring shuffler configurations</h1>
<h2 id="basic-shuffling">Basic shuffling</h2>
<p>Let’s start with the simplest shuffle. We’ll start with a dataset and stream it through a shuffler of varying size. In the following table, we have datasets of size <span class="math inline">\(2^{10}, 2^{12}, 2^{14}\)</span>, and shufflers of buffer size 0%, 1%, 10%, 50%, and 100% of the data size.</p>
<table style="margin: 0 auto; caption-side: bottom">
<caption>
A single shuffler of buffer size ratio 0 - 1, acting on datasets of size <span class="math inline">\(2^{10} - 2^{14}\)</span>.
</caption>
<tr>
<th colspan="7">
Buffer size ratio
</th>
</tr>
<tr>
<th rowspan="4">
# data
</td>
<td />
<th>
0
</th>
<th>
0.01
</th>
<th>
0.1
</th>
<th>
0.5
</th>
<th>
1
</th>
</tr>
<tr>
<th>
1024
</th>
<td>
<img src="/static/shuffling_viz/basic_scaling_1024_0.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_1024_0.01.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_1024_0.1.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_1024_0.5.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_1024_1.png">
</td>
</tr>
<tr>
<th>
4096
</th>
<td>
<img src="/static/shuffling_viz/basic_scaling_4096_0.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_4096_0.01.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_4096_0.1.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_4096_0.5.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_4096_1.png">
</td>
</tr>
<tr>
<th>
16384
</th>
<td>
<img src="/static/shuffling_viz/basic_scaling_16384_0.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_16384_0.01.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_16384_0.1.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_16384_0.5.png">
</td>
<td>
<img src="/static/shuffling_viz/basic_scaling_16384_1.png">
</td>
</tr>
</table>
<p>As it turns out, using a simple <code>dataset.shuffle()</code> is good enough to scramble the exact ordering of the data when making multiple passes over your data, but it’s not good for much else. It completely fails to destroy any large-scale correlations in your data.</p>
<p>Another interesting discovery here is that the buffer size ratio [buffer size / dataset size] appears to be scale-free, meaning that even as we scaled up to a much larger dataset, the qualitative shuffling behavior would remain unchanged if the buffer size ratio stays the same. This gives us the confidence to say that our toy examples here will generalize to real datasets.</p>
<h2 id="chained-shufflers">Chained shufflers</h2>
<p>The next thought I had was whether you could do any better by chaining multiple .shuffle() calls in a row. To be fair, I kept the memory budget constant, so if I used 4 chained shuffle calls, each shuffle call would get 1/4 the buffer size. In the following table, we have 1, 2, or 4 chained shufflers, with buffer size ratios of 0%, 1%, 10%, and 50%. <em>All graphs from here on use a dataset size of <span class="math inline">\(2^{14}\)</span>.</em></p>
<table style="margin: 0 auto; caption-side: bottom">
<caption>
Multiple chained shufflers (1, 2, or 4) with varying buffer sizes.
</caption>
<tr>
<th colspan="5">
# chained shufflers
</th>
</tr>
<tr>
<th rowspan="5">
buffer size
</td>
<td />
<th>
1
</th>
<th>
2
</th>
<th>
4
</th>
</tr>
<tr>
<th>
0
</th>
<td>
<img src="/static/shuffling_viz/chained_scaling_0_1.png">
</td>
<td>
<img src="/static/shuffling_viz/chained_scaling_0_2.png">
</td>
<td>
<img src="/static/shuffling_viz/chained_scaling_0_4.png">
</td>
</tr>
<tr>
<th>
0.01
</th>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.01_1.png">
</td>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.01_2.png">
</td>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.01_4.png">
</td>
</tr>
<tr>
<th>
0.1
</th>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.1_1.png">
</td>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.1_2.png">
</td>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.1_4.png">
</td>
</tr>
<tr>
<th>
0.5
</th>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.5_1.png">
</td>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.5_2.png">
</td>
<td>
<img src="/static/shuffling_viz/chained_scaling_0.5_4.png">
</td>
</tr>
</table>
<p>The discovery here is that chaining shufflers results in worse performance than just using one big shuffler.</p>
<h2 id="sharded-shuffling">Sharded shuffling</h2>
<p>It seems, then, that we need some way to create large-scale movement of data. The simplest way to do this is to shard your data into multiple smaller chunks. In fact, if you’re working on very large datasets, chances are your data is already sharded to begin with. In the following table, we have 1, 2, 4, or 8 shards of data, with buffer size ratios of 0%, 1%, 10%, and 50%. The order of shards is randomized.</p>
<table style="margin: 0 auto; caption-side: bottom">
<caption>
A single shuffler reading (1, 2, 4, or 8) shards in random order.
</caption>
<tr>
<th colspan="6">
number of shards
</th>
</tr>
<tr>
<th rowspan="5">
buffer size
</td>
<td />
<th>
1
</th>
<th>
2
</th>
<th>
4
</th>
<th>
8
</th>
</tr>
<tr>
<th>
0
</th>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0_1.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0_2.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0_4.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0_8.png">
</td>
</tr>
<tr>
<th>
0.01
</th>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.01_1.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.01_2.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.01_4.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.01_8.png">
</td>
</tr>
<tr>
<th>
0.1
</th>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.1_1.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.1_2.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.1_4.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.1_8.png">
</td>
</tr>
<tr>
<th>
0.5
</th>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.5_1.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.5_2.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.5_4.png">
</td>
<td>
<img src="/static/shuffling_viz/sharded_scaling_0.5_8.png">
</td>
</tr>
</table>
<h2 id="parallel-read-sharded-shuffling">Parallel-read sharded shuffling</h2>
<p>The last table didn’t look particularly great, but wait till you see this one. A logical next step with sharded data is to read multiple shards concurrently. Luckily, TensorFlow’s <a href="https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave">dataset.interleave</a> API makes this really easy to do.</p>
<p>The following table has 1, 2, 4, 8 shards, with 1, 2, 4, 8 of those shards being read in parallel. <em>All graphs from here on use a buffer size ratio of 1%.</em></p>
<table style="margin: 0 auto; caption-side: bottom">
<caption>
A single shuffler reading multiple shards in parallel.
</caption>
<tr>
<th colspan="6">
shards read in parallel
</th>
</tr>
<tr>
<th rowspan="5">
# shards
</td>
<td />
<th>
1
</th>
<th>
2
</th>
<th>
4
</th>
<th>
8
</th>
</tr>
<tr>
<th>
1
</th>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_1_1.png">
</td>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<th>
2
</th>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_2_1.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_2_2.png">
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<th>
4
</th>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_4_1.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_4_2.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_4_4.png">
</td>
<td>
</td>
</tr>
<tr>
<th>
8
</th>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_8_1.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_8_2.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_8_4.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_8_8.png">
</td>
</tr>
</table>
<p>We’re starting to see some interesting things, namely that when #shards = #parallel reads, we get some pretty darn good shuffling. There are still a few issues: because all the shards are exactly the same size, we see stark boundaries when a set of shards are completed simultaneously. Additionally, because each shard is unshuffled, we see a slowly changing gradient across the image as each shard is read from front to back in parallel. This pattern is most apparent in the 2, 2 and 4, 4 table entries.</p>
<h2 id="parallel-read-sharded-shuffling-with-shard-size-jittering">Parallel-read sharded shuffling, with shard size jittering</h2>
<p>Next, I tried jittering the shard sizes to try and fix the shard boundary issue. The following table is identical to the previous one, except that shard sizes range from 0.75~1.5x of the previous table’s shards.</p>
<table style="margin: 0 auto; caption-side: bottom">
<caption>
A single shuffler reading multiple shards in parallel (shard sizes jittered).
</caption>
<tr>
<th colspan="6">
shards read in parallel
</th>
</tr>
<tr>
<th rowspan="5">
# shards
</td>
<td />
<th>
1
</th>
<th>
2
</th>
<th>
4
</th>
<th>
8
</th>
</tr>
<tr>
<th>
1
</th>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_1_1.png">
</td>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<th>
2
</th>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_2_1.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_2_2.png">
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<th>
4
</th>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_4_1.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_4_2.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_4_4.png">
</td>
<td>
</td>
</tr>
<tr>
<th>
8
</th>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_8_1.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_8_2.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_8_4.png">
</td>
<td>
<img src="/static/shuffling_viz/parallel_read_scaling_jittered_8_8.png">
</td>
</tr>
</table>
<p>This table doesn’t look that great; the big blobs of color occur because whichever shard is the biggest, ends up being the only shard left over at the end. We’ve succeeded in smearing the sharp shard boundaries we saw in the previous table, but jittering has not solved the large-scale gradient in color.</p>
<h2 id="multi-stage-shuffling">Multi-stage shuffling</h2>
<p>So now we’re back to reading in parallel from many shards. How might we shuffle the data within each shard? Well, if sharding the original dataset results in shards that fit in memory, then we can just shuffle them - simple enough. But if not, then we can actually just recursively shard our files until they get small enough to fit in memory! The number of sharding stages would then grow as log(N).</p>
<p>Here’s what two-stage shuffling looks like. Each stage is shuffled with the same parameters - number of shards, number of shards read in parallel, and buffer size.</p>
<table style="margin: 0 auto; caption-side: bottom">
<caption>
Recursive sharding/shuffling, with two stages of sharding/shuffling.
</caption>
<tr>
<th colspan="6">
shards read in parallel
</th>
</tr>
<tr>
<th rowspan="5">
# shards
</td>
<td />
<th>
1
</th>
<th>
2
</th>
<th>
4
</th>
<th>
8
</th>
</tr>
<tr>
<th>
1
</th>
<td>
<img src="/static/shuffling_viz/twice_shuffled_1_1.png">
</td>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<th>
2
</th>
<td>
<img src="/static/shuffling_viz/twice_shuffled_2_1.png">
</td>
<td>
<img src="/static/shuffling_viz/twice_shuffled_2_2.png">
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<th>
4
</th>
<td>
<img src="/static/shuffling_viz/twice_shuffled_4_1.png">
</td>
<td>
<img src="/static/shuffling_viz/twice_shuffled_4_2.png">
</td>
<td>
<img src="/static/shuffling_viz/twice_shuffled_4_4.png">
</td>
<td>
</td>
</tr>
<tr>
<th>
8
</th>
<td>
<img src="/static/shuffling_viz/twice_shuffled_8_1.png">
</td>
<td>
<img src="/static/shuffling_viz/twice_shuffled_8_2.png">
</td>
<td>
<img src="/static/shuffling_viz/twice_shuffled_8_4.png">
</td>
<td>
<img src="/static/shuffling_viz/twice_shuffled_8_8.png">
</td>
</tr>
</table>
<p>This table shows strictly superior results to our original parallel read table.</p>
<h1 id="conclusions">Conclusions</h1>
<p>I’ve shown here a setup of recursive shuffling that should pretty reliably shuffle data that is perfectly sorted. In practice, your datasets will have different kinds of sortedness at different scales. The important thing is to be able to break correlations at each of these scales.</p>
<p>To summarize:</p>
<ul>
<li>A single streaming shuffler can only remove correlations that are closer than its buffer size.</li>
<li>Shard your data and read in parallel.</li>
<li>Shards should themselves be free of large-scale correlations.</li>
<li>For really big datasets, use multiple passes of shuffling.</li>
</ul>
<p>All code can be found on <a href="https://github.com/brilee/shuffling_experiments">Github</a>.</p>
Rewriting moderndescartes.com as a static site2018-03-10T00:00:00Z2018-03-10T00:00:00Ztag:www.moderndescartes.com,2018-03-10:/essays/gcs_static<p>(This was originally a talk given at <a href="https://www.meetup.com/Hack-Tell-Boston/">Hack && Tell Boston</a>.)</p>
<p>You may or may not have noticed that my website is now being served from Google Cloud Storage. If you haven’t, great! My wallet, on the other hand, has definitely noticed:</p>
<p><img src="/static/gcs_billing.png" title="My GCS bill: 1 cent a month." style="margin: 0 auto; width: 100%"/></p>
<p>Read on for a quick rundown of how I went about converting my website.</p>
<hr />
<h1 id="whats-a-static-site">What’s a static site?</h1>
<p>First, let’s clear up the difference between a static and a dynamic site.</p>
<p>google.com is your prototypical dynamic site: it takes your query, your login status, location, and lots of other things to decide what results to return. It then puts together a HTML page which it sends back to your browser to be displayed. Chances are, this exact HTML has never been sent to anybody else before.</p>
<p>A static site, on the other hand, sends the same HTML file to everyone. Because it is so simple to serve static files, many services will host files for you, like content delivery networks, or in my case, Google Cloud Storage.</p>
<h1 id="a-blast-from-the-past">A blast from the past</h1>
<p>The previous backend for this website was a Django 1.4 instance. I hadn’t upgraded it in almost 6 years - Django 1.4 was originally released in March 2012! There was also a MySQL server, and who knows what security holes there were in the whole thing. Compiling new essays relied on an ancient python MarkDown library that I could no longer find on PyPI. To add insult to injury, I paid $80/year for the privilege of maintaining this tire fire of a website backend.</p>
<p>The breaking point was when I got a new laptop and was unable to reinstall the python MarkDown library, which meant that I couldn’t render/view my essays locally before pushing them live.</p>
<h1 id="compiling-static-pages">Compiling static pages</h1>
<p>The first order of business was finding a replacement MarkDown compiler. I first started with a handful of python libraries, but none of them could handle hybrid MarkDown/LaTeX documents. I then remembered that <a href="https://pandoc.org">Pandoc</a> existed, and just like that, all of my problems were magically solved.</p>
<p>So, to recompile all of my webpages, I passed the raw MarkDown essays through Pandoc, and then through <a href="http://jinja.pocoo.org/docs/2.10/">jinja2</a> (a Django compatible templating engine) to compile the webpages as they had previously been rendered on my website.</p>
<p>For static assets, nothing more than a <code>cp -r</code> was needed.</p>
<p>For my RSS feed, I used a <a href="https://github.com/getpelican/feedgenerator">standalone version of Django’s RSS feed generator</a>. Technically, RSS feeds are dynamic, but they only ever change when a new essay is posted, so I could simply recompile it each time I pushed a new essay.</p>
<p>Everything went into a staging directory, after which a single command deploys the whole thing: <code>gsutil -m rsync -r -d staging gs://www.moderndescartes.com</code>.</p>
<h1 id="hiccups-with-the-new-gcs-static-site">Hiccups with the new GCS static site</h1>
<p>As it turns out, having a static site doesn’t just mean serving files, but also correctly setting various HTTP headers on the responses. For example, I had to figure out how to get GCS to put a content-type:text/html header on my extensionless files. I also ended up having to add <meta charset=“utf-8”/> to my pages, because this was something Django had transparently handled for me before.</p>
<p>I also discovered that <code>cp -p</code> would preserve modification times, which meant that I could prevent <code>gsutil rsync</code> from attempting to reupload my entire static directory everytime I recompiled my files.</p>
<p>There was an unfruitful detour into gzip-land; <code>gsutil cp -z</code> would apparently gzip files automatically before uploading them, and would even automatically set the content-encoding property, so that the pages would be served correctly. But <code>gsutil rsync</code> didn’t support this flag, so instead of maintaining a weird hybrid push step of <code>gsutil cp</code> for my HTML pages and <code>gsutil rsync</code> for my static assets, I abandoned the idea altogether.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Writing my own static site compiler has been remarkably easy to do. In between jinja2 templating and Pandoc for correctly compiling MarkDown/LaTeX files, I’ve ended up with 100 lines of Python that replace the entire Django installation I used to have. The initial prototype took me 3 hours on a plane to write; cleaning up the push/deploy process took another few hours of futzing around with gsutil doc pages.</p>
<p>To take a peek at the code supporting this static site, <a href="https://github.com/brilee/modern-descartes-v2">check out the Git repo</a>.</p>
AlphaGo Zero: an analysis2017-12-03T00:00:00Z2017-12-03T00:00:00Ztag:www.moderndescartes.com,2017-12-03:/essays/agz<p>Where to begin?</p>
<p>I think the most mindblowing part of this paper is just how simple the core idea is: if you have an algorithm that can trade computation time for better assessments, and a way to learn from these better assessments, then you have a way to bootstrap yourself to a higher level. There’s a lot of deserved buzz around how this is an extremely general RL technique that can be used for anything.</p>
<p>I think there are two things special to Go that make this RL technique viable.</p>
<p>First - Go piggybacks off of the success of convolutional networks. Out of the many kinds of networks, convnets for image processing are definitely the kind of network that has clear theoretical justification and the most real-world success. Most games have a more arbitrary, less spatial/geometric sense of gameplay, and would require carefully designed network architectures. Go gets to use a vanilla convnet architecture with almost no modification.</p>
<p>Second - Monte Carlo Tree Search (MCTS) is the logical successor of minimax for turn-based games that have a large branching factor. MCTS has been investigated by computer Go researchers for 10 years now, and the community has had a long time to understand how MCTS behaves in favorable and unfavorable positions, and to discover algorithmic optimizations like virtual losses (more on this later). Another algorithmic breakthrough like MCTS will be needed to handle games that have a continuous time dimension.</p>
<h2 id="convnets-and-tpus">Convnets and TPUs</h2>
<p>What makes convnets so appropriate for Go? Well, a Go board is a reasonably large square grid, and there’s nothing particularly special about any point, other than its distance from the edge of the board. That’s exactly the kind of input that convnets do well on.</p>
<p>The recent development of residual network architectures has also allowed convnets to scale to ridiculous depths. The original AlphaGo used 12 layers of 3x3 convolutions, which meant that information could only have propagated a <a href="https://en.wikipedia.org/wiki/Taxicab_geometry">Manhattan distance</a> of 24. To compensate, a set of handcrafted input features helped information propagate through the 19x19 board, by computing ladders and shared liberties of a long chain. But with resnets, the 40 (or 80) convolutions apparently eliminates the need for these handcrafted features. Resnets are clearly an upgrade from regular convnets, and it’s unsurprising that the new AlphaGo uses them.</p>
<p>The downside of convnets is the rather large amount of computation they require. Compared to a fully connected network of the same size on a 19x19 board, convnets require roughly 14500 times fewer parameters[1]. The number of add-multiplies doesn’t actually change, though, and the efficiency gains instead encourage researchers to just make their convnets bigger. The net result is a crapton of required computation. That, in turn, implies a lot of TPUs.</p>
<p><img src="/static/hummer.jpg" title="The modern convnet is like a Hummer: a complete gas guzzler" style="display: block; margin: 0 auto; width:100%;"/> Fig. 1: The modern convnet</p>
<p>A lot of press has focused on the use of “4 TPUs” by AGZ, which is the number of TPUs that is used by the competition version of AGZ. This is the wrong thing to focus on. The important number is 2000 TPUs. This is the number of TPUs used during the self-play bootstrapping phase[2]. If you want to trade computation time for improved play, you’ll need a lot of sustained computation time! There is almost certainly room for more efficient use of computation.</p>
<p>[1] Each pixel in the 19x19x256 output of one conv layer only looks at a 3x3x256 slice of the previous layer, and additionally, each of the 19x19 points share the same weights. Thus, the equivalent fully connected network would have used 361^2 / 9 = 14480x more weights.</p>
<p>[2] This number isn’t given in the paper, but it can be extrapolated from their numbers (5 million self-play games, 250 moves per game, 0.4s thinking time per move = 500 million compute seconds = 6000 compute days, which was done in 3 real days. Therefore, something like 2000 TPUs in parallel. Aja Huang has confirmed this number.</p>
<h2 id="mcts-with-virtual-losses">MCTS with Virtual Losses</h2>
<p>In addition to having a lot of TPUs, it’s important to optimize things so that the TPU is at 100% utilization. The traditional MCTS update algorithm, as described by the AGZ paper itself, goes like this:</p>
<ul>
<li>Pick a new variation in the game tree to explore, balancing between reading more deeply into favorable variations, and exploring new variations.</li>
<li>Ask the neural network what it thinks of that position. (It will return a value estimation, and the most likely followup moves).</li>
<li>Record the followup moves and value estimate. Increment the visit counts for all positions leading to that variation.</li>
</ul>
<p>Unfortunately, the algorithm as described here is an inherently sequential process. Because MCTS is deterministic, rerunning step 1 will always return the same variation, until the updated value estimates and visit counts have been incorporated. And yet, the supporting information in the AGZ paper describes batching up positions in multiples of 8, for optimal TPU throughput.</p>
<p>The simplest way to achieve this sort of parallelism is to just play 8 games in parallel. However, there’s a few things that make this approach less simple than it initially seems. First is the ragged edges problem: what happens when your games don’t end after the same number of moves? Second is the latency problem: for the purposes of delivering games to the training process, you would rather have 1 game completed every minute, than 8 games completed every 8 minutes. Third is that this method of parallelism severely hamstrings your competition strength, where you want to focus all of your computation time on one game.</p>
<p>So we’d like to figure out how to get a TPU (or four) to operate concurrently on one game tree. That method is virtual losses, and it works as follows:</p>
<ul>
<li>Pick a new variation in the game tree to explore, balancing between reading more deeply into favorable variations, and exploring new variations. Increment the visit count for all positions leading to that variation.</li>
<li>Ask the neural network what it thinks of that position. (It will return a value estimation, and the most likely followup moves).</li>
<li>Record the followup moves and value estimate, realigning value estimates to the standard MCTS algorithm.</li>
</ul>
<p>The method is called virtual losses, because by incrementing visit counts without adding the value estimate, you are adding a value estimate of “0” - or in other words, a (virtual) loss. The net effect is that you can now rerun the first step because it will give you a different variation each time. Therefore, you can run this step repeatedly to get a batch of 8 positions to send to the TPU, and even while the TPU is working, you can continue to run the first step to prepare the next batch.</p>
<p>It seems like a small detail, but virtual losses allow MCTS to scale horizontally by a factor of about 50x. 32x comes from 4 TPUs and a batch size of 8 all working on one game tree, and another 2x for allowing both CPU and TPU to work concurrently.</p>
<h2 id="mcts-in-the-random-regime">MCTS in the Random Regime</h2>
<p>The second mindblowing part of the AGZ paper was that this bootstrapping actually worked, even starting from random noise. The MCTS would just be averaging a bunch of random value estimates at the start, so how would it make any progress at all?</p>
<p>Here’s how I think it would work:</p>
<ul>
<li>A bunch of completely random games are played. The value net learns that just counting black stones and white stones correlates with who wins. MCTS is also probably good enough to figure out that if black passes early, then white will also pass to end the game and win by komi, at least according to Tromp-Taylor rules. The training data thus lacks any passes, and the policy network learns not to pass.</li>
<li>MCTS starts to figure out that capture is a good move, because the value net has learned to count stones -> the policy network starts to capture stones. Simultaneously, the value net starts to learn that spots surrounded by stones of one color can be considered to belong to that color.</li>
<li>???</li>
<li>Beat Lee Sedol.</li>
</ul>
<p>Joking aside, the insight here is that there are only two ways to impart true signal into this system: during MCTS (if two passes are executed), and during training of the value net. The value net will therefore be the first to learn; MCTS will then play moves that guide the game towards high value states (taking the opponent’s moves into consideration); and only then will the policy network start outputting those moves. The resulting games are slightly more complex, allowing the value network learn more sophisticated ideas. In a way, the whole thing reminds me of Fibonacci numbers: your game strength is a combination of the value network from last generation and the policy network from two generations past.</p>
<h2 id="conclusion">Conclusion</h2>
<p>AlphaGo Zero’s reinforcement learning algorithm is an accomplishment that I think should have happened a decade from now, but was made possible today because of Go’s amenability to convnets and MCTS.</p>
<h2 id="miscellaneous-thoughts">Miscellaneous thoughts</h2>
<ul>
<li><p>The first AlphaGo paper trained the policy network first, then froze those weights while training the weights of a value branch. The current AlphaGo concurrently computes both policy and value halves, and trains with a combined loss function. This is really elegant is several ways: the two objectives regularize each other; it halves the computation time required; and it integrates perfectly with MCTS, which requires evaluating both policy and value parts for each variation investigated.</p></li>
<li><p>The reason 400 games were played to evaluate whether a candidate was better than its predecessor, is because 400 gives statistical significance at exactly 2 standard deviations for a 55% win rate threshold. In particular, the variance of a binomial process is N<em>p</em>(1-p), so if the null hypothesis is p = 0.5 (the candidates do not differ), then the expected variance is 400 * 0.5 * 0.5 = 100. The standard deviation is thus +/- 10, and 200 wins +/- 10 corresponds to a 50% +/- 2.5% win rate. A 55% win rate is therefore 2 standard deviations.</p></li>
<li>There were a lot of interconnected hyperparameters, making this far from a drop-in algorithm:
<ul>
<li>the number of moves randomly sampled at the start to generate enough variation , compared to the size of the board, the typical length of game, and the typical branching factor/entropy introduced by each move.</li>
<li>the size of the board, the dispersion factor of the Dirichlet noise, the <span class="math inline">\(L_2\)</span> regularization strength that causes the policy net’s outputs to be more disperse, the number of MCTS searches that were used to overcome the magnitude of the priors, and the <span class="math inline">\(P_{UCT}\)</span> constant (not given in paper) used to weight the policy priors against the MCTS searches.</li>
<li>the learning rate, batch size, and training speed as compared to the rate at which new games were generated.</li>
<li>the depth of old games that were revisited during training, compared to the rate at which new version of AGZ were promoted.</li>
</ul></li>
<li><p>On the subject of sampling old games - I wonder if it’s analogous to the way that strong Go players have an intuitive sense of whether a group is alive. They’re almost always right, but maybe they need to read it out every so often to keep their intuition sharp.</p></li>
</ul>