Modern Descartes - Essays by Brian Lee

From GnuGo to AlphaGo Zero: A Roadmap for Solving Difficult Problems

2025-04-14T00:00:00Z

Originally posted 2025-04-14

Tagged: alphago, machine_learning, strategy

What do difficult problems like digital assistants, self-driving cars, theorem-proving systems, and compilers have in common? They all have a certain level of fuzziness in their problem specification and a large solution space that makes it incredibly difficult to find optimal solutions.

As it turns out, the game of Go is difficult in many of the same ways, and has proven to be a fertile testbed for artificial intelligence techniques. Computer Go researchers have gone through many iterations, and the ways in which each iteration fell short are instructive to understand. In this essay, I review four distinct eras of computer Go, and draw a roadmap for how to tackle difficult problems.

Why is Go difficult?

Go is a simple game to learn. There are just three official rules: 1. players alternate playing stones on a 19x19 board, until both players agree on the ownership of each point of the board. 2. if a chain of stones is surrounded on all available adjacencies, it is captured and removed from the board (the inevitability of capture drives consensus on the “ownership” in rule 1). 3. No repeating previous positions (“Ko” rule).

The rules are so simple that they teach you absolutely nothing about how to play the game. You might think you’re ahead because you captured a big lump of stones, but a stronger player would recognize that you’d just wasted several moves on stones that were not worth it for either side to rescue or finish off. You learn by playing games, but it’s often not obvious what’s being learned.

In Go, balance is important. If you play your stones too far apart, they don’t support each other and can get separated; if you play your stones too closely, you won’t claim territory as rapidly as your opponent. If you play too close to the edge, you can secure early-game territory at the cost of whole-board influence that secures late-game territory; if you play too close to the center, then you lose out on easy territory. My favorite Go proverb is “If one player has all four corners, the weaker player should resign”, implying that if such an unbalanced situation arose, then somebody – most likely the weaker player – has messed up.

The Expert Systems Era (1980s-2006)

The first serious attempt at Go AI was the expert system, best typified by GnuGo. GnuGo was composed of rules, heuristics, and decision trees programmed by experts in each domain, with modules for each area of the game – breakin.c (invasions), dragon.c (whole-board fights), endgame.c, fuseki.c (opening theory), influence.c, semeai.c (capture races), and so on.

Here is some representative code, showing typical expert system ideas:

if (eyespace_neighbors >= 2)
  if (make_solid_eye(pos, color)) {
    bonus += 20;
    if (do_capture_dead_stones && opponent_dragons > 0)
    bonus += 10;
  }

score[pos] = 4 * eyespace_neighbors + bonus;
if (safety == INVINCIBLE) {
  score[pos] += own_neighbors;
  if (own_neighbors < 2)
    score[pos] += own_diagonals;
  if (own_worms > 1 && eyespace_neighbors >= 1)
    score[pos] += 10 + 5 * (own_worms - 2);
}
else if (eyespace_neighbors > 2)
  score[pos] += own_diagonals;

/* Splitting bonus. */
if (opponent_dragons > 1)
  score[pos] += 10 * (opponent_dragons - 1);

Note:

A variety of hand-tuned magic numbers.
Black and white determinations like “INVICIBLE”, no shades of grey.
Human-interpretable concepts like “dragons” reified in code.
Giant hand-tuned if-else trees.

GnuGo had a number of glaring flaws. It focuses narrowly on local situations when other parts of the board are more important. It is unable to consider whole-board interactions. It rigidly sticks with local shapes and patterns even when it’s unwarranted. It wastes moves clarifying the status of 99.9% alive groups, due to its desire to meet strictly correct definitions of “dragon” or “invincible”. It has blind spots where weaker fallback code is activated. All of these are flavors of the same problem: treating Go in absolutes, rather than in subtleties. GnuGo played at the 25%ile of casual players.

Lest you think that expert systems are worthless, keep in mind that Deep Blue, which defeated Kasparov in 1997, was also an expert system built atop hand-tuned board evaluation functions and modules for opening, midgame, and endgame.

The Monte-Carlo Tree Search era (2006-2016)

The Deep Learning era (2015)

Even during the MCTS era, hand-written rollout heuristics were increasingly discarded in favor of statistically learned heuristics. As Deep Learning became more widely known, multiple groups applied convolutional neural networks to Go to learn better move heuristics. Surprisingly, it was discovered that these trained neural networks could play Go at the 75%ile of human players, even without search! Later on, with better data, network architectures and training techniques, neural networks have been able to play at roughly the 98%ile of human players without search.

AlphaGo relied on three key neural networks: a move proposer, a position evaluator, and a rollout move proposer. The move proposer was used to bias the tree search towards more promising moves; the position evaluator estimated who was winning, and the rollout move proposer, a remnant from the MCTS era, quickly suggested moves for playing rollouts.

AlphaGo’s style was mostly human, intuitive, and even beautiful. Its least human trait was that it could and would accurately make gigantic trades, thanks to its architecture forcing fresh evaluations of every position. In contrast, humans tended to judge positions as an accumulation of good and bad exchanges. However, this also meant it was structurally unable to “save” an analysis across game trees, leading to the thrashing effect shown in Game 4 of Lee Sedol vs. AlphaGo. It was also susceptible to a “split brain” phenomenon when the rollout and evaluation methods disagreed. AlphaGo’s policy network started at the 95%ile of human players and added a massive amount of compute (~700 GPUs used in the AlphaGo vs. Lee Sedol match) to reach 100%ile.

The Reinforcement Learning era (2017)

How to tackle difficult problems

If we compress 30 years of Computer Go into a roadmap, it would look like the following:

Start with brute-force search, guided by some hand-encoded heuristics.
Replace with smarter search and machine-learned heuristics.
Optimize the tick-tock cycle of better search and better machine learned heuristics.
Fully automate the tick-tock steps as a reinforcement learning.

Put this way, it seems pretty obvious, right? But there are many ways to screw this up.

Bitter lesson

Let’s look at the first stage: brute force and hand-coded heuristics. Rich Sutton’s now famous essay on The Bitter Lesson has the top-line takeaway that, thanks to Moore’s law, methods that scale with computing power (like search and statistical learning techniques) will win over methods that merely scale with human effort (like hand-encoding of heuristics).

In this essay, the very first example that Sutton of the bitter lesson is Deep Blue: Deep Blue’s brute force search beat out the efforts of computer chess researchers to encode human chess expertise. True. Deep Blue optimized for faster and less accurate heuristics to enable its brute force. But do you think that the GnuGo developers didn’t know the lessons of Deep Blue?

When I look at the GnuGo codebase from a modern machine learning lens, this effort is the moral equivalent of a group of people spending hundreds of human-years running sklearn.ensemble.RandomForestClassifier().fit() by hand. Why did they persist so long? Sutton offers the following sequence:

AI researchers have often tried to build knowledge into their agents
this always helps in the short term, and is personally satisfying to the researcher, but
in the long run it plateaus and even inhibits further progress, and
breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.

I offer another point: that these stages happen over the span of a decade or so. Over this decade, PhDs are minted, career identities built, promotion criteria set, and organizational culture annealed. Much in the way that science progresses one funeral at a time, progress on difficult problems progresses one company/project/organizational shutdown at a time.

Jumping straight to Reinforcement Learning

The last stage, reinforcement learning, is so powerful and tempting, yet incredibly hard to get right. If you ever take a class on RL theory, you might think that RL can only be done by theory geniuses who understand inscrutable math equations - just skim through Barto and Sutton’s textbook for a taste of this. The set of people who graduate with PhDs in RL don’t help this picture, either - the math is indeed so hard that students don’t have time to gain experience with such mundanities as building a working RL system.

In practice, jumping straight to RL is like pouring rocket fuel into your car’s gas tank and expecting it to magically work. All of your car’s systems are designed for a certain amount of engine power, requiring a certain amount of cooling capacity, delivering a certain amount of power through the drivetrain, and designed around the limitations of physics and human biology. The result isn’t a supercar, but a car that randomly explodes on the test track, and veers uncontrollably off-course. See Alex Irpan’s compilation of RL failures for countless examples of RL PhDs adding rocket fuel to cars.

Unfortunately, the only way to build a really fast car is to start with a slow car and work your way up, figuring out what breaks at each stage of scaleup. Looking back at the time I replicated AlphaGo Zero, I’m most surprised at how much effort went into building monitoring systems and inspection tools, to understand when and how our system was broken. We introduced and fixed countless bugs related to edge cases in endgame scoring, in tree search, incomplete shuffling, random noise injection, and MCTS parameter selection.

Successful applications of RL are far more mundane than the theory suggests: a focus on the basic tick-tock cycle of search and deep learning, combined with a strong emphasis on developing correct objective functions and supporting infrastructure, like monitoring and inspection tools.

Conclusion

The evolution of Go AI from expert systems to reinforcement learning offers a blueprint for tackling other fuzzy, complex problems. While it’s tempting to jump straight to advanced techniques like reinforcement learning, success typically requires progressing through stages: starting with basic search and heuristics, gradually incorporating machine learning, and only then attempting full automation through reinforcement learning. Each stage requires robust infrastructure, monitoring, and a willingness to abandon previous approaches. These lessons apply whether you’re building self-driving cars, digital assistants, or other complex AI systems - there are no shortcuts to superhuman performance.

The AI Infra War: Winners and losers

2025-02-14T00:00:00Z

Originally posted 2025-02-14

Tagged: strategy, llms

Competition is brutal in the world of LLMs. In March 2024, Databricks released DBRX, an open-weights, MoE-style LLM that was at the top of all the leaderboards for a grand total of 3 weeks. It was overtaken by Meta’s Llama 3, a series of LLMs that probably cost 100M+ to build, and was released… for free. You don’t even remember DBRX. I only remember because I work at Databricks.

The early adopter community races to download and vibe-check each new release. Yet, it’s becoming increasingly clear that foundation models are becoming commoditized. Over the next 5-10 years, a few companies will specialize in serving foundation models at scale with thin margins, while the real winners will be a diverse ecosystem of vertical businesses that own the relationship with the user. Teams that balance the rapid pace of change with rigorous evals will be able to move fast and rearchitect their systems when necessary.

A small number of companies will serve LLMs

Empirically over the last 3-4 years, LLM price @ constant quality has improved at triple speed compared to Moore’s law (2x reduction every six months, compared to Moore’s 2x reduction every 18 months). In 2024, improvements were actually quadruple speed thanks to a giant VC funding glut and price wars, but I expect 2025-2030 to see “merely” triple or double Moore speeds. These improvements come from a mix of hardware improvements, engineering optimizations, and scientific breakthroughs. All three legs favor the centralization of effort in a small number of companies.

On the hardware front, there’s the original Moore’s law with a 1.5x improvement year over year. A menagerie of companies are either developing accelerators to compete with NVidia’s GPUs (AWS, Google, Cerebras, Grok, and I suppose AMD) or partnering with one who has such accelerators (Anthropic->AWS). The large capital expenditures and fixed overhead of chip design naturally point towards centralization.

On the engineering front, scale justifies the engineering investment needed to continuously eke out 10-20% efficiency gains in places like optimal KV-caching, optimal GPU fleet sizing, Pareto-optimal throughput vs. latency, load balancing/queueing algorithms and so on.

On the scientific front, look no further than the DeepSeek v3 paper, a tour de force of hardware-software-research synergies. Mixed precision inference, speculative decoding, mixture of expert models, multi-token prediction, and latent attention are just the beginning.

Overall, maintaining pace with this brutal 4x year over year improvement requires all three legs to succeed, and LLM companies that don’t take advantage of all three legs will find themselves being outcompeted by companies that do. For everyone else, there will be hosted LLM APIs at a reasonable price.

Simple agents will win over complex frameworks

In the 1990s and 2000s, when Moore’s law was in its heyday, it made sense to repurchase a brand new PC every 2-3 years or so, because each upgrade got you a 3-4x improvement in computing speed. A similar dynamic played out in smartphones over the 2010s. Of course, this didn’t actually mean that your experience got 4x better; an increasing level of software bloat has meant that a simple calculator GUI application today consumes ~ 1,000 times more computing resources than the same command-line calculator application did 30 years ago.

The reason for this bloat is simple: a team writing in a high-level language like Python or Javascript could deliver innovative software experiences to market several times faster than a competitor using C/Java could, and even if the competitor could deliver a product that was 10x more efficient with computing resources, that advantage would be gone within a few years. The rapid iteration speeds possible in high-level languages resulted in continuously compounding product improvements, whereas using C/Java resulted in a one-shot speed improvement whose magnitude did not increase over time.

Companies that embrace the LLM improvement treadmill will outcompete companies that ossify themselves in an overcomplicated software framework. As of this essay’s publication time, it’s already cheap and effective to just let an LLM run OCR on a screenshot of your PDF, rather than investing in software that “does it right” by parsing the PDF binary contents.

The relentless pace of LLM improvements implies that if you merely migrate to the latest LLMs every 6-12 months, this alone will give you amazing quality/cost improvements. I say “merely”, but migration requires investment into trustworthy evaluation suites. These suites will also pay off when generational improvements in the base LLM allow the developer to confidently rearchitect and remove software components from their system.

Bifurcation of do-it-yourself vs. AI-as-a-Service

Overall, I expect the LLM industry to follow the software industry. Despite the wide availability of excellent free-to-use OSS software, managed services and infrastructure companies are collectively worth trillions of dollars today.

Sure, you can self-host a Postgres database on a Docker container on a VPS for $100/year. I wager that if you are capable of doing this, you already have a job in tech that pays well over $100,000/year. For a non-tech business, $10,000/year for a managed Postgres database is a bargain, even at 100x the cost of the underlying compute. It is simply impossible to compete for expertise when SaaS companies can offer triple the salary to build and offer the same services to thousands of companies.

The DIY alternative here is not Postgres, but Microsoft Excel, which I would guess powers something like ~1-10 trillion USD in small business GDP. Excel is a very intuitive, approachable database solution that is so simple that most “serious programmers” don’t even think of it as a database. But if you think about Excel’s functional role in many small companies’ business processes, it absolutely is a database.

Spreadsheet software took twenty years to become ubiquitous, and it also took twenty years for Google Search to become ubiquitous. Over the next twenty years, humans and LLMs will coevolve a prompting language that everybody will just know as the “right” way to instruct an LLM to do some useful work. As knowledge and ease of LLM prompting improve over time, an “Excel for LLMs” will take off, enabling people with low tech literacy to DIY their own workflows and software.

Cost per query will span many orders of magnitude

A few years ago, LLMs came in sizes ranging from the barely useful 1B parameter sizes, to the (still) mind-bogglingly large 175B parameters of GPT3. They only took plain text in and out, roughly 10-1000 tokens on either end. Nobody actually thought you could serve a GPT3-sized model to end-users, and such work was seen as pushing scientific frontiers rather than for any practical purpose.

Today, LLMs range in size over 2-3 orders of magnitude - from the surprisingly competent mini 1B parameter sizes, to the 671B-parameter DeepSeek R1 - and R1 is actually being served to end users today. Additionally, LLMs can take plain text as input in a chat context (10-1000 input tokens), or they can take massive PDFs, images, audio, and other multimedia inputs (1,000-10,000 input tokens). Finally, their output can be a simple 10-100 tokens for a straight response, 100-1,000 tokens for a prompt that explicitly asks the LLM to perform chain-of-thought, or 1,000+ tokens for reasoning models.

For context, a human speaking continuously can generate 10,000 tokens an hour and speed-read 50,000 tokens an hour, while today’s frontier LLMs (~100B params) cost about $10/million tokens. In other words, it would cost <10 cents an hour to replace a call center operator, and <50 cents an hour to service a chat window. Realistically, since humans aren’t continuously reading or talking, the cost will be one tenth of the above numbers. Reasoning models, assuming the human only reads the final output, are 10-100x more expensive. Human I/O is far more expensive than LLMs these days - even the reasoning LLM variants.

In the future, I anticipate models as small as 100M parameters will become surprisingly useful for a variety of tasks. Google Translate on my phone downloads a ~50MB “language pack” per language to be able to translate offline; if you can translate a pair of languages for 50MB, I bet there are many other surprisingly useful things you can do at that size. On the upper end, 1T parameter models are likely to start showing up within one or two generations of Nvidia hardware refreshes, and when you pile reasoning / thought tokens on top of that, we’re looking at something like 6-7 orders of magnitude range in cost between tiny and gigantic reasoning LLMs.

Value per query will span many orders of magnitude

While the cost per query will vary dramatically based on model size and complexity, the economic value generated by these queries will vary even more dramatically, creating interesting arbitrage opportunities.

Since the Industrial Age, civilization has worked to automate as much labor as possible, while utilizing human flexibility to fill in the hard parts with human labor. The “hard” stuff is often not what you think it is. In an Amazon warehouse, inventory management is a tremendously difficult intellectual problem, handled mostly by software, while packing varied items into boxes turns out to be a tremendously difficult physical problem, handled mostly by humans.

Similarly, LLMs are very good at some things while being very bad at others. There is not necessarily any correlation between “LLM strengths” and “economic value”, leading to new arbitrage opportunities between human and LLM effort. Here’s a few areas I see being big opportunities:

Information propagation: Jensen Huang has 60 direct reports and runs large group meetings, because he’s a madman and he wants to maximize information throughput to/from his company. You could approximate this by removing half your middle managers ($1000-10000/hour) and letting an LLM monitor your chat messages and answer easier questions on your behalf (say, $10/hour for the highest quality reasoning models).
Tutoring: This currently costs $50-200/hour, but it could be done competently at $0.01/hour by an LLM for all topics up through high school. Relatedly, therapists are essentially emotional tutors.
Call centers: This currently costs $10/hour at outsourced rates, but it could be done competently at $0.01/hour by an LLM.
Translation: This currently costs $10-100/hour for professional translations, but it could be done at $0.01/hour for good-enough translations.

Another type of LLM value is their scalability and flexibility. Imagine the effort that goes into ramping up humans to deal with anticipated one-off spikes in demand, let alone unanticipated spikes in demand. We can summon and then dismiss many copies of the same agent for tasks of all shapes and sizes, at any time of your choosing. There is no need to deal with the headaches of outsourcing - timezones, cultural nuances, accents, international business operations. You can have 24/7 coverage with uniform response quality, instant global rollout of policy changes or other improvements in agent quality. That level of consistency is typically paid for in Enterprise “Call us!” levels of pricing, but can now be had for the price of an on-demand GPU.

Conclusion

A simple extrapolations of trends that I think have robust foundations, with a dash of history. And yet the conclusions are surprising. The world of 2030 will be as unrecognizable to us as the world of cellphones today was in 2010, and as the world of the connected Internet was in 2000.

Learn Perfect Pitch in 15 years

2024-11-25T00:00:00Z

Originally posted 2024-11-25

Tagged: music, personal, popular ⭐️

I learned perfect pitch as an adult.

It was not easy or quick, and there were so many qualifiers that, for over a decade, I didn’t really think I “deserved” the label. But as I got older, those qualifiers started dropping off one by one, and today I’m convinced that I have perfect pitch. I describe my musical journey and thoughts on what perfect pitch “is”.

Addendum: This post spawned a lot of discussion on Hacker News and quite a few people resonated (hah) with my description of “childhood perfect pitch, but weirdly, only for one instrument”. If this sounds like you, perhaps you too, can learn perfect pitch in 15 years :)

Popular perception of perfect pitch

Just so we’re on the same page: popular pitch is described as one or more of the following abilities:

ability to name the pitch of ambient amusical tones (alarms, honking, pure sine waves etc.)
ability to name the pitch of musical tones
ability to name individual notes within a chord
ability to hum or sing a named note
ability to identify if a song is not in its original key.

The last variant has the lowest bar, but it is most accommodating to those without formal music training.

My formal musical history (age 5-18)

I started taking piano lessons at 5, and while I enjoyed it, I don’t remember it being an unbalanced part of my childhood. Our family moved two hours away when I was 9, and after a month of schlepping two hours each way to see my old piano teacher on weekends (God bless my mother!) she gave up and we switched to another more local piano teacher. I didn’t get along with this teacher, and ended up quitting piano a year or so later. Instead, I played clarinet through middle and high school, and the piano lessons resumed when I was 15, when my mother learned of another piano teacher from a friend of hers.

Sometime when I was ~12 years old, I remember surprising my clarinet teacher by correctly repeating some random notes that he played. He told me I had perfect pitch, but I didn’t think so, because I couldn’t name notes for any instrument other than the clarinet.

The clarinet’s physical layout is the result of many, many tradeoffs in the location of its boreholes and keys. As a result, each note is characteristically flat or sharp by up to 10 cents, and has a characteristic timbre, due to variations in the overtone series generated. I still cringe when I think about the stuffy nose sound quality of the B♭4’s default fingering. I used the A4 + trill modifier fingering almost exclusively as a result. I’m fairly certain that my clarinet “perfect pitch” was just my ability to recognize these microvariations in timbre.

My piano lessons, on the other hand, never hinted at perfect pitch. To be fair, I also never really liked piano the same way I did when I was younger. That all changed in my senior year of high school. We’d just finished recording a CD, to send in with my college application materials. (It was, in retrospect, an absolute waste of everyone’s time both to record and to listen to). My teacher then told me, “You’re done! You can quit lessons now if you want, or if you want to continue we can play other fun stuff.” I still remember that moment with shock - I couldn’t believe that somebody who taught music for a living could be that crass about why anyone should want to learn music.

“I… want to keep playing”, I managed to say, and to their credit, my teacher did a pretty good job over the next 6 months of letting me learn music for music’s sake, reviving my love for the piano. Then, I went to college, and that was it for my formal piano lessons.

Learning perfect pitch (18-30s)

At MIT, I found myself tremendously bored my freshman year due to the combination of my advisor not letting me sign up for interesting classes, plus the freshman credit limit. Instead, I found myself figuring out how to access the suite of music practice rooms, and I spent well over 20-30 hours a week finding new classical music, listening to recordings, and sightreading the scores. The clarinet got the short end of the stick - I tried playing in the wind ensemble for a semester, but found that I just didn’t love clarinet the same way I loved the piano.

By the end of freshman year, I had significantly expanded my musical tastes and unambiguously had perfect pitch… for piano only. I find it highly unlikely that this was due to identifying microtuning differences; the piano, under the cover, is a sequence of cables of continuously varying weight, length, and tension, very unlike a clarinet’s body with its idiosyncratically drilled holes. But I still could not hear other instruments or textures; when listening to piano concertos, I would latch onto the piano’s sounds, follow along with my relative pitch, but inevitably lose track of the notes whenever the orchestra took over.

After that first freshman year, my workload was nowhere near as easy, so I could only play piano ~10 hours a week. But my musical tastes and breadth kept evolving; I took music theory classes, got a music minor, started singing in the concert choir, and learned to appreciate operatic, orchestral, and chamber music. I practiced identifying notes in these new contexts, but ironically enough, my clarinet perfect pitch came back to haunt me with vengeance, as if to mock me for abandoning the instrument. The clarinet is a Bb instrument, meaning that the “C” as I knew it on the clarinet was actually a Bb. My mental train of thought would immediately slip sideways by a whole step as soon as I heard any mildly prominent clarinet sounds. That was quite frustrating!

One day, maybe when I was 25 or so, I realized that I could identify pitches from a variety of orchestral and string instrument textures. I had also continued to sing with the MIT choir as an alumnus, and I could feel perfect pitch approaching for voices as well; let’s say I picked up perfect pitch for vocal textures around 28. Later on, maybe around 32, I realized that I could identify atonal textures (alarms, train horns, etc.), and along with this milestone came the ability to quantify how out of tune something was.

There are days when my sense of pitch feels weaker. Sometimes I’m listening to an ambient noise soundtrack during work, or don’t get a chance to listen to tonal music for some time. When I haven’t actively listened to tonal music in over 2 weeks, the time it takes me to recognize a note goes from sub-second to a few seconds, although it comes back quickly. It feels like putting on your skis or ice skates for the first time each winter and remembering how to ski/skate.

Analysis

Perfect pitch, for me, was an incredibly smooth and long learning curve. For each new instrument or texture I learned, I went from only hearing relative intervals, to being able to say, “this piece is probably in D major”, to being able to trace along the exact notes of the melody and bass lines, to being able to instantly lock onto notes when I wanted to. These weren’t discrete transitions either; I would have good days and bad days for recognizing pitches, and over time I would have more and more good days.

Most studies I’ve seen on the internet study perfect pitch as a snapshot in time; a set of subjects drawn from populations of varying musical training, is tested on a single day, or perhaps over a month. These studies have no statistical power to talk about perfect pitch in gradations; they can only say that some fraction of people “have it” or “don’t have it”. I imagine that I would have failed some of their tests and passed others at differing points in time.

The caveat with anecdata is, of course, the highly individual circumstances that come with each data point. Does it matter that I spent my first few years of life living in, essentially, a Korean enclave in NYC? Is it relevant that I am highly sensitive to fine details in all five of my senses, not just in sound? Does it matter that I have a generally strong memory for details? Am I somebody who should have developed perfect pitch normally if not for my patchy formal music schooling? Or was it my choice of non-C musical instrument that delayed the onset of perfect pitch? Or maybe the blame lies with my home piano, which was never tuned and drifted from flat to flatter by about 30 cents over a decade.

To me, the most surprising thing about perfect pitch is that we don’t all have it. The vast majority of us can tell apart our colors, even if half of us refuse to use words like “chartreuse”. Your ears contain millions of tiny fine hairs of varying lengths which each vibrate in response to some set of frequencies, making them essentially analog Fourier Transform devices. And then, your brain then does something stupidly complicated to this set of clean inputs, so that you can instantly tell whose voice is whose in a multi-speaker environment, and so that you can detect the slightest tremor in somebody’s voice that might clue you in on their mental state as they say those words. We undergo decades of musical training so that we can train our brains to unwind all of this complicated processing and extract pure tones from this jumble of sound.

In my understanding, “perfect pitch” is what you get when your brain has finally annealed its mental representation of music into twelve neurons, one per tone class (assuming western 12-tone scales). In my own mind, these twelve neurons just… exist; I don’t have any synesthetic associations with colors, names, or any visual artifacts like a piano. Just the pure qualia of hearing that pitch. At this logical endpoint, note recognition is instantaneous and universal. You can imagine intermediate stages, where neurons are calibrated to certain timbres of sound, or perhaps you have a neuron for the opening note of a specific song that’s burned into your memory, and can place pitches with reference to that opening note.

How to learn perfect pitch

Now that I have a tiny human of my own creation, here’s what I think are probably optimal conditions for developing perfect pitch.

Learn to play some instrument, in a setting where you have to play in tune. (e.g. if you’re learning guitar, don’t just match your strings to each other; use an actual tuner.)
Don’t learn a non-C instrument
Eliminate sources of out-of-tune sounds in your environment.
Learn how to read sheet music, and listen actively to music while following the sheet music/score.
Learn some music theory; nobody learns the 12 tones in isolation, but rather the ways in which they come together to form melody and harmony.
Listening is a skill and takes the proverbial 10,000 hours to master.
Stretch your harmonic recognition by listening to music on the border between tonal and atonal - in the classical world, Ravel, Poulenc, Prokofiev, Shostakovich are pretty good in this regard.
Enjoy the music! It’s okay if you don’t like a piece or a composer; sometimes it takes the right performer or piece to open your vistas.

Score of Chopin's Waltz in A minor (2024)

2024-10-27T00:00:00Z

Originally posted 2024-10-27

Tagged: music

As reported in the New York Times, a Waltz by Chopin, discovered posthumously in 2024, transcribed by yours truly from the raw scan and Lang-Lang’s performance.

Download sheet music PDF

Editor’s notes

Very, very obviously a Chopin Waltz. The chord choices, usage of triplets/mordants, and the suspended pedal tones is characteristic. It’s a mix of Prelude Op 28 no 11’s ephemeral brilliance and Waltz Op 34 no. 2’s lilting, moody style.

My “edition” is not an urtext. I added some phrasing where I thought it was obvious and perhaps went missing over the ages from the raw scan. The ornamentation in measure 20 was probably modified by Lang-Lang, but I think it fits the piece better than a plain mordant, so I notated it as played.

If I were to judge based purely on the music - minus all of the contextual clues like paper, ink, backstory - the probability of it being fake is ~10%. I say this only because it is shockingly similar to 34-2 in harmonic and stylistic elements, which is exactly the kind of thing an AI trained on a not-big-enough dataset would do. While AI utterly fails at longer pieces, it could plausibly render a coherent 24-measure piece in the style of Chopin, and DeepMind could plausibly be working in stealth on a really good music-composition AI. But in the end, the piece is too tightly composed, and I trust NYT’s decision to trust the historians who are familiar with evaluating such artifacts.

Useful Debates

2024-10-20T00:00:00Z

Originally posted 2024-10-20

Tagged: strategy

When I was once an undergrad, I came across two students in my dorm debating the finer points of bathroom social conventions.

(Our bathrooms were single-occupancy but had no door locks, which explains why nobody had yet tried to simply open the door.)

“Do you think there’s someone inside the bathroom?”

“The convention is that when you’re done using the bathroom, you should leave the door slightly ajar. It’s closed, so there is someone inside.”

“Right, but what confuses me is that the lights are off inside.”

“Well, it’s daytime so someone could be using the natural light from the window.”

“I can’t hear anything either, but that’s no guarantee.”

“What if the person doesn’t know about the door convention? Do we ever actually go over that convention during dorm orientation?”

“I don’t know, but perhaps we should add it to the script, so everybody’s on the same page.”

“The dorm is pretty empty right now in the middle of the day - so that makes it less likely that someone is inside, right?”

“We’ve also been talking outside this door for a while now… if there was someone inside wouldn’t they have said something by now?”

“Maybe they have social anxiety, we shouldn’t assume.”

And on and on it went. I walked up to the bathroom door, knocked, and asked “Anyone inside?” No response. I opened the door to an empty bathroom and gestured to my dormmates.

Useful debates

What makes a debate useful?

A debate is the only available means to resolve a question.
The cost of debate is less than the impact of the wrong decision.
Both sides want to find the best outcome.

Examples of useful debates: Could detonating an atomic bomb set the atmosphere on fire? Should we acquire a company that’s 25% of our current market cap? Does this newly proposed cryptographic scheme have any vulnerabilities? How can we rearchitect tax law to close loopholes while not creating new perverse incentives?

If any one of the three criteria are missing, then I would argue that the debate is not useful. Some debates can be resolved by, e.g. Googling the question, by asking the right person, or by just trying something to see what happens. Other debates go on far too long for how trivial the consequences are. And there are debates where one or more participants have ulterior motives.

We can see why the bathroom occupancy debate is so uniquely bad: it fails all three criteria.

The cost of debate is greater than the impact of the wrong decision. (worst case scenario: you can slowly open the door while giving the person time ample time to shout if they are actually inside.)
A simple way exists to resolve the debate. (knock on the door)
Participants seem to enjoy the process of debating more than the outcome.

Even if the debate is not useful, trying to convince others of this fact could itself become a bathroom occupancy debate. There are two solutions: opt out of useless debates, or just simply solve it for them (by knocking on the door for them, by googling an answer, etc.).

This site now supports Dark Mode

2024-08-08T00:00:00Z

Originally posted 2024-08-08

Tagged: software engineering

Inspired by Charles Eckman’s recent blog post, I’ve added dark mode support to my website. This post talks about some of the implementation details, and how I used ChatGPT to teach myself enough CSS to make it to the finish line.

Colors, colors everywhere!

Fundamentally, the code change is actually really easy. You replace every single color in your CSS with a CSS variable, and then declare your light mode/dark mode color scheme using a @media selector.

:root {
  --background-color: #fff;
  --text-color: #000;
  /* ... more colors ... */
}

@media (prefers-color-scheme: dark) {
  :root {
    --background-color: #222;
    --text-color: #eee;
    /* ... more colors ... */
    }
}

body {
    color: var(--text-color);
    background-color: var(--background-color);
}

The actual hard part is hunting down every single color in your site’s CSS. I’d been using Pure.css as kind of a “I really do not want to think about CSS” layer, and the Pure CSS had at least 60 different colors. I could have whipped out some regex to transform it all, but instead I decided to just rip out Pure entirely. It turns out that I didn’t need it at all. CSS has gotten a lot, lot better since I first did my site’s CSS. Chrome’s inspector tools were incredibly useful for hunting down which rule was actually responsible for something I was seeing, so I could copy over the relevant snippet from Pure.

The second most annoying part is hunting down all the small things you don’t normally think about, like:

the subtle highlighting effect when you hover over my site header menu
the blue color of a link (and the purple of a clicked link), and then trying to replicate the “feel” of those blues and purples in a dark mode context.
the odd-even highlighting effect of my striped data tables in places like the Factobattery essay

Finally, syntax highlighting in code snippets was mildly annoying, but mostly because highlight.js is bloated and tries to support every language in existence. (No, I do not particularly care to define a color for syntax highlighting BNF grammars or verilog. You’ll node that the CSS snippet above isn’t syntax highlighted correctly, and that’s because I have better things to do.)

KaTeX compiled output turned out to be a non-issue - it just inherits the color/background color of normal text. I also did not attempt to invert any image colors; you’re just going to have to deal with blocks of color in images.

Finally, I did this dark mode writeup… well, in the dark, so I could actually see how my colors looked when my laptop monitor was turned down to its lowest setting, in a dark room.

ChatGPT, CSS Tutor

I do not think I would have had the spoons to make it through this dark mode project if I did not have ChatGPT to act as a tutor. What could have been 30 minutes of head-banging turned into a 1 minute read through ChatGPT’s explanation of how table.striped th differed from table .striped th. Other questions I asked ChatGPT, for which it gave me nearly perfect answers:

how would I use css to reset the styling on an <a> link element? I want it to look black, no underline like normal text
(after getting back an answer recommending setting color: black) is the default text color actually black or is there some “unset” color that I should be using?
how do I write css to make alternating table rows gray / white?
how can I use python watchdog to monitor a directory and do something to any changed files?
is there a way to have two observers watching different directories? how would I have to change the if __main__ section to wait on both observers?
How would I concisely express these two CSS rules as one combo rule? table .striped td, table .striped th
why doesn’t it work to do table .striped td, th
what is the selector for a <th> element nested under a <table class=“striped”>? (here’s where I finally figured out what noob mistake I was making.)
does it matter if there’s a space between table and .striped

Of these questions, “python watchdog” and “alternating table rows gray/white” is one that would have been easily answerable by Google. Even then, I only asked the basic python watchdog question so that I could set up the context to add my followup question. ChatGPT’s ability to answer very specific clarifying questions in the context of previous Q/A pairs is just such a transformative experience. Can you imagine trying to ask Google “does it matter if there’s a space between table and .striped”? No need to imagine, just ask Google. Absolutely useless.

I’ve been increasingly working this way with ChatGPT and there is very much an educational revolution just waiting to happen here.

The Tyranny of the Flake Equation

2024-06-11T00:00:00Z

Originally posted 2024-06-11

Tagged: software engineering, statistics, llms, popular ⭐️

I once worked with an algorithm whose runtime scaled roughly linearly with the number of rows of data. To date, the largest job we’d ever attempted had taken a few hours. One day, we tried to run a job that was only four times as large as our previous largest job. We figured we’d just leave it running for a day and it would be complete. Instead, due to frequent job preemption and other flakiness, the job took almost two weeks to complete! Each time, the easiest path forward was to restart and pray for good luck. I joked that we’d somehow discovered an algorithm with exponential runtime.

When I went back and did the math, it turns out we did, in fact, have exponential runtime!

If flakiness is proportional (with rate $p$) to the base flake-free runtime of the job, then the flake equation says that total runtime grows exponentially with job size.

\[O(f(n)) \rightarrow O(f(n)\cdot e^{p\cdot f(n)})\]

When the expected number of flakes is less than one, we’re in the flat part of the exponential term, and flakiness is an occasional nuisance. However, as the expected number of flakes exceeds one per run, the likelihood of job success drops exponentially. If you retry the job until it completes, you can expect to wait some time that is exponential in the job size.

The derivation is pretty simple. The expected number of flakes is $O(p\cdot f(n))$, and then the probability that you complete the job with zero flakes is, by the Poisson distribution, $O(e^{-p\cdot f(n)})$. Inverting this probability, we see that on average, we must repeat the job $O(e^{p\cdot f(n)})$ times before a job completes without flaking.

A grand unified theory of bug fixes

Flakiness is not just bad luck. It is a failure to design for the realities of the production environment. And yet we only have limited engineering hours to fix all possible scenarios. Flakes are omnipresent, passing through our software like neutrinos through the Earth, and we cannot and should not try to fix them all. The only folks who actually worry about literal cosmic rays, are NASA, who must deal with elevated radiation in space with extended mission times, precisely the two terms that show up in the flake equation’s exponential term.

The flake equation presents a sharp decision boundary: any specific flake will exist in one of two regimes: occasionally annoying, or oppressively showstopping. As you scale up software usage, flakes flip from the former regime to the latter, giving you the shortest of warning periods during the transition. In my experience, great engineers can sniff out early hints of flakiness and fix them while the exponential is still in the somewhat flat parts of its growth curve, allowing teams to scale smoothly.

Because of the exponential growing penalty associated with these flakes, there is no ignoring flakes - as soon as a flake mode starts to rear its head, you will not be able to scale more than 2x past your current level without seeing that flake grow into a reliable job-killing monster.

Flake-prone contexts

Software scaling can happen in many forms: in job size, in job volume, in heterogeneity of deployment environments, in complexity of the software stack, in the diversity of user journeys, and so on. Each of these scale ups presents an opportunity for an previously innocuous flake mode to go exponential. You will encounter increasingly obscure modes of flakiness, and your software’s growth rate will be capped at the rate at which you can fix these showstoppers.

We can see the flake equation most clearly in software whose usage grows continuously in multiple dimensions.

Continuous integration

One haven for flakiness is in continuous integration and testing, where developer activity generates both increased volume and heterogeneity of tests. CI is particularly pernicious because there is often no recourse to a failed CI run other than to simply rerun the entire test suite! Software teams adopt the following mitigations to stave off the inevitability of flaky CI:

reduction in scope of how many tests need to run with each change
discipline in not writing inherently flaky tests (no random numbers, no network connections, no complex containers/servers etc.)
granular retries of failed tests, with tests marked pass if the second try passes.

Test-level retries seems to irritate some folks who think that we should focus our efforts on reducing flaky tests in the first place, but I will point out that retrying is dead simple to activate and reduces the flake equation penalty from $O(e^{p\cdot f(n)})$ to $O(e^{p^2\cdot f(n)})$, while fixing flaky tests consumes developer time linear to each fixed test. It allows teams to exponentially flatten the global failure rate while continuing to derive value from individual flaky tests. (Presumably the flaky tests still generate some value, otherwise they ought to be deleted entirely!)

Large Language Model Training

Another contemporary haven for flakes is LLM training runs. LLM scaling implies scaling up in volume and variety of data, quantity of hardware being orchestrated, and duration of training. As training runs scale from GPT2 to GPT3 to GPT4 to GPT5 sized models, increasingly obscure failure modes are encountered. To give an example of a GPT3-scale flake, if the average lifetime of a GPU is 5 years, then if you run a cluster of 500 GPUs for a week, well.. that’s about 10 years of GPU time. Two of the GPUs will die on average per run, and how will the code respond to that?

For LLMs, checkpointing is the canonical solution to flakiness. It’s a powerful solution because it covers a broad class of flake modes, whereas rearchitecting to detect and recover from failed GPUs, irregular network connectivity, etc. is far harder and requires deep engineering expertise for each flake mode fixed.

Development environments

One final flake-prone environment is in development environments. The modern software stack is overwhelmingly complex: to develop on some Python software, you might need to download pyenv, poetry, hundreds of packages from PyPI, download the latest compiler to compile some C++ for a Python package, download and install some GPU drivers, run a PostgreSQL instance via Docker and then orchestrate it on minikube. And Python is just the runner-up in the complexity race - Javascript is the clear “winner” in this realm.

All of these components are maintained and updated regularly by thousands of software engineers. Each change introduces the possibility for human error. Each part of this stack is tested in isolation by its maintainers to some level of flakiness that they find acceptable. However, when you add up the flake rates across all of these components, you can quickly cross into the exponential regime of the flake equation.

Nobody ends up here from the start. Each of these components are introduced over time as companies scale up in the number of teams, in number of deployment environments, in scale of operations, and in diversity of operating requirements.

To cope with this complexity, development teams use the following mitigations:

save a “known good” configuration of all components
when a configuration must be updated (say, for security reasons), only change one component at a time.
standardize the engineering organization on a smaller number of carefully selected components.

The first strategy is akin to setting a seed on the random number generator, while the second strategy is a variant on the checkpointing strategy from LLM training. The third strategy directly reduces the exponential term in the flake equation.

Conclusion

It is highly counterintuitive to me that the retry penalty of flakiness is exponential, rather than linear. I think the reason is that we so rarely manage to glimpse and understand this transitionary phase from mostly flat to exponential blowup. Either the job is only mildly flaky/annoying (in which case the Taylor approximation to the exponential is, in fact, linear), or it is so annoying that it already got fixed.

There are contexts in which we live this transitionary phase for years: sometimes company growth is just slow enough that the exponential growth term comes on gradually over one or two years. In these scenarios, the metaphorical frog is boiled, and only an outside perspective can comprehend the stupidly low productivity. From the inside, it just feels like a bad case of tech debt. While the “tech debt” diagnosis is trivially true, the underlying insight is that incremental tech debt gets added together, and then exponentiated - so that each marginal unit of debt is responsible for another multiplicative slowdown for the whole team! As companies scale, investments into technical excellence and developer productivity are what forestalls the tyranny of the flake equation.

Appendix: optimal checkpointing

Some fun math: if there are $\lambda$ expected flakes per run, the whole job costs $N$, rate of flakes is $p$ (with $Np = \lambda$), and checkpointing costs $C$ per checkpoint save, then what is the optimal frequency of checkpointing?

Let’s assume all progress since the last checkpoint is lost if a flake occurs, and that the checkpointing operation itself is flake-free. Then, we can track the total cost of a job as the cost to reach each checkpoint. Call the number of checkpoints $f$. The cost to reach each checkpoint is $\frac{N}{f}e^{\lambda/f}$, and the total cost is $Ne^{\lambda/f} + Cf$.

We can take the derivative w.r.t. $f$ to find the minimum of this function to be the solution of the following equation:

\[-\frac{N\lambda}{f^2} * e^{\lambda/f} + C = 0\]

Rearranging, we get the following equation:

\[f^2e^f = \lambda^2e^{\lambda}\frac{N}{C\lambda}\]

As an approximation to the solution, let’s guess that $f = \lambda$ (i.e. if there are 5 expected flakes per run, then take 5 checkpoints). In that scenario, we’re left with the equation $C\lambda = N$. So in other words, if there are 5 flakes, and we checkpoint 5 times, and the cost of one checkpoint is roughly 1/5th the cost of the whole job, then this is the optimal frequency of checkpointing. If checkpointing is cheaper than that, then we should checkpoint more frequently than the expected number of flakes, and vice versa.

Who pays you? And why?

2024-05-02T00:00:00Z

Originally posted 2024-05-02

Tagged: strategy, popular ⭐️

I get asked for career advice from time to time. While each situation is different, a recurring theme is disempowerment - feeling like there’s nothing you can do to advance your career. To help diagnose, I like to ask two questions: Who pays you? And why? These two questions encourage you to leave the comfort zone of job descriptions and confront the reality of what it’ll take to get to the next level at your current job, or potentially a new job.

I’ll explain how I think about these questions (mostly from a software engineering perspective), and this hopefully helps you think through your own situation.

Why software engineers get paid

Fundamentally, software engineers get paid a lot because one person can write code that generates high profit margins over millions of machines. In the same way that traders scale their efforts across money and executives scale their efforts across people, software engineers scale their effort across machines.

Since pay is correlated with scale of impact, the next question is, what limits software scaling? I’d say technical and conceptual complexity, which can come about from bad product management, bad engineering decisions, cascading tech debt, and just plain old lack of skill. Fortunately, software engineers have been chipping away at the complexity problem for decades. Engineers today benefit from better documentation, better code editors, better linters and test runners, better languages, better compilers, better code search tools, better cluster infrastructure, better libraries, better experience sharing, better understanding of Conway’s law, etc..

To maximize your impact as a software engineer means the following:

Getting better at picking more impactful problems to solve. You could dominate the bingo card market but how much does that even earn you?
Continually trying new ways of working.
Building simple code that doesn’t collapse under its own abstraction weight.
Teaching other engineers to work more efficiently and to build simpler code, either 1 on 1, through writing, or through creating languages/tools/infra that shapes the way they work.
Influencing other engineers to work on more impactful problems.

These ideas are often expressed in some sort of performance review rubric, whose purpose is to directly educate and incentivize you to be maximally useful to the company in the way that only software engineers can. The rubrics are typically public and yet these ideas are not understood by many. (See: the popularity of Tanya Reilly’s excellent Glue Work essay.)

Who pays software engineers

I should also differentiate here between software companies (rough heuristic: competes with FAANG for talent) and companies that use software (basically everyone else). The difference is that software companies are built bottom-up without humans in their critical business loops, allowing software to scale without bound. You can retroactively try to streamline a human-centric organization, but the people involved will resist furiously. There is simply no way that a traditional ad agency built atop ad salesmen and weeks-long business cycles could transform into an organization capable of milliseconds-long ad auctions, as Google implements. Thus, Google’s ad engineers each generate 10-100x as much business value as a traditional ad agency’s IT folks, and are paid probably 5-25x as much.

Thinking beyond job descriptions

Understanding how to answer who and why helps you navigate ambiguous job descriptions, or to even invent them from scratch.

Consider the “data scientist” job title, which became increasingly popular in the 2010s, especially among scientists fleeing academia. While the job description read “data scientists get paid to build models”, anybody who took that job description at face value found themselves looking for a new job in a year or two. The real reason data scientists are paid is to drive better business decisions. Data and science are optional.

Nowadays, job titles like “AI engineer” or “LLM engineer” or “AI researcher” get thrown around. Who pays these people, and why? The cynic would say they’re getting paid because clueless business people are buying into unsubstantiated hype. And honestly, the cynic is right. Nobody knows what business value to expect from LLMs. But what I can tell you is that there are thousands of startups, tinkerers, consultants, and research groups trying to answer “who” and “why”. The people that succeed in answering these questions will create a lot of value, and depending on their strategic positioning, capture some of that value. Those who don’t understand why they are being paid will be surprised when the hype train moves on and their projects are cancelled.

Finally, you’ll find that staff engineers’s job descriptions inevitably include some statement about navigating ambiguity. This is what ultimately differentiates staff from senior: the ability to navigate business ambiguity as well as technical ambiguity

Finding personal alignment

I’ve been answering questions about “who and why” in the abstract. But your personal answers will be more tactical. The who is your manager, or perhaps your skip-manager. The why may relate to short-term projects or perverse incentives that you might not agree with. And you shouldn’t necessarily trust your manager/skip managers/organization to tell you what they want you to do and why.

Finding career success and satisfaction is very much about finding personal alignment with an organization’s goals. You have to understand what you need, what the organization needs, what the alignment is between the two, and finally, decide whether that alignment is enough to keep you happy. If that doesn’t sound hard enough, all of these factors are also changing over time: you grow as a person in what you find challenging and rewarding, while the organization responds to market conditions in terms of what sort of work is needed. What might have been a dream job may be a dead-end job in a short few years, especially in a field as dynamic as tech. Ask me how I know.

Trust your gut. In my experience, people get thoroughly boiled before they finally decide to jump out of the pot. If you feel like something’s not right, you have to understand where that feeling is coming from. Switching teams, jobs, or sometimes even switching careers is often a necessary step towards reaching career satisfaction.

Once you figure out how to find personal alignment, you’ll also be well-equipped to handle one of the harder problems facing management: how to maximize your team’s output by figuring out how to align others’ personal needs with organizational needs.

Reimagining your experiences

Let’s say you’ve decided to switch teams/jobs. How you talk about your past experience reveals a significant amount about how well you understand the answers to who and why.

There’s a famous story about three bricklayers in which three people doing identical work describe themselves as “laying bricks”, “building a cathedral”, and “spreading the word of God”. The tech equivalent of laying bricks might be “Improved test runners”, “Optimized continuous integration suite runtime by 50%”, and “Created a rapid iteration culture by increasing release cadence from monthly to weekly”. It’s clear that these three resumé lines are written by a junior, senior, and staff software engineer, respectively.

I strongly recommend the exercise of rewriting your resumé from the perspective of the next rung on the job ladder. You might feel dishonest claiming credit for things you don’t feel like you did, when all you did was lay bricks - but think of it this way - why should your manager get to claim credit for the cathedral you built? If you can understand the vision and you’ve been helping guide yourself and your teammates towards that vision, then you’ve been doing next level work, regardless of what the org chart says.

Concluding thoughts

I’ve been told by former coworkers that they wished they could be as decisive as me when it came to pivoting my career. Truthfully, it doesn’t come easily to me, either. I’ve switched career tracks three times so far. The first time was when I dropped out of grad school and pivoted into software, then I was fired from HubSpot and pivoted into machine learning, and finally when I got laid off from Google and joined a successful LLM startup. I didn’t even initiate two of those three career pivots! I was scared of change and didn’t take the necessary steps for 6-12 months. But once I had taken the plunge (voluntarily or not), I found that within a few months, I had newfound clarity about what I wanted to do next, and from there on it was just about finding the right company that aligned with the work I wanted to do.

I see a lot of people stuck in jobs they’re not happy in. As tech workers, our outsized salaries grant us more flexibility to take more career risks and find better personal satisfaction from our job. Oftentimes the grass really is greener on the other side - come and find out!

Small Data, Big Compute

2024-03-31T00:00:00Z

Originally posted 2024-03-31

Tagged: software engineering, llms, strategy, popular ⭐️

LLMs are really expensive to run, computationally speaking. I think you may be surprised by the order of magnitude difference.

While working at Lilac, I coined a phrase “small data, big compute” to describe this pattern, and used it to drive engineering decisions.

Arithmetic intensity

Arithmetic intensity is a concept popularized by NVidia that measures a very simple ratio: how many arithmetic operations are executed per byte transferred?

Consider a basic business analyst query: SELECT SUM(sales_amount) FROM table WHERE time < end_range AND time >= start_range. This query executes 1 addition for each 4-byte floating point number it processes, for an arithmetic intensity of 0.25. However, the bytes corresponding to sales_amount are usually interleaved with the bytes for time and row_id and everything else in the table, so only 1-10% of the bits read from disk are actually relevant to the calculation, for a net arithmetic intensity of 0.01.

Is 0.01 good or bad? Well, computers can read data from disk at roughly 1 GiB per second, or 250M floats per second; they can compute at roughly 8-16 FLOPs per cycle, with 3GHz clock speed = 25-50B float ops per second. Computers therefore have a 100:1 available ratio of compute to disk. Any code with an arithmetic intensity of less than 100 is underutilizing the CPU.

In other words, your typical business analyst query is horrendously underutilizing the computer, by a factor of about 10,000x. This mismatch is why there exists a $100B market for database companies and technologies that can optimize these business queries (Spark, Parquet, Hadoop, MapReduce, Flume, etc.). They do so by using columnar databases and on-the-fly compression techniques like run-length encoding, bit-packing, and delta compression, which trade increased compute for more effective use of bandwidth. The result is blazing fast analytics queries that actually fully utilize the 100:1 available ratio of compute to disk.

How many FLOPs do we spend per byte of user data in an LLM? Well… consider the popular 7B model size. As a rough approximation, let’s say each parameter-byte interaction results in 1 FLOP, for an arithmetic intensity of $10^{10}$ operations per byte processed. Other larger LLMs can go to $10^{13}$. You could quibble about bytes vs. tokens or multiply vs. add and the cost of exponentiation. But does it really matter if it’s 8 or 9 orders of magnitude more expensive per byte than the business analyst query? Convnets for image processing have an arithmetic intensity of $10^4$ - $10^5$. It’s large but not unreasonable, which is why they’ve found many applications in factory QC, agriculture, satellite imagery processing, etc..

Needless to say, this insane arithmetic intensity breaks just about every assumption and expectation that’s been baked into the way we think about software for the past twenty years.

Technical implications

No need for distributed systems

Unless you work at a handful of companies that train LLMs from scratch, you will not have the budget to operate LLMs on “big data”. A single 1TB harddrive can store enough text data to burn 10 million dollars in GPT4 API calls!

As a result, most business use cases for LLMs will inevitably operate on small data - say, <1 million rows.

The software industry has spent well over a decade learning how to build systems that scale across trillions of rows and thousands of machines, with the tradeoff that you would wait at least 30s per invocation. We got used to this inconvenience because it let us turn a 10 day single-threaded job into a 20 minute distributed job.

Now, faced with the daunting prospect of a mere 1 million rows, all of that is unnecessary complexity. Users deserve sub-second overheads when doing non-LLM computations on such small data. Lilac utilizes DuckDB to blast all cores on a single machine to compute basic summary statistics for every column in the user’s dataset, in less than a second - a luxury that we can afford because of small data!

Massive budget for bloat

Ordinarily, inefficiencies in per-item handling can add up to a significant cost. This includes things like network bandwidth/latency, preprocessing of data in a slow language like Python, HTTP request overhead, unnecessary dependencies, and so on.

LLMs are so expensive that everything else is peanuts. There is a lot more budget for slop and I fully expect businesses to use this budget. I am sorry to the people who are frustrated with the increasing bloat of the modern software stack - LLMs will bring on yet another expansionary era of bloat.

At Lilac, we ended up building a per-item progress saver into our dataset.map call, because it was honestly a small cost, relative to the fees that our users were incurring while making API calls to OpenAI. In comparison, HuggingFace’s dataset.map doesn’t implement checkpointing, because it would be an enormous waste of time and compute and disk space to checkpoint the result of a trivial arithmetic operation.

Latency-batching tradeoffs

GPU cores have a compute-memory bandwidth ratio of around 100 - they are not fundamentally different from computers in this regard. Ironically, LLMs end up bandwidth-limited despite the insane arithmetic intensity quoted above. If you also count the parameters of the model in the “bytes transferred” denominator, then LLM arithmetic intensity is roughly $\frac{nm}{n + m}$, with n = input bytes and m = model bytes. Since $m \gg n$, arithmetic intensity is proportional to $n$. Increasing batch size is thus a free win, up to the point where the GPU is compute-bound rather than bandwidth-bound.

For real-time use cases like chatbots, scale is king! When you have thousands of queries per second, it becomes easy to wait 50 milliseconds for a batch of user queries to accumulate, and then execute them in a single batch. If you only have one query per second, you are in a situation where you will either get poor GPU utilization (expensive hardware goes to waste), or users will have to wait multiple seconds for enough accumulated queries to make a batch.

For offline use cases like document corpus embedding/transformation, we can automatically get full utilization through internal batching of the corpus. Because GPUs are the expensive part, I expect organizations to implement a queueing system to maximize usage of GPUs around the clock, possibly even intermingling offline jobs with real-time computation.

Minimal viable fine-tune

As a corollary of “compute cost dominates all”, any and all ways to optimize compute cost will be utilized. We will almost certainly see a relentless drive towards specialization of cheaper fine-tuned models for every conceivable use case. Stuff like speculative decoding shows just how expensive the largest LLMs are - you can productively run a smaller LLM to try and predict the larger LLM’s output, in real time!

In between engineering optimizations, fine-tuning/research breakthroughs, and increased availability of massively parallel hardware optimized for LLMs, the cost for any particular performance point will decrease significantly - some people claim 4x every year, which sounds aggressive but not even that unreasonable - 1.5x each from hardware, research, and engineering optimizations gets you close to ~4x.

I expect there to be a good business in drastically reducing compute costs by making it very easy to fine-tune a minimal viable model for a specific purpose.

Business implications

Data egress is not a moat

Cloud providers invest a lot of money into onboarding customers, with the knowledge that once they’re inside, it becomes very expensive to unwind all of the business integrations they’ve built. Furthermore, it becomes very expensive to even try to diversify into multiple clouds, because data egress outside of the cloud is stupidly expensive. This is all part of an intentional strategy to make switching harder.

Yet, the insane cost of LLMs means that data egress costs are a relatively small deal. 1GB of egress costs ~$0.10, while embedding 1GB worth of text would cost ~$50. As a result, I expect that…

A new GPU cloud will emerge

Because of the ease with which small data can flow between clouds, I expect a new cloud competitor, focused on cheap GPU compute. Scale will be king here, because increased scale results in negotiating power for GPU purchase contracts, investments into GPU reliability, investments into engineering tricks to maximize GPU utilization, improved latency for realtime applications, and investments into data/security/compliance certifications. Modal, Lambda, and NVidia seem like potential cloud winners here, but the truth is that we’re all winners, because relentless competition will drive down GPU costs for everyone.

Attack > defense

A certain class of user-generated content will become a Turing Arena of sorts, where LLMs will generate fake text (think Amazon product reviews or Google search result spam or Reddit commenter product/service endorsements), and LLMs will try to detect LLM-generated text. I think it’s a reasonable guess that LLMs will only be able to detect other LLMs of lesser quality.

Unfortunately for the internet, I think attack will win over the defense. The reason is safety in numbers.

A small number of attackers will have the resources to use the most expensive LLMs to generate the most realistic looking fake reviews, specifically in categories where the profit margins are highest (think “best hotel in manhattan” or “best machu picchu tour”). However, a much larger number of attackers will have moderate resources to use medium-sized LLMs to generate a much larger volume of semi-realistic fake reviews. The defense, on the other hand, has to scale up LLMs to run on all user-generated content, and realistically they will only be able to afford running medium or small LLMs to do so. Dan Luu’s logorrhea on the diseconomies of scale is exactly the right way to think here.

Conclusion

“Small data, big compute” allowed us to optimize for a certain class of dataset and take certain shortcuts. The Lilac team will be joining Databricks and I look forward to continuing to build systems tailored to the unusual needs of LLMs!

Finding New Mountains to Climb

2024-02-01T00:00:00Z

Originally posted 2024-02-01

Tagged: personal

A long overdue personal update!

Machine Learning’s Manifest Destiny

At the start of 2017, I had just wrapped up a nine month sabbatical where I taught myself machine learning, and started working as an ML engineer at Verily. It’s hard to explain how exciting it was to be in ML at that time. Perhaps the best analogy is to compare it to the era of Manifest Destiny in American history.

Manifest Destiny, for my non-American and otherwise history-unaware readers, was the era of American history where settlers rushed westward in pursuit of the “free” land that was being doled out by the U.S. government. The Destiny was that America would eventually grow from coast to coast to create a glorious future. The omitted fine print was that you might have to evict the native inhabitants of that land. It’s a historical era with strong similarities to recent times.

I’m of course, talking about machine learning circa 2015-2020. It was obvious to everyone in ML that there was so much low hanging fruit across every subject, and ML was destined to pick that fruit and move the state of the art forward to a glorious future. There were researchers scattering in every direction, hoping to stake claims in “unoccupied” territory and be the first to apply ML to a problem. Like the original Manifest Destiny, it ignored existing subject matter inhabitants and betrayed the arrogance of thousands of ML researchers. But with breakthroughs like AlphaGo, AlphaFold, incredible computer vision accuracy and voice synthesis, some arrogance was warranted!

I had a great time with my Noogler project, which was to investigate the potential utility and data readiness of an electronic medical record dataset. The answer was a pretty resounding “not ready”, as documented in “Deep learning on EMRs is doomed to fail”. Still, this work was exactly what I’d hoped to do in the big leagues, and I was excited for my next project.

Unfortunately, the next project didn’t live up to my expectations. We built some data pipelines that were already tech-debt laden on delivery, and then it turned out that what we’d built was a complete mismatch to what the business actually needed, which was two business analysts and a SWE. Instead, our team of 10 SWEs/ML engineers built a janky pipeline that needed 4 SWEs just to keep afloat.

Frustrated with this project, I started putting my energy into Minigo, an ambitious attempt to replicate AlphaGo Zero. (Spoiler alert: we succeeded) It was technically a 20% project, but was effectively a second 80% project. It was fun work, though, which is why I didn’t mind. On the strength of that work, I was able to transfer into Google Brain in 2018 to work on TensorFlow 2’s AutoGraph feature.

AutoGraph is a Python-to-Python compiler that transformed imperative Python control flow into the equivalent functional tf.cond and tf.while_loop calls, making it easier to express complex control flow - things like WaveNet, beam search, and meta optimizers. It got me acquainted with compilers (which was a big gap in my CS knowledge), and taught me how to unleash Google’s systems to automatically run millions of lines of internal TensorFlow code as a test input to our compilers. It was the perfect demonstration of how Google allowed research engineers to work at the intersection of powerful engineering infrastructure and cutting-edge ML research.

After TF2 launched, I transitioned to working on graph neural networks for organic chemistry and more specifically, olfaction. As an ex-chemist, I’d never imagined that the field I’d abandoned in 2011 would be relevant again. Our team got some strong early results, and started a variety of collaborations to find experimental confirmation of our results.

Complacency, shattered.

2021 was an inflection point for me. I’d been promoted to tech lead of my team, and for the first time since dropping out of grad school, I finally stopped feeling like I was “behind” some of my peers who had taken the straight path through a CS degree and into Google/Facebook/Dropbox right out of undergrad. (I also happened to know Greg Brockman of Stripe/OpenAI fame from high school summer camp, which really didn’t help my imposter syndrome…) I was now the one fielding questions from my friends in academia about switching to industry, instead of the other way around. I had their dream job - academic freedom to work on the forefront of science, with a big tech salary.

I got complacent. I thought the environment at Google Brain was intellectually stimulating enough to continue my personal growth, but motivation, for me, comes from within. I had a frank conversation with my manager and skip manager: I told them that the project I’d been assigned was somebody else’s L5 promo project, but it wasn’t an L6 project. I didn’t feel like I was being stretched or being given room to grow. They gave me some vague statements about how what I was doing was the most impactful thing I could be doing for Google right now, but they didn’t tell me I was wrong.

After that conversation, I felt my motivation drop; an easy project that should have taken 6 months to finish dragged on to take 18 months instead. In retrospect, while I’d diagnosed the problem correctly, my demands were laughably naïve. L6 projects aren’t given to you; you create them out of thin air. Your ability to envision and materialize a new future is specifically what it means to be L6.

Unfortunately for me, envisioning a new future would be equivalent to declaring mutiny on my manager, who was very protective of his intellectual ownership of our team’s mission. I should have realized this when I wrote a precursor to this manifesto on the science of smell in 2019 and received a deflection rather than enthusiastic support from my manager. There was simply no room on my team to grow, and it would have been better both for myself and for my team to have somebody who was trying harder.

Eventually, external events forced my hand. In early 2022, our team was pulled into a conference room and were told, “Congratulations! You guys are now a startup!”. We learned later that our manager, unhappy with the size of our team, had petitioned Google leadership to grow headcount by an order of magnitude, and the resolution was that Google would not fund our expansion. Undeterred, he found external VC funding and negotiated a spinout of our project as a new startup. We were the last ones to find out, learning about the spinout as the ink dried on that deal.

From our perspective as Google employees, our project had effectively been canceled, and our choices were to either find a new project, or quit Google and follow our now ex-manager to Osmo. He was now offering us our old jobs back at “market rates for startup developers”. “You get to put ‘founding engineer’ on your resumé - it’s just as good as being an actual cofounder!”, he told us. I could tell I would not succeed in negotiating a serious offer from him, nor would I trust him as a CEO, and decided to look for a new project at Google.

I spent the rest of 2022 asking pointed questions of myself and of the L7+ PMs, VPs, and execs who had helped orchestrate the spinout. I learned a great deal about how Google leadership made decisions; how incentive structures aligned at each level of management, and how, if you had the right idea to generate greater shareholder profit, you held the moral high ground in escalating as far as you needed in order to realize your vision. During this time I searched for new projects and helped evaluate chemistry/ML project proposals in Climate. I tried (unsuccessfully) to pitch my own project on refrigerant mixture design, after seeing an unfilled niche. I would have kept on pitching, but fatherhood called and I went on paternity leave.

During those sleepless nights with our newborn baby, I started writing “Why does Brain Exist” and pondered my future. In the end, I never resolved my project matching situation, as I got laid off shortly before my paternity leave ended.

Post-layoff blues

It took me a few months after the layoff to regain my footing. I was still underwater from the first-time parent transition, and I was uncertain about what direction I wanted to go next. As I talked with many friends, I realized that a year after my manager’s betrayal, I was still not quite sure what had happened, why, or how I could protect myself in the future. I refused to consider another big tech job, and I had an irrational fear of anything LLM-related, given that it would bring me closer to working with MBA-types.

I tried to understand things from my ex-manager’s perspective, talking with other founders about how they viewed their ownership and relative contributions from team members. I also read MBA/consultant literature. My conclusion was ultimately that Google execs and my management chain did it because they could and because there was greater shareholder value to be created. While I take some solace in knowing that my ex-manager lost a year of momentum by having to rehire and re-gel his team, I acknowledge that he probably found a replacement for me who’s willing to work at market rates for startup developers and be happy with having “founding engineer” on their resumé.

From my MBA readings, I learned that on the stakeholder inform-consult-negotiate classification scheme, I hadn’t even qualified for the “inform” bucket. I resolved to at least learn enough about business to fall into the “consult” bucket in the future.

I set up my LinkedIn for the first time, and learned that the vast majority of my friends and friend-of-friend consultants were current or former BCG. In the same way that I benefited from Google’s deep and rich internal culture around best engineering practices, I anticipated I might benefit from BCG’s similarly deep and rich internal culture around best consulting practices. I practiced my case studies, did some part-time consulting to get experience, and practiced my presentation skills by rewriting my Brain essay in MECE format. I networked my way into meetings with sympathetic BCG partners who referred me to their hiring directors. Unfortunately, I was stymied by BCG’s hiring freeze. McKinsey and Bain were also in a similar situation.

Around this time, an old friend reached out and convinced me to join Motional as a systems engineer, working to collate, analyze, and present information to executives on progress in capabilities and safety. I figured that this would both tickle my technical itch as well as get me the exec-facing experience I was looking for. I did enjoy getting up to speed on the self-driving technology stack, but I was overwhelmed by the bureaucracy I saw at Motional. I felt like there was no way I could make any difference in the company’s success. I think that if I’d stayed, I would have learned a great deal about negotiating with unaligned parties, but I would also have gone mad! So I quit. The “efficient bureaucracy” theme running through some of my recent essays was my way of working through the frustration I felt at Motional.

Most recently, I’ve been having a great deal of fun at Lilac, a startup focused on text data analysis and curation using LLMs. I have fully recovered from my malaise - I’m quite actively involved in helping shape narrative/product/company direction and don’t feel the ugh wall that previously stopped me from getting near VCs.

Takeaways

For me, my biggest personal mistake was that I didn’t find a new mountain to climb once I had summitted the ML mountain. I think that ultimately, the reason I stayed was because I was pinned by others’ perceptions of my job: why would anyone leave this dream job? There were hordes of grad students at NeurIPS desperately networking to try and find their way into my seat; I would be insane to trade places with them. Yet, I felt far more alive when I was pitching new projects in Climate/ML/Chemistry than I did before the spinout announcement. I really should have known better - this was the same mistake I’d made ten years ago, staying in grad school far longer than I should have.

I don’t think it was necessarily a mistake that I got screwed by my ex-manager; that happens to everyone eventually. When it happens to you, my advice is: don’t take it personally. Just remember what it feels like. Don’t forget it. Learn what you can, and move on. And don’t try to preempt it - once you start leaning into politics, it’s far too easy to rely on it as a crutch to progress career-wise, instead of progressing by honing your technical abilities.