For Want of a JOIN
Originally posted 2022-11-27
Obligatory disclaimer: all opinions are mine and not of my employer
A story of a data pipeline gone wrong.
For want of a JOIN
Once upon a time, I was on a small team tasked with building a data pipeline. On this team was someone we’ll call Bob, who was fresh out of grad school and hadn’t had much experience with software engineering, and no experience with SQL. As part of this data pipeline, Bob had to combine the data from two tables. JOIN is a core feature of SQL, so this ought to be simple, right? Unfortunately, Bob hadn’t learned any SQL past SELECT and WHERE.
So, instead of joining the two tables:
python_client.query("SELECT * FROM A JOIN B ON A.id = B.id WHERE...")
= python_client.query("SELECT id FROM B WHERE...") rows = python_client.query("SELECT * FROM A WHERE id IN (%s)" % ', '.join(rows))result
Other than the unnecessary second round-trip to the database, this worked just fine for our launch dataset. With the launch deadline looming, this code got checked in as is.
For want of a bounded query string
We launched, and started running this pipeline daily on the latest data dumps. One day, we hit the following error message:
ERROR: SQL query may not exceed 10,000 characters
Oops. As the JOIN condition grew, the generated SQL query was growing longer and longer, until BigQuery rejected it outright. Bob knew how to fix this problem! The IDs were at most 8 characters each, so accounting for the commas and whitespace, he could make sure to stay under the character limit:
= python_client.query("SELECT id FROM B WHERE...") rows =  result for chunk in more_itertools.chunked(rows, 1000): "SELECT * FROM A WHERE id IN (%s)" % ', '.join(chunk))) result.extend(python_client.query(
The code chugged along.
For want of a fast pipeline
As our dataset continued to grow, the number of chunks in this
for loop ticked upwards. Eventually, as the pipeline
execution time started to push several hours, the team made a push to
optimize the pipeline. Bob added concurrent queries to parallelize the
= python_client.query("SELECT id FROM B WHERE...") rows =  pending_queries for chunk in more_itertools.chunked(rows, 1000): "SELECT * FROM A WHERE id IN (%s)" % ', '.join(chunk))) pending_queries.append(python_client.run( = list(itertools.chain(*(q.result() for q in pending_queries)))results
This brought execution time back down to a more tolerable slowness.
For want of parallel queries
Unfortunately, this solution was but a band-aid. We ran into the following error not long afterwards:
WARNING: Concurrent INTERACTIVE query limit reached (50). To increase concurrent queries, drop to BATCH priority, or contact us to increase your quota limits.
Bob did both, firing off an email to our GCP contact and dropping our query priority to BATCH. I don’t remember what happened to our quota, but what I do remember is that our pipeline didn’t get any faster… BATCH priority meant that our queries could take arbitrarily long to execute, depending on whether BigQuery was having a busy day! Our pipeline execution times bounced around between a few hours to half a day. Despite each day’s pipeline run kicking off around midnight, we started getting occasional questions from our analysts around 8AM about where today’s data was.
This was the last I saw of this particular train wreck, as I left the company soon after.
It’s never just the one thing
You may be wondering why nobody commented on the missing JOIN during code review. And how did the issue persist for so long?
I was the code reviewer on the initial PR, and yes, I’d commented on the issue. Unfortunately, the PR totalled 2,000 lines of code, and this was just one of many issues I found in that diff. When I asked our TL what to do, they advised me to ship it, since we had a launch deadline upcoming and working code was better than no code. I figured this was a “disagree and commit” situation so I let the code through.
Later, the easy availability of all of those IDs in Python meant that other subsequent steps came to depend on those values. That was why it was so difficult to revert back to the simple JOIN statement.
We also didn’t have much breathing room to do things right; organizationally, the missing JOIN performance issue was just one of many fires. Post-launch, our on-call shifts were best described as disappointing. Imagine being woken up at 4AM to find out that some user submitted a fake phone number, uncovering an edge case where the frontend phone number validator disagreed with the backend validator. (Our leadership had mandated that all signup errors warranted paging the on-call). And when the fix was submitted for the phone number bug, bureaucratic Q/C and release processes meant that our on-calls continued to be paged for bogus phone numbers for a few more weeks.
What lessons should we take away from this story?
- Learn the capabilities of your tools. With the exception of NPM modules, tools are designed to solve problems, possibly the ones you have.
- 2000 lines is far too late to get feedback. Smaller changes are more likely to receive actionable feedback.
- If you encounter an unusually round system limit, you’re probably using the system in a way its designers explicitly discourage.
- As an inexperienced engineer, you have unknown unknowns. Find mentors to help identify them.
For the code reviewer:
- Pick your battles when reviewing code; you, the code author, and the project all have a patience budget for code review that should not be exceeded. Ask your TL for backup.
For the TL:
- Operate on data where it resides. Python is a control plane, not a data plane.
- Parallelizing something you haven’t optimized first is as likely to slow it down as to speed it up.
- Communicate expectations around engineering workflow.
- More broadly, have a system in place to identify and address cultural issues. Others may not realize they need to ask for help.
For the manager:
- Budget extra time for junior engineers to learn their tools and best practices.
- Help Bob find mentors.
- Make clear that Bob’s expectations are not just about producing code and helping meet the launch deadline, but growing to the point where his work needs minimal supervision from more senior members of the team.
- Understand how rigid your project’s launch date and scope are, and be intentional about tech debt.
- Does your team have a TL? Are you the TL? (Bad idea)