Startup update 15: Scraping PDFs is still hard

Originally posted 2025-08-04

Tagged: cartesian_tutor, llms

Obligatory disclaimer: all opinions are mine and not of my employer


Progress update

I did a bunch of scraping this week to try and get past years’ chemistry olympiads in a format that could be presented on the app. Turned out to be surprisingly hard: chemistry relies on diagrams in a variety of places - the problem statement, in multiple choice options, in the solutions; at varied spots within multipart questions. The types of diagrams include plots, lewis dot diagrams, organic structures, molecular orbital diagrams, etc.. So figuring out how to get images scraped out of PDFs and then associated/represented with the right place in the database schema is a bit annoying. Unfortunately, this is both table stakes and annoying to implement. I also anticipate that getting an automated grader working later on will also be annoying. That’s this week’s excuse for not having something new up 🫠

On the plus side, having gone through a few of the recent exams more closely, I am getting reacquainted with the difficulty inflation of recent years and the types of content that I’ll have to do for my app. I’m taking notes on problem type composition, obscure tricks that I forgot I knew and will have to write lessons for, test-taking tips, etc. etc.

I am also dreading figuring out how to generate / grade / teach organic chemistry using LLMs. LLMs are really just not there yet in terms of being able to reason graphically about electron pushing or whatnot. (Maybe if I ask them to output SVGs?). I think this might just have to be old-school textbook style or some other placeholder. OTOH, if I do eventually figure this out, I think the pre-med market for teaching organic chemistry becomes a natural target audience.

On scraping PDFs

This is stupidly hard, STILL. This “old” (from 5 months ago) post from HN led me to believe that we were here. We are not here, yet.

Structured PDF extraction feels like a Venn diagram, where it’s Extract Images (Bounding boxes) from PDF, Extract Structured Content from PDF, and Extract from Long PDFs. The kicker: it’s not even a “pick 2 of 3” situation, but a “pick 1 of 3” situation.

Long PDF X Structured Content: Scraping 2-3 page PDFs seems to work fine, but as you get to 10 pages, the likelihood that the parser goes through and successfully outputs a correctly formatted JSON/CSV/whatever gets lower and lower. Markdown does not count as “structured” in my book.

Long PDF X Images: Gemini gives bounding boxes nicely, but it has no way of referring to multiple images: it can’t say “bounding box X1,X2,Y1,Y2 from attachment 3 of 8”.

Images X Structured Content: Gemini handles PDFs natively, but Claude requires image conversion first. if you want bounding boxes, Gemini also requires image conversion. Structured extraction from images is okayyy but comes with higher error rate compared to PDF as input.

So then I end up in this unfortunate situation where I’m orchestrating a multipart pipeline:

  • First convert the PDF to a markdown file, preserving text content, but without being too precious about precise formatting. Drastically shortens/shrinks the size of the input from 1MB 10-page PDF to a 30kb text file. This step loses image content and relative positioning.
  • Then convert the markdown blob to a more structured format.
  • Then pass the PDF, page-wise, through a bounding box extractor, while simultaneously asking the LLM to annotate the extracted image with “this comes from problem 7, subpart (c)”, so that I can later piece things together

Debugging the orchestration layer while also debugging these individual steps was stupidly annoying. I am now experimenting with providing the above steps as an MCP server, and letting the Claude CLI / desktop app orchestrate everything, instead of me writing the orchestration code myself. The fun part of this is that I can now file my scraping costs under my Claude Pro subscription instead of filing it under pay-per-token! (Sorry, I may be part of the reason why these subscriptions are getting limited…) The other fun part of this is that when you use Claude Desktop as the orchestration layer, you get to naturally use the Claude interface to debug what the orchestration layer is doing - what artifacts it’s generating, where in the process it is, and if it ran out of context window (type “continue” to continue, with surprisingly graceful partial-task recovery).