Startup update 6: Usage plans+limits

Originally posted 2025-06-01

Tagged: cartesian_tutor, llms

Obligatory disclaimer: all opinions are mine and not of my employer


Progress update

I implemented usage+cost logging, limited free plan/trials (which was my biggest blocker to putting something online), and polishing CSS/UI. For now, if you exceed the freebie tier, you’ll just get a message saying “sign up for the waitlist” and I will follow with billing implementation hopefully in the next week. There was a surprising amount of state change that I had to manage in showing your subscription status and free plan status (blocking out starting new conversations, blocking new messages, etc.). The CSS polish was a sobering exercise in seeing how much counterproductive and extraneous CSS Cursor had generated for me :P.

I demoed the app in person to a number of folks of varying demographics, and the high level bit is that Socratic teaching works pretty well, both for conceptual and problem solving learning. I was amused to find the LLM occasionally glossing over various mistakes the users made, almost as if they hadn’t made them. For example, a 4th grader made an arithmetic mistake - multiplied instead of divided on both sides, and the LLM just pretended they had done the right thing. This happens roughly 10% of mistakes. I was less amused to find out that this behavior was highly irreproducible. I considered adding “Be nitpicky” to the prompt but I’m honestly not sure if it helps because, again, irreproducible. 🙁 In the end, this is highly context-dependent - maybe such arithmetic mistakes are in the same tier of mistake as typos and grammatical mistakes, and it would be annoying if the LLM nitpicked those!

Now, if only there were someone with expertise on LLM quality evaluation who knew of a methodology for measuring and improving these sorts of irreproducible LLM quirks…. (sarcasm alert: that’s me.) I do think it’s telling that despite being an expert on this, I literally don’t think it’s important enough yet to do. Maybe orienting Lilac + Databricks agent eval around agent quality was always an exercise in wishful thinking. (This perspective alone is worth the month I’ve spent on this startup!) It blows my mind that the Databricks Agent Eval team invested something like 4-5 person-years into building an eval product when, by simply giving one person the freedom and license to just fuck off for 2 months and attempt to actually build an agent, they could have learned what issues people actually go through when building agents. I mean, yes, we talked to people, but clearly we didn’t learn the right lessons, or we were too strongly anchored on the idea of quality.

On the go-to-market side of things, I talked with Fred, an old friend and cofounder of PrepScholar about how he found PMF. His answer? Set up a gazillion landing pages, throw some money into Google ads, and see which landing pages convert. He knew they’d found something when they put up an SAT prep landing page and got three phone calls the same day from parents who wanted to sign their kid up but were disappointed to not find a working app behind the landing page. Perhaps Google ads / SEO is more of a played out game with less alpha these days, but the general idea is still probably very true.