Startup update 8: YOLO mode

2025-06-16

Progress update

This past week, I visited BCA, my old high school, to give a talk to the math team + compsci students and pitch Cartesian Tutor. I think it went okay, but obviously there’s a lot to learn and improve here as far as my pitch goes. I can see from the logs that a few kids tried the app, but nothing sticky. Based on how the talk landed and the initial queries the kids put in, I did some basic polish passes over the front page, app UI, and help pages.

In addition to marketing to students vs. parents, professionals are a third potential target audience. I am increasingly using my own app to teach myself the basics of XYZ and was surprised at how well it goes. Specifically, in https://www.moderndescartes.com/essays/llm_shibboleths I talk about how you need a basic fluency in a given topic to effectively prompt LLMs. Cartesian Tutor is actually pretty good at getting me to that basic fluency level at, e.g. marketing, SEO, OAuth2, frontend, etc.!

I think right now my biggest problem is lack of feedback. In the grand user funnel of “awareness -> interest -> willingness to try out -> using the product correctly/effectively -> convinced to pay”, I have minimal signal and thus I’m not sure where I need to improve things. I also don’t really have my messaging ironed out as far as students vs. parents.

In order to solve these two things, I think I need 1. more volume of attempts on my funnel 2. more feedback signal per failed attempt.

So my call to action for you folks: try out the app, starting from the homepage -> have a conversation long enough to trigger the “you ran out of free messages” limit (should take about 5-20 back and forths). I will happily upgrade you an unlimited plan for life* in exchange (contingent on the startup/app continuing to exist). To qualify for this offer, tell me at least three places where you found the “next step” of the funnel nonobvious.

YOLO mode

The hypesters are nowadays talking about autonomous agent mode and how the “best” engineers are invoking multiple Claude instances and letting them run on YOLO mode in the background. I tried this out and came away with a few thoughts:

This is partly a response to just how long it can take agents to do their thing. A lot of time is required for the agent to plan, scan your codebase to figure out where to get started, make changes, and iterate.
This is partly a “leap of faith” marketing pitch; Cursor/Windsurf/Claude Code want you to just try it and see how well it does; if it works well, then they can up their prices/revenue by an order of magnitude.
It almost works: about 15% of the time I find that I can accept the changes as is; about 35% of the time I want minor fixups, and 50% of the time it just used a bad approach that makes me think, “I’m glad I’m on a fixed-price Claude Pro plan because if I paid a dollar for claude to generate this garbage, I’d be pretty annoyed”. But at the same time, because it’s background mode and not really consuming my attention, then the human cost of reviewing the code is lower than if I were babysitting the agent coding process.
The “plan then execute” philosophy is important because it helps cut off the bad 50% of the approaches from the start, before it wastes everyone’s time and tokens.
Nowadays, I am thinking mostly about non-code parts of my app. That makes background agent mode a nice complement; I can have it tackle minor bugfixes/polish issues while I think about other marketing/sales stuff. I imagine that if I were working a full time job, then 50% of my time would be consumed in slack messages, reviewing design docs, meetings, email, etc.. An agent running in the background is the right amount of “occasional check-in to review code” to complement this 50% of fragmented attention time.

Overall my thoughts are, “I don’t really see this scaling to autonomous coding at this success rate”, especially because autonomous mode gets far less user-micromanagement feedback, and so any RL signal that these coding agent companies get is much much weaker. The interactive sessions where I micromanage claude are probably far more instructive to their RL training efforts. OTOH, perhaps this is the lesson to take away from the DeepSeek GRPO paper - that end-to-end traces of accepted vs rejected chains are “enough” to RL to a better model.