LLM Shibboleths determine AI effectiveness

Originally posted 2025-05-28

Tagged: llms, software engineering

Obligatory disclaimer: all opinions are mine and not of my employer


Coding assistants promise to revolutionize software development, but why do some developers sing praises while others find them useless? The answer lies between the keyboard and the chair, but it’s more than just simple user error. Your level of expertise silently shapes the way you interact with the AI, allowing two people to have completely different experiences despite interacting with the same AI on the same subject.

In this essay I’ll discuss how this is possible and what you can do about it.

What’s an LLM Shibboleth?

LLM training datasets includes novices asking questions that don’t make sense, as well as experts that know exactly the piece of information they are lacking and are now requesting. How you ask your question determines which part of the Internet hivemind you get back. The original shibboleth was a habit of the tongue, near-undetectible by the outgroup, and yet characteristically identifying ingroup members to each other. Experts are constantly using LLM shibboleths without even realizing it.

Here’s a simple example. When I ask “what can cause runny nose”, I get Dr. Google fluff mode, with lawyer-coached advice warning me to seek my doctor for medical advice, but when I ask “runny nose differential diagnosis”, I get Med Student mode with the actual medical braindump that I wanted.

As an experienced backend engineer, I know exactly what to ask Cursor for, and what order to ask it in - “Add a field to the X model for deleted_at”. “Ensure that CRUD methods for X filter on deleted_at not null”. “Add a delete endpoint that soft-deletes X.” The phrases “deleted_at” (as opposed to “is_deleted”), “migration”, “CRUD”, “not null”, “soft delete” are all shibboleths that conjure precisely the parts of the AI subconscious that correspond to “experienced backend engineer”, and the text of my prompts match precisely the Git commit summary that an experienced backend engineer would write for that pull request - since that’s presumably what these LLMs were trained on.

As a novice frontend engineer, I don’t have the vocabulary to ask anything more precise than “Make the listX call happen only once the auth token is available”, and in response, Cursor injects a spaghetti-code callback on top of the auth function, auth().onSuccess(() => store.listX()). Completely useless. (To be fair to the AI, maybe this is representative of the median frontend codebase?) Later on, as I learned more frontend and had to debug race conditions and inconsistent state transitions introduced by these earlier AI-assisted coding sessions, I started to understand just how buggy the generated code was.

When using Perplexity, I’ve learned to trust my spidey sense. Often when I dig into the sources, I find that Perplexity has chosen to summarize a series of forum questions from confused novices, whose question phrasing most closely matched my novice phrasing, in a domain I didn’t know much about.

Gell-Mann AmnesiAI

Gell-Mann Amnesia is the tendency of individuals to critically assess media reports in a domain they are knowledgeable about, yet continue to trust reporting in other areas despite recognizing similar potential inaccuracies.

I’m currently building an AI tutoring app, and I asked it to teach me about the Carnot Cycle - something which I’m vaguely familiar with, but never felt like I truly understood. I came away very impressed by its explanatory abilities. But when I later tried to get it to do Olympiad math problems – something that I have extensive past experience with, and is explicitly trained for by OpenAI + Anthropic – I could see that it was spouting absolute nonsense, despite various claims that frontier models are capable of handling math olympiad problems.

I think what is most surprising to me about AI is how frequently I go from having a seemingly good, trust-building interaction with the AI, to having another incoherent trust-destroying interaction in a different domain. Maybe what’s happening is that LLMs talk so smoothly that they trigger some underlying ‘deference to authority’ instinct in humans - the textual equivalent of charisma. Whatever it is, it’s hard to ignore the effect even when you know it’s happening.

Obligatory coding agent tips

Given these challenges in using LLMs properly, here are some strategies I’ve found useful:

Universal advice:

  • Frequent commits: you will frequently want to nuke the entire 500 line change that the agent spit out, and it’s far easier to do this when you can just type git restore ..
  • Aim to generate ~1 bug per agent ask. You will need to understand how frequently LLMs write bugs, and dynamically calibrate how much code you ask for in one shot. Once you have multiple bugs, they interact with each other and make debugging [exponentially harder].
  • You must test, otherwise you let bugs slip by and then run into future debugging difficulties, as per the previous bullet point.
  • Restart if the agent is going in circles; long conversations are bad because of anchoring effects https://arxiv.org/abs/2505.06120
  • Read the release notes from Cursor/Windsurf; you will often go, “wait, what, that was possible?”.

For novices:

  • You can ask the AI generate an “onboarding document” e.g. “write an onboarding section in the README for configuring postgres for local development”. This will often reveal unknown unknowns.
  • Build a foundational understanding of the domain you’re working in. Look for human-written explanation docs. LLMs are particularly bad at generating explanation-style documentation.

For experts:

  • Be mindful of architectural decisions. Set a good example for the AI to use as a template for additional features. (This is not that different from how junior devs work in an enterprise context!)
  • Lean on your intuition for how you would break up a task for a more junior employee.
  • LLMs have a good intuitive grasp of coding styles and refactoring techniques; they just need you to tell them which one to use.
  • Use the AI enough to understand where its weaknesses and strengths are, and fill in the gaps with your experience.

This interview with Anthropic engineers is excellent, and around the 32 minute mark, they stress the importance of having a human thinking about large-scale code quality and codebase organization. I suspect that vibecoded apps will start collapsing under their own complexity after roughly 3-5 major features. Under that threshold, they’re probably good enough.

Conclusion

While building the backend of my product, I was pretty impressed with how quickly Cursor helped me put together a working backend. I could see where the AI was strongest and weakest, and after some false starts where I tried vibecoding and auto-accepting a bit too much code at once, I found a nice rhythm where I could smash out features as fast as I could think through how they should be built.

In comparison, while building the frontend of my product, ~75% of my time has been spent reading documentation on Typescript, Svelte, OAuth2, browser APIs, etc., so that I can build up my frontend expertise to the point where I 1. understand the code being generated 2. understand the “right” architecture and 3. know the right shibboleths to prompt the AI. The actual AI-assisted coding + testing only takes 25% of my time.

A prerequisite to becoming an expert is to develop enough taste to recognize quality work. Today, LLMs can generate quality work - when prompted appropriately. I think artisanal code will eventually become a curiosity, much like luxury mechanical watches or bespoke tailoring. We’re entering an era where we’ll see large quantities of industrially produced code, designed by a smaller number of experts with the taste and skill to spec out products - even if they’ve only mechanically written code a few times in their lives, just to get a feel for it. Despite believing that this current AI bubble won’t make it to AGI, I also believe that there will be plenty of change to cope with.