LLM Shibboleths determine AI effectiveness

Originally posted 2025-05-28

Tagged: llms, software engineering

Obligatory disclaimer: all opinions are mine and not of my employer


Coding assistants promise to revolutionize software development, but why do some developers sing praises while others find them useless? The answer lies between the keyboard and the chair, but it’s more than just simple user error. Your level of expertise silently shapes the way you interact with the AI, allowing two people to have completely different experiences despite interacting with the same AI on the same subject.

In this essay I’ll discuss how this is possible and what you can do about it.

What’s an LLM Shibboleth?

LLM training datasets includes novices asking questions that don’t make sense, as well as experts that know exactly the piece of information they are lacking and are now requesting. How you ask your question determines which part of the Internet hivemind you get back. The original shibboleth was a habit of the tongue, near-undetectible by the outgroup, and yet characteristically identifying ingroup members to each other. Experts are constantly using LLM shibboleths without even realizing it.

Here’s a simple example. When I ask “what can cause runny nose”, I get Dr. Google fluff mode, with lawyer-coached advice warning me to seek my doctor for medical advice, but when I ask “runny nose differential diagnosis”, I get Med Student mode with the actual medical braindump that I wanted.

As an experienced backend engineer, I know exactly what to ask Cursor for, and what order to ask it in - “Add a field to the X model for deleted_at”. “Ensure that CRUD methods for X filter on deleted_at not null”. “Add a delete endpoint that soft-deletes X.” The phrases “deleted_at” (as opposed to “is_deleted”), “migration”, “CRUD”, “not null”, “soft delete” are all shibboleths that conjure precisely the parts of the AI subconscious that correspond to “experienced backend engineer”, and the text of my prompts match precisely the Git commit summary that an experienced backend engineer would write for that pull request - since that’s presumably what these LLMs were trained on.

As a novice frontend engineer, I don’t have the vocabulary to ask anything more precise than “Make the listX call happen only once the auth token is available”, and in response, Cursor injects a spaghetti-code callback on top of the auth function, auth().onSuccess(() => store.listX()). Completely useless. (To be fair to the AI, maybe this is representative of the median frontend codebase?) Later on, as I learned more frontend and had to debug race conditions and inconsistent state transitions introduced by these earlier AI-assisted coding sessions, I started to understand just how buggy the generated code was.

When using Perplexity, I’ve learned to trust my spidey sense. Often when I dig into the sources, I find that Perplexity has chosen to summarize a series of forum questions from confused novices, whose question phrasing most closely matched my novice phrasing, in a domain I didn’t know much about.

Gell-Mann AmnesiAI

Gell-Mann Amnesia is the tendency of individuals to critically assess media reports in a domain they are knowledgeable about, yet continue to trust reporting in other areas despite recognizing similar potential inaccuracies.

I’m currently building an AI tutoring app, and I asked it to teach me about the Carnot Cycle - something which I’m vaguely familiar with, but never felt like I truly understood. I came away very impressed by its explanatory abilities. But when I later tried to get it to do Olympiad math problems – something that I have extensive past experience with, and is explicitly trained for by OpenAI + Anthropic – I could see that it was spouting absolute nonsense, despite various claims that frontier models are capable of handling math olympiad problems.

I think what is most surprising to me about AI is how frequently I go from having a seemingly good, trust-building interaction with the AI, to having another incoherent trust-destroying interaction in a different domain. Maybe what’s happening is that LLMs talk so smoothly that they trigger some underlying ‘deference to authority’ instinct in humans - the textual equivalent of charisma. Whatever it is, it’s hard to ignore the effect even when you know it’s happening.

Conclusion

While building the backend of my product, I was pretty impressed with how quickly Cursor helped me put together a working backend. I could see where the AI was strongest and weakest, and after some false starts where I tried vibecoding and auto-accepting a bit too much code at once, I found a nice rhythm where I could smash out features as fast as I could think through how they should be built.

In comparison, while building the frontend of my product, ~75% of my time has been spent reading documentation on Typescript, Svelte, OAuth2, browser APIs, etc., so that I can build up my frontend expertise to the point where I 1. understand the code being generated 2. understand the “right” architecture and 3. know the right shibboleths to prompt the AI. The actual AI-assisted coding + testing only takes 25% of my time.

A prerequisite to becoming an expert is to develop enough taste to recognize quality work. Today, LLMs can generate quality work - when prompted appropriately. I think artisanal code will eventually become a curiosity, much like luxury mechanical watches or bespoke tailoring. We’re entering an era where we’ll see large quantities of industrially produced code, designed by a smaller number of experts with the taste and skill to spec out products - even if they’ve only mechanically written code a few times in their lives, just to get a feel for it. Despite believing that this current AI bubble won’t make it to AGI, I also believe that there will be plenty of change to cope with.