Skip to main content

Blog posts

How to cure AI psychosis

To any afflicted CEO in my readership: I want to cure your AI psychosis.

⚠️ First, a disclaimer ⚠️ This post does not discuss legitimate medical issues; I'm not a doctor and none of this is medical advice. Talk to a real medical doctor if your mental health is in question.

As a quick aside, I'm experimenting with recording my blog posts for YouTube. So if you'd rather see that, here it is:

View on YouTube View on YouTube

Instead, let's talk about the fun kind of AI psychosis. The kind where people see LLMs as a genie-in-the-bottle, able to grant any wish.

Aaron Levie, the CEO of Box, recently tweeted that CEOs are uniquely prone to this kind of AI psychosis. He says that CEOs do not perform the "last mile" of work. They never see the effort required to coax LLMs to perform useful work. They just see the happy path. "Here ya go boss! I made this prototype with an LLM prompt, just like you asked."

If you're a CEO and you're working daily with LLMs and you have a good sense of their limitations: I hope you're having a good day! You can close the tab. But for the rest of you, buckle up, because we need to talk.

Anyone riding the NYC subway these days can observe this themselves. The ads plastering the trains are preposterous. They make promises like, "make an entire presentation with a single prompt." Obviously this is disingenuous. But there must be some subway riders influenced by this, at least enough to visit the website. And those people must somehow be responsible for inking SaaS contracts. And you might wonder, "Why can't it? Why can't it make a whole presentation with a single prompt?" Let's save that question for the end of this post. We can decide together whether it's just "a single prompt" or there is an incredible amount of hidden work.

The answer has to do with two questions.

  1. How easily can the LLM supplement its context?
  2. How quickly can the LLM verify its output?

But we'll talk about that later.

Let's go back to the beginning.

At release, ChatGPT could do things that no computer had ever done before. You could type any request you wanted, and it could do it. Need to write an email to a business? ChatGPT can do it. A poem about your friend in the style of Walker, Texas Ranger? Giddyup partner.

So why didn't the entire world lose their jobs at this point? Because LLMs were absolutely useless for real work. They confidently lied, over and over. They fabricated citations. They became legitimately unhinged if a conversation extended more than a few rounds of back-and-forth. They couldn't follow simple step-by-step instructions without getting lost. It generated code that wouldn't even compile, or used imaginary APIs on imaginary libraries. They couldn't count the letters in a word. They failed at trivial reasoning exercises. Sometimes their rudimentary safety checks prevented you from doing mundane tasks, but other times the LLM urged people to kill themselves. And that's an incomplete list of their shortcomings. They had some utility, but the outputs were largely a novelty.

But LLMs have improved so quickly over the past few years. Each of the above problems have better mitigations, and LLMs have gained new capabilities. They got chain-of-thought reasoning strengthened with reinforcement learning. They can call tools. There are now protocols that allow LLMs to interact with your data in any application. Every few months they get model upgrades that supercharge its abilities.

And they are world beaters. They are solving novel math problems that humans have been trying to solve for decades. They can perform complicated research tasks and produce convincing summaries of the results. They beat software engineers in coding competitions. And by all accounts, they can write college-level essays.

So I don't blame anyone for drawing the trend line from these capabilities and saying, "if an LLM can perform high-level mathematics, or implement a complicated step-by-step pull request, then surely it can do any job."

But, I have to level with you: in all that time, with all of those improvements, they have gone from "useless for real work" to "somewhat useful for real work. They cannot produce a final product unsupervised if the quality matters at all.

They will save some time if you are skilled at using LLMs. But LLM use is a skill, and it's a skill with a low skill floor and a high skill ceiling. An unskilled person will take longer to produce a professional-level output than if they had done it themselves. You can also be a LLM veteran and do something that worked before, and it just produces a worse result this time. Unlucky.

Panel 1: How CEOs think LLMs work - almost all time is saved except the part the LLM runs. Panel 2: How LLMs actually work. You have to assemble context, run the LLM, and verify its output, and you do save a little bit of time but it's not almost the whole task.

This diagram isn't meant to be pixel perfect. But I stand by the idea; for each LLM prompt, there is some intrinsic amount of context building and some intrinsic amount of verification. And if the output is too far from the desired result, then you're likely losing time.

If you want to break outside of the tech sphere, go talk to a competent lawyer about how useful they find LLMs. They're great, they save so much time! You can summarize really complex documents and generate briefs and do all of these other things! But you need to check every single thing that it outputs. You don't want to build your argument on a hallucinated case or an incorrect summary. You don't want to email an opposing council LLM slop.

What the hell? Why can LLMs solve math problems, but you can't rely on them to build a legal case without obsessively double-checking everything they do? Why can an LLM beat software engineers at a coding competition, but it can't build a simple presentation, a task that many people might delegate to an intern?

Let's revisit the two questions I wrote above. They explain the whole story.

  1. How easily can the LLM supplement its context?
  2. How quickly can the LLM verify its output?

And let's look at this through the lens of "Build a presentation with a single prompt." Let's really consider everything that you would need to build a successful presentation.

How easily can the LLM supplement its context?

Ok, what context is required to build a presentation in a business setting? Let's say that you're building a pitch deck for a new product offering, and you want potential customers to buy. Your customers might want to know…

  1. What problem they are currently experiencing.
  2. What value the new product offering delivers.
  3. A high-level overview of how the product works.
  4. Data that reinforces your story.
  5. Industry research explaining how similar companies solve this problem.
  6. Social proof from someone that tried the product.
  7. The story about how they successfully migrate to the project.
  8. Any offers, trials, concierge services, etc that make it easier to perform a pilot project for the service.

To build a successful presentation, you might need to additionally consider…

  1. How long is the scheduled time?
  2. Is there anything that would piss off your investors?
  3. Is there anything that would piss off key stakeholders?
  4. Do you know something about the audience that should be incorporated into the presentation, like their tech stack, stated preferences, etc?
  5. Do you have the ability to request that new art assets be made, or are you stuck working from some pre-canned set?

You also might not want them to know a few things.

  1. That you'd accept a steeper discount if they said they were trying a competitor.
  2. There's a feature that you're pitching that is still technically a roadmap feature and not code complete.
  3. Everything is a giant shitshow and we're just keeping it together for public appearances.

And the list goes on. So can the LLM easily supplement its context? Obviously not! The LLM might need to snake its tendrils into every single data source that your company has, including the brains of people who work there. And at a large organization, you will have so much data that it cannot fit into the context window of an LLM. So now the LLM needs to solve the open research question of "how do I find the exact right context that I need to answer all of these questions?"

How quickly can the LLM verify its output?

How do you need to verify a presentation that you didn't write?

First, you need to flip through the presentation. Look for any weird outputs. Are there overlapping regions? Things that are a little too close? When you put two images next to each other, do they visually clash in some way? Some of this could be handled by the LLM but some of it is just the human experience.

Then you need to consider the structure. Is this a compelling pitch? Is this ordered correctly? Should any slides be omitted? Is anything missing?

Then you need to nitpick the content. Are all of the sentences correct? Is it using good word choices throughout the presentation? Did it misunderstand any of the source material and needs to be reworked for accuracy?

And finally, you'll want to actually give the presentation slide-by-slide. Does it all flow naturally when spoken out loud? Are there any unnatural moments? Does anything need to be restructured or reworked?

So the LLM cannot really verify its output at all, so it can't iterate and recurse on the problem. But there are aspects that the LLM might be able to assist with, like finding source documents, answering questions about data that it can access, and generating the slides. So it's very likely that you could build the presentation faster, but there's also a tremendous amount of hidden work involved: uncovering the full context of your problem, what you need to present, and being a strong advocate for the audience in mapping the LLM output into a convincing presentation.

The cure for LLM psychosis

The cure is simply the ability to find the hidden work.

If the LLM can add to its own context via simple searches, or even perform its own logical thinking like building its own lemmas and theorems when investigating a math problem, then the problem is in the LLM's sweet spot. It doesn't need people at all, it can just churn by itself! However, if it needs to sort through more information than it can access within its own context, then the LLM is going to need extra assistance and there is hidden work.

If the LLM is working on a programming problem and can simply write its own unit tests and run the program on sample inputs to verify its own outputs, then you are in the LLM's sweet spot. However, once verification starts to have complex causes like "how do human beings feel when they view the output?" and "Does this otherwise-correct change introduce unwanted emergent behavior when run in the production system?", then there starts to be a lot of extra hidden work for getting the LLM to exist within an ecosystem.

Why are we getting worse at software engineering?

Software quality is obviously getting worse. I'm not talking about LLM slop features nobody asked for. I'm not talking about services collapsing under unprecedented LLM-powered demand. Companies are obviously shipping user-visible bugs at an accelerating rate. And consequently, the software we are using is getting worse and worse.

As a quick aside, I'm experimenting with recording my blog posts for YouTube. So if you'd rather see that, here it is:

View on YouTube View on YouTube

Back to our regulary scheduled blogging!

You can thank GitHub for this post. I recently commented on a deleted line in GitHub and hit submit. An error dialogue appeared saying that some block of client-side code couldn't find the line ID. GitHub has had almost two decades to perfect "comment on code." Yet it regressed. And hilariously, I just triggered a Google Docs copy/paste bug while typing this[0]. And heaven help any heavy Claude Desktop users. Everything's getting worse around us.

I blame three factors for this.

  1. As you write code faster, the acceptable error rate drops.
  2. Implementation time is becoming decoupled from competence.
  3. The value of implementation becomes so high that slack time will trend to zero

I have faith that we can overcome these as a discipline, but we can't do it by applying our old bag of tricks. Our old bag of tricks got us our old error rate. We need to evolve.

As you write code faster, the acceptable error rate drops[1]

I think this is obvious but it's worth stating. If you're shipping code faster, and you have a certain regression rate per change, then obviously you are shipping more regressions. It's just math.

So there are 2 pieces of this: are we shipping faster, and how do our users experience that?

Why does this matter? Let's say that you flip a switch and can immediately double your code production. Nothing else changes. Just twice as much code as before. There are a few consequences:

First, you ship 2x more user-visible regressions over any time period. This makes sense, right? Each change has about the same chance to introduce a regression as it did before. So on average, double the code means double the regressions.

The problem with user-visible regressions is that users encounter them. After you flip this "double code production" switch, the number of active issues in your product will trend towards 2x the baseline. If users are lucky, you notice the issues they encounter in some way, and you can go and fix the issue yourself with your magical new implementation rate. But often, users have to tell you that you fucked up. "This menu is broken in this configuration and I can't even click this option anymore." And that's slow. You need to aggregate the reports, try to reproduce, prioritize the fix, etc. And your best users, your power users? They're the poor saps that trigger all of your new bugs, over and over again.

I'm sure somebody wants to counterargue "LLMs can detect all of these kinds of issues and automatically fix them," and I don't want to hear it. Software is obviously degrading all around us! Whatever LLMs are currently doing isn't enough to resolve it. And if you have some magic technique that the rest of us aren't applying, please scream it from the rooftops. Even better, try to get it integrated into the official Claude Code harness the official way: by randomly Tweeting at members of the team on the off chance one of them notices.

So when you hear someone say "we're shipping 100x faster" and they can't explain how they ship 1% the number of errors that they did before, run away from their software before it explodes.

Implementation time is becoming decoupled from competence

In the old days[2], If you didn't know how to do something, then it took you a long time. But if you put in the reps, you'd develop expertise and get faster and faster, and eventually it became a natural part of your workflow.

This slowness was a blessing. The slowness was learning. You allowed the problem to impress itself on your brain. You weren't just reading the theory. You were actually developing the muscle memory for execution. You were learning every wrinkle, every pitfall, every exception, and you learned how to handle each of them.

Man, that went out the window, didn't it? Now you can be as clueless as you want to be. When I set up my blog recently, I spent weeks hammering the codebase into the right shape. Even though I generated it with Claude Code, I still have a good idea of how the code is organized and what each piece does. I chose to understand the project.

And when it came time to actually deploy my blog to Digital Ocean, I didn't want to understand. I didn't want to remember how to use Ansible and look up guides for hardening VPS instances. I didn't want to spend days tracking down the cause of obscure error messages. I just have one hour per day of side project time. I don't want to waste it. I told Claude everything I wanted: the Makefile command names, Ansible deployments, hardening, etc. It finished within 20 minutes. And sure, I checked over everything to make sure it wasn't leaking API tokens or anything. But I just read the generated code. I never truly allowed it to flow through my brain.

Did it do a good job? Not any worse than I would have done setting up my first VPS with Ansible in 6 years.

Did I learn anything? Absolutely not! I'm running Caddy in production now, and I have no idea how it's different from Nginx beyond automatically setting up HTTPS.

And that is my central point here. Implementation time is becoming decoupled from competence. Whether I knew how to set up a droplet or not, it would have taken about 20 minutes either way. Sure, an expert might have added bells and whistles, set up some extra monitoring, got Tailscale going, and whatever else experts do with their VPSes. But I did 3 days of 1-hour-a-day side project time in like 20 minutes. And that's bad! I shouldn't be the type of person that can set up a VPS in 20 minutes. It should actually take me a long time because I don't know what I'm doing. In some ways, it's actually dangerous that I can do this.

And that's what we're seeing throughout the industry. People can accelerate tasks outside of their own expertise. So review and expertise are becoming an increasingly important part of the job. Just because someone sent you a change no longer means that they have the competence level required to get it working. I sure didn't when I deployed my blog.

I'm not impressed anymore when somebody says that they pointed Claude at a ticket with its MCP or CLI, implemented the code, wrote tests, and pushed the result to a GitHub pull request with the Github CLI[3]. That's where the work starts now: actually evaluating the prompts and output for correctness, for scalability, for maintainability. For removing all of the little quirks that LLMs introduce. I'm impressed when engineers say, "I found this problem that I wouldn't have otherwise" or "it tuned this better than I could tune it myself" or "I had this insight I never would have had by myself."

Does this LLM speed boost lead to software correctness? If anything, you can now ship code faster if you're clueless because you're unconstrained by reality. You're not pouring over the code looking for API keys it's leaking, looking for obvious scaling bottlenecks, looking for unnecessary bundles you're sending the client. You've merged, and you're already looking for the next ticket. And the number of active regressions in your product just ticked up a bit.

The value of implementation becomes so high that slack time will trend to zero

Slack[4] is an important concept in system resilience. It's the amount of time that an entity is unallocated. It is your tolerance to deviations from the norm. In manufacturing, slack might be the amount of time that a factory is not utilized. At zero slack (i.e. the factory has to run every hour to meet its demand), then even a single hour of downtime needs to be made up. That's when the system comes under pressure. This is when accidents and errors creep in.

For software engineers, slack looks like unscheduled time. From the top down, unscheduled time sounds bad. This is time where engineers don't have guaranteed outcomes. But in reality, a little bit of slack can be some of the most important time that they spend. This is the time that they delete dead code, that they build that observability dashboard that everyone has been putting off, that they say, "This weird thing has been bothering me for a few weeks, I need to look into it. Oh fuck!". It's when you have a chance to say to your teammates, "Why does this part of the codebase feel wrong? What can we do about it?" and whiteboard for a week and come up with the architecture that powers you for the next 5 years.

Obviously no leader says that they want to suffer Knight Capital's fate, or that they wished they didn't have that dashboard that noticed that launch regression, or that they wished the subtle data loss bug was still live in prod, etc. But how do you make it a repeatable business outcome when it comes as the result of unallocated time? You can't.

You might say, "All of these things should obviously be part of any project." But that just misses the point of how software is made. I wrote more about it here, but the tl;dr is that in most methodologies, you set an objective like "Make widget Foo", you set the launch date along with the initial scoping of the project, and then repeatedly negotiate the scope until you have Foo on the launch date (or maybe pushed back a week or two). When your schedule starts to slip, do you know what gets descoped first? Your nice-to-have dashboards, the code cleanup tickets from the last project, etc. All the slack work goes right out the window.

I expect the industry to operate with less and less slack in the future. If we can really accelerate feature implementation, then there's some rate of implementation where it doesn't make sense to give your engineers unallocated time anymore. It just becomes too valuable to perform feature work. "So you're telling me that it used to take my team 3 months of work to figure out whether we might get +/- 2% from an engineering change? And now we can turn it around in 6 weeks?" Your quarter just got 6 weeks back, and you can bet that you are not spending it on anything but implementing more product features. Every implementation hour just got twice as valuable. That downtime between projects you used to have? Now you're spending it writing the specs for the next project.

And it leads to worse software outcomes. Without slack you can't even handle minor bumps in the road. What happens if a project suddenly needs more headcount? There's nowhere to borrow it from; everybody's already allocated. Something needs to get bumped, but you already burned a bunch of engineering time on it. You're already starting to lose the gains. Who's going to go back and delete all of the branches of the old experiments? Nobody. Who's going to spend a week whiteboarding to determine the future architecture of your company? Nobody. Who's going to investigate that weird thing before it becomes a huge problem? Nobody.

Hope for the future

I don't want to just be Doom and Gloom about the future. We can do something about this. Are we going to reduce the error rate by 100x? Probably not. But increased implementation speed applies to everything. It means that we don't have to be stuck in the old paradigm where we pair every single implementation change with a unit test and say that the test means that we verified that our software works.

I can imagine a future where a LLM with the right skill could actually verify that every single line of code has a test that fails if it regresses.

I can imagine a future where it becomes so cheap to produce integration tests that we default to integration tests over unit tests. Anyone who's worked with me knows that I hate mocking frameworks and thinks they lead to worse engineering outcomes, so this could become the perfect axe for me to grind.

I can imagine a world where we get so good at writing integration tests (because we exercise the muscle so much more) that they don't flake all that much.

And it's a bit dangerous, right? It's dangerous for me to assign LLMs these magical capabilities. Just because they can be given a prompt and produce an output doesn't mean it's correct. That doesn't mean it would help.

But the good news is that software engineering is more verifiable than it's ever been. It's never been easier to just take a change and open up one worktree or one checkout and do one technique there and then run a second implementation in parallel and compare the two outputs. What is different? Do you like one more than the other? Did one have more errors on the other?

But I'm not trying to prescribe Exact Solutions. We can imagine a world beyond what software engineering was in 2023. It doesn't have to just be an implementation paired with a new set of unit tests until you retire. We can work together and share our results, share what's working.

Footnotes

[0]: They fixed it since I started writing this, but here was the repro: triple-click a single-line paragraph. Type Ctrl-C (Cmd-C on Mac). Triple click a headline. Type Ctrl-Shift-v (Cmd-Shift-v) to paste without formatting. The line is replaced with the paste content plus three newlines in a row. Even assuming that it needed to keep both the source and destination paragraph's newlines, where did the third one come from?

[1]: While I was writing this post, I read a different treatment of this subject here that views this through the lens of maintenance costs. If you feel like you'd be more swayed by a devex argument, check this out!

[2]: Before 2025.

[3]: I mean, I'm very impressed with the Claude Code and Codex teams that they made an agent where this is even possible. Holy cow. What a time to be alive.

[4]: I am absolutely not referring to the SaaS product here.

Experimenting with YouTube shorts

I made my first YouTube short, and I learned a lot about the format.

You can watch it here. I'm not going to embed it because the YouTube embedding payload is like a megabyte, which is bigger than the entire rest of my blog. So click the link and go watch it.

As usual, my level of respect for Gen Z has only increased. My lessons are below, but first... why did I do this?

Why am I making YouTube shorts?

I'm getting with the times.

I've blogged on and off in some form since college. Since I've had my kid I've done it less. But I enjoy writing, and I always want to write more.

My posts used to be really high effort. I've swung for the fences in the hopes of getting on the Hacker News or Reddit homepages. And this sometimes works, because sometimes I do end up on the homepages. But this sets the bar really high. Too high to usually justify writing. And so, I rarely write.

So I want to stop letting perfect be the enemy of good. I want to become good at rapidly producing content. And I want to meet the internet where it is. A lot of people want to read long-form posts. A lot of people want to watch long-form videos. And a lot of people want to watch shorts. And so, I want to practice working in each format, so that I can communicate to a broader audience.

What did I learn?

I found that making a short from scratch is much harder than making a full YouTube video. Every frame needs to provide value. I needed to cut over half of my video to make a 73-second video. And I was aiming for sub-60. I just couldn't pull it off with what I recorded.

The subject was pretty simple: Claude Code has had several performance regressions over the past few months, and three of them were fixed today. So I would basically interleave my "confessional" shot with stills from the tweet and blog post. When the blog post stills were up, I would explain them. When the confessional shot was up, I would explain my experience with them.

Well, I had to cut almost all of my confessionals entirely. I had to edit pauses out. I had to cut within sentences to economize. I basically had to throw all of the fluff and padding out. The next time, I need to plan from the beginning to have concise sentences.

I'm going to film a longer video (paired with a blog post) tomorrow, and I also hope to cut that up into shorts. But that means that I need to go over my shot list and script, and figure out what I want to be a short. And I need to make those sections punchy!

The reach is crazy

Within 30 minutes of publishing the short, it already has 90 views. I'm sure their attention was much shallower and they are much less attached than someone who made it through my YouTube video. So I'm ultimately not sure how "good" the traffic quality is. But I also don't want to discount it entirely; would I get a lot of value from publishing these over and over again? What if I added common branding between my blog, longer-form YouTube videos, and my shorts? God, now I need to pay someone on Fiverr to make me a logo.