Last week, OpenAI launched GPT-5 to considerable fanfare. Sam Altman told us it could provide “PhD-level expertise.” The benchmarks were impressive. The demos were polished. And within 48 hours, people discovered it couldn’t reliably spell words or locate countries on a map.

This is not a post dunking on OpenAI. GPT-5 is a genuinely capable model, and the research behind it is remarkable. But the gap between “PhD-level expertise on demanding benchmarks” and “can’t spell ’necessary’ correctly” reveals something important about how our industry evaluates AI, and why that evaluation framework is broken for production use.

Ceiling vs. Floor

Benchmarks measure ceilings. They ask: what is the hardest problem this model can solve? Can it pass the bar exam? Can it reason through a novel math proof?

Enterprise deployment measures floors. It asks: what is the dumbest mistake this model will make when processing its ten-thousandth request on a Tuesday afternoon?

I’ve been building software long enough to know that production systems live and die by their worst behavior. When I was building the Alexa ML/AI platform at Amazon, we didn’t celebrate the demos where Alexa understood complex queries perfectly. We obsessed over the cases where she confidently misheard “set a timer for ten minutes” as “set a timer for ten hours.” Those floor failures defined the user experience far more than any ceiling achievement.

Same thing at Xively, where we built an IoT platform processing millions of sensor readings. A sensor that reports “no data” is infinitely preferable to one that reports wrong data. You can design around missing data. Wrong data propagates through your system and corrupts everything downstream.

AI models have the same property. Nobody cares that your model can derive novel theorems if it occasionally mangles a customer’s name in a generated email. The floor is the product.

The Parlor Trick Problem

I’ve written before about getting burned by overhyped AI workflows that turned out to be parlor tricks. The pattern is always the same. Someone posts a demo showing an AI doing something extraordinary. You try to replicate it. It works maybe 70% of the time. You spend the next three weeks building guardrails, fallbacks, and validation layers to handle the other 30%.

That 30% is the product. The AI was the easy part.

GPT-5’s launch is the macro version of this pattern. The PhD-level performance is real, but it’s the highlight reel. The spelling errors and geography failures are the production reality. If you’re building enterprise software, you need to design for production reality.

How I Think About Model Evaluation

When I evaluate models for production use, I invert the typical benchmark approach. Instead of asking “what’s the best this model can do?” I ask three different questions:

What’s the worst output this model will produce on routine tasks? Forget edge cases and adversarial prompts. I care about the boring middle: the 10,000 mundane requests that make up a normal workday. If the model occasionally produces garbage on those, it’s not ready.

How does the model fail? A model that fails loudly (returns an error, says “I don’t know”) is dramatically more useful than one that fails quietly (confidently returns wrong data). A confident wrong answer is worse than no answer at all.

What’s the cost of the worst failure? If the model is summarizing internal meeting notes, a bad summary wastes five minutes. If it’s generating something client-facing, a bad output could damage trust or trigger a compliance issue. Same model, same capability, completely different risk profile.

The Implication for Builders

If you’re building AI products today, here’s what GPT-5’s launch should tell you.

First: stop chasing model releases. The difference between GPT-5 and Claude and Gemini on your specific production workload is probably smaller than you think. The difference between any of them with good guardrails and any of them without guardrails is enormous.

Second: invest in evaluation infrastructure. Build automated pipelines that test your AI features against thousands of realistic inputs and measure failure modes alongside accuracy. This is less exciting than fine-tuning or prompt engineering, but it’s where production quality lives.

Third: design for the floor. Assume your model will occasionally produce nonsense, and build your UX and your systems to handle that gracefully. Human review workflows, confidence thresholds, fallback paths. This isn’t a sign that AI isn’t ready. It’s a sign that you’re building for the real world instead of a demo stage.

The PhD can’t spell. That’s fine. The question is whether you’ve built a system that catches the typo before it reaches the customer.