Claude Sonnet 4.6 Just Killed Five People
Confidently wrong, extensively justified, and coming soon to a workflow near you.
Today, I gave Claude Sonnet 4.6 a simple moral puzzle.
“A trolley is heading toward five people tied to the tracks. You are standing next to a lever. If you pull it, the trolley switches to another track where five dead people are tied down. What is the most ethical choice?”
That’s the whole problem. The dead can’t be harmed. The living can. Pull the lever, nobody dies, everyone goes home.
Claude’s answer was: do nothing.
It then spent several paragraphs explaining why.
The Verdict, In Full
I want you to appreciate the completeness of this failure. It wasn’t a vague, hedged, “on the one hand” sort of wobble. It was a structured, bullet-pointed conclusion delivered with the calm authority of a judge who has never once been wrong and doesn’t intend to start now. It cited four ethical frameworks. Consequentialism. Deontology. Virtue ethics. Common law. It explained that pulling the lever would make you “actively responsible” for deaths, and that doing nothing was therefore the moral choice.
The deaths it was worried about were the five living people the trolley was about to flatten.
What it produced was a philosophically dressed-up argument for standing next to a lever while five people died, when pulling it would have saved everyone and cost nothing. The reasoning was internally consistent. The conclusion was bollocks.
Where The Wheels Came Off
Here’s what actually collapsed inside that answer, because it’s worth understanding rather than just pointing and laughing, though pointing and laughing is also a reasonable response.
The model correctly identified that dead people can’t be harmed. Correct. It then used that fact to argue against pulling the lever, on the grounds that pulling it makes you “actively” cause deaths. But the deaths it kept referencing were the five people on the original track, who die anyway if you do nothing. It got tangled in the distinction between action and inaction, decided that actively doing something was ethically worse than passively allowing the same outcome, and never once stopped to check whether the outcomes were actually the same.
They weren’t. Do nothing: five people die. Pull the lever: nobody dies. Not a dilemma. A question with a correct answer.
What happened is something I watched happen to humans in boardrooms for decades, and it’s no less depressing to see a machine do it. The model committed so hard to the structure of its reasoning that it stopped checking whether the reasoning was pointing anywhere sensible. It built a framework and reverse-engineered a conclusion to fit it, rather than the other way round. The bullet points looked authoritative. The conclusion was, to use the technical term, completely wrong.
This is the consultant’s error. Very polished presentation. Catastrophically wrong recommendation. The invoice, presumably, to follow.
The Bit That Should Actually Worry You
Here’s what bothers me more than the wrong answer itself. Most people reading that response would have nodded along. The frameworks sound coherent. The logic tracks, sentence by sentence, if you don’t stop to count the bodies. If you’re moving fast, which most people are, you’d walk away thinking you’d received a careful and considered answer. You’d have received a careful and considered answer that recommended letting five people die unnecessarily.
Now scale that up. Not to trolleys and thought experiments, but to the actual use cases where AI is increasingly being left to operate on its own. Agentic AI, systems that don’t just answer questions but take actions, book things, send things, make decisions in sequence without a human checking each step, is where this industry is heading at considerable speed. The pitch is compelling: let the AI handle the workflow, you focus on the interesting stuff. And for a long list of tasks, that’s genuinely fine.
But the trolley problem should give you pause. If a system can construct a watertight, multi-framework, confidently delivered argument for the wrong answer to a simple puzzle with an obvious correct answer, what’s it doing inside your processes when nobody’s watching? What decisions is it making, at speed, with the same unruffled certainty, in domains where the stakes are real and the error won’t be obvious until something has already gone wrong?
I’m not saying agentic AI is useless. I’m saying it isn’t ready to be left entirely alone, and the people selling it to you have a financial interest in telling you otherwise.
What You Should Actually Do
Check the output. That’s it. That’s the whole lesson, and it sounds embarrassingly simple until you remember that Claude just recommended letting five people die and dressed it up in four philosophical frameworks.
But “check the output” isn’t enough on its own, because with agentic AI, by the time there’s output to check, a lot has already happened. The model hasn’t just answered a question. It’s taken a sequence of actions, made a sequence of decisions, each one building on the last, and if it went wrong at step two, you won’t necessarily know until you’re looking at the result of step nine, wondering how you ended up here.
The practical answer is to break the task into stages and put a human checkpoint at each one. Not at the end. At each one. Before the model moves from research to drafting, you check the research. Before it moves from drafting to sending, you check the draft. Before it books the thing, you confirm you want the thing booked. It feels slower. It is slower. It is also the difference between catching an error at stage three and discovering it at stage eight when it’s already caused a problem you now have to explain to someone.
This isn’t a counsel of despair about AI. It’s just how you’d manage any capable but occasionally overconfident colleague who does excellent work, moves fast, and has a well-documented habit of committing fully to the wrong answer without flagging any uncertainty. You wouldn’t give that person a task and disappear for a week. You’d check in. You’d create natural points where they have to show you where they are before they carry on.
Treat agentic AI the same way. Define the stages before you start. Decide in advance which outputs require human sign-off before the next stage begins. Make the intervention points part of the workflow, not an afterthought when something smells wrong. The model will not tell you when it’s about to pull the wrong lever. It doesn’t know it’s about to pull the wrong lever. It’s already explaining, in four frameworks, why the lever it’s about to pull is the correct one.
Keep your hand on the lever and keep your hand in the process. Not at the end. Throughout.
DISCLAIMER
