Part II: How Do I Work With This Thing?

The Tests Are for You

Professional test output is not designed for humans. It's designed for machines — exit codes, stack traces, assertion counts, coverage percentages. A passing suite tells you everything passed. A failing one tells you something failed, somewhere, with a line number. What it doesn't tell you is what happened. What the board looked like. What the score was. Why the domino that should have scored ten points scored zero.

This is the metamer problem. In color science, metamers are stimuli that are physically different but perceptually identical — two different light spectra that look like the same shade of blue. In AI-assisted development, the metamer problem runs in both directions. Code that looks correct to the AI can be visually wrong to a human. And code that looks wrong in test output can be functionally correct but presented in a way that hides the logic. If you can't see what the AI produced — not the code, but the result of the code — you can't verify that the code is right. The test passes. The output is garbage. Exit code zero.

The lesson came from building a dominoes game. The AI could write game logic. It could write tests for game logic. But it could not see the board. It had no concept of domino placement, rotation, or how a chain of tiles looks when rendered. Bugs that were obvious at a glance — a tile connected to the wrong end, a spinner scoring only one side, a chain end displaying the inner face instead of the outer — were invisible to the AI's tests. The tests checked invariants: total tiles in play, score divisible by five, correct player turn order. The invariants all passed. The game was broken.

The fix was building test output that humans could read. Not minimizing it, not tucking it behind verbosity flags — making the default output show you the setup, the action, and the result in a form you could trace by hand. Board states rendered as text: [5|6] → [6-6] → [6|1], with chain ends labeled, counts summed, scores annotated. If the scoring was wrong, you could see the board state and add the ends yourself. If the placement was wrong, you could see the chain and spot where the logic diverged. The test wasn't checking a boolean. It was showing you a scene.

This evolved into dedicated infrastructure. An artifact collector accumulated board snapshots, scoring breakdowns, event sequences, and move lists during test execution, then generated styled HTML pages you could open in a browser and read like a case study. Scoring scenarios got their own page. Spinner lifecycle got its own page. Game replays got their own page. The test suite wasn't just verifying correctness. It was documenting behavior in a format that a human could audit and an AI could be held accountable against.

The tests became a Rosetta Stone. The original stone worked because it carved the same message in three different scripts — and if you could read one, you could decode the others. Human-readable tests do the same thing. The same behavior, written three ways: what the code says, what the AI claims it does, and what actually happens when it runs. The code is the script you can't read — dense, formal, machine-native. The AI's assertion is the summary — a simplified claim that something passed. The human-readable output is the script you can actually read, the one that lets you verify the other two. When all three agree, you know the system works. When they disagree, you know exactly where to look.

There's also a growth problem that tests solve. The more detail you put into one part of the system, the fuzzier everything else becomes. You refine the scoring engine, and the placement logic drifts. You add spinner chains, and the end-of-hand calculation breaks. Software built with AI hits this wall fast because the AI is eager to build and reluctant to check. Without tests, every addition breaks something else, and you can't tell what broke or when. Tests are what let the software get meaningfully bigger. They freeze the parts you're not touching so you can work on the parts you are.

The practical rule: always write tests before a refactor. If you're about to ask the AI to restructure something, get the tests in place first. Let the tests describe the current behavior in human-readable output. Then refactor. Then read the output again. If it changed, you know exactly what shifted and can decide whether the shift was intentional. Without the before-picture, you're trusting the AI's claim that nothing broke — and you already know what that claim is worth.

And writing tests is trivially easy now. The same AI that can't see the visual bugs can generate test scaffolding, test data, and test infrastructure in seconds. The barrier that kept people from writing comprehensive tests — the tedium, the boilerplate, the setup — is gone. The only barrier left is knowing that you need them and knowing what they need to show you.

AI has no perceptual ground truth. It processes tokens, not pixels. It can verify that a number equals another number, but it cannot see that a domino is facing the wrong way. Every domain has its equivalent of the visual bug — the thing that's correct by every metric the machine can check, and obviously wrong the moment a human looks at it. The tests are a Rosetta Stone. Build them with every script.


← Talking to the Duck