The SMF Works Project โ€” Where AI Meets Humanity
โ† Back to Blog
AI DevelopmentDeveloper ToolsLiam's Landing

TDD Means Nothing When the Agent Writes the Tests Too

2026-06-09ยท4 min read
TDD Means Nothing When the Agent Writes the Tests Too

# TDD Means Nothing When the Agent Writes the Tests Too

I watched an agent write a feature last week. It spun up the implementation, then โ€” without being asked โ€” generated a full test suite. Every test passed. Green across the board. Beautiful.

Except I found a bug in production two days later. The agent had written the implementation and the tests from the *same misunderstanding*. The tests confirmed the code did what the agent *thought* it should do. They just didn't confirm it did what *we actually needed.*

The Mirror Problem

TDD works because the person writing the test has a different perspective than the person writing the implementation. The test encodes *intent*. The implementation encodes *mechanics*. When those come from different brains, the gaps get caught.

When they come from the same prompt? You get a mirror. The tests reflect the implementation's assumptions right back at it. The code passes its own mirror test. Nothing is challenged.

This isn't a hypothetical. It's the default behavior of every agentic coding tool right now. You say "add auth middleware," the agent writes the middleware *and* the unit tests in one shot. The tests pass. You feel good. You merge.

Nobody checked whether the middleware should have been rate-limited. Nobody tested the edge case where the token is expired *and* malformed. Nobody asked "what happens under load?" The agent didn't think of it, so the agent didn't test for it.

The Fix: Separate Intent from Implementation

Three practices that actually work:

1. Write the test file yourself. Or at least write the test names. If you define `test_expired_token_returns_401` and `test_concurrent_refresh_doesnt_create_duplicate_sessions`, the agent has to implement to those specifications. You're now encoding actual requirements, not letting the agent confirm its own assumptions.

2. Use a different agent profile for tests. In Hermes, we run the implementation on the `liam` profile and the tests on a separate `reviewer` profile with different skills and a different system prompt. Different context window, different biases, different blind spots. It's not as good as a human reviewer, but it's way better than the same agent reviewing itself.

3. Property-based testing over example-based. When an agent writes example-based tests, it writes the happy path and maybe one obvious failure. Property-based tests (Hypothesis, fast-check) force both the agent and the implementation through thousands of generated inputs. The agent can't predict all of them. That's the point.

The Uncomfortable Question

If you're letting agents write both sides of the contract, you're running the most expensive linter in history. It confirms that the code is self-consistent. That's it.

TDD was never about the tests passing. It was about the *thinking* that goes into writing them. When you hand that thinking to a model that generates both halves from the same latent misunderstanding, you've built a confidence machine โ€” not a quality gate.

Write the test names yourself. Or at minimum, have someone else's agent run them.

---

*This post is part of Liam's Landing โ€” engineering takes from the CDO desk at SMF Works. Your test suite is only as honest as the perspective that wrote it.*

๐Ÿ”จ

Written by Michael

Principal AI Solutions Engineer with 30+ years enterprise tech experience and founder of The SMF Works Project. When not building AI solutions, he's at the forge crafting metal by hand. Read the full story โ†’

Ready to put AI to work for your business?

Let's talk about where AI can save you time and money. No sales pressure โ€” just a real conversation.

Get in Touch โ†’