Using an LLM as a Test Compiler

Our warehouse management system, which runs the fulfillment and logistics side of the business, has a dedicated team constantly working on maintenance and enhancements. Although we started with an off-the-shelf WMS, we've grown the system in scope and sophistication over time. Some of that recent growth includes an overhaul of in our testing and deployment process, most recently the implementation of an end-to-end (E2E) testing framework.

This is a well worn path but, like many things, agentic systems open new doors and challenge fundamentals. So before we committed to investing in a traditional E2E framework and test suite, we wanted to answer the question: Could an LLM just do the E2E testing for us, given a plain description of what to check? We looked at third-party tools and sketched a few ideas of our own. Turns out the short answer is "yes", but the long answer is "not well."

The choice we didn't want to make

With traditional, hand-coded E2E tests, you write scripts that drive a browser, check the results, and run them in CI. When they pass, you trust them. They're deterministic and they're (relatively) fast. The catch is what it costs to own them. Someone has to write the automation, hunt down the right selectors, and chase the flakiness. Every new feature needs a new test suite. Every time the UI moves, something breaks and a human goes and fixes it.

On the other side, you point an LLM at the browser and just let it go. Implementation and maintenance become super easy. You describe a workflow and the agent clicks around. But it's slow and nondeterministic. An agent that might take a different path on each run, at the speed of model inference, burning tokens every time, isn't something you want standing between your team and a production deploy.

The insight: these aren't mutually exclusive

This doesn't happen often in our craft, but in this case it turns out it's entirely possible to take the best part of each option and combine them into the best of both worlds.

I like to use a compiler as an analogy here. A compiler is an expensive, complicated tool. But you don't ship it inside your app or run it on every request. You run it once, at build time, to turn human-friendly source code into a fast, deterministic artifact. Then you run that artifact, with the compiler nowhere in sight.

Taking the analogy a bit further:

Source: a plain-language Markdown file describing the workflow, with the goal, the steps, and the checks. Cheap to write, and anyone can read it.
Compiler: Claude, plus a set of skills that capture how we build tests. It reads the description, opens the app, figures out how the workflow behaves, and writes the code.
Output: an ordinary Playwright + pytest test. Deterministic, and it runs in CI with no AI involved at all.

That model aligns perfectly with our problem, and testing is the ideal application: a test case is literally a prescriptive, unambiguous set of actions and assertions that are easy for an LLM to implement and even easier for it to know when it got it right. Also, as the test suite grows, we get a detailed library of examples and mappings between plain-language Markdown and working pytest code for free as a byproduct.

The building part is cheap in both time and effort. You describe the behavior in plain language, and the compiler handles the automation work. The running part is fast and free, because it's plain pytest with no model in it and nothing improvising a different path from one run to the next.

Here's roughly what a test's "source" Markdown looks like:

1# Receive a Purchase Order
2
3## What to test
4Receive a purchase order with several line items, taking in the
5full quantity on every line. This is the core receiving path.
6If it breaks, we can't take stock in.
7
8## Steps
91. Open the receiving screen and look up the purchase order by number.
102. For each line, enter a received quantity equal to the ordered quantity.
113. Confirm the receipt.
12
13## What to check
14- The order's status becomes "Received"
15- Inventory goes up for each received item
16- A receipt transaction shows up in the history

Two critical things live in that file. The first is the high-level goal: what the workflow is for, and what success looks like. Keeping it in the file means the point of the test doesn't get lost as the file evolves. The second is the part is the actual path, spelled out, the same way you'd be precise in source code rather than leave the compiler guessing.

That precision matters most for the autonomous side. A vague definition compiles into a slightly different test every "build". A prescriptive one lets the compiler work with little hand-holding. It also keeps things grounded when the app changes months later: a failing test can be read against a precise record of how the workflow is supposed to go, so an agent can tell "the app broke" apart from "the test just needs recompiling."

The bonus insight: the framework is mostly instructions

Once we were thinking in compiler terms, a second insight revealed itself. If the compiler is "Claude plus a set of skills," then the skills are the framework. All the behavior we'd normally write as framework code could instead be written as instructions the model follows at build time: how to pick what to test, how to poke around an unfamiliar UI, what our selector and data conventions are, how a test file should be laid out.

That left us with just a small amount of actual framework code to write and maintain. Most of what makes our E2E framework ours lives in a few skill files and a shared conventions doc, not in a codebase. The framework is, more or less, a carefully written set of instructions that give the development-time coding agent everything it needs to both assist in writing the Markdown "code" for a test case, as well as the "compilation" process that (mostly autonomously) converts that into concrete pytest code.

What test implementation looks like

We start with a test generation skill that facilitates a conversation with explicit checkpoints, and it front-loads the human input so the agent can do the slow part on its own.

A session goes roughly like this:

Propose what to test. The agent looks at what you've been working on and suggests a few candidate tests. You pick one or describe your own.
Pin down the intent. It restates what it thinks you mean and asks about anything ambiguous: which record to use, whether you're testing the view path or the edit path. No UI questions yet.
Explore the real application. The agent uses the local Playwright MCP to walk the workflow in a live browser, reading each page through its accessibility tree (much cheaper on context than screenshots, and more reliable) to discover the real steps and selectors live while a human is there to clarify anything along the way.
Write the definition, then stop. Next it writes the plain-language definition (the goal, the prescribed steps, and the checks) and pauses at a hard gate: does this capture what we're building? Nothing gets generated until you say yes. This is the compiler equivalent of signing off on the source before you hit build.
Compile, then check its own work. Once the definition's approved, the agent goes off on its own. It writes the test, runs it, and fixes it until it passes. Depending on the test this could take some time.
Demonstrate and commit. When it's done waits for you to return and replays the finished test in a browser you can watch, points out anything it had to guess at, and commits once you're happy.

The front-loading is what makes it practical. Your part is a handful of quick decisions near the start, plus a review at the end. Everything in between, exploring the UI implementation and generating code and fixing it, happens without you. That's the slow part, but is mostly autonomous and parallelizable. In the end you're approving a spec and reviewing a result, not interactively narrating every turn.

Why this delivers both

Cheap to build doesn't mean instant. A generation session is real work and takes real time. The agent is exploring a live UI and checking its own output, and none of that is free. But the work is mostly hands-off, and it needs almost no specialized development skills. You don't need to know Playwright to write a test. You need to be able to describe the workflow. We've traded scarce skilled-engineer hours for mostly unattended agent time.

Safe to trust is the deterministic half. Because the output is plain pytest with no model in it, a test does the same thing every time, in seconds, very cheaply.

Maintenance is what quietly kills most traditional suites. Here it stays cheap, for the same reason. When the UI shifts, a human doesn't need to go spelunking through the test to re-find selectors. You point the "compiler" at the same definition and recompile, and the fresh wiring comes out the other end. Because the durable, prescriptive description lives in the source and only the fragile parts get regenerated, the test keeps up with the app instead of slowly rotting behind it. Triage gets the same grounding -- a precise description to check a failure against, to differentiate a flakey test from actual broken code.

None of this is domain-specific

While we built this another insight presented itself: none of this is specific to our WMS codebase. It's "skills plus a thin layer of code," and the skills don't care what application you point them at. So we're already looking at whether the same "compiler" can write E2E tests for other teams and codebases across the company.

In fact, this model isn't even specific to testing, but by nature testing is the easiest application of it. Spec-driven agentic development is a whole other thing that deserves its own blog post... or blog.

How any team can try this

But for testing you don't need our stack to use the idea and apply this model:

Split the what from the how. Write tests as plain-language definitions and treat those as the source of truth. The executable code is an output, not the thing humans maintain.
Be prescriptive in the definition. Define the goal and the actual steps independently, the way you'd be precise in defining requirements and source code. Loose intent makes regeneration drift but a precise definition keeps the autonomous parts grounded over time.
Put the AI at build time, not runtime. Have it generate standard tests in whatever framework you already trust (Playwright, Cypress, etc.), so execution stays deterministic and migration stays brownfield-friendly.
Write your conventions down as skills. Selector preferences, data setup patterns, file layout: put them where the agent can read them. The tests themselves eventually become the convention documentation, but an explicit document jumpstarts that.
Keep humans at the gates. Sign off on the definition before any code gets written, and review the output before it merges. Front-load the questions and context so the agent can run unattended in between.

The takeaway

These insights weren't really about the AI writing tests, that's just what led us there. It was about where AI sits in our process -- something we're evaluating across all of our processes. Think of it as a build time thing instead of runtime thing, and you can keep the good parts of each approach for testing and beyond. The source is plain Markdown anyone can read. The output is fast, deterministic test code. And the "framework" mostly just instructions.

The choice we didn't want to make

The insight: these aren't mutually exclusive

The bonus insight: the framework is mostly instructions

What test implementation looks like

Why this delivers both

None of this is domain-specific

How any team can try this

The takeaway

Portable article formats

Related Posts

Human First, AI Empowered - How’s it going?

Embedding AI in Fullscript Engineering