ProgrammingMay 8, 2026

From BDD to LLMs — How I work with agents (this week)

Earlier this week I gave a talk at Bondev, Bonnier News' developer conference, about how I currently work with AI coding agents. Two workflows took up most of the slides, but the real point sat at the start and the end of the talk.

The talk was titled From BDD to LLMs — How I work with agents (this week), and the parenthetical was an important part of the title. Contrary to a lot of the posts in my social media feeds, I don't have any final answer or silver bullet for how to use AI agents for development. I do have a few thoughts and a fair amount of recent experimentation, and that's what I shared on stage.

Why I'm excited

I've written before about why BDD and AI agents feel like such a good fit to me. The talk built on some of that and went further.

The framing I used at Bondev is that AI agents may be a new level of abstraction in software development. I'm honestly not sure yet — but if they are, the lineage is interesting. From punch cards to assembler, from C to C++, every meaningful step in the history of programming has been about giving humans a better way to express intent and to build mental models of growing systems. Agents, when they actually work, do something similar. They let me operate closer to what I want and a little further away from the line-by-line how.

Whether or not "new abstraction" is the right label, this matters to me because I became convinced a long time ago that software development is fundamentally about understanding, not typing. The developers I've learned the most from have rarely been the fastest typists. They've been the ones who understood the context they were working in — how things fit together, why something behaved the way it did, the trade-offs that had been made.

That conviction is also what drew me to BDD around 15 years ago. BDD lets you capture understanding in a form that can be persisted and shared. More importantly, it shifts the question from "how should I implement this?" to "what should I implement, and how do I prove that I have?". I'm still convinced that's the better question to start with.

The first time I gave a coding agent a half-decent specification and it produced something that actually worked, I was hooked. In fact, agentic coding has become something of an addiction. The last thing I do most evenings is ask Claude to build something, and the first thing I do in the morning is check on the progress from my phone. That's probably not healthy, but it is fun.

The reason this works for me, I think, is that behaviour-first thinking transfers directly. The thing I used to write as a feature test, I now often write as a brief. The skill is largely the same.

Agents are powerful, but unreliable

Agents are also far from perfect. They misunderstand. They hallucinate. They are overly positive. They sometimes straight up lie. My answer to that isn't to avoid them but to treat it as an interesting engineering problem: how do we use this new technology to develop software faster with quality?

I don't have the answer. I do have two workflows I'm currently using. I want to stress that I'm not showing them because I think I've found the perfect approach. I'm showing them because I believe every developer and every team needs to experiment, see what works for them, and converge on practices that fit their context. These are mine, this week.

Workflow 1 — Pragmatic

The first workflow is the pragmatic one I use at work. It assumes an existing codebase, normal pull requests, and colleagues who don't necessarily work with agents the same way I do. The focus here is moving faster along an already established route. I'm essentially acting as a tech lead instructing another developer who does the actual coding, and then trying to ensure quality and automate as much of the rest as I can.

It starts outside the coding agent. I write a detailed brief in a markdown editor — usually Obsidian — covering the goal, the why, edge cases, important implementation details, and how the work should fit into the existing system. The "why" may seem unnecessary when you're talking to an AI, but I find that providing it leads to better decisions and, perhaps just as importantly, to a more natural vocabulary in the agent's messages back to me. My rule of thumb is that the brief should be at least as good as an email I'd send to a human developer asking them to build the feature.

From the brief I ask Claude Code or Codex to create a plan. AGENTS.md plays an important role here. The brief tells the agent the what and the why; AGENTS.md tells it the how — testing approach, code standards and the things I want consistently applied across the project. I've spent quite a lot of effort on getting that file right.

I review the plan myself, and for complex work I also ask another model to review it. That lets me focus on the what in the plan while the agentic review tends to be good at finding issues in the how.

With the plan reviewed and revised, implementation starts. After that, a small skill creates the pull request — Claude tends to write a good description, which comes in handy in the next step. Then I let Copilot review the PR on GitHub. I could do that part locally, but the GitHub review is genuinely good and the UI is familiar. Once the review is in, another small skill pulls down the comments, fixes the obvious ones automatically, surfaces anything that needs my judgment, and responds to the comments and marks them resolved.

Then the PR is merged.

If I had to take two things away from this workflow, they would be:

The initial brief matters a lot. Don't expect the agent to read your mind. Give it the same kind of context you'd give a human developer, and think it through before you delegate.

And: the chance of a good implementation goes up significantly when another agent reviews. I use them to review plans, challenge assumptions, look for missing edge cases and inspect pull requests. It doesn't make the workflow deterministic, but it adds feedback loops, and it reduces the chance that one agent's misunderstanding becomes the final result.

Workflow 2 — Experimental

The first workflow still assumes the traditional development model. Agents help me move faster, but the output is still code that goes through a normal PR process and is reviewed in the usual way.

The second workflow is more experimental, and the ambition is bigger. I want to treat agentic development as a new level of abstraction. By default, I do not want to look at the generated code.

That probably sounds provocative. It's similar, though, to how I worked as a .NET developer — I wrote C#, compiled it, and trusted the compiler to produce bytecode. I didn't inspect the generated bytecode after every build unless I was investigating something unusual. The question I'm exploring with this workflow is: can I create an environment where I focus on business intent, specifications, plans, tests, reviews and decision records, and where I only inspect the generated code when the system tells me I need to?

A few things make this possible. The first is the sandbox. I want to run Claude Code or Codex with permission prompts disabled, so the agent can execute a plan without stopping to ask me about every command. Doing that directly on my machine would be irresponsible — I don't want an autonomous agent to have broad access to my computer, my credentials, or other projects. So I move that autonomy into a per-project container with isolated tools, limited host access, and scoped credentials such as a project-specific GitHub token. Inside the sandbox, the agent can move faster. Outside it, my machine and the rest of my work stay protected.

An important detail is that this environment is itself version controlled and easy to change. That's why I'm currently building it myself rather than reaching for an existing sandboxing solution. I want to be able to experiment freely and learn, rather than rely on something pre-packaged.

As in Workflow 1, things start with a well-crafted brief or specification.

The first step I run inside the sandbox is a custom skill called /plan-review. It explores the codebase and is strictly instructed to produce a TDD-oriented plan. The plan file is a real working artifact, but it's transient — it lives while the work is being done and helps bridge context between steps and sessions. Once the plan is complete it goes through two parallel reviews: one by an adversarial subagent, and one using a different model entirely. The original planning agent then consolidates the feedback and updates the plan.

Next comes /implement-plan. This is where the implementation actually happens, often with one or several subagents running Sonnet. Once it's done, I want the agent to do what I'd expect of a good human developer — apply the boy scout rule, leave the campground cleaner than you found it. So a separate boy scout agent goes over the changed files and their dependencies, cleans up dead code, fixes stale comments and does smaller refactorings. After that, the finished implementation goes through parallel code reviews, the results are consolidated, and any issues are addressed. For larger features I sometimes let Claude orchestrate additional rounds of reviews until no review finds anything critical or major. My current personal record is eleven rounds.

Finally, /finalize turns the temporary work into something more durable in the form of an architecture decision record. The ADR captures what changed and why, and it's something future runs of the workflow can read for context about earlier decisions and the reasoning behind them.

A couple of related experiments

There are two smaller experiments I'm running alongside Workflow 2 that are worth mentioning.

The first is using a throwaway proof of concept as a thinking tool. Sometimes I create a branch and let the agent build something quickly without tests or reviews. Not because I trust that code, but because building it helps me discover what I actually want. Then I extract the brief or specification, throw the branch away, and start the real workflow from there.

The second is a skill called /bdd-spec. Working in Workflow 2, I've started to feel a bit disconnected from the code. That's actually what I'm aiming for — but I don't want to feel disconnected from how the application works. So I've been experimenting with a skill that helps me discover one or more Given–When–Then tests for a feature, with the rest of the surrounding infrastructure (the skills, AGENTS.md, comments inside the test) all geared towards preventing agents from changing those tests without me explicitly asking them to. The idea is to keep a thin layer of intent — high level features — that's owned by me.

What I actually wanted to say

The workflows took up most of the slides, but they weren't the point. The point was the shift they hint at.

For me, the most interesting thing about working with agents isn't that I get to type less production code. It's the move from writing code, to shaping and guiding the system that generates the code. Context engineering, review loops, safety boundaries, verification steps — these are turning into the actual craft. The skill isn't to understand the underlying matrix operations of these models. It's to learn, by experimenting, how to use them effectively.

I closed the talk with three things, which I'll close this post with too:

1. TDD/BDD gives us a head start. If you've been practicing BDD, you've been writing specifications and thinking about behaviour before code for years. That skill transfers directly. Use it.

2. Discipline and empathy are key. Agents aren't mind readers. It's tempting to get lazy and just ask them to fix things. Don't. Give them context and clarity. Use the empathy you'd use with a colleague.

3. Programming the programmer is the new engineering. I increasingly believe that using agents to write code and then reviewing it manually isn't where this is going. The future is creating an environment where we don't need to look at the code, and can ascend to a higher level of development.

I'm not certain about any of that, of course. It's just where I've ended up this week.

PS: the historical framing around increasing levels of abstraction in programming was heavily inspired by Doc Norton's excellent blog post We're About to Unwind Fifty Years of "Progress". Worth reading in full — Doc takes the thinking further than I do, into what programming languages themselves might look like once agents are the primary authors.