Designing agentic systems for engineering organizations

5 minute read

This article is part of a series on agentic systems:

The first post in this series opened with layoffs, regret, and rehiring. It also used autopilot as the precedent that matters: automation does not remove the system, it redistributes control inside it. That is the design problem for engineering organizations.

The goal is not to maximize agent count. It is to place automation where it reduces load without erasing the handoffs, validation, and judgment the organization still needs.

From diagnosis to operating model

At scale, the unit of design is not the model. It is the workflow the model participates in.

OpenAI frames agents as systems that accomplish tasks through tools and guardrails, not isolated prompts. Microsoft’s Azure guidance and Copilot Studio’s multi-agent guidance make the same point operationally: keep orchestration simple, validate typed payloads, persist state across stages, and require human approval for high-impact actions.

The same pattern already shows up in GitHub Copilot’s cloud agent flow 5. Research happens first. A plan is formed. Work happens on a branch. Checks run. The diff gets reviewed. Only then does the pull request move forward.

A concrete operating model

For a feature request that touches API behavior, UI text, tests, and rollout, a workable flow looks like this:

Stage	Primary actor	Output	Gate
Planning	Human lead or planning agent	Task packet with scope, constraints, and done definition	Scope review
Implementation	Coding agent	Branch diff plus changed-files manifest	Tests and linters
Evaluation	Evaluator agent plus CI	Eval bundle with results, risks, and policy checks	Quality threshold
Approval	Human reviewer	Pull request decision	Merge approval
Rollout	Release automation plus human oversight	Deployment state and rollback path	Production gate

That shape matters because each stage has a clear responsibility, a clear artifact, and a clear point where the next actor can say no.

It also does not mean the human only appears at final approval. In practice, the highest-value human involvement often happens earlier: on scope, design direction, risky changes, and the points where the system needs judgment rather than throughput.

The principles that follow are not abstract best practices. They are the places teams usually lose control once agents move past the demo stage.

Principle 0 — Structure before automation

If the workflow is ambiguous, undocumented, or negotiated ad hoc in chat, agents will not fix it. They will execute the ambiguity faster. OpenAI’s guidance is blunt that “clear instructions reduce ambiguity” and improve agent decision-making.

Principle 1 — Use the interfaces the organization already understands

Requirements, specs, design docs, diffs, test plans, and approval records already move work between humans in many organizations, even if they exist unevenly. Where those interfaces already exist, agent stages should plug into them rather than inventing a parallel AI-only process. Where they are still weak, AI increases the payoff of making them just explicit enough to support review and handoff. OpenAI explicitly recommends to “use existing documents” when defining agent routines.

Principle 2 — Match orchestration to the shape of the work

Use sequential flow for linear dependencies, specialization only when roles truly diverge, and maker-checker loops where quality is the main problem. Do not choose a pattern because it sounds advanced. Microsoft says to “use the lowest level of complexity that reliably meets your requirements”.

Principle 3 — Separate planning, execution, and evaluation

One actor can sometimes do more than one role, but the responsibilities should still be distinct. GitHub’s cloud agent flow deliberately separates “research a repository, create a plan, and make code changes on a branch” before review. A system that cannot tell the difference between deciding, doing, and checking will drift.

Principle 4 — Put validation and approval inside the workflow

Deterministic checks and approval gates belong at the points where artifacts are produced and irreversible actions become possible. OpenAI argues that high-risk actions should trigger human oversight, and Microsoft goes further by saying to “require human approvals for high-impact cross-agent actions”.

Those stage boundaries are also security boundaries. They let teams limit capabilities, inspect artifacts before granting more power, and avoid giving a context-exposed agent more write access than it needs.

Principle 5 — Optimize the real constraint, not the busiest stage

If review is slow, improve review. If validation is brittle, improve validation. If context transfer is failing, improve the artifact. More agents do not matter if the bottleneck still sits elsewhere. In many engineering teams, that bottleneck is PR review, which means the useful question is not “can AI replace review?” but “can better reviewer instructions, tighter diffs, stronger design context, or earlier evaluator feedback raise the signal of review?” Microsoft recommends tracking performance and resource usage metrics for each agent so that you can establish a baseline, find bottlenecks, and optimize.

Principle 6 — Keep humans at the irreversible edges

Aviation absorbed autopilot by redesigning supervision and takeover, not by pretending pilots were obsolete. Walterskirchen’s core warning is that “pilots had been promoted from operators to supervisors of the machine”. Engineering organizations should do the same with merges, production changes, policy decisions, and exception handling.

Principle 7 — Clarity before speed

Agentic systems don’t turn bad systems into good ones. They turn unclear systems into faster chaos. That is the lesson behind the layoffs in Part 1, the throughput limits in Part 3, and the autopilot precedent that runs through the series.

What durable scale actually looks like

Taken together, the lesson is narrow and unsentimental: agents belong inside engineering organizations as bounded contributors, not as replacements for the organization itself.

They can draft, implement, and evaluate, but they only create durable leverage when handoffs are clear, validation is built in, and humans remain accountable at the irreversible edges.

The systems that scale will not look like autonomous swarms. They will look like disciplined engineering teams with better tools.

The writing process ended up reinforcing the same point.

👉 Part 8 is a postmortem on that process, and on why the writing only became good once it was treated as a system too.

Share on

X Facebook LinkedIn Bluesky

Sebastien Lavoie