Writing this series with AI: a postmortem
This article is part of a series on agentic systems:
- Agentic systems are struggling to scale (this should feel familiar)
- Agentic systems are still in the artisanal era
- Agentic systems are bound by the same fundamental limits
- Artifacts are compression: how systems handle complexity
- Why all systems become pipelines
- Long-lived systems need modularity
- Designing agentic systems for engineering organizations
- Writing this series with AI: a postmortem 👈
This series was mostly written with AI assistance.
I want to say that plainly because pretending otherwise would be silly. What feels more worth saying is that the parts that work did not come from some perfect prompt. They came from turning a messy writing process into something more structured: plans, artifacts, review passes, explicit roles, and a lot of manual correction when the drafts drifted or sounded wrong.
That is also why this felt worth writing down. If the series has a useful point, I more or less learned it while writing it: AI was helpful, but only after I gave it a system to work inside.
The workflow
I did not get here by asking for seven finished essays and polishing the result. The process was much more iterative. Microsoft Research describes LLM-assisted writing as a process where users “actively refine, explore, and co-construct text”, which feels close to what happened here.
| Phase | Primary tool | Output | What mattered most |
|---|---|---|---|
| Discovery | ChatGPT voice chat | Raw ideas, tensions, candidate directions | Speed and breadth |
| Architecture | ChatGPT plus manual synthesis | High-level plan, narrative spine, article map | Coherence |
| Evidence | ChatGPT deep research plus manual triage | Source list, prior art, examples | Source quality |
| Persistence | Markdown repository | Plans, research notes, article briefs | Stable artifacts |
| Drafting | Copilot | Initial article drafts | Throughput |
| Review | Copilot plus manual critique | Gap analysis, rewrite targets, defect patterns | Judgment |
ChatGPT handled discovery, planning, and source gathering. Copilot handled the file-based drafting and review loops. My part was to keep cutting weak sources, fixing tone, and deciding what survived.
In practice the loop was simple:
- Use ChatGPT to brainstorm, outline, and gather sources.
- Turn that into article plans and stable files in the repo.
- Use Copilot to draft and review against those files.
- Manually rewrite the parts that still felt weak, generic, or unsupported.
That separation mattered more than the specific tools.
What actually made it work
If I had to reduce it to a few things, they would be structure, artifacts, and iteration.
OpenAI argues that customers “typically achieve greater success with an incremental approach”. That is a better description of the workflow than any story about fully autonomous generation.
Moving the work out of chat and into plans, briefs, and drafts made it easier to keep the argument stable. Once those artifacts existed, the models had something concrete to work against instead of a fading conversational context.
The useful skill was not inventing the perfect prompt. It was learning how to inspect output, identify the failure mode, and redirect the next pass without reopening the whole problem.
Most of the quality still came from manual judgment: rejecting obscure sources, removing citation clutter, rewriting robotic transitions, and cutting sections that sounded tidy but dead. That is probably the part people underestimate most when they talk about AI-assisted writing.
Where the models struggled
The hardest failures were not grammar or obvious hallucinations. They were structural.
The models were prone to soft argument drift: flattening distinctions, overexplaining points that had already been made, or turning a causal sequence into a list of adjacent observations. That showed up most clearly in endings and transitions, which often sounded like mini summaries instead of the next step in an argument.
They were also too willing to keep weak evidence if it fit the paragraph: obscure sources, repetitive links, or claims that sounded supported until the source was inspected closely.
And then there was the prose itself. It could look polished while still feeling generic. The literature review paper on AI-assisted writing warns that “LLM outputs always appear at first glance to be well written, well informed and thought out, but closer reading reveals gaps, biases and lack of depth”. That matches my experience almost exactly.
The uncomfortable part is that the workflow depended on me being able to notice those problems. If I could not tell when a source was weak or a section had subtly drifted, the process would have produced something smoother and worse.
What I would do differently next time
This is probably the most useful part of the postmortem. The series got to a result that I like, but not efficiently.
1. Lock the constraints and artifacts earlier
I would define the thesis, source bar, stylistic anti-patterns, and review rubric much sooner. I would also create a claim map and source map for each article before drafting begins. A lot of the late-stage editing came from discovering those constraints too late.
2. Separate review and measure it
Too many prompts tried to fix everything at once. Next time I would separate factual review, structural review, prose review, and citation review. I would also track time per article, major rewrite passes, source replacements, and human editing time after the first draft. Without that, it is easy to say AI made the process faster and much harder to say where it actually did.
3. Encode the recurring roles into the system
I kept rediscovering the same roles in ad hoc prompts: series architect, source triage reviewer, article planner, structural critic, style enforcer, citation hygiene reviewer. Next time I would codify some of those into reusable skills or specialized agents with clearer contracts.
OpenAI’s guidance is to “maximize a single agent’s capabilities first”, so I would still start small. But GitHub’s documentation notes that “you can create specialized custom agents for different tasks”, and that seems like the right direction for work that keeps repeating the same evaluation patterns.
GitHub also puts this plainly: “The more Copilot cloud agent knows about the code in your repository, the tools you use, and your coding standards and practices, the more effective it will become”. The writing equivalent is straightforward: the more the system knows about source quality, tone, formatting preferences, and the series spine, the less time gets wasted rediscovering them.
The process proved the thesis
The series argued that agentic systems do not scale because of intelligence alone. They scale when work is decomposed into artifacts, pipelines, interfaces, and validation loops.
The writing process followed the same pattern. One-shot generation did not produce publishable essays. Structured plans, stable artifacts, repeated review loops, and human intervention did.
The literature review paper is even harsher on this point, warning that “a press-button strategy leaving AI to do the work is a recipe for disaster”. That is exactly right. The more autonomous the drafting looked, the more human the editing had to become.
This also speaks about the immaturity of my writing process. I managed to get by because it was only a few articles, but it would definitely not scale to something like a multi-chapter book without the improvements above.
So if there is one lesson I would keep from this process, it is this: AI did not replace the writing system. It made the writing system more necessary.