5 minute read

This series was mostly written with AI assistance.

I want to say that plainly because pretending otherwise would be silly. What feels more worth saying is that the parts that work did not come from some perfect prompt. They came from turning a messy writing process into something more structured: plans, artifacts, review passes, explicit roles, and a lot of manual correction when the drafts drifted or sounded wrong.

That is also why this felt worth writing down. If the series has a useful point, I more or less learned it while writing it: AI was helpful, but only after I gave it a system to work inside.

The workflow

I did not get here by asking for seven finished essays and polishing the result. The process was much more iterative. Microsoft Research describes LLM-assisted writing as a process where users “actively refine, explore, and co-construct text”, which feels close to what happened here.

Phase Primary tool Output What mattered most
Discovery ChatGPT voice chat Raw ideas, tensions, candidate directions Speed and breadth
Architecture ChatGPT plus manual synthesis High-level plan, narrative spine, article map Coherence
Evidence ChatGPT deep research plus manual triage Source list, prior art, examples Source quality
Persistence Markdown repository Plans, research notes, article briefs Stable artifacts
Drafting Copilot Initial article drafts Throughput
Review Copilot plus manual critique Gap analysis, rewrite targets, defect patterns Judgment

ChatGPT handled discovery, planning, and source gathering. Copilot handled the file-based drafting and review loops. My part was to keep cutting weak sources, fixing tone, and deciding what survived.

In practice the loop was simple:

  1. Use ChatGPT to brainstorm, outline, and gather sources.
  2. Turn that into article plans and stable files in the repo.
  3. Use Copilot to draft and review against those files.
  4. Manually rewrite the parts that still felt weak, generic, or unsupported.

That separation mattered more than the specific tools.

What actually made it work

If I had to reduce it to a few things, they would be structure, artifacts, and iteration.

OpenAI argues that customers “typically achieve greater success with an incremental approach”. That is a better description of the workflow than any story about fully autonomous generation.

Moving the work out of chat and into plans, briefs, and drafts made it easier to keep the argument stable. Once those artifacts existed, the models had something concrete to work against instead of a fading conversational context.

The useful skill was not inventing the perfect prompt. It was learning how to inspect output, identify the failure mode, and redirect the next pass without reopening the whole problem.

Most of the quality still came from manual judgment: rejecting obscure sources, removing citation clutter, rewriting robotic transitions, and cutting sections that sounded tidy but dead. That is probably the part people underestimate most when they talk about AI-assisted writing.

Where the models struggled

The hardest failures were not grammar or obvious hallucinations. They were structural.

The models were prone to soft argument drift: flattening distinctions, overexplaining points that had already been made, or turning a causal sequence into a list of adjacent observations. That showed up most clearly in endings and transitions, which often sounded like mini summaries instead of the next step in an argument.

They were also too willing to keep weak evidence if it fit the paragraph: obscure sources, repetitive links, or claims that sounded supported until the source was inspected closely.

And then there was the prose itself. It could look polished while still feeling generic. The literature review paper on AI-assisted writing warns that “LLM outputs always appear at first glance to be well written, well informed and thought out, but closer reading reveals gaps, biases and lack of depth”. That matches my experience almost exactly.

The uncomfortable part is that the workflow depended on me being able to notice those problems. If I could not tell when a source was weak or a section had subtly drifted, the process would have produced something smoother and worse.

What I would do differently next time

This is probably the most useful part of the postmortem. The series got to a result that I like, but not efficiently.

1. Lock the constraints and artifacts earlier

I would define the thesis, source bar, stylistic anti-patterns, and review rubric much sooner. I would also create a claim map and source map for each article before drafting begins. A lot of the late-stage editing came from discovering those constraints too late.

2. Separate review and measure it

Too many prompts tried to fix everything at once. Next time I would separate factual review, structural review, prose review, and citation review. I would also track time per article, major rewrite passes, source replacements, and human editing time after the first draft. Without that, it is easy to say AI made the process faster and much harder to say where it actually did.

3. Encode the recurring roles into the system

I kept rediscovering the same roles in ad hoc prompts: series architect, source triage reviewer, article planner, structural critic, style enforcer, citation hygiene reviewer. Next time I would codify some of those into reusable skills or specialized agents with clearer contracts.

OpenAI’s guidance is to “maximize a single agent’s capabilities first”, so I would still start small. But GitHub’s documentation notes that “you can create specialized custom agents for different tasks”, and that seems like the right direction for work that keeps repeating the same evaluation patterns.

GitHub also puts this plainly: “The more Copilot cloud agent knows about the code in your repository, the tools you use, and your coding standards and practices, the more effective it will become”. The writing equivalent is straightforward: the more the system knows about source quality, tone, formatting preferences, and the series spine, the less time gets wasted rediscovering them.

The process proved the thesis

The series argued that agentic systems do not scale because of intelligence alone. They scale when work is decomposed into artifacts, pipelines, interfaces, and validation loops.

The writing process followed the same pattern. One-shot generation did not produce publishable essays. Structured plans, stable artifacts, repeated review loops, and human intervention did.

The literature review paper is even harsher on this point, warning that “a press-button strategy leaving AI to do the work is a recipe for disaster”. That is exactly right. The more autonomous the drafting looked, the more human the editing had to become.

This also speaks about the immaturity of my writing process. I managed to get by because it was only a few articles, but it would definitely not scale to something like a multi-chapter book without the improvements above.

So if there is one lesson I would keep from this process, it is this: AI did not replace the writing system. It made the writing system more necessary.