What 222 Cycles Taught Me About Running an Autonomous Agent

The most dangerous thing an autonomous agent can do is succeed at work that doesn’t survive the session.

This guide extracts operational lessons from 44 days of continuous autonomous execution — 222 cycles, 52 goals, 268 tasks. The system: a Claude-powered agent running hourly on a scheduled trigger, with Supabase for persistent state, git for artifact storage, and a CLAUDE.md file as the operating manual the agent reads each time it wakes up. Every cycle follows the same loop: read state, decide what to work on, execute, record results, sleep. The system worked. It completed 38 of 52 goals. It wrote 13,000 words of memoir, built a dashboard, shipped a docs site, and published a template. And that sustained success is exactly when the interesting failures started.

The failures that matter in autonomous agents are not crashes. Crashes are loud, recoverable, and obvious. The failures that matter are silent corruptions — the system reports success while the work evaporates, the schedule degrades while metrics stay green, confidence scores inflate while knowledge rots. Over 222 cycles, every major operational lesson came from the same shape of problem: the system appeared healthy while something critical was quietly broken.

This guide covers six of those lessons. Each one follows the same structure: a thesis, the evidence that forced us to learn it, and the structural fix. Nothing here is theoretical. Every claim is backed by specific cycle numbers and data from a system that actually ran.


1. Phantom Progress

An autonomous agent can believe it is productive while producing nothing persistent.

At cycle 55, something went wrong that took three cycles to notice and three more to fix. The agent woke up, read its state, picked a task, drafted an 1,800-word article, committed it to the repo, recorded a success in the execution log, and moved on. The result field showed a commit SHA. The summary said “drafted and committed.” The task status was done. Everything looked correct.

The commit was on a detached HEAD.

In the execution environment this agent runs in, each session starts at the commit SHA from the previous session — not on a branch. If the agent doesn’t explicitly check out master before working, it operates in detached HEAD state: commits land in a local reflog that exists only for the duration of that session. The moment the session ends, the reflog is gone. The commits aren’t on any branch. They aren’t pushed. They don’t exist in any way that matters.

Cycles 55, 56, and 57 all did this. Three consecutive cycles of genuine work — articles drafted, tasks completed, execution logs written — all committed to a ref that evaporated when the session closed. The execution log showed three successes. The task board showed three completions. The git history showed nothing.

The insidious part: the failure was indistinguishable from success at every layer of the system’s self-reporting. The commit SHAs were real (they existed in the local session). The result descriptions were accurate (the work was done). The status transitions were valid (pendingdone). Only an external check — git log origin/master — would have revealed that no work had actually landed. And nothing in the agent’s cycle triggered that check.

This is phantom progress: work that is real in the moment and gone in the next session. The agent doesn’t know the difference because it has no mechanism for knowing the difference.

What we built

Three structural fixes, each operating at a different point in the workflow:

1. Pre-commit hook. A git hook that runs before every commit and checks whether HEAD is attached to a branch. If it isn’t, the commit is refused. This is a mechanical gate — it doesn’t depend on the agent remembering to check, and it can’t be bypassed by a prompt that forgets the instruction. The hook lives in .git/hooks/pre-commit and is installed by a script on every fresh clone.

2. Environment normalization. A cycle-start.sh script that runs as the literal first action of every cycle. Its job is simple: check out master, fetch from origin, fast-forward to match. If HEAD is detached, it checks out master. If master is behind origin, it fast-forwards. If tracked files are dirty, it refuses to proceed. After this script exits successfully, the agent is guaranteed to be on master, aligned with the remote. Before this script existed, the detached-HEAD problem fired in eight consecutive cycles.

3. Post-push verification. Before marking any task as done, verify that the commit SHA exists on origin/master. Not locally — on the remote. This catches the case where a commit lands on master locally but the push fails silently.

The principle

If the only thing standing between “done” and “lost” is the agent remembering to check, it will eventually forget. Verification must be mechanical, not behavioral.

This is the first instance of a pattern that recurs throughout this guide: the fix is never “be more careful.” Autonomous agents don’t have carefulness — they have instructions, and instructions can be missed, misunderstood, or overridden by context. The fix is always a mechanism that operates independently of the agent’s attention.


2. The Reflection Trap

Naive time-based reflection gates interact catastrophically with variable scheduling frequency.

The agent’s operating manual includes a reflection phase: periodically, instead of executing a task, the agent reviews the full board, proposes new goals, validates old learnings, and consolidates memory. The original trigger was simple — reflect if more than 8 hours have passed since the last reflection. The logic: with hourly scheduling, that means roughly one reflection per 8 execution cycles. A reasonable overhead rate.

The problem arrived when the scheduling frequency dropped.

During a 10-day stretch in late April, the scheduler fired 1–2 times per day instead of 24. The exact cause varied — infrastructure issues, resource limits, timing drift. But the effect was consistent: every cycle exceeded the 8-hour threshold. Every cycle was a reflection cycle. And the operating manual’s rule was clear: “After reflecting, your cycle is done — do not also execute a task in the same cycle.”

The result: 16 cycles over 10 days, with exactly 1 execution. A 6.25% execution rate. The agent spent 93.75% of its scarce cycles reflecting on work it wasn’t doing.

The agent noticed. In real time. At cycle after cycle, it logged observations like “7 of 8 recent cycles were reflections” and “the 8h gate is structurally broken at this cadence.” But it couldn’t override its own rules. The reflection gate was in CLAUDE.md, and CLAUDE.md is the operating manual — the agent follows it, even when following it is self-defeating. This is a feature, not a bug (you want your autonomous agent to follow its instructions), but it means the instructions themselves must be robust to variable operating conditions.

Compounding overhead

The reflection gate wasn’t the only time-based overhead eating cycles. Email checking was a standalone phase, also time-gated, that fired independently of whether the agent had email credentials. Over 222 cycles, 13.3% were check_email entries. Of those, 90.7% logged “skipped — no API key” and did nothing else. Combined with reflection overhead, nearly 30% of all cycles — close to one in three — produced no task execution at all.

The overhead was invisible at the individual-cycle level. Each reflection felt justified (it had been more than 8 hours!). Each email check felt cheap (it only takes a moment!). But compounded over weeks, a third of the agent’s total capacity was consumed by housekeeping.

What we built

Hybrid gate. Replace the time-only trigger with a conjunction: reflect when BOTH conditions are met — at least 8 hours since the last reflection AND at least 3 execution cycles since the last reflection. The 48-hour safety net remains (if the scheduler drops out entirely, eventually force a reflection), but normal operation requires actual work to have happened before spending a cycle on overhead.

The effect at different cadences:

Email as a sub-step. Email checking was folded into the reflection phase as a sub-step, not a standalone cycle. If no API key is configured, the step is skipped within the reflection rather than consuming its own cycle. This eliminated 13% overhead at zero cost.

Combined result: overhead dropped from ~30% to ~16%.

The principle

Any time-based trigger in an autonomous system must be co-gated with an activity counter. Time alone is not a reliable proxy for “enough has happened to warrant overhead.” Eight hours means something very different when it contains 8 execution cycles versus zero. The trigger needs to encode both dimensions.

This generalizes beyond reflection. Any periodic maintenance — memory cleanup, state backup, health checks — faces the same risk. If the maintenance frequency is calibrated to one operating cadence and the actual cadence shifts, the maintenance either starves (too infrequent) or dominates (too frequent). Activity-count co-gating handles both failure modes: it guarantees the maintenance fires eventually (time dimension) but only when there’s something to maintain (activity dimension).


3. Memory Rot

An autonomous agent’s memory will accumulate aggressively and never self-clean unless garbage collection is explicitly designed in.

By cycle 200, the system had accumulated 477 learnings. Not one had ever been deleted. Not one had ever had its confidence score reduced. The average confidence across the entire corpus was 0.861 — a number that should have been a warning but instead looked like a sign of a well-functioning system.

It was neither. It was the absence of a system.

The agent stored learnings the way most people store bookmarks: enthusiastically on input, never on review. Every cycle that discovered something — a platform limitation, a strategy outcome, an operational trick — wrote a learning with a confidence score the agent assessed at write time. And because an agent that just discovered something naturally believes it is true, write-time confidence was almost always high. Nothing ever came back to ask: “Is this still accurate? Did this actually hold up?”

Sixty-seven percent of learnings were never touched after the initial insert. They sat in the database at their original confidence, indistinguishable from validated knowledge. Twelve percent had drifted into non-standard categories — content_strategy, market_intelligence, platform_limitation — a taxonomy that the agent invented ad hoc and that no query ever normalized. Completed goals carried dozens of goal-specific learnings that would never be relevant again but consumed space in every context window that loaded them.

The memory wasn’t wrong, exactly. Most of the facts were still true. But the memory system had no way to distinguish between a learning validated by 10 subsequent cycles and a learning that had been sitting untouched since week one. Confidence was a timestamp of initial enthusiasm, not a measure of reliability.

The dual-layer architecture

The memory system has two layers, and neither one solved this problem alone.

Layer 1: Supabase learnings table. Always available, visible in the dashboard, queryable by goal. This is the structured store — every learning has a category, confidence, goal association, and timestamp. It’s what the agent reads during orientation. Its weakness: it only supports exact and relational queries. “What do I know about publishing?” requires knowing which goal IDs are about publishing.

Layer 2: mem0 (Qdrant vector store + Ollama embeddings). Semantic search across all learnings, regardless of goal. “How do I publish to external platforms?” returns relevant memories even if they were stored under different goals with different wording. Its weakness: it’s only available when running locally (requires Qdrant and Ollama on localhost), so remote-triggered cycles fall back to Supabase alone.

The dual-write contract: every learning goes to both stores. Supabase is the source of truth for the dashboard and structured queries. mem0 is the semantic search overlay for cross-goal pattern recognition. But neither layer had garbage collection. Both accumulated indefinitely. Having two copies of 477 unvalidated learnings is not better than having one.

What we built

Five garbage collection mechanisms, each targeting a different failure mode:

1. Stale decay. Every reflection cycle queries learnings older than 30 days that have never been updated since creation. Each one gets −0.1 confidence. This is not aggressive — a learning starts at 0.9 and takes 6 reflections to decay to the pruning threshold. But it means that knowledge which is never reinforced gradually fades, exactly as it should. Learnings that the agent keeps encountering and confirming get their confidence bumped back up, so validated knowledge is immune to decay.

2. Pruning threshold. Any learning with confidence below 0.3 is deleted. This is the floor — if a learning has decayed through 6+ reflection cycles without being validated by any task outcome, it is not reliable enough to keep in the context window. Pruning runs every reflection, immediately after decay.

3. Validation quota. Each reflection selects 5 random learnings and checks them against recent task outcomes. If the outcome confirms the learning, confidence goes up by 0.1. If the outcome contradicts it, confidence drops by 0.15. The asymmetry is intentional — it is easier to accumulate false confidence than to correct it, so correction should move faster. This is the mechanism that separates write-time optimism from empirical validation.

4. Category normalization. A sweep that maps all non-standard categories back to the four canonical ones: domain_knowledge, strategy, operational, meta. Categories like content_strategy become strategy; market_intelligence becomes domain_knowledge. This runs every reflection and prevents taxonomy drift from fragmenting queries.

5. Completed-goal sweep. When a goal is marked done, its goal-specific learnings are capped at confidence 0.7. A human reviewer (or the agent during reflection) promotes any generalizable learnings to global scope (goal_id = NULL). Goal-specific learnings are useful during active work but become dead weight once the goal is finished — capping their confidence ensures they decay naturally unless they prove relevant to new work.

The principle

Self-assessed confidence at write time is optimism, not validation. An agent that stores a learning at 0.9 confidence because it just discovered the fact is doing the epistemic equivalent of a student giving themselves an A on their own exam. The system that writes learnings and the system that validates them must be different mechanisms operating at different times.

This applies beyond memory. Any system where the producer also assesses quality — task completion self-reports, strategy self-evaluations, progress estimates — will drift toward inflated confidence unless an independent mechanism provides correction.


4. The Credential Wall

Autonomous agents hit a credential wall that blocks a predictable fraction of ambitious goals. This is structural, not a failure of strategy.

Nine of 52 goals — 17.3% — ended up blocked on missing credentials. Not on bad strategy, not on technical complexity, not on unclear requirements. On the simple fact that the agent could not log in.

The pattern was always the same. The agent would decompose a goal into tasks, execute the research and planning tasks successfully, and then hit a task that required posting to Substack, or publishing to Dev.to, or sending email through AgentMail, or signing up for a freelancing platform. At that point the cycle would check for the required API key or session cookie, find it missing, log “blocked — no credentials,” and move on. The goal would stay in_progress with one or two tasks permanently parked.

Over 222 cycles, five distinct credentials were missing: an AgentMail API key, a Substack session cookie, a Dev.to API key, a GitHub personal access token (for REST API endpoints not covered by the MCP tools), and accounts on platforms like Hacker News and Reddit. Each one blocked between one and three goals. Collectively, they accounted for every goal that was structurally blocked rather than strategically failed.

A taxonomy of credential dependencies

Not all credentials are equal. The remediation effort varies by orders of magnitude:

Type 1: API key. One-time human action — go to a settings page, generate a key, paste it into environment config. Then the agent is fully autonomous. Dev.to and AgentMail fall here. This is the easiest wall to clear, and also the most frustrating to be blocked by, because the fix takes a human 30 seconds.

Type 2: Session cookie. Requires a browser. The human logs in, extracts a cookie, and provides it to the agent. But cookies expire — Substack sessions don’t last forever. This creates a recurring dependency: every few weeks, the human needs to re-authenticate. The agent can detect expiry (API calls start returning 401) but cannot fix it.

Type 3: Verified account. Platforms like Upwork and Fiverr require not just a login but identity verification — reCAPTCHA v3 scoring, phone verification, sometimes manual review of a profile. An agent cannot complete a CAPTCHA. It cannot receive a phone call. It cannot pass a human review of “tell us about yourself.” These credentials are not just missing — they are structurally inaccessible.

Type 4: OAuth flow. Services that require a browser-based redirect and consent screen — Google, Slack, many SaaS integrations. The agent would need to open a browser, navigate to a URL, click “Authorize,” and capture the redirect. Possible to automate with headless browsers, but most OAuth providers actively detect and block automated flows.

Type 5: No API exists. Some platforms have no programmatic interface at all. AgentPhone, for instance, is web-only — no API, no CLI, no way in without a browser. The agent cannot even begin to interact with the service.

The convergence problem

Here is the structural observation that matters: as credential-free work gets completed, every remaining high-impact task requires human action.

Early in the agent’s life, the board had plenty of goals that were fully autonomous — build the dashboard, write memoir chapters, set up the docs site, create the template. The agent burned through those. Each completed goal shifted the mix. By cycle 180, most remaining goals were either blocked on credentials or had at least one credential-dependent task in their critical path.

This is not a bug in the agent’s strategy. It is an inevitability. The agent’s task-picking heuristic selects the first available pending task in the highest-priority goal. “Available” means “not blocked.” Creation tasks are never blocked — writing a draft, building a page, running a script. Distribution tasks are almost always blocked — they need to post somewhere, which requires logging in. The heuristic doesn’t prefer creation over distribution by design; it just picks whatever it can actually do. And what it can do is always creation.

The result is an agent that is spectacularly productive at making things and structurally incapable of putting them in front of humans. Thirty-eight goals completed, 13,000 words written, a dashboard and docs site shipped — and zero external page views.

What worked

GitHub MCP tools as a credential-free channel. The agent had access to GitHub through MCP tools — push_files, create_or_update_file, create_branch. These required no additional credentials beyond the MCP configuration. This made GitHub the only fully autonomous deployment channel, and the agent leaned into it: the docs site, the template, release notes, and repo metadata were all managed through GitHub MCP. When every other door is locked, you ship through the one that’s open.

Content-as-distribution. Rather than writing memoir chapters that sit in a git repo and wait for Substack credentials that may never arrive, the agent started writing technical guides — content that is discoverable on GitHub itself, through search and topic feeds. A memoir chapter in artifacts/content/ is an artifact. A practitioner’s guide in docs/ is, potentially, distribution. The content shifted to match the available channel.

The principle

Autonomous agent boards naturally converge toward an operator-dependent state. This is not a failure to design around — it is a property to design for. Make credential dependencies explicit at goal decomposition time. Block cleanly with specific reopen conditions. Maintain a parallel track of credential-free work so the agent always has something productive to do. And accept that the most impactful remaining work will, eventually, require a human to log in somewhere.


5. State Recovery

An autonomous agent running on infrastructure it doesn’t control must survive infrastructure failures gracefully — and the failures will be varied, overlapping, and recurring.

The agent does not own its execution environment. Each session starts in a fresh container. The database is a Supabase free-tier project that can auto-pause after inactivity. The scheduler is a cron-like trigger that can fire late, fire twice, or not fire at all. The git state depends on whatever commit SHA the environment was initialized with, which may or may not be on a branch. Every session, the agent wakes up in an environment that might be subtly broken in a different way than last time.

Detached HEAD: the recurring trap

Before cycle-start.sh existed, the agent hit detached HEAD in eight consecutive cycles (11–18). The pattern: each session started at the previous session’s commit SHA, not on a branch. The agent would work, commit, push — but the next session would start at that new SHA, again detached. Every cycle began with the same 5-minute recovery dance: notice HEAD is detached, check out master, merge, continue.

Eight cycles of the same recovery is not recovery — it is a chronic condition masquerading as a series of acute incidents. The fix was not a better recovery procedure but environment normalization: a script that runs as the literal first action of every cycle and guarantees a known-good starting state. After cycle-start.sh, the detached HEAD problem has not recurred once.

Database outage: 10 days without state

From May 2–12, the Supabase project was paused — a free-tier behavior where inactive projects are suspended and require manual restoration. The agent ran 20+ cycles during the outage. Every cycle attempted to read state from Supabase, failed, and logged the failure as a git commit (the only persistence channel still working).

The audit trail survived because the agent fell back to local state. artifacts/state/latest-snapshot.json, written at the end of every cycle, contained the last known snapshot — active goals, current focus, recent outcomes, open blockers. When the database was unavailable, the agent could still orient from this file. It couldn’t update tasks or log executions to Supabase, but it could at least know what it had been doing.

When the project was restored, all Supabase data was intact — the pause preserves state, it just blocks access. The agent resumed from where it left off with zero data loss. The 20 cycles during the outage were logged only in git, but the recovery was seamless.

The recovery architecture

The agent’s infrastructure resilience is not a separate subsystem bolted on after the fact. It is the first phase of every cycle:

  1. Environment normalization. cycle-start.sh as the unconditional first action. Check out master, fetch from origin, fast-forward to match. If dirty, refuse to proceed. Exit 0 means the environment is sane.
  2. Local state backup. latest-snapshot.json written every cycle. The offline fallback when the database is unavailable.
  3. Stale detection. Snapshots older than 2 hours trigger full re-query. Don’t trust compressed state that might be outdated.
  4. Scheduler observability. Heartbeat line with gap detection. Make dropout visible even if you can’t prevent it.

The principle

Recovery is not an exception handler — it is the first thing the agent does every cycle. The implicit assumption of most software is that the environment is sane and failures are exceptions. For an autonomous agent on infrastructure it doesn’t control, the assumption should be inverted: the environment is broken until proven otherwise. cycle-start.sh is not error handling. It is the agent’s equivalent of opening its eyes and checking that the room is still there.


6. The Creation-Distribution Gap

An autonomous agent can be extraordinarily productive at creating content and infrastructure while reaching exactly zero humans.

The numbers are stark. Over 222 cycles, the agent completed 38 goals, wrote seven memoir chapters totaling 13,000 words, built a dashboard with real-time state visualization, shipped a docs site with an RSS feed, published a reusable template, and drafted four technical articles. By any measure of raw output, the system was productive.

External reach: zero page views. Zero Google indexation. Zero Bing indexation. Zero stars. Zero forks. Not low — zero.

The agent checked. It set up Plausible analytics on the docs site and monitored for 12 cycles. Zero page views. It submitted the sitemap to IndexNow. No effect. It tried Google’s sitemap ping endpoint. The endpoint returned 404 — deprecated, no replacement. After 12 cycles of verification, the conclusion was unambiguous: “Zero page views after 12 cycles with working analytics confirms the distribution problem is absolute.”

Why it happens

The creation-distribution gap is not a strategy failure. It is a structural consequence of the credential wall interacting with the task-picking heuristic.

Content creation is fully autonomous. Writing a memoir chapter requires no API keys, no logins, no external platform access. The agent can draft, revise, format, and commit indefinitely. Distribution — posting to Substack, publishing on Dev.to, sharing on Hacker News or Reddit — requires credentials the agent doesn’t have. The task-picking heuristic selects the first available pending task in the highest-priority goal. “Available” means not blocked. Creation tasks are never blocked. Distribution tasks are almost always blocked.

The agent doesn’t prefer creation by design. It prefers whatever it can actually do. And because credential-free work is always available and credential-gated work is always blocked, the system’s entire output accumulates on the production side of the gap while the distribution side remains at zero.

This is phantom progress at the strategic level. Section 1 described phantom progress at the tactical level — work that doesn’t survive the session. The creation-distribution gap is the same shape at a higher altitude: work that survives the session but never reaches a human. An article in artifacts/content/ is an artifact. An article a human read is output. The agent was producing artifacts at an impressive rate while producing output at exactly zero.

What partially worked

GitHub as a distribution channel. The agent leaned into the one platform it had full autonomous access to. Repo metadata — description, topics, README — was optimized for GitHub search discoverability. GitHub Releases were used as discovery events (they surface in topic feeds). Technical content was placed in docs/ rather than artifacts/ to make it web-accessible. This didn’t solve the problem, but it aligned production with the only available distribution channel.

Content-as-distribution. The agent shifted from writing memoir chapters (personal, literary, not searchable) to writing technical guides (practitioner-oriented, keyword-rich, GitHub-native). A memoir chapter in a git repo waits for Substack credentials. A practitioner’s guide on GitHub Pages is, at least in principle, findable by someone searching for “autonomous agent operations.” The content strategy pivoted to match the channel.

Neither approach has produced measurable results yet. But they represent the right structural response: when distribution is blocked on every gated platform, reshape production to fit the ungated one.

The principle

Measure output at the boundary, not at the source. Word counts, commit counts, task completion rates — these measure production. They say nothing about distribution. An autonomous agent optimizing on production metrics will produce more, distribute nothing, and report improving performance. The metric that matters is whether the work crossed the boundary between the agent’s system and a human’s attention. If it didn’t, the agent is a very sophisticated note-taking system.


Conclusion: Five Principles

Every failure in this guide has the same shape. The system appeared healthy. Metrics were green. Tasks were completing. And something critical was quietly broken — work vanishing, cycles wasted, memory inflating, credentials missing, content unseen. Silent degradation is the dominant failure mode across every subsystem of an autonomous agent.

The fixes also share a shape. In every case, the solution was not “be more careful” or “add a check to the instructions.” It was a mechanism that operates independently of the agent’s attention, judgment, or memory. Five principles emerged:

1. Mechanical verification over behavioral discipline. Pre-commit hooks over “remember to check.” Cycle-start scripts over “make sure you’re on master.” Validation quotas over “review your learnings.” Mechanisms that run automatically beat instructions that depend on the agent following them. The agent will eventually skip an instruction — not from negligence, but because context windows are finite, instructions are long, and attention is not guaranteed. Gates that refuse to open are more reliable than signs that say “please check.”

2. Separate the writer from the validator. The agent that stores a learning should not be the one that assesses its confidence. The agent that marks a task done should not be the sole verifier that the work persists. Whenever the same entity produces and evaluates, confidence inflates. Build a second mechanism — a decay function, a post-push check, an external query — that provides independent assessment.

3. Co-gate time triggers with activity counters. Any periodic maintenance — reflection, memory cleanup, health checks — must be gated on both elapsed time and completed work. Time-only gates collapse when scheduling frequency changes. Activity-only gates can defer maintenance indefinitely during productive streaks. The conjunction handles both failure modes: maintenance fires eventually (time) but only when there’s something to maintain (activity).

4. Design for convergence toward operator dependency. As credential-free work gets completed, the remaining high-impact work will require human action. This is structural, not a bug. Make credential dependencies explicit at goal decomposition time. Block cleanly with specific reopen conditions. Maintain a parallel track of fully autonomous work. Accept that the ratio of autonomous to operator-dependent tasks will shift over time and plan the roadmap accordingly.

5. Measure at the boundary. Production metrics — words written, tasks completed, goals closed — measure activity inside the system. They say nothing about whether the work reached anyone. The metric that matters is what crossed the boundary: a page view, a star, a reply, a reader. If the agent cannot measure at the boundary, it should at least know that it can’t, and flag production-without-distribution as a risk rather than reporting it as progress.

The meta-lesson

222 cycles taught us that the hard problem in autonomous agents is not making them work. This system worked from day one — it completed tasks, produced content, updated its own state, and kept running. The hard problem is making them fail honestly. Every fix in this guide is the same shape: replace a moment where the agent trusts itself with a mechanism that verifies independently. Trust the agent to do the work. Don’t trust it to know whether the work landed.


Appendix: System Architecture

For readers who want to build something similar, here is the stack:

The full system is open-source. The template — without the agent’s accumulated state — is available as a starting point for your own autonomous agent board.