533 PRs, $379, and one rap battle: a month of autonomous agents on an abandoned side project
Seven weeks, 533 PRs, ~$379 out of pocket, 3B tokens, one rap battle. Here’s what a month of autonomous agents on an abandoned side project actually looked like.
YOLO night
On March 6th I left Claude Code running overnight on a side project I hadn’t touched in a year and a half.
The prompt was one sentence: “you have full freedom, add most
creative features you can come up with”. The setup was one Docker
container with --dangerously-skip-permissions, one branch
called clowd, and one supervisor: me, checking Vercel
deploys from bed. The project was Townleap, a website for people
considering permanent relocation that I’d been building on weekends
until my day job ate the weekends.
I woke up to 109 commits (and an alert from Vercel telling me I reached 100 deployments/day limit).
The first hour was slop. At 22:14 it shipped a Russian City Tinder. At 22:17, This or That (kind of Facesmash but for cities). At 22:19, City Roast. At 22:23, a “Surprise me” button. At 22:26, a satirical news generator (why do LLM love fake newspapers so much?). At 22:54, City Rap Battle (wtf?).
After 23:00 it stopped messing around. Score Simulator with adjustable factor weights. Cost Breakdown page. Timezone Overlap tool for remote workers. Open Graph tags. A sitemap. The “/” keyboard shortcut to focus the search bar. A Heat Map color-coded by region. Twenty-something things, most of them on a list I’d been thinking to add at some point.
The bill was $90: $20 for the Claude sub I was already paying, $70 in overage from one night.
I kept the good half, reverted the rap battle, and spent the next seven weeks figuring out how to do that on purpose.
Half slop, half roadmap
Half the time the agent (Opus 4.6 at the time) built something I wanted, or at least along the lines of it; the other half it produced slop — funny or eerie. Per-commit deployments gave me a window into the process, but genuinely useful features buried in a pile of slop inside one giant PR wasn’t workable. Meanwhile extra usage was eating too much, too quickly. It needed love, and order — mostly structure and order. Then a sweaty sprint hit at my day job, and bye my dear friend: I didn’t touch the site for weeks.
Plumbing, then synthetic users
Three weeks later, one Friday night, I returned to the experiment.
First thing I added: a Telegram bot for steering the agent when it was
spiraling. Two modes — /status to see what it was working
on, and plain messages landing in INBOX.md, which the agent
read between rounds. It worked, but wasn’t perfect: unformatted logs
broke jq parsing, containers crashed, and I had to get up
to the laptop to restart things. So I added /bug to log
feedback straight to a GitHub issue. Now we had a queue. It’s always a
queue. Better.
To stop walking to the laptop, I added a Claude Code session on the host machine with remote control. Clunky — sessions stalled, permission prompts didn’t always come through, and I’d be staring at “Whackamolling” like a fool for ten minutes — but it gave me room. With the plumbing done, I became the bottleneck. Someone had to review all these half-slop half-good features. I needed QA. I needed users. I was too embarrassed to invite friends — I’d told a few about the YOLO night and shown them the rap battle, and I saw the look they gave me. Besides, not everyone is vibeslopping at 4 am.
So I went with the most logical move a person in my state could make — synthetic users. The idea: spawn an LLM as a fake user, give it a persona and a goal, let it browse the site for ten minutes, and have it write up what was confusing or broken. Sure, LLMs are just predicting the next token, remember? Dress one as a distinct character and turn it loose. I wrote five: young post-2022 emigrants looking for an optimal city to settle, each drafting a shortlist. A few hours scaffolding the agent-builder CLI and the skill description — and voilà — the first report came in. Reading… hm, makes sense:
Anya, on Phuket: “Nine percent sunny in summer? In PHUKET? That’s got to be a data bug.”
Dmitri, on the Safety First page: “Hungary? The same country the blockquote implies they’d exclude? Orbán’s Hungary ranks #7 for ‘safety’ while the democracy index calls it a hybrid regime. This contradicts the promise at the top of the page.”
Max, on Gothenburg’s school data: “One international school. My older kid is 9 and speaks Russian and English. Swedish immersion at 9? That’s a big ask.”
Lena, on Chiang Mai: “I want to know if there’s actually trails nearby. Map? Elevation? Not obvious.”
Nothing earth-shattering — but much meatier than another gstack design review (a CLI toolkit of QA, design and review skills I lean on heavily — more on it below). I asked the agent to turn the report into tickets for the yolo-agent. A few hours later, second persona run: 2 of 5 finished a shortlist. One hyper-pedantic persona went sideways into a dark corner of the site with no links and never returned. Ten more issues, run the cycle again. It was taking shape. Glance at the clock. 5 am. Oops. One more persona test, ten more issues to keep the agent busy while I got my five hours.
Next day I finally got fed up with the one-branch setup and extended it into a pool: a dispatcher sorting issues by priority, spawning Docker workers, each claiming an issue in its own container and branch. I told it to screenshot and record before/after for each feature and upload artifacts to my bucket, so I could sanity-check and merge smaller PRs at first glance.
The pool, wired up
Simple dispatcher, no LLM, a cron + bash script: get the sorted list of issues, for each spawn a container with Claude Code, cap at 5 workers (I was still under the delusion I was counting money here). Claude itself figured out how to extract the OAuth token from Keychain on container start. A launchd process refreshed the token every 8 hours.
On start, a worker checked out master, ran pnpm install,
read the issue and started working. A skill told it to self-review with
agent-browser, QA and screenshot before opening a PR. My comments on PRs
triggered a follow-up worker; the dispatcher marked PRs with
wip / ready labels. Merge conflicts also
triggered auto-resolving.
I’d worried that scoping each worker to one issue would starve it of project context. The opposite happened: the context window got freer — each worker only loaded what its issue touched, and a denser CLAUDE.md with links to targeted docs covered the rest.
I’d had enough of the extra usage and subscribed to the $200 Max plan. (A year ago I’d thought “what idiots would pay that much for an LLM?”.) I ran persona tests in the morning, checked fixes at lunch, evolved the Docker setup in the evenings into the nights. Anthropic fixed remote control and shipped auto-mode just in time: now I could have a “Pool watcher” session on the laptop without babysitting permissions for the crazy bash commands Claude wrote. Ten more steady days, 5–7 features per day.
What I gained, what I lost
I got agentic-pilled. I was a Cursor champion for over a year, tried it way back in 2023, and made fun of everyone who declared AGI and stopped looking at the code. Suddenly the IDE-augmentation level felt one abstraction too low — especially on a side project where I don’t really care about the code.
Essentially, I became a manager of an agent pool. What I lost: the pixel-perfecting and design tweaking I actually enjoyed. Instead of a few delightful hours iterating on fonts, buttons and layouts in code, I’d find myself writing a third prompt in a row trying to convey my taste. Irritated. Annoyed. Argh. The pet project — where I was supposed to play with fonts, buttons and layouts that I never get to touch at my day job — turned into a FIFA-manager-like game. Designer almost gone, product manager prevailed.
What I gained: those polishing sessions had never left room to actually ship the things sitting in the roadmap in my ADHD mind. Now I was deploying so much more. A few weeks in, I accepted it: the tradeoff is worth it.
A little nuance: it didn’t detach me from the laptop or the phone. 5 am bedtimes for weeks. Two jobs — Elastic by day, pool by evening into morning. My girlfriend was not pleased.
Eventually I shared, and friends noticed: it doesn’t feel like AI slop. Why: 1.5 years of hand-curated scraper foundation. A brief sidestory: I started Townleap in 2024, around when I discovered one of the first “agentic” coding tools — some solo-dev Cursor-like extension I trusted, for the first time, with a 5-minute round-trip scaffolding scrapers for my dream city-comparison spreadsheet. Long story short: I switched back to writing by hand after a few days, and for the next months I added GDP, taxes, homicide rates, expenses and so on, until I lost motivation and free time grew scarce. Now the foundation paid off — with a pool of workers producing slop at times, I could keep it in line, periodically steering toward reproducible facts, forcing it to reintegrate ad-hoc data into the proper dataset, extending the existing architecture instead of band-aiding gaps.
The other thing keeping the slop in check: gstack. The design-review skill helped me converge on a coherent visual style instead of the never-ending flow of generic gradients, and the audit pass surfaced real Web Vitals regressions I’d have missed. Caveats: it’s a token glutton — easily my single biggest line item per run — and the recommendations occasionally drift into generic “increase contrast, add whitespace” territory you have to push back on. Still net positive, with a finger on the budget.
The numbers
What it cost me out of pocket — seven weeks, March 2 to April 26:
| Line item | Charge |
|---|---|
| Claude Pro sub (Mar 2, pre-upgrade) | $20.00 |
| YOLO night overage (Mar 6) | $50.00 + $10.00 |
| Pre-Max overage (Mar 7) | $50.00 |
| Claude Max 20x upgrade (Mar 27) | $195.80 |
| Extra usage after Max ran out (Apr 21) | $53.10 |
| Total Anthropic, seven weeks | ~$379 |
| Vercel Pro (incl. YOLO-night overage) | $20/mo |
| Cloudflare R2 + Worker (artifact bucket) | $0 |
| SEO SaaS | $29/mo |
What it would have cost me without the subscription — Claude Code
logs each session’s total_cost_usd as if I were paying API
rates:
| Window | Tokens | API-equivalent |
|---|---|---|
| Seven weeks (Mar 2 – Apr 26) | 3B (~98% cache) | ~$4,000 |
So: ~$379 out of pocket against ~$4,000 of metered tokens. ~10× subsidy. Most of that is cache reads, which the API meter still counts — your mileage will vary.
Detour: a week with Codex
Between feedback threads and broadening scope, I burned through even the $200 Max weekly limits. I had a few quiet days. Touching grass, feeling spring in the air, playing padel, going to sleep before 2 am.
Two days in, the urge came back. Still almost a week until Claude reset. I decided to try Codex: asked it to familiarize itself with the Docker pool setup, transfer the persona-test skill, swap the CLI in the image — all on a parallel branch, no time for a proper refactor. On the $20 ChatGPT sub, btw. So I let it run.
At first it was bliss: gpt-5.4 was faster, felt more thorough on SEO, correctness issues and data gaps. Soon I noticed PR descriptions becoming slop-walls. Who cares, I had screenshots for a reason.
Around then I’d been polishing a new game mechanic — Tinder, but for preferences: swipe yes/no on 15 lifestyle questions, get a ranked city shortlist at the end.
I shared it with a few groups of friends; they shared it further. Friends — mostly — playing with something I’d built end to end as the sole product owner. A small thrill. A handful of them then deep-dived into the comparison page beyond the quiz. But every time someone passed it on, they passed it on as screenshots — not the unique link with the carefully crafted OG card.
Worse: they were stitching together 2–3 captures, because the result list didn’t fit on one screen.
So I filed an issue for codex: make the results page tighter and screenshot-worthy. The agent followed the instructions: reduced paddings, margins. Not ideal, but works. Then I looked carefully at the copy… it said “here’re your results, screenshot-worthy”. I was taken aback. “Are you for real? Do not leak instructions to the end user.” “Sorry, my bad, here is the corrected version:” — and the same slop again. In the end the codex-powered pool finished 5–6 issues per 5h quota on a cheap plan — cool, but on design it was annoyingly horrible.
Anthropic kept shipping the thing I needed
Finally, Friday night. But still 24 hours until limit reset. I gave up and enabled extra usage. Luckily mister Dario decided to send $200 in credits in another round of spending spree. It burned through the $200 in extra usage in one evening. Ouch. It’s painful to spend, even if they’re free credits.
On the bright side: Boris and the team were shipping like crazy and outshipped me on many fronts:
- Remote control. I abandoned Telegram for the Claude iOS app: mid-April it reliably held multiple sessions, and I didn’t have to care about formatting logs or maintaining the chat UX.
- Auto mode. Unlocked the host-machine “Pool Watcher” session — classifier catches bad pushes, process kills and other weird stuff. Before that I had to balance between being glued to the permission dialog or yoloing it and hoping it wouldn’t go rogue and delete my photo library.
/loopand monitors. Replaced cron and my manual nudges. An autonomous session shepherded the pool in my absence.--dangerously-experimental-mcp-channels. Didn’t work for me. I removed it; the native app improved to an acceptable level.
3 billion tokens
At week 5 I set up DuckDB ingesting a pile of JSONL logs to get a rough observability layer. One of the first aggregations — total tokens. 3 billion tokens across seven weeks. Come again? Mostly cache. Still, it’s wild to type. Sure, I’d read the rants about AI labs running on negative margins and the alarmists crying bubble. Seeing it for myself hit different.
The other thing the logs confirmed: persona tests were the critical piece. Continuity of the reports made progress visible, and they surfaced gaps no unit test could.
Other tools I tried
Codex CLI. Interactively it felt very close, but for
the autonomous workflow I had — auto mode, /loop, monitors
— were all missing.
Gemini CLI. Tried it in the first two weeks. Thought I’d arm it as a third titan in the pool to avoid limits. The model was fast, but the terminal flickered like a disco ball at the time, and again — no loops, no monitors, and a myriad of permission dialogs. (Disco-ball clip ↗)
Opencode / Pi / etc. Haven’t tried — I wanted to focus on my website, not set up a harness for a whole weekend. Besides, once I saw how wild the gap is between subs and pay-as-you-go pricing, I committed to the sub.
What it can’t do
The obvious failures are easy to spot and easy to laugh at: a mobile-only bug “fixed” and self-verified with a desktop screenshot. Plausible-sounding copy that, on a second read, contains the literal string “VERDICT ABOVE THE FOLD”. The rap battle. You QA, you push back, you move on.
The harder ceiling is the one you don’t see in any single PR. Three things the pool just can’t do:
- Decide what to build. Someone has to file the issue. Synthetic users help, but they only surface gaps inside the product as it already is — they won’t tell you the product itself is the wrong shape.
- Hold a taste bar. “This copy is wrong” is a judgment call against a standard you carry in your head. You can’t transfer that through a skill description; the agent will earnestly produce six variants of the same generic voice.
- Keep design coherent across pages. A worker can polish one screen beautifully and then break the visual language two PRs later. Without a human in the loop the system drifts toward an averaged, slightly-off aesthetic.
That’s why I still review every PR. The pool compresses my time; it doesn’t replace my judgment — and the moment I pretend it does, the cracks show up in the product.
Where it’s going
No roadmap promises here, folks. The project ships again. The data layer keeps deepening. I sleep before 2 am most nights now.
If you’re picking your next city — take the quiz at townleap.com, see if it can shortlist you a place worth moving to. And if it does — please share the link, not a screenshot.
Postscript: two weeks later (May 4)
812 closed PRs total — 439 of them in the two weeks since I locked the draft. Second $200 cycle just rolled. A few things changed:
- A real queue. Issues live on a GitHub Project board now, p0–p3, status synced by launchd. The “no LLM, just a cron” dispatcher finally has a face I can stare at.
- Tier-gating. Risky issues get a spec-only pass first. Cheaper to throw away a wrong plan than a wrong PR.
- More synthetic users. Five Western personas joined the original five Russian-speaking ones. A 10-persona prod run on May 3 surfaced the next batch of issues. Reports are stable enough now that I trust the diff between runs.
- The product moved too. N-way city compare,
archetype landing pages, real screenshot OG cards, a canonical
/move-to/[country]/from/[passport]route. The pool isn’t only fixing things anymore.
Same caveats. The product still ships, the data layer keeps deepening, the second $200 still hurts a little.