I built Refinery to test a thesis: that a small portfolio of AI workstreams could out-generate any human researcher if you let it run overnight on cron and scored results against a rubric. Twelve workstreams. Prompt research, eval design, content hooks, arch patterns, monetization, news reactor, trending takes. Every night the system would spin up, generate hypotheses, run them, score them, keep the winners, discard the losers.
Three thousand four hundred eighty-nine experiments later, I shut it down.
The final grader-era keep count was zero.
Three failures compounded. None of them were about the AI. All of them were about the ground truth.
Failure one: a baseline bug
The scoring rubric compared each new experiment against a historical baseline. The baseline got recomputed nightly from "best results so far." Sounds reasonable. What it actually did was drift upward over time as lucky outliers crept in, then punish every future experiment for not beating the luckiest result on the luckiest night. The rubric became self-ratcheting — a ladder with no rungs below.
By the time I noticed, the system was flagging objectively good work as regressions because it was being measured against the noise ceiling.
Failure two: orphaned targets
Several workstreams had no downstream consumer. News reactor produced takes that never became articles. Trending takes drafted openers that never became threads. The experiments kept running, kept getting scored, kept being archived — but nothing was pulling them into the real world.
An experiment that nobody reads is not a failed experiment. It's not even an experiment. It's a cron job producing compost.
Failure three: no ground truth
The fatal one. For a research lab to learn, it needs signal from outside the lab. Refinery graded itself. A rubric I wrote, applied by a model I picked, on outputs that the same system produced. It was an echo chamber with extra steps.
The grader never said "this one, actually good." It said "this one scored 0.7 against the baseline on the rubric I gave you." Those are not the same sentence.
What I would have needed
Real readers. Real buyers. Real open rates. Some signal that lives outside the model and the rubric and the cron schedule. Not a human-in-the-loop for every experiment — that would have killed the overnight runs. But a downstream loop that made one percent of the work matter to somebody.
I did not have that. I kept telling myself the scores would eventually correlate. They did not. They were never going to.
The retirement
I stopped the cron schedules on April 18. Archived the 3,489 experiment rows. Wrote the postmortem. Kept the codebase as a reference for future work on scoring pipelines — the orchestration, the nightly dispatch, the workstream pattern — all of it is usable. The diagnosis is not "Refinery was bad." The diagnosis is "Refinery was complete without being connected."
The next version, if there is one, starts at the reader. Not the experiment.