I've recently been building workflows using [Claude Routines](https://claude.ai/code/routines), which let you launch Claude background sessions with a CRON job or webhook.

When Routines were released, subscribers could schedule 15 Claude sessions via CRON (up from 3). I could move from a "ci helper" to a complete workflow, which allowed me to do much more personal coding:

```mermaid
xychart-beta
    title "Commits in 8 Personal Projects, Oct 2025 – Apr 2026"
    x-axis ["Oct 13", "Oct 20", "Oct 27", "Nov 3", "Nov 10", "Nov 17", "Nov 24", "Dec 1", "Dec 8", "Dec 15", "Dec 22", "Dec 29", "Jan 5", "Jan 12", "Jan 19", "Jan 26", "Feb 2", "Feb 9", "Feb 16", "Feb 23", "Mar 2", "Mar 9", "Mar 16", "Mar 23", "Mar 30", "Apr 6", "Apr 13"]
    y-axis "Commits" 0 --> 70
    line [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 9, 4, 8, 5, 0, 17, 27, 44, 11, 18, 65, 16, 45, 56, 67, 60]
```

The workflow is entirely composed of stateless jobs (five Claude Routines, one serverless container) that treat GitHub as a [deque](https://en.wikipedia.org/wiki/Double-ended_queue) of issues and PRs. The cost is entirely flat for Claude subscribers. The Routines are scoped to certain repos, and I filter to only my comments, issues, and PRs to avoid adversarial input.

The separation of concerns in jobs makes each portion easy to scale quickly (example: to open more PRs for human review, schedule more PR agents). Cost/token usage can be similarly managed by model selection and altering the frequency for each job.

Due to jobs' statelessness, each portion of the workflow is self-healing in the case of outages for Anthropic, GitHub, or Scaleway. Details of the five-agent workflow below; here's the [repo for the deterministic job](https://github.com/dominicburkart/PRodder) and the gists for all of the Routine prompts: [issue fifo](https://gist.github.com/DominicBurkart/6bd9a9a665e583bc94abdabae89f9b04), [issue lifo](https://gist.github.com/DominicBurkart/4bf30c9611b701c5eb083c88563fdf21), [pr fifo](https://gist.github.com/DominicBurkart/92218c722a3b936b45d204220e4c8392), [pr lifo](https://gist.github.com/DominicBurkart/d5e66363d485a09513f69bcfe75a39f6), [arch + quality](https://gist.github.com/DominicBurkart/e364d530f893c985f59c12d8dee8fa09).

# Architecture

```mermaid
flowchart LR
    Start(("start")) --> WriteIssues[/"human with research agents writes issues"/]
    WriteIssues --> IssueDeque[("GitHub Issues")]
    IssueDeque -- FIFO --> ImplLatest[["latest N issue owner
    cron"]]
    IssueDeque -- LIFO --> ImplOldest[["oldest N issue owner
    cron"]]
    ImplLatest --> DraftPRs[("GitHub PRs
    draft:true")]
    ImplOldest --> DraftPRs
    ArchQuality[["arch + quality agents
    cron"]] -. agent-sourced PRs .-> DraftPRs
    Dependabot[["dependabot etc.
    cron"]] -. deterministic source .-> ReadyPRs[("GitHub PRs
    draft:false")]
    DraftPRs -- FIFO --> CIAgentLatest[["latest N PR owner
    cron"]]
    DraftPRs -- LIFO --> CIAgentOldest[["oldest N PR owner
    cron"]]
    CIAgentLatest --> MergeableJudge{"Judge: Work Complete & Mergeable?"}
    CIAgentOldest --> MergeableJudge
    MergeableJudge -- no --> DraftPRs
    MergeableJudge -- yes --> PromoteDraft["promote to draft:false"]
    MergeableJudge -- misaligned
    issue still open, has link to failed attempt --> PRClosed(("PR closed"))
    PromoteDraft --> ReadyPRs
    ReadyPRs --> HumanReview[/"human review"/] & RebaseOutOfDate
    RebaseOutOfDate -. base merged .-> DemoteDraft
    DemoteDraft -. demoted .-> DraftPRs
    HumanReview --> Merged(("PR merged")) & PRClosed
    HumanReview -- demoted with comments --> DraftPRs

    subgraph guard["deterministic job"]
            DemoteDraft["demote to draft:true
            if unmergeable"]
            RebaseOutOfDate["update PRs out-of-date
            with base branch"]
    end

    style WriteIssues fill:#7B5BA5,color:#fdfbf7,stroke:#5A4080
    style HumanReview fill:#7B5BA5,color:#fdfbf7,stroke:#5A4080
    style ImplLatest fill:#CC6600,color:#fdfbf7,stroke:#8F4700
    style ImplOldest fill:#CC6600,color:#fdfbf7,stroke:#8F4700
    style ArchQuality fill:#CC6600,color:#fdfbf7,stroke:#8F4700
    style Dependabot fill:#2D6DB1,color:#fdfbf7,stroke:#1E4A7A
    style CIAgentLatest fill:#CC6600,color:#fdfbf7,stroke:#8F4700
    style CIAgentOldest fill:#CC6600,color:#fdfbf7,stroke:#8F4700
    style DemoteDraft fill:#2D6DB1,color:#fdfbf7,stroke:#1E4A7A
    style RebaseOutOfDate fill:#2D6DB1,color:#fdfbf7,stroke:#1E4A7A
    style IssueDeque fill:#f5f0e8,color:#2d2d2d,stroke:#0D7B6B
    style DraftPRs fill:#f5f0e8,color:#2d2d2d,stroke:#0D7B6B
    style ReadyPRs fill:#f5f0e8,color:#2d2d2d,stroke:#0D7B6B
    style MergeableJudge fill:#ffffff,color:#1a1a2e,stroke:#1a1a2e
    style Merged fill:#ffffff,color:#1a1a2e,stroke:#1a1a2e
    style PRClosed fill:#ffffff,color:#1a1a2e,stroke:#1a1a2e
    style PromoteDraft fill:#ffffff,color:#1a1a2e,stroke:#1a1a2e
    style Start fill:#ffffff,color:#1a1a2e,stroke:#1a1a2e
```

Claude.ai/code supports orchestration. Each individual agent can access multiple PRs and fan out to many subagents (for the PR agent, as of writing: 15 PRs * (1 review subagent + 1 CI iteration subagent + 1 comment-addressing subagent) = 45 total subagents).

<figure style="margin: 1.5em 0;">
  <img src="/assets/claude_routine_pr_agent_session.png" alt="A PR agent session in the claude.ai/code UI, showing a session with access to 8 repos, 45 subagents launched, and actions taken incrementally as subagents completed." style="width: 100%; display: block; margin-left: auto; margin-right: auto; border-radius: 8px; height: auto;" />
  <figcaption style="text-align: center; font-style: italic; margin-top: 0.5em;">A code agent spawning 45 subagents to manage 15 draft PRs</figcaption>
</figure>

[LIFO](https://en.wikipedia.org/wiki/Stack_(abstract_data_type)) + [FIFO](https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)) is a common pattern, in [scheduling algorithms](https://en.wikipedia.org/wiki/Work_stealing) and distributed systems. The LIFO path reserves half of the agents for work on the oldest issues and PRs on the assumption that they are the most complex, and will require the most turns to complete. The FIFO component attempts to address the issues and PRs as quickly as possible, as some can be implemented and approved for human review in one or a few iterations, and system throughput should not be blocked by saturating agents with the most complex tasks.

For the implementation of most issues, human attention is only necessary to create the issue and to approve the completed PR. PR agents can also, based on negative reviews, elect to close PRs. Future issue agent iterations can identify failed prior attempts during the code research phase. Issue agents can't close issues: I want to determine what's worked on and what isn't.

# Scaling

Scaling a job-based system like this is easy: schedule more jobs that produce the thing that you want, probably PRs open for human review.

I was predominantly interested in orchestration approaches to context management. As of writing, individual sessions struggle with more than 50 subagents and subscribers can make up to 15 Routine calls per day, so after a job reaches a few dozen subagents, scaling horizontally by scheduling more jobs was the easiest approach.

Context window optimization is harder to do in background sessions. There's somewhat of an observability gap here with Claude sessions lacking specific obs configuration; having granular data on context utilization on the session would be useful.

# Models & Evaluation

I duplicated the PR agents, having half run with Opus and half with Sonnet, since some problems need more turns in CI while others need more attention and planning (4 total agents, 2 LIFO and 2 FIFO) and I wanted to collect longitudinal data on outcomes with both.

My small-N reviews indicate Sonnet 4.6 is better at PR iteration than Opus 4.7 (1 mil) both in terms of PR promotions and closing PRs, but worse in terms of premature promotion. It appears Opus promotes/demotes less as it is slower and brittler, and so often fails to complete.

I only looked at a few dozen sessions though, so that finding is qualitative. I haven't collected the dataset to evaluate issue agents or interaction effects between issue and PR agents, and there are a lot of externalities. There are constant changes to the repo's context for example, by me and the arch subagents, on top of changes to the claude.ai/code agents, harnesses, etc. (there was nearly a week where GitHub failed to retain authentication for more than 2 hours, for example).

Given Mythos and Sonnet 4.7 are both arriving imminently, I think the right move here is writing a script so I can re-run analysis with new models as they come out, or with some other cadence (quarterly?). That could also help me evaluate context changes in my repo, though only post-facto. For now, given the externalities, I don't think a formal evaluation system with direct comparisons would be productive.

# Cost

As noted, this workflow has a flat cost. Routines optimize usage based on Anthropic subscription's window-based metering. Any individual job's cadence can be adjusted, for example to conserve tokens for usage by foreground agents during certain periods.

Currently, Anthropic subscription-based usage is extremely inexpensive compared to API billing. Community estimates vary, but are consistently over 90% reduction; Anthropic does not communicate on it. Rightsizing the Anthropic subscription based on usage is easy, given that there are essentially two levers. Is my limiter something which reduces to "usage" as metered by Anthropic? How does performance change when I use a different model?

Though I would like more granular control of model selection while orchestrating fanouts, I see that as more of an industry problem than a background agent problem. Coding agents need access to their own coding agents so they can defer well-defined, specialized work to cheaper, specialized models. From a resource optimization perspective, it seems likely that an ecosystem of many specialized models for specific stacks/use-cases will complement the smaller number of models trained by the provider running the background agent.

# Alternatives & Limitations

I used a deque approach for processing issues and PRs because it's easy and it works, but we can just as easily pull from the middle ([GitHub stores issues and PRs in MySQL](https://github.blog/engineering/infrastructure/mysql-high-availability-at-github/)). It's not difficult to prioritize work differently, for example via:
- human attention: a Routine could be dedicated to shadowing the PRs I am working on with foreground agents during times I am inactive (during work, night).
- smallest PRs: use the deterministic Routine to find the 15 PRs with the smallest diffs, then launch a PR agent for them via a webhook.
- structural blockers: ask a derivative of the PR agent to identify PRs that block resolution on the most number of issues/other PRs, then iterate on them.
- random chance: Routines can select open issues/PRs entirely at random, so long as two Routines do not run at the same time. Given that they time out, this can be scheduled, but is brittle (Anthropic can change timeout behavior).

Since the current deque architecture is working well for me, I would prioritize an eval suite set up to be able to track changes in my workflow or in the repo's context as discussed under Models & Evaluation, even if it's correlational/post-facto.

With this approach, my foreground agent work is dedicated to large, central projects (like building an eval system for a coding harness) and changes which require CI write access (I don't let Claude.ai/code edit CI, which has been a useful guardrail).

# Conclusion

Background agents are my preferred UX for interacting with coding agents. I like the interfaces provided by Claude Code and have found them useful enough to build a workflow around them. GitHub can be configured with some practical guardrails that facilitate agentic development (example: background agents can't modify CI configurations), and you can get surprisingly far in terms of orchestration design without writing any code. Some code can provide inexpensive guardrails and further conserve human attention. Having everything run as stateless jobs makes it easy to maintain and scale.

Since I have something that works, the next step is optimization. I noted different strategies than my LIFO/FIFO approach which I can evaluate based on historically available GitHub API metrics: rate of lines of code, number of issues resolved, number of dedicated subagents per PR or line of code. I noted that delegation interfaces for specialized models would likely be key for cost reduction. I'm not in a rush though, as the workflow has already improved my productivity, and the GitHub data needed for evaluation isn't going away.