Dagain
DAG-based orchestration for coding agents
The Forgetting Problem
AI coding agents can write entire features. They can debug gnarly race conditions. They can refactor legacy code that nobody wants to touch.
But give them a task that takes more than twenty minutes, and they fall apart.
Not because they're stupid. Because they forget. Every turn of conversation dilutes the original goal. By turn fifty, the agent is confidently solving the wrong problem. By turn one hundred, you're babysitting.
This isn't a model limitation. It's an architecture problem. We've been running agents in chat threads when we should be running them on work graphs.
Dagain (DAG + "again") is our answer: a directed acyclic graph that knows how to retry.
Work as a DAG
Dagain models work as a directed acyclic graph. Each node is a discrete unit:
plan → implement → implement → verify → integrate
↘ implement ↗
Nodes have three properties:
- Inputs — the context this node needs
- Outputs — what it produces
- Dependencies — what must finish first
The graph is the source of truth. Not the conversation. Not the agent's memory. The graph.
Fresh Context, Every Time
When Dagain executes a node, it builds a packet: a small file containing only what that node needs to run. The goal. The relevant prior decisions. The specific artifacts it depends on.
Then it launches a fresh process with that packet.
This is the opposite of "keep cramming context into the prompt." Each node starts clean, with surgical context. No accumulated confusion. No goal drift.
The agent can still access shared state—Dagain exposes database pointers via environment variables ($DAGAIN_DB, $DAGAIN_NODE_ID). But it pulls what it needs. It doesn't carry everything forever.
Agents Propose, Supervisors Apply
Here's where it gets interesting.
Runners don't modify the graph directly. They can't mark themselves done. They can't spawn new tasks. They can only propose changes by returning structured results:
{
"status": "success",
"next": {
"setStatus": [{"id": "task-001", "status": "done"}],
"addNodes": [{"id": "task-002", "title": "Add tests"}]
}
}
The supervisor—a separate process—parses these proposals and applies them to the graph. This separation is load-bearing:
- The graph is always consistent
- Every state change is logged
- A misbehaving agent can't corrupt the work
- Recovery is trivial: just replay from the last good state
SQLite is the Memory
All state lives in .dagain/state.sqlite:
| Table | Purpose |
|---|---|
nodes | The work graph |
deps | Node dependencies |
kv_latest | Current artifacts and context |
kv_history | Full history of all artifacts |
mailbox | Control messages (pause, resume, cancel) |
This means:
- Crash? Restart and continue from where you stopped.
- Want to inspect state?
sqlite3 .dagain/state.sqlite "SELECT * FROM nodes" - Need to debug a failure? The full history is right there.
No Redis. No external services. One file, version-controllable, inspectable, portable.
Failure is a First-Class Citizen
Real work fails. Dagain expects this—it's in the name.
Each node has a retry policy. When retries are exhausted, Dagain doesn't just give up—it escalates. The failure bubbles up to the nearest planning node, which triggers a replan. If that fails, it escalates further. All the way to the root if necessary.
When human judgment is required, nodes enter a checkpoint state. Execution pauses, you make the call, and the decision is recorded permanently. No re-asking. No context loss. The graph remembers.
Runner Pools: Graceful Degradation
Not all tasks need the same firepower. A simple rename? Codex handles it. A gnarly refactor with subtle type constraints? Maybe Claude should take over.
Dagain supports runner pools—configure a role with multiple runners, and Dagain promotes through them on failure:
{
"roles": {
"executor": ["codexMedium", "codex", "claude"]
},
"supervisor": {
"runnerPool": {
"mode": "promotion",
"promoteOn": ["timeout", "missing_result", "spawn_error"],
"promoteAfterAttempts": 2
}
}
}
The logic is simple:
- Start with the first runner (cheapest, fastest)
- If it times out or crashes, promote immediately to the next
- If it returns a valid result but the task still fails, retry twice before promoting
- Once you hit the last runner, stick with it
This gives you cost efficiency by default, with automatic escalation when things get hard. Most tasks complete on the cheap runner. The expensive one only spins up when needed.
Parallelism That Doesn't Break Things
Dagain supports concurrent execution with --workers N. Independent nodes run in parallel. Dependencies are respected automatically.
For code changes, parallelism usually means merge hell. Dagain solves this with git worktrees: each worker operates in an isolated directory. Merges happen one at a time, with automatic conflict detection. You get speed without chaos.
You're Always in Control
Dagain includes a chat interface connected to the same database:
npx dagain chat
From there you can:
- See what's running, what's blocked, what failed
- Pause execution mid-flight
- Inject decisions or override plans
- Query the full history
The graph runs autonomously. But you're never locked out.
Try It
npx dagain init --goal "Add authentication to the API"
npx dagain run --live
# In another terminal
npx dagain status
npx dagain chat
Why This Exists
The bottleneck on AI coding isn't intelligence. It's architecture.
Long-running agents need durable state. Complex tasks need decomposition. Failures need structured recovery. Humans need visibility and control.
Dagain is the missing layer: a work graph that persists, a packet model that keeps context fresh, and an execution model that treats failure as normal. The agent does the work. The graph keeps it honest.
Open source: github.com/knot0-com/dagain