Dagain - Knot0

The Forgetting Problem

AI coding agents can write entire features. They can debug gnarly race conditions. They can refactor legacy code that nobody wants to touch.

But give them a task that takes more than twenty minutes, and they fall apart.

Not because they're stupid. Because they forget. Every turn of conversation dilutes the original goal. By turn fifty, the agent is confidently solving the wrong problem. By turn one hundred, you're babysitting.

This isn't a model limitation. It's an architecture problem. We've been running agents in chat threads when we should be running them on work graphs.

Dagain (DAG + "again") is our answer: a directed acyclic graph that knows how to retry.

Work as a DAG

Dagain models work as a directed acyclic graph. Each node is a discrete unit:

plan → implement → implement → verify → integrate
            ↘ implement ↗

Nodes have three properties:

Inputs — the context this node needs
Outputs — what it produces
Dependencies — what must finish first

The graph is the source of truth. Not the conversation. Not the agent's memory. The graph.

TASKGRAPH DEMO

AGENT LOG

Click "Run Demo" to start...

Planning

Executing

Failed

Re-planning

Done

Cancelled

Fresh Context, Every Time

When Dagain executes a node, it builds a packet: a small file containing only what that node needs to run. The goal. The relevant prior decisions. The specific artifacts it depends on.

Then it launches a fresh process with that packet.

This is the opposite of "keep cramming context into the prompt." Each node starts clean, with surgical context. No accumulated confusion. No goal drift.

The agent can still access shared state—Dagain exposes database pointers via environment variables ($DAGAIN_DB, $DAGAIN_NODE_ID). But it pulls what it needs. It doesn't carry everything forever.

Agents Propose, Supervisors Apply

Here's where it gets interesting.

Runners don't modify the graph directly. They can't mark themselves done. They can't spawn new tasks. They can only propose changes by returning structured results:

{
  "status": "success",
  "next": {
    "setStatus": [{"id": "task-001", "status": "done"}],
    "addNodes": [{"id": "task-002", "title": "Add tests"}]
  }
}

The supervisor—a separate process—parses these proposals and applies them to the graph. This separation is load-bearing:

The graph is always consistent
Every state change is logged
A misbehaving agent can't corrupt the work
Recovery is trivial: just replay from the last good state

SQLite is the Memory

All state lives in .dagain/state.sqlite:

Table	Purpose
`nodes`	The work graph
`deps`	Node dependencies
`kv_latest`	Current artifacts and context
`kv_history`	Full history of all artifacts
`mailbox`	Control messages (pause, resume, cancel)

This means:

Crash? Restart and continue from where you stopped.
Want to inspect state? sqlite3 .dagain/state.sqlite "SELECT * FROM nodes"
Need to debug a failure? The full history is right there.

No Redis. No external services. One file, version-controllable, inspectable, portable.

Failure is a First-Class Citizen

Real work fails. Dagain expects this—it's in the name.

Each node has a retry policy. When retries are exhausted, Dagain doesn't just give up—it escalates. The failure bubbles up to the nearest planning node, which triggers a replan. If that fails, it escalates further. All the way to the root if necessary.

When human judgment is required, nodes enter a checkpoint state. Execution pauses, you make the call, and the decision is recorded permanently. No re-asking. No context loss. The graph remembers.

Runner Pools: Graceful Degradation

Not all tasks need the same firepower. A simple rename? Codex handles it. A gnarly refactor with subtle type constraints? Maybe Claude should take over.

Dagain supports runner pools—configure a role with multiple runners, and Dagain promotes through them on failure:

{
  "roles": {
    "executor": ["codexMedium", "codex", "claude"]
  },
  "supervisor": {
    "runnerPool": {
      "mode": "promotion",
      "promoteOn": ["timeout", "missing_result", "spawn_error"],
      "promoteAfterAttempts": 2
    }
  }
}

The logic is simple:

Start with the first runner (cheapest, fastest)
If it times out or crashes, promote immediately to the next
If it returns a valid result but the task still fails, retry twice before promoting
Once you hit the last runner, stick with it

This gives you cost efficiency by default, with automatic escalation when things get hard. Most tasks complete on the cheap runner. The expensive one only spins up when needed.

Parallelism That Doesn't Break Things

Dagain supports concurrent execution with --workers N. Independent nodes run in parallel. Dependencies are respected automatically.

For code changes, parallelism usually means merge hell. Dagain solves this with git worktrees: each worker operates in an isolated directory. Merges happen one at a time, with automatic conflict detection. You get speed without chaos.

You're Always in Control

Dagain includes a chat interface connected to the same database:

npx dagain chat

From there you can:

See what's running, what's blocked, what failed
Pause execution mid-flight
Inject decisions or override plans
Query the full history

The graph runs autonomously. But you're never locked out.

Try It

npx dagain init --goal "Add authentication to the API"
npx dagain run --live

# In another terminal
npx dagain status
npx dagain chat

Why This Exists

The bottleneck on AI coding isn't intelligence. It's architecture.

Long-running agents need durable state. Complex tasks need decomposition. Failures need structured recovery. Humans need visibility and control.

Dagain is the missing layer: a work graph that persists, a packet model that keeps context fresh, and an execution model that treats failure as normal. The agent does the work. The graph keeps it honest.

Open source: github.com/knot0-com/dagain