Why do long-running agents need checkpoints?
An agent that runs for minutes or hours — working through a large task, many tool calls, or multiple steps — will eventually be interrupted. A process crashes, an API times out, a deployment restarts the service, or the task pauses to wait for a human. Without checkpoints, any interruption means starting over, throwing away completed work and repeating every expensive model call up to that point.
Checkpointing makes the agent's progress durable. At safe points, it persists its state — the steps completed, intermediate results, and where it is in the plan — to a store outside the process. When it resumes, it reloads that state and continues from the last checkpoint rather than the beginning. For long or costly tasks, this is the difference between a system that's robust and one that's prohibitively expensive to run reliably.
What state needs to be saved?
A useful checkpoint captures enough to reconstruct the agent's situation, typically:
- Progress — which steps or sub-tasks are done, and what remains in the plan.
- Intermediate results — outputs already produced, so they aren't recomputed.
- Working state — the relevant context or pointers to it (often files on disk, per the file system as context pattern).
- Position in control flow — where the orchestrator was, so it knows which step to run next.
The aim is for resume to be indistinguishable from never having stopped.
Where checkpointing enables human-in-the-loop
Checkpointing isn't only for failures — it's what makes deliberate pauses practical. When an agent needs human approval mid-task, it checkpoints, waits, and resumes when the human responds, without holding a process open the whole time. This same durability lets agents run as asynchronous, long-lived jobs rather than single blocking calls — the foundation for the kind of long-running orchestration Prodinit builds into agentic systems.