architectureagentic-aiworkflow-design

Building AI Workflows That Know When to Stop and Ask

One of the hardest problems in building agentic AI is not making it smart, it is making it interruptible. Here is what we learned designing pause-resume workflows that feel seamless to the user.

dataface.ai engineering·February 26, 2026·7 min read

We are building a product where an AI can take actions on behalf of a user, connecting datasources, querying databases, resolving references, fetching context, and constructing answers that span multiple data sources. The core of what we built is an agentic workflow engine, a directed graph of nodes, edges, and conditional routing, powered by LangGraph.

For the most part, this works beautifully. The agent navigates the graph, invokes the right nodes in sequence, and produces a response. But we quickly hit a wall with a category of situations the graph does not handle well by default: when the agent cannot proceed without a human answering a question.

The native interrupt: freezing time

Previously, "resuming" a graph meant saving a checkpoint, taking the user's input, and conditionally routing the graph to skip already-completed nodes. It was essentially a DAG workaround: start from the beginning but fast-forward through the past. This was terribly inefficient and convoluted to maintain.

We recently migrated our workflow to use LangGraph's native interrupt() and Command(resume=...) API. This fundamentally changed how we handle HTTP boundaries in agentic workflows. Here is how it works:

  1. When a node realizes it needs user input (e.g., an ambiguous intent or a required password), it calls interrupt(payload).
  2. This immediately freezes graph execution mid-node. LangGraph's checkpointer serializes the exact state of the runner and stores it safely in our database (Redis or Postgres). The HTTP response yields back to the client asking for input.
  3. When the user responds, we invoke the graph with Command(resume=user_answer). The checkpointer retrieves the frozen state, and execution resumes at the exact line of code where it paused.

No prior nodes are re-executed. No complex "resume_from" routing logic is needed. The workflow continues its natural DAG progression inline, yielding an incredibly efficient and clean architecture.

Where does the "freezed runner" live? It lives entirely in the checkpointer database. The HTTP backend remains completely stateless. The frozen state is seamlessly rehydrated across completely different server instances.

State serialization: the part nobody talks about

Our graph state is a Pydantic model. It contains nested objects: lists of datasource references (some partially resolved), query classification outputs, summary chunks, and execution trace entries. Serializing this to Redis on every checkpoint sounds straightforward until you encounter:

  • Enum fields that serialize to their string representation but need to deserialize back to the enum type.
  • Optional fields that are None vs. absent, which behave differently in Pydantic v2.
  • Lists of union types that need discriminated deserialization.
  • Circular references in nested context objects (rare, but devastating when they appear).

We ended up implementing a strict serialization contract for the checkpoint: always call .model_dump(mode="json") on the state before writing to Redis, and always validate through StateModel.model_validate() on read. No shortcutting. The extra validation on read is the thing that saved us, it caught schema drift between a new deployment and a checkpoint that was written with an older schema.

How the user experience of this feels

From the user's side, they type a question. The AI responds, not with an answer, but with a clarifying question and a set of options to pick from. The user picks one. The AI continues and delivers the result.

It sounds simple. But what makes it feel seamless versus jarring is what is preserved between the pause and the resume. If the agent loses track of what it was doing, if the resumed response does not connect coherently to the original question, users notice immediately. The conversation feels broken.

Getting the state right is not a backend concern alone. It shows up directly in the quality of the conversation.

What we would do differently

If you are building something similar, here is what we would tell ourselves earlier:

  • Design your state schema with serialization in mind from day one. Retrofitting it is painful.
  • Treat the HITL checkpoint token as a first-class part of your API contract, version it, document it, test it.
  • Write integration tests that pause at the HITL node and resume with different inputs, not just happy-path tests that skip it.
  • Log the full state diff at every resume. When something breaks in production, that diff is the only thing that tells you what changed.

We are still refining this. There are edge cases around concurrent sessions and checkpoint expiry that we have not fully nailed down. But the core model is working, and it is one of the more interesting engineering problems we have solved building this product.

Back to all posts