Here's what we'll cover
Here's what we'll cover
Here's what we'll cover
This is Part 3 of our Super Bowl scaling series. Part 1 covered our scaling strategy and the "Async All The Things" philosophy. Part 2 covered the frontend: how we built a zero-API online visit that collects an entire session's worth of data in the browser and submits it as a single payload. This article picks up where that payload lands: what happens on the backend after a user clicks submit.
When we set out to build the backend for the No-V onboarding flow, a single checkout submit action needed to trigger 20+ steps across half a dozen microservices. Account creation, token exchange, address updates, payment processing. Some complete in milliseconds, others depend on external events that may not have happened yet. All of them need to succeed for a patient to be onboarded successfully.
We considered adopting a framework like Temporal for workflow orchestration, and we looked at adapting custom orchestrator engines already in use by other services internally. But we opted to start simple: a Postgres table as our work queue and a basic for loop as our orchestrator, iterating through a series of synchronous Python functions.
This article walks through the design choices that made that level of simplicity viable, and why it turned out to be the right call.
The problem
A user fills out the onboarding online visit and clicks submit. The backend needs to:
Create a new patient account (or look up an existing one)
Update the patient's address, phone number, and personal information
Save insurance information
Set up payment details
Persist health questionnaire answers
Execute checkout, which triggers payment processing and adds the patient to the Body Program
Each of these steps calls a different downstream service. Some are fast, some are slow, and some can fail transiently.
The question for us was: how do we reliably process a large spike of incoming payloads, many of which might be duplicated (a typical consequence of multi-region retry/hedging behavior, which we’ll talk about later), through all 20+ steps, without losing or duplicating any submissions. All at a potential request rate of thousands per second?
Architecture: save now, process later
We split the problem into two phases: a synchronous HTTP handler that accepts submissions, and a background worker that processes them.
This is the "Async All The Things" philosophy from Part 1 applied at the service level: accept the user's data immediately, process it on our own schedule (which we can fully control).
Accepting submissions
An SQS queue sits upstream in the API gateway, buffering submissions before they reach us. This decouples the frontend from the processing rate of the intake service, and lets us absorb traffic spikes without backpressure.
The submission endpoint itself is deliberately lightweight. Its only job is to persist the payload and return. No validation, no downstream calls, no computation.
def handle_submission(
payload: dict[str, Any],
idempotency_key: UUID,
) -> None:
with get_session() as session:
workflow = Workflow(
idempotency_key=idempotency_key,
payload=payload,
)
session.add(workflow)
try:
session.commit()
except IntegrityError:
session.rollback(),
The payload is stored as JSONB in a workflow row with status PENDING. The idempotency key, a client-generated UUID (which we discuss in our last post), has a unique database constraint. If the same submission arrives twice, the second one hits an IntegrityError and is dropped.
The design follows EAFP (easier to ask forgiveness than permission): rather than querying for an existing idempotency key before inserting, we attempt the insert directly and let the database constraint catch duplicates. One round trip, not two. The hot path stays free of payload hashing or cache lookups. The database constraint is the single deduplication mechanism.
We expect to see frequent IntegrityErrors in production, and that's by design. The frontend makes hedged requests to different regions to maximize availability, so duplicate submissions are a normal part of operation, not an error condition.
The result is a synchronous path that does one INSERT, returns one response, and has a single failure mode: the database is unavailable. (And in case that happens, the upstream SQS queue will retry delivery until the database has recovered).
Processing submissions
A separate background process, the worker, polls the database in a loop, grabs the next pending submission, and runs it through the step list:
workflow = session.execute(
select(Workflow)
.where(Workflow.status == "PENDING")
.where(Workflow.invisible_until <= func.now())
.order_by(Workflow.created_at)
.limit(1)
.with_for_update(skip_locked=True)
).scalar_one_or_none()
SELECT FOR UPDATE SKIP LOCKED turns the database table into a concurrent work queue. If one worker has already locked a row, other workers skip it and grab the next available one. No contention, no duplicate processing. This is the same problem that dedicated message brokers and task frameworks solve, but here, the database we're already using handles it natively. No additional infrastructure, no separate queue service, no framework-specific serialization format. The workflow data and the queue are the same table.
The worker itself is a plain while True loop. No async/await, no event loop, no concurrency within a single execution. The next section explains why.
Designing for simplicity
No async, no problem
Each workflow runs as a single-threaded, synchronous function call. One worker picks up one submission, processes it through all 20+ steps from start to finish, then picks up the next one.
This eliminates an entire class of problems. There are no race conditions between steps. No callback chains. No event loop gotchas. When something goes wrong, the stack trace reads top-to-bottom, just like the code.
The trade-off is throughput. A single worker can only process one submission at a time. But for our system, this tradeoff is actually a feature because it lets us control downstream load with a single knob.
Scaling by adding workers
Processing time for an entire workflow is consistent, with a p95 of 3 seconds across all 20+ steps. Most of that time is spent waiting on external HTTP calls, not in local computation.
Rather than making individual workers concurrent, we control throughput by adjusting the number of worker instances. Need more capacity? Add workers. The database handles coordination via SKIP LOCKED, ensuring no two workers grab the same row.
The capacity math is straightforward: with a p95 processing time of 3 seconds, a single worker handles roughly 20 submissions per minute(60/3 = 20). The system might be receiving thousands of requests per second, but the rate at which they get processed is fully in our control with our decoupled save now, process later architecture.
Flattening the curve
During traffic spikes, we don't necessarily want or need to scale workers up. The goal isn't to process every submission as fast as possible. Our design gives us a knob to control how quickly submissions flow into downstream services. During a Super Bowl ad, the submission endpoint and SQS queue absorb the spike, while the workers drain the queue at a steady, controlled pace. This is where the "flattening of the curve" from Part 1 happens: downstream services never see the spike at all. They see the same predictable volume they handle on any normal day. We can scale our intake service without requiring every downstream service to scale for the same spike.
In practice, we keep the worker count stable and let the queue absorb bursts. Adding more workers is always an option if we need to drain faster, but the system doesn't depend on it.
One list to rule them all
The system handles three distinct types of submissions: brand-new users, existing users returning for a new visit, and users whose accounts were recently created via a separate async onboarding flow. Each type carries different data, uses different authentication mechanisms, and requires a different subset of the 20+ steps.
The natural instinct is to build three separate workflows, or at least to add branching logic in the orchestrator. We did neither. Every submission, regardless of type, runs through the exact same step list:
steps: list[WorkflowStep] = [
validate_payload,
validate_duplicate_email,
create_account,
# ... +18 more steps
]Steps that don't apply to a given submission type handle this internally. A step checks the submission type and returns early if it doesn't apply:
def create_account(*, workflow: Workflow, context: WorkflowStepContext) -> dict[str, Any] | None:
if workflow.payload.get("authentication_mode") != "net_new":
return
# ... proceed with account creation
The orchestrator has zero awareness of submission types. Adding a new type means updating individual steps, not rewriting the orchestration logic. And keeping the step list flat also simplifies retries.
Not every step in the list is critical to completing the onboarding. Some steps are designed as best-effort: they attempt an operation, but if it fails, the workflow continues rather than blocking. This applies to steps where alternative paths exist to resolve the outcome later.
For example, if a payment card fails to attach, the workflow still completes. The patient is onboarded, and a downstream dunning flow handles payment recovery separately. Similarly, steps that set non-critical preferences or link optional verifications are allowed to fail without marking the entire workflow as FAILED.
This distinction between steps that must succeed and steps that should succeed keeps the workflow moving for the patient while ensuring nothing is silently lost. Failed best-effort steps are logged and can be investigated or retried through other mechanisms.
When in doubt, start over
When a step fails with a transient error, the workflow retries from Step 1, not from the point of failure. The entire step list runs again from the beginning.
This approach relies on steps being safe to re-execute. Steps that don't apply to the current submission type return early, as described above. Steps that make downstream calls are generally idempotent (updating an address to the same value is a no-op). On a retry, steps that can self-skip do so without making any downstream network calls, so the workflow quickly advances to the point where it previously failed.
Restarting from scratch also sidesteps a category of edge cases that checkpoint-based systems need to handle:
What if the step order changes between attempts? With checkpointing, a deploy that reorders steps mid-retry could skip a step that now needs to run earlier, or re-run a step that's moved later. Restarting from scratch always uses the current step order.
What if new steps are added? A checkpoint that says "resume at Step 12" is meaningless if a deploy added two new steps before Step 12. Restarting from scratch runs the full current list, including any new steps.
What about in-flight workflows during a deploy? Always restarting means no rollout complexities or dreaded migrations or backfills.
Because the step list is flat (no conditional branches), restarting is straightforward. The orchestrator doesn't need to figure out which branch was taken on the previous attempt. It just runs the same list again.
How the steps work
Just a function
Steps are plain Python functions. There's no base class, no decorator, and no registration mechanism. A Protocol defines the structural contract:
class WorkflowStep(Protocol):
__name__: str
def __call__(
self, *, workflow: Workflow, context: WorkflowStepContext
) -> dict[str, Any] | None: ...Each step receives two arguments: the workflow record (which contains the raw payload) and a context object. It returns either None or a dictionary of values to add to the context.
The step list itself is a literal Python list in a single function. Adding a step, removing one, or changing the order is a one-line edit.
Steps communicate through a shared context object, a Pydantic model that accumulates state as the workflow progresses:
,class WorkflowStepContext(BaseModel):
# Subset of fields that are populated progressively as steps execute
patient_uuid: UUID | None = None
visit_uuid: str | None = None
payment_card_id: str | None = None
# ...The executor runs each step and merges its return value into the context:
for step_func in steps:
result = step_func(workflow=workflow, context=context)
if result:
for key, value in result.items():
setattr(context, key, value)The account creation step returns {"patient_uuid": ...}. Later steps read context.patient_uuid to make API calls on behalf of the patient. Each step reads from and writes to the context without knowing which other step produced the data it depends on. This keeps steps decoupled: they can be reordered, replaced, or tested in isolation as long as the context contract is maintained.
Three statuses and a timestamp
A workflow has exactly three statuses: PENDING, COMPLETE, or FAILED. There is no IN_PROGRESS state, no RETRYING, no WAITING_FOR_EVENT. A workflow is either waiting to be processed, done, or dead. This simplicity is deliberate: every workflow in the table can be understood at a glance, and monitoring reduces to counting rows by status.
Additionally, each workflow row has an invisible_until timestamp column. The worker's polling query includes a filter:
WHERE status = 'PENDING'
AND invisible_until <= now()
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKEDWhen a step fails with a transient error, the worker doesn't move the row to a separate retry queue or schedule a future job. It pushes invisible_until into the future:
workflow.retry_count += 1
workflow.invisible_until = calculate_invisible_until(
workflow.retry_count,
base_delay=error.base_delay,
max_delay=error.max_delay,
jitter_factor=error.jitter_factor,
)
session.commit()The row stays in PENDING status but becomes invisible to the polling query until the backoff period expires. When it expires, the next poll picks it up and the workflow runs through all steps again from the beginning.
No separate retry queue. No scheduled jobs. One column, one query filter.
Event-driven retry acceleration
We use scheduled retries as our default path to ensure eventual progress, but this can sometimes add unnecessary latency.
When a workflow is blocked due to an unsatisfied dependency (e.g., a related object created asynchronously in another service), we’ll listen for Kafka events that signal when a workflow is unblocked. The consumer updates the workflow row using the same simple queue semantics (set invisible_until to now()) to short-circuit the long wait time, making the workflow available for immediate processing.
session.execute(
update(Workflow)
.where(Workflow.linking_id == account_id)
.where(Workflow.status == "PENDING")
.where(Workflow.invisible_until > func.now())
.values(invisible_until=func.now())
)This creates a hybrid: poll-based retries as the reliable fallback, with event-driven acceleration as the fast path. If the message bus is healthy, the workflow resumes within seconds of the account being created. If the message bus is having a bad day, or the upstream service failed to publish the event, the poll-based retry still fires after the backoff period. Nothing is lost either way.
Rolling back without letting go
There's a subtle concurrency problem in the worker's failure path. When a step fails, the worker needs to do two things: roll back any database side effects from the failed step, and update the workflow row (either mark it FAILED or increment the retry count). But session.rollback() would roll back the entire transaction, including the SELECT FOR UPDATE lock on the row. In the window between the rollback and the next commit, another worker could grab the same row.
The solution is PostgreSQL savepoints, exposed in SQLAlchemy as nested transactions:
savepoint = session.begin_nested()
try:
execute_workflow(workflow)
workflow.status = "COMPLETE"
session.commit()
except RetryableStepError as e:
savepoint.rollback() # undo step's writes, keep the row lock
workflow.retry_count += 1
workflow.invisible_until = calculate_invisible_until(...)
session.commit()
except Exception:
savepoint.rollback() # undo step's writes, keep the row lock
workflow.status = "FAILED"
session.commit()session.begin_nested() creates a savepoint before the steps run. On failure, rolling back to the savepoint undoes the step's database writes while the outer transaction, and its row lock, stays open. The worker can then safely update the workflow status and commit, with no window for another worker to interfere.
What this design doesn’t do
Simplicity has costs:
Not every step is retryable. Only a subset of steps raise RetryableStepError. Most treat failures as terminal, marking the workflow FAILED for manual investigation. This is partly by design (some failures genuinely aren't transient, like invalid payload data) and partly pragmatic. Making a step safely retryable requires ensuring it's idempotent or can detect its own prior side effects. That analysis takes time, and we prioritized shipping a working system over covering every edge case on day one.
Idempotency is a convention, not a guarantee. Restart-from-scratch retries assume steps can handle being re-run. Most steps achieve this by being naturally idempotent (updating an address to the same value is a no-op) or by self-skipping based on the submission type. But this property isn't enforced by the framework. It's a convention that each step author must uphold. A step that isn't idempotent and runs before a retryable step could cause problems on retry. Step ordering mitigates this, but the constraint is documented rather than mechanically enforced.
Restart-from-scratch is wasteful on late-stage failures. If Step 19 out of 21 fails transiently, the retry re-runs all 21 steps. Steps that aren't relevant to the current submission type return early without making downstream calls, so most of the re-run is cheap. But steps relevant to the current submission type that have already completed successfully will make redundant downstream calls. At our current scale this is negligible, but it's something to monitor as traffic grows.
Conclusion
We set out to build a workflow engine and chose the simplest approach that could work: synchronous Python functions, a database table with a few well-chosen columns, and a polling loop. The system processes submissions with a p95 of 3 seconds (which the patient never notices, as this is an async process post-onboarding-submission), scales by adding workers, retries by re-running from the start, and uses a single flat step list for all submission types with branching logic pushed into the steps themselves. Together with the frontend architecture from Part 2 and the scaling strategy from Part 1, it gave us a system that handled Super Bowl traffic without breaking a sweat.
These choices won't be right for every system. But for a system that needs to be reliable, observable, and debuggable by anyone on the team, we've found that boring is a feature. The most maintainable code is the code that doesn't need a guide to understand.