← Back to Labs
PyPI: relay-corePythonWorkflowAI AgentsOpen SourceInfrastructure

Relay

Reliable AI operations — making sure your automated workflows finish what they start, every time.

live
Relay cover
🔬Note

relay-core 1.27.0 — available now on PyPI. Built in Python 3.12+. MIT licensed.

The problem nobody talks about honestly

AI agents are impressive in a demo. A human types a request, the agent springs into action — scraping data, calling services, crunching numbers, delivering results. It feels like magic.

But demos don't run your business. Production does.

Here's what actually happens when you put AI agents to work: your agent starts a multi-step task — maybe it's gathering market data, calling an external service, running analysis, then writing results to a report. Halfway through, something goes wrong. A server goes down. A network connection drops. An external API takes too long to respond.

What happens to all that work? It's gone. The agent has no memory of what it completed. No record of what step it was on. No way to pick up where it left off without running everything from scratch — including all the time and money you already spent on the steps that succeeded.

And if you're lucky enough to avoid crashes, scaling becomes its own headache. Want to run 10 agents at the same time? You need to figure out queuing, distribution, timeout handling, and retry logic yourself. Want 100? Now you're building infrastructure instead of building products.

We saw this pattern across team after team, project after project:

⚠️Problem

The recurring pain points:

  • A single failure erases hours of in-progress work
  • No way to resume from where things stopped — every failure means starting over
  • Retry logic is patched together differently in every project
  • Agents can't "remember" partial results across restarts
  • Scaling up means duplicating fragile, custom-built coordination code
  • No visibility into what's running, what's stuck, or what failed

The uncomfortable truth is that most "production AI agents" aren't truly reliable. They assume everything will go perfectly. And they fall apart the moment the real world pushes back.


Why we built Relay

We got tired of babysitting agents.

Not building them — that part is genuinely exciting. We were tired of the operational overhead: the custom retry wrappers, the ad-hoc state tracking written to databases with crossed fingers, the middle-of-the-night alerts because an agent silently failed on step 7 of 12 and no one noticed until the customer complained.

We looked at what the infrastructure world had figured out. Solutions like Temporal are brilliant but operationally heavy — you're deploying and managing a separate cluster. Azure's Durable Task Framework is powerful but deeply tied to the Microsoft ecosystem. There was nothing that felt native to Python, worked with modern async patterns, and was light enough to install and actually use in an afternoon.

So we built Relay.

The name comes from the idea of relaying work between steps — passing tasks forward reliably, like a relay race where each runner hands off the baton without dropping it. Your workflow steps are the runners. Relay is what makes sure the baton always gets to the next runner — reliable, recoverable, and observable — even when individual runners stumble.

Inspired by the best ideas from enterprise workflow systems, but designed from the ground up to be simple to adopt. SQLite by default for getting started. PostgreSQL when you're ready to scale. No new infrastructure to manage. Just your code and a worker process.


What Relay actually is

Here's where it gets interesting — because Relay is two things at once, and understanding both angles changes how you think about it.

As a Reliable Workflow Engine

At its core, Relay is a durable workflow engine. Every step in a workflow is tracked and recorded. If your process crashes, Relay reconstructs exactly where execution was — then continues from that point forward.

Relay architecture: Worker polls task queue, executes workflow steps, persists events to database for deterministic replay
Relay's architecture: reliable execution with automatic recovery

This isn't just saving checkpoints — it's intelligent replay. The workflow logic itself is separated from the actions it performs (API calls, database writes, external service integrations). When something fails, only the failed step is retried — not the entire workflow from the beginning.

The result: workflows that survive server crashes, network failures, and deployment restarts — automatically, with no extra code from you.

As an AI Agent Execution Platform

Now think about your AI agents through this lens.

An agent is essentially a workflow with more steps and more uncertainty. It has inputs, it produces outputs, and it coordinates a sequence of actions — many of which have real costs or consequences. Sound familiar?

Relay gives agents what they've always needed:

  • Persistent memory across failures — state is saved after every step, not held in volatile memory
  • Step-level recovery — a failed step retries from itself, not from the beginning
  • Human-in-the-loop — external systems can pause and resume agents mid-execution for approvals or reviews
  • Full visibility — every step, state change, and action is logged and inspectable
  • Concurrent scale — run dozens of agent workflows in parallel across distributed workers

Your agent stops being a fragile script and becomes a reliable, auditable business process.


How it works

You define your data shapes, your activities (the steps that interact with external systems), and your workflow (the coordination logic). Relay handles the rest — tracking, recovery, retries, and monitoring.

import asyncio
from typing import TypedDict
import relay

# 1. Define your data types
class OrderInput(TypedDict):
    order_id: str
    customer_email: str

class OrderState(TypedDict):
    payment_confirmed: bool
    email_sent: bool

# 2. Activities = steps that interact with external systems
#    (these are automatically retried, timed-out, and tracked)
@relay.activity(name="process_payment", retry_count=3, timeout_seconds=30)
async def process_payment(order_id: str) -> bool:
    # Call your payment provider here
    return True

@relay.activity(name="send_email", retry_count=2)
async def send_confirmation_email(email: str, order_id: str) -> None:
    # Call your email service here
    pass

# 3. Workflow = the coordination logic
#    (automatically recovers from any failure point)
@relay.workflow(name="OrderProcessing", version="1.0.0")
class OrderWorkflow(relay.Workflow[OrderInput, OrderState]):

    @relay.step(name="process_payment")
    async def payment_step(self, ctx: relay.WorkflowContext[OrderInput, OrderState]):
        success = await ctx.activity(process_payment, ctx.input["order_id"])
        await ctx.state.set("payment_confirmed", success)

    @relay.step(name="send_confirmation")
    async def notification_step(self, ctx: relay.WorkflowContext[OrderInput, OrderState]):
        if ctx.state["payment_confirmed"]:
            await ctx.activity(
                send_confirmation_email,
                ctx.input["customer_email"],
                ctx.input["order_id"],
            )
            await ctx.state.set("email_sent", True)

# 4. Start it
async def main():
    handle = await OrderWorkflow.start(
        OrderInput(order_id="ORD-12345", customer_email="user@example.com")
    )
    result = await handle.result()
    print(f"Completed: {result}")

asyncio.run(main())

Then manage your workflows with simple commands:

# Set up the database (SQLite by default)
relay init

# Start a worker (4 concurrent task processors)
relay worker --workers 4

# See what's running
relay list --status RUNNING

# Inspect a workflow's full history
relay inspect <workflow-id> --events

Relay also ships with a web dashboard — a real-time view of running workflows, event histories, task queues, and performance metrics. Everything you need to monitor your automated processes at a glance.

# Launch the dashboard
relay web

# Access at http://localhost:8000

What's next

Relay is actively maintained and already running in production. The roadmap includes:

  • Remote workers — distribute execution across multiple machines without changing your workflow code
  • First-class agent support — built-in tracking for AI-specific steps like model calls and tool usage
  • OpenTelemetry integration — industry-standard monitoring for every workflow and activity
  • Managed cloud option — a hosted execution platform for teams who don't want to manage their own infrastructure

The core engine is stable, the API is clean, and we're running it in production today. If you're building AI agents or complex automated workflows in Python, Relay is the missing layer between your code and reliable execution.


📦Install

Get started in one line:

pip install relay-core

Then relay init to set up your database, relay worker to start processing, and relay web to see it all in a dashboard. The source is on GitHub — PRs welcome.