Aetheris — Durable Execution Runtime for AI Agents

Why Aetheris

Production agents fail in ways
your code doesn't handle.

Happy-path code runs fine in demos. In production, workers crash mid-run, retries duplicate payments, and nobody can prove what the agent actually did.

Without Aetheris

✗ Worker crash → entire task restarts from zero
✗ Retry logic re-sends emails, re-charges cards
✗ No audit trail → impossible post-incident analysis
✗ Human-in-the-loop requires custom state machines
✗ Long-running jobs silently timeout with no recovery

With Aetheris

✓ Resume from last durable checkpoint automatically
✓ Invocation ledger guarantees at-most-once execution
✓ Every transition traced, replayable, and auditable
✓ StatusParked for wait states — no custom state machine
✓ Lease fencing keeps jobs alive across worker restarts

Architecture

Drop in the runtime.
Keep your agent logic.

Aetheris sits between your agent and the outside world. You submit work via HTTP or Go SDK; the runtime handles durability, scheduling, and side-effect safety.

Your Agent

Python · Go · HTTP

→

Aetheris API

POST /agents/:id/message

→

Aetheris Runtime

Planner → TaskGraph → Executor

→

Tool Ledger

At-most-once guarantees

→

External Tools

APIs · DBs · Services

PostgreSQL job store Checkpoint runner Lease fencing DAG compiler OpenTelemetry traces Replay engine

Core Guarantees

Three properties you can
reason about formally.

Aetheris is built around a small set of guarantees that compose correctly. Each one has a formal definition in the runtime spec.

⚡

Crash Recovery

When a worker dies, the scheduler's lease fencing detects the timeout and assigns the job to the next available worker, which resumes from the last committed checkpoint. Zero manual intervention required.

// Job resumes from checkpoint on worker restart func Resume(ctx context.Context, jobID string) error { checkpoint, _ := store.LoadCheckpoint(jobID) return runner.ContinueFrom(ctx, checkpoint) }

🔒

At-most-once Execution

Every tool invocation is recorded in an immutable ledger before execution. On retry, the runtime checks the ledger — if the call already happened, the recorded result is returned without re-executing.

// Ledger prevents duplicate side effects if result, ok := ledger.Lookup(invocationID); ok { return result, nil // idempotent reply } return tool.Execute(ctx, params)

🔍

Replayable Audit Trail

Every state transition is appended to a durable event log with full causality. Any job can be replayed step-by-step for debugging, compliance review, or post-incident root cause analysis.

// Full trace for any job curl /api/jobs/{id}/trace/page // Deterministic replay in sandbox aetheris replay --job job_abc123

Quickstart

Running in under 3 minutes.

Clone, start the embedded runtime, and submit your first durable job. No cloud account, no Kubernetes, no configuration files required.

1

Clone and start

git clone https://github.com/Colin4k1024/Aetheris.git
cd Aetheris
make run-embedded

2

Verify the runtime is live

curl http://localhost:8080/api/health

3

Submit your first message

curl -X POST http://localhost:8080/api/agents/default/message \
  -H "Content-Type: application/json" \
  -d '{"content": "Summarize the top 3 HN stories"}'

1

Register an external HTTP agent

# agents.yaml
agents:
  - id: my-agent
    type: external_http
    endpoint: http://localhost:9000/run
    timeout: 30s

2

Start your agent server

# Python example (any HTTP server works)
from flask import Flask, request, jsonify
app = Flask(__name__)

@app.route('/run', methods=['POST'])
def run():
    goal = request.json['goal']
    return jsonify({'answer': f'Done: {goal}'})

3

Submit work via Aetheris

curl -X POST http://localhost:8080/api/agents/my-agent/message \
  -H "Content-Type: application/json" \
  -d '{"content": "your goal here"}'

1

Install the Python SDK

pip install aetheris-sdk

2

Submit a durable job

from aetheris import Client

client = Client("http://localhost:8080")
job = client.submit(
    agent="default",
    message="Summarize the report",
)
print(job.id, job.status)

3

Inspect the trace

trace = client.trace(job.id)
for step in trace.steps:
    print(step.name, step.status, step.duration_ms)

Key endpoints

GET

/api/health

Runtime liveness check

POST

/api/agents/:id/message

Submit work to an agent

GET

/api/jobs/:id/trace/page

View step-level execution trace

GET

/metrics

Prometheus metrics

Integrations

Works with the stack you already use.

Bring your existing agent written in any language. The external HTTP boundary means Aetheris is language-agnostic by design.

🐍

Python

SDK + HTTP adapter

🐹

Go

Native SDK (Eino ADK)

🟨

Node.js

openclaw-adapter

🌐

Any HTTP

External agent contract

🐘

PostgreSQL

Durable job store

📊

Prometheus

Metrics & observability

🔭

OpenTelemetry

Distributed tracing

⚙️

MCP

Model Context Protocol

Live Runtime Detected

Jump into operational surfaces.

This page is being served by a running Aetheris instance. These shortcuts reach it directly.

✅

API Health

Sanity-check the running service and verify the API is live.

Open → 📊

Prometheus Metrics

Inspect exported runtime metrics for local debugging and ops dashboards.

Open → 🔭

Trace Overview

Jump into the built-in trace UI for multi-job inspection and replay.

Open →

Resources

Everything you need to go deeper.

From the five-minute quickstart to formal execution semantics — the full picture is in the docs.

🚀

Quickstart Guide

Embedded mode, external HTTP agents, and your first durable job in 5 minutes.

Read → 📖

API Reference

Full HTTP surface, job lifecycle endpoints, and operational APIs.

Read → 🔌

External HTTP Adapter

Connect any Python, JavaScript, or Go agent through the durable HTTP boundary.

Read → 🔒

Runtime Guarantees

How checkpointing, replay, and side-effect safety are modeled formally.

Read → 🐍

Python SDK

Packaged client for durable job submission from Python applications.

Browse → 💡

Examples

Working demos for agent integrations, workflows, and runtime patterns.

Browse →

The reliability layer
your AI agents are missing.

Production agents fail in ways
your code doesn't handle.

Drop in the runtime.
Keep your agent logic.

Three properties you can
reason about formally.

Crash Recovery

At-most-once Execution

Replayable Audit Trail

Running in under 3 minutes.

Works with the stack you already use.

Jump into operational surfaces.

Everything you need to go deeper.

Ready to make your agents reliable?

The reliability layeryour AI agents are missing.

Production agents fail in waysyour code doesn't handle.

Drop in the runtime.Keep your agent logic.

Three properties you canreason about formally.

Crash Recovery

At-most-once Execution

Replayable Audit Trail

Running in under 3 minutes.

Works with the stack you already use.

Jump into operational surfaces.

Everything you need to go deeper.

Ready to make your agents reliable?

The reliability layer
your AI agents are missing.

Production agents fail in ways
your code doesn't handle.

Drop in the runtime.
Keep your agent logic.

Three properties you can
reason about formally.