v2.5.3  ·  Open Source  ·  Go 1.26

The reliability layer
your AI agents are missing.

Aetheris wraps long-running agents in durable execution semantics — crash recovery, at-most-once side effects, replayable audit trails — without forcing you to rewrite your stack.

Durable
Checkpointed jobs
At-most-once
Side effects
Replayable
Audit trail
HTTP · Go · Python
Integration styles

Production agents fail in ways
your code doesn't handle.

Happy-path code runs fine in demos. In production, workers crash mid-run, retries duplicate payments, and nobody can prove what the agent actually did.

Without Aetheris
  • Worker crash → entire task restarts from zero
  • Retry logic re-sends emails, re-charges cards
  • No audit trail → impossible post-incident analysis
  • Human-in-the-loop requires custom state machines
  • Long-running jobs silently timeout with no recovery
With Aetheris
  • Resume from last durable checkpoint automatically
  • Invocation ledger guarantees at-most-once execution
  • Every transition traced, replayable, and auditable
  • StatusParked for wait states — no custom state machine
  • Lease fencing keeps jobs alive across worker restarts

Drop in the runtime.
Keep your agent logic.

Aetheris sits between your agent and the outside world. You submit work via HTTP or Go SDK; the runtime handles durability, scheduling, and side-effect safety.

Your Agent
Python · Go · HTTP
Aetheris API
POST /agents/:id/message
Aetheris Runtime
Planner → TaskGraph → Executor
Tool Ledger
At-most-once guarantees
External Tools
APIs · DBs · Services
PostgreSQL job store Checkpoint runner Lease fencing DAG compiler OpenTelemetry traces Replay engine

Three properties you can
reason about formally.

Aetheris is built around a small set of guarantees that compose correctly. Each one has a formal definition in the runtime spec.

Crash Recovery

When a worker dies, the scheduler's lease fencing detects the timeout and assigns the job to the next available worker, which resumes from the last committed checkpoint. Zero manual intervention required.

// Job resumes from checkpoint on worker restart func Resume(ctx context.Context, jobID string) error { checkpoint, _ := store.LoadCheckpoint(jobID) return runner.ContinueFrom(ctx, checkpoint) }
🔒

At-most-once Execution

Every tool invocation is recorded in an immutable ledger before execution. On retry, the runtime checks the ledger — if the call already happened, the recorded result is returned without re-executing.

// Ledger prevents duplicate side effects if result, ok := ledger.Lookup(invocationID); ok { return result, nil // idempotent reply } return tool.Execute(ctx, params)
🔍

Replayable Audit Trail

Every state transition is appended to a durable event log with full causality. Any job can be replayed step-by-step for debugging, compliance review, or post-incident root cause analysis.

// Full trace for any job curl /api/jobs/{id}/trace/page // Deterministic replay in sandbox aetheris replay --job job_abc123

Running in under 3 minutes.

Clone, start the embedded runtime, and submit your first durable job. No cloud account, no Kubernetes, no configuration files required.

1
Clone and start
git clone https://github.com/Colin4k1024/Aetheris.git
cd Aetheris
make run-embedded
2
Verify the runtime is live
curl http://localhost:8080/api/health
3
Submit your first message
curl -X POST http://localhost:8080/api/agents/default/message \
  -H "Content-Type: application/json" \
  -d '{"content": "Summarize the top 3 HN stories"}'
1
Register an external HTTP agent
# agents.yaml
agents:
  - id: my-agent
    type: external_http
    endpoint: http://localhost:9000/run
    timeout: 30s
2
Start your agent server
# Python example (any HTTP server works)
from flask import Flask, request, jsonify
app = Flask(__name__)

@app.route('/run', methods=['POST'])
def run():
    goal = request.json['goal']
    return jsonify({'answer': f'Done: {goal}'})
3
Submit work via Aetheris
curl -X POST http://localhost:8080/api/agents/my-agent/message \
  -H "Content-Type: application/json" \
  -d '{"content": "your goal here"}'
1
Install the Python SDK
pip install aetheris-sdk
2
Submit a durable job
from aetheris import Client

client = Client("http://localhost:8080")
job = client.submit(
    agent="default",
    message="Summarize the report",
)
print(job.id, job.status)
3
Inspect the trace
trace = client.trace(job.id)
for step in trace.steps:
    print(step.name, step.status, step.duration_ms)
Key endpoints
GET
/api/health
Runtime liveness check
POST
/api/agents/:id/message
Submit work to an agent
GET
/api/jobs/:id/trace/page
View step-level execution trace
GET
/metrics
Prometheus metrics

Works with the stack you already use.

Bring your existing agent written in any language. The external HTTP boundary means Aetheris is language-agnostic by design.

🐍
Python
SDK + HTTP adapter
🐹
Go
Native SDK (Eino ADK)
🟨
Node.js
openclaw-adapter
🌐
Any HTTP
External agent contract
🐘
PostgreSQL
Durable job store
📊
Prometheus
Metrics & observability
🔭
OpenTelemetry
Distributed tracing
⚙️
MCP
Model Context Protocol

Everything you need to go deeper.

From the five-minute quickstart to formal execution semantics — the full picture is in the docs.

Ready to make your agents reliable?

Open source · MIT License · No vendor lock-in

Get started in 5 minutes → View on GitHub