Relier makes Celery reliable. One decorator wraps your existing tasks with crash recovery, idempotent execution, two-tier timeouts, graceful shutdown, admission control, and a DLQ without changing your function bodies or your Redis broker.
Every task either completes, hands off to another worker, or lands in the Dead Letter Queue with a traceable reason. Nothing silently disappears.
→ Docs · Quickstart · Benchmarks
Vanilla Celery:
@celery_app.task
def charge_customer(customer_id: str, amount_cents: int):
return stripe.charge(customer_id, amount_cents)
charge_customer.delay("cus_abc", 5000)
# - Worker dies mid-charge -> task lost
# - Network blip causes retry -> customer charged twice
# - Stripe hangs -> task hangs the worker forever
# - Traffic spike -> queue floods, cascade failureWith Relier (same function, four added kwargs):
from relier import rl_task
@rl_task(
queue="high_priority",
idempotent=True, # idempotent execution via atomic Redis Lua
soft_timeout=8, # cleanup hook fires at 8s
hard_timeout=10, # cancelled at 10s
)
async def charge_customer(customer_id: str, amount_cents: int) -> dict:
return await stripe.charge(customer_id, amount_cents)
await charge_customer.apush("cus_abc", 5000)
# - Worker dies -> Phoenix re-queues within ~7s (p99), same args; idempotency
# stops a double-charge
# - Network blip -> cached result returned, no second charge
# - Stripe hangs -> cancelled at 10s, quarantined to DLQ with full payload
# - Traffic spike -> AdmissionRejectedError with Retry-After, HTTP 429 readyThat's the entire migration. Your function body doesn't change. Your call site
swaps .delay(...) for await task.apush(...) (async) or task.push(...)
(sync, for Flask / Django views / scripts).
| Problem | Vanilla Celery | With Relier |
|---|---|---|
| Worker OOM-killed mid-task | Lost forever, no trace | Phoenix re-queues within ~7 s (p99) |
| Non-idempotent retries | Your problem to solve | idempotent=True atomic Lua prevents concurrent duplicate execution |
| No task timeouts | Zombie tasks block workers | Two-tier soft/hard timeout with cleanup hooks |
| Ungraceful deploys | ~40% of in-flight tasks silently lost | SIGTERM drain + handoff to other workers |
| No visibility | celery inspect, then squint |
rl tasks inflight --follow, structured output |
| Traffic spikes | Queue floods, cascade failures | Atomic admission control, Retry-After |
| Poison-pill tasks | Crash workers forever | Quarantined to DLQ after max_resurrections |
| Schema drift on rolling deploy | Old payloads on new code fail silently | Versioned envelope + sequential migrations — old and new workers run simultaneously safely |
All eight covered. Same Celery programming model. Same Redis broker. No new infrastructure to operate beyond what you already have.
Relier is a thin wrapper around Celery, not a replacement for it.
You keep your workers (celery -A relier.tasks.app worker), your Redis broker,
your queue names, your @task intuition. Relier adds a lifecycle layer on top:
heartbeat tracking, resurrection, idempotency, timeouts, graceful shutdown. Your
function bodies don't change. Your infrastructure doesn't change. You add one
decorator, switch .delay() to .push(), and you're done.
Relier is not Temporal or Hatchet.
Temporal and Hatchet are workflow engines. They model multi-step workflows with deterministic replay, activity retries across process restarts, and saga compensation. That's a fundamentally different problem and a fundamentally different programming model. If you need long-running workflows spanning hours, human approval steps, or saga rollbacks, use one of those.
Relier is for teams that already have Celery tasks and want them to stop disappearing. No workflow model. No deterministic replay. No new service to operate. Same Redis you already have.
Relier is not a DAG runner.
Prefect, Airflow, Dagster, Luigi: these schedule and orchestrate pipelines of dependent tasks. They have UIs, schedulers, and retry policies baked into a pipeline definition. Relier has none of that.
Relier makes individual Celery tasks reliable. What those tasks do, when they run, and how they depend on each other is still your problem and Celery's.
vs. building it yourself. Most teams write some subset of this: an
idempotency table, sometimes a heartbeat-based resurrector, occasionally a DLQ.
The pieces are individually well-understood. Composing them correctly (fence tokens
for the GC-pause-victim case, AOF + noeviction preflight checks, thundering-herd
defences on resurrection batches) is what Relier exists to spare you from. The
chaos suite ships first-party so you can verify the guarantees hold on your own
cluster, not just trust ours.
pip install relierRequirements: Python 3.11+, Redis 7+ with AOF persistence and
maxmemory-policy noeviction. Relier preflight-checks both and refuses to
start if either is wrong.
If you're already running Celery and want it to stop losing tasks — yes.
If you're starting a new project and open to a different paradigm — consider Temporal or Hatchet first. Relier is a reliability layer for existing Celery deployments, not a reason to choose Celery over modern alternatives.
If you need workflow orchestration, DAGs, or deterministic replay — use Prefect, Airflow, or Temporal. Relier makes individual tasks reliable; it doesn't orchestrate pipelines.
# tasks.py
from relier import rl_task
@rl_task(idempotent=True, hard_timeout=30)
async def send_invoice(invoice_id: str) -> dict:
await charge_card(invoice_id)
await email_invoice(invoice_id)
return {"invoice_id": invoice_id}# FastAPI
@app.post("/invoices/{invoice_id}/send")
async def dispatch(invoice_id: str) -> dict:
await send_invoice.apush(invoice_id)
return {"status": "queued"}# Three processes - bare metal, no Docker required
# --include=tasks tells the worker where your @rl_task functions live
celery -A relier.tasks.app worker -l info -Q high_priority,default,low_priority,re-queue --include=tasks
rl run-resurrector
uvicorn main:appOr get the full stack (Redis + workers + resurrector + OTel + Grafana) if you've cloned the repo:
make dev # docker-compose.yml, single-node Redis with AOF
make prod # docker-compose.prod.yml, Redis HA with Sentinel + backupFull quickstart: docs/quickstart.md.
# Seed a long-running task, SIGKILL the worker that's running it,
# watch Phoenix re-queue it onto a healthy worker, live.
rl chaos worker-kill --seed --watch --watch-duration 60Five chaos scenarios ship with Relier: worker-kill, network-partition,
load-spike, task-corrupt, slow-task. They let you prove the reliability
claims against your own cluster, your own task code, your own Redis. Most
projects ship a test suite; Relier also ships a chaos suite.
Full guide: docs/chaos-guide.md.
Measured by the built-in bench suite (docker compose -f docker-compose.bench.yml up --build) on Linux with prefork workers and synthetic 0.5 s tasks. All claims verified end-to-end not microbenchmarks against a mock.
Numbers below: Relier v0.1.7, captured 2026-06-03 (9/9 claims verified). Re-run with make bench-docker to compare on your hardware.
Linux (Docker, python:3.11-slim, prefork=4) | Redis 7.2 AOF | 500 tasks × 5 kills
Metric Relier 0.1.7 Vanilla Celery Vanilla +acks_late
----------------------------------------------------------------------------------------------
Task delivery rate (5 SIGKILL) 100% 500/500 92.0% 460/500 96.0% 480/500 (0 dup)
OOM recovery avg / p99 6.9 s / 7.0 s ∞ lost partial (visibility)
Dual-OOM (2 concurrent tasks) 2/2 · 7.0 s both lost partial (visibility)
Idempotent recovery (delayed) re-ran 1.0 s ∞ lost partial (visibility)
Idempotency (50 submissions) 1 execution 50 executions 50 executions
Admission control p99 / max 0.323 ms / 1.15 ms n/a n/a
Graceful shutdown (3 cycles) 100% 0% 0%
Dispatch overhead (net avg) +0.99 ms n/a n/a
Cold-start to first task 1.00 s avg n/a n/a
Resurrection under load (5 kill) 5/5 · 1.1 s p99 all lost partial (visibility)
Worker RAM (idle, per process) +16 MB/proc n/a n/a
File descriptor leak Δ +0 (stable) n/a n/a
----------------------------------------------------------------------------------------------
+0.99 ms per dispatch pays for: atomic admission check, SHA-256-signed envelope wrap, heartbeat registration. On any task that does real work (a DB query, an HTTP call, an AI inference), this is invisible.
At 1.76 ms average per dispatch, a single async producer sustains ~570 apush() calls/second per thread. FastAPI producers fan out well past 1,000/second.
The admission control Lua script stays under 1 ms at p99 (0.323 ms), meaning the tail-latency cost of the admission check is bounded for the vast majority of requests. The "Vanilla +acks_late" column shows what flipping task_acks_late=True actually buys you: partial recovery (96.0% vs 92.0%) but not Relier's 100%, because the Redis broker's visibility_timeout default (~1 hour) gates redelivery long after most completions would have happened.
The default run above kills 5 workers across 500 tasks. The --scale profile raises the sample size on every test — so dedup, recovery, and admission numbers rest on a meaningful N, not a token handful — and still passes 9/9:
Linux (Docker, prefork) | synthetic 0.05 s tasks | python -m bench.bench --scale
Metric Relier 0.1.7 Vanilla Celery
-------------------------------------------------------------------------------------------------
Delivery rate (10,000 tasks, 10 kills) 100% 10,000/10,000 99.07% default · 99.86% acks_late (0 dup)
Duplicate prevention (2,000 submissions) 1/2,000 ran 2,000/2,000 ran
Worker OOM recovery (20 cycles) 7.0 s avg / 7.0 s p99 ∞ lost
Admission control p99 (50,000 samples) 0.248 ms (p99.9 0.338) n/a
Graceful shutdown (5 cycles) 100% 8.4%
Resurrection under load (25 inflight) 25/25 · 6.3 s p99 ∞ all lost
Worker RAM (idle, per process) +14.1 MB/proc n/a
-------------------------------------------------------------------------------------------------
The headline guarantees don't soften under 20× the load: delivery stays at 100%, dedup holds at 1-of-2,000, and resurrection recovers all 25 simultaneously-killed in-flight tasks within the same heartbeat-bound window (p99 6.3 s) — recovery does not degrade with the number of concurrent deaths.
Full methodology, per-test breakdowns, and Docker Compose instructions: docs/benchmarks.md.
Test 7 reports Redis ops/sec with N tasks inflight and the same workers idle, both as measured — it doesn't subtract them, because a worker busy inside a task polls the broker less than an idle one, so the inflight figure can read below idle. Relier's own per-task steady-state cost is the heartbeat refresh: 2 ops every heartbeat_ttl/2 s = 0.4 ops/sec/task (~400/s at 1k inflight, ~4,000/s at 10k) — deterministic and tiny.
The real Redis cost is per-task lifecycle ops (dispatch + register + complete), about ~13–16 ops per task end-to-end. Capacity scales with task turnover rate, not inflight count:
| Workload | Tasks/sec | Redis ops/sec | Single-master Redis |
|---|---|---|---|
| 1M tasks/day | ~12 | ~180 | trivial |
| 10M tasks/day | ~120 | ~1,800 | trivial |
| 100M tasks/day | ~1,200 | ~18,000 | comfortable |
| 1B tasks/day | ~12,000 | ~180,000 | needs sharding |
Long-running tasks are cheap at the steady-state level — just the 0.4 ops/sec/task heartbeat — so you can hold tens of thousands of concurrent ETL jobs inflight without saturating Redis. Single-master Redis tops out around 10,000 tasks/sec end-to-end (100k–150k ops/sec ÷ ~15 ops/task); past that, the path is vertical Redis, Redis Cluster (Relier ships hash-tagged keys for this), or a RabbitMQ broker. Full breakdown: docs/benchmarks.md § Scaling ceiling.
- Zero job loss (Phoenix Pattern): heartbeat-based crash detection, atomic re-queue with lease + fence tokens.
- Idempotent execution: atomic Redis Lua prevents concurrent duplicate execution of the same logical task via claim/in-flight/completed states.
@rl_task(idempotent=True)for automatic keying;idempotency_lock(key, ttl)for manual control withlock.set_result(value); result is committed automatically on context exit, lock released automatically on exception. - Two-tier timeouts: soft (cleanup hook) + hard (asyncio cancellation), enforced on async tasks.
- Checkpointing:
ctx.set_partial(state)in the soft-timeout hook saves progress to Redis; the next resurrection resumes from that state instead of starting over. - Graceful shutdown: SIGTERM drain phase, handoff to Phoenix for tasks that won't finish in time.
- Dead Letter Queue: full payload + reason + resurrection history. CLI to inspect, release, retry-all, purge.
- Admission control: atomic Lua-based fixed-window limiter, returns
Retry-After. - SLO burn-rate tracking: 1h / 6h / 3d windows, Google SRE-style burn rates, JSON or table output.
- Schema versioning: signed envelopes with sequential migrations for rolling deploys, old workers and new workers can run simultaneously without payload mismatches.
- Full OpenTelemetry: every lifecycle event emits spans and metrics. Bundled OTel -> Prometheus -> Grafana stack.
- Redis HA out of the box: Sentinel-based failover, replicas, hourly RDB backups, optional S3 offsite.
- Async-first, sync-compatible:
apushfor asyncio (FastAPI),pushfor sync code (Flask, Django, scripts). - Chaos suite: five scenarios to verify the guarantees on your cluster.
Full feature reference: docs/.
| Quickstart | 5-minute working setup |
| Celery Primer | If you've never used Celery |
| Core Concepts | What each mechanism does and why |
| Integration Recipes | FastAPI, Flask, Django, scripts |
| Patterns Cookbook | Idempotency keys, checkpoints, dedicated workers |
| Troubleshooting & FAQ | First place to look when things break |
| API Reference | Every @rl_task option, every dispatch method |
| Configuration | Every RELIER_* env var |
| CLI Reference | Every rl subcommand, what it touches in Redis |
| Deployment | Bare metal, Docker dev, Docker prod, Kubernetes |
| Durability & HA | What's protected against which failure mode |
| Architecture | Internals: async bridge, Redis keys, Lua scripts |
| Metrics Reference | OTel metric names and labels for dashboards |
| Chaos Guide | How to verify the guarantees yourself |
- PyPI project links (v0.1.7):
pyproject.tomlnow declares[project.urls](Homepage, Documentation, Repository, Issues, Changelog), so the PyPI sidebar links straight to the docs, repo, and changelog. Packaging metadata only — no code or behaviour change, benchmarks unchanged. - Teal CLI palette (v0.1.6): all
rlcommands now use a consistent teal/dark theme — structured log output, coloured flags and env vars, dimmed comments, and coloured extra fields inrl run-resurrector. - Resurrector claim grace period (v0.1.6): the "never claimed" false-positive warning on cold worker starts is fixed — the resurrector now waits
resurrection_claim_grace_periodseconds (default 30 s) before declaring a re-queued task unclaimed, eliminating false alerts during slow worker startups. - Bench refresh (v0.1.6): idempotent recovery re-runs 1.0 s after restart, dual-OOM resurrects both tasks in 7.0 s, resurrection under load at p99 1.1 s (5 inflight) and 6.3 s (25 inflight,
--scale).
Full history in the CHANGELOG.
Relier is pre-1.0. The API is stabilising but may change before 1.0. The internals (Redis key layout, Lua scripts, fence-token protocol) are production-grade and have been validated against the bundled chaos suite, including under network partitions and mass worker failure.
If you're considering it for production: read Durability & HA first, then run the chaos suite against a staging cluster that mirrors your prod setup. File issues for anything that surprises you. Those are the inputs that get the project to 1.0.
Issues and pull requests welcome. Particularly valuable:
- Real-world workloads that don't fit the current Patterns Cookbook
- Failure modes the durability matrix doesn't cover
- Documentation gaps you hit while integrating
- Performance numbers from your environment (
make benchoutput plus a one-line spec)
git clone https://github.com/getrelier/relier
cd relier
cp .env.example .env # fill in your Redis URL
make setup # venv + dev deps + pre-commit
make test # unit tests
make test-integration # integration tests against test-container Redis
make bench # synthetic bench smoke (no Docker, ~2 min)
make bench-docker # full bench in Docker with Prometheus + GrafanaOpen a PR against main. Quality gates: make lint check test must pass; make test-integration is recommended if you touched anything in core/ or tasks/.
- Issues: bugs, feature requests, questions via the issue templates above
- Discussions: github.com/getrelier/relier/discussions ideas, integrations, show and tell
- X / Twitter: @relierdev release announcements and short-form updates
- Releases: watch this repo for new releases; the changelog is in each GitHub Release
MIT. See LICENSE.
Built on Celery, Redis, asyncio, and OpenTelemetry. The Phoenix Pattern owes its name to the obvious metaphor; the fence-token approach is borrowed from Martin Kleppmann's writeups on distributed locking. The explicit-checkpoint philosophy is shared with Faust, Temporal (despite their different model), and AWS Step Functions. When production systems converge on a design choice, it's worth noticing.


