Background Jobs in FastAPI on EKS: APScheduler vs Celery vs EventBridge — What to Pick and Why

Your FastAPI service is a coffee shop counter. Customers walk in, order a flat white, and expect it in 90 seconds. That’s the deal.

Now imagine the barista also has to roast the beans, descale the machine, and repaint the walls — mid-order. That’s what happens when you stuff long-running work inside an API request. It starts innocently (“just send a welcome email”), and three weeks later your requests are timing out, your pods are sweating, and your users are rage-refreshing like it personally offended them.

The fix is conceptually simple: FastAPI responds fast. Background jobs do the heavy lifting elsewhere. But “elsewhere” means different things depending on what kind of job you’re running — and on EKS, getting this wrong doesn’t just slow you down, it runs the same job five times across five pods and lets you discover it in production.

Let’s fix that.

Two Kinds of Jobs, Very Different Needs

Before you pick a tool, figure out what you’re actually building. Most background jobs fall into one of two buckets, and confusing them is the root cause of a lot of bad architectural decisions.

The first bucket is user-triggered work — the user clicks “Generate Report” or “Export CSV” and you need to do something heavy without making them stare at a spinner for 40 seconds. Your API should return a 202 Accepted immediately and hand the real work off somewhere else.

The second bucket is scheduled work — nightly cleanup, hourly sync, daily digest emails, the stuff that needs to happen on a clock regardless of whether any user is around. Nobody clicked anything. Time did.

This distinction matters because the right tool for bucket one is wrong for bucket two, and vice versa. Keep it in mind as we walk through the options.

Option A: APScheduler — The Alarm Clock Inside Your App

APScheduler is the fastest path from zero to a scheduled task. You import a library, register a function, tell it “every 10 minutes,” and it just… runs. It lives inside your FastAPI app. No new infrastructure. No queue. No separate deployment. This is genuinely appealing.

Here’s the catch, and it’s a good one: you are on EKS, which means you have multiple pods.

Each pod runs its own copy of the app. Each copy runs its own scheduler. If you have five replicas and you schedule a cleanup job for 2AM, congratulations — you’ve scheduled five cleanup jobs for 2AM. Your database will have opinions about this.

(EKS Cluster)
┌─────────────────────┐
│ FastAPI Pod #1 │ runs job at 2AM
├─────────────────────┤
│ FastAPI Pod #2 │ also runs job at 2AM
├─────────────────────┤
│ FastAPI Pod #3 │ also also runs job at 2AM
└─────────────────────┘


DB / External APIs
(very confused)

There are two ways to make APScheduler safe on EKS. The first is to give it its own home: a dedicated scheduler-deployment with replicas: 1 that runs nothing but the scheduler, while your API pods stay clean and stateless. The second is a distributed lock — every pod tries to acquire a Redis or database lock at job time, only one wins, the rest skip gracefully. Both work. The dedicated pod approach is simpler to reason about.

APScheduler is the right call when your scheduled tasks are lightweight, your team wants something running today, or you’re laying groundwork before committing to a heavier system. Just don’t let it share a pod with your API in production without either of those guardrails in place. The convenience is real; the footgun is also real.

Option B: Celery + Redis — The To-Do List with a Team

Celery flips the mental model. Instead of running work inside FastAPI, your API becomes a dispatcher: it drops a task onto a Redis queue and returns immediately. A separate fleet of worker pods picks up tasks and does the actual work. The API never waits. The workers never deal with HTTP traffic. Everyone stays in their lane.

Client → FastAPI Pods → Redis Queue → Worker Pods → DB / Services

This architecture fits Kubernetes beautifully because Kubernetes is very good at running worker pools. Need more throughput? Scale the worker deployment. Hitting memory issues? Give workers a different resource profile than your API pods. Tasks failing intermittently? Celery has native retry support with exponential backoff, so transient failures stop being incidents.

The honest trade-off is complexity. You’re now operating Redis, a worker deployment, and a task queue — all of which need monitoring, alerting, and someone who understands what’s happening when a task silently disappears. Debugging async flows is genuinely harder than debugging request/response, because the failure might happen five minutes after the user clicked the button, in a pod they never interacted with.

There’s also the at least once reality of any queue system. A task can be delivered more than once — due to worker crashes, pod restarts, or network blips — so your task logic needs to be idempotent. Running it twice should produce the same result as running it once. This is good engineering practice anyway, but Celery makes it mandatory.

For anything user-triggered and expensive — PDF generation, report exports, email blasts, file processing, third-party API calls that take seconds — Celery with Redis is the right choice on EKS. It scales cleanly, retries gracefully, and keeps your API feeling fast even when the underlying work is slow.

Option C: EventBridge + SQS + EKS Workers — The Cron That Lives Outside Your Cluster

EventBridge is AWS’s managed scheduler. It doesn’t know or care how many pods you’re running, because it lives entirely outside your cluster. You define a schedule (cron or rate expression), EventBridge fires a message to an SQS queue at the appointed time, and your worker pods consume from that queue and do the work.

┌─────────────────────────┐
│ EventBridge Schedule │ (fires once, on schedule)
└──────────┬──────────────┘


┌───────┐
│ SQS │ (buffers, retries, no duplicates from scheduling)
└───┬───┘


┌─────────────────────┐
│ EKS Worker Pods │
└──────────┬──────────┘


DB / External APIs

The key property here is that EventBridge fires once. The scheduling duplication problem that haunts APScheduler on multi-pod deployments simply doesn’t exist. SQS adds durability and natural backpressure — if your workers are slow, messages queue up instead of hammering your downstream services.

The trade-offs are straightforward. This is AWS-specific infrastructure, so if portability matters, you’re accumulating platform debt. Your worker code still needs to be written and operated. And SQS is also “at-least-once,” so idempotency still applies — EventBridge solves the multi-pod scheduling problem, not the delivery guarantee problem.

For cron-style jobs — nightly data sync, daily cleanup, batch refresh, scheduled pipeline runs — this pattern is hard to beat when you’re already on AWS. The schedules are durable, visible in the AWS console, and completely decoupled from your deployment lifecycle. Rolling out a new version of your API at 2AM doesn’t interfere with your cleanup job, because your cleanup job has no idea your API exists.

So What Should You Actually Use?

The honest answer for most FastAPI services on EKS is a hybrid: Celery + Redis for user-triggered work, EventBridge → SQS → workers for scheduled work. They solve different problems, they don’t overlap, and they each do their job well.

APScheduler earns its place in specific situations — lightweight scheduled tasks, internal tooling, teams who need something running by end of day and will revisit architecture next quarter. Use it with a dedicated scheduler pod, not embedded in your API replicas, and you’ll be fine.

The temptation is to crown one tool King of Everything and make it do all the jobs. Don’t. Yes, Celery has a beat scheduler that can run cron jobs, and yes, it technically works — the same way you can use a chef’s knife to open a wine bottle. Technically a success. Spiritually, a disaster. The moment your Redis cluster hiccups, your scheduled jobs vanish with it, and now you’re debugging two problems at 2 AM instead of one. EventBridge exists precisely so your cron jobs don’t care whether Redis is having a bad Tuesday. Use the tool that was built for the job.

The Four Rules That Apply Regardless

Whatever you choose, these four properties will determine whether you’re paged at 3AM.

Idempotency. Any job that can run twice should produce the same result as a job that ran once. This isn’t optional — it’s the cost of distributed systems.

Retries with backoff. Networks fail. External APIs have bad days. A job that gives up on the first error is a job that creates incidents. Add retries, add exponential backoff, add a dead-letter queue for tasks that still fail after N attempts.

Resource isolation. Workers should not compete with API pods for CPU and memory. Give them separate deployments with separate resource limits. Your API’s latency should not degrade because a background job decided to import a large CSV.

Observability. Logs, metrics, and alerts for job failures and long-running tasks. A job that silently fails is worse than a job that visibly fails, because the silent one might be running your billing calculations.

Got questions or a war story about background jobs in production? Drop it in the comments .


Background Jobs in FastAPI on EKS: APScheduler vs Celery vs EventBridge — What to Pick and Why was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top