Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

Like many here, I kept running into Claude usage limits when building anything non-trivial.

I was working with a job search automation pipeline (based on the Career-Ops project), and the naive flow was burning ~16k tokens per application — completely unsustainable.

So I spent some time reworking it with a focus on token efficiency as a first-class concern, not an afterthought.

🚀 Results

~85% reduction in token usage
~900 tokens per application
Most repeated context calls eliminated
Much more stable under usage limits

⚡ What actually helped (practical takeaways)

1. Prompt caching (biggest win)

Cached system + profile context (cache_control: ephemeral)
Break-even after 2 calls, strong gains after that
~40% reduction on repeated operations

👉 If you're re-sending the same context every time, you're wasting tokens.

2. Model routing instead of defaulting to Sonnet/Opus

Lightweight tasks → Haiku
Medium reasoning → Sonnet
Heavy tasks only → Opus

👉 Most steps don’t need expensive models.

3. Precompute anything reusable

Built an answer bank (25 standard responses) in one call
Reused across applications

👉 Eliminated ~94% of LLM calls during form filling.

4. Avoid duplicate work

TF-IDF semantic dedup (threshold 0.82)
Filters duplicate job listings before evaluation

👉 Prevents burning tokens on the same content repeatedly.

5. Reduce “over-intelligence”

Added a lightweight classifier step before heavy reasoning
Only escalate to deeper models when needed

👉 Not everything needs full LLM reasoning.

🧠 Key insight

Most Claude workflows hit limits not because they’re complex —
but because they recompute everything every time.

🧩 Curious about others’ setups

How are you handling repeated context?
Anyone using caching aggressively in multi-step pipelines?
Any good patterns for balancing Haiku vs Sonnet vs Opus?

https://github.com/maddykws/jubilant-waddle

Inspired by Santiago Fernández’s Career-Ops — this is a fork focused on efficiency + scaling under usage limits.

submitted by /u/distanceidiot
[link] [comments]