
How to Save Time and Money on Repeated LLM Calls with Ephemeral Caching
The Problem
A large prompt can rapidly incur costs due to the model charging per output and input tokens. During Prompt development, or prompt engineering, an iterative process takes place of designing, refining, and optimizing the prompt to guide a Large Language Model to produce desirable outputs. It’s during this time especially when high costs can creep up on you, especially if you also have a large context or knowledge base.
In addition to high costs, large prompts can also result in increase latency. Large prompts cause high latency primarily because they increase the Time to First Token (TTFT), as the model must process the entire input sequence through every layer of its neural network before generating a response.
Large prompts can be found in RAG systems, Document Q/As or Coding Assistants. In addition, if you have a multi-turn conversation then that means the prompt also has to be processed along with the ever growing conversation history.
The Solution
Enter Prompt Caching! It’s an amazing technique to have in your toolbox. It stores and reuses the initial, static parts of LLM prompts (the prefix), significantly reducing latency and lowering costs.
By avoiding re-processing consistent instructions, long context, or examples, it boosts efficiency, especially in chatbots and RAG systems.
Prompt caching also reduces Time to First Token by storing and reusing the computational state of frequently used prompts (like system instructions or long context). By bypassing the prefill phase for cached segments, you can expect up to 85% lower latency and 90% lower costs for cached inputs!
👉So, let’s do a quick review of the benefits!
- Cost savings — Cached tokens are cheaper to process than fresh ones. Cache reads can be billed at a fraction of the normal input token cost.
- Faster responses — When a large prompt (like a long system prompt or document) is cached, the model doesn’t need to reprocess it on every request, reducing latency.
- Efficiency for repeated context — If you’re sending the same large block of text repeatedly (e.g., a long document, codebase, or system prompt), caching avoids redundant processing on every call.
🚧The tradeoff is that cached content must meet a minimum token threshold to be eligible, and there’s a small cache write cost on the first call. But for anything involving large, repeated context, the savings add up quickly!
Full Code can be found on my Github!
Cache Experiment
I will run an experiment first without caching and then with caching and observe the input tokens and time it takes to run.
👉Anthropic requires a minimum of 1024 tokens to be eligible for caching
👉If more than 5 minutes passed between requests, the cache expired
👉First request always writes, never reads
🚨If you would like to recreate this experiment for yourself, make sure that the prompt is longer than the documents.
Here are the specs for our experiment:
- Model name and version: claude-sonnet-4–6
- Data Set: Ted Talk — transcripts of different Ted Talk’s subset to first 10 rows
- Prompt: ~3197 tokens
👉Our LLM will take in the the Ted Talk transcript and the prompt to assess extract the topics.
NOTE: For the purposes of this, we don’t care about the output, only if the input tokens are being cached!
👉We will use ephemeral caching
The project structure looks like this:
topic_modeling/
├── main.py ← orchestration only (API client + pipeline loop)
├── config.py ← load_config() + PRICE constants
├── config.yaml ← runtime settings
├── logger.py ← setup_logging()
├── models.py ← UsageStats, SessionStats, DocResult
├── printer.py ← print_result() terminal formatter
├── prompt.py ← get_prompt() (your system prompt)
├── topics.py ← get_topics() to import list of topics
├── documents.py ← load_documents() CSV loader
├── .env
Ephemeral Caching
Ephemeral means you’re marking part of your prompt to be cached temporarily by the model. Here’s the full breakdown. As of writing this, ephemeral is the only supported cache type. This tells the API to store the prompt prefix in a cache so the next request share the same prefix and don’t have to process it each time. The cache as a 5 minute Time To Live by default, this means that it will expire and must be recomputed.
Creating a cached prompt is as easy as passing in a cache_control dictionary:
"cache_control": {
"type": "ephemeral" # ✅ caching the prompt
}And setting the default headers:
self.client = anthropic.Anthropic(api_key=api_key,
default_headers={"anthropic-beta": "prompt-caching-2024-07-31"})
👉When you make an API request, headers are metadata sent alongside the request that tell the server how to handle it. The anthropic-beta header specifically is how Anthropic lets you opt into features that are still in beta — not yet on by default for everyone.
👉Without it, the API ignores your cache_control blocks entirely — it processes every request fresh, as if caching doesn't exist. It’s worth noting that on newer Claude models like claude-sonnet-4-6,
👉Anthropic has been rolling out automatic caching that doesn’t require this header — but explicit caching with cache_control still needs it to be reliable. Keeping the header in is the safest approach regardless of model version.
About the dataset
Each row corresponds to a Talk on TED.com and each column details Metadata (generic/speaker/talk related information) plus Transcript.
The Prompt
Extra long for maximum impact, I’m only including a list of 1000 topics.
f"""You are an expert topic detection and text analysis system with deep knowledge across
multiple domains including science, technology, politics, economics, culture, sports,
entertainment, health, environment, law, education, philosophy, religion, and more.
For each transcript number your primary responsibility is to analyze text and identify all major and minor topics,
themes, entities, and concepts present within it with high precision and recall.
CONFIDENCE AND UNCERTAINTY: When the text contains hedged language (e.g., "may",
"could", "allegedly", "reportedly"), reflect this uncertainty in your relevance
scores by skewing toward the lower end of each band. Overconfident scoring on
ambiguous content reduces the utility of the analysis.
AUDIENCE AND REGISTER: Consider who the intended audience appears to be — general
public, domain experts, policymakers, investors — and how that shapes the framing
of topics. Technical jargon density, assumed background knowledge, and rhetorical
style are all signals.
ENTITY DISAMBIGUATION: When the same name could refer to multiple entities (e.g.,
"Apple" as a company vs. a fruit, "Jordan" as a person vs. a country), use
surrounding context to disambiguate and assign the correct entity type. Always
prefer the interpretation most supported by the full text.
## Sentiment and Tone Guidelines
For each major topic, assess the sentiment expressed:
POSITIVE: The text expresses favorable, optimistic, or supportive views
NEGATIVE: The text expresses unfavorable, critical, or pessimistic views
NEUTRAL: The text presents information without strong sentiment
MIXED: The text presents both positive and negative views on the topic
## Relevance Scoring Guidelines
Score each topic by its relevance to the overall text:
0.9 - 1.0: Central topic, text is primarily about this
0.7 - 0.8: Major topic, discussed extensively
0.5 - 0.6: Moderate topic, discussed in some detail
0.3 - 0.4: Minor topic, briefly mentioned
0.1 - 0.2: Peripheral topic, only tangentially referenced
## Entity Type Guidelines
Classify named entities using the following topics:
{topics_str}
## Instructions
You will be given a TEXT to analyze. Carefully read the entire text before making
any assessments. Identify all topics from the broadest themes down to the most
specific details. Pay special attention to:
1. The primary subject matter of the text
2. All named entities mentioned
3. Any domain-specific terminology or jargon
4. Implicit topics that are not explicitly stated but clearly implied
5. The overall sentiment and tone of the text
6. Relationships between different topics
7. Any controversial or sensitive subjects
Respond ONLY with a JSON object in this exact format, nothing else,
no markdown, no preamble, no explanation outside the JSON:
{{
"primary_topic": {{
"name": "the main topic of the text",
"category": "category from the list above",
"relevance_score": 0.0 to 1.0,
"sentiment": "positive" or "negative" or "neutral" or "mixed",
"description": "two to three sentence description of how this topic appears in the text"
}},
"major_topics": [
{{
"name": "topic name",
"category": "category from the list above",
"relevance_score": 0.7 to 0.8,
"sentiment": "positive" or "negative" or "neutral" or "mixed",
"description": "one to two sentence description"
}}
],
"minor_topics": [
{{
"name": "topic name",
"category": "category from the list above",
"relevance_score": 0.1 to 0.6,
"sentiment": "positive" or "negative" or "neutral" or "mixed"
}}
],
"named_entities": [
{{
"name": "entity name as it appears in text",
"type": "PERSON" or "ORGANIZATION" or "LOCATION" or "PRODUCT" or "EVENT" or "CONCEPT" or "DATE" or "QUANTITY",
"relevance_score": 0.0 to 1.0,
"description": "brief description of this entity's role in the text"
}}
],
"keywords": ["list", "of", "important", "keywords", "from", "the", "text"],
"overall_sentiment": "positive" or "negative" or "neutral" or "mixed",
"domain": "primary domain or field of the text",
"summary": "two to three sentence summary of the text and its main topics"
}}
"""
Load the Dataset
df_ted = pd.read_csv('TED_Talk.csv').head(10)[['talk__id','transcript']]
talk__id transcript
0 66 Good morning. How are you?(Audience) Good.It's...
1 2405 A few years ago, I got one of those spam email...
2 1569 So I want to start by offering you a free no-t...
3 848 How do you explain when things don't go as we ...
4 1042 So, I'll start with this: a couple years ago, ...
5 2034 The human voice: It's the instrument we all pl...
6 2458 So in college, I was a government major, which...
7 2225 When I was a kid, the disaster we worried abou...
8 13587 Hello everyone. I'm Sam, and I just turned 17....
9 1647 Hi. My name is Cameron Russell, and for the la...Set Cache Control in invoke_model
.....
# Static system prompt — marked for caching once, reused every call
self._system = [
{
"type": "text",
"text": get_prompt(),
"cache_control": {"type": "ephemeral"}, # ← cache breakpoint
}
]
.....
Verdict
With Caching
══════════════════════════════════════════════════════════════
SESSION SUMMARY
══════════════════════════════════════════════════════════════
Documents processed : 3
Skipped (this run) : 0
Failed : 0
──────────────────────────────────────────────────────────────
Cache write tokens : 3,190
Cache read tokens : 6,380 ← 90% off
Regular input tokens : 94
Output tokens : 1,141
──────────────────────────────────────────────────────────────
Total time : 19.74s
Actual cost : $ 0.03127
Without caching : $ 0.04611
Saved by caching : $ 0.01483 (32.2%)
══════════════════════════════════════════════════════════════
Without Caching
══════════════════════════════════════════════════════════════
SESSION SUMMARY
══════════════════════════════════════════════════════════════
Documents processed : 3
Skipped (this run) : 0
Failed : 0
──────────────────────────────────────────────────────────────
Cache write tokens : 0
Cache read tokens : 0
Regular input tokens : 9,664
Output tokens : 1,277
──────────────────────────────────────────────────────────────
Total time : 25.39s
Actual cost : $ 0.04815
Without caching : $ 0.04815
Saved by caching : $ 0.00000 (0.0%)
══════════════════════════════════════════════════════════════
Wow!! We have clear evidence that our prompt is being cached and when using cache, the overall response is much faster!
Now, this is a very straightforward and simple example, however I believe it is effective in exemplifying the usefulness of caching!
Let me know what you think and if you have any questions!
Use Prompt Caching to Reduce Input Tokens with Claude was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.