Your Claude Code bill went up. Or you’re burning through your Pro usage limits twice as fast as you were a month ago. Same workflows, same habits. You didn’t change anything.
Two real bugs. One architectural blindspot. And a five-minute timer that’s been quietly eating your quota every time you grab coffee.
What Reddit Found
A post titled “PSA: Claude Code has two cache bugs that can silently 10-20x your API costs” hit 914 upvotes last week. The author spent days reverse-engineering Claude Code’s 228MB binary with Ghidra, a MITM proxy, and radare2. They found two independent bugs.
Bug 1: A sentinel replacement in the standalone binary breaks cache when your conversation mentions billing internals. The replacement targets the first occurrence in the JSON body. If your chat history contains the sentinel string, it replaces the wrong one. Cache prefix broken. Full rebuild on every request.
Bug 2: Every --resume causes a full cache miss since v2.1.69. A new deferred_tools_delta attachment gets injected at different positions in fresh vs resumed sessions. The message prefix changes completely. One-time cache rebuild on every resume.
Another post hit 2,600 upvotes: “Thanks to the leaked source code, I patched the root cause of the insane token drain.” People aren’t just complaining. They’re reverse-engineering the binary and shipping their own patches.
And then there’s this post at 384 upvotes: “I got less than an hour in two 5-hour limits debugging a script. My weekly limit went from 48% to 84% in just those two sessions. I’m afraid to ask Claude a question.”
Afraid to ask a question. About a tool you’re paying for. That’s where we are.
The Bigger Problem Nobody’s Fixing
Those bugs are real and Anthropic should patch them. But there’s a cost multiplier that affects everyone, bug or no bug. It’s baked into how prompt caching works.
Anthropic’s prompt cache has a ~5-minute TTL. Every interaction resets the clock. Step away for six minutes and your entire prompt cache expires. Your next message rebuilds the whole thing from scratch.
Five minutes sounds reasonable until you think about what you actually do between messages. Read a long response. Think about it. Type a careful reply — especially if you’re reviewing a plan or architecture decision, which you should be doing carefully. That’s easily 5-10 minutes right there, and you never left your chair. Running two Claude Code agents? Now one is idle while you’re focused on the other. Double the exposure.
To be fair to Anthropic, the TTL exists for a reason. Each user’s 200k token context takes roughly 400MB of GPU memory in the KV cache. Multiply that by a million concurrent users and you’re looking at 373 TB of GPU RAM — nearly 5,000 H100s just to hold everyone’s cache. At current prices, that’s over $140 million in hardware sitting idle while you’re on Slack. The 5-minute TTL isn’t spite. It’s economics.
But understanding why the TTL exists doesn’t change what it costs you. And here’s the part most people miss: a cache miss doesn’t just cost you the regular input rate. It costs more.
From Anthropic’s prompt caching docs: “5-minute cache write tokens are 1.25 times the base input tokens price. Cache reads are 10% of base input token price.”
That’s a 12.5× spread. Cache reads at 10% of base. Cache writes at 125% of base. Every time the TTL expires and you send a new message, you’re not just “reloading” the cache — you’re paying a 25% premium on top of the regular input price to rebuild it.
For API users, that’s dollars. For Pro subscribers, those are the hidden token burns chewing through your 5-hour usage window in 45 minutes. Three coffee breaks a day at 150k tokens? That’s wasted rebuilds — a chunk of your daily quota gone before you’ve written any real code.
| Scenario | Context Size | Cache Read | Cache Write (miss) | Spread |
|---|---|---|---|---|
| Quick chat | 50k tokens | ~$0.015 | ~$0.19 | 12.5× |
| Deep session | 150k tokens | ~$0.045 | ~$0.56 | 12.5× |
| Marathon | 300k tokens | ~$0.09 | ~$1.13 | 12.5× |
The spread is always 12.5×. But 12.5× of $0.09 is very different from 12.5× of $0.015. Before the 1M context update, people rotated sessions more often and kept contexts small. Now conversations run longer, contexts balloon, and every cache miss costs more. That’s why your limits started evaporating after the update even though you didn’t change how you work.
The “Leak” That’s Saving You Money
When Anthropic’s source code surfaced (April Fools or not — still unclear), people dug into the caching logic in claude.ts and realized the cache has a TTL. Step away too long, your entire conversation gets re-cached from scratch on the next message. Some started manually pinging their sessions to keep the cache warm. Others called this a “usage leak” — the system “wasting tokens” on empty messages.
They had it backwards.
A keepalive ping is a tiny message that resets the TTL. Costs fractions of a cent. Without it, your next real message triggers a full cache rebuild at 1.25× base rate. The pings aren’t the problem. The pings are the cheapest insurance you’re not buying.
What the Low-Bill Users Do
Two strategies. That’s it.
- They work in focused bursts and finish before the cache expires.
- They use something that keeps the cache warm while they’re away.
Most people do neither. Work for 20 minutes, switch to Slack for 10, come back, pay full price to rebuild 200k+ tokens of context. Every single time.
How Soma Handles This
Soma ships a keepalive system that solves this without running up an infinite tab.
Automatic pings. Soma watches the cache countdown. At ~45 seconds before expiry, it sends a tiny ping that resets the TTL.
5 lives. You get 5 pings per idle period. Each ping fires at ~45 seconds before the TTL expires, so each life covers about 4 minutes and 15 seconds of idle time. Five lives: roughly 24 minutes of protection. Enough to cover a coffee break, a standup, or a code review in another repo.
We capped it at 5 on purpose. Remember the GPU math above — 373 TB to hold everyone’s cache. If every agent ran unlimited keepalives, nobody’s cache would ever evict. Anthropic’s infrastructure costs would spike, and the rational response would be to patch out the caching mechanism entirely or shorten the TTL. Five pings is a fair-use line: enough to protect you, not enough to break the system that makes caching possible in the first place.
Smart reset. Send a real message, lives reset to 5. You only spend them when you’re idle.
Auto-exhale. Lives run out and you’ve burned more than 75k tokens? Soma saves your session state automatically. From there you can keep going, or run soma inhale — fresh session, ~5k tokens, and an agent that’s more focused on what you were actually working on than the bloated session you just left. The preload carries the goal, not the entire conversation.
The notification:
♥ Keepalive 3/5 (2 left, 45s was remaining)
Configuration:
{
"keepalive": {
"maxPings": 5,
"autoExhale": true,
"autoExhaleMinTokens": 75000
}
}
Your System Prompt Is the Other Problem
Even if Anthropic fixes both bugs tomorrow, your system prompt size still determines how much a cache miss costs. Every token in that prompt gets re-cached at 1.25× when the TTL expires.
When we were rethinking how system prompts should actually work, we read Claude’s entire system prompt to see if we were missing anything. All 1,191 lines. What we found instead was roughly 25,000 tokens of instructions sent on every single conversation. 72% of it is tool documentation and legal compliance. The copyright section repeats the same rule six times. The entire behavioural block appears twice — word for word. A user debugging a Python script gets the full artifact storage API spec, the crisis mental health protocol, and 5,000 tokens of copyright law.
That’s the static prompt tax. And every time your cache expires, you’re rebuilding all 25,000 tokens of it at 1.25× base rate. Including the tools you’ll never touch in that session.
Soma’s prompt is dynamic. Every protocol, muscle, and skill has a heat score based on how often you actually use it:
- 🔥 Hot — full protocol body + TL;DR summaries for muscles (~800 tokens each)
- 🌡 Warm — one-line breadcrumb per protocol, short TL;DR per muscle (~50-100 tokens)
- ❄️ Cold — just the name (~5 tokens)
- 💤 Inactive — not loaded, zero tokens
Use something often, it stays hot. Ignore it, it cools and drops out of the prompt. Nothing loads unless it earns its place.
Typical result: 5-8k tokens instead of 25k. When a cache miss does happen, rebuilding 6k at 1.25× base rate costs 76% less than rebuilding 25k. And every token that is loaded is relevant to what you’re actually doing — not a copy-pasted legal section from a team that never talked to the team that wrote the behavioural rules.
Five Things You Can Do Right Now
You don’t need Soma for these. They’ll help regardless.
1. Don’t walk away mid-session. If your break is longer than 5 minutes, save your state first. Your next message will cost 12.5× more than it needs to. And don’t multi-task across agents — while you’re focused on one, the other’s cache is dying. Pick one, finish the task, move on.
With Soma, you’ve got ~25 minutes of keepalive per agent, and it doesn’t constantly ding you for permissions (no matter how many times you tell Claude Code “don’t ask again”). It just does things. We might spend two hours planning a feature — reading MAPs, reviewing architecture, writing the approach — while the cache sits warm. When we shipped soma doctor, it needed to update custom agents across versions without breaking anything people had changed. That’s a hard problem — 10 migration phases, version chain-walking, template comparison, auto-fix tiers. We solved it with a single MAP that knows the full sequence: detect the version gap, chain the phase files, verify each jump, update settings, body templates, protocols — all without the agent asking “what should I do next?” The MAP already knows.
And because Soma loves scripts, most of it is automated before the agent even wakes up. On boot, the doctor script checks your project version against the agent version, walks the migration chain, adds missing settings keys with safe defaults, scaffolds any missing body templates, installs new bundled protocols, converts legacy formats — all silently, all idempotent. By the time Soma starts thinking, the grunt work is already done. The agent picks up where the script left off and handles the parts that need judgment. Scripts for the mechanical. Agent for the contextual. That’s the split.
2. Rotate at ~50% context. Don’t let it grow to 300k tokens. Start fresh with a summary. Smaller context, cheaper rebuilds.
3. Skip --resume. On Claude Code v2.1.69+, every resume triggers a full cache miss. Start a fresh session with a summary of where you left off instead. Claude Code offers compaction for long contexts, but that’s lossy — it summarizes your conversation and throws away detail to free space. Soma takes a different approach: exhale writes a focused briefing of where you are and what’s next, then soma inhale starts fresh at ~5k tokens with an agent that knows the goal, not one reconstructing it from a compressed summary.
4. Trim your system prompt. Your CLAUDE.md, AGENTS.md, custom instructions — all cached. All rebuilt from scratch on every miss. Smaller is cheaper.
5. Use npx instead of the standalone binary. The sentinel bug only exists in the custom Bun fork. The npm package running on standard Node doesn’t have it.
Soma has 22 lines of learned behavior for every line of compiled code. Over 35,000 lines of protocols, muscles, and workflows — but only the ones relevant to your current session actually load. The rest stay cold, costing you nothing.
The cache bugs will get fixed. The 5-minute TTL probably won’t. The question is whether your tools are built for that reality or pretending it doesn’t exist.