A developer reaches out to us after getting his first month's AI API bill: $1,200. He'd estimated $300.
His agent was doing research tasks: searching the web, reading documents, extracting structured data, writing summaries. It ran about 10,000 tasks that month.
When we looked at his usage breakdown, we found three problems:
- Every task was hitting Claude Sonnet regardless of complexity
- No prompt caching — identical system prompts were being charged full price on every request
- Token counts were 3x higher than necessary because he was including full document text when he only needed extracted fields
After optimization: $430/month. Same agent, same tasks, same quality. 64% cost reduction.
This isn't unusual. Most teams overpay for AI APIs by 30-70% because they haven't measured what they're actually paying per task — and which parts of the task are expensive vs cheap.
Here's the playbook.
Step 1: Measure Cost Per Task, Not Cost Per Token
Token prices are how providers charge you. But the unit that matters for your product is cost per completed task.
Different tasks have wildly different token costs even if the task itself seems similar:
| Task | Avg tokens | At Claude Sonnet prices | At Gemini Flash prices | |------|-----------|------------------------|----------------------| | Classify sentiment (1 sentence) | 80 | $0.0004 | $0.00002 | | Summarize 10-page document | 8,000 | $0.035 | $0.002 | | Extract structured data (5 fields) | 500 | $0.003 | $0.0002 | | Write code (medium function) | 3,500 | $0.017 | $0.001 | | Research + write 500-word report | 25,000 | $0.12 | $0.007 |
The "research and write" task costs 300x more than sentiment classification at the same model. Before you optimize, you need to know which tasks in your agent account for most of your spend.
Log cost per task:
from openai import OpenAI
import time
client = OpenAI(
api_key=os.getenv("HEXACLAW_API_KEY"),
base_url="https://api.hexaclaw.com/v1"
)
def run_task(task_type: str, prompt: str, model: str = "claude-sonnet-4-6") -> dict:
start = time.time()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
usage = response.usage
duration = time.time() - start
# Log for analysis
cost_estimate = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)
log_task({
"task_type": task_type,
"model": model,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"duration_ms": int(duration * 1000),
"cost_usd": cost_estimate
})
return {"content": response.choices[0].message.content, "cost": cost_estimate}
After a week of logging, sort by cost_usd * count to find your most expensive task types. Those are where optimization has the highest leverage.
Step 2: Match Model to Task Complexity
The most common and highest-impact optimization. Using Claude Sonnet ($3.90/$19.50 per 1M tokens) for a classification task that Gemini Flash ($0.15/$0.60 per 1M tokens) handles equally well is a 30-65x cost difference on that task.
Task complexity tiers:
Tier 1 — Cheap models work fine:
- Single-label classification (sentiment, category, intent)
- Simple extraction (pull specific fields from text)
- Short Q&A with known facts
- Translation
- Basic summarization
Best models: gemini-2.5-flash, claude-haiku-4-5, gpt-4.1-mini
Tier 2 — Mid-tier models are the sweet spot:
- Multi-step reasoning
- Code generation (simple to medium)
- Document analysis and synthesis
- Nuanced writing tasks
Best models: claude-sonnet-4-6, gpt-4.1, gemini-2.5-pro
Tier 3 — Only for the hardest tasks:
- Complex reasoning with ambiguity
- Hard coding tasks (architecture, debugging complex systems)
- Advanced math and logic
- Tasks where errors are very costly
Best models: claude-opus-4-6, o3, o4-mini
Implementing routing:
def select_model(task_complexity: str, task_type: str) -> str:
"""Select the cheapest model that handles this task well."""
# Force expensive models only for known hard tasks
if task_type in ["complex_code", "architecture", "advanced_math"]:
return "claude-opus-4-6"
# Mid-tier for reasoning and quality writing
if task_complexity == "medium" or task_type in ["code", "analysis", "report"]:
return "claude-sonnet-4-6"
# Budget models for everything else
return "gemini-2.5-flash"
# Or let HexaClaw route automatically
response = client.chat.completions.create(
model="auto:cheap", # HexaClaw picks cheapest capable model
messages=[...]
)
One beta user's agent was running 50,000 classification tasks per month at $0.003/each on Claude Sonnet = $150/month. After routing classification to Gemini Flash at $0.00002/each = $1/month. Same accuracy. $149/month saved on one task type.
Step 3: Use Prompt Caching
Most agents have a system prompt that doesn't change between requests. If your system prompt is 2,000 tokens long and you run 10,000 requests per day, you're paying for 20M input tokens per day that are identical.
Claude and GPT-4 both support prompt caching — subsequent requests with the same prefix get cached at roughly 10% of the base input token cost.
Without caching:
- System prompt: 2,000 tokens × 10,000 requests × $3.90/1M = $78/day
With caching:
- First request: $0.0078 (full price)
- Subsequent requests: $0.00078 (90% discount)
- Approximate daily cost: ~$8/day
$70/day savings from one change.
To enable caching on HexaClaw, set the cache_control parameter:
response = client.chat.completions.create(
model="claude-sonnet-4-6",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} # Cache this prefix
}
]
},
{"role": "user", "content": user_message}
]
)
The cache stays warm for up to 5 minutes. For agents that process requests continuously, this means nearly every request hits the cache.
Step 4: Reduce Token Waste in Prompts
Token waste is when you're sending more tokens than the task requires.
Common sources of token waste:
Sending full documents when you need summaries:
# Expensive: 10,000 token document in context
prompt = f"Extract the deadline from this contract:\n\n{full_contract_text}"
# Cheaper: pre-extract relevant section
relevant_section = extract_section(full_contract_text, "deadline") # 200 tokens
prompt = f"Extract the deadline from this section:\n\n{relevant_section}"
Verbose system prompts:
# Before: 800 tokens
You are an expert customer support agent with deep knowledge of our product.
Your job is to help customers resolve their issues. You should be polite,
professional, and helpful at all times. You should not discuss competitors.
You should not make promises about refunds unless the issue is clearly our fault.
[... continues for 30 more lines ...]
# After: 150 tokens
Customer support agent. Scope: product issues only. Tone: professional.
Refunds: only if fault is ours. Do not discuss competitors.
Sending conversation history naively: Every message you include in context costs tokens. Long conversations can have 20K+ tokens of history for a 200-token question. Options:
- Summarize old conversation history (keep last N turns, summarize earlier)
- Use vector retrieval to fetch only relevant history
- Cap conversation length and start fresh
Measuring output token waste:
If your agent regularly outputs more tokens than needed, constrain it:
response = client.chat.completions.create(
model="claude-sonnet-4-6",
max_tokens=500, # Set appropriate limits per task
messages=[
{"role": "system", "content": "Be concise. Answer in 1-3 sentences unless asked for detail."},
{"role": "user", "content": question}
]
)
Step 5: Batch Requests Where Possible
Real-time interactive agents need immediate responses. Background agents don't. If you're processing documents, running analyses, or generating reports asynchronously, batch requests can reduce costs by 50% on some providers.
# Instead of 100 individual requests:
for doc in documents:
result = run_agent(doc) # 100 separate API calls
# Batch into one request where possible:
batch_prompt = "\n---\n".join([
f"Document {i+1}:\n{doc}" for i, doc in enumerate(documents[:10])
])
results = run_agent(f"Classify each of the following documents:\n\n{batch_prompt}")
Batching works well for classification, extraction, and summarization where documents are independent. It doesn't work for tasks that require sequential reasoning.
Step 6: Monitor Continuously
Cost optimization is not a one-time project. Model prices change. Your usage patterns change. New capabilities enable cheaper approaches.
Set up monitoring for:
- Daily cost per task type — catch when a task starts consuming more tokens than expected
- Model distribution — verify routing is working as intended
- Cache hit rate — if caching is enabled but hit rate is low, something's wrong
- Token anomalies — individual requests with 10x normal token counts are almost always bugs
HexaClaw's usage dashboard shows cost breakdown by service and model, with daily and monthly views. Available at hexaclaw.com/dashboard/credits/history.
The Quick Win Checklist
If you've never optimized your AI API costs, start here:
- [ ] Identify your top 3 task types by total spend (log for 1 week)
- [ ] Check if any classification or extraction tasks are running on premium models
- [ ] Enable prompt caching if you have a static system prompt
- [ ] Set
max_tokenslimits on all requests based on expected output length - [ ] Reduce system prompt length by 30-50% (almost always possible)
- [ ] Route high-volume simple tasks to Gemini Flash or Claude Haiku
These six changes typically reduce AI API costs by 30-50% with no degradation in output quality.
HexaClaw's credit dashboard shows exactly where your credits are going — by model, by service, and by day. Sign up at hexaclaw.com/signup and the built-in usage analytics make it easy to find the optimization opportunities in your specific workload.