Making Architectural Decisions to Optimise LLM Cost
Baljeet Dogra
Cost control is not a post-launch optimisation—it is an architecture problem. Every decision you push left in the pipeline, before a paid LLM call, compounds into real savings at scale.
If you have already read about what goes wrong when LLM bills spiral, this article is the counterpart: how to design systems that do not explode in the first place. The patterns below come from production agents that score businesses across multiple dimensions—only one of which needs expensive vision.
The golden rule: push decisions left
Every decision you can make before hitting a paid API saves money. The further left in the pipeline you resolve something, the cheaper it is.
Cost spectrum (cheapest → most expensive)
A well-designed agent routes most work through the left side. The expensive model is reserved for tasks that genuinely need it—not as a default.
In a typical multi-scorer setup, three of four dimensions can run on rules, free APIs, or cheap models. The design scorer—vision plus reasoning—is often the only one that breaks the rule. Architecture is about making that exception deliberate, gated, and cached.
Decision 1: when to call the expensive model at all
The biggest architectural win is a pre-qualification gate before any LLM call. Only inputs that pass basic checks deserve the expensive vision score.
Business URL received
│
▼
Is the site reachable? ──No──→ Skip. Log as unreachable.
│
Yes
▼
Does it have HTTPS? ──No──→ Pitch immediately. No scoring needed.
│ ("Your site has a security warning")
Yes
▼
Is it in the cache? ──Yes──→ Return cached score. Zero cost.
│
No
▼
Run cheap scorers first
(performance, discoverability, conversion — all free)
│
▼
Total across 3 scorers < 12/30? ──Yes──→ Strong enough lead.
│ Run design scorer (Opus).
No
▼
Score is low enough without design. Skip vision. Use proxy.
The insight: if a business already scores poorly on three free dimensions, the design score rarely changes the pitch angle—you already know what to say. Only run Opus when design is the deciding dimension.
Decision 2: model tiering
Match model capability to task complexity. Never use a model for a task a rule can do.
| Task | Complexity | Right approach | Wrong approach |
|---|---|---|---|
| Design visual scoring | High — vision + reasoning | claude-opus-4 (or equivalent) | Haiku (no vision) |
| Pitch angle generation | Low — rule lookup | Rule-based logic (free) | Any LLM |
| Email draft personalisation | Medium — tone + context | claude-haiku-4-5 | Opus (overkill) |
| Classification (sensitive vs not) | Low | Haiku or rules | Opus |
Keep pitch-angle logic rule-based. Only escalate to an LLM when the output genuinely requires open-ended language generation.
Decision 3: caching strategy
Not all data ages at the same rate. Cache each data type for the right duration.
| Data | Staleness risk | Cache TTL |
|---|---|---|
| Design score (vision) | Low — sites rarely redesign | 30 days |
| PageSpeed score | Medium — can change with deploys | 7 days |
| Google Business Profile | Low — ratings shift slowly | 7 days |
| On-page SEO signals | Medium — content changes | 7 days |
| Conversion signals | Medium | 7 days |
| Site reachability | High — can go down anytime | 1 day |
Two-layer cache architecture
Request hits → Redis (fast, in-memory, TTL per dimension)
│ miss
▼
Disk/DB cache (slower, longer TTL, survives restarts)
│ miss
▼
Live API call → write to both caches
Redis key pattern: score:{domain}:{dimension}:{YYYY-MM-DD} — invalidate one dimension without clearing others.
Decision 4: batching vs real-time
Running one business at a time is the most expensive pattern. Batching changes your cost profile significantly.
- • Screenshot batching — accumulate 10 screenshots and send them in one multi-image message. Providers charge per token, not per message, so you amortise system-prompt cost across many businesses.
- • Async processing — score businesses in a queue overnight when you have backlog, not real-time. Off-peak processing enables aggressive retry-and-cache logic without user-facing latency pressure.
Decision 5: spend guardrails in code
Cost optimisation without guardrails is just optimism. Build limits into the agent itself.
DAILY_BUDGET_USD = 10.00
COST_PER_VISION_CALL = 0.015 # update when pricing changes
MAX_VISION_CALLS_TODAY = int(DAILY_BUDGET_USD / COST_PER_VISION_CALL)
class BudgetGuard:
def __init__(self):
self.vision_calls_today = self._load_from_redis("vision_calls:today")
def can_run_vision(self) -> bool:
return self.vision_calls_today < MAX_VISION_CALLS_TODAY
def record_vision_call(self):
self.vision_calls_today += 1
self._persist_to_redis("vision_calls:today", self.vision_calls_today)
When the budget guard fires, fall back to proxy scoring—you still get a result, just a less rich one.
The architecture in one view
Discovery agent output
│
▼
Reachability check (free)
│
▼
Cache lookup (free)
│ miss
▼
Free scorers in parallel
(performance + discoverability + conversion)
│
▼
Gate: is vision needed?
├── No → proxy design score → merge → pitch
└── Yes → budget guard → Opus vision → merge → pitch
│
budget exhausted?
└── proxy fallback
│
▼
Cache result by dimension + TTL
│
▼
Pitch angle (rules, free)
│
▼
Email draft (Haiku, only on confirmed leads)
This is cost architecture: not one trick, but a pipeline where expensive steps are rare, gated, cached, batched, and bounded. Prototype code often calls the best model on every path; production code earns that call.
Related reading: 10 LLM Cost Explosion Traps and AI Pricing Tips: How to Control AI Costs Effectively.
Need help designing cost-aware AI?
I help teams architect agents, scoring pipelines, and LLM integrations with predictable spend—gates, tiering, and guardrails built in from day one.
Get in Touch