Making Architectural Decisions to Optimise LLM Cost

If you have already read about what goes wrong when LLM bills spiral, this article is the counterpart: how to design systems that do not explode in the first place. The patterns below come from production agents that score businesses across multiple dimensions—only one of which needs expensive vision.

The golden rule: push decisions left

Every decision you can make before hitting a paid API saves money. The further left in the pipeline you resolve something, the cheaper it is.

Cost spectrum (cheapest → most expensive)

Rules → Scraping → Free API → Cheap LLM → Expensive LLM

A well-designed agent routes most work through the left side. The expensive model is reserved for tasks that genuinely need it—not as a default.

In a typical multi-scorer setup, three of four dimensions can run on rules, free APIs, or cheap models. The design scorer—vision plus reasoning—is often the only one that breaks the rule. Architecture is about making that exception deliberate, gated, and cached.

Decision 1: when to call the expensive model at all

The biggest architectural win is a pre-qualification gate before any LLM call. Only inputs that pass basic checks deserve the expensive vision score.

Business URL received
        │
        ▼
Is the site reachable? ──No──→ Skip. Log as unreachable.
        │
       Yes
        ▼
Does it have HTTPS? ──No──→ Pitch immediately. No scoring needed.
        │                   ("Your site has a security warning")
       Yes
        ▼
Is it in the cache? ──Yes──→ Return cached score. Zero cost.
        │
       No
        ▼
Run cheap scorers first
(performance, discoverability, conversion — all free)
        │
        ▼
Total across 3 scorers < 12/30? ──Yes──→ Strong enough lead.
        │                                Run design scorer (Opus).
        No
        ▼
Score is low enough without design. Skip vision. Use proxy.

The insight: if a business already scores poorly on three free dimensions, the design score rarely changes the pitch angle—you already know what to say. Only run Opus when design is the deciding dimension.

Decision 2: model tiering

Match model capability to task complexity. Never use a model for a task a rule can do.

Task	Complexity	Right approach	Wrong approach
Design visual scoring	High — vision + reasoning	claude-opus-4 (or equivalent)	Haiku (no vision)
Pitch angle generation	Low — rule lookup	Rule-based logic (free)	Any LLM
Email draft personalisation	Medium — tone + context	claude-haiku-4-5	Opus (overkill)
Classification (sensitive vs not)	Low	Haiku or rules	Opus

Keep pitch-angle logic rule-based. Only escalate to an LLM when the output genuinely requires open-ended language generation.

Decision 3: caching strategy

Not all data ages at the same rate. Cache each data type for the right duration.

Data	Staleness risk	Cache TTL
Design score (vision)	Low — sites rarely redesign	30 days
PageSpeed score	Medium — can change with deploys	7 days
Google Business Profile	Low — ratings shift slowly	7 days
On-page SEO signals	Medium — content changes	7 days
Conversion signals	Medium	7 days
Site reachability	High — can go down anytime	1 day

Two-layer cache architecture

Request hits → Redis (fast, in-memory, TTL per dimension)
                    │ miss
                    ▼
              Disk/DB cache (slower, longer TTL, survives restarts)
                    │ miss
                    ▼
              Live API call → write to both caches

Redis key pattern: score:{domain}:{dimension}:{YYYY-MM-DD} — invalidate one dimension without clearing others.

Decision 4: batching vs real-time

Running one business at a time is the most expensive pattern. Batching changes your cost profile significantly.

• Screenshot batching — accumulate 10 screenshots and send them in one multi-image message. Providers charge per token, not per message, so you amortise system-prompt cost across many businesses.
• Async processing — score businesses in a queue overnight when you have backlog, not real-time. Off-peak processing enables aggressive retry-and-cache logic without user-facing latency pressure.

Decision 5: spend guardrails in code

Cost optimisation without guardrails is just optimism. Build limits into the agent itself.

DAILY_BUDGET_USD = 10.00
COST_PER_VISION_CALL = 0.015   # update when pricing changes
MAX_VISION_CALLS_TODAY = int(DAILY_BUDGET_USD / COST_PER_VISION_CALL)

class BudgetGuard:
    def __init__(self):
        self.vision_calls_today = self._load_from_redis("vision_calls:today")

    def can_run_vision(self) -> bool:
        return self.vision_calls_today < MAX_VISION_CALLS_TODAY

    def record_vision_call(self):
        self.vision_calls_today += 1
        self._persist_to_redis("vision_calls:today", self.vision_calls_today)

When the budget guard fires, fall back to proxy scoring—you still get a result, just a less rich one.

The architecture in one view

Discovery agent output
         │
         ▼
   Reachability check (free)
         │
         ▼
   Cache lookup (free)
         │ miss
         ▼
   Free scorers in parallel
   (performance + discoverability + conversion)
         │
         ▼
   Gate: is vision needed?
   ├── No → proxy design score → merge → pitch
   └── Yes → budget guard → Opus vision → merge → pitch
                                │
                          budget exhausted?
                                └── proxy fallback
         │
         ▼
   Cache result by dimension + TTL
         │
         ▼
   Pitch angle (rules, free)
         │
         ▼
   Email draft (Haiku, only on confirmed leads)

This is cost architecture: not one trick, but a pipeline where expensive steps are rare, gated, cached, batched, and bounded. Prototype code often calls the best model on every path; production code earns that call.