The Running Executive Summary: How to Stop Paying for the Same Conversation Twice
Baljeet Dogra
Series: Part 1 — Thread state and profile state · Part 2 — Email headers for thread scoping · Part 3 (this article)
Once a thread has a reliable boundary, the next question has the biggest cost and quality impact: what do you actually send the model on each new message? Full history replay grows cost quadratically. Sliding windows drop constraints silently. The running executive summary—updated after every turn, passed forward as summary plus latest message—keeps per-turn cost flat and context accurate.
Once a thread has a reliable boundary and you know what belongs inside it, the next question is the one with the biggest cost and quality impact: what do you actually send the model on each new message? There are three common answers, and only one of them holds up as a thread gets long.
Option 1: replay full history
The simplest approach—send every message in the thread, every time—fails on cost in a way that's easy to underestimate. Cost per turn doesn't stay flat; it grows with every reply, because each new call re-sends everything before it.
Take a thread averaging 150 tokens per message. By turn 20, a full-history call is re-sending roughly 19 × 150 ≈ 2,850 tokens of prior conversation just to process one new 150-token message. Summed across all 20 turns, you're paying for something close to n² token growth rather than linear—the conversation gets more expensive to continue the longer it runs, even though each individual message is the same size.
It also has a quality cost, not just a financial one: as irrelevant early turns pile up, the model's attention has to work harder to find what's actually relevant to the current message—a phenomenon usually called context drift, where instructions or facts from early in a long context get effectively buried under volume and stop reliably influencing the model's output.
Option 2: fixed sliding window
Cap it—only send the last N messages. This solves the cost-growth problem but trades it for a correctness problem: anything outside the window is gone. If a user states a hard constraint on turn 2 ("I want a refund, not store credit") and the window is six messages wide, that constraint silently disappears by turn 8, and the agent can contradict an explicit instruction without anyone noticing why.
Option 3: the running executive summary
The pattern that avoids both failure modes: maintain one summary per thread, update it after every turn, and pass forward only [current summary] + [latest message]—never the raw history.
Mechanically, this adds one extra step after each agent reply: a call (ideally to a fast, cheap model—the task doesn't need your best model) that takes the existing summary plus the new exchange and produces an updated summary, which overwrites the stored one.
SUMMARIZE_PROMPT = """
Update the conversation summary below with the new exchange.
Preserve all open commitments, stated constraints, and decisions —
do not drop them even if the conversation has moved on.
Current summary:
{current_summary}
New exchange:
User: {new_user_message}
Agent: {new_agent_reply}
Updated summary (structured fields, not prose):
- open_items:
- constraints_stated:
- decisions_made:
- next_expected_action:
"""
Structuring the summary into explicit fields rather than letting the model write free prose matters more than it looks like it should. A prose summary can quietly omit a dollar figure or a stated constraint because it didn't seem central to the narrative; a constraints_stated field forces the model to either carry it forward or visibly drop it, which makes the failure mode detectable instead of silent.
What this does to token cost
Compare the two approaches across a 20-turn thread, each message averaging 150 tokens, with the summary held to a steady ~200 tokens regardless of thread length:
| Approach | Tokens sent on turn 20 | Total tokens across all 20 turns |
|---|---|---|
| Full history | ~2,850 | ~28,500 |
| Running summary | ~350 (summary + new message) | ~7,000 |
The summary approach doesn't just save tokens on later turns—it keeps the per-turn cost roughly flat as the thread grows, instead of growing with it. That flatness is also the quality win: the model is never asked to find the signal inside a haystack of old turns, because the haystack was already compressed into the signal before the call was made.
Where this breaks, and how to guard against it
Lossy compression can silently drop something critical
A free-text summarisation step has no guarantee it preserves the one number or commitment that actually matters. The mitigation is the structured-fields approach above: treat certain categories—amounts, deadlines, explicit commitments—as things that must appear verbatim in a dedicated field, not paraphrased into general prose where they're easy to lose.
Summary-of-a-summary drift over very long threads
Folding turn 20 into a summary that's already been folded nineteen times compounds small losses. For threads that run unusually long, a periodic full re-summarisation directly from the raw message log—say, every 10–15 turns—resets that compounding rather than letting it accumulate indefinitely. This costs more than an incremental update, but it's a rare event compared to the per-turn updates, so it doesn't undo the overall savings.
You won't notice degradation until it's already caused a bad response
The fix here isn't vigilance, it's a test. Build a small set of synthetic threads with known "must retain" facts planted early on, run them through your summarisation pipeline, and check on every change to the prompt or model whether those facts survive to the final summary. This turns an invisible failure mode into a regression test that fails loudly in CI instead of quietly in production.
Putting the series together
Across these three posts:
- Part 1 defines why memory should be split into ephemeral thread state and durable profile state.
- Part 2 defines where the boundary sits for thread state using the threading headers email already provides.
- Part 3 (this article) defines what actually lives inside that thread state—a structured, continuously-updated summary rather than a growing transcript.
Together they form a reference architecture for memory that stays cheap and stays accurate as conversations get longer, rather than degrading on both fronts at once.
Building production AI agents?
I help teams implement the full memory stack—thread/profile split, email header scoping, running summaries, and promotion policies that don't quietly fill up with noise.
Get in Touch