Skip to main content

Session compaction

Long conversations grow without bound. Replaying every turn into the prompt eventually overflows the model's context window, drives prefill cost up, and dilutes recent intent with stale tool output. Vystak compacts sessions in three layers, all running on the agent server next to the LangGraph checkpoint.

When to enable it

  • Agents that hold long-running conversations — support assistants, multi-turn debugging copilots, anything with sessions.engine = postgres or sqlite.
  • Agents that emit large tool outputs (file reads, search dumps, exec stdout) where most of each prefill is dead weight.

If your agent does one-shot turns or always starts a fresh thread, compaction has nothing to do — leave it off.

The three layers

Layer 1 — tool-output prune

Always-on, no LLM call, pure function. Before every turn, scans ToolMessage entries older than the last few user→assistant turns. If their content is bigger than prune_tool_output_bytes (default 4 KB), they're rewritten as head + "...truncated N bytes..." + tail. The last N turns are preserved byte-for-byte.

This is cheap defense — it catches the common case of an agent quoting a 50 KB file read into the prompt forever.

Layer 3 — threshold pre-call summarize

Runs in the prompt callable, before the LLM sees the messages. Token estimate (sync tokenizer when the model exposes one, otherwise a calibrated chars/3.5 fallback) decides whether prefill is at trigger_pct × context_window. If so:

  1. Split into older (everything except the recent zone) + recent.
  2. Summarize older with the configured summarizer model.
  3. Replace older messages with a single SystemMessage(summary).

Subsequent turns read latest_compaction(thread_id) and assemble [summary] + messages_after_up_to_id so the summary persists across turns. The original LangGraph checkpoint is never rewritten — every generation lives in the vystak_compactions table.

A 60-second / 70%-coverage idempotency guard suppresses Layer 3 when a recent compaction already covers most of the message list, so the threshold never re-summarizes the same span.

Manual /compact

POST /v1/sessions/{thread_id}/compact with optional {"instructions": "..."} body. Forces a compaction regardless of trigger state. Useful as a slash command (/compact in vystak-chat) or as a checkpoint at task boundaries.

What's not Layer 2

The original design called for an autonomous-tool middleware (the model decides when to summarize at clean task boundaries). LangChain renamed that API in 1.1 (SummarizationMiddleware) and removed the autonomous-tool variant. The remaining threshold class is incompatible with vystak's prompt= callable architecture, so vystak doesn't wire it. Layer 3 in the prompt callable provides the same threshold guarantee. (See vystak.schema.compaction for the full rationale.)

Schema

agents:
- name: chatty
model: agent_model
sessions:
type: postgres
provider: {name: docker, type: docker}
compaction:
mode: aggressive # off | conservative | aggressive
trigger_pct: 0.3 # optional override (0 < x < 1)
keep_recent_pct: 0.2 # optional override
prune_tool_output_bytes: 4096
target_tokens: 50000 # post-compaction target
context_window: 5000 # override the model's nominal window
summarizer: # optional — falls back to agent.model
name: summarizer
provider: {name: anthropic, type: anthropic}
model_name: claude-haiku-4-5-20251001
api_keys: {name: ANTHROPIC_API_KEY}

Mode presets

Fieldconservative (default)aggressive
trigger_pct0.750.60
keep_recent_pct0.100.20
prune_tool_output_bytes40961024
target_tokenshalf of context_windowquarter

mode: off short-circuits the runtime — app_factory.build_agent_app skips wiring the compactor + pruner, and no compaction-specific calls fire during a turn.

context_window

Defaults to a built-in table (200K for current Claude/Sonnet/Haiku, 128K for gpt-4o, 1M for gpt-4.1, 200K for unknown models). Override to:

  • Test compaction quickly — set 5000 and compaction fires within a handful of turns instead of needing a real long conversation.
  • Match an unfamiliar model — when you point ANTHROPIC_API_URL at a non-Anthropic endpoint or run a model not in the built-in table.

What gets stored

Every compaction (Layer 3 and manual) appends a row to vystak_compactions:

columndescription
thread_id, generationcomposite PK, generation increments per thread
summary_textthe summary text
up_to_message_idstable vystak_msg_id of the last message replaced
triggerthreshold or manual
summarizer_modelactual model name used for the summary
input_tokens, output_tokensusage from the summarizer call
created_attimestamp

The LangGraph checkpoint is never rewritten — older generations stay queryable for audit and debugging.

Inspection endpoints

Generated on every compaction-enabled agent:

  • POST /v1/sessions/{thread_id}/compact — force a compaction
  • GET /v1/sessions/{thread_id}/compactions — list all generations
  • GET /v1/sessions/{thread_id}/compactions/{generation} — full row

The chat-channel proxy (vystak-channel-chat) forwards all three so they're reachable through the OpenAI-compatible front door.

vystak-chat slash commands /compact [instructions] and /compactions resolve thread_id from the most recent previous_response_id and call the inspection endpoints.

Failure handling

LayerOn summarizer errorUser-visible
Prunen/a — pure function
Thresholdhard-truncate to target_tokens, set _vystak_compaction_fallback on call configobservable via x_vystak SSE chunk; logged WARNING
ManualHTTP 502 with code: "compaction_failed"direct error to caller

The threshold layer fails open by design — a missed compaction is preferable to a dropped turn. Manual is interactive, so it fails loudly.

Token estimation strategy

Three tiers, in order:

  1. Cheap early-out — last turn's usage_metadata.input_tokens plus chars/3.5 × 1.10 on new messages. Skips the pre-flight if well below threshold.
  2. Provider tokenizer — calls model.aget_num_tokens_from_messages (async) or get_num_tokens_from_messages (sync, run in a worker thread). Anthropic models expose only the sync version in langchain 1.x.
  3. Calibrated chars/3.5 — last-resort fallback. The 4-chars/token GPT heuristic underestimates Anthropic prefill; Claude's tokenizer yields ~3.5 chars/token in practice, plus we add a 10% safety margin for system prompt and tool definitions.

Probe results are cached per-model so models without either tokenizer log INFO once instead of WARNING every turn.

Observability

In-process counters per agent (vystak_compaction_total{layer, trigger, outcome}, vystak_compaction_input_tokens_total{layer}, vystak_compaction_messages_compacted{layer}, vystak_compaction_estimate_error{provider}, vystak_compaction_suppressions{layer, reason}).

Structured logs:

vystak.compaction.threshold.suppressed thread_id=... covered=0.85 seconds_since=12
vystak.compaction.threshold.fallback thread_id=... reason="rate limited"

Example

examples/docker-compaction/ is a complete working setup with Postgres sessions + chat channel + Slack channel + compaction tuned for fast dev loops (5K context, 0.3 trigger). Walks through deploy → drive → inspect → destroy.

  • Services — Postgres / SQLite session store backing
  • Channelsvystak-channel-chat proxies the inspection endpoints

Status

Compaction is wired into agents scaffolded from vystak-template-langchain-python when compaction.mode != "off". The runtime lives at _vystak/runtime/compaction/{pruner.py,compactor.py} inside the user's project. Tested end-to-end against Postgres sessions on Docker; stub-tested for SQLite and in-memory backends.