Token Budget Management¶

Token budgets give you fine-grained control over how tokens are allocated across different context sources. Instead of a single max_tokens cap where all sources compete, you can assign dedicated portions to system prompts, memory, retrieval, tools, and other sources.

Why Token Budgets?¶

Without budgets, all ContextItem objects compete for the same token pool. A large retrieval result can crowd out conversation history. A verbose system prompt can leave no room for RAG context.

Token budgets solve this by:

Allocating per-source caps -- guarantee each source gets its share.
Reserving tokens -- hold back tokens for the LLM's response.
Defining overflow strategies -- control what happens when a source exceeds its cap.
Tracking shared pool usage -- diagnostics show how tokens flow.

Core Models¶

TokenBudget¶

The top-level budget model defines the total token capacity and how it is divided.

from anchor import TokenBudget, BudgetAllocation, SourceType

budget = TokenBudget(
    total_tokens=8192,
    reserve_tokens=1200,  # Hold back for the LLM response
    allocations=[
        BudgetAllocation(source=SourceType.SYSTEM, max_tokens=800, priority=10),
        BudgetAllocation(source=SourceType.MEMORY, max_tokens=800, priority=8),
        BudgetAllocation(source=SourceType.RETRIEVAL, max_tokens=3200, priority=5),
    ],
)

Field	Type	Description
`total_tokens`	`int`	Total token budget (must be > 0)
`allocations`	`list[BudgetAllocation]`	Per-source allocations
`reserve_tokens`	`int`	Tokens reserved for the LLM response (default: 0)

The model validates that sum(allocations) + reserve_tokens <= total_tokens. If the sum exceeds the total, a ValueError is raised at construction time.

BudgetAllocation¶

Defines how many tokens a single source type may consume.

from anchor import BudgetAllocation, SourceType

alloc = BudgetAllocation(
    source=SourceType.RETRIEVAL,
    max_tokens=3200,
    priority=5,
    overflow_strategy="truncate",  # or "drop"
)

Field	Type	Default	Description
`source`	`SourceType`	--	The source type this allocation applies to
`max_tokens`	`int`	--	Maximum tokens for this source (must be > 0)
`priority`	`int`	5	Priority used for ordering (1--10)
`overflow_strategy`	`"truncate" \\| "drop"`	`"truncate"`	What to do when the source exceeds its cap

Overflow Strategies¶

When a source produces more items than its allocation allows, the overflow strategy determines what happens.

Truncate (default)¶

Items are sorted by (-priority, -score). Items are kept until the cap is reached; the rest overflow.

Source "retrieval" cap: 2000 tokens

Item A (800 tokens, score=0.95) --> KEPT     (800 / 2000)
Item B (700 tokens, score=0.85) --> KEPT     (1500 / 2000)
Item C (600 tokens, score=0.70) --> OVERFLOW (would exceed 2000)
Item D (400 tokens, score=0.60) --> OVERFLOW

Drop¶

If the total tokens for the source exceed the cap, all items for that source are dropped. This is useful when partial retrieval context is worse than no retrieval context.

Source "retrieval" cap: 2000 tokens

Total items: 2500 tokens --> ALL DROPPED (exceeds cap)

Drop strategy

Use "drop" only when your application requires all-or-nothing behavior for a source. In most cases, "truncate" is the safer choice.

Reserve Tokens¶

The reserve_tokens field subtracts tokens from the effective max_tokens of the pipeline. This guarantees space for the LLM's response.

from anchor import ContextPipeline, TokenBudget

budget = TokenBudget(total_tokens=8192, reserve_tokens=1200)
pipeline = ContextPipeline(max_tokens=8192).with_budget(budget)
# Effective context window = 8192 - 1200 = 6992 tokens

The pipeline will raise a PipelineExecutionError if reserve_tokens >= max_tokens (leaving zero or negative space for context).

Shared Pool¶

Tokens not explicitly allocated to any source form the shared pool. Sources without an allocation compete for this pool during window assembly.

budget = TokenBudget(
    total_tokens=8192,
    reserve_tokens=1200,          # 1200
    allocations=[
        BudgetAllocation(source=SourceType.SYSTEM, max_tokens=800),    # 800
        BudgetAllocation(source=SourceType.RETRIEVAL, max_tokens=3200), # 3200
    ],
)
print(budget.shared_pool)  # 8192 - 1200 - 800 - 3200 = 2992

Items from sources with no explicit allocation (e.g., SourceType.MEMORY, SourceType.CONVERSATION, SourceType.USER in the example above) draw from the shared pool.

The get_allocation() method returns the per-source cap if one exists, or the shared pool size as a fallback:

print(budget.get_allocation(SourceType.SYSTEM))     # 800
print(budget.get_allocation(SourceType.RETRIEVAL))   # 3200
print(budget.get_allocation(SourceType.MEMORY))      # 2992 (shared pool)

Preset Factories¶

Three factory functions provide sensible defaults for common application types. Each accepts a max_tokens parameter and returns a configured TokenBudget.

default_chat_budget¶

Optimized for conversational applications with moderate retrieval.

from anchor import default_chat_budget

budget = default_chat_budget(max_tokens=8192)

Source	Allocation	Percentage
System	819	10%
Memory	819	10%
Conversation	1638	20%
Retrieval	2048	25%
Reserve	1228	15%
Shared pool	--	20%

default_rag_budget¶

Optimized for RAG-heavy applications where retrieval dominates.

from anchor import default_rag_budget

budget = default_rag_budget(max_tokens=8192)

Source	Allocation	Percentage
System	819	10%
Memory	409	5%
Conversation	819	10%
Retrieval	3276	40%
Reserve	1228	15%
Shared pool	--	20%

default_agent_budget¶

Optimized for agentic applications with tool usage.

from anchor import default_agent_budget

budget = default_agent_budget(max_tokens=8192)

Source	Allocation	Percentage
System	1228	15%
Memory	819	10%
Conversation	1228	15%
Retrieval	1638	20%
Tool	1228	15%
Reserve	1228	15%
Shared pool	--	10%

Custom budgets

The presets are a starting point. For production workloads, construct a TokenBudget directly with allocations tuned to your application's data distribution.

Using Budgets with the Pipeline¶

Attach a budget to the pipeline with .with_budget():

from anchor import ContextPipeline, default_rag_budget

budget = default_rag_budget(max_tokens=8192)
pipeline = (
    ContextPipeline(max_tokens=8192)
    .with_budget(budget)
    .add_system_prompt("You are a helpful assistant.")
)
result = pipeline.build("What is context engineering?")

You can also pass the budget directly to the constructor:

pipeline = ContextPipeline(max_tokens=8192, budget=budget)

Budget Diagnostics¶

When a budget is configured, the pipeline's diagnostics include extra fields that track how tokens were spent:

result = pipeline.build("What is context engineering?")
d = result.diagnostics

# Tokens used per source type
print(d.get("token_usage_by_source"))
# e.g. {"system": 45, "retrieval": 1200, "memory": 300}

# Tokens used by sources without explicit allocations
print(d.get("shared_pool_usage"))
# e.g. 300

# Items dropped because a source exceeded its cap
print(d.get("budget_overflow_by_source"))
# e.g. {"retrieval": 3}  -- 3 retrieval items were dropped

Overflow vs window overflow

Budget overflow happens during per-source cap enforcement (Stage 4a). Window overflow happens when total items still exceed max_tokens after budget filtering (Stage 4b). Both are tracked in diagnostics.

Source Types¶

The SourceType enum defines the valid source categories:

Value	Description
`SourceType.SYSTEM`	System prompts and instructions
`SourceType.MEMORY`	Persistent memory entries
`SourceType.CONVERSATION`	Conversation history turns
`SourceType.RETRIEVAL`	RAG / search results
`SourceType.TOOL`	Tool or function call outputs
`SourceType.USER`	Direct user-provided context