Document Ingestion¶
Ingest raw text and files, chunk them with different strategies, enrich metadata, and index the results into a retriever.
Overview¶
This example demonstrates:
- Using
DocumentIngesterwith different chunkers - Comparing
FixedSizeChunker,RecursiveCharacterChunker, andSentenceChunker - Enriching chunk metadata with
MetadataEnricher - Indexing chunks into a
DenseRetriever - Querying the indexed documents
Full Example¶
import math
from anchor import (
ContextPipeline,
DenseRetriever,
DocumentIngester,
FixedSizeChunker,
InMemoryContextStore,
InMemoryVectorStore,
MetadataEnricher,
QueryBundle,
RecursiveCharacterChunker,
SentenceChunker,
retriever_step,
)
# ---------------------------------------------------------------
# 1. Deterministic embedding function
# ---------------------------------------------------------------
def embed_fn(text: str) -> list[float]:
seed = sum(ord(c) for c in text) % 10000
raw = [math.sin(seed * 1000 + i) for i in range(64)]
norm = math.sqrt(sum(x * x for x in raw))
return [x / norm for x in raw] if norm else raw
# ---------------------------------------------------------------
# 2. Sample document (a technical article)
# ---------------------------------------------------------------
article = """
# Context Engineering for AI
Context engineering is the emerging discipline of designing systems that
provide language models with the right information at the right time.
Unlike prompt engineering, which focuses on crafting individual prompts,
context engineering builds entire pipelines that assemble context from
multiple sources.
## Why Context Matters
Language models are only as good as the context they receive. A model
with perfect reasoning but poor context will produce poor results.
Context engineering addresses this by systematically managing what
information flows into the model's context window.
## Key Components
The core components of a context engineering system include:
1. Retrieval: Finding relevant documents from a knowledge base using
dense embeddings, sparse keyword matching, or hybrid approaches.
2. Memory: Managing conversation history with token-aware sliding
windows that automatically evict old turns when the budget fills.
3. Token Budgets: Allocating finite context window space across
competing sources like system prompts, memory, and retrieval.
4. Formatting: Converting assembled context into the specific format
required by each LLM provider (Anthropic, OpenAI, etc.).
## Best Practices
Start simple and add complexity only when needed. A basic pipeline
with a single retriever and sliding-window memory handles most use
cases. Add reranking, hybrid retrieval, and query transformation
only after measuring baseline performance with evaluation metrics.
"""
# ---------------------------------------------------------------
# 3. Compare chunking strategies
# ---------------------------------------------------------------
print("=== Chunking Strategy Comparison ===\n")
chunkers = {
"FixedSize(100 tokens)": FixedSizeChunker(chunk_size=100, overlap=20),
"Recursive(100 tokens)": RecursiveCharacterChunker(chunk_size=100, overlap=20),
"Sentence(100 tokens)": SentenceChunker(chunk_size=100, overlap=1),
}
for name, chunker in chunkers.items():
ingester = DocumentIngester(chunker=chunker)
items = ingester.ingest_text(article, doc_id="context-engineering")
print(f"{name}:")
print(f" Chunks produced: {len(items)}")
for i, item in enumerate(items):
tokens = item.token_count
preview = item.content[:60].replace("\n", " ")
print(f" [{i}] ({tokens} tokens) {preview}...")
print()
# ---------------------------------------------------------------
# 4. Use MetadataEnricher for custom metadata
# ---------------------------------------------------------------
print("=== Metadata Enrichment ===\n")
class TopicTagger(MetadataEnricher):
"""Tag chunks with topics based on keyword presence."""
TOPICS = {
"retrieval": ["retrieval", "embedding", "search", "keyword"],
"memory": ["memory", "conversation", "sliding", "evict"],
"formatting": ["format", "anthropic", "openai", "provider"],
"budgets": ["token", "budget", "allocat", "window"],
}
def enrich(
self,
chunk_text: str,
chunk_index: int,
total_chunks: int,
metadata: dict,
) -> dict:
text_lower = chunk_text.lower()
tags = []
for topic, keywords in self.TOPICS.items():
if any(kw in text_lower for kw in keywords):
tags.append(topic)
metadata["topics"] = tags
metadata["position"] = f"{chunk_index + 1}/{total_chunks}"
return metadata
ingester = DocumentIngester(
chunker=RecursiveCharacterChunker(chunk_size=120, overlap=20),
enricher=TopicTagger(),
)
items = ingester.ingest_text(article, doc_id="context-engineering")
for item in items:
topics = item.metadata.get("topics", [])
pos = item.metadata.get("position", "?")
preview = item.content[:50].replace("\n", " ")
print(f" [{pos}] topics={topics} | {preview}...")
# ---------------------------------------------------------------
# 5. Index chunks into a DenseRetriever
# ---------------------------------------------------------------
print("\n=== Indexing into DenseRetriever ===\n")
dense = DenseRetriever(
vector_store=InMemoryVectorStore(),
context_store=InMemoryContextStore(),
embed_fn=embed_fn,
)
count = dense.index(items)
print(f" Indexed {count} chunks")
# ---------------------------------------------------------------
# 6. Query the indexed documents
# ---------------------------------------------------------------
print("\n=== Querying Indexed Documents ===\n")
pipeline = (
ContextPipeline(max_tokens=2048)
.add_step(retriever_step("search", dense, top_k=3))
.add_system_prompt("Answer based on the provided context.")
)
queries = [
"What is context engineering?",
"How does memory management work?",
"What are the best practices?",
]
for q in queries:
query = QueryBundle(query_str=q, embedding=embed_fn(q))
result = pipeline.build(query)
retrieved = [
item for item in result.window.items
if item.source.value == "retrieval"
]
print(f" Q: {q}")
print(f" Results: {len(retrieved)} items")
for r in retrieved:
score = f"{r.score:.3f}"
topics = r.metadata.get("topics", [])
print(f" [{score}] topics={topics} "
f"| {r.content[:50].replace(chr(10), ' ')}...")
print()
# ---------------------------------------------------------------
# 7. Inspect chunk metadata
# ---------------------------------------------------------------
print("=== Chunk Metadata Details ===\n")
sample = items[0]
print(f" id: {sample.id}")
print(f" source: {sample.source.value}")
print(f" priority: {sample.priority}")
print(f" token_count: {sample.token_count}")
print(f" metadata:")
for key, value in sample.metadata.items():
print(f" {key}: {value}")
Key Concepts¶
Built-in Chunkers¶
| Chunker | Strategy | Best For |
|---|---|---|
FixedSizeChunker | Fixed token count with overlap | Uniform chunk sizes |
RecursiveCharacterChunker | Hierarchy: paragraph, line, sentence, word | General-purpose text |
SentenceChunker | Sentence-boundary aware | Prose and articles |
SemanticChunker | Embedding similarity boundaries | Topically diverse text |
Built-in Parsers¶
DocumentIngester auto-detects parsers by file extension:
| Parser | Extensions | Dependencies |
|---|---|---|
PlainTextParser | .txt | None |
MarkdownParser | .md, .markdown | None |
HTMLParser | .html, .htm | None |
PDFParser | .pdf | pypdf (optional) |
Ingesting Files¶
from pathlib import Path
ingester = DocumentIngester(
chunker=RecursiveCharacterChunker(chunk_size=512),
)
# Single file
items = ingester.ingest_file(Path("docs/guide.md"))
# Entire directory (recursive)
items = ingester.ingest_directory(
Path("docs/"),
extensions=[".md", ".txt"],
)
Chunk Size Guidelines
- Small chunks (100-200 tokens): better precision, more chunks to search
- Large chunks (500-1000 tokens): more context per result, fewer chunks
- Start with 256-512 tokens and adjust based on evaluation metrics
SemanticChunker Signature
SemanticChunker requires a batch embedding function with signature (list[str]) -> list[list[float]], not the single-string embed_fn used by DenseRetriever.
Next Steps¶
- RAG Pipeline -- build a retrieval pipeline around your documents
- Custom Retriever -- implement a custom storage backend
- Evaluation Workflow -- measure chunking quality