Skip to content

Evaluation API Reference

All classes are importable from anchor.evaluation. For usage examples see the Evaluation Guide.


RetrievalMetrics

Immutable Pydantic model holding metrics for a single retrieval evaluation. All values are bounded in [0.0, 1.0].

from anchor.evaluation import RetrievalMetrics
Field Type Description
precision_at_k float Fraction of retrieved items that are relevant
recall_at_k float Fraction of relevant items that were retrieved
f1_at_k float Harmonic mean of precision and recall
mrr float Reciprocal of the rank of the first relevant item
ndcg float Normalized Discounted Cumulative Gain
hit_rate float 1.0 if at least one relevant item was retrieved

RAGMetrics

Immutable Pydantic model for RAGAS-style RAG evaluation. All values bounded in [0.0, 1.0].

from anchor.evaluation import RAGMetrics
Field Type Description
faithfulness float How faithful the answer is to the provided contexts
answer_relevancy float How relevant the answer is to the query
context_precision float Precision of retrieved contexts for the query
context_recall float Recall of retrieved contexts against ground truth

EvaluationResult

Combines retrieval and RAG metrics into a single evaluation result.

from anchor.evaluation import EvaluationResult
Field Type Description
retrieval_metrics RetrievalMetrics \| None Retrieval metrics, if computed
rag_metrics RAGMetrics \| None RAG metrics, if computed
metadata dict[str, Any] Arbitrary metadata for the run

RetrievalMetricsCalculator

Computes standard IR metrics from ranked results and a known set of relevant document IDs. No LLM dependencies.

from anchor.evaluation import RetrievalMetricsCalculator

Constructor:

RetrievalMetricsCalculator(k: int = 10)
Parameter Type Default Description
k int 10 Default cutoff for top-k evaluation

Warning

Raises ValueError if k < 1.

evaluate(retrieved, relevant, k=None) -> RetrievalMetrics

def evaluate(
    self,
    retrieved: list[ContextItem],
    relevant: list[str],
    k: int | None = None,
) -> RetrievalMetrics
Parameter Type Default Description
retrieved list[ContextItem] -- Items in ranked order
relevant list[str] -- Ground-truth relevant document IDs
k int \| None None Cutoff override; falls back to instance default
from anchor.evaluation import RetrievalMetricsCalculator
from anchor.models.context import ContextItem, SourceType

calc = RetrievalMetricsCalculator(k=5)
items = [ContextItem(id="a", content="x", source=SourceType.RETRIEVAL)]
metrics = calc.evaluate(items, relevant=["a", "b"], k=3)
print(metrics.precision_at_k)  # 1.0

LLMRAGEvaluator

RAGAS-style RAG evaluator driven by user-supplied callback functions. Each callback returns a float in [0.0, 1.0].

from anchor.evaluation import LLMRAGEvaluator

Constructor:

LLMRAGEvaluator(
    *,
    faithfulness_fn: Callable[[str, list[str]], float] | None = None,
    relevancy_fn: Callable[[str, str], float] | None = None,
    precision_fn: Callable[[str, list[str]], float] | None = None,
    recall_fn: Callable[[str, list[str], str], float] | None = None,
)
Parameter Type Description
faithfulness_fn (answer, contexts) -> float Grounding check
relevancy_fn (query, answer) -> float Relevance check
precision_fn (query, contexts) -> float Context precision
recall_fn (query, contexts, ground_truth) -> float Context recall

evaluate(query, answer, contexts, ground_truth=None) -> RAGMetrics

Parameter Type Default Description
query str -- The original user query
answer str -- The generated answer
contexts list[str] -- Context strings fed to the generator
ground_truth str \| None None Reference answer for recall

Dimensions without registered callbacks return 0.0.


PipelineEvaluator

Orchestrates retrieval and RAG evaluation into a single result.

from anchor.evaluation import PipelineEvaluator

Constructor:

PipelineEvaluator(
    *,
    retrieval_calculator: RetrievalMetricsCalculator | None = None,
    rag_evaluator: LLMRAGEvaluator | None = None,
)
Parameter Type Default Description
retrieval_calculator RetrievalMetricsCalculator \| None None Defaults to a new instance
rag_evaluator LLMRAGEvaluator \| None None Optional LLM-based evaluator

evaluate_retrieval(retrieved, relevant, k=10) -> RetrievalMetrics -- evaluates retrieval only.

evaluate_rag(query, answer, contexts, ground_truth=None) -> RAGMetrics -- evaluates RAG only.

Warning

Raises ValueError if no rag_evaluator was configured.

evaluate(...) -> EvaluationResult -- runs both evaluations.

def evaluate(
    self,
    query: str,
    answer: str,
    retrieved: list[ContextItem],
    relevant: list[str],
    contexts: list[str],
    ground_truth: str | None = None,
    k: int = 10,
    metadata: dict[str, Any] | None = None,
) -> EvaluationResult

EvaluationSample (A/B testing)

A single evaluation sample. Defined in anchor.evaluation.ab_testing.

from anchor.evaluation import EvaluationSample
Field Type Default Description
query str -- The query string
relevant_ids list[str] [] Relevant document IDs
metadata dict[str, Any] {} Arbitrary metadata

EvaluationDataset (A/B testing)

A collection of EvaluationSample instances.

from anchor.evaluation import EvaluationDataset
Field Type Default Description
samples list[EvaluationSample] [] The evaluation samples
name str "" Optional dataset name
metadata dict[str, Any] {} Arbitrary metadata

AggregatedMetrics (A/B testing)

Aggregated retrieval metrics across multiple evaluation samples.

from anchor.evaluation import AggregatedMetrics
Field Type Default Description
mean_precision float 0.0 Mean precision@k
mean_recall float 0.0 Mean recall@k
mean_f1 float 0.0 Mean F1@k
mean_mrr float 0.0 Mean MRR
mean_ndcg float 0.0 Mean NDCG
num_samples int 0 Number of samples evaluated

ABTestResult

Result of an A/B test comparing two retrievers.

from anchor.evaluation import ABTestResult
Field Type Default Description
metrics_a AggregatedMetrics -- Metrics for retriever A
metrics_b AggregatedMetrics -- Metrics for retriever B
winner str -- "a", "b", or "tie"
p_value float -- Paired t-test p-value
is_significant bool -- Whether the result is statistically significant
significance_level float 0.05 Threshold for significance
per_metric_comparison dict[str, dict[str, Any]] {} Per-metric deltas
metadata dict[str, Any] {} Arbitrary metadata

ABTestRunner

Runs an A/B test comparing two retrievers on a shared dataset.

from anchor.evaluation import ABTestRunner

Constructor:

ABTestRunner(evaluator: PipelineEvaluator, dataset: EvaluationDataset)
Parameter Type Description
evaluator PipelineEvaluator Evaluator for computing retrieval metrics
dataset EvaluationDataset Shared evaluation dataset

run(retriever_a, retriever_b, k=10, significance_level=0.05) -> ABTestResult

Parameter Type Default Description
retriever_a Retriever -- First retriever
retriever_b Retriever -- Second retriever
k int 10 Top-k cutoff
significance_level float 0.05 p-value threshold

BatchEvaluator

Runs evaluation over an entire dataset and aggregates results. Importable from anchor.evaluation.batch.

from anchor.evaluation.batch import BatchEvaluator

Constructor:

BatchEvaluator(*, evaluator: PipelineEvaluator, retriever: Retriever, top_k: int = 10)
Parameter Type Default Description
evaluator PipelineEvaluator -- Per-sample evaluator
retriever Retriever -- Retriever for fetching items
top_k int 10 Items to retrieve per query

evaluate(dataset, k=10) -> AggregatedMetrics

Returns AggregatedMetrics (batch module variant) with count, mean_precision, mean_recall, mean_f1, mean_mrr, mean_ndcg, mean_hit_rate, p95_precision, p95_recall, min_precision, min_recall, and per_sample_results.


HumanJudgment

A single human relevance judgment for a query-document pair.

from anchor.evaluation import HumanJudgment
Field Type Constraints Description
query str -- The evaluated query
item_id str -- Document ID being judged
relevance int 0 <= x <= 3 Relevance score
annotator str -- Annotator identifier
metadata dict[str, Any] -- Arbitrary metadata

HumanEvaluationCollector

Collects human relevance judgments and computes inter-annotator agreement.

from anchor.evaluation import HumanEvaluationCollector

Constructor: HumanEvaluationCollector() -- no parameters.

Property: judgments -> list[HumanJudgment] -- copy of all collected judgments.

add_judgment(judgment: HumanJudgment) -> None -- add a single judgment.

add_judgments(judgments: list[HumanJudgment]) -> None -- add multiple judgments.

compute_agreement() -> float -- Cohen's kappa over (query, item_id) pairs judged by at least two annotators. Returns 0.0 if no overlapping judgments exist.

to_dataset(threshold: int = 2) -> EvaluationDataset -- converts judgments into an EvaluationDataset. Items with mean relevance at or above threshold are considered relevant.

compute_metrics() -> dict[str, float] -- returns mean_relevance, agreement, num_judgments, num_annotators, num_queries.