Evaluation API Reference¶
All classes are importable from anchor.evaluation. For usage examples see the Evaluation Guide.
RetrievalMetrics¶
Immutable Pydantic model holding metrics for a single retrieval evaluation. All values are bounded in [0.0, 1.0].
| Field | Type | Description |
|---|---|---|
precision_at_k | float | Fraction of retrieved items that are relevant |
recall_at_k | float | Fraction of relevant items that were retrieved |
f1_at_k | float | Harmonic mean of precision and recall |
mrr | float | Reciprocal of the rank of the first relevant item |
ndcg | float | Normalized Discounted Cumulative Gain |
hit_rate | float | 1.0 if at least one relevant item was retrieved |
RAGMetrics¶
Immutable Pydantic model for RAGAS-style RAG evaluation. All values bounded in [0.0, 1.0].
| Field | Type | Description |
|---|---|---|
faithfulness | float | How faithful the answer is to the provided contexts |
answer_relevancy | float | How relevant the answer is to the query |
context_precision | float | Precision of retrieved contexts for the query |
context_recall | float | Recall of retrieved contexts against ground truth |
EvaluationResult¶
Combines retrieval and RAG metrics into a single evaluation result.
| Field | Type | Description |
|---|---|---|
retrieval_metrics | RetrievalMetrics \| None | Retrieval metrics, if computed |
rag_metrics | RAGMetrics \| None | RAG metrics, if computed |
metadata | dict[str, Any] | Arbitrary metadata for the run |
RetrievalMetricsCalculator¶
Computes standard IR metrics from ranked results and a known set of relevant document IDs. No LLM dependencies.
Constructor:
| Parameter | Type | Default | Description |
|---|---|---|---|
k | int | 10 | Default cutoff for top-k evaluation |
Warning
Raises ValueError if k < 1.
evaluate(retrieved, relevant, k=None) -> RetrievalMetrics
def evaluate(
self,
retrieved: list[ContextItem],
relevant: list[str],
k: int | None = None,
) -> RetrievalMetrics
| Parameter | Type | Default | Description |
|---|---|---|---|
retrieved | list[ContextItem] | -- | Items in ranked order |
relevant | list[str] | -- | Ground-truth relevant document IDs |
k | int \| None | None | Cutoff override; falls back to instance default |
from anchor.evaluation import RetrievalMetricsCalculator
from anchor.models.context import ContextItem, SourceType
calc = RetrievalMetricsCalculator(k=5)
items = [ContextItem(id="a", content="x", source=SourceType.RETRIEVAL)]
metrics = calc.evaluate(items, relevant=["a", "b"], k=3)
print(metrics.precision_at_k) # 1.0
LLMRAGEvaluator¶
RAGAS-style RAG evaluator driven by user-supplied callback functions. Each callback returns a float in [0.0, 1.0].
Constructor:
LLMRAGEvaluator(
*,
faithfulness_fn: Callable[[str, list[str]], float] | None = None,
relevancy_fn: Callable[[str, str], float] | None = None,
precision_fn: Callable[[str, list[str]], float] | None = None,
recall_fn: Callable[[str, list[str], str], float] | None = None,
)
| Parameter | Type | Description |
|---|---|---|
faithfulness_fn | (answer, contexts) -> float | Grounding check |
relevancy_fn | (query, answer) -> float | Relevance check |
precision_fn | (query, contexts) -> float | Context precision |
recall_fn | (query, contexts, ground_truth) -> float | Context recall |
evaluate(query, answer, contexts, ground_truth=None) -> RAGMetrics
| Parameter | Type | Default | Description |
|---|---|---|---|
query | str | -- | The original user query |
answer | str | -- | The generated answer |
contexts | list[str] | -- | Context strings fed to the generator |
ground_truth | str \| None | None | Reference answer for recall |
Dimensions without registered callbacks return 0.0.
PipelineEvaluator¶
Orchestrates retrieval and RAG evaluation into a single result.
Constructor:
PipelineEvaluator(
*,
retrieval_calculator: RetrievalMetricsCalculator | None = None,
rag_evaluator: LLMRAGEvaluator | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
retrieval_calculator | RetrievalMetricsCalculator \| None | None | Defaults to a new instance |
rag_evaluator | LLMRAGEvaluator \| None | None | Optional LLM-based evaluator |
evaluate_retrieval(retrieved, relevant, k=10) -> RetrievalMetrics -- evaluates retrieval only.
evaluate_rag(query, answer, contexts, ground_truth=None) -> RAGMetrics -- evaluates RAG only.
Warning
Raises ValueError if no rag_evaluator was configured.
evaluate(...) -> EvaluationResult -- runs both evaluations.
def evaluate(
self,
query: str,
answer: str,
retrieved: list[ContextItem],
relevant: list[str],
contexts: list[str],
ground_truth: str | None = None,
k: int = 10,
metadata: dict[str, Any] | None = None,
) -> EvaluationResult
EvaluationSample (A/B testing)¶
A single evaluation sample. Defined in anchor.evaluation.ab_testing.
| Field | Type | Default | Description |
|---|---|---|---|
query | str | -- | The query string |
relevant_ids | list[str] | [] | Relevant document IDs |
metadata | dict[str, Any] | {} | Arbitrary metadata |
EvaluationDataset (A/B testing)¶
A collection of EvaluationSample instances.
| Field | Type | Default | Description |
|---|---|---|---|
samples | list[EvaluationSample] | [] | The evaluation samples |
name | str | "" | Optional dataset name |
metadata | dict[str, Any] | {} | Arbitrary metadata |
AggregatedMetrics (A/B testing)¶
Aggregated retrieval metrics across multiple evaluation samples.
| Field | Type | Default | Description |
|---|---|---|---|
mean_precision | float | 0.0 | Mean precision@k |
mean_recall | float | 0.0 | Mean recall@k |
mean_f1 | float | 0.0 | Mean F1@k |
mean_mrr | float | 0.0 | Mean MRR |
mean_ndcg | float | 0.0 | Mean NDCG |
num_samples | int | 0 | Number of samples evaluated |
ABTestResult¶
Result of an A/B test comparing two retrievers.
| Field | Type | Default | Description |
|---|---|---|---|
metrics_a | AggregatedMetrics | -- | Metrics for retriever A |
metrics_b | AggregatedMetrics | -- | Metrics for retriever B |
winner | str | -- | "a", "b", or "tie" |
p_value | float | -- | Paired t-test p-value |
is_significant | bool | -- | Whether the result is statistically significant |
significance_level | float | 0.05 | Threshold for significance |
per_metric_comparison | dict[str, dict[str, Any]] | {} | Per-metric deltas |
metadata | dict[str, Any] | {} | Arbitrary metadata |
ABTestRunner¶
Runs an A/B test comparing two retrievers on a shared dataset.
Constructor:
| Parameter | Type | Description |
|---|---|---|
evaluator | PipelineEvaluator | Evaluator for computing retrieval metrics |
dataset | EvaluationDataset | Shared evaluation dataset |
run(retriever_a, retriever_b, k=10, significance_level=0.05) -> ABTestResult
| Parameter | Type | Default | Description |
|---|---|---|---|
retriever_a | Retriever | -- | First retriever |
retriever_b | Retriever | -- | Second retriever |
k | int | 10 | Top-k cutoff |
significance_level | float | 0.05 | p-value threshold |
BatchEvaluator¶
Runs evaluation over an entire dataset and aggregates results. Importable from anchor.evaluation.batch.
Constructor:
| Parameter | Type | Default | Description |
|---|---|---|---|
evaluator | PipelineEvaluator | -- | Per-sample evaluator |
retriever | Retriever | -- | Retriever for fetching items |
top_k | int | 10 | Items to retrieve per query |
evaluate(dataset, k=10) -> AggregatedMetrics
Returns AggregatedMetrics (batch module variant) with count, mean_precision, mean_recall, mean_f1, mean_mrr, mean_ndcg, mean_hit_rate, p95_precision, p95_recall, min_precision, min_recall, and per_sample_results.
HumanJudgment¶
A single human relevance judgment for a query-document pair.
| Field | Type | Constraints | Description |
|---|---|---|---|
query | str | -- | The evaluated query |
item_id | str | -- | Document ID being judged |
relevance | int | 0 <= x <= 3 | Relevance score |
annotator | str | -- | Annotator identifier |
metadata | dict[str, Any] | -- | Arbitrary metadata |
HumanEvaluationCollector¶
Collects human relevance judgments and computes inter-annotator agreement.
Constructor: HumanEvaluationCollector() -- no parameters.
Property: judgments -> list[HumanJudgment] -- copy of all collected judgments.
add_judgment(judgment: HumanJudgment) -> None -- add a single judgment.
add_judgments(judgments: list[HumanJudgment]) -> None -- add multiple judgments.
compute_agreement() -> float -- Cohen's kappa over (query, item_id) pairs judged by at least two annotators. Returns 0.0 if no overlapping judgments exist.
to_dataset(threshold: int = 2) -> EvaluationDataset -- converts judgments into an EvaluationDataset. Items with mean relevance at or above threshold are considered relevant.
compute_metrics() -> dict[str, float] -- returns mean_relevance, agreement, num_judgments, num_annotators, num_queries.