Evaluate a Program on a Dataset — eval

Standard evaluation function for optimizers. Executes a module on a dataset, applies a metric to each example, and returns detailed per-example results plus aggregated statistics.

This is the core evaluation function used by all optimizers. It wraps evaluate() with enhanced output including:

Per-example timing and error information
Aggregated cost tracking
Standard error computation
Multi-epoch evaluation for statistical significance (when epochs > 1)

Usage

eval_program(
  program,
  dataset,
  metric,
  .llm = NULL,
  control = NULL,
  epochs = 1L,
  ...
)

Arguments

program: A DSPrrr module to evaluate.
dataset: A data frame containing test examples.
metric: A metric function for scoring predictions.
.llm: Optional ellmer Chat object for LLM calls.
control: An OptimizerControl object or NULL for defaults.
epochs: Integer; number of times to repeat evaluation for statistical significance. Defaults to 1L. When > 1, computes std and confidence intervals.
...: Additional arguments passed to evaluate().

Value

An EvalResult object containing:

examples: tibble with per-example row_id, score, error, predicted, and input columns (prefixed with input_*)
mean_score: mean score across successful evaluations
std_error: standard error of per-example scores (SD / sqrt(n))
n_evaluated: number of successful evaluations
n_errors: number of failed evaluations
total_tokens: total tokens used
total_cost: total cost in USD
total_latency_ms: total time in milliseconds

When epochs > 1, additional fields:

epochs: number of epochs run
epoch_scores: list of score vectors, one per epoch
score_std: standard deviation of mean scores across epochs
ci_lower, ci_upper: 95% confidence interval bounds

Examples

if (FALSE) { # \dontrun{
sig <- signature("question -> answer")
mod <- module(sig, type = "predict")

dataset <- tibble::tibble(
  question = c("What is 2+2?", "What is 3+3?"),
  answer = c("4", "6")
)

result <- eval_program(
  mod,
  dataset,
  metric = metric_exact_match(field = "answer"),
  .llm = ellmer::chat_openai()
)

result@mean_score
result@examples
} # }