Skip to contents

Generic evaluation entry point for DSPrrr modules. Executes the module on a dataset, applies a metric to each example, and returns aggregate statistics together with the predictions and metadata required for downstream analysis.

Usage

evaluate(module, ...)

Arguments

module

A DSPrrr module created with module().

...

Arguments passed to methods:

  • data: A data frame or tibble containing columns that match the module's signature inputs plus any expected fields used by metric.

  • metric: A function applied per example with signature metric(prediction, expected_row).

Additional arguments passed to run_dataset():

  • .llm: Optional ellmer chat object

  • .parallel: Logical; whether to allow parallel execution

  • .progress: Logical; whether to display progress while evaluating

  • .return_format: Character; "simple" returns just scores and predictions, "structured" (default) includes full metadata and data

  • epochs: Integer; number of times to repeat evaluation for statistical significance (default = 1L). When > 1, each sample is evaluated multiple times to quantify variation.

Value

A list with elements. When .return_format = "structured" (default):

  • mean_score: numeric mean over all successful metric evaluations.

  • scores: per-example numeric scores (coerced from logical metrics).

  • predictions: list of model outputs.

  • metadata: list of metadata captured from run().

  • n_evaluated: number of successful evaluations.

  • n_errors: number of metric failures.

  • errors: character vector with error messages, when any.

  • feedbacks: per-example textual feedback when the metric returns list(score = , feedback = ) (see metric_with_feedback()); NA otherwise.

  • data: input data augmented with prediction metadata.

When epochs > 1, additional fields are included:

  • epoch_scores: list of numeric vectors, one per epoch

  • score_std: standard deviation of mean scores across epochs

  • ci_95: 95% confidence interval for the mean score (numeric vector of length 2)

When .return_format = "simple":

  • mean_score, scores, predictions, n_evaluated, n_errors, errors (omits metadata and data for lighter-weight results)

See also

Examples

if (FALSE) { # \dontrun{
classifier <- module(
  signature("text -> sentiment: enum('positive', 'negative', 'neutral')"),
  type = "predict"
)

testset <- dsp_trainset(
  text = c("I love it!", "Awful.", "It's fine."),
  sentiment = c("positive", "negative", "neutral")
)

result <- evaluate(
  classifier,
  data = testset,
  metric = metric_exact_match(field = "sentiment"),
  .llm = ellmer::chat_openai()
)

result$mean_score # accuracy across the test set
result$scores # per-example scores
result$n_errors # examples where the metric failed
} # }