Skip to contents

Generic evaluation entry point for DSPrrr modules. Executes the module on a dataset, applies a metric to each example, and returns aggregate statistics together with the predictions and metadata required for downstream analysis.

Usage

evaluate(module, ...)

Arguments

module

A DSPrrr module created with module().

...

Arguments passed to methods:

  • data: A data frame or tibble containing columns that match the module's signature inputs plus any expected fields used by metric.

  • metric: A function applied per example with signature metric(prediction, expected_row).

Additional arguments passed to run_dataset():

  • .llm: Optional ellmer chat object

  • .parallel: Logical; whether to allow parallel execution

  • .progress: Logical; whether to display progress while evaluating

  • .return_format: Character; "simple" returns just scores and predictions, "structured" (default) includes full metadata and data

Value

A list with elements. When .return_format = "structured" (default):

  • mean_score: numeric mean over all successful metric evaluations.

  • scores: per-example numeric scores (coerced from logical metrics).

  • predictions: list of model outputs.

  • metadata: list of metadata captured from run().

  • n_evaluated: number of successful evaluations.

  • n_errors: number of metric failures.

  • errors: character vector with error messages, when any.

  • data: input data augmented with prediction metadata.

When .return_format = "simple":

  • mean_score, scores, predictions, n_evaluated, n_errors, errors (omits metadata and data for lighter-weight results)

See also