Evaluate a DSPrrr module

Generic evaluation entry point for DSPrrr modules. Executes the module on a dataset, applies a metric to each example, and returns aggregate statistics together with the predictions and metadata required for downstream analysis.

Usage

evaluate(module, ...)

Arguments

module

A DSPrrr module created with module().

...

Arguments passed to methods:

data: A data frame or tibble containing columns that match the module's signature inputs plus any expected fields used by metric.
metric: A function applied per example with signature metric(prediction, expected_row).

Additional arguments passed to run_dataset():

.llm: Optional ellmer chat object
.parallel: Logical; whether to allow parallel execution
.progress: Logical; whether to display progress while evaluating
.return_format: Character; "simple" returns just scores and predictions, "structured" (default) includes full metadata and data
epochs: Integer; number of times to repeat evaluation for statistical significance (default = 1L). When > 1, each sample is evaluated multiple times to quantify variation.

Value

A list with elements. When .return_format = "structured" (default):

mean_score: numeric mean over all successful metric evaluations.
scores: per-example numeric scores (coerced from logical metrics).
predictions: list of model outputs.
metadata: list of metadata captured from run().
n_evaluated: number of successful evaluations.
n_errors: number of metric failures.
errors: character vector with error messages, when any.
data: input data augmented with prediction metadata.

When epochs > 1, additional fields are included:

epoch_scores: list of numeric vectors, one per epoch
score_std: standard deviation of mean scores across epochs
ci_95: 95% confidence interval for the mean score (numeric vector of length 2)

When .return_format = "simple":

mean_score, scores, predictions, n_evaluated, n_errors, errors (omits metadata and data for lighter-weight results)

Usage

Arguments

Value

See also