Generic evaluation entry point for DSPrrr modules. Executes the module on a dataset, applies a metric to each example, and returns aggregate statistics together with the predictions and metadata required for downstream analysis.
Arguments
- module
A DSPrrr module created with
module().- ...
Arguments passed to methods:
data: A data frame or tibble containing columns that match the module's signature inputs plus any expected fields used by metric.metric: A function applied per example with signaturemetric(prediction, expected_row).
Additional arguments passed to
run_dataset():.llm: Optional ellmer chat object.parallel: Logical; whether to allow parallel execution.progress: Logical; whether to display progress while evaluating.return_format: Character;"simple"returns just scores and predictions,"structured"(default) includes full metadata and dataepochs: Integer; number of times to repeat evaluation for statistical significance (default = 1L). When > 1, each sample is evaluated multiple times to quantify variation.
Value
A list with elements. When .return_format = "structured" (default):
mean_score: numeric mean over all successful metric evaluations.scores: per-example numeric scores (coerced from logical metrics).predictions: list of model outputs.metadata: list of metadata captured fromrun().n_evaluated: number of successful evaluations.n_errors: number of metric failures.errors: character vector with error messages, when any.feedbacks: per-example textual feedback when the metric returnslist(score = , feedback = )(seemetric_with_feedback());NAotherwise.data: input data augmented with prediction metadata.
When epochs > 1, additional fields are included:
epoch_scores: list of numeric vectors, one per epochscore_std: standard deviation of mean scores across epochsci_95: 95% confidence interval for the mean score (numeric vector of length 2)
When .return_format = "simple":
mean_score,scores,predictions,n_evaluated,n_errors,errors(omitsmetadataanddatafor lighter-weight results)
See also
run()for executing without metricsrun_dataset()for batch execution without metricsoptimize_grid()for parameter optimizationmetric_exact_match(),metric_contains()for built-in metrics
Examples
if (FALSE) { # \dontrun{
classifier <- module(
signature("text -> sentiment: enum('positive', 'negative', 'neutral')"),
type = "predict"
)
testset <- dsp_trainset(
text = c("I love it!", "Awful.", "It's fine."),
sentiment = c("positive", "negative", "neutral")
)
result <- evaluate(
classifier,
data = testset,
metric = metric_exact_match(field = "sentiment"),
.llm = ellmer::chat_openai()
)
result$mean_score # accuracy across the test set
result$scores # per-example scores
result$n_errors # examples where the metric failed
} # }