
Why Metrics Drive Better LLM Applications
Source:vignettes/concepts-why-metrics-matter.Rmd
concepts-why-metrics-matter.RmdThis article explains why metrics are essential for building reliable LLM applications and how dsprrr’s evaluation workflow enables systematic improvement.
The Problem with “It Looks Good”
Most LLM development follows this pattern:
- Write a prompt
- Try a few examples
- Outputs “look good”
- Ship it
This feels productive but hides a dangerous assumption: your intuition about output quality is reliable. It isn’t.
Why Intuition Fails
Confirmation bias: You notice examples that work and forget ones that don’t. After tweaking a prompt, you test it on the example that was failing. It works now. You don’t re-test the examples that were working before.
Incomplete coverage: You test 5 examples manually. Your production data has thousands of variations. The examples you tested aren’t representative.
Changing baselines: You improve the prompt. But did it actually get better, or does it just look different? Without quantitative comparison, you can’t know.
Subjective judgment: “Good enough” varies by person and by mood. What seems acceptable on Friday afternoon might look inadequate Monday morning.
Metrics as Ground Truth
A metric converts subjective judgment into objective measurement:
# Without metrics: "This looks right"
result <- run(mod, text = "I love this product!")
# result: "positive" # Is this correct? Probably?
# With metrics: This IS right (or wrong)
score <- metric_exact_match()(result, "positive")
# score: TRUE or FALSE, no ambiguityMetrics provide:
- Reproducibility: Same inputs → same scores, every time
- Comparability: Version A scores 0.78, Version B scores 0.82
- Automation: Evaluate thousands of examples without human review
- Progress tracking: Did this change help or hurt?
Built-in Metrics
dsprrr provides common metrics out of the box:
Exact Match
For classification and extraction tasks where the answer must be exactly right:
metric <- metric_exact_match()
metric("positive", "positive") # TRUE
metric("Positive", "positive") # FALSE (case sensitive by default)
# Case insensitive
metric <- metric_exact_match(ignore_case = TRUE)
metric("Positive", "positive") # TRUE
# Field extraction for structured outputs
metric <- metric_exact_match(field = "sentiment")
metric(
list(sentiment = "positive", confidence = 0.9),
list(sentiment = "positive")
) # TRUEContains
For tasks where the answer should include specific content:
metric <- metric_contains()
metric("The capital of France is Paris.", "Paris") # TRUE
metric("The capital of France is Paris.", "London") # FALSEF1 Score
For text generation where partial overlap matters:
metric <- metric_f1()
metric("the quick brown fox", "the fast brown fox") # 0.75
# Tokens: {the, quick, brown, fox} vs {the, fast, brown, fox}
# Overlap: {the, brown, fox} = 3 tokens
# Precision: 3/4, Recall: 3/4, F1: 0.75Custom Metrics
For domain-specific evaluation:
# Check if output is valid JSON
metric_valid_json <- function(prediction, expected) {
tryCatch({
jsonlite::fromJSON(prediction)
TRUE
}, error = function(e) FALSE)
}
# Check if sentiment matches with confidence threshold
metric_confident_match <- function(prediction, expected) {
if (!is.list(prediction)) return(FALSE)
prediction$sentiment == expected$sentiment &&
prediction$confidence >= 0.8
}The Evaluation Workflow
dsprrr’s evaluate() function runs your module on a
dataset and computes metrics:
# Prepare labeled data
test_data <- tibble::tibble(
text = c("Great product!", "Terrible service", "It's okay"),
sentiment = c("positive", "negative", "neutral")
)
# Evaluate
result <- evaluate(
mod,
data = test_data,
metric = metric_exact_match(field = "sentiment"),
.llm = llm
)
result$mean_score # 0.67 (2 out of 3 correct)
result$scores # c(TRUE, TRUE, FALSE)
result$predictions # What the model actually said
result$n_evaluated # 3
result$n_errors # 0This single function call:
- Runs your module on every example
- Applies your metric to each prediction
- Aggregates results into actionable statistics
Metric-Driven Development
With metrics in place, development becomes systematic:
1. Start with Evaluation
Before writing any prompt, create your evaluation dataset:
# Collect or create labeled examples
eval_data <- tibble::tibble(
question = c(
"What is 2+2?",
"What is the capital of France?",
"Who wrote Romeo and Juliet?"
),
answer = c("4", "Paris", "Shakespeare")
)This forces you to define success upfront. What counts as a correct answer? What edge cases matter?
2. Establish a Baseline
Run evaluation with your initial prompt:
baseline <- evaluate(mod_v1, eval_data, metric_exact_match())
baseline$mean_score # 0.65Now you have a number to beat.
3. Iterate with Measurement
Every change gets measured:
# Try different instructions
mod_v2 <- module(
signature("question -> answer", instructions = "Be concise."),
type = "predict"
)
v2_result <- evaluate(mod_v2, eval_data, metric_exact_match())
v2_result$mean_score # 0.70 - better!
# Try adding examples
mod_v3 <- mod_v2$clone()
mod_v3$demos <- list(
list(inputs = list(question = "What is 3+3?"), output = "6")
)
v3_result <- evaluate(mod_v3, eval_data, metric_exact_match())
v3_result$mean_score # 0.75 - even better!No guessing. No “it feels better”. Just numbers.
4. Automate Optimization
Let the computer search for you:
mod$optimize_grid(
devset = train_data,
metric = metric_exact_match(),
parameters = list(
temperature = c(0, 0.3, 0.7),
instructions = c("Be concise.", "Think step by step.", "Answer directly.")
)
)
# Best configuration is automatically selected
mod$state$best_score # 0.82Optimization only works because metrics provide objective feedback.
Choosing the Right Metric
Different tasks need different metrics:
| Task | Recommended Metric | Why |
|---|---|---|
| Classification | metric_exact_match() |
Answer must be exactly right |
| Extraction | metric_exact_match(field = "...") |
Extract specific field |
| Generation | metric_f1() |
Partial credit for overlap |
| Yes/No questions | metric_exact_match(ignore_case = TRUE) |
“Yes” = “yes” |
| Contains keyword | metric_contains() |
Answer includes key info |
| Complex evaluation | Custom function | Domain-specific logic |
When Exact Match is Too Strict
Sometimes equivalent answers look different:
# These are all correct, but exact match fails
metric_exact_match()("4", "four") # FALSE
metric_exact_match()("Paris", "paris") # FALSE
metric_exact_match()("$100", "100 dollars") # FALSESolutions: - Normalize outputs before comparison - Use
ignore_case = TRUE - Write a custom metric that handles
variations - Use F1 for partial credit
When F1 is Too Lenient
F1 gives partial credit for word overlap, but sometimes that’s wrong:
metric_f1()("The answer is yes", "The answer is no") # 0.75
# High score despite opposite meaning!For tasks where meaning matters more than words, consider: - Exact match on key fields - Semantic similarity metrics - Model-graded evaluation (use another LLM to judge)
The Data Split
Proper evaluation requires data discipline:
# Split your data
set.seed(42)
n <- nrow(all_data)
# 60% for training (used during optimization)
train_idx <- sample(n, 0.6 * n)
# 20% for validation (used to select best config)
remaining <- setdiff(1:n, train_idx)
val_idx <- sample(remaining, length(remaining) / 2)
# 20% for test (never touched until final evaluation)
test_idx <- setdiff(remaining, val_idx)
trainset <- all_data[train_idx, ]
valset <- all_data[val_idx, ]
testset <- all_data[test_idx, ]Training set: Used to generate demonstrations and optimize prompts.
Validation set: Used to compare configurations and select the best.
Test set: Used once, at the end, to report final performance. Never optimize against it.
If you optimize against your test set, you’re measuring how well you overfit, not how well you’ll generalize.
Common Pitfalls
Metric Doesn’t Match Task
# Task: Generate creative stories
# WRONG: Exact match (too strict)
evaluate(mod, data, metric_exact_match())
# RIGHT: Custom metric for creativity/coherence
evaluate(mod, data, metric_story_quality)The Payoff
Metric-driven development takes more upfront effort: - Collecting labeled data - Choosing appropriate metrics - Maintaining train/val/test splits
But it pays off:
- Confidence: You know when things work
- Progress: You can measure improvement
- Debugging: You can identify failure modes
- Automation: Optimization becomes possible
- Communication: Share concrete numbers, not vibes
Further Reading
- Tutorial: Optimize Your Module - Metrics in action
- How Optimization Works - Why metrics enable optimization
- API Reference: Metrics - Built-in metrics