
Tutorial 4: Improving with Examples
Source:vignettes/tutorial-improve-with-demos.Rmd
tutorial-improve-with-demos.RmdIn Tutorial 3, you built extractors that return structured data. But how do you make them more accurate? The answer: show the LLM examples of correct behavior.
This technique—called few-shot learning—is one of the most powerful ways to improve LLM performance.
Time: 25-30 minutes
Prerequisites
- Completed Tutorial 3
-
OPENAI_API_KEYset in your environment
Step 1: The Problem
Let’s build a customer support ticket classifier:
sig <- signature(
"ticket -> category: enum('billing', 'technical', 'shipping', 'general')",
instructions = "Classify the customer support ticket."
)
classifier <- module(sig, type = "predict")Test it:
run(classifier, ticket = "My package hasn't arrived yet", .llm = chat)
#> $category
#> [1] "shipping"
run(classifier, ticket = "I was charged twice for my order", .llm = chat)
#> $category
#> [1] "billing"
run(classifier, ticket = "The app keeps crashing when I try to login", .llm = chat)
#> $category
#> [1] "technical"This works, but how do we know it’s accurate? And how do we improve it?
Step 2: Create Training Data
First, let’s create labeled examples using
dsp_trainset():
trainset <- dsp_trainset(
ticket = c(
"I was charged twice for the same item",
"How do I update my payment method?",
"The website won't load on my phone",
"My password reset email never arrived",
"When will my order ship?",
"Can I change my delivery address?",
"The product arrived damaged",
"I need a refund for my subscription",
"How do I contact customer service?",
"What are your business hours?"
),
category = c(
"billing",
"billing",
"technical",
"technical",
"shipping",
"shipping",
"shipping",
"billing",
"general",
"general"
)
)
trainset
#> ticket category
#> 1 I was charged twice for the same item billing
#> 2 How do I update my payment method? billing
#> 3 The website won't load on my phone technical
#> 4 My password reset email never arrived technical
#> 5 When will my order ship? shipping
#> 6 Can I change my delivery address? shipping
#> 7 The product arrived damaged shipping
#> 8 I need a refund for my subscription billing
#> 9 How do I contact customer service? general
#> 10 What are your business hours? generalThe dsp_trainset() function creates a properly formatted
tibble where input columns are matched with expected output columns.
Step 3: Measure Baseline Accuracy
Before improving, let’s measure current performance:
baseline_results <- run_dataset(classifier, trainset, .llm = chat)
baseline_results
#> # A tibble: 10 × 3
#> ticket category result
#> <chr> <chr> <list>
#> 1 I was charged twice for the same item billing <chr [1]>
#> 2 How do I update my payment method? billing <chr [1]>
#> 3 The website won't load on my phone technical <chr [1]>
#> 4 My password reset email never arrived technical <chr [1]>
#> 5 When will my order ship? shipping <chr [1]>
#> 6 Can I change my delivery address? shipping <chr [1]>
#> 7 The product arrived damaged shipping <chr [1]>
#> 8 I need a refund for my subscription billing <chr [1]>
#> 9 How do I contact customer service? general <chr [1]>
#> 10 What are your business hours? general <chr [1]>Calculate accuracy:
Step 4: Add Manual Demonstrations
The simplest way to improve: show the LLM examples. Add them via the
demos parameter:
classifier_with_demos <- module(
sig,
type = "predict",
demos = list(
list(
inputs = list(ticket = "I was charged twice for the same item"),
output = list(category = "billing")
),
list(
inputs = list(ticket = "The app crashes on startup"),
output = list(category = "technical")
),
list(
inputs = list(ticket = "My package is late"),
output = list(category = "shipping")
)
)
)Each demo has inputs (a list of input values) and
output (a list of expected outputs).
Test the improved classifier:
Step 5: Measure Improvement
Run on the same test set:
improved_results <- run_dataset(classifier_with_demos, trainset, .llm = chat)
#> Processing 5/10 | 50% | ETA: 1s
#> Processing 10/10 | 100% | ETA: 0s
#>
correct_improved <- sum(improved_results$category == trainset$category)
improved_accuracy <- correct_improved / total
cat("Baseline accuracy:", scales::percent(baseline_accuracy), "\n")
#> Baseline accuracy: 100%
cat("Improved accuracy:", scales::percent(improved_accuracy), "\n")
#> Improved accuracy: 100%
cat("Improvement:", scales::percent(improved_accuracy - baseline_accuracy), "\n")
#> Improvement: 0%Step 6: Automatic Demo Selection with LabeledFewShot
Manually picking demos is tedious. The LabeledFewShot
teleprompter automatically selects good examples from your training
data:
# Create a teleprompter that selects 3 examples
teleprompter <- LabeledFewShot(k = 3L)
# Compile the module with the teleprompter
compiled <- compile_module(
program = classifier,
teleprompter = teleprompter,
trainset = trainset
)The compile_module() function takes your module, a
teleprompter strategy, and training data. It returns an improved module
with automatically selected demonstrations.
Step 7: Check What Was Selected
See which examples the teleprompter chose:
# The compiled module now has demos
compiled$config$demos
#> NULLStep 8: Compare All Three Versions
Let’s see how each version performs:
# Run all three on the test set
results_baseline <- run_dataset(classifier, trainset, .llm = chat)
#> Processing 8/10 | 80% | ETA: 1s
#> Processing 10/10 | 100% | ETA: 0s
#>
results_manual <- run_dataset(classifier_with_demos, trainset, .llm = chat)
#> Processing 7/10 | 70% | ETA: 1s
#> Processing 10/10 | 100% | ETA: 0s
#>
results_compiled <- run_dataset(compiled, trainset, .llm = chat)
#> Processing 4/10 | 40% | ETA: 2s
#> Processing 10/10 | 100% | ETA: 0s
#>
# Calculate accuracies
acc_baseline <- mean(results_baseline$category == trainset$category)
acc_manual <- mean(results_manual$category == trainset$category)
acc_compiled <- mean(results_compiled$category == trainset$category)
comparison <- tibble(
Version = c("Baseline (no demos)", "Manual demos (3)", "LabeledFewShot (3)"),
Accuracy = scales::percent(c(acc_baseline, acc_manual, acc_compiled))
)
comparison
#> # A tibble: 3 × 2
#> Version Accuracy
#> <chr> <chr>
#> 1 Baseline (no demos) 100%
#> 2 Manual demos (3) 100%
#> 3 LabeledFewShot (3) 100%Step 9: Using Built-in Metrics
dsprrr provides metrics for common evaluation tasks. Use
metric_exact_match() for classification:
metric <- metric_exact_match(field = "category")
# Evaluate the compiled module
eval_result <- evaluate(compiled, trainset, metric = metric, .llm = chat)
#> Processing 3/10 | 30% | ETA: 3s
#> Processing 9/10 | 90% | ETA: 0s
#> Processing 10/10 | 100% | ETA: 0s
#>
eval_result
#> $mean_score
#> [1] 1
#>
#> $scores
#> [1] 1 1 1 1 1 1 1 1 1 1
#>
#> $predictions
#> $predictions[[1]]
#> $predictions[[1]]$category
#> [1] "billing"
#>
#>
#> $predictions[[2]]
#> $predictions[[2]]$category
#> [1] "billing"
#>
#>
#> $predictions[[3]]
#> $predictions[[3]]$category
#> [1] "technical"
#>
#>
#> $predictions[[4]]
#> $predictions[[4]]$category
#> [1] "technical"
#>
#>
#> $predictions[[5]]
#> $predictions[[5]]$category
#> [1] "shipping"
#>
#>
#> $predictions[[6]]
#> $predictions[[6]]$category
#> [1] "shipping"
#>
#>
#> $predictions[[7]]
#> $predictions[[7]]$category
#> [1] "shipping"
#>
#>
#> $predictions[[8]]
#> $predictions[[8]]$category
#> [1] "billing"
#>
#>
#> $predictions[[9]]
#> $predictions[[9]]$category
#> [1] "general"
#>
#>
#> $predictions[[10]]
#> $predictions[[10]]$category
#> [1] "general"
#>
#>
#>
#> $n_evaluated
#> [1] 10
#>
#> $n_errors
#> [1] 0
#>
#> $errors
#> character(0)
#>
#> $metadata
#> $metadata[[1]]
#> $metadata[[1]]$latency_ms
#> [1] 411.8221
#>
#> $metadata[[1]]$prompt_length
#> [1] 272
#>
#> $metadata[[1]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: I was charged twice for the same item"
#>
#> $metadata[[1]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[1]]$timestamp
#> [1] "2026-01-09 16:16:05 UTC"
#>
#> $metadata[[1]]$batch_index
#> [1] 1
#>
#>
#> $metadata[[2]]
#> $metadata[[2]]$latency_ms
#> [1] 419.3778
#>
#> $metadata[[2]]$prompt_length
#> [1] 269
#>
#> $metadata[[2]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: How do I update my payment method?"
#>
#> $metadata[[2]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[2]]$timestamp
#> [1] "2026-01-09 16:16:05 UTC"
#>
#> $metadata[[2]]$batch_index
#> [1] 2
#>
#>
#> $metadata[[3]]
#> $metadata[[3]]$latency_ms
#> [1] 413.7697
#>
#> $metadata[[3]]$prompt_length
#> [1] 269
#>
#> $metadata[[3]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: The website won't load on my phone"
#>
#> $metadata[[3]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[3]]$timestamp
#> [1] "2026-01-09 16:16:06 UTC"
#>
#> $metadata[[3]]$batch_index
#> [1] 3
#>
#>
#> $metadata[[4]]
#> $metadata[[4]]$latency_ms
#> [1] 416.1301
#>
#> $metadata[[4]]$prompt_length
#> [1] 272
#>
#> $metadata[[4]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: My password reset email never arrived"
#>
#> $metadata[[4]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[4]]$timestamp
#> [1] "2026-01-09 16:16:06 UTC"
#>
#> $metadata[[4]]$batch_index
#> [1] 4
#>
#>
#> $metadata[[5]]
#> $metadata[[5]]$latency_ms
#> [1] 440.4459
#>
#> $metadata[[5]]$prompt_length
#> [1] 259
#>
#> $metadata[[5]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: When will my order ship?"
#>
#> $metadata[[5]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[5]]$timestamp
#> [1] "2026-01-09 16:16:06 UTC"
#>
#> $metadata[[5]]$batch_index
#> [1] 5
#>
#>
#> $metadata[[6]]
#> $metadata[[6]]$latency_ms
#> [1] 441.6645
#>
#> $metadata[[6]]$prompt_length
#> [1] 268
#>
#> $metadata[[6]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: Can I change my delivery address?"
#>
#> $metadata[[6]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[6]]$timestamp
#> [1] "2026-01-09 16:16:07 UTC"
#>
#> $metadata[[6]]$batch_index
#> [1] 6
#>
#>
#> $metadata[[7]]
#> $metadata[[7]]$latency_ms
#> [1] 448.1862
#>
#> $metadata[[7]]$prompt_length
#> [1] 262
#>
#> $metadata[[7]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: The product arrived damaged"
#>
#> $metadata[[7]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[7]]$timestamp
#> [1] "2026-01-09 16:16:07 UTC"
#>
#> $metadata[[7]]$batch_index
#> [1] 7
#>
#>
#> $metadata[[8]]
#> $metadata[[8]]$latency_ms
#> [1] 454.9153
#>
#> $metadata[[8]]$prompt_length
#> [1] 270
#>
#> $metadata[[8]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: I need a refund for my subscription"
#>
#> $metadata[[8]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[8]]$timestamp
#> [1] "2026-01-09 16:16:08 UTC"
#>
#> $metadata[[8]]$batch_index
#> [1] 8
#>
#>
#> $metadata[[9]]
#> $metadata[[9]]$latency_ms
#> [1] 456.0275
#>
#> $metadata[[9]]$prompt_length
#> [1] 269
#>
#> $metadata[[9]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: How do I contact customer service?"
#>
#> $metadata[[9]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[9]]$timestamp
#> [1] "2026-01-09 16:16:08 UTC"
#>
#> $metadata[[9]]$batch_index
#> [1] 9
#>
#>
#> $metadata[[10]]
#> $metadata[[10]]$latency_ms
#> [1] 601.1727
#>
#> $metadata[[10]]$prompt_length
#> [1] 264
#>
#> $metadata[[10]]$prompt
#> [1] "Example 1:\nticket: The website won't load on my phone\nOutput: technical\n\nExample 2:\nticket: What are your business hours?\nOutput: general\n\nExample 3:\nticket: How do I update my payment method?\nOutput: billing\n\n\n# Input: ticket\nticket: What are your business hours?"
#>
#> $metadata[[10]]$instructions
#> [1] "Classify the customer support ticket."
#>
#> $metadata[[10]]$timestamp
#> [1] "2026-01-09 16:16:09 UTC"
#>
#> $metadata[[10]]$batch_index
#> [1] 10
#>
#>
#>
#> $data
#> # A tibble: 10 × 5
#> ticket category result .metadata .chat
#> <chr> <chr> <list> <list> <list>
#> 1 I was charged twice for the same i… billing <named list> <named list> <Chat>
#> 2 How do I update my payment method? billing <named list> <named list> <Chat>
#> 3 The website won't load on my phone technic… <named list> <named list> <Chat>
#> 4 My password reset email never arri… technic… <named list> <named list> <Chat>
#> 5 When will my order ship? shipping <named list> <named list> <Chat>
#> 6 Can I change my delivery address? shipping <named list> <named list> <Chat>
#> 7 The product arrived damaged shipping <named list> <named list> <Chat>
#> 8 I need a refund for my subscription billing <named list> <named list> <Chat>
#> 9 How do I contact customer service? general <named list> <named list> <Chat>
#> 10 What are your business hours? general <named list> <named list> <Chat>
#>
#> attr(,"class")
#> [1] "dsprrr_evaluation"The evaluate() function returns detailed results
including: - mean_score: Overall accuracy -
scores: Per-example scores - predictions: What
the model predicted
Step 10: Varying the Number of Examples
More examples aren’t always better. Let’s compare:
results <- list()
for (k in c(1L, 2L, 3L, 4L, 5L)) {
tp <- LabeledFewShot(k = k)
compiled_k <- compile_module(classifier, tp, trainset)
eval_k <- evaluate(compiled_k, trainset, metric = metric, .llm = chat)
results[[as.character(k)]] <- tibble(
k = k,
accuracy = eval_k$mean_score
)
}
#> Processing 5/10 | 50% | ETA: 2s
#> Processing 10/10 | 100% | ETA: 0s
#>
#> Processing 3/10 | 30% | ETA: 3s
#> Processing 7/10 | 70% | ETA: 1s
#> Processing 10/10 | 100% | ETA: 0s
#>
#> Processing 3/10 | 30% | ETA: 4s
#> Processing 8/10 | 80% | ETA: 1s
#> Processing 10/10 | 100% | ETA: 0s
#>
#> Processing 3/10 | 30% | ETA: 4s
#> Processing 9/10 | 90% | ETA: 1s
#> Processing 10/10 | 100% | ETA: 0s
#>
#> Processing 4/10 | 40% | ETA: 4s
#> Processing 8/10 | 80% | ETA: 1s
#> Processing 10/10 | 100% | ETA: 0s
#>
do.call(rbind, results)
#> # A tibble: 5 × 2
#> k accuracy
#> * <int> <dbl>
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 1
#> 5 5 1You’ll often find that 2-4 examples works best. Too many can confuse the LLM or exceed context limits.
What You Learned
In this tutorial, you:
- Created labeled training data with
dsp_trainset() - Measured baseline accuracy before improving
- Added manual demonstrations with the
demosparameter - Used
LabeledFewShotfor automatic demo selection - Compiled modules with
compile_module() - Evaluated with
metric_exact_match() - Experimented with different numbers of examples
Why Demos Work
LLMs are excellent at pattern matching. When you show examples of correct behavior: - They understand the format you want - They learn edge cases specific to your domain - They pick up on subtle distinctions between categories
This is called “in-context learning”—the LLM learns from the examples in the prompt without any weight updates.
The Few-Shot Trade-off
| More Examples | Fewer Examples |
|---|---|
| Better pattern coverage | Lower token costs |
| More edge cases shown | Faster inference |
| Risk of context overflow | May miss edge cases |
Start with 2-3 examples, measure, and adjust.
Next Steps
You’ve improved your module with examples. But what about other parameters—temperature, instructions, templates? Continue to:
- Tutorial 5: Finding Best Configuration — Grid search over parameters
- Quick Reference — All metrics and teleprompters
- How Optimization Works — The theory behind teleprompters