Introduction
dsprrr provides a comprehensive suite of DSPy-inspired optimizers (teleprompters) for automatically improving your LLM programs. This guide covers the advanced optimizers beyond basic few-shot learning.
For basic optimization concepts, see
vignette("compilation-optimization").
Quick Reference: Choosing an Optimizer
| If you want… | Use this optimizer | Complexity |
|---|---|---|
| Add labeled examples as demos | LabeledFewShot |
Low |
| Bootstrap demos from LLM outputs | BootstrapFewShot |
Medium |
| Bootstrap + search multiple configs | BootstrapFewShotWithRandomSearch |
Medium |
| Dynamic per-query demo selection | KNNFewShot |
Medium |
| Optimize instructions (not demos) | COPRO |
Medium |
| Joint instruction + demo optimization | MIPROv2 |
High |
| Focus on hard examples | SIMBA |
Medium |
| Multi-objective optimization | GEPA |
High |
| Combine multiple strategies | Ensemble |
Low |
Decision Tree
Start here: Do you have labeled training data?
│
├─ NO → Use zero-shot or manually write demos
│
└─ YES → How much data?
│
├─ < 30 examples → LabeledFewShot (simple few-shot)
│
└─ 30+ examples → What do you want to optimize?
│
├─ Just add demos → BootstrapFewShot
│
├─ Demos + search configs → BootstrapFewShotWithRandomSearch
│
├─ Different demos per query → KNNFewShot
│
├─ Improve instructions → COPRO
│
├─ Both instructions + demos → MIPROv2
│
├─ Handle hard cases better → SIMBA
│
├─ Balance quality vs cost → GEPA
│
└─ Combine multiple optimized modules → Ensemble
Dataset Sizing Guidance
| Dataset Size | Recommended Optimizers | Notes |
|---|---|---|
| 10-30 examples | LabeledFewShot |
Minimal optimization |
| 30-100 examples |
BootstrapFewShot, COPRO
|
Good starting point |
| 100-300 examples |
BootstrapFewShotWithRandomSearch,
MIPROv2
|
Meaningful search |
| 300+ examples | All optimizers | Full optimization potential |
Key principles:
- Train/validation split: Use 70-80% for training, 20-30% for validation
- Diverse examples: Ensure coverage of edge cases and all output categories
- Quality over quantity: 50 high-quality examples beat 500 noisy ones
library(dsprrr)
# Split your dataset properly
full_data <- tibble::tibble(
question = c(...), # Your examples
answer = c(...)
)
set.seed(42)
n <- nrow(full_data)
train_idx <- sample(n, size = floor(0.7 * n))
trainset <- full_data[train_idx, ]
valset <- full_data[-train_idx, ]
# Or use the built-in helper
splits <- split_dataset(full_data, prop = 0.7, seed = 42)
trainset <- splits$train
valset <- splits$valSetup
library(dsprrr)
library(ellmer)
# Configure your LLM
llm <- chat_openai(model = "gpt-4o-mini")
# Example training data for demonstrations
trainset <- dsp_trainset(
question = c(
"What is the capital of France?",
"Who wrote Romeo and Juliet?",
"What is the chemical symbol for gold?",
"When did World War II end?",
"What is the largest planet in our solar system?"
),
answer = c(
"Paris",
"William Shakespeare",
"Au",
"1945",
"Jupiter"
)
)
# Base module to optimize
qa_module <- module(
signature("question -> answer"),
type = "predict"
)BootstrapFewShot
Bootstraps demonstrations by running the module on training examples and keeping successful outputs as demos.
Best for: When you want the LLM to generate its own demonstration format.
tp <- BootstrapFewShot(
metric = metric_exact_match(field = "answer"),
max_bootstrapped_demos = 4L,
max_labeled_demos = 2L,
max_rounds = 3L,
max_errors = 5L,
seed = 42L
)
compiled <- compile(tp, qa_module, trainset, .llm = llm)
# Check what demos were bootstrapped
print(compiled$demos)
# Run the optimized module
result <- run(compiled, question = "What is the speed of light?", .llm = llm)Parameters:
-
max_bootstrapped_demos: Maximum LLM-generated demos to include -
max_labeled_demos: Maximum labeled examples from trainset -
max_rounds: Bootstrapping iterations -
metric: Evaluation metric (defaults to exact match)
BootstrapFewShotWithRandomSearch
Combines bootstrapping with random search over configurations. Produces multiple candidate programs and selects the best.
Best for: When you want to explore different demo combinations and find the optimal configuration.
tp <- BootstrapFewShotWithRandomSearch(
metric = metric_exact_match(field = "answer"),
max_bootstrapped_demos = 4L,
max_labeled_demos = 2L,
num_candidate_programs = 8L,
num_threads = 4L,
seed = 42L
)
compiled <- compile(tp, qa_module, trainset, valset = valset, .llm = llm)
# Access the best score
print(compiled$config$optimizer$best_score)
# Access candidate programs (list of program metadata)
candidates <- compiled$config$optimizer$candidate_programsParameters:
-
num_candidate_programs: Number of configurations to try -
num_threads: Parallel evaluation threads - All
BootstrapFewShotparameters are inherited
Tip: Use Ensemble to combine the best
optimized module with other strategies:
# Compile with different strategies and ensemble them
mod1 <- compile(BootstrapFewShotWithRandomSearch(), qa_module, trainset, valset = valset, .llm = llm)
mod2 <- compile(COPRO(), qa_module, trainset, valset = valset, .llm = llm)
mod3 <- compile(LabeledFewShot(k = 3L), qa_module, trainset, .llm = llm)
# Ensemble with weights from validation scores
ens <- ensemble(
list(mod1, mod2, mod3),
reduce_fn = reduce_weighted_vote(),
weights = c(0.90, 0.85, 0.80) # Validation scores
)KNNFewShot
Selects demonstrations dynamically based on similarity to the input query. Uses embeddings to find the most relevant examples.
Best for: Tasks where example relevance varies significantly by query.
tp <- KNNFewShot(
k = 3L,
vectorizer = function(texts) {
# Use any embedding function
ragnar::embed_openai(texts)
},
cache_embeddings = TRUE # Cache embeddings for efficiency
)
compiled <- compile(tp, qa_module, trainset, .llm = llm)
# Each query now gets personalized demos based on similarity
result <- run(compiled, question = "What is DNA made of?", .llm = llm)Parameters:
-
k: Number of nearest neighbors to use as demos -
vectorizer: Function that converts text to embeddings -
cache_embeddings: Boolean to enable embedding caching -
input_text: Which input field to use for similarity (default: first input)
COPRO (Coordinate Prompt Optimization)
Optimizes instructions through coordinate ascent. Generates and tests instruction variants to find the best wording.
Best for: When your task benefits from better instructions rather than more demos.
# Optionally use a different model for instruction generation
prompt_llm <- chat_openai(model = "gpt-4o")
tp <- COPRO(
metric = metric_exact_match(field = "answer"),
prompt_model = prompt_llm, # Model to generate instruction candidates
breadth = 5L, # Candidates per iteration
depth = 3L, # Number of iterations
init_temperature = 1.4,
seed = 42L
)
compiled <- compile(tp, qa_module, trainset, valset = valset, .llm = llm)
# Check the optimized instructions
print(compiled$signature@instructions)
# View optimization history
history <- compiled$config$optimizer$history
print(history)Parameters:
-
breadth: Number of instruction candidates per iteration -
depth: Number of coordinate ascent iterations -
prompt_model: LLM for generating instructions (can differ from task LLM) -
init_temperature: Temperature for instruction generation
How it works:
- Starts with current instructions as baseline
- Generates
breadthinstruction variants - Evaluates each on validation set
- Keeps the best, uses failed examples to improve
- Repeats for
depthiterations
MIPROv2
Multi-prompt Instruction Proposal Optimizer. Jointly optimizes instructions and demonstrations using Bayesian optimization.
Best for: Maximum optimization when you have sufficient data and compute budget.
tp <- MIPROv2(
metric = metric_exact_match(field = "answer"),
auto = "medium", # Preset: "light", "medium", or "heavy"
num_candidates = 10L, # Optional: override instruction candidates
init_temperature = 1.0,
prompt_model = chat_openai(model = "gpt-4o"),
seed = 42L
)
compiled <- compile(tp, qa_module, trainset, valset = valset, .llm = llm)
# MIPROv2 optimizes both instructions and demos
print(compiled$signature@instructions)
print(compiled$demos)Parameters:
-
auto: Preset level -"light","medium", or"heavy" -
num_candidates: Override instruction candidates to generate -
prompt_model: LLM for generating instruction proposals -
max_bootstrapped_demos,max_labeled_demos: Demo limits - Supports
log_dirfor detailed trial logging
Presets:
# Light preset (faster, less thorough)
tp_light <- MIPROv2(
metric = metric_exact_match(),
auto = "light"
)
# Heavy preset (slower, more thorough)
tp_heavy <- MIPROv2(
metric = metric_exact_match(),
auto = "heavy"
)SIMBA (Self-Improving Model-Based Augmentation)
Focuses optimization on hard examples that the model struggles with.
Best for: When your model performs well on average but fails on edge cases.
tp <- SIMBA(
metric = metric_exact_match(field = "answer"),
bsize = 32L, # Mini-batch size for evaluation
num_candidates = 6L, # Candidate demos per step
max_steps = 8L, # Optimization iterations
max_demos = 4L, # Maximum demos to include
seed = 42L
)
compiled <- compile(tp, qa_module, trainset, valset = valset, .llm = llm)
# SIMBA iteratively improves on hard examples
print(compiled$demos)Parameters:
-
bsize: Mini-batch size for evaluation -
num_candidates: Number of demo candidates per step -
max_steps: Number of optimization iterations -
max_demos: Maximum demonstrations to include
How it works:
- Identifies examples where the model fails
- Generates targeted demos for those cases
- Re-evaluates and repeats
- Builds a demo set that covers edge cases
GEPA (Guided Evolutionary Prompt Algorithm)
Multi-objective optimization balancing quality vs. cost (or other objectives).
Best for: Production systems where you need to balance accuracy against token usage or latency.
# Single-objective GEPA
tp <- GEPA(
metric = metric_exact_match(field = "answer"),
population_size = 20L,
generations = 10L,
mutation_rate = 0.1,
crossover_rate = 0.7,
seed = 42L
)
# Multi-objective GEPA with named metrics
tp_multi <- GEPA(
metrics = list(
quality = metric_exact_match(field = "answer"),
brevity = function(pred, expected) 1 / (1 + nchar(as.character(pred$answer)))
),
population_size = 20L,
generations = 10L,
selection = "pareto", # Use Pareto selection for multi-objective
seed = 42L
)
compiled <- compile(tp_multi, qa_module, trainset, valset = valset, .llm = llm)
# GEPA returns Pareto-optimal solutions for multi-objective
pareto <- compiled$config$optimizer$pareto_frontier
print(pareto)Parameters:
-
metric: Single metric function (for single-objective) -
metrics: Named list of metric functions (for multi-objective) -
population_size: Number of candidates per generation -
generations: Evolution iterations -
selection: Selection strategy ("tournament"or"pareto") -
mutation_rate,crossover_rate: Genetic algorithm parameters
Ensemble
Combines multiple compiled modules using voting or aggregation strategies.
Best for: Maximum robustness by combining diverse optimization strategies.
# First, compile modules with different strategies
mod_bootstrap <- compile(BootstrapFewShot(), qa_module, trainset, .llm = llm)
mod_copro <- compile(COPRO(), qa_module, trainset, valset = valset, .llm = llm)
mod_knn <- compile(KNNFewShot(k = 3L), qa_module, trainset, .llm = llm)
# Combine with ensemble
ens <- ensemble(
list(mod_bootstrap, mod_copro, mod_knn),
reduce_fn = reduce_majority()
)
# Or use validation scores as weights
ens_weighted <- ensemble(
list(mod_bootstrap, mod_copro, mod_knn),
reduce_fn = reduce_weighted_vote(),
weights = c(0.85, 0.90, 0.82) # Validation scores
)
# Run the ensemble
result <- run(ens, question = "What is photosynthesis?", .llm = llm)Reduce Functions
# Majority voting (default)
reduce_majority()
# Weighted voting using module weights
reduce_weighted_vote()
# Just take the first successful output
reduce_first()
# Score outputs with a metric
reduce_best_by_metric(
metric = metric_f1(field = "answer")
)Ensemble via Teleprompter
# Ensemble teleprompter wraps existing compiled modules
tp <- Ensemble(
reduce_fn = reduce_weighted_vote(),
weights = c(0.9, 0.85, 0.8)
)
ens <- compile(tp, programs = list(mod1, mod2, mod3))Tracking and Logging
All optimizers support logging for debugging and reproducibility.
Trial Logging
# Enable logging to a directory
tp <- BootstrapFewShotWithRandomSearch(
metric = metric_exact_match(),
log_dir = "optimization_logs/experiment_001",
seed = 42L
)
compiled <- compile(tp, qa_module, trainset, .llm = llm)
# Log directory contains:
# - trials.jsonl: All trial results
# - best_program.rds: Serialized best module
# - config.json: Optimizer configurationAnalyzing Trials
# Read trial logs
trials <- read_trials_jsonl("optimization_logs/experiment_001/trials.jsonl")
# Examine trial results
print(trials)
# Plot optimization progress
library(ggplot2)
ggplot(trials, aes(x = trial_id, y = score)) +
geom_line() +
geom_point() +
labs(title = "Optimization Progress", x = "Trial", y = "Score")Accessing Optimizer State
# After compilation, access optimizer metadata
compiled$config$optimizer$name
compiled$config$optimizer$params
compiled$config$optimizer$best_score
compiled$config$optimizer$trials
# For BootstrapFewShotWithRandomSearch
compiled$config$optimizer$candidate_programs # List of candidate metadata
# For COPRO
compiled$config$optimizer$history
# For GEPA (multi-objective)
compiled$config$optimizer$pareto_frontierReproducibility
All optimizers support deterministic seeds:
# Set seed for reproducibility
tp <- BootstrapFewShot(
metric = metric_exact_match(),
seed = 42L
)
# Same seed = same results
compiled1 <- compile(tp, qa_module, trainset, .llm = llm)
compiled2 <- compile(tp, qa_module, trainset, .llm = llm)
# Demos will be identical
identical(compiled1$demos, compiled2$demos) # TRUEError Handling
Optimizers gracefully handle LLM errors:
tp <- BootstrapFewShot(
metric = metric_exact_match(),
max_errors = 10L # Allow up to 10 errors before failing
)
# Compilation continues despite some failed examples
compiled <- compile(tp, qa_module, trainset, .llm = llm)
# Check how many errors occurred
compiled$config$optimizer$n_errorsPerformance Tips
1. Start Simple, Then Advance
# Step 1: Try LabeledFewShot first
simple <- compile(LabeledFewShot(k = 3L), qa_module, trainset, .llm = llm)
simple_score <- evaluate(simple, valset, metric_exact_match(), .llm = llm)$mean_score
best_module <- simple
best_score <- simple_score
# Step 2: If not good enough, try bootstrapping
if (best_score < 0.8) {
bootstrap <- compile(
BootstrapFewShot(metric = metric_exact_match()),
qa_module, trainset, .llm = llm
)
bootstrap_score <- evaluate(bootstrap, valset, metric_exact_match(), .llm = llm)$mean_score
if (bootstrap_score > best_score) {
best_module <- bootstrap
best_score <- bootstrap_score
}
}
# Step 3: If still not good enough, try COPRO
if (best_score < 0.85) {
copro <- compile(
COPRO(metric = metric_exact_match()),
qa_module, trainset, valset = valset, .llm = llm
)
copro_score <- evaluate(copro, valset, metric_exact_match(), .llm = llm)$mean_score
if (copro_score > best_score) {
best_module <- copro
best_score <- copro_score
}
}
# Use the best performing module
print(paste("Best score:", best_score))2. Use Parallel Evaluation
# BootstrapFewShotWithRandomSearch supports parallel evaluation
tp <- BootstrapFewShotWithRandomSearch(
num_candidate_programs = 16L,
num_threads = 8L # Evaluate 8 candidates in parallel
)3. Cache Embeddings for KNNFewShot
# Enable embedding caching in KNNFewShot
tp <- KNNFewShot(
k = 3L,
vectorizer = ragnar::embed_openai,
cache_embeddings = TRUE # Cache computed embeddings
)
# Or create a caching vectorizer manually
cached_embed <- local({
cache <- new.env(parent = emptyenv())
function(texts) {
key <- digest::digest(texts)
if (!exists(key, envir = cache)) {
cache[[key]] <- ragnar::embed_openai(texts)
}
cache[[key]]
}
})
tp <- KNNFewShot(k = 3L, vectorizer = cached_embed)4. Monitor Costs
# Track costs during optimization
session_cost() # Total cost so far
# Costs are tracked in traces and trial logs
# Use session_cost() for aggregate viewSummary
| Optimizer | Optimizes | Data Needs | Compute Cost |
|---|---|---|---|
LabeledFewShot |
Demos | Low (10+) | Very Low |
BootstrapFewShot |
Demos | Medium (30+) | Low |
BootstrapFewShotWithRandomSearch |
Demos + Config | Medium (50+) | Medium |
KNNFewShot |
Dynamic Demos | Medium (50+) | Low |
COPRO |
Instructions | Medium (30+) | Medium |
MIPROv2 |
Instructions + Demos | High (100+) | High |
SIMBA |
Hard Example Demos | Medium (50+) | Medium |
GEPA |
Multi-objective | High (100+) | High |
Ensemble |
Combines modules | N/A | Varies |
Further Reading
Tutorials: - Finding Best Configuration — Hands-on grid search - Taking to Production — Deploy optimized modules
How-to Guides: - Compile & Optimize — Basic optimization concepts - Evaluate with Vitals — Integration with vitals package
Concepts: - How Optimization Works — Theory behind teleprompters - Why Metrics Matter — Choosing the right metric
Reference: - Quick Reference — Metrics and teleprompter syntax
