
How Prompt Optimization Works
Source:vignettes/concepts-optimization-theory.Rmd
concepts-optimization-theory.RmdThis article explains how dsprrr optimizes prompts automatically. Understanding the theory helps you choose the right optimizer and configure it effectively.
The Optimization Problem
When you build an LLM application, you’re implicitly defining an optimization problem:
Given: - A task specification (signature) - Training examples with expected outputs - An evaluation metric
Find: - The prompt configuration that maximizes metric performance
Traditional prompt engineering solves this by hand: you write prompts, test them, tweak them. dsprrr automates this search.
What Gets Optimized
LLM performance depends on several factors that can be tuned:
1. Instructions
The system prompt or task description:
# Original
sig <- signature(
"text -> sentiment",
instructions = "Classify the sentiment."
)
# Variant 1
sig_detailed <- signature(
"text -> sentiment",
instructions = "Analyze the emotional tone. Consider word choice, context, and implicit meanings."
)
# Variant 2
sig_concise <- signature(
"text -> sentiment",
instructions = "positive, negative, or neutral?"
)Different instructions can dramatically affect performance. The right wording depends on the model, the task, and the data distribution.
2. Demonstrations (Few-Shot Examples)
Examples shown to the model before the actual input:
# No demos
mod$demos <- list()
# With demos
mod$demos <- list(
list(inputs = list(text = "I love this!"), output = "positive"),
list(inputs = list(text = "This is terrible."), output = "negative"),
list(inputs = list(text = "It's okay."), output = "neutral")
)More demos aren’t always better. The right demos matter more than the number of demos. Demos that are: - Too similar may not generalize - Too different may confuse the model - Poorly formatted may teach bad patterns
3. Temperature and Other Parameters
Model generation parameters:
mod$config$temperature <- 0 # Deterministic
mod$config$temperature <- 0.7 # Some randomness
mod$config$temperature <- 1.0 # More creativeLower temperature is typically better for classification tasks. Higher temperature helps with generation tasks where diversity matters.
The Search Space
The combination of all tunable parameters forms a search space. For a simple classifier:
| Parameter | Possible Values | Count |
|---|---|---|
| Instructions | 5 variants | 5 |
| Temperature | 0, 0.3, 0.7 | 3 |
| Demo count | 0, 2, 4, 8 | 4 |
| Demo selection | Random, diverse, similar | 3 |
Total configurations: 5 × 3 × 4 × 3 = 180 combinations
Evaluating each on a 100-example dataset means 18,000 LLM calls. This is why smart search strategies matter.
Teleprompters: The Optimization Strategies
dsprrr calls its optimizers teleprompters (from DSPy). Each teleprompter implements a different search strategy.
LabeledFewShot: The Simplest Approach
Just add k examples from your training set as demonstrations:
tp <- LabeledFewShot(k = 4L)
compiled <- compile(tp, mod, trainset)How it works:
- Sample k examples from training data
- Format as demonstrations
- Add to module
When to use: - Quick baseline - When you have high-quality labeled data - When the task is straightforward
Limitations: - Doesn’t optimize demo selection - Doesn’t tune other parameters - May pick suboptimal examples
BootstrapFewShot: Self-Improving Demonstrations
Uses the model to generate demonstrations, keeping only successful ones:
tp <- BootstrapFewShot(
metric = metric_exact_match(),
max_bootstrapped_demos = 4L,
max_labeled_demos = 8L
)
compiled <- compile(tp, mod, trainset)How it works:
- Start with labeled examples as initial demos
- Run teacher model on remaining training examples
- Score each prediction with your metric
- Keep predictions that score above threshold as new demos
- Optionally repeat for multiple rounds
Why this works: The model generates outputs in its own “voice”. Demonstrations that match the model’s natural output style are more effective than human-written examples.
When to use: - When you have a reliable metric - When labeled examples don’t transfer well - When you want the model to learn its own format
GridSearchTeleprompter: Systematic Exploration
Tests every combination of instruction and template variants:
variants <- tibble::tibble(
id = c("concise", "detailed", "structured"),
instructions = c(
"Classify sentiment briefly.",
"Analyze the emotional tone considering context and nuance.",
"Task: Sentiment classification\nOutput: positive, negative, or neutral"
)
)
tp <- GridSearchTeleprompter(
metric = metric_exact_match(),
variants = variants
)
compiled <- compile(tp, mod, trainset)How it works:
- Define a grid of parameter configurations
- Evaluate each on a validation set
- Select the best-performing configuration
When to use: - When you have specific hypotheses to test - When the search space is small - When you need interpretable results
Limitations: - Scales poorly with parameters - May miss optimal combinations not in grid
BootstrapFewShotWithRandomSearch: Combined Strategy
Combines demo bootstrapping with parameter search:
tp <- BootstrapFewShotWithRandomSearch(
metric = metric_exact_match(),
num_candidate_programs = 8L,
max_bootstrapped_demos = 4L
)
compiled <- compile(tp, mod, trainset)How it works:
- Generate multiple demo configurations via bootstrapping
- Create candidate programs with different demo subsets
- Evaluate all candidates
- Return the best performer
When to use: - When BootstrapFewShot alone isn’t enough - When you want to explore demo combinations - When you have compute budget for more trials
SIMBA: Embedding-Based Selection
Uses semantic similarity to select diverse, representative demos:
tp <- SIMBA(
metric = metric_exact_match(),
max_demos = 4L,
embed_fn = embed_openai()
)
compiled <- compile(tp, mod, trainset)How it works:
- Embed all training examples
- Cluster by semantic similarity
- Select diverse representatives from clusters
- Use as demonstrations
Why this works: Coverage matters more than quantity. A diverse set of demos shows the model the range of possible inputs and appropriate outputs.
When to use: - When you have many training examples - When examples vary significantly - When you want coverage over the input space
KNNFewShot: Dynamic Demo Selection
Selects demos based on similarity to each test input:
tp <- KNNFewShot(
metric = metric_exact_match(),
k = 3L,
embed_fn = embed_openai()
)
compiled <- compile(tp, mod, trainset)How it works:
- At compile time: embed all training examples
- At runtime: find k nearest neighbors to the input
- Use those neighbors as demonstrations
Why this works: Relevant examples are more helpful than random ones. If the input is about sports, sports-related demos are more useful than cooking demos.
When to use: - When inputs span diverse domains - When context-specific demos help - When you can afford embedding costs at inference
COPRO: Instruction Optimization
Generates and refines instructions using the LLM itself:
tp <- COPRO(
metric = metric_exact_match(),
breadth = 5L,
depth = 3L
)
compiled <- compile(tp, mod, trainset)How it works:
- Generate breadth instruction candidates
- Evaluate each on training data
- Take top performers, ask LLM to improve them
- Repeat for depth iterations
Why this works: LLMs are good at understanding what makes instructions effective. They can propose refinements humans wouldn’t think of.
When to use: - When instructions matter more than demos - When you have compute budget for exploration - When you want automatically-generated instructions
MIPROv2: Multi-Parameter Optimization
Jointly optimizes instructions and demonstrations:
tp <- MIPROv2(
metric = metric_exact_match(),
num_candidates = 10L,
init_temperature = 1.4
)
compiled <- compile(tp, mod, trainset)Uses Bayesian optimization to jointly search instruction and demo combinations. The most general-purpose optimizer when you have compute budget for state-of-the-art results.
GEPA: Genetic Programming
Evolves prompts through mutation and crossover:
tp <- GEPA(
metric = metric_exact_match(),
population_size = 20L,
generations = 10L
)
compiled <- compile(tp, mod, trainset)Uses evolutionary algorithms (selection, mutation, crossover) to explore the prompt space. Good when you have compute budget and want creative exploration.
Choosing an Optimizer
Use this decision tree:
Start
│
├─ "I just want a quick baseline"
│ └─> LabeledFewShot
│
├─ "I have a reliable metric and want better demos"
│ └─> BootstrapFewShot
│
├─ "I have specific instruction variants to test"
│ └─> GridSearchTeleprompter
│
├─ "I want best results, have compute budget"
│ └─> MIPROv2 or BootstrapFewShotWithRandomSearch
│
├─ "My inputs are diverse, need adaptive demos"
│ └─> KNNFewShot
│
└─ "I want to explore creatively"
└─> GEPA
The Evaluation Loop
All optimizers share a common evaluation pattern:
# Pseudocode for optimizer loop
for (candidate in candidates) {
# 1. Create candidate module
mod_candidate <- copy_module(base_module)
mod_candidate <- apply_config(mod_candidate, candidate)
# 2. Evaluate on validation set
result <- evaluate(mod_candidate, valset, metric)
# 3. Track score
scores[[candidate$id]] <- result$mean_score
# 4. Update best if improved
if (result$mean_score > best_score) {
best_score <- result$mean_score
best_candidate <- candidate
}
}
# 5. Apply best configuration
final_module <- apply_config(base_module, best_candidate)This loop is the core of all optimization. Different teleprompters differ in how they generate candidates.
Practical Considerations
Data Requirements
| Optimizer | Min Training Examples | Recommended |
|---|---|---|
| LabeledFewShot | k (usually 4) | 20+ |
| BootstrapFewShot | 10 | 50+ |
| GridSearch | 20 | 50+ |
| MIPROv2 | 30 | 100+ |
More data generally helps, but with diminishing returns after ~100 examples.
Compute Costs
Optimization requires many LLM calls. Rough estimates per optimizer:
| Optimizer | LLM Calls (100 examples) |
|---|---|
| LabeledFewShot | 0 (no evaluation) |
| BootstrapFewShot | 100-200 |
| GridSearch | variants × examples |
| MIPROv2 | 500-2000 |
| GEPA | population × generations × examples |
Use cheaper models during optimization, then evaluate final config on your target model.
Validation Strategy
Split your data:
- Training set: Used to generate demos
- Validation set: Used to evaluate candidates during optimization
- Test set: Used to evaluate final model (never touched during optimization)
# Good practice
set.seed(42)
n <- nrow(data)
train_idx <- sample(n, 0.6 * n)
val_idx <- sample(setdiff(1:n, train_idx), 0.2 * n)
test_idx <- setdiff(1:n, c(train_idx, val_idx))
trainset <- data[train_idx, ]
valset <- data[val_idx, ]
testset <- data[test_idx, ]
# Optimize
compiled <- compile(tp, mod, trainset, valset = valset)
# Final evaluation
evaluate(compiled, testset, metric)Reproducibility
Use seeds for reproducible optimization:
tp <- BootstrapFewShot(
metric = metric_exact_match(),
seed = 42L
)Why Optimization Works
Prompt optimization works because:
LLMs are sensitive: Small prompt changes cause big output changes. This creates a navigable landscape where search can find improvements.
Metrics provide signal: Evaluation scores guide the search toward better configurations. Without metrics, you’re optimizing blind.
The search space is structured: Good prompts share characteristics. Optimization exploits this structure.
Transfer happens: Prompts optimized on training data often generalize to test data. The improvements aren’t just memorization.
Connection to Machine Learning
Prompt optimization parallels traditional ML:
| ML Concept | Prompt Optimization |
|---|---|
| Model parameters | Prompt text, demos |
| Training data | Labeled examples |
| Loss function | Metric (negated) |
| Gradient descent | Teleprompter search |
| Hyperparameters | Temperature, demo count |
| Validation | Held-out evaluation |
The key difference: in ML, we update model weights. In prompt optimization, we update the text we send to the model. The model itself stays fixed.
Further Reading
- Tutorial: Optimize Your Module - Hands-on optimization
- Understanding Signatures & Modules - Core abstractions
- Why Metrics Matter - Choosing the right metric
- DSPy Paper - Academic foundations