The DSPy Philosophy: Programs, Not Prompts

This article explains the philosophy behind dsprrr and why treating prompts as programs leads to better LLM applications.

The Prompt Engineering Problem

Traditional LLM development looks like this:

Write a prompt
Test it manually
Tweak the wording
Test again
Repeat until it “seems to work”

This approach has fundamental problems:

Fragility: A prompt that works perfectly with GPT-4 may fail with Claude. A prompt optimized for one model version breaks when the provider updates their model. Your carefully crafted prompt is tied to a specific moment in time.

Subjectivity: “Seems to work” isn’t a metric. Without systematic evaluation, you’re cherry-picking examples that confirm your prompt works while ignoring failures. Confirmation bias runs rampant.

Maintenance burden: As requirements change, you accumulate prompt variants. Which version is production? Which was tested? Your prompt library becomes technical debt.

No composability: Prompts are strings. You can’t easily combine two prompts, pass one to another, or build pipelines. Every integration is ad-hoc string manipulation.

The Insight: Prompts Are Programs

DSPy (Declarative Self-improving Language Programs), developed at Stanford NLP, introduced a paradigm shift: treat prompts as programs that can be optimized.

Instead of: - Writing prompt strings - Manually testing and tweaking

You: - Declare input/output specifications - Let the framework generate prompts - Optimize automatically with your data

This is the same shift that happened in machine learning: from hand-tuned features to learned representations.

Three Pillars of DSPy Thinking

Pillar 1: Signatures Define Contracts

A signature declares what you want, not how to ask for it:

# This is a contract: given a question, return an answer
signature("question -> answer")

# This is also a contract: given text, return a sentiment classification
signature("text -> sentiment: enum('positive', 'negative', 'neutral')")

The signature is an interface. It says nothing about: - What prompt template to use - How to format the examples - What instructions to include

Those details are implementation—handled by the framework, not you.

Why this matters: When you separate interface from implementation, you can change the implementation without changing the interface. Optimize the prompt? The signature stays the same. Switch models? The signature stays the same. Add few-shot examples? The signature stays the same.

Pillar 2: Modules Encapsulate Behavior

A module wraps a signature with executable logic:

mod <- module(signature("text -> sentiment"), type = "predict")

Modules are:

Stateful: They remember their configuration, demos, and traces
Composable: You can chain modules together
Optimizable: Their parameters can be tuned automatically

Think of modules like functions in programming. A function has: - A signature (input types → output types) - Implementation (the code) - State (closure variables)

A dsprrr module has: - A signature (inputs → outputs) - Implementation (LLM calls with a prompt template) - State (configuration, demos, traces)

Why this matters: Modules are the unit of reuse. You build a module once, optimize it, and deploy it. The optimization results travel with the module.

Pillar 3: Optimizers Find What Works

Here’s the radical part: instead of hand-tuning prompts, you define a metric and let the optimizer search:

mod$optimize_grid(
  devset = training_data,
  metric = metric_exact_match(),
  parameters = list(temperature = c(0, 0.3, 0.7))
)

The optimizer: - Tries each configuration - Measures performance against your metric - Keeps the best one

This works because:

LLMs are sensitive to prompt wording (small changes → big effects)
The search space is navigable (parameters are mostly continuous)
Evaluation is cheap compared to training

Why this matters: Human intuition about prompts is unreliable. Systematic search finds configurations humans wouldn’t try. And unlike manual tuning, optimization is reproducible.

Mental Model: LLM Applications as Statistical Programs

Here’s a useful way to think about it:

Traditional ML: Data → Model → Predictions - The model has learned parameters - Training optimizes those parameters - Inference uses the trained parameters

LLM Applications: Prompt → LLM → Response - The prompt configures behavior - The prompt is the “learned” artifact - Inference uses the optimized prompt

In both cases, you have: - A specification (model architecture / signature) - Learned parameters (weights / prompt configuration) - An optimization process (gradient descent / grid search + teleprompters)

dsprrr makes this analogy explicit. A “compiled” module is like a trained model—it has optimized parameters (prompt configuration, demos) found through optimization.

Why This Matters for R Users

R users already think statistically. You’re comfortable with: - Defining models declaratively (y ~ x1 + x2) - Fitting models to data (lm(), glm()) - Evaluating with held-out data - Comparing models with metrics

dsprrr brings this workflow to LLMs: - Define the task declaratively (signatures) - Fit to data (teleprompters) - Evaluate with held-out data (evaluate()) - Compare with metrics

The tidyverse integration makes this natural:

# This feels like R
trainset |>
  run_dataset(classifier) |>
  mutate(correct = sentiment == expected_sentiment) |>
  summarize(accuracy = mean(correct))

The dsprrr Adaptation

dsprrr brings DSPy’s ideas to R with some adaptations:

S7 for Signatures: Immutable, validated type objects. Once you define a signature, it doesn’t change. This enables optimization—you can swap prompts while keeping the interface stable.

R6 for Modules: Stateful execution context. Modules accumulate traces, store optimized configurations, and provide an object-oriented API ($predict(), $optimize_grid()).

ellmer Integration: dsprrr builds on ellmer’s chat infrastructure. You get the same chat_openai(), chat_anthropic() objects, with structured outputs handled via ellmer’s type system.

tidyverse Conventions: Tibbles everywhere. Pipe-friendly APIs. run_dataset() returns a tibble. Traces are tibbles. Everything composes with dplyr.

When Traditional Prompting Still Works

Not everything needs dsprrr. Use plain prompts when:

The task is simple and stable: If you need to summarize text and the requirements won’t change, a direct ellmer call is fine.

You’re exploring: In early development, before you have training data or clear requirements, manual prompting helps you understand the problem.

The prompt is truly one-off: A single query to answer a single question doesn’t benefit from the optimization infrastructure.

You don’t have labeled data: Optimization needs a metric, which needs expected outputs. Without labels, you can’t optimize.

The Paradigm Shift

The move from prompt engineering to prompt programming mirrors other shifts in software development:

From	To
Assembly	High-level languages
Manual memory management	Garbage collection
Hand-tuned SQL	Query optimizers
Feature engineering	Deep learning
Prompt engineering	Prompt programming

Each shift: - Raises the abstraction level - Lets humans focus on what, not how - Enables optimization beyond human intuition - Improves reproducibility

dsprrr is this shift for LLM applications in R.

Practical Implications

If you accept this philosophy, your workflow changes:

Before (prompt engineering):

Write prompt
Try examples manually
Tweak wording
Deploy when it “feels right”
Hope it keeps working

After (prompt programming):

Define signature
Create labeled dataset
Define metric
Optimize
Evaluate on held-out data
Deploy with confidence
Monitor with traces

The second workflow is: - Reproducible (anyone can re-run the optimization) - Measurable (you have metrics, not vibes) - Maintainable (change the data, re-optimize) - Debuggable (traces show what happened)