Tutorial 3: Extracting Structured Data
Source:vignettes/tutorial-structured-outputs.Rmd
tutorial-structured-outputs.RmdIn Tutorial 2, you built a classifier that returns a single value. But real extraction tasks need multiple fields: names and dates, sentiment and confidence, entities and relationships.
In this tutorial, you’ll extract complex, structured data from text. We’re switching the running example from sentiment labels to news articles and emails because extraction is where multi-field outputs shine—but the workflow you learned in Tutorial 2 (signature → module → run) stays exactly the same. Only the output type grows richer.
Time: 25-30 minutes
What You’ll Build
An entity extractor that pulls structured information from news articles and emails.
Prerequisites
- Completed Tutorial 2
-
OPENAI_API_KEYset in your environment
Step 1: Multiple Output Fields
So far you’ve seen single outputs like -> answer or
-> sentiment. Add more outputs with commas:
sig <- signature("text -> sentiment, confidence: number")
extractor <- module(sig, type = "predict")
run(extractor, text = "This product is absolutely fantastic!", .llm = chat)You get back both sentiment and confidence
in a structured result.
Try another:
run(extractor, text = "It was okay, nothing special.", .llm = chat)Notice the confidence is lower for ambiguous text.
Step 2: Typed Multiple Outputs
Add types to each output field:
sig <- signature(
"review -> sentiment: enum('positive', 'negative', 'neutral'), stars: int, summary: string"
)
analyzer <- module(sig, type = "predict")
result <- run(
analyzer,
review = "I've been using this blender for 6 months now. It's incredibly powerful and easy to clean. The only downside is it's quite loud. Overall, I'm very happy with it.",
.llm = chat
)
resultYou get sentiment, a star rating, and a summary—all typed correctly.
Step 3: Complex Structures with type_object()
For nested or complex data, use ellmer’s type system directly:
sig <- signature(
inputs = list(
input("article", description = "News article to analyze")
),
output_type = type_object(
headline = type_string("A concise headline"),
sentiment = type_enum(values = c("positive", "negative", "neutral")),
word_count = type_integer()
),
instructions = "Analyze the news article."
)
article_analyzer <- module(sig, type = "predict")Test it with a news snippet:
article <- "
Scientists at MIT announced a breakthrough in solar panel efficiency today.
The new panels can convert 47% of sunlight to electricity, nearly double
the current commercial standard. The technology uses a novel layered
approach that captures more of the light spectrum. Researchers expect
commercial applications within 3-5 years.
"
run(article_analyzer, article = article, .llm = chat)Step 4: Arrays of Values
Extract lists of items with type_array():
sig <- signature(
inputs = list(
input("text", description = "Text to extract entities from")
),
output_type = type_object(
people = type_array(type_string(), description = "Names of people mentioned"),
organizations = type_array(type_string(), description = "Organizations mentioned"),
locations = type_array(type_string(), description = "Places mentioned")
),
instructions = "Extract named entities from the text."
)
entity_extractor <- module(sig, type = "predict")Test with a news article:
news <- "
Apple CEO Tim Cook met with President Biden at the White House yesterday
to discuss manufacturing jobs. Cook announced that Apple will invest
$430 billion in the United States over the next five years, creating
20,000 new jobs. The meeting also included Treasury Secretary Janet Yellen
and Commerce Secretary Gina Raimondo.
"
result <- run(entity_extractor, text = news, .llm = chat)
resultAccess the arrays directly:
result$people
result$organizationsStep 5: Nested Objects
For hierarchical data, nest type_object() calls:
sig <- signature(
inputs = list(
input("email", description = "Email message to parse")
),
output_type = type_object(
sender = type_object(
name = type_string(),
email = type_string()
),
subject = type_string(),
priority = type_enum(values = c("low", "normal", "high", "urgent")),
action_items = type_array(type_string()),
requires_response = type_boolean()
),
instructions = "Parse the email and extract key information."
)
email_parser <- module(sig, type = "predict")Test with an email:
email <- "
From: Sarah Johnson <sarah.johnson@techcorp.com>
Subject: Q4 Budget Review - Action Required
Hi team,
Please review the attached Q4 budget proposal by Friday. We need to:
1. Confirm department allocations
2. Identify any cost-saving opportunities
3. Submit final numbers to finance
This is time-sensitive as the board meeting is next Monday.
Thanks,
Sarah
"
result <- run(email_parser, email = email, .llm = chat)
resultAccess nested fields:
result$sender$name
result$action_items
result$priorityStep 6: Building an Email Triage System
Let’s combine what you’ve learned into a practical system:
sig <- signature(
inputs = list(
input("email", description = "Email to triage")
),
output_type = type_object(
category = type_enum(
values = c("meeting", "task", "fyi", "urgent", "spam"),
description = "Email category"
),
summary = type_string("One-sentence summary"),
action_required = type_boolean(),
suggested_response = type_enum(
values = c("reply_now", "reply_later", "forward", "archive", "delete"),
description = "Recommended action"
)
),
instructions = "Triage the email for inbox management."
)
triage <- module(sig, type = "predict")Process a batch of emails:
emails <- tibble(
id = 1:3,
email = c(
"Meeting tomorrow at 3pm to discuss Q1 results. Please confirm attendance.",
"FYI - The office will be closed on Monday for the holiday.",
"URGENT: Server down! Need immediate assistance to restore services."
)
)
results <- run_dataset(triage, emails, .llm = chat)
resultsStep 7: Handling Optional Fields
Some fields might not always be present. Make them nullable:
sig <- signature(
inputs = list(
input("text", description = "Text that may mention a date")
),
output_type = type_object(
has_date = type_boolean(),
date = type_string("Date in YYYY-MM-DD format, or null if no date mentioned"),
confidence = type_number()
),
instructions = "Extract date information if present. Set date to empty string if no date."
)
date_extractor <- module(sig, type = "predict")
run(date_extractor, text = "Let's meet next Tuesday", .llm = chat)
run(date_extractor, text = "Great weather today!", .llm = chat)What You Learned
In this tutorial, you:
- Extracted multiple output fields with comma notation
- Added types to each field
- Built complex structures with
type_object() - Extracted lists with
type_array() - Created nested/hierarchical outputs
- Built a practical email triage system
- Handled optional fields
The Structure Advantage
Structured extraction is powerful because:
- Guaranteed types: Numbers come back as numbers, not strings
- Predictable shape: Your downstream code knows exactly what to expect
- Validation: The LLM must conform to your schema
- Composability: Results plug directly into R data structures
Next Steps
Your extractor works, but can it be improved? Continue to:
- Tutorial 4: Improving with Examples — Add demonstrations to improve accuracy
- Quick Reference — All output types at a glance
- How Optimization Works — Why examples improve LLM performance