Skip to contents

Introduction

Baseline correction is a fundamental preprocessing step for spectroscopic and chromatographic data. Baselines can drift due to instrument effects, sample scattering, fluorescence, or detector response, obscuring the chemical information in your measurements.

The measure package provides several baseline correction methods as recipe steps, plus the ability to use custom R functions or Python’s pybaselines library.

The problem: Baseline drift

Let’s visualize baseline issues in NIR spectra:

data(meats)

# Convert to long format for visualization
meats_viz <- meats |>
  mutate(id = row_number()) |>
  pivot_longer(
    cols = starts_with("x_"),
    names_to = "channel",
    values_to = "transmittance"
  ) |>
  mutate(channel = as.integer(gsub("x_", "", channel)))

# Plot raw spectra
meats_viz |>
  filter(id <= 20) |>
  ggplot(aes(x = channel, y = transmittance, group = id, color = factor(id))) +
  geom_line(alpha = 0.7) +
  labs(
    x = "Channel",
    y = "Transmittance",
    title = "Raw NIR Spectra with Baseline Variation"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Notice the vertical offset between spectra? This baseline shift isn’t related to the chemical composition we want to model.

Built-in baseline correction methods

Asymmetric Least Squares (ALS)

step_measure_baseline_als() uses the Asymmetric Least Squares algorithm, which is excellent for spectra where peaks are predominantly in one direction (e.g., absorption peaks going up or emission peaks going down).

rec_als <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_als(lambda = 1e6, p = 0.01)

processed_als <- bake(prep(rec_als), new_data = NULL)

# Visualize using unnest
plot_als <- processed_als |>
  slice(1:20) |>
  mutate(id = row_number()) |>
  unnest(.measures)

ggplot(plot_als, aes(x = location, y = value, group = id, color = factor(id))) +
  geom_line(alpha = 0.7) +
  labs(
    x = "Channel",
    y = "Corrected Transmittance",
    title = "ALS Baseline Correction"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Key parameters:

  • lambda: Smoothness penalty (higher = smoother baseline). Try 10^4 to 10^9.
  • p: Asymmetry parameter (0-1). Lower values fit below the signal. Try 0.001-0.1.

Polynomial baseline

step_measure_baseline_poly() fits a polynomial to the spectrum and subtracts it. Simple and fast, works well for gentle baseline curvature.

rec_poly <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_poly(degree = 2)

processed_poly <- bake(prep(rec_poly), new_data = NULL)

# Visualize using unnest
plot_poly <- processed_poly |>
  slice(1:20) |>
  mutate(id = row_number()) |>
  unnest(.measures)

ggplot(plot_poly, aes(x = location, y = value, group = id, color = factor(id))) +
  geom_line(alpha = 0.7) +
  labs(
    x = "Channel",
    y = "Corrected Transmittance",
    title = "Polynomial Baseline Correction (degree = 2)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Robust fitting baseline

step_measure_baseline_rf() uses robust local regression (LOESS with iterative reweighting) to estimate the baseline. Good for complex baseline shapes.

rec_rf <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_rf(span = 0.5) |>
  step_measure_output_wide(prefix = "nir_")

processed_rf <- bake(prep(rec_rf), new_data = NULL)
#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(water, fat, protein, location)) |>
#>   dplyr::filter(n > 1L)

Detrending

step_measure_detrend() removes polynomial trends from spectra. The simplest baseline correction - just removes the overall trend.

rec_detrend <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_detrend(degree = 1) |>
  step_measure_output_wide(prefix = "nir_")

processed_detrend <- bake(prep(rec_detrend), new_data = NULL)
#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(water, fat, protein, location)) |>
#>   dplyr::filter(n > 1L)

GPC/SEC baseline correction

step_measure_baseline_gpc() is specialized for chromatography data where baseline regions exist at the start and end of the chromatogram.

# For chromatography data
rec_gpc <- recipe(outcome ~ ., data = chromatogram_data) |>

  step_measure_input_long(signal, location = vars(time)) |>
  step_measure_baseline_gpc(left_frac = 0.05, right_frac = 0.05, method = "linear")

Custom baseline functions

step_measure_baseline_custom() lets you provide any R function for baseline estimation. Your function receives a measure_tbl (tibble with location and value columns) and should return a numeric vector of baseline values.

Using a function

# Simple moving minimum baseline
moving_min_baseline <- function(x, window = 51) {
  y <- x$value
  n <- length(y)
  baseline <- numeric(n)
  half_win <- window %/% 2


  for (i in seq_len(n)) {
    start <- max(1, i - half_win)
    end <- min(n, i + half_win)
    baseline[i] <- min(y[start:end], na.rm = TRUE)
  }
  baseline
}

rec_custom <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_custom(.fn = moving_min_baseline, window = 101) |>
  step_measure_output_wide(prefix = "nir_")

processed_custom <- bake(prep(rec_custom), new_data = NULL)
#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(water, fat, protein, location)) |>
#>   dplyr::filter(n > 1L)

Using a formula

For quick one-liners, use the formula interface where .x is the measure_tbl:

# LOESS baseline using formula
rec_loess <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_custom(
    .fn = ~ stats::loess(.x$value ~ .x$location, span = span)$fitted,
    span = 0.3
  ) |>
  step_measure_output_wide(prefix = "nir_")

processed_loess <- bake(prep(rec_loess), new_data = NULL)
#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(water, fat, protein, location)) |>
#>   dplyr::filter(n > 1L)

Extracting the baseline

With step_measure_baseline_custom() and step_measure_baseline_py(), set subtract = FALSE to get the baseline itself instead of the corrected signal:

# Use custom baseline with subtract = FALSE to extract the baseline
rec_extract <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_custom(
    .fn = ~ stats::loess(.x$value ~ .x$location, span = 0.5)$fitted,
    subtract = FALSE
  )

baselines <- bake(prep(rec_extract), new_data = NULL)

# The .measures column now contains the estimated baselines
baselines |>
  slice(1) |>
  unnest(.measures) |>
  head()
#> # A tibble: 6 × 5
#>   water   fat protein location value
#>   <dbl> <dbl>   <dbl>    <dbl> <dbl>
#> 1  60.5  22.5    16.7        1  2.62
#> 2  60.5  22.5    16.7        2  2.62
#> 3  60.5  22.5    16.7        3  2.62
#> 4  60.5  22.5    16.7        4  2.62
#> 5  60.5  22.5    16.7        5  2.62
#> 6  60.5  22.5    16.7        6  2.62

Python pybaselines integration

step_measure_baseline_py() provides access to over 50 baseline correction algorithms from the Python pybaselines library.

Setup

First, install pybaselines:

# Install reticulate if needed
install.packages("reticulate")

# Install pybaselines
reticulate::py_require("pybaselines")

Using pybaselines methods

# Asymmetric Least Squares
rec_py_asls <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_py(method = "asls", lam = 1e6, p = 0.01) |>
  step_measure_output_wide(prefix = "nir_")

# SNIP algorithm (good for spectroscopy)
rec_py_snip <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_py(method = "snip", max_half_window = 40) |>
  step_measure_output_wide(prefix = "nir_")

# Modified polynomial
rec_py_modpoly <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_py(method = "modpoly", poly_order = 3) |>
  step_measure_output_wide(prefix = "nir_")

# Morphological (rolling ball)
rec_py_mor <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_py(method = "rolling_ball", half_window = 30) |>
  step_measure_output_wide(prefix = "nir_")

Available pybaselines methods

Category Methods
Whittaker asls, iasls, airpls, arpls, drpls, psalsa
Polynomial poly, modpoly, imodpoly, loess, quant_reg
Morphological mor, imor, rolling_ball, tophat
Spline pspline_asls, pspline_airpls, mixture_model
Smooth snip, swima, noise_median
Classification dietrich, golotvin, fastchrom

See the pybaselines documentation for the full list and parameter details.

Using derivatives for baseline correction

Savitzky-Golay derivatives remain a powerful approach for baseline correction, especially in NIR spectroscopy:

# First derivative removes constant baseline
rec_d1 <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_savitzky_golay(window_side = 7, differentiation_order = 1) |>
  step_measure_output_wide(prefix = "nir_")

# Second derivative removes linear baseline
rec_d2 <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_savitzky_golay(window_side = 7, differentiation_order = 2) |>
  step_measure_output_wide(prefix = "nir_")

Choosing a method

Method Best for Tunable Notes
step_measure_baseline_als() Most spectroscopy Yes Good general-purpose choice
step_measure_baseline_poly() Gentle curvature Yes Simple and fast
step_measure_baseline_rf() Complex baselines Yes Robust to outliers
step_measure_detrend() Simple trends Yes Fastest option
step_measure_baseline_gpc() Chromatography No Uses baseline regions
step_measure_baseline_custom() Special cases Optional Maximum flexibility
step_measure_baseline_py() Advanced algorithms Yes 50+ methods available
step_measure_savitzky_golay() NIR/IR spectra Yes Also smooths and differentiates

Hyperparameter tuning

Most baseline correction steps are tunable with the tidymodels tuning framework:

library(tune)

# Create tunable recipe
rec_tunable <- recipe(water ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_als(
    lambda = tune(),  # Will be tuned
    p = tune()        # Will be tuned
  ) |>
  step_measure_output_wide(prefix = "nir_")

# The tunable parameters are automatically detected
tunable(rec_tunable)

For step_measure_baseline_custom(), you can declare tunable parameters explicitly:

rec_custom_tunable <- recipe(water ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  step_measure_baseline_custom(
    .fn = ~ stats::loess(.x$value ~ .x$location, span = span)$fitted,
    span = 0.5,
    tunable = list(
      span = list(pkg = "dials", fun = "degree", range = c(0.1, 0.9))
    )
  )

Complete preprocessing pipeline

Baseline correction often works best as part of a complete preprocessing pipeline:

rec_complete <- recipe(water + fat + protein ~ ., data = meats) |>
  step_measure_input_wide(starts_with("x_")) |>
  # Baseline correction
  step_measure_baseline_als(lambda = 1e6, p = 0.01) |>
  # Scatter correction
  step_measure_snv() |>
  # Smoothing with mild derivative
  step_measure_savitzky_golay(window_side = 5, differentiation_order = 1) |>
  # Output for modeling
  step_measure_output_wide(prefix = "nir_")

final_data <- bake(prep(rec_complete), new_data = NULL)
#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#>  Use `values_fn = list` to suppress this warning.
#>  Use `values_fn = {summary_fun}` to summarise duplicates.
#>  Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(water, fat, protein, location)) |>
#>   dplyr::filter(n > 1L)
final_data[1:5, 1:8]
#> # A tibble: 5 × 8
#>   water   fat protein nir_01    nir_02    nir_03    nir_04    nir_05   
#>   <dbl> <dbl>   <dbl> <list>    <list>    <list>    <list>    <list>   
#> 1  60.5  22.5    16.7 <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]>
#> 2  46    40.1    13.5 <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]>
#> 3  71     8.4    20.5 <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]>
#> 4  72.8   5.9    20.7 <dbl [2]> <dbl [2]> <dbl [2]> <dbl [2]> <dbl [2]>
#> 5  58.3  25.5    15.5 <dbl [2]> <dbl [2]> <dbl [2]> <dbl [2]> <dbl [2]>

Summary