Skip to contents

step_measure_emsc() creates a specification of a recipe step that applies Extended Multiplicative Scatter Correction to spectral data. EMSC accounts for wavelength-dependent scatter effects using polynomial terms.

Usage

step_measure_emsc(
  recipe,
  degree = 2L,
  reference = "mean",
  measures = NULL,
  role = NA,
  trained = FALSE,
  ref_spectrum = NULL,
  locations = NULL,
  skip = FALSE,
  id = recipes::rand_id("measure_emsc")
)

Arguments

recipe

A recipe object.

degree

Polynomial degree for wavelength-dependent terms. Default is 2. Higher values can model more complex scatter effects but risk overfitting.

reference

Reference spectrum method: "mean" (default) or "median". Alternatively, a numeric vector can be supplied as the reference spectrum.

measures

An optional character vector of measure column names.

role

Not used.

trained

Logical indicating if the step has been trained.

ref_spectrum

The learned reference spectrum (after training).

locations

The location values for polynomial terms (after training).

skip

Logical. Should the step be skipped when baking?

id

Unique step identifier.

Value

An updated recipe with the new step added.

Details

Extended MSC (EMSC) extends standard MSC by modeling wavelength-dependent scatter effects. For a spectrum \(x_i\) and reference \(x_r\), the model is:

$$x_i = a_i + b_i \cdot x_r + c_i \cdot \lambda + d_i \cdot \lambda^2 + ... + \epsilon$$

The corrected spectrum is:

$$EMSC(x_i) = \frac{x_i - a_i - c_i \cdot \lambda - d_i \cdot \lambda^2 - ...}{b_i}$$

The polynomial terms (\(\lambda\), \(\lambda^2\), etc.) account for wavelength-dependent baseline effects that vary between samples.

When to use EMSC vs MSC:

  • Use MSC for simple additive/multiplicative scatter

  • Use EMSC when scatter effects vary with wavelength

  • Start with degree=2, increase if needed for complex scatter

Examples

library(recipes)

rec <- recipe(water + fat + protein ~ ., data = meats_long) |>
  update_role(id, new_role = "id") |>
  step_measure_input_long(transmittance, location = vars(channel)) |>
  step_measure_emsc(degree = 2) |>
  prep()

bake(rec, new_data = NULL)
#> # A tibble: 215 × 5
#>       id water   fat protein .measures
#>    <int> <dbl> <dbl>   <dbl>    <meas>
#>  1     1  60.5  22.5    16.7 [100 × 2]
#>  2     2  46    40.1    13.5 [100 × 2]
#>  3     3  71     8.4    20.5 [100 × 2]
#>  4     4  72.8   5.9    20.7 [100 × 2]
#>  5     5  58.3  25.5    15.5 [100 × 2]
#>  6     6  44    42.7    13.7 [100 × 2]
#>  7     7  44    42.7    13.7 [100 × 2]
#>  8     8  69.3  10.6    19.3 [100 × 2]
#>  9     9  61.4  19.9    17.7 [100 × 2]
#> 10    10  61.4  19.9    17.7 [100 × 2]
#> # ℹ 205 more rows