step_measure_despike() creates a specification of a recipe step that
detects and removes spikes (sudden, brief outliers) from measurement data.
Spikes are common artifacts in spectroscopy (cosmic rays in Raman, detector
glitches) and chromatography (electrical noise).
Arguments
- recipe
A recipe object.
- measures
An optional character vector of measure column names.
- window
The window size for local statistics. Must be an odd integer of at least 3. Default is 5. Tunable via
smooth_window().- threshold
The threshold multiplier for spike detection. Points deviating more than
threshold * MADfrom the local median are flagged. Default is 5. Tunable viadespike_threshold().- method
How to replace detected spikes. One of
"interpolate"(default, linear interpolation from neighbors),"median"(replace with local median), or"mean"(replace with local mean).- max_width
Maximum width (in points) of a spike. Consecutive outliers wider than this are not considered spikes. Default is 3.
- role
Not used.
- trained
Logical indicating if the step has been trained.
- skip
Logical. Should the step be skipped when baking?
- id
Unique step identifier.
Details
Spike detection uses a robust local statistic approach:
For each point, calculate the local median and MAD (Median Absolute Deviation) within a sliding window
Flag points where
|value - local_median| > threshold * MADGroup consecutive flagged points into spike regions
If a spike region is narrower than
max_width, replace with the specified method
MAD is scaled by 1.4826 to be consistent with standard deviation for normally distributed data.
This approach is robust because:
Median and MAD are not affected by the spikes themselves
The threshold adapts to local noise levels
The max_width parameter prevents removing genuine peaks
Examples
library(recipes)
rec <- recipe(water + fat + protein ~ ., data = meats_long) |>
update_role(id, new_role = "id") |>
step_measure_input_long(transmittance, location = vars(channel)) |>
step_measure_despike(threshold = 5) |>
prep()
bake(rec, new_data = NULL)
#> # A tibble: 215 × 5
#> id water fat protein .measures
#> <int> <dbl> <dbl> <dbl> <meas>
#> 1 1 60.5 22.5 16.7 [100 × 2]
#> 2 2 46 40.1 13.5 [100 × 2]
#> 3 3 71 8.4 20.5 [100 × 2]
#> 4 4 72.8 5.9 20.7 [100 × 2]
#> 5 5 58.3 25.5 15.5 [100 × 2]
#> 6 6 44 42.7 13.7 [100 × 2]
#> 7 7 44 42.7 13.7 [100 × 2]
#> 8 8 69.3 10.6 19.3 [100 × 2]
#> 9 9 61.4 19.9 17.7 [100 × 2]
#> 10 10 61.4 19.9 17.7 [100 × 2]
#> # ℹ 205 more rows