Pareto Scaling

step_measure_scale_pareto() creates a specification of a recipe step that applies Pareto scaling at each measurement location. This is a compromise between no scaling and auto-scaling, commonly used in metabolomics.

Usage

step_measure_scale_pareto(
  recipe,
  measures = NULL,
  role = NA,
  trained = FALSE,
  learned_params = NULL,
  skip = FALSE,
  id = recipes::rand_id("measure_scale_pareto")
)

Arguments

recipe: A recipe object. The step will be added to the sequence of operations for this recipe.
measures: An optional character vector of measure column names to process. If NULL (the default), all measure columns (columns with class measure_list) will be processed. Use this to limit processing to specific measure columns when working with multiple measurement types.
role: Not used by this step since no new variables are created.
trained: A logical to indicate if the quantities for preprocessing have been estimated.
learned_params: A named list containing learned means and locations for each measure column. This is NULL until the step is trained.
skip: A logical. Should the step be skipped when the recipe is baked by recipes::bake()? While all operations are baked when recipes::prep() is run, some operations may not be able to be conducted on new data (e.g. processing the outcome variable(s)). Care should be taken when using skip = TRUE as it may affect the computations for subsequent operations.
id: A character string that is unique to this step to identify it.

Value

An updated version of recipe with the new step added to the sequence of any existing operations.

Details

Pareto scaling divides by the square root of the standard deviation rather than the standard deviation itself. This reduces the relative importance of large values while still giving more weight to larger fold changes.

For a data matrix $X$, the transformation is:

$$X_{scaled} = \frac{X - \bar{X}}{\sqrt{s_X}}$$

where $\bar{X}$ and $s_X$ are the column-wise mean and standard deviation computed from the training data.

If a column has zero standard deviation (constant values), that column is only centered, not scaled.

The means and standard deviations are learned during prep() from the training data and stored for use when applying the transformation to new data during bake().

No selectors should be supplied to this step function. The data should be in the internal format produced by step_measure_input_wide() or step_measure_input_long().

References

van den Berg, R.A., Hoefsloot, H.C., Westerhuis, J.A., Smilde, A.K., and van der Werf, M.J. 2006. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7:142.

Examples

library(recipes)

rec <-
  recipe(water + fat + protein ~ ., data = meats_long) |>
  update_role(id, new_role = "id") |>
  step_measure_input_long(transmittance, location = vars(channel)) |>
  step_measure_scale_pareto() |>
  prep()

bake(rec, new_data = NULL)
#> # A tibble: 215 × 6
#>       id water   fat protein .measures channel    
#>    <int> <dbl> <dbl>   <dbl>    <meas> <list>     
#>  1     1  60.5  22.5    16.7 [100 × 2] <int [100]>
#>  2     2  46    40.1    13.5 [100 × 2] <int [100]>
#>  3     3  71     8.4    20.5 [100 × 2] <int [100]>
#>  4     4  72.8   5.9    20.7 [100 × 2] <int [100]>
#>  5     5  58.3  25.5    15.5 [100 × 2] <int [100]>
#>  6     6  44    42.7    13.7 [100 × 2] <int [100]>
#>  7     7  44    42.7    13.7 [100 × 2] <int [100]>
#>  8     8  69.3  10.6    19.3 [100 × 2] <int [100]>
#>  9     9  61.4  19.9    17.7 [100 × 2] <int [100]>
#> 10    10  61.4  19.9    17.7 [100 × 2] <int [100]>
#> # ℹ 205 more rows