Delayed operations enables us to process our samples faster on big datasets. See the details section for details on how they work.

DelayedOperation(
  name,
  fun = NULL,
  params = list(),
  params_iter = list(),
  fun_extract = NULL,
  fun_aggregate = NULL
)

Arguments

name

A named for de delayed operation, only used for printing.

fun

A function that takes a sample and returns a modified sample

params

A named list with additional arguments to be passed to function

params_iter

A named list with additional arguments to be passed to function. Compared to params, each argument must be a named list of length the number of samples, so each sample will receive its corresponding parameter according to its name

fun_extract

A function that takes a modified sample and returns an extracted object.

fun_aggregate

A function that takes a dataset and a list of extracted objects and returns a modified dataset.

Value

A DelayedOperation object

Details

Let's say we have a pipeline with two actions (e.g. smooth() and detectPeaks()). and we want to apply it to a dataset with two samples (e.g s1, s2).

This is a simple pseudocode to execute all actions in all samples. The code is written so you can get an idea of how :

dataset = list(s1, s2)
actions = list(smooth, detectPeaks)
for (action in actions) {
  for (i in seq_along(dataset)) {
      dataset[[i]] <- action(dataset[[i]])
  }
}

When the dataset is big, samples are stored in disk, and loaded/saved when used:

dataset = list(s1, s2)
actions = list(smooth, detectPeaks)
for (action in actions) {
  for (i in seq_along(dataset)) {
      sample <- read_from_disk(i)
      sample <- action(sample)
      save_to_disk(sample)
  }
}

So actually, we can avoid "saving and loading" by changing the loop order:

dataset = list(s1, s2)
actions = list(smooth, detectPeaks)
for (i in seq_along(dataset)) {
  sample <- read_from_disk(i)
  for (action in actions) {
      sample <- action(sample)
  }
  save_to_disk(sample)
}

This requires that when we apply an operation to the dataset, the operation is delayed, so we can stack many delayed operations and run them all at once.

The DelayedOperation class allows us to store all pending actions and run them afterwards when the data is needed.

Besides, samples can be processed in parallel if enough cores and RAM are available.

The DelayedOperation class also considers that sometimes we want to extract some information from each sample (e.g. the Reverse Ion Chromatogram) and build some matrix with the Reverse Ion Chromatograms of all samples. It changes the loops above, so after each action modifies each sample, we can extract something out of the sample and save it. After all actions have executed, we can aggregate the results we have extracted and save them into the dataset. This is used for instance in the getRIC() implementation, to extract the RIC from each sample and afterwards aggregate it into a matrix. This is implemented here with the fun_extract and fun_aggregate functions.