Data analysis — nmr_data_analysis • AlpsNMR

Data analysis on AlpsNMR can be performed on both nmr_dataset_1D full spectra as well as nmr_dataset_peak_table peak tables.

Usage

nmr_data_analysis(
  dataset,
  y_column,
  identity_column,
  external_val,
  internal_val,
  data_analysis_method,
  .enable_parallel = TRUE
)

Arguments

dataset: An nmr_dataset_family object
y_column: A string with the name of the y column (present in the metadata of the dataset)
identity_column: NULL or a string with the name of the identity column (present in the metadata of the dataset).
external_val, internal_val: A list with two elements: iterations and test_size. See random_subsampling for further details
data_analysis_method: An nmr_data_analysis_method object
.enable_parallel: Set to FALSE to disable parallellization.

Value

A list with the following elements:

train_test_partitions: A list with the indices used in train and test on each of the cross-validation iterations
inner_cv_results: The output returned by train_evaluate_model on each inner cross-validation
inner_cv_results_digested: The output returned by choose_best_inner.
outer_cv_results: The output returned by train_evaluate_model on each outer cross-validation
outer_cv_results_digested: The output returned by train_evaluate_model_digest_outer.

Details

The workflow consists of a double cross validation strategy using random subsampling for splitting into train and test sets. The classification model and the metric to choose the best model can be customized (see new_nmr_data_analysis_method()), but for now only a PLSDA classification model with a best area under ROC curve metric is implemented (see the examples here and plsda_auroc_vip_method)

Examples

# Data analysis for a table of integrated peaks

## Generate an artificial nmr_dataset_peak_table:
### Generate artificial metadata:
num_samples <- 32 # use an even number in this example
num_peaks <- 20
metadata <- data.frame(
    NMRExperiment = as.character(1:num_samples),
    Condition = rep(c("A", "B"), times = num_samples / 2)
)

### The matrix with peaks
peak_means <- runif(n = num_peaks, min = 300, max = 600)
peak_sd <- runif(n = num_peaks, min = 30, max = 60)
peak_matrix <- mapply(function(mu, sd) rnorm(num_samples, mu, sd),
    mu = peak_means, sd = peak_sd
)
colnames(peak_matrix) <- paste0("Peak", 1:num_peaks)

## Artificial differences depending on the condition:
peak_matrix[metadata$Condition == "A", "Peak2"] <-
    peak_matrix[metadata$Condition == "A", "Peak2"] + 70

peak_matrix[metadata$Condition == "A", "Peak6"] <-
    peak_matrix[metadata$Condition == "A", "Peak6"] - 60

### The nmr_dataset_peak_table
peak_table <- new_nmr_dataset_peak_table(
    peak_table = peak_matrix,
    metadata = list(external = metadata)
)

## We will use a double cross validation, splitting the samples with random
## subsampling both in the external and internal validation.
## The classification model will be a PLSDA, exploring at maximum 3 latent
## variables.
## The best model will be selected based on the area under the ROC curve
methodology <- plsda_auroc_vip_method(ncomp = 3)
model <- nmr_data_analysis(
    peak_table,
    y_column = "Condition",
    identity_column = NULL,
    external_val = list(iterations = 3, test_size = 0.25),
    internal_val = list(iterations = 3, test_size = 0.25),
    data_analysis_method = methodology
)
## Area under ROC for each outer cross-validation iteration:
model$outer_cv_results_digested$auroc
#> # A tibble: 3 × 3
#>   cv_outer_iteration ncomp   auc
#>                <int> <int> <dbl>
#> 1                  1     1 0.933
#> 2                  2     1 0.875
#> 3                  3     2 1    
## Rank Product of the Variable Importance in the Projection
## (Lower means more important)
sort(model$outer_cv_results_digested$vip_rankproducts)
#>     Peak2     Peak6    Peak17    Peak11     Peak3    Peak14    Peak20    Peak12 
#>  1.259921  1.587401  3.556893  5.646216  6.316360  7.651725  8.962809  9.049114 
#>     Peak9    Peak19    Peak15     Peak1     Peak4     Peak5     Peak7    Peak13 
#>  9.165656  9.654894 11.100998 11.686316 12.091887 12.428930 12.632719 13.782348 
#>    Peak16    Peak18    Peak10     Peak8 
#> 13.924767 14.227573 14.986655 17.324782