1. Introduction

MSclassifR is a R package dedicated for automated classification of mass-spectra with machine learning methods.It was developed with the aim of identifying very similar species or phenotypes from mass spectra obtained by Matrix Assisted Laser Desorption Ionisation - Time Of Flight Mass Spectrometry (MALDI-TOF MS). However, the different functions of this package can also be used to classify other categories associated to mass spectra; or from mass spectra obtained with other mass spectrometry techniques. It includes easy-to-use functions for pre-processing mass spectra, functions to determine discriminant mass-over-charge values (m/z) from a library of mass spectra corresponding to different categories, and functions to predict the category (species, phenotypes, etc.) associated to a mass spectrum from a list of selected mass-over-charge values.

In this vignette, we search to discriminate two species of the bacterial genus Klebsiella by MALDI-TOF mass spectrometry with a data set of 18 mass spectra (Lasch et al. (2016)) using MSclassifR package. The raw data were downloaded from this link (Lasch et al. (2016)).

2. Install packages

The installation of the MSclassifR package requires the installation of packages from Bioconductor, so you might have to install the latest version of the BiocManager package. The MSclassifR package imports the other necessary packages from the CRAN. It is recommended to have the latest version of R.

## install BiocManager if not installed
    if (!require("BiocManager", quietly = TRUE))
        install.packages("BiocManager")

## Install the mixOmics and multtest packages from Bioconductor
   BiocManager::install(c("multtest","mixOmics", "limma", "qvalue"))

## Install MSclassifR package
install.packages("MSclassifR")
## For signal processing and machine learning
require(MSclassifR)
## Le chargement a nécessité le package : MSclassifR
## For additional signal processingt functions
### install MALDIquantForeign and MALDIquant if not installed:
#install.packages("MALDIquantForeign")
#install.packages("MALDIquant")
require(MALDIquantForeign)
## Le chargement a nécessité le package : MALDIquantForeign
## Le chargement a nécessité le package : MALDIquant
## 
## This is MALDIquant version 1.21
## Quantitative Analysis of Mass Spectrometry Data
##  See '?MALDIquant' for more information about this package.
require(MALDIquant)

## For additional machine learning functions 

### install caret if not installed:
#install.packages("caret")
require(caret)
## Le chargement a nécessité le package : caret
## Le chargement a nécessité le package : ggplot2
## Le chargement a nécessité le package : lattice

3. Import spectra and metadata

## url for load the  MS data and metadata
MSdataK <- "https://agodmer.github.io/MSData/Klebsiella/KlebsiellaRKIspectra.RData"

MSMetadataK <- "https://agodmer.github.io/MSData/Klebsiella/KlebsiellaRKImetadata.RData"

## Load mass spectra
load(url(MSdataK))

## Load metadata
load(url(MSMetadataK))
## for data .flex format import
#MyBrukerFlexspectra <- MALDIquantForeign::importBrukerFlex("YourPathway")

4. Plotting mass spectra

## Plotting mass spectra
MSclassifR::PlotSpectra(SpectralData=KlebsiellaRKIspectra[[1]])

5. Signal processing and peaks detection

We need to perform the signal processing (including intensity transformation, smoothing, removing baseline) and peaks detection (including peaks detection and peaks binning) steps with the two SignalProcessing and PeakDetection functions. Of note, the optimal workflow can be determined by changing the default parameters of this two functions, see ?SignalProcessing and ?PeakDetection for more details.

## Signal processing of mass spectra
spectra <- MSclassifR::SignalProcessing(KlebsiellaRKIspectra)
## Peaks detection
peaks <- MSclassifR::PeakDetection(spectra, labels = KlebsiellaRKImetadata$number)
## Inspect spectra after signal processing
MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")

## Perfom an Intensity Matrix with spectra data
IntMat <- MALDIquant::intensityMatrix(peaks)

## Rows are named according to selected metadata
rownames(IntMat) <-  paste(KlebsiellaRKImetadata$strains, KlebsiellaRKImetadata$number) 

## Remove "NA" in the intensityMatrix
IntMat[is.na(IntMat)] <- 0

## Normalize peaks according to the maximum intensity
## Create basic function
norma<-function(x) x/(max(x)) 
## Apply this function 
IntMat <- apply(IntMat,1,norma)

## Transpose Matrix for statistical analysis and named X
X <- t(IntMat)
Y <- factor(KlebsiellaRKImetadata$species)
set.seed(123) # For reproductibility
## Split data to tune the model according train set

## Index creation with p = 0,7
Index_skfold <- caret::createDataPartition(Y, p = 0.7, list=FALSE)

## Train set creation
X_train <- X[Index_skfold,]
Y_train <- Y[Index_skfold]

## Test set creation
X_test <- X[-Index_skfold,]
Y_test <- Y[-Index_skfold]

## Format Y_test as a factor for analysis
Y_testb <- gsub(" ", ".",Y_test)

6. Variables selection

The selection of discriminant mass-over-charge values can be performed with several methods proposed in the SelectionVar function of our MSclassifR package. The RFERF method uses the Recursive Feature Elimination (RFE) algorithm coupled with Random Forests (RF); the RFEGlmnet method uses RFE coupled with Logistic regression; the sPLSDA method uses sparse partial least squares discriminant analysis (sPLS-DA) (Rohart F et al. (2017)). The sPLSDA method selects variables from the ones kept in latent components of the model using an automatic choice of the number of components; the VSURF method uses RF with the VSURF function and package (Genuer et al. (2015)).

## a. using RFE and Random forest model
set.seed(90) # for  reproducibility
a <- SelectionVar(X_train,
                  Y_train,
                  MethodSelection = c("RFERF"),
                  MethodValidation = c("LOOCV"),
                  Metric = "Accuracy",
                  PreProcessing = c("center","scale","nzv","corr"),
                  Sizes = c(5:8))


## b. using RFE and Random forest model
set.seed(90) # for  reproducibility
b <- SelectionVar(X_train,
                  Y_train,
                  MethodSelection = c("RFEGlmnet") ,
                  MethodValidation = c("LOOCV"),
                  Metric = "Accuracy",
                  PreProcessing = c("center","scale","nzv","corr"),
                  Sizes = c(5:8))


## c. using sPLDA method 
set.seed(90) # for  reproducibility
c <- SelectionVar(X_train,
                  Y_train,
                  MethodSelection = c("sPLSDA"),
                  MethodValidation = c("LOOCV"),
                  PreProcessing = c("scale","nzv"),
                  Sizes = c(5:8))

## d. using VSURF
set.seed(90) # for  reproducibility
d <- SelectionVar(X_train,
                  Y_train,
                  MethodSelection = c("VSURF"))
## Keep selected variables in a list
list_moz <- list("RFERF" = a$sel_moz,
                 "RFEglmnet" = b$sel_moz,
                 "sPLSDA" = c$sel_moz,
                 "VSURF" = d$sel_moz)
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=a$sel_moz,col_spec="blue",col_peak="black")

PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=b$sel_moz,col_spec="blue",col_peak="black")

MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=c$sel_moz,col_spec="blue",col_peak="black")

MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=d$sel_moz,col_spec="blue",col_peak="black")

7. Estimation of a multinomial logistic regression

7.1 Use the selected variables with RFERF method

With the selected discriminant masses, we can create a prediction model. The LogReg function allows estimating a prediction model with k-fold cross validation in order to predict the category to which a mass spectrum belongs. Two main kinds of models can be estimated: linear or nonlinear (with neural networks (nnet method), random forests (rf method), support vector machines with linear kernel (svm method), or eXtreme Gradient Boosting (xgb method)). Hyperparameters are randomly searched, except for the eXtreme Gradient Boosting where a grid search is performed.

In this part, we will create different prediction models with discriminant mass-over-charge values previously selected ones using the 4 methods described above.

Each prediction models were estimated with a repeated k-fold cross validation (k = 2, and 2 repeats)in the train data set. Next, the accuracies of estimated models are evaluated on the test data set.

set.seed(123) # for  reproducibility
## Select variables found with RFERF 
sel_moz=a$sel_moz

## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2)
#Estimated model:
#model_lm

## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn

## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf

## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb

## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm

Let’s look at the performances of different prediction models.

## Keep models in a list
list_model_A <- list("lm" = model_lm$train_mod,
                      "nnet" = model_nn$train_mod,
                      "rf" = model_rf$train_mod,
                      "Xgboost" = model_xgb$train_mod,
                      "svm" = model_svm$train_mod)

## Plot performances of prediction model
model_A <- caret::resamples(list_model_A)
bwplot(model_A)

## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[-Index_skfold],
                                   model = list_model_A,
                                   moz = sel_moz,
                                   Reference = Y_test)
  • Estimating accuracies of each method on the test dataset:
## Split table according method used
ResultatsModelA <- split(prob_cat$Prob.results, prob_cat$Prob.results$method)

## Calcul accuracy for each method
DF_A <- data.frame(lapply(ResultatsModelA, function(x)(sum(Y_testb == x[,5])/length(Y_test))), row.names = "RFERF accuracy")

t(DF_A)
##             RFERF accuracy
## comb_fisher              1
## max_vote                 1
## multinom                 1
## nnet                     1
## rf                       1
## svmLinear2               1
## xgbTree                  1

7.2 Use the selected variables with RFEGlmnet method

set.seed(123) # for  reproducibility
## Select variables found with RFGlmnet
sel_moz=b$sel_moz

## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2)
#Estimated model:
#model_lm

## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn

## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf

## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb

## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
## Keep models in a list
list_model_B <- list("lm" = model_lm$train_mod,
                      "nnet" = model_nn$train_mod,
                      "rf" = model_rf$train_mod,
                      "Xgboost" = model_xgb$train_mod,
                      "svm" = model_svm$train_mod)

## Plot performances of prediction model
model_B <- caret::resamples(list_model_B)
bwplot(model_B)

## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[-Index_skfold],
                                   model = list_model_B,
                                   moz = sel_moz,
                                   Reference = Y_test)
## Split table according method used
ResultatsModelB <- split(prob_cat$Prob.results, prob_cat$Prob.results$method)

## Calcul accuracy for each method
DF_B <- data.frame(lapply(ResultatsModelB, function(x)(sum(Y_testb == x[,5])/length(Y_testb))), row.names = "RFEGlmnet accuracy")

t(DF_B)
##             RFEGlmnet accuracy
## comb_fisher                1.0
## max_vote                   1.0
## multinom                   1.0
## nnet                       1.0
## rf                         1.0
## svmLinear2                 1.0
## xgbTree                    0.4

7.3 Use the selected variables with sPLSDA method

set.seed(123) # for  reproducibility
## Select variables found with RFGlmnet
sel_moz=c$sel_moz

## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2)
#Estimated model:
#model_lm

## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy",  number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn

## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf

## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb

## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
## Keep models in a list
list_model_C <- list("lm" = model_lm$train_mod,
                      "nnet" = model_nn$train_mod,
                      "rf" = model_rf$train_mod,
                      "Xgboost" = model_xgb$train_mod,
                      "svm" = model_svm$train_mod)

## Plot performances of prediction model
model_C <- caret::resamples(list_model_C)
bwplot(model_C)

## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[-Index_skfold],
                                   model = list_model_C,
                                   moz = sel_moz,
                                   Reference = Y_test)
## Split table according method used
ResultatsModelC <- split(prob_cat$Prob.results, prob_cat$Prob.results$method)

## Calcul accuracy for each method
DF_C <- data.frame(lapply(ResultatsModelC, function(x)(sum(Y_testb == x[,5])/length(Y_testb))), row.names = "sPLSDA accuracy")

t(DF_C)
##             sPLSDA accuracy
## comb_fisher             0.4
## max_vote                0.4
## multinom                0.4
## nnet                    0.4
## rf                      1.0
## svmLinear2              0.4
## xgbTree                 0.6

Prediction models cannot be created with the VSURF method in this example because with the VSURF method because only one variable has been selected. Finally, the accuracies of all the used method in the test dataset can be summarized:

## Summarize results on test set
DFf <- rbind(DF_A, DF_B, DF_C)

## Print results
t(DFf)
##             RFERF accuracy RFEGlmnet accuracy sPLSDA accuracy
## comb_fisher              1                1.0             0.4
## max_vote                 1                1.0             0.4
## multinom                 1                1.0             0.4
## nnet                     1                1.0             0.4
## rf                       1                1.0             1.0
## svmLinear2               1                1.0             0.4
## xgbTree                  1                0.4             0.6

8. Conclusion

This vignette illustrates how MSclassifR allows estimating many different machine learning-based models to identify two species of the genus Klebsiella from MALDI-TOF mass spectra, and how to use these models to predict the species of a new coming mass spectrum. In the example we used, 14 of the 21 models have accuracy equal to 1. For instance, perfect accuracies were obtained on the test data set with variable selection by RFE-RF and RFE-Glmnet coupled with multinom, nnet, rf and svm methods.

References

## Session info
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
## [5] LC_TIME=French_France.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] caret_6.0-92           lattice_0.20-45        ggplot2_3.3.6         
## [4] MALDIquantForeign_0.13 MALDIquant_1.21        MSclassifR_0.3.0      
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                  ggstance_0.3.5             
##   [3] tidyselect_1.1.2            grid_4.1.3                 
##   [5] ranger_0.14.1               BiocParallel_1.28.3        
##   [7] maptools_1.1-4              pROC_1.18.0                
##   [9] munsell_0.5.0               codetools_0.2-18           
##  [11] mutoss_0.1-12               performanceEstimation_1.1.0
##  [13] xgboost_1.6.0.1             future_1.27.0              
##  [15] withr_2.5.0                 colorspace_2.0-3           
##  [17] Biobase_2.54.0              highr_0.9                  
##  [19] knitr_1.39                  rstudioapi_0.13            
##  [21] stats4_4.1.3                robustbase_0.95-0          
##  [23] listenv_0.8.0               labeling_0.4.2             
##  [25] Rdpack_2.4                  mnormt_2.1.0               
##  [27] polyclip_1.10-0             farver_2.1.1               
##  [29] parallelly_1.32.1           vctrs_0.4.1                
##  [31] generics_0.1.3              TH.data_1.1-1              
##  [33] cp4p_0.3.6                  ipred_0.9-13               
##  [35] xfun_0.31                   geepack_1.3.4              
##  [37] randomForest_4.7-1.1        R6_2.5.1                   
##  [39] doParallel_1.0.17           cachem_1.0.6               
##  [41] reshape_0.8.9               assertthat_0.2.1           
##  [43] scales_1.2.0                multcomp_1.4-19            
##  [45] nnet_7.3-17                 gtable_0.3.0               
##  [47] globals_0.15.1              sandwich_3.0-2             
##  [49] timeDate_4021.104           rlang_1.0.4                
##  [51] Rborist_0.2-3               splines_4.1.3              
##  [53] ModelMetrics_1.2.2.2        broom_1.0.0                
##  [55] mosaicCore_0.9.0            yaml_2.3.5                 
##  [57] reshape2_1.4.4              abind_1.4-5                
##  [59] backports_1.4.1             qvalue_2.26.0              
##  [61] tools_4.1.3                 lava_1.6.10                
##  [63] gstat_2.0-9                 ellipsis_0.3.2             
##  [65] readBrukerFlexData_1.9.0    jquerylib_0.1.4            
##  [67] RColorBrewer_1.1-3          proxy_0.4-27               
##  [69] geeM_0.10.1                 BiocGenerics_0.40.0        
##  [71] ggformula_0.10.1            ggridges_0.5.3             
##  [73] TFisher_0.2.0               Rcpp_1.0.9                 
##  [75] plyr_1.8.7                  base64enc_0.1-3            
##  [77] waveslim_1.8.3              purrr_0.3.4                
##  [79] rpart_4.1.16                MBA_0.0-9                  
##  [81] zoo_1.8-10                  haven_2.5.0                
##  [83] ggrepel_0.9.1               magrittr_2.0.3             
##  [85] data.table_1.14.2           RSpectra_0.16-1            
##  [87] spacetime_1.2-8             mvtnorm_1.1-3              
##  [89] matrixStats_0.62.0          hms_1.1.1                  
##  [91] evaluate_0.15               XML_3.99-0.10              
##  [93] mclust_5.4.10               gridExtra_2.3              
##  [95] shape_1.4.6                 compiler_4.1.3             
##  [97] ellipse_0.4.3               tibble_3.1.8               
##  [99] crayon_1.5.1                htmltools_0.5.3            
## [101] corpcor_1.6.10              tidyr_1.2.0                
## [103] lubridate_1.8.0             DBI_1.1.3                  
## [105] tweenr_1.0.2                MASS_7.3-55                
## [107] Matrix_1.4-0                car_3.1-0                  
## [109] cli_3.3.0                   rbibutils_2.2.8            
## [111] parallel_4.1.3              metap_1.8                  
## [113] gower_1.0.0                 qqconf_1.2.3               
## [115] igraph_1.3.4                forcats_0.5.1              
## [117] pkgconfig_2.0.3             sn_2.0.2                   
## [119] numDeriv_2016.8-1.1         foreign_0.8-82             
## [121] signal_0.7-7                sp_1.5-0                   
## [123] readMzXmlData_2.8.1         recipes_1.0.1              
## [125] foreach_1.5.2               rARPACK_0.11-0             
## [127] bslib_0.4.0                 hardhat_1.2.0              
## [129] multtest_2.50.0             VSURF_1.1.0                
## [131] prodlim_2019.11.13          stringr_1.4.0              
## [133] digest_0.6.29               rmarkdown_2.14             
## [135] intervals_0.15.2            MESS_0.5.9                 
## [137] MALDIrppa_1.1.0-1           lifecycle_1.0.1            
## [139] nlme_3.1-155                mltools_0.3.5              
## [141] jsonlite_1.8.0              carData_3.0-5              
## [143] mixOmics_6.18.1             UBL_0.0.7                  
## [145] limma_3.50.3                fansi_1.0.3                
## [147] labelled_2.9.1              pillar_1.8.0               
## [149] fuzzyjoin_0.1.6             fastmap_1.1.0              
## [151] plotrix_3.8-2               DEoptimR_1.0-11            
## [153] survival_3.2-13             glue_1.6.2                 
## [155] xts_0.12.1                  FNN_1.1.3.1                
## [157] iterators_1.0.14            glmnet_4.1-4               
## [159] ggforce_0.3.3               class_7.3-20               
## [161] stringi_1.7.6               sass_0.4.2                 
## [163] automap_1.0-16              mathjaxr_1.6-0             
## [165] dplyr_1.0.9                 e1071_1.7-11               
## [167] future.apply_1.9.0
Genuer,R. et al. (2015) VSURF: An R Package for Variable Selection Using Random Forests. The R Journal, 7, 19–33.
Gibb,S. and Strimmer,K. (2012) MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics, 28, 2270–2271.
Kuhn,M. (2008) Building predictive models in r using the caret package. Journal of Statistical Software, Articles, 28, 1–26.
Lasch,P. et al. (2016) A MALDI-TOF Mass Spectrometry Database for Identification and Classification of Highly Pathogenic Microorganisms from the Robert Koch- Institute (RKI).
Rohart F,G.B. et al. (2017) mixOmics: An r package for ’omics feature selection and multiple data integration. PLoS computational biology, 13, e1005752.