MSclassifR
is a R package dedicated for automated
classification of mass-spectra with machine learning methods.It was
developed with the aim of identifying very similar species or phenotypes
from mass spectra obtained by Matrix Assisted Laser Desorption
Ionisation - Time Of Flight Mass Spectrometry (MALDI-TOF MS). However,
the different functions of this package can also be used to classify
other categories associated to mass spectra; or from mass spectra
obtained with other mass spectrometry techniques. It includes
easy-to-use functions for pre-processing mass spectra, functions to
determine discriminant mass-over-charge values (m/z) from a library of
mass spectra corresponding to different categories, and functions to
predict the category (species, phenotypes, etc.) associated to a mass
spectrum from a list of selected mass-over-charge values.
In this vignette, we search to discriminate two species of the
bacterial genus Klebsiella by MALDI-TOF mass spectrometry with
a data set of 18 mass spectra (Lasch et
al. (2016)) using MSclassifR
package. The raw
data were downloaded from this link (Lasch et al. (2016)).
The installation of the MSclassifR
package requires the
installation of packages from Bioconductor
, so you might
have to install the latest version of the BiocManager
package. The MSclassifR
package imports the other necessary
packages from the CRAN. It is recommended to have the latest version of
R.
MSclassifR
using the following code:## install BiocManager if not installed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Install the mixOmics and multtest packages from Bioconductor
BiocManager::install(c("multtest","mixOmics", "limma", "qvalue"))
## Install MSclassifR package
install.packages("MSclassifR")
MSclassifR
package is installed. Also we need
the MALDIquantForeign
, MALDIquant
and
caret
packages from the CRAN. The installation is possible
with the command install.packages()
.## For signal processing and machine learning
require(MSclassifR)
## Le chargement a nécessité le package : MSclassifR
## For additional signal processingt functions
### install MALDIquantForeign and MALDIquant if not installed:
#install.packages("MALDIquantForeign")
#install.packages("MALDIquant")
require(MALDIquantForeign)
## Le chargement a nécessité le package : MALDIquantForeign
## Le chargement a nécessité le package : MALDIquant
##
## This is MALDIquant version 1.21
## Quantitative Analysis of Mass Spectrometry Data
## See '?MALDIquant' for more information about this package.
require(MALDIquant)
## For additional machine learning functions
### install caret if not installed:
#install.packages("caret")
require(caret)
## Le chargement a nécessité le package : caret
## Le chargement a nécessité le package : ggplot2
## Le chargement a nécessité le package : lattice
## url for load the MS data and metadata
MSdataK <- "https://agodmer.github.io/MSData/Klebsiella/KlebsiellaRKIspectra.RData"
MSMetadataK <- "https://agodmer.github.io/MSData/Klebsiella/KlebsiellaRKImetadata.RData"
## Load mass spectra
load(url(MSdataK))
## Load metadata
load(url(MSMetadataK))
.flex
format from
Bruker MALDI-TOF MS, you could use importBrukerFlex
from
MALDIquantForeign
package. Of note, you will have to create
your own corresponding metadata as a data frame in this case.## for data .flex format import
#MyBrukerFlexspectra <- MALDIquantForeign::importBrukerFlex("YourPathway")
## Plotting mass spectra
MSclassifR::PlotSpectra(SpectralData=KlebsiellaRKIspectra[[1]])
We need to perform the signal processing (including intensity
transformation, smoothing, removing baseline) and peaks detection
(including peaks detection and peaks binning) steps with the two
SignalProcessing
and PeakDetection
functions.
Of note, the optimal workflow can be determined by changing the default
parameters of this two functions, see ?SignalProcessing
and
?PeakDetection
for more details.
## Signal processing of mass spectra
spectra <- MSclassifR::SignalProcessing(KlebsiellaRKIspectra)
## Peaks detection
peaks <- MSclassifR::PeakDetection(spectra, labels = KlebsiellaRKImetadata$number)
## Inspect spectra after signal processing
MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
intensityMatrix
function in MALDIquant
package (Gibb
and Strimmer (2012)).## Perfom an Intensity Matrix with spectra data
IntMat <- MALDIquant::intensityMatrix(peaks)
## Rows are named according to selected metadata
rownames(IntMat) <- paste(KlebsiellaRKImetadata$strains, KlebsiellaRKImetadata$number)
## Remove "NA" in the intensityMatrix
IntMat[is.na(IntMat)] <- 0
## Normalize peaks according to the maximum intensity
## Create basic function
norma<-function(x) x/(max(x))
## Apply this function
IntMat <- apply(IntMat,1,norma)
## Transpose Matrix for statistical analysis and named X
X <- t(IntMat)
Y <- factor(KlebsiellaRKImetadata$species)
caret
package (Kuhn (2008)) and the
createDataPartition
function for machine learning
step.set.seed(123) # For reproductibility
## Split data to tune the model according train set
## Index creation with p = 0,7
Index_skfold <- caret::createDataPartition(Y, p = 0.7, list=FALSE)
## Train set creation
X_train <- X[Index_skfold,]
Y_train <- Y[Index_skfold]
## Test set creation
X_test <- X[-Index_skfold,]
Y_test <- Y[-Index_skfold]
## Format Y_test as a factor for analysis
Y_testb <- gsub(" ", ".",Y_test)
The selection of discriminant mass-over-charge values can be
performed with several methods proposed in the SelectionVar
function of our MSclassifR
package. The RFERF
method uses the Recursive Feature Elimination (RFE) algorithm coupled
with Random Forests (RF); the RFEGlmnet
method uses RFE
coupled with Logistic regression; the sPLSDA
method uses
sparse partial least squares discriminant analysis (sPLS-DA) (Rohart F et al. (2017)). The sPLSDA
method selects variables from the ones kept in latent components of the
model using an automatic choice of the number of components; the
VSURF
method uses RF with the VSURF
function
and package (Genuer et al.
(2015)).
VSURF
method performs three
steps variables selection : thresholding step, interpretation step and
prediction step (Genuer et al.
(2015)).## a. using RFE and Random forest model
set.seed(90) # for reproducibility
a <- SelectionVar(X_train,
Y_train,
MethodSelection = c("RFERF"),
MethodValidation = c("LOOCV"),
Metric = "Accuracy",
PreProcessing = c("center","scale","nzv","corr"),
Sizes = c(5:8))
## b. using RFE and Random forest model
set.seed(90) # for reproducibility
b <- SelectionVar(X_train,
Y_train,
MethodSelection = c("RFEGlmnet") ,
MethodValidation = c("LOOCV"),
Metric = "Accuracy",
PreProcessing = c("center","scale","nzv","corr"),
Sizes = c(5:8))
## c. using sPLDA method
set.seed(90) # for reproducibility
c <- SelectionVar(X_train,
Y_train,
MethodSelection = c("sPLSDA"),
MethodValidation = c("LOOCV"),
PreProcessing = c("scale","nzv"),
Sizes = c(5:8))
## d. using VSURF
set.seed(90) # for reproducibility
d <- SelectionVar(X_train,
Y_train,
MethodSelection = c("VSURF"))
## Keep selected variables in a list
list_moz <- list("RFERF" = a$sel_moz,
"RFEglmnet" = b$sel_moz,
"sPLSDA" = c$sel_moz,
"VSURF" = d$sel_moz)
RFERF
with
the red dotted lines.PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=a$sel_moz,col_spec="blue",col_peak="black")
RFEglmnet
method with the red dotted lines.PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=b$sel_moz,col_spec="blue",col_peak="black")
sPLSDA
method with the red dotted lines.MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=c$sel_moz,col_spec="blue",col_peak="black")
VSURF
method
with the red dotted lines.MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=d$sel_moz,col_spec="blue",col_peak="black")
VSURF
method.This is insufficient to build a prediction
model. This selection method will not be included in further
analysis.With the selected discriminant masses, we can create a prediction
model. The LogReg
function allows estimating a prediction
model with k-fold cross validation in order to predict the category to
which a mass spectrum belongs. Two main kinds of models can be
estimated: linear or nonlinear (with neural networks (nnet
method), random forests (rf
method), support vector
machines with linear kernel (svm
method), or eXtreme
Gradient Boosting (xgb
method)). Hyperparameters are
randomly searched, except for the eXtreme Gradient Boosting where a grid
search is performed.
In this part, we will create different prediction models with discriminant mass-over-charge values previously selected ones using the 4 methods described above.
Each prediction models were estimated with a repeated k-fold cross validation (k = 2, and 2 repeats)in the train data set. Next, the accuracies of estimated models are evaluated on the test data set.
set.seed(123) # for reproducibility
## Select variables found with RFERF
sel_moz=a$sel_moz
## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2)
#Estimated model:
#model_lm
## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn
## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf
## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb
## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
Let’s look at the performances of different prediction models.
## Keep models in a list
list_model_A <- list("lm" = model_lm$train_mod,
"nnet" = model_nn$train_mod,
"rf" = model_rf$train_mod,
"Xgboost" = model_xgb$train_mod,
"svm" = model_svm$train_mod)
## Plot performances of prediction model
model_A <- caret::resamples(list_model_A)
bwplot(model_A)
## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[-Index_skfold],
model = list_model_A,
moz = sel_moz,
Reference = Y_test)
## Split table according method used
ResultatsModelA <- split(prob_cat$Prob.results, prob_cat$Prob.results$method)
## Calcul accuracy for each method
DF_A <- data.frame(lapply(ResultatsModelA, function(x)(sum(Y_testb == x[,5])/length(Y_test))), row.names = "RFERF accuracy")
t(DF_A)
## RFERF accuracy
## comb_fisher 1
## max_vote 1
## multinom 1
## nnet 1
## rf 1
## svmLinear2 1
## xgbTree 1
set.seed(123) # for reproducibility
## Select variables found with RFGlmnet
sel_moz=b$sel_moz
## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2)
#Estimated model:
#model_lm
## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn
## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf
## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb
## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
## Keep models in a list
list_model_B <- list("lm" = model_lm$train_mod,
"nnet" = model_nn$train_mod,
"rf" = model_rf$train_mod,
"Xgboost" = model_xgb$train_mod,
"svm" = model_svm$train_mod)
## Plot performances of prediction model
model_B <- caret::resamples(list_model_B)
bwplot(model_B)
## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[-Index_skfold],
model = list_model_B,
moz = sel_moz,
Reference = Y_test)
## Split table according method used
ResultatsModelB <- split(prob_cat$Prob.results, prob_cat$Prob.results$method)
## Calcul accuracy for each method
DF_B <- data.frame(lapply(ResultatsModelB, function(x)(sum(Y_testb == x[,5])/length(Y_testb))), row.names = "RFEGlmnet accuracy")
t(DF_B)
## RFEGlmnet accuracy
## comb_fisher 1.0
## max_vote 1.0
## multinom 1.0
## nnet 1.0
## rf 1.0
## svmLinear2 1.0
## xgbTree 0.4
set.seed(123) # for reproducibility
## Select variables found with RFGlmnet
sel_moz=c$sel_moz
## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2)
#Estimated model:
#model_lm
## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn
## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf
## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb
## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), Metric = "Accuracy", number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
## Keep models in a list
list_model_C <- list("lm" = model_lm$train_mod,
"nnet" = model_nn$train_mod,
"rf" = model_rf$train_mod,
"Xgboost" = model_xgb$train_mod,
"svm" = model_svm$train_mod)
## Plot performances of prediction model
model_C <- caret::resamples(list_model_C)
bwplot(model_C)
## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[-Index_skfold],
model = list_model_C,
moz = sel_moz,
Reference = Y_test)
## Split table according method used
ResultatsModelC <- split(prob_cat$Prob.results, prob_cat$Prob.results$method)
## Calcul accuracy for each method
DF_C <- data.frame(lapply(ResultatsModelC, function(x)(sum(Y_testb == x[,5])/length(Y_testb))), row.names = "sPLSDA accuracy")
t(DF_C)
## sPLSDA accuracy
## comb_fisher 0.4
## max_vote 0.4
## multinom 0.4
## nnet 0.4
## rf 1.0
## svmLinear2 0.4
## xgbTree 0.6
Prediction models cannot be created with the VSURF method in this
example because with the VSURF
method because only one
variable has been selected. Finally, the accuracies of all the used
method in the test dataset can be summarized:
## Summarize results on test set
DFf <- rbind(DF_A, DF_B, DF_C)
## Print results
t(DFf)
## RFERF accuracy RFEGlmnet accuracy sPLSDA accuracy
## comb_fisher 1 1.0 0.4
## max_vote 1 1.0 0.4
## multinom 1 1.0 0.4
## nnet 1 1.0 0.4
## rf 1 1.0 1.0
## svmLinear2 1 1.0 0.4
## xgbTree 1 0.4 0.6
This vignette illustrates how MSclassifR
allows
estimating many different machine learning-based models to identify two
species of the genus Klebsiella from MALDI-TOF mass spectra,
and how to use these models to predict the species of a new coming mass
spectrum. In the example we used, 14 of the 21 models have accuracy
equal to 1. For instance, perfect accuracies were obtained on the test
data set with variable selection by RFE-RF
and
RFE-Glmnet
coupled with multinom
,
nnet
, rf
and svm
methods.
## Session info
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=French_France.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] caret_6.0-92 lattice_0.20-45 ggplot2_3.3.6
## [4] MALDIquantForeign_0.13 MALDIquant_1.21 MSclassifR_0.3.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 ggstance_0.3.5
## [3] tidyselect_1.1.2 grid_4.1.3
## [5] ranger_0.14.1 BiocParallel_1.28.3
## [7] maptools_1.1-4 pROC_1.18.0
## [9] munsell_0.5.0 codetools_0.2-18
## [11] mutoss_0.1-12 performanceEstimation_1.1.0
## [13] xgboost_1.6.0.1 future_1.27.0
## [15] withr_2.5.0 colorspace_2.0-3
## [17] Biobase_2.54.0 highr_0.9
## [19] knitr_1.39 rstudioapi_0.13
## [21] stats4_4.1.3 robustbase_0.95-0
## [23] listenv_0.8.0 labeling_0.4.2
## [25] Rdpack_2.4 mnormt_2.1.0
## [27] polyclip_1.10-0 farver_2.1.1
## [29] parallelly_1.32.1 vctrs_0.4.1
## [31] generics_0.1.3 TH.data_1.1-1
## [33] cp4p_0.3.6 ipred_0.9-13
## [35] xfun_0.31 geepack_1.3.4
## [37] randomForest_4.7-1.1 R6_2.5.1
## [39] doParallel_1.0.17 cachem_1.0.6
## [41] reshape_0.8.9 assertthat_0.2.1
## [43] scales_1.2.0 multcomp_1.4-19
## [45] nnet_7.3-17 gtable_0.3.0
## [47] globals_0.15.1 sandwich_3.0-2
## [49] timeDate_4021.104 rlang_1.0.4
## [51] Rborist_0.2-3 splines_4.1.3
## [53] ModelMetrics_1.2.2.2 broom_1.0.0
## [55] mosaicCore_0.9.0 yaml_2.3.5
## [57] reshape2_1.4.4 abind_1.4-5
## [59] backports_1.4.1 qvalue_2.26.0
## [61] tools_4.1.3 lava_1.6.10
## [63] gstat_2.0-9 ellipsis_0.3.2
## [65] readBrukerFlexData_1.9.0 jquerylib_0.1.4
## [67] RColorBrewer_1.1-3 proxy_0.4-27
## [69] geeM_0.10.1 BiocGenerics_0.40.0
## [71] ggformula_0.10.1 ggridges_0.5.3
## [73] TFisher_0.2.0 Rcpp_1.0.9
## [75] plyr_1.8.7 base64enc_0.1-3
## [77] waveslim_1.8.3 purrr_0.3.4
## [79] rpart_4.1.16 MBA_0.0-9
## [81] zoo_1.8-10 haven_2.5.0
## [83] ggrepel_0.9.1 magrittr_2.0.3
## [85] data.table_1.14.2 RSpectra_0.16-1
## [87] spacetime_1.2-8 mvtnorm_1.1-3
## [89] matrixStats_0.62.0 hms_1.1.1
## [91] evaluate_0.15 XML_3.99-0.10
## [93] mclust_5.4.10 gridExtra_2.3
## [95] shape_1.4.6 compiler_4.1.3
## [97] ellipse_0.4.3 tibble_3.1.8
## [99] crayon_1.5.1 htmltools_0.5.3
## [101] corpcor_1.6.10 tidyr_1.2.0
## [103] lubridate_1.8.0 DBI_1.1.3
## [105] tweenr_1.0.2 MASS_7.3-55
## [107] Matrix_1.4-0 car_3.1-0
## [109] cli_3.3.0 rbibutils_2.2.8
## [111] parallel_4.1.3 metap_1.8
## [113] gower_1.0.0 qqconf_1.2.3
## [115] igraph_1.3.4 forcats_0.5.1
## [117] pkgconfig_2.0.3 sn_2.0.2
## [119] numDeriv_2016.8-1.1 foreign_0.8-82
## [121] signal_0.7-7 sp_1.5-0
## [123] readMzXmlData_2.8.1 recipes_1.0.1
## [125] foreach_1.5.2 rARPACK_0.11-0
## [127] bslib_0.4.0 hardhat_1.2.0
## [129] multtest_2.50.0 VSURF_1.1.0
## [131] prodlim_2019.11.13 stringr_1.4.0
## [133] digest_0.6.29 rmarkdown_2.14
## [135] intervals_0.15.2 MESS_0.5.9
## [137] MALDIrppa_1.1.0-1 lifecycle_1.0.1
## [139] nlme_3.1-155 mltools_0.3.5
## [141] jsonlite_1.8.0 carData_3.0-5
## [143] mixOmics_6.18.1 UBL_0.0.7
## [145] limma_3.50.3 fansi_1.0.3
## [147] labelled_2.9.1 pillar_1.8.0
## [149] fuzzyjoin_0.1.6 fastmap_1.1.0
## [151] plotrix_3.8-2 DEoptimR_1.0-11
## [153] survival_3.2-13 glue_1.6.2
## [155] xts_0.12.1 FNN_1.1.3.1
## [157] iterators_1.0.14 glmnet_4.1-4
## [159] ggforce_0.3.3 class_7.3-20
## [161] stringi_1.7.6 sass_0.4.2
## [163] automap_1.0-16 mathjaxr_1.6-0
## [165] dplyr_1.0.9 e1071_1.7-11
## [167] future.apply_1.9.0