MSclassifR
is a R package dedicated for Automated
classification of mass-spectra with machine learning methods.It was
developed with the aim of identifying very similar species or phenotypes
of bacteria from mass spectra obtained by Matrix Assisted Laser
Desorption Ionisation - Time Of Flight Mass Spectrometry (MALDI-TOF MS).
However, the different functions of this package can also be used to
classify other categories associated to mass spectra; or from mass
spectra obtained with other mass spectrometry techniques. It includes
easy-to-use functions for pre-processing mass spectra, functions to
determine discriminant mass-over-charge values (m/z) from a library of
mass spectra corresponding to different categories, and functions to
predict the category (species, phenotypes, etc.) associated to a mass
spectrum from a list of selected mass-over-charge values.
In this vignette, we search to discriminate five species of the genus
Ecrobia by MALDI-TOF mass spectrometry with a data set of 55
mass spectra (Lasch et al.
(2016)) using MSclassifR
package. The raw data
concerning from a research paper (Wilke et
al., 2020) were downloaded from this link.
The installation of the MSclassifR
package requires the
installation of packages from Bioconductor
, so you you
might have to install the latest version of the BiocManager
package. The MSclassifR
package imports the other necessary
packages from the CRAN. In addition, it is recommended to install the
latest version of R or versions >= 4.0.
MSclassifR
using the following code:## install BiocManager if not installed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
## Install the mixOmics and multtest packages from Bioconductor
BiocManager::install(c("multtest","mixOmics"))
## Install MSclassifR package
install.packages("MSclassifR")
MSclassifR
package:## For spectral easy spectral treatment and machine learning
require(MSclassifR)
## Le chargement a nécessité le package : MSclassifR
## Warning: le package 'MSclassifR' a été compilé avec la version R 4.1.3
MALDIquantForeign
, MALDIquant
and
caret
packages from the CRAN. The installation is possible
with the command install.packages()
.## time measurement
start_time <- Sys.time()
## For additional spectral treatment functions
require(MALDIquantForeign)
## Le chargement a nécessité le package : MALDIquantForeign
## Le chargement a nécessité le package : MALDIquant
##
## This is MALDIquant version 1.21
## Quantitative Analysis of Mass Spectrometry Data
## See '?MALDIquant' for more information about this package.
require(MALDIquant)
## For additional machine learning functions
require(caret)
## Le chargement a nécessité le package : caret
## Warning: le package 'caret' a été compilé avec la version R 4.1.3
## Le chargement a nécessité le package : ggplot2
## Le chargement a nécessité le package : lattice
For this vignette, we will load mass spectra and their respective metadata from a public repository on GitHub.
## url for load the MS data
MSdataE <- "https://agodmer.github.io/MSData/Ecrobia/MassSpectra_Ecrobia.Rdata"
MSMetadataE <- "https://agodmer.github.io/MSData/Ecrobia/metaData_Ecrobia.Rdata"
## Load mass spectra
### Mass spectra
load(url(MSdataE))
### Metadata
load(url(MSMetadataE))
If you want to use your own data with .flex format from Bruker
MALDI-TOF MS, you could use importBrukerFlex
from
MALDIquantForeign package
. Of note, you will have to create
your own corresponding metadata as a data frame in this case.
## for data .flex format import
#MyBrukerFlexspectra <- MALDIquantForeign::importBrukerFlex("YourPathway")
It can be useful to inspect the spectra visually before starting the analysis. This step allows eliminating poor quality spectra.
## Visual quality control
MSclassifR::PlotSpectra(SpectralData=MassSpectra_Ecrobia[[1]])
## Signal processing
spectra <- MSclassifR::SignalProcessing(MassSpectra_Ecrobia)
## Peaks detection
peaks <- MSclassifR::PeakDetection(spectra, labels = metaData_Ecrobia$Number_strain)
You can plot the spectral processing performed and the peaks selected (black crosses).
## Inspect spectra after signal processing
MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],col_spec="blue",col_peak="black")
At this step, we create an intensity matrix with m/z values in
columns and sample names in rows data with the
intensityMatrix
function in MALDIquant
package
(Gibb and Strimmer (2012)).
## Perfom an Intensity Matrix with spectra data
IntMat <- MALDIquant::intensityMatrix(peaks)
## Rows are named according to selected metadata
rownames(IntMat) <- paste(metaData_Ecrobia$Strain_name_spot, metaData_Ecrobia$Number_strain)
## Remove "NA" in the intensityMatrix
IntMat[is.na(IntMat)] <- 0
## Normalise peaks according to the maximum intensity
## Create basic function
norma<-function(x) x/(max(x))
## Apply this function
IntMat <- apply(IntMat,1,norma)
## Transpose Matrix for statistical analysis and named X
X <- t(IntMat)
Y <- factor(metaData_Ecrobia$Species)
The dataset is splitted it into training (70%) and test (30%)
datasets using caret
package Kuhn
(2008) and the createDataPartition
function for
machine learning step.
set.seed(123) # For reproductibility
## Split data to tune the model according train set
## Index creation with p = 0,7
Index_skfold <- caret::createDataPartition(Y, p = 0.7, list=FALSE)
## Train set creation
X_train <- X[Index_skfold,]
Y_train <- Y[Index_skfold]
## Test set creation
X_test <- X[-Index_skfold,]
Y_test <- Y[-Index_skfold]
## Format Y_test as a factor for analysis
Y_testb <- gsub(" ", ".",Y_test)
The selection of discriminant mass-over-charge values can be
performed with several methods proposed in the SelectionVar
function of our MSclassifR
package. The RFERF
method uses the Recursive Feature Elimination (RFE) algorithm coupled
with Random Forests (RF); the RFEGlmnet
method uses RFE
coupled with Logistic regression; the sPLSDA
method uses
sparse partial least squares discriminant analysis (sPLS-DA) (Rohart F et al. (2017)). The sPLSDA
method selects variables from the ones kept in latent components of the
model using an automatic choice of the number of components; the
VSURF
method uses RF with the VSURF
function
and package (Genuer et al.
(2015)).
For this example, we used a k-fold cross validation (k = 2) with 5 to
70 variables. The VSURF
method performs three steps
variables selection : thresholding step, interpretation step and
prediction step (Genuer et al.
(2015)).
## a. using RFE and Random forest model
set.seed(2) # for reproducibility
a <- SelectionVar(X_train,
Y_train,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
NumberCV = 2,
PreProcessing = c("center","scale","nzv","corr"),
Sizes = c(5:70))
## b. using RFE and Random forest model
set.seed(2) # for reproducibility
b <- SelectionVar(X_train,
Y_train,
MethodSelection = c("RFEGlmnet") ,
MethodValidation = c("cv"),
NumberCV = 2,
PreProcessing = c("center","scale","nzv","corr"),
Sizes = c(5:70))
## c. using sPLDA method
set.seed(2) # for reproducibility
c <- SelectionVar(X_train,
Y_train,
MethodSelection = c("sPLSDA"),
MethodValidation = c("cv"),
NumberCV = 2,
PreProcessing = c("scale","nzv"),
Sizes = c(5:70))
## d. using VSURF
set.seed(2) # for reproducibility
d <- SelectionVar(X_train,
Y_train,
MethodSelection = c("VSURF"))
## Keep selected variables in a list
list_moz <- list("RFERF" = a$sel_moz,
"RFEglmnet" = b$sel_moz,
"sPLSDA" = c$sel_moz,
"VSURF" = d$sel_moz)
RFERF
with
the red dotted lines.PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=a$sel_moz,col_spec="blue",col_peak="black")
RFEglmnet
method with the red dotted lines.PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=b$sel_moz,col_spec="blue",col_peak="black")
sPLSDA
method with the red dotted lines.MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=c$sel_moz,col_spec="blue",col_peak="black")
VSURF
method
with the red dotted lines.MSclassifR::PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=d$sel_moz,col_spec="blue",col_peak="black")
With the selected discriminant masses, we can create a prediction
model. The LogReg
function allows estimating a prediction
model with k-fold cross validation in order to predict the category to
which a mass spectrum belongs. Two main kinds of models can be
estimated: linear or nonlinear (with neural networks (nnet
method), random forests (rf
method), support vector
machines with linear kernel (svm
method), or eXtreme
Gradient Boosting (xgb
method)). Hyperparameters are
randomly searched, except for the eXtreme Gradient Boosting where a grid
search is performed.
In this part, we will create different prediction models with discriminant mass-over-charge values previously selected ones using the 4 methods described above.
Each prediction models were estimated with a repeated k-fold cross validation (k = 2, and 2 repeats)in the train data set. Next, the accuracies of estimated models are evaluated on the test data set.
## Select variables found with RFERF
sel_moz=a$sel_moz
## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2)
#Estimated model:
#model_lm
## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn
## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf
## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number = 2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb
## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
Let’s look at the performances of different prediction models on test dataset.
## Keep models in a list
list_model_A <- list("lm" = model_lm$train_mod,
"nnet" = model_nn$train_mod,
"rf" = model_rf$train_mod,
"Xgboost" = model_xgb$train_mod,
"svm" = model_svm$train_mod)
## Plot performances of prediction model
model_A <- caret::resamples(list_model_A)
bwplot(model_A)
## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::Predict_LogReg(peaks = peaks[-Index_skfold],
model = list_model_A,
moz = sel_moz)
## Split table according method used
ResultatsModelA <- split(prob_cat,prob_cat$method)
## Calcul accuracy for each method
DF_A <- data.frame(lapply(ResultatsModelA, function(x)(sum(Y_testb == x[,8])/length(Y_testb))), row.names = "RFERF accuracy")
t(DF_A)
## RFERF accuracy
## comb_fisher 0.21428571
## max_vote 0.28571429
## multinom 0.85714286
## nnet 0.85714286
## rf 1.00000000
## svmLinear2 0.07142857
## xgbTree 0.85714286
## Select variables found with RFGlmnet
sel_moz=b$sel_moz
## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2)
#Estimated model:
#model_lm
## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn
## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf
## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb
## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
## Keep models in a list
list_model_B <- list("lm" = model_lm$train_mod,
"nnet" = model_nn$train_mod,
"rf" = model_rf$train_mod,
"Xgboost" = model_xgb$train_mod,
"svm" = model_svm$train_mod)
## Plot performances of prediction model
model_B <- caret::resamples(list_model_B)
bwplot(model_B)
## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::Predict_LogReg(peaks = peaks[-Index_skfold],
model = list_model_B,
moz = sel_moz)
## Split table according method used
ResultatsModelB <- split(prob_cat,prob_cat$method)
## Calcul accuracy for each method
DF_B <- data.frame(lapply(ResultatsModelB, function(x)(sum(Y_testb == x[,8])/length(Y_testb))), row.names = "RFEGlmnet accuracy")
t(DF_B)
## RFEGlmnet accuracy
## comb_fisher 0.2857143
## max_vote 0.2857143
## multinom 1.0000000
## nnet 1.0000000
## rf 0.9285714
## svmLinear2 0.4285714
## xgbTree 0.7857143
## Select variables found with RFGlmnet
sel_moz=c$sel_moz
## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2)
#Estimated model:
#model_lm
## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn
## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf
## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb
## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
## Keep models in a list
list_model_C <- list("lm" = model_lm$train_mod,
"nnet" = model_nn$train_mod,
"rf" = model_rf$train_mod,
"Xgboost" = model_xgb$train_mod,
"svm" = model_svm$train_mod)
## Plot performances of prediction model
model_C <- caret::resamples(list_model_C)
bwplot(model_C)
## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::Predict_LogReg(peaks = peaks[-Index_skfold],
model = list_model_C,
moz = sel_moz)
## Split table according method used
ResultatsModelC <- split(prob_cat,prob_cat$method)
## Calcul accuracy for each method
DF_C <- data.frame(lapply(ResultatsModelC, function(x)(sum(Y_testb == x[,8])/length(Y_testb))), row.names = "sPLSDA accuracy")
t(DF_C)
## sPLSDA accuracy
## comb_fisher 0.2857143
## max_vote 0.2142857
## multinom 1.0000000
## nnet 1.0000000
## rf 1.0000000
## svmLinear2 0.2857143
## xgbTree 0.7857143
## Select variables found with RFGlmnet
sel_moz=d$sel_moz
## linear multinomial regression
model_lm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2)
#Estimated model:
#model_lm
## nonlinear multinomial regression using neural networks
model_nn=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="nnet")
#Estimated model:
#model_nn
## nonlinear multinomial regression using random forests
model_rf=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="rf")
#Estimated model:
#model_rf
## nonlinear multinomial regression using xgboost
model_xgb=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="xgb")
#Estimated model:
#model_xgb
## nonlinear multinomial regression using svm
model_svm=MSclassifR::LogReg(X=X_train, moz=sel_moz, Y=factor(Y_train), number=2, repeats=2, kind="svm")
#Estimated model:
#model_svm
## Keep models in a list
list_model_D <- list("lm" = model_lm$train_mod,
"nnet" = model_nn$train_mod,
"rf" = model_rf$train_mod,
"Xgboost" = model_xgb$train_mod,
"svm" = model_svm$train_mod)
## Plot performances of prediction model
model_D <- caret::resamples(list_model_D)
bwplot(model_D)
## Probabilities of belonging to each category for the mass spectra on test set
prob_cat=MSclassifR::Predict_LogReg(peaks = peaks[-Index_skfold],
model = list_model_D,
moz = as.numeric(sel_moz))
## Split table according method used
ResultatsModelD <- split(prob_cat,prob_cat$method)
## Calcul accuracy for each method
DF_D <- data.frame(lapply(ResultatsModelC, function(x)(sum(Y_testb == x[,8])/length(Y_testb))), row.names = "VSURF accuracy")
t(DF_D)
## VSURF accuracy
## comb_fisher 0.2857143
## max_vote 0.2142857
## multinom 1.0000000
## nnet 1.0000000
## rf 1.0000000
## svmLinear2 0.2857143
## xgbTree 0.7857143
the accuracies of all the used method in the test dataset can be summarized:
## Summarize results on test set
DFf <- rbind(DF_A, DF_B, DF_C, DF_D)
## Print results
t(DFf)
## RFERF accuracy RFEGlmnet accuracy sPLSDA accuracy VSURF accuracy
## comb_fisher 0.21428571 0.2857143 0.2857143 0.2857143
## max_vote 0.28571429 0.2857143 0.2142857 0.2142857
## multinom 0.85714286 1.0000000 1.0000000 1.0000000
## nnet 0.85714286 1.0000000 1.0000000 1.0000000
## rf 1.00000000 0.9285714 1.0000000 1.0000000
## svmLinear2 0.07142857 0.4285714 0.2857143 0.2857143
## xgbTree 0.85714286 0.7857143 0.7857143 0.7857143
This vignette illustrates how MSclassifR
allows
estimating many different machine learning-based models to identify two
species of the genus Ecrobia from MALDI-TOF mass spectra, and
how to use these models to predict the species of a new coming mass
spectrum. In the example we used, nines models of the 28 models have
accuracy equal to 1. For instance, perfect accuracies were obtained on
the test data set with variables selection by RFE-Glmnet
,
sPLSDA
and VSURF
coupled with lm
,
nnet
methods.
## time measurement
end_time <- Sys.time()
end_time - start_time
## Time difference of 48.34758 mins
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=French_France.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] caret_6.0-91 lattice_0.20-45 ggplot2_3.3.5
## [4] MALDIquantForeign_0.13 MALDIquant_1.21 MSclassifR_0.2.0
##
## loaded via a namespace (and not attached):
## [1] sn_2.0.2 plyr_1.8.6 igraph_1.2.11
## [4] splines_4.1.2 BiocParallel_1.28.3 listenv_0.8.0
## [7] qqconf_1.1.1 TH.data_1.1-0 MALDIrppa_1.1.0-1
## [10] digest_0.6.29 foreach_1.5.2 htmltools_0.5.2
## [13] fansi_1.0.3 magrittr_2.0.2 doParallel_1.0.17
## [16] recipes_0.2.0 globals_0.14.0 gower_1.0.0
## [19] matrixStats_0.61.0 rARPACK_0.11-0 sandwich_3.0-1
## [22] hardhat_0.2.0 colorspace_2.0-3 signal_0.7-7
## [25] ggrepel_0.9.1 rbibutils_2.2.7 xfun_0.30
## [28] dplyr_1.0.8 crayon_1.5.1 jsonlite_1.8.0
## [31] readBrukerFlexData_1.8.5 survival_3.2-13 zoo_1.8-9
## [34] iterators_1.0.14 glue_1.6.2 gtable_0.3.0
## [37] ipred_0.9-12 readMzXmlData_2.8.1 future.apply_1.8.1
## [40] shape_1.4.6 BiocGenerics_0.40.0 DEoptimR_1.0-11
## [43] scales_1.2.0 mvtnorm_1.1-3 DBI_1.1.2
## [46] Rcpp_1.0.8 metap_1.8 plotrix_3.8-2
## [49] tmvnsim_1.0-2 proxy_0.4-26 stats4_4.1.2
## [52] lava_1.6.10 prodlim_2019.11.13 Rborist_0.2-3
## [55] glmnet_4.1-3 RColorBrewer_1.1-3 TFisher_0.2.0
## [58] ellipsis_0.3.2 pkgconfig_2.0.3 XML_3.99-0.9
## [61] farver_2.1.0 nnet_7.3-16 sass_0.4.1
## [64] utf8_1.2.2 tidyselect_1.1.2 labeling_0.4.2
## [67] rlang_1.0.2 reshape2_1.4.4 munsell_0.5.0
## [70] tools_4.1.2 xgboost_1.5.2.1 cli_3.2.0
## [73] generics_0.1.2 ranger_0.13.1 mathjaxr_1.6-0
## [76] evaluate_0.15 stringr_1.4.0 fastmap_1.1.0
## [79] yaml_2.3.5 fuzzyjoin_0.1.6 ModelMetrics_1.2.2.2
## [82] knitr_1.38 robustbase_0.93-9 purrr_0.3.4
## [85] randomForest_4.7-1 future_1.24.0 nlme_3.1-153
## [88] compiler_4.1.2 rstudioapi_0.13 e1071_1.7-9
## [91] waveslim_1.8.2 tibble_3.1.6 bslib_0.3.1
## [94] stringi_1.7.6 highr_0.9 RSpectra_0.16-0
## [97] Matrix_1.3-4 VSURF_1.1.0 multtest_2.50.0
## [100] vctrs_0.3.8 mutoss_0.1-12 pillar_1.7.0
## [103] lifecycle_1.0.1 Rdpack_2.3 jquerylib_0.1.4
## [106] data.table_1.14.2 corpcor_1.6.10 R6_2.5.1
## [109] gridExtra_2.3 parallelly_1.31.0 codetools_0.2-18
## [112] MASS_7.3-54 assertthat_0.2.1 withr_2.5.0
## [115] mnormt_2.0.2 multcomp_1.4-18 parallel_4.1.2
## [118] mixOmics_6.18.1 grid_4.1.2 rpart_4.1-15
## [121] timeDate_3043.102 tidyr_1.2.0 class_7.3-19
## [124] rmarkdown_2.13 pROC_1.18.0 numDeriv_2016.8-1.1
## [127] Biobase_2.54.0 lubridate_1.8.0 base64enc_0.1-3
## [130] ellipse_0.4.2