1. Introduction

MSclassifR is a R package dedicated for automated classification of mass-spectra with machine learning methods. It was developed with the aim of identifying very similar species or phenotypes from mass spectra obtained by Matrix Assisted Laser Desorption Ionisation - Time Of Flight Mass Spectrometry (MALDI-TOF MS). However, the different functions of this package can also be used to classify other categories associated to mass spectra; or from mass spectra obtained with other mass spectrometry techniques. It includes easy-to-use functions for pre-processing mass spectra, functions to determine discriminant mass-over-charge values (m/z) from a library of mass spectra corresponding to different categories, and functions to predict the category (species, phenotypes, etc.) associated to a mass spectrum from a list of selected mass-over-charge values.

In this vignette, we apply the variable selection methods available in MSclassifR to select features (peptides, proteins, metabolites, …) that have been quantified in samples belonging to several biological conditions.

2. Importing data in R

To illustrate, we use a gold standard dataset available in the cp4p R package. The method can also be applied on other datasets composed of quantified values by MS in several samples.

library(cp4p)

# dataset from the cp4p R package
data("LFQRatio2")
#head of the dataset
head(LFQRatio2)
##       A.R1     A.R2     A.R3     B.R1     B.R2     B.R3 Welch.test.pval
## 1 29.77090 29.74317 29.84074 28.85878 28.77653 28.86370     0.000019600
## 2 24.03103 24.64009 24.39310 23.42070 23.39777 23.39856     0.032817385
## 3 28.03507 27.92677 27.89972 26.99389 27.18830 27.30287     0.005204326
## 4 28.26083 28.17760 28.26700 27.17026 27.27719 27.18158     0.000025900
## 5 27.39125 27.40795 27.47112 26.69750 26.70028 26.62840     0.000025000
## 6 27.72605 27.79743 27.61491 26.68030 26.73931 26.68043     0.000975483
##   Organism                             Majority.protein.IDs
## 1    human P02768upsedyp ALBU_HUMAN_upsedyp - CON__P02768-1
## 2    human    P69905upsedyp HBA_HUMAN_upsedyp - CON__P01966
## 3    human                O00762upsedyp UBE2C_HUMAN_upsedyp
## 4    human                 O76070upsedyp SYUG_HUMAN_upsedyp
## 5    human                 P00167upsedyp CYB5_HUMAN_upsedyp
## 6    human                P00709upsedyp LALBA_HUMAN_upsedyp
#Log2 intensities of proteins in the first sample of the first biological condition 
plot(LFQRatio2[,1],type="l")

3. Variable selection

Next we can applied the SelectionVar function to select features:

library(MSclassifR)

#select the abundance values of the dataset
Xtrain=t(as.matrix(LFQRatio2[,1:6]))
#names of the proteins
colnames(Xtrain)=LFQRatio2[,9]

#selection of proteins
set.seed(20) #for reproductibility 
list_diff=SelectionVar(X=Xtrain,
                       Y=factor(c(rep("A",3),rep("B",3))),
                       MethodSelection = c("RFERF"),
                       MethodValidation = c("cv"),
                       Metric = "Accuracy",
                       NumberCV = 2,
                       PreProcessing = NULL,
                       Sizes = c(2:70))
## Selection variables with RFE and RF method
## Le chargement a nécessité le package : ggplot2
## Le chargement a nécessité le package : lattice
# Print results
list_diff
## $result
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (2 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy Kappa AccuracySD KappaSD Selected
##          2        1     1          0       0        *
##          3        1     1          0       0         
##          4        1     1          0       0         
##          5        1     1          0       0         
##          6        1     1          0       0         
##          7        1     1          0       0         
##          8        1     1          0       0         
##          9        1     1          0       0         
##         10        1     1          0       0         
##         11        1     1          0       0         
##         12        1     1          0       0         
##         13        1     1          0       0         
##         14        1     1          0       0         
##         15        1     1          0       0         
##         16        1     1          0       0         
##         17        1     1          0       0         
##         18        1     1          0       0         
##         19        1     1          0       0         
##         20        1     1          0       0         
##         21        1     1          0       0         
##         22        1     1          0       0         
##         23        1     1          0       0         
##         24        1     1          0       0         
##         25        1     1          0       0         
##         26        1     1          0       0         
##         27        1     1          0       0         
##         28        1     1          0       0         
##         29        1     1          0       0         
##         30        1     1          0       0         
##         31        1     1          0       0         
##         32        1     1          0       0         
##         33        1     1          0       0         
##         34        1     1          0       0         
##         35        1     1          0       0         
##         36        1     1          0       0         
##         37        1     1          0       0         
##         38        1     1          0       0         
##         39        1     1          0       0         
##         40        1     1          0       0         
##         41        1     1          0       0         
##         42        1     1          0       0         
##         43        1     1          0       0         
##         44        1     1          0       0         
##         45        1     1          0       0         
##         46        1     1          0       0         
##         47        1     1          0       0         
##         48        1     1          0       0         
##         49        1     1          0       0         
##         50        1     1          0       0         
##         51        1     1          0       0         
##         52        1     1          0       0         
##         53        1     1          0       0         
##         54        1     1          0       0         
##         55        1     1          0       0         
##         56        1     1          0       0         
##         57        1     1          0       0         
##         58        1     1          0       0         
##         59        1     1          0       0         
##         60        1     1          0       0         
##         61        1     1          0       0         
##         62        1     1          0       0         
##         63        1     1          0       0         
##         64        1     1          0       0         
##         65        1     1          0       0         
##         66        1     1          0       0         
##         67        1     1          0       0         
##         68        1     1          0       0         
##         69        1     1          0       0         
##         70        1     1          0       0         
##       1481        1     1          0       0         
## 
## The top 2 variables (out of 2):
##    sp P06169 PDC1_YEAST, sp P10592 HSP72_YEAST
## 
## 
## $sel_moz
## [1] "P02768upsedyp ALBU_HUMAN_upsedyp - CON__P02768-1"
## [2] "P02788upsedyp TRFL_HUMAN_upsedyp"                
## [3] "P69905upsedyp HBA_HUMAN_upsedyp - CON__P01966"   
## [4] "sp P06169 PDC1_YEAST"

As you can see, we tried to select between 2 and 70 proteins (Sizes argument), and the selection using random forest (RFERF) allows reaching an accuracy of 1 with few human proteins, i.e. the conditions A and B can be predicted using these proteins:

list_diff$sel_moz
## [1] "P02768upsedyp ALBU_HUMAN_upsedyp - CON__P02768-1"
## [2] "P02788upsedyp TRFL_HUMAN_upsedyp"                
## [3] "P69905upsedyp HBA_HUMAN_upsedyp - CON__P01966"   
## [4] "sp P06169 PDC1_YEAST"

This function uses 2 cross-validations, it selects at least 2 proteins at each cross-validation. The union of all the selected proteins along the cross-validations can thus be superior to 2.

If you are searching all the proteins that are differential (instead of finding the ones allowing to predict the conditions with accuracy), you can use the SelectionVarStat function:

#selection of proteins
list_diff=SelectionVarStat(X=Xtrain,
                           Y=factor(c(rep("A",3),rep("B",3))))
## abh method is used. pi0 is estimated to 0.9743417 .
list_diff$sel_moz
##  [1] "P02768upsedyp ALBU_HUMAN_upsedyp - CON__P02768-1"
##  [2] "P69905upsedyp HBA_HUMAN_upsedyp - CON__P01966"   
##  [3] "O00762upsedyp UBE2C_HUMAN_upsedyp"               
##  [4] "O76070upsedyp SYUG_HUMAN_upsedyp"                
##  [5] "P00167upsedyp CYB5_HUMAN_upsedyp"                
##  [6] "P00709upsedyp LALBA_HUMAN_upsedyp"               
##  [7] "P00915upsedyp CAH1_HUMAN_upsedyp"                
##  [8] "P00918upsedyp CAH2_HUMAN_upsedyp"                
##  [9] "P01008upsedyp ANT3_HUMAN_upsedyp"                
## [10] "P01112upsedyp RASH_HUMAN_upsedyp"                
## [11] "P01375upsedyp TNFA_HUMAN_upsedyp"                
## [12] "P02753upsedyp RET4_HUMAN_upsedyp"                
## [13] "P02787upsedyp TRFE_HUMAN_upsedyp"                
## [14] "P02788upsedyp TRFL_HUMAN_upsedyp"                
## [15] "P04040upsedyp CATA_HUMAN_upsedyp"                
## [16] "P05413upsedyp FABPH_HUMAN_upsedyp"               
## [17] "P06396upsedyp GELS_HUMAN_upsedyp"                
## [18] "P06732upsedyp KCRM_HUMAN_upsedyp"                
## [19] "P08263upsedyp GSTA1_HUMAN_upsedyp"               
## [20] "P08758upsedyp ANXA5_HUMAN_upsedyp"               
## [21] "P09211upsedyp GSTP1_HUMAN_upsedyp"               
## [22] "P10599upsedyp THIO_HUMAN_upsedyp"                
## [23] "P10636-8upsedyp TAU_HUMAN_upsedyp"               
## [24] "P12081upsedyp SYHC_HUMAN_upsedyp"                
## [25] "P16083upsedyp NQO2_HUMAN_upsedyp"                
## [26] "P41159upsedyp LEP_HUMAN_upsedyp"                 
## [27] "P51965upsedyp UB2E1_HUMAN_upsedyp"               
## [28] "P55957upsedyp BID_HUMAN_upsedyp"                 
## [29] "P61626upsedyp LYSC_HUMAN_upsedyp"                
## [30] "P62937upsedyp PPIA_HUMAN_upsedyp"                
## [31] "P63165upsedyp SUMO1_HUMAN_upsedyp"               
## [32] "P68871upsedyp HBB_HUMAN_upsedyp"                 
## [33] "P99999upsedyp CYC_HUMAN_upsedyp"                 
## [34] "Q06830upsedyp PRDX1_HUMAN_upsedyp"               
## [35] "sp P02994 EF1A_YEAST"                            
## [36] "sp P05740 RL17A_YEAST"                           
## [37] "sp P10659 METK1_YEAST"                           
## [38] "sp P50095 IMDH3_YEAST"                           
## [39] "sp P60010 ACT_YEAST"

Using the default parameters (LIMMA t-tests, ABH method to correct p-values, FDR of 5%), 39 proteins are selected with 5 Yeast proteins among them, resulting to a ‘biological’ false discovery proportion equal to 5/39=12.8%. Of note, the true number of human proteins in the dataset is 38, so we select 34/38=89.4% of the human proteins.

Of note, SelectionVarStat allows estimating the expected number of differential proteins:

#expected number of differential proteins
nbs=list_diff$nb_to_sel
nbs
## [1] 37

Next, we can also use the selection using random forest (RFERF) by specifying we are searching at least this number of proteins:

#selection of proteins
set.seed(24) #for reproductibility
list_diff=SelectionVar(X=Xtrain,
                       Y=factor(c(rep("A",3),rep("B",3))),
                       MethodSelection = c("RFERF"),
                       MethodValidation = c("cv"),
                       Metric = "Accuracy",
                       NumberCV = 2,
                       PreProcessing = NULL,
                       Sizes = c(nbs:70))
## Selection variables with RFE and RF method
list_diff$sel_moz
##  [1] "O00762upsedyp UBE2C_HUMAN_upsedyp"               
##  [2] "O76070upsedyp SYUG_HUMAN_upsedyp"                
##  [3] "P00167upsedyp CYB5_HUMAN_upsedyp"                
##  [4] "P00709upsedyp LALBA_HUMAN_upsedyp"               
##  [5] "P00915upsedyp CAH1_HUMAN_upsedyp"                
##  [6] "P00918upsedyp CAH2_HUMAN_upsedyp"                
##  [7] "P01008upsedyp ANT3_HUMAN_upsedyp"                
##  [8] "P01112upsedyp RASH_HUMAN_upsedyp"                
##  [9] "P01127upsedyp PDGFB_HUMAN_upsedyp"               
## [10] "P01344upsedyp IGF2_HUMAN_upsedyp"                
## [11] "P01375upsedyp TNFA_HUMAN_upsedyp"                
## [12] "P02753upsedyp RET4_HUMAN_upsedyp"                
## [13] "P02768upsedyp ALBU_HUMAN_upsedyp - CON__P02768-1"
## [14] "P02787upsedyp TRFE_HUMAN_upsedyp"                
## [15] "P02788upsedyp TRFL_HUMAN_upsedyp"                
## [16] "P04040upsedyp CATA_HUMAN_upsedyp"                
## [17] "P05413upsedyp FABPH_HUMAN_upsedyp"               
## [18] "P06396upsedyp GELS_HUMAN_upsedyp"                
## [19] "P06732upsedyp KCRM_HUMAN_upsedyp"                
## [20] "P08263upsedyp GSTA1_HUMAN_upsedyp"               
## [21] "P08758upsedyp ANXA5_HUMAN_upsedyp"               
## [22] "P09211upsedyp GSTP1_HUMAN_upsedyp"               
## [23] "P10145upsedyp IL8_HUMAN_upsedyp"                 
## [24] "P10599upsedyp THIO_HUMAN_upsedyp"                
## [25] "P10636-8upsedyp TAU_HUMAN_upsedyp"               
## [26] "P12081upsedyp SYHC_HUMAN_upsedyp"                
## [27] "P15559upsedyp NQO1_HUMAN_upsedyp"                
## [28] "P16083upsedyp NQO2_HUMAN_upsedyp"                
## [29] "P41159upsedyp LEP_HUMAN_upsedyp"                 
## [30] "P51965upsedyp UB2E1_HUMAN_upsedyp"               
## [31] "P55957upsedyp BID_HUMAN_upsedyp"                 
## [32] "P61626upsedyp LYSC_HUMAN_upsedyp"                
## [33] "P62937upsedyp PPIA_HUMAN_upsedyp"                
## [34] "P63165upsedyp SUMO1_HUMAN_upsedyp"               
## [35] "P68871upsedyp HBB_HUMAN_upsedyp"                 
## [36] "P69905upsedyp HBA_HUMAN_upsedyp - CON__P01966"   
## [37] "P99999upsedyp CYC_HUMAN_upsedyp"

This function uses 2 cross-validations and selects at least 37 proteins at each cross-validation. The union of all the selected proteins along the cross-validations can thus be superior to 37. Of note, this way of doing allows to select 37/38=97.3% of the human proteins present in the dataset, without a high increasing of the ‘biological’ false discovery proportion.