2. SVM_performance.py

2.1. Description

Calculating performance metrics using K-fold cross-validation.

  • F1_micro
  • F1_macro
  • Accuracy
  • Precision
  • Recall

2.2. Options

--version show program’s version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input_file=INPUT_FILE
 Tab or space separated file. The first column contains sample IDs; the second column contains sample labels in integer (must be 0 or 1); the third column contains sample label names (string, must be consistent with column-2). The remaining columns contain featuers used to build SVM model.
-n N_FOLD, --nfold=N_FOLD
 The original sample is randomly partitioned into n equal sized subsamples (2 =< n <= 10). Of the n subsamples, a single subsample is retained as the validation data for testing the model, and the remaining n − 1 subsamples are used as training data. default=5.
-p N_THREAD, --nthread=N_THREAD
 Number of threads to use. default=2
-C C_VALUE, --cvalue=C_VALUE
 C value. default=1.0
-k S_KERNEL, --kernel=S_KERNEL
 Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. default=linear

2.3. Input files format

ID Label Label_name feature_1 feature_2 feature_3 feature_n
sample_1 1 WT 1560 795 0.9716 feature_n
sample_2 1 WT 784 219 0.4087 feature_n
sample_3 1 WT 2661 2268 1.1691 feature_n
sample_4 0 Mut 643 198 0.5458 feature_n
sample_5 0 Mut 534 87 1.0545 feature_n
sample_6 0 Mut 332 75 0.5115 feature_n

2.4. Example of input file

$ cat lung_CES_5features.tsv
TCGA_ID Label   Group   gsva_p53_activated      gsva_p53_repressed      ssGSEA_p53_activated    ssGSEA_p53_repressed    PC1
TCGA-22-4593-11A        0       Normal  0.97337963      -0.965872505    0.446594884     -0.332230329    10.12036762
TCGA-22-4609-11A        0       Normal  0.974507532     -0.971830001    0.480743696     -0.373937866    12.57932272
TCGA-22-5471-11A        0       Normal  0.981934732     -0.991054313    0.465087717     -0.354705367    11.50908022
TCGA-22-5472-11A        0       Normal  0.914660832     -0.889643616    0.433541263     -0.316566781    7.96785884
TCGA-22-5478-11A        0       Normal  0.983080513     -0.989789407    0.478239013     -0.370840097    11.81998124
TCGA-22-5481-11A        0       Normal  0.958950969     -0.973021839    0.441116626     -0.325822867    10.62201083
TCGA-22-5482-11A        0       Normal  0.97113164      -0.976324136    0.471515295     -0.362373723    10.78576876
TCGA-22-5483-11A        0       Normal  0.957377049     -0.986013986    0.378674475     -0.253223408    7.487083257
TCGA-22-5489-11A        0       Normal  0.963911525     -0.982725528    0.45219094      -0.339061168    9.49806089
TCGA-22-5491-11A        0       Normal  0.981934732     -0.991054313    0.475345705     -0.367218333    12.2813137
TCGA-33-4587-11A        0       Normal  0.90739615      -0.930774072    0.403446401     -0.281428331    9.368460346
TCGA-33-6737-11A        0       Normal  0.962025316     -0.957522049    0.495340808     -0.391557543    10.79155095
TCGA-34-7107-11A        0       Normal  0.949717514     -0.934120795    0.451010344     -0.337452999    10.04177079
TCGA-34-8454-11A        0       Normal  0.992397661     -0.987269255    0.480060883     -0.372603029    10.6050578
...

2.5. Command

$  python3  SVM_performance.py -i lung_CES_5features.tsv -C 10

Note

There is no rule of thumb to choose a C value, people can try a bunch of different C values and choose the one which gives you “best performance scores”

2.6. Output to screen

Preprocessing data ...
Evaluate metric(s) by cross-validation ...
F1 score is the weighted average of the precision and recall. F1 = 2 * (precision * recall) / (precision + recall)


F1_macro calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
       Iteration 1: 1.000000
       Iteration 2: 0.983518
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 0.967273
F1-macro: 0.9902 (+/- 0.0262)


F1_micro calculate metrics globally by counting the total true positives, false negatives and false positives.
       Iteration 1: 1.000000
       Iteration 2: 0.986301
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 0.972222
F1-micro: 0.9917 (+/- 0.0222)


accuracy is equal to F1_micro for binary classification problem
       Iteration 1: 1.000000
       Iteration 2: 0.986301
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 0.972222
Accuracy: 0.9917 (+/- 0.0222)


Precision = tp / (tp + fp). It measures "out of all *predictive positives*, how many are correctly predicted?"
       Iteration 1: 1.000000
       Iteration 2: 1.000000
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 1.000000
Precision: 1.0000 (+/- 0.0000)


Recall = tp / (tp + fn). Recall (i.e. sensitivity) measures "out of all  *positives*, how many are correctly predicted?"
       Iteration 1: 1.000000
       Iteration 2: 0.980769
       Iteration 3: 1.000000
       Iteration 4: 1.000000
       Iteration 5: 0.960784
Recall: 0.9883 (+/- 0.0313)