Performance evaluation of different machine learning algorithms in presence of outliers using gene expression data
Classification of samples into one or more populations is one of the main objectives of gene expression data (GED) analysis. Many machine learning algorithms were employed in several studies to perform this task. However, these studies did not consider the outliers problem. GEDs are often contaminated by outliers due to several steps involve in the data generating process from hybridization of DNA samples to image analysis. Most of the algorithms produce higher false positives and lower accuracies in presence of outliers, particularly for lower number of replicates in the biological conditions. Therefore, in this paper, a comprehensive study has been carried out among five popular machine learning algorithms (SVM, RF, Naïve Bayes, k-NN and LDA) using both simulated and real gene expression datasets, in absence and presence of outliers. Three different rates of outliers (5%, 10% and 50%) and six performance indices (TPR, FPR, TNR, FNR, FDR and AUC) were considered to investigate the performance of five machine learning algorithms. Both simulated and real GED analysis results revealed that SVM produced comparatively better performance than the other four algorithms (RF, Naïve Bayes, k-NN and LDA) for both small-and-large sample sizes.
J. bio-sci. 28: 69-80, 2020