Performance evaluation of different machine learning algorithms in presence of outliers using gene expression data

  • M Shahjaman Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
  • MM Rashid Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
  • MI Asifuzzaman Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
  • H Akter Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
  • SMS Islam Institutitute of Biological Sciences, University of Rajshahi, Bangladesh
  • MNH Mollah Bioinformatics Lab., Department of Statistics, University of Rajshahi, Bangladesh
Keywords: Classification, DE gene, GED, Outliers, Robustness

Abstract

Classification of samples into one or more populations is one of the main objectives of gene expression data (GED) analysis. Many machine learning algorithms were employed in several studies to perform this task. However, these studies did not consider the outliers problem. GEDs are often contaminated by outliers due to several steps involve in the data generating process from hybridization of DNA samples to image analysis. Most of the algorithms produce higher false positives and lower accuracies in presence of outliers, particularly for lower number of replicates in the biological conditions. Therefore, in this paper, a comprehensive study has been carried out among five popular machine learning algorithms (SVM, RF, Naïve Bayes, k-NN and LDA) using both simulated and real gene expression datasets, in absence and presence of outliers. Three different rates of outliers (5%, 10% and 50%) and six performance indices (TPR, FPR, TNR, FNR, FDR and AUC) were considered to investigate the performance of five machine learning algorithms. Both simulated and real GED analysis results revealed that SVM produced comparatively better performance than the other four algorithms (RF, Naïve Bayes, k-NN and LDA) for both small-and-large sample sizes.

J. bio-sci. 28: 69-80, 2020

Downloads

Download data is not yet available.
Abstract
496
PDF
144
Published
2019-12-28
How to Cite
Shahjaman, M., Rashid, M., Asifuzzaman, M., Akter, H., Islam, S., & Mollah, M. (2019). Performance evaluation of different machine learning algorithms in presence of outliers using gene expression data. Journal of Bio-Science, 28, 69-80. https://doi.org/10.3329/jbs.v28i0.44712
Section
Articles