Improved k-nearest neighbors approach for incomplete and contaminated gene expression datasets

Authors

  • MI Asifuzzaman Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
  • H Akter Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
  • MM Rashid Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
  • MNH Mollah Bioinformatics Lab., Department of Statistics, University of Rajshahi, Bangladesh
  • SMS Islam Institutitute of Biological Sciences, University of Rajshahi, Bangladesh
  • M Shahjaman Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh

DOI:

https://doi.org/10.3329/jbs.v27i0.44669

Keywords:

Gene expression data, IQR, Missing values, Outliers, Robustness

Abstract

With the rapid development of high-throughput DNA microarray technologies, researchers can measure expression profiles of thousands of genes simultaneously with low costs. These massive amounts of gene expression (GE) data often contain missing values or outliers due to various reasons of data generating process. Most of the statistical methods were developed based on complete dataset. As a result, for subsequent analysis using incomplete dataset, these methods strongly suffer and we cannot find our target. A numerous methods have been developed to impute missing values and they are available in the literature. Albeit, missing values imputation and outliers handling both are equally important for analyzing GE, most of the methods perform these tasks separately and produce misleading results. Therefore, in this paper, an attempt is made to develop a new hybrid approach which is robust against outliers and missing values, simultaneously. We demonstrate the performance of the proposed method in a comparison of popular missing value imputation method K-NN while performing feature selection using both simulated and real GE datasets. The Results obtain from simulated as well as real data studies show that the proposed method outperforms K-NN in presence of different percentages of missing values and outliers. On the other hand, in absence of outliers with missing values, the proposed method keeps equal performance with the other methods.

J. bio-sci. 27: 31-41, 2019

Downloads

Download data is not yet available.
Abstract
655
PDF
247

Downloads

Published

2019-12-26

How to Cite

Asifuzzaman, M., Akter, H., Rashid, M., Mollah, M., Islam, S., & Shahjaman, M. (2019). Improved k-nearest neighbors approach for incomplete and contaminated gene expression datasets. Journal of Bio-Science, 27, 31–41. https://doi.org/10.3329/jbs.v27i0.44669

Issue

Section

Articles