Improved k-nearest neighbors approach for incomplete and contaminated gene expression datasets

MI Asifuzzaman; H Akter; MM Rashid; MNH Mollah; SMS Islam; M Shahjaman

doi:10.3329/jbs.v27i0.44669

Authors

MI Asifuzzaman Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
H Akter Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
MM Rashid Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
MNH Mollah Bioinformatics Lab., Department of Statistics, University of Rajshahi, Bangladesh
SMS Islam Institutitute of Biological Sciences, University of Rajshahi, Bangladesh
M Shahjaman Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh

Keywords:

Gene expression data, IQR, Missing values, Outliers, Robustness

Abstract

With the rapid development of high-throughput DNA microarray technologies, researchers can measure expression profiles of thousands of genes simultaneously with low costs. These massive amounts of gene expression (GE) data often contain missing values or outliers due to various reasons of data generating process. Most of the statistical methods were developed based on complete dataset. As a result, for subsequent analysis using incomplete dataset, these methods strongly suffer and we cannot find our target. A numerous methods have been developed to impute missing values and they are available in the literature. Albeit, missing values imputation and outliers handling both are equally important for analyzing GE, most of the methods perform these tasks separately and produce misleading results. Therefore, in this paper, an attempt is made to develop a new hybrid approach which is robust against outliers and missing values, simultaneously. We demonstrate the performance of the proposed method in a comparison of popular missing value imputation method K-NN while performing feature selection using both simulated and real GE datasets. The Results obtain from simulated as well as real data studies show that the proposed method outperforms K-NN in presence of different percentages of missing values and outliers. On the other hand, in absence of outliers with missing values, the proposed method keeps equal performance with the other methods.

J. bio-sci. 27: 31-41, 2019

Abstract
244

PDF
213

Improved k-nearest neighbors approach for incomplete and contaminated gene expression datasets

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

How to Cite

Information