The Impact of Variable Omission on Variable Importance Measures of Cart, Random Forest, and Boosting Algorithms

W Holmes  Finch

doi:10.3329/jsr.v55i2.58809

Authors

W Holmes Finch Ball State University, Muncie, IN, USA

Keywords:

Regression models; Machine learning algorithms; Missing values; 1

Abstract

When researchers use statistical models to gain insights into various phenomena, they make a tacit assumption that most (if not all) of the important predictor variables with respect to the outcome are included in the analysis. However, in practice this may not always be possible, whether because some important variables could not be measured, or because a researcher was not aware of all such important predictors. Prior research has shown that when important variables are omitted from both linear and nonlinear regression models, the model coefficients can be biased, with greater levels of bias being associated with larger correlations between the missing and retained variables. However, very little work has examined how such omissions impact the performance of variable importance measures used with popular machine learning algorithms. Therefore, the purpose of this simulation study was to address this gap in the literature and thereby provide insights into the impact of such omissions on variable importance measures for classification and regression trees, random forests, and boosting algorithms. Results showed that when an important variable is omitted from an analysis, other predictors that are correlated with and/or involved in an interaction with it will have inflated variables importance measures themselves. An empirical example and implications of these results are discussed.

Journal of Statistical Research 2021, Vol. 55, No. 2, pp. 335-358

Abstract
222

PDF
197

The Impact of Variable Omission on Variable Importance Measures of Cart, Random Forest, and Boosting Algorithms

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

How to Cite

Information