The Impact of Variable Omission on Variable Importance Measures of Cart, Random Forest, and Boosting Algorithms
Keywords:Regression models; Machine learning algorithms; Missing values; 1
When researchers use statistical models to gain insights into various phenomena, they make a tacit assumption that most (if not all) of the important predictor variables with respect to the outcome are included in the analysis. However, in practice this may not always be possible, whether because some important variables could not be measured, or because a researcher was not aware of all such important predictors. Prior research has shown that when important variables are omitted from both linear and nonlinear regression models, the model coefficients can be biased, with greater levels of bias being associated with larger correlations between the missing and retained variables. However, very little work has examined how such omissions impact the performance of variable importance measures used with popular machine learning algorithms. Therefore, the purpose of this simulation study was to address this gap in the literature and thereby provide insights into the impact of such omissions on variable importance measures for classification and regression trees, random forests, and boosting algorithms. Results showed that when an important variable is omitted from an analysis, other predictors that are correlated with and/or involved in an interaction with it will have inflated variables importance measures themselves. An empirical example and implications of these results are discussed.
Journal of Statistical Research 2021, Vol. 55, No. 2, pp. 335-358
How to Cite
Copyright (c) 2021 Journal of Statistical Research
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.