Modelling of South African Hypertension: Application of Classical Quantile Regression

Background: High blood pressure, medically known as hypertension is the major risk factor for cardiovascular diseases (CVDs) and premature death globally. The aim of the present study was to explore possible interactions amongst systolic blood pressure`s (SBP) and diastolic blood pressure`s (DBP) risk factors in South Africa. Methods:A retrospective study was conducted using data acquired from the South African National Income Dynamics Study Wave 5, Household Survey which was carried out in 2017-2018.A final data set of 21 180 adults was utilized for data analysis. An application of the hierarchical group-lasso approach to detect interactions between SBP`s and DBP`s risk factors and classical quantile regression analysis were performed in this study. Results: By using only upper quantilesbody mass index (BMI), age, race, never exercised, and the following nine interactions: BMI and age, BMI and gender male, age and never exercised, gender male and race African, race coloured and depression some or little of the time, BMI and cigarette consumption, age and race white, gender male and employment status, never exercised and cigarette consumptionwere found to be significantdeterminantsof hypertension in South Africa. Conclusion: The evidence of this study suggests that it is ideal to consider interactions amongst risk factors when modelling hypertension.


Introduction
High blood pressure, medically known as hypertension is the major risk factor for cardiovascular diseases (CVDs) and premature death globally. CVDs are conditions that affect the structures or functions of the heart. These are abnormal heart rhythms or arrhythmias, aorta disease and Marfan syndrome, congenital heart disease, coronary artery disease (narrowing of the arteries), deep vein thrombosis and pulmonary embolism, heart attack, heart failure, heart muscle disease (cardiomyopathy), heart valve disease, pericardial disease, peripheral vascular disease, rheumatic heart disease,stroke and vascular disease (blood vessel disease). CVDs are the number 1 cause of death worldwide and an estimated 17.9 million people died from CVDs in 2016, representing 31% of all global deaths 1 . Hypertension is responsible for 7.6 million deaths per annum globally 2 .
The prevalence and burden of hypertension is rising across the world, especially in low and middle-income countries including South Africa 3 . Approximately, 27.4% of men and 26.1% of women in South Africa have raised blood pressure in 2015 4 . Based on these high prevalence rates of hypertension in South Africa, this study sought to establish the prevalence of hypertension amongst adults in South Africa attributable to high systolic and diastolic blood pressure. Thus, systolic blood pressure (SBP) readings greater than or equal to 140 mmHg and/or diastolic blood pressure (DBP) greater than or equal to 90 mmHg 5 .
Most previous studies on hypertension have used descriptive statistics to study the prevalence and awareness of hypertension in South Africa 678 . Some have applied mean regression 7910 . In addition to descriptive statistics, this study shall also apply classical quantile regression. Modelling hypertension using quantile regression (QR) is more appropriate than using descriptive statistics and mean regression only in that it provides flexibility to estimate the influence of potential risk factors on the upper quantiles (75% or 95%) of the conditional distribution of hypertension. When modelling hypertension, it makes more sense to model high values of systolic and diastolic blood pressure which corresponds to the upper distribution of either SBP or DBP 11 . Hence, the aim of the present study was to explore possible interactions amongst SBP`s and DBP`s risk factors. Modelling interactions is conceivable to play an important role when predicting diseases 12 .

Materials and Methods
In this section, the data and variables, theoretical models and data analysis techniques applied in this paper are presented.

Data and Variables
This was a retrospective study conducted using data acquired from the South African National Income Dynamics Study (NIDS) Wave 5, Household Survey which was carried out in 2017-2018.The South African National Income Dynamics Study provides good quality anthropometric, sociodemographic and behavioural data sampled across the South African population 7 .
From this particular secondary data, data for South African male and female adults aged 18 years and above was extracted for analysis. NIDS was embarked by the South African Presidency in order to track changes in the well-being of South Africans citizens in the entire country 13 . Hence, this survey provided nationally representative data.
The target population for NIDS was private households in all nine provinces of South Africa, and residents in workers' hostels, convents and monasteries. The frame excludes other collective living quarters, such as student hostels, old age homes, hospitals, prisons and military barracks. Fieldworkers were trained and instructed to interview and collect data on subjects residing at selected households.
Data cleaning was conducted before analyzing the data. The data cleaning process involved dropping observations with missing data (7 127) for any of the variables used in the study and participants aged below 18 (1 803). A final data set of 21 180 adults was obtained from 30 110 adults originally observed.
The variables used in this study are systolic blood pressure, diastolic blood pressure, non-modifiable risk factors (age, gender and race) and modifiable risk factors (BMI, exercises, cigarette consumption, depression and employment status). Systolic blood pressure and diastolic blood pressure are the dependent variables whilst age, body mass index, gender, race, exercises, cigarette consumption, depression and employment status are the independent variables. Age was computed by subtracting date of birth from date of interview and body mass index was calculated by dividing weight ( kg ) by height in meters squared ( 2 m ).

Classical Quantile Regression and Computational Methods
In statistical modelling, regression analysis is one of the most widely used and powerful multivariate techniques in order to assess the impact of a set of variables on a certain outcome variable Classical or Standard linear regression centers on the expectation of variable conditional on the values of a set of variables thus 14 . This is called the regression function. Since this function focuses on a specific location which is the mean, quantile regression extends this approach to allow the conditional distribution of on at different locations to be established 15  The τ th sample quantile^τ ξ can be then obtained as the solution to the problem of minimizing an asymmetric weighted absolute deviations. The optimization problem is defined as:

( )
: : Equivalent to: is called the pinball loss function 16 . Linear programming methods can then be utilised to obtain quantile regression estimates 17 .
These linear programming methods include the simplex algorithm of Barrodale and Roberts (1973), the Sparse Frisch-Newton algorithm described in Portnoy and Koenker (1997) and the Sparse Frisch-Newton algorithm with pre-processing. The Barrodale and Roberts (1973) simplex algorithm is the default method implemented in the r package called quantreg 20 . The implementation of the simplex method and further developments in linear programming have made quantile regression to better than classical linear regression methods 21 .

Pairwise Hierarchical Interactions
According to Bien et al.(2013), a pairwise interaction model containing all possible pairwise products of the predictors is presented in the following form: Now interaction between say two independent variables and , occur when the effect on will vary depending on the level or value of . Pairwise hierarchical interactions in this study shall be conducted in a manner that satisfies strong hierarchy and then parameter selection is applied via the group-lasso. A model is said to obey strong hierarchy whenever an interaction is estimated to be nonzero and both main effects are included in the model 12 . Weak hierarchy is obtained as long as either of its main effects are present 12 . Interactions amongst variables can play an important role in predicting diseases such as hypertension 23 .
Basically, there are three possible cases of interaction: • Interaction between two continuous variables.
• Interaction between two categorical variables.
• Interaction between a categorical variable and a continuous variable.

Interaction between two continuous variables
Let and be two continuous variables, then the interaction between these continuous variables is given by:

Interaction between two categorical variables
Let and be two categorical variables, then the interaction between these categorical variables is given by: Whenever there is an interaction between two categorical variables, interactions are taken at each level of the variable.

Interaction between a categorical variable and a continuous variable
Let be a categorical variable with levels and be a continuous variable, then the interaction between these two variables can result into one of the following cases 12 : • (no main effects, no interactions), • (one main effect or ), • (two main effects), • (main effects and interaction).

Lasso
Lasso defined as the least absolute shrinkage and selection operator has emerged as a critical tool for variable selection. It is quite convenient to apply lasso in estimating the quantile regression models so as to improve the prediction accuracy by eliminating irrelevant variables 24 . The lasso includes an 1  -penalty term that constraints the minimum size of the estimated model coefficients, forcing the model to have fewer parameters. The lasso coefficient estimates solve the following problem 25 : subject to the following function: where has to be greater than zero. is the tuning parameter that controls the amount of shrinkage.
Group-lasso is an extension of lasso that performs variable selection on non-overlapping groups of variables and sets groups of coefficients to zero 26 .

Data Analysis
Descriptive statistics were analysed by use of IBM SPSS version 27. Frequencies on demographic and lifestyle characteristics of participants and summary statistics on continuous variables such as SBP, DBP, BMI and age were produced. To explore possible interactions amongst SBP`s and DBP`s risk factors, the Least Absolute Shrinkage and Selection Operator (Lasso) via pairwise hierarchical interactions technique the R packages namely hierNet 27 and glmnet 28 were utilised. The classical quantile regression model wasfitted using the quantreg R package 20 .

Ethical Consideration
The South African National Income Dynamics Survey was conducted after ethical approval was granted by the University of Cape Town, Faculty of Commerce Ethics Committee. Informed consent was also obtained from each study participant.

Results
This section presents the empirical resultsof the study. These results are presented in form of tables and figures. Also, interpretation of the results is given in this section. The study results in Table 1 show that 8 616 (40.7%) of the respondents were males and 12 564 (59.3%) were females. Most of the participants were African and they were 16 999 (80.3%) and the least number of participants were Asian/Indian and they were 338 (1.6%). In regard to the age distribution, 7 658 (36.2%) were between 18-29 years, followed by the 50 years and above age group who were 5 896 (27.8%). The least number of participants by age were 3 192 (15.1%) and they were aged between 40 to 49 years. Respondents were asked to indicate the number of times in a week they are likely to suffer from depression. 12 152 (57.4%) respondents revealed that they rarely suffer from depression and 2 757 (13%) indicated that they are likely to be affected by a depression between 3 to 7 days a week. The study also considered employment status as a possible risk factor of raised blood pressure. It can be seen from the results in Table 4 that 14 408 (68%) of the study participants are not employed whilst 6 772 (32%) are employed. Table 2 that, 3 320 (15.7%) of the total respondents had high SBP (more than 140 mmHg) and 3 765 (17.8%) participants had abnormal DBP (more than 90 mmHg). Finally yet importantly, 5 100 (24.1%) study participants were overweight (25 -29.9 kg/m 2 ) and 6 161 (29.1%) were obese, thus 30 kg/m 2 and above.

Pairwise Interactions for SBP
This section presents the main effects and interactions retrieved for the SBP model. The analysis was conducted using 8 explanatory variables which consists of 2 continuous and 6 categorical variables. Categorical variables were utilised in the model as per their respective levels using dummy coding. The results obtained illustrate that 14 main effects and 21 interactions were deduced for the SBP model. It is apparent that the model satisfies the concept of strong hierarchy as the main effects of all 21 interactions are present. These results indicate that systolic blood pressure can be predicted by the 14main effect variables and the 21 interactions detected. Applying the grouplasso technique would be necessary to predict SBP more accurately by eliminating uninformative variables. Results of the lasso model selection for SBP after fitting 14 main effect variables and 21 interactions are summarised in Table 4. Nine non-zero coefficients representing 2 main effects and 7 interactions were extracted in the sparse matrix, as possible strong predictors of systolic blood pressure. Figure 1 illustrates the cross-validation curve with dotted lines and error bars. The left vertical line in the plot shows the value at which the minimal mean squared error is achieved and the right vertical line shows the most regularized model whose mean squared error is within 1 standard deviation of the minimum. It is evident from Figure 1 that the errors increase substantially when the number of variables decreases, but they remain constant between 9 to 34 variables.

Lasso Model Selection for SBP
These results suggests that the model has optimally chosen 9 variables to be the possible best predictors of SBP, confirming the findings of the sparse matrix represented in Table 5.  Table 5 presents the upper classical quantile regression estimated coefficients for SBP`s risk factors. Only the upper quantiles (75% or 95%) were estimated in order to examine how blood pressure risk factors affects individuals most at risk for hypertension. It can be seen from Table 5 that, in all upper quantiles ( , age and race coloured had positive statistically significant effects on SBP. The interactions between BMI and age, BMI and gender male, age and exercises never & gender male and race African also presented statistically significant relations across all upper quantiles. The interactions

Figure 1: Cross Validation Plot for SBP Model
between BMI and cigarette consumption & race coloured and cigarette consumption did not present statistically significant coefficients for both high quantiles. Interaction between race coloured and depression for some or little of the time presented a significant effect on the 75 th quantile and did not present a significant effect on the 95 th quantile.
Similarly to SBP, the pairwise interactions for DBP were conducted using 8 explanatory variables which consists of 2 continuous and 6 categorical variables. Categorical variables were also treated in the model as per their respective levels.It can be deducted from the analysis that 15 main effects and 29 interactions obeying strong hierarchy concept were extracted for the SBP model. It is ideal to fit a group-lasso model on the variables extracted so as to eliminate irrelevant variables when predicting SBP. It can be seen from Table 6, that after fitting the group-lasso model, 4 main effects and 8 interactions were extracted from the sparse matrix as possible strong predictors of DBP. Figure 1 illustrates that the mean squared errors substantially increase when the number of variables decreases, but they remain constant between 12 to 43 variables. These results implies that the model has optimally chosen 12 variables to be the possible strong predictors of DBP, confirming the findings of the sparse matrix represented in Table 8.  Table 7 illustrates the upper classical quantile regression estimated coefficients for DBP`s risk factors. BMI, age, race coloured, exercises never,the interaction between BMI and gender male & age and race white presented statistically significant effects on DBP across all higher quantiles. Interaction effects between BMI and age, BMI and race coloured & BMI and Employment Status did not present any statistically significant relations with DBP. The interactions between BMI and cigarette consumption, gender male and employment status &exercises never and cigarette consumption displayed statistically significant association with DBP, only at the 75 th quantile.

Discussion
This study revealed statistically significant risk factors of hypertension based on the classical quantile regression models estimated. Quantile regression was more helpful in this study because it appropriately captured the effects of the observed risk factors on the upper quantiles of both SBP and DBP.
Study results illustrated that age had positive statistically significant estimated coefficients with both SBP and DBP respectively. The magnitude of the association increased from the 75 th quantile to the 95 th quantile.
These findings suggests that prevalence of hypertension increase with age increase.The combination of BMI and age had positive statistically significant effects with SBP only across the upper quantiles. These results imply that the increase in both BMI and age is likely to influence the occurrence of raised blood pressure. The present findings seem to be consistent with other research which found that hypertension increases with age, possibly because age is mostly associated with structural changes in the arteries and especially with large artery stiffness 29 .
BMI and gender male presented positive significant relations with SBP and DBP on both higher quantiles ( . These findings indicate that among males, an increase in BMI is associated with an increase in SBP. BMI presented positive statistically significant impact on DBP only. These findings imply that an increase in BMI is related with the increase in prevalence of elevated blood pressure. These results are consistent with previous studies which suggests that men are likely to be more hypertensive as compared to women 9 and that BMI is significantly associated with hypertension and individuals who are overweight and obese are at high risk of developing high blood pressure 30 .
Exercises never was found to be positively significant with DBP only on both higher quantiles. This indicates that South African individuals who do not exercise are vulnerable to hypertension. Positive empirical estimated coefficients for the interaction between age and exercises never were statistically significant with only SBP across both quantiles. This finding implies that among South Africans who do not take part in exercises, every additional year of age is associated with an increase in SBP. These results are quite in line with other studies which revealed that the incidences of high blood pressure are most common in individuals with sedentary lifestyle 29 .
Race coloured had a positive impact with both BP measures across all the upper quantiles. These results suggests that the prevalence of raised blood pressure is likely to increase among the coloured people as compared to other racial groups. Also, negative statistically significant effect on DBP only was found on the interaction between age and race white, suggesting that among South Africans who are not white, an increase in age is likely to influence the occurrence of raised blood pressure. In regard to racial differences in hypertension, this finding is coherent with prior studies that blacks do develop hypertension at an earlier age than whites 31 .
Interaction effects on DBP only between gender male and employment status &exercises never and cigarette consumption were statistically significant on the 75 th quantile only. These findings imply that employed males are prone to suffer from high blood pressure possibly due to work pressure and stress. The interaction between exercises never and cigarette consumption suggests that individuals who smoke as well as do not take part in physical exercises are prone to hypertension. In regard to cigarette consumption, these results are similar to past studies which also revealed that cigarette smoking is modestly associated with an increased risk of developing hypertension 32 .
A positive statistically significant interaction effect on SBP between gender male and race African was found across both upper quantiles. This indicates that the occurrence of high blood pressure is likely to increase among African males. A finding noted by other studies that high prevalence of blood pressure is experienced more among black males 31 .
The interaction between race coloured and depression for some or little of the time was found to be positively significant with SBP only for the 75 th quantile. This outcome indicates that coloured individuals who sometimes suffer from depression are more likely to suffer from hypertension. Similarly, previous studies have suggested that depression increases the risk of suffering from uncontrolled hypertension 33 .
Other risk factors extracted after fitting the SBP and DBP group-lasso models were not statistically significant after conducting the classical quantile regression models as indicated in Table 6 and 9 respectively. This may be attributed to very few participants with such lifestyle characteristics in this study.

Conclusion
This study presented an application of the hierarchical group-lasso approach to detect possible interactions between SBP`s and DBP`s risk factors and perform variable selections whilst obeying the concept of strong hierarchy. Also, classical quantile regression analysis was conducted in order to estimate the influence of potential risk factors on the upper quantiles (75% or 95%) of the conditional distribution of hypertension.
The results derived from the group-lasso interaction model were considered to be conclusive given their ability to capture linear and non-linear effects while performing variable selection.
The application of the techniques identified some important variables as risk factors of hypertension in South Africa.The evidence of this study suggests that it is ideal to consider interactions amongst risk factors when modelling hypertension and possibly other diseases instead of only considering main effect variables. Repeated surveys of this nature are ideal to be administered regularly across South Africa so as to continuously monitor and manage the risk factors of hypertension.