Adapted Factor-Type Imputation Strategies

The present paper provides alternative improved Factor-Type (F-T) estimators of population mean in presence of item non-response for the practitioners. The proposed estimators have been shown to be more efficient than the four existing estimators which are more efficient than the usual ratio and the mean estimators. Optimum conditions for minimum mean squared error are obtained for the new estimators. Empirical comparisons based on three different data sets establish that the proposed estimators record least mean squared error and hence a substantial gain in Percentage Relative Efficiency (P.R.E.), over these five contemporary estimators.


Introduction
Often Surveys are accompanied by incomplete response or unavailable items.Analysis restricted to complete records when faced with an incomplete dataset may lead to biased inference [1].The other alternative is to construct the estimates for the missing observations.Estimation of individual missing items based on survey response is called imputation.Missing patterns are classified as missing at random (MAR) and observed at random (OAR) [2].The data are MAR if the probability of the observed missingness pattern given the observed and unobserved data, does not depend on the values of the unobserved data.In other words, the data are missing only due to chance factors.The data are OAR, if for every possible value of the missing data, the probability of the observed missingness pattern, given the observed and the unobserved data, does not depend on the values of the observed data.The combination of MAR and OAR is called missing completely at random (MCAR) which means that the response propensity to provide information is constant for all the subjects.In this paper, we implicitly assume that the missing values are MCAR.
Various techniques for mean estimation under non-response have been considered by several researchers [3][4][5], among others.Some well-known imputation methods in literature are deductive imputation, mean imputation overall (MO), random imputation overall (RO), mean imputation within classes (MC), hot deck, cold deck and so on [6].
Compromised imputation [7], imputation based on power transformation [8], modifications of ratio and regression methods of imputation [9], exponential ratio type imputation based on an auxiliary variable [10], Factor -Type estimator adapted as a tool of imputation [11], imputation based on modified Walsh estimator [12] have been considered in literature so far.Present work adapts strategy of [12] based on Factor-Type technique to provide better and more efficient means of imputing missing survey data.In the present paper, three separate strategies are proposed to impute missing data under (i) low non-response: a few missing data under item non-response-the most common missing data situation encountered in practice and under (ii) high nonresponse: large chunks of missing data under a variable-when a region is inaccessible for data collection or the information is invasive for a particular section of the target group.The present theoretical contribution proposes estimators with higher precision and substantially reduced M.S.E. as compared to the currently used five different imputation strategies.Numerical results based on three different types of populations A (moderate sized), B (small sized) and C (large sized) illustrate superiority of the estimatorsproposed where the variable to be imputed is highly positively correlated with the auxiliary variable or the covariate.
A finite population Sampling without Replacement), of size n is drawn from  to estimate Y .The responding units out of sampled n units are denoted by r forming a set R and the set of non-responding units are denoted by c R .The variable Y is of main interest and X is an auxiliary variable highly correlated with Y .For every unit R i  , the value i y is observed.For The M.S.E of the estimator 5 t are given by . The minimum M.S.E. of 6 t for the optimum value of t is similar to the usual ratio type estimator 2 t and therefore the expression of bias and M.S.E. are the same and X Y   .

Proposed Methods of Imputation
Under the proposed strategies for imputing missing observations, the data after imputation takes the following form: where The minimum M.S.E. at (ii) The estimator with the conditional bias   The minimum M.S.E. at (iii) The estimator has the conditional bias   and M.S.E.
The minimum M.S.E. at Substituting the value of and using the concept of large sample approximation, we get and higher order leads to equation (3), (7) and (11).Now, taking expectation of both sides of equation ( 3), ( 7) and ( 11), we get Simplifying above three expressions, equations ( 4), ( 8) and ( 12) are obtained.Squaring both sides of equations ( 3), ( 7) and (11), and neglecting terms of e's having power greater than two, we have Now taking expectation on both sides inall the above expressions, we get e and 3 e leads to ( 5), ( 9) and ( 13).
where,  , The equation ( 15) is a polynomial of degree four and the pair   V f , can be treated as known.The four possible roots are denoted by , 1  and 4  (some may be imaginary) for which optimum level of m.s.e. will attain minimum.The following algorithm yields bias control at the optimum level of m.s.e.

STEP I:
 so as to achieve the solution quickly.

Remark 3:
The quantity V is stable over moderate length of time which would be initially known or could be guessed on the basis of the past data [13].Also, equation (15) has only  in the power (of order four), while V and f are known in advance.
Therefore, one can solve the equation (15) in order to obtain optimal values in the suggested class.

Analytical comparison among the proposed estimators
Using equations ( 9)-( 11), the following identities are obtained: Therefore, the proposed estimation strategy 3 FT y is preferable over the other two proposed methods.

Analytical comparison between proposed estimators and [12] strategy
which is always true.
It thus emerges that the proposed estimators 1 FT y and 3 FT y are better than the [12]   imputation method and the estimators

Almost Unbiased Imputation Methods
In terms of expressions ( 3), ( 4) and ( 5), the bias of 3 , 2 , 1 ;  j y FTj could be made zero up to the first order of approximation.This provides the following three equations: Using data from [5], in the above equations, we obtain that either where the proposed estimators are almost unbiased.Considering (20) shows polynomial equation in  , values 9902 .
while the rest of the roots appear to be imaginary which render the imputed estimator almost unbiased up to the first order of approximation.

Empirical Study
To compare the effect of missing data mechanism under the proposed and the existing imputation strategies the following three populations are considered: Population A: (Source: [14]).X represents number of students and Y the number of teachers.
Population C: (Source: [16]).Data collected by a market research company consists of N =2376 points of sale for which the sale area Y (in square meters) and the number of employees X are surveyed.
Using the whole data set, the following statistics are obtained., where f denotes the finite sampling fraction being 5% to 25%.Further, r responding units are subsampled from the sample of n units and then selected response rate ranging between 5% to 95% are considered.
Percentage Relative Efficiency (P.R.E.) of the proposed imputation methods with respect to the mean, ratio, compromised, [4] and [12] imputation methods, using equation ( 21), are reported in Tables 3-5 for the three data sets respectively.

Discussion
Based on empirical investigation summarized in Table 3 we conclude that the proposed estimator 1 FT y performs better than the prevalent contemporary estimators when the response rate is higher than 50% and continues to gain in efficiency as response rate increases.At 65% response rate, the relative efficiency gain is observed as 238.84%, 161.55%, 160.16%, 122.97% and 160.16% which grows to gain further 358.45%,291.65%, 290.46%, 235.31% and 290.45% at 80% response rate with respect to the mean, ratio, compromised, [4] and [12] methods of imputation respectively.The proposed estimator 1 FT y is therefore more suitable for imputing data in real survey situations where fractional response rate is 65% and higher, which isthe most common data collection scenario in the real sample surveys observed so far.
As response rate decreases the proposed estimator 2 FT y performs better.For example, if the response rate is 5%, then a gain of 753.88%, 111.52%, 100.04%, 679.55% and 100% (i.e.equal) with respect to mean, ratio, compromised, [4] and [12] methods of imputation is expected by the proposed method 2 FT y respectively.At response rate of 35%, the corresponding relative efficiency gains are 251.57%,102.67%, 100.01%(i.e.equal), 131.17% and 100.00% (i.e.equal) respectively.Hence, the proposed estimator 2 FT y is suited for imputing missing data in real survey situation where higher fractions of non-respondents occur.
Relative Efficiency of the proposed estimator to the prevalent ratio, compromised, [4] and [12] methods of imputation for the proposed estimator is observed to be 606.25%,597.09%, 527.06% and 597.03% respectively at 50% response rate.
Similarly, based on empirical summary from Table 4, we conclude that for the proposed estimator 1 FT y is more efficient than the other five considered estimators, when the response rate is higher than 50% and continues to gain in efficiency as response rate increases.At 65% response rate, the relative efficiency gain is observed as 267.36%, 199.40%, 172.50%, 172.50% and 172.50% which substantially grows to gain further 1414.43%,1362.27%,1341.61%,1356.75% and 1341.61% at 95% response rate with respect to the mean, ratio, compromised, [4] and [12] methods of imputation respectively.
As response rate decreases the proposed estimator 2 FT y performs better.For example, if the response rate is 5%, then a gain of 1517.01%,502.00%, 100.00%, 1308.16% and 100% (i.e.equal) is achieved by the proposed method 2 FT y with respect to mean, ratio, compromised, [4] and [12] methods of imputation respectively.At response rate of 35%, the corresponding relative efficiency gains are 284.79%,152.42%, 100.00% (i.e.equal), 139.71% and 100.00% (i.e.equal) respectively.Hence the proposed estimator 2 FT y is suited for imputing missing data in real survey situation wherehigher fractions of non-respondents occur.
Relative Efficiency of the proposed estimator 3 FT y increases as the response rate in the survey increases (similar to the behavior exhibited by the proposed estimator 1 FT y ) except under the mean method of imputation where relative efficiency of the proposed estimator 3 FT y is substantially high at 5204.25%.The median gain in efficiency in comparison to the prevalent ratio, compromised, [4] and [12] methods of imputation for the proposed estimator is observed at 3329.28%, 2586.69%,2729.63% and 2586.69%respectively at 50% response rate.
Similarly, from Table 5, the proposed estimator 1 FT y is evidenced to perform better than the prevalent contemporary estimators when the response rate is higher than 50% and the gain in efficiency continues as response rate increases.At 65% response rate, the relative efficiency gain is observed as 207.03%, 210.00%, 146.36%, 152.41% and 146.36% which grows to gain further 430.07%,430.97%, 411.78%, 427.36% and 411.78% at 95% response rate with respect to the mean, ratio, compromised, [4] and [12] methods of imputation respectively.The proposed estimator 1 FT y is therefore more suitable for imputing data in real survey situations where fractional response rate is 65% and higher, which isthe most common data collection scenario in the real sample surveys observed so far.
As response rate decreases the proposed estimator 2 FT y performs better.For example, if the response rate is 5%, then a gain of 430.50%, 454.02%, 100.00%, 372.64% and 100% (i.e.equal) with respect to mean, ratio, compromised, [4] and [12] methods of imputation respectively.At response rate of 35%, the corresponding relative efficiency gains are 215.46%,221.11%, 100.00% (i.e.equal), 109.81% and 100.00% (i.e.equal) respectively.Hence the proposed estimator 2 FT y is suited for imputing missing data in real survey situation where higher fractions of nonrespondents occur.
Relative Efficiency of the proposed estimator 3 FT y increases as the response rate in the survey increases (similar to the behavior exhibited by the proposed estimator 1 FT y ) except under the mean method of imputation where relative efficiency of the proposed estimator 3 FT y is substantially high at 526.32%.The median gain in efficiency in comparison to the prevalent ratio, compromised, [4] and [12] methods of imputation for the proposed estimator is observed at 537.01%, 307.69% and 307.72% and 307.69 respectively at 50% response rate.

Conclusion
Thus, the present work provides threemore efficient alternative imputation strategies for situations involving higher fraction of non-respondents ( 2 FT y ) as well as for the situations which involve smaller data loss ( 1 FT y and 3 FT y ) on the study characteristic in a bi-variate sample data.The new estimators are formulated by transforming Singh's Walsh-type estimator to Factor-Type estimators.The proposed estimator 3 FT y shows highest improvement in terms of relative efficiency among all the three proposed estimators.All the three proposed alternative estimators are theoretically as well as empirically found to have higher PRE and therefore are regarded superior to the existing mean, ratio, compromised, [4] and [12] methods of imputation respectively.The present paper is therefore an important contribution for the practitioners in the area of missing data analysis as it offers improved estimators than the existing ones for imputing lost or missing data.

.
i y values are missing and imputed values are derived.The th i value ix of auxiliary variate X is used as a source of imputation for missing data when c R i  .For sample S, the data Assuming MCAR for nonresponse with r and n known, we obtain:

6 t
are equally efficient.It is thus established that 3 FT y is the best estimator among the other existing and proposed estimators as discussed above.

3
FT y increases as the response rate in the survey increases (similar to the behavior exhibited by the proposed estimator 1 FT y ) except under the mean method of imputation where relative efficiency of the proposed estimator 3 FT y is substantially high at 1120.14%.The median gain in efficiency for 3 FT y in comparison

Table
. Some well-known imputation methods.

Table 2 .
Some special transforms of

Table 3 .
P.R.E. of the Proposed Imputation Methods w.r.to the known Estimators for 5%-25% Sampled for Population A.
n r n r

Table 5 .
P.R.E. of the Proposed Imputation Methods w.r.to the known Estimators for 5%-25% Sampled for Population C.