Imputation techniques for non-ordered categorical missing data

Karangwa, Innocent

dc.contributor.advisor	Kotze, Danelle
dc.contributor.advisor	Blignaut, Renette
dc.contributor.author	Karangwa, Innocent
dc.date.accessioned	2016-06-06T12:28:17Z
dc.date.available	2016-06-06T12:28:17Z
dc.date.issued	2016
dc.identifier.uri	http://hdl.handle.net/11394/5061
dc.description	Philosophiae Doctor - PhD	en_US
dc.description.abstract	Missing data are common in survey data sets. Enrolled subjects do not often have data recorded for all variables of interest. The inappropriate handling of missing data may lead to bias in the estimates and incorrect inferences. Therefore, special attention is needed when analysing incomplete data. The multivariate normal imputation (MVNI) and the multiple imputation by chained equations (MICE) have emerged as the best techniques to impute or fills in missing data. The former assumes a normal distribution of the variables in the imputation model, but can also handle missing data whose distributions are not normal. The latter fills in missing values taking into account the distributional form of the variables to be imputed. The aim of this study was to determine the performance of these methods when data are missing at random (MAR) or completely at random (MCAR) on unordered or nominal categorical variables treated as predictors or response variables in the regression models. Both dichotomous and polytomous variables were considered in the analysis. The baseline data used was the 2007 Demographic and Health Survey (DHS) from the Democratic Republic of Congo. The analysis model of interest was the logistic regression model of the woman’s contraceptive method use status on her marital status, controlling or not for other covariates (continuous, nominal and ordinal). Based on the data set with missing values, data sets with missing at random and missing completely at random observations on either the covariates or response variables measured on nominal scale were first simulated, and then used for imputation purposes. Under MVNI method, unordered categorical variables were first dichotomised, and then K − 1 (where K is the number of levels of the categorical variable of interest) dichotomised variables were included in the imputation model, leaving the other category as a reference. These variables were imputed as continuous variables using a linear regression model. Imputation with MICE considered the distributional form of each variable to be imputed. That is, imputations were drawn using binary and multinomial logistic regressions for dichotomous and polytomous variables respectively. The performance of these methods was evaluated in terms of bias and standard errors in regression coefficients that were estimated to determine the association between the woman’s contraceptive methods use status and her marital status, controlling or not for other types of variables. The analysis was done assuming that the sample was not weighted fi then the sample weight was taken into account to assess whether the sample design would affect the performance of the multiple imputation methods of interest, namely MVNI and MICE. As expected, the results showed that for all the models, MVNI and MICE produced less biased smaller standard errors than the case deletion (CD) method, which discards items with missing values from the analysis. Moreover, it was found that when data were missing (MCAR or MAR) on the nominal variables that were treated as predictors in the regression model, MVNI reduced bias in the regression coefficients and standard errors compared to MICE, for both unweighted and weighted data sets. On the other hand, the results indicated that MICE outperforms MVNI when data were missing on the response variables, either the binary or polytomous. Furthermore, it was noted that the sample design (sample weights), the rates of missingness and the missing data mechanisms (MCAR or MAR) did not affect the behaviour of the multiple imputation methods that were considered in this study. Thus, based on these results, it can be concluded that when missing values are present on the outcome variables measured on a nominal scale in regression models, the distributional form of the variable with missing values should be taken into account. When these variables are used as predictors (with missing observations), the parametric imputation approach (MVNI) would be a better option than MICE.	en_US
dc.language.iso	en	en_US
dc.publisher	University of the Western Cape	en_US
dc.subject	Missing data	en_US
dc.subject	Multiple imputation	en_US
dc.subject	Multiple imputation by chained equations	en_US
dc.subject	Multivariate normal imputation	en_US
dc.title	Imputation techniques for non-ordered categorical missing data	en_US
dc.rights.holder	University of the Western Cape	en_US

Files in this item

Name:: Karangwa_i_phd_ns_2016.pdf
Size:: 21.76Mb
Format:: PDF
Description:: PhD

View/Open

This item appears in the following Collection(s)

Philosophiae Doctor - PhD (Statistics and Population Studies)

Show simple item record