Statistical Assessment Of Imputation Algorithms For Estimation Of Missing Values In Cross Sectional Data

OSCAR GYIMAH 119 PAGES (28947 WORDS) Statistics Thesis

ABSTRACT

The validity and quality of data analysis relies largely on the data accuracy

and completeness of the data matrix. Missing values are unavoidable statistical

research problems in almost every research study and if not handled properly,

may provide negative and bias conclusion. This study purposely sought to

investigate the efficacy and accuracy of the convergence of five imputation

algorithms: expectation maximization (EM), multiple imputation by chained

equation (MICE), k nearest neighbor (KNN), mean substitution (MS) and

regression substitution (RS) in estimating and replacing missing values in crosssectional

world population data sheet using MCAR and MAR assumptions. This

thesis used Little’s Test to verify whether a given data matrix with missing values

is MCAR or MAR. Multiple linear regression analysis model was used to run the

complete data of the world population data sheet, and thereafter, missing values

in the complete data sets were artificially introduced at 5%, 10%, 20%, 30%

and 40% under two missing data mechanisms (MCAR & MAR). The imputation

algorithms used for evaluating missing data problems were assessed and compared

using average coefficient difference (ACD) of multiple linear regression (MLR)

model, mean absolute difference (MAD) and the coefficient of determination (R2).

The study suggested that, when data on cross-sectional World Population Data

Sheet is missing completely at random (MCAR) and normally distributed, the

regression substitution is the best approach. The MICE algorithm was found to be

comparatively the best method for replacing missingness under MAR assumption.

Since this thesis is mainly concentrated on missing data imputation in a crosssectional

dataset, it is recommended that in future categorical and longitudinal

studies should be considered.