Handling missing data

We know different types of people tend to drop out of our studies at different times, depending on their individual circumstances and characteristics. To support researchers in producing robust analysis, we have developed comprehensive advice on how to deal with missing data. The approaches we recommend to researchers capitalise on the rich data cohort members provided over the years before their non-response. These include well known methods such as multiple imputation, inverse probability weighting, and full information maximum likelihood.

The methods we recommend all rely on the assumption that the data are missing at random (MAR), which means that systematic differences between the missing values and the observed values can be explained by the rich information available in the cohorts. We have implemented a systematic data-driven approach to choosing which variables within the studies best predict drop out, using multivariable regression analyses and machine learning algorithms for variable selection.

This approach is described in Mostafa et al [1].

In this work, we show that the methods we recommend are able to restore the composition of the National Child Development Study (NCDS) samples at age 50 and age 55 to be representative of the study’s target population, using external benchmarks, and according to a number of characteristics captured within the original birth sample. For example, we were able to replicate the known population distribution of educational attainment (see figure 1) and marital status at age 50 based on the ONS Annual Population and Labour Force Surveys. We also replicate the original distribution of paternal social class observed at the birth survey, and the distribution of cognitive ability at age 7 (see figure 2).

There could still be other variables in NCDS for which we wouldn’t be able to restore representativeness to the target population, but our findings indicate that using principled methods for handling missing data has strong potential to reduce bias arising from missing data and restore sample representativeness.

 

Find out more

More guidance on how users can adopt these methods for handling missing data in their own analyses is available for NCDS users, in our NCDS Missing Data User Guide.

We’ll be producing similar resources for researchers using data from our other studies in the future, including cohort-specific publications and missing data user guides. In the meantime, users of data from other cohorts may find the NCDS-specific material of interest and use, since the same principles can be applied in other settings.

We also offer training on handling missing data. Please keep an eye on our events page for details of future training. Or get in touch with us at ioe.clsevents@ucl.ac.uk.

 


 

Figure 1. Percentage of those with degree or equivalent at age 50 in the Annual Population Survey and NCDS before and after adjustment for missing data.

APS GB: Annual Population Survey = Born in Great Britain in 1958 (derived by the Office for National Statistics)

APS All: Annual Population Survey – Born in Great Britain or elsewhere in 1958 (derived by the Office for National Statistics)

NCDS50 MI: Estimate after multiple imputation using predictors of educational attainment at age 50 and predictors of non-response at age 50 as auxiliary variables.

 


 

Figure 2. Social class of mother’s husband at birth before and after adjustment for missing data.

Imputation phase of MI included predictors of response at age 55 and social class at birth only for cohort members that participated at age 55.

 


 

[1] Mostafa, T., Narayanan, M., Pongiglione, B., Dodgeon, B., Goodman, A.,Silverwood, R.J., and Ploubidis G.B. (2020) Improving the plausibility of the missing at random assumption in the 1958 British birth cohort: A pragmatic data driven approach, CLS Working Paper 2020/6. London: UCL Centre for Longitudinal Studies

Contact us

Centre for Longitudinal Studies
UCL Social Research Institute

20 Bedford Way
London WC1H 0AL

Email: clsfeedback@ucl.ac.uk