Handling missing data

We know different types of people tend to drop out of our studies at different times, depending on their individual circumstances and characteristics.

To support researchers in producing robust analysis, we have developed comprehensive advice on how to deal with missing data. A good place to get started are our handling missing data webinars.

Watch again: Handling missing data webinar – with theory and demo (2023)

Watch again: Handling missing data in the 1970 British Cohort Study webinar – (2024)

Our approach

The approaches we recommend to researchers capitalise on the rich data cohort members provided over the years before their non-response. These include well-known methods such as multiple imputation, inverse probability weighting, and full information maximum likelihood.

The methods we recommend all rely on the assumption that the data are missing at random (MAR), which means that systematic differences between the missing values and the observed values can be explained by the rich information available in the cohorts. We have implemented a systematic data-driven approach to choosing which variables within the studies best predict drop out, using multivariable regression analyses and machine learning algorithms for variable selection.

This approach is described in Mostafa et al ^[1].

In this work, we show that the methods we recommend are able to restore the composition of the National Child Development Study (NCDS) samples at age 50 and age 55 to be representative of the study’s target population, using external benchmarks, and according to a number of characteristics captured within the original birth sample. For example, we were able to replicate the known population distribution of educational attainment (see figure 1) and marital status at age 50 based on the ONS Annual Population and Labour Force Surveys. We also replicate the original distribution of paternal social class observed at the birth survey, and the distribution of cognitive ability at age 7 (see figure 2).

There could still be other variables in NCDS for which we wouldn’t be able to restore representativeness to the target population, but our findings indicate that using principled methods for handling missing data has strong potential to reduce bias arising from missing data and restore sample representativeness.

Training and guidance

Download the Handling missing data in the CLS cohort studies user guide for detailed guidance on how to handle missing data in your own research, including a detailed worked example.

Training

We also offer training on handling missing data. Please keep an eye on our events page for details of future training. Or get in touch with us at ioe.clsevents@ucl.ac.uk.

Figure 1. Percentage of those with degree or equivalent at age 50 in the Annual Population Survey and NCDS before and after adjustment for missing data.

APS GB: Annual Population Survey = Born in Great Britain in 1958 (derived by the Office for National Statistics)

APS All: Annual Population Survey – Born in Great Britain or elsewhere in 1958 (derived by the Office for National Statistics)

NCDS50 MI: Estimate after multiple imputation using predictors of educational attainment at age 50 and predictors of non-response at age 50 as auxiliary variables.

Figure 2. Social class of mother’s husband at birth before and after adjustment for missing data.

Imputation phase of MI included predictors of response at age 55 and social class at birth only for cohort members that participated at age 55.

[1]

Mostafa, T., Narayanan, M., Pongiglione, B., Dodgeon, B., Goodman, A., Silverwood, R.J., & G.B. Ploubidis, G.B. (2021) Missing at random assumption made more plausible: evidence from the 1958 British birth cohort, Journal of Clinical Epidemiology

Mostafa, T., Narayanan, M., Pongiglione, B., Dodgeon, B., Goodman, A.,Silverwood, R.J., and Ploubidis G.B. (2020) Improving the plausibility of the missing at random assumption in the 1958 British birth cohort: A pragmatic data driven approach, CLS Working Paper 2020/6. London: UCL Centre for Longitudinal Studies

Handling missing data

Watch again: Handling missing data webinar – with theory and demo (2023)

Watch again: Handling missing data in the 1970 British Cohort Study webinar – (2024)

Our approach

Training and guidance

Training

News

CLS Bibliography

Data access & training

Contact us

Follow us