Handling missing data

Background

This page gives an overview of the handling of missing data in cohort studies.

We know different types of people tend to drop out of longitudinal studies over time, depending on their individual circumstances and characteristics.

To support researchers of CLS cohort data deal with this common problem, we have developed comprehensive advice on how to deal with missing data and reduce bias

Missing data training and guidance

User guide (2024)

For more detailed guidance to handling missing data in your own research, download our Handling missing data in the CLS cohort studies user guide.

Webinars

Watch: Handling missing data webinar (2023)

 

Watch: Handling missing data in the 1970 British Cohort Study webinar (2024)

CLS missing data training

We regularly run training events in handling missing data, for both new and experienced data users.

Sign up to our mailing list to be the first to hear about upcoming webinars or get in touch with us via email: ioe.clsevents@ucl.ac.uk.

Catch up with past events

Watch previous webinars on our Training and Support page or subscribe to our YouTube channel.

About missing data

Missing data occur in longitudinal cohort studies for two reasons:

  1. Cohort members do not participate in a wave of data collection at all. This is known as “wave non-response”. If they never return to the study, it is known as “attrition”. Reasons for this can include that a cohort member has died or emigrated, cannot be traced, or chooses not to participate.
  2. When cohort members participate in a wave of data collection, but do not provide responses or take part in certain elements. This is known as “item non-response”. This will happen, for example, when a cohort member is unable or chooses not to answer a certain question in a questionnaire.

What is the impact of missing data?

Missing data mean that when we are conducting an analysis, the sample of cohort members with complete data will be reduced. This reduction in sample size will reduce the statistical power of the analysis, meaning that we are less likely to be able to draw definitive conclusions.

We also know that the probability of response is often patterned by cohort members’ individual characteristics and circumstances. This means that the analysis sample of cohort members with complete data may not be representative of the cohort overall, which may lead to bias in the analysis. This can jeopardise the validity of the findings.

CLS approach to missing data

Our methods

At CLS, we have developed an approach to deal with missing data and reduce bias. It builds on the rich data cohort members have provided to us since over the years they have participated in our studies.

We suggest the use of well-known methods such as:

  • multiple imputation
  • inverse probability weighting
  • full information maximum likelihood.

Missing at random

These methods rely on the assumption that the data are missing at random (MAR). MAR implies that systematic differences between the missing values and the observed values can be explained by observed data.

We can also include additional information to make the MAR assumption more plausible. This can include using “auxiliary variables” (variables not in the analysis model) in multiple imputation.

The most useful auxiliary variables are those which are associated with both the probability of the data being missing and the underlying values of the variables subject to missingness.

The latter of these needs to be considered on an analysis-specific basis depending on which variables are being analysed. But since the majority of non-response in the CLS cohorts is at the wave rather than the item level, variables associated with missingness (ie wave non-response) can be considered in a more general way.

Study response rates

Find out more about NCDS response and missingness.

Contact us

Centre for Longitudinal Studies
UCL Social Research Institute

20 Bedford Way
London WC1H 0AL

Email: clsdata@ucl.ac.uk

Follow us