This page gives an overview of the handling of missing data in cohort studies.
We know different types of people tend to drop out of longitudinal studies over time, depending on their individual circumstances and characteristics.
To support researchers of CLS cohort data deal with this common problem, we have developed comprehensive advice on how to deal with missing data and reduce bias
For more detailed guidance to handling missing data in your own research, download our Handling missing data in the CLS cohort studies user guide.
We regularly run training events in handling missing data, for both new and experienced data users.
Sign up to our mailing list to be the first to hear about upcoming webinars or get in touch with us via email: ioe.clsevents@ucl.ac.uk.
Watch previous webinars on our Training and Support page or subscribe to our YouTube channel.
Missing data occur in longitudinal cohort studies for two reasons:
Missing data mean that when we are conducting an analysis, the sample of cohort members with complete data will be reduced. This reduction in sample size will reduce the statistical power of the analysis, meaning that we are less likely to be able to draw definitive conclusions.
We also know that the probability of response is often patterned by cohort members’ individual characteristics and circumstances. This means that the analysis sample of cohort members with complete data may not be representative of the cohort overall, which may lead to bias in the analysis. This can jeopardise the validity of the findings.
At CLS, we have developed an approach to deal with missing data and reduce bias. It builds on the rich data cohort members have provided to us since over the years they have participated in our studies.
We suggest the use of well-known methods such as:
These methods rely on the assumption that the data are missing at random (MAR). MAR implies that systematic differences between the missing values and the observed values can be explained by observed data.
We can also include additional information to make the MAR assumption more plausible. This can include using “auxiliary variables” (variables not in the analysis model) in multiple imputation.
The most useful auxiliary variables are those which are associated with both the probability of the data being missing and the underlying values of the variables subject to missingness.
The latter of these needs to be considered on an analysis-specific basis depending on which variables are being analysed. But since the majority of non-response in the CLS cohorts is at the wave rather than the item level, variables associated with missingness (ie wave non-response) can be considered in a more general way.
Find out more about NCDS response and missingness.