To coincide with the release of the National Child Development Study’s Age 11 essays, CLS hosted a special tutorial on automated content analysis to help enable researchers to make the most of this new data. The session covered the fundamentals of using the Differential Language Analysis Toolkit (DLATK) and was led by H. Andrew Schwartz (Stony Brook University).
|Date||15 January 2018|
|Time||10:30 - 16:00|
Qualitative data, such as essays and free response questions in surveys, are rich sources of psychological, social and behavioural information. Yet such information has traditionally been impossible to leverage at a large scale. Recent advances in computational linguistics and machine learning have produced automatic content analysis tools, which can now be applied to a wide number of settings, including the open responses collected longitudinally within a large national birth cohort study.
In a new project funded by the Economic and Social Research Council, we are applying such tools to newly transcribed essays that were written by cohort members of the National Child Development Study (NCDS), when they were age 11 in 1969. (“Imagine you are now 25 years old…”) The responses provide a largely untapped source of psychological and behavioural information that can be linked longitudinally to outcomes for the same individuals.
A new dataset containing the fully transcribed text for 10,500 of these essays will be released by the UK Data Service in mid-February 2018, and will be available for researchers worldwide to download and analyse.
To enable researchers to make the most of this new data release, the Centre for Longitudinal Studies offered the exciting opportunity to attend a specialised tutorial on automated content analysis, provided by H. Andrew Schwartz, faculty of the Computer Science Department and Center for Computational Social Science at Stony Brook University, New York.
The Differential Language Analysis ToolKit
DLATK (Differential Language Analysis ToolKit) is an end to end language analysis software, specifically suited for social media and social scientific research applications. It has been used for research published in over 40 peer-reviewed papers across psychology, computer science, public health, medicine, and political science. Although the heart of DLATK is a Python library it is typically used through a vestaile command interface (requiring no programming).
This tutorial covered the fundamentals of automated content analysis using DLATK:
H. Andrew Schwartz is part of the faculty of the Computer Science Department and Center for Computational Social Science at Stony Brook University, New York. He was previously Lead Research Scientist for the interdisciplinary “World Well-Being Project” at the University of Pennsylvania where he created the Differential Language Analysis ToolKit.
Differential Language Analysis ToolKit: http://dlatk.wwbp.org/
Schwartz, H. A., Giorgi, S., Sap, M., Crutchley, P., Ungar, L., & Eichstaedt, J. (2017). DLATK: Differential Language Analysis ToolKit. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 55-60). Pdf
Kern, M. L., Park, G., Eichstaedt, J. C., Schwartz, H. A., Sap, M., Smith, L. K., & Ungar, L. H. (2016). Gaining insights from social media language: Methodologies and challenges.
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political analysis, 21(3), 267-297.
Schwartz, H. A., & Ungar, L. H. (2015). Data-driven content analysis of social media: a systematic overview of automated methods. The ANNALS of the American Academy of Political and Social Science, 659(1), 78-94.
Phone: 020 7911 5320