Can 'machine learning' improve our understanding of non-response in Understanding Society and means of tacking it?
|Day:||Thu 4 Jul|
Panel studies face the challenge of non-response: keeping respondents within the panel over many years. Strategies towards non-response include (a) changes to data collection (different modes, financial incentives), and (b) methods of data analysis (e.g. weighting, imputation). Statistical models of non-response may be used to create ‘weighting classes’ or regression probabilities of non-response to create weights, but statistical models may not be optimised for prediction and ‘Machine learning’ (ML) methods are better created for prediction.<br />ML methods are proving superior to existing regression-based models of non-response in Understanding Society data for each pair of waves from w1/w2 to w7/w8. A reasonable logistic regression model achieved a prediction mean squared error of 0.182 for wave 2 attrition (a model’ using just the constant had a MSE of 0.191). But a random forest (RF) model using an identical set of independent variables achieved 0.125. At wave 7-8, the MSE for logit was 0.122 compared with 0.075 for the random forest. The implication is that non-response weights (based on the inverse probability of response) would be more ‘accurate’ if based on the RF predictions rather than the logit prediction. Other ML algorithms are also being investigated for analysing attrition.<br />The ML methods also hold the prospect of identifying key variables affecting attrition – which in past research (e.g. Kern at JSM 2018) seems likely to include paradata relating to past response. Position in household and month of issue appear more important in the ML models than regressions.<br />Research funding from [identifiable] is gratefully acknowledged.