Hi there,
I apologise for the long-winded question but I think the detail is necessary to gain any useful insight.
The Data
I am working on the analysis for a randomised controlled trial. At baseline, 250 participants were randomised at an individual level into treatment and control groups. The randomisation was successful and at baseline only 3% of 150 baseline characteristics between treatment and control were statistically significantly different. Since then, attrition has occurred. I have a dataset with 149 observations. For the remaining sample, 4% of the baseline characteristics are significantly different. I have created attrition probability weights based on these baseline characteristics.
Methodology until Now
For the majority of the data with which I am working, data was gathered on an individual level using tests and surveys of the participants themselves. The data is at times binary and at other times continuous. As such, I have carried out analysis on Stata 12 using permutation hypothesis testing with inverse probability weighting used to account for attrition. I have accounted for gender as a strata as at baseline and with the estimation sample it is significantly different between groups.
Partially Clustered Data
Some data is gathered from teacher-provided responses about students. For 100 participants, many responses came from the same teacher: any given teacher provided between 2 and 10 responses at a given time about different participants in their class. There are 22 of these classes – they are of varying size. 50 of the participants have responses based on teachers responding about only one student. As such, the data is partially clustered. I checked the intracorrelation coefficient for only those students in class groups across a selection of variables. In some cases it was as high as .3 so I do not think I can ignore clustering. Within the classes, there are some that contain a mixture of treatment and control participants and some that contain only one group or the other. We had no hand in the clustering and randomisation was not based on the clusters, though there is a chance that treatment may have affected the classes where participants ended up.
Possible Methods
Questions
I apologise for the long-winded question but I think the detail is necessary to gain any useful insight.
The Data
I am working on the analysis for a randomised controlled trial. At baseline, 250 participants were randomised at an individual level into treatment and control groups. The randomisation was successful and at baseline only 3% of 150 baseline characteristics between treatment and control were statistically significantly different. Since then, attrition has occurred. I have a dataset with 149 observations. For the remaining sample, 4% of the baseline characteristics are significantly different. I have created attrition probability weights based on these baseline characteristics.
Methodology until Now
For the majority of the data with which I am working, data was gathered on an individual level using tests and surveys of the participants themselves. The data is at times binary and at other times continuous. As such, I have carried out analysis on Stata 12 using permutation hypothesis testing with inverse probability weighting used to account for attrition. I have accounted for gender as a strata as at baseline and with the estimation sample it is significantly different between groups.
Partially Clustered Data
Some data is gathered from teacher-provided responses about students. For 100 participants, many responses came from the same teacher: any given teacher provided between 2 and 10 responses at a given time about different participants in their class. There are 22 of these classes – they are of varying size. 50 of the participants have responses based on teachers responding about only one student. As such, the data is partially clustered. I checked the intracorrelation coefficient for only those students in class groups across a selection of variables. In some cases it was as high as .3 so I do not think I can ignore clustering. Within the classes, there are some that contain a mixture of treatment and control participants and some that contain only one group or the other. We had no hand in the clustering and randomisation was not based on the clusters, though there is a chance that treatment may have affected the classes where participants ended up.
Possible Methods
- Ignore clustering: This seems unfeasible given the intracorleation for those in classes
- Create clusters for remaining ‘singletons’. I did this. This raises the number of clusters to 30. Clustering in this format is used for the Stata options I have examined.
- Stata options that I have run:
- Permutation based hypothesis testing using cluster at a strata control variable. Inverse probability weighted.
- Linear regression or logit depending on the dependant variable. Ignoring clusters. Inverse probability weighted.
- Vce(cluster) – From my reading, cluster robust standard errors, as they rely on the number of clusters going to infinity, are not reliable at 30 clusters.
- Xtmixed, after running xtreg, re and xtreg, fe. Unweighted. I compared random and fixed effects options for the clusters using the hausman test, which indicated the use of random effects. However, I believe the model will violate the random effects assumption as I do not believe treatment status and cluster are completely independent. On reading about the use of “fe” and “re”, I saw that cluster robust standard errors are used when using weighting with these methods. As stated above, this seems to be a poor option for the data I have.
- Controlling for cluster as a categorical, for example: reg y treatment i.class – In other Stata forums I have seen this suggested as a method when dealing with EU data due to the relatively small number of clusters. However, given my sample size the degrees of freedom is greatly lessened with this method.
Questions
- Is there a legitimate method I can use to incorporate clusters (either using the ‘full set’ from the other clusters I have created or, preferably, using only the clusters that actually exist) into permutation testing such that the method remains robust? Given the relatively small sample size and for continuity within the analysis, this would be ideal.
- Is it necessary to control differences at baseline given that the randomisation seems to have been successful?
- Given that in order to look at interactions for a sub-group analysis my search for an appropriate non-permutation based method must also continue.
- Is there a Stata command that can be used to model partially clustered data? This paper by Baldwin, Bauer, Stice and Rohde (2014) discusses adjusting multi-level models in SAS but I can’t find anything of the sort within Stata documents.
- Is there some other method within Stata that I have not considered that I can use to account for clustering of this sort?