Hi there, I was hoping you could help me estimate the sample size using either the -power- or -ciwidth- command.
I have pilot data (N=17) which consists of analyte concentrations from matched samples collected via 2 different methods, methodA and methodB. It contains 3 variables: id, methodA, methodB. I've fitted a simple linear regression model to fit the measurements from methodB into the methodA range, the results are as follows:
Code:
. regress methodA methodB
Source | SS df MS Number of obs = 17
-------------+---------------------------------- F(1, 15) = 2849.23
Model | 12.3342823 1 12.3342823 Prob > F = 0.0000
Residual | .064934729 15 .004328982 R-squared = 0.9948
-------------+---------------------------------- Adj R-squared = 0.9944
Total | 12.399217 16 .774951065 Root MSE = .06579
------------------------------------------------------------------------------
methodA | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
methodB | 1.000668 .0187468 53.38 0.000 .9607106 1.040626
_cons | .1832411 .0526242 3.48 0.003 .0710754 .2954069
------------------------------------------------------------------------------
I want to refine the model using an additional, external dataset of matched samples, which also contains data on the age of the sample. I have no indication whether age will have an effect, but want to include both age and the interaction term (methodB*age) in the model to find out. The final model I generate will be applied to a much larger dataset of unmatched samples. The pilot dataset is not representative of the distribution of sample age in the additional training dataset or in the wider, unmatched dataset. I do not have access to the additional dataset as yet.
As well as serving as a training dataset to refine the model and enable the inclusion of the age and interaction predictors, I want to use this additional dataset as a means of validating the model. I plan to perform k-fold cross-validation.
I want to estimate how many samples to include to: 1)
ensure I have enough power to detect significant relationships between age / the interaction term and methodA; and 2) to
limit the prediction error to X, where X is a defined value (that I'm also unsure exactly what should be set).
I read that for a training dataset, sample size should be based on effect size of the predictors (which I have calculated for methodB) whilst for a test dataset, effect size should be based on the magnitude of the prediction error we are willing to detect and the variance of the prediction errors (which I can estimate for methodB). I would be grateful for any general advice on whether this sounds correct. Also, as I only have preliminary data for methodB and not age or their interaction, how best should I estimate total effect size, especially given the magnitude of methodB's effect (Cohen's f2; see below)?
Code:
.**estimate effect size for methodB
local r2 : di e(r2)
local f2methodB = `r2'/(1-`r2')
di "Cohens's f2: `f2methodB'"
Cohens's f2: 189.9490166125627
I originally set out using the -power- command, as below.
However, I was unsure how to include the number of predictors or effect size (and how to include age or the interaction term in the estimate of total effect size)? I also wondered if
maybe I should set the effect size to the smallest estimated effect size of all the predictors to ensure sufficient sample size to detect such small estimated size? e.g. Cohen's f2 = 0.02?
Code:
. power rsq 0.9948, ntested(3)
Performing iteration ...
Estimated sample size for multiple linear regression
F test for R2 testing all coefficients
H0: R2_T = 0 versus Ha: R2_T != 0
Study parameters:
alpha = 0.0500
power = 0.8000
delta = 191.3077
R2_T = 0.9948
ntested = 3
Estimated sample size:
N = 6
Then I decided -ciwidth- may be a better option for sample size determination as it may be more appropriate for sample size determination for model validation. I wrote the following code,
but wondered if there was a way of specifying a multivariate regression as with the power command?
Code:
. quietly regress methodA methodB
.
. local r = sqrt(e(r2)) // correlation coef
.
. su methodA
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
methodA | 17 2.859964 .880313 1.539736 4.555959
. local asd = r(sd)
. su methodB
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
methodB | 17 2.674935 .8774185 1.396721 4.384808
. local bsd = r(sd)
.
. ciwidth pairedmeans, sd1(`asd') sd2(`bsd') corr(`r') probwidth(0.95)
> width(0.1)
Performing iteration ...
Estimated sample size for a paired-means-difference CI
Student's t two-sided CI
Study parameters:
level = 95.0000 sd1 = 0.8803
Pr_width = 0.9500 sd2 = 0.8774
width = 0.1000 corr = 0.9974
sd_d = 0.0637
Estimated sample size:
N = 14
Finally, I wondered if either code allowed for any other options that I'm unaware of that may allow me to utilise even more from my pilot data?
Thank you so much in advance, any advice is greatly appreciated!