There seems to be quite a lot of debate over this issue but I thought I'd try to get some comments specific to my situation. I have posted a similar question on StackOverflow (
http://stackoverflow.com/questions/3...ssion-in-stata) but was recommended to come to Statalist.
I'm using Stata 12 on a Mac. Basically, I have a dataset that is collected from subjects every day (excluding weekends). The data is a simple binary response. Therefore, on each day there are positive (1) and negative (0) responses. My workflow usually includes data manipulation in Python and Pandas and then exporting to CSV ready to import into Stata. Before creating the CSV I calculated the odds (and ln odds) and probability of success on each date. The structure of the data then looks like:
Code:
date resp freq total prob odds lnodds
2015-01-02 0 14 16 0.125 0.1428571 -1.94591
2015-01-02 1 2 16 0.125 0.1428571 -1.94591
2015-01-05 0 14 15 0.0666667 0.0714286 -2.639057
2015-01-05 1 1 15 0.0666667 0.0714286 -2.639057
The whole dataset covers about 18 months.
The data shows a clear annual seasonality and so I calculated sin(2*pi*date/365) and cos(2*pi*date/365) variables and ran the following command:
Code:
logit resp c.date c.sin c.cos [fw=freq]
After the logistic regression, I calculated the linear prediction using:
...and calculated the raw residuals as the ln odds minus the linear prediction.
I plotted the linear prediction against date and overlaid the ln odds for each day. There might be a bit of tweaking of the variables required but generally the prediction seemed to fit the data pretty well. The raw residuals also showed a normal distribution.
Array
To me, it seems reasonable that I should be able to calculate a prediction interval (not a confidence interval) around the linear prediction to estimate the ln odds of success on any given date (and hence convert to probability). However, Stata doesn't seem to allow this. The discussions I've read suggest that this is because any individual outcome can only be binary, 0 or 1. Which, undoubtedly is true. But the logit link function allows us to convert binary outcomes to odds in the first place so we can model the data with logistic regression. Why can't the same reasoning be applied to the post-estimation situation?
I've attempted to calculate a prediction interval (PI) by calculating standard deviation (SD) of the raw residuals and plotting the linear prediction ± 1.96*SD. This produces a plot where the ln odds of success falls outside the PI on only a small number of occasions (as expected). This produces a PI that is constant with time despite the variation for the first half of the data being greater than the second half, which is a bit of an issue.
Array
So, my three related questions are:
1. Why is this wrong?
2. Why doesn't Stata allow me to calculate PIs automatically?
3. How can I improve this plot?