addressing left censoring in survival analysis with enter() option?

December 8, 2016, 8:38 am

≪ Previous: Stata Conference 2017 Call for Presentations

Dear all,

I work on a single record per id, single spell survival dataset with late entry and try to do everything I need in this framework with the built in Stata commands. This means the standard hazard-/survival function graphs by groups as well as (semi-)parametric regressions.

quickly to the setting: Let's call a person going to a certain store a relationship. I want to model the survival of these relationships in the last 12 quarters prior to the specific store's closing. In this respect I am not worried about right censoring because the spell is either completed when the relationship has terminated before the store has closed forever, or the spell lasts until t=12. There is nothing after the "right end" of my study time.

My question relates to the "left end" of my sample, i.e. the relationships that existed prior the last 12 quarters of the store's existence, that also survived into my analysis time. I am in the fortunate situation to have a very large dataset that goes back all the way so I know when every relationship started.

My question is how to properly take this information into account to not have a problem arising from left censoring.

Can this be accomplished by simply typing

stset timevar, failure(D_CENSORED=1) origin(time 0) enter(time relastartdate)

where D_CENSORED is an indicator variable equal to 1 when the spell ends before the store closes, and relastartdate gives the start quarter of the relationship, relative to the beginning of analysis time (i.e. it's negative for otherwise left-censored observations)? I tried this with a dataset I created just for this purpose but inclusion of the enter() option does not seem to alter the kaplan-meier survival function or anything else.

Any help is appreciated!
Swati

/edit: I also just tried stsetting the dataset with positive values for relastartdate but it still did have no effect. I did this because I realized before timevar should be positive. is this because all obs before becoming at risk are ignored for the analysis? but how can I address the left censoring then without leaving these observations out or integrating out?

↧

Heckman probit error

December 8, 2016, 8:49 am

≫ Next: Supercolumn position in table command

≪ Previous: addressing left censoring in survival analysis with enter() option?

Hello.

When the following formula is run then error messages appears:
. gen capphi=norm(p1)
. gen invmills=phi/capphi

How can one fix this? Stata didn’t seem to recognize the nnorm function.
Need this for some research in social studies.

Thank you for answers.

↧

Supercolumn position in table command

December 8, 2016, 9:00 am

≫ Next: Converting text values to numeric values in Stata dataset

≪ Previous: Heckman probit error

Dear All,

I need a table with layout same as shown below, but supercolumn totals placed first (before "cheap") in the order of supercolumns.
Just in case, I am looking for a solution which would utilize the existing options of the -table- command, not totally rewriting this command (unless a drop-in replacement is already available).

Thank you, Sergiy Radyakin

Code:

--------------------------------------------------------------------------
Repair    |                  Price category and Car type                  
Record    | ------ cheap -----    ---- expensive ---    ------ Total -----
1978      | Domestic   Foreign    Domestic   Foreign    Domestic   Foreign
----------+---------------------------------------------------------------
        1 |                        4,564.5               4,564.5          
        2 |    3,667               6,296.3               5,967.6          
        3 |    3,515     3,895     6,993.6   5,295.5     6,607.1   4,828.7
        4 |    3,829     3,995     6,138.1   6,544.8     5,881.6   6,261.4
        5 |    3,984     3,773       4,425   7,012.6     4,204.5   6,292.7
--------------------------------------------------------------------------

Obtained with the following code:

Code:

version 13.0
clear all
sysuse auto
recode price (1/4000=1 "cheap") (4001/16000=2 "expensive"), generate(pcateg)
label variable pcateg "Price category"
table rep78 foreign pcateg, c(mean price) scol

↧

Converting text values to numeric values in Stata dataset

December 8, 2016, 9:33 am

≫ Next: revisiting options for the missing F statistic

≪ Previous: Supercolumn position in table command

Good afternoon! I'm very new to Stata and am having difficulty converting a text value to a numeric one. I have a data set with hospital account types (22 account types under a variable called "accnt_type"), with lengths between 4 and 5 characters long. For example: "O ER" and "I INP" are two of the account types. I used the encode accnt_type, gen(visit_type) command to convert a string variable to a numeric one and create a new variable called visit_type.

Now, I'd like to replace "O ER" and "O OB" values with a 2, the "O OUT" value with a 1, and all of the other visit types with a 0 - where the 0, 1, and 2 appear in the dataset. However, I can't determine how to do this. Can you provide some guidance on this, please? Thank you.

↧

revisiting options for the missing F statistic

December 8, 2016, 9:52 am

≫ Next: Line plot with event markers

≪ Previous: Converting text values to numeric values in Stata dataset

Hello All
We conducted a study in 7 clinics and are now doing analyses on our outcome. Of course, we are adjusting for clustering (see model below). The problem is we have a missing f statistic. We checked and we do not have any clusters with only 1 observation. The problem appears to be that we have too many variables in the model. Only with 5 variables is the statistic reported. This makes it difficult to test the full hypothesis. I've read this post (http://www.stata.com/statalist/archi.../msg00646.html) but still at a lost on how to handle this problem. A few questions.
1) One statistician advised that the results are still accurate despite the missing test statistic. Is this the case? Are our results valid even if the statistic is missing?
2) Because we have so few clusters, we considered creating covariates for the clinics and place in the model recognizing that without a cluster option we are sacrificing controlling for unobserved cluster level effects. Is this an acceptable approach?
3) What other approach can we try if the response to 1 & 2 is not appropriate?.
Thanks so much.

. xi: svy, subpop(female): poisson qualrate age sex orient setstdvpc coresrhcaremed1 confid checkin clinicexp providertx learnpreg learnstd , irr
(running poisson on estimation sample)

Survey: Poisson regression

Number of strata = 1 Number of obs = 384
Number of PSUs = 7 Population size = 385
Subpop. no. of obs = 329
Subpop. size = 329
Design df = 6
F( 6, 1) = .
Prob > F = .

------------------------------------------------------------------------------
| Linearized
qualrate | IRR Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 1.004954 .0183052 0.27 0.795 .9611461 1.050758
sex| 1.124346 .027146 4.85 0.003 1.059846 1.192771
orient | .911727 .0885023 -0.95 0.378 .7189676 1.156166
setstdvpc | 1.321357 .0861916 4.27 0.005 1.126424 1.550024
coresrhcar~1 | .9860874 .0675292 -0.20 0.845 .8339518 1.165977
confid | .8414634 .0865075 -1.68 0.144 .6543126 1.082144
checkin | .7279026 .1518476 -1.52 0.179 .4369057 1.212715
clinicexp | 1.423182 .1143372 4.39 0.005 1.169191 1.732348
providertx | 2.140839 .2321125 7.02 0.000 1.641976 2.791266
learnpreg | 1.116886 .0333451 3.70 0.010 1.038202 1.201533
learnstd | 1.093678 .0962715 1.02 0.348 .8817527 1.35654
------------------------------------------------------------------------------

↧

Line plot with event markers

December 8, 2016, 10:22 am

≫ Next: Confidence interval won't compute in cross-classified model

≪ Previous: revisiting options for the missing F statistic

I have daily stock price data for many different companies.
For each company, there is at least one "event" date that I am interested in.
I would like to create a line plot for a given company (group_id), that has a marker only on the event date(s).

So far the only solutions I found were:

I defined a binary event variable (equal to 1 for an event, and 0 otherwise) and used it as a weight, which leads to a huge circle around the event datapoint. It is almost impossible to identify the exact point it is circling.
Code:
```
scatter price date [w=event] if group_id == 1, msymbol(Oh) connect(l)
```
I defined the event variable to equal "event" for an event, and "" otherwise, and used it as a label, which is also hard to pinpoint when looking at over a year of data.
Code:
```
scatter price date if group_id == 1, mlabel(event) msymbol(none) connect(l)
```

Optimally, I would like a small dot/circle on event datapoints.

Please let me know if this is feasible.

Thank you so much!

↧

Confidence interval won't compute in cross-classified model

December 8, 2016, 10:26 am

≫ Next: Combining the data from two variables (in the same data set).

≪ Previous: Line plot with event markers

I'm running a cross-classified growth model. The goal is to measure reading growth from kindergarten to third grade, and determine which of three reading programs is linked to the highest growth. Time is measured by grade level, and students are cross-classified across campuses, as there is some degree of annual student mobility.

I've run a very basic model, which I pasted below (not even key independent variables included yet). I'm stumped by the random effects parameters, specifically that the confidence interval didn't compute when assessing between-student variation in reading growth across grade levels.

Has anyone come across a similar error? If so, what was the cause of the error? I've started re-checking my code and variable compilation in the meantime.

Thank you,
Sandra

. mixed reading gradelvl ///
> || _all: R.campus_n ///
> || id: gradelvl, stddev mle

Performing EM optimization:

Performing gradient-based optimization:

Iteration 0: log likelihood = -86014.482
Iteration 1: log likelihood = -85897.491
Iteration 2: log likelihood = -85891.93
Iteration 3: log likelihood = -85891.917
Iteration 4: log likelihood = -85891.917

Computing standard errors:

Mixed-effects ML regression Number of obs = 17,075

-------------------------------------------------------------
| No. of Observations per Group
Group Variable | Groups Minimum Average Maximum
----------------+--------------------------------------------
_all | 1 17,075 17,075.0 17,075
id | 4,733 2 3.6 4
-------------------------------------------------------------

Wald chi2(1) = 21749.74
Log likelihood = -85891.917 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
reading | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gradelvl | 33.50561 .2271907 147.48 0.000 33.06032 33.95089
_cons | 534.3791 1.147548 465.67 0.000 532.13 536.6283
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
_all: Identity |
sd(R.campus_n) | 11.42497 .8793064 9.825248 13.28515
-----------------------------+------------------------------------------------
id: Independent |
sd(gradelvl) | 4.95e-10 1.86e-07 0 .
sd(_cons) | 26.40677 .3940714 25.64559 27.19054
-----------------------------+------------------------------------------------
sd(Residual) | 30.74959 .1972111 30.36548 31.13856
------------------------------------------------------------------------------
LR test vs. linear model: chi2(3) = 4062.35 Prob > chi2 = 0.0000

↧

Combining the data from two variables (in the same data set).

December 8, 2016, 10:49 am

≫ Next: 'Balancing property' issue with PSCORE

≪ Previous: Confidence interval won't compute in cross-classified model

Hi everyone! How are you guys doing?

I'm fairly new on Stata and today I come here with a very simple question.

How can I combine the data from two variables (in the same data set) into one variable. I want to combine the variables in such a way that the data from one variable can replace the missing values from the other.

Here is an example of what I am looking for:

var1	var2	newvar
.	3	3
.	7	7
4	.	4
.	.	.

Any thoughts on how to do this? I looked for an answer exhaustively but couldn't find a solution.

Thank you in advance.

↧

'Balancing property' issue with PSCORE

December 8, 2016, 11:55 am

≫ Next: Using a formatted date string in a macro

≪ Previous: Combining the data from two variables (in the same data set).

I am trying to do propensity score matching between my sample and population based on some characteristics. Here is my Stata syntax and I get the following error:

pscore smpl_or_pop rural_urban_codes ownership teach_status mhsmemb bedcode, pscore(myscore)

Error : "Variable rural_urban_codes is not balanced in block 6

The balancing property is not satisfied

Try a different specification of the propensity score"

Could the experts please suggest what can I do to not get this error. Thank you!

↧

Using a formatted date string in a macro

December 8, 2016, 12:05 pm

≫ Next: Direct standardization: Conf. Interval of Stratum

≪ Previous: 'Balancing property' issue with PSCORE

I have a dataset that is tsset, and I would like to use the formatted (string) value of the date variable, from one particular observation, in generating a totally new variable name. For example, if the 6th observation is 1987q3, I would like a new variable named var1987q3. Here is what I have tried

Code:

webuse gdpoil

gen var`=tq(qdate[6])' =5     // macro is blank
gen var`tq(qdate[6])' =5      // macro is blank

levelsof qdate if _n==6, local(a)
gen var`a' = 5     // kind of works but macro resolves to the numeric date value, not the formatted date string


gen mystr = string(qdate, "%tq")
levelsof mystr if _n==6, local(a)
gen var`a' = 5     // error message "1987q3 invalid name

Thank you for your help.

↧

Direct standardization: Conf. Interval of Stratum

December 8, 2016, 6:55 pm

≫ Next: Multiple Imputation with Complex Survey Data that employs Balanced Repeated Replicate (BRR) Weights

≪ Previous: Using a formatted date string in a macro

Hello dear all,

I have a question.
Age_category is three, so Stratum was separated to three.
However, seems like the Conf. Interval was applied by whole age_category.
Is there any way to check the individual Conf. Interval by each Stratum?

use http://www.stata-press.com/data/r14/1962, clear
save "C:\Stata14\working\1962.dta" // my working folder "C:\Stata14\working"
use http://www.stata-press.com/data/r13/mortality, clear
dstdize deaths pop age_cat, by(nation) using(1962) Array

↧

Multiple Imputation with Complex Survey Data that employs Balanced Repeated Replicate (BRR) Weights

December 8, 2016, 8:26 pm

≫ Next: Panel SUR Analysis

≪ Previous: Direct standardization: Conf. Interval of Stratum

I am currently working with the High School Longitudinal Study (HSLS) from the National Center for Education Statistics (NCES). Since I am working with the public-use data at this time, I am using the provided BRR (balanced repeated replicate) weights within the Survey Package (SVY) in STATA. This helps me correctly adjust the standard errors to account for selection and clustering, so that I can more accurately perform regression analyses after Propensity Score matching and weighting. However, this complex survey does have some missing values that I would like to account for, and the BRR option in the Survey Package does not allow me to use multiple imputation (MI) estimates. Thus, I am wondering if anyone has come across this problem before? If so, have you found any other mechanisms for imputing missing data in Complex Surveys (that employ BRR weights)?

↧

Panel SUR Analysis

December 9, 2016, 12:02 am

≫ Next: graph for each group

≪ Previous: Multiple Imputation with Complex Survey Data that employs Balanced Repeated Replicate (BRR) Weights

I am interested in analysing panel data using SUR model. Is it possible to include one or two dummies in such analysis?
Please help me in this problem.

↧

graph for each group

December 9, 2016, 1:40 am

≫ Next: Methods Difference in Difference

≪ Previous: Panel SUR Analysis

hi,

i am using below code for making plots per group:

twoway scatter rev quality, by( MARQUE )

by using this i got my required graph but for all groups of brands in one frame(below pic), how can I have seperate outputs?

Array

↧

Methods Difference in Difference

December 9, 2016, 2:22 am

≫ Next: Starting event study with government bonds yields

≪ Previous: graph for each group

Dear all,

I currently struggle with the following problem. For my research I need to analyse a dataset with the following characteristics, I am struggling with the right method to use.

I have a dataset of US firms M&A activity who I will test on an output characteristic (no stock price, stock value). I want to know if firms who engaged in an acquisition (the treatment) who acquired cross-border (another treatment or a different sample group?) had more or less of the desired output characteristic than firms who engaged in an acquisition but acquired domestically.

I hope that some one know how to proceed as to me this is should be done by a DiD OLS regressions but I am struggling with the inputs for this.

Kind regards,

↧

Starting event study with government bonds yields

December 9, 2016, 2:41 am

≫ Next: Setting up a survival analysis data set

≪ Previous: Methods Difference in Difference

Hello everyone,

I'm new to the forum and to Stata and I'm really grateful for your time and help. I've wander the forum in the quest for an hint but it seems no one has had the same kind of starting problem I have. I would like, for my Master thesis, to conduct an event study on the effects of 5 announcements of the ECB on the yields of European government bonds. I have 17 countries and their bond yields for 2years and 10 years maturities and my dataset goes from 2014 to August 2016. For each country, I'd like to calculate the 1-day variation on the 30 days prior to the announcement and then calculate the standard deviation on the 30 yield variations to run the t-test afterwards.
My questions here are how do I calculate the variation and how do I isolate the 30 yield variations prior to the announcement? I've tried to adapt the Princeton procedure but I don't manage to figure out how to do it with my data.

Thanks a lot for your answers.

Best regards

↧

Setting up a survival analysis data set

December 9, 2016, 3:43 am

≫ Next: Data management: network setup

≪ Previous: Starting event study with government bonds yields

I am trying to re-do my discrete-time hazard analysis, accounting for repeated events and left-truncation (delayed entry). But I just cannot quiet get it right, both by hand as well as using things like -expand- or -stset-. Can you help me? I use Stata 14.

I have panel data in long-format, person-period-format. People are dropping in and out of employment and I am interested in these transitions, right now mostly out of employment into re-employment. Regarding the left-truncation: For some individuals (not for all) I have a variable telling me since when they are unemployed.

Ultimately, I want to do a multilevel discrete-time hazard analysis to account for repeated events (some individuals become re-employed up to four times in my data).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid int syear byte employed int q111 float(event entry)
4801 1984 0 .a 0 0
4901 1984 0 .a 0 0
4901 1985 0  . 0 0
4901 1986 0  . 0 0
4901 1987 0  . 0 0
4901 1988 0  . 0 0
4901 1989 0  . 0 0
4901 1990 0  . 0 0
4901 1991 0  . 0 0
4901 1992 0  . 0 0
4901 1993 0  . 0 0
4901 1994 0  . 0 0
4901 1995 0  . 0 0
4901 1996 0  . 0 0
4901 1997 0  . 0 0
4901 1998 0  . 0 0
4901 1999 0  . 0 0
4901 2000 0  . 0 0
4901 2001 0  . 0 0
4901 2002 0  . 0 0
4901 2003 0  . 0 0
4901 2004 0  . 0 0
4901 2005 0  . 0 0
4901 2006 0  . 0 0
4901 2007 0  . 0 0
4901 2008 0  . 0 0
4901 2009 0  . 0 0
4901 2010 0  . 0 0
4901 2011 0  . 0 0
4901 2012 0  . 0 0
4901 2013 0  . 0 0
5001 1984 1 .a 0 0
5101 1984 0 . 0 0
5101 1985 0  1981 0 0
5101 1986 0  . 0 0
5101 1987 0  . 0 0
5101 1988 0  . 0 0
5101 1989 0  . 0 0
5101 1990 0  . 0 0
5101 1991 0  . 0 0
5101 1992 0  . 0 0
5101 1993 0  . 0 0
5101 1994 0  . 0 0
5201 1984 0 . 0 0
5201 1985 0  . 0 0
5201 1986 0  1984 0 1
5201 1987 1  . 1 0
5201 1988 0  . 0 1
5201 1989 1  . 1 0
5201 1990 1  . 0 0
5201 1991 1  . 0 0
5201 1992 0  . 0 1
5201 1993 1  . 1 0
5201 1994 1  . 0 0
5201 1995 1  . 0 0
5201 1996 1  . 0 0
5201 1997 1  . 0 0
5201 1998 1  . 0 0
5201 1999 0  . 0 1
5201 2000 0  . 0 0
5201 2001 0  . 0 0
5201 2002 0  . 0 0
5201 2003 0  . 0 0
5201 2004 0  1999 0 0
5201 2005 0  . 0 0
5201 2006 0  . 0 0
5201 2007 0  . 0 0
5201 2008 0  . 0 0
5201 2009 0  . 0 0
5201 2010 0  . 0 0
end
label values employed employed
label def employed 0 "unemployed", modify
label def employed 1 "employed, modify

At first, I tried to -expand- the variable, calculating how many personyears of unemployment are missing prior to the first observeration and then expanding this. I was hoping to thereby create a new duration variable, that includes left-truncated personyears, as was often suggested by Phil Enders.

Code:

bysort pid: gen newvar = syear[1]-q111 if employed[1] == 0
expand newvar
bysort pid: gen time = _n

This is one case showing why this does not work:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long pid int syear byte employed int q111 float(event entry time newvar)
3097001 2000 0    . 0 0 16  .
3097001 2001 0 1989 0 0 12 11
3097001 2001 0 1989 0 0  2 11
3097001 2001 0 1989 0 0  5 11
3097001 2001 0 1989 0 0 20 11
3097001 2001 0 1989 0 0  4 11
3097001 2001 0 1989 0 0 17 11
3097001 2001 0 1989 0 0 18 11
3097001 2001 0 1989 0 0 14 11
3097001 2001 0 1989 0 0 19 11
3097001 2001 0 1989 0 0  9 11
3097001 2001 0 1989 0 0  1 11
3097001 2002 0    . 0 0 11  .
3097001 2003 0    . 0 0 15  .
3097001 2004 0    . 0 0 13  .
3097001 2005 0    . 0 0  8  .
3097001 2006 0    . 0 0  7  .
3097001 2007 0    . 0 0 10  .
3097001 2008 0    . 0 0  6  .
3097001 2009 0    . 0 0  3  .
end
label values employed employed
label def employed 0 "unemployed", modify

Problem 1) Often (see case 5101 above) I have the report on when last unemployment spell began only in the second observation period. Using -expand- it expands the second observation period and not the first. I first wanted to put the reported begin of the spell into the first person-year of every person, however:

Sometimes (as in case 5201), this report is nonsense, because I already know when the spell started (in the observation period).

Problem 2) I have repeated events and do not know how to start a new duration variable after every event and new entry (so that after somebody becomes employed and unemployed again, the duration starts at 0 again). Is there any command for this? I looked into -stset-, hoping it would do s.th. like this, but couldn't find anything.

Problem 3) I want to delete everybody, who is left-truncated but for whom I cannot tell when this spell started. If I get the -expand- command right, I would do this:

Code:

drop if employed[1]==0 & time == 1

but that would probably only drop the very first observation, not the whole unemployment spell until the next entry into employment and then re-employment.

↧

Data management: network setup

December 9, 2016, 3:43 am

≫ Next: Mult Valued Inverse Probability Weighted Regression Adjustment

≪ Previous: Setting up a survival analysis data set

Dear all,
I am Prepare data for network meta-analysis. my data is below but
i have error message . I don't know how to solve this problem. Please help me.
Thanks very much

Code:

    study     d      n   trt  
        1     4     87     1  
        1    11     55     3  
        2     6     92     1  
        2     4     42     2  
        3    15    141     3  
        3    12    180     1  
        3     6     44     2  
        3    19    226     1  
        4    16    111     2  
        4    17    468     3  
        4    33   1260     1  
        4    33   1260     1  
        5   246   2871     1  
        5    11    172     2  
        5   113   1000     3  
        5    78   2446     1  
        6    54   1808     1  
        6    11    147     2  
        7     7     80     2  
        7     3     68     1  
        7     1     53     1  
        7     6     65     3  
        8     2     15     3  
        8     3     15     1  
        9    13    209     1  
        9    12    106     3  

. network setup d n , studyvar(study) trtvar(trt)
variables study trt do not uniquely identify the observations

sugi

↧

Mult Valued Inverse Probability Weighted Regression Adjustment

December 9, 2016, 4:05 am

≫ Next: is there a command that returns maximum and minimum length of a string variable?

≪ Previous: Data management: network setup

I had an overlap violation error message when I ran the regression. I then used osample (newvar) to identify it but I cannot find any information on where I go from there. Can anybody help please?

↧

is there a command that returns maximum and minimum length of a string variable?

December 9, 2016, 5:34 am

≫ Next: confidence intervals for hazard / survival functions obtained by logistic regression

≪ Previous: Mult Valued Inverse Probability Weighted Regression Adjustment

I'm looking for a Stata command that stores the maximum and minimum length of a string variable - as a shortcut to the long way, which would go by:
gen vnew = strlen(v1)
egen smax = max(vnew)
etc.

I want to write syntax that looks like

local smax = r(maxlength)
local smin = r(minlength)

I need these locals as an indicator for correct / incorrect string variables as part of a testing routine of a host of files .

My research of such a command had no result - does someone out there know better?

↧