Multilevel survival analysis

July 14, 2019, 12:52 pm

≪ Previous: Reflow code in do file keyboard shortcut

Hello everyone,

I'm trying to run a multilevel survival analysis and have to specify the distribution model. Online, I found that Weibull can be used. Are there any other models? How can I test which one fits better?
Additionally, when I check for random slopes, stata sometimes gives the error message 'Discontinuous region, cannot calculate an improvement'. My outcome is binary. Is there anything I can do to prevent this?

Thank you,
Elvire Landstra

↧

Sort frequency table with asdoc

July 14, 2019, 1:14 pm

≫ Next: Fixed Effects

≪ Previous: Multilevel survival analysis

Hi,

I wonder if is possible sort a frequency table with asdoc, example:

Code:

sysuse auto.dta, clear

tab rep78, sort


     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          3 |         30       43.48       43.48
          4 |         18       26.09       69.57
          5 |         11       15.94       85.51
          2 |          8       11.59       97.10
          1 |          2        2.90      100.00
------------+-----------------------------------
      Total |         69      100.00




asdoc tab rep78

  Array 
 
asdoc tab rep78, sort

option sort not allowed

↧

Fixed Effects

July 14, 2019, 1:20 pm

≫ Next: Windowing behavior in v16.0

≪ Previous: Sort frequency table with asdoc

Dear Statalist community,

I have a panel dataset at the individual level level. The individuals live in a city x at time t. i have variables including wages, ages, and other individual characteristics.
I would like to estimate a model at the individual level of wages of individual 'i' at time 't' in city 'x' as dependent variables . I want to add time-city (t*c) fixed effects and save them from this regression and use them in in a second stage regression

Attached is the sample data

Attached is the sample data

↧

Windowing behavior in v16.0

July 14, 2019, 1:46 pm

≫ Next: Omitted estimate but obtain test results - probit with perfect separation

≪ Previous: Fixed Effects

When attempting to close all tabs with multiple tabs open in the Do-File Editor, v16.0 (running on a Mac) appears to require the closing of each tab one by one. In v15.1, the user instead received a popup warning:

Code:

Are you sure you want to close this window?

23 documents are open in this window. Do you want to close the window anyway?

Clicking "Close Window" in v15.1 closed all tabs at once.

For someone (like me) who most typically uses the Do-File editor as a "scratchpad" in developing programs, this v16.0 behavior is unwelcome.

I've tried different combinations of the Tabs options, but to no avail. It's also possible that I'm missing an option that would override this behavior, and if so would appreciate being pointed in the right direction.

If there's no way around this I'll post the issue on the v17 (or v16.1) Wishlist. Thanks in advance.

↧

Omitted estimate but obtain test results - probit with perfect separation

July 14, 2019, 4:09 pm

≫ Next: Failure Curve

≪ Previous: Windowing behavior in v16.0

Hi! This is my first post (I'm a new Stata user), so please be patient =)
I'm running a probit model with random effects where one of the variables I'm interested in is omitted because of perfect separation. However, I'm also running some tests that involve this variable, and I obtain an output. What is the reason for this? Why can I get a test result but not obtain an estimate?
This is the code I'm running:

Code:

    xtset SubjectID Period 
    xtprobit DecisionsPG ib2.Treatment#ib0.recip
    test (2.Treatment#1.recip = 12.Treatment#1.recip) (2.Treatment#0.recip = 12.Treatment#0.recip), mtest(holm)

Thanks in advance for your help!

↧

Failure Curve

July 14, 2019, 6:56 pm

≫ Next: Hausman test with omitted variables - still valid?

≪ Previous: Omitted estimate but obtain test results - probit with perfect separation

Hi,

I am trying to generate a Kaplan Meier graph using my data set looking at recurrence after a surgery (with or without steroid administration). I just want to generate two KM curves showing the failure curve of those where steroids were given and those where they were not.

My variables are the following:
steroids (steroids administered) = 1 or 0
txfail (treatment failure)=1 or 0
duration = months to failure
fu = total follow up time

I get multiple errors when I use the command: stset duration, id(steroid) failure(txfail) or stset fu, id(steroid) failure(duration) and the graph is correspondly incorrect. Can anyone advise?

stset fu, id(steroid) failure(duration)

id: steroid
failure event: duration != 0 & duration < .
obs. time interval: (fu[_n-1], fu]
exit on or before: failure

------------------------------------------------------------------------------
93 total observations
32 event time missing (fu>=.) PROBABLE ERROR
19 multiple records at same instant PROBABLE ERROR
(fu[_n-1]==fu)
38 observations begin on or after (first) failure
------------------------------------------------------------------------------
4 observations remaining, representing
2 subjects
2 failures in single-failure-per-subject data
21 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 11

↧

Hausman test with omitted variables - still valid?

July 14, 2019, 7:40 pm

≫ Next: Counting distinct values across two variables

≪ Previous: Failure Curve

Hi Statalist

In the help files for -hausman- we see an example (eg1 on p.894, Stata 16) in which all the regressors vary with time.

My question is whether the Hausman test is still valid when there are time-invariant regressors that the FE model subsequently remove?

In the simulated data below, the unobserved time-invariant component is correlated with regressor x1. As such, it would suggest that we should apply an FE model instead of an RE model.

However, x1 is time-invariant and the resultant Hausman test supports the use of RE. Does this mean that in order for the Hausman test to be valid (ignoring homoskedasticity for the time being), both sets of RE and FE equations must contain the same set of regressors?

Thanks.

Code:

*****************
clear
set seed 111
set obs 1000
*****************
generate id     = _n
generate year   = 2000
generate x1     = runiform()> .5

generate nu     = rnormal()
generate alpha = x1 + nu
*****************
expand 5
bysort id:  replace year = year + _n
*****************
generate x2    = rbeta(2,3)
generate u     = rnormal()
*****************
generate y  = (3) + (1) * x1 + (1) * x2 + alpha + u 
// the unobserved time-invariant component alpha is correlated with regressor x1,
// so FE would be more appropriate

xtset id year

xtsum // to check that x1 is time-invariant whereas x2 varies with time

quietly xtreg y x1 x2, re
estimates store RE

quietly xtreg y x1 x2, fe
estimates store FE

hausman FE RE, sigmamore

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |       FE           RE         Difference          S.E.
-------------+----------------------------------------------------------------
          x2 |    .9144598     .9098057        .0046541         .016269
------------------------------------------------------------------------------
                           b = consistent under Ho and Ha; obtained from xtreg
            B = inconsistent under Ha, efficient under Ho; obtained from xtreg

    Test:  Ho:  difference in coefficients not systematic

                  chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =        0.08
                Prob>chi2 =      0.7748

↧

Counting distinct values across two variables

July 15, 2019, 1:21 am

≫ Next: Error message when trying to run MI

≪ Previous: Hausman test with omitted variables - still valid?

I have a dataset of tournament chess games. I want to calculate the total number of players in each tournament ('event'), as well as the number of female players. eid is the event id, whitepid is the ID of the player playing with white, and blackpid is the ID of the player playing with black. The complication is that in a tournament, a player may play zero, one or more than one game with each colour. I am not interested in the number of distinct players who played with a given colour in the tournament, but rather the total number. But adding the number of distinct players by colour will double-count. Can anyone help? I have attached some sample data from the first event (eid == 1).

(PS, First post, sorry for etiquette violations.)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(eid whitepid blackpid) float(whitefemale blackfemale)
1 2253  505 0 0
1 2352 2224 0 0
1 2325 1380 1 0
1 2199 2247 0 0
1 2361 2319 0 0
1  460  536 0 0
1 2304 2198 0 0
1 2342 2240 0 0
1 2328 2198 0 0
1 2276 2199 0 0
1 2285 2324 1 0
1 2240 2216 0 0
1 2374 2253 0 0
1 2323 2328 0 0
1 1271 2276 0 0
1 2346 2374 0 0
1 2331 2209 0 0
1  888 2216 0 0
1 2321  536 0 0
1 2355 2346 0 0
1 2198 2326 0 0
1 2215 2346 0 0
1 2358 2240 0 0
1 2351 2355 0 0
1 2247 2327 0 0
1 2361 2306 0 0
1 2320 2374 0 0
1 2093 1380 0 0
1 2319 2304 0 0
1 2216 2093 0 0
1 2336 2204 0 0
1 2215 2093 0 0
1 2351  460 0 0
1 2336 2331 0 0
1 1271 2358 0 0
1 2355 1271 0 0
1 2321 2204 0 0
1 2338 2299 0 0
1 2240 2224 0 0
1 2360  739 1 0
1 2276 2322 0 0
1 2197 1271 0 0
1 2198  460 0 0
1 2328 2327 0 0
1 2352 2324 0 0
1 2198 2240 0 0
1 2306 2360 0 1
1 2245 2304 0 0
1 2320 2304 0 0
1 2323 2351 0 0
1 2247 2342 0 0
1 2245 2276 0 0
1  508 2330 0 0
1 2304  536 0 0
1 1380  505 0 0
1 2224 2199 0 0
1 2240  449 0 0
1 2245  444 0 0
1 2253 2360 0 1
1 2299 2374 0 0
1 2323  888 0 0
1 2321 2320 0 0
1 2326 2197 0 0
1  739  536 0 0
1 2247 2332 0 1
1 2332 2093 1 0
1 2354 2347 0 0
1 2199 2304 0 0
1 2199 2285 0 1
1 2355 2245 0 0
1 2327 2323 0 0
1 2299 2220 0 0
1 2224 2321 0 0
1 2354  536 0 0
1 2325 2215 1 0
1 2199 2355 0 0
1 2199 2322 0 0
1  739 2328 0 0
1 2220 2352 0 0
1  505  888 0 0
1 1380 2320 0 0
1 2220 2198 0 0
1 2220 2346 0 0
1 2331 2323 0 0
1 2352 1380 0 0
1 2209 2328 0 0
1 2374 2276 0 0
1 2215 2209 0 0
1  460 2247 0 0
1 2216 2285 0 1
1 2328 2336 0 0
1 2093 2326 0 0
1 2209 1380 0 0
1  460 2320 0 0
1  460 2361 0 0
1 2347 2306 0 0
1 2204 2347 0 0
1 2247 2354 0 0
1 2197 2215 0 0
1 2326 2347 0 0
end

↧

Error message when trying to run MI

July 15, 2019, 2:33 am

≫ Next: Using the foreach command

≪ Previous: Counting distinct values across two variables

Hi All

I don't have too much experience running multiple imputation, but I've been running the following imputation on the same dataset for a while now and I never received this error message before. In fact, I've imputed the same dataset with the same set of variables, and it's worked smoothly. So, no idea why I received the error message, and what the possible reason could be. Any ideas?

Code:

Code:

mi set mlong
mi register imputed socialc42 employ42 partner42 home42 voting42 edu42 child42 health42 illness42 smoke42 alcol42 fsocial zrutter5all zmal9all zrutter5xall zmal5all ///
lifesatis42 bmi42 malcont42 wdmental42 sex86
mi impute chained (logit) socialc42 employ42 partner42 home42 voting42 edu42 child42 health42 illness42 smoke42 alcol42 ///
(mlogit) fsocial (regress) zrutter5all zmal9all zrutter5xall zmal5all lifesatis42 bmi42 malcont42 wdmental42 = sex86, add(50) rseed (53421) savetrace(trace1,replace) force

Error message:

Performing chained iterations ...
mi impute: VCE is not positive definite
The posterior distribution from which mi impute drew the imputations for zrutter5xall is not proper when the VCE estimate from the observed data is not positive definite. This may happen, for example, when the number of parameters exceeds the number of observations. Choose an alternate imputation model.
error occurred during imputation of socialc42 employ42 partner42 home42 voting42 edu42 child42 health42 illness42 smoke42
alcol42 fsocial zrutter5all zmal9all zrutter5xall zmal5all lifesatis42 bmi42 malcont42 wdmental42 on m = 1
r(498);

Thanks!
/Amal

↧

Using the foreach command

July 15, 2019, 3:06 am

≫ Next: New version of xtdcce2 and xtcd2. New package xtcse2

≪ Previous: Error message when trying to run MI

Hello,

I am using STATA 15 and wide-format dataset.

My aim is to create a new variable (my data is confidential but as an example I will call it "newv") which I want to have a value of 1 or 0.

The value should be 1 if any of the values in a long list of other alphanumeric variables in the dataset contain a certain alphanumeric code.

As an example of the list of variables I will use "toronto" "paris" "rome" "madrid" here.

varname: toronto paris rome madrid
value: A1 b2 C3 D4

The list of variables in reality is very long and involves a lot of lines of code, so I am trying to use the foreach command.

The code I am using is :

gen newv=0
for each v of varlist toronto paris rome madrid {
replace newv=1 if strpos (v,"A1")>0
}

However, when I do this STATA says "{ required"

I thought I had put the braces in the correct place so I'm not sure where I am going wrong?

Any advice would be much appreciated,
Karyn

↧

New version of xtdcce2 and xtcd2. New package xtcse2

July 15, 2019, 3:35 am

≫ Next: generate variable: birth order of children (ages 0 to 17) in household

≪ Previous: Using the foreach command

Thanks to Kit Baum a new version of xtdcce2 and xtcd2 is available on SSC. In addition a new program called xtcse2 which estimates the exponent of cross-sectional dependence is available on SSC as well (see more details below). Updates and further documentation can be accessed on my github page too.

The new versions of xtdcce2 (now version 2.0, was formerly called 1.35beta) and xtcd include the following updates:

xtdcce2

speed improvements, old method still available via option "blockdiaguse"
improved checks for collinearities and inverting rank deficient matrices
R2 calculation for mean group and pooled regressions following Holly et. al (2010)
Newey-West type standard errors for pooled regressions (Pesaran 2006)
Fixed-T standard errors for pooled regressions following Westerlund et. al (2019)
Estimation of exponent of cross-sectional dependence (see below)
several bugfixes (using jackknife option and if statement, use of binary variables, calculation of R2, if statement on panel ids)

xtcd2

uses predict , e if xtreg is used.
improved detection of unbalanced panels and missing observations

xtcse2
xtcse2 estimates the exponent of cross-sectional dependence in a panel with a large number of observations over time (T) and cross-sectional units (N). The estimation method follows Bailey, Kapetanios, Pesaran (2016, Journal of Applied Econometrics). xtcse2 estimates the strength of the cross-sectional dependence factor, for a residual or one or more variables. It outputs the point estimate, the standard error and confidence interval. It is intend to support the decision whether to include cross-sectional averages when using xtdcce2 and accompanies xtcd2 in testing for weak cross-sectional dependence. As a default it uses xtcd2 to test for weak cross-sectional dependence.

For further explanation of the command and a full list of updates please see the help file and my github page. The packages can be obtain directly from Stata using either SSC or from github (only Stata >14):

Code:

net from https://janditzen.github.io/xtdcce2/

Any comments, bugs or suggestions are of course welcome, either here or by mail.

↧

generate variable: birth order of children (ages 0 to 17) in household

July 15, 2019, 4:05 am

≫ Next: Create mean variables of a subgroup for the whole group

≪ Previous: New version of xtdcce2 and xtcd2. New package xtcse2

Hello

I am using the South African NIDS (National Income Dynamic Survey) wave 1 to wave 5. I have successfully merged the waves to create a panel dataset.
I am trying to create a variable that identifies the order of birth between siblings within the household.
I have successfully created a variable that sums the total number of children in the household, but my stata knowledge is insufficient in creating a variable for ranking the children according to age. I do understand that it would require the 'egen: rank' command.
The desired result is a variable that could label the eldest, youngest etc.

I only have a variables for hhid, pid, date of birth, relation to the household head.

Could anyone help in creating such a variable?

Thank you
Sophie Gebers

↧

Create mean variables of a subgroup for the whole group

July 15, 2019, 4:38 am

≫ Next: cox regression with shared frailty or vce cluster

≪ Previous: generate variable: birth order of children (ages 0 to 17) in household

Dear all,

I have data on education attainment per year and state and I want to compute the percentage of people who are 25 or above and have at least a Bachelors degree in a given state a given year. So far I have tried the following:

Code:

gen percedu = 0
replace percedu = 1 if educ>=110
egen meanpercedu = wtmean(percedu) if age>=25, weight(asecwt) by (statefip year)

The issue is that for all observations in a given state and given year whose age is less than 25, I have a missing value. What I want is to have the same value instead of missing values for all. Is there a way to compute this?

Thank you in advance.

Best regards,
Alexandros Achilleos

↧

cox regression with shared frailty or vce cluster

July 15, 2019, 4:54 am

≫ Next: Generating a new variable as a result of comparing the mean of a variable (t-test) in two different times

≪ Previous: Create mean variables of a subgroup for the whole group

Hello,

In running a cox regression model (stcox), I would like to look at within group correlation. I was wondering what the difference is between using vce(cluster x) as compared to shared (x).

In the STATA manual is says:
"One solution would be to fit a standard Cox model, adjusting the standard errors of the estimated hazard ratios to account for the possible correlation by specifying vce(cluster patient).
We could instead model the correlation by assuming that the correlation is the result of a latent patient-level effect, or frailty. That is, rather than fitting a standard model and specifying vce(cluster patient), we could fit a frailty model by specifying shared(patient)".

According to this either method is acceptable. But, what are the differences? Apart from the theta you get when using shared frailty, are there any other advantages to using shared rather than vce cluster?

Thank you for any assistance/ clarification!

↧

Generating a new variable as a result of comparing the mean of a variable (t-test) in two different times

July 15, 2019, 4:56 am

≫ Next: Coefficient interpretation in log-level regression- Actual impact of independent variable on dependent variable.

≪ Previous: cox regression with shared frailty or vce cluster

Hi,

I have three variables here in my example, as described below.
1) EXP: is the time dummy variable indicating 1 if the time is from 2013 to 2015 and, Zero otherwise (2016 to 2018).
2) Abs_DACC: is a variable representing companies earnings management.
3) firmid: is the company's identity number

I want to generate a new (dummy) variable indicating 1 if the average value of the variable Abs_DACC is significantly more (could be t-test) in the period of Depost == 1 than the period of Depost == 0, and Zero otherwise.

Below is the example from my sample.

I would be grateful if one could guide me in generating my new dummy variable.

Best regards,
Mahmoud

example

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str14 firmid float(Abs_DACC EXP)
"SE0000101362"  .0043683364 1
"SE0000101362"    .04054667 1
"SE0000101362"   .016035259 1
"SE0000101362"  .0021590118 0
"SE0000101362"  .0004436661 0
"SE0000101362"   .025840644 0
"SE0000103699"   .007250534 1
"SE0000103699" .00024572795 1
"SE0000103699"   .011364168 1
"SE0000103699"   .015171934 0
"SE0000103699"   .005706637 0
"SE0000103699"   .015331745 0
"SE0000103814"   .007773208 1
"SE0000103814"   .020582644 1
"SE0000103814"  .0010111552 1
"SE0000103814"    .05279258 0
"SE0000103814"   .022170475 0
"SE0000103814"   .006775594 0
"SE0000105199"   .010294355 1
"SE0000105199"   .020721633 0
"SE0000105199"    .03075206 0
"SE0000105199"     .0771719 0
"SE0000105264"    .05795591 1
"SE0000105264"    .19086185 1
"SE0000107724"     .0830186 1
"SE0000107724"    .01778689 0
"SE0000107724"   .003614869 0
"SE0000107724"   .070517756 0
"SE0000108227"    .03772058 1
"SE0000108227"    .02138011 1
"SE0000108227"   .013683626 1
"SE0000108227"   .005922483 0
"SE0000108227"    .05475985 0
"SE0000108227"    .00685365 0
"SE0000108656"   .032462087 1
"SE0000108656"    .03074585 1
end

------------------ copy up to and including the previous line ------------------

Listed 36 out of 781 observations

.

↧

Coefficient interpretation in log-level regression- Actual impact of independent variable on dependent variable.

July 15, 2019, 5:05 am

≫ Next: Compare percentage across different tables

≪ Previous: Generating a new variable as a result of comparing the mean of a variable (t-test) in two different times

Hi all,

I have a regression where the dependent variable is in log form. I was wondering if we can find out the actual impact of the independent variable on the natural form of the dependent variable? Or do we have to rely on percentage change? Here is an example:

The coefficient of the variable I'm interested is -0.299. First, I take the exponential:

Code:

. di exp(0.299)-1
.34850962

This means one unit increase in the independent variable decreases the dependent variable by 34%, right?

But can we say what the actual numerical change is? If the mean value of the natural form of dependent variable is 1,000,000 can we say that it leads to 340.000 decrease?

Thanks!

↧

Compare percentage across different tables

July 15, 2019, 6:03 am

≫ Next: alternative to interaction term in regression

≪ Previous: Coefficient interpretation in log-level regression- Actual impact of independent variable on dependent variable.

Hey

I want to compare percentages across tables. So, for example I want to test if investor_A invest more in the seed stage of a startup than investor_B. I could do it with dummys for the funding type and a ttest but is there a better way in stata?

Code:

. tab fundingtype if investor_A == 1
 
            Funding Type |      Freq.     Percent  
-------------------------+-----------------------------------
                    Seed |      1,230       65.15      
                Series A |        461       24.42      
                Series B |        197       10.43  
 -------------------------+-----------------------------------
                   Total |      1,888      100.00
 
 
. tab fundingtype if investor_B == 1
 
            Funding Type |      Freq.     Percent  
-------------------------+-----------------------------------
                    Seed |      1,030       62.43      
                Series A |        300       18.18      
                Series B |        320       19.39  
 -------------------------+-----------------------------------
                   Total |      1,650      100.00

Kind regards
Alex

↧

alternative to interaction term in regression

July 15, 2019, 6:10 am

≫ Next: Condition based on part of a string

≪ Previous: Compare percentage across different tables

Hi all,

I am examining the interaction effect between X and Y (outcome var is Z) using fixed effects model. However, I have reason to suspect that simply adding an interaction term may not well capture the real picture of how the effect of X on Z varies across the level of Y. So I am wondering if there is any statistical methods (and Stata package) that can determine if there is any cutoff points (in terms of, for instance, percentile i of Y) along the range of Y, around which the effect of X on Z is statistically significantly different from each other below and above the cutoff point?

Thank you very much!

↧

Condition based on part of a string

July 15, 2019, 11:02 am

≫ Next: Get levels value from differenced forecast

≪ Previous: alternative to interaction term in regression

Hi,
I want to execute some commands only on those observations that contain a specific phrase as part of their value in a string variable. What should be the condition?
P.S. the phrase isn't in English (Hebrew letters).
Thanks ahead,
Ben

↧

Get levels value from differenced forecast

July 15, 2019, 11:28 am

≫ Next: Running multiple event studies using the loop command

≪ Previous: Condition based on part of a string

Hi,

I am trying to extract levels data from a model I used to forecast which was specified in first differences due to non-stationarity of the data.
I need levels values because I want to calucalte model comparison test statistics such as Theil's U and use the Diebold Mariano/Harvey Newbold Leybourne tests.

I have seen online that it is possible to "inverse difference" data in R and Python, and EViews for that matter. However I couldn't find a similar command in Stata.

Could anyone help me?

Thank you

↧