Dropping observations if missing certain dates

August 12, 2019, 9:44 am

≪ Previous: Merge baseline and follow-up datasets

I have incomplete data and would like to drop any observations for which I do not have data in a particular variable during a given time period. My best guess of how to do this would be by generating a tag with bysort and then dropping everything without that tag. For example,

Code:

clear all
ssc install dataex
input long id float year float month float day float var1
1 2017 7 1 125
1 2017 8 1 200
1 2017 9 1 108
1 2018 4 1 20
1 2018 5 1 50
1 2018 7 1 73
1 2018 8 1 18
1 2018 9 1 20
2 2018 7 1 32
2 2018 8 1 29
2 2018 9 1 18
2 2018 4 1 103
2 2018 5 1 24
2 2018 7 1 .
2 2018 8 1 .
2 2018 9 1 .
3 2017 7 1 20
3 2017 8 1 .
3 2017 9 1 .
3 2018 7 1 73
3 2018 8 1 18
3 2018 9 1 20
end

Where if I only want to keep observations that have no missing data in var1 for the months of July, August and September (7,8,9) in 2018, the final output ought to be

Code:

clear all
ssc install dataex
input long id float year float month float day float var1
1 2017 7 1 125
1 2017 8 1 200
1 2017 9 1 108
1 2018 4 1 20
1 2018 5 1 50
1 2018 7 1 73
1 2018 8 1 18
1 2018 9 1 20
3 2017 7 1 20
3 2017 8 1 .
3 2017 9 1 .
3 2018 7 1 73
3 2018 8 1 18
3 2018 9 1 20

Thank you very much for your help.

↧

Probit formula interpretation

August 12, 2019, 10:18 am

≫ Next: Adding suffix to all variables after a given variable

≪ Previous: Dropping observations if missing certain dates

Dear community!

I am running an ordered probit regression where my DV ranges from 0-1; IV(1) 0-10, IV(2) 0-10, IV(3) 1-4, IV(5) 1-5 (all variable levels will be included in regression i.var5).
I am struggling with how to show this formula on paper. Could you help me with that please?

P.S. is it acceptable to have IV range from 1 instead range from 0?

↧

Adding suffix to all variables after a given variable

August 12, 2019, 11:26 am

≫ Next: Error in the manova test

≪ Previous: Probit formula interpretation

I am running the same code on different datasets. They all contain four initial variables: year, country, object, period, and then a list of city codes. Each file contains a different number of city codes. I would like to add a suffix to each city code, but want to have code that is general enough that it could apply to any of my files. The ideal scenario is one where all variables after the variable 'period' would contain a suffix.

How can I specify that all variables, after a given variable (in this case 'period'), should be renamed to contain the suffix "c_"?

Thanks

↧

Error in the manova test

August 12, 2019, 11:28 am

≫ Next: Optimizing space and make panel data

≪ Previous: Adding suffix to all variables after a given variable

Hi all,

Any time I use the simulate command in Stata 16, it returns "Error in the manova test" but then proceeds to run the simulation. Does anyone know what's driving this error message?

Thank you!

↧

Optimizing space and make panel data

August 12, 2019, 11:48 am

≫ Next: -putexcel- inserting rows within the 'results' block from within stata

≪ Previous: Error in the manova test

I recently imported WHO expenditure data. The excel file has strange data positioning (Year and variables both in columns), so I couldn't import it in panel form. I used excel to filter unique variables and then used the following command to generate variables:

gen

che_gdp

E==

"che_gdp"

H contains values, whereas E contains the variables. I did this for all variables.
Previously, data was structured like this:

Country	Variable Name	Year	Value
CountryA	var1	2000	32
CountryA	var1	2001	45
CountryA	var2	2000	68
CountryA	var2	2001	38
CountryB	var1	2000	78
CountryB	var1	2001	69
CountryB	var2	2000	23
CountryB	var2	2001	11

Now, after running the command mentioned above, and dropping variable name and value, the data is structure like this"

Country	Year	var1	var2
CountryA	2000	32
CountryA	2001	45
CountryA	2000		68
CountryA	2001		38
CountryB	2000	78
CountryB	2001	69
CountryB	2000		23
CountryB	2001		11

Please notice that this data contains 17 years, 191 countries and more than 100 variables.

Now if I run xtset country year, I get an error that there are multiple year values. How can I get rid of those multiple years with no observations? for example, country A now has 2 missing values for var1 and two for var2. however, I cannot drop anyone of them

Thanks
Atif

↧

-putexcel- inserting rows within the 'results' block from within stata

August 12, 2019, 12:02 pm

≫ Next: Using weights on categorical variables in bar graphs

≪ Previous: Optimizing space and make panel data

I want to output to excel to organize my mixlogit outputs. The code I have works well to write mixed logit to excel, but I do not know how to command Stata to insert a blank row in the middle of the 'matlist results' output without overwriting the results themselves. (I need to insert the blank row so that I can take the negative sum to get the third attribute: this is effects coded data). Any ideas on how to do this? My code below works well to export to stata, but not to add in the insert rows within the results:
Array

** Exporting results **
mixlogit choice, group(resptask) id(respid) rand($varse) nrep(200)
matlist r(table)
matrix results = r(table)
matrix results = results[1..6,1...]'
matlist results
putexcel set RegressionDCEDMD.xlsx, sheet(Mixlogit effects) replace
putexcel A12 = matrix(results), names nformat(number_d2) hcenter
putexcel A1=(e(title)) ///
E1 = "Results last updated" G1 = "`c(current_date)'" ///
A2=(e(cmdline)) ///
A4=("Obs") B4=(e(N)) ///
A5= ("Clusters") B5=(e(N_clust)) ///
A6=("Wald chi2(df)") B6=(e(chi2)) ///
A7= ("DF") B7=(e(df_m)) ///
A8=("Prob>chi2") B8=(e(p)) ///
A9 =("Psuedo R2") B9= (e(r2_p)) ///
A10 = ("Log psuedolikelihood") B10= (e(ll))
vce
return list
matrix list r(V)
putexcel A37 = ("Covariance matrix of coefficients") ///
C36 = matrix(r(V))

Thank you!!

↧

Using weights on categorical variables in bar graphs

August 12, 2019, 12:17 pm

≫ Next: xtabond2 with T>N

≪ Previous: -putexcel- inserting rows within the 'results' block from within stata

Hi. I apologize in advance if this is a very easy question - I've read the help pages to no avail. I am trying to find a way to do a double bar graph that incorporates weights of sample data. I looked into the weight options but it seems like they are only for yvars and not categorical variables?
For instance, say I have the following set of data:

Obs #	Sex	Weight	Industry
1	Male	4	A
2	Female	1	A
3	Male	3	B
4	Female	2	B

and want to produce a double bar graph like this:
Array

How would I go about 1) incorporating the weights and 2) putting the bars side-by-side even if they need different "if commands" - note: I only want the results for two industries out of 100, so I cannot use something like over(Industry) or sort.

For 1), I have tried graph bar (percent) pw=weight if Industry=="A", over(Sex). But this doesn't seem to take weights into account. I cannot use weights as over_subopts either.
For 2), I am clueless as to which command I should use... Should I generate two separate variables?

Thank you!

↧

xtabond2 with T>N

August 12, 2019, 12:48 pm

≫ Next: Calculating and plotting non-response rate of a variable against percentiles of another variable

≪ Previous: Using weights on categorical variables in bar graphs

I want to see how various healthcare variables impact population health. I have panel data on 10 jurisdictions over 38 years (N=10, T=38). My conceptual model is a dynamic panel data model, as health one year leads directly to health the next.

My key explanatory variables (healthcare capital stock, number of healthcare workers, and drug expenditures) are likely endogenous.

A colleague recommended using system GMM as a way to both represent the dynamic panel model and deal with endogeneity.

I ran the following command:

Code:

xtabond2 ln_trt_mort L.ln_trt_mort ln_total_capital_r ln_prov_drugs_r ///
ln_health_workers ln_population i.year, ///
gmm(L.ln_trt_mort ln_total_capital_r ln_prov_drugs_r ln_health_workers, ///
collapse) iv(i.year ln_population) twostep robust small nodiffsargan

I got the following results:

Code:

Dynamic panel-data estimation, two-step system GMM
------------------------------------------------------------------------------
Group variable: prov_num                        Number of obs      =       300
Time variable : year                            Number of groups   =        10
Number of instruments = 182                     Obs per group: min =        30
F(49, 9)      =  26726.95                                      avg =     30.00
Prob > F      =     0.000                                      max =        30
------------------------------------------------------------------------------------
                   |              Corrected
       ln_trt_mort |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
       ln_trt_mort |
               L1. |          0  (omitted)
                   |
ln_total_capital_r |  -.1949693    .033019    -5.90   0.000    -.2696634   -.1202752
   ln_prov_drugs_r |   .5323356   .0313076    17.00   0.000     .4615128    .6031584
 ln_health_workers |          0  (omitted)
     ln_population |          0  (omitted)
                   |
              year |
             1975  |          0  (empty)
             1976  |          0  (omitted)
             1977  |          0  (omitted)
             1978  |          0  (omitted)
             1979  |          0  (omitted)
             1980  |          0  (omitted)
             1981  |          0  (omitted)
             1982  |          0  (omitted)
             1983  |          0  (omitted)
             1984  |          0  (omitted)
             1985  |          0  (omitted)
             1986  |          0  (omitted)
             1987  |          0  (omitted)
             1988  |   .3323847   .0520803     6.38   0.000     .2145708    .4501987
             1989  |          0  (omitted)
             1990  |          0  (omitted)
             1991  |          0  (omitted)
             1992  |   .1061141   .0508995     2.08   0.067    -.0090286    .2212569
             1993  |          0  (omitted)
             1994  |          0  (omitted)
             1995  |   .0717928   .0410346     1.75   0.114    -.0210338    .1646195
             1996  |          0  (omitted)
             1997  |  -.0076909   .0741135    -0.10   0.920    -.1753472    .1599654
             1998  |          0  (omitted)
             1999  |   .0157125    .028684     0.55   0.597    -.0491752    .0806002
             2000  |  -.0480081   .0404153    -1.19   0.265    -.1394338    .0434176
             2001  |          0  (omitted)
             2002  |          0  (omitted)
             2003  |          0  (omitted)
             2004  |          0  (omitted)
             2005  |          0  (omitted)
             2006  |          0  (omitted)
             2007  |   -.125124   .0278338    -4.50   0.001    -.1880883   -.0621596
             2008  |          0  (omitted)
             2009  |          0  (omitted)
             2010  |          0  (omitted)
             2011  |          0  (omitted)
             2012  |  -.0886373   .0164099    -5.40   0.000    -.1257592   -.0515154
             2013  |          0  (omitted)
             2014  |          0  (omitted)
             2015  |          0  (omitted)
             2016  |          0  (omitted)
             2017  |          0  (omitted)
             2018  |          0  (omitted)
                   |
             _cons |          0  (omitted)
------------------------------------------------------------------------------------
Instruments for first differences equation
  Standard
    D.(1975b.year 1976.year 1977.year 1978.year 1979.year 1980.year 1981.year
    1982.year 1983.year 1984.year 1985.year 1986.year 1987.year 1988.year
    1989.year 1990.year 1991.year 1992.year 1993.year 1994.year 1995.year
    1996.year 1997.year 1998.year 1999.year 2000.year 2001.year 2002.year
    2003.year 2004.year 2005.year 2006.year 2007.year 2008.year 2009.year
    2010.year 2011.year 2012.year 2013.year 2014.year 2015.year 2016.year
    2017.year 2018.year ln_population)
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    L(1/43).(L.ln_trt_mort ln_total_capital_r ln_prov_drugs_r
    ln_health_workers) collapsed
Instruments for levels equation
  Standard
    1975b.year 1976.year 1977.year 1978.year 1979.year 1980.year 1981.year
    1982.year 1983.year 1984.year 1985.year 1986.year 1987.year 1988.year
    1989.year 1990.year 1991.year 1992.year 1993.year 1994.year 1995.year
    1996.year 1997.year 1998.year 1999.year 2000.year 2001.year 2002.year
    2003.year 2004.year 2005.year 2006.year 2007.year 2008.year 2009.year
    2010.year 2011.year 2012.year 2013.year 2014.year 2015.year 2016.year
    2017.year 2018.year ln_population
    _cons
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    D.(L.ln_trt_mort ln_total_capital_r ln_prov_drugs_r ln_health_workers)
    collapsed
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z =  -2.60  Pr > z =  0.009
Arellano-Bond test for AR(2) in first differences: z =   0.18  Pr > z =  0.858
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(132)  = 203.88  Prob > chi2 =  0.000
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(132)  =   0.00  Prob > chi2 =  1.000
  (Robust, but weakened by many instruments.)

Obviously, I have way too many instruments, so the Hansen test result is not reliable.

I next tried to impose lag limits, although I wasn't super comfortable with this as it was not directly recommended in the GMM literature I read.

Code:

 xtabond2 ln_trt_mort L.ln_trt_mort ln_total_capital_r ln_prov_drugs_r ///
ln_health_workers ln_population, ///
gmm(L.ln_trt_mort ln_total_capital_r ln_prov_drugs_r ln_health_workers, ///
collapse eq(level) laglimits (0 1)) iv(ln_population, equation(level)) ///
 robust small nodiffsargan twostep

I get the following results:

Code:

Dynamic panel-data estimation, one-step system GMM
------------------------------------------------------------------------------
Group variable: prov_num                        Number of obs      =       300
Time variable : year                            Number of groups   =        10
Number of instruments = 10                      Obs per group: min =        30
F(5, 9)       =   1041.73                                      avg =     30.00
Prob > F      =     0.000                                      max =        30
------------------------------------------------------------------------------------
                   |               Robust
       ln_trt_mort |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
       ln_trt_mort |
               L1. |    .506323   .3194124     1.59   0.147    -.2162381    1.228884
                   |
ln_total_capital_r |  -.0885838   .0328479    -2.70   0.025     -.162891   -.0142767
   ln_prov_drugs_r |  -.1461226   .1173413    -1.25   0.244    -.4115671    .1193218
 ln_health_workers |    .170425   .2320239     0.73   0.481    -.3544496    .6952997
     ln_population |   .5489733      .4645     1.18   0.268    -.5017986    1.599745
             _cons |  -1.348713   3.142279    -0.43   0.678    -8.457042    5.759616
------------------------------------------------------------------------------------
Instruments for levels equation
  Standard
    ln_population
    _cons
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    DL(0/1).(L.ln_trt_mort ln_total_capital_r ln_prov_drugs_r
    ln_health_workers) collapsed
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z =  -1.79  Pr > z =  0.074
Arellano-Bond test for AR(2) in first differences: z =   1.30  Pr > z =  0.194
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(4)    =  23.67  Prob > chi2 =  0.000
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(4)    =   7.00  Prob > chi2 =  0.136
  (Robust, but weakened by many instruments.)

My results now seem okay, but only one of my variables is significant and the number of instruments equals the number of groups, which is not what is recommended. Furthermore, I cannot seem to add any controls or do any extended versions of this model without the Hansen test result breaking down. Should I trust these results?

Are my problems a function of the fact that T>N? I tried running some regressions on a model of just the 9 most recent years so that N>T, but that didn't seem to work either.

↧

Calculating and plotting non-response rate of a variable against percentiles of another variable

August 12, 2019, 12:56 pm

≫ Next: Discrepancy between kmatch and summarize

≪ Previous: xtabond2 with T>N

I have two variables: income and scores.

I want to see if lower scores have higher non-response rate for income variable in comparison to higher scores.

To do so I want to calculate the non-response rate for income variable within each quintile of score variable. Then I can also plot the results.

How can I do so?

↧

Discrepancy between kmatch and summarize

August 12, 2019, 1:54 pm

≫ Next: dataset is showing

≪ Previous: Calculating and plotting non-response rate of a variable against percentiles of another variable

I'm running Stata/MP 14.2 under Windows 7. I have a question about different results between summarize and ksmatch (http://ideas.repec.org/c/boc/bocode/s458346.html). ksmatch gives me the following output:

Code:

. kmatch ps anesthesia encounter_age bsa2 qrs2 pulgrade pul_stenosis rightventsize lvefecho2 /*
> */ (lvef2=encounter_age bsa2 qrs2 pulgrade pul_stenosis rightventsize lvefecho2), /*
> */ comsup kernel(biweight) pscmd(logit) att bwidth(cv) /*
> */ generate(_KM_treat _KM_nc _KM_nm _KM_mw) po
(12 observations with PS outside common support)
(computing bandwidth ................ done)

Propensity-score kernel matching                Number of obs     =        117
                                                Kernel            =   biweight
Treatment   : anesthesia = 1
Covariates  : encounter_age bsa2 qrs2 pulgrade pul_stenosis rightventsize lvefecho2
PS model    : logit (pr)
RA equations: lvef2 = encounter_age bsa2 qrs2 pulgrade pul_stenosis rightventsize lvefecho2 _cons

Matching statistics
------------------------------------------------------------------------------------------
           |             Matched             |             Controls            | Bandwidth
           |       Yes         No      Total |      Used     Unused      Total |         
-----------+---------------------------------+---------------------------------+----------
   Treated |        27          5         32 |        78          7         85 |    .07729
------------------------------------------------------------------------------------------

Treatment-effects estimation
------------------------------------------------------------------------------
       lvef2 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ATT |  -6.285085   1.763609    -3.56   0.001    -9.778135   -2.792035
          Y1 |   50.03333   1.300894    38.46   0.000     47.45675    52.60992
          Y0 |   56.31842    1.16635    48.29   0.000     54.00832    58.62852
------------------------------------------------------------------------------

I'm trying to reproduce the results in summarize. This works with the treatment group because the weights are 1 (at least that's the way I read the ksmatch documentation). However, I get this with summarize:

Code:

. summarize lvef2 if anesthesia==0 & _KM_nm~=0 & _KM_nm~=. [aw=_KM_mw]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
       lvef2 |      78          27    56.22177    5.17997         42         69

The number of obs and the sum of the weights are both correct. Can the discrepancy in the mean between ksmatch and summarize be attributed to rounding error?

↧

dataset is showing

August 13, 2019, 12:43 am

≫ Next: Literature for a larger data set advantages

≪ Previous: Discrepancy between kmatch and summarize

Hello,

So let me first describe what I am working with and the steps I took.

One dataset has for every company every year stated like, 1992, 1993, 1994, 1995.. and so on. with financial information about companies.

the next dataset is about CEO financials, and there it's 1992, 1992, 1992, 1993, 1993, 1993, 1993.. and so on. Because some companies have more or less CEO's per year.

I have used 1:m merge, and it looks like its merged. I then looked at the CEO GENDER, which is main part of my study, and deleted the rows which did not match.

I checked whether it removed rows of information, but it kept the information of Dataset 1 and moved/multiplied the information from the 'company financials years' to CEO financials. Like this: [merged]

Year....CEO FIN....Company FIN

X1......X1-Male............X1

X1......X1-Male............X1

X1......X1-Female............X1

Now it's of course counting all the years multiple times and therefore the sample size looks deceitful, it's giving something like 293,123 as sample size only because the years are counted more than once.

Is there a way to shorten the rows when using it in calculations? And secondly, I thought to only keep 1 row per year of the information when there is a Female CEO that year (if more, I summarize) I'll use only X1-Female, and otherwise I'll use Male (X1-Male). So that per company year there will be one row. But am kinda clueless how to perform such task.
Does anyone know how to do this, or even if this is possible. Thanks.

↧

Literature for a larger data set advantages

August 13, 2019, 1:06 am

≫ Next: Analysing duplicates

≪ Previous: dataset is showing

Hello

I do a panel analysis over eight years at the provincial level, which unfortunately only gives me 192 observations. However, I think it would be better if I had done the analysis at the community level. Unfortunately, there is no data at the community level at the moment. For this reason, I would like to criticize this in my paper.

My question is, how can I argue that my research with community level data would be better. This would also give me a much larger dataset for the eight years. Unfortunately, I can't find any literature on this. Do you know where I can find literature on this, or perhaps in the Wooldridge book on which page this is explained, why an investigation with a larger dataset would be better and what advantages would result from a larger dataset?

Thank you very much for your help.

↧

Analysing duplicates

August 13, 2019, 1:45 am

≫ Next: Confusion between out of sample predict and forecast

≪ Previous: Literature for a larger data set advantages

Hi guys,

I need to analyse duplicates. I've got different newpaper articles. They all have a story_id. These articles mention different EU and US companies. First I need to analyse how many companies are mentioned in one article. For that I used:

Code:

duplicates tag rp_story_id, gen(dup_storyid)

Second I need to analyse how many US and Non-US companies (country_code=="US") are mentioned each year.

Example:

company country_code story_id headline year
VW DE NDJHAODUW Earnings announcement 3. Qu 2003
BMW DE NDJHAODUW Earnings announcement 3. QU 2003
GM US NDJHAODUW Earnings announcement 3. Qu 2003
VW DE SODOEIKDIDI Earnings announcement 1. Qu 2004
GM US SODOEIKDIDI Earnings announcement 1. Qu 2004

Code:

duplicates tag rp_story_id, gen(dup_storyid)
gen continent=0
replace continent=1 if country_code!="US"
tab dup_storyid continent

Any suggestions how I could continue?

↧

Confusion between out of sample predict and forecast

August 13, 2019, 3:35 am

≫ Next: Panel Data model: DIagnostic testing

≪ Previous: Analysing duplicates

I have read the forecast manual, but I can't understand why using forecast produces different results than just predicting on the year variable on in-sample observations. Forecasting "now-casts" using actual values even if I don't check that box, while predict using in-sample observations would plug in the predictor values and output prediction using the estimated equation.

In other words, when we use forecast, in which way is the model used on past observations? it looks as if it doesn't. it "predicts" the actual value whether or not that option is checked. Since I wish to see if my model predicts the "past" somewhat reasonably, this is of no use for me.

A simple example to replicate:

Code:

clear all
webuse sunspot
tsappend, add(20)

* Predict with regular linear model
reg spot time
estimate store lin
predict spot_hat1

tsline spot spot_hat1, name(predict)

* Using stata's forecast
forecast create forecast1
forecast estimates lin
forecast solve, suffix(_hat2) periods(20) static

tsline spot spot_hat2, name(forecast)

graph combine predict forecast

Note that the "prediction" in forecast is just the actual values and it completely overlays the actual values.

↧

Panel Data model: DIagnostic testing

August 13, 2019, 4:01 am

≫ Next: Loop regexm to search for multiple strings stored as observations

≪ Previous: Confusion between out of sample predict and forecast

Hi stata community. I really need help. I have a long panel T=17 and N=8(countries). These are the steps I applied

Code:

xtreg y x1 x2, fe
est store fe
xtreg y x1 x2, re
est store re
hausman fe re, sigmamore

The test was significant. Which meant FE model is applicable. 

However, since the results were not consistent with the literature, I carried out the following diagnostic tests:
xtreg y x1 x2, fe
xttest2 (For cross sectional dependence)
xttest3 (For heteroscedasticity)
xtserial y x1 x2 

The model had all 3 problems.

To remove the problem I applied the xtgls command, i.e.,
xtgls y x1 x2, panels(hetero) corr(ar1)

Could someone please tell me if this is correct? And how do I proceed from this?

↧

Loop regexm to search for multiple strings stored as observations

August 13, 2019, 4:05 am

≫ Next: Speed of Stata

≪ Previous: Panel Data model: DIagnostic testing

Dear Statalists,

I have sucessfully used regexm to search for one string (a diagnose code, for example "I33.0") among many observations in a variable (called "diagnosis").
I created a new variable (endocarditis) that is given the value 1 if there was a match.

Code:
replace endocarditis = regexm(diagnosis, "I33.0")

Now I want to do the exact same thing, but instead of searching for one string, I want to search for many strings (I33.0", "I33.9", "B37.6").
If there is a match on any of these diagnose codes, I want the new variable "endocarditis" to be given the value 1.

The strings I want to search for are stored in a variable ("X") in the same dataset.

I would very much appreciate if I could get help to loop regexm to search the variable "diagnosis" for each observation in "X".

I want the data to look as follows:

X	diagnosis	endocarditis
I33.0	A367 B12.3 B37.6	1
B37.6	C45.6	0
I33.9	S536 F6349	0

Thankful for any help on this matter
Niko Vähäsarja
Karolinska Institutet

↧

Speed of Stata

August 13, 2019, 4:23 am

≫ Next: Strata option in Bootstrap command

≪ Previous: Loop regexm to search for multiple strings stored as observations

Dear All,

I am using stata 15.1. I noticed something that I cannot explain. I am using the command threshold to run an estimation over a dataset containing 90,000 observations.

I noticed that if I run the estimation while I am recharging my laptop, it takes much more time than when it is not.

Is this issue a common feature? If so why?

Thanks for your help.

Dario

↧

Strata option in Bootstrap command

August 13, 2019, 4:48 am

≫ Next: Probit or logit?

≪ Previous: Speed of Stata

Hello everybody,
for my PhD thesis I am developing a DiD model that captures the effects of a particular price policy on online platforms: I would computing bootstrapped standard errors allowing for a cluster structure, through vce(bootstrap).
Among the command options, I find strata(varlist): analyzing the available resources, I have not reached a valid conclusion on the functioning of this option and the difference with clusters (varlist).
Someone could help me?
Thanks
Best regards

↧

Probit or logit?

August 13, 2019, 4:55 am

≫ Next: reshape errors

≪ Previous: Strata option in Bootstrap command

Dear community?
Is there any difference in using probit or logit regression when i have binary DV?

↧

reshape errors

August 13, 2019, 5:28 am

≫ Next: Margins meologit: After 12 hours I have no results!

≪ Previous: Probit or logit?

I am having issues reshaping the sample data below. I want to have wgt to be in long format. Any help/advise will be appreciated.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id int(wgt_1 wgt_2 wgt_3 wgt_4 time) float(trt kill grav)
2049 35  .  .  . 10 0 3 35
2050 25 25 20 20 10 0 1 23
2051 25 25  . 40 10 1 1 30
2055  . 30  .  . 10 0 1  .
2060  . 40  .  . 10 0 3 40
2067  . 25  . 20  7 1 1 25
2071  . 30  . 20 10 1 1 25
2073 35 35  . 20  7 0 3 30
2075  . 50  . 30 10 0 3 40
2076 40 20  .  .  7 0 1 30
2077  .  . 35  .  6 1 1  .
2083  . 25 20 25  6 0 1 25
2083  . 30  . 25  7 0 1 27
2083  . 55  . 35  7 0 .  .
2083  . 40  . 25  7 0 3 32
2083 30 20 30 20 10 0 1 25
2101 20 20 20 20  6 1 0 20
2110  . 55 30  .  7 0 3 42
2110 35 40 30 25  8 0 3 35
2110  . 35  . 30  9 0 3 32
2113  .  .  . 30 10 0 1 30
2118  . 40  . 30 10 0 3 35
2119 40 45  . 25 10 1 3 37
2120 25 25 20 20  7 0 0 23
2120  . 20  . 15 10 0 1 18
2126 55  .  .  .  8 1 3 55
2130  . 30  .  .  7 1 1 30
2130  . 35  .  . 10 1 3 35
2135  . 60  .  .  8 0 .  .
2136 25 20 20 20  6 0 0 22
2136  . 45  .  .  9 0 .  .
2140 30 30 20 25 10 1 1 27
2142  . 30  . 30 10 0 1 30
2154  . 20  . 25  6 1 1 22
2154 25 20 20 25  7 1 1 22
2154  . 15 30 30 10 1 1 25
2155 25 20 20 30  8 1 1 20
2156  . 40  . 30 10 1 3 35
2158  . 40  . 30  7 0 3 35
2158 65 45 40 45  9 0 3 50
2164 30 30 30 25  8 0 1 30
2164 40 30 20 20 10 0 1 30
2169  . 40 35  .  6 1 3 38
2169 40 35 30 30  7 1 3 35
2169  . 30  . 30  9 1 1 30
2170 15 15  . 15  7 0 0 15
2170  . 15 15 15 10 0 0 15
2171 60  . 40 25  6 1 3 42
2171  . 35  . 30  7 1 1 32
2171 40 40  . 30  8 1 3 37
end
label values time time1
label def time1 6 "1 yr", modify
label def time1 7 "2 yr", modify
label def time1 8 "3 yr", modify
label def time1 9 "4 yr", modify
label def time1 10 "5 yr", modify
label values trt trtlb
label def trtlb 0 "active", modify
label def trtlb 1 "placebo", modify

I get an error that says "variable deb contains all missing values" when I try

Code:

reshape long wgt, i(id) j(deb)

↧