Multiple imputation iteration backed up

February 26, 2017, 5:50 pm

≪ Previous: Announcing labeldatasyntax: Stata module to produce syntax to label variables and values, given a data dictionary

Hi folks,

I ran a multiple imputation for a longitudinal data in a wide form (m=20). When m=4, I saw this message, and imputation could not go ahead:

Running mlogit on data from iteration 8, m=4:

Iteration 0: log likelihood = -10833.986
Iteration 1: log likelihood = -9869.2422
Iteration 2: log likelihood = -9714.8179
Iteration 3: log likelihood = -9714.8179 (backed up)
Iteration 4: log likelihood = -9714.8179 (backed up)
Iteration 5: log likelihood = -9714.8179 (backed up)
Iteration 6: log likelihood = -9714.8179 (backed up)
Iteration 7: log likelihood = -9714.8179 (backed up)
Iteration 8: log likelihood = -9714.8179 (backed up)
Iteration 9: log likelihood = -9714.8179 (backed up)
Iteration 10: log likelihood = -9714.8179 (backed up)
Iteration 11: log likelihood = -9714.8179 (backed up)
...

My syntax is:
local lg="chronic* prevcare* rhukou* phi* sickinj*"
local rg="lghhincg* oop_cpi* denc* econ* health* soc* trans* comm* edc*"
local ml="educ_r* mcare* emp_rs* shitype*"
mi impute chained (logit) `lg' (regress) `rg' (mlogit) `ml' ///
= age* sex hhsizeimp* geography stratum, add(20) rseed(1635) augment noisily

Anyone know why did it succeed in m=1/3, but fail in m=4? Can I find from the saved log which variable did the failing estimation happen on? Is there any solution?
Thank you!

Best,

Rodrigo

↧

System GMM with mi estimate

February 26, 2017, 10:30 pm

≫ Next: Margins after Ivregress with binary endogenous variable

≪ Previous: Multiple imputation iteration backed up

Hi,
does anyone know which command to use to implement system GMM for a multiple imputation dataset? I understand that xtabond2 is commonly used for system GMM but could not find any documentation on its compatibly with "mi estimate:".

The regression that I am trying to replicate is Table A7, page 16 at this link: https://static-content.springer.com/...MOESM1_ESM.pdf

The multiple imputation dataset used is Solt's SWIID v5 available at: https://dataverse.harvard.edu/datase...l:1902.1/11992

thank you!

↧

Margins after Ivregress with binary endogenous variable

February 26, 2017, 10:42 pm

≫ Next: New package, sdist, availavle on SSC

≪ Previous: System GMM with mi estimate

Dear Statalisters,

I am using ivregress command with a binary endogenous variable.Code is given below.

ivregress 2sls AgeatMar medu42 medu43 medu44 sibshipsize GenroleN (Compl_cl8 = Fac_School) , first

where Ageat_Mar is continuous and Compl_cl8 is binary endogenous variable.

I would like to calculate the predicted mean for Compl_cl8.For that, I have used margins.

margins, dydx(Compl_cl8) atmeans

I am afraid I am using wrong margins command. Please suggest stata code for calculation of predicted probability for binary endogenous regressor(for ivregress)

Any further discussion on this thread will be appreciated.

↧

New package, sdist, availavle on SSC

February 26, 2017, 10:46 pm

≫ Next: How can I replace missing values in a cluster with the unique non-missing values within that cluster, given unique orgnr in my data set?

≪ Previous: Margins after Ivregress with binary endogenous variable

Thanks to Kit Baum, a new package is available through SSC that simulates the central limit theorem: sdist.

A sound understanding of the central limit theorem is crucial for comprehending parametric inferential statistics. Despite this, undergraduate and graduate students alike often struggle with grasping how the theorem actually works and why researchers rely on its properties to draw inferences from a single unbiased random sample. This package, sdist, offers a tool for teaching and learning the central limit theorem via easy-to-generate simulations. Specifically, sdist can be used to simulate the central limit theorem by (1) generating a matrix of randomly-generated normal or non-normal variables, (2) plotting the associated empirical sampling distribution of sample means, (3) comparing the true sampling distribution standard deviation to the standard error from the first randomly-generated sample, and (4) automatically producing a side-by-side comparison of the two distributions.

The package can be obtained using the following:

Code:

ssc install sdist

The code is purposefully kept simple to promote student use and experimentation. For example, if the student wishes to simulate the central limit theorem by comparing the standard deviation from an empirical sampling distribution from 500 random samples following a uniform distribution to the standard error estimate from one of these random samples, the student would type the following:

Code:

sdist, samples(500) obs(500) type(uniform)

Where obs(500) indicates that the student wants 500 observations in each of the 500 samples.

When executed here, the code produced the following simple output:

Code:

                                
          ------------------
                      sd/se    
          ------------------
          sig_Xb       .013
          se_Xb        .013
          abs(diff)    0
          ------------------
                  
The difference between sig_Xb and se_Xb is 0. The larger
this difference, the poorer the single X variable standard error approximates
the standard deviation of the sampling distribution. This may be due to one
of two things: a small number of samples and/or a small sample size.

sdist also produced the following graph, with the empirically-generated sampling distribution and its associated parameters in the top panel and the variable distribution from one of the random samples and its parameters in the bottom panel.
Array

The point, of course, is to illustrate that that standard error estimate from the bottom panel is incredibly close in value to the observed standard deviation of the sampling distribution--despite the fact that the variable is not normally distributed. (In this example, the standard deviation of the sampling distribution and the standard error are reported as being exactly the same; however, as shown in the package help file, the student can use the round() option to report more exact estimates.)

Uniform, normal, and Poisson distributions are currently available. Any results can be reproduced using the set seed function prior to executing the sdist command.

Best,

Marshall

↧

How can I replace missing values in a cluster with the unique non-missing values within that cluster, given unique orgnr in my data set?

February 27, 2017, 12:10 am

≫ Next: ivprobit taking too much time

≪ Previous: New package, sdist, availavle on SSC

I work with a panel data set over 10 years which includes loan portfolios from several banks to customers with their respective organization numbers (variable name: customerorgid). In the data set I also have industry codes (variable name: bransjek_07) for the different customers, but for some reason I have about approximately 300 000 missing observations of industry codes. For instance, there are several cases in which the customers have industry codes only for a few years, in which the remaining years the industry codes are missing. Also, there are many customers without industry codes. My goal is to measure the loans from banks on industries, and not on customers (firms). Hence, if I drop 300 000 loans (out of approximately 1 300 000 loans) this will result in an extreme omitted variable bias. So my question is therefore: How can I replace the missing values (industry codes) in a cluster with the unique non-missing values (the industry codes that actually is non-missing for that specific firm) within that cluster, given the organization number of the firms I have in my data set? Additionally, is there an efficient way around for the firms that have only missing values on industry codes?

I tried the following command in stata:
xfill bransjek_07, i(customerorgid),

but I got the following error message:
bransjek_07 is not constant within customerorgid -> fill not performed

Any help will be appreciated!

Sigve

↧

ivprobit taking too much time

February 27, 2017, 5:30 am

≫ Next: Computing accurately exp(x)-1

≪ Previous: How can I replace missing values in a cluster with the unique non-missing values within that cluster, given unique orgnr in my data set?

Hi everyone,

I am analyzing data through instrumental variable analysis.

I applied ivprobit (because my dependent variable is dummy). However, stata is taking too much time in analyzing my data.

following is my regression command

ivprobit y size roa_w mtb_w lev_w loss big4 age inst_own num_analyst i.year i.ind_49 (x = idio_mean), first

this is what stata is showing now (attached file).
Array Array

↧

Computing accurately exp(x)-1

February 27, 2017, 5:53 am

≫ Next: Dropping duplicates for a egen variable

≪ Previous: ivprobit taking too much time

Stata and Mata have no expm1(x) function, i.e. a function which computes the function exp(x)-1.
Computing exp(x)-1 suffers from cancellation errors for small values of x and will be inaccurate in this case (see in particular this paper https://cran.r-project.org/web/packa...1mexp-note.pdf).
This function is in particular useful to compute accurately 1-exp(-x) which looks like the invcloglog functon (1-exp(-exp(x)). We can compute 1-exp(-x)=-expm1(-x).
It also seems that the invcloglog implementation in Stata and Mata does not use this function. The mata code for invcloglog(x) basically returns 1-exp(-exp(x)) for all values, only for non-missing values of x>500 the result is set equal to 1.

Note that the function expm1() is part of the C and C++ language as well as the function log1p().

I have tried to implement the function myself in Mata but I am in doubt whether I am doing it right. Basically my approach is to use a Taylor expansion of order two for values less or equal to 1e-5. I am not sure whether computing more terms will actually help.

Code:

real matrix expm1(real matrix x)
{
    real matrix index

    index = abs(x) :<= 1e-5

    return(!index:*(exp(x):-1):+index:*(x+0.5*x:^2))

}

I have read somewhere else that one can use exp(x/2)*sinh(x/2) for values of x between epsilon()/2 and log(2) (where epsilon is the machine precision) and for values less than epsilon/2 then expm1 = x.
But I don't know which approach is right, if there is any.
I am no expert in this matter and I would very much like to benefit from your insights about this issue.
Does someone know which approximation I should use (as well as the thresholds

) or could give som advice or inputs in how to compute this function?
Best
Christophe

↧

Dropping duplicates for a egen variable

February 27, 2017, 6:35 am

≫ Next: Different (independent) variables but the same sample

≪ Previous: Computing accurately exp(x)-1

Hi, I am working with the following data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year str28 country str43 admin1 str26 admin2 int(admin1_riots admin2_riots)
2011 "Algeria" "Bechar"         "Abadla"                1  1
2015 "Algeria" "Tizi Ouzou"     "Abi Youcef"           16  1
2015 "Algeria" "Bejaia"         "Adekar"               10  1
2016 "Algeria" "Bejaia"         "Adekar"               18  1
2015 "Algeria" "Adrar"          "Adrar"                 2  2
2015 "Algeria" "Adrar"          "Adrar"                 2  2
2016 "Algeria" "Adrar"          "Adrar"                 9  8
2016 "Algeria" "Adrar"          "Adrar"                 9  8
2016 "Algeria" "Adrar"          "Adrar"                 9  8
2016 "Algeria" "Adrar"          "Adrar"                 9  8
2016 "Algeria" "Adrar"          "Adrar"                 9  8
2016 "Algeria" "Adrar"          "Adrar"                 9  8
2016 "Algeria" "Adrar"          "Adrar"                 9  8
2016 "Algeria" "Adrar"          "Adrar"                 9  8
2010 "Algeria" "Laghouat"       "Aflou"                 1  1
2012 "Algeria" "Bouira"         "Aghbalou"              3  1
2009 "Algeria" "Bouira"         "Ahnif"                 1  1
2016 "Algeria" "Setif"          "Ain Arnat"             5  1
2015 "Algeria" "Bouira"         "Ain Bessem"           13  1
2010 "Algeria" "Ain Defla"      "Ain Defla"             1  1
2013 "Algeria" "Ain Defla"      "Ain Defla"             2  2
2013 "Algeria" "Ain Defla"      "Ain Defla"             2  2
2004 "Algeria" "Annaba"         "Ain El Berda"          1  1
2014 "Algeria" "Oum el Bouaghi" "Ain El Fakroun"        2  1
2011 "Algeria" "Tiaret"         "Ain El Hadid"          1  1
2012 "Algeria" "Tiaret"         "Ain El Hadid"          6  1
2016 "Algeria" "Tiaret"         "Ain El Hadid"         12  1
2016 "Algeria" "Mascara"        "Ain Fares"             9  1
2003 "Algeria" "Oum el Bouaghi" "Ain MLila"             1  1

end

and I am trying to drop duplicate observations of admin 2 for the same year. For example, I would like only one observation for Adrar in 2016 that has a admin2_riots value of 8, but when I try to create a variable to indicate duplicates and then delete duplicate observations, all of my values for admin_2 turn to 1 since I used egen to create the variable. Is there a proper way to store these values so they're not altered when I try to delete duplicate observations?

Thank you.

↧

Different (independent) variables but the same sample

February 27, 2017, 7:33 am

≫ Next: xtoverid operator not allowed r(101); error

≪ Previous: Dropping duplicates for a egen variable

Hello.
Excuse my english, because I'm not english native. I have a doubt, and i'm hope that somebody help me: I have a sample with 234 observations (people) and I want analyse five different ordinal variables in this group. It's different that to analyse the same variable at five different times or five differents instruments outcomes using the same variable. In the last one, I can use the Friedman test, but in first example, can I use the Kruskall-Wallis test (since that I want analyse differents outcomes of differents variables)??

Thank you so much in advance.

↧

xtoverid operator not allowed r(101); error

February 27, 2017, 7:35 am

≫ Next: flagging hospital transfers

≪ Previous: Different (independent) variables but the same sample

Hi there, I am running the following IV estimation:

xtivreg netreturn sin (religiositymean sinreligiositymean= statefav sinstatefav) beta lmarketcap lpb bev lgdp spread inflationrate open law year1 year2 year3 year4 year5 year6 year7 year8 year9 year10 country1 country2 country3 country4 country5 country6 country7 country8 , re vce(cluster company)

And trying to test for the relevance and exog of the instruments, however when I do
xtoverid or -xtoverid, cluster(company), I get " xtoverid, cluster(company)
o. operator not allowed
r(101);"

What am I doing wrong? Any help would be much appreaciated.

↧

flagging hospital transfers

February 27, 2017, 8:14 am

≫ Next: Weighted averages

≪ Previous: xtoverid operator not allowed r(101); error

Hi there
In my data, each observation is a hospital admission. I want to flag hospital transfers, which I am defining as any admission where the admission date is <=1 day after the discharge date of a previous admission for the same person.
Below is a brief illustration - the variable "transfer" is the variable I would like to create.
If possible, I would like to avoid converting my dataset from 'long' to 'wide', as I already have lots of variables and some people have a very large number of admissions.
Any advice much appreciated!
Thanks

Array

↧

Weighted averages

February 27, 2017, 8:19 am

≫ Next: math

≪ Previous: flagging hospital transfers

Hi everyone,

I have panel data on orders for various products (A, B, C, etc.). The data are organized in the following format.

Product	Year	Share	Price	Quantity
A	2003	1	10	1000
A	2005	1	12	4000
A	2007	1	7	200
B	2005	1	8	300
B	2005	2	9	100
B	2007	1	3	200
B	2007	2	6	50
B	2007	3	7	50
B	2009	1	72	1000
B	2011	1	12	400
C	2003	1	5	5000
C	2005	1	7	200
C	2005	2	12	150
C	2009	1	2	1900

Some years, the orders were split between manufacturers, as indicated by the “share” variable. In 2005, for example, 300 units of product B were purchased at a price of $8 per unit from one company, and 100 units of the same product were purchased from another company at a price of $9 per unit. Orders can be split between as many as seven companies, so the share column can take on values of 1 to 7.

For split orders, I’d like to calculate a weighted average. For product B in 2005, for instance, it'd be ((300*8) + (100*9)) / (300 + 100) = 8.25

The result would be a dataset like this one.

Product	Year	Share	Price	Quantity
A	2003	1	10	1000
A	2005	1	12	4000
A	2007	1	7	200
B	2005		8.25	400
B	2007		4.17	300
B	2009	1	72	1000
B	2011	1	12	400
C	2003	1	5	5000
C	2005		9.14	350
C	2009	1	2	1900

What would be the simplest way to accomplish this?

Thanks for your help.

↧

math

February 27, 2017, 8:24 am

≫ Next: tabmult - categorical variables show both col percent and freq?

≪ Previous: Weighted averages

\[ \exp(x) \]

↧

tabmult - categorical variables show both col percent and freq?

February 27, 2017, 9:07 am

≫ Next: Instrumental Variables and Gologit2

≪ Previous: math

I am trying to produce a single table of multiple two-way tabulations using tabmult from SSC in Stata14.

I am using only categorical variables, no continuous variables. I have successfully produced xml tables with either column percentages or frequencies/counts, but I would like to show both in the table.

I have used the following command as described in the help -tabmult- guide:

tabmult, cat(varY1) by(varX1 varX2) statc(count) col save(Table1.xml) sheet(VAR1) replace

I currently have results with col percents (shown below) but would like to add cell frequencies as well, preferably above or below the col percents. Is this possible using tabmult with only categorical variables? If not is there another package that does the same thing? I have tried baseline and that does not produce exactly what I am looking for. tabmult works great, except for this one issue.

	varX1			varX2
	1	2	1	2	3
varY1
Always	2.99	0.49	0.40	0.68	1.92
Most Of The Time	19.16	5.39	10.76	3.04	20.33
About Half The Time	29.04	15.69	19.52	13.85	30.77
Some Of The Time	39.22	51.47	44.22	45.95	41.48
Never	9.58	26.96	25.10	36.49	5.49

Thank you,
Lisa Bryant

↧

Instrumental Variables and Gologit2

February 27, 2017, 9:30 am

≫ Next: Margins after ivprobit

≪ Previous: tabmult - categorical variables show both col percent and freq?

Hi,

I am having trouble looking for the stata command that allows me to do a generalized ordered logit with instrumental variable technique. Does anyone know how to do this?

Thanks!

↧

Margins after ivprobit

February 27, 2017, 9:43 am

≫ Next: table out

≪ Previous: Instrumental Variables and Gologit2

Hi,

When I use the margins, dydx(explanatory variable) command after an ivprobit estimation, the marginal effect reported is the same as the coefficient in my ivprobit regression. Does anybody know why this is the case?

Thanks!

↧

table out

February 27, 2017, 10:11 am

≫ Next: Panel Data Parametric Survival Regression

≪ Previous: Margins after ivprobit

Hi all,

I would like to show in a table the following, where y is a continous variable and regions and year are discrete vars

table regions year, c(mean y)

Which stata command shoud I use to create a publication-quality table? Thanks a lot.

↧

Panel Data Parametric Survival Regression

February 27, 2017, 10:21 am

≫ Next: Regression model for E coli concentration

≪ Previous: table out

I would greatly appreciate if you could let me know how to choose among different parametric distributions including gama, Weibull, lognormal, loglogistic and etc for panel (time series cross sectional data) survival analysis or discrete time survival analysis in STATA 14.
Also, how to decide to use proportional hazard or accelerated failure time.

I read these materials but they are about continuous time survival analysis:

http://spia.uga.edu/faculty_pages/rb...rdOneNotes.pdf
http://spia.uga.edu/faculty_pages/rb...rdTwoNotes.pdf
http://spia.uga.edu/faculty_pages/rb...ThreeNotes.pdf

Then, I tried to calculate LR test, which is explained on page 22 of the second note, in order to calculate p_value. However, I am not sure or I don't know what to do. Besides, I just could test PH assumption for cox model, which is not a kind of panel data.

What's more, I couldn't do what is instructed on pages 24 and 25 of Oxford second note. In fact, when I use the "predict" command, it gives me an array of continuous values even though my dependent variable is discrete.

The following table is based on my estimations, which are attached in conjuction with my data.

Survival Distribution		AIC	BIC	Log-Likelihood	df
Exponential	Proportional Hazard	433.663	471.1031	-209.83151	7
Exponential	Accelerated Failure Time	433.663	471.1031	-209.83151	7
Lognormal	Proportional Hazard	377.6502	420.4389	-180.82508	8
Loglogistic	Proportional Hazard	377.874	420.6627	-180.93701	8
Gama	Proportional Hazard	cannot compute an improvement -- discontinuous region encountered
Weibull	Proportional Hazard	cannot compute an improvement -- discontinuous region encountered
Weibull	Accelerated Failure Time	205.8869	248.6756	-94.943472	8

Best regards,

↧

Regression model for E coli concentration

February 27, 2017, 10:44 am

≫ Next: how to display (tab) values only if y=0 or y=1 in all 13 waves

≪ Previous: Panel Data Parametric Survival Regression

Hi,

I am having issues determining the best way to model my E. coli data, as there are a lot of zeros in the dataset. I have been considering Zero inflated but I have not been able to find if it allows for random effects (I have three that I need to account for: Barn, Visit, Floor).

Does anyone know the best way to model this type of data?

Thanks,
Chelsea

↧

how to display (tab) values only if y=0 or y=1 in all 13 waves

February 27, 2017, 11:09 am

≫ Next: How to fill in missing values across variables with last observed 0 value?

≪ Previous: Regression model for E coli concentration

Hi,

I have a panel dataset with 13 waves, where the identifier is hhid (household ID). My dependent variable is saving1 (binary, which takes either 1 or 0).

When I run a Hausman test for xtlogit RE vs FE, I get the error message:

model fitted on these data fails to meet the asymptotic assumptions of the Hausman test;

I believe this is because when I run the xtlogit, fe model, I get this

note: multiple positive outcomes within groups encountered.
note: 60 groups (767 obs) dropped because of all positive or all negative outcomes.

Anyway, I wanted to show that running a FE model would drop many values for which saving1 is time-invariant.
E.g. as can be seen in the screenshot below, for hhid=6, saving1 varies over time, whereas for hhid=21, saving1=0 for all 13 waves.

So I wanted to display

Code:

tab hhid saving1

but only if saving1==0 or if saving1==1 for all 13 waves.

Something like the following (although I know my attempt is incorrect):

Code:

tab hhid year if saving1==0 or saving1==1 for nyear==13

Would you be able to suggest the code for this please?

↧