Quantcast
Channel: Statalist
Viewing all 72758 articles
Browse latest View live

Combine Regression Coefficients

$
0
0
Hello,

I am using lincom in StataSE 15 to compute standard errors, t-stat, p-values and confidence intervals for linear combinations of coefficients derived from regress. Rather than inputting the lincom command for each combination, I would like to run a single command for all the combinations of coefficients I want and output as one table, is this possible or do I have to use a different command.

Example,
regress deg4 age score exper female
lincom _b[_cons] + _b[age]
lincom _b[_cons] + _b[score]
lincom _b[_cons] + _b[exper]
lincom _b[_cons] + _b[female]
Deg4 Coef. Std. Err. t P>|t| [95% Conf. Interval]
(1) .2301388 .0197375 11.66 0.000 .1914445 .2688331
(2)
(3)
(4)
Thank you,

How to generate a variable that is the sum of another variable's observations subject to a year and id

$
0
0
I want to generate a variable that is the sum of "trade-share of GDP" in a given year for a given country. I have data for 4 years, and I have a variable that is the "trade value" for each country pair.
year importer exporter tradevalue GDPimport
1970 CAN USA 2356 65748
1970 CAN AUS 567 65748
1970 CAN GBR 789 65748
1971 IND CHN 125 46783
1971 IND JAP 1256
46783
1971 JAP BEN 34
67890
1972 JAP BRZ 1345 67890
1972 EGY JOR 234 3456
1972 EGY POL 78
3456
Ignoring that I need to convert tradevalue and GDPimport to the same dollar amount (so both in billions of $), I want to create a column/variable that is the sum of the tradevalue for a country in a given year. So the new column would be like this:
year importer exporter tradevalue GDPimport tradesum
1970 CAN USA 2356 65748 3712‬
1970 CAN AUS 567 65748 3712‬
1970 CAN GBR 789 65748 3712‬
1971 IND CHN 125 46783 1381
1971 IND JAP 1256
46783
1381
1971 JAP BEN 34
67890
1379
1972 JAP BRZ 1345 67890 1379
1972 CAN
USA
2689 70019 3367
1972 CAN AUS 678 70019 3367
So tradevalue is 3712 (567+789+125) for CAN in 1970, but is 3367 (2689+678) in 1972. So basically, I want to sum tradevalue by year and importer, so that the value repeats for importer in the year 1970. As you can see if have CAN twice here, once for 1970 and 1972. In my real data, I have every country in the world as both importer and exporter, so I don't want to simply add tradevalue by importer, as this would add for all four years. And since I have every country in the world, I cannot select each country.

I'm hoping to take that Tradesum value and then just divide by the GDPimport to get the trade share of GDP. But first I need to figure out how to get that tradesum variable to exist. Any help would be greatly appreciated.

Cross validation after logistic regression and validation of optimal cut points

$
0
0
Dear all,

I have modeled a logistic regression and I have been asked to cross-validate my results.
I have used the cvauroc function. My independent variable of interest is compromisedlung (p = 0.000 in the regression)

Array


How to interpret the output? Is the cvMean AUC = 0.84 reason enough to state that the model is accurate and reasonably validated?

Furthermore, I have been asked to provide optimal cutpoint for compromisedlung, which I have using the cutpt function
Array

However, I am required to set the cutpoint to achieve high sensitivity or high specificity and validate them by using cross-validation.
Can you help me better understand the steps involved in this? How should I cross-validate the cut-points?

Thank you very much for your help. You are helping COVID-19 research.

"Received fatal alert: handshake_failure" when downloading Census (QWI) data.

$
0
0
I'm trying to download a series of .csv files from the Census LEHD website (a certain configuration of the Quarterly Workforce Indicators data, one file for each state). When I use the "copy" command, I receive the error "Received fatal alert: handshake_failure".

Array

This is not using the Census API. Is that necessary?

Getting "varlist required" error while using xtgcause command for granger causality

$
0
0
Hey.

I am using the xtgcause command for granger causality analysis in Stata 14. The data is quarterly for housing prices and gdp growth.

Here's the error:

Code:
xtgcause gdp_value hp_value
varlist required
r(100);
The data looks like this.

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 code str7 time float(hp_value gdp_value panel)
"AUS" "2005-Q1"  73.20925   .783395  1
"AUS" "2005-Q2"  72.95418   .449001  2
"AUS" "2005-Q3" 72.254326  1.067343  3
"AUS" "2005-Q4"  72.73719    .84764  4
"AUS" "2006-Q1"  73.43183   .204879  5
"AUS" "2006-Q2"  74.83635   .167939  6
"AUS" "2006-Q3" 75.763535  1.498158  7
"AUS" "2006-Q4"  76.36989  1.279062  8
"AUS" "2007-Q1"   77.1149  1.489092  9
"AUS" "2007-Q2"  79.13862   .586955 10
"AUS" "2007-Q3"  81.87685  1.116082 11
"AUS" "2007-Q4"  83.83504   .500482 12
"AUS" "2008-Q1"  84.03899  1.131335 13
"AUS" "2008-Q2"  82.22787   .222517 14
"AUS" "2008-Q3"  80.14417     .7786 15
"AUS" "2008-Q4"  78.08295  -.483688 16
"AUS" "2009-Q1"  77.83062  1.033833 17
"AUS" "2009-Q2"  80.32821   .559652 18
"AUS" "2009-Q3"  83.68297    .30364 19
"AUS" "2009-Q4"  87.15096   .737188 20
"AUS" "2010-Q1"  89.63873   .503527 21
"AUS" "2010-Q2"  90.46313   .517838 22
"AUS" "2010-Q3"   89.8545   .694337 23
"AUS" "2010-Q4"  89.58639  1.021125 24
"AUS" "2011-Q1"  88.17492  -.269426 25
"AUS" "2011-Q2"  86.57712  1.173372 26
"AUS" "2011-Q3"  85.21406  1.347792 27
"AUS" "2011-Q4"  83.61575  1.145107 28
"AUS" "2012-Q1"  84.19519   .935578 29
"AUS" "2012-Q2"  83.33581   .753941 30
"AUS" "2012-Q3"  83.04298   .577621 31
"AUS" "2012-Q4"  83.82952   .558495 32
"AUS" "2013-Q1"   84.4818   .271678 33
"AUS" "2013-Q2"   85.8001   .455408 34
"AUS" "2013-Q3"  87.75211   .792315 35
"AUS" "2013-Q4"  90.07372   .819718 36
"AUS" "2014-Q1"  91.41034   .695223 37
"AUS" "2014-Q2"  92.29663   .545999 38
"AUS" "2014-Q3"  93.83645    .47311 39
"AUS" "2014-Q4"  94.82317   .372507 40
"AUS" "2015-Q1"   96.5474   .856947 41
"AUS" "2015-Q2"   99.9376   .135397 42
"AUS" "2015-Q3" 102.16533  1.070304 43
"AUS" "2015-Q4" 101.34966   .562297 44
"AUS" "2016-Q1"  101.9707   .899407 45
"AUS" "2016-Q2"  103.1011    .71166 46
"AUS" "2016-Q3" 104.77054   .159082 47
"AUS" "2016-Q4" 108.22515   .984697 48
"AUS" "2017-Q1" 110.95603   .345305 49
"AUS" "2017-Q2" 112.13645   .622296 50
"AUS" "2017-Q3" 112.16082   1.01736 51
"AUS" "2017-Q4" 111.96382   .490343 52
"AUS" "2018-Q1" 111.37794   .910043 53
"AUS" "2018-Q2" 109.88402   .732109 54
"AUS" "2018-Q3" 108.05421   .349175 55
"AUS" "2018-Q4" 104.35603   .162793 56
"AUS" "2019-Q1" 101.60428   .495032 57
"AUS" "2019-Q2"  99.93758   .602246 58
"AUS" "2019-Q3" 102.12965   .551297 59
"AUS" "2019-Q4"  105.0248   .526511 60
"CAN" "2005-Q1" 64.094246   .346067  1
"CAN" "2005-Q2"  65.19584   .725523  2
"CAN" "2005-Q3"  65.84021  1.211375  3
"CAN" "2005-Q4" 67.317825   .988734  4
"CAN" "2006-Q1"  69.01177   .810995  5
"CAN" "2006-Q2"  71.05362   .051646  6
"CAN" "2006-Q3" 73.843666   .279245  7
"CAN" "2006-Q4"   75.6291   .398598  8
"CAN" "2007-Q1"  76.30847   .639627  9
"CAN" "2007-Q2"  78.72769   .968907 10
"CAN" "2007-Q3"  80.86643   .419855 11
"CAN" "2007-Q4"  82.21097   .113617 12
"CAN" "2008-Q1"  83.24121   .076641 13
"CAN" "2008-Q2"   83.4369   .361672 14
"CAN" "2008-Q3"  82.41688   .820044 15
"CAN" "2008-Q4"  81.50007 -1.159518 16
"CAN" "2009-Q1"  79.49334 -2.259686 17
"CAN" "2009-Q2"  78.38065 -1.091259 18
"CAN" "2009-Q3"  80.15189   .449939 19
"CAN" "2009-Q4"  82.80745  1.166157 20
"CAN" "2010-Q1"  85.57713  1.206341 21
"CAN" "2010-Q2"  86.85273   .523261 22
"CAN" "2010-Q3"  86.64377   .708695 23
"CAN" "2010-Q4"  85.75758  1.117505 24
"CAN" "2011-Q1"  87.04622   .755347 25
"CAN" "2011-Q2"  88.05898   .194807 26
"CAN" "2011-Q3"  89.41224   1.37964 27
"CAN" "2011-Q4"  89.92641   .787513 28
"CAN" "2012-Q1"  90.75403   .064087 29
"CAN" "2012-Q2"  91.75568   .326164 30
"CAN" "2012-Q3"  92.13129   .136217 31
"CAN" "2012-Q4"  92.14071   .206781 32
"CAN" "2013-Q1"  91.94321   .897005 33
"CAN" "2013-Q2"  92.51186   .580357 34
"CAN" "2013-Q3"  92.88799   .814973 35
"CAN" "2013-Q4"  93.92451   1.05056 36
"CAN" "2014-Q1"  95.01175   .163448 37
"CAN" "2014-Q2"  95.32622   .913329 38
"CAN" "2014-Q3"  95.96335   .959051 39
"CAN" "2014-Q4"  97.15623   .691976 40
end
First, I set the variable panel using xtset panel and then I use xtgcause gdp_value hp_value but it shows the error.

What can be the problem?

I have a balanced panel.

Calculating F-stat when Fewer Clusters than Regressors

$
0
0
I am trying to evaluate whether my first stage in a 2SLS IV regression suggests a strong enough instrument. However, my regression output does not calculate an F-stat. Here is an example if the code I am trying to run:

Code:
reghdfe x z sex age christian state_trends_2 - state_trends_32, cluster(state) absorb(state year_birth)
With reghdfe absorbing my fixed effects, I think I should not have to worry about singletons. But I include state time trends, so if I include any other regressors, like my instrument z and demographic characteristics, the number of regressors will exceed the number of clusters, and Stata will not calculate an F-stat. My understanding is that this is because I do not have enough degrees of freedom to get an F-stat on the whole regression. The number of clusters is a bit small, 32, but I have more than 2,500 observations.

My question is then whether it makes sense to calculate an F-test on a subset of regessors? And if so, how would I objectively select regressors to evaluate the strength of my instrument?

I have looked through a number of earlier posts about missing F-tests:
https://www.stata.com/statalist/arch.../msg00583.html
https://www.statalist.org/forums/for...o-f-test-value

But I have yet to find a post or reference that explains whether it is okay to use "test" on a subset of regressors to check the instrument.

In case you are not familiar with reghdfe: http://scorreia.com/software/reghdfe/

Axis for each category using xtline

$
0
0
Hi everyone,

Hope all of you are safe.

I have a panel data of 5 countries. I used xtline to draw a graph. However, I want the x-axis and y-axis to be visible for each of the group (countries). I do not want to use overlay because I want to show the graphs of countries separately. Does anyone know how I can do that?

Thanks!

aligning contents in tables with multiple columns

$
0
0
Hi
I am creating a regression output with esttab using the wide option and have three models, hence aside from the variables names' column I have six other columns (two columns per model, presenting the coefficients and standard errors). I am using esttab with .tex extension to input this in Latex but the table has the first coefficients' column (for the first model) aligned left (which I like) and the standard errors for the three models perfectly aligned but that is not the case with the coefficients of the second and third models. I am using the following line in stata
esttab est1 est2 est3 using "regressions2.tex", wide indicate() se s() b(3) alignment(D{.}{.}{-1}(lr)) starlevels(* 0.1 ** 0.05 *** 0.01) title(Models) nonumbers mtitles("Model 1" "Model 2" "Model 3") replace
and the output I get in latex is the one presented in the image attached
I also wonder how to center the number of observations and the headings (Model 1, Model 2 and Model 3)
Thank you very much, Juan
Array

The way that "eteffects" attempts to identify the selection bias

$
0
0
Dear Statalist,


While going through the PDF manual entry of "eteffects", I found some skipped but I think critical steps in two places. I really appreciate it if someone can help me understand the reasoning by completing the missed steps.



# 1 It is not that obvious to me that under (3) and (5):
E[\epsilon_ij | E(t|z_i)+v_i]=E[\epsilon_ij | v_i]

Although \epsilon_ij and v_i is mean independent of z_i respectively, it may not be the case in the expectation specified as above where it is conditional on v_i and z_i simultaneously.

Also E[\epsilon_ij | v_i] = v_i*beta_2j, I think such linearity is something assumed rather than that can be derived from the existing assumptions.

# 2 in equation (7), given E[\epsilon_ij | v_i] = v_i*beta_2j, I am curious how to derive E[\epsilon_ij | x_i, v_i,t_i=j] = v_i*beta_2j as this is additionally conditional on x_i and t_i.


Kind regards,
Yugen

Coefplot: Define range for rconnected CI plots

$
0
0
Hello,

I am trying to plot the coefficients of a regression with multiple treated groups, differentiated by intensity of treatment (I use deciles of a continuous variable). I run a diff-in-diff regression with the 10 deciles (I omit the middle decile) and then the coefplot command to plot the coefficients. I also plot the 95% confidence interval in the same graph.

I would like to include the omitted decile interaction in the coefplot. However, I don't want the confidence interval lines to connect to that point. That is, I would like the confidence interval lines (the black dashed lines in the image) to only appear between 1 and 4 and between 6 and 10, while the rconnected plot for the coefficients (the solid blue line) is defined for the entire interval. Is there a way to define the range in which the CI lines are visible? An example on what I am trying to achieve would be Figure 2 on page 167 in Deschênesand Greenstone (2011).

Please see example code and graph below. Any help would be much appreciated!

Best regards,
Simon

Code:
sysuse nlsw88, clear
* ssc install coefplot

egen dexp = xtile(ttl_exp), nq(10) // Create deciles of experience

reg wage ib5.dexp##i.collgrad

#delimit;
local lab 1/10;
coefplot, drop(_cons 0.collgrad 1.collgrad *.dexp) levels(95)
vertical yline(0) recast(connected) ciopts(recast(rconnected) lpattern(dash) color(black))
legend(order(2 "Coefficient" 1 "95% CI") row(1)) xlabel(`lab')
xtitle("Experience Decile Interacted with College Degree") title("Wage")
xsize(4.6) baselevels
;
#delimit cr
Array

Delete observations under conditions

$
0
0
Hi, I have a question that I hope someone can answer. I'm still pretty new to stata, so excuse me if I'm completely lost...

I work with a dataset where I made my own. A survey.

I have 538 observations in my dataset.
I am researching something with Muslims in Denmark and have therefore measured the degree of their identification with the social group "muslim minorities" on three variables.
Here, they should declare "strongly agree" to "strongly disagree" on a scale of 1-5.

I would now like only to keep the observations that have either responded "strongly agree" or "partially agree" on mininum two of the three variables.


I don't know if that's a possibility in stata at all? I've searched and read a lot, but can't find some commands that can solve my problem.

Error importing csv into Stata

$
0
0
I am having trouble importing my .csv into Stata. When I imported it, the following error message appeared:


Code:
Note: Unmatched quote while processing row 128; this can be due to a formatting problem in the file or because a quoted data element spans
multiple lines. You should carefully inspect your data after importing. Consider using option bindquote(strict) if quoted data spans
multiple lines or option bindquote(nobind) if quotes are not used for binding data.
I have tried to export the data here using dataex, but I get this error term:

Code:
dataex
input statement exceeds linesize limit. Try specifying fewer variables
r(1000);
I have attached the excel spreadsheet as well. Any help is much appreciated.

multiplying two dummies to create interactive dummy

$
0
0
Hi, i have been thinking about two questions:
#1. if you have two dummy variables age and sex, you want to create a dummy variable. i tried two ways:
1. xi: i.sex * i.age
or
2. sexdummy* agedummy
It seems both of them give the same result.
Is it always ok to use the second way of simply multiplying two dummies to create an interactive dummy?
#2, if the age becomes a continuous number (let us say 1-100), still needs to create interactive dummy, can I just multiply each of them and create a dummy ?
Thanks.

Subanalysis

$
0
0
Hello,

I have a few questions regarding interaction terms.

1. If I have my final logistic regression model below,Logistic regression Number of obs = 189
LR chi2(5) = 29.56
Prob > chi2 = 0.0000
Log likelihood = -102.55384 Pseudo R2 = 0.1260

------------------------------------------------------------------------------
low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .9383918 .0332708 -1.79 0.073 .8753963 1.00592
lwd | 2.54087 .9815007 2.41 0.016 1.191724 5.417377
smoke | 1.585579 .5509269 1.33 0.185 .802469 3.132905
ht | 4.017858 2.57528 2.17 0.030 1.143957 14.1117
ptd | 4.322268 1.93573 3.27 0.001 1.796804 10.39735
_cons | .8722474 .7224864 -0.17 0.869 .1720227 4.422763
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.


2. and now I want to add interaction term between smoke and lwd
  • logistic low age smoke##lwd ht ptd
Logistic regression Number of obs = 189
LR chi2(6) = 32.20
Prob > chi2 = 0.0000
Log likelihood = -101.23846 Pseudo R2 = 0.1372

------------------------------------------------------------------------------
low | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .9433269 .0337749 -1.63 0.103 .8793985 1.011903

smoke |
smoker | 2.263877 .9304785 1.99 0.047 1.011579 5.066477
1.lwd | 4.67193 2.53338 2.84 0.004 1.614089 13.52275
|
smoke#lwd
smoker#1 | .2850451 .2207858 -1.62 0.105 .0624602 1.30084
|
ht | 3.719531 2.426126 2.01 0.044 1.035805 13.35668
ptd | 4.002717 1.800033 3.08 0.002 1.657933 9.663683
_cons | .6733606 .5741697 -0.46 0.643 .1266002 3.581468
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

3. Can I report from my first final model that lwd (OR 2.54), ht (OR 4.08) and ptd (OR 4.32) are the risk factors for low birth, and from my second model, can I just explain about the interaction between smoke and lwd only (OR 0.28)? or should I use the second model to be my final model and report the results from the second model? if then, how should I interpret smoker 1. lwd and smoke#lwd altogether? because in the first model, smoke wasn't a significant risk factor but in the second model it is significant.

4. how to decide which interaction terms to include in a multivariate logistic regression model? which analysis should I run to choose interaction terms?

Thank you for your time and help in advance.

Venn diagram

$
0
0
Hi I'm trying to do a Venn diagram with 4 variables but the code pvenn allows only 3 variables maximum. I also tried to use the code vendiag but STATA crash.
I tried also with twoway code but I wasn't able.
Do you have any advice?


Thanks for your help.

Vittorio

How to Handle Heterogeneity in Panel Data using Systems GMM approach

$
0
0
Dear Stata Users,

I am currently researching on Foreign direct investment in Africa using the Systems GMM method. My measure of FDI is FDI stock into Africa. My preliminary analysis shows that there is a panel heterogeneity in the data. I had read in Hansen, 1982; Newey & West, 1987; and White, (2000) that the GMM model is capable of dealing with unobserved panel heterogeneity. I guess this was a bit tricky as I just grossly assumed heterogeneity. My mentor insists GMM cannot solve the heterogeneity problem in panel data. I removed very small countries from my sample, yet the differences still persist.

All my estimated parameters have acceptable signs and the model performs well(judging from values of Sargan/Hansen tests)

Unfortunately, I haven't come across any work that has dealt with this specific issue or an appropriate justification that can rescue me. I will greatly appreciate the assistance.

Kind regards

How to model "Sparial Variability" or "Choice of Location"

$
0
0
I have a survey data on 10000 delivery person. I have number of delivery they made in 49 neighborhoods (that is 49 columns plus 01 as "other neighborhoods") and one column as their earning per hour (EPH). I am trying to explain that a person's EPH depends on his choice of neighborhoods to work in.

For initial exploration, I divided the 10000 into 10 groups (based on EPH, low to high) and plotted the fraction of delivery they made in different neighborhoods. It is clear from the graph that high earning groups pick the rich neighborhoods more. (I arbitrarily picked a neighborhood that I know rich people live in and a neighborhood I know to be poor. Also, I picked "fraction of delivery made in neighborhood X" instead of "total delivery" because different delivery person has different years of experience).

My question is how do I statistically model the variation in EPH due to choice of neighborhoods? What kind of regression model should I be studying? How do I deal with the fact that neighborhoods are spatially correlated?

Additionally,
1) I have data from the delivery company which tells me true EPH per neighborhood, aggregated over ALL delivery for ALL delivery person. That is 125 rows for 125 neighborhoods and 2 columns with EPH and Total Delivery. How do I incorporate this information in my modelling?
2) I have the Well-Known Text representation of the neighborhoods.

Thanks!

Question about diff-in-diff with multiple control groups and one treatment group

$
0
0
Hello!

I am running a Difference in Difference (DD) regression to see whether the introduction of a policy affected school enrolment for a particular social group (S1) as compared to three other social groups (S2, S3, S4) and I have data from before and after the policy was implemented.

The command for the regression is:
reg school hh_S2 hh_S3 hh_S4 post post_X_S2 post_X_S3 post_X_S4

where hh_S2 = 1 if the household belongs to the social group S2 and 0 otherwise and so on for S3, S4, S5
and post = 1 for years after the policy and 0 for before the policy

I am more interested in seeing the effect of the policy on S1 school enrolment because that is my treatment group. My question here is that since I have multiple control groups rather than one treatment and one control group, what would be the interpretation of the coefficient for the constant term as well as the post coefficient. Is the coefficient for 'post' the effect of the policy on school enrolment for social group S1 after the policy and coefficient of constant the expected school enrolment for social group S1 before the policy?

Thank you

Parametric survival analysis

$
0
0

Good morning. I am doing parametric survival analysis through streg. After streg, I got the following graph through the stcurve option. What should I do when I want to get the exact hazard ratio at a specific point in time as shown in the graph? Array

How to do Reality check and SPA test by Stata?

$
0
0
Dear Statalist,

I want to know whether there are any user written commands by stata for performing Reality Check (White 2000) and Superior Predictive Ability test (Hansen 2001)?
So far I didnt find anything.

Here is the link to White`s paper: White, H. (2000). A reality check for data snooping. Econometrica, 68(5), 1097-1126. http://www.ssc.wisc.edu/~bhansen/718/White2000.pdf

Here is Hansen`s paper: Hansen, P. R. (2001). An unbiased and powerful test for superior predictive ability (No. 2001-06). http://www-siepr.stanford.edu/workp/swp05003.pdf

Thanks a lot on advance for your help!
It is really appreciated!
Sincerely Ning.
Viewing all 72758 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>