Plot (kernel) density estimates as areas

April 17, 2020, 9:02 am

≫ Next: VECM with k variables and r cointegratin relation

≪ Previous: Nearest Neighbour matching with exact matches and with replacement

This is a brief puff for an idea that has become standard in some quarters, but seems to deserve a bigger push until everyone who might care knows about it. Here is a reproducible example, which as always is indicative, not definitive.

Code:

sysuse auto, clear

gen where = _n + 4 in 1/45

local choices kernel(biweight) bw(5) at(where)

kdensity mpg if foreign, `choices' gen(x1 d1)

kdensity mpg if !foreign, `choices' gen(x0 d0)

gen rug1 = -0.004
gen rug0 = -0.008

twoway area d1 d0 where, xtitle("`: var label mpg'") color(orange%40 blue%40) ///
|| scatter rug1 mpg if foreign, ms(|) mc(orange) msize(medlarge) ///
|| scatter rug0 mpg if !foreign, ms(|) mc(blue) msize(medlarge) ///
legend(order(1 "Foreign" 2 "Domestic") pos(1) ring(0) col(1)) ///
ytitle(Probability density) yla(, ang(h)) xla(10(10)40)

Array

Kernel density estimates are plotted by default in Stata as lines, meaning curves. It is elementary (meaning, fundamental) that area under the curve has an interpretation as probability.

Often area-based graphs say in a complicated way what could be said much more simply. Bad examples include bars with arbitrary bases that could just be replaced by point symbols for the values in question, or bars that start at zero, when not being zero is banal or irrelevant.

However, area graphs can be helpful when comparing two or more distributions. (Histograms work that way.) But then transparency becomes vital to see overlap clearly.

You can do something like this directly with kdensity or twoway density with the option recast(area). There is no special rationale for coding as above, although the default of truncating the density at the observed extremes can be unfortunate, so I typically work a little harder at setting up a wider grid on which to calculate estimates.

The immediate inspiration for this came from an excellent book by Claus Wilke. This is a link to a review I wrote with several detailed comments: https://www.amazon.com/gp/customer-reviews/R22MWD7RJ6QAFP

↧

VECM with k variables and r cointegratin relation

April 17, 2020, 9:30 am

≫ Next: Number of lags in Fisher-type Dickey-Fuller test (panel data)

≪ Previous: Plot (kernel) density estimates as areas

Hi,
If I'm testing a cointegration relatioship between say 7 variables and I found (Johansen test) 3 cointegrationg relations. How could I know this LT relationships holds between which variables? How can I interpret ce1, ce2 and ce3 in each of the 7 variables output?

↧

Number of lags in Fisher-type Dickey-Fuller test (panel data)

April 17, 2020, 9:51 am

≫ Next: Oaxaca command

≪ Previous: VECM with k variables and r cointegratin relation

Hello,

I am wondering how I can determine the number of lags I should use when conducting the Fisher-type Dickey-Fuller test to test panel data for stationarity?

I have panel data on the quarterly reported costs ("Actual") of 289 projects ("ID") from 2013 to 2019. I declared my dataset as panel data using -xtset-. My research goal is to exame if the quarterly costs can be predicted on project level. My panel data is unbalanced, i.e., not all projects last 7 years. Also, I have some gaps as not all projects reported consistenty on a quarterly basis. Please find an example of the dataset below.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float ID byte QrtInt int YearValue float Date double Actual
2 1 2013 212           265.2611
2 2 2013 213 334.21302000000003
2 3 2013 214           671.4628
2 4 2013 215  929.6637700000001
2 1 2014 216          457.79724
2 2 2014 217 465.83238000000006
2 3 2014 218          443.06652
2 4 2014 219          170.27506
2 1 2015 220          188.16879
2 2 2015 221          272.98868
2 3 2015 222          245.32497
2 4 2015 223  439.7882000000001
end
format %tq Date

Due to the unbalancedness of my panel data, I should use the Fisher-tye Dickey-Fuller test (-xtunitroot fisher, dfuller-) to test for stationarity, right?

If yes, how would I determine the appropriate number of lags for the test? I am currently using an autoregressive AR(4) model as a prediction model, i.e., I use the costs of t-1 to t-4 for each project as the independent variables to predict the costs in the quarter t0. Does that mean that I need to also use 4 lags in the Fisher-type test to test my data for stationarity?

I decided to use 4 lags in the AR model after I plotted the partial autocorrelation function (PACF) of the time-series data (see corrgram below). However, I used the aggregated quarterly costs of all projects and declared my data as a time-seris (-tsset-). Is there also a way to test for partial autocorrelations within each project when using panel data (-xtset-)?

Code:

                                          -1       0       1 -1       0       1
 LAG       AC       PAC      Q     Prob>Q  [Autocorrelation]  [Partial Autocor]
-------------------------------------------------------------------------------
1        0.4765   0.5448   7.0652  0.0079          |---               |----    
2        0.4386   0.4124   13.279  0.0013          |---               |---     
3        0.3219   0.1706   16.762  0.0008          |--                |-       
4        0.6617   0.9623   32.089  0.0000          |-----             |------- 
5        0.1644  -0.6763   33.076  0.0000          |-            -----|        
6        0.1583  -0.3777   34.033  0.0000          |-              ---|        
7        0.0774   0.5882   34.273  0.0000          |                  |----    
8        0.3275  -0.1387   38.777  0.0000          |--               -|        
9       -0.0681   0.2185   38.982  0.0000          |                  |-       
10      -0.0709   0.0915   39.217  0.0000          |                  |        
11      -0.1254   0.0174   39.993  0.0000         -|                  |        
12       0.0551  -0.0434   40.153  0.0001          |                  |

Thank you very much for your help.

Best regards,
Tobias

↧

Oaxaca command

April 17, 2020, 9:52 am

≫ Next: How to interpret a t-test?

≪ Previous: Number of lags in Fisher-type Dickey-Fuller test (panel data)

I have Stata?IC 16.1 and when I try to input the following code:

oaxaca logwage gradeat potentialexp7991 potentialexp_sq7991 sex i.year i.state, by(black) noisily

I get this:
command oaxaca is unrecognized
r(199);

Can i download the command somehow or what can I do?

↧

How to interpret a t-test?

April 17, 2020, 10:07 am

≫ Next: using Stata frame to store coefficients in a loop

≪ Previous: Oaxaca command

Hi, I'm new on this forum and, essentially, i'm not that good in statistics but i need to run a project work on multiple regression model because of an exam. I have got a dataset where the dependent variable is a discrete one "quality of wine", and 10 indipendent variables. I ran t-tests for all the predictors and i don't know how to intrepet them. The central problem is the hypotesized mean, i really don't know what is the value i should choose. So i really need help! I attached results coming from three t-tests on three of the dependent variables.
PS. I really need some explanation, but please, i haven't studied advanced statistics, so please speak as clear as possible!! Thank you

↧

using Stata frame to store coefficients in a loop

April 17, 2020, 10:12 am

≫ Next: Homoskedasticity

≪ Previous: How to interpret a t-test?

Dear Statausers,

I would like to know if it is possible to use the new Stata frame command to store the coefficients of a regression within a loop.

Any help or guidance will be much appreciated.

All the very best,

JP

↧

Homoskedasticity

April 17, 2020, 10:14 am

≫ Next: Which statistical analysis test should be used?

≪ Previous: using Stata frame to store coefficients in a loop

I have created a "residual-vs-fitted value" plot in order to check for the assumption of homosckedasticity of my residuals. However I don't know how to interpret the results since my dependent variable is discrete, could the result of the graph be read as a random cloud?

↧

Which statistical analysis test should be used?

April 17, 2020, 10:43 am

≫ Next: Set to missing if has non-empty value label

≪ Previous: Homoskedasticity

Hello,

I'm looking into factors associated with depression (yes or no) among the elderly population in two different sites (A and B) and comparing those factors between sites A and B.

Any suggestions appreciated.

↧

Set to missing if has non-empty value label

April 17, 2020, 10:44 am

≫ Next: Appending and merging in a single loop?

≪ Previous: Which statistical analysis test should be used?

Hi

I am processing a large number of variables from different sources, too many to deal with by hand. For continuous/count variables, different values are used for different flavours of missing. For example, -9=refused, -8=inapplicable or 99=refused. Values used to indicate why a variable is missing in one variable may be genuine observations in another.

What they do all have in common is that they are well labelled. Only the values indicating the cause of missingness are labelled.

How can I set all observations with a non-empty value label in a variable to missing?

With best wishes and thanks,

Jane

↧

Appending and merging in a single loop?

April 17, 2020, 11:21 am

≫ Next: Difference of proportions using svy in Stata 15

≪ Previous: Set to missing if has non-empty value label

Hi, I am stuck in making a loop where I can append and merge in a same loop.

I have the following files which are all 250 MB in size each with identical variable names/type.

1) I have named them as "1.dta", "2.dta", "3.dta" .... "784.dta".

2) I have another file "PEad_ret_test2.dta"

File "1.dta" has only one identifier variable named as "localid" which has a value of "1", File "2.dta" has only one identifier variable named as "localid" which has a value of "2" and so on. Futhermore, the other variable on the basis of which I will be merging is "date".

File "PEad_ret_test2.dta" has all the identifiers i.e., localid from 1-784.

3) I want to merge all the files in (1) with (2) and get a final file "company_N&S.dta"

Now what I want to do is the make a loop in which every time Stata picks up a file like "1.dta", drops certain data and then merges it with "PEad_ret_test2.dta" and then saves it to "company_N&S.dta". Then it should move onto "2.dta", merges it with "PEad_ret_test2.dta" and so on.

I also want to keep a limited number of variables in the final file in order to keep the final size of the file small.

I have tried to write the following code for just "1.dta" and "2.dta" in order to test my loop.

Code:

clear all
cd "D:\test"
foreach num of numlist 1/2 {
use `num'.dta          // this step with use the "1.dta" and then "2.dta" in this case. 
keep if datatype_1 ==1 
drop date                // dropping this because this is the incorrect date
clonevar date = date_stata // this is the correct date
merge m:m localid date using "D:\test\PEad_ret_test2.dta", force
keep ticker dayofweek folder_number assetcode localid date_stata date
keep if _merge==3
save company_N&S.dta, replace
}

The problem with this code is that it overwrites the previous file.

For example, by the end of this loop, I only have "2.dta" merged with "PEad_ret_test2.dta". Contents of "1.dta" was not there after the merge.

Is there any way to solve this issue?

Thanks

↧

Difference of proportions using svy in Stata 15

April 17, 2020, 11:24 am

≫ Next: 3-way multi-level interactions... random intercept or random slopes?

≪ Previous: Appending and merging in a single loop?

I have a database of a national survey of Mexico, of this I want to compare the prevalence of obesity comparing two different criteria; WHO and a national. Sample design was probabilistic, multistage, stratified and clustered. So I am using the SVY module.
I would like to compare if there is a difference of proportions between the WHO versus the national criteria, what command should I use?

svyset [pw= pondef],psu(code_upm) strata(est_var) singleunit(centered)
gen bmi_cat_who = height / weight^2
gen bmi_cat_nat = height / weight^2
recode bmi_cat_who min/24.999=0 25/29.999=1 30/max=2
recode bmi_cat_nat min/22.99999=0 23/24.999=1 25/max=2
*** in both cases 1=normal, 2=overweight and 3=obese

↧

3-way multi-level interactions... random intercept or random slopes?

April 17, 2020, 11:46 am

≫ Next: Interaction Terms and Random Slopes

≪ Previous: Difference of proportions using svy in Stata 15

Dear Statalist,

I have data for individuals (level 1) nested within countries (level 2). I'm interested in estimating how country-level (L21) characteristics moderate the interaction effects between two individual level continuous covariates (L11 x L12). What's the best way to model this situation? I find the discussions divided on incorporating random slopes. If so, I'm unclear what random slopes I should incorporate. Which of the following specifications would be the best?

mixed Y L11##L12##L21 || country:, robust

mixed Y L11##L12##L21 || country: L21, robust

mixed Y L11##L12##L21 || country: L11 L12, robust

mixed Y L11##L12##L21 || country: L11#L12, robust (which is not a possible syntax in stata 15 that I'm using)

I'm interested in making inferences in the form of: the interaction effects would increase by x units, as L21 increases by one unit. I'm also not interesting in the effect of L21 on its own, but only as a moderator. So, I'm looking for the simplest way to understand these moderating effects.

I would appreciate your answers. Thanks!

↧

Interaction Terms and Random Slopes

April 17, 2020, 12:06 pm

≫ Next: Stratification for multiple linear regression

≪ Previous: 3-way multi-level interactions... random intercept or random slopes?

I'm running a mixed-effect logistic regression (melogit) in Stata 16.0. The model is composed of a series of fixed effects (including one cross-product interaction term [categorical by categorical]), a random intercept (geographical site), and a random slope (which is intended to be the same categorical interaction term from the fixed effect portion). I'm having difficulties running categorical variables as random slopes in Stata using the melogit function but I believe I can run them if using the older "xi:" command prior to the melogit command, using dummy variables, or by using the meglm function. My question, however, is: Can I run a cross-product interaction term (i.categorical1##i.categorical2) as a random slope in Stata, such that the code will run as melogit y i.x1##i.x2 x3...xn ||intercept: i.x1#i.x2? Is the only possible way to convert this into a single new variable?

↧

Stratification for multiple linear regression

April 17, 2020, 12:20 pm

≫ Next: Creating a*comparative BOX-PLOT for linear regression output

≪ Previous: Interaction Terms and Random Slopes

Hello,

I am running a multiple linear regression analysis to examine associations between blood glucose (continuous) and saturated fats (meeting recommendations: yes/no [binary]). As shown in the code below, we are adjusting for a number of variables (condensed for the purposes here).

We would like to stratify by age (continuous) and sex (binary). I tried using egen mystrata = group(age sex), however get an error that the option mystrata is not allowed when I run the command.

Code:

regress blood_glucose i.SF_energy age i.sex i.ethnicity i.education_group i.smoke i.bmi_gp, mystrata
cap drop myResiduals
predict myResiduals, r
sktest myResiduals

I was hoping to receiving guidance on:
1. the code to correctly stratify for age and sex, and
2. whether I would need to remove "age i.sex" from the above code when doing so

Thank you!

↧

Creating a*comparative BOX-PLOT for linear regression output

April 17, 2020, 12:35 pm

≫ Next: [HELP] Can I manipulate data in this way?

≪ Previous: Stratification for multiple linear regression

Dear STATA-list,

I'm hoping to get some advice as to how to draw a boxplot, comparing point-estimates and confidence outputs for regression models I'm running.

Below is an example of my regression code:

Code:

regress total_f1_new i.vi_cat_better gender edu_2cat _independent_home htn diabetes smoking alcohol

i.vi_cat_better ("visual impairment category, in the better eye") is recorded as either none/moderate/severe/blind, and is the independent variable of interest.
Output has options 1 (moderate), 2(severe) or 3(blind), compared to reference 0(no visual impairment).

I'm wondering whether it's possible to draw a boxplot with a graph (point estimate, 95% CI) for each of the 3 visual impairment categories, and one for the reference (in this case, the intercept output, with it's 95% CIs)?

If anyone has any advice, it would be greatly appreciated. Thanks for your consideration.

William

↧

[HELP] Can I manipulate data in this way?

April 17, 2020, 12:45 pm

≫ Next: Adjusted incidence rate using margins after Poisson regression

≪ Previous: Creating a*comparative BOX-PLOT for linear regression output

Hello Statlisit--

So I am trying to get total monthly children enrollment (variable: wic_children) for the states listed below (Washington and California). Any idea how I can go about getting this information? Would I need to create a new variable? Any help will be greatly appreciated!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str21 state_name float year double(month date wic_children)
"California" 2015 12 671    251
"California" 2016  4 675  54109
"California" 2017  9 692   1166
"California" 2018 10 705  37881
"California" 2019  2 709  17373
"California" 2020  .   .      .
"Washington" 2015  2 661    308
"Washington" 2016 11 682     50
"Washington" 2017 12 695    250
"Washington" 2018 11 706  21055
"Washington" 2019  3 710   2403
"Washington" 2020  .   .      .
"California" 2015 11 670  55803
"California" 2015 12 671 267653
"California" 2015 12 671    751
"California" 2015  5 664  13512
"California" 2015  3 662    781
"California" 2015  1 660  15407
"California" 2015  3 662    396
"California" 2015  6 665  22067
"California" 2015  6 665  14876
"California" 2015 11 670   5998
"California" 2015 12 671    388
"California" 2015 12 671    333
"California" 2015 12 671  12881
"California" 2015 12 671   2415
"California" 2015  1 660   1410
"California" 2015  4 663   2463
"California" 2015  2 661   7437
"California" 2015  5 664  20035
"California" 2015 11 670  22862
"California" 2015  3 662  37012
"California" 2015  1 660  25310
"California" 2015  1 660   3052
"California" 2015  7 666   2192
"California" 2015  4 663  22384
"California" 2015  4 663   2446
"California" 2015  8   .      .
"California" 2015  7 666   1794
"California" 2015 12 671  57852
"California" 2015 10 669   2563
"California" 2015  3 662  21920
"California" 2015  6 665    788
"California" 2015  3 662    781
"California" 2015  4 663   1775
"California" 2015  8 667  59371
"California" 2015  2 661   2928
"California" 2015  1 660   7523
"California" 2015 10 669   2563
"California" 2015 10 669   1392
"California" 2015  8 667   6325
"California" 2015  6 665  24415
"California" 2015  8 667   3502
"California" 2015  6 665   7192
"California" 2015  9 668   1007
"California" 2015  3 662  13430
"California" 2015  8 667    279
"California" 2015  7 666   1414
"California" 2015  8 667    854
"California" 2015  4 663   9141
"California" 2015  4 663   7346
"California" 2015  7 666  15033
"California" 2015 12 671    957
"California" 2015  3 662   9002
"California" 2015  7 666    796
"California" 2015  3 662   2484
"California" 2015  7 666   2354
"California" 2015  8 667    750
"California" 2015  1 660  13983
"California" 2015  9 668    175
"California" 2015  7 666    280
"California" 2015  9 668  21742
"California" 2015  5 664  59217
"California" 2015  1 660  15605
"California" 2015  4 663  20163
"California" 2015  2 661  59846
"California" 2015  4 663   7346
"California" 2015  8 667  69862
"California" 2015 11 670   2457
"California" 2015 10 669   2773
"California" 2015  2 661  36381
"California" 2015  4 663    750
"California" 2015  1 660   6483
"California" 2015  7 666  14947
"California" 2015  7 666    143
"California" 2015  4 663  25031
"California" 2015  5 664   7210
"California" 2015 12 671  14009
"California" 2015 12 671    751
"California" 2015 11 670    760
"California" 2015  9 668  15029
"California" 2015  4 663    419
"California" 2015  8 667   2890
"California" 2015 11 670  55803
"California" 2015  7 666    164
"California" 2015 12 671   3927
"California" 2015  3 662   6479
"California" 2015  4 663    806
"California" 2015 11 670   1628
"California" 2015  3 662  11700
end
format %tmMCY date

↧

Adjusted incidence rate using margins after Poisson regression

April 17, 2020, 1:32 pm

≫ Next: Logistic regression difficulties

≪ Previous: [HELP] Can I manipulate data in this way?

Hi all,

In my project, I would like to calculate the adjusted (for age, education, etc) incidence rates of the outcome for smoking and non-smoking groups and the difference in adjusted incidence rates between two groups. I have individual-level data (rather than aggregated data like dollhill3).

I learnt a lot from a discussion about the calculation using Poisson regression at https://www.statalist.org/forums/for...-after-poisson

Two approaches were mentioned in that thread:

Code:

webuse dollhill3, clear
poisson deaths i.smokes agecat, irr exp(pyears)
margins smokes, predict(ir)

Code:

webuse dollhill3, clear
poisson deaths i.smokes agecat, irr exp(pyears)
egen sm_spec_pyears = mean(pyears), by(smokes)
margins, over(smokes) exp(predict(n)/sm_spec_pyears)

I still have a few questions about the methods:

Q1. The two approaches give quite different results (although both of them can be called 'adjusted incidence rate'). In your opinion, which approach is more commonly used or more practical in epidemiology studies?

Q2. No matter which approach I use, I need to describe the methods. Are the following descriptions appropriate and sufficient for an epidemiology journal?

For approach A
Adjusted incidence rates for smokers or non-smoker were calculated based on observed data on covariates using Poisson regression, assuming all participants were smokers or non-smoker, respectively.

For approach B
Adjusted incidence rates for smokers or non-smoker were calculated by dividing the total predicted number of the outcome by the total person-years in smoking or non-smoking group, respectively, after a Poisson regression was fit.

Q3. Can I use the -pwcompare- option in the -margins- command to get the 'incidence rate difference' between the two groups and claim this is an 'adjusted incidence rate difference'?

Thank you very much.

↧

Logistic regression difficulties

April 17, 2020, 1:41 pm

≫ Next: SYNTAX: missing as default value

≪ Previous: Adjusted incidence rate using margins after Poisson regression

I am using Stata 15.1 and am struggling to conduct a logistic regression.

I am using financial data that has 1912 observations and in my regression I am using 34 independent variables (14 numeric and 20 dummies). Upon importing the data I used destring for the numerical independent variables and subsequently generated my binary dummies from categorical variables.

I have no issues with the setting up of the regression but when I come to use logistic Stata continually denies my command, reporting insufficient / no observations.

I looked at past online queries regarding this and the replies always seemed to be surrounding destring / encode but neither work in this situation. Hoping someone has the answer?

Many thanks in advance,
Caleb

↧

SYNTAX: missing as default value

April 17, 2020, 2:05 pm

≫ Next: i.variable

≪ Previous: Logistic regression difficulties

Dear All!

How do I specify a missing value as the default value for an optional numeric option in the syntax declaration?

For example, consider the following code:

Code:

clear all

program define listsome
   version 16.0
   syntax varname, [maxval(real .)]
   
   list `varlist' if `varlist'<`maxval'   
end

program define listsome2
   version 16.0
   syntax varname, [maxval(real -999999)]
   if (`maxval'==-999999) local maxval=.
   
   list `varlist' if `varlist'<`maxval'   
end

sysuse auto
listsome2 price   // works, but clumsy implementation
listsome price    // doesn't work => Error 197: Invalid syntax

// end of file

I don't see a reason why a missing value may not be declared as a default value. I understand that the behavior corresponds to the documentation as outlined in the section "option descriptor optional real value" of the help to syntax. But I don't understand why this behavior was chosen, not permitting numeric defaults. (Same applies to .a, .b, ..., .z - all missing values).

Either code works if the missing value is passed as the option's value:

Code:

listsome price, maxval(.)
listsome2 price, maxval(.)

So, is there a more comfortable implementation of listsome avoiding a special value like is shown in listsome2?

Thank you, Sergiy

↧

i.variable

April 17, 2020, 2:24 pm

≫ Next: Signal-to-noise (lambda) coefficient and significance when using sfpanel in stochastic frontier analysis

≪ Previous: SYNTAX: missing as default value

Hi, may you help me please to understand when should I use i. before variable and when not?
e.g: reg health_expenditures i.round age_hh age_sp educ_hh educ_sp i.female_hh i.indigenous hhsize i.dirtfloor i.bathroom land hospital_distance if enrolled ==1
why some variables have i. and some don't? Thanks.

↧