Weighted average and percentiles of subpopulations

January 23, 2017, 5:37 am

≫ Next: Statistical Power estimation for fixed effects panel model?

≪ Previous: bounding my violin plot at 1

Hey Statalists,

Thank you for looking at my posts and providing such a great resource through contributions to this forum (it has helped a lot!). I am pretty new to stata and am having trouble calculating the weighted mean and percentiles for subpopulations in my data set.

About my data:
Census data from 1970 - 2015

What I would like to do:
Calculate the weighted mean, p10, p50, p90 of "wages1999" and using "newwt" as weights for each industry and year. So that in the end I will have e.g. the 10th percentile of wages1991 for industry X in year Y.

Where I am having trouble:
1. egen does not seem to allow for weights
2. pctile does not seem to allow for by's
3. collapse does not seem to allow for a variety of statistical functions (e.g. collapse (mean) wages1999 (p10) wages1999 (p50) wages1999 (p90) wages1999 (p99) wages1999 [aw = new
> wt], by(year ind1990code))

A random sample of 10 observation from my dataset with the relevant variables:

Code:

input int year float(ind1990code wages1999 newwt)
1970  60     12031 194000
1970 251     13847 261000
1970 682      2497  36375
1980 712 25118.775 204000
1980 732 36616.727 204000
1980 831  7217.775  81600
1990 831     38976 274176
2000 162     26900 125280
2000 351     57000 361080
2015 842     55200 118320
end

Thank you!

↧

Statistical Power estimation for fixed effects panel model?

January 23, 2017, 5:40 am

≫ Next: Orthogonalized Impulse Response Functions

≪ Previous: Weighted average and percentiles of subpopulations

This is not Stata specific and may not be an appropriate question. If so moderators please delete. I've been asked to provide power estimates in support of a grant application. There will be 28 days of data on n subjects. There are 3 predictors and a pure within subject design is appropriate. Complicating this is that a X1 by X2 by X3 interaction effect is hypothesized. At this point, I don't even have guestimates of effect sizes, but even if I did I don't really know where to start in terms of estimating sample size requirements. I'd be most appreciative if anyone could offer any suggestions as to how to approach this. Thanks much.

↧

Orthogonalized Impulse Response Functions

January 23, 2017, 6:05 am

≫ Next: Stata data types

≪ Previous: Statistical Power estimation for fixed effects panel model?

Hi,

After running the var command, I would like to produce orthogonalized impulse response functions. However, using the "irf graph/table oirf" command shows the response (I think) to a 1 standard deviation innovation in the structural shock. Is there a command to show the orthogonalized IRF to a 1 unit innovation, or a simple workaround?

Thanks

↧

Stata data types

January 23, 2017, 7:00 am

≫ Next: Expand observation

≪ Previous: Orthogonalized Impulse Response Functions

Stata manual declares the ranges for each data storage type in the manual for datatypes, for example, the range for byte is: [-127;100]. The interval is asymmetric since the largest values in the range are reserved for missing and extended missing values.

So the total number of values that can be represented by a byte type is: 127 (negatives) + 1 (zero) +100 (positives) + 1 (mv) + 26 (extended mv) = 255.

But one byte can hold 256 different values! Upon careful inspection we can see that the value -128 (which can in theory be represented by a signed byte storage type) is blacklisted. We can't create such a value byte variable in Stata:

Code:

clear
set obs 1
generate byte x=-128

results in a missing value in variable x.

The problem is that Stata doesn't do the same validation when opening datasets, and happily accepts value -128. This results in a rather strange behavior later.

Code:

clear
use "http://www.radyakin.org/statalist/2017/1371324-stata-data-types.dta"
clonevar V1=V0
summarize

Produces the following output:

Code:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          V0 |          1        -128           .       -128       -128
          V1 |          0

while identical stats are expected for clones and originals.

What is the purpose of blacklisting values like -128, -32768, etc for their exact value types? For example, C# fits -128 nicely into a signed byte type (sbyte).
Would it be possible to add validation to prohibit such values in input files if these values are subsequently incorrectly processed in Stata? (as demonstrated by clonevar/summarize example above).

Thank you, Sergiy Radyakin

↧

Expand observation

January 23, 2017, 7:23 am

≫ Next: Cmp simultaneous regressions: invalid syntax error.

≪ Previous: Stata data types

Dear Statalist,
I have a dataset made by a partially panel and partially crossection structure. One of my variable is the year in which the observation started working, while another is the year in which the interview that generate the data has been taken. I'd like to expand the number of observation, to have a "copy" of each observation for any year he or she worked, with a different year variable, but identical for all the other sake. For instance, my first observation has been interviewed in 2008, when he declared to have started to work in 2000. I'd like to have 8 observation for him, all the same, with a different year value (2008, 2007, 2006...). Is it possible? Thank you so much to anyone for all the help.

↧

Cmp simultaneous regressions: invalid syntax error.

January 23, 2017, 7:55 am

≫ Next: Replacing missing values in baseline with other specified observations

≪ Previous: Expand observation

I am trying to figure out how the cmp program exactly works.

However, I do not seem to manage to obtain the basic model and I don't know where it goes wrong.

My model is:

irrigation_dummy = croptype_dummy + ps1 + ps2+ps3 + ps4
croptype_dummy = irrigation_dummy + ps1 + ps2+ps3 + ps4

irrigation_dummy has 2 categories (probit) and croptype_dummy has 6 categories (mprobit). Both dependent variables are therefore discrete!

I thought the STATA code should be this:

cmp (irrigation_dummy =croptype_dummy# ps1 ps2 ps3 ps4) (croptype_dummy = irrigation_dummy# ps1 ps2 ps3 ps4), indicators($cmp_probit $cmp_mprobit) qui tech(dfp)

But it does not work. The error I get is the one below. But I am very sure that the spelling of my variables is correct, so I do not know what I do wrong.

Equation croptype_dummy not found.
invalid syntax

error . . . . . . . . . . . . . . . . . . . . . . . . Return code 111
__________ not found;
no variables defined;
The variable does not exist. You may have mistyped the
variable's name.
variables out of order;
You specified a varlist containing varname1-varname2, yet
varname1 occurs after varname2. Reverse the order of the
variables if you did not make some other typographical error.
Remember, varname1-varname2 is taken by Stata to mean varname1,
varname2, and all the variables in dataset order in between.
Type describe to see the order of the variables in your dataset.
__________ not found in using data;
You specified a varlist with merge, but the variables on which
you wish to merge are not found in the using dataset, so the
merge is not possible.
__________ ambiguous abbreviation;
You typed an ambiguous abbreviation for a variable in your data.
The abbreviation could refer to more than one variable. Use a
nonambiguous abbreviation, or if you intend all the variables
implied by the ambiguous abbreviation, append a `*' to the end
of the abbreviation.

↧

Replacing missing values in baseline with other specified observations

January 23, 2017, 8:41 am

≫ Next: Tobit model and fixed effects

≪ Previous: Cmp simultaneous regressions: invalid syntax error.

Hi,

So I've been provided with this dataset to analyse and there are two observations for each time point depending on derivation(or lack of used), As below:

Array

The baseline values with aOC dtype are empty, while those with empty dtype are present. This the case for hundreds of different IDs. How can I replace the missing baseline values with the values present in the 'missing' dtype cells without having to copy/paste hundreds of times. I only want to replace the baseline values, and leave other time points as is.

I hope that made sense. Thanks in advance!

↧

Tobit model and fixed effects

January 23, 2017, 8:50 am

≫ Next: Meta analysis(corrected for sample size)

≪ Previous: Replacing missing values in baseline with other specified observations

Dear Statalists,

I intend to estimate a SEM using panel data. For one equation, the dependent variable is right censored, therefore I wanted to estimate the equation using Tobit. The problem is that Tobit (xttobit) does not consider fixed-effects, so that hausman (FE vs RE) test for panel data is not possible.
Does anybody have a hint how to progress? thank you

↧

Meta analysis(corrected for sample size)

January 23, 2017, 9:11 am

≫ Next: Comtrade API in Stata

≪ Previous: Tobit model and fixed effects

Dear stata users.
I am doing meta analysis.
i have some 1 study have very narrow interval (larger sample size and negative effect)
and 4 studies have wider confidences interval (small sample size and +ve effect).
In stata does the meta analysis random effect model corrected for sample size?
if not is there is way i can correct for sample size?
and any suggestion deal with this situation

Thanks
sugan

↧

Comtrade API in Stata

January 23, 2017, 9:16 am

≫ Next: xtmixed and mean comparisons

≪ Previous: Meta analysis(corrected for sample size)

Hello
I aim to get trade data from Comtrade site using the API from within the dofile. I got this page http://statadaily.com/2015/09/08/un-...-api-in-stata/ which tells two possible ways. Through the link, for the example given there:
import delimited http://comtrade.un.org/api/get?r=608...=TOTAL&fmt=csv
Alternatively, by a program. However, I got an error for both of these.

file http://comtrade.un.org/api/get? not found
server says file temporarily redirected to https://comtrade.un.org/api/get?
could not open url
r(603);

I have Stata/MP13.0.
Would be great if someone please look into this.

↧

xtmixed and mean comparisons

January 23, 2017, 10:20 am

≫ Next: 603 error - help text

≪ Previous: Comtrade API in Stata

Dear StataListers,

I have been asked to analyse data from an RCT study with 2 study arms (control vs. intervention) and measures where collected at 3 time points which are baseline (prior to randomisation), 12 weeks and 52 weeks follow-up. The intervention types are CBT vs. waiting list and the outcome is well-being.

The main aim of the study is to assess whether the changes observed at 12 weeks are sustained at 52 weeks. Hence, I am planning to use baseline scores as a confounder. I also plan to use multilevel modelling on patients as the design is repeated measures; hence, I reshaped the data to long format accordingly. I would like to make sure my approach is correct and so is my interpretation.

This is the command I am implementing:

xtmixed wellbing b.arm##i.time wellbaseline || BlindPIN: , var

Does the interaction term answer my question?

WELL Coef. Std. Err. z P>z [95% Conf. Interval]

arm
CONTROL 3.709868 1.095133 3.39 0.001 1.563446 5.85629
2.time -.1553125 1.005324 -0.15 0.877 -2.125711 1.815086

arm#time
CONTROL#2 -3.413797 1.422091 -2.40 0.016 -6.201044 -.6265495

baseline .427302 .095777 4.46 0.000 .2395824 .6150216
_cons 8.055717 2.607366 3.09 0.002 2.945374 13.16606

I noticed using the -test-command as below I obtain a similar answer.

test (0.arm#1.time-0.arm#2.time)-(1.arm#1.time-1.arm#2.time)=0

( 1) [WELL]0b.arm#1b.time - [WELL]0b.arm#2o.time - [WELL]1o.arm#1b.time +
[WELL]1.arm#2.time = 0

chi2( 1) = 5.76
Prob > chi2 = 0.0164

I would welcome suggestions on how to best address my research question and references I should be reading.

Thanks in advance!

↧

603 error - help text

January 23, 2017, 10:28 am

≫ Next: forvalues loop problem

≪ Previous: xtmixed and mean comparisons

r(603) is documented "... file __________ could not be opened;
This file, although found, failed to open properly. This
error is unlikely to occur. You will have to review your
operating system's manual to determine why it occurred."

I found one discussion in the forum that suggested a timing problem in the interface for this error. My experience was much simpler - I had opened the XLS file in another application to edit the data before reading into Stata. Although the file had been updated, the other application still had a lock on the file. This does not seem to me "unlikely to occur". Close the other application and repeat the Stata operation to overwrite it.

↧

forvalues loop problem

January 23, 2017, 10:57 am

≫ Next: Specifying reference category for fixed effects

≪ Previous: 603 error - help text

Dear stata users,

I have the following code

Code:

gen mvq=.
egen yearmonth = group(year month)
sum yearmonth
scalar maxy =r(max)
local k=maxy
egen nobs3 =count(mv), by(yearmonth)

forvalues i=1(1) `k' {
    qui xtile xmv=mv if yearmonth==`i' , nq(5) 
    qui replace mvq=xmv if yearmonth==`i'
    qui drop xmv
}

where "mv" is the variable for which I create 5 quantile categories and k=433 (the "yearmonth" variable which I loop over has 433 observations).
I face a problem. There are missing values for some i's (each yearmonth iteration). For instance, `i'=50 and `i'=51 are missing so whenever the program reaches the 49th execution I get this message:
"nquantiles() must be less than or equal to number of observations plus one",
which clearly refers to the number of quantiles because I think since there is no yearmonth, there are also no quantiles in the 50th execution.
so..
1. Is there a way to skip the missing values in the loop so that it continues to let's say the 52nd execution?
2. Is there a way to take into account the fact that there might be less than 6 observations so as to skip the xtile command and leave mvq=. ?

I tried

Code:

qui xtile xmv=mv if yearmonth==`i' & nobs3 >=6, nq(5)

but I it doesn't work.

Thank you,
Stylia

↧

Specifying reference category for fixed effects

January 23, 2017, 11:06 am

≫ Next: Create dataset

≪ Previous: forvalues loop problem

Hi Statalist,

I apologize in advance if this question has been covered before on another thread. I tried searching for a similar thread, but no luck.

I'm using fixed effects as part of a number of different regression specifications such as OLS, probit, logit, etc. I have a number of variables that I am controlling for using fixed effects. For example, I have a variable called "industry" for which there are observations such as "Life Sciences," "Software," etc.

When I run a regression and use fixed effects (i.industry), Stata automatically chooses a category (such as "Life Sciences") to drop from the regression as the reference category. Is there a way to specify which category is used as the reference category?

Thanks for the help.

Spencer

↧

Create dataset

January 23, 2017, 11:41 am

≫ Next: Oaxaca, normalize

≪ Previous: Specifying reference category for fixed effects

Dear all,
my apologize if the question is quite simple but I am not used to create a dataset and I need your help. I have this kind of values and I need to have a distribution. I have a range for income which is 0-1 where the total income is negative and there are 283468 taxpayers. When I copy and paste them in STATA, they are all strings. How can I create ny dataset and have normal distribution? I hope someone can help me.
Thank you very much

Range income	Total income	N. taxpayers
0-1	- 4 835 516 000	283 468
1-10000	14 002 838 000	2 506 533
10000-20000	71 204 885 000	4 749 939
20000-30000	119 519 432 000	4 784 796
30000-40000	142 509 645 000	4 102 418
40000-50000	129 949 670 000	2 906 925
50000-100000	383 600 971 000	5 641 489
100000-500000	264 531 927 000	1 674 865
>500000	72 251 237 000	56 397

↧

Oaxaca, normalize

January 23, 2017, 12:43 pm

≫ Next: xtreg, fe -- equivalence using mixed?

≪ Previous: Create dataset

Hi,

I am currently running a wage regression for males and females and I use the oaxaca-command to decompose the gender wage gap into an explained and unexlained component.
As I include several categorial variables in my regression I use the "normalize"-option in the oaxaca-command.
Now, when trying to interprete the effects of these categorial variables on the explained and unexplained gap, I am not quite sure whether I understand correctly how the normalize-option works.

For instance, if I have a positive coefficient of a (not normalized) variable for the male regression and a lower positive coefficient for the female regression, this adds to the unexplained gap.
If this now is the case for a variable that was included in the "normalize"-option, this means that probably the variable will also add to the unexplained component, but it could also reduce the unexplained component if using other base categories on average produces a negative effect on the unexplained gap?

I hope you understand my question, thanks for your help!

Ally

↧

xtreg, fe -- equivalence using mixed?

January 23, 2017, 12:48 pm

≫ Next: Equation level goodness of fit

≪ Previous: Oaxaca, normalize

Hi:

What would be the equivalent of a fixed effect (within estimator) regression in terms of the mixed command?

More specifically, if I have:
xtset id
xtreg y x t , fe

where x is a continuous, time-varying predictor, and t is the wave number.

What would be the equivalent of this using 'mixed'?

I'm studying both multilevel modeling and fixed effect models (within the econometrics tradition), and I'm trying to examine the differences/similarities.

Thanks.
Edgar K.

↧

Equation level goodness of fit

January 23, 2017, 1:01 pm

≫ Next: CSV with variable names and loop to keep variables with these names

≪ Previous: xtreg, fe -- equivalence using mixed?

Hello! I have just conducted an equation level goodness of fit in Stata with a command estat eqgof for a multiple-group SEM. I have no idea how to interpret it. I mean what needs to be interpreted and how. Could you please advise accordingly? Stata information pdf files online do not seem to be clear about that. how could this statistics be interpreted? Thank you

estat eqgof /*Equation-level goodness of fit*/
/*
Group #1 (treatment=1; N=209)
-----------------------------------------------------------------------------
| Variance |
depvars | fitted predicted residual | R-squared mc mc2
-------------+---------------------------------+------------------------------
observed | |
sanctions | 1.731114 .1850652 1.546049 | .1069053 .3269637 .1069053
-------------+---------------------------------+------------------------------
overall | | .1069053
------------------------------------------------------------------------------
mc = correlation between depvar and its prediction
mc2 = mc^2 is the Bentler-Raykov squared multiple correlation coefficient

Group #2 (treatment=2; N=201)
------------------------------------------------------------------------------
| Variance |
depvars | fitted predicted residual | R-squared mc mc2
-------------+---------------------------------+------------------------------
observed | |
sanctions | 1.485331 .1830023 1.302329 | .1232064 .3510077 .1232064
-------------+---------------------------------+------------------------------
overall | | .1232064
------------------------------------------------------------------------------
mc = correlation between depvar and its prediction
mc2 = mc^2 is the Bentler-Raykov squared multiple correlation coefficient

Group #3 (treatment=3; N=201)
------------------------------------------------------------------------------
| Variance |
depvars | fitted predicted residual | R-squared mc mc2
-------------+---------------------------------+------------------------------
observed | |
sanctions | 1.349877 .1418881 1.207989 | .1051119 .3242096 .1051119
-------------+---------------------------------+------------------------------
overall | | .1051119
------------------------------------------------------------------------------
mc = correlation between depvar and its prediction
mc2 = mc^2 is the Bentler-Raykov squared multiple correlation coefficient
*/

↧

CSV with variable names and loop to keep variables with these names

January 24, 2017, 3:58 am

≫ Next: why does -tpm- give me an output and -churdle- fails to do so when they should both report same results?

≪ Previous: Equation level goodness of fit

Hi everyone,

My issue is the following:
I have a set of data from which I want to keep specific variables with specific names and the variables' names I want to keep are stored in a CSV file. Do you know how I could create a loop from this CSV file with the variables' names that I want to keep and use the "keep" command of Stata in order to automatize the procedure and avoid having to write "keep *variable 1* & *variable 2* & ... " ?

Thank you very much for your help,

Kind regards,

Alexandre

↧

why does -tpm- give me an output and -churdle- fails to do so when they should both report same results?

January 24, 2017, 5:55 am

≫ Next: Duplication of data

≪ Previous: CSV with variable names and loop to keep variables with these names

Dear all,

I am trying to fit a double hurdle model on Stata 14 using the new churdle command. Essentially I have data with heavy proportion of zeros so I want to model the first step as a probit and the second as a linear regression with truncated distribution.

What I do not understand is that when I use the churdle command I get the following error message: "initial values not feasible"
However, when I run the command using the tpm command (which should give the same), I do get an output and no error message.
Here are the 2 commands:

Code:

churdle linear sales_kg l_har_crop l_price $HH $farm i.group transport i.region $sell_maize $prod_maize if zaocode==11 & time==1, ///
select($HH $farm $exclusion l_har_crop l_price i.group i.region) ll(0) vce(robust)

Code:

tpm (first: sales_kg = $HH $farm $exclusion l_har_crop l_price i.group i.region) (second: sales_kg= l_har_crop l_price $HH $farm i.group transport i.region $sell_maize $prod_maize) if zaocode==11 & time==1, f(probit) s(regress) vce(robust)

From what I understand, the tpm command does what the churdle command should do, i.e.: " The first part models the probability that depvar>0 using a binary choice model (logit or probit). The second part models the distribution of depvar | depvar>0 using linear (regress)" (quote from tpm help file).

Does anyone know why, although theoretically I should get the same with both commands, only the tpm gives me an output?

Many thanks in advance

Basile

↧