How to math and merge data for a family from different subfiles in CHNS database?

November 18, 2016, 6:32 pm

≪ Previous: Extract digits from a number

I'm using the CHNS(China health and nutrition survey) data to do empirical analysis. http://www.cpc.unc.edu/projects/china/data
I need to use 5 modules: ID, matching, individual data, parents data and family data.

One person is defined by three levels: commid+hhid+line,which means community id+ household id+member id, respectively, concluded in each subfile. And these subfiles also give parent and chidren information :

m12bith gives mother's hhid+line and child's hhid+line, gender and date of birth, but I don't know how to distinguish mother and children and how to match mother with their own children and date of birth
m12relatives gives a5(relationship to head of household) and whether involved investigation in survey year(1989,1991,1993,....2011)(expressed by binary variable "surveyed,1=yes,0=no" and another column is wave--survey year)
m12rst gives a5b(father's line number ), a5d( mother's line number), whether a child or an adult survey

So the first question is how to match children with their own parents, how to match mother with their own children and date of birth by these subfiles
2.

m12emw( ever married women survey) gives s47a(total number of children given birth to) and s47a( number of children who died)
m12wed gives s216&s218(child's number of brothers&sisters), s220&s222( husband's number of brothers& sisters)
pexa( physical exam) gives everyone's height including both adult and children.

So the problem is how to discern the number of children each family has from these two files and those files from question1?
And then how to merge these data in order to get a panel data which include : the member details(father, mother, children, number of children, height of each person, and data of birth) in a family and whether involved in a survey year.
3.
There are also education, income, subsidy information, and how to merge these with those basic information above by the key of (commid+hhid+line)?
I don't know whether I express these questions explicitly, but I really helpless, pleading for help or at least some hint, both code in stata and threads will be really helpful for me, thanks very much!
PS: these are the subfiles we need
Array

↧

Find matches

November 19, 2016, 12:59 am

≫ Next: AIC and BIC model selection

≪ Previous: How to math and merge data for a family from different subfiles in CHNS database?

Hi all,

I have already searched the forum however did not end up with a satsifying result. I have the following problem: I have two groups of firms and I want to the test the difference of mean for a certain variable. In a first simple way I just do an unmatched test. In a second step however I want to check whether they differ when I match the firms according to some variables.

My question is whether propensity score matching makes sense as the variable of interest might affect whether the company is in one of the groups. All other papers I have seen that use PSM, do it to investigate the ATE, but as far as I understand this effect is only valid if the outcome variable is oberserved after the treatment.

Would rangejoin be a reasonable approach?

Thanks in advance,
Felix

↧

AIC and BIC model selection

November 19, 2016, 1:14 am

≫ Next: Portfolio construction

≪ Previous: Find matches

Do we use BIC, instead of AIC, when AIC is unstable?

I ran -varsoc x, maxlags(5)-, and both AIC and BIC showed 5 lags.
However, when I increase the maxlags(), the BIC won't change, and still show 5 lags, while AIC will change for each maxlags() selection. There seems to be a pattern, and AIC will pick n-3 lags, where n is the maxlags(). So if I do -varsoc x, maxlags(30), AIC will choose 27 lags.

Does this make sense? Should I trust BIC and use 5 lags, as it will consistently pick 5 lags regardless of maxlags I use.

Thanks

↧

Portfolio construction

November 19, 2016, 1:20 am

≫ Next: command used in "Bank regulatory capital and liquidity: Evidence from US and European publicly traded banks" paper

≪ Previous: AIC and BIC model selection

Hi there,

I am relatively new to stata, have done OLS, time series and panel data work on it for various undergraduate projects. But I am currently doing my dissertation which I need to do the following:

Construct a portfolio and returns of that portfolio from several stocks of specific asian countries. So for instance, I need a portfolio of India stocks (listed on BSE nad NSE) and then also will break this down into a portfolio of Indian beverage companies...I think once I know how to construct a portfolio I can obviously apply it to sectors/countries.

I want it to be market cap weighted. I was told in passing I would need to write my own program for this using loops?

Is this correct? If not any help with how to get going on it would greatly help.

Many thanks

↧

command used in "Bank regulatory capital and liquidity: Evidence from US and European publicly traded banks" paper

November 19, 2016, 1:33 am

≫ Next: Compare the regression slopes of two different predictors in the same regression model

≪ Previous: Portfolio construction

Hello

,
could anyone help me to identify which command is used in this paper please ?
specially when they say "...we use GMM method...." and " ....we replace bank-level variables with their one-year lagged value..." and they say that they use one-year lagged instruments!
is it gmm command or xtivreg command as it is panel data with fixed effects ?

this is the link to the paper : https://halshs.archives-ouvertes.fr/...87426/document

thank you
Lina

↧

Compare the regression slopes of two different predictors in the same regression model

November 19, 2016, 2:17 am

≫ Next: A second question about: how do I select a subset of observations using a complicated criterion?

≪ Previous: command used in "Bank regulatory capital and liquidity: Evidence from US and European publicly traded banks" paper

Dear all,

With a logistic regression, now I try to compare the coefficients of two different predictors on the same dependent variable, in order to see which one is more important/salient for the prediction of DV.

I'm not sure whether the command of -lincom- is appropriate in this context? For example,

Code:

lincom _b[x2] - _b[x1]

Could you give me some advice? Thx!

↧

A second question about: how do I select a subset of observations using a complicated criterion?

November 19, 2016, 7:20 pm

≫ Next: Difference in Difference Possible with Variables Split between Pre/Post?

≪ Previous: Compare the regression slopes of two different predictors in the same regression model

Hello,

I am also wishing to work with a subset of observations. I believe this subset has a unique set of criterion that I haven't figured out how to properly ask STATA to perform. Mainly, I am trying to find the average time of between lagged variables across multiple subjects (22 id numbers), multiple days (14 days), and multiple time periods that have been randomly generated (eight per day; 112 total per subject, 285 for the entire data set).

I have tried to conceptualize this and think this might help better describe my coding question:

If "Day1" "Prompt1" "Var1" is equal to or greater than 0, and if "Day1" "Prompt2" "Var2" is equal to or greater than 0, then select the "Timestamp" associated with both variables/ give me the average between the Day1 Prompt1 Timestamp and the Day1 Prompt2 Timestamp.

I will want this for every day (1-14) and every participant (1-22).

I have read the forum and the answers from Nicholas J. Cox titled the same, but am hoping for a bit more direction.

In gratitude,
Jenn

↧

Difference in Difference Possible with Variables Split between Pre/Post?

November 19, 2016, 8:05 pm

≫ Next: Household size

≪ Previous: A second question about: how do I select a subset of observations using a complicated criterion?

Stata community,

Forgive my newness to the site and stata (first post!), but I have an urgent question that has been keeping me up!

I'm conducting a DiD on a dataset from JPAL (https://dataverse.harvard.edu/datase...l:1902.1/11389) that studies the effect of microfinance on education expenditure. I'm looking at between 2007-2009 in Hyperbad, India, the effect of access to microloans on individuals in randomized treatment families. The study that employed this data spent a considerable amount of effort cleaning it and getting the data collection right, so my concerns do not lie there - just in manipulating it for a DiD.

In the traditional DiD fashion, I am trying to create the treatment, post, and treatment*post variables, and then run the "diff" function; however, my dataset organizes everything by pre and post. Every variable has a before (educ_exp_mo_pc_1, visityear_1, etc) and after (educ_exp_mo_pc_2, visityear_2, etc). The result is that I cannot create a time variable that activates the characteristics I need. See picture here to get a general idea:
Array

So here are my questions boiled down:
1. How do I do a DiD with variables split between pre and post? Is there a function that allows the activation of certain variables if I'm focusing on the post (the idea would be this: if visiyear_2==2009, then consider only variables ending with _2?)? Is there a new variable I could generate to accomplish this?
2. Would a manual approach work?

I did ("sum educ_exp_mo_pc_2 if treatment" - "sum educ_exp_mo_pc_1 if treatment") - (("sum educ_exp_mo_pc_2 if !treatment" - "sum educ_exp_mo_pc_1 if !treatment") but I'm aware this only captures averages and not interactions. Suggestions?

I would appreciate any and all help/code/suggestions!

-Christian

↧

Household size

November 19, 2016, 10:19 pm

≫ Next: How do I Choose Only Respondents that are At least 75% participating?

≪ Previous: Difference in Difference Possible with Variables Split between Pre/Post?

I have 8968 households in my survey data. Some data sets are at the individual level and others are households. I want to transform my datasets into household level. After merging my data looks like the following

Household_members	Slno
1	1
1	2
1	3
2	1
2	2
2	3
2	4
2	5
2	6
3	1
3	2
3	3
3	4
3	5

After merging, data contains 39697 observations. Therefore, please advice me on how to generate household size from the above table.

↧

How do I Choose Only Respondents that are At least 75% participating?

November 19, 2016, 11:11 pm

≫ Next: white methodology heteroskedasticity correction stata

≪ Previous: Household size

This is my first time here so I will try to be as clear as possible.

We have a survey that has 12 parts to a single question coded as: q11_1, q11_2, etc.

I have mass recoded each part into five categories of:

0 = none
1 = 1-17
2 = 18-34
3 = 35-52
4 = 53+

Now, since each question has this same categorical designation, i realized that its possible the same person could put "0" or none for many of the subsections. so, how do I parse the information or create an index I guess, that ONLY includes people that did NOT have "0" for at least 75% of the questions?

So basically an index of only respondents that had 4 or less "0" or none responses?

Thanks in advance, and if this is a common question, my apologies, I wouldn't even know what to search for to have found the answer myself.

↧

white methodology heteroskedasticity correction stata

November 19, 2016, 11:44 pm

≫ Next: "Instruction referenced memory that could not be written. Stata MP2 crashes

≪ Previous: How do I Choose Only Respondents that are At least 75% participating?

Hello,

How can I get corrected T-statistics with White method with stata for panel data with fixed effects (I use xtivreg2 command)? I would like to know also if I can calculate them manually, I mean can I get the formula please ?
thank you

Lina

↧

"Instruction referenced memory that could not be written. Stata MP2 crashes

November 19, 2016, 11:59 pm

≫ Next: How do I stop a simulation program from looping forever?

≪ Previous: white methodology heteroskedasticity correction stata

Hi there

i'm running the -traj- command on stata 14MP2 on a windows box, and i can successfully run a command for a 2-group 3-group and 4 group model, but the (interesting) 5,6,7,and 8 group models are crashing. I have successfully run the code on a sample of 300,000 obs, but now want to run it for 2.5m obs.

I get an error:

The instruction at 0xec01790c referenced memory at 0x0000000. The memory could not be written. Click on OK to terminate the program.

and then Stata 14 MP2 crashes.

Can anybody suggest ways to reduce this error? Is it a case of needing more RAM to allow the swap-space to write the trajectories, or is there something odd that i need to re-configure? I have reduced the file by deleting redundant fields and using -compress-.

Interestingly I get this error regardless of whether the program runs on Stata 14 MP installed on a windows 2012 server, or windows 7 machine. The 32gb RAM, and nothing notably running. Files are running from the local drive, rather than across a network.
Many thanks
Dan

↧

How do I stop a simulation program from looping forever?

November 20, 2016, 12:35 am

≫ Next: Interpretation of errors of RE xtlogit

≪ Previous: "Instruction referenced memory that could not be written. Stata MP2 crashes

I have set the simulationReps to 10, and -obs- to 100 just to test if the code works, but it never stops looping.
Is there a general solution to the problem, such as a specific command that could display an answer instead of manually counting the result for each simulation?

Thanks

↧

Interpretation of errors of RE xtlogit

November 20, 2016, 2:03 am

≫ Next: Concatenating Strings when collapsing data

≪ Previous: How do I stop a simulation program from looping forever?

Dear all,

I ran a command -xtlogit depvar indepvar i.time i.p, re vce(robust)- for a panel database of 500 observations. However, Stata generates the following results:

note: 0.place != 0 predicts failure perfectly
0.place dropped and 13 obs not used

note: 20.time omitted because of collinearity
note: 4.place omitted because of collinearity

Calculating robust standard errors:
calculation of robust standard errors failed

I know that this error warning is due to that the database is not sufficient to estimate the model or the model itself is too complicated.

But I want to ask about the meaning of the four notes:

note: 0.place != 0 predicts failure perfectly
0.place dropped and 13 obs not used

note: 20.time omitted because of collinearity
note: 4.place omitted because of collinearity

Variable place is a categorical variable with value of 0, 1, 2, 3, and 4. Variable time is time dummy from 1 to 20.

Thank you very much.

↧

Concatenating Strings when collapsing data

November 20, 2016, 2:14 am

≫ Next: Augmenting a model with country and time dummies

≪ Previous: Interpretation of errors of RE xtlogit

I would like to collapse variables by an ID. For numeric variables everything is working fine. For string variables I realized that there is no option to concatenate strings. So I am looking for some kind of work around (possibly including bysort and egen) before collapsing variables.

This is how my data looks like.

ID	Text
1	AD AR
1	BD KL AD
2	AD SJ
2	FD WE RS

And I would like to have:

ID	Text	Concat
1	AD AR	AD AR BD KL AD
1	BD KL AD	AD AR BD KL AD
2	AD SJ	AD SJ FD WE RS
2	FD WE RS	AD SJ FD WE RS

So that when collapsing by ID, I can simply use the first concat value.

Even better would be if repeating values in Text would not be added to Concat, so that for ID 1 Concat would be "AD AR BD KL" and not "AD AR BD KL AD".

How do I do this in Stata?

Many thanks,
Milan

↧

Augmenting a model with country and time dummies

November 20, 2016, 2:22 am

≫ Next: Estimating the value of a variable using it´s probability distribution

≪ Previous: Concatenating Strings when collapsing data

Hello everyone!

This question doesn't pertain to Stata. It is rather related to econometrics.

In my thesis, I am studying the impact of various explanatory variables on profit made by banks. 59 Banks have been picked from 12 developing countries. Time period is 7 years.

I am facing some problems in interpreting my results.

1. Some of my variables are significant in the original model (which only includes explanatory variables and doesn't include dummies) but when I modify the model by including country dummies, these variables turn insignificant. What could be the reason behind this?

2.. This point deals with a situation that is exactly the reverse of point 1. Some of my variables are insignificant in the original model but when I am including country dummies they turn significant. I don't understand why.

3. In some instances, a highly significant variable just looses significance upon inclusion of country dummies (i.e. it doesn't turn insignificant, just becomes less significant). What could be a general explanation for situations when a variable becomes less significant upon inclusion of country dummies?

4. In some situations inclusion of country dummy doesn't change significant status. A significant variable remains significant but when I further modify the model by adding time dummies, the variable turns insignificant.

Can someone provide an answer to each of the above points?

↧

Estimating the value of a variable using it´s probability distribution

November 20, 2016, 3:16 am

≫ Next: Pantest2

≪ Previous: Augmenting a model with country and time dummies

Hi, all!

I have one question regarding Stata and the estimation of an unknown variable.

Is there any way of estimating the value of a variable in Stata when you know it´s probability distribution?
In this case I have a variable which is distributed on the interval (1, 2), with a 25% probability of 2, and uniformly distributed otherwise.

Thanks!

↧

Pantest2

November 20, 2016, 3:38 am

≫ Next: How to do pscore(propensity score matching) by industry with loop

≪ Previous: Estimating the value of a variable using it´s probability distribution

Hello!

I am working on panel data -- xtreg, fe -- and I am looking for diagnostic tests for the regression.

I ran across the 'pantest2' command, and I need help in interpreting. What do the values mean? Thanks!

Test for serial correlation in residuals
Null hypothesis is either that rho=0 if residuals are AR(1)
or that lamda=0 if residuals are MA(1)
LM= 83
which is asy. distributed as chisq(1) under null, so:
Probability of value greater than LM is 8.205e-20
LM5= 9.1104336
which is asy. distributed as N(0,1) under null, so:
Probability of value greater than abs(LM5) is 0

Test for significance of fixed effects
F= 30.925901
Probability>F= 1.388e-37

Test for normality of residuals

Skewness/Kurtosis tests for Normality
------- joint ------
Variable | Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2
-------------+---------------------------------------------------------------
__00000B | 166 1.0000 0.0000 46.72 0.0000

Yours,

Wenny

↧

How to do pscore(propensity score matching) by industry with loop

November 20, 2016, 3:58 am

≫ Next: How to do pscore(propensity score matching) by industry with loop

≪ Previous: Pantest2

Dear all,

I'm trying to do PSM matching with pscore. I have data about 1 million firms needed for propensity score matching. These firms are in 36 industries. I want to match control group by industry. My code is as follows:

use firms.dta
global vars lnl lntfp export valueadd

forvalues i=1(1)36{
qui pscore treat $Vars if industry==`i', pscore(mypscore) blockid(myblock) logit

}

Then there comes error messege: "mypscore already defined". I know that's because when industry equals 2, myscore already exist in the dataset . But I have no idea how to do the loop.

Anyone could give me suggestions? Thanks a lot!

Cheers

Owen

↧

How to do pscore(propensity score matching) by industry with loop

November 20, 2016, 4:00 am

≫ Next: Automatically storing or using the coefficient dervied from a logistic regression

≪ Previous: How to do pscore(propensity score matching) by industry with loop

Dear all, I'm trying to do PSM matching with pscore. I have data about 1 million firms needed for propensity score matching. These firms are in 36 industries. I want to match control group by industry. My code is as follows: use firms.dta global vars lnl lntfp export valueadd forvalues i=1(1)36{ qui pscore treat $Vars if industry==`i', pscore(mypscore) blockid(myblock) logit } Then there comes error messege: "mypscore already defined". I know that's because when industry equals 2, myscore already exist in the dataset . But I have no idea how to do the loop. Anyone could give me suggestions? Thanks a lot! Cheers Owen

↧