Dropping largest difference time variable

November 24, 2015, 10:24 am

≫ Next: Creating new variables with total() and nvals(), hoping to avoid creating missing values with "if"

≪ Previous: Interpreting results of -signrank-

I have some long time series data, with a time series variable year. However, there are a lot of duplicates. I first considered removing the oldest duplicates, assuming the newer was more accurate. That is to say, if the value for 1951 was estimated twice, the second estimate is more accurate, so we drop the old one.

However, this caused some problems. The way I am considering now is to keep the years which feature the smallest change. For example, let's say we only have one 1950 value at .4. We have two values for 1951, .6 and .7. Since .6 is a smaller change from 1950, we keep the .6 observation and drop the .7 one. This seems to get more complicated when the year you are examining has alternate observations for the prior year. That is to say, what if 1950 also has three different estimates? Maybe there is some algorithm to run. Does anyone know of any techniques for this? Thanks.

Here is some example data

1950	0.4438
1951	0.2746
1952	0.215
1953	0.9189
1954	0.7192
1955	0.7332
1956	0.6545
1957	0.2492
1957	0.3382
1958	0.6456
1958	0.1853
1950	0.4664
1951	0.3202
1952	0.2473
1953	0.9355
1954	0.4428
1955	0.0049
1956	0.9164

↧

Creating new variables with total() and nvals(), hoping to avoid creating missing values with "if"

November 24, 2015, 11:37 am

≫ Next: Count a dummy variable for each group

≪ Previous: Dropping largest difference time variable

Hello,

I've been creating new variables in my dataset of panel data to show total days of follow-up by month of follow-up. I used the following code to do this successfully:
by month: egen followup= total(days)
I also needed to know the number of facilities monitored, and added this variable using the nvals() function:
by month: egen facilities=nvals(facility)

My question arose when I was trying to add further variables displaying only information regarding intervention and controls, but to all observations in the month (rather than interventions showing missing in the control column and vice versa). I initially tried the following code, which yielded missing values for controls:
by month: egen followup_int= total(days) if interv_site==1
by month: egen facilities_int=nvals(facility) if interv_site==1

I discovered a useful workaround for the total() function in Stata's FAQ pages:
by month: egen followup_int= total(days*(interv_site==1))

I'm wondering (a) if there's another way to tell Stata to apply the total (found using the if statement) to all observations (by month) without sort of tricking it as above, and (b) if there is a way to make Stata similarly fill all cells in the new column with the number of intervention facilities monitored by month.

I'm sure there are multiple ways of working around this, but thought I would ask the experts in case there's a simple command or option I'm not aware of. Thank you!

Julia

↧

Count a dummy variable for each group

November 24, 2015, 12:54 pm

≫ Next: OLS Regression

≪ Previous: Creating new variables with total() and nvals(), hoping to avoid creating missing values with "if"

Hey everybody,

I want to count how many time a dummy variable equals one if an identifier has a specific value.
In the end I want to compute percentage share when the dummy equals if an identifier has a specific value.

I tried that code:

Code:

gen freq = _N if id==1
egen dummy  = count(mpg) if mpg==1 & id==1
egen cdummy = sum(dummy) & id==1
gen share = dummy/freq if id==1

It seems to work out.

But it would be even better if I can compute the percentage share for different values of id so that I don't have to repeat the command for each identifier.

Any suggestions?

Many thanks in advance.

Bene

↧

OLS Regression

November 24, 2015, 1:03 pm

≫ Next: Testing for collinearity and multiple collinearity in a conditional logistic regression model with imputed data

≪ Previous: Count a dummy variable for each group

Hi experts,
I divided my data in 5 quintiles of low to high book leverage.
Now i have to test with an OLS regression to check if the amount of sales of quintile 1 is significantly different then quintile 5.
I know this can be done with a t test, but how do you set this up with an OLS?

↧

Testing for collinearity and multiple collinearity in a conditional logistic regression model with imputed data

November 24, 2015, 1:14 pm

≫ Next: Creating panel dataset

≪ Previous: OLS Regression

Dear Stata forum,

I have imputed a data set consisting of continuous and binary variables and I am creating a conditional logistic regression model with independent variables associated with the recurrence of TB infection (recurrence being my dependent variable). I believe that there are some variables that are highly correlated e.g. the interruption of drug treatment and reaction to medication. When I search online for methods to detect collinearity and multiple collinearity papers suggest to use methods such as VIF, the condition index and / or using the unexpected direction of associations between the outcome and explanatory variables is an important sign of collinearity and multicollinearity (http://www.nature.com/bdj/journal/v199/n7/full/4812743a.html). Using the last recommendation I believe I have detected collinearity but I cannot use VIF / the condition index with multiple imputed data. I was wondering if there is a better approach to assess my conditional logistic regression model for the presence of collinear and multiple collinear variables when working with multiply imputed data?

Many thanks for your help

↧

Creating panel dataset

November 24, 2015, 3:30 pm

≫ Next: FGLS for heteroskedasticity (WLS) with weights

≪ Previous: Testing for collinearity and multiple collinearity in a conditional logistic regression model with imputed data

Hi everyone,

I am trying to reorganize an administrative dataset into a panel form.

Basically, I need to transform a dataset structured like this

Country	Citizenship	1999	2000	2001	2002
XXXX	A	15	16	17	18
XXXX	B	16	17	18	19
XXXX	C	17	18	19	20
YYY	A	20	19	18	17
YYY	B	19	18	16	15
YYY	C	18	17	16	15

into this

Country	Year	A	B	C
XXXX	1999	15	16	17
XXXX	2000	16	17	18
XXXX	2001	17	18	19
XXXX	2002	18	19	20
YYY	1999	20	19	18
YYY	2000	19	18	17
YYY	2001	18	16	16
YYY	2002	17	15	15

Problem is, the dataset is quite big (there are 200 different citizenships, in 88 countries and in 18 years), so I don't think my life span would be enough to do it manually.
I also tried to use the reshape command, but it is not working at all (for sure I used it incorrectly).

Any advices?

Thank you in advance.
V.

↧

FGLS for heteroskedasticity (WLS) with weights

November 24, 2015, 3:43 pm

≫ Next: unstring/separate date variable with inconsisten format

≪ Previous: Creating panel dataset

Hi, a question more of process than actually having a problem.

So consider that we want to use the FGLS estimator to model heteroskedasticity. As Cameron and Trivedi (2010) show this can be done with weighted least squares using aweight. The process is normally the following: estimate the homoskedastic model, predict the residuals, square the predictions, estimate the variance model using the squared of the predictions as the dependent variable, predict the variance, and use 1/variance prediction for aweight. So far so good.

I was wondering... what happens if we want to use weights in our estimation? For example, we may be using survey data that provides weights and we want to use them in our estimation. If we do all the intermediate estimations using the weights, do we need to also include the survey weights in addition of weighting to do FGLS? If so, how? I was kind of thinking that since the intermediate steps have been using the survey weights, the prediction of the variance is affected by the survey weights, and thus we may not need accommodate for the survey weights in the final estimation.

Help and thoughts would be appreciated, thanks!

Reference:
Cameron, A. Colin and Pravin K. Trivedi. 2010. Microeconometrics Using Stata. Revised ed. College Station, TX USA: Stata Press.

↧

unstring/separate date variable with inconsisten format

November 24, 2015, 5:53 pm

≫ Next: Problem with 95% CI from mean command

≪ Previous: FGLS for heteroskedasticity (WLS) with weights

Dear Statalist community,

The dates contained in one string variable have inconsistent formats, exemplified as follows

date

1961-1965

del 5-Ene-1930 - 8-Dec-1950

5-Ene-1931

I wanted to unstring such variable, but I cannot because it contain non-numeric characters.

I proceeded by eliminating "del" in excel (I would appreciate if someone provided ideas on how to do this directly in Stata) and then by using

Code:

 split date, p(" - ")

which resulted in two variables like this

date	date1	date2
1961-1965	1961	1965
del 5-Ene-1930 - 8-Dec-1950	5-Ene-1930	8-Dec-1950
5-Ene-1931	5-Ene-1931

I wanted to convert (especially the last row) to a Stata-readable date format, but given that date1 & date2 contain different formats of dates, it was not possible.

I proceeded by

Code:

split date1, p("-")

, which resulted in three variables, for observations 2 and 3, the contents were the components of date 1.

I then

Code:

egen new_date = concat (date1)

, and obtained:

new_date

1961

5Ene1930

5Ene1931

Now given that "Ene" stands for "January" in a foreign language, I am afraid I am still not able to achieve my goal, to unstring the "date" variable.

I would appreciate suggestions in this regard.

↧

Problem with 95% CI from mean command

November 24, 2015, 7:33 pm

≫ Next: converting string to numeric values

≪ Previous: unstring/separate date variable with inconsisten format

Hi all,

Apologies if this has been discussed before, but I couldn't find anything on it.

I'm using Stata 14.1, and having an issue with the command 'mean'. Specifically, the 95% confidence intervals it generates don't appear to be 95%. As an example, when I run mean on a variable in my dataset, it generates the following output:

. mean day

Mean estimation Number of obs = 212

--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
day | 1659.731 326.08 1016.939 2302.523
--------------------------------------------------------------

However, when I manually calculate the 95% confidence interval based on the formula:

x̅ ± z(0.05/2) x S.E.

I get a confidence interval of 1020.626 - 2298.836. Similar, but not exactly the same.

Working backwards from the CI generated by stata, it looks like it's using a z that is smaller than 0.05 - and the exact value varies depending on the calculation/variable used.

Not a major issue, I would just like to understand why it is doing it (assuming its deliberate).

↧

converting string to numeric values

November 24, 2015, 7:34 pm

≫ Next: [Problem] with to impose symmetry and homogeneity restrictions

≪ Previous: Problem with 95% CI from mean command

I'm a new user (sorry!) and have followed all of the commands in books and online I can find to convert my data from a string (eg, 22-Nov-15 00:00:00) to numeric. I created a new variable and specified my expression as date(Date, "DMYhms"). The new variable is created but all of the cells are blank. Any suggestions? Thanks!

↧

[Problem] with to impose symmetry and homogeneity restrictions

November 24, 2015, 8:54 pm

≫ Next: Propensity score matching - with MICE, 1:k

≪ Previous: converting string to numeric values

Hello,
my question maybe seems to be very easy, but i don't know how to do it. How do I impose symmetry and homogeneity restrictions on a cost function, only with the total cost and the price of the variables, in stata?

Thanks in advance.

↧

Propensity score matching - with MICE, 1:k

November 27, 2015, 4:22 am

≫ Next: Estimating elasticities using time series.

≪ Previous: [Problem] with to impose symmetry and homogeneity restrictions

Dear all,

I am fairly new to Stata. I work with the latest version of Stata.

I work on the propensity score matching analysis. My data set had missing values and MICE was performed. Using imputed models I obtained average ps-scores for my observations.
To obtain matches I use code "psmatch2 treat, pscore(average) n(2) cal(0.2)". And here I face a problem.

Namely, I would like to compare baseline characteristic of matched patients in order to see if some characteristics are still statistically significan between groups after matching. But I can't find a good way to do this as the matching was performed with replacement (I don't have problems with 1:1 matching without replacement). While comparing patients I would like to obtain p-values and frequencies (e.g. mean age of patients in treatment and control group, etc.).

regards
Natalia

↧

Estimating elasticities using time series.

November 27, 2015, 5:16 am

≫ Next: Drop observations by group size

≪ Previous: Propensity score matching - with MICE, 1:k

Hi stata experts and enthusiasts.

I have a problem when trying to estimate the different elasticities in a time series dataset.
I am using the margins command like this.

xtreg z y x (i.Year), r
margins, eyex(y) at(x = (-0.35(0.05)2)) noatlegend
marginsplot, noci xlabel(-0.35(0.05)2)

My problem is that this command seems to run all the observations for Y in the entire dataset against a fixed value for X.
I want to have corresponding observations for Y and X i.e all the values in Y given a value of X.

Would it be better using the percentiles command? In that case I would want to specifiy the percentiles by my time variable, so that data is sorted using time rather than small to large for each variable.

//Erik

↧

Drop observations by group size

November 27, 2015, 5:32 am

≫ Next: Postestimation for non-parametric ARFIMA models

≪ Previous: Estimating elasticities using time series.

I have the following data:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input matcheducage
 1
 2
 3
 4
 5
 6
 7
 7
 7
 7
 7
 8
 9
 9
10
11
12
12
13
14
end

I would like to drop the groups from my dataset which have only one observation, i.e. in this case groups 1,2,3,4,5,6,8 and so on.

Would you have any idea on how I could code this?

Thank you very much in advance for your suggestions!

Best,

Max

↧

Postestimation for non-parametric ARFIMA models

November 27, 2015, 6:15 am

≫ Next: Using tsappend with seqdate

≪ Previous: Drop observations by group size

Dear statalist-users,

I am running non-parametric arfima estimation using gphudak, roblpr and modlpr. I want to conduct different postestimations, e.g. use predict for in-sample forecasting. However, I get the following error messages:
- for gphudak: variable __000006 not found
r(111);

- for roblpr: variable __00000P not found
r(111);

- for modlpr: variable __00000E not found
r(111);

Is there anyway to be able to use predict for in- as well as out-of-sample (option dynamic) forecasting with gphudak, roblpr and modlpr?

Thank you very much in advance.

Kind regards,
Volker

↧

Using tsappend with seqdate

November 27, 2015, 6:28 am

≫ Next: Creating a dummy variable in a panel

≪ Previous: Postestimation for non-parametric ARFIMA models

Dear statalist-users,

I have a time series and I want to conduct ML arfima estimation. However, the variable of interest has missing observations for several dates. Accodringly, I create a new "date" variable using professor Baum's routine (http://www.stata-journal.com/sjpdf.h...lenum=dm0028):

quietly generate byte notmiss=variableofinterest<.a
quietly generate seqdate=cond(notmiss, sum(notmiss),.)
tsset seqdate

I then run arfima for the first half of my observations; let's say, I have 1,000 observations: arfima variableofinterest if tin(1, 500)

The trouble starts now: I want to conduct dynamic (out-of-sample) forecasts. Therefore, I need to append 500 observations before I conduct the dynamic forecast. However, tsappend does not do its job as seqdate is not a date-variable as such:

tsappend, add(500)
the time variable may not be missing
r(198);

Any Idea what I can do about it?

Thank you very much in advance.

Kind regards,
Volker

↧

Creating a dummy variable in a panel

November 27, 2015, 6:31 am

≫ Next: Problems with regression coefficient comparison (clustered standard errors)

≪ Previous: Using tsappend with seqdate

Dear Stata Users,

I need to create a dummy variable to account for the fact that last time I observed a given variable, it has certain property. To be more precise, I have a typical panel data with the following structure:

ID	Time	sold_drugs	group
1	1	0	0
1	2	1	0
1	3	1	0
1	4	.	.
2	1	1	1
2	2	0	0
2	3	1	0
2	4	0	0

And my dummy should equal 1 if last time subject i sold drugs, was also part of a group (last time sold_drugs==1 & group==1). Therefore, the data with the dummy should look like this:

ID	Time	sold_drugs	group	Dummy
1	1	0	0	0
1	2	1	0	0
1	3	1	0	0
1	4	.	.	.
2	1	1	1	0
2	2	0	0	1
2	3	1	0	1
2	4	0	0	0

I want the dummy to have the value of zero for the first observation of subject i so I can still use that observation in my regressions. In other words, I want the dummy to be "missing" only if that particular observation is "missing". Any advice about how to do this.

Thanks,
Diego

↧

Problems with regression coefficient comparison (clustered standard errors)

November 27, 2015, 6:32 am

≫ Next: Delete observations

≪ Previous: Creating a dummy variable in a panel

Dear Statalist,

I have a large panel-dataset with which I perform a regression analysis. Afterwards, I compare two regression coefficients within one estimation. The dataset includes 50 countries and 40 quarter-years. I use the "areg2gen" command to absorb one fixed-effect (i.e. 2000 Country-Quarter-Years) and cluster the standard errors in two dimensions (i.e. Quarter-Year and Country). Unfortunately, whenever I try to perform an F-test on two specific regression coefficients, stata reports:

Code:

 Constraint 1 dropped

       F(  0,  1590) =       .
            Prob > F =         .

I am wondering why this is the case. When I only cluster in one-dimension [with "areg" command and vce(cluster Country) or vce(cluster QuarterYear)] everything works just fine.

I will gladly provide more information on my data if required. Thanks for your help!

↧

Delete observations

November 27, 2015, 8:58 am

≫ Next: Problems with zscore standardizing

≪ Previous: Problems with regression coefficient comparison (clustered standard errors)

Dear all,
I have a data set (irrc_2) with observations regarding companies’ directors, sorted by CUSIP.

The data are from 2007-2014 (Variable: DataYear)

I want to keep observations only for companies (per CUSIP) for which I have data from 2007 (for the specific company I mean), any suggestions?

For example: There is a company (CUSIP: 1) for which I have data only for years 203 & 2014, how can I say to Stata to delete all the observations for the specific company?
Unfortunately there aren’t empty cells in DataYear for the missing years (2007-2012), then it would be very easy task…
Thank you in advance.

I use StataMP 13.1 in Windows 10.

↧

Problems with zscore standardizing

November 27, 2015, 9:53 am

≫ Next: How to transform a panel dataset variables in logarithmic (ln)

≪ Previous: Delete observations

Hello everyone,

I used the search option and did not find anything similar to my problem. If you do have in mind a discussion about this topic, please refer me to it.

In order to display the results of my findings in a way easy to read, I would like to standardize the variables. It is my understanding that this process does not change the significance level of the variables. Unfortunantly, I was not able to achieve this.

Some details about my data that might be of help to figure out where my mistake is:

I have a panel dataset and I calculate a re-xtlogit model with a binary dependet variable (dep). Further there are two explanatory variables, one is continous between 0 and 1 (cont01) and the other is a dummy variable (dum1). I use both of them with a one period lag, and I use an interaction term between these two. Additionally there are five control variables, two are dummy variables (cd1 cd2), one is categorical (c3) and two can have positive integer values (i.e age) (c4 c5).

Here is what I did:

1. I did my usual regression model: xtlogit dep c.l.cont01##c.l.dum1 cd1 cd2 c3 c4 c5, re

2. I used the zscore command for cont01, c1, c3, c4, c5 without any additional options: zscore cont01, c3, c4, c5

Stata replied:
z_cont01 created with 47 missing values
z_c3 created with 0 missing values
z_c4 created with 0 missing values
z_c5 created with 0 missing values

These are the same numbers of missing values present in the data before.

3. I did the same regession as in (1) but with the variables obtained from zscore: xtlogit dep c.l.z_cont01##c.l.dum1 cd1 cd2 z_c3 z_c4 z_c5, re

I get the same results in terms of significance for all the variables except the l.dum1. The p-value was 0,076 before and is now 0,818. Again, the significance for the interaction term and the l.cont01 remain the same.

Where is my mistake? How can I get standarized results?

Thanks to everbody for taking your time. Every thought or comment is appreciated.

Best,
Keith

↧