Quantcast
Channel: Statalist
Viewing all 73257 articles
Browse latest View live

Find first stage F-stats under xtivreg with factor variables (so no xtoverid)

$
0
0
It seems to be well-documented (here: https://www.statalist.org/forums/for...port-fvvarlist or here: https://www.stata.com/statalist/arch.../msg00707.html) that xtoverid does not work when factor variable are included in a regression using xtivreg.

I am using factor variables in an xtivreg regression, and I would like to know the first stage F-stat for my excluded variables. Is there any way to do this w/out using xtoverid?

If there is no post-estimation command that works to do this, I can of course separately run what I think is the 1st stage, and test my excluded variables myself. From page 20 of the manual (https://www.stata.com/manuals/xtxtivreg.pdf) it looks like I would first (a) remove all fixed effects using xtreg, then (b) run a 2SLS regression of my 1st stage using ivreg or ivreg2. Does anyone know if this is indeed the best manual approximation of the first stage of xtivreg?

no space for concat

$
0
0
Hi,
I want to connate the following variable of the dataset,

the outcome is not what I want to to get
. I want
really only one space between numbers
, but there is
wide space
. How can
move out the spacing

----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -
dataex-. To install: 
ssc install 
dataex
clear
input str1(dev_active_proj_sm_1 dev_active_proj_sm_2 dev_active_proj_sm_3 dev_active_proj_sm_4 dev_active_proj_sm_5)
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  ""  "3" "" "" 
"1" ""  ""  "" "" 
"1" ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "" 
"1" ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  "2" ""  "" "" 
""  "2" ""  "" "" 
"1" ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "" 
""  "2" ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "5"
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "" 
""  "2" ""  "" "" 
"1" ""  ""  "" "" 
"1" ""  ""  "" "" 
""  ""  ""  "" "" 
"1" "2" ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  ""  "3" "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "" 
"1" ""  ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "" 
"1" ""  ""  "" "" 
"1" ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  "3" "" "" 
""  ""  ""  "" "" 
""  "2" ""  "" "" 
""  "2" ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
""  ""  ""  "" "" 
"1" ""  ""  "" "" 
end
------------------ copy up to and including the previous line ------------------

Listed 100 out of 1298 observations
Use the count
(
) option to list more


Mycode
,

egen
concat_var
=concat
(dev_active_proj_sm_0-dev_active_proj_sm_5),
p
(" ")
The out put like that follow,
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -
dataex-. To install: 
ssc install 
dataex
clear
input str9 
concat_var
"0"     
"0"     
"0"     
"2"     
"0"     
"3"     
"1"     
"1"     
"0"     
"0"     
"0"     
"1"     
"1"     
""      
"0"     
"2"     
"0"     
"0"     
"2"     
"2"     
"2"     
"1"     
""      
"0"     
"0"     
"0"     
"0"     
"0"     
"2"     
"0"     
"0"     
"2"     
"2"     
"0"     
"0"     
"0"     
"0"     
"1"     
"2"     
"2"     
"0"     
"0"     
"0"     
"1"     
"0"     
"1    5"
"0"     
"0"     
"0"     
"0"     
"0"     
"0"     
"2"     
"0"     
"0"     
"2"     
"0"     
"0"     
"0"     
"0"     
"0"     
"0"     
"2"     
"0"     
"1"     
"2"     
"0"     
"1"     
"2"     
"1"     
"1"     
"0"     
"1 2"   
"2"     
"0"     
"2"     
"0"     
"2"     
"3"     
"0"     
"0"     
"1"     
"1"     
"0"     
"1"     
"1"     
"1"     
"0"     
"0"     
"0"     
"3"     
"0"     
"2"     
"2"     
"0"     
"0"     
"0"     
"0"     
"0"     
"1"     
end
------------------ copy up to and including the previous line ------------------

You will see about the output : "1 5", there is more than one space between these two number, I really want only one space "1 5"
please guide me how to proceed to get that way

Cronbach's alpha for respective countries

$
0
0
Hello,
I'm trying to construct a table for how countries score on an index of mine. I can find the mean for each country (tab country, sum(index)) and the overall Cronbachs alpha value (alpha item*, casewise).
But how can i estimate the Cronbach's alpha of my index for each country?

Best regards

New version of xcontract on SSC

$
0
0
Thanks as always to Kit Baum, a new version of the xcontract package is now available for download from SSC. In Stata, use the ssc command to do this, or adoupdate if you already have an old version of xcontract.

The xcontract package is described as below on my website. The new version is updated to Stata Version 16, and has a new frame() option, allowing the user to save the output dataset (or resultsset) in a data frame.

Users of older versions of Stata can still download versions of xcontract in Stata 10 or Stata 8 from my website by typing, in Stata,

net from http://www.rogernewsonresources.org.uk/

and selecting a Stata version at or below their own in which to download xcontract.

Best wishes

Roger

------------------------------------------------------------------------------------------
package xcontract from http://www.rogernewsonresources.org.uk/stata16
------------------------------------------------------------------------------------------

TITLE
xcontract: Create dataset of variable combinations with frequencies and percents

DESCRIPTION/AUTHOR(S)
xcontract is an extended version of contract. It creates an output
data set with 1 observation per combination of values of the
variables in varlist and data on the frequencies and percents of
those combinations of values in the existing data set, and,
optionally, the cumulative frequencies and percents of those
combinations. If the by() option is used, then the output data set
has one observation per combination of values of the varlist
variables per by-group, and percents are calculated within each
by-group. The output data set created by xcontract may be listed to
the Stata log, or saved to a data frame, or saved to a disk file, or
written to the memory (overwriting any pre-existing data set).

Author: Roger Newson
Distribution-Date: 26december2019
Stata-Version: 16

INSTALLATION FILES (click here to install)
xcontract.ado
xcontract.sthlp
------------------------------------------------------------------------------------------
(click here to return to the previous screen)

New version of xsvmat on SSC

$
0
0
Thanks once again to Kit Baum, a new version of the xsvmat package is now available for download from SSC. In Stata, use the ssc command to do this, or adoupdate if you already have an old version of xsvmat.

The xsvmat package is described as below on my website. The new version has been updated to Stata Version 16, and has a frame() option to allow the user to save the output datast (or resultsset) in a data frame.

Users of older versions of Stata can still download old versions of xsvmat, compatible with their Stata versions, by typing, in Stata,

net from http://www.rogernewsonresources.org.uk/

and selecting a Stata version in whch to download xsvmat.

---------------------------------------------------------------------------
package xsvmat from http://www.rogernewsonresources.org.uk/stata16
---------------------------------------------------------------------------

TITLE
xsvmat: Convert a matrix to variables in an output dataset

DESCRIPTION/AUTHOR(S)
xsvmat is an extended version of svmat. It creates an output
dataset (or resultsset), with one observation per row of either an
existing matrix or the result of a matrix expression, and data on
the values of the column entries in that row, and, optionally,
extra variables specified by the user. The output dataset created
by xsvmat may be listed to the Stata log, or saved to a data
frame, or saved to a disk file, or written to the memory
(overwriting any pre-existing dataset).

Author: Roger Newson
Distribution-Date: 28december2013
Stata-Version: 16

INSTALLATION FILES (click here to install)
xsvmat.ado
xsvmat.sthlp
---------------------------------------------------------------------------
(click here to return to the previous screen)

Using loop for regression and saving results as a new variable

$
0
0
Happy new year to Satatlist!


I'm stuck with an old problem (using loop for regression). The variables are smoking (DV), country, education and income. The goal is to regress 'smoking' on 'education' and 'income' for all seven countries, and to save the ORs for each country. I tried the following code, which is not producing anything:

Code:
local saving n.dta

 foreach n in ` country' {
statsby _b, by(country): logit smoker i.education i.income
  eststo
}

Thanks in advance for you insights. Here is the dataex:

[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(country education income smoker)
4 0 3 1
2 0 3 1
4 0 3 1
5 0 3 1
4 0 3 1
3 0 3 1
7 0 3 1
3 1 4 1
3 0 4 1
5 0 4 1
3 0 4 1
2 0 4 1
4 0 4 1
5 0 2 1
4 0 2 1
7 0 3 1
1 1 3 1
7 0 2 1
2 0 2 1
6 0 2 0
1 2 2 1
7 0 2 1
3 0 2 1
3 0 2 1
6 0 3 1
2 0 3 1
5 0 2 1
1 0 2 1
5 0 2 1
5 0 2 0
7 0 2 1
7 0 2 1
1 0 2 .
7 0 2 1
1 0 2 .
3 0 2 1
4 0 2 1
4 0 2 1
2 0 2 .
1 0 2 .
4 0 2 1
6 0 3 1
6 0 3 1
6 0 3 1
1 0 3 .
2 0 3 1
7 0 2 0
3 0 2 1
1 0 2 1
1 0 2 .
1 0 3 .
3 0 4 1
7 0 4 1
1 0 4 .
2 0 4 1
2 0 1 1
1 0 3 1
1 0 3 1
3 0 1 1
3 0 2 1
2 0 3 1
1 0 3 1
1 0 3 .
7 0 2 1
3 0 2 1
4 0 3 1
3 0 3 1
2 0 3 1
2 0 3 1
3 0 3 1
2 0 3 .
1 0 3 .
5 0 1 1
2 0 1 1
5 0 2 1
2 0 2 1
7 0 1 1
2 0 1 1
1 0 1 1
7 0 1 1
1 0 1 .
1 1 1 .
3 0 1 1
3 0 1 1
4 0 1 1
4 0 1 1
2 0 1 1
1 0 1 1
3 0 1 1
5 0 1 1
1 0 1 1
5 0 1 1
3 2 2 1
5 0 1 1
4 0 1 1
5 0 1 1
2 0 1 .
1 0 1 .
2 0 3 1
2 0 3 1
end



A quick question about weird question marks in a string variable

$
0
0
Hello, I am using Stata 15.1. I have a string variable - student-teacher ratio (see data below). for some responses, they have the sign �� rather than a regualr expression. I think it should be a colon. Does anyone know how I can change �� into a colon or make it show as it should be rather than ��? Thanks!
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str9 plc0107
"12:1"     
"16��1"    
"6.6:1"    
"3:1"      
"16.3:1"   
"10.7��1"  
"1173��93" 
"11��1"    
"23��1"    
"16.15��1" 
"1��14"    
"16.15��1" 
"15��1"    
"12��1"    
"1��13"    
"17.7:1"   
"16.3:1"   
"14.5��1"  
"1��19.6"  
"100��9"   
"9��1"     
"09��1"    
"12.5:1"   
"12��1"    
"1��10.4"  
"17.7:1"   
"15��1"    
"11:1"     
"11.4:1"   
"16.3:1"   
"9��1"     
"1��11"    
"18:1"     
"1��15"    
""         
""         
"6:5"      
"12:1"     
"11:1"     
"10.7��1"  
"14��1"    
"16.3:1"   
"12:1"     
"16.3:1"   
"1��14"    
"12��1"    
"10��1"    
"10��1"    
"16��1"    
"17:1"     
"6.6:1"    
"100��5.38"
"6.6:1"    
"12.8��1"  
"16.3:1"   
"7:1"      
"15.6��1"  
"11:1"     
"20��1"    
""         
"12.5:1"   
"5.5��1"   
"1��19.6"  
"18:1"     
"7:1"      
"1��15"    
"1��10.4"  
"11:1"     
"11��1"    
"15:1"     
""         
"12.5:1"   
"11��1"    
"16.8:1"   
"14.5��1"  
"1��11"    
"22��1"    
"10.7��1"  
"17:1"     
"21��1"    
"6.6:1"    
"7:1"      
"17:1"     
"7��1"     
"14��1"    
"1007��60" 
"12.5:1"   
"6:5"      
"18��1"    
"16.15��1" 
"11.6��1"  
"17:1"     
"3:1"      
"1��10.4"  
"13��1"    
"16��1"    
"7��1"     
"11:1"     
"14.5��1"  
"1��11.3"  
end
------------------ copy up to and including the previous line ------------------

type mismatch error when using "replace" function

$
0
0
Hi everyone,

I am using STATA version 14.0 and am hoping to replace missing observations using:
"replace state = . if state == -3 Missing geocode".
Unfortunately, I keep receiving the error "type mismatch". The variable I am working with is categorical and I need to ensure that I am not analyzing any data that is missing. Thank you!

Replacing melogit by meqrlogit ?????

$
0
0
I'm using melogit command for a multilevel mixed effect regression. When I run the following command melogit health SEX1 AGE1 Left Right || countryID: I obtain an error message like
"initial values not feasible" r(1400).
Then, I decide to use meqrlogit and it run well.
I'm working in Stata 13.
I applied the following suggestion that I saw in this group: logit health SEX1 AGE1 Left Right
matrix b = e(b)
melogit health SEX1 AGE1 Left Right , from(b, skip) || countryID:

and I obtained the same error message "
initial values not feasible"

Can I substitute melogit by meqrlogit ?

Thanks in advance!

ESTIMATION RESULT

melogit health SEX1 AGE1 Left Right || countryID:

Fitting fixed-effects model:

Iteration 0: log likelihood = -16686.915
Iteration 1: log likelihood = -16664.002
Iteration 2: log likelihood = -16663.98
Iteration 3: log likelihood = -16663.98

Refining starting values:

Grid node 0: log likelihood = -16120.406

Fitting full model:

initial values not feasible
r(1400);

end of do-file

r(1400);


meqrlogit health SEX1 AGE1 Left Right || countryID:

Refining starting values:

Iteration 0: log likelihood = -15229.903 (not concave)
Iteration 1: log likelihood = -15188.46
Iteration 2: log likelihood = -15172.542

Performing gradient-based optimization:

Iteration 0: log likelihood = -15172.542
Iteration 1: log likelihood = -15168.007
Iteration 2: log likelihood = -15167.513
Iteration 3: log likelihood = -15167.511
Iteration 4: log likelihood = -15167.511

Mixed-effects logistic regression Number of obs = 32187
Group variable: countryID Number of groups = 19

Obs per group: min = 1002
avg = 1694.1
max = 2852

Integration points = 7 Wald chi2(4) = 41.07
Log likelihood = -15167.511 Prob > chi2 = 0.0000

------------------------------------------------------------------------------
health | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
SEX1 | .0739644 .030306 2.44 0.015 .0145657 .1333632
AGE1 | -.0005852 .0008761 -0.67 0.504 -.0023023 .0011318
Left | .1671068 .040944 4.08 0.000 .0868581 .2473556
Right | -.1103563 .0390669 -2.82 0.005 -.1869259 -.0337866
_cons | 1.46544 .1851746 7.91 0.000 1.102504 1.828375
------------------------------------------------------------------------------

------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
countryID: Identity |
var(_cons) | .6384834 .2098664 .3352461 1.216005
------------------------------------------------------------------------------
LR test vs. logistic regression: chibar2(01) = 2992.94 Prob>=chibar2 = 0.0000




Running a correlation for a subset of observations

$
0
0
Hi all,
Please consider the following example:
Code:
clear

.  input str4 id byte str3 shock byte str3 acc byte str3 netw

            id      shock        acc       netw
  1. 1001 yes yes no
  2. 1002 yes no no
  3. 1003 yes no yes
  4. 1004 yes yes yes
  5. 1005 no yes yes
  6. 1006 no no yes
  7. end

. list

     +---------------------------+
     |   id   shock   acc   netw |
     |---------------------------|
  1. | 1001     yes   yes     no |
  2. | 1002     yes    no     no |
  3. | 1003     yes    no    yes |
  4. | 1004     yes   yes    yes |
  5. | 1005      no   yes    yes |
     |---------------------------|
  6. | 1006      no    no    yes |
     +---------------------------+
I want to find the spearman's rho coefficient between the variables "acc" and "netw", for observations having value of "shock" as "yes". I am able to find the coefficient for all 6 observations taken together, but not able to do it for the subset containing "shock"="yes".
Any help would be greatly appreciated.
Thank you.

How can I keep original number of observations in linear regression

$
0
0
Hello,
I would like to know one thing. I am running a linear regression in Stata 16. When I added an independent variable, the total number of observations drops. Which command should I use to maintain the original number of observations? Thank you.

What do the numbers in the box mean in the output of the estimates of SEM builder

$
0
0
I used SEM builder to test a model. But I don't understand what the numbers in the boxes are after estimating. For example, the .48 and 2.4 in the box of x1 as well as the -.27 in the box of x3. Does anyone know them? I checked the coefficients and other numbers in the output, it seems that the numbers in the box don't match any in the output.

Superimposing observed and predicted Weibull survivor function graphs

$
0
0
Hi
I was wondering if there is a simple way to superimpose observed and fitted Weibull survivor functions onto the one graph. I am trying to reproduce Figure 10.1 in David Collett's book "Modelling survival data in medical research" 3rd Edition. This is a study of survival in breast cancer patients with 2 types of stain on their histological specimens.
This is the dataset:
Code:
clear
input float(stain time status)
1  23 1
1  47 1
1  69 1
1  70 0
1  71 0
1 100 0
1 101 0
1 148 1
1 181 1
1 198 0
1 208 0
1 212 0
1 224 0
2   5 1
2   8 1
2  10 1
2  13 1
2  18 1
2  24 1
2  26 1
2  26 1
2  31 1
2  35 1
2  40 1
2  41 1
2  48 1
2  50 1
2  59 1
2  61 1
2  68 1
2  71 1
2  76 0
2 105 0
2 107 0
2 109 0
2 113 1
2 116 0
2 118 1
2 143 1
2 154 0
2 162 0
2 188 0
2 212 0
2 217 0
2 225 0
end
And this the code for the two graphs I want superimposed.
Code:
stset time, failure(status==1)
streg i.stain, distribution(weibull)
sts graph, by(stain)
stcurve, survival at1( stain=1 ) at2( stain=2 )
Thanks and regards

Chris

Analysis of rotating panel design data in Stata

$
0
0
Hello everyone,
I have a 9-year rotating panel design in which every year, one-ninth of the sample is replaced. Each sample is representative of the population in and of itself. For panel analysis of this dataset on Stata, should I simply treat this as an unbalanced panel (since all units will not have observations in every year) and trust the estimates derived. Especially since the data missing in each year are not missing due to systematic reasons? Is there any other way to handle rotating panel data or unbalanced panels in Stata?

Looping through multiple strings for comparison

$
0
0
Dear all,

I would like to compare educational institutions of two workers and code a binary variable that indicates whether these two workers have attended the same university (regardless of the type of degree, time etc.).

The structure of the dataset is shown below. There are pairs of a worker (worker_id) and coworker (coworker_id) and we have the names of the institutions he/she attended for their Bachelors, Masters etc. degree. The variables are just an excerpt, in the full dataset there is a range of further variables such as for MBA university, PhD university etc. which are not displayed here for brevity. The names of the institutions are standardized (i.e. there are no different spellings for the same institution), therefore I would like to do an exact comparison rather than similarity scoring.

The goals is to compare all the education variables of a worker with those of the coworker, and if they attended the same university at some point, then the (new, to be generated) binary variable "same_university" would take the value 1. For example, in line 3 below, worker_id 44 attended the University of Puget Sound and coworker_id 1 as well, thus binary variable "same_university" would take the value 1. Similarly, in line 5 in the example below, both workers attended University of Puget Sound (it is irrelevant that they obtained different types of degrees or may not have been there at the same time).

How would you recommend to code this?

Thanks a lot for your help again!


Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(worker_id coworker_id) str33 worker_ba_uni_std str26 worker_ma_uni_std str25 coworker_ba_uni_std byte coworker_ma_uni_std
72 1 "hobart and william smith colleges" "dartmouth college"          "university of puget sound" .
27 1 "stanford university"               "stanford university"        "university of puget sound" .
44 1 "university of puget sound"          ""                           "university of puget sound" .
28 1 "city university of new york"       ""                           "university of puget sound" .
 5 1 "indian institute of technology"    "university of puget sound" "university of puget sound" .
17 1 "stanford university"               ""                           "university of puget sound" .
15 1 "stanford university"               ""                           "university of puget sound" .
72 1 "hobart and william smith colleges" "dartmouth college"          "university of puget sound" .
19 1 "dartmouth college"                 ""                           "university of puget sound" .
end

zero-inflated (poisson) mediators

$
0
0
Hi, I am running a mediation which the mediator is zero-inflated poisson and the outcome variable is binary. The output summary is attached ('zip1' and 'zip2'). I am a bit confused about the two classes in the output. Here is my understanding. Please correct me if I make any misinterpretation.

variables: Predictor = disc_pfreq (count) ; mediator = disc_sfreq (zero-inflated poisson); outcome = letterv_occur (binary)

total n = 13740

From the output, there are 2 classes. Class 1 includes those whose observed mediator count is 0 (i.e.,13559 cases). Class 2 includes all obs cases (i.e., 13740 cases).
However, the zip model entails 2 parts: obs count = 0 vs. obs count > 0 (please see 'zip model' attached to the msg).
Does class 1 from the output match the first equation of the model (obs count = 0)?
Does class 2 from the output match the second equation of the model (obs count >0)? If so, then the obs number for class 2 should be 13740-13559 cases rather than 13740 cases.

In short, I am not sure how to interpret the zip (mediator) model in this case and how it is related to the outcome model.

So far, I can only find one paper on zip mediators by Zhigang Li et al. (2019). https://arxiv.org/abs/1906.09175v1

Any comment or help is highly appreciated!!!

egen newvar=group(varlist) does not generate a variable based on all unique combinations in the varlist

$
0
0
I have been using egen newvar=group(varname1 varname2) for a while, with the understanding that newvar would be a new variable containing all unique combinations of varname1 and varname2. Since I want to collapse my data to 1 observation per unique combination of varname1 and varname2, but I also need to have these variables (as well as newvar) in the final dataset, I decided to test the following code (a somewhat pathological example, I agree, but I regularly work with datasets with tens of millions of observations) to see if it was faster to put varname1 and varname2 in the by() option or in the variable list. Remember, there should be no variation in varname1 and varname2 within values of newvar, so the means of varname1 and varname2 that come out of collapse are just their values for each specific value of newvar, and both approaches should give the same thing. However, I got an unfortunate surprise: egen group does not appear to generate its new variable based on all unique combinations in the varlist. Here is the code I tested:

Code:
clear
set obs 20000000
gen indid=floor(uniform()*10000)
gen firmid=floor(uniform()*10000)
sort indid firmid
egen matchid=group(indid firmid)
drawnorm x y
preserve
timer on 1 sort matchid collapse x y indid firmid, by(matchid) summarize timer off 1
restore preserve
timer on 2 sort matchid indid firmid collapse x y, by(matchid indid firmid) summarize timer off 2
restore timer list 1 timer list 2
The first summarize gives
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
matchid | 17,451,332 8738687 5060250 1 1.81e+07
x | 17,451,332 .0003232 .9687208 -5.402949 5.76085
y | 17,451,332 .000109 .9687222 -5.663941 5.742223
indid | 17,451,332 4820.521 2791.404 0 9999
firmid | 17,451,332 4998.954 2886.053 0 9999


and the second gives
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
indid | 18,125,449 4999.293 2886.385 0 9999
firmid | 18,125,449 4999.083 2886.161 0 9999
matchid | 18,125,449 9062725 5232367 1 1.81e+07
x | 18,125,449 .0002974 .9751242 -5.402949 5.76085
y | 18,125,449 .0001436 .9751012 -5.663941 5.742223


Note the different number of observations; since matchid was defined by combinations of indid and firmid, there should be no variation in these variables within matchid and adding them to the by() option should not change the number of observations or anything else in the descriptive statistics (this holds, by the way, when I choose 1000 different values for indid and firmid, instead of 10 000).

To follow up on this, I checked to see that matchid was indeed not getting unique combinations with this code:

Code:
format matchid %9.0f
tempfile hold
sort matchid
save `hold'
collapse (sd) sdi=indid sdf=firmid, by (matchid)
merge matchid using `hold'
gsort -sdi -sdf matchid
order matchid indid firmid sdi sdf
list in 1/10
and Stata returns:
+----------------------------------------------------------------------------------+
| matchid indid firmid sdi sdf x y _merge |
|----------------------------------------------------------------------------------|
1. | 17283644 9534 9999 .5773503 5772.637 .1314288 .5337735 3 |
2. | 17283644 9535 0 .5773503 5772.637 -1.074206 1.676847 3 |
3. | 17283644 9535 1 .5773503 5772.637 -.2784413 -2.624709 3 |
4. | 17998432 9929 9999 .5773503 5772.637 .654556 .3150313 3 |
5. | 17998432 9930 1 .5773503 5772.637 -1.504835 .3309812 3 |
|----------------------------------------------------------------------------------|
6. | 17998432 9930 0 .5773503 5772.637 -.0689836 -1.354481 3 |
7. | 16986156 9370 9998 .5773503 5772.06 -.1597866 .414003 3 |
8. | 16986156 9371 0 .5773503 5772.06 .0601182 -.0195616 3 |
9. | 16986156 9370 9998 .5773503 5772.06 -.1587963 1.086275 3 |
10. | 16986156 9371 1 .5773503 5772.06 .5464685 .4928921 3 |
+----------------------------------------------------------------------------------+


where you can see that the same matchid has multiple indid-firmid combinations associated with it.

Clearly, I seem to have misunderstood what egen group does. Is this a commonly known issue? If so, is there a better way to generate a variable that refers to unique combinations of variables in the varlist?

Thank you all for your time.

P.S. Here are the timers, if you are curious:
. timer list 1
1: 128.52 / 4 = 32.1290

. timer list 2
2: 217.01 / 3 = 72.3383




Missing deciles with Portfolio Sort

$
0
0
Hi,

I have the following problem. When using the fastxtile command to sort my values into deciles, one decile group is missing. After some testing I found that for a certain month in my observation period I get the following percentile breakpoints after using pctile with option nq(10).
pt_sent_10
-0.08
0
0
0.0377778
0.08
0.14
0.19
0.28
0.38
It seems like the breakpoints for the second and third class are the same. Does that mean that I don't have enough variation in my values?
Do I have to use less groups or is there another way?

Best regards
Nic

Select US listed companies only (Compustat)

$
0
0
Hi all,

I would like to select only the firms (gvkey) in my dataset which are listed in the US (US listed companies, i.e. registered under the SEC). Does anyone know an unique identifier (from Compustat) to do this? Maybe based on this site: http://www.crsp.com/products/documen...ata-industrial. I would like to hear from you.

Roy

Annualized Sharpe ratios and Volatilities

$
0
0
Good morning,

I have Monthly excess returns of two portfolios P1 and P8 over 29 years. For my master thesis I have to state the Volatilities and SR in one number, while these two numbers should be annualized. So far I have tried different approaches to calculate the needed results but I'm not sure which is the formally correct code to do so.

Here is an example of the code (for Portfolio P1) I think is correct and the data from 1990 and 1991. I know the Code Looks Kind of sloppy but I will put the calculations into loops later on.
Code:
egen Sd_P1= sd(P1), by(Jahr)
gen Sd_Annual= Sd_P1*sqrt(12)

gen SR= P1/Sd_P1
gen SR_annual= SR*sqrt(12)
egen SR_final= mean(SR_annual)

//To get my SD for P1
gen SD_final= Sd_Annual*sqrt(29)
  
 * Example generated by -dataex-. To install: ssc install dataex
clear
input float(date Jahr Monat P1 P8)
360 1990  1  -.0019377362   -.003668688
361 1990  2   .0008372813    .001459499
362 1990  3 -.00022618668   .0019396635
363 1990  4  -.0003067676   -.002378969
364 1990  5   .0004768702    .005065726
365 1990  6   .0006901055 -.00016673017
366 1990  7  -.0012268435  -.0008285865
367 1990  8  -.0027135995   -.005536701
368 1990  9  -.0011730944   -.004750926
369 1990 10  -.0017754886   .0002686236
370 1990 11   .0014749074    .005864478
371 1990 12   .0004333424   .0011803983
372 1991  1   .0016687294    .005055216
373 1991  2   .0041755983    .005791987
374 1991  3   .0020613207   .0019960906
375 1991  4    .002639579  .00008909535
376 1991  5  .00054460054    .002296731
377 1991  6  .00011390844  -.0020861127
378 1991  7 .000027179554   .0012150153
379 1991  8   .0017399964   .0015752286
380 1991  9 -.00013826239   .0004809036
381 1991 10  -.0008881435   .0013854794
382 1991 11 -.00016042936   -.002432231
383 1991 12    .000421264    .004964425
end
format %tm date
I think I am using the wrong scaling somewhere, even though the results look Kind of reasonable.
Viewing all 73257 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>