Quantcast
Channel: Statalist
Viewing all 73249 articles
Browse latest View live

Test to compare Positive and Negative Predictive Value?

$
0
0
Hi all,

I wonder if its possible use a test to compare Positive and Negative Predictive Value with stata 14?

Thanks in advance.
Regards
Rodrigo.

Changing a discrete variable into a dummy and loosing significance

$
0
0
Dear all,

I am writing an essay using an unbalanced panel dataset. Right now I am using a Linear Probability Model with time and company fixed effects.
My research question is if the number of women in a department increases the probability of the head of the department to be female as well.

I'm using the discrete variable for the number of women in the department as the main dependent variable. I ran the regression in Stata with the xtreg command the coefficient of my main x-variable was significant on the 5 % level and positive.

Now I wanted to transform my discrete variable into a dummy variable which is equal to 1 if four or more women work in the department.
When running the new regression with the same control variables as before, the coefficient of the dummy variable becomes insignificant. It is just above 0.

How can this be since the discrete variable was statistically significant? How can a change from 0 to 1 woman for example have an impact on the probability of Y=1 but not the change of 0 to 4 women?

I really appreciate your help. Thank you.

Merged file size is ~5X the source: Mechanism behind Merge

$
0
0
Hello,

This is my first post on Statalist and I have tried my best to follow posting advice in the FAQ. Kindly excuse any mistakes.

I am using Stata/IC 14 for Unix (Linux 64-bit x86-64) on a remote high performance computing setup to perform some basic data manipulations on large source files. My question relates to the resulting file size from a merge operation.

My in-memory data has 18,865 observations and has a file size of approximately 4.1MB.
Code:
 obs:        18,865                          
 vars:            24                          4 Mar 2020 14:55
 size:     4,093,705                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
key             str18   %18s                  
run_yr          float   %9.0g                 Running year of alliance
firm1_gvkey     long    %12.0g                Final GVKey for P1
firm2_gvkey     long    %12.0g                Final GVKey for P2
gvkeypaired     float   %9.0g                
ann_date        float   %td                   Announcement date of the alliance
firm1_permco    long    %12.0g                P1 Permco from CCM
firm2_permco    long    %12.0g                P2 Permco from CCM
n_firm1         float   %9.0g                 Number of alliances of firm1 in the running year
flipped         byte    %8.0g                 0 if A-B, 1 if B-A in a year
firm1_parent    str30   %30s                  P1 Ultimate Parent Name
firm2_parent    str30   %30s                  P2 Ultimate Parent Name
industry        str14   %14s                  Industry
firm1_name      str30   %30s                  Participant 1 in Venture / Alliance (Short Name)
firm2_name      str30   %30s                  Participant 2 in Venture / Alliance (Short Name)
firm1_sic       int     %8.0g                 P1 Ultimate Parent Primary SIC Code
firm2_sic       int     %8.0g                 P2 Ultimate Parent Primary SIC Code
count_allyear   float   %9.0g                 Count of alliance year
group           float   %9.0g                 ID variable to identify year-focal firm combination
run_yr_enddate  float   %td                   End date of running year of alliance
id1             float   %9.0g                 group(run_yr)
id2             float   %9.0g                 group(run_yr firm1_gvkey)
gfreq           float   %9.0g                
numid           float   %9.0g                
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: run_yr  firm2_gvkey  gvkeypaired

The 'using' data file has 633,476,799 observations and has a file size of approximately 21GB.
Code:
 obs:   633,476,799                          
 vars:             6                          20 Feb 2020 23:41
 size:20,904,734,367                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
year            int     %8.0g                
gvkey1          long    %12.0g                
gvkey2          long    %12.0g                
score           float   %9.0g                
ball            byte    %8.0g                
key             str18   %18s                  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
I perform a merge operation on these data
Code:
merge m:1 key using "/home/1996_2017.dta"
The resulting data set is unbelievably massive in terms of file size (approximately 147GB)!!
Code:
  obs:   633,483,072                          
 vars:            30                          4 Mar 2020 15:44
 size:147601555776                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
key             str18   %18s                  
run_yr          float   %9.0g                 Running year of alliance
firm1_gvkey     long    %12.0g                Final GVKey for P1
firm2_gvkey     long    %12.0g                Final GVKey for P2
gvkeypaired     float   %9.0g                
ann_date        float   %td                   Announcement date of the alliance
firm1_permco    long    %12.0g                P1 Permco from CCM
firm2_permco    long    %12.0g                P2 Permco from CCM
n_firm1         float   %9.0g                 Number of alliances of firm1 in the running year
flipped         byte    %8.0g                 0 if A-B, 1 if B-A in a year
firm1_parent    str30   %30s                  P1 Ultimate Parent Name
firm2_parent    str30   %30s                  P2 Ultimate Parent Name
industry        str14   %14s                  Industry
firm1_name      str30   %30s                  Participant 1 in Venture / Alliance (Short Name)
firm2_name      str30   %30s                  Participant 2 in Venture / Alliance (Short Name)
firm1_sic       int     %8.0g                 P1 Ultimate Parent Primary SIC Code
firm2_sic       int     %8.0g                 P2 Ultimate Parent Primary SIC Code
count_allyear   float   %9.0g                 Count of alliance year
group           float   %9.0g                 ID variable to identify year-focal firm combination
run_yr_enddate  float   %td                   End date of running year of alliance
id1             float   %9.0g                 group(run_yr)
id2             float   %9.0g                 group(run_yr firm1_gvkey)
gfreq           float   %9.0g                
numid           float   %9.0g                
year            int     %8.0g                
gvkey1          long    %12.0g                
gvkey2          long    %12.0g                
score           float   %9.0g                
ball            byte    %8.0g                
_merge          byte    %23.0g     _merge    
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
This is problematic even for machines with 64 CPU cores and 512GB of RAM. Simple operations like filtering and summarizing take a long time to show results, which is quite understandable. What I do not understand is that even though all input and output files are .dta format, an addition of 6273 observations and 24 variables to the larger ('using') data set leads to approximately five fold increase in the file size.

I have tried looking on the internet to figure out what happens with merge but could not find anything substantial. All I could read up and understand is that file formats are optimized for reading, writing etc. by each software (Reference: https://nelsonareal.net/blog/2017/11...ile_sizes.html). Can some expert here explain to me what is happening with the merge operation in Stata in general and maybe in my case. Thank you!

generating dummy variables for consecutive years

$
0
0
Hi everyone!

I am quite new to Stata and I probably have a simple question, but I am not able to figure it out on my own.

I have an unbalanced dataset with N individuals who have a unique identifier (id). I have several survey years (years) and an indicator (work) whether an individual had a paid job in the relevant year (yes=1, no=0).

Now I would like to create several dummy variables (basically one dummy for every year) to see if the individual worked over several consecutive years, e.g. work1- work4.

Meaning that if individual 1 worked in year 2001, dummy variable work1 should take the value 1. If individual 1 also worked one year later (2002), dummy variable work2 should also be equal to 1. If individual 1 did not work in year 2003, work1-work4 should equal 0 for this year.

What I mean should look like this:
id year work work1 work2 work3 work4
1 2001 1 1 0 0 0
1 2002 1 0 1 0 0
1 2003 0 0 0 0 0
2 2002 1 1 0 0 0
2 2003 1 0 1 0 0
2 2004 0 0 0 0 0
3 2008 1 1 0 0 0
3 2009 1 0 1 0 0
3 2010 1 0 0 1 0
4 2005 1 1 0 0 0
4 2006 1 0 1 0 0
4 2007 1 0 0 1 0
4 2008 1 0 0 0 1

I tried it with the following code, but unfortunately my idea does not work.

sort persnr syear
gen work1=1 if (work == 1 & work[_n-1] == 0 & id == id)
replace work1=0 if missing(work1)

gen work2=1 if (work == 1 & work[_n-1] == 1 & id == id[_n-1])
replace work2=0 if missing(work2)
Hope someone can help. Thanks a lot!
Chris

set seed

$
0
0
Hi All,
As I understand set seed the numbers to produce the same result for randomization when running, again and again, But I found a different result in every time of running randomization.
Please suggest me how to fix this, and to get a constant random result.

Drawing random sample from a large data set for each observation

$
0
0
Hi, I have a large dataset on course enrollment. Individual students take courses in different semesters. Observations are unique at the individual-semester-coursenum level. Individuals also have different graduation years ("cohort"). I would like to choose, for each individual, a random sample of individuals in their cohort of a different size ("total") that is different for each individual. The best possible way I can think of is to loop through the individual observations and use the randomtag command, and create a unique identifer for each value of random tag (possibly the unique identifer of the student) - so for example, I could use the following commands:

preserve
keep id cohort total
duplicates drop /* We now have one observation per individual */
sort id
local N=_N
set seed 1357
forvalues i = 1/`N' {
local id = id[`i']
local year = cohort[`i']
local groupsize = total[`i']
randomtag if cohort == `year', count(`groupsize') g(selected)
g randomgroup = .
replace randomgroup = `id'*selected
}
sort id
save randomgroups.dta
restore
sort id
merge id using randomgroups.dta

I'm wondering if there is a faster way to do this, rather than looping over individual observations to generate random samples one at a time. Thank you for your suggestions.

Comparing regression coefficients across two regression models

$
0
0
Hi everyone,
I would like to compare the regression coefficients across two regression models. When I use the svy prefix then the usual test command is not working. However, without using the svy prefix the below test command works.


svy: regress overallsatis i.BQ4 i.BQ5 BQ1 BQ2 i.Race i.BQ6 i.BQ8 i.BQ7 ADLs i.BQ10 Help if NH_RCF==0, eform(exp(Coef.))
eststo NH
est store NH_RCF0

svy: regress overallsatis i.BQ4 i.BQ5 BQ1 BQ2 i.Race i.BQ6 i.BQ8 i.BQ7 ADLs i.BQ10 Help if NH_RCF==1, eform(exp(Coef.))
*vif
eststo RCF
est store NH_RCF1
esttab NH RCF, r2 ar2 aic l title(Satisfaction NH vs. RCF)
suest NH_RCF0 NH_RCF1


test [NH_RCF0_mean]2.BQ4=[NH_RCF1_mean]2.BQ4
Error: equation [NH_RCF0_mean] not found

PPML with fixed effects (exporter + importer) for gravity equation - How to get the values of fixed effects for the dropped variables?

$
0
0
Hi everyone,

I want to estimate a gravity model while accounting for Multilateral resistance for 2009 (Cross-section, 84 countries so 84*83 observations). I therefore want to add exporter and importer fixed effects. This will not permit to consider unilateral traditional variables (GDP, landlocked, RTA...). I have two questions:

1- Shall I use -ppml- while generating dummies for exporters and importers OR use -xtpoisson- with FE option ?

2- Since i want to evaluate the effects of UNILATERAL variables, I follow Melitz (2005) and regress the exporter/ importer fixed effects (become the dependent variable) over UNILATERAL variables.
Here, my question is : since STATA drops a dummy in order to prevent collinearity and as I want to have all fixed effects as observations for the NEW dependent variable, how to get all the values of country (exporter/ importer) fixed effects ?
Do the values associated to the exporter/ importer dummies presented in the results correspond to the needed country fixed effects ? IF yes, i want to make sure that for the NEW variable (Exporter OR importer country fixed effect), I shall add the value of ZERO for the dropped dummy to have 84 observations!

Thank you very much!

Drawing random sample for each observation where parameters vary for each observation

$
0
0
Hi, I have a large dataset on course enrollment. Individual students take courses in different semesters. Observations are unique at the individual-semester-coursenum level. Individuals also have different graduation years ("cohort"). I would like to select, for each individual, a random sample of individuals in their cohort of a different size ("total") that is different for each individual. The best possible way I can think of is to loop through the individual observations and use the randomtag command, and create a unique identifer for each value of random tag (possibly the unique identifer of the student). For example, the following code works:

preserve
keep id cohort
duplicates drop /* We now have one observation per individual */
egen total = count(id), by (cohort)
local N=_N
set seed 1357
g randomgroup = .
sort id
forvalues i = 1/`N' {
global id = id[`i']
global year = cohort[`i']
global groupsize = total[`i']
randomtag if cohort == $year, count($groupsize) g(selected$id)
replace randomgroup = $id*selected$id if selected$id == 1
}
sort id
save randomgroups.dta
restore
sort id
merge id using randomgroups.dta

I'm wondering if there is a faster way to do this, rather than looping over individual observations to generate random samples one at a time. Thank you for your suggestions.

Creating binary variable with multiply conditions using for loop

$
0
0
I want to create a binary variable using for loop. The setup is that my binary variable, say "a", will take the value 1 if any one of the variables b,c or d take the value 5. I tried doing it using a local macro for b,c,d. But I found that stata is only reading variable b and not the other ones. I know this can be easily done using a replace and generate commands, but I am new to writing loops and hence want to learn this approach.

Series 0 not found error

$
0
0
I'm encountering an incredibly mysterious error, and I was hoping anybody here would have some insight. Code like this has worked fine for me in the past and continues to work in other settings, but the problem has now developed on a particular server. Minimal example below:

set obs 10

gen x = _n
gen y = _n +1
gen cond = x<6

graph twoway (line x y if cond==0) (line x y if cond==1)

throws:
"series 0 not found"

You need to overlay two graphs, and those two graphs need to have different conditional statements in them. This persists even after restarting Stata and running 'clear all' and 'clear ado'. It's probably something about my account on that machine, but I don't know what more I can do to clear out the gunk.

String Variable Merge

$
0
0
Hi, I am fairly new to stata and am having difficulty creating a string variable identifying year and quarter eg. 2012Q1, in order to merge two data sets together.

Currently I have gotten as far as getting the month and year as variables using:

gen month1 = substr(date, 3, 3)
gen year1 = substr(date, 6, 4)
destring year1, replace
replace month3 = 1 if month1 == "jan"
replace month3 = 2 if month1 == "feb"
replace month3 = 3 if month1 == "mar"
replace month3 = 4 if month1 == "apr"
replace month3 = 5 if month1 == "may"
replace month3 = 6 if month1 == "jun"
replace month3 = 7 if month1 == "jul"
replace month3 = 8 if month1 == "aug"
replace month3 = 9 if month1 == "sep"
replace month3 = 10 if month1 == "oct"
replace month3 = 11 if month1 == "nov"
replace month3 = 12 if month1 == "dec"

I now need to create a quarter variable as a string eg Q1 for month3 = 1 | 2 | 3... and then combine this with the year to create a variable matching the one in the other data set so as to merge. The switching between string and numeric is causing some difficulties and I am not sure where to go from here

thanks in advance for any advice

label define but it's not 1-to-1

$
0
0
I don't think I make a clear question, so let me show my data.

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str31 q14_marriage
"Single; never married"          
"Single; never married"          
"Single; never married"          
"Married or domestic partnership"
"Married or domestic partnership"
"Divorced"                       
"Married or domestic partnership"
"Married or domestic partnership"
"Married or domestic partnership"
"Married or domestic partnership"
end
For this string variable, it has multiple choices and I want them to be 0,1,1,0,0: (my reasoning here is whether people are in a relationship or not. I'm open if you have different opinions.)

Divorced
Engaged; to be married
Married or domestic partnership
Separated
Single; never married

In my case, three string values are 0 and two string values should be 1. How to do it?

Svy: Subpop number observations dropping people

$
0
0
Using NHANES and my subpopulation has 39,313 people but when I do a simple svy, subpop(if subpop==1): mean ridageyr the number of subpopulation observations is 37,425 instead of 39,313. I have no missing data on the survey parameter variables (MEC16YR, sdmvstra, sdmstra) or on ridageyr.

. svyset [pweight=MEC16YR], strata(sdmvstra) psu(sdmvpsu) vce(linearized) singleunit(centered)

pweight: MEC16YR
VCE: linearized
Single unit: centered
Strata 1: sdmvstra
SU 1: sdmvpsu
FPC 1: <zero>

. svy, subpop(if subpop==1): mean ridageyr
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 118 Number of obs = 82,091
Number of PSUs = 241 Population size = 1,415,698,832
Subpop. no. obs = 37,425
Subpop. size = 946,631,143.08
Design df = 123

--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
ridageyr | 47.36543 .2322556 46.90569 47.82516
--------------------------------------------------------------

Without weights or survey design variables, you can see there is no missing for age:

. mean ridageyr if subpop==1

Mean estimation Number of obs = 39,313

--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
ridageyr | 50.83229 .0932921 50.64944 51.01515
--------------------------------------------------------------

Seemingly unrelated regressions and instrumental variables

$
0
0
Hi all,
I have a question regarding SUR and IV.
Currently, I have the following two regressions. (Different dependent variables but same independent variables) Array

And I am planning to use instrumental variable 'b_{i, t}' for 'a_{i, t}'
And I want to test whether beta_1 is equal to beta_2
I searched for SUR and found out that 'sureg' command is what I was looking for. But it seems like there are no options for instrumental variables in the command.
Is there another way to do this?

Thanks!

Understanding manual mechanics of -predict- command after two-factor confirmatory factor analysis

$
0
0
Background:
I am working on a longitudinal study where we use a measure (MOS-HIV; 35 items that are ordinal categorical variables ranging from 1-3, 1-5, and 1-6 ) where to get the final scores (mental health and physical health summary scores), we can use the scoring coefficients from the patient population of the original validation study. Given our sample is quite different (original patient population North Americans with HIV; our sample is East Africans with HIV), we would like to use our own scoring coefficients from our sample and the measures of time (5 visits total)

To get to this point, for the baseline data we ran a two-factor confirmatory factor analysis with a varimax roation and developed the two summary scores with no problem and compared them to the primary method of scoring (using the Roche patient population as described in first paragraph) just as an extra check to make sure the two summary scores for each scoring method were highly correlated. We detected no issues. For subsequent data, I used matrix2dta to store the scoring coefficients from baseline, transformed it, and had no issue merging it in the database with all subsequent data (no baseline).

Actions thus far:
I wanted to test how to make the summary scores (which I made using the -predict- for the baseline) by replicating the baseline data by manually calculating the predicted scores using the values and the scoring coefficients. I wanted to do this to ensure I was calculating the scores correctly at the subsequent visits. I did this only with Factor 1 as if I can match the "manual" scores with the predicted scores with one of the two outcomes at one visit, I presumed I replicate the calculation. I standardized the 35 variables per info in the helpfile for factor postestimation (page 351) where it says : "The table with scoring coefficients informs us that the factor is obtained as a weighted sum of standardized versions of headroom, rear seat, and trunk with weights 0.28, 0.27, and 0.46."

For my first calculation, I multipled the rotated scoring coefficients by the standardized values for each of the 35 scoring coefficients and corresponding standardized variables from baseline data only, and then summed the products of this calculation. They were not equal to the predicted scores nor were they correlated (I tested correlation to see if perhaps same score but on different scale somehow).

For my second calculation, I multipled the unrotated scoring coefficients by the standardized values for each of the 35 scoring coefficients and corresponding standardized variables from baseline data only, and then summed the products of this calculation. They were also not equal to the predicted scores nor were they correlated.

For further attempts, I repeated the above steps after specifying a Bartlett regression scoring method for the predict comment (instead of the default regression method). These also didn't work (work meaning match the stata producted via predict command summary scores).

The bottom line question is: What is the "under the hood" calculations for the predict command so that I can replicate using scoreing coefficients from a baseline visit for subsequent visits it in my longitudinal study?

This is my first post and I read instructions and tried to be as detailed as possible. Please let me know if I can provide more information.

label variable question and substr

$
0
0
Hi, I have several variables and their names are: q11r1_facebook q11r2_instagram q11r3_twitter q11r4_snapchat q11r5_pinterest q11r6_tiktok q11r7_linkedin q11r8_strava

I want to do
Code:
label variable q11r1_facebook "facebook"
. (The reason is that when I form the correlation table or regression table, I can read var labels.)

So I want to do:
Code:
foreach v of varlist q11r1_facebook-q11r8_strava {
    label variable `v' "???"
}
I guess it should use substr. Can anyone help me?

GMM and stacking moment conditions

$
0
0
Dear Statalisters,

I have never used GMM in stata and couldn't see a related issue posted on statalist. I will appreciate any guidance on the following matter.

I have the following OLS models:

Model 1) Stock_Returnst= Alphat + b1*MarketReturnt+ b2*HTMt + b3*BTMt + e_1

Model 2) Alphat = a + c1 * Dummy1t + c2*Dummy2t + c3 * Sizet+ e_2

My question is, how can I stack the moment conditions and estimate the parameters in model 2 in GMM?

Trevor

How to retrieve data from World Bank database in stata

$
0
0
Hi all,

I've been using wbopendata stata command to retrieve indicators from the World Bank database but I've noticed that after some time, they update the indicators and move the old ones to the archive. When this happens, the wbopendata displays error 23. I was wondering if anyone knows a way of accessing the information that has been moved to the WB archive from stata.
Thanks!

Binary Variable

$
0
0
Hello guys,

I've been recently reading a paper by Berk, Jonathan B. and van Binsbergen, Jules, (2016) "Assessing asset pricing using revealed preference". In their paper they run the following regression:

sign(F) = β0 + β1 sign(ALPHA) + p

, where sign(F) and sign(ALPHA) take on values in {−1, 1}.

I cannot understand how to run such a regression in Stata. I was wondering if any of you know how to do so?

Thank you
Viewing all 73249 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>