Quantcast
Channel: Statalist
Viewing all 73242 articles
Browse latest View live

Unicode analyze error - too big file

$
0
0
Hi Guys! I am new in Stata and I have a problem with analyzing and translating dta file encodings by unicode. After typing "unicode analyze mydata.dta" in Stata/SE 15.1 I got that "1 file ignored (too big or not Stata)". The file is not larger than 2 GB. Do you have any idea how to solve that? Thanks a lot!

How to do calcuations between different datatypes (double, long, float)?

$
0
0
When I add two numerics/numbers, the result is always rounded to some nonsense.

Actually my datatype for pid/hid is "long". With MWE it will give you float, but illustrates the same problem.

Code:
clear
input   pid hid
        513 60313    
        513 60313
        513 94    
        513 94    
        514 167    
        514 175
        515 175    
        516 175            
end
I would like to multiply the first identifier by the maximum of the second and add this in order to create a unique new identifier. But I cannot figure out, what dataformat I should use.

Code:
gen pid_float = float(pid)
gen pid2 = pid_float*1000000
format pid2 %15.0f

gen hid_float = float(hid)
gen pid_hid = pid2+hid_float
format pid_hid %15.0f
list pid hid pid_float pid2 hid_float pid_hid

     +-----------------------------------------------------------+
     | pid     hid   pid_fl~t        pid2   hid_fl~t     pid_hid |
     |-----------------------------------------------------------|
  1. | 513   60313        513   513000000      60313   513060320 |
  2. | 513   60313        513   513000000      60313   513060320 |
  3. | 513      94        513   513000000         94   513000096 |
  4. | 513      94        513   513000000         94   513000096 |
  5. | 514     167        514   514000000        167   514000160 |
     |-----------------------------------------------------------|
  6. | 514     175        514   514000000        175   514000160 |
  7. | 515     175        515   515000000        175   515000160 |
  8. | 516     175        516   516000000        175   516000160 |
     +-----------------------------------------------------------+

How can I handle those numbers as normal numbers? Why is hid 60313 rounded to 60320? Thank you very much.

Panel Data Analysis (Run desired regression)

$
0
0
Dear Specialist

First Thanks for your reading.

I have a Panel Data set. I treated Region as id variable and Year as time variable. I would like to run a regression in the below format.

GVArt = β01 * WageGaprt + β2 * WageGapr(t-1) + Region + ε

Which code should i use in order to get the above formula.

Thanks for you reading again.
Have a good day.

Best Regards
Kangchen Sun


Multiple imputation for correlated exposure variables

$
0
0
Hello. I am trying to perform multiple imputation in my dataset using mi impute chained.
The dataset has three exposure variables, e.g. smoking at time 1, smoking at time 2, and smoking at time 3. I generated 4th exposure variable: smoking at any time. All exposure variables are binary variables (0 vs 1).
Code:
gen smoking_any=0 if smoking1==0 & smoking2==0 & smoking3==0
replace smoking_any=1 if smoking1==1 | smoking2==1 | smoking3==1
smoking1 smoking2 smoking3 smoking_any
1 0 . 1
0 0 . .
0 1 . 1
. . 1 1
0 0 0 0
0 0 0 0
0 0 0 0
. 0 0 .
0 0 . .
My question is how to impute "smoking_any".
When I include all four exposure variables in the imputation model, it always says ".... predicts data perfectly" or "convergence not achieved".

If I impute "smoking_any" separately from the other three variables, it looks like the prevalence of smoking at any time would be overestimated.

Can I use passive imputation approach after I impute smoking1, smoking2, and smoking3? But it is said that "this method is actually a misspecification of your imputation model and will lead to biased parameter estimates in your analytic model".

Thank you very much.

Table 1 help or test for significant difference

$
0
0
Hello I have a data set in which I am trying to compare longitudinal data from successive pregnancies.

each variable I have is coded pregnancy1 or pregnancy 2. With the exception of race/ethnicity all variables differ based on pregnancy 1 and pregnancy 2

I am trying to set up a table 1 comparing continuous, categorical and binary variables ( education, annual income, mean age, hypertension status etc) at the time points of either pregnancy 1 or pregnancy 2.

Previously I used a code for table 1 but that won't work here because I am not stratifying by a binary outcome but rather by Preg 1 vs preg 2 .

I am currently using tab or codebook command to compare each variable ex:

tab edu_1
tab edu_2

tab obese_1
tab obese_2


this gives me column percentage; but no p value comparing the two. Is there a code for this? my table one command did it automatically. I hope this makes sense. Thanks!

question about xtabond2 output

$
0
0
I am running the below syntax

Code:
xtabond2 generalcrime l.generalcrime  proactivity feb mar apr may jun jul aug sep oct nov dec, gmm( proactivity generalcrime , lag (2 4)) iv( feb mar apr may jun jul aug sep oct nov dec) nolevel twostep robust

Code:

Dynamic panel-data estimation, two-step difference GMM
------------------------------------------------------------------------------
Group variable: id                              Number of obs      =     26150
Time variable : week                            Number of groups   =       523
Number of instruments = 305                     Obs per group: min =        50
Wald chi2(13) =    127.65                                      avg =     50.00
Prob > chi2   =     0.000                                      max =        50
------------------------------------------------------------------------------
             |              Corrected
generalcrime |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
generalcrime |
         L1. |   .0084245    .007643     1.10   0.270    -.0065556    .0234045
             |
 proactivity |   .0918542   .0309783     2.97   0.003     .0311378    .1525705
         feb |  -.3357907   .2197143    -1.53   0.126    -.7664227    .0948414
         mar |  -.1039943   .2222107    -0.47   0.640    -.5395193    .3315306
         apr |  -.1241196   .2364084    -0.53   0.600    -.5874715    .3392323
         may |   .3990998   .2487457     1.60   0.109    -.0884328    .8866324
         jun |   .6014529   .2578449     2.33   0.020     .0960862     1.10682
         jul |   .6369464   .2641964     2.41   0.016     .1191309    1.154762
         aug |   .3615521   .2706158     1.34   0.182    -.1688452    .8919494
         sep |  -.0556289   .2523996    -0.22   0.826    -.5503231    .4390653
         oct |  -.2701021   .2545519    -1.06   0.289    -.7690146    .2288105
         nov |  -.7050279   .2654339    -2.66   0.008    -1.225269    -.184787
         dec |   -.536009   .2914149    -1.84   0.066    -1.107172    .0351537
------------------------------------------------------------------------------
Instruments for first differences equation
  Standard
    D.(feb mar apr may jun jul aug sep oct nov dec)
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    L(2/4).(proactivity generalcrime)
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z =  -2.96  Pr > z =  0.003
Arellano-Bond test for AR(2) in first differences: z =  -1.01  Pr > z =  0.312
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(292)  = 558.93  Prob > chi2 =  0.000
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(292)  = 289.84  Prob > chi2 =  0.525
  (Robust, but weakened by many instruments.)

Difference-in-Hansen tests of exogeneity of instrument subsets:
  iv(feb mar apr may jun jul aug sep oct nov dec)
    Hansen test excluding group:     chi2(281)  = 280.43  Prob > chi2 =  0.498
    Difference (null H = exogenous): chi2(11)   =   9.40  Prob > chi2 =  0.585
However, when I change the main IV proactivity to l.proactivity, the direction of the effect changed. What does this mean? Should I use proactivity in level if I'm interested in the total effect proactivity has on crime?

Code:
. xtabond2 generalcrime l.generalcrime  l.proactivity feb mar apr may jun jul aug sep oct nov dec, gmm( proactivity generalcrime , lag (2
>  4)) iv( feb mar apr may jun jul aug sep oct nov dec) nolevel twostep robust
Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm.

Dynamic panel-data estimation, two-step difference GMM
------------------------------------------------------------------------------
Group variable: id                              Number of obs      =     26150
Time variable : week                            Number of groups   =       523
Number of instruments = 305                     Obs per group: min =        50
Wald chi2(13) =    121.25                                      avg =     50.00
Prob > chi2   =     0.000                                      max =        50
------------------------------------------------------------------------------
             |              Corrected
generalcrime |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
generalcrime |
         L1. |   .0035523   .0066324     0.54   0.592    -.0094469    .0165516
             |
 proactivity |
         L1. |  -.0302871   .0110876    -2.73   0.006    -.0520183   -.0085559
             |
         feb |  -.3070974   .2207157    -1.39   0.164    -.7396923    .1254975
         mar |  -.0884815   .2224854    -0.40   0.691    -.5245449    .3475818
         apr |  -.1001165   .2368536    -0.42   0.673     -.564341    .3641081
         may |   .3628104    .256478     1.41   0.157    -.1398773    .8654981
         jun |   .5504192   .2667749     2.06   0.039     .0275501    1.073288
         jul |   .5248709   .2752155     1.91   0.057    -.0145415    1.064283
         aug |   .2397443   .2874894     0.83   0.404    -.3237246    .8032131
         sep |  -.2070373   .2700674    -0.77   0.443    -.7363596     .322285
         oct |   -.446874   .2807123    -1.59   0.111    -.9970599     .103312
         nov |   -.907495   .2868705    -3.16   0.002    -1.469751   -.3452392
         dec |   -.714167   .3153677    -2.26   0.024    -1.332276   -.0960577
------------------------------------------------------------------------------
Instruments for first differences equation
  Standard
    D.(feb mar apr may jun jul aug sep oct nov dec)
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    L(2/4).(proactivity generalcrime)
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z =  -2.96  Pr > z =  0.003
Arellano-Bond test for AR(2) in first differences: z =  -1.34  Pr > z =  0.181
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(292)  = 562.59  Prob > chi2 =  0.000
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(292)  = 292.47  Prob > chi2 =  0.481
  (Robust, but weakened by many instruments.)

Difference-in-Hansen tests of exogeneity of instrument subsets:
  iv(feb mar apr may jun jul aug sep oct nov dec)
    Hansen test excluding group:     chi2(281)  = 282.66  Prob > chi2 =  0.461
    Difference (null H = exogenous): chi2(11)   =   9.81  Prob > chi2 =  0.548


.

Graph data onto a map with spmap

$
0
0
Hi every one,
I kind ask for you help. I am running the following
shp2dta using /Users/jennyperez/personal/STUDY/maping/e.g./sudamerica_adm2.shp, database(sudamdb) coordinates(sudamcoord) genid(id) and it is what happens
file sudamcoord.dta could not be opened

I tried some of the suggestions that I found in the forum but nothing. if some one can help me to identified what is the problem, I will be well appreciate

Best,
Jenny

fuzzy match date of birth w/ typos in date of birth

$
0
0
I would like to create an id variable that uniquely identifies people within a dataset that looks like this:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float cardholderid str9 firstname str6 lastname str4(fsx lsx) str9 dob
12345 "kit" "fugate" "K300" "F230" "13dec1997"
12345 "kit" "fugate" "K300" "F230" "13dec1997"
60000 "kit" "fugate" "K300" "F230" "13nov1997"
23456 "ben" "smith"  "B500" "S530" "27jan2004"
23456 "ben" "smith"  "B500" "S530" "27jan2004"
70000 "ben" "smith"  "B500" "S530" "26jan2004"
end
cardholderid is a variable in the original data that is supposed to uniquely identify people, but it does not. I have improved the data by running code like: group_id cardholderid, matchby(fsx lsx dob) to cut down on repeat cardholderid issued to people who have typos in their names but the same dateofbirth.

The issue is that there are also a fair amount of date of birth typos, I've made the mock dataset here show that clearly-- you can easily tell that cardholderid 12345 and 60000 identify the same person, who has a relatively unique name, but the date of birth is off by exactly a month. The dob typos are all over the place- sometimes the date is off a few days (ex: same uncommon name but two entries, one with dob 13nov1980 and one with dob 19nov1980), sometimes the month is off by one (as with kit fugate above), sometimes the year is off by even up to five years for the same uncommon name.

How can I de-duplicate my dataset to create one id for someone like kit fugate, born 13 dec/nov 1997? Because some of the typos are more obvious than others-- based on bad the dob typos are (date off by a day versus off by 5 years) and how uncommon the name is (if there are two John Smiths who are share a birth month+year but have different day, we would be less certain these are really the same person compared to two Rashida Tlaibs who share a birth month+year but are off by a day)-- it would be great to have some way to make a similarity score and manually check as opposed to doing something like group_id cardholderid, matchby(fsx lsx dob_month dob_year).

Thanks in advance!


Handle data with blanks and zeros with PPML

$
0
0
Dear all.
I am trying to run a gravity model with PPML.
I have a question. It could be simple or the question indeed could be wrong. But I would like to ask you.
In my export data I have many blanks because there are no trade between some countries in some years.
My question is: Shall I leave blanks in the export data and run the model or shall I replace blanks with zeros and then run the model?
Best wishes to all of you.
CAbreo

figure for standardised diff

$
0
0
Hi folks,

I have access to Stata 14. I want to create something like the following figure. What is the best way to go about accomplishing it in Stata?

Array
Source: Austin PC 2009.

Thanks.

Identifying patterns in survey responses

$
0
0
I have survey data with IDs and responses (0, 25, 50, 75, 100). They are typical Likert scale questions with 5 points. I need to see if there are patterns where participants' choices didn't change (they selected only one point).
For instance, in the example below, ID 9 selected 50 while ID 1 selected 0s.
Thank you,

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input int id byte(question1 question2 question3 question4 question5 question6 question7 question8 question9 question10)
 1   0   0   0   0   0   0   0   0   0   0
 2  50  50   0   0  50   0 100   0   0   0
 3  50  50   0   0  50   0 100   0   0   0
 4  50  50   0   0  50   0  50   0   0   0
 5  50  50   0  50  50   0  50  50  50  50
 6  75 100  50 100 100  50 100   0 100 100
 7  75  75  50  50 100  50 100  50  50 100
 8  25  50   0   0  50   0   0   0   0  50
 9  50  50  50  50  50  50  50  50  50  50
10  25  75   0   0  50   0  50   0   0  50
11  50 100   0 100 100 100 100 100 100 100
12  25 100  50 100 100 100 100 100 100 100
13  25  25   0   0 100 100 100 100  50  50
14  75  50  50 100 100  50 100  50 100 100
15 100 100 100 100 100 100 100 100 100 100
16  75  75 100 100 100 100 100 100 100 100
17  50  25  50 100 100  50 100 100 100  50
18  50  50   0   0  50  50  50   0   0   0
19  25  75   0  50   0   0   0  50  50   0
20  25  50   0   0   0   0   0   0   0   0
end

Econometrics

$
0
0
I'm having issues with the Stata commands, I'm super new to Stata and still trying to figure out my way with command, if anybody can help or give any sort of command advice I would really appreciate it.
Thanks

Simple equation by variable

$
0
0
So I am attempting to create a new variable in the following way
newvar = var1 - lagvar1 + var2 however given the rudimentary nature of my lag variable it cuts across different companies (var3) and so I also want to restrict this calculation and ensure that both var1 and lagvar1 correspond to the same company lag3.
Thus far I am drawing a blank, any help would be much appreciated

Varsoc command

$
0
0
Hi guys,

I'm new here and I'm having some issues with varsoc command. When I run the command, stata shows me no results in lag 4 (just for FPE) but I don't know why. Should I consider this model correct?

varsoc lnBio lnHydro lnNRE lnConsPC lnPIBPC lnBBR

Selection-order criteria
Sample: 1989 - 2015 Number of obs = 27
+---------------------------------------------------------------------------+
|lag | LL LR df p FPE AIC HQIC SBIC |
|----+----------------------------------------------------------------------|
| 0 | 105.478 2.5e-11 -7.36877 -7.28315 -7.08081 |
| 1 | 249.296 287.64 36 0.000 9.3e-15 -15.3553 -14.7559 -13.3395 |
| 2 | 292.499 86.406 36 0.000 8.5e-15 -15.8888 -14.7757 -12.1453 |
| 3 | 372.749 160.5* 36 0.000 1.5e-15 -19.1666* -17.5397* -13.6953* |
| 4 | . . 36 . -1.0e-78* . . . |
+---------------------------------------------------------------------------+
Endogenous: lnBio lnHydro lnNRE lnConsPC lnPIBPC lnBBR
Exogenous: _cons

Thanks

Python modules not found even after specifying python_userpath

$
0
0
I am having some trouble getting the new Python integration to work in Stata 16. Specifically, some of my modules are not found even after I specify the python executable file and set the python userpath.

I am on a linux machine on which I do not have administrative privileges but do have substantial freedom to keep files and install software locally in my home directory. The Stata license is system-wide but my Python 3.7 installation is local (via miniconda3). There are other Python versions installed in system locations but I do not want to use them, as I would like to be able to manage my own Python packages which are not all available on the wider system.

I have instructed Stata to find my local Python executable by setting python_exec, and have also set the python_userpath variable to the site-packages/ directory of my local Python installation, with the prepend option. The directory structure is such that, within site-packages/, there are many directories corresponding with package names and these contain the .py files for the installed packages. It is a standard miniconda3 structure. Running
Code:
python query
confirms that the two variables have been set and that the correct Python system has been identified with its corresponding library.

Some packages are found; for example,
Code:
python where numpy
displays the correct location for numpy. However,
Code:
python where pandas
results in error r(601), "Python module pandas not found," even though pandas is installed and its file structure is extremely similar to that of numpy (both relevant files are named __init__.py and are in an eponymous directory within site-packages/).

I tried adding site-packages/pandas/ to the library path, but with no success. I also tried creating a symbolic link to pandas/__init__.py with the name pandas.py in the site-packages/ directory, hoping that Stata would be looking for a file with that name in that location, but pandas is still not found. I'm not sure what might be causing this and I feel I have exhausted the available documentation. Please advise.

Sympson's paradox

$
0
0
Hi! Is there a way to visualize simpson's paradox and how to use entropyetc ? It is not really clear to me yet. I am testing the effect of the change in news risk on weekly returns (log). Please find my results below:
perccomb3 is the change in news risk for which i created dummy variables based on different amounts of news risk. I also included the dummy multiplied by the change in news risk in the regression and these show a positive tendency (while the scatter only shows a downward sloping graph.
Array

Replace value in many observations and variables

$
0
0
Dear all
I have a database of around 50 variables and 200 observations. I would like to replace with the value 0, in those observations that says in the all variables the word 'quantity'.
For example, from the following database, I would like to replace the word quantity with the value 0 in variable 1 and 2.
Observation Var 1 Var 2
1 400 200
2 quantity 800
3 300 quantity
4 quantity 1000
The question is, what command can I use to do it in the 50 variables that I have and in the 200 observations, in a simple way?
Thank you very much!

Generating dummy variable conditional to other dummy variable figures

$
0
0
Hi everyone,

I have two dummy variables, holder67 and holder30, which are equal to 1 if the value of another variable, COND is equal or above 0.67 and 0.30 respectively.

Now I am looking to generate a new dummy variable "holder30-holder67", which will equal to 1 if holder30=1 but holder67=0 to capture low-to-moderate degree of overconfidence.

What would be the code to generate such dummy variable dependent on the values of other dummy variables?

Thanks in advance!

Finding R squared and Adjusted R squared values on xtreg

$
0
0
Hi Everyone,

I am conducting research on panel data, set with xtset command on STATA.

xtset company_id fiscalyear
panel variable: company_id (unbalanced)
time variable: fiscalyear, 1992 to 2018, but with gaps
delta: 1 unit

Now when I run xtreg to find the regression results, I am having trouble determining the R2 and Adjusted R2 figures that regular regression outputs generate. My xtreg output is following:

xtreg PMP holder30 holder67 age sex

Random-effects GLS regression Number of obs = 25,025
Group variable: company_id Number of groups = 2,202

R-sq: Obs per group:
within = 0.0010 min = 1
between = 0.0114 avg = 11.4
overall = 0.0018 max = 26

Wald chi2(4) = 43.75
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

------------------------------------------------------------------------------
PMP | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
holder30 | .0156001 .0156502 1.00 0.319 -.0150738 .0462739
holder67 | .0460581 .0133362 3.45 0.001 .0199197 .0721966
age | -.0021195 .0006194 -3.42 0.001 -.0033336 -.0009054
sex | .0428783 .0295993 1.45 0.147 -.0151353 .1008919
_cons | .0325781 .0445875 0.73 0.465 -.0548119 .119968
-------------+----------------------------------------------------------------
sigma_u | .08218976
sigma_e | .68331747
rho | .01426107 (fraction of variance due to u_i)
------------------------------------------------------------------------------

Is there any command or something I need to do to find the the R squares?

Thanks in advance!

Random forests and neural nets in Stata (with Python integration)

$
0
0
Hey all,

I wrote a couple Stata packages to do regression or classification with random forests and neural networks (specifically, multi-layer perceptrons) in Stata 16. These programs are basically wrappers for methods in the popular Python library scikit-learn. The packages will automatically load the required Stata variables into Python, use some scikit-learn methods on the data, and return predictions and other information to Stata's interface. This is essentially an expanded version of the example .ado file provided in Stata's release notes for the new Stata Function Interface.

I split these into two separate packages:
1. pyforest.ado - regression and classification with random forests
2. pymlp.ado - regression and classification with multi-layer perceptrons

The syntax for specifying optional arguments is nearly identical to the syntax used in scikit-learn. This means that the scikit-learn documentation is also a readable reference for using these packages. Of course, both of these packages also contain built-in Stata help files.

You can read a bit more about these packages and install them with instructions on GitHub:
https://github.com/mdroste/stata-pyforest
https://github.com/mdroste/stata-pymlp

I am still actively developing both of these packages, and I plan to submit them to SSC very soon. I am sure there are some bugs that will need to be fixed before then, since I put both of them together over the last two days or so. There's a whole bunch of stuff that I think should be added, but since both seem to be very much usable right now, I figured it's worth posting what I have for now.

If you have any issues with these packages, definitely let me know either on this thread or on Github.

I hope this is useful!

Mike
Viewing all 73242 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>