Hi Guys! I am new in Stata and I have a problem with analyzing and translating dta file encodings by unicode. After typing "unicode analyze mydata.dta" in Stata/SE 15.1 I got that "1 file ignored (too big or not Stata)". The file is not larger than 2 GB. Do you have any idea how to solve that? Thanks a lot!
↧
Unicode analyze error - too big file
↧
How to do calcuations between different datatypes (double, long, float)?
When I add two numerics/numbers, the result is always rounded to some nonsense.
Actually my datatype for pid/hid is "long". With MWE it will give you float, but illustrates the same problem.
I would like to multiply the first identifier by the maximum of the second and add this in order to create a unique new identifier. But I cannot figure out, what dataformat I should use.
How can I handle those numbers as normal numbers? Why is hid 60313 rounded to 60320? Thank you very much.
Actually my datatype for pid/hid is "long". With MWE it will give you float, but illustrates the same problem.
Code:
clear input pid hid 513 60313 513 60313 513 94 513 94 514 167 514 175 515 175 516 175 end
Code:
gen pid_float = float(pid) gen pid2 = pid_float*1000000 format pid2 %15.0f gen hid_float = float(hid) gen pid_hid = pid2+hid_float format pid_hid %15.0f list pid hid pid_float pid2 hid_float pid_hid +-----------------------------------------------------------+ | pid hid pid_fl~t pid2 hid_fl~t pid_hid | |-----------------------------------------------------------| 1. | 513 60313 513 513000000 60313 513060320 | 2. | 513 60313 513 513000000 60313 513060320 | 3. | 513 94 513 513000000 94 513000096 | 4. | 513 94 513 513000000 94 513000096 | 5. | 514 167 514 514000000 167 514000160 | |-----------------------------------------------------------| 6. | 514 175 514 514000000 175 514000160 | 7. | 515 175 515 515000000 175 515000160 | 8. | 516 175 516 516000000 175 516000160 | +-----------------------------------------------------------+
How can I handle those numbers as normal numbers? Why is hid 60313 rounded to 60320? Thank you very much.
↧
↧
Panel Data Analysis (Run desired regression)
Dear Specialist
First Thanks for your reading.
I have a Panel Data set. I treated Region as id variable and Year as time variable. I would like to run a regression in the below format.
GVArt = β0 +β1 * WageGaprt + β2 * WageGapr(t-1) + Region + ε
Which code should i use in order to get the above formula.
Thanks for you reading again.
Have a good day.
Best Regards
Kangchen Sun
First Thanks for your reading.
I have a Panel Data set. I treated Region as id variable and Year as time variable. I would like to run a regression in the below format.
GVArt = β0 +β1 * WageGaprt + β2 * WageGapr(t-1) + Region + ε
Which code should i use in order to get the above formula.
Thanks for you reading again.
Have a good day.
Best Regards
Kangchen Sun
↧
Multiple imputation for correlated exposure variables
Hello. I am trying to perform multiple imputation in my dataset using mi impute chained.
The dataset has three exposure variables, e.g. smoking at time 1, smoking at time 2, and smoking at time 3. I generated 4th exposure variable: smoking at any time. All exposure variables are binary variables (0 vs 1).
My question is how to impute "smoking_any".
When I include all four exposure variables in the imputation model, it always says ".... predicts data perfectly" or "convergence not achieved".
If I impute "smoking_any" separately from the other three variables, it looks like the prevalence of smoking at any time would be overestimated.
Can I use passive imputation approach after I impute smoking1, smoking2, and smoking3? But it is said that "this method is actually a misspecification of your imputation model and will lead to biased parameter estimates in your analytic model".
Thank you very much.
The dataset has three exposure variables, e.g. smoking at time 1, smoking at time 2, and smoking at time 3. I generated 4th exposure variable: smoking at any time. All exposure variables are binary variables (0 vs 1).
Code:
gen smoking_any=0 if smoking1==0 & smoking2==0 & smoking3==0 replace smoking_any=1 if smoking1==1 | smoking2==1 | smoking3==1
smoking1 | smoking2 | smoking3 | smoking_any |
1 | 0 | . | 1 |
0 | 0 | . | . |
0 | 1 | . | 1 |
. | . | 1 | 1 |
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 |
. | 0 | 0 | . |
0 | 0 | . | . |
When I include all four exposure variables in the imputation model, it always says ".... predicts data perfectly" or "convergence not achieved".
If I impute "smoking_any" separately from the other three variables, it looks like the prevalence of smoking at any time would be overestimated.
Can I use passive imputation approach after I impute smoking1, smoking2, and smoking3? But it is said that "this method is actually a misspecification of your imputation model and will lead to biased parameter estimates in your analytic model".
Thank you very much.
↧
Table 1 help or test for significant difference
Hello I have a data set in which I am trying to compare longitudinal data from successive pregnancies.
each variable I have is coded pregnancy1 or pregnancy 2. With the exception of race/ethnicity all variables differ based on pregnancy 1 and pregnancy 2
I am trying to set up a table 1 comparing continuous, categorical and binary variables ( education, annual income, mean age, hypertension status etc) at the time points of either pregnancy 1 or pregnancy 2.
Previously I used a code for table 1 but that won't work here because I am not stratifying by a binary outcome but rather by Preg 1 vs preg 2 .
I am currently using tab or codebook command to compare each variable ex:
tab edu_1
tab edu_2
tab obese_1
tab obese_2
this gives me column percentage; but no p value comparing the two. Is there a code for this? my table one command did it automatically. I hope this makes sense. Thanks!
each variable I have is coded pregnancy1 or pregnancy 2. With the exception of race/ethnicity all variables differ based on pregnancy 1 and pregnancy 2
I am trying to set up a table 1 comparing continuous, categorical and binary variables ( education, annual income, mean age, hypertension status etc) at the time points of either pregnancy 1 or pregnancy 2.
Previously I used a code for table 1 but that won't work here because I am not stratifying by a binary outcome but rather by Preg 1 vs preg 2 .
I am currently using tab or codebook command to compare each variable ex:
tab edu_1
tab edu_2
tab obese_1
tab obese_2
this gives me column percentage; but no p value comparing the two. Is there a code for this? my table one command did it automatically. I hope this makes sense. Thanks!
↧
↧
question about xtabond2 output
I am running the below syntax
However, when I change the main IV proactivity to l.proactivity, the direction of the effect changed. What does this mean? Should I use proactivity in level if I'm interested in the total effect proactivity has on crime?
Code:
xtabond2 generalcrime l.generalcrime proactivity feb mar apr may jun jul aug sep oct nov dec, gmm( proactivity generalcrime , lag (2 4)) iv( feb mar apr may jun jul aug sep oct nov dec) nolevel twostep robust
Code:
Dynamic panel-data estimation, two-step difference GMM ------------------------------------------------------------------------------ Group variable: id Number of obs = 26150 Time variable : week Number of groups = 523 Number of instruments = 305 Obs per group: min = 50 Wald chi2(13) = 127.65 avg = 50.00 Prob > chi2 = 0.000 max = 50 ------------------------------------------------------------------------------ | Corrected generalcrime | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- generalcrime | L1. | .0084245 .007643 1.10 0.270 -.0065556 .0234045 | proactivity | .0918542 .0309783 2.97 0.003 .0311378 .1525705 feb | -.3357907 .2197143 -1.53 0.126 -.7664227 .0948414 mar | -.1039943 .2222107 -0.47 0.640 -.5395193 .3315306 apr | -.1241196 .2364084 -0.53 0.600 -.5874715 .3392323 may | .3990998 .2487457 1.60 0.109 -.0884328 .8866324 jun | .6014529 .2578449 2.33 0.020 .0960862 1.10682 jul | .6369464 .2641964 2.41 0.016 .1191309 1.154762 aug | .3615521 .2706158 1.34 0.182 -.1688452 .8919494 sep | -.0556289 .2523996 -0.22 0.826 -.5503231 .4390653 oct | -.2701021 .2545519 -1.06 0.289 -.7690146 .2288105 nov | -.7050279 .2654339 -2.66 0.008 -1.225269 -.184787 dec | -.536009 .2914149 -1.84 0.066 -1.107172 .0351537 ------------------------------------------------------------------------------ Instruments for first differences equation Standard D.(feb mar apr may jun jul aug sep oct nov dec) GMM-type (missing=0, separate instruments for each period unless collapsed) L(2/4).(proactivity generalcrime) ------------------------------------------------------------------------------ Arellano-Bond test for AR(1) in first differences: z = -2.96 Pr > z = 0.003 Arellano-Bond test for AR(2) in first differences: z = -1.01 Pr > z = 0.312 ------------------------------------------------------------------------------ Sargan test of overid. restrictions: chi2(292) = 558.93 Prob > chi2 = 0.000 (Not robust, but not weakened by many instruments.) Hansen test of overid. restrictions: chi2(292) = 289.84 Prob > chi2 = 0.525 (Robust, but weakened by many instruments.) Difference-in-Hansen tests of exogeneity of instrument subsets: iv(feb mar apr may jun jul aug sep oct nov dec) Hansen test excluding group: chi2(281) = 280.43 Prob > chi2 = 0.498 Difference (null H = exogenous): chi2(11) = 9.40 Prob > chi2 = 0.585
Code:
. xtabond2 generalcrime l.generalcrime l.proactivity feb mar apr may jun jul aug sep oct nov dec, gmm( proactivity generalcrime , lag (2 > 4)) iv( feb mar apr may jun jul aug sep oct nov dec) nolevel twostep robust Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm. Dynamic panel-data estimation, two-step difference GMM ------------------------------------------------------------------------------ Group variable: id Number of obs = 26150 Time variable : week Number of groups = 523 Number of instruments = 305 Obs per group: min = 50 Wald chi2(13) = 121.25 avg = 50.00 Prob > chi2 = 0.000 max = 50 ------------------------------------------------------------------------------ | Corrected generalcrime | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- generalcrime | L1. | .0035523 .0066324 0.54 0.592 -.0094469 .0165516 | proactivity | L1. | -.0302871 .0110876 -2.73 0.006 -.0520183 -.0085559 | feb | -.3070974 .2207157 -1.39 0.164 -.7396923 .1254975 mar | -.0884815 .2224854 -0.40 0.691 -.5245449 .3475818 apr | -.1001165 .2368536 -0.42 0.673 -.564341 .3641081 may | .3628104 .256478 1.41 0.157 -.1398773 .8654981 jun | .5504192 .2667749 2.06 0.039 .0275501 1.073288 jul | .5248709 .2752155 1.91 0.057 -.0145415 1.064283 aug | .2397443 .2874894 0.83 0.404 -.3237246 .8032131 sep | -.2070373 .2700674 -0.77 0.443 -.7363596 .322285 oct | -.446874 .2807123 -1.59 0.111 -.9970599 .103312 nov | -.907495 .2868705 -3.16 0.002 -1.469751 -.3452392 dec | -.714167 .3153677 -2.26 0.024 -1.332276 -.0960577 ------------------------------------------------------------------------------ Instruments for first differences equation Standard D.(feb mar apr may jun jul aug sep oct nov dec) GMM-type (missing=0, separate instruments for each period unless collapsed) L(2/4).(proactivity generalcrime) ------------------------------------------------------------------------------ Arellano-Bond test for AR(1) in first differences: z = -2.96 Pr > z = 0.003 Arellano-Bond test for AR(2) in first differences: z = -1.34 Pr > z = 0.181 ------------------------------------------------------------------------------ Sargan test of overid. restrictions: chi2(292) = 562.59 Prob > chi2 = 0.000 (Not robust, but not weakened by many instruments.) Hansen test of overid. restrictions: chi2(292) = 292.47 Prob > chi2 = 0.481 (Robust, but weakened by many instruments.) Difference-in-Hansen tests of exogeneity of instrument subsets: iv(feb mar apr may jun jul aug sep oct nov dec) Hansen test excluding group: chi2(281) = 282.66 Prob > chi2 = 0.461 Difference (null H = exogenous): chi2(11) = 9.81 Prob > chi2 = 0.548 .
↧
Graph data onto a map with spmap
Hi every one,
I kind ask for you help. I am running the following
shp2dta using /Users/jennyperez/personal/STUDY/maping/e.g./sudamerica_adm2.shp, database(sudamdb) coordinates(sudamcoord) genid(id) and it is what happens
file sudamcoord.dta could not be opened
I tried some of the suggestions that I found in the forum but nothing. if some one can help me to identified what is the problem, I will be well appreciate
Best,
Jenny
I kind ask for you help. I am running the following
shp2dta using /Users/jennyperez/personal/STUDY/maping/e.g./sudamerica_adm2.shp, database(sudamdb) coordinates(sudamcoord) genid(id) and it is what happens
file sudamcoord.dta could not be opened
I tried some of the suggestions that I found in the forum but nothing. if some one can help me to identified what is the problem, I will be well appreciate
Best,
Jenny
↧
fuzzy match date of birth w/ typos in date of birth
I would like to create an id variable that uniquely identifies people within a dataset that looks like this:
cardholderid is a variable in the original data that is supposed to uniquely identify people, but it does not. I have improved the data by running code like: group_id cardholderid, matchby(fsx lsx dob) to cut down on repeat cardholderid issued to people who have typos in their names but the same dateofbirth.
The issue is that there are also a fair amount of date of birth typos, I've made the mock dataset here show that clearly-- you can easily tell that cardholderid 12345 and 60000 identify the same person, who has a relatively unique name, but the date of birth is off by exactly a month. The dob typos are all over the place- sometimes the date is off a few days (ex: same uncommon name but two entries, one with dob 13nov1980 and one with dob 19nov1980), sometimes the month is off by one (as with kit fugate above), sometimes the year is off by even up to five years for the same uncommon name.
How can I de-duplicate my dataset to create one id for someone like kit fugate, born 13 dec/nov 1997? Because some of the typos are more obvious than others-- based on bad the dob typos are (date off by a day versus off by 5 years) and how uncommon the name is (if there are two John Smiths who are share a birth month+year but have different day, we would be less certain these are really the same person compared to two Rashida Tlaibs who share a birth month+year but are off by a day)-- it would be great to have some way to make a similarity score and manually check as opposed to doing something like group_id cardholderid, matchby(fsx lsx dob_month dob_year).
Thanks in advance!
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float cardholderid str9 firstname str6 lastname str4(fsx lsx) str9 dob 12345 "kit" "fugate" "K300" "F230" "13dec1997" 12345 "kit" "fugate" "K300" "F230" "13dec1997" 60000 "kit" "fugate" "K300" "F230" "13nov1997" 23456 "ben" "smith" "B500" "S530" "27jan2004" 23456 "ben" "smith" "B500" "S530" "27jan2004" 70000 "ben" "smith" "B500" "S530" "26jan2004" end
The issue is that there are also a fair amount of date of birth typos, I've made the mock dataset here show that clearly-- you can easily tell that cardholderid 12345 and 60000 identify the same person, who has a relatively unique name, but the date of birth is off by exactly a month. The dob typos are all over the place- sometimes the date is off a few days (ex: same uncommon name but two entries, one with dob 13nov1980 and one with dob 19nov1980), sometimes the month is off by one (as with kit fugate above), sometimes the year is off by even up to five years for the same uncommon name.
How can I de-duplicate my dataset to create one id for someone like kit fugate, born 13 dec/nov 1997? Because some of the typos are more obvious than others-- based on bad the dob typos are (date off by a day versus off by 5 years) and how uncommon the name is (if there are two John Smiths who are share a birth month+year but have different day, we would be less certain these are really the same person compared to two Rashida Tlaibs who share a birth month+year but are off by a day)-- it would be great to have some way to make a similarity score and manually check as opposed to doing something like group_id cardholderid, matchby(fsx lsx dob_month dob_year).
Thanks in advance!
↧
Handle data with blanks and zeros with PPML
Dear all.
I am trying to run a gravity model with PPML.
I have a question. It could be simple or the question indeed could be wrong. But I would like to ask you.
In my export data I have many blanks because there are no trade between some countries in some years.
My question is: Shall I leave blanks in the export data and run the model or shall I replace blanks with zeros and then run the model?
Best wishes to all of you.
CAbreo
I am trying to run a gravity model with PPML.
I have a question. It could be simple or the question indeed could be wrong. But I would like to ask you.
In my export data I have many blanks because there are no trade between some countries in some years.
My question is: Shall I leave blanks in the export data and run the model or shall I replace blanks with zeros and then run the model?
Best wishes to all of you.
CAbreo
↧
↧
figure for standardised diff
Hi folks,
I have access to Stata 14. I want to create something like the following figure. What is the best way to go about accomplishing it in Stata?
Array
Source: Austin PC 2009.
Thanks.
I have access to Stata 14. I want to create something like the following figure. What is the best way to go about accomplishing it in Stata?
Array
Source: Austin PC 2009.
Thanks.
↧
Identifying patterns in survey responses
I have survey data with IDs and responses (0, 25, 50, 75, 100). They are typical Likert scale questions with 5 points. I need to see if there are patterns where participants' choices didn't change (they selected only one point).
For instance, in the example below, ID 9 selected 50 while ID 1 selected 0s.
Thank you,
For instance, in the example below, ID 9 selected 50 while ID 1 selected 0s.
Thank you,
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input int id byte(question1 question2 question3 question4 question5 question6 question7 question8 question9 question10) 1 0 0 0 0 0 0 0 0 0 0 2 50 50 0 0 50 0 100 0 0 0 3 50 50 0 0 50 0 100 0 0 0 4 50 50 0 0 50 0 50 0 0 0 5 50 50 0 50 50 0 50 50 50 50 6 75 100 50 100 100 50 100 0 100 100 7 75 75 50 50 100 50 100 50 50 100 8 25 50 0 0 50 0 0 0 0 50 9 50 50 50 50 50 50 50 50 50 50 10 25 75 0 0 50 0 50 0 0 50 11 50 100 0 100 100 100 100 100 100 100 12 25 100 50 100 100 100 100 100 100 100 13 25 25 0 0 100 100 100 100 50 50 14 75 50 50 100 100 50 100 50 100 100 15 100 100 100 100 100 100 100 100 100 100 16 75 75 100 100 100 100 100 100 100 100 17 50 25 50 100 100 50 100 100 100 50 18 50 50 0 0 50 50 50 0 0 0 19 25 75 0 50 0 0 0 50 50 0 20 25 50 0 0 0 0 0 0 0 0 end
↧
Econometrics
I'm having issues with the Stata commands, I'm super new to Stata and still trying to figure out my way with command, if anybody can help or give any sort of command advice I would really appreciate it.
Thanks
Thanks
↧
Simple equation by variable
So I am attempting to create a new variable in the following way
newvar = var1 - lagvar1 + var2 however given the rudimentary nature of my lag variable it cuts across different companies (var3) and so I also want to restrict this calculation and ensure that both var1 and lagvar1 correspond to the same company lag3.
Thus far I am drawing a blank, any help would be much appreciated
newvar = var1 - lagvar1 + var2 however given the rudimentary nature of my lag variable it cuts across different companies (var3) and so I also want to restrict this calculation and ensure that both var1 and lagvar1 correspond to the same company lag3.
Thus far I am drawing a blank, any help would be much appreciated
↧
↧
Varsoc command
Hi guys,
I'm new here and I'm having some issues with varsoc command. When I run the command, stata shows me no results in lag 4 (just for FPE) but I don't know why. Should I consider this model correct?
varsoc lnBio lnHydro lnNRE lnConsPC lnPIBPC lnBBR
Selection-order criteria
Sample: 1989 - 2015 Number of obs = 27
+---------------------------------------------------------------------------+
|lag | LL LR df p FPE AIC HQIC SBIC |
|----+----------------------------------------------------------------------|
| 0 | 105.478 2.5e-11 -7.36877 -7.28315 -7.08081 |
| 1 | 249.296 287.64 36 0.000 9.3e-15 -15.3553 -14.7559 -13.3395 |
| 2 | 292.499 86.406 36 0.000 8.5e-15 -15.8888 -14.7757 -12.1453 |
| 3 | 372.749 160.5* 36 0.000 1.5e-15 -19.1666* -17.5397* -13.6953* |
| 4 | . . 36 . -1.0e-78* . . . |
+---------------------------------------------------------------------------+
Endogenous: lnBio lnHydro lnNRE lnConsPC lnPIBPC lnBBR
Exogenous: _cons
Thanks
I'm new here and I'm having some issues with varsoc command. When I run the command, stata shows me no results in lag 4 (just for FPE) but I don't know why. Should I consider this model correct?
varsoc lnBio lnHydro lnNRE lnConsPC lnPIBPC lnBBR
Selection-order criteria
Sample: 1989 - 2015 Number of obs = 27
+---------------------------------------------------------------------------+
|lag | LL LR df p FPE AIC HQIC SBIC |
|----+----------------------------------------------------------------------|
| 0 | 105.478 2.5e-11 -7.36877 -7.28315 -7.08081 |
| 1 | 249.296 287.64 36 0.000 9.3e-15 -15.3553 -14.7559 -13.3395 |
| 2 | 292.499 86.406 36 0.000 8.5e-15 -15.8888 -14.7757 -12.1453 |
| 3 | 372.749 160.5* 36 0.000 1.5e-15 -19.1666* -17.5397* -13.6953* |
| 4 | . . 36 . -1.0e-78* . . . |
+---------------------------------------------------------------------------+
Endogenous: lnBio lnHydro lnNRE lnConsPC lnPIBPC lnBBR
Exogenous: _cons
Thanks
↧
Python modules not found even after specifying python_userpath
I am having some trouble getting the new Python integration to work in Stata 16. Specifically, some of my modules are not found even after I specify the python executable file and set the python userpath.
I am on a linux machine on which I do not have administrative privileges but do have substantial freedom to keep files and install software locally in my home directory. The Stata license is system-wide but my Python 3.7 installation is local (via miniconda3). There are other Python versions installed in system locations but I do not want to use them, as I would like to be able to manage my own Python packages which are not all available on the wider system.
I have instructed Stata to find my local Python executable by setting python_exec, and have also set the python_userpath variable to the site-packages/ directory of my local Python installation, with the prepend option. The directory structure is such that, within site-packages/, there are many directories corresponding with package names and these contain the .py files for the installed packages. It is a standard miniconda3 structure. Running
confirms that the two variables have been set and that the correct Python system has been identified with its corresponding library.
Some packages are found; for example,
displays the correct location for numpy. However,
results in error r(601), "Python module pandas not found," even though pandas is installed and its file structure is extremely similar to that of numpy (both relevant files are named __init__.py and are in an eponymous directory within site-packages/).
I tried adding site-packages/pandas/ to the library path, but with no success. I also tried creating a symbolic link to pandas/__init__.py with the name pandas.py in the site-packages/ directory, hoping that Stata would be looking for a file with that name in that location, but pandas is still not found. I'm not sure what might be causing this and I feel I have exhausted the available documentation. Please advise.
I am on a linux machine on which I do not have administrative privileges but do have substantial freedom to keep files and install software locally in my home directory. The Stata license is system-wide but my Python 3.7 installation is local (via miniconda3). There are other Python versions installed in system locations but I do not want to use them, as I would like to be able to manage my own Python packages which are not all available on the wider system.
I have instructed Stata to find my local Python executable by setting python_exec, and have also set the python_userpath variable to the site-packages/ directory of my local Python installation, with the prepend option. The directory structure is such that, within site-packages/, there are many directories corresponding with package names and these contain the .py files for the installed packages. It is a standard miniconda3 structure. Running
Code:
python query
Some packages are found; for example,
Code:
python where numpy
Code:
python where pandas
I tried adding site-packages/pandas/ to the library path, but with no success. I also tried creating a symbolic link to pandas/__init__.py with the name pandas.py in the site-packages/ directory, hoping that Stata would be looking for a file with that name in that location, but pandas is still not found. I'm not sure what might be causing this and I feel I have exhausted the available documentation. Please advise.
↧
Sympson's paradox
Hi! Is there a way to visualize simpson's paradox and how to use entropyetc ? It is not really clear to me yet. I am testing the effect of the change in news risk on weekly returns (log). Please find my results below:
perccomb3 is the change in news risk for which i created dummy variables based on different amounts of news risk. I also included the dummy multiplied by the change in news risk in the regression and these show a positive tendency (while the scatter only shows a downward sloping graph.
Array
perccomb3 is the change in news risk for which i created dummy variables based on different amounts of news risk. I also included the dummy multiplied by the change in news risk in the regression and these show a positive tendency (while the scatter only shows a downward sloping graph.
Array
↧
Replace value in many observations and variables
Dear all
I have a database of around 50 variables and 200 observations. I would like to replace with the value 0, in those observations that says in the all variables the word 'quantity'.
For example, from the following database, I would like to replace the word quantity with the value 0 in variable 1 and 2.
The question is, what command can I use to do it in the 50 variables that I have and in the 200 observations, in a simple way?
Thank you very much!
I have a database of around 50 variables and 200 observations. I would like to replace with the value 0, in those observations that says in the all variables the word 'quantity'.
For example, from the following database, I would like to replace the word quantity with the value 0 in variable 1 and 2.
Observation | Var 1 | Var 2 |
1 | 400 | 200 |
2 | quantity | 800 |
3 | 300 | quantity |
4 | quantity | 1000 |
Thank you very much!
↧
↧
Generating dummy variable conditional to other dummy variable figures
Hi everyone,
I have two dummy variables, holder67 and holder30, which are equal to 1 if the value of another variable, COND is equal or above 0.67 and 0.30 respectively.
Now I am looking to generate a new dummy variable "holder30-holder67", which will equal to 1 if holder30=1 but holder67=0 to capture low-to-moderate degree of overconfidence.
What would be the code to generate such dummy variable dependent on the values of other dummy variables?
Thanks in advance!
I have two dummy variables, holder67 and holder30, which are equal to 1 if the value of another variable, COND is equal or above 0.67 and 0.30 respectively.
Now I am looking to generate a new dummy variable "holder30-holder67", which will equal to 1 if holder30=1 but holder67=0 to capture low-to-moderate degree of overconfidence.
What would be the code to generate such dummy variable dependent on the values of other dummy variables?
Thanks in advance!
↧
Finding R squared and Adjusted R squared values on xtreg
Hi Everyone,
I am conducting research on panel data, set with xtset command on STATA.
xtset company_id fiscalyear
panel variable: company_id (unbalanced)
time variable: fiscalyear, 1992 to 2018, but with gaps
delta: 1 unit
Now when I run xtreg to find the regression results, I am having trouble determining the R2 and Adjusted R2 figures that regular regression outputs generate. My xtreg output is following:
xtreg PMP holder30 holder67 age sex
Random-effects GLS regression Number of obs = 25,025
Group variable: company_id Number of groups = 2,202
R-sq: Obs per group:
within = 0.0010 min = 1
between = 0.0114 avg = 11.4
overall = 0.0018 max = 26
Wald chi2(4) = 43.75
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
PMP | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
holder30 | .0156001 .0156502 1.00 0.319 -.0150738 .0462739
holder67 | .0460581 .0133362 3.45 0.001 .0199197 .0721966
age | -.0021195 .0006194 -3.42 0.001 -.0033336 -.0009054
sex | .0428783 .0295993 1.45 0.147 -.0151353 .1008919
_cons | .0325781 .0445875 0.73 0.465 -.0548119 .119968
-------------+----------------------------------------------------------------
sigma_u | .08218976
sigma_e | .68331747
rho | .01426107 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Is there any command or something I need to do to find the the R squares?
Thanks in advance!
I am conducting research on panel data, set with xtset command on STATA.
xtset company_id fiscalyear
panel variable: company_id (unbalanced)
time variable: fiscalyear, 1992 to 2018, but with gaps
delta: 1 unit
Now when I run xtreg to find the regression results, I am having trouble determining the R2 and Adjusted R2 figures that regular regression outputs generate. My xtreg output is following:
xtreg PMP holder30 holder67 age sex
Random-effects GLS regression Number of obs = 25,025
Group variable: company_id Number of groups = 2,202
R-sq: Obs per group:
within = 0.0010 min = 1
between = 0.0114 avg = 11.4
overall = 0.0018 max = 26
Wald chi2(4) = 43.75
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
PMP | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
holder30 | .0156001 .0156502 1.00 0.319 -.0150738 .0462739
holder67 | .0460581 .0133362 3.45 0.001 .0199197 .0721966
age | -.0021195 .0006194 -3.42 0.001 -.0033336 -.0009054
sex | .0428783 .0295993 1.45 0.147 -.0151353 .1008919
_cons | .0325781 .0445875 0.73 0.465 -.0548119 .119968
-------------+----------------------------------------------------------------
sigma_u | .08218976
sigma_e | .68331747
rho | .01426107 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Is there any command or something I need to do to find the the R squares?
Thanks in advance!
↧
Random forests and neural nets in Stata (with Python integration)
Hey all,
I wrote a couple Stata packages to do regression or classification with random forests and neural networks (specifically, multi-layer perceptrons) in Stata 16. These programs are basically wrappers for methods in the popular Python library scikit-learn. The packages will automatically load the required Stata variables into Python, use some scikit-learn methods on the data, and return predictions and other information to Stata's interface. This is essentially an expanded version of the example .ado file provided in Stata's release notes for the new Stata Function Interface.
I split these into two separate packages:
1. pyforest.ado - regression and classification with random forests
2. pymlp.ado - regression and classification with multi-layer perceptrons
The syntax for specifying optional arguments is nearly identical to the syntax used in scikit-learn. This means that the scikit-learn documentation is also a readable reference for using these packages. Of course, both of these packages also contain built-in Stata help files.
You can read a bit more about these packages and install them with instructions on GitHub:
https://github.com/mdroste/stata-pyforest
https://github.com/mdroste/stata-pymlp
I am still actively developing both of these packages, and I plan to submit them to SSC very soon. I am sure there are some bugs that will need to be fixed before then, since I put both of them together over the last two days or so. There's a whole bunch of stuff that I think should be added, but since both seem to be very much usable right now, I figured it's worth posting what I have for now.
If you have any issues with these packages, definitely let me know either on this thread or on Github.
I hope this is useful!
Mike
I wrote a couple Stata packages to do regression or classification with random forests and neural networks (specifically, multi-layer perceptrons) in Stata 16. These programs are basically wrappers for methods in the popular Python library scikit-learn. The packages will automatically load the required Stata variables into Python, use some scikit-learn methods on the data, and return predictions and other information to Stata's interface. This is essentially an expanded version of the example .ado file provided in Stata's release notes for the new Stata Function Interface.
I split these into two separate packages:
1. pyforest.ado - regression and classification with random forests
2. pymlp.ado - regression and classification with multi-layer perceptrons
The syntax for specifying optional arguments is nearly identical to the syntax used in scikit-learn. This means that the scikit-learn documentation is also a readable reference for using these packages. Of course, both of these packages also contain built-in Stata help files.
You can read a bit more about these packages and install them with instructions on GitHub:
https://github.com/mdroste/stata-pyforest
https://github.com/mdroste/stata-pymlp
I am still actively developing both of these packages, and I plan to submit them to SSC very soon. I am sure there are some bugs that will need to be fixed before then, since I put both of them together over the last two days or so. There's a whole bunch of stuff that I think should be added, but since both seem to be very much usable right now, I figured it's worth posting what I have for now.
If you have any issues with these packages, definitely let me know either on this thread or on Github.
I hope this is useful!
Mike
↧