When i use sum for three variables, there summaries of means of them. If i want quantiles at 0.95, 0.75, 0.5, 0.25, and 0.05 listing after the mean, could i use sum? or other commands?
↧
summarize command
↧
Generate
Hello guys, I am really really new using Stata, and I have a question. I have a huge database, and I want to generate a new variable using three existing variables. This new variable will help me to get the prevalence of Chronic lung disease using three existing variables (emphysema, COPD and Cronic bronchitis); however, people could respond yes to emphysema, COPD and Cronic bronchitis (the three variables). I have to generate these new variables but I do not want to “double or triple count” those who responded yes to all three terms.
Thank you in advance.
Thank you in advance.
↧
↧
assumption of normality
Hello,
I would like to know if by using command xtreg in panel data I need assumption of normality of the sample.
Thank you
Erasmo
I would like to know if by using command xtreg in panel data I need assumption of normality of the sample.
Thank you
Erasmo
↧
Creating binary variable defined on existing variable
Hi,
So, this is probably really simple but I can't seem to figure it out: I'm trying to create a binary variable. The binary variable will be based on another variable, a variable with values between 0-4 (this is social scientific experimental data so the data points are individual answers). I want everyone who has a 2 or higher to be a "1" in the binary variable, and everyone below 2 should be 0. How do I do this?
Thanks!
So, this is probably really simple but I can't seem to figure it out: I'm trying to create a binary variable. The binary variable will be based on another variable, a variable with values between 0-4 (this is social scientific experimental data so the data points are individual answers). I want everyone who has a 2 or higher to be a "1" in the binary variable, and everyone below 2 should be 0. How do I do this?
Thanks!
↧
Seeking right method to compute confidence interval
Hi,
I have some anthropometric measurements till 12 months
i would like to compute difference in mean growth 0-6 months in pahse1 and phase2 separately for male and female
can any one guide me how i can proceed to get the desire result
clear
input str6 study double month float gender long counts double avg_weight
"Phase1" 0 1 195 2.7541025641025643
"Phase1" 0 2 185 2.716378378378378
"Phase1" 6 1 195 5.945502645502645
"Phase1" 6 2 185 5.48494318181818
"Phase2" 0 1 220 2.5724409090909104
"Phase2" 0 2 144 2.5227708333333334
"Phase2" 6 1 260 5.716719230769237
"Phase2" 6 2 167 5.325449101796409
end
label values gender s1q118
label def s1q118 1 "Male", modify
label def s1q118 2 "Female", modify
[/CODE]
I have some anthropometric measurements till 12 months
i would like to compute difference in mean growth 0-6 months in pahse1 and phase2 separately for male and female
can any one guide me how i can proceed to get the desire result
clear
input str6 study double month float gender long counts double avg_weight
"Phase1" 0 1 195 2.7541025641025643
"Phase1" 0 2 185 2.716378378378378
"Phase1" 6 1 195 5.945502645502645
"Phase1" 6 2 185 5.48494318181818
"Phase2" 0 1 220 2.5724409090909104
"Phase2" 0 2 144 2.5227708333333334
"Phase2" 6 1 260 5.716719230769237
"Phase2" 6 2 167 5.325449101796409
end
label values gender s1q118
label def s1q118 1 "Male", modify
label def s1q118 2 "Female", modify
[/CODE]
↧
↧
Regression Discontinuity Graph
Hi,
I am working in stata 15 and am trying to use a twoway lpoly command to come up with a regression discontinuity graph. My running variable is age and the outcome variable is Trust levels which is a categorical variable. I set a cutoff of 33 years of age. When I ran both the lpoly and rdplot commands the output I get from the plotted graph seems to be having problems with scale especially in the x axis of my graph. May you please help me on how to fix this and get a better graph.
Here are the two codes I am using for twoway lpoly graph and rdplots respectively:
twoway lpoly Trust1 agefromthresh if agefromthres<0 || lpoly Trust1 agefromthresh if agefromthres>=0
rdplot Trust1 agefromthresh
I have attached the two graphs which I get from running the commands above.
I am working in stata 15 and am trying to use a twoway lpoly command to come up with a regression discontinuity graph. My running variable is age and the outcome variable is Trust levels which is a categorical variable. I set a cutoff of 33 years of age. When I ran both the lpoly and rdplot commands the output I get from the plotted graph seems to be having problems with scale especially in the x axis of my graph. May you please help me on how to fix this and get a better graph.
Here are the two codes I am using for twoway lpoly graph and rdplots respectively:
twoway lpoly Trust1 agefromthresh if agefromthres<0 || lpoly Trust1 agefromthresh if agefromthres>=0
rdplot Trust1 agefromthresh
I have attached the two graphs which I get from running the commands above.
↧
Replacing missing values of a given variables a large merged dataset
Hi all,
I have data that was collected in a household survey about individual children (age, Sex, Malaria e.t.c) and household characteristics (household size (hh_size), water source(WAT_SO) and access to sanitary facility e.t.c). The household data is the same for all children in a given household and is uniquely identified by the question number (qno) but is often entered for one child creating missing values for the other children in the same household. I need to fill in the missing values but have to identify the specific household in a merged data set depending on the survey (season_year), the livelihood (livelihood), the region (region), the district (district) and cluster (cluster) where it is found. Getting to the household questionnaire would be sorting by season_ year then livelihood then region then district then cluster to qno. I looked at similar posts/questions but I did not find one where the sorting has to go through several layers and for multiple variables. I tried to attach a sample of my dataset as a dta but the file was rejected - not sure how to do this otherwise.
Thanks for your help
Mona
I have data that was collected in a household survey about individual children (age, Sex, Malaria e.t.c) and household characteristics (household size (hh_size), water source(WAT_SO) and access to sanitary facility e.t.c). The household data is the same for all children in a given household and is uniquely identified by the question number (qno) but is often entered for one child creating missing values for the other children in the same household. I need to fill in the missing values but have to identify the specific household in a merged data set depending on the survey (season_year), the livelihood (livelihood), the region (region), the district (district) and cluster (cluster) where it is found. Getting to the household questionnaire would be sorting by season_ year then livelihood then region then district then cluster to qno. I looked at similar posts/questions but I did not find one where the sorting has to go through several layers and for multiple variables. I tried to attach a sample of my dataset as a dta but the file was rejected - not sure how to do this otherwise.
Thanks for your help
Mona
↧
Generate different possible combinations
Hi,
I would like to generate multiple groups of observations for each different combination of a group of numbers. For example:
var1 group order
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
and would like to get the following:
var1 group order
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
1 2 1
2 2 2
3 2 3
4 2 4
5 2 5
7 2 6
6 2 7
1 3 1
2 3 2
3 3 3
7 3 4
4 3 5
5 3 6
6 3 7
etc..
for all potential combinations of the numbers 1 to 7, so a total of 7!=5040 different groups. Any help would be much appreciated.
Thank you
I would like to generate multiple groups of observations for each different combination of a group of numbers. For example:
var1 group order
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
and would like to get the following:
var1 group order
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
7 1 7
1 2 1
2 2 2
3 2 3
4 2 4
5 2 5
7 2 6
6 2 7
1 3 1
2 3 2
3 3 3
7 3 4
4 3 5
5 3 6
6 3 7
etc..
for all potential combinations of the numbers 1 to 7, so a total of 7!=5040 different groups. Any help would be much appreciated.
Thank you
↧
How to obtain different font sizes in the same line of a graph title?
dear Statalisters,
My aim is to produce a graph title with different font sizes, something like this:
Age=65 years and eGFR=30 mL/min per 1.73m2
because the unit for eGFR is so long that it takes half the space if I don't reduce its fontsize. The problem is that I cannot split the title between the number and the unit measure, it would be very awkward to read.
I guess the solution would be to use some SMCL trick, but I haven't found anything helpful in the Stata manual for my problem
thank you for your attention
Dino
My aim is to produce a graph title with different font sizes, something like this:
Age=65 years and eGFR=30 mL/min per 1.73m2
because the unit for eGFR is so long that it takes half the space if I don't reduce its fontsize. The problem is that I cannot split the title between the number and the unit measure, it would be very awkward to read.
I guess the solution would be to use some SMCL trick, but I haven't found anything helpful in the Stata manual for my problem
thank you for your attention
Dino
↧
↧
Percentage of total by category
Dear Statalists,
I have a simple problem that I cannot seem to figure out. I have three variables: X, Y and Z, where X is a dummy variable (0-1), Y is a categorical variable with four values and Z is a variable capturing time (years).
I would like to visualise, out of the total of X for each value of Z (year), how much goes to each category in Y.
In other words, I would like to plot the results of the following tabulation for each of the values in Z:
An example:
I have tried with catplot and collapsing but I can get to is the results of the following tabulation:
This seems relatively easy so I might be missing something really basic.
Many thanks in advanced.
Guillem
I have a simple problem that I cannot seem to figure out. I have three variables: X, Y and Z, where X is a dummy variable (0-1), Y is a categorical variable with four values and Z is a variable capturing time (years).
I would like to visualise, out of the total of X for each value of Z (year), how much goes to each category in Y.
In other words, I would like to plot the results of the following tabulation for each of the values in Z:
Code:
tab Y X if Z==`i', col
Y | X =0 | X=1 |
1 | 30 | 10 |
2 | 30 | 20 |
3 | 5 | 40 |
4 | 35 | 30 |
Total | 100% | 100% |
Code:
tab Y X if Z==1, row
Many thanks in advanced.
Guillem
↧
New version of somersd on SSC
Thanks as always to Kit Baum, a new version of the somersd package is now available to download from SSC. In Stata, use the ssc command to do this, or adoupdate if you already have an old version of somersd.
The somersd package is described as below on my website, and computes confidence intervals for a range of rank statistics, with the options of clustering and/or sampling-probability weighting. The new version contains an improved version of the Mata function tidottree(), which uses a search tree algorithm to compute jackknife pseudovalues for these rank statistics. The new tidottree() uses the quadsum() function to improve precision when adding tiny sums of weights to huge sums of weights, which can lead to loss of precision in really big datasets.
Best wishes
Roger
-------------------------------------------------------------------------------------
package somersd from http://www.rogernewsonresources.org.uk/stata12
-------------------------------------------------------------------------------------
TITLE
somersd: Kendall's tau-a, Somers' D and percentile slopes
DESCRIPTION/AUTHOR(S)
The somersd package contains the programs somersd, censlope and cendif,
which calculate confidence intervals for a range of parameters behind
rank or "nonparametric" statistics. somersd calculates confidence
intervals for generalized Kendall's tau-a or Somers' D parameters,
and stores the estimates and their covariance matrix as estimation results.
It can be used on left-censored, right-censored, clustered and/or
stratified data. censlope is an extended version of somersd, which also
calculates confidence limits for the generalized Theil-Sen median slopes
(or other percentile slopes) corresponding to the version of Somers' D
or Kendall's tau-a estimated. cendif is an easy-to-use program to
calculate confidence intervals for Hodges-Lehmann median differences
(or other percentile differences) between two groups. The somersd package
can be used to calculate confidence intervals for a wide range of
rank-based parameters, which are special cases of Kendall's tau-a,
Somers' D or percentile slopes. These parameters include differences
between proportions, Harrell's c index, areas under receiver operating
characteristic (ROC) curves, differences between Harrell's c indices or
ROC areas, Gini coefficients, population attributable risks, median
differences, ratios, slopes and per-unit ratios, and the parameters
behind the sign test and the Wilcoxon-Mann-Whitney or Breslow-Gehan
ranksum tests. Full documentation of the programs (including methods and
formulas) can be found in the manual files somersd.pdf, censlope.pdf and
cendif.pdf, which can be viewed using the Adobe Acrobat Reader.
Author: Roger Newson
Distribution-date: 16september2018
Stata-version: 12.1
INSTALLATION FILES (click here to install)
cendif.ado
censlope.ado
somers_p.ado
somersd.ado
_bcsf_bisect.mata
_bcsf_bracketing.mata
_bcsf_regula.mata
_bcsf_ridders.mata
_blncdtree.mata
_somdtransf.mata
_u2jackpseud.mata
_v2jackpseud.mata
blncdtree.mata
tidot.mata
tidottree.mata
lsomersd.mlib
cendif.sthlp
censlope.sthlp
censlope_iteration.sthlp
mf_bcsf_bracketing.sthlp
mf_blncdtree.sthlp
mf_somdtransf.sthlp
mf_u2jackpseud.sthlp
somersd.sthlp
somersd_mata.sthlp
ANCILLARY FILES (click here to get)
cendif.pdf
censlope.pdf
somersd.pdf
-------------------------------------------------------------------------------------
(click here to return to the previous screen)
The somersd package is described as below on my website, and computes confidence intervals for a range of rank statistics, with the options of clustering and/or sampling-probability weighting. The new version contains an improved version of the Mata function tidottree(), which uses a search tree algorithm to compute jackknife pseudovalues for these rank statistics. The new tidottree() uses the quadsum() function to improve precision when adding tiny sums of weights to huge sums of weights, which can lead to loss of precision in really big datasets.
Best wishes
Roger
-------------------------------------------------------------------------------------
package somersd from http://www.rogernewsonresources.org.uk/stata12
-------------------------------------------------------------------------------------
TITLE
somersd: Kendall's tau-a, Somers' D and percentile slopes
DESCRIPTION/AUTHOR(S)
The somersd package contains the programs somersd, censlope and cendif,
which calculate confidence intervals for a range of parameters behind
rank or "nonparametric" statistics. somersd calculates confidence
intervals for generalized Kendall's tau-a or Somers' D parameters,
and stores the estimates and their covariance matrix as estimation results.
It can be used on left-censored, right-censored, clustered and/or
stratified data. censlope is an extended version of somersd, which also
calculates confidence limits for the generalized Theil-Sen median slopes
(or other percentile slopes) corresponding to the version of Somers' D
or Kendall's tau-a estimated. cendif is an easy-to-use program to
calculate confidence intervals for Hodges-Lehmann median differences
(or other percentile differences) between two groups. The somersd package
can be used to calculate confidence intervals for a wide range of
rank-based parameters, which are special cases of Kendall's tau-a,
Somers' D or percentile slopes. These parameters include differences
between proportions, Harrell's c index, areas under receiver operating
characteristic (ROC) curves, differences between Harrell's c indices or
ROC areas, Gini coefficients, population attributable risks, median
differences, ratios, slopes and per-unit ratios, and the parameters
behind the sign test and the Wilcoxon-Mann-Whitney or Breslow-Gehan
ranksum tests. Full documentation of the programs (including methods and
formulas) can be found in the manual files somersd.pdf, censlope.pdf and
cendif.pdf, which can be viewed using the Adobe Acrobat Reader.
Author: Roger Newson
Distribution-date: 16september2018
Stata-version: 12.1
INSTALLATION FILES (click here to install)
cendif.ado
censlope.ado
somers_p.ado
somersd.ado
_bcsf_bisect.mata
_bcsf_bracketing.mata
_bcsf_regula.mata
_bcsf_ridders.mata
_blncdtree.mata
_somdtransf.mata
_u2jackpseud.mata
_v2jackpseud.mata
blncdtree.mata
tidot.mata
tidottree.mata
lsomersd.mlib
cendif.sthlp
censlope.sthlp
censlope_iteration.sthlp
mf_bcsf_bracketing.sthlp
mf_blncdtree.sthlp
mf_somdtransf.sthlp
mf_u2jackpseud.sthlp
somersd.sthlp
somersd_mata.sthlp
ANCILLARY FILES (click here to get)
cendif.pdf
censlope.pdf
somersd.pdf
-------------------------------------------------------------------------------------
(click here to return to the previous screen)
↧
multivariate binary data on stata
Hi, i'm completely new to stata, i have a binary data set, with the dependent variable being intimate partner violence( 1 if yes , 0 otherwise ) and my independent variables are an education dummy ( no education, primary, secondary and tertiary) , employment status ( 1 if yes , 0 if not employed ), type of employment( seasonal , occasional and permanent) . I wish to do the pca , and some descriptive analysis like bar graphs . what are the commands ?
↧
how to hide zero percent in blabel option of graph bar
Dear Stata Users,
I have a question as to option for labeling bars. When I graph bar charts of grouped data (percents of different activity of three business types), I want to add a label on each bar. However, there are some zero percent in my data, and the -blabel()- option will display them as same as non-zero values. And my question is how to hide zero percents in corresponding bars.
Array
I have a question as to option for labeling bars. When I graph bar charts of grouped data (percents of different activity of three business types), I want to add a label on each bar. However, there are some zero percent in my data, and the -blabel()- option will display them as same as non-zero values. And my question is how to hide zero percents in corresponding bars.
Code:
graph bar v2 v3 v4 v5, over(v1) stack blabel(bar, format(%9.1f) posi(center)) nofill legend(row(1) order(1 "None" 2 "One" 3 "Two" 4 "Three"))
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str9 v1 float(v2 v3 v4 v5) "business1" 31.47 68.53 0 0 "business2" 67.33 29.53 3.13 0 "business3" 78.93 7.4 13.2 .47 end
↧
↧
Package control: Packages over-writing other packages
Hi there
I have a more general question that I haven't been able to find the answer for. How does one specify which command should be used, if multiple commands from different user-written packages share the same name?
This problem occur when I try to run a rdplot from the rdrobust package. Stata then runs the rdplot as programmed in the rdplot package.
How do I make rdrobust the default package?
I have a more general question that I haven't been able to find the answer for. How does one specify which command should be used, if multiple commands from different user-written packages share the same name?
This problem occur when I try to run a rdplot from the rdrobust package. Stata then runs the rdplot as programmed in the rdplot package.
Code:
. which rdplot C:\Program Files (x86)\Stata15\ado\base\r\rdplot.ado *!version 7.5.1 2018-07-05
↧
Fixed effects and dynamic ordinary least squares
Dear all,
I am currently working on a research topic related on the environmental disaggregate renewable energy effectsts.
I am estimating some regression models with the objective of evaluating the effectiveness of renewable energy diffusion on the ghg emission.
The models are:
Yit= B0 + B1 RES+ B2 FOSSIL+ uit
Where: Res= share of renewable energy production; Fossil= share of fossil energy production; E= ghg emission
I would like to know what is the differece beetween fixed effects and DOLS (dynamic ordinary least squares).
I have 21 years and 28 countries.
I would very much appreciate some thoughts on this problem.
Thanks in advance!
Matteo
I am currently working on a research topic related on the environmental disaggregate renewable energy effectsts.
I am estimating some regression models with the objective of evaluating the effectiveness of renewable energy diffusion on the ghg emission.
The models are:
Yit= B0 + B1 RES+ B2 FOSSIL+ uit
Where: Res= share of renewable energy production; Fossil= share of fossil energy production; E= ghg emission
I would like to know what is the differece beetween fixed effects and DOLS (dynamic ordinary least squares).
I have 21 years and 28 countries.
I would very much appreciate some thoughts on this problem.
Thanks in advance!
Matteo
↧
Issue with syntax
Dear Statalist,
I am starting to use panel data for the first time and I wanted to create an easy variable that computes the gap between two groups (high and low educated people) in the time they spend in X activity. The structure of my data is as follows:
What I would like to have is a variable that calculates the gap in time spent (time) between those children (ID) that have a mother with degree (moth_degree==1) and those having mothers without degree (moth_degree==0)?? I know it is an easy task but after try many options I could not find the way! ![Frown]()
Many many thanks!
Best
I am starting to use panel data for the first time and I wanted to create an easy variable that computes the gap between two groups (high and low educated people) in the time they spend in X activity. The structure of my data is as follows:
Code:
ID wave time moth_degree 61105964 2 13.25 1 61105964 3 16.375 1 61105964 4 33.71667 1 61105964 5 7.758338 1 61105964 6 5.833334 1 61105966 2 3.5 1 61105966 4 7 1 61105966 5 16.33334 1 61105966 6 10.20834 1

Many many thanks!
Best
↧
Time dummies and time trend simultaneously
Hello everyone,
I would like to ask if it makes any sense to include time dummies and a time trend in the same specification model .
I am running a panel data regression with macro economic variables. Therefore I would like to include time dummies, but some variables show a clear linear trend in the long term. Therefore I would like to include a trend for de-trending my data. Therefore my data include something like this:
and also the time dummies which take the value 1 or 0 as usual. I do not know if the model is sensible if I include both (time dummies and a time trend at the same time).
Thanks in advance.
Regards
I would like to ask if it makes any sense to include time dummies and a time trend in the same specification model .
I am running a panel data regression with macro economic variables. Therefore I would like to include time dummies, but some variables show a clear linear trend in the long term. Therefore I would like to include a trend for de-trending my data. Therefore my data include something like this:
Panel identifier | year | trend |
1 | 2001 | 1 |
1 | 2002 | 2 |
1 | 2003 | 3 |
1 | 2004 | 4 |
1 | 2005 | 5 |
2 | 2001 | 1 |
2 | 2002 | 2 |
Thanks in advance.
Regards
↧
↧
Histogram discrete is not discrete
This is driving me bananas.
I have a set of discrete data spanning 1-30.
I want a histogram with bins from 1-30.
When I type:
histogram hday if hday<=30, discrete
I get a histogram that lumps together the values of 1 (the min) and 2.
No amount of tampering with the starting values, the bin width, the bin number, etc has been able to solve this problem for me?
What is broken under the hood?
Thanks
I have a set of discrete data spanning 1-30.
I want a histogram with bins from 1-30.
When I type:
histogram hday if hday<=30, discrete
I get a histogram that lumps together the values of 1 (the min) and 2.
No amount of tampering with the starting values, the bin width, the bin number, etc has been able to solve this problem for me?
What is broken under the hood?
Thanks
↧
Different confidence intervals for linear regression
I have run a linear regression which gives a 95% CI. I would like to calculate the 80% and 99% CI for β1 but can't find a command other than that which gives me a generalized CI for the X variable. Is there a command/menu option for me to calculate that? Is there also a way to show a picture of the probability distribution for each calculation?
Thanks
Thanks
↧
Generate mean of a variable for each level of another variable
Hi there,
I want to create several variables that store the mean of other "mother" variables (trunk and displacement) for different values of an index variable (rep 78). Then I want to estimate the difference between means, also for the different levels of the index value. I used the following code:
As you can see, the output displays the summaries of trunk and displacement for the different values of rep78. But it only displays the means difference for rep78=3. Why?
I want to create several variables that store the mean of other "mother" variables (trunk and displacement) for different values of an index variable (rep 78). Then I want to estimate the difference between means, also for the different levels of the index value. I used the following code:
Code:
sysuse auto, clear
drop if missing(rep78)
levelsof rep78,local(levels)
foreach l of local levels {
summarize trunk displacement if rep78 == `l'
egen disp_`l'= mean(displacement) if rep78 == `l'
egen trunk_`l'= mean(trunk) if rep78 == `l'
gen dif_`l'= disp_`l' - trunk_`l'
di dif_`l'
}
As you can see, the output displays the summaries of trunk and displacement for the different values of rep78. But it only displays the means difference for rep78=3. Why?
↧