Quantcast
Channel: Statalist
Viewing all 73359 articles
Browse latest View live

Wildcards with strpos/regexr

$
0
0
Could someone give me an example on using a wildcard with strpos and regexr ? For example, I want to scan a string variable (with multiple words) called meds for nu seal, nu-seal and nuseal, (or variants thereon) and replace with aspirin, . eg.

replace meds=regexr(meds,"nu[ -]seal", "aspirin") works for nu-seal and nu seal, but doesn't include "nuseal". I know I could write another line, but there are instances where I'd like to incorporate a wildcard into the one regex command. Same holds for strpos. Any pointers would be very much appreciated!

Kruskal–Wallis one-way analysis of variance

$
0
0
Hello everyone,
I am bothered by a question regarding group differences in my data set. I have a dataset with variables (ordinal, dummy, and intervall) from 10 different communities. I want to run multiple regressions with the overall sample. However, I also want to check whether some of the central constructs of the analysis vary between the communities. Since comparing the 10 communities with each other in a descriptive way is a lot of work, I'd like to run a test that indicates whether the variance of a construct can be partially explained by the group differences (i.e. belonging to the different communities). I extracted from the literature that this is usually done via one-way Anova. As most of my data is non-normally distributed, I was wondering whether the Kruskal–Wallis one-way analysis of variance would be the right test for me?
I used the follwoing command:
kwallis var, by(communities)

Can anyone tell me whether I am on the right track?

Many thanks in advance!!

Extremes

$
0
0
Hi everyone

I have a big dataset, with 280 variables (made a global varlist with all of them). I need to identify whether if there are strange extreme values so as to drop them

I have read about it and found Nick Cox's extremes command, but gives me a strange output. I don't know whether if I'm not performing it well or does not work when finding non-numeric variables.

Find attached a picture of the output

Thanks,

Francesc

Codetest

$
0
0
Hey,

the following commands:

:
by InsurantNumber: replace _id

bysort produces different results everytime

$
0
0
Hello,

I have a strange problem. I am using the following code to count the unique visits of an insurant/patient (via DispensingDate). InsurantNumber is unique to a patient and to a doctor, meaning one patient attending two doctors will have two InsurantNumbers.

:
sort InsurantNumber DispensingDate

by InsurantNumber: gen id_ = _n == 1

by InsurantNumber: replace id_ = id_[_n]+1 if _n > 1 & DispensingDate != DispensingDate[_n-1]
After that I want to generate a new variable which contains the number of unique visits by doctor. I do that via

:
sort doctor

by doctor: egen visits=total(id_)
Now I have run this a couple of times and I get slightly different results almost every time. I tried various versions of doing this (bysort and egen in one command; foreach loop for seperate calc per doctor) and always get different results per doctor (+- 4) while the total amount of visits stays the same.
Perhaps an important note: If I run the same command multiple times the calculations are equal, but if I for example run my whole file another time the results vary. I even traced this down to the point where I ran a random command between every execution of the same countprocedure, just out of curiosity, and as I feared then again the results varied.

I simply dont get it.

I hope someone can help me.

Regards,

Karl

DHS WEIGHT

$
0
0
Hello, Please I need your suggestions. I want to use data of DHS to analyse the effect of the consommation of a product on the probability to work. Unfortunately, I have only 100 people who consume the product. I thought I could use the sampling weight to allow my 100 people to be representative of the country population. However when I use tab var [aw=hv005] I have about 300 000 000 people who consume the product, that doesn t make sense since the real population of the country is about 11 000 000. when I use the recommandation of the DHS material gen weight=hv005/1 000 000; tab var [aw=weight] I get less than 100 people who consume the product. so I would like to know how I can use sample weights in my descriptives statistics to have a representativeness of the population with my few data. Thanks.

Cluster analysis with mixed data

$
0
0
Dear Statalisters,


I want to do (hierarchical) cluster analysis with a data set that contains one continuous variable (age) and data from a few multiple response questions (with dummys for each option of the particular question).


I have three questions:


Is it possible to weight variables when using the Gower dissimilarity coefficient for mixed data? Without explicit weighting the multiple response questions are weighted by the number of options (which I don't want).


What to do with multiple response questions that have data like a single response question. For example: denomination. Most people have only one denomination. The normal matching coefficient used in the Gower measure would overestimate the similarity in this case.


Is there any research on or practical guides how to do cluster analysis with multiple response data?


Thanks!
Tim

Merge loop over different variables

$
0
0
Hi everyone!

I am trying to merge two datasets (household surveys) but I would like to create something like a loop so if the first group of variables don't create a match, Stata should try merging the two datasets over another group of variables. The first group of variables are ideal, so the 'second merge' should only be tried if the first one didn't work. i.e. The second merge should never ovewrite the first one.

My original merge (using Stata 12) was done as following:
:
use "${cleandatadir}HH_SEC_A.dta", clear
merge m:1 region_code district_code ward_code village_code cluster hhid using "${tempdatadir}weight", gen (m)
Analysing some of the non-merged observations I've noticed that some households may be the same based on head of household's name. For instance, look at obs No. 16 and obs No. 21. They do not have the same hhid but could be the same.

:
----------------------- copy starting from the next line -----------------------

:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float region_code int district_code long(ward_code village_code) str36 headname float hhid byte m
12 1204 1204201 120420101 "Ningile Abraham Kasuka"          3011 1
12 1204 1204201 120420101 "Ephes Nelson Lwesya"             3012 1
12 1204 1204201 120420101 "Frenk Simon Ndile"               3014 1
12 1204 1204201 120420101 "Eneli Kalago Molokulima"         3015 1
12 1204 1204201 120420101 "Johon Mwasyana Mwakasege"        3016 1
12 1204 1204201 120420101 "Godfrey Obeti Mwaisoloka"        3017 1
12 1204 1204201 120420101 "Hussen Katolika Kabeta"          3018 1
12 1204 1204201 120420101 "Anosisye Simoni Mwaibale"        3019 1
12 1204 1204201 120420101 "Suma Mbembela Ambangile"         3020 1
12 1204 1204201 120420101 "Ambangile Elija Bugara"          3021 1
12 1204 1204201 120420101 "Robath Mwakililo Mbukwa"         3022 1
12 1204 1204201 120420101 "Pius Simion Ndile"               3023 1
12 1204 1204201 120420101 "Agnes Nasoni Kasigila"           3024 1
12 1204 1204201 120420101 "Bahati Musafiri Seme"            3027 1
12 1204 1204201 120420101 "Maritini Kajuni Mbisa"           3028 1
12 1204 1204201 120420101 "Timothy Abrahim Seme"            3029 1
12 1204 1204201 120420101 "Grolia Saimoni Sankulu"          3048 2
12 1204 1204201 120420101 "Grece Fredy Mbwaga"              3037 2
12 1204 1204201 120420101 "Bahati Musafiri Seme"            3046 2
12 1204 1204201 120420101 "Daily Adamson Seme"              3031 2
12 1204 1204201 120420101 "Timothy Abrahim Seme"            3054 2
12 1204 1204201 120420101 "  "                              3059 2
12 1204 1204201 120420101 "  "                              3057 2
12 1204 1204201 120420101 "Anodi Syusi Mbahiba"             3050 2
12 1204 1204201 120420101 "  "                              3060 2
12 1204 1204201 120420101 "Imanueli Bughali Elija"          3045 2
12 1204 1204201 120420101 "Stanton William Mwandemele"      3055 2
12 1204 1204201 120420101 "Osyana Patrick Mwaipopo"         3044 2
12 1204 1204201 120420101 "Adam Syelwike Swila"             3043 2
12 1204 1204201 120420101 "Maritini Kajuni Mbisa"           3047 2
12 1204 1204201 120420101 "Daines Batoni Kafyanya"          3049 2
12 1204 1204201 120420101 "  "                              3061 2
12 1204 1204201 120420101 "  "                              3058 2
12 1204 1204201 120420101 "  "                              3063 2
12 1204 1204201 120420101 "Tupyelesyege Kalile Bugali"      3035 2
12 1204 1204201 120420101 "Genesa Astolile Timoti"          3053 2
12 1204 1204201 120420101 "Abusolom Asajenye Kajange"       3056 2
12 1204 1204201 120420101 "Anosisye Mwanganyanga Mwanyanya" 3051 2
12 1204 1204201 120420101 "Pius Simion Ndile"               3052 2
12 1204 1204201 120420101 "  "                              3062 2
12 1204 1204201 120420101 "  "                              3064 2
end
label values m _merge
label def _merge 1 "master only (1)", modify
label def _merge 2 "using only (2)", modify
------------------ copy up to and including the previous line ------------------
Is there any way I could tell stata to try and do the first merge with variables
:
region_code district_code ward_code village_code cluster hhid
But if that doesn't work to try ...
:
region_code district_code ward_code village_code cluster headname
I am thinking of something like a loop, but I'm clueless, so your help would be more than appreciated.

Thanks in advance!!
Mariana

Consistency between appearance of graphs in window and when exported

$
0
0
Dear Statalist,

I would like to know how to maximize the similarity in appearance of graphs as they appear in the graph window and as they appear when exported, in particular to PDF.

Font sizes, etc., are slightly different between the graph window and a subsequently generated PDF. Usually this is not a problem but occasionally it has been frustrating to find, for example, graph titles escaping the margins of the pdf.

I have experimented a bit with graph set window, in particular changing the fontfaces to match those of eps (see graph set for these options), and this seems to improve the match, but I'm not sure whether this is the best way to go about this, if it will hold up generally, or if there are possible negative unintended consequences.

Many thanks,

BL

PS - Stata 13.1 SE on W10-64bit
PPS - If possible, I would like to avoid calling an external program like epstopdf since, when last I checked about a year ago, calling external programs was difficult-to-impossible when running Stata in batch mode on Windows.

tabulate with weighed data

$
0
0
Hello, I have a simple question. I have a dataset that is weighed according to a specific variable. Let's say when I do a simple tab, I write svy: tab X Y. Now I would like to tabulate two variables but according to a third variable. Usually on unweighed data, I use the command bysort Z=... tab X Y. This doesnt work with weighed data or the command svy:

Does anyone have a simple solution to this really simple problem
Many thanks

Poisson inverse cumulative distribution function that returns k instead of m?

$
0
0
Hello,

Stata's invpoisson() function "returns the Poisson mean such that the cumulative Poisson distribution evaluated at k is p: if poisson(m,k) = p, then invpoisson(k,p) = m," according to the user guide. Is there an inverse Poisson CDF function that returns k for a given (m,p)? I understand the difficulty in inverting the Poisson CDF, but R has this function (QPOIS), so I thought it might be possible in Stata.

Thanks for your time,

Bob

Fixed Effects in Stata

$
0
0
Dear All,

I have a panel data with missing observations. I run the following regression:

reg y x

and

tsset id time

xtreg y x

The number of observations is the same in both regressions. My question is: how are missing observations treated in stata? shouldn't the missing values be dropped in the panel analysis if there is less than two observations within each group?

Thanks!

Synth: Pre/post MSPE

$
0
0
Dear all

I have problem calculating the Post(event)-MSPE in the placebo analysis.

I am student at the Norwegian School of Economics and I am writing my master thesis using the synthetic control method testing the effect of partial privatization. I am running the synthetic control method on Stata/IC 14 using the codes provided by the authors (website:http://web.stanford.edu/~jhain/synthpage.html) and the help synth command in Stata. I am however unable to calculate the Post/pre MSPE-ratio.

The help synth command provide me with a code for the pre-event MSPE, but I do not know how to modify the formula to calcutale the POST-event MSPE.

. tempname resmat
forvalues i = 1/4 {
synth cigsale retprice cigsale(1988) cigsale(1980) cigsale(1975) ,trunit(`i') trperiod(1989) xperiod(1980(1)1988)
matrix `resmat' = nullmat(`resmat') \ e(RMSPE)
local names `"`names' `"`i'"'"'}
mat colnames `resmat' = "RMSPE"
mat rownames `resmat' = `names'
matlist `resmat' , row("Treated Unit")


I do not expect anything, but it is possible to get some advice and help with Stata.


I am referring to the final test in the paper:

"One final way to evaluate the California gap relative to the gaps obtained from the
placebo runs is to look at the distribution of the ratios of post/pre-Proposition 99 MSPE.
The main advantage of looking at ratios is that it obviates choosing a cuto® for the exclusion
of ill-fitting placebo runs. Figure 8 displays the distribution of the post/pre-Proposition 99
ratios of the MSPE for California and all 38 control states. The ratio for California clearly
stands out in the ¯gure: post-Proposition 99 MSPE is about 130 times the MSPE for the
pre-Proposition 99 period. No control state achieves such a large ratio. If one were to assign
the intervention at random in the data, the probability of obtaining a post/pre-Proposition
99 MSPE ratio as large as California's is 1=39 = 0:026."

Best regards, Thomas Kringlebu

Choice of residuals to use in Two Stage Residual Inclusion strategy with Poisson first stage regression

$
0
0
Dear Stata-listers,




I write to ask your opinion regarding which type of residual to use in a 2SRI strategy with a first stage Poisson regression.

I want to estimate a linear model where there could be an endogeneity problem to unobserved self-selection.
For this reason, I aim to use the Two Stage Residual Inclusion method to control for the possible endogeneity.

Since the endogeneity affects a set of dummies that can be interpreted as choice of firms in the panel, a non-linear first stage regression to estimate a choice model is needed.
For this purpose, I estimate a Poisson choice model, given that I use semi-aggregate data.

My question is: which residuals are better to be used?
Do you agree with the use of Pearson residuals (ie: {[realized # of outcomes - predicted # of outcomes]/[(predicted # of outcomes)^0.5]} ) in this case?

Thanks in advance for your help.

Best,
Giuseppe


Plotting predicted number of events from negative binomial panel regression

$
0
0
Dear Statalist members,
I'm trying to plot the number of predicted events per month over time for different age groups, based on a negative binomial regression model. My general coding procedure, after setting the time and id-variables with -xtset- is as follows:
:
xtnbreg count c.month##agegroups, fe irr
margins i.agegroups, at(month=(1(1)36) predict(nu0)
marginsplot
I've previously calculated and plotted similar linear (-xtreg-) and poisson (-xtpoisson-) regression models, which both estimate the expected number of events to be around ~5 per month. Since, from my understanding, the main difference between poisson and negative binomial models is that the latter better accounts for overdispersion (which my data suffers from), I wouldn't expect the plotted trendlines between the two to differ much. The incidence rate ratios also seem to match quite closely between the two models. In this case, however, while the poisson predicted number of events hover around 5-6 per month when plotted, the negative binomial model seems to hover around just 1 event/month. Further, the trendline is almost flat for the negative binomial plot while the poisson trendlines seem to increase much more meaningfully, despite very similar incidence rate ratios in the regression tables.

I can't tell if my code for plotting the negative binomial model is just wrong (the approach is a copy of my -xtpoisson- plotting code), or if it's my understanding of negative binomial regression models that is lacking. Can someone please help me shed some light on this?

As a side question, I can't seem to produce margins for my -xtnbreg- model if I rely on random effects instead of fixed effects. I've since read suggestions that fixed/random effects for -xtnbreg- does not quite operate the way I would expect for -xtpoisson-. From my own experience, the fixed effects negative binomial regression model suddenly doesn't drop time-invariant predictors. How can this happen and what's the difference when using fixed/random effects in -xtnbreg-?

I've tried to make my question somewhat general, but please let me know if you need me to provide further details or clarify the question further.
Thanks in advance!

Is there a combination of xtscc and xtregar?

$
0
0
Dear Statalist,

I need to do a regression on a panel dataset (N=19 countries, T=32 year, balanced). My problem is that the data suffers from contemporary correlation as well as a unit root in the dependent variable. As there is no unit root in the first difference of the dependent variable, I think it follows an ar1-process. The code below shows the relevant tests.

As far as I know, I should correct for contemporary correlation by employing Driscoll and Kraay standard errors via - xtscc - (see Hoechle 2007) and for the ar1 by using xtregar. However, I don't think I can use both at the same time. Is there a possibility to combine them? Or an alternative which deals with both problems?

Additional Tests with non-problematic results: Breusch-Pagan LM Test (- xttest0 -) points toward RE-model instead of OLS, a Hausman test (- hausman -) suggests that a country-fixed-effects are not necessary, and - collin - shows no problematic multicollinearity (all VIF<2)

:
. xtset cid year
       panel variable:  cid (strongly balanced)
        time variable:  year, 1980 to 2011
                delta:  1 unit

.
. xtreg Fsocx gdp d_unemp veto_schmidt proz_cm_sd_ks singleparty Maastricht

Random-effects GLS regression                   Number of obs     =        570
Group variable: cid                             Number of groups  =         19

R-sq:                                           Obs per group:
     within  = 0.8604                                         min =         30
     between = 0.2363                                         avg =       30.0
     overall = 0.5218                                         max =         30

                                                Wald chi2(6)      =    3330.26
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

-------------------------------------------------------------------------------
        Fsocx |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
          gdp |   .2122995   .0066548    31.90   0.000     .1992562    .2253427
      d_unemp |   152.3585   19.43612     7.84   0.000     114.2644    190.4526
 veto_schmidt |  -75.18298    125.168    -0.60   0.548    -320.5077    170.1417
proz_cm_sd_ks |   .7689358   .6727737     1.14   0.253    -.5496764    2.087548
  singleparty |  -74.74662   72.97877    -1.02   0.306    -217.7824    68.28915
   Maastricht |   43.17916   6.421921     6.72   0.000     30.59243     55.7659
        _cons |   62.52319   687.0926     0.09   0.927    -1284.154      1409.2
--------------+----------------------------------------------------------------
      sigma_u |  1084.6002
      sigma_e |  497.43015
          rho |  .82621314   (fraction of variance due to u_i)
-------------------------------------------------------------------------------

. xtcsd, pesaran abs //-> significant -> contemporary correlation
 
 
Pesaran's test of cross sectional independence =    12.577, Pr = 0.0000
 
Average absolute value of the off-diagonal elements =     0.427

.
.
. xtunitroot fisher socx, dfuller l(1) //adding trend and/or demean option does not influence the result substant
> ially

Fisher-type unit-root test for socx
Based on augmented Dickey-Fuller tests
--------------------------------------
Ho: All panels contain unit roots           Number of panels  =     19
Ha: At least one panel is stationary        Number of periods =     32

AR parameter: Panel-specific                Asymptotics: T -> Infinity
Panel means:  Included
Time trend:   Not included
Drift term:   Not included                  ADF regressions: 1 lag
------------------------------------------------------------------------------
                                  Statistic      p-value
------------------------------------------------------------------------------
 Inverse chi-squared(38)   P         9.1607       1.0000
 Inverse normal            Z         5.8960       1.0000
 Inverse logit t(99)       L*        6.2693       1.0000
 Modified inv. chi-squared Pm       -3.3081       0.9995
------------------------------------------------------------------------------
 P statistic requires number of panels to be finite.
 Other statistics are suitable for finite or infinite number of panels.
------------------------------------------------------------------------------

. gen diffsocx=D.socx
(19 missing values generated)

. xtunitroot fisher diffsocx, dfuller l(1) //-> unit root not present in first difference -> ar(1)
(19 missing values generated)

Fisher-type unit-root test for diffsocx
Based on augmented Dickey-Fuller tests
---------------------------------------
Ho: All panels contain unit roots           Number of panels  =     19
Ha: At least one panel is stationary        Number of periods =     31

AR parameter: Panel-specific                Asymptotics: T -> Infinity
Panel means:  Included
Time trend:   Not included
Drift term:   Not included                  ADF regressions: 1 lag
------------------------------------------------------------------------------
                                  Statistic      p-value
------------------------------------------------------------------------------
 Inverse chi-squared(38)   P       197.4740       0.0000
 Inverse normal            Z       -10.5401       0.0000
 Inverse logit t(99)       L*      -12.4628       0.0000
 Modified inv. chi-squared Pm       18.2929       0.0000
------------------------------------------------------------------------------
 P statistic requires number of panels to be finite.
 Other statistics are suitable for finite or infinite number of panels.
------------------------------------------------------------------------------

.
end of do-file

&quot;added_text_option&quot; in scatter not working

$
0
0
Hi,
I use stata 12 on a mac and do have issues with the added text option in a scatter graph.
This is my code:
scatter c diff_PL_125 [w=population], msymbol(Oh) xline(0) text(120.78 -36.21 "China rural") text(84.24 -36.43 "India rural")

Stata plots the graph, but omits the specified text.
I already strapped it to basics and removed the msymbol specification and the weight, but this didn't change anything.

This exact code worked around two years ago on a windows. Unfortunately I don't have a Windows PC at hand now. So I really don't know what's changed.

Did anyone account a similar issue before?

I very much appreciate any comments and ideas!

Caroline

String variables functions

$
0
0
Hello everyone, I have a dataset with a string variable like this:

name

Duomo (1)
Brera (2)
Giardini Porta Venezia (3)
Guastalla (4)
Vigentina (5)
Ticinese (6)
Magenta - S. Vittore (7)

I want to create two new variables: in the first one I want to have only the name and in the second one the value between parenthesis.

Which command I have to use? Someone could help me?
Thank you in advance.

Second Generation Panel Data estimation

$
0
0
Hey Stata Experts,
I am trying to estimate panel data using second generation panel data technique, here is the command i am running

xtpmg d.pcgdp d.pcoda d.pcfdi d. inv d.cor d.tr d.le, lr( l.pcgdp pcfdi pcoda cor inv tr le) ec(ec) replace mg full

but stata doesnt estimate and shows this message "Maximum number of itterations exceeded"
Please guide me if there is any problem with the syntax or something else.

problem in reshaping data

$
0
0
Hello!

I have panel data in long format on mergers and acquisitions: acquirers, (uniquely identified by their cusip ("acusip")) that made several acquisitions in the course of time. Targets are also identified by their cusip number ("tcusip"). I ranked the targets of each bidder according to the effective time of the deal ("datee"). (egen rank=rank(datee), by (acusip)) so, that for each acquiror, I know his first, second, etc. target.

Then, I wanted to reshape the data into a wide format (reshape wide tcusip, i(acusip) j(rank) but I got the following error:

"rank not unique within acusip;
there are multiple observations at the same rank within acusip."

Apparently, some deals have the exact same date, so when ranked, they got the average (for example, the 3rd and 4th target were acquired on the same date, so they both have a rank of 3.5).

Please, help me out of this situation! Thanks heaps!


Viewing all 73359 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>