Quantcast
Channel: Statalist
Viewing all 72918 articles
Browse latest View live

How to change a numeric variable to string variable? Tostring and decode code don't work.

$
0
0
Dear all,
I am trying to dealing with the following problem:
There is a numeric variable in my dataset which is double precision. This numeric variable must be converted to string type for my next work.
I tried to use "tostring" code to convert this numeric variable to string variable. But failed. The "recast" can not change this variable from double precision to float/int /long/byte. Then I used "decode" code to generate a new variable based on this numeric variable. But there is no value of the new variable.
What can I do to deal with this problem? Please help. Thank you all.

Duo Liu

Histogram, wrong y-axis

$
0
0
Hi guys,

I created a histogram, but the y-axis is wrong. I need the assets on the x-axis and the number of companies on the y-axis. WC02999 are the assets. Any suggestions what I need to change?

Code:
gen asset= 50 if (WC02999>=0 & WC02999<=50000000)
replace asset= 100 if (WC02999>5000000 & WC02999<=100000000)
replace asset= 150 if (WC02999>100000000 & WC02999<=150000000)
replace asset= 200 if (WC02999>150000000 & WC02999<=200000000)
replace asset= 250 if (WC02999>200000000 & WC02999<=250000000)
replace asset= 300 if (WC02999>250000000 & WC02999<=300000000)
replace asset= 350 if (WC02999>300000000 & WC02999<=350000000)
replace asset= 400 if (WC02999>350000000 & WC02999<=400000000)
replace asset= 450 if (WC02999>400000000 & WC02999<=450000000)
replace asset= 500 if (WC02999>450000000 & WC02999<=500000000)
replace asset= 550 if (WC02999>500000000 & WC02999<=550000000)
replace asset= 600 if (WC02999>550000000 & WC02999<=600000000)
replace asset= 650 if (WC02999>600000000 & WC02999<=650000000)
replace asset= 700 if (WC02999>650000000 & WC02999<=700000000)
replace asset= 750 if (WC02999>700000000 & WC02999<=750000000)
replace asset= 800 if (WC02999>750000000 & WC02999<=800000000)
replace asset= 850 if (WC02999>800000000 & WC02999<=850000000)
replace asset= 900 if (WC02999>850000000 & WC02999<=900000000)
tabulate asset
hist asset, frequency

Generating a categorical variable

$
0
0
Hi statalist,

I am trying to generate a categorical variables using seven dummy variables that identify whether a woman experienced a certain type of domestic violence (i.e. kicked, pushed etc). I want to generate a categorical variables ranging from 1 to 7 such that women who experienced only 1 of these types of domestic violence would be coded 1, women who experienced any 2 of these types of domestic violence would be coded 2, women who experienced any 3 of these types would be coded 3 and so on and so forth. Is there a quick way of generating this variable?

The seven variables are: dvppush dvpslap dvppunch dvpchoke dvpkick dvpmsever dvplsever

Here is my data
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input int year byte(dvppush dvpslap dvppunch dvpchoke dvpkick dvpmsever dvplsever)
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 1 0 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 1 1 1 0 0 0 1
2005 0 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 1 0 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 1 0 1 0 1
2005 0 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 1 1 0 1 0 1
2005 1 0 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 0 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 0 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 0 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 1 0 1 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 1 0 1 1 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 1 1 1 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 0 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 1 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 1 1 1 0 0 0 1
2005 1 1 1 0 0 0 1
2005 1 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0
2005 0 1 0 0 0 0 1
2005 0 0 0 0 0 0 0
end
label values year YEAR
label def YEAR 2005 "2005", modify
label values dvppush DVPPUSH
label def DVPPUSH 0 "no", modify
label def DVPPUSH 1 "yes", modify
label values dvpslap DVPSLAP
label def DVPSLAP 0 "no", modify
label def DVPSLAP 1 "yes", modify
label values dvppunch DVPPUNCH
label def DVPPUNCH 0 "no", modify
label def DVPPUNCH 1 "yes", modify
label values dvpchoke DVPCHOKE
label def DVPCHOKE 0 "no", modify
label values dvpkick DVPKICK
label def DVPKICK 0 "no", modify
label def DVPKICK 1 "yes", modify
label values dvpmsever DVPMSEVER
label def DVPMSEVER 0 "no", modify
label def DVPMSEVER 1 "yes", modify
label values dvplsever DVPLSEVER
label def DVPLSEVER 0 "no", modify
label def DVPLSEVER 1 "yes", modify

Recoding variable by quartiles

$
0
0
Dear community,
I conduct a survey analysis, where I have question measured in 11 categories 0-10. I would like to reduce it to variable of 4 categories by taking four quarters of 25%. How could I do it?

Kind regards,
John

Is there an nbreg alternative for ppmlhdf ? (presence of overdisperion)

$
0
0
Hi,
I'im running regression on count data with 4 fixed effects.
Because I include 4 fixed effects it is not possible to use a simple poisson of nbreg (i set matsize at max and set emptycells drop). I used the the reghfde package by Sergio Correia which includes ppmlhdfe, a poisson regression with multiple fixed effects. However, I think my count data is overdispersed and should use a nbreg but there is no such thing in the reghdfe package.

1) Is there a test to run after ppmlhdfe to check for overdispersion?
2) or is there any alternative to run a nbreg with multiple fixed effects?

I ran the following code with results:
Code:
.  ppmlhdfe teamsize internetdummy invt_network_size invt_pat_count invt_career_age mobile_invt ,
vce(robust) absorb(cbsacode appyear uspc invt_id) d
(dropped 99460 observations that are either singletons or separated by a fixed effect)
note: 1 variable omitted because of collinearity: invt_career_age
Iteration 1:   deviance = 1.454e+05                  itol = 1.0e-04  subiters = 30  min(eta) =  
> -1.28                                                                                        
>        [p  ]
Iteration 2:   deviance = 1.407e+05  eps = 3.39e-02  itol = 1.0e-04  subiters = 19  min(eta) =  
> -1.94                                                                                        
>        [   ]
Iteration 3:   deviance = 1.406e+05  eps = 1.91e-04  itol = 1.0e-04  subiters = 10  min(eta) =  
> -2.03                                                                                        
>        [   ]
Iteration 4:   deviance = 1.406e+05  eps = 1.43e-07  itol = 1.0e-04  subiters = 3   min(eta) =  
> -2.03                                                                                        
>        [   ]
Iteration 5:   deviance = 1.406e+05  eps = 1.72e-07  itol = 1.0e-08  subiters = 62  min(eta) =  
> -2.02                                                                                        
>        [ s ]
Iteration 6:   deviance = 1.406e+05  eps = 7.80e-11  itol = 1.0e-08  subiters = 95  min(eta) =  
> -2.02                                                                                        
>        [ps ]
Iteration 7:   deviance = 1.406e+05  eps = 2.10e-14  itol = 1.0e-10  subiters = 116 min(eta) =  
> -2.02                                                                                        
>        [pso]
Iteration 8:   deviance = 1.406e+05  eps = 2.39e-14  itol = 1.0e-10  subiters = 117 min(eta) =  
> -2.02                                                                                        
>        [pso]
------------------------------------------------------------------------------------------------
> ------------
(legend: p: exact partial-out   s: exact solver   o: epsilon below tolerance)
Converged in 8 iterations and 452 HDFE sub-iterations (tol = 1.0e-08)

HDFE PPML regression                              No. of obs      =    362,605
Absorbing 4 HDFE groups                           Residual df     =    280,588
                                                  Wald chi2(4)    =    1354.34
Deviance             =   140641.425               Prob > chi2     =     0.0000
Log pseudolikelihood = -564388.4936               Pseudo R2       =     0.1827
-----------------------------------------------------------------------------------
                  |               Robust
         teamsize |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
    internetdummy |  -.0025327   .0053141    -0.48   0.634    -.0129481    .0078827
invt_network_size |   .0122162   .0003325    36.74   0.000     .0115645     .012868
   invt_pat_count |   -.001211   .0000895   -13.53   0.000    -.0013864   -.0010355
  invt_career_age |          0  (omitted)
      mobile_invt |  -.0057639   .0086128    -0.67   0.503    -.0226447     .011117
            _cons |   .9506205   .0060948   155.97   0.000     .9386748    .9625661
-----------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
    cbsacode |       481           0         481     |
     appyear |         6           1           5     |
        uspc |       412           1         411    ?|
     invt_id |     81188          72       81116    ?|
-----------------------------------------------------+
? = number of redundant parameters may be higher
I could use a simple poisson regression with only 2 fixed effects, but that would affect the robustness.

Any suggestions?

Thanks
Ludo

Errors when interpolating panel data using mipolate

$
0
0
Hello, Statalist,

Long-time listener, first-time caller, here.

I have panel data at the U.S. county x year level. I have observations of several variables (time-varying county characteristics, namely, the percent of the population in a particular age bucket, ex., percent of population aged 5-14) by decade (in 1940, 1950, 1960, 1970, and 1980) because they are from the Decennial Census. I would like to interpolate values for the years between decades, within county. I have successfully generated linear interpolations using ipolate. However, I would like to try other functional forms, so I would like to interpolate using mipolate and its pchip, spline, and cubic options. For almost each of these options, I run into a different problem.

Pchip gives me this error:
Code:
. mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_pchip) pchip
           pchipslopes():  3201  vector required
                 pchip():     -  function returned error
            pchipolate():     -  function returned error
                 <istmt>:     -  function returned error

When I run cubic, it interpolates for some decades, but not all decades. See example scatterplot for Connecticut counties. (A separate, less important problem: the way that the green points are connected with a line seems to be screwy.)

Code:
. mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_cubic) cubic
(194535 missing values generated)


. twoway connected percent5_14_i_cubic year if statefip==9, ms(+) sort || scatter percent5_14 year if
> statefip==9, ///
> legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14")  
> name( cubicp, replace)

Array

Spline seems to run successfully:

Code:
mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_spline) spline


twoway connected percent5_14_i_spline year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14") name( spline, replace)
graph export "/home/nwb/hcproductivity/13_2/percent_5_14_i_spline CT.png", as(png) replace
Array

I am running Stata 14.2 MP on a Linux server.

Any advice and help to get me out of this jam, I would greatly appreciate!

Thanks,
Nate


Code:
foreach i in percent5_14 percent1524 percent2534 percent3544 percent4554 percent5564 percent6574 percent75 {
    bysort fcounty1: ipolate `i' year, gen(i`i')
}


xtset fcounty1 year
format year %ty

sort fcounty1 year
mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_spline) spline
mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_pchip) pchip
mipolate percent5_14 year, by(fcounty1) gen(percent5_14_i_cubic) cubic

set scheme s1color

twoway connected percent5_14_i year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14") name( spline, replace)
graph export "/home/nwb/hcproductivity/13_2/percent_5_14_i_spline CT.png", as(png) replace

twoway connected percent5_14_i_pchip year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14") name( pchip, replace)
graph export "/home/nwb/hcproductivity/13_2/percent_5_14_i_pchip CT.png", as(png) replace

twoway connected percent5_14_i_cubic year if statefip==9, ms(+) sort || scatter percent5_14 year if statefip==9, ///
legend(order(1 "guessed" 2 "known"))  xtitle("") yla(, ang(h)) ytitle("Percent of Pop., Age 5-14")  name( cubicp, replace)
graph export "/home/nwb/hcproductivity/13_2/percent_5_14_i_cubic CT.png", as(png) replace
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input double fcounty1 float(year percent5_14 statefip)
   0    .         . .
1000    .         . .
1001 1927         . 1
1001 1928         . 1
1001 1929         . 1
1001 1930         . 1
1001 1931         . 1
1001 1932         . 1
1001 1933         . 1
1001 1934         . 1
1001 1935         . 1
1001 1936         . 1
1001 1937         . 1
1001 1938         . 1
1001 1939         . 1
1001 1940 .23730753 1
1001 1941         . 1
1001 1942         . 1
1001 1943         . 1
1001 1944         . 1
1001 1945         . 1
1001 1946         . 1
1001 1947         . 1
1001 1948         . 1
1001 1949         . 1
1001 1950  .2252282 1
1001 1951         . 1
1001 1952         . 1
1001 1953         . 1
1001 1954         . 1
1001 1955         . 1
1001 1956         . 1
1001 1957         . 1
1001 1958         . 1
1001 1959         . 1
1001 1960  .2381664 1
1001 1961         . 1
1001 1962         . 1
1001 1963         . 1
1001 1964         . 1
1001 1965         . 1
1001 1966         . 1
1001 1967         . 1
1001 1968         . 1
1001 1969         . 1
1001 1970 .25179887 1
1001 1971         . 1
1001 1972         . 1
1001 1973         . 1
1001 1974         . 1
1001 1975         . 1
1001 1976         . 1
1001 1977         . 1
1001 1978         . 1
1001 1979         . 1
1001 1980 .19021048 1
1001 1981         . 1
1001 1982         . 1
1001 1983         . 1
1001 1984         . 1
1001 1985         . 1
1001 1986         . 1
1001 1987         . 1
1001 1988         . 1
1001 1989         . 1
1001 1990         . 1
1001 1991         . 1
1001 1992         . 1
1001 1993         . 1
1001 1994         . 1
1001 1995         . 1
1001 1996         . 1
1001 1997         . 1
1001 1998         . 1
1001 1999         . 1
1001 2000         . 1
1001 2001         . 1
1001 2002         . 1
1001 2003         . 1
1001 2004         . 1
1001 2005         . 1
1001 2006         . 1
1001 2007         . 1
1003 1927         . 1
1003 1928         . 1
1003 1929         . 1
1003 1930         . 1
1003 1931         . 1
1003 1932         . 1
1003 1933         . 1
1003 1934         . 1
1003 1935         . 1
1003 1936         . 1
1003 1937         . 1
1003 1938         . 1
1003 1939         . 1
1003 1940  .2143299 1
1003 1941         . 1
1003 1942         . 1
1003 1943         . 1
end

Analysing duplicates

$
0
0
Hi guys,

I need to analyse duplicates. I've got different newpaper articles. They all have a story_id. These articles mention different EU and US companies. First I need to analyse how many companies are mentioned in one article. For that I used:
Code:
duplicates report rp_story_id
Second I need to analyse how many US and Non-US companies (country_code=="US") are mentioned each year.

Example:

company country_code story_id headline year
VW DE NDJHAODUW Earnings announcement 3. Qu 2003
BMW DE NDJHAODUW Earnings announcement 3. QU 2003
GM US NDJHAODUW Earnings announcement 3. Qu 2003
VW DE SODOEIKDIDI Earnings announcement 1. Qu 2004
GM US SODOEIKDIDI Earnings announcement 1. Qu 2004



Thanks for your help!

Binary probit or categorical probit

$
0
0
Dear community,
In the first model I applied binary dependent and independent variables (see pictures attached), in the second model I applied dependent and independent variables measured in 4 categories. It seems that binary model is better in terms of log likelihood (Log likelihood for binary model = -12509.156; Log likelihood for categorical model = -26665.522) and effect significance, however I am still concerned about which model is better to choose.

Could you advise me on that please? Array
Array

Help Panel Var..can't get over this!!

$
0
0
Hello guys,

I don't know if someone will be so nice to help me with this, but I'm trying to get to a point for days. I need to compute the ire for this dataset and every time stata gives me an error message, "equation not fount" ""type mismatch" "derivative already defined", I don't know how can I do it.

Can someone help me o give me some hint?

I really need your help.

Test overidentification when instrumenting xtivreg2 with predicted values of xtpoisson

$
0
0
Dear all

I have two questions, one stata question and another econometric question.

First the stata question:

I am performing the following model as suggested by Professor Jeff Wooldridge in another thread which is available here (as well as in chapter 19 of Econometric analysis of cross section and panel data) :

Code:
xtset municipality quarter

xtpoisson preg_rate c.z1##c.z1 c.z2##c.z2  x1 x2 x3 xi, fe vce(robust)
predict yhat

xtivreg2 log_lbw x1 x2 x3 xi   (preg_rate=yhat), fe vce(robust) endog(preg_rate)
where z1 and z2 are my instruments; x1-xi my control variables; preg_rate is the pregnancy rate per municipality (pregnancies per 1000 fertile women) and is treated as a count; and log_lbw is the percentage of low birth weight newborns per municipality [log_lbw=log(1+lbw_rate)]. Both preg_rate and lbw_rate have many zeros (due to small municipalities and the low occurrence of low weight births) and are thus treated as counts, following the methodology proposed by Lindo et al. (2017) which deals with abortion rates.


I would like to ask how could I perform a test of overidentification manually, given that xtivreg2 only detects one instrument (yhat) and thus assumes the model is perfectly identified, when in reality, there are 2 instruments: z1 and z2.

Regarding the econometric question

I would also like to treat the variable lbw_rate as a count in an perform an IV poisson regression. Nevertheless, as Professor Joao Santos Silva mentioned often in this forum and elsewhere, an IV poisson model with fixed effects may suffer from inconsistency caused by the incidental parameters problem associated with fixed efects.

For that reason I employ xtreg with the log transformed variable (also helps to read the results in percentage terms).

My question is whether this is a legitime way to go around the incidental parameters problem of xtivpoisson with fixed effects.

Thank you very much for your time and consideration.





Notes:
xtivreg2 is a userwritten command written by Mark E Schaffer.

I have quarterly data on most 274 municipalities from 2007 to 2014 (unbalanced)
or
I have quarterly data on all 274 municipalities from 2010 to 2014 (balanced)


References:
Cunningham, S., Lindo, J. M., Myers, C. K., & Schlosser, A. (2017). How far is too far? New evidence on abortion clinic closures, access, and abortions. NBER Working Paper, (w23366).
Wooldridge, J. M. (2002). Econometric analysis of cross section and panel data. MIT press. pp: 623-625

Add histogram of values distribution when graphing conditional marginal effects?

$
0
0
Hi all,

I am learning about graphing the marginal effects of a variable conditional on the values of a second variable. I see how to use the margins command and then graph those results with confidence intervals, which gives me a sense of whether the marginal effects of one variable in the interaction are significant over the value range of the other variable in the interaction. But, my conditioning variable has a range of 0 to 85, with a mean of less than 3. I would like to show a histogram of that distribution overlaid on top of the marginal effects graph. I see that in R, there is a package called "interplot" that makes it easy. Does anyone know how to do the same thing in Stata? I'm on version 15 if that matters.

Here is what I have been doing:

logit dependentvariable variableA variableB c.variableA#c.variableB covariateC covariate D, cluster(yadayada)
margins, dydx(variableA) at(variableB=(0(5)85)) vsquish
marginsplot, ylin(0)

Thanks so much for any tips.

How to get a graph line with multiple variables in the x-axis?

$
0
0
Hi everyone,

I have a tricky question regarding the line graph with multiple variables.

My problem with this specific issue is that I want to get multiple variables in my x-axis along with two distinct lines across for a categorical variable.

I have read through PDF file for graphs and previous posts about multiple variables but I cannot seem to find a way to get multiple variables in a single x-axis. I am attaching a picture describing what I want to get. Blocks numbered from 1 to 8 are individual variables in the following picture. Meaning that, there are 32 different variables on the x-axis. Array



Thank you in advance,
Ezgi

Looping and importing excel sheets

$
0
0
Hi all,

I am importing 12 files from excel. These worksheets have common name starting from II1 to II12. However, my code doesn't work.

Can someone please tell me where m I going wrong?

foreach i =1/12 {
2. import excel "C:\Users\rk0022\Dropbox\\II.xls", sheet("II`i'") firstrow clear case(lower)
3. reshape long y_, i(country) j(year)
4. rename y II`i
5. save ""C:\Users\rk0022\Dropbox\II`i'.dta", replace
6. }
Thank you.

Ritika

Linearity assumption in Logistic

$
0
0
Hi statalist.
I wonder during logistic regression, do I have to check linearity assumption between continuous independent variable & logit of dependent variable in aspect of univariate model or multivariate model?

net install lpdensity, error r(677)

$
0
0
Hi,

I use rddensity in Stata and already have rddensity installed on my computer but then a error reported:

plotting feature requires command lpdensity, install with
net install lpdensity, from(https://sites.google.com/site/nppack...pdensity/stata) replace
r(111);

when I tried the above code to install this package, another error reported as:

net install lpdensity, from(https://sites.google.com/site/nppack...pdensity/stata) replace
remote connection failed -- see help r(677) for troubleshooting
https://sites.google.com/site/nppack...density/stata/ either
1) is not a valid URL, or
2) could not be contacted, or
3) is not a Stata download site (has no stata.toc file).

current site is still http://fmwww.bc.edu/repec/bocode/r/
r(677);

I already tried to tell Stata the name and port of the "http proxy server" using set httpproxyhost, set httpproxyport and set httpproxy on however after all this the same problem still reported.

Then I go to the above link (https://sites.google.com/site/nppack...pdensity/stata) which works totally fine for me. I tried to download each of the files but then Chrome said that this site cannot be reached. I was also thinking of installing this manually however if I am not able to download all these files I do not know how to do this.

I am not sure if anything wrong with the settings of my computer...

Any help would be greatly appreciated!

Creating a variable: TRUE or FALSE based on date within two other dates for each participant of of which there are multiple ID values

$
0
0
Hello, my apologies for not using dataex. I am using healthcare data so I have created an example below.
I am looking to create a variable that will satisfy this equality:

visit_date (for visit_type==1) <= visit_date (for visit_type==2) <= visit_date (for visit_type==1 + 365 days)


here is the structure I have:

study_id unique_id visit_type visit_date
1 165423 1 01/01/2000
1 164651 2 06/07/2000
2 949628 1 03/05/2001
2 489461 2 04/05/2002
3 984665 1 02/20/2002
3 894861 2 01/06/2003
4 894156 1 10/10/2002
4 876464 3 10/02/2003
4 786386 2 11/05/2003

// all date values of Type: double and Format: tc%
// unique_id is unique to every occurrence of "activity" i.e. unique in the entirety of the dataset
// I have given an example where I have more than one value for what is supposed to be the follow-up measurement but the actual measurement was erroneous in some way (e.g. three study_id values, and visit_type==3 instead of 1 or 2).

study_id unique_id visit_type visit_date within_window
1 165423 1 01/01/2000 .
1 164651 2 06/07/2000 1
2 949628 1 03/05/2001 .
2 489461 2 04/05/2002 0
3 984665 1 02/20/2002 .
3 894861 2 01/06/2003 1
4 894156 1 10/10/2002 .
4 876464 3 10/02/2003 .
4 786386 2 11/05/2003 0

// you can see that I desire within_window==. if visit_type!=2
// I don't think -reshape wide- will help because as it stands there are > 100 variables and > 10,000 observations in the dataset

I have tried something very simple and non-elegant:

Code:
     
* visit_date for visit_type==1
gen double FirstVisitDate=vis_date if visit_type==1                      
  format FirstVisitDate %tc
     
* visit_date for visit_type==2
gen FollowUpVisitDate=visit_date if visit_type==2
  format FollowUpVisitDate %tc

* visit_datefor visit_type==1 + 365 days
gen FirstVisitDate_plus365=visit_date if visit_type==1
  format FirstVisitDate_plus365 %tc
  replace FirstVisitDate_plus365=FirstVisitDate_plus365+3.1536*10^10
  // 3.1536*10^10 = 1 year in milliseconds (non-leap since %tc)

* var returned within the time-window
gen within_window=.                                               
  replace within_window=1 if FirstVisitDate < FollowUpVisitDate < FirstVisitDate_plus365          
  replace within_window=0 if missing(FollowUpVisitDate) | FollowUpVisitDate > FirstVisitDate_plus365
This solution seems somewhat bizarre though.
I end up with the dataset looking something like this:

study_id unique_id visit_type visit_date FirstVisitDate FollowUpVisitDate FirstVisitDate_plus365 within_window
1 165423 1 01/01/2000 01/01/2000 . 01/01/2001 0
1 164651 2 06/07/2000 . 06/07/2000 . 1

Clearly the value within_window==1 is correct, however, I am confused about how Stata is reading this given that the observations are on different lines, so would Stata not be evaluating the calculation of within_window as:


Code:
FirstVisitDate==01/01/2000 < FollowUpVisitDate==. < FirstVisitDate_plus365==01/01/2001

FirstVisitDate==. < FollowUpVisitDate==06/09/2000 < FirstVisitDate_plus365==.
I have checked multiple values manually which appear to be correct but there are too many to sort through by simply observing value by value etc.

Questions:
  1. Can someone lend a more elegant solution, point out my errors, or have a better way to quality check this solution?
  2. Also, I would like to have within_window==. for the observations that have visit_type==1 but I suppose when I aggregate I can take only values of 0 or 1 where visit_type==2 if my data stay in the form I have above.

Thanks for the help as always.

Stacked/Bar Graphs for Multiple Categorical Variables

$
0
0
Hello,

I am examining socio-demographic differences in attitudes towards FGC practice. Here is an example of my data:

Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(FGC_GoodPractice Gender DOB RegionBirth)
0 2 3 1
0 2 1 1
0 1 3 1
0 1 3 1
0 1 3 1
0 2 1 1
0 1 1 1
0 1 1 1
0 2 1 1
0 1 3 1
end
label values FGC_GoodPractice labels20
label def labels20 0 "No", modify
label values Gender labels1
label def labels1 1 "Women", modify
label def labels1 2 "Men", modify
label values DOB labels0
label def labels0 1 "< 25", modify
label def labels0 3 "45-64", modify
label values RegionBirth labels2
label def labels2 1 "West Africa", modify



The following code gave me the relevant crosstabulations, shown below:


Code:
tabout Gender DOB RegionBirth FGC_GoodPractice using FGC_1.txt, ///
c(row ci) sum f(3) ///
style(tab) stats(chi2) font(bold) npos(col) cisep(-)


  
No Yes Don't Know N
% % % No. %
Respondent Sex
Women 77.1 20.8 2.2 371 100
Men 66.2 32.4 1.4 145 100
Respondent Date of Birth
< 25 75.8 23.1 1.1 264 100
45-64 72.5 24.6 2.8 211 100
> 64 67.6 29.7 2.7 37 100
African Region of Birth
West Africa 73 25.1 1.9 482 100
East Africa 88.9 7.4 3.7 27 100
South Africa 80 20 0 5 100
North Africa 100 0 0 2 100
I want to convert the results to the bar graph below, produced using Excel ... ( I would actually prefer a stacked bar).



Array


graph hbar and catplot are able to produce a graph of one variable at a time, but how do I get a graph showing the distribution of about 6 variables by the DV (approval of FGC practice)?

Thanks in advance for your assistance.

best wishes - cY

Interaction without main effects-should it be significant too?

$
0
0
Should an interaction without main effects be significant if the interaction term is significant in the model with main effects? I have continues variable and dummy variable that takes the value of 1 if the value in continues variable is above the threshold and 0 otherwise. I include the interaction between these two (corr is 0.5, VIF is less than 1.5) which is statistically significant. However, without main effects, the interaction term is not significant. Do I have a problem here?

change in sign and significance of linear term after adding quadratic and cubic terms

$
0
0
Hi all,

I need your comments on change in sign and significance of linear term after adding quadratic and cubic terms. please see details below:

I am using panel data with 681 firms and 9 years each.

DV: In_roa (firm performance)
IV: ff_d (dummy variable, whether firm is family firm or not, if family then 1)
IV: stdfo_own (shares held by an investment company in a firm, values ranges from 0 to 100, moreover i standardized this variable by egen stdvar=std(var) in order to get standardized coefficients ) ----- quadratic and cubic terms of this variable are also used later.

rest four variables are control variables

DV and all control variables are log transformed

i also add interaction terms of both IVs

When i ran xtreg ,fe only with linear terms and their respective interaction. The results are following

Code:
. xtreg ln_roa ln_age ln_assets ln_leverage ln_ebitda stdfo_own i.ff_d c.stdfo_own#i.ff_d i.year,fe Fixed-effects (within) regression Number of obs = 2,619 Group variable: sr Number of groups = 435 R-sq: Obs per group: within = 0.3449 min = 1 between = 0.0702 avg = 6.0 overall = 0.0961 max = 9 F(15,2169) = 76.14 corr(u_i, Xb) = -0.9166 Prob > F = 0.0000 ln_roa Coef. Std. Err. t P>t [95% Conf. Interval] ln_age -.2425169 .1151951 -2.11 0.035 -.4684213 -.0166125 ln_assets .0968887 .0733825 1.32 0.187 -.0470188 .2407961 ln_leverage -.090931 .0229502 -3.96 0.000 -.1359377 -.0459243 ln_ebitda 1.143277 .0353174 32.37 0.000 1.074017 1.212536 stdfo_own .1853135 .0686506 2.70 0.007 .0506857 .3199414 1.ff_d 1.106262 .4312164 2.57 0.010 .2606212 1.951902 ff_d#c.stdfo_own 1 -.3014559 .1265948 -2.38 0.017 -.5497158 -.0531961 year 2008 -.1011256 .0543446 -1.86 0.063 -.2076986 .0054474 2009 -.1452684 .0588673 -2.47 0.014 -.2607106 -.0298262 2010 -.1071921 .0573892 -1.87 0.062 -.2197355 .0053514 2011 -.0159812 .0603661 -0.26 0.791 -.1343627 .1024002 2012 -.021513 .0640318 -0.34 0.737 -.1470832 .1040571 2013 -.0861119 .0674674 -1.28 0.202 -.2184194 .0461957 2014 -.0374172 .0699141 -0.54 0.593 -.1745228 .0996884 2015 -.0255822 .0741488 -0.35 0.730 -.1709924 .119828 _cons 1.423105 .9439093 1.51 0.132 -.4279559 3.274166 sigma_u 2.2670211 sigma_e .64303764 rho .92553463 (fraction of variance due to u_i) F test that all u_i=0: F(434, 2169) = 10.38 Prob > F = 0.0000
but when i add quadratic and cubic terms with their respective interaction terms, change of sign of main linear effect of Stdfo_own arise and main variable and its interaction term also goes insignificant. Please see below:

Code:
. xtreg ln_roa ln_age ln_assets ln_leverage ln_ebitda stdfo_own i.ff_d c.stdfo_own#i.ff_d c.stdfo_own#c.stdfo_own c.stdfo_own#c.stdfo_own#c.stdfo_own i.ff_d#c.stdfo_own#c.stdfo_own > i.ff_d#c.stdfo_own#c.stdfo_own#c.stdfo_own i.year,fe Fixed-effects (within) regression Number of obs = 2,619 Group variable: sr Number of groups = 435 R-sq: Obs per group: within = 0.3525 min = 1 between = 0.0702 avg = 6.0 overall = 0.0972 max = 9 F(19,2165) = 62.04 corr(u_i, Xb) = -0.9157 Prob > F = 0.0000 ---------------------------------------------------------------------------------------------------------- ln_roa | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------------------------------+---------------------------------------------------------------- ln_age | -.2487596 .114643 -2.17 0.030 -.4735815 -.0239377 ln_assets | .1082457 .0732253 1.48 0.139 -.0353535 .2518448 ln_leverage | -.0875726 .0228485 -3.83 0.000 -.1323799 -.0427653 ln_ebitda | 1.147255 .0351986 32.59 0.000 1.078229 1.216282 stdfo_own | -.4299763 .2723444 -1.58 0.115 -.9640601 .1041074 1.ff_d | 1.385639 .5345586 2.59 0.010 .3373372 2.433941 | ff_d#c.stdfo_own | 1 | .1154103 .4098834 0.28 0.778 -.6883959 .9192164 | c.stdfo_own#c.stdfo_own | 1.829437 .3914211 4.67 0.000 1.061837 2.597038 | c.stdfo_own#c.stdfo_own#c.stdfo_own | -.548248 .1106208 -4.96 0.000 -.7651821 -.331314 | ff_d#c.stdfo_own#c.stdfo_own | 1 | -1.62318 .5700911 -2.85 0.004 -2.741162 -.5051965 | ff_d#c.stdfo_own#c.stdfo_own#c.stdfo_own | 1 | .5078677 .1566739 3.24 0.001 .2006207 .8151146 | year | 2008 | -.0976755 .0541075 -1.81 0.071 -.2037836 .0084325 2009 | -.1358211 .0586226 -2.32 0.021 -.2507835 -.0208587 2010 | -.0975943 .0571462 -1.71 0.088 -.2096614 .0144728 2011 | .000361 .0602345 0.01 0.995 -.1177626 .1184846 2012 | -.0069673 .0638666 -0.11 0.913 -.1322135 .118279 2013 | -.076881 .0672857 -1.14 0.253 -.2088323 .0550702 2014 | -.0218242 .0697872 -0.31 0.755 -.1586811 .1150327 2015 | -.0140085 .0739652 -0.19 0.850 -.1590588 .1310417 | _cons | .8776022 .9563418 0.92 0.359 -.9978418 2.753046 -----------------------------------------+---------------------------------------------------------------- sigma_u | 2.2607953 sigma_e | .63990051 rho | .92582914 (fraction of variance due to u_i) ---------------------------------------------------------------------------------------------------------- F test that all u_i=0: F(434, 2165) = 10.46 Prob > F = 0.0000
it would be great if anyone can explain why does this change happens and what does it mean.
regards,

Npregress slow with large data-sets, small samples

$
0
0
Hi there, Stata brethren.

Recently I have been trying to use the new nonparametric regression feature in Stata 16, npregress series, on different subsamples of my data. I found it to be slow. After digging in, I think I've discovered a strange behavior, where npregress becomes much slower when you increase the size of the data-set in memory, without changing the size of the sample in the estimation.

Consider the below example.
Toy example
Code:
clear
set obs 100000
gen x1 = runiform()
gen x2 = runiform()
gen y = cos(x1)*sin(x2) + x1^2 + 1/3*runiform()
npregress series y x1 x2 if _n < 1001, polynomial
This takes my computer about 60 seconds to run. Now I use the exact same sample, but drop the unused observations.

Code:
drop if _n >=1001
npregress series y x1 x2 if _n < 1001, polynomial
This takes about 2 seconds. This was not the expected behavior, because if I run a similar experiment regress instead of npregress, the speeds will be roughly the same.

Can someone explain why this is happening? Is npregress utilizing the unsampled data somehow? I was hoping to be able to repeatedly run npregress on subsamples of my data in order to construct non-parametric predictions without needing to repeatedly shuffle the data in memory (which will also take a long time, given that I am using a moderately large data-set).

Best,
Rustin
Viewing all 72918 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>