Creating table where avg return for each portfolio in each year is given

October 30, 2019, 7:33 am

≫ Next: Referencing outside data set to generate new variables

≪ Previous: predicting linear change when dependent variable is log transformed

Dear stata users,

I am quite new in Stata and struggeling with some codes.

I want to find out if there is a correlation between the goodwill within a company and its returns. I calculated the monthly returns and goodwill to asset ratios for 250 firms over a period of 13 years and stated it as panel data in stata,
Then I made different portfolios for each month by using the following code: bys date: egen portfolios = xtile(gwa), nq(5).
This is a sample of my data:

name gwa return marketcap companyname year month date x portfolios
NOVOZYMES A/S - GOODWILL - GROSS .029167 .091569 .004189 155 2006 1 2006m1 1 2
MERCK KGAA - GOODWILL - GROSS .10902 .062225 .002097 143 2006 1 2006m1 1 3
NATURGY ENERGY GROUP - GOODWILL - GROSS .033331 .027543 .001873 148 2006 1 2006m1 1 2
DIAGEO PLC - GOODWILL - GROSS .012174 .04012 .00355 71 2006 1 2006m1 1 1
LANXESS AG - GOODWILL - GROSS .004368 .059602 .000501 135 2006 1 2006m1 1 1
SVENSKA CELLULOSA - GOODWILL - GROSS .127964 .122239 .011656 213 2006 1 2006m1 1 3
CAPGEMINI SE - GOODWILL - GROSS .23913 .017238 .000955 42 2006 1 2006m1 1 4
HEINEKEN HOLDING - GOODWILL - GROSS .176639 .060422 .001051 97 2006 1 2006m1 1 4
ALFA LAVAL AB - GOODWILL - GROSS .235412 .153414 .004812 11 2006 1 2006m1 1 4
ROYAL UNIBREW A/S - GOODWILL - GROSS .094737 .135891 .0006 182 2006 1 2006m1 1 3
GIVAUDAN SA - GOODWILL - GROSS .228314 .0049 .00111 93 2006 1 2006m1 1 4
TELEFONAKTIEBOLAGET - GOODWILL - GROSS .033887 .018313 .061239 221 2006 1 2006m1 1 2
SPECTRIS PLC - GOODWILL - GROSS .401544 .053161 .000137 207 2006 1 2006m1 1 5
NOKIA OYJ - GOODWILL - GROSS .024395 .021263 .008561 151 2006 1 2006m1 1 2
KBC GROUP NV - GOODWILL - GROSS .005253 .02719 .004501 125 2006 1 2006m1 1 1
NORDEA BANK ABP - GOODWILL - GROSS .006011 .005975 .038167 153 2006 1 2006m1 1 1
FIAT CHRYSLER - GOODWILL - GROSS .062612 .013937 .001863 86 2006 1 2006m1 1 3
RTL GROUP SA - GOODWILL - GROSS .628545 .059643 .001789 184 2006 1 2006m1 1 5

I want to create a table where on the headings for the colums there are the 5 portfolios and where on the x-as the year. So that i can see the return of each portfolio in each year.
As done in fama and french.

Kind regards

↧

Referencing outside data set to generate new variables

October 30, 2019, 8:13 am

≫ Next: Can stata do reduced rank regression analysis?

≪ Previous: Creating table where avg return for each portfolio in each year is given

I've generated a data set (call this percentile.dta) that connect test scores to percentiles depending on the year and test type. I'm left with 5 variables: score, test1_percentile, test2_percentile, test3_percentile, test4_percentile.

I then have a different data set (call this score.dta) that has variables like test1_score, test2_score, test3_score, and test4_score for 400+ individual IDs, and I want to generate test1_percentile, test2_percentile, test3_percentile, test4_percentile for each ID. Each ID doesn't necessarily have values for all 4 test scores.

Is there a way to match test scores from score.dta to their respective percentiles in percentile.dta without merging the data? In a way, percentile.dta acts as the code to generate percentile variables.

↧

Can stata do reduced rank regression analysis?

October 30, 2019, 8:16 am

≫ Next: Help on creating a new variable

≪ Previous: Referencing outside data set to generate new variables

Hello everybody,
Has someone any experiences doing reduced rank regressions with stata? A colleague of mine will have to use this as a main analysis type and so far only found solutions for SAS.
Thank you very much for your help and answers!
Julia

↧

Help on creating a new variable

October 30, 2019, 8:32 am

≫ Next: Panel structure with repeated time values: problem with xtset

≪ Previous: Can stata do reduced rank regression analysis?

'Create a variable which gives you the change in price from the 1st to 2nd period of a combination'

can anyone help? message me if you would like more background would really appreciate it

↧

Panel structure with repeated time values: problem with xtset

October 30, 2019, 8:51 am

≫ Next: Handling time and date

≪ Previous: Help on creating a new variable

Dear All,

I am working with a dataset of tweets by politicians. For each tweet, some dummy variables - such as "inequality" and "migration" in the example below - are equal to 1 if that particular tweet deals with a specific topic - eg inequality and migration respectively. My final objective is to set up a dynamic panel model to study the relationship between politicians talking about particular subjects and their shares in present present and past opinion polls (variable "share").

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str22 name_polit str7 yearmo double share float(inequality migration) str33 tweet_id
"Politician X" "2018m2"               13.1 10 "797153663818320_802734809926872"
"Politician X" "2018m2"               13.1 0 0 "797153663818320_805419726325047"
"Politician X" "2018m2"               13.1 0 1"797153663818320_801139120086441"
"Politician X" "2018m3" 18.566666666666666 0 0 "797153663818320_806538146213205"
"Politician X" "2018m3" 18.566666666666666 1 0 "797153663818320_809722572561429"
"Politician X" "2018m3" 18.566666666666666 0 0 "797153663818320_806901342843552"
"Politician X" "2018m3" 18.566666666666666 0 1 "797153663818320_808973869302966"
"Politician X" "2018m4"              20.45 0 0 "797153663818320_828926897307663"
"Politician X" "2018m4"              20.45 1 0 "797153663818320_823786454488374"
"Politician X" "2018m4"              20.45 0 0 "797153663818320_819100928290260"
"Politician X" "2018m4"              20.45 0 1"797153663818320_819338594933160"
"Politician X" "2018m5" 22.640000000000004 1 0 "797153663818320_835309080002778"
"Politician X" "2018m5" 22.640000000000004 0 1 "797153663818320_832123746987978"
"Politician X" "2018m5" 22.640000000000004 0 0 "797153663818320_839839332883086"

end

My problem is that I have monthly shares, but in a month the same politicians tweeted several times. Hence I cannot xtset using "name_polit" as id and "yearmo" as time. Thus I'm stuck on how to organize the panel: if I want to know whether a politician talked about migration because his o her previous month's shares went down I need to preserve the monthly structure, but that would violate the panel structure having repeated time values within panel. I cannot use "tweet_id" either because it uniquely identifies each single tweet and therefore is not repeated over time.

I would really appreciate your help on this.

Many thanks!
Giovanni

↧

Handling time and date

October 30, 2019, 8:58 am

≫ Next: Access 95%CI after using svy: mean variables, over(group)

≪ Previous: Panel structure with repeated time values: problem with xtset

Dear all,
I'm very new to Stata, I'm more used to SPSS. I need to define missing values for time variable, but no matter what I try it doesn't work, and I can't seem to find answers in any material.
This is how I defined the variable (this part worked out well):
gen double SDf= SEf - SOf
format %tcHH:MM SDf

(this did not)
*Dealing with suspicious values
recode SDf (tc(31dec1959 00:00:00)/tc(31dec1959 03:27:00) = tc(31dec1959 00:00:00)) (tc(31dec1959 14:00:00)/max = tc(31dec1959 00:00:00)), copyrest
mvdecode SDf , mv(tc(31dec1959 00:00:00))

To cut a long story short, I need to define values up to 3:27 a.m. as missing and from 2 p.m. and higher as missing. Since I defined SDf variable as "tc" I thought I could work it out the same while recoding it and then marking as missing. However, I'm just getting "unknown el tc in rule".
Please, can anyone help me how to handle this? My life very much depends on it...

↧

Access 95%CI after using svy: mean variables, over(group)

October 30, 2019, 9:07 am

≫ Next: Reshaping cross sections to panel data

≪ Previous: Handling time and date

Hi,

I'm working with a dataset that uses some fancy survey sampling design, which I think I've finally understood. After using the svy-commands, and so on, I want to find the means and 95% CI of the variable by age. The command finds the mean, standard error, and 95% CI, but I have no idea how to access the confidence interval, even after a lot of googling!

Please the below MWE, where a dataset is included at the end.

Code:

svyset cluster [pweight= wgt], strata(stratum)
svy: mean giver

Note that the svy: mean giver command gives me the mean, standard error and the 95% CI. I also want to store the 95% CI in a local/scalar whatever. What is the name of the matrix that the CI is stored in?

Code:

scalar mean = e(b)[1,1] //works
di mean
scalar CI_lowerbound = e(ci)[1,1] // does not work, in what matrix are the lower and upper bounds of the ci stored?

Where are the 95% CI stored?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float wgt byte(stratum cluster) float giver
26.538713 1 2 0
.14287305 2 1 0
13.541413 2 1 0
1.5445018 1 2 0
1.2428784 1 3 0
.14605893 1 3 0
  41.5022 1 2 0
1.3487246 1 2 0
1.3300264 1 3 0
2.2500246 2 2 1
 10.70082 2 2 0
 26.03309 1 3 0
2.2902308 1 3 0
1.3894604 1 2 0
end

[/CODE]

↧

Reshaping cross sections to panel data

October 30, 2019, 9:15 am

≫ Next: Tweaking 1st stage of 3SLS

≪ Previous: Access 95%CI after using svy: mean variables, over(group)

Dear Ms./Mr.,

I need your help urgently in converting the following data set from 1980-1999,

X_Country1 Y_Country1 X_Country2 Y_Country2 X_Country3 Y_Country3

and I am asked to pool it and run regression for it on stata. I am wondering shall I add a year variable? if yes how can I? how can I transfer it from this shape to pool data shape to run a regression?

Thank you in advance for your support

Best,
Rania

↧

Tweaking 1st stage of 3SLS

October 30, 2019, 9:26 am

≫ Next: program end leads to stata breaking?

≪ Previous: Reshaping cross sections to panel data

I am trying to estimate following two simultaneous equations using 3SLS:
y1 = x1 x2
y2 = y1 y1*x1 x1 x3 (where y1*x1 is an interaction term; it is the variable of interest)

My problem is 1st stage results for the 1st equation, because it estimates y1 on x1 x2 x3 andy1*x1 (it takes all RHS vars from both equations). This of course totally messes up the results of the first stage (as the interaction term, of course, absorbs much of the variance in the dependent variable), and destroys second stage results.

As check, I try to run instead:
y1 = x1 x2
y2 = y1 x1 x3
And here the results are good (y1 enters the way I expect). But my model focuses on the interaction term in that second regression, so this is not good enough. I also get good results if I runt the regression above as 2SLS (because the 1st equation becomes the 1st stage equation, and is not augmented with additional variables from the second one). But, again, I have good reason to believe that y2 and affects y1, so 2SLS not good enough.

Questions: how can I run the 3SLS above, without having the interaction term y1*x1 entering the 1st equation? Or can I replicate the 3SLS by somehow doing all the steps "manually" (presumably with residuals from simple OLS regressions) and having full control over 1st stage?

↧

program end leads to stata breaking?

October 30, 2019, 10:50 am

≫ Next: Is margins the correct approach?

≪ Previous: Tweaking 1st stage of 3SLS

Hi Statalist,

I am having difficulty fixing this particular (probably syntactic?) error and would appreciate another pair of eyes looking at it.

I want to define a program inside a forvalues loop. I've written a simpler version of what I want here:

Code:

forvalues i = 1 / 12 {
    di "`i'"

    cap prog drop aaa
    prog def aaa
        di "hello `i'"
    end 
}

When I run this (STATA/MP 15.1 on windows 10), I receive the following error message:

Code:

--Break--
r(1);

end of do-file

--Break--
r(1);

Oddly, the `i' that should display at the beginning of the loop never displays itself, unless I remove the end command like this:

Code:

forvalues i = 1 / 12 {
    di "`i'"

    cap prog drop aaa
    prog def aaa
        di "hello `i'"
   * end 
}

In this latter case, `i' displays for the first loop before I get the error message you would expect, telling me I never finished defining my program:

Code:

1
  1. 
unexpected end of file
r(612);

end of do-file

r(612);

Thank you for the help!

Julian

↧

Is margins the correct approach?

October 30, 2019, 12:08 pm

≫ Next: How to check if I have duplicates without perfectly duplicated obs

≪ Previous: program end leads to stata breaking?

Hi all,

A colleague and I are attempting to determine whether an analysis we'd like to accomplish is done so using the margins command. They write:

"Back in my graduate school days, I *think* I remember running a bunch of equations around women’s and men’s salary gaps, and basically looking at what we would expect a woman to earn if they had men’s slopes (at their own means) and what we’d expect men to earn if they had women’s slopes (at their own means) – OR something like that. The point was to show how much the gap would close if women got the same mileage out of their experiences as men got."

To me, this sounds close to calculating marginal effects, but I could be wrong here. We are also thrown off by the Stata documentation for margins, which states:

"after a linear regression fit on males and females, the marginal mean (margin of mean) for males is the predicted mean of the dependent variable, where every observation is treated as if it represents a male; thus those observations that in fact do represent males are included, as well as those observations that represent females. The marginal mean for female would be similarly obtained by treating all observations as if they represented females."

The documentation seems to indicate that the margins command would not be used in the case my colleague described. As such, I am hoping for more clarity on the math behind the margins command, and insight into whether or not it is the best way to approach our case.

Thank you!

↧

How to check if I have duplicates without perfectly duplicated obs

October 30, 2019, 12:19 pm

≫ Next: Adding a plot of means to a twoway graph.

≪ Previous: Is margins the correct approach?

Dear All,

I am a long data, with duplicated IDs, as there are obs for different dates. I would like to check if there are difference between the variables for each ID and Date (as my title say: duplicates without perfectly duplicated obs)

Code:

  
 input id date char1 char2 char3 char4
1 1/1/2009 4 5 4 0  
1 2/22009 4 5 5 10
1 3/3/2009 4 5 4 10
1 4/4/2009 4 5 4 10
 1 5/5/2009 4 5 4 10  
end

I have tried to use duplicates, but I have more than 40 variables, so it is practical to use

I have also tried to use collapse (checked the help in stata) but don't know if its the correct way to go.

Code:

ds date, not
collapse (lastnm) `r(varlist)', by(date)

Thank you for your help in advance.

↧

Adding a plot of means to a twoway graph.

October 30, 2019, 12:39 pm

≫ Next: margins command

≪ Previous: How to check if I have duplicates without perfectly duplicated obs

Dear All,

I want to plot a twoway graph that includes both, linear fit with confidence intervals and the annual means (connected with a line).

My data is an annual panel:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(oxy_mme_pc year)
  .014084662 2006
  .014647138 2007
    .0873531 2008
   .10673542 2009
   .08485814 2010
           0 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
   .04371211 2006
    .0459794 2007
  .064410366 2008
   .07556807 2009
   .05561658 2010
.00008645455 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
  .008630031 2006
  .022363244 2007
   .05436933 2008
   .05925127 2009
    .0424795 2010
           0 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
   .05500945 2006
    .0762932 2007
    .1241695 2008
    .1195685 2009
   .07976383 2010
           0 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
   .02577892 2006
   .03027122 2007
   .04367403 2008
   .04493835 2009
  .035172656 2010
           0 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
           0 2006
.00012212787 2007
.00024554916 2008
  .003843192 2009
 .0032100165 2010
           0 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
  .007248664 2006
  .014635595 2007
   .03227554 2008
   .03791877 2009
  .025497457 2010
           0 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
   .04249525 2006
  .070987284 2007
   .11885042 2008
   .13005434 2009
   .10445116 2010
           0 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
   .04897014 2006
   .09370453 2007
   .15168366 2008
   .14806564 2009
    .0993745 2010
           0 2011
           0 2012
           . 2013
           . 2014
           . 2015
           . 2016
   .02212559 2006
end

I plotted the linear fit with the CI lines using the following:

Code:

.  twoway (lfitci oxy_mme_pc year if year<2011) (lfitci oxy_mme_pc year if year>2009, lpa
> ttern(dash)), xline(2010)  ///
>  xtitle(Year) xmtick(2006(1)2012) ///
>   legend(/*ring(0) pos(5)*/ col(1) order(2 "Original before 2010, fit" 4 "Original afte
> r 2010, fit")) legend(size(vsmall)) ///
> xlabel(2006 "2006" 2007 "2007" 2008 "2008" 2009 "2009" 2010 "2010" 2011 "2011" 2012 "20
> 12",  angle(45)) name(a)

How can I add the annual means of variable

oxy_mme_pc connected with a line? Thank you for your help.

Sincerely,
Sumedha.

↧

margins command

October 30, 2019, 12:43 pm

≫ Next: Shading Graphs - Area Plot Tapered

≪ Previous: Adding a plot of means to a twoway graph.

hi,

Probit model as school enroll is a binary variable. Height for age is a continuous var.

schoolenroll=male+heightforagezscore+(heightforage *male)

i want the effect of height of girls vs boys on the school enrollment.
i cant figure out the margins command which could give me direct estimates.

please help thanks

↧

Shading Graphs - Area Plot Tapered

October 30, 2019, 1:02 pm

≫ Next: Fuzzy merge between two datasets, using jarowinkler distance

≪ Previous: margins command

Hello,

I am trying to plot a trend line with recessionary periods shaded in grey. Prior posts on this forum address the methods of shading graphs (e.g., https://www.stata.com/statalist/arch.../msg00121.html). These discussions were helpful in creating a graph that nearly does what I want. The problem is that the grey shaded areas are wider on the bottom than on the top. Does anyone know how to make the shaded area have vertical lines perpendicular to the x-axis?

Here is a snippet of data that shows the issue.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(time recession trend)
184 0  5.275247
185 0  5.435284
186 0  5.782407
187 0  5.680537
188 0  5.860635
189 0  5.696178
190 0  5.665268
191 7  5.749439
192 7   5.69439
193 7  5.597833
194 7  5.342285
195 7  5.042449
196 7 4.5402293
197 7 4.2877626
198 0  3.875503
199 0  3.930877
200 0  4.123271
201 0  4.307055
202 0  4.420278
203 0 4.5420003
end
format %tq time

This is the command that generates the graph with the tapered areas.

Code:

twoway  (area recession time, color(gs12)) (line trend time, xlab(184(4)203,labsize(small) angle(45)) yscale(range(0 7)) yla(0(1)7) legend(order(2 1) label(1 "Recession") label(2 "Trend")))

In case it is helpful, I am using Stata/SE 15.1. Thanks.

↧

Fuzzy merge between two datasets, using jarowinkler distance

October 30, 2019, 1:18 pm

≫ Next: Specification curve

≪ Previous: Shading Graphs - Area Plot Tapered

Hi,

I have one dataset that contains information about the full name (two names, two surnames) of 200.000 persons, approximately. This (string) information is not of high quality, in the sense that there are a lot of names that are misspelled or incomplete.

What I am trying to do is basically a merge between this dataset and another that contains the correct information of the full name of 13.000.000 individuals (included the 200.000 of the previous database). Since the names are misspelled or incomplete, a typical merge is not really a good solution, so I am going for a fuzzy merge.

I have been trying with the reclink command, but it takes forever to run and the results I get do not make much sense (score=1.0 for the two string variables selected, with actual "values" that are completely different).

I would like to do the "fuzzy merge" using the Jaro-Winkler distance measure, but I am struggling with this. In the perfect scenario, I would be able to get more than one "merge candidate" for the 200.000 observations.

Just in case, I am currently using Stata 14.0.

Thanks in advance,
Isidora.

↧

Specification curve

October 30, 2019, 2:10 pm

≫ Next: Estimate of mean squared error

≪ Previous: Fuzzy merge between two datasets, using jarowinkler distance

FYI: Hans Sievertsen has released on demo program to create a specification curve on GitHub. In the example below, the graph displays 72 robustness checks.

https://github.com/hhsievertsen/speccurve

[ATTACH=CONFIG]temp_16075_1572469456165_939[/ATTACH]

↧

Estimate of mean squared error

October 30, 2019, 2:53 pm

≫ Next: Batistical Analysis

≪ Previous: Specification curve

Hello,

Is there a Stata written command that allows one to determine the mean squared error from a mixed-effects linear regression model? I want to estimate this quantity for a simulation analysis that I am doing.

All best,
Jack

↧

Batistical Analysis

October 30, 2019, 3:07 pm

≫ Next: Weighting by how often something shows up

≪ Previous: Estimate of mean squared error

Happy Halloween

Code:

cap preserve
drop _all
set seed 7
set obs 15
g y=uniform()
g x=uniform()
g m="🦇"
tw sc y x, graphr(c(purple)) plotr(c(black)) msy(i) mla(m) mlabp(0) mlabs(*15) ysc(r(-.2 1.2)) xsc(r(-.2 1.2) noli) yla("",nog) xla("") yti("y",s(*1.5) orient(hor)) xti("x",s(*1.5)) sch(sj)
restore

↧

Weighting by how often something shows up

October 30, 2019, 3:52 pm

≫ Next: Correct Modeling of Data

≪ Previous: Batistical Analysis

I am trying to weight areas of a twoway scatter by how frequent they show up.
For instance
scatter var1 var2

Say I have many observations that show up at the point (1,1) on my scatter plot, but not that many that are at (0,0). I want to make giant hollow circles but they should be much bigger at (1,1) since there are lots more observations at that point. Can you help?
Thanks!

↧