Hello Statalist,
1. Problem description
I have a data set that is produced by a data base in the format as in the table below, although this example is simplified. The data data set is outputted in a
dataex script below the table. The data set is a result of merging two data sets,
data1 and
data2. In the example below the variable names show which variable comes from which data set. The two data sets are merged on
country.
After I have this data I want to multiply the value in
data2_varK with the value of the variable listed in
data1_varY. So for country
Aland I want to multiply
data1_varY with
data2_varK (
22 * .25), but for
Eland I wan to multiply
data1_varZ with
data2_varK (
5 * .5). While most countries will only have the one value I want to use in
data1_varX,
data1_varY and
data1_varZ, there are cases, like
Cland in the example, that has multiple values.
We want to do this multiple times and sometimes it is not multiplication. It could also be generate a dummy if value in
data1_varX,
data1_varY or
data1_varZ is higher than a cut-off value that also coming from
data2.
country |
data1_varX |
data1_varY |
data1_varZ |
data2_varJ |
data2_varK |
Aland |
|
22 |
|
data1_varY |
.25 |
Bland |
|
34 |
|
data1_varY |
.75 |
Cland |
15 |
42 |
|
data1_varX |
.6 |
Dland |
24 |
|
|
data1_varX |
.85 |
Eland |
|
|
34 |
data1_varZ |
.75 |
Fland |
|
|
5 |
data1_varZ |
.5 |
Code:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 country byte(data1_varX data1_varY data1_varZ) str10 data2_varJ double(data2_varK result)
"Aland" . 22 . "data1_varY" .25 5.5
"Bland" . 34 . "data1_varY" .75 25.5
"Cland" 15 42 . "data1_varX" .6 9
"Dland" 24 . . "data1_varX" .85 20.4
"Eland" . . 34 "data1_varZ" .75 25.5
"Fland" . . 5 "data1_varZ" .5 2.5
end
2. What we've tried so far
Test 2a;
Code:
gen result2a = data2_varK * `=data2_varJ[_n]'
We have tried something like this, but
`=data2_varJ[_n]' takes the value of
data2_varJ from the first row for all rows. So for all rows the code evaluates to this since
data2_varJ in the first row is
"data1_varY".
Code:
gen result2a = data2_varK * data1_varY
Test 2b;
Code:
bys data2_varJ : gen result2b = data2_varK * `=data2_varJ[_n]'
This has the same result as test 2a.
`=data2_varJ[_n]' is evaluated into
data1_varY for all rows, like this:
Code:
bys data2_varJ : gen result2b = data2_varK * data1_varY
Test 2c;
Code:
bys data2_varJ : gen result2c = data2_varK * `=data2_varJ[1]'
This gave the same result as in 2b.
3. What works but we are concerned will be too slow
Code:
gen results3a = .
forvalue obs = 1/`=_N' {
replace results3a = data2_varK * `=data2_varJ[`obs']' if _n == `obs'
}
We are so far only piloting this, so this might not be too slow, but we are concerned that it will be. We also would love to be able to solve this on one line as then we can plug it in to the data base so it happens on the fly. Nevertheless, it is good to have this loop-option as an option of last resort in case someone else have a better suggestion.
Thanks,
Kristoffer