First of all, my solution works but it is slow. Maybe I should just be content that it works but I am imagining an increasingly complicated dataset where my solution would be painfully slow! Is Stata efficient with foreach loops? Is there a more efficient data form? In R I would use matrices.
I am using public-use CDC multiple cause mortality data. http://www.nber.org/data/vital-stati...-of-death.html They are very large files.
The multiple causes of deaths are coded (by ICD-10 codes) in 20 variables named record_1 to record_20. I would like to select only observations that have certain multiple causes of death (at least one of list A AND at least one of list B). There are hundreds (thousands?) levelsof(record_*) but I am only interested in these 3 + 15. record_* variables are type string and potentially have the same possible levels but not all of the levels are observed in all of the variables.
So I loop over the first 3 of from List A for each of the 20 record_* variables. Then I keep only those that match. This first step reduces the number of observations from 2,394,871 to ~10,000 Then I loop over the 15 from list B in each of the 20 record_* variables. I don't care where they show up in those variables or in which order. Then I keep all observations that meet the second criterion.
These for loops work (I made them based on Nicholas Cox's excellent for loop "fortitude" tutorial). But it seems inefficient. Because then I do this for each of the files from 1999-2014. Any thoughts? Posting sample data would be unwieldy, I looked for a relevant toy dataset but I couldn't really find anything helpful.
Thanks!
I am using public-use CDC multiple cause mortality data. http://www.nber.org/data/vital-stati...-of-death.html They are very large files.
The multiple causes of deaths are coded (by ICD-10 codes) in 20 variables named record_1 to record_20. I would like to select only observations that have certain multiple causes of death (at least one of list A AND at least one of list B). There are hundreds (thousands?) levelsof(record_*) but I am only interested in these 3 + 15. record_* variables are type string and potentially have the same possible levels but not all of the levels are observed in all of the variables.
Code:
capture drop OD g OD = . foreach var_A of varlist record_* { foreach MCD_A in T402 T403 T404 { replace OD = 1 if `var_A' == "`MCD_A'" } } keep if OD == 1 foreach var_B of varlist record_* { foreach MCD_B in X40 X41 X42 X43 X44 X60 X61 X62 X63 X64 X85 Y10 Y11 Y12 Y13 Y14 { replace OD = 2 if `var_B' == "`MCD_B'" } } keep if OD == 2
These for loops work (I made them based on Nicholas Cox's excellent for loop "fortitude" tutorial). But it seems inefficient. Because then I do this for each of the files from 1999-2014. Any thoughts? Posting sample data would be unwieldy, I looked for a relevant toy dataset but I couldn't really find anything helpful.
Thanks!