I have hierarchical data consisting of 3 levels (cases nested in groups and groups nested in geographic spaces). The data are in wide format where each row is a level 1 case. There are about 2 million level 1 units. At present I am trying to get a simple count of the number of level 1 units within each level 2 unit. The level 2 units are denoted by string text, there are many duplicates that are the result of some data entry error (for example, a misspelling). I know there are about 19,000 unique entries using egen's tag function (again, many of these are actually duplicates with slightly different spelling and that is the issue I am attempting to address).
In going through and hand coding these 19,000 level 2 entries it is helpful to see which entries have 1 or 2 cases associated with them and which have many (say 1,000 plus) entries associated with them; the former are likely to be the result of typographical mistakes while the latter are valid names of level 2 units.
Again, using egen I can group these units using egen...
Of course I could simply do a count by each group...
but with 19,000 or more level 2 units this is too onerous.
I need a variable that contains the count of level 1 units for each unique level 2 entries. I tried the following but it does not seem to work. This is the issue.
Ideally I could then list the level2 entry and number of level 1 cases associated with it and put that in excel and clean it up by putting duplicate entries together under a unified code/name. That is the hope anyway.
I know there is a simple way to do this.
Code:
egen grouptag=tag(level2var)
Again, using egen I can group these units using egen...
Code:
egen groups=group(level2var)
Code:
bysort groups:count
I need a variable that contains the count of level 1 units for each unique level 2 entries. I tried the following but it does not seem to work. This is the issue.
Code:
bysort groups: egen casecount=count
Code:
list level2name casecount if grouptag, clean noobs