Hi everyone,
I am trying to clean a dataset with the following variables:
HID visitno day mon year
1. 601 1 2 12 2012
2. 601 2 12 12 2012
3. 601 3 17 12 2012
4. 601 4 22 12 2012
5. 601 5 27 12 2012
6. 601 6 1 12 2012
7. 602 1 30 12 2012
8. 602 2 10 12 2012
9. 602 3 15 12 2012
10. 602 4 20 12 2012
11. 602 5 25 12 2012
12. 602 6 30 12 2012
13. 603 1 16 2 2013
14. 603 2 21 2 2013
15. 603 3 26 2 2013
16. 603 4 3 3 2013
17. 603 5 8 8 2013
18. 603 6 13 3 2013
19. 604 1 16 5 2013
20. 604 2 11 5 2013
21. 604 3 26 5 2013
22. 604 4 3 6 2013
23. 604 5 8 6 2013
24. 604 6 13 6 2013
Interviewers visit the same household on 6 occasions to conduct a survey. The HID variable is the household identifier, the visitno the visit number and the date the interview was conducted is decomposed into three variables (day, mon and year). For example, on line 10, household 602's 4th interviewer (visitno=4) was conducted on the 20th of December 2012. In general, the time between two interviews is 5 or 10 days but there can be exceptions.
From the sample of the dataset above, you can see that there are two types of problems
1) There is a problem with the day entered (line 20 for household 604, where the day should have been 21 instead of 11 if we assume that the number of days between the interviews here is 5);
2) There is a problem with the month entered (line 7, where the month should have been 11 instead of 12, or line 17 where the month should have been 3 instead of 5).
I was wondering if there is a fast way to identify the data entry errors and to correct the days and months without changing them manually. In my dataset, I have more than 130,000 observations and 6,000 data entry errors.
Thank you
Nicolas
I am trying to clean a dataset with the following variables:
HID visitno day mon year
1. 601 1 2 12 2012
2. 601 2 12 12 2012
3. 601 3 17 12 2012
4. 601 4 22 12 2012
5. 601 5 27 12 2012
6. 601 6 1 12 2012
7. 602 1 30 12 2012
8. 602 2 10 12 2012
9. 602 3 15 12 2012
10. 602 4 20 12 2012
11. 602 5 25 12 2012
12. 602 6 30 12 2012
13. 603 1 16 2 2013
14. 603 2 21 2 2013
15. 603 3 26 2 2013
16. 603 4 3 3 2013
17. 603 5 8 8 2013
18. 603 6 13 3 2013
19. 604 1 16 5 2013
20. 604 2 11 5 2013
21. 604 3 26 5 2013
22. 604 4 3 6 2013
23. 604 5 8 6 2013
24. 604 6 13 6 2013
Interviewers visit the same household on 6 occasions to conduct a survey. The HID variable is the household identifier, the visitno the visit number and the date the interview was conducted is decomposed into three variables (day, mon and year). For example, on line 10, household 602's 4th interviewer (visitno=4) was conducted on the 20th of December 2012. In general, the time between two interviews is 5 or 10 days but there can be exceptions.
From the sample of the dataset above, you can see that there are two types of problems
1) There is a problem with the day entered (line 20 for household 604, where the day should have been 21 instead of 11 if we assume that the number of days between the interviews here is 5);
2) There is a problem with the month entered (line 7, where the month should have been 11 instead of 12, or line 17 where the month should have been 3 instead of 5).
I was wondering if there is a fast way to identify the data entry errors and to correct the days and months without changing them manually. In my dataset, I have more than 130,000 observations and 6,000 data entry errors.
Thank you
Nicolas