Quantcast
Channel: Statalist
Viewing all articles
Browse latest Browse all 72784

How can I clean the date variable in my data?

$
0
0
Hi everyone,

I am trying to clean a dataset with the following variables:

HID visitno day mon year
1. 601 1 2 12 2012
2. 601 2 12 12 2012
3. 601 3 17 12 2012
4. 601 4 22 12 2012
5. 601 5 27 12 2012
6. 601 6 1 12 2012
7. 602 1 30 12 2012
8. 602 2 10 12 2012
9. 602 3 15 12 2012
10. 602 4 20 12 2012
11. 602 5 25 12 2012
12. 602 6 30 12 2012
13. 603 1 16 2 2013
14. 603 2 21 2 2013
15. 603 3 26 2 2013
16. 603 4 3 3 2013
17. 603 5 8 8 2013
18. 603 6 13 3 2013
19. 604 1 16 5 2013
20. 604 2 11 5 2013
21. 604 3 26 5 2013
22. 604 4 3 6 2013
23. 604 5 8 6 2013
24. 604 6 13 6 2013

Interviewers visit the same household on 6 occasions to conduct a survey. The HID variable is the household identifier, the visitno the visit number and the date the interview was conducted is decomposed into three variables (day, mon and year). For example, on line 10, household 602's 4th interviewer (visitno=4) was conducted on the 20th of December 2012. In general, the time between two interviews is 5 or 10 days but there can be exceptions.

From the sample of the dataset above, you can see that there are two types of problems
1) There is a problem with the day entered (line 20 for household 604, where the day should have been 21 instead of 11 if we assume that the number of days between the interviews here is 5);
2) There is a problem with the month entered (line 7, where the month should have been 11 instead of 12, or line 17 where the month should have been 3 instead of 5).

I was wondering if there is a fast way to identify the data entry errors and to correct the days and months without changing them manually. In my dataset, I have more than 130,000 observations and 6,000 data entry errors.

Thank you

Nicolas

Viewing all articles
Browse latest Browse all 72784

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>