Hi, I am working with a data and time variable.
The variable has DMY for some observations, DMY hms , for some other observations, and then it has DMYhms and some”000” characteristics for the remaining observation.
I give the -dataex- below to show you what it looks like.
. dataex ID date_time
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float ID str29 date_time 1 "07.12.2010" 2 "04.09.2010" 3 "08.10.2010" 4 "16.01.2010" 5 "07.02.2013 09:08:20" 6 "26.02.2013 16:22:20" 9 "06.05.2013 09:28:26" 8 "23.05.2018 12.24.00,443000000" 9 "24.10.2018 11.32.17,751000000" 10 "29.01.2019 15.10.00,453000000" end
Listed 10 out of 10 observations
I have a large dataset of more than 120,000. Therefore it is very difficult to identify which observation has which of the three formats., or if there is another format inside.
- I would appreciate knowing if there is any STATA command that helps to identify the observations are with each of the characteristics.
- The other issue is when I change this string variable to numeric format using the clock function, I get missing values for those observations with “DMY” only.
I used the following command.
gen date_time_N = clock(date_time, "DMY hms#")
format date_time_N %tc
Below I show you the data for the two variables “date_time” and “date_time_N”. In my numeric variable “date_time_N “, all the observations with only “DMY” are missing.
. dataex ID date_time date_time_N
copy starting from the next line ------- ---------------
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float ID str29 date_time float date_time_N 1 "07.12.2010" . 2 "04.09.2010" . 3 "08.10.2010" . 4 "16.01.2010" . 5 "07.02.2013 09:08:20" 1.6758473e+12 6 "26.02.2013 16:22:20" 1.677515e+12 9 "06.05.2013 09:28:26" 1.6834517e+12 8 "23.05.2018 12.24.00,443000000" 1.8426975e+12 9 "24.10.2018 11.32.17,751000000" 1.856e+12 10 "29.01.2019 15.10.00,453000000" 1.864394e+12 end format %tc date_time_N
Listed 10 out of 10 observations
One solution I used is to generate another numeric variable with the following command
gen date_time_N2 = clock(date_time, "DMY")
format date_time_N2 %tc
The second numeric viable “ date_time_N2 “ has non-missing only for the observation with “DMY”. Below I show with -dataex- what I “date_time_N2” looks like.
. dataex ID date_time date_time_N2
copy starting from the next line ------- ---------------
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float ID str29 date_time float date_time_N2 1 "07.12.2010" 1.6072992e+12 2 "04.09.2010" 1.5991776e+12 3 "08.10.2010" 1.6021152e+12 4 "16.01.2010" 1.579219e+12 5 "07.02.2013 09:08:20" . 6 "26.02.2013 16:22:20" . 9 "06.05.2013 09:28:26" . 8 "23.05.2018 12.24.00,443000000" . 9 "24.10.2018 11.32.17,751000000" . 10 "29.01.2019 15.10.00,453000000" . end format %tc date_time_N2
Listed 10 out of 10 observations
Then, I combine the two by creating another viable, which is date_time_all, as follows.
gen date_time_N_all = date_time_N
replace date_time_N_all = date_time_N2 if date_time_N_all==. & date_time_N2!=.
. dataex ID date_time date_time_N2 date_time_N_all
----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float ID str29 date_time float(date_time_N2 date_time_N_all) 1 "07.12.2010" 1.6072992e+12 1.6072992e+12 2 "04.09.2010" 1.5991776e+12 1.5991776e+12 3 "08.10.2010" 1.6021152e+12 1.6021152e+12 4 "16.01.2010" 1.579219e+12 1.579219e+12 5 "07.02.2013 09:08:20" . 1.6758473e+12 6 "26.02.2013 16:22:20" . 1.677515e+12 9 "06.05.2013 09:28:26" . 1.6834517e+12 8 "23.05.2018 12.24.00,443000000" . 1.8426975e+12 9 "24.10.2018 11.32.17,751000000" . 1.856e+12 10 "29.01.2019 15.10.00,453000000" . 1.864394e+12 end format %tc date_time_N2 format %tc date_time_N_all
Listed 10 out of 10 observations
I see the solution I used a bit with a longer step, and I am unsure how precise it will be. I would like to have any other recommendations. Or if there is a danger of using my solution, it would be nice to know. Thank you very much in advance.