Quantcast
Channel: Statalist
Viewing all articles
Browse latest Browse all 72776

Identifying different characteristics in a string data and time variable and changes it to numeric

$
0
0

Hi, I am working with a data and time variable.







The variable has DMY for some observations, DMY hms , for some other observations, and then it has DMYhms and some”000” characteristics for the remaining observation.

I give the -dataex- below to show you what it looks like.


. dataex ID date_time

----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input float ID str29 date_time
 1 "07.12.2010"                   
 2 "04.09.2010"                   
 3 "08.10.2010"                   
 4 "16.01.2010"                   
 5 "07.02.2013 09:08:20"          
 6 "26.02.2013 16:22:20"          
 9 "06.05.2013 09:28:26"          
 8 "23.05.2018 12.24.00,443000000"
 9 "24.10.2018 11.32.17,751000000"
10 "29.01.2019 15.10.00,453000000"
end
------------------ copy up to and including the previous line ------------------

Listed 10 out of 10 observations





I have a large dataset of more than 120,000. Therefore it is very difficult to identify which observation has which of the three formats., or if there is another format inside.

  1. I would appreciate knowing if there is any STATA command that helps to identify the observations are with each of the characteristics.
  1. The other issue is when I change this string variable to numeric format using the clock function, I get missing values for those observations with “DMY” only.



I used the following command.




gen date_time_N = clock(date_time, "DMY hms#")

format date_time_N %tc




Below I show you the data for the two variables “date_time” and “date_time_N”. In my numeric variable “date_time_N “, all the observations with only “DMY” are missing.


. dataex ID date_time date_time_N

copy starting from the next line ------- ---------------
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input float ID str29 date_time float date_time_N
1 "07.12.2010"                                .
2 "04.09.2010"                                .
3 "08.10.2010"                                .
4 "16.01.2010"                                .
5 "07.02.2013 09:08:20"           1.6758473e+12
6 "26.02.2013 16:22:20"            1.677515e+12
9 "06.05.2013 09:28:26"           1.6834517e+12
8 "23.05.2018 12.24.00,443000000" 1.8426975e+12
9 "24.10.2018 11.32.17,751000000"     1.856e+12
10 "29.01.2019 15.10.00,453000000"  1.864394e+12
end
format %tc date_time_N
copy up to and including the previous line -- ---------------

Listed 10 out of 10 observations



One solution I used is to generate another numeric variable with the following command

gen date_time_N2 = clock(date_time, "DMY")

format date_time_N2 %tc

The second numeric viable “ date_time_N2 “ has non-missing only for the observation with “DMY”. Below I show with -dataex- what I “date_time_N2” looks like.




. dataex ID date_time date_time_N2

copy starting from the next line ------- ---------------
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input float ID str29 date_time float date_time_N2
1 "07.12.2010"                    1.6072992e+12
2 "04.09.2010"                    1.5991776e+12
3 "08.10.2010"                    1.6021152e+12
4 "16.01.2010"                     1.579219e+12
5 "07.02.2013 09:08:20"                       .
6 "26.02.2013 16:22:20"                       .
9 "06.05.2013 09:28:26"                       .
8 "23.05.2018 12.24.00,443000000"             .
9 "24.10.2018 11.32.17,751000000"             .
10 "29.01.2019 15.10.00,453000000"             .
end
format %tc date_time_N2
copy up to and including the previous line -- ---------------

Listed 10 out of 10 observations



Then, I combine the two by creating another viable, which is date_time_all, as follows.




gen date_time_N_all = date_time_N




replace date_time_N_all = date_time_N2 if date_time_N_all==. & date_time_N2!=.



. dataex ID date_time date_time_N2 date_time_N_all

----------------------- copy starting from the next line -----------------------
Code:
* Example generated by -dataex-. For more info, type help dataex
clear
input float ID str29 date_time float(date_time_N2 date_time_N_all)
 1 "07.12.2010"                    1.6072992e+12 1.6072992e+12
 2 "04.09.2010"                    1.5991776e+12 1.5991776e+12
 3 "08.10.2010"                    1.6021152e+12 1.6021152e+12
 4 "16.01.2010"                     1.579219e+12  1.579219e+12
 5 "07.02.2013 09:08:20"                       . 1.6758473e+12
 6 "26.02.2013 16:22:20"                       .  1.677515e+12
 9 "06.05.2013 09:28:26"                       . 1.6834517e+12
 8 "23.05.2018 12.24.00,443000000"             . 1.8426975e+12
 9 "24.10.2018 11.32.17,751000000"             .     1.856e+12
10 "29.01.2019 15.10.00,453000000"             .  1.864394e+12
end
format %tc date_time_N2
format %tc date_time_N_all
------------------ copy up to and including the previous line ------------------

Listed 10 out of 10 observations






I see the solution I used a bit with a longer step, and I am unsure how precise it will be. I would like to have any other recommendations. Or if there is a danger of using my solution, it would be nice to know. Thank you very much in advance.



Viewing all articles
Browse latest Browse all 72776

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>