I have some long time series data, with a time series variable year. However, there are a lot of duplicates. I first considered removing the oldest duplicates, assuming the newer was more accurate. That is to say, if the value for 1951 was estimated twice, the second estimate is more accurate, so we drop the old one.
However, this caused some problems. The way I am considering now is to keep the years which feature the smallest change. For example, let's say we only have one 1950 value at .4. We have two values for 1951, .6 and .7. Since .6 is a smaller change from 1950, we keep the .6 observation and drop the .7 one. This seems to get more complicated when the year you are examining has alternate observations for the prior year. That is to say, what if 1950 also has three different estimates? Maybe there is some algorithm to run. Does anyone know of any techniques for this? Thanks.
Here is some example data
However, this caused some problems. The way I am considering now is to keep the years which feature the smallest change. For example, let's say we only have one 1950 value at .4. We have two values for 1951, .6 and .7. Since .6 is a smaller change from 1950, we keep the .6 observation and drop the .7 one. This seems to get more complicated when the year you are examining has alternate observations for the prior year. That is to say, what if 1950 also has three different estimates? Maybe there is some algorithm to run. Does anyone know of any techniques for this? Thanks.
Here is some example data
1950 | 0.4438 |
1951 | 0.2746 |
1952 | 0.215 |
1953 | 0.9189 |
1954 | 0.7192 |
1955 | 0.7332 |
1956 | 0.6545 |
1957 | 0.2492 |
1957 | 0.3382 |
1958 | 0.6456 |
1958 | 0.1853 |
1950 | 0.4664 |
1951 | 0.3202 |
1952 | 0.2473 |
1953 | 0.9355 |
1954 | 0.4428 |
1955 | 0.0049 |
1956 | 0.9164 |