I have two sources of traffic logs my_source1
and my_source2
that record approximately the same data with few important differences.
I need to dedup data in this way:
source=my_source* | dedup _time, ip, page
But with the following important difference:
If events are found to occur within 2 seconds of each other (same ip, page) - consider them duplicates, but only keep events from my_source2, even if they occurred earlier.
What's the most efficient way to accomplish that?
Note: system generates up to 100,000 events per hour.
I would suggest you using transaction command if the data volume is not so high. The biggest advantage is that it enables you to aggregate similar events from the distinct sources in one transaction while providing a "duration" field based on the _time used between the similar events.
By using eval's mvindex() you are then able to keep only the last or first events from the transaction.