With 2 sources producing similar data, how to dedu...

gesman · ‎01-28-2015

I have two sources of traffic logs my_source1 and my_source2 that record approximately the same data with few important differences.
I need to dedup data in this way:
source=my_source* | dedup _time, ip, page

But with the following important difference:
If events are found to occur within 2 seconds of each other (same ip, page) - consider them duplicates, but only keep events from my_source2, even if they occurred earlier.
What's the most efficient way to accomplish that?

Note: system generates up to 100,000 events per hour.

inode · ‎02-03-2015

I would suggest you using transaction command if the data volume is not so high. The biggest advantage is that it enables you to aggregate similar events from the distinct sources in one transaction while providing a "duration" field based on the _time used between the similar events.

By using eval's mvindex() you are then able to keep only the last or first events from the transaction.

With 2 sources producing similar data, how to dedup events within 2 seconds of each other, but only keep events from one particular source?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!