So I got multiple custom datasources, scripts mainly, which are sending events to Splunk on some schedule/recurrence.
I can distinguish every execution of these sources by either a timestamp, or a custom ID, which gets incremented with every execution which is captured in every event. The events always have a proper host field, which also contributes to the "unique key" of an event with unique ID mentioned beforehand. The hosts are attributed with custom fields, this is the third part of something which could be used as uniqe key. These are always present in the events as long as they apply to a given host, and are no longer present when they don't apply to a host.
An example what I mean (every line is a separate event):
(Because of the _time field, these would appear in Splunk in reverse order obviously)
I want to deduplicate such events to always have the data only from the really last execution of a script. Like, from the above example, I want to have only
If I were to use
| dedup hostID, attributeID, customid
It would yield me
- host1, attribute2, customid1
- host2, attribute1, customid2
- host1, attribute1, customid2
The solution my team came up is using
<base search> | eventstats max(customid) as max_customid by hostID | search customid=max_customid
This pretty much does the thing, but I feel this is really not efficient - what would be the right approach do to this?
One given host has multiple events (with multiple attributes) from the same execution of the script.
A more detailed example, let's say I got these events:
I want to keep the below events:
This is the reason I can't use stats first()
Let's baseline. These stats pairs are similar: first
/last
, earliest
/latest
, min
/max
. The last pair I think are obvious but the first pair are not the same as the second pair, which is what may people assume at first. If your events have not been resorted, they should (and this is a big "should" because sometimes Splunk fails to do this and doesn't always generate a warning) come back to you sorted in "newest to latest" order with newest on top. In such a case, first
does the same thing as latest
. Let that sink in: first
DOES NOT do the same thing as earliest
; it does the OPPOSITE. That is because what first
actually does is walk backwards through your events from the top (which by default should be the "latest" event) and grab the "first" one that it sees.
OK, so for your case, simply sort your events the way that you desire (you can have multiple layers of sort by using more than 1 field argument) and then use first
or dedup
.
Pro tip: be sure that you use sort 0
, not just sort
.
How about you just do dedup on host??
Have you tried the "first" function with the stats command: <base search> | eval myKey=attributeID.customID | stats first(myKey) by hostID
Unfortunately not what I need, please see me update on the original post above.
<base search> | eval myKey=hostID.attributeID.customID | dedup myKey
Should do what you want. Dedup keeps the youngest event that matches the combined key.