Hi all,
I know that the "dedup" command returns the most recent values in time. However, I'm currently in a situation where I want to use dedup to only keep the oldest events from my data (example below). I found the following thread which is identical to my question, but the proposed solution (sorting on +_time) does not seem to work for me.
What I specifically have are a bunch of client requests to a web server. Each event has an associated req_time
and a session_id
; many transactions can share the same session_id
. What I want to do is call '...|dedup session_id'
and have only the OLDEST transaction from each individual session_id
be returned, rather than the NEWEST.
Any suggestions on how to accomplish this?
I think you will find the sortby parameter to do this for you.
YourSearch | dedup session_id sortby +_time
Check out the docs for more ways you can tweak dedup:
http://www.splunk.com/base/Documentation/latest/SearchReference/Dedup
maybe the correct is:
Your_search | reverse | dedup ...
I think you will find the sortby parameter to do this for you.
YourSearch | dedup session_id sortby +_time
Check out the docs for more ways you can tweak dedup:
http://www.splunk.com/base/Documentation/latest/SearchReference/Dedup
Indeed it does! Thanks for the help David, and for confirming that I'm not going crazy.
Fortunately, if you need to grab the newest events after running a concurrency (or either way want to wrest control of your search's fate out from the hands of concurrency), you can work around this by creating another time field. I was able to do:
MySearch | eval MyTime = _time | concurrency duration=duration output=concurrentevents | dedup MyField sortby -MyTime
Without the same issue. Likewise, +MyTime works.
Does that get you where you need to be?
Hi David,
I am in kind of same situation , I need to retrieve results for latest time instead of old events.
I performed search as -
index=x | eval sorttime=strptime('_time',"%m/%d/%Y %H:%M:%S%p")| sort -sorttime |dedup hostname compName +_time keepempty=true | xyseries hostname compName status
This should retrieve latest week / time results instead it's showing old week data
I just tried that, and can definitely confirm what you found. If you toss a concurrency before the dedup, it does return the same results as if you had done a sortby +_time. You should be able to override this by doing a sortby -_time, but that search failed for me ("job ... is a zombie and is no longer with us"). This appears to be a bug, where concurrency is doing some sort of work on _time, and breaking dedup.
Thanks for the reply, David.
I mentioned that I tried this solution in my earlier question. For some reason, it did not work yesterday and only the oldest events were removed. However, it is working this morning to my pleasant surprise.
Any idea as to why that happened?
EDIT: Answered my own question, but I'm still mystified by it. The query which successfully returned the oldest events included some concurrency information that I had been playing around with.
... | eval timeout=1599 | ... | concurrency duration=timeout | dedup session_id
The above works. I have no idea why.