Solved: Using dedup to keep the oldest events

acdevlin · ‎07-26-2011

Hi all,

I know that the "dedup" command returns the most recent values in time. However, I'm currently in a situation where I want to use dedup to only keep the oldest events from my data (example below). I found the following thread which is identical to my question, but the proposed solution (sorting on +_time) does not seem to work for me.

What I specifically have are a bunch of client requests to a web server. Each event has an associated req_time and a session_id; many transactions can share the same session_id. What I want to do is call '...|dedup session_id' and have only the OLDEST transaction from each individual session_id be returned, rather than the NEWEST.

Any suggestions on how to accomplish this?

David · ‎07-26-2011

I think you will find the sortby parameter to do this for you.

YourSearch | dedup session_id sortby +_time

Check out the docs for more ways you can tweak dedup:

http://www.splunk.com/base/Documentation/latest/SearchReference/Dedup

View solution in original post

fli · ‎04-04-2017

maybe the correct is:

Your_search | reverse | dedup ...

David · ‎07-26-2011

I think you will find the sortby parameter to do this for you.

YourSearch | dedup session_id sortby +_time

Check out the docs for more ways you can tweak dedup:

http://www.splunk.com/base/Documentation/latest/SearchReference/Dedup

acdevlin · ‎07-27-2011

Indeed it does! Thanks for the help David, and for confirming that I'm not going crazy.

David · ‎07-27-2011

Fortunately, if you need to grab the newest events after running a concurrency (or either way want to wrest control of your search's fate out from the hands of concurrency), you can work around this by creating another time field. I was able to do:

MySearch | eval MyTime = _time | concurrency duration=duration output=concurrentevents | dedup MyField sortby -MyTime

Without the same issue. Likewise, +MyTime works.

Does that get you where you need to be?

rashi83 · ‎06-04-2019

Hi David,

I am in kind of same situation , I need to retrieve results for latest time instead of old events.
I performed search as -
index=x | eval sorttime=strptime('_time',"%m/%d/%Y %H:%M:%S%p")| sort -sorttime |dedup hostname compName +_time keepempty=true | xyseries hostname compName status

This should retrieve latest week / time results instead it's showing old week data

David · ‎07-27-2011

I just tried that, and can definitely confirm what you found. If you toss a concurrency before the dedup, it does return the same results as if you had done a sortby +_time. You should be able to override this by doing a sortby -_time, but that search failed for me ("job ... is a zombie and is no longer with us"). This appears to be a bug, where concurrency is doing some sort of work on _time, and breaking dedup.

acdevlin · ‎07-27-2011

Thanks for the reply, David.

I mentioned that I tried this solution in my earlier question. For some reason, it did not work yesterday and only the oldest events were removed. However, it is working this morning to my pleasant surprise.

Any idea as to why that happened?

EDIT: Answered my own question, but I'm still mystified by it. The query which successfully returned the oldest events included some concurrency information that I had been playing around with.

... | eval timeout=1599 | ... | concurrency duration=timeout | dedup session_id

The above works. I have no idea why.

Using dedup to keep the oldest events

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!