Tried using the already answered question on splunk answer on the same topic they say do it using lookup or sub search. like this
Error on using delete on stream stats:-
index=idx1 sourcetype=csv|streamstats count by _raw _time |where cnt>1|delete
Both the below methods are giving me wrong output i have many duplicates but below commands only showing me a few.
Subsearch method:-
index=idx1 sourcetype=csv| eval eid=_cd|search[ |streamstats count as cnt by _raw _time | where cnt>1| field eid ]|delete
Lookup method:-
index=idx1 sourcetype=csv|streamstats count by _raw _time |where cnt>1| eid=_cd | stats count by eid| fields eid| outputcsv del_id.csv
index=id1 sourcetype=csv|eval eid=_cd | search [ del_id.csv ] | delete
This will give you the data you need:
index=foo sourcetype=bar | stats count by _time _raw | where count > 1
For large amounts of data this might take a while.
To actually delete, you cannot use nonstreaming commands - you'd need to write the _bkt
and _cd
fields into a lookup, and then search like this:
index=foo sourcetype=bar [inputlookup that_lookup] | delete
Make sure to test this on unimportant data first!
Streamstats is centralized streaming command that gets executed on the search head... the delete has to run on the indexers though, they're the ones storing and therefore deleting data. As a result, you can only run distributed streaming commands before a delete.
See http://docs.splunk.com/Documentation/Splunk/6.4.3/SearchReference/Commandsbytype for reference.
As for your second comment, the stats
only produces three fields - _time _raw count
. There's no _bkt
and _cd
field present. One approach to copy them over might work something like this (untested):
index=foo sourcetype=bar | stats count list(_cd) as _cd list(_bkt) as _bkt by _time _raw | where count > 1 | eval _cd = mvindex(_cd, 1, -1) | eval _bkt = mvindex(_bkt, 1, -1) | eval zip = mvzip(_cd, _bkt, "##breaker##") | mvexpand zip | eval zip = split(zip, "##breaker##") | eval _cd = mvindex(zip, 0) | eval _bkt = mvindex(zip, 1) | fields - zip count | outputlookup ...
The list()
tell the stats
to keep a list of the two fields identifying an event, and the first two mvindex()
should keep all but the first element in the list. The whole mvzip|mvexpand|split|mvindex
part should put every event into one row.
Again, important reminder: Don't run this without first extensively testing on unimportant data!
We are trying to put the event Ids of duplicate records in a csv file. However when we run the below command it gives no results.
index=idx1 sourcetype=csv|stats count by _raw _time |where cnt>1|eval eid=_cd*|stats count by eid|fields - count|outputlookup filename.csv*
But when we run it without the part in bold, it gives the correct count.
So we are not able to figure out what is wrong in the later part of the query highlighted in bold.
@martin_mueller
Sir I will try your answer .
Can you please help me understand why i am not getting the correct answer using the streamstats and count>1 .As i don't have much understanding how streamstats works can be a good learning.
I'd be careful about that. What is delete
supposed to do after a reporting command such as stats
? Delete both inputs to the stats
? Delete only one? Which one?
ohk, as the question was - "I want to delete duplicates from the splunk index which have same _raw and same _time", and, so, after finding "the data you need" thru your first query, i thought he should use |delete command.
now only the thought came to me - why delete after stats.
thanks a lot.. good learning!
adding the |delete at the end -
index=idx1 sourcetype=csv|stats count by _raw _time |where count > 1|delete