Solved: find the duplicate files from particular source in...

snehalk · ‎09-22-2016

Hello All,

I need to find from particular source how many we have duplicate files in last 7 days.

I have used this method to indexed duplicate files in Splunk.

here the definition of duplicate file is first line and last line of file is matches to second file first and last line then its called as duplicate.

I can able to achieve the duplicate files if will matches only first line or last line as below

index=main sourcetype=sampledata Header* | eventstats count by _raw | where count>1 | table source, _raw

In this query its giving me the result of files which has the same header. and in below query am getting the result of files where "Trailer" 0r last line of the file is common

index=main sourcetype=sampledata TRL* | eventstats count by _raw | where count>1 | table source, _raw

So here we need to compare two query and pick the result where we have header and trailer common.

Can any one please help me on this.

Thanks in Adavance

sideview · ‎09-22-2016

This will do it more efficiently, and will even work if there ever are more than one match for Header* or TRL* in a given file (fairly easy to imagine that this could happen sometimes).

index=main sourcetype=sampledata (Header* OR TRL*) | stats earliest(_raw) as first latest(_raw) as last by source | stats dc(source) as fileCount values(source) as files by first last | sort - fileCount | where fileCount>1

Breaking it down - we get the events off disk, and we try and only get Header* OR TRL* to avoid getting intermediate events that are of no use to us. The parens are unnecessary here, but I often like them for clarity.

The stats command will take the earliest line and the latest line for each source. NOTE - if events near the start and/or end end up with the same _time value, you'll have a problem here, and we'll need to some additional matching on the Header* and TRL*. Here I've assumed all events near start and end get a different timestamp and there's no ambiguity.

The next stats command now just counts up the number of source (files) it's seen per row dc(source) as fileCount, the actual values of the paths values(source) as files, and it does it for every unique combination of first and last by first last

the rest is just sorting and filtering to the ones that are actually duplicates.

View solution in original post

sideview · ‎09-22-2016

This will do it more efficiently, and will even work if there ever are more than one match for Header* or TRL* in a given file (fairly easy to imagine that this could happen sometimes).

index=main sourcetype=sampledata (Header* OR TRL*) | stats earliest(_raw) as first latest(_raw) as last by source | stats dc(source) as fileCount values(source) as files by first last | sort - fileCount | where fileCount>1

Breaking it down - we get the events off disk, and we try and only get Header* OR TRL* to avoid getting intermediate events that are of no use to us. The parens are unnecessary here, but I often like them for clarity.

The stats command will take the earliest line and the latest line for each source. NOTE - if events near the start and/or end end up with the same _time value, you'll have a problem here, and we'll need to some additional matching on the Header* and TRL*. Here I've assumed all events near start and end get a different timestamp and there's no ambiguity.

The next stats command now just counts up the number of source (files) it's seen per row dc(source) as fileCount, the actual values of the paths values(source) as files, and it does it for every unique combination of first and last by first last

the rest is just sorting and filtering to the ones that are actually duplicates.

snehalk · ‎09-23-2016

Thank you, its working for me. and good explanation as well. once again thank you so much!!

woodcock · ‎09-22-2016

Like this:

index=main sourcetype=sampledata Header* OR TRL*
| eval MyType=if(searchmatch(Header*), "HDR", "TRL")
| stats first(HDR) AS HDR first(TRL) AS TRL BY source
| stats values(source) count BY HDR TRL
| search count > 1

haley_swarnapat · ‎09-22-2016

Try this:

index=main sourcetype=sampledata "TRL*" OR "Header*"
| EVAL mytype=CASE(MATCH(_raw,"TRL"), "TRAILER",MATCH( _raw,"Header"), "HEADER")
| chart count by source, mytype
| SEARCH TRAILER>1 AND HEADER>1

It should show you every source with duplicated "TRL" and "Header"

snehalk · ‎09-22-2016

Hello Haley,

Its not displaying any result.. but i remove the eval command then events are coming

haley_swarnapat · ‎09-22-2016

Oh, it seems that you need to convert your _raw to upper case like this:

index=main sourcetype=sampledata "TRL*" OR "Header*"
| EVAL mytype=CASE(MATCH(upper(_raw),"TRL"), "TRAILER",MATCH( upper(_raw),"HEADER"), "HEADER")
| stats count by source, mytype, _raw | WHERE count>1
| CHART first(_raw) by source, mytype

You should be able to see the duplicated _raw in header and trailer for each source

inventsekar · ‎09-22-2016

Please check this one -

 index=main sourcetype=sampledata Header* | eventstats count by _raw | where count>1 | table source, _raw
| append [index=main sourcetype=sampledata TRL* | eventstats count by _raw | where count>1 | table source, _raw]

Thanks and regards,
Sekar

snehalk · ‎09-22-2016

Hello Sekar,

Thanks for response, i have updated my queries, please check and let me know on this.

snehalk · ‎09-22-2016

Hello Sekar,

Here its resulting header matched files, ( eg: its has file which has header same but different trailer)

find the duplicate files from particular source in splunk search query

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life