Splunk Search

find the duplicate files from particular source in splunk search query

snehalk
Communicator

Hello All,

I need to find from particular source how many we have duplicate files in last 7 days.

I have used this method to indexed duplicate files in Splunk.

here the definition of duplicate file is first line and last line of file is matches to second file first and last line then its called as duplicate.

I can able to achieve the duplicate files if will matches only first line or last line as below

index=main sourcetype=sampledata Header* | eventstats count by _raw | where count>1 | table source, _raw

In this query its giving me the result of files which has the same header. and in below query am getting the result of files where "Trailer" 0r last line of the file is common

index=main sourcetype=sampledata TRL* | eventstats count by _raw | where count>1 | table source, _raw 

So here we need to compare two query and pick the result where we have header and trailer common.

Can any one please help me on this.

Thanks in Adavance

1 Solution

sideview
SplunkTrust
SplunkTrust

This will do it more efficiently, and will even work if there ever are more than one match for Header* or TRL* in a given file (fairly easy to imagine that this could happen sometimes).

index=main sourcetype=sampledata (Header* OR TRL*) | stats earliest(_raw) as first latest(_raw) as last by source | stats dc(source) as fileCount values(source) as files by first last | sort - fileCount | where fileCount>1

Breaking it down - we get the events off disk, and we try and only get Header* OR TRL* to avoid getting intermediate events that are of no use to us. The parens are unnecessary here, but I often like them for clarity.

The stats command will take the earliest line and the latest line for each source. NOTE - if events near the start and/or end end up with the same _time value, you'll have a problem here, and we'll need to some additional matching on the Header* and TRL*. Here I've assumed all events near start and end get a different timestamp and there's no ambiguity.

The next stats command now just counts up the number of source (files) it's seen per row dc(source) as fileCount, the actual values of the paths values(source) as files, and it does it for every unique combination of first and last by first last

the rest is just sorting and filtering to the ones that are actually duplicates.

View solution in original post

sideview
SplunkTrust
SplunkTrust

This will do it more efficiently, and will even work if there ever are more than one match for Header* or TRL* in a given file (fairly easy to imagine that this could happen sometimes).

index=main sourcetype=sampledata (Header* OR TRL*) | stats earliest(_raw) as first latest(_raw) as last by source | stats dc(source) as fileCount values(source) as files by first last | sort - fileCount | where fileCount>1

Breaking it down - we get the events off disk, and we try and only get Header* OR TRL* to avoid getting intermediate events that are of no use to us. The parens are unnecessary here, but I often like them for clarity.

The stats command will take the earliest line and the latest line for each source. NOTE - if events near the start and/or end end up with the same _time value, you'll have a problem here, and we'll need to some additional matching on the Header* and TRL*. Here I've assumed all events near start and end get a different timestamp and there's no ambiguity.

The next stats command now just counts up the number of source (files) it's seen per row dc(source) as fileCount, the actual values of the paths values(source) as files, and it does it for every unique combination of first and last by first last

the rest is just sorting and filtering to the ones that are actually duplicates.

snehalk
Communicator

Thank you, its working for me. and good explanation as well. once again thank you so much!!

0 Karma

woodcock
Esteemed Legend

Like this:

index=main sourcetype=sampledata Header* OR TRL*
| eval MyType=if(searchmatch(Header*), "HDR", "TRL")
| stats first(HDR) AS HDR first(TRL) AS TRL BY source
| stats values(source) count BY HDR TRL
| search count > 1

haley_swarnapat
Path Finder

Try this:

index=main sourcetype=sampledata "TRL*" OR "Header*"
| EVAL mytype=CASE(MATCH(_raw,"TRL"), "TRAILER",MATCH( _raw,"Header"), "HEADER")
| chart count by source, mytype
| SEARCH TRAILER>1 AND HEADER>1

It should show you every source with duplicated "TRL" and "Header"

snehalk
Communicator

Hello Haley,

Its not displaying any result.. but i remove the eval command then events are coming

0 Karma

haley_swarnapat
Path Finder

Oh, it seems that you need to convert your _raw to upper case like this:

index=main sourcetype=sampledata "TRL*" OR "Header*"
| EVAL mytype=CASE(MATCH(upper(_raw),"TRL"), "TRAILER",MATCH( upper(_raw),"HEADER"), "HEADER")
| stats count by source, mytype, _raw | WHERE count>1
| CHART first(_raw) by source, mytype

You should be able to see the duplicated _raw in header and trailer for each source

0 Karma

inventsekar
SplunkTrust
SplunkTrust

Please check this one -

 index=main sourcetype=sampledata Header* | eventstats count by _raw | where count>1 | table source, _raw
| append [index=main sourcetype=sampledata TRL* | eventstats count by _raw | where count>1 | table source, _raw]

Thanks and regards,
Sekar

snehalk
Communicator

Hello Sekar,

Thanks for response, i have updated my queries, please check and let me know on this.

0 Karma

snehalk
Communicator

Hello Sekar,

Here its resulting header matched files, ( eg: its has file which has header same but different trailer)

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...