Getting Data In

Splunk's mechanism to detect files with the same content

ziegfried
Influencer

The documentation says Splunk is creating a CRC hash of the first and last 256 bytes of a file in order to detect weather the file's content has already been processed (eg. log file rotation). Is this true? Recent observations made me believe that only the first 256 bytes and the file size are relevant. How does this similar file detection work exactly?

What are the options to override/tune this behavior other than crcSalt=<SOURCE>? Is there a way to increase this 256 byte window? (eg. let splunk use the first 512 byte to detect simliar files).

EDIT:

Here is an example, to illustrate what I mean:

First 256 byte of every file the directory is the same:

sp@locutus:test_input$ for f in $(ls -1 .); do echo "head -c 256 $f | md5 = $(head -c 256 $f | md5)"; done
head -c 256 timings1_0.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_1.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_2.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_3.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_4.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings1_5.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings2_0.csv | md5 = e665ba09f505913aa5fe05d603fde49a
head -c 256 timings2_1.csv | md5 = e665ba09f505913aa5fe05d603fde49a
...

Last 256 bytes are different:

sp@locutus:test_input$ for f in $(ls -1 .); do echo "tail -c 256 $f | md5 = $(tail -c 256 $f | md5)"; done
tail -c 256 timings1_0.csv | md5 = de07cfe6f9b7209cbfdc3c63b5e45f66
tail -c 256 timings1_1.csv | md5 = b17470e217afcb23017596a569ce759a
tail -c 256 timings1_2.csv | md5 = 3aa94dfeb5014537e33bdd67ab7d16d0
tail -c 256 timings1_3.csv | md5 = 290d8c33f80a79a83bd02d10417ee8af
tail -c 256 timings1_4.csv | md5 = 292a292f17b01a4d4483712b70eddc68
tail -c 256 timings1_5.csv | md5 = 102566f80f0fb29a1ed8d5db5b26cce6
tail -c 256 timings2_0.csv | md5 = 61caa775c378b1c8887f2a442b546758
tail -c 256 timings2_1.csv | md5 = fd097acdbbb32391a4e0d9bccc37bc68
...

Filesize is different as well:

sp@locutus:test_input$ for f in $(ls -1 .); do echo "du -h $f $(du -h $f)"; done
du -h timings1_0.csv 2,3M   timings1_0.csv
du -h timings1_1.csv 8,6M   timings1_1.csv
du -h timings1_2.csv 3,4M   timings1_2.csv
du -h timings1_3.csv 3,1M   timings1_3.csv
du -h timings1_4.csv 2,8M   timings1_4.csv
du -h timings1_5.csv 2,8M   timings1_5.csv
du -h timings2_0.csv 2,3M   timings2_0.csv
du -h timings2_1.csv 7,3M   timings2_1.csv
...

Added to Splunk (it hasn't been on this instance before) into an empty index "test":

sp@locutus:test_input$ splunk add monitor . -index test -sourcetype splunk_dup_test
Your session is invalid.  Please login.
Splunk username: admin
Password: 
Added monitor of '/Users/sp/temp/test_input'.

Waited a fair amount of time (Splunk finished indexing):

splunk search "index=test | stats count by source"

                 source                  count
---------------------------------------- -----
/Users/sp/temp/test_input/timings1_0.csv 11662

(Only 1 file got indexed)

Tags (1)

gkanapathy
Splunk Employee
Splunk Employee

It uses the first 256 bytes and the last 256 bytes by default. There are two other available methods. Adding crcSalt=<SOURCE> simply adds the file path to the hash, so if the file path is invariant, this doesn't actually change things.

You can use the CHECK_METHOD paramater in props.conf to select one of the other methods. You would most likely specify this in a [source::] stanza on the forwarder. From props.conf.spec:

CHECK_METHOD = endpoint_md5 | entire_md5 | modtime
* Set to 'endpoint_md5' to have Splunk checksum of the first and last 256 bytes of a file.  When matches are found, Splunk lists the file as already indexed and indexes only new data, or ignores it if there is no new data.
* Set this to "entire_md5" to use the checksum of the entire file.
* Alternatively, set this to "modtime" to check only the modification time of the file.
* Settings other than endpoint_md5 will cause splunk to index the entire file for each detected change.
* Defaults to endpoint_md5.

yannK
Splunk Employee
Splunk Employee

since 5.0 you also can set the parameter initCrcLength (default is 256)
http://docs.splunk.com/Documentation/Splunk/latest/Admin/Inputsconf

ziegfried
Influencer

So, if Splunk uses the first and last 256 bytes of the files, why isn't it indexing more than one file of the example (added to the question)? I've added the CHECK_METHOD to the props.conf as well, didn't make a difference.

Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...