v4.3.1 on sles 11.1
the standard whitelist for data source /var/log will produce dupe indexing because by default on sles it rotates out the messages file to another file "messages-YYYYMMDD" and will bzip that on 2nd rotation (aka, delayed compress in logrotate, etc).
so, in my case i think changing whitelist to use ^messages$ would be better, and possibly changing some of the others like .log and log$
Splunk should not produce duplicate results because of file rotation. In the situation you reference, the CRC will match and the rotated file will be ignored. You shouldn't need to edit the whitelist in the situation you've mentioned.
Details can be found here:
The monitoring processor picks up new files and reads the first and last 256 bytes of the file. This data is hashed into a begin and end cyclic redundancy check (CRC). Splunk checks new CRCs against a database that contains all the CRCs of files Splunk has seen before. The location Splunk last read in the file, known as the file's seekPtr, is also stored. There are three possible outcomes of a CRC check: 1. There is no begin and end CRC matching this file in the database. This indicates a new file. Splunk will pick it up and consume its data from the start of the file. Splunk updates the database with the new CRCs and seekPtrs as the file is being consumed. 2. The begin CRC and the end CRC are both present, but the size of the file is larger than the seekPtr Splunk stored. This means that, while Splunk has seen the file before, there has been data added to it since it was last read. Splunk opens the file, seeks to the previous end of the file, and starts reading from there. In this way, Splunk will only grab the new data and not anything it has read before. 3. The begin CRC is present, but the end CRC does not match. This means that Splunk has previously read the file but that some of the material that it read has since changed. In this case, Splunk must re-read the whole file. Important: Since the CRC start check is run against only the first 256 bytes of the file, it is possible for non-duplicate files to have duplicate start CRCs, particularly if the files are ones with identical headers. To handle such situations, you can use the crcSalt attribute when configuring the file in inputs.conf, as described here. The crcSalt attribute ensures that each file has a unique CRC. You do not want to use this attribute with rolling log files, however, because it defeats Splunk's ability to recognize rolling logs and will cause Splunk to re-index the data.
I think it would be better to blacklist bzip and gzip (`.bz$|.gz$|.gzip$` or similar. If you only whitelist the first file, it's possible to miss a message that was sent the the file handle after the file was rotated/renamed the first time. You could whitelist the date-formatted file also, but because the rotation names might be more varied, I think it's easier to blacklist the delaycompressed files.