Getting Data In

Reingestion of changes

kbaden
Explorer

So I've been unable to understand how Splunk works with log ingestion from Folder Monitor when it comes to a document that has already been ingested but has been changed since.

A basic example is a security log. Splunk identifies it, ingests and indexes it etc.
A new entry is added to that security log.

What happens at that point?

Does it reingest the log, duplicating the old data?
Does it not reingest since it has already done so once?
Is it ridiculously smart and just reingests the new data?

Thanks for the help!

Kane

0 Karma
1 Solution

acharlieh
Influencer

Typically, log files are written to in an appending manner. So by default, Splunk keeps track of aspects about each file it's monitored, (including observed size, modtime, bytes read, and checksums) in a data structure known as "the fishbucket." So in your scenario, Splunk has indexed the file previously, a new entry is added to the end and when Splunk reads the file again, and it only sends the new entry to be indexed.

Entries to the fishbucket are keyed by a checksum of the beginning of the file (so that when log files are rolled, you do not wind up with duplication just because a log file now has a different file name). You can have issues with duplicate indexing if your rolling is also doing compression and you haven't setup Splunk to ignore the compressed log files on your monitor stanza (since checksum of compressed bytes won't usually match checksum of uncompressed bytes).

But there are a lot of settings in inputs.conf and props.conf to control this behavior. In fact, in cases where the file that you're monitoring a file that is not a log file, where you actually want to reindex the whole file if it's changed, you could set Splunk to only check the modtime or checksum of the entire file and resend everything using a props file.

View solution in original post

acharlieh
Influencer

Typically, log files are written to in an appending manner. So by default, Splunk keeps track of aspects about each file it's monitored, (including observed size, modtime, bytes read, and checksums) in a data structure known as "the fishbucket." So in your scenario, Splunk has indexed the file previously, a new entry is added to the end and when Splunk reads the file again, and it only sends the new entry to be indexed.

Entries to the fishbucket are keyed by a checksum of the beginning of the file (so that when log files are rolled, you do not wind up with duplication just because a log file now has a different file name). You can have issues with duplicate indexing if your rolling is also doing compression and you haven't setup Splunk to ignore the compressed log files on your monitor stanza (since checksum of compressed bytes won't usually match checksum of uncompressed bytes).

But there are a lot of settings in inputs.conf and props.conf to control this behavior. In fact, in cases where the file that you're monitoring a file that is not a log file, where you actually want to reindex the whole file if it's changed, you could set Splunk to only check the modtime or checksum of the entire file and resend everything using a props file.

kbaden
Explorer

Amazing.

Appreciate your help mate.

-Kane

0 Karma
Get Updates on the Splunk Community!

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...