Getting Data In

Reingestion of changes

kbaden
Explorer

So I've been unable to understand how Splunk works with log ingestion from Folder Monitor when it comes to a document that has already been ingested but has been changed since.

A basic example is a security log. Splunk identifies it, ingests and indexes it etc.
A new entry is added to that security log.

What happens at that point?

Does it reingest the log, duplicating the old data?
Does it not reingest since it has already done so once?
Is it ridiculously smart and just reingests the new data?

Thanks for the help!

Kane

0 Karma
1 Solution

acharlieh
Influencer

Typically, log files are written to in an appending manner. So by default, Splunk keeps track of aspects about each file it's monitored, (including observed size, modtime, bytes read, and checksums) in a data structure known as "the fishbucket." So in your scenario, Splunk has indexed the file previously, a new entry is added to the end and when Splunk reads the file again, and it only sends the new entry to be indexed.

Entries to the fishbucket are keyed by a checksum of the beginning of the file (so that when log files are rolled, you do not wind up with duplication just because a log file now has a different file name). You can have issues with duplicate indexing if your rolling is also doing compression and you haven't setup Splunk to ignore the compressed log files on your monitor stanza (since checksum of compressed bytes won't usually match checksum of uncompressed bytes).

But there are a lot of settings in inputs.conf and props.conf to control this behavior. In fact, in cases where the file that you're monitoring a file that is not a log file, where you actually want to reindex the whole file if it's changed, you could set Splunk to only check the modtime or checksum of the entire file and resend everything using a props file.

View solution in original post

acharlieh
Influencer

Typically, log files are written to in an appending manner. So by default, Splunk keeps track of aspects about each file it's monitored, (including observed size, modtime, bytes read, and checksums) in a data structure known as "the fishbucket." So in your scenario, Splunk has indexed the file previously, a new entry is added to the end and when Splunk reads the file again, and it only sends the new entry to be indexed.

Entries to the fishbucket are keyed by a checksum of the beginning of the file (so that when log files are rolled, you do not wind up with duplication just because a log file now has a different file name). You can have issues with duplicate indexing if your rolling is also doing compression and you haven't setup Splunk to ignore the compressed log files on your monitor stanza (since checksum of compressed bytes won't usually match checksum of uncompressed bytes).

But there are a lot of settings in inputs.conf and props.conf to control this behavior. In fact, in cases where the file that you're monitoring a file that is not a log file, where you actually want to reindex the whole file if it's changed, you could set Splunk to only check the modtime or checksum of the entire file and resend everything using a props file.

kbaden
Explorer

Amazing.

Appreciate your help mate.

-Kane

0 Karma
Get Updates on the Splunk Community!

Share Your Ideas & Meet the Lantern team at .Conf! Plus All of This Month’s New ...

Splunk Lantern is Splunk’s customer success center that provides advice from Splunk experts on valuable data ...

Combine Multiline Logs into a Single Event with SOCK: a Step-by-Step Guide for ...

Combine multiline logs into a single event with SOCK - a step-by-step guide for newbies Olga Malita The ...

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...