Getting Data In

How can I avoid duplicates when indexing logfiles on a cluster filesystem?

cfrantsen
Explorer

I have a couple of clusters with logfiles that reside on a shared cluster filesystem that all hosts in the cluster logs data to. The clustered applications can execute on any node but will always log to the same directory and file.

How can I prevent multiple cluster hosts running splunk from indexing the same logfile? The filename and paths to the logfiles does not contain any useful information such as hostname that could help me narrow the input stanza.

Tags (2)
0 Karma
1 Solution

gkanapathy
Splunk Employee
Splunk Employee

I suppose if it's a shared filesystem, you could just have a single instance of Splunk monitor the entire log filesystem and send everything. If simply looking at the filesystem provides no way to determine where the original file came from, I'm not sure what else there is to do. If you want to spread the forwarding load out, you could just partition the set of files to different Splunk instances to read and forward, e.g., one node reads a whitelist: /var/log/[a-m]* and another reads /var/log/[n-z]*. The nodes aren't reading the files they wrote, but I don't know if that matters.

View solution in original post

maverick
Splunk Employee
Splunk Employee

Another way would be to just forward all events to be indexed at all times, even if you get duplicates or not, and then use the dedup command on the entire raw event (| dedup _raw) to filter out duplicates on the fly while you search.

Obviously, this is not as elegant, but at least you will never risk missing events under any conditions or unexpected scenarios, assuming here that is the more important requirement.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

I suppose if it's a shared filesystem, you could just have a single instance of Splunk monitor the entire log filesystem and send everything. If simply looking at the filesystem provides no way to determine where the original file came from, I'm not sure what else there is to do. If you want to spread the forwarding load out, you could just partition the set of files to different Splunk instances to read and forward, e.g., one node reads a whitelist: /var/log/[a-m]* and another reads /var/log/[n-z]*. The nodes aren't reading the files they wrote, but I don't know if that matters.

dwaddle
SplunkTrust
SplunkTrust
0 Karma

cfrantsen
Explorer

Speed and space is probably not an issue as the log volume would be quite small, what kind of search could I use to only find the duplicates (something like an opposite of dedupe)?

0 Karma

dwaddle
SplunkTrust
SplunkTrust

I do not think you would be happy with 'delete' due to its speed (it is not very fast) and the fact that space in the index would not be reclaimed.

0 Karma

cfrantsen
Explorer

I was thinking, would it be possible to just let all nodes index the logfiles in question and then delete all duplicates on the central splunk server using a scheduled search of some kind? I haven't been able to find a search command that could be used to list all duplicate events that I could feed to "| delete", does something like this exist?

0 Karma

cfrantsen
Explorer

Sounds like a workable solution, I would have to run n+1 splunk forwarders as I still need to index local files and run scripts on each node, but that's minor problem.

0 Karma

dwaddle
SplunkTrust
SplunkTrust

Put the forwarder's splunk code on the shared filesystem and run it as a cluster resource as well. The forwarder's internal index will keep up with which logfiles have been forwarded and how much of them -- allowing it to intelligently restart during a cluster event.

0 Karma

cfrantsen
Explorer

The problem with having only one node monitor everything is that we would need to reconfigure the forwarder on another node if the first one is down for whatever reason (maintenance, crash, etc). These are all high availability clusters designed to be able to loose one or more nodes without requiring reconfigurations.

It would have been great if one splunk instance could let other instances know that certain files are already being indexed (perhaps using some sort of lockfile).

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...