Dealing with a few million little log files

ecnausysadm · ‎05-28-2012

We're looking to use Splunk to index our application logs; But the application creates a separate log file for each process it spawns, for this month so far have approx 720000 separate log files totalling 32GB.
Ultimately we'd look to roll this out across multiple instances of the application which means over 2 million application logs a month. We are going to filter the incoming data to reduce the amount indexed but we still need to read the total number of sources to get the bits we want.

To add to this we have system logs (eg apache, sendmail/postfix, ftp daemon) that all go into a syslog-ng instance, these are ordered by server then year then month and finally a file for each day. I have heard that we are better off just putting all of these into the one big file and then roll that daily than our current method of a file per day.

I notice we can use ignoreOlderThan which will be useful once we have the initial set of data indexed, but I was wondering if someone has similar experience and can provide some ideas and experiences on how best to handle this sort of input optimally?

Cheers,
Mark

amrit · ‎05-28-2012

Mark, an exact recommendation depends on a few more details about your setup. It's recommended to avoid monitoring many hundreds of thousands of files in a single Splunk instance, as the current implementation can be heavy on memory usage - although if you're on a very beefy server, it's possible you won't care about that much. Scaling up to two million files actively monitored in a single instance is an untested scenario, so hopefully your data is arranged in such a way that we can ingest it in batches.

Before we get to that, you're correct that consolidating your syslog-ng files is a smart move. In general, monitoring fewer files is a good thing. Hoever, it's important to remember that you should not lump different types of log data into a single file. You should keep apache, sendmail, etc all in separate files, but you can certainly combine the streams coming from different hosts running the same application type. This will allow you to continue effectively using sourcetypes.

For the main issue with millions of log files, some questions come to mind:

Growth is expected to be upwards of 2 million files per month - are we only going to be monitoring new files, or do you also want to index the 2M files from April, the 2M from March, and so on?

What is the topology here? What can we expect in terms of data transport from the original log source to Splunk indexer? Are these 2M files distributed amongst various server instances, and you expect to use the Universal Forwarder on each server? Or are they instead being collected centrally, and you're looking to index the logs over NFS? If it's the latter, we may want to use the [DESTRUCTIVE] "sinkhole" input method and copy the logs over in batches.

Is there a directory hierarchy for the 2M files that will allow us to efficiently blacklist out known old data? "ignoreOlderThan" is certainly helpful in speeding up file tracking, but there is still the startup-time cost of gathering each file's data, and each blacklisted file is still tracked, meaning the memory is still being used. Blacklisting an entire subdirectory is much more efficient, as we will simply avoid recursing into the directory.

The above details are important for any large-ish deployment - once we have a better picture of your scenario, we can provide a more concrete list of steps to get your data flowing. And if you're experimenting on your own, you may want to have a look at this script to get an idea of what the Tailing Processor is doing at any given moment: http://blogs.splunk.com/2011/01/02/did-i-miss-christmas-2/

Amrit

amrit · ‎06-04-2012

At this point the monthly bulk "mv $CURRENTLOGS $OLDLOGS/something" job can be eliminated, as the new cron job would simply move $LOGFILE to the appropriate month's directory based on its metadata.

Let me know whether you agree with the concerns stated above and whether a briefly delayed solution works for you. If not, we can always discuss other ideas, but I believe the above is the simplest solution.

// And now, to go file a bug on max comment length here...

amrit · ‎06-04-2012

One way to make this work very well is if you can live with a 1-2 minute indexing delay (think: realtime data stream running a minute or two behind). If so, you can use the [DESTRUCTIVE] "sinkhole" input that exists on all Splunk instances. The idea is to setup a cron job that uses "find" to move all logs older than a couple of minutes to the archive directory, and simultaneously copy the file into the sinkhole directory. As soon as the sinkholed file is read, it will be deleted. This will keep the amount of monitored files (in the sinkhole) low, reducing memory usage.

amrit · ‎06-04-2012

network hiccups causing delays in sending files, indexers backing up due to a large burst of data, just plain having a network & indexer setup that is regularly outpaced by log file growth, etc.

So, monitoring the CURRENTLOGS directory in your scenario may not turn out to be entirely reliable. More typical logging strategies create fewer files that are larger and are rotated individually instead of as an entire directory, making it a bit easier to avoid this problem.

amrit · ‎06-04-2012

If there are many processes running every second and each process is creating one or more files per run, your file rotation strategy will create a race that could cause some data to be missed. When it comes time to "mv $CURRENTLOGS $OLDLOGS/something" and "mkdir $CURRENTLOGS", there's no guarantee that Splunk has indexed all of the logs being moved - for example, 100 processes could spin up in the seconds before the "mv" and the files they created may be moved out of the way before Splunk has a chance to see them. There are many reasons for the forwarder to fall behind like this:

amrit · ‎06-04-2012

Sorry for the delay in responding. We should be able to manage the amount of tracked files (and thus file-scanning performance & the amount of memory used), although a straight-forward setup with this rate of file creation can have some caveats, so we may resort to a couple of tricks. I'll explain my concerns below, and you can consider their validity.

Let's say the path setup is:
CURRENTLOGS=/path/to/app/instance1/logs/live
OLDLOGS=/path/to/app/instance1/logs/archive

ecnausysadm · ‎05-28-2012

There is only a single instance where these small files will be forwarded from using a Universal Forwarder, we are also using a Forwarder for the syslog-ng data. It'll all be over gigabit ethernet.

Sorry for the split posts, there is a character limit in the comments.

Thanks,
Mark

ecnausysadm · ‎05-28-2012

There are subdirectories for an application instance, but all the files are placed in the one directory under that instance path, so we'd end up having this structure;
- /path/to/app/instance1/logs/
- /path/to/app/instance2/logs/
I'm not sure what we can do in regards to moving log files around to different directories as the application isn't too flexible in that regards, which is one reason we are looking at splunk.

ecnausysadm · ‎05-28-2012

Hi Amrit,
We only want to index new files which is generally anything < 2 weeks old. Once indexed we don't really worry about them other than for historical access, and from the application point of view they do not change. They application logs are also archived off at the start of the new month - basically it's a mv $CURRENTLOGS $OLDLOGS && mkdir $CURRENTLOGS and then the application will put all the logs for the current month in the same directory as last months logs but it's now empty.

Dealing with a few million little log files

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!