I've been evaluating Splunk against a custom application which consists of a cluster of tomcat instances running two separate applications (partially sharing classes) and some front and back end apaches. I tested importing a months worth of log data (all at once), and have been playing around with it.
Firstly it seems that Splunk reduces the data size to about half of what it was originally. If possible I'd like to be even more efficient with the indexing, as a lot of the data in the logs contains duplicate info.
Looking at how splunk has processed the incoming data from the tomcat application (log4j), it seems to have only parsed the timestamp and nothing further (it's possible I'm missing something), so a log line is pretty much processed as a string. I later used field extraction (from the search) to extract fields such as the log level, the actual java class etc, but the concept is still a bit foreign to me (despite reading through a lot of documentation).
If I kept on importing more data, would it automatically add the extracted fields to files with the same sourcetype?
If I specified the field extractions so it's done during import would it reduce the size of the indexes stored on disk?
Especially this last point is confusing, as documentation mentioned that field extraction during import can actually increase the size of the indexes. If I have 500 java classes producing a million lines of log a day, wouldn't separating that from the bulk log line during indexing actually reduce the size of the indexes (especially if the alternative is having a stored search producing a dashboard out of the data anyway)?
... View more