Deployment Architecture

Reduce the size of an index by eliminating parts of index

nicholasgrabows
Path Finder

We have a fairly large indexing cluster. We tend to get about a 2x compression rate on our raw data. In other words, we can store about 2 GB of raw data + indexes on 1GB of disk. We were hoping to get much better compression rates. Given our data is text we believe 5 to 10x is possible. It was suggested to us that the reason we are only getting 2x is due to the amount of indexes created. Moreover, it was suggested we could reduce the number of indexes if we knew for a fact that some portion of the index is never used in searching. For example, If we have a URL like "http://www.someurl/path1/path2/path3/path4" in our data, the indexing algorithm automatically stores each path in a separate index and combinations of the paths as well. So we would end up storing path1, path2, path1/path2, etc, in indexes. Yet if we know our logs are well defined then folks won't be searching for these types of things, rather they will search for URL=path1... hence we could, if splunk let's us, significantly reduce the size of the index. Is this possible? If so is their any documentation on the same?

0 Karma
1 Solution

Ayn
Legend

I believe what you're talking about is Splunk's segmentation. You can tweak this to your liking in props.conf, but as you probably will have guessed optimizing segmentation settings for storage efficiency will have impact on performance. More information available in (among others) the following links:

http://docs.splunk.com/Documentation/Splunk/5.0/Data/Abouteventsegmentation
http://wiki.splunk.com/Community:SplunkTuningFactors
http://docs.splunk.com/Documentation/Splunk/5.0/Data/Setthesegmentationforeventdata

View solution in original post

nicholasgrabows
Path Finder

I guess I could say the same thing about common nomenclature... At any rate, is there a name for the indexes inside a splunk Index? If there is, happy to use it.... Thanks for the answers below... I will investigate and get back to you.

0 Karma

sowings
Splunk Employee
Splunk Employee

The Splunk concept you're after is segmentation. That documentation link will explain the general case, how Splunk does it, and lead you to other articles to adjust its configuration.

Note: I could make a case for searching for a given token in the path, like the subdirectory "path3" as given above.

sowings
Splunk Employee
Splunk Employee

I should note that typically this is not necessary, as Splunk gets pretty good compression of data even when you factor in the size of the index files. I've seen data with low entropy compressing as 19::1. YMMV.

0 Karma

Ayn
Legend

I believe what you're talking about is Splunk's segmentation. You can tweak this to your liking in props.conf, but as you probably will have guessed optimizing segmentation settings for storage efficiency will have impact on performance. More information available in (among others) the following links:

http://docs.splunk.com/Documentation/Splunk/5.0/Data/Abouteventsegmentation
http://wiki.splunk.com/Community:SplunkTuningFactors
http://docs.splunk.com/Documentation/Splunk/5.0/Data/Setthesegmentationforeventdata

nicholasgrabows
Path Finder

Ayn, this is what I was looking for. Thanks so much.

0 Karma

Ayn
Legend

To avoid confusion, it's a really good idea to use Splunk's definition of index around here.

nicholasgrabows
Path Finder

The logs are broken by timestamp. here is an example event:
Date=11-10-2012 00:00:00, URL="http://www.someurl/path1/path2/path3/path4"
The question is how can I reduce the size/number of indexes created for this sourcetype. I'm using "index" here in the common usage (i.e. index on a relational data table), when splunk uses the word Index they mean a set of files some of which are raw data files and some of which are index files.

0 Karma

bmacias84
Champion

I am kinda at a loss? Your logs are not broken by timestamp, but rather by pathing. so each segment of the path is broken by a transform into seperate events?

0 Karma
Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...