We have a virtualization index with no restrictions currently as far as hot/warm/cold. After about 4 months we're sitting at 16GB indexed per day (average) with 1.8 TB (compressed) on disk and searchable. I'm proposing that we set a hard cap on this, as I don't believe keeping all of the data around is useful.
I'm looking to leverage summary indexes so that I can somehow summarize the data and then dump it. For example, grab the average CPU/memory usage and dump it to a summary index, but not to keep the source data around for long. I do see how I can create a saved search that outputs to a summary index in short timespans (i.e. last hour) however, how would I do this retroactively on 1.8 TB worth of data in chunks so that trends can be seen? If I need to clarify the question let me know.
Thanks
1- create your summary searches (searching index A, doing sitimechart or sistats commands to optimize the results, then saving the results in an index B)
see http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Usesummaryindexing
2 - test it on new data, verify that you can retrieve the events, and be happy, schedule it.
3- run the backfill script for the timerange prior to the scheduled summary search, and wait for the jobs to complete
see the backfill script http://docs.splunk.com/Documentation/Splunk/latest/Knowledge/Managesummaryindexgapsandoverlaps
it can take some time, so if you have many cores, you can ask the script to spawn 8 parallel jobs to complete faster.
4- profit, and change your retention on index A, as you should not need the original raw data.