We are currently performing a POC using Splunk 4.1.3 to index Blue Coat proxy data. Our test Splunk license is for 200gig a day. I have indexed around 800gig worth of raw data on our POC server.
The Splunk documentation @ http://www.splunk.com/base/Documentation/latest/Installation/HowHowmuchspaceyouwillneed gives some commands to run in order to get a idea of how much storage you will need. I ran the commands and see the following numbers, I am only using the default DB:
[root@ssdatacrusher2 db]# du -shc hot_v*/rawdata 3.2G hot_v1_22/rawdata 31M hot_v1_25/rawdata 3.9G hot_v1_35/rawdata 2.7G hot_v1_38/rawdata 9.8G total
[root@ssdatacrusher2 db]# du -ch hot_v*
8.0G hot_v1_22 0 hot_v1_22.sentinel
62M hot_v1_25 0 hot_v1_25.sentinel
9.5G hot_v1_35 0 hot_v1_35.sentinel
7.3G hot_v1_38 0 hot_v1_38.sentinel
I however ran a du -h on the $SPLUNK_HOME/var directory and am seeing around 400gig worth of disk usage:
[root@ssdatacrusher2 splunk]# pwd
I am seeing most of the 400gig disk usage being used by directories that end with rawdata.
When the data is rolled from hot to warm will these directories be deleted and cause my disk usage to do down or stay around the same since I will still be indexing more raw data ?
No, this data will not be deleted, and it not meant to be deleted (unless it ages out by policy). The commands listed are not giving you the actual size of the total data, they are intended to give you some idea of the ratio of raw data size vs stored disk size.
I would also say that the commands in the documentation are not correct or useful the way you have used them. I'd pretty much ignore them, as the results they gave you above don't say much.
The best thing for you is to just find out how much space is taken up by /opt/splunk/var/lib/splunk and compare that to the raw amount of data you have indexed. If you started with an empty index and the 800GB of raw is the only thing put into that index, then that will give you an indication of the size ratio you can expect. My guess is that it's pretty close to 400GB/800GB = 50%, i.e., the index will need about 50% of the raw size, assuming the data sample is representative of all your data.
Following on from gkanapathy -
When Splunk moves data from the Hot DB to the Warm DB, nothing is deleted - it is simply moved
When Splunk moves data from the Warm DB to the Cold DB, nothing is deleted - it is simply moved
When Splunk "retires" data from the Cold DB, it will be deleted unless you have configured a
coldToFrozenScript in indexes.conf. This is done as part of a larger exercise to configure your data retirement policy, click here to learn more about this subject.
A follow on page from the above link is this one, which will tell you how to set up your script, and some other options you may want to consider.
The main thing to understand here are the states through which your data moves as Splunk is indexing it and as it ages. You can't keep data in the warm DB forever unless you have a lot of space or are indexing very little, so you need to consider how much space you want to use and how often you want to access it. If you usually run searches on data from the last week or two, then that's all you really need to keep in the hot & warm DB's and you can move your cold DB off to a cheap NFS location somewhere. Searching data on a NFS location, is slower than local disk, so if you're going to be regularly searching over data from the last 3 - 6 months and speed is important to you, then you will want to size your Splunk server accordingly and give it a lot of local storage.
According to http://www.splunk.com/base/Documentation/4.1.4/Admin/Backupindexeddata, it looks like hot buckets are renamed into warm. Warm to cold is renamed or moved, depending on if they are on the same filesystem.
Managing bucket sizes? 2 Answers
Total space for index 2 Answers
Splunk indexing question 2 Answers
Anyone splunking hadoop logs? 4 Answers