When Hunk archives data from a Splunk bucket to HDFS or S3, what exactly is it archiving? The entire bucket? Or just the rawdata file? Is there a formula we can use to calculate the amount of storage we would need in HDFS/S3 based on our bucket sizes and retention periods?
Just the raw data file. Sadly, there's not a search you can do to get a picture of how much space the raw data uses. I ended up whipping up a shell script to pull that data off indexers directly.
I've asked Splunk for an enhancement to make this more visible.
I took this JSON file = Hunkdata.json
Before Splunk Indexing:
671 MB
After Splunk Indexing (raw data + Index data):
463 MB = About 70% of original file
After Archiving it into HDFS (raw data + few metadata files):
157 MB = About 33% of Splunk indexer
Just the raw data file. Sadly, there's not a search you can do to get a picture of how much space the raw data uses. I ended up whipping up a shell script to pull that data off indexers directly.
I've asked Splunk for an enhancement to make this more visible.
If you use the Hadoop Connect app you might be able to get a picture of how much space the raw data uses. Hadoop Connect includes the hdfs command, so you can use | hdfs lsr to calculate the space files are consuming in HDFS.
In this blog: http://blogs.splunk.com/2012/12/20/connecting-splunk-and-hadoop/ the last example might give you a guideline one how to create such a search.
Thanks! Is it stored compressed?
Yup, it's stored compressed on the indexer and I'm 99% sure it stays compressed over in HDFS.
Yes, it's still compressed. Note that journal.gz is not just raw data, it's the journal of what gets written to the bucket, so it also contains metadata (not the lexicon) and is sufficient to rebuild the entire bucket. We also archive some of the .dat files out of the bucket as well. As a rule of thumb, we'll copy about 30-40% of the size of the bucket to HDFS.
Great, so basically it is the same data that would be archived by a cold2frozen script? Is it cluster aware, ie, a cluster of indexers won't archive multiple copies of the same buckets to s3?
Well, the cold2frozen script has the option of doing whatever it wants with the bucket. Yes, it is cluster aware. This is a big portion of the investment in this feature, BTW, this stuff ain't simple :).
Right, I was using the comparison between the two only in terms of the files you would typically archive with the cold2frozen script and the files archived by hunk. Thanks for the info!