Hello guys.... I have this task to investigate why indexes roll of data before retention age. From my findings, it shows number of warm buckets exceeded. Here's what the index configuration looks like. How can i fix this?
[wall]
repFactor=auto
coldPath = volume:cold/customer/wall/colddb
homePath = volume:hot_warm/customer/wall/db
thawedPath = /splunk/data/cold/customer/wall/thaweddb
frozenTimePeriodInSecs = 34186680
maxHotBuckets = 10
maxTotalDataSizeMB = 400000
Data rolls off due to a few reasons.
Data arrives in Splunk, it then needs to move through HOT/WARM>COLD/FROZEN, otherwise data will build up and you will run out of space.
1.Warm buckets move cold when either the homePath or maxWarmDBCount reach their limits.
2.Cold buckets are deleted when either the frozenTimePeriodInSecs or maxTotalDataSizeMB reach their limits.
This may help show why it’s moving – see the event_message field
index=_internal sourcetype=splunkd component=BucketMover
| fields _time, bucket, candidate, component, event_message, from, frozenTimePeriodInSecs, host, idx,latest, log_level, now, reason, splunk_server, to
| fieldformat "now"=strftime('now', "%Y/%m/%d %H:%M:%S")
| fieldformat "latest"=strftime('latest', "%Y/%m/%d %H:%M:%S")
| eval retention_days = frozenTimePeriodInSecs / 86400
| table _time component, bucket, from, to, candidate, event_message, from, frozenTimePeriodInSecs, retention_days, host, idx, now, latest, reason, splunk_server, log_level
You apply config via indexes.conf for the index for disk constrains by configuring the various options:
Settings:
frozenTimePeriodInSecs (Retention Period in seconds - Old bucket data is deleted (option to freeze it) based on the newest event -
maxTotalDataSizeMB = (Limits the overall size of the index - (hot, warm, cold moves frozen)
maxVolumeDataSizeMB = (limits the total size of all databases that reside on this volume)
maxWarmDBCount = (The maximum number of warm buckets moves to cold)
maxHotBuckets = (The number of actively written open buckets - when exceeded it moves to warm state)
maxHotSpanSecs = (Specifies how long a bucket remains in the hot/warm state before moving to cold)
maxDataSize = (specifies that a hot bucket can reach before splunkd triggers a roll to warm)
maxVolumeDataSizeMB = (Overall Volume Size limit)
homePath.maxDataSizeMB = (limit the individual index size)
coldPath.maxDataSizeMB = (limit the individual index size)
maxVolumeDataSizeMB = (limits the total size of all databases that reside on this volume)
See the indexes.conf for details
https://docs.splunk.com/Documentation/Splunk/9.2.1/Admin/Indexesconf
Few additional remarks to an otherwise quite good explanation.
1. Cold buckets are rolled to frozen. By default it means they just get deleted but they might get archived onto a yet another storage location or processed by some external script (for example, compressed and encrypted for long-time storage). But yes, by default "freezing" equals deleting the bucket.
2. As you mentioned when listing the parameters affecting bucket lifecycle, there are also limits regarding a volume size.
So the buckets might roll from cold to frozen if any of the conditions are met:
1) The bucket is older than the retention limit (the _newest_ event in the bucket is _older_ than the limit - in other words - whole bucket contains only events older than the retention limit).
2) The index has grown beyond the size limit
3) The volume has grown beyond the size limit.
Obviously the 3) condition can be met only if your index directories are defined using volumes.
In case of the 2) condition Splunk will freeze the oldest bucket for the index (again - the one for which the newest event is oldest), but in case of the 3) condition Splunk will freeze the oldest bucket from any of the indexes contained on the volume.
You can find the actual reason for freezing bucket by searching your _internal index for AsyncFreezer
Typically if just one index freezes before reaching retention period you'd expect this index to run out of space but if buckets from many indexes get prematurely frozen it might be the volume size issue. Yet, you can see volume size limit affecting just one index if your indexes have significantly differing retention periods so that one of them contains much older events than other ones.
From your answer, maxTotalDataSizeMB is the same size as diskSize. That's the reason it's rolling off.
Still don't know which of the parameters to tune to get it fixed.
@deepakc Is there a formula i can use to determine the right diskSize, maxTotalDataSize, maxWarmDBCount?
I think that will help me set the right values for these parameters.
There is no one general formula to find such things. And there cannot be because it depends on your needs.
It's like asking "I'm gonna start a company, how big a warehouse should I rent?". It depends. You might not even need to rent any warehouse if you're just gonna do accounting or IT consulting.
Thank you so much @deepakc Your answer has been very helpful!
I'll check them out, thanks! @isoutamo