I'm trying to find a way to programmatically get the average size of data flowing into each index on a daily basis so we can set indexes.conf max data retention based on how many days we want to hold for each index. There seem to be a variety of ways to get the size of indexes per day, none of which seem to match. There are a lot of posts on this over the last decade, but reading all of them has just left me more confused.
Ideally I can figure out the average amount of MB required on disk (across all indexers in a site) per index. Then I can define how many days I want to retain on a per index basis and set homePath.maxDataSizeMB to avg_per_day_size_mb * number_of_days_to_retain. Ignore cold/frozen for the moment and assume that all hot/warm is going to the same path (e.g. no smartstore). How can I do this?
The basic ways I've seen:
1. Use dbinspect. I can filter by state = "hot" or "warm" as well as filter by bucketId to throw away replicated copies of the bucket (either due to replication factor or site replication factor). Given that dbinspect shows you a snapshot in time of buckets, I don't think it accurately shows me the amount of new data coming into the index per day.
2. Use license usage log. When I average this metric out to a per day per index value, it feels like it is fairly close to the amount of data I'm storing across splunk (looking at du values on linux hosts). However, it doesn't count all indexes, notable events, internal logs, etc, which I need to account for.
3. Estimate raw data size for messages (e.g. len(_raw)) over a short period of time and then use those values with tstats over a longer period of time to estimate size. This feels like it will give the closest value, however it feels like a really dirty way to get the data.
4. Use metrics log to look at throughput values. Reading documentation, this just seems to be plain wrong as it stores only samples and doesn't show size of the message after it comes out of the various parsing queues.
What are folks doing to reliably size their indexes.conf based on # of days they want to retain per index?
... View more