Early on in our Splunk deployment we set ANNOTATE_PUNCT to false on our indexers, both to save space and for performance tuning. Fast forward a few years and uor indexing set up now has some spare resources. So we could turn it back on.
I'm wondering two things:
as for number 1... we use the punct field to find anomalies in data. If 99% of your events are like this _______/______(_)______
and 1% look like this ??????++++____+(*)___+_
, then perhaps you've got 1% bad data.
as for number 2... id say do this math:
punct= <- equals 6 characters, maybe think of it as 8 characters if it were quoted in the index (idk)
then, the value it equals is always 1 or more characters. So minimal you're looking at 7-9 characters per event. Lets round up for ease of mathing.. lets say its at least 10 characters (but i think closer to 20 - 40 would be better estimate). So ((10-40 characters * number of events) * (.15 * RF)) + ((10-40 characters * number of events) * (.35 * SF)) = Total number of bytes used
. Fill in your number of events, rep & search factors...
Here's a search that will do the math for you, if you enable punct:
| tstats count where index=* OR index=_* by punct index
| eval bytes=len(punct)*count
| eval replicationFactor=2
| eval searchFactor=3
| stats sum(eval(bytes/1024/1024/1024)) as GB count by index replicationFactor searchFactor
| eval Estimated_GB_used = (0.15*replicationFactor*GB) + (0.35*searchFactor*GB)
as for number 1... we use the punct field to find anomalies in data. If 99% of your events are like this _______/______(_)______
and 1% look like this ??????++++____+(*)___+_
, then perhaps you've got 1% bad data.
as for number 2... id say do this math:
punct= <- equals 6 characters, maybe think of it as 8 characters if it were quoted in the index (idk)
then, the value it equals is always 1 or more characters. So minimal you're looking at 7-9 characters per event. Lets round up for ease of mathing.. lets say its at least 10 characters (but i think closer to 20 - 40 would be better estimate). So ((10-40 characters * number of events) * (.15 * RF)) + ((10-40 characters * number of events) * (.35 * SF)) = Total number of bytes used
. Fill in your number of events, rep & search factors...
Here's a search that will do the math for you, if you enable punct:
| tstats count where index=* OR index=_* by punct index
| eval bytes=len(punct)*count
| eval replicationFactor=2
| eval searchFactor=3
| stats sum(eval(bytes/1024/1024/1024)) as GB count by index replicationFactor searchFactor
| eval Estimated_GB_used = (0.15*replicationFactor*GB) + (0.35*searchFactor*GB)
Good call!
hey, i edited the answer, please review.
Why the .15 and .35 for the rep and search factor?
See this: http://docs.splunk.com/Documentation/Splunk/6.2.0/Indexer/Systemrequirements#Storage_considerations
Replicated buckets are just raw data, and not searchable. Searchable buckets are raw data with additional overhead related to metadata and field values. Replicated buckets are about15% of the original data size and searchable buckets are about 35% of the original data size.
"Here are two examples of estimating cluster storage requirements, both assuming 100GB of incoming syslog data, resulting in 15GB for each set of rawdata and 35GB for each set of index files:
3 peer nodes, with replication factor = 3; search factor = 2: This requires a total of 115GB across all peer nodes (averaging 38GB/peer), calculated as follows:
Total rawdata = (15GB * 3) = 45GB.
Total index files = (35GB * 2) = 70 GB.
5 peer nodes, with replication factor = 5; search factor = 3: This requires a total of 180GB across all peer nodes (averaging 36GB/peer), calculated as follows:
Total rawdata = (15GB * 5) = 75GB.
Total index files = (35GB * 3) = 105 GB."
Ah right! We run with 1/1 for replication and search factors so I had always used just .5. Useful to know!