Solved: Re: Punct... good god y'all - what is it good for?

jplumsdaine22 · ‎05-10-2016

Early on in our Splunk deployment we set ANNOTATE_PUNCT to false on our indexers, both to save space and for performance tuning. Fast forward a few years and uor indexing set up now has some spare resources. So we could turn it back on.

I'm wondering two things:

Am I missing out on anything amazing by not having the punct field indexed?
Is there a way to estimate the space and performance impacts if I re-enable ANNOTATE_PUNCT ?

jkat54 · ‎05-10-2016

as for number 1... we use the punct field to find anomalies in data. If 99% of your events are like this _______/______(_)______ and 1% look like this ??????++++____+(*)___+_, then perhaps you've got 1% bad data.

as for number 2... id say do this math:

punct= <- equals 6 characters, maybe think of it as 8 characters if it were quoted in the index (idk)
then, the value it equals is always 1 or more characters. So minimal you're looking at 7-9 characters per event. Lets round up for ease of mathing.. lets say its at least 10 characters (but i think closer to 20 - 40 would be better estimate). So ((10-40 characters * number of events) * (.15 * RF)) + ((10-40 characters * number of events) * (.35 * SF)) = Total number of bytes used. Fill in your number of events, rep & search factors...

Here's a search that will do the math for you, if you enable punct:

| tstats count where index=* OR index=_* by punct index 
| eval bytes=len(punct)*count 
| eval replicationFactor=2
| eval searchFactor=3
| stats sum(eval(bytes/1024/1024/1024)) as GB count by index replicationFactor searchFactor
| eval Estimated_GB_used = (0.15*replicationFactor*GB) + (0.35*searchFactor*GB)

View solution in original post

jkat54 · ‎05-10-2016

as for number 1... we use the punct field to find anomalies in data. If 99% of your events are like this _______/______(_)______ and 1% look like this ??????++++____+(*)___+_, then perhaps you've got 1% bad data.

as for number 2... id say do this math:

punct= <- equals 6 characters, maybe think of it as 8 characters if it were quoted in the index (idk)
then, the value it equals is always 1 or more characters. So minimal you're looking at 7-9 characters per event. Lets round up for ease of mathing.. lets say its at least 10 characters (but i think closer to 20 - 40 would be better estimate). So ((10-40 characters * number of events) * (.15 * RF)) + ((10-40 characters * number of events) * (.35 * SF)) = Total number of bytes used. Fill in your number of events, rep & search factors...

Here's a search that will do the math for you, if you enable punct:

| tstats count where index=* OR index=_* by punct index 
| eval bytes=len(punct)*count 
| eval replicationFactor=2
| eval searchFactor=3
| stats sum(eval(bytes/1024/1024/1024)) as GB count by index replicationFactor searchFactor
| eval Estimated_GB_used = (0.15*replicationFactor*GB) + (0.35*searchFactor*GB)

jplumsdaine22 · ‎05-10-2016

Good call!

jkat54 · ‎05-10-2016

hey, i edited the answer, please review.

jplumsdaine22 · ‎05-10-2016

Why the .15 and .35 for the rep and search factor?

jkat54 · ‎05-10-2016

See this: http://docs.splunk.com/Documentation/Splunk/6.2.0/Indexer/Systemrequirements#Storage_considerations

Replicated buckets are just raw data, and not searchable. Searchable buckets are raw data with additional overhead related to metadata and field values. Replicated buckets are about15% of the original data size and searchable buckets are about 35% of the original data size.

"Here are two examples of estimating cluster storage requirements, both assuming 100GB of incoming syslog data, resulting in 15GB for each set of rawdata and 35GB for each set of index files:

3 peer nodes, with replication factor = 3; search factor = 2: This requires a total of 115GB across all peer nodes (averaging 38GB/peer), calculated as follows:

  Total rawdata = (15GB * 3) = 45GB.
  Total index files = (35GB * 2) = 70 GB.

5 peer nodes, with replication factor = 5; search factor = 3: This requires a total of 180GB across all peer nodes (averaging 36GB/peer), calculated as follows:

  Total rawdata = (15GB * 5) = 75GB.
  Total index files = (35GB * 3) = 105 GB."

jplumsdaine22 · ‎05-11-2016

Ah right! We run with 1/1 for replication and search factors so I had always used just .5. Useful to know!

Punct... good god y'all - what is it good for?

Introducing the Splunk Community Dashboard Challenge!

Get the T-shirt to Prove You Survived Splunk University Bootcamp

Wondering How to Build Resiliency in the Cloud?