Getting Data In

Punct... good god y'all - what is it good for?

jplumsdaine22
Influencer

Early on in our Splunk deployment we set ANNOTATE_PUNCT to false on our indexers, both to save space and for performance tuning. Fast forward a few years and uor indexing set up now has some spare resources. So we could turn it back on.

I'm wondering two things:

  1. Am I missing out on anything amazing by not having the punct field indexed?
  2. Is there a way to estimate the space and performance impacts if I re-enable ANNOTATE_PUNCT ?
1 Solution

jkat54
SplunkTrust
SplunkTrust

as for number 1... we use the punct field to find anomalies in data. If 99% of your events are like this _______/______(_)______ and 1% look like this ??????++++____+(*)___+_, then perhaps you've got 1% bad data.

as for number 2... id say do this math:

punct= <- equals 6 characters, maybe think of it as 8 characters if it were quoted in the index (idk)
then, the value it equals is always 1 or more characters. So minimal you're looking at 7-9 characters per event. Lets round up for ease of mathing.. lets say its at least 10 characters (but i think closer to 20 - 40 would be better estimate). So ((10-40 characters * number of events) * (.15 * RF)) + ((10-40 characters * number of events) * (.35 * SF)) = Total number of bytes used. Fill in your number of events, rep & search factors...

Here's a search that will do the math for you, if you enable punct:

| tstats count where index=* OR index=_* by punct index 
| eval bytes=len(punct)*count 
| eval replicationFactor=2
| eval searchFactor=3
| stats sum(eval(bytes/1024/1024/1024)) as GB count by index replicationFactor searchFactor
| eval Estimated_GB_used = (0.15*replicationFactor*GB) + (0.35*searchFactor*GB)

View solution in original post

jkat54
SplunkTrust
SplunkTrust

as for number 1... we use the punct field to find anomalies in data. If 99% of your events are like this _______/______(_)______ and 1% look like this ??????++++____+(*)___+_, then perhaps you've got 1% bad data.

as for number 2... id say do this math:

punct= <- equals 6 characters, maybe think of it as 8 characters if it were quoted in the index (idk)
then, the value it equals is always 1 or more characters. So minimal you're looking at 7-9 characters per event. Lets round up for ease of mathing.. lets say its at least 10 characters (but i think closer to 20 - 40 would be better estimate). So ((10-40 characters * number of events) * (.15 * RF)) + ((10-40 characters * number of events) * (.35 * SF)) = Total number of bytes used. Fill in your number of events, rep & search factors...

Here's a search that will do the math for you, if you enable punct:

| tstats count where index=* OR index=_* by punct index 
| eval bytes=len(punct)*count 
| eval replicationFactor=2
| eval searchFactor=3
| stats sum(eval(bytes/1024/1024/1024)) as GB count by index replicationFactor searchFactor
| eval Estimated_GB_used = (0.15*replicationFactor*GB) + (0.35*searchFactor*GB)

jplumsdaine22
Influencer

Good call!

0 Karma

jkat54
SplunkTrust
SplunkTrust

hey, i edited the answer, please review.

0 Karma

jplumsdaine22
Influencer

Why the .15 and .35 for the rep and search factor?

0 Karma

jkat54
SplunkTrust
SplunkTrust

See this: http://docs.splunk.com/Documentation/Splunk/6.2.0/Indexer/Systemrequirements#Storage_considerations

Replicated buckets are just raw data, and not searchable. Searchable buckets are raw data with additional overhead related to metadata and field values. Replicated buckets are about15% of the original data size and searchable buckets are about 35% of the original data size.

"Here are two examples of estimating cluster storage requirements, both assuming 100GB of incoming syslog data, resulting in 15GB for each set of rawdata and 35GB for each set of index files:

3 peer nodes, with replication factor = 3; search factor = 2: This requires a total of 115GB across all peer nodes (averaging 38GB/peer), calculated as follows:

  Total rawdata = (15GB * 3) = 45GB.
  Total index files = (35GB * 2) = 70 GB.

5 peer nodes, with replication factor = 5; search factor = 3: This requires a total of 180GB across all peer nodes (averaging 36GB/peer), calculated as follows:

  Total rawdata = (15GB * 5) = 75GB.
  Total index files = (35GB * 3) = 105 GB."

jplumsdaine22
Influencer

Ah right! We run with 1/1 for replication and search factors so I had always used just .5. Useful to know!

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...