Monitoring Splunk

What does this message mean regarding the health status of Splunkd?

julian0125
Explorer

Hello splunkers,

I need your help. I have an alert about a bucket on my Splunk.

This it the message that I have:

"The percentage of small of buckets created (50) over the last hour is very high and exceeded the yellow thresholds (30) for index=_internal, and possibly more indexes, on this indexer"

What does it mean, and how can i fix it?

Tags (2)

jacobpevans
Motivator

I've been going crazy with this error, so I did a write-up here which has a query to identify the indexes that are being flagged.

Query

index=_internal sourcetype=splunkd component=HotBucketRoller "finished moving hot to warm"
 | eval bucketSizeMB = round(size / 1024 / 1024, 2)
 | table _time splunk_server idx bid bucketSizeMB
 | rename idx as index
 | join type=left index 
     [ | rest /services/data/indexes count=0
       | rename title as index
       | eval maxDataSize = case (maxDataSize == "auto",             750,
                                  maxDataSize == "auto_high_volume", 10000,
                                  true(),                            maxDataSize)
       | table  index updated currentDBSizeMB homePath.maxDataSizeMB maxDataSize maxHotBuckets maxWarmDBCount ]
 | eval bucketSizePercent = round(100*(bucketSizeMB/maxDataSize))
 | eval isSmallBucket     = if (bucketSizePercent < 10, 1, 0)
 | stats sum(isSmallBucket) as num_small_buckets
         count              as num_total_buckets
         by index splunk_server
 | eval  percentSmallBuckets = round(100*(num_small_buckets/num_total_buckets))
 | sort  - percentSmallBuckets
 | eval isViolation = if (percentSmallBuckets > 30, "Yes", "No")

From there, and stealing from @DMohn, plug your index into this query:

index=abc
| eval latency=_indextime-_time
| stats min(latency),
        max(latency),
        avg(latency),
        median(latency)
    by index sourcetype

Hopefully, one or more of the sourcetypes sticks out to you. Add the sourcetype to the query to hopefully narrow down by host (or, if the problem is universal to all hosts, you now know the sourcetype to investigate). In our case, a few heavy forwarders (e.g. search heads) do not have all of the necessary sourcetypes defined.

index=abc sourcetype=def
| eval latency=_indextime-_time
| stats min(latency),
        max(latency),
        avg(latency),
        median(latency)
    by index sourcetype host

Good luck!

Cheers,
Jacob

If you feel this response answered your question, please do not forget to mark it as such. If it did not, but you do have the answer, feel free to answer your own post and accept that as the answer.

Teja78
New Member

How to identify which index is having this error

0 Karma

pkellyz
Explorer

@jacobevans Apologies if I come across dense but I think I'm missing something. I'm very new to this.

I ran the first query to identify the indexes causing the alert. There were a few.
I selected one of the indexes and ran the 2nd search to identify the sourcetypes.

I plugged one of the sourcetypes into the final search to find the hosts.

The search returns 2 hosts but there are three hosts with that app deployed from the deployment server. But no logs from the 3rd server... which I guess means it could still be a timestamp extraction issue, right?

All of the logs with high latency are from one of two sources. But again, no logs from the 2nd source being monitored in that app.

I had a sysadmin check the time on one of the affected hosts and the time matched the current time.

I reran the 3rd search starting with Last 4 Hours and adding 4 hour increments until I got up to last 24 hours.

It looks like the high latency only occurred between 20 and 24 hours ago. 0 - 20 hours ago latency was super duper low.

Is it possible theres something happening on that host which could cause a delay in sending logs to Splunk?

Can Splunk indexers become overloaded with logs being received at the same time and take a while to index them all?

The highest latency as of right now is _time 22:00 02-09-2020 and _indextime of 17:35 02-10-2020. _time and the timestamp in the raw log match.

EDIT: I did some more searching on Google and here within Splunk Answers and remembered we have the monitoring console. I checked Indexing Performance: Instance and Indexing Performance: Advanced.

Within Advanced I found CPU usage and it never gets very much above 10%. Is there something else I should check to eliminate Splunk/the indexers as the cause before asking another team to investigate why their host is delaying the sending of logs?

0 Karma

DMohn
Motivator

This is most likely related to an issue with the event times - either your timestamp extraction is not working properly, your server times are way off, or your applications logging the wrong time.

Try investigating on this, as Splunk will create new buckets when the events coming into an index are outside of a certain time range. This will then cause the error.

0 Karma

julian0125
Explorer

is there a way to fix that issue? it may be an index configuration issue.

0 Karma

SinghK
Builder

The add-on that gets the data to that index on the 3rd host needs a props correction add TZ = correct time zone field in there and that should fix the time.

0 Karma

DMohn
Motivator

You can narrow down the issue by checking the index latency, which might be an indication where event timestamps might be off...

index=* index!=_* | eval latency=_indextime-_time | stats min(latency), max(latency), avg(latency), median(latency) by index 
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...