Monitoring Splunk

troubleshooting Splunk when the index queue fill ratio has reached 100% .

arjitg
Explorer

Hi All, 

We have an Monitoring console and due to a recent release, we observed all the, aggregator queue, typing queue & index queue fill ratio has reached 100%.  I have checked indexer performance dashboards in monitoring console, and I wasn't able to find out any relevant error which might have caused it. The data ingestion rate through licensing console looked same as we have every day & Can someone, please point me right steps to troubleshoot this? Thanks. 

Labels (2)
0 Karma

isoutamo
SplunkTrust
SplunkTrust

Hi

you said “due to a recent release”. What is this? A new splunk version, a new software realease of your business app or something else? 
r. Ismo

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @arjitg,

there are many factory to investigate that could cause this issue:

  • you have a slow storage (Splunk requires at least 800 IOPS): this is the more common cause,
  • you haven't sufficient resources (CPUs) for the logs volume you have to index: this is a frequent cause,
  • you have too many regexes used in the typing queue: this shouldn't be the cause because also your index queue rached the 100%.

Check your storage and CPUs resources.

Ciao.

Giuseppe

0 Karma

arjitg
Explorer

Hi @gcusello , 

I checked the data ingested via licensing so there isnt any additional influx of data into Splunk. Can sufficient resources (CPUs) for the logs volume, be an issue then? Also, you mentioned about IOPS. There is no increase in data & its not a new set up, then how can IOPS get impacted? Is there a way we can check more details around what could have impacted the indexqueue? 

arjitg_1-1712904421951.png

 

Thanks.  

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. You're _not_ waiting on the indexing queue so it doesn't seem to be the issue of backpressure from the disk not able to keep up with the rate of incoming events or any other configured outputs.

I'd start by checking the OS-level metrics (cpu usage, ram). If _nothing_ else changed "outside" (amount of events, their ingestion rate throughout the day - not only the general summarized license consumption over the whole day, composition of the ingest stream between (split among different sources, sourcetypes and so on)), something must have changed within your infrastructure. There are no miracles 😉

Is this a bare-metal installation or a VM? There could be issues with either oversubscribing resources if that's a VM or even with environment temperature in your DC so your CPUs would get throttled. (yes, I've seen such things).

But if the behaviour changed, something must have changed. Question is what.

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @arjitg,

in my opinion the issue is on the CPUs you are using:

how many CPUs have in your Indexer?

How many logs are you ingesting?

what kind of storage are you using?

you can measure the IOPS of your storage using a tool as Bonnie++, remembering that Splunk requires at least 800 IOPS.

In addition, from your screenshot a see that your index queue is at 0%, so probably the issue is in the Typing pipeline.

have you many Add-ons that parse your data?

only If you have a performant storage (at least 800 IOPS), you can use two parallel pipelines usng in this way more resources, for more infos see at https://docs.splunk.com/Documentation/Splunk/9.2.1/Indexer/Pipelinesets#Configure_the_number_of_pipe... .

Ciao.

Giuseppe

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...