Solved: Monitoring indexers through health checks / heartb...

rturk · ‎05-31-2011

Greetings Splunkers,

I've seen in a lot of the online documentation about the "autoLB" mode of load-balancing, however haven't seen anywhere that actually goes into any details of how this is done, specifically how the indexer is monitored by the forwarder and considered to be a valid recipient of raw data (eg. Splunk specific health checks/heartbeats).

My reasoning is that I would like to monitor the Splunk servers to see whether they are in an UP/DOWN state from a monitoring platform, and while it's possible to script up, if there's a way that a Forwarder uses, it'd be best to use the same method to gauge actual impact.

A similar question was asked HERE (http://splunk-base.splunk.com/answers/8720/best-practice-for-monitoring-indexer-health) with no answers.

Can anyone please shed some light on this, or point me towards some documentation that details this (I did look, but the technical details were very much on the light side).

Regards,

RT 🙂

mw · ‎06-01-2011

Have you looked at the Deployment Monitor app? It does some stuff around health. Basically, health is generally going to come down to looking at the metrics to see if any data has been indexed recently. I would imagine that forwarders are simply going to try to connect to the indexer, and pass to the next if they don't get the proper ACK. I doubt you want to replicate that, but looking at the stats of the indexer would seem to be a reasonable way to determine this.

Here's a boiled down version of what the Deployment Monitor is using to establish some level of monitoring of indexers:

index="_internal" source="*metrics.log" group=per_index_thruput series!="_*" | stats max(_time) as _time sum(kb) as kb by splunk_server | eval status = if(KB==0, "idle", if(parseQ_percentage>50, "overloaded", if(indexQ_percentage>50,"overloaded","normal")))

View solution in original post

Simeon · ‎06-02-2011

You could easily run a CLI or API search that checks on event counts from each Splunk Server. Obviously, there will be a penalty of running the search. Similarly, you could check on known sources across all servers and report on those event counts by "splunk_server". For example, to check all servers to see if they are indexing I could do this search:

index=_internal source=*metrics.log earliest=-2m | stats count by splunk_server

Regarding auto load balancing, it works as follows... Assume I have 5 indexers, call them i1, i2, i3, i4, and i5. If my forwarder is running autoLB, it will do the following:

Randomly select and indexer from the pool i1 - i5. In this case, let's assume we select i1.
Send data to i1 for the autoLB period.
Randomly select an indexer from the pool i2 - i5 (the indexer we previously sent data to is removed from the pool). In this case, let's assume we select i2.
If a connection is established, send data to i2 for the autoLB period.
If we cannot establish a connection, select another indexer from the pool i3-i5 (i2 is removed because we can't send data there, but i1 is also removed since we last sent data there). Let's assume we select i5...
Send data to i5 for the autoLB period
Randomly select an indexer from the pool i1-i4. Wash rinse repeat

Notice that splunk removes the previously used indexer from the pool, and if the subsequent indexer fails, it is also removed from the pool until success.

mw · ‎06-01-2011

Have you looked at the Deployment Monitor app? It does some stuff around health. Basically, health is generally going to come down to looking at the metrics to see if any data has been indexed recently. I would imagine that forwarders are simply going to try to connect to the indexer, and pass to the next if they don't get the proper ACK. I doubt you want to replicate that, but looking at the stats of the indexer would seem to be a reasonable way to determine this.

Here's a boiled down version of what the Deployment Monitor is using to establish some level of monitoring of indexers:

index="_internal" source="*metrics.log" group=per_index_thruput series!="_*" | stats max(_time) as _time sum(kb) as kb by splunk_server | eval status = if(KB==0, "idle", if(parseQ_percentage>50, "overloaded", if(indexQ_percentage>50,"overloaded","normal")))

dwaddle · ‎05-31-2011

I don't know exactly how autoLB decides if an indexer is "good" or not, but a new feature of 4.2 is "Indexer Acknowledgement" -- http://www.splunk.com/base/Documentation/latest/Deploy/Protectagainstlossofin-flightdata

You could always configure Nagios or its peers to connect to your splunkd/splunkweb on their various ports. Splunkd's management port is HTTP(S), as is Splunkweb. The problem with connecting to a forwarder port is (a) you can't speak the protocol and (b) simply connecting doesn't mean all is well.

This is one of those cases where I feel like a Nagios passive check is of value. You could do something as simple as a CLI search every 2-3 minutes on each indexer. You would want something to validate that there is "recent" data showing up for that indexer (use splunk_server=xxxx) and if not then the passive check reports back to Nagios that all is not well. The nice thing about passive checks is if they don't update in a timeframe, Nagios can be set up to assume them to be down.

Monitoring indexers through health checks / heartbeats?

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life