All Apps and Add-ons

Splunk App for Windows Infrastructure -- gaps in data from Perfmon

alekksi
Communicator

Hi all,

I've set up an app where it utilises the Splunk_TA_Windows perfmon counters to get greater granularity on a small set of servers, rather than the normal 10 second intervals that are by default in that app.

In our test environment, we had no issues with this, but it seems that we are getting gaps in our data now that it has moved to PROD. In other words, instead of an interval of 1 producing per-second stats, we are getting certain counters to return every 2 or 3 seconds. This makes graphing the data impossible as it is missing certain fields that are required in the calculations.

Is there a way to either:

1) Change the way events are done in the windows app, i.e. every counter for a specific object & instance is on one line (cpu0 will have per-second %user, %priviledged, %interrupt, %idle, cpu1 will have the same, etc.)?

2) Ensure that the data is consistently sent to splunk?

Currently the config looks a bit like this (just CPU shown for brevity):

[perfmon://Processor]

interval = 1

object = Processor

counters = % Idle Time; % Interrupt Time; % Privileged Time; % User Time

instances = *

disabled = 0

index = os

Any help would be appreciated!!

Cheers,
Alex

0 Karma

ltrand
Contributor

The symptoms you are experiencing are probably due to race conditions where the system doesn't answer the poll in time due to performing other work.

I would start by asking why you need per-second resolution. Would a 30 second resolution be acceptable so that you can account for race conditions and other hold times when resources are not available? Currently we monitor performance metrics in 5 minute intervals unless we are actively troubleshooting because on a per-second resolution puts a lot of burden on the system.

Also keep in mind that splunk really isn't the best tool for real-time monitoring. It might be able, but other real-time tools are going to be better. Even if you get the data in, you're not able to view it in real time as the searches take longer than the data-set refreshes.

0 Karma

ltrand
Contributor

I would investigate a good real-time monitoring solution and use Splunk for your logs rather than for the real-time monitoring. Microsoft SCOM, Cacti (goes well with Nagios), or a number of other alternatives are going to be better for real-time monitoring.

If you want the data in Splunk then perhaps you could write a local script to poll by the second and dump it in a CSV and have Splunk monitor it. That way if it misses a polling period the data is still there.

As a personal aside, I'm curious what regulatory body requires by the second performance monitoring. Sounds pretty interesting!

0 Karma

alekksi
Communicator

My requirement was for per-second granularity and this has a lot to do with regulatory requirements. We already have nagios which does 3-5 minute charting of some statistics, but it's not particularly useful for much beyond alerting the support team.

While we aren't doing much streaming of real-time charts, we still use per-second granularity for historical investigation. We are looking to correlate past events with high accuracy.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...