All Apps and Add-ons

Splunk On Splunk performance measurements for universal forwarders are showing constant CPU percentage

laserval
Communicator

I have a bunch of Universal Forwarders running on 64bit linux systems, that are forwarding the data from TA-SoS to an indexer running on Windows.

The "ps" output for my forwarders are all showing almost constant values, as well as the same CPU percentage across different machines (with varying amounts of cores).

When running

index=sos sourcetype="ps" | multikv | where COMMAND=="splunkd" | timechart range(pctCPU) by host

I get flat lines for all forwarders. My indexer is running the windows PS script, and I get measurements for that. If I go back to when the forwarder started, the CPU% shows a peak but then goes down to the constant value. Currently the forwarders claim to use constant 0.4% CPU over several days.

I also have scripts running that use top to monitor all applications on the machines, including splunkd, and they are showing variations of CPU% between 0.1-1.5 or so depending on traffic.

Why am I not getting correct CPU% measurements from TA-sos?

edit:
Read up on ps and what it does, and it seems to be a difference in how it and top works:
http://unix.stackexchange.com/questions/58539/top-and-ps-not-showing-the-same-cpu-result

Essentially, ps only measures lifetime CPU usage, while top does a sampling. Perhaps forwarders simply vary too little in CPU usage for the lifetime value to change? This makes me wonder how useful it is for detecting spikes in CPU usage on forwarders.

1 Solution

hexx
Splunk Employee
Splunk Employee

I think you nailed it with your latest edit. From the man page of /usr/bin/ps:

CPU usage is currently expressed as the percentage of time spent running **during the entire lifetime of a process**.

From the man page of /usr/bin/top:

k: %CPU -- CPU usage

The task’s share of the elapsed CPU time **since the last screen update, expressed as a percentage of total CPU time.

View solution in original post

hexx
Splunk Employee
Splunk Employee

I think you nailed it with your latest edit. From the man page of /usr/bin/ps:

CPU usage is currently expressed as the percentage of time spent running **during the entire lifetime of a process**.

From the man page of /usr/bin/top:

k: %CPU -- CPU usage

The task’s share of the elapsed CPU time **since the last screen update, expressed as a percentage of total CPU time.

hexx
Splunk Employee
Splunk Employee

ps_sos.ps1 fetches per-process CPU usage from WMI:

$pctCPU = get-wmiobject Win32_PerfFormattedData_PerfProc_Process -Filter "IDProcess = $myPID" | select -expand PercentProcessorTime

I believe that this yields usage over the sample period (5s by default), which makes spikiness a lot more noticeable of course.

0 Karma

laserval
Communicator

Yes, but good to get confirmation from someone else as well.

I wonder if the Windows ps_sos script handles this the same way.

I would probably prefer that the script used top, but perhaps there are portability or other reasons for the choice of data source. Measuring CPU usage isn't straightforward, I guess.

0 Karma
Get Updates on the Splunk Community!

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...