The output from both the script and the command the script runs is correct
server# sar -P ALL 1 1 | awk 'BEGIN {print "CPU pctUser pctNice pctSystem pctIowait pctIdle"} /Average|Linux|^$|%/ {next} (NR==1) {next} {cpu=$3; pctUser=$4; pctNice=$5; pctSystem=$6; pctIowait=$7; pctIdle=$NF} {printf "%-3s %9s %9s %9s %9s %9s\n", cpu, pctUser, pctNice, pctSystem, pctIowait, pctIdle}' header="CPU pctUser pctNice pctSystem pctIowait pctIdle"
CPU pctUser pctNice pctSystem pctIowait pctIdle
all 6.31 0.00 0.06 0.00 93.63
0 0.99 0.00 0.00 0.00 99.01
1 100.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 100.00
3 0.00 0.00 0.00 0.00 100.00
4 0.00 0.00 0.00 0.00 100.00
5 0.00 0.00 0.00 0.00 100.00
6 0.00 0.00 0.00 0.00 100.00
7 0.00 0.00 0.00 0.00 100.00
8 0.99 0.00 0.00 0.00 99.01
9 0.99 0.00 0.00 0.00 99.01
10 0.00 0.00 0.00 0.00 100.00
11 0.00 0.00 0.00 0.00 100.00
12 0.00 0.00 1.00 0.00 99.00
13 0.00 0.00 0.00 0.00 100.00
14 0.00 0.00 0.00 0.00 100.00
15 0.00 0.00 0.00 0.00 100.00
However the indexed result seems off. The number of fields is correct, and the field headers are correct. But the values for the CPU field are incorrect. The CPU column values look like they belong in pctUser column (compare the first three rows to those above) I modified the script so the command would also be indexed.
**Note that the header below is not included in the indexed source, I included it for clarity**
CPU pctUser pctNice pctSystem pctIowait pctIdle
6.31 0.00 0.00 0.00 0.00 93.69
1.00 0.00 0.00 0.00 0.00 99.00
100.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
1.00 0.00 0.00 0.00 0.00 99.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00 0.00 0.00 100.00
Cmd = [sar -P ALL 1 1]; | awk 'BEGIN {print "CPU pctUser pctNice pctSystem pctIowait pctIdle"} /Average|Linux|^$|%/ {next} (NR==1) {next} {cpu=$3; pctUser=$4; pctNice=$5; pctSystem=$6; pctIowait=$7; pctIdle=$NF} {printf "%-3s %9s %9s %9s %9s %9s\n", cpu, pctUser, pctNice, pctSystem, pctIowait, pctIdle}' header="CPU pctUser pctNice pctSystem pctIowait pctIdle"
What do you think is going on here?
This is now listed as an issue with a work around for the TA.
http://docs.splunk.com/Documentation/UnixAddOn/5.2.4/User/Releasenotes
I know this is an ancient question but here is the issue.
When you run the script locally its taking your "locale" into consideration and outputing something like this from sar:
08:28:37 PM CPU %user %nice %system %iowait %steal %idle
08:28:38 PM all 11.14 0.00 0.70 0.00 0.00 88.15
08:28:38 PM 0 98.99 0.00 1.01 0.00 0.00 0.00
When splunk runs it sar is using a different time format, and more than likely its using POSIX so your output above transforms to
20:28:37 CPU %user %nice %system %iowait %steal %idle
20:28:38 all 11.14 0.00 0.70 0.00 0.00 88.15
20:28:38 0 98.99 0.00 1.01 0.00 0.00 0.00
So then when the script takes that output and feeds it to AWK, the placement of the values is off by one because of the absence of the AM/PM designation.
The Fix:
You can modify the user's shell splunkd runs under and add LC_TIME=en_US (or some other locale that adds AM/PM).
or
Add LC_TIME=en_US to the last line in the script before $CMD
LC_TIME=en_US $CMD | tee $TEE_DEST | $AWK "$HEADERIZE $FILTER $FORMAT $PRINTF" header="$HEADER"
I found a solution for my Ubuntu installation!
However i did not find the reason 😞
Here it goes...
I found that everything looks good as long as the data is collected by the sar command, but for some unknown reason the sar command fails after some time and instead the mpstat command is used, as the cpu.sh script proposes.
The problem however is that the output from 'sar -P ALL 1 1' and 'mpstat -P ALL 1 1' in my ubuntu and in my debian installation isnt as expected in the cpu.sh script.
So my solution was to never user the sar command and always use the mpstat command instead. And change the FORMAT part to suit the actual output:
--
if [ "x$KERNEL" = "xLinux" ] ; then
queryHaveCommand sar
FOUND_SAR=$?
queryHaveCommand mpstat
FOUND_MPSTAT=$?
# if [ $FOUND_SAR -eq 0 ] ; then
# CMD='sar -P ALL 1 1'
# FORMAT='{cpu=$3; pctUser=$4; pctNice=$5; pctSystem=$6; pctIowait=$7; pctIdle=$NF}'
# FORMAT='{cpu=$NF-7; pctUser=$NF-6; pctNice=$NF-5; pctSystem=$NF-4; pctIowait=$NF-1; pctIdle=$NF}'
if [ $FOUND_MPSTAT -eq 0 ] ; then
CMD='mpstat -P ALL 1 1'
# FORMAT='{cpu=$(NF-9); pctUser=$(NF-8); pctNice=$(NF-7); pctSystem=$(NF-6); pctIowait=$(NF-5); pctIdle=$(NF-1)}'
FORMAT='{cpu=$(NF-10); pctUser=$(NF-9); pctNice=$(NF-8); pctSystem=$(NF-7); pctIowait=$(NF-6); pctIdle=$(NF-1)}'
else
failLackMultipleCommands sar mpstat
fi
FILTER='/Average|Linux|^$|%/ {next} (NR==1) {next}' elif [ "x$KERNEL" = "xSunOS" ] ; then
--
edit:
I discovered that the data retrived wasnt correct, so i had to edit the FORMAT line again. These works for 'mpstat -P ALL 1 1'
For my Ubuntu 14.04:
FORMAT='{cpu=$(NF-10); pctUser=$(NF-9); pctNice=$(NF-8); pctSystem=$(NF-7); pctIowait=$(NF-6); pctIdle=$(NF)}'
and for my Deban 7.8:
FORMAT='{cpu=$(NF-9); pctUser=$(NF-8); pctNice=$(NF-7); pctSystem=$(NF-6); pctIowait=$(NF-5); pctIdle=$(NF)}'
I believe you have to make your own format line that fits your distribution/version.
I also have the same problem. And I have a Splunk ticket open on it, but as of yet, no solution. And to further convolute the issue, a restart does work ... but if the restart is initiated via a deployment-server reload, the problem isn't corrected. But ... if a restart is issued 'outside' of splunk ... via 'sudo -u splunk_user $SPLUNK_HOME/bin/splunk restart' OR 'sudo /sbin/service splunk_service_name restart' then, the problem does clear up (until it resurfaces again.
But, if splunkd itself initiates the restart ... nothing changes.
I have the exact same problem, however when i restart Splunk i discovered that the CPU numbering are working correctly for a period, but after a while, the problem reappears.
Restart of Splunk, and everything is ok again.
I have the very same problem and I can't get it fixed.
From my debugging, it seems to be a problem with awk behaving differently between versions:
Buggy host:
./cpu.sh --debug
CPU pctUser pctNice pctSystem pctIowait pctIdle
1.00 0.00 0.00 0.00 0.00 99.00
1.00 0.00 0.00 0.00 0.00 99.00
Linux 3.16.0-4-686-pae (myhost.domain.de) 07/23/15 i686 (1 CPU)
04:58:15 CPU %user %nice %system %iowait %steal %idle
04:58:16 all 1.00 0.00 0.00 0.00 0.00 99.00
04:58:16 0 1.00 0.00 0.00 0.00 0.00 99.00
Average: CPU %user %nice %system %iowait %steal %idle
Average: all 1.00 0.00 0.00 0.00 0.00 99.00
Average: 0 1.00 0.00 0.00 0.00 0.00 99.00
Cmd = [sar -P ALL 1 1]; | awk 'BEGIN {print "CPU pctUser pctNice pctSystem pctIowait pctIdle"} /Average|Linux|^$|%/ {next} (NR==1) {next} {cpu=$3; pctUser=$4; pctNice=$5; pctSystem=$6; pctIowait=$7; pctIdle=$NF} {printf "%-3s %9s %9s %9s %9s %9s\n", cpu, pctUser, pctNice, pctSystem, pctIowait, pctIdle}' header="CPU pctUser pctNice pctSystem pctIowait pctIdle"
Running it manually in the shell works, though:
myhost:/opt/splunkforwarder/etc/apps/Splunk_TA_nix/bin# sar -P ALL 1 1 | awk 'BEGIN {print "CPU pctUser pctNice pctSystem pctIowait pctIdle"} /Average|Linux|^$|%/ {next} (NR==1) {next} {cpu=$3; pctUser=$4; pctNice=$5; pctSystem=$6; pctIowait=$7; pctIdle=$NF} {printf "%-3s %9s %9s %9s %9s %9s", cpu, pctUser, pctNice, pctSystem, pctIowait, pctIdle}' header="CPU pctUser pctNice pctSystem pctIowait pctIdle"
CPU pctUser pctNice pctSystem pctIowait pctIdle
all 0.00 0.00 1.00 0.00 99.000 0.00 0.00 1.00 0.00 99.00
myhost:/opt/splunkforwarder/etc/apps/Splunk_TA_nix/bin# awk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
compiled limits:
max NF 32767
sprintf buffer 1020
Working host:
[root|/opt/splunk/etc/apps/Splunk_TA_nix/bin] ./cpu.sh --debug
CPU pctUser pctNice pctSystem pctIowait pctIdle
all 7.58 0.00 0.88 41.79 49.75
0 3.12 0.00 1.04 10.42 85.42
[root|/opt/splunk/etc/apps/Splunk_TA_nix/bin] cat debug--cpu.sh--Thu_Jul_23_05-03-52_CEST_2015
Linux 3.13.0-57-generic (host2.domain.de) 07/23/2015 x86_64 (1 CPU)
05:03:52 AM CPU %user %nice %system %iowait %steal %idle
05:03:53 AM all 7.58 0.00 0.88 41.79 0.00 49.75
05:03:53 AM 0 3.12 0.00 1.04 10.42 0.00 85.42
Average: CPU %user %nice %system %iowait %steal %idle
Average: all 7.58 0.00 0.88 41.79 0.00 49.75
Average: 0 3.12 0.00 1.04 10.42 0.00 85.42
Cmd = [sar -P ALL 1 1]; | awk 'BEGIN {print "CPU pctUser pctNice pctSystem pctIowait pctIdle"} /Average|Linux|^$|%/ {next} (NR==1) {next} {cpu=$3; pctUser=$4; pctNice=$5; pctSystem=$6; pctIowait=$7; pctIdle=$NF} {printf "%-3s %9s %9s %9s %9s %9s
", cpu, pctUser, pctNice, pctSystem, pctIowait, pctIdle}' header="CPU pctUser pctNice pctSystem pctIowait pctIdle"
correct output with shell script, but incorrect from command line:
[root|/opt/splunk/etc/apps/Splunk_TA_nix/bin] sar -P ALL 1 1 | awk 'BEGIN {print "CPU pctUser pctNice pctSystem pctIowait pctIdle"} /Average|Linux|^$|%/ {next} (NR==1) {next} {cpu=$3; pctUser=$4; pctNice=$5; pctSystem=$6; pctIowait=$7; pctIdle=$NF} {printf "%-3s %9s %9s %9s %9s %9s", cpu, pctUser, pctNice, pctSystem, pctIowait, pctIdle}' header="CPU pctUser pctNice pctSystem pctIowait pctIdle"
CPU pctUser pctNice pctSystem pctIowait pctIdle
all 2.52 0.00 0.38 0.50 96.600 1.04 0.00 1.04 2.08 95.831 0.98 0.00 1.96 1.96 94.122
[root|/opt/splunk/etc/apps/Splunk_TA_nix/bin] awk --version [bernd]
GNU Awk 4.0.1
Copyright (C) 1989, 1991-2012 Free Software Foundation.
Any ideas on how to fix this?
I'm going to try to guess at an answer to this so that it finally goes away from my list of unanswered questions... It could be a permissions or environment issue caused by running the command as whatever user Splunk is running as.
Example 1: root user > sh > sar > awk > splunkd
Example 2: Splunk user > sh > sar > awk > splunkd
I've got the same problem. Found that restarting the forwarder tends to help, but not fully resolve the issue. For example, this morning we found a system doing this. Restarted the forwarder, and now about 50% of the events are right, and 50% are wrong...