Solved: NMON Performance Monitor for Unix and Linux System...

davebo1896 · ‎07-29-2016

We have successfully implemented TA-nmon on a few AIX servers which are sending data fine. However, we have a few servers that send data for a short while, then stop. The nmon_helper is working consistently, there are nmon files being created, they just stop getting processed after some time, typically after a few iterations of nmon_helper. I'm really at a loss at this point, and I suspect it may be a problem with the unarchive_cmd. What steps can I take to further troubleshoot this?

guilmxm · ‎08-03-2016

Hello,

Following our exchanges and detailed analysis, it looks like on these few AIX servers, the Archive processor is not correctly called by the TailReader processor for some unexpected reason we couldn't unfortunately identify.

This issues is visible in splunkd log messages referring to the TA-nmon activity, in normal circumstances you should see these activity order:

The TailReader processor identifies the nmon file and waits for the file to be updated:

INFO  TailReader - Archive file='.../xxxx.nmon' updated less than 10000ms ago, will not read it until it stops changing.

The TailReader processor informs that the file will be read:

INFO  TailReader - Archive file='xxxx.nmon' has stopped changing, will read it now.

The Archive processor manages the nmon file:

INFO  ArchiveProcessor - Handling file=xxxx.nmon
INFO  ArchiveProcessor - reading path=xxxx.nmon (seek=xxxxx len=xxxxx )
INFO  ArchiveProcessor - Finished processing file 'xxxx.nmon', removing from stats

In the present issue, only the TailReader activity was visible, and the Archive processor wasn't called

As a resolution, i have created a new alternative Technical Addon called "TA-nmon_selfmode" available currently in GitHub:

https://github.com/guilhemmarchand/TA-nmon_selfmode

This Technical addon is totally similar to the standard TA-nmon provided within the core application, at the exception that it doesn't use the unarchive_cmd feature which as is at the root cause of this issue.

The TA-nmon_selfmode implements a replacement input script "nmon_manage.sh" that will be scheduled and executed by Splunk.
This script will search for nmon files to be managed based on the accurate modification time (mtime) of nmon files
If identified nmon files have to be managed, their content are transparently sent to nmon2csv parsers.

There is absolutely no loss of features or change in the processing cost, only the way nmon files are monitored by Splunk will differ.

The TA-nmon_selfmode will be include in the next release of the Nmon Performance Monitor core application as an alternative to the standard TA-nmon for any server that run in this kind of issues.
The on-line documentation will be soon updated to reflect this new feature and these changes.: http://nmonsplunk.wikidot.com/

Guilhem

View solution in original post

guilmxm · ‎08-03-2016

Hello,

Following our exchanges and detailed analysis, it looks like on these few AIX servers, the Archive processor is not correctly called by the TailReader processor for some unexpected reason we couldn't unfortunately identify.

This issues is visible in splunkd log messages referring to the TA-nmon activity, in normal circumstances you should see these activity order:

The TailReader processor identifies the nmon file and waits for the file to be updated:

INFO  TailReader - Archive file='.../xxxx.nmon' updated less than 10000ms ago, will not read it until it stops changing.

The TailReader processor informs that the file will be read:

INFO  TailReader - Archive file='xxxx.nmon' has stopped changing, will read it now.

The Archive processor manages the nmon file:

INFO  ArchiveProcessor - Handling file=xxxx.nmon
INFO  ArchiveProcessor - reading path=xxxx.nmon (seek=xxxxx len=xxxxx )
INFO  ArchiveProcessor - Finished processing file 'xxxx.nmon', removing from stats

In the present issue, only the TailReader activity was visible, and the Archive processor wasn't called

As a resolution, i have created a new alternative Technical Addon called "TA-nmon_selfmode" available currently in GitHub:

https://github.com/guilhemmarchand/TA-nmon_selfmode

This Technical addon is totally similar to the standard TA-nmon provided within the core application, at the exception that it doesn't use the unarchive_cmd feature which as is at the root cause of this issue.

The TA-nmon_selfmode implements a replacement input script "nmon_manage.sh" that will be scheduled and executed by Splunk.
This script will search for nmon files to be managed based on the accurate modification time (mtime) of nmon files
If identified nmon files have to be managed, their content are transparently sent to nmon2csv parsers.

There is absolutely no loss of features or change in the processing cost, only the way nmon files are monitored by Splunk will differ.

The TA-nmon_selfmode will be include in the next release of the Nmon Performance Monitor core application as an alternative to the standard TA-nmon for any server that run in this kind of issues.
The on-line documentation will be soon updated to reflect this new feature and these changes.: http://nmonsplunk.wikidot.com/

Guilhem

guilmxm · ‎07-29-2016

Hi !

Well, i would suspect an issue related to system limits that could be set too low on these trouble servers.

Under AIX, system limits are very restrictive (number of open files, memory limits...) and this often a root cause of issue when running Splunk on AIX.
So i would recommend comparing limits settings between good and bad servers and check if you see any difference.
Try setting these limits to higher or unlimited levels.

Something to check too, carefully look at splunkd.log, hitting these limits should generate warning or error messages.

Finally, you may also need to compare OS levels and Splunk versions, you may have a combo OS / Splunk version that generates some trouble.

Something that could be related, are both good and bad UF running under root or a standard UNIX user.

In the worst case, it is possible to have a simple workaround by scheduling a script that would run the nmon2csv.sh parser against NMON files available, but there is no reason to have to do this.

Don't hesitate to revert, if you like you can contact me by mail directly (at the app page, contact developer)

Guilhem

davebo1896 · ‎08-01-2016

I compared a working system with a non-working system.

ulimits are the same
There are no messages about limits in the splunkd.log

OS is 7.1.3.16 on both, splunk UF is 6.4.2 on both

UF is running as the "splunk" user, not root.

What do you mean by "Don't hesitate to revert" - should I try an older version of the TA-nmon ?

guilmxm · ‎08-01-2016

Hi,

I mean, don't hesitate to inform me 😉

Ok, please check as i said earlier the system limits on the trouble servers.

Check this:

http://docs.splunk.com/Documentation/Splunk/latest/Troubleshooting/ulimitErrors

Verify the limits set on your servers.

davebo1896 · ‎08-01-2016

ulimits are the same on good and bad:
+++ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) unlimited
pipe size (512 bytes, -p) 64
stack size (kbytes, -s) hard
cpu time (seconds, -t) unlimited
max user processes (-u) 3072
virtual memory (kbytes, -v) unlimited

guilmxm · ‎08-01-2016

Ok.
Can we exchange by mail ?
You can contact me through Splunk base, from the app main page.

I think we need to verify that nmon are being correctly generated (I think you said it is), an verify that Splunk is calling the unarchive command every time the nmon file is updated (you can see it in the splunkd.log)

Can you check the trouble shoot guide:

http://nmonsplunk.wikidot.com/documentation:userguide:troubleshoot:troubleguide

NMON Performance Monitor for Unix and Linux Systems: How to troubleshoot why AIX NMON stops processing data?

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM