Nagios Linux Performance Graphs

jbaileyicw · ‎10-21-2011

I'm not getting anything for Memory Usage. What should the plugin output look like? I think that's my problem.

mkeys · ‎12-08-2011

Luke,

You 'da man! Both pnp4nagios and your app are working in conjunction now. The linux performance graphs I spoke of earlier are still empty but I'm sure it's something simple I'm missing.

On a related note I noticed a new bug while re-checking everything. When I click "Livestatus Dashboard", the top line (host up-down-unreachable) are populates correctly but the next line (services ok-warning-critical-unknown) have 0s for everything. If I click the Livestatus Dashboard tab again to reload the dashboard it will then populate the total service numbers (1092) correctly. I'm not sure if it's just a delay in mk or what.

lukeh · ‎12-07-2011

Hi Matt 🙂

The nagios.log file contains alerts and notifications etc but the performance data is logged to a separate file, either:

"service-perfdata" to be ingested into splunk with a sourcetype of "nagiosserviceperf"
 or
"splunk-nagios-perfdata" to be ingested into splunk with a sourcetype of "nagiosperfdata"

Use the latter if using pnp4nagios 🙂

You'll need to update the pnp4nagios script to output the performance data to an additional log file and ingested the new file into splunk with a new sourcetype. This way the performance data log format does not need to change and your rrd graphs will continue to work.

1/ Update the 'Bulk Mode' section within "process_perfdata.pl" as follows:

        print_log( "reading $pdfile for bulk update", 2 );
        open (SPLUNK, '>>/opt/nagios/var/splunk-nagios-perfdata');
        open( PDFILE, "< $pdfile" );
        my $count = 0;
        while (<PDFILE>) {
            $count++;
            print_log( "Processing Line $count", 2 );
            my @LINE = split(/\t/);
            %ENV = ();    # cleaning ENV
            foreach my $k (@LINE) {
                $k =~ /([A-Z 0-9_]+)::(.*)$/;
                $ENV{ 'NAGIOS_' . $1 } = $2 if ($2);
            }
            print SPLUNK "$_\n";
            if ( $ENV{NAGIOS_SERVICEPERFDATA} || $ENV{NAGIOS_HOSTPERFDATA} ) {
                parse_env();
                process_perfdata();
            }
            else {
                print_log( "No Perfdata. Skipping line $count", 2 );
            }
        }

        print_log( "$count Lines processed", 1 );

        if ( unlink("$pdfile") == 1 ) {
            print_log( "$pdfile deleted", 1 );
        }
        else {
            print_log( "Could not delete $pdfile:$!", 1 );
        }

    }
    else {
        print_log( "ERROR: File $opt_b not found", 1 );
    }
close (SPLUNK);
}

Note: only the following three new lines should be added to your existing script:

open (SPLUNK, '>>/opt/nagios/var/splunk-nagios-perfdata');
print SPLUNK "$_\n";
close (SPLUNK);

Replace /opt/nagios with the relevant path for your installation 🙂

2/ Update "$SPLUNK_HOME/etc/apps/SplunkForNagios/default/props.conf" with the following new sourcetype:

[nagiosperfdata]
EXTRACT-datatype = DATATYPE::(?P<datatype>[^\t]*)
EXTRACT-src_host = HOSTNAME::(?P<src_host>[^\t]*)
EXTRACT-name = SERVICEDESC::(?P<name>[^\t]*)
EXTRACT-result = SERVICEPERFDATA::(?P<result>[^\t]*)
EXTRACT-process = SERVICECHECKCOMMAND::(?P<process>[^\t]*)
EXTRACT-hoststate = HOSTSTATE::(?P<hoststate>[^\t]*)
EXTRACT-hoststatetype = HOSTSTATETYPE::(?P<hoststatetype>[^\t]*)
EXTRACT-state = SERVICESTATE::(?P<state>[^\t]*)
EXTRACT-statetype = SERVICESTATETYPE::(?P<statetype>\w+)
SHOULD_LINEMERGE = false
TIME_PREFIX = TIMET::

3/ Add the new file "splunk-nagios-perfdata" to be ingested into splunk with a sourcetype of "nagiosperfdata"

4/ Update the dashboards in "$SPLUNK_HOME/etc/apps/SplunkForNagios/default/data/ui/views" and change any occurance of sourcetype="nagiosserviceperf" to sourcetype="nagiosperfdata"

All the best,

Luke 🙂

P.S. The 'CURRENT SERVICE STATE' events are logged to nagios.log at midnight everyday, ie. as they are only logged just once per day they cannot be used for creating performance graphs, hence the requirement to ingest the performance data from the specific log file.

mkeys · ‎12-07-2011

Luke,

Thanks for the pointers, at least I'm looking in the right spot now lol. I'm using the official pluggins for the most part. Let's use Zombie Processes for an example. In the NagiosLinuxPerformanceGraphs.xml I've got:

        <module name="HiddenPostProcess" layoutPanel="panel_row4_col2" group="Zombie Processes" autoRun="False">
          <param name="search">timechart span=5m last(Zombies) as Zombies</param>
          <param name="groupLabel">Zombie Processes</param>

The check is outputting the following (snip from /usr/local/nagios/var/nagios.log) :

[1323234000] CURRENT SERVICE STATE: somehost;Zombie Processes;OK;HARD;1;PROCS OK: 0 processes with STATE = Z

From your reply I'm thinking I need to change last(Zombies) but I don't know what it should be. Since there's usually 0 of them, this may be a bad example. 🙂 Moving on to total processes the xml has:

        <module name="HiddenPostProcess" layoutPanel="panel_row4_col1" group="Total Processes" autoRun="False">
          <param name="search">timechart span=5m avg(Processes) as Processes</param>
          <param name="groupLabel">Total Processes</param>

The check outputs :

[1323234000] CURRENT SERVICE STATE: pvirtuadb;Total Processes;OK;HARD;1;PROCS OK: 167 processes

Thanks again,
Matt

lukeh · ‎12-06-2011

Hi Matt 🙂

I have updated the doco because it was too vague and misleading, here are the new instructions:
Using your favourite xml editor, update these dashboards with the relevant label/key names for the specific performance data that are in use in your nagios environment. eg. change avg(CpuSystem) to avg(cpu_system) if your performance data for CPU Usage is labelled differently.
Dashboard location: $SPLUNK_HOME/etc/apps/SplunkForNagios/default/data/ui/views

ie. you don't have to change the service_description for any of your nagios checks, you just need to update the dashboards in Splunk for Nagios with your specific performance data label/key names.

Feel free to provide examples of your nagios performance data if you would like further assistance and I can recommend the relevant updates to make to your dashboards.

All the best,

Luke 🙂

mkeys · ‎12-06-2011

Luke,

I'm having the same issue with CPU Usage, Network Utilization, Zombie Processes, Total Processes, Swap Usage, and Memory Free/total in the "Nagios Linux Performance Graphs" (I haven't started the Windows side yet). You suggest above to "rename the "name" value in the dashboard to the relevant name of your nagios plugin". Would this be in $SPLUNK_HOME/etc/apps/SplunkForNagios/default/props.conf? My props.conf for CPU and Zombie processes for example has:

[nagiosserviceperf]

EXTRACT-Processes = PROCS \w+: (?P\d+) \w+\"

EXTRACT-Zombies = PROCS \w+: (?P\d+) \w+ with STATE = Z

But our Nagios service_description for these are "Total Processes" and "Zombie Processes". I read in a separate answer that you mentioned future releases will be CIM compliant. Are there existing standards for these names? If so and I change the service_description on the Nagios side to these CIM compliant names, will it break the historical data from that point on?

Thanks,
Matt

jbaileyicw · ‎10-24-2011

Hi Luke,

I am using the Nagios Linux Performance Graphs dashboard, and will the check the plugin for its perfdata output and see if it matches yours. Thanks.

lukeh · ‎10-24-2011

Please ensure that you rename the "name" value in the dashboard to the relevant name of your nagios plugin, ie. it should be the same as your service_description in nagios.

Example plugin output from "check_mem.pl" is as follows:

TOTAL=12167908KB;;;; USED=2661344KB;;;; FREE=9506564KB;;;; CACHES=9207732KB;;;;

Note: the Splunk for Nagios compatible Memory plugin is available here:

check_mem.pl: http://exchange.nagios.org/directory/Plugins/System-Metrics/Memory/check_mem-2Epl/details

All the best,

Luke 🙂

Nagios Linux Performance Graphs

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life