We faced similar requirements to send alerts via SNMP traps when certain errors were logged. Initially we tried a scheduled search that executed Splunk's SNMP script if any events were found. We encountered the issues you described.
Our solution was to write a Python script that searched Splunk (using CLI), analyzed the search results, and sent the SNMP traps. Splunk runs our script as a configured scripted input; the script logs its activity to output, which Splunk indexes so we can later show that we really did sent the SNMP traps.
Like you, we observed some latency in Splunk's indexing application log files. We also noted that Splunk's scripted inputs do not run intervals exact to the second.
To solve these timing issues, we explicitly scoped the time range for each search. The end time for a search was the current time less the latency. The start time usually was the end time of the previous search (stored in a file between runs), but was limited by a maximum search span (typically an hour) and the end time. The time range was formatted for Splunk searches.
endTime = int(time.time ()) - indexLatencySeconds
startTime = min (endTime, max (previousEndTime, endTime - maxSpanSeconds))
return "starttimeu=%d endtimeu=%d" % (startTime, endTime)
The Python script used Splunk CLI to run a search. The query formatted the results as pipe delimited fields. Here is a typical query; our script adds the time constraints to the beginning.
MessageType="FATAL" | fields + host,problem | stats count(problem) as count by host,problem | strcat host "|" count "|" problem formatted | fields + formatted
Python runs the complete Splunk search command and collects its outputs as lists of search results and errors:
from subprocess import *
proc = Popen (command, shell = True, stdin = PIPE, stdout = PIPE, stderr = PIPE, close_fds = True)
(childStdin, childStdout, childStderr) = (proc.stdin, proc.stdout, proc.stderr)
childStdin.close ()
searchResults = childStdout.readlines ()
childStdout.close ()
errorResults = childStderr.readlines()
childStderr.close ()
return (searchResults, errorResults)
The first two lines of search results contains headers. The remaining lines can easily be transformed into a list of events comprised of fields:
return [line.rstrip ().split ('|') for line in searchResults [2:-1]]
In our script, we send an SNMP alert for each resulting event. You can send an email alert.
Splunk indexes the script output, which contains log messages. Early on, the team receiving the SNMP traps claimed that we weren't sending them. We added logging throughout our scripts to aid testing and to record its activity in production. Here is an example from the code that sends the SNMP trap:
self.logger.info ('SNMP trap: problemHost=%s count=%s searchName="%s" problem="%s"', host, count, searchName, problem)
The logger adds a timestamp, severity level, etc. in the spirit of Log4j. Here is an example log line generated by the the logging statement above:
<UOW-U-WSE-MEL-APPFATAL-FE9301> 2010-05-27 15:42:57,981 [main] INFO SnmpTrapDetails - SNMP trap: problemHost=db02 count=1 searchName="Application Error" problem="Fatal: Attempting to action a nonexistent TCA user PROV100AA to/from ROLE NEEDSBUILDER"
... View more