Hello,
We're load balancing Exchange servers behind our F5 LTM's. From time to time the Exchange services cycle due to maintenance reasons but the F5 LTM is marking the pool down when it happens thus generating a potential alert that we're not going to care about. My basic objective is to only alert if the pool has gone down (one syslog message) but has not come back up (another syslog message) within 5 minutes. >5min means something is really broken but <5min means probably no worries in our environment.
Sample Logs I Don't Want to Alert On (Downtime <5min)
Jan 28 06:40:03 example-ltm Jan 28 06:40:03 slot2/example-ltm notice mcpd[4907]: 01070638:5: Pool /Common/Exchange_ad_pool member /Common/exampleServer:4169 monitor status down. [ /Common/Exchange_ad_http_monitor: down ] [ was up for 125hrs:23mins:32sec ]
Jan 28 06:40:22 example-ltm Jan 28 06:40:22 slot2/example-ltm notice mcpd[4907]: 01070727:5: Pool /Common/Exchange_ad_pool member /Common/exampleServer:4169 monitor status up. [ /Common/Exchange_ad_http_monitor: up ] [ was down for 0hr:0min:19sec ]
Sample Logs I Do Want to Alert On (Downtime >5min)
Jan 28 08:32:23 example-ltm Jan 28 08:32:23 slot2/example-ltm notice mcpd[4907]: 01070638:5: Pool /Common/Exchange_ad_pool member /Common/exampleServer:4169 monitor status down. [ /Common/Exchange_ad_http_monitor: down ] [ was node down for 0hr:0min:14sec ]
Jan 28 08:36:03 example-ltm Jan 28 08:50:03 slot2/example-ltm notice mcpd[4907]: 01070727:5: Pool /Common/Exchange_ad_pool member /Common/exampleServer:4169 monitor status up. [ /Common/Exchange_ad_http_monitor: up ] [ was down for 0hr:3mins:40sec ]
It's important to note that I don't want to wait for the "up" syslog message. In other words, if I see a "down" message but not an "up" message within 5min, alert me. Since we have many pools, members and LTM's (hosts), I'm using the transaction command below to group the transactions by these but I want to start the transaction where it went down (did field extractions already for up/down status) and then complete it when up. Thing is I don't know how to complete it or alert me if I don't get an "up" message after 5min. For what it's worth, here's my search below
Splunk Sample Search (Incorrect)
tag=ltm "monitor status" NOT ("monitor status node" OR "Node \/Common*") | transaction f5_pool_name host f5_pool_member startswith(f5_member_status=down) endswith(f5_monitor_status=up) maxspan=5min keeporphans=true
Just want to get alerted if no "up" in the transaction in 5min.
Try this:
tag=ltm "monitor status" NOT ("monitor status node" OR "Node \/Common*") | streamstats current=f last(_time) AS prevTime BY f5_pool_name host f5_pool_member | dedup f5_pool_name host f5_pool_member | eval spanSeconds=_time-prevTime | where (f5_member_status="down" AND (now()-_time) >=300) OR (f5_member_status="up" AND spanSeconds>=300)
I'm running Splunk 6.2.1 - maybe this is why but getting the following below for the search:
Error in 'streamstats' command: The argument 'prev(_time)' is invalid.
I had a typo; try it now.
No syntax issue now but search comes back instantly with no results even over a 30 day period when usually it takes a bit of time to run.
Perhaps a field name has a typo or something. Pull off everything after the last pipe, one by one, until you get sensible data and then go back the other way fixing the typo. The logic all looks good to me.
Did you get this to work?
Any luck with this alert? We are trying to do the same thing, but we're getting tripped up on the search.
Thanks
None yet. I tried working with support on this awhile back but was on an older version of Splunk but can't say for sure it was a bug or not. For w/e reason I just couldn't make the command work - probably my search string. If you ever get it to work reliably, let me know!!! I've had to move on to other things so didn't get back to TSing this anymore.