I am looking to create an alert which would trigger in real-time if an event from esxi device is triggered for lost redundancy to storage and not followed by a restored event within an hour. In the below example, storage redundancy was lost for RTP1-VIF19-7072-LUN005 data store, but was restored after five minutes. I want Splunk to generate an alert if the redundancy is not restored within an hour for any device.
Example:
problem event:
2015-04-15T15:30:53+00:00 rtp1-vif064-17 [scsiCorrelator] 2403930999128us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.60001440000000107072444926f563d3 degraded. Path vmhba2:C0:T2:L2 is down. Affected datastores: "RTP1-VIF19-7072-LUN005".
Restore event:
2015-04-15T15:35:53+00:00 rtp1-vif064-17 [411C1B70 info 'Vimsvc.ha-eventmgr'] Event 1294 : Path redundancy to storage device naa.60001440000000107072444926f563d3 (Datastores: RTP1-VIF19-7072-LUN005) restored. Path vmhba2:C0:T2:L2 is active again.
"esx.problem.storage.redundancy.degraded" OR ("Path redundancy to storage device" AND "restored") earliest=-62m
| rex "^\s*\S+\s(?<datastore>\S+)\s"
| transaction datastore startswith="esx.problem.storage.redundancy.degraded"
endswith="Path redundancy to storage device" keepevicted=1
| search "esx.problem.storage.redundancy.degraded"
| where duration>3600 OR eventcount=1
| eval problemStartTime = _time
| eval problemDuration = if(eventcount==1,now()-problemStartTime,duration)
| where problemDuration > 3600
| eval problemStartTime = strftime(problemStartTime,"%x %X")
| eval problemDuration = tostring(problemDuration,"duration")
| table datastore problemStartTime problemDuration
Schedule the alert to run once each minute and trigger your alert on "number of results greater than zero."
Also, you might be able to add more to the initial search to make it more efficient.