I am trying to write a Splunk query where in if the host is down for more than 5mins, I should get an alert message
Log Example -
8:01:00 Host1 is OFFLINE
8:03:00 Host2 is ONLINE
8:04:00 Host1 is OFFLINE
8:10:00 Host2 is ONLINE
1st example - Host1 was down for 2mins. - I should not get an alert.
Example 2 - Host 2 was down for more than 5mins. I should get an alert in this case as soon as the 5mins threshold is hit in real-time ,or just a query to check if the the host is down for more than 5mins and when it is back up.
How to create a Splunk query using TRANSACTION or whichever is the best way to create.
Are your "host" values correct (do not seem to match your description)?
Woodcock,
Host names are - "Host1" ,"Host2". Apologize for the typo.
Below is the text in the logs .
"Set Vendor Status OFFLINE "
Vendor - 3rd party which we are accessing to retrieve information.
For the latter case (check after it is back up):
| makeresults
| eval raw="8:01:00 Host1 is OFFLINE::8:03:00 Host1 is ONLINE::8:04:00 Host2 is OFFLINE::8:10:00 Host2 is ONLINE"
| makemv raw delim="::"
| mvexpand raw
| rename raw AS _raw
| eval _time=strptime(_raw, "%H:%M:%S")
| rex "\S+\s+(?<host>\S+)"
| sort 0 - _time
| rename COMMENT AS "Everything above generates sample event data; everything below is your solution"
| streamstats count(eval(searchmatch("ONLINE"))) AS sessionID BY host
| eventstats range(_time) AS TimeOffline BY host
| where TimeOffline > 5*60
Give this a try (assuming field hostname and state is extracted. If not, add | rex "^\S+\s+(?<hostname>\S+)\s+is\s+(?<state>\S+)"
after base search)
your base search
| fields _time hostname state
| sort 0 hostname _time
| streamstats current=f window=1 values(_time) as prev_time values(state) as prev_state by hostname
| eval Duration=_time-prev_time
| where (prev_state="OFFLINE" AND state="ONLINE" AND Duration>300)
Thank you somesoni2 for the details. I tried the below
index=qa host="q1*" "Configured host status *" sourcetype=jboss_server_log | rex "^\S+\s+(?\S+)\s+is\s+(?\S+)" | sort 0 hostname _time | streamstats current=f window=1 values(_time) as prev_time values(state) as prev_state by hostname.
I am able to get all the logs under that host till I use the above query. Once I add the condition "| where (prev_state="OFFLINE" AND state="ONLINE" AND Duration>300)", I am not getting any results. I changed the Duration to "< 300" instead to see if it works for any duration.
Assuming the system did not go offline at all. I just tried a generic query to check if it was offline with the below query, I can find an event if I run the below.
index=qa host="q1*" "Configured q1 status *"
Below is the search results I am getting
8:04:39.633 AM Configured q1 status to OFFLINE
8:05:09.714 AM Configured q1 back status to ONLINE
Can you manually validate the results before the where clause (add a table command | table _time prev_time hostname state prev_state
) and see if those are correct? My answer assumed your logs are in format <timestamp> <hostname> is <state>
. If it's not in that format the rex command would fail and so will other commands.
In the below case, I only see the time when it went OFFLINE(_time), but the hostname , prev_time , prev_state are not populated with details.
And I forgot to mention, I am trying to check the status on the vendor host which is OFFLINE or ONLINE and more than 5mins on qa1 host.
index=qa host="q1*" "*Configured vendor host to OFFLINE *" | rex "^\S+\s+(?\S+)\s+is\s+(?\S+)" | sort 0 hostname _time | streamstats current=f window=1 values(_time) as prev_time values(state) as prev_state by hostname | eval Duration=_time-prev_time| table _time prev_time hostname state prev_state
Splunk log -
8:04:39.633 AM Configured vendor host1 status to OFFLINE
8:05:09.714 AM Configured vendor host1 status back to ONLINE
Thank you once again for the details.
See if this gets you started. It fetches the most recent entry for each host, throws away those that are ONLINE, then filters out those that went OFFLINE in the last 5 minutes. Set the alert to run every 5 minutes and trigger if the result count is not zero.
index=foo | dedup host | search "OFFLINE" | where (now()-_time)>300
Thank you rich, but using this query I believe I would only know when it went OFFLINE. I would want to know both when it went OFFLINE and came back ONLINE with the duration.. I am trying to do it in a tabular way something like below,
Host | Went OFFLINE | Back ONLINE | Duration |