Alerting

With a set of events, continuously collect some based on content and alert when there's been a 5 minute gap, alert on others immediately

cdhippen
Path Finder

We have software restarts that can occur either when they're forced which will produce this:

2019-08-18 23:15:21 restartBy= restartUser=.......... restartReason=................

Followed shortly after by something like:

2019-08-18 23:19:46,222 INFO (i/o) [....] Version Information: 1.1.11

The build version information will always show up after a restart, but the first one only shows up if it was restarted manually. We want to alert on the manual restarts immediately, but for system restarts, we want to collect them until there's been 10 minutes since the last system restart. i.e. We want to alert that there were x system restarts, the first restart time for this group of alerted restarts was at x time, the last one was x time.

The problem so far is that if the restarts are occurring for longer than the search window they won't show up in the collected alert and also if there were two different groups in the time range I would end up counting all of the restarts from both groups. This is what I've got so far but it's throwing me for a loop and I'm having trouble finishing it out.

| inputlookup partial_day_core_restarts.csv
| search alerted="false"
| eval new="false" 
| append 
    [| search 
        <base search>
    | eval build=coalesce(build, signature) //  for the build version, sometimes the field is reported as signature
    | rex "restartReason\=(?<restartReason>.*)" 
    | rex "service\/(?<core>.*)" 
    | rex "sudo: (?<restartUser>.*) :" 
    | lookup workspace workspaceGuid output currentCustomerGuid as customerGuid 
    | lookup customer-dc5prod customerGuid output name as customerName 
    | eventstats values(eval(if(isnotnull(restartReason), restartReason, null()))) as restartReason values(eval(if(isnotnull(restartUser), restartUser, null()))) as restartUser by core 
    | eval restart=core + ":::::" + restartReason + ":::::" + restartUser 
    | eval restart=coalesce(restart, core + ":::::System Restart:::::System Restart") 
    | mvexpand restart 
    | eval build1=build 
    | eventstats latest(eval(if(isnotnull(build), _time, null()))) as restartTime by restart 
    | eventstats values(eval(if(_time>restartTime-2000, workspaceGuid, null()))) as workspaces values(eval(if(_time>restartTime-2000, customerName, null()))) as customers latest(build) as build by restart 
    | eval customers=mvjoin(customers, "::"), workspaces=mvjoin(workspaces, "::")
    | fillnull customers workspaces value="None Active" 
    | table workspaces customers restartTime build build1 core restart 
    | where isnotnull(build1) 
    | eval restartReason=mvindex(split(restart, ":::::"), 1), restartUser=mvindex(split(restart, ":::::"), 2) 
    | eval new="true"]
| stats values(new) as new by build core customers restartReason restartTime restartUser workspaces
| eventstats max(restartTime) as restartTime1 by restartReason
| eval alert=case(restartTime1>now()-300 AND restartReason="System Restart", "false", mvcount(new)>1, "false", match(new, "false"), "false", match(new, "true"), "true")
| eval new="false"
| outputlookup partial_day_core_restarts.csv
| eval alert=case(restartTime1>now()-300 AND restartReason="System Restart", "false", mvcount(new)>1, "false", match(new, "false"), "false", match(new, "true"), "true")
| search alert="true"
| stats values(core) as core values(customers) as customers values(workspaces) as workspaces by restartTime1 restartUser restartReason build
| eval core=if(mvcount(core)>5, tostring(mvcount(core)) + " cores were restarted", core), customers=replace(customers, "::", ", ")
| convert ctime(restartTime1) as restartTime timeformat="%Y-%m-%d %H:%M:%S"
| eval customers=replace(replace(mvjoin(customers, ", "), "None Active, ", ""), ", None Active", ""), workspaces=replace(replace(mvjoin(workspaces, ", "), "None Active, ", ""), ", None Active", ""), core=mvjoin(core, ", ")
| fillnull customers workspaces value="None Active"
| eval throttle=md5(restartTime.restartUser.restartReason.build)

I'm totally open to new ways of doing this that would be simpler as well

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...