Scenario: We have a data source of interest that we wish to analyze. The data source is hourly host activity events. An endpoint agent installed on a user's host monitors for specific events. The endpoint agent reports theses events to the central server, aka manager/collector. Then the central server sends the data/events to Splunk for ingest. We found that a distinct count of specific action events per hour per host is very interesting to us. If the hourly count per user is greater than "the normal behavior" average then we want to be alerted. We define normal behavior as the "90 day average of distinct hourly counts per host/user". We define an outlier/alert as an hourly distinct count above 2 standard deviations from the 90day hourly average. For instance, if the 90 day hourly average is 2 events for a host, then 10 events in a single hour for that host would fire an alert. We tried many different methods and found some anomalies. One issue is the events' arrival time to Splunk. Specifically, the data does not always arrive to Splunk in a consistent interval. The endpoint agent may be delayed in processing or sending the data to the central server if the network connection is lost or the running host was suspended/shutdown shortly after the events of interest occurred. We have accepted this issue as its very infrequent. Methodology: In order to conduct our analysis we have multiple phases. Phase 1 > prepare the data and output to KVstore lookup We run a query to prime the historic data. index=foo earliest=-90d@h latest=-1h@h foo_event=* host=*
| timechart span=1h dc(foo_event) as Foo_Count by host limit=0
| untable _time host Foo_Count |outputlookup 90d-Foo_Coun Then we modify and save the query to append the new data, we use the -2h@h and -1h@h to mitigate lagging events. This report runs first every hour at minute=0. index=foo earliest=-2@h latest=-1h@h foo_event=* host=*
| timechart span=1h dc(foo_event) as Foo_Count by host limit=0
| untable _time host Foo_Count |outputlookup 90d-Foo_Count append=t Phase 2 > calculate the upperBound for each user This report runs second every hour at minute=15. We add additional statistics for investigation purposes. |inputlookup 90d-Foo_Count |timechart span=1h values(Foo_Count) as Foo_Count by host limit=0 | untable _time host Foo_Count
| stats min(Foo_Count) as Mini max(Foo_Count) as Maxi mean(Foo_Count) as Averg stdev(Foo_Count) as sdev median(Foo_Count) as Med mode(Foo_Count) as Mod range(Foo_Count) as Rng by host
| eval upperBound=(Averg+sdev*exact(2)) | outputlookup Foo_Count-upperBound Phase 3 > trim the oldest data to maintain a 90d@h interval This report runs third every hour at minute=30. |inputlookup 90d-Foo_Count | eval trim_time = relative_time(now(),"-90d@h") | where _time>trim_time | convert ctime(trim_time) |outputlookup 90d-Foo_Count Phase 4 > detect outliers This alert runs fourth (last) every hour the minute=45. index=foo earliest=-1h@h latest=@h foo_event=* host=*
| stats dc(foo_event) as as Foo_Count by host limit=0
| lookup Foo_Count-upperBound host output upperBound | eval isOutlier=if('Foo_Count' > upperBound, 1, 0) This method is successful alerting on outliers. RE: event lag, we monitor and keep track of how significant. Originally, we tried using the MLTK with a DensityFunction and partial fit, however we have approximately 65 million data points which causes issues with the Smart Outlier Detection assistant. The question is whether anyone has a different or more efficient way to do this? Thank you for your time!
... View more