Finding standing out IPs of requestors to a partic...

alucas_1stop · ‎09-10-2014

I spent about 5 minutes trying to figure out how to even title this question.

Its much easier explained by this example, please feel free to edit the title.

We have an access log of format:

ClientIP Hostname URI StatusCode

Now, I try to identify a set of ClientIPs that have unusually large number of requests per Hostname over a specified timespan (for example per minute). For example 2 standard deviations higher request count per Hostname then average (over that timespan).

The reason for trying to do this vs a specific count/threshold is because:

we have many hostnames and they all have varying access per minute profiles
the access profiles change over time
it's cooler to use slightly more advanced stats 😉

For those interested, here is how the current "preset threshold" is implemented:

index=accesslog | stats count as CPM by ClientIP Hostname | search (Hostname="*.domain.com" CPM>800) OR (Hostname="this.domain.com" CPM>350) OR (Hostname="that.domain.com" CPM>300) OR (...) OR ...

prelert · ‎10-23-2014

The easiest solution is to leverage the Prelert Anomaly Detective App. You can easily determine which ClientIP are making an abnormally different number of requests than other ClientIPs:

index=accesslog | prelertautodetect count over ClientIP

If you want to segment it by host as well:

index=accesslog | prelertautodetect count by Hostname over ClientIP

Here's a slightly different example of finding a ClientIP requesting a abnormally different number of pages than other ClientIPs, but in this case, segmented by status code:

http://www.prelert.com/images/screenshots/count_by_status_over_clientip.png

This is also be easily run on-going as a regularly scheduled search, so that you can continuously run it every X minutes.

alucas_1stop · ‎09-17-2014

After a bit of trial and error I figured it out.

index=accesslog                   earliest=-7d@m-5m latest==-7d@m
| append [ search index=accesslog earliest=-14d@m-5m latest==-14d@m ]
| append [ search index=accesslog earliest=-21d@m-5m latest==-21d@m ]
| bucket _time span=1m 
| stats count AS LastCPM by ClientIP Hostname date_mday
| stats avg(LastCPM) as LastAvg, stdev(LastCPM) as LastStdev by Hostname
| join type=outer Hostname [ search index=accesslog earliest=-5m@m latest=now@m
| bucket _time span=1m 
| stats count AS NowCPM by ClientIP Hostname date_mday
| stats avg(NowCPM) as NowAvg by Hostname ]
| where NowAvg > LastAvg+LastStdev*2

The output will be something like this:

Hostname                LastAvg     LastStdev   NowAvg   
host3.domain.com        25.370370   32.720253   26.600000
host55.domain.com       10.610169   14.518736   13.900000

The logic is, look at average and stdev of events(connections) per client per host per minute. Compare that with the current average.

You can add more appends to cover additional time ranges. Every company/site/webservice have unique access profile. Some will have similar stats per specific times of the day of week (like in our case), others will have similar stats every day regardless of weekday/weekend, hence you can change the append time frames to yesterday, day before, etc, instead of going week earlier and earlier.

Now, I have 2 things I'm not sure about. Do I need to add _time to the stats count lines? I think that would be needed only if you want to compare non-equal timespans (e.g. >5 mins in the top lines and exactly last 5 minutes under the join).
The other thing is, if I wanted to "extract" individual events from the resulting stats tables (after the where pipe), but I could not find the way to do that, the underlying logs that made the stats are lost?

alucas_1stop · ‎09-17-2014

Also, a thing of note. My solution above only works if the number of requests by a number of ClientIPs or by one ClientIP pushes the NowAvg high enough to be "caught". The good thing is that it will catch both possibilities but the bad is that it will still catch some real situations like testers doing more connections during tests or added external monitoring solution.
The solution is efficient and can be used in "almost" real-time reports/alerts if you don't specify large timespans. For 5min timespan as in the example above, parsing takes longer than actual search and stats.

lguinn2 · ‎09-14-2014

For a normal distribution, we can use the stats function p97 to approximate two standard deviations.

Here is how I would write the search:

index=accesslog
| bucket _time span=1m
| stats count as CPM by ClientIP Hostname _time
| eventstats p97(CPM) as threshhold by Hostname _time
| where CPM > threshhold

Although I am not quite sure what you are trying to compare... but start with a short time period and leave off the last line, and I think you will see how it works.

(More info on std dev - look at the graph in this Wikipedia article and you can see that 2 standard deviations would include all but approximately the top 2.2% of values.)

lguinn2 · ‎09-16-2014

@alucas_1stop - thanks for the explanation - I still think my idea will work. Let me know if it doesn't

alucas_1stop · ‎09-16-2014

It runs fine and is quite efficient (from performance point of view) for short timespans. However, it returns too many rows. I've tried changing to p98() and still quite a few. I believe its because of large variety of number of requests, for different times of the day. I think comparing current-timespan (e.g. last 5 mins) to last 4 weeks for the same hour:min timespan would produce better results. e.g. avg(today:15:00-15:05 per IP per host per minute) > avg(last4weeks:15:00-15:05 per IP per host per minute) Is it possible with Splunk?

alucas_1stop · ‎09-15-2014

One way would be to look at it purely from time perspective, then it would be X per hostname standing out and then looking at events manually (they could be from same or multiple IPs). Another way (I think preferred) is to look at it from client IP point of view (Y requests per minute per client IP per hostname).

alucas_1stop · ‎09-15-2014

I will reiterate. The idea is to catch anomalies. But only higher than usual number of requests.
The idea is: each hostname gets X requests on average per minute, and Y requests per minute by unique(!) client IP. Somehow we want to be able to see client IPs requesting considerable more than others (anomalies right?). This is a good way to identify attacks, too frequent health/monitoring checks (possibly misconfiguration), infinite loops querying sites/webservices/etc internally or externally (possibly poor code), the list goes on.

sowings · ‎09-12-2014

But if you're trying to calculate "higher than two stdev" over the time frame, you need to have some other sample time frame against which to figure out the mean / stdev.

Would you say that your question could be phrased as "Search for sudden uptick in activity from a <client host>?"

alucas_1stop · ‎09-14-2014

Yes, I was actually trying to use last weeks data to calculate average and stdev and compare to the last 5 minutes. Although a better way would be to do equivalent timespan (e.g. 5 minutes) from the same time range over the last 4+ weeks - it is actually computationally faster and more indicative, for web traffic stats. I couldn't even get the first scenario going using some example solutions in the links specified above - they are either wrong or have some hidden syntax errors.

alucas_1stop · ‎09-12-2014

@felipetesta . Actually the timeframe is specified in the search (right side) or in alert/report settings, its not explicit in the search itself, so the values for CPM are by timeframe specified (in our case its 5min). Bucket command is nice but I think what I'm after is more advanced, like looking at average of previous week's stats per Hostname and comparing with now

felipetesta · ‎09-12-2014

@alucas_1stop . I am not sure where the problem is. If you need to split by time, which is not shown in your example, how about "| bucket _time span=1m | stats count as CPM by ClientIP Hostname _time | search ..."

Then I tried playing with find anomalies commands but did not come to a meaningful result.

s2_splunk · ‎09-11-2014

Could you make it one field if the other Q contains what you need...?
index=accesslog | eval ipHost = ClientIP+":"+Hostname | stats count as CPM by ipHost | ....

alucas_1stop · ‎09-11-2014

I tried to follow this: http://answers.splunk.com/answers/58750/how-do-you-monitoralert-for-spikes-of-negative-events but doesn't really work as the example uses 1 field and here we are looking at a table (ClientIP vs Hostname)

Finding standing out IPs of requestors to a particular host over a timespan in an access log

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life