Splunk Search

how to calculate/monitor the error rate (in minutes or hours), using HTTP response codes

kingsizebk
Path Finder

I have the below working search that calculates and monitors a web site's performance (using the average and standard deviation of the round-trip request/response time) per timeframe (the timeframe is chosen from the standard TimePicket pulldown), using a log entry that we call "Latency" ("rttc" is a field extraction in props.conf: Latency:(\s+\d+){11}\s+(?\d+) which contains the total round-trip time and "latencyURI" is similiar field extraction that provides the URI of the request):

log-entry.message="Latency:*" | stats latest(rttc) as latestRTTC avg(rttc) as avgRTTC stdev(rttc) as stdRTTC by latencyURI | eval avgRTTC=round(avgRTTC,0) | eval stdRTTC=round(stdRTTC,0) | eval lowerLimit=latestRTTC-stdRTTC | eval upperLimit=latestRTTC+stdRTTC | eval rangeColor=case(((latestRTTC>=lowerLimit) AND (latestRTTC<=upperLimit)),"green",((rttcupperLimit)),"red") | table latencyURI latestRTTC avgRTTC stdRTTC lowerLimit upperLimit rangeColor

I would like to do something similiar to calculate/monitor the rate of errors (per timeframe, similiar to the above), using the HTTP response codes (any response code other than 200) but I run into trouble as soon as I need to calculate the rate of none "200" HTTP response codes per hour...

responseCode!="200" | bucket span=1h _time | stats avg(count(responseCode)) as avgCodesPerHour stdev(count(responseCode)) as stdCodesPerHour by _time

The above search does not produce any errors but it also does not seem to calculate the avgCodesPerHour or the stdCodesPerHour. Can anyone suggest what the problem is, how to fix the problem or a different way of approaching this?

(This is running on Splunk 5.0.2 which is running on RHEL 5.5)

Tags (5)
0 Karma
1 Solution

stefano_guidoba
Communicator

Hi Kingsizebk,

the problem with your search is that you are using a stats ... by _time.
If you want daily/hourly rate, first calculate your occurrences per minute, then reaggregate on the hour. You could try:

responseCode!="200" earliest=-24h@h latest=@h | stats count by date_hour date_minute | stats avg(count) as avgErrsByHour stdev(count) as stdErrsByHour by date_hour

Same if you want it by day:

responseCode!="200" earliest=-30d@d latest=@d | stats count by date_wday date_hour | stats avg(count) as avgErrsByDay stdev(count) as stdErrsByDay by date_wday

Regards,
Stefano

View solution in original post

stefano_guidoba
Communicator

Hi Kingsizebk,

the problem with your search is that you are using a stats ... by _time.
If you want daily/hourly rate, first calculate your occurrences per minute, then reaggregate on the hour. You could try:

responseCode!="200" earliest=-24h@h latest=@h | stats count by date_hour date_minute | stats avg(count) as avgErrsByHour stdev(count) as stdErrsByHour by date_hour

Same if you want it by day:

responseCode!="200" earliest=-30d@d latest=@d | stats count by date_wday date_hour | stats avg(count) as avgErrsByDay stdev(count) as stdErrsByDay by date_wday

Regards,
Stefano

kingsizebk
Path Finder

Thx for the answer, it works for calculating the rate per day/hour/min/sec and the stdev. Here is my final search:

responseCode!="200" | stats count by date_hour date_minute responseCode | stats last(count) as lastHourErrCount stdev(count) as stdHourlyErrCount by responseCode | eval stdHourlyErrCount=round(stdHourlyErrCount,0) | eval errLowLim=lastHourErrCount-stdHourlyErrCount | eval errUppLim=lastHourErrCount+stdHourlyErrCount | eval rangeColor=case(((lastHourErrCount>=errLowLim) AND (lastHourErrCount<=errUppLim)),"green",((lastHourErrCounterrUppLim)),"red")

0 Karma

kingsizebk
Path Finder

avg and stdev of the errors per hour. for instance:

date & hour = "2/25/2012 00:00:00 - 00:59:59" and count=2
date & hour = "2/25/2012 01:00:00 - 01:59:59" and count=3
date & hour = "2/25/2012 02:00:00 - 02:59:59" and count=4

avg of the above is 3 and the stdev is 1 (e.g. statistically, everything is normal.)

this measurement would trigger an alert because the count is violating the stdev:

date & hour = "2/25/2012 03:00:00 - 03:59:59" and count=7

hopefully you understand now.

0 Karma

Ayn
Legend

Yeah but the average and standard deviation of what? You get 1 value for count per hour. The "average" of that would obviously be exactly the same as the count.

0 Karma

kingsizebk
Path Finder

I need the count of errors per hour (the same can be done on a per-minute or per-day basis), so I can calculate the average and standard deviation. The average and standard deviation will then be used to determine if a specified minute/hour/day saw a statistically significant increase or decrease in the number of errors. That info can then be used to trigger alerts can be accordingly.

0 Karma

Ayn
Legend

I don't really get what the expected results are? You're putting events in buckets of 1 hour and the "average" count of errors per hour would be...what, as opposed to just the count of errors? Same question with stdev.

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...