I'm working with some HTTP access logs that have a status code in them. Most are successful messages, naturally. I would like to see a chart, over time, that plots the percentage of messages that are errors, to avoid daily cycles in amount of hits and see if a growing proportion of the messages are error messages.
There doesn't seem to be this "percentage of whole" function in stats / chart / timechart. What can I do?
I basically need the "percentage" column from top, over time.
This is a search that seems to be the shortest possible way to do this. The idea is:
... | bucket _time bins=100 | eventstats count as total by _time | stats count first(total) as total by _time, http_response | eval percent=(count/total)*100 | search NOT http_response="2*" | timechart first(percent) by http_response
This will give you a graph of percentage numbers by each type of error over time. The Y axis will likely be from 0-1 (as in percent) for a properly functioning system.
I have perhaps a better solution for those who seek to get a percent success broken down by some other field over time.
This gives percent success over time by a field "url" in some http logs. Just configure the span in the bucket command to control the time split, and add "%H:%M:%S" to the time format if you need hours/minutes/seconds.
index=my_http sourcetype=http_logs http_status_code IN (2*, 3*, 5*)
| bucket _time span=1d
| eval success=case(match(http_status_code ,"2.*"), "1", match(http_status_code ,"3.*"), "1", match(http_status_code ,"5.*"), "0")
| eventstats count as total, sum(success) as successes by url, _time
| eval pct=round((successes/total)*100,2)
| eval timestring=strftime(_time, "%m-%d-%y")
| chart first(pct) by url, timestring
Perhaps this wasn't available in earlier versions - but with latest splunk you can change your stackmode to 100% stacked - here's what it generates in XML:
<option name="charting.chart.stackMode">stacked100</option>
This is a search that seems to be the shortest possible way to do this. The idea is:
... | bucket _time bins=100 | eventstats count as total by _time | stats count first(total) as total by _time, http_response | eval percent=(count/total)*100 | search NOT http_response="2*" | timechart first(percent) by http_response
This will give you a graph of percentage numbers by each type of error over time. The Y axis will likely be from 0-1 (as in percent) for a properly functioning system.
Awesome!
Make sure to replace the HTML emphasis (<em> and </em>) with asterisks (*).
You're a lifesaver!