I'm collecting lots of data about a large amount of machines with the linux and unix ta (but that's a bit irrelevant with regards to this question other than to give an example).
I would like splunk to answer questions like "How much of the time does..
I'm accomplishing something similar with this search (although this is event-correlated, not time-correlated):
sourcetype=cpu | multikv fields pctIdle | eval Percent_CPU_Load = 100 - pctIdle | stats count(eval(Percent_CPU_Load<90)) AS below, count(eval(Percent_CPU_Load>=90)) AS over by host | eval all=below+over | eval TimeOverloaded=tostring(round(over/all*100, 2))+"%" | table host, TimeOverloaded
This, however, seems like a very tedious way to get to this information. It feels like there should be a simple search command to answer these kind of questions like stat, chart etc., but I can't find it. All data in splunk is time correlated, so this should certainly be possible.
If a command like this already exist, I apologize. If not, I would like to request this feature - although I'm at a loss as to how this command should be named 🙂
A search command like this would be very useful when calculating eg. SLA fulfillment.
You can certainly slim down that query:
sourcetype=cpu | multikv fields pctIdle | stats count AS all, count(eval(pctIdle<=10)) AS over by host | eval TimeOverloaded=tostring(round(over/all*100, 2))+"%" | table host, TimeOverloaded
You could use streamstats to add the next event's timestamp to each event, calculate the difference, use that as valid duration for the event and hence have an approximation of the time during which your CPU was greater than 90%.
Not having a dedicated command for this special case makes this special case a bit harder to build, but having powerful generic commands makes it possible in the first place. Imagine how many very very specific commands there would have to be to cover every possible eventuality.
You are right.. My query is rather long 🙂
I still think Splunk could use a more dedicated command to accomplish this more generally.
There's a big difference in the fact that this search counts the percentage of the logged events where the host(s) are overloaded - not the percentage of the time where the host(s) are overloaded.
A dedicated command could eg. take into account how time slots with missing data should be handled. Using addinfo to get the search start- and endtime and then calculating how many data points I should have makes the search even more complicated 🙂