Getting Proper Averages from Summary Index

deastman · ‎10-26-2017

First, as an example, I wanted to share that I thought the Question, and responses in this SA post was excellent and I stole the formatting Idea from the OP, and hope it will help: https://answers.splunk.com/answers/48641/summary-index-noob-question.html

first, the summary search:
- search name = "Summary CPU Usage".
- search = "sourcetype="Perfmon:CPU" counter="% Processor Time" instance="_Total" | sitimechart span=5m limit=0 avg(Value) by host".
- start time = "-20m@m" finish time = "-5m@m".
- scheduled to run every 5 minutes.
- alert condition = always.
- alert mode = once per search.
- summary indexing = enabled.
- summary index = "Performance_Summary".
- added fields: "report" = "cpu_usage".

-Report Search: index=Performance_Summary report="cpu_usage" | timechart span=15m count by host"

But this returns so many statitstics that it makes the graph unusable. And also, in doing by host as noted above it just pulls back the name of my search head not each individual node. I understand that this would need to be changed to orig_host, but why is that, and is there a way to change that, as users may not know when they need to do that to Summary Data.

Thanks!
Dustin

DalJeanis · ‎10-26-2017

Let's start with the host question.

your underlying query is this

sourcetype="Perfmon:CPU" counter="% Processor Time" instance="_Total"
| sitimechart span=5m limit=0 avg(Value) by host".

The values for host that will be set in the summary index will be the host field that was in the Perfmon:CPU records.

If that data only tracks your search heads, then that is the only thing in your summary index at the moment. To me, that seems unlikely, unless your search heads are set up for performance monitoring and the rest of your hosts are not.

More likely, your search heads may just be the most busy, so their records are the ones that get prioritized by the timechart command.

To validate this, pick a couple of non-search head hosts and do this...

 index=Performance_Summary report="cpu_usage" 
    host="myfirsthost" OR host="mysecondhost"
 | timechart span=15m count by host

Assuming that shows good data, then we can ignore your orighost question, and move on to the big question. If not, then we need to backtrack and figure out what is going on with your system monitoring data.

The big question

What are your users doing with the data?

If they are trying to find busy servers, then maybe you need to segment the data a little more.

To make the best data visualization, you always have to assume the role of the person who you are making it for.

If I'm trying to find out which servers are being pounded, then maybe I want to see only servers that have more than 75% CPU.

If I'm trying to find out how my overall processes are running, maybe I want to see a summary of how many servers are running at each 10% increment (therefore ten lines). Or maybe I want <25% blue, 25-50% green, 50-75% yellow, 75-90% orange, 90%+ red.

The key is to always ask why anyone needs to look at the graph in the first place, what's the most important thing they need to know, and what's the next thing they are going to want to do with what they learn.

Once you identify that, then you can work out the data viz that allows them to do their job most easily.

deastman · ‎10-26-2017

Per Feedback from my End User in this case: I would be interested in having an average of CPU and memory in use every five minutes and every hour. I asked for further clarification and they users wants simply an average of CPU utilization over a 5 minute window, or over a 1 hour window/host.

I hope this helps clarify the use case.

Getting Proper Averages from Summary Index

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms