I am trying to use HiddenSearch and HiddenPostProcess in a few places to re-use the same result set, based on the documentation here: http://docs.splunk.com/Documentation/Splunk/latest/Developer/PostProcess
I'm running in to a serious problem, however: the results appear to be truncated, and silently. I'm looking for 24 hours of results and I'm getting about 4.5 hours worth. Is there a way to a) determine if the HiddenPostProcess module really is discarding results and b) increase the limit, if so?
Here's the source I am using:
<module name="HiddenSearch" layoutPanel="panel_row2_col1" autoRun="True">
<param name="search"><![CDATA[
index=foo sourcetype=foo_bar | rex field=_raw "host=\"(?<realhost>[^\"]+)\"" | fields _time, severity, program, message, realhost
]]></param>
<param name="earliest">-24h</param>
<module name="HiddenPostProcess" layoutPanel="panel_row2_col1_grp1">
<param name="search"><![CDATA[
search severity<4 | timechart span=5m count by severity
]]></param>
<module name="HiddenChartFormatter">
<param name="chart">area</param>
<param name="primaryAxisTitle.text">time</param>
<param name="secondaryAxisTitle.text">error count</param>
<param name="legend.placement">none</param>
<module name="JSChart">
<param name="width">100%</param>
<param name="height">300px</param>
</module>
</module>
</module>
</module>
One of the big pitfalls of using postprocess searches, is having a base search that returns pure untransformed events. If you do this you will hit limits that are basically at the search API level, where the API limits how many raw events are stored. the assumption is that if someone is doing large scale reporting, they will have used some transforming commands in the base search.
Instead very often people leave the base search as an events search, and then do all their aggregation in the postprocess. This not only bumps up against those limits but dramatically reduces the efficiency of your search.
What I recommend is having a base search of:
index=foo sourcetype=foo_bar | rex field=_raw "host=\"(?<realhost>[^\"]+)\"" | bin _time span=5min | stats count by _time severity program message realhost
and then a postprocess search of
search severity<4 | timechart span=5m sum(count) as count by severity
The differences between this and what you have are as follows:
a) instead of using fields
in the base search I use stats count by _time, foo bar baz
. Note that this means the number of rows in the base search will be dramatically lower.
b) in the postprocess search it looks very similar but for the sum(count) as count
part, which we need because now we'll have partially aggregated rows in the base search.
I know that for the purposes of asking the question you've probably omitted details about a few other postprocess searches that were hanging off of this base search. However there's always a way to fold in more dimensions with another "baz" in stats count by foo bar baz
, or another stat as in stats count sum(KB) by foo bar baz
.
Also, if you go back to the postProcess documentation, or better if you look at the UI Examples app which has a bunch of postProcess examples, or if you look at Sideview Utils which has as good or better docs on the same topic, you'll see a fair amount of time spent explaining this pitfall, so you don't have to believe me - you can read it there. 😃
And lastly, I know that theoretically it might work by bumping up eventCount, but don't do that. that limit is set that way for some good reasons and taking that leash off can lead to some nasty problems later.
Even if the number of rows ends up being kind of high, and even though you'll have to use stats sum(count) as count by message | sort - message
instead of the simpler top message
, you'll be in a much better place if the base search is aggregated with a stats count by
.
As to installing Utils, note that any admin users can install the app by going to Manager > App > Install app from file.
Indeed, I am doing a lot more than just a count on those fields -- I am running rare against message (heavily normalized, with 75k errors there are probably 10k unique strings), top against a few of the fields, count by severity, etc. I hope to improve this dashboard further by making use of summary indexes (currently entirely nonfunctional in this app for some reason).
I haven't had a chance to check out SideviewUtils but it comes highly recommended. I don't have the ability to install any files on the server, limiting what I can do to whatever the Splunk web UI. Maybe in the future.
Ah ha, found it. HiddenSearch has a default max event count of 10000. Not sure how to get Splunk to report when that count is hit, but at least a fix is possible:
<param name="eventCount">10000000000000</param>
I think it got updated to
This did work in Splunk 4, but does not work in Splunk 5.0.1.
Its not discarding them, its just too much data to display on the chart.
If you try to chart
index=foo sourcetype=foo_bar | rex field=_raw "host=\"(?<realhost>[^\"]+)\"" | fields _time, severity, program, message, realhost | search severity<4 | timechart span=5m count by severity
Do you get the 'too much data' warning ?
No, no warnings with that search (on flashtimeline). This is for a pretty small index, FWIW -- 24 hours of results only returns about 74k events. Most of the timechart stuff I do works with much larger data sets.
Is there a way to tell how much is too much to show in a chart? A way to get Splunk to show warnings, etc, on the dashboard?