pushing search results through parallel pipelines

andersmholmgren · ‎12-01-2011

I have a set of related metrics I need to produce over a set of data

The initial part of the search looks something like

index=foo | bucket _time span=1h | stats count, median(duration), max(duration) by _time, blah

then I want to roll that up in different ways to produce the metrics. For example

1/ to get the peak count (and hour it occured)

| sort count desc | head 1

2/ to get the average median duration

| avg(median(duration))

etc

This works if I do separate full queries for each metric but is very inefficient as it repeats the main search each time

I would like to reuse the result set of the main search. I can sort of get this to work by using appendpipe

but of course it puts these rows at the end of the main result set. I can filter the results at the end to drop the rows from the initial search but this is ugly and inefficient.
And it breaks down when I introduce my next requirement - I need to repeat this process over different timespans. i.e. produce the same metrics rolled up for the last year, month and day. Given that the month is in the last year and so is the day, I wanted to have one years worth of results that I then segment off and crunch metrics off (e.g. | where _time >= relative_time(now(), @mon) and now crunch the metrics for the last month. Repeat for year and day.

Is there another way to do this? i.e. pass one result set through multiple pipes and aggregate the results?

andersmholmgren · ‎12-08-2011

Thank Bob.

I followed the instructions and the resulting performance is acceptable. I had planned to set up summary indexing later, but I was getting surprisingly bad performance without it when I only had 14 test events in the system! I figured that even with summary indexing (on the real data set) the input data size from the summary index would be way over 14 events. So I would therefore hit the same problem later.

It turned out though that the response time was not much longer for the query against a si form of the real data set than it was for the query against my raw data set of 14 events.

So I take it then that there is no way to pipe a result set through multiple parallel paths and then aggregate the result

i.e instead of something like

index=foo <processing for metric1> | ... | append [search index=foo <processing for metricN>]

you can't instead do something like

index=foo <processing for initial result set> | set union [<processing for metric1>] ... [<processing for metricN>]

i.e. where each of the subsearches in 'set' work off the initial result set (like for appendpipe) rather than from scratch?

BobM · ‎12-02-2011

The first thing I would recommend is summary indexing. If you are running any search over a years worth of data it will be slow.

I find running 365*24 1 hour backfill scripts is slower than 365 1 day backfill scripts. So first of all create a summary search to run every day at midnight for the previous day.

index=foo | bucket _time span=1h | sistats count, median(duration), max(duration) by _time, blah

then run the backfill script. This will take some time but will only ever need to be run once.

./splunk cmd python fill_summary_index.py -app YOURAPP -name "*" -et -y@mon -lt @d

then edit the script to run at 5 past every hour on the previous hour and re run the backfill script with -et @d -lt -1s -dedup true

You should then have a summary index full of data so amend your searches to start with:-

index=summary search_name=MYSEARCH | stats count, median(duration), max(duration) by _time, blah

Next if you are presenting this on a dashboard and running searches over a shorter time range, I recommend you look at post processing. http://docs.splunk.com/Documentation/Splunk/latest/Developer/FormSearchPostProcess

pushing search results through parallel pipelines

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes