Solved: Search with several joins taking too long

minura · ‎11-24-2017

I am trying to create a dashboard of CPU and Memory usage from some log files. Because of the way the data is inputted into Splunk, I first need to do joins over several metrics for a single time in order to calculate a field for that time. I repeat this multiple times, once for each metric I want to calculate for each server. Each of these calculated metrics is plotted as a line on a chart. The query I have takes too long and does not scale (I need to plot more metrics and over a longer time range).

Currently:
Plotting 2 lines takes 4 seconds
Plotting 4 lines takes 8 seconds
Plotting 6 lines takes 30 seconds
Plotting 8 lines takes 10 minutes

In the job inspector, the only thing that takes a substantial amount of time is the join so my guess is that this is what causes the problem. For plotting 6 lines:
-dispatch.evaluate (10sec)
-dispatch.evaluate.join (10sec)

Is the way I am using the join incorrect? For the task I want to do, the time for the query to run should ideally be linear with the number of lines plotted whereas this is clearly not the case here. What can I do to make this run faster?

For context, the index stores about 600,000 events in total but only about 1% of the events are useful in my query.

This is my query:

index=logs "Cpu-user" Server=unixvs54 
| rename Value AS "Cpu-user-54" 
| table _time "Cpu-user-54" 
| join type=left _time 
    [ search index=logs  "Cpu-kernel" Server=unixvs54 
    | rename Value AS "Cpu-kernel-54" 
    | table _time "Cpu-kernel-54"] 
| join type=left _time 
    [ search index=logs  "Cpu-iowait" Server=unixvs54 
    | rename Value AS "Cpu-iowait-54" 
    | table _time "Cpu-iowait-54"] 
| eval "CPU-54 Use %"='Cpu-user-54'+'Cpu-kernel-54'+'Cpu-iowait-54' 
| join type=left _time 
    [ search index=logs  "Memory-phys mem" Server=unixvs54 
    | rename Value AS "Memory Total-54" 
    | table _time "Memory Total-54"] 
| join type=left _time 
    [ search index=logs  "Memory-free mem" Server=unixvs54 
    | rename Value AS "Memory Free-54" 
    | table _time "Memory Free-54"]
| eval "Memory-54 Use %"=100*(1-('Memory Free-54'/'Memory Total-54'))

|join type=left _time
    [search index=logs  "Cpu-user" Server=unixvs56 
    | rename Value AS "Cpu-user-56" 
    | table _time "Cpu-user-56"]
| join type=left _time 
    [ search index=logs  "Cpu-kernel" Server=unixvs56 
    | rename Value AS "Cpu-kernel-56" 
    | table _time "Cpu-kernel-56"] 
| join type=left _time 
    [ search index=logs  "Cpu-iowait" Server=unixvs56 
    | rename Value AS "Cpu-iowait-56" 
    | table _time "Cpu-iowait-56"] 
| eval "CPU-56 Use %"='Cpu-user-56'+'Cpu-kernel-56'+'Cpu-iowait-56' 
| join type=left _time 
    [ search index=logs  "Memory-phys mem" Server=unixvs56 
    | rename Value AS "Memory Total-56" 
    | table _time "Memory Total-56"] 
| join type=left _time 
    [ search index=logs  "Memory-free mem" Server=unixvs56 
    | rename Value AS "Memory Free-56" 
    | table _time "Memory Free-56"]
| eval "Memory-56 Use %"=100*(1-('Memory Free-56'/'Memory Total-56'))

|join type=left _time
    [search index=logs  "Cpu-user" Server=unixvs58 
    | rename Value AS "Cpu-user-58" 
    | table _time "Cpu-user-58"] 
| join type=left _time 
    [ search index=logs  "Cpu-kernel" Server=unixvs58 
    | rename Value AS "Cpu-kernel-58" 
    | table _time "Cpu-kernel-58"] 
| join type=left _time 
    [ search index=logs  "Cpu-iowait" Server=unixvs58 
    | rename Value AS "Cpu-iowait-58" 
    | table _time "Cpu-iowait-58"] 
| eval "CPU-58 Use %"='Cpu-user-58'+'Cpu-kernel-58'+'Cpu-iowait-58' 
| join type=left _time 
    [ search index=logs "Memory-phys mem" Server=unixvs58 
    | rename Value AS "Memory Total-58" 
    | table _time "Memory Total-58"] 
| join type=left _time 
    [ search index=logs "Memory-free mem" Server=unixvs58 
    | rename Value AS "Memory Free-58" 
    | table _time "Memory Free-58"]
| eval "Memory-58 Use %"=100*(1-('Memory Free-58'/'Memory Total-58'))

| table _time "CPU-54 Use %" "Memory-54 Use %" "CPU-56 Use %" "Memory-56 Use %" "CPU-58 Use %" "Memory-58 Use %"

This is what I'm trying to do (in case someone knows of a better query to do this)

Additional image for comment below (can't upload images in comment).

martin_mueller · ‎11-25-2017

That's about the least splunky way to solve this. Without knowing your data this might need tiny changes, but here's roughly what you need:

index=logs (Server=unixvs54 OR Server=unixvs56 OR Server=unixvs58)
  ("Cpu-user" OR "Cpu-kernel" OR "Cpu-iowait" OR "Memory-phys mem" OR "Memory-free mem")
| eval cpu_user = case(searchmatch("Cpu-user"), Value), eval cpu_kernel = case(searchmatch("Cpu-kernel"), Value), 
  eval cpu_iowait = case(searchmatch("Cpu-iowait"), Value), eval mem_total = case(searchmatch("Memory-phys mem"), Value), 
  eval mem_free = case(searchmatch("Memory-free mem"), Value)
| bin span=1m _time
| stats avg(cpu_user) as cpu_user avg(cpu_kernel) as cpu_kernel avg(cpu_iowait) as cpu_iowait avg(mem_total) as mem_total avg(mem_free) as mem_free by _time Server
| eval cpu_perc = cpu_user + cpu_kernel + cpu_iowait, mem_perc = 100 * (1 - (mem_free / mem_total))
| timechart avg(cpu_perc) as "CPU Use %" avg(mem_perc) as "Memory Use %" by Server

View solution in original post

martin_mueller · ‎11-25-2017

That's about the least splunky way to solve this. Without knowing your data this might need tiny changes, but here's roughly what you need:

index=logs (Server=unixvs54 OR Server=unixvs56 OR Server=unixvs58)
  ("Cpu-user" OR "Cpu-kernel" OR "Cpu-iowait" OR "Memory-phys mem" OR "Memory-free mem")
| eval cpu_user = case(searchmatch("Cpu-user"), Value), eval cpu_kernel = case(searchmatch("Cpu-kernel"), Value), 
  eval cpu_iowait = case(searchmatch("Cpu-iowait"), Value), eval mem_total = case(searchmatch("Memory-phys mem"), Value), 
  eval mem_free = case(searchmatch("Memory-free mem"), Value)
| bin span=1m _time
| stats avg(cpu_user) as cpu_user avg(cpu_kernel) as cpu_kernel avg(cpu_iowait) as cpu_iowait avg(mem_total) as mem_total avg(mem_free) as mem_free by _time Server
| eval cpu_perc = cpu_user + cpu_kernel + cpu_iowait, mem_perc = 100 * (1 - (mem_free / mem_total))
| timechart avg(cpu_perc) as "CPU Use %" avg(mem_perc) as "Memory Use %" by Server

martin_mueller · ‎12-01-2017

bin takes a field, in this case the time stamp, and converts it into discrete bins, in this case whole minutes.
transaction merges events together based on one or more fields.

Did you run my search?

Looking at your data screenshot, it would be much easier if you extracted fields for each type of measurement, e.g. if you extracted the number after Cpu-idle as a field called cpu_idle. Then you can use these fields in a timechart directly without all this hassle.

minura · ‎12-01-2017

Yes, I used your search. I'll also try extracting the fields the way you said. Thank you so much for your help!

martin_mueller · ‎11-30-2017

Why are you using transaction?
The stats and timechart in my search are split by Server, so no averages are computed across multiple servers.

If you'd post some of your data it'd be much easier to help you.

minura · ‎12-01-2017

I missed the by server in your query last time. Using bin the query works for almost 2 and a half weeks which is enough for me. Why is there this difference between using transaction and bin?

I've attached an image of a part of the raw data at the bottom of my original question above. There is a file like this generated every 15 minutes. Each row becomes an event in Splunk so there are a lot of events in total. I do not have permission to change the format of this data.
I have also attached images of properly generated rows in the Statistics tab (bottom most left image) and the empty rows (bottom most right) which are generated after a certain number of rows. (This only occurs after two weeks using bin so no longer is a problem. But it would be helpful to know why this happens).

minura · ‎11-30-2017

Thanks! Your query improves the search time by almost a factor of 10.

This query works perfectly when I set the time range to 1 day or less. However, running it on anything longer produces empty rows in the Statistics tab after about 1.5 days from the start time. The query I am using has a change from using bin to transaction because I need to group by both time and Server since I don't want to average the metrics for all servers.

Below is my exact query. (Note: The logs I get are every 15 minutes, so using span=15m ensures that the fields are single valued.) Do you know how I can fix the problem of empty rows?

index="logs" (Server=unixvs54 OR Server=unixvs56 OR Server=unixvs58 OR Server=unixvs60)
| eval cpu_user=case(searchmatch("Cpu-user"), Value), cpu_kernel=case(searchmatch("Cpu-kernel"), Value), cpu_iowait=case(searchmatch("Cpu-iowait"), Value), total_mem=case(searchmatch("Memory-phys mem"), Value), free_mem=case(searchmatch("Memory-free mem"), Value), disk_available=case(searchmatch("fs-available") AND match('Info',"^regex_search$"), Value), disk_used=case(searchmatch("fs-used") AND match('Info',"^regex_search$"), Value)
| transaction _time Server maxevents=-1
| eval total_cpu_%=cpu_user+cpu_kernel+cpu_iowait, mem_used_%=100*(1-free_mem/total_mem), disk_use_%=100*'disk_used'/('disk_used'+'disk_available')
| timechart span=15m max(total_cpu_%) max(mem_used_%) max(disk_use_%) by Server

minura · ‎11-30-2017

This problem seems to have to do with a limit on the number of events Splunk is processing. I filtered for only the required events at the start and now I get nearly 4 days worth of data but it would be better to get up to a week.

index="logs" (Server=unixvs54 OR Server=unixvs56 OR Server=unixvs58 OR Server=unixvs60) AND (Metric="Cpu-user" OR Metric="Cpu-kernel" OR Metric="Cpu-iowait" OR Metric="Memory-phys mem" OR Metric="Memory-free mem" OR Metric="fs-available" OR Metric="fs-used")
    | eval cpu_user=case(searchmatch("Cpu-user"), Value), cpu_kernel=case(searchmatch("Cpu-kernel"), Value), cpu_iowait=case(searchmatch("Cpu-iowait"), Value), total_mem=case(searchmatch("Memory-phys mem"), Value), free_mem=case(searchmatch("Memory-free mem"), Value), disk_available=case(searchmatch("fs-available") AND match('Info',"^regex_search$"), Value), disk_used=case(searchmatch("fs-used") AND match('Info',"^regex_search$"), Value) 
    | transaction _time Server maxevents=-1 
    | eval total_cpu_%=cpu_user+cpu_kernel+cpu_iowait, mem_used_%=100*(1-free_mem/total_mem), disk_use_%=100*'disk_used'/('disk_used'+'disk_available') 
    | table _time Server total_cpu_% mem_used_% disk_use_% 
    | timechart span=15m max(total_cpu_%) max(mem_used_%) max(disk_use_%) by Server

(Sorry about the block of code, can't seem to get syntax highlighting working.)

Search with several joins taking too long

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Adoption of RUM and APM at Splunk