Hello,
Since we upgraded from Splunk 5 to Splunk 6.2.4, some of our searches run 10 to 20 times slower than before. It happens where we use a complex macro with a lot of case branches. It seems, Splunk now spends more time on case branches that aren't evaluated.
So the performance of
index=something sourcetype=sourcetype1 | eval mode=case(
sourcetype=="sourcetype1", some string logic,
true(), "undefined")
is very different from
index=something sourcetype=sourcetype1 | eval mode=case(
sourcetype=="sourcetype1", some string logic,
sourcetype=="sourcetype2", some string logic,
sourcetype=="sourcetype3", some string logic,
sourcetype=="sourcetype4", some string logic,
sourcetype=="sourcetype5", some string logic,
...
true(), "undefined")
This kind of logic is now much much slower on Splunk 6 for us. Is this expected, or is there something we could tune? We would need to redo a lot of summary index definitions that use a macro with this kind of search, where we gather information from various sourcetype in a single summary index.
Hi,
I actually was mistaken, and the increase in runtime was not due to number of case branches. It was due to a specific case, and it boiled down to:
... ,like(http_request,"%/jobs%") AND action=“run"
Splunk hates me for using the fieldname "action" (which should be created by auto field extraction). I had to rewrite this as
..., like(http_request,"%/jobs%") AND like(http_request,"%action=run%")
It took me quite a while to get there. And I have no clue why this fixed it. Is "action" some kind of special field now in Splunk 6? Anyway, removing relying on "action" as fieldname brought back performance to normal level. Using the case statement in original form drops performance by 10 to 20 times even if the case branch is not used. Really strange.
What is the difference in the normalized search
between the two?
This is not really fitting the title of my original question as I mistakenly blamed case for the performance loss, but here is a little test for me. I stripped down a sample search and the macro to do something minimal, albeit not very meaningful.
Original:
index=lbsp sourcetype=lbsp_regeo_access | eval mode=case(like(http_request,"%/jobs%") AND action="run","jobs-submit",true(),"other") | stats count by mode
Normalized:
litsearch index=lbsp sourcetype=lbsp_regeo_access | eval mode=case(like(http_request,"%/jobs%") AND action="run","jobs-submit",true(),"other") | addinfo type=count label=prereport_events | fields keepcolorder=t "mode" "prestats_reserved_*" "psrsvd_*" | prestats count by mode
-> This search has completed and has returned 1 result by scanning 4,460,539 events in 141.771 seconds.
Modified not to use action=
index=lbsp sourcetype=lbsp_regeo_access | eval mode=case(like(http_request,"%/jobs%") AND like(http_request,"%action=run%"),"jobs-submit",true(),"other") | stats count by mode
Normalized:
litsearch index=lbsp sourcetype=lbsp_regeo_access | eval mode=case(like(http_request,"%/jobs%") AND like(http_request,"%action=run%"),"jobs-submit",true(),"other") | addinfo type=count label=prereport_events | fields keepcolorder=t "mode" "prestats_reserved_*" "psrsvd_*" | prestats count by mode
-> This search has completed and has returned 1 result by scanning 4,460,539 events in 8.684 seconds.
Oh, I just tried anf changed action="run" to bla="run"
index=lbsp sourcetype=lbsp_regeo_access | eval mode=case(like(http_request,"%/jobs%") AND bla="run","jobs-submit",true(),"other") | stats count by mode
-> This search has completed and has returned 1 result by scanning 4,460,539 events in 6.381 seconds.
None of the events on that sourcetype have action=... or bla=... in any request. So it's really the use of "action" that makes it slow.
Make this even more basic and let it run vs _internal:
index=_internal sourcetype=splunkd_access | eval foo=if(isnull(action),"foo","bar") | stats count by foo
-> This search has completed and has returned 1 result by scanning 9,280,747 events in 162.836 seconds.
index=_internal sourcetype=splunkd_access | eval foo=if(isnull(bla),"foo","bar") | stats count by foo
This search has completed and has returned 1 result by scanning 9,280,747 events in 4.126 seconds.
Lessons learnt: avoid the usage of a field named "action" I guess. And I didn't see that in Splunk 5.
You should edit the subject line of your question (look for the gear icon on top) and change it to an appropriate string and then convert the previous comment to an Answer and then Accept it.
You can use the job inspector
to examine the normalized search
and this will show you the intense optimizations that Splunk is trying to do on your search in v6.2.4. Hopefully you still have an instance of v5 (or can stand one up) to see how it was working in the earlier release and then compare the two. You might be AMAZED at the CRAZY stuff they do with reverse lookups, etc. If the slowdown is indeed as a result of bad optimization (which would not at all surprise me), you can disable the optimization by piping your first search to a second partly duplicated search like this:
index=something sourcetype=sourcetype1 | search sourcetype=sourcetype1 | eval mode=case(
sourcetype=="sourcetype1", some string logic,
sourcetype=="sourcetype2", some string logic,
sourcetype=="sourcetype3", some string logic,
sourcetype=="sourcetype4", some string logic,
sourcetype=="sourcetype5", some string logic,
...
true(), "undefined")
Again, if you compare the normalized search
of this version of your search to the other version, it will probably not contain the crazy optimizations and will definitely run differently (hopefully more quickly, like it did before).
For more information, check out this question:
http://answers.splunk.com/answers/256125/why-does-splunk-add-junk-to-the-front-of-a-search.html