Splunk response time is quite slow when I use the lookup script presented below. The response time of the web service we use in the script is quite fast most of the times under 1ms per request. what should I look to improve the performance in splunk?
This is what job inspector says "dispatch.preview" takes most of the time:
Job inspector Results:
Duration (seconds) Component Invocations Input count Output count
0.003 command.addinfo 3 9,352 9,352
0.003 command.fields 3 9,352 9,352
0.005 command.lookup 1 3,240 3,240
0.07 command.prestats 3 9,352 3,240
0.327 command.remotetl 3 9,352 9,352
0.899 command.search 3 - 9,352
0.569 command.search.kv 1 - -
0.243 command.search.typer 3 9,352 9,352
0.067 command.search.rawdata 1 - -
0.009 command.search.calcfields 1 9,352 9,352
0.006 command.search.filter 1 - -
0.006 command.search.index 3 - -
0.002 command.search.tags 3 9,352 9,352
0.001 command.search.fieldalias 1 9,352 9,352
0.001 command.search.lookups 1 9,352 9,352
0.136 command.stats 1 3,240 3,240
0.048 dispatch.createProviderQueue 1 - -
0.169 dispatch.evaluate 1 - -
0.091 dispatch.evaluate.lookup 1 - -
0.077 dispatch.evaluate.search 1 - -
0.001 dispatch.evaluate.stats 1 - -
1.293 dispatch.fetch 4 - -
**461.84 dispatch.preview 1 - -**
0.013 dispatch.process_remote_timeline 1 110,422 9,352
1.292 dispatch.stream.local 3 - -
0.022 dispatch.timeline 4 - -
Splunk Query:
earliest=-1d@d latest=@d index="si_usage" |stats sum(count) by entity| lookup entityid_lookup entity as entity OUTPUT entity_title as entity_title
Script:
import urllib2, sys,csv,time,logging
from time import clock
from lxml import etree
from time import strftime as date
def lookup(entity):
try:
url = "http://fqdn/rex/select/?q="+entity+"&version=2.2&start=0&rows=10&indent=on&fl=title&application=agent.007"
f = urllib2.urlopen(url)
doc = etree.XML(f.read())
entity_title = doc.xpath("//str[@name='title']/text()")
if not entity_title:
entity_title = "Entity Title Not Found"
return entity_title
except:
return []
def main():
logging.basicConfig(filename='lookup.log', level=logging.INFO, format='%(asctime)s %(message)s')
start_time = time.time()
logging.info('Script started')
entity = sys.argv[1]
entity_title= sys.argv[2]
r = csv.reader(sys.stdin)
w = None
first = True
for line in r:
if first:
header = line
if entity not in header or entity_title not in header:
print "entity and entity_title fields must exist in CSV data"
sys.exit(0)
csv.writer(sys.stdout).writerow(header)
w = csv.DictWriter(sys.stdout, header)
first = False
continue
result = {}
i = 0
while i < len(header):
if i < len(line):
result[header[i]] = line[i]
else:
result[header[i]] = ''
i += 1
if len(result[entity]) and len(result[entity_title]):
w.writerow(result)
elif len(result[entity]):
resulted = lookup(result[entity])
result[entity_title]=resulted
w.writerow(result)
end_time = time.time() - start_time
logging.info('Script finished in %s seconds', end_time)
if __name__ == '__main__':
main()
I am skeptical of the performance of your scripted lookup. I don't think that you could set up a TCP session (including DNS lookup time), make the service call, and get a response back in 1ms. Ayn's advice of the search inspector is good. In conjunction with the search job inspector you might (for a test) try running your search as follows:
earliest=-1d@d latest=@d index="si_usage"
| stats sum(count) by entity
| head 1
| lookup entityid_lookup entity as entity OUTPUT entity_title as entity_title
This is to check to see what your minimum cost is for the lookup. Then, increase the number of items being looked up and see how your runtime scales.
If I run the script by itself the avg. rt is always around 50ms. If I use it as lookup script the rt is around 77ms. Splunk is adding ~27ms. I need the response time to be ~60ms. Using head:
head 1:
2012-08-28 17:23:09 Script finished in 0.047 sec
head 2:
2012-08-28 17:23:47 Script finished in 0.084 sec
head 3:
2012-08-28 17:24:07 Script finished in 0.122 sec
head 4:
2012-08-28 17:24:30 Script finished in 0.185 sec
head 5:
2012-08-28 17:33:13 Script finished in 0.392 sec
head 10:
2012-08-28 17:56:11 Script finished in 0.815 sec
head 100:
2012-08-28 17:54:12 Script finished in 7.470 sec
To determine the cost of the script, it will be necessary to get the time at the beginning and the end of the script, for each run, and log all these times. Once you have measured the cumulative time of each and every invocation of the script from a single search run, you'll have a better idea if the cost is elsewhere.
I will be skeptical too I was not explicit enough. Let me elaborate based on your recommendation:
Job Inspector reported:
This search has completed and has returned 1 result by scanning 9,352 events in 2.044 seconds.
Out 2.044 seconds dispatch.preview took 1.275 second.
The response time of server that gets REST request was QTime=1ms . The lookup time to resolve the FQDN was :
real 0m0.005s
user 0m0.000s
sys 0m0.005s
If you add QTime + the lookup still we have less than 10ms. why is dispatch.preview taking so much time?
What do you suggest?
Thanks,
Lp
Splunkweb includes an excellent utility for debugging your searches in terms of exactly what in the search took how much time. It's available in the Jobs list under the "Inspect" link.
I'm as skeptical as dwaddle is regarding the 1ms response time, so I'm thinking your lookup needs some optimization for things to speed up.
The dispatch.preview is the offender. According to Splunk documentation: It is the time spent generating preview results.
Job inspector reported the following result set for on day sample:
This search has completed and has returned 3,240 results by scanning 9,352 events in 466.573 seconds.
Out of this 466.573 seconds dispatch.preview took 465.081 seconds.