Solved: How does data feed into external lookup scripts? C...

thisissplunk · ‎10-09-2015

I have an external lookup script that takes in a username from our Splunk events then uses an API call to go and grab the user's phone number from an external source. This can take forever when you want to run it against 1,000 unique usernames. In fact it takes about 15 minutes for that amount.

Assuming my API calls per minute is limitless, do lookups work in a way where I can send out multiple calls at once to greatly reduce the time involved? I'm hoping to chunk them up and do at least 5 at once instead of one.

This would be awesome if so!

thisissplunk · ‎10-13-2015

I've found there is nothing stopping you from using multithreading in a lookup sript. One thing to note is how Splunk actually feeds the script. It streams or chunks results in via stdin. It does not put every single result into it at once. This is important because the data restarting like this prevents queues and threads from getting out of control, as it seems to reset variables/threads/queues/objects each time it chunks the data in. I'm guessing it's actually starting the process again but I'm not sure.

I implemented exactly what is on the Python Queue documentation page. It reduced 2k url api calls from 33 minutes to 30 seconds.

One final note for tuning the script for Splunk. Using queues almost double the amount of time it takes to complete the job. Still much faster than single threaded, but if you remove queues entirely and just create a thread for each line from stdin, it's much faster (targer=yourfunfunction instead of queue_worker). Safer? Not so sure, as queues are thread safe and take the guess work out of multithreading for you (locks, mutexes, etc). Also, it's only slower in Splunk, not when I test with cat. This leads me to believe the way Splunk feeds data to the script is different than normal (again, chunking it in and maybe restarting the process fresh).

View solution in original post

thisissplunk · ‎10-13-2015

I've found there is nothing stopping you from using multithreading in a lookup sript. One thing to note is how Splunk actually feeds the script. It streams or chunks results in via stdin. It does not put every single result into it at once. This is important because the data restarting like this prevents queues and threads from getting out of control, as it seems to reset variables/threads/queues/objects each time it chunks the data in. I'm guessing it's actually starting the process again but I'm not sure.

I implemented exactly what is on the Python Queue documentation page. It reduced 2k url api calls from 33 minutes to 30 seconds.

One final note for tuning the script for Splunk. Using queues almost double the amount of time it takes to complete the job. Still much faster than single threaded, but if you remove queues entirely and just create a thread for each line from stdin, it's much faster (targer=yourfunfunction instead of queue_worker). Safer? Not so sure, as queues are thread safe and take the guess work out of multithreading for you (locks, mutexes, etc). Also, it's only slower in Splunk, not when I test with cat. This leads me to believe the way Splunk feeds data to the script is different than normal (again, chunking it in and maybe restarting the process fresh).

thisissplunk · ‎10-13-2015

Found my answer - Yes, there is nothing keeping you from multithreading in a python external lookup. It's still just a normal script. A general caveat for external lookups are that you need to test the script by feeding it stdin in csv format, and outputting stdout in csv format and then testing like so:

cat testdata | ./awesomescript.py test_column new_column

For the actual multithreading, I implemented something almost identical to the example at the bottom of the python Queue documentation. Basic idea is:

Open x threads targetting the queue worker (popper) class
Inside the queue worker function is where you'll call the function that has the true work that needs to be done. This queue worker is now waiting for the queue to fill up so it can pop things off of it and send them to the fun function.
Send your data to the queue so that the worker can grab it (the for loop and q.put() command). Could be as simple as:

for lines in file:
q.put(lines)

I did notice much worse performance using a queue when in Splunk itself compared to locally testing it with cat. Removing the queue and creating an unsafe set of threads made it sometimes twice as fast in Splunk. This might be somewhat acceptable depending on what you're doing and how often your threads might step on each other or error out.

Finale caveat - I heard from someone once that lookups technically take in stdin in chunks. This makes sense because you'll never know everything your passing to the list beforehand, if I understand Splunk correctly. If you waited for all of the results, you'd be trying to pass 20,000,000 million events sometimes!

lcrielaa · ‎10-09-2015

The only thing I can find that would help in some way is the following from transforms.conf:

batch_index_query = <bool>
* For large file based lookups, this determines whether queries can be grouped to improve 
  search performance.
* Default is unspecified here, but defaults to true (at global level in limits.conf)

allow_caching = <bool>
* Allow output from lookup scripts to be cached
* Default is true

The first one only works for file-based lookups and since you're using an external script, it won't help you much and the second one doesn't do anything for new requests. You'll either have to implement the logic in the external lookup script itself or change your approach altogether.

How often do people's phone number change? Can you Request the whole list and use that as a file-based lookup in Splunk? Perhaps updating the full list once a day? Even with a million-line CSV file as a file-based lookup, Splunk will still enrich the data charmingly.

thisissplunk · ‎10-09-2015

Hi, thanks for the reply.

When you say "implement the logic in the external lookup script itself..." are you referring to the caching, or the multithreading aspect? I'm trying to grasp if threading is even possible at this point. I think I heard that data is chunked into the lookup script which makes me think their might be a possibility of multithreading the chunks.

As for the phone numbers changing, this was just an example. The real data will be changing all the time. Using the example: Imagine that people update their phone numbers once a day and we want to keep track of 10,000 people's.

How does data feed into external lookup scripts? Can I multithread the API calls in my script to speed it up?

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life