Solved: How to optimize a search for a non-prefixed wildca...

arkadyz1 · ‎08-01-2016

I have data which contain a field with a lot of values and has duplicates on almost every one - a barcode, scanned in more than one place. In addition to being a part of a long field (a kind of a combined tag) like <fixed-length prefix><%03d length><barcode of length-3>, that barcode does appear on its own in a couple of places in the same event, which hopefully means I can search on it quickly enough - though what format of the search string to use to optimize it, I'm not sure. The event might contain several such barcodes, and the "main" one is actually the one from that combined tag I mentioned earlier. I can't predict whether that barcode will appear as barcode1, barcode2 or barcode3 (the actual field names are different, but you get the idea), so even that is not so straightforward.

However, there is a bigger problem: one of the search terms our application must have is a track ID, which is a trailing part of the barcode, extracted by some special rules (which I can easily put into eval command, if necessary) and never appearing directly in the raw event. For example, we might have a barcode 310200549315, which will appear in barcode2 (barcode1 and barcode3 will have different values and be of no interest to our application) and be a part of the combined_tag=PRD015310200549315. The track ID, as a suffix of the barcode, might have a value of 49315, which is always only a part of a token inside of the event.

Even if I succeed in creating a calculated field track_id, searching on it will be excruciatingly slow. And searching on *49315 (continuing my previous example) will be no better.

My question is: how would you attack this problem? I'm ready to create an index-time track_id field, but can I put an EVAL somewhere in props/transforms to achieve that? Alternatively, is there a way to optimize the search for such situations?

gabriel_vasseur · ‎08-04-2016

The index-time extraction is definitely an option, but that will take up a lot of resources and my understanding is it's not recommended unless it's a very important use case for you.

An in-between solution would be to create a simple data model, extracting exactly the fields you need, and accelerating it. If you mostly only do searches for recentish times (say in the last week or the last month) that will have a smaller impact (you can choose how long you accelerate the data model for: 1d, 7d, 30d, 3 months, 12 months, all time). The advantage is that you can wipe it out at any point and start over, since the acceleration is independent from the indexed raw data.

View solution in original post

gabriel_vasseur · ‎08-04-2016

The index-time extraction is definitely an option, but that will take up a lot of resources and my understanding is it's not recommended unless it's a very important use case for you.

An in-between solution would be to create a simple data model, extracting exactly the fields you need, and accelerating it. If you mostly only do searches for recentish times (say in the last week or the last month) that will have a smaller impact (you can choose how long you accelerate the data model for: 1d, 7d, 30d, 3 months, 12 months, all time). The advantage is that you can wipe it out at any point and start over, since the acceleration is independent from the indexed raw data.

arkadyz1 · ‎08-04-2016

Thanks, Gabriel!
Your answer comes at the perfect time, as we are starting to drift towards data models and pivot tables vs. ad-hoc searches behind each dashboard.

As for field extractions - I got them just fine, having some long experience with regexes in Perl and then Python. It's the lack of the actual term in the raw event that worried me. Accelerating the data model should speed that up - and yes, we mostly have recent time searches. In fact, we can probably draw a line between recent and "stale" dashboards with our customers, and let them know that "stale" ones will have significantly higher search times.

Edit: Moved my response under your answer.

gabriel_vasseur · ‎08-04-2016

Glad I could help! For tidyness (and karma!) I have copy-pasted my comment as an answer so that you can accept it. 🙂

gabriel_vasseur · ‎08-02-2016

The index-time extraction is definitely an option, but that will take up a lot of resources and my understanding is it's not recommended unless it's a very important use case for you.

An in-between solution would be to create a simple data model, extracting exactly the fields you need, and accelerating it. If you mostly only do searches for recentish times (say in the last week or the last month) that will have a smaller impact (you can choose how long you accelerate the data model for: 1d, 7d, 30d, 3 months, 12 months, all time). The advantage is that you can wipe it out at any point and start over, since the acceleration is independent from the indexed raw data.

Whatever solution you choose, you'll have to get the field extractions working. We can all help with that but we need at least an example of one whole raw event and a break down of the fields you want to extract and the values expected for that particular example. Feel free to change names and values to protect the innocent, but keep the pattern the same.

woodcock · ‎08-01-2016

If this is an important use case, then by all means do as much index-time extraction as you can, as regards your key fields. You will immediately know the system impact and if it is reasonable, you are good-to-go. If not, just ask for more budget for more indexers!

twinspop · ‎08-01-2016

Sample logs would help 🙂

somesoni2 · ‎08-01-2016

And some sample searches that you tried. Splunk recommends creating search time field extraction (extraction/calculated fields) over index-time field extraction due to overhead involved with index-time field extraction.

How to optimize a search for a non-prefixed wildcard (field=*suffix)?

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!

Splunk Custom Visualizations App End of Life