Splunk Search

Is an index-time extraction right for my situation?

walkeran
Explorer

I know this has been asked many times, and answered in splunkbase and in the documentation -- yet here I am, not sure if an index-time extraction would be right for our situation.

A little background

We have quite a few different web applications whose logs have an identical format, but the specific application cannot be determined by the actual event data -- except by the source. For example, the logs for app foo in the production environment are in /path/to/apps/foo/prod/log/*, and the logs for app bar in the beta environment are in /path/to/apps/bar/beta/log/*.

When searching Splunk, it is almost always the case that we will want to restrict the results to a specific application and environment. Over the course of a day, there can be multiple millions of events in this index, and it can be the case that the app that is being searched for will only be 100 or so of these events. Up to this point, we've been including something like source="\*/app/env/\*" in each search, which works, and is quite fast, but is a bit cumbersome. Sometimes, we're searching over a few apps, and would like to be able to have the app name and environment pulled out into a field -- in this case we've used rex to our advantage, but again, it's a bit cumbersome (albeit fast) to have to add a rex to every search that does this.

What I tried

After absorbing the index-time vs search-time docs, and after reading quite a few questions regarding this subject, I'd came up with a search-time extraction:
(in transforms.conf on the indexer)

[inhouse_app]
REGEX = /path/to/apps/(?<app>[^/]*)/(?<env>[^/]*)/.*
SOURCE_KEY = source

Perfect! It works like a charm! So, we changed a bunch of our searches to use app=appname env=environment instead of source="\*/appname/environment/\*", and very quickly found out that our search performance had degraded to the point where it was almost unusable in quite a few instances. For example:

Search: index=myapps source="*/app1/prod/*"

This search has completed and has returned 14 results by scanning 14 events in 0.193 seconds.

Search: index=myapps app=app1 env=prod

This search has completed and has returned 14 results by scanning 198,502 events in 91.228 seconds.

alt text

So, maybe that wasn't the way to go 😞 I also tried using tags, creating a separate tag for each source...but that's even more cumbersome. It works, but we have an ever-changing set of apps, and continuously messing around with the tags isn't something I necessarily want to do.

So...


While I continue to try to talk myself out of using an index-time extraction, it keeps seeming to me like the way to go. Thoughts?

(If there's any crucial information that I left out, feel free to ask for more -- I'd be more than happy to help you help me 😄 )

The Solution

Since the general consensus was that it would be acceptable to extract these fields at index-time, that's just what I did. I'm creating completely new fields (not overwriting any of the default ones, like sourcetype), and it is working like a charm. For posterity, here was how I accomplished this:

fields.conf

[app]
INDEXED = true
INDEXED_VALUE = false

[env]
INDEXED = true
INDEXED_VALUE = false

transforms.conf

[app_env]
SOURCE_KEY = MetaData:Source
REGEX = /path/to/apps/([^/]+)/([^/]+)/
WRITE_META = true
FORMAT = app::"$1" env::"$2"

props.conf

[my_sourcetype]
# ... other sourcetype related stuff
TRANSFORMS-appenv = app_env
1 Solution

alacercogitatus
SplunkTrust
SplunkTrust

I think in this case, index-time extraction is exactly what you need. Since you already have the source, but no defined sourcetype, I would write the sourcetype using transforms.conf.

transforms.conf

[inhouse_app]
REGEX = /path/to/apps/([^/]*)/([^/]*)/.*
SOURCE_KEY = source
WRITE_META = true
FORMAT = app::$1 env::$2

Then your searches look like this: index=myapps app=app1 OR env=beta

You could change the format to whatever you want based on your REGEX and FORMAT directives. I think the benefits of index-time are justified here, and the "extra" processing offsets the search-time cost.

View solution in original post

walkeran
Explorer

Drainy: I'm not sure I follow, because if I were to remove that entire search-time regex, then I can't search on those fields to actually perform the experiment. Or, am I missing something really obvious about what you are proposing I try... 🙂

0 Karma

Drainy
Champion

If you disable that one search time regex you created, does the search time improve?

Drainy
Champion

Have you looked at the search inspector to see where the time delay is introduced? Its possible its just highlighted an under-spec'ed machine.

If its a field you are likely to search on in almost every search then yes, its worth adding it. There are trade offs in the form of increased bucket sizes plus the increased time to search indexed fields but in this case it should help. However... if it struggled that badly to apply the regex at search time then you need to consider how it would cope when applying this at index time, hence I come back to my above point about the spec of the machine 🙂

Also, perhaps try using this regex instead;

/path/to/apps/(?<app>[^/]+)/(?<env>[^/]+)

Drainy
Champion

Sorry, what I meant was if you remove the whole stanza you created to pull out app/environment does it improve the search time? It could be another search time extraction thats going wonky and taking up all the time

0 Karma

walkeran
Explorer

I tried removing that bit of the regex (I assume, you mean just the last extraneous ".*"), and it helped a little, but not a whole lot.

This server is kindof a do-it-all box as far as Splunk is concerned, being both the search-head, and the only indexer. There's not a whole lot of searching going on, and it's indexing about 1.5GB/day, almost all of which is in the index that this question is regarding.

0 Karma

Drainy
Champion

It basically says its all on search time extractions, but this could mean any of your search time extractions. Thats why I wonder if its worth trying again if you remove that one regex. How much data is it indexing per day? are all searches run on this machine? If its all fairly low then yeah, that should be a fine spec 🙂

0 Karma

walkeran
Explorer

I have a hard time believing that the machine is under-spec'd, but I could be wrong. It's a 12-core 2.8GHz, with far too much RAM, and lotsa spindles (with not much else going on).

I tried with the other regex you provided, and it shaved a little time off (thank you :P), but not on the magnitude that I was lookin' for.

I checked the search inspector out, and about all I was able to determine, was that I don't really know how to read it 😄 I did edit my question, and I included a screenshot of the "execution cost" breakdown.

0 Karma

alacercogitatus
SplunkTrust
SplunkTrust

I think in this case, index-time extraction is exactly what you need. Since you already have the source, but no defined sourcetype, I would write the sourcetype using transforms.conf.

transforms.conf

[inhouse_app]
REGEX = /path/to/apps/([^/]*)/([^/]*)/.*
SOURCE_KEY = source
WRITE_META = true
FORMAT = app::$1 env::$2

Then your searches look like this: index=myapps app=app1 OR env=beta

You could change the format to whatever you want based on your REGEX and FORMAT directives. I think the benefits of index-time are justified here, and the "extra" processing offsets the search-time cost.

walkeran
Explorer

I did leave a bit out of the above explanation, and we are actually already using the sourcetype to differentiate between a couple of formats. Sourcetype is also being used for a couple of LINE_BREAKER and EXTRACT options in props.conf, so I don't think re-using that is going to work for us. I am however moving forward with implementing the two index-time extractions. I'll comment back when I know how things are going!

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...