I know this has been asked many times, and answered in splunkbase and in the documentation -- yet here I am, not sure if an index-time extraction would be right for our situation.
We have quite a few different web applications whose logs have an identical format, but the specific application cannot be determined by the actual event data -- except by the source. For example, the logs for app foo in the production environment are in /path/to/apps/foo/prod/log/*
, and the logs for app bar in the beta environment are in /path/to/apps/bar/beta/log/*
.
When searching Splunk, it is almost always the case that we will want to restrict the results to a specific application and environment. Over the course of a day, there can be multiple millions of events in this index, and it can be the case that the app that is being searched for will only be 100 or so of these events. Up to this point, we've been including something like source="\*/app/env/\*"
in each search, which works, and is quite fast, but is a bit cumbersome. Sometimes, we're searching over a few apps, and would like to be able to have the app name and environment pulled out into a field -- in this case we've used rex to our advantage, but again, it's a bit cumbersome (albeit fast) to have to add a rex to every search that does this.
After absorbing the index-time vs search-time docs, and after reading quite a few questions regarding this subject, I'd came up with a search-time extraction:
(in transforms.conf on the indexer)
[inhouse_app]
REGEX = /path/to/apps/(?<app>[^/]*)/(?<env>[^/]*)/.*
SOURCE_KEY = source
Perfect! It works like a charm! So, we changed a bunch of our searches to use app=appname env=environment instead of source="\*/appname/environment/\*"
, and very quickly found out that our search performance had degraded to the point where it was almost unusable in quite a few instances. For example:
Search: index=myapps source="*/app1/prod/*"
This search has completed and has returned 14 results by scanning 14 events in 0.193 seconds.
Search: index=myapps app=app1 env=prod
This search has completed and has returned 14 results by scanning 198,502 events in 91.228 seconds.
So, maybe that wasn't the way to go 😞 I also tried using tags, creating a separate tag for each source...but that's even more cumbersome. It works, but we have an ever-changing set of apps, and continuously messing around with the tags isn't something I necessarily want to do.
(If there's any crucial information that I left out, feel free to ask for more -- I'd be more than happy to help you help me 😄 )
Since the general consensus was that it would be acceptable to extract these fields at index-time, that's just what I did. I'm creating completely new fields (not overwriting any of the default ones, like sourcetype), and it is working like a charm. For posterity, here was how I accomplished this:
fields.conf
[app]
INDEXED = true
INDEXED_VALUE = false
[env]
INDEXED = true
INDEXED_VALUE = false
transforms.conf
[app_env]
SOURCE_KEY = MetaData:Source
REGEX = /path/to/apps/([^/]+)/([^/]+)/
WRITE_META = true
FORMAT = app::"$1" env::"$2"
props.conf
[my_sourcetype]
# ... other sourcetype related stuff
TRANSFORMS-appenv = app_env
I think in this case, index-time extraction is exactly what you need. Since you already have the source, but no defined sourcetype, I would write the sourcetype using transforms.conf.
transforms.conf
[inhouse_app]
REGEX = /path/to/apps/([^/]*)/([^/]*)/.*
SOURCE_KEY = source
WRITE_META = true
FORMAT = app::$1 env::$2
Then your searches look like this: index=myapps app=app1 OR env=beta
You could change the format to whatever you want based on your REGEX and FORMAT directives. I think the benefits of index-time are justified here, and the "extra" processing offsets the search-time cost.
Drainy: I'm not sure I follow, because if I were to remove that entire search-time regex, then I can't search on those fields to actually perform the experiment. Or, am I missing something really obvious about what you are proposing I try... 🙂
If you disable that one search time regex you created, does the search time improve?
Have you looked at the search inspector to see where the time delay is introduced? Its possible its just highlighted an under-spec'ed machine.
If its a field you are likely to search on in almost every search then yes, its worth adding it. There are trade offs in the form of increased bucket sizes plus the increased time to search indexed fields but in this case it should help. However... if it struggled that badly to apply the regex at search time then you need to consider how it would cope when applying this at index time, hence I come back to my above point about the spec of the machine 🙂
Also, perhaps try using this regex instead;
/path/to/apps/(?<app>[^/]+)/(?<env>[^/]+)
Sorry, what I meant was if you remove the whole stanza you created to pull out app/environment does it improve the search time? It could be another search time extraction thats going wonky and taking up all the time
I tried removing that bit of the regex (I assume, you mean just the last extraneous ".*"), and it helped a little, but not a whole lot.
This server is kindof a do-it-all box as far as Splunk is concerned, being both the search-head, and the only indexer. There's not a whole lot of searching going on, and it's indexing about 1.5GB/day, almost all of which is in the index that this question is regarding.
It basically says its all on search time extractions, but this could mean any of your search time extractions. Thats why I wonder if its worth trying again if you remove that one regex. How much data is it indexing per day? are all searches run on this machine? If its all fairly low then yeah, that should be a fine spec 🙂
I have a hard time believing that the machine is under-spec'd, but I could be wrong. It's a 12-core 2.8GHz, with far too much RAM, and lotsa spindles (with not much else going on).
I tried with the other regex you provided, and it shaved a little time off (thank you :P), but not on the magnitude that I was lookin' for.
I checked the search inspector out, and about all I was able to determine, was that I don't really know how to read it 😄 I did edit my question, and I included a screenshot of the "execution cost" breakdown.
I think in this case, index-time extraction is exactly what you need. Since you already have the source, but no defined sourcetype, I would write the sourcetype using transforms.conf.
transforms.conf
[inhouse_app]
REGEX = /path/to/apps/([^/]*)/([^/]*)/.*
SOURCE_KEY = source
WRITE_META = true
FORMAT = app::$1 env::$2
Then your searches look like this: index=myapps app=app1 OR env=beta
You could change the format to whatever you want based on your REGEX and FORMAT directives. I think the benefits of index-time are justified here, and the "extra" processing offsets the search-time cost.
I did leave a bit out of the above explanation, and we are actually already using the sourcetype to differentiate between a couple of formats. Sourcetype is also being used for a couple of LINE_BREAKER and EXTRACT options in props.conf, so I don't think re-using that is going to work for us. I am however moving forward with implementing the two index-time extractions. I'll comment back when I know how things are going!