All,
I have a JSON log coming in from Akamai. 99% of searches against this data are using the field cliIP":"1.2.3.4" . Mind you, it's a dump from a cloud service, so there is no host field right now.
Given that it stands to reason that we should give that field some sort of priority in the index. My understanding is that an index-time extraction is a solution for this?
1) thought on that?
2) How would I build an index-time extract against json? Worried there is some special option I'll miss.
For JSON, I'd recommend enabling INDEXED_EXTRACTIONS=json
in props.conf giving you automatic index-time fields.
http://docs.splunk.com/Documentation/Splunk/6.4.0/Data/Extractfieldsfromfileswithstructureddata
Sorry, I am not following the documentation very well. Does this turn every value into an index time extraction?
In an automated way, yes.
Wouldn't making every field an index time extraction be a really big hit in performance?
The indexing performance hit is not that bad, after all it's only one (complicated) extraction running, not hundreds for every imaginable field in your data.
There will be some space consumed of course, how much depends on your data. Based on my limited use, it's not too bad. Search-time speed certainly makes up for this - you can skip building an accelerated data model for many use cases, for example.
@martin_mueller,
I plan to run a process on a remote computer that:
I'm already doing this successfully on a small scale (with a few events). Currently, I'm using KV_MODE=json
, to perform search-time extraction, but I think that you're recommending specifying INDEXED_EXTRACTIONS=json
instead, to perform index-time extraction, right?
I'm very curious about this: this choice has been on my mind, too. I'd be very interested to hear more from anyone in a similar situation. I'm concerned not just about index size, but also about whether, with the extra processing introduced at index time by index-time extraction, I'll need to load balance across more indexers to handle the incoming (TCP) stream of thousands of events.
Indexers typically are more busy with running searches than with indexing... so a little more indexing load that potentially takes a lot off search load can actually save net indexer capacity.
How this tradeoff turns out depends on your environment, search load, and data. I'd just run it and watch what happens.
I wouldn't suggest turning on indexed extractions in production without testing the effect it has on the index size and real-world performance. Testing with a json audit log that I had available (containing 10 fields), indexed extractions cost double the storage and I imagine it could cost much more. It may not be worth the extra storage because searches on IP addresses should be fairly efficient to begin with.