What are the possible gains from an index-time ext...

daniel333 · ‎04-21-2016

All,

I have a JSON log coming in from Akamai. 99% of searches against this data are using the field cliIP":"1.2.3.4" . Mind you, it's a dump from a cloud service, so there is no host field right now.

Given that it stands to reason that we should give that field some sort of priority in the index. My understanding is that an index-time extraction is a solution for this?
1) thought on that?
2) How would I build an index-time extract against json? Worried there is some special option I'll miss.

martin_mueller · ‎04-21-2016

For JSON, I'd recommend enabling INDEXED_EXTRACTIONS=json in props.conf giving you automatic index-time fields.

http://docs.splunk.com/Documentation/Splunk/6.4.0/Data/Extractfieldsfromfileswithstructureddata

daniel333 · ‎04-25-2016

Sorry, I am not following the documentation very well. Does this turn every value into an index time extraction?

martin_mueller · ‎04-25-2016

In an automated way, yes.

daniel333 · ‎04-25-2016

Wouldn't making every field an index time extraction be a really big hit in performance?

martin_mueller · ‎04-26-2016

The indexing performance hit is not that bad, after all it's only one (complicated) extraction running, not hundreds for every imaginable field in your data.

There will be some space consumed of course, how much depends on your data. Based on my limited use, it's not too bad. Search-time speed certainly makes up for this - you can skip building an accelerated data model for many use cases, for example.

Graham_Hanningt · ‎05-10-2016

@martin_mueller,

I plan to run a process on a remote computer that:

Parses a proprietary-format binary log file containing thousands of events
Converts those events into JSON
Sends them via TCP to Splunk

I'm already doing this successfully on a small scale (with a few events). Currently, I'm using KV_MODE=json, to perform search-time extraction, but I think that you're recommending specifying INDEXED_EXTRACTIONS=json instead, to perform index-time extraction, right?

I'm very curious about this: this choice has been on my mind, too. I'd be very interested to hear more from anyone in a similar situation. I'm concerned not just about index size, but also about whether, with the extra processing introduced at index time by index-time extraction, I'll need to load balance across more indexers to handle the incoming (TCP) stream of thousands of events.

martin_mueller · ‎05-10-2016

Indexers typically are more busy with running searches than with indexing... so a little more indexing load that potentially takes a lot off search load can actually save net indexer capacity.

How this tradeoff turns out depends on your environment, search load, and data. I'd just run it and watch what happens.

jtacy · ‎04-26-2016

I wouldn't suggest turning on indexed extractions in production without testing the effect it has on the index size and real-world performance. Testing with a json audit log that I had available (containing 10 fields), indexed extractions cost double the storage and I imagine it could cost much more. It may not be worth the extra storage because searches on IP addresses should be fairly efficient to begin with.

What are the possible gains from an index-time extraction of a large JSON log?

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!