Splunk Search

What are the possible gains from an index-time extraction of a large JSON log?

daniel333
Builder

All,

I have a JSON log coming in from Akamai. 99% of searches against this data are using the field cliIP":"1.2.3.4" . Mind you, it's a dump from a cloud service, so there is no host field right now.

Given that it stands to reason that we should give that field some sort of priority in the index. My understanding is that an index-time extraction is a solution for this?
1) thought on that?
2) How would I build an index-time extract against json? Worried there is some special option I'll miss.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

For JSON, I'd recommend enabling INDEXED_EXTRACTIONS=json in props.conf giving you automatic index-time fields.

http://docs.splunk.com/Documentation/Splunk/6.4.0/Data/Extractfieldsfromfileswithstructureddata

0 Karma

daniel333
Builder

Sorry, I am not following the documentation very well. Does this turn every value into an index time extraction?

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

In an automated way, yes.

0 Karma

daniel333
Builder

Wouldn't making every field an index time extraction be a really big hit in performance?

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

The indexing performance hit is not that bad, after all it's only one (complicated) extraction running, not hundreds for every imaginable field in your data.

There will be some space consumed of course, how much depends on your data. Based on my limited use, it's not too bad. Search-time speed certainly makes up for this - you can skip building an accelerated data model for many use cases, for example.

0 Karma

Graham_Hanningt
Builder

@martin_mueller,

I plan to run a process on a remote computer that:

  1. Parses a proprietary-format binary log file containing thousands of events
  2. Converts those events into JSON
  3. Sends them via TCP to Splunk

I'm already doing this successfully on a small scale (with a few events). Currently, I'm using KV_MODE=json, to perform search-time extraction, but I think that you're recommending specifying INDEXED_EXTRACTIONS=json instead, to perform index-time extraction, right?

I'm very curious about this: this choice has been on my mind, too. I'd be very interested to hear more from anyone in a similar situation. I'm concerned not just about index size, but also about whether, with the extra processing introduced at index time by index-time extraction, I'll need to load balance across more indexers to handle the incoming (TCP) stream of thousands of events.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Indexers typically are more busy with running searches than with indexing... so a little more indexing load that potentially takes a lot off search load can actually save net indexer capacity.

How this tradeoff turns out depends on your environment, search load, and data. I'd just run it and watch what happens.

0 Karma

jtacy
Builder

I wouldn't suggest turning on indexed extractions in production without testing the effect it has on the index size and real-world performance. Testing with a json audit log that I had available (containing 10 fields), indexed extractions cost double the storage and I imagine it could cost much more. It may not be worth the extra storage because searches on IP addresses should be fairly efficient to begin with.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...