Getting Data In

Ignoring massive amounts of data at index time

msarro
Builder

Hey everyone. We are working on taking in large amounts of CSV data. Each line of the CSV is a single event, and each line is comprised of about 270 fields. Currently only about 40 of those fields are useful. Right now our props.conf and transforms.conf are set to index each field with a specific field name.

How would we best proceed to strip out the fields that we don't need so they don't get indexed? It would be a substantial cost savings for us. I'd prefer not to make the worlds nastiest REGEX but if I have to I will.

Tags (1)
0 Karma

dwaddle
SplunkTrust
SplunkTrust

If you are looking to strip "columns" out of the CSV data at index time, about the only way you'd be able to do it is with a SEDCMD. I gamble this would be a nontrivial regex to write.

Maybe you could use a scripted input to read the CSV file, and feed it through (say) Python's csv module to only emit those fields of interest? I think the biggest issue here would be keeping up with how much of the CSV file you have previously read/transformed/sent to splunk. This could be easy, or could require you to re-implement much of the tailing processor's functionality around file rotations and such.

0 Karma

msarro
Builder

Hm, so it does look like it will be a massive regex. What sort of overhead would a regex of that size incur? Sadly one issue we'd run into with this is the fact that our source moves multiple times.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...