Ignoring massive amounts of data at index time

msarro · ‎05-12-2011

Hey everyone. We are working on taking in large amounts of CSV data. Each line of the CSV is a single event, and each line is comprised of about 270 fields. Currently only about 40 of those fields are useful. Right now our props.conf and transforms.conf are set to index each field with a specific field name.

How would we best proceed to strip out the fields that we don't need so they don't get indexed? It would be a substantial cost savings for us. I'd prefer not to make the worlds nastiest REGEX but if I have to I will.

dwaddle · ‎05-12-2011

If you are looking to strip "columns" out of the CSV data at index time, about the only way you'd be able to do it is with a SEDCMD. I gamble this would be a nontrivial regex to write.

Maybe you could use a scripted input to read the CSV file, and feed it through (say) Python's csv module to only emit those fields of interest? I think the biggest issue here would be keeping up with how much of the CSV file you have previously read/transformed/sent to splunk. This could be easy, or could require you to re-implement much of the tailing processor's functionality around file rotations and such.

msarro · ‎05-12-2011

Hm, so it does look like it will be a massive regex. What sort of overhead would a regex of that size incur? Sadly one issue we'd run into with this is the fact that our source moves multiple times.

Ignoring massive amounts of data at index time

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!