Getting Data In

What is the most efficient way to sends a large number of raw csv files to splunk?

ilaila
New Member

I have a network share folder with a huge number of directories and files (.csv). Files are constantly being added and periodically getting removed for archiving. When creating a Directory data source for this share I found that splunk is opening a lot of file handles as it tries to watch the files and monitor for changes. When restarting my instance it also takes a very long time for this monitoring to begin working; I was told this is because it needs to "catch up" with any missed changes by scanning the entire share.

So I tried using splunk's HEC to send my .csv files. If I try to send each row of the csv as an individual event (source type=_json) then the overhead of repeating the csv headers for each row quickly builds up (these csvs are often very large and have >40 headers (the headers are also not very static)) and it unnecessarily burns through the license limit. It does work though.

If I try to send the raw string content of the csv file as a single event (source type = csv) it doesn't interpret it correctly (it doesn't detect the fields). I'm not even certain this is supposed to work in the first place.

So both the Directory data source and HEC seem to be inefficient for my scenario. Are there any other options I can try (out of the box preferably or an official app)? Or perhaps tweaks to above the methods (preferably not undocumented settings)?

Tags (3)
0 Karma

tiagofbmm
Influencer

What is the problem with putting one or more monitor stanzas over those files in that directory?

0 Karma

ilaila
New Member

Do you mean a stanza for each folder? The folders that are added have random guid names and they get removed regularly as well.

0 Karma

tiagofbmm
Influencer

No, just put a monitor on that folder and Splunk will read everything in that directory substructure by default and you don't have to worry about it.

By the way the option is called recursive and the default is true.

I believe this solves your issue. Let me know

0 Karma

ilaila
New Member

Unless I'm missing something, that doesn't sound any different from my existing set-up.

The problem with that is the monitor is too aggressive with the IO operations that it does on the share. As well as the "catch up" time that I mentioned when the instance is rebooted.

0 Karma

tiagofbmm
Influencer

Yes you are correct, the monitor stanza I'm talking about is what you called Directory data source.

Monitor is the correct way to monitor files when you have actually the ability to do it. I'm not sure what you mean by "aggressive". Is it taking too long to index data?

The "catch up" time is normal specially if you have very big files and a large number of them.

Maybe you need a second pipeline in that Universal Forwarder: https://docs.splunk.com/Documentation/Forwarder/7.0.2/Forwarder/Configureaforwardertohandlemultiplep...

0 Karma

tiagofbmm
Influencer

Please let me know if the answer was useful for you. If it was, accept it and upvote. If not, give us more input so we can help you with that

0 Karma
Get Updates on the Splunk Community!

Welcome to the Splunk Community!

(view in My Videos) We're so glad you're here! The Splunk Community is place to connect, learn, give back, and ...

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM

Elevating Digital Service Excellence: The Synergy of Real User Monitoring and Application Performance ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...