Splunk Search

How to extract and report on field values from a large CSV file in a directory on my Splunk server that is updated daily?

apnetmedic
Explorer

I have a bit of a non-traditional application, but one which Splunk is pretty good at 95% of:

There's a big file (call it bigReport.csv), updated daily by a business intelligence system and deposited in a directory on my Splunk server. Let's say it's 25,000 entries showing status of orders. On any given day, lines may be added to the end, changed (order status updated, ship dates changed), or deleted (order is complete and falls out of the search criteria of the BI report). The file has the same name every day.

The mission is to take the data in this file, extract some values from fields, do some lookups against other reports (which have similar ingest problems), and produce some summary data.

I had started by doing a watched directory for this, and ingesting the file. In order to do that, I had to set props.conf CHECK_METHOD = modtime, in case the beginning and the end of the file stayed the same and the CRC's wouldn't show changes.

Pulling only the latest set of data is a challenge too. The file gets updated at approximately the same time every day, but not exactly. So doing earliest=-1d@d might not work, depending on what time of day you access a report.

I came up with:

source=bigReport.csv earliest=-2d | eventstats max(_time) as LatestTime 
| where _time > LatestTime-30 | rest-of-search

This is pretty expensive, though. Especially when multiple files are involved in the same way, requiring subsearches to be performed in the same way.

So the bottom line is.. is there a better way to do this? |inputcsv seems tempting, though that has its own issues in terms of data access (those CSVs are readable by any user with search access who can find the file name)

0 Karma
1 Solution

somesoni2
SplunkTrust
SplunkTrust

For data ingested in Splunk, try this (I would always suggest to provide the index and sourcetype name as well in all your queries)

index=yourindex sourcetype=yoursourcetype source=bigReport.csv [| tstats max(_time) as earliest WHERE index=yourindex sourcetype=yoursourcetype source=bigReport.csv earliest=-2d@d | eval earliest=relative_time(earliest,"@d") ] | rest of the search

The subsearch will get you the date, as earliest, of the latest report. Also, it uses tstats command and works on metadata so is much faster.

You can also explore option of lookup table files OR KV store for this requirement as well.

View solution in original post

somesoni2
SplunkTrust
SplunkTrust

For data ingested in Splunk, try this (I would always suggest to provide the index and sourcetype name as well in all your queries)

index=yourindex sourcetype=yoursourcetype source=bigReport.csv [| tstats max(_time) as earliest WHERE index=yourindex sourcetype=yoursourcetype source=bigReport.csv earliest=-2d@d | eval earliest=relative_time(earliest,"@d") ] | rest of the search

The subsearch will get you the date, as earliest, of the latest report. Also, it uses tstats command and works on metadata so is much faster.

You can also explore option of lookup table files OR KV store for this requirement as well.

apnetmedic
Explorer

I like it! I left out index and sourcetype for brevity in the original question. I've been having other ingest problems with a file this large, which has made me question the whole method. Sounds like I can still make it work.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...