Getting Data In

Only latest results from a constantly human updated CSV?

edenzler
Path Finder

I have a use case where a CSV in a shared location is being updated daily by project manager(s). I'm attempting to build a bar chart showing offset from today using the data (I've got the visual working). I'm only interested essentially in the absolute most recent state of what is in the CSV. If something is added, a field is manipulated, or deleted, I just want the current state of the CSV.

Here's the current state of the search:

[the indexed csv] | eval start=strptime(Start,"%m/%d/%Y") | eval from=now() | eval int=strptime(Initializing,"%m/%d/%Y") | eval plan=strptime(Planning,"%m/%d/%Y") | eval exec=strptime(Execution,"%m/%d/%Y") | eval close=strptime(Closing,"%m/%d/%Y") | eval S = start - from | eval I = int - from | eval P = plan - from | eval E = exec - from | eval C = close - from | eval iDays=I / 86400 | eval pDays=P / 86400 | eval eDays = E / 86400 | eval cDays=C / 86400| eval Now= start - from | eval today=Now / 86400 | rename today AS Now, iDays AS 1-Intializing, pDays AS 2-Planning, eDays AS 3-Execution cDays AS 4-Closing | table Project 4-Closing 3-Execution 2-Planning 1-Intializing | dedup Project

Here is the format of the CSV:

Project Start Initializing Planning Execution Closing

1 9/15/2012 10/1/2012 10/5/2012 6/30/2013 7/15/2013
2 11/20/2012 11/30/2012 12/31/2012 4/30/2013 5/15/2013
3 3/6/2013 3/15/2013 4/30/2013 6/30/2013 7/15/2013
4 3/6/2013 3/15/2013 4/30/2013 6/30/2013 7/15/2013
5 12/1/2012 12/3/2012 3/31/2013 5/31/2013 6/15/2013
6 3/1/2013 3/15/2013 4/30/2013 6/30/2013 7/15/2013
7 8/1/2011 8/15/2011 12/31/2012 5/31/2013 6/15/2013
8 3/1/2013 3/5/2013 6/30/2013 7/31/2013 8/15/2013
9 9/1/2012 10/1/2012 2/28/2013 4/7/2013 4/15/2013
10 10/1/2011 10/15/2011 9/1/2012 7/15/2013 8/30/2013
11 8/1/2011 8/15/2011 12/31/2012 5/31/2013 6/15/2013
12 3/1/2013 3/5/2013 6/30/2013 7/31/2013 8/15/2013

Thanks in advance...

Tags (3)

apnetmedic
Explorer

I solve this by doing two things:

  1. In props.conf: [source:://my/watched/path/*] CHECK_METHOD = modtime My events don't have time stamps, so _time is now set to the last time the file was touched.
  2. In my searches: <base search> | eventstats max(_time) as LatestTime by source | eval ThreshTime = LatestTime - 2 | where _time > ThreshTime | <now the rest of my search>

Find the latest _time, subtract a few seconds to account for the time it might take to ingest (might need more here if coming through a forwarder?) and ignore events that aren't in this window. I combine this with a time range spec that goes back a few days so that the initial search doesn't have to work as hard.

I have to think there's a better way to do this, but I have not found it. I too am looking to present a dashboard of data based on a file that's refreshed daily, but the update happens at varying times of day, or sometimes not at all, so just doing "earliest=-1d@d" doesn't work. If the boss logs in before the refresh happens, the dashboard would be blank. Doing "earliest=-24h" might catch two days' worth of data if done at the wrong time, so that's out too.

I'm not super happy with it because it makes all my searches more expensive, but it gets the job done.

I went down a path of trying to do a scheduled search to take my code from above and reverse the sense on it:
... eval ThreshTime = LatestTime - 2 | where _time < ThreshTime | delete
But this fails because apparently delete can't be used after a streaming command like eventstats.

Any other thoughts on this I'd love to hear.

0 Karma

yannK
Splunk Employee
Splunk Employee

If your CSV had a timestamp on each line, you could have done
source=*mycsv.csv | stats latest("Start") AS "Start" latest(Initializing) AS Initializing ` etc ...

but it seems that your dates are not chronological, so you have to sort somehow.

Does the field "Project" always increment ?
here is a way to find the highest project number in a subsearch and use it as a filter.

source=*mycsv.csv [ search source=*mycsv.csv| stats max(Project) AS Project | table Project ]

0 Karma

yannK
Splunk Employee
Splunk Employee

If the file does not contains a timestamp, and splunk assigned the current mod time to each event detected (datetime=CURRENT, in the sourcetype definition).
Then you could consider that the last events will have the most recent timestamp in splunk.
and use a
` source=*mycsv.csv | stats latest(Project) etc...'

The risk is that if you do not have a sourcetype and timestamp well defined, splunk will pick any timestamp on the line (and you have 5 dates already per event)

0 Karma

edenzler
Path Finder

I added a Ver column to the CSV and used the source=*mycsv.csv [ search source=*mycsv.csv| stats max(Ver) AS Ver | table Ver ] - and this works when the Ver is incremented after the data in the CSV has been modified. This will work for small CSV files, for large ones it's not optimal. This would also require data manipulation of the CSV prior to being indexed.

Any other ideas? 🙂

0 Karma

edenzler
Path Finder

thanks yannK - No, the "Project" field does not actually increment, I sanitized it to numbers for the post. It really is all meaningful project hames. Perplexed why there would be no means to read a date/time file stamp.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...