Getting Data In

Data sampling from a large log file

nowplaying
Explorer

We have an application log that generates event timings. This log far exceeds our Splunk license if we consumed it for the entire day. A small segment of data from this log is all we really need on an hourly basis. I was wondering if there's a way to configure a forwarder to take a sampling of data from this log at a specific interval?

Tags (1)
0 Karma

Simeon
Splunk Employee
Splunk Employee

We always advise users not to cut out data as in almost all cases, users will want that data at some point. However, let us assume you really do have a chatty system that you absolutely cannot control becuase it is setup in a DEBUG mode....

You can leverage the conditional routing to implement programmatic polling. For example, if you know that your data consistently shows up every second of every minute, you can leverage a regex that sends data to the nullqueue. Specifically, I could tell Splunk to index only events that occur every other second. To do this, my regex would ONLY look for events that have a timestamp with a character class that includes odd numbers. I believe the following regex would call timestamps that are all odd (create one that applies for your specific events):

\d\d:\d\d:\d[13579]

In theory, the aforementioned regex would indexevents with 12:00:01, 12:00:03, 12:00:05, 12:00:07, 12:00:09, 12:34:01, etc...
It would NOT index (instead sending events to nullqueue) events with a timestamp of 12:00:02, 12:00:04, 12:00:06, 12:00:08, 12:00:00.

nowplaying
Explorer

That's perfect. After I saw jbsplunk's post I figured I would regex based on time but wasn't sure where to start. \d\d:[12345]0:\d\d should give me enough data to work with. A 1 minute data sampling every 10 minutes.

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

I would caution that this could introduce systematic biases in your data. For example, events that went off every minutes at :00 seconds would always get counted, and thus may be disproportionately represented in your sample.

0 Karma

jbsplunk
Splunk Employee
Splunk Employee

Nice, I never thought about using routing in that manner, but that is pretty slick.

0 Karma

jbsplunk
Splunk Employee
Splunk Employee

Not really...Splunk is going to pick up where it left off the last time it read a file once you start monitoring it regardless of the settings that you put in place for the particular input.

However, there is a solution to this particular for exactly the use case your mentioning. You can route the data you aren't interested in seeing into the nullQueue, and data routed to this queue will not count against your license like it does in the indexQueue.

http://www.splunk.com/base/Documentation/latest/Deploy/Routeandfilterdatad

You want to follow the example in 'Filter event data and send to queues', specifically, the use case for 'Keep specific events and discard the rest'

I hope this helps! Good luck!

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...