Getting Data In

duplicate events: unarchive_cmd gets passed whole file, not just the delta since the last change

stu2
Explorer

Docs make it look like CHECK_METHOD = endpoint_md5 in props.conf should tell Splunk to only sends deltas. But anytime the source file changes it's getting the whole file, and Splunk's creating duplicates of any previously indexed data.

Is this related to priority? If I don't set priority in props.conf (see below) my unarchive_cmd doesn't get run. If I DO set it as below, it runs BUT I get duplicates.

Here's my inputs.conf

[monitor:///Users/stu/projects/splunk/compressed/txnsummaries.txt]
sourcetype = txn_summaries_st

props.conf

[source::/Users/stu/projects/splunk/compressed/txnsummaries.txt]
invalid_cause = archive
unarchive_cmd = python /Users/stu/projects/splunk/scripts/timings_filter.py
sourcetype = txn_summaries_st
NO_BINARY_CHECK = true
priority = 10
CHECK_METHOD = endpoint_md5

unarchive_cmd is otherwise doing what I'm looking for. I'm taking a single event containing many batched/compressed transaction timing records, and breaking each record up into it's own event. Splunk is correctly seeing the events I'm sending. However whenever the source file changes I get duplicates of previously indexed events.

Hoping this isn't a limitation of unarchive_cmd.

Any ideas?

Thanks in advance

0 Karma

jrodman
Splunk Employee
Splunk Employee

I'm not really sure what you're doing here with unarchive_cmd, but it was built to handle things like gzip files. Those cannot work without getting the entire file.

Whether Splunk can do content tracking and acquiring of new records from the output of customer unarchive commands is a question I don't know the answer to. It's certainly not documented functionality if it's possible.

The usual tools to build an input like this are scripted or modular inputs, but that puts the bookmark-tracking problem squarely on the input script.

The easiest solution is to just process your compressed data into uncompressed logs ahead of time,

0 Karma

stu2
Explorer

I agree, it would be far easier just to emit the logs with all contents uncompressed. But for performance reasons that's not a great option for our production environment. We periodically emit from a background thread a single log event with a field containing gzip'd and base64 encoded collection of JSON records. Output looks something like this.

2014-10-04 11:37:27 [pool-2-thread-1] INFO :  TxnSummaryLogWriter.logTiming JSON_GZIP_TXN_SUMMARIES  [[[H4sIAAAAAAAAAOWYS09jRxBG/wq6a9eon9XV3hGwNBPNAGKMFCkaoeruaoQCJjImioT47ynCLHPhLry5zs6Pbln36NTj8/Ow+3vz+HR/z9tbeRyWvz8P......yG+8XHwAA]]] 

As these events are written to log, we'd like to get the deltas and unpack the contents to create an event per txn that gets sent to Splunk. I have a simple python script that does this, I just can't seem to figure out how to plug into Splunk's ingest pipeline to run this only on deltas.

Scripted inputs would work, but I don't want to have to worry about identifying deltas in my script - Splunk's great at that. And waiting until the log rolls over means an unwanted delay in getting the info into Splunk.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...