What's the trick to get unarchive_cmd to work for ...

Lowell · ‎07-04-2014

I've been attempting to get the unarchive_cmd to work with no success. I decided to abandon my ultimate goal for the purpose of troubleshooting and what I'm finding is that I can't get even the most simple use-case to work.

My simplified use case is simple adding support for an "imaginary" gzip-2 format. (Which is really just gzip, with ".gz2" as the extension instead of ".gz"). So I copied Splunk's default props.conf settings for handling "gzip" files and just basically add a "2" to the end of everything. My configuratoin looks like so:

props.conf:

[preprocess-gzip2]
invalid_cause = archive
is_valid = False
LEARN_MODEL = false

[source::....gz2(.\d+)?]
unarchive_cmd = gzip -cd -
sourcetype = preprocess-gzip2
NO_BINARY_CHECK = true

inputs.conf:

[monitor:///tmp/indexme]

I've tried adding unarchive_cmd to the [preprocess-gzip2] with no luck. I also added priority = 90 to [source::....gz2(.\d+)?]. Again no luck. I also tried adding sourcetype=preprocess-gzip2 to the monitor stanza -- no change.

The behavior I'm seeing is this:

Any normal gzip file ending with .gz is indexed successfully.
Any gzip file ending with .gz2 fails to be indexed at all. No meaningful error (or debug) messages explaining why.

I cranked up a few log sources for debugging purposes. Perhaps someone can shed some light on what I'm doing wrong. Again, my long-term goal is to implement some pre-processing command other than gzip -cd - but I'm just trying to boil this down to the most simple case to get things started.

Splunkd log - First attempt - Loading a .gz2 file [FAILS]

07-04-2014 15:34:21.631 +0000 INFO  TailingProcessor - Archive file='/tmp/indexme/test1-custom-handler-gz2.gz2' updated less than 10000ms ago, will not read it until it stops changing.
07-04-2014 15:34:31.632 +0000 INFO  TailingProcessor - Archive file='/tmp/indexme/test1-custom-handler-gz2.gz2' has stopped changing, will read it now.
07-04-2014 15:34:31.632 +0000 INFO  ArchiveProcessor - handling file=/tmp/indexme/test1-custom-handler-gz2.gz2
07-04-2014 15:34:31.633 +0000 INFO  ArchiveProcessor - reading path=/tmp/indexme/test1-custom-handler-gz2.gz2 (seek=0 len=1337)
07-04-2014 15:34:31.633 +0000 DEBUG ArchiveContext - /tmp/indexme/test1-custom-handler-gz2.gz2 is NOT an archive file.
07-04-2014 15:34:31.719 +0000 DEBUG ArchiveProcessor -   Attempting to load indexed extractions config from conf=/tmp/indexme/test1-custom-handler-gz2.gz2||preprocess-gzip2| ...
07-04-2014 15:34:31.720 +0000 DEBUG ArchiveContext - /tmp/indexme/test1-custom-handler-gz2.gz2 is NOT an archive file.
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveProcessor -   Basing small file initCrc on path=/tmp/indexme/test1-custom-handler-gz2.
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveProcessor -   This archive stream's initcrc=0x2e6fa8b616493a17.
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveProcessor -   Found no initcrc match for this stream, will re-read entire file.
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveContext - Early termination because write has false notification
07-04-2014 15:34:31.729 +0000 DEBUG ArchiveContext - /tmp/indexme/test1-custom-handler-gz2.gz2 is NOT an archive file.
07-04-2014 15:34:31.747 +0000 DEBUG ArchiveProcessor - fname=/tmp/indexme/test1-custom-handler-gz2.gz2 fishstate=key=0xb5635587184bcff4 sptr=1337 scrc=0x65e7d9f9bdd1b805 fnamecrc=0xba14100c169eff7e modtime=0
07-04-2014 15:34:31.747 +0000 INFO  ArchiveProcessor - Finished processing file '/tmp/indexme/test1-custom-handler-gz2.gz2', removing from stats

Splunkd log - Second attempt - Loading a typical .gz file [WORKS]

07-04-2014 15:36:18.722 +0000 INFO  TailingProcessor - Archive file='/tmp/indexme/test2-standard-gz-file.gz' updated less than 10000ms ago, will not read it until it stops changing.
07-04-2014 15:36:28.724 +0000 INFO  TailingProcessor - Archive file='/tmp/indexme/test2-standard-gz-file.gz' has stopped changing, will read it now.
07-04-2014 15:36:28.725 +0000 INFO  ArchiveProcessor - handling file=/tmp/indexme/test2-standard-gz-file.gz
07-04-2014 15:36:28.725 +0000 INFO  ArchiveProcessor - reading path=/tmp/indexme/test2-standard-gz-file.gz (seek=0 len=1344)
07-04-2014 15:36:28.725 +0000 DEBUG ArchiveContext - /tmp/indexme/test2-standard-gz-file.gz is NOT an archive file.
07-04-2014 15:36:28.858 +0000 DEBUG ArchiveProcessor -   Attempting to load indexed extractions config from conf=/tmp/indexme/test2-standard-gz-file.gz||test2-standard-gz-file-too_small| ...
07-04-2014 15:36:28.858 +0000 DEBUG ArchiveContext - /tmp/indexme/test2-standard-gz-file.gz is NOT an archive file.
07-04-2014 15:36:28.866 +0000 DEBUG ArchiveProcessor -   Basing small file initCrc on path=/tmp/indexme/test2-standard-gz-file.
07-04-2014 15:36:28.866 +0000 DEBUG ArchiveProcessor -   This archive stream's initcrc=0xcea312efbb705310.
07-04-2014 15:36:28.866 +0000 DEBUG ArchiveProcessor -   Found no initcrc match for this stream, will re-read entire file.
07-04-2014 15:36:28.866 +0000 DEBUG ArchiveContext - Early termination because write has false notification
07-04-2014 15:36:28.867 +0000 DEBUG ArchiveContext - /tmp/indexme/test2-standard-gz-file.gz is NOT an archive file.
07-04-2014 15:36:28.878 +0000 INFO  databasePartitionPolicy - idx=main Creating hot bucket=hot_v1_3, given event timestamped=1404488177
07-04-2014 15:36:28.879 +0000 INFO  DatabaseDirectoryManager - Writing a bucket manifest in hotWarmPath='/opt/splunk/var/lib/splunk/defaultdb/db'.  Reason='Adding bucket, bid=main~3~BCB6BEA7-CE1C-486C-BA49-07E40BAB59B1'
07-04-2014 15:36:28.880 +0000 DEBUG ArchiveProcessor - fname=/tmp/indexme/test2-standard-gz-file.gz fishstate=key=0x72a9c05c9c36e767 sptr=1344 scrc=0xb4910955c707a2a2 fnamecrc=0x84c51a934e8fdaab modtime=0
07-04-2014 15:36:28.880 +0000 INFO  ArchiveProcessor - Finished processing file '/tmp/indexme/test2-standard-gz-file.gz', removing from stats

Final thoughts

It appears as though the ArchiveContext <FILE> is NOT an archive file is bogus message since it appears for all archive files; the ones that work and the ones that don't.

The content of the two test files were created using a simple shell command for testing only.

( echo "`date` This is a test event ... ;  dd if=/dev/urandom of=/dev/stdout count=1 | xxd ) | gzip -c - > /tmp/indexme/<SOME_FILE_NAME_ENDING_IN_GZ_OR_GZ2>

I'm testing this on Splunk 6.1.1 (splunk-6.1.1-207789.x86_64 rpm) running on CentOS 6.4 using an AWS t1.micro instance.

Update:

I changed my extension from .gz2 to .xy (randomly chosen) just to make sure I wasn't some hitting the wrong [source:*] stanza. No change.

I added the following to my local props.conf just to prove to myself that Splunk is actually calling the unarchive_cmd


[source::....(?>!tar.)gz(.\d+)?]
unarchive_cmd = gzip -cd - | sed -re 's/secret/HIDDEN/g'

And this does work. I also tried this with an invalid command, and a typo in the sed command and all of those issues were logged in splunkd.log.

07-04-2014 16:26:15.102 +0000 ERROR ArchiveContext - From archive='/tmp/indexme/test5-standard-gz-file-with-wacky-archivecommandoverride.gz':  /bin/sh: NOT_A_VALID_COMMAND: command not found
07-04-2014 16:27:11.112 +0000 ERROR ArchiveContext - From archive='/tmp/indexme/test5-standard-gz-file-with-wacky-archivecommandoverride.gz':  sed: -e expression #1, char 9: unknown command: `*'

I feel like this should work without me having arbitrarily add a .gz to the end of my file names.

rkirkw · ‎09-06-2014

I was able to make it work with these files:

I am converting binary data to ascii data with a perl script as the unarchive_cmd

inputs.conf

[monitor:///var/log/huawei/*.cdr]
disabled = false
followTail = 0
host = huawei_switch
index = huawei
sourcetype = huawei_cdr

props.conf

[source::/var/log/huawei/*.cdr]
invalid_cause = archive
unarchive_cmd = /usr/local/bin/huawei2text.pl
sourcetype = huawei_cdr
NO_BINARY_CHECK=true

Caveat 1:

The props.conf file would not work if I used sourcetype [huawei_cdr] instead of the [source::/var/log/huawei/*.cdr] listed above.

Caveat 2:

It would not work with a "known" file type such as .dat or .txt
I found this out by running splunk in debug mode and it set the priority for it's understanding of a .dat file higher than what I was telling it in my props and inputs files. I changed the extention to a .cdr file and it was able to execute the unarchive_cmd and index the binary files properly.

Hope this helps!

What's the trick to get unarchive_cmd to work for a custom archive format?

.conf24 | Registration Open!

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!