All Apps and Add-ons

Splunk/Hunk snappy orc files: no field extraction in Fast Mode

burwell
SplunkTrust
SplunkTrust

Basic problem: in smart mode my fields are not getting extracted. All works in verbose mode. Also the time searching does work so I know that how I specify the time field does work.

Search that fails: index=foo | stats count by hii (or any field that isn't partitioned)

I have looked at the previous questions on Hunk extractions and smart mode (e.g. https://answers.splunk.com/answers/147879/why-hunks-field-extractor-behaves-differently-in-smart-mod...) but I cannot get mine to work.

  • we are using log files generated by spark: they are snappy compressed with the name ... snappy.orc
  • there is no metastore so I provide a fake database and table to make Splunk happy
  • i specify the exact fields and their types
  • I tried making all the fields or some of the fields required per Leon B's posts but that didn't help
  • I have the snappy jar on the THIRD_PARTY_JARS and Splunk is able to decompress the orc files

indexes.conf

vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter.hive.columnnames  = cqtq, ttms, chi, crc, pssc, psql, cqhm, cquc, caun, phr, psct, cquuc, cqtr, cqssl, cqssr, pitag, sstc, psqql, ttsfb,ttrq, cqbl, pttsfb, tfstoc, sscl, UA, tsso, sscc, phi, chp, Carpcqh, sssc, cqssv, cqssc, hii
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string:string:string,string,int:int:int:string:int:int:int:int:bigint:int:bigint:bigint:string,int:int:string:int:string:string:string:string:string
vix.input.1.required_fields           = cqtq,ttms,UA,hii
# Completely made up values to satisfy Splunk                                                                                                                      
vix.input.1.splitter.hive.tablename  = transfered
vix.input.1.splitter.hive.dbname     = default
  • in my provider i have vix.splunk.search.splitter = HiveSplitGenerator

props.conf

[source::/projects/flickr/flopsa/ycpi_spark/orc/...]
priority          = 202
sourcetype        = foo                                                                                                                                     
NO_BINARY_CHECK   = true

[foo]
NO_BINARY_CHECK = 1
SHOULD_LINEMERGE  = false
TIME_PREFIX       = cqtq\":
TIME_FORMAT       = %s.%3N

(Note I also tried these two which also get the time search to work but still not fields)

eval-_time=strptime('cqtq',"%s.%3N")                                                                                                                              
EXTRACT-_time=strptime('cqtq',"%s.%3N")  
Tags (1)
0 Karma
1 Solution

burwell
SplunkTrust
SplunkTrust

I got help from Splunk (thanks Raanan!) and this is my solution so others will know.

My indexes.conf

  1. In my long columntypes list i had some commas instead of colons as separators. What I learned from Raanan was to just pull out the first few columns and when that works add in other fields. So above where I have string, int etc that should be string:int and so the columnnames didn't align with the columntypes

  2. I shouldn't need to specify a dummy database and table name. We are filing a bug report.

  3. In the index (not provider) I specified the following. This way you don't have to have several different providers. You can reuse:

    vix.input.1.splitter.hive.fileformat = orc
    vix.input.1.splitter = HiveSplitGenerator

So altogether this is what worked (I changed the columnnames to be shorter and shortened the list to make things clearer)

[foo]
vix.provider                      = bt
vix.input.1.path                  = /my/path/...
vix.input.1.accept                = \.orc$
vix.input.1.ignore                = .+SUCCESS
vix.input.1.et.regex              = /my/path/regex...
vix.input.1.et.format             = yyyyMMddHH
vix.input.1.et.offset             = 0
vix.input.1.lt.regex              = /my/path/regex...
vix.input.1.lt.format             = yyyyMMddHH
vix.input.1.lt.offset             = 3600
vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter                  = HiveSplitGenerator
vix.input.1.required_fields           = cqtq,b
vix.input.1.splitter.hive.columnnames = cqtq,b,c,d,e,f,g,h,i 
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string.. etc
# Completely made up values to satisfy Splunk bug                                                                                       
vix.input.1.splitter.hive.tablename  = default
vix.input.1.splitter.hive.dbname     = default

Props.conf
To get search by time to properly work, in my props.conf I used the following. My time field is called cqtq.
It is 10 digit unix timestamp followed by a period then 3 digits. And at the beginning of each record.

eval-_time                = strptime('cqtq',"%s.%3N")

View solution in original post

0 Karma

burwell
SplunkTrust
SplunkTrust

I got help from Splunk (thanks Raanan!) and this is my solution so others will know.

My indexes.conf

  1. In my long columntypes list i had some commas instead of colons as separators. What I learned from Raanan was to just pull out the first few columns and when that works add in other fields. So above where I have string, int etc that should be string:int and so the columnnames didn't align with the columntypes

  2. I shouldn't need to specify a dummy database and table name. We are filing a bug report.

  3. In the index (not provider) I specified the following. This way you don't have to have several different providers. You can reuse:

    vix.input.1.splitter.hive.fileformat = orc
    vix.input.1.splitter = HiveSplitGenerator

So altogether this is what worked (I changed the columnnames to be shorter and shortened the list to make things clearer)

[foo]
vix.provider                      = bt
vix.input.1.path                  = /my/path/...
vix.input.1.accept                = \.orc$
vix.input.1.ignore                = .+SUCCESS
vix.input.1.et.regex              = /my/path/regex...
vix.input.1.et.format             = yyyyMMddHH
vix.input.1.et.offset             = 0
vix.input.1.lt.regex              = /my/path/regex...
vix.input.1.lt.format             = yyyyMMddHH
vix.input.1.lt.offset             = 3600
vix.input.1.splitter.hive.fileformat = orc
vix.input.1.splitter                  = HiveSplitGenerator
vix.input.1.required_fields           = cqtq,b
vix.input.1.splitter.hive.columnnames = cqtq,b,c,d,e,f,g,h,i 
vix.input.1.splitter.hive.columntypes = string:int:string:string:int:bigint:string:string:string.. etc
# Completely made up values to satisfy Splunk bug                                                                                       
vix.input.1.splitter.hive.tablename  = default
vix.input.1.splitter.hive.dbname     = default

Props.conf
To get search by time to properly work, in my props.conf I used the following. My time field is called cqtq.
It is 10 digit unix timestamp followed by a period then 3 digits. And at the beginning of each record.

eval-_time                = strptime('cqtq',"%s.%3N")
0 Karma
Get Updates on the Splunk Community!

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...