Solved: Splunk Hadoop Connect - unable to read snappy comp...

splunkears · ‎07-08-2013

Does Hadoop Connect support snappy compressed file (on HDFS) for Indexing?
All it needs is, to use -text while reading and indexing the file. Without this, it appears like Splunk will be indexing garbage.

Any insights?

splunkears · ‎07-08-2013

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

View solution in original post

Ledion_Bitincka · ‎07-09-2013

Thanks for pointing this out - I've filed a requirement for us to address this during our next revision of the app.

splunkears · ‎07-15-2013

Typo in iii) above:
iii) How to flush current index and re-index HDFS files from UI?

Thanks.

splunkears · ‎07-15-2013

Thanks for considering the request.
Pls. consider the following:
i) When HDFS files are indexed, please provide a feature to specify the timestamp column. ( What I mean is, please compare the feature for uploading a single file via Splunk Web UI. It gives us the option to specify and verify timestamp column, so indexing based on per day , per hour is accurate.)
ii) I also noted that, line break is going wrong with HadoopConnect when reading snappy files. So, I've to add a special stanza for source type to introduce line breakage.
iii) From UI flush current index and re-index HDFS files?

splunkears · ‎07-08-2013

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

Splunk Hadoop Connect - unable to read snappy compressed data

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

They're back! Join the SplunkTrust and MVP at .conf24

Enterprise Security Content Update (ESCU) | New Releases