All Apps and Add-ons

Splunk Hadoop Connect - unable to read snappy compressed data

splunkears
Path Finder

Does Hadoop Connect support snappy compressed file (on HDFS) for Indexing?
All it needs is, to use -text while reading and indexing the file. Without this, it appears like Splunk will be indexing garbage.

Any insights?

0 Karma
1 Solution

splunkears
Path Finder

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

View solution in original post

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

Thanks for pointing this out - I've filed a requirement for us to address this during our next revision of the app.

0 Karma

splunkears
Path Finder

Typo in iii) above:
iii) How to flush current index and re-index HDFS files from UI?

Thanks.

0 Karma

splunkears
Path Finder

Thanks for considering the request.
Pls. consider the following:
i) When HDFS files are indexed, please provide a feature to specify the timestamp column. ( What I mean is, please compare the feature for uploading a single file via Splunk Web UI. It gives us the option to specify and verify timestamp column, so indexing based on per day , per hour is accurate.)
ii) I also noted that, line break is going wrong with HadoopConnect when reading snappy files. So, I've to add a special stanza for source type to introduce line breakage.
iii) From UI flush current index and re-index HDFS files?

0 Karma

splunkears
Path Finder

Looks like current Splunk Hadoop Connect does not support Snappy..
Here is the code I looked into ... (of Splunk Hadoop Connect)

splunk-HadoopConnect-master/bin/hdfs.py
...
...

def process_file_uri(hdfs_uri):

hj = HadoopCliJob(HadoopEnvManager.getEnv(hdfs_uri))
# a hack: I couldn't get the data translator to work with .gz files;
# so, we rely on the hj.text() to do the gunzip'ing for us
translator = None
if hdfs_uri.endswith(".gz"): 
    hj.text(hdfs_uri)          <<========  making -text method of hadoop FsShell.
    translator = FileObjTranslator(hdfs_uri, hj.process.stdout)
else:
    hj.cat(hdfs_uri)
    translator = get_data_translator(hdfs_uri, hj.process.stdout)

cur_src = ""
buf = translator.read()
bytes_read = len(buf)

Fix is to add another line
if hdfs_uri.endswith(".snappy") to index snappy files.

0 Karma
Get Updates on the Splunk Community!

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

Take a look below to explore our upcoming Community Office Hours, Tech Talks, and Webinars this month. This ...

They're back! Join the SplunkTrust and MVP at .conf24

With our highly anticipated annual conference, .conf, comes the fez-wearers you can trust! The SplunkTrust, as ...

Enterprise Security Content Update (ESCU) | New Releases

Last month, the Splunk Threat Research Team had two releases of new security content via the Enterprise ...