Cloudera CDH 5.02 and Hunk 6.2 I believe, or whatever is the latest version of both. Anyway, I was trying to turn on snappy compression, which i did from Cloudera, but there are several compression settings that should be pushed from the job level. So here are the configs in the Virtual Index Provider, please verify they are correct. So far I see no performance benefit.
vix.mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.SnappyCodec
vix.mapreduce.output. fileoutputformat. compress.type = BLOC
vix.mapreduce.output.fileoutputformat. compress = true
Are these correct and if so, why am I seeing no performance difference? I am processing netflow data, each file is about 300 MB for 15 minutes of netflow data. So using a date range and verbose mode, it takes about 10 minutes to process 94 files X 300MB per file. Note the netflow data is not compresses in HDFS. If these need to be compressed on HDFS I assume LZO or Snappy?
Thanks.
... View more