April 18, 2011

Hadoop Intermediate Data Compression

To enable intermediate data compression, setup corresponding variables in mapred-site.xml.
<!-- mapred-site.xml -->   
<property>
    <name> mapreduce.map.output.compress </name> 
    <value> true</value> 
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

Setting up LZO compression is a bit tricky. First of all, should install LZO package on all nodes. I built this package and followed instructions here.

Having difficulty while building  eg: "BUILD FAILED make sure $JAVA_HOME set correctly." - then take a look at here.

At the end, this is how my config files look like:
<!-- mapred-site.xml -->
<property>
    <name> mapreduce.map.output.compress </name> 
    <value> true</value> 
</property>
<property>
    <name>mapreduce.map.output.compress.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

<!-- core-site.xml -->
<property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
<property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

To compress final output data, Job object should be set to output compressed data before its execution.