October 31, 2010

Why Hadoop can’t always read properly .gz compressed input files ?

Hadoop supposed to work happily with .gz input file format by default. [1] So I run my MR job with gz compressed input files and boom! didn’t work.. whenever there is an empty line in the input, Hadoop stucks there and doesn't read -recognize rest of the file. (basically readline returns 0 length string even though there is data) I spent hours to figure out the problem. everything was looking great, .gz files were corrupted or anything, and my code runs fine with the decompressed input... At the end I realized that if I decompress .gz input files and re-compress them again, the size reduces by half ! seems like Hadoop has problems with different versions of .gz compression. I suspect my input files were uncompressed on a Windows machine and looks like some compression applications ends up producing type of .gz file which is incompatible with Hadoop.

Long story short; if Hadoop doesn’t process your .gz compressed input files, try to decompress and re-compress them with gzip under a Linux machine.
gunzip filename.gz
gzip filename



here is the script that unzips and zips all the files under given directory one by one (in case you have a huge archive !)


#!/bin/bash

dir="aaa"
for f in $( ls $dir  ); do
        eval gunzip "$dir/$f"
        eval gzip "$dir/${f%.gz}"
done



[1] Tom White, Hadoop: The Definitive Guide