March 07, 2011

Hadoop - Java Heap Space Error

"Error: Java Heap space" means I'm trying to allocate more memory then available in the system.
how to go around? (1) better configuration (2) look for unnecessarily allocated objects
Configuration

mapred.map.child.java.opts : heap size for map tasks
mapred.reduce.child.java.opts: heap size for reduce tasks

mapred.tasktracker.map.tasks.maximum: max map tasks can run simultaneously per node
mapred.tasktracker.reduce.tasks.maximum: max reduce tasks can run simultaneously per node

Make sure ((num_of_maps * map_heap_size) + (num_of_reducers * reduce_heap_size)) is not larger than memory available in the system. Max number of mappers & reducers can also be tuned looking at available system resources.

io.sort.factor: max # of streams to merge at once for sorting. Used both in map and reduce.

io.sort.mb: map side memory buffer size used while sorting
mapred.job.shuffle.input.buffer.percent: Reduce side buffer related - The percentage of memory to be allocated from the maximum heap size for storing map outputs during the shuffle

NOTE: Using fs.inmemory.size.mb is very bad idea!
Unnecessary memory allocation

Simply look for new keyword and make sure there is no unnecessary allocation. A very common tip is using set() method of Writable objects rather than re-allocating a new object at every map or reduce.
Here is a simple count example to show the trick:

public static class UrlReducer extends Reducer{
  IntWritable sumw = new IntWritable();
  int sum;

  public void reduce(Text key,Iterable<IntW> vals,Context context){
    sum=0;
    for (IntWritable val : vals) {
      sum += val.get();
    }
    sumw.set(sum);
    context.write(key, sumw);
  }
}

note: There are couple more tips here for resolving common errors in Hadoop.