April 03, 2011

java.io.EOFException with Hadoop

My code runs smoothly with a smaller dataset, however whenever I run it with a larger one, it fails with java.io.EOFException I've been trying to figure out the problem.

11/03/31 01:13:55 INFO mapreduce.Job: 
  Task Id: attempt_201103301621_0025_m_000634_0, Status : FAILED
java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)
 at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)
 at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1999)
 at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2131)
 ...
 ...
 ...
 at org.apache.hadoop.mapred.MapTask$
  NewTrackingRecordReader.nextKeyValue(MapTask.java:465)
 at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
 at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:90)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
 at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run
  (Delegatin

So, EOFException means something wrong with your input files. If files are not written & closed correctly, this exception is thrown - the file systems thinks there are more to read but actually number of bytes left are less than expected.
To solve the problem, dig into the input files and make sure they are created carefully without any corruption. Also if MultipleOutputs is used to prepare input files, make sure it is also closed at the end!