Here is an example usage for distributed cache. While working on web-graph problem I replace URLs with unique id's. If I have the url-Id mapping in memory, I can easily replace URLs with their corresponding ids. So here is the sample usage:
public static class ReplacementMapper extends Mapper<Text, Text, Text, Text> { private HashMap<String, String> idmap; @Override public void setup(Context context) { LoadIdUrlMapping(context); } @Override public void map(Text key, Text value, Context context) throws InterruptedException { .... }
id-url mapping is loaded at the beginning of each Map task. Below example simply reads the file out of HDFS and stores the data in a hashmap for quick access. Here is the function:
private void loadIdUrlMapping(Context context) { FSDataInputStream in = null; BufferedReader br = null; try { FileSystem fs = FileSystem.get(context.getConfiguration()); Path path = new Path(cacheFileLocation); in = fs.open(path); br = new BufferedReader(new InputStreamReader(in)); } catch (FileNotFoundException e1) { e1.printStackTrace(); System.out.println("read from distributed cache: file not found!"); } catch (IOException e1) { e1.printStackTrace(); System.out.println("read from distributed cache: IO exception!"); } try { this.idmap = new HashMap< string, string >(); String line = ""; while ( (line = br.readLine() )!= null) { String[] arr = line.split("\t"); if (arr.length == 2) idmap.put(arr[1], arr[0]); } in.close(); } catch (IOException e1) { e1.printStackTrace(); System.out.println("read from distributed cache: read length and instances"); } } }
This is one way of accessing shared data among hadoop nodes. Other way of accessing it is through local file system. Here is a great article about how to throw cache data automatically among nodes and access it later.