December 03, 2010

make sure start-dfs.sh script is being executed on the master !

everything looks fine when start-all.sh is executed but there is no Namenode process apprear on the jps results. why ? also when I check the logs I see following networking exceptions:

ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to hostname/ipaddress:host : Cannot assign requested address
        at org.apache.hadoop.ipc.Server.bind(Server.java:190)
        at org.apache.hadoop.ipc.Server$Listener.(Server.java:253)
        at org.apache.hadoop.ipc.Server.(Server.java:1026)
        at org.apache.hadoop.ipc.RPC$Server.(RPC.java:488)

when I try to do an ls on dfs, I get following:
ipc.Client: Retrying connect to server:  hostname/ipaddress:host
...
...
Bad connection to FS. command aborted

I spent lots of time trying to figure out the "networking" problem, checked if the port is already in use, ip4/ip6 conflict etc ..

at the end, I realized that I'm running start-all.sh script on a random node. When it is executed on the master node, works just fine ! simple fix ..

November 24, 2010

hadoop - wrong key class exception

Usually happens because of mismatch between Map or Reduce class signature and configuration settings.

But also be careful about the combiner ! Check if you are using the same class as reducer and combiner. If reducer's input key-value pair is not same as its output key-value pair, then it can not be used as a combiner -because combiner's output will became input on the reducer side !

here is an example:
reducer input key val : < IntWritable, IntWritable >
reducer output key val: < Text, Text >

if this reducer is used as combiner, then the combiner will output <text, text> and reducer will receive <text, text> as input - and boom - wrong key class exception !

November 23, 2010

java.lang.InstantiationException hadoop

java.lang.InstantiationException definition:
Thrown when an application tries to create an instance of a class using the newInstance method in class Class, but the specified class object cannot be instantiated because it is an interface or is an abstract class.

I get this exception for setting input reader to FileInputFormat
FileInputFormat is an abstract class !
job.setInputFormatClass(FileInputFormat.class)

Default is TextInputFormat and it can be used instead..
job.setInputFormatClass(TextInputFormat.class)

exception:
Exception in thread "main" java.lang.RuntimeException: java.lang.InstantiationException
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:123)
        at org.apache.hadoop.mapreduce.lib.input.MultipleInputs.getInputFormatMap(MultipleInputs.java:109)
        at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:58)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:401)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:418)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:976)
        at nyu.cs.webgraph.LinkGraphUrlIdReplacement.phase1(LinkGraphUrlIdReplacement.java:326)
        at nyu.cs.webgraph.LinkGraphUrlIdReplacement.main(LinkGraphUrlIdReplacement.java:351)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
Caused by: java.lang.InstantiationException
        at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:121)
        ... 14 more

November 20, 2010

Finding a needle in Haystack: Facebook's photo storage

Summary of the idea Haystack project that Facebook started to use for storing pictures

Total image workload Facebook has:
  • 260 billion images (~20 petabytes) of data
  • every week 1 billion (~60terabyte) new photos are uploaded

Main charachteristics of Facebook images:
  • read often
  • written once
  • no modification
  • rarely deleted

Traditional file systems are not fast for these specifications (too many disk accesses per read) and external CDN won't be enough in near future due to increasing workload -especially for long tail. As a solution, Haystack is designed to provide;
  1. High throughput low latency:
    • keeps metadata in main memory -at most one disk access per read
  2. Fault tolerance
    • replicas are in different geographical regions
  3. Cost effective and simple
    • comparison to NFS based NAS appliance
    • each usable terabyte costs ~28% less
    • ~4% more reads per sec

Design Previous to Haystack



What is learned from NFS-based Design
  • more than 10 disk operation to read an image
  • if directory size is reduced, 3 disk operation to fetch an image
  • caching file name for highly possible next requests - new kernel func open_by_file_handle
Take away from previous design
  • Focusing only on caching has limited impact on reducing disk operations for long tail
  • CDN are not effective for long tail
  • Would GoogleFS like system be useful ?
  • Lack of correct RAM/disk ratio in current system
Haystack Solution:
  • use XFS (extend base file system)
    • reduce metadata size per picture so all metadata can fit into RAM
    • store multiple photos per file
    • so very good price/performance point -better off than buying more NAS appliances
    • holding all regular size metadata in RAM would be way expensive
  • design your own CDN (Haystack Cache)
    • uses distributed hash table
    • in requested photo can not be find in cache, fetches from Haystack store
    • store multiple photos per file


DESIGN DETAILS
needs to be updated ..

D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. Finding a needle in Haystack: Facebook’s photo storage. In OSDI ’10

November 03, 2010

Using Distributed Cache in Hadoop

Distributed cache allows to share static data among all nodes. In order use this functionality, the data location should be set before MR job starts.

Here is an example usage for distributed cache. While working on web-graph problem I replace URLs with unique id's. If I have the url-Id mapping in memory, I can easily replace URLs with their corresponding ids. So here is the sample usage:

public static class ReplacementMapper extends Mapper<Text, Text, Text, Text> {

    private HashMap<String, String> idmap;

    @Override
    public void setup(Context context) {
     LoadIdUrlMapping(context);
    }

    @Override
    public void map(Text key, Text value, Context context) throws InterruptedException {
        ....
    }

id-url mapping is loaded at the beginning of each Map task. Below example simply reads the file out of HDFS and stores the data in a hashmap for quick access. Here is the function:
private void loadIdUrlMapping(Context context) {
   
 FSDataInputStream in = null;
 BufferedReader br = null;
 try {
  FileSystem fs = FileSystem.get(context.getConfiguration());
  Path path = new Path(cacheFileLocation);
  in = fs.open(path);
  br  = new BufferedReader(new InputStreamReader(in));
 } catch (FileNotFoundException e1) {
  e1.printStackTrace();
  System.out.println("read from distributed cache: file not found!");
 } catch (IOException e1) {
  e1.printStackTrace();
  System.out.println("read from distributed cache: IO exception!");
 }
 try {
  this.idmap = new HashMap< string, string >();
  String line = "";
  while ( (line = br.readLine() )!= null) {
   String[] arr = line.split("\t");
   if (arr.length == 2)
    idmap.put(arr[1], arr[0]);
  }
  in.close();
 } catch (IOException e1) {
  e1.printStackTrace();
  System.out.println("read from distributed cache: read length and instances");
 }
   }
}

This is one way of accessing shared data among hadoop nodes. Other way of accessing it is through local file system. Here is a great article about how to throw cache data automatically among nodes and access it later.

November 02, 2010

How to manipulate (copy/move/rename etc..) most recent n files under a directory

In order to list most recent n files under current directory:

ls --sort-time -r | tail -n

n represent number of files, so should be replaced with a number; eg: 5

in order to move, copy, delete or do something with the result, this line can be fed into "cp" "mv" "rm" commands. However the format is important. This line should be in between single quotes. However not the ones near by enter on your keyboard ( ' ), use the ones under esc key ( ` ).
So here is the command line for moving top n files from one directory to another:

mv `ls --sort-time -r | tail -n` /home/yasemin/hebelek/

October 31, 2010

Why Hadoop can’t always read properly .gz compressed input files ?

Hadoop supposed to work happily with .gz input file format by default. [1] So I run my MR job with gz compressed input files and boom! didn’t work.. whenever there is an empty line in the input, Hadoop stucks there and doesn't read -recognize rest of the file. (basically readline returns 0 length string even though there is data) I spent hours to figure out the problem. everything was looking great, .gz files were corrupted or anything, and my code runs fine with the decompressed input... At the end I realized that if I decompress .gz input files and re-compress them again, the size reduces by half ! seems like Hadoop has problems with different versions of .gz compression. I suspect my input files were uncompressed on a Windows machine and looks like some compression applications ends up producing type of .gz file which is incompatible with Hadoop.

Long story short; if Hadoop doesn’t process your .gz compressed input files, try to decompress and re-compress them with gzip under a Linux machine.
gunzip filename.gz
gzip filename



here is the script that unzips and zips all the files under given directory one by one (in case you have a huge archive !)


#!/bin/bash

dir="aaa"
for f in $( ls $dir  ); do
        eval gunzip "$dir/$f"
        eval gzip "$dir/${f%.gz}"
done



[1] Tom White, Hadoop: The Definitive Guide

October 13, 2010

eclipse Galileo CDT plugin "No Repository found" error

Seems like people faced similar problems with the < 3.5 eclipse version. However I had the same problem with 3.5 Galileo version. There is an easy work-around solution:

- just download the cdt archive and install it manually.
* Here is the page you can find eclipse-cdt.
* Then go to Help -> Install software and install it.

September 21, 2010

directory name include space in Linux

I have both Ubuntu and Windows installed on my computer. When I'm on Ubuntu and want to reach some data from Windows, directory names with spaces causes problem..
Pretty pretty easy way to go around this is using single quotes around the directory in terminal :-)

August 23, 2010

Inequalities with aws_sdb

I'm using aws_sdb_proxy for accessing SBD. Since it is using ActiveResource (REST approach), querying with equalities is fairly simple. All the equalities can be passed as a hash with the :params parameter.

Balloon.find(:all, :params => {:color=> "green", :size => "5inch"})

which will be resolves to below GET request:

http://localhost:8080/Baloons.xml?color=green&size=small

In order to support all the other sql-like queries, sdb_proxy allows users to pass a hard-coded string value. Because ActiveResource is used as the underlying communication mechanism, :params value has to be a hash event though all we need to pass is a string. aws_proxy goes around this problem with only looking at the first key of the hash in the case of hard-coded query string.

Balloon.find(:all, :params => {" ['size' > '5inch'] sort 'size' " => nil})

and this will resolve to below GET request:

http://localhost:8080/Baloons/query.xml?:static_query

Note1: I added the query as static_query because I don't have my setup on this machine, but it will add corresponding UTF values for space, quotes and for brackets.. At the end will be something like http://localhost:8080/Baloons/query.xml?size**5inch**sort****size**
Note2: sdb_proxy is using site_prefix for the resource, so the object name (Baloon in this case) is not the domain name. This is a whole different point to discuss, I wrote this entry according to my altered version of sdb. If you are using sdb_proxy as it is, then your http requests will be similar to following:
http://localhost:8080/:domain_name/baloons/query.xml?:static_query
http://localhost:8080/:domain_name/baloons.xml?color=green&size=small

Links for SDB & Rails

Rails helper

Rails MVC

SDB

SDB-proxy

August 12, 2010

Synergy -sharing keyboard & mouse

Here are the instructions I followed.
Two things I had problems with:
- Reverse DNS (make sure your hostname and IP address resolves to each other correctly)
Here is how to check if your DNS resolves backward and forward correctly
>> dig +short hostname
>> dig +short -x your_ip_address
- Make sure the hostname you put in the server configuration matches the "Screen name" in your client.

Then you are good to go! :-)

July 26, 2010

Prevent connection drops on a ssh connection

Here is two ways to stop connection drops.

modify /etc/ssh/sshd_config file.
add following:

ClientAliveInterval 30
ClientAliveCountMax 5

and restart sshd:

/etc/init.d/ssh restart

OR

modify /etc/ssh/ssh_config

ServerAliveInterval 15
ServerAliveCountMax 3

July 13, 2010

colorful ls

I was looking for a way to distinguish files and folders. It is set to non-colorful in gnome konsole I'm using.
And I find out a more generic way to overwrite a command's behavior.
Here is how to make your ls colorful:
alias 'ls=ls --color=auto'

July 06, 2010

Can't enable wireless on Linux

I keep getting error, saying "wireless disabled" on my Dell XPS 1330 laptop.
Then I realized it is hardware disabled...

#rfkill list
>> 3: phy0: Wireless LAN
>> Soft blocked: no
>> Hard blocked: yes

Way to solve this problem is
#sudo rmmod dell_laptop
--enabling it back:
#sudo modprobe dell_laptop

how to change command line editor

here is how to change command line editor in Linux.
set -o vi
set -o emacs

click for more detailed guidance on shell options.

March 26, 2010

How to Increase Heap Space of a Project in Eclipse

Right click source code of the class where your main function is. Then select Run As and Run Configurations. Choose the "Arguments" tab and indicate how much memory you wanna allocate in the VM arguments part. eg: -xmx128m
here is a screenshot:



Also, you can find descriptions of other parameters on this webpage.