<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3215128726026366458</id><updated>2012-01-25T11:02:36.296-08:00</updated><category term='inequalities'/><category term='hadoop exception'/><category term='fsck'/><category term='nonRelational DB'/><category term='SQL'/><category term='s3'/><category term='distributed cache'/><category term='HyperTable'/><category term='ec2'/><category term='random_create'/><category term='how to'/><category term='gzip'/><category term='sdb'/><category term='cap theorem'/><category term='java.io.IOException'/><category term='lzo'/><category term='hadoop'/><category term='compression'/><category term='heap space'/><category term='MultipleOutputs'/><category term='MS SQL Server 2008'/><category term='numOfMappers'/><category term='grep'/><category term='HBase'/><category term='lsdsir'/><category term='hg'/><category term='syntaxhighlighter'/><category term='aws'/><category term='shuffleError'/><category term='ulimit'/><category term='sort'/><category term='Cassandra'/><category term='linux'/><category term='java.io.EOFException'/><category term='java'/><category term='rename'/><category term='format'/><category term='bash'/><category term='NoSQL'/><category term='hdfs'/><category term='Key-Value Store'/><category term='bricked kf'/><category term='terminal'/><category term='aws_sdb_proxy'/><category term='hadoop perfomance'/><category term='uuidtools'/><category term='DiskErrorException'/><category term='secondarySort'/><category term='log'/><category term='splitsize'/><category term='kindle fire'/><category term='Scalable Manipulation of Archival Web Graphs'/><category term='namenode'/><category term='Partitioner'/><category term='mercurial'/><category term='error'/><title type='text'>yasemin's notes</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://yaseminavcular.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>53</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-1307397953913781158</id><published>2012-01-23T23:01:00.000-08:00</published><updated>2012-01-23T23:17:04.098-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bricked kf'/><category scheme='http://www.blogger.com/atom/ns#' term='kindle fire'/><title type='text'>How to get a bricked kindle fire back to live</title><content type='html'>I've rooted my kindle fire and then started playing with /system/build.prop file to get any app I wanted from android market. However I ended up with a bricked KF - it was stuck at the kindle fire logo.&lt;br /&gt;&lt;br /&gt;I thought "luckily I've backed up the build.prop file, so I can just copy it back". But the process I went through wasn't that easy.. It took me a while to figure out how to revert build.prop file.&lt;br /&gt;&lt;br /&gt;Here are the steps I followed t get KF back to live again:&lt;br /&gt;&lt;br /&gt;1. First of all download latest &lt;a href="http://developer.android.com/sdk/index.html" target="_blank"&gt;android-sdk tools&lt;/a&gt;. This comes with ./adb and ./fastboot which will be your main tools to access KF.&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: bash"&gt;# see all options&lt;br /&gt;./adb --help &amp;nbsp;&lt;br /&gt;&lt;br /&gt;# Common commands&lt;br /&gt;./adb kill-server #kills&lt;br /&gt;./adb devices #searches for devices&lt;br /&gt;./adb shell &amp;nbsp;#goes to FK linux shell&lt;br /&gt;./adb push from_file_dir_in_pc to_file_dir_in_kf&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;2. Next, make sure KF is recognized. &amp;nbsp;Go to KF shell and type su. If moves to # shell, great, you have root permissions. Move to step 4. If there is an error eg. segmentation fault, then move to step3.&lt;br /&gt;&lt;br /&gt;3. Download &lt;a href="https://docs.google.com/leaf?id=0B0WXkM8-Uhf9N2NlNzA4OTYtYzFhOC00ZDExLThhOTQtMjBlM2VlMjAyNjg1&amp;hl=en_US"&gt;fastboot&lt;/a&gt;, copy it to KF and execute. Then reboot and go to KF shell again. This time you should have root permissions. Try su. Refer to &lt;a href="http://forum.xda-developers.com/showthread.php?t=1414832"&gt;this&lt;/a&gt; post for further details.&lt;br /&gt;&lt;pre class="brush: bash"&gt;$ ./adb push fbmode /data/local/tmp&lt;br /&gt;$ ./adb shell chmod 755 /data/local/tmp/fbmode&lt;br /&gt;$ ./adb shell /data/local/tmp/fbmode&lt;br /&gt;$ ./adb reboot&lt;br /&gt;$ ./adb shell&lt;br /&gt;  $ su&lt;/pre&gt;4. At this point, you should have root permissions and can just go ahead and revert changes in build.prop. Copy stock build.prop in your work directory if does not exist. Download &lt;a href="https://docs.google.com/leaf?id=0B0WXkM8-Uhf9MzZkYzMyN2UtMzZhMS00Y2NlLTkyZTItNjVjYzNkYTA2NmNm&amp;hl=en_US"&gt;here&lt;/a&gt;.&lt;pre class="brush: bash"&gt;$ ./adb push build.prop /system/&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;5. Next, follow instructions &lt;a href="http://forum.xda-developers.com/showthread.php?t=1372207&amp;amp;page=3"&gt;here&lt;/a&gt;. In my case, ./fastboot command never recognized the device and I ended up in &lt;a href="http://attachments.xda-developers.com/attachment.php?attachmentid=831362&amp;amp;stc=1&amp;amp;d=1325009550" target="_blank"&gt;cwm-based recovery screen&lt;/a&gt;. Since KF has only one button, I was only able to pick the first choice in each screen. At the beginning, this option is "install update" and in the next screen it is a "NO". So "install update" option looks for an update.zip file under /sdcard/ and then&amp;nbsp;just installs if the file is there - even though you have to pick "NO" in the confirmation screen.&lt;br /&gt;So to do this, first download the &lt;a href="https://docs.google.com/open?id=0B0WXkM8-Uhf9ZmE3Y2M3MmQtZTQwNS00ZTdhLWJiYzktYzRiMjQzNmMzOWFl" target="_blank"&gt;update.zip&lt;/a&gt; file here. Then just click on the power button on KF twice to install. Once installation is completed, reboot the KF.&lt;br /&gt;&lt;pre class="brush: bash"&gt;&lt;br /&gt;# Copy update.zip to KF &lt;br /&gt;$ ./adb push update.zip /sdcard/&lt;br /&gt;# Push on the reboot button twice to install.&lt;br /&gt;# Once installed, reboot the KF.&lt;br /&gt;$ ./adb reboot&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;And, it's done, the KF is back!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-1307397953913781158?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1307397953913781158'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1307397953913781158'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2012/01/how-to-get-bricked-kindle-fire-back-to.html' title='How to get a bricked kindle fire back to live'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-96465281422939246</id><published>2011-11-28T22:06:00.001-08:00</published><updated>2012-01-23T23:01:52.994-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='kindle fire'/><title type='text'>How to get android app store apps on kindle fire</title><content type='html'>It was not happy to realize that kindle fire does not allow access to android market place. Probably this is a well thought out marketing strategy of Amazon, but if you're tired of keep being redirected to "can not find app" message and don't want to root your kindle, keep reading :)&lt;br /&gt;&lt;br /&gt;The default kindle fire browser (Silk) redirects all android market links to amazon app store. After googling a little bit, idea of getting another browser and using that to access android market seemed like to be a smart way. I tried accessing via&amp;nbsp;&lt;a href="http://www.dolphin-browser.com/"&gt;dolphin&lt;/a&gt;&amp;nbsp;-&amp;nbsp;but failure, still redirect to amazon app store (btw - give a try to dolphin if you haven't done yet, it is a pretty cool browser.).&amp;nbsp;So I'm left with the last option, sideloading apps through third parties such as&amp;nbsp;&lt;a href="http://www.freewarelovers.com/android/apps"&gt;freewarelovers.com&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="http://www.getjar.com/"&gt;getjar.com&lt;/a&gt;. Just search for the apps online, download the apk file, and click on it to install. Before installing, make sure&amp;nbsp;"allow applications from unknown sources" option is enabled through Settings-&amp;gt;Device.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-96465281422939246?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/96465281422939246'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/96465281422939246'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/11/how-to-get-android-app-store-on-kindle.html' title='How to get android app store apps on kindle fire'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-8841707044096079689</id><published>2011-11-08T20:18:00.000-08:00</published><updated>2011-11-12T14:19:58.191-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Scalable Manipulation of Archival Web Graphs'/><category scheme='http://www.blogger.com/atom/ns#' term='lsdsir'/><title type='text'>Scalable Manipulation of Archival Web Graphs</title><content type='html'>I was working on archival web graphs till last June. Actually, most of the Hadoop related posts on this blog are things I learned while working on this project.&lt;br /&gt;&lt;br /&gt;We looked at the problem of processing large-scale archival web graphs and generating a simple representation for the raw input graph data. This representation should allow the end user to be able to analyze &amp;amp; query the graph efficiently. Also, the representation should be flexible enough - so it can be&amp;nbsp;loaded&amp;nbsp;into a database or can be processed using distributed computing frameworks, ie. Hadoop.&lt;br /&gt;To achieve these goals, we developed a workflow for archival graph processing within Hadoop. This project is still going on and its current status has appeared at LSDS-IR '11 workshop at CIKM conference last month. I like to share the &lt;a href="http://cis.poly.edu/~yavcular/webgraphs-lsdsir11.pdf"&gt;paper&lt;/a&gt; for those who are interested in further details. The abstract is following:&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;&lt;div class="p1"&gt;&lt;i&gt;In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC’s Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.&lt;/i&gt;&lt;/div&gt;&lt;/blockquote&gt;Also, the source code for "graph construction algorithm" is open sourced at &lt;a href="https://github.com/yavcular/WebGraphConstruction"&gt;GitHub&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-8841707044096079689?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8841707044096079689'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8841707044096079689'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/11/scalable-manipulation-of-archival-web.html' title='Scalable Manipulation of Archival Web Graphs'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-2973698743521286662</id><published>2011-10-09T22:21:00.000-07:00</published><updated>2011-10-09T22:28:58.784-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ec2'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><title type='text'>Setting up EC2 Command Line Tools (API Tools)</title><content type='html'>Download the latest version of EC2 API from AWS &lt;a href="http://aws.amazon.com/developertools/351"&gt;website&lt;/a&gt; and unzip. All APIs are under the bin. Setting up the $EC_HOME  It's better to add this directory to your $PATH. Here is how: &lt;br /&gt;&lt;pre class="brush: bash"&gt;export EC2_HOME=/Users/yasemin/My-Apps/ec2-api-tools&lt;br /&gt;export PATH=$PATH:$EC2_HOME/bin&lt;br /&gt;&lt;/pre&gt;Next is setting up the credentials for the aws account. Create &amp;amp; download a private key and the corresponding certificate from the account's "security credentials" &lt;a href="https://aws-portal.amazon.com/gp/aws/developer/account/index.html?ie=UTF8&amp;amp;action=access-key"&gt;page&lt;/a&gt; and link these files to EC2 CLI. Here is how:&lt;br /&gt;&lt;pre class="brush: bash"&gt;export EC2_PRIVATE_KEY=/Users/yasemin/My-Apps/ec2-api-tools/credentilas/pk-HKZYKTAIG2ECMXYIBH3HXV4ZBZQ55CLO.pem &lt;br /&gt;export EC2_CERT=/Users/yasemin/My-Apps/ec2-api-tools/credentilas/cert-HKZYKTAIG2ECMXYIBH3HXV4ZBZQ55CLO.pem  &lt;br /&gt;&lt;/pre&gt;This is all - you should be good to go!&lt;br /&gt;Notes:&lt;br /&gt;- If the export commands are appended to ~/.bashrc file, then they will be executed automatically with every new bash session - which is nice to have.&lt;br /&gt;- For detailed setup instruction, please see the &lt;a href="http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?SettingUp_CommandLine.html"&gt;official docs&lt;/a&gt;.- Make sure  $JAVA_HOME is also set. Use '$ which java' to figure out the current path. &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-2973698743521286662?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2973698743521286662'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2973698743521286662'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/10/setting-up-ec2-command-line-tools.html' title='Setting up EC2 Command Line Tools (API Tools)'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-6004824783123440362</id><published>2011-09-08T19:06:00.000-07:00</published><updated>2011-11-29T15:14:04.963-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='s3'/><category scheme='http://www.blogger.com/atom/ns#' term='aws'/><title type='text'>S3 Access</title><content type='html'>Two ways to access your S3 buckets: &lt;a href="http://jets3t.s3.amazonaws.com/downloads.html"&gt;jets3t&lt;/a&gt; and &lt;a href="http://s3tools.org/s3cmd"&gt;s3cmd&lt;/a&gt;. Jet3set provides UI and s3cmd is only UI.&lt;br /&gt;Here is how to get a file using s3cmd:First configure the account: &lt;br /&gt;&lt;pre class="brush: bash"&gt;s3cmd --configure&lt;br /&gt;&lt;/pre&gt;Then, you can use s3cmd to access your buckets. eg: downloading a folder from s3:&lt;br /&gt;&lt;pre class="brush: bash"&gt;s3cmd get --recursive s3://bucket_name/object_name to_local_file&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-6004824783123440362?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/6004824783123440362'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/6004824783123440362'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/09/s3-access.html' title='S3 Access'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-3873720578861193045</id><published>2011-09-08T14:25:00.000-07:00</published><updated>2011-09-08T18:58:47.734-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='terminal'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>Mysql command line basics</title><content type='html'>&lt;pre class="brush: bash"&gt;show databases;&lt;br /&gt;use db_name; &lt;br /&gt;show tables; &lt;br /&gt;describle table_name;&lt;br /&gt;&lt;/pre&gt;These will get you enogh info to run your real SQL command..&lt;br /&gt;The other way is just to use the UI app eg. &lt;a href="http://dev.mysql.com/doc/query-browser/en/"&gt;MySQL Query Browser&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-3873720578861193045?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3873720578861193045'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3873720578861193045'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/09/mysql-command-line-usage.html' title='Mysql command line basics'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-8942209965419207621</id><published>2011-09-02T14:01:00.000-07:00</published><updated>2011-09-09T23:57:53.864-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bash'/><title type='text'>bash alias and colorful ls</title><content type='html'>Everything below should go under ~/.bashrc&lt;pre class="brush: bash"&gt;&lt;br /&gt;alias la='ls -la'&lt;br /&gt;alias gopen='gnome-open'&lt;br /&gt;alias gterminal='gnome-terminal'&lt;br /&gt;alias ls='ls --color'&lt;br /&gt;&lt;br /&gt;PS1="\`if [ \$? = 0 ]; then echo \e[33\;40m\\\^\\\_\\\^\e[0m; else echo&lt;br /&gt;\e[36\;40m\\\-\e[0m\\\_\e[36\;40m\\\-\e[0m; fi\` \w&gt; "&lt;br /&gt;&lt;/pre&gt;NOTE: Make sure alias comes after PS1. PS1 is for cutom prompt. &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-8942209965419207621?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8942209965419207621'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8942209965419207621'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/09/bash-alias-and-colorful-ls.html' title='bash alias and colorful ls'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-1103125070423931880</id><published>2011-06-20T19:03:00.000-07:00</published><updated>2011-08-12T17:13:19.723-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='how to'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>How to pass job specific parameters in Hadoop</title><content type='html'>Say there is a parameter your mapper or reducer needs, and it is desirable to get this parameter from the user at the beginning of the job submission. Here is how to use "Configuration" to let the user set the parameter:&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: java"&gt;public class GenericReplace {&lt;br /&gt;&lt;br /&gt;   public static final String IS_KEY_FIRST = "IsKeyFirstInMapFile";&lt;br /&gt;&lt;br /&gt;   public static class GenerateLinks extends Mapper {&lt;br /&gt;&lt;br /&gt;      public void map(Text key, Text value, Context context)  {&lt;br /&gt;         if (context.getConfiguration().getInt(IS_KEY_FIRST, 1)) {&lt;br /&gt;              //do this .. &lt;br /&gt;         }&lt;br /&gt;         else{&lt;br /&gt;              //do that .. &lt;br /&gt;         }&lt;br /&gt;      }&lt;br /&gt;   }&lt;br /&gt;&lt;br /&gt;   public static void main(String[] args) throws Exception {&lt;br /&gt;&lt;br /&gt;   Configuration conf = new Configuration();&lt;br /&gt;   GenericReplace.graphPath = args[0];&lt;br /&gt;   GenericReplace.outputPath = args[1];&lt;br /&gt;   conf.setBoolean(IS_KEY_FIRST , Boolean.getBoolean(args[3]));&lt;br /&gt;   Job job = Job.getInstance(new Cluster(conf), conf);&lt;br /&gt;   ...&lt;br /&gt;   }&lt;br /&gt;}&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-1103125070423931880?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1103125070423931880'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1103125070423931880'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/06/how-to-pass-job-specific-parameters-in.html' title='How to pass job specific parameters in Hadoop'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-4880146729767472247</id><published>2011-06-20T10:10:00.000-07:00</published><updated>2011-07-03T23:56:40.534-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hdfs'/><category scheme='http://www.blogger.com/atom/ns#' term='how to'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Ways to write &amp; read HDFS files</title><content type='html'>- Output Stream &lt;br /&gt;&lt;pre class="brush: java"&gt;FSDataOutputStream dos = fs.create(new Path("/user/tmp"), true); &lt;br /&gt;dos.writeInt(counter); &lt;br /&gt;dos.close();&lt;/pre&gt;&lt;br /&gt;- Buffered Writer/Reader&lt;br /&gt;&lt;pre class="brush: java"&gt;//Writer&lt;br /&gt;BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fs.create(new Path("/user/tmp"), true)));&lt;br /&gt;bw.write(counter.toString());&lt;br /&gt;bw.close();&lt;br /&gt;&lt;br /&gt;//Reader&lt;br /&gt;DataInputStream d = new DataInputStream(fs.open(new Path(inFile)));&lt;br /&gt;BufferedReader reader = new BufferedReader(new InputStreamReader(d));&lt;br /&gt;while ((line = reader.readLine()) != null){&lt;br /&gt;...&lt;br /&gt;}&lt;br /&gt;reader.close();&lt;br /&gt;  &lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;- SequenceFile Reader and Writer (I think most preferable way for Hadoop jobs):&lt;br /&gt;&lt;pre class="brush: java"&gt;//writer&lt;br /&gt;SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new Path(pathForCounters, context.getTaskAttemptID().toString()), Text.class, Text.class);&lt;br /&gt;   writer.append(new Text(firtUrl.toString()+"__"+ context.getTaskAttemptID().getTaskID().toString()), new Text(counter+""));&lt;br /&gt;   writer.close(); &lt;br /&gt;&lt;br /&gt;//reader&lt;br /&gt;SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(makeUUrlFileOffsetsPathName(FileInputFormat.getInputPaths(context)[0].toString())),  conf);&lt;br /&gt;   while (reader.next(key, val)){&lt;br /&gt;    offsets.put(key.toString(), Integer.parseInt(val.toString()));&lt;br /&gt;   }&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-4880146729767472247?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4880146729767472247'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4880146729767472247'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/06/ways-to-write-read-hdfs-files.html' title='Ways to write &amp; read HDFS files'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-5326007764271492494</id><published>2011-06-19T13:29:00.001-07:00</published><updated>2011-06-21T00:25:11.456-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='how to'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Hadoop - how to get job output path using context</title><content type='html'>&lt;pre class="brush: java"&gt;FileOutputFormat.getOutputPath(context)&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-5326007764271492494?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5326007764271492494'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5326007764271492494'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/06/hadoop-how-to-get-output-file-path.html' title='Hadoop - how to get job output path using context'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-4855139865599685674</id><published>2011-06-14T14:11:00.000-07:00</published><updated>2011-06-21T00:25:38.045-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='how to'/><category scheme='http://www.blogger.com/atom/ns#' term='Partitioner'/><category scheme='http://www.blogger.com/atom/ns#' term='sort'/><category scheme='http://www.blogger.com/atom/ns#' term='secondarySort'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>How to ensure each key in Reducer has sorted iterator ?</title><content type='html'>There are three properties control how values are partitioned and sorted for reducer's consumption. As mentioned at &lt;a href="http://www.riccomini.name/Topics/DistributedComputing/Hadoop/SortByValue/"&gt;riccomini's blog&lt;/a&gt; &lt;a href="http://markmail.org/message/7gonm3kiasyh2xnf#query:setOutputKeyComparatorClass+page:3+mid:esn3lgzyx3ag26cy+state:results"&gt;Owen O'Malley&lt;/a&gt; explains with a very simple &amp; nice example. By default intermediate pairs are partitioned using the key. To manipulate this behavior, custom &lt;code&gt; Partitioner &lt;/code&gt; can be defined. &lt;br /&gt;&lt;br /&gt;Once we ensure pairs belonging to same partition are sent to the same reducer, now there are two functions take care of their ordering and grouping of keys in each partition/reducer.  &lt;br /&gt;&lt;code&gt;setOutputKeyComparatorClass&lt;/code&gt; defines the sort order of the keys and &lt;code&gt;setOutputValueGroupingComparator&lt;/code&gt; defines the groups, which pairs will be grouped together to process once. Order of values at the reducer's iterator can be set using combination of these two. &lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: java"&gt;public static class RemoveIdentifierAndPartition extends Partitioner&lt; Text, Writable &gt; {&lt;br /&gt;&lt;br /&gt;  @Override&lt;br /&gt;  public int getPartition(Text key, Writable value, int numReduceTasks) {&lt;br /&gt;   return (removeKeyIdentifier(key.toString()).hashCode() &amp; Integer.MAX_VALUE) % numReduceTasks;&lt;br /&gt;  }&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; public static final class SortReducerByValuesValueGroupingComparator implements RawComparator&lt; Text &gt;  {&lt;br /&gt;     private static Text.Comparator NODE_COMPARATOR = new Text.Comparator();&lt;br /&gt;&lt;br /&gt;     @Override&lt;br /&gt;     public int compare(Text e1, Text e2) {&lt;br /&gt;         return e1.compareTo(e2);&lt;br /&gt;     }&lt;br /&gt;&lt;br /&gt;     @Override&lt;br /&gt;     public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {&lt;br /&gt;&lt;br /&gt;         // skip last 2 bytes ( space 1 / space 2)&lt;br /&gt;         &lt;br /&gt;         int skip = 2;&lt;br /&gt;         int stringsize1 = 0;&lt;br /&gt;         int stringsize2 = 0;&lt;br /&gt;&lt;br /&gt;         // compare the byte array of Node first&lt;br /&gt;         return NODE_COMPARATOR.compare(b1, s1 , l1-skip, b2, s2 , l2-skip);&lt;br /&gt;     }&lt;br /&gt; }&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-4855139865599685674?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4855139865599685674'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4855139865599685674'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/06/how-to-ensure-each-key-in-reducer-has.html' title='How to ensure each key in Reducer has sorted iterator ?'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-4251061688085361472</id><published>2011-06-14T13:31:00.000-07:00</published><updated>2011-06-21T00:26:03.497-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='splitsize'/><category scheme='http://www.blogger.com/atom/ns#' term='how to'/><category scheme='http://www.blogger.com/atom/ns#' term='numOfMappers'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>How to set number of Maps with Hadoop</title><content type='html'>Setting number of map tasks is not simple like the reduce tasks. User can not explicitly give a fixed number, FileInputFormat decides how to split input files using various parameters. &lt;br /&gt;&lt;br /&gt;First one is &lt;code&gt;isSplitable&lt;/code&gt;, determines whether file is splittable or not. &lt;br /&gt;Next three variables, &lt;code&gt;mapred.min.split.size&lt;/code&gt;, &lt;code&gt;mapred.max.split.size&lt;/code&gt;, &lt;code&gt;dfs.block.size&lt;/code&gt; determine the actual split size used if input is splittable.&amp;nbsp;By default, min split size is 0 and max split size is &lt;code&gt;Long.MAX&lt;/code&gt; and block size 64MB.  For actual split size; minSplitSize&amp;blockSize set the lower bound and blockSize&amp;maxSplitSize together sets the upper bound. Here is the function to calculate:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;max(minsplitsize, min(maxsplitsize, blocksize))&lt;/code&gt;&lt;/blockquote&gt;Note: compressed input files (eg. gzip) are not splittable, there are patches &lt;a href="https://issues.apache.org/jira/browse/HADOOP-7076"&gt;*&lt;/a&gt; &lt;a href="https://issues.apache.org/jira/browse/MAPREDUCE-491"&gt;*&lt;/a&gt; available.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-4251061688085361472?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4251061688085361472'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4251061688085361472'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/06/how-to-set-number-of-maps-with-hadoop.html' title='How to set number of Maps with Hadoop'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-2025961872925404921</id><published>2011-04-24T12:16:00.000-07:00</published><updated>2011-05-01T22:43:32.198-07:00</updated><title type='text'>Java BitSet Size</title><content type='html'>Java &lt;a href="http://download.oracle.com/javase/1.4.2/docs/api/java/util/BitSet.html"&gt;BitSet&lt;/a&gt; could  be described as bit array. It holds certain of bits.&lt;br /&gt;What we used to see is a constructor with a size value as argument returns an object with that size. However &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;BitSet(int nbits)&amp;nbsp;&lt;/span&gt;constructor does not work this way. Here is the description from JavaDocs: &lt;br /&gt;&lt;blockquote&gt;Creates a bit set whose initial size is large enough to explicitly represent bits with indices in the range 0 through nbits-1.&lt;/blockquote&gt;Indeed length of the object is equals to or bigger than specified value. &lt;br /&gt;&lt;pre class="brush: java"&gt;BitSet set = new BitSet(1);&lt;br /&gt;System.out.println(set.size()); //64&lt;br /&gt;BitSet set = new BitSet(10);&lt;br /&gt;System.out.println(set.size()); //64&lt;br /&gt;BitSet set = new BitSet(65);&lt;br /&gt;System.out.println(set.size()); //128&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;It seems like BitSet constructor sets the size to closest 2^n value starting with n=6.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-2025961872925404921?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2025961872925404921'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2025961872925404921'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/java-bitset-size.html' title='Java BitSet Size'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-7399712688500992983</id><published>2011-04-21T18:13:00.000-07:00</published><updated>2011-04-21T18:18:58.521-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop exception'/><category scheme='http://www.blogger.com/atom/ns#' term='format'/><category scheme='http://www.blogger.com/atom/ns#' term='java.io.IOException'/><category scheme='http://www.blogger.com/atom/ns#' term='namenode'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Hadoop - Incompatible namespaceIDs Error</title><content type='html'>After formatting the namenode, restarting Hadoop fails - more specifically namenode does not start  with  Incompatible namespaceIDs Error. &lt;br /&gt;&lt;pre class="brush : bash;"&gt;bin/hadoop namenode -format&lt;br /&gt;..&lt;br /&gt;bin/start-dfs.sh&lt;br /&gt;..&lt;br /&gt;ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:&lt;br /&gt;  java.io.IOException: Incompatible namespaceIDs in /hadoop21/hdfs/datadir: &lt;br /&gt;  namenode namespaceID = 515704843; datanode namespaceID = 572408927&lt;br /&gt;  ..&lt;br /&gt;&lt;/pre&gt;Why? -datanodes have the old version number after formatting the namenode. &lt;br /&gt;Solution? - hacking in to &amp;lt;hdfs-data-path&amp;gt;/datadir/current/VERSION file and changing the version number with the new one (which is 572408927 in this example) solves the problem. Make sure to change it for every data-node in the cluster. &lt;br /&gt;&lt;br /&gt;WARNING: most probably you will be loosing the data in HDFS. even though it is not deleted, not accessible with the new version.&lt;br /&gt;&lt;br /&gt;To avoid such a boring case, be careful before formatting. Take a look at &lt;a href="http://wiki.apache.org/hadoop/GettingStartedWithHadoop"&gt;this&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-7399712688500992983?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7399712688500992983'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7399712688500992983'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/hadoop-incompatible-namespaceids-error.html' title='Hadoop - Incompatible namespaceIDs Error'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-1139938305967256331</id><published>2011-04-19T13:08:00.000-07:00</published><updated>2011-06-21T00:27:08.354-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MultipleOutputs'/><category scheme='http://www.blogger.com/atom/ns#' term='how to'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Hadoop MultipleOutputs</title><content type='html'>Want to generate various types of output files. For example, I have a huge linkgraph with includes timestamp and the outlink information. I want put these two constrains into seperate files. Here is how to use MultipleOutputFormat for this purpose:&lt;br /&gt;&lt;pre class="brush: java;"&gt;public static class FinalLayersReducer extends Reducer&amp;lt;IntWritable, Text, WritableComparable,Writable&amp;gt; &lt;br /&gt;{&lt;br /&gt;     public void setup(Context context) &lt;br /&gt;     {&lt;br /&gt;          mos = new MultipleOutputs(context);&lt;br /&gt;     }&lt;br /&gt;  &lt;br /&gt;     public void reduce(IntWritable key, Iterable&amp;lt;text&amp;gt; values, Context context) throws IOException, InterruptedException {&lt;br /&gt;          for ( Text val : values) {&lt;br /&gt;          // some sort of a computation ..&lt;br /&gt;          }&lt;br /&gt;          mos.write("outlink", key, outlink_text);&lt;br /&gt;          mos.write("timestamp", key, timestamp_text);&lt;br /&gt;     }&lt;br /&gt;&lt;br /&gt;     protected void cleanup(Context context) throws IOException, InterruptedException {&lt;br /&gt;          mos.close();&lt;br /&gt;     }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;public static void main(String[] args) throws Exception {&lt;br /&gt;&lt;br /&gt;     Job job = new Job(conf, "prepare final layer files");&lt;br /&gt;     &lt;br /&gt;     // other job settings ..&lt;br /&gt;&lt;br /&gt;     MultipleOutputs.addNamedOutput(job, "outlink", TextOutputFormat.class , IntWritable.class, Text.class);&lt;br /&gt;     MultipleOutputs.addNamedOutput(job, "timestamp", TextOutputFormat.class , IntWritable.class, Text.class);&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Facing zero sized output files OR lines in the 2 separate outputs do not match when they supposed to OR can not unzip the output files -&amp;gt; these are signs are telling that you forget to close() the MultipleOutputs object at the end - in the cleanup() function.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-1139938305967256331?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1139938305967256331'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1139938305967256331'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/hadoop-multipleoutputformat.html' title='Hadoop MultipleOutputs'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-3332897446544973196</id><published>2011-04-18T20:49:00.000-07:00</published><updated>2011-06-21T00:27:44.991-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gzip'/><category scheme='http://www.blogger.com/atom/ns#' term='lzo'/><category scheme='http://www.blogger.com/atom/ns#' term='how to'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><category scheme='http://www.blogger.com/atom/ns#' term='compression'/><title type='text'>Hadoop Intermediate Data Compression</title><content type='html'>To enable intermediate data compression, setup corresponding variables in mapred-site.xml.&lt;br /&gt;&lt;pre class="brush: xml;"&gt;&amp;lt;!-- mapred-site.xml --&amp;gt;   &lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;    &amp;lt;name&amp;gt; mapreduce.map.output.compress &amp;lt;/name&amp;gt; &lt;br /&gt;    &amp;lt;value&amp;gt; true&amp;lt;/value&amp;gt; &lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;    &amp;lt;name&amp;gt;mapreduce.map.output.compress.codec&amp;lt;/name&amp;gt;&lt;br /&gt;    &amp;lt;value&amp;gt;org.apache.hadoop.io.compress.GzipCodec&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Setting up LZO compression is a bit tricky. First of all, should install LZO package on all nodes. I built &lt;a href="https://github.com/toddlipcon/hadoop-lzo-packager"&gt;this&lt;/a&gt; package and followed instructions &lt;a href="https://github.com/kevinweil/hadoop-lzo"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Having difficulty while building &amp;nbsp;eg: "&lt;span class="Apple-style-span" style="font-family: Consolas, 'Bitstream Vera Sans Mono', 'Courier New', Courier, monospace; font-size: 13px; line-height: 14px; white-space: pre;"&gt;BUILD FAILED make sure $JAVA_HOME set correctly." - &lt;/span&gt;then take a look at&amp;nbsp;&lt;a href="http://yaseminavcular.blogspot.com/2011/04/lzo-build-problem.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;At the end, this is how my config files look like: &lt;br /&gt;&lt;pre class="brush: xml;"&gt;&amp;lt;!-- mapred-site.xml --&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;    &amp;lt;name&amp;gt; mapreduce.map.output.compress &amp;lt;/name&amp;gt; &lt;br /&gt;    &amp;lt;value&amp;gt; true&amp;lt;/value&amp;gt; &lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;    &amp;lt;name&amp;gt;mapreduce.map.output.compress.codec&amp;lt;/name&amp;gt;&lt;br /&gt;    &amp;lt;value&amp;gt;com.hadoop.compression.lzo.LzoCodec&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&lt;br /&gt;&amp;lt;!-- core-site.xml --&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;    &amp;lt;name&amp;gt;io.compression.codecs&amp;lt;/name&amp;gt;&lt;br /&gt;    &amp;lt;value&amp;gt;org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;    &amp;lt;name&amp;gt;io.compression.codec.lzo.class&amp;lt;/name&amp;gt;&lt;br /&gt;    &amp;lt;value&amp;gt;com.hadoop.compression.lzo.LzoCodec&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;a href="http://yaseminavcular.blogspot.com/2011/04/compressing-hadoop-output-file.html"&gt; To compress final output data&lt;/a&gt;, Job object should be set to output compressed data before its execution.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-3332897446544973196?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3332897446544973196'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3332897446544973196'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/hadoop-intermediate-data-compression.html' title='Hadoop Intermediate Data Compression'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-5241730863387844747</id><published>2011-04-18T20:29:00.000-07:00</published><updated>2011-06-21T10:34:20.525-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gzip'/><category scheme='http://www.blogger.com/atom/ns#' term='lzo'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><category scheme='http://www.blogger.com/atom/ns#' term='compression'/><title type='text'>Compressing Hadoop Output usinig Gzip and Lzo</title><content type='html'>In most of the cases, writing out output files in compressed format is faster - less&amp;nbsp;amount&amp;nbsp;of data will be written. To have a faster computation, compression algorithm should perform well - so time is saved even though there is an extra compression time overhead.&lt;br /&gt;&lt;br /&gt;Compressing regular output formats with Gzip, use:&lt;br /&gt;&lt;pre class="brush: java;"&gt;job.setOutputFormatClass(TextOutputFormat.class);&lt;br /&gt;TextOutputFormat.setCompressOutput(job, true);&lt;br /&gt;TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;For Lzo Output compression, download this &lt;a href="https://github.com/kevinweil/hadoop-lzo"&gt;package&lt;/a&gt; by &lt;a href="http://www.twitter.com/kevinweil"&gt;@kevinweil&lt;/a&gt;. Then following should work: &lt;br /&gt;&lt;pre class="brush: java;"&gt;job.setOutputFormatClass(TextOutputFormat.class);&lt;br /&gt;TextOutputFormat.setCompressOutput(job, true);&lt;br /&gt;TextOutputFormat.setOutputCompressorClass(job, LzoCodec.class);&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In terms of space efficiency, Gzip compresses better. However, in terms of time Lzo i smuch faster. Also, it is possible to split Lzo files, splittable Gzip is not available.  &lt;br /&gt;Keep in mind that these two techniques will only compress the final outputs of a Hadoop job. To be able to compress intermediate data, parameters in mapred-site.xml should be &lt;a href="http://yaseminavcular.blogspot.com/2011/04/hadoop-intermediate-data-compression.html"&gt;configured&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-5241730863387844747?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5241730863387844747'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5241730863387844747'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/compressing-hadoop-output-file.html' title='Compressing Hadoop Output usinig Gzip and Lzo'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-3320579274343249624</id><published>2011-04-11T11:02:00.000-07:00</published><updated>2011-04-11T11:03:03.733-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='lzo'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><category scheme='http://www.blogger.com/atom/ns#' term='compression'/><title type='text'>LZO build problem</title><content type='html'>Trying to built the lzo library for Hadoop, it failed with "make sure $JAVA_HOME" set correctly message. Here is the full error log: &lt;br /&gt;&lt;pre class="brush: xml;"&gt;....     &lt;br /&gt;   [exec] checking jni.h usability... no&lt;br /&gt;   [exec] checking jni.h presence... no&lt;br /&gt;   [exec] checking for jni.h... no&lt;br /&gt;   [exec] configure: error: Native java headers not found. &lt;br /&gt;   Is $JAVA_HOME set correctly?&lt;br /&gt;BUILD FAILED make sure $JAVA_HOME set correctly. &lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This means build is using incorrect java installation. To make sure JAVA_HOME is pointing to the correct one use &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;apt-file search&lt;/span&gt; - searches in all packages installed in your system. &lt;br /&gt;&lt;pre class="brush: bash;"&gt;apt-file search jni.h&lt;br /&gt;&lt;/pre&gt;And then set JAVA_HOME accordingly.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-3320579274343249624?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3320579274343249624'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3320579274343249624'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/lzo-build-problem.html' title='LZO build problem'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-2458581569890301713</id><published>2011-04-11T10:41:00.000-07:00</published><updated>2011-06-15T19:54:02.602-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop exception'/><category scheme='http://www.blogger.com/atom/ns#' term='ulimit'/><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><category scheme='http://www.blogger.com/atom/ns#' term='shuffleError'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Common Hadoop HDFS exceptions with large files</title><content type='html'>Big data in HDFS, so many disk problems. First of all, make sure there are at least ~20-30% free space in each node. There are two other problems I faced recently:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;all datanodes are bad&lt;/span&gt;&lt;br /&gt;This error could be cause because of there are too many open files. limit is 1024 by default. To increase this use&lt;br /&gt;&lt;pre class="brush: bash;"&gt;ulimit -n newsize&lt;/pre&gt;For more information &lt;a href="http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/"&gt; click! &lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;error in shuffle in fetcher#k&amp;nbsp;&lt;/span&gt;&lt;br /&gt;This is another problem - here is full error log:&lt;br /&gt;&lt;pre class="brush: bash;"&gt;2011-04-11 05:59:45,744 WARN org.apache.hadoop.mapred.Child: &lt;br /&gt;Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: &lt;br /&gt;error in shuffle in fetcher#2&lt;br /&gt; at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124)&lt;br /&gt; at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362)&lt;br /&gt; at org.apache.hadoop.mapred.Child$4.run(Child.java:217)&lt;br /&gt; at java.security.AccessController.doPrivileged(Native Method)&lt;br /&gt; at javax.security.auth.Subject.doAs(Subject.java:416)&lt;br /&gt; at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)&lt;br /&gt; at org.apache.hadoop.mapred.Child.main(Child.java:211)&lt;br /&gt;Caused by: java.lang.OutOfMemoryError: Java heap space&lt;br /&gt; at org.apache.hadoop.io.BoundedByteArrayOutputStream.&lt;init&gt;(BoundedByteArrayOutputStream.java:58)&lt;br /&gt; at org.apache.hadoop.io.BoundedByteArrayOutputStream.&lt;init&gt;(BoundedByteArrayOutputStream.java:45)&lt;br /&gt; at org.apache.hadoop.mapreduce.task.reduce.MapOutput.&lt;init&gt;(MapOutput.java:104)&lt;br /&gt; at org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:267)&lt;br /&gt; at org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:257)&lt;br /&gt; at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:305)&lt;br /&gt; at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:251)&lt;br /&gt; at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149)&lt;br /&gt;&lt;/init&gt;&lt;/init&gt;&lt;/init&gt;&lt;/pre&gt;&lt;br /&gt;One way to go around this problem is making sure there are not too many map tasks for small input files. If possible you can &lt;code&gt;cat&lt;/code&gt; input files manually to create bigger chunks or push hadoop to combine multiple tiny input files for a single mapper. For more details, take a look at &lt;a href="http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html"&gt;here&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Also, at Hadoop discussions groups, it is mentioned that default value of dfs.datanode.max.xcievers parameter, the upper bound for  the number of files an HDFS DataNode can serve, is too low and causes &lt;code&gt;ShuffleError&lt;/code&gt;. In hdfs-site.xml, I set this value to 2048 and worked in my case.&lt;br /&gt;&lt;pre class="brush: xml;"&gt;&amp;lt;property&amp;gt;&lt;br /&gt;        &amp;lt;name&amp;gt;dfs.datanode.max.xcievers&amp;lt;/name&amp;gt;&lt;br /&gt;        &amp;lt;value&amp;gt;2048&amp;lt;/value&amp;gt;&lt;br /&gt;  &amp;lt;/property&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Update: Default value for dfs.datanode.max.xcievers is updated with this &lt;a href="https://issues.apache.org/jira/browse/HDFS-1861"&gt;JIRA&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-2458581569890301713?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2458581569890301713'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2458581569890301713'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/common-hadoop-hdfs-exceptions-with.html' title='Common Hadoop HDFS exceptions with large files'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-555541559540442831</id><published>2011-04-04T14:44:00.000-07:00</published><updated>2011-06-21T00:28:43.832-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='how to'/><category scheme='http://www.blogger.com/atom/ns#' term='syntaxhighlighter'/><title type='text'>how to remove unnecessary scrollbar from syntaxhighlighter code</title><content type='html'>I've been looking for how to remove annoying scroll bar in syntaxhighlighted code. They were visible in Chrome but not in FF. I came across the solution &lt;a href="https://bitbucket.org/alexg/syntaxhighlighter/issue/177/superfluous-vertical-scrollbars-in-chrome"&gt;here&lt;/a&gt;. Just add following code snipped at the end of &lt; head &gt; section.&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: xml;"&gt;&amp;lt;style type="text/css"&amp;gt;&lt;br /&gt;.syntaxhighlighter { overflow-y: hidden !important; }&lt;br /&gt;&amp;lt;/style&amp;gt;  &lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-555541559540442831?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/555541559540442831'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/555541559540442831'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/remove-unnecessary-scrollbar-from.html' title='how to remove unnecessary scrollbar from syntaxhighlighter code'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-4786119836547383724</id><published>2011-04-03T21:05:00.000-07:00</published><updated>2011-06-20T20:34:37.500-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop exception'/><category scheme='http://www.blogger.com/atom/ns#' term='hdfs'/><category scheme='http://www.blogger.com/atom/ns#' term='java.io.EOFException'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>java.io.EOFException with Hadoop</title><content type='html'>My code runs smoothly with a smaller dataset, however whenever I run it with a larger one, it fails with java.io.EOFException I've been trying to figure out the problem.&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: bash;"&gt;11/03/31 01:13:55 INFO mapreduce.Job: &lt;br /&gt;  Task Id: attempt_201103301621_0025_m_000634_0, Status : FAILED&lt;br /&gt;java.io.EOFException&lt;br /&gt; at java.io.DataInputStream.readFully(DataInputStream.java:197)&lt;br /&gt; at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)&lt;br /&gt; at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)&lt;br /&gt; at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1999)&lt;br /&gt; at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2131)&lt;br /&gt; ...&lt;br /&gt; ...&lt;br /&gt; ...&lt;br /&gt; at org.apache.hadoop.mapred.MapTask$&lt;br /&gt;  NewTrackingRecordReader.nextKeyValue(MapTask.java:465)&lt;br /&gt; at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)&lt;br /&gt; at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:90)&lt;br /&gt; at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)&lt;br /&gt; at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run&lt;br /&gt;  (Delegatin&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So, EOFException means something wrong with your input files. If files are not written &amp; closed correctly, this exception is thrown - the file systems thinks  there are more to read but actually number of bytes left are less than expected.&lt;br /&gt;To solve the problem, dig into the input files and make sure they are created carefully without any corruption. Also if &lt;a href="http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html"&gt;MultipleOutputs&lt;/a&gt; is used to prepare input files, make sure it is also closed at the end!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-4786119836547383724?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4786119836547383724'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4786119836547383724'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/04/javaioeofexception-with-hadoop.html' title='java.io.EOFException with Hadoop'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-2768959363776459673</id><published>2011-03-29T11:35:00.000-07:00</published><updated>2011-04-10T18:08:34.408-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hdfs'/><category scheme='http://www.blogger.com/atom/ns#' term='fsck'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>WARNING : There are about 1 missing blocks. Please check the log or run fsck.</title><content type='html'>&lt;pre class="brush: bash;"&gt;hadoop fsck /&lt;br /&gt;hadoop fsck -delete / &lt;br /&gt;hadoop fsck -move / &lt;/pre&gt;-move option moves under /lost+found&lt;br /&gt;-delete option deleted all corrupted files&lt;br /&gt;For more options: &lt;a href="http://developer.yahoo.com/hadoop/tutorial/module2.html"&gt;http://developer.yahoo.com/hadoop/tutorial/module2.html&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-2768959363776459673?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2768959363776459673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2768959363776459673'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/warning-there-are-about-1-missing.html' title='WARNING : There are about 1 missing blocks. Please check the log or run fsck.'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-5864370173618021447</id><published>2011-03-26T09:44:00.000-07:00</published><updated>2011-04-08T13:51:33.051-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Cassandra'/><category scheme='http://www.blogger.com/atom/ns#' term='HyperTable'/><category scheme='http://www.blogger.com/atom/ns#' term='cap theorem'/><category scheme='http://www.blogger.com/atom/ns#' term='HBase'/><title type='text'>Usage of CAP Theorem in Today's Distributed Storage Systems</title><content type='html'>&lt;a href="http://en.wikipedia.org/wiki/CAP_theorem"&gt;CAP Theorem&lt;/a&gt; (Eric Brewer) states that a Distributed System can provide at most two of three properties -Consistency, Partition Tolerance, and Availability. &lt;br /&gt;&lt;br /&gt;Partition Tolerance is a &lt;i&gt;must&lt;/i&gt; in real world systems since machines fail all the time. Therefore, a distributed system has to pick either Availability or Consistency as the second property. Consistency means "always return the correct value" -  &lt;i&gt;eg:the latest written one&lt;/i&gt;. And availability is "always accept requests" -  &lt;i&gt;eg: read &amp; write&lt;/i&gt;. &lt;br /&gt;Picking one of these does not always mean loosing the other totally. If application favors from availability,  &lt;i&gt;eg: shoppping cart&lt;/i&gt;, it is better to prioritize availability and resolving consistency issues later (eventually consistent). On the other hand, if application requires consistency,  &lt;i&gt;eg: checkout or backend systems which doesn't require instant response to end user&lt;/i&gt;, better to prioritize consistency and give up availability.&lt;br /&gt;&lt;br /&gt;Here are some examples:&lt;br /&gt;- Amazon's Dynamo, LinkedIn's Project Voldemort and Facebook's Cassandra provide high availability, but eventual consistency.  &lt;br /&gt;- On the other side, Google's Bigtable provides strong consistency and gives up high availability. HyperTable and HBAse are using BigTable approach.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-5864370173618021447?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5864370173618021447'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5864370173618021447'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/usage-of-cap-theorem-in-todays.html' title='Usage of CAP Theorem in Today&apos;s Distributed Storage Systems'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-246044218007356356</id><published>2011-03-26T09:30:00.000-07:00</published><updated>2011-05-05T15:27:45.657-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Key-Value Store'/><category scheme='http://www.blogger.com/atom/ns#' term='NoSQL'/><category scheme='http://www.blogger.com/atom/ns#' term='nonRelational DB'/><title type='text'>Non-Relational Databases</title><content type='html'>Non-Relational DBs&lt;br /&gt;&lt;br /&gt;from Data Design point of view; broadly there are two design approaches, &lt;br /&gt;- Google's Bigtable like designs which are column based eg: Hypertable and HBAse   &lt;br /&gt;- A simpler Key/Value storage using distributed hash tables (DHTs) eg: &lt;a href="http://project-voldemort.com/"&gt;Project Voldemort&lt;/a&gt;, &lt;a href="http://www.mongodb.org/"&gt;MongoDB&lt;/a&gt;, &lt;a href="http://couchdb.apache.org/"&gt;CouchDB&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;common specs of both designs &lt;br /&gt;&lt;br /&gt;- not a prefixed schema, only domains are set (broadly, these are like RDMS tables)&lt;br /&gt;- entries in domains have a key, and keys have set of attributes &lt;br /&gt;- no prefixed rules defined for these attributes &amp; no explicit definition of domains&lt;br /&gt;- scalable - new nodes can be added and removed easily, able to handle heavy workloads&lt;br /&gt;&lt;br /&gt;Bigtable paper by Google summarized &lt;a href="http://the-paper-trail.org/blog/?p=86"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-246044218007356356?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/246044218007356356'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/246044218007356356'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/relational-vs-non-relational-databases.html' title='Non-Relational Databases'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-4715321267981504673</id><published>2011-03-25T06:56:00.000-07:00</published><updated>2011-04-08T13:56:19.423-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mercurial'/><category scheme='http://www.blogger.com/atom/ns#' term='rename'/><category scheme='http://www.blogger.com/atom/ns#' term='hg'/><title type='text'>hg - detecting renamed files</title><content type='html'>&lt;pre class="brush: bash;"&gt;hg addremove directory/file --similarity 90&lt;/pre&gt;If only want to detect files with no change, then 100 should be used. So the number represents the percentage.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-4715321267981504673?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4715321267981504673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4715321267981504673'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/hg-detecting-renamed-files.html' title='hg - detecting renamed files'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-5845493345971004537</id><published>2011-03-24T22:00:00.000-07:00</published><updated>2011-04-11T11:09:26.324-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop exception'/><category scheme='http://www.blogger.com/atom/ns#' term='hdfs'/><category scheme='http://www.blogger.com/atom/ns#' term='DiskErrorException'/><title type='text'>Hadoop - out of disk space</title><content type='html'>Facing weird errors like &lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;Exception running child : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out&lt;/i&gt; &lt;/blockquote&gt;&lt;br /&gt;means simply some of the nodes run out of memory. To check hdfs status and available storage in each node: &lt;a href="http://master-urls:50070/dfsnodelist.jsp?whatNodes=LIVE"&gt;http://master-urls:50070/dfsnodelist.jsp?whatNodes=LIVE&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-5845493345971004537?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5845493345971004537'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5845493345971004537'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/hadoop-out-of-disk-space.html' title='Hadoop - out of disk space'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-7707419999662290545</id><published>2011-03-09T09:17:00.000-08:00</published><updated>2011-04-08T14:00:53.497-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gzip'/><category scheme='http://www.blogger.com/atom/ns#' term='grep'/><title type='text'>Grep for zipped file</title><content type='html'>&lt;pre class="brush: bash;"&gt;zgrep filename&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-7707419999662290545?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7707419999662290545'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7707419999662290545'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/grep-for-zipped-file.html' title='Grep for zipped file'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-4467189575212332442</id><published>2011-03-09T08:01:00.000-08:00</published><updated>2011-06-21T00:23:20.865-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Hadoop - MapReduce without reducer to avoid sorting</title><content type='html'>MR job can be defined with no reducer. In this case, all the mappers write their outputs under specified job output directory. So; there will be&lt;b&gt; no sorting&lt;/b&gt; and &lt;b&gt;no partitioning&lt;/b&gt;.&lt;br /&gt;Just set the number of reduces to 0.&lt;br /&gt;&lt;pre class = "brush: java;"&gt;job.setNumReduceTasks(0);&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-4467189575212332442?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4467189575212332442'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4467189575212332442'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/hadoop-mapreduce-without-reducer.html' title='Hadoop - MapReduce without reducer to avoid sorting'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-7663430987734376456</id><published>2011-03-07T22:14:00.000-08:00</published><updated>2011-04-18T20:31:21.803-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop perfomance'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>many small input files</title><content type='html'>&lt;script src="http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js" type="text/javascript"&gt;&lt;/script&gt;&lt;script src="http://alexgorbatchev.com/pub/sh/current/scripts/shAutoloader.js" type="text/javascript"&gt;&lt;/script&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;If MR input directory consists of many small files (couple MBs), then there will be a seperate map task for each and probably these map tasks will last only for 1-2 secs. it kills the performance ! so much scheduling and/or initialization overhead ..&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;As advised&amp;nbsp;&lt;/span&gt;&lt;a href="http://www.cloudera.com/blog/2009/02/the-small-files-problem/"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;, it is better to combine input files - so there will be less number of map tasks with larger piece of data to process in each one.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Here is how to combine input files.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;(1)&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; Pick a regular record reader class, like LineRecorReader.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;(2)&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;b&gt; &lt;/b&gt;Define your own record record reader class for multi file inputs using an instance of&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;(1)&lt;/span&gt;&lt;/i&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; &lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;(3)&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt; Define your own input format class which extends CombineFileInputFormat and returns&amp;nbsp;&lt;/span&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;(2)&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;The trick is regular recordReader uses fileSplit instance, however record reader to be used with combineFileInputFormat should you be using CombineFileSplit. Here is the code:&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Name of the regular record reader class I use in this example is&amp;nbsp;&lt;/span&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;RawWebGraphRecordReader&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;. Its basic idea is similar to LineRecordReader.&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;Below is the code for multi file record reader -&lt;/span&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;MultiFileRawWebGraphRecordReader &lt;span class="Apple-style-span" style="font-weight: normal;"&gt;a&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;nd input format -&lt;/span&gt;&lt;i&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;RawGraphInputFormat&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;br /&gt;&lt;pre class="brush: java; ruler: false; first-line: 10; gutter: false; bloggerMode: true "&gt;public class MultiFileRawWebGraphRecordReader extends &lt;br /&gt;                                       RecordReader &amp;lt; Text, Text &amp;gt; {&lt;br /&gt; private static final Log LOG = LogFactory.getLog(MultiFileRawWebGraphRecordReader.class);&lt;br /&gt;&lt;br /&gt; private CombineFileSplit split;&lt;br /&gt; private TaskAttemptContext context;&lt;br /&gt; private int index;&lt;br /&gt; private RecordReader&amp;lt; Text, Text &amp;gt; rr;&lt;br /&gt;&lt;br /&gt; public MultiFileRawWebGraphRecordReader(CombineFileSplit split,&lt;br /&gt;                                         TaskAttemptContext context, &lt;br /&gt;                                         Integer index) throws IOException {&lt;br /&gt;  this.split = split;&lt;br /&gt;  this.context = context;&lt;br /&gt;  this.index = index;&lt;br /&gt;  rr = new RawWebGraphRecordReader();&lt;br /&gt; }&lt;br /&gt; &lt;br /&gt; &lt;br /&gt; public void initialize(InputSplit genericSplit, TaskAttemptContext context)&lt;br /&gt; throws IOException, InterruptedException {&lt;br /&gt;&lt;br /&gt;  this.split = (CombineFileSplit) genericSplit;&lt;br /&gt;  this.context = context;&lt;br /&gt;&lt;br /&gt;  if (null == rr) {&lt;br /&gt;   rr = new RawWebGraphRecordReader();&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  FileSplit fileSplit = new FileSplit(this.split.getPath(index), &lt;br /&gt;                                      this.split.getOffset(index), &lt;br /&gt;                                      this.split.getLength(index), &lt;br /&gt;                                      this.split.getLocations());&lt;br /&gt;  this.rr.initialize(fileSplit, this.context);&lt;br /&gt; }&lt;br /&gt; &lt;br /&gt; public boolean nextKeyValue() throws IOException, InterruptedException {&lt;br /&gt;  return rr.nextKeyValue();&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public Text getCurrentKey() throws IOException, InterruptedException {&lt;br /&gt;  return rr.getCurrentKey();&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public Text getCurrentValue() throws IOException, InterruptedException {&lt;br /&gt;  return rr.getCurrentValue();&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; /**&lt;br /&gt;  * Get the progress within the split&lt;br /&gt;  * @throws InterruptedException &lt;br /&gt;  * @throws IOException &lt;br /&gt;  */&lt;br /&gt; @Override&lt;br /&gt; public float getProgress() throws IOException, InterruptedException {&lt;br /&gt;  return rr.getProgress();&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; public synchronized void close() throws IOException {&lt;br /&gt;  if (rr != null) {&lt;br /&gt;   rr.close();&lt;br /&gt;   rr = null;&lt;br /&gt;  }&lt;br /&gt; }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;pre class="brush: java; ruler: false; first-line: 10; gutter: false; bloggerMode: true "&gt;public static class RawGraphInputFormat extends &lt;br /&gt;                                 CombineFileInputFormat&amp;lt; Text, Text &amp;gt; {&lt;br /&gt;&lt;br /&gt;  @Override&lt;br /&gt;  public RecordReader&amp;lt; Text, Text &amp;gt; &lt;br /&gt;                     createRecordReader(InputSplit split, &lt;br /&gt;                                        TaskAttemptContext context)throws IOException {&lt;br /&gt;   return new CombineFileRecordReader&amp;lt; Text, Text &amp;gt;( &lt;br /&gt;                                 (CombineFileSplit) split, &lt;br /&gt;                                  context, &lt;br /&gt;                                  MultiFileRawWebGraphRecordReader.class);&lt;br /&gt;  }&lt;br /&gt;  &lt;br /&gt;  @Override&lt;br /&gt;  protected boolean isSplitable(JobContext context, Path file) {&lt;br /&gt;   CompressionCodec codec = new CompressionCodecFactory(    &lt;br /&gt;                                context.getConfiguration()).getCodec(file);&lt;br /&gt;   return codec == null;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt; }&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-7663430987734376456?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7663430987734376456'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7663430987734376456'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html' title='many small input files'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-4534316957695248962</id><published>2011-03-07T15:02:00.000-08:00</published><updated>2011-06-20T23:42:27.700-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='heap space'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop exception'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Hadoop - Java Heap Space Error</title><content type='html'>&lt;i&gt;"Error: Java Heap space"&lt;/i&gt;&amp;nbsp;means I'm trying to allocate more memory then available in the system.&lt;br /&gt;how to go around? (1) better configuration (2) look for unnecessarily allocated objects&lt;br /&gt;&lt;div&gt;&lt;b&gt;Configuration&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;mapred.map.child.java.opts&lt;/code&gt; : heap size for map tasks&lt;br /&gt;&lt;code&gt;mapred.reduce.child.java.opts&lt;/code&gt;: heap size for reduce tasks&lt;br /&gt;&lt;br /&gt;&lt;code&gt;mapred.tasktracker.map.tasks.maximum&lt;/code&gt;: max map tasks can run simultaneously per node&lt;br /&gt;&lt;code&gt;mapred.tasktracker.reduce.tasks.maximum&lt;/code&gt;: max reduce tasks can run simultaneously per node &lt;br /&gt;&lt;br /&gt;Make sure ((num_of_maps * map_heap_size) + (num_of_reducers * reduce_heap_size)) is not larger than memory available in the system. Max number of mappers &amp; reducers can also be tuned looking at available system resources.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;io.sort.factor&lt;/code&gt;: max # of streams to merge at once for sorting. Used both in map and reduce. &lt;br /&gt;&lt;br /&gt;&lt;code&gt;io.sort.mb&lt;/code&gt;: map side memory buffer size used while sorting &lt;br /&gt;&lt;code&gt;mapred.job.shuffle.input.buffer.percent&lt;/code&gt;: Reduce side buffer related - The percentage of memory to be allocated from the maximum heap size for storing map outputs during the shuffle&lt;br /&gt;&lt;br /&gt;NOTE: Using &lt;code&gt;fs.inmemory.size.mb&lt;/code&gt; is very bad idea!&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Unnecessary memory allocation&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Simply look for &lt;i&gt;new &lt;/i&gt;keyword and make sure there is no&amp;nbsp;unnecessary&amp;nbsp;allocation. A very common tip is using&lt;i&gt; set()&lt;/i&gt; method of &lt;i&gt;Writable&lt;/i&gt;&amp;nbsp;objects rather than re-allocating a new object at every map or reduce.&lt;br /&gt;Here is a simple count example to show the trick:&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush:java"&gt;public static class UrlReducer extends Reducer{&lt;br /&gt;  IntWritable sumw = new IntWritable();&lt;br /&gt;  int sum;&lt;br /&gt;&lt;br /&gt;  public void reduce(Text key,Iterable&amp;lt;IntW&amp;gt; vals,Context context){&lt;br /&gt;    sum=0;&lt;br /&gt;    for (IntWritable val : vals) {&lt;br /&gt;      sum += val.get();&lt;br /&gt;    }&lt;br /&gt;    sumw.set(sum);&lt;br /&gt;    context.write(key, sumw);&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;i&gt;&lt;b&gt;note:&amp;nbsp;&lt;/b&gt;&lt;/i&gt;There are couple more tips&amp;nbsp;&lt;a href="http://int%20sum%20%3D%200/;"&gt;here&lt;/a&gt;&amp;nbsp;for resolving common errors in Hadoop.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-4534316957695248962?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4534316957695248962'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/4534316957695248962'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/hadoop-java-heap-space-error.html' title='Hadoop - Java Heap Space Error'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-2805899995289253716</id><published>2011-03-01T07:32:00.000-08:00</published><updated>2011-04-08T14:14:11.126-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='mercurial'/><category scheme='http://www.blogger.com/atom/ns#' term='hg'/><title type='text'>hg - untracking without deleting</title><content type='html'>&lt;b&gt;-A&lt;/b&gt; is shortcut for &lt;b&gt;addremove&lt;/b&gt;&amp;nbsp;&amp;nbsp;--&amp;gt; &amp;nbsp;adds all files under directory / deletes already deleted ones&lt;br /&gt;&lt;b&gt;f &lt;/b&gt;is shortcut for &lt;b&gt;force&lt;/b&gt;, rm -f &amp;nbsp; &amp;nbsp; --&amp;gt; forces to delete&lt;br /&gt;&lt;b&gt;-Af&lt;/b&gt;&amp;nbsp;surprisingly&amp;nbsp;&amp;nbsp;becomes untracking files without deleting from local repository.&lt;br /&gt;&lt;b&gt;forget&lt;/b&gt;&amp;nbsp;is an alias for -Af&lt;br /&gt;&lt;blockquote&gt;hg rm -Af&lt;br /&gt;hg forget&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-2805899995289253716?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2805899995289253716'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2805899995289253716'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/03/hg-untracking-files-without-deleting.html' title='hg - untracking without deleting'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-5898654480381616989</id><published>2011-02-24T07:58:00.000-08:00</published><updated>2011-02-24T07:58:47.648-08:00</updated><title type='text'>Java Profiler</title><content type='html'>HProf is a simple CPU Heap profiling tool. Hadoop's profiling also uses it. Here is the&amp;nbsp;&lt;a href="http://java.sun.com/developer/technicalArticles/Programming/HPROF.html"&gt;link&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-5898654480381616989?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5898654480381616989'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5898654480381616989'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/02/java-profiler.html' title='Java Profiler'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-3347277702072534280</id><published>2011-02-15T12:46:00.000-08:00</published><updated>2011-02-15T12:51:00.393-08:00</updated><title type='text'>JVM Garbage Collector (Tuning)</title><content type='html'>&lt;code&gt;java.lang.OutOfMemoryError: GC overhead limit exceeded &lt;span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"&gt;means garbage collector is taking so much ( &amp;gt; 98% ) time and can't open up as much space in memory ( &amp;lt; 2% ). I got this exception while working on a large dataset and holding ~1G of data in memory. One way to get over this exception is using a specific jvm garbage collector called Concurrent Collector. Concurrent collector does not let executing program to pause for a long time because of gc. Also concurrent collector takes advantage of multiple CPUs available in the environment. Can be enabled via&lt;/span&gt;&lt;/code&gt;&amp;nbsp;&lt;code&gt;-XX:+UseConcMarkSweepGC &lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Tip: For RMI apps - unnecessary rmi garbage collection can be avoided via tuning its execution frequency. by default it runs every 60,000 msec.&lt;br /&gt;&lt;br /&gt;&lt;code&gt; -Dsun.rmi.dgc.client.gcInteraval=3600000 &lt;/code&gt;&lt;br /&gt;&lt;code&gt; -Dsun.rmi.dgc.server.gcInteraval=3600000 &lt;/code&gt;&lt;br /&gt;&lt;br /&gt;References&lt;br /&gt;- offial reference: &lt;a href="http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#par_gc.oom"&gt;[1]&lt;/a&gt;&lt;br /&gt;- nice summary: &amp;nbsp;&lt;a href="http://www.petefreitag.com/articles/gctuning/"&gt;[2]&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-3347277702072534280?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3347277702072534280'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3347277702072534280'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2011/02/jvm-garbage-collector-tuning.html' title='JVM Garbage Collector (Tuning)'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-8335140525299424869</id><published>2010-12-03T14:44:00.000-08:00</published><updated>2011-04-11T16:24:27.240-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop exception'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>make sure start-dfs.sh script is being executed on the master !</title><content type='html'>everything looks fine when start-all.sh is executed but there is no Namenode process apprear on the jps results. why ? also when I check the logs I see following&amp;nbsp;networking&amp;nbsp;exceptions:&lt;br /&gt;&lt;br /&gt;ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to hostname/ipaddress:host : Cannot assign requested address&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.ipc.Server.bind(Server.java:190)&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.ipc.Server$Listener.&lt;init&gt;(Server.java:253)&lt;/init&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.ipc.Server.&lt;init&gt;(Server.java:1026)&lt;/init&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;at org.apache.hadoop.ipc.RPC$Server.&lt;init&gt;(RPC.java:488)&lt;/init&gt;&lt;br /&gt;&lt;br /&gt;when I try to do an ls on dfs, I get following:&lt;br /&gt;ipc.Client: Retrying connect to server:&amp;nbsp;&amp;nbsp;hostname/ipaddress:host&lt;br /&gt;...&lt;br /&gt;...&lt;br /&gt;Bad connection to FS. command aborted&lt;br /&gt;&lt;br /&gt;I spent lots of time trying to figure out the "networking" problem, checked if the port is already in use, ip4/ip6 conflict&amp;nbsp;etc ..&lt;br /&gt;&lt;br /&gt;at the end, I realized that I'm running start-all.sh script on a random node. When it is executed on the master node, works just fine ! simple fix ..&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-8335140525299424869?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8335140525299424869'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8335140525299424869'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/12/start-your-hadoop-cluster-on-master.html' title='make sure start-dfs.sh script is being executed on the master !'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-3763871507580287197</id><published>2010-11-24T09:20:00.000-08:00</published><updated>2011-04-10T18:07:23.631-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop exception'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>hadoop -  wrong key class exception</title><content type='html'>Usually happens because of mismatch between Map or Reduce class signature and configuration settings. &lt;br /&gt;&lt;br /&gt;But also be careful about the combiner ! Check if you are using the same class as reducer and combiner. If reducer's input key-value pair is not same as its output key-value pair, then it can not be used as a combiner -because combiner's output will became input on the reducer side !&lt;br /&gt;&lt;br /&gt;here is an example:&lt;br /&gt;reducer input key val : &amp;lt; IntWritable, IntWritable &amp;gt;&lt;br /&gt;reducer output key val: &amp;lt; Text, Text &amp;gt;&lt;br /&gt;&lt;br /&gt;if this reducer is used as combiner, then the combiner will output &amp;lt;text, text&amp;gt; and reducer will receive &amp;lt;text, text&amp;gt; as input - and boom - wrong key class exception !&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-3763871507580287197?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3763871507580287197'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3763871507580287197'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/11/hadoop-wrong-key-class-exception.html' title='hadoop -  wrong key class exception'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-8790968202867795504</id><published>2010-11-23T17:20:00.000-08:00</published><updated>2011-04-11T11:08:46.598-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop exception'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>java.lang.InstantiationException hadoop</title><content type='html'>&lt;a href="http://download.oracle.com/javase/1.4.2/docs/api/java/lang/InstantiationException.html"&gt;java.lang.InstantiationException&lt;/a&gt; definition:&lt;br /&gt;&lt;blockquote&gt;&lt;i&gt;Thrown when an application tries to create an instance of a class using the newInstance method in class Class, but the specified class object cannot be instantiated because it is an interface or is an abstract class.&lt;/i&gt;&lt;/blockquote&gt;&lt;br /&gt;I get this exception for  setting input reader to &lt;i&gt;FileInputFormat&lt;/i&gt;&lt;br /&gt;FileInputFormat is an abstract class ! &lt;br /&gt;job.setInputFormatClass(&lt;strike&gt;FileInputFormat.class&lt;/strike&gt;)&lt;br /&gt;&lt;br /&gt;Default is &lt;b&gt;TextInputFormat&lt;/b&gt; and it can be used instead.. &lt;br /&gt;job.setInputFormatClass(TextInputFormat.class)&lt;br /&gt;&lt;br /&gt;exception: &lt;br /&gt;&lt;pre class="brush: js"&gt;Exception in thread "main" java.lang.RuntimeException: java.lang.InstantiationException&lt;br /&gt;        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:123)&lt;br /&gt;        at org.apache.hadoop.mapreduce.lib.input.MultipleInputs.getInputFormatMap(MultipleInputs.java:109)&lt;br /&gt;        at org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:58)&lt;br /&gt;        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:401)&lt;br /&gt;        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:418)&lt;br /&gt;        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338)&lt;br /&gt;        at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)&lt;br /&gt;        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:976)&lt;br /&gt;        at nyu.cs.webgraph.LinkGraphUrlIdReplacement.phase1(LinkGraphUrlIdReplacement.java:326)&lt;br /&gt;        at nyu.cs.webgraph.LinkGraphUrlIdReplacement.main(LinkGraphUrlIdReplacement.java:351)&lt;br /&gt;        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)&lt;br /&gt;        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)&lt;br /&gt;        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)&lt;br /&gt;        at java.lang.reflect.Method.invoke(Method.java:616)&lt;br /&gt;        at org.apache.hadoop.util.RunJar.main(RunJar.java:192)&lt;br /&gt;Caused by: java.lang.InstantiationException&lt;br /&gt;        at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48)&lt;br /&gt;        at java.lang.reflect.Constructor.newInstance(Constructor.java:532)&lt;br /&gt;        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:121)&lt;br /&gt;        ... 14 more&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-8790968202867795504?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8790968202867795504'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8790968202867795504'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/11/javalanginstantiationexception-hadoop.html' title='java.lang.InstantiationException hadoop'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-6469577749292228219</id><published>2010-11-20T19:00:00.000-08:00</published><updated>2010-12-12T09:30:28.848-08:00</updated><title type='text'>Finding a needle in Haystack: Facebook's photo storage</title><content type='html'>Summary of the idea Haystack project that Facebook started to use for storing pictures&lt;br /&gt;&lt;br /&gt;Total image workload Facebook has:&lt;br /&gt;&lt;ul&gt;&lt;li&gt; 260 billion images  (~20 petabytes) of data&lt;/li&gt;&lt;li&gt; every week 1 billion (~60terabyte) new photos are uploaded&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Main charachteristics of Facebook images:&lt;br /&gt;&lt;ul&gt;&lt;li&gt; read often&lt;/li&gt;&lt;li&gt; written once&lt;/li&gt;&lt;li&gt; no modification&lt;/li&gt;&lt;li&gt; rarely deleted&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Traditional file systems are not fast for these specifications (too many disk accesses per read) and external CDN won't be enough in near future due to&amp;nbsp;increasing&amp;nbsp;workload -especially for long tail. As a solution, Haystack is designed to  provide;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;High throughput low latency:&lt;br /&gt;&lt;ul&gt;&lt;li&gt; keeps metadata in main memory -at most one disk access per read&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;Fault tolerance&lt;/li&gt;&lt;ul&gt;&lt;li&gt;replicas are in different geographical regions&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Cost effective and simple&lt;/li&gt;&lt;ul&gt;&lt;li&gt;comparison to NFS based NAS appliance&lt;/li&gt;&lt;li&gt;each usable terabyte costs ~28% less&lt;/li&gt;&lt;li&gt;~4% more reads per sec&lt;/li&gt;&lt;/ul&gt;&lt;/ol&gt;&lt;br /&gt;Design Previous to Haystack&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_ANHBPZ8iLCU/TOiHJIidyiI/AAAAAAAAAqg/0Udd6BvOtHc/s1600/Screenshot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="250" src="http://1.bp.blogspot.com/_ANHBPZ8iLCU/TOiHJIidyiI/AAAAAAAAAqg/0Udd6BvOtHc/s320/Screenshot.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_ANHBPZ8iLCU/TOiHnRK2z6I/AAAAAAAAAqo/kIh_4j-HDGI/s1600/Screenshot-2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://2.bp.blogspot.com/_ANHBPZ8iLCU/TOiHnRK2z6I/AAAAAAAAAqo/kIh_4j-HDGI/s320/Screenshot-2.png" width="318" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;What is learned from NFS-based Design&lt;br /&gt;&lt;ul&gt;&lt;li&gt; more than 10 disk operation to read an image &lt;/li&gt;&lt;li&gt; if directory size is reduced, 3 disk operation to fetch an image&lt;/li&gt;&lt;li&gt; caching file name for highly possible next requests - new kernel func open_by_file_handle &lt;/li&gt;&lt;/ul&gt;Take away from previous design&lt;br /&gt;&lt;ul&gt;&lt;li&gt; Focusing only on caching has limited impact on reducing disk operations for long tail &lt;/li&gt;&lt;li&gt; CDN are not effective for long tail&lt;/li&gt;&lt;li&gt; Would GoogleFS like system be useful ? &lt;/li&gt;&lt;li&gt; Lack of correct RAM/disk ratio in current system &lt;/li&gt;&lt;/ul&gt;Haystack Solution:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;use XFS (extend base file system)&lt;/li&gt;&lt;ul&gt;&lt;li&gt; reduce metadata size per picture so all metadata can fit into RAM &lt;/li&gt;&lt;li&gt; store multiple photos per file&lt;/li&gt;&lt;li&gt; so very good price/performance point -better off than buying more NAS appliances&lt;/li&gt;&lt;li&gt; holding all regular size metadata in RAM would be way expensive&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;design your own CDN (Haystack Cache)&lt;/li&gt;&lt;ul&gt;&lt;li&gt; uses distributed hash table &lt;key, val=""&gt; &lt;photo_id, location_in_cache=""&gt;  &lt;/photo_id,&gt;&lt;/key,&gt;&lt;/li&gt;&lt;li&gt; in requested photo can not be find in cache, fetches from Haystack store &lt;/li&gt;&lt;li&gt; store multiple photos per file&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_ANHBPZ8iLCU/TOiJqYutaVI/AAAAAAAAAqw/KVcO5iCuoFM/s1600/Screenshot-4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="318" src="http://1.bp.blogspot.com/_ANHBPZ8iLCU/TOiJqYutaVI/AAAAAAAAAqw/KVcO5iCuoFM/s320/Screenshot-4.png" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;DESIGN DETAILS&lt;br /&gt;needs to be updated .. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. Finding a needle in Haystack: Facebook’s photo storage. In OSDI ’10&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-6469577749292228219?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/6469577749292228219'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/6469577749292228219'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/11/finding-needle-in-haystack-facebooks.html' title='Finding a needle in Haystack: Facebook&apos;s photo storage'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_ANHBPZ8iLCU/TOiHJIidyiI/AAAAAAAAAqg/0Udd6BvOtHc/s72-c/Screenshot.png' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-5978858286450521036</id><published>2010-11-03T16:38:00.000-07:00</published><updated>2011-06-19T15:28:33.392-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='distributed cache'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Using Distributed Cache in Hadoop</title><content type='html'>Distributed cache allows to share static data among all nodes. In order use this functionality, the data location should be set before MR job starts. &lt;br /&gt;&lt;br /&gt;Here is an example usage for distributed cache. While working on web-graph problem I replace URLs with unique id's. If I have the url-Id mapping in memory, I can easily replace URLs with their corresponding ids. So here is the sample usage: &lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: java"&gt;public static class ReplacementMapper extends Mapper&amp;lt;Text, Text, Text, Text&amp;gt; {&lt;br /&gt;&lt;br /&gt;    private HashMap&amp;lt;String, String&amp;gt; idmap;&lt;br /&gt;&lt;br /&gt;    @Override&lt;br /&gt;    public void setup(Context context) {&lt;br /&gt;     LoadIdUrlMapping(context);&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    @Override&lt;br /&gt;    public void map(Text key, Text value, Context context) throws InterruptedException {&lt;br /&gt;        ....&lt;br /&gt;    }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;id-url mapping is loaded at the beginning of each Map task. Below example simply reads the file out of HDFS and stores the data in a hashmap for quick access. Here is the function:&lt;br /&gt;&lt;pre class="brush: java"&gt;private void loadIdUrlMapping(Context context) {&lt;br /&gt;   &lt;br /&gt; FSDataInputStream in = null;&lt;br /&gt; BufferedReader br = null;&lt;br /&gt; try {&lt;br /&gt;  FileSystem fs = FileSystem.get(context.getConfiguration());&lt;br /&gt;  Path path = new Path(cacheFileLocation);&lt;br /&gt;  in = fs.open(path);&lt;br /&gt;  br  = new BufferedReader(new InputStreamReader(in));&lt;br /&gt; } catch (FileNotFoundException e1) {&lt;br /&gt;  e1.printStackTrace();&lt;br /&gt;  System.out.println("read from distributed cache: file not found!");&lt;br /&gt; } catch (IOException e1) {&lt;br /&gt;  e1.printStackTrace();&lt;br /&gt;  System.out.println("read from distributed cache: IO exception!");&lt;br /&gt; }&lt;br /&gt; try {&lt;br /&gt;  this.idmap = new HashMap&lt; string, string &gt;();&lt;br /&gt;  String line = "";&lt;br /&gt;  while ( (line = br.readLine() )!= null) {&lt;br /&gt;   String[] arr = line.split("\t");&lt;br /&gt;   if (arr.length == 2)&lt;br /&gt;    idmap.put(arr[1], arr[0]);&lt;br /&gt;  }&lt;br /&gt;  in.close();&lt;br /&gt; } catch (IOException e1) {&lt;br /&gt;  e1.printStackTrace();&lt;br /&gt;  System.out.println("read from distributed cache: read length and instances");&lt;br /&gt; }&lt;br /&gt;   }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This is one way of accessing shared data among hadoop nodes. Other way of accessing it is through local file system. &lt;a href="http://developer.yahoo.com/hadoop/tutorial/module5.html"&gt;Here&lt;/a&gt; is a great article about how to throw cache data automatically among nodes and access it later.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-5978858286450521036?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5978858286450521036'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5978858286450521036'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/11/using-distributed-cache-in-hadoop.html' title='Using Distributed Cache in Hadoop'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-7455555622779381844</id><published>2010-11-02T10:35:00.000-07:00</published><updated>2011-04-10T18:04:20.857-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><category scheme='http://www.blogger.com/atom/ns#' term='bash'/><title type='text'>How to manipulate (copy/move/rename etc..) most recent n files under a directory</title><content type='html'>In order to list most recent n files under current directory: &lt;br /&gt;&lt;code&gt;&lt;br /&gt;ls --sort-time -r | tail -n &lt;br /&gt;&lt;/code&gt;&lt;br /&gt;n represent number of files, so should be replaced with a number; eg: 5&lt;br /&gt;&lt;br /&gt;in order to move, copy, delete or do something with the result, this line can be fed into "cp" "mv" "rm" commands. However the format is important. This line should be in between single quotes. However not the ones near by enter on your keyboard (&lt;b&gt; ' &lt;/b&gt;), use the ones under esc key (&lt;b&gt; ` &lt;/b&gt;). &lt;br /&gt;So here is the command line for moving top n files from one directory to another:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;mv `ls --sort-time -r | tail -n` /home/yasemin/hebelek/&lt;br /&gt;&lt;/code&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-7455555622779381844?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7455555622779381844'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7455555622779381844'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/11/how-to-manipulate-copymoverename-etc.html' title='How to manipulate (copy/move/rename etc..) most recent n files under a directory'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-2566913558911295182</id><published>2010-10-31T09:10:00.001-07:00</published><updated>2011-04-10T18:05:12.058-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gzip'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><category scheme='http://www.blogger.com/atom/ns#' term='compression'/><title type='text'>Why Hadoop can’t always read properly .gz compressed input files ?</title><content type='html'>Hadoop supposed to work happily with .gz input file format by default. &lt;a href="http://books.google.com/books?id=bKPEwR-Pt6EC&amp;amp;q"&gt;[1]&lt;/a&gt; So I run my MR job with gz compressed input files and boom! didn’t work.. whenever there is an empty line in the input, Hadoop stucks there and doesn't read -recognize rest of the file. (basically readline returns 0 length string even though there is data) I spent hours to figure out the problem. everything was looking great, .gz files were corrupted or anything, and my code runs fine with the decompressed input... At the end I realized that if I decompress .gz input files and re-compress them again, the size reduces by half ! seems like Hadoop has problems with different versions of .gz compression. I suspect my input files were uncompressed on a Windows machine and looks like some compression applications ends up producing type of .gz file which is incompatible with Hadoop. &lt;br /&gt;&lt;br /&gt;Long story short; if Hadoop doesn’t process your .gz compressed input files, try to decompress and re-compress them with gzip under a Linux machine. &lt;br /&gt;&lt;code&gt;gunzip filename.gz&lt;br /&gt;gzip filename&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;here is the script that unzips and zips all the files under given directory one by one (in case you have a huge archive !)&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre class="brush: js"&gt;#!/bin/bash&lt;br /&gt;&lt;br /&gt;dir="aaa"&lt;br /&gt;for f in $( ls $dir &amp;nbsp;); do&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;eval gunzip "$dir/$f"&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;eval gzip "$dir/${f%.gz}"&lt;br /&gt;done&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://books.google.com/books?id=bKPEwR-Pt6EC&amp;amp;q"&gt;[1] Tom White, Hadoop: The Definitive Guide&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-2566913558911295182?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2566913558911295182'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2566913558911295182'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/10/why-hadoop-cant-always-read-properly-gz.html' title='Why Hadoop can’t always read properly .gz compressed input files ?'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-8781987249368313938</id><published>2010-10-13T15:06:00.000-07:00</published><updated>2010-10-13T15:27:09.959-07:00</updated><title type='text'>eclipse Galileo CDT plugin "No Repository found" error</title><content type='html'>Seems like people faced similar problems with the &lt; 3.5 eclipse version. However I had the same problem with 3.5 Galileo version. There is an easy work-around &lt;a href="http://www.eclipse.org/forums/index.php?t=msg&amp;th=168917&amp;"&gt;solution&lt;/a&gt;: &lt;br /&gt;&lt;br /&gt; - just download the cdt archive and install it manually. &lt;br /&gt; * &lt;a href="http://www.eclipse.org/downloads/download.php?file=/tools/cdt/releases/galileo/dist/cdt-master-6.0.2.zip"&gt;Here&lt;/a&gt; is the page you can find eclipse-cdt. &lt;br /&gt;    * Then go to Help -&gt; Install software and install it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-8781987249368313938?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8781987249368313938'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8781987249368313938'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/10/eclipse-galileo-cdt-plugin-no.html' title='eclipse Galileo CDT plugin &quot;No Repository found&quot; error'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-8224578221158361992</id><published>2010-09-21T10:08:00.000-07:00</published><updated>2010-09-21T10:11:36.349-07:00</updated><title type='text'>directory name include space in Linux</title><content type='html'>I have both Ubuntu and Windows installed on my computer. When I'm on Ubuntu and want to reach some data from Windows, directory names with spaces causes problem.. &lt;br /&gt;Pretty pretty easy way to go around this is using single quotes around the directory in terminal :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-8224578221158361992?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8224578221158361992'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8224578221158361992'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/09/directory-name-include-space-in-linux.html' title='directory name include space in Linux'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-752974906792533574</id><published>2010-08-23T18:41:00.000-07:00</published><updated>2011-04-08T14:12:28.935-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sdb'/><category scheme='http://www.blogger.com/atom/ns#' term='inequalities'/><category scheme='http://www.blogger.com/atom/ns#' term='aws_sdb_proxy'/><title type='text'>Inequalities with aws_sdb</title><content type='html'>I'm using &lt;a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1242"&gt;aws_sdb_proxy &lt;/a&gt;for accessing SBD. Since it is using ActiveResource (REST approach), querying with equalities is fairly simple. All the equalities can be passed as a hash with the :params parameter. &lt;br /&gt;&lt;br /&gt;Balloon.find(:all, :params =&amp;gt; {:color=&amp;gt; "green", :size =&amp;gt; "5inch"})&lt;br /&gt;&lt;br /&gt;which will be resolves to below GET request:&lt;br /&gt;&lt;br /&gt;http://localhost:8080/Baloons.xml?color=green&amp;amp;size=small&lt;br /&gt;&lt;br /&gt;In order to support all the other sql-like queries, sdb_proxy allows users to pass a hard-coded string value. Because ActiveResource is used as the underlying communication mechanism, :params value has to be a hash event though all we need to pass is a  string. aws_proxy goes around this problem with only looking at the first key of the hash in the case of hard-coded query string. &lt;br /&gt;&lt;br /&gt;Balloon.find(:all, :params =&amp;gt; {" ['size' &amp;gt; '5inch'] sort 'size' " =&amp;gt; nil})&lt;br /&gt;&lt;br /&gt;and this will resolve to below GET request:&lt;br /&gt;&lt;br /&gt;http://localhost:8080/Baloons/query.xml?:static_query&lt;br /&gt;&lt;br /&gt;Note1: I added the query as static_query because I don't have my setup on this machine, but it will add corresponding UTF values for space, quotes and for brackets.. At the end will be something like http://localhost:8080/Baloons/query.xml?size**5inch**sort****size**&lt;br /&gt;Note2: sdb_proxy is using site_prefix for  the resource, so the object name (Baloon in this case) is not the domain name. This is a whole different point to discuss, I wrote this entry according to my altered version of sdb. If you are using sdb_proxy as it is, then your http requests will be similar to following:&lt;br /&gt;http://localhost:8080/:domain_name/baloons/query.xml?:static_query&lt;br /&gt;http://localhost:8080/:domain_name/baloons.xml?color=green&amp;amp;size=small&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-752974906792533574?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/752974906792533574'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/752974906792533574'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/08/inequalities-with-awssdb.html' title='Inequalities with aws_sdb'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-2917005535842617260</id><published>2010-08-23T14:04:00.000-07:00</published><updated>2010-08-23T19:23:25.721-07:00</updated><title type='text'>Links for SDB &amp; Rails</title><content type='html'>&lt;a href="http://paulsturgess.co.uk/articles/show/49-using-helper-methods-in-ruby-on-rails"&gt;Rails helper&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.devarticles.com/c/a/Ruby-on-Rails/Understanding-Action-Views-in-Ruby-on-Rails/3/"&gt;Rails MVC&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://inside.glnetworks.de/2008/01/20/bridging-rails-to-amazon-simpledb-using-activeresource/"&gt;SDB&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://inside.glnetworks.de/2008/01/20/bridging-rails-to-amazon-simpledb-using-activeresource/"&gt;SDB-proxy&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-2917005535842617260?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2917005535842617260'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2917005535842617260'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/08/links-for-sdb-rails.html' title='Links for SDB &amp; Rails'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-3574471222529220956</id><published>2010-08-12T06:13:00.000-07:00</published><updated>2010-08-12T08:47:09.245-07:00</updated><title type='text'>Synergy -sharing keyboard &amp; mouse</title><content type='html'>&lt;a href="http://synergy2.sourceforge.net/running.html"&gt;Here&lt;/a&gt; are the instructions I followed. &lt;br /&gt;Two things I had problems with: &lt;br /&gt; - Reverse DNS (make sure your hostname and IP address resolves to each other correctly)&lt;br /&gt;     Here is how to check if your DNS resolves backward and forward correctly&lt;br /&gt;     &gt;&gt; dig +short hostname&lt;br /&gt;     &gt;&gt; dig +short -x your_ip_address&lt;br /&gt; - Make sure the hostname you put in the server configuration matches the "Screen name" in your client. &lt;br /&gt;&lt;br /&gt;Then you are good to go! :-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-3574471222529220956?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3574471222529220956'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3574471222529220956'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/08/synergy-sharing-keyboard-mouse.html' title='Synergy -sharing keyboard &amp; mouse'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-1535284083239169439</id><published>2010-07-30T08:47:00.000-07:00</published><updated>2011-04-10T18:07:05.052-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='error'/><category scheme='http://www.blogger.com/atom/ns#' term='random_create'/><category scheme='http://www.blogger.com/atom/ns#' term='uuidtools'/><title type='text'>uuidtools</title><content type='html'>Two ways to access random_create method with uuidtools:&lt;br /&gt;&lt;br /&gt;UUIDTools::UUID.random_create&lt;br /&gt;UUID.random_create&lt;br /&gt;&lt;br /&gt;Use one or another if you get one of the following errors:&lt;br /&gt;&lt;br /&gt;NoMethodError: undefined method `random_create' for UUID:Class &lt;br /&gt;NameError: uninitialized constant UUIDTools&lt;br /&gt;&lt;br /&gt;* Don't forget to require 'uuidtools' for the both cases..&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-1535284083239169439?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1535284083239169439'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1535284083239169439'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/07/uuidtools.html' title='uuidtools'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-8777969605153730352</id><published>2010-07-26T12:24:00.000-07:00</published><updated>2010-07-26T12:29:31.266-07:00</updated><title type='text'>Prevent connection drops on a ssh connection</title><content type='html'>&lt;a href="http://tinyurl.com/2f9zqdv"&gt;Here&lt;/a&gt; is two ways to stop connection drops. &lt;br /&gt;&lt;br /&gt;modify /etc/ssh/sshd_config file. &lt;br /&gt;add following: &lt;br /&gt;&lt;br /&gt;ClientAliveInterval 30&lt;br /&gt;ClientAliveCountMax 5&lt;br /&gt;&lt;br /&gt;and restart sshd:&lt;br /&gt;&lt;br /&gt;/etc/init.d/ssh restart&lt;br /&gt;&lt;br /&gt;OR&lt;br /&gt;&lt;br /&gt;modify /etc/ssh/ssh_config&lt;br /&gt;&lt;br /&gt;ServerAliveInterval 15&lt;br /&gt;ServerAliveCountMax 3&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-8777969605153730352?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8777969605153730352'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/8777969605153730352'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/07/prevent-connection-failure-to-remote.html' title='Prevent connection drops on a ssh connection'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-7821000178891055185</id><published>2010-07-13T11:02:00.000-07:00</published><updated>2010-07-13T11:07:43.378-07:00</updated><title type='text'>colorful ls</title><content type='html'>I was looking for a way to distinguish files and folders. It is set to non-colorful in gnome konsole I'm using. &lt;br /&gt;And I find out a more generic way to overwrite a command's behavior. &lt;br /&gt;Here is how to make your ls colorful: &lt;br /&gt;alias 'ls=ls --color=auto'&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-7821000178891055185?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7821000178891055185'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/7821000178891055185'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/07/colorful-ls.html' title='colorful ls'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-5916723770764395150</id><published>2010-07-06T19:08:00.002-07:00</published><updated>2010-07-06T20:18:49.646-07:00</updated><title type='text'>Can't enable wireless on Linux</title><content type='html'>I keep getting error, saying "wireless disabled" on my Dell XPS 1330 laptop. &lt;br /&gt;Then I realized it is hardware disabled... &lt;br /&gt;&lt;br /&gt;#rfkill list &lt;br /&gt;&gt;&gt; 3: phy0: Wireless LAN&lt;br /&gt;&gt;&gt; Soft blocked: no&lt;br /&gt;&gt;&gt; Hard blocked: yes&lt;br /&gt;&lt;br /&gt;Way to solve this problem is &lt;br /&gt;#sudo rmmod dell_laptop&lt;br /&gt;--enabling it back:&lt;br /&gt;#sudo modprobe dell_laptop&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-5916723770764395150?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5916723770764395150'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/5916723770764395150'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/07/cant-enable-wireless-on-linux.html' title='Can&apos;t enable wireless on Linux'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-2111866368046219531</id><published>2010-07-06T10:57:00.000-07:00</published><updated>2010-07-06T11:57:59.287-07:00</updated><title type='text'>how to change command line editor</title><content type='html'>here is how to change command line editor in Linux. &lt;br /&gt;set -o vi &lt;br /&gt;set -o emacs &lt;br /&gt;&lt;br /&gt;&lt;a href="http://tinyurl.com/2veyho3"&gt;click&lt;/a&gt; for more detailed guidance on shell options.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-2111866368046219531?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2111866368046219531'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/2111866368046219531'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/07/how-to-change-command-line-editor.html' title='how to change command line editor'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-1550949497741437909</id><published>2010-03-26T13:11:00.001-07:00</published><updated>2010-03-26T13:23:07.399-07:00</updated><title type='text'>How to Increase Heap Space of a Project in Eclipse</title><content type='html'>Right click source code of the class where your main function is. Then select Run As and Run Configurations. Choose the "Arguments" tab and indicate how much memory you wanna allocate in the VM arguments part. eg: -xmx128m &lt;br /&gt;here is a screenshot:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_ANHBPZ8iLCU/S60WLmS7-GI/AAAAAAAAAjY/dZnUQ3VUSAc/s1600/EclipseHeapSpace.JPG"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 302px;" src="http://1.bp.blogspot.com/_ANHBPZ8iLCU/S60WLmS7-GI/AAAAAAAAAjY/dZnUQ3VUSAc/s320/EclipseHeapSpace.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5453039112329885794" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Also, you can find descriptions of other parameters on &lt;a href="http://www.caucho.com/resin-3.0/performance/jvm-tuning.xtp"  target=”_blank”&gt; this&lt;/a&gt; webpage.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-1550949497741437909?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1550949497741437909'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/1550949497741437909'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2010/03/how-to-increase-heap-space-of-project.html' title='How to Increase Heap Space of a Project in Eclipse'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_ANHBPZ8iLCU/S60WLmS7-GI/AAAAAAAAAjY/dZnUQ3VUSAc/s72-c/EclipseHeapSpace.JPG' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-6797879343028434317</id><published>2009-10-28T20:16:00.000-07:00</published><updated>2009-10-28T20:24:10.845-07:00</updated><title type='text'>How to re-activate hiberbate for Windows Vista</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_ANHBPZ8iLCU/SukKvGHrSRI/AAAAAAAAAdQ/_7oAIj_BjXU/s1600-h/hibernate.png"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;width: 60px; height: 58px;" src="http://3.bp.blogspot.com/_ANHBPZ8iLCU/SukKvGHrSRI/AAAAAAAAAdQ/_7oAIj_BjXU/s320/hibernate.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5397857432593189138" /&gt;&lt;/a&gt;I was missing the hibernate option on my laptop for a while. And found&lt;a href="http://www.askvg.com/how-to-re-enable-missing-hibernate-option-in-windows-vista/"&gt; this webpage &lt;/a&gt; very useful. &lt;br /&gt;Basically it tells to run the the command prompt as administrator&lt;br /&gt;then type following commands:&lt;br /&gt;powercfg /hibernate on&lt;br /&gt;powercfg -h on&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-6797879343028434317?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/6797879343028434317'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/6797879343028434317'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2009/10/how-to-re-activate-hiberbate-for.html' title='How to re-activate hiberbate for Windows Vista'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_ANHBPZ8iLCU/SukKvGHrSRI/AAAAAAAAAdQ/_7oAIj_BjXU/s72-c/hibernate.png' height='72' width='72'/></entry><entry><id>tag:blogger.com,1999:blog-3215128726026366458.post-3634034768149725033</id><published>2009-10-21T17:52:00.000-07:00</published><updated>2011-04-23T15:28:45.167-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MS SQL Server 2008'/><category scheme='http://www.blogger.com/atom/ns#' term='log'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>How to delete .log file in SQL Server 2008</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_ANHBPZ8iLCU/St_VEoHY3aI/AAAAAAAAAb4/Egdn8XCjafo/s1600-h/check-and-cross-icons.jpg"&gt;&lt;img style="float:left; margin:0 10px 10px 0;cursor:pointer; border="0"; cursor:hand;width: 100px; height: 106px; " src="http://4.bp.blogspot.com/_ANHBPZ8iLCU/St_VEoHY3aI/AAAAAAAAAb4/Egdn8XCjafo/s320/check-and-cross-icons.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5395265154077744546" /&gt;&lt;/a&gt;I've been struggling with deleting 30GB log file for my 8GB of data.. &lt;br /&gt;And I figure out that easiest way to do this is just detaching the database and later attaching it without the log file. however this simple approach doesn't work when I try to do it with right click detach and right click attach.. &lt;br /&gt;So rather than using interface, try following:&lt;br /&gt;detach the database with following commands;&lt;br /&gt;&lt;font face="Courier" color="#0000FF"&gt;&lt;br /&gt;USE master;&lt;br /&gt;go&lt;/font&gt;&lt;br /&gt;&lt;font face="Courier" color="#800517"&gt;SP_DETACH_DB&lt;/font&gt; &lt;font face="Courier" color="#FF0000"&gt;'dbname'&lt;/font&gt;;&lt;br /&gt;&lt;font face="Courier" color="#0000FF"&gt;go&lt;br /&gt;&lt;/font&gt;&lt;br /&gt;Later just go to the directory where the database reside. Usually the log file name outline is dbname_log and resides under C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\DATA. Delete the log file under this directory. &lt;br /&gt;Later attach the database with following commands:&lt;br /&gt;&lt;font face="Courier" color="#0000FF"&gt;&lt;br /&gt;USE master;&lt;br /&gt;go&lt;br /&gt;&lt;font face="Courier" color="#800517"&gt;SP_ATTACH_DB&lt;/font&gt; &lt;font face="Courier" color="#FF0000"&gt;'dbname'&lt;/font&gt;, &lt;font face="Courier" color="#FF0000"&gt;'C:\Program Files\....\dbname.mdf'&lt;/font&gt;&lt;br /&gt;go&lt;br /&gt;&lt;/font&gt;&lt;br /&gt;This basically attaches the database with new log file. &lt;br /&gt;Finally you get rid of the this huge log file.. &lt;br /&gt;Don't forget to limit auto-growth if you don't want to face same problem later..&lt;br /&gt;&lt;br /&gt;Reference &lt;a href="http://www.sqlhacks.com/index.php/Administration/Shrink-Log"&gt;sqlhacks&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3215128726026366458-3634034768149725033?l=yaseminavcular.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3634034768149725033'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3215128726026366458/posts/default/3634034768149725033'/><link rel='alternate' type='text/html' href='http://yaseminavcular.blogspot.com/2009/10/how-to-delete-log-file-in-sql-server.html' title='How to delete .log file in SQL Server 2008'/><author><name>Yasemin Avcular</name><uri>http://www.blogger.com/profile/04323522593265451492</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_ANHBPZ8iLCU/St_VEoHY3aI/AAAAAAAAAb4/Egdn8XCjafo/s72-c/check-and-cross-icons.jpg' height='72' width='72'/></entry></feed>
