Writing Ruby Map-Reduce programs for Hadoop
May 13th, 2008 | by Rajagopal |I wanted to make some crawlers that I wrote in ruby to run faster. Thought of running it using hadoop, and just started trying out using Hadoop Streaming for the purpose, and thought of sharing a short how-to on it since there might be others looking for it.
Hadoop Streaming facilitates users to write the map and reduce code in any language and run it on a Hadoop grid easily.
I assume that you have installed Hadoop and it is already running the sample java programs that ship with the package well. If not take a look into the official quickstart tutorial. This would also tell you how to use the hdfs commands to copy the input files to hdfs, reading the output, etc.
All you need to understand to write an hadoop streaming program is:
- A mapper always reads records and outputs key-value pairs. The record reader might be anything, but the default record reader splits by lines.
Your Mapper should read each record and output a key value pair. You may choose to use any delimiter to separate the key. But Hadoop by default uses tabspace as the delimiter. If it makes sense for you, you could use the same. - A reducer always gets key value pairs as input and gives key value pairs as output.
- While using Hadoop Streaming, you need to write your mapper such that it reads records from STDIN and outputs the key-value pairs to STDOUT, and reducer such that it reads key-value pairs from STDIN and outputs key-value pairs to STDOUT.
- If you plan to use a combiner, it again would do just like your reducer.
Lets get our hands wet.
mapper.rb
#!/usr/bin/env ruby
wordcount = Hash.new
STDIN.each_line do |line|
line.split.each do |word|
wordcount[word] = wordcount[word].to_i+1
end
endwordcount.each_pair do |word,count|
puts “#{word}t#{count}”
end
reducer.rb
#!/usr/bin/env ruby
wordcount = Hash.new
STDIN.each_line do |line|
keyval = line.split(”t”)
wordcount[keyval[0]] = wordcount[keyval[0]].to_i+keyval[1].to_i
endwordcount.each_pair do |word,count|
puts “#{word}t#{count}”
end
Now that you have a mapper and reducer, just start running your job like below.
$HADOOP_HOME/bin/hadoop \
jar $HADOOP_HOME/contrib/streaming/hadoop-0.16.4-streaming.jar \
-mapper mapper.rb \
-reducer reducer.rb \
-input input/* \
-output wc-output \
-file $PATH_TO_YOUR_FOLDER/mapper.rb \
-file $PATH_TO_YOUR_FOLDER/reducer.rb
I had set $HADOOP_HOME to the root of the extracted hadoop package, and $PATH_TO_YOUR_FOLDER is where you have the mapper.rb and reducer.rb files. In the above example, I had just mentioned the mapper and reducer file names without giving the complete path to the files, and supplied the complete path using -file options. Here, the -file option packs the file - supplied to it as argument - in the job jar file that it sends to all nodes in the grid.If the files preexisted in the grid at a particular path, you could just have specified the full path to the -mapper and -reducer options itself. But it makes no harm using the -file option unless the files are big and the job jar file size would become unnecessarily huge.
Happy Rubying and Hadooping! ![]()

One Response to “Writing Ruby Map-Reduce programs for Hadoop”
By Daniel Haran on Aug 6, 2008 | Reply
In the reducer, keyval = line.split(”t”) could be simplified to:
key, val = line.split(”t”)
This line is suspect:
wordcount[keyval[0]] = wordcount[keyval[0]].to_i+keyval[1].to_i
wordcount[keyval[0]].to_i is probably 0, so the result is still correct; however it’s likely adding some unnecessary time to execution.