Home > Hadoop > How to chain multiple MapReduce jobs in Hadoop?

How to chain multiple MapReduce jobs in Hadoop?


When running MapReduce jobs it is possible to have several MapReduce steps with overall job scenarios means the last reduce output will be used as input for the next map job.

Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3…

While searching for an answer to my MapReduce job, I stumbled upon several cool new ways to achieve my objective. Here are some of the ways:

Using Map/Reduce JobClient.runJob() Library to chain jobs:

http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. The first job in the chain should write its output to a path which is then used as the input path for the second job. This process can be repeated for as many jobs are necessary to arrive at a complete solution to the problem

Method 1:

  • First create the JobConf object “job1″ for the first job and set all the parameters with “input” as inputdirectory and “temp” as output directory. Execute this job: JobClient.run(job1).
  • Immediately below it, create the JobConf object “job2″ for the second job and set all the parameters with “temp” as inputdirectory and “output” as output directory. Finally execute second job: JobClient.run(job2).

Method 2:

  • Create two JobConf objects and set all the parameters in them just like (1) except that you don’t use JobClient.run.
  • Then create two Job objects with jobconfs as parameters: Job job1=new Job(jobconf1); Job job2=new Job(jobconf2);
  • Using the jobControl object, you specify the job dependencies and then run the jobs: JobControl jbcntrl=new JobControl(“jbcntrl”); jbcntrl.addJob(job1); jbcntrl.addJob(job2); job2.addDependingJob(job1); jbcntrl.run();

Using Oozie which is Hadoop Workflow Service described as below:

https://issues.apache.org/jira/secure/attachment/12400686/hws-v1_0_2009FEB22.pdf

3.1.5 Fork and Join Control Nodes

A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it. fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.

The name attribute in the fork node is the name of the workflow fork node. The to attribute in the transition elements in the fork node indicate the name of the workflow node that will be part of the concurrent execution. The name attribute in the join node is the name of the workflow join node. The to attribute in the transition element in the join node indicates the name of the workflow node that will executed after all  3.1.5 Fork and Join Control Nodes 5concurrent execution paths of the corresponding fork arrive to the join node.

Example:

<hadoop−workflow name=”sample−wf”>

<fork name=”forking”>

<transition to=”firstparalleljob”/>

<transition to=”secondparalleljob”/>

</fork>

<hadoop name=”firstparalleljob”>

<job−xml>job1.xml</job−xml>

<transition name=”OK” to=”joining”/>

<transition name=”ERROR” to=”fail”/>

</hadoop>

<hadoop name=”secondparalleljob”>

<job−xml>job2.xml</job−xml>

<transition name=”OK” to=”joining”/>

<transition name=”ERROR” to=”fail”/>

</hadoop>

<join name=”joining”>

<transition to=”nextaction”/>

</join>

</hadoop−workflow>

Using Cascading Library (GPL):

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ inMapReduce.

http://www.cascading.org/

Using Apache Mahout Recommender Job Sample:

Apache Mahout project has a sample call RecommenderJob.java which chains together multiple MapReduce jobs. You can find the sample here:

http://search-lucene.com/c/Mahout:/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java%7C%7CRecommenderJob

Using a simple java library name “Hadoop-orchestration” at GitHub:

This library, enables execution of multiple Hadoop jobs as a workflow. The job configuration and workflow defining inter job dependency is configured in a JSON file. Everything is externally configurable and does not require any change in existing map reduce implementation to be part of a workflow. Details can be found here. Source code and jar is available in github.

http://pkghosh.wordpress.com/2011/05/22/hadoop-orchestration/

About these ads
Categories: Hadoop Tags: , ,
  1. May 23, 2012 at 5:48 am | #1

    Hi Avkash,
    Great article!
    Do we have a similar tutorial for how to create multiple jobs using c# similar to http://social.technet.microsoft.com/wiki/contents/articles/7258.walkthrough-creating-and-using-c-mapper-and-reducer-hadoop-streaming.aspx

    My query is if there is a logic which can take let’s say 3 map reduce job (coding could be in c# or java) and after executing the same at hadooponazure, if I click on Delete Job during the first job execution. What should ideally happen?

    Should the first job gets deleted but rest get executed due to chaining or it does not get deleted because there are active threads running?

    Any views on this will be greatly appreciated.

  2. hadoop-fan
    November 5, 2012 at 11:39 pm | #2

    Good summary. Thanks.

    One problem I found using Method 2:JobControl.run() is that this function itself never return even after all jobs are finished. I have to use another thread to monitor whether all jobs in JobControl are finished and call JobControl.stop(), so that JobControl.run() can exit.

    • vic4ever
      April 2, 2013 at 1:23 pm | #3

      I encountered the same problem. Can you show me how you solved the problem ? Thanks.

  3. raghavan S
    January 17, 2013 at 9:40 am | #4

    I think think you meant ControlledJob class instead of the Job class. Job class do not have any method like addDependingJob

  4. Dave holmes
    February 18, 2014 at 1:27 am | #5

    Thanks for the great resources. You helped me with a school project.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 36 other followers

%d bloggers like this: