Archive

Archive for the ‘MapReduce’ Category

12 ways to troubleshoot a failed MapReduce task


Sometimes it is hard to troubleshoot MapReduce task failure in a Hadoop Cluster. The following suggestions does help to troubleshoot the problem:

1. Integrate Additional Configuration:

  • keep.task.files.pattern:  This settings will specify a MapReduce task to keep stored by name, in both failed and success condition.
  • keep.failed.task.files:  This settings inform the data node  to keep failed task files stored on the machine. The files are stored at {HADOOP_LOG_DIR}

    /local/taskTracker/taskid/.

2. Launching IsolationRunner:

   Visit this folder and launch IsolationRunner (org.apache.hadoop.mapred.IsolationRunner) as shown below which will execute  failed task in a child JVM, using the same input which you can debug.

    • $hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
    • job.xml is generated by the MapReduce components when a job is submitted, MapReduce create the JobConf

3. Configure Log4J to include  additional logging from specific classes, for example use configure below to get more log from TaskTracker

  • log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG

4. Be sure to have the following log specific Hadoop configuration is set in the cluster:

HADOOP_LOG_DIR Directory for log files
HADOOP_PID_DIR Directory to store the PID for the servers
HADOOP_ROOT_LOGGER Logging configuration for hadoop.root.logger. default: “INFO,console”
HADOOP_SECURITY_LOGGER Logging configuration for hadoop.security.logger. default: “INFO,NullAppender
HDFS_AUDIT_LOGGER Logging configuration for hdfs.audit.logger. default: “INFO,NullAppender

5.  Create a small 1 node cluster an using small size of data set, run the same job and look for simple execution pattern. This is good step to check the issues related with task specific execution.

6.  User can temporarily run JobTracker in local mode even with a large cluster to test code execution.

  • mapred.job.tracker = local
  • fs.default.name = local
  • Note: Above settings can be included in hadoop-site.xml as well however those settings will see as long as xml is not changed. Above settings will be specific to current running job only.

7. Sending kill signal to running Java process

  • Use “kill  -QUIT #java_process_pid” will print Call Stack, threads, lock and deadlocks details on stdout.
  • Java 1.5 have jps (List Java Process) and jstack (List Java Process Call Stack) to further troubleshoot the problem

8. Hadoop Pipes (C++ Library to support MapReduce program written in C++)  set the following configuration to keep the troubled task files saved in the data node.

  • hadoop.pipes.command-file.keep = true

9. Using script file to troubleshot the problem is also in option with Hadoop 0.15 and above.

  • Upload script file to HDFS
  • Set the configuration as
    • mapred.cache.files = full_path#script_file_name (Multiple script files can be added using ‘ , ‘ comma)
    • mapred.create.symlink = yes (Script file must be symlinked)
  • Same above can be done programmatically as below:
    • jobConf.setMapDebugScript(“./debugscript”);
    • DistributedCache.createSymlink(jobConf);
    • DistributedCache.addCacheFile(“/debug/scripts/script#debugscript”);

10. Enable Job Profiling

  • User can sample Map and Reduce task through built in Java Profiler 
  • Use the following configuration
    • mapred.task.profile = TRUE
    • mapred.task.profile.maps = 0-2
    • mapred.task.profile.reduces = 0-2
  • Same effect is achieve programmatically as below:
    • JobConf.setProfileEnabled(boolean)
    • JobConf.setProfileTaskRange(boolean,String)

11. Using MapReduce Tool Interface

  • Tool Interface in Hadoop supports functionality to handle Command line options
  • Here is a list of Hadoop command line options
    • -conf <configuration file>
    • -D <property=value>
    • -fs <local|namenode:port>
    • -jt <local|jobtracker:port>
  • Same can be achieved through code as describe here.

12.  Enable code level logging by using Reporter class.

  • To display debug information,  use Reported reported parameter with map and reduce interfaces. Method Reporter.setStatus(String status) changes the displayed status of the map task and visable on the jobtracker web page.

Source:

 

Categories: Hadoop, HDInsight, MapReduce

HDInsight (Hadoop on Azure) Demo: Submit MapReduce job, process result from Pig and filter final results in Hive


In this demo we will submit a WordCount map reduce job to HDInsight cluster and process the results in Pig and then filter the results in Hive by storing structured results into a table.

Step 1: Submitting WordCount MapReduce Job to 4 node HDInsight cluster:

c:\apps\dist\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar c:\apps\Jobs\templates\635000448534317551.hadoop-examples.jar wordcount /user/admin/DaVinci.txt /user/admin/outcount

hd001

The results are stored @ /user/admin/outcount

Verify the results at Interactive Shell:

js> #ls /user/admin/outcount
Found 2 items
-rwxrwxrwx 1 admin supergroup 0 2013-03-28 05:22 /user/admin/outcount/_SUCCESS
-rwxrwxrwx 1 admin supergroup 337623 2013-03-28 05:22 /user/admin/outcount/part-r-00000

Step 2:  loading /user/admin/outcount/part-r-00000 results in the Pig: 

First we are storing the flat text file data as words, wordCount format as below:

Grunt>mydata = load ‘/user/admin/output/part-r-00000′ using PigStorage(‘\t’) as (words:chararray, wordCount:int);

Grunt>first10 = LIMIT mydata 10;

Grunt>dump first10;

hd002

Note: This shows results for the words with frequency 1.  We need to reorder to results on descending order to get words with top frequency.

Grunt>mydatadsc = order mydata by wordCount DESC;

Grunt>first10 = LIMIT mydatadsc 10;

Grunt>dump first10;

hd003

Now we have got the result as expected. Lets stored the results into a file at HDFS.

Grunt>Store first10 into ’/user/avkash/myresults10‘ ;

Step 3:  Filtering Pig Results  in to Hive Table: 

First we will create a table in Hive using the same format (words and  wordcount separated by comma)

hive> create table wordslist10(words string, wordscount int) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’;

Now once table is created we will load the hive store file ’/user/admin/myresults10/part-r-00000′ into wordslist10 table we just created:

hive> load data inpath ‘/user/admin/myresults10/part-r-00000′ overwrite into table wordslist10;

That’s all as you can see the results now in table:

hive> select * from wordslist10;

hd004

KeyWords: Apache Hadoop, MapReduce, Pig, Hive, HDInsight, BigData

Understanding HDFS cluster FSImage image and looking its contents

March 20, 2013 1 comment

FSImage in HDFS Cluster:

  • The HDFS namespace is stored by the NameNode.
  • The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage.
  • The FsImage is stored as a file in the NameNode’s local file system too.
  • The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
  • Reads at startup only and keeps it during the life of start
    • RAM Dependency
    • What about “One Large File” or  “1,000,000 small files”

Step 1: First save the fsimage to local file system (Note: You cannot save the fsimage to HDFS.)

Step 2: Converting saved image to human readable file using –oiv option

Step 3: Now you can open the file in your favorite editor or you can try $head or $tail quickly to peek into.

fsimage

Cloudera Impala Hands-on Video

March 6, 2013 2 comments

Cloudera Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in (near) real time.

Learn more: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

Download the Impala virtual Machine: https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Impala+Demo+VM

Hortonworks Data Platform (HDP) Sandbox with Hadoop 1.1.2.21 Walkthrough

February 26, 2013 Leave a comment

I recently downloaded and decided to give a quick run to Hortonworks Data Platform (HDP) Sandbox. As suggested it takes only 15 minutes for anyone to get familiar about key Hadoop components i.e. Pig, HBase, Hive, Oozie and few others. The Sandbox comes with pre-configured and pre-installed VMWare or VirtualBox compatable VMDK running CentOS 6.2 Linux, which user can download and open very easily with any of above free application. This 15 minutes walk through checks out key Hadoop components in HDP sandbox.

If you are new to Apache Hadoop, this is the easiest method you have to get familiarize yourself with Amazing world of Hadoop.

HDP Sandbox Instructions: http://hortonworks.com/blog/hortonworks-sandbox-the-fastest-on-ramp-to-apache-hadoop/

HDP Download Links: http://hortonworks.com/products/sandbox-instructions/

VMDK Username and Password: root/hadoop

Categories: Hadoop, MapReduce, Big Data Tags: ,

Hadoop Adventures with Microsoft HDInsight

November 3, 2012 2 comments

 

What is HDInsight?

  • Hdinsight is the product name for Microsoft installation of Hadoop and Hadoop on azure service. HDInsight is Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers organizations with new insights on previously untouched unstructured data, while connecting to the most widely used Business Intelligence (BI) tools on the planet.
  • http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx

It is available in two mode:

  • HDInsight as Cloud Service: Cloud Version running on Windows Azure
  • HDInsight as Local Cluster: A downloadable version to runs locally on Windows Server and Desktop

In this article we will see how to use HDInsight on local machine.

 

Where to get it?

What does Windows installer brings to your machine:

clip_image001

After the installation is completed you will see the following applications are installed:

  1. Microsoft HDInsight Community Technology Preview Version 1.0.0.0
  2. Hortonwoks Data Platform 1.0.1 Developer Preview Version 1.0.1
  3. If you do not change the installed component, Python 2.7.3150 is also installed
  4. Java and C++ runtime is also installed as required in the machine

clip_image002

 

By default the Hadoop is installed at C:\Hadoop as below:

clip_image003

Once installer is completed you will see the following shortcuts are setup in your machine:

clip_image004

Here is the list of shortcuts:

  1. Hadoop Command Line
  2. Microsoft HDInsight Dashboard
  3. Hadoop MapReduce Status
  4. Hadoop Name Node Status

 

If you launch the “Hadoop command Line” you will see the list of commands as below:

clip_image005

· namenode -format format the DFS filesystem

· secondarynamenode run the DFS secondary namenode

· namenode run the DFS namenode

· datanode run a DFS datanode

· dfsadmin run a DFS admin client

· mradmin run a Map-Reduce admin client

· fsck run a DFS filesystem checking utility

· fs run a generic filesystem user client

· balancer run a cluster balancing utility

· fetchdt fetch a delegation token from the NameNode

· jobtracker run the MapReduce job Tracker node

· pipes run a Pipes job

· tasktracker run a MapReduce task Tracker node

· historyserver run job history servers as a standalone daemon

· job manipulate MapReduce jobs

· queue get information regarding JobQueues

· version print the version

· jar <jar> run a jar file

· distcp <srcurl> <desturl> copy file or directories recursively

· archive -archiveName NAME <src>* <dest> create a hadoop archive

· daemonlog get/set the log level for each daemon

· or

· CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

Try checking the Version as below:

c:\Hadoop\hadoop-1.1.0-SNAPSHOT>hadoop version

Hadoop 1.1.0-SNAPSHOT

Subversion on branch -r

Compiled by jenkins on Wed Oct 17 22:28:56 PDT 2012

From source with checksum 80f5614dfb0743b569344f051a07b37d

Now if you Launch “Microsoft HDInsight Dashboard” shortcut you will see the dashboard running locally as below:

clip_image006

Launching “Hadoop MapReduce Status” shortcut will give you the following info:

clip_image007

And Launching “Hadoop Name Node Status” shortcut you will see the following:

clip_image008

So as you can see above, you do have Hadoop Cluster running on your local machine.

Play with it a little more and my next article is coming with more info on this regard.

Have fun with Hadoop!!

Categories: Big Data, Hadoop, MapReduce

MapReduce in Cloud


When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:

  1. A collection of machines which are Hadoop/MapReduce ready and instant available
  2. You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
  3. It means you just need to hook your data and push MapReduce jobs immediately
  4. Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.

Here are a few options which are available now, which I tried before writing here:

Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/

The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.

Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
http://aws.amazon.com/elasticmapreduce/

Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
https://developers.google.com/bigquery/

Mortar Data:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works

MapReduce in Cloud


When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:

  1. A collection of machines which are Hadoop/MapReduce ready and instant available
  2. You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
  3. It means you just need to hook your data and push MapReduce jobs immediately
  4. Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.

Here are a few options which are available now, which I tried before writing here:

Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/

The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.

Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
http://aws.amazon.com/elasticmapreduce/

Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
https://developers.google.com/bigquery/

Mortar Data:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works

Categories: Hadoop, MapReduce

Programmatically retrieving Task ID and Unique Reducer ID in MapReduce


For each Mapper and Reducer you can get Task attempt id and Task ID both. This can be done when you set up your map using the Context object. You may also know that the when setting a Reducer an unique reduce ID is used inside reducer class setup method. You can get this ID as well.

There are multiple ways you can get this info:

1. Using JobConf Class.

  • JobConf.get(“mapred.task.id”) will provide most of the info related with Map and Reduce task along with attempt id.

2. You can use Context Class and use as below:

  • To get task attempt ID – context.getTaskAttemptID()
  • Reducer Task ID – Context.getTaskAttemptID().getTaskID()
  • Reducer Number – Context.getTaskAttemptID().getTaskID().getId()

Keyword: Hadoop, Map/Reduce, Jobs Performance, Mapper, Reducer

Resource Allocation Model in MapReduce 2.0


What was available in previous MapReduce:

  • Each node in the cluster was statically assigned the capability of running a predefined number of Map slots and a predefined number of Reduce slots.
  • The slots could not be shared between Maps and Reduces. This static allocation of slots wasn’t optimal since slot requirements vary during the MR job life cycle
  • In general there is a demand for Map slots when the job starts, as opposed to the need for Reduce slots towards the end

Key drawback in previous MapReduce:

  • In a real cluster, where jobs are randomly submitted and each has its own Map/Reduce slots requirement, having an optimal utilization of the cluster was hard, if not impossible.

What is new in MapReduce 2.0:

  • The resource allocation model in Hadoop 0.23 addresses above (Key drawback) deficiency by providing a more flexible resource modeling.
  • Resources are requested in the form of containers, where each container has a number of non-static attributes.
  • At the time of writing this blog, the only supported attribute was memory (RAM). However, the model is generic and there is intention to add more attributes in future releases (e.g. CPU and network bandwidth).
  • In this new Resource Management model, only a minimum and a maximum for each attribute are defined, and Application Master (AMs) can request containers with attribute values as multiples of these minimums.

Credit: http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: