Archive
12 ways to troubleshoot a failed MapReduce task
Sometimes it is hard to troubleshoot MapReduce task failure in a Hadoop Cluster. The following suggestions does help to troubleshoot the problem:
1. Integrate Additional Configuration:
- keep.task.files.pattern: This settings will specify a MapReduce task to keep stored by name, in both failed and success condition.
- keep.failed.task.files: This settings inform the data node to keep failed task files stored on the machine. The files are stored at {HADOOP_LOG_DIR}
/local/taskTracker/taskid/.
2. Launching IsolationRunner:
Visit this folder and launch IsolationRunner (org.apache.hadoop.mapred.IsolationRunner) as shown below which will execute failed task in a child JVM, using the same input which you can debug.
- $hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
- job.xml is generated by the MapReduce components when a job is submitted, MapReduce create the JobConf
3. Configure Log4J to include additional logging from specific classes, for example use configure below to get more log from TaskTracker
- log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG
4. Be sure to have the following log specific Hadoop configuration is set in the cluster:
| HADOOP_LOG_DIR | Directory for log files |
| HADOOP_PID_DIR | Directory to store the PID for the servers |
| HADOOP_ROOT_LOGGER | Logging configuration for hadoop.root.logger. default: “INFO,console” |
| HADOOP_SECURITY_LOGGER | Logging configuration for hadoop.security.logger. default: “INFO,NullAppender“ |
| HDFS_AUDIT_LOGGER | Logging configuration for hdfs.audit.logger. default: “INFO,NullAppender“ |
5. Create a small 1 node cluster an using small size of data set, run the same job and look for simple execution pattern. This is good step to check the issues related with task specific execution.
6. User can temporarily run JobTracker in local mode even with a large cluster to test code execution.
- mapred.job.tracker = local
- fs.default.name = local
- Note: Above settings can be included in hadoop-site.xml as well however those settings will see as long as xml is not changed. Above settings will be specific to current running job only.
7. Sending kill signal to running Java process
- Use “kill -QUIT #java_process_pid” will print Call Stack, threads, lock and deadlocks details on stdout.
- Java 1.5 have jps (List Java Process) and jstack (List Java Process Call Stack) to further troubleshoot the problem
8. Hadoop Pipes (C++ Library to support MapReduce program written in C++) set the following configuration to keep the troubled task files saved in the data node.
- hadoop.pipes.command-file.keep = true
9. Using script file to troubleshot the problem is also in option with Hadoop 0.15 and above.
- Upload script file to HDFS
- Set the configuration as
- mapred.cache.files = full_path#script_file_name (Multiple script files can be added using ‘ , ‘ comma)
- mapred.create.symlink = yes (Script file must be symlinked)
- Same above can be done programmatically as below:
- jobConf.setMapDebugScript(“./debugscript”);
- DistributedCache.createSymlink(jobConf);
- DistributedCache.addCacheFile(“/debug/scripts/script#debugscript”);
10. Enable Job Profiling
- User can sample Map and Reduce task through built in Java Profiler
- Use the following configuration
- mapred.task.profile = TRUE
- mapred.task.profile.maps = 0-2
- mapred.task.profile.reduces = 0-2
- Same effect is achieve programmatically as below:
- JobConf.setProfileEnabled(boolean)
- JobConf.setProfileTaskRange(boolean,String)
11. Using MapReduce Tool Interface
- Tool Interface in Hadoop supports functionality to handle Command line options
- Here is a list of Hadoop command line options
- -conf <configuration file>
- -D <property=value>
- -fs <local|namenode:port>
- -jt <local|jobtracker:port>
- Same can be achieved through code as describe here.
12. Enable code level logging by using Reporter class.
- To display debug information, use Reported reported parameter with map and reduce interfaces. Method Reporter.setStatus(String status) changes the displayed status of the map task and visable on the jobtracker web page.
Source:
- http://wiki.apache.org/hadoop/HowToDebugMapReducePrograms
- https://ccp.cloudera.com/display/DOC/Hadoop+Tutorial
HDInsight (Hadoop on Azure) Demo: Submit MapReduce job, process result from Pig and filter final results in Hive
In this demo we will submit a WordCount map reduce job to HDInsight cluster and process the results in Pig and then filter the results in Hive by storing structured results into a table.
Step 1: Submitting WordCount MapReduce Job to 4 node HDInsight cluster:
c:\apps\dist\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar c:\apps\Jobs\templates\635000448534317551.hadoop-examples.jar wordcount /user/admin/DaVinci.txt /user/admin/outcount
The results are stored @ /user/admin/outcount
Verify the results at Interactive Shell:
Step 2: loading /user/admin/outcount/part-r-00000 results in the Pig:
First we are storing the flat text file data as words, wordCount format as below:
Grunt>mydata = load ‘/user/admin/output/part-r-00000′ using PigStorage(‘\t’) as (words:chararray, wordCount:int);
Grunt>first10 = LIMIT mydata 10;
Grunt>dump first10;

Note: This shows results for the words with frequency 1. We need to reorder to results on descending order to get words with top frequency.
Grunt>mydatadsc = order mydata by wordCount DESC;
Grunt>first10 = LIMIT mydatadsc 10;
Grunt>dump first10;
Now we have got the result as expected. Lets stored the results into a file at HDFS.
Grunt>Store first10 into ’/user/avkash/myresults10‘ ;
Step 3: Filtering Pig Results in to Hive Table:
First we will create a table in Hive using the same format (words and wordcount separated by comma)
hive> create table wordslist10(words string, wordscount int) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’;
Now once table is created we will load the hive store file ’/user/admin/myresults10/part-r-00000′ into wordslist10 table we just created:
hive> load data inpath ‘/user/admin/myresults10/part-r-00000′ overwrite into table wordslist10;
That’s all as you can see the results now in table:
hive> select * from wordslist10;
KeyWords: Apache Hadoop, MapReduce, Pig, Hive, HDInsight, BigData
Understanding HDFS cluster FSImage image and looking its contents
- The HDFS namespace is stored by the NameNode.
- The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage.
- The FsImage is stored as a file in the NameNode’s local file system too.
- The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
- Reads at startup only and keeps it during the life of start
- RAM Dependency
- What about “One Large File” or “1,000,000 small files”
Step 1: First save the fsimage to local file system (Note: You cannot save the fsimage to HDFS.)
Step 2: Converting saved image to human readable file using –oiv option
Step 3: Now you can open the file in your favorite editor or you can try $head or $tail quickly to peek into.
Cloudera Impala Hands-on Video
Cloudera Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in (near) real time.
Learn more: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
Download the Impala virtual Machine: https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Impala+Demo+VM
Hortonworks Data Platform (HDP) Sandbox with Hadoop 1.1.2.21 Walkthrough
I recently downloaded and decided to give a quick run to Hortonworks Data Platform (HDP) Sandbox. As suggested it takes only 15 minutes for anyone to get familiar about key Hadoop components i.e. Pig, HBase, Hive, Oozie and few others. The Sandbox comes with pre-configured and pre-installed VMWare or VirtualBox compatable VMDK running CentOS 6.2 Linux, which user can download and open very easily with any of above free application. This 15 minutes walk through checks out key Hadoop components in HDP sandbox.
If you are new to Apache Hadoop, this is the easiest method you have to get familiarize yourself with Amazing world of Hadoop.
HDP Sandbox Instructions: http://hortonworks.com/blog/hortonworks-sandbox-the-fastest-on-ramp-to-apache-hadoop/
HDP Download Links: http://hortonworks.com/products/sandbox-instructions/
VMDK Username and Password: root/hadoop
Hadoop Adventures with Microsoft HDInsight
What is HDInsight?
- Hdinsight is the product name for Microsoft installation of Hadoop and Hadoop on azure service. HDInsight is Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers organizations with new insights on previously untouched unstructured data, while connecting to the most widely used Business Intelligence (BI) tools on the planet.
- http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx
It is available in two mode:
- HDInsight as Cloud Service: Cloud Version running on Windows Azure
- HDInsight as Local Cluster: A downloadable version to runs locally on Windows Server and Desktop
In this article we will see how to use HDInsight on local machine.
Where to get it?
- You can download HDInsight Preview version from the link below:
- http://www.microsoft.com/web/handlers/webpi.ashx?command=GetInstallerRedirect&appid=HDINSIGHT-PREVIEW&mode=new
What does Windows installer brings to your machine:
After the installation is completed you will see the following applications are installed:
- Microsoft HDInsight Community Technology Preview Version 1.0.0.0
- Hortonwoks Data Platform 1.0.1 Developer Preview Version 1.0.1
- If you do not change the installed component, Python 2.7.3150 is also installed
- Java and C++ runtime is also installed as required in the machine
By default the Hadoop is installed at C:\Hadoop as below:
Once installer is completed you will see the following shortcuts are setup in your machine:
Here is the list of shortcuts:
- Hadoop Command Line
- Microsoft HDInsight Dashboard
- Hadoop MapReduce Status
- Hadoop Name Node Status
If you launch the “Hadoop command Line” you will see the list of commands as below:
| · namenode -format format the DFS filesystem
· secondarynamenode run the DFS secondary namenode · namenode run the DFS namenode · datanode run a DFS datanode · dfsadmin run a DFS admin client · mradmin run a Map-Reduce admin client · fsck run a DFS filesystem checking utility · fs run a generic filesystem user client · balancer run a cluster balancing utility · fetchdt fetch a delegation token from the NameNode · jobtracker run the MapReduce job Tracker node · pipes run a Pipes job · tasktracker run a MapReduce task Tracker node · historyserver run job history servers as a standalone daemon · job manipulate MapReduce jobs · queue get information regarding JobQueues · version print the version · jar <jar> run a jar file · distcp <srcurl> <desturl> copy file or directories recursively · archive -archiveName NAME <src>* <dest> create a hadoop archive · daemonlog get/set the log level for each daemon · or · CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. |
Try checking the Version as below:
| c:\Hadoop\hadoop-1.1.0-SNAPSHOT>hadoop version
Hadoop 1.1.0-SNAPSHOT Subversion on branch -r Compiled by jenkins on Wed Oct 17 22:28:56 PDT 2012 From source with checksum 80f5614dfb0743b569344f051a07b37d |
Now if you Launch “Microsoft HDInsight Dashboard” shortcut you will see the dashboard running locally as below:
Launching “Hadoop MapReduce Status” shortcut will give you the following info:
And Launching “Hadoop Name Node Status” shortcut you will see the following:
So as you can see above, you do have Hadoop Cluster running on your local machine.
Play with it a little more and my next article is coming with more info on this regard.
Have fun with Hadoop!!
MapReduce in Cloud
When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:
- A collection of machines which are Hadoop/MapReduce ready and instant available
- You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
- It means you just need to hook your data and push MapReduce jobs immediately
- Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.
Here are a few options which are available now, which I tried before writing here:
Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
http://aws.amazon.com/elasticmapreduce/
Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
https://developers.google.com/bigquery/
Mortar Data:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works
MapReduce in Cloud
When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:
- A collection of machines which are Hadoop/MapReduce ready and instant available
- You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
- It means you just need to hook your data and push MapReduce jobs immediately
- Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.
Here are a few options which are available now, which I tried before writing here:
Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
http://aws.amazon.com/elasticmapreduce/
Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
https://developers.google.com/bigquery/
Mortar Data:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works
Programmatically retrieving Task ID and Unique Reducer ID in MapReduce
For each Mapper and Reducer you can get Task attempt id and Task ID both. This can be done when you set up your map using the Context object. You may also know that the when setting a Reducer an unique reduce ID is used inside reducer class setup method. You can get this ID as well.
There are multiple ways you can get this info:
1. Using JobConf Class.
- JobConf.get(“mapred.task.id”) will provide most of the info related with Map and Reduce task along with attempt id.
2. You can use Context Class and use as below:
- To get task attempt ID – context.getTaskAttemptID()
- Reducer Task ID – Context.getTaskAttemptID().getTaskID()
- Reducer Number – Context.getTaskAttemptID().getTaskID().getId()
Keyword: Hadoop, Map/Reduce, Jobs Performance, Mapper, Reducer
Resource Allocation Model in MapReduce 2.0
What was available in previous MapReduce:
- Each node in the cluster was statically assigned the capability of running a predefined number of Map slots and a predefined number of Reduce slots.
- The slots could not be shared between Maps and Reduces. This static allocation of slots wasn’t optimal since slot requirements vary during the MR job life cycle
- In general there is a demand for Map slots when the job starts, as opposed to the need for Reduce slots towards the end
Key drawback in previous MapReduce:
- In a real cluster, where jobs are randomly submitted and each has its own Map/Reduce slots requirement, having an optimal utilization of the cluster was hard, if not impossible.
What is new in MapReduce 2.0:
- The resource allocation model in Hadoop 0.23 addresses above (Key drawback) deficiency by providing a more flexible resource modeling.
- Resources are requested in the form of containers, where each container has a number of non-static attributes.
- At the time of writing this blog, the only supported attribute was memory (RAM). However, the model is generic and there is intention to add more attributes in future releases (e.g. CPU and network bandwidth).
- In this new Resource Management model, only a minimum and a maximum for each attribute are defined, and Application Master (AMs) can request containers with attribute values as multiples of these minimums.
Credit: http://www.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/




