Archive

Archive for December, 2011

Hadoop WordCount Compilation errors related with OutputCollector, setInputPath, setOutputPath

December 31, 2011 3 comments

If you have tried Hadoop WordCount sample job available in multiple old tutorials, you may have hit compilation problem as below:

Older Code:

package org.myorg;
import java.io.Exception;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount {
public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setInputPath(new Path(args[1]));
conf.setOutputPath(new Path(args[2]));
JobClient.runJob(conf);
}

}

After Compilation we hit the following error:

WordCount.java:2: error: cannot find symbol
import java.io.Exception;
^
symbol: class Exception
location: package java.io
WordCount.java:14: error: cannot find symbol
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

^
symbol: class IOException
location: class MapClass
WordCount.java:25: error: cannot find symbol
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

^
symbol: class IOException
location: class Reduce
WordCount.java:44: error: cannot find symbol
conf.setInputPath(new Path(args[1]));
^
symbol: method setInputPath(Path)
location: variable conf of type JobConf
WordCount.java:45: error: cannot find symbol
conf.setOutputPath(new Path(args[2]));
^
symbol: method setOutputPath(Path)
location: variable conf of type JobConf
Note: WordCount.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
5 errors

This is because the old sample code is based on older Hadoop distribution. This problem happens when you use 0.20.x or newer Hadoop distribution. Like in my case, I was using 0.20.203.1 as below:

C:\Azure\Java>C:\Apps\java\openjdk7\bin\javac -classpath c:\Apps\dist\hadoop-core-0.20.203.1-SNAPSHOT.jar -d . WordCount.java

To solve this problem you would need to change your  code to as below:

package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AvkashWordCount {
public static class Map extends Mapper
<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer
<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(AvkashWordCount.class);
job.setJobName(“avkashwordcountjob”);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(AvkashWordCount.Map.class);
job.setCombinerClass(AvkashWordCount.Reduce.class);
job.setReducerClass(AvkashWordCount.Reduce.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Above code is tested with Hadoop 0.20.x and above distribution.

 

Resources:

Categories: Hadoop

Hadoop and Big Data Resources

December 31, 2011 Leave a comment

6 reasons why 2012 could be the year of Hadoop
http://gigaom.com/cloud/six-reasons-why-2012-could-be-the-year-of-hadoop/

Defining Hadoop: the Players, Technologies and Challenges of 2011
http://pro.gigaom.com/2011/03/defining-hadoop-the-players-technologies-and-challenges-of-2011/

The Hadoop project includes these subprojects:

Other Hadoop-related projects at Apache include:

  • Avro™: A data serialization system.
  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Chukwa™: A data collection system for managing large distributed systems.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
  • Mahout™: A Scalable machine learning and data mining library.
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • ZooKeeper™: A high-performance coordination service for distributed applications.
Free On Line Course:
Other Resources:
Categories: Hadoop

Apache Hadoop on Windows Azure Blog Series

December 31, 2011 Leave a comment

For those who are looking to run Hadoop on Windows Azure, the CTP is already out and you can register yourself to be in line to use Hadoop on Azure CTP. More information  on this regard is available here:

For new Hadoop users, I have created a few blog entries for to explain various aspects of Hadoop on Windows Azure as below:

Apache Hadoop on Windows Azure Part 1- Creating a new Windows Azure Cluster for Hadoop Job

http://blogs.msdn.com/b/avkashchauhan/archive/2011/12/28/apache-hadoop-on-windows-azure-part-1-creating-a-new-windows-azure-cluster-for-hadoop-job.aspx

Apache Hadoop on Windows Azure Part 2 – Creating a Pi Estimator Hadoop Job

http://blogs.msdn.com/b/avkashchauhan/archive/2011/12/29/apache-hadoop-on-windows-azure-part-2-creating-a-pi-estimator-hadoop-job.aspx

Apache Hadoop on Windows Azure Part 3 – Creating a Word Count Hadoop Job with a few twists

http://blogs.msdn.com/b/avkashchauhan/archive/2011/12/29/apache-hadoop-on-windows-azure-part-3-creating-a-word-count-hadoop-job-with-a-few-twists.aspx

Apache Hadoop on Windows Azure Part 4- Remote Login to Hadoop node for MapReduce Job and HDFS administration

http://blogs.msdn.com/b/avkashchauhan/archive/2011/12/29/apache-hadoop-on-windows-azure-part-4-remote-login-to-hadoop-node-for-mapreduce-job-and-hdfs-administration.aspx

Apache Hadoop on Windows Azure Part 5 – Running 10GB Sort Hadoop Job with Teragen, TeraSort and TeraValidate Options

http://blogs.msdn.com/b/avkashchauhan/archive/2011/12/30/apache-hadoop-on-windows-azure-part-5-running-10gb-sort-hadoop-job-with-teragen-terasort-and-teravalidate-options.aspx

Apache Hadoop on Windows Azure Part 6 – Running 10GB Sort Hadoop Job with TeraSort Option and understanding MapReduce Job administration

http://blogs.msdn.com/b/avkashchauhan/archive/2011/12/30/apache-hadoop-on-windows-azure-part-6-running-10gb-sort-hadoop-job-with-terasort-option-and-understanding-mapreduce-job-administration.aspx

Categories: Hadoop on Azure

Resources to use Hadoop with different programming technology and environments:

December 16, 2011 Leave a comment

Writing a hadoop mapreduce program in python

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

What is functional Programming (2008)

http://www.fincher.org/tips/General/SoftwareEngineering/FunctionalProgramming.shtml

Functional Programming in C# 3.0: How Map/Reduce/Filter can Rock your World

http://www.25hoursaday.com/weblog/2008/06/16/FunctionalProgrammingInC30HowMapReduceFilterCanRockYourWorld.aspx

Writing a Hadoop MapReduce Program in PHP

http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html

Programming with Hadoop – A Hands on Introduction

http://ebiquity.umbc.edu/event/html/id/408/Programming-with-Hadoop-A-Hands-On-Introduction

Hadoop Example program on a single node cluster using Hadoop Streaming API:

http://pages.cs.brandeis.edu/~cs147a/lab/hadoop-example/

An Introduction to Parallel Programming with MapReduce

http://cnx.org/content/m20644/latest/

The elephant in the room … Hadoop and BigData!

http://mikethetechie.com/post/6822576191/the-elephant-in-the-room-hadoop-and-bigdata

Writing your first map-reduce program on hadoop (Java, 2010)

http://jayant7k.blogspot.com/2010/06/writing-your-first-map-reduce-program.html

Hadoop Streaming Made Simple using Joins and Keys with Python

http://allthingshadoop.com/2010/12/16/simple-hadoop-streaming-tutorial-using-joins-and-keys-with-python/

Categories: Hadoop

Scientific Computing in Cloud using MATLAB and Windows Azure

December 14, 2011 Leave a comment

There are couple of things you could do while planning to run MATLAB on Windows Azure. Here i will provide a few resources to get your started. I am also working on a Windows Azure SDK 1.6 and MATLAB Runtime 7.1 installer based sample also, which I will release shortly.

Understanding Windows Azure HPC Scheduler SDK and how it works in general with Windows Azure:

Getting Started with Application Deployment with the Windows Azure HPC Scheduler (Document Walhthrough)

Windows Azure HPC Scheduler code sample: Overview (Video instructions – part 1)

Watch Video about running Windows Azure HPC Scheduler code sample: Publish the service (Video instructions – Part 2)

Step by Step hands on training guide to use Windows HPC with Burst to Windows Azure:

Learn more about Message Passing Interface (MPI): MPI is a platform-independent standard for messaging between HPC nodes. Microsoft MPI (MS MPI) is the MPI implementation used for MPI applications executed by Windows HPC Server 2008 R2 SP2. Integration of Windows HPC Server 2008 R2 SP2 with Windows Azure supports running MPI applications on Windows Azure nodes.

You can use Windows Azure HPC Scheduler (follow link below for more info)

Using MATLAB with Windows Azure HPC Scheduler SDK:

In MATLAB Parallel Computing Toolbox, you can find MATLAB  MPI implementation MPICH2 MPI.

Windows Azure HPC Scheduler allows for spawning worker nodes on which MPI jobs can be run. With a local head node and for compute-bound workloads you could

  • have Matlab cloudbursting to Azure via MPI
  • use a local non-Matlab WS2008R2 master node and run MPI jobs using common numeric libraries

Installing MATLAB Runtime with your Windows Azure application:

To install MCR (MATLAB Compiler Runtime) in Windows Azure you could do the following:

  1. Create a Startup task to download MCR zip and then install it.
    1. Learn More about Startup task here
    1. You can use Azure BootStrapper application to download and install very easily
  1. If you are using Worker role and want set a specific application as role entry point

Other Useful Resources:

Some case studies from Curtin University, Australia, using MATLAB and Windows Azure:

A bunch of Microsoft Research Projects specially ModisAzure use MATLAB on Azure which you can find as the link below:

A presentation done by Techila‘s at Microsoft TechDays 2011 presentation showed MATLAB code running in Windows Azure.

Categories: Computing and Cloud

If you think “Cloud Computing” and “Cloud Services” is hype, think again!!

December 2, 2011 Leave a comment

Last couple of years cloud services and cloud computing is highlight of technology columns and everyone who consider them tech-know-hows, understands these two teams very well. From CTO/CIO to all the way IT guys are being thrown lots and lots of details about cloud services. The terms like SAAS, PAAS, IAAS, private cloud, public cloud, are headline of tech magazines and internet columns.  Amazon, Microsoft, Google, IBM, HP, Rackspace, Dropbox and many more companies are firing on all cylinders to compete and outsmart each other, however there are still skeptics who believe that it is all hype and there is nothing serious like cloud services or cloud computing.

So those who thinks this is all hype let me start by mentioning big names and what they are doing in cloud services arena:

It is true that each and every company in cloud services are trying to define cloud computing and cloud services in their way however there are still frontrunners and lots of others are playing hard to catch them first while others who have resources & capacity to do something new, are trying to do both. While Amazon has clear lead in overall cloud service business however company like Microsoft & Google are head to head in certain areas and trying to innovate in other areas. Rackspace, VMWare, Red Hat, Heroku, AppFog are also a few big and establish companies in cloud services.

Here is the very much positive market outlook about cloud services:

 

IDC has rated the five-year annual growth outlook for cloud services at 26%, some six times the rate of traditional IT and forecast to be worth almost $45 billion by 2013 when it will account for 10% of all IT revenue. Of the $27 billion in net new IT revenue in 2013, 27% will come from IT cloud services, it said. Gartner estimated that the cloud market would reach $150 billion in 2013, a figure that includes a much broader set of revenue streams than IDC and includes the likes of cloud-based advertising, e-commerce, human resources and payments processing.

Here is the truth about cloud services:

Cloud computing is where infrastructure and services merge into some kind of utility which is consumed by users over internet, without any consideration of where and how it was originated. The consumers just utilize what is giving and pay a fee to use is. A regular consumption of these utilities keep the price flat and save lots of money in longer term. Consumers don’t bother to know how and what it takes to create instead just use as it required, similar to electricity in our home and business.

Cloud services are something which built over cloud computing framework however depend on how consumer wants to consume these services, there are various types of cloud services available for consumers to choose.

When you decide to choose a cloud service vendor who can provide services to you or your company, you can get services to offload infrastructure or you can go to extreme and just use a cloud service vendor who can provide you full software and you just don’t need to own anything. Cloud services also have the same layered concept in which you are partially or fully depend on your vendor. Currently cloud service are categorize in 3 different types:

1.          Infrastructure as a Service (IAAS)

2.          Platform as a Service (PAAS)

3.          Software as a Service (SAAS)

Learn more about cloud service here:

http://cloudcelebrity.wordpress.com/2011/11/22/introduction-to-cloud-services-iaas-paas-saas/

Resources:

Please send your comments to me and i would glad to follow up!!

Categories: Cloud Services
Follow

Get every new post delivered to your Inbox.

Join 46 other followers

%d bloggers like this: