Archive

Archive for the ‘Development’ Category

What to do when compiling Hadoop branch 1.2.x returns java.io.IOException: Cannot run program “autoreconf”

August 12, 2013 1 comment

Compiling Hadoop branch 1.2.x code in OSX returned exception as below:

create-native-configure:

BUILD FAILED
/Users/avkash/work/hadoop/branch-1.2/build.xml:634: Execute failed: java.io.IOException: Cannot run program “autoreconf” (in directory “/Users/avkash/work/hadoop/branch-1.2/src/native”): error=2, No such file or directory
at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:457)
at java.lang.Runtime.exec(Runtime.java:593)
at org.apache.tools.ant.taskdefs.Execute$Java13CommandLauncher.exec(Execute.java:862)
at org.apache.tools.ant.taskdefs.Execute.launch(Execute.java:481)
at org.apache.tools.ant.taskdefs.Execute.execute(Execute.java:495)
at org.apache.tools.ant.taskdefs.ExecTask.runExecute(ExecTask.java:631)
at org.apache.tools.ant.taskdefs.ExecTask.runExec(ExecTask.java:672)
at org.apache.tools.ant.taskdefs.ExecTask.execute(ExecTask.java:498)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at org.apache.tools.ant.Project.executeTarget(Project.java:1368)
at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1251)
at org.apache.tools.ant.Main.runBuild(Main.java:809)
at org.apache.tools.ant.Main.startAnt(Main.java:217)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
at java.lang.ProcessImpl.start(ProcessImpl.java:91)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
… 23 more

Looking src/native there is no “autoconf” shipped with the source so to solve this problem the best option is to install

└─[1] <git:(master✗✈)> brew install automake
==> Installing automake dependency: autoconf
==> Downloading http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.gz
######################################################################## 100.0%
==> ./configure –prefix=/usr/local/Cellar/autoconf/2.69
==> make install
🍺 /usr/local/Cellar/autoconf/2.69: 69 files, 2.0M, built in 21 seconds
==> Installing automake
==> Downloading http://ftpmirror.gnu.org/automake/automake-1.14.tar.gz
######################################################################## 100.0%
==> ./configure –prefix=/usr/local/Cellar/automake/1.14
==> make install
🍺 /usr/local/Cellar/automake/1.14: 127 files, 2.5M, built in 7 seconds

Thats all!!

After than just run the following commands to make sure the same exception does not come back:

$ant create-native-configure

Categories: Development, Hadoop, Scripting

Customized bash command prompt with line separator and other goodies


I wanted to have a fancy looking and very useful terminal windows with customize command prompt so after digging I build something as below for me:

Image

So what it have:

  • Line Separator including current time at the end of the terminal
  • History counter along with current command counter
  • Logged user @ Hostname
  • Current working folder $

Here is what I have done. First created a file call .avkashbash_profile at my $HOME folder as below:

fill=”-”
reset_style=’\[33[00m\]‘
status_style=$reset_style’\[33[0;32m\]‘ # gray color; use 0;37m for lighter color
prompt_style=$reset_style
command_style=$reset_style’\[33[0;32m\]‘ # bold black
# Prompt variable:
PS1=”$status_style”‘$fill[\T]\n’”$prompt_style”‘${debian_chroot:+($debian_chroot)}\e[0;31m\e[47m[\!:\#]\e[
0m-\e[1;33m[\u@\h]\e[0m\n[\w]\e[0;32m$ '"$command_style "
# Reset color for command output
# (this one is invoked every time before a command is executed):
function prompt_command {
# create a $fill of all screen width minus the time string and a space:
let fillsize=${COLUMNS}-11
fill=""
while [ "$fillsize" -gt "0" ]
do
fill=”-${fill}” # fill with underscores to work on
let fillsize=${fillsize}-1
done
}
PROMPT_COMMAND=prompt_command

To make the setting permanent, just add the following code in .bash_profile first:

if [ -f "$HOME/.avkashbash_ps1" ]; then
. “$HOME/.avkashbash_ps1″
fi

And then run the following command to set it:

$sources .bash_profile

Or if you don’t want to make it permanent, just add the following code to .bashrc first:

if [ -f "$HOME/.avkashbash_ps1" ]; then
. “$HOME/.avkashbash_ps1″
fi

And then run the following command to set it:

$sources .bashrc

Thats all.

Thanks for the guys at here and here!!

Brew on Mac – Just 3 steps and you are ready


Step 1:

MachineHead:docs avkash$ ruby -e “$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)”
==> This script will install:
/usr/local/bin/brew
/usr/local/Library/…
/usr/local/share/man/man1/brew.1
==> The following directories will be made group writable:
/usr/local/.
/usr/local/bin
/usr/local/etc
/usr/local/lib
/usr/local/share
/usr/local/share/man
/usr/local/share/man/man1
/usr/local/share/info
==> The following directories will have their group set to admin:
/usr/local/.
/usr/local/bin
/usr/local/etc
/usr/local/lib
/usr/local/share
/usr/local/share/man
/usr/local/share/man/man1
/usr/local/share/info

Press ENTER to continue or any other key to abort
==> /usr/bin/sudo /bin/chmod g+rwx /usr/local/. /usr/local/bin /usr/local/etc /usr/local/lib /usr/local/share /usr/local/share/man /usr/local/share/man/man1 /usr/local/share/info
Password:
==> /usr/bin/sudo /usr/bin/chgrp admin /usr/local/. /usr/local/bin /usr/local/etc /usr/local/lib /usr/local/share /usr/local/share/man /usr/local/share/man/man1 /usr/local/share/info
==> Downloading and Installing Homebrew…
remote: Counting objects: 121792, done.
remote: Compressing objects: 100% (59383/59383), done.
remote: Total 121792 (delta 85372), reused 95223 (delta 61439)
Receiving objects: 100% (121792/121792), 19.29 MiB | 283 KiB/s, done.
Resolving deltas: 100% (85372/85372), done.
From https://github.com/mxcl/homebrew
* [new branch] master -> origin/master
HEAD is now at c45c77d audit: don’t complain about bottle versions.
==> Installation successful!
You should run `brew doctor’ *before* you install anything.
Now type: brew help

Step 2: 

MachineHead:docs avkash$ brew doctor
Your system is ready to brew.

Step 3:

MachineHead:docs avkash$ brew help
Example usage:
brew [info | home | options ] [FORMULA...]
brew install FORMULA…
brew uninstall FORMULA…
brew search [foo]
brew list [FORMULA...]
brew update
brew upgrade [FORMULA...]

Troubleshooting:
brew doctor
brew install -vd FORMULA
brew [--env | --config]

Brewing:
brew create [URL [--no-fetch]]
brew edit [FORMULA...]
open https://github.com/mxcl/homebrew/wiki/Formula-Cookbook

Further help:
man brew
brew home

Thats all!!

Categories: Development, Hadoop

Apache Weave: Big Data Application runtime and development framework by Continuuity


Continuuity decided to build Weave and be part of the journey to take Apache YARN to the next level of usability and functionality. Continuuity has been using Weave extensively to support their  products and  seen the benefit and power of Apache YARN and Weave combined.  Continuuity decided to share Weave under the Apache 2.0 license in an effort to collaborate with members of the community, broaden the set of applications and patterns that Weave supports, and further the overall adoption of Apache YARN.

Weave is NOT a replacement for Apache YARN.  It is instead a value-added framework that operates on top of Apache YARN.

What is Weave:  Weave is a simple set of libraries that allows you to easily manage distributed applications through an abstraction layer built on Apache YARN. Weave allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads.

Features of Weave:
- Simple API for specifying, running and managing application lifecycle
- An easy way to communicate with an application or parts of an application
- A generic Application Master to better support simple applications
- Simplified archive management and local file transport
- Improved control over application logs, metrics and errors
- Discovery service
- And many more…

Weave Source code is available on github at http://github.com/continuuity/weave under the Apache 2.0 License.

Learn more at http://www.continuuity.com/.

Keyword: Hadoop, Yarn, MapReduce, Big Data

Processing unstructured content from a URL in R

February 12, 2013 Leave a comment

R has a built in function name readLines() which read a local file or an URL to read content line by line.

For example my blog URL is https://cloudcelebrity.wordpress.com so lets read it:

> myblog <- readLines(“https://cloudcelebrity.wordpress.com&#8221;)
Warning message:
In readLines(“https://cloudcelebrity.wordpress.com&#8221;) :
incomplete final line found on ‘https://cloudcelebrity.wordpress.com&#8217;

> length(myblog)
[1] 1380

As you can see above there is a warning message even when the myblog does have all the lines in it. To disable this warning message we can use “warn=FALSE” as below:

> myblog <- readLines(“https://cloudcelebrity.wordpress.com&#8221;, warn=FALSE)

> length(myblog)
[1] 1380

And above there are no warning.. if I would want to print special lines based on line number I can just call the

> myblog[100]
[1] “<script type=’text/javascript’>/*<![CDATA[*/if(typeof(addLoadEvent)!='undefined'){addLoadEvent(function(){if(top==self){i=document.createElement('img');i.src='http://botd.wordpress.com/botd.gif?blogid=29647980&postid=0&lang=1&date=1359531235&ip=98.203.199.229&url=https://cloudcelebrity.wordpress.com/&loc='+document.location;i.style.width='0px';i.style.height='0px';i.style.overflow='hidden';document.body.appendChild(i);}});}/*]]>*/</script>”

Lets get the summary:

> summary(myblog)
Length Class Mode
1380 character character

To read only limited lines in the same URL , I can also pass the total line limit as below:

> myblog <- readLines(“https://cloudcelebrity.wordpress.com&#8221;, 100, warn=FALSE)
> summary(myblog)
Length Class Mode
100 character character

After I read  all the lines in my blog, lets perform some specific search operation in the content:

Searching all lines with Hadoop or hadoop in it: 

To search all the lines which have Hadoop or hadoop in it we can run grep command to find all the line numbers as below:

> hd <- grep(“[hH]adoop”, myblog)

Lets print hd to see all the line numbers:
> hd
[1] 706 803 804 807 811 812 814 819 822 823 826 827 830 834
[15] 837 863 869 871 872 875 899 911 912 921 923 925 927 931
[29] 934 1000 1010 1011 1080 1278 1281

To print all the lines with Hadoop or hadoop in it we can just use:

> myblog[hd]
[1] “<p>A: ACID – Atomicity, Consistency, Isolation and Durability <br />B: Big Data – Volume, Velocity, Variety <br />C: Columnar (or Column-Oriented) Database <br />D: Data Warehousing – Relevant and very useful <br />E: ETL – Extract, transform and load <br />F: Flume – A framework for populating Hadoop with data <br />G: Geospatial Analysis – A picture worth 1,000 words or more <br />H: Hadoop, HDFS, HBASE – Do you really want to know? <br />I:  In-Memory Database – A new definition of superfast access <br />J: Java – Hadoop gave biggest push in last years to stay in enterprise market <br />K: Kafka – High-throughput, distributed messaging system originally developed at LinkedIn <br />L: Latency – Low Latency and High Latency <br />M: Map/Reduce – MapReduce <br />N:  NoSQL Databases – No SQL Database or Not Only SQL <br />O: Oozie – Open-source workflow engine managing Hadoop job processing <br />P: Pig – Platform for analyzing huge data sets <br />Q: Quantitative Data Analysis <br />R: Relational Database – Still relevant and will be for some time <br />S: Sharding (Database Partitioning)  and Sqoop (SQL Database to Hadoop) <br />T: Text Analysis – Larger the information, more needed analysis <br />U: Unstructured Data – Growing faster than speed of thoughts <br />V: Visualization – Important to keep the information relevant <br />W: Whirr – Big Data Cloud Services i.e. Hadoop distributions by cloud vendors <br />X:  XML – Still eXtensible and no Introduction needed <br />Y: Yottabyte – Equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes <br />Z: Zookeeper – Help managing Hadoop nodes across a distributed network </p>”
[2] “\t\t\t<div class=\”post-382 post type-post status-publish format-standard hentry category-big-data category-data-analysis category-hadoop tag-hadoop-bigdata-disney-data-platform\” id=\”post-382\”>”
[3] “\t\t\t<h2><a class=\”title\” href=\”https://cloudcelebrity.wordpress.com/2012/11/13/how-hadoop-is-shaping-up-at-disney-world/\” rel=\”bookmark\”>How Hadoop is shaping up at Disney&nbsp;World?</a></h2>”
[4] “\t\t\t\t<span class=\”author\”><a href=\”https://cloudcelebrity.wordpress.com/author/cloudcelebrity/\” title=\”Posts by cloudcelebrity\” rel=\”author\”>cloudcelebrity</a></span>\t\t\t\t\t\t\t\t<span class=\”comments\”><a href=\”https://cloudcelebrity.wordpress.com/2012/11/13/how-hadoop-is-shaping-up-at-disney-world/#respond\” title=\”Comment on How Hadoop is shaping up at Disney&nbsp;World?\”>Leave a comment</a></span>”
[5] “\t\t\t\t<div class=\”pd-rating\” id=\”pd_rating_holder_5386869_post_382\” data-settings=\”{&quot;id&quot;:5386869,&quot;item_id&quot;:&quot;_post_382&quot;,&quot;settings&quot;:&quot;{\\&quot;id\\&quot;:5386869,\\&quot;unique_id\\&quot;:\\&quot;wp-post-382\\&quot;,\\&quot;title\\&quot;:\\&quot;How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F\\&quot;,\\&quot;permalink\\&quot;:\\&quot;http:\\\\\\/\\\\\\/cloudcelebrity.wordpress.com\\\\\\/2012\\\\\\/11\\\\\\/13\\\\\\/how-hadoop-is-shaping-up-at-disney-world\\\\\\/\\&quot;,\\&quot;item_id\\&quot;:\\&quot;_post_382\\&quot;}&quot;}\”></div><br/><p> </p>”

……………….

………………

[34] “PDRTJS_settings_5386869_post_412={\”id\”:5386869,\”unique_id\”:\”wp-post-412\”,\”title\”:\”Merging%20two%20data%20set%20in%20R%20based%20on%20one%20common%26nbsp%3Bcolumn\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/30\\/merging-two-data-set-in-r-based-on-one-common-column\\/\”,\”item_id\”:\”_post_412\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_412 == ‘undefined’ ){PDRTJS_5386869_post_412 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_412 );}}PDRTJS_settings_5386869_post_409={\”id\”:5386869,\”unique_id\”:\”wp-post-409\”,\”title\”:\”Working%20with%20dataset%20in%20R%20and%20using%20subset%20to%20work%20on%26nbsp%3Bdataset\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/30\\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\\/\”,\”item_id\”:\”_post_409\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_409 == ‘undefined’ ){PDRTJS_5386869_post_409 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_409 );}}PDRTJS_settings_5386869_post_398={\”id\”:5386869,\”unique_id\”:\”wp-post-398\”,\”title\”:\”Listing%20base%20datasets%20in%20R%20and%20loading%20as%20Data%26nbsp%3BFrame\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/19\\/listing-base-datasets-in-r-and-loading-as-data-frame\\/\”,\”item_id\”:\”_post_398\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_398 == ‘undefined’ ){PDRTJS_5386869_post_398 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_398 );}}PDRTJS_settings_5386869_post_397={\”id\”:5386869,\”unique_id\”:\”wp-post-397\”,\”title\”:\”ABC%20of%20Data%26nbsp%3BScience\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/01\\/abc-of-data-science\\/\”,\”item_id\”:\”_post_397\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_397 == ‘undefined’ ){PDRTJS_5386869_post_397 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_397 );}}PDRTJS_settings_5386869_post_390={\”id\”:5386869,\”unique_id\”:\”wp-post-390\”,\”title\”:\”R%20Programming%20Language%20%28Installation%20and%20configuration%20on%26nbsp%3BWindows%29\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/12\\/18\\/r-programming-language-installation-and-configuration-on-windows\\/\”,\”item_id\”:\”_post_390\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_390 == ‘undefined’ ){PDRTJS_5386869_post_390 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_390 );}}PDRTJS_settings_5386869_post_382={\”id\”:5386869,\”unique_id\”:\”wp-post-382\”,\”title\”:\”How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/11\\/13\\/how-hadoop-is-shaping-up-at-disney-world\\/\”,\”item_id\”:\”_post_382\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_382 == ‘undefined’ ){PDRTJS_5386869_post_382 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_382 );}}PDRTJS_settings_5386869_post_376={\”id\”:5386869,\”unique_id\”:\”wp-post-376\”,\”title\”:\”Hadoop%20Adventures%20with%20Microsoft%26nbsp%3BHDInsight\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/11\\/03\\/hadoop-adventures-with-microsoft-hdinsight\\/\”,\”item_id\”:\”_post_376\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_376 == ‘undefined’ ){PDRTJS_5386869_post_376 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_376 );}}”
[35] “\t\tWPCOM_sharing_counts = {\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/30\\/merging-two-data-set-in-r-based-on-one-common-column\\/\”:412,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/30\\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\\/\”:409,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/19\\/listing-base-datasets-in-r-and-loading-as-data-frame\\/\”:398,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/01\\/abc-of-data-science\\/\”:397,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/12\\/18\\/r-programming-language-installation-and-configuration-on-windows\\/\”:390,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/11\\/13\\/how-hadoop-is-shaping-up-at-disney-world\\/\”:382,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/11\\/03\\/hadoop-adventures-with-microsoft-hdinsight\\/\”:376}\t</script>”

Above I have just removed the lines in middle to show the result snippet.

In the above If I try to collect lines between 553 .. 648, there is list of all dataset in R, so to collect I can do the following:

> myLines <- myblog[553:648]
> summary(myLines)
Length Class Mode
96 character character

Note: Above mylines character list has total 110 lines so you can try printing and see what you get.

Create a list of available dataset from above myLines vector: 

The pattern in myLines is as below:

[1] “AirPassengers Monthly Airline Passenger Numbers 1949-1960″
[2] “BJsales Sales Data with Leading Indicator”
[3] “BOD Biochemical Oxygen Demand”
[4] “CO2 Carbon Dioxide Uptake in Grass Plants”

……….

………

[92] “treering Yearly Treering Data, -6000-1979″
[93] “trees Girth, Height and Volume for Black Cherry Trees”
[94] “uspop Populations Recorded by the US Census”
[95] “volcano Topographic Information on Auckland’s Maunga”
[96] ” Whau Volcano”

so the first word is dataset name and after the space is the dataset information. To get the dataset name only lets use sub function as below:

> dsName <- sub(” .*”, “”, myLines)
> dsName
[1] “AirPassengers” “BJsales” “BOD”
[4] “CO2″ “ChickWeight” “DNase”
[7] “EuStockMarkets” “” “Formaldehyde”
[10] “HairEyeColor” “Harman23.cor” “Harman74.cor”
[13] “Indometh” “InsectSprays” “JohnsonJohnson”
[16] “LakeHuron” “LifeCycleSavings” “Loblolly”
[19] “Nile” “Orange” “OrchardSprays”
[22] “PlantGrowth” “Puromycin” “Theoph”
[25] “Titanic” “ToothGrowth” “”
[28] “UCBAdmissions” “UKDriverDeaths” “UKLungDeaths”
[31] “UKgas” “USAccDeaths” “USArrests”
[34] “USJudgeRatings” “” “USPersonalExpenditure”
[37] “VADeaths” “WWWusage” “WorldPhones”
[40] “ability.cov” “airmiles” “”
[43] “airquality” “anscombe” “”
[46] “attenu” “attitude” “austres”
[49] “” “beavers” “cars”
[52] “chickwts” “co2″ “crimtab”
[55] “datasets-package” “discoveries” “esoph”
[58] “euro” “eurodist” “faithful”
[61] “freeny” “infert” “”
[64] “iris” “islands” “lh”
[67] “longley” “lynx” “morley”
[70] “mtcars” “nhtemp” “nottem”
[73] “” “occupationalStatus” “precip”
[76] “presidents” “pressure” “”
[79] “quakes” “randu” “”
[82] “rivers” “rock” “sleep”
[85] “stackloss” “state” “sunspot.month”
[88] “sunspot.year” “sunspots” “swiss”
[91] “” “treering” “trees”
[94] “uspop” “volcano” “”

Next work item:  mylines does has a few empty item so we can clean the array.

 

Note: Readline in R is used to prompt user to input something in console.

 

Categories: Data Analysis, Development, R

Merging two data set in R based on one common column

January 30, 2013 Leave a comment

Let’s create a new dataset using mtcars dataset and only mpg and hp column:

> cars.mpg <- subset(mtcars, select = c(mpg, hp))
 
> cars.mpg
                     mpg  hp
Mazda RX4           21.0 110
Mazda RX4 Wag       21.0 110
Datsun 710          22.8  93
Hornet 4 Drive      21.4 110
Hornet Sportabout   18.7 175
Valiant             18.1 105
Duster 360          14.3 245
Merc 240D           24.4  62
Merc 230            22.8  95
Merc 280            19.2 123
Merc 280C           17.8 123
Merc 450SE          16.4 180
Merc 450SL          17.3 180
Merc 450SLC         15.2 180
…………..

Let’s create another dataset using mtcars dataset and only hp and cyl column:

> cars.cyl <- subset(mtcars, select = c(hp,cyl))
> cars.cyl
                     hp cyl
Mazda RX4           110   6
Mazda RX4 Wag       110   6
Datsun 710           93   4
Hornet 4 Drive      110   6
Hornet Sportabout   175   8
Valiant             105   6
Duster 360          245   8
Merc 240D            62   4
Merc 230             95   4
Merc 280            123   6
Merc 280C           123   6
Merc 450SE          180   8
Merc 450SL          180   8
Merc 450SLC         180   8
…………………..

Now we can merge both dataset based on common column hp as below:

 
> merge.ds <- merge(cars.mpg, cars.cyl, by="hp")
> merge.ds
    hp  mpg cyl
1   52 30.4   4
2   62 24.4   4
3   65 33.9   4
4   66 32.4   4
5   66 32.4   4
6   66 27.3   4
7   66 27.3   4
8   91 26.0   4
9   93 22.8   4
10  95 22.8   4
11  97 21.5   4
12 105 18.1   6
13 109 21.4   4
14 110 21.0   6
15 110 21.0   6
16 110 21.0   6
17 110 21.0   6
18 110 21.0   6
19 110 21.0   6
20 110 21.4   6
21 110 21.4   6
22 110 21.4   6
23 113 30.4   4
24 123 17.8   6
25 123 17.8   6
26 123 19.2   6
27 123 19.2   6
28 150 15.2   8
29 150 15.2   8
30 150 15.5   8
31 150 15.5   8
32 175 18.7   8
33 175 18.7   6
34 175 18.7   8
35 175 19.7   8
36 175 19.7   6
37 175 19.7   8
38 175 19.2   8
39 175 19.2   6
40 175 19.2   8
41 180 16.4   8
42 180 16.4   8
43 180 16.4   8
44 180 17.3   8
45 180 17.3   8
46 180 17.3   8
47 180 15.2   8
48 180 15.2   8
49 180 15.2   8
50 205 10.4   8
51 215 10.4   8
52 230 14.7   8
53 245 14.3   8
54 245 14.3   8
55 245 13.3   8
56 245 13.3   8
57 264 15.8   8
58 335 15.0   8

Why you see total 58 merged rows when there were only 32 rows in original data sets? 

This is because "merge is a generic function whose principal method is for data frames: the default method coerces its arguments to data frames and calls the "data.frame" method."
Learn more by calling
?merge
Categories: Data Analysis, Development, R Tags: ,

Working with dataset in R and using subset to work on dataset

January 30, 2013 Leave a comment

Let’s load built in data set

>Library(datasets)

Let’s access one of the dataset name mtcars:

 >mtcars

 

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb

Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4

Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1

Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1

Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2

Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1

Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4

Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3

Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3

Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4

Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4

Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1

Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2

Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2

AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2

Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4

Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1

Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

So what is this data set? Running str(_dataset_name) will tell you more about it:

> str(mtcars)
'data.frame':  32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Let’s see the top of the data frame using head

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can also see the tail as well:

> tail(mtcars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

So how many rows are in this dataset?

> nrow(mtcars)
[1] 32

So how many columns are in this dataset?

> ncol(mtcars)
[1] 11

Ok can we get the name of column headers? Sure

> names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

So let’s see the summary for our dataset:

> summary(mtcars)

      mpg             cyl             disp             hp             drat      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec             vs               am        
 Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :3.325   Median :17.71   Median :0.0000   Median :0.0000  
 Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062  
 3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000  
      gear            carb      
 Min.   :3.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:2.000  
 Median :4.000   Median :2.000  
 Mean   :3.688   Mean   :2.812  
 3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :8.000

As you can see above the summary give you the mean, media, max and other values for each column.

Using Subset with Dataset in R

How to list data frame with a few columns? Sure we will be using subset() for this:

Subset(dataframe, condition, format)

For example the following subset will show all the rows in mtcrs but only mpg column:

> subset(mtcars, select = c(mpg))
                     mpg
Mazda RX4           21.0
Mazda RX4 Wag       21.0
Datsun 710          22.8
Hornet 4 Drive      21.4
Hornet Sportabout   18.7
Valiant             18.1
Duster 360          14.3
Merc 240D           24.4
Merc 230            22.8
Merc 280            19.2
Merc 280C           17.8
Merc 450SE          16.4
Merc 450SL          17.3
Merc 450SLC         15.2
Cadillac Fleetwood  10.4
Lincoln Continental 10.4
Chrysler Imperial   14.7
Fiat 128            32.4
Honda Civic         30.4
Toyota Corolla      33.9
Toyota Corona       21.5
Dodge Challenger    15.5
AMC Javelin         15.2
Camaro Z28          13.3
Pontiac Firebird    19.2
Fiat X1-9           27.3
Porsche 914-2       26.0
Lotus Europa        30.4
Ford Pantera L      15.8
Ferrari Dino        19.7
Maserati Bora       15.0
Volvo 142E          21.4

Now what if we want to see the list of cars which have mpg above 20  along with mpg and hp column:

> subset(mtcars, mpg > 20, select = c(mpg, hp))
                mpg  hp
Mazda RX4      21.0 110
Mazda RX4 Wag  21.0 110
Datsun 710     22.8  93
Hornet 4 Drive 21.4 110
Merc 240D      24.4  62
Merc 230       22.8  95
Fiat 128       32.4  66
Honda Civic    30.4  52
Toyota Corolla 33.9  65
Toyota Corona  21.5  97
Fiat X1-9      27.3  66
Porsche 914-2  26.0  91
Lotus Europa   30.4 113
Volvo 142E     21.4 109

We can also call subset within subset.

For example we need to get a list of cars with HP above 100 and mpg above 20.

> subset(subset(mtcars, mpg > 20, select = c(mpg, hp)),hp >= 100)
                mpg  hp
Mazda RX4      21.0 110
Mazda RX4 Wag  21.0 110
Hornet 4 Drive 21.4 110
Lotus Europa   30.4 113
Volvo 142E     21.4 109

Note: following will not work as condition will not match and you will see a full list of cars with mpg and hp column
> subset(mtcars, mpg > 20 && hp >= 100, select = c(mpg, hp))

You can also use subset to show specific columns
> subset(mtcars[1:3])
                     mpg cyl  disp
Mazda RX4           21.0   6 160.0
Mazda RX4 Wag       21.0   6 160.0
Datsun 710          22.8   4 108.0
Hornet 4 Drive      21.4   6 258.0
Hornet Sportabout   18.7   8 360.0

…..

Or you can defined the columns name as below:

> subset(mtcars[c("mpg", “cyl”, "hp")])



Categories: Data Analysis, Development, R

Listing base datasets in R and loading as Data Frame

January 19, 2013 Leave a comment

> library(help=”datasets”)

 

Description:

Package:       datasets
Version:       2.15.2
Priority:      base
Title:         The R Datasets Package
Author:        R Core Team and contributors worldwide
Maintainer:    R Core Team <R-core@r-project.org>
Description:   Base R datasets
License:       Part of R 2.15.2
Built:         R 2.15.2; ; 2012-10-26 16:15:19 UTC; unix

Index:

AirPassengers           Monthly Airline Passenger Numbers 1949-1960
BJsales                 Sales Data with Leading Indicator
BOD                     Biochemical Oxygen Demand
CO2                     Carbon Dioxide Uptake in Grass Plants
ChickWeight             Weight versus age of chicks on different diets
DNase                   Elisa assay of DNase
EuStockMarkets          Daily Closing Prices of Major European Stock
                        Indices, 1991-1998
Formaldehyde            Determination of Formaldehyde
HairEyeColor            Hair and Eye Color of Statistics Students
Harman23.cor            Harman Example 2.3
Harman74.cor            Harman Example 7.4
Indometh                Pharmacokinetics of Indomethacin
InsectSprays            Effectiveness of Insect Sprays
JohnsonJohnson          Quarterly Earnings per Johnson & Johnson Share
LakeHuron               Level of Lake Huron 1875-1972
LifeCycleSavings        Intercountry Life-Cycle Savings Data
Loblolly                Growth of Loblolly pine trees
Nile                    Flow of the River Nile
Orange                  Growth of Orange Trees
OrchardSprays           Potency of Orchard Sprays
PlantGrowth             Results from an Experiment on Plant Growth
Puromycin               Reaction Velocity of an Enzymatic Reaction
Theoph                  Pharmacokinetics of Theophylline
Titanic                 Survival of passengers on the Titanic
ToothGrowth             The Effect of Vitamin C on Tooth Growth in
                        Guinea Pigs
UCBAdmissions           Student Admissions at UC Berkeley
UKDriverDeaths          Road Casualties in Great Britain 1969-84
UKLungDeaths            Monthly Deaths from Lung Diseases in the UK
UKgas                   UK Quarterly Gas Consumption
USAccDeaths             Accidental Deaths in the US 1973-1978
USArrests               Violent Crime Rates by US State
USJudgeRatings          Lawyers' Ratings of State Judges in the US
                        Superior Court
USPersonalExpenditure   Personal Expenditure Data
VADeaths                Death Rates in Virginia (1940)
WWWusage                Internet Usage per Minute
WorldPhones             The World's Telephones
ability.cov             Ability and Intelligence Tests
airmiles                Passenger Miles on Commercial US Airlines,
                        1937-1960
airquality              New York Air Quality Measurements
anscombe                Anscombe's Quartet of 'Identical' Simple Linear
                        Regressions
attenu                  The Joyner-Boore Attenuation Data
attitude                The Chatterjee-Price Attitude Data
austres                 Quarterly Time Series of the Number of
                        Australian Residents
beavers                 Body Temperature Series of Two Beavers
cars                    Speed and Stopping Distances of Cars
chickwts                Chicken Weights by Feed Type
co2                     Mauna Loa Atmospheric CO2 Concentration
crimtab                 Student's 3000 Criminals Data
datasets-package        The R Datasets Package
discoveries             Yearly Numbers of Important Discoveries
esoph                   Smoking, Alcohol and (O)esophageal Cancer
euro                    Conversion Rates of Euro Currencies
eurodist                Distances Between European Cities
faithful                Old Faithful Geyser Data
freeny                  Freeny's Revenue Data
infert                  Infertility after Spontaneous and Induced
                        Abortion
iris                    Edgar Anderson's Iris Data
islands                 Areas of the World's Major Landmasses
lh                      Luteinizing Hormone in Blood Samples
longley                 Longley's Economic Regression Data
lynx                    Annual Canadian Lynx trappings 1821-1934
morley                  Michelson Speed of Light Data
mtcars                  Motor Trend Car Road Tests
nhtemp                  Average Yearly Temperatures in New Haven
nottem                  Average Monthly Temperatures at Nottingham,
                        1920-1939
occupationalStatus      Occupational Status of Fathers and their Sons
precip                  Annual Precipitation in US Cities
presidents              Quarterly Approval Ratings of US Presidents
pressure                Vapor Pressure of Mercury as a Function of
                        Temperature
quakes                  Locations of Earthquakes off Fiji
randu                   Random Numbers from Congruential Generator
                        RANDU
rivers                  Lengths of Major North American Rivers
rock                    Measurements on Petroleum Rock Samples
sleep                   Student's Sleep Data
stackloss               Brownlee's Stack Loss Plant Data
state                   US State Facts and Figures
sunspot.month           Monthly Sunspot Data, 1749-1997
sunspot.year            Yearly Sunspot Data, 1700-1988
sunspots                Monthly Sunspot Numbers, 1749-1983
swiss                   Swiss Fertility and Socioeconomic Indicators
                        (1888) Data
treering                Yearly Treering Data, -6000-1979
trees                   Girth, Height and Volume for Black Cherry Trees
uspop                   Populations Recorded by the US Census
volcano                 Topographic Information on Auckland's Maunga
                        Whau Volcano
warpbreaks              The Number of Breaks in Yarn during Weaving
women                   Average Heights and Weights for American Women

Learn More about Dataset:

> str(iris)

‘data.frame’: 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …
$ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …

Note: Str will tell you more about dataset i.e. total columns (variables) and total rows (objects)

> summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

Note: Summary will tell you about statistical  details related to dataset i.e. mean, median etc. Both str and summary will help you to learn more about dataset.


Loading a dataset into Data FRame

> dd <- data.frame(iris)

As you can see below that DD is now a data frame,in which the contents are loaded from iris dataset

> str(dd)
‘data.frame’: 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …
$ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …

If you are using RStudio, you can use view(dataset_name) to view the dataset in GUI as below:

>data(volcano)

> view(volcano)

Screen Shot 2013-01-19 at 1.00.19 PM

Categories: R Tags:

Top most algorithms used in Data Mining


I am trying to compile a comprehensive  list of  Data Mining Algorithm and while trying to do so I found a top 10 list can be created by several ways.

Based on a Scientific research paper here is top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006 and  these top 10 algorithms are among the most influential data mining algorithms in the research community

  1. C4.5
  2. k-Means
  3. SVM
  4. Apriori
  5. EM
  6. PageRank
  7. AdaBoost
  8. kNN
  9. Naive Bayes
  10. CART

Public Voting:

  1. Decision Trees/Rules
  2. Regression
  3. Clustering
  4. Statistics (descriptive)
  5. Visualization
  6. Time series/Sequence analysis
  7. Support Vector (SVM)
  8. Association rules
  9. Ensemble methods
  10. Text Mining
  11. Neural Nets
  12. Boosting
  13. Bayesian
  14. Bagging
  15. Factor Analysis
  16. Anomaly/Deviation detection
  17. Social Network Analysis
  18. Survival Analysis
  19. Genetic algorithms
  20. Uplift modeling

Based on voting done by “Mahout user mailing list” here is the list:

  1. Matrix factorization (SVD)
  2. k-means
  3. Naive Bayes
  4. Dirichlet Process Clustering
  5. Matrix Factorization
  6. Frequent Pattern Matching
  7. LDA
  8. Expectation Maximization
  9. SVM
  10. Decision Trees
  11. Logistics Regression
  12. Random Forest

Resources: 

Cloud Fair 2012 Presentation on Apache Hadoop, Windows Azure and Open Source

April 25, 2012 2 comments

Recently I was given an opportunity to talk about “Apache Hadoop on Windows Azure” and “Open Source & Cloud Computing” at Cloud Fair 2012 Seattle.  If you wish to get the presentation, please follow the info below:

Presentation Topic: Apache Hadoop on Windows Azure

Since Microsoft’s adoption of open source Apache Hadoop in its cloud offering and CTP release “Apache Hadoop on Windows Azure”, the Hadoop is most exciting and adopted technology to analyze large amount of data in a very simple ways. The service offered as an elastic service for both on premise and public clouds based on Microsoft and Apache Hadoop technologies. Scale invariant insight, information processing, and analytics are now available to all participants in the Microsoft ecosystem – cloud and enterprise. Best of all, these capabilities have been bridged into the vibrant domains of Office BI and Collaboration, Data Warehousing, and Visualization/Reporting.

You will learn how to unlock business insights from all your structured and unstructured data, including large volumes of data not previously activated, with Microsoft’s Big Data solution. I will explain using live demonstration, how you can enterprise class Hadoop based solutions designed by Microsoft, on both Windows Server and Windows Azure. This talk is developer oriented and tutorial includes installation, configuration and simplified Big Data analysis with JavaScript.

Please download the Presentation from the link below:
http://www.papershare.com/paper/Apache-Hadoop-on-Windows-Azure

Presentation Title: Open Source and Cloud Computing

Targeting the Open Source developer community will unlock significant growth opportunities for any cloud service vendor however for an open source developer; it is very hard to decide which cloud option to choose. Cost reduction is one of the main value-prop of both Cloud services and Open Source tools. So combining both of them is a significant reduction in cost, however, other hidden and unknown facts, may appear later. A significant portion of the development community uses LAMP stack and when you add Java, Java Script, Ruby, CGI, Python, Node.JS, *SQL, *DB, and Hadoop to list, you have a majority of the application currently thriving to move to cloud.
This interactive session targets to open source developers, suggesting what cloud services could be best option and what they should really look for, in a cloud platform. Attendees will learn about cloud services both big and small, those have successfully adopted Open Source applications support in their services offering. Attendees can use the information to understand what technical limitations exist along with how to overcome these hurdles.

Please download the Presentation from the link below:
http://www.papershare.com/paper/Open-Source-and-Cloud-Computing

Follow

Get every new post delivered to your Inbox.

Join 37 other followers

%d bloggers like this: