Big Data 1B dollars Club – Top 20 Players


Here is a list of top players in Big Data world having influence over billion dollars (or more) Big Data projects directly or indirectly (not in order):

  1. Microsoft
  2. Google
  3. Amazon
  4. IBM
  5. HP
  6. Oracle
  7. VMWare
  8. Terradata
  9. EMC
  10. Facebook
  11. GE
  12. Intel
  13. Cloudera
  14. SAS
  15. 10Gen
  16. SAP
  17. Hortonworks
  18. MapR
  19. Palantir
  20. Splunk

The list is based on each above companies involvement in Big data directly or indirectly along with a direct product or not. All of above companies are involved in Big Data projects worth considering Billion+ …

What to do when compiling Hadoop branch 1.2.x returns java.io.IOException: Cannot run program “autoreconf”


Compiling Hadoop branch 1.2.x code in OSX returned exception as below:

create-native-configure:

BUILD FAILED
/Users/avkash/work/hadoop/branch-1.2/build.xml:634: Execute failed: java.io.IOException: Cannot run program “autoreconf” (in directory “/Users/avkash/work/hadoop/branch-1.2/src/native”): error=2, No such file or directory
at java.lang.ProcessBuilder.processException(ProcessBuilder.java:478)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:457)
at java.lang.Runtime.exec(Runtime.java:593)
at org.apache.tools.ant.taskdefs.Execute$Java13CommandLauncher.exec(Execute.java:862)
at org.apache.tools.ant.taskdefs.Execute.launch(Execute.java:481)
at org.apache.tools.ant.taskdefs.Execute.execute(Execute.java:495)
at org.apache.tools.ant.taskdefs.ExecTask.runExecute(ExecTask.java:631)
at org.apache.tools.ant.taskdefs.ExecTask.runExec(ExecTask.java:672)
at org.apache.tools.ant.taskdefs.ExecTask.execute(ExecTask.java:498)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at org.apache.tools.ant.Project.executeTarget(Project.java:1368)
at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1251)
at org.apache.tools.ant.Main.runBuild(Main.java:809)
at org.apache.tools.ant.Main.startAnt(Main.java:217)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
at java.lang.ProcessImpl.start(ProcessImpl.java:91)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
… 23 more

Looking src/native there is no “autoconf” shipped with the source so to solve this problem the best option is to install

└─[1] <git:(master✗✈)> brew install automake
==> Installing automake dependency: autoconf
==> Downloading http://ftpmirror.gnu.org/autoconf/autoconf-2.69.tar.gz
######################################################################## 100.0%
==> ./configure –prefix=/usr/local/Cellar/autoconf/2.69
==> make install
🍺 /usr/local/Cellar/autoconf/2.69: 69 files, 2.0M, built in 21 seconds
==> Installing automake
==> Downloading http://ftpmirror.gnu.org/automake/automake-1.14.tar.gz
######################################################################## 100.0%
==> ./configure –prefix=/usr/local/Cellar/automake/1.14
==> make install
🍺 /usr/local/Cellar/automake/1.14: 127 files, 2.5M, built in 7 seconds

Thats all!!

After than just run the following commands to make sure the same exception does not come back:

$ant create-native-configure

Customized bash command prompt with line separator and other goodies


I wanted to have a fancy looking and very useful terminal windows with customize command prompt so after digging I build something as below for me:

Image

So what it have:

  • Line Separator including current time at the end of the terminal
  • History counter along with current command counter
  • Logged user @ Hostname
  • Current working folder $

Here is what I have done. First created a file call .avkashbash_profile at my $HOME folder as below:

fill=”-“
reset_style=’\[33[00m\]’
status_style=$reset_style’\[33[0;32m\]’ # gray color; use 0;37m for lighter color
prompt_style=$reset_style
command_style=$reset_style’\[33[0;32m\]’ # bold black
# Prompt variable:
PS1=”$status_style”‘$fill[\T]\n'”$prompt_style”‘${debian_chroot:+($debian_chroot)}\e[0;31m\e[47m[\!:\#]\e[
0m-\e[1;33m[\u@\h]\e[0m\n[\w]\e[0;32m$ ‘”$command_style “
# Reset color for command output
# (this one is invoked every time before a command is executed):
function prompt_command {
# create a $fill of all screen width minus the time string and a space:
let fillsize=${COLUMNS}-11
fill=””
while [ “$fillsize” -gt “0” ]
do
fill=”-${fill}” # fill with underscores to work on
let fillsize=${fillsize}-1
done
}
PROMPT_COMMAND=prompt_command

To make the setting permanent, just add the following code in .bash_profile first:

if [ -f “$HOME/.avkashbash_ps1” ]; then
. “$HOME/.avkashbash_ps1”
fi

And then run the following command to set it:

$sources .bash_profile

Or if you don’t want to make it permanent, just add the following code to .bashrc first:

if [ -f “$HOME/.avkashbash_ps1” ]; then
. “$HOME/.avkashbash_ps1”
fi

And then run the following command to set it:

$sources .bashrc

Thats all.

Thanks for the guys at here and here!!

Brew on Mac – Just 3 steps and you are ready


Step 1:

MachineHead:docs avkash$ ruby -e “$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)”
==> This script will install:
/usr/local/bin/brew
/usr/local/Library/…
/usr/local/share/man/man1/brew.1
==> The following directories will be made group writable:
/usr/local/.
/usr/local/bin
/usr/local/etc
/usr/local/lib
/usr/local/share
/usr/local/share/man
/usr/local/share/man/man1
/usr/local/share/info
==> The following directories will have their group set to admin:
/usr/local/.
/usr/local/bin
/usr/local/etc
/usr/local/lib
/usr/local/share
/usr/local/share/man
/usr/local/share/man/man1
/usr/local/share/info

Press ENTER to continue or any other key to abort
==> /usr/bin/sudo /bin/chmod g+rwx /usr/local/. /usr/local/bin /usr/local/etc /usr/local/lib /usr/local/share /usr/local/share/man /usr/local/share/man/man1 /usr/local/share/info
Password:
==> /usr/bin/sudo /usr/bin/chgrp admin /usr/local/. /usr/local/bin /usr/local/etc /usr/local/lib /usr/local/share /usr/local/share/man /usr/local/share/man/man1 /usr/local/share/info
==> Downloading and Installing Homebrew…
remote: Counting objects: 121792, done.
remote: Compressing objects: 100% (59383/59383), done.
remote: Total 121792 (delta 85372), reused 95223 (delta 61439)
Receiving objects: 100% (121792/121792), 19.29 MiB | 283 KiB/s, done.
Resolving deltas: 100% (85372/85372), done.
From https://github.com/mxcl/homebrew
* [new branch] master -> origin/master
HEAD is now at c45c77d audit: don’t complain about bottle versions.
==> Installation successful!
You should run `brew doctor’ *before* you install anything.
Now type: brew help

Step 2: 

MachineHead:docs avkash$ brew doctor
Your system is ready to brew.

Step 3:

MachineHead:docs avkash$ brew help
Example usage:
brew [info | home | options ] [FORMULA…]
brew install FORMULA…
brew uninstall FORMULA…
brew search [foo]
brew list [FORMULA…]
brew update
brew upgrade [FORMULA…]

Troubleshooting:
brew doctor
brew install -vd FORMULA
brew [–env | –config]

Brewing:
brew create [URL [–no-fetch]]
brew edit [FORMULA…]
open https://github.com/mxcl/homebrew/wiki/Formula-Cookbook

Further help:
man brew
brew home

Thats all!!

Apache Weave: Big Data Application runtime and development framework by Continuuity


Continuuity decided to build Weave and be part of the journey to take Apache YARN to the next level of usability and functionality. Continuuity has been using Weave extensively to support their  products and  seen the benefit and power of Apache YARN and Weave combined.  Continuuity decided to share Weave under the Apache 2.0 license in an effort to collaborate with members of the community, broaden the set of applications and patterns that Weave supports, and further the overall adoption of Apache YARN.

Weave is NOT a replacement for Apache YARN.  It is instead a value-added framework that operates on top of Apache YARN.

What is Weave:  Weave is a simple set of libraries that allows you to easily manage distributed applications through an abstraction layer built on Apache YARN. Weave allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads.

Features of Weave:
– Simple API for specifying, running and managing application lifecycle
– An easy way to communicate with an application or parts of an application
– A generic Application Master to better support simple applications
– Simplified archive management and local file transport
– Improved control over application logs, metrics and errors
– Discovery service
– And many more…

Weave Source code is available on github at http://github.com/continuuity/weave under the Apache 2.0 License.

Learn more at http://www.continuuity.com/.

Keyword: Hadoop, Yarn, MapReduce, Big Data

Processing unstructured content from a URL in R


R has a built in function name readLines() which read a local file or an URL to read content line by line.

For example my blog URL is https://cloudcelebrity.wordpress.com so lets read it:

> myblog <- readLines(“https://cloudcelebrity.wordpress.com&#8221;)
Warning message:
In readLines(“https://cloudcelebrity.wordpress.com&#8221;) :
incomplete final line found on ‘https://cloudcelebrity.wordpress.com&#8217;

> length(myblog)
[1] 1380

As you can see above there is a warning message even when the myblog does have all the lines in it. To disable this warning message we can use “warn=FALSE” as below:

> myblog <- readLines(“https://cloudcelebrity.wordpress.com&#8221;, warn=FALSE)

> length(myblog)
[1] 1380

And above there are no warning.. if I would want to print special lines based on line number I can just call the

> myblog[100]
[1] “<script type=’text/javascript’>/*<![CDATA[*/if(typeof(addLoadEvent)!=’undefined’){addLoadEvent(function(){if(top==self){i=document.createElement(‘img’);i.src=’http://botd.wordpress.com/botd.gif?blogid=29647980&postid=0&lang=1&date=1359531235&ip=98.203.199.229&url=https://cloudcelebrity.wordpress.com/&loc=’+document.location;i.style.width=’0px&#8217;;i.style.height=’0px’;i.style.overflow=’hidden’;document.body.appendChild(i);}});}/*]]>*/</script>”

Lets get the summary:

> summary(myblog)
Length Class Mode
1380 character character

To read only limited lines in the same URL , I can also pass the total line limit as below:

> myblog <- readLines(“https://cloudcelebrity.wordpress.com&#8221;, 100, warn=FALSE)
> summary(myblog)
Length Class Mode
100 character character

After I read  all the lines in my blog, lets perform some specific search operation in the content:

Searching all lines with Hadoop or hadoop in it: 

To search all the lines which have Hadoop or hadoop in it we can run grep command to find all the line numbers as below:

> hd <- grep(“[hH]adoop”, myblog)

Lets print hd to see all the line numbers:
> hd
[1] 706 803 804 807 811 812 814 819 822 823 826 827 830 834
[15] 837 863 869 871 872 875 899 911 912 921 923 925 927 931
[29] 934 1000 1010 1011 1080 1278 1281

To print all the lines with Hadoop or hadoop in it we can just use:

> myblog[hd]
[1] “<p>A: ACID – Atomicity, Consistency, Isolation and Durability <br />B: Big Data – Volume, Velocity, Variety <br />C: Columnar (or Column-Oriented) Database <br />D: Data Warehousing – Relevant and very useful <br />E: ETL – Extract, transform and load <br />F: Flume – A framework for populating Hadoop with data <br />G: Geospatial Analysis – A picture worth 1,000 words or more <br />H: Hadoop, HDFS, HBASE – Do you really want to know? <br />I:  In-Memory Database – A new definition of superfast access <br />J: Java – Hadoop gave biggest push in last years to stay in enterprise market <br />K: Kafka – High-throughput, distributed messaging system originally developed at LinkedIn <br />L: Latency – Low Latency and High Latency <br />M: Map/Reduce – MapReduce <br />N:  NoSQL Databases – No SQL Database or Not Only SQL <br />O: Oozie – Open-source workflow engine managing Hadoop job processing <br />P: Pig – Platform for analyzing huge data sets <br />Q: Quantitative Data Analysis <br />R: Relational Database – Still relevant and will be for some time <br />S: Sharding (Database Partitioning)  and Sqoop (SQL Database to Hadoop) <br />T: Text Analysis – Larger the information, more needed analysis <br />U: Unstructured Data – Growing faster than speed of thoughts <br />V: Visualization – Important to keep the information relevant <br />W: Whirr – Big Data Cloud Services i.e. Hadoop distributions by cloud vendors <br />X:  XML – Still eXtensible and no Introduction needed <br />Y: Yottabyte – Equal to 1,000 exabytes, 1 million petabytes and 1 billion terabytes <br />Z: Zookeeper – Help managing Hadoop nodes across a distributed network </p>”
[2] “\t\t\t<div class=\”post-382 post type-post status-publish format-standard hentry category-big-data category-data-analysis category-hadoop tag-hadoop-bigdata-disney-data-platform\” id=\”post-382\”>”
[3] “\t\t\t<h2><a class=\”title\” href=\”https://cloudcelebrity.wordpress.com/2012/11/13/how-hadoop-is-shaping-up-at-disney-world/\” rel=\”bookmark\”>How Hadoop is shaping up at Disney&nbsp;World?</a></h2>”
[4] “\t\t\t\t<span class=\”author\”><a href=\”https://cloudcelebrity.wordpress.com/author/cloudcelebrity/\” title=\”Posts by cloudcelebrity\” rel=\”author\”>cloudcelebrity</a></span>\t\t\t\t\t\t\t\t<span class=\”comments\”><a href=\”https://cloudcelebrity.wordpress.com/2012/11/13/how-hadoop-is-shaping-up-at-disney-world/#respond\” title=\”Comment on How Hadoop is shaping up at Disney&nbsp;World?\”>Leave a comment</a></span>”
[5] “\t\t\t\t<div class=\”pd-rating\” id=\”pd_rating_holder_5386869_post_382\” data-settings=\”{&quot;id&quot;:5386869,&quot;item_id&quot;:&quot;_post_382&quot;,&quot;settings&quot;:&quot;{\\&quot;id\\&quot;:5386869,\\&quot;unique_id\\&quot;:\\&quot;wp-post-382\\&quot;,\\&quot;title\\&quot;:\\&quot;How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F\\&quot;,\\&quot;permalink\\&quot;:\\&quot;http:\\\\\\/\\\\\\/cloudcelebrity.wordpress.com\\\\\\/2012\\\\\\/11\\\\\\/13\\\\\\/how-hadoop-is-shaping-up-at-disney-world\\\\\\/\\&quot;,\\&quot;item_id\\&quot;:\\&quot;_post_382\\&quot;}&quot;}\”></div><br/><p> </p>”

……………….

………………

[34] “PDRTJS_settings_5386869_post_412={\”id\”:5386869,\”unique_id\”:\”wp-post-412\”,\”title\”:\”Merging%20two%20data%20set%20in%20R%20based%20on%20one%20common%26nbsp%3Bcolumn\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/30\\/merging-two-data-set-in-r-based-on-one-common-column\\/\”,\”item_id\”:\”_post_412\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_412 == ‘undefined’ ){PDRTJS_5386869_post_412 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_412 );}}PDRTJS_settings_5386869_post_409={\”id\”:5386869,\”unique_id\”:\”wp-post-409\”,\”title\”:\”Working%20with%20dataset%20in%20R%20and%20using%20subset%20to%20work%20on%26nbsp%3Bdataset\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/30\\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\\/\”,\”item_id\”:\”_post_409\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_409 == ‘undefined’ ){PDRTJS_5386869_post_409 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_409 );}}PDRTJS_settings_5386869_post_398={\”id\”:5386869,\”unique_id\”:\”wp-post-398\”,\”title\”:\”Listing%20base%20datasets%20in%20R%20and%20loading%20as%20Data%26nbsp%3BFrame\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/19\\/listing-base-datasets-in-r-and-loading-as-data-frame\\/\”,\”item_id\”:\”_post_398\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_398 == ‘undefined’ ){PDRTJS_5386869_post_398 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_398 );}}PDRTJS_settings_5386869_post_397={\”id\”:5386869,\”unique_id\”:\”wp-post-397\”,\”title\”:\”ABC%20of%20Data%26nbsp%3BScience\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/01\\/abc-of-data-science\\/\”,\”item_id\”:\”_post_397\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_397 == ‘undefined’ ){PDRTJS_5386869_post_397 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_397 );}}PDRTJS_settings_5386869_post_390={\”id\”:5386869,\”unique_id\”:\”wp-post-390\”,\”title\”:\”R%20Programming%20Language%20%28Installation%20and%20configuration%20on%26nbsp%3BWindows%29\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/12\\/18\\/r-programming-language-installation-and-configuration-on-windows\\/\”,\”item_id\”:\”_post_390\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_390 == ‘undefined’ ){PDRTJS_5386869_post_390 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_390 );}}PDRTJS_settings_5386869_post_382={\”id\”:5386869,\”unique_id\”:\”wp-post-382\”,\”title\”:\”How%20Hadoop%20is%20shaping%20up%20at%20Disney%26nbsp%3BWorld%3F\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/11\\/13\\/how-hadoop-is-shaping-up-at-disney-world\\/\”,\”item_id\”:\”_post_382\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_382 == ‘undefined’ ){PDRTJS_5386869_post_382 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_382 );}}PDRTJS_settings_5386869_post_376={\”id\”:5386869,\”unique_id\”:\”wp-post-376\”,\”title\”:\”Hadoop%20Adventures%20with%20Microsoft%26nbsp%3BHDInsight\”,\”permalink\”:\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/11\\/03\\/hadoop-adventures-with-microsoft-hdinsight\\/\”,\”item_id\”:\”_post_376\”}; if ( typeof PDRTJS_RATING !== ‘undefined’ ){if ( typeof PDRTJS_5386869_post_376 == ‘undefined’ ){PDRTJS_5386869_post_376 = new PDRTJS_RATING( PDRTJS_settings_5386869_post_376 );}}”
[35] “\t\tWPCOM_sharing_counts = {\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/30\\/merging-two-data-set-in-r-based-on-one-common-column\\/\”:412,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/30\\/working-with-dataset-in-r-and-using-subset-to-work-on-dataset\\/\”:409,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/19\\/listing-base-datasets-in-r-and-loading-as-data-frame\\/\”:398,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2013\\/01\\/01\\/abc-of-data-science\\/\”:397,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/12\\/18\\/r-programming-language-installation-and-configuration-on-windows\\/\”:390,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/11\\/13\\/how-hadoop-is-shaping-up-at-disney-world\\/\”:382,\”http:\\/\\/cloudcelebrity.wordpress.com\\/2012\\/11\\/03\\/hadoop-adventures-with-microsoft-hdinsight\\/\”:376}\t</script>”

Above I have just removed the lines in middle to show the result snippet.

In the above If I try to collect lines between 553 .. 648, there is list of all dataset in R, so to collect I can do the following:

> myLines <- myblog[553:648]
> summary(myLines)
Length Class Mode
96 character character

Note: Above mylines character list has total 110 lines so you can try printing and see what you get.

Create a list of available dataset from above myLines vector: 

The pattern in myLines is as below:

[1] “AirPassengers Monthly Airline Passenger Numbers 1949-1960”
[2] “BJsales Sales Data with Leading Indicator”
[3] “BOD Biochemical Oxygen Demand”
[4] “CO2 Carbon Dioxide Uptake in Grass Plants”

……….

………

[92] “treering Yearly Treering Data, -6000-1979”
[93] “trees Girth, Height and Volume for Black Cherry Trees”
[94] “uspop Populations Recorded by the US Census”
[95] “volcano Topographic Information on Auckland’s Maunga”
[96] ” Whau Volcano”

so the first word is dataset name and after the space is the dataset information. To get the dataset name only lets use sub function as below:

> dsName <- sub(” .*”, “”, myLines)
> dsName
[1] “AirPassengers” “BJsales” “BOD”
[4] “CO2” “ChickWeight” “DNase”
[7] “EuStockMarkets” “” “Formaldehyde”
[10] “HairEyeColor” “Harman23.cor” “Harman74.cor”
[13] “Indometh” “InsectSprays” “JohnsonJohnson”
[16] “LakeHuron” “LifeCycleSavings” “Loblolly”
[19] “Nile” “Orange” “OrchardSprays”
[22] “PlantGrowth” “Puromycin” “Theoph”
[25] “Titanic” “ToothGrowth” “”
[28] “UCBAdmissions” “UKDriverDeaths” “UKLungDeaths”
[31] “UKgas” “USAccDeaths” “USArrests”
[34] “USJudgeRatings” “” “USPersonalExpenditure”
[37] “VADeaths” “WWWusage” “WorldPhones”
[40] “ability.cov” “airmiles” “”
[43] “airquality” “anscombe” “”
[46] “attenu” “attitude” “austres”
[49] “” “beavers” “cars”
[52] “chickwts” “co2” “crimtab”
[55] “datasets-package” “discoveries” “esoph”
[58] “euro” “eurodist” “faithful”
[61] “freeny” “infert” “”
[64] “iris” “islands” “lh”
[67] “longley” “lynx” “morley”
[70] “mtcars” “nhtemp” “nottem”
[73] “” “occupationalStatus” “precip”
[76] “presidents” “pressure” “”
[79] “quakes” “randu” “”
[82] “rivers” “rock” “sleep”
[85] “stackloss” “state” “sunspot.month”
[88] “sunspot.year” “sunspots” “swiss”
[91] “” “treering” “trees”
[94] “uspop” “volcano” “”

Next work item:  mylines does has a few empty item so we can clean the array.

 

Note: Readline in R is used to prompt user to input something in console.

 

Merging two data set in R based on one common column


Let’s create a new dataset using mtcars dataset and only mpg and hp column:

> cars.mpg <- subset(mtcars, select = c(mpg, hp))
 
> cars.mpg
                     mpg  hp
Mazda RX4           21.0 110
Mazda RX4 Wag       21.0 110
Datsun 710          22.8  93
Hornet 4 Drive      21.4 110
Hornet Sportabout   18.7 175
Valiant             18.1 105
Duster 360          14.3 245
Merc 240D           24.4  62
Merc 230            22.8  95
Merc 280            19.2 123
Merc 280C           17.8 123
Merc 450SE          16.4 180
Merc 450SL          17.3 180
Merc 450SLC         15.2 180
…………..

Let’s create another dataset using mtcars dataset and only hp and cyl column:

> cars.cyl <- subset(mtcars, select = c(hp,cyl))
> cars.cyl
                     hp cyl
Mazda RX4           110   6
Mazda RX4 Wag       110   6
Datsun 710           93   4
Hornet 4 Drive      110   6
Hornet Sportabout   175   8
Valiant             105   6
Duster 360          245   8
Merc 240D            62   4
Merc 230             95   4
Merc 280            123   6
Merc 280C           123   6
Merc 450SE          180   8
Merc 450SL          180   8
Merc 450SLC         180   8
…………………..

Now we can merge both dataset based on common column hp as below:

 
> merge.ds <- merge(cars.mpg, cars.cyl, by="hp")
> merge.ds
    hp  mpg cyl
1   52 30.4   4
2   62 24.4   4
3   65 33.9   4
4   66 32.4   4
5   66 32.4   4
6   66 27.3   4
7   66 27.3   4
8   91 26.0   4
9   93 22.8   4
10  95 22.8   4
11  97 21.5   4
12 105 18.1   6
13 109 21.4   4
14 110 21.0   6
15 110 21.0   6
16 110 21.0   6
17 110 21.0   6
18 110 21.0   6
19 110 21.0   6
20 110 21.4   6
21 110 21.4   6
22 110 21.4   6
23 113 30.4   4
24 123 17.8   6
25 123 17.8   6
26 123 19.2   6
27 123 19.2   6
28 150 15.2   8
29 150 15.2   8
30 150 15.5   8
31 150 15.5   8
32 175 18.7   8
33 175 18.7   6
34 175 18.7   8
35 175 19.7   8
36 175 19.7   6
37 175 19.7   8
38 175 19.2   8
39 175 19.2   6
40 175 19.2   8
41 180 16.4   8
42 180 16.4   8
43 180 16.4   8
44 180 17.3   8
45 180 17.3   8
46 180 17.3   8
47 180 15.2   8
48 180 15.2   8
49 180 15.2   8
50 205 10.4   8
51 215 10.4   8
52 230 14.7   8
53 245 14.3   8
54 245 14.3   8
55 245 13.3   8
56 245 13.3   8
57 264 15.8   8
58 335 15.0   8

Why you see total 58 merged rows when there were only 32 rows in original data sets? 

This is because "merge is a generic function whose principal method is for data frames: the default method coerces its arguments to data frames and calls the "data.frame" method."
Learn more by calling
?merge