Refresher R for Beginners


R Studio Environment

R Location (OSX)

$ ls –l /Library/Frameworks/R.framework/Versions

#Get R Version

version

R01

Environment

getwd()

setwd(“/Users/avkashchauhan/work/global”)

getwd()

dir()

#Getting Help

help(getwd)

#Reading a File

help(read.csv)

filename <- “test.csv”

filex <- read.csv(filename, header = TRUE, sep=”,”)

filex

summary(filex)

filex$id

filex$name

filex$age

filex$zip

names(filex)

attributes(filex)

# Listing All Vars

ls()

# ls() – List of all variables

# DataTypes & number Assignment

asc <- c(1,2,3,4,5,6,7,8,9,10)

# What is c? c is “combine”

asc[2]

asc[5]

asc[5:6]

asc[1:9]

View(asc)

a <- 10

a

a[1]

a[3]

help(sqrt)

a <- sqrt(10)

a

a <- sqrt(10*a)

a

asc

mean(asc)

median(asc)

help(var)

typeof(asc)

typeof(a)

# String data type

a <- c(“this”, “is”, “so”, “fun”)

a

a[1]

typeof(a)

#Understanding c or combine

a <- 10

> a

[1] 10

> a[1]

[1] 10

> a[2]

[1] NA

> a <- c(10)

> a

[1] 10

> a[2]

[1] NA

# DATAFRAME

# creating a data frame

a <- c(1,2,3,4,5,6,7,8,9,10)

b <- c(10,20,30,40,50,60,70,80,90,100)

ab <- data.frame(first=a, second=b)

ab

ab[1]

ab[1][1]

ab[1][2] ß XXX

ab[2]

ab[2][1]

ab[2][2] ß XXX

ab$first

ab$second

ab$second[1]

ab$second[3]

ab$first[10]

View(ab)

#Logical

a <- c(TRUE)

a

typeof(a)

a <- c(FALSE)

a

typeof(a)

#Conditions in R

a <- c(TRUE)

if(!a) a <- c(FALSE)

a ß Still TRUE

if(a) a <- c(FALSE)

a ß FALSE Now

a <- c(TRUE,FALSE)

a

a[1]

a[2]

if (a[1]) a[2] <- TRUE

a

R02

Factor in R – A “factor” is a vector whose elements can take on one of a specific set of values. For example, “Sex” will usually take on only the values “M” or “F,” whereas “Name” will generally have lots of possibilities. The set of values that the elements of a factor can take are called its levels.

a <- factor(c(“Male”, “Female”, “Female”, “Male”, “Male”))

a

a <- factor(c(“A”,”A”,”B”,”A”,”B”,”B”,”C”,”A”,”C”))

a

Tables: (One way and two way)

a <- factor(c(“Male”, “Female”, “Female”, “Male”, “Male”))

a

mytable <- table(a)

a

mytable

summary(a)

attributes(a)

#datatype check R

#Example #1

a <- c(1,2,4)

is.numeric(a)

is.factor(a)

#Example #2

b <- factor(c(“M”, “F”))

b

is.factor(b)

is.numeric(b)

Graph Plotting in R

Using Library ggplot2

#installing ggplot2

install.packages(“ggplot2”)

R03

also installing the dependencies ‘colorspace’, ‘Rcpp’, ‘stringr’, ‘RColorBrewer’, ‘dichromat’, ‘munsell’, ‘labeling’, ‘plyr’, ‘digest’, ‘gtable’, ‘reshape2’, ‘scales’, ‘proto’

Using ggplot2 Library

 

library(ggplot2)

detach(package:ggplot2)

head(diamonds)

View(diamonds)

qplot(clarity, data=diamonds, fill=cut, geom=”bar”)

R04

qplot(clarity, data=diamonds, geom=”bar”, fill=cut, position=”stack”)

qplot(clarity, data=diamonds, geom=”freqpoly”, group=cut, colour=cut, position=”identity”)

R05

qplot(carat, data=diamonds, geom=”histogram”, binwidth=0.1)

qplot(carat, data=diamonds, geom=”histogram”, binwidth=0.01)

R06

Graph Source: http://www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf

Keywords:  R, Analysis, ggplot,

Mounting a new Partition in Linux (CentOS/RH) virtual image by adding new virtual storage


Sometime it may required to add a new virtual storage disk to an existing virtual image. Adding the virtual storage disk to an existing virtual image is part of the virtualization system and depending on your virtualization technology (HyperVisor, Xen, VMware…) you can easily add a new virtual storage disk to your VM however making that VM accessible to your existing OS needs little work. Here are the instruction of adding a new 100GB virtual storage partition to CentOS Linux machine:

Checking the disks with valid files system on Linux OS:
[root@liunxbox64 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_centos-lv_root
35G 3.2G 31G 10% /
tmpfs 935M 0 935M 0% /dev/shm
/dev/xvda1 477M 75M 377M 17% /boot

Note: you dont see any mounted partition as /dev/xvdb*

Listing disk partition in Linux:
[root@liunxbox64 ~]# cat /proc/partitions
major minor #blocks name

202 0 41943040 xvda
202 1 512000 xvda1
202 2 41430016 xvda2
202 16 104857600 xvdb <————– This is the partition available in Linux but not ready yet to use.
253 0 37330944 dm-0
253 1 4096000 dm-1

Above you can see that xvdb partition is available so we will try to make this parition part of Linux file system.

Creating partition with the fdisk command:
[root@liunxbox64 ~]# fdisk /dev/xvdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xd288a1a2.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won’t be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It’s strongly recommended to
switch off the mode (command ‘c’) and change display units to
sectors (command ‘u’).

Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-13054, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-13054, default 13054):
Using default value 13054

Command (m for help): p

Disk /dev/xvdb: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xd288a1a2

Device Boot Start End Blocks Id System
/dev/xvdb1 1 13054 104856223+ 83 Linux

Command (m for help): x

Expert command (m for help): b
Partition number (1-4): 1
New beginning of data (1-209712509, default 63):
Using default value 63

Expert command (m for help): p

Disk /dev/xvdb: 255 heads, 63 sectors, 13054 cylinders

Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID
1 00 1 1 0 254 63 1023 63 209712447 83
2 00 0 0 0 0 0 0 0 0 00
3 00 0 0 0 0 0 0 0 0 00
4 00 0 0 0 0 0 0 0 0 00

Expert command (m for help): r

Command (m for help): p

Disk /dev/xvdb: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xd288a1a2

Device Boot Start End Blocks Id System
/dev/xvdb1 1 13054 104856223+ 83 Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

Making sure the partition is created:

[root@liunxbox64 ~]# fdisk -l /dev/xvdb

Disk /dev/xvdb: 107.4 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xd288a1a2

Device Boot Start End Blocks Id System
/dev/xvdb1 1 13054 104856223+ 83 Linux
[root@liunxbox64 ~]# fdisk -l /dev/xvdb1

Disk /dev/xvdb1: 107.4 GB, 107372772864 bytes
255 heads, 63 sectors/track, 13053 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Setting Partition’s file system to ext4 type
[root@liunxbox64 ~]# mkfs.ext4 /dev/xvdb1

mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
6553600 inodes, 26214055 blocks
1310702 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
800 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 39 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

Editing /etc/fstab file to enter newly created ext4 partition:
Finding partition UUID:

[root@liunxbox64 ~]# blkid /dev/xvdb1
/dev/xvdb1: UUID=”269fe527-501d-4598-bc52-e0b862af249d” TYPE=”ext4″

Now editing /etc/fstab to include newly created partition UUID as below:

$ vi /etc/fstab
UUID=b24c297f-d468-45c6-b42e-97bae9cf0ef1 /hadoop2 ext4 errors=remount-ro 0 1

Mounting partition:
[root@liunxbox64 ~]# mkdir /hadoop2

[root@liunxbox64 ~]# mount /hadoop2

[root@liunxbox64 ~]# df -h /hadoop2
Filesystem Size Used Avail Use% Mounted on
/dev/xvdb1 99G 60M 94G 1% /hadoop2
Verifying the partition is listed in Linux:
[root@liunxbox64 ~]# df -h

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_centos-lv_root
35G 3.2G 31G 10% /
tmpfs 935M 0 935M 0% /dev/shm
/dev/xvda1 477M 75M 377M 17% /boot
/dev/xvdb1 99G 60M 94G 1% /hadoop2 <————- Now the partition is listed here

Enterprise Hadoop Solution distributed by key Hadoop vendors


Lets start from Cloudera Enterprise Data Hub:

Cloudera-Ehadoop

Here is the offering from Hortonworks:

HW-enterprizehadoop

And this is how MapR is packaging Enterprise Hadoop

mapr-hadoop

And finally Pivotal Enterprise Hadoop offering:

Pivotal-hadoop

Keywords: Apache Hadoop, Cloudera, Hortonworks, Pivotal, MapR, Big Data

Error with git as “git-sh-setup: No such file or directory” with OSX Yosemite and oh-my-zsh (Z-Shell)


I recently received the following error while pushing/pulling code to/from git:

$ git pull
/Applications/Xcode.app/Contents/Developer/usr/libexec/git-core/git-pull: line 11: git-sh-setup: No such file or directory

This happened after I updated my macbook to OSX Yosemite and I do have Zsh (Z-Shell – oh-my-zsh)  as my favorite shell and iTerm2 as my favorite terminal.

After looking around I found the problem is that Zshell does not invoke /usr/bin/login when opening the command window as well as not clearing the environment vars while closing…

The potential solutions:

1. Edit the opening command when open a new shell (Preferred as it keeps your encoding as well as theme intact):

  • Open ITerm2 preferences > Profile – Default > Command – Command – /bin/bash -c /bin/zsh

2. You can also edit the same command to use your login:

  • Open ITerm2 preferences > Profile – Default > Command – Command –  /usr/bin/login -f <your user name>

More info at Stackoverflow..

Unknown Entity exception with Java Hibernate


If you hit an exception as “Unknown Entity” with Java Hibernate as below:

org.hibernate.MappingException: Unknown entity: com.myapplication.mymodel.modelname.ModalClass

It means the Class is not added into Hibernate configuration and to solve it simply you would need to add the class into your hibernate.cfg.xml as below:

  • <session-factory>
    • ……
    • <mapping class=”com.myapplication.mymodel.modelname.ModalClass” />
  • </session-factory>

You can find more details at StackOverflow.

Keywords: Java, Hibernate, Class, Unknown Entity, Exception

Data360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics (Presentation Slides)


Yesterday I participated in Data360 conference and given an introductory presentation about Big Data, Hadoop and Big Data Analytics. It was a great way to connect with community and share some of the information.

dara360

The full presentation slides are located at Slideshare which you can get directly from the link below:

http://www.slideshare.net/Avkashslide/data-360-conference-introduction-to-big-data-hadoop-and-big-data-analytics

Keywords: Hadoop, Big Data, Analytics

Open Source Distributed Analytics Engine with SQL interface and OLAP on Hadoop by eBay – Kylin


What is Kilyn?

  • Kylin is an open source Distributed Analytics Engine with SQL interface and multi-dimensional analysis (OLAP) to support extremely large datasets on Hadoop by eBay.

kylin

Key Features:

  • Extremely Fast OLAP Engine at Scale:
    • Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
  • ANSI-SQL Interface on Hadoop:
    • Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions
  • Interactive Query Capability:
    • Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset
  • MOLAP Cube:
    • User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • Seamless Integration with BI Tools:
    • Kylin currently offers integration capability with BI Tools like Tableau.
  • Other Highlights:
    • Job Management and Monitoring
    • Compression and Encoding Support
    • Incremental Refresh of Cubes
    • Leverage HBase Coprocessor for query latency
    • Approximate Query Capability for distinct Count (HyperLogLog)
    • Easy Web interface to manage, build, monitor and query cubes
    • Security capability to set ACL at Cube/Project Level
    • Support LDAP Integration

Keywords: Kylin, Big Data, Hadoop, Jobs, OLAP, SQL, Query