Archive
Amazon EC2 Security Group (Firewall) settings for Hadoop Cluster
When setting Hadoop cluster in Amazon EC2 you would need to configure proper security settings (firewall) so you can access Hadoop cluster directly. Following are the settings for Cloudera CDH4 Hadoop distribution on EC2:
Port 22 for SSH, Port 7180/82 for CDH Manager, 7432 for PSQL and 8888 for Hue and finally Port 50000-50100 for Hadoop JT and HDFS.
Getting Hadoop Configuration Parameters directly at Hadoop Command Prompt
Hadoop users can use “hdfs getconf -confKey KEY_NAME” command to get the value specific to any configuration keys the cluster have it configured.
Constructing blocks and file system relationship in HDFS
I am using a 3 nodes Hadoop cluster running Windows Azure HDInsight for the testing.
In Hadoop we can use fsck utility to diagnose the health of the HDFS file system, to find missing files or blocks or calculate them for integrity.
Lets Running FSCK for the root file system:
c:\apps\dist\hadoop-1.1.0-SNAPSHOT>hadoop fsck /
FSCK started by avkash from /10.114.132.17 for path / at Thu Mar 07 05:27:39 GMT 2013
……….Status: HEALTHY
Total size: 552335333 B
Total dirs: 21
Total files: 10
Total blocks (validated): 12 (avg. block size 46027944 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 3
FSCK ended at Thu Mar 07 05:27:39 GMT 2013 in 8 milliseconds
The filesystem under path ‘/’ is HEALTHY
Now let’s check the total files in the root (/) to verify the files and directories:
c:\apps\dist\hadoop-1.1.0-SNAPSHOT>hadoop fs -lsr /
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/apps
-rw-r–r– 3 avkash supergroup 4608 2013-03-04 21:16 /example/apps/cat.exe
-rw-r–r– 3 avkash supergroup 5120 2013-03-04 21:16 /example/apps/wc.exe
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/data
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /example/data/gutenberg
-rw-r–r– 3 avkash supergroup 1395667 2013-03-04 21:16 /example/data/gutenberg/davinci.txt
-rw-r–r– 3 avkash supergroup 674762 2013-03-04 21:16 /example/data/gutenberg/outlineofscience.txt
-rw-r–r– 3 avkash supergroup 1573044 2013-03-04 21:16 /example/data/gutenberg/ulysses.txt
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp/mapred
drwx—— – avkash supergroup 0 2013-03-04 21:15 /hdfs/tmp/mapred/system
-rw——- 3 avkash supergroup 4 2013-03-04 21:15 /hdfs/tmp/mapred/system/jobtracker.info
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive/warehouse
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /hive/warehouse/hivesampletable
-rw-r–r– 3 avkash supergroup 5015508 2013-03-04 21:16 /hive/warehouse/hivesampletable/HiveSampleData.txt
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /tmp
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /tmp/hive-avkash
drwxrwxrwx – SYSTEM supergroup 0 2013-03-04 21:15 /uploads
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM/graph
-rw-r–r– 3 avkash supergroup 80 2013-03-04 21:16 /user/SYSTEM/graph/catepillar_star.edge
drwxr-xr-x – avkash supergroup 0 2013-03-04 21:16 /user/SYSTEM/query
-rw-r–r– 3 avkash supergroup 12 2013-03-04 21:16 /user/SYSTEM/query/catepillar_star_rwr.query
drwxr-xr-x – avkash supergroup 0 2013-03-05 07:37 /user/avkash
drwxr-xr-x – avkash supergroup 0 2013-03-04 23:00 /user/avkash/.Trash
-rw-r–r– 3 avkash supergroup 543666528 2013-03-05 07:37 /user/avkash/data_w3c_large.txt
Above we found that there are total 21 directories and 10 files. Now we can dig further to check the total 12 blocks in HDFS for each files:
c:\apps\dist\hadoop-1.1.0-SNAPSHOT>hadoop fsck / -files -blocks –racks
FSCK started by avkash from /10.114.132.17 for path / at Thu Mar 07 05:35:44 GMT 2013
/
/example
/example/apps
/example/apps/cat.exe 4608 bytes, 1 block(s): OK
0. blk_9084981204553714951_1008 len=4608 repl=3 [/fd0/ud0/10.114.236.28:50010, /
fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]
/example/apps/wc.exe 5120 bytes, 1 block(s): OK
0. blk_-7951603158243426483_1009 len=5120 repl=3 [/fd1/ud1/10.114.228.35:50010,
/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]
/example/data
/example/data/gutenberg
/example/data/gutenberg/davinci.txt 1395667 bytes, 1 block(s): OK
0. blk_3859330889089858864_1005 len=1395667 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]
/example/data/gutenberg/outlineofscience.txt 674762 bytes, 1 block(s): OK
0. blk_-3790696559021810548_1006 len=674762 repl=3 [/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010, /fd1/ud1/10.114.228.35:50010]
/example/data/gutenberg/ulysses.txt 1573044 bytes, 1 block(s): OK
0. blk_-8671592324971725227_1007 len=1573044 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]
/hdfs
/hdfs/tmp
/hdfs/tmp/mapred
/hdfs/tmp/mapred/system
/hdfs/tmp/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK
0. blk_5997185491433558819_1003 len=4 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]
/hive
/hive/warehouse
/hive/warehouse/hivesampletable
/hive/warehouse/hivesampletable/HiveSampleData.txt 5015508 bytes, 1 block(s): OK
0. blk_44873054283747216_1004 len=5015508 repl=3 [/fd1/ud1/10.114.228.35:50010,/fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]
/tmp
/tmp/hive-avkash
/uploads
/user
/user/SYSTEM
/user/SYSTEM/graph
/user/SYSTEM/graph/catepillar_star.edge 80 bytes, 1 block(s): OK
0. blk_-6715685143024983574_1010 len=80 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud2/10.114.236.42:50010, /fd0/ud0/10.114.236.28:50010]
/user/SYSTEM/query
/user/SYSTEM/query/catepillar_star_rwr.query 12 bytes, 1 block(s): OK
0. blk_8102317020509190444_1011 len=12 repl=3 [/fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]
/user/avkash
/user/avkash/.Trash
/user/avkash/data_w3c_large.txt 543666528 bytes, 3 block(s): OK
0. blk_2005027737969478969_1012 len=268435456 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010]
1. blk_1970119524179712436_1012 len=268435456 repl=3 [/fd1/ud1/10.114.228.35:50010, /fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010]
2. blk_6223000007391223944_1012 len=6795616 repl=3 [/fd0/ud0/10.114.236.28:50010, /fd0/ud2/10.114.236.42:50010, /fd1/ud1/10.114.228.35:50010]
Status: HEALTHY
Total size: 552335333 B
Total dirs: 21
Total files: 10
Total blocks (validated): 12 (avg. block size 46027944 B)
Minimally replicated blocks: 12 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 3
FSCK ended at Thu Mar 07 05:35:44 GMT 2013 in 10 milliseconds
The filesystem under path ‘/’ is HEALTHY
Above we can verify that where all total 12 blocks are distributed. 9 blocks are distributed through 9 files and 3 blocks are distributed through 1 file.
Keywaords: FSCK, Hadoop,HDFS, Blocks, File System, Replications, HDInsight
Windows Azure HDInsight on Windows 8 (Single Node Hadoop Cluster) Walkthrough
Windows Azure HDInsight is Microsoft response to Big Data movement. Microsoft’s end-to-end roadmap for Big Data embraces Apache Hadoop™ by distributing enterprise class, Hadoop-based solutions on both Windows Server and Windows Azure.
In this video you will learn details about HDInsight running on a Windows 8 as a single node Hadoop cluster. Running CSharp and Java based MapReduce jobs examples shows here along with example of Apache Pig.
Microsoft’s roadmap for Big Data: http://www.microsoft.com/bigdata/
Apache Hadoop on Windows Azure: http://hadooponazure.com
Windows Azure HDInsight Introduction and Installation Video
Windows Azure HDInsight is Microsoft response to Big Data movement. Microsoft’s end-to-end roadmap for Big Data embraces Apache Hadoop™ by distributing enterprise class, Hadoop-based solutions on both Windows Server and Windows Azure.
In this video you will learn how to install HDInsight on a Windows 8 machine, as a single node Hadoop cluster.
Microsoft’s roadmap for Big Data: http://www.microsoft.com/bigdata/
Apache Hadoop on Windows Azure: http://hadooponazure.com
Contact me: @avkashchauhan
Windows Azure HDInsight – Installation Walkthrough
CTP version of HDInsight for Windows Server and Windows Clients is available to download from here.
When you install HDInsight through WebPI the following components are installed in your Windows machine:
Once installation is done you can launch the Hadoop console to verify the installation is done along with checking the Hadoop version using command “hadoop version” as below:
Also you can check System > Services to verify that all HDInsight specific services are running as expected:
Hortonworks Data Platform (HDP) Sandbox with Hadoop 1.1.2.21 Walkthrough
I recently downloaded and decided to give a quick run to Hortonworks Data Platform (HDP) Sandbox. As suggested it takes only 15 minutes for anyone to get familiar about key Hadoop components i.e. Pig, HBase, Hive, Oozie and few others. The Sandbox comes with pre-configured and pre-installed VMWare or VirtualBox compatable VMDK running CentOS 6.2 Linux, which user can download and open very easily with any of above free application. This 15 minutes walk through checks out key Hadoop components in HDP sandbox.
If you are new to Apache Hadoop, this is the easiest method you have to get familiarize yourself with Amazing world of Hadoop.
HDP Sandbox Instructions: http://hortonworks.com/blog/hortonworks-sandbox-the-fastest-on-ramp-to-apache-hadoop/
HDP Download Links: http://hortonworks.com/products/sandbox-instructions/
VMDK Username and Password: root/hadoop
MapReduce in Cloud
When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:
- A collection of machines which are Hadoop/MapReduce ready and instant available
- You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
- It means you just need to hook your data and push MapReduce jobs immediately
- Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.
Here are a few options which are available now, which I tried before writing here:
Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
http://aws.amazon.com/elasticmapreduce/
Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
https://developers.google.com/bigquery/
Mortar Data:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works
Big Data in Astronomical scale HDF and HUDF
Scientists in general, and astronomers in particular, have been at the forefront when it comes to dealing with large amounts of data. These days, the “Big Data” community, as it is known, includes almost every scientific endeavor — and even you.
In fact, Big Data is not just about extremely large collections of information hidden in databases inside archives like the Barbara A. Mikulski Archive for Space Telescopes. Big Data includes the hidden data you carry with you all the time in now-ubiquitous smart phones: calendars, photographs, SMS messages, usage information and records of our current and past locations. As we live our lives, we leave behind us a “data exhaust” that tells something about ourselves.

…..
In late 1995, the Hubble Space Telescope took hundreds of exposures of a seemingly empty patch of sky near the constellation of Ursa Major (the Big Dipper). The Hubble Deep Field (HDF), as it is known, uncovered a mystifying collection of about 3,000 galaxies at various stages of their evolution. Most of the galaxies were faint, and from them we began to learn a story about our Universe that had not been told before.
……
So was the HDF unique? Were we just lucky to observe a crowded but faint patch of sky? To address this question, and determine if indeed the HDF was a “lucky shot,” in 2004 Hubble took a million-second-long exposure in a similarly “empty” patch of sky: The Hubble Ultra Deep Field (HUDF). The result was even more breathtaking. Containing an estimated 10,000 galaxies, the HUDF revealed glimpses of the first galaxies as they emerge from the so-called “dark ages” — the time shortly after the Big Bang when the first stars reheated the cold, dark universe. As with the HDF, the HUDF data was made immediately available to the community, and has spawned hundreds of publications and several follow-up observations.
Read Full Article at: http://hubblesite.org/blog/2012/04/data-exhaust/
Open Source system for data mining – RapidMiner
RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Thousands of applications of RapidMiner in more than 40 countries give their users a competitive edge.
- Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite
- Powerful but intuitive graphical user interface for the design of analysis processes
- Repositories for process, data and meta data handling
- Only solution with meta data transformation: forget trial and error and inspect results already during design time
- Only solution which supports on-the-fly error recognition and quick fixes
- Complete and flexible: Hundreds of data loading, data transformation, data modeling, and data visualization methods







