Archive
Handling Apache Hive Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Recently I hit the following Error when starting Apache hive in my server:
ubuntu@HIVE_SERVER:~$ hive
log4j:ERROR Could not find value for key log4j.appender.DEBUG
log4j:ERROR Could not instantiate appender named “DEBUG”.
log4j:ERROR Could not find value for key log4j.appender.WARN
log4j:ERROR Could not instantiate appender named “WARN”.
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /var/log/hive/hive.log (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
at java.io.FileOutputStream.<init>(FileOutputStream.java:136)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:223)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:842)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:648)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:514)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
at org.apache.log4j.PropertyConfigurator.configure(PropertyConfigurator.java:415)
at org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:52)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:629)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Hive history file=/tmp/ubuntu/hive_job_log_ubuntu_201306111657_1917799460.txt
hive> show tables;
FAILED: Error in metadata: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
hive> quit;
Potential Root Cause:
This is mainly because the I have had previous hive installation in the machine and uninstalled it. Not all the files were removed completelty and the /var/log/hive/hive.log file was not there and most of the settings were messed up.
Even when I created an empty /var/log/hive/hive.log ($touch /var/log/hive/hive.log and chmod 777 /var/log/hive/hive.log) still the problem persisted.
Removing Hive:
ubuntu@HIVE_SERVER:~$ sudo apt-get remove hive
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following packages were automatically installed and are no longer required:
libopts25 libcap2 hbase ntp
Use ‘apt-get autoremove’ to remove them.
The following packages will be REMOVED:
hive
0 upgraded, 0 newly installed, 1 to remove and 26 not upgraded.
After this operation, 175 MB disk space will be freed.
Do you want to continue [Y/n]? Y
(Reading database … 57309 files and directories currently installed.)
Removing hive …
Processing triggers for man-db …
Checking what else is left after removal:
ubuntu@HIVE_SERVER:~$ sudo find / -name “hive”
/etc/hive
/usr/lib/hive
/var/lib/hive
Deleting all of above residual files in Hive:
ubuntu@HIVE_SERVER:~$ sudo rm -rf /etc/hive
ubuntu@HIVE_SERVER:~$ sudo rm -rf /usr/lib/hive
ubuntu@HIVE_SERVER:~$ sudo rm -rf /var/lib/hive
Verifying that none of hive is in my machine:
ubuntu@HIVE_SERVER:~$ sudo find / -name “hive”
Installing Hive (from cloudera distribution)
ubuntu@HIVE_SERVER:~$ sudo apt-get install hive
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following NEW packages will be installed:
hive
0 upgraded, 1 newly installed, 0 to remove and 26 not upgraded.
Need to get 0 B/32.4 MB of archives.
After this operation, 175 MB of additional disk space will be used.
Selecting previously unselected package hive.
(Reading database … 50478 files and directories currently installed.)
Unpacking hive (from …/hive_0.10.0+78-1.cdh4.2.1.p0.8~precise-cdh4.2.1_all.deb) …
Processing triggers for man-db …
Setting up hive (0.10.0+78-1.cdh4.2.1.p0.8~precise-cdh4.2.1) …
update-alternatives: using /etc/hive/conf.dist to provide /etc/hive/conf (hive-conf) in auto mode.
Listing hive in my machine:
ubuntu@HIVE_SERVER:~$ sudo find / -name “hive”
/etc/hive
/run/hive
/usr/lib/hive
/usr/lib/hive/lib/php/packages/serde/org/apache/hadoop/hive
/usr/lib/hive/docs/api/org/apache/hadoop/hive
/usr/lib/hive/docs/api/org/apache/hive
/usr/lib/hive/bin/hive
/usr/share/doc/hive
/usr/share/doc/hive/examples/test-plugin/src/org/apache/hive
/usr/bin/hive
/var/lib/hive
/var/log/hive
Testing Hive:
ubuntu@HIVE_SERVER:~$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.2.1.jar!/hive-log4j.properties
Hive history file=/tmp/ubuntu/hive_job_log_ubuntu_201306111709_559685140.txt
hive> show tables;
OK
Time taken: 14.431 seconds
hive> quit;
Its working!!
Handling Permission denied error on HDFS
When you try creating a folder or a file on HDFS you may hit the following error:
ubuntu@HADOOP_CLUSTER:~$ hdfs dfs -mkdir /abc
mkdir: Permission denied: user=ubuntu, access=WRITE, inode=”/user”:hdfs:hadoop:drwxr-xr-x
This particuler problem could happen because current logged user may not be part of hadoop group.
So lets find out who is the current logged user:
ubuntu@HADOOP_CLUSTER:~$ whoami
ubuntu
Now we can find out the list of groups the user “ubuntu” is associated with:
ubuntu@HADOOP_CLUSTER:~$ groups ubuntu
ubuntu : ubuntu adm dialout cdrom floppy audio dip video plugdev netdev admin
As we can see above that user ubuntu is not part of hadoop group so we will go ahead and add this user to hadoop group:
ubuntu@HADOOP_CLUSTER:~$ sudo adduser ubuntu hadoop
Adding user `ubuntu’ to group `hadoop’ …
Adding user ubuntu to group hadoop
Done.
Now we can try the exact same command we tried before:
ubuntu@HADOOP_CLUSTER:~$ hdfs dfs -mkdir /abc
The command is successfull and if we list files and folder at HDFS we can find our folder as below:
ubuntu@HADOOP_CLUSTER:~$ hdfs dfs -ls /
Found 5 items
drwxr-xr-x – ubuntu hadoop 0 2013-06-05 05:58 /abc
-rw-r–r– 1 ubuntu hadoop 28865 2013-06-04 16:14 /history.log
drwxr-xr-x – mapred hadoop 0 2013-06-04 15:56 /home
drwxrwxrwt – mapred hadoop 0 2013-06-05 05:48 /tmp
drwxr-xr-x – hdfs hadoop 0 2013-06-05 05:53 /user
To make sure that user “ubuntu” is part of “hadoop” group we can verify as below:
ubuntu@HADOOP_CLUSTER:~$ groups ubuntu
ubuntu : ubuntu adm dialout cdrom floppy audio dip video plugdev netdev admin hadoop
Keywords: Hadoop, HDFS, Error
Handling Cloudera Hadoop Cluster from command line
If you have installed Hadoop from Cloudera distribution without Cloudera Manager you would have to manage your cluster from console and the things art not easy. Here are some of the important information to manage working on Cloudera Hadoop from console:
Where hadoop binary are located:
ubuntu@HADOOP_CLUSTER:~$ which hadoop
- /usr/bin/hadoop
Files located at /usr/lib/hadoop/
drwxr-xr-x 2 root root 4096 May 22 21:00 bin
drwxr-xr-x 2 root root 4096 May 23 00:25 client
drwxr-xr-x 2 root root 4096 May 23 00:25 client-0.20
drwxr-xr-x 2 root root 4096 May 22 21:00 cloudera
drwxr-xr-x 2 root root 4096 May 22 21:00 etc
-rw-r–r– 1 root root 16678 Apr 22 17:38 hadoop-annotations-2.0.0-cdh4.2.1.jar
lrwxrwxrwx 1 root root 37 Apr 22 17:38 hadoop-annotations.jar -> hadoop-annotations-2.0.0-cdh4.2.1.jar
-rw-r–r– 1 root root 46858 Apr 22 17:38 hadoop-auth-2.0.0-cdh4.2.1.jar
lrwxrwxrwx 1 root root 30 Apr 22 17:38 hadoop-auth.jar -> hadoop-auth-2.0.0-cdh4.2.1.jar
-rw-r–r– 1 root root 2267883 Apr 22 17:38 hadoop-common-2.0.0-cdh4.2.1.jar
-rw-r–r– 1 root root 1213897 Apr 22 17:38 hadoop-common-2.0.0-cdh4.2.1-tests.jar
lrwxrwxrwx 1 root root 32 Apr 22 17:38 hadoop-common.jar -> hadoop-common-2.0.0-cdh4.2.1.jar
drwxr-xr-x 3 root root 4096 May 22 21:00 lib
drwxr-xr-x 2 root root 4096 May 23 00:25 libexec
drwxr-xr-x 2 root root 4096 May 22 21:00 sbin
Hadoop cluster specific XML configuration files are stored here:
lrwxrwxrwx 1 root root 16 Apr 22 17:38 hadoop -> /etc/hadoop/conf
ubuntu@HADOOP_CLUSTER:~$ ls -l /usr/lib/hadoop/etc/hadoop
lrwxrwxrwx 1 root root 16 Apr 22 17:38 /usr/lib/hadoop/etc/hadoop -> /etc/hadoop/conf
ubuntu@HADOOP_CLUSTER:~$ ls -l /etc/hadoop/conf
lrwxrwxrwx 1 root root 29 May 22 21:00 /etc/hadoop/conf -> /etc/alternatives/hadoop-conf
ubuntu@HADOOP_CLUSTER:~$ ls -l /etc/alternatives/hadoop-conf
lrwxrwxrwx 1 root root 23 May 22 22:02 /etc/alternatives/hadoop-conf -> /etc/hadoop/conf.avkash
ubuntu@HADOOP_CLUSTER:~$ ls -l /etc/hadoop/conf.avkash/
- core-site.xml
- hadoop-metrics.properties
- hadoop-metrics2.properties
- hdfs-site.xml
- log4j.properties
- mapred-site.xml
- slaves
- ssl-client.xml.example
- ssl-server.xml.example
- yarn-env.sh
- yarn-site.xml
Note: Otherwise you can try to find Hadoop configuration files as
- ubuntu@ec2-54-214-67-144:~$ sudo find / -name “hdfs*.xml”
Hadoop cluster specific scripts are located here:
- Hadoop
- /usr/lib/hadoop/libexec/hadoop-config.sh
- /usr/lib/hadoop/libexec/hadoop-layout.sh
- /usr/lib/hadoop/sbin/hadoop-daemon.sh
- /usr/lib/hadoop/sbin/hadoop-daemons.sh
- MapReduce
- /usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemon.sh
- /usr/lib/hadoop-0.20-mapreduce/bin/hadoop-config.sh
- /usr/lib/hadoop-0.20-mapreduce/bin/hadoop-daemons.sh
To start/stop/restart Hadoop service, the scripts are located here:
- Hadoop Namenode and Job Tracker
- /etc/init.d/hadoop-0.20-mapreduce-jobtracker
- /etc/init.d/hadoop-hdfs-namenode
- Hadoop Datanode and TaskTracker
- /etc/init.d/hadoop-hdfs-datanode
- /etc/init.d/hadoop-0.20-mapreduce-tasktracker
If you decided to start Hadoop Service manually you can do the following:
- Stop Services:
- sudo /etc/init.d/hadoop-hdfs-namenode stop
- sudo /etc/init.d/hadoop-hdfs-datanode stop
- sudo /etc/init.d/hadoop-0.20-mapreduce-jobtracker stop
- sudo /etc/init.d/hadoop-0.20-mapreduce-tasktracker stop
- Start Services
- sudo /etc/init.d/hadoop-hdfs-namenode start
- sudo /etc/init.d/hadoop-hdfs-datanode start
- sudo /etc/init.d/hadoop-0.20-mapreduce-jobtracker start
- sudo /etc/init.d/hadoop-0.20-mapreduce-tasktracker start
Running hdfs command in hdfs user context:
- sudo -u hdfs hdfs dfs -mkdir /tmp
- sudo -u hdfs hdfs dfs -chmod -R 1777 /tmp
- sudo -u hdfs hdfs dfs -mkdir -p /var/lib/hadoop-hdfs/cache/
- hdfs dfs -ls /
Running Hadoop example jobs from console:
- ubuntu@HADOOP_CLUSTER:~$ hdfs dfs -copyFromLocal history.log /
- ubuntu@HADOOP_CLUSTER:~$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount /history.log /home/ubuntu/results
- 13/06/04 16:14:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
- 13/06/04 16:14:35 INFO input.FileInputFormat: Total input paths to process : 1
- 13/06/04 16:14:35 INFO mapred.JobClient: Running job: job_201306041556_0005
- 13/06/04 16:14:36 INFO mapred.JobClient: map 0% reduce 0%
The following error means the HDFS is running but JobTracker is not running at Hadoop:
- 13/06/04 15:48:48 INFO ipc.Client: Retrying connect to server: HADOOP_CLUSTER.us-west-2.compute.amazonaws.com/10.254.42.72:8021. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
- 13/06/04 15:48:49 INFO ipc.Client: Retrying connect to server: HADOOP_CLUSTER.us-west-2.compute.amazonaws.com/10.254.42.72:8021. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
- 13/06/04 15:48:50 INFO ipc.Client: Retrying connect to server: HADOOP_CLUSTER.us-west-2.compute.amazonaws.com/10.254.42.72:8021. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
- 13/06/04 15:48:51 INFO ipc.Client: Retrying connect to server: HADOOP_CLUSTER.us-west-2.compute.amazonaws.com/10.254.42.72:8021. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
Keywords: Hadoop, MapReduce, Cloudera, Services
Platfora BI Software
Apache Weave: Big Data Application runtime and development framework by Continuuity
Continuuity decided to build Weave and be part of the journey to take Apache YARN to the next level of usability and functionality. Continuuity has been using Weave extensively to support their products and seen the benefit and power of Apache YARN and Weave combined. Continuuity decided to share Weave under the Apache 2.0 license in an effort to collaborate with members of the community, broaden the set of applications and patterns that Weave supports, and further the overall adoption of Apache YARN.
Weave is NOT a replacement for Apache YARN. It is instead a value-added framework that operates on top of Apache YARN.
What is Weave: Weave is a simple set of libraries that allows you to easily manage distributed applications through an abstraction layer built on Apache YARN. Weave allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads.
Features of Weave:
- Simple API for specifying, running and managing application lifecycle
- An easy way to communicate with an application or parts of an application
- A generic Application Master to better support simple applications
- Simplified archive management and local file transport
- Improved control over application logs, metrics and errors
- Discovery service
- And many more…
Weave Source code is available on github at http://github.com/continuuity/weave under the Apache 2.0 License.
Learn more at http://www.continuuity.com/.
Keyword: Hadoop, Yarn, MapReduce, Big Data
Top movies visualizations using Platfora
Here are some of the cool visualization of IMDB Movie Dataset (60,000 records from 1893-2004) using Platfora…
Top Criteria: 7.5+ rating and 50,000+ votes.
Timeline: From 1893 – 2004
Top movies of all time:
Top “R” Rated movies:
Top PG-13 movies of all time:
Top Comedy Movies:
Keywords: Hadoop, BigData, Data Visualization,
IMDB Movie Dataset Visualization using Platfora
Here are some of the cool visualization of IMDB Movie Dataset (60,000 records from 1893-2004) using Platfora…
Top movies based on 8.0 rating and highest voting:
Total yearly budget and movie production between 1893-2004
Total movies produced from 1893-2004
Total yearly budget for movies production from 1893-2004
Total movies produced based on MPAA Ratings between 1893-2004
Total movies based on MPAA rating between 1893-2004
Fact: In 1942, total 100 animation (of all length) produced comparative to only 94 in 2003
All time, top 9+ rating movies total (The data for year 2005 is incomplete):
Getting Hadoop Configuration Parameters directly at Hadoop Command Prompt
Hadoop users can use “hdfs getconf -confKey KEY_NAME” command to get the value specific to any configuration keys the cluster have it configured.
HDInsight (Hadoop on Azure) Demo: Submit MapReduce job, process result from Pig and filter final results in Hive
In this demo we will submit a WordCount map reduce job to HDInsight cluster and process the results in Pig and then filter the results in Hive by storing structured results into a table.
Step 1: Submitting WordCount MapReduce Job to 4 node HDInsight cluster:
c:\apps\dist\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd jar c:\apps\Jobs\templates\635000448534317551.hadoop-examples.jar wordcount /user/admin/DaVinci.txt /user/admin/outcount
The results are stored @ /user/admin/outcount
Verify the results at Interactive Shell:
Step 2: loading /user/admin/outcount/part-r-00000 results in the Pig:
First we are storing the flat text file data as words, wordCount format as below:
Grunt>mydata = load ‘/user/admin/output/part-r-00000′ using PigStorage(‘\t’) as (words:chararray, wordCount:int);
Grunt>first10 = LIMIT mydata 10;
Grunt>dump first10;

Note: This shows results for the words with frequency 1. We need to reorder to results on descending order to get words with top frequency.
Grunt>mydatadsc = order mydata by wordCount DESC;
Grunt>first10 = LIMIT mydatadsc 10;
Grunt>dump first10;
Now we have got the result as expected. Lets stored the results into a file at HDFS.
Grunt>Store first10 into ’/user/avkash/myresults10‘ ;
Step 3: Filtering Pig Results in to Hive Table:
First we will create a table in Hive using the same format (words and wordcount separated by comma)
hive> create table wordslist10(words string, wordscount int) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’;
Now once table is created we will load the hive store file ’/user/admin/myresults10/part-r-00000′ into wordslist10 table we just created:
hive> load data inpath ‘/user/admin/myresults10/part-r-00000′ overwrite into table wordslist10;
That’s all as you can see the results now in table:
hive> select * from wordslist10;
KeyWords: Apache Hadoop, MapReduce, Pig, Hive, HDInsight, BigData
Understanding HBase tables and HDFS file structure in Hadoop
Learn more on HBase here: http://hbase.apache.org/book.html
Lets create a HBase table first and add some data to it.
[cloudera@localhost ~]$ hbase shell
13/03/27 00:04:31 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter ‘help<RETURN>’ for list of supported commands.
Type “exit<RETURN>” to leave the HBase Shell
Version 0.94.2-cdh4.2.0, rUnknown, Fri Feb 15 11:51:18 PST 2013
hbase(main):001:0> create ‘students’, ‘name’
0 row(s) in 2.5020 seconds
=> Hbase::Table – students
hbase(main):002:0> list ‘students’
TABLE
students
1 row(s) in 0.0540 seconds
=> ["students"]
hbase(main):003:0> put ‘students’, ‘row1′, ‘name:id1′,’John’
0 row(s) in 0.0400 seconds
hbase(main):004:0> put ‘students’, ‘row2′, ‘name:id2′,’Jim’
0 row(s) in 0.0070 seconds
hbase(main):005:0> put ‘students’, ‘row3′, ‘name:id3′,’Will’
0 row(s) in 0.0070 seconds
hbase(main):006:0> put ‘students’, ‘row4′, ‘name:id4′,’Henry’
0 row(s) in 0.0040 seconds
hbase(main):007:0> put ‘students’, ‘row5′, ‘name:id5′,’Ken’
0 row(s) in 0.0440 seconds
hbase(main):008:0> scan ‘students’
ROW COLUMN+CELL
row1 column=name:id1, timestamp=1364357135479, value=John
row2 column=name:id2, timestamp=1364357147587, value=Jim
row3 column=name:id3, timestamp=1364357161684, value=Will
row4 column=name:id4, timestamp=1364357173959, value=Henry
row5 column=name:id5, timestamp=1364357189836, value=Ken
5 row(s) in 0.0450 seconds
As you can see above we have table students and only one table name comes when running list command.
In my Hadoop cluster the HBase is configured to use /hbase folder so now lets check the disk utilization in /hbase folder:
[cloudera@localhost ~]$ hdfs dfs -du /hbase
2868 /hbase/-ROOT-
245 /hbase/.META.
0 /hbase/.archive
0 /hbase/.corrupt
2424 /hbase/.logs
0 /hbase/.oldlogs
0 /hbase/.tmp
38 /hbase/hbase.id
3 /hbase/hbase.version
928 /hbase/students
Above table students is user table however -ROOT- and .META. are HBase catalog tables. These tables are part of HBase configuration where HBase keeps catalog about the user tables. To understand each table structure lets run describe command:
hbase(main):010:0> describe ‘-ROOT-’
DESCRIPTION ENABLED
{NAME => ‘-ROOT-’, IS_ROOT => ‘true’, IS_META => ‘t true
rue’, FAMILIES => [{NAME => 'info', DATA_BLOCK_ENCO
DING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_
SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '1
0', TTL => '2147483647', MIN_VERSIONS => '0', KEEP_
DELETED_CELLS => 'false', BLOCKSIZE => '8192', ENCO
DE_ON_DISK => 'true', IN_MEMORY => 'true', BLOCKCAC
HE => 'true'}]}
1 row(s) in 0.0700 seconds
hbase(main):011:0> describe ‘.META.’
DESCRIPTION ENABLED
{NAME => ‘.META.’, IS_META => ‘true’, FAMILIES => [ true
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLO
OMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPR
ESSION => 'NONE', VERSIONS => '10', TTL => '2147483
647', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'f
alse', BLOCKSIZE => '8192', ENCODE_ON_DISK => 'true
', IN_MEMORY => 'true', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0470 seconds
hbase(main):012:0> describe ‘students’
DESCRIPTION ENABLED
{NAME => ‘students’, FAMILIES => [{NAME => 'name', true
DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE
', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPR
ESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147
483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE =
> '65536', IN_MEMORY => 'false', ENCODE_ON_DISK =>
'true', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0270 seconds
Now we can check the file structure for the user table ‘students‘ as below:
[cloudera@localhost ~]$ hdfs dfs -du /hbase/students
697 /hbase/students/.tableinfo.0000000001
0 /hbase/students/.tmp
231 /hbase/students/b2cd87df288adbb7e9ff2423ca532e14
We can also check the HBase system specific tables structure as well:
[cloudera@localhost ~]$ hdfs dfs -du /hbase/-ROOT-
727 /hbase/-ROOT-/.tableinfo.0000000001
0 /hbase/-ROOT-/.tmp
2141 /hbase/-ROOT-/70236052
[cloudera@localhost ~]$ hdfs dfs -du /hbase/.META.
245 /hbase/.META./1028785192
Now if we dig further to see the file structure for user table students we can learn about regioninfo as below:
[cloudera@localhost ~]$ hdfs dfs -ls /hbase/students
Found 3 items
-rw-r–r– 1 hbase supergroup 697 2013-03-27 00:04 /hbase/students/.tableinfo.0000000001
drwxr-xr-x – hbase supergroup 0 2013-03-27 00:04 /hbase/students/.tmp
drwxr-xr-x – hbase supergroup 0 2013-03-27 00:04 /hbase/students/b2cd87df288adbb7e9ff2423ca532e14
[cloudera@localhost ~]$ hdfs dfs -ls /hbase/students/.tmp
Now here we can see the regioninfo details about the table ‘students’
[cloudera@localhost ~]$ hdfs dfs -ls /hbase/students/b2cd87df288adbb7e9ff2423ca532e14
Found 2 items
-rw-r–r– 1 hbase supergroup 231 2013-03-27 00:04 /hbase/students/b2cd87df288adbb7e9ff2423ca532e14/.regioninfo
drwxr-xr-x – hbase supergroup 0 2013-03-27 00:04 /hbase/students/b2cd87df288adbb7e9ff2423ca532e14/name
[cloudera@localhost ~]$ hdfs dfs -ls /hbase/students/b2cd87df288adbb7e9ff2423ca532e14/name
[cloudera@localhost ~]$ hdfs dfs -cat /hbase/students/b2cd87df288adbb7e9ff2423ca532e14/.regioninfo
=�&:9students,,1364357097018.b2cd87df288adbb7e9ff2423ca532e14students�}�
{NAME => ‘students,,1364357097018.b2cd87df288adbb7e9ff2423ca532e14.’, STARTKEY => ”, ENDKEY => ”, ENCODED => b2cd87df288adbb7e9ff2423ca532e14,}[cloudera@local[cloudera@localhost ~]$
This is the way we can understand more about HBase user table details in HDFS.
Keywords: Hadoop, HBase, Regions, RegionServer, Catalog, Cloudera, HDFS

















