Datawarehouse with Hadoop+Hbase+Hive+SpringBatch – Part 2
Join the DZone community and get the full member experience.
Join For FreeThe svn codebase for this article is here.
In continuation to part 1, this section covers,
- Setup of a Hadoop, Hbase, Hive on a new Ubuntu VM
- Run Hadoop, Hbase and Hive as services
- Setup the Spring batch project and run the tests
- Some useful commands/tips
To begin with let me tell you the choice of using Hive was to understand not to use Hive as a JDBC equivalent. It was more to understand how to use Hive as a powerful datawarehouse analytics engine.
Setup of a Hadoop, Hbase, Hive on a new Ubuntu VM
Download the latest Hadoop, Hbase and Hive from the apache websites. You can also go to Cloudera website and get the Cloudera UBuntu VM and use apt-get install hadoop, hbase and hive. It did not work for me, if you are adventurous you can try that. You can also try MapR’s VMs. Both Cloudera and MapR have good documentation and tutorials.
Unzip the file in the home directory and go to .profile file and add the bin directories to the path as below,
export HADOOP_HOME=<HADOOP HOME> export HBASE_HOME=<HBASE HOME> export HIVE_HOME=<HIVE HOME> export PATH=$PATH:$HADOOP_HOME/bin:$HBASE_HOME/bin:$HIVE_HOME/bin sudo mkdir -p /app/hadoop/tmp sudo chown <login user>:<machine name>/app/hadoop/tmp hadoop namenode -format
Set HADOOP_HOME, HBASE_HOME and HIVE_HOME environment variables.
Run the ifconfig and get the ip address it will be something like, 192.168.45.129
Go to etc/hosts file and add an entry like,
192.168.45.129 <machine name>
Run Hadoop, Hbase and Hive as services
Go to hadoop root folder and run the below command,
start-all.sh
Open a browser and access http://localhost:50060/ it will open the hadoop admin console. If there are some issues, execute below command and see if there are any exceptions
tail -f $HADOOP_HOME/logs/hadoop-<login username>-namenode-<machine name>.log
Hadoop is running in 54310 port by default.
Go to hbase root folder and run the below command,
start-hbase.sh tail -f $HBASE_HOME/logs/hbase--master-.log
See if there are any errors. Hbase is runing in port 60000 by default
Go to hive root folder and run the below command,
hive --service hiveserver -hiveconf hbase.master=localhost:60000 mapred.job.tracker=local
Notice, by giving the hbase reference we have integrated hive with hbase. Also hive default port is 10000. Now run hive as a command line client as follow,
hive -h localhost
Create the seed table as below,
CREATE TABLE weblogs(key int, client_ip string, day string, month string, year string, hour string, minute string, second string, user string, loc string) row format delimited fields terminated by '\t'; CREATE TABLE hbase_weblogs_1(key int, client_ip string, day string, month string, year string, hour string, minute string, second string, user string, loc string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, cf1:client_ip, cf2:day, cf3:month, cf4:year, cf5:hour, cf6:minute, cf7:second, cf8:user, cf9:loc") TBLPROPERTIES ("hbase.table.name" = "hbase_weblog"); LOAD DATA LOCAL INPATH '/home/hduser/batch-wordcount/weblogs_parse1.txt' OVERWRITE INTO TABLE weblogs; INSERT OVERWRITE TABLE hbase_weblogs_1 SELECT * FROM weblogs;
Setup the Spring batch project and run the tests
To setup this project get the latest code from SVN mentioned in the beginning. Download gradle and setup the path in the .profile. Now run the below command to load the data,
gradle -Dtest=org.springframework.data.hadoop.samples.DataloadWorkflowTests test
Run the below junit to get the analysis data,
gradle -Dtest=org.springframework.data.hadoop.samples.AnalyzeWorkflowTests test
hadoopVersion is 1.0.2. build.gradle file looks as below,
repositories { // Public Spring artefacts mavenCentral() maven { url "http://repo.springsource.org/libs-release" } maven { url "http://repo.springsource.org/libs-milestone" } maven { url "http://repo.springsource.org/libs-snapshot" } maven { url "http://www.datanucleus.org/downloads/maven2/" } maven { url "http://oss.sonatype.org/content/repositories/snapshots" } maven { url "http://people.apache.org/~rawson/repo" } maven { url "https://repository.cloudera.com/artifactory/cloudera-repos/"} } dependencies { compile ("org.springframework.data:spring-data-hadoop:$version") { exclude group: 'org.apache.thrift', module: 'thrift' } compile "org.apache.hadoop:hadoop-examples:$hadoopVersion" compile "org.springframework.batch:spring-batch-core:$springBatchVersion" // update the version that comes with Batch compile "org.springframework:spring-tx:$springVersion" compile "org.apache.hive:hive-service:0.9.0" compile "org.apache.hive:hive-builtins:0.9.0" compile "org.apache.thrift:libthrift:0.8.0" runtime "org.codehaus.groovy:groovy:$groovyVersion" // see HADOOP-7461 runtime "org.codehaus.jackson:jackson-mapper-asl:$jacksonVersion" testCompile "junit:junit:$junitVersion" testCompile "org.springframework:spring-test:$springVersion" }
Spring Data Hadoop configuration looks as below,
<configuration> <!-- The value after the question mark is the default value if another value for hd.fs is not provided --> fs.default.name=${hd.fs:hdfs://localhost:9000} mapred.job.tracker=local</pre> </configuration> <hive-client host="localhost" port="10000" />
Spring Batch job looks as below,
<batch:job id="job1"> <batch:step id="import"> <batch:tasklet ref="hive-script"/> </batch:step> </batch:job>
Spring Data Hive script for loading the data is as below,
<hive-tasklet id="hive-script"> <script> LOAD DATA LOCAL INPATH '/home/hduser/batch-analysis/weblogs_parse.txt' OVERWRITE INTO TABLE weblogs; INSERT OVERWRITE TABLE hbase_weblogs_1 SELECT * FROM weblogs; </script> </hive-tasklet>
Spring Data Hive script for analyzing the data is as below,
<hive-tasklet id="hive-script"> <script> SELECT client_ip, count(user) FROM hbase_weblogs_1 GROUP by client_ip; </script> </hive-tasklet>
Some useful commands/tips
For querying hadoop dfs you can use any file based unix commands like,
hadoop dfs -ls / hadoop dfs -mkdir /hbase
If you have entered safemode in hadoop and it is not starting up you can execute below command,
hadoop dfsadmin -safemode leave
If you want find some errors in the file hadoop filesystem you can execute below command,
hadoop fsck /
Published at DZone with permission of Krishna Prasad, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Docker Compose vs. Kubernetes: The Top 4 Main Differences
-
Hiding Data in Cassandra
-
Constructing Real-Time Analytics: Fundamental Components and Architectural Framework — Part 2
-
What ChatGPT Needs Is Context
Comments