Friday, June 18, 2010

Installing Hive on linux

Overview:

Hive is a query based, data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

Pre-requisites:

1. Java 1.6

2. Hadoop 0.17.x to 0.20.x.

3. Ant 1.8.1

4. Subversion (SVN)

Note:

For me, I need to configure and install Hadoop 0.20.0 version.

Installation:

Download

 
svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive

Using svn you can download the hive frame work from trunk

  1. Navigate to the folder in the command shell by following command
 
       cd hive
 
      2.  Build the hive package using ant (Make sure you have installed ant)

      ant package
 
3.      Once if you build the jar files using ant you will find new folder structure in hive folder, navigate to below folder structure  
 
      cd build/dist
 
4.      Make sure everything  build by ant is correctly by list the folder
 
 ls
 
Output will be list like this 
 
 README.txt
 bin/ (all the shell scripts)
 lib/ (required jar files)
 conf/ (configuration files)
 examples/ (sample input and query files)
 
 
 

Execution:

Note: Before running hive make sure you have configured and installed Hadoop 0.20.0 or any other version as per requirement

Hive use Hadoop so you need to specify the Hadoop path to hive, Hadoop folder path (which you have to download and configure ) need to make visible. You can do it by two ways

1. You can export the path for single session by following bash command

export HADOOP_HOME=hadoop-install-dir

For me it is

export HADOOP_HOME=/home/hadoop/arun/hadoop-0.20.0

Note:

Make sure HADOOP_HOME variable assigned properly is by this command

echo $HADOOP_HOME

2. You can export the path as permanent by adding it in to .bash_profile file which is shell script automatically checked first when any script executes

vi /logged_in_user_name/.bash_profile

For me it is (I logged as root)

vi /root/.bash_profile

Open the file with vi editor and add the following lines

Assigning the variable:

HADOOP_HOME=/home/hadoop/arun/hadoop-0.20.0

Make visible to all process:

export HADOOP_HOME

Note :

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Following commands will do above creation, navigate to the Hadoop folder path (home folder ) for me it is

"/home/hadoop/arun/hadoop-0.20.0 "

And type the following commands; it will create the folder structure:

[root@master hadoop-0.20.2]# /bin/hadoop dfs -mkdir /tmp
[root@master hadoop-0.20.2]# /bin/hadoop dfs -mkdir/user/hive/warehouse
[root@master hadoop-0.20.2]# /bin/hadoop dfs -mkdir /tmp
[root@master hadoop-0.20.2]# /bin/hadoop dfs -mkdir /tmp
 
 
 
Basic Execution:
 
  Now navigate back to the hive folder downloaded and installed folder move to   Ant build  package “build/dist”, And type bin/hive the shell will open query shell in the same shell
        
    cd /hive_home_folder/   for me : cd /downloads/hive/
    cd build/dist
    bin/hive
    
  For me it is 
[root@master dist]# bin/hive
Hive history file=/tmp/root/hive_job_log_root_201006181053_1733690517.txt
hive>
 
 
 Testing/Checking:
    
      Creating Hive table
 
     hive> CREATE TABLE jak (id INT, friends STRING); 
 
  Output: 
 
          hive> CREATE TABLE jak (id INT, friends STRING);
          OK
          Time taken: 0.476 seconds
    hive>
 
    It will show the created Hive  table 
 
         hive> SHOW TABLES;
 
 
 
 
Output: 
 
hive>SHOW TABLES;
OK
jak
lists
Time taken: 0.215 seconds
hive>
 
 
  

Configuration:

- hive default configuration is stored in /conf/hive-default.xml

· Configuration variables can be changed by (re-)defining them in /conf/hive-site.xml

- log4j configuration is stored in /conf/hive-log4j.properties

- hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default.

- hive configuration can be manipulated by:

· editing hive-site.xml and defining any desired variables (including hadoop variables) in it

· from the cli using the set command (see below)

· by invoking hive using the syntax:

o $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2

§ this sets the variables x1 and x2 to y1 and y2 respectively

 

Error logs:

Hive uses log4j for logging. By default logs are not emitted to the console by the cli. They are stored in the file: - /tmp/{user.name}/hive.log

If the user wishes - the logs can be emitted to the console by adding the arguments shown below: - bin/hive -hiveconf hive.root.logger=INFO,console

Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time.

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org.

Installation errors and solutions:

Note:

In installation/setup of hive you will mostly face issue on the downloading the required software to build the jar file using ant.

Error 1:

If try to build the package using ant like below

[root@master hive]# ant package

It will produce error like below

ivy-retrieve-hadoop-source:
[ivy:retrieve] :: Ivy 2.0.0-rc2 - 20081028224207 :: http://ant.apache.org/ivy/ ::
:: loading settings :: file = /master/hadoop/hive/ivy/ivysettings.xml

BUILD FAILED

/test/hive/build.xml:160: The following error occurred while executing this line:

/test/hive/build.xml:103: The following error occurred while executing this line:

/test/hive/shims/build.xml:56: The following error occurred while executing this line:

/test/hive/build-common.xml:177: impossible to resolve dependencies:

resolve failed - see output for details

Solution:

It is all about downloading file error the path is not correct so I downloaded Hadoop-0.20.0 core file and placed it ant building path as below

    cd ~/.ant/cache/hadoop/core/sources
    wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.0/hadoop-0.20.0.tar.gz
    svn update
 
After did this you need to change the a variable true to false in /hive/ivy/ivysettings.xml
 
 
For me hive home path is
 
   cd /hive/ivy
   vi ivysettings.xml
 
I changed the following variable to true to false (basically we tell to ant not to download file use existing file for further process)as follows
 
Checkmodified=”true” to checkmodified=”false”
 
 
From:
        checkmodified="true" changingPattern=".*SNAPSHOT"/>
TO:
        checkmodified="false" changingPattern=".*SNAPSHOT"/>
 
 
Now clean the ant package and build hive package for specified version of 
Hadoop for me it is 0.20.0
 
      ant -Dhadoop.version=0.x.x clean package
 
For me
 

ant -Dhadoop.version=0.20.0 clean package

It will work fine…

Error 2:

After installing the HIVE, when we query means it will show error like

“Execution Jar: hive/lib/hive-exec-*.jar. “

Missing Hive Execution Jar

Solution:

If we do the installation (manually download and saving file in ant path) in above method some times the created of jar file (by ant) will be saved in a different location

We need to find that file and need to paste in correct path

Original/correct path is hive home folder/build/dist

For me it is cd /hive/ build/dist

Note:

If you want acces globally u need to set HIVE_HOME path refer above path setting process.

It will work fine…

Error 3:

After installing the HIVE, when we query means it will show error like “it can’t able to create JVM because of Memory HEAPSIZE is very high”

Solution:

For that we need to change the heap size in the following file

\hive\build\dist\bin\ext\util\execHiveCmd.sh

as 64 or as per your requirement

# increase the threshold for large queries

HADOOP_HEAPSIZE=64

 

It will work fine…