Hive编程(二)【基础操作】，hive编程基础操作

来源： javaer 分享于 2020-03-20 点击 13807 次点评：235

Hive编程(二)【基础操作】，hive编程基础操作

2.1 安装预先配置好的虚拟机

2.2 安装详细步骤

2.2.1 安装Java

Hive依赖于Hadoop,而Hadoop依赖于Java

linux系统中Java安装

$ /usr/java/latest/bin/java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)
$ sudo echo "export JAVA_HOME=/usr/java/latest" > /etc/profile.d/java.sh
$ <span class="hljs-built_in">sudo</span> <span class="hljs-built_in">echo</span> <span class="hljs-string">"PATH=<span class="hljs-variable">$PATH:$JAVA_HOME/bin" >> /etc/profile.d/java.sh
$ . /etc/profile
$ <span class="hljs-built_in">echo</span> <span class="hljs-variable">$JAVA_HOME
/usr/java/latest

设置Java环境变量,在/etc/profile中添加如下内容:

$ sudo vim /etc/profile
$ export JAVA_HOME=/usr/java/latest
$ <span class="hljs-keyword">export</span> PATH=<span class="hljs-variable">$PATH:$JAVA_HOME/bin

Mac OS X系统中Java安装

编辑$HOME/.bashrc文件

jdk 1.6

$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
$ <span class="hljs-keyword">export</span> PATH=<span class="hljs-variable">$PATH:$JAVA_HOME/bin

jdk 1.7

$ export JAVA_HOME=/Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home
$ <span class="hljs-keyword">export</span> PATH=<span class="hljs-variable">$PATH:$JAVA_HOME/bin

2.2.2 安装Hadoop

2.2.3 本地模式、伪分布式模式和分布式模式

Hadoop默认使用本地模式。在该模式下使用的是本地文件系统.在本地模式下,当执行Hadoop job时,Map task和Reduce task在同一个jvm进程中运行.

真实的集群环境是分布式模式.所有没有完整URL指定的路径默认都是分布式文件系统(通常是HDFS)，且由Jobtrack来管理job,不同的task在不同的进程中执行。

设置hive以本地模式运行(即使当前用户是在分布式模式或伪分布式模式下执行也使用这种模式)

set hive.exec.mode.local.auto=true

若想默认使用这个配置,可以将这个命令添加到$HOME/.hiverc文件中

2.2.4 测试Hadoop

在Hive中使用Hadoop提供的dfs命令查看本地文件系统

$ hadoop dfs -ls /
Found 26 items
drwxrwxrwx - root root 24576 2012-06-03 14:28 /tmp
drwxr-xr-x - root root 4096 2012-01-25 22:43 /opt
drwx------ - root root 16384 2010-12-30 14:56 /lost+found
drwxr-xr-x - root root 0 2012-05-11 16:44 /selinux
dr-xr-x--- - root root 4096 2012-05-23 22:32 /root

当频繁使用hadoop dfs命令时，最好为这个命令定义一个别名

alias hdfs="hadoop dfs"

对于非常大的文件,可以结合linux shell中的管道。如:可以将-cat命令的输出通过管道传递给shell中的more、head或者tail命令。例如:

hadoop dfs -cat wc-out/* | more

2.2.5安装Hive

Hive需要使用HADOOP_HOME这个环境变量

Installing Hive from a Stable Release

$ tar -xzvf hive-x.y.z.tar.gz
$ cd hive-x.y.z
$ export HIVE_HOME={{pwd}} 
$ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source

Compile Hive on Hadoop 0.23

  $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ mvn clean install -Phadoop-2,dist
  $ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin
  $ ls
  LICENSE
  NOTICE
  README.txt
  RELEASE_NOTES.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)
  hcatalog / (hcatalog installation)
  scripts / (upgrade scripts for hive-metastore)

Compile Hive on Hadoop 0.20

$ mvn clean install -Phadoop-1,dist

Compile Hive Prior to 0.13 on Hadoop 0.20

$ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ ant clean package
  $ cd build/dist
  # ls
  LICENSE
  NOTICE
  README.txt
  RELEASE_NOTES.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)
  hcatalog / (hcatalog installation)
  scripts / (upgrade scripts for hive-metastore)

Compile Hive Prior to 0.13 on Hadoop 23

$ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23
$ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

Running Hive

Hive uses Hadoop, so:

you must have Hadoop in your path OR

export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before you can create a table in Hive.

Commands to perform this setup:

  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

You may find it useful, though it's not necessary, to set HIVE_HOME:

  $ export HIVE_HOME=<hive-install-dir>

To use the Hive command line interface (CLI) from the shell:

  $ $HIVE_HOME/bin/hive

2.3 Hive内部是什么

$HIVE_HOME/lib 目录下包含有众多的JAR文件，每个JAR文件都实现了Hive功能中的某个特定的部分
$HIVE_HOME/bin 目录下包含可以执行Hive各种服务的可执行文件。如:CLI(Hive命令行)，提供交互式的接口
Thrift 提供可远程访问其他进程的功能，也提供JDBC和ODBC访问Hive的功能
metastoreservice(元数据服务) Hive使用这个服务来存储表模式信息和其他元数据信息。通常情况下使用关系型数据库来存储这些信息。
HWI Hive提供一个简单的网页界面，提供远程访问Hive的功能
conf 目录下存放Hive的配置文件

2.4 启动Hive

使用$HIVE_HOME/bin/hive启动Hive的CLI服务。

在Hive中是不区分大小写的。

2.5 配置Hadoop环境

2.5.1 本地模式配置

本地模式下，所有的任务进程在同一个JVM上。

Hive在本地模式下运行时,默认的数据存储位置在file:///user/hive/warehouse目录下

本地模式下hive-site.xml文件内容

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/home/me/hive/warehouse</value>
    <description>
      Local or HDFS directory where Hive keeps table contents.
    </description>
  </property>
  <property>
    <name>hive.metastore.local</name>
    <value>true</value>
    <description>
      Use false if a production metastore server is used.
    </description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=/home/me/hive/metastore_db;create=true</value>
    <description>
      The JDBC connection URL.
    </description>
  </property>
</configuration>

2.5.2 分布式模式和伪分布式模式配置

可以使用下面的命令可以自定义数据仓库的位置

set hive.metastore.warehouse.dir=/user/myname/hive/warehouse;

可以将这条语句写入到$HOME/.hiverc文件中,这样每次启动Hive都会执行。

2.5.3 使用JDBC连接元数据

Metastore database configuration in hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://db1.mydomain.pvt/hive_db?createDatabaseIfNotExist=true</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>database_user</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>database_pass</value>
  </property>
</configuration>

记得需要将mysql的驱动放到$HIVE_HOME/lib目录下。

2.6 Hive命令

命令选项

$ bin/hive --help
Usage ./hive <parameters> --service serviceName <service parameters>
Service List: cli help hiveserver hwi jar lineage metastore rcfilecat
Parameters parsed:
--auxpath : Auxiliary jars
--config : Hive configuration directory
--service : Starts specific service/component. cli is default
Parameters used:
HADOOP_HOME : Hadoop install directory
HIVE_OPT : Hive options
For help on a particular service:
./hive --service serviceName --help
Debug help: ./hive --debug --help

可以通过--service name服务名称来启动某个服务.

cli 命令行界面,定义表、执行查询等
hiveserver Hive Server监听来自于其他进程的Thrift连接的一个守护进程
hwi 一个可执行查询语句和其他命令的简单web界面
jar Hadoop jar命令的扩展，可以执行Hive需要的执行环境
metastore Hive元数据服务
rcfilecat 可打印RCFile格式

--auxpath 允许指定以冒号分隔的附属JAR包 --config 允许用户覆盖$HIVE_HOME/conf中的默认配置

2.7 Hive命令行界面

2.7.1 CLI选项

$ hive --help --service cli 

usage: hive 

-d,--define <key=value> Variable substitution to apply to hive 

commands. e.g. -d A=B or --define A=B 

-e <quoted-query-string> SQL from command line 

-f <filename> SQL from files 

-H,--help Print help information 

-h <hostname> connecting to Hive Server on remote host 

--hiveconf <property=value> Use value for given property 

--hivevar <key=value> Variable substitution to apply to hive 

commands. e.g. --hivevar A=B 

-i <filename> Initialization SQL file 

-p <port> connecting to Hive Server on port number 

-S,--silent Silent mode in interactive shell 

-v,--verbose Verbose mode (echo executed SQL to the 

console)

2.7.2 变量和属性

--define key=value等价于--hivevar key=value.

当使用该功能时,Hive会将这些键值对放到hivevar命名空间。

Hive中变量和属性命名空间

hivevar 读/写 (v0.8.0及以后版本) 用户自定义变量.
hiveconf 读/写 Hive相关配置属性.
system 读/写 Java定义的配置属性.
env 只读 Shell环境变量(e.g.bash).

Hive变量内部是以Java字符串方式存储的。

在CLI中，可以使用SET命令显示或修改变量的值.如:

$ hive 

hive> set env:HOME; 

env:HOME=/home/thisuser

hive> set;
… lots of output including these variables:
hive.stats.retries.wait=3000
env:TERM=xterm
system:user.timezone=America/New_York
…

hive> set -v;
… even more output!…

如果不加-v标记，set命令会打印出命名空间hivevar、hiveconf、system和env中所有的变量.使用-v标记，还会打印出Hadoop中所定义的属性

命令set还可以用于给变量赋新值.如:

$ hive –define foo=bar 

hive> set foo; 

foo=bar;

hive> set hivevar:foo;
hivevar:foo=bar;

hive> set hivevar:foo=bar2;

hive> set foo;
foo=bar2

hive> set hivevar:foo;
hivevar:foo=bar2

前缀hivevar:是可选的。--hivevar标记和--define标记是相等的。

hive> create table toss1(i int, ${hivevar:foo} string);

hive> describe toss1; 

i int 

bar2 string

hive> create table toss2(i2 int, ${foo} string);

hive> describe toss2; 

i2 int 

bar2 string

hive> drop table toss1; 

hive> drop table toss2;

使用hive.cli.print.current.db属性开启在hive命令行中显示当前数据库名

$ hive --hiveconf hive.cli.print.current.db=true 

hive (default)> set hive.cli.print.current.db; 

hive.cli.print.current.db=true

hive (default)> set hiveconf:hive.cli.print.current.db; 

hiveconf:hive.cli.print.current.db=true

hive (default)> set hiveconf:hive.cli.print.current.db=false;

hive> set hiveconf:hive.cli.print.current.db=true;

hive (default)> ...

默认数据库名为default.hive.cli.print.current.db属性的值默认为false。

Hive中system命名空间，对这个属性具有可读写权限

hive> set system:user.name; 

system:user.name=myusername

hive> set system:user.name=yourusername;

hive> set system:user.name;
system:user.name=yourusername

Hive中env命名空间，只具有可读权限

hive> set env:HOME; 

env:HOME=/home/yourusername

hive> set env:HOME; 

env:* variables can not be set.

和hivevar变量不同,必须使用system:或env:前缀来指定系统属性和环境变量。

env命名空间可以向Hive中传递变量。如:

$ YEAR=2012 hive -e &quot;SELECT * FROM mytable WHERE year = ${env:YEAR}";

2.7.3 Hive中“一次使用”命令

Hive中可以接受-e这种命令参数形式。执行一个或多个查询。(多个查询之间用分号分隔)。执行结束后CLI立即退出。

$ hive -e "SELECT * FROM mytable LIMIT 3"; 

OK 

name1 10 

name2 20 

name3 30 

Time taken: 4.955 seconds 

$

增加-S选项开启静默模式，在输出结果中去掉"OK"和"Time taken"等信息.

$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery 

$ cat /tmp/myquery 

name1 10 

name2 20 

name3 30

Hive默认将输出写到标准输出流中(控制台或终端)，上面的例子将输出重定向到本地文件系统中，而不是HDFS中。

模糊取某个属性名hive -S -e "set" | grep warehouse 查询数据仓库的路径

$ hive -S -e "set" | grep warehouse 

hive.metastore.warehouse.dir=/user/hive/warehouse 

hive.warehouse.subdir.inherit.perms=false

2.7.4 从文件中执行Hive查询

hive -f /user/scott/hive.hql

或者在hive shell中使用source命令来执行脚本文件

hive > source /user/scott/hive.hql

2.7.5 hiverc文件

Hive启动后默认会在当前用户home目录下寻找名为.hiverc的文件,且自动执行文件中的命令

一个典型的.hiverc文件中的内容为

ADD JAR /user/scott/hive-examples.jar;   #增加一个jar文件 

set hive.cli.print.current.db=true;     #显示当前所在工作数据库 

set hive.exec.model.local.auto=true;    #设置以本地文件模式运行(当hadoop是以分布式模式或者伪分布式模式执行时的话,就在本地执行,可以加快小数据集的查询速度)

注:每行语句结尾的分号不可省略

2.7.6 使用Hive CLI的更多介绍

自动补全功能

2.7.7 查看操作命令历史

Hive将最近使用的10000行命令保存在$HOME/.hivehistory文件中.

2.7.8 执行shell命令

不需要退出Hive CLI就可以执行简单的shell命令。只要在命令前加上!且以;结尾即可.

hive > !pwd;  #查看当前目录

注:hive shell中使用shell命令时,不能使用需用户进行输入交互的命令,且不支持shell的管道功能和文件名自动补全功能。

2.7.9 在Hive内部使用Hadoop的dfs命令

hive > dfs -ls /;

2.7.10 Hive脚本中注释

使用以--开头的字符表示注释。该注释只能放在hiveql脚本中若直接在hive shell中使用的话会报错

-- Copyright (c) 2012 Megacorp, LLC. 

-- This is the best Hive script evar!! 

SELECT * FROM massive_table;

2.7.11 显示字段名称

hive > set hive.cli.print.header=true;

同理可以将该语句添加到$home/.hiverc文件中

Hive编程(二)【基础操作】，hive编程基础操作