hive实战-3

一、总结hive数据加载的方式

1、从本地linux文件系统加载和从HDFS文件系统加载

load data [local] inpath 'path' [overwrite] Into table tblName

（1）local参数

有：数据从本地linux的路径中加载，其中path可以是绝对路径，也可以是相对路径

例如：load data local inpath ‘/usr/local/dir/t1_data’ into table t1

无：数据从hdfs上的相应路径中移动数据到表tblName下，相当于hadoop fs -mv hdfs_uri_1 hdfs_uri_2

例如：load data inpath ‘/dir/t1_data’ into table t1 上传完毕之后 t1_data将被移动到hive对应目录

（2）overwrite参数

有：覆盖原来的数据，只保留新上传的数据

例如：load data local inpath '/usr/local/d2/t2' overwrite into table t2;

无：在原来的目录下，在增加一个数据文件

2、从其它表加载数据

语法形式：

Insert [into/overwrite] table t1 select columns… from t2;

例如：

（1） insert into table t1 select id from t2;

使用方法类似传统关系型数据库的sql

（2）insert overwrite table t1 select id from t2

使用overwirte 来覆盖掉原来的数据

（3）向分区表中插入数据

建立个分区表

create table t1(

p_id int,

p_name string

)

partitioned by (p_province string,p_city string)

row format delimited

fields terminated by ' '

在本地linux下有/usr/local/d2/t1文件，信息如下：

[root@hadoop d2]# more t1

1 lily1

2 lily2

3 lily3

使用hive插入数据

load data local inpath '/usr/local/d2/t1' into table t1 partition

(p_province='heilongjiang',p_city='haerbin')

操作如下：

hive> load data local inpath '/usr/local/d2/t1' into table t1 partition (p_province='heilongjiang',p_city='haerbin');

Loading data to table db2.t1 partition (p_province=heilongjiang, p_city=haerbin)

Partition db2.t1{p_province=heilongjiang, p_city=haerbin} stats: [numFiles=1, numRows=0, totalSize=24, rawDataSize=0]

Time taken: 1.232 seconds

hive> select * from t1;

1 lily1 heilongjiang haerbin

2 lily2 heilongjiang haerbin

3 lily3 heilongjiang haerbin

Time taken: 0.603 seconds, Fetched: 3 row(s)

hive>

假设在Linux下的数据文件/usr/local/d2/t1文件内容修改如下：

[root@hadoop d2]# more t1

4 lily4

5 lily5

6 lily6

再次执行如下操作：

hive> load data local inpath '/usr/local/d2/t1' into table t1 partition (p_province='zhejiang',p_city='hangzhou');

hive> select * from t1;

1 lily1 heilongjiang harbin

2 lily2 heilongjiang harbin

3 lily3 heilongjiang harbin

4 lily4 zhejiang hangzhou

5 lily5 zhejiang hangzhou

6 lily6 zhejiang hangzhou

（4）使用动态分区表

建立的表t2:

create table t2(

p_id int,

p_name string

)

partitioned by (p_province string,p_city string)

row format delimited

fields terminated by ' '

或者也可以使用简单方法建立和t1一模一样的表create table t2 like t1;

由于表的分区很多，那么如果一个分区一个分区的插入，类似如下操作会很麻烦：

hive> insert overwrite table t2 partition(p_province='heilongjiang',p_city='haerbin') select p_id,p_name from t1 where p_province='heilongjiang' ;

hive> select * from t2;

1 lily1 heilongjiang harbin

2 lily2 heilongjiang harbin

3 lily3 heilongjiang harbin

Time taken: 0.106 seconds, Fetched: 3 row(s)

hive>

那么，我们开启动态分区支持

set hive.exec.dynamic.partition=true; //使用动态分区

(可通过这个语句查看：set hive.exec.dynamic.partition;)

set hive.exec.dynamic.partition.mode=nonstrict;//无限制模式

如果模式是strict，则必须有一个静态分区，且放在最前面。

SET hive.exec.max.dynamic.partitions.pernode=10000;每个节点生成动态分区最大个数

set hive.exec.max.dynamic.partitions=100000;,生成动态分区最大个数，如果自动分区数大于这个参数，将会报错

set hive.exec.max.created.files＝150000; //一个任务最多可以创建的文件数目

set dfs.datanode.max.xcievers=8192;//限定一次最多打开的文件数

hive> insert overwrite table t2 partition(p_province,p_city) select p_id,p_name,p_province,p_city from t1;

hive> select * from t1;

1 lily1 heilongjiang harbin

2 lily2 heilongjiang harbin

3 lily3 heilongjiang harbin

4 lily4 zhejiang hangzhou

5 lily5 zhejiang hangzhou

6 lily6 zhejiang hangzhou

3、import和export

export可以将表数据导入到hdfs目录中，注意该目录必须事先是空目录，操作如下：

hive> export table t2 to '/d2/t2';

import操作将从hdfs目录中的数据导入表中，这时系统将自动帮你创建一个表。操作如下：

hive> import table t2_bak2 from '/d2/t2';

二、情感分析案例

案例一：电商商品评论情感分析

案例描述

某电商平台希望通过分析用户对商品的评论，了解用户对商品的态度，从而优化商品运营和服务。用户评论数据包含评论内容、用户 ID、商品 ID、评论时间等信息，需要利用 Hive 对这些评论进行情感分析，判断评论的情感倾向是积极、消极还是中性。

建模（建表）

CREATE TABLE e_commerce_comments (

comment_id INT,

user_id INT,

product_id INT,

comment_text STRING,

comment_time TIMESTAMP

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

comment_id	user_id	product_id	comment_text	comment_time
1	101	5001	这款手机拍照效果太棒了，电池续航也很给力！	2024-01-01 10:00:00
2	102	5001	信号太差了，经常连不上网，太失望了！	2024-01-02 14:30:00
3	103	5002	衣服质量很好，尺码标准，非常满意！	2024-01-03 09:15:00
4	104	5002	颜色和图片差距太大，实物很难看，不推荐！	2024-01-04 16:20:00
5	105	5003	性价比超高，值得购买！	2024-01-05 11:45:00
6	106	5003	做工粗糙，线头很多，不值这个价！	2024-01-06 15:30:00
7	107	5004	使用起来很方便，操作简单！	2024-01-07 08:00:00
8	108	5004	经常死机，体验极差！	2024-01-08 13:10:00
9	109	5005	味道很好，孩子很喜欢吃！	2024-01-09 17:25:00
10	110	5005	口感太差了，吃了一口就不想吃了！	2024-01-10 12:50:00

SQL 代码实现

-- 定义情感分析函数（这里假设已有自定义函数sentiment_analysis，可调用外部NLP库实现）

ADD FILE /path/to/sentiment_analysis.py;

CREATE TEMPORARY FUNCTION sentiment AS 'pythonUDF.sentiment_analysis';

-- 进行情感分析

SELECT

comment_id,

user_id,

product_id,

comment_text,

comment_time,

sentiment(comment_text) AS sentiment_result

FROM

e_commerce_comments;

案例二：社交媒体舆情情感分析

案例描述

某品牌希望通过分析社交媒体上关于自身品牌的帖子，了解公众对品牌的情感态度，以便及时调整营销策略和公关方案。社交媒体数据包含帖子 ID、发布者 ID、发布时间、帖子内容等，利用 Hive 对这些帖子内容进行情感分析，掌握品牌在社交媒体上的口碑。

建模（建表）

CREATE TABLE social_media_posts (

post_id INT,

publisher_id INT,

post_time TIMESTAMP,

post_content STRING

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STOED AS TEXTFILE;

post_id	publisher_id	post_time	post_content
1001	2001	2024-02-01 08:30:00	这个品牌的新品太酷了，设计感十足！
1002	2002	2024-02-02 12:15:00	质量越来越差了，上次买的东西没用几天就坏了！
1003	2003	2024-02-03 15:40:00	一直很喜欢这个品牌，忠实粉丝！
1004	2004	2024-02-04 09:20:00	服务态度太差了，售后根本不理人！
1005	2005	2024-02-05 14:05:00	性价比超高，推荐给身边朋友了！
1006	2006	2024-02-06 17:30:00	广告太频繁了，烦死了！
1007	2007	2024-02-07 10:10:00	产品更新速度快，紧跟潮流！
1008	2008	2024-02-08 13:55:00	价格太贵了，消费不起！
1009	2009	2024-02-09 16:25:00	品牌活动很有趣，参与感很强！
1010	2010	2024-02-10 07:45:00	体验感太差，不会再买了！

SQL 代码实现

-- 定义情感分析函数

ADD FILE /path/to/sentiment_analysis.py;

CREATE TEMPORARY FUNCTION sentiment AS 'pythonUDF.sentiment_analysis';

-- 进行情感分析

SELECT

post_id,

publisher_id,

post_time,

post_content,

sentiment(post_content) AS sentiment_result

FROM

social_media_posts;

案例三：电影评论情感分析

案例描述

某电影平台想要分析用户对电影的评论，了解观众对不同电影的喜好和评价，为电影推荐和营销提供依据。评论数据包含评论 ID、用户 ID、电影 ID、评论时间、评论内容等，通过 Hive 对评论内容进行情感分析，挖掘观众情感倾向。

建模（建表）

CREATE TABLE movie_comments (

comment_id INT,

user_id INT,

movie_id INT,

comment_time TIMESTAMP,

comment_text STRING

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STOED AS TEXTFILE;

comment_id	user_id	movie_id	comment_time	comment_text
2001	3001	6001	2024-03-01 11:00:00	这部电影剧情紧凑，演员演技炸裂，超赞！
2002	3002	6001	2024-03-02 13:30:00	特效太假了，剧情也很拖沓，失望！
2003	3003	6002	2024-03-03 09:45:00	画面很美，配乐也恰到好处，强烈推荐！
2004	3004	6002	2024-03-04 15:20:00	完全看不懂在讲什么，浪费时间！
2005	3005	6003	2024-03-05 12:15:00	很有深度的电影，引发了很多思考！
2006	3006	6003	2024-03-06 14:40:00	节奏太慢，看得想睡觉！
2007	3007	6004	2024-03-07 10:30:00	喜剧效果拉满，笑得肚子疼！
2008	3008	6004	2024-03-08 16:05:00	笑点很尬，不好笑！
2009	3009	6005	2024-03-09 13:25:00	科幻场景很震撼，值得一看！
2010	3010	6005	2024-03-10 17:50:00	逻辑漏洞太多，不推荐！