hive优化（HiveSQL优化）

1、目的是提升执行效率

hive优化（HiveSQL优化）

2、多方法解决问题

优化一：列裁剪优化和分区裁剪优化

列裁剪就是在分区的时候只读取需要的列。如果列很多或者数据量很大的时候，如果select所有的列或者没有指定分区，导致全列或者全表扫描查询的效率很低。Hive列裁剪的优化配置为hive.optimize.cp=True

分区裁剪就是在查询时只读取需要的分区。hive分区中的配置为hive.optimize.pruner=True

优化二：排序技巧：sort by 代替order by

c采用order by 会导致所有map端的数据都会进入一个reducer中，在数据量特别大的时候，有很大的可能会计算不完。如果使用sort by 那会就会根据实际情况启动多个reducer进行排列，并且保证每个reducer内局部有序。为了控制map端的数据合理地分配到reducer端，还有配合distribute by 一同使用，如果不加distribute by 的话，那么map端就会随机分配给reducer.

l例如：未优化的写法

select a,b from table where xxx order by a,b limit 20;

优化的写法：

select a,b from table where xxx distribute by user_id sort by a limit 20;

优化三：去重方法--用group by 去代替 distinct

例如：未优化写法

select distinct user_name from user_trace where dt >'0';

优化写法：

select user_name from user_trace where dt >'0' group by user_name;

--2018年购买又退款的用户

未优化的写法：

select a.user_name from (select distinct user_name from user_trace where year(dt)=2018) a join (select distinct user_name from user_refund where year(dt)=2018) b on a.user_name=b.user_name

优化写法：

select a.user_name from (select user_name from user_trace where year(dt)=2018 group by user_name) a join (select user_name from user_refund where year(dt)=2017) on a.user_name=b.user_name

优化方法四：聚合技巧--grouping sets、cube、rollup

未优化：

年龄分布

select sex,count(distinct user_name) from user_infomation group by sex;

同理城市、等级等优化前也是这种写法

性别、城市、等级用户分布(优化写法)

select sex,city, level from user_infomation group by sex,city,level grouping sets(sex,city,level);

用户的性别和性别的城市分布

select sex，city count(distinct user_name) from user_information group by sex,city grouping sets(sex,(sex, city));

cube:根据group by维度的所有组合进行聚合

年龄、城市、等级的各种组合的用户分布

select sex,city,level from user_information group by sex,city,level grouping sets(sex,city,level ,(sex,city),(sex,level),(city,level))

优化写法

select sex,city,level from user_information group by sex,city,level with cube;

rollup以最左侧的维度为主，进行层级聚合，是cube的子集

未优化写法：计算每月的支付金额以及每年的支付金额

select a.dt,sum(a.year_month),sum(a.month_amount) from (select substr(dt,1,4) as dt,sum(pay_amount) as year_amount ,0 as month_amout from user_trade where dt >'0' group by substr(dt,1,4) union all select substr(dt,1,7) as dt ,sum(pay_amount) month_amount , 0 as year_amount from user_trade where dt >'0' group by substr(dt,1,7) a group by a.dt;

优化写法：

select year(dt) as year,month(dt) as month,sum(pay_amount) payamount from user_trade where dt>'0' group by year(dt),month(dt) with rollup;

--在2017年和2018年购买的用户

select a.user_name from ((select distinct user_name from user_trade where year(dt)=2017) a join (select distinct user_name from user_trade where year(dt)=2018) b on a.user_name=b.user_name) c;

优化写法一：

select a.user_name from (select user_name,count(distinct year(dt)) year_num from user_trade where year(dt) in (2017,2018) group by user_name) a where a.year_num=2;

优化写法二：

select a.user_name from (select user_name,count(distinct year(dt)) as year_num from user_trade where year(dt) in (2017,2018) group by user_name having count(distinct year(dt)=2) as a) group by user_name;

优化六：union all可以开启并发执行

Hive中互相没有依赖关系的job之间是可以并行运行的，典型的是在查询的时候，多个子查询·union all。在集群资源相对充足的前提下，可以开启并行·执行 set hive.exec.parallel=True;

未优化：每个用户的支付金额和退款金额汇总

select a.user_name,sum(a.pay_amounts) ,sum(a.refund_amounts) from (select user_name,sum(pay_amount) as pay_amounts ,0 as refund_amount from user_trade where dt >'0' group by user_name union all select user_name,sum(refund_amount) as refund_amounts, 0 as pay_amounts from user_refund where dt>'0' group by user_name ) a group by a.user_name

优化七：

1、小表在前，大表在后

2、使用相同的连接键。当对3及以上的表进行连接的时候，如果出现的on连接的键相同，那么只会产生一个mapreducer job.

3、尽量的过滤数据。尽量减少各个阶段的数据量，该加分区的就加分区，只选择使用到的字段

优化八：

1、查询分区表的时候，不限定分区

2、两表join时没有on字段，产生笛卡尔积

3、要order by 来排序但是没有指定limit

要开始严格模式需要将hive.mapred.mode=strict

正文

hive优化（HiveSQL优化）

相关阅读

酷跑游戏大全,一直死然后重来的跑酷游戏

新版赵信,端游赵信技能加点顺序

wow 采矿攻略,WOW中采矿

网游发卡平台,淘手游的673严选靠谱吗

发表评论取消回复

还没有评论，来说两句吧...

目录[+]