1、目的是提升执行效率
2、多方法解决问题
优化一:列裁剪优化和分区裁剪优化
列裁剪就是在分区的时候只读取需要的列。如果列很多或者数据量很大的时候,如果select所有的列或者没有指定分区,导致全列或者全表扫描查询的效率很低。Hive列裁剪的优化配置为hive.optimize.cp=True
分区裁剪就是在查询时只读取需要的分区。hive分区中的配置为hive.optimize.pruner=True
优化二:排序技巧:sort by 代替order by
c采用order by 会导致所有map端的数据都会进入一个reducer中,在数据量特别大的时候,有很大的可能会计算不完。如果使用sort by 那会就会根据实际情况启动多个reducer进行排列,并且保证每个reducer内局部有序。为了控制map端的数据合理地分配到reducer端,还有配合distribute by 一同使用,如果不加distribute by 的话,那么map端就会随机分配给reducer.
l例如:未优化的写法
select a,b from table where xxx order by a,b limit 20;
优化的写法:
select a,b from table where xxx distribute by user_id sort by a limit 20;
优化三: 去重方法--用group by 去代替 distinct
例如:未优化写法
select distinct user_name from user_trace where dt >'0';
优化写法:
select user_name from user_trace where dt >'0' group by user_name;
--2018年购买又退款的用户
未优化的写法:
select a.user_name from (select distinct user_name from user_trace where year(dt)=2018) a join (select distinct user_name from user_refund where year(dt)=2018) b on a.user_name=b.user_name
优化写法:
select a.user_name from (select user_name from user_trace where year(dt)=2018 group by user_name) a join (select user_name from user_refund where year(dt)=2017) on a.user_name=b.user_name
优化方法四:聚合技巧--grouping sets、cube、rollup
未优化:
年龄分布
select sex,count(distinct user_name) from user_infomation group by sex;
同理城市、等级等优化前也是这种写法
性别、城市、等级用户分布(优化写法)
select sex,city, level from user_infomation group by sex,city,level grouping sets(sex,city,level);
用户的性别和性别的城市分布
select sex,city count(distinct user_name) from user_information group by sex,city grouping sets(sex,(sex, city));
cube:根据group by维度的所有组合进行聚合
年龄、城市、等级的各种组合的用户分布
select sex,city,level from user_information group by sex,city,level grouping sets(sex,city,level ,(sex,city),(sex,level),(city,level))
优化写法
select sex,city,level from user_information group by sex,city,level with cube;
rollup以最左侧的维度为主,进行层级聚合,是cube的子集
未优化写法:计算每月的支付金额以及每年的支付金额
select a.dt,sum(a.year_month),sum(a.month_amount) from (select substr(dt,1,4) as dt,sum(pay_amount) as year_amount ,0 as month_amout from user_trade where dt >'0' group by substr(dt,1,4) union all select substr(dt,1,7) as dt ,sum(pay_amount) month_amount , 0 as year_amount from user_trade where dt >'0' group by substr(dt,1,7) a group by a.dt;
优化写法:
select year(dt) as year,month(dt) as month,sum(pay_amount) payamount from user_trade where dt>'0' group by year(dt),month(dt) with rollup;
--在2017年和2018年购买的用户
select a.user_name from ((select distinct user_name from user_trade where year(dt)=2017) a join (select distinct user_name from user_trade where year(dt)=2018) b on a.user_name=b.user_name) c;
优化写法一:
select a.user_name from (select user_name,count(distinct year(dt)) year_num from user_trade where year(dt) in (2017,2018) group by user_name) a where a.year_num=2;
优化写法二:
select a.user_name from (select user_name,count(distinct year(dt)) as year_num from user_trade where year(dt) in (2017,2018) group by user_name having count(distinct year(dt)=2) as a) group by user_name;
优化六:union all可以开启并发执行
Hive中互相没有依赖关系的job之间是可以并行运行的,典型的是在查询的时候,多个子查询·union all。在集群资源相对充足的前提下,可以开启并行·执行 set hive.exec.parallel=True;
未优化:每个用户的支付金额和退款金额汇总
select a.user_name,sum(a.pay_amounts) ,sum(a.refund_amounts) from (select user_name,sum(pay_amount) as pay_amounts ,0 as refund_amount from user_trade where dt >'0' group by user_name union all select user_name,sum(refund_amount) as refund_amounts, 0 as pay_amounts from user_refund where dt>'0' group by user_name ) a group by a.user_name
优化七:
1、小表在前,大表在后
2、使用相同的连接键。当对3及以上的表进行连接的时候,如果出现的on连接的键相同,那么只会产生一个mapreducer job.
3、尽量的过滤数据。尽量减少各个阶段的数据量,该加分区的就加分区,只选择使用到的字段
优化八:
1、查询分区表的时候,不限定分区
2、两表join时没有on字段,产生笛卡尔积
3、要order by 来排序但是没有指定limit
要开始严格模式需要将hive.mapred.mode=strict
还没有评论,来说两句吧...