hive中distinct用法

hive中distinct用法hive 中的 distinct 是去重的意思 和 groupby 在某些情况下有相同的功能下面测试下 distinct 的部分功能 先创建一张测试表 createtablet trip tmp idint user idint salesman idint huose idint 插入模拟数据 insertintote trip tmpvalues 1 2

欢迎各位关注我的个人微信公众号:赶令,将会定期更新大数据相关知识及个人心得

hive中的distinct是去重的意思,和group by在某些情况下有相同的功能

下面测试下distinct的部分功能,先创建一张测试表

create table test.trip_tmp( id int, user_id int, salesman_id int, huose_id int ); 

插入模拟数据

insert into test.trip_tmp values(1, 2, 3, 3); insert into test.trip_tmp values(1, 2, 3, 3); insert into test.trip_tmp values(2, 2, 3, 3); insert into test.trip_tmp values(3, 2, 3, 3); insert into test.trip_tmp values(4, 2, 5, 3); insert into test.trip_tmp values(6, 3, 3, 3); insert into test.trip_tmp values(5, 4, 2, 3); insert into test.trip_tmp values(5, 2, 3, 3); insert into test.trip_tmp values(6, 2, 5, 3); insert into test.trip_tmp values(5, 2, 3, 3); insert into test.trip_tmp values(5, 2, 5, 3); 

查看表的所有数据

select * from test.trip_tmp; OK 1 2 3 3 1 2 3 3 5 2 5 3 2 2 3 3 3 2 3 3 4 2 5 3 6 3 3 3 5 4 2 3 5 2 3 3 6 2 5 3 5 2 3 3 Time taken: 0.277 seconds, Fetched: 11 row(s) 

对表的所有列去重

select distinct id, user_id, salesman_id, huose_id from test.trip_tmp; OK 1 2 3 3 2 2 3 3 3 2 3 3 4 2 5 3 5 2 3 3 5 2 5 3 5 4 2 3 6 2 5 3 6 3 3 3 Time taken: 13.142 seconds, Fetched: 9 row(s) 

这样distinct后的所有列重复的数据去除了

hive中使用distinct必须在select的最前面,不能在distinct的前面加列名,否则会报错

select huose_id, distinct id, user_id, salesman_id from test.trip_tmp; NoViableAltException(96@[80:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA13.specialStateTransition(HiveParser_SelectClauseParser.java:4625) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:1616) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1177) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:951) at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:42192) at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36852) at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:37119) at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36765) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35954) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35842) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2285) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1334) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) FAILED: ParseException line 1:17 cannot recognize input near 'distinct' 'id' ',' in selection target 

distinct也可以这样用,但和把所有的列都放一起差不多

select distinct (id, user_id, huose_id), salesman_id from test.trip_tmp; OK { 
  "col1":1,"col2":2,"col3":3} 3 { 
  "col1":2,"col2":2,"col3":3} 3 { 
  "col1":3,"col2":2,"col3":3} 3 { 
  "col1":4,"col2":2,"col3":3} 5 { 
  "col1":5,"col2":2,"col3":3} 3 { 
  "col1":5,"col2":2,"col3":3} 5 { 
  "col1":5,"col2":4,"col3":3} 2 { 
  "col1":6,"col2":2,"col3":3} 5 { 
  "col1":6,"col2":3,"col3":3} 3 Time taken: 9.201 seconds, Fetched: 9 row(s) 

distinct不能和聚合函数并列使用,否则会报错

select distinct id, user_id, salesman_id, count(huose_id) from test.trip_tmp; FAILED: SemanticException [Error 10128]: Line 1:42 Not yet supported place for UDAF 'count' 

但可以在聚合函数里面使用distinct

select count(distinct id) from test.trip_tmp; OK 6 Time taken: 4.775 seconds, Fetched: 1 row(s) 

最后,如果能用group by的就尽量使用group by,因为group by性能比distinct更好,尤其数据量大的时候能明显感觉到。

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请联系我们举报,一经查实,本站将立刻删除。

发布者:全栈程序员-站长,转载请注明出处:https://javaforall.net/217114.html原文链接:https://javaforall.net

(0)
上一篇 2026年3月18日 上午10:19
下一篇 2026年3月18日 上午10:19


相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注全栈程序员社区公众号