hive中distinct用法

欢迎各位关注我的个人微信公众号：赶令，将会定期更新大数据相关知识及个人心得

hive中的distinct是去重的意思，和group by在某些情况下有相同的功能

下面测试下distinct的部分功能，先创建一张测试表

create table test.trip_tmp( id int, user_id int, salesman_id int, huose_id int );

插入模拟数据

insert into test.trip_tmp values(1, 2, 3, 3); insert into test.trip_tmp values(1, 2, 3, 3); insert into test.trip_tmp values(2, 2, 3, 3); insert into test.trip_tmp values(3, 2, 3, 3); insert into test.trip_tmp values(4, 2, 5, 3); insert into test.trip_tmp values(6, 3, 3, 3); insert into test.trip_tmp values(5, 4, 2, 3); insert into test.trip_tmp values(5, 2, 3, 3); insert into test.trip_tmp values(6, 2, 5, 3); insert into test.trip_tmp values(5, 2, 3, 3); insert into test.trip_tmp values(5, 2, 5, 3);

查看表的所有数据

select * from test.trip_tmp; OK 1 2 3 3 1 2 3 3 5 2 5 3 2 2 3 3 3 2 3 3 4 2 5 3 6 3 3 3 5 4 2 3 5 2 3 3 6 2 5 3 5 2 3 3 Time taken: 0.277 seconds, Fetched: 11 row(s)

对表的所有列去重

select distinct id, user_id, salesman_id, huose_id from test.trip_tmp; OK 1 2 3 3 2 2 3 3 3 2 3 3 4 2 5 3 5 2 3 3 5 2 5 3 5 4 2 3 6 2 5 3 6 3 3 3 Time taken: 13.142 seconds, Fetched: 9 row(s)

这样distinct后的所有列重复的数据去除了

hive中使用distinct必须在select的最前面，不能在distinct的前面加列名，否则会报错

select huose_id, distinct id, user_id, salesman_id from test.trip_tmp; NoViableAltException(96@[80:1: selectItem : ( ( tableAllColumns )=> tableAllColumns -> ^( TOK_SELEXPR tableAllColumns ) | ( expression ( ( ( KW_AS )? identifier ) | ( KW_AS LPAREN identifier ( COMMA identifier )* RPAREN ) )? ) -> ^( TOK_SELEXPR expression ( identifier )* ) );]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser$DFA13.specialStateTransition(HiveParser_SelectClauseParser.java:4625) at org.antlr.runtime.DFA.predict(DFA.java:80) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:1616) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1177) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:951) at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:42192) at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36852) at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:37119) at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36765) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35954) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35842) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2285) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1334) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77) at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) FAILED: ParseException line 1:17 cannot recognize input near 'distinct' 'id' ',' in selection target

distinct也可以这样用，但和把所有的列都放一起差不多

select distinct (id, user_id, huose_id), salesman_id from test.trip_tmp; OK { 
  "col1":1,"col2":2,"col3":3} 3 { 
  "col1":2,"col2":2,"col3":3} 3 { 
  "col1":3,"col2":2,"col3":3} 3 { 
  "col1":4,"col2":2,"col3":3} 5 { 
  "col1":5,"col2":2,"col3":3} 3 { 
  "col1":5,"col2":2,"col3":3} 5 { 
  "col1":5,"col2":4,"col3":3} 2 { 
  "col1":6,"col2":2,"col3":3} 5 { 
  "col1":6,"col2":3,"col3":3} 3 Time taken: 9.201 seconds, Fetched: 9 row(s)

distinct不能和聚合函数并列使用，否则会报错

select distinct id, user_id, salesman_id, count(huose_id) from test.trip_tmp; FAILED: SemanticException [Error 10128]: Line 1:42 Not yet supported place for UDAF 'count'

但可以在聚合函数里面使用distinct

select count(distinct id) from test.trip_tmp; OK 6 Time taken: 4.775 seconds, Fetched: 1 row(s)

最后，如果能用group by的就尽量使用group by，因为group by性能比distinct更好，尤其数据量大的时候能明显感觉到。

发布者：全栈程序员-站长，转载请注明出处：https://javaforall.net/217114.html原文链接：https://javaforall.net

hive中distinct用法

关于作者

全栈程序员-站长

发表回复

hive中distinct用法

关于作者

全栈程序员-站长

相关推荐

python 配对t检验_配对t检验的python实现

Java安全之jar包调试技巧

HDU 5929 Basic Data Structure 模拟

使用 AI 自动生成会议记录并整理（使用 豆包 进行会议记录）

pycharm安装库报错_pycharm安装numpy失败

2026 最新 Grok 4.2 国内极速直连指南：grok多智能体介绍与Grok镜像保姆级国内使用教程

发表回复

使用 AI 自动生成会议记录并整理（使用豆包进行会议记录）