not in vs left join is null
admin
2024-01-29 03:31:23
0

比如需要验证一系列值,是否在另一个表中出现过,有两种方法
not in 和 left join is null
差值
在一个表中出现,但是在另一个表中没出现。对应的业务可以是新客判断。在历史中未出现过,则认为是新客。

========

you are not alone ,any question(problem) you meet ,someone maybe meet before。
i copied from internet
here it is :

How to create fast database queries

NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: SQL Server
with 26 comments

This series of articles is inspired by multiple questions asked by the site visitors and Stack Overflow users, including Tony, Philip, Rexem and others.

Which method (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL) is best to select values present in one table but missing in another one?

This:

SELECT l.*
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL
, this:

SELECT l.*
FROM t_left l
WHERE l.value NOT IN
(
SELECT value
FROM t_right r
)
or this:

SELECT l.*
FROM t_left l
WHERE NOT EXISTS
(
SELECT NULL
FROM t_right r
WHERE r.value = l.value
)
Differences between the methods
These methods are quite different.

First of all, LEFT JOIN / IS NULL and NOT EXISTS are semantically equivalent, while NOT IN is not. These method differ in how they handle NULL values in t_right

LEFT JOIN is guaranteed to return every row from t_left, and then filtering is applied to the values returned from t_right. If for some row in t_left there is no corresponding row in t_right (which means no row with that exact value is present in t_right), the row from t_left will be returned once, and the NULL values will be substituted instead of t_right’s actual values.

Since NULL values can never satisfy an equality JOIN condition, the NULL values returned by the query are guaranteed to be substituted by the LEFT JOIN, not fetched out of the actual t_right’s row. This means that LEFT JOIN / IS NULL is guaranteed to return at most one row from t_left, and these row’s value is not equal to one of those in t_right.

The same holds for NOT EXISTS. Since it’s a predicate, not a JOIN condition, the rows from t_left can only be returned at most once too. EXISTS always returns TRUE or FALSE and it will return TRUE as soon as it finds only a single matching row in t_right, or FALSE, if it find none.

NOT EXISTS, therefore, will return TRUE only if no row satisfying the equality condition is found in t_right (same as for LEFT JOIN / IS NULL).

Note that NULL values do not safisfy the equality conditions, so both LEFT JOIN / IS NULL and NOT EXISTS will always return rows from t_left that have value set to NULL, even is there are rows with value IS NULL in t_right.

NOT IN, however, behaves differently.

IN predicate (unlike EXISTS) is trivalent, i. e. it can return TRUE, FALSE or NULL:

TRUE is returned when the non-NULL value in question is found in the list
FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values
NULL is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value
IN predicate does not give a definitive answer to whether or not the expression is contained in the list as long as there are NULL values on either side of the expression, returning NULL instead.

This of course makes no difference when using the positive form of NULL: predicates returning NULL are filtered out by the WHERE clause as well as those returning FALSE.

However, NOT IN is different, since negation of NULL is NULL as well.

That’s why NOT IN condition will never hold for any list with a NULL value in it.

If a row is found in the list, IN will return TRUE and NOT IN, therefore, will return FALSE
If a row is not found in the list, IN will return NULL, and NOT IN on its turn will also return NULL
Both conditions will of course be filtered out by the WHERE clause.

Let’s illustrate it with two simple queries that compare (1, NULL) in t_left with (2, NULL) in t_right:

WITH t_left AS
(
SELECT 1 AS value
UNION ALL
SELECT NULL
),
t_right AS
(
SELECT 2 AS value
UNION ALL
SELECT NULL
)
SELECT l.*
FROM t_left l
WHERE NOT EXISTS
(
SELECT NULL
FROM t_right r
WHERE r.value = l.value
)

value
1
NULL
2 rows fetched in 0.0001s (0.0006s)
This query, using NOT EXISTS, returns both values from t_left, since neither of them is equal to any of the values from t_right.

WITH t_left AS
(
SELECT 1 AS value
UNION ALL
SELECT NULL
),
t_right AS
(
SELECT 2 AS value
UNION ALL
SELECT NULL
)
SELECT l.*
FROM t_left l
WHERE l.value NOT IN
(
SELECT value
FROM t_right
)

value
0 rows fetched in 0.0001s (0.0005s)
This query, on the other hand, returns nothing. Since there is a NULL in t_right, NOT IN returns NULL rather than TRUE if the value is not found among the defined values. Just in case.

IN (and NOT IN) are too chicken to say something definite about lists with NULL unless they are completely sure that the value is there.

However, if the values in both tables are non-nullable, NULL, all three method describe above are semantically identical.

Efficiency comparison
Let’s see how efficient are these methods.

To do that, we will create two sample tables:

Table creation details

Table t_left contains 100,000 rows with 10,000 distinct values.

Table t_right contains 1,000,000 rows with 10,000 distinct values.

There are 10 rows in t_left with values not present in t_right.

Let’s run the queries against these tables.

NOT IN
SELECT l.id, l.value
FROM [20090915_anti].t_left l
WHERE l.value NOT IN
(
SELECT value
FROM [20090915_anti].t_right r
)
View query results, details and execution plan

As we can see, this query uses Merge Anti Semi Join which is extremely efficient if there is a cheap way to obtain two ordered resultsets (like in example above). Since value is indexed in both tables, the indexes serve as such resulsets.

Merge Join means that the server iterates both resultsets from lower values to higher ones, keeping a pointer to the current value and advancing it in both resultsets.

Anti Semi Join above means that as soon as the engine meets a match in t_right it just skips all matching values in both t_left and t_right. Since values from t_right are pregrouped using Stream Aggregate (making the right resultset 100 times as small), the values are only skipped in t_left (10 at once).

The whole query takes as little as 0.271 s.

NOT EXISTS
SELECT l.id, l.value
FROM [20090915_anti].t_left l
WHERE NOT EXISTS
(
SELECT NULL
FROM [20090915_anti].t_right r
WHERE r.value = l.value
)
View query results, details and execution plan

Exactly same plan and exactly same execution time as above.

In SQL Server, NOT IN and NOT EXISTS are complete synonyms in terms of the query plans and execution times (as long as both columns are NOT NULL).

LEFT JOIN / IS NULL
SELECT l.id, l.value
FROM [20090915_anti].t_left l
LEFT JOIN
[20090915_anti].t_right r
ON r.value = l.value
WHERE r.value IS NULL
View query results, details and execution plan

Here, the results are the same but performance details are very different.

SQL Server’s optimizer cannot discern an ANTI JOIN in a LEFT JOIN / IS NULL construct.

That’s why it just build the complete resultset (as with a common LEFT JOIN) and filters out the matching values.

Since we have lots of values to filter in this case (almost 10,000,000), it’s a hard job to filter such a tremendous lot of values. This operation is performed using quite an efficient Hash Match which can be and is parallelized, but filtering the values out still takes the most time.

That’s why the LEFT JOIN / IS NULL query takes 810 ms, or 3 times as much as the NOT EXISTS / NOT IN query.

Summary
In SQL Server, NOT EXISTS and NOT IN predicates are the best way to search for missing values, as long as both columns in question are NOT NULL. They produce the safe efficient plans with some kind of an Anti Join.

LEFT JOIN / IS NULL is less efficient, since it makes no attempt to skip the already matched values in the right table, returning all results and filtering them out instead.

=============
hive sql

order by

sort by

distribute by

cluster by

Hive vs Impala – SQL War in the Hadoop Ecosystem
Apache Hive is an effective standard for SQL-in-Hadoop. Hive is a front end for parsing SQL statements, generating logical plans, optimizing logical plans, translating them into physical plans which are executed by MapReduce jobs. Apache Hive is designed for the data warehouse system to ease the processing of adhoc queries on massive data sets stored in HDFS and ease data aggregations.

Impala is an open source SQL query engine developed after Google Dremel. Cloudera Impala is an SQL engine for processing the data stored in HBase and HDFS. Impala uses Hive megastore and can query the Hive tables directly. Unlike Hive, Impala does not translate the queries into MapReduce jobs but executes them natively.

However, both Apache Hive and Cloudera Impala support the common standard HiveQL.

相关内容

热门资讯

宁波一游乐园内装饰树起火,工作... 7月6日,多位网友在社交平台发视频称,浙江宁波罗蒙环球乐园一棵圣诞树突然起火。 现场视频显示,一处游...
快乐不放假 公园玩一夏 全市4... 深圳新闻网2025年7月7日讯(深圳商报记者 文灿)7月5日,深圳公园“快乐不放假,公园玩一夏”夏日...
为了帮女儿照顾孩子,我和儿媳妇... 为了帮女儿照顾孩子,我和儿媳妇闹翻了01 “带着你的外孙走。” 一声尖锐的叫喊声过后,传来门被重...
孩子课上听不懂老师讲的怎么办? 孩子课上听不懂老师讲的怎么办?到外面培训机构培训一下,比如新东方教育之类的孩子刚上一年级,课上听不懂...
安徽一爸爸火车上睡着,儿子走丢... 安徽一爸爸火车上睡着,儿子走丢了都不知道,爸爸带娃到底有多靠谱?实在是太不靠谱了,这样的行为实在是太...
大家支持小学生向老师举报、告状... 大家支持小学生向老师举报、告状、告密吗?我个人不支持这样的做法,这样有个弊端,可能把孩子引向一个背后...
麻烦各位推荐几本好看的穿越小说 麻烦各位推荐几本好看的穿越小说绾青丝 独步天下 醉玲珑 丑颜倾城芊泽花,醉玲珑,跨过千年来爱你,暴...
星际战甲怎么一键分解所有重复m... 星际战甲怎么一键分解所有重复mod对着mod,按鼠标中键,即可全选该mod。按重复数量排序,即可方便...
电视剧《五星大饭店》有续集吗?... 电视剧《五星大饭店》有续集吗?是什么结局啊?没有续集。话说没有拍续集的打算。一般现在看的续集都是网友...
高中会考没过就一定没高中毕业证... 高中会考没过就一定没高中毕业证吗?高中会考没过就一定没高中毕业证吗不一定那,没过也可以领的当然不是,...
为什么我的仙剑奇侠传四玩道慕容... 为什么我的仙剑奇侠传四玩道慕容紫英那就卡掉,不能继续啊?我仙四一到紫英那放动画那就卡住,然后关闭了啊...
YES!光之美少女5GOGO有... YES!光之美少女5GOGO有哪些剧场版有YES,光之美少女剧场版1、光之美少女剧场版2、光之美少女...
住家保姆的服务内容有哪些 住家保姆的服务内容有哪些这个要跟保姆自己谈了要看你给的价格了
钓鱼坠子的制作方法有什么? 钓鱼坠子的制作方法有什么?进行海竿抛钩时,由于水底不平或水草、树根过多,挂钩失坠时有发生。因此在垂钓...
问下,图里的那个动漫人物是出自... 问下,图里的那个动漫人物是出自哪个动漫薄樱鬼斋藤一出自动漫《薄樱鬼》人物名为:斋藤一这是原图我觉得像...
死神黑崎一护是不是灭却师?第二... 死神黑崎一护是不是灭却师?第二部什么时候出?一护妈妈是灭却师,爸爸是死神,所以算个混血儿→ →也就是...
点到为止的意思 点到为止的意思点到为止:说话时只是轻轻触及话题的边缘,而不必深入谈论,但是已经让对方明了意图即可到了...
《余生请多指教》热播,顾医生身... 《余生请多指教》热播,顾医生身上有哪些特质?特别温柔,特别专一,会安慰人,会讲大道理,特别暖心。顾医...
浩态狂香昔未逢,红灯烁烁绿盘龙... 浩态狂香昔未逢,红灯烁烁绿盘龙,觉来独对情惊恐,身在仙宫第九重。韩愈的哪一首诗。名字啊什麼?顺便求诗...