SQL Left Join,Hash Join,Merge Join对比

本文深入探讨了SQL中的三种主要连接方式:LEFT JOIN, Hash JOIN和Merge JOIN。LEFT JOIN允许从左表获取所有记录,并包含右表匹配的记录;Hash JOIN通过创建哈希表来匹配两个表的数据;而Merge JOIN则依赖于排序和合并过程,适合处理已排序的数据。理解这些连接方法的原理有助于优化SQL查询性能。" 80458075,7720008,今日头条推荐算法详解:从原理到实践,"['推荐系统', '深度学习', '文本分析', '用户行为', '内容过滤']

ref:http://support.microsoft.com/kb/197297/en-us

INF: Comparison of Join Techniques

Article ID:197297
Last Review:April 21, 1999
Revision:1.0
This article was previously published under Q197297

SUMMARY

Microsoft SQL Server version 7.0 includes some new join techniques, such as the hash join and sort-merge join, to supplement nested-loop joins. This article describes these new techniques and compares and contrasts them.

Back to the top

MORE INFORMATION

Microsoft SQL Server version 6.5 and earlier had only one technique for joining records: the nested loops join. SQL Server 7.0 implements hash and sort-based algorithms. The decision by the query engine on whether to use hash versus sort depends on relative sizes of the inputs, whether they are sorted or not, whether the hash table will fit entirely in memory, and so on. The optimizer will always choose the best join plan to execute your query, and that may be any one of the three options. The difference between hash joins and merge joins is often small, but there are certain situations where one is better than the other. There are considerable differences between these two join types and the nested loops join, as shown below. The example in this article is provided only as an illustration. The table at the end of this article is an illustration of some decisions that may affect the join type used. It is generally recommended that you not force join types, overriding the optimizer.

Back to the top

Nested Loop Joins

In a nested loops join, one of the tables is scanned; for every row in that table, a lookup is performed on the second table and the matching rows are retrieved. We will call these tables outer and inner input. The basic algorithm is to scan the inner input once for each value of outer input, looking for a match. Depending on the presence of indexes on the inner input the lookup can run quite efficiently. Indexed nested loops join is obviously more efficient than non-indexed nested loops. Also, if the tables have a 1-1 relationship, the scan of the inner table can stop as soon as the row is found that matches the row in the outer table.

Any query can be processed using this methodology including joins on inequality conditions.

Nested loops example:
SELECT * FROM reserves AS r1 JOIN sailors AS s1 ON r1.sid=s1.sid option
(loop join)
Reserves has 529 pages and 100000 rows
Sailors has 195 pages 40000 rows

Table 'sailors'. Scan count 100000, logical reads 207596, physical reads 3, read-ahead reads 0.
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read- ahead reads 0.
|--Nested Loops(Inner Join)
       |--Sort(ORDER BY:([r1].[sid] ASC))
       |    |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))
       |--Clustered Index Seek(OBJECT:([Northwind].[dbo].[sailors].[I2] AS
[s1]), SEEK:([s1].[sid]=[r1].[sid]) ORDERED)

Back to the top

Hash Joins

In a hash match, a hash table is created in memory for one of the inputs (build input) using a repeatable randomizing function and this table is searched for matching values from the other input (probe input). The hash table performs like an index. The data from the build input is stored in memory. If the build input does not fit in memory it is partitioned recursively until it fits in memory. The probe input is then hashed and used to search for matches. Unlike with nested loops, the presence or absence of indexes is not particularly a concern in this case. Hash joins are CPU-intensive in comparison to nested loops and are affected by available memory. Hash joins are better when there is a significant difference in the sizes of the tables being joined.

In a hash join the build input determines the recursion depth because it can stop partitioning when the input fits in memory. A single hash join can perform both grouping and join at the same time when the grouping attribute is also the join attribute. The result of this join is not in any particular order. Inequality conditions cannot be satisfied by this type of join.
|--Hash Match(Inner Join, HASH:([s1].[sid])=([r1].[sid]),
RESIDUAL:([s1].[sid]=[r1].[sid]))
       |--Clustered Index Scan(OBJECT:([Northwind].[dbo].[sailors].[I2] AS [s1]))
       |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read- ahead reads 0.
Table 'sailors'. Scan count 1, logical reads 208, physical reads 0, read- ahead reads 0.

Back to the top

Sort-Merge Join

In a sort-merge, sorting the larger input dominates a large portion of the total cost. In a merge join, the inputs are sorted on the join column and the sorted pair is merged into one, eliminating the column values that do not match from the sorted result set.

The algorithm first examines the tuple of one input, scans the second sorted input until a row with equal or larger join value is found. Match up tuple in R1 with all tuples in R2 with matching join value. Repeat for all tuples in R1.

Merge join requires that both inputs be in memory (as opposed to a hash join, where only the smaller input is in memory). The merge join is obviously more efficient if the inputs are already sorted. Also the result of the join is sorted on the join attribute. Inequality conditions cannot be satisfied by this type of join.
|--Merge Join(Inner Join, MANY-TO-MANY MERGE:([r1].[sid])=([s1].[sid]),
RESIDUAL:([s1].[sid]=[r1].[sid]))
       |--Sort(ORDER BY:([r1].[sid] ASC))
       |    |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))
       |--Clustered Index Scan(OBJECT:([Northwind].[dbo].[sailors].[I2] AS s1]), ORDERED)

Table 'sailors'. Scan count 1, logical reads 208, physical reads 0, read- ahead reads 0.
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read- ahead reads 0.

By changing the size of the input such that the relative sizes of the inputs vary, (in our example reserves had only 500 rows and sailors 50,000 rows) merge join for the same query as above took much longer than hash as can be seen below:

Merge join:
SQL Server Execution Times:
CPU time = 1081 ms, elapsed time = 2211 ms.

Hash join:
SQL Server Execution Times:
CPU time = 181 ms, elapsed time = 479 ms.

If an index exists on the join key, the data does not need to be sorted and therefore a merge join will be the best choice.
==========================================================================
Property                 |  Nested loop   | Hash or merge
=========================|=============================================
Only limited memory      | Better         |Worse,
                                          |memory is needed for sorting
                                          |or building hash table

Need first row fast      | Better         |Worse,
                                          |need to sort or hash
                                          |before a row is returned

Selecting only a few     | Better         |Worse
rows from inner table    |                |

Parallelism              |Works best      |Will work

Index available on inner |Works best      |Will work

Without one equality     |Works           |Needs at least 1 equality

2 Very large inputs      |Worse           |Not much better,
                                          |hash and sort will have
                                          |almost equal cost

One very small           |Worse           |Hash better than merge
and large input          |                |

Back to the top

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值