Vous êtes sur la page 1sur 16

Speeding Up Queries with Semi-Joins and Anti-Joins: How Oracle Evaluates EXISTS, NOT EXISTS, IN, and NOT IN

by Roger Schrag Database Specialists, Inc. About Database Specialists, Inc.


Database Specialists, Inc. provides remote DBA services and onsite database support for your mission critical Oracle systems. Since 1995, we have been providing Oracle database consulting in Solaris, HP-UX, Linux, AIX, and Windows environments. We are DBAs, speakers, educators, and authors. Our team is continually recognized by Oracle, at national conferences and by leading trade publications. Learn more about our remote DBA, database tuning, and consulting services. Or, call us at 415-344-0500 or 888-648-0500.

Introduction Optimizing SQL usually gives the most significant results when DBAs are called upon to make the system run faster. Using tools like Statspack or Enterprise Manager, it is often easy to find the slow SQL. But how do you make the queries run faster? That is the challenge! In this paper we will discuss the semi-join and the anti-join, two powerful SQL constructs Oracle offers for use in your quest for faster queries. In particular, we will define these two terms, discuss when and why you might want to use the [NOT] EXISTS or [NOT] IN constructs, and demonstrate how you can use optimizer hints and make minor query changes in order to enable Oracle to use some very powerful and efficient access paths. For certain classes of queries, these features can dramatically reduce logical reads, physical reads, CPU time, and elapsed time. But beware! There are some obscure (and not always easy to find) requirements that must be met in order for Oracle to deploy the semi- and anti-join access paths. If you fail to dot an I or cross a T, you could be banging your head against the wall for hours. In this paper we will look at SQL from a real application and demonstrate the before and after versions of queries that ran orders of magnitude faster once semi-joins and anti-joins were implemented correctly. Before we get started, I want to point out that Oracle Corporation has improved its optimization techniques significantly over the years, specifically in the area of semi-joins and anti-joins. This paper was written using Oracle 9i release 9.2.0.4, and all examples were validated in such an environment. If you are using a newer or older release of Oracle, it is quite possible that you will see different results. Please do not take anything you read in this paperor in any paper for that matterfor granted. Validate every hypothesis in your environment before relying upon it.
Semi-Join Defined

A semi-join between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned. Semi-joins are written using the EXISTS or IN constructs. Suppose you have the DEPT and EMP tables in the SCOTT schema and you want a list of departments with at least one employee. You could write the query with a conventional join:
SELECT FROM WHERE ORDER BY D.deptno, D.dname dept D, emp E E.deptno = D.deptno D.deptno;

Unfortunately, if a department has 400 employees then that department will appear in the query output 400 times. You could eliminate the duplicate rows by using the DISTINCT keyword, but you would be making Oracle do more work than necessary. Really what you want to do is specify a semi-join between the DEPT and EMP tables instead of a conventional join:
SELECT FROM WHERE D.deptno, D.dname dept D EXISTS

( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno;

The above query will list the departments that have at least one employee. Whether a department has one employee or 100, the department will appear only once in the query output. Moreover, Oracle will move on to the next department as soon as it finds the first employee in a department, instead of finding all of the employees in each department.
Anti-Join Defined

An anti-join between two tables returns rows from the first table where no matches are found in the second table. An anti-join is essentially the opposite of a semi-join: While a semi-join returns one copy of each row in the first table for which at least one match is found, an anti-join returns one copy of each row in the first table for which no match is found. Anti-joins are written using the NOT EXISTS or NOT IN constructs. These two constructs differ in how they handle nullsa subtle but very important distinction which we will discuss later. Suppose you want a list of empty departmentsdepartments with no employees. You could write a query that finds all departments and subtracts off the department of each employee:
SELECT FROM MINUS SELECT FROM WHERE ORDER BY D1.deptno, D1.dname dept D1 D2.deptno, D2.dname dept D2, emp E2 D2.deptno = E2.deptno 1;

The above query will give the desired results, but it might be clearer to write the query using an anti-join:
SELECT FROM WHERE D.deptno, D.dname dept D NOT EXISTS ( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno;

The above query is more efficient because Oracle can employ an anti-join access path. The difference in efficiency here is akin to the difference between a nested loops join and a hash join when you are joining every row in one table to another. The EXISTS and IN Constructs In this section we will discuss how Oracle evaluates EXISTS and IN clauses, prerequisites for using Oracles semi-join access paths, and hints that influence semi-join query optimization. Then well look at a few sample queries where semi-join access paths can be used to make the query more efficient.
How Oracle Evaluates EXISTS and IN Clauses

The SQL language as implemented in Oracle does not include a syntax for explicitly performing semi-joins, so you use an EXISTS or IN clause with a subquery to indicate to Oracle that a semi-join should be performed instead of a conventional join. Oracle will typically transform the subquery into a join wherever possible. An interesting characteristic of Oracles query optimizer is that (according to Metalink document 144967.1), in Oracle 8i and 9i the decision of whether or not to transform a query by merging its subqueries into the main body is a heuristic one. Basically, Oracle will merge subqueries if it can, regardless of impact on cost. The assumption is that merging is always a good thing. In fact, Oracle performs query transformations (merging subqueries into the main body) before making a list of possible execution plans and evaluating their cost. This bit of trivia will come into play when we discuss the second example in this section. The Oracle Performance Tuning Guide in the Oracle 8i documentation set explains that EXISTS and IN are processed differently by the query optimizer. The manual explains a rule of thumb for when to use each, and provides an example to demonstrate. The Oracle 9i Performance Tuning Guide provides the same rule of thumb with a slightly more polished example. The Oracle 10g Performance Tuning Guide contains substantially the same text as the Oracle 9i manual on this subject. The rule of thumb goes like this: If the main body of your query is highly selective, then an EXISTS clause might be more appropriate to semi-join to the target table. However, if the main body of your query is not so selective and the subquery (the target of the semi-join) is more selective, then an IN clause might be more appropriate. The example in the Oracle 8i Performance Tuning Guide shows a query where the main body is not selective, while the target of the semi-join is highly selective. The example demonstrates that Oracle chooses a poor execution plan when an EXISTS clause is used, and a much more efficient execution plan when an IN clause is used instead. I was easily able to reproduce this example in an Oracle 8i environment with the same results as shown in the Performance Tuning Guide. However, when I reproduced this same example in an Oracle 9i environment, Oracle chose an identical (and efficient) execution plan for both versions of the query. Furthermore, the example in the Oracle 9i Performance Tuning Guide gave an identical (and efficient) execution plan for both versions of the query in my Oracle 9i environment. These tests echo my previous experiencethat Oracle 9i is smarter than Oracle 8i about EXISTS and IN clauses, and you dont typically need to worry about when to use EXISTS versus IN. The upshot is that if you are writing queries for an Oracle 8i (or earlier) system, then you should think about the selectivity of the predicates in the main query versus the subquery and choose whether to use an EXISTS or an IN clause wisely. If you are using Oracle 9i or later, it doesnt hurt to follow this rule of thumb, but you may not need to worry as much.
Prerequisites and Hints that Impact Semi-Joins

There are a few situations in which Oracle cannot use a semi-join access path. If the query contains a DISTINCT operation (explicitly or implicitly through a set operator such as UNION) then Oracle cannot transform EXISTS or IN clauses into something that could use a semi-join access path. The same is true if the EXISTS or IN clause is part of an OR branch. (Semi-join access paths can be used in queries that contain ORs in the WHERE clause, just as long as the EXISTS or IN is not part of the OR.) Assuming that a semi-join access path is possible, Oracle chooses the semi-join algorithm to use while evaluating execution plans. In Oracle 9i a semi-join can be performed using the nested loops, hash join, or merge join algorithms. As with a conventional join, Oracle 9i will choose the join algorithm with the lowest cost. In Oracle 8i a semi-join can be performed using the hash join or merge join algorithms, but Oracle 8i will not choose the join algorithm with the lowest cost. Instead, the always_semi_join instance parameter dictates whether or not an Oracle 8i database should perform semi-joins and, if so, which algorithm to use. The always_semi_join instance parameter is obsolete in Oracle 9i. If a semi-join access path is not possible (because of a DISTINCT operation or an OR predicate), or if instance parameters specify a semi-join should not be used, then Oracle will choose an alternate data access path. This usually amounts to performing a conventional join and sorting out the duplicates, or scanning the first table and using the second table as a filter. Oracle provides the NL_SJ, HASH_SJ, and MERGE_SJ hints in order for you to manipulate the semi-join process if you need to. The hint is applied to the subquery of the EXISTS or IN clause, not the main body of the query itself. In my experience Oracle does not need you to provide a hint in order to consider the efficient semi-join access paths. However, in Oracle 9i or 10g, where you have a wider choice of semi-join algorithms, you might want to use hints if Oracle chooses a hash semi-join when a nested loops semi-join would in fact be better, or vice versa.

Semi-Join Example #1

Suppose you want a list of your gold-status customers who have placed an order within the last three days. You might start with a query that looks like:
SELECT FROM WHERE AND AND ORDER BY DISTINCT C.short_name, C.customer_id customers C, orders O C.customer_type = 'Gold' O.customer_id = C.customer_id O.order_date > SYSDATE - 3 C.short_name;

The execution plan (from a TKPROF report) for this query is:
Rows ------2 20 10 2990 Row Source Operation --------------------------------------------------SORT UNIQUE (cr=33608 r=30076 w=0 time=6704029 us) HASH JOIN (cr=33608 r=30076 w=0 time=6703101 us) TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=31718 us) TABLE ACCESS FULL ORDERS (cr=33570 r=30040 w=0 time=6646420 us)

You can see that Oracle did a (conventional) hash join between the CUSTOMERS and ORDERS tables, and then performed a sort for uniqueness to filter out the duplicate records. This sort was necessary because the query is supposed to show a customer only once, even if they have more than one order placed within the last three days. (A sort to satisfy the ORDER BY clause is also necessary, but Oracle tackles this at the same time as the sort for uniqueness and thus it does not appear as a separate step in the execution plan.) Using what we have discussed so far, you might want to rewrite the query using a semi-join:
SELECT FROM WHERE AND C.short_name, C.customer_id customers C C.customer_type = 'Gold' EXISTS ( SELECT 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 3 ) ORDER BY C.short_name;

The execution plan for this query becomes:


Rows ------2 2 10 2990 Row Source Operation --------------------------------------------------SORT ORDER BY (cr=33608 r=29921 w=0 time=6422770 us) HASH JOIN SEMI (cr=33608 r=29921 w=0 time=6422538 us) TABLE ACCESS FULL CUSTOMERS (cr=38 r=0 w=0 time=61290 us) TABLE ACCESS FULL ORDERS (cr=33570 r=29921 w=0 time=6345754 us)

The SEMI that you see on the second line indicates that a semi-join access path is being used instead of a conventional join. A semi-join is more efficient here for two reasons: (1) The search through the ORDERS table for a given customer can stop as soon as the first qualifying order is found, and (2) There is no need to sort the results to remove duplicate customer records in the case

where a customer had multiple qualifying orders. (You can see that the sort for uniqueness, which took 928 microseconds, has been replaced by a sort to satisfy the ORDER BY clause, which took only 232 microseconds.) Rewriting the query to use a semi-join access path has perhaps been a novel experience, but, so far, it hasnt done us a substantial amount of good. We brought the elapsed time down about 4%, but that doesnt seem too significant, and it could just be the result of chance variation in our test environment. But suppose you are aware of the fact that very few of your customers are gold-status. Perhaps a nested loops semi-join would be more efficient. You can add an NL_SJ hint and see the results for yourself:
SELECT FROM WHERE AND C.short_name, C.customer_id customers C C.customer_type = 'Gold' EXISTS ( SELECT /*+ NL_SJ */ 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 3 ) ORDER BY C.short_name; Rows ------2 2 10 2 780 Row Source Operation --------------------------------------------------SORT ORDER BY (cr=833 r=725 w=0 time=358431 us) NESTED LOOPS SEMI (cr=833 r=725 w=0 time=358232 us) TABLE ACCESS FULL CUSTOMERS (cr=38 r=0 w=0 time=2210 us) TABLE ACCESS BY INDEX ROWID ORDERS (cr=795 r=725 w=0 time=355822 us) INDEX RANGE SCAN ORDERS_N1 (cr=15 r=13 w=0 time=5601 us)(object id 34176)

Indeed, the nested loops semi-join brought the consistent reads down to 833 from 33608 and reduced the execution time by over 90%. So it appears that Oracle chose a hash semi-join when a nested loops semi-join was more efficient. Once we get Oracle to see the light with the NL_SJ hint, we end up with a much more efficient query than we started with. We achieved this performance gain by leveraging Oracles nested loops semi-join access path.
Semi-Join Example #2

I recently worked on a project where applying the concepts presented in this section brought the execution time for a key query in my clients application down from 46 minutes to 1.1 seconds. A screen in the application allowed a user to enter a project owner and up to five participants, and a list of assignments involving all of the specified people would be displayed. The users found that the response time was tolerable if they only selected one participant, but it went downhill from there as they added more participants. The original version of the query looked like this:
SELECT FROM WHERE AND AND AND AND AND AND AND AND DISTINCT A.name, A.code, A.description, A.item_id, A.assignment_id, FI.string0, FI.string1 relationships R, assignments A, format_items FI, relationships R1, relationships R2, relationships R3, relationships R4, relationships R5 R.user_id = 134546 R.account_id = 134545 R.type_code = 0 A.item_id = R.item_id FI.item_id = A.item_id R1.item_id = A.item_id AND R1.status = 5 AND R1.user_id = 137279 R2.item_id = A.item_id AND R2.status = 5 AND R2.user_id = 134555 R3.item_id = A.item_id AND R3.status = 5 AND R3.user_id = 134546 R4.item_id = A.item_id AND R4.status = 5 AND R4.user_id = 137355

AND R5.item_id = A.item_id AND R5.status = 5 AND R5.user_id = 134556 ORDER BY A.name ASC;

(I have changed the names of tables and columns, stripped away extraneous columns and predicates, and eliminated bind variables in order to both protect my clients confidentiality and make the example easier to understand.) The execution plan looked like this:
Rows Row Source Operation ------- --------------------------------------------------642 SORT UNIQUE (cr=23520269 r=34 w=0 time=2759937104 us) 64339104 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=23520269 r=34 w=0 time=1838051782 us) 95184881 NESTED LOOPS (cr=7842642 r=23 w=0 time=907238095 us) 2710288 NESTED LOOPS (cr=2266544 r=23 w=0 time=103840003 us) 317688 NESTED LOOPS (cr=484734 r=11 w=0 time=23494451 us) 50952 NESTED LOOPS (cr=43280 r=10 w=0 time=2688237 us) 4146 NESTED LOOPS (cr=19016 r=3 w=0 time=988374 us) 1831 NESTED LOOPS (cr=13353 r=0 w=0 time=608296 us) 4121 HASH JOIN (cr=7395 r=0 w=0 time=399488 us) 2046 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=299181 us) 17788 INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=81158 us)(object id 34528) 3634 TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=0 w=0 time=25536 us) 1831 TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=163252 us) 1831 INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=115113 us)(object id 34316) 4146 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5663 r=3 w=0 time=349554 us) 4264 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3678 r=0 w=0 time=224390 us)(object id 34345) 50952 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=24264 r=7 w=0 time=1538051 us) 70976 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=8428 r=0 w=0 time=630831 us)(object id 34345) 317688 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=441454 r=1 w=0 time=19584539 us) 1631032 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=108302 r=0 w=0 time=7213989 us)(object id 34345) 2710288 TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=1781810 r=12 w=0 time=72015925 us) 4237104 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=643298 r=1 w=0 time=29018764 us)(object id 34345) 92474592 INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=5576098 r=0 w=0 time=503165776 us)(object id 34345)

It is no wonder that this query took so long when five participants were selected. With each join to the RELATIONSHIPS table, the Cartesian product gets more and more out of hand. By the final sort to eliminate the duplicates, we have over 64 million rows to reckon with. This query struck me right away as a good candidate for a nested loops semi-join. We want a list of assignments that meet certain criteria. Whether a specific user has one qualifying relationship record or 10, it makes no difference. We are only concerned with the existence of at least one qualifying relationship record for each of the specified users. Rewriting the query with EXISTS clauses to semi-join to the RELATIONSHIPS table five times is pretty easy. The challenge here is that the query does have a DISTINCT operator, and this defeats Oracles semi-join access paths. Not to worryan inline view with a NO_MERGE hint saved the day. The revised query looked like:
SELECT FROM /*+ NO_MERGE (M) */ DISTINCT M.name, M.code, M.description, M.item_id, M.assignment_id, M.string0, M.string1 ( SELECT A.name, A.code, A.description, A.item_id, A.assignment_id, FI.string0, FI.string1 FROM relationships R, assignments A, format_items FI WHERE R.user_id = 134546 AND R.account_id = 134545 AND R.type_code = 0 AND A.item_id = R.item_id

AND AND

AND

AND

AND

AND

) M ORDER BY M.name ASC;

FI.item_id = A.item_id EXISTS (SELECT 1 FROM relationships R1 WHERE R1.item_id = A.item_id AND AND R1.user_id = 137279) EXISTS (SELECT 1 FROM relationships R2 WHERE R2.item_id = A.item_id AND AND R2.user_id = 134555) EXISTS (SELECT 1 FROM relationships R3 WHERE R3.item_id = A.item_id AND AND R3.user_id = 134546) EXISTS (SELECT 1 FROM relationships R4 WHERE R4.item_id = A.item_id AND AND R4.user_id = 137355) EXISTS (SELECT 1 FROM relationships R5 WHERE R5.item_id = A.item_id AND AND R5.user_id = 134556)

R1.status = 5

R2.status = 5

R3.status = 5

R4.status = 5

R5.status = 5

The execution plan was as follows:


Rows ------642 1300 1300 1314 1314 1314 1322 1831 4121 2046 17788 3634 1831 1831 1322 2472 1314 1582 1314 11011 1314 2234 1300 1300 Row Source Operation --------------------------------------------------SORT UNIQUE (cr=36315 r=89 w=0 time=1107054 us) VIEW (cr=36315 r=89 w=0 time=1085116 us) NESTED LOOPS SEMI (cr=36315 r=89 w=0 time=1082232 us) NESTED LOOPS SEMI (cr=32385 r=89 w=0 time=1002330 us) NESTED LOOPS SEMI (cr=28261 r=89 w=0 time=904654 us) NESTED LOOPS SEMI (cr=22822 r=89 w=0 time=737705 us) NESTED LOOPS SEMI (cr=18730 r=89 w=0 time=651196 us) NESTED LOOPS (cr=13353 r=89 w=0 time=530670 us) HASH JOIN (cr=7395 r=89 w=0 time=347584 us) TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=7211 r=0 w=0 time=186820 us) INDEX RANGE SCAN RELATIONSHIPS_N3 (cr=71 r=0 w=0 time=43770 us)(object id 34528) TABLE ACCESS FULL ASSIGNMENTS (cr=184 r=89 w=0 time=91899 us) TABLE ACCESS BY INDEX ROWID FORMAT_ITEMS (cr=5958 r=0 w=0 time=141416 us) INDEX RANGE SCAN FORMAT_ITEMS_N1 (cr=4127 r=0 w=0 time=100207 us)(object id 34316) TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5377 r=0 w=0 time=92046 us) INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=3664 r=0 w=0 time=61077 us)(object id 34345) TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=4092 r=0 w=0 time=63478 us) INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2647 r=0 w=0 time=40433 us)(object id 34345) TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=5439 r=0 w=0 time=133620 us) INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2639 r=0 w=0 time=65312 us)(object id 34345) TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=4124 r=0 w=0 time=75657 us) INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=49373 us)(object id 34345) TABLE ACCESS BY INDEX ROWID RELATIONSHIPS (cr=3930 r=0 w=0 time=58651 us) INDEX RANGE SCAN RELATIONSHIPS_N2 (cr=2630 r=0 w=0 time=38832 us)(object id 34345)

The execution plan starts out the same as before, with a hash join between RELATIONSHIPS and ASSIGNMENTS followed by a nested loops join to FORMAT_ITEMS. At this point there are 1831 rows in the candidate result set. Whereas the original query does five conventional joins to the RELATIONSHIPS table and the 1831 rows balloon to over 64 million, the revised query does semijoins instead and whittles the candidate rows down to 1300. Joining 1300 rows is a lot faster than joining 64 million rows, so you can see how the semi-join access path really speeds up this query. The NO_MERGE hint and the inline view (aliased as M) causes Oracle to separate the DISTINCT operation from the bulk of the query. This is how we get around the fact that a DISTINCT operation defeats semi-join access paths. In the second version of the query, Oracle executes the query as if the DISTINCT keyword were not there. Then, as a final step, it applies the DISTINCT operation to the interim result set that it has stored in memory (the output of the VIEW operator). If you are wondering why the NO_MERGE hint is necessary, recall what we discussed earlier about the heuristic Oracle uses when deciding whether or not to merge a subquery into the main body of a query. Oracle assumes that merging is a good thing and will do it whenever possible. Thus without the NO_MERGE hint Oracle would merge the inline view into the DISTINCT operation (eliminating the inline view), and the semi-join access path would be defeated by the presence of the DISTINCT operation. This example demonstrates the value of the semi-join access paths. Your gains may not be as spectacular as we have seen here, but hopefully you can see from this discussion how proper use of semijoins can make a certain class of queries much more efficient. This example also underscores the importance of knowing how to properly use semi-joins. If you did not remember that semi-join access paths are defeated by the DISTINCT operation, you might have changed the example query to use EXISTS clauses and wondered why the query still ran slowly. The NOT EXISTS and NOT IN Constructs In this section we will discuss how Oracle evaluates NOT EXISTS and NOT IN clauses, prerequisites for using Oracles anti-join access paths, and hints that influence anti-join query optimization. Then well look at a few examples of how anti-join access paths can be used to make some queries more efficient.
How Oracle Evaluates NOT EXISTS and NOT IN Clauses

NOT EXISTS and NOT IN clauses are used to retrieve rows from a table for which no match is found in another table. In the early days, Oracle would use the subquery of a NOT EXISTS or NOT IN clause as a filter. For each candidate row retrieved by the main body of the query, Oracle would execute the subquery. If the subquery retrieved one or more rows, then the candidate row would be discarded. Conceptually this approach seems similar to a nested loops anti-join. Somewhere around Oracle 7.3, the merge anti-join and hash anti-join access paths were added. These enabled Oracle to leverage the merge and hash join algorithms when performing anti-joins. However, Oracle would only use these algorithms if the always_anti_join instance parameter was set appropriately or if a hint was added to the query. In Oracle 8i this still appears to be true. Oracle 9i, however, does away with the always_anti_join instance parameter and instead the optimizer evaluates the costs of the different algorithms and chooses the one with the lowest cost. In an earlier section of this paper we discussed when the EXISTS clause should be used versus when the IN clause should be used. The two are functionally equivalent, and actively choosing one over the other is usually done in the hopes that this will help the optimizer choose a better execution plan. When talking about NOT EXISTS and NOT IN, however, it is important to note that these two constructs are not functionally equivalentthere is a significant difference between the two in how null values are handled. If the subquery of a NOT IN clause returns at least one null value, then the NOT IN predicate evaluates to false. The NOT EXISTS construct, meanwhile, concerns itself only with whether a row is returned or not, and thus does not do anything differently if a null value is returned. This may seem a bit abstract, so lets clarify the point with an example. Suppose you want a list of empty departmentsdepartments that have no employees. You could write this query with a NOT EXISTS clause or a NOT IN clause. We saw the NOT EXISTS case earlier:
SELECT FROM D.deptno, D.dname dept D

WHERE

NOT EXISTS ( SELECT 1 FROM emp E WHERE E.deptno = D.deptno ) ORDER BY D.deptno;

Written with a NOT IN clause, the query becomes:


SELECT FROM WHERE D.deptno, D.dname dept D D.deptno NOT IN ( SELECT E.deptno FROM emp E ) ORDER BY D.deptno;

The queries look like they would be functionally equivalent, but if you are using the EMP and DEPT tables from the SCOTT schema of an Oracle 9i database, then the two queries are in fact different. Insert a row into the EMP table with a null DEPTNO, run the two versions of the query, and you will see the difference. The NOT EXISTS version will show you that department 40 has no employees. However, the NOT IN version of the query will say there are no empty departments. Why? Because the subquery of the NOT IN clause retrieves a row with a null value, causing the NOT IN clause to always evaluate to false. As strange as this NOT IN nuance may seem, it is fully documented in Metalink document 28934.1. The difference between NOT EXISTS and NOT IN with respect to handling of nulls is an important onenot just from the standpoint of differing functionality, but also from a performance perspective as well. To evaluate a NOT IN subquery that is capable of returning nulls, Oracle cannot scan an index because an implicit NVL() function must be applied and this would defeat the index. (Null values arent stored in single-column non-unique indexes, anyway.) Well see the impact of this implicit NVL() function in the examples later in this section. If the subquery of a NOT IN clause is physically incapable of returning a null value (because the columns involved have NOT NULL constraints or because predicates have been added to the WHERE clause requiring non-null values), then the query with the NOT IN clause will return the same results as a query with an equivalent NOT EXISTS clauseand indexes wont be defeated in the process. If we wanted to write the query that shows empty departments with a NOT IN clause that gave the same results as the NOT EXISTS version, we would need to add a predicate to the subquery that rules out null values. The resulting query is:
SELECT FROM WHERE D.deptno, D.dname dept D D.deptno NOT IN ( SELECT E.deptno FROM emp E WHERE E.deptno IS NOT NULL ) ORDER BY D.deptno;

To choose whether to use NOT EXISTS or NOT IN in a query, therefore, you must first consider how you want null values to be handled. If you need the presence of a null value in the subquery to cause the expression to always evaluate to false, then you need to use the NOT IN construct. Otherwise you have a choice between NOT EXISTS and NOT IN (with the extra WHERE predicate if the subquery would otherwise be capable of returning a null value).

If you dont need the special null value semantics of NOT IN, then your choice of whether to use NOT EXISTS or NOT IN hinges on what sort of access path you think will be best for the query. In a conventional join, the nested loops algorithm is desirable when the predicates on the first table are very selective and the join columns in the second table are selectively indexed. The merge and hash join algorithms, meanwhile, are more desirable when predicates are not very selective or the join columns in the second table are not selectively indexed. A similar rule of thumb applies to anti-joins. That is to say, merge and hash anti-joins are most valuable in situations where the first table has a lot of candidate rows or the join columns are not selectively indexed. If you think your query is going to be better off with a merge or hash anti-join than a nested loops anti-join or a filter operation, then you might want to code your query with a NOT IN clause instead of a NOT EXISTS clause. Whenever the Oracle documentation discusses anti-join access paths, it always seems to do so in the context of the NOT IN construct with no mention of NOT EXISTS. While sometimes Oracle 9i will choose an anti-join access path for a query with a NOT EXISTS clause, it does seem that Oracle 9i might be more likely to do so if NOT IN is used instead of NOT EXISTS. In Oracle 8i, I believe the query optimizer will never choose an anti-join access path for a query with a NOT EXISTS clause (even if always_anti_join is set and a hint appears in the subquery). Meanwhile, if you dont think your query will benefit from a merge or hash anti-join, then you might be best off with the good old filter technique instead of the nested loops anti-join access path. It appears that, at least in Oracle 9i, the nested loops anti-join might not be as efficient as it could be. In the example of the query for empty departments that we have been discussing, the filter technique will rule out a department as soon as the first employee in that department is located. The nested loops anti-join, meanwhile, appears to locate all employees in the department before ruling it out. This is unnecessary work. Whether a department has one employee or 100, it does not qualify as an empty department. So Oracle should cut to the chase and move on to the next department as soon as the first employee is found. The filter technique does just that, but the nested loops anti-join (at least in the cases I have seen) does not. The second example in this section will demonstrate this. The upshot is that you must first consider how null values should be handled when deciding whether to use NOT EXISTS or NOT IN. If you need the special semantics provided by NOT IN, then your decision has been made. Otherwise, you should next consider whether or not the query might benefit from a merge or hash anti-join. If so, then you probably ought to choose NOT IN. If you decide to go with NOT IN but do not want the expression to evaluate to false if a null value is found, then make sure the subquery cannot return a null value. If there is no chance that the query will benefit from a merge or hash anti-join and the special semantics of NOT IN are not desired, then you might want to select the NOT EXISTS construct so that there is a better chance Oracle will perform an efficient filter instead of an inefficient nested loops anti-join.
Prerequisites and Hints that Impact Anti-Joins

The key prerequisite for the anti-join access paths is that the subquery of a NOT IN clause cannot be capable of returning a null value. As mentioned before, this means that either the columns being selected must have NOT NULL constraints or there must be predicates in the WHERE clause of the subquery to ensure there are no nulls. Even if every row in the table has a non-null value in a nullable column, you must still add the extra predicate to the WHERE clause or else Oracle will refuse to use the merge and hash anti-join access paths. Assuming that an anti-join access path is possible, Oracle chooses the anti-join algorithm to use while evaluating execution plans. In Oracle 9i an anti-join can be performed using the nested loops, merge, or hash join algorithms. As with a conventional join, Oracle will choose the join algorithm with the lowest cost, or it may choose to use the filter technique instead of an anti-join altogether. In Oracle 8i an anti-join can be performed using the merge or hash join algorithms, but as with semi-joins, Oracle 8i will not choose the join algorithm with the lowest cost. Instead, the always_anti_join instance parameter dictates which approach Oracle 8i should use. (Oracle 8i does not implement the nested loops anti-join access path, so if the always_anti_join parameter is set to nested_loops, the filter technique is used instead.) The always_anti_join instance parameter is obsolete in Oracle 9i. Oracle provides the NL_AJ, MERGE_AJ, and HASH_AJ hints in order for you to manipulate the anti-join process if you need to. As with the semi-join hints, the hint is applied to the subquery and not the main body of the query itself.
Anti-Join Example #1

Suppose you want a list of customers who have not placed an order within the last ten days. You might start with a query that looks like:
SELECT C.short_name, C.customer_id

FROM WHERE

customers C NOT EXISTS ( SELECT 1 FROM orders O WHERE O.customer_id = C.customer_id AND O.order_date > SYSDATE - 10 ) ORDER BY C.short_name;

The execution plan looks like:


Rows ------11 11 1000 989 989 100000 5338680 Row Source Operation --------------------------------------------------SORT ORDER BY (cr=18707 r=301 w=0 time=22491917 us) FILTER (cr=18707 r=301 w=0 time=22491555 us) TABLE ACCESS FULL CUSTOMERS (cr=38 r=36 w=0 time=15277 us) VIEW (cr=18669 r=265 w=0 time=22365647 us) HASH JOIN (cr=18669 r=265 w=0 time=22347234 us) INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2207 r=208 w=0 time=277584 us)(object id 34555) INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16462 r=57 w=0 time=9036516 us)(object id 34556)

You can see that the query took over 22 seconds to run and performed 18,707 logical reads. Oracle did not use an anti-join access path in this execution plan, instead opting to read every row in the CUSTOMERS table and use the subquery as a filter by executing it once for each customer. Although Oracle found a clever way to hash join two indexes on the ORDERS table and thereby make the subquery very efficient, the query overall is still resource intensive because the subquery must execute 1000 timesonce for each customer. This query seems like a good candidate for an anti-join access path because executing the subquery once and performing a merge or hash anti-join might be more efficient than performing the subquery once for each customer as a filter operation. We could rewrite the query using a NOT IN clause instead of NOT EXISTS to see if Oracle chooses an anti-join access path:
SELECT FROM WHERE C.short_name, C.customer_id customers C C.customer_id NOT IN ( SELECT O.customer_id FROM orders O WHERE O.order_date > SYSDATE - 10 ) ORDER BY C.short_name;

Unfortunately, the results are quite disappointingelapsed time shot up to 695 seconds! The execution plan looks like:
Rows ------11 11 1000 989 5359590 Row Source Operation --------------------------------------------------SORT ORDER BY (cr=5360749 r=4870724 w=0 time=695232973 us) FILTER (cr=5360749 r=4870724 w=0 time=695232475 us) TABLE ACCESS FULL CUSTOMERS (cr=38 r=129 w=0 time=61614 us) TABLE ACCESS BY INDEX ROWID ORDERS (cr=5360711 r=4870595 w=0 time=695088325 us) INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=16520 r=0 w=0 time=22299946 us)(object id 34556)

You can see that Oracle did not choose an anti-join access path. Instead, Oracle is still reading every row from the CUSTOMERS table and executing the subquery once for each of the 1000 customers. Moreover, Oracle is no longer hash joining the two ORDERS indexes to make the subquery very efficient. For each customer, Oracle is now grabbing all orders from the last 10 days and

sifting through them to find those belonging to the customer currently being processed. It turns out that the CUSTOMER_ID column in the ORDERS table is nullable, and so Oracle conceptually rewrites the query in order to implement the special null value semantics of NOT IN. The rewritten query looks like:
SELECT FROM WHERE C.short_name, C.customer_id customers C NOT EXISTS ( SELECT 1 FROM orders O WHERE O.order_date > SYSDATE 10 AND NVL (O.customer_id, C.customer_id) = C.customer_id ) ORDER BY C.short_name;

You can see how the implicit NVL() function that Oracle inserts during the transformation would be necessary in order to cause the predicate to evaluate to false if a null value is returned. You can also see how this defeats the index on the CUSTOMER_ID column of the ORDERS table. You might wonder why the CUSTOMER_ID column in the ORDERS table can be null. Lets suppose the developers tell you that this handles a special case and you should not worry about it. The developers want the query to ignore nulls. Thus you can add an extra predicate to the WHERE clause of the subquery in order to meet the prerequisite for anti-join access paths. The resulting query is:
SELECT FROM WHERE C.short_name, C.customer_id customers C C.customer_id NOT IN ( SELECT O.customer_id FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id IS NOT NULL ) ORDER BY C.short_name;

The execution plan becomes:


Rows ------11 11 1000 20910 20910 20910 100000 Row Source Operation --------------------------------------------------SORT ORDER BY (cr=311 r=132 w=98 time=1464035 us) HASH JOIN ANTI (cr=311 r=132 w=98 time=1463770 us) TABLE ACCESS FULL CUSTOMERS (cr=38 r=34 w=0 time=37976 us) VIEW (cr=273 r=98 w=98 time=1318222 us) HASH JOIN (cr=273 r=98 w=98 time=1172207 us) INDEX RANGE SCAN ORDERS_N2_ORD_DATE (cr=58 r=0 w=0 time=43870 us)(object id 34556) INDEX FAST FULL SCAN ORDERS_N1_CUST_ID (cr=215 r=0 w=0 time=196953 us)(object id 34555)

You can see that Oracle now chooses to perform a hash anti-join. The CUSTOMERS table gets a full table scan as before, but the critical difference is that now the subquery executes just once. The two indexes on the ORDERS table are hash joined once in order to develop a list of order dates for each customer, and this is hash anti-joined to the CUSTOMERS table in order to yield the list of customers with no orders in the last 10 days. The result is a query that completes in 1.46 seconds and performs just 311 logical reads. In this example we did not need to use hints to get Oracle to find the most efficient execution plan. All we had to do was write the query in a way that made it eligible for the anti-join access paths. (This entailed using NOT IN instead of NOT EXISTS, and adding a predicate to the subquerys WHERE clause to make it impossible for the subquery to return a null value.)

Anti-Join Example #2

Along the lines of the previous example, suppose you needed to find out if a specific customer has not placed an order within the last 10 days. Your query might look like:
SELECT FROM WHERE AND C.short_name, C.customer_id customers C C.customer_id = 22 NOT EXISTS ( SELECT 1 FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id = C.customer_id ) ORDER BY C.short_name;

The execution plan looks like:


Rows ------0 0 1 2 Row Source Operation --------------------------------------------------TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=6 r=0 w=0 time=341 us) INDEX UNIQUE SCAN CUSTOMERS_PK (cr=6 r=0 w=0 time=333 us)(object id 34552) TABLE ACCESS BY INDEX ROWID ORDERS (cr=4 r=0 w=0 time=170 us) INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=65 us)(object id 34555)

As you can see, Oracle uses a filter operation instead of an anti-join. This is not surprising, because we used the NOT EXISTS construct instead of NOT IN. The query performed 6 logical reads and completed in 341 microseconds. Customer 22 has 110 orders in the system. By looking at the row counts in the execution plan, you can see that Oracle stopped the filter process as soon as it found the first order placed by the customer within the last 10 days. There is no need to look at all of the orders placed by this customer because weve already found one that excludes the customer from the result set. This is an example of a query with an anti-join where the main query is extremely selectivewe are only looking at one customer record. This begs the question: Would this be a good opportunity to use the nested loops anti-join access path? It is hard to imagine an execution plan for this query that is more efficient than what we have just seen, but lets try. The query rewritten with a NOT IN clause and an NL_AJ hint looks like:
SELECT FROM WHERE AND C.short_name, C.customer_id customers C C.customer_id = 22 C.customer_id NOT IN ( SELECT /*+ NL_AJ */ O.customer_id FROM orders O WHERE O.order_date > SYSDATE - 10 AND O.customer_id IS NOT NULL ) ORDER BY C.short_name;

The execution plan is:


Rows ------0 Row Source Operation --------------------------------------------------SORT ORDER BY (cr=115 r=86 w=0 time=16824 us)

0 1 1 10 110

NESTED LOOPS ANTI (cr=115 r=86 w=0 time=16724 us) TABLE ACCESS BY INDEX ROWID CUSTOMERS (cr=3 r=2 w=0 time=467 us) INDEX UNIQUE SCAN CUSTOMERS_PK (cr=2 r=2 w=0 time=429 us)(object id34552) TABLE ACCESS BY INDEX ROWID ORDERS (cr=112 r=84 w=0 time=16213 us) INDEX RANGE SCAN ORDERS_N1_CUST_ID (cr=2 r=0 w=0 time=308 us)(object id 34555)

You can see that Oracle obeyed our hint and used the nested loops anti-join access path. Unfortunately, the number of logical reads went up from 6 to 115 and the elapsed time went up by a factor of 50 to 16824 microseconds. Why? The row counts in the execution plan tell the story. Oracle looked at all 110 orders from the customer and found that 10 were placed less than 10 days ago. Oracle did a lot of unnecessary work. In this regard the filter access path seems much smarter than the nested loops anti-join access path. This behavior was seen on an Oracle 9.2.0.4 database. Lets hope Oracle fixes this in a future release. Until then, I cant think of a case where the nested loops anti-join access path would be useful. By the way, it turns out that Oracle would have chosen the nested loops anti-join access path even without the NL_AJ hint. So it turns out that this query was more efficient when written with a NOT EXISTS than a NOT IN. Again, hopefully this is just an anomaly that will be fixed in a future release of Oracle.
Anti-Join Example #3

I recently used the concepts presented in this section to bring a clients query down from 33 minutes (where one CPU on the server was pegged) to less than four seconds. It turned out to be another case of the implicit NVL() function in a NOT IN clause defeating an index. This time it was a query that showed a count of calls to the customer service center placed by users who did not belong to corporate customers. The original query looked like this:
SELECT COUNT(*) FROM calls C WHERE C.requestor_user_id NOT IN ( SELECT CM.member_user_id FROM corp_members CM );

(Once again names have been changed to protect the innocent.) The execution plan looked like this:
Rows ------1 0 184965 61032 Row Source Operation --------------------------------------------------SORT AGGREGATE (cr=12784272 r=1864678 w=0 time=1978321835 us) FILTER (cr=12784272 r=1864678 w=0 time=1978321817 us) TABLE ACCESS FULL CALLS (cr=3588 r=1370 w=0 time=979769 us) TABLE ACCESS FULL CORP_MEMBERS (cr=12780684 r=1863308 w=0 time=2948544268 us)

It appears that Oracle is reading rows from the CALLS table one at a time and, for each row, is scanning the CORP_MEMBERS table until it finds a row with a matching MEMBER_USER_ID. Since there are over 180,000 calls in the system, it takes over 180,000 partial scans of the CORP_MEMBERS table to complete this query. (The CORP_MEMBERS table has 584 blocks below its high water mark.) A quick check of the CORP_MEMBERS table shows that the MEMBER_USER_ID column is indexed. However, the column is also nullable and so Oracle cannot use the index because of the implicit NVL() function associated with the NOT IN clause. Since the client saw records in the CORP_MEMBERS table with a null MEMBER_USER_ID as an unrelated special case, they wanted the query to ignore such records. So my first step was to rewrite the query with the extra WHERE predicate in the subquery that makes the NOT IN clause ignore nulls:
SELECT COUNT(*)

FROM WHERE

calls C C.requestor_user_id NOT IN ( SELECT CM.member_user_id FROM corp_members CM WHERE CM.member_user_id IS NOT NULL );

The resulting execution plan was:


Rows ------1 0 184965 81945 Row Source Operation --------------------------------------------------SORT AGGREGATE (cr=3790 r=3906 w=420 time=5450615 us) HASH JOIN ANTI (cr=3790 r=3906 w=420 time=5450591 us) TABLE ACCESS FULL CALLS (cr=3588 r=3480 w=0 time=646554 us) INDEX FAST FULL SCAN CORP_MEMBERS_USER_ID (cr=202 r=6 w=0 time=113178 us)(object id 34580)

Now not only did Oracle use the index on the MEMBER_USER_ID column when accessing the CORP_MEMBERS table, but more importantly Oracle also switched to a hash anti-join access path. This means that Oracle only had to access the CORP_MEMBERS table once instead of more than 180,000 times. The execution time dropped by 99% to 5.45 seconds. The users were thrilled with the improvement and the problem was considered resolved. But to better understand Oracles data access paths and join methods, I benchmarked two more versions of the query. First I added a MERGE_AJ hint to cause a merge anti-join instead of a hash anti-join. This version of the query performed the same number of logical reads as the hash anti-join but used a little more CPU time. In this situation it looked like the merge anti-join used up more CPU time on sorting than did the hash anti-join. I also tried rewriting the query with a NOT EXISTS clause to see how it would perform using a filter operation instead of an anti-join access path. This version of the query looked like:
SELECT COUNT(*) FROM calls C WHERE NOT EXISTS ( SELECT 1 FROM corp_members CM WHERE CM.member_user_id = C.requestor_user_id );

The execution plan was:


Rows ------1 0 184965 61032 Row Source Operation --------------------------------------------------SORT AGGREGATE (cr=125652 r=3489 w=0 time=3895569 us) FILTER (cr=125652 r=3489 w=0 time=3895547 us) TABLE ACCESS FULL CALLS (cr=3588 r=3489 w=0 time=906699 us) INDEX RANGE SCAN CORP_MEMBERS_USER_ID (cr=122064 r=0 w=0 time=1842121 us)(object id 34580)

This version of the query performed a lot more logical reads than the hash anti-join version, but actually used less CPU time. This query appeared to hammer on the blocks of the CORP_MEMBERS_USER_ID index very heavily. Most likely the whole index segment was sitting in the buffer cache and thus the logical reads were relatively inexpensive. Meanwhile, the large sort required by a hash anti-join was avoided and the result was a net reduction in CPU time.

These examples demonstrate the value of the anti-join access paths and the significance of the way NOT IN treats null values. Again, your gains may not be as spectacular as those shown here, but hopefully you can see from this discussion how the proper use of anti-joins and NOT EXISTS and NOT IN clauses can make a certain class of queries much more efficient. These examples also demonstrate the importance of the difference between NOT EXISTS and NOT IN, and the apparent behavior (although not fully documented as far as I can tell) that Oracle typically will not apply anti-join access paths to a NOT EXISTS clause. Conclusion The semi-join and anti-join are two special ways of joining data from multiple tables in a query. Although SQL does not have a special symbol or keyword to designate a semi- or anti-join, we use the EXISTS, IN, NOT EXISTS, and NOT IN constructs to indicate when one of these joins is desired. Over the years Oracle has gotten better at optimizing semi-joins and anti-joins, and new data access paths make it possible to write very efficient queries that use these two types of joins. However, there are certain prerequisites to be met before Oracle can use these access paths, and sometimes hints or specific coding styles are necessary in order for the query optimizer to find the best execution plan possible. Understanding how semi-joins and anti-joins workand how Oracle implements the relevant data access pathswill enable you to write very efficient SQL and dramatically speed up an entire class of queries. This in turn will make your employer (or client) very happy. About the Author Roger Schrag, OCP, has been an Oracle DBA and application architect for over 15 years. He began his career at Oracle Corporation on the Oracle Financials development team. In 1995, he founded Database Specialists, Inc., a boutique consulting firm specializing in Oracle database technology, remote administration, and performance tuning. Since that time, Roger has focused his expertise in the area of performance optimization. Roger is a frequent speaker at Oracle OpenWorld and the International Oracle Users Group (IOUG) Live conferences, where he has frequently been voted in the Top 10% of speakers. Roger has been an Oracle Masters Class instructor for the IOUG and is the current president of the Northern California Oracle Users Group (NoCOUG). He can be reached at rschrag@dbspecialists.com.
Still Looking for Help on this Subject?
Get a Consultation We would be happy to talk with you about our services and how our senior-level database team might help you. Call Database Specialists at 415-344-0500 or 888-648-0500 or fill out a free consultation request form. Complimentary Newsletter If you'd like to receive our complimentary monthly newsletter with database tips and new white paper announcements, sign up for The Specialist.

Copyright 2005 Database Specialists, Inc. http://www.dbspecialists.com

Vous aimerez peut-être aussi