Home > Oracle, SQL > SQL 101: Which Query is Better?

SQL 101: Which Query is Better?


I was asked the following question (paraphrased) during a job interview:

In general, there are lots of ways of expressing a particular query requirement in SQL with implications for query performance. For example, which departments have employees with salaries greater than a certain cutoff? Here are two ways to express this query requirement in SQL. The first uses a conventional correlated subquery while the second uses ANSI join syntax. Which is better? From a theoretical perspective? From a practical perspective? In certain situations? Since I can only pick one, which one should I pick?

VARIABLE salary_cutoff NUMBER

-- Correlated subquery

SELECT d.department_name
FROM departments d
WHERE EXISTS (
  SELECT *
  FROM employees e
  WHERE salary > :salary_cutoff
  AND department_id = d.department_id
);

-- ANSI join syntax

SELECT d.department_name
FROM (
  SELECT DISTINCT department_id
  FROM employees
  WHERE salary > :salary_cutoff
) e
JOIN departments d ON e.department_id = d.department_id;

My interviewer’s opinion was that the best choice depends on the salary cutoff. He argued that if the salary cutoff is high, few employees will qualify and therefore the Employees table should be used as the driving table; that is, the second version of the query (the ANSI Join version) is the best choice when the salary cutoff is high. Conversely, if the salary cutoff is low, many employees will qualify and therefore it would be unadvisable to use the Employees table as the driving table; that is, the first version of the query (the Correlated Subquery version) is the best choice when the salary cutoff is low.

The above reasoning assumes that there is only one choice of driving table for each version of the query. This assumption is based on an intuitive interpretation of each version. An intuitive interpretation of the Correlated Subquery version is that the Departments table is the driving table because the first words of the query are “SELECT d.department_name FROM departments.” Similarly, an intuitive interpretation of the ANSI Join version is that the Employees table is the driving table because it is the first table encountered in the query. However—as the following demonstrations prove—the choice of driving table is not constrained. Hints are used in the demonstrations to alter the driving table at will. This effectively proves that the Oracle query optimizer can choose either table as the driving table. The demonstrations used Oracle Database 11g Release 2.

The first demonstration uses the LEADING and HASH_SJ hints in order to make the Departments table the driving table for the Correlated Subquery version of the query. Note that Oracle uses a Semijoin method (Hash Join Semi) to join the Departments table and the Employees table.

exec :salary_cutoff := 0;

SELECT
  /*+ QB_NAME(main) LEADING(d@main) */
  d.department_name
FROM hr.departments d
WHERE EXISTS (
  SELECT /*+ QB_NAME(sub) HASH_SJ */ *
  FROM hr.employees e
  WHERE e.salary > :salary_cutoff
  AND e.department_id = d.department_id
);

select * from table(dbms_xplan.display_cursor());

----------------------------------------------------------------------------------
| Id  | Operation          | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |       |       |     7 (100)|          |
|*  1 |  HASH JOIN SEMI    |             |    10 |   230 |     7  (15)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| DEPARTMENTS |    27 |   432 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| EMPLOYEES   |   107 |   749 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------

The second demonstration uses the LEADING and USE_HASH hints in order to make the Employees table the driving table for the Correlated Subquery version of the query. Note that Oracle uses a regular join method (Hash Join) in this case.

exec :salary_cutoff := 0;

SELECT
  /*+ QB_NAME(main) LEADING(e@sub) USE_HASH(d@main) */
  d.department_name
FROM hr.departments d
WHERE EXISTS (
  SELECT /*+ QB_NAME(sub) */ *
  FROM hr.employees e
  WHERE e.salary > :salary_cutoff
  AND e.department_id = d.department_id
);

select * from table(dbms_xplan.display_cursor());

-----------------------------------------------------------------------------------
| Id  | Operation           | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |             |       |       |     8 (100)|          |
|*  1 |  HASH JOIN          |             |    11 |   253 |     8  (25)| 00:00:01 |
|   2 |   SORT UNIQUE       |             |   107 |   749 |     3   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| EMPLOYEES   |   107 |   749 |     3   (0)| 00:00:01 |
|   4 |   TABLE ACCESS FULL | DEPARTMENTS |    27 |   432 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

The third demonstration also uses the LEADING and USE_HASH hints in order to make the Employees table the driving table for the ANSI Join version of the query. The query plan is almost identical to the previous query plan, the only difference being the use of the Hash Unique method instead of the Sort Unique method for sorting the contents of the Employees table. Note that the Oracle query optimizer’s estimate (107) of the number of the rows remaining after the Sort Unique operation in the previous query plan is obviously incorrect because it does not differ from the estimate of the number of rows (107) in the Employees table. The optimizer makes a better estimate (11) of the number of rows remaining after the Hash Unique operation in the following query plan.

exec :salary_cutoff := 0;

SELECT /*+ LEADING(e d) USE_HASH(d) */ d.department_name
FROM (
  SELECT DISTINCT department_id
  FROM hr.employees
  WHERE salary > :salary_cutoff
) e
JOIN hr.departments d ON e.department_id = d.department_id;

select * from table(dbms_xplan.display_cursor());

------------------------------------------------------------------------------------
| Id  | Operation            | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |             |       |       |     8 (100)|          |
|*  1 |  HASH JOIN           |             |    11 |   319 |     8  (25)| 00:00:01 |
|   2 |   VIEW               |             |    11 |   143 |     4  (25)| 00:00:01 |
|   3 |    HASH UNIQUE       |             |    11 |    77 |     4  (25)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL| EMPLOYEES   |   107 |   749 |     3   (0)| 00:00:01 |
|   5 |   TABLE ACCESS FULL  | DEPARTMENTS |    27 |   432 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------

The fourth demonstration also uses the LEADING and USE_HASH hints in order to make the Departments table the driving table for the ANSI Join version of the query. This query plan differs very significantly from the query plan in the first demonstration because it requires a sort operation and a regular join instead of a Semijoin.

exec :salary_cutoff := 0;

SELECT /*+ LEADING(d e) USE_HASH(e) */ d.department_name
FROM (
  SELECT DISTINCT department_id
  FROM hr.employees
  WHERE salary > :salary_cutoff
) e
JOIN hr.departments d ON e.department_id = d.department_id;

select * from table(dbms_xplan.display_cursor());

------------------------------------------------------------------------------------
| Id  | Operation            | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |             |       |       |     8 (100)|          |
|*  1 |  HASH JOIN           |             |    11 |   319 |     8  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL  | DEPARTMENTS |    27 |   432 |     3   (0)| 00:00:01 |
|   3 |   VIEW               |             |    11 |   143 |     4  (25)| 00:00:01 |
|   4 |    HASH UNIQUE       |             |    11 |    77 |     4  (25)| 00:00:01 |
|*  5 |     TABLE ACCESS FULL| EMPLOYEES   |   107 |   749 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------

The above demonstrations prove that there is no constraint on the choice of driving table in either version of the query. However, the ANSI Join version requires an unnecessary sort operation if the Department table is chosen as the driving table. Query optimization is a very complex problem and it appears that Oracle is not currently capable of utilizing the Semijoin method to process the ANSI Join version of the query. This significantly reduces the number of options considered by the query optimizer when processing the ANSI Join version. For example, the following query plan can be used with the Correlated Subquery version of the query but not with the ANSI Join version because it uses another variant of the Semijoin method (Nested Loops Semi).

exec :salary_cutoff := 0;

SELECT
  /*+ QB_NAME(main) LEADING(d@main) USE_NL(e@sub) */
  d.department_name
FROM hr.departments d
WHERE EXISTS (
  SELECT /*+ QB_NAME(sub) */ *
  FROM hr.employees e
  WHERE e.salary > :salary_cutoff
  AND e.department_id = d.department_id
);

select * from table(dbms_xplan.display_cursor());

--------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |       |       |    14 (100)|          |
|   1 |  NESTED LOOPS SEMI           |                   |    10 |   230 |    14   (0)| 00:00:01 |
|   2 |   TABLE ACCESS FULL          | DEPARTMENTS       |    27 |   432 |     3   (0)| 00:00:01 |
|*  3 |   TABLE ACCESS BY INDEX ROWID| EMPLOYEES         |    41 |   287 |     1   (0)| 00:00:01 |
|*  4 |    INDEX RANGE SCAN          | EMP_DEPARTMENT_IX |     1 |       |     0   (0)|          |
--------------------------------------------------------------------------------------------------

The above demonstrations indicate that the Correlated Subquery version of the query is the better choice.

P.S. Note that the Correlated Subquery version of the query is based on Relational Calculus while the ANSI Join version is based on Relational Algebra. The founder of relational database theory, Edgar Codd, predicted that queries based on Relational Calculus would be easier to optimize. In his words: “The relational calculus permits a user to request the data he desires by its properties. This is an ideal starting point for search optimization. The algebra, on the other hand, requires the user to formulate a sequence of algebraic operations that will generate the desired data from the data base relations. For queries other than very simple ones, the properties of the desired data tend to get hidden in the particular operation sequence (one of many possible ones) which the user selects. Therefore, starting from an algebraic source language, one has the choice of locally optimizing the execution of each operation (a very limited form of optimization) or tackling the difficult problem of analyzing sequences of such operations to discover the intended defining properties of the desired data.” (Relational Completeness of Database Sublanguages, 1972).

About these ads
Categories: Oracle, SQL
  1. July 26, 2010 at 12:37 pm

    Hi Iggy,

    >The first demonstration uses the LEADING and HASH_SJ hints in order to make the Departments table the driving table for the Correlated Subquery version of the query.
    HASH_SJ hint is no longer documented. To do the same thing in a documented way, you can just instruct Oracle to use hash join for joining employees:

    use_hash(@sub e)
    

    – which is, BTW, correct (documented) way to specify query block for a hint.

  1. July 30, 2010 at 3:49 pm
  2. October 29, 2010 at 3:48 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 760 other followers

%d bloggers like this: