Home > Oracle, SQL > SQL 101: Which Query is Better?—Part II

SQL 101: Which Query is Better?—Part II


In a previous post, I reviewed alternative formulations of the query “which departments have employees with salaries greater than a certain cutoff?”

VARIABLE salary_cutoff NUMBER

-- Correlated subquery

SELECT d.department_name
FROM departments d
WHERE EXISTS (
  SELECT *
  FROM employees e
  WHERE salary > :salary_cutoff
  AND department_id = d.department_id
);

-- the JOIN keyword

SELECT d.department_name
FROM (
  SELECT DISTINCT department_id
  FROM employees
  WHERE salary > :salary_cutoff
) e
JOIN departments d ON e.department_id = d.department_id;

I concluded that the “relational calculus” formulation using a “correlated subquery” was better because it gave more options to the query optimizer. In fact, the founder of relational database theory, Edgar (Ted) Codd, predicted that queries based on relational calculus would be easier to optimize. In Relational Completeness of Data Base Sublanguages, he says:

The relational calculus permits a user to request the data he desires by its properties. This is an ideal starting point for search optimization. The algebra, on the other hand, requires the user to formulate a sequence of algebraic operations that will generate the desired data from the data base relations. For queries other than very simple ones, the properties of the desired data tend to get hidden in the particular operation sequence (one of many possible ones) which the user selects. Therefore, starting from an algebraic source language, one has the choice of locally optimizing the execution of each operation (a very limited form of optimization) or tackling the difficult problem of analyzing sequences of such operations to discover the intended defining properties of the desired data.

This leads us to ask what exactly constitutes a relational calculus formulation. We can find the following definition on page 24 of the referenced paper by Codd:

Let the given “alpha expression” be z = {t : w}, where

  1. t = (t1, t2, … tk) is the target list;
  2. w = U1 AND U2 AND … Up AND V is the qualification (a range-separable well-formed formula);
  3. there are q >= 0 bound variables in V;
  4. all p of the variables free in w occur in t.

Codd wrote his paper before the SQL language was created and therefore he used the language of mathematics. An “alpha expression” is simply what we now call a SELECT statement, (t1, t2, … tk) is the select-list in the statement, U1 AND U2 AND … Up is the from-list (each U refers to one table and there are p of them), V is the where-clause, “free variables” are tables listed in the from-list, and “bound variables” are those that are listed in the from-list of a correlated subquery contained in the statement.

Since all p free variables must occur in (t1, t2, … tk), the key point in Codd’s definition is that the only tables listed in the from-list should be those that are referenced in the select-list.

Let’s look at some examples; we’ll use the HR Sample Schema provided by Oracle. The first example is of managers who have an employee with a higher salary. The formulation below does not meet the requirements of the above definition because the variable “emp” is not referenced in the select-list. Note that the employees table is used twice in the from-list but each reference has been given a separate alias; for the purposes of the query, the two references are distinct tables.

SELECT DISTINCT
  mgr.employee_id AS mgr_id,
  mgr.salary AS mgr_salary
FROM
  employees emp,
  employees mgr
WHERE
  emp.manager_id = mgr.employee_id
AND
  emp.salary > mgr.salary;

The following is a more modern formulation of the above query and uses the JOIN keyword. However, it is only cosmetically different from the above query and it too does not meet the requirements of Codd’s definition.

SELECT DISTINCT
  mgr.employee_id AS mgr_id,
  mgr.salary AS mgr_salary
FROM
  employees emp
  JOIN employees mgr ON emp.manager_id = mgr.employee_id
WHERE
  emp.salary > mgr.salary;

Next, we have a formulation that does meet the requirements of Codd’s definition. The only variable in the from-list is “mgr” which is the only variable referenced in the select-list.

SELECT
  mgr.employee_id AS mgr_id,
  mgr.salary AS mgr_salary
FROM
  employees mgr
WHERE EXISTS (
  SELECT
    *
  FROM
    employees emp
  WHERE
    emp.manager_id = mgr.employee_id
  AND
    emp.salary > mgr.salary
);

The next example is of employees who have a higher salary than their respective managers. The formulation below does not meet the requirements of Codd’s definition because the variable “emp” is not referenced in the select-list.

SELECT
  emp.employee_id AS emp_id,
  emp.salary AS emp_salary
FROM
  employees emp,
  employees mgr
WHERE
  emp.manager_id = mgr.employee_id
AND
  emp.salary > mgr.salary;

The next formulation of the above query uses the JOIN keyword but it too does not meet the requirements of Codd’s definition.

SELECT
  emp.employee_id AS emp_id,
  emp.salary AS emp_salary
FROM
  employees emp
  JOIN employees mgr ON emp.manager_id = mgr.employee_id
WHERE
  emp.salary > mgr.salary;

Finally, we have a formulation that does meet the requirements of Codd’s definition. The only variable in the from-list is “mgr” which is the only variable referenced in the select-list.

SELECT
  emp.employee_id AS emp_id,
  emp.salary AS emp_salary
FROM
  employees emp
WHERE EXISTS (
  SELECT
    *
  FROM
    employees mgr
  WHERE
    emp.manager_id = mgr.employee_id
  AND
    emp.salary > mgr.salary
);

Our third example requires information about employees as well as managers. It lists the identification numbers and salaries of employees who have a higher salary than their respective managers together with the identification numbers and salaries of the managers. The formulation below meets the requirements of Codd’s definition.

SELECT
  emp.employee_id AS emp_id,
  emp.salary AS emp_salary,
  mgr.employee_id AS mgr_id,
  mgr.salary AS mgr_salary
FROM
  employees emp,
  employees mgr
WHERE
  emp.manager_id = mgr.employee_id
AND
  emp.salary > mgr.salary;

The formulation below uses the modern JOIN keyword. Even though references to algebraic operations such as JOIN, INTERSECT, MINUS, and UNION are not part of the language of relational calculus, I have to admit that it is only cosmetically different than the formulation above and does meet Codd’s requirements in spirit if not in letter.

SELECT
  emp.employee_id AS emp_id,
  emp.salary AS emp_salary,
  mgr.employee_id AS mgr_id,
  mgr.salary AS mgr_salary
FROM
  employees emp
  JOIN employees mgr ON emp.manager_id = mgr.employee_id
AND
  emp.salary > mgr.salary;

Which Query is Better?—Part I
Sequel or Ess-Cue-El?

About these ads
Categories: Oracle, SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 743 other followers

%d bloggers like this: