SQL 101: Which Query is Better?—Part II
In a previous post, I reviewed alternative formulations of the query “which departments have employees with salaries greater than a certain cutoff?”
VARIABLE salary_cutoff NUMBER -- Correlated subquery SELECT d.department_name FROM departments d WHERE EXISTS ( SELECT * FROM employees e WHERE salary > :salary_cutoff AND department_id = d.department_id ); -- the JOIN keyword SELECT d.department_name FROM ( SELECT DISTINCT department_id FROM employees WHERE salary > :salary_cutoff ) e JOIN departments d ON e.department_id = d.department_id;
I concluded that the “relational calculus” formulation using a “correlated subquery” was better because it gave more options to the query optimizer. In fact, the founder of relational database theory, Edgar (Ted) Codd, predicted that queries based on relational calculus would be easier to optimize. In Relational Completeness of Data Base Sublanguages, he says:
The relational calculus permits a user to request the data he desires by its properties. This is an ideal starting point for search optimization. The algebra, on the other hand, requires the user to formulate a sequence of algebraic operations that will generate the desired data from the data base relations. For queries other than very simple ones, the properties of the desired data tend to get hidden in the particular operation sequence (one of many possible ones) which the user selects. Therefore, starting from an algebraic source language, one has the choice of locally optimizing the execution of each operation (a very limited form of optimization) or tackling the difficult problem of analyzing sequences of such operations to discover the intended defining properties of the desired data.
This leads us to ask what exactly constitutes a relational calculus formulation. We can find the following definition on page 24 of the referenced paper by Codd:
Let the given “alpha expression” be z = {t : w}, where
- t = (t1, t2, … tk) is the target list;
- w = U1 AND U2 AND … Up AND V is the qualification (a range-separable well-formed formula);
- there are q >= 0 bound variables in V;
- all p of the variables free in w occur in t.
Codd wrote his paper before the SQL language was created and therefore he used the language of mathematics. An “alpha expression” is simply what we now call a SELECT statement, (t1, t2, … tk) is the select-list in the statement, U1 AND U2 AND … Up is the from-list (each U refers to one table and there are p of them), V is the where-clause, “free variables” are tables listed in the from-list, and “bound variables” are those that are listed in the from-list of a correlated subquery contained in the statement.
Since all p free variables must occur in (t1, t2, … tk), the key point in Codd’s definition is that the only tables listed in the from-list should be those that are referenced in the select-list.
Let’s look at some examples; we’ll use the HR Sample Schema provided by Oracle. The first example is of managers who have an employee with a higher salary. The formulation below does not meet the requirements of the above definition because the variable “emp” is not referenced in the select-list. Note that the employees table is used twice in the from-list but each reference has been given a separate alias; for the purposes of the query, the two references are distinct tables.
SELECT DISTINCT mgr.employee_id AS mgr_id, mgr.salary AS mgr_salary FROM employees emp, employees mgr WHERE emp.manager_id = mgr.employee_id AND emp.salary > mgr.salary;
The following is a more modern formulation of the above query and uses the JOIN keyword. However, it is only cosmetically different from the above query and it too does not meet the requirements of Codd’s definition.
SELECT DISTINCT mgr.employee_id AS mgr_id, mgr.salary AS mgr_salary FROM employees emp JOIN employees mgr ON emp.manager_id = mgr.employee_id WHERE emp.salary > mgr.salary;
Next, we have a formulation that does meet the requirements of Codd’s definition. The only variable in the from-list is “mgr” which is the only variable referenced in the select-list.
SELECT
mgr.employee_id AS mgr_id,
mgr.salary AS mgr_salary
FROM
employees mgr
WHERE EXISTS (
SELECT
*
FROM
employees emp
WHERE
emp.manager_id = mgr.employee_id
AND
emp.salary > mgr.salary
);
The next example is of employees who have a higher salary than their respective managers. The formulation below does not meet the requirements of Codd’s definition because the variable “emp” is not referenced in the select-list.
SELECT emp.employee_id AS emp_id, emp.salary AS emp_salary FROM employees emp, employees mgr WHERE emp.manager_id = mgr.employee_id AND emp.salary > mgr.salary;
The next formulation of the above query uses the JOIN keyword but it too does not meet the requirements of Codd’s definition.
SELECT emp.employee_id AS emp_id, emp.salary AS emp_salary FROM employees emp JOIN employees mgr ON emp.manager_id = mgr.employee_id WHERE emp.salary > mgr.salary;
Finally, we have a formulation that does meet the requirements of Codd’s definition. The only variable in the from-list is “mgr” which is the only variable referenced in the select-list.
SELECT
emp.employee_id AS emp_id,
emp.salary AS emp_salary
FROM
employees emp
WHERE EXISTS (
SELECT
*
FROM
employees mgr
WHERE
emp.manager_id = mgr.employee_id
AND
emp.salary > mgr.salary
);
Our third example requires information about employees as well as managers. It lists the identification numbers and salaries of employees who have a higher salary than their respective managers together with the identification numbers and salaries of the managers. The formulation below meets the requirements of Codd’s definition.
SELECT emp.employee_id AS emp_id, emp.salary AS emp_salary, mgr.employee_id AS mgr_id, mgr.salary AS mgr_salary FROM employees emp, employees mgr WHERE emp.manager_id = mgr.employee_id AND emp.salary > mgr.salary;
The formulation below uses the modern JOIN keyword. Even though references to algebraic operations such as JOIN, INTERSECT, MINUS, and UNION are not part of the language of relational calculus, I have to admit that it is only cosmetically different than the formulation above and does meet Codd’s requirements in spirit if not in letter.
SELECT emp.employee_id AS emp_id, emp.salary AS emp_salary, mgr.employee_id AS mgr_id, mgr.salary AS mgr_salary FROM employees emp JOIN employees mgr ON emp.manager_id = mgr.employee_id AND emp.salary > mgr.salary;
-
October 29, 2010 at 3:52 pm | #1Log Buffer #204, A Carnival of the Vanities for DBAs | The Pythian Blog


Recent Comments