Part 1: The Hitchhiker’s Guide to SQL: Lateral derived tables and other alternatives to GROUP BY
“In many of the more relaxed civilizations on the Outer Eastern Rim of the Galaxy, the Hitchhiker’s Guide has already supplanted the great Encyclopaedia Galactica as the standard repository of all knowledge and wisdom, for though it has many omissions and contains much that is apocryphal, or at least wildly inaccurate, it scores over the older, more pedestrian work in two important respects. First, it is slightly cheaper; and secondly it has the words DON’T PANIC inscribed in large friendly letters on its cover.”—The Hitchhiker’s Guide to the Galaxy by Douglas Adams
The father of relational theory, Dr. Edgar (Ted) Codd posed the following question as an exercise in his book The Relational Model for Database Management: Version 2:
“Consider the language SQL. It contains a feature called GROUP BY. Treat SQL, with this feature dropped as SQLX. Are there any queries expressible in SQL that are not expressible in SQLX? If your answer is yes, supply two examples. If the answer is no, supply a proof.”
DON’T PANIC. The answer is no. Here is a proof by example. Suppose that we have the following query requirement in the HR sample schema: find the sum of employee salaries for each department at location 1700 with more than 5 employees. Four different solutions are shown below, beginning with the traditional GROUP BY solution and followed by window functions, scalar subqueries, and lateral derived tables. Note that event 22829 needs to be set before using lateral derived tables; thanks to Jonathan Lewis for that tip. I conducted my tests using Oracle Database 11g Release 2 and the Developer Days Virtual Machine.
Traditional solution using GROUP BY:
SELECT /* GROUP BY */ d.department_name, SUM(e.salary) AS sum FROM departments d INNER JOIN employees e ON (e.department_id = d.department_id) WHERE d.location_id = 1700 GROUP BY d.department_name HAVING COUNT(*) > 5 ORDER BY d.department_name;
Solution using window functions:
SELECT DISTINCT /* Window functions */ department_name, sum FROM ( SELECT d.department_name, SUM(e.salary) OVER (PARTITION BY d.department_name) AS sum, COUNT(*) OVER (PARTITION BY d.department_name) AS count FROM departments d INNER JOIN employees e ON (e.department_id = d.department_id) WHERE d.location_id = 1700 ) WHERE COUNT > 5 ORDER BY department_name;
Solution using scalar subqueries:
SELECT /* Scalar subqueries */ d.department_name, ( SELECT SUM(salary) FROM employees WHERE department_id = d.department_id ) AS sum FROM departments d WHERE d.location_id = 1700 AND (SELECT COUNT(*) FROM employees WHERE department_id = d.department_id) > 5 ORDER BY d.department_name;
Unsupported solution using lateral derived tables:
ALTER session SET events '22829 trace name context forever'; SELECT /* Lateral derived tables */ d.department_name, l.sum FROM departments d, LATERAL ( SELECT SUM(e.salary) AS sum, COUNT(*) AS count FROM employees e WHERE e.department_id = d.department_id ) l WHERE d.location_id = 1700 AND l.count > 5 ORDER BY department_name;
Addendum: A reader asked what point I am trying to make and if I am suggesting that we stop using the GROUP BY clause. In a perfect world, we could use any of the above variations and it wouldn’t matter because all of them would have the same execution plan for any particular data distribution and hence would always have the same performance. In an imperfect world, we have to choose what works best in our case whether that be GROUP BY or something else. Read Day 4: The Twelve Days of SQL: The way you write your query matters.