Home > Oracle, SQL > Comparison Between Window Functions and the GROUP BY clause

Comparison Between Window Functions and the GROUP BY clause


“Window functions” were introduced by Oracle in Oracle Database 8i and incorporated into the SQL standard beginning with SQL:2003. According to the SQL standard, “a window function is a function whose result for a given row is derived from the window frame of that row.” In this short posting, I discuss their similarity with the old-school GROUP BY clause. I won’t discuss all the capabilities and features of window functions; more information can be found in Chapter 21 (SQL for Analysis and Reporting) of the Oracle Database 11g Release 2 Data Warehousing Guide.

Let’s suppose that we need a departmentwise summary of employee salaries. We can use the old-school GROUP BY clause as follows:

SELECT
  department_id,
  sum(salary) AS total_salary
FROM employees
GROUP BY department_id

DEPARTMENT_ID TOTAL_SALARY
------------- ------------
          100        51608
           30        24900
                      7000
           20        19000
           70        10000
           90        58000
          110        20308
           50       156400
           40         6500
           80       304500
           10         4400
           60        28800

12 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1192169904

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |    11 |    77 |     4  (25)| 00:00:01 |
|   1 |  HASH GROUP BY     |           |    11 |    77 |     4  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMPLOYEES |   107 |   749 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        784  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         12  rows processed

Next, let’s try to produce the same result using window functions. Once again, a window function is “a function whose result for a given row is derived from the window frame of that row.” The PARTITION BY clause is used to define the “window frame” for a row.

SELECT
  department_id,
  sum(salary) OVER (PARTITION BY department_id) AS total_salary
FROM employees

DEPARTMENT_ID TOTAL_SALARY
------------- ------------
           10         4400
           20        19000
           20        19000
           30        24900
           30        24900
           30        24900
           30        24900
           30        24900
           30        24900
           40         6500
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           50       156400
           60        28800
           60        28800
           60        28800
           60        28800
           60        28800
           70        10000
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           80       304500
           90        58000
           90        58000
           90        58000
          100        51608
          100        51608
          100        51608
          100        51608
          100        51608
          100        51608
          110        20308
          110        20308
                      7000

107 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1919783947

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |   107 |   749 |     4  (25)| 00:00:01 |
|   1 |  WINDOW SORT       |           |   107 |   749 |     4  (25)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| EMPLOYEES |   107 |   749 |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
       2826  bytes sent via SQL*Net to client
        600  bytes received via SQL*Net from client
          9  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
        107  rows processed

These are not quite the results we expected. We obtained as many rows as there are in the Employees table because—unlike the old-school GROUP BY clause—window functions have no grouping effect. To produce the desired summary, we need to throw in the DISTINCT clause because SQL does not eliminate duplicates by default; this is one of the criticisms of SQL by relational purists. To be precise, SELECT is a shorthand notation for the “bag operator” SELECT ALL whereas SELECT DISTINCT is a true relational operator. Also, note that the query execution plan below is not the same as in the GROUP BY case. Window functions require a WINDOW SORT operation whereas GROUP BY queries use the efficient HASH GROUP BY operation.

SELECT DISTINCT
  department_id,
  sum(salary) OVER (PARTITION BY department_id) AS total_salary
FROM employees

DEPARTMENT_ID TOTAL_SALARY
------------- ------------
           30        24900
           60        28800
           90        58000
           10         4400
          100        51608
           20        19000
           50       156400
           40         6500
           80       304500
           70        10000
          110        20308
                      7000

12 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2449754683

---------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |   107 |   749 |     5  (40)| 00:00:01 |
|   1 |  HASH UNIQUE        |           |   107 |   749 |     5  (40)| 00:00:01 |
|   2 |   WINDOW SORT       |           |   107 |   749 |     5  (40)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| EMPLOYEES |   107 |   749 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        784  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
         12  rows processed

Old-school SQL also included the HAVING clause to limit the number of groups in the summary. For example, let’s suppose that we are only interested in departments whose total salary expense is greater than $100,000. We could do this with the old-style HAVING clause as follows:

SELECT
  department_id,
  sum(salary) AS total_salary
FROM employees
GROUP BY department_id
HAVING sum(salary) > 100000

DEPARTMENT_ID TOTAL_SALARY
------------- ------------
           50       156400
           80       304500

Execution Plan
----------------------------------------------------------
Plan hash value: 244580604

---------------------------------------------------------------------------------
| Id  | Operation           | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |           |     7 |    49 |     4  (25)| 00:00:01 |
|*  1 |  FILTER             |           |       |       |            |          |
|   2 |   HASH GROUP BY     |           |     7 |    49 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| EMPLOYEES |   107 |   749 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(SUM("SALARY")>100000)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        667  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          2  rows processed

Here’s how we could produce the same effect with analytic functions.

SELECT *
FROM
(
  SELECT DISTINCT
    department_id,
    sum(salary) OVER (PARTITION BY department_id) AS total_salary
  FROM employees
)
WHERE total_salary > 100000

DEPARTMENT_ID TOTAL_SALARY
------------- ------------
           50       156400
           80       304500

Execution Plan
----------------------------------------------------------
Plan hash value: 3855082855

----------------------------------------------------------------------------------
| Id  | Operation            | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |   107 |  2782 |     5  (40)| 00:00:01 |
|*  1 |  VIEW                |           |   107 |  2782 |     5  (40)| 00:00:01 |
|   2 |   HASH UNIQUE        |           |   107 |   749 |     5  (40)| 00:00:01 |
|   3 |    WINDOW SORT       |           |   107 |   749 |     5  (40)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| EMPLOYEES |   107 |   749 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("TOTAL_SALARY">100000)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        667  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          2  rows processed

The above examples demonstrate that window functions can produce the same effect as the old-school GROUP BY and HAVING clauses. However, window functions require a WINDOW SORT operation whereas GROUP BY queries use the efficient HASH GROUP BY operation.

P.S. The examples in this posting were tested in the HR sample schema.

Categories: Oracle, SQL
  1. chris stephens
    March 20, 2010 at 7:45 am

    Analytics can provide tremendous performance benefits when used in proper situations but this post just proves that you must understand the performance implications of you syntax choices. Just because ‘analytics rock’ doesn’t mean they are a magic bullet.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: