Home > DBA, Oracle, SQL > Multidimensional Clustering (MDC) in Oracle Database

Multidimensional Clustering (MDC) in Oracle Database


Contrary to popular belief, Oracle Database does provide Multidimensional Clustering (MDC) and has done so for decades. Consider the following “fact” table with three independent dimensions—year, region, and color—and some straightforward aggregations.

CREATE TABLE mdc_tab
(
  pk integer PRIMARY KEY,
  year integer NOT NULL,
  region integer NOT NULL,
  color integer NOT NULL,
  sales number NOT NULL
);

SELECT sum(sales) FROM mdc_tab WHERE year=10 and region=10 and color=10;
SELECT sum(sales) FROM mdc_tab WHERE year=10 and region=10;
SELECT sum(sales) FROM mdc_tab WHERE color=10;

The goal of MDC is to tightly cluster all rows with the same values of the dimension columns. Suppose that each of the dimensions in the above example can have ten different values. Imagine a 10x10x10 cube made up of 1x1x1 cells. All the rows in a cell should have the same values of all three dimension columns. All ten cells in any row or column of cells should have the same values for exactly two of the dimension columns. All one hundred cells in any horizontal or vertical slice of cells should have the same values of just one of the dimension columns.

Picture of an almost solved Rubik's Cube by Wikipedia Contributor Kleiner

Picture of an almost solved Rubik’s Cube by Wikipedia Contributor Kleiner

A cell is a collection of database blocks tracked by a “block index.” (A “block index” is also called a “sparse index” because it does not contain any entry for each row in the table; instead it contains one row for every block and is therefore much smaller than a regular index.) When a query visits a block it will only find rows that satisfy the query criteria. Thus, the query only has to visit the minimum possible number of blocks.

Here’s a demonstration of MDC in action. First we create a table cluster. (One use of a table cluster is to co-locate rows from multiple tables that have the same value of the cluster key; Oracle uses such clusters in the data dictionary and in its TPC-C benchmarks. However, a table cluster can also provide the multidimensional clustering we are looking for.) We then add a table to the cluster and insert 500,000 rows into it. Each dimension can have ten possible values.

-- Create a multidimensional cluster
CREATE CLUSTER mdc
(
  year integer,
  region integer,
  color integer
);

-- create the required "block index"
CREATE INDEX mdc_idx ON CLUSTER mdc;

-- Add the fact table to the cluster
CREATE TABLE mdc_tab
(
  pk integer PRIMARY KEY,
  year integer NOT NULL,
  region integer NOT NULL,
  color integer NOT NULL,
  sales number(8,2) NOT NULL
)
CLUSTER MDC (year, region, color);

-- Insert 500,000 rows into the fact table
CREATE TABLE temp AS
SELECT
  level AS pk,
  dbms_random.value*10 AS year,
  dbms_random.value*10 AS region,
  dbms_random.value*10 AS color,
  dbms_random.value*10000 AS sales
FROM dual
CONNECT BY level <=500000;

INSERT INTO mdc_tab(pk, year, region, color, sales)
SELECT * FROM temp;
COMMIT;

-- Collect optimizer statistics
EXEC dbms_stats.gather_table_stats('IGGY', 'MDC_TAB');

The following query plans show that Oracle efficiently uses the block index to visit the minimum number of blocks. If all three dimensions are specified, Oracle finds all the 68 rows it needs in just one block of the table. I wonder if you can beat that with a couple of million dollars worth of hardware from Oracle (Exadata) or Netezza or Teradata or ParAccel🙂

 SQL> SELECT sum(sales) FROM mdc_tab WHERE year=10 and region=10 and color=10;

SUM(SALES)
----------
 345738.12

Plan hash value: 2635193114

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name    | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |      1 |        |       |     2 (100)|          |      1 |00:00:00.01 |       3 |
|   1 |  SORT AGGREGATE       |         |      1 |      1 |    14 |            |          |      1 |00:00:00.01 |       3 |
|   2 |   TABLE ACCESS CLUSTER| MDC_TAB |      1 |    376 |  5264 |     2   (0)| 00:00:01 |     68 |00:00:00.01 |       3 |
|*  3 |    INDEX UNIQUE SCAN  | MDC_IDX |      1 |      1 |       |     1   (0)| 00:00:01 |      1 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("YEAR"=10 AND "REGION"=10 AND "COLOR"=10)

SQL> SELECT sum(sales) FROM mdc_tab WHERE region=10 and color=10;

SUM(SALES)
----------
6737404.59

Plan hash value: 1321893723

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name    | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |      1 |        |       |    13 (100)|          |      1 |00:00:00.01 |      17 |
|   1 |  SORT AGGREGATE       |         |      1 |      1 |    11 |            |          |      1 |00:00:00.01 |      17 |
|   2 |   TABLE ACCESS CLUSTER| MDC_TAB |      1 |   4132 | 45452 |    13   (0)| 00:00:01 |   1290 |00:00:00.01 |      17 |
|*  3 |    INDEX SKIP SCAN    | MDC_IDX |      1 |   4132 |       |     6   (0)| 00:00:01 |     11 |00:00:00.01 |       6 |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("REGION"=10 AND "COLOR"=10)
       filter(("REGION"=10 AND "COLOR"=10))

SQL> SELECT sum(sales) FROM mdc_tab WHERE color=10;

SUM(SALES)
----------
 126282401

Plan hash value: 1321893723

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name    | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |      1 |        |       |    83 (100)|          |      1 |00:00:00.11 |     127 |
|   1 |  SORT AGGREGATE       |         |      1 |      1 |     8 |            |          |      1 |00:00:00.11 |     127 |
|   2 |   TABLE ACCESS CLUSTER| MDC_TAB |      1 |  45455 |   355K|    83   (0)| 00:00:01 |  25224 |00:00:00.06 |     127 |
|*  3 |    INDEX SKIP SCAN    | MDC_IDX |      1 |  45455 |       |    10   (0)| 00:00:01 |    121 |00:00:00.01 |       6 |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("COLOR"=10)
       filter("COLOR"=10)

Now let’s load the same data into a traditional heap table and create a traditional B-tree index.

-- Create a traditional heap table
CREATE TABLE heap
(
  pk integer PRIMARY KEY,
  year integer NOT NULL,
  region integer NOT NULL,
  color integer NOT NULL,
  sales number NOT NULL
);

-- Insert 500,000 rows into the heap table
INSERT INTO heap(pk, year, region, color, sales)
SELECT * FROM temp;
COMMIT;

-- Create a traditional B-tree index on the heap table
CREATE INDEX btree ON HEAP(year, region, color);

-- Collect optimizer statistics
EXEC dbms_stats.gather_table_stats('IGGY', 'HEAP');

Oracle now has to do a lot more work. When all three dimensions are specified, Oracle now has to visit 67 blocks in order to retrieve the same 68 rows as in the clustered case.

SQL> SELECT sum(sales) FROM heap WHERE year=10 and region=10 and color=10;

SUM(SALES)
----------
345738.126

Plan hash value: 1886783043

-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |       |   353 (100)|          |      1 |00:00:00.36 |      70 |     45 |
|   1 |  SORT AGGREGATE              |       |      1 |      1 |    30 |            |          |      1 |00:00:00.36 |      70 |     45 |
|   2 |   TABLE ACCESS BY INDEX ROWID| HEAP  |      1 |    376 | 11280 |   353   (0)| 00:00:05 |     68 |00:00:00.36 |      70 |     45 |
|*  3 |    INDEX RANGE SCAN          | BTREE |      1 |    376 |       |     4   (0)| 00:00:01 |     68 |00:00:00.02 |       3 |      1 |
-----------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("YEAR"=10 AND "REGION"=10 AND "COLOR"=10)

SQL> SELECT sum(sales) FROM heap WHERE region=10 and color=10;

SUM(SALES)
----------
6737404.83

Plan hash value: 608968902

------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |   758 (100)|          |      1 |00:00:01.08 |    2767 |   2764 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |    27 |            |          |      1 |00:00:01.08 |    2767 |   2764 |
|*  2 |   TABLE ACCESS FULL| HEAP |      1 |   4132 |   108K|   758   (2)| 00:00:10 |   1290 |00:00:01.08 |    2767 |   2764 |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("REGION"=10 AND "COLOR"=10))

SQL> SELECT sum(sales) FROM heap WHERE color=10;

SUM(SALES)
----------
 126282401

Plan hash value: 608968902

------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |       |   757 (100)|          |      1 |00:00:01.61 |    2767 |   2764 |
|   1 |  SORT AGGREGATE    |      |      1 |      1 |    24 |            |          |      1 |00:00:01.61 |    2767 |   2764 |
|*  2 |   TABLE ACCESS FULL| HEAP |      1 |  45455 |  1065K|   757   (1)| 00:00:10 |  25224 |00:00:01.54 |    2767 |   2764 |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("COLOR"=10)
Categories: DBA, Oracle, SQL
  1. Bhanu
    April 21, 2011 at 6:49 pm

    Thanks, nice article.

  2. Edward Hayrabedian
    January 11, 2013 at 6:35 am

    At first sight, the idea of storing together the related fact table rows looks ok. Unfortunatelly, the DWH star queries have to join a number of dimension tables with the fact table and filter by some of their attributes. Also, the monster dimensions would make the index access path heavy (while searching the cluster index).

    • Iggy Fernandez
      August 28, 2013 at 8:21 am

      Edward Hayrabedian :

      At first sight, the idea of storing together the related fact table rows looks ok. Unfortunatelly, the DWH star queries have to join a number of dimension tables with the fact table and filter by some of their attributes. Also, the monster dimensions would make the index access path heavy (while searching the cluster index).

      I think that clusters would be perfect for large fact tables. A cluster index is a block index not a regular index and, hence, its size is only a small fraction of the size of a traditional index that needs to store the values of the indexed columns and ROWIDs. Alternatively, you could dispense with an index by using indexed clusters instead of hashed clusters. For even more performance, you can use bitmap indexes and bitmap join indexes in conjunction with clusters and take advantage of the star transformation. Partitioned clusters have already been tried out by Oracle in TPC-C benchmarks (https://iggyfernandez.wordpress.com/2011/05/10/major-new-undocumented-partitioning-feature-in-oracle-database-11g-release-2/) so I have no doubt that they will eventually find their way into the main product.

      • Edward Hayrabedian
        August 29, 2013 at 10:07 am

        Iggy, the issues I have mentioned are not easily visualized in such a test case.

        1) The first issue is related to the idea that the fact table stores the surrogate dimension keys, so we must join the fact table with all participating (in the star query) dimension tables. It is just not possible to access only the clustered fact table. The dimensions’ attributes are located in a phisically different place, i.e. in the dimension tables. So, the very convinient query agains a clustered table with where clause refering a cluster key columns, will be replaced by a join query, which will have filtering columns from a dimension tables, not the cluster ones.

        2) Yes, the cluster index access will be faster than the b*tree access (due to jump from row level to block level), but it is still an index access, which will result in a single block i/o, which will be slower than multiblock i/o (for big resultsets).
        This architecture will depend on the same kind of limitation (the index access path). Although, it will be faster it will still depend on the extent of which the query uses the index.

        3) The same limitation is valid for a bitmap index access. Yes, I can agree that the “star transformation” is a smart move, but at the end, the effect of this feature depends on the volume of the final resultset. I.e. lets assume that because of the smart bitmap and/or operations we decrease elapsed time of the 1-st phase of the star transformation to zero .. but, the second phase, must take the rowid-s and go through a single block access in order to take the rest of the data. i.e. the respose time will greatly depend on the volume of the final resultset…

        I would be really happy to understand in details how the clustered bitmap index fit into this paradigma.. anyway, thanks for your time and energy into this topic !!

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: