Home > DBA, NoCOUG, NoSQL, Oracle, SQL > Relational joins are expensive by definition (NOT), therefore abandon the relational model (NOT)

Relational joins are expensive by definition (NOT), therefore abandon the relational model (NOT)


I have on good authority from a NoSQL aficionado that “the only requirement from NoSQL is to be non-relational, and therefore without joins.” Obviously if relational joins are computationally expensive and NoSQL is a viable alternative then NoSQL wins. But are relational joins expensive by definition?

Suppose that we have the following tables; the example is from the 1970 paper A Relational Model of Data for Large Shared Data Banks by the founder of the relational movement Dr. Edgar (Ted) Codd which was reprinted in the 100th issue of the NoCOUG Journal.

SQL> describe employees;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 EMPLOYEE#                                 NOT NULL NUMBER(38)
 NAME                                               VARCHAR2(16)
 BIRTH_DATE                                         DATE

SQL> describe children;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 EMPLOYEE#                                 NOT NULL NUMBER(38)
 CHILD_NAME                                NOT NULL VARCHAR2(16)
 BIRTH_DATE                                         DATE

SQL> describe job_history;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 EMPLOYEE#                                 NOT NULL NUMBER(38)
 JOB_DATE                                  NOT NULL DATE
 TITLE                                              VARCHAR2(16)

SQL> describe salary_history
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 EMPLOYEE#                                 NOT NULL NUMBER(38)
 JOB_DATE                                  NOT NULL DATE
 SALARY_DATE                               NOT NULL DATE
 SALARY                                             NUMBER

Suppose that we have the following data. Employee 1 (Ignatius) has two children (Iniga and Inigo) and has held two jobs (Programmer and Database Admin). The salary history for each job is provided.

SQL> SELECT * FROM employees;

 EMPLOYEE# NAME             BIRTH_DAT
---------- ---------------- ---------
         1 IGNATIUS         01-JAN-70

SQL> SELECT * FROM children;

 EMPLOYEE# CHILD_NAME       BIRTH_DAT
---------- ---------------- ---------
         1 INIGA            01-JAN-01
         1 INIGO            01-JAN-01

SQL> SELECT * FROM job_history;

 EMPLOYEE# JOB_DATE  TITLE
---------- --------- ----------------
         1 01-JAN-91 PROGRAMMER
         1 01-JAN-92 DATABASE ADMIN

SQL> SELECT * FROM salary_history;

 EMPLOYEE# JOB_DATE  SALARY_DA     SALARY
---------- --------- --------- ----------
         1 01-JAN-91 01-FEB-91       1000
         1 01-JAN-91 01-MAR-91       1000
         1 01-JAN-92 01-FEB-92       2000
         1 01-JAN-92 01-MAR-92       2000

Suppose that we need to create a “shopping cart” with all the information about employee 1. Here is an XML version of the shopping cart.

<?xml version="1.0" encoding="US-ASCII"?>
<employee>
  <name>IGNATIUS</name>
  <birth_date>1970-01-01</birth_date>
  <children>
    <child>
      <child_name>INIGA</child_name>
      <birth_date>2001-01-01</birth_date>
    </child>
    <child>
      <child_name>INIGO</child_name>
      <birth_date>2001-01-01</birth_date>
    </child>
  </children>
  <job_history>
    <job>
      <job_date>1991-01-01</job_date>
      <title>PROGRAMMER</title>
      <salary_history>
        <salary>
          <salary_date>1991-02-01</salary_date>
          <salary>1000</salary>
        </salary>
        <salary>
          <salary_date>1991-03-01</salary_date>
          <salary>1000</salary>
        </salary>
      </salary_history>
    </job>
    <job>
      <job_date>1992-01-01</job_date>
      <title>DATABASE ADMIN</title>
      <salary_history>
        <salary>
          <salary_date>1992-02-01</salary_date>
          <salary>2000</salary>
        </salary>
        <salary>
          <salary_date>1992-03-01</salary_date>
          <salary>2000</salary>
        </salary>
      </salary_history>
    </job>
  </job_history>
</employee>

The above XML document was produced by the following query. Tim Hall has an excellent tutorial on the SQL/XML functions used in the query.

SELECT
  XMLROOT (
    XMLELEMENT ("employee",
      XMLFOREST (
        e.name AS "name",
        e.birth_date AS "birth_date",
        (
          SELECT
            XMLAGG (
              XMLELEMENT ("child",
                XMLFOREST (
                  c.child_name AS "child_name",
                  c.birth_date AS "birth_date"
                )
              )
            )
          FROM children c
          WHERE c.employee# = e.employee#
        ) AS "children",
        (
          SELECT
            XMLAGG (
              XMLELEMENT ("job",
                XMLFOREST (
                  j.job_date AS "job_date",
                  j.title AS "title",
                  (
                    SELECT
                      XMLAGG (
                        XMLELEMENT ("salary",
                          XMLFOREST (
                            s.salary_date AS "salary_date",
                            s.salary AS "salary"
                          )
                        )
                      )
                    FROM salary_history s
                    WHERE s.employee# = j.employee#
                    AND s.job_date = j.job_date
                  ) AS "salary_history"
                )
              )
            )
          FROM job_history j
          WHERE j.employee# = e.employee#
        ) AS "job_history"
      )
    )
  ) AS details
FROM employees e
WHERE employee# = 1;

Eight separate physical storage containers are [obviously] required in the above example—one for each of the four tables and one for each of the four indexes that are [obviously] necessary for efficient retrieval of a single employee’s information. Therefore, a minimum of eight data blocks will have to be retrieved in order to construct the shopping cart—many more if the tables contain real volumes of data. If it takes 5 milliseconds on average to retrieve a block from disk, the above query may need a minimum of 40 milliseconds which may be unacceptable in some demanding scenarios. The natural temptation therefore is to abandon the relational model as Amazon did in the case of Dynamo, the NoSQL product that started it all.

However, the conclusion that eight storage containers are needed for our example is a misinterpretation of Dr. Codd’s vision. Dr. Codd did not dictate how data should be stored.

  • He did not dictate that each stored table occupy one physical file.
  • He did not dictate that data be stored in row-major order.
  • He did not dictate that stored tables have only one storage representation each.
  • He did not dictate that data be stored in normalized form only.
  • He did not dictate that a single data block only contain data from a single table.
  • He did not dictate that data not be stored in a compact form such as a shopping cart.

In other words, shopping carts would be an acceptable storage representation for stored tables. It was not necessary for Amazon to abandon the relational model when creating Dynamo. Amazon mistook implementation deficiencies (deficiencies of existing implementations) for intrinsic deficiencies and threw out the baby along with the bathwater.

P.S. Oracle Database has always allowed us to store data from multiple tables in a form that closely resembles a shopping cart. Each block in a multi-table cluster can store data from multiple tables. Every single Oracle Database on the planet contains multi-table clusters because they are used internally by Oracle Database to turbocharge the performance of the data dictionary. On page 379 of Effective Oracle by Design, Tom Kyte quotes Steve Adams as saying: “If a schema has no IOTs or clusters, that is a good indication that no thought has been given to the matter of optimizing data access.” Let me show how multi-table clusters can be used to reduce the number of blocks needed to construct a shopping cart to just one by using a hash cluster; all data for employee 1 will be stored in a single data block and no indexes will be necessary.

CREATE CLUSTER employees (employee# INTEGER) hashkeys 1000;

CREATE TABLE employees (employee# INTEGER NOT NULL, name VARCHAR2(16), birth_date DATE)
  CLUSTER employees (employee#);

CREATE TABLE job_history (employee# INTEGER NOT NULL, job_date DATE NOT NULL, title VARCHAR2(16))
  CLUSTER employees (employee#);

CREATE TABLE salary_history (employee# INTEGER NOT NULL, job_date DATE NOT NULL, salary_date DATE NOT NULL, salary NUMBER)
  CLUSTER employees (employee#);

CREATE TABLE children (employee# INTEGER NOT NULL, child_name VARCHAR2(16) NOT NULL, birth_date DATE)
  CLUSTER employees (employee#);

INSERT INTO employees VALUES (1, 'IGNATIUS', '01-JAN-1970');

INSERT INTO children VALUES (1, 'INIGA', '01-JAN-2001');
INSERT INTO children VALUES (1, 'INIGO', '01-JAN-2001');

INSERT INTO job_history VALUES (1, '01-JAN-1991', 'PROGRAMMER');
INSERT INTO job_history VALUES (1, '01-JAN-1992', 'DATABASE ADMIN');

INSERT INTO salary_history VALUES (1, '01-JAN-1991', '1-FEB-1991', 1000);
INSERT INTO salary_history VALUES (1, '01-JAN-1991', '1-MAR-1991', 1000);
INSERT INTO salary_history VALUES (1, '01-JAN-1992', '1-FEB-1992', 2000);
INSERT INTO salary_history VALUES (1, '01-JAN-1992', '1-MAR-1992', 2000);

Let’s confirm that all the data for employee 1 is stored in the same block.

SQL> SELECT DISTINCT DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number FROM employees where employee# = 1;

BLOCK_NUMBER
------------
       29025

SQL> SELECT DISTINCT DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number FROM children where employee# = 1;

BLOCK_NUMBER
------------
       29025

SQL> SELECT DISTINCT DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number FROM job_history where employee# = 1;

BLOCK_NUMBER
------------
       29025

SQL> SELECT DISTINCT DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number FROM salary_history where employee# = 1;

BLOCK_NUMBER
------------
       29025

Therefore, no matter how many queries are needed or how complex they are, only one block of data needs to be retrieved into memory in order to answer any and all questions about employee 1. No indexes are necessary if we are retrieving information about only one employee because we are using hash clusters. However we are not prevented from creating indexes as usual so it would still be possible to efficiently answer questions involving more than one employee such as: “Find all employees who have held the post of database administrator.” In other words, we get to eat our cake and have it too.

P.S. When I interviewed Baron Schwartz, chief performance architect of Percona, for the 101th issue of the NoCOUG Journal, I asked him “What do you think of the NoSQL movement? Is it a threat to MySQL?” Here was his reply:

I think we have all learned a lot from it in the last few years, but I think that some of the strongest proponents have actually misunderstood what the real problems are. There was a lot of discussion that went something like “SQL is not scalable,” when in fact that’s not true at all. Current implementations of SQL databases indeed have some scalability limitations. That is an implementation problem, not a problem with the model. The baby got thrown out with the bathwater. I believe that some of the emerging products, such as Clustrix, are going to open a lot of people’s eyes to what is possible with relational technology. That said, many of the approaches taken by nonrelational databases are worth careful consideration, too. In the end, though, I don’t think that we need 50 or 100 alternative ways to access a database. We might need more than one, but I would like some of the non-SQL databases to standardize a bit.

Click here to read the full interview with Baron.

Some of the emerging products, such as Clustrix, are going to open a lot of people’s eyes to what is possible with relational technology.

Why SQL Loses and NoSQL Wins

Multidimensional Clustering (MDC) in Oracle Database: Can Exadata Beat This?

Major New Undocumented Partitioning Feature in Oracle Database 11g Release 2

Day 9: The Twelve Days of SQL: Physical database design matters

Lesson 2 of 40: Physical Database Design for Oracle Databases: Non-Necessity?

Categories: DBA, NoCOUG, NoSQL, Oracle, SQL
  1. April 8, 2012 at 8:56 am

    Nice Demonstration.. Thanks for sharing this.

    Mostly people use heap and consider as just a repository of storing data.

  2. Iggy Fernandez
    April 8, 2012 at 12:17 pm

    Taral :

    Mostly people use heap and consider as just a repository of storing data.

    Tom Kyte has a simple philosophy: “You can treat Oracle as a black box and shove data inside of it, or you can learn how Oracle works and exploit it as a powerful computing environment.”

  3. April 19, 2012 at 1:21 pm
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: