Archive

Archive for the ‘Physical Database Design’ Category

SQL v/s NoSQL: Amazon v/s eBay and the false premise of NoSQL

October 9, 2013 4 comments

Update: The recording is now available at http://www.confio.com/webinars/nosql-big-data/. The slide deck is at https://iggyfernandez.files.wordpress.com/2013/10/nosql-and-big-data-for-oracle-dbas-oct-2013.pdf.

In my webinar for Confio on October 10, I will explain that the deficiencies of relational technology are actually a result of deliberate choices made by the relational movement in its early years. The relational camp needs to revisit these choices if it wants to compete with NoSQL and Big Data technologies in the areas of performance, scalability, and availability. Previous versions of this presentation have been delivered at Great Lakes Oracle Conference, NoCOUG, and OakTableWorld. New material is constantly being added to the presentation based on attendee feedback and additional research. Attendees seem most interested in learning that eBay had the same performance, scalability, and availability requirements as Amazon but stuck with Oracle and SQL.

Register at http://marketo.confio.com/NoSQLBigDataforOracle_RegPage.html.

  1. The origins of NoSQL
    • Amazon’s requirements
    • Amazon’s solution
    • Amazon v/s eBay
  2. The false premise of NoSQL
    • Zeroth normal form
    • The problem with flat tables
    • Eliminating the join penalty
    • Dr. Codd on eventual consistency
  3. The NoSQL landscape
    • Key-value databases
    • Document databases
    • Column-family databases
    • Graph databases
    • Map/Reduce
  4. What makes relational technology so sacred anyway?
  5. Mistakes of the relational camp
    • De-emphasizing physical database design
    • Discarding nested relations
    • Favoring relational calculus over relational algebra
    • Equating the normalized set with the stored set
    • Marrying relational theory to ACID DBMS
    • Ignoring SQL-92 CREATE ASSERTION
  6. Going one better than Hadoop
    • Preprocessor feature of external tables
    • Partition pruning in partition views
    • Parallel UNION ALL in Oracle Database 12c
  7. Bonus slides
    • NewSQL
    • NoSQL Buyer’s Guide

The Mistakes of the Relational Camp: Mistake #1: The de-emphasis of physical database design

It is unlikely that application developers will develop highly performant and scalable applications if they don’t have access to performance metrics. EM Express in Oracle Database 12c has taken a step in the right direction by making it possible to give developers access to some performance metrics. From the interview with ACE Director Kyle Hailey in Oracle Magazine:

Oracle Database 12c introduces an administration console called Oracle Enterprise Manager Database Express, a “light” version of Oracle Enterprise Manager that developers can access through a browser. “Now DBAs can give the developers read-only access to a simplified database management console so they can see the impact of their code,” says Hailey.

This is a big change from the traditional setup, where the effect of code was seen only in the DBA’s world. “The DBA has this privileged access to see what’s happening with the database,” Hailey says. “He can see load go up and see who caused it, but the poor developer writes some code, runs some code, and maybe sees some text on a screen—but there’s no visual impact. The developer doesn’t know what’s going on in the database, and that’s not fair. The DBA comes and complains that the developers are making a mess, and the developer says, ‘How am I supposed to know?’ With [Oracle Enterprise Manager Database Express], developers will be able to see the effect of their code and, if there’s a problem, shut it down before the DBA comes calling.”

Instructions for giving EM Express access to non-administrative users are at http://docs.oracle.com/cd/E16655_01/server.121/e17643/em_manage.htm#BABHCDGA.

So Many Oracle Manuals, So Little Time

See also :  No! to SQL and No! to NoSQL

The inventor of the relational model, Dr. Edgar “Ted” Codd believed that the suppression of physical database design details was the chief advantage of the relational model. He made the case in the very first sentence of the very first paper on the relational model saying “Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).” (“A Relational Model of Data for Large Shared Data Banks,” reprinted with permission in the 100th issue of the NoCOUG Journal.)

How likely is it that application developers will develop highly performant and scalable applications if they are shielded from the internal representation of data? The de-emphasis of physical database design was the biggest mistake of the relational camp and provided the opening for NoSQL…

View original post 452 more words

The Mistakes of the Relational Camp: Mistake #1: The de-emphasis of physical database design

September 9, 2013 3 comments

See also :  No! to SQL and No! to NoSQL

The inventor of the relational model, Dr. Edgar “Ted” Codd believed that the suppression of physical database design details was the chief advantage of the relational model. He made the case in the very first sentence of the very first paper on the relational model saying “Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).” (“A Relational Model of Data for Large Shared Data Banks,” reprinted with permission in the 100th issue of the NoCOUG Journal.)

How likely is it that application developers will develop highly performant and scalable applications if they are shielded from the internal representation of data? The de-emphasis of physical database design was the biggest mistake of the relational camp and provided the opening for NoSQL and Big Data technologies to proliferate.

A case in point is that the language SQL which is universally used by application developers was not created with them in mind. As explained by the creators of SQL (originally called SEQUEL) in their 1974 paper, there is “a large class of users who, while they are not computer specialists, would be willing to learn to interact with a computer in a reasonably high-level, non-procedural query language. Examples of such users are accountants, engineers, architects, and urban planners [emphasis added]. It is for this class of users that SEQUEL is intended. For this reason, SEQUEL emphasizes simple data structures and operations [emphasis added].” (http://faculty.cs.tamu.edu/yurttas/PL/DBL/docs/sequel-1974.pdf)

If you were the manager of a bookstore, how would you stock the shelves? Would you stand at the door and fling books onto any shelf that had some free space, perhaps recording their locations in a notebook for future reference. Of course not! And would you scatter related books all over the bookstore? Of course not! Then why do we store rows of data in random fashion? The default Oracle table storage structure is the unorganized heap and it is chosen 99.9% of the time.

The de-emphasis of physical database design was an epic failure in the long run. Esther Dyson referred to the “join penalty” when she complained that “Using tables to store objects is like driving your car home and then disassembling it to put it in the garage. It can be assembled again in the morning, but one eventually asks whether this is the most efficient way to park a car.” [1]

It doesn’t have to be that way. Oracle Database has always provided a way to cluster rows of data from one or more tables using single-table or multi-table clusters in hashed or indexed flavors and thus to completely avoid the join penalty that Esther Dyson complained about. However, they must be the longest and best kept secret of Oracle Database—as suggested by their near-zero adoption rate—and have not been emulated by any other DBMS vendor. You can read more about them at https://iggyfernandez.wordpress.com/2013/07/28/no-to-sql-and-no-to-nosql/.

It doesn’t have to be that way. But it is.

1. Esther Dyson was the editor of a newsletter called Release 1.0. I’ve been unable to find the statement in the Release 1.0 archives at http://www.sbw.org/release1.0/ so I don’t really know the true source or author of the statement. However, the statement is popularly attributed to Esther Dyson and claimed to have been published in the Release 1.0 newsletter. I found a claim that the statement is found in the September 1988 issue but that didn’t pan out.

See also :  No! to SQL and No! to NoSQL

No! to SQL and No! to NoSQL

July 28, 2013 5 comments

See also :  What’s so sacred about relational anyway?

C J DateThe opinions of relational theoreticians C. J. Date and Hugh Darwen about SQL are well known but they have not previously commented on the NoSQL phenomenon and so their interview in the latest issue of the NoCOUG Journal—the official publication of the Northern California Oracle Users group—makes for interesting reading. In my opinion, their comments are right on the mark even though Date admits that he knows almost nothing about NoSQL products.

In discussing Amazon Dynamo—the forerunner of the NoSQL movement—and those that came after it, Date and Darwen make these astute observations:

“Developers tend to be more concerned with convenience in database definition and updating than with the ease of deriving useful and reliable information from the database.”—Darwen

 “If there’s a suggestion that Amazon’s various disaster scenarios, regarding tornados and the rest, are somehow more of a problem for relational systems than they are for nonrelational ones, then of course I reject that suggestion 100 percent.”—Date

“Those who disparage relational are almost invariably very far from being properly informed and almost invariably equate ‘relational’ with [current implementations].”—Darwen

Dynamo Assumptions and Requirements

Date and Darwen’s remarks are spot on. Here is the Dynamo use case from the 2007 ACM paper by Amazon: “Customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. Therefore, the service responsible for managing shopping carts requires that it can always write to and read from its data store, and that its data needs to be available across multiple data centers. … There are many services on Amazon’s platform that only need primary-key access to a data store. For many services, such as those that provide best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog, the common pattern of using a relational database would lead to inefficiencies and limit scale and availability. Dynamo provides a simple primary-key only interface to meet the requirements of these applications. … Experience at Amazon has shown that data stores that provide ACID guarantees tend to have poor availability. … Dynamo targets applications that operate with weaker consistency (the “C” in ACID) if this results in high availability.”

To paraphrase, Amazon’s goals were extreme performance, extreme scalability, and extreme availability and it concluded that the only way to achieve its ends was to discard the relational model.

Compartmentation

Amazon started off on the right track. Its first innovation was to break up the traditional monolithic enterprise-wide database service into simpler component services such as the best seller list service, the shopping cart service, the customer preferences service, the sales rank service, and the product catalog service. This avoids a single point of failure.

Amazon’s first DBA, Jeremiah Wilton, explained Amazon’s approach in his answer to the question “Is 24×7 a myth?” in an interview published in the NoCOUG Journal in 2007. He said “The best availability in the industry comes from application software that is predicated upon a surprising assumption: The databases upon which the software relies will inevitably fail. The better the software’s ability to continue operating in such a situation, the higher the overall service’s availability will be. But isn’t Oracle unbreakable? At the database level, regardless of the measures taken to improve availability, outages will occur from time to time. An outage may be from a required upgrade or a bug. Knowing this, if you engineer application software to handle this eventuality, then a database outage will have less or no impact on end users. In summary, here are many ways to improve a single database’s availability. But the highest availability comes from thoughtful engineering of the entire application architecture.”

As an example, the shopping cart service should not be affected if the checkout service is having hiccups.

There’s no conflict with the relational model there, even if the shopping cart data is not in the same database as the product catalog and occasional inconsistencies result. Occasionally, an item that is present in a shopping cart may go out of stock. Occasionally, an item that is present in a shopping cart may be repriced. The problems can be resolved when the customer decides to checkout if not earlier. As an Amazon customer, I occasionally leave items in my shopping cart but don’t complete a purchase. When I resume shopping, I sometimes get a notification that an item in my shopping chart is no longer in stock or has been repriced.

In any case, current relational implementations do not allow referential integrity constraints to span databases. See, for example, http://docs.oracle.com/cd/E11882_01/server.112/e25494/ds_appdev003.htm.

Extreme Scalability

Focusing next on extreme scalability, Amazon saw that the solution was “sharding” or horizontal partitioning of a hierarchical schema amongst shared-nothing database servers. The simple hierarchical schemas that resulted from compartmentation were very shardable. No conflict with the relational model there.

Extreme Availability

Continuing this line of thought, Amazon saw that the key to extreme availability was data replication. Multiple copies of the shopping cart are allowed to exist and, if one of the replicas becomes unresponsive, the data can be served by one of the other replicas. The technique used by Dynamo has a close parallel in the well-known technique of “multimaster replication.” However, because of network latencies, the copies may occasionally get out of sync and the customer may occasionally encounter a stale version of the shopping cart. Once again, this can be appropriately handled by the application tier; the node that falls behind can catch up eventually or inconsistencies can be detected and resolved at an opportune time such as checkout. This technique is called “eventual consistency.”

The inventor of relational theory, Dr. Edgar “Ted” Codd, himself blessed eventual consistency in the very first paper on relational theory “A Relational Model of Data for Large Shared Data Banks” which was reprinted with permission in the 100th issue of the NoCOUG Journal. In that paper, Dr. Codd said:

“There are, of course, several possible ways in which a system can detect inconsistencies and respond to them. In one approach the system checks for possible inconsistency whenever an insertion, deletion, or key update occurs. Naturally, such checking will slow these operations down. [emphasis added] If an inconsistency has been generated, details are logged internally, and if it is not remedied within some reasonable time interval, either the user or someone responsible for the security and integrity of the data is notified. Another approach is to conduct consistency checking as a batch operation once a day or less frequently.”

In other words, the inventor of relational theory did not see a conflict between his relational model and the “eventual consistency” that is one of the hallmarks of the NoSQL products of today. However, Amazon imagined a conflict because it quite understandably conflated the relational model with the ACID guarantees of database management systems. However, ACID has nothing to do with the relational model per se (though relational theory does come in very handy in defining consistency constraints); pre-relational database management systems such as IMS provided ACID guarantees and so did post-relational object-oriented database management systems.

The tradeoff between consistency and performance is as important in the wired world of today as it was in Codd’s world. We cannot cast stones at Dynamo for the infraction of not guaranteeing the synchronization of replicated data, because violations of the consistency requirement are commonplace in the relational camp. The replication technique used by Dynamo is similar to the well-known technique of “multimaster replication.” Application developers are warned about the negative impact of integrity constraints. [1] [2] [3] [4] And, most importantly, no DBMS that aspires to the relational moniker currently implements the SQL-92 “CREATE ASSERTION” feature that is necessary in order to provide the consistency guarantee. For a detailed analysis of this anomaly, refer to Toon Koppelaars’s article “CREATE ASSERTION: The Impossible Dream” in the August 2013 issue of the NoCOUG Journal.

Extreme Performance

The final hurdle was extreme performance and that’s where Amazon went astray. Amazon believed that the relational model makes data retrieval and updates inefficient. This objection to the relational model is colorfully summarized by the following statement attributed to Esther Dyson, the editor of the Release 1.0 newsletter, “Using tables to store objects is like driving your car home and then disassembling it to put it in the garage. It can be assembled again in the morning, but one eventually asks whether this is the most efficient way to park a car.” The statement dates back to 1988 and was much quoted when object-oriented databases were in vogue. [5]

Since the shopping cart is an “object,” then surely disassembling it for storage makes data retrieval and updates inefficient? The belief stems from an unfounded assumption that has found its way into every relational implementation to date—that a “table” should map to physical storage. In reality, the relational model does not concern itself with storage details at all. The relational model is a logical model, it does not concern itself with storage details. It would be perfectly legitimate to store the shopping cart in a physical form that resembled a shopping cart while still offering a relational model of the data complete with SQL. In other words, the physical representation could be optimized for the most important use case—retrieving the entire shopping-cart object using its key—without affecting the relational model of the data. It would also be perfectly legitimate to provide a non-relational API for the important use cases. Dr. Codd himself gave conditional blessing to such non-relational APIs in his 1985 Computerworld article “Is Your DBMS Really Relational?” where he says “If a relational system has a low-level (single-record-at-a-time) language, that low level [should not] be used to subvert or bypass the integrity rules and constraints expressed in the higher level relational language (multiple-records-at-a-time).”

Have I mentioned that Dr. Codd invented relational theory?

Zero-th Normal Form

In fact, the key-value approach used by Dynamo and those that came after it is exactly equivalent to “zero-th” normal form in relational terminology. [6] In his 1970 paper, Codd says: “Nonatomic values can be discussed within the relational framework. Thus, some domains may have relations as elements. These relations may, in turn, be defined on nonsimple domains, and so on. For example, one of the domains on which the relation employee is defined might be salary history. An element of the salary history domain is a binary relation defined on the domain date and the domain salary. The salary history domain is the set of all such binary relations. At any instant of time there are as many instances of the salary history relation in the data bank as there are employees. In contrast, there is only one instance of the employee relation.” In relational terminology, a relation with non-simple domains is said to be in “zero-th” normal form or unnormalized. Codd suggested that unnormalized relations should be normalized for ease of use. Here is the unnormalized employee relation from Codd’s paper.

employee (
  employee#,
  name,
  birthdate,
  jobhistory (jobdate, title, salaryhistory (salarydate, salary)),
  children (childname, birthyear)
)

The above unnormalized relation can be decomposed into four normalized relations as follows.

employee' (employee#, name, birthdate)
jobhistory' (employee#, jobdate, title)
salaryhistory' (employee#, jobdate, salarydate, salary)
children' (employee#, childname, birthyear)

However, this is not to suggest that these normalized relations must necessarily be mapped to individual buckets of physical storage. Codd differentiated between the stored set, the named set, and the expressible set. In the above example, we have five relations but, if we preferred it, the unnormalized employee relation could be the only member of the stored set. Alternatively, if we preferred it, all five relations could be part of the stored set; that is, we could legitimately store redundant representations of the data. However, the common belief blessed by current practice is that the normalized relations should be the only members of the stored set.

Even if the stored set contains only normalized relations, they need not map to different buckets of physical storage. Oracle is unique amongst RDBMS vendors in providing a convenient construct called the “table cluster” which is suitable for hierarchical schemas. In Codd’s example, employee# would be the cluster key, and rows corresponding to the same cluster key from all four tables could be stored in the same physical block on disk. If the cluster was a “hash cluster,” no indexes would be required to retrieve records corresponding to the same cluster key from all four tables.

Conclusion

The conclusion is that Amazon had an opportunity to kick the relational model up a notch but did not rise to the occasion. Amazon could have eaten its cake—extreme performance, extreme scalability, and extreme availability for important use cases such as shopping carts—and had it too—the relational model with all its wonderful declarative power. As Hugh Darwen said in the NoCOUG Journal interview: “Developers tend to be more concerned with convenience in database definition and updating than with the ease of deriving useful and reliable information from the database. … Those who disparage relational are almost invariably very far from being properly informed and almost invariably equate ‘relational’ with [current implementations].” I’ll leave you with that thought.

See also :  What’s so sacred about relational anyway?

1. “Using primary and foreign keys can impact performance. Avoid using them when possible.” (http://docs.oracle.com/cd/E17904_01/core.1111/e10108/adapters.htm#BABCCCIH)

2. “For performance reasons, the Oracle BPEL Process Manager, Oracle Mediator, human workflow, Oracle B2B, SOA Infrastructure, and Oracle BPM Suite schemas have no foreign key constraints to enforce integrity.” (http://docs.oracle.com/cd/E23943_01/admin.1111/e10226/soaadmin_partition.htm#CJHCJIJI)

3. “For database independence, applications typically do not store the primary key-foreign key relationships in the database itself; rather, the relationships are enforced in the application.” (http://docs.oracle.com/cd/E25178_01/fusionapps.1111/e14496/securing.htm#CHDDGFHH)

4. “The ETL process commonly verifies that certain constraints are true. For example, it can validate all of the foreign keys in the data coming into the fact table. This means that you can trust it to provide clean data, instead of implementing constraints in the data warehouse.” (http://docs.oracle.com/cd/E24693_01/server.11203/e16579/constra.htm#i1006300)

5. I’ve been unable to find the statement in the Release 1.0 archives at http://www.sbw.org/release1.0/ so I don’t really know the true source or author of the statement. However, the statement is popularly attributed to Esther Dyson and claimed to have been published in the Release 1.0 newsletter. I found a claim that the statement is found in the September 1988 issue but that didn’t pan out.

6. Chris Date is a strong proponent of “relation-valued attributes” (RVAs) and argues that relations with RVAs are as “normal” as those without. See “What First Normal Form Really Means” in Date on Database: Writings 2000-2006 (Apress, 2006).

Appendix: Table Clusters in Oracle Database

Here’s a demonstration of using Oracle table clusters to store records from four tables in the same block and retrieving all the components of the “employee cart” without using indexes. First we create four normalized tables and prove that all the records of a single employee including job history, salary history, and children are stored in a single database block so that there is never any join-penalty when assembling employee data. Then we create an object-relational view that assembles employee information into a single unnormalized structure and show how to insert into this view using an “INSTEAD OF” trigger.

The following demonstration was performed using a pre-Built developer VM for Oracle VM VirtualBox. The version of Oracle Database is 11.2.0.2.

SQL*Plus: Release 11.2.0.2.0 Production on Sun Jul 28 19:44:23 2013

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

First, we create a table cluster and add four tables to the cluster.

SQL> CREATE CLUSTER employees (employee# INTEGER) hashkeys 1000;

Cluster created.

SQL> CREATE TABLE employees
  2  (
  3    employee# INTEGER NOT NULL,
  4    name VARCHAR2(16),
  5    birth_date DATE,
  6    CONSTRAINT employees_pk PRIMARY KEY (employee#)
  7  )
  8  CLUSTER employees (employee#);

Table created.

SQL> CREATE TABLE job_history
  2  (
  3    employee# INTEGER NOT NULL,
  4    job_date DATE NOT NULL,
  5    title VARCHAR2(16),
  6    CONSTRAINT job_history_pk PRIMARY KEY (employee#, job_date),
  7    CONSTRAINT job_history_fk1 FOREIGN KEY (employee#) REFERENCES employees
  8  )
  9  CLUSTER employees (employee#);

Table created.

SQL> CREATE TABLE salary_history
  2  (
  3    employee# INTEGER NOT NULL,
  4    job_date DATE NOT NULL,
  5    salary_date DATE NOT NULL,
  6    salary NUMBER,
  7    CONSTRAINT salary_history_pk PRIMARY KEY (employee#, job_date, salary_date),
  8    CONSTRAINT salary_history_fk1 FOREIGN KEY (employee#) REFERENCES employees,
  9    CONSTRAINT salary_history_fk2 FOREIGN KEY (employee#, job_date) REFERENCES job_history
 10  )
 11  CLUSTER employees (employee#);

Table created.

SQL> CREATE TABLE children
  2  (
  3    employee# INTEGER NOT NULL,
  4    child_name VARCHAR2(16) NOT NULL,
  5    birth_date DATE,
  6    CONSTRAINT children_pk PRIMARY KEY (employee#, child_name),
  7    CONSTRAINT children_fk1 FOREIGN KEY (employee#) REFERENCES employees
  8  )
  9  CLUSTER employees (employee#);

Table created.

Then we insert data into all four tables. We find that all the records have been stored in the same database block even though they belong to different tables. Therefore the join-penalty has been eliminated.

SQL> INSERT INTO employees VALUES (1, 'IGNATIUS', '01-JAN-1970');

1 row created.

SQL> INSERT INTO children VALUES (1, 'INIGA', '01-JAN-2001');

1 row created.

SQL> INSERT INTO children VALUES (1, 'INIGO', '01-JAN-2002');

1 row created.

SQL> INSERT INTO job_history VALUES (1, '01-JAN-1991', 'PROGRAMMER');

1 row created.

SQL> INSERT INTO job_history VALUES (1, '01-JAN-1992', 'DATABASE ADMIN');

1 row created.

SQL> INSERT INTO salary_history VALUES (1, '01-JAN-1991', '1-FEB-1991', 1000);

1 row created.

SQL> INSERT INTO salary_history VALUES (1, '01-JAN-1991', '1-MAR-1991', 1000);

1 row created.

SQL> INSERT INTO salary_history VALUES (1, '01-JAN-1992', '1-FEB-1992', 2000);

1 row created.

SQL> INSERT INTO salary_history VALUES (1, '01-JAN-1992', '1-MAR-1992', 2000);

1 row created.

SQL> SELECT DISTINCT DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number FROM employees where employee# = 1;

BLOCK_NUMBER
------------
       22881

SQL> SELECT DISTINCT DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number FROM children where employee# = 1;

BLOCK_NUMBER
------------
       22881

SQL> SELECT DISTINCT DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number FROM job_history where employee# = 1;

BLOCK_NUMBER
------------
       22881

SQL> SELECT DISTINCT DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) AS block_number FROM salary_history where employee# = 1;

BLOCK_NUMBER
------------
       22881

Next we create an object-relational view that presents each employee as an object.

SQL> CREATE OR REPLACE TYPE children_rec AS OBJECT (child_name VARCHAR2(16), birth_date DATE)
  2  /

Type created.

SQL> CREATE OR REPLACE TYPE children_tab AS TABLE OF children_rec
  2  /

Type created.

SQL> CREATE OR REPLACE TYPE salary_history_rec AS OBJECT (salary_date DATE, salary NUMBER)
  2  /

Type created.

SQL> CREATE OR REPLACE TYPE salary_history_tab AS TABLE OF salary_history_rec
  2  /

Type created.

SQL> CREATE OR REPLACE TYPE job_history_rec AS OBJECT (job_date DATE, title VARCHAR2(16), salary_history SALARY_HISTORY_TAB)
  2  /

Type created.

SQL> CREATE OR REPLACE TYPE job_history_tab AS TABLE of job_history_rec
  2  /

Type created.

SQL> create or replace view employees_view as
  2  SELECT
  3    employee#,
  4    name,
  5    birth_date,
  6    CAST
  7    (
  8      MULTISET
  9      (
 10        SELECT
 11          child_name,
 12          birth_date
 13        FROM children
 14        WHERE employee#=e.employee#
 15      )
 16      AS children_tab
 17    ) children,
 18    CAST
 19    (
 20      MULTISET
 21      (
 22        SELECT
 23          job_date,
 24          title,
 25          CAST
 26          (
 27            MULTISET
 28            (
 29              SELECT salary_date, salary
 30              FROM salary_history
 31              WHERE employee#=e.employee#
 32              AND job_date=jh.job_date
 33            )
 34            AS salary_history_tab
 35          ) salary_history
 36        FROM job_history jh
 37        WHERE employee#=e.employee#
 38      )
 39      AS job_history_tab
 40    ) job_history
 41  FROM employees e;

View created.

Let’s retrieve one employee object and look at the query execution plan. No indexes are used in retrieving records from each of the four tables. The cost of the plan is just 1. This is the minimum achievable cost, indicating that there is no join-penalty.

SQL> alter session set "_rowsource_execution_statistics"=true;

Session altered.

SQL> SELECT * FROM employees_view WHERE employee# = 1;

 EMPLOYEE# NAME             BIRTH_DAT
---------- ---------------- ---------
CHILDREN(CHILD_NAME, BIRTH_DATE)
------------------------------------------------------------------------------------------------------------------------------------
JOB_HISTORY(JOB_DATE, TITLE, SALARY_HISTORY(SALARY_DATE, SALARY))
------------------------------------------------------------------------------------------------------------------------------------
         1 IGNATIUS         01-JAN-70
CHILDREN_TAB(CHILDREN_REC('INIGA', '01-JAN-01'), CHILDREN_REC('INIGO', '01-JAN-02'))
JOB_HISTORY_TAB(JOB_HISTORY_REC('01-JAN-91', 'PROGRAMMER', SALARY_HISTORY_TAB(SALARY_HISTORY_REC('01-FEB-91', 1000), SALARY_HISTORY_
REC('01-MAR-91', 1000))), JOB_HISTORY_REC('01-JAN-92', 'DATABASE ADMIN', SALARY_HISTORY_TAB(SALARY_HISTORY_REC('01-FEB-92', 2000), S
ALARY_HISTORY_REC('01-MAR-92', 2000))))

SQL> SELECT * FROM TABLE(dbms_xplan.display_cursor(null, null, 'TYPICAL IOSTATS LAST'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  aaxmaqz947aa0, child number 0
-------------------------------------
SELECT * FROM employees_view WHERE employee# = 1

Plan hash value: 2117652374

--------------------------------------------------------------------------------------------------------------
| Id  | Operation         | Name           | Starts | E-Rows |E-Bytes| Cost  | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                |      1 |        |       |     1 |      1 |00:00:00.01 |       1 |
|*  1 |  TABLE ACCESS HASH| CHILDREN       |      1 |      1 |    32 |       |      2 |00:00:00.01 |       1 |
|*  2 |  TABLE ACCESS HASH| SALARY_HISTORY |      2 |      1 |    44 |       |      4 |00:00:00.01 |       3 |
|*  3 |  TABLE ACCESS HASH| JOB_HISTORY    |      1 |      1 |    32 |       |      2 |00:00:00.01 |       1 |
|*  4 |  TABLE ACCESS HASH| EMPLOYEES      |      1 |    845 | 27040 |       |      1 |00:00:00.01 |       1 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("EMPLOYEE#"=:B1)
   2 - access("EMPLOYEE#"=:B1)
       filter("JOB_DATE"=:B1)
   3 - access("EMPLOYEE#"=:B1)
   4 - access("EMPLOYEE#"=1)

Note
-----
   - cpu costing is off (consider enabling it)
   - dynamic sampling used for this statement (level=2)

30 rows selected.

Next, let’s create an “INSTEAD OF” trigger so that we insert into the view directly; that is, use a single insert statement instead of multiple insert statements. The trigger will do all the heavy-lifting for us.

SQL> CREATE OR REPLACE TRIGGER employees_view_insert
  2  INSTEAD OF INSERT ON employees_view
  3  REFERENCING NEW AS n
  4  FOR EACH ROW
  5  DECLARE
  6    i NUMBER;
  7  BEGIN
  8    INSERT INTO employees
  9    VALUES
 10    (
 11      :n.employee#,
 12      :n.name,
 13      :n.birth_date
 14    );
 15
 16    FOR i IN :n.children.FIRST .. :n.children.LAST
 17    LOOP
 18      INSERT INTO children
 19      VALUES
 20      (
 21        :n.employee#,
 22        :n.children(i).child_name,
 23        :n.children(i).birth_date
 24      );
 25    END LOOP;
 26
 27    FOR i IN :n.job_history.FIRST .. :n.job_history.LAST
 28    LOOP
 29      INSERT INTO job_history VALUES
 30      (
 31        :n.employee#,
 32        :n.job_history(i).job_date,
 33        :n.job_history(i).title
 34      );
 35      FOR j IN :n.job_history(i).salary_history.FIRST .. :n.job_history(i).salary_history.LAST
 36      LOOP
 37        INSERT INTO salary_history
 38        VALUES
 39        (
 40          :n.employee#,
 41          :n.job_history(i).job_date,
 42          :n.job_history(i).salary_history(j).salary_date,
 43          :n.job_history(i).salary_history(j).salary
 44         );
 45      END LOOP;
 46    END LOOP;
 47  END;
 48  /

Trigger created.

Finally, let’s insert an employee object directly into the view and confirm that we can read it back.

SQL> INSERT INTO employees_view
  2  VALUES
  3  (
  4    2,
  5    'YGNACIO',
  6    '01-JAN-70',
  7    CHILDREN_TAB
  8    (
  9      CHILDREN_REC('INIGA', '01-JAN-01'),
 10      CHILDREN_REC('INIGO', '01-JAN-02')
 11    ),
 12    JOB_HISTORY_TAB
 13    (
 14      JOB_HISTORY_REC
 15      (
 16        '01-JAN-91',
 17        'PROGRAMMER',
 18        SALARY_HISTORY_TAB
 19        (
 20          SALARY_HISTORY_REC('01-FEB-91', 1000),
 21          SALARY_HISTORY_REC('01-MAR-91', 1000)
 22        )
 23      ),
 24      JOB_HISTORY_REC
 25      (
 26        '01-JAN-92',
 27        'DATABASE ADMIN',
 28        SALARY_HISTORY_TAB
 29        (
 30          SALARY_HISTORY_REC('01-FEB-92', 2000),
 31          SALARY_HISTORY_REC('01-MAR-92', 2000)
 32        )
 33      )
 34    )
 35  );

1 row created.

SQL> SELECT * FROM employees_view WHERE employee# = 2;

 EMPLOYEE# NAME             BIRTH_DAT
---------- ---------------- ---------
CHILDREN(CHILD_NAME, BIRTH_DATE)
------------------------------------------------------------------------------------------------------------------------------------
JOB_HISTORY(JOB_DATE, TITLE, SALARY_HISTORY(SALARY_DATE, SALARY))
------------------------------------------------------------------------------------------------------------------------------------
         2 YGNACIO          01-JAN-70
CHILDREN_TAB(CHILDREN_REC('INIGA', '01-JAN-01'), CHILDREN_REC('INIGO', '01-JAN-02'))
JOB_HISTORY_TAB(JOB_HISTORY_REC('01-JAN-91', 'PROGRAMMER', SALARY_HISTORY_TAB(SALARY_HISTORY_REC('01-FEB-91', 1000), SALARY_HISTORY_
REC('01-MAR-91', 1000))), JOB_HISTORY_REC('01-JAN-92', 'DATABASE ADMIN', SALARY_HISTORY_TAB(SALARY_HISTORY_REC('01-FEB-92', 2000), S
ALARY_HISTORY_REC('01-MAR-92', 2000))))

Three-Star Oracle Indexes and Three-Star Oracle Conferences

October 13, 2011 2 comments

Dear NoCOUG Members and Friends,

Oracle indexes can be given one, two, or three stars. A three-star index is the best possible index for a SQL query.

  • The first star is given if the index entries relevant to the query are next to each other or at least as close to each other as possible. This minimizes the thickness of the index slice that must be scanned.
  • The second star is given if the index rows are in the right order for the query. This eliminates sorting.
  • The third star is given if the index rows contain all the columns referred to by the query. This eliminates table access: the access path is index-only.

You can read more about three-star indexes in Relational Database Index Design by Lahdenmaki and Leach which is #1 on the top-ten list of Oracle books recommended by Oracle performance expert Jonathan Lewis in his interview titled Everybody Loves Jonathan printed in the November 2008 issue of the NoCOUG Journal.

Similarly, Oracle conferences can be given one, two, or three stars. 🙂 A three-star conference is the best possible conference you can imagine.

  • The first star is given for great speakers.
  • The second star is given for a great location.
  • The third star is given for a great occasion.

NoCOUG Conference #100 Sponsored by Quest Software—Simplicity at Work on Wednesday, November 9 is a three-star conference.

  • It will feature some of the best speakers in the world including Oracle ACE Directors Steven Feuerstein , Craig Shallahamer , and Alex Gorbachev .
  • It will be held at the greatest location in the world for an Oracle conference—the Computer History Museum in Mountain View. The museum features wonderful computing artifacts including Hollerith’s 1890 Census Tabulator—the product that made IBM famous—and a Babbage Difference Engine—one of only two that have ever been constructed.
  • It is a history-making occasion. NoCOUG is celebrating its 25th anniversary in 2011 and the upcoming conference is NoCOUG’s 100th quarterly conference.

Conference 100 will be our biggest, baddest conference ever! Don’t you want to be there? We’ve opened registration early because we expect so many attendees. Please review the agenda and RSVP right away.

Kindest regards,

Dave Abercrombie, Journal Editor
Eric Hutchinson, Webmaster
Gwen Shapira, Conference Director
Hanan Hit, Vice President
Iggy Fernandez, President
Jen Hong, Board Member at Large
Naren Nagtode, Secretary and Treasurer
Omar Anwar, Vendor Coordinator
Randy Samberg, Training Coordinator
Scott Alexander, Speaker Coordinator

Lesson 2 of 40: Physical Database Design for Oracle Databases: Non-Necessity?

September 13, 2011 5 comments

In most people’s vocabularies, design means veneer. It’s interior decorating. It’s the fabric of the curtains and the sofa. But to me, nothing could be further from the meaning of design. Design is the fundamental soul of a man-made creation that ends up expressing itself in successive outer layers of the product or service.—Steve, Jobs. “Apple’s One-Dollar-a-Year Man.” Fortune Magazine. 24 Jan 2000.

Physical database design improvements are a cost-effective solution to a performance and scalability challenge.

Or are they?

Economic principles can equally well be used be used to make the case for faster hardware. In his book, Oracle on VMware, Dr. Bert Scalzo says:

Person hours cost so much more now than computer hardware even with inexpensive offshore outsourcing. It is now considered a sound business decision these days to throw cheap hardware at problems. It is at least, if not more, cost effective than having the staff [sic] tuned and optimized for the same net effect.

Besides, a failed tuning and optimization effort leaves you exactly where you started. At least the hardware upgrade approach results in a faster/better server experiencing the same problem that may still have future value to the business once the fundamental problem is eventually corrected. And, if nothing else, the hardware can be depreciated, whereas the time spent tuning is always just a cost taken off the bottom line. So, with such cheap hardware, it might be a wiser business bet to throw hardware at some solutions sooner than was done in the past. One might go so far as to make an economic principle claim that the opportunity cost of tuning is foregoing cheap upgrades that might fix the issue and also possess intrinsic value. Stated this way, it is a safe bet that is where the business people would vote to spend.

When I interviewed Jonathan Lewis for the November 2008 issue of the NoCOUG Journal, I asked him for a second opinion. He homed on a key element of Dr. Scalzo’s argument.

I think the key sentence in Dr. Scalzo’s case is “So, with such cheap hardware, it might be a wiser business bet to throw hardware at some solutions sooner than was done in the past.” And in this sentence, the most important word is “sooner.”

Throwing hardware at a problem has always been an option—and sometimes the correct option—but 20 years ago hardware was much more expensive than it is now, took up more space, used more power, and generated more heat so, in the standard analysis of cost/benefit/risk, correcting the bad code was usually seen to be more cost effective than using more hardware to live with bad code.

Nothing has really changed in the analysis—except the pricing. If you know that more hardware will give you the scalability you need for as long as you need; the installation process is fast enough and safe enough; and the alternative is going to take too long, need too much testing, or introduce too much risk, then the hardware is the right choice.

It has to be said, though that the analysis is not always carried out correctly. Issues of scalability can be deceptive—especially when you are talking about highly concurrent systems, and installing new hardware isn’t always risk-free or fast. The main drawback to the hardware solution (if the analysis isn’t done properly) is that you can still end up having to fix the code, except the fix is now an emergency that appears unexpectedly at some future time.

When I interviewed Dr. Scalzo for the next issue of the Journal, I asked him to clarify his remarks. He focused on the cost of tuning:

No—Oracle is a great database with extensive flexibility. Instead, it’s the cost of tuning (whether for database, operating system, and/or application) that has become somewhat cost ineffective. Let me give an example. A decade ago, tuning the SGA memory allocation and usage was critical—and it still is today. But with dirt-cheap memory, servers that can accommodate tons of memory, and Oracle automatic memory management—just buy more memory. It’s often far cheaper to add 16 GB RAM than pay two days’ expert database consultation fees. Yes, there are some scenarios where simply adding hardware might make a database bottleneck situation worse. But the odds and cost are now such that it’s a worthwhile first step—since you can always tune if that fails. I’m just preaching to try cheap and easy solutions first.”

In the next issue of the Journal, I interviewed Karen Morton and discussed the same topic with her. She emphasized the need for diagnosis before attempting any fix:

If you have performance problems that have been proven, through careful analysis, to have their root cause in substandard hardware, then the solution may be to upgrade. The investment made to make sure personnel are adequately trained to properly diagnose and repair performance problems can benefit your company in the long run much more so than taking what appears to be a quick fix by throwing hardware at the problem in the short run.

I think that when personnel are properly trained and knowledgeable, they will not only be able to quickly, efficiently, and cost effectively find and fix problems when they arise, but they will not make the mistakes that cause the problems in the first place.

I believe the truth is that if hardware isn’t your real problem, you’ll eventually have no other choice than to address the actual root cause, because hardware upgrades can only be a Band-aid on the wound for so long. I’ve seen too many people try this approach only to pay hundreds of thousands of dollars for hardware they don’t really even need and still have performance problems. When they finally do address the real root cause and fix it, they often find that they could have bypassed the entire expenditure on hardware and saved more time and money if they’d just done it right in the first place.

To summarize, while physical database design improvements may be the elegant and efficient solution to a performance and scalability challenge, more powerful hardware can sometimes be the simplest, quickest, and most cost-effective solution. For example, GridIron Systems claims that its “TurboCharger” caching solution “boosts Oracle database performance by providing up to 1000x faster access to the data blocks containing database tables and indexes.” However, diligence and diagnosis is necessary because a hardware upgrade may not improve the performance of your task and can sometimes make it worse. As Cary Millsap points out in Why “System” is a Four-Letter Word, “performance of your task gets worse if the “improvement” intensifies competition for the resource that is the bottleneck for your task.”

April Fools’ Day Joke

I started writing my book on April 1, 2011 when enthusiasm for Oracle Exadata Database Machine was at its height. The strengths of Oracle Exadata Database Machine are Smart Scans, Storage Indexes, and Hybrid Columnar Compression which dramatically reduce the cost of full table scans. However, it was clear to me that powerful hardware did not eliminate the need for physical database design. I also noted that Oracle’s most recent attempt at the TPC-H benchmark did not use Exadata. I wrote the following tongue-in-cheek posting in honor of the day.

SECRET OF ORACLE DATABASE PERFORMANCE FOUND IN 1987 SEARS ROEBUCK CATALOG

Redwood Shores, April 1, 2010 – In between mouthfuls of raw tofu slices delicately dressed with chili oil and shredded seaweed, Oracle CEO Larry Ellison related his game-changing epiphany about database performance.

“My personal ivory buyer was showing me a picture of ivory chopsticks in the 1897 Sears Roebuck catalog when I noticed the following words at the back of the catalog: IF YOU DON’T FIND IT IN THE INDEX, LOOK VERY CAREFULLY THROUGH THE ENTIRE CATALOG.

That’s when the truth hit me like a ton of bricks. Physical database design is hard. There are few good books about it. Why should our loyal customers spend a lot of money on database consultants when they can spend an equal amount of money on Oracle software and hardware instead? We could tell ’em to stop worrying about indexes, clustering, partitioning, and materialization and sell ’em a big honkin’ piece of hardware that can simply scan entire tables every time.”

Instructions in the 1897 Sears Roebuck catalog

Instructions in the 1897 Sears Roebuck catalog

Mr. Ellison went on: “It’s not that we haven’t tried the traditional route. Consider Multidimensional Clustering for example. We’ve offered it from day one but the only places it is ever used are the Oracle data dictionary and our own TPC-C benchmarks. It’s time to admit that our customers are right and to give them what they want, not what they need.”

When asked how he came up with the “Smart Scan” moniker for what was formerly called a full table scan, Mr. Ellison was equally candid. “When other companies sell fish, they bill it as dead fish. We call it sushi.”

When told that a certain database maverick named Iggy Fernandez had started writing a book with the ridiculously long title of “Expert Oracle Indexing, Clustering, Partitioning, and Materialization for Performance and Scalability in Oracle Database 11g Release 2” that he hopes to release in time for OpenWorld 2011, Mr. Ellison let out a derisive snort and predicted that Mr. Fernandez was unlikely to get rich from it. When contacted for comment, Mr. Fernandez reluctantly agreed with Mr. Ellison’s prediction and quoted American poet laureate Robert Frost Two roads diverged in a wood, and I—I took the one less traveled by.”

Load-and-Go

I myself have encountered at least one database application that did not use any indexes whatsoever but performed adequately even though every query required full table scans. In fact, a database vendor named ParAccel recently performed the 1 TB TPC-H data warehouse benchmark without any indexes at all; it calls this approach “Load-and-Go.” You may wonder how a database can perform joins—inner, outer, semi, and anti joins—without indexes. Oracle Database has two methods at its disposal—the “hash join” and the “sort merge join.” Please refer to the links for more information.

Summary

  • A hardware upgrade can sometimes be the simplest, quickest, and most cost-effective solution to a performance and scalability problem.
  • Performance of your task gets worse if a hardware upgrade intensifies competition for the resource that is the bottleneck for your task.
  • Oracle Database can use the “hash join” and “sort merge join” techniques to perform joins of all kinds without the use of indexes.

Further Reading

Exercises

  • Use Statspack or AWR to investigate whether your database application will benefit from faster I/O.
  • Evaluate SSD products from vendors such as Kaminario, Fusion-io, Texas Memory Systems, and GridIron Systems.

Previous Lesson

Lesson 1 of 40: Physical Database Design for Oracle Databases: Necessity and Definition

September 4, 2011 3 comments

Now that my publisher and I have amicably parted ways, I am free to self-publish my work on physical database design. Some explanation is called for. I did not want to rewrite the excellent sections on physical database design that can be found in the free online guides provided by Oracle Corporation; instead I wanted to direct the reader to the online guides for supplemental information. This would have meant that my book would be far shorter and take far longer to write than the contract called for. Now that the publishing rights have reverted back to me, I plan on serializing my work here in the form of short lessons with links to the relevant sections of the online Oracle guides for supplemental information. Comments and criticism will always be welcome; it will help me improve the work which I hope to eventually publish in book form under the title Physical Database Design for Oracle Databases: Indexes, Clusters, Partitions, and Materialized Views.

In the first lesson you will learn the necessity of Physical Database Design and find out what is involved.

The year was … well just let’s say it was a long time ago. My friends Andrea, Mabel, and Sameer and I had been assigned to work on a big software development project. We were unfamiliar with the technologies that would be used—VMS, ACMS, TDMS, and Rdb—but training was promised. The very first thing the instructor said was “First you have to insert your definitions into the CDD,” and he walked to the chalkboard and wrote the commands that we needed for the purpose. Needless to say, we were quite flustered because we had no idea what those “definitions” might be or what a “CDD” was and how it fit into the big picture. We were being taught practice before the theory.

Those who are in love with practice without knowledge are like the sailor who gets into a ship without rudder or compass and who never can be certain whether he is going. Practice must always be founded on sound theory.”The Discourse On Painting by Leonardo da Vinci

You’ll appreciate that I don’t begin by saying “First you have to create your indexes” and giving you the commands that you need for the purpose. Instead I’ll first put physical database design into perspective, explain its purpose, and provide a definition.

Definitions of Logical and Physical Database Design

Database design can be divided into a logical phase (logical database design) and a physical phase (physical database design). Logical database design is the process of designing normalized database tables. (Lightstone et al. 2007) Consider the following example taken from a paper by Edgar Codd, the inventor of relational database theory:

Suppose the data bank contains information about parts and projects. For each part, the part number, part name, part description, quantity-on-hand, and quantity-on-order are recorded. For each project, the project number, project name, project description are recorded. Whenever a project makes use of a certain part, the quantity of that part committed to the given project is also recorded.

You could conceivably construct the following solution consisting of just one table:

CREATE TABLE part_project_commit
  (
    -- part information
    part#               NUMBER(6) NOT NULL,
    part_name           VARCHAR2(16) NOT NULL,
    part_description    VARCHAR2(64) NOT NULL,
    quantity_on_hand    NUMBER(18,6) NOT NULL,
    quantity_on_order   NUMBER(18,6) NOT NULL,
    -- project information
    project#            NUMBER(6) NOT NULL,
    project_name        VARCHAR2(16) NOT NULL,
    project_description VARCHAR2(64) NOT NULL,
    -- quantity of part committed to project
    quantity_committed  NUMBER(18,6) NOT NULL,
    CONSTRAINT part_project_commit_pk PRIMARY KEY (part#, project#)
  );

The above solution is obviously problematic; for example, multiple rows may have to be updated whenever the quantity_on_hand of a product changes. Through a process of data modeling and database normalization a better solution involving three tables can be constructed as follows:

CREATE TABLE part
  (
    part#             NUMBER(6) NOT NULL,
    part_name         VARCHAR2(16) NOT NULL,
    part_description  VARCHAR2(64) NOT NULL,
    quantity_on_hand  NUMBER(18,6) NOT NULL,
    quantity_on_order NUMBER(18,6) NOT NULL,
    CONSTRAINT part_pk PRIMARY KEY (part#)
  );

CREATE TABLE project
  (
    project#            NUMBER(6) NOT NULL,
    project_name        VARCHAR2(16) NOT NULL,
    project_description VARCHAR2(64) NOT NULL,
    CONSTRAINT project_pk PRIMARY KEY (project#)
  );

CREATE TABLE commit
  (
    part#              NUMBER(6) NOT NULL,
    project#           NUMBER(6) NOT NULL,
    quantity_committed NUMBER(18,6) NOT NULL,
    CONSTRAINT commit_pk PRIMARY KEY (part#, project#),
    CONSTRAINT commit_part_fk FOREIGN KEY (part#) REFERENCES part,
    CONSTRAINT commit_proj_fk FOREIGN KEY (project#) REFERENCES project
  );

The above CREATE TABLE commands can be used to construct tables in your Oracle database. This concludes the process of logical database design as we defined it. However, the work of database design is not complete; we still have to consider the question of performance. The default organization of an Oracle database table is an unordered heap; in other words, an Oracle database table starts out as an unorganized collection of records. The SQL queries that are retrieve data from these tables will need the help of indexes and other mechanisms to retrieve data efficiently. They are transformed by the query optimizer into sequences of database operations such as restriction, projection, inner join, outer join, semi-join, anti-join, aggregation, and sorting. These operations are illustrated in the following examples; some experience with SQL is assumed.

Our first example illustrates the restriction and projection operations. The restriction operation eliminates rows from the target of the query while the projection operation eliminates columns.

-- Illustration of restriction and projection
-- List part names for certain parts

SELECT prt.part#,
  prt.part_name
FROM part prt
WHERE part# BETWEEN 10000 AND 20000
ORDER BY prt.part#;

Our second example illustrates the inner join operation. It is used to associate rows from two tables.

-- Illustration of inner join
-- List each project and the parts it uses
-- Projects that use no parts are not listed

SELECT prj.project#,
  prj.project_name,
  cmt.part#,
  cmt.quantity_committed
FROM project prj
INNER JOIN commit cmt
ON (cmt.project# = prj.project#)
ORDER BY prj.project#,
  cmt.part#;

Our third example illustrates the left outer join operation. Like the inner join operation, it is also used to associate rows from two tables. However, rows from the left table that have no corresponding rows in the right table are also included in the result. The right outer join operation is similar to the left outer join operation; rows from the right table that have no corresponding rows in the left table are included in the result. The full outer join operation includes rows from both the left and right tables even if they have no corresponding rows in the right and left tables respectively.

-- Illustration of left outer join
-- List each part and the projects in which it is used
-- Parts that are not used in any project are also listed

SELECT prt.part#,
  prt.part_name,
  cmt.project#,
  cmt.quantity_committed
FROM part prt
LEFT OUTER JOIN commit cmt
ON (cmt.part# = prt.part#)
WHERE prt.part# BETWEEN 10000 AND 20000
ORDER BY prt.part#,
  cmt.project#;

Our fourth example illustrates a check for existence in a correlated subquery. The corresponding database operation is called a semi-join.

-- Illustration of semi-join
-- List parts that are used in at least one project

SELECT prt.part#,
  prt.part_name,
  prt.part_description,
  prt.quantity_on_hand,
  prt.quantity_on_order
FROM part prt
WHERE EXISTS
  (
  -- Correlated subquery
  SELECT * FROM COMMIT cmt WHERE cmt.part#=prt.part#
  )
ORDER BY prt.part#;

Our fifth example illustrates a check for non-existence in a correlated subquery. The corresponding database operation is called an anti-join.

-- Illustration of anti-join
-- List parts that are not used in any projects

SELECT prt.part#,
  prt.part_name,
  prt.part_description,
  prt.quantity_on_hand,
  prt.quantity_on_order
FROM part prt
WHERE NOT EXISTS
  (
  -- Correlated subquery
  SELECT * FROM COMMIT cmt WHERE cmt.part#=prt.part#
  )
ORDER BY prt.part#;

Our sixth example illustrates aggregation.

-- Illustration of aggregation
-- List the total committed quantity for each part

SELECT cmt.part#,
  SUM(cmt.quantity_committed) AS tot_quantity_committed
FROM commit cmt
GROUP BY cmt.part#
ORDER BY cmt.part#;

Obviously, SQL queries can be much more complicated than the above examples; queries that are thousands of lines long are not unheard of. Queries can include any number of tables and can build upon intermediate queries as in the following example which lists projects which use more than 50% of the committed usage of any part.

-- A complex query
-- List projects which use more than 50% of the committed usage of any part

SELECT cmt.project#,
  cmt.part#,
  cmt.quantity_committed,
  cmt.quantity_committed/tot.tot_quantity_committed * 100 AS pct_usage
FROM COMMIT cmt
INNER JOIN
  (
  -- Intermediate query
  -- List the total committed quantity for each part
  SELECT cmt.part#,
    SUM(cmt.quantity_committed) AS tot_quantity_committed
  FROM COMMIT cmt
  GROUP BY cmt.part#
  ) tot
ON (cmt.part# = tot.part#)
WHERE cmt.quantity_committed/tot.tot_quantity_committed * 100 > 50
ORDER BY cmt.project#,
  cmt.part#;

Three broad categories of performance mechanisms are available to improve the performance of database operations. Indexes help in restriction, projection, joins of all types, and sorting. Partitions and clusters can be used to co-locate data with similar characteristics in order to limit the amount of physical I/O. Finally, materialized views can be used to perform joins and aggregations ahead of time.

Physical database design is the process of designing indexes, partitions, clusters, and materialized views. (Lightstone et al. 2007) Its purpose is to improve the efficiency of database operations such as restriction, projection, joining, aggregation, and sorting.

This concludes the first lesson.

Summary

  • Database design can be divided into a logical phase (logical database design) and a physical phase (physical database design).
  • Logical database design is the process of designing normalized database tables. (Lightstone et al. 2007)
  • Physical database design is the process of designing indexes, partitions, clusters, and materialized views. (Lightstone et al. 2007)
  • The purpose of physical database design is to improve the efficiency of database operations such as restriction, projection, joining, aggregation, and sorting.

Further Reading

Exercises

  • Perform the database design exercise in SQL Developer Data Modeler User’s Guide Release 3.0.
  • Discuss whether physical database design is required for small databases that completely fit into memory.
  • Discuss whether database normalization affects query performance.
  • Create the Part, Project, and Commit tables using the CREATE TABLE statements provided in the lesson. Identify the indexes that were automatically created by Oracle even though we did not explicitly ask for them to be created. Justify why Oracle decides to create these indexes automatically.
  • Use the UNION clause instead of the OUTER JOIN clause to list each part and the projects in which it is used, including the parts which are not used in any projects. Use the INNER JOIN clause instead of a correlated subquery to list parts that are used in at least one project. Use the MINUS clause instead of a correlated subquery to list parts that are not used in any project. Use analytic functions instead of the GROUP BY clause to lists projects which account for more than 50% of the committed usage of any part. Discuss the performance impacts of rewriting queries and whether the physical database design is influenced by how queries are written.

Next Lesson

%d bloggers like this: