Archive

Archive for the ‘SQL’ Category

Results of the NoCOUG SQL Mini-Challenge

October 28, 2014 2 comments

As published in the November 2014 issue of the NoCOUG Journal

The inventor of the relational model, Dr. Edgar Codd, was of the opinion that “[r]equesting data by its properties is far more natural than devising a particular algorithm or sequence of operations for its retrieval. Thus, a calculus-oriented language provides a good target language for a more user-oriented source language” (“Relational Completeness of Data Base Sublanguages”). Therefore, with the exception of the union operation, the original version of SQL was based on relational calculus, although, over time, other elements of relational algebra like difference (minus), intersection, and outer join crept in.

Testing of existence in a set using subqueries is the hallmark of relational calculus. The following example has nested subqueries. It lists the locations containing a department that either contains an employee named Steven King or an employee who holds the title of President or an employee who has previously held the title of President.

SELECT l.location_id, l.city
FROM locations l
WHERE EXISTS (
  SELECT * FROM departments d, employees e, jobs j
  WHERE d.location_id = l.location_id
  AND e.department_id = d.department_id
  AND j.job_id = e.job_id
  AND (
    (e.first_name = 'Steven' AND e.last_name = 'King')
    OR j.job_title  = 'President'
    OR EXISTS (
      SELECT * FROM job_history jh, jobs j2
      WHERE jh.employee_id = e.employee_id
      AND j2.job_id = jh.job_id
      AND j2.job_title = 'President'
    )
  )
)

The challenge was to rewrite the above query without subqueries and in the most efficient way possible, as measured by the number of consistent gets—the Buffers column in the query execution plan—when executed in the HR sample schema. The competitors published their submissions on the NoCOUG blog nocoug.wordpress.com. Oracle ACE Directors Kyle Hailey and Tim Gorman helped judge the competition.

Backgrounder

The secret sauce of relational database management systems is composed of what Dr. Edgar Codd referred to as “relational calculus” and “relational algebra.” He was a mathematician by training so he instinctively used complex-sounding mathematical terms that normal folk don’t use a lot. Relational calculus has nothing to do with college calculus; a relational calculus expression is simply an English-like non-procedural specification of the user’s query requirements. Relational algebra, on the other hand, is a step-by-step procedural recipe that details how to meet the user’s requirements. The whole reason why there is so much interest in EXPLAIN PLAN is that it documents the sequence of relational algebra operations that Oracle Database used at runtime to execute any particular query; that is, it documents the query execution plan used by Oracle Database.

More precisely, relational algebra is a collection of operations that can be used to combine tables. Just as you can combine numbers using the operations of addition, subtraction, multiplication, and division, you can combine tables using operations like “restriction,” “projection,” “union,” “difference,” and “cartesian join.” These five operations are provably sufficient to implement any requirement expressed in relational calculus. They are also primitive; that is, none of them is a combination of the others. Composite relational operations can be devised by combining these five operations; examples include “inner join,” “intersection,” “outer join,” “semi-join,” “anti-join,” and “division.” For example, inner join is obviously a combination of cartesian join and restriction. Another example is A INTERSECTION B = A MINUS (A MINUS B).

In “Relational Completeness of Data Base Sublanguages” Codd showed how to mechanically convert a relational calculus expression into a relational algebra expression. Refer to http://goo.gl/SHcWLE for a worked example. In the case of the challenge query, a little thought—or a lot of thought—will convince you that the EXISTS clauses can be eliminated by the simple expedient of selecting distinct values of location_id and city from the join of all the tables in the original query, as shown below. Note that the jobs table has to be joined twice because it occurs twice in the challenge query.

SELECT DISTINCT l.location_id, l.city
FROM locations l, departments d, employees e, jobs j, job_history jh, jobs j2
WHERE EXISTS (
  SELECT * FROM departments d, employees e, jobs j
  WHERE d.location_id = l.location_id
  AND e.department_id = d.department_id
  AND j.job_id = e.job_id
  AND (
    (e.first_name = 'Steven' AND e.last_name = 'King')
    OR j.job_title  = 'President'
    OR EXISTS (
      SELECT * FROM job_history jh, jobs j2
      WHERE jh.employee_id = e.employee_id
      AND j2.job_id = jh.job_id
      AND j2.job_title = 'President'
    )
  )
)

However, such a mechanical conversion from relational calculus to relational algebra may not be your best bet. In general, there is more than one way to write a SQL query, and the query optimizer may choose a different query execution plan for each version. In other words, functionally equivalent SQL queries may not be equally efficient in practice. This is the biggest unsolved problem of the relational era.

Judging Criteria

The criteria used by the judges were correctness and efficiency. Efficiency was measured by the number of consistent gets, that is, the Buffers column in the query execution plan. The GATHER_PLAN_STATISTICS hint instructs Oracle Database to keep track of “rowsource execution statistics,” which can then be printed out by DBMS_XPLAN.DISPLAY_CURSOR. Judging correctness is a much harder task, because no tools are available for determining the equivalence of two queries. In fact, the reason why the query optimizer may choose different query execution plans for two functionally equivalent queries is that it cannot tell if they are equivalent. What is needed is a systematic method of converting a SQL query into a “canonical” version such that all functionally equivalent queries would be converted into the same canonical version. In the absence of automated tools, we have to resort to thinking very hard and writing test cases.

We Have a Winner!

Kim Berg Hansen from Denmark wins the mini-challenge. Not only did his solution survive the test cases written by the judges but he reduced the number of consistent gets to the theoretical minimum of 1 by using a UNION ALL materialized view. The first branch of the UNION ALL finds employees named Steven King or who hold the title of President, while the second branch finds employees who previously held the title of President. Materialized views can be indexed just like tables, so Kim created an index consisting only of the location_id and city columns. He also took pains to ensure that the materialized view was fast-refreshable on commit. Therefore, his solution continues to return correct results even after the tables are updated.

CREATE MATERIALIZED VIEW LOG ON employees WITH ROWID, COMMIT SCN
   (employee_id, job_id, first_name, last_name, department_id)
   INCLUDING NEW VALUES;

CREATE MATERIALIZED VIEW LOG ON departments WITH ROWID, COMMIT SCN
   (department_id, location_id)
   INCLUDING NEW VALUES;

CREATE MATERIALIZED VIEW LOG ON locations WITH ROWID, COMMIT SCN
   (location_id, city)
   INCLUDING NEW VALUES;

CREATE MATERIALIZED VIEW LOG ON jobs WITH ROWID, COMMIT SCN
   (job_id, job_title)
   INCLUDING NEW VALUES;

CREATE MATERIALIZED VIEW LOG ON job_history WITH ROWID, COMMIT SCN
   (job_id, employee_id)
   INCLUDING NEW VALUES;

CREATE MATERIALIZED VIEW employees_mv
BUILD IMMEDIATE
REFRESH FAST ON COMMIT
ENABLE QUERY REWRITE
AS
SELECT 'CURRENT' record_type, l.location_id, l.city, j.rowid j_rowid, NULL jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid
FROM locations l, departments d, employees e, jobs j
WHERE (
  d.location_id = l.location_id
  AND e.department_id = d.department_id
  AND j.job_id = e.job_id
  AND (
    (e.first_name = 'Steven' AND e.last_name = 'King')
    OR j.job_title  = 'President'
  )
)
UNION ALL
SELECT 'HISTORY' record_type, l.location_id, l.city, j2.rowid j_rowid, jh.rowid jh_rowid, e.rowid e_rowid, d.rowid d_rowid, l.rowid l_rowid
FROM locations l, departments d, employees e, job_history jh, jobs j2
WHERE (
  d.location_id = l.location_id
  AND e.department_id = d.department_id
  AND (
      jh.employee_id = e.employee_id
      AND j2.job_id = jh.job_id
      AND j2.job_title = 'President'
  )
);

CREATE INDEX employees_mv_ix on employees_mv (
   location_id, city
);

Kim’s solution to the mini-challenge mimics the definition of the materialized view in the hope that “query rewrite” will kick in. Because Oracle Database does not index null values, the redundant clause “WHERE location_id IS NOT NULL” is added to ensure that the query optimizer uses the index.

SELECT /*+ GATHER_PLAN_STATISTICS */ DISTINCT location_id, city
FROM (
  SELECT l.location_id, l.city
  FROM locations l, departments d, employees e, jobs j
  WHERE (
    d.location_id = l.location_id
    AND e.department_id = d.department_id
    AND j.job_id = e.job_id
    AND (
      (e.first_name = 'Steven' AND e.last_name = 'King')
      OR j.job_title  = 'President'
    )
  )
  UNION ALL
  SELECT l.location_id, l.city
  FROM locations l, departments d, employees e,
    job_history jh, jobs j2
  WHERE (
    d.location_id = l.location_id
    AND e.department_id = d.department_id
    AND (
        jh.employee_id = e.employee_id
        AND j2.job_id = jh.job_id
        AND j2.job_title = 'President'
    )
  )
)
WHERE location_id IS NOT NULL;

On checking the query execution plan, we find that the data was indeed obtained from the index on the materialized view. Only one consistent get operation was needed.

SELECT * FROM
TABLE(dbms_xplan.display_cursor(format => 'TYPICAL IOSTATS LAST'));

Plan hash value: 4093995400

--------------------------------------------------------------------------
| Id  | Operation          | Name            | Starts | A-Rows | Buffers |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                 |      1 |      1 |       1 |
|   1 |  SORT UNIQUE NOSORT|                 |      1 |      1 |       1 |
|*  2 |   INDEX FULL SCAN  | EMPLOYEES_MV_IX |      1 |      1 |       1 |
--------------------------------------------------------------------------

Kim wins a $75 Amazon gift certificate for his most excellent effort. The entries of the other competitors have been published on the NoCOUG blog nocoug.wordpress.com.

Categories: NoCOUG, Oracle, SQL

The Hitchhiker’s Guide to the EXPLAIN PLAN: The story so far (Part 11–22)

September 27, 2014 Leave a comment

On the Toad World site, I’m writing a series of blog posts and Wiki articles on the subject of EXPLAIN PLAN. I’m using EXPLAIN PLAN as a motif to teach not just SQL tuning but also relational theory, logical database design, and physical database design. In a year’s time, I hope to have enough material for a self-published book.

Part 11—Abandon All Hope: Query plans can and will change dramatically because SQL is a non-procedural language. At best, reviewing a query plan can only establish that the query plan is reasonably good. But there is no guarantee that the plan will be consistently used. In addition to goodness of query plans, we need robustness (applicability to a wide range of bind variables) and stability (the guarantee that the plan of choice will be used) but they entail a lot of work. There is hope only if we are willing to do the work that is entailed. There is no hope if you are unwilling to take the time to understand the principles of robustness and stability and do what it takes to achieve them. There is no Easy Button like in Staples stores.

Part 12—Throw Away that Execution Plan: Instead of focusing on EXPLAIN PLAN at the outset, focus on logical database and physical database design, give the optimizer the information it needs to do a good job, and write well-structured SQL statements that the optimizer can easily understand.

Part 13—The Great Pretender: An enduring Oracle Database myth is that EXPLAIN PLAN and AUTOTRACE show the execution plan. This may have been true in early versions of Oracle but no longer. As Tom Kyte said: “It ain’t so much the things we don’t know that get us into trouble. It’s the things you know that just ain’t so or just ain’t so anymore or just ain’t always so.”

Part 14—Damn the Cost, Full Speed Ahead: The cost displayed in query plans is a misleading and useless piece of information. It is an estimate that is more likely to be wrong than correct. By default, it is computed using particular values of bind variables and therefore does not apply to subsequent executions with different values. And, it is measured in units of SREADTIM, not clock seconds.

Reviewing Execution Plan History: When a SQL statement is performing poorly, the conventional wisdom is to examine its query execution plan. If a query is executed more than once, its query plan can be different the second time around for a variety of reasons. Bind variable peeking and adaptive query optimization are only two of the reasons. Other reasons include changes in statistics, cursor_sharing=similar in Oracle Database 10g, adaptive cursor sharing in Oracle Database 11g Release 1, and cardinality feedback in Oracle Database 11g Release 2. At any given point in time, different users may be using different execution plans for the same SQL query. All this means that a query does not have a single execution plan but an execution plan history. Therefore EXPLAIN PLAN is practically useless; at best it is a teaching tool. One must look at the history of query execution plans for a particular SQL statement, not the output of EXPLAIN PLAN.

Part 15—Oracle-in-the-box: The OTN Developer Day VM is the greatest thing since sliced bread and I use it in all my experiments and demonstrations. It uses Linux as the operating system and is pre-loaded with a fully-functional Oracle 12c database with all the options. I don’t have to go through any installation headaches, my laptop is not messed up, and—best of all—I can take a snapshot of a known good state and revert to it at will. And, of course, you don’t need any licenses because “all [Oracle] software downloads are free, and most come with a Developer License that allows you to use full versions of the products at no charge while developing and prototyping your applications, or for strictly self-educational purposes.

Part 16—Practice Exam I: The deep-left tree clearly illustrates the fundamental strategy by which a SQL query is converted into a series of join operations: first a “driving” table is selected from amongst the tables involved in the query and then the remaining tables are joined to it using the nested-loop method.

Part 17—Practice Exam II: The opposite of the deep-left tree is the deep-right tree. Each qualifying row in the driving table is checked against again in-memory hash-lookup tables.

Part 18—Practice Exam III: Every time the optimizer chooses to join two data sets using the hash-join method, it has to decide which of the two data sets will be the in-memory lookup table. The smaller of the two data sets is the logical choice. The lookup table is always placed on the left hand side of the join node. This can result in zig-zag trees.

Part 19—Practice Exam IV: Instead of evaluating all join possibilities by brute force, the optimizer adopts reasonable strategies in the hope that they will lead to reasonably efficient query plans. The oldest strategy in the book is to pick a “driving” table and then keep picking another table with which to join until all tables have been joined. However, this depends on the ability to perform “view merging” if the query contains views and “subquery unnesting” if the query contains subqueries. If a view cannot be merged or a subquery cannot be nested, then the optimizer is forced to generate a “bushy” tree.

Reviewing Query Execution Plans: You should stop reviewing query execution plans of queries that you have not yet executed. You must first execute your query and then use DBMS_XPLAN.DISPLAY_CURSOR to review the query execution plan that was used by the Oracle engine. Here DISPLAY_CURSOR has some obvious deficiencies. It does not indicate the order in which the lines are actually executed nor the parent of each line. It does not indicate which lines are the most expensive in terms of logical reads and phusical reads. The extra information we need can be obtained from the V$SQL_PLAN_STATISTICS_ALL view.

Part 20—The Final Exam! (Part I): According to Ralph Kimball, there is only one basic evaluation approach—along with a “bail out” option—in a star schema: evaluate the constraints on the dimension tables, prepare a list of composite keys of the fact table, and use a composite index on the fact table to fetch the data from the fact table. When you notice that the list of composite candidate keys is too long, you bail out to a full-table scan in which you look at every fact record without the help of indexes. For the final exam, we will exercise Kimball’s theories using the Sales History (SH) sample schema provided by Oracle. The Sales “fact table” and its five dimensions—Products, Customers, Times, Channels, and Promotions—form the classic star pattern that Kimball refers to.

Part 21—The Final Exam! (Part II): In addition to the basic evaluation approach of Kimball and the bail-out option, Oracle Database offers a lightning-fast option which uses “bitmap” indexes. A bitmap index is composed of multiple “bitmaps” instead of primary key and ROWID pairs like conventional indexes. Each bitmap in a bitmap index is an array of bits (0s and 1s)  and tracks exactly one of the values that occurs in the indexed column. The 1s in a bitmap indicate that the corresponding records in the table have the value being tracked by the bitmap while the 0s indicate that the corresponding records do not have that value. Bitmaps are highly effective in solving “COLUMN IN (LIST)” and “COLUMN_1 IN (LIST_1) AND COLUMN_2 IN (LIST_2).”

Part 22—The Final Exam! (Part III We Don’t Use Databases; We Don’t Use Indexes): Back in the early days of the relational era, the creator of relational theory, Dr. Edward Codd married relational theory with transactional database management systems (a.k.a. ACID DBMS) and the Relational Database Management System (RDBMS) was born. He authored two influential ComputerWorld articles—“Is your DBMS really relational?” (October 14, 1985) and “Does your DBMS run by the rules?” (October 21, 1985)—that set the direction of the relational movement for the next quarter century. Today, the full declarative power of “data base sublanguages” (the term used by Dr. Codd) such as Structured Query Language (SQL) is only available within the confines of a transactional database management system. However “external tables” are the half-way house between jail and freedom.

Categories: DBA, EXPLAIN PLAN, Oracle, SQL

The Hitchhiker’s Guide to the EXPLAIN PLAN: The story so far (Part 1–10)

June 28, 2014 2 comments

On the Toad World site, I’m writing a series of blog posts and articles on the subject of EXPLAIN PLAN. I’m using EXPLAIN PLAN as a motif to teach not just SQL tuning but also relational theory, logical database design, and physical database design. In a year’s time, I hope to have enough material for a self-published book.

Part 1—DON’T PANIC: Even experienced application developers may not understand EXPLAIN PLAN output. As the great Renaissance artist Leonardo da Vinci said in his discourse on painting: “Those who are in love with practice without science are like the sailor who gets into a ship without rudder or compass, who is never certain where he is going. Practice must always be built on sound theory … The painter who copies by practice and judgement of eye, without rules, is like a mirror which imitates within itself all the things placed before it without any understanding of them.”

Part 2—A Long Time Ago: When magnetic disk drives first became a reality in the 1960s, a software engineer at General Electric, Charles Bachman, invented the first database management system (DBMS). He conceived of the application developer as a navigator, navigating through records of different types using indexes and pointers. For his tremendous accomplishment, he received the ACM Turing Award, the equivalent of the Nobel prize in the computer field. We call his system the “network model” because of its emphasis on the pathways between individual records.

Part 3—The Impossible Dream: To make application development easier, an IBM researcher named Edgar Codd suggested that application developers should not have to concern themselves with indexes and pointers. Instead, they should use non-procedural languages and leave the choice of access paths to the DBMS. Dr. Codd used the mathematical term “relation” (table) for a set of records of a single type and called his theory the “relational model.” In contrast, Bachman had never once used the words “table” or “relation.” Dr. Codd also received the ACM Turing Award.

Part 4—Secret Sauce: The secret sauces of the relation model are “relational algebra” and “relational calculus.” Relational algebra is a collection of operations—such as selection, projection, union, difference, and join—that can be used to produce new tables from old while relational calculus is an English-like non-procedural language for specifying the characteristics of the desired information. A relational calculus expression has to be converted into an equivalent sequence of relational algebra operations (a query execution plan) but this is the job of the DBMS not the application developer.

Introduction to Relational Algebra and Relational Calculus: Just as you can combine numbers using the operations of addition, subtraction, multiplication, and division, you can combine tables using operations like “selection,” “projection,” “union,” “difference,” and “join.” However, Codd was of the opinion that “Requesting data by its properties is far more natural than devising a particular algorithm or sequence of operations for its retrieval. Thus, a calculus-oriented language provides a good target language for a more user-oriented source language.” With the exception of the union operation, the original version of SQL was based on relational calculus though, over time, other elements of relational algebra like difference (minus), intersection, and outer join crept in.

Equivalence of Relational Algebra and Relational Calculus: Dr. Codd showed how to systematically convert a relational calculus expression into an equivalent relational algebra expression. A given collection of relational algebra operations is “complete” if the operations in the collection can be used to translate all relational calculus expressions into equivalent algebra expressions.

Part 5—SQL Sucks: In practice, we don’t use relational algebra or relational calculus but an English-like query language called SQL. As with relational calculus expressions, a SQL statement must be converted into an equivalent sequence of relational algebra operations (a query execution plan) by the DBMS. SQL is a heavily redundant language offering multiple ways of posing the same query. Unfortunately, and for no fault of the application developer, semantically equivalent but syntactically different SQL statements typically end up with different execution plans that are not equally efficient.

Part 6—Trees Rule: The DBMS converts your SQL statement into an equivalent sequence of relational algebra operations (the query execution plan). EXPLAIN PLAN output is simply a listing of that query execution plan. The Oracle documentation incorrectly states that “The execution order in EXPLAIN PLAN output begins with the line that is the furthest indented to the right.” In reality, the EXPLAIN PLAN is a “tree” structure.

Part 7—Don’t pre-order your EXPLAIN PLAN: An EXPLAIN PLAN is a “tree” structure corresponding to a relational algebra expression. It is printed in “pre-order” sequence (visit the root of the tree, then traverse each subtree—if any—in pre-order sequence) but should be read in “post-order” sequence (first traverse each subtree—if any—in post-order sequence, then only visit the root of the tree).

Part 8—Tree Menagerie: There are four varieties of EXPLAIN PLAN trees: deep-left trees, deep-right trees, zigzag trees, and bushy trees. Deep left trees are very common because the optimizer typically picks a “driving” table and then joins tables to it one by one. Deep-right trees are useful in data warehouses for joining large fact tables to small dimension tables using hash joins. Hash tables are best constructed from the smaller of the inputs so Oracle will switch the order of the inputs when necessary; this results in zigzag trees. The optimizer does not generally consider bushy trees because they increase the search space beyond its capabilities. However, it is forced to use a bushy tree when faced with an unmergeable view.

Part 9—A Forest Hymn: It is popularly believed that the number of join orderings of N tables is FACTORIAL(N) = N * (N – 1) * (N – 2) * … * 3 * 2 * 1 because FACTORIAL(N) is the number of possible permutations of N objects. FACTORIAL(N) is actually the number of deep-left trees; it omits all the other possibilities. The actual number of join orderings is much larger.

Part 10—Mystery Tree: EXPLAIN PLAN output can sometimes be very confusing. In the EXPLAIN PLAN output that we obtained for the relational calculus solution of our first teaching example “employees who have worked in all accounting job classifications,” some operations seem to be located in the wrong nodes of the tree. The mystery can be solved by referring to the “predicate information” section of the EXPLAIN PLAN output and inserting additional nodes into the tree.

Categories: DBA, EXPLAIN PLAN, Oracle, SQL