Archive
Top 12 reasons why you should NOT attend the next NoCOUG conference
#12 All NoCOUG emails automatically go to your spam folder, including this one. You rely on Outlook for career guidance.
#11 They won’t send a stretch limousine to pick you up and take you back.
#10 They talked up SQL for 25 years but now, they’re all, like, “No SQL.” I mean, really!
#9 You’re wayyyyy too busy working to learn anything new. (A very good problem to have!)
#8 Your head is exploding with knowledge already. (An even better problem to have!)
#7 It’s always the same people there, like Iggy and Kamran. (We totally understand but we can’t tell Iggy and Kamran to stop coming, can we?)
#6 You were there the day NoCOUG webmaster Eric Hutchinson sang the theme song from Cheers “Sometimes you want to go where everybody knows your name.” You don’t ever want to hear Eric sing again. Ever!
#5 The food is just too good. You eat to live, not live to eat. (Good for you!)
#4 You don’t appreciate being bribed with free raffle prizes like iPads, Oracle Press teddy bears, and Oracle Press books. (The world needs more upright and honest people!)
#3 You’ve been going for 25 years already; it’s time for a change. You’re going to AARP meetings now (American Association of Retired Persons).
#2 You’ve finally converted your company to Excel spreadsheets. So much cheaper and easier to use!
But the #1 reason not to attend the NoCOUG conference tomorrow is:
#1 You thought that NoCOUG was the North Carolina Oracle Users Group on the East coast!
We don’t use databases; we don't use indexes
Reblogged from So Many Oracle Manuals, So Little Time:
Whenever salespeople phone Mogens Norgaard, he puts them off by saying that he just doesn’t use the products that they are calling about.
When the office furniture company phones, he says “We don’t use office furniture.” When the newspaper company phones, he says “We don’t read newspapers.” When the girl scouts phone, he probably says “We don’t eat cookies.”
Once he got a phone call from the…
We don’t use databases; we don’t use indexes
Whenever salespeople phone Mogens Norgaard, he puts them off by saying that he just doesn’t use the products that they are calling about.
When the office furniture company phones, he says “We don’t use office furniture.” When the newspaper company phones, he says “We don’t read newspapers.” When the girl scouts phone, he probably says “We don’t eat cookies.”
Once he got a phone call from the phone company.
You can only imagine how that conversation went. Read the whole story at http://wedonotuse.com/stories-and-answers.aspx.
I wonder what Mogens would say if a database vendor phoned. I can imagine him saying “We don’t use databases. We don’t use indexes. We store all our data in compressed text files. Each compressed text file contains one year of data for one location. There is a separate subdirectory for each year. We have a terabyte of data going back to 1901 so we currently have 113 subdirectories. The performance is just fine, thank you.”
On second thoughts, that’s just too far-fetched.
You see, back in the early days of the relational era, the creator of relational theory, Dr. Edward Codd married relational theory with transactional database management systems (a.k.a. ACID DBMS) and the Relational Database Management System (RDBMS) was born. He authored two influential ComputerWorld articles—“Is your DBMS really relational?” (October 14, 1985) and “Does your DBMS run by the rules?” (October 21, 1985)—that set the direction of the relational movement for the next quarter century. Today, the full declarative power of “data base sublanguages” (the term coined by Dr. Codd) such as Structured Query Language (SQL) is only available within the confines of a transactional database management system.
Today, the full declarative power of “data base sublanguages” (the term coined by Dr. Codd) such as Structured Query Language (SQL) is only available within the confines of a transactional database management system.
But it shoudn’t have to be that way.
Consider the running example of “big data” used in Hadoop: The Definitive Guide. The National Climatic Data Center publishes hourly climatic data such as temperature and pressure from more than 10,000 recording stations all over the world. Data from 1901 onwards is available in text files. Each line of text contains the station code, the timestamp, and a number of climatic readings. The format is documented at ftp://ftp.ncdc.noaa.gov/pub/data/noaa/ish-format-document.pdf. The files are organized into subdirectories, one subdirectory for each year. Each subdirectory contains one file from each recording station that was in operation during that year. The individual files are compressed using gzip. All the files can be downloaded from ftp://ftp.ncdc.noaa.gov/pub/data/noaa/.
You might have already guessed where I am going with this.
Conceptually the above terabyte-sized data set is a single table. But it should not be necessary to uncompress and load this huge quantity of structured non-transactional data into a transactional database management system in order to query it. The choice of physical representation conserves storage space and is a technical detail that is irrelevant to the logical presentation of the data set as a single table; it is a technical detail that users don’t care about. As Dr. Codd said in the opening sentence of his 1970 paper A Relational Model of Data For Large Shared Data Banks (faithfully reproduced in the 100th issue of the NoCOUG Journal), “future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).”
Users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).
Why shouldn’t we be able to query the above data set using good old SQL?
Well you can do just that with the Oracle query engine and you don’t have to load it into an Oracle database first. You can even take advantage of partitioning and parallelism. You can also write queries that mix and match data from the database and the filesystem.
The following demonstrations were performed using a pre-Built developer VM for Oracle VM VirtualBox. The version of Oracle Database is 11.2.0.2.
SQL*Plus: Release 11.2.0.2.0 Production on Tue Jan 22 23:51:29 2013 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - Production With the Partitioning, OLAP, Data Mining and Real Application Testing options
In the demonstrations, we only consider the years from 1901 to 1904. Here is the directory structure.
/home/oracle/app/oracle/admin/orcl/dpdump/noaa /home/oracle/app/oracle/admin/orcl/dpdump/noaa/1901 /home/oracle/app/oracle/admin/orcl/dpdump/noaa/1904 /home/oracle/app/oracle/admin/orcl/dpdump/noaa/1902 /home/oracle/app/oracle/admin/orcl/dpdump/noaa/1903
We first need to create “directories” and define an “external table.” The definition of this external table specifies a preprocessing script which is the secret sauce that makes it possible for the query engine to traverse the subdirectories and uncompress the data.
create or replace directory data_pump_dir
as '/home/oracle/app/oracle/admin/orcl/dpdump;
create or replace directory noaa_dir
as '/home/oracle/app/oracle/admin/orcl/dpdump/noaa;
create or replace directory noaa_1901_dir
as '/home/oracle/app/oracle/admin/orcl/dpdump/noaa_1901';
create or replace directory noaa_1902_dir
as '/home/oracle/app/oracle/admin/orcl/dpdump/noaa_1902';
create or replace directory noaa_1903_dir
as '/home/oracle/app/oracle/admin/orcl/dpdump/noaa_1903';
create or replace directory noaa_1904_dir
as '/home/oracle/app/oracle/admin/orcl/dpdump/noaa_1904';
grant all on directory data_pump_dir to public;
grant all on directory noaa_dir to public;
grant all on directory noaa_1901_dir to public;
grant all on directory noaa_1902_dir to public;
grant all on directory noaa_1903_dir to public;
grant all on directory noaa_1904_dir to public;
create table temperatures
(
station_code char(6),
datetime char(12),
temperature char(5)
)
organization external
(
type oracle_loader
default directory data_pump_dir
access parameters
(
records delimited by newline
preprocessor data_pump_dir:'uncompress.sh'
fields
(
station_code position(1:6) char(4),
datetime position(7:18) char(12),
temperature position(19:23) char(5)
)
)
location ('noaa')
);
Here’s the tiny preprocessing script that makes it possible for Oracle to traverse the subdirectories and uncompress the data. It recursively traverses the file system beginning with the location specified by the query engine; that is, the location specified in the table definition. It uncompresses all zipped files it finds and sends the output to the “cut” utility which cuts out only those column positions that we care about and writes what’s left to standard output, not to the filesystem. The table definition specifies its location as data_pump_dir.
#!/bin/sh
/usr/bin/find $1 -name "*.gz" -exec /bin/zcat {} \; | /usr/bin/cut -c5-10,16-27,88-92
All the capabilities of SQL—including analytic functions and pivoting—can now be exploited as shown in the following example. For each month in the year 1901, we list the top three recording stations in terms of average monthly temperature.
select * from
(
select
month,
station_code,
dense_rank() over (partition by month order by average) as rank
from
(
select
substr(datetime,1,4)||'/'||substr(datetime,5,2) as month,
station_code,
avg(temperature) as average
from temperatures
where datetime >= '1901' and datetime < '1902'
group by
substr(datetime,1,4)||'/'||substr(datetime,5,2),
station_code
)
)
pivot(max(station_code) for rank in (1, 2, 3))
order by month;
MONTH 1 2 3
------------------------- ------ ------ ------
1901/01 227070 029600 029720
1901/02 227070 029070 029600
1901/03 227070 029070 029600
1901/04 029070 029500 029810
1901/05 029070 029500 029810
1901/06 029070 029810 029500
1901/07 029070 029500 227070
1901/08 227070 029070 029600
1901/09 029070 227070 029600
1901/10 227070 029600 029070
1901/11 227070 029600 029720
1901/12 227070 029600 029070
It’s an epiphany, that’s what it is.
We can also use “partition views” and take advantage of “partition pruning.” For those who don’t remember, partition views are a really old feature that predates “real” partitioning in Oracle 8.0 and above. Partition views continue to work just fine today, even in Oracle Database 11g Release 2.
Let’s create a separate table definition for each year and then use a partition view to tie the tables together.
create or replace view temperatures_v as
select * from temperatures_1901
where datetime >= '190101010000' and datetime < '190201010000'
union all
select * from temperatures_1902
where datetime >= '190201010000' and datetime < '190301010000'
union all
select * from temperatures_1903
where datetime >= '190301010000' and datetime < '190401010000'
union all
select * from temperatures_1904
where datetime >= '190401010000' and datetime < '190501010000';
create table temperatures_1901
(
station_code char(6),
datetime char(12),
temperature char(5)
)
organization external
(
type oracle_loader
default directory noaa_dir
access parameters
(
records delimited by newline
preprocessor data_pump_dir:'uncompress.sh'
fields
(
station_code position(1:6) char(4),
datetime position(7:18) char(12),
temperature position(19:23) char(5)
)
)
location ('1901')
);
-- the remaining table definitions are not shown for brevity
When we specify only a portion of the temperatures_v view, the query plan confirms that the unneeded branches of the view are filtered out by the query optimizer.
select count(*) from temperatures_v
where datetime >= '1901' and datetime < '1902';
Plan hash value: 907705830
-------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers | Reads |
-------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 29 (100)| | 1 |00:00:02.59 | 426 | 57 |
| 1 | SORT AGGREGATE | | 1 | 1 | 14 | | | 1 |00:00:02.59 | 426 | 57 |
| 2 | VIEW | TEMPERATURES_V | 1 | 23 | 322 | 29 (0)| 00:00:01 | 6565 |00:00:03.52 | 426 | 57 |
| 3 | UNION-ALL | | 1 | | | | | 6565 |00:00:03.28 | 426 | 57 |
|* 4 | EXTERNAL TABLE ACCESS FULL | TEMPERATURES_1901 | 1 | 20 | 280 | 29 (0)| 00:00:01 | 6565 |00:00:03.00 | 426 | 57 |
|* 5 | FILTER | | 1 | | | | | 0 |00:00:00.01 | 0 | 0 |
|* 6 | EXTERNAL TABLE ACCESS FULL| TEMPERATURES_1902 | 0 | 20 | 280 | 29 (0)| 00:00:01 | 0 |00:00:00.01 | 0 | 0 |
|* 7 | FILTER | | 1 | | | | | 0 |00:00:00.01 | 0 | 0 |
|* 8 | EXTERNAL TABLE ACCESS FULL| TEMPERATURES_1903 | 0 | 20 | 280 | 29 (0)| 00:00:01 | 0 |00:00:00.01 | 0 | 0 |
|* 9 | FILTER | | 1 | | | | | 0 |00:00:00.01 | 0 | 0 |
|* 10 | EXTERNAL TABLE ACCESS FULL| TEMPERATURES_1904 | 0 | 20 | 280 | 29 (0)| 00:00:01 | 0 |00:00:00.01 | 0 | 0 |
-------------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter(("DATETIME">='190101010000' AND "DATETIME"<'1902'))
5 - filter(NULL IS NOT NULL)
6 - filter(("DATETIME">='190201010000' AND "DATETIME"<'1902'))
7 - filter(NULL IS NOT NULL)
8 - filter(("DATETIME">='190301010000' AND "DATETIME"<'1902'))
9 - filter(NULL IS NOT NULL)
10 - filter(("DATETIME">='190401010000' AND "DATETIME"<'1902'))
Finally, let’s check whether query execution can be parallelized. And so it can.
select /*+ parallel(temperatures_v 4) */ count(*) from temperatures_v;
Plan hash value: 2698603534
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 32 (100)| | | | | 1 |00:00:04.97 | 304 | 0 |
| 1 | SORT AGGREGATE | | 1 | 1 | | | | | | | 1 |00:00:04.97 | 304 | 0 |
| 2 | PX COORDINATOR | | 1 | | | | | | | | 4 |00:00:00.54 | 304 | 0 |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 0 | 1 | | | | Q1,00 | P->S | QC (RAND) | 0 |00:00:00.01 | 0 | 0 |
| 4 | SORT AGGREGATE | | 4 | 1 | | | | Q1,00 | PCWP | | 4 |00:00:04.58 | 561 | 8 |
| 5 | VIEW | TEMPERATURES_V | 4 | 80 | | 32 (0)| 00:00:01 | Q1,00 | PCWP | | 26266 |00:00:04.11 | 561 | 8 |
| 6 | UNION-ALL | | 4 | | | | | Q1,00 | PCWP | | 26266 |00:00:03.12 | 561 | 8 |
| 7 | PX BLOCK ITERATOR | | 4 | 20 | 280 | 8 (0)| 00:00:01 | Q1,00 | PCWC | | 6565 |00:00:02.54 | 264 | 8 |
|* 8 | EXTERNAL TABLE ACCESS FULL| TEMPERATURES_1901 | 1 | 20 | 280 | 8 (0)| 00:00:01 | Q1,00 | PCWP | | 6565 |00:00:02.24 | 264 | 8 |
| 9 | PX BLOCK ITERATOR | | 4 | 20 | 280 | 8 (0)| 00:00:01 | Q1,00 | PCWC | | 6565 |00:00:01.86 | 99 | 0 |
|* 10 | EXTERNAL TABLE ACCESS FULL| TEMPERATURES_1902 | 1 | 20 | 280 | 8 (0)| 00:00:01 | Q1,00 | PCWP | | 6565 |00:00:01.74 | 99 | 0 |
| 11 | PX BLOCK ITERATOR | | 4 | 20 | 280 | 8 (0)| 00:00:01 | Q1,00 | PCWC | | 6554 |00:00:01.84 | 99 | 0 |
|* 12 | EXTERNAL TABLE ACCESS FULL| TEMPERATURES_1903 | 1 | 20 | 280 | 8 (0)| 00:00:01 | Q1,00 | PCWP | | 6554 |00:00:01.73 | 99 | 0 |
| 13 | PX BLOCK ITERATOR | | 4 | 20 | 280 | 8 (0)| 00:00:01 | Q1,00 | PCWC | | 6582 |00:00:00.87 | 99 | 0 |
|* 14 | EXTERNAL TABLE ACCESS FULL| TEMPERATURES_1904 | 1 | 20 | 280 | 8 (0)| 00:00:01 | Q1,00 | PCWP | | 6582 |00:00:00.75 | 99 | 0 |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
8 - filter(("DATETIME">='190101010000' AND "DATETIME"<'190201010000'))
10 - filter(("DATETIME">='190201010000' AND "DATETIME"<'190301010000'))
12 - filter(("DATETIME">='190301010000' AND "DATETIME"<'190401010000'))
14 - filter(("DATETIME">='190401010000' AND "DATETIME"<'190501010000'))
I predict that the time is soon coming when the marriage of relational theory and transactional database management systems will be dissolved. We will be free to store structured non-transactional data outside a transactional database management system while continuing to exploit the entire universe of indexing, partitioning, and clustering techniques as well as the full power of relational languages, not only SQL.
Over to you, Mogens.
The time is soon coming when the marriage of relational theory and transactional database management systems will be dissolved. We will be free to store structured non-transactional data outside a transactional database management system while continuing to exploit the entire universe of indexing, partitioning, and clustering techniques as well as the full power of relational languages, not only SQL.





Recent Comments