The Twelve Days of NoSQL: Day Eleven: Mistakes of the relational camp
On the eleventh day of Christmas, my true love gave to me
Eleven pipers piping.
Over a lifespan of four and a half decades, the relational camp made a series of strategic mistakes that made NoSQL possible. The mistakes started very early. The biggest mistake is enshrined in the first sentence of the first paper on relational theory by none other than its inventor, Dr. Edgar Codd: “Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).” (A Relational Model of Data for Large Shared Data Banks)
How likely is it that application developers will develop scalable high-performance applications if they are shielded from the internal representation of data? SQL was never intended for serious application development. As explained by the creators of SQL in their 1974 paper, there is “a large class of users who, while they are not computer specialists, would be willing to learn to interact with a computer in a reasonably high-level, non-procedural query language. Examples of such users are accountants, engineers, architects, and urban planners [emphasis added]. It is for this class of users that SEQUEL is intended. For this reason, [SQL] emphasizes simple data structures and operations [emphasis added].” (http://faculty.cs.tamu.edu/yurttas/PL/DBL/docs/sequel-1974.pdf).
A case in point: If you were the manager of a bookstore, how would you stock the shelves? Would you stand at the door and fling books onto any shelf that had some free space, perhaps recording their locations in a notebook for future reference. Of course not! And would you scatter related books all over the bookstore? Of course not! Then why do we store rows of data in random fashion? The default Oracle table storage structure is the unorganized heap and yet it is chosen 99.9% of the time.
Productivity and ease-of-use were the goals of the relational model, not performance. In Normalized Data Base Structure: A Brief Tutorial (1971), Dr. Codd said “What is less understandable is the trend toward more and more complexity in the data structures with which application programmers and terminal users directly interact. Surely, in the choice of logical data structures that a system is to support, there is one consideration of absolutely paramount importance – and that is the convenience of the majority of users. … To make formatted data bases readily accessible to users (especially casual users) who have little or no training in programming we must provide the simplest possible data structures and almost natural language. … What could be a simpler, more universally needed, and more universally understood data structure than a table?”
Dr. Codd emphasized the productivity benefits of the relational model in his acceptance speech for the 1981 Turing Award: “It is well known that the growth in demands from end users for new applications is outstripping the capability of data processing departments to implement the corresponding application programs. There are two complementary approaches to attacking this problem (and both approaches are needed): one is to put end users into direct touch with the information stored in computers; the other is to increase the productivity of data processing professionals in the development of application programs. It is less well known that a single technology, relational database management, provides a practical foundation to both approaches.”
The emphasis on productivity and ease of use at the expense of performance was the biggest mistake of the relational camp.
Mistake #2: Forbidding nested relations
Dr. Codd made a second serious mistake in his 1970 paper by forbidding nested relations in the interests of productivity and ease-of-use. He incorrectly argued (in an unpublished version of that paper) that nested relations would mean that “The second-order predicate calculus (rather than first-order) is needed because the domains on which relations are defined may themselves have relations as elements.” Not anticipating markup languages like XML, he also argued that “The simplicity of the array representation which becomes feasible when all relations are cast in normal form is not only an advantage for storage purposes but also for communication of bulk data between systems which use widely different representations of the data.” This caused the detractors of the relational model to observe that “Using tables to store objects is like driving your car home and then disassembling it to put it in the garage. It can be assembled again in the morning, but one eventually asks whether this is the most efficient way to park a car.” (incorrectly attributed to Esther Dyson, the editor of Release 1.0).
Even though he made a serious mistake by forbidding nested relations, Dr. Codd never said that
- data should be stored in normalized form only;
- each stored table should occupy a separate storage bucket;
- a single data block should only contain data from a single table;
- data should be stored in row-major order; and
- stored tables have only one storage representation each.
Oracle Database has limited support for nested tables but they are not stored inline, they are “collections” not real tables, and you cannot define primary-key or foreign-key constraints on them.
Mistake #3: Favoring relational calculus over relational algebra
At the heart of the relational model are the operations of the “relational algebra”: equi-join, theta-join, outer-join, semi-join, anti-join, union, difference, intersection, etc. In the interests of productivity and ease-of-use however, Dr. Codd favored “relational calculus” over relational algebra. In Relational Completeness of Data Base Sublanguages, he proved the equivalence of relational calculus and relational algebra. In the interests of productivity and ease-of-use, Codd then made the decision for you and me: “Clearly, the majority of users should not have to learn either the relational calculus or algebra in order to interact with data bases. However, requesting data by its properties is far more natural than devising a particular algorithm or sequence of operations for its retrieval. Thus, a calculus-oriented language provides a good target language for a more user-oriented source language.” SQL was therefore based on relational calculus instead of relational algebra and it became the job of the database management system to convert SQL statements into equivalent sequences of relational algebra operations. This is not a trivial problem. It is an extremely hard problem. But the relational camp is adamant that the optimizer always knows best (it doesn’t).
Mistake #4: Equating the normalized set with the stored set
Equating the normalized set with the stored set ensures that mainstream database management systems will be permanently handicapped in the performance race. For example, a nested relation can be normalized into “first normal form” (no nested relations). As another example, if B and C are colunm-subsets of A and A = B EQUIJOIN C, then we can perform a “lossless decomposition” of A into B and C. However, the relational model does not require that each normalized relation be mapped to a separate bucket of physical storage. Dr. Codd differentiated between the “stored” set, the “named” set, and the “expressible” set but mainstream database management systems conflate the stored set with the normalized set. Mainstream database management systems only allow integrity constraints to be specified on the stored set and only permit DML operations on the stored set. If B and C are colunm-subsets of A and A = B EQUIJOIN C, then it ought to be possible to store A only while defining primary and foreign key constraints on B and C and allowing DML operations on B and C but this is not permitted by mainstream database management systems.
Mistake #5: Incomplete implementation
Strange as it may sound, the relational model has not yet been fully implemented by mainstream database management systems. You’ve probably heard of Codd’s “twelve rules” published in a 1985 Computerworld article. More than a quarter-century later, mainstream database management systems still do not follow all the twelve rules. DML operations are still not permitted on views (Rule 6 and Rule 7). Arbitrary integrity constraints still cannot be declared and enforced (Rule 10). Integrity constraints still cannot span separate databases (Rule 11). If would be hard to abandon a database management systems that had these capabilities. It is easier to abandon a database management system that does not have these capabilities.
Mistake #6: The marriage of relational and transactional
ACID transactions nothing to do with the relational model per se although relational theory does come in very handy in defining consistency constraints. (See What’s so sacred about relational anyway?) Pre-relational database management systems such as IMS provided ACID guarantees and so did post-relational object-oriented database management systems. We should be able to store non-transactional data outside a transactional database management system while continuing to exploit the entire spectrum of indexing, partitioning, and clustering techniques. See We don’t use databases; we don’t use indexes.