The Twelve Days of NoSQL: Day Seven: Schemaless Design
On the seventh day of Christmas, my true love gave to me
Seven swans a-swimming.
As we discussed on Day One, NoSQL consists of “disruptive innovations” that are gaining steam and moving upmarket. So far, we have discussed functional segmentation (the pivotal innovation), sharding, asynchronous replication, eventual consistency (resulting from lack of distributed transactions across functional segments and from asynchronous replication), and blobs.
The final innovation of the NoSQL camp is “schemaless design.” In database management systems of the NoSQL kind, data is stored in “blobs” and documents the database management system does not police their structure. In mainstream database management systems on the other hand, doctrinal purity requires that the schema be designed before data is inserted. Let’s do a thought experiment.
Suppose that we don’t have a schema and let’s suppose that the following facts are known.
- Iggy Fernandez is an employee with EMPLOYEE_ID=1 and SALARY=$1000.
- Mogens Norgaard is a commissioned employee with EMPLOYEE_ID=2, SALARY=€1000, and COMMISSION_PCT=25.
- Morten Egan is a commissioned employee with EMPLOYEE_ID=3, SALARY=€1000, and unknown COMMISSION_PCT.
Could we ask the following questions and expect to receive correct answers?
- Question: What is the salary of Iggy Fernandez?
- Correct answer: $1000.
- Question: What is the commission percentage of Iggy Fernandez?
- Correct answer: Invalid question.
- Question: What is the commission percentage of Mogens Norgaard?
- Correct answer: 25%
- Question: What is the commission percentage of Morten Egan?
- Correct answer: Unknown.
If we humans can process the above data and correctly answer the above questions, then surely we can program computers to do so.
The above data could be modeled with the following three relations. It is certainly disruptive to suggest that this be done on the fly by the database management system but not outside the realm of possibility.
EMPLOYEES EMPLOYEE_ID NOT NULL NUMBER(6) EMPLOYEE_NAME VARCHAR2(128) UNCOMMISSIONED_EMPLOYEES EMPLOYEE_ID NOT NULL NUMBER(6) SALARY NUMBER(8,2) COMMISSIONED_EMPLOYEES EMPLOYEE_ID NOT NULL NUMBER(6) SALARY NUMBER(8,2) COMMISSION_PCT NUMBER(2,2)
A NoSQL company called Hadapt has already stepped forward with such a feature:
“While it is true that SQL requires a schema, it is entirely untrue that the user has to define this schema in advance before query processing. There are many data sets out there, including JSON, XML, and generic key-value data sets that are self-describing — each value is associated with some key that describes what entity attribute this value is associated with [emphasis added]. If these data sets are stored in Hadoop, there is no reason why Hadoop cannot automatically generate a virtual schema against which SQL queries can be issued. And if this is true, users should not be forced to define a schema before using a SQL-on-Hadoop solution — they should be able to effortlessly issue SQL against a schema that was automatically generated for them when data was loaded into Hadoop.
Thanks to the hard work of many people at Hadapt from several different groups, including the science team who developed an initial design of the feature, the engineering team who continued to refine the design and integrate it into Hadapt’s SQL-on-Hadoop solution, and the customer solutions team who worked with early customers to test and collect feedback on the functionality of this feature, this feature is now available in Hadapt.” (http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/)
This is not really new ground. Oracle Database provides the ability to convert XML documents into relational tables (http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb01int.htm#ADXDB0120) though it ought to be possible to view XML data as tables while physically storing it in XML format in order to benefit certain use cases. It should also be possible to redundantly store data in both XML and relational formats in order to benefit other use cases.
In “Extending the Database Relational Model to Capture More Meaning,” Dr. Codd explains how a “formatted database” is created from a collection of facts:
“Suppose we think of a database initially as a set of formulas in first-order predicate logic. Further, each formula has no free variables and is in as atomic a form as possible (e.g, A & B would be replaced by the component formulas A, B). Now suppose that most of the formulas are simple assertions of the form Pab…z (where P is a predicate and a, b, … , z are constants), and that the number of distinct predicates in the database is few compared with the number of simple assertions. Such a database is usually called formatted, because the major part of it lends itself to rather regular structuring. One obvious way is to factor out the predicate common to a set of simple assertions and then treat the set as an instance of an n-ary relation and the predicate as the name of the relation.”
In other words, a collection of facts can always be organized into relations if necessary.