On the tenth day of Christmas, my true love gave to me
Ten lords a-leaping.
The topic of Big Data is often brought up in NoSQL discussions so let’s give it a nod. In 1998, Sergey Brin and Larry Page invented the PageRank algorithm for ranking web pages (The Anatomy of a Large-Scale Hypertextual Web Search Engine by Brin and Page) and founded Google. The PageRank algorithm required very large matrix-vector multiplications (Mining of Massive Datasets Ch. 5 by Rajaraman and Ullman) so the MapReduce technique was invented to handle such large computations (MapReduce: Simplified Data Processing on Large Clusters). Smart people then realized that the MapReduce technique could be used for other classes of problems and an open-source project called Hadoop was created to popularize the MapReduce technique (The history of Hadoop: From 4 nodes to the future of data). Other smart people realized that MapReduce could handle the operations of relational algebra such as join, anti-join, semi-join, union, difference, and intersection (Mining of Massive Datasets Ch. 2 by Rajaraman and Ullman) and began looking at the possibility of processing large volumes of business data (a.k.a. “Big Data”) better and cheaper than mainstream database management systems. Initially programmers had to write Java code for the “mappers” and “reducers” used by MapReduce. However, smart people soon realized that SQL queries could be automatically translated into the necessary Java code and “SQL-on-Hadoop” was born. Big Data thus became about processing large volumes of business data with SQL but better and cheaper than mainstream database management systems. However, the smart people have now realized that MapReduce is not the best solution for low-latency queries (Facebook open sources its SQL-on-Hadoop engine, and the web rejoices). Big Data has finally become about processing large volumes of business data with SQL but better and cheaper than mainstream database management systems and with or without MapReduce.
That’s the fast-moving story of Big Data in a nutshell.
On the ninth day of Christmas, my true love gave to me
Nine ladies dancing.
NoSQL databases can be classified into the following categories:
- Key-value stores: The archetype is Amazon Dynamo of which DynamoDB is the commercial successor. Key-value stores basically allow applications to “put” and “get” values but each product has differentiators. For example, DynamoDB supports “tables” (namespaces) while Oracle NoSQL Database offers “major” and “minor” key paths.
- Column-family stores: Column-family stores allow data associated with a single key to be spread over multiple storage nodes. Each storage node only stores a subset of the data associated with the key; hence the name “column-family.” A key is therefore composed of a “row key” and a “column key” in a matter analogous to the major and minor key paths of Oracle NoSQL Database.
- Graph databases: Graph databases are non-relational databases that use graph concepts such as nodes and edges to solve certain classes of problems: for example; the shortest route between two towns on a map. The concepts of functional segmentation, sharding, replication, eventual consistency, and schemaless design do not apply to graph databases so I will not discuss graph databases.
NoSQL products are numerous and rapidly evolving. There is a crying need for a continuously updated encyclopedia of NoSQL products but none exists. There is a crying need for an independent benchmarking organization but none exists. My best advice is to do a proof of concept (POC) as well as a PSR (Performance Scalability Reliability) test before committing to using a NoSQL product. Back in the day, in 1985 to be precise, Dr. Codd had words of advice for those who were debating between the new relational products and the established pre-relational products of his day. The advice is as solid today as it was in Dr. Codd’s day.
“Any buyer confronted with the decision of which DBMS to acquire should weigh three factors heavily.
The first factor is the buyer’s performance requirements, often expressed in terms of the number of transactions that must be executed per second. The average complexity of each transaction is also an important consideration. Only if the performance requirements are extremely severe should buyers rule out present relational DBMS products on this basis. Even then buyers should design performance tests of their own, rather than rely on vendor-designed tests or vendor-declared strategies. [emphasis added]
The second factor is reduced costs for developing new databases and new application programs …
The third factor is protecting future investments in application programs by acquiring a DBMS with a solid theoretical foundation …
In every case, a relational DBMS wins on factors two and three. In many cases, it can win on factor one also—in spite of all the myths about performance.”
—An Evaluation Scheme for Database Management Systems that are claimed to be Relational
On the eighth day of Christmas, my true love gave to me
Eight maids a-milking.
In May 2011, Oracle Corporation published a scathing indictment of NoSQL, the last words being “Go for the tried and true path. Don’t be risking your data on NoSQL databases.” Just a few months later however, in September of that year, Oracle Corporation released Oracle NoSQL Database. Oracle removed the NoSQL criticism from its website but since information published on the internet is immortal, archived copies can be easily found if you know what you are looking for. In the white paper that accompanied the release of Oracle NoSQL Database, Oracle Corporation claimed that the demands of certain applications could not be met by mainstream database management systems:
“The Oracle NoSQL Database, with its “No Single Point of Failure” architecture, is the right solution when data access is “simple” in nature and application demands exceed the volume or latency capability of traditional data management solutions. For example, click-stream data from high volume web sites, high-throughput event processing and social networking communications all represent application domains that produce extraordinary volumes of simple keyed data. Monitoring online retail behavior, accessing customer profiles, pulling up appropriate customer ads and storing and forwarding real-time communication are examples of domains requiring the ultimate in low-latency access. Highly distributed applications such as real-time sensor aggregation and scalable authentication also represent domains well-suited to Oracle NoSQL Database.”
Oracle NoSQL Database has two features that distinguish it from other key-value stores.
- A key is the concatenation of a “major key path” and a “minor key path.” All records with the same “major key path” will be colocated on the same storage node.
- Oracle NoSQL provides transactional support for modifying multiple records with the same major key path.
Here are some resources to get you started with Oracle NoSQL Database:
- The white paper on Oracle NoSQL Database v2.0; an updated version of the original September 2011 paper.
- Download the community edition of Oracle NoSQL Database v2.0 from http://download.oracle.com/otn-pub/otn_software/nosql-database/kv-ce-2.0.26.zip. The prerequisite is JDK 1.6 or higher which you can download from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
- Installation instructions and a five-minute quickstart for Oracle NoSQL Database are at http://docs.oracle.com/cd/NOSQL/html/quickstart.html. It includes a “Hello World” teaching example illustrating the “put” and “get” function calls which are the basic operations in key-value stores.
final String keyString = "Hello"; final String valueString = "Big Data World!"; store.put(Key.createKey(keyString), Value.createValue(valueString.getBytes())); final ValueVersion valueVersion = store.get(Key.createKey(keyString)); System.out.println(keyString + " " + new String(valueVersion.getValue().getValue()));
- Oracle NoSQL Database and Oracle Relational Database – A Perfect Fit, a presentation by Dave Rubin, Director of NoSQL Database development at Oracle Corporation.
- Data Management in Oracle NoSQL Database, a presentation by Anuj Sahni, Principal Product Manager at Oracle Corporation. Also a hands-on database administration workshop.
- The Oracle NoSQL Database resource page on the Oracle Corporation website.