Archive

Archive for the ‘Hadoop’ Category

The Twelve Days of NoSQL: Day Ten: Big Data in a Nutshell

January 4, 2014 2 comments

On the tenth day of Christmas, my true love gave to me
Ten lords a-leaping.

(Yesterday: NoSQL Taxonomy)(Tomorrow: Mistakes of the relational camp)

The topic of Big Data is often brought up in NoSQL discussions so let’s give it a nod. In 1998, Sergey Brin and Larry Page invented the PageRank algorithm for ranking web pages (The Anatomy of a Large-Scale Hypertextual Web Search Engine by Brin and Page) and founded Google. The PageRank algorithm required very large matrix-vector multiplications (Mining of Massive Datasets Ch. 5 by Rajaraman and Ullman) so the MapReduce technique was invented to handle such large computations (MapReduce: Simplified Data Processing on Large Clusters). Smart people then realized that the MapReduce technique could be used for other classes of problems and an open-source project called Hadoop was created to popularize the MapReduce technique (The history of Hadoop: From 4 nodes to the future of data). Other smart people realized that MapReduce could handle the operations of relational algebra such as join, anti-join, semi-join, union, difference, and intersection (Mining of Massive Datasets Ch. 2 by Rajaraman and Ullman) and began looking at the possibility of processing large volumes of business data (a.k.a. “Big Data”) better and cheaper than mainstream database management systems. Initially programmers had to write Java code for the “mappers” and “reducers” used by MapReduce. However, smart people soon realized that SQL queries could be automatically translated into the necessary Java code and “SQL-on-Hadoop” was born. Big Data thus became about processing large volumes of business data with SQL but better and cheaper than mainstream database management systems. However, the smart people have now realized that MapReduce is not the best solution for low-latency queries (Facebook open sources its SQL-on-Hadoop engine, and the web rejoices). Big Data has finally become about processing large volumes of business data with SQL but better and cheaper than mainstream database management systems and with or without MapReduce.

That’s the fast-moving story of Big Data in a nutshell.

Also see: The Twelve Days of SQL: Day Ten: Sometimes the optimizer needs a hint

The Twelve Days of NoSQL: Day Seven: Schemaless Design

December 31, 2013 Leave a comment

On the seventh day of Christmas, my true love gave to me
Seven swans a-swimming.

(Yesterday: The False Premise of NoSQL)(Tomorrow: Oracle NoSQL Database)

As we discussed on Day One, NoSQL consists of “disruptive innovations” that are gaining steam and moving upmarket. So far, we have discussed functional segmentation (the pivotal innovation), sharding, asynchronous replication, eventual consistency (resulting from lack of distributed transactions across functional segments and from asynchronous replication), and blobs.

The final innovation of the NoSQL camp is “schemaless design.” In database management systems of the NoSQL kind, data is stored in “blobs” and documents the database management system does not police their structure. In mainstream database management systems on the other hand, doctrinal purity requires that the schema be designed before data is inserted. Let’s do a thought experiment.

Suppose that we don’t have a schema and let’s suppose that the following facts are known.

  • Iggy Fernandez is an employee with EMPLOYEE_ID=1 and SALARY=$1000.
  • Mogens Norgaard is a commissioned employee with EMPLOYEE_ID=2, SALARY=€1000, and COMMISSION_PCT=25.
  • Morten Egan is a commissioned employee with EMPLOYEE_ID=3, SALARY=€1000, and unknown COMMISSION_PCT.

Could we ask the following questions and expect to receive correct answers?

  • Question: What is the salary of Iggy Fernandez?
  • Correct answer: $1000.
  • Question: What is the commission percentage of Iggy Fernandez?
  • Correct answer: Invalid question.
  • Question: What is the commission percentage of Mogens Norgaard?
  • Correct answer: 25%
  • Question: What is the commission percentage of Morten Egan?
  • Correct answer: Unknown.

If we humans can process the above data and correctly answer the above questions, then surely we can program computers to do so.

The above data could be modeled with the following three relations. It is certainly disruptive to suggest that this be done on the fly by the database management system but not outside the realm of possibility.

	 	 
EMPLOYEES
  EMPLOYEE_ID NOT NULL NUMBER(6)
  EMPLOYEE_NAME VARCHAR2(128)

UNCOMMISSIONED_EMPLOYEES
  EMPLOYEE_ID NOT NULL NUMBER(6)
  SALARY NUMBER(8,2)

COMMISSIONED_EMPLOYEES
  EMPLOYEE_ID NOT NULL NUMBER(6)
  SALARY NUMBER(8,2)
  COMMISSION_PCT NUMBER(2,2)

A NoSQL company called Hadapt has already stepped forward with such a feature:

“While it is true that SQL requires a schema, it is entirely untrue that the user has to define this schema in advance before query processing. There are many data sets out there, including JSON, XML, and generic key-value data sets that are self-describing — each value is associated with some key that describes what entity attribute this value is associated with [emphasis added]. If these data sets are stored in Hadoop, there is no reason why Hadoop cannot automatically generate a virtual schema against which SQL queries can be issued. And if this is true, users should not be forced to define a schema before using a SQL-on-Hadoop solution — they should be able to effortlessly issue SQL against a schema that was automatically generated for them when data was loaded into Hadoop.

Thanks to the hard work of many people at Hadapt from several different groups, including the science team who developed an initial design of the feature, the engineering team who continued to refine the design and integrate it into Hadapt’s SQL-on-Hadoop solution, and the customer solutions team who worked with early customers to test and collect feedback on the functionality of this feature, this feature is now available in Hadapt.” (http://hadapt.com/blog/2013/10/28/all-sql-on-hadoop-solutions-are-missing-the-point-of-hadoop/)

This is not really new ground. Oracle Database provides the ability to convert XML documents into relational tables (http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb01int.htm#ADXDB0120) though it ought to be possible to view XML data as tables while physically storing it in XML format in order to benefit certain use cases. It should also be possible to redundantly store data in both XML and relational formats in order to benefit other use cases.

In  “Extending the Database Relational Model to Capture More Meaning,” Dr. Codd explains how a “formatted database” is created from a collection of facts:

“Suppose we think of a database initially as a set of formulas in first-order predicate logic. Further, each formula has no free variables and is in as atomic a form as possible (e.g, A & B would be replaced by the component formulas A, B). Now suppose that most of the formulas are simple assertions of the form Pab…z (where P is a predicate and a, b, … , z are constants), and that the number of distinct predicates in the database is few compared with the number of simple assertions. Such a database is usually called formatted, because the major part of it lends itself to rather regular structuring. One obvious way is to factor out the predicate common to a set of simple assertions and then treat the set as an instance of an n-ary relation and the predicate as the name of the relation.”

In other words, a collection of facts can always be organized into relations if necessary.

Also see: The Twelve Days of SQL: Day Seven: EXPLAIN PLAN lies

SQL v/s NoSQL: Amazon v/s eBay and the false premise of NoSQL

October 13, 2013 1 comment

Here is the poll data from the Confio-sponsored webinar “NoSQL and Big Data for the Oracle DBA” and my answers to the questions asked in the chat. The recording of the webinar is now available at http://www.confio.com/webinars/nosql-big-data/. The slide deck is at https://iggyfernandez.files.wordpress.com/2013/10/nosql-and-big-data-for-oracle-dbas-oct-2013.pdf.

Are NoSQL products and technologies being deployed at your organization?
Total Responses: 145 of 303 (48%)
Answer Total Number Total %
Yes 39 27%
No 106 73%
Are Big Data products and technologies being deployed at your organization?
Total Responses: 138 of 303 (46%)
Answer Total Number Total %
Yes 52 38%
No 86 62%
Is there any merit to the claim that NoSQL technology beats relational technology in performance, scalability, and availability?
Total Responses: 143 of 303 (47%)
Answer Total Number Total %
No merit whatsoever 5 3%
Some merit 38 27%
A lot of merit 6 4%
Don’t have enough information to judge 94 66%

SQL v/s NoSQL

Q. Where would I use a NoSQL database v/s Cloudera Hadoop? (G. B.)

A. You would use a NoSQL database where you are dealing with simple schemas such as in the Amazon examples. You would use Cloudera Hadoop when you want to process large amounts of filesystem data using the parallel capabilities of the Map/Reduce algorithm.

Q. I’m a little confused on the objective of the webinar. Is it to indicate that a NoSQL solution is always unnecessary and one can use Oracle? Or that there are situations where a NoSQL data store is relevant and one can still use Oracle? Or something else? (R. B.)

A. There are certainly situations where NoSQL technologies excel, specifically situations that benefit from sharding and replication. However, the eBay example proves that it is also possible to design a modern e-commerce platform without completely abandoning relational technology. The unstated objective of the presentation was to prove that Amazon missed the opportunity to take relational technology to the next level. Amazon believed that relational technology requires a join penalty but it was wrong. See my post https://iggyfernandez.wordpress.com/2013/07/28/no-to-sql-and-no-to-nosql/.

Q. Are you implying that NoSQL is a solution for everything relational or that it has areas where it excels?  What are those areas for NoSQL vs. Relational? (C. R.)

A. Answered above.

Q. Are you also going to discuss the cost of Oracle v/s NoSQL? (D. T.)

A. Many NoSQL technologies are open-source. However, there is an argument to be made that you get what you pay for. Oracle Database is a feature-rich and mature product.

Q. If I understood your explanations, you mean that Oracle, through the 12c version, is now able to create data or reorganize any existing ones the NoSQL way and then can still use the powerful SQL language which does not require high technical skills? (J. M. A.)

Clusters have been available in Oracle Database for an extremely long time. For example, refer to the Oracle7 Server Concepts Manual at http://docs.oracle.com/cd/A57673_01/DOC/server/doc/SCN73/ch5.htm#toc052. Object-relational capabilities were introduced in Oracle Database 8.0.

Q. Is there a benefit for eBay to transfer its data into NoSQL clusters given it is already using Oracle the SQL way? (J. M. A.)

The eBay example proves that it is also possible to design a modern e-commerce platform without completely abandoning relational technology.

Clusters

Q. Why do you think people are not using clustered tables? Are there any downsides with it? (A. K. E.)

A. They are not used because most application developers and database administrators haven’t heard of them even though Oracle uses them for data dictionary tables such as TAB$ and COL$ and even though Oracle uses single-table hash clusters in TPC-C benchmarks.

Clusters cannot be partitioned but you could use partition views to emulate partitioning as in the example at https://iggyfernandez.wordpress.com/2013/01/22/we-dont-use-databases-we-dont-use-indexes/. Note that Oracle used an undocumented patch to partition hash clusters in a recent TPC-C benchmark. See https://iggyfernandez.wordpress.com/2011/05/10/major-new-undocumented-partitioning-feature-in-oracle-database-11g-release-2/.

One issue with hash clusters is the potential for hash collisions and block chains. In the TPC-C benchmarks, Oracle pre-allocates space for all the expected rows and uses the “HASH IS” clause to prevent hash collisions. An alternative is to use indexed clusters.

Q. What if you have large tables in terms of rows. How will table clusters perform? (T. D.)

A. Clusters cannot be partitioned but you could use partition views to emulate partitioning as in the example at https://iggyfernandez.wordpress.com/2013/01/22/we-dont-use-databases-we-dont-use-indexes/. Note that parallel UNION ALL is only available in Oracle Database 12c. Note that Oracle used an undocumented patch to partition hash clusters in a recent TPC-C benchmark. See https://iggyfernandez.wordpress.com/2011/05/10/major-new-undocumented-partitioning-feature-in-oracle-database-11g-release-2/ so it is not unreasonable to hope that the feature will be implemented someday.

Q. How will table clusters perform for large amounts of data? (M. R.)

A. Answered above.

Q. Is there a multi-block read penalty for blocks that are read from the clustered tables in your example when you want to report across all employees? (K. H.)

A. Yes, there is a penalty. But NoSQL optimizes for a specific use case and we did the same. It is good practice to optimize for the most important use case. In the example, we optimized for the use case of retrieving all data for a single employee.

Q. How about the employee object-relational view. Doesn’t it use join operations in the background at the time of select? (K. N.)

A. Yes, it does. But the query execution plan shows that there is no join penalty. When retrieving one row from the view, the estimated cost is 1 and the actual number of blocks touched is also 1.

Big Data

Q. You mentioned writing Java code [for Hadoop]. How about using C++, Python, etc.? (K. Z.)

A. See http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/.

Q. How does what you present compare to Aster SQL/MapReduce? (L. L.)

A. The detailed explanation is in Aster Data’s paper “SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions” available at http://pdf.aminer.org/000/225/039/a_practical_approach_to_static_node_positioning.pdf. You might also want to read Oracle’s paper “In-Database Map-Reduce” available at http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-indatabase-mapreduce-128831.pdf.

Q. What is the best database to load Twitter and Facebook data and analyze? (G.B.)

A. Facebook cannot provide public access to user posts because of privacy restrictions but Twitter provides public access to its tweet stream. The choice of technology for analyzing the tweet stream depends on the kind of analysis. Last year, Twitter engineers made presentations on the Twitter architecture to students at the University of California at Berkeley. For their projects, the students analyzed the Twitter data using various technologies. The architecture presentations by the Twitter engineers and the project presentations by the students are available at http://blogs.ischool.berkeley.edu/i290-abdt-s12/. The “map of a tweet” is at http://www.scribd.com/doc/30146338/map-of-a-tweet.

Predictions

Q. What skills can an Oracle DBA take to use in Big Data world? (G. B.)

A. In the Big Data world, you’re probably going to need Linux, SQL, and programming skills. You can leverage your previous experience as an administrator or an application developer.

Q. Assuming NoSQL takes over, what do you think will be the roles of Database Administrators? (K. I.)

A. I don’t foresee NoSQL taking over. But I see innovation continuing in the relational and non-relational spaces. I see relational and non-relational systems co-existing. I see some companies winning and some companies losing. I see both the relational and non-relational camps adopting each other’s best ideas. However, the tasks will remain the same; that is, installing, configuring, upgrading, monitoring, tuning, programming, etc.

Q. With Big Data and the Cloud coming into the picture, will roles like Oracle DBA and SQL Server DBA be replaced by Big Data DBA and Cloud DBA roles? (V. V.)

A. They won’t be replaced because Oracle Database and SQL Server are not going away.

Q. If NoSQL people are reinventing SQL in one way or another, then what is the future of SQL? (J. G.)

A. Relational algebra is the right tool for a lot of tasks, so the future of SQL is assured.

SQL v/s NoSQL: Amazon v/s eBay and the false premise of NoSQL

October 9, 2013 4 comments

Update: The recording is now available at http://www.confio.com/webinars/nosql-big-data/. The slide deck is at https://iggyfernandez.files.wordpress.com/2013/10/nosql-and-big-data-for-oracle-dbas-oct-2013.pdf.

In my webinar for Confio on October 10, I will explain that the deficiencies of relational technology are actually a result of deliberate choices made by the relational movement in its early years. The relational camp needs to revisit these choices if it wants to compete with NoSQL and Big Data technologies in the areas of performance, scalability, and availability. Previous versions of this presentation have been delivered at Great Lakes Oracle Conference, NoCOUG, and OakTableWorld. New material is constantly being added to the presentation based on attendee feedback and additional research. Attendees seem most interested in learning that eBay had the same performance, scalability, and availability requirements as Amazon but stuck with Oracle and SQL.

Register at http://marketo.confio.com/NoSQLBigDataforOracle_RegPage.html.

  1. The origins of NoSQL
    • Amazon’s requirements
    • Amazon’s solution
    • Amazon v/s eBay
  2. The false premise of NoSQL
    • Zeroth normal form
    • The problem with flat tables
    • Eliminating the join penalty
    • Dr. Codd on eventual consistency
  3. The NoSQL landscape
    • Key-value databases
    • Document databases
    • Column-family databases
    • Graph databases
    • Map/Reduce
  4. What makes relational technology so sacred anyway?
  5. Mistakes of the relational camp
    • De-emphasizing physical database design
    • Discarding nested relations
    • Favoring relational calculus over relational algebra
    • Equating the normalized set with the stored set
    • Marrying relational theory to ACID DBMS
    • Ignoring SQL-92 CREATE ASSERTION
  6. Going one better than Hadoop
    • Preprocessor feature of external tables
    • Partition pruning in partition views
    • Parallel UNION ALL in Oracle Database 12c
  7. Bonus slides
    • NewSQL
    • NoSQL Buyer’s Guide

The Mistakes of the Relational Camp: Mistake #1: The de-emphasis of physical database design

It is unlikely that application developers will develop highly performant and scalable applications if they don’t have access to performance metrics. EM Express in Oracle Database 12c has taken a step in the right direction by making it possible to give developers access to some performance metrics. From the interview with ACE Director Kyle Hailey in Oracle Magazine:

Oracle Database 12c introduces an administration console called Oracle Enterprise Manager Database Express, a “light” version of Oracle Enterprise Manager that developers can access through a browser. “Now DBAs can give the developers read-only access to a simplified database management console so they can see the impact of their code,” says Hailey.

This is a big change from the traditional setup, where the effect of code was seen only in the DBA’s world. “The DBA has this privileged access to see what’s happening with the database,” Hailey says. “He can see load go up and see who caused it, but the poor developer writes some code, runs some code, and maybe sees some text on a screen—but there’s no visual impact. The developer doesn’t know what’s going on in the database, and that’s not fair. The DBA comes and complains that the developers are making a mess, and the developer says, ‘How am I supposed to know?’ With [Oracle Enterprise Manager Database Express], developers will be able to see the effect of their code and, if there’s a problem, shut it down before the DBA comes calling.”

Instructions for giving EM Express access to non-administrative users are at http://docs.oracle.com/cd/E16655_01/server.121/e17643/em_manage.htm#BABHCDGA.

So Many Oracle Manuals, So Little Time

See also :  No! to SQL and No! to NoSQL

The inventor of the relational model, Dr. Edgar “Ted” Codd believed that the suppression of physical database design details was the chief advantage of the relational model. He made the case in the very first sentence of the very first paper on the relational model saying “Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).” (“A Relational Model of Data for Large Shared Data Banks,” reprinted with permission in the 100th issue of the NoCOUG Journal.)

How likely is it that application developers will develop highly performant and scalable applications if they are shielded from the internal representation of data? The de-emphasis of physical database design was the biggest mistake of the relational camp and provided the opening for NoSQL…

View original post 452 more words

The Mistakes of the Relational Camp: Mistake #1: The de-emphasis of physical database design

September 9, 2013 3 comments

See also :  No! to SQL and No! to NoSQL

The inventor of the relational model, Dr. Edgar “Ted” Codd believed that the suppression of physical database design details was the chief advantage of the relational model. He made the case in the very first sentence of the very first paper on the relational model saying “Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).” (“A Relational Model of Data for Large Shared Data Banks,” reprinted with permission in the 100th issue of the NoCOUG Journal.)

How likely is it that application developers will develop highly performant and scalable applications if they are shielded from the internal representation of data? The de-emphasis of physical database design was the biggest mistake of the relational camp and provided the opening for NoSQL and Big Data technologies to proliferate.

A case in point is that the language SQL which is universally used by application developers was not created with them in mind. As explained by the creators of SQL (originally called SEQUEL) in their 1974 paper, there is “a large class of users who, while they are not computer specialists, would be willing to learn to interact with a computer in a reasonably high-level, non-procedural query language. Examples of such users are accountants, engineers, architects, and urban planners [emphasis added]. It is for this class of users that SEQUEL is intended. For this reason, SEQUEL emphasizes simple data structures and operations [emphasis added].” (http://faculty.cs.tamu.edu/yurttas/PL/DBL/docs/sequel-1974.pdf)

If you were the manager of a bookstore, how would you stock the shelves? Would you stand at the door and fling books onto any shelf that had some free space, perhaps recording their locations in a notebook for future reference. Of course not! And would you scatter related books all over the bookstore? Of course not! Then why do we store rows of data in random fashion? The default Oracle table storage structure is the unorganized heap and it is chosen 99.9% of the time.

The de-emphasis of physical database design was an epic failure in the long run. Esther Dyson referred to the “join penalty” when she complained that “Using tables to store objects is like driving your car home and then disassembling it to put it in the garage. It can be assembled again in the morning, but one eventually asks whether this is the most efficient way to park a car.” [1]

It doesn’t have to be that way. Oracle Database has always provided a way to cluster rows of data from one or more tables using single-table or multi-table clusters in hashed or indexed flavors and thus to completely avoid the join penalty that Esther Dyson complained about. However, they must be the longest and best kept secret of Oracle Database—as suggested by their near-zero adoption rate—and have not been emulated by any other DBMS vendor. You can read more about them at https://iggyfernandez.wordpress.com/2013/07/28/no-to-sql-and-no-to-nosql/.

It doesn’t have to be that way. But it is.

1. Esther Dyson was the editor of a newsletter called Release 1.0. I’ve been unable to find the statement in the Release 1.0 archives at http://www.sbw.org/release1.0/ so I don’t really know the true source or author of the statement. However, the statement is popularly attributed to Esther Dyson and claimed to have been published in the Release 1.0 newsletter. I found a claim that the statement is found in the September 1988 issue but that didn’t pan out.

See also :  No! to SQL and No! to NoSQL

What’s your take on RDBMS and NoSQL?

April 9, 2013 2 comments

My take is that application developers have belatedly but correctly concluded that an RDBMS is not the best tool for every application. For example, relational algebra, relational calculus, and SQL are not the best tools for graph problems. As another example, weblogs are non-transactional and don’t benefit from the ACID properties of the RDBMS. Amazon created the Dynamo key-value store for a highly specific use case. From the Dynamo white paper: Customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. … There are many services on Amazon’s platform that only need primary-key access to a data store. … Simple read and write operations to a data item that is uniquely identified by a key. Data is stored as binary objects (i.e., blobs) identified by unique keys. No operations span multiple data items and there is no need for relational schema. … The operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization.”

What’s your take?

P.S. I wasn’t always of this opinion because, until very recently, I had not studied NoSQL technologies. But my favorite quote is the one on consistency from Emerson’s essay on self-reliance: “The other terror that scares us from self-trust is our consistency; a reverence for our past act or word … bring the past for judgment into the thousand-eyed present, and live ever in a new day. …  A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. … Speak what you think now in hard words, and to-morrow speak what to-morrow thinks in hard words again, though it contradict every thing you said to-day.” (http://www.emersoncentral.com/selfreliance.htm)

However, I don’t give a free pass to the NoSQL camp. I believe that most of the problems that the NoSQL camp is trying to solve, with the sole exception of graph problems, could have been solved within the framework of relational theory. Since the relational camp couldn’t (or wouldn’t) solve the problems, the NoSQL camp came up with their own solutions and threw out the baby (relational theory) along with the bathwater (the perceived deficiencies of relational theory). My topic for the Great Lakes Oracle Conference is therefore “Soul-searching for the relational camp: Why NoSQL and Big Data have momentum.” (https://www.neooug.org/gloc/accepted-presentations.aspx)

Categories: Big Data, DBA, Hadoop, NoSQL, SQL
%d bloggers like this: