Results of the First International NoCOUG SQL Challenge
The following report was published in the August 2009 issue of the NoCOUG Journal. (View PDF)
The First International NoCOUG SQL Challenge was a great success; nine solutions were found by participants in seven countries and three continents. Alberto Dell’Era wins the contest for his wonderful solution using Discrete Fourier Transforms; the runner-up is André Araujo from Australia, who used binary arithmetic and common table expressions in his solution. The August Order of the Wooden Pretzel (Footnote 1) will be bestowed on Alberto but the real prize is six books of his choice from the Apress catalog. André will receive a prize of six e-books of his choice. Thanks to Chen Shapira for publicizing the event in her blog, Dan Tow for helping to judge the contest, and Apress for donating the books.
An ancient 20-sided die (icosahedron) was discovered in the secret chamber of mystery at Hogwash School of Es-Cue-El. A mysterious symbol was inscribed on each face of the die. The great Wizard of Odds discovered that each symbol represents a number. The great wizard discovered that the die was biased: that is, it was more probable that certain numbers would be displayed than others if the die were used in a game of chance. The great wizard recorded this information in tabular fashion as described below.
The great wizard then invited all practitioners of the ancient arts of Es-Cue-El to create an Es-Cue-El spell that displays the probabilities of obtaining various sums when the die is thrown N times in succession in a game of chance, N being a substitution variable or bind variable. The rules of the competition can be found in the May 2009 issue of the NoCOUG Journal and on the Web at www.nocoug.org/SQLchallenge/FirstSQLchallenge.pdf.
The modus operandi was for each participant to post their solutions on their blog or website; you can find them all with a simple Google search. The first solution—by Laurent Schneider from Switzerland—used the CONNECT BY clause to join the table to itself N times and the SYS_CONNECT_BY_PATH and XMLQUERY functions to perform the necessary additions and multiplications. The number of records generated by CONNECT BY grows exponentially and hurts performance.
The second solution—by Craig Martin from the USA—used the CONNECT BY clause to join the table to itself N times and logarithms to perform the necessary additions and multiplications. The number of records grows exponentially in this case too.
The fourth solution—by Vadim Tropashko from the USA—used recursive common table expressions to generate records. The number of records grows exponentially in this case too. Recursive common table expressions are available in Microsoft SQL Server and DB2 but are not presently available in Oracle 11gR1. Rumor has that they will be available in Oracle Database 11gR2.
The fifth and sixth solutions—by Alberto Dell’Era from Italy—used advanced mathematical techniques such as convolutions, Discrete Fourier Transforms, and Fast Fourier Transforms. The Fast Fourier Transform method is an efficient way of calculating Discrete Fourier Transforms and was implemented using the Model clause.
The eighth solution—by a blogger named Cd-MaN from Romania—used pipelined table functions and recursion. The solution was demonstrated in a Postgres database but can easily be adapted for use in an Oracle Database. The use of recursion means that this is not a pure SQL solution.
Dan Tow authored the following statement on behalf of the judging committee:
To begin, we’d like to congratulate the contestants on finding so many different and clever ways to solve this problem, and on making the problem of picking a winner so difficult for us, personally. Each of the solutions had major advantages to recommend it. We were also very pleasantly surprised at the global extent of the entries, with entries from seven nations and three continents!
The criteria for the judging were stated in advance at www.nocoug.org/SQLchallenge/FirstSQLchallenge.pdf. The main criteria that separated the top scores from the rest, given that they were all quite good as technical solutions from one perspective or another, were the inclusion of commentary and test results, which were minimal or altogether lacking from most entries. (Hey, we understand that you’re busy, so we’re not surprised to see these missing or minimal, but they were important to getting a win here!)
There were elegant solutions using the Model clause, including one by Alberto Dell’Era that implemented a solution using Fast Fourier Transforms that was technically amazing, well-documented and tested, and scaled better than any other solution, with order N * Log(N) scaling, almost linear up to enormous numbers of throws of the dice. However, these used “iterate” loops that we believe violated the contest requirement that “Solutions that use procedural loops to multiply probabilities are not eligible” stated at the top of the “Judges’ Statement.” (We know that there are wonderful, super-efficient ways to solve this problem in procedural code like C, but the point behind that unbendable requirement was to get contestants “thinking in SQL,” doing the job in a set-wise manner, not just finding ways to bend SQL into doing what we’d be better off doing in C, procedurally.) The “procedural loop” component in these solutions was really minimal and easy to miss, even, in a casual examination of the code, but we think we have to stick with the pre-stated rules here and disqualify those solutions, even while we admire them.
Scaling almost as well (at order N * N), and also very well documented and tested, both from the perspective of performance and functionality, was the amazing Fourier-Transform-based solution, www.adellera.it/investigations/nocoug_challenge/index.html also by Alberto Dell’Era from Italy, that we think has to be declared the winner here, with no procedural loops and good-to-excellent scores in every stated judging criteria.
The runner-up choice is difficult, too, but we’d probably have to go with André Araujo of Australia, www.pythian.com/news/2385/nocoug-sql-challenge-entry. Of all the solutions, his probably best combined straightforward (very clever, but still straightforward to the reader!), portable SQL that could be easily understood and maintained by a developer without an advanced degree in mathematics, with fairly scalable and well-tested SQL that ran well up to quite high numbers of throws of the die (“N”). It is true that the SQL had a hard-coded limit of N = 511 (a limit that André documented well, to the credit of the solution), and that functional limit lost a few points, but we should keep in mind that this high value of N is one that most of the implementations (other than Alberto Dell’Era’s) would never reach in our lifetime, anyway, owing to their comparative lack of scalability—being logically correct at high N is worth nothing if the program never finishes! If we had to actually maintain one of these in a production environment, and we didn’t anticipate needing results at very high values of N, we’d probably go with André’s solution, just because we’d be frightened of long-term maintenance on the high mathematics of Alberto Dell’Era’s brilliant but more complex and technically harder-to-follow solution.
Analysis of the Winning Solution
Alberto published two solutions. The first solution uses a mathematical tool called Discrete Fourier Transforms used in Digital Signal Processing. This solution will work with most databases—including SQL Server and DB2—with a little tweaking. The second solution provided by Alberto used Oracle’s proprietary Model clause to implement an efficient technique called Fast Fourier Transforms for computing Discrete Fourier Transforms.
Alberto recognized that the contents of the die table define a mathematical function and that the process of joining the table with itself and grouping the results is the so-called convolution of this function with itself. Throwing the die N times is therefore equivalent to performing N – 1 convolutions. For example, for N = 3, we have to perform two convolutions. This is best expressed using common table expressions as follows. Notice that the definition of the second convolution references the first convolution.
WITH first_convolution AS ( SELECT face_value, SUM (probability) AS probability FROM ( SELECT d1.face_value + d2.face_value AS face_value, d1.probability * d2.probability AS probability FROM die d1 CROSS JOIN die d2 ) GROUP BY face_value ), second_convolution AS ( SELECT face_value, SUM (probability) AS probability FROM ( SELECT d1.face_value + d2.face_value AS face_value, d1.probability * d2.probability AS probability FROM first_convolution d1 CROSS JOIN die d2 ) GROUP BY face_value ) SELECT face_value, probability FROM second_convolution ORDER BY face_value;
The most common solution of the problem requires an N-way cross join. Convolutions have obvious advantages over an N-way cross join because they keep the size of intermediate results in check. The question is how to compute the required N – 1 convolutions with a single SQL statement if the value of N is not known in advance. One solution is to recursively invoke a table function as was done by the Romanian contestant; that is, we have to resort to procedural programming. Alberto was able to avoid procedural programming using a Fourier transform; that is, a certain function whose definition is derived from the original function. The Fourier transform has the interesting property that the transform of the convolution of functions is the simple product of the individual transforms. Therefore, the Fourier transform of the N-way convolution of our function with itself is the Nth power of the Fourier transform of the function. The Fourier transform is straightforward to compute and—with a little mathematical trick involving a conversion from the Cartesian coordinate system to the polar coordinate system—the Nth power of the Fourier transform is also straightforward to compute. At this point, Alberto has the Fourier transform of the N-way convolution. To obtain the result that he really needs, all that is left for him to do is to compute the Inverse Fourier Transform of the Fourier transform. An explanation of Fourier Transforms can be found on the Web; efficient C-language implementations can also be readily found.
For readability and maintainability, Alberto uses a sequence of common table expressions. The following is a simplified version of his solution; the full solution handles more general cases and uses some tricks to reduce the number of scientific computations. Hints to guide the optimizer and improve efficiency are included in the version shown below.
First Alberto creates a one-column table of sequence numbers using the CONNECT BY method; this table is used several times in the rest of the solution. The number of elements in the table is one more than the product of the number of sides of the die and the number of throws.
sequence AS ( SELECT /*+ NO_MERGE */ LEVEL - 1 AS n FROM dual CONNECT BY LEVEL <= (:N * :sides + 1) )
Alberto then constructs a discrete function whose domain is the numbers in the sequence table. The function is called discrete because it is only defined for certain discrete values—not for all values in a range as in the case of a continuous function. Whenever possible, the function uses the values in the die table. If a value is not found, the value of the function is set to zero. This is done using an outer join.
function AS ( SELECT /*+ NO_MERGE */ n, COALESCE(probability, 0) AS x FROM sequence LEFT OUTER JOIN die ON (n = face_value) )
Alberto then computes the Fourier transform of the discrete function. There’s some advanced math going on here but it’s easy to take in small doses; you’ll recognize Pi as the well-known mathematical constant. A cross join with the Sequence table is required to calculate sums.
transform AS ( SELECT /*+ NO_MERGE LEADING(function) */ sequence.n, SUM(x * COS(-2 * :pi * sequence.n * function.n / (:N * :sides + 1))) AS x, SUM(x * SIN(-2 * :pi * sequence.n * function.n / (:N * :sides + 1))) AS y FROM function CROSS JOIN sequence GROUP BY sequence.n )
In another little dose of math, Alberto switches to polar form for ease of further computation.
polar AS ( SELECT /*+ NO_MERGE */ n, SQRT((x * x) + (y * y)) AS r, CASE WHEN ABS(y) < 0.000001 AND ABS(x) < 0.000001 THEN 0 ELSE ATAN2(y, x) END AS theta FROM transform )
Computing the Nth power of the Fourier transform is then very easy.
power AS ( SELECT /*+ NO_MERGE */ n, POWER(r, :N) AS r, theta * :N AS theta FROM polar )
Alberto has no more use for the polar form, so he converts back to Cartesian form.
cartesian AS ( SELECT /*+ NO_MERGE */ n, r * COS(theta) AS x, r * SIN(theta) AS y FROM power )
So far, Alberto has calculated the Fourier transform of the discrete function and computed its Nth power. As explained earlier, this is the Fourier transform of the N-way convolution of the discrete function. To obtain the convolution itself, Alberto computes the Inverse Fourier Transform using another cross join with the Sequence table.
convolution AS ( SELECT /*+ NO_MERGE LEADING(cartesian) */ sequence.n, SUM ( x * COS(+2 * :pi * cartesian.n * sequence.n / (:N * :sides + 1)) - y * SIN(+2 * :pi * cartesian.n * sequence.n / (:N * :sides + 1)) ) / (:N * :sides + 1) AS x FROM cartesian CROSS JOIN sequence GROUP BY sequence.n )
Finally, Alberto can display the results. The result has more than 30 digits of decimal precision; only 30 are displayed in the interests of accuracy.
SELECT n AS face_value, ROUND(x, 30) AS probability FROM convolution WHERE n >= :N ORDER BY n
As you can see, Alberto’s solution used advanced mathematical techniques, but it is not very long and the use of common table expressions makes it quite readable. We have a winner!
Footnote 1: Steven Feuerstein was asked the following question in an interview published in the May 2006 issue of the NoCOUG Journal: “SQL is a set-oriented non-procedural language; i.e., it works on sets and does not specify access paths. PL/SQL on the other hand is a record-oriented procedural language, as is very clear from the name. What is the place of a record-oriented procedural language in the relational world?” Steven replied: “Its place is proven: SQL is not a complete language. Some people can perform seeming miracles with straight SQL, but the statements can end up looking like pretzels created by someone who is experimenting with hallucinogens. We need more than SQL to build our applications, whether it is the implementation of business rules or application logic. PL/SQL remains the fastest and easiest way to access and manipulate data in an Oracle RDBMS, and I am certain it is going to stay that way for decades.”
Download the complete SQL script (simplified version of Alberto Dell’Era’s solution)