Here's a challenge for you: write me an SQL query that generates prime numbers. Note that I said "generates"; storing a zillion primes in a table and selecting from it is not what I'm looking for here.
"Why?" you ask. "What possible reason could there be for wanting a query that emits primes?". Well, you might want to use it to give every row in a table a prime numbered primary key. Then you could record references to multiple rows by storing the product of their keys. You might want to implement some SQL that does a checksum reliant on primes. But the reason I want to do it is that it teaches another way of thinking about programming. In universities, they would probably use a language like Prolog for this. SQL is a bit more widespread though, and it's also a bit more misunderstood; my real hope you'll come away from this article with a greater appreciation for what you're asking of your database server when you fire off those queries.
So where do we start? Well, let's think about the way you'd implement this in C. Most programmers would immediately think of the Sieve of Eratosthenes, a very simple algorithm with the honour of being named after my pet crocodile. It works like this: you setup a bitset with all values initialised to true. Then, you go along and turn off every 2nd value, then every 3rd, then every 4th and so on up to the square root of the largest number you're interested in. Once this is complete, the prime numbers are the indices of the set with a value of true.
The more observant will quickly see an easy optimisation to be had in this algorithm: there is no need to flag every 4th bit after you've done every 2nd, because in flagging every 2nd bit, you'll have already covered every 4th. In fact, more generally, you can use the values you already know to be prime to skip processing of all multiples of that prime. Sneaky.
So let's think a little about what we've done here: we've started with the assumption that every number is prime and then built a machine that can identify all composite (non-prime) numbers, and used that machine to remove all the composite numbers from our set. What we're left with then, must be primes.
Eratosthenes was indeed clever for suggesting this algorithm, and it is far more efficient than the SQL solution I'm about to present. The reason that the SQL solution is interesting is that Eratosthenes' solution defined the solution in terms of a process, whilst the SQL solution defines the solution in terms of the result, albeit in a highly round-about manner.
Now let's think about the basic structure of an SQL query. We have a few basic tools: we have sources of data (tables), ways to connect those sources of data (joins) and filters on any intermediate or final result (conditions). That is to say that your basic query has a structure something like this:
SELECT * FROM t1 INNER JOIN t2 ON t1.foreign_id = t2.id WHERE t2.field < 1
Alright, so how can we use this to get primes? Well, if your database provided an IsPrime function, and you had a table of some set of integers, you could do this:
SELECT n FROM numbers WHERE IsPrime(n)
except that no database I know of provides IsPrime, and no-one really wants to store all the integers in a table.
Ok, so that's no good, what *can* we do? Well, there's two problems there: the lack of an IsPrime function, and the lack of a set of integers. Well, the IsPrime part is more interesting but less fun, so I'm going to start off by dealing with the problem of not having a cheap, reliable source of integers first.
Let's start at the very beginning. If we need all the natural numbers up to the number 1, we could implement it like this:
SELECT 1
And we're done! I know I know, it's not quite the revelation you were hoping for, stay with me. What if we now want all the natural numbers up to the number 2? Easy! SQL's UNION set operator provides us with a scalable solution for this one as well:
SELECT 1 UNION SELECT 2
"Good work, James" I hear you saying. "Take the rest of the day off!". Relax! Things can only get better, right?
What if we need to get to seven? That's starting to sound like a lot of typing, and I've got RSI. I chose 7 because it so happens that 7 fits into three bits. If we had a way to enumerate all combinations of bits up to 3, we'd be in business.
But how could we possibly do that? Isn't that exactly the same problem as what we started with? It is indeed, but this one is a little more tractable, because look above: we managed to write an SQL query that could produce two values, what if we wrote one that generated 0 and 1 as output:
SELECT 0 UNION SELECT 1
There, I just made a bit. Alright, that's great and all, but now we need to turn my bits into numbers. No problem, if you don't specify any JOIN criteria, between tables, SQL will just enumerate all possible combinations in depth-first search order. So here's what we do:
SELECT (bit1.n * 4 + bit2.n * 2 + bit3.n) AS n FROM
(SELECT 0 AS n UNION SELECT 1 AS n) AS bit1,
(SELECT 0 AS n UNION SELECT 1 AS n) AS bit2,
(SELECT 0 AS n UNION SELECT 1 AS n) AS bit3
So we took a simple mechanism that generates 1s and 0s, exploited SQL's cartesian nature and multiplied the 1s and 0s by powers of 2 to produce the integers that we're interested in. Cute trick, huh? The really cute thing is that you can make a view from that for say, 4 bits, and then use the same trick to build another view which uses the 4 bit view 4 times to get 16 bits
worth of integers, or more if you need it.
Alright, so let's say we've made such a view (or you can continue to type out the query manually as a sub-query if you're in the mood) and called it v_numbers, how do we go from there to primes? Well, what's a prime? Recall the definition that the Sieve of Eratosthenes exploited: a prime is a non-composite. Ok, well then all we need to do is generate all the composite numbers and find the numbers not in that set. The first part is easy: we learnt above how to generate all combinations from a set, so all we'd need to do here is pair-wise multiplications of all those combinations. Like this:
SELECT n1.n * n2.n
FROM v_numbers AS n1, v_numbers AS n2
WHERE n1.n > 1;
We require that n1 be greater than 1 because 1 is a unit; if we did not include this restriction, well, we would conclude that there are no prime numbers, since in multiplying everything by 1, we would have selected every integer.
We can actually make this a little more efficient if we add in the requirement from before that we only need to compute products of numbers up to the square root of the largest number in the set:
SELECT n1.n * n2.n
FROM v_numbers AS n1, v_numbers AS n2
WHERE n1.n > 1
AND n1.n <= CEIL(SQRT(MAX(SELECT n FROM v_numbers AS n3)))
AND n2.n <= CEIL(SQRT(MAX(SELECT n FROM v_numbers AS n4)));
It's up to your SQL implementation as to whether that ends up being better or worse; it may choose to re-evaluate the MAX expression for each value. If it does, you can simply hard-code the largest value to work-around the limitation.
Again, let's put that into a view (because typing is hard) called v_composites. Selecting our primes then should be a walk in the park, it's just all the numbers not in that set. So, here it is:
SELECT v_numbers.n
FROM v_numbers
WHERE v_numbers.n NOT IN (SELECT v_composites.n FROM v_composites)
AND n > 1;
The n > 1 clause is again to exclude 1 from the set. It's a unit, and its exclusion here is no more elegant than its exclusion by definition by number theorists.
The most important thing to take away from this is to understand that when you invoke an SQL query, you really are asking the database to consider all possible answers and filter out the ones you want; it's a massive depth-first-search implementation. That's important for a few reasons, but notably: it's extremely powerful, extremely slow and extremely inappropriate for almost all problems you're likely to face. Your database server will thank you if you keep those things in mind.