Using a ‘regular’ expression to test primality is a bad idea

This entry was inspired by jawil’s blog entry on clever but useless tricks of regular expressions. I will introduce (or rather reintroduce) the trick, state the reason why the trick is slow (from a theoretical point of view), test that out on a real machine and finally summarise the lessons learnt.

Updated on 3 June 2017: Added content on greediness.

汉语使用者可以阅读上面的链接提到的博文，我是其中的一个评论者。此外，奇技淫巧一词的意思就是“过分高明而无用的技巧”，表达的感情色彩类似雕虫小技而不是鬼斧神工、巧夺天工，不过最近被误用罢了。

The trick

There is a clever but useless trick (奇技淫巧 in Chinese) to test non-primality with a ‘regular expression’. To clarify, it is not a regular expression in the sense for which computer scientists use this term. Regular expressions only recognise regular languages, and what you will see is not a regular language (simple proof by pumping lemma, left as an exercise for the readers).

To begin with, we have to encode the number in its unary form, i.e., the encoding is $f:\mathbb{N}\mapsto{\left\{1\right\}}^{\star},n\mapsto 1^n.$

We now want a regular expression (practically used one) to match all non-prime numbers, these includes 0, 1 and any composite number. To detect 0 and 1, it suffices to use ^1?$. To detect a composite number, use ^(11+)\1+$. The clever part is the usage of backward reference. This regular expression will match one factor of $n$ in the first captured group and repeat that group one or more times. Therefore, the regular expression is ^1?$|^(11+)\1+$.

Why the trick is useless

The trick is unnecessarily slow. Its running time is $\Omega\left(N\right)$ , which is much slower than the naïve $\mathcal{O}\left(\sqrt{N}\right)$ algorithm (if we count computing modulo as $\mathcal{O}\left(1\right)$ ), not mentioning that $\mathrm{PRIME}$ is in $\mathrm{P}$ .

Benchmarking that

When I first mentioned that the trick is slow, jawil, the author of the original blog entry, disagreed. He also benchmarked his implementations of both algorithms. The result was surprising — the regular expression version is faster! How could that be possible?

Trial 1

Let’s take a look at how he benchmarked the code (relevant part only, translated into English):

I am benchmarking the code in NodeJS v6.10.2. First, let’s take an example with a small number, 19.

console.time('Not using regex')
function isPrime(num){
    if(typeof num !== "number" || !Number.isInteger(num)){　　　　　　
        return false
    }
    if(num == 2){
        return true
    }else if(num % 2 == 0){
        return false
    }
    let squareRoot = Math.sqrt(num)
    for(let i = 3; i <= squareRoot; i += 2) {
      if (num % i === 0) {
         return false
      }
    }
    return true
}
console.log(isPrime(19))
console.timeEnd('Not using regex')

console.time('Using regex')
function isPrime(num) {
return !/^1?$|^(11+?)\1+$/.test(Array(num+1).join('1'))
}
console.log(isPrime(19))
console.timeEnd('Using regex')

The result is

Not using regex: 3.809ms
Using regex: 0.259ms

If we change to a larger number within the memory limits, say 123456, we have

Not using regex: 6.261ms
Using regex: 1.842ms

So regular expression is actually faster.

The problems

I pointed out that there were three major problems:

That the first-time JIT compilation is also timed;
That the benchmarking was one-shot, instead of average on multiple runs;
That the data were poorly chosen — 123456 has an extremely small prime factor.

It is also worth noting that the second problem could be a hard one to resolve. If you run it multiple times, perhaps the compiler finds it to be pure, it could optimise it away.

Trial 2: another lesson

I wrote my version of benchmarking code and used NodeJS 6.10.3.

The code written by me in this blog entry is licensed under MIT license with an additional requirement: when distributing and redistributing the code, you must also include a link to this blog entry. And recursive redistributors must also obey this rule.

function isPrimeSqrtN(x)
{
    if (x < 2) return false;
    for (var i = 2; i * i <= x; ++i)
        if (x % i == 0)
            return false;
    return true;
}
function isPrimeRegex(x)
{
    return !/^1?$|^(11+?)\1+$/.test(Array(x+1).join('1'));
}

isPrimeSqrtN(9);
isPrimeSqrtN(7);
isPrimeRegex(9);
isPrimeRegex(7);

console.time('isPrimeSqrtN(19)');
console.log(isPrimeSqrtN(19));
console.timeEnd('isPrimeSqrtN(19)');
console.time('isPrimeRegex(19)');
console.log(isPrimeRegex(19));
console.timeEnd('isPrimeRegex(19)');

console.time('isPrimeSqrtN(123456)');
console.log(isPrimeSqrtN(123456));
console.timeEnd('isPrimeSqrtN(123456)');
console.time('isPrimeRegex(123456)');
console.log(isPrimeRegex(123456));
console.timeEnd('isPrimeRegex(123456)');

I did not resolve any the problems in this version. It seems (from the third trial) that the first four calls intended to force the JIT did not work. I soon realised it was because the results were not used at all and could have been optimised away. But this remains a mystery or unverified. I doubt if NodeJS 6.10.3 were clever enough to detect pureness of isPrimeXxx.

Trial 3

I made several modifications to Trial 2:

I printed the return values of the first four calls to standard error stream;
I printed other return values to standard error stream in favour of copying the output directly from standard output with PowerShell | Set-Clipboard piping;
I added another two calls with parameter 347*347, where 347 is prime.

The code is collapsed for brevity. Show the code.

The code can be collapsed for brevity. Hide the code.

function isPrimeSqrtN(x)
{
    if (x < 2) return false;
    for (var i = 2; i * i <= x; ++i)
        if (x % i == 0)
            return false;
    return true;
}
function isPrimeRegex(x)
{
    return !/^1?$|^(11+?)\1+$/.test(Array(x+1).join('1'));
}

console.error(isPrimeSqrtN(9));
console.error(isPrimeSqrtN(7));
console.error(isPrimeRegex(9));
console.error(isPrimeRegex(7));

console.time('isPrimeSqrtN(19)');
console.error(isPrimeSqrtN(19));
console.timeEnd('isPrimeSqrtN(19)');
console.time('isPrimeRegex(19)');
console.error(isPrimeRegex(19));
console.timeEnd('isPrimeRegex(19)');

console.time('isPrimeSqrtN(123456)');
console.error(isPrimeSqrtN(123456));
console.timeEnd('isPrimeSqrtN(123456)');
console.time('isPrimeRegex(123456)');
console.error(isPrimeRegex(123456));
console.timeEnd('isPrimeRegex(123456)');

console.time('isPrimeSqrtN(347*347)');
console.error(isPrimeSqrtN(347*347));
console.timeEnd('isPrimeSqrtN(347*347)');
console.time('isPrimeRegex(347*347)');
console.error(isPrimeRegex(347*347));
console.timeEnd('isPrimeRegex(347*347)');

Now the results looked sane to me:

isPrimeSqrtN(19): 0.154ms
isPrimeRegex(19): 0.161ms
isPrimeSqrtN(123456): 0.460ms
isPrimeRegex(123456): 2.868ms
isPrimeSqrtN(347*347): 0.581ms
isPrimeRegex(347*347): 30.239ms

I also did multiple runs (each time in a new process) and the order of their performance was consistent across runs.

Lessons learnt

Trust theory but do not hesitate to verify it with experiments;
When benchmarking, time solely the interval the logic runs;
- I’ve learnt a lot from Algorithm Engineering course during my exchange to Aarhus Universitet;
Consider different kinds of input data;
- This is also relevant to String Algorithm course in Aarhus, during the exam of which the professors asked me about the best/worst test cases of pattern matching algorithms.

Updated on 3 June 2017: greediness

I just realised that what I wrote in the text and what I wrote in the code was quite different!

The regular expression I wrote in the text was ^1?$|^(11+)\1+$ but what I did in the code was ^1?$|^(11+?)\1+$. Note the non-greedy specifier ? after 11+!

Of course, of course, they match the same set of strings since they have ^ and $ around them. But the way they work are quite different. The greedy one captures the largest non-trivial factor of a composite number in the first captured group while the other the smallest, which is never worse than the greedy one.

Consider two cases:

The number is prime: In this case, they do not differ since they both test every number from $2$ to $x-1$ , just in different order.
The number is composite: The non-greedy always wins. Let $p$ be the smallest prime factor of $x$ . The non-greedy one makes $p-1$ tries before it stops while the greedy one has to make $x\left({1-p^{-1}}\right)$ tries. Since $p\leq\sqrt{x}$ , the non-greedy one makes $\mathcal{O}\left({\sqrt{x}}\right)$ tries while the other $\Omega\left(x\right)$ ones.

The practical running time is strange, though:

isPrimeSqrtN(19): 0.106ms
isPrimeRegex(19): 0.097ms
isPrimeRegexGreedy(19): 0.095ms

isPrimeSqrtN(123456): 0.075ms
isPrimeRegex(123456): 4.205ms
isPrimeRegexGreedy(123456): 1.930ms

isPrimeSqrtN(347*347): 2.578ms
isPrimeRegex(347*347): 47.566ms
isPrimeRegexGreedy(347*347): 2081.716ms

Note that for 123456, the greedy one is even faster! And the order is preserved across runs. How could this be?

I developed another theory in the regular expression engine. Perhaps the engine can predict whether the match could succeed with length, so the first 61727 tries failed fast. And the performance difference is observed by the \1+ part. It could be the case that confirming 61727 more copies of 11 is slower than confirming one more copy of 11…1 (61728 ones).

I therefore predict that the greedy version works better if and only if the composite number is even — if it is odd, the fail-fast would not work for $\frac{x}{2}\sim\frac{x}{3}$ because the engine has to first confirm a copy of the first captured group. This time the result somehow coincides with the prediction.

The code is collapsed for brevity. Show the code.

The code can be collapsed for brevity. Hide the code.

function isPrimeSqrtN(x)
{
    if (x < 2) return false;
    for (var i = 2; i * i <= x; ++i)
        if (x % i == 0)
            return false;
    return true;
}
function isPrimeRegex(x)
{
    return !/^1?$|^(11+?)\1+$/.test(Array(x+1).join('1'));
}
function isPrimeRegexGreedy(x)
{
    return !/^1?$|^(11+)\1+$/.test(Array(x+1).join('1'));
}

console.error(isPrimeSqrtN(9));
console.error(isPrimeSqrtN(7));
console.error(isPrimeRegex(9));
console.error(isPrimeRegex(7));
console.error(isPrimeRegexGreedy(9));
console.error(isPrimeRegexGreedy(7));

[50021, 50023, 50033, 50047].forEach(function (value)
{
    [1, 2, 3, 5, 7].forEach(function (factor)
    {
        var testee = factor * value;
        console.log(factor + ' * ' + value + ' = ' + testee);
        console.time('isPrimeRegex');
        console.error(isPrimeRegex(testee));
        console.timeEnd('isPrimeRegex');
        console.time('isPrimeRegexGreedy');
        console.error(isPrimeRegexGreedy(testee));
        console.timeEnd('isPrimeRegexGreedy');
        console.log('');
    })
});

The test result is collapsed for brevity. Show the test result.

The test result can be collapsed for brevity. Hide the test result.

1 * 50021 = 50021
isPrimeRegex: 356.197ms
isPrimeRegexGreedy: 342.034ms

2 * 50021 = 100042
isPrimeRegex: 1.372ms
isPrimeRegexGreedy: 0.787ms

3 * 50021 = 150063
isPrimeRegex: 2.426ms
isPrimeRegexGreedy: 689.920ms

5 * 50021 = 250105
isPrimeRegex: 6.049ms
isPrimeRegexGreedy: 3928.162ms

7 * 50021 = 350147
isPrimeRegex: 5.987ms
isPrimeRegexGreedy: 9954.669ms

1 * 50023 = 50023
isPrimeRegex: 354.442ms
isPrimeRegexGreedy: 342.425ms

2 * 50023 = 100046
isPrimeRegex: 0.599ms
isPrimeRegexGreedy: 0.784ms

3 * 50023 = 150069
isPrimeRegex: 4.631ms
isPrimeRegexGreedy: 650.475ms

5 * 50023 = 250115
isPrimeRegex: 4.680ms
isPrimeRegexGreedy: 3947.305ms

7 * 50023 = 350161
isPrimeRegex: 8.555ms
isPrimeRegexGreedy: 10198.142ms

1 * 50033 = 50033
isPrimeRegex: 357.004ms
isPrimeRegexGreedy: 348.866ms

2 * 50033 = 100066
isPrimeRegex: 2.997ms
isPrimeRegexGreedy: 1.635ms

3 * 50033 = 150099
isPrimeRegex: 2.043ms
isPrimeRegexGreedy: 682.865ms

5 * 50033 = 250165
isPrimeRegex: 3.745ms
isPrimeRegexGreedy: 4187.814ms

7 * 50033 = 350231
isPrimeRegex: 5.976ms
isPrimeRegexGreedy: 10713.553ms

1 * 50047 = 50047
isPrimeRegex: 372.650ms
isPrimeRegexGreedy: 350.183ms

2 * 50047 = 100094
isPrimeRegex: 1.632ms
isPrimeRegexGreedy: 1.216ms

3 * 50047 = 150141
isPrimeRegex: 2.468ms
isPrimeRegexGreedy: 681.795ms

5 * 50047 = 250235
isPrimeRegex: 4.254ms
isPrimeRegexGreedy: 4137.351ms

7 * 50047 = 350329
isPrimeRegex: 8.131ms
isPrimeRegexGreedy: 10960.352ms

But anyway, who wants to use such code in practice?