Benchmarking: Testing a string against a single composite Regular Expressions, or an array of Regexps

Posted: December 21st, 2009 | By | Filed under: Development | Tags: , | 2 Comments »

Here.

Today I needed to recode a small part of  client side script to test against a regular expression instead testing it begins with a specific value (using substring).

The current code used an array of expressions the string can begin with, and tested a string by looping and testing for each substring.

Changing the code had me to face a question that haunts me for a long time: which  is more efficient – testing against a single regular expression, composed from different supplied expressions, or testing against an array of those expressions.  By testing i mean to check if at least one of the expression tests positive for the string (i.e. someRegexp.test(“Tested string”) ).

My Boss, in his infinite wisdom, said that regular expressions, being compiled to a NFA (or DFA),  should theoretically be more efficient. Me, in my blissful ignorance of CS, decided to benchmark it.

You can test it yourself, using the cycles of your own browser. This was all hacked late at night, so forgive me for any mistakes and let me know about them.

The benchmark page  is pretty simple, lets you specify list of expressions to test with, list of strings to test, number of times to iterate, and than prints out the result for compilation (not really interesting for most uses) and match testing.

Further Explanations

The benchmark takes a list of regular expressions, and  tests them against the following methods:

  1. An array of compiled regular expressions
    • Each expressions is compiled and pushed to a compiled RegExps array (benchmarked seperately)
    • The tested string is tested against each expression iteratively, and breaks when there is a positive test (i.e. only if no matches are found, all regular expressions will be tested)
  2. A single composite regular expression, made of the given expression
    • A string is constructed for a regular expression with OR between non capturing groups for each expression in the form (?: expr1)|(?: expr2)…|(?: exprN) .
      (note: I’m not entirely sure this covers all the cases of sub-expressions, and  there might be some grouping/escaping constructs that will not work)
    • The string is compiled to a regular expression
    • The tested string is tested against the compiled regular expression.

The Code

Looking at the code might  help clear up things. Note that the code uses global variables for sheer lazyness, and provided just so copy/pasting this would work


var regexpArraySrc = [".*bla", "\dbloop", "sdl[k|j]sfl"];
 var regexpArray = [];
 var singleRegexp = new RegExp();

Array:

Compilation:


function compileRegExpArray() {
 var arrCompiled = [];
 for (var i=0; i < regexpArraySrc.length; i++) {
 re = new RegExp();
 re.compile(regexpArraySrc[i]);
 arrCompiled.push(re);
 }

return arrCompiled;
 }

Testing:


function testArray(stringToTest) {

for (var regexpIndex=0; regexpIndex < regexpArray.length; regexpIndex++) {
 if (regexpArray[regexpIndex].test(stringToTest)) {

return true;
 }
 }

return false;
 }

Single Regexp:

Compilation:


function compileSingleRegexp() {
 singleRegexpStr = "(?:" + regexpArraySrc.join(")|(?:") + ")";
 re = new RegExp();
 re.compile(singleRegexpStr);
 return re;
 }

Testing:


function testSingle(stringToTest) {
 return singleRegexp.test(stringToTest);
 }

Some post Benchmark thoughts

Well, as always, the results depends.

The Array method has the advantage of stopping evaluation explicitly when a match is found. This is useful when your testing mostly for positive matches, and you can estimate what expressions to check first. I found this method faster in the following example:

Expressions: \.gif$  ,   \.bmp$,   ^.*b.*\.jpg$

Strings to test: sababa.jpg, no  sense, bkldjhd.tiff, bla.gif, bla2.gif, bla.bmp

In most cases however, especially when the expressions weren’t complex, the single regular expression  proved to be faster (on Fire Fox 3.5).

The most important thing i found is that the efficiency is much more important in crafting a single expression in the list. For example: ^.*jpg$ is 4 times worse than jpg$ , though for testing they are practically the same (at least for a single line of input).

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Leave a Reply