CoffeeScript Regex Golf

Posted: January 19th, 2014 | By | Filed under: Development, Fun | Tags: , , , | No Comments »

Like most geeks, i’ve was recently introduced to the new activity of Regex Golf.  What it means, is trying to find the shortest regular expression for all strings in a given set, that doesn’t match any strings in a different given set.
For example, for matching ‘one’, ‘two’ and ‘three’, but not matching ‘four’, ‘five’ and ‘six’, we can come up with the the regexp /t|ne/

XKCD had a brilliant comic strip about the subject, which led the great Peter Norvig to write about creating a solver for finding a short regex golf solution.

It got me interested enough to want improve a little on the solution, and I thought “why not try out a new language?” so i’ve decided to port Norvig’s python code to Coffeescript, and play around with both the language and the solution to the problem.

You can have a look and try what I came up with at https://github.com/amitayd/regexp-golf-coffeescript.

It even runs in the browser, in a ugly page I hacked (be careful, it can be slow for big sets):

I will describe some of my experiences in porting the code to CoffeeScript and extending the solution below.

Porting to CoffeeScript

CoffeeScript seemed easy enough to pick up just by browsing the language reference, and control-F-ing and web searching for things I thought should be possible.
I decided to write to code first to run on node.js, using the command line coffee runner and REPL. Installing it was as easy as

sudo npm install -g coffee-script

Since Norvig’s code was so clean and self dependent, I didn’t need any external dependencies, but had to write some data structures and functions that are part of the Python runtime, namely a Set class with basic operations (intersection, union, substraction, etc),  and a max function that takes an external ranking function (here).

The porting was pretty much straight-forward, other than some…

CoffeeScript pains…

Since the original code relies heavily on python’s nifty list comprehensions, I was very happy to find out something similar exists in Coffee. I would even dare to say this can be the winning point over JavaScript. However, there are many things about it that made comprehensions painful to use with:

  • Multiple collections comprehensions, such as in “[x,y] for x in xs for y in ys” will result in something like:
    [ [ [ 'X1', 'Y1' ],
    [ 'X2', 'Y1' ] ],
    [ [ 'X1', 'Y2' ],
    [ 'X2', 'Y2' ] ] ]
    Bahhh. It turns out  CoffeeScript doesn’t flat-maps each collection, but creates nested arrays instead. This forced me to either wrap the comprehension in a flatten() call, or abandon them and use for loops that append to an array. So long “everything is an expression” and immutable variables.
  • Different iterator syntax for objects and arrays: “for key of obj” vs “for item in arr”. Yes, I understand it’s a javascript issue, but it’s still annoying as hell to find out you used the wrong keyword and  iterated over an empty collection because of that.
  • No default iterators for custom object: I couldn’t just do something like “for item in mySet”, I had to do something like “for item of mySet.set”. So long abstraction…
  • Mutliline comprehension don’t work. Or work strangely, not sure yet. If nested comprehension aren’t workable, this is not that important anyway.

Having to use a temporary array for something that was a clean list comprehension in the original code, made a little tear roll down my left cheek.
Other than those issues, which for me rendered one of CoffeeScript’s greatest selling points not so attractive, there were some other minor issues I’ve encountered:

  • Strange operator precedence: not operator has high priority than of operator:
    item for item in ["one", "two"] when not item of {one:1}
    Will return []. Made me paranoid about not using explicit parenthesis.
  • in operator does not work for strings
  • Different syntax for if as an expression (if something then 5 else 3). Maybe it has something to do with the optional parenthesis parsing.
  • Ranges return the upper-bound element too, so [0..length]  returns an extra element (length+1 elements). Since arrays (and strings) are zero-based, it led to some confusion.
  • No sugar syntax for default parameters

I should probably stop whining and open bugs for those.

Things I enjoyed (mostly compared to JavaScript)

  • More things are an expression (especially enjoyed if..then instead of ternary operator)
  • Shorter function decleration: (x) -> x*x
  • Ruby-style string interpolation: “Hello #{user}!”
  • Multiline strings
  • Slicing syntax for arrays and strings: x[1..3]
  • The Existential Operator (I mostly enjoy its name): age ? “no age supplied”
  • Classes construct (for straight-forward objects)

I’m sure there are other nice features which I didn’t get to use.

Overall, a nice language, but it still feels very confined to JavaScript, and offers (sometimes really cool) syntactic sugar over it. I’m still not convinced I would want it components that I will have to debug heavily in the browser, but it might be a good candidate for logic and algorithms that are thoroughly tested, and do not need a browser environment.

Improvements to the Regex Golf solver

I would suggest first reading Peter Norvig’s original annotated code.
I can proudly say I managed to find a 49 characters solution instead of the 54 characters found by in the original post: /[gikuj]..n|a.[alt]|[pivo].l|i..o|[jocy]e|sh|di|oo/ - thank god hours of brain and CPU power weren’t wasted! How?

Grouping to Character classes:

For example, if we have a regexp am|bm|cm|sfsf we can have more efficient expression [abc]m|sfsf.
Since this expression is logically an OR of two expressions, we know it matches some winners, and doesn’t match any losers.
This is done by:

  1. Creating a list of blanked letter, and the possible letters in the blank from the generated parts. For example, from ‘ab’, ‘ac’, ‘ad’, ‘bd’ we create:
    ‘a_’: ['b', 'c', 'd']
    ‘_d’:  ['a', 'b']
  2. For each of the generated entries, calculate the (no repeats, order insignificant) permutations bigger than one, and return them as new parts:
    ‘a_’: ['b', 'c', 'd']  ->
    ‘bc’, ‘bd’, ‘cd’, ‘bcd’ ->
    ‘a[bc]‘, ‘a[cd]‘, ‘a[cd]‘, ‘a[bcd]‘

We then add the generated parts to the pool of parts were going to use in the main find algorithm. See getGroupedParts() and getPermutations()

Build covers from multiple branches of the best candidates

This means that for each iteration when constructing a cover, we find the best n parts from the pool, and then recursively construct the rest of the cover for the remaining pool and winners. See the relevant code.

This results in numBranches ^ depth possible covers, and of course, makes the solution that much more costly.

I let the number of branches be determined by a setting (global or branches per depth), but would love it if someone would come up with a clever way to optimize the number of branches (keep in mind that the depth is not known at the beginning of the recursion).

Using random element in the ranking function…

Well, not really an optimization, but if we run many iterations, and the ranking function returns similar or equal values for different parts, it helps adding some randomness. The randomness factor can be controlled (or disabled by setting to zero) from the settings

That’s it

This was a fun learning experience for me, keep in mind that that’s what it is – a toy code.  I tried keeping it simple for writing and reading, not necessarily adopting best practices for a professional project.  It is not well tested, documented, organized, or even written. Just me having fun with a new language and a new problem. I hope the code and my experiences are still of some value and fun to those of you who’ve read through all this.

facebooktwittergoogle_plusredditpinterestlinkedinmailby feather

Leave a Reply