In class today, I sat through my second lecture on the power behind Regular
Expressions with derivatives. The professor live coded a Regular Expression
engine in Python during class that has all of the regular language functionality
and more than the ones you will find in perl, ruby, python, java, boost, etc.
Why would language
implementers not use the more powerful way? Because long ago it was
thought that Brzozowski’s derivative method was too costly so everyone used
Thompson’s method which is fast for most operations but suffers from possible
exponential blowup with some operations.
With derivatives we get those operations back (Intersection, Difference,
Complement) and significantly decrease the complexity of implementing regular
expressions.
I will implement all operations for regular languages in Python, Javascript and
Ruby for fun and posterity.
Warning
Keep in mind, none of these are optimized and there are some types
of regular expressions that will cause exponential time to compute, but these
are easily fixed with some simple bail-out rules. I have not implemented any
of those fixes here.
Algorithm
The algorithm is a simple two step process:
Take the derivative with respect to each character you are matching in order
Does the final language accept (which is the empty-string language)
A derivative of a language
In formal theory, a language is a set of characters, what we’d call a set of
strings like . The derivative of that
language in terms of the character b is
.
Nullability
A nullable language is one that no longer accepts any input, in other words does
the language not accept ? Such a derivative of a language
looks like , where z is not in the language
and thus is null.
Notation
We define two things
A function which returns if the argument accepts the
empty-string and when it does not
The derivative of a regular expression re with respect to a character as
.
Operations of Regular Languages
Matches
Here we, will define our base RegEx object. A character matches if the
derivative of the language with respect to that character exists. This is
inherently a recursive definition, and you can tell in the matches that so
long as we haven’t traversed the entire string to match we will continue
deriving and checking matches.
This, along with the empty-string language, are what we find at the depths of
our recursive match calls. If at any point in the matching process does a match
fail, the bottom of the recursion will contain Empty.
A primitive is a single character language, like {‘c’}. You can think of a
string as being one or more Primitive languages, e.g. 'cat' = 'c' 'a' 't'
all concatenated together.
So, the derivative of a primitive is
and the derivative of the re with respect to c is if c is the
same as the parameter c’ and if they are not equal.
Choice is where you have two languages and matching either one is fine, e.g.
‘foo’ or ‘bar’ would match ‘foo’ and ‘bar’. This is also known as the Union
operation in set theory.
The derivitive of the union of two languages is the union of their respective
derivatives. The same goes for the derivative of the re of those languages.
Repetition is zero or more repetitions of the language, e.g. ‘foo’ would match
‘foofoofoo’ and ‘’ since it is matching 3 repetitions and 0 respectively.
The derivative of a repetition is and the derivative with of the
re is the derivative of the re concatenated with re*.
Like I explained earlier, strings longer than 1 character can be thought of as
concatenations of Primitive languages, e.g. ‘foo’ is the same as ‘f’ ‘o’ ‘o’.
The derivative of a sequence of languages is the sequence of the derivative of
those languages. And the derivative of the re is the Choice of the sequence
of first derivative with the second language, and the derivative of the second.
This is one of those operations you don’t get with classic regular expression
engines in perl, python, ruby, etc. This is because the way they are written
make the nfa to dfa conversion blow up exponentially in most cases. With
derivatives, we get them cheaply.
The derivative of the intersection of languages is the intersection of their
derivatives. This is the same with respect to the re.
This is another operation prohibitively expensive in classic implementations
that we get cheaply with derivatives. The difference of two languages would be
where all the strings accepted by A minus the strings accepted by language B.
The derivative of the difference of two languages is the difference of their
derivatives.
This is another example of an operation we get cheaply with derivatives, which
we don’t get at all in classic implementations. The complement of a language is
where we want to match all strings that are not accepted.
The derivative of the complement of a language is the compliment of the
derivative of the language. Same for the re.