# Introduction

In class today, I sat through my second lecture on the power behind Regular Expressions with derivatives. The professor live coded a Regular Expression engine in Python during class that has all of the regular language functionality and more than the ones you will find in perl, ruby, python, java, boost, etc.

Why would language implementers not use the more powerful way? Because long ago it was thought that Brzozowski’s derivative method was too costly so everyone used Thompson’s method which is fast for most operations but suffers from possible exponential blowup with some operations.

With derivatives we get those operations back (Intersection, Difference, Complement) and significantly decrease the complexity of implementing regular expressions.

I will implement all operations for regular languages in Python, Javascript and Ruby for fun and posterity.

# Warning

Keep in mind, none of these are optimized and there are some types of regular expressions that will cause exponential time to compute, but these are easily fixed with some simple bail-out rules. I have not implemented any of those fixes here.

# Algorithm

The algorithm is a simple two step process:

• Take the derivative with respect to each character you are matching in order
• Does the final language accept $\epsilon$ (which is the empty-string language)

## A derivative of a language

In formal theory, a language is a set of characters, what we’d call a set of strings like $\{\text{foo},\text{bar},\text{baz}\}$. The derivative of that language in terms of the character b is $D_b\{\text{foo},\text{bar},\text{baz}\}=\{\text{ar},\text{az}\}$.

## Nullability

A nullable language is one that no longer accepts any input, in other words does the language not accept $\epsilon$? Such a derivative of a language looks like $D_z\{\text{foo},\text{bar}\}$, where z is not in the language and thus is null.

## Notation

We define two things

• A function $\delta$ which returns $\epsilon$ if the argument accepts the empty-string and $\emptyset$ when it does not
• The derivative of a regular expression re with respect to a character as $D_{char}(re)$.

# Operations of Regular Languages

## Matches

Here we, will define our base RegEx object. A character matches if the derivative of the language with respect to that character exists. This is inherently a recursive definition, and you can tell in the matches that so long as we haven’t traversed the entire string to match we will continue deriving and checking matches.

## Empty: Empty-Set

This, along with the empty-string language, are what we find at the depths of our recursive match calls. If at any point in the matching process does a match fail, the bottom of the recursion will contain Empty.

## Blank: Empty String Language

If the string is accepted (meaning it matched all the way), then it accepts $\epsilon$ which is represented by the empty-string language of Blank.

The derivative of $\epsilon$ is $\epsilon$ and is $\emptyset$ with respect to any character.

## Primitive: Single Character Language

A primitive is a single character language, like {‘c’}. You can think of a string as being one or more Primitive languages, e.g. 'cat' = 'c' 'a' 't' all concatenated together.

So, the derivative of a primitive is $\emptyset$ and the derivative of the re with respect to c is $\epsilon$ if c is the same as the parameter c’ and $\emptyset$ if they are not equal.

## Choice: ORing Languages

Choice is where you have two languages and matching either one is fine, e.g. ‘foo’ or ‘bar’ would match ‘foo’ and ‘bar’. This is also known as the Union operation in set theory.

The derivitive of the union of two languages is the union of their respective derivatives. The same goes for the derivative of the re of those languages.

## Repetition: Languages*

Repetition is zero or more repetitions of the language, e.g. ‘foo’ would match ‘foofoofoo’ and ‘’ since it is matching 3 repetitions and 0 respectively.

The derivative of a repetition is $\epsilon$ and the derivative with of the re is the derivative of the re concatenated with re*.

## Sequence: Concatenation of Languages

Like I explained earlier, strings longer than 1 character can be thought of as concatenations of Primitive languages, e.g. ‘foo’ is the same as ‘f’ ‘o’ ‘o’.

The derivative of a sequence of languages is the sequence of the derivative of those languages. And the derivative of the re is the Choice of the sequence of first derivative with the second language, and the derivative of the second.

## Intersection: Languages with equal strings

This is one of those operations you don’t get with classic regular expression engines in perl, python, ruby, etc. This is because the way they are written make the nfa to dfa conversion blow up exponentially in most cases. With derivatives, we get them cheaply.

The derivative of the intersection of languages is the intersection of their derivatives. This is the same with respect to the re.

## Difference: Language A - Language B

This is another operation prohibitively expensive in classic implementations that we get cheaply with derivatives. The difference of two languages would be where all the strings accepted by A minus the strings accepted by language B.

The derivative of the difference of two languages is the difference of their derivatives.

## Complement: Not Language you are looking for

This is another example of an operation we get cheaply with derivatives, which we don’t get at all in classic implementations. The complement of a language is where we want to match all strings that are not accepted.

The derivative of the complement of a language is the compliment of the derivative of the language. Same for the re.