I’m happy to announce the release of Pumping, a library to leverage the OCaml type system to recognize regular languages.

The goal of type systems is to check that some expression respect some properties. More powerful type systems allow to enforce more complex properties. Three years ago, the OCamlPro team proposed a generic method to check properties with the OCaml type system. More recently, jacques Garrigue showed that arbitrary Turing machines can be encoded in GADTs. These methods allow to check very complex properties but are quite inconvenient to use. The Pumping library allows to easily check properties that covers regular languages by leveraging a well known tool: regular expressions.

Regular expressions in types

While Tyre’s main idea was to improve regular expressions with types, Pumping aims to do the converse: enrich the OCaml type system with regular expressions. In order to do so, Pumping introduces a new ppx annotation, re (also available as pumping.re), that allows writing regular expressions in types:

type t = [%re? Star (`a | `b)]

We can then check if a word is in this regular language. Words here are sequence of polymorphic variants terminated by `End. Note that Pumping does not have any runtime cost (and does not have any runtime at all)! All the work is done by the type system, at compile time.

let x : t = `a (`a (`b `End))

Types produced by pumping are structural: we don’t even need a type declaration!

let x : [%re? `a, (`b | `c)] = `a (`c `End)

Most of the usual regular expression operators are available, such as star and ranges.

let x
 : [%re? Star ('a'..'z'), (`Foo | `Bar)]
 = `f (`o (`o (`Bar `End)))

Intersection is also available:

(`B `End : [%re? (`A | `B) :: (`B | `C)])

Convenient notation for regular expressions

Unfortunately, PPXs are limited by the OCaml syntax. The usual syntax for regular expressions is short, convenient and known by most people, but not easily encoded in the OCaml syntax. In order to be able to express regular properties in an even terser manner, pumping also allows to use the POSIX notation for regular expressions:

let x : [%re"foo|bar"] = `foo `End

Most of the POSIX syntaxes are supported, including charsets, ranges and bounded repetition:

let x : [%re"[a-z]{2,6}"] = `p (`o (`t (`a (`t (`o `End)))))

A new operator, &, is introduced for the intersection:

let x : [%re"[ab]&[bc]"] = `b `End

Equivalence of regular expressions

Since types defined by pumping are structural, it is possible to use the typechecker to test the equivalence of regular expressions:

let f : [%re"aa*a"] -> [%re"a{2,}"] = fun x -> x

We can also leverage subtyping to test the inclusion of regular expressions:

let f x = (x : [%re"a*"] :> [%re"[ab]*"])

Implementation

The implementation is a fairly simple adaption of Brzozowski derivatives. In particular, it follows the technique described in the article “Regular-expression derivatives reexamined”. Recursive types with (polymorphic) variants naturally describes a graph, thanks to unification type variables and the as construction, allowing to easily encode finite state automatons.

Indeed, [%re"(OC|oc|c)(a)*ml"] is simply:

[   `oc of 'v2
  | `c of 'v2
  | `OC of [ `ml of [ `End ] as 'v3  | `a of 'v2 ] as 'v2
] as 'v1

The typechecker then does a very good job at simplifying the types. Unfortunately, it also unfolds the sub-parts of non-recursive types, which sometimes causes an exponential blowup.

Going further

The expressivity of regular languages is well known, but limited. It is unclear how to handle grouping, backreferences and lookaround constructions, but there are features that could be introduced:

  • In the style of PERL named regular expressions, it would be useful to introduce annotation on types, in order to describe named mutually dependent regular expressions:
    type%re t1 = [%re? Star `Foo, t2]
    and t2 = [%re? (`Bar | `Baz | t2)]
    
  • Deterministic Pushdown automatons are fairly easy to encode (using a “counter” type, as proposed in OCamlpro’s blogpost), but a new convenient syntax would need to be proposed.

With this set of features, Pumping should be almost sufficient to entirely encode tyxml’s static checking of HTML.

The (only ?) drawback of this approach is that error messages leave a bit to be desired.

Pumping should be available in OPAM soon. You can find examples in the test file and discuss it on reddit. Happy pumping!