5.12 Performance optimizations

In the past, the relatively poor performance of parser combinator libraries has often been cited as the primary impediment to their more widespread adoption. For this reason optimal performance stood front and center as a design goal during the development of FParsec and a lot of effort has been spent on optimizing parsing speed. As a result, FParsec has become so fast that parsers implemented with FParsec often significantly outperform parsers created by parser generator tools like fslex & fsyacc.

In general, a parser implemented in FParsec can get close to the performance of a hand‐optimized recursive‐descent parser written in C#. Due to the multi‐layered architecture of the FParsec API, you always have the option to fall back to the lower‐level API should a particular parser component implemented with the high‐level API turn out to be too slow. Hence, if you choose FParsec for implementing your parsers, you don’t have to worry that performance will become a reason for switching away from FParsec.

5.12.1 Performance guidelines

If you strive for optimal performance in your parser applications, try to adhere to the following guidelines:

Avoid backtracking

Try to avoid backtracking where possible. Sometimes it’s already enough to factor out a common prefix from a parser expression to avoid backtracking, e.g. by transforming (prefix >>? p1) <|> (prefix >>? p2) to prefix >>. (p1 <|> p2). Some simple backtracking can also be avoided by parsing whitespace as trailing whitespace instead of leading whitespace.

If you’re designing a programming or markup language, you should try to minimize the need for backtracking, both to simplify parsing and to avoid exponential worst‐case behaviour.

Prefer specialized parsers

FParsec provides a number of specialized parsers and combinators for various purposes. Using more specialized primitives instead of reimplementing them with generic combinators will often safe you time and improve parsing speed.

In particular:

  • Prefer the skip... variants of parsers and combinators if you don’t need the parser results.
  • Parse whitespace with the built‐in whitespace parsers.
  • Parse numbers with the built‐in number parsers.
  • Prefer to parse strings with the many[1]Satisfy[2][L] parsers.
  • Consider parsing unicode identifiers with the identifier parser.
Construct parsers once

Constructing a parser can be relatively expensive in comparison to a single invocation of the parser. Hence, if you repeatedly apply the same parser, you should make sure that you construct the parser only once, either by preconstructing it at the beginning or by lazily constructing the parser and then caching it.

Usually the place where parsers get inadvertently constructed more than once is inside closures.

For example, if you have a local function like

fun stream ->
    let reply = (parser1 >>. parser2) stream
    if reply.Status = Ok then // ...
    else // ...

you should avoid the repeated construction of parser1 >>. parser2 every time the closure is called by moving the construction outside of the closure, as in

let parser = parser1 >>. parser2
fun stream ->
    let reply = parser stream
    if reply.Status = Ok then //...
    else // ...

Also, you shouldn’t wrap a parser expression inside a function just to avoid F#’s value restriction if you can achieve the same goal with a type annotation. For example, you should not try to fix the compiler error in the first example of the tutorial chapter on F#’s value restriction by replacing

let p = pstring "test"

with

let p stream = pstring "test" stream
Avoid parse {...} expressions
Avoid regex parsers

The regex parser parses a string by applying a .NET regular expression to the input. Since .NET regular expressions are relatively slow, you should reserve the use of the regex parser for patterns that you can’t easily express with other FParsec parsers and combinators.

Consider optimizing large choice parsers

Formal grammars for programming languages or DSLs often have one or two grammar rules at their core that essentially just enumerate a long list of possible ways to form a statement or expression in that language. A straightforward FParsec implementation of such a grammar rule typically uses the choice combinator to combine a list of parsers for all the alternatives.

Usually such an implementation with a large choice‐based parser will do just fine. However, if parsing performance is critical for your application, replacing a large choice parser with a custom‐made combinator can be an optimization with a high benefit‐cost ratio. The next section explains this optimization in more detail.

5.12.2 Low‐level parser implementations

FParsec’s high‐level API consists of its built‐in parsers and combinators in the Primitives and CharParsers module. The high‐level API allows you to easily construct parsers in a concise and rather declarative way. Usually you will author most of your parsers using the high‐level API, because that’s the most productive way to do it.

However, sometimes you might find that a specific piece of parser functionality is a bit inconvenient to express through the high‐level API or that the high‐level implementation isn’t as fast as you had hoped for. In those situations it’s a great advantage that FParsec allows you to drop down to the low‐level API, so that you can implement your own special‐purpose parser and combinator primitives.

We have already covered the basics of the low‐level API in the chapters on the internals of a simple parser function and applying parsers in sequence. In this section we will discuss some examples that demonstrate how you can use low‐level parser implementations for optimization purposes.

One example of a parser implemented using the low‐level API is contained in the samples folder of the FParsec distribution in samples/FSharpParsingSample/FParsecVersion/parser.fs. It is a parser for an identifier string that is not identical with a keyword.

The low‐level implementation uses another parser, identifierString, to parse an identifier string and then backtracks when the parsed string is a keyword:

let identifier : Parser<string, unit> =
    let expectedIdentifier = expected "identifier"
    fun stream ->
        let state = stream.State
        let reply = identifierString stream
        if reply.Status <> Ok || not (isKeyword reply.Result) then reply
        else // result is keyword, so backtrack to before the string
            stream.BacktrackTo(state)
            Reply(Error, expectedIdentifier)

The same parser could also be implemented with the high‐level API:

let identifier =
    attempt (identifierString
             >>= fun str ->
                     if not (isKeyword str) then preturn str
                     else pzero) <?> "identifier"

The high‐level version is a bit more concise, but whether it is also easier to understand is debatable. The low‐level version seems at least a bit more self‐explanatory and hence is probably more accessible to new FParsec users. Since the low‐level implementation is also significantly faster than the high‐level one, this is a good example for a parser that can be improved through a low‐level implementation.

If you wanted to optimize the performance of the identifier parser even more, you could replace the identifierString parser invocation with direct calls to CharStream methods. However, whether the potential performance gain would be worth the loss in code modularity and maintainability is questionable. A more promising optimization often is to integrate the identifier parser into a higher‐level choice‐based parser, like it is done below in the last example of this section.

choice parsers with long list of argument parsers are performance‐wise one of the weakest spots of FParsec’s high‐level API. As we noted in the previous section, formal grammars for programming languages or DSLs often have one or two grammar rules at their core that essentially just enumerate a long list of possible ways to form a statement or expression in that language. A straightforward implementation of such a grammar rule using the choice combinator yields only sub‐optimal performance, since the choice parser has no knowledge about its argument parsers and has to try one parser after another.

This makes large choice‐based parsers an excellent optimization opportunity. With your knowledge about the parser grammar you can often narrow down the set of possible parsers just by peeking at the following one or two chars in the input. Having identified the set of possible parsers (often only consisting of one parser), you can then considerably speed up the dispatch to the right subparser.

For example, take a look at the JSON‐value parser from the tutorial:

choice [jobject
        jlist
        jstring
        jnumber
        jtrue
        jfalse
        jnull]

If you look at the definitions for the argument parsers, you’ll see that in almost all cases one can decide which parser should handle the input just based on the next char in the input. Hence, we could replace the choice‐based parser with the following low‐level implementation:

let error = expected "JSON value"
fun (stream: CharStream<_>) ->
    match stream.Peek() with
    | '{' -> jobject stream
    | '[' -> jlist stream
    | '"' -> jstring stream
    | 't' when stream.Skip("true")  -> Reply(JBool true)
    | 'f' when stream.Skip("false") -> Reply(JBool false)
    | 'n' when stream.Skip("null")  -> Reply(JNull)
    | _ ->
        let stateTag = stream.StateTag
        let mutable reply = jnumber stream
        if reply.Status = Error && stateTag = stream.StateTag then
           reply.Error <- error
        reply

A drawback of such a low‐level implementation is that you have to be a bit careful not to overlook any of the possible grammar cases. This is why we applied the jnumber parser in the “catch‐all” case, so that we don’t depend on the precise grammar rules for numbers.

You also need to consider how the low‐level implementation affects error messages. When a choice parser fails, it will generate an error message with the error messages from all the argument parsers it tried. This gives a human reader usually enough context to understand the error. For a low‐level implementation it can take a little more effort to ensure that the error messages for every case contain enough information about the grammar context. For example, in our implementation above we had to replace the default error message by jnumber with a custom one, so that the error message generated by the catch‐all case doesn’t create the impression that a JSON value can only be a number.

By now it is probably obvious that a low‐level parser implementation can actually be quite simple to write, but that it also comes at a certain cost in terms of code modularity and maintainability. Having the option of a low‐level implementation can certainly be what saves a project in certain situations and should give you some peace of mind with regard to parser performance, but generally you should only consider it as a backup option for those cases where you really need it.

The following example shows again how you can replace a choice‐based parser with a low‐level implementation, this time with a grammar that is a bit more representative of a typical programming language:

type Expr = Number float
          | LetBinding ...
          | IfThenElse ...
          | ...

type UserState = //...

type Parser<'result> = Parser<'result, UserState>

type Keyword = None = 0
             | If   = 1
             | Let  = 2
             // ...

let stringToKeyword = createStaticStringMapping
                          Keyword.None
                          ["if", Keyword.If
                           "let", Keyword.Let
                           // ...
                          ]

let str s = pstring s

let identifierString : Parser<string> = // ...

let identifierRest (id: string) : Parser<Expr> = ...

let number : Parser<Expr> = // ... (parser for floating-point number)

let ifThenElseRest   : Parser<Expr> = // ...
let letBindingRest   : Parser<Expr> = // ...
let exprInParensRest : Parser<Expr> = // ...

// The parser after this comment is a replacement for
//     let identifierStringButNoKeyword =
//         (* implementation like identifier parser in the first example above *)
//
//     let identifier   : Parser<Expr> = identifierStringButNoKeyword
//                                       >>= identifierRest
//
//     let ifThenElse   : Parser<Expr> = str "if"  >>. ifThenElseRest
//     let letBinding   : Parser<Expr> = str "let" >>. letBindingRest
//     let exprInParens : Parser<Expr> = str "("   >>. exprInParensRest
//
//     let expr = choice [identifierStringNoKeyword
//                        number
//                        ifThenElse
//                        exprInParens
//                        // ...
//                       ]
//
let expr : Parser<Expr> =
  fun stream ->
    let stateTag = stream.StateTag
    let reply = identifierString stream
    if reply.Status = Ok then
      match stringToKeyword reply.Result with
      | Keyword.None -> identifierRest reply.Result stream
      | Keyword.If   -> ifThenElseRest stream
      | Keyword.Let  -> letBindingRest stream
      // ...
    elif reply.Status = Error && stateTag = stream.StateTag then // no identifier
      match stream.Peek() with
      | '(' -> stream.Skip(); exprInParensRest stream
      | c when isDigit c -> number stream
      // ...
    else // error within identifier string
      Reply(reply.Status, reply.Error)