5.12 Performance optimizations
In the past, the relatively poor performance of parser combinator libraries has often been cited as the primary impediment to their more widespread adoption. For this reason optimal performance stood front and center as a design goal during the development of FParsec and a lot of effort has been spent on optimizing parsing speed. As a result, FParsec has become so fast that parsers implemented with FParsec often significantly outperform parsers created by parser generator tools like fslex & fsyacc.
In general, a parser implemented in FParsec can get close to the performance of a hand‐optimized recursive‐descent parser written in C#. Due to the multi‐layered architecture of the FParsec API, you always have the option to fall back to the lower‐level API should a particular parser component implemented with the high‐level API turn out to be too slow. Hence, if you choose FParsec for implementing your parsers, you don’t have to worry that performance will become a reason for switching away from FParsec.
5.12.1 Performance guidelines
If you strive for optimal performance in your parser applications, try to adhere to the following guidelines:
- Avoid backtracking
-
Try to avoid backtracking where possible. Sometimes it’s already enough to factor out a common prefix from a parser expression to avoid backtracking, e.g. by transforming
(prefix >>? p1) <|> (prefix >>? p2)
toprefix >>. (p1 <|> p2)
. Some simple backtracking can also be avoided by parsing whitespace as trailing whitespace instead of leading whitespace.If you’re designing a programming or markup language, you should try to minimize the need for backtracking, both to simplify parsing and to avoid exponential worst‐case behaviour.
- Prefer specialized parsers
-
FParsec provides a number of specialized parsers and combinators for various purposes. Using more specialized primitives instead of reimplementing them with generic combinators will often safe you time and improve parsing speed.
In particular:
-
Prefer the
skip...
variants of parsers and combinators if you don’t need the parser results. - Parse whitespace with the built‐in whitespace parsers.
- Parse numbers with the built‐in number parsers.
-
Prefer to parse strings with the
many[1]Satisfy[2][L]
parsers. -
Consider parsing unicode identifiers with the
identifier
parser.
-
Prefer the
- Construct parsers once
-
Constructing a parser can be relatively expensive in comparison to a single invocation of the parser. Hence, if you repeatedly apply the same parser, you should make sure that you construct the parser only once, either by preconstructing it at the beginning or by lazily constructing the parser and then caching it.
Usually the place where parsers get inadvertently constructed more than once is inside closures.
For example, if you have a local function like
fun stream -> let reply = (parser1 >>. parser2) stream if reply.Status = Ok then // ... else // ...
you should avoid the repeated construction of
parser1 >>. parser2
every time the closure is called by moving the construction outside of the closure, as inlet parser = parser1 >>. parser2 fun stream -> let reply = parser stream if reply.Status = Ok then //... else // ...
Also, you shouldn’t wrap a parser expression inside a function just to avoid F#’s value restriction if you can achieve the same goal with a type annotation. For example, you should not try to fix the compiler error in the first example of the tutorial chapter on F#’s value restriction by replacing
let p = pstring "test"
with
let p stream = pstring "test" stream
-
Avoid
parse {...}
expressions -
Avoid
regex
parsers -
Consider optimizing large
choice
parsers -
Formal grammars for programming languages or DSLs often have one or two grammar rules at their core that essentially just enumerate a long list of possible ways to form a statement or expression in that language. A straightforward FParsec implementation of such a grammar rule typically uses the
choice
combinator to combine a list of parsers for all the alternatives.Usually such an implementation with a large
choice
‐based parser will do just fine. However, if parsing performance is critical for your application, replacing a largechoice
parser with a custom‐made combinator can be an optimization with a high benefit‐cost ratio. The next section explains this optimization in more detail.
5.12.2 Low‐level parser implementations
FParsec’s high‐level API consists of its built‐in parsers and combinators in the Primitives
and CharParsers
module. The high‐level API allows you to easily construct
parsers in a concise and rather declarative way. Usually you will author most of your parsers using the high‐level API, because that’s the most
productive way to do it.
However, sometimes you might find that a specific piece of parser functionality is a bit inconvenient to express through the high‐level API or that the high‐level implementation isn’t as fast as you had hoped for. In those situations it’s a great advantage that FParsec allows you to drop down to the low‐level API, so that you can implement your own special‐purpose parser and combinator primitives.
We have already covered the basics of the low‐level API in the chapters on the internals of a simple parser function and applying parsers in sequence. In this section we will discuss some examples that demonstrate how you can use low‐level parser implementations for optimization purposes.
One example of a parser implemented using the low‐level API is contained in the samples folder of the FParsec distribution in samples/FSharpParsingSample/FParsecVersion/parser.fs. It is a parser for an identifier string that is not identical with a keyword.
The low‐level implementation uses another parser, identifierString
, to parse an identifier
string and then backtracks when the parsed string is a keyword:
let identifier : Parser<string, unit> = let expectedIdentifier = expected "identifier" fun stream -> let state = stream.State let reply = identifierString stream if reply.Status <> Ok || not (isKeyword reply.Result) then reply else // result is keyword, so backtrack to before the string stream.BacktrackTo(state) Reply(Error, expectedIdentifier)
The same parser could also be implemented with the high‐level API:
let identifier = attempt (identifierString >>= fun str -> if not (isKeyword str) then preturn str else pzero) <?> "identifier"
The high‐level version is a bit more concise, but whether it is also easier to understand is debatable. The low‐level version seems at least a bit more self‐explanatory and hence is probably more accessible to new FParsec users. Since the low‐level implementation is also significantly faster than the high‐level one, this is a good example for a parser that can be improved through a low‐level implementation.
If you wanted to optimize the performance of the identifier parser even more, you could replace the identifierString
parser invocation with direct calls to CharStream
methods. However, whether the potential performance
gain would be worth the loss in code modularity and maintainability is questionable. A more promising optimization often is to integrate the
identifier parser into a higher‐level choice
‐based parser, like it is done below in the last example of this section.
choice
parsers with long list of
argument parsers are performance‐wise one of the weakest spots of FParsec’s high‐level API. As we noted in the previous section, formal grammars
for programming languages or DSLs often have one or two grammar rules at their core that essentially just enumerate a long list of possible ways
to form a statement or expression in that language. A straightforward implementation of such a grammar rule using the choice
combinator yields only sub‐optimal performance,
since the choice
parser has no
knowledge about its argument parsers and has to try one parser after another.
This makes large choice
‐based
parsers an excellent optimization opportunity. With your knowledge about the parser grammar you can often narrow down the set of possible
parsers just by peeking at the following one or two chars in the input. Having identified the set of possible parsers (often only consisting of
one parser), you can then considerably speed up the dispatch to the right subparser.
For example, take a look at the JSON‐value parser from the tutorial:
choice [jobject jlist jstring jnumber jtrue jfalse jnull]
If you look at the definitions for the argument parsers, you’ll see that in almost all cases one can decide which parser should handle the input
just based on the next char in the input. Hence, we could replace the choice
‐based parser with the following low‐level
implementation:
let error = expected "JSON value" fun (stream: CharStream<_>) -> match stream.Peek() with | '{' -> jobject stream | '[' -> jlist stream | '"' -> jstring stream | 't' when stream.Skip("true") -> Reply(JBool true) | 'f' when stream.Skip("false") -> Reply(JBool false) | 'n' when stream.Skip("null") -> Reply(JNull) | _ -> let stateTag = stream.StateTag let mutable reply = jnumber stream if reply.Status = Error && stateTag = stream.StateTag then reply.Error <- error reply
A drawback of such a low‐level implementation is that you have to be a bit careful not to overlook any of the possible grammar cases. This is
why we applied the jnumber
parser in the “catch‐all” case, so that we don’t
depend on the precise grammar rules for numbers.
You also need to consider how the low‐level implementation affects error messages. When a choice
parser fails, it will generate an error message
with the error messages from all the argument parsers it tried. This gives a human reader usually enough context to understand the error. For a
low‐level implementation it can take a little more effort to ensure that the error messages for every case contain enough information about the
grammar context. For example, in our implementation above we had to replace the default error message by jnumber
with a custom one, so that the error message generated by the catch‐all case doesn’t create the impression that
a JSON value can only be a number.
By now it is probably obvious that a low‐level parser implementation can actually be quite simple to write, but that it also comes at a certain cost in terms of code modularity and maintainability. Having the option of a low‐level implementation can certainly be what saves a project in certain situations and should give you some peace of mind with regard to parser performance, but generally you should only consider it as a backup option for those cases where you really need it.
The following example shows again how you can replace a choice
‐based parser with a low‐level implementation, this time with a grammar that is a bit more representative of
a typical programming language:
type Expr = Number float | LetBinding ... | IfThenElse ... | ... type UserState = //... type Parser<'result> = Parser<'result, UserState> type Keyword = None = 0 | If = 1 | Let = 2 // ... let stringToKeyword = createStaticStringMapping Keyword.None ["if", Keyword.If "let", Keyword.Let // ... ] let str s = pstring s let identifierString : Parser<string> = // ... let identifierRest (id: string) : Parser<Expr> = ... let number : Parser<Expr> = // ... (parser for floating-point number) let ifThenElseRest : Parser<Expr> = // ... let letBindingRest : Parser<Expr> = // ... let exprInParensRest : Parser<Expr> = // ... // The parser after this comment is a replacement for // let identifierStringButNoKeyword = // (* implementation like identifier parser in the first example above *) // // let identifier : Parser<Expr> = identifierStringButNoKeyword // >>= identifierRest // // let ifThenElse : Parser<Expr> = str "if" >>. ifThenElseRest // let letBinding : Parser<Expr> = str "let" >>. letBindingRest // let exprInParens : Parser<Expr> = str "(" >>. exprInParensRest // // let expr = choice [identifierStringNoKeyword // number // ifThenElse // exprInParens // // ... // ] // let expr : Parser<Expr> = fun stream -> let stateTag = stream.StateTag let reply = identifierString stream if reply.Status = Ok then match stringToKeyword reply.Result with | Keyword.None -> identifierRest reply.Result stream | Keyword.If -> ifThenElseRest stream | Keyword.Let -> letBindingRest stream // ... elif reply.Status = Error && stateTag = stream.StateTag then // no identifier match stream.Peek() with | '(' -> stream.Skip(); exprInParensRest stream | c when isDigit c -> number stream // ... else // error within identifier string Reply(reply.Status, reply.Error)