5.13 Tips and tricks

5.13.1 Parallel parsing

If your parser grammar is suitable for parallel parsing, parallelizing the parser has the potential to dramatically accelerate parsing on multi‐core machines. In the following we will shortly discuss requirements and strategies for parallelizing an FParsec parser.

For a parser grammar to be well suited for parallel parsing, the grammar and the typical input must satisfy the following two criteria:

  • Parts of the input must be independently parseable, i.e. parts must be parseable without knowlege about the other parts.
  • These parts must be large enough and easily enough identifiable within the total input.

Often, the easiest and most beneficial way to parallelize the parsing stage of an application is to parse multiple input files in parallel. In the simplest case you have multiple independent “compilation units” that can be parsed in parallel. This works even for C/C++, where a badly designed preprocesser generally makes efficient parsing quite hard. In many programming languages and markup languages you can also parse in parallel files that are “included”, “opened” or “imported” within source files. However, this usually only works if the language allows such includes only at well‐defined points in the grammar. In languages like C/C++, where the unstructured text content of other files can be included at essentially arbitrary positions in the source, parsing the included files in parallel is generally quite hard. (In C/C++ it’s even hard to avoid parsing the same file multiple times when it is included multiple times).

If you’re dealing with large input files or very slow parsers, it might also be worth trying to parse multiple sections within a single file in parallel. For this to be efficient there must be a fast way to find the start and end points of such sections. For example, if you are parsing a large serialized data structure, the format might allow you to easily skip over segments within the file, so that you can chop up the input into multiple independent parts that can be parsed in parallel. Another example could be a programming languages whose grammar makes it easy to skip over a complete class or function definition, e.g. by finding the closing brace or by interpreting the indentation. In this case it might be worth not to parse the definitions directly when they are encountered, but instead to skip over them, push their text content into a queue and then to process that queue in parallel.

Here are some tips for parallel parsing with FParsec:

  • All FParsec parsers are thread‐safe and can be safely applied concurrently to different CharStream instances, as long as you don’t introduce mutable shared state yourself.
  • CharStream instances are not thread‐safe and a single instance must not be accessed concurrently.
  • However, you can call the CreateSubstream method to create a substream for a CharStream. A CharStream and its substreams can be safely accessed concurrently.
  • If you want to parse multiple files in parallel, you should also create the CharStream instances in parallel, because the CharStream constructors that accept file paths or binary streams perform I/O operations that benefit from parallelization.
  • If you parallelize your parser, consider introducing an option for switching off parallel execution, since debugging a multi‐threaded parser is harder than debugging a single‐threaded one.

5.13.2 Dispatching parsers through a dictionary

A technique that is often useful for making a parser modular and easily extensible is to store Parser functions in dictionaries and then to delegate parsing to one of the Parser functions in the dictionary based on the input.

For example, a parser for a markup language could be implemented by defining a generic tag parser that delegates the parsing of the tagged content to a specific parser for the respective tag name. The following code shows how this could be done:

open FParsec
open System.Collections.Generic

// For simplicity we don't define a full-blown markup language here,
// just a parser for two simple non-recursive "tags" in square brackets.
// The chapter on "parsing with user state" contains a slightly more developed
// sample for a markup language, though without a dictionary-based tag parser.

type Tag = Bold of string
         | Url of string * string

// We store the tag parser dictionary in the user state, so that we can
// concurrently parse multiple input streams with the same parser instance
// but differerent tag dictionaries.

type TagParserMap = Dictionary<string,Parser<Tag,UserState>>

and UserState = {
        TagParsers: TagParserMap
     }

let defaultTagParsers = TagParserMap()

let isTagNameChar1 = fun c -> isLetter c || c = '_'
let isTagNameChar = fun c -> isTagNameChar1 c || isDigit c
let expectedTag = expected "tag starting with '['"

let tag : Parser<Tag, UserState> =
  fun stream ->
    if stream.Skip('[') then
        let name = stream.ReadCharsOrNewlinesWhile(isTagNameChar1, isTagNameChar,
                                                   false)
        if name.Length <> 0 then
            let mutable p = Unchecked.defaultof<_>
            if stream.UserState.TagParsers.TryGetValue(name, &p) then p stream
            else
                stream.Skip(-name.Length)
                Reply(Error, messageError ("unknown tag name '" + name + "'"))
        else Reply(Error, expected "tag name")
    else Reply(Error, expectedTag)

let str s = pstring s
let ws = spaces
let text = manySatisfy (function '['|']' -> false | _ -> true)

defaultTagParsers.Add("b", str "]" >>. text .>> str "[/b]" |>> Bold)

defaultTagParsers.Add("url",      (str "=" >>. manySatisfy ((<>)']') .>> str "]")
                             .>>. (text .>> str "[/url]")
                             |>> Url)

let parseTagString str =
    runParserOnString tag {TagParsers = TagParserMap(defaultTagParsers)} "" str

> parseTagString "[b]bold text[/b]";;
val it : ParserResult<Tag,UserState> = Success: Bold "bold text"

> parseTagString "[url=http://tryfsharp.org]try F#[/url]";;
val it : ParserResult<Tag,UserState> =
  Success: Url ("http://tryfsharp.org","try F#")

> parseTagString "[bold]test[/bold]";;
val it : ParserResult<Tag,UserState> = Failure:
Error in Ln: 1 Col: 2
[bold]test[/bold]
 ^
unknown tag name 'bold'

5.13.3 Memoizing parsers

If your parser implementation backtracks a lot when parsing typical inputs and as a result repeatedly applies some Parser functions at the same input position, it can be beneficial to memoize these Parser functions, i.e. cache their results for each input position.

In the extreme case, memoization can mean the difference between linear and exponential execution times. In practice, FParsec is typically used for formal grammars that hardly require any extensive backtracking, so that memoization would usually only have a negative affect on performance.

In situation where you really do need to memoize parsers, you can work with a generic memoize combinator like the one in the following example:

open FParsec
open System.Collections.Generic

// We need a place to store the cached parser results. Since we want parser
// instances to be able to concurrently access different caches for different
// input streams, we will use a user state variable for this purpose. Since we
// don't want the backtracking to undo changes to the cache,  we will use a
// mutable dictionary for this purpose.

type UserState = {
        MemoCache: Dictionary<MemoKey, obj>
        // ...
    }

// An entry in the MemoCache must be uniquely identified by its MemoKey. In this
// example the MemoKey includes the stream index value and a reference to the
// memoized parser instance. Should the result of a memoized Parser function in
// your implementation also depend on the UserState value, you will have to
// extend the MemoKey with a UserState member. Similarly, if you want to cache
// results for more than one stream in the MemoCache, you'll have to extend the
// MemoKey with an identifier for the stream.

and [<CustomEquality; NoComparison>]
    MemoKey = struct
        new (parser: obj, stream: CharStream) =
            {Parser = parser; Index = stream.Index}

        val Parser: obj
        val Index: int64

        interface System.IEquatable<MemoKey> with
            member t.Equals(other: MemoKey) =
                t.Index = other.Index && t.Parser = other.Parser

        override t.Equals(otherObj: obj) =
            match otherObj with
            | :? MemoKey as other ->
                t.Index = other.Index && t.Parser = other.Parser
            | _ -> false

        override t.GetHashCode() = int32 t.Index
end

/// Returns a memoized version of the argument parser
let memoize (p: Parser<'a,UserState>) : Parser<'a,UserState> =
    fun stream ->
        let key = MemoKey(p, stream)
        let memoCache = stream.UserState.MemoCache
        let mutable boxedReply = null
        if memoCache.TryGetValue(key, &boxedReply) then
            boxedReply :?> Reply<'a>
        else
            let reply = p stream
            memoCache.Add(key, box reply)
            reply

5.13.4 Parsing F# infix operators

F# supports user‐definable infix operators whose precedence and associativity depend on the first chars of the operator name. For example, the F# spec states that operators that start with * are left‐associative, while operators that start with ** are right associative and have a higher precedence, so that 1*2*.3**4**.5 is parsed as ((1*2)*.(3**(4 **.5))).

Since the precedence and associativity rules are fixed, you can parse F# expressions with a static operator precedence grammar, i.e. without having to reconfigure the parser when a new operator is defined in the parsed source code. However, it’s probably not immediately obvious how to do this with FParsec’s OperatorPrecedenceParser class (OPP), since the OPP normally expects all possible operators to be (individually) specified before they are used.

The trick to supporting whole classes of operator names without having to reconfigure the OPP at run‐time is to shift part of the operator parsing to the after‐string‐parser, like in the following example:

open FParsec

type Expr = InfixOpExpr of string * Expr * Expr
          | Number of int

let ws  = spaces  // whitespace parser

let isSymbolicOperatorChar = isAnyOf "!%&*+-./<=>@^|~?"
let remainingOpChars_ws = manySatisfy isSymbolicOperatorChar .>> ws

let opp = new OperatorPrecedenceParser<Expr, string, unit>()
opp.TermParser <- pint32 .>> ws |>> Number

// a helper function for adding infix operators to opp
let addSymbolicInfixOperators prefix precedence associativity =
    let op = InfixOperator(prefix, remainingOpChars_ws,
                           precedence, associativity, (),
                           fun remOpChars expr1 expr2 ->
                               InfixOpExpr(prefix + remOpChars, expr1, expr2))
    opp.AddOperator(op)

// the operator definitions:
addSymbolicInfixOperators "*"  10 Associativity.Left
addSymbolicInfixOperators "**" 20 Associativity.Right
// ...

> run opp.ExpressionParser "1*2*.3**4**.5";;
val it : ParserResult<Expr,unit> = Success
InfixOpExpr
  ("*.", InfixOpExpr ("*", Number 1, Number 2),
         InfixOpExpr ("**", Number 3, InfixOpExpr ("**.", Number 4, Number 5)))

If you use the after‐string‐parser in this manner for operators that can lead to operator conflicts in the input, e.g. non‐associative operators, then you also need to replace the default OperatorConflictErrorFormatter, since otherwise the default formatter may print truncated operator names:

addSymbolicInfixOperators "<"  1 Associativity.None
> run opp.ExpressionParser "1 <= 2 <=. 3";;
val it : ParserResult<Expr,unit> = Failure:
Error in Ln: 1 Col: 9
1 <= 2 <=. 3
        ^
The infix operator '<' (precedence: 1, non-associative) conflicts with the
infix operator '<' (precedence: 1, non-associative) on the same line at column 3.

An error formatter that prints the full operator names could look like the following:

opp.OperatorConflictErrorFormatter <-
  fun (pos1, op1, afterString1) (pos2, op2, afterString2) ->
    let msg = sprintf "The operator '%s' conflicts with the previous operator '%s' at %A."
                       (op2.String + afterString2)
                       (op1.String + afterString1) pos1
    messageError msg
> run opp.ExpressionParser "1 <= 2 <=. 3";;
val it : ParserResult<Expr,unit> = Failure:
Error in Ln: 1 Col: 9
1 <= 2 <=. 3
        ^
The operator '<=.' conflicts with the previous operator '<=' at (Ln: 1, Col: 3).