5.13 Tips and tricks
5.13.1 Parallel parsing
If your parser grammar is suitable for parallel parsing, parallelizing the parser has the potential to dramatically accelerate parsing on multi‐core machines. In the following we will shortly discuss requirements and strategies for parallelizing an FParsec parser.
For a parser grammar to be well suited for parallel parsing, the grammar and the typical input must satisfy the following two criteria:
- Parts of the input must be independently parseable, i.e. parts must be parseable without knowlege about the other parts.
- These parts must be large enough and easily enough identifiable within the total input.
Often, the easiest and most beneficial way to parallelize the parsing stage of an application is to parse multiple input files in parallel. In the simplest case you have multiple independent “compilation units” that can be parsed in parallel. This works even for C/C++, where a badly designed preprocesser generally makes efficient parsing quite hard. In many programming languages and markup languages you can also parse in parallel files that are “included”, “opened” or “imported” within source files. However, this usually only works if the language allows such includes only at well‐defined points in the grammar. In languages like C/C++, where the unstructured text content of other files can be included at essentially arbitrary positions in the source, parsing the included files in parallel is generally quite hard. (In C/C++ it’s even hard to avoid parsing the same file multiple times when it is included multiple times).
If you’re dealing with large input files or very slow parsers, it might also be worth trying to parse multiple sections within a single file in parallel. For this to be efficient there must be a fast way to find the start and end points of such sections. For example, if you are parsing a large serialized data structure, the format might allow you to easily skip over segments within the file, so that you can chop up the input into multiple independent parts that can be parsed in parallel. Another example could be a programming languages whose grammar makes it easy to skip over a complete class or function definition, e.g. by finding the closing brace or by interpreting the indentation. In this case it might be worth not to parse the definitions directly when they are encountered, but instead to skip over them, push their text content into a queue and then to process that queue in parallel.
Here are some tips for parallel parsing with FParsec:
-
All FParsec parsers are thread‐safe and can be safely applied concurrently to different
CharStream
instances, as long as you don’t introduce mutable shared state yourself. -
CharStream
instances are not thread‐safe and a single instance must not be accessed concurrently. -
However, you can call the
CreateSubstream
method to create a substream for aCharStream
. ACharStream
and its substreams can be safely accessed concurrently. -
If you want to parse multiple files in parallel, you should also create the
CharStream
instances in parallel, because theCharStream
constructors that accept file paths or binary streams perform I/O operations that benefit from parallelization. - If you parallelize your parser, consider introducing an option for switching off parallel execution, since debugging a multi‐threaded parser is harder than debugging a single‐threaded one.
5.13.2 Dispatching parsers through a dictionary
A technique that is often useful for making a parser modular and easily extensible is to store Parser
functions in dictionaries and then to delegate
parsing to one of the Parser
functions in the dictionary based on the input.
For example, a parser for a markup language could be implemented by defining a generic tag parser that delegates the parsing of the tagged content to a specific parser for the respective tag name. The following code shows how this could be done:
open FParsec open System.Collections.Generic // For simplicity we don't define a full-blown markup language here, // just a parser for two simple non-recursive "tags" in square brackets. // The chapter on "parsing with user state" contains a slightly more developed // sample for a markup language, though without a dictionary-based tag parser. type Tag = Bold of string | Url of string * string // We store the tag parser dictionary in the user state, so that we can // concurrently parse multiple input streams with the same parser instance // but differerent tag dictionaries. type TagParserMap = Dictionary<string,Parser<Tag,UserState>> and UserState = { TagParsers: TagParserMap } let defaultTagParsers = TagParserMap() let isTagNameChar1 = fun c -> isLetter c || c = '_' let isTagNameChar = fun c -> isTagNameChar1 c || isDigit c let expectedTag = expected "tag starting with '['" let tag : Parser<Tag, UserState> = fun stream -> if stream.Skip('[') then let name = stream.ReadCharsOrNewlinesWhile(isTagNameChar1, isTagNameChar, false) if name.Length <> 0 then let mutable p = Unchecked.defaultof<_> if stream.UserState.TagParsers.TryGetValue(name, &p) then p stream else stream.Skip(-name.Length) Reply(Error, messageError ("unknown tag name '" + name + "'")) else Reply(Error, expected "tag name") else Reply(Error, expectedTag) let str s = pstring s let ws = spaces let text = manySatisfy (function '['|']' -> false | _ -> true) defaultTagParsers.Add("b", str "]" >>. text .>> str "[/b]" |>> Bold) defaultTagParsers.Add("url", (str "=" >>. manySatisfy ((<>)']') .>> str "]") .>>. (text .>> str "[/url]") |>> Url) let parseTagString str = runParserOnString tag {TagParsers = TagParserMap(defaultTagParsers)} "" str
> parseTagString "[b]bold text[/b]";; val it : ParserResult<Tag,UserState> = Success: Bold "bold text" > parseTagString "[url=http://tryfsharp.org]try F#[/url]";; val it : ParserResult<Tag,UserState> = Success: Url ("http://tryfsharp.org","try F#") > parseTagString "[bold]test[/bold]";; val it : ParserResult<Tag,UserState> = Failure: Error in Ln: 1 Col: 2 [bold]test[/bold] ^ unknown tag name 'bold'
5.13.3 Memoizing parsers
If your parser implementation backtracks a lot when parsing typical inputs and as a result repeatedly applies some Parser
functions at the same input position, it can be
beneficial to memoize these Parser
functions, i.e. cache their results for each input position.
In the extreme case, memoization can mean the difference between linear and exponential execution times. In practice, FParsec is typically used for formal grammars that hardly require any extensive backtracking, so that memoization would usually only have a negative affect on performance.
In situation where you really do need to memoize parsers, you can work with a generic memoize
combinator like the one in the following example:
open FParsec open System.Collections.Generic // We need a place to store the cached parser results. Since we want parser // instances to be able to concurrently access different caches for different // input streams, we will use a user state variable for this purpose. Since we // don't want the backtracking to undo changes to the cache, we will use a // mutable dictionary for this purpose. type UserState = { MemoCache: Dictionary<MemoKey, obj> // ... } // An entry in the MemoCache must be uniquely identified by its MemoKey. In this // example the MemoKey includes the stream index value and a reference to the // memoized parser instance. Should the result of a memoized Parser function in // your implementation also depend on the UserState value, you will have to // extend the MemoKey with a UserState member. Similarly, if you want to cache // results for more than one stream in the MemoCache, you'll have to extend the // MemoKey with an identifier for the stream. and [<CustomEquality; NoComparison>] MemoKey = struct new (parser: obj, stream: CharStream) = {Parser = parser; Index = stream.Index} val Parser: obj val Index: int64 interface System.IEquatable<MemoKey> with member t.Equals(other: MemoKey) = t.Index = other.Index && t.Parser = other.Parser override t.Equals(otherObj: obj) = match otherObj with | :? MemoKey as other -> t.Index = other.Index && t.Parser = other.Parser | _ -> false override t.GetHashCode() = int32 t.Index end /// Returns a memoized version of the argument parser let memoize (p: Parser<'a,UserState>) : Parser<'a,UserState> = fun stream -> let key = MemoKey(p, stream) let memoCache = stream.UserState.MemoCache let mutable boxedReply = null if memoCache.TryGetValue(key, &boxedReply) then boxedReply :?> Reply<'a> else let reply = p stream memoCache.Add(key, box reply) reply
5.13.4 Parsing F# infix operators
F# supports user‐definable infix operators whose precedence and associativity depend on the first chars of the operator name. For example, the
F# spec states that operators that start with *
are left‐associative, while operators that start with **
are right associative and have a higher precedence, so that 1*2*.3**4**.5
is parsed as ((1*2)*.(3**(4 **.5)))
.
Since the precedence and associativity rules are fixed, you can parse F# expressions with a static operator precedence grammar, i.e. without
having to reconfigure the parser when a new operator is defined in the parsed source code. However, it’s probably not immediately obvious how to
do this with FParsec’s OperatorPrecedenceParser
class (OPP), since the OPP normally expects all possible operators to be (individually)
specified before they are used.
The trick to supporting whole classes of operator names without having to reconfigure the OPP at run‐time is to shift part of the operator parsing to the after‐string‐parser, like in the following example:
open FParsec type Expr = InfixOpExpr of string * Expr * Expr | Number of int let ws = spaces // whitespace parser let isSymbolicOperatorChar = isAnyOf "!%&*+-./<=>@^|~?" let remainingOpChars_ws = manySatisfy isSymbolicOperatorChar .>> ws let opp = new OperatorPrecedenceParser<Expr, string, unit>() opp.TermParser <- pint32 .>> ws |>> Number // a helper function for adding infix operators to opp let addSymbolicInfixOperators prefix precedence associativity = let op = InfixOperator(prefix, remainingOpChars_ws, precedence, associativity, (), fun remOpChars expr1 expr2 -> InfixOpExpr(prefix + remOpChars, expr1, expr2)) opp.AddOperator(op) // the operator definitions: addSymbolicInfixOperators "*" 10 Associativity.Left addSymbolicInfixOperators "**" 20 Associativity.Right // ...
> run opp.ExpressionParser "1*2*.3**4**.5";; val it : ParserResult<Expr,unit> = Success InfixOpExpr ("*.", InfixOpExpr ("*", Number 1, Number 2), InfixOpExpr ("**", Number 3, InfixOpExpr ("**.", Number 4, Number 5)))
If you use the after‐string‐parser in this manner for operators that can lead to operator conflicts in the input, e.g. non‐associative
operators, then you also need to replace the default OperatorConflictErrorFormatter
, since otherwise the default formatter may print truncated operator names:
addSymbolicInfixOperators "<" 1 Associativity.None
> run opp.ExpressionParser "1 <= 2 <=. 3";; val it : ParserResult<Expr,unit> = Failure: Error in Ln: 1 Col: 9 1 <= 2 <=. 3 ^ The infix operator '<' (precedence: 1, non-associative) conflicts with the infix operator '<' (precedence: 1, non-associative) on the same line at column 3.
An error formatter that prints the full operator names could look like the following:
opp.OperatorConflictErrorFormatter <- fun (pos1, op1, afterString1) (pos2, op2, afterString2) -> let msg = sprintf "The operator '%s' conflicts with the previous operator '%s' at %A." (op2.String + afterString2) (op1.String + afterString1) pos1 messageError msg
> run opp.ExpressionParser "1 <= 2 <=. 3";; val it : ParserResult<Expr,unit> = Failure: Error in Ln: 1 Col: 9 1 <= 2 <=. 3 ^ The operator '<=.' conflicts with the previous operator '<=' at (Ln: 1, Col: 3).