5.3 Internals of a simple Parser function

In the beginning of this user’s guide we noted that asciiLower “parses” a lower case ASCII char and that skipString “skips” over a string, but we haven’t yet explained what it actually means for a Parser function to “parse” a letter or “skip” a string. That’s what we will do in this chapter. To explain how Parser functions work, we will discuss the implementation of a simple string parser. This also gives us the opportunity to explain some important details about the Reply and CharStream types.

5.3.1 The code

The parser whose implementation we will discuss in this chapter is

val stringReturn: string -> 'a -> Parser<'a,'u>

Like skipString str the parser stringReturn str result skips over the string str, but it returns result as part of its reply value, instead of the unit value () that skipString str returns. This makes stringReturn a bit more general than skipString. Indeed, the two library parsers pstring and skipString are actually implemented with the help of stringReturn. For example, skipString str is defined as stringReturn str ().

A simplified version[1] of the actual implementation of stringReturn in the library is

let stringReturn str result : Parser<_,_> =                                // 1
    checkStringContainsNoNewlineChar str "pstring/skipString/stringReturn" // 2
    let error = expectedString str                                         // 3
    fun stream ->                                                          // 4
        if stream.Skip(str) then                                           // 5
            Reply(result)                                                  // 6
        else                                                               // 7
            Reply(Error, error)                                            // 8

Let’s start with the general structure of this implementation: We define a function stringReturn with two parameters that returns a function closure. The type annotation : Parser<_,_> on line 1 fixes the type of the returned function closure to Parser<'a,'u> and in particular constrains the type of its argument to CharStream<'u>. Remember, the type Parser<'a,'u> is simply an abbreviation for CharStream<'u> -> Reply<'a>, where 'a represents the result type and 'u the user state type.

Implementing our parameterized parser as a function returning a parser closure allows us to factor out common setup work that only needs to be done once for every parser.[2] In this case we only need to check once (line 2) whether the string contains a newline char, i.e. '\r' or '\n', (we’ll explain below why this is necessary) and in line 3 we preconstruct the error message that is later used whenever the parser is applied and doesn’t find str in the input (we’ll write more about error messages in later chapters).

The actual parsing logic is completely straightforward: On line 5 the parser calls the CharStream’s Skip method with the argument str. If the next chars in the stream match str, Skip advances the stream’s position by the length of the passed string and returns true; otherwise, it doesn’t change the position of the stream and returns false. Thus, if the string is skipped, the parser returns with a Reply value containing the result (line 6). Otherwise, it returns a Reply with the preconstructed error message (line 8).

5.3.2 The Reply type

This is a good time to discuss the Reply type in a little more detail.

type Reply<'TResult> = struct
  new: 'TResult -> Reply<'TResult>
  new: ReplyStatus * ErrorMessageList -> Reply<'TResult>
  new: ReplyStatus * 'TResult * ErrorMessageList -> Reply<'TResult>

  val mutable Status: ReplyStatus
  /// If Status <> Ok then the Result value is undefined and may be null.
  val mutable Result: 'TResult
  val mutable Error: ErrorMessageList
end

Similar to a tuple, a Reply can be seen as an aggregate of it three fields: Status, Result and Error. The Status field contains a ReplyStatus enum value indicating whether the parser succeeded (Ok) or failed (Error or FatalError). By returning a FatalError instead of an Error a parser can signal that no error recovery should be tried (except through backtracking mechanisms, which we explain later). If the Status is Ok, the Result field contains the parser result; otherwise, its value is undefined (and null). The Error field holds a list of error messages in the form of an ErrorMessageList value. An empty ErrorMessageList is represented as a null value.

The 1‐argument constructor we use in line 6 sets the status to Ok and the result value to result. The 2‐argument constructor we use in line 8 sets the status to Error and the error message to error. The Reply type also defines a 3‐argument constructor, which simply sets the fields to the respective argument values. The default valuetype constructor with 0 arguments initializes the Reply value to Reply(Error, null).

The error messages returned in the Reply value implicitly refer to the current stream position. Since the ErrorMessage values stored in the ErrorMessageList do not themselves contain an error position, they can only be interpreted together with the position of the CharStream as it is when the parser returns.

5.3.3 The parser state and the line and column count

Usually one CharStream<'u> instance is created per input file and all parser functions involved in parsing elements of the same file are passed the same CharStream instance. Since calling the methods of a CharStream may change its state, parser functions have to be careful about when and how they change the CharStream state, because it obviously may affect all parsers subsequently called.

In the example above, stringReturn only advances the stream position when it succeeds. This makes it an atomic string parser, because it does not consume input if only the beginning of the argument string matches the input. Whether or not a parser consumes input before it fails has important implications for the error handling, as we will discuss later in this user’s guide.

Except for the freely customizable UserState, all the mutable state information in the CharStream<'u> instance pertains to the location of the next char in the text stream. The most important element of the state is the char Index, which uniquely identifies the UTF‐16 char in the stream. In addition to the index of the next char, the CharStream also keeps track of char’s Line number and the index of the first char in the line, the LineBegin. By combining the Index and LineBegin we can calculate a Column. The CharStream’s Name serves as a description or identifier for the stream.

Only the char index is strictly necessary for the core stream functionality. We also store the other pieces of state information in a CharStream<'u> instance because having all parser state information in one place reduces complexity and allows us to expose a more convenient API to Parser functions.

Note

The CharStream<'u>.State property returns a snapshot of all the mutable state components in the form of a CharStreamState<'u> value.

The state information that is exposed through the CharStream<'u>.State property is all the state that is tracked by FParsec parsers, which is why we also refer to it as the parser state.[3]

Ideally, the CharStream class would keep track of the column and line count in a completely automated fashion. Ideally, the CharStream class would give the user a way to freely specify the recognized set of newline character sequences and all CharStream methods then would automatically detect such newlines in the input. Unfortunately, such a configuration option would be difficult to implement efficiently and would likely have a severe impact on performance (at least in comparison to the hard‐coded alternative, and with the current language and compiler support).

Since the CharStream can’t provide automatic support for all possible notions of a newline, it exposes two sets of methods in its interface. One set provides the basic stream operations, such as skipping a certain number of UTF‐16 chars or matching a string with the stream content. These methods come without any automatic newline detection, but they offer optimal performance and give the user complete freedom to manually register any kind of newline. The other set of methods provides some frequently needed higher‐level text operations, such as skipping over a sequence of whitespace chars or reading a sequence of chars satisfying a given predicate function. These other methods automatically detect any of the 3 standard newline char sequences "\n", "\r\n" and "\r", because that’s the notion of a newline used by most text applications. In combination both sets of methods cover the needs of a majority of text parsers in a convenient and efficient manner.

Note

Maybe you wonder why we don’t just leave the line and column count completely to the user instead of complicating the CharStream API. The reason we keep track of a line count in the CharStream class is that most non‐trivial text‐parsing applications require a line count for error reporting purposes. Implementing it at a relatively low API level brings significant performance advantages and relieves higher‐level API users from constantly having to code around the special case of newline chars.

If you have a look at the reference documentation for CharStream, you’ll see that the CharStream methods that automatically detect newlines are easily discernible by their name. The Skip method we used in the above example does not belong to these methods, which is why we have to make sure in line 2 that the string doen’t contain any newlines. In practice one hardly ever uses a parser like stringReturn with a string containing a newline, hence lifting this restriction wouldn’t be worth the effort, especially since simple workarounds are available.[4]

Footnotes:
[1] The library version is a bit more complicated because it contains optimized paths for argument strings with only 1 or 2 chars.
[2] Even parsers without a parameter, like e.g. asciiLower, are actually compiled as properties returning a new function object every time they are called. This is because the user state type variable makes asciiLower generic, while function values can only have a non‐generic type.
[3] Strictly speaking, a CharStream<'a> instance has a little more publically observable mutable state than the one that is also exposed through the State property. For example, the MinRegexSpace configuration parameter is not tracked in the State parameter. Another example is the value of the IndexOfLastCharPlus1 property which changes once the last char of the stream is detected. However, there shouldn’t be a reason that a parser needs to restore the old values of these properties upon backtracking, so we just treat these properties as constant and ignore them when we discuss the mutable CharStream state.
[4] For example, stringReturn "str1\nstr2" result can be replaced with attempt (skipString "str1" >>. newline >>. stringReturn "str2" result).