5.3 Internals of a simple Parser
function
In the beginning of this user’s guide we noted that asciiLower
“parses” a lower case ASCII char and that skipString
“skips” over a string, but we haven’t yet explained what it actually means for a Parser
function to “parse”
a letter or “skip” a string. That’s what we will do in this chapter. To explain how Parser
functions work, we will discuss the implementation
of a simple string parser. This also gives us the opportunity to explain some important details about the Reply
and CharStream
types.
5.3.1 The code
The parser whose implementation we will discuss in this chapter is
val stringReturn: string -> 'a -> Parser<'a,'u>
Like skipString str
the parser stringReturn str result
skips over the string str
, but it returns result
as part of its reply value, instead of the unit
value ()
that skipString str
returns. This makes stringReturn
a bit more general than skipString
. Indeed, the two library parsers pstring
and skipString
are actually implemented with the help of stringReturn
. For example, skipString str
is defined as stringReturn str ()
.
A simplified version[1] of the actual implementation of stringReturn
in the library is
let stringReturn str result : Parser<_,_> = // 1 checkStringContainsNoNewlineChar str "pstring/skipString/stringReturn" // 2 let error = expectedString str // 3 fun stream -> // 4 if stream.Skip(str) then // 5 Reply(result) // 6 else // 7 Reply(Error, error) // 8
Let’s start with the general structure of this implementation: We define a function stringReturn
with two parameters that returns a function closure. The type annotation : Parser<_,_>
on line 1 fixes
the type of the returned function closure to Parser<'a,'u>
and in particular constrains the type of its argument to CharStream<'u>
. Remember, the type Parser<'a,'u>
is simply an abbreviation for CharStream<'u> -> Reply<'a>
, where 'a
represents the result type and 'u
the
user state type.
Implementing our parameterized parser as a function returning a parser closure allows us to factor out common setup work that only needs to be
done once for every parser.[2] In this case we only need to
check once (line 2) whether the string contains a newline char, i.e. '\r'
or '\n'
, (we’ll explain below why this is necessary) and in line 3 we preconstruct the error message that is later used whenever the parser is applied and doesn’t find str
in the input (we’ll write more about error messages in later chapters).
The actual parsing logic is completely straightforward: On line 5 the parser calls the CharStream’s Skip
method with the argument
str
. If the next chars in the stream match str
, Skip
advances the stream’s position by the length of the passed string and returns true
; otherwise, it doesn’t change the position of the stream and returns false
. Thus, if the string is skipped, the parser returns with a Reply
value containing the result (line 6). Otherwise,
it returns a Reply
with the preconstructed error
message (line 8).
5.3.2 The Reply
type
This is a good time to discuss the Reply
type in a
little more detail.
type Reply<'TResult> = struct new: 'TResult -> Reply<'TResult> new: ReplyStatus * ErrorMessageList -> Reply<'TResult> new: ReplyStatus * 'TResult * ErrorMessageList -> Reply<'TResult> val mutable Status: ReplyStatus /// If Status <> Ok then the Result value is undefined and may be null. val mutable Result: 'TResult val mutable Error: ErrorMessageList end
Similar to a tuple, a Reply
can be seen as an
aggregate of it three fields: Status
,
Result
and Error
. The Status
field contains a ReplyStatus
enum value indicating whether the parser succeeded (Ok
) or failed (Error
or FatalError
). By returning a FatalError
instead of an Error
a parser can signal that no error recovery should be
tried (except through backtracking mechanisms, which we explain later). If the Status
is Ok
, the Result
field contains the parser result; otherwise, its value
is undefined (and null
). The Error
field holds a list of error messages in the form of
an ErrorMessageList
value. An empty ErrorMessageList
is represented as a null
value.
The 1‐argument constructor we use in line 6 sets the status to Ok
and the result value to result
. The 2‐argument constructor we use in line 8 sets the status to Error
and the error message to error
. The Reply
type also
defines a 3‐argument constructor, which simply sets the fields to the respective argument values. The default valuetype constructor with 0
arguments initializes the Reply
value to Reply(Error, null)
.
The error messages returned in the Reply
value
implicitly refer to the current stream position. Since the ErrorMessage
values stored in the ErrorMessageList
do not themselves contain an
error position, they can only be interpreted together with the position of the CharStream
as it is when the parser returns.
5.3.3 The parser state and the line and column count
Usually one CharStream<'u>
instance is created per input file and all parser functions
involved in parsing elements of the same file are passed the same CharStream
instance. Since calling the methods of a CharStream
may change its state, parser functions have to be
careful about when and how they change the CharStream
state, because it obviously may affect all parsers subsequently called.
In the example above, stringReturn
only advances the stream position when it succeeds. This makes it an atomic string parser,
because it does not consume input if only the beginning of the argument string matches the input. Whether or not a parser consumes input before
it fails has important implications for the error handling, as we will discuss later in this user’s guide.
Except for the freely customizable UserState
, all the mutable state information in the CharStream<'u>
instance pertains to the location of the next char in the text stream. The most
important element of the state is the char Index
, which uniquely identifies the UTF‐16 char in the stream. In addition to the index of the next char, the
CharStream
also keeps track of char’s
Line
number and the index
of the first char in the line, the LineBegin
. By combining the Index
and LineBegin
we can calculate a Column
. The CharStream
’s Name
serves as a description or identifier for
the stream.
Only the char index is strictly necessary for the core stream functionality. We also store the other pieces of state information in a CharStream<'u>
instance because having all parser state information in one place reduces complexity
and allows us to expose a more convenient API to Parser
functions.
The CharStream<'u>.State
property returns a snapshot of all
the mutable state components in the form of a CharStreamState<'u>
value.
The state information that is exposed through the CharStream<'u>.State
property is
all the state that is tracked by FParsec
parsers, which is why we also refer to
it as the parser state.[3]
Ideally, the CharStream
class would
keep track of the column and line count in a completely automated fashion. Ideally, the CharStream
class would give the user a way to freely specify
the recognized set of newline character sequences and all CharStream
methods then would automatically detect such newlines in the input. Unfortunately, such a configuration
option would be difficult to implement efficiently and would likely have a severe impact on performance (at least in comparison to the
hard‐coded alternative, and with the current language and compiler support).
Since the CharStream
can’t provide
automatic support for all possible notions of a newline, it exposes two sets of methods in its interface. One set provides the basic stream
operations, such as skipping a certain number of UTF‐16 chars or matching a string with the stream content. These methods come without any
automatic newline detection, but they offer optimal performance and give the user complete freedom to manually register any kind of newline. The
other set of methods provides some frequently needed higher‐level text operations, such as skipping over a sequence of whitespace chars or
reading a sequence of chars satisfying a given predicate function. These other methods automatically detect any of the 3 standard newline char
sequences "\n"
,
"\r\n"
and "\r"
, because that’s the notion of a newline used by most text applications. In combination both sets of methods
cover the needs of a majority of text parsers in a convenient and efficient manner.
Maybe you wonder why we don’t just leave the line and column count completely to the user instead of complicating the CharStream
API. The reason we keep track of a line count in
the CharStream
class is that most
non‐trivial text‐parsing applications require a line count for error reporting purposes. Implementing it at a relatively low API level brings
significant performance advantages and relieves higher‐level API users from constantly having to code around the special case of newline
chars.
If you have a look at the reference documentation for CharStream
, you’ll see that the CharStream
methods that automatically detect newlines are easily discernible by their name. The Skip
method we used in the
above example does not belong to these methods, which is why we have to make sure in line 2 that the string
doen’t contain any newlines. In practice one hardly ever uses a parser like stringReturn
with a string containing a newline,
hence lifting this restriction wouldn’t be worth the effort, especially since simple workarounds are available.[4]
[1] | The library version is a bit more complicated because it contains optimized paths for argument strings with only 1 or 2 chars. |
---|---|
[2] |
Even parsers without a parameter, like e.g. asciiLower , are actually compiled as properties returning a new function object every time they are called. This is
because the user state type variable makes asciiLower generic, while function values can only have a non‐generic type.
|
[3] |
Strictly speaking, a CharStream<'a> instance has a little more publically observable mutable
state than the one that is also exposed through the State property. For example, the MinRegexSpace configuration parameter is
not tracked in the State parameter. Another example is the value of the IndexOfLastCharPlus1 property
which changes once the last char of the stream is detected. However, there shouldn’t be a reason that a parser needs to restore the old values
of these properties upon backtracking, so we just treat these properties as constant and ignore them when we discuss the mutable CharStream state.
|
[4] |
For example, stringReturn "str1\nstr2" result
can be replaced with attempt (skipString "str1" >>. newline >>. stringReturn "str2" result) .
|