5.9 Parsing with user state

Each CharStream<'u> holds a value of the freely definable user state type 'u. In previous chapters we just ignored the user state and always assumed 'u to be unit. In this section we finally get to discuss the purpose of the user state and how you can use it in your parsers.

5.9.1 Overview

The user state allows you to introduce additional variables into the state tracked by FParsec parsers. It has the following two important properties:

  • The user state is stored in the CharStream<'u> instance and hence associated with the input. It is not shared globally and not associated with particular parser instances. The same parser instances can be concurrently applied to different CharStream<'u> instances with different user state instances.
  • The user state is tracked by FParsec parsers together with the input stream position. This means in particular that a parser restores the previous user state value when it backtracks.
Important

If you want changes to the user state to be undone during backtracking, you must change the user state by assigning a new value to the user state, not by mutating an existing user state value.

With the help of the user state you can implement context sensitive parsers, i.e. parsers whose behaviour not only depends on the immediate input but also on the context of the input. In general this works as follows:

  1. You establish a context by defining variables in the user state.
  2. You update the context depending on the input by letting parsers update the user state.
  3. You parse input depending on the context by making the parser behaviour dependent on the user state variables.

The user state is exposed through the UserState property of the CharStream<'u>. You can implement parsers using the low‐level API that directly access this property, or you can use the following parser primitives from the CharParsers module:

The next section contains an example employing updateUserState to change the user state and userStateSatisfies to check for parser preconditions.

5.9.2 Recursive grammars with nesting restrictions

An important area of application for context sensitive parsers are recursive grammars where certain grammar elements cannot nest within others or where grammar elements need to be parsed differently depending on the nesting context.

Consider for example a textual markup languages like HTML. Many such markup languages support various “inline tags” to annotate text in a paragraph. Usually these inline tags can nest arbitrarily, except for a few tags with special restrictions. One of these restrictions often is that hyperlinks must not contain hyperlinks, even though they can contain any other inline content. Other restrictions may apply to elements allowed in superscript text or footnotes. A convenient way to enforce such restrictions during parsing is to introduce variables into the user state that keep track of the nesting context. The following example demonstrates this approach.[1]

The following parser for a tiny markup‐language employs the user state

  1. to ensure that nested hyperlinks are not accepted and
  2. to parse potentially nested quotations between matching pairs of '\'' or '\"' chars.
open FParsec

type Element = Text of string
             | Bold of Element list
             | Italic of Element list
             | Url of string * Element list
             | Quote of char * Element list

type UserState =
    {InLink: bool
     QuoteStack: char list}
    with
       static member Default = {InLink = false; QuoteStack = []}

let ws    = spaces
let ws1   = spaces1
let str s = pstring s

let elements, elementsR = createParserForwardedToRef()

let text = many1Satisfy (isNoneOf "<>'\"\\") |>> Text
let escape = str "\\" >>. (anyChar |>> (string >> Text))

let quote (q: char) =
  let pq = str (string q)

  let pushQuote =
      updateUserState (fun us -> {us with QuoteStack = q::us.QuoteStack})

  let popQuote =
      updateUserState (fun us -> {us with QuoteStack = List.tail us.QuoteStack})

  let isNotInQuote =
      userStateSatisfies (fun us -> match us.QuoteStack with
                                    | c::_ when c = q -> false
                                    | _ -> true)

  isNotInQuote >>. between pq pq
                           (between pushQuote popQuote
                                    (elements |>> fun ps -> Quote(q, ps)))

// helper functions for defining tags

let tagOpenBegin tag =
    str ("<" + tag)
    >>? nextCharSatisfiesNot isLetter // make sure tag name is complete
    <?> "<" + tag + "> tag"

let tagOpen tag = tagOpenBegin tag >>. str ">"
let tagClose tag = str ("</" + tag + ">")

let tag t p f =
    between (tagOpen t) (tagClose t)
            (p |>> f)

let attributeValue =
    ws >>. str "=" >>. ws
    >>. between (str "\"") (str "\"")
                (manySatisfy (isNoneOf "\n\""))

let attribute s = str s >>. attributeValue

let nonNestedTag tag pAttributesAndClose pBody f
                 isInTag setInTag setNotInTag =
    tagOpenBegin tag
    >>. ((fun stream ->
            if not (isInTag stream.UserState) then
                stream.UserState <- setInTag stream.UserState
                Reply(())
            else // generate error at start of tag
                stream.Skip(-tag.Length - 1)
                Reply(FatalError,
                      messageError ("Nested <" + tag + "> tags are not allowed.")))
         >>. pipe2 pAttributesAndClose pBody f
             .>> (tagClose tag >>. updateUserState setNotInTag))

// the tags

let bold   = tag "b" elements Bold
let italic = tag "i" elements Italic

let url = nonNestedTag "a" (ws >>. attribute "href" .>> (ws >>. str ">"))
                       elements
                       (fun url phrases -> Url(url, phrases))
                       (fun us -> us.InLink)
                       (fun us -> {us with InLink = true})
                       (fun us -> {us with InLink = false})


let element = choice [text
                      escape
                      quote '\''
                      quote '\"'
                      bold
                      italic
                      url]

do elementsR:= many element

let document = elements .>> eof

> runParserOnString document UserState.Default ""
    "A \"'text' with 'nested \"<b>quotes</b>\"'.\"";;
val it : ParserResult<Element list,UserState> = Success:
[Text "A ";
 Quote
   ('"',
    [Quote ('\'',[Text "text"]); Text " with ";
     Quote ('\'',[Text "nested "; Quote ('"',[Bold [Text "quotes"]])]); Text "."])]

> runParserOnString document UserState.Default ""
    @"<b>Text <i></i>with</b> <a href=""url"">'link' but no \<blink\></a>";;
val it : ParserResult<Element list,UserState> = Success:
[Bold [Text "Text "; Italic []; Text "with"]; Text " ";
 Url("url",
     [Quote ('\'',[Text "link"]); Text " but no "; Text "<"; Text "blink";
      Text ">"])]

> runParserOnString document UserState.Default ""
    "<a href=\"url\"><a href=\"nested\">test</a></a>";;
val it : ParserResult<Element list,UserState> = Failure:
Error in Ln: 1 Col: 15
<a href="url"><a href="nested">test</a></a>
              ^
Nested <a> tags are not allowed.

5.9.3 Parameterizing a parser through the user state

The user state is also a good place to store parser configuration data that is specific to a “parser job”. For example, a compiler that processes multiple compilation units could put configuration data that is specific to the compilation unit, e.g. include paths, into the user state and then parse different compilation units with the same Parser instance, like in the following code:

type CompilationUnitAST = (* ... *)

type UserState = {
    IncludePaths = string list
    (* ... *)
}

let parser : Parser<CompilationUnitAST, UserState> = (* ... *)

let parseCompilationUnit file encoding includePaths (* ... *) =
    let initialUserState = {IncludePaths = includePaths; (* ... *)}
    runParserOnFile parser initialUserState file encoding
Footnotes:
[1] An alternative way to handle such restrictions at the parser level would be to define separate instances of the parser for each possible combination of restrictions, e.g. separate parsers for inline elements at the top level, for inline elements within hyperlinks, for elements within hyperlinks within superscript text and so on. However, with an increasing number of restrictions this approach quickly falls victim to the combinatorial explosion caused by the recursive nature of the involved parsers.