5.5 Parsing sequences
In the previous chapter we discussed various ways to sequentially apply two or more parsers. In this section we will explain how to repeatedly apply the same parser in order to parse a sequence with an arbitrary number of elements.
5.5.1 The many
parser
In regular expressions and many grammar formalisms a Kleene Star marks a parser rule
that can be repeatedly applied. For example, number*
could represent
a sequence of zero or more numbers.
In FParsec the many
combinator takes
the place of the Kleene Star:
val many: Parser<'a,'u> -> Parser<'a list, 'u>
With many
the number example could
be translated into the following FParsec code:
let ws = spaces let number = pint32 .>> ws
> run (many number) "1 2 3 4";; val it : ParserResult<int32 list,unit> = Success: [1; 2; 3; 4]
The parser many p
repeatedly applies the parser p
until p
fails, i.e. it “greedily” parses as many occurrences of p
as possible. The results of p
are returned as a list in the order
of occurrence.
At the end of a sequence parsed with many p
the argument parser p
must fail
without consuming input (or changing the parser state in any other way). When p
fails after
consuming input, many p
fails with the error returned by p
.
The following example illustrates this behaviour:
The many numberInBrackets
parser successfully parses the first two numbers in this test run:
> run (many numberInBrackets .>> str "(c)") "[1] [2] (c)";; val it : ParserResult<int32 list,unit> = Success: [1; 2]
However, the same parser fails while trying to parse the 3rd number in this test run:
> run (many numberInBrackets >>. str "[c]") "[1] [2] [c]";; val it : ParserResult<string,unit> = Failure: Error in Ln: 1 Col: 10 [1] [2] [c] ^ Expecting: integer number (32-bit, signed)
The many
parser failed here because
the numberInBrackets
parser failed after consuming input. In the chapter on looking ahead and backtracking we’ll come back to this example and discuss how you can modify the
numberInBrackets
parser such that it fails without consuming input if an opening bracket is
not followed by a number.[1]
Since many p
continues until p
fails, you have to be a little careful not to supply an argument parser
p
that can succeed without consuming input. The following example shows what happens if you
accidentally supply such an argument parser:
> run (many (many digit .>> ws)) "123 456";; System.InvalidOperationException: (Ln: 1, Col: 8): The combinator 'many' was applied to a parser that succeeds without consuming input and without changing the parser state in any other way. (If no exception had been raised, the combinator likely would have entered an infinite loop.) (... stack trace ...) Stopped due to error
The problem here is that many digit .>> ws
will succeed
without changing the parser state if it can’t parse any digits or trailing whitespace. Thus, if the combined parser hadn’t have thrown an
exception, it would have entered an infinite loop at the end of the input.
We can easily avoid the error in the last example by requiring the inner parser to consume at least one digit. Instead of many digit
, which succeeds with an empty list if can’t parse
any digits, we can use many1 digit
, which fails if it can’t parse at least one digit:
> run (many (many1 digit .>> ws)) "123 456";; val it : ParserResult<char list list,unit> = Success: [['1'; '2'; '3']; ['4'; '5'; '6']]
Before we continue, we should point out that an example like many1 digit
is somewhat
artificial, because you hardly ever want to parse digit chars into a list. If you want to parse numbers, one of the number parsers is usually the best way forward. If you actually need the individual
chars, you normally need them as a string, not as a list.
If you want to parse a sequence of chars, you should generally prefer one of the specialized string parsers.
5.5.2 sepBy
and sepEndBy
Often the elements of a sequence are separated by some separator. A convenient way to parse such sequences are the sepBy
and sepEndBy
combinators.
sepBy p sep
parses a sequence of p
separated by sep
and returns the results in a list. sepEndBy
parses a sequence of p
separated and optionally ended by sep
.
With these combinators you could for example define the following two parsers for a semicolon‐separated list of numbers in brackets:
let str s = pstring s let sepList = between (str "[") (str "]") (sepBy pint32 (str ";")) let sepEndList = between (str "[") (str "]") (sepEndBy pint32 (str ";"))
The sepList
parser only accepts lists where the semicolons only occur between two numbers:
> run sepList "[]";; val it : ParserResult<int32 list,unit> = Success: [] > run sepList "[1;2;3]";; val it : ParserResult<int32 list,unit> = Success: [1; 2; 3] > run sepList "[1;2;3;]";; val it : ParserResult<int32 list,unit> = Failure: Error in Ln: 1 Col: 8 [1;2;3;] ^ Expecting: integer number (32-bit, signed)
The sepEndList
parser also accepts a terminating semicolon:
> run sepEndList "[1;2;3]";; val it : ParserResult<int32 list,unit> = Success: [1; 2; 3] > run sepEndList "[1;2;3;]";; val it : ParserResult<int32 list,unit> = Success: [1; 2; 3]
Like for the many
combinator, there
are also variants of the sepBy
and
sepEndBy
parsers that require at
least one element in the sequence and/or skip over a sequence without returning the results. Have a look at the parser overview.
5.5.3 Parsing a sequence without creating a list
If you want to parse a sequence and you don’t need the results as an F# list, you can avoid the allocation of a temporary list by defing a
custom sequence parser using the inline helper methods Inline.Many
and Inline.SepBy
.
For example, if you wanted to define a variant of many
that parses the elements directly into a ResizeArray
, i.e.
a System.Collections.Generic.List
, you could use the following definition:
let manyRA p = // the compiler expands the call to Inline.Many to an optimized sequence parser Inline.Many(elementParser = p, stateFromFirstElement = (fun x0 -> let ra = ResizeArray<_>() ra.Add(x0) ra), foldState = (fun ra x -> ra.Add(x); ra), resultFromState = (fun ra -> ra), resultForEmptySequence = (fun () -> ResizeArray<_>()))
A test run:
> run (manyRA (pint32 .>> spaces)) "1 2 3";; val it : ParserResult<System.Collections.Generic.List<int32>,unit> = Success: seq [1; 2; 3]
The reference documentation for the Inline
class contains some more examples.
[1] |
|
---|