# A brief excursion into Monadic Parsing

📅May 25, 2021 | ☕️10 minutes read

After the final exams, I have finally managed to find some time for digging deeper into the interesting world of monadic parsing. Although parsing is often treated as a solved problem, the concept of monadic parsing is still eye-opening. Briefly, it turns out that we can think of parsing as a monadic computation and a parser as a cleverly typed higher order function. These properties allow us to build abstraction by simply composing functions which is a common pattern in functional programming. Basically, we can build stronger and more complex parsers as compositions of simpler, easy to understand parsers.

Haskell’s extremely powerful type system enables us to write parsers for context-free grammars which look essentially the same as grammar productions, which looks quite cool.

To illustrate the main ideas, consider the subset of classic expression grammar in the EBNF form:

<expr> ::= <term> "+" <expr>
| <term> "-" <expr>
| <term>

<term> ::= <factor> "*" <term>
| <factor> "\"" <term>
| <factor>

<factor> ::= "(" <expr> ")"
| <number>

With monadic parsing, we can write the expression evaluator comprising of lexer, parser and evaluator in just a few lines of quite declarative code.

expr :: Parser Double
expr = term chainl1 addop

term :: Parser Double
term = factor chainl1 mulop

addop :: Parser (Double -> Double -> Double)
where add = do { symbol "+"; return (+) }
sub = do { symbol "-"; return (-) }

mulop :: Parser (Double -> Double -> Double)
mulop = mul <|> div
where mul = do { symbol "*"; return (*) }
div = do { symbol "/"; return (/) }

factor :: Parser Double
factor = parensExpr <|> number
where
parensExpr = do {
symbol "(";
x <- expr;
symbol ")";
return x
}

The goal of this post is to give a basic development of tools which enable us to write such a program. Apart from that, we will see how to apply tools to some classic parsing problems.

The post will consist of three parts:

1. Monadic parsing & Parser Combinators - A brief explanation of concepts of functor, applicative and monad and how they fit together with parsers. Finally we develop some basic, but surprisingly powerful parser combinators.
2. Writing a JSON parser - We will apply the abstract definition of a monad of parsers to build a JSON parser.
3. Syntax-Directed Translation of Arithmetic Expression - We will implement a syntax-directed translator (expression evaluator driven by grammar) using the theory and combinators developed above.

## Monadic parsing & Parser Combinators

### A definition of a Parser

Essentially, a parser is a function that takes a string and returns some form of structured data. Most frequently, the output is a parse tree, but in principle it can be any type.

In Haskell, we can encode the essence of a parser of arbitrary type in a very convenient definition.

newtype Parser a = Parser (String -> [(a,String)])

Firstly, it is worth noting that our Parser is a function wrapped inside a Parser type and the reasons for that will soon become obvious. Another thing is the fact that the output is a list. The reason for that is pure convenience. We can also introduce a convention that an empty list means the parsing error and a nonempty list the parsing success. Since the parser returns a list, we can build parsers for ambigious grammars with arbitrary many results being returned. (The question is when do we want this?) Finally, apart from returning the result of type $a$, we also return the unparsed suffix of the string.

Now, we define a few trivial parsers:

A parser that always fails:

zero :: Parser a
zero = Parser (const [])

An ‘unwrapping’ function to apply a parser to a string:

parse :: Parser a -> String -> [(a, String)]
parse (Parser p) = p

### Monadic nature of parsing problem

Before going deeper, I would like to state a very nonrealistic assumption. We will not deal with error-recovery and error messaging in this post. We assume that the parsing world is great and that we always either correctly parse everything we want to parse or we fail without explanation.

Quite commonly, we would like to apply some postprocessing of raw output produced by a parser. Say our parser consumes a sequence of digits comprising an integer. We would like to convert that sequence into the actual integer for later usage. It is convenient to equip our parser type with a general method of doing such postprocessing operation. In other words, we will equip a parser with an operation which takes some mapping function, extracts the value parsed by our parser, applies the mapping to the extracted value and returns a new parser.

In the language of functional programming, we recognise our parser as a $\texttt{Functor}$ and our mapping operation above as an implementation of $\texttt{fmap}$.

instance Functor Parser where
fmap g p = Parser (\cs -> case parse p cs of
[]        -> []
[(a,cs')] -> [(g a, cs')])

We can observe that our parser mapping operation obeys the following laws:

identity mapping returns the same parser as the original parser

fmap id = id

and the following composition rule holds

fmap (g . f) = fmap g . fmap f

So far, we have been concerned only with the ‘unwrapping’ of parser values. But to get to this stage, we need to wrap something first. Now we define a building block parser which does not consume any input, and succeeds with the given value. We will call it a $\texttt{pure}$ parser.

  pure a = Parser (\cs -> [(a, cs)])

Sometimes, we want to build the mapping functions mentioned in the first paragraph in the running time. For example, when arithmetic expression parser encounters a ”+” or ”-” it will perform different operations. Instead of manually building parsers which behave in this way, we give a general definition defining how to compose a $\texttt{Parser}$ whose value is a mapping with a $\texttt{Parser}$ which contains a value to be mapped over. It turns out that this use case corresponds to the definition of $\texttt{Applicative}$ operator $<*>$. With a bit of paperwork, we see that our Parser is actually an $\texttt{Applicative Functor}$.

instance Applicative Parser where
pure a = Parser (\cs -> [(a, cs)])
pg <*> pa = Parser (\cs -> case parse pg cs of
[] -> []
[(g, cs')] -> parse (fmap g pa) cs'
)

In order to parse unambigious grammars, we sometimes want to try parsing one production and try another if the first one fails. Instead of manually handling those cases which can be quite complicating depending on the number of cases, we define what it means to ‘deterministically’ select parser at runtime. By deterministically select, we mean trying to parse using the parse using the first given parser. In case of succesfull parse, we will return the result computed from the first parser. But in case of unsuccessfull parse, we return the result obtained by running the second parser on the original input. In other words, we perform a backtracking. So although conceptually nice, we need to be careful when using it. We do this by providing a definition of $\texttt{Alternative}$ operator $<|>$.

instance Alternative Parser where
empty = zero
p <|> q = Parser (\cs -> case parse p cs of
[] -> parse q cs
[(a, cs')] -> [(a, cs')]
)

When parsing, we often want to sequence parsers in a way such that we firstly apply some parser and depending on the outcome of parsing operation we select a parser to be applied on the unconsumed suffix of the first applied parser. After a bit of thinking, we realise we want an implementation of a famous $\texttt{bind}$ operator, known as $\texttt{>>=}$. The $\texttt{bind}$ operator takes a parser to be applied first and a function which returns another parser depending on the value parsed by the first one. In case the first parser fails, we do not have a value to pass to the function, so we fail without an explanation, according to our initial assumption. With a bit of paperwork and checking some laws on behaviour of our operator, we see that our Parser is actually a $\texttt{Monad}$. By explicitely defining it as such, we can use a bunch of syntax sugar and functionality provided by Haskell language. For example, we would like to use a powerful $\texttt{do}$ construct and the only way to this is by notifying Haskell that the Parser type is monadic.

instance Monad Parser where
return = pure
p >>= f = Parser (\cs -> concat [
parse (f a) cs' | (a, cs') <- parse p cs
])

### Parser Combinators

Now when we have our abstract machinery set up, we can start defining some useful parsers. A parser combinator is a parser used to combine or compose other parsers.

Let us start modestly, by defining a parser which consumes a single character and returns the rest of the input. In case of there is no input, our parser fails. Otherwise, it consumes a single character and returns the rest of the input.

item :: Parser Char
item = Parser (\case
"" -> []
(c:cs) -> [(c,cs)]
)

Now let us define a parser which succeeds only if some given predicate passes when applied to the next character. So we consume a single character if it matches the predicate and fail otherwise.

sat :: (Char -> Bool) -> Parser Char
sat p = do { c <- item; if p c then return c else zero }

Observe that even on this stage, we compose the trivial parser $\texttt{item}$ to build a more interesting one.

Now, we can provide a generic definition of a parser consuming a specific character.

char :: Char -> Parser Char
char c = sat (c ==)

Quite commonly, we want to parse some reserved keywords. For such purposes, we define a parser consuming a specific string, recursively in terms of character parser above.

string :: String -> Parser String
string [] = return []
string (c:cs) = do {
pc <- char c;
prest <- string cs;
return (pc : prest)
}

By inspecting the structure of the parser above, we notice a classic foldr recursion pattern. So we can abstract the recursion in the definition above using the $\texttt{foldr}$ operator.

string :: String -> Parser String
string = foldr (\c acc -> do {
pc <- char c;
pr <- acc;
return (pc:pr)
}
) (return "")

Now we are ready to define a few more interesting combinators. In parsing, it is quite common to parse zero or more occurences of some production, following each other in sequential fashion.

For purposes of this, we define two combinators. Let $\texttt{many0 p}$ be a parser parsing zero or more occurences of parser $\texttt{p}$. Let $\texttt{many1 p}$ be a parser parsing one or more occurences of parser $\texttt{p}$. Now we define both of them in mutually recursive way.

many0 :: Parser a -> Parser [a]
many0 p = many1 p <|> return []

many1 :: Parser a -> Parser [a]
many1 p = do {
a <- p;
as <- many0 p;
return (a : as)
}

Now using $\texttt{many0}$ we can define a very useful combinator consuming arbitrary many spaces.

spaces :: Parser String
spaces = many0 (sat isSpace)

Often, we want to support lexemes being separated by arbitary many spaces. Usually, this sort of issue is handled in lexical analysis stage. But using $\texttt{spaces}$ combinator, becomes painless task.

token :: Parser a -> Parser a
token p = do { spaces; a <- p; spaces; return a }

Now we define a very useful combinator parsing exactly the given string, being separated from other lexemes by arbitrary many spaces.

symbol :: String -> Parser String
symbol symb = token (string symb)

Quite often, we want to parse zero or more occurences of some production separated by some other production. A common example is parsing arguments in a function call or parsing the initialization of an array. For such purposes, it is convenient to introduce a few new combinators, $\texttt{sepBy0}$ and $\texttt{sepBy1}$.

So we define $\texttt{sepBy0 p sep}$ to be a parser parsing zero or more occurences of parser $\texttt{p}$ separated by whatever is parsed by a parser $\texttt{sep}$. Similarly, we define $\texttt{sepBy1 p sep}$ to be a parser parsing one or more occurences of parser $\texttt{p}$ separated by whatever is parsed by a parser $\texttt{sep}$. Now, similarly to $\texttt{many0}$ and $\texttt{many1}$ combinators, we define $\texttt{sepBy0}$ in terms of $\texttt{sepBy1}$.

sepBy0 :: Parser a -> Parser b -> Parser [a]
p sepBy0 sep = p sepBy1 sep <|> return []

sepBy1 :: Parser a -> Parser b -> Parser [a]
p sepBy1 sep = do {
a <- p;
as <- many0 (do { sep; p });
return (a:as)
}

Notice how we are gradually building more and more powerful parsers using the function composition as a main building block of abstraction. Observe $sepBy1$ is implemented using $\texttt{many0}$.

Another useful combinator is $\texttt{takeUntil stop}$ consuming characters until the next character is exactly $\texttt{stop}$ or there is no more input. To define this combinator, we define auxiliary $\texttt{look}$ lookahead combinator which returns either the next character or $\texttt{Nothing}$ if there is no more input.

look :: Parser (Maybe Char)
look = Parser (\case
[] -> [(Nothing, [])]
(c:cs') -> [(Just c, c:cs')]
)

Now we define $\texttt{takeUntil}$ recursively.

takeUntil :: Char -> Parser [Char]
takeUntil stop = consumeRest "" stop
where
consumeRest acc stop = do {
l <- look;
if l == Just stop then return [] else do {
c <- item;
cs <- consumeRest (acc ++ [c]) stop;
return (c:cs)
}
}

## Parsing JSON

Now, we have a bunch parser combinators ready for cool applications. Let us start with parsing (simplified) version of JSON.

Consider the following grammar in the EBNF form:

<json> ::= <jsonArray>
| <jsonObject>
| <string>
| <double>
| <integer>
| true
| false
| "{" "}"
| null

<jsonArray> ::= "[" <json> ["," <json>]* "]"
| "[" "]"

<jsonObject> ::= "{" <jsonProperty> ["," <jsonProperty>]* "}"
| "{" "}"

<jsonProperty> ::= <string> ":" <json>

<double>  ::= <integer>"."<digit>*
<integer> ::= [+|-] 0 | [1 |... | 9] <digit>*
<digit>   ::= 0 | 1 | ... | 9

We will start by defining $\texttt{Json}$ algebraic data type in Haskell. We will use that type as a result of a successfull parse.

data Json = JsonNull
| JsonBool Bool
| JsonString String
| JsonInteger Int
| JsonDouble Double
| JsonArray [Json]
| JsonObject [JsonProperty]
deriving (Eq, Show)

type JsonProperty = (String, Json)

Observe how Haskell’s algebraic data types resemble the grammar structure.

We start by defining a root parser corresponding to the root production of grammar above, entirely using $<|>$ operator.

json :: Parser Json
json =
jsonObject  <|>
jsonArray   <|>
jsonString  <|>
jsonDouble  <|>
jsonInteger <|>
jsonBool    <|>
jsonNull

Now we define a few straightforward parser combinators.

jsonNull :: Parser Json
jsonNull = do { symbol "null"; return JsonNull }

jsonBool :: Parser Json
jsonBool = true <|> false
where true = do { symbol "true"; return (JsonBool True) }
false = do { symbol "false"; return (JsonBool False) }
}

Now we apply $\texttt{takeUntil}$ combinator to painlessly parse strings.

jsonString :: Parser Json
jsonString = do {
symbol "\"";
str <- takeUntil '\"';
symbol "\"";
return (JsonString str) 

To parse numbers, we need to be a bit more careful and look at a few cases.

digit :: Parser Int
digit = do {
d <- sat isDigit;
return (digitToInt d)
}

integer :: Parser Int
integer = do {
spaces;
s <- sign;
d <- digitToInt <$> sat isDigit; if d == 0 then return 0 else do { ds <- many0 digit; return (s * asInt (d:ds)); } } where sign :: Parser Int sign = do { symbol "+"; return 1; } <|> do { symbol "-"; return (-1); } <|> do { return 1; } asInt ds = sum [ d * (10^p) | (d, p) <- zip (reverse ds) [0..] ] Now using these combinators, we assemble combinators for parsing JSON integers or doubles. jsonInteger :: Parser Json jsonInteger = JsonInteger <$> integer

jsonDouble :: Parser Json
jsonDouble = do {
wholePart <- fmap fromIntegral integer;
char '.';
fractionalPart <- fmap asFracPt (many digit);
return (JsonDouble (wholePart + fractionalPart))
}
where asFracPt ds = sum [
fromIntegral d * (10 ** (-p))
| (d, p) <- zip ds [1..]
]

Now we employ $\texttt{sepBy0}$ to parse JSON arrays.

jsonArray :: Parser Json
jsonArray = do {
symbol "[";
xs <- json sepBy0 char ',';
symbol "]";
return (JsonArray xs)
}

Finally, we define the object parser.

jsonObject :: Parser Json
jsonObject = do {
symbol "{";
props <- jsonProperty sepBy0 symbol ",";
symbol "}";
return (JsonObject props)
}
where
jsonProperty :: Parser JsonProperty
jsonProperty = do {
name <- fmap (\(JsonString s) -> s) jsonString;
symbol ":";
value <- json;
return (name, value)
}

Now let us see how parser behaves.

{
"users": [
{
"id": 0,
"work": "Unilogic",
"dob": "1978",
"city": "Boston",
"optedin": true,
"referedBy" : []
},
{
"id": 1,
"name": "Leanne Brier",
"work": "Connic",
"email": "leanne.brier@connic.org",
"dob": "1987",
"city": "Toronto",
"optedin": false,
"referedBy" : [
{
"id": 0,
"work": "Unilogic",
"dob": "1978",
"city": "Boston",
"optedin": true,
"referedBy" : []
}
]
}
],
"images": [
"img0.png",
"img1.png",
"img2.png"
],
"price": "$59,395" } The output is: [(JsonObject [("users",JsonArray [JsonObject [("id",JsonInteger 0),("name",JsonString "Adam Carter"),("work",JsonString "Unilogic"),("email",JsonString "adam.carter@unilogic.com"),("dob",JsonString "1978"),("address",JsonString "83 Warner Street"),("city",JsonString "Boston"),("optedin",JsonBool True),("referedBy",JsonArray [])],JsonObject [("id",JsonInteger 1),("name",JsonString "Leanne Brier"),("work",JsonString "Connic"),("email",JsonString "leanne.brier@connic.org"),("dob",JsonString "1987"),("address",JsonString "9 Coleman Avenue"),("city",JsonString "Toronto"),("optedin",JsonBool False),("referedBy",JsonArray [JsonObject [("id",JsonInteger 0),("name",JsonString "Adam Carter"),("work",JsonString "Unilogic"),("email",JsonString "adam.carter@unilogic.com"),("dob",JsonString "1978"),("address",JsonString "83 Warner Street"),("city",JsonString "Boston"),("optedin",JsonBool True),("referedBy",JsonArray [])]])]]),("images",JsonArray [JsonString "img0.png",JsonString "img1.png",JsonString "img2.png"]),("price",JsonString "$59,395")],"")]

Yes 💪..

## Syntax Directed Translation of Arithmetic Expressions

According to Wikipedia, syntax directed translation is a compiling implementation method in which the source language translation is completely driven by a parser. The classic example is a calculator. We explore how to implement syntax directed translation entirely using monadic parsing. Consider the following grammar:

<expr> ::= <term> "+" <expr>
| <term> "-" <expr>
| <term>

<term> ::= <factor> "*" <term>
| <factor> "\"" <term>
| <factor>

<factor> ::=  "-" "(" <expr> ")"
| "(" <expr> ")"
| <number>

The first thing we need to consider is associativity of operators. We have that all operators except $-$ are left associative. The operator $-$ can be either left or right associative. However, in the context of root production $\texttt{}$, $-$ is left associative.

Consider the evaluation of $1 + 2 + 3$. We want to parse as $((1 + 2) + 3)$ and evaluate as $(1 + 2) = 3, 3 + 3 = 6$. To implement this more generally, in the spirit of monadic parsing, we want sort of ‘accummulation’ combinator, parsing and evaluating the leftmost expression, computing the result and feeding it to the next subexpression being parsed and evaluated. Essentially, we want something like $\texttt{foldl}$, but operating on expressions while parsing. The solution is the $\texttt{chainl1}$ combinator:

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
p chainl1 op = do { x <- p; rest x }
where rest x = do {
f <- op;
y <- p;
rest (f x y)
} <|> return x

We go into this a bit deeper. So, the $\texttt{chainl1}$ combinator takes a parser parsing an expression of some type. The next argument is a Parser ‘parsing’ the ‘folding’ operation to be applied to the accumulator and the result of parsed expression. The ‘folding’ logic is in the expression for $\texttt{rest}$. The subcombinator $\texttt{rest}$ is defined recursively, taking the accummulated result so far and parsing the next expression. Finally, it computes the current value by applying the accummulation function $\texttt{f}$ taken from the parser $\texttt{op}$ to the previous accumulator value and currently parsed value. This value is used as accummulator for the next ‘iteration’ of $\texttt{rest}$. We proceed until exhaustion.

We start by reusing number parsers from the JSON section and wrapping those into a parser parsing $\texttt{Doubles}$.

number :: Parser Double
number = withDecimalPt <|> withoutDecimalPt
where
withoutDecimalPt = fromIntegral <$> integer withDecimalPt = do { wholePart <- withoutDecimalPt; char '.'; fractionalPart <- fmap asFracPt (many digit); return (wholePart + fractionalPart) } asFracPt ds = sum [ fromIntegral d * (10 ** (-p))| (d, p) <- zip ds [1..] ] Now we define evaluator for factors. To evaluate a factor, there are three cases. Either we have an unary minus followed by a factor, or a parenthesised expression or simply a number. If we parse a number, the evaluation is simply a number. If we parse a parenthesised expression, we evaluate the internal expression. If we parse an unary minus followed by a factor, we evaluate the factor and negate it. The following evaluator encodes this logic. factor :: Parser Double factor = negativeFactor <|> parensExpr <|> number where negativeFactor = do { symbol "-"; negate <$> factor }
parensExpr = do {
symbol "(";
x <- expr;
symbol ")";
return x
}

Now we define two abstract parsers, parsing the operations they mean. For example, when we encounter a $+$, we consume it and return the addition function as a parse result.

addop :: Parser (Double -> Double -> Double)
where add = do { symbol "+"; return (+)}
sub = do { symbol "-"; return (-)}

mulop :: Parser (Double -> Double -> Double)
mulop = mul <|> div
where mul = do { symbol "*"; return (*)}
div = do { symbol "/"; return (/)}

Finally, we $\texttt{chainl1}$ the operation parsers over terms and factors, as suggested by the grammar.

expr :: Parser Double
expr = term chainl1 addop

term :: Parser Double
term = factor chainl1 mulop

The final result is:

expr :: Parser Double
expr = term chainl1 addop

term :: Parser Double
term = factor chainl1 mulop

addop :: Parser (Double -> Double -> Double)
where add = do { symbol "+"; return (+)}
sub = do { symbol "-"; return (-)}

mulop :: Parser (Double -> Double -> Double)
mulop = mul <|> div
where mul = do { symbol "*"; return (*)}
div = do { symbol "/"; return (/)}

factor :: Parser Double
factor = negativeFactor <|> parensExpr <|> number
where
negativeFactor = do { symbol "-"; negate <$> factor } parensExpr = do { symbol "("; x <- expr; symbol ")"; return x } digit :: Parser Int digit = do { d <- sat isDigit; return (digitToInt d) } integer :: Parser Int integer = do { spaces; d <- digitToInt <$> sat isDigit;
if d == 0
then
return 0
else
do {
ds <- many0 digit;
return (asInt (d:ds));
}
}
where asInt ds = sum [
d * (10^p) | (d, p) <- zip (reverse ds) [0..]
]

number :: Parser Double
number = withDecimalPt <|> withoutDecimalPt
where
withoutDecimalPt = fromIntegral <\$> integer
withDecimalPt = do {
wholePart <- withDecimalPt;
char '.';
fractionalPart <- fmap asFracPt (many digit);
return (wholePart + fractionalPart)
}
asFracPt ds = sum [
fromIntegral d * (10 ** (-p))| (d, p) <- zip ds [1..]
]

We try it on a few tricky expressions.

apply expr "89.8+((9*3)+8)+(9*2)+1"

gives

[(143.8,"")]
apply expr "- (3 + (4 / 2)) * (- (4 + 2*4 - 1))"

gives

[(55.0,"")]
apply expr "- - ( 2 + 3 )"

gives

[(5.0,"")]

Finally, it is worth comparing this implementation to the classical approach involving lexing, parsing and evaluating stages.

## A few conclusion remarks

To conclude, I would like to point out that it is not necessary, nor probably recommended to roll your own parser combinators to build a parser. There are various, very powerful parser combinator libraries waiting to be used. In case of Haskell, the most famous one is Parsec.

I think key advantage of monadic parsing combinators is that they are often straightforward to construct and they usually directly follow the grammar productions being parsed which results in a quite readable and maintainable parser.

However, it is worth noting that in this post we did not even think about the error logging and error recovery. It turns out those are non trivial to implement in a parser combinator library. Conventional wisdom says that monadic parser combinators tend to generate less accurate errors, especially when some operators are used in a ‘wrong way’. One of such operators is $\texttt{try}$ operator in $\texttt{Parsec}$ composed with $<|>$ which we defined today. You can read more about it here.

Performance issues are discussed in Optimizing Parser Combinators paper.

You can find the code for this post here, located in folder $\texttt{monadic-parsing}$. The instructions to run this are in the repository readme.

## References

Hi👋! I am a final year Computer Science and Mathematics student at 🎓University of Edinburgh. This is a place where I (will hopefully) write about topics I find interesting. I plan to write about my personal projects and learning experiences, as well as some random tech topics.

Built with a lot of