Wring simple parser with Megaparsec in Haskell
There goes around opinion that pure functional languages are rock solid and well suited for critical systems. For example Facebook uses it in anti-spam filters, serval financial companies for derivative modelling and there is also some documented usage in compilers.
I tiptoed in Haskell long time ago, but didn't really get it. This time, my particular usecase was that I wanted to have parser for toy language with minimal effort. Parser combinators like Parsec or Megaparsec are known for purely declarative approach to modelling grammars.
After 2 weeks of playing with the language I must say that there is something strangely addictive in writing pure functional code. Reading it is hard, writing it even harder, but when it starts to work there is a lot of satisfaction. I don't know maybe I wasn't feeling confident about it before, but I finally started to like it.
In this short writing we are going to write simple application that reads in json, validates it and then pretty prints to the console. I used Megaparsec because its newer and has better error messages, but 99% could be written in parsec as well. On the fly I will showcase some Haskell idioms I learned on the way.
Normal workflow
In haskell every function is a pure function, which means it's idepotent. Side effects like IO are nicely contained in scope in monads. What you get is that because there is no mutable state all the work is really composition of functions. It's a little bit intellectually demanding, but once you wrote it right, it's hard to see it failing.
Most of the testing happens in REPL. Once you set up Stack or Cabal project you write functions in library module and the load it in REPL and test the functions.
stack ghci
ghci>
From there you can use various tools but most helpful is probably checking types of the functions.
ghci> import Data.List
ghci> :t concat
concat :: Foldable t => t [a] -> [a]
Then you can test the function directly in REPL
ghci> concat ["Hello", " ", "World"]
"Hello World"
There are 2 things at play here in this example:
- haskell has parameteric polymorphism, which means you can have some abstract type a (like generics)
- string is actually list of char [Char]
If you are stubborn you can of course write tests in HUnit, but since you can test it in REPL I don't think anyone writes them. But several libraries exist like HUnit.
Representing data
Haskell has no oop concepts so everthing is represented as Algebraic Data Types (or Generalized Algebraic Data Types for more power). ADTs have two flavors:
- Sum types A or B
- Product types tuples (A, B)
This is how I represented internal JSON structure
data VNumber = VInt Integer | VDouble Double
deriving (Show, Eq)
data VJson =
VNumber VNumber
| VString String
| VBool Bool
| VNull
| VArray [VJson]
| VObject [(String, VJson)]
deriving (Show, Eq)
VJSON is sum type of couple of variants. It think it's self explanatory, except the deriving part. Since we are building on top of built in types like Integer, Double and String, typeclass Show (printing to console) and Eq can be autogenerated. Typeclasses for now are like constraints on wildcard generic types the expose some behaviour.
Parser
All parsing happens inside monad Parsec.
Parsec e s a
where e - custom exception,
s - stream type, can be String, Text, or ByteString,
a returned structure.
so for this simple parser I created type alias
type Parser = Parsec Void String
You may notice the last type on the rhs is not defined so it "floats" to the definition on the left.
The flow is that you compose the smaller parsers into bigger parsers up to the top and the run topmmost one with function parse
ghci> :t parse
parse
:: Parsec e s a -> String -> s -> Either (ParseErrorBundle s e) a
So it takes our parser, debugging message, text to parse and returns either Left error or Right result.
For now let's see how the main function looks like:
run :: IO ()
run = do
args <- getArgs
case args of
[filePath] -> do
content <- readFile filePath
let parseResult = parse value "" content
case parseResult of
Left err -> putStr $ errorBundlePretty err
Right v -> putStr $ prettyJson v 0
_ -> putStr "Usage: program <filepath> \n"
Pattern matching is easy to guess but for people unfamiliar with monads I will draw an analogy for do notation in Java. So you have a method to that for each element returns a stream
Stream<Integer> foo(Integer i) {
return Stream.of(i, i+1);
}
And then you have a stream of 2 elements.
Stream nums = Stream.of(1,3)
And finally you may recall that there is a method flatMap, so combining everthing together
Stream.of(1,3).flatMap(this::foo) // == Stream.of(1,2,3,4)
So this is much like working with monads. The foo method works on extracted value from the previous monad and returns back new monad. This particular monad has intrisinc behavior so it flattens the nested streams but the behavior itself could be anyting you want it to be. You can chain multiple foos and in the end you will get back monad.
So do notation is much alike that it allows to chain previous monadic function to the next.
Parser rules
Let's start by parsing trivial cases of true, false and null.
literal :: Parser VJson
literal = do
lit <- choice [C.string "true", C.string "false", C.string "null"]
<?> "only 'true', 'false' and 'null' literals are valid"
_ <- notFollowedBy C.alphaNumChar <?> "only 'true', 'false' and 'null' literals are valid"
if lit == "true" then
return $ VBool True
else if lit == "false" then
return $ VBool False
else
return VNull
From the first parts:
C.string "true"
returns a parser that takes up "true" from the input or fails if not met.
choice [string "a", string "b"]
returns composite parser that if string "a" fails then follows to the next parser up to exhaustion.
Then we save result in variable.
<?> allows for customizing failure message if the whole rule fails. It's like returning parser monad that always
fails with this message.
Finally we make sure that literals have no trailers which allows us to fail exactly in this place not in some time later.
One of the monad methods is return which just wraps a value in the monad
return $ VBool False
The $ means evaluate what is on the right side and apply it to left side. It's needed here because evaluation happens from left to right.
Next
stringLiteral :: Parser VJson
stringLiteral = do
_ <- C.char '"'
content <- many (chunk "\\\"" <|> (:[]) <$> anySingleBut '"')
_ <- C.char '"'
return $ VString (concat content)
First match '"' single character then take zero or more parts that
chunk "\\\"" <|> (:[]) <$> anySingleBut '"'
Matches 2 chars \ and " if that fails try parsing any char that is not ".
One word of explanation for <$>. This operator takes a function (conversion to list or appending empty list afterwards)
and applies it inside the monad resulting in new monad with the result from function.
The most hard is parsing number
zero :: Parser String
zero = C.string "0"
digitFromOne :: Parser Char
digitFromOne = oneOf ['1'..'9']
wholePart :: Parser String
wholePart = choice [
zero,
((++) . (: []) <$> digitFromOne) <*> many C.digitChar]
fractPart :: Parser String
fractPart = do
dot <- C.char '.'
fract <- some C.digitChar
return $ dot : fract
expPart :: Parser String
expPart = do
e <- oneOf ['E', 'e']
sign <- option "" ((:[]) <$> oneOf ['+', '-'])
num <- some C.digitChar
return $ e : (sign ++ num)
number :: Parser VJson
number = do
sign <- option "" ((:[]) <$> oneOf ['+', '-'])
wh <- wholePart <?> "whole part of number"
fract <- option "" fractPart
e <- option "" expPart
if null fract && null e then
return $ VNumber (VInt $ read (sign ++ wh))
else
return $ VNumber (VDouble $ read (sign ++ wh ++ fract ++ e))
Explanations:
((++) . (: []) <$> digitFromOne) <*> many C.digitChar]
First convert char to String (with single char) inside monad. Then take this result and concat it with string
parsed by many.
ghci> :t option
option :: GHC.Base.Alternative m => a -> m a -> m a
So option takes a default value, and a monad return new monad with possibly default applied.
So for example what this does
e <- option "" expPart
is that if expPart matches it returns string otherwise empty string "" is assigned.
Then array value
array :: Parser VJson
array = do
_ <- C.char '['
arr <- value `sepBy` C.char ','
_ <- C.char ']'
return $ VArray arr
And object
keyValuePair :: Parser (String, VJson)
keyValuePair = do
C.space
VString key <- stringLiteral
C.space
_ <- C.char ':'
v <- value
return (key, v)
object :: Parser VJson
object = do
_ <- C.char '{'
kvs <- keyValuePair `sepBy` C.char ','
_ <- C.char '}'
return $ VObject kvs
Now that we have all the rules it is time to combine them
value :: Parser VJson
value = do
C.space
v <- try literal <|> try stringLiteral <|> try number <|> try array <|> object
C.space
return v
try tries parsing rule without consuming the input, note the last alternative is without try
to guarantee progress. It is a nice trick to consume spaces before and after proper value.
Megaparsec include test method parseTest. Let's try it:
ghci> parseTest value "{\"a\" : 1, \"b\" : [1.0, 2.0]}"
VObject [("a",VNumber (VInt 1)),("b",VArray [VNumber (VDouble 1.0),VNumber (VDouble 2.0)])]
Printing
Some branches are easy but some are complicated:
prettyJson :: VJson -> Int -> String
prettyJson VNull _ = "null"
prettyJson (VBool True) _ = "true"
prettyJson (VBool False) _ = "false"
prettyJson (VNumber (VInt i)) _ = show i
prettyJson (VString s) _ = "\"" ++ s ++ "\""
prettyJson (VNumber (VDouble d)) _ = show d
prettyJson (VArray arr) ind = "[" ++ printElems arr (ind+2) ++ "\n" ++ indent ind "]\n"
where
printElems (a:as) i = foldl
(\acc x -> acc ++ ",\n" ++ indent i "" ++ prettyJson x i) ("\n" ++ indent i "" ++ prettyJson a i) as
printElems [] _ = ""
indent i s = replicate i ' ' ++ s
prettyJson (VObject arr) ind = "{" ++ printElems arr (ind+2) ++ "\n" ++ indent ind "}\n"
where
printElems ((k, v):as) i = foldl
(\acc (k2, v2) -> acc ++ ",\n" ++ indent i "" ++ "\"" ++ k2 ++ "\" : " ++ prettyJson v2 i)
("\n" ++ indent i "" ++ "\"" ++ k ++ "\" : " ++ prettyJson v i)
as
printElems [] _ = ""
Let's only concentrate on VArray branch
prettyJson (VArray arr) ind = "[" ++ printElems arr (ind+2) ++ "\n" ++ indent ind "]\n"
where
printElems (a:as) i = foldl
(\acc x -> acc ++ ",\n" ++ indent i "" ++ prettyJson x i) ("\n" ++ indent i "" ++ prettyJson a i) as
printElems [] _ = ""
indent i s = replicate i ' ' ++ s
printElems is a helper method that prints elements each in new and indented line. The whole trick was that to not include
"," after last expression. So this is how i did it. First match a list with first element and the rest. Then use foldl to reduce
array of json values to one string. The starting value is the matched first element a and it's different. Then reduction function takes
accumulator and appends next element with "," prepended. And for empty array just return empty string. indent is a helper method that prefixes
the give string with number of spaces.
Let's test:
ghci> putStr $ prettyJson ( VArray [(VNumber (VInt 1)), (VNumber (VInt 2))]) 0
[
1,
2
]
ghci>
Conclusion
That's it for this simple tutorial. Here is a list of nice readings
- Megaparsec intermediate tutorial
- What I whish I knew learning Haskell
- GHC reading list | advanced
- Write you a haskell
I had a lot of fun and eager for new challenges. If I have some spare time maybe I will write how to write System F typechecker in Haskell.
Thanks for now.
PS Here is the code.