Wring simple parser with Megaparsec in Haskell

March 3, 2025 · 11 min read

There goes around opinion that pure functional languages are rock solid and well suited for critical systems. For example Facebook uses it in anti-spam filters, serval financial companies for derivative modelling and there is also some documented usage in compilers.

I tiptoed in Haskell long time ago, but didn't really get it. This time, my particular usecase was that I wanted to have parser for toy language with minimal effort. Parser combinators like Parsec or Megaparsec are known for purely declarative approach to modelling grammars.

After 2 weeks of playing with the language I must say that there is something strangely addictive in writing pure functional code. Reading it is hard, writing it even harder, but when it starts to work there is a lot of satisfaction. I don't know maybe I wasn't feeling confident about it before, but I finally started to like it.

In this short writing we are going to write simple application that reads in json, validates it and then pretty prints to the console. I used Megaparsec because its newer and has better error messages, but 99% could be written in parsec as well. On the fly I will showcase some Haskell idioms I learned on the way.

Normal workflow

In haskell every function is a pure function, which means it's idepotent. Side effects like IO are nicely contained in scope in monads. What you get is that because there is no mutable state all the work is really composition of functions. It's a little bit intellectually demanding, but once you wrote it right, it's hard to see it failing.

Most of the testing happens in REPL. Once you set up Stack or Cabal project you write functions in library module and the load it in REPL and test the functions.

stack ghci
ghci>

From there you can use various tools but most helpful is probably checking types of the functions.

ghci> import Data.List
ghci> :t concat
concat :: Foldable t => t [a] -> [a]

Then you can test the function directly in REPL

ghci> concat ["Hello", " ", "World"]
"Hello World"

There are 2 things at play here in this example:

haskell has parameteric polymorphism, which means you can have some abstract type a (like generics)
string is actually list of char [Char]

If you are stubborn you can of course write tests in HUnit, but since you can test it in REPL I don't think anyone writes them. But several libraries exist like HUnit.

Representing data

Haskell has no oop concepts so everthing is represented as Algebraic Data Types (or Generalized Algebraic Data Types for more power). ADTs have two flavors:

Sum types A or B
Product types tuples (A, B)

This is how I represented internal JSON structure

data VNumber = VInt Integer |  VDouble Double
    deriving (Show, Eq)

data VJson =
    VNumber VNumber
    | VString String
    | VBool Bool
    | VNull
    | VArray [VJson]
    | VObject [(String, VJson)]
    deriving (Show, Eq)

VJSON is sum type of couple of variants. It think it's self explanatory, except the deriving part. Since we are building on top of built in types like Integer, Double and String, typeclass Show (printing to console) and Eq can be autogenerated. Typeclasses for now are like constraints on wildcard generic types the expose some behaviour.

Parser

All parsing happens inside monad Parsec.

Parsec e s a 

where e - custom exception,

s - stream type, can be String, Text, or ByteString,

a returned structure.

so for this simple parser I created type alias

type Parser = Parsec Void String

You may notice the last type on the rhs is not defined so it "floats" to the definition on the left.

The flow is that you compose the smaller parsers into bigger parsers up to the top and the run topmmost one with function parse

ghci> :t parse
parse
  :: Parsec e s a -> String -> s -> Either (ParseErrorBundle s e) a

So it takes our parser, debugging message, text to parse and returns either Left error or Right result.

For now let's see how the main function looks like:

run :: IO ()
run  = do 
    args <- getArgs
    case args of 
        [filePath] -> do 
            content <- readFile filePath
            let parseResult = parse value "" content 
            case parseResult of 
                Left err -> putStr $ errorBundlePretty err 
                Right v -> putStr $ prettyJson v 0
        _ -> putStr "Usage: program <filepath> \n"

Pattern matching is easy to guess but for people unfamiliar with monads I will draw an analogy for do notation in Java. So you have a method to that for each element returns a stream

Stream<Integer> foo(Integer i) {
    return Stream.of(i, i+1);
}

And then you have a stream of 2 elements.

Stream nums = Stream.of(1,3)

And finally you may recall that there is a method flatMap, so combining everthing together

Stream.of(1,3).flatMap(this::foo) // == Stream.of(1,2,3,4)

So this is much like working with monads. The foo method works on extracted value from the previous monad and returns back new monad. This particular monad has intrisinc behavior so it flattens the nested streams but the behavior itself could be anyting you want it to be. You can chain multiple foos and in the end you will get back monad.

So do notation is much alike that it allows to chain previous monadic function to the next.

Parser rules

Let's start by parsing trivial cases of true, false and null.

literal :: Parser VJson
literal = do
    lit <- choice [C.string "true", C.string "false", C.string "null"]
         <?> "only 'true', 'false' and 'null' literals are valid"
    _ <- notFollowedBy C.alphaNumChar <?> "only 'true', 'false' and 'null' literals are valid"
    if lit == "true" then
        return $ VBool True
    else if lit == "false" then
        return $ VBool False
    else
        return VNull

From the first parts:

C.string "true"

returns a parser that takes up "true" from the input or fails if not met.

choice [string "a", string "b"]

returns composite parser that if string "a" fails then follows to the next parser up to exhaustion.

Then we save result in variable.

<?> allows for customizing failure message if the whole rule fails. It's like returning parser monad that always fails with this message.

Finally we make sure that literals have no trailers which allows us to fail exactly in this place not in some time later.

One of the monad methods is return which just wraps a value in the monad

return $ VBool False

The $ means evaluate what is on the right side and apply it to left side. It's needed here because evaluation happens from left to right.

stringLiteral :: Parser VJson
stringLiteral = do
    _ <- C.char '"'
    content <- many (chunk "\\\"" <|> (:[]) <$> anySingleBut '"')
    _ <- C.char '"'
    return $ VString (concat content)

First match '"' single character then take zero or more parts that

chunk "\\\"" <|> (:[]) <$> anySingleBut '"'

Matches 2 chars \ and " if that fails try parsing any char that is not ".

One word of explanation for <$>. This operator takes a function (conversion to list or appending empty list afterwards) and applies it inside the monad resulting in new monad with the result from function.

The most hard is parsing number

zero :: Parser String
zero = C.string "0"

digitFromOne :: Parser Char
digitFromOne = oneOf ['1'..'9']

wholePart :: Parser String
wholePart = choice [
    zero,
     ((++) . (: []) <$> digitFromOne) <*> many C.digitChar]

fractPart :: Parser String
fractPart  = do
    dot <- C.char '.'
    fract <- some C.digitChar
    return $ dot : fract

expPart  :: Parser String
expPart = do
    e <- oneOf ['E', 'e']
    sign <- option "" ((:[]) <$> oneOf ['+', '-'])
    num <- some C.digitChar
    return $ e : (sign ++ num)



number :: Parser VJson
number = do
    sign <- option "" ((:[]) <$> oneOf ['+', '-'])
    wh <- wholePart <?> "whole part of number"
    fract <- option "" fractPart
    e <- option "" expPart
    if null fract  && null e then
        return $ VNumber (VInt $ read (sign ++ wh))
    else
        return $ VNumber (VDouble $ read (sign ++ wh ++ fract ++ e))

Explanations:

((++) . (: []) <$> digitFromOne) <*> many C.digitChar]

First convert char to String (with single char) inside monad. Then take this result and concat it with string parsed by many.

ghci> :t option
option :: GHC.Base.Alternative m => a -> m a -> m a

So option takes a default value, and a monad return new monad with possibly default applied.

So for example what this does

e <- option "" expPart

is that if expPart matches it returns string otherwise empty string "" is assigned.

Then array value

array :: Parser VJson
array = do
    _ <- C.char '['
    arr <- value `sepBy` C.char ','
    _ <- C.char ']'
    return $ VArray arr

And object

keyValuePair :: Parser (String, VJson)
keyValuePair = do
    C.space
    VString key <- stringLiteral
    C.space
    _ <- C.char ':'
    v <- value
    return (key, v)


object :: Parser VJson
object = do
    _ <- C.char '{'
    kvs <- keyValuePair `sepBy` C.char ','
    _ <- C.char '}'
    return $ VObject kvs

Now that we have all the rules it is time to combine them

value :: Parser VJson
value = do
    C.space
    v <- try literal <|> try stringLiteral <|> try number <|> try array <|> object
    C.space
    return v

try tries parsing rule without consuming the input, note the last alternative is without try to guarantee progress. It is a nice trick to consume spaces before and after proper value.

Megaparsec include test method parseTest. Let's try it:

ghci> parseTest value "{\"a\" : 1, \"b\" : [1.0, 2.0]}"
VObject [("a",VNumber (VInt 1)),("b",VArray [VNumber (VDouble 1.0),VNumber (VDouble 2.0)])]

Printing

Some branches are easy but some are complicated:

prettyJson :: VJson -> Int -> String
prettyJson VNull _ = "null"
prettyJson (VBool True) _ = "true"
prettyJson (VBool False) _ = "false"
prettyJson (VNumber (VInt i)) _ = show i
prettyJson (VString s) _ = "\"" ++ s ++ "\""
prettyJson (VNumber (VDouble d)) _ = show d
prettyJson (VArray arr) ind = "[" ++ printElems arr (ind+2) ++ "\n" ++ indent ind "]\n"
    where 
        printElems (a:as) i = foldl 
            (\acc x -> acc ++ ",\n" ++ indent i "" ++ prettyJson x i) ("\n" ++ indent i "" ++ prettyJson a i) as
        printElems [] _ = ""
        indent i s = replicate i ' ' ++ s
prettyJson (VObject arr) ind = "{" ++ printElems arr (ind+2) ++ "\n" ++ indent ind "}\n"
    where 
        printElems ((k, v):as) i = foldl 
            (\acc (k2, v2) -> acc ++ ",\n" ++ indent i "" ++ "\"" ++ k2 ++ "\" : " ++ prettyJson v2 i) 
            ("\n" ++ indent i "" ++ "\"" ++ k ++ "\" : " ++ prettyJson v i) 
            as
        printElems [] _ = ""

Let's only concentrate on VArray branch

prettyJson (VArray arr) ind = "[" ++ printElems arr (ind+2) ++ "\n" ++ indent ind "]\n"
    where 
        printElems (a:as) i = foldl 
            (\acc x -> acc ++ ",\n" ++ indent i "" ++ prettyJson x i) ("\n" ++ indent i "" ++ prettyJson a i) as
        printElems [] _ = ""
        indent i s = replicate i ' ' ++ s

printElems is a helper method that prints elements each in new and indented line. The whole trick was that to not include "," after last expression. So this is how i did it. First match a list with first element and the rest. Then use foldl to reduce array of json values to one string. The starting value is the matched first element a and it's different. Then reduction function takes accumulator and appends next element with "," prepended. And for empty array just return empty string. indent is a helper method that prefixes the give string with number of spaces.

Let's test:

ghci> putStr $ prettyJson ( VArray [(VNumber (VInt 1)), (VNumber (VInt 2))]) 0
[
  1,
  2
]
ghci> 

Conclusion

That's it for this simple tutorial. Here is a list of nice readings

I had a lot of fun and eager for new challenges. If I have some spare time maybe I will write how to write System F typechecker in Haskell.

Thanks for now.

PS Here is the code.

Normal workflow​

Representing data​

Parser​

Parser rules​

Printing​

Conclusion​