Introduction

JavaParsec enables parsing in a pure functional way via combinator pattern. It is written in Java 21 and makes use of sealed classes, pattern matching and records. All operations are type-safe and there are no casting or null exceptions.

.1. Motivation

Creating a parser for new language or DSL is a big commitment. The amount of work is sometimes prohibitive for starting small and just experimenting with ideas. On the other hand generating full blown parser from ANTLR makes the iteration loops longer plus you have to learn a new library.

Pure functional languages are attractive for writing parsers, because of:

  • REPL and quick iteration loops - code consists of small or medium-sized functions that are directly testable in eval loop. You don’t even have to write tests since you are already testing it in REPL.

  • composability without side effects - because there is no mutable state the work boils down to wiring big functions from smaller functions. Once done - it is done, hardly ever the code needs runtime debugging, which especially for complex parsers is desired property.

  • delegating correctness checking to type checker - the more checking is delegated to types, the less runtime debugging, plus your IDE can more help you

JavaParsec brings this coding experience to Java land. It is a toolkit of parser combinators - parser objects both primitive and composite that match given pattern in a greedy manner. Every parser is composable with another and together they form yet another composite parser.

There are various way that the library can be used, but it was designed primarily for quick experimentation and fast debuging. The coding experience is very much like scripting. This is why each parser is separated and complete on its own and ready to use. You can test each part separately on whatever input you like.

For usage there a several recommended options:

  • scratch files in IDE

  • JBang with VSCodium editing

  • Jupyter notebook with Java

  • java 24 implicit class main functions

.2. Quick Examples

For a real life examples refer to examples folder in repository.

Parsing ISO Date

public void main() {

    var year = times(digit(), 4).str()
            .map(Integer::valueOf);

    var month = times(digit(), 2).str()
            .map(Integer::valueOf)
            .failIf(i -> i < 1 || i > 12, "wrong month");

    var day = times(digit(), 2).str()
            .map(Integer::valueOf)
            .failIf(i -> i > 31, "wrong day");

    var date = seq(
            year,
            anyOf('-'),
            month,
            anyOf('-'),
            day
    ).map(t -> LocalDate.of(t.one(), t.three(), t.five()));


    println(date.parse("2024-03-23"));
}

Parsing integer literal

public void main(String[] args) {
   var zero = c("0").map(Integer::valueOf);

   var nonZero = seq(
    nonZeroDigit(),
    many(digit()).str()
   ).str().map(Integer::valueOf);

   var zeroNotFollowed = zero.dropRight(not(some(digit())));

   var integer = any(zeroNotFollowed, nonZero);

   assert integer.parse("0").ok() == 0;
   assert integer.parse("123").ok() == 123;
   assert !integer.parse("001").isOk();
}

.3. License

This code is available under Apache License Version 2.0

Installation and use

Add dependency to your project

<repositories>
    <repository>
      <id>github</id>
      <name>Github</name>
      <url>https://maven.pkg.github.com</url>
    </repository>
</repositories>
<dependency>
  <groupId>org.jparsec</groupId>
  <artifactId>java-parsec</artifactId>
  <version>1.1.1</version>
</dependency>

Run with JBang

Download JBang

Create example file:

//JAVA 24
//PREVIEW
//REPOS mavencentral,github=https://maven.pkg.github.com
//DEPS org.jparsec:java-parsec:1.1.1

import static org.jparsec.Api.*;

import org.jparsec.containers.Context;
import org.jparsec.containers.Ok;


public void main(String[] args) {
    var aOrB = anyOf('a', 'b');

    var r = aOrB.parse("b");
    if (r instanceof Ok(Character c, Context ctx)) {
        System.out.println(c);
    } else {
        System.exit(1);
    }

}

Install VSCodium

Install JBang extension.

Edit and run the file.

Run in java notebook

Install jupyter notebook

sudo apt-get install jupyter-notebook

Install Java kernel

jbang install-kernel@jupyter-java

Run notebook

jupyter notebook

Select Java kernel, create new notebook

Import dependency with maven (notebook)

%mavenRepo snapshots https://maven.pkg.github.com
%maven org.jparsec:JavaParsec:1.1.1

Usage tutorial

Writing simplified Markdown parser

.1. Intro

In this section we will parse subset of Markdown syntax with library constructs. To make this part shorter this won’t be the full Markdown sytax. Markdown handles content as a list of blocks. It can be either a heading title, paragraph or a block quote. They are separated by a blank line - a line that contains only whitespaces.

MarkdownDocument  ::=  Block*
Block             ::=  HeadingBlock
                   |  Paragraph
                   |  BlockQuote
                   |  BlankLine

Heading block starts with multiplicity of '#' contains alphanumeric characters and is ended by a newline.

ATXHeading      ::=  "#"{1, 6} [Space] Text LineBreak

On the other hand paragraph allows single line breaks. It will allow for formatting text with common markdown punctuation, like strong, italics and code.

Paragraph       ::=  (InlineContent (Space | LineBreak)*)+

Block quote consists of text lines prefixed by '>'

BlockQuote      ::=  (">" [Space] Text LineBreak)+

And finally the blank line.

BlankLine       ::=  (Space | Tab)* LineBreak

Inline content in this example appears only in paragraph. It is a couple of variant either normal text or formatted.

InlineContent    ::=  InlineElement*
InlineElement    ::=  Text
                   |  Emphasis
                   |  Strong
                   |  InlineCode
                   |  Link

Emphasis ::= `"*" Text "*"
Strong = "**" Text "**"
InlineCode  = "`" Text "`"
Link   =  "[" Text "]" "(" Text ")"

.2. Parsing inline elements

Text elements are plain characters without the puctuation of markdown. noneOf matches any of the characters that is not specified. Then we can combine it with higher level parser combinator - some. some() takes at least one or more occurences of given inner parser. Since we fed it with single char parser we get list of chars. A convenience method str() will join them into string.

public void main(String[] args) {
    var textSingle = noneOf('*', '`', '[', ']', '(', ')', '>', '\n');

    var text = some(textSingle).str();

    var r1 = text.parse("this some text");

    out.println(r1.ok());

}

Combinators are greedy that they will take up input until it is valid. So in the case of encountering markdown punctuation, text parser will stop there.

public void main(String[] args) {
    var textSingle = noneOf('*', '`', '[', ']', '(', ')', '>', '\n');

    var text = some(textSingle).str();

    var r1 = text.parse("this some *text*");

    if (r1 instanceof Ok(String s, Context ctx)) {
        assert r1.ok().equals("this some ");
        assert ctx.index == ctx.content.indexOf("*");

    } else {
        throw new ParseException("unexpected");
    }

}

some() succeeded because we have at least one character matching, but the text got parsed up to the first occurence of '*'.

To make fast assertions inline we can use convenience methods on the parser itself. Take a look at:

public void main(String[] args) {
    var textSingle = noneOf('*', '`', '[', ']', '(', ')', '>', '\n');

    var text = some(textSingle).str();

    text.assertEquals(
        Context.of("this some *text*"), "this some ");

}

Before we introduce next rule it will be good to define conatiners for the results. If you had a chance to meet Algebraic Data Types then sealed interfaces and records are the way to achieve this result.

sealed interface Inline {
    record Text(String value) implements Inline{};
    record Emphasis(String value) implements Inline{};
    record Strong(String value) implements Inline{};
    record Code(String value) implements Inline{};
    record Link(String text, String url) implements Inline{};
}

Here we will write emphasis parser. It is enclosed in '*' characters and inside there is normal text. We can model one rule occuring after another with seq() combinator that takes up to 7 rules. What is important that it is totally typesafe thanks to generics.

var emphasis = seq(c('*'), text, c('*'))
    .map(Ops::takeMiddle).map(Inline.Text::new);

out.println(emphasis.parse("*emphasis*").ok());

Result

Text[value=emphasis]

Quick assertion that formatted text will not pass.

emphasis.assertFails("*emphasis `code`*");

Text gets only parsed up to '’, but then we have to end the rule with '*' which is not matched.

The other rules are handled in similar fashion.

public void main(String[] args) {
    var textSingle = noneOf('*', '`', '[', ']', '(', ')', '>', '\n');
    var text = some(textSingle).str();

    var emphasis = seq(c('*'), text, c('*'))
        .map(Ops::takeMiddle).map(Inline.Emphasis::new);

    emphasis.assertFails("* emphasis `code`*");
    emphasis.assertEquals(Context.of("*text*"),  new Inline.Emphasis("text"));

    var strong = seq(c("**"), text, c("**"))
        .map(Ops::takeMiddle).map(Inline.Strong::new);

    strong.assertFails("** strong *nested* **");
    strong.assertEquals(Context.of("**strong**"), new Inline.Strong("strong"));

    var codeText = some(noneOf('`', '\n')).str();
    var code = seq(c('`'), codeText, c('`'))
        .map(Ops::takeMiddle).map(Inline.Code::new);
    code.assertEquals(Context.of("`code ()*[]`"), new Inline.Code("code ()*[]"));

    var link = seq(c('['), text, c(']'), c('('), text, c(')'))
        .map( tuple6 -> new Inline.Link(tuple6.two(), tuple6.five()));
    link.assertEquals(Context.of("[desc](https://link)"), new Inline.Link("desc", "https://link"));

    var inline = choice(link, strong, code, emphasis, text)
        .map(ch5 ->
            switch (ch5) {
                case One(Inline.Link l) -> (Inline) l;
                case Two(Inline.Strong s) -> (Inline) s;
                case Three(Inline.Code c) -> (Inline) c;
                case Four(Inline.Emphasis e) -> (Inline) e;
                case Five(String t) -> (Inline) new Inline.Text(t);
            }
        );

    var r2 = some(inline).parse("text with **strong** and `code`");
    out.println(r2.ok());

}

The result:

[Text[value=text with ], Strong[value=strong], Text[value= and ], Code[value=code]]

.3. Parsing blocks

The most complex case is paragraph because it has to incorporate stylized text. For headers and links we will just parse simple text. sepBy() parser will intercalate first parser with the second parser. The separator can be just anything, here we will just use newline or multiplicity of spaces.

var paragraph = sepBy(inline, (c('\n').or(some(c(' ')))));
var r3 = paragraph.parse("""
        normal text   *emphasis text*
        **strong text**
        `code text`
        """);
assert r3.isOk();

Block quote is a possibly multiline text each line starting with '>'. Here we will join on newlines and we will merge strings with space as separator.

 var blockQuote = sepBy(
    seq(c('>'), text)
        .map(Ops::takeSecond),
    c('\n')).map(list -> String.join(" ", list));

var r4 = blockQuote.parse("""
        first line
        second line
        """);
assert r4.ok().equals("first line second line");

We can separate whole blocks by a blank line.

var blankLine = seq(many(c(' ')), some(c('\n')));

var r5 = sepBy(paragraph, blankLine).parse("""
        paragraph 1

        paragraph 2
        """);
assert r5.ok().size() == 2;

Heading is a simple text prefixed by number of '#' s. We can use times() matcher which asserts that the number of inner rule’s occurrences is exactly in the range.

var heading = seq(
        times(c('#'), 1, 6)
            .map(List::size),
        many(c(' ')),
        text
).map(p -> new Block.HeadingBlock(p.one(), p.three()));

heading.assertEquals(Context.of("## heading 2"), new Block.HeadingBlock(2, "heading 2"));

Quick note on mapping method. Sequences return tuple objects that are again up to 7. They are typesafe with the help of generics.

And now we are ready to combine everything together:

public void main(String[] args) {
    var textSingle = noneOf('*', '#', '`', '[', ']', '(', ')', '>', '\n');
    var text = some(textSingle).str();

    var emphasis = seq(c('*'), text, c('*'))
        .map(Ops::takeMiddle).map(Inline.Emphasis::new);

    emphasis.assertFails("* emphasis `code`*");
    emphasis.assertEquals(Context.of("*text*"),  new Inline.Emphasis("text"));

    var strong = seq(c("**"), text, c("**"))
        .map(Ops::takeMiddle).map(Inline.Strong::new);

    strong.assertFails("** strong *nested* **");
    strong.assertEquals(Context.of("**strong**"), new Inline.Strong("strong"));

    var codeText = some(noneOf('`', '\n')).str();
    var code = seq(c('`'), codeText, c('`'))
        .map(Ops::takeMiddle).map(Inline.Code::new);
    code.assertEquals(Context.of("`code ()*[]`"), new Inline.Code("code ()*[]"));

    var link = seq(c('['), text, c(']'), c('('), text, c(')'))
        .map( tuple6 -> new Inline.Link(tuple6.two(), tuple6.five()));
    link.assertEquals(Context.of("[desc](https://link)"), new Inline.Link("desc", "https://link"));

    var inline = choice(link, strong, code, emphasis, text)
        .map(ch5 ->
            switch (ch5) {
                case One(Inline.Link l) -> (Inline) l;
                case Two(Inline.Strong s) -> (Inline) s;
                case Three(Inline.Code c) -> (Inline) c;
                case Four(Inline.Emphasis e) -> (Inline) e;
                case Five(String t) -> (Inline) new Inline.Text(t);
            }
        );


    var paragraph = sepBy(inline, (c('\n').or(many(c(' ')))))
         .map(Block.Paragraph::new);
    var r3 = paragraph.parse("""
            normal text   *emphasis text*
            **strong text**
            `code text`
            """);
    assert r3.isOk();

    var blockQuote = sepBy(
        seq(c('>'), text)
            .map(Ops::takeSecond),
        c('\n')).map(list -> String.join(" ", list))
        .map(Block.BlockQoute::new);

    var r4 = blockQuote.parse("""
            first line
            second line
            """);
    assert r4.ok().equals(new Block.BlockQoute("first line second line"));

    var blankLine = seq(many(c(' ')), many(c('\n')));

    var r5 = sepBy(paragraph, blankLine).parse("""
            paragraph 1

            paragraph 2
            """);
    assert r5.ok().size() == 2;

    var heading = seq(
            times(c('#'), 1, 6)
                .map(List::size),
            many(c(' ')),
            text
    ).map(p -> new Block.Heading(p.one(), p.three()));

    heading.assertEquals(Context.of("## heading 2"), new Block.Heading(2, "heading 2"));

    var block = choice(blockQuote, paragraph, heading)
            .map(t3 -> switch(t3) {
                case One(Block.BlockQoute q) -> (Block) q;
                case Two(Block.Paragraph p) -> (Block) p;
                case Three(Block.Heading h) -> (Block) h;
            });

    var markdown = sepBy(block, blankLine);

    var test = """
            This is a **strong** paragraph.
            With `code`

            # Heading 1

            > Some interesting
            > quote
            """;

    out.println(markdown.parse(test).ok());
}

The result

[Paragraph[elements=[Text[value=This is a ], Strong[value=strong], Text[value=paragraph.], Text[value=With ], Code[value=code]]], Heading[nestLevel=1, text=Heading 1], BlockQoute[text= Some interesting  quote]]

Note that we are doing it in iterative way. We started with bottom-most rule and built up to more complex ones. The experience was much like working in REPL. This way you can speed up the process significantly.

Handling errors

Whenever possible the library tries to return first encountered bottom most error. But there are some optimistic matchers like:

that always succeed. This complicates the development but it is not impossible to get these errors. About that in a second.

var identifier = seq(nonZeroDigit(), many(digit())).str();

var result = identifier.parse("01");

out.println(result.error());
expected non zero digit

Sometimes you may want to return more helpful, domain specific error messages. For example instead of error on char level you would like to say that the literal is invalid.

var identifier = seq(nonZeroDigit(), many(digit())).str();
identifier = identifier.setErrorMessage("wrong literal");

var result = identifier.parse("01");

out.println(result.error());
wrong literal

You can do that with setErrorMessage() method on the matcher. It has the tradeoff that the topmost customized message overrides all error messages on the bottom. So it should be used sparingly and only for non-related rules.

There are various methods on the result to inspect the error further.

errorTrace() allows to investige the whole "stack" from top to bottom when the matcher fails.

out.println(result.errorTrace());
Line: 0, Column: 0 => error in mapping
 +- Line: 0, Column: 0 => error in sequence
  +- Line: 0, Column: 0 => expected non zero digit

Unfortunately it won’t print the stack for optimistic rules like many() opt() and sepBy(). But errors are still logged and you can view all the logs with verboseErrors(). Just take into account that some errors are completely normal during the parsing.

var as = many(c('a')).str();

var result = as.parse("aab");
out.println(result.verboseErrors());
Line: 0, Column: 2 => expected 'a'

For obtaining compiler like messages with line number, column and content context you can use errorPrettyPrint().

var as = seq(c("aaa"), c('\n'), c("aaa"));

var result = as.parse("""
aaa
baa
""");
out.println(result.errorPrettyPrint());
aaa
>>baa
--------
Line: 1, Column: 0
expected "aaa"

Mapping parser results

Matchers resemble monads in a way that they "wrap" the returned results. So if there is a Matcher<Integer> on success it will return ParseResult<Integer> which you can get with ok() method.

map() concerns itself only with success results. It is applied on success and runs the provided function during parse time. On the method invocation if you pass mapper returnin U you will get Matcher<U> as a result.

var num = some(digit()).str()
        .map(s -> Integer.valueOf(s));


num.assertEquals(
        Context.of("123"),
        Integer.valueOf(123));

mapOrError() is more generic and allows to act on MatchResult itself. This may be required if you want to map Ok result to Err of ignore Err for some reason and return default.

var str = some(digit()).str()
        .mapOrError(res -> switch (res) {
            case Ok(String s, Context ctx) -> {
                if (s.equals("hello world")) {
                    yield new Err<>("wrong string", ctx);
                } else {
                    yield new Ok<>(s, ctx);
                }
            }
            case Err e -> e;

        });


   assert str.parse("hello world").error().equals("wrong string");

In Ops you will find convenence methods that implement common mapping tasks.

Working with indentation and recursion

Indentation inside the library is handled with a smart trick. Whenever indent() takes a rule it establishes a scope with indentation incremented that is passed around with Context. Multiple nested indents are supported. nl() on the other hand is a newline, but it is indentation sensitive. It not only takes newline, but also trims input from indentation pattern times the indentation level. Here is the example:

 Recursive<Element> elementOrContainer = recursive();
   var text = some(alphaNum()).str();
   var element = text
        .map(s -> (Element) new Element.Simple(s));
   var container = seq(
        text,
        c(":"),
        indent(
            some(
                seq(
                    nl(),
                    c("- "),
                    elementOrContainer
                ).map(Ops::takeLast)
            ), "  "
        )
    ).map(v -> (Element) new Element.Container(v.one(), v.three()));

    elementOrContainer.set(any(container, element));

    out.println(elementOrContainer.parse(
"""
container1:
  - element1
  - element2
"""
    ).ok());
Container[key=container1, children=[Simple[value=element1], Simple[value=element2]]]

There is a happens-before problem for recursive rules. elementOrContainer needs to be defined before all the other rules, but for itself needs other rules definitions. To resolve this we used recursive() combinator which acts as forward declaration, and is updated later on.

Api

All JavaDocs are present here

You may take a look on all combinators

Contribution

There are several ways to help.

The library will be only as good as the much of the exposure to the community. If you would like to play around with the library and fill in some (inevitable) bugs, feel free to submit issues on the repository.

.1. Github

There is also interesting topic of how bad is the performance. Some benchmarks are needed. If I have time I will write them, but maybe you will be faster so you can submit PR.

There is also potential for writing more specialized combinators, if you have an idea contact me by email, it is under my github account.

Finally the most interesting topic - generating the imperative parser in Java. The outline would be to add definition tree generation in the rules, with mapping functions disassembled and pasted to the resulting parser. Rules on chars would be coalesced to regexes/tokenization. If you would like to work on this topic feel free, even better contact me beforehand.

If you found JavaParsec at least interesting, would you consider giving it a star?