Introduction to Jaxon: a JSON parser with streaming support for Elixir

Jaxon is a JSON parser that supports partial parsing and streaming of big documents using JSON path expressions.

The library was actually released a few months ago, and have been running it in production since. So after the last release, here is a guide on how to use it, some use cases and an overview on how it all works internally, which might be interesting to some folks!

So how is Jaxon different from all the other libraries out there?

A few things are different, but the biggest difference is that Jaxon implements a resumable parser that lets you feed the JSON data in chunks, instead of expecting the whole document all at once in a string. Which allows you do some interesting things.

This is usually not an issue when you are using JSON for encoding and decoding simple HTTP API requests, because most requests do not exceed a few KiB of memory. But at Sqlify, because of the nature of the product, we needed to be able to parse multi-GB data without holding the whole file in memory, and it had to be really fast.

A typical solution to this problem is to ask users to provide JSON documents separated by new lines like this:

{"hello":"world"}
{"hello":"world"}
{"hello":"world"}

And then parsed like:

Enum.map(lines, fn line ->
  case decode(line) do
    {:ok, term} -> ...
    {:error, term} -> ...
  end
end)

The only problem with this is I didn't want to require everyone using my service to provide files in this format. I wanted a solution that would actually just parse an incoming stream of JSON, no matter the format, to support as many user provided files as possible.

This is how Jaxon parses and streams values using JSON path expressions:

[
  ~s({"jaxon":"rocks"),
  ~s(,"array":[1,2]})
] 
|> Stream.cycle()
|> Stream.take(5_000)
|> Jaxon.Stream.query([:root, "array", :all])
|> Enum.reduce(&(&1 + &2))
# 15000

As you can see the JSON is split in half, if this was a stream over a network or a file stream, the document would be received in a similar fashion.

[:root, "array", :all] is the decoded version of the JSON expression $.array[*], for convenience, you can decode and encode paths using the helper Jaxon.Path module:

iex> Jaxon.Path.decode("$.array[*]")
{:ok, [:root, "array", :all]}

iex> Jaxon.Path.encode([:root, "array", 1])
{:ok, "$.array[1]"}

Jaxon still supports decoding simple strings:

iex> Jaxon.decode!(~s({"jaxon":"rocks","array":[1,2]}))
%{"array" => [1, 2], "jaxon" => "rocks"}

How is it implemented?

Jaxon separates the concepts of parsing and decoding, allowing decoupled implementations for each one. We call parsing the process of taking a string representation of JSON and converting it into a list of tokens, for example:

iex> Jaxon.Parser.parse(~s({"key":true}))
[:start_object, {:string, "key"}, :colon, {:boolean, true}, :end_object]

Even though the parser is able to successfully parse the JSON document, it doesn't understand JSON per se, it just parses a list of tokens, so this works too:

iex> Jaxon.Parser.parse(~s(true false 0, "hello"))
[true, false, {:integer, 0}, :comma, {:string, "hello"}]

Then the decoder, takes a list of tokens and reduces it into a final Elixir term, this is where we check the correctness of the token list.

iex> Jaxon.Decoder.events_to_term([
...> :start_object,
...> {:string, "key"}, :colon, {:boolean, true},
...> :end_object
...> ])
{:ok, %{"key" => true}}

And error handling is done like this:

iex> {:error, error} = Jaxon.Decoder.events_to_term([:start_object, :end_stream])
{:error, %Jaxon.ParseError{expected: [:key], message: nil, unexpected: :end_stream}}
iex> Jaxon.ParseError.message(error)
"Unexpected end of stream, expected a key instead."

The main bottleneck when processing big JSON documents is normally the parsing step, this is why the default parser is written in C, implemented as an Erlang NIF.

Separating decoding and parsing also allows us to keep the parser very simple and reduce it's responsibilities to a minimum. This way we are able to write a simple parser in C, with very fast NIF calls (<1ms) that don't mess with the Erlang scheduler.

Benchmarks

Because the parser is written in C, you can imagine its pretty fast. Here are the latest benchmarks:

https://github.com/boudra/jaxon/blob/master/BENCHMARKS.md

You can run the benchmarks in your machine by cloning the repo and running:

mix bench.decode

Credit to Michał Muskała for the benchmarking code.

What's next?

Encoder

At the moment Jaxon can only decode, so an encoder is a must if this library is to be used in production.

Native Elixir parser

Unfortunately, only the NIF parser implementation is available at the moment, but its already possible to write a native Elixir one by implementing the Parser behaviour.

Extend JSON path support

Support for advanced path expressions like:

$.array[:5] # get all items until the 5th index
$.array[0,1] # get items 0 and 1
$.array[?(@.price > 10)] # get objects where price is greater than 10

Feel free check out the issues list on Github, help would be greatly appreciated!

Show Comments

Get the latest posts delivered right to your inbox.