How data formats are born

In the age of Single-Page-Applications and REST APIs (JSON over HTTP), the cool kids use JSON as the data serialization format. In other words, to transfer data over the wire, they convert the data structures from memory into a series of bytes which are, nominally, valid JSON documents.

Most of the time, REST responses have a predefined schema for each endpoint and will always respond with that schema exactly. For example: GET /api/v1/mountains?page=2 will respond with following JSON:

{
  "total_pages": 11,
  "mountains": [
    {"name": "Kočna", "elevation_m": 2540, "popularity": {"percent": 98.6, "rank": 62}},
    {"name": "Mrzla Gora", "elevation_m": 2203, "popularity": {"percent": 97.2, "rank": 90}},
    {"name": "Mali Grintavec", "elevation_m": 1813, "popularity": {"percent": 93.0, "rank": 250}}
  ],
  "timestamp": 1740928578
}

For this example, the schema (or the type) of the response would be:

type Response = {
  total_pages = int,
  mountains = [
    {
      name = text,
      elevation_m = int,
      popularity = {
        percent = float,
        rank = int,
      },
    }
  ],
  timestamp = int,
}

The server and the client both assume that the response has this schema and if it does not, the whole request can be considered invalid.

Now, this begs the question: why do we bother with property names at all?

If each mountain object always has a name, followed by elevation and popularity, why would we not send it as an array? That would allow us to omit property names, making the response much shorter, without losing any data.

{
  "total_pages": 11,
  "mountains": [
    ["Kočna", 2540, {"percent": 98.6, "rank": 62}],
    ["Mrzla Gora", 2203, {"percent": 97.2, "rank": 90}],
    ["Mali Grintavec", 1813, {"percent": 93.0, "rank": 250}]
  ],
  "timestamp": 1740928578
}

And shorter it is, by a lot!

I’ve applied this rule to each of the mountains, but I could apply it to their popularity scores and the top-level object as well.

[
  11,
  [
    ["Kočna", 2540, [98.6, 62]],
    ["Mrzla Gora", 2203, [97.2, 90]],
    ["Mali Grintavec", 1813, [93.0, 250]]
  ],
  1740928578
]

It’s now arrays, all the way down.

If we are very pedantic, we can try to save more bytes by flattening popularity into the mountain array. It will always have exactly two items anyway, so why would a nested array matter?

[
  11,
  [
    ["Kočna", 2540, 98.6, 62],
    ["Mrzla Gora", 2203, 97.2, 90],
    ["Mali Grintavec", 1813, 93.0, 250]
  ],
  1740928578
]

Can we go further? Can we prune the message down to exactly just the information that is needed?

Well, yes! We can flatten mountains in the array. Because we know each mountain will contain exactly 4 items, we can flatten it into a single array.

[
  11,
  [
    "Kočna", 2540, 98.6, 62,
    "Mrzla Gora", 2203, 97.2, 90,
    "Mali Grintavec", 1813, 93.0, 250
  ],
  1740928578
]

Now we will to make a step that might seem mad, but trust me, I have a plan. We will flatten the array (the one that can contain a variable number of items) into the parent array. Doing it in place would break our top-level object, though.

We need to have a way of accessing the timestamp, which comes after this array. And because the number of array items is not statically known, we need to move it elsewhere. So we place the contents of mountains at the end and replace them with just their count. This way, timestamp remains to be accessed with response[2].

[
  11,
  3,
  1740928578,
  "Kočna", 2540, 98.6, 62,
  "Mrzla Gora", 2203, 97.2, 90,
  "Mali Grintavec", 1813, 93.0, 250
]

As I planned, we end up with a single array. If we are willing to go beyond JSON, we can now use a comma-separated-list:

11, 3, 1740928578,
"Kočna", 2540, 98.6, 62,
"Mrzla Gora", 2203, 97.2, 90,
"Mali Grintavec", 1813, 93.0, 250

I could write all items on one line but for the sake of readability, I won’t.

When it comes to custom data formats, a big advantage is the ability to read a part of the data without parsing the whole file. We have this here, in a way:

to read the timestamp, we can scan the file from the start and read only the third element,
to read the popularity rank of the second mountain, we read the element at position 3 + (1 * 4) + 3. In this form, the numbers are:
- 3 elements of the top-level object,
- 1 is the index of the second mountain,
- 4 is the number of mountain properties,
- 3 is the index of the popularity.rank.

Scanning for the third element still requires reading the part of the file before our element. If we wanted to avoid that, we could define how many bytes would be taken by each item, save this information in the schema and add padding to ensure that fields that are too short don’t mess up the position of the following fields.

I’ll use:

4 digits for most integers (total_pages, count of mountains, elevation_m, rank),
4 bytes for floats (percent),
14 bytes for text,
10 bytes for the timestamp,
spaces and new lines to make things readable (could be omitted in read implementation),
characters _ and 0 for padding,
no commas and quotes (because they are not needed anymore).

0011 0003 1740928578
Kočna_________ 2540 98.6 0062
Mrzla Gora____ 2203 97.2 0090
Mali Grintavec 1813 93.0 0250

Now, a partial read is super easy:

to read the timestamp, we start reading at position 8 and read 10 bytes,
to read the popularity rank of the second mountain, we start at the position 18 + (1 * 26) + 22 and read 4 bytes. In this form, the numbers are:
- 18 bytes of the top-level object,
- 1 is the index of the second mountain,
- 26 is the number of bytes of the mountain object,
- 22 is the number of bytes of fields before popularity.rank.

What happens when a mountain has a rank of 10000? What happens when a mountain has a name that contains more than 18 bytes? Well, our naive format breaks. But there are ways around that. For ranks, we could use 8 bytes from the start and be safe. For names, we could do a similar trick to what we did the flattening of the array. Insert the length in bytes and a pointer to the start of the contents. Then append the contents to the back of the message.

Let’s stop here and admire our creation. An unholy mess of data, with its meaning hidden in the index of the one glorious byte array.

We have split the 297 bytes of JSON into 96 bytes of data (67% reduction!) and the following schema definition:

type Response = {
  total_pages = int of 4 bytes,
  mountains = [
    {
      name = text of 18 bytes,
      elevation_m = int of 4 bytes,
      popularity = {
        percent = float of 4 bytes,
        rank = int of 4 bytes
      }
    }
  ],
  timestamp = int of 10 bytes,
}

… and we can now perform partial reads of our data at the cost of O(1).

However the data itself is not readable at all. And sending and receiving this format became much, much harder than JSON.stringify used to be.

Which means our work here is not yet done.

This is the first post in a series about data formats.

I plan to write about how serialization frameworks can split verbose messages into types and data, without sacrificing the ease of use within the programming language.

Furthermore, such division allows the type information to “flow” from data sources to their consumers while reducing the amount of bytes transferred and allowing certain performance improvements (such as partial reads).