RION - A Fast, Compact, Versatile Data Format

In this article, learn all about RION, a new, fast, compact, versatile data format optimized especially for working with tabular data.

Jakob Jenkov

Feb. 13, 20 · Tutorial

Likes (9)

Comment

Save

30.1K Views

Raw Internet Object Notation (RION) is a fast, compact, versatile data format. Yes, I know what you are thinking: "Yet another data format." How is this different from CSV, XML, JSON, YAML, ProtoBuf, MessagePack, CBOR, Amazon’s ION, Apache Avro or ASN.1? Hang on, and I’ll explain that throughout this article, but first I will have to give a bit of background information on RION.

A Data Format for Efficient Exchange and Storage of Data

RION is developed by Nanosai, a distributed systems R&D company (which I am a co-founder of) as an "open standard" — meaning everyone is welcome to use it. RION was originally designed as a data format for efficient data exchange. However, we have since expanded the target use case to include efficient storage of structured data too.

We believe the use cases are closely related enough that this expansion makes sense. The main difference between these two uses cases is that a message sent over the network tends to have a fixed size (once sent, at least), whereas a file may reasonably be expected to grow over time.

We are actually using RION both as message encoding for a network protocol and as a record format in a data streaming storage engine, so we have pressure-tested RION for both data exchange and data storage.

A Versatile Data Format

Right from the start, we wanted to make RION as versatile as possible, meaning it should be able to encode a wide variety of structured data. The hope was that with a more versatile data format, developers would not have to switch between different data formats as often. The more often we would just be able to default to RION, the better.

Obviously, no data format is perfect for every type of data. For some use cases, like audio, video, strong formatted documents, etc. domain-specific encodings, such as MP3, MP4 or PDF might be more appropriate. To accommodate such situations RION is designed to be able to embed binary data along with other structured data (e.g. metadata).

RION is also designed to enable you to extend it with custom data types for situations where the built-in data types are insufficient.

Supported Data Structures

To be truly versatile RION had to be able to represent a wide variety of data structures. Currently, RION enables you to represent:

Binary data.
Typed data fields (boolean, int, float, text, datetime).
Single fields.
Unbounded streams of fields.
Bounded lists of fields (arrays).
Tabular data (like a CSV file where column names are only included once).
Objects and maps (key-value pairs).
Object graphs (objects with nested objects).

These data structures can be combined to create more advanced structures. For instance, you could have a table with other tables nested inside of it, or a table with object graphs nested inside, which again could have tables inside, etc.

Field Types

RION-encoded data consists of one or more RION fields. Each RION field has a type. RION currently contains the following field types:

Bytes.
Boolean.
Int-Pos.
Int-Neg.
Float (32 or 64 bit).
UTF-8.
UTF-8 Short.
UTC.
(Reference).
Array(*).
Table.
Object.
Key.
Key Short.
Extended.

Let me elaborate a bit on these field types.

The Bytes field is intended for “unstructured” binary data. For instance, if you need to embed an audio or video file in RION (or any other kind of file or binary data), you can embed it inside a Bytes field. This allows for the efficient transfer of binary data.

Boolean fields can represent the values true and false.

The Int-Pos and Int-Neg fields represent positive and negative integers. Positive integers are encoded so they only include significant bytes. Thus, an Int-Pos field containing the number 127 can be represented using 2 bytes, 1024 using 3 bytes, etc. Negative numbers, on the other hand, are a bit more challenging.

For instance, a 32-bit negative integer requires 4 bytes because all bytes are significant. To encode negative numbers more efficiently we have created a negative integer field that contains the absolute (positive) value of the negative integer - minus 1. This allows us to use the same efficient “significant byte” encoding as we do with positive integers.

Floats can be either 32-bit or 64-bit floating point numbers.

UTF-8 and UTF-8 Short are for textual data stored in UTF-8 format. The UTF-8 Short can encode text of 15 bytes or less, using 1 byte less than the UTF-8 field. In data with many text fields, those 1 bytes saved per field can be significant.

UTC is for storing data and time in UTC. Exchanging date-time information over the network is a common use case, so we felt RION should support that. To avoid time zone confusion, we decided to “force” date-time fields to be represented as UTC time.

The Reference field is only in the idea phase so far. It is intended to represent a “back reference” to a RION field found earlier in RION data. This can be used to represent cyclic object graphs. This could also be used to avoid duplicating redundant information in e.g. a RDBMS result set or in a microservice query response. We might add other fields for representing copies of redundant data in the future (e.g., a Copy field).

The Array field is used to represent arrays (lists) of RION fields. Thus, a RION Array can contain other RION fields nested inside it. The Array field is thus a composite field. Please note that we have realized that you can represent an array as a table with a single column, so we might actually remove the Array field and just have the Table field for arrays and tabular data.

The Table field is used to represent tabular data of columns and rows, like a CSV file or the result of an SQL query against a relational database. To encode tabular data efficiently, a Table only contains the columns names (Key fields) for the rows once. After the column names follows rows of column values. This is similar to a CSV file, where the first line is the column headers, and subsequent lines are the column values for each row. A Table with a single column can be used to represent an Array, so we might remove the Array field as mentioned earlier. Tables can contain other RION fields nested inside them. Thus, you can use Tables with nested Tables to represent tree structures more efficiently too.

The Object field is used to represent objects or maps (dictionaries) of key-value pairs. Typically key-value pairs will be encoded as a Key field followed by some other field, but you can leave out the keys if you want (or the values too - if that makes sense in your use case). You can nest other RION fields inside an Object including Array or Table fields to represent the object graph you need. So far you can only represent acyclic object graphs, but once we finish the specification of the Reference field you can represent cyclic object graphs too.

The Extended field type is intended to enable you to specify your own field types, so you can embed other types of data than possible with the core RION field types.

Compactness

To be efficient to exchange and store, compactness was essential for RION. Thus, we have gone to great lengths to make the RION encoding as compact as possible. Sometimes, we have had to make some compromises in order to achieve other design goals, like high read speed, but for the most part RION is reasonably compact.

Speed

Another goal we had for RION was to make it fast to read and write. Whenever a tradeoff between read speed and write speed had to be made, we favored read speed, since we expect data to be read more often than written. For instance, a RION file might be written once and read many times after that. The same is true for network messages. They are written once, and then read one or more times in transit and during processing.

RION uses a compact binary encoding. That by itself makes it faster to read and write than a textual encodings like XML, JSON, and YAML, and on par with MessagePack, CBOR, and Amazon’s ION.

Furthermore, RION is designed to be used directly in its binary form. When read directly in its binary form rather than first deserializing to Java objects, we have seen speedups of 10x for simple use cases. The speedup could be both greater and smaller for more advanced use cases.

Additionally, RION is designed to allow for partial parsability and arbitrary hierarchical navigation. Often, a service may return more data than a given client needs in a particular situation. Rather than having to parse all the returned data, RION enables you to efficiently navigate through the parts you don’t need to get to the parts you need. You would do so by navigating through the RION data in its binary form and parsing out the fields you need, skipping over the rest.

Binary Encoding

To be able to achieve both compactness and speed, RION uses a binary encoding. A binary encoding enables a more compact encoding of both numbers, dates, and binary data than with a textual encoding.

One of the common objections to binary encodings is that they are hard to read for humans during development, debugging, monitoring, etc. To address this concern, we are currently working on a textual encoding of RION (so far called TION), so you can convert RION to TION and back for easy reading and editing in a text editor. TION is not 100% ready, but I expect it will be sometime during 2020. We have also implemented a RION to “formatted hexadecimal notation” converter that makes it possible to inspect raw byte values in a text editor.

Self-Describing

RION uses a self-describing encoding, meaning you do not need a schema to make sense of a block of RION data. Having a data format be self-describing makes it easier to work with, as you can navigate through data to see its structure without knowing its schema. This also makes it easier to route a message for intermediate nodes that do not know the schema of the message.

Even if a data format is self-describing, it can still make sense to combine it with a schema. This is the case with XML + XML Schema and JSON + Swagger/RAML. The schema can provide additional restrictions on what values are allowed for a given field, what fields to expect, etc. RION does not have any schema mechanism at this point, but it is under consideration,

More Design Goals

To keep this introductory article of RION short, I have left of some of the “less significant” design goals for RION. You can see the full list of design goals for RION here:

http://tutorials.jenkov.com/rion/rion-design-goals.html

RION vs. Other Data Formats

In this section, I will try to give you a quick overview of how RION is different from many of the other popular data formats in use today. Please keep in mind though, that the full overview requires deep knowledge of data formats. Therefore, there is a limit to the depth of detail I can get into in this article.

First of all, being a binary data format sets RION apart from CSV, XML, JSON, and YAML. Being binary means that RION is typically more compact than these formats and faster to read and write. On average, RION is about 10 to 33% more compact than JSON, and if used for tabular data, the difference can exceed 50%. This compactness difference also translates into similar read/write speed differences.

Textual data formats do have the advantage of being easier to read and edit in a text editor, but we intend to address that by having a textual representation of RION (TION), which you can convert to and from RION. This should decrease the human visibility/editability problem.

RION can contain multiple fields at the root level of a file. That sets RION apart from XML and JSON, which can only contain a single element at the root level of a file. This makes RION easier to use for stream data structures like log files and data files that are appended to continuously.

RION uses a self-describing binary encoding that is similar to that of MessagePack, CBOR, and Amazon’s ION. However, RION is different in a few, subtle, but significant ways. First of all, RION is the only one of these data formats to specify an efficient encoding of tabular data. Second, RION is slightly easier to navigate arbitrarily in its binary form when navigating composite data structures, like object graphs. Third, RION will soon be able to represent cyclic object graphs. Neither MessagePack, CBOR, nor Amazon’s ION can currently do that.

Other than that, RION is about as compact as MessagePack, CBOR, and Amazon’s ION, and will be approximately as fast to read and write, with tabular data being the exception where RION excels.

ProtoBuf, ASN.1, and Avro all use data encodings that require a schema to be able to parse them. In other words, they are not self-describing. A non-self-describing data format can be made slightly more compact than a self-describing data format in some situations, but the difference is small. Data formats requiring a schema can be a bit more cumbersome to work with though, so it is a tradeoff.

For a more detailed overview, you can see our own comparison page, which we update from time to time. There are some data formats missing, like Amazon’s ION, YAML and XML, but we will add them eventually:

http://tutorials.jenkov.com/rion/rion-vs-other-formats.html.

Performance-wise our measurements have shown RION to be able to match the speeds of MessagePack and CBOR, while getting close to that of ProtoBuf. All our benchmarks were based on serializing and deserializing objects, though. If you work with RION directly in its binary form, and/or only parse parts of it, you can greatly increase speed. Our benchmarks are a bit old, so we will need to re-do them soon. But there they are, so far:

http://tutorials.jenkov.com/rion/rion-performance-benchmarks.html

Summary and Further Details

All in all, we believe RION is one of the best all-round data formats available today. It offers fast, efficient binary encoding, a versatile and flexible set of field types, and is designed to be read and written both directly in its binary form (for speed) or for being serialized to and deserialized from objects. In terms of speed, it will match most popular data formats and can even exceed them with tabular data formats, or if reading RION directly in its binary form.

We (Nanosai) have spent quite some time analyzing RION and comparing it to other data formats, but we believe the current encoding provides a good base from which to expand. We still have a few corners to attend to (like References) but we expect to address most of those in 2020. We will post again as RION becomes more and more complete.

We are using RION as message encoding for our network protocol IAP which is an alternative to HTTP for efficient and flexible network communication at the application layer. IAP is still in progress, but the basics are already well defined.

We are also using RION as record format in Stream Ops for Java - our embeddable data streaming engine. Stream Ops is like a “Kafka light”. Stream Ops is at proof of concept stage at this point, but we expect to improve it throughout 2020.

Because Stream Ops uses RION we can achieve some pretty good record processing throughput speeds. Right now we have been able to squeeze it to 19.5 million records per second (small records) on a developer laptop. We hope to see much higher numbers on data center grade hardware. Long term we are actually shooting for 1 billion records per second, as you can read in our 1BRS challenge.

Currently, our open source toolkits for RION are implemented in Java, but once they stabilize, we plan to expand to other languages too. Probably the performance languages first, like C#, C/C++, and then perhaps Python, since it is used a lot for data science. But that is still to be decided.

So, if you are in the market for a fast, compact, versatile all-round data format, whether you are doing high-performance microservices, data science, event-driven architecture with Kafka, Pulsar, Hazelcast etc, it might be worthwhile for you to give RION a look.

More Information About RION

Raw Internet Object Notation (RION) is our current working title for this data format. We started out with the name ION, but a year later Amazon released a data format named ION which they had been using internally, so we switched to RION. We might change the name again in the future, but the data format itself is reasonably stable by now.

For those interested in all the details of RION, check out the RION tutorial:

http://tutorials.jenkov.com/rion/index.html

We are also developing an open source toolkit for working with RION in Java called Rion Ops for Java. You can find RION Ops for Java here:

https://github.com/nanosai/rion-ops-java

Big data file IO Database Relational database Apache Spark UTF-8 sql Object (computer science) Data Types Open source

Opinions expressed by DZone contributors are their own.

Related

Trending