Looking Beyond JSON
In a world where services are becoming micro, I thought of writing a micro story. In this series, I tried to identify reasons for emergence and reasons for alternates.
Join the DZone community and get the full member experience.Join For Free
In a world where even services are becoming micro, I thought of writing a micro story.
The trigger point of this article is the unquestionable dominant usage of JSON. You are looked upon with suspicion if you try proposing to even evaluate an alternate. I have been amazed by the faith in JSON we have put in many architectural solutions.
There are definitely valid reasons for the faith in JSON but I feel we should always have our reasons for doing so. In this series, I have tried to identify the reasons for emergence and reasons for alternates.
Beyond JSON — Flashback
Before the need and emergence of the distributed systems, we were only interacting with the APIs in the same memory space (The realistic Monolithic apps, not what we say in today’s vocabulary to justify and define Microservices). With the advent of distributed applications, we started delving into mechanisms to perform remote calls. First thing was to identify the protocol and HTTP over TCP/IP became the immediate choice for obvious reasons. HTTP, as an application layer protocol, solved the purpose of connecting two applications over a network. Still, something was missing. It’s like, two people have a mobile phone in hand but they need a language to talk to each other.
This led to the emergence of RPC (Remote Procedure Call) in around 1981 (we can see some traces of specifications emerging). It is based on extending the conventional local procedure calling so that the called procedure need not exist in the same address space as the calling procedure. This clearly enabled the distributed application development. But RPC had its own negative as it led to various implementations as marshaling and un-marshaling of the data was dependent on language, OS, etc., and it was a costly operation as well.
This led to a new IDL based specifications emerging by 1991 in the form of CORBA 1.0
So specific languages started coming up with their enhanced RPC implementations like RMI (JAVA) and DCOM (Microsoft). But this came up with a fundamental problem for solution providers on the technology lock-in. Hence, it was not a very long-lived solution. It resulted in the emergence of CORBA adopting the advancement of native implementations like RMI and DCOM, with COBRA 2.x
There was a world, which was busy in making the object brokers to make the distributed world more technology independent. The processing power was seeing the exponential growth coupled with the distributed systems
Beyond JSON — The Emergence
In the late nineties, there was a greater push to keep finding the technology for the distributed systems. The biggest challenge was the data exchange between the systems. There is something happening in parallel. The emergence of WWW.
In 1989, Tim Berners-Lee wrote a memo proposing an Internet-based hypertext system. Berners-Lee specified HTML and wrote the browser and server software in late 1990. Browsers became the interpreters for the HTML and we saw the success of one of the first truly intended distributed systems with runtime interpretation as early as 1990. Soon people started analyzing the possibility of HTML as the data exchange format. This led to the revolution of the emergence of XML. It was about to change the way we think about the distributed systems.
Everyone was sure about the success and there was a need to standardize the specifications for cross-platform and application data exchange. XML 1.0 specifications came around 1998. The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages.
There was a flood of the Application Programming Interfaces (APIs) by the developers for parsing the XML data in various languages, which helped in the acceptability and adaptability of XML as the de-facto data exchange format in the late 1990s and early 2000s
The XML became the desired format for being utilized for complex data structures for web services but also becoming an increasingly popular format for documents (RSS, ATOM, SOAP, SVG, and XHTML). What we usually ignore is the fact that XML is beyond data format. XML is a language. Below are the features of XML as a language, which makes it highly usable in most of the scenarios;
XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. The World Wide Web Consortium (W3C) defined XPath.
2. Attributes and Namespaces
3. XML Schema
An XML Schema is a language for expressing constraints about XML documents. There are several different schema languages in widespread use, but the main ones are Document Type Definitions (DTDs), Relax-NG, Schematron, and W3C XSD (XML Schema Definitions).
The term Extensible Stylesheet Language (XSL) is used to refer to a family of languages used to transform and render XML documents.
As our software never keeps evolving, there was something interesting happening on the sidelines. The browsers came to interpret the HTML and present the data to the end-user. Netscape took the lead with the browser in 1994 to display static pages. They soon realized the power of the web and there was a desire in the web development scene to remove the limitation of static pages, so in 1995, Netscape decided to add a scripting language to make it dynamic.
Microsoft started ruling the desktop computing world and came up with their own browser Internet Explorer and their own scripting language Jscript in 1996. Microsoft market share in browsers was 95% by the early 2000s
During the period of Internet Explorer dominance in the early 2000s, client-side scripting was stagnant. This started to change in 2004, when the successor of Netscape, Mozilla, released the Firefox browser. Firefox was well received by many, taking significant market share from Internet Explorer. However, In 2005 Mozilla and Adobe tried standardizing Action Script 3 as ES4 implementation but because of lack of support from Microsoft, it did not reach anywhere.
Wait… Where is our hero JSON in the story?
It's time to introduce him…
The early 2000s was the time when XML ruled the world as data exchange. XML was the first platform-neutral data exchange mechanism adopted across the world. With the emergence of SOA, everyone needed a platform-neutral mechanism for the services to interact. Before that, we still had services (server and client components) with technologies like EJB, RMI, COM, etc. However, these are not platform-neutral data exchange mechanisms. The first ray of hope came with SOAP protocol with XML used for data exchange. Being platform-neutral was such a big advantage that as soon as we started having libraries for marshaling and un-marshaling XML it captured the space in the distributed middleware.
The IT world from 2001 was trying for a solution to exchange data between the browser and the application. Crockford and Morningstar had decided that they could abuse an HTML frame to send themselves data. They could point a frame at a URL that would return an HTML document like the one above. Crockford and Morningstar were trying to build AJAX applications well before the term 'AJAX' had been coined.
People say that XML is bulkier; I believe it is the same argument as 'convention over configuration'. If you follow convention strictly, you would not need bulky configurations. Another argument we always hear is the various specification with XML and sub-languages like RSS, SOAP, SVG, and others, which makes it confusing. Well JSON is so simple and lean that it can only be JSON. And we never tried having variations in JSON due to the learning from the state of XML, it just remains plain simple JSON.
I somehow feel that there are two historical events, which made the space and the platform for JSON to launch itself. First, the effort of XHTML halted in early 2000. This would have established XML further as the standard for data exchange. Parsing would have been so less error-prone after this, which was a major pain point for application developers. Second, most importantly, lack of support for Action Scripts. Action script was a solution for a rich client using AJAX (conceptually) and performing binary data exchange. By 2005, the world had learned the advantages of open standards and disadvantages of proprietary solutions. The stage was set for JSON to emerge.
Beyond JSON — The Dominance
By 2010, there were many JSON parsing libraries available and JSON kept gaining popularity amongst data exchange formats. As the below graph clearly shows that JSON surpassed the popularity of XML by around 2012.
Below are reasons that helped the cause.
Positives of JSON over XML
1. Easy to understand.
2. Simple structure — Just key-value pairs with arrays.
3. Less Verbose.
4. Wide libraries are available to parse the information.
5. It is faster to parse a JSON.
6. Really good as a data format as it representing objects.
7. Most of the people (non-architects) know about it.
Negatives of XML
1. XML is very large — XML may contain a lot of meta-information along with data hence increasing the size of the overall structure making it slower to transmit and parse.
2. Comparatively difficult to read.
3. Even empty information has a cost.
Though just a food for thought that there is a reason XML is the choice for certain areas apart from data exchange:
1. Application configuration.
2. Build configuration.
3. Data transformation.
4. System interoperability.
Ever wondered why configuration files never changed to JSON format?
So the question arises, is there any need to change from the data exchange perspective? Or more appropriately, is it already changing? The simple answer is, yes. Let’s see next why, what, and how.
Beyond JSON — Alternates
When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer would not make sense to any other process, this sequence-of-bytes representation looks quite different from the data structures that are normally used in memory.
Thus, we need some kind of translation between the two representations. The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshaling), and the reverse is called decoding (parsing, deserialization, unmarshalling).
Most of the programming languages have built-in encoding libraries for encoding in-memory objects (JAVA: java.io.serializable, Ruby: Marshal, etc.). But the issue with them is that the encoding is language-specific. It’s generally a bad idea to use your language’s built-in encoding for anything other than very transient purposes.
Moving to standardized encodings that can be written and read by many programming languages, JSON and XML are the obvious contenders. JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This observation led to the development of binary encodings for JSON (MessagePack, BSON, BJSON, etc.). For data that is used only internally within your organization, you could choose a format that is more compact or faster to parse.
In 2005, a new variant of SOA was coined, Microservice. By 2012, there were people adopting and experimenting with it. By 2015, the architecture style gained momentum. This is the biggest change, which has prompted people to look for alternatives to JSON. The reason for the alternate are;
1. Every Microservice should be capable of been deployed in separate memory space.
2. Whole application was divided into small functions interacting with each other over a network.
3. We needed a more efficient way for data exchange as JSON still has a higher overhead for encoding and decoding.
Everyone was looking back for the binary data exchange to leverage the benefits of speed, memory footprint, storage size, etc. The difference from the olden days was that the community was looking for a binary format, which multiple languages could exchange to avoid technology lock-in. Let’s first see a few of the binary data serialization formats;
1. Apache Thrift
Thrift is an interface definition language that is used to define and create services for numerous languages. It is used as a remote procedure call (RPC) framework and was developed at Facebook. Thrift’s goal is 'to enable efficient and reliable communication across programming languages'. Solving many aspects of cross-platform services, it generates RPC code for clients and servers, providing a compact, deterministic, and versionable interchange protocol. Thrift is based on the RPC style architecture with binary data exchange format. So thrift is a complete package with a web service architecture shift t RPC and binary encoding (ThriftBinaryProtocol and ThriftCompactPotocol) advantage.
2. Protocol Buffers
Protocol Buffers is an encoding format by Google. Both Protocol Buffer and Thrift came about at the same time and not surprisingly are very similar. Protocol Buffers (which has only one binary encoding format) does the bit packing slightly differently but is otherwise very similar to Thrift’s CompactProtocol.
3. Apache Avro
Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols and serializes data in a compact binary format. Avro is one of the most compact binary encoding formats because encoding simply consists of values concatenated together. A string is just a length prefix followed by UTF-8 bytes, but there’s nothing in the encoded data that tells you that it is a string. To parse the binary data, you go through the fields in the order that they appear in the schema and use the schema to tell you the datatype of each field.
Apache Avro is used in Apache Kafka and Apache Hadoop. If you see, both systems are heavy traffic and heavy volume systems.
BSON is a computer data interchange format used mainly as a data storage and network transfer format in the MongoDB database. It is a binary form for representing simple data structures and associative arrays (called objects or documents in MongoDB). BSON has a huge number of implementations. Compared to JSON, BSON is designed to be efficient both in storage space and scan-speed. The key advantage is its traversability, which makes it suitable for storage purposes, but comes at the cost of over-the-wire encoding size
MessagePack is a compact binary representation of JSON. Compared to BSON, MessagePack is more space-efficient. BSON is designed for fast in-memory manipulation, whereas MessagePack is designed for efficient transmission over the wire. The Protocol Buffers format aims to be compact and is on par with MessagePack. However, while JSON and MessagePack aim to serialize arbitrary data structures with type tags, Protocol Buffers require a schema to define the data types.
So the conclusion of the long story is that we have gone through an evolution and made choices and evolved in what makes the best sense for the Solution Architectures. In the past few years with the evolution of Deployment Architecture (Cloud and Containerization), Application Architecture (Microservices), and API Design Architecture (REST to RPC) we need to think through the choice of data interchange encoding than being over-obsessed with JSON.
Opinions expressed by DZone contributors are their own.