Exploring the Fundamentals of Binary Serialized Data Structures
In this article, we'll study how to operate and leverage binary serialized data structures for effective and efficient data utilization.
Join the DZone community and get the full member experience.Join For Free
I usually solve problems by letting them devour me.
There are a great many binary formats that data might live in. Everything very popular has grown good open-source libraries, but you may encounter some legacy or in-house format for which this is not true. Good general advice is that unless there is an ongoing and/or performance-sensitive need for processing an unusual format, try to leverage existing parsers. Custom formats can be tricky, and if one is uncommon, it is as likely as not also to be under-documented.
This article is an excerpt from the book, Cleaning Data for Effective Data Science—a comprehensive guide for data scientists to master effective data cleaning tools and techniques in a language-agnostic manner.
If an existing tool is only available in a language you do not wish to use for your main data science work, nonetheless see if that can be easily leveraged to act only as a means to export to a more easily accessed format. A fire-and-forget tool might be all you need, even if it is one that runs recurringly, but asynchronously with the actual data processing you need to perform.
For this article section, let us assume that the optimistic situation is not realized, and we have nothing beyond some bytes on disk, and some possibly flawed documentation to work with. Writing the custom code is much more the job of a systems engineer than a data scientist, but we data scientists need to be polymaths, and we should not be daunted by writing a little bit of systems code.
Here, we look at a simple and straightforward binary format. Moreover, this is a real-world data format for which we do not actually need a custom parser. Having an actual well-tested, performant, and bullet-proof parser to compare our toy code with is a good way to make sure we do the right thing. Specifically, we will read data stored in the NumPy NPY format, which is documented as follows (abridged):
- The first 6 bytes are a magic string: exactly
- The next 1 byte is an unsigned byte: the major version number of the file format, e.g.
- The next 1 byte is an unsigned byte: the minor version number of the file format, e.g.
- The next 2 bytes form a little-endian unsigned short int: the length of the header data HEADER_LEN.
- The next HEADER_LEN bytes are an ASCII string that contains a Python literal expression of a dictionary.
- Following the header comes the array data.
First, we read in some binary data using the standard reader, using Python and NumPy, to understand what type of object we are trying to reconstruct. It turns out that the serialization was of a 3-dimensional array of 64-bit floating-point values. A small size was chosen for this section, but of course, real-world data will generally be much larger:
Visually examining the bytes is a good way to have a better feel for what is going on with the data. NumPy is, of course, a clearly and correctly documented project, but for some hypothetical format, this is an opportunity to potentially identify problems with the documentation not matching the actual bytes. More subtle issues may arise in the more detailed parsing; for example, the meaning of bytes in a particular location can be contingent on flags occurring elsewhere. Data science is, in surprisingly large part, a matter of eyeballing data:
As a first step, let us make sure the file really does match the type we expect in having the correct "magic string." Many kinds of files are identified by a characteristic and distinctive first few bytes. In fact, the common utility on Unix-like systems,
file, uses exactly this knowledge via a database describing many file types. For a hypothetical rare file type (i.e. not NumPy), this utility may not know about the format; nonetheless, the file might still have such a header:
With that, let us open a file handle for the file and proceed with trying to parse it according to its specification. For this, in Python, we will simply open the file in bytes mode, so as not to convert to text, and read various segments of the file to verify or process portions. For this format, we will be able to process it strictly sequentially, but in other cases it might be necessary to seek to particular byte positions within the file. The Python
struct module will allow us to parse basic numeric types from bytestrings. The
ast module will let us create Python data structures from raw strings without a security risk that
eval() can encounter:
Next we need to determine how long the header is, and then read it in. The header is always ASCII in NPY version 1, but may be UTF-8 in version 3. Since ASCII is a subset of UTF-8, decoding does no harm even if we do not check the version:
While this dictionary stored in the header gives a nice description of the dtype, value order, and the shape, the convention used by NumPy for value types is different from that used in the
struct module. We can define a (partial) mapping to obtain the correct spelling of the data type for the reader. We only define this mapping for some data types encoded as little-endian, but the big-endian versions would simply have a greater-than sign instead. The key for
'fortran_order' indicates whether the fastest or slowest varying dimension is contiguous in memory. Most systems use "C order" instead.
We are not aiming for high efficiency here, but to minimize code. Therefore, I will expediently read the actual data into a simple list of values first, and then later convert that to a NumPy array:
Let us now convert the raw values into an actual NumPy array of appropriate shape and dtype. We will also look for whether to use Fortran- or C-order in memory:
Just as binary data can be oddball, so can text.
In this article, we dived deep into the details of binary serialization and studied the preference of existing libraries, bytes and struct data types, and offset layout of data.
David Mertz, Ph.D. is the founder of KDM Training, a partnership dedicated to educating developers and data scientists in machine learning and scientific computing. He created a data science training program for Anaconda Inc. and was a senior trainer for them. With the advent of deep neural networks, he has turned to training our robot overlords as well.
He previously worked for 8 years with D. E. Shaw Research and was also a Director of the Python Software Foundation for 6 years. David remains co-chair of its Trademarks Committee and Scientific Python Working Group. His columns, Charming Python and XML Matters, were once the most widely read articles in the Python world.
Opinions expressed by DZone contributors are their own.