Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

So Much Data, So Many Formats: A Conversion Service, Part 1

DZone 's Guide to

So Much Data, So Many Formats: A Conversion Service, Part 1

Data is a core resource. One important challenge for handling data is storing the data in the right way. Learn how to automate some data handling tasks.

· Big Data Zone ·
Free Resource

Data is a core resource for many activities. One important challenge for handling data is storing the data in the right way. We need to choose a format that makes it easy to solve the problem at hand. When multiple problems are being solved using the same data, that could mean that the same data has to be provided in different formats. There are potentially different actors who could use the same piece of data; they could be persons or programs. And these might prefer or need to use different formats, and, in that case, we need to convert the data between them.

For example, you may have a program that generates some data in an XML file, but you need to process that same data using an API that expects JSON. Or you need to share some data your department produces with other departments. The problem is that your colleagues from Department A want the data in a certain format, while the ones in Department B insist you provide the data in another format.

So you need to convert between different formats. It is an issue that is common and usually solvable, but it is a repetitive task with a few subtleties to pay attention to.

As engineers, we know that all boring, error-prone tasks are just begging for a smart solution. Let's try to provide one.

Handling Data in Multiple Formats

Converting between different formats can be tricky for a few reasons:

  • Parsing a new format can be hard if there is no ready-to-use library.
  • If you want to support several formats the number of possible conversions necessary explode rapidly. For example, if you support four formats you could have 12 different conversions to handle. If then you add just a new format you now have 20 different conversions to support.
  • Different formats could handle the same type of data in slightly different ways. For instance, a string is always between double quotes (") in JSON format, but it may or may not be between double quotes in CSV format. In different formats, the rule for escaping characters could be similar but slightly different. Getting all these details exactly right requires a significant amount of work.
  • Not all formats are compatible with each other. For example, you can convert CSV data to JSON, but vice versa is not always possible. This is the case because CSV files are intended to represent a list of homogeneous elements. JSON, instead, could contain other data structures, like collections of heterogeneous elements.

These are the problems. Now, imagine that you can create a generic service that can convert between different formats. This approach gets you a few advantages over doing an ad-hoc conversion every time.

Once you create a service to convert between different formats you have an assembly line that permits you to move from one format to another, abstracting the simpler but tricky details. You could also configure this assembly line of yours to do all sorts of useful things, such as:

  • Enforce specific rules in the output produced. As an example, you may want to specify all numbers with a specific number of decimal digits.
  • You could combine data that is expressed in multiple files in format A in a single file in format B or vice versa.
  • You could apply filtering rules on the data (e.g, consider only certain rows in a CSV file for the conversion to JSON).

In this article, we are going to create only a simple version of this service, but this can give you an idea of what is possible.

The Design

We will create a simple REST API that will be used to receive files and convert them to a specified format. To limit the number of conversions we need to handle, we are going to rely on an internal intermediate representation. In this way, we will need to write conversions only to and from the intermediate representation. So, for instance, we are not going to convert directly from CSV to JSON. Instead, we are going to convert from CSV to our internal data structure and from our internal data structure to JSON.

This approach allows more flexibility because, to implement a new format, we only need to interact with our intermediate representation and not any specific format.

We are not going to do it in this tutorial, but another benefit is that, by using an intermediate representation, we could easily merge different data sources in one output file.

So, the workflow is as follows:

  • The set of input files (typically just one) get converted into the generic data structure.
  • The generic data structure is converted to the requested output format (e.g, JSON), producing one or more output files (typically one).

Our internal data format is quite simple, basically, at the base, there is DataItem which represents a generic element of our data:

  • DataArray represents a list of values (e.g., ["one", "two"] but also [ {"name": "john"}, {"name": "mike"} ]
  • DataObject represents a series of pairs of a field name with values (e.g., {"name": "john"} but also {"names": ["mike", "pike", "kite"]}
  • DataValue is made to contain individual values (e.g., 5, "john"). This will be booleans, numbers, and strings.

So, the complex classes, like DataArray and DataObject can contain other elements and essentially allow a tree-like organization.

A Note on a Physical Intermediate Format

Now, you may ask yourself, it is one thing to create a custom representation in memory, but should we also create a custom format? Well, we do not need it in our tutorial. However, if we were building a production system, depending on the requirements, we may want to create a custom format, i.e., a specially designed way to efficiently store the data for our purposes.

This physical intermediate format could be useful if, for some reason, the conversion process cannot be implemented in a single process, sharing memory. For example, if we want different executables or different web services to perform the parsing and the serialization of the desired output, then we need to make these components to communicate. In this case, they may need a physical intermediate format.

You may ask: is not XML perfectly fine for representing arbitrary data structures? Well, it is true that for the cases considered in a simple tutorial like this one, JSON or XML would have worked fine as a custom intermediate format. However, this might not be true or optimal to represent all formats and features of our hypothetical service.

Different formats are designed for different things: the same image might be represented in different formats, but the resulting file will have different characteristics (e.g., a JPEG will be smaller than a PNG). By designing our custom representation we can better control the process, avoid any quirks of a specific existing format, and save any data in an optimal way for our service.

For example, we can have a format designed to easily handle transformations on data (e.g., by storing the different operations made on the data). Designing a custom format does not necessarily means messing with bytes: OpenDocument is a bunch of compressed files with data stored in XML files with specific attributes and values.

And that is it for the design considerations. Let's see some code.

Setting Up the Project

We are going to create a new ASP.NET Web API Project using the command line dotnet program.

dotnew new webapi

Then we are going add the necessary packages to deal with JSON and parsing things.

dotnet add package Newtonsoft.Json
dotnet add package Antlr4.Runtime.Standard
dotnet add package Microsoft.CodeAnalysis.CSharp.Scripting

Of course, we are going to use ANTLR to parse the files. Since we are using Visual Studio Code, we also set up the awesome Visual Code extension to automatically generate an ANTLR parser whenever we save our grammar. Just put this values in your settings for the project.

{
    "antlr4.generation": {        
        "language": "CSharp",
        "listeners": false,
        "visitors": false,
        "outputDir": "../",
        "package": "ParsingServices.Parsers"
    }
}

If you use another editor, you need to give the ANTLR tool the right options:

  • We create the parser inside the namespace (package in Java terminology) ParsingServices.Parsers
  • We generate a C# project.
  • We create neither listeners nor visitors.
  • We generate the parsers in the directory above the grammar.

If you do not know how to use ANTLR we have written plenty of tutorials, you can read Getting started with ANTLR in C#, if you want a short introduction, or our ANTLR Mega Tutorial if you want to know a lot.

These options mean that our grammars will be inside a antlr/grammars folder, while the parsers will be generated inside the antlr folder. This makes for a clean structure, separating grammars from generated code.

Grammars

Speaking of ANTLR grammars: we have two, one for JSON and the other for CSV. Both are taken from the repository of ANTLR grammars, but we have modified them for clarity and consistency.

grammar CSV;

csv     : hdr row+ ;
hdr     : row ;

row     : field (',' field)* '\r'? '\n' ;

field   : TEXT
        | STRING
        |
        ;

TEXT    : ~[,\n\r"]+ ;
STRING  : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote

We also have the JSON grammar to facilitate our job in the rest of the program. We created a distinct case to differentiate simple values (e.g. "number" : 5) from complex values (e.g., "numbers" : [5, 3] or "number" : { "value": 1, "text": "one" }).

grammar JSON;

json                    : complex_value ;

obj                     : '{' pair (',' pair)* '}'
                        | '{' '}'
                        ;

pair                    : STRING ':' (value | complex_value) ;

array                   : '[' composite_value (',' composite_value)* ']'
                        | '[' ']'
                        ;

composite_value         : value
                        | complex_value
                        ;

value                   : TRUE
                        | FALSE
                        | NULL
                        | STRING
                        | NUMBER
                        ;

complex_value           : obj
                        | array
                        ;

TRUE                    : 'true' ;
FALSE                   : 'false' ;
NULL                    : 'null' ;

STRING                  : '"' (ESC | SAFECODEPOINT)* '"' ;

fragment ESC            : '\\' (["\\/bfnrt] | UNICODE) ;

fragment UNICODE        : 'u' HEX HEX HEX HEX ;

fragment HEX            : [0-9a-fA-F] ;

fragment SAFECODEPOINT  : ~ ["\\\u0000-\u001F] ;

NUMBER                  : '-'? INT ('.' [0-9] +)? EXP? ;

fragment INT            : '0' | [1-9] [0-9]* ;

// no leading zeros

fragment EXP            : [Ee] [+\-]? INT ;

// \- since - means "range" inside [...]

WS                      : [ \t\n\r] + -> skip ;

The Data Classes

Before seeing how the conversion is implemented, let's take at the Data* classes that form the structure of our data format. We already explained their general design before, so here we are mostly looking at the code.

public class DataItem { }

public class DataValue : DataItem
{        
    public string Text { get; set; } = "";   
    public ValueFormat Format { get; set; } = ValueFormat.NoValue;     
}

public class DataField : DataItem
{
    public string Text { get; set; } = ""; 
    public DataItem Value { get; set; } = null;
}

public class DataObject : DataItem
{     
    public IList<DataField> Fields { get; set; } = new List<DataField>();        

    [..]
}

public class DataArray : DataItem
{
    public IList<DataItem> Values { get; set; } = new List<DataItem>();

    [..]
}

We removed the methods because they are superfluous to understand how the classes are connected. As you can see, they are quite intuitive and probably look how you expected them to.

public enum ValueFormat
{
    Bool,
    Integer,
    Numeric,
    String,
    NoValue
}

ValueFormat is an enum that is supposed to represent the actual type of the data. That is because we treat every value as a string to simplify the input and output phase since a string can accept any type of input. But we know that, in actuality, there are different types of data. So, we try to understand the different formats here.

public class DataValue : DataItem
{        
    private string _text = "";
    public string Text
    {
        get {
            return _text;
        }
        set {
            Format = DataValue.DetermineFormat(value);

            _text = DataValue.PrepareValue(Format, value);                
        }
    }
    public ValueFormat Format { get; private set; } = ValueFormat.String;     

    private static ValueFormat DetermineFormat(string text)
    {
        ValueFormat format = ValueFormat.String;

        text = text.Trim().Trim('"');

        int intNum;

        bool isInt = int.TryParse(text, out intNum);            

        if(isInt)
            return ValueFormat.Integer;

        double realNum;

        bool isNum = double.TryParse(text, out realNum);

        if(isNum)
            return ValueFormat.Numeric;

        bool boolean;

        bool isBool = bool.TryParse(text, out boolean);

        if(isBool)
            return ValueFormat.Bool;

        return format;
    }
}

To understand the real type of the data that we are managing, we try to parse each value until we find a match. If there is no match for any type, it means that we have a string. We need to find out the real type because each type can be represented differently in a specific format. For instance, in JSON a number can be written without the enclosing double quotes while a string always needs them. So, this information will be used when we output the data in a specific format.

We solve the issue that different formats may represent the same data differently in the input phase when we convert the original format in our own intermediate one. For instance, a string is always between double quotes (") in a JSON format, but it may or may not be between double quotes in a CSV format. We have to clean all the data in input, in order to have a standard representation in our custom data format.

This part would be the ideal location to perform any standard editing of the data, like ensuring that all numbers use one specific decimal separator. Since we want to keep things simple we just trim the string of any whitespace.

private static String PrepareValue(string text)
{
    text = text.Trim();

    return text;
}

That's all for Part 1. Tune back in tomorrow when we'll go over using a convert controller, setting up a simple pipeline, and more!

Topics:
big data ,big data adoption ,data cleaning

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}