Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Getting Started with Apache Avro: Part 1

DZone's Guide to

Getting Started with Apache Avro: Part 1

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In our previous post we got some basic idea about Avro. In this post we will use Avro for serializing and deserializing data.

We will use these 3 methods in which we can use Avro for serialization/deserialization:

  1. Using Avro command line tools.
  2. Using Avro Java API without code generation.
  3. Using Avro Java API with code generation.

Sample Data

We will use below sample data (StudentActivity.json):

{"id":"A91D021BA58444B29D4D42CA5E39F7BF","student_id":100,"university_id":908,"course_details":{"course_id":100,"enroll_date":"2012-02-13 00:00:00.000000000","verb":"completed","result_score":0.9}}
{"id":"502A77CC99B241CB94CA356F5218F1A9","student_id":101,"university_id":112,"course_details":{"course_id":233,"enroll_date":"2011-06-08 00:00:00.000000000","verb":"started","result_score":0.65}}
{"id":"5D04CD5ABF014D6EBA237766F9B470DE","student_id":102,"university_id":340,"course_details":{"course_id":339,"enroll_date":"2012-03-06 00:00:00.000000000","verb":"started","result_score":0.57}}

Note that the JSON records are nested ones.


Defining a Schema

Avro schemas are defined using JSON. The avro schema for our sample data is defined as below (StudentActivity.avsc):

{
    "namespace": "com.rishav.avro",
    "type": "record",
    "name": "StudentActivity",
    "fields": [
        {
            "name": "id",
            "type": "string"
        },
        {
            "name": "student_id",
            "type": "int"
        },
        {
            "name": "university_id",
            "type": "int"
        },
        {
            "name": "course_details",
            "type": {
                "name": "Activity",
                "type": "record",
                "fields": [
                    {
                        "name": "course_id",
                        "type": "int"
                    },
                    {
                        "name": "enroll_date",
                        "type": "string"
                    },
                    {
                        "name": "verb",
                        "type": "string"
                    },
                    {
                        "name": "result_score",
                        "type": "double"
                    }
                ]
            }
        }
    ]
}


1. Serialization/Deserialization using Avro command line tools

Avro provides a jar file by name avro-tools-<version>.jar which provides many command line tools as listed below:

$ java -jar avro-tools-1.7.5.jar 
Version 1.7.5 of Apache Avro
Copyright 2010 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

C JSON parsing provided by Jansson and
written by Petri Lehtinen. The original software is
available from http://www.digip.org/jansson/.

----------------

Available tools:
  cat  extracts samples from files
  compile  Generates Java code for the given schema.
  concat  Concatenates avro files without re-compressing.
  fragtojson  Renders a binary-encoded Avro datum as JSON.
  fromjson  Reads JSON records and writes an Avro data file.
  fromtext  Imports a text file into an avro data file.
  getmeta  Prints out the metadata of an Avro data file.
  getschema  Prints out schema of an Avro data file.
  idl  Generates a JSON schema from an Avro IDL file
 idl2schemata  Extract JSON schemata of the types from an Avro IDL file
  induce  Induce schema/protocol from Java class/interface via reflection.
  jsontofrag  Renders a JSON-encoded Avro datum as binary.
  random  Creates a file with randomly generated instances of a schema.
  recodec  Alters the codec of a data file.
  rpcprotocol  Output the protocol of a RPC service
  rpcreceive  Opens an RPC Server and listens for one message.
  rpcsend  Sends a single RPC message.
  tether  Run a tethered mapreduce job.
  tojson  Dumps an Avro data file as JSON, one record per line.
  totext  Converts an Avro data file to a text file.
  totrevni  Converts an Avro data file to a Trevni file.
  trevni_meta  Dumps a Trevni file's metadata as JSON.
trevni_random  Create a Trevni file filled with random instances of a schema.
trevni_tojson  Dumps a Trevni file as JSON.

For converting json sample data to Avro binary format use "fromjson" option and for getting json data back from Avro files use "tojson" option.

Command for SerializingJSON

Without any compression

java -jar avro-tools-1.7.5.jar fromjson --schema-file StudentActivity.avsc StudentActivity.json > StudentActivity.avro

With snappy compression

java -jar avro-tools-1.7.5.jar fromjson --schema-file StudentActivity.avsc StudentActivity.json > StudentActivity.snappy.avro

Command for Deserializing JSON

The same command is used for deserializing both compressed and uncompressed data

java -jar avro-tools-1.7.5.jar tojson StudentActivity.avro
java -jar avro-tools-1.7.5.jar tojson StudentActivity.snappy.avro

As Avro data file contains the schema also, we can retrieve it using this commmand:

java -jar avro-tools-1.7.5.jar getschema StudentActivity.avro
java -jar avro-tools-1.7.5.jar getschema StudentActivity.snappy.avro

In our next post we will use Avro Java API for serialization/deserialization.



Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
bigdata ,apache ,json ,big data ,apache avro

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}