Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Twitter Data Analysis: Optimizing Insertion Throughput With Batching

DZone's Guide to

Twitter Data Analysis: Optimizing Insertion Throughput With Batching

GRAKN.AI is the database for AI. It is a distributed knowledge base designed specifically to handle complex data in knowledge-oriented system - a task for wh...

Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

GRAKN.AI is the database for AI. It is a distributed knowledge base designed specifically to handle complex data in a knowledge-oriented system — a task for which traditional database technologies are not the best fit.

To ensure that their internal knowledge is the most up-to-date and relevant, AI systems are always hungry for newly updated data. Working seamlessly with streaming data is therefore useful for building knowledge-oriented systems. In this blog post, we will look at how to stream public tweets into Grakn's distributed knowledge base.

Continuing Where We Left Off

This post is the last one of a three-part series covering how we can leverage GRAKN.AI for performing analysis on Twitter data.

Here, we will continue our work from Part 2 and look at how we can optimize the throughput even further with batching.

Before we delve into what batching is, let's get a brief recap of what we previously covered in this Twitter data series. Here's a rundown of the first two posts:

Part 1: Using GRAKN.AI to Stream Twitter Data. In this post, we mainly look at two things: how to model and define a schema and how to insert the actual data into the knowledge base.

Part 2: Performing Aggregate Query on Twitter Data. Here, we continue further by looking at how we can perform aggregate queries in order to obtain meaningful information from a dataset.

If you haven't already, I recommend that you check out these posts. They cover basic concepts of working with GRAKN.AI and Twitter data and will serve as the basis for the remainder of this post.

So, What Is Batching?

Batching is a technique which improves the throughput of data processing. It works by organizing the execution unit in batches in order to minimise the amount of associated "plumbing works."

In order to help us understand what batching is more clearly, let's do a concrete example of performing HTTP calls...

Imagine we are trying to add new items into a database via HTTP calls. If we're trying to add ten items, it makes sense to send them in a batch of 10 items through a single HTTP call rather than doing one call per item.

This is because the cost of the associated plumbing works — which is in initiating an HTTP connection — is so high that it would reduce the throughput by a significant margin.

That is how batching fundamentally works, and it is such an important technique which can be applied to many different things. In this post, we're going to look at performing batch insertion in order to improve the throughput of Twitter data ingestion.

Enabling Batch Insertion

Fortunately, Grakn has batching support already built in.

Let's update GraknTweetOntologyHelper::withGraknGraph() in order to expose GraknTxType parameters. This way, we can chose whether to use WRITE or BATCH depending on the circumstances.

public static void withGraknGraph(GraknSession session, GraknTxType type, Consumer<GraknGraph> fn) {
  GraknGraph graphWriter = session.open(type);
  fn.accept(graphWriter);
  graphWriter.commit();
}

Next, go to the main method to update our schema creation and data insertion to use the appropriate GraknTxType.

An important thing to note: BATCH can only be used for data insertion. Schema creation must always be done with WRITE.

public static void main(String[] args) {
  try (GraknSession session = Grakn.session(graphImplementation, keyspace)) {
    withGraknGraph(session, GraknTxType.WRITE, graknGraph -> initTweetOntology(graknGraph)); // initialize schema
listenToTwitterStreamAsync(consumerKey, consumerSecret, accessToken, accessTokenSecret, (screenName, tweet) -> {
      withGraknGraph(session, GraknTxType.BATCH, graknGraph -> {
        insertUserTweet(graknGraph, screenName, tweet);
        Stream<Map.Entry<String, Long>> result = calculateTweetCountPerUser(graknGraph); // query
        prettyPrintQueryResult(result); // display
      });
    });
  }
}
public static void main(String[] args) { try (GraknSession session = Grakn.session(graphImplementation, keyspace)) {

Running The Application

Let's build and run the application with:

$ mvn package
$ java -jar target/twitterexample-1.0-SNAPSHOT.jar

You will see a list of users along with the number of times they have tweeted since we started the application:

------
-- user <user-1> tweeted 2 time(s).
-- user <user-2> tweeted 1 time(s).
-- user <user-3> tweeted 1 time(s).
-- user <user-n> tweeted 1 time(s).
------

But it's exactly the same as what we had built earlier, in Part 2!

So What Has Changed?

Well, what we've done is an optimization step for achieving higher throughput. While the external behavior of our app doesn't change, it is now able to receive data faster.

How much faster, exactly? Well, we're curious, too! To be frank, we don't yet have the number at hand. We're still working on a benchmark, which will be published very soon.

In big data, batching is an extremely valuable technique which should be considered at various stages of the processing pipeline in order to boost performance.

Specifically for our app, introducing batching makes a lot of sense — we want to be able to receive as much data as possible in the shortest amount of time.

Conclusion

This post concludes the Twitter Data Analysis series! Over the last few weeks, we've looked at how we can develop a very simple application with Grakn using the Java programming language.

We've chosen to work with Twitter data so that developers of any level can dive straight into Grakn.

In other words, working with Grakn is easy and we want you to know it!

We've looked at how to define a schema, how to insert data, and how to perform aggregate queries in order to get meaningful information out of it. We've also looked at batching, which is an essential technique for working with data at scale.

You should have enough knowledge for developing your first application with Grakn. But there's more, and we encourage to check out our docs in order to find comprehensive information on working with the schema and query language AKA Graql.

Have a look at part one and part two on the Grakn blog, in case you missed them. Also, don't forget, the sample project is always available for you to download and play with.

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:
big data ,data analytics ,twitter ,optimization ,batching ,insertion ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}