Over a million developers have joined DZone.

Glue and Big Data: Getting Started, Part 1

DZone's Guide to

Glue and Big Data: Getting Started, Part 1

Free Resource

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

Where to start?

Glue is split into three parts:
  • glue-rest - this is the workflow engine that will execute your jobs
  • gluecron - this is the cron/datadriven deamon that launch workflows based on cron or data in hdfs
  • glue-ui - a simple ui that gives you insight into the workflows running and their output


Initial Requirements

As with all things hadoop related it's best to use linux. Technically Glue does not require a linux machine because it runs on the JVM, but even for trying out the examples its best to create a linux VM (ubuntu or centos) using VirtualBox or another VM app.

Java 6+

Nothing more is required, Glue is packed with its own libraries.

Install Glue Rest

Ok, I could have chosen a better name, but the naming sort of stuck ever since Glue was created.   This is a simple step:Download the rpm from https://sourceforge.net/projects/glueworkflows/files
If your using ubuntu use: sudo alien 'rpm' to convert to a deb.

To install type:

sudo rpm -i 'rpm'
sudo dpkg -i 'deb'

The package installs to /opt/glue and you can run it using 

service glue-server start
/etc/init.d/glue-server start

Install Glue Cron

Download the gluecron rpm from https://sourceforge.net/projects/glueworkflows/file
Again for a deb use sudo alien 'rpm'

To install type:

sudo rpm -i 'rpm'
sudo dpkg -i 'deb'

The package installs to /opt/gluecron and you can run it using:

service gluecron start
/etc/init.d/gluecron start

Don't worry if gluecron or glue gives you errors on startup at the moment.
We'll need to configure them first.
That is the aim of part 2 (coming soon).

To explore more please go to: http://gerritjvv.github.io/glue/documentation.html

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.


Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}