Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Set Up MapR-DB to Elasticsearch Replication

DZone's Guide to

How to Set Up MapR-DB to Elasticsearch Replication

Learn how to use the MapR Gateway replication feature for full-text data search, visualization, and more. There's no better way to learn how to do something than to do it, after all!

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The automatic replication of MapR-DB data to Elasticsearch is useful for many environments. There are some great uses cases I can think of for taking advantage of this great feature:

  1. Full-text search of data in MapR-DB
  2. Geospatial searches for location data (think mobile user data here)
  3. Kibana visualization of the data, especially useful for time series data like sensor data or performance/network metrics
  4. ES as a secondary index for a MapR-DB table (won’t be needed from MapR 6.0 when JSON DB tables will support secondary indices)
  5. Change data capture (arguably)

The MapR Gateway replication feature makes it possible to get the data into Elasticsearch 2.2 without any code!

Let’s learn how we can do it using the latest MapR Sandbox version 5.2.1 (available for free). There is no better way to learn than to do, after all!

By and large, the MapR documentation of this feature is sufficient for an experienced MapR admin to set up the replication working. However, the documentation isn’t task-focused. What I contribute in this post is a start to finish how to using sample data and cover the whole process step-by-step.

Limitations

Some important notes about limitations of the MapR-DB to Elasticsearch replication:

  1. Only Elasticsearch 2.2 is supported
    • Later versions of ES will not work
  2. No support for JSON DB tables

Sample Dataset

For this tutorial, we’ll use pump sensor data that is used in other training materials and blogs such as Real-Time Streaming With Kafka and HBase by Carol MacDonald. I have modified it a bit to add an ID column.

[mapr@maprdemo ~]$ head -n 3 /mapr/demo.mapr.com/user/mapr/sensordata.csv
1,COHUTTA,3/10/14,1:02,9.67,1.731,882,0.52,87,1.79
2,COHUTTA,3/10/14,1:03,10.47,1.732,882,1.7,92,0.66
3,COHUTTA,3/10/14,1:05,9.56,1.734,883,1.35,99,0.68

The data columns include a date, a time, and some metrics related to sensor readings from a pump such as those used in the oil industry (psi, flow, etc.). There are 47,899 rows in this dataset. While this is tiny by the standards of production MapR-DB, it’s more than enough to demonstrate the technology working on the sandbox.

Get the data here.

Send the data to the sandbox using the following command (while the sandbox is running, of course!):

$> scp -p 2222 sensordata.csv mapr@localhost:

You will be prompted for the password; it’s “mapr”.

Or else, you can wget the data directly from the sandbox. Just copy the dataset’s URL and paste it after wget directly while logged into the sandbox.

$> wget <URL to dataset>

Remember that to log into the sandbox from your favorite shell, just type:

$> ssh -p 2222 mapr@localhost

Finally, copy the data to MapR. This ensures the command to import data into MapR DB will run as-is:

$> cp sensordata.csv /mapr/demo.mapr.com/user/mapr

MapR-DB Replication Using the MapR Gateway Service

MapR-DB is a NoSQL database that follows in the footsteps of Google BigTable. More specifically, it started as a reimplementation of Apache HBase designed from the ground up to take advantage of the advanced inner workings of the killer distributed platform known as the MapR Converged Data Platform. It also now has native JSON support to more easily handle hierarchical, nested, and evolving data formats.

At its core, the MapR-DB replication feature was to enable a MapR-DB table to be replicated to a MapR-DB instance running on another cluster automatically. One primary use case is for a global enterprise to improve the speed of access and get multi-region level HA automatically with a guarantee of data consistency. This feature can get really fancy with bi-directional replication where applications can read and write to/from either replica and still know both are always kept up to date.

More info can be found here and here.

Setup Guide

Getting set up is pretty short.

Choices in Solution Design

If you just want to try this feature out, then the MapR Sandbox is a great way to get started quickly. I’ll make sure to cover that in this guide.

For those who may want to use this feature on a production cluster though, there are a couple of configurations to ponder:

  • Co-locate the ES cluster with the MapR cluster
  • Use an external ES cluster

Unsurprisingly, if you have plenty of hardware servers then the external ES cluster should be the preferred solution, to isolate services and reduce failure impact as well as reserve the cluster resources for actual big data processing.

While putting the ES cluster on separate nodes is the recommended solution for a production cluster, it is also possible to colocate part or all of an ES cluster with MapR nodes. Keep in mind that memory resources taken by ES are not available to the cluster.

For sizing of the ES cluster, the main factors are storage needs and incoming data throughput. The more data, the more nodes will be needed. The sizing issue is well explained in the MapR documentation.

Preparation

Install ES (single node or cluster mode).

  1. Install Elasticsearch and run it (as root):
    $> wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.2.0/elasticsearch-2.2.0.rpm
    $> rpm -i elasticsearch-2.2.0.rpm
    $> service elasticsearch start
  2. Check installation:
    $> curl localhost:9200
    {
    "name" : "D'Spayre",
    "cluster_name" : "elasticsearch",
    "version" : {
    "number" : "2.2.0",
    "build_hash" : "8ff36d139e16f8720f2947ef62c8167a888992fe",
    "build_timestamp" : "2016-01-27T13:32:39Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
    },
    "tagline" : "You Know, for Search"
    }
    Note: It's installed in /usr/share/elasticsearch/ and runs as user "elasticsearch" when installing with rpm.
  3. Update Elasticsearch config:
    $> vi /etc/elasticsearch/elasticsearch.yml
    cluster.name = mapr-elastic
    network.host = 10.0.2.15 # <- ip of sandbox, hostname -i
  4. Verify config:
    $> curl maprdemo:9200
    {
    "name" : "Angelo Unuscione",
    "cluster_name" : "mapr-elastic",
    "version" : {
    "number" : "2.2.0",
    "build_hash" : "8ff36d139e16f8720f2947ef62c8167a888992fe",
    "build_timestamp" : "2016-01-27T13:32:39Z",
    "build_snapshot" : false,
    "lucene_version" : "5.4.1"
    },
    "tagline" : "You Know, for Search"
    }

Optional: Add port forwarding to access ES from your host. In VirtualBox, I added TCP port 9200 to the list of Port Forwarding Rules.

VirtualBox

We’ll just keep in mind to remember the hostname of the ES instances and remember that the supported ES version is 2.2. This is important or else there is good chance the replication will fail.

Create a MapR-DB Table

There are a variety of ways to create a MapR-DB table. We’ll use the command line, but it’s equally possible (and very easy!) to use MCS to do it visually.

$> maprcli table create -path /user/mapr/pumps
$> maprcli table cf create -path /user/mapr/pumps -cfname data

That’s it! Inserting the data will create the columns automatically. We don’t need to worry about data types, as MapR-DB only stores bytes and it’s up to the application to convert the data to/from bytes. This is a common pattern for NoSQL databases.

Add Data to the MapR-DB Table

We’re using HBase’s inputTsv functionality to import the CSV formatted dataset directly into the MapR-DB table we just created. Again, no code required. So much for “Hadoop is difficult,” right?

$> hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=,  -Dimporttsv.columns="HBASE_ROW_KEY,data:resid,data:date,data:time,data:hz,data:disp,data:flow,data:sedppm,data:psi,data:ch1ppm" /user/mapr/pumps /user/mapr/sensordata.csv

This launches a YARN MapReduce application to bulk import the data, meaning it will scale up to any size CSV from megabytes to petabytes. The main point to check here is that the data:  part means the column family. Adjust the columns to match your own use case and span potentially many column families — it will work just fine!

Note: The column names are meaningful. They must match up with the Elasticsearch type mappings we define later on. This is important to get everything working!

Install MapR Gateway Service

First, install the mapr-gateway package on one or more nodes. On a production cluster, it’s always recommended to have at least two gateways to enable high availability. The number of nodes running the gateway should be based on the network bandwidth requirement as well as cluster hardware and available resources.

To install the package, log in as root (su root after logging on as MapR, or just log in as root. The password is also :mapr"). Then, install the package using yum:

$> yum install -y mapr-gateway

After installing the package, still as ‘root’ configure the system again:

$> /opt/mapr/server/configure.sh -R
$> service mapr-warden restart

The details are all available on the MapR documentation site.

Register Elasticsearch

Next, we need to register ES with the MapR cluster. This basically means copying over some libraries for the gateway to use. An ES needs only be registered once per cluster and can be reused to replicate many tables to different index/types.

We will also need to run the following command as root.

To do this, run the script /opt/mapr/bin/register-elasticsearch.

Parameters:

  • -c: This parameter is a tag that will be used as a target for the replica setup command. the recommended name is the ES cluster name but it could be anything. It will be the name used for the replication command. remember it!
  • -t: Use the transport client. This is the only client supported by MapR 5.2 and is required in conjunction with the -r parameter.
  • -e: The directory where ES is installed. Note that if ES is installed via the RPM/Deb package, this parameter is not necessary.
  • -y: Do not prompt for values. If following the steps here, it’s safe to use.

Using the sandbox, this command will register ES as the MapR user:

$> /opt/mapr/bin/register-elasticsearch -c elastic -r maprdemo -t -y  
Copying ES files from maprdemo to /tmp/es_register_mapr...
The authenticity of host 'maprdemo (10.0.2.15)' can't be established.
RSA key fingerprint is 6a:24:76:81:7d:53:ab:4d:3e:b5:29:0a:cb:ab:dd:9a.
Are you sure you want to continue connecting (yes/no)? yes
Registering ES cluster elastic on local MapR cluster.
Your ES cluster elastic has been successfully registered on the local MapR cluster.

Doing this as root on a fresh sandbox, expect only the prompt for “Are you sure you want to continue connecting?” Answer yes, of course. If you run the command as user mapr, it will not work if Elasticsearch was installed from RPM because it requires access to the elasticsearch.yml file, which the RPM installs in the /etc/elasticsearch folder.

In practice, this will add some shared libs and other such required data to MapR FS under the folder /mapr/demo.mapr.com/opt/external/elasticsearch. You can verify the content of the Clusters subfolder will have "elastic".

To verify ES is registered properly, you can then enter this command (notice the -l parameter):

$> /opt/mapr/bin/register-elasticsearch -l
Found 1 items
drwxr-xr-x   - mapr mapr          3 2016-10-27 21:28 /opt/external/elasticsearch/clusters/elastic

We are now done with registering the Elasticsearch cluster with the MapR cluster. This only needs to be done once for each Elasticsearch cluster, regardless of how many tables replicate to ES.

Add Elasticsearch Mappings

This part is critical and a source of most issues. Get the mappings wrong and the replication will fail.

$> curl -X PUT maprdemo:9200/pumps/ -d '
{
    "mappings" : {
          "pumpsdata" : {
            "properties" : {
                  "pumpsdata" : {
                    "dynamic" : "true",
                    "properties" : {
                          "resid" :{"type":"string"},
                        "date" :{"type":"date", "format":"MM/dd/yy"},
                        "time" :{"type":"string"},
                        "hz" :{"type":"double"},
                        "disp" :{"type":"double"},
                        "flow" :{"type":"double"},
                        "sedppm" :{"type":"double"},
                        "psi" :{"type":"double"},
                        "ch1ppm" :{"type":"double"}
                    }
                }
            }
          }
    }
}'

The mappings are really critical because MapR DB binary tables, just like HBase and many NoSQL databases, have no information about the data. They only store bytes. As such, the replication gateway needs to convert the data from bytes into whatever is in the mapping for the type defined in Elasticsearch. If a conversion fails, it throws an exception and the replication fails.

Check this page in the MapR documentation to validate that your mapping is indeed correct given your data.

Personally, the date/timestamp types caused me a lot of grief. It was fiddly to get it working properly until I got the hang of it. The mappings above are tested and work.

Set Up Replication

We are finally there! Time to start the actual replication. Related documentation is found here in MapRDocs.

This is done using the maprcli utility as user "mapr":

$> maprcli table replica elasticsearch autosetup -path /user/mapr/pumps -target elastic -index pumps -type pumpsdata

Once this command is run, MapR will launch a MapReduce job to do an initial bulk replication of the data currently stored in the MapR-DB table. This could be long if the table is already holding a lot of data. With our very small test data (47,899 rows) this should take less than one minute, mostly because of the startup cost of a MapReduce job. You can see it running by opening the ResourceManager UI (http://maprdemo:8088/cluster/apps).

If planning to use replication from the start, it’s probably a good idea to set it up when the table has just a bit of data to make the initial bulk load run quickly. While it’s possible to enable replication on an empty table, I wouldn’t recommend it since there is no way to make sure the replication is setup properly until data is added, which could be in production. I tend to prefer to detect errors and fix issues as early as possible.

From there on out, as data is added to the MapR-DB table, the data will be automatically replicated to ES by the gateway. It’s magic!

Verifying the Replication

In MCS, we should now be able to see that the replication has indeed been successful.

In Elasticsearch, we can also make sure that we have three hits for the rows we have replicated so far:

$> curl maprdemo:9200/pumps/metrics/_count
{"count":47899,"_shards":{"total":5,"successful":5,"failed":0}}

MCS

Above, we can see a screenshot of MCS where the /user/mapr/pumps table’s replicas tab is, which clearly shows that Elasticsearch replication is on.

The first load is a bulk load, and all subsequent inserts/updates are added to ES as they are added to MapR-DB in a streaming fashion.

Potential Issues

Some sources of issues to be careful about:

  1. Make sure the user running the replication command has POSIX permissions to the MapR-DB table. In our case, we’re creating it with user ‘mapr’ and running the command as ‘mapr,’ so that’s OK. Permissions in MapR matter.
  2. Double check that your index is created and the mappings are well matched to the data. If you’re using our test data and mappings though, it should be smooth sailing!
  3. Finally, ensure that the data input are strings in UTF-8 format in this particular example. The gateway decodes the bytes stored in MapR-DB as a UTF-8 string so if the data input was ASCII, the decoded output will be weird numbers and ES will complain. UTF-8 is the default file format of all modern computers, so it should be fine, but it’s something to keep in mind.

If the job fails, go to elasticsearch-2.2.0/conf and edit the logging.yml file to set the logging level to DEBUG. Tailing the log in elasticsearch-2.2.0/logs/elastic.log will give the most information about conversion errors.

Wrap Up

Replication to Elasticsearch can be a very useful feature, with a lot of great use cases as I described above. It’s pretty easy to set up and will work reliably in the background to keep your data synchronized. Hopefully it will encourage more MapR users to experiment with this feature and take advantage of it on their production clusters.

Additional Resources

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
elasticsearch ,mapr ,big data ,tutorial ,replication ,data storage ,data visualization

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}