Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

High-Performance Computing Clusters (HPCC) and Cassandra on OS X

DZone's Guide to

High-Performance Computing Clusters (HPCC) and Cassandra on OS X

· Performance Zone
Free Resource

Our new parent company, LexisNexis, has one of the world's largest public records databases:

"...our comprehensive collection of more than 46 billion records from more than 10,000 diverse sources—including public, private, regulated, and derived data. You get comprehensive information on approximately 269 million individuals and 277 million unique businesses."
http://www.lexisnexis.com/en-us/products/public-records.page

And they've been managing, analyzing and searching this database for decades.  Over that time period, they've built up quite an assortment of "Big Data" technologies.  Collectively, LexisNexis refers to those technologies as their High-Performance Computing Cluster (HPCC) platform.
http://hpccsystems.com/Why-HPCC/How-it-works

HPCC is entirely open source:
https://github.com/hpcc-systems/HPCC-Platform

Naturally, we are working through the marriage of HPCC with our real-time data management and analytics stack.  The potential is really exciting.  Specifically, HPCC has sophisticated machine learning and statistics libraries, and a query engine (Roxie) capable of serving up those statistics.
http://hpccsystems.com/ml

Low and behold, HPCC can use Cassandra as a backend storage mechanism! (FTW!)

The HPCC platform isn't technically supported on a Mac, but here is what I did to get it running:

HPCC Install

  
brew install icu4c
brew install boost
brew install libarchive
brew install bison27
brew install openldap
brew install nodejs
  • Make a build directory, and run cmake from there:
  
export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
cmake ../ -DICU_LIBRARIES=/usr/local/opt/icu4c/lib/libicuuc.dylib -DICU_INCLUDE_DIR=/usr/local/opt/icu4c/include -DLIBARCHIVE_INCLUDE_DIR=/usr/local/opt/libarchive/include -DLIBARCHIVE_LIBRARIES=/usr/local/opt/libarchive/lib/libarchive.dylib -DBOOST_REGEX_LIBRARIES=/usr/local/opt/boost/lib -DBOOST_REGEX_INCLUDE_DIR=/usr/local/opt/boost/include -DUSE_OPENLDAP=true -DOPENLDAP_INCLUDE_DIR=/usr/local/opt/openldap/include -DOPENLDAP_LIBRARIES=/usr/local/opt/openldap/lib/libldap_r.dylib -DCLIENTTOOLS_ONLY=false -DPLATFORM=true
  • Then, compile and install with (sudo make install)
  • After that, you'll need to muck with the permissions a bit:
  • Now, ordinarily you would run hpcc-init to get the system configured, but that script fails on OS X, so I used linux to generate config files that work and posted those to a repository here:
  • Clone this repository and replace /var/lib/HPCCSystems with the content of var_lib_hpccsystems.zip
  
sudo rm -fr /var/lib/HPCCSystems
sudo unzip var_lib_hpccsystems.zip -d /var/lib
chmod -R a+rwx /var/lib/HPCCSystems
  • Then, from the directory containing the xml files in this repository, you can run:
      daserver (Runs the Dali server, which is the persistence mechanism for HPCC)
      esp (Runs the ESP server, which is the web services and UI layer for HPCC)
      eclccserver (Runs the ECL compile server, which takes the ECL and compiles it down to C and then a dynmic library)
      roxie (Runs the Roxie server, which is capable of responding to queries)
  • Kickoff each one of those, then you should be ready to run some ECL. Then, go to http://localhost:8010 in a browser.  You are ready to run some ECL!

Running ECL

Like Pig with Hadoop, HPCC runs a DSL called ECL.  More information on ECL can be found here:
http://hpccsystems.com/download/docs/learning-ecl
  • As a simple smoke test, go into your HPCC-Platform repository, and go under: ./testing/regress/ecl.  
  • Then, run the following:
  
ecl run hello.ecl --target roxie --server=localhost:8010
  • You should see the following:
  
<dataset name="Result 1"> 
<row><result_1>Hello world</result_1></row> 
</dataset>

Cassandra Plugin

With HPCC up and running, we are ready to have some fun with Cassandra.  HPCC has plugins.  Those plugins reside in /opt/HPCC/plugins.  For me, I had to copy those libraries into /opt/HPCCSystems/lib to get HPCC to recognize them.

Go back to the testing/regress/ecl directory and have a look at cassandra-simple.ecl. A snippet is shown below:

-------------------------

childrec := RECORD
string name,
integer4 value { default(99999) },
boolean boolval { default(true) },
real8 r8 {default(99.99)},
real4 r4 {default(999.99)},
DATA d {default (D'999999')},
DECIMAL10_2 ddd {default(9.99)},
UTF8 u1 {default(U'9999 ß')},
UNICODE u2 {default(U'9999 ßßßß')},
STRING a,
SET OF STRING set1,
SET OF INTEGER4 list1,
LINKCOUNTED DICTIONARY(maprec) map1{linkcounted};
END;
































































init := DATASET([{'name1', 1, true, 1.2, 3.4, D'aa55aa55', 1234567.89, U'Straße', U'Straße','Ascii',['one','two','two','three'],[5,4,4,3],[{'a'=>'apple'},{'b'=>'banana'}]},
{'name2', 2, false, 5.6, 7.8, D'00', -1234567.89, U'là', U'là','Ascii', [],[],[]}], childrec);
































































load(dataset(childrec) values) := EMBED(cassandra : user('boneill'),keyspace('test'),batch('unlogged'))
INSERT INTO tbl1 (name, value, boolval, r8, r4,d,ddd,u1,u2,a,set1,list1,map1) values (?,?,?,?,?,?,?,?,?,?,?,?,?);
ENDEMBED;

--------------------

In this example, we define childrec as a RECORD with a set of fields. We then create a DATASET of type childrec. Then we define a method that takes a dataset of type childrec and runs the Cassandra insert command for each of the records in the dataset. 

Startup a Cassandra locally.  (download Cassandra, unzip it, then run bin/cassandra -f (to keep it in foreground))

Once Cassandra is up, simply run the ECL like you did the hello program.

ecl run cassandra-simple.ecl --target roxie --server=localhost:8010

You can then go over to cqlsh and validate that all the data made it back into Cassandra: 

➜ cassandra bin/cqlsh
Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.7 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh> select * from test.tbl1 limit 5;
































































name | a | boolval | d | ddd | list1 | map1 | r4 | r8 | set1 | u1 | u2 | value
-----------+---+---------+----------------+------+-------+------+--------+--------+------+--------+-----------+-------
name1575 | | True | 0x393939393939 | 9.99 | null | null | 1576.6 | 1575 | null | 9999 ß | 9999 ßßßß | 1575
name3859 | | True | 0x393939393939 | 9.99 | null | null | 3862.9 | 3859 | null | 9999 ß | 9999 ßßßß | 3859
name11043 | | True | 0x393939393939 | 9.99 | null | null | 11054 | 11043 | null | 9999 ß | 9999 ßßßß | 11043
name3215 | | True | 0x393939393939 | 9.99 | null | null | 3218.2 | 3215 | null | 9999 ß | 9999 ßßßß | 3215
name7608 | | False | 0x393939393939 | 9.99 | null | null | 7615.6 | 7608.1 | null | 9999 ß | 9999 ßßßß | 7608
OK -- that should give a little taste of ECL and HPCC.    It is a powerful platform.
As always, let me know if you run into any trouble.
Topics:

Published at DZone with permission of Brian O' Neill, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}