BigMemory 4.0 technical overview
Join the DZone community and get the full member experience.
Join For Freefor those out there who are working with huge data sets (from gigabytes to terabytes), bigmemory is a product worth to consider, as it allows your java application to handle large volumes of data in memory. meaning that you get fast access to your data.
version 4.0 is out since a few days, let’s have a technical look at what’s inside.
1) “fast data access at terabyte scale”
there’s an interesting whitepaper that was published last year. this reports the results of scaling a test application. the size of the data set started at 2gb and went up to 1.8tb. the throughput remained within a range of approximately 10% of the mean with no garbage-collection induced latency spikes. in sum, the speed of reading/writing to bigmemory scales at around the same speed from gigabytes to terabytes of data.
the architecture is the following:
each instance of your application is using ehcache to cache the hotset of data in heap. then you define in ehcache the size of the bigmemory store you want. this part is named offheap, as it is memory that is not handled by the garbage collector (thus that is not heap)
you can replicate your cache/bigmemory store amongst multiple instances of your application, by connecting them to the terracotta server. please not that the terracotta server can also use bigmemory to store more data.
in terms of development, if you’re familiar with ehcache you won’t be lost.
creating a bigmemory instance will look like this:
import net.sf.ehcache.*; import net.sf.ehcache.config.*; configuration cfg = new configuration() .terracotta(new terracottaclientconfiguration().url("localhost:9510")) .cache(new cacheconfiguration().name("mydatatableexample") .maxbyteslocalheap(1, memoryunit.gigabytes) .maxbyteslocaloffheap(4, memoryunit.gigabytes) .terracotta(new terracottaconfiguration()) ); cachemanager manager = cachemanager.newinstance(cfg); ehcache mydatatableexample = manager.getehcache("mydatatableexample"); string key = "some key"; somecustomentity value = .... somecustomentity newvalue = .... mydatatableexample.put(new element(key, value)); mydatatableexample.replace(new element(key, newvalue)); value = (somecustomentity)mydatatableexample.get(key).getobjectvalue(); mydatatableexample.remove(key); manager.shutdown();
2) “monitoring”
this refers to the monitoring tool called
tmc.
after having installed the bigmemory archive, you will be able to start it from the command line with one of these scripts, depending on the version you’re using:
tools/management-console/bin/start-tmc.sh
or
management-console/bin/start-tmc.sh
then the monitoring application can be accessed in your browser on
http://localhost:9889/tmc
in this example, the store is clustered (=shared amongst different jvms) through the terracotta server, and the monitoring console is getting the info from the terracotta server.
this means that in case you’re in standalone mode (the store running in a single jvm), you’re not using the terracotta server. you’ll need to indicate in your configuration that you intend to publish your data to the monitoring console:
managementrestserviceconfiguration restcfg = new managementrestserviceconfiguration(); restcfg.setsecurityservicelocation("http://localhost:9889/tmc/api/assertidentity"); configuration cfg = new configuration() .managementrestservice(restcfg) .cache(new cacheconfiguration().name("mydatatableexample") .maxbyteslocalheap(1, memoryunit.gigabytes) .maxbyteslocaloffheap(4, memoryunit.gigabytes) );
3) “fast restart for disaster recovery”
now things become even more interesting. not only all data is in memory, but it can be persisted to disk, so in case of a crash or after restarting the application, the data is loaded back to memory.
there are 4 options, but i’ll focus on the two most important : localrestartable and distributed
a) localrestartable is used for a standalone configuration, persisting the data to the disk
configuration cfg = new configuration() .diskstore(new diskstoreconfiguration().path("/mydisk/mystore/")) .cache(new cacheconfiguration().name("mypersistentdatatableexample") .maxbyteslocalheap(1, memoryunit.gigabytes) .maxbyteslocaloffheap(4, memoryunit.gigabytes) .persistence(new persistenceconfiguration().strategy(persistenceconfiguration.strategy.localrestartable)) );
b) distributed is used for a clustered configuration, persisting the data to the terracotta server, that will have the responsibility to persist it on disk
configuration cfg = new configuration() .diskstore(new diskstoreconfiguration().path("/mydisk/mystore/")) .terracotta(new terracottaclientconfiguration().url("localhost:9510")) .cache(new cacheconfiguration().name("mypersistentdatatableexample") .maxbyteslocalheap(1, memoryunit.gigabytes) .maxbyteslocaloffheap(4, memoryunit.gigabytes) .terracotta(new terracottaconfiguration()) .persistence(new persistenceconfiguration().strategy(persistenceconfiguration.strategy.distributed)) );
and add in the terracotta server config (tc-config.xml):
<restartable enabled="true"/> <offheap> <enabled>true</enabled> <maxdatasize>....</maxdatasize> </offheap>
4) “search support”
as we’ve seen, with bigmemory we have a <key, value> in memory data store that can be clustered and persisted to disk.
since the goal is to keep as much data in memory and still provide convenient access to the data, we also have query support, similarly to what you would find in a database.
it would take a whole post to describe all the functionalities for search support, so here is a simple example:
say we put instances of a person class in memory, that would have these two fields:
private string familyname; private integer age;
finding every person having their family name starting by ‘a’ would be :
import net.sf.ehcache.search.*; import net.sf.ehcache.search.aggregator.*; query query = mypersistentdatatableexample.createquery() .addcriteria(new ilike("familyname", "a%")) .includevalues() .addorderby(new criteria("familyname"), direction.ascending); results results = query.execute(); list<result> all = results.all(); for (result result : all) { person person = (person)result.getvalue(); }
5) “hadoop ready”
this is another interesting functionality.
in case you’re not so familiar with hadoop, this is a system that allows you to process big data sets.
it is very powerful because you can use commodity servers and hadoop will take care of distributing the work amongst them.
hadoop has a custom filesystem, named hdfs, and you first need to import your data on the hdfs system, before it can be processed.
however, you can’t get the results in realtime, since you need to wait for the whole process to be done before you can access to the data on hdfs.
there are some solutions emerging, like cloudera’s impala realtime queries.
bigmemory tackles the problem by providing a hadoop connector. with this connector, hadoop can send the data to bigmemory as soon as it is processed, making it available to memory in realtime.
in hadoop, the outputformat and recordwriter interfaces define how the output of the map/reduce jobs are handled.
e.g. by default, the textoutputformat will put the data out to a file on the hdfs system.
instead, you can use ehcacheoutputformat, that is a custom outputformat implementation that will write data to bigmemory.
if we take the famous hadoop wordcount example,
public static void main(string[] args) throws exception { jobconf conf = new jobconf(wordcount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); fileinputformat.setinputpaths(conf, new path(args[0])); fileoutputformat.setoutputpath(conf, new path(args[1])); jobclient.runjob(conf); }
this will turn into
public static void main(string[] args) throws exception { jobconf conf = new jobconf(wordcount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(ehcacheelementwritable.class); conf.setmapperclass(map.class); conf.setcombinerclass(reduce.class); conf.setreducerclass(reduce.class); conf.setinputformat(textinputformat.class); conf.setoutputformatclass(ehcacheoutputformat.class); // fileinputformat.setinputpaths(conf, new path(args[0])); jobclient.runjob(conf); }
you also need to define the ehcache.xml configuration (following ehcache format).
then in your application, you will be able to initialize it as a standard ehcache and get the data as soon as it is in memory.
6) more
…
there is other stuff under the hood.
a toolkit library is available for clustered objects (like collections, locks, synchronization…).
the security functionality has been improved. as in version 3.7, ssl based communication is available in order to have a secure setup. additionally, you can leverage an ldap or active directory server to store credentials.
and also, java 7 is supported…
so if you’re interested in the big data world, you should definitely check it out here.
don’t hesitate to leave comments if you need more precisions!
Published at DZone with permission of Aurelien Broszniowski, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments