DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Data Topics

article thumbnail
Spring Caching Abstraction and Google Guava Cache
Spring provides a great out of the box support for caching expensive method calls. The caching abstraction is covered in a great detail here. My objective here is to cover one of the newer cache implementations that Spring now provides with 4.0+ version of the framework - using Google Guava Cache In brief, consider a service which has a few slow methods: public class DummyBookService implements BookService { @Override public Book loadBook(String isbn) { // Slow method 1. } @Override public List loadBookByAuthor(String author) { // Slow method 2 } } With Spring Caching abstraction, repeated calls with the same parameter can be sped up by an annotation on the method along these lines - here the result of loadBook is being cached in to a "book" cache and listing of books cached into another "books" cache: public class DummyBookService implements BookService { @Override @Cacheable("book") public Book loadBook(String isbn) { // slow response time.. } @Override @Cacheable("books") public List loadBookByAuthor(String author) { // Slow listing } } Now, Caching abstraction support requires a CacheManager to be available which is responsible for managing the underlying caches to store the cached results, with the new Guava Cache support the CacheManager is along these lines: @Bean public CacheManager cacheManager() { return new GuavaCacheManager("books", "book"); } Google Guava Cache provides a rich API to be able to pre-load the cache, set eviction duration based on last access or created time, set the size of the cache etc, if the cache is to be customized then a guava CacheBuilder can be passed to the CacheManager for this customization: @Bean public CacheManager cacheManager() { GuavaCacheManager guavaCacheManager = new GuavaCacheManager(); guavaCacheManager.setCacheBuilder(CacheBuilder.newBuilder().expireAfterAccess(30, TimeUnit.MINUTES)); return guavaCacheManager; } This works well if all the caches have a similar configuration, what if the caches need to be configured differently - for eg. in the sample above, I may want the "book" cache to never expire but the "books" cache to have an expiration of 30 mins, then the GuavaCacheManager abstraction does not work well, instead a better solution is actually to use a SimpleCacheManager which provides a more direct way to get to the cache and can be configured this way: @Bean public CacheManager cacheManager() { SimpleCacheManager simpleCacheManager = new SimpleCacheManager(); GuavaCache cache1 = new GuavaCache("book", CacheBuilder.newBuilder().build()); GuavaCache cache2 = new GuavaCache("books", CacheBuilder.newBuilder() .expireAfterAccess(30, TimeUnit.MINUTES) .build()); simpleCacheManager.setCaches(Arrays.asList(cache1, cache2)); return simpleCacheManager; } This approach works very nicely, if required certain caches can be configured to be backed by a different caching engines itself, say a simple hashmap, some by Guava or EhCache some by distributed caches like Gemfire.
November 3, 2014
by Biju Kunjummen
· 60,076 Views · 8 Likes
article thumbnail
BigList: a Scalable High-Performance List for Java
As memory gets cheaper and cheaper, our applications can keep more data readily available in main memory, or even all as in case of in-memory databases. To make real use of the growing heap memory, appropriate data structures must be used. Interesting enough, there seem to be no specialized implementations for lists - by far the most used collection. This article introduces BigList, a list designed for handling large collections where large means that all data still fit completely in the heap memory. The article will show the special requirements for handling large collections, how BigList is implemented and how it compares to other list implementations. 1. Requirements What are the special requirements we need to handle large collections efficiently? Memory: Sparing use of memory: The list should need little memory for its own implementation so memory can be used for storing application data. Specialized versions for primitives: It must be possible to store common primitives like ints in a memory saving way. Avoid copying large data blocks: If the list grows or shrinks, only a small part of the data must be copied around, as this operation becomes expensive and needs the same amount of memory again. Data sharing: copying collections is a frequent operation which should be efficiently possible even if the collection is large. An efficient implementation requires some sort of data sharing as copying all elements is per se a costly operation. Performance: Good performance for normal operations like reading, storing, adding or removing single elements. Great performance for bulk operations like adding or removing multiple elements. Predictable overhead of operations, so similar operations should need a similar amount of time without excessive worst case scenarios. If an implementation does not offer these features, some operations will not only be slow for really large collections, but will becomse just not feasible because memory or CPU usage will be too exhaustive. Introduction to BigList BigList is a member of the Brownies Collections library which also includes GapList, the fastest list implementation known. GapList is a drop-in replacement for ArrayList, LinkedList, or ArrayDequeue and offers fast access by index and fast insertion/removal at the beginning and at the end at the same time. GapList however has not been designed to cope with large collections, so adding or removing elements can make it necessary to copy a lot of elements around which will lead to performance problems. Also copying a large collection becomes an expensive operation, both in term of time and memory consumption. It will simply not be possible to make a copy of a large collections if not the same amount of memory is available a second time. And this is a common operation as you often want to return a copy of an internal list through your API which has no reference the original list. BigList addresses both problems. The first problem is solved by storing the collection elements in fixed size blocks. Add or remove operations are then implemented to use only data from one block. The copying problem is solved by maintaining a reference count on the fixed size blocks which allows to implement a copy-on-write approach. For efficient access to the fixed size blocks, they are maintained in a specialized tree structure. 2. BigList Details Each BigList instance stores the following information: Elements are stored in in fixed-size blocks A single block is implemented as GapList with a reference count for sharing All blocks are maintained in a tree for fast access Access information for the current block is cached for better performance The following illustration shows these details for two instances of BigList which share one block. 2.1 Use of Blocks Elements are stored in in fixed-size blocks with a default block size of 1000. Where this default may look pretty small, it is most of the time a good choice because it guarantees that write operation only need to move few elements. Read operations will profit from locality of reference by using the currently cached block to be fast. It is however possible to specify the block size for each created BigList instance. All blocks except the first are allocated with this fixed size and will not grow or shrink. The first block will grow to the specified block size to save memory for small lists. If a block has reached its maximum size and more data must be stored within, the block needs to be split up in two blocks before more elements can be stored. If elements are added to the head or tail of the list, the block will only be filled up to a threshold of 95%. This allows inserts into the block without the immediate need for split operations. To save memory, blocks are also merged. This happens automatically if two adjacent blocks are both filled less than 35% after a remove operation. 2.2 Locality of Reference For each operation on BigList, the affected block must be determined first. As predicted by locality of reference, most of the time the affected block will be the same as for the last operation. The implementation of BigList has therefore been designed to profit from locality of reference which makes common operations like iterating over a list very efficient. Instead of always traversing the block tree to determine the block needed for an operation, lower and upper index of the last used block are cached. So if the next operation happens near to the previous one, the same block can be used again without need to traverse the tree. 2.3 Reference Counting To support a copy-on-write approach, BigList stores a reference count for each fixed size blocks indicating whether this block is private or shared. Initially all lists are private having a reference count of 0, so that modification are allowed. If a list is copied, the reference count is incremented which prohibits further modifications. Before a modification then can be made, the block must be copied decrementing the block's reference count and setting the reference count of the copy to 0. The reference count of a block is then decremented by the finalizer of BigList. 3. Benchmarks To prove the excellence of BigList in time and memory consumption, we compare it with some other List implementations. And here are the nominees: Type Library Description BigList brownie-collections List optimized for storing large number of elements. Elements stored in fixed size blocks which are maintained in a tree. GapList brownie-collections Fastest list implementation known. Fast access by index and fast insertion/removal at end and at beginning. ArrayList JDK Maintains elements in a single array. Fast access by index, fast insertion/removal at end, but slow at beginning. LinkedList JDK Elements stored using a linked list. Slow access by index. Memory overhead for each element stored. TreeList commons-collections Elements stored in a tree. All operations are not really fast, but there are no very slow operations. Memory overhead for each element stored. FastTable javolution Elements stored in a "fractal"-like data structure. Good performance and use of memory. However no bulk operations and collection does not shrink. 3.1 Handling Objects In the first part of the benchmark, we compare memory consumption and performance of the different list implementations. Let's first have a look at the memory consumption. The following table shows the bytes used to hold a list with 1'000'000 null elements: BigList GapList ArrayList LinkedList TreeList FastTable 32 bit 4'298'466 4'021'296 4'861'992 8'000'028 18'000'028 4'142'892 64 bit 8'544'254 8'042'552 9'723'964 16'000'044 26'000'044 8'222'988 We can see that BigList, GapList, ArrayList, and FastTable only add small overhead to the stored elements, where as Linkedlist needs twice the memory and TreeList even more. Now to the performance. Here are the results of 9 benchmarks which have been run for each of the 6 candidates with JDK 8 in a 32 bit Windows environment and a list of 1'000'000 elements: The result table can be read as follows: the fastest candidate for each test has a relative performance indicator of 1 the value for the other candidates indicate how many times they have been slower, so a factor of 3 means that this implementation was 3 times slower than the best one The different factor are colored like this: 
factor 1: green (best)
 factor <5: blue (good) 
factor <25: yellow (moderate) 
factor >25: red (poor) If we look the benchmark result, we can see that the performance of BigList is best for all expect two benchmarks. The only moderate result is produces in getting elements in a totally random order. This could be expected as there is no locality of reference which can be exploited, so for each access, the block tree must be traversed to find the correct block. Luckily this is a rare use case in real applications. And the benchmark "Get local" shows that performance is back to good as soon as elements next to each other must be retrieved - as it is the case if we iterate over a range. 3.2 Handling Primitives In the second part of the benchmark, we want see how big the savings are if we use a data structure specialized for storing primitives compared to strong wrapped objects. For this reason, we compare IntBigList and BigList. The following table shows memory needed to store 1'000'000 integer values: BigList IntBigList 32 bit 16'298'454 4'534'840 64 bit 28'544'234 4'570'432 Obviously it is easy to save a lot of memory. In a 32 bit environment, IntBigList just needs 25% percent of memory, in a 64 bit environment only 14%! These figures become plausible if you recall that a simple object needs 8 bytes in a 32 bit, but already 16 bytes in a 64 bit environment, where as a primitive integer value always only needs 4 bytes. The measurable performance gain is not so impressive, it is something below 10% for simple get operations and something above 10% for add and remove operations. These numbers show that the JVM is impressively fast in creating wrapper objects and boxing and unboxing primitive values. We must however also consider that each created object will need to be garbage collected once and therefore adds to the total load of the JVM. 4. Summary BigList is a scalable high-performance list for storing large collections. Its design guarantees that all operations will be predictable and efficient both in term of performance and memory consumption, even copying large collections is tremendous fast. Benchmarks haven proven this and shown that BigList outperform other known list implementations. The library also offers specialized implementations for primitive types like IntBigList which save much memory and provide superior performance. BigList for handling objects and the specializations for handling primitives are part of the Brownies Collections library and can be downloaded from http://www.magicwerk.org/collections.
November 3, 2014
by Thomas Mauch
· 32,940 Views · 10 Likes
article thumbnail
Building a REST API with JAXB, Spring Boot and Spring Data
if someone asked you to develop a rest api on the jvm, which frameworks would you use? i was recently tasked with such a project. my client asked me to implement a rest api to ingest requests from a 3rd party. the project entailed consuming xml requests, storing the data in a database, then exposing the data to internal application with a json endpoint. finally, it would allow taking in a json request and turning it into an xml request back to the 3rd party. with the recent release of apache camel 2.14 and my success using it , i started by copying my apache camel / cxf / spring boot project and trimming it down to the bare essentials. i whipped together a simple hello world service using camel and spring mvc. i also integrated swagger into both. both implementations were pretty easy to create ( sample code ), but i decided to use spring mvc. my reasons were simple: its rest support was more mature, i knew it well, and spring mvc test makes it easy to test apis. camel's swagger support without web.xml as part of the aforementioned spike, i learned out how to configure camel's rest and swagger support using spring's javaconfig and no web.xml. i made this into a sample project and put it on github as camel-rest-swagger . this article shows how i built a rest api with java 8, spring boot/mvc, jaxb and spring data (jpa and rest components). i stumbled a few times while developing this project, but figured out how to get over all the hurdles. i hope this helps the team that's now maintaining this project (my last day was friday) and those that are trying to do something similar. xml to java with jaxb the data we needed to ingest from a 3rd party was based on the ncpdp standards. as a member, we were able to download a number of xsd files, put them in our project and generate java classes to handle the incoming/outgoing requests. i used the maven-jaxb2-plugin to generate the java classes. org.jvnet.jaxb2.maven2 maven-jaxb2-plugin 0.8.3 generate -xtostring -xequals -xhashcode -xcopyable org.jvnet.jaxb2_commons jaxb2-basics 0.6.4 src/main/resources/schemas/ncpdp the first error i ran into was about a property already being defined. [info] --- maven-jaxb2-plugin:0.8.3:generate (default) @ spring-app --- [error] error while parsing schema(s).location [ file:/users/mraible/dev/spring-app/src/main/resources/schemas/ncpdp/structures.xsd{1811,48}]. com.sun.istack.saxparseexception2; systemid: file:/users/mraible/dev/spring-app/src/main/resources/schemas/ncpdp/structures.xsd; linenumber: 1811; columnnumber: 48; property "multipletimingmodifierandtimingandduration" is already defined. use to resolve this conflict. at com.sun.tools.xjc.errorreceiver.error(errorreceiver.java:86) i was able to workaround this by upgrading to maven-jaxb2-plugin version 0.9.1. i created a controller and stubbed out a response with hard-coded data. i confirmed the incoming xml-to-java marshalling worked by testing with a sample request provided by our 3rd party customer. i started with a curl command, because it was easy to use and could be run by anyone with the file and curl installed. curl -x post -h 'accept: application/xml' -h 'content-type: application/xml' \ --data-binary @sample-request.xml http://localhost:8080/api/message -v this is when i ran into another stumbling block: the response wasn't getting marshalled back to xml correctly. after some research, i found out this was caused by the lack of @xmlrootelement annotations on my generated classes. i posted a question to stack overflow titled returning jaxb-generated elements from spring boot controller . after banging my head against the wall for a couple days, i figured out the solution . i created a bindings.xjb file in the same directory as my schemas. this causes jaxb to generate @xmlrootelement on classes. to add namespaces prefixes to the returned xml, i had to modify the maven-jaxb2-plugin to add a couple arguments. -extension -xnamespace-prefix and add a dependency: org.jvnet.jaxb2_commons jaxb2-namespace-prefix 1.1 then i modified bindings.xjb to include the package and prefix settings. i also moved into a global setting. i eventually had to add prefixes for all schemas and their packages. i learned how to add prefixes from the namespace-prefix plugins page . finally, i customized the code-generation process to generate joda-time's datetime instead of the default xmlgregoriancalendar . this involved a couple custom xmladapters and a couple additional lines in bindings.xjb . you can see the adapters and bindings.xjb with all necessary prefixes in this gist . nicolas fränkel's customize your jaxb bindings was a great resource for making all this work. i wrote a test to prove that the ingest api worked as desired. @runwith(springjunit4classrunner.class) @springapplicationconfiguration(classes = application.class) @webappconfiguration @dirtiescontext(classmode = dirtiescontext.classmode.after_class) public class initiaterequestcontrollertest { @inject private initiaterequestcontroller controller; private mockmvc mockmvc; @before public void setup() { mockitoannotations.initmocks(this); this.mockmvc = mockmvcbuilders.standalonesetup(controller).build(); } @test public void testgetnotallowedonmessagesapi() throws exception { mockmvc.perform(get("/api/initiate") .accept(mediatype.application_xml)) .andexpect(status().ismethodnotallowed()); } @test public void testpostpainitiationrequest() throws exception { string request = new scanner(new classpathresource("sample-request.xml").getfile()).usedelimiter("\\z").next(); mockmvc.perform(post("/api/initiate") .accept(mediatype.application_xml) .contenttype(mediatype.application_xml) .content(request)) .andexpect(status().isok()) .andexpect(content().contenttype(mediatype.application_xml)) .andexpect(xpath("/message/header/to").string("3rdparty")) .andexpect(xpath("/message/header/sendersoftware/sendersoftwaredeveloper").string("hid")) .andexpect(xpath("/message/body/status/code").string("010")); } } spring data for jpa and rest with jaxb out of the way, i turned to creating an internal api that could be used by another application. spring data was fresh in my mind after reading about it last summer. i created classes for entities i wanted to persist, using lombok's @data to reduce boilerplate. i read the accessing data with jpa guide, created a couple repositories and wrote some tests to prove they worked. i ran into an issue trying to persist joda's datetime and found jadira provided a solution. i added its usertype.core as a dependency to my pom.xml: org.jadira.usertype usertype.core 3.2.0.ga ... and annotated datetime variables accordingly. @column(name = "last_modified", nullable = false) @type(type="org.jadira.usertype.dateandtime.joda.persistentdatetime") private datetime lastmodified; with jpa working, i turned to exposing rest endpoints. i used accessing jpa data with rest as a guide and was looking at json in my browser in a matter of minutes. i was surprised to see a "profile" service listed next to mine, and posted a question to the spring boot team. oliver gierke provided an excellent answer . swagger spring mvc's integration for swagger has greatly improved since i last wrote about it . now you can enable it with a @enableswagger annotation. below is the swaggerconfig class i used to configure swagger and read properties from application.yml . @configuration @enableswagger public class swaggerconfig implements environmentaware { public static final string default_include_pattern = "/api/.*"; private relaxedpropertyresolver propertyresolver; @override public void setenvironment(environment environment) { this.propertyresolver = new relaxedpropertyresolver(environment, "swagger."); } /** * swagger spring mvc configuration */ @bean public swaggerspringmvcplugin swaggerspringmvcplugin(springswaggerconfig springswaggerconfig) { return new swaggerspringmvcplugin(springswaggerconfig) .apiinfo(apiinfo()) .genericmodelsubstitutes(responseentity.class) .includepatterns(default_include_pattern); } /** * api info as it appears on the swagger-ui page */ private apiinfo apiinfo() { return new apiinfo( propertyresolver.getproperty("title"), propertyresolver.getproperty("description"), propertyresolver.getproperty("termsofserviceurl"), propertyresolver.getproperty("contact"), propertyresolver.getproperty("license"), propertyresolver.getproperty("licenseurl")); } } after getting swagger to work, i discovered that endpoints published with @repositoryrestresource aren't picked up by swagger. there is an open issue for spring data support in the swagger-springmvc project. liquibase integration i configured this project to use h2 in development and postgresql in production. i used spring profiles to do this and copied xml/yaml (for maven and application*.yml files) from a previously created jhipster project. next, i needed to create a database. i decided to use liquibase to create tables, rather than hibernate's schema-export. i chose liquibase over flyway based of discussions in the jhipster project . to use liquibase with spring boot is dead simple: add the following dependency to pom.xml, then place changelog files in src/main/resources/db/changelog . org.liquibase liquibase-core i started by using hibernate's schema-export and changing hibernate.ddl-auto to "create-drop" in application-dev.yml . i also commented out the liquibase-core dependency. then i setup a postgresql database and started the app with "mvn spring-boot:run -pprod". i generated the liquibase changelog from an existing schema using the following command (after downloading and installing liquibase). liquibase --driver=org.postgresql.driver --classpath="/users/mraible/.m2/repository/org/postgresql/postgresql/9.3-1102-jdbc41/postgresql-9.3-1102-jdbc41.jar:/users/mraible/snakeyaml-1.11.jar" --changelogfile=/users/mraible/dev/spring-app/src/main/resources/db/changelog/db.changelog-02.yaml --url="jdbc:postgresql://localhost:5432/mydb" --username=user --password=pass generatechangelog i did find one bug - the generatechangelog command generates too many constraints in version 3.2.2 . i was able to fix this by manually editing the generated yaml file. tip: if you want to drop all tables in your database to verify liquibase creation is working in postgesql, run the following commands: psql -d mydb drop schema public cascade; create schema public; after writing minimal code for spring data and configuring liquibase to create tables/relationships, i relaxed a bit, documented how everything worked and added a loggingfilter . the loggingfilter was handy for viewing api requests and responses. @bean public filterregistrationbean loggingfilter() { loggingfilter filter = new loggingfilter(); filterregistrationbean registrationbean = new filterregistrationbean(); registrationbean.setfilter(filter); registrationbean.seturlpatterns(arrays.aslist("/api/*")); return registrationbean; } accessing api with resttemplate the final step i needed to do was figure out how to access my new and fancy api with resttemplate . at first, i thought it would be easy. then i realized that spring data produces a hal -compliant api, so its content is embedded inside an "_embedded" json key. after much trial and error, i discovered i needed to create a resttemplate with hal and joda-time awareness. @bean public resttemplate resttemplate() { objectmapper mapper = new objectmapper(); mapper.configure(deserializationfeature.fail_on_unknown_properties, false); mapper.registermodule(new jackson2halmodule()); mapper.registermodule(new jodamodule()); mappingjackson2httpmessageconverter converter = new mappingjackson2httpmessageconverter(); converter.setsupportedmediatypes(mediatype.parsemediatypes("application/hal+json")); converter.setobjectmapper(mapper); stringhttpmessageconverter stringconverter = new stringhttpmessageconverter(); stringconverter.setsupportedmediatypes(mediatype.parsemediatypes("application/xml")); list> converters = new arraylist<>(); converters.add(converter); converters.add(stringconverter); return new resttemplate(converters); } the jodamodule was provided by the following dependency: com.fasterxml.jackson.datatype jackson-datatype-joda with the configuration complete, i was able to write a messagesapiitest integration test that posts a request and retrieves it using the api. the api was secured using basic authentication, so it took me a bit to figure out how to make that work with resttemplate. willie wheeler's basic authentication with spring resttemplate was a big help. @runwith(springjunit4classrunner.class) @contextconfiguration(classes = integrationtestconfig.class) public class messagesapiitest { private final static log log = logfactory.getlog(messagesapiitest.class); @value("http://${app.host}/api/initiate") private string initiateapi; @value("http://${app.host}/api/messages") private string messagesapi; @value("${app.host}") private string host; @inject private resttemplate resttemplate; @before public void setup() throws exception { string request = new scanner(new classpathresource("sample-request.xml").getfile()).usedelimiter("\\z").next(); responseentity response = resttemplate.exchange(gettesturl(initiateapi), httpmethod.post, getbasicauthheaders(request), org.ncpdp.schema.transport.message.class, collections.emptymap()); assertequals(httpstatus.ok, response.getstatuscode()); } @test public void testgetmessages() { httpentity request = getbasicauthheaders(null); responseentity> result = resttemplate.exchange(gettesturl(messagesapi), httpmethod.get, request, new parameterizedtypereference>() {}); httpstatus status = result.getstatuscode(); collection messages = result.getbody().getcontent(); log.debug("messages found: " + messages.size()); assertequals(httpstatus.ok, status); for (message message : messages) { log.debug("message.id: " + message.getid()); log.debug("message.datecreated: " + message.getdatecreated()); } } private httpentity getbasicauthheaders(string body) { string plaincreds = "user:pass"; byte[] plaincredsbytes = plaincreds.getbytes(); byte[] base64credsbytes = base64.encodebase64(plaincredsbytes); string base64creds = new string(base64credsbytes); httpheaders headers = new httpheaders(); headers.add("authorization", "basic " + base64creds); headers.add("content-type", "application/xml"); if (body == null) { return new httpentity<>(headers); } else { return new httpentity<>(body, headers); } } } to get spring data to populate the message id, i created a custom restconfig class to expose it. i learned how to do this from tommy ziegler . /** * used to expose ids for resources. */ @configuration public class restconfig extends repositoryrestmvcconfiguration { @override protected void configurerepositoryrestconfiguration(repositoryrestconfiguration config) { config.exposeidsfor(message.class); config.setbaseuri("/api"); } } summary this article explains how i built a rest api using jaxb, spring boot, spring data and liquibase. it was relatively easy to build, but required some tricks to access it with spring's resttemplate. figuring out how to customize jaxb's code generation was also essential to make things work. i started developing the project with spring boot 1.1.7, but upgraded to 1.2.0.m2 after i found it supported log4j2 and configuring spring data rest's base uri in application.yml. when i handed the project off to my client last week, it was using 1.2.0.build-snapshot because of a bug when running in tomcat . this was an enjoyable project to work on. i especially liked how easy spring data makes it to expose jpa entities in an api. spring boot made things easy to configure once again and liquibase seems like a nice tool for database migrations. if someone asked me to develop a rest api on the jvm, which frameworks would i use? spring boot, spring data, jackson, joda-time, lombok and liquibase. these frameworks worked really well for me on this particular project.
October 30, 2014
by Matt Raible
· 64,279 Views
article thumbnail
Sharding Pitfalls Part III: Chunk Balancing and Collection Limits
In Parts 1 and 2 we have covered a number of common issues people run into when managing a sharded MongoDB cluster. In this final post of the series we will cover a subtle, but important distinction in terms of balancing a sharded cluster as well as an interesting limitation that can be worked around relatively easily, but is nonetheless surprising when it comes up. 6. Chunk balancing != data balancing != traffic balancing The balancer in a sharded cluster cares about just one thing: Are chunks for a given collection evenly balanced across all shards? If they are not, then it will take steps to rectify that imbalance. This all sounds perfectly logical, and even with extra complexity like tagging involved the logic is pretty straight forward. If we assume that all chunks are equal, then we can rest assured that our data is being evenly balanced across all the shards in our cluster and rest easy at night. Although that is sometimes, perhaps even frequently, the case it is not always true - chunks are not always equal. There can be massive “jumbo” chunks that exceed the maximum chunk size (64MiB), completely empty chunks and everything in between. Let’s use an example from our first pitfall, the monotonically increasing shard key. For our example, we have picked just such a key to shard on (date), and up until this point we have had just one shard and had not sharded the collection. We are about to add a second shard to our cluster and so we enable sharding on the collection and do the necessary admin work to add the new shard into the cluster. Once the collection is enabled for sharding, the first shard contains all the newly minted chunks. Let’s represent them in a simplified table of 10 chunks. This is not representative of a real data set, but it will do for illustrative purposes: Table 1 - Initial Chunk Layout Now we add our second shard. The balancer will kick in and attempt to distribute the chunks evenly. It will do this by moving the lowest range chunks to the new shard until the counts are identical. Once it is finished balancing, our table now looks like this: Table 2 - Balanced Chunk Layout That looks pretty good at the moment, but lets imagine that more recent chunks are more likely to have more activity (updates say) than older chunks. Adding the traffic share estimates for each chunk shows that shard1 is taking far more traffic (72%) than shard2 (28%) despite the chunks seeming balanced overall based on the approximate size. Hence, chunk balancing is not equal to traffic balancing. Using that same example, let’s add another wrinkle - periodic deletion of old data. Every 3 months we run a job to delete any data older than 12 months. Let’s look at the impact of that on our table after we run it for the first time (assuming the first run happens on July 1st 2015). Table 3 - Post-Delete Chunk Layout The distribution of data is now completely skewed toward shard1 - shard2 is in fact empty! However, the balancer is completely unaware of this imbalance - the chunk count has remained the same the entire time, and as far as it is concerned the system is in a steady state. With no data on shard2, our traffic imbalance as seen above will be even worse, and we have essentially negated the benefit of having a second shard for this collection. Possible Mitigation Strategies If data and traffic balance are important, select an appropriate shard key Move chunks manually to address the imbalances - swap “hot” chunks for “cool” chunks, empty chunks for larger chunks 7. Waiting too long to shard a collection (collection too large) This is not very common, but when it falls on your shoulders, it can be quite challenging to solve. There is a maximum data size for a collection when when it is initially split which is a function of the chunk size and data size as noted on the limits page. If your collection contains less than 256GiB of data, then there will be no issue. If the collection size exceeds 256GiB but is less than 400GiB, then MongoDB may be able to do an initial split without any special measures being taken. Otherwise, with larger initial data sizes and the default settings, the initial split will fail. It is worth noting that once split the collection may grow as needed and without any real limitations as long as you can continue to add shards as data size grows. Possible Mitigation Strategies Since the limit is dictated by the chunk size and the data size, and assuming there is not much to be done about the data size, then the remaining variable is the chunk size. This is adjustable (default is 64MiB) and can be raised in order to let a large collection split initially and then reduced once that has been completed. The required chunk size increase will depend on the actual data size. However, this is relatively easy to work out - simply divide your data size by 256GB and then multiply that figure by 64MiB (and round up if it is not a nice even number). As an example, let’s consider a 4TiB collection: 4TiB divided by 256GiB = 16 64MiB x 16 = 1024MiB Hence, set the max chunk size to 1024MiB, then perform the initial sharding of the collection, and then finally reduce the chunk size back to 64MiB using the same procedure. . Thanks for reading through the Sharding Pitfall series! If you want to learn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations. Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.
October 27, 2014
by Francesca Krihely
· 4,276 Views
article thumbnail
AngularJS: Different Ways of Using Array Filters
In this article, we learn how to utilize AngularJS's array filter features in a variety of use cases. Read on to get started.
October 24, 2014
by Siva Prasad Reddy Katamreddy
· 315,775 Views · 5 Likes
article thumbnail
Understanding Information Retrieval by Using Apache Lucene and Tika - Part 1
introduction in this tutorial, the apache lucene and apache tika frameworks will be explained through their core concepts (e.g. parsing, mime detection, content analysis, indexing, scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. we assume you have a working knowledge of the java™ programming language and plenty of content to analyze. throughout this tutorial, you will learn: how to use apache tika's api and its most relevant functions how to develop code with apache lucene api and its most important modules how to integrate apache lucene and apache tika in order to build your own piece of software that stores and retrieves information efficiently. (project code is available for download) what are lucene and tika? according to apache lucene's site, apache lucene represents an open source java library for indexing and searching from within large collections of documents. the index size represents roughly 20-30% the size of text indexed and the search algorithms provide features like: ranked searching - best results returned first many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more. in this tutorial we will demonstrate only phrase queries. fielded search (e.g. title, author, contents) sorting by any field flexible faceting, highlighting, joins and result grouping pluggable ranking models, including the vector space model and okapi bm25 but lucene's main purpose is to deal directly with text and we want to manipulate documents, who have various formats and encoding. for parsing document content and their properties the apache tika library it is necessary. apache tika is a library that provides a flexible and robust set of interfaces that can be used in any context where metadata analyzis and structured text extraction is needed. the key component of apache tika is the parser (org.apache.tika.parser.parser ) interface because it hides the complexity of different file formats while providing a simple and powerful mechanism to extract structured text content and metadata from all sorts of documents. criterias for tika parsing design streamed parsing the interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. this allows even huge documents to be parsed without excessive resource requirements. structured content a parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. a client application can use this information for example to better judge the relevance of different parts of the parsed document. input metadata a client application should be able to include metadata like the file name or declared content type with the document to be parsed. the parser implementation can use this information to better guide the parsing process. output metadata a parser implementation should be able to return document metadata in addition to document content. many document formats contain metadata like the name of the author that may be useful to client applications. context sensitivity while the default settings and behaviour of tika parsers should work well for most use cases, there are still situations where more fine-grained control over the parsing process is desirable. it should be easy to inject such context-specific information to the parsing process without breaking the layers of abstraction. requirements maven 2.0 or higher java 1.6 se or higher lesson 1: automate metadata extraction from any file type our premisses are the following: we have a collection of documents stored on disk/database and we would like to index them; these documents can be word documents, pdfs, htmls, plain text files etc. as we are developers, we would like to write reusable code that extracts file properties regarding format (metadata) and file content. apache tika has a mimetype repository and a set of schemes (any combination of mime magic, url patterns, xml root characters, or file extensions) to determine if a particular file, url, or piece of content matches one of its known types. if the content does match, tika has detected its mimetype and can proceed to select the appropriate parser. in the sample code, the file type detection and its parsing is being covered inside the class com.retriever.lucene.index.indexcreator , method indexfile. listing 1.1 analyzing a file with tika public static documentwithabstract indexfile(analyzer analyzer, file file) throws ioexception { metadata metadata = new metadata(); contenthandler handler = new bodycontenthandler(10 * 1024 * 1024); parsecontext context = new parsecontext(); parser parser = new autodetectparser(); inputstream stream = new fileinputstream(file); //open stream try { parser.parse(stream, handler, metadata, context); //parse the stream } catch (tikaexception e) { e.printstacktrace(); } catch (saxexception e) { e.printstacktrace(); } finally { stream.close(); //close the stream } //more code here } the above code displays how a file it is being parsed using org.apache.tika.parser.autodetectparser; this kind of implementation was chosen because we would like to achieve parsing documents disregarding their format. also, for handling the content the org.apache.tika.sax.bodycontenthandler wasconstructed with a writelimit given as parameter ( 10*1024*1024); this type of constructor creates a content handler that writes xhtml body character events to an internal string buffer and in case of documents with large content is less likely to throw a saxexception (thrown when the default write limit is reached). as a result of our parsing we have obtained a metadata object that we can now use to detect file properties (title or any other header specific to a document format). metadata processing can be done as described below ( com.retriever.lucene.index.indexcreator , method indexfiledescriptors) : listing 1.2 processing metadata private static document indexfiledescriptors(string filename, metadata metadata) { document doc = new document(); //store file name in a separate textfield doc.add(new textfield(isearchconstants.field_file, filename, store.yes)); for (string key : metadata.names()) { string name = key.tolowercase(); string value = metadata.get(key); if (stringutils.isblank(value)) { continue; } if ("keywords".equalsignorecase(key)) { for (string keyword : value.split(",?(\\s+)")) { doc.add(new textfield(name, keyword, store.yes)); } } else if (isearchconstants.field_title.equalsignorecase(key)) { doc.add(new textfield(name, value, store.yes)); } else { doc.add(new textfield(name, filename, store.no)); } } in the method presented above we store the file name in a separate field and also the document's title ( a document can have a title different from its file name); we are not interested in storing other informations.
October 22, 2014
by Ana-Maria Mihalceanu
· 18,745 Views · 4 Likes
article thumbnail
Measuring String Concatenation in Logging
introduction i had an interesting conversation today about the cost of using string concatenation in log statements. in particular, debug log statements often need to print out parameters or the current value of variables, so they need to build the log message dynamically. so you wind up with something like this: logger.debug("parameters: " + param1 + "; " + param2); the issue arises when debug logging is turned off. inside the logger.debug() statement a flag is checked and the method returns immediately; this is generally pretty fast. but the string concatenation had to occur to build the parameter prior to calling the method, so you still pay its cost. since debug tends to be turned off in production, this is the time when this difference matters. for this reason, we have pretty much all been trained to do this: if (logger.isdebugenabled()) { logger.debug("parameters: " + param1 + "; " + param2); } the discussion was about how much difference this “good practice” makes. caliper this kind of question is perfect for a micro benchmark. my own favorite tool for this purpose is caliper . caliper runs small snippets of code enough times to average out variations. it passes in a number of repetitions, which it calculates in order to make sure that the whole method takes long enough to measure given the resolution of the system clock. caliper also detects garbage collection and hotspot compiling that might impact the accuracy of the tests. caliper uploads results to a google app engine application. its sign-in supports google logins and issues an api key that can be used to organize results and list them. a typical timing methods looks like this: public string timemultstringnocheck(long reps) { for (int i = 0; i < reps; i++) { logger.debug(strings[0] + " " + strings[1] + " " + strings[2] + " " + strings[3] + " " + strings[4]); } return strings[0]; } the return string is not used; it is included in the method solely to ensure that java does not optimize away the method. similarly, the content of the variables used should be randomly generated to avoid compile-time optimization. the full example is available in one of my github repositories , in the org.anvard.introtojava.log package. results the outcome is pretty interesting. string concatenation creates a pretty significant penalty, around two orders of magnitude for our example that concatenates five strings. interesting is that even in the case where we do not use string concatenation (i.e. the simplestring methods), the penalty is around 4x. this is probably the time spent pushing the string parameter onto the stack. the examples with doubles, using string.format() , is even more extreme, four orders of magnitude. the elapsed time here about 4ms, large enough that if the log statement were in a commonly used method, the performance hit would be noticeable. the final method, multstringparams , uses a feature that is available in the slf4j api. it works similarly to string.format() , but in a simple token replace fashion. most importantly, it does not perform the token replace unless the logging level is enabled. this makes this form just as fast as the “check” forms, but in a more compact form. of course, this only works if no special formatting is needed of the log string, or if the formatting can be shifted to a method such as tostring() . what is especially surprising is that this method did not show a penalty in building the object array necessary to pass the parameters into the method. this may have been optimized out by the java runtime since there was no chance of the parameters being used. conclusion the practice of checking whether a logging level is enabled before building the log statement is certainly worthwhile and should be something teams check during peer review.
October 20, 2014
by Alan Hohn
· 10,511 Views
article thumbnail
Functional Programming with Java 8 Functions
Learn how to use lambda expressions and anonymous functions in Java 8.
October 20, 2014
by Edwin Dalorzo
· 330,112 Views · 25 Likes
article thumbnail
Unlocking and Erasing FLASH with Segger J-Link
when using a bootloader (see “ serial bootloader for the freedom board with processor expert “), then i usually protect the bootloader flash areas, so it does not get accidentally erased by the application ;-). when programming my boards with the p&e multilink, then the p&e firmware will automatically unlock and erase the chip. that’s not the same if working with the segger j-link, as it requires extra steps. protected flash pages with processor expert failed programming with protected flash if i try to re-program the protected bootloader with segger j-link (both in codewarrior and eclipse/kds with gdb), then the download silently fails. the effect is that somehow the application on the board does not match what it should be. looking at the console view, it shows that erase has failed (but no real error reported) :-(: jlink: failed to erase sectors 0 @ address 0x00000000 (algo135: flash protection violation. flash is write-protected.) j-link failed to erase in codewarrior the gnu arm eclipse segger integration with gdb (e.g. kinetis design studio) is not better: no error sign, the only thing is a hidden error in the console log of the jlinkgdbservercl: error: failed to erase sectors 0 @ address 0x00000000 (algo135: flash protection violation. flash is write-protected.) error algo135 flash protection violation about failed flash programming what i need is to unprotect the memory and then erase it. erasing the segger j-link features a very fast programming. part of that speed is that the segger firmware checks each flash page if it really needs to be programmed, and only then it erases and reprogrammed that page. so downloading twice the same application actually will not touch the flash memory at all. additionally, it does not do a complete erase of the device: it only programs the pages i’m using in my application. the advantage of that is first speed. and it does not erase the application data i’m using in non-volatile memory (see “ configuration data: using the internal flash instead of an external eeprom “). however, sometimes i really need to clear all my data in flash too, and then i need to erase all my flash pages on the device. segger has product named ‘j-flash’ which is used to flash and erase devices outside of an ide. there is a free-of-charge ‘lite’ version available for download from segger. this utility is not intended to be used for production. with this utility i have a gui to erase and program my device. j-flash lite but j-flash lite cannot unlock my locked flash pages :-(. if my device is not locked, i can use the codewarrior ‘flash file to target’ (see “ flashing with a button (and a magic wand) “) to erase the device: erasing device with flash file to target again, this does not work if the device is locked. codewarrior has another feature called ‘target task’ which can be used to erase/unsecure (if your device is supported), see “ device is secure? “. so i need to use a different tool to unlock and unprotect my device: the j-link commander . unlocking and erasing with j-link commander to unlock the device, segger has a utility named ‘j-link commander’, available from http://www.segger.com/jlink-software.html . the binary is ‘jlink.exe’ on windows and is a command line utility. to unlock the device use unlock kinetis unlocking device but it seems that i need to do an unlock, followed by an erase to make it permanent. to erase the device, i can use the same command line utility. but i need to specify the device name first, and then i can erase it (example for the kl25z): device mkl25z128xxx4 unlock kinetis erase :!: i need to do the erase operation right after the unlock. a) set device b) unlock and c) erase, otherwise it will fail? unlocking and erasing with j-link commander summary in order to re-program the protected flash sectors with segger j-link, i need first to unlock and mass erase the device. for this, there is the j-link commander utility which has a command line interface to unprotect and erase the device. for erasing only, the j-flash (and lite) is a very useful tool, especially to get a ‘clean’ device memory. to me, the segger way and tools are very powerful. in this case, things are very flexible, but not that obvious. so i hope this post can help others to get his device unlocked and erased. happy erasing :-)
October 17, 2014
by Erich Styger
· 20,333 Views
article thumbnail
MySQL Replication: 'Got fatal error 1236' Causes and Cures
Originally Written by Muhammad Irfan MySQL replication is a core process for maintaining multiple copies of data – and replication is a very important aspect in database administration. In order to synchronize data between master and slaves you need to make sure that data transfers smoothly, and to do so you need to act promptly regarding replication errors to continue data synchronization. Here on the Percona Support team, we often help customers with replication broken-related issues. In this post I’ll highlight the top most critical replication error code 1236 along with the causes and cure. MySQL replication error “Got fatal error 1236” can be triggered by multiple reasons and I will try to cover all of them. Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: ‘log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event ‘binlog.000201′ at 5480571 This is a typical error on the slave(s) server. It reflects the problem around max_allowed_packet size. max_allowed_packet refers to single SQL statement sent to the MySQL server as binary log event from master to slave. This error usually occurs when you have a different size of max_allowed_packet on the master and slave (i.e. master max_allowed_packet size is greater then slave server). When the MySQL master server tries to send a bigger packet than defined on the slave server, the slave server then fails to accept it and hence the error. In order to alleviate this issue please make sure to have the same value for max_allowed_packet on both slave and master. You can read more about max_allowed_packet here. This error usually occurs when updating a huge number of rows on the master and it doesn’t fit into the value of slave max_allowed_packet size because slave max_allowed_packet size is lower then the master. This usually happens with queries “LOAD DATA INFILE” or “INSERT .. SELECT” queries. As per my experience, this can also be caused by application logic that can generate a huge INSERT with junk data. Take into account, that one new variable introduced in MySQL 5.6.6 and later slave_max_allowed_packet_size which controls the maximum packet size for the replication threads. It overrides the max_allowed_packet variable on slave and it’s default value is 1 GB. In this post, “max_allowed_packet and binary log corruption in MySQL,”my colleague Miguel Angel Nieto explains this error in detail. Got fatal error 1236 from master when reading data from binary log: ‘Could not find first log file name in binary log index file’ This error occurs when the slave server required binary log for replication no longer exists on the master database server. In one of the scenarios for this, your slave server is stopped for some reason for a few hours/days and when you resume replication on the slave it fails with above error. When you investigate you will find that the master server is no longer requesting binary logs which the slave server needs to pull in order to synchronize data. Possible reasons for this include the master server expired binary logs via system variable expire_logs_days – or someone manually deleted binary logs from master via PURGE BINARY LOGS command or via ‘rm -f’ command or may be you have some cronjob which archives older binary logs to claim disk space, etc. So, make sure you always have the required binary logs exists on the master server and you can update your procedures to keep binary logs that the slave server requires by monitoring the “Relay_master_log_file” variable from SHOW SLAVE STATUS output. Moreover, if you have set expire_log_days in my.cnf old binlogs expire automatically and are removed. This means when MySQL opens a new binlog file, it checks the older binlogs, and purges any that are older than the value of expire_logs_days (in days). Percona Server added a feature to expire logs based on total number of files used instead of the age of the binlog files. So in that configuration, if you get a spike of traffic, it could cause binlogs to disappear sooner than you expect. For more information check Restricting the number of binlog files. In order to resolve this problem, the only clean solution I can think of is to re-create the slave server from a master server backup or from other slave in replication topology. – Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000525′ at 175770780, the last event read from ‘/data/mysql/repl/mysql-bin.000525′ at 175770780, the last byte read from ‘/data/mysql/repl/mysql-bin.000525′ at 175771648.’ Usually, this caused by sync_binlog <>1 on the master server which means binary log events may not be synchronized on the disk. There might be a committed SQL statement or row change (depending on your replication format) on the master that did not make it to the slave because the event is truncated. The solution would be to move the slave thread to the next available binary log and initialize slave thread with the first available position on binary log as below: mysql>CHANGE MASTERTOMASTER_LOG_FILE='mysql-bin.000526',MASTER_LOG_POS=4; – [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: ‘Client requested master to start replication from impossible position; the first event ‘mysql-bin.010711′ at 55212580, the last event read from ‘/var/lib/mysql/log/mysql-bin.000711′ at 4, the last byte read from ‘/var/lib/mysql/log/mysql-bin.010711′ at 4.’, Error_code: 1236 I foresee master server crashed or rebooted and hence binary log events not synchronized on disk. This usually happens when sync_binlog != 1 on the master. You can investigate it as inspecting binary log contents as below: $mysqlbinlog--base64-output=decode-rows--verbose--verbose--start-position=55212580mysql-bin.010711 You will find this is the last position of binary log and end of binary log file. This issue can usually be fixed by moving the slave to the next binary log. In this case it would be: mysql>CHANGE MASTER TOMASTER_LOG_FILE='mysql-bin.000712',MASTER_LOG_POS=4; This will resume replication. To avoid corrupted binlogs on the master, enabling sync_binlog=1 on master helps in most cases. sync_binlog=1 will synchronize the binary log to disk after every commit. sync_binlog makes MySQL perform on fsync on the binary log in addition to the fsync by InnoDB. As a reminder, it has some cost impact as it will synchronize the write-to-binary log on disk after every commit. On the other hand, sync_binlog=1 overhead can be very minimal or negligible if the disk subsystem is SSD along with battery-backed cache (BBU). You can read more about this here in the manual. sync_binlog is a dynamic option that you can enable on the fly. Here’s how: mysql-master>SET GLOBAL sync_binlog=1; To make the change persistent across reboot, you can add this parameter in my.cnf. As a side note, along with replication fixes, it is always a better option to make sure your replica is in the master and to validate data between master/slaves. Fortunately, Percona Toolkit has tools for this purpose: pt-table-checksum & pt-table-sync. Before checking for replication consistency, be sure to check the replication environment and then, later, to sync any differences.
October 15, 2014
by Peter Zaitsev
· 40,819 Views
article thumbnail
R: Filtering data frames by column type ('x' must be numeric)
I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame. The exercise uses the ‘Carseats’ data set which can be imported like so: > install.packages("ISLR") > library(ISLR) > head(Carseats) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US 1 9.50 138 73 11 276 120 Bad 42 17 Yes Yes 2 11.22 111 48 16 260 83 Good 65 10 Yes Yes 3 10.06 113 35 10 269 80 Medium 59 12 Yes Yes 4 7.40 117 100 4 466 97 Medium 55 14 Yes Yes 5 4.15 141 64 3 340 128 Bad 38 13 Yes No 6 10.81 124 113 13 501 72 Bad 78 16 No Yes filter the categorical variables from a data frame and If we try to run the ‘cor‘ function on the data frame we’ll get the following error: > cor(Carseats) Error in cor(Carseats) : 'x' must be numeric As the error message suggests, we can’t pass non numeric variables to this function so we need to remove the categorical variables from our data frame. But first we need to work out which columns those are: > sapply(Carseats, class) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor" "numeric" "numeric" Urban US "factor" "factor" We can see a few columns of type ‘factor’ and luckily for us there’s a function which will help us identify those more easily: > sapply(Carseats, is.factor) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE Urban US TRUE TRUE Now we can remove those columns from our data frame and create the correlation matrix: > cor(Carseats[sapply(Carseats, function(x) !is.factor(x))]) Sales CompPrice Income Advertising Population Price Age Education Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 -0.44495073 -0.231815440 -0.051955242 CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.58484777 -0.100238817 0.025197050 Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 -0.05669820 -0.004670094 -0.056855422 Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 0.04453687 -0.004557497 -0.033594307 Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 -0.01214362 -0.042663355 -0.106378231 Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 1.00000000 -0.102176839 0.011746599 Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684 1.000000000 0.006488032 Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.01174660 0.006488032 1.000000000 Be Sociable, Share!
October 8, 2014
by Mark Needham
· 29,047 Views
article thumbnail
Using Groovy To Import XML Into MongoDB
This year I’ve been demonstrating how easy it is to create modern web apps using AngularJS, Java and MongoDB. I also use Groovy during this demo to do the sorts of things Groovy is really good at - writing descriptive tests, and creating scripts. Due to the time pressures in the demo, I never really get a chance to go into the details of the script I use, so the aim of this long-overdue blog post is to go over this Groovy script in a bit more detail. Firstly I want to clarify that this is not my original work - I stoleborrowed most of the ideas for the demo from my colleague Ross Lawley. In this blog post he goes into detail of how he built up an application that finds the most popular pub names in the UK. There’s asection in there where he talks about downloading the open street map data and using python to convert the XML into something more MongoDB-friendly - it’s this process that I basically stole, re-worked for coffee shops, and re-wrote for the JVM. I’m assuming if you’ve worked with Java for any period of time, there has come a moment where you needed to use it to parse XML. Since my demo is supposed to be all about how easy it is to work with Java, I didnot want to do this. When I wrote the demo I wasn’t really all that familiar with Groovy, but what I did know was that it has built in support for parsing and manipulating XML, which is exactly what I wanted to do. In addition, creating Maps (the data structures, not the geographical ones) with Groovy is really easy, and this is effectively what we need to insert into MongoDB. Goal Of The Script Parse an XML file containing open street map data of all coffee shops. Extract latitude and longitude XML attributes and transform intoMongoDB GeoJSON. Perform some basic validation on the coffee shop data from the XML. Insert into MongoDB. Make sure MongoDB knows this contains query-able geolocation data. The script is PopulateDatabase.groovy, that link will take you to the version I presented at JavaOne: Firstly, We Need Data I used the same service Ross used in his blog post to obtain the XML file containing “all” coffee shops around the world. Now, the open street map data is somewhat… raw and unstructured (which is why MongoDB is such a great tool for storing it), so I’m not sure I really have all the coffee shops, but I obtained enough data for an interesting demo using http://www.overpass-api.de/api/xapi?*[amenity=cafe][cuisine=coffee_shop] The resulting XML file is in the github project, but if you try this yourself you might (in fact, probably will) get different results. Each XML record looks something like: Each coffee shop has a unique identifier and a latitude and longitude as attributes of a node element. Within this node is a series of tag elements, all with k and v attributes. Each coffee shop has a varying number of these attributes, and they are not consistent from shop to shop (other than amenity and cuisine which we used to select this data). Initialisation Before doing anything else we want to prepare the database. The assumption of this script is that either the collection we want to store the coffee shops in is empty, or full of stale data. So we’re going to use the MongoDB Java Driver to get the collection that we’re interested in, and then drop it. There’s two interesting things to note here: This Groovy script is simply using the basic Java driver. Groovy can talk quite happily to vanilla Java, it doesn’t need to use a Groovy library. There are Groovy-specific libraries for talking to MongoDB (e.g. the MongoDB GORM Plugin), but the Java driver works perfectly well. You don’t need to create databases or collections (collections are a bit like tables, but less structured) explicitly in MongoDB. You simply use the database and collection you’re interested in, and if it doesn’t already exist, the server will create them for you. In this example, we’re just using the default constructor for theMongoClient, the class that represents the connection to the database server(s). This default is localhost:27017, which is where I happen to be running the database. However you can specify your own address and port - for more details on this see Getting Started With MongoDB and Java. Turn The XML Into Something MongoDB-Shaped So next we’re going to use Groovy’s XmlSlurper to read the open street map XML data that we talked about earlier. To iterate over every node we use: xmlSlurper.node.each. For those of you who are new to Groovy or new to Java 8, you might notice this is using a closure to define the behaviour to apply for every “node” element in the XML. Create GeoJSON Since MongoDB documents are effectively just maps of key-value pairs, we’re going to create a Map coffeeShop that contains the document structure that represents the coffee shop that we want to save into the database. Firstly, we initialise this map with the attributes of the node. Remember these attributes are something like: We’re going to save the ID as a value for a new field calledopenStreetMapId. We need to do something a bit more complicated with the latitude and longitude, since we need to store them as GeoJSON, which looks something like: { 'location' : { 'coordinates': [, ], 'type' : 'Point' } } In lines 12-14 you can see that we create a Map that looks like the GeoJSON, pulling the lat and lon attributes into the appropriate places. Insert Remaining Fields Now for every tag element in the XML, we get the k attribute and check if it’s a valid field name for MongoDB (it won’t let us insert fields with a dot in, and we don’t want to override our carefully constructed locationfield). If so we simply add this key as the field and its the matching vattribute as the value into the map. This effectively copies theOpenStreetMap key/value data into key/value pairs in the MongoDB document so we don’t lose any data, but we also don’t do anything particularly interesting to transform it. Save Into MongoDB Finally, once we’ve created a simple coffeeShop Map representing the document we want to save into MongoDB, we insert it into MongoDB if the map has a field called name. We could have checked this when we were reading the XML and putting it into the map, but it’s actually much easier just to use the pretty Groovy syntax to check for a key called namein coffeeShop. When we want to insert the Map we need to turn this into aBasicDBObject, the Java Driver’s document type, but this is easily done by calling the constructor that takes a Map. Alternatively, there’s a Groovy syntax which would effectively do the same thing, which you might prefer: collection.insert(coffeeShop as BasicDBObject) Tell MongoDB That We Want To Perform Geo Queries On This Data Because we’re going to do a nearSphere query on this data, we need to add a “2dsphere” index on our location field. We created the locationfield as GeoJSON, so all we need to do is call createIndex for this field. Conclusion So that’s it! Groovy is a nice tool for this sort of script-y thing - not only is it a scripting language, but its built-in support for XML, really nice Map syntax and support for closures makes it the perfect tool for iterating over XML data and transforming it into something that can be inserted into a MongoDB collection.
October 8, 2014
by Trisha Gee
· 10,276 Views
article thumbnail
The No Fluff Introduction to Big Data
big data traditionally has referred to a collection of data too massive to be handled efficiently by traditional database tools and methods. this original definition has expanded over the years to identify tools (big data tools) that tackle extremely large datasets (nosql databases, mapreduce, hadoop, newsql, etc.), and to describe the industry challenge posed by having data harvesting abilities that far outstrip the ability to process, interpret, and act on that data. technologists knew that those huge batches of user data and other data types were full of insights that could be extracted by analyzing the data in large aggregates. they just didn’t have any cheap, simple technology for organizing and querying these large batches of raw, unstructured data. the term quickly became a buzzword for every sort of data processing product’s marketing team. big data became a catchall term for anything that handled non-trivial sizes of data. sean owen, a data scientist at cloudera, has suggested that big data is a stage where individual data points are irrelevant and only aggregate analysis matters [1]. but this is true for a 400 person survey as well, and most people wouldn’t consider that very big. the key part missing from that definition is the transformation of unstructured data batches into structured datasets. it doesn’t matter if the database is relational or non-relational. big data is not defined by a number of terabytes, it’s rooted in the push to discoverhidden insights in data that companies used to disregard or throw away. due to the obstacles presented by large scale data management, the goal for developers and data scientists is two-fold: first, systems must be created to handle large scale data, and two, business intelligence and insights should be acquired from analysis of the data. acquiring the tools and methods to meet these goals is a major focus in the data science industry, but it’s a landscape where needs and goals are still shifting. what are the characteristics of big data? tech companies are constantly amassing data from a variety of digital sources that is almost without end—everything from email addresses to digital images, mp3s, social media communication, server traffic logs, purchase history, and demographics. and it’s not just the data itself, but data about the data (metadata). it is a barrage of information on every level. what is it that makes this mountain of data big data? one of the most helpful models for understanding the nature of big data is “the three vs:” volume, velocity, and variety. data volume volumeis the sheer size of the data being collected. there was a point in not-so-distant history where managing gigabytes of data was considered a serious task—now we have web giants like google and facebook handling petabytes of information about users’ digital activities. the size of the data is often seen as the first challenge of characterizing big data storage, but even beyond that is the capability of programs to provide architecture that can not only store but query these massive datasets. one of the most popular models for big data architecture comes from google’s mapreduce concept, which was the basis for apache hadoop, a popular data management solution. data velocity velocityis a problem that flows naturally from the volume characteristics of big data. data velocity is the speed at which data is flowing into a business’ infrastructure and the ability of software solutions to receive and ingest that data quickly. certain types of high-velocity data, such as streaming data, needs to be moved into storage and processed on the fly. this is often referred to as complex event processing (cep). the ability to intercept and analyze data that has a lifespan of milliseconds is a widely sought after. this kind of quick-fire data processing has long been the cornerstone of digital financial transactions, but it is also being used to track live consumer behavior or to bring instant updates to social media feeds. data variety variety refers to the source and type of data that is being collected. this data could be anything from raw image data to sensor readings, audio recordings, social media communication, and metadata. the challenge of data variety is being able to take raw, unstructured data and organize it so that an application can use it. this kind of structure can be achieved through architectural models that traditionally favor relational databases—but there is often a need to tidy up this data before it will even be useful to store in a raw form. sometimes a better option is to use a schema-less, non-relational database. how do you manage big data? the three vs is a great model for getting an initial understanding of what makes big data a challenge for businesses. however, big data is not just about the data itself, but the way that it is handled. a popular way of thinking about these challenges is to look at how a business stores, processes, and accesses their data. · store: can you store the vast amounts of data being collected? · process: can you organize, clean, and analyze the data collected? · access: can you search and query this data in an organized manner? the store, process, and access model is useful for two reasons: it reminds businesses that big data is largely about managing data, and it demonstrates the problem of scale within big data management. “big” is relative. the data batches that challenge some companies could be moved through a single google datacenter in under a minute. the only question a company needs to ask itself is how it will store and access increasingly massive amounts of data for its particular use case. there are several high level approaches that companies have turned to in the last few years. the traditional approach the traditional method for handling most data is to use relational databases. data warehouses are then used to integrate and analyze data from many sources. these databases are structured according to the concept of “early structure binding”—essentially, the database has predetermined “questions” that can be asked based on a schema. relational databases are highly functional, and the goal with this type of data processing is for the database to be fully transactional. although relational databases are the most common persistence type by a large margin (see key findings pg. 4-5), a growing number of use cases are not well-suited for relational schema. relational architectures tend to have difficulty when dealing with the velocity and variety of big data, since their structure is very rigid. when you perform functions such as join on many large data sets, the volume can be a problem as well. instead, businesses are looking to non-relational databases, or a mixture of both types, to meet data demand. the newer approach - mapreduce, hadoop, and nosql databases in the early 2000s, web giant google released two helpful web technologies: google file system (gfs) and mapreduce. both were new and unique approaches to the growing problem of big data, but mapreduce was chief among them, especially when it comes to its role as a major influencer of later solution models. mapreduce is a programming paradigm that allows for low cost data analysis and clustered scale-out processing. mapreduce became the primary architectural influence for the next big thing in big data: the creation of the big data management infrastructure known as hadoop. hadoop’s open source ecosystem and ease of use for handling large-scale data processing operations have secured a large part of the big data marketplace. besides hadoop, there was a host of non-relational (nosql) databases that emerged around 2009 to meet a different set of demands for processing big data. whereas hadoop is used for its massive scalability and parallel processing, nosql databases are especially useful for handling data stored within large multi-structured datasets. this kind of discrete data handling is not traditionally seen as a strong point of relational databases, but it’s also not the same kind of data operations that hadoop is running. the solution for many businesses ends up being a combination of these approaches to data management. finding hidden data insights once you get beyond storage and management, you still have the enormous task of creating actionable business intelligence (bi) from the datasets you’ve collected. this problem of processing and analyzing data is maybe one of the trickiest in the data management lifecycle. the best options for data analytics will favor an approach that is predictive and adaptable to changing data streams. the thing is, there’s so many types of analytic models and different ways of providing infrastructure for this process. your analytics solution should scale, but to what degree? scalability can be an enormous pain in your analytical neck, due to the problem of decreasing performance returns when scaling out an algorithm. ultimately, analytics tools rely on a great deal of reasoning and analysis to extract data patterns and data insights, but this capacity means nothing for a business if they can’t then create actionable intelligence. part of this problem is that many businesses have the infrastructure to accommodate big data, but they aren’t asking questions about what problems they’re going to solve with the data. implementing a big data-ready infrastructure before knowing what questions you want to ask is like putting the cart before the horse. but even if we do know the questions we want to ask, data analysis can always reveal many correlations with no clear causes. as organizations get better at processing and analyzing big data, the next major hurdle will be pinpointing the causes behind the trends by asking the right questions and embracing the complexity of our answers. [1] http://www.quora.com/what-is-big-data 2014 guide to big data this guide explores the meaning of big data, how businesses use it, and uncovers new tools and techniques for the future of big data. this guide includes: detailed profiles on 43 big data vendor solutions in-depth articles written by industry experts results from our survey of 850 it professionals "finding the database for your use case" download now
September 25, 2014
by Benjamin Ball
· 10,641 Views · 1 Like
article thumbnail
How to Trace Transactions Across Every Layer of Your Distributed Software Stack
APM solutions give you great visibility into any code you have control over; however, today’s systems are largely a combination of code you write along with off-the-shelf components, sitting on top of VMs/containers, and cloud-based services. Thus, full system-wide visibility requires an ability to look into your APM tool as well as log data produced from the components that you may not be able to instrument. This post offers an outline of how APM solutions work and how you can combine them with your system logs to finally get an end-to-end and top-to-bottom view of your system behavior and performance. How APM Tools Work – Hello APM APM tools give you insight deep into your code and often work using cool techniques like dynamic instrumentation. Dynamic instrumentation essentially allows you to instrument your apps on the fly without any need to modify your application source code. Such techniques have become widely been supported by mainstream programming languages to make it possible for even mere mortals to build their own APM tools. For example, since Java version 5, any Java applications can be instrumented using java.lang.instrument, which allows for the instrumentation of any programs running on the JVM through modification of the byte code of methods. It works by letting you alter the corresponding byte code of a class when it is being loaded, such that you can introduce monitoring capabilities such as execution profiling or event tracing. There’e a great beginner tutorial here by Julien Paoletti on how to write your first APM tool in java. It essentially shows you how you can intercept classes at class load time and then inject code into methods of your choice to record how long it takes for given methods to execute. While building a full APM solution is not for the faint hearted, you can easily begin to build your first ‘Hello APM’ tool, and play around with JVM internals following Julien’s post above. Transaction Tracing For those interested in moving beyond simply recording method execution time, you can begin to trace full transactions using some simple techniques. To do so, you essentially need a unique identifier to be passed along to any methods executed in that transaction. Continuing on from our hello world profiler above, you could do this by injecting a unique ID into the thread at any entry point in the system (e.g. new incoming requests). Java provides ThreadLocal storage that allows you to do just this. Using ThreadLocal you can embed a unique ID that gets recorded as each method executes. Reconstructing a Transaction On every invocation of a method along the transaction data is logged. An example of what might be logged by an APM tool is as follows: unique transaction id sequence number call depth method details performance data You can then easily piece together full transaction traces by ordering all method calls by sequence number. Further analysis can be applied to this information for a number of purposes. For example, by analysing the transactions, developers can easily construct design diagrams that can help quickly deduce overall system structure. Relationships between system components can help understand interdependencies enabling developers to anticipate potential conflicts and to debug problems as well as allowing them to reason about their system design (which in turn can have a major impact of system performance). Tracing Transactions Across the Network The real challenge with transaction tracing can come when you are dealing with distributed components. In such scenarios you need to be able to trace transactions across the network. One approach here is to piggy-back the necessary correlation data (unique transaction ID) onto the request from a client to the remote server. RPC (remote procedure call) systems generally employ a standard mechanism, known as ‘stubs and skeletons‘, to hide the complexities of the network from the client making any remote calls. Stubs and skeletons work as follows: The stub masks the low level networking issues from the client and forwards the request on to a server side proxy object (the skeleton). The skeleton masks the low level networking issues from the distributed component. It also delegates the remote request to the distributed component. The distributed component then handles the request and returns control to the skeleton, which in turn returns control to the stub. The stub, in turn, hands back to the client. One approach to the issue of tracing transactions across the network can be achieved by taking advantage of the stubs and skeletons model. Essentially the stub and skeletons can be modified such that the unique transaction ID piggy backs on the communication and is sent as part of the request to the stub and response from the skeleton. The implementation may differ from platform to platform, but the principles can generally be applied. For example, Remote Method Invocation is used for distributed communication on java platforms and details on how this can be achieved for RMI can be found in one of my older research papers here. RMI with Custom Stub Wrapper and Server Side Interception Point Going Beyond APM The above transaction tracing will give you visibility at a method call level across your distributed application. However sometimes external factors outside your application code (server resource, SAAS components your app communicates with, network speed etc.) will have an impact on your overall system performance. One way of enhancing the information provided by your APM solution is to collect and analyze your log data. Logs provide a very flexible way of gathering information on your system behavior without any requirement for deep instrumentation and any of the complex techniques described above. Furthermore you may not be able to instrument every software component or cloud service that makes up your overall system – yet almost all of these will produce valuable log data containing system usage and performance information. In such scenarios, combining APM and log data will give you the complete picture. Below are some tips that will allow you to map logs to APM transactions or to enhance them with data from additional components such as OS, middleware or network level components: Logging the Transaction ID: Any log data produced by your apps can be easily mapped to transactions produced by your APM tool by logging the transaction ID used to trace the transaction. Client Side Logging & Logging User/Session/Account ID: Logging other unique identifiers such as session ID, user ID or account ID, can also help you assist with tracing transactions across log events, where the transaction ID used by the APM tool is unavailable. This can be particularly useful if you are logging events from the client side as well as from back end components where you want to be able to view the sequence of log events related to a give user action or session for example. Same Time Frame For System Logs: Where unique identifiers have not been logged as part of your log events, viewing logs within the same time frame window as your APM tool will help you narrow down related log events and will give you a view into system behavior during the transaction time frame. Correlating with Other KPIs: Logs will contain key performance and resource usage metrics that can be rolled up into trend lines and charts. Correlating APM transaction traces with performance metrics and server resource usage information can help with investigation can result in quicker root cause than investigating transaction traces in isolation. Build It Yourself? Naturally anybody in their right mind would not actually go about building their own APM solution, it’s almost as hair-brained as rolling your own logging solution The good news is you don’t have to do either – simply take advantage of the new Logentries and New Relic integration such that you can trace transactions from end to end and from top to bottom of your entire distributed software stack.
September 24, 2014
by Trevor Parsons
· 6,832 Views · 1 Like
article thumbnail
How to Resolve Maven's ''Failure to Transfer'' Error
Learn how to resolve the ''failure to transfer'' error encountered in Maven in this quick tutorial.
September 24, 2014
by Jose Roy Javelosa
· 129,904 Views · 3 Likes
article thumbnail
Java - Four Security Vulnerabilities Related Coding Practices to Avoid
This article represents top 4 security vulnerabilities related coding practice to avoid while you are programming with Java language. Recently, I came across few Java projects where these instances were found. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Executing a dynamically generated SQL statement Directly writing an Http Parameter to Servlet output Creating an SQL PreparedStatement from dynamic string Array is stored directly Executing a Dynamically Generated SQL Statement This is most common of all. One can find mention of this vulenrability at several places. As a matter of fact, many developers are also aware of this vulnerability, although this is a different thing they end up making mistakes once in a while. In several DAO classes, the instances such as following code were found which could lead to SQL injection attacks. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (" + namesString + ")" ); try { Connection connection = getConnection(); Statement statement = connection.createStatement(); resultSet = statement.executeQuery(query.toString()); } Instead of above query, one could as well make use of prepared statement such as that demonstrated in the code below. It not only makes code less vulnerable to SQL injection attacks but also makes it more efficient. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (?)" ); try { Connection connection = getConnection(); PreparedStatement statement = connection.prepareCall(query.toString()); statement.setString( 1, namesString ); resultSet = statement.execute(); } Directly writing an Http Parameter to Servlet Output In Servlet classes, I found instances where the Http request parameter was written as it is, to the output stream, without any validation checks. Following code demonstrate the same: public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { String content = request.getParameter("some_param"); // // .... some code goes here // response.getWriter().print(content); } Note that above code does not persist anything. Code like above may lead to what is called reflected (or non-persistent) cross site scripting (XSS) vulnerability. Reflected XSS occur when an attacker injects browser executable code within a single HTTP response. As it goes by definition (being non-persistent), the injected attack does not get stored within the application; it manifests only users who open a maliciously crafted link or third-party web page. The attack string is included as part of the crafted URI or HTTP parameters, improperly processed by the application, and returned to the victim. You could read greater details on following OWASP page on reflect XSS Creating an SQL PreparedStatement from Dynamic Query String What it essentially means is the fact that although PreparedStatement was used, but the query was generated as a string buffer and not in the way recommended for prepared statement (parametrized). If unchecked, tainted data from a user would create a String where SQL injection could make it behave in unexpected and undesirable manner. One should rather make the query statement parametrized and, use the PreparedStatement appropriately. Take a look at following code to identify the vulnerable code. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (" + namesString + ")" ); try { Connection connection = getConnection(); PreparedStatement statement = connection.prepareStatement(query.toString()); resultSet = statement.executeQuery(); } Array is Stored Directly Instances of this vulnerability, Array is stored directly, could help the attacker change the objects stored in array outside of program, and the program behave in inconsistent manner as the reference to the array passed to method is held by the caller/invoker. The solution is to make a copy within the object when it gets passed. In this manner, a subsequent modification of the collection won’t affect the array stored within the object. You could read the details on following stackoverflow page. Following code represents the vulnerability: // Note that values is a String array in the code below. // public void setValues(String[] somevalues) { this.values = somevalues; }
September 19, 2014
by Ajitesh Kumar
· 19,406 Views · 1 Like
article thumbnail
A Closer Look at the MySQL ibdata1 Disk Space Issue and Big Tables
A recurring customer issue seen by the Percona Support team involves how to make the ibdata1 file “shrink” within MySQL. I'll show you how to handle big tables.
September 16, 2014
by Peter Zaitsev
· 7,777 Views
article thumbnail
The Near Future of IoT
[This article was written by Sean Lorenz.] Pundits within the technology sphere have been calling 2014 the year of the Internet of Things (IoT). The market revenue potentials are forecasted into the trillions and it’s a Fortune 500 land grab with major companies moving quickly to stake their claims [1]. If this sounds a bit like pages from an American Wild West history book, a frontier analogy isn’t too far off. This is an exciting turning point in technology that—thanks to advances in plummeting sensor costs, wireless communication, and chip size reduction—will soon make today’s futuristic IoT concepts seem humorous in retrospect. While it’s difficult to see where the market is going, given the exponential rate of change in IoT technology, I have noticed several key trends emerging. As a fellow IoT prospector on the frontier, this is my account of the most evident trends as well as some educated predictions for the future. Trends 1. Business Value Over Technology Focus Like any promising new technology still in its infancy stage, the true innovation stems from tech-savvy researchers and tinkerers that build fascinating devices that sometimes have no consumer base–I’m looking at you, robotics market. We have all heard about the smart toothbrushes and smart egg trays coming to market and thought: “Interesting! I wouldn’t buy one, but… sure!” Perhaps the biggest trend is a shift from thinking, “let’s build it because we can” to “what business problem are we solving here?” IoT developers are getting wise to this mentality and building user-focused MVPs (Minimum Viable Products) that will begin hitting the market in late 2014 and early 2015. 2. Keeping It Real At my company, Xively, we often get asked what are the real use cases for the IoT. Many times our customers walk in the door with a vague idea of how connecting their product or service to the Internet would be potentially interesting, but need a little help with seeing how an IoT-enabled product can transform their business—internally and externally. The reason for this is that most of the exciting, transformative elements happen under the hood. Right now, the true “wow” moments in the industry are far from sexy: energy savings in enterprise complexes, CRM & ERP integration, service and support, supply chain efficiencies, product part failure and alert, and so on… you get the idea. Smart homes that respond to our every whim are really great ideas, but these products aren’t integral to our lives yet. Large manufacturing companies and enterprises are using the Internet of Things to manage internal operations and efficiency while also engaging their customers more fully with new IoT data sources aggregated in existing services like Salesforce1 or SAP. 3. Publish-Subscribe The IoT protocol wars are heating up, but allegiances aside, publish-subscribe messaging is what the bulk of implemented models use for connecting devices to the cloud. Pub-sub protocols such as MQTT, CoAP, and AMQP are attractive for connected product development thanks to their ease of scalability and many-to-one/one-to-many possibilities. Given the massive variance of the IoT market, there is bound to be more than one protocol that wins in the end; yet before we get to that point, there are plenty of bugs and vulnerabilities to patch across all of the thriving, open IoT protocols out there. 4. Security Panic! Hacked refrigerators, big box stores, and security cameras… oh my! There has been no shortage of concern for privacy, security, and compliance in the Internet of Things space. Like any news story, some of this attention is warranted and some overblown. Just like your pre-IoT old-fashioned Internet, creating specific application keys and advanced permissioning systems for hardware connecting to the cloud is essential. The amount of nodes at the edge connecting to services across the Internet will be far larger than anything we see now, but IoT platforms are already addressing these complex device lifecycle management issues that are crucial for protecting personal and enterprise information in a connected world. Near-Term Predictions Now lets hop in the DeLorean and look into the future. Rather than focus on five, ten, or twenty years into the future, let’s focus only on the next few years. Why? As I mentioned in the beginning, the IoT landscape changes on a day-to-day basis, so even a prediction looking forward six-months from now can be unreliable. This list contains no self-driving cars or sentient AIs. Instead, it makes some pretty sure bets for what to expect over the horizon. 1. A Household Name Usually the second question after “what’s your name?” at a dinner party is the inevitable “so what do you do?” Mentioning the Internet of Things to non-techies still draws blank stares and looks of confusion. Those looks are justified given the not-so-great marketing name of IoT and the myriad definitions trying to explain what it actually is. Whether it’s called the Internet of Things, Internet of Everything, or just the good ol’ Internet, the concept of connecting any and everything to the Internet will begin to make sense for everyday consumers. 2. Consumers Slow to Adopt Many IoT products are still just toys in many people’s minds. Startups are building products that address problems which most consumers don’t see as a problem yet. This isn’t to say the consumer IoT market will evaporate. It just means we need to get smarter about what customers actually want from smart devices. Today’s wearable products remind me of the Newton—Apple’s infamous PDA. The problem wasn’t the idea, but rather the timing. The Apple Newton seemed clunky, not very powerful, and low on the usability scale. Years later, the iPhone and iPad came along with a set of features and a form factor that customers were looking for. The same feels true of wearables right now—they may need a few more years to incubate before the general public gives two thumbs up. Other consumer IoT markets such as the smart home or driverless cars seem to be in the same situation as the wearables market, but this is changing quickly with major players like Apple and Google moving into these arenas. For example, in the home automation space, frameworks like Apple HomeKit will be essential for unifying disparate protocols and clouds into one application that can handle various products’ data, automating much of the technology and pushing it into the background. I am sure there is a brilliant developer learning Swift and building the first killer smart home app as we speak. 3. Analytics and Automation This prediction probably comes as no surprise, but it is worth stating. Most companies willing to foray into the IoT unknown are, for now, happy with connecting their devices to an external application or cloud service. Having a place to send the data is usually the first step in constructing an IoT system. So what do you do with all this data once you have it? Reporting tools for IoT are just starting to become available, but this is just the tip of the iceberg. The real magic lies in the ability to use exploratory and predictive algorithms to make actionable intelligence a reality. These insights are beneficial to both businesses understanding their customers and to the customers themselves. One could imagine closing the feedback loop between sensor, cloud, and actuator by adding some beautiful supervised machine learning code into the cloud platform at some point in the chain. There are currently a handful of analytics startups focusing on IoT specifically, but this market is about to explode from both platform and application perspectives. 4. IoT Startups Galore For any developers out there interested in the IoT with a real customer pain that needs solving, now is the time to get coding and building that pitch deck. With hardware back en vogue, venture capital funding of IoT-centric companies ison the rise [2]. Having been to a number of IoT events, the amount of enthusiasm by VC and angel investors is palpable. There’s a definite need for developers with great, connected product and service ideas; so, if you haven’t already, I strongly suggest putting on your favorite prospecting gear and exploring the untamed wild west of the Internet of Things. [1] https://internetofeverything.cisco.com/sites/default/files/docs/en/ioe_public_sector_vas_white%20paper_121913final.pdf [2] http://www.cbinsights.com/blog/internet-of-things-investing-snapshot 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW
August 28, 2014
by Benjamin Ball
· 10,499 Views
article thumbnail
Deserializing Json to a Java Object Using Google’s Gson Library
javascript object notation (json) is fast becoming the de facto standard or format for transferring, sharing and passing around data. be it on the web, rest service, a remote procedure call or even an ajax request. json is light weight with little memory footprint when compared to an xml. the content of a json string in its raw form when observed looks gibberish. to make the content usable it needs to be deserialized or converted to a useable form usually a java object (pojo) or an array or list of objects depending on the json content. a typical json string is as shown below {"city":"jos","country":"nigeria","housenumber":"13","lga":"jos south", "state":"plateau","streetname":"jonah jann","village":"bukuru","ward":"1"} there are a lot of frameworks for deserializing json to a java object such as json-rpc , gson , flexjson and a whole lots of other open source libraries. of all the libraries mentioned i would in this blog post demonstrate how to use google-gson library to deserialize a json string to a java object. you can download the gson library from https://code.google.com/p/google-gson/ . to have the json string deserialized, a java object must be created that has the same fields names with the fields in the json string. there is a website that provides a service for viewing the content of a json string in a tree like manner. http://jsonviewer.stack.hu paste the json string in the text tab and view the fields and the content from the viewer tab i would deserialize a json string that contains address details to an address pojo, the address object follows the structure as seen from the json tree view above. public class address{ private string city; private string country; private string housenumber; private string lga; private string state; private string streetname; private string village; private string ward; public string getcity() { return city; } public void setcity(string city) { this.city = city; } public string getcountry() { return country; } public void setcountry(string country) { this.country = country; } public string gethousenumber() { return housenumber; } public void sethousenumber(string housenumber) { this.housenumber = housenumber; } public string getlga() { return lga; } public void setlga(string lga) { this.lga = lga; } public string getstate() { return state; } public void setstate(string state) { this.state = state; } public string getstreetname() { return streetname; } public void setstreetname(string streetname) { this.streetname = streetname; } public string getvillage() { return village; } public void setvillage(string village) { this.village = village; } public string getward() { return ward; } public void setward(string ward) { this.ward = ward; } @override public string tostring() { return "address [city=" + city + ", country=" + country + ", housenumber=" + housenumber + ", lga=" + lga + ", state=" + state + ", streetname=" + streetname + ", village=" + village + ", ward=" + ward + "]"; } } to perform the deserialization with gson is easy, create pojo classes to hold your data, import the packages com.google.gson.gson and com.google.gson.gsonbuilder, to your project. then create and instance of the gson class and then perform the deserialization as shown below. gson gson = new gsonbuilder().create(); address address=gson.fromjson(json, address.class); voila, you have your json deserialized! the source code listing is below. package jsondeserializer import com.google.gson.gson; import com.google.gson.gsonbuilder; public class tester { public static void main(string[] args) { string json ="{\"city\":\"jos\",\"country\":\"nigeria\",\"housenumber\":\"13\",\"lga\":\"jos south\",\n" + "\"state\":\"plateau\",\"streetname\":\"jonah jann\",\"village\":\"bukuru\",\"ward\":\"1\"}"; gson gson = new gsonbuilder().create(); address address=gson.fromjson(json, address.class); system.out.println(address.tostring()); } }
August 27, 2014
by Ayobami Adewole
· 90,829 Views
article thumbnail
Microservices and PaaS (Part II)
[This article was written by John Wetherill.] This is a continuation of the Microservices and PaaS - Part I blog post I wrote last week, which was an attempt to distil the wealth of information presented at the microservices meetup hosted by Cisco, with Adrian Cockcroft and others presenting. Part I provided a brief background on microservices, with a summary of some lessons learned by microservices pioneers. In this installment I will cover a number of practices related to microservices that were discussed during the meetup. A followup article will dive into the advantages that Platform as a Service brings to microservice development. Microservices Practices I'm calling these "Microservice Practices," not "Microservices Best Practices" because microservices-based architectures are still evolving, with new practices, techniques, tools, and patterns emerging constantly. At the meetup a number of practices were highlighted that Netflix and other microservices pioneers have spearheaded in their efforts to adopt a microservices mentality across their organizations. Break Things Deliberately According to Netflix: "We have found that the best defense against major unexpected failures is to fail often." Netflix has brought us "Chaos Monkey" which is a powerful tool the sole purpose of which is to break things, often and randomly. They use this tool continuously on their production systems to bring down essential services, to ensure that doing so doesn't disrupt the user experience or their overall service. It's much better to deliberately break the system in the middle of the morning when all teams are assembled and sufficient caffeine has been consumed, than to be informed of a breakage by a page at 3am. No Manual "Anything" In a world where microservices come and go, grow and shrink, and migrate around racks and data centers in seconds - there's absolutely no room for manual intervention. All aspects of deployment, monitoring, testing, and recovery must be fully automated. For example, monitoring a service should occur instantly and automatically by virtue of it being deployed, not requiring a separate manual step. Similarly failure discovery and rerouting to old code, as described in Part I of this blog, must be fully automated, no human intervention required. Respect Human Attention Span Speaking of humans, a typical human's attention span, say when filling out a shopping cart, is around 10 seconds. If a failure occurs when deploying an updated shopping cart microservice, it's important that the time between the failure, reporting, and rerouting to existing, working code is kept under around this 10 second range. Obviously this shouldn't happen too often, but the occasional 10 second gap in response will probably not lose the customer. A five minute, or 5 hour lag, resulting from manual intervention and rollback, will. Denormalize like Crazy Refactor database schemas, and de-normalize everything, to allow complete separation and partitioning of data. That is, do not use underlying tables that serve multiple microservices. There should be no sharing of underlying tables that span multiple microservices, and no sharing of data. Instead, if several services need access to the same data, it should be shared via a service API (such as a published REST or a message service interface). Polyglot Persistence Each microservice can have its own persistence layer. Gone are the days of a single monolithic database instance that's shared across all parts of an application. Databases are getting cheaper and easier. As an example, Neo4J allows you to embed an industry-strength self-contained graph database in your microservice at the cost of a few megabytes in a jarfile, with startup time on the order of milliseconds. That's essentially free. Even better, any PaaS worth its salt will provide multiple database services that can be spawned and accessed at the drop of a hat. With technology like this at our disposal, it makes sense to use the persistence layer that fits, both to the problem being solved, and to the expertise - and passions - of the team that's solving the problem. Avoid Trunk Conflicts The old mindset had all code for a large project contained in a single source repository. This can be slightly easier to setup and manage, but it ties the microservices together and makes it much more difficult to evolve them independently. Instead each microservice should have its own scm repository so it can truly be updated and enhanced independent of other services. One Service, One Manifest Each microservice must have its own manifest and dependencies, instead of maintaining a global dependency list for all services. This allows, for example, one microservice to depend on Spring v3.2, while another can require Spring 4.1. The dependencies for one microservice can change over time with no effect on the dependencies of other microservices. Contain Everything All microservices should run in a container, such as Tomcat, Docker, or in whatever container system is provided by the PaaS (you are running a PaaS aren't you?). Do not run microservices on bare metal, or directly on a VM. Containerization brings countless advantages, particularly a consistent, isolated runtime environment that can easily migrate around the datacenter or around the globe. With Docker and other modern containerization approaches, there is very little overhead in running in a container, and considerable upside. No State Do not build stateful services. Instead, maintain state in a dedicated persistence service, or elsewhere. This is a well-known practice brought to us by the cloud. When an application instance maintains state, it can't easily be moved, scaling is more complex, and it's more likely to cause problems when it fails. This practice applies even more to microservices which in general should be light-weight, instantly replaceable on failure, and should be able to hop around data-centers. Don't Name your Chickens People who raise chickens soon learn that naming chickens is a bad idea: after naming a chicken you get attached to it, at least the kids do, and it can be uncomfortable to have to explain at the dinner table that the chicken pot pie is really "Molly." Instead, number your chickens, so you can say "that was chicken #38" or even better, "that was chicken 586ec9bd." Makes for a much more enjoyable meal. The same can be said of computer systems. Do not name systems after planets, or animals, or philosophers, or prisons, as was common practice in the UNIX world for decades. Instead, assign them guid's, and don't attach any sort of significance to them, like assigning them specific roles or purposes. Systems should be commodities, like McDonalds Franchises. Each McDonalds is eerily similar, with the advantage that if one shuts down you can just walk an extra few blocks and be served the exact same burger at the same price in the same amount of time. Create and Curate Access Libraries Microservices are accessed by externally published APIs or protocols. This allows the microservice implementation to completely change with no effect on its consumers, as long as the API remains constant. But just publishing an API is not enough. The microservice provider should also be responsible for building and stewarding client libraries used to access the service. If this is not done, the construction of these libraries will be left to third parties, and will likely result in fragmentation where various implementations might have slight differences, or implementors may incorrectly interpret the spec and introduce inconsistencies which then stick. Optimize the Interaction One downside of a microservices architecture is the "fanout" problem where a single request to the overall application results in 10 or 20 requests bubbling throughout the various microservices the application relies on. This dramatic increase in network traffic calls for more optimal communication between microservices. Instead of transmitting the standard text/html REST content type, consider using something like Google Protocol Buffers, Simple Binary Encoding, or Apache Thrift, to decrease the size of the payload and optimize the inter-microservice communications. Release the Monkeys Netflix has released what they call the "Simian Army," a suite of tools including Chaos Monkey, mentioned above, whose purpose is to help an organization build resilient, scalable, fault-tolerant software. The suite includes such tools as Janitor Monkey, to reclaim unused resources, Security Monkey which looks for security vulnerabilities, Latency Monkey, which induces artificial delays in the REST layer to scare out latency issues, and many more. As Phil described last week in his blog Devops: Tools vs. Culture, most organizations don't have the resources or luxury of being able to build their own toolsets when evolving to a microservices and devops culture. Instead they must leverage existing tools, and fortunately lots of tools are constantly appearing. It's worth spending the effort searching and researching these tools, and incorporating them into your overall development process when they make sense. To be continued... Again I originally intended to cover last week's microservices meetup in a single blog post, which then expanded to two. I have yet to address the power of PaaS in microservices architectures, and I'm out of space already. So I will continue this Microservices and PaaS theme next week, finally getting into PaaS, and discuss how Platform as a Service can significantly streamline the microservices development process.
August 26, 2014
by John Wetherill
· 9,623 Views
  • Previous
  • ...
  • 411
  • 412
  • 413
  • 414
  • 415
  • 416
  • 417
  • 418
  • 419
  • 420
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×