Data Resources

The Latest Data Topics

A hashmap which maintains keys as regular expressions. Any pattern matching the expression will be able to retrieve the same value. Internally it maintains two maps, one containing the regex to value, and another containing matched pattern to regex. Whenever there is a new pattern to 'get', there will be a O(n) search through the compiled regex(s) (which have been 'put' as keys) to find a match. Existing patterns will have constant time lookup through two maps. import java.util.ArrayList; import java.util.Collection; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Set; import java.util.WeakHashMap; import java.util.regex.Pattern; public class RegexHashMap implements Map { private class PatternMatcher { private final String regex; private final Pattern compiled; PatternMatcher(String name) { regex = name; compiled = Pattern.compile(regex); } boolean matched(String string) { if(compiled.matcher(string).matches()) { ref.put(string, regex); return true; } return false; } } /** * Map of input to pattern */ private final Map ref; /** * Map of pattern to value */ private final Map map; /** * Compiled patterns */ private final List matchers; @Override public String toString() { return "RegexHashMap [ref=" + ref + ", map=" + map + "]"; } /** * */ public RegexHashMap() { ref = new WeakHashMap(); map = new HashMap(); matchers = new ArrayList(); } /** * Returns the value to which the specified key pattern is mapped, or null if this map contains no mapping for the key pattern */ @Override public V get(Object weakKey) { if(!ref.containsKey(weakKey)) { for(PatternMatcher matcher : matchers) { if(matcher.matched((String) weakKey)) { break; } } } if(ref.containsKey(weakKey)) { return map.get(ref.get(weakKey)); } return null; } /** * Associates a specified regular expression to a particular value */ @Override public V put(String key, V value) { V v = map.put(key, value); if (v == null) { matchers.add(new PatternMatcher(key)); } return v; } /** * Removes the regular expression key */ @Override public V remove(Object key) { V v = map.remove(key); if(v != null) { for(Iterator iter = matchers.iterator(); iter.hasNext();) { PatternMatcher matcher = iter.next(); if(matcher.regex.equals(key)) { iter.remove(); break; } } for(Iterator> iter = ref.entrySet().iterator(); iter.hasNext();) { Entry entry = iter.next(); if(entry.getValue().equals(key)) { iter.remove(); } } } return v; } /** * Set of view on the regular expression keys */ @Override public Set> entrySet() { return map.entrySet(); } @Override public void putAll(Map m) { for(Entry entry : m.entrySet()) { put(entry.getKey(), entry.getValue()); } } @Override public int size() { return map.size(); } @Override public boolean isEmpty() { return map.isEmpty(); } /** * Returns true if this map contains a mapping for the specified regular expression key. */ @Override public boolean containsKey(Object key) { return map.containsKey(key); } /** * Returns true if this map contains a mapping for the specified regular expression matched pattern. * @param key * @return */ public boolean containsKeyPattern(Object key) { return ref.containsKey(key); } @Override public boolean containsValue(Object value) { return map.containsValue(value); } @Override public void clear() { map.clear(); matchers.clear(); ref.clear(); } /** * Returns a Set view of the regular expression keys contained in this map. */ @Override public Set keySet() { return map.keySet(); } /** * Returns a Set view of the regex matched patterns contained in this map. The set is backed by the map, so changes to the map are reflected in the set, and vice-versa. * @return */ public Set keySetPattern() { return ref.keySet(); } @Override public Collection values() { return map.values(); } /** * Produces a map of patterns to values, based on the regex put in this map * @param patterns * @return */ public Map transform(List patterns) { for(String pattern : patterns) { get(pattern); } Map transformed = new HashMap(); for(Entry entry : ref.entrySet()) { transformed.put(entry.getKey(), map.get(entry.getValue())); } return transformed; } public static void main(String...strings) { RegexHashMap rh = new RegexHashMap(); rh.put("[o|O][s|S][e|E].?[1|2]", "This is a regex match"); rh.put("account", "This is a direct match"); System.out.println(rh); System.out.println("get:ose-1 -> "+rh.get("ose-1")); System.out.println("get:OSE2 -> "+rh.get("OSE2")); System.out.println("get:OSE112 -> "+rh.get("OSE112")); System.out.println("get:ose-2 -> "+rh.get("ose-2")); System.out.println("get:account -> "+rh.get("account")); System.out.println(rh); } }

November 7, 2014

by Sutanu Dalui

· 23,826 Views

Sketching API Connections

daniel bryant , simon and i recently had a discussion about how to represent system communication with external apis. the requirement for integration with external apis is now extremely common but it's not immediately obvious how to clearly show them in architectural diagrams. how to represent an external system? the first thing we discussed was what symbol to use for a system supplying an api. traditionally, uml has used the actor (stick man) symbol to represent a "user or any other system that interacts with the subject" (uml superstructure specification, v2.1.2). therefore a system providing an api may look like this: i've found that this symbol tends to confuse those who aren't well versed in uml as most people assume that the actor symbol always represents a *person* rather than a system. sometimes this is stereotyped to make it more obvious e.g. however the symbol is very powerful and tends to overpower the stereotype. therefore i prefer to use a stereotyped box for an external system supplying an api. let's compare two context diagrams using boxes vs stick actors. in which diagram is it more obvious what are systems or people? note that archimate has a specific symbol for application service that can be used to represent an api: (application service notation from the open group's archimate 2.1 specification) an api or the system that supplies it? whatever symbol we choose, what we've done is to show the *system* rather than the actual api. the api is a definition of a service provided by the system in question. how should we provide more details about the api? there are a number of ways we could do this but my preference is to give details of the api on the connector (line connecting two elements/boxes). in c4 the guidelines for a container diagram includes listing protocol information on the connector and an api can be viewed as the layer above the protocol. for example: multiple apis per external system many api providers supply multiple services/apis (i'm not referring to different operations within an api but multiple sets of operations in different apis, which may even use different underlying protocols.) for example a financial marketplace may have apis that do the following: allow a bulk, batch download of static data (such as details of companies listed on a stock market) via xml over http. supply real time, low latency updates of market prices via bespoke messages over udp. allow entry of trades via industry standard fpml over a queuing system. supply a bulk, batch download of trades for end-of-day reconciliation via fpml over http. two of the services use the same protocol (xml over http) but have very different content and use. one of the apis is used to constantly supply information after user subscription (market data) and the last service involves the user supplying all the information with no acknowledgment (although it should reconcile at eod). there are multiple ways of showing this. we could: have a single service element, list the apis on it and have all components linking to it. show each service/api as a separate box and connect the components that use the individual service to the relevant box. show a single service element with multiple connections. each connection is labeled and represents an api. use a port and connector style notation to represent each api from the service provider. provide a key for the ports. use a uml style 'cup and ball' notation to define interfaces and their usage. some examples are below: a single service element and simple description in the above diagram the containers are stating what they are using but contain no information about how to use the apis. we don't know if it is a single api (with different operations) or anything about the mechanisms used to transport the data. this isn't very useful for anyone implementing a solution or resolving operational issues. single, service box with descriptive connectors in this diagram there is a single, service box with descriptive connectors. the above diagram shows all the information so is much more useful as a diagnostic or implementation tool. however it does look quite crowded. services/apis shown as separate boxes here the external system has its services/apis shown as separate boxes. this contains all the information but might be mistaken as defining the internal structure of the external system. we want to show the services it provides but we know nothing about the internal structure. using ports to represent apis in the above diagram the services/apis are shown as 'ports' on the external system and the details have been moved into a separate key/table. this is less likely to be mistaken as showing any internal structure of the external service. (note that i could have also shown outgoing rports from the brokerage system.) uml interfaces this final diagram is using a uml style interface provider and requirer. this is a clean diagram but requires the user to be aware of what the cup and ball means (although i could have explained this in the key). conclusion any of these solutions could be appropriate depending on the complexity of the api set you are trying to represent. i'd suggest starting with a simple representation (i.e. fully labeled connections) and moving to a more complex one if needed but remember to use a key to explain any elements you use!

November 7, 2014

by Robert Annett

· 8,170 Views · 1 Like

Using REST with the CQRS Pattern to Blend NoSQL & SQL Data

REST Easy with SQL/NoSQL Integration and CQRS Pattern implementation New demands are being put on IT organizations everyday to deliver agile, high-performance, integrated mobile and web applications. In the meantime, the technology landscape is getting complex everyday with the advent of new technologies like REST, NoSQL, Cloud while existing technologies like SOAP and SQL still rule everyday work. Rather than taking religious side of the debate, NoSQL can successfully co-exist with SQL in this ‘polyglot’ of data storage and formats. However, this integration also adds another layer of complexity both in architecture and implementation. This document offers a guide on how some of the relatively newer technologies like REST can help bridge the gap between SQL and NoSQL with an example of a well known pattern called CQRS. This document is organized as follows: Introduction to SQL development process NoSQL Do I have to choose between SQL and NoSQL? CQRS Pattern How to implement CQRS pattern using REST services Introduction to SQL development process Developers have been using SQL Databases for decades to build and deliver enterprise business applications. The process of creating tables, attributes,and relationships is second nature for most developers. Data architects think in terms of tables and columns and navigate relationships for data. The basic concepts of delivery and transformation takes place at the web server level which means the server developer is reading and ‘binding’ to the tables and mapping attributes to a REST response. Application development lifecycle meant changes to the database schema first, followed by the bindings, then internal schema mapping, and finally the SOAP or JSON services, and eventually the client code. This all costs the project time and money. It also means that the ‘code’ (pick your language here) and the business logic would also need to be modified to handle the changes to the model. NoSQL NoSQL is gaining supporters among many SQL shops for various reasons including: Low cost Ability to handle unstructured dataa Scalability Performance The first thing database folks notice is that there is no schema. These document style storage engines can handle huge volumes of structured, semi-structured, and unstructured data. The very nature of schema-less documents allows change to a document structure without having to go through the formal change management process (or data architect). The other major difference is that NoSQL (no-schema) also means no joins or relationships. The document itself contains the embedded information by design. So an order entry would contain the customer with all the orders and line items for each order in a single document. There are many different NoSQL vendors (popular NoSQL databases include MongoDB, Casandra) that are being used for BI and Analytics (read-only) purposes. We are also seeing many customers starting to use NoSQL for auditing, logging, and archival transactions. Do I have to choose between SQL and NoSQL? The purpose of this article is to not get into the religious debate about whether to use SQL or NoSQL. Bottom line is both have their place and are suited for certain type of data – SQL for structured data and NoSQL for unstructured data. So why not have the capability to mix and match this data depending on the application. This can be done by creating a single REST API across both SQL and NoSQL databases. Why a single REST API? The answer is simple – the new agile and mobile world demands this ‘mashup’ of data into a document style JSON response. CQRS (Command Query Responsibility Segmentation) Pattern There are many design patterns for delivery of high performance RESTful services but the one that stands out was described in an article written by Martin Fowler, one of the software industry veterans. He described the pattern called CQRS that is more relevant today in a ‘polyglot’ of servers, data, services, and connections. “We may want to look at the information in a different way to the record store, perhaps collapsing multiple records into one, or forming virtual records by combining information for different places. On the update side we may find validation rules that only allow certain combinations of data to be stored, or may even infer data to be stored that’s different from that we provide.” – Martin Fowler 2011 In this design pattern, the REST API requests (GET) return documents from multiple sources (e.g. mashups). In the update process, the data is subject to business logic derivations, validations, event processing, and database transactions. This data may then be pushed back into the NoSQL using asynchronous events. With the wide-spread adoption of NoSQL databases like MongoDB and schema-less, high capacity data store; most developers are challenged with providing security, business logic, event handling, and integration to other systems. MongoDB; one the popular NoSQL databases and SQL databases share many similar concepts. However the MongoDB programming language itself is very different from the SQL we all know. How to implement CQRS pattern using a RESTFul Architecture A REST server should meet certain requirements to support the CQRS pattern. The server should run on-premise or in the cloud and appears to the mobile and web developer as an HTTP endpoint. The server architecture should implement the following: Connections and Mapping necessary for SQL and NoSQL connectivity and API services needed to create and return GET, PUT, POST, and DELETE REST responses Security Business Logic Connections and Mapping There are two main approaches to creating REST Servers and APIs for SQL and NoSQL databases: Open source frameworks like Apache Tomcat, Spring/Hibernate Commercial framework like Espresso Logic Open source Frameworks Using various open source frameworks like Tomcat, Spring/Hibernate, Node.js, JDBC and MongoDB drivers, a REST server can be created, but we would still be left with the following tasks: Creation and mapping of the necessary SQL objects Create a REST server container and configurations Create Jersey/Jackson classes and annotations Create and define REST API for tables, views, and procedures Hand write validation, event and business logic Handle persistence, optimistic locking, transaction paging Adding identity management and security by roles Now we can start down the same path to connect to MongoDB and write code to connect, select, and return data in JSON and then create the REST calls to merge these two different document styles into a single RESTful endpoint. This is a lot of work for a development team to manage and control and frankly pretty boring and repetitive and is better done by a well designed framework Commercial Frameworks Many commercial frameworks may take care of this complexity without the need to do extensive programming. Here is an example from Espresso Logic and how it handles this complexity with a point and click interface: Running REST server in the cloud or on-premise Connections to external SQL databases Object mapping to tables, views, and procedures Automatic creation of RESTful endpoints from model Reactive business rules and rich event model Integrated role-based security and authentication services. Point-and-click document API creation for SQL and MongoDB endpoints In the example below, the editor shows an SQL (customersTransactions) joined with archived details from MongoDB (archivedTransactions). The MongoDB document for each customer may include transaction details, check images, customer service notes and other relevant account information. This new mashup becomes a single REST call that can be published to mobile and web application developer. Security Security is an important part of building and delivery of RESTful services which can be broken down into two parts; authentication and access control. Authentication Before allowing anyone access to corporate data you want to use the existing corporate identity management (some call this authentication services) to capture and validate the user. This identity management service is based on using existing corporate standards such as LDAP, Windows AD, SQL Database. Role-based Access Control Each user may be assigned one or more corporate roles and these roles are then assigned specific access privileges to each resource (e.g. READ, INSERT, UPDATE, and DELETE). Role-based access should also be able to restrict permissions to specific rows and columns of the API (e.g. only sales reps can see their own orders or a manager can see and change his department salaries but cannot change his own). This restriction should be applied regardless of how or where the API is used or called. Remember, the SQL database already provides some level of security and access which must be considered when designing and delivering new front-end services to internal and external users. Business Logic for REST When data is updated to a REST Server several things need to happen. First, the authentication and access control should determine if this is a valid request and if the user has rights to the endpoint. In addition, the server may need to de-alias REST attributes back to the actual SQL column names. In a full featured business logic server, there should be a series of events and business rules to perform various calculations, validations, and fire other events on dependent tables. Finally, the entire multi-table transaction is written back to the SQL database in a single transaction. Updates are then sent asynchronously to MongoDB as part of the commit event (after the SQL transaction has completed). Conclusion In the real-world of API services, the demand for more complex document style RESTful services is a requirement. That is, the ability to create ‘mashups’ of data from multiple tables, NoSQL collections, and other external systems is a large part of this new design pattern. In addition, the ability to alias attribute names and formats from these source fields has become critical for partners and customers systems. Using REST with the CQRS pattern to blend MongoDB and SQL seamlessly to your existing data will become a major part of your future mobile strategy. To implement these REST services, one can use open source tools and spend a lot of time or select a right commercial framework. This framework should support cloud or on-premise connectivity, security, API integration, as well as business logic. This will make the design and delivery of new application services more rapid and agile in the heterogeneous world of information.

November 4, 2014

by Val Huber

CORE

· 16,260 Views

Spring Caching Abstraction and Google Guava Cache

Spring provides a great out of the box support for caching expensive method calls. The caching abstraction is covered in a great detail here. My objective here is to cover one of the newer cache implementations that Spring now provides with 4.0+ version of the framework - using Google Guava Cache In brief, consider a service which has a few slow methods: public class DummyBookService implements BookService { @Override public Book loadBook(String isbn) { // Slow method 1. } @Override public List loadBookByAuthor(String author) { // Slow method 2 } } With Spring Caching abstraction, repeated calls with the same parameter can be sped up by an annotation on the method along these lines - here the result of loadBook is being cached in to a "book" cache and listing of books cached into another "books" cache: public class DummyBookService implements BookService { @Override @Cacheable("book") public Book loadBook(String isbn) { // slow response time.. } @Override @Cacheable("books") public List loadBookByAuthor(String author) { // Slow listing } } Now, Caching abstraction support requires a CacheManager to be available which is responsible for managing the underlying caches to store the cached results, with the new Guava Cache support the CacheManager is along these lines: @Bean public CacheManager cacheManager() { return new GuavaCacheManager("books", "book"); } Google Guava Cache provides a rich API to be able to pre-load the cache, set eviction duration based on last access or created time, set the size of the cache etc, if the cache is to be customized then a guava CacheBuilder can be passed to the CacheManager for this customization: @Bean public CacheManager cacheManager() { GuavaCacheManager guavaCacheManager = new GuavaCacheManager(); guavaCacheManager.setCacheBuilder(CacheBuilder.newBuilder().expireAfterAccess(30, TimeUnit.MINUTES)); return guavaCacheManager; } This works well if all the caches have a similar configuration, what if the caches need to be configured differently - for eg. in the sample above, I may want the "book" cache to never expire but the "books" cache to have an expiration of 30 mins, then the GuavaCacheManager abstraction does not work well, instead a better solution is actually to use a SimpleCacheManager which provides a more direct way to get to the cache and can be configured this way: @Bean public CacheManager cacheManager() { SimpleCacheManager simpleCacheManager = new SimpleCacheManager(); GuavaCache cache1 = new GuavaCache("book", CacheBuilder.newBuilder().build()); GuavaCache cache2 = new GuavaCache("books", CacheBuilder.newBuilder() .expireAfterAccess(30, TimeUnit.MINUTES) .build()); simpleCacheManager.setCaches(Arrays.asList(cache1, cache2)); return simpleCacheManager; } This approach works very nicely, if required certain caches can be configured to be backed by a different caching engines itself, say a simple hashmap, some by Guava or EhCache some by distributed caches like Gemfire.

November 3, 2014

by Biju Kunjummen

· 60,150 Views · 8 Likes

BigList: a Scalable High-Performance List for Java

As memory gets cheaper and cheaper, our applications can keep more data readily available in main memory, or even all as in case of in-memory databases. To make real use of the growing heap memory, appropriate data structures must be used. Interesting enough, there seem to be no specialized implementations for lists - by far the most used collection. This article introduces BigList, a list designed for handling large collections where large means that all data still fit completely in the heap memory. The article will show the special requirements for handling large collections, how BigList is implemented and how it compares to other list implementations. 1. Requirements What are the special requirements we need to handle large collections efficiently? Memory: Sparing use of memory: The list should need little memory for its own implementation so memory can be used for storing application data. Specialized versions for primitives: It must be possible to store common primitives like ints in a memory saving way. Avoid copying large data blocks: If the list grows or shrinks, only a small part of the data must be copied around, as this operation becomes expensive and needs the same amount of memory again. Data sharing: copying collections is a frequent operation which should be efficiently possible even if the collection is large. An efficient implementation requires some sort of data sharing as copying all elements is per se a costly operation. Performance: Good performance for normal operations like reading, storing, adding or removing single elements. Great performance for bulk operations like adding or removing multiple elements. Predictable overhead of operations, so similar operations should need a similar amount of time without excessive worst case scenarios. If an implementation does not offer these features, some operations will not only be slow for really large collections, but will becomse just not feasible because memory or CPU usage will be too exhaustive. Introduction to BigList BigList is a member of the Brownies Collections library which also includes GapList, the fastest list implementation known. GapList is a drop-in replacement for ArrayList, LinkedList, or ArrayDequeue and offers fast access by index and fast insertion/removal at the beginning and at the end at the same time. GapList however has not been designed to cope with large collections, so adding or removing elements can make it necessary to copy a lot of elements around which will lead to performance problems. Also copying a large collection becomes an expensive operation, both in term of time and memory consumption. It will simply not be possible to make a copy of a large collections if not the same amount of memory is available a second time. And this is a common operation as you often want to return a copy of an internal list through your API which has no reference the original list. BigList addresses both problems. The first problem is solved by storing the collection elements in fixed size blocks. Add or remove operations are then implemented to use only data from one block. The copying problem is solved by maintaining a reference count on the fixed size blocks which allows to implement a copy-on-write approach. For efficient access to the fixed size blocks, they are maintained in a specialized tree structure. 2. BigList Details Each BigList instance stores the following information: Elements are stored in in fixed-size blocks A single block is implemented as GapList with a reference count for sharing All blocks are maintained in a tree for fast access Access information for the current block is cached for better performance The following illustration shows these details for two instances of BigList which share one block. 2.1 Use of Blocks Elements are stored in in fixed-size blocks with a default block size of 1000. Where this default may look pretty small, it is most of the time a good choice because it guarantees that write operation only need to move few elements. Read operations will profit from locality of reference by using the currently cached block to be fast. It is however possible to specify the block size for each created BigList instance. All blocks except the first are allocated with this fixed size and will not grow or shrink. The first block will grow to the specified block size to save memory for small lists. If a block has reached its maximum size and more data must be stored within, the block needs to be split up in two blocks before more elements can be stored. If elements are added to the head or tail of the list, the block will only be filled up to a threshold of 95%. This allows inserts into the block without the immediate need for split operations. To save memory, blocks are also merged. This happens automatically if two adjacent blocks are both filled less than 35% after a remove operation. 2.2 Locality of Reference For each operation on BigList, the affected block must be determined first. As predicted by locality of reference, most of the time the affected block will be the same as for the last operation. The implementation of BigList has therefore been designed to profit from locality of reference which makes common operations like iterating over a list very efficient. Instead of always traversing the block tree to determine the block needed for an operation, lower and upper index of the last used block are cached. So if the next operation happens near to the previous one, the same block can be used again without need to traverse the tree. 2.3 Reference Counting To support a copy-on-write approach, BigList stores a reference count for each fixed size blocks indicating whether this block is private or shared. Initially all lists are private having a reference count of 0, so that modification are allowed. If a list is copied, the reference count is incremented which prohibits further modifications. Before a modification then can be made, the block must be copied decrementing the block's reference count and setting the reference count of the copy to 0. The reference count of a block is then decremented by the finalizer of BigList. 3. Benchmarks To prove the excellence of BigList in time and memory consumption, we compare it with some other List implementations. And here are the nominees: Type Library Description BigList brownie-collections List optimized for storing large number of elements. Elements stored in fixed size blocks which are maintained in a tree. GapList brownie-collections Fastest list implementation known. Fast access by index and fast insertion/removal at end and at beginning. ArrayList JDK Maintains elements in a single array. Fast access by index, fast insertion/removal at end, but slow at beginning. LinkedList JDK Elements stored using a linked list. Slow access by index. Memory overhead for each element stored. TreeList commons-collections Elements stored in a tree. All operations are not really fast, but there are no very slow operations. Memory overhead for each element stored. FastTable javolution Elements stored in a "fractal"-like data structure. Good performance and use of memory. However no bulk operations and collection does not shrink. 3.1 Handling Objects In the first part of the benchmark, we compare memory consumption and performance of the different list implementations. Let's first have a look at the memory consumption. The following table shows the bytes used to hold a list with 1'000'000 null elements: BigList GapList ArrayList LinkedList TreeList FastTable 32 bit 4'298'466 4'021'296 4'861'992 8'000'028 18'000'028 4'142'892 64 bit 8'544'254 8'042'552 9'723'964 16'000'044 26'000'044 8'222'988 We can see that BigList, GapList, ArrayList, and FastTable only add small overhead to the stored elements, where as Linkedlist needs twice the memory and TreeList even more. Now to the performance. Here are the results of 9 benchmarks which have been run for each of the 6 candidates with JDK 8 in a 32 bit Windows environment and a list of 1'000'000 elements: The result table can be read as follows: the fastest candidate for each test has a relative performance indicator of 1 the value for the other candidates indicate how many times they have been slower, so a factor of 3 means that this implementation was 3 times slower than the best one The different factor are colored like this:  factor 1: green (best)  factor <5: blue (good)  factor <25: yellow (moderate)  factor >25: red (poor) If we look the benchmark result, we can see that the performance of BigList is best for all expect two benchmarks. The only moderate result is produces in getting elements in a totally random order. This could be expected as there is no locality of reference which can be exploited, so for each access, the block tree must be traversed to find the correct block. Luckily this is a rare use case in real applications. And the benchmark "Get local" shows that performance is back to good as soon as elements next to each other must be retrieved - as it is the case if we iterate over a range. 3.2 Handling Primitives In the second part of the benchmark, we want see how big the savings are if we use a data structure specialized for storing primitives compared to strong wrapped objects. For this reason, we compare IntBigList and BigList. The following table shows memory needed to store 1'000'000 integer values: BigList IntBigList 32 bit 16'298'454 4'534'840 64 bit 28'544'234 4'570'432 Obviously it is easy to save a lot of memory. In a 32 bit environment, IntBigList just needs 25% percent of memory, in a 64 bit environment only 14%! These figures become plausible if you recall that a simple object needs 8 bytes in a 32 bit, but already 16 bytes in a 64 bit environment, where as a primitive integer value always only needs 4 bytes. The measurable performance gain is not so impressive, it is something below 10% for simple get operations and something above 10% for add and remove operations. These numbers show that the JVM is impressively fast in creating wrapper objects and boxing and unboxing primitive values. We must however also consider that each created object will need to be garbage collected once and therefore adds to the total load of the JVM. 4. Summary BigList is a scalable high-performance list for storing large collections. Its design guarantees that all operations will be predictable and efficient both in term of performance and memory consumption, even copying large collections is tremendous fast. Benchmarks haven proven this and shown that BigList outperform other known list implementations. The library also offers specialized implementations for primitive types like IntBigList which save much memory and provide superior performance. BigList for handling objects and the specializations for handling primitives are part of the Brownies Collections library and can be downloaded from http://www.magicwerk.org/collections.

November 3, 2014

by Thomas Mauch

· 33,019 Views · 10 Likes

Building a REST API with JAXB, Spring Boot and Spring Data

if someone asked you to develop a rest api on the jvm, which frameworks would you use? i was recently tasked with such a project. my client asked me to implement a rest api to ingest requests from a 3rd party. the project entailed consuming xml requests, storing the data in a database, then exposing the data to internal application with a json endpoint. finally, it would allow taking in a json request and turning it into an xml request back to the 3rd party. with the recent release of apache camel 2.14 and my success using it , i started by copying my apache camel / cxf / spring boot project and trimming it down to the bare essentials. i whipped together a simple hello world service using camel and spring mvc. i also integrated swagger into both. both implementations were pretty easy to create ( sample code ), but i decided to use spring mvc. my reasons were simple: its rest support was more mature, i knew it well, and spring mvc test makes it easy to test apis. camel's swagger support without web.xml as part of the aforementioned spike, i learned out how to configure camel's rest and swagger support using spring's javaconfig and no web.xml. i made this into a sample project and put it on github as camel-rest-swagger . this article shows how i built a rest api with java 8, spring boot/mvc, jaxb and spring data (jpa and rest components). i stumbled a few times while developing this project, but figured out how to get over all the hurdles. i hope this helps the team that's now maintaining this project (my last day was friday) and those that are trying to do something similar. xml to java with jaxb the data we needed to ingest from a 3rd party was based on the ncpdp standards. as a member, we were able to download a number of xsd files, put them in our project and generate java classes to handle the incoming/outgoing requests. i used the maven-jaxb2-plugin to generate the java classes. org.jvnet.jaxb2.maven2 maven-jaxb2-plugin 0.8.3 generate -xtostring -xequals -xhashcode -xcopyable org.jvnet.jaxb2_commons jaxb2-basics 0.6.4 src/main/resources/schemas/ncpdp the first error i ran into was about a property already being defined. [info] --- maven-jaxb2-plugin:0.8.3:generate (default) @ spring-app --- [error] error while parsing schema(s).location [ file:/users/mraible/dev/spring-app/src/main/resources/schemas/ncpdp/structures.xsd{1811,48}]. com.sun.istack.saxparseexception2; systemid: file:/users/mraible/dev/spring-app/src/main/resources/schemas/ncpdp/structures.xsd; linenumber: 1811; columnnumber: 48; property "multipletimingmodifierandtimingandduration" is already defined. use to resolve this conflict. at com.sun.tools.xjc.errorreceiver.error(errorreceiver.java:86) i was able to workaround this by upgrading to maven-jaxb2-plugin version 0.9.1. i created a controller and stubbed out a response with hard-coded data. i confirmed the incoming xml-to-java marshalling worked by testing with a sample request provided by our 3rd party customer. i started with a curl command, because it was easy to use and could be run by anyone with the file and curl installed. curl -x post -h 'accept: application/xml' -h 'content-type: application/xml' \ --data-binary @sample-request.xml http://localhost:8080/api/message -v this is when i ran into another stumbling block: the response wasn't getting marshalled back to xml correctly. after some research, i found out this was caused by the lack of @xmlrootelement annotations on my generated classes. i posted a question to stack overflow titled returning jaxb-generated elements from spring boot controller . after banging my head against the wall for a couple days, i figured out the solution . i created a bindings.xjb file in the same directory as my schemas. this causes jaxb to generate @xmlrootelement on classes. to add namespaces prefixes to the returned xml, i had to modify the maven-jaxb2-plugin to add a couple arguments. -extension -xnamespace-prefix and add a dependency: org.jvnet.jaxb2_commons jaxb2-namespace-prefix 1.1 then i modified bindings.xjb to include the package and prefix settings. i also moved into a global setting. i eventually had to add prefixes for all schemas and their packages. i learned how to add prefixes from the namespace-prefix plugins page . finally, i customized the code-generation process to generate joda-time's datetime instead of the default xmlgregoriancalendar . this involved a couple custom xmladapters and a couple additional lines in bindings.xjb . you can see the adapters and bindings.xjb with all necessary prefixes in this gist . nicolas fränkel's customize your jaxb bindings was a great resource for making all this work. i wrote a test to prove that the ingest api worked as desired. @runwith(springjunit4classrunner.class) @springapplicationconfiguration(classes = application.class) @webappconfiguration @dirtiescontext(classmode = dirtiescontext.classmode.after_class) public class initiaterequestcontrollertest { @inject private initiaterequestcontroller controller; private mockmvc mockmvc; @before public void setup() { mockitoannotations.initmocks(this); this.mockmvc = mockmvcbuilders.standalonesetup(controller).build(); } @test public void testgetnotallowedonmessagesapi() throws exception { mockmvc.perform(get("/api/initiate") .accept(mediatype.application_xml)) .andexpect(status().ismethodnotallowed()); } @test public void testpostpainitiationrequest() throws exception { string request = new scanner(new classpathresource("sample-request.xml").getfile()).usedelimiter("\\z").next(); mockmvc.perform(post("/api/initiate") .accept(mediatype.application_xml) .contenttype(mediatype.application_xml) .content(request)) .andexpect(status().isok()) .andexpect(content().contenttype(mediatype.application_xml)) .andexpect(xpath("/message/header/to").string("3rdparty")) .andexpect(xpath("/message/header/sendersoftware/sendersoftwaredeveloper").string("hid")) .andexpect(xpath("/message/body/status/code").string("010")); } } spring data for jpa and rest with jaxb out of the way, i turned to creating an internal api that could be used by another application. spring data was fresh in my mind after reading about it last summer. i created classes for entities i wanted to persist, using lombok's @data to reduce boilerplate. i read the accessing data with jpa guide, created a couple repositories and wrote some tests to prove they worked. i ran into an issue trying to persist joda's datetime and found jadira provided a solution. i added its usertype.core as a dependency to my pom.xml: org.jadira.usertype usertype.core 3.2.0.ga ... and annotated datetime variables accordingly. @column(name = "last_modified", nullable = false) @type(type="org.jadira.usertype.dateandtime.joda.persistentdatetime") private datetime lastmodified; with jpa working, i turned to exposing rest endpoints. i used accessing jpa data with rest as a guide and was looking at json in my browser in a matter of minutes. i was surprised to see a "profile" service listed next to mine, and posted a question to the spring boot team. oliver gierke provided an excellent answer . swagger spring mvc's integration for swagger has greatly improved since i last wrote about it . now you can enable it with a @enableswagger annotation. below is the swaggerconfig class i used to configure swagger and read properties from application.yml . @configuration @enableswagger public class swaggerconfig implements environmentaware { public static final string default_include_pattern = "/api/.*"; private relaxedpropertyresolver propertyresolver; @override public void setenvironment(environment environment) { this.propertyresolver = new relaxedpropertyresolver(environment, "swagger."); } /** * swagger spring mvc configuration */ @bean public swaggerspringmvcplugin swaggerspringmvcplugin(springswaggerconfig springswaggerconfig) { return new swaggerspringmvcplugin(springswaggerconfig) .apiinfo(apiinfo()) .genericmodelsubstitutes(responseentity.class) .includepatterns(default_include_pattern); } /** * api info as it appears on the swagger-ui page */ private apiinfo apiinfo() { return new apiinfo( propertyresolver.getproperty("title"), propertyresolver.getproperty("description"), propertyresolver.getproperty("termsofserviceurl"), propertyresolver.getproperty("contact"), propertyresolver.getproperty("license"), propertyresolver.getproperty("licenseurl")); } } after getting swagger to work, i discovered that endpoints published with @repositoryrestresource aren't picked up by swagger. there is an open issue for spring data support in the swagger-springmvc project. liquibase integration i configured this project to use h2 in development and postgresql in production. i used spring profiles to do this and copied xml/yaml (for maven and application*.yml files) from a previously created jhipster project. next, i needed to create a database. i decided to use liquibase to create tables, rather than hibernate's schema-export. i chose liquibase over flyway based of discussions in the jhipster project . to use liquibase with spring boot is dead simple: add the following dependency to pom.xml, then place changelog files in src/main/resources/db/changelog . org.liquibase liquibase-core i started by using hibernate's schema-export and changing hibernate.ddl-auto to "create-drop" in application-dev.yml . i also commented out the liquibase-core dependency. then i setup a postgresql database and started the app with "mvn spring-boot:run -pprod". i generated the liquibase changelog from an existing schema using the following command (after downloading and installing liquibase). liquibase --driver=org.postgresql.driver --classpath="/users/mraible/.m2/repository/org/postgresql/postgresql/9.3-1102-jdbc41/postgresql-9.3-1102-jdbc41.jar:/users/mraible/snakeyaml-1.11.jar" --changelogfile=/users/mraible/dev/spring-app/src/main/resources/db/changelog/db.changelog-02.yaml --url="jdbc:postgresql://localhost:5432/mydb" --username=user --password=pass generatechangelog i did find one bug - the generatechangelog command generates too many constraints in version 3.2.2 . i was able to fix this by manually editing the generated yaml file. tip: if you want to drop all tables in your database to verify liquibase creation is working in postgesql, run the following commands: psql -d mydb drop schema public cascade; create schema public; after writing minimal code for spring data and configuring liquibase to create tables/relationships, i relaxed a bit, documented how everything worked and added a loggingfilter . the loggingfilter was handy for viewing api requests and responses. @bean public filterregistrationbean loggingfilter() { loggingfilter filter = new loggingfilter(); filterregistrationbean registrationbean = new filterregistrationbean(); registrationbean.setfilter(filter); registrationbean.seturlpatterns(arrays.aslist("/api/*")); return registrationbean; } accessing api with resttemplate the final step i needed to do was figure out how to access my new and fancy api with resttemplate . at first, i thought it would be easy. then i realized that spring data produces a hal -compliant api, so its content is embedded inside an "_embedded" json key. after much trial and error, i discovered i needed to create a resttemplate with hal and joda-time awareness. @bean public resttemplate resttemplate() { objectmapper mapper = new objectmapper(); mapper.configure(deserializationfeature.fail_on_unknown_properties, false); mapper.registermodule(new jackson2halmodule()); mapper.registermodule(new jodamodule()); mappingjackson2httpmessageconverter converter = new mappingjackson2httpmessageconverter(); converter.setsupportedmediatypes(mediatype.parsemediatypes("application/hal+json")); converter.setobjectmapper(mapper); stringhttpmessageconverter stringconverter = new stringhttpmessageconverter(); stringconverter.setsupportedmediatypes(mediatype.parsemediatypes("application/xml")); list> converters = new arraylist<>(); converters.add(converter); converters.add(stringconverter); return new resttemplate(converters); } the jodamodule was provided by the following dependency: com.fasterxml.jackson.datatype jackson-datatype-joda with the configuration complete, i was able to write a messagesapiitest integration test that posts a request and retrieves it using the api. the api was secured using basic authentication, so it took me a bit to figure out how to make that work with resttemplate. willie wheeler's basic authentication with spring resttemplate was a big help. @runwith(springjunit4classrunner.class) @contextconfiguration(classes = integrationtestconfig.class) public class messagesapiitest { private final static log log = logfactory.getlog(messagesapiitest.class); @value("http://${app.host}/api/initiate") private string initiateapi; @value("http://${app.host}/api/messages") private string messagesapi; @value("${app.host}") private string host; @inject private resttemplate resttemplate; @before public void setup() throws exception { string request = new scanner(new classpathresource("sample-request.xml").getfile()).usedelimiter("\\z").next(); responseentity response = resttemplate.exchange(gettesturl(initiateapi), httpmethod.post, getbasicauthheaders(request), org.ncpdp.schema.transport.message.class, collections.emptymap()); assertequals(httpstatus.ok, response.getstatuscode()); } @test public void testgetmessages() { httpentity request = getbasicauthheaders(null); responseentity> result = resttemplate.exchange(gettesturl(messagesapi), httpmethod.get, request, new parameterizedtypereference>() {}); httpstatus status = result.getstatuscode(); collection messages = result.getbody().getcontent(); log.debug("messages found: " + messages.size()); assertequals(httpstatus.ok, status); for (message message : messages) { log.debug("message.id: " + message.getid()); log.debug("message.datecreated: " + message.getdatecreated()); } } private httpentity getbasicauthheaders(string body) { string plaincreds = "user:pass"; byte[] plaincredsbytes = plaincreds.getbytes(); byte[] base64credsbytes = base64.encodebase64(plaincredsbytes); string base64creds = new string(base64credsbytes); httpheaders headers = new httpheaders(); headers.add("authorization", "basic " + base64creds); headers.add("content-type", "application/xml"); if (body == null) { return new httpentity<>(headers); } else { return new httpentity<>(body, headers); } } } to get spring data to populate the message id, i created a custom restconfig class to expose it. i learned how to do this from tommy ziegler . /** * used to expose ids for resources. */ @configuration public class restconfig extends repositoryrestmvcconfiguration { @override protected void configurerepositoryrestconfiguration(repositoryrestconfiguration config) { config.exposeidsfor(message.class); config.setbaseuri("/api"); } } summary this article explains how i built a rest api using jaxb, spring boot, spring data and liquibase. it was relatively easy to build, but required some tricks to access it with spring's resttemplate. figuring out how to customize jaxb's code generation was also essential to make things work. i started developing the project with spring boot 1.1.7, but upgraded to 1.2.0.m2 after i found it supported log4j2 and configuring spring data rest's base uri in application.yml. when i handed the project off to my client last week, it was using 1.2.0.build-snapshot because of a bug when running in tomcat . this was an enjoyable project to work on. i especially liked how easy spring data makes it to expose jpa entities in an api. spring boot made things easy to configure once again and liquibase seems like a nice tool for database migrations. if someone asked me to develop a rest api on the jvm, which frameworks would i use? spring boot, spring data, jackson, joda-time, lombok and liquibase. these frameworks worked really well for me on this particular project.

October 30, 2014

by Matt Raible

· 64,337 Views

Sharding Pitfalls Part III: Chunk Balancing and Collection Limits

In Parts 1 and 2 we have covered a number of common issues people run into when managing a sharded MongoDB cluster. In this final post of the series we will cover a subtle, but important distinction in terms of balancing a sharded cluster as well as an interesting limitation that can be worked around relatively easily, but is nonetheless surprising when it comes up. 6. Chunk balancing != data balancing != traffic balancing The balancer in a sharded cluster cares about just one thing: Are chunks for a given collection evenly balanced across all shards? If they are not, then it will take steps to rectify that imbalance. This all sounds perfectly logical, and even with extra complexity like tagging involved the logic is pretty straight forward. If we assume that all chunks are equal, then we can rest assured that our data is being evenly balanced across all the shards in our cluster and rest easy at night. Although that is sometimes, perhaps even frequently, the case it is not always true - chunks are not always equal. There can be massive “jumbo” chunks that exceed the maximum chunk size (64MiB), completely empty chunks and everything in between. Let’s use an example from our first pitfall, the monotonically increasing shard key. For our example, we have picked just such a key to shard on (date), and up until this point we have had just one shard and had not sharded the collection. We are about to add a second shard to our cluster and so we enable sharding on the collection and do the necessary admin work to add the new shard into the cluster. Once the collection is enabled for sharding, the first shard contains all the newly minted chunks. Let’s represent them in a simplified table of 10 chunks. This is not representative of a real data set, but it will do for illustrative purposes: Table 1 - Initial Chunk Layout Now we add our second shard. The balancer will kick in and attempt to distribute the chunks evenly. It will do this by moving the lowest range chunks to the new shard until the counts are identical. Once it is finished balancing, our table now looks like this: Table 2 - Balanced Chunk Layout That looks pretty good at the moment, but lets imagine that more recent chunks are more likely to have more activity (updates say) than older chunks. Adding the traffic share estimates for each chunk shows that shard1 is taking far more traffic (72%) than shard2 (28%) despite the chunks seeming balanced overall based on the approximate size. Hence, chunk balancing is not equal to traffic balancing. Using that same example, let’s add another wrinkle - periodic deletion of old data. Every 3 months we run a job to delete any data older than 12 months. Let’s look at the impact of that on our table after we run it for the first time (assuming the first run happens on July 1st 2015). Table 3 - Post-Delete Chunk Layout The distribution of data is now completely skewed toward shard1 - shard2 is in fact empty! However, the balancer is completely unaware of this imbalance - the chunk count has remained the same the entire time, and as far as it is concerned the system is in a steady state. With no data on shard2, our traffic imbalance as seen above will be even worse, and we have essentially negated the benefit of having a second shard for this collection. Possible Mitigation Strategies If data and traffic balance are important, select an appropriate shard key Move chunks manually to address the imbalances - swap “hot” chunks for “cool” chunks, empty chunks for larger chunks 7. Waiting too long to shard a collection (collection too large) This is not very common, but when it falls on your shoulders, it can be quite challenging to solve. There is a maximum data size for a collection when when it is initially split which is a function of the chunk size and data size as noted on the limits page. If your collection contains less than 256GiB of data, then there will be no issue. If the collection size exceeds 256GiB but is less than 400GiB, then MongoDB may be able to do an initial split without any special measures being taken. Otherwise, with larger initial data sizes and the default settings, the initial split will fail. It is worth noting that once split the collection may grow as needed and without any real limitations as long as you can continue to add shards as data size grows. Possible Mitigation Strategies Since the limit is dictated by the chunk size and the data size, and assuming there is not much to be done about the data size, then the remaining variable is the chunk size. This is adjustable (default is 64MiB) and can be raised in order to let a large collection split initially and then reduced once that has been completed. The required chunk size increase will depend on the actual data size. However, this is relatively easy to work out - simply divide your data size by 256GB and then multiply that figure by 64MiB (and round up if it is not a nice even number). As an example, let’s consider a 4TiB collection: 4TiB divided by 256GiB = 16 64MiB x 16 = 1024MiB Hence, set the max chunk size to 1024MiB, then perform the initial sharding of the collection, and then finally reduce the chunk size back to 64MiB using the same procedure. . Thanks for reading through the Sharding Pitfall series! If you want to learn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations. Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.

October 27, 2014

by Francesca Krihely

· 4,306 Views

AngularJS: Different Ways of Using Array Filters

In this article, we learn how to utilize AngularJS's array filter features in a variety of use cases. Read on to get started.

October 24, 2014

by Siva Prasad Reddy Katamreddy

· 315,829 Views · 5 Likes

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 1

introduction in this tutorial, the apache lucene and apache tika frameworks will be explained through their core concepts (e.g. parsing, mime detection, content analysis, indexing, scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. we assume you have a working knowledge of the java™ programming language and plenty of content to analyze. throughout this tutorial, you will learn: how to use apache tika's api and its most relevant functions how to develop code with apache lucene api and its most important modules how to integrate apache lucene and apache tika in order to build your own piece of software that stores and retrieves information efficiently. (project code is available for download) what are lucene and tika? according to apache lucene's site, apache lucene represents an open source java library for indexing and searching from within large collections of documents. the index size represents roughly 20-30% the size of text indexed and the search algorithms provide features like: ranked searching - best results returned first many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more. in this tutorial we will demonstrate only phrase queries. fielded search (e.g. title, author, contents) sorting by any field flexible faceting, highlighting, joins and result grouping pluggable ranking models, including the vector space model and okapi bm25 but lucene's main purpose is to deal directly with text and we want to manipulate documents, who have various formats and encoding. for parsing document content and their properties the apache tika library it is necessary. apache tika is a library that provides a flexible and robust set of interfaces that can be used in any context where metadata analyzis and structured text extraction is needed. the key component of apache tika is the parser (org.apache.tika.parser.parser ) interface because it hides the complexity of different file formats while providing a simple and powerful mechanism to extract structured text content and metadata from all sorts of documents. criterias for tika parsing design streamed parsing the interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. this allows even huge documents to be parsed without excessive resource requirements. structured content a parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. a client application can use this information for example to better judge the relevance of different parts of the parsed document. input metadata a client application should be able to include metadata like the file name or declared content type with the document to be parsed. the parser implementation can use this information to better guide the parsing process. output metadata a parser implementation should be able to return document metadata in addition to document content. many document formats contain metadata like the name of the author that may be useful to client applications. context sensitivity while the default settings and behaviour of tika parsers should work well for most use cases, there are still situations where more fine-grained control over the parsing process is desirable. it should be easy to inject such context-specific information to the parsing process without breaking the layers of abstraction. requirements maven 2.0 or higher java 1.6 se or higher lesson 1: automate metadata extraction from any file type our premisses are the following: we have a collection of documents stored on disk/database and we would like to index them; these documents can be word documents, pdfs, htmls, plain text files etc. as we are developers, we would like to write reusable code that extracts file properties regarding format (metadata) and file content. apache tika has a mimetype repository and a set of schemes (any combination of mime magic, url patterns, xml root characters, or file extensions) to determine if a particular file, url, or piece of content matches one of its known types. if the content does match, tika has detected its mimetype and can proceed to select the appropriate parser. in the sample code, the file type detection and its parsing is being covered inside the class com.retriever.lucene.index.indexcreator , method indexfile. listing 1.1 analyzing a file with tika public static documentwithabstract indexfile(analyzer analyzer, file file) throws ioexception { metadata metadata = new metadata(); contenthandler handler = new bodycontenthandler(10 * 1024 * 1024); parsecontext context = new parsecontext(); parser parser = new autodetectparser(); inputstream stream = new fileinputstream(file); //open stream try { parser.parse(stream, handler, metadata, context); //parse the stream } catch (tikaexception e) { e.printstacktrace(); } catch (saxexception e) { e.printstacktrace(); } finally { stream.close(); //close the stream } //more code here } the above code displays how a file it is being parsed using org.apache.tika.parser.autodetectparser; this kind of implementation was chosen because we would like to achieve parsing documents disregarding their format. also, for handling the content the org.apache.tika.sax.bodycontenthandler wasconstructed with a writelimit given as parameter ( 10*1024*1024); this type of constructor creates a content handler that writes xhtml body character events to an internal string buffer and in case of documents with large content is less likely to throw a saxexception (thrown when the default write limit is reached). as a result of our parsing we have obtained a metadata object that we can now use to detect file properties (title or any other header specific to a document format). metadata processing can be done as described below ( com.retriever.lucene.index.indexcreator , method indexfiledescriptors) : listing 1.2 processing metadata private static document indexfiledescriptors(string filename, metadata metadata) { document doc = new document(); //store file name in a separate textfield doc.add(new textfield(isearchconstants.field_file, filename, store.yes)); for (string key : metadata.names()) { string name = key.tolowercase(); string value = metadata.get(key); if (stringutils.isblank(value)) { continue; } if ("keywords".equalsignorecase(key)) { for (string keyword : value.split(",?(\\s+)")) { doc.add(new textfield(name, keyword, store.yes)); } } else if (isearchconstants.field_title.equalsignorecase(key)) { doc.add(new textfield(name, value, store.yes)); } else { doc.add(new textfield(name, filename, store.no)); } } in the method presented above we store the file name in a separate field and also the document's title ( a document can have a title different from its file name); we are not interested in storing other informations.

October 22, 2014

by Ana-Maria Mihalceanu

· 18,804 Views · 4 Likes

Measuring String Concatenation in Logging

introduction i had an interesting conversation today about the cost of using string concatenation in log statements. in particular, debug log statements often need to print out parameters or the current value of variables, so they need to build the log message dynamically. so you wind up with something like this: logger.debug("parameters: " + param1 + "; " + param2); the issue arises when debug logging is turned off. inside the logger.debug() statement a flag is checked and the method returns immediately; this is generally pretty fast. but the string concatenation had to occur to build the parameter prior to calling the method, so you still pay its cost. since debug tends to be turned off in production, this is the time when this difference matters. for this reason, we have pretty much all been trained to do this: if (logger.isdebugenabled()) { logger.debug("parameters: " + param1 + "; " + param2); } the discussion was about how much difference this “good practice” makes. caliper this kind of question is perfect for a micro benchmark. my own favorite tool for this purpose is caliper . caliper runs small snippets of code enough times to average out variations. it passes in a number of repetitions, which it calculates in order to make sure that the whole method takes long enough to measure given the resolution of the system clock. caliper also detects garbage collection and hotspot compiling that might impact the accuracy of the tests. caliper uploads results to a google app engine application. its sign-in supports google logins and issues an api key that can be used to organize results and list them. a typical timing methods looks like this: public string timemultstringnocheck(long reps) { for (int i = 0; i < reps; i++) { logger.debug(strings[0] + " " + strings[1] + " " + strings[2] + " " + strings[3] + " " + strings[4]); } return strings[0]; } the return string is not used; it is included in the method solely to ensure that java does not optimize away the method. similarly, the content of the variables used should be randomly generated to avoid compile-time optimization. the full example is available in one of my github repositories , in the org.anvard.introtojava.log package. results the outcome is pretty interesting. string concatenation creates a pretty significant penalty, around two orders of magnitude for our example that concatenates five strings. interesting is that even in the case where we do not use string concatenation (i.e. the simplestring methods), the penalty is around 4x. this is probably the time spent pushing the string parameter onto the stack. the examples with doubles, using string.format() , is even more extreme, four orders of magnitude. the elapsed time here about 4ms, large enough that if the log statement were in a commonly used method, the performance hit would be noticeable. the final method, multstringparams , uses a feature that is available in the slf4j api. it works similarly to string.format() , but in a simple token replace fashion. most importantly, it does not perform the token replace unless the logging level is enabled. this makes this form just as fast as the “check” forms, but in a more compact form. of course, this only works if no special formatting is needed of the log string, or if the formatting can be shifted to a method such as tostring() . what is especially surprising is that this method did not show a penalty in building the object array necessary to pass the parameters into the method. this may have been optimized out by the java runtime since there was no chance of the parameters being used. conclusion the practice of checking whether a logging level is enabled before building the log statement is certainly worthwhile and should be something teams check during peer review.

October 20, 2014

by Alan Hohn

· 10,555 Views

Functional Programming with Java 8 Functions

Learn how to use lambda expressions and anonymous functions in Java 8.

October 20, 2014

by Edwin Dalorzo

· 330,214 Views · 25 Likes

Unlocking and Erasing FLASH with Segger J-Link

when using a bootloader (see “ serial bootloader for the freedom board with processor expert “), then i usually protect the bootloader flash areas, so it does not get accidentally erased by the application ;-). when programming my boards with the p&e multilink, then the p&e firmware will automatically unlock and erase the chip. that’s not the same if working with the segger j-link, as it requires extra steps. protected flash pages with processor expert failed programming with protected flash if i try to re-program the protected bootloader with segger j-link (both in codewarrior and eclipse/kds with gdb), then the download silently fails. the effect is that somehow the application on the board does not match what it should be. looking at the console view, it shows that erase has failed (but no real error reported) :-(: jlink: failed to erase sectors 0 @ address 0x00000000 (algo135: flash protection violation. flash is write-protected.) j-link failed to erase in codewarrior the gnu arm eclipse segger integration with gdb (e.g. kinetis design studio) is not better: no error sign, the only thing is a hidden error in the console log of the jlinkgdbservercl: error: failed to erase sectors 0 @ address 0x00000000 (algo135: flash protection violation. flash is write-protected.) error algo135 flash protection violation about failed flash programming what i need is to unprotect the memory and then erase it. erasing the segger j-link features a very fast programming. part of that speed is that the segger firmware checks each flash page if it really needs to be programmed, and only then it erases and reprogrammed that page. so downloading twice the same application actually will not touch the flash memory at all. additionally, it does not do a complete erase of the device: it only programs the pages i’m using in my application. the advantage of that is first speed. and it does not erase the application data i’m using in non-volatile memory (see “ configuration data: using the internal flash instead of an external eeprom “). however, sometimes i really need to clear all my data in flash too, and then i need to erase all my flash pages on the device. segger has product named ‘j-flash’ which is used to flash and erase devices outside of an ide. there is a free-of-charge ‘lite’ version available for download from segger. this utility is not intended to be used for production. with this utility i have a gui to erase and program my device. j-flash lite but j-flash lite cannot unlock my locked flash pages :-(. if my device is not locked, i can use the codewarrior ‘flash file to target’ (see “ flashing with a button (and a magic wand) “) to erase the device: erasing device with flash file to target again, this does not work if the device is locked. codewarrior has another feature called ‘target task’ which can be used to erase/unsecure (if your device is supported), see “ device is secure? “. so i need to use a different tool to unlock and unprotect my device: the j-link commander . unlocking and erasing with j-link commander to unlock the device, segger has a utility named ‘j-link commander’, available from http://www.segger.com/jlink-software.html . the binary is ‘jlink.exe’ on windows and is a command line utility. to unlock the device use unlock kinetis unlocking device but it seems that i need to do an unlock, followed by an erase to make it permanent. to erase the device, i can use the same command line utility. but i need to specify the device name first, and then i can erase it (example for the kl25z): device mkl25z128xxx4 unlock kinetis erase :!: i need to do the erase operation right after the unlock. a) set device b) unlock and c) erase, otherwise it will fail? unlocking and erasing with j-link commander summary in order to re-program the protected flash sectors with segger j-link, i need first to unlock and mass erase the device. for this, there is the j-link commander utility which has a command line interface to unprotect and erase the device. for erasing only, the j-flash (and lite) is a very useful tool, especially to get a ‘clean’ device memory. to me, the segger way and tools are very powerful. in this case, things are very flexible, but not that obvious. so i hope this post can help others to get his device unlocked and erased. happy erasing :-)

October 17, 2014

by Erich Styger

· 20,439 Views

MySQL Replication: 'Got fatal error 1236' Causes and Cures

Originally Written by Muhammad Irfan MySQL replication is a core process for maintaining multiple copies of data – and replication is a very important aspect in database administration. In order to synchronize data between master and slaves you need to make sure that data transfers smoothly, and to do so you need to act promptly regarding replication errors to continue data synchronization. Here on the Percona Support team, we often help customers with replication broken-related issues. In this post I’ll highlight the top most critical replication error code 1236 along with the causes and cure. MySQL replication error “Got fatal error 1236” can be triggered by multiple reasons and I will try to cover all of them. Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: ‘log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event ‘binlog.000201′ at 5480571 This is a typical error on the slave(s) server. It reflects the problem around max_allowed_packet size. max_allowed_packet refers to single SQL statement sent to the MySQL server as binary log event from master to slave. This error usually occurs when you have a different size of max_allowed_packet on the master and slave (i.e. master max_allowed_packet size is greater then slave server). When the MySQL master server tries to send a bigger packet than defined on the slave server, the slave server then fails to accept it and hence the error. In order to alleviate this issue please make sure to have the same value for max_allowed_packet on both slave and master. You can read more about max_allowed_packet here. This error usually occurs when updating a huge number of rows on the master and it doesn’t fit into the value of slave max_allowed_packet size because slave max_allowed_packet size is lower then the master. This usually happens with queries “LOAD DATA INFILE” or “INSERT .. SELECT” queries. As per my experience, this can also be caused by application logic that can generate a huge INSERT with junk data. Take into account, that one new variable introduced in MySQL 5.6.6 and later slave_max_allowed_packet_size which controls the maximum packet size for the replication threads. It overrides the max_allowed_packet variable on slave and it’s default value is 1 GB. In this post, “max_allowed_packet and binary log corruption in MySQL,”my colleague Miguel Angel Nieto explains this error in detail. Got fatal error 1236 from master when reading data from binary log: ‘Could not find first log file name in binary log index file’ This error occurs when the slave server required binary log for replication no longer exists on the master database server. In one of the scenarios for this, your slave server is stopped for some reason for a few hours/days and when you resume replication on the slave it fails with above error. When you investigate you will find that the master server is no longer requesting binary logs which the slave server needs to pull in order to synchronize data. Possible reasons for this include the master server expired binary logs via system variable expire_logs_days – or someone manually deleted binary logs from master via PURGE BINARY LOGS command or via ‘rm -f’ command or may be you have some cronjob which archives older binary logs to claim disk space, etc. So, make sure you always have the required binary logs exists on the master server and you can update your procedures to keep binary logs that the slave server requires by monitoring the “Relay_master_log_file” variable from SHOW SLAVE STATUS output. Moreover, if you have set expire_log_days in my.cnf old binlogs expire automatically and are removed. This means when MySQL opens a new binlog file, it checks the older binlogs, and purges any that are older than the value of expire_logs_days (in days). Percona Server added a feature to expire logs based on total number of files used instead of the age of the binlog files. So in that configuration, if you get a spike of traffic, it could cause binlogs to disappear sooner than you expect. For more information check Restricting the number of binlog files. In order to resolve this problem, the only clean solution I can think of is to re-create the slave server from a master server backup or from other slave in replication topology. – Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000525′ at 175770780, the last event read from ‘/data/mysql/repl/mysql-bin.000525′ at 175770780, the last byte read from ‘/data/mysql/repl/mysql-bin.000525′ at 175771648.’ Usually, this caused by sync_binlog <>1 on the master server which means binary log events may not be synchronized on the disk. There might be a committed SQL statement or row change (depending on your replication format) on the master that did not make it to the slave because the event is truncated. The solution would be to move the slave thread to the next available binary log and initialize slave thread with the first available position on binary log as below: mysql>CHANGE MASTERTOMASTER_LOG_FILE='mysql-bin.000526',MASTER_LOG_POS=4; – [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: ‘Client requested master to start replication from impossible position; the first event ‘mysql-bin.010711′ at 55212580, the last event read from ‘/var/lib/mysql/log/mysql-bin.000711′ at 4, the last byte read from ‘/var/lib/mysql/log/mysql-bin.010711′ at 4.’, Error_code: 1236 I foresee master server crashed or rebooted and hence binary log events not synchronized on disk. This usually happens when sync_binlog != 1 on the master. You can investigate it as inspecting binary log contents as below: $mysqlbinlog--base64-output=decode-rows--verbose--verbose--start-position=55212580mysql-bin.010711 You will find this is the last position of binary log and end of binary log file. This issue can usually be fixed by moving the slave to the next binary log. In this case it would be: mysql>CHANGE MASTER TOMASTER_LOG_FILE='mysql-bin.000712',MASTER_LOG_POS=4; This will resume replication. To avoid corrupted binlogs on the master, enabling sync_binlog=1 on master helps in most cases. sync_binlog=1 will synchronize the binary log to disk after every commit. sync_binlog makes MySQL perform on fsync on the binary log in addition to the fsync by InnoDB. As a reminder, it has some cost impact as it will synchronize the write-to-binary log on disk after every commit. On the other hand, sync_binlog=1 overhead can be very minimal or negligible if the disk subsystem is SSD along with battery-backed cache (BBU). You can read more about this here in the manual. sync_binlog is a dynamic option that you can enable on the fly. Here’s how: mysql-master>SET GLOBAL sync_binlog=1; To make the change persistent across reboot, you can add this parameter in my.cnf. As a side note, along with replication fixes, it is always a better option to make sure your replica is in the master and to validate data between master/slaves. Fortunately, Percona Toolkit has tools for this purpose: pt-table-checksum & pt-table-sync. Before checking for replication consistency, be sure to check the replication environment and then, later, to sync any differences.

October 15, 2014

by Peter Zaitsev

· 40,869 Views

R: Filtering data frames by column type ('x' must be numeric)

I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame. The exercise uses the ‘Carseats’ data set which can be imported like so: > install.packages("ISLR") > library(ISLR) > head(Carseats) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US 1 9.50 138 73 11 276 120 Bad 42 17 Yes Yes 2 11.22 111 48 16 260 83 Good 65 10 Yes Yes 3 10.06 113 35 10 269 80 Medium 59 12 Yes Yes 4 7.40 117 100 4 466 97 Medium 55 14 Yes Yes 5 4.15 141 64 3 340 128 Bad 38 13 Yes No 6 10.81 124 113 13 501 72 Bad 78 16 No Yes filter the categorical variables from a data frame and If we try to run the ‘cor‘ function on the data frame we’ll get the following error: > cor(Carseats) Error in cor(Carseats) : 'x' must be numeric As the error message suggests, we can’t pass non numeric variables to this function so we need to remove the categorical variables from our data frame. But first we need to work out which columns those are: > sapply(Carseats, class) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor" "numeric" "numeric" Urban US "factor" "factor" We can see a few columns of type ‘factor’ and luckily for us there’s a function which will help us identify those more easily: > sapply(Carseats, is.factor) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE Urban US TRUE TRUE Now we can remove those columns from our data frame and create the correlation matrix: > cor(Carseats[sapply(Carseats, function(x) !is.factor(x))]) Sales CompPrice Income Advertising Population Price Age Education Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 -0.44495073 -0.231815440 -0.051955242 CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.58484777 -0.100238817 0.025197050 Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 -0.05669820 -0.004670094 -0.056855422 Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 0.04453687 -0.004557497 -0.033594307 Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 -0.01214362 -0.042663355 -0.106378231 Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 1.00000000 -0.102176839 0.011746599 Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684 1.000000000 0.006488032 Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.01174660 0.006488032 1.000000000 Be Sociable, Share!

October 8, 2014

by Mark Needham

· 29,101 Views

Using Groovy To Import XML Into MongoDB

This year I’ve been demonstrating how easy it is to create modern web apps using AngularJS, Java and MongoDB. I also use Groovy during this demo to do the sorts of things Groovy is really good at - writing descriptive tests, and creating scripts. Due to the time pressures in the demo, I never really get a chance to go into the details of the script I use, so the aim of this long-overdue blog post is to go over this Groovy script in a bit more detail. Firstly I want to clarify that this is not my original work - I stoleborrowed most of the ideas for the demo from my colleague Ross Lawley. In this blog post he goes into detail of how he built up an application that finds the most popular pub names in the UK. There’s asection in there where he talks about downloading the open street map data and using python to convert the XML into something more MongoDB-friendly - it’s this process that I basically stole, re-worked for coffee shops, and re-wrote for the JVM. I’m assuming if you’ve worked with Java for any period of time, there has come a moment where you needed to use it to parse XML. Since my demo is supposed to be all about how easy it is to work with Java, I didnot want to do this. When I wrote the demo I wasn’t really all that familiar with Groovy, but what I did know was that it has built in support for parsing and manipulating XML, which is exactly what I wanted to do. In addition, creating Maps (the data structures, not the geographical ones) with Groovy is really easy, and this is effectively what we need to insert into MongoDB. Goal Of The Script Parse an XML file containing open street map data of all coffee shops. Extract latitude and longitude XML attributes and transform intoMongoDB GeoJSON. Perform some basic validation on the coffee shop data from the XML. Insert into MongoDB. Make sure MongoDB knows this contains query-able geolocation data. The script is PopulateDatabase.groovy, that link will take you to the version I presented at JavaOne: Firstly, We Need Data I used the same service Ross used in his blog post to obtain the XML file containing “all” coffee shops around the world. Now, the open street map data is somewhat… raw and unstructured (which is why MongoDB is such a great tool for storing it), so I’m not sure I really have all the coffee shops, but I obtained enough data for an interesting demo using http://www.overpass-api.de/api/xapi?*[amenity=cafe][cuisine=coffee_shop] The resulting XML file is in the github project, but if you try this yourself you might (in fact, probably will) get different results. Each XML record looks something like: Each coffee shop has a unique identifier and a latitude and longitude as attributes of a node element. Within this node is a series of tag elements, all with k and v attributes. Each coffee shop has a varying number of these attributes, and they are not consistent from shop to shop (other than amenity and cuisine which we used to select this data). Initialisation Before doing anything else we want to prepare the database. The assumption of this script is that either the collection we want to store the coffee shops in is empty, or full of stale data. So we’re going to use the MongoDB Java Driver to get the collection that we’re interested in, and then drop it. There’s two interesting things to note here: This Groovy script is simply using the basic Java driver. Groovy can talk quite happily to vanilla Java, it doesn’t need to use a Groovy library. There are Groovy-specific libraries for talking to MongoDB (e.g. the MongoDB GORM Plugin), but the Java driver works perfectly well. You don’t need to create databases or collections (collections are a bit like tables, but less structured) explicitly in MongoDB. You simply use the database and collection you’re interested in, and if it doesn’t already exist, the server will create them for you. In this example, we’re just using the default constructor for theMongoClient, the class that represents the connection to the database server(s). This default is localhost:27017, which is where I happen to be running the database. However you can specify your own address and port - for more details on this see Getting Started With MongoDB and Java. Turn The XML Into Something MongoDB-Shaped So next we’re going to use Groovy’s XmlSlurper to read the open street map XML data that we talked about earlier. To iterate over every node we use: xmlSlurper.node.each. For those of you who are new to Groovy or new to Java 8, you might notice this is using a closure to define the behaviour to apply for every “node” element in the XML. Create GeoJSON Since MongoDB documents are effectively just maps of key-value pairs, we’re going to create a Map coffeeShop that contains the document structure that represents the coffee shop that we want to save into the database. Firstly, we initialise this map with the attributes of the node. Remember these attributes are something like: We’re going to save the ID as a value for a new field calledopenStreetMapId. We need to do something a bit more complicated with the latitude and longitude, since we need to store them as GeoJSON, which looks something like: { 'location' : { 'coordinates': [, ], 'type' : 'Point' } } In lines 12-14 you can see that we create a Map that looks like the GeoJSON, pulling the lat and lon attributes into the appropriate places. Insert Remaining Fields Now for every tag element in the XML, we get the k attribute and check if it’s a valid field name for MongoDB (it won’t let us insert fields with a dot in, and we don’t want to override our carefully constructed locationfield). If so we simply add this key as the field and its the matching vattribute as the value into the map. This effectively copies theOpenStreetMap key/value data into key/value pairs in the MongoDB document so we don’t lose any data, but we also don’t do anything particularly interesting to transform it. Save Into MongoDB Finally, once we’ve created a simple coffeeShop Map representing the document we want to save into MongoDB, we insert it into MongoDB if the map has a field called name. We could have checked this when we were reading the XML and putting it into the map, but it’s actually much easier just to use the pretty Groovy syntax to check for a key called namein coffeeShop. When we want to insert the Map we need to turn this into aBasicDBObject, the Java Driver’s document type, but this is easily done by calling the constructor that takes a Map. Alternatively, there’s a Groovy syntax which would effectively do the same thing, which you might prefer: collection.insert(coffeeShop as BasicDBObject) Tell MongoDB That We Want To Perform Geo Queries On This Data Because we’re going to do a nearSphere query on this data, we need to add a “2dsphere” index on our location field. We created the locationfield as GeoJSON, so all we need to do is call createIndex for this field. Conclusion So that’s it! Groovy is a nice tool for this sort of script-y thing - not only is it a scripting language, but its built-in support for XML, really nice Map syntax and support for closures makes it the perfect tool for iterating over XML data and transforming it into something that can be inserted into a MongoDB collection.

October 8, 2014

by Trisha Gee

· 10,302 Views

The No Fluff Introduction to Big Data

big data traditionally has referred to a collection of data too massive to be handled efficiently by traditional database tools and methods. this original definition has expanded over the years to identify tools (big data tools) that tackle extremely large datasets (nosql databases, mapreduce, hadoop, newsql, etc.), and to describe the industry challenge posed by having data harvesting abilities that far outstrip the ability to process, interpret, and act on that data. technologists knew that those huge batches of user data and other data types were full of insights that could be extracted by analyzing the data in large aggregates. they just didn’t have any cheap, simple technology for organizing and querying these large batches of raw, unstructured data. the term quickly became a buzzword for every sort of data processing product’s marketing team. big data became a catchall term for anything that handled non-trivial sizes of data. sean owen, a data scientist at cloudera, has suggested that big data is a stage where individual data points are irrelevant and only aggregate analysis matters [1]. but this is true for a 400 person survey as well, and most people wouldn’t consider that very big. the key part missing from that definition is the transformation of unstructured data batches into structured datasets. it doesn’t matter if the database is relational or non-relational. big data is not defined by a number of terabytes, it’s rooted in the push to discoverhidden insights in data that companies used to disregard or throw away. due to the obstacles presented by large scale data management, the goal for developers and data scientists is two-fold: first, systems must be created to handle large scale data, and two, business intelligence and insights should be acquired from analysis of the data. acquiring the tools and methods to meet these goals is a major focus in the data science industry, but it’s a landscape where needs and goals are still shifting. what are the characteristics of big data? tech companies are constantly amassing data from a variety of digital sources that is almost without end—everything from email addresses to digital images, mp3s, social media communication, server traffic logs, purchase history, and demographics. and it’s not just the data itself, but data about the data (metadata). it is a barrage of information on every level. what is it that makes this mountain of data big data? one of the most helpful models for understanding the nature of big data is “the three vs:” volume, velocity, and variety. data volume volumeis the sheer size of the data being collected. there was a point in not-so-distant history where managing gigabytes of data was considered a serious task—now we have web giants like google and facebook handling petabytes of information about users’ digital activities. the size of the data is often seen as the first challenge of characterizing big data storage, but even beyond that is the capability of programs to provide architecture that can not only store but query these massive datasets. one of the most popular models for big data architecture comes from google’s mapreduce concept, which was the basis for apache hadoop, a popular data management solution. data velocity velocityis a problem that flows naturally from the volume characteristics of big data. data velocity is the speed at which data is flowing into a business’ infrastructure and the ability of software solutions to receive and ingest that data quickly. certain types of high-velocity data, such as streaming data, needs to be moved into storage and processed on the fly. this is often referred to as complex event processing (cep). the ability to intercept and analyze data that has a lifespan of milliseconds is a widely sought after. this kind of quick-fire data processing has long been the cornerstone of digital financial transactions, but it is also being used to track live consumer behavior or to bring instant updates to social media feeds. data variety variety refers to the source and type of data that is being collected. this data could be anything from raw image data to sensor readings, audio recordings, social media communication, and metadata. the challenge of data variety is being able to take raw, unstructured data and organize it so that an application can use it. this kind of structure can be achieved through architectural models that traditionally favor relational databases—but there is often a need to tidy up this data before it will even be useful to store in a raw form. sometimes a better option is to use a schema-less, non-relational database. how do you manage big data? the three vs is a great model for getting an initial understanding of what makes big data a challenge for businesses. however, big data is not just about the data itself, but the way that it is handled. a popular way of thinking about these challenges is to look at how a business stores, processes, and accesses their data. · store: can you store the vast amounts of data being collected? · process: can you organize, clean, and analyze the data collected? · access: can you search and query this data in an organized manner? the store, process, and access model is useful for two reasons: it reminds businesses that big data is largely about managing data, and it demonstrates the problem of scale within big data management. “big” is relative. the data batches that challenge some companies could be moved through a single google datacenter in under a minute. the only question a company needs to ask itself is how it will store and access increasingly massive amounts of data for its particular use case. there are several high level approaches that companies have turned to in the last few years. the traditional approach the traditional method for handling most data is to use relational databases. data warehouses are then used to integrate and analyze data from many sources. these databases are structured according to the concept of “early structure binding”—essentially, the database has predetermined “questions” that can be asked based on a schema. relational databases are highly functional, and the goal with this type of data processing is for the database to be fully transactional. although relational databases are the most common persistence type by a large margin (see key findings pg. 4-5), a growing number of use cases are not well-suited for relational schema. relational architectures tend to have difficulty when dealing with the velocity and variety of big data, since their structure is very rigid. when you perform functions such as join on many large data sets, the volume can be a problem as well. instead, businesses are looking to non-relational databases, or a mixture of both types, to meet data demand. the newer approach - mapreduce, hadoop, and nosql databases in the early 2000s, web giant google released two helpful web technologies: google file system (gfs) and mapreduce. both were new and unique approaches to the growing problem of big data, but mapreduce was chief among them, especially when it comes to its role as a major influencer of later solution models. mapreduce is a programming paradigm that allows for low cost data analysis and clustered scale-out processing. mapreduce became the primary architectural influence for the next big thing in big data: the creation of the big data management infrastructure known as hadoop. hadoop’s open source ecosystem and ease of use for handling large-scale data processing operations have secured a large part of the big data marketplace. besides hadoop, there was a host of non-relational (nosql) databases that emerged around 2009 to meet a different set of demands for processing big data. whereas hadoop is used for its massive scalability and parallel processing, nosql databases are especially useful for handling data stored within large multi-structured datasets. this kind of discrete data handling is not traditionally seen as a strong point of relational databases, but it’s also not the same kind of data operations that hadoop is running. the solution for many businesses ends up being a combination of these approaches to data management. finding hidden data insights once you get beyond storage and management, you still have the enormous task of creating actionable business intelligence (bi) from the datasets you’ve collected. this problem of processing and analyzing data is maybe one of the trickiest in the data management lifecycle. the best options for data analytics will favor an approach that is predictive and adaptable to changing data streams. the thing is, there’s so many types of analytic models and different ways of providing infrastructure for this process. your analytics solution should scale, but to what degree? scalability can be an enormous pain in your analytical neck, due to the problem of decreasing performance returns when scaling out an algorithm. ultimately, analytics tools rely on a great deal of reasoning and analysis to extract data patterns and data insights, but this capacity means nothing for a business if they can’t then create actionable intelligence. part of this problem is that many businesses have the infrastructure to accommodate big data, but they aren’t asking questions about what problems they’re going to solve with the data. implementing a big data-ready infrastructure before knowing what questions you want to ask is like putting the cart before the horse. but even if we do know the questions we want to ask, data analysis can always reveal many correlations with no clear causes. as organizations get better at processing and analyzing big data, the next major hurdle will be pinpointing the causes behind the trends by asking the right questions and embracing the complexity of our answers. [1] http://www.quora.com/what-is-big-data 2014 guide to big data this guide explores the meaning of big data, how businesses use it, and uncovers new tools and techniques for the future of big data. this guide includes: detailed profiles on 43 big data vendor solutions in-depth articles written by industry experts results from our survey of 850 it professionals "finding the database for your use case" download now

September 25, 2014

by Benjamin Ball

· 10,676 Views · 1 Like

How to Trace Transactions Across Every Layer of Your Distributed Software Stack

APM solutions give you great visibility into any code you have control over; however, today’s systems are largely a combination of code you write along with off-the-shelf components, sitting on top of VMs/containers, and cloud-based services. Thus, full system-wide visibility requires an ability to look into your APM tool as well as log data produced from the components that you may not be able to instrument. This post offers an outline of how APM solutions work and how you can combine them with your system logs to finally get an end-to-end and top-to-bottom view of your system behavior and performance. How APM Tools Work – Hello APM APM tools give you insight deep into your code and often work using cool techniques like dynamic instrumentation. Dynamic instrumentation essentially allows you to instrument your apps on the fly without any need to modify your application source code. Such techniques have become widely been supported by mainstream programming languages to make it possible for even mere mortals to build their own APM tools. For example, since Java version 5, any Java applications can be instrumented using java.lang.instrument, which allows for the instrumentation of any programs running on the JVM through modification of the byte code of methods. It works by letting you alter the corresponding byte code of a class when it is being loaded, such that you can introduce monitoring capabilities such as execution profiling or event tracing. There’e a great beginner tutorial here by Julien Paoletti on how to write your first APM tool in java. It essentially shows you how you can intercept classes at class load time and then inject code into methods of your choice to record how long it takes for given methods to execute. While building a full APM solution is not for the faint hearted, you can easily begin to build your first ‘Hello APM’ tool, and play around with JVM internals following Julien’s post above. Transaction Tracing For those interested in moving beyond simply recording method execution time, you can begin to trace full transactions using some simple techniques. To do so, you essentially need a unique identifier to be passed along to any methods executed in that transaction. Continuing on from our hello world profiler above, you could do this by injecting a unique ID into the thread at any entry point in the system (e.g. new incoming requests). Java provides ThreadLocal storage that allows you to do just this. Using ThreadLocal you can embed a unique ID that gets recorded as each method executes. Reconstructing a Transaction On every invocation of a method along the transaction data is logged. An example of what might be logged by an APM tool is as follows: unique transaction id sequence number call depth method details performance data You can then easily piece together full transaction traces by ordering all method calls by sequence number. Further analysis can be applied to this information for a number of purposes. For example, by analysing the transactions, developers can easily construct design diagrams that can help quickly deduce overall system structure. Relationships between system components can help understand interdependencies enabling developers to anticipate potential conflicts and to debug problems as well as allowing them to reason about their system design (which in turn can have a major impact of system performance). Tracing Transactions Across the Network The real challenge with transaction tracing can come when you are dealing with distributed components. In such scenarios you need to be able to trace transactions across the network. One approach here is to piggy-back the necessary correlation data (unique transaction ID) onto the request from a client to the remote server. RPC (remote procedure call) systems generally employ a standard mechanism, known as ‘stubs and skeletons‘, to hide the complexities of the network from the client making any remote calls. Stubs and skeletons work as follows: The stub masks the low level networking issues from the client and forwards the request on to a server side proxy object (the skeleton). The skeleton masks the low level networking issues from the distributed component. It also delegates the remote request to the distributed component. The distributed component then handles the request and returns control to the skeleton, which in turn returns control to the stub. The stub, in turn, hands back to the client. One approach to the issue of tracing transactions across the network can be achieved by taking advantage of the stubs and skeletons model. Essentially the stub and skeletons can be modified such that the unique transaction ID piggy backs on the communication and is sent as part of the request to the stub and response from the skeleton. The implementation may differ from platform to platform, but the principles can generally be applied. For example, Remote Method Invocation is used for distributed communication on java platforms and details on how this can be achieved for RMI can be found in one of my older research papers here. RMI with Custom Stub Wrapper and Server Side Interception Point Going Beyond APM The above transaction tracing will give you visibility at a method call level across your distributed application. However sometimes external factors outside your application code (server resource, SAAS components your app communicates with, network speed etc.) will have an impact on your overall system performance. One way of enhancing the information provided by your APM solution is to collect and analyze your log data. Logs provide a very flexible way of gathering information on your system behavior without any requirement for deep instrumentation and any of the complex techniques described above. Furthermore you may not be able to instrument every software component or cloud service that makes up your overall system – yet almost all of these will produce valuable log data containing system usage and performance information. In such scenarios, combining APM and log data will give you the complete picture. Below are some tips that will allow you to map logs to APM transactions or to enhance them with data from additional components such as OS, middleware or network level components: Logging the Transaction ID: Any log data produced by your apps can be easily mapped to transactions produced by your APM tool by logging the transaction ID used to trace the transaction. Client Side Logging & Logging User/Session/Account ID: Logging other unique identifiers such as session ID, user ID or account ID, can also help you assist with tracing transactions across log events, where the transaction ID used by the APM tool is unavailable. This can be particularly useful if you are logging events from the client side as well as from back end components where you want to be able to view the sequence of log events related to a give user action or session for example. Same Time Frame For System Logs: Where unique identifiers have not been logged as part of your log events, viewing logs within the same time frame window as your APM tool will help you narrow down related log events and will give you a view into system behavior during the transaction time frame. Correlating with Other KPIs: Logs will contain key performance and resource usage metrics that can be rolled up into trend lines and charts. Correlating APM transaction traces with performance metrics and server resource usage information can help with investigation can result in quicker root cause than investigating transaction traces in isolation. Build It Yourself? Naturally anybody in their right mind would not actually go about building their own APM solution, it’s almost as hair-brained as rolling your own logging solution The good news is you don’t have to do either – simply take advantage of the new Logentries and New Relic integration such that you can trace transactions from end to end and from top to bottom of your entire distributed software stack.

September 24, 2014

by Trevor Parsons

· 6,871 Views · 1 Like

How to Resolve Maven's ''Failure to Transfer'' Error

Learn how to resolve the ''failure to transfer'' error encountered in Maven in this quick tutorial.

September 24, 2014

by Jose Roy Javelosa

· 130,005 Views · 3 Likes

Java - Four Security Vulnerabilities Related Coding Practices to Avoid

This article represents top 4 security vulnerabilities related coding practice to avoid while you are programming with Java language. Recently, I came across few Java projects where these instances were found. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Executing a dynamically generated SQL statement Directly writing an Http Parameter to Servlet output Creating an SQL PreparedStatement from dynamic string Array is stored directly Executing a Dynamically Generated SQL Statement This is most common of all. One can find mention of this vulenrability at several places. As a matter of fact, many developers are also aware of this vulnerability, although this is a different thing they end up making mistakes once in a while. In several DAO classes, the instances such as following code were found which could lead to SQL injection attacks. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (" + namesString + ")" ); try { Connection connection = getConnection(); Statement statement = connection.createStatement(); resultSet = statement.executeQuery(query.toString()); } Instead of above query, one could as well make use of prepared statement such as that demonstrated in the code below. It not only makes code less vulnerable to SQL injection attacks but also makes it more efficient. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (?)" ); try { Connection connection = getConnection(); PreparedStatement statement = connection.prepareCall(query.toString()); statement.setString( 1, namesString ); resultSet = statement.execute(); } Directly writing an Http Parameter to Servlet Output In Servlet classes, I found instances where the Http request parameter was written as it is, to the output stream, without any validation checks. Following code demonstrate the same: public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { String content = request.getParameter("some_param"); // // .... some code goes here // response.getWriter().print(content); } Note that above code does not persist anything. Code like above may lead to what is called reflected (or non-persistent) cross site scripting (XSS) vulnerability. Reflected XSS occur when an attacker injects browser executable code within a single HTTP response. As it goes by definition (being non-persistent), the injected attack does not get stored within the application; it manifests only users who open a maliciously crafted link or third-party web page. The attack string is included as part of the crafted URI or HTTP parameters, improperly processed by the application, and returned to the victim. You could read greater details on following OWASP page on reflect XSS Creating an SQL PreparedStatement from Dynamic Query String What it essentially means is the fact that although PreparedStatement was used, but the query was generated as a string buffer and not in the way recommended for prepared statement (parametrized). If unchecked, tainted data from a user would create a String where SQL injection could make it behave in unexpected and undesirable manner. One should rather make the query statement parametrized and, use the PreparedStatement appropriately. Take a look at following code to identify the vulnerable code. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (" + namesString + ")" ); try { Connection connection = getConnection(); PreparedStatement statement = connection.prepareStatement(query.toString()); resultSet = statement.executeQuery(); } Array is Stored Directly Instances of this vulnerability, Array is stored directly, could help the attacker change the objects stored in array outside of program, and the program behave in inconsistent manner as the reference to the array passed to method is held by the caller/invoker. The solution is to make a copy within the object when it gets passed. In this manner, a subsequent modification of the collection won’t affect the array stored within the object. You could read the details on following stackoverflow page. Following code represents the vulnerability: // Note that values is a String array in the code below. // public void setValues(String[] somevalues) { this.values = somevalues; }

September 19, 2014

by Ajitesh Kumar

· 19,476 Views · 1 Like

A Closer Look at the MySQL ibdata1 Disk Space Issue and Big Tables

A recurring customer issue seen by the Percona Support team involves how to make the ibdata1 file “shrink” within MySQL. I'll show you how to handle big tables.

September 16, 2014

by Peter Zaitsev

· 7,869 Views