Data Engineering Resources

The Latest Data Engineering Topics

Using REST with the CQRS Pattern to Blend NoSQL & SQL Data

REST Easy with SQL/NoSQL Integration and CQRS Pattern implementation New demands are being put on IT organizations everyday to deliver agile, high-performance, integrated mobile and web applications. In the meantime, the technology landscape is getting complex everyday with the advent of new technologies like REST, NoSQL, Cloud while existing technologies like SOAP and SQL still rule everyday work. Rather than taking religious side of the debate, NoSQL can successfully co-exist with SQL in this ‘polyglot’ of data storage and formats. However, this integration also adds another layer of complexity both in architecture and implementation. This document offers a guide on how some of the relatively newer technologies like REST can help bridge the gap between SQL and NoSQL with an example of a well known pattern called CQRS. This document is organized as follows: Introduction to SQL development process NoSQL Do I have to choose between SQL and NoSQL? CQRS Pattern How to implement CQRS pattern using REST services Introduction to SQL development process Developers have been using SQL Databases for decades to build and deliver enterprise business applications. The process of creating tables, attributes,and relationships is second nature for most developers. Data architects think in terms of tables and columns and navigate relationships for data. The basic concepts of delivery and transformation takes place at the web server level which means the server developer is reading and ‘binding’ to the tables and mapping attributes to a REST response. Application development lifecycle meant changes to the database schema first, followed by the bindings, then internal schema mapping, and finally the SOAP or JSON services, and eventually the client code. This all costs the project time and money. It also means that the ‘code’ (pick your language here) and the business logic would also need to be modified to handle the changes to the model. NoSQL NoSQL is gaining supporters among many SQL shops for various reasons including: Low cost Ability to handle unstructured dataa Scalability Performance The first thing database folks notice is that there is no schema. These document style storage engines can handle huge volumes of structured, semi-structured, and unstructured data. The very nature of schema-less documents allows change to a document structure without having to go through the formal change management process (or data architect). The other major difference is that NoSQL (no-schema) also means no joins or relationships. The document itself contains the embedded information by design. So an order entry would contain the customer with all the orders and line items for each order in a single document. There are many different NoSQL vendors (popular NoSQL databases include MongoDB, Casandra) that are being used for BI and Analytics (read-only) purposes. We are also seeing many customers starting to use NoSQL for auditing, logging, and archival transactions. Do I have to choose between SQL and NoSQL? The purpose of this article is to not get into the religious debate about whether to use SQL or NoSQL. Bottom line is both have their place and are suited for certain type of data – SQL for structured data and NoSQL for unstructured data. So why not have the capability to mix and match this data depending on the application. This can be done by creating a single REST API across both SQL and NoSQL databases. Why a single REST API? The answer is simple – the new agile and mobile world demands this ‘mashup’ of data into a document style JSON response. CQRS (Command Query Responsibility Segmentation) Pattern There are many design patterns for delivery of high performance RESTful services but the one that stands out was described in an article written by Martin Fowler, one of the software industry veterans. He described the pattern called CQRS that is more relevant today in a ‘polyglot’ of servers, data, services, and connections. “We may want to look at the information in a different way to the record store, perhaps collapsing multiple records into one, or forming virtual records by combining information for different places. On the update side we may find validation rules that only allow certain combinations of data to be stored, or may even infer data to be stored that’s different from that we provide.” – Martin Fowler 2011 In this design pattern, the REST API requests (GET) return documents from multiple sources (e.g. mashups). In the update process, the data is subject to business logic derivations, validations, event processing, and database transactions. This data may then be pushed back into the NoSQL using asynchronous events. With the wide-spread adoption of NoSQL databases like MongoDB and schema-less, high capacity data store; most developers are challenged with providing security, business logic, event handling, and integration to other systems. MongoDB; one the popular NoSQL databases and SQL databases share many similar concepts. However the MongoDB programming language itself is very different from the SQL we all know. How to implement CQRS pattern using a RESTFul Architecture A REST server should meet certain requirements to support the CQRS pattern. The server should run on-premise or in the cloud and appears to the mobile and web developer as an HTTP endpoint. The server architecture should implement the following: Connections and Mapping necessary for SQL and NoSQL connectivity and API services needed to create and return GET, PUT, POST, and DELETE REST responses Security Business Logic Connections and Mapping There are two main approaches to creating REST Servers and APIs for SQL and NoSQL databases: Open source frameworks like Apache Tomcat, Spring/Hibernate Commercial framework like Espresso Logic Open source Frameworks Using various open source frameworks like Tomcat, Spring/Hibernate, Node.js, JDBC and MongoDB drivers, a REST server can be created, but we would still be left with the following tasks: Creation and mapping of the necessary SQL objects Create a REST server container and configurations Create Jersey/Jackson classes and annotations Create and define REST API for tables, views, and procedures Hand write validation, event and business logic Handle persistence, optimistic locking, transaction paging Adding identity management and security by roles Now we can start down the same path to connect to MongoDB and write code to connect, select, and return data in JSON and then create the REST calls to merge these two different document styles into a single RESTful endpoint. This is a lot of work for a development team to manage and control and frankly pretty boring and repetitive and is better done by a well designed framework Commercial Frameworks Many commercial frameworks may take care of this complexity without the need to do extensive programming. Here is an example from Espresso Logic and how it handles this complexity with a point and click interface: Running REST server in the cloud or on-premise Connections to external SQL databases Object mapping to tables, views, and procedures Automatic creation of RESTful endpoints from model Reactive business rules and rich event model Integrated role-based security and authentication services. Point-and-click document API creation for SQL and MongoDB endpoints In the example below, the editor shows an SQL (customersTransactions) joined with archived details from MongoDB (archivedTransactions). The MongoDB document for each customer may include transaction details, check images, customer service notes and other relevant account information. This new mashup becomes a single REST call that can be published to mobile and web application developer. Security Security is an important part of building and delivery of RESTful services which can be broken down into two parts; authentication and access control. Authentication Before allowing anyone access to corporate data you want to use the existing corporate identity management (some call this authentication services) to capture and validate the user. This identity management service is based on using existing corporate standards such as LDAP, Windows AD, SQL Database. Role-based Access Control Each user may be assigned one or more corporate roles and these roles are then assigned specific access privileges to each resource (e.g. READ, INSERT, UPDATE, and DELETE). Role-based access should also be able to restrict permissions to specific rows and columns of the API (e.g. only sales reps can see their own orders or a manager can see and change his department salaries but cannot change his own). This restriction should be applied regardless of how or where the API is used or called. Remember, the SQL database already provides some level of security and access which must be considered when designing and delivering new front-end services to internal and external users. Business Logic for REST When data is updated to a REST Server several things need to happen. First, the authentication and access control should determine if this is a valid request and if the user has rights to the endpoint. In addition, the server may need to de-alias REST attributes back to the actual SQL column names. In a full featured business logic server, there should be a series of events and business rules to perform various calculations, validations, and fire other events on dependent tables. Finally, the entire multi-table transaction is written back to the SQL database in a single transaction. Updates are then sent asynchronously to MongoDB as part of the commit event (after the SQL transaction has completed). Conclusion In the real-world of API services, the demand for more complex document style RESTful services is a requirement. That is, the ability to create ‘mashups’ of data from multiple tables, NoSQL collections, and other external systems is a large part of this new design pattern. In addition, the ability to alias attribute names and formats from these source fields has become critical for partners and customers systems. Using REST with the CQRS pattern to blend MongoDB and SQL seamlessly to your existing data will become a major part of your future mobile strategy. To implement these REST services, one can use open source tools and spend a lot of time or select a right commercial framework. This framework should support cloud or on-premise connectivity, security, API integration, as well as business logic. This will make the design and delivery of new application services more rapid and agile in the heterogeneous world of information.

November 4, 2014

by Val Huber

CORE

· 16,218 Views

Configuring an OpenStack VM with Multiple Network Cards

[This article was written by Barak Merimovich.] We have discussed OpenStack networking extensively in previous posts. In this post, I’d like to dive into a more advanced OpenStack networking scenario. Many cloud images are not configured to automatically bring up all network cards that are available. They will usually only have a single network card configured. To correctly set up a host in the cloud with multiple network cards, log on to the machine and bring up the additional interfaces. echo $'auto eth1\niface eth1 inet dhcp' | sudo tee /etc/network/interfaces.d/eth1.cfg > /dev/null sudo ifup eth1 Networks in the cloud A complex network architecture is a mainstay of modern IaaS clouds. Understanding how to configure your cloud-based networks, and hosts, is critical to getting your application working in the cloud. This is especially true with Cloudify, the open source cloud orchestration platform I work on. The cloud, like the world, used to be flat It was not that long a time ago that most IaaS providers only supported flat networks – all of your hosts were in one large network. Separation between services running in the cloud was enforced in software or with firewalls/security-groups. But technically, all of the hosts were connected to the same network and visible to each other. The flat network model is simple, and therefore easy to reason and understand. It was a good choice for the early days of the IaaS cloud and no doubt helped with getting applications into the cloud in the first place. It was one of the things that made EC2 so easy to use for anyone just starting out with the ‘cloud’. This model is in fact still available on Amazon Web Services under the title ‘EC2-Classic’. And for many applications, a flat network is good enough. But as cloud adoption increases, more complex applications are moving into the clouds, and issues like network separation, security, SLA and broadcast domains make more complex networks models a must. Software Defined Networks (SDN) fill that gap. They are now a staple of most major IaaS clouds. AWS has AWS-VPC, OpenStack has the Neutron project and there are many other implementations. Working with SDN requires knowing a bit more about how information moves around between your cloud resources. In this post I am going to discuss how to set up a host in the cloud so it will play nice with complex networks. I’ll be using OpenStack, but the concepts are similar for other cloud infrastructures. Openstack configuration I am going to start with an empty tenant, only the public network is available. First, lets set up out networks and router: neutron router-create demo-router neutron net-create demo-network-1 neutron net-create demo-network-2 neutron subnet-create --name demo-subnet-1 demo-network-1 10.0.0.0/24 neutron subnet-create --name demo-subnet-2 demo-network-2 10.0.1.0/24 neutron router-interface-add demo-router demo-subnet-1 neutron router-interface-add demo-router demo-subnet-2 neutron router-gateway-set demo-router public Note the network IDs: neutron net-list | id | name | subnets | | 2c33efe2-6204-4125-9716-3bc525630016 | demo-network-1 | 928dafa0-83ef-459c-b20d-71d8ea596fa2 10.0.0.0/24 | | aa30627e-c181-4a4b-89bf-5dd7c26c244e | demo-network-2 | 26d573f7-7953-4a54-825b-ed7bbc0661c7 10.0.1.0/24 | | e502de8d-929a-4ee0-bd18-efa297875cf6 | public | d40dab51-a729-452c-9ee6-b9ad08d10808 | We’ll start with a standard Ubuntu cloud image: glance image-create --name "Ubuntu 12.04 Standard" --location "http://uec-images.ubuntu.com/precise/current/precise-server-cloudimg-amd64-disk1.img" --disk-format qcow2 --container-format bare Create the keypair and security group: nova keypair-add demo-keypair > demo-keypair.pem chmod 400 demo-keypair.pem nova secgroup-create demo-security-group "Security group for demo" nova secgroup-add-rule demo-security-group tcp 22 22 0.0.0.0/0 Let’s spin up an instance connected to both our networks: nova boot -flavor m1.small --image "Ubuntu 12.04 Standard" --nic net-id=2c33efe2-6204-4125-9716-3bc525630016 --nic net-id=aa30627e-c181-4a4b-89bf-5dd7c26c244e --security-groups demo-security-group --key-name demo-keypair demo-vm And set up floating IPs for the first network: nova list | ID | Name | Status | Task State | Power State | Networks | 2b17588b-8980-4489-9a04-6539a159dc3c | demo-vm | ACTIVE | None | Running | demo-network-1=10.0.0.2; demo-network-2=10.0.1.2 | neutron floatingip-create public neutron floatingip-list | id | fixed_ip_address | floating_ip_address | port_id | | 49c8b05e-bb8f-4b07-80ed-3155ab6ffc09 | | 192.168.15.42 | | neutron port-list | id | name | mac_address | fixed_ips | | 1ccfd334-7328-4b22-b93e-24a0888276ab | | fa:16:3e:14:39:39 | {"subnet_id": "94598487-c1fc-4f55-ac1f-ef2545d5cfeb", "ip_address": "10.0.1.3"} | | a482c4f6-fa74-476e-b1ce-cd8dd0c70815 | | fa:16:3e:18:92:79 | {"subnet_id": "94598487-c1fc-4f55-ac1f-ef2545d5cfeb", "ip_address": "10.0.1.2"} | | b23d7836-30c5-4bff-b873-15c87ba051f6 | | fa:16:3e:3a:28:40 | {"subnet_id": "dec6ec74-cfa9-4a08-8792-54900631b98e", "ip_address": "10.0.0.3"} | | d421b447-2adf-406f-876b-142238683344 | | fa:16:3e:9d:fc:7f | {"subnet_id": "dec6ec74-cfa9-4a08-8792-54900631b98e", "ip_address": "10.0.0.2"} | | dcf8696b-cc80-4b48-b09c-61c0f8ab02ac | | fa:16:3e:5b:39:fb | {"subnet_id": "94598487-c1fc-4f55-ac1f-ef2545d5cfeb", "ip_address": "10.0.1.1"} | | f6a1666e-495a-4d3f-afa3-754b3cb3cfc0 | | fa:16:3e:8a:1b:fb | {"subnet_id": "dec6ec74-cfa9-4a08-8792-54900631b98e", "ip_address": "10.0.0.1"} | neutron floatingip-associate 49c8b05e-bb8f-4b07-80ed-3155ab6ffc09 d421b447-2adf-406f-876b-142238683344 Note how we matched the VM’s IP to its port, and associated the floating IP to the port. I wish there was an easier way to do this from the CLI… If everything worked correctly, you should have the following setup: Let’s make sure ssh works correctly: ssh -i demo-keypair.pem [email protected] hostname demo-vm Cool, ssh works. Now, we should have two network cards, right? ssh -i demo-keypair.pem [email protected] hostname demo-vm Cool, ssh works. Now, we should have two network cards, right? ssh -i demo-keypair.pem [email protected] ifconfig eth0 Link encap:Ethernet HWaddr fa:16:3e:5f:a2:5f inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0 inet6 addr: fe80::f816:3eff:fe5f:a25f/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:230 errors:0 dropped:0 overruns:0 frame:0 TX packets:224 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:46297 (46.2 KB) TX bytes:31130 (31.1 KB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) Huh?! The VM only has one working network interface! Where is my second NIC? Was there a configuration problem with the OpenStack network setup? The answer is here: ssh -i demo-keypair.pem [email protected] ifconfig -a eth0 Link encap:Ethernet HWaddr fa:16:3e:5f:a2:5f inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0 inet6 addr: fe80::f816:3eff:fe5f:a25f/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:324 errors:0 dropped:0 overruns:0 frame:0 TX packets:332 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:69973 (69.9 KB) TX bytes:47218 (47.2 KB) eth1 Link encap:Ethernet HWaddr fa:16:3e:29:6d:22 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) The second NIC exists, but is not running. The issue is not with the OpenStack network configuration – it’s with the image. The image itself should be configured to work correctly with multiple NICs. All we have to do is bring up the NIC. So we ssh into the instance: ssh -i demo-keypair.pem [email protected] And run the following commands: echo $'auto eth1\niface eth1 inet dhcp' | sudo tee /etc/network/interfaces.d/eth1.cfg > /dev/null sudo ifup eth1 The second NIC should now be running: ifconfig eth1 eth1 Link encap:Ethernet HWaddr fa:16:3e:18:92:79 inet addr:10.0.1.2 Bcast:10.0.1.255 Mask:255.255.255.0 inet6 addr: fe80::f816:3eff:fe18:9279/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:81 errors:0 dropped:0 overruns:0 frame:0 TX packets:45 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:15376 (15.3 KB) TX bytes:3960 (3.9 KB) And there you go – your VM can access both networks. This issue can make life complicated when setting up a complex, or even a not very complex, application. When will this issue hurt you? Well, imagine a scenario where you have a web server and a database server. The web server is connected to both Network1 and Network2, and the database server is only connected to Network2. Network1 is connected to the external world over a router, and Network 2 is completely internal, adding another layer of security to the critical database server. So what happens if the web server only has one network card? If only the NIC for Network1 is up, the web server can’t access the database. If only the NIC for Network2 is up, the web server can’t be reached from the external world. Even worse, if this web server is accessed via a floating IP, this IP will also not work, so you won’t be able to access the web server and fix the issue. Tricky. In conclusion The above commands will bring up your additional network card. You will of-course need to repeat this process for each additional network card, and for each VM. You can use a start-up script (a.k.a. user-data script) or system service to run these commands, but there are better ways. I’ll discuss how to automate the network setup in a follow-up post. This was originally posted at Barak's blog Head in the Clouds, find it here.

November 4, 2014

by Sharone Zitzman

· 14,716 Views

Spring Caching Abstraction and Google Guava Cache

Spring provides a great out of the box support for caching expensive method calls. The caching abstraction is covered in a great detail here. My objective here is to cover one of the newer cache implementations that Spring now provides with 4.0+ version of the framework - using Google Guava Cache In brief, consider a service which has a few slow methods: public class DummyBookService implements BookService { @Override public Book loadBook(String isbn) { // Slow method 1. } @Override public List loadBookByAuthor(String author) { // Slow method 2 } } With Spring Caching abstraction, repeated calls with the same parameter can be sped up by an annotation on the method along these lines - here the result of loadBook is being cached in to a "book" cache and listing of books cached into another "books" cache: public class DummyBookService implements BookService { @Override @Cacheable("book") public Book loadBook(String isbn) { // slow response time.. } @Override @Cacheable("books") public List loadBookByAuthor(String author) { // Slow listing } } Now, Caching abstraction support requires a CacheManager to be available which is responsible for managing the underlying caches to store the cached results, with the new Guava Cache support the CacheManager is along these lines: @Bean public CacheManager cacheManager() { return new GuavaCacheManager("books", "book"); } Google Guava Cache provides a rich API to be able to pre-load the cache, set eviction duration based on last access or created time, set the size of the cache etc, if the cache is to be customized then a guava CacheBuilder can be passed to the CacheManager for this customization: @Bean public CacheManager cacheManager() { GuavaCacheManager guavaCacheManager = new GuavaCacheManager(); guavaCacheManager.setCacheBuilder(CacheBuilder.newBuilder().expireAfterAccess(30, TimeUnit.MINUTES)); return guavaCacheManager; } This works well if all the caches have a similar configuration, what if the caches need to be configured differently - for eg. in the sample above, I may want the "book" cache to never expire but the "books" cache to have an expiration of 30 mins, then the GuavaCacheManager abstraction does not work well, instead a better solution is actually to use a SimpleCacheManager which provides a more direct way to get to the cache and can be configured this way: @Bean public CacheManager cacheManager() { SimpleCacheManager simpleCacheManager = new SimpleCacheManager(); GuavaCache cache1 = new GuavaCache("book", CacheBuilder.newBuilder().build()); GuavaCache cache2 = new GuavaCache("books", CacheBuilder.newBuilder() .expireAfterAccess(30, TimeUnit.MINUTES) .build()); simpleCacheManager.setCaches(Arrays.asList(cache1, cache2)); return simpleCacheManager; } This approach works very nicely, if required certain caches can be configured to be backed by a different caching engines itself, say a simple hashmap, some by Guava or EhCache some by distributed caches like Gemfire.

November 3, 2014

by Biju Kunjummen

· 60,096 Views · 8 Likes

BigList: a Scalable High-Performance List for Java

As memory gets cheaper and cheaper, our applications can keep more data readily available in main memory, or even all as in case of in-memory databases. To make real use of the growing heap memory, appropriate data structures must be used. Interesting enough, there seem to be no specialized implementations for lists - by far the most used collection. This article introduces BigList, a list designed for handling large collections where large means that all data still fit completely in the heap memory. The article will show the special requirements for handling large collections, how BigList is implemented and how it compares to other list implementations. 1. Requirements What are the special requirements we need to handle large collections efficiently? Memory: Sparing use of memory: The list should need little memory for its own implementation so memory can be used for storing application data. Specialized versions for primitives: It must be possible to store common primitives like ints in a memory saving way. Avoid copying large data blocks: If the list grows or shrinks, only a small part of the data must be copied around, as this operation becomes expensive and needs the same amount of memory again. Data sharing: copying collections is a frequent operation which should be efficiently possible even if the collection is large. An efficient implementation requires some sort of data sharing as copying all elements is per se a costly operation. Performance: Good performance for normal operations like reading, storing, adding or removing single elements. Great performance for bulk operations like adding or removing multiple elements. Predictable overhead of operations, so similar operations should need a similar amount of time without excessive worst case scenarios. If an implementation does not offer these features, some operations will not only be slow for really large collections, but will becomse just not feasible because memory or CPU usage will be too exhaustive. Introduction to BigList BigList is a member of the Brownies Collections library which also includes GapList, the fastest list implementation known. GapList is a drop-in replacement for ArrayList, LinkedList, or ArrayDequeue and offers fast access by index and fast insertion/removal at the beginning and at the end at the same time. GapList however has not been designed to cope with large collections, so adding or removing elements can make it necessary to copy a lot of elements around which will lead to performance problems. Also copying a large collection becomes an expensive operation, both in term of time and memory consumption. It will simply not be possible to make a copy of a large collections if not the same amount of memory is available a second time. And this is a common operation as you often want to return a copy of an internal list through your API which has no reference the original list. BigList addresses both problems. The first problem is solved by storing the collection elements in fixed size blocks. Add or remove operations are then implemented to use only data from one block. The copying problem is solved by maintaining a reference count on the fixed size blocks which allows to implement a copy-on-write approach. For efficient access to the fixed size blocks, they are maintained in a specialized tree structure. 2. BigList Details Each BigList instance stores the following information: Elements are stored in in fixed-size blocks A single block is implemented as GapList with a reference count for sharing All blocks are maintained in a tree for fast access Access information for the current block is cached for better performance The following illustration shows these details for two instances of BigList which share one block. 2.1 Use of Blocks Elements are stored in in fixed-size blocks with a default block size of 1000. Where this default may look pretty small, it is most of the time a good choice because it guarantees that write operation only need to move few elements. Read operations will profit from locality of reference by using the currently cached block to be fast. It is however possible to specify the block size for each created BigList instance. All blocks except the first are allocated with this fixed size and will not grow or shrink. The first block will grow to the specified block size to save memory for small lists. If a block has reached its maximum size and more data must be stored within, the block needs to be split up in two blocks before more elements can be stored. If elements are added to the head or tail of the list, the block will only be filled up to a threshold of 95%. This allows inserts into the block without the immediate need for split operations. To save memory, blocks are also merged. This happens automatically if two adjacent blocks are both filled less than 35% after a remove operation. 2.2 Locality of Reference For each operation on BigList, the affected block must be determined first. As predicted by locality of reference, most of the time the affected block will be the same as for the last operation. The implementation of BigList has therefore been designed to profit from locality of reference which makes common operations like iterating over a list very efficient. Instead of always traversing the block tree to determine the block needed for an operation, lower and upper index of the last used block are cached. So if the next operation happens near to the previous one, the same block can be used again without need to traverse the tree. 2.3 Reference Counting To support a copy-on-write approach, BigList stores a reference count for each fixed size blocks indicating whether this block is private or shared. Initially all lists are private having a reference count of 0, so that modification are allowed. If a list is copied, the reference count is incremented which prohibits further modifications. Before a modification then can be made, the block must be copied decrementing the block's reference count and setting the reference count of the copy to 0. The reference count of a block is then decremented by the finalizer of BigList. 3. Benchmarks To prove the excellence of BigList in time and memory consumption, we compare it with some other List implementations. And here are the nominees: Type Library Description BigList brownie-collections List optimized for storing large number of elements. Elements stored in fixed size blocks which are maintained in a tree. GapList brownie-collections Fastest list implementation known. Fast access by index and fast insertion/removal at end and at beginning. ArrayList JDK Maintains elements in a single array. Fast access by index, fast insertion/removal at end, but slow at beginning. LinkedList JDK Elements stored using a linked list. Slow access by index. Memory overhead for each element stored. TreeList commons-collections Elements stored in a tree. All operations are not really fast, but there are no very slow operations. Memory overhead for each element stored. FastTable javolution Elements stored in a "fractal"-like data structure. Good performance and use of memory. However no bulk operations and collection does not shrink. 3.1 Handling Objects In the first part of the benchmark, we compare memory consumption and performance of the different list implementations. Let's first have a look at the memory consumption. The following table shows the bytes used to hold a list with 1'000'000 null elements: BigList GapList ArrayList LinkedList TreeList FastTable 32 bit 4'298'466 4'021'296 4'861'992 8'000'028 18'000'028 4'142'892 64 bit 8'544'254 8'042'552 9'723'964 16'000'044 26'000'044 8'222'988 We can see that BigList, GapList, ArrayList, and FastTable only add small overhead to the stored elements, where as Linkedlist needs twice the memory and TreeList even more. Now to the performance. Here are the results of 9 benchmarks which have been run for each of the 6 candidates with JDK 8 in a 32 bit Windows environment and a list of 1'000'000 elements: The result table can be read as follows: the fastest candidate for each test has a relative performance indicator of 1 the value for the other candidates indicate how many times they have been slower, so a factor of 3 means that this implementation was 3 times slower than the best one The different factor are colored like this:  factor 1: green (best)  factor <5: blue (good)  factor <25: yellow (moderate)  factor >25: red (poor) If we look the benchmark result, we can see that the performance of BigList is best for all expect two benchmarks. The only moderate result is produces in getting elements in a totally random order. This could be expected as there is no locality of reference which can be exploited, so for each access, the block tree must be traversed to find the correct block. Luckily this is a rare use case in real applications. And the benchmark "Get local" shows that performance is back to good as soon as elements next to each other must be retrieved - as it is the case if we iterate over a range. 3.2 Handling Primitives In the second part of the benchmark, we want see how big the savings are if we use a data structure specialized for storing primitives compared to strong wrapped objects. For this reason, we compare IntBigList and BigList. The following table shows memory needed to store 1'000'000 integer values: BigList IntBigList 32 bit 16'298'454 4'534'840 64 bit 28'544'234 4'570'432 Obviously it is easy to save a lot of memory. In a 32 bit environment, IntBigList just needs 25% percent of memory, in a 64 bit environment only 14%! These figures become plausible if you recall that a simple object needs 8 bytes in a 32 bit, but already 16 bytes in a 64 bit environment, where as a primitive integer value always only needs 4 bytes. The measurable performance gain is not so impressive, it is something below 10% for simple get operations and something above 10% for add and remove operations. These numbers show that the JVM is impressively fast in creating wrapper objects and boxing and unboxing primitive values. We must however also consider that each created object will need to be garbage collected once and therefore adds to the total load of the JVM. 4. Summary BigList is a scalable high-performance list for storing large collections. Its design guarantees that all operations will be predictable and efficient both in term of performance and memory consumption, even copying large collections is tremendous fast. Benchmarks haven proven this and shown that BigList outperform other known list implementations. The library also offers specialized implementations for primitive types like IntBigList which save much memory and provide superior performance. BigList for handling objects and the specializations for handling primitives are part of the Brownies Collections library and can be downloaded from http://www.magicwerk.org/collections.

November 3, 2014

by Thomas Mauch

· 32,962 Views · 10 Likes

Building a REST API with JAXB, Spring Boot and Spring Data

if someone asked you to develop a rest api on the jvm, which frameworks would you use? i was recently tasked with such a project. my client asked me to implement a rest api to ingest requests from a 3rd party. the project entailed consuming xml requests, storing the data in a database, then exposing the data to internal application with a json endpoint. finally, it would allow taking in a json request and turning it into an xml request back to the 3rd party. with the recent release of apache camel 2.14 and my success using it , i started by copying my apache camel / cxf / spring boot project and trimming it down to the bare essentials. i whipped together a simple hello world service using camel and spring mvc. i also integrated swagger into both. both implementations were pretty easy to create ( sample code ), but i decided to use spring mvc. my reasons were simple: its rest support was more mature, i knew it well, and spring mvc test makes it easy to test apis. camel's swagger support without web.xml as part of the aforementioned spike, i learned out how to configure camel's rest and swagger support using spring's javaconfig and no web.xml. i made this into a sample project and put it on github as camel-rest-swagger . this article shows how i built a rest api with java 8, spring boot/mvc, jaxb and spring data (jpa and rest components). i stumbled a few times while developing this project, but figured out how to get over all the hurdles. i hope this helps the team that's now maintaining this project (my last day was friday) and those that are trying to do something similar. xml to java with jaxb the data we needed to ingest from a 3rd party was based on the ncpdp standards. as a member, we were able to download a number of xsd files, put them in our project and generate java classes to handle the incoming/outgoing requests. i used the maven-jaxb2-plugin to generate the java classes. org.jvnet.jaxb2.maven2 maven-jaxb2-plugin 0.8.3 generate -xtostring -xequals -xhashcode -xcopyable org.jvnet.jaxb2_commons jaxb2-basics 0.6.4 src/main/resources/schemas/ncpdp the first error i ran into was about a property already being defined. [info] --- maven-jaxb2-plugin:0.8.3:generate (default) @ spring-app --- [error] error while parsing schema(s).location [ file:/users/mraible/dev/spring-app/src/main/resources/schemas/ncpdp/structures.xsd{1811,48}]. com.sun.istack.saxparseexception2; systemid: file:/users/mraible/dev/spring-app/src/main/resources/schemas/ncpdp/structures.xsd; linenumber: 1811; columnnumber: 48; property "multipletimingmodifierandtimingandduration" is already defined. use to resolve this conflict. at com.sun.tools.xjc.errorreceiver.error(errorreceiver.java:86) i was able to workaround this by upgrading to maven-jaxb2-plugin version 0.9.1. i created a controller and stubbed out a response with hard-coded data. i confirmed the incoming xml-to-java marshalling worked by testing with a sample request provided by our 3rd party customer. i started with a curl command, because it was easy to use and could be run by anyone with the file and curl installed. curl -x post -h 'accept: application/xml' -h 'content-type: application/xml' \ --data-binary @sample-request.xml http://localhost:8080/api/message -v this is when i ran into another stumbling block: the response wasn't getting marshalled back to xml correctly. after some research, i found out this was caused by the lack of @xmlrootelement annotations on my generated classes. i posted a question to stack overflow titled returning jaxb-generated elements from spring boot controller . after banging my head against the wall for a couple days, i figured out the solution . i created a bindings.xjb file in the same directory as my schemas. this causes jaxb to generate @xmlrootelement on classes. to add namespaces prefixes to the returned xml, i had to modify the maven-jaxb2-plugin to add a couple arguments. -extension -xnamespace-prefix and add a dependency: org.jvnet.jaxb2_commons jaxb2-namespace-prefix 1.1 then i modified bindings.xjb to include the package and prefix settings. i also moved into a global setting. i eventually had to add prefixes for all schemas and their packages. i learned how to add prefixes from the namespace-prefix plugins page . finally, i customized the code-generation process to generate joda-time's datetime instead of the default xmlgregoriancalendar . this involved a couple custom xmladapters and a couple additional lines in bindings.xjb . you can see the adapters and bindings.xjb with all necessary prefixes in this gist . nicolas fränkel's customize your jaxb bindings was a great resource for making all this work. i wrote a test to prove that the ingest api worked as desired. @runwith(springjunit4classrunner.class) @springapplicationconfiguration(classes = application.class) @webappconfiguration @dirtiescontext(classmode = dirtiescontext.classmode.after_class) public class initiaterequestcontrollertest { @inject private initiaterequestcontroller controller; private mockmvc mockmvc; @before public void setup() { mockitoannotations.initmocks(this); this.mockmvc = mockmvcbuilders.standalonesetup(controller).build(); } @test public void testgetnotallowedonmessagesapi() throws exception { mockmvc.perform(get("/api/initiate") .accept(mediatype.application_xml)) .andexpect(status().ismethodnotallowed()); } @test public void testpostpainitiationrequest() throws exception { string request = new scanner(new classpathresource("sample-request.xml").getfile()).usedelimiter("\\z").next(); mockmvc.perform(post("/api/initiate") .accept(mediatype.application_xml) .contenttype(mediatype.application_xml) .content(request)) .andexpect(status().isok()) .andexpect(content().contenttype(mediatype.application_xml)) .andexpect(xpath("/message/header/to").string("3rdparty")) .andexpect(xpath("/message/header/sendersoftware/sendersoftwaredeveloper").string("hid")) .andexpect(xpath("/message/body/status/code").string("010")); } } spring data for jpa and rest with jaxb out of the way, i turned to creating an internal api that could be used by another application. spring data was fresh in my mind after reading about it last summer. i created classes for entities i wanted to persist, using lombok's @data to reduce boilerplate. i read the accessing data with jpa guide, created a couple repositories and wrote some tests to prove they worked. i ran into an issue trying to persist joda's datetime and found jadira provided a solution. i added its usertype.core as a dependency to my pom.xml: org.jadira.usertype usertype.core 3.2.0.ga ... and annotated datetime variables accordingly. @column(name = "last_modified", nullable = false) @type(type="org.jadira.usertype.dateandtime.joda.persistentdatetime") private datetime lastmodified; with jpa working, i turned to exposing rest endpoints. i used accessing jpa data with rest as a guide and was looking at json in my browser in a matter of minutes. i was surprised to see a "profile" service listed next to mine, and posted a question to the spring boot team. oliver gierke provided an excellent answer . swagger spring mvc's integration for swagger has greatly improved since i last wrote about it . now you can enable it with a @enableswagger annotation. below is the swaggerconfig class i used to configure swagger and read properties from application.yml . @configuration @enableswagger public class swaggerconfig implements environmentaware { public static final string default_include_pattern = "/api/.*"; private relaxedpropertyresolver propertyresolver; @override public void setenvironment(environment environment) { this.propertyresolver = new relaxedpropertyresolver(environment, "swagger."); } /** * swagger spring mvc configuration */ @bean public swaggerspringmvcplugin swaggerspringmvcplugin(springswaggerconfig springswaggerconfig) { return new swaggerspringmvcplugin(springswaggerconfig) .apiinfo(apiinfo()) .genericmodelsubstitutes(responseentity.class) .includepatterns(default_include_pattern); } /** * api info as it appears on the swagger-ui page */ private apiinfo apiinfo() { return new apiinfo( propertyresolver.getproperty("title"), propertyresolver.getproperty("description"), propertyresolver.getproperty("termsofserviceurl"), propertyresolver.getproperty("contact"), propertyresolver.getproperty("license"), propertyresolver.getproperty("licenseurl")); } } after getting swagger to work, i discovered that endpoints published with @repositoryrestresource aren't picked up by swagger. there is an open issue for spring data support in the swagger-springmvc project. liquibase integration i configured this project to use h2 in development and postgresql in production. i used spring profiles to do this and copied xml/yaml (for maven and application*.yml files) from a previously created jhipster project. next, i needed to create a database. i decided to use liquibase to create tables, rather than hibernate's schema-export. i chose liquibase over flyway based of discussions in the jhipster project . to use liquibase with spring boot is dead simple: add the following dependency to pom.xml, then place changelog files in src/main/resources/db/changelog . org.liquibase liquibase-core i started by using hibernate's schema-export and changing hibernate.ddl-auto to "create-drop" in application-dev.yml . i also commented out the liquibase-core dependency. then i setup a postgresql database and started the app with "mvn spring-boot:run -pprod". i generated the liquibase changelog from an existing schema using the following command (after downloading and installing liquibase). liquibase --driver=org.postgresql.driver --classpath="/users/mraible/.m2/repository/org/postgresql/postgresql/9.3-1102-jdbc41/postgresql-9.3-1102-jdbc41.jar:/users/mraible/snakeyaml-1.11.jar" --changelogfile=/users/mraible/dev/spring-app/src/main/resources/db/changelog/db.changelog-02.yaml --url="jdbc:postgresql://localhost:5432/mydb" --username=user --password=pass generatechangelog i did find one bug - the generatechangelog command generates too many constraints in version 3.2.2 . i was able to fix this by manually editing the generated yaml file. tip: if you want to drop all tables in your database to verify liquibase creation is working in postgesql, run the following commands: psql -d mydb drop schema public cascade; create schema public; after writing minimal code for spring data and configuring liquibase to create tables/relationships, i relaxed a bit, documented how everything worked and added a loggingfilter . the loggingfilter was handy for viewing api requests and responses. @bean public filterregistrationbean loggingfilter() { loggingfilter filter = new loggingfilter(); filterregistrationbean registrationbean = new filterregistrationbean(); registrationbean.setfilter(filter); registrationbean.seturlpatterns(arrays.aslist("/api/*")); return registrationbean; } accessing api with resttemplate the final step i needed to do was figure out how to access my new and fancy api with resttemplate . at first, i thought it would be easy. then i realized that spring data produces a hal -compliant api, so its content is embedded inside an "_embedded" json key. after much trial and error, i discovered i needed to create a resttemplate with hal and joda-time awareness. @bean public resttemplate resttemplate() { objectmapper mapper = new objectmapper(); mapper.configure(deserializationfeature.fail_on_unknown_properties, false); mapper.registermodule(new jackson2halmodule()); mapper.registermodule(new jodamodule()); mappingjackson2httpmessageconverter converter = new mappingjackson2httpmessageconverter(); converter.setsupportedmediatypes(mediatype.parsemediatypes("application/hal+json")); converter.setobjectmapper(mapper); stringhttpmessageconverter stringconverter = new stringhttpmessageconverter(); stringconverter.setsupportedmediatypes(mediatype.parsemediatypes("application/xml")); list> converters = new arraylist<>(); converters.add(converter); converters.add(stringconverter); return new resttemplate(converters); } the jodamodule was provided by the following dependency: com.fasterxml.jackson.datatype jackson-datatype-joda with the configuration complete, i was able to write a messagesapiitest integration test that posts a request and retrieves it using the api. the api was secured using basic authentication, so it took me a bit to figure out how to make that work with resttemplate. willie wheeler's basic authentication with spring resttemplate was a big help. @runwith(springjunit4classrunner.class) @contextconfiguration(classes = integrationtestconfig.class) public class messagesapiitest { private final static log log = logfactory.getlog(messagesapiitest.class); @value("http://${app.host}/api/initiate") private string initiateapi; @value("http://${app.host}/api/messages") private string messagesapi; @value("${app.host}") private string host; @inject private resttemplate resttemplate; @before public void setup() throws exception { string request = new scanner(new classpathresource("sample-request.xml").getfile()).usedelimiter("\\z").next(); responseentity response = resttemplate.exchange(gettesturl(initiateapi), httpmethod.post, getbasicauthheaders(request), org.ncpdp.schema.transport.message.class, collections.emptymap()); assertequals(httpstatus.ok, response.getstatuscode()); } @test public void testgetmessages() { httpentity request = getbasicauthheaders(null); responseentity> result = resttemplate.exchange(gettesturl(messagesapi), httpmethod.get, request, new parameterizedtypereference>() {}); httpstatus status = result.getstatuscode(); collection messages = result.getbody().getcontent(); log.debug("messages found: " + messages.size()); assertequals(httpstatus.ok, status); for (message message : messages) { log.debug("message.id: " + message.getid()); log.debug("message.datecreated: " + message.getdatecreated()); } } private httpentity getbasicauthheaders(string body) { string plaincreds = "user:pass"; byte[] plaincredsbytes = plaincreds.getbytes(); byte[] base64credsbytes = base64.encodebase64(plaincredsbytes); string base64creds = new string(base64credsbytes); httpheaders headers = new httpheaders(); headers.add("authorization", "basic " + base64creds); headers.add("content-type", "application/xml"); if (body == null) { return new httpentity<>(headers); } else { return new httpentity<>(body, headers); } } } to get spring data to populate the message id, i created a custom restconfig class to expose it. i learned how to do this from tommy ziegler . /** * used to expose ids for resources. */ @configuration public class restconfig extends repositoryrestmvcconfiguration { @override protected void configurerepositoryrestconfiguration(repositoryrestconfiguration config) { config.exposeidsfor(message.class); config.setbaseuri("/api"); } } summary this article explains how i built a rest api using jaxb, spring boot, spring data and liquibase. it was relatively easy to build, but required some tricks to access it with spring's resttemplate. figuring out how to customize jaxb's code generation was also essential to make things work. i started developing the project with spring boot 1.1.7, but upgraded to 1.2.0.m2 after i found it supported log4j2 and configuring spring data rest's base uri in application.yml. when i handed the project off to my client last week, it was using 1.2.0.build-snapshot because of a bug when running in tomcat . this was an enjoyable project to work on. i especially liked how easy spring data makes it to expose jpa entities in an api. spring boot made things easy to configure once again and liquibase seems like a nice tool for database migrations. if someone asked me to develop a rest api on the jvm, which frameworks would i use? spring boot, spring data, jackson, joda-time, lombok and liquibase. these frameworks worked really well for me on this particular project.

October 30, 2014

by Matt Raible

· 64,305 Views

Sharding Pitfalls Part III: Chunk Balancing and Collection Limits

In Parts 1 and 2 we have covered a number of common issues people run into when managing a sharded MongoDB cluster. In this final post of the series we will cover a subtle, but important distinction in terms of balancing a sharded cluster as well as an interesting limitation that can be worked around relatively easily, but is nonetheless surprising when it comes up. 6. Chunk balancing != data balancing != traffic balancing The balancer in a sharded cluster cares about just one thing: Are chunks for a given collection evenly balanced across all shards? If they are not, then it will take steps to rectify that imbalance. This all sounds perfectly logical, and even with extra complexity like tagging involved the logic is pretty straight forward. If we assume that all chunks are equal, then we can rest assured that our data is being evenly balanced across all the shards in our cluster and rest easy at night. Although that is sometimes, perhaps even frequently, the case it is not always true - chunks are not always equal. There can be massive “jumbo” chunks that exceed the maximum chunk size (64MiB), completely empty chunks and everything in between. Let’s use an example from our first pitfall, the monotonically increasing shard key. For our example, we have picked just such a key to shard on (date), and up until this point we have had just one shard and had not sharded the collection. We are about to add a second shard to our cluster and so we enable sharding on the collection and do the necessary admin work to add the new shard into the cluster. Once the collection is enabled for sharding, the first shard contains all the newly minted chunks. Let’s represent them in a simplified table of 10 chunks. This is not representative of a real data set, but it will do for illustrative purposes: Table 1 - Initial Chunk Layout Now we add our second shard. The balancer will kick in and attempt to distribute the chunks evenly. It will do this by moving the lowest range chunks to the new shard until the counts are identical. Once it is finished balancing, our table now looks like this: Table 2 - Balanced Chunk Layout That looks pretty good at the moment, but lets imagine that more recent chunks are more likely to have more activity (updates say) than older chunks. Adding the traffic share estimates for each chunk shows that shard1 is taking far more traffic (72%) than shard2 (28%) despite the chunks seeming balanced overall based on the approximate size. Hence, chunk balancing is not equal to traffic balancing. Using that same example, let’s add another wrinkle - periodic deletion of old data. Every 3 months we run a job to delete any data older than 12 months. Let’s look at the impact of that on our table after we run it for the first time (assuming the first run happens on July 1st 2015). Table 3 - Post-Delete Chunk Layout The distribution of data is now completely skewed toward shard1 - shard2 is in fact empty! However, the balancer is completely unaware of this imbalance - the chunk count has remained the same the entire time, and as far as it is concerned the system is in a steady state. With no data on shard2, our traffic imbalance as seen above will be even worse, and we have essentially negated the benefit of having a second shard for this collection. Possible Mitigation Strategies If data and traffic balance are important, select an appropriate shard key Move chunks manually to address the imbalances - swap “hot” chunks for “cool” chunks, empty chunks for larger chunks 7. Waiting too long to shard a collection (collection too large) This is not very common, but when it falls on your shoulders, it can be quite challenging to solve. There is a maximum data size for a collection when when it is initially split which is a function of the chunk size and data size as noted on the limits page. If your collection contains less than 256GiB of data, then there will be no issue. If the collection size exceeds 256GiB but is less than 400GiB, then MongoDB may be able to do an initial split without any special measures being taken. Otherwise, with larger initial data sizes and the default settings, the initial split will fail. It is worth noting that once split the collection may grow as needed and without any real limitations as long as you can continue to add shards as data size grows. Possible Mitigation Strategies Since the limit is dictated by the chunk size and the data size, and assuming there is not much to be done about the data size, then the remaining variable is the chunk size. This is adjustable (default is 64MiB) and can be raised in order to let a large collection split initially and then reduced once that has been completed. The required chunk size increase will depend on the actual data size. However, this is relatively easy to work out - simply divide your data size by 256GB and then multiply that figure by 64MiB (and round up if it is not a nice even number). As an example, let’s consider a 4TiB collection: 4TiB divided by 256GiB = 16 64MiB x 16 = 1024MiB Hence, set the max chunk size to 1024MiB, then perform the initial sharding of the collection, and then finally reduce the chunk size back to 64MiB using the same procedure. . Thanks for reading through the Sharding Pitfall series! If you want to learn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations. Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.

October 27, 2014

by Francesca Krihely

· 4,294 Views

How to Avoid Hash Collisions When Using MySQL’s CRC32 Function

Originally Written by Arunjith Aravindan Percona Toolkit’s pt-table-checksum performs an online replication consistency check by executing checksum queries on the master, which produces different results on replicas that are inconsistent with the master – and the tool pt-table-sync synchronizes data efficiently between MySQL tables. The tools by default use the CRC32. Other good choices include MD5 and SHA1. If you have installed the FNV_64 user-defined function, pt-table-sync will detect it and prefer to use it, because it is much faster than the built-ins. You can also use MURMUR_HASH if you’ve installed that user-defined function. Both of these are distributed with Maatkit. For details please see the tool’s documentation. Below are test cases similar to what you might have encountered. By using the table checksum we can confirm that the two tables are identical and useful to verify a slave server is in sync with its master. The following test cases with pt-table-checksum and pt-table-sync will help you use the tools more accurately. For example, in a master-slave setup we have a table with a primary key on column “a” and a unique key on column “b”. Here the master and slave tables are not in sync and the tables are having two identical values and two distinct values. The pt-table-checksum tool should be able to identify the difference between master and slave and the pt-table-sync in this case should sync the tables with two REPLACE queries. +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 3 | 3 | | 3 | 4 | | 4 | 4 | +-----+-----+ +-----+-----+ Case 1: Non-cryptographic Hash function (CRC32) and the Hash collision. The tables in the source and target have two different columns and in general way of thinking the tools should identify the difference. But the below scenarios explain how the tools can be wrongly used and how to avoid them – and make things more consistent and reliable when using the tools in your production. The tools by default use the CRC32 checksums and it is prone to hash collisions. In the below case the non-cryptographic function (CRC32) is not able to identify the two distinct values as the function generates the same value even we are having the distinct values in the tables. CREATE TABLE `t1` ( `a` int(11) NOT NULL, `b` int(11) NOT NULL, PRIMARY KEY (`a`), UNIQUE KEY `b` (`b`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Master Slave +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 3 | 3 | | 3 | 4 | | 4 | 4 | +-----+-----+ +-----+-----+ Master: [root@localhost mysql]# pt-table-checksum --replicate=percona.checksum --create-replicate-table --databases=db1 --tables=t1 localhost --user=root --password=*** --no-check-binlog-format TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-17T00:59:45 0 0 4 1 0 1.081 db1.t1 Slave: [root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=*** --verbose --sync-to-master 192.**.**.** # Syncing via replication h=192.**.**.**,p=...,u=root # DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE Narrowed down to BIT_XOR: Master: mysql> SELECT BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) FROM `db1`.`t1`; +------------------------------------------------------------+ | BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) | +------------------------------------------------------------+ | 6581445 | +------------------------------------------------------------+ 1 row in set (0.00 sec) Slave: mysql> SELECT BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) FROM `db1`.`t1`; +------------------------------------------------------------+ | BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) | +------------------------------------------------------------+ | 6581445 | +------------------------------------------------------------+ 1 row in set (0.16 sec) Case 2: As the tools are not able to identify the difference, let us add a new row to the slave and check if the tools are able to identify the distinct values. So I am adding a new row (5,5) to the slave. mysql> insert into db1.t1 values(5,5); Query OK, 1 row affected (0.05 sec) Master Slave +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 3 | 3 | | 3 | 4 | | 4 | 4 | +-----+-----+ | 5 | 5 | +-----+-----+ [root@localhost mysql]# pt-table-checksum --replicate=percona.checksum --create-replicate-table --databases=db1 --tables=t1 localhost --user=root --password=*** --no-check-binlog-format TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-17T01:01:13 0 1 4 1 0 1.054 db1.t1 [root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=*** --verbose --sync-to-master 192.**.**.** # Syncing via replication h=192.**.**.**,p=...,u=root # DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE DELETE FROM `db1`.`t1` WHERE `a`='5' LIMIT 1 /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**. 10,p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/; REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('3', '4') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**, p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/; REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('4', '3') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**, p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/; # 1 2 0 0 Chunk 01:01:43 01:01:43 2 db1.t1 Well, apparently the tools are now able to identify the newly added row in the slave and the two other rows having the difference. Case 3: Advantage of Cryptographic Hash functions (Ex: Secure MD5) As such let us make the tables as in the case1 and ask the tools to use the cryptographic (secure MD5) hash functions instead the usual non-cryptographic function. The default CRC32 function provides no security due to their simple mathematical structure and too prone to hash collisions but the MD5 provides better level of integrity. So let us try with the –function=md5 and see the result. Master Slave +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 3 | 3 | | 3 | 4 | | 4 | 4 | +-----+-----+ +-----+-----+ Narrowed down to BIT_XOR: Master: mysql> SELECT 'test', 't2', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONCAT(LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING (@crc, 1, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'), LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING(@crc := md5(CONCAT_WS('#', `a`, `b`)) , 17, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'))), 0) AS crc FROM `db1`.`t1`; +------+----+---+------+------+------+-----+----------------------------------+ | test | t2 | 1 | NULL | NULL | NULL | cnt | crc | +------+----+---+------+------+------+-----+----------------------------------+ | test | t2 | 1 | NULL | NULL | NULL | 4 | 000000000000000063f65b71e539df48 | +------+----+---+------+------+------+-----+----------------------------------+ 1 row in set (0.00 sec) Slave: mysql> SELECT 'test', 't2', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONCAT(LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING (@crc, 1, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'), LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING(@crc := md5(CONCAT_WS('#', `a`, `b`)) , 17, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'))), 0) AS crc FROM `db1`.`t1`; +------+----+---+------+------+------+-----+----------------------------------+ | test | t2 | 1 | NULL | NULL | NULL | cnt | crc | +------+----+---+------+------+------+-----+----------------------------------+ | test | t2 | 1 | NULL | NULL | NULL | 4 | 0000000000000000df024e1a4a32c31f | +------+----+---+------+------+------+-----+----------------------------------+ 1 row in set (0.00 sec) [root@localhost mysql]# pt-table-checksum --replicate=percona.checksum --create-replicate-table --function=md5 --databases=db1 --tables=t1 localhost --user=root --password=*** --no-check-binlog-format TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-23T23:57:52 0 1 12 1 0 0.292 db1.t1 [root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=amma --verbose --function=md5 --sync-to-master 192.***.***.*** # Syncing via replication h=192.168.56.102,p=...,u=root # DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('3', '4') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.168.56.101,p=..., u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.***.***.***,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5608 user:root host:localhost.localdomain*/; REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('4', '3') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.168.56.101,p=..., u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.***.**.***,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5608 user:root host:localhost.localdomain*/; # 0 2 0 0 Chunk 04:46:04 04:46:04 2 db1.t1 Master Slave +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 4 | 3 | | 3 | 4 | | 3 | 4 | +-----+-----+ +-----+-----+ The MD5 did the trick and solved the problem. See the BIT_XOR result for the MD5 given above and the function is able to identify the distinct values in the tables and resulted with the different crc values. The MD5 (Message-Digest algorithm 5) is a well-known cryptographic hash function with a 128-bit resulting hash value. MD5 is widely used in security-related applications, and is also frequently used to check the integrity but MD5() and SHA1() are very CPU-intensive with slower checksumming if chunk-time is included.

October 24, 2014

by Peter Zaitsev

· 7,096 Views

AngularJS: Different Ways of Using Array Filters

In this article, we learn how to utilize AngularJS's array filter features in a variety of use cases. Read on to get started.

October 24, 2014

by Siva Prasad Reddy Katamreddy

· 315,797 Views · 5 Likes

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 1

introduction in this tutorial, the apache lucene and apache tika frameworks will be explained through their core concepts (e.g. parsing, mime detection, content analysis, indexing, scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. we assume you have a working knowledge of the java™ programming language and plenty of content to analyze. throughout this tutorial, you will learn: how to use apache tika's api and its most relevant functions how to develop code with apache lucene api and its most important modules how to integrate apache lucene and apache tika in order to build your own piece of software that stores and retrieves information efficiently. (project code is available for download) what are lucene and tika? according to apache lucene's site, apache lucene represents an open source java library for indexing and searching from within large collections of documents. the index size represents roughly 20-30% the size of text indexed and the search algorithms provide features like: ranked searching - best results returned first many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more. in this tutorial we will demonstrate only phrase queries. fielded search (e.g. title, author, contents) sorting by any field flexible faceting, highlighting, joins and result grouping pluggable ranking models, including the vector space model and okapi bm25 but lucene's main purpose is to deal directly with text and we want to manipulate documents, who have various formats and encoding. for parsing document content and their properties the apache tika library it is necessary. apache tika is a library that provides a flexible and robust set of interfaces that can be used in any context where metadata analyzis and structured text extraction is needed. the key component of apache tika is the parser (org.apache.tika.parser.parser ) interface because it hides the complexity of different file formats while providing a simple and powerful mechanism to extract structured text content and metadata from all sorts of documents. criterias for tika parsing design streamed parsing the interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. this allows even huge documents to be parsed without excessive resource requirements. structured content a parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. a client application can use this information for example to better judge the relevance of different parts of the parsed document. input metadata a client application should be able to include metadata like the file name or declared content type with the document to be parsed. the parser implementation can use this information to better guide the parsing process. output metadata a parser implementation should be able to return document metadata in addition to document content. many document formats contain metadata like the name of the author that may be useful to client applications. context sensitivity while the default settings and behaviour of tika parsers should work well for most use cases, there are still situations where more fine-grained control over the parsing process is desirable. it should be easy to inject such context-specific information to the parsing process without breaking the layers of abstraction. requirements maven 2.0 or higher java 1.6 se or higher lesson 1: automate metadata extraction from any file type our premisses are the following: we have a collection of documents stored on disk/database and we would like to index them; these documents can be word documents, pdfs, htmls, plain text files etc. as we are developers, we would like to write reusable code that extracts file properties regarding format (metadata) and file content. apache tika has a mimetype repository and a set of schemes (any combination of mime magic, url patterns, xml root characters, or file extensions) to determine if a particular file, url, or piece of content matches one of its known types. if the content does match, tika has detected its mimetype and can proceed to select the appropriate parser. in the sample code, the file type detection and its parsing is being covered inside the class com.retriever.lucene.index.indexcreator , method indexfile. listing 1.1 analyzing a file with tika public static documentwithabstract indexfile(analyzer analyzer, file file) throws ioexception { metadata metadata = new metadata(); contenthandler handler = new bodycontenthandler(10 * 1024 * 1024); parsecontext context = new parsecontext(); parser parser = new autodetectparser(); inputstream stream = new fileinputstream(file); //open stream try { parser.parse(stream, handler, metadata, context); //parse the stream } catch (tikaexception e) { e.printstacktrace(); } catch (saxexception e) { e.printstacktrace(); } finally { stream.close(); //close the stream } //more code here } the above code displays how a file it is being parsed using org.apache.tika.parser.autodetectparser; this kind of implementation was chosen because we would like to achieve parsing documents disregarding their format. also, for handling the content the org.apache.tika.sax.bodycontenthandler wasconstructed with a writelimit given as parameter ( 10*1024*1024); this type of constructor creates a content handler that writes xhtml body character events to an internal string buffer and in case of documents with large content is less likely to throw a saxexception (thrown when the default write limit is reached). as a result of our parsing we have obtained a metadata object that we can now use to detect file properties (title or any other header specific to a document format). metadata processing can be done as described below ( com.retriever.lucene.index.indexcreator , method indexfiledescriptors) : listing 1.2 processing metadata private static document indexfiledescriptors(string filename, metadata metadata) { document doc = new document(); //store file name in a separate textfield doc.add(new textfield(isearchconstants.field_file, filename, store.yes)); for (string key : metadata.names()) { string name = key.tolowercase(); string value = metadata.get(key); if (stringutils.isblank(value)) { continue; } if ("keywords".equalsignorecase(key)) { for (string keyword : value.split(",?(\\s+)")) { doc.add(new textfield(name, keyword, store.yes)); } } else if (isearchconstants.field_title.equalsignorecase(key)) { doc.add(new textfield(name, value, store.yes)); } else { doc.add(new textfield(name, filename, store.no)); } } in the method presented above we store the file name in a separate field and also the document's title ( a document can have a title different from its file name); we are not interested in storing other informations.

October 22, 2014

by Ana-Maria Mihalceanu

· 18,762 Views · 4 Likes

Sharding Pitfalls Part II: Running a Sharded Cluster

By Adam Comerford, Senior Solutions Engineer In Part I we discussed important considerations when picking a shard key. In this post we will go through some recommendations when running a sharded cluster at scale. Scalability is one of the core benefits of sharding in MongoDB but this can give you a false sense of security; even with that flexibility, you still have to make smart decisions about how and when you deploy resources. In this post, we will cover a couple of common mistakes that people tend to make when it comes to running a sharded cluster. 3. Waiting too long to add a new shard (overloaded) You sharded your database and scaled horizontally for a reason, perhaps it was to add more memory or disk capacity. Whatever the reason, if your application usage grows over time so (generally) does your database utilization. Eventually, your current sharded cluster will pass a certain point, let’s call it 80% utilized (as a nice round estimate), such that it becomes problematic to add another shard. Why? Well, adding a new shard to a cluster is not free, and it is not instantaneous. It consumes resources and (initially) accepts very little traffic. Essentially, at the start of its existence, a newly added shard costs you capacity instead of adding capacity. The length of time it will stay in this state will depend on the balancer and how long it takes for a significant portion of “busy/active” chunks to move onto the new shard. It can often be easier to visualize this process, so let’s make up some hypothetical numbers and set the bar relatively low. Our imaginary existing cluster will be a set of 2 shards, with 2000 chunks (500 considered “active”) and to that we need to add a 3rd shard. This 3rd shard will eventually store one third of the active chunks (and total chunks). The question is, when does this shard stop adding overhead overall and instead become an asset? In reality, this will vary from cluster to cluster and have a lot of dependencies and variables - in other words you need to have good metrics about your cluster, particularly your load bottleneck. Therefore we will once again use our imaginations and go with a relatively low bar: when 5% of active chunks—that is, those chunks seeing most traffic—have migrated to the new shard, you should expect a net gain in performance. In our imaginary system we have evaluated our load levels, the expected impact of migrations and have determine that once that 5% threshold of active chunks has been migrated to the new shard it can be considered a net gain for the overall system. Once all chunks have been balanced, then the migration overhead disappears, but initially this will be an expected trade off. This chart shows how long it would take for new shards to reach net positive contribution in your cluster (the dotted line implies net gain): In this fabricated example, it takes almost 2 hours for the new shard to attain a viable level of active chunks and be considered a net gain for the overall system. Although these numbers are fictional, these numbers are based on setups we have seen in real systems with moderate load. From there it is relatively easy to imagine this set of migrations taking even longer on an overloaded set of shards, and taking far longer for our newly added shard to cross the threshold and become a net gain. As such it is best to be proactive and add capacity before it becomes a necessity. Possible Mitigation Strategies Manual balancing of targeted “hot” chunks (chunk that is being accessed more than others) to move activity to the new shard more quickly Add the shard at low traffic time so that there is less competition for resources Disable balancing on some collections, prioritise balancing busy collections first 4. Under-provisioning Config Servers Provisioning enough resources without being wasteful is always tricky, and all the more so in a complicated distributed system like a MongoDB sharded cluster. Everyone wants to use their hardware, virtual instances, virtual machines, containers and the like in the most efficient way possible, and get the best bang for their buck. Hence it is only natural to take a look at the various pieces of a distributed cluster and look for lower utilized pieces that could be put on less expensive resources. The most common pitfall here with MongoDB are the config servers, which are often neglected when stress testing a cluster. In testing environments and smaller deployments (unless specific measures are taken to stress them) they are relatively lightly loaded and usually identified as candidates for lesser instances/hardware. The problem is that these are critical pieces of infrastructure. They may not be heavily loaded all the time, but when they do see load and struggle to service requests, that can impact all queries (reads, writes, authentication) and add latency to all requests made of the cluster in question. In particular, the first config server in the list supplied to your mongos processes is vital. This is the config server that all mongos processes will default to read from when fetching or refreshing their view of the data distribution in your cluster. Similarly, this is the server that will be hit when attempting to authenticate a user. If it is under-provisioned and cannot service queries, or if it has problems with networking (packet loss, congestion), then the effects will be significant. Possible Mitigation Strategies Ensure the config servers are load tested, slightly over-provisioned (the first config server in particular) If using virtual machines or cloud based instances, investigate increasing available resources Turning off the balancer, disabling chunk splitting will reduce the chances of high read traffic to the config servers (no migrations, no meta data refresh) but this is only a temporary fix unless you have a perfect write distribution and may not eliminate issues completely. 5. Using the count() command on sharded collections This pitfall is very common, and it seems to hit somewhat randomly in terms of how long someone has been running a sharded environment. At some point, a question will arise along the lines of: “How are we tracking/verifying/checking how many documents we have in each collection on each shard, how balanced are they and do they agree with ?” Hopefully no one is actually constructing questions this way in your organization, but you get the basic idea. The most obvious way to do a quick check on this type of thing is to count the documents and see if the numbers make sense and/or agree with counts elsewhere. That thinking naturally leads people to the count command and they proceed to use it to gather figures for their documents and collections. Unfortunately, on a busy, mature sharded cluster, the results will very rarely be what is expected. The reason for this is that the count command as implemented today has several optimizations in place to make it faster to run in general and those speed optimizations essentially bypass a key piece of the sharding functionality needed to return accurate results in this case. This is a known bug and is being tracked in SERVER-3645, but does not stop people from consistently hitting this issue. The nature of the issue means that count will report documents in the results that it should not, for example: Documents that are being deleted as part of a chunk migrations Documents that have been left behind from previous chunk migrations (also known as orphans) Documents currently being copied as part of an in-flight chunk migration A regular query (rather than a count) will have its results filtered by the respective primary and not suffer from the same problem. Hence, if you were to manually count the results from a query client-side you would get an accurate result. This quirk of sharded environments will eventually be fixed, but for now it will inevitably crop up from time to time in all active sharded clusters used by a large team. Possible Mitigation Strategies Do counts on the client side, or use targeted, range based queries (with a primary read preference) to count instead Use cleanUpOrphaned and disable the balancer (make sure it has finished current round) when performing counts across the cluster If you want tolearn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations. Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.

October 21, 2014

by Francesca Krihely

· 4,734 Views

Measuring String Concatenation in Logging

introduction i had an interesting conversation today about the cost of using string concatenation in log statements. in particular, debug log statements often need to print out parameters or the current value of variables, so they need to build the log message dynamically. so you wind up with something like this: logger.debug("parameters: " + param1 + "; " + param2); the issue arises when debug logging is turned off. inside the logger.debug() statement a flag is checked and the method returns immediately; this is generally pretty fast. but the string concatenation had to occur to build the parameter prior to calling the method, so you still pay its cost. since debug tends to be turned off in production, this is the time when this difference matters. for this reason, we have pretty much all been trained to do this: if (logger.isdebugenabled()) { logger.debug("parameters: " + param1 + "; " + param2); } the discussion was about how much difference this “good practice” makes. caliper this kind of question is perfect for a micro benchmark. my own favorite tool for this purpose is caliper . caliper runs small snippets of code enough times to average out variations. it passes in a number of repetitions, which it calculates in order to make sure that the whole method takes long enough to measure given the resolution of the system clock. caliper also detects garbage collection and hotspot compiling that might impact the accuracy of the tests. caliper uploads results to a google app engine application. its sign-in supports google logins and issues an api key that can be used to organize results and list them. a typical timing methods looks like this: public string timemultstringnocheck(long reps) { for (int i = 0; i < reps; i++) { logger.debug(strings[0] + " " + strings[1] + " " + strings[2] + " " + strings[3] + " " + strings[4]); } return strings[0]; } the return string is not used; it is included in the method solely to ensure that java does not optimize away the method. similarly, the content of the variables used should be randomly generated to avoid compile-time optimization. the full example is available in one of my github repositories , in the org.anvard.introtojava.log package. results the outcome is pretty interesting. string concatenation creates a pretty significant penalty, around two orders of magnitude for our example that concatenates five strings. interesting is that even in the case where we do not use string concatenation (i.e. the simplestring methods), the penalty is around 4x. this is probably the time spent pushing the string parameter onto the stack. the examples with doubles, using string.format() , is even more extreme, four orders of magnitude. the elapsed time here about 4ms, large enough that if the log statement were in a commonly used method, the performance hit would be noticeable. the final method, multstringparams , uses a feature that is available in the slf4j api. it works similarly to string.format() , but in a simple token replace fashion. most importantly, it does not perform the token replace unless the logging level is enabled. this makes this form just as fast as the “check” forms, but in a more compact form. of course, this only works if no special formatting is needed of the log string, or if the formatting can be shifted to a method such as tostring() . what is especially surprising is that this method did not show a penalty in building the object array necessary to pass the parameters into the method. this may have been optimized out by the java runtime since there was no chance of the parameters being used. conclusion the practice of checking whether a logging level is enabled before building the log statement is certainly worthwhile and should be something teams check during peer review.

October 20, 2014

by Alan Hohn

· 10,532 Views

Functional Programming with Java 8 Functions

Learn how to use lambda expressions and anonymous functions in Java 8.

October 20, 2014

by Edwin Dalorzo

· 330,158 Views · 25 Likes

Unlocking and Erasing FLASH with Segger J-Link

when using a bootloader (see “ serial bootloader for the freedom board with processor expert “), then i usually protect the bootloader flash areas, so it does not get accidentally erased by the application ;-). when programming my boards with the p&e multilink, then the p&e firmware will automatically unlock and erase the chip. that’s not the same if working with the segger j-link, as it requires extra steps. protected flash pages with processor expert failed programming with protected flash if i try to re-program the protected bootloader with segger j-link (both in codewarrior and eclipse/kds with gdb), then the download silently fails. the effect is that somehow the application on the board does not match what it should be. looking at the console view, it shows that erase has failed (but no real error reported) :-(: jlink: failed to erase sectors 0 @ address 0x00000000 (algo135: flash protection violation. flash is write-protected.) j-link failed to erase in codewarrior the gnu arm eclipse segger integration with gdb (e.g. kinetis design studio) is not better: no error sign, the only thing is a hidden error in the console log of the jlinkgdbservercl: error: failed to erase sectors 0 @ address 0x00000000 (algo135: flash protection violation. flash is write-protected.) error algo135 flash protection violation about failed flash programming what i need is to unprotect the memory and then erase it. erasing the segger j-link features a very fast programming. part of that speed is that the segger firmware checks each flash page if it really needs to be programmed, and only then it erases and reprogrammed that page. so downloading twice the same application actually will not touch the flash memory at all. additionally, it does not do a complete erase of the device: it only programs the pages i’m using in my application. the advantage of that is first speed. and it does not erase the application data i’m using in non-volatile memory (see “ configuration data: using the internal flash instead of an external eeprom “). however, sometimes i really need to clear all my data in flash too, and then i need to erase all my flash pages on the device. segger has product named ‘j-flash’ which is used to flash and erase devices outside of an ide. there is a free-of-charge ‘lite’ version available for download from segger. this utility is not intended to be used for production. with this utility i have a gui to erase and program my device. j-flash lite but j-flash lite cannot unlock my locked flash pages :-(. if my device is not locked, i can use the codewarrior ‘flash file to target’ (see “ flashing with a button (and a magic wand) “) to erase the device: erasing device with flash file to target again, this does not work if the device is locked. codewarrior has another feature called ‘target task’ which can be used to erase/unsecure (if your device is supported), see “ device is secure? “. so i need to use a different tool to unlock and unprotect my device: the j-link commander . unlocking and erasing with j-link commander to unlock the device, segger has a utility named ‘j-link commander’, available from http://www.segger.com/jlink-software.html . the binary is ‘jlink.exe’ on windows and is a command line utility. to unlock the device use unlock kinetis unlocking device but it seems that i need to do an unlock, followed by an erase to make it permanent. to erase the device, i can use the same command line utility. but i need to specify the device name first, and then i can erase it (example for the kl25z): device mkl25z128xxx4 unlock kinetis erase :!: i need to do the erase operation right after the unlock. a) set device b) unlock and c) erase, otherwise it will fail? unlocking and erasing with j-link commander summary in order to re-program the protected flash sectors with segger j-link, i need first to unlock and mass erase the device. for this, there is the j-link commander utility which has a command line interface to unprotect and erase the device. for erasing only, the j-flash (and lite) is a very useful tool, especially to get a ‘clean’ device memory. to me, the segger way and tools are very powerful. in this case, things are very flexible, but not that obvious. so i hope this post can help others to get his device unlocked and erased. happy erasing :-)

October 17, 2014

by Erich Styger

· 20,383 Views

AppDynamics VS New Relic – Which Tool is Right For You? The Complete Guide

New Relic VS AppDynamics: All the performance features, integrations, installation procedures and pricing plans side by side to help you decide which tool to use When thinking about performance, AppDynamics and New Relic are the main modern tools that come to mind. Both spawned from the same company, Wily Technology, who also dealt with performance monitoring and was acquired by CA back in 2006 - making way to new technology. New Relic is an anagram of Lew Cirne, its founder and CEO. AppDynamics was founded by Jyoti Bansal, who was a Lead Software Architect at the same Wily Technology, which was also founded by Lew. The main goal of this guide is to help you understand the similarities and differences between the two, so you can decide which one fits your company’s needs. Table of Contents What is APM anyhow? Supported Languages and Environments Features - Backend Monitoring - Fronted & Mobile Monitoring How to Solve the Errors You Find Installation Dashboard and Usage Integrations and Plugins Pricing Conclusion 1. What is APM anyhow? It’s the only buzzword you’ll read in this article, promise. Well, maybe also DevOps, but that’s it. So application Performance Management has been around for a while, though it seems like many developers are not comfortable with it yet. APM provides us with analytics around our application’s performance - at the core this means timing how long it takes to execute different areas in the code and complete transactions - this is done either by instrumenting the code, monitoring logs, or including network / hardware metrics. On top of this basic concept, many different implementations exist - but there are a basic truths we can agree on: A modern solution should monitor production environments, so its overhead (in terms of CPU and throughput) becomes very important. Also, it should display what the web/mobile end users are experiencing, which was not part of traditional APMs. What was once considered a luxury is becoming commonplace: Rapid new deployments in production mean more chances to introduce errors to your systems architecture, slow it down, and maybe even crash it. Let’s see what AppDynamics and New Relic have in store for us. 2. Supported Environments AppDynamics: Java, Scala, .NET, PHP, Node.js, iOS and Android; including your favorite flavour of database and cloud platform. New Relic: Java, Scala, .NET, PHP, Node.js, Ruby and Python. Supported databases, cloud platforms and other plug-ins are available here. We’ll dig in deeper with extensions later on. On the user monitoring front, iOS, Android, and JavaScript support is included with both tools. Bottom line: Main difference here is New Relic’s Ruby and Python support, and different levels of support for various platforms. 3. Features Both New Relic and AppDynamics can be broken down into 6 different products, all reporting to a main dashboard interface. Let’s split these to backend, mobile and frontend to do a quick runthrough over the main offerings. Backend Monitoring The bread and butter of performance management - reporting stats, graphs and insights of your applications performance under the hood. AppDynamics and NewRelic each offer 4 approaches here: Application Performance Management High level metrics with drill downs to code level data about how your application is performing. Must have metrics include transaction response time, error rate, throughput (Requests per Minute) on NewRelic and load (calls/min) on AppDynamics. AppDynamics dashboard on the left, New Relic on the Right You’ve probably noticed the main screen at AppDynamics include a map of the services the application is using with their call loads and health index while NewRelic displays a response time graph. This might be a way to signal each tool’s monitoring priorities and AppDynamics inclination to larger enterprises. Anyhow, enough with this dev tool psychology, but it’s worth noting that a similar map is also available on New Relic: New Relic’s application map One of the thorny issues here is alerting and reporting, with so many metrics and moving parts, it’s hard to identify which matters most. Is it a low error rate? Responsiveness? Throughput? AppDynamics and New Relic each took a different approach to distill these metrics into performance indicators. New Relic is using the Apdex score index, which uses a user defined response time threshold T to imply end-user satisfaction. Simply put, they require you to manually set the threshold. Here’s an example for the way this score is calculated: Calculating Apdex, now sum this over all requests for a given time and you’ll get the score AppDynamics on the other hand, doesn’t believe in Apdex (as they explained in an article called ”Apdex is Fatally Flawed”). They’ve come up with a solution of their own that automatically creates a dynamic baseline for the apps performance which varies by time. For example, the definition of a slow transaction might vary under low and high loads on the system. Bottom line: We’re seeing that AppDynamics puts its priority on visualizing the stack from end to end, while NewRelic is focused on bottom line response times. We’ve also seen the difference in alerting with Apdex and a dynamic baselines. Server Monitoring Another monitoring capability offered by both tools focuses on the hardware your servers run on: specs, CPU usage, memory utilization, disk I/O and network IO. AppDynamics on the left, New Relic on the right with some sample data In this category, AppDynamics offers a few more features than New Relic, mostly around memory: heap size & utilization, garbage collection stats divided by gens and memory leak detection. AppDynamics Server Monitoring - Memory features Bottom line: AppDynamics provides deeper insights into garbage collection and memory leak detection beyond the standard metrics. Database Monitoring Moving on to other components in your stack, the first thing that comes to mind is the database. Here we have a greater distinction with a richer AppDynamics dashboard looking into things like resource consumption, wait states, user sessions, specific query calls and more. On New Relic’s end the situation is a bit different with the Database dashboard as part of the basic APM product. Both tools have specific database monitoring metrics available through plugins to view data from external services (we’ll talk a bit more about integrations later). Either way, both native and external feature sets here might be different depending on the database you use. AppDynamics on the left with an Oracle DB, New Relic on the right with MySQL plugin Bottom line: Beyond the shared database metrics that go a bit deeper with AppDynamics, it’s worth looking into the features available for your specific database within each tool. Insights and Analytics This one is a wildcard, going beyond traditional APM and opening up to business intelligence metrics. Since both New Relic and AppDynamics already have access to the messages that go through your application, they’ve built this opt-in additional database to store your stats and enable you to query them. AppDynamics on the left, New Relic on the right with some sample data Bottom line: If you don’t have a solution that already allows you to process such queries, it might be about time to get one. Frontend & Mobile Monitoring Switching seats from the backend, lets take a quick look at what we’re getting on the Real-User Monitoring front. Both AppDynamics and New Relic have a product targeting browsers and a product targeting mobile with iOS & Android support. On Mobile, the flagship features include insights on slowdowns and crashes, that are filtered through geographic regions, devices, operating systems and operator networks: AppDynamics on the left, New Relic on the right - Mobile Real-User Monitoring With end-user browser analytics, it feels like having the visibility you have on your browsers load times through Chrome dev tools available on the actual users of your app: AppDynamics on the left, New Relic on the right - Browser Real-User Monitroing Bottom line: We’re seeing again how New Relic’s focus is on response time bottom lines while AppDynamics emphasizes the global picture. 4. How to Solve the errors you find To go beyond the reporting and alerting of errors by AppDynamics and New Relic, many of our users add Takipi to their toolbox. This allows them not only to monitor server slowdowns and errors via New Relic or AppDynamics, but also to solve them using Takipi. Whenever a new exception is thrown or a log error occurs - Takipi captures it and shows you the variable state which caused it, across methods and machines. Takipi will overlay this over the actual code which executed at the moment of error – so you can analyze the exception as if you were there when it happened. The dashboard links each error to a recorded instance of all involved code when the bug happened, and includes the variable values that caused it: Takipi plays well with AppDynamics, and it also has a New Relic plug-in that displays an exception and log error dashboard: Takipi for New Relic Bottom line: It’s one thing to identify what’s stopping you down, but solving it, is a whole different issue. Java or Scala developers? Whether you're using an APM tool or not, try Takipi. 5. Dashboard and Usage To get a better feel of each tool’s user experience and way of solving problems, I think it’s probably best to browse through a video. But before that, it’s worth noting that AppDynamics uses a dashboard based on Flash, yes… Flash, turns out it’s still out there. This felt like a drawback but its probably the best Flash application I’ve seen out there: NewRelic at TypeSafe, a webinar that gives an overview of New Relic for Play (it’s a bit long but gives a nice overview if you browse through): Bottom line: I still can’t believe the Flash didn’t scare me off completely. And it was ok actually. Both tools provide a nice experience, but still feel a bit cluttered. 6. Installation SaaS/On-Premise: AppDynamics offers a few modes of operation - SaaS, on-premises and a hybrid approach, each with its installation instructions. New Relic is only available through SaaS. Agents: Monitoring your application becomes available through attaching language specific agents to your server. For example, with Java there are 2 possible ways to instrument your code with agents, either by using a Java agent or a native agent. New Relic and AppDynamics use a Java agent to collect the performance data they’re reporting. To gather the low-level data required not only to point to an error but to help solve it, Takipi uses a native agent. Code and configuration changes: On the Real-User Monitoring front, project and configurations changes including introducing a few dependencies would be needed if you’d like to add monitoring capabilities to your web or mobile app. This includes adding JS agents to your website and native mobile agents to your mobile application. Alerting: AppDynamics computes your response time thresholds by itself and might take some time to learn your system. New Relic relies on custom thresholds defined by you for its Apdex index. Bottom line: If you require an on-premise version, the answer is clearly AppDynamics. Otherwise, ease of installation is pretty much the same - mind the alerting though. 7. Integrations and Plugins Branching out, AppDynamics and New Relic offer integrations and plug-ins to hundreds of services. Let’s start with NewRelic, we’ve already mentioned the Platform program earlier: a plug in platform with 116 (Last time I checked) plugins to services like Hadoop, RabbitMQ and Redis, that stream metrics of their data so you can view in on New Relic. On the integrations side of the table, there’s Connect, with 53 integrations with tools like Jira, HipChat, Takipi and pagerduty. AppDynamics Exchange offers 100 plugins and is also an open platform for developers to build plugins. Bottom line: New Relic has richer integrations that feel friendlier, but any way you go it’s an individual decision to see where and how your tools of choice integrate better. 8. Pricing Both tools have a free lite version with limited features across all products, including a 24hr data retention with pro trials of 14-30 days. Pricing with AppDynamics pro programs is more individual, you’ll have to contact sales to get a customized plan based on the number of agents you need. Mobile monitoring is a bit different when each agent is priced per 5000 Monthly Active Users. With New Relic, pro account pricing starts with with $199 per month per host ($149 on a yearly plan), this includes APM, Servers, Platfrom and Browser basics. Mobile monitoring costs $49 per month ($29 on a yearly plan) with 1 week of data retention. The Insights product start from $250 per month for up to 75 million events. Bottom line: New Relic’s pricing caters more to startups and small-medium business while AppDynamics focus is on customizing solutions to enterprises. With that said, each tool ventures off to the other’s natural playground and this distinction today is not that clear as it was. Conclusion AppDynamics and New Relic are top of the line APM tools, each traditionally targeted a different type of developer, from enterprises to startups. But as both are stepping forward to their IPOs and after experiencing huge growth the lines are getting blurred. The choice is not clear, but you could not go wrong – On premise = AppDynamics, otherwise, it’s an individual call depends on which better fits your stack (and which of all these features are you actually thinking you’re going to use). Originally posted on Takipi's blog

October 16, 2014

by Chen Harel

· 12,540 Views

MySQL Replication: 'Got fatal error 1236' Causes and Cures

Originally Written by Muhammad Irfan MySQL replication is a core process for maintaining multiple copies of data – and replication is a very important aspect in database administration. In order to synchronize data between master and slaves you need to make sure that data transfers smoothly, and to do so you need to act promptly regarding replication errors to continue data synchronization. Here on the Percona Support team, we often help customers with replication broken-related issues. In this post I’ll highlight the top most critical replication error code 1236 along with the causes and cure. MySQL replication error “Got fatal error 1236” can be triggered by multiple reasons and I will try to cover all of them. Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: ‘log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event ‘binlog.000201′ at 5480571 This is a typical error on the slave(s) server. It reflects the problem around max_allowed_packet size. max_allowed_packet refers to single SQL statement sent to the MySQL server as binary log event from master to slave. This error usually occurs when you have a different size of max_allowed_packet on the master and slave (i.e. master max_allowed_packet size is greater then slave server). When the MySQL master server tries to send a bigger packet than defined on the slave server, the slave server then fails to accept it and hence the error. In order to alleviate this issue please make sure to have the same value for max_allowed_packet on both slave and master. You can read more about max_allowed_packet here. This error usually occurs when updating a huge number of rows on the master and it doesn’t fit into the value of slave max_allowed_packet size because slave max_allowed_packet size is lower then the master. This usually happens with queries “LOAD DATA INFILE” or “INSERT .. SELECT” queries. As per my experience, this can also be caused by application logic that can generate a huge INSERT with junk data. Take into account, that one new variable introduced in MySQL 5.6.6 and later slave_max_allowed_packet_size which controls the maximum packet size for the replication threads. It overrides the max_allowed_packet variable on slave and it’s default value is 1 GB. In this post, “max_allowed_packet and binary log corruption in MySQL,”my colleague Miguel Angel Nieto explains this error in detail. Got fatal error 1236 from master when reading data from binary log: ‘Could not find first log file name in binary log index file’ This error occurs when the slave server required binary log for replication no longer exists on the master database server. In one of the scenarios for this, your slave server is stopped for some reason for a few hours/days and when you resume replication on the slave it fails with above error. When you investigate you will find that the master server is no longer requesting binary logs which the slave server needs to pull in order to synchronize data. Possible reasons for this include the master server expired binary logs via system variable expire_logs_days – or someone manually deleted binary logs from master via PURGE BINARY LOGS command or via ‘rm -f’ command or may be you have some cronjob which archives older binary logs to claim disk space, etc. So, make sure you always have the required binary logs exists on the master server and you can update your procedures to keep binary logs that the slave server requires by monitoring the “Relay_master_log_file” variable from SHOW SLAVE STATUS output. Moreover, if you have set expire_log_days in my.cnf old binlogs expire automatically and are removed. This means when MySQL opens a new binlog file, it checks the older binlogs, and purges any that are older than the value of expire_logs_days (in days). Percona Server added a feature to expire logs based on total number of files used instead of the age of the binlog files. So in that configuration, if you get a spike of traffic, it could cause binlogs to disappear sooner than you expect. For more information check Restricting the number of binlog files. In order to resolve this problem, the only clean solution I can think of is to re-create the slave server from a master server backup or from other slave in replication topology. – Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000525′ at 175770780, the last event read from ‘/data/mysql/repl/mysql-bin.000525′ at 175770780, the last byte read from ‘/data/mysql/repl/mysql-bin.000525′ at 175771648.’ Usually, this caused by sync_binlog <>1 on the master server which means binary log events may not be synchronized on the disk. There might be a committed SQL statement or row change (depending on your replication format) on the master that did not make it to the slave because the event is truncated. The solution would be to move the slave thread to the next available binary log and initialize slave thread with the first available position on binary log as below: mysql>CHANGE MASTERTOMASTER_LOG_FILE='mysql-bin.000526',MASTER_LOG_POS=4; – [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: ‘Client requested master to start replication from impossible position; the first event ‘mysql-bin.010711′ at 55212580, the last event read from ‘/var/lib/mysql/log/mysql-bin.000711′ at 4, the last byte read from ‘/var/lib/mysql/log/mysql-bin.010711′ at 4.’, Error_code: 1236 I foresee master server crashed or rebooted and hence binary log events not synchronized on disk. This usually happens when sync_binlog != 1 on the master. You can investigate it as inspecting binary log contents as below: $mysqlbinlog--base64-output=decode-rows--verbose--verbose--start-position=55212580mysql-bin.010711 You will find this is the last position of binary log and end of binary log file. This issue can usually be fixed by moving the slave to the next binary log. In this case it would be: mysql>CHANGE MASTER TOMASTER_LOG_FILE='mysql-bin.000712',MASTER_LOG_POS=4; This will resume replication. To avoid corrupted binlogs on the master, enabling sync_binlog=1 on master helps in most cases. sync_binlog=1 will synchronize the binary log to disk after every commit. sync_binlog makes MySQL perform on fsync on the binary log in addition to the fsync by InnoDB. As a reminder, it has some cost impact as it will synchronize the write-to-binary log on disk after every commit. On the other hand, sync_binlog=1 overhead can be very minimal or negligible if the disk subsystem is SSD along with battery-backed cache (BBU). You can read more about this here in the manual. sync_binlog is a dynamic option that you can enable on the fly. Here’s how: mysql-master>SET GLOBAL sync_binlog=1; To make the change persistent across reboot, you can add this parameter in my.cnf. As a side note, along with replication fixes, it is always a better option to make sure your replica is in the master and to validate data between master/slaves. Fortunately, Percona Toolkit has tools for this purpose: pt-table-checksum & pt-table-sync. Before checking for replication consistency, be sure to check the replication environment and then, later, to sync any differences.

October 15, 2014

by Peter Zaitsev

· 40,837 Views

JSR 199 - Compiler API

JSR 199 provides the compiler API to compile the Java code inside another Java program. The following are the important classes and interfaces provided for facilitating the compilation from a Java program. JavaFileObject - Represents a compilation unit, typically a class source. SimpleJavaFileObject - Implementation of the methods defined in JavaFileObject DiagnosticCollector - Collects the compilation errors, warning into a list of Diagnostic type Diagnostic - Reports the type of the problem and details like line number, character, error reason etc. JavaFileManager - To work on the Java source and class files. JavaCompiler - The compiler instance for compiling the compilation unit. CompilationTask - A sub interface of JavaCompiler which helps to compile and return the status with diagnostic when used call method on it. Where to start To compile a Java code, we need the Java source. The source can be a physical file on the disk or a string inside the program. Using the source, we need create an instance type of JavaFileObject. Using String literal Create a class which implements JavaFileObject, here i am using SimpleJavaFileObject. We need create the path URI of the class file package com.test; import java.io.IOException; import java.net.URI; import javax.tools.SimpleJavaFileObject; public class SampleSource extends SimpleJavaFileObject { private String source; protected SampleSource(String name, String code) { super(URI.create("string:///" +name.replaceAll("\\.", "/") + Kind.SOURCE.extension), Kind.SOURCE); this.source = code ; } @Override public CharSequence getCharContent(boolean ignoreEncodingErrors) throws IOException { return source ; } } Now, create the instance of JavaFileObject and from those, create the Compilation Unit (A collection of JavaFileObject) String str = "package com.test;" + "\n" + "public class Test {" + "\npublic static void test() {" + "\nSystem.out.println(\"Comiler API Test\")-;" + "" + "\n}" + "\n}"; SimpleJavaFileObject fileObject = new SampleSource("com.test.Test", str); JavaFileObject javaFileObjects[] = new JavaFileObject[] { fileObject }; Iterable compilationUnits = Arrays .asList(javaFileObjects); From File System If the source is from physical location. Then create like this. File []files = new File[]{file1, file2, file3, file4} ; Iterable units = fileManager.getJavaFileObjectsFromFiles(Arrays.asList(files)); Create a JavaFileManger We will see, how to create a fileManger now. JavaFileManager fileManager = compiler.getStandardFileManager( diagnostics, Locale.getDefault(), Charset.defaultCharset()); To get the FileManger, we need diagnostic - A DiagnosticCollector of JavaFileObject locale - The locale of the compilation charset - The charset to be used. Compiler Get the compiler instance using ToolProvider. Finally, create the CompilationTask from the compiler instance using diagnostics, file manager and compilation units (Optionally writer and compilation options). JavaCompiler compiler = ToolProvider.getSystemJavaCompiler(); CompilationTask task = compiler.getTask(null, fileManager, diagnostics, compilationOptionss, null, compilationUnits); The argument required to get the CompilationTask are out - A writer which writes the output of the compiler. Defaults to System.err if null listener - A diagnostic listener, the errors or warning can be accessed using. options - Compiler options (Ex : -d, like we give in command line using javac ) classes - Name of the classes to be processed compilationUnits - List of compilation units Compile Finally, call the method to compile. This method to be called only once otherwise it throws IllegalStateException on multiple calls. Once compiled, returns true for successful compilation otherwise false. We need to look the diagnosticCollector to get the error/warning details. boolean status = task.call(); All together Putting all together. public static void main(String[] args) { String str = "package com.test;" + "\n" + "public class Test {" + "\npublic static void test() {" + "\nSystem.out.println(\"Comiler API Test\")-;" + "" + "\n}" + "\n}"; SimpleJavaFileObject fileObject = new SampleSource("com.test.Test", str); JavaFileObject javaFileObjects[] = new JavaFileObject[] { fileObject }; Iterable compilationUnits = Arrays .asList(javaFileObjects); Iterable compilationOptionss = Arrays.asList(new String[] { "-d", "classes" }); DiagnosticCollector diagnostics = new DiagnosticCollector(); JavaCompiler compiler = ToolProvider.getSystemJavaCompiler(); JavaFileManager fileManager = compiler.getStandardFileManager( diagnostics, Locale.getDefault(), Charset.defaultCharset()); CompilationTask task = compiler.getTask(null, fileManager, diagnostics, compilationOptionss, null, compilationUnits); boolean status = task.call(); if(!status) { System.out.println("Found errors in compilation"); int errors = 1; for(Diagnostic diagnostic : diagnostics.getDiagnostics()) { printError(errors, diagnostic); errors++; } } else System.out.println("Compilation sucessfull"); try { fileManager.close(); } catch (IOException e){} } public static void printError(int number,Diagnostic diagnostic) { System.out.println(); System.out.print(diagnostic.getKind()+" : "+number+" Type : "+diagnostic.getMessage(Locale.getDefault())); System.out.print(" at column : "+diagnostic.getColumnNumber()); System.out.println(" Line number : "+diagnostic.getLineNumber()); System.out.println("Source : "+diagnostic.getSource()); } Output Output with an error will be (because of an hyphen in System.out.println in main method of Test) Found errors in compilation ERROR : 1 Type : illegal start of expression at column : 40 Line number : 4 Source : com.test.SampleSource[string:///com/test/Test.java] ERROR : 2 Type : not a statement at column : 39 Line number : 4 Source : com.test.SampleSource[string:///com/test/Test.java] To read more about JSR 199, follow the official link. Happy Learning!!!! Read more articles at blog

October 15, 2014

by Veeresham Kardas

· 6,501 Views

jQuery Mobile Tutorial: User Registration, Login and Logout Screens for the Meeting Room Booking App

in this jquery mobile tutorial we will create the screens that will handle user registration, login and logout in a real-world meeting room booking application. this article is part of a series of mobile application development tutorials that i have been publishing on my blog jorgeramon.me. if you are new to this series, i recommend that you read its first part, as well as this mobile ui patterns article where i provide a flowchart describing the user registration, login and logout screens in a mobile application. we will use this chart as a guide for this article. here’s a screenshot: in this part of the tutorial we will only create the static html for the screens. in future articles we will implement the programming logic that makes the pages work. the first step we are going to take is to set up a jquery mobile project for the app. how to set up a jquery mobile project while you can use mobile sdks such as kendo ui mobile and intel xdk to create, debug and deploy jquery mobile apps, in this tutorial i will show you how to create a simple jquery mobile project without using the facilities provided by those sdks. i think that it’s important to understand how you can create this type of project from scratch, and how the different pieces in the project work together in an app. the project’s directories and files we need to pick a directory in our development workstations where we will place the project’s files. in my case i named that directory “apps”. in that directory, we will create a root directory for the application, which we will name “conf-rooms”. make sure that this directory is set up so it can be accessed from your local web server. under “conf-rooms” we will create a “css” directory, where we will place the css assets of the project; and an “img” directory for the images that we will use. at the same level of the “apps” directory, we will create a “lib” directory. this is where we will place the jquery mobile and any other libraries that our application will use. you also need to set up this directory so it can be accessed from your local web server. on my workstation the directories look as depicted below: now is a good time to download the jquery mobile and jquery libraries from their respective websites, and place them in the “jqm” and “jquery” directories, all under the “lib” directory. this is how the files look on my workstation: how jquery mobile works a short overview of jquery mobile for those who aren’t very familiar with it yet. as its documentation clearly explains, jquery mobile is a unified user interface system with the following characteristics: it works seamlessly across all popular mobile device platforms. it uses jquery and jquery ui as its foundations. it has a lightweight codebase built on progressive enhancement. it has a flexible and easily themeable design. an attribute that differentiates jquery mobile from other frameworks is that it targets a wide variety of mobile browsers. the reason this coverage is possible has to do with the way jquery mobile works. jquery mobile works by applying css and javascript enhancements to html pages built with clean, semantic html. the usage of semantic html ensures compatibility with most web-enabled devices. the techniques applied by the framework to an html page, transform the semantic page into a rich and interactive experience. we call these changes progressive enhancements, as they are applied progressively to the page, taking advantage of the capabilities of the browser on the web-enabled device. the enhancements result in pages that provide a great user experience on the latest mobile browsers and degrade gracefully on less capable browsers, without losing their intrinsic functionality. in addition, the framework provides support for screen readers and other assistive technologies through a tight integration with the web accessibility initiative – accessible rich internet applications suite (wai-aria) technical specification. creating the landing screen the first application that we will create is the landing screen. this screen will come up when users launch our application. as reflected in the flowchart at the beginning of this article, the landing screen is the door to all the areas of the app, and it requires that users log in before navigating any further. in the “wireframes for signing in and signing up” section of the third part of this tutorial we created the following mockup for this screen. let’s create an empty index.html file in the “conf-rooms” directory, and add a jquery mobile page template to the file as follows: book it welcome! existing users sign in don't have an account? sign up before we step through this code i want you to check out this file in a mobile browser or simulator. the result should look like this screenshot: what do you think? back in the index.html file, in the head section we have references to the jquery and jquery mobile libraries. double check that yours are pointing to the correct directories in your workstation. how to use a custom theme in jquery mobile now i want to direct your attention to the following lines: these lines mean that we are using a custom jquery mobile theme that resides in the “conf-room1.min.css” file. i created this file using jquery mobile’s theme roller . we will use this theme to give our app a look different than the standard jquery mobile themes. you can download the theme using this link . after downloading the zipped theme files, we will go to the “css” directory, create a “css/themes/1″ directory and place the unzipped theme files there. when done, the “css” directory should look like this: in the head section of the index.html file we also have this code: app.css is the css file where we will place any additional custom styles that we will use in the app. for the moment, we will add the following code to the “app.css” file: /* change html headers color */ h1,h2,h3,h4,h5 { color:#0071bc; } h2.mc-text-danger, h3.mc-text-danger { color:red; } /* change border radius of icon buttons */ .ui-btn-icon-notext.ui-corner-all { -webkit-border-radius: .3125em; border-radius: .3125em; } /* change color of jquery mobile page headers */ .ui-title { color:#fff; } /* center-aligned text */ .mc-text-center { text-align:center; } /* top margin for some elements */ .mc-top-margin-1-5 { margin-top:1.5em; } these are just a few cosmetic changes that will enhance the look of the app. notice that i prefixed non-jquerymobile classes with the characters “mc-” to avoid potential collisions with jquerymobile’s classes. the remaining lines in the head section of index.html are the references to the jquery and jquery mobile libraries. as i suggested earlier, make sure that yours are pointing to the correct directories in your project. let’s move on to the body section of the “index.html file”. there you will find the standard jquery mobile page template with a header and main divs: book it welcome! existing users sign in don't have an account? sign up we have decorated the “header” div with the data-theme=”c” attribute, which gives it the nice purple background color that we defined in the custom theme: in the “main” div we are using a couple of links to the sign in and sign up screens respectively. the links point to the sign-in.html and sing-up.html files that we will create in a few minutes. these links are decorated with the jquery mobile ui-btn, ui-btn-b and ui-corner-all classes, which make them look like buttons: this is all we need to do in the lading page for the moment. let’s move on to the log in screen. creating a log in screen with jquery mobile here’s the log in screen’s mockup that we built in the third part of this tutorial : we will use the log in screen to capture the user’s credentials and validate them against the application’s user accounts database. if validation succeeds, we will direct users to a “main menu” scree that we will create in an upcoming tutorial. let’s create an empty sign-in.html file in the project’s directory. in the file, we will write the following code: book it sign in email address password remember me submit can't access your account? login failed did you enter the right credentials? ok if you open this sign-in.html with a mobile browser or emulator, you will see something like this: the head section of the html document is similar to the index.html file we created a few minutes ago, with the exception of the document’s title. no need to explain much there. in the “main” section of the jquery mobile page that wee added to the file, we dropped a few controls that will allow us to capture the user’s email and password, along with a checkbox that will let us know when the user wants the app to remember their credentials: sign in email address password remember me submit can't access your account? we also added a link that will allow users to initiate the password reset process if they have problems logging in. you will also notice that the “submit” link points to the “dlg-invalid-credentials” anchor defined in the same jquery mobile page. this link is decorated with the data-rel=”popup”, data-transition=”pop” and data-position-to=”window” attributes. when we do this, we are telling jquery mobile to open the link to the element with id=”dlg-invalid-credentials” as a popup dialog, using a “pop” transition, and center the element relative to the document’s window. here’s the html for the popup: login failed did you enter the right credentials? ok notice that the “dlg-invalid-credentials” div is decorated with the data-rel=”popup” attribute, signaling to jquery mobile to apply popup styles to this div. if you click or tap the “submit” button, you will see the “invalid credentials” popup: one last thing on this screen. we have the popup linked directly to the “submit” button for testing purposes. in an upcoming part of this tutorial we will add programming logic that will activate the popup only when the login fails. for the moment we are only concerned with creating the html code for the pages and making sure that the jquery mobile enhancements work on them. creating an account locked screen with jquery mobile many business applications use an account locked feature as a measure to increase the app’s security. we will use this feature in our app, and this means that we need to create an account locked screen. the purpose of the screen is to notify the user that their account is locked. we will define under which conditions an account will be locked through programming logic that we will add in an upcoming chapter of this tutorial. let’s create an empty account-locked.html file, and drop the following code in it: back app name your account is locked please contact the helpdesk to resolve this issue. the file should look like the screenshot below when viewed with a mobile browser or emulator: the html code for this screen is very similar to that in the prior two screens, with the exception of one element that i want you to pay attention to: back app name the header section of the jquery mobile page we just created contains a link to the sign-in.html file. when we decorate it with the ui-btn-left, ui-btn, ui-btn-icon-notext, ui-corner-all and ui-icon-back classes, we are giving the link the appearance of a toolbar button, just like this: the data-rel=”back” attribute causes any taps on the anchor to mimic a “back button”, going back one history entry and ignoring the anchor’s default href. you can read more about navigation and linking on jquery mobile by visiting jquery mobile’s navigation documentation. you should also visit the jquery mobile buttons guide to learn about how to create buttons. creating a sign up page with jquery mobile when users tap on the landing screen’s “sign up” button, they will open the sign up screen. this is where we will capture the user’s personal information so we can create an account for them. remember that our mockup of the sign up screen looks like this: let’s create an empty sign-up.html file and add the following code to it: book it sign up first name last name email address password confirm password submit almost done... confirm your email address we sent you an email with instructions on how to confirm your email address. please check your inbox and follow the instructions in the email. ok the file should look as depicted below when you open it with a mobile browser or emulator: the only difference with the mockup is that we are using a “submit” button at the bottom of the screen, instead of a “done” button in the toolbar. when you examine the html code, you will find that the “submit” button is wired to the “dlg-sign-up-sent” popup: almost done... confirm your email address we sent you an email with instructions on how to confirm your email address. please check your inbox and follow the instructions in the email. ok if you tap on the button, the popup will become visible: we will use this popup to notify the users that we have sent them a message asking them to confirm their email address. the message will contain a link to a webpage where users will need to re-enter the email address used to create their account in the app. with this step we are trying to make sure that it was is a human with a valid email inbox who created the account. back in the popup’s html code, notice how the “ok” button links back to the sign in screen. you should be able to confirm that this link works when you tap the button. creating a password reset screen with jquery mobile the app’s landing screen has a “can’t access your account?” link that helps user initiate the password reset workflow of the app. the first step of this workflow is to present the “begin password reset” screen to the user. we will use this screen to capture the user’s email address. if we find this email address in the user accounts database of the app, we will email the user a provisional password. next, we will activate the “end password reset” screen, where the user will need to enter the provisional password and a new password of their choosing. the picture below illustrates this process. let’s create an empty begin-password-reset.html file in the project’s directory. we will write the following code int the file: book it password reset enter your email address submit password reset check your inbox we sent you an email with instructions on how to reset your password. please check your inbox and follow the instructions in the email. ok this is how the screen should look when viewed on a mobile browser or emulator: there is nothing new in the html code of this screen. we wired the “submit” button so when a user taps it, the embedded “dlg-pwd-reset-sent” popup will become active: we did this for testing purposes. remember that we will add the programming logic that activates these popups in upcoming chapters of this tutorial. when a user taps the popup’s “ok” button, the application will navigate to the “end password reset” screen, which we will create next. the end password reset screen this is the screen where the user will enter the provisional password we sent them via email, along with a new password of their choosing. to create this screen we will add an empty “end-password-reset.html” file to the project. here’s the code that goes in the file: book it reset password provisional password new password confirm new password submit done your password was changed. ok the screen should look like the picture below when viewed with a mobile browser or emulator: we wired the “submit” button so it activates the embedded “dlg-pwd-changed” popup: this popup simply tells the user that their password was changed. tapping the “ok” button will make the app navigate back to the sign in screen, where the user can sign in with the new password. summary and next steps this concludes our first phase of work on the user registration, login and logout screens of the jquery mobile version of the app. i will emphasize again that in this phase we are not adding programming logic to the screens. we are simply creating a jquery mobile page for each screen and making sure that the visual elements within the screens adhere to the mockups that we created in previous chapters of this tutorial, as well as to the ui patterns flowchart that i mentioned at the beginning of this article. while we’ve made significant progress with the app at this point, it’s fair to say that we are just getting started. we are still missing the programming for this article’s screens, as well as the jquery mobile pages and programming for the screens that will allow users to browse and reserve meeting rooms, which is why we created the app in the first place. in the next chapter of this tutorial we will get started with the programming of the user profile screens we just created. don’t forget to sign up for my mailing list so you can be among the first to know when i publish the next update. stay tuned don’t miss any articles. get new articles and updates sent free to your inbox.

October 13, 2014

by Jorge Ramon

· 78,838 Views · 4 Likes

Neo4j: COLLECTing Multiple Values (Too Many Parameters for Function ‘Collect’)

One of my favourite functions in Neo4j’s cypher query language is COLLECT which allows us to group items into an array for later consumption. However, I’ve noticed that people sometimes have trouble working out how to collect multiple items with COLLECT and struggle to find a way to do so. Consider the following data set: create (p:Person {name: "Mark"}) create (e1:Event {name: "Event1", timestamp: 1234}) create (e2:Event {name: "Event2", timestamp: 4567}) create (p)-[:EVENT]->(e1) create (p)-[:EVENT]->(e2) If we wanted to return each person along with a collection of the event names they’d participated in we could write the following: $ MATCH (p:Person)-[:EVENT]->(e) > RETURN p, COLLECT(e.name); +--------------------------------------------+ | p | COLLECT(e.name) | +--------------------------------------------+ | Node[0]{name:"Mark"} | ["Event1","Event2"] | +--------------------------------------------+ 1 row That works nicely, but what about if we want to collect the event name and the timestamp but don’t want to return the entire event node? An approach I’ve seen a few people try during workshops is the following: MATCH (p:Person)-[:EVENT]->(e) RETURN p, COLLECT(e.name, e.timestamp) Unfortunately this doesn’t compile: SyntaxException: Too many parameters for function 'collect' (line 2, column 11) "RETURN p, COLLECT(e.name, e.timestamp)" ^ As the error message suggests, the COLLECT function only takes one argument so we need to find another way to solve our problem. One way is to put the two values into a literal array which will result in an array of arrays as our return result: $ MATCH (p:Person)-[:EVENT]->(e) > RETURN p, COLLECT([e.name, e.timestamp]); +----------------------------------------------------------+ | p | COLLECT([e.name, e.timestamp]) | +----------------------------------------------------------+ | Node[0]{name:"Mark"} | [["Event1",1234],["Event2",4567]] | +----------------------------------------------------------+ 1 row The annoying thing about this approach is that as you add more items you’ll forget in which position you’ve put each bit of data so I think a preferable approach is to collect a map of items instead: $ MATCH (p:Person)-[:EVENT]->(e) > RETURN p, COLLECT({eventName: e.name, eventTimestamp: e.timestamp}); +--------------------------------------------------------------------------------------------------------------------------+ | p | COLLECT({eventName: e.name, eventTimestamp: e.timestamp}) | +--------------------------------------------------------------------------------------------------------------------------+ | Node[0]{name:"Mark"} | [{eventName -> "Event1", eventTimestamp -> 1234},{eventName -> "Event2", eventTimestamp -> 4567}] | +--------------------------------------------------------------------------------------------------------------------------+ 1 row During the Clojure Neo4j Hackathon that we ran earlier this week this proved to be a particularly pleasing approach as we could easily destructure the collection of maps in our Clojure code.

October 13, 2014

by Mark Needham

· 10,375 Views

Do it in Java 8: The State Monad

In a previous article (Do it in Java 8: Automatic memoization), I wrote about memoization and said that memoization is about handling state between function calls, although the value returned by the function does not change from one call to another as long as the argument is the same. I showed how this could be done automatically. There are however other cases when handling state between function calls is necessary but cannot be done this simple way. One is handling state between recursive calls. In such a situation, the function is called only once from the user point of view, but it is in fact called several times since it calls itself recursively. Most of these functions will not benefit internally from memoization. For example, the factorial function may be implemented recursively, and for one call to f(n), there will be n calls to the function, but not twice with the same value. So this function would not benefit from internal memoization. On the contrary, the Fibonacci function, if implemented according to its recursive definition, will call itself recursively a huge number of times will the same arguments. Here is the standard definition of the Fibonacci function: f(n) = f(n – 1) + f(n – 2) This definition has a major problem: calculating f(n) implies evaluating n² times the function with different values. The result is that, in practice, it is impossible to use this definition for n > 50 because the time of execution increases exponentially. But since we are calculating more than n values, it is obvious that some values are calculated several times. So this definition would be a good candidate for internal memoization. Warning: Java is not a true recursive language, so using any recursive method will eventually blow the stack, unless we use TCO (Tail Call Optimization) as described in my previous article (Do it in Java 8: recursive and corecursive Fibonacci). However, TCO can't be applied to the original Fibonacci definition because it is not tail recursive. So we will write an example working for n limited to a few thousands. The naive implementation Here is how we might implement the original definition: static BigInteger fib(BigInteger n) { return n.equals(BigInteger.ZERO) || n.equals(BigInteger.ONE) ? n : fib(n.subtract(BigInteger.ONE)).add(fib(n.subtract(BigInteger.ONE.add(BigInteger.ONE)))); } Do not try this implementation for values greater that 50. On my machine, fib(50) takes 44 minutes to return! Basic memoized version To avoid computing several times the same values, we can use a map were computed values are stored. This way, for each value, we first look into the map to see if it has already been computed. If it is present, we just retrieve it. Otherwise, we compute it, store it into the map and return it. For this, we will use a special class called Memo. This is a normal HashMap that mimic the interface of functional map, which means that insertion of a new (key, value) pair returns a map, and looking up a key returns an Optional: public class Memo extends HashMap { public Optional retrieve(BigInteger key) { return Optional.ofNullable(super.get(key)); } public Memo addEntry(BigInteger key, BigInteger value) { super.put(key, value); return this; } } Note that this is not a true functional (immutable and persistent) map, but this is not a problem since it will not be shared. We will also need a Tuple class that we define as: public class Tuple { public final A _1; public final B _2; public Tuple(A a, B b) { _1 = a; _2 = b; } } With this class, we can write the following basic implementation: static BigInteger fibMemo1(BigInteger n) { return fibMemo(n, new Memo().addEntry(BigInteger.ZERO, BigInteger.ZERO) .addEntry(BigInteger.ONE, BigInteger.ONE))._1; } static Tuple fibMemo(BigInteger n, Memo memo) { return memo.retrieve(n).map(x -> new Tuple<>(x, memo)).orElseGet(() -> { BigInteger x = fibMemo(n.subtract(BigInteger.ONE), memo)._1 .add(fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE), memo)._1); return new Tuple<>(x, memo.addEntry(n, x)); }); } This implementation works fine, provided values of n are not too big. For f(50), it returns in less that 1 millisecond, which should be compared to the 44 minutes of the naive version. Remark that we do not have to test for terminal values (f(0) and f(1). These values are simply inserted into the map at start. So, is there something we should make better? The main problem is that we have to handle the passing of the Memo map by hand. The signature of the fibMemo method is no longer fibMemo(BigInteger n), but fibMemo(BigInteger n, Memo memo). Could we simplify this? We might think about using automatic memoization as described in my previous article (Do it in Java 8: Automatic memoization ). However, this will not work: static Function fib = new Function() { @Override public BigInteger apply(BigInteger n) { return n.equals(BigInteger.ZERO) || n.equals(BigInteger.ONE) ? n : this.apply(n.subtract(BigInteger.ONE)).add(this.apply(n.subtract(BigInteger.ONE.add(BigInteger.ONE)))); } }; static Function fibm = Memoizer.memoize(fib); Beside the fact that we could not use lambdas because reference to this is not allowed, which may be worked around by using the original anonymous class syntax, the recursive call is made to the non memoized function, so it does not make things better. Using the State monad In this example, each computed value is strictly evaluated by the fibMemo method, and this is what makes the memo parameter necessary. Instead of a method returning a value, what we would need is a method returning a function that could be evaluated latter. This function would take a Memo as parameter, and this Memo instance would be necessary only at evaluation time. This is what the State monad will do. Java 8 does not provide the state monad, so we have to create it, but it is very simple. However, we first need an implementation of a list that is more functional that what Java offers. In a real case, we would use a true immutable and persistent List. The one I have written is about 1 000 lines, so I can't show it here. Instead, we will use a dummy functional list, backed by a java.util.ArrayList. Although this is less elegant, it does the same job: public class List { private java.util.List list = new ArrayList<>(); public static List empty() { return new List(); } @SafeVarargs public static List apply(T... ta) { List result = new List<>(); for (T t : ta) result.list.add(t); return result; } public List cons(T t) { List result = new List<>(); result.list.add(t); result.list.addAll(list); return result; } public U foldRight(U seed, Function> f) { U result = seed; for (int i = list.size() - 1; i >= 0; i--) { result = f.apply(list.get(i)).apply(result); } return result; } public List map(Function f) { List result = new List<>(); for (T t : list) { result.list.add(f.apply(t)); } return result; } public List filter(Function f) { List result = new List<>(); for (T t : list) { if (f.apply(t)) { result.list.add(t); } } return result; } public Optional findFirst() { return list.size() == 0 ? Optional.empty() : Optional.of(list.get(0)); } @Override public String toString() { StringBuilder s = new StringBuilder("["); for (T t : list) { s.append(t).append(", "); } return s.append("NIL]").toString(); } } The implementation is not functional, but the interface is! And although there are lots of missing capabilities, we have all we need. Now, we can write the state monad. It is often called simply State but I prefer to call it StateMonad in order to avoid confusion between the state and the monad: public class StateMonad { public final Function> runState; public StateMonad(Function> runState) { this.runState = runState; } public static StateMonad unit(A a) { return new StateMonad<>(s -> new StateTuple<>(a, s)); } public static StateMonad get() { return new StateMonad<>(s -> new StateTuple<>(s, s)); } public static StateMonad getState(Function f) { return new StateMonad<>(s -> new StateTuple<>(f.apply(s), s)); } public static StateMonad transition(Function f) { return new StateMonad<>(s -> new StateTuple<>(Nothing.instance, f.apply(s))); } public static StateMonad transition(Function f, A value) { return new StateMonad<>(s -> new StateTuple<>(value, f.apply(s))); } public static StateMonad> compose(List> fs) { return fs.foldRight(StateMonad.unit(List.empty()), f -> acc -> f.map2(acc, a -> b -> b.cons(a))); } public StateMonad flatMap(Function> f) { return new StateMonad<>(s -> { StateTuple temp = runState.apply(s); return f.apply(temp.value).runState.apply(temp.state); }); } public StateMonad map(Function f) { return flatMap(a -> StateMonad.unit(f.apply(a))); } public StateMonad map2(StateMonad sb, Function> f) { return flatMap(a -> sb.map(b -> f.apply(a).apply(b))); } public A eval(S s) { return runState.apply(s).value; } } This class is parameterized by two types: the value type A and the state type S. In our case, A will be BigInteger and S will be Memo. This class holds a function from a state to a tuple (value, state). This function is hold in the runState field. This is similar to the value hold in the Optional monad. To make it a monad, this class needs a unit method and a flatMap method. The unit method takes a value as parameter and returns a StateMonad. It could be implemented as a constructor. Here, it is a factory method. The flatMap method takes a function from A (a value) to StateMonad and return a new StateMonad. (In our case, A is the same as B.) The new type contains the new value and the new state that result from the application of the function. All other methods are convenience methods: map allows to bind a function from A to B instead of a function from A to StateMonad. It is implemented in terms of flatMap and unit. eval allows easy retrieval of the value hold by the StateMonad. getState allows creating a StateMonad from a function S -> A. transition takes a function from state to state and a value and returns a new StateMonad holding the value and the state resulting from the application of the function. In other words, it allows changing the state without changing the value. There is also another transition method taking only a function and returning a StateMonad. Nothing is a special class: public final class Nothing { public static final Nothing instance = new Nothing(); private Nothing() {} } This class could be replaced by Void, to mean that we do not care about the type. However, Void is not supposed to be instantiated, and the only reference of type Void is normally null. The problem is that null does not carry its type. We could instantiate a Void instance through introspection: Constructor constructor; constructor = Void.class.getDeclaredConstructor(); constructor.setAccessible(true); Void nothing = constructor.newInstance(); but this is really ugly, so we create a Nothing type with a single instance of it. This does the trick, although to be complete, Nothing should be able to replace any type (like null), which does not seem to be possible in Java. Using the StateMonad class, we can rewrite our program: static BigInteger fibMemo2(BigInteger n) { return fibMemo(n).eval(new Memo().addEntry(BigInteger.ZERO, BigInteger.ZERO).addEntry(BigInteger.ONE, BigInteger.ONE)); } static StateMonad fibMemo(BigInteger n) { return StateMonad.getState((Memo m) -> m.retrieve(n)) .flatMap(u -> u.map(StateMonad:: unit).orElse(fibMemo(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))))); } Now, the fibMemo method only takes a BigInteger as its parameter and returns a StateMonad, which means that when this method returns, nothing has been evaluated yet. The Memo doesn't even exist! To get the result, we may call the eval method, passing it the Memo instance. If you find this code difficult to understand, here is an exploded commented version using longer identifiers: static StateMonad fibMemo(BigInteger n) { /* * Create a function of type Memo -> Optional with a closure * over the n parameter. */ Function> retrieveValueFromMapIfPresent = (Memo memoizationMap) -> memoizationMap.retrieve(n); /* * Create a state from this function. */ StateMonad> initialState = StateMonad.getState(retrieveValueFromMapIfPresent); /* * Create a function for converting the value (BigInteger) into a State * Monad instance. This function will be bound to the Optional resulting * from the lookup into the map to give the result if the value was found. */ Function> createStateFromValue = StateMonad:: unit; /* * The value computation proper. This can't be easily decomposed because it * make heavy use of closures. It first calls recursively fibMemo(n - 1), * producing a StateMonad. It then flatMaps it to a new * recursive call to fibMemo(n - 2) (actually fibMemo(n - 1 - 1)) and get a * new StateMonad which is mapped to BigInteger addition * with the preceding value (x). Then it flatMaps it again with the function * y -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z) which adds * the two values and returns a new StateMonad with the computed value added * to the map. */ StateMonad computedValue = fibMemo(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))); /* * Create a function taking an Optional as its parameter and * returning a state. This is the main function that returns the value in * the Optional if it is present and compute it and put it into the map * before returning it otherwise. */ Function, StateMonad> computeFiboValueIfAbsentFromMap = u -> u.map(createStateFromValue).orElse(computedValue); /* * Bind the computeFiboValueIfAbsentFromMap function to the initial State * and return the result. */ return initialState.flatMap(computeFiboValueIfAbsentFromMap); } The most important part is the following: StateMonad computedValue = fibMemo_(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo_(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))); This kind of code is essential to functional programming, although it is sometimes replaced in other languages with “for comprehensions”. As Java 8 does not have for comprehensions we have to use this form. At this point, we have seen that using the state monad allows abstracting the handling of state. This technique can be used every time you have to handle state. More uses of the state monad The state monad may be used for many other cases were state must be maintained in a functional way. Most programs based upon maintaining state use a concept known as a State Machine. A state machine is defined by an initial state and a series of inputs. Each input submitted to the state machine will produce a new state by applying one of several possible transitions based upon a list of conditions concerning both the input and the actual state. If we take the example of a bank account, the initial state would be the initial balance of the account. Possible transition would be deposit(amount) and withdraw(amount). The conditions would be true for deposit and balance >= amount for withdraw. Given the state monad that we have implemented above, we could write a parameterized state machine: public class StateMachine { Function> function; public StateMachine(List, Transition>> transitions) { function = i -> StateMonad.transition(m -> Optional.of(new StateTuple<>(i, m)).flatMap((StateTuple t) -> transitions.filter((Tuple, Transition> x) -> x._1.test(t)).findFirst().map((Tuple, Transition> y) -> y._2.apply(t))).get()); } public StateMonad process(List inputs) { List> a = inputs.map(function); StateMonad> b = StateMonad.compose(a); return b.flatMap(x -> StateMonad.get()); } } This machine uses a bunch of helper classes. First, the inputs are represented by an interface: public interface Input { boolean isDeposit(); boolean isWithdraw(); int getAmount(); } There are two instances of inputs: public class Deposit implements Input { private final int amount; public Deposit(int amount) { super(); this.amount = amount; } @Override public boolean isDeposit() { return true; } @Override public boolean isWithdraw() { return false; } @Override public int getAmount() { return this.amount; } } public class Withdraw implements Input { private final int amount; public Withdraw(int amount) { super(); this.amount = amount; } @Override public boolean isDeposit() { return false; } @Override public boolean isWithdraw() { return true; } @Override public int getAmount() { return this.amount; } } Then come two functional interfaces for conditions and transitions: public interface Condition extends Predicate> {} public interface Transition extends Function, S> {} These act as type aliases in order to simplify the code. We could have used the predicate and the function directly. In the same manner, we use a StateTuple class instead of a normal tuple: public class StateTuple { public final A value; public final S state; public StateTuple(A a, S s) { value = Objects.requireNonNull(a); state = Objects.requireNonNull(s); } } This is exactly the same as an ordinary tuple with named members instead of numbered ones. Numbered members allows using the same class everywhere, but a specific class like this one make the code easier to read as we will see. The last utility class is Outcome, which represent the result returned by the state machine: public class Outcome { public final Integer account; public final List> operations; public Outcome(Integer account, List> operations) { super(); this.account = account; this.operations = operations; } public String toString() { return "(" + account.toString() + "," + operations.toString() + ")"; } } This again could be replaced with a Tuple>>, but using named parameters make the code easier to read. (In some functional languages, we could use type aliases for this.) Here, we use an Either class, which is another kind of monad that Java does not offer. I will not show the complete class, but only the parts that are useful for this example: public interface Either { boolean isLeft(); boolean isRight(); A getLeft(); B getRight(); static Either right(B value) { return new Right<>(value); } static Either left(A value) { return new Left<>(value); } public class Left implements Either { private final A left; private Left(A left) { super(); this.left = left; } @Override public boolean isLeft() { return true; } @Override public boolean isRight() { return false; } @Override public A getLeft() { return this.left; } @Override public B getRight() { throw new IllegalStateException("getRight() called on Left value"); } @Override public String toString() { return left.toString(); } } public class Right implements Either { private final B right; private Right(B right) { super(); this.right = right; } @Override public boolean isLeft() { return false; } @Override public boolean isRight() { return true; } @Override public A getLeft() { throw new IllegalStateException("getLeft() called on Right value"); } @Override public B getRight() { return this.right; } @Override public String toString() { return right.toString(); } } } This implementation is missing a flatMap method, but we will not need it. The Either class is somewhat like the Optional Java class in that it may be used to represent the result of an evaluation that may return a value or something else like an exception, an error message or whatever. What is important is that it can hold one of two things of different types. We now have all we need to use our state machine: public class Account { public static StateMachine createMachine() { Condition predicate1 = t -> t.value.isDeposit(); Transition transition1 = t -> new Outcome(t.state.account + t.value.getAmount(), t.state.operations.cons(Either.right(t.value.getAmount()))); Condition predicate2 = t -> t.value.isWithdraw() && t.state.account >= t.value.getAmount(); Transition transition2 = t -> new Outcome(t.state.account - t.value.getAmount(), t.state.operations.cons(Either.right(- t.value.getAmount()))); Condition predicate3 = t -> true; Transition transition3 = t -> new Outcome(t.state.account, t.state.operations.cons(Either.left(new IllegalStateException(String.format("Can't withdraw %s because balance is only %s", t.value.getAmount(), t.state.account))))); List, Transition>> transitions = List.apply( new Tuple<>(predicate1, transition1), new Tuple<>(predicate2, transition2), new Tuple<>(predicate3, transition3)); return new StateMachine<>(transitions); } } This could not be simpler. We just define each possible condition and the corresponding transition, and then build a list of tuples (Condition, Transition) that is used to instantiate the state machine. There are however to rules that must be enforced: Conditions must be put in the right order, with the more specific first and the more general last. We must be careful to be sure to match all possible cases. Otherwise, we will get an exception. At this stage, nothing has been evaluated. We did not even use the initial state! To run the state machine, we must create a list of inputs and feed it in the machine, for example: List inputs = List.apply( new Deposit(100), new Withdraw(50), new Withdraw(150), new Deposit(200), new Withdraw(150)); StateMonad = Account.createMachine().process(inputs); Again, nothing has been evaluated yet. To get the result, we just evaluate the result, using an initial state: Outcome outcome = state.eval(new Outcome(0, List.empty())) If we run the program with the list above, and call toString() on the resulting outcome (we can't do more useful things since the Either class is so minimal!) we get the following result: // // (100,[-150, 200, java.lang.IllegalStateException: Can't withdraw 150 because balance is only 50, -50, 100, NIL]) This is a tuple of the resulting balance for the account (100) and the list of operations that have been carried on. We can see that successful operations are represented by a signed integer, and failed operations are represented by an error message. This of course is a very minimal example, and as usual, one may think it would be much easier to do it the imperative way. However, think of a more complex example, like a text parser. All there is to do to adapt the state machine is to define the state representation (the Outcome class), define the possible inputs and create the list of (Condition,Transition). Going the functional way does not make the whole thing simpler. However, it allows abstracting the implementation of the state machine from the requirements. The only thing we have to do to create a new state machine is to write the new requirements!

October 9, 2014

by Pierre-Yves Saumont

· 25,943 Views · 7 Likes

R: Filtering data frames by column type ('x' must be numeric)

I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame. The exercise uses the ‘Carseats’ data set which can be imported like so: > install.packages("ISLR") > library(ISLR) > head(Carseats) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US 1 9.50 138 73 11 276 120 Bad 42 17 Yes Yes 2 11.22 111 48 16 260 83 Good 65 10 Yes Yes 3 10.06 113 35 10 269 80 Medium 59 12 Yes Yes 4 7.40 117 100 4 466 97 Medium 55 14 Yes Yes 5 4.15 141 64 3 340 128 Bad 38 13 Yes No 6 10.81 124 113 13 501 72 Bad 78 16 No Yes filter the categorical variables from a data frame and If we try to run the ‘cor‘ function on the data frame we’ll get the following error: > cor(Carseats) Error in cor(Carseats) : 'x' must be numeric As the error message suggests, we can’t pass non numeric variables to this function so we need to remove the categorical variables from our data frame. But first we need to work out which columns those are: > sapply(Carseats, class) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor" "numeric" "numeric" Urban US "factor" "factor" We can see a few columns of type ‘factor’ and luckily for us there’s a function which will help us identify those more easily: > sapply(Carseats, is.factor) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE Urban US TRUE TRUE Now we can remove those columns from our data frame and create the correlation matrix: > cor(Carseats[sapply(Carseats, function(x) !is.factor(x))]) Sales CompPrice Income Advertising Population Price Age Education Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 -0.44495073 -0.231815440 -0.051955242 CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.58484777 -0.100238817 0.025197050 Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 -0.05669820 -0.004670094 -0.056855422 Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 0.04453687 -0.004557497 -0.033594307 Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 -0.01214362 -0.042663355 -0.106378231 Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 1.00000000 -0.102176839 0.011746599 Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684 1.000000000 0.006488032 Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.01174660 0.006488032 1.000000000 Be Sociable, Share!

October 8, 2014

by Mark Needham

· 29,065 Views