Data Engineering Resources

The Latest Data Engineering Topics

Algorithm of the Week: Quicksort - Three-way vs. Dual-pivot

It’s no news that quicksort is considered one of the most important algorithms of the century and that it is the de facto system sort for many languages, including the Arrays.sort in Java. So, what’s new about quicksort? Well, nothing except that I just now figured out (two damn years after the release of Java 7) that the quicksort implementation of Arrays.sort has been replaced with a variant called dual-pivot quicksort. This thread is not only awesome for this reason but also how humble Jon Bentley and Joshua Bloch really are. What did I do then? Just like everybody else, I wanted to implement it and do some benchmarking against some 10 million integers (random and duplicate). Oddly enough, I found the following results: Random Data Basic quicksort: 1222 ms Three-way quicksort: 1295 ms (seriously!) Dual-pivot quicksort: 1066 ms Duplicate Data Basic quicksort: 378 ms Three-way quicksort: 15 ms Dual-pivot quicksort: six ms Stupid Question One I am afraid that I am missing something in the implementation of the three-way partition. Across several runs against random inputs (of 10 million numbers), I could see that the single pivot always performs better (although the difference is less than 100 milliseconds for 10 million numbers). I understand that the whole purpose of making the three-way quicksort the default quicksort until now is that it does not give 0(n2) performance on duplicate keys, which is very evident when I run it against duplicate inputs. But is it true that, for the sake of handling duplicate data, a small penalty is taken by the three-way quicksort? Or is my implementation bad? Stupid Question Two My dual-pivot implementation (link below) does not handle duplicates well. It takes forever (0(n2)) to execute. Is there a good way to avoid this? Referring to the Arrays.sort implementation, I figured out that ascending sequences and duplicates are eliminated well before the actual sorting is done. So, as a dirty fix, if the pivots are equal I fast-forward the lowerIndex until it is different than pivot2. Is this a fair implementation? else if (pivot1==pivot2){ while (pivot1==pivot2 && lowIndex=j) break; exchange(input, i, j); } exchange(input, pivotIndex, j); return j; } } Three-way package basics.sorting.quick; import static basics.shuffle.KnuthShuffle.shuffle; import static basics.sorting.utils.SortUtils.exchange; import static basics.sorting.utils.SortUtils.less; public class QuickSort3Way { public void sort (int[] input){ //input=shuffle(input); sort (input, 0, input.length-1); } public void sort(int[] input, int lowIndex, int highIndex) { if (highIndex<=lowIndex) return; int lt=lowIndex; int gt=highIndex; int i=lowIndex+1; int pivotIndex=lowIndex; int pivotValue=input[pivotIndex]; while (i<=gt){ if (less(input[i],pivotValue)){ exchange(input, i++, lt++); } else if (less (pivotValue, input[i])){ exchange(input, i, gt--); } else{ i++; } } sort (input, lowIndex, lt-1); sort (input, gt+1, highIndex); } } Dual-pivot package basics.sorting.quick; import static basics.shuffle.KnuthShuffle.shuffle; import static basics.sorting.utils.SortUtils.exchange; import static basics.sorting.utils.SortUtils.less; public class QuickSortDualPivot { public void sort (int[] input){ //input=shuffle(input); sort (input, 0, input.length-1); } private void sort(int[] input, int lowIndex, int highIndex) { if (highIndex<=lowIndex) return; int pivot1=input[lowIndex]; int pivot2=input[highIndex]; if (pivot1>pivot2){ exchange(input, lowIndex, highIndex); pivot1=input[lowIndex]; pivot2=input[highIndex]; //sort(input, lowIndex, highIndex); } else if (pivot1==pivot2){ while (pivot1==pivot2 && lowIndex

August 14, 2013

by Arun Manivannan

· 46,428 Views

Resource Pooling, Virtualization, Fabric, and the Cloud

One of the five essential attributes of cloud computing (see The 5-3-2 Principle of Cloud Computing) is resource pooling, which is an important differentiator separating the thought process of traditional IT from that of a service-based, cloud computing approach. Resource pooling in the context of cloud computing and from a service provider’s viewpoint denotes a set of strategies and a methodical way of managing resources. For a user, resource pooling institutes an abstraction for presenting and consuming resources in a consistent and transparent fashion. This article presents these key concepts derived from resource pooling: Resource Pools Virtualization in the Context of Cloud Computing Standardization, Automation, and Optimization Fabric Cloud Closing Thoughts Resource Pools Ultimately, data center resources can be logically placed into three categories. They are: compute, networks, and storage. For many, this grouping may appear trivial. It is, however, a foundation upon which some cloud computing methodologies are developed, products designed, and solutions formulated. Compute This is a collection of all CPU capabilities. Essentially all data center servers, either for supporting or actually running a workload, are all part of this compute group. Compute pool represents the total capacity for executing code and running instances. The process to construct a compute pool is to first inventory all servers and identify virtualization candidates followed by implementing server virtualization. It is never too early to introduce a system management solution to facilitate the processes, which in my view is a strategic investment and a critical component for all cloud initiatives. Networks The physical and logical artifacts putting in place to connect resources, segment, and isolate resources from layer three and below, etc., are gathered in the network pool. Networking enables resources becoming visible and hence possibly manageable. In the age of instant gratification, networks and mobility are redefining the security and system administration boundaries, and play a direct and impactful role in user productivity and customer satisfaction. Networking in cloud computing is more than just remote access, but empowerment for a user to self-serve and consume resources anytime, anywhere, with any device. BYOD and consumerization of IT are various expressions of these concepts. Storage This has long been a very specialized and sometimes mysterious part of IT. An enterprise storage solution is frequently characterized as a high-cost item with a significant financial and contractual commitment, specialized hardware, proprietary API and software, a dependency on direct vendor support, etc. In cloud computing, storage has become even more noticeable since the ability to grow and shrink based on demands, i.e. elasticity, demands an enterprise-level, massive, reliable, and resilient storage solution at a global scale. While enterprise IT is consolidating resources and transforming the existing establishment into a cloud computing environment, how to leverage existing storage devices from various vendors and integrate them with the next generation storage solutions is among the highest priorities for modernizing a data center. Virtualization in the Context of Cloud Computing In the last decade, virtualization has proved its value and accelerated the realization of cloud computing. Then, virtualization was mainly server virtualization, which in an over-simplified statement means hosting multiple server instances with the same hardware while each instance runs transparently and in insolation, as if each consumes the entire hardware and is the only instance running. Much of the customer expectations, business needs, and methodologies has since evolved. Now, we should validate virtualization in the context of cloud computing to fully address the innovations rapidly changing how IT conducts business and delivers services. As discussed below, in the context of cloud computing, consumable resources are delivered in some virtualized form. Various virtualization layers collectively construct and form the so-called fabric. Server Virtualization The concept of server virtualization remains: running multiple server instances with the same hardware while each instance runs transparently and in isolation, as if each instance is the only instance running and consuming the entire server hardware. In addition to virtualizing and consolidating servers, server virtualization also signifies the practices of standardizing server deployment switching away from physical boxes to VMs. Server virtualization is for packaging, delivering, and consuming a compute pool. There are a few important considerations of virtualizing servers. IT needs the ability to identify and manage bare metal such that the entire resource life-cycle management from commencing to decommissioning can be standardized and automated. To fundamentally reduce the support and training cost while increasing productivity, a consistent platform with tools applicable across physical, virtual, on-premises, and off-premises deployments is essential. The last thing IT wants is one set of tools for physical resources and another for those virtualized, one set of tools for on-premises deployment and another for those deployed to a service provider, and one set of tools for development and another for deploying applications. The requirement is one methodology for all, one skill set for all, and one set of tools for all. This advantage is obvious when developing applications and deploying Windows Server 2012 R2 on premises or off premises to Windows Azure. The Active Directory security model can work across sites, System Center can manage resources deployed off premises to Windows Azure, and Visual Studio can publish applications across platforms. Windows infrastructure architecture, security, and deployment models are all directly applicable. Network Virtualization The similar idea of server virtualization applies here. Network virtualization is the ability to run multiple networks on the same network device while each network runs transparently and in isolation, as if each network is the only network running and consuming the entire network hardware. Conceptually, since each network instance is running in isolation, one tenant’s 192.168.x network is not aware of another tenant’s identical192.168.x network running with the same network device. Network virtualization provides the translation between physical network characteristics and the representation of and a resource identity in a virtualized network. Consequently, above the network virtualization layer, various tenants while running in isolation can have identical network configurations. A great example of network virtualization is Windows Azure virtual networking. At any given time, there can be multiple Windows Azure subscribers all allocating the same 192.168.x address space with an identical subnet scheme (192.168.1.x/16) for deploying VMs. Those VMs belonging to one subscriber will however not be aware of or visible to those deployed by others, despite the fact that the network configuration, IP scheme, and IP address assignments may all be identical. Network virtualization in Windows Azure isolates on subscriber from the others such that each subscriber operates as if the subscription is the only one employing a 192.168.x address space. Storage Virtualization I believe this is where the next wave of drastic cost reduction of IT post-server virtualization happens. Historically, storage has been a high cost item in any IT budget in each and every aspects including hardware, software, staffing, maintenance, SLA, etc. Since the introduction of Windows Server 2012, there is a clear direction where storage virtualization is built into OS and becoming a commodity. New capabilities like Storage Pool, Hyper-V over SMB, Scale-Out Fire Share, etc., are now part of Windows Server OS and are making storage virtualization part of server administration routines and easily manageable with tools and utilities like PowerShell, which is familiar to many IT professionals. The concept of storage virtualization remains consistent with the idea of logically separating a computing object from its hardware, in this case the storage capacity. Storage virtualization is the ability to integrate multiple and heterogeneous storage devices, aggregate the storage capacities, and present/manage as one logical storage device with a continuous storage space. JBOD is a technology to realize this concept. Standardization, Automation and Optimization Each of the three resource pools has an abstraction to logically present itself with characteristics and work patterns. A compute pool is a collection of physical (virtualization and infrastructure) hosts and VMs. A virtualization host hosts VMs that run workloads deployed by service owners and consumed by authorized users. A network pool encompasses network resources including physical devices, logical switches, address spaces, and site configurations. Network virtualization as enabled/defined in configurations can identify and translate a logical/virtual IP address into a physical one, such that tenants with the same network hardware can implement an identical network scheme without a concern. A storage pool is based on storage virtualization which is a concept of presenting an aggregated storage capacity as one continuous storage space as if provided from one logical storage device. In other words, the three resource pools are wrapped with server virtualization, network virtualization, and storage virtualization, respectively. Each virtualization presents a set of methodologies on which work patterns are derived and common practices are developed. These virtualization layers provides opportunities to standardize, automate, and optimize deployments and considerably facilitates the adoption of cloud computing. Standardization Virtualizing resources decouples the dependency between instances and the underlying hardware. This offers an opportunity to simplify and standardize the logical representation of a resource. For instance, a VM is defined and deployed with a VM template that provides a level of consistency with a standardized configuration. Automation Once VM characteristics are identified and standardized, we can now generate an instance by providing only instance-based information or information that depends on run-time, such as the VM machine name, which must be validated at run-time to prevent duplicated names. This requirement for providing only minimal information at deployment can be significantly simplify and streamline operations for automation. And with automation, resources can then be deployed, instantiated, relocated, taken off-line, brought back online, or removed rapidly and automatically based on set criteria. Standardization and automation are essential mechanisms so that workload can be scaled on demand, i.e., become elastic. Optimization Standardization provides a set of common criteria. Automation executes operations based on set criteria with volumes, consistency, and expediency. With standardization and automation, instances can be instantiated with consistency, efficiency, and predictability. In other words, resources can be operated in bulk with consistency and predictability. The next logical step is then to optimize the usage based on SLA. The presented progression is what resource pooling and virtualizations can provide and facilitate. These methodologies are now built into products and solutions. Windows Server 2012 R2 and System Center 2012 and later integrate server virtualization, network virtualization, and storage virtualization into one consistent solution platform with standardization, automation, and optimization for building and managing clouds. Fabric This is a significant abstraction in cloud computing. Fabric implies accessibility and discoverability, and denotes the ability to discover, identify, and manage a resource. Conceptually, fabric is an umbrella term encompassing all the underlying infrastructure supporting a cloud computing environment. At the same time, a fabric controller represents the system management solution which manages, i.e. owns, fabric. In cloud architecture, fabric consists of the three resource pools: compute, networks, and storage. Compute provides the computing capabilities, executes code, and runs instances. Networks glues the resources based on requirements. Storage is where VMs, configurations, data, and resources are kept. Fabric shields the physical complexities of the three resource pools presented with server virtualization, network virtualization, and storage virtualization. All operations are eventually directed by the fabric controller of a data center. Above fabric, there are logical views of consumable resources including VMs, virtual networks, and logical storage drives. By deploying VMs, configuring virtual networks, or acquiring storage, a user consumes resources. Under fabric, there are virtualization and infrastructure hosts, Active Directory, DNS, clusters, load balancers, address pools, network sites, library shares, storage arrays, topology, racks, cables, etc., all under the fabric controller’s command to collectively present and support fabric. For a service provider, building a cloud computing environment is essentially establishing a fabric controller and constructing fabric. Namely, instituting a comprehensive management solution, building the three resource pools, and integrating server virtualization, network virtualization, and storage virtualization to form fabric. From a user’s point of view, how and where a resource is physically provided is not a concern, but the accessibility, readiness, scalability, and fulfillment of SLA are. Cloud This is a well-defined term and we should not be confused with it. (see NIST SP 800-145 and the 5-3-2 Principle of Cloud Computing) We need to be very clear on: what a cloud must exhibit (the five essential attributes), how to consume it (with SaaS, PaaS, or IaaS), and the model a service is deployed in (like private cloud, public cloud, and hybrid cloud). Cloud is a concept, a state, a set of capabilities such that a business can be delivered as a service, i.e. available on demand. The architecture of a cloud computing environment is presented with three resource pools: compute, networks, and storage. Each is an abstraction provided by a virtualization layer. Server virtualization presents a compute pool with VMs that supply the computing, i.e. CPUs, and power to execute code and run instances. Network virtualization offers a network pool and is the mechanism that allows multiple tenants with identical network configurations on the same virtualization host while connecting, segmenting, isolating network traffic with virtual NICs, logical switches, address space, network sites, IP pools, etc. Storage virtualization provides a logical storage device with the capacity to appear continuous and aggregated with a pool of storage devices behind the scene. The three resource pools together constitute the fabric (of a cloud) while the three virtualization layers collectively form the abstraction, such that while the underlying physical infrastructure may be intricate, the user experience above fabric remains logical and consistent. Deploying a VM, configuring a virtual network, or acquiring storage is transparent with virtualization regardless of where the VM actually resides, how the virtual network is physically wired, or what devices in the aggregate the requested storage is provided with. Closing Thoughts Cloud is a very consumer-focused approach. It is about a customer’s ability and control based on SLA in getting resources when needed and with scale, and equally important releasing resources when no longer required. It is not about products and technologies. It is about servicing, consuming, and strengthening the bottom line.

August 12, 2013

by Yung Chou

· 10,420 Views

neo4j: Extracting a subgraph as an adjacency matrix and calculating eigenvector centrality with JBLAS

Earlier in the week I wrote a blog post showing how to calculate the eigenvector centrality of an adjacency matrix using JBLAS and the next step was to work out the eigenvector centrality of a neo4j sub graph. There were 3 steps involved in doing this: Export the neo4j sub graph as an adjacency matrix Run JBLAS over it to get eigenvector centrality scores for each node Write those scores back into neo4j I decided to make use of the Paul Revere data set from Kieran Healy’s blog post which consists of people and groups that they had membership of. The script to import the data is on my fork of the revere repository. Having imported the data the next step was to write a cypher query which would give me the people in anadjacency matrix with the number in each column/row intersection showing how many common groups that pair of people had. I thought it’d be easier to build this query incrementally so I started out writing a query which would return one row of the adjacency matrix: MATCH p1:Person, p2:Person WHERE p1.name = "Paul Revere" WITH p1, p2 MATCH p = p1-[?:MEMBER_OF]->()<-[?:MEMBER_OF]-p2 WITH p1.name AS p1, p2.name AS p2, COUNT(p) AS links ORDER BY p2 RETURN p1, COLLECT(links) AS row Here we start with Paul Revere and then find the relationships between him and every other person by way of a common group membership. We use an optional relationship since we need to include a value in each column/row of our adjacency matrix we need to return a 0 value for anyone he doesn’t intersect with. If we run that query we get back the following: +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | p1 | row | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | "Paul Revere" | [2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,3,1,1,1,1,1,1,3,3,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,3,2,1,1,2,1,2,1,1,1,1,1,0,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,2,1,3,1,3,2,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,0,1,0,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,2,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,3,1,1,2,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,2,1,1,1,1,1,1,1,1,3,1,1,1,1,3,1,1,1,1,0,1,2,1,1,1,1,1,1,1] | +-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ As it turns outs we’ve only got to remove the WHERE clause and order everybody and we’ve get the adjacency matrix for everyone: MATCH p1:Person, p2:Person WITH p1, p2 MATCH p = p1-[?:MEMBER_OF]->()<-[?:MEMBER_OF]-p2 WITH p1.name AS p1, p2.name AS p2, COUNT(p) AS links ORDER BY p2 RETURN p1, COLLECT(links) AS row ORDER BY p1 +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | p1 | row | +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | "Abiel Ruddock" | [0,1,1,1,0,1,0,1,0,0,1,1,1,0,1,2,0,1,0,1,1,1,2,2,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,1,1,0,0,2,2,0,0,1,1,2,1,1,1,0,1,0,1,1,0,0,2,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,0,2,1,2,1,0,0,0,0,1,1,0,1,0,0,1,0,2,0,0,1,0,0,0,1,0,0,2,0,1,0,1,1,1,0,0,1,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,1,0,1,1,0,1,1,1,2,0,0,1,1,0,0,2,0,1,2,1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,1,2,1,0,1,1,1,1,1,0,0,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0,2,1,0,0,1,1,1,1,0,1,0,0,0,1,0,1,0,1,1,0,0,1,0,1,0,1,0,0,1,0,2,1,1,0,0,2,0,1,0,0,0,0,1,0,1,0,1,0,1,0] | | "Abraham Hunt" | [1,0,1,1,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,1,0,1,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,1,1,1,1,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,1,0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,1,0] | ... +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 254 rows 9897 ms The next step was to wire up the query results with the JBLAS code that I wrote in the previous post. I ended up with the following: public class Neo4jAdjacencyMatrixSpike { public static void main(String[] args) throws SQLException { ClientResponse response = client() .resource("http://localhost:7474/db/data/cypher") .entity(queryAsJson(), MediaType.APPLICATION_JSON) .accept(MediaType.APPLICATION_JSON) .post(ClientResponse.class); JsonNode result = response.getEntity(JsonNode.class); ArrayNode rows = (ArrayNode) result.get("data"); List principalEigenvector = JBLASSpike.getPrincipalEigenvector(new DoubleMatrix(asMatrix(rows))); List people = asPeople(rows); updatePeopleWithEigenvector(people, principalEigenvector); System.out.println(sort(people).take(10)); } private static double[][] asMatrix(ArrayNode rows) { double[][] matrix = new double[rows.size()][254]; int rowCount = 0; for (JsonNode row : rows) { ArrayNode matrixRow = (ArrayNode) row.get(2); double[] rowInMatrix = new double[254]; matrix[rowCount] = rowInMatrix; int columnCount = 0; for (JsonNode jsonNode : matrixRow) { matrix[rowCount][columnCount] = jsonNode.asInt(); columnCount++; } rowCount++; } return matrix; } // rest cut for brevity } Here we are taking the query and then converting it into an array of arrays before passing it to our JBLAS code to calculate the principal eigenvector. We then return the top 10 people: Person{name='William Cooper', eigenvector=0.172604992239612, nodeId=68}, Person{name='Nathaniel Barber', eigenvector=0.17260499223961198, nodeId=18}, Person{name='John Hoffins', eigenvector=0.17260499223961195, nodeId=118}, Person{name='Paul Revere', eigenvector=0.17171142003936804, nodeId=207}, Person{name='Caleb Davis', eigenvector=0.16383970722169897, nodeId=71}, Person{name='Caleb Hopkins', eigenvector=0.16383970722169897, nodeId=121}, Person{name='Henry Bass', eigenvector=0.16383970722169897, nodeId=21}, Person{name='Thomas Chase', eigenvector=0.16383970722169897, nodeId=54}, Person{name='William Greenleaf', eigenvector=0.16383970722169897, nodeId=104}, Person{name='Edward Proctor', eigenvector=0.15600043886738055, nodeId=201} I get back the same 10 people as Kieran Healy although they have different eigenvector values. As far as I understand the absolute value doesn’t matter, what’s more important is the relative score to other people so I think we’re ok. The final step was to write these eigenvector values back into neo4j which we can do with the following code: private static void updateNeo4jWithEigenvectors(List people) { for (Person person : people) { ObjectNode request = JsonNodeFactory.instance.objectNode(); request.put("query", "START p = node({nodeId}) SET p.eigenvectorCentrality={value}"); ObjectNode params = JsonNodeFactory.instance.objectNode(); params.put("nodeId", person.nodeId); params.put("value", person.eigenvector); request.put("params", params); client() .resource("http://localhost:7474/db/data/cypher") .entity(request, MediaType.APPLICATION_JSON) .accept(MediaType.APPLICATION_JSON) .post(ClientResponse.class); } } Now we might use that eigenvector centrality value in other queries, such as one to show who the most central/potentially influential people are in each group: MATCH g:Group<-[:MEMBER_OF]-p WITH g.name AS group, p.name AS personName, p.eigenvectorCentrality as eigen ORDER BY eigen DESC WITH group, COLLECT(personName) AS people RETURN group, HEAD(people) + [HEAD(TAIL(people))] + [HEAD(TAIL(TAIL(people)))] AS mostCentral +--------------------------------------------------------------------------+ | group | mostCentral | +--------------------------------------------------------------------------+ | "StAndrewsLodge" | ["Paul Revere","Joseph Warren","Thomas Urann"] | | "BostonCommittee" | ["William Cooper","Nathaniel Barber","John Hoffins"] | | "LoyalNine" | ["Caleb Hopkins","William Greenleaf","Caleb Davis"] | | "LondonEnemies" | ["William Cooper","Nathaniel Barber","John Hoffins"] | | "LongRoomClub" | ["Paul Revere","John Hancock","Benjamin Clarke"] | | "NorthCaucus" | ["William Cooper","Nathaniel Barber","John Hoffins"] | | "TeaParty" | ["William Cooper","Nathaniel Barber","John Hoffins"] | +--------------------------------------------------------------------------+ 7 rows 280 ms Our top ten feature frequently although it’s interesting that only one of them is in the ‘LongRoomClub’ group which perhaps indicates that people in that group are less likely to be members of the other ones. I’d be interested if anyone can think of other potential uses for eigenvector centrality once we’ve got it back in the graph. All the code described in this post is on github if you want to take it for a spin.

August 12, 2013

by Mark Needham

· 5,671 Views

A Prodecedural City in 100 Lines of Three.js

The above skyline is from "City," a simple flight-simulator by Ricardo "Mr. Doob" Cabello. "City" is a demo of the capabilities of WebGL, and is written in an impressive 100 lines of JavaScript using Three.js. In his blog post "How to Do a Procedural City in 100 Lines," Jerome Etienne walks you through the process of recreating Cabello's "City." The secret lies in creating 20,000 cubes that are given random sizes and positions and merging them together to create a city. Let's hope that this algorithm is never used for actual city-planning, though, because the buildings can randomly intersect each other and there are no logical spaces for streets! Check out Etienne's blog post here and watch the screencast introduction here:

August 9, 2013

by Allen Coin

· 8,849 Views

In-Process Caching vs. Distributed Caching

In this post I will compare the pros and cons of using an in-process cache versus a distributed cache and when you should use either one. First, let's see what each one is. As the name suggests, an in-process cache is an object cache built within the same address space as your application. The Google Guava Library provides a simple in-process cache API that is a good example. On the other hand, a distributed cache is external to your application and quite possibly deployed on multiple nodes forming a large logical cache. Memcached is a popular distributed cache. Ehcache from Terracotta is a product that can be configured to function either way. Following are some considerations that should be kept in mind when making a decision. Considerations In-Process Cache Distributed Cache Comments Consistency While using an in-process cache, your cache elements are local to a single instance of your application. Many medium-to-large applications, however, will not have a single application instance as they will most likely be load-balanced. In such a setting, you will end up with as many caches as your application instances, each having a different state resulting in inconsistency. State may however be eventually consistent as cached items time-out or are evicted from all cache instances. Distributed caches, although deployed on a cluster of multiple nodes, offer a single logical view (and state) of the cache. In most cases, an object stored in a distributed cache cluster will reside on a single node in a distributed cache cluster. By means of a hashing algorithm, the cache engine can always determine on which node a particular key-value resides. Since there is always a single state of the cache cluster, it is never inconsistent. If you are caching immutable objects, consistency ceases to be an issue. In such a case, an in-process cache is a better choice as many overheads typically associated with external distributed caches are simply not there. If your application is deployed on multiple nodes, you cache mutable objects and you want your reads to always be consistent rather than eventually consistent, a distributed cache is the way to go. Overheads This dated but very descriptive article describes how an in-process cache can negatively effect performance of an application with an embedded cache primarily due to garbage collection overheads. Your results however are heavily dependent on factors such as the size of the cache and how quickly objects are being evicted and timed-out. A distributed cache will have two major overheads that will make it slower than an in-process cache (but better than not caching at all): network latency and object serialization As described earlier, if you are looking for an always-consistent global cache state in a multi-node deployment, a distributed cache is what you are looking for (at the cost of performance that you may get from a local in-process cache). Reliability An in-process cache makes use of the same heap space as your program so one has to be careful when determining the upper limits of memory usage for the cache. If your program runs out of memory there is no easy way to recover from it. A distributed cache runs as an independent processes across multiple nodes and therefore failure of a single node does not result in a complete failure of the cache. As a result of a node failure, items that are no longer cached will make their way into surviving nodes on the next cache miss. Also in the case of distributed caches, the worst consequence of a complete cache failure should be degraded performance of the application as opposed to complete system failure. An in-process cache seems like a better option for a small and predictable number of frequently accessed, preferably immutable objects. For large, unpredictable volumes, you are better off with a distributed cache. Recommendation For a small, predictable number of preferably immutable objects that have to be read multiple times, an in-process cache is a good solution because it will perform better than a distributed cache. However, for cases in which the number of objects that can be or should be cached is unpredictable and large, and consistency of reads is a must-have, a distributed cache is perhaps a better solution even though it may not bring the same performance benefits as an in-process cache. It goes without saying that your application can use both schemes for different types of objects depending on what suits the scenario best.

August 8, 2013

by Faheem Sohail

· 107,365 Views · 8 Likes

EclipseLink MOXy and the Java API for JSON Processing - Object Model APIs

The Java API for JSON Processing (JSR-353) is the Java standard for producing and consuming JSON which was introduced as part of Java EE 7. JSR-353 includes object (DOM like) and stream (StAX like) APIs. In this post I will demonstrate the initial JSR-353 support we have added to MOXy's JSON binding in EclipseLink 2.6. You can now use MOXy to marshal to: javax.json.JsonArrayBuilder javax.json.JsonObjectBuilder And unmarshal from: javax.json.JsonStructure javax.json.JsonObject javax.json.JsonArray You can try this out today using a nightly build of EclipseLink 2.6.0: http://www.eclipse.org/eclipselink/downloads/nightly.php The JSR-353 reference implementation is available here: https://java.net/projects/jsonp/downloads/download/ri/javax.json-ri-1.0.zip Java Model Below is the simple customer model that we will use for this post. Note for this example we are only using the standard JAXB (JSR-222) annotations. Customer package blog.jsonp.moxy; import java.util.*; import javax.xml.bind.annotation.*; @XmlType(propOrder={"id", "firstName", "lastName", "phoneNumbers"}) public class Customer { private int id; private String firstName; private String lastName; private List phoneNumbers = new ArrayList(); public int getId() { return id; } public void setId(int id) { this.id = id; } public String getFirstName() { return firstName; } public void setFirstName(String firstName) { this.firstName = firstName; } @XmlElement(nillable=true) public String getLastName() { return lastName; } public void setLastName(String lastName) { this.lastName = lastName; } @XmlElement public List getPhoneNumbers() { return phoneNumbers; } } PhoneNumber package blog.jsonp.moxy; import javax.xml.bind.annotation.*; @XmlAccessorType(XmlAccessType.FIELD) public class PhoneNumber { private String type; private String number; public String getType() { return type; } public void setType(String type) { this.type = type; } public String getNumber() { return number; } public void setNumber(String number) { this.number = number; } } jaxb.properties To specify MOXy as your JAXB provider you need to include a file called jaxb.properties in the same package as your domain model with the following entry (see: Specifying EclipseLink MOXy as your JAXB Provider) javax.xml.bind.context.factory=org.eclipse.persistence.jaxb.JAXBContextFactory Marshal Demo In the demo code below we will use a combination of JSR-353 and MOXy APIs to produce JSON. JSR-353's JsonObjectBuilder and JsonArrayBuilder are used to produces instances of JsonObject and JsonArray. We can use MOXy to marshal to these builders by wrapping them in instances of MOXy's JsonObjectBuilderResult and JsonArrayBuilderResult. package blog.jsonp.moxy; import java.util.*; import javax.json.*; import javax.json.stream.JsonGenerator; import javax.xml.bind.*; import org.eclipse.persistence.jaxb.JAXBContextProperties; import org.eclipse.persistence.oxm.json.*; public class MarshalDemo { public static void main(String[] args) throws Exception { // Create the EclipseLink JAXB (MOXy) Marshaller Map jaxbProperties = new HashMap(2); jaxbProperties.put(JAXBContextProperties.MEDIA_TYPE, "application/json"); jaxbProperties.put(JAXBContextProperties.JSON_INCLUDE_ROOT, false); JAXBContext jc = JAXBContext.newInstance(new Class[] {Customer.class}, jaxbProperties); Marshaller marshaller = jc.createMarshaller(); // Create the JsonArrayBuilder JsonArrayBuilder customersArrayBuilder = Json.createArrayBuilder(); // Build the First Customer Customer customer = new Customer(); customer.setId(1); customer.setFirstName("Jane"); customer.setLastName(null); PhoneNumber phoneNumber = new PhoneNumber(); phoneNumber.setType("cell"); phoneNumber.setNumber("555-1111"); customer.getPhoneNumbers().add(phoneNumber); // Marshal the First Customer Object into the JsonArray JsonArrayBuilderResult result = new JsonArrayBuilderResult(customersArrayBuilder); marshaller.marshal(customer, result); // Build List of PhoneNumer Objects for Second Customer List phoneNumbers = new ArrayList(2); PhoneNumber workPhone = new PhoneNumber(); workPhone.setType("work"); workPhone.setNumber("555-2222"); phoneNumbers.add(workPhone); PhoneNumber homePhone = new PhoneNumber(); homePhone.setType("home"); homePhone.setNumber("555-3333"); phoneNumbers.add(homePhone); // Marshal the List of PhoneNumber Objects JsonArrayBuilderResult arrayBuilderResult = new JsonArrayBuilderResult(); marshaller.marshal(phoneNumbers, arrayBuilderResult); customersArrayBuilder // Use JSR-353 APIs for Second Customer's Data .add(Json.createObjectBuilder() .add("id", 2) .add("firstName", "Bob") .addNull("lastName") // Included Marshalled PhoneNumber Objects .add("phoneNumbers", arrayBuilderResult.getJsonArrayBuilder()) ) .build(); // Write JSON to System.out Map jsonProperties = new HashMap(1); jsonProperties.put(JsonGenerator.PRETTY_PRINTING, true); JsonWriterFactory writerFactory = Json.createWriterFactory(jsonProperties); JsonWriter writer = writerFactory.createWriter(System.out); writer.writeArray(customersArrayBuilder.build()); writer.close(); } } Highlighted lines: 36, 37, 38, 54, 55, 64 Output Below is the output from running the marshal demo (MarshalDemo). The highlighted portions (lines 2-12 and 18-25) correspond to the portions that were populated from our Java model. [ { "id":1, "firstName":"Jane", "lastName":null, "phoneNumbers":[ { "type":"cell", "number":"555-1111" } ] }, { "id":2, "firstName":"Bob", "lastName":null, "phoneNumbers":[ { "type":"work", "number":"555-2222" }, { "type":"home", "number":"555-3333" } ] } ] Highlighted lines: 2-12, 18-25 Unmarshal Demo MOXy enables you to unmarshal from a JSR-353 JsonStructure (JsonObject or JsonArray). To do this simply wrap the JsonStructure in an instance of MOXy's JsonStructureSource and use one of the unmarshal operations that takes an instance of Source. package blog.jsonp.moxy; import java.io.FileInputStream; import java.util.*; import javax.json.*; import javax.xml.bind.*; import org.eclipse.persistence.jaxb.JAXBContextProperties; import org.eclipse.persistence.oxm.json.JsonStructureSource; public class UnmarshalDemo { public static void main(String[] args) throws Exception { try (FileInputStream is = new FileInputStream("src/blog/jsonp/moxy/input.json")) { // Create the EclipseLink JAXB (MOXy) Unmarshaller Map jaxbProperties = new HashMap(2); jaxbProperties.put(JAXBContextProperties.MEDIA_TYPE, "application/json"); jaxbProperties.put(JAXBContextProperties.JSON_INCLUDE_ROOT, false); JAXBContext jc = JAXBContext.newInstance(new Class[] {Customer.class}, jaxbProperties); Unmarshaller unmarshaller = jc.createUnmarshaller(); // Parse the JSON JsonReader jsonReader = Json.createReader(is); // Unmarshal Root Level JsonArray JsonArray customersArray = jsonReader.readArray(); JsonStructureSource arraySource = new JsonStructureSource(customersArray); List customers = (List) unmarshaller.unmarshal(arraySource, Customer.class) .getValue(); for(Customer customer : customers) { System.out.println(customer.getFirstName()); } // Unmarshal Nested JsonObject JsonObject customerObject = customersArray.getJsonObject(1); JsonStructureSource objectSource = new JsonStructureSource(customerObject); Customer customer = unmarshaller.unmarshal(objectSource, Customer.class) .getValue(); for(PhoneNumber phoneNumber : customer.getPhoneNumbers()) { System.out.println(phoneNumber.getNumber()); } } } } Highlighted lines: 27-30, 37-39 Input (input.json) The following JSON input will be converted to a JsonArray using a JsonReader. [ { "id":1, "firstName":"Jane", "lastName":null, "phoneNumbers":[ { "type":"cell", "number":"555-1111" } ] }, { "id":2, "firstName":"Bob", "lastName":null, "phoneNumbers":[ { "type":"work", "number":"555-2222" }, { "type":"home", "number":"555-3333" } ] } ] Highlighted lines: 4, 15, 20, 24 Output Below is the output from running the unmarshal demo (UnmarshalDemo). Jane Bob 555-2222 555-3333

August 7, 2013

by Blaise Doughan

· 13,768 Views

NoSQL with JPA

EclipseLink, reference implementation of JPA, has JPA support for NoSQL databases (MongoDB and Oracle NoSQL) as of the version 2.4. In this tutorial we will discuss the use of MongoDB database with the JPA support of EclipseLink. The transaction previously done using the console and native java driver will be done in a web application with the help of EclipseLink. Tools and technologies used in the sample application are as follows: MongoDB version 2.4.1 MongoDB Java Driver version 2.11.1 JSF version 2.2 PrimeFaces version 3.5 EclipseLink version 2.4 Jetty 7.x Maven Plugin JDK version 1.7 Maven 3.0.4 Project Dependencies org.glassfish javax.faces 2.2.0-SNAPSHOT org.primefaces primefaces 3.5 org.primefaces.themes bootstrap 1.0.10 org.eclipse.persistence org.eclipse.persistence.jpa 2.4.0-SNAPSHOT org.eclipse.persistence org.eclipse.persistence.nosql 2.4.0-SNAPSHOT jboss jboss-j2ee 4.2.2.GA org.mongodb mongo-java-driver 2.11.1 commons-fileupload commons-fileupload 1.3 Entity Class @Entity @NoSql(dataFormat=DataFormatType.MAPPED) public class Article implements Serializable { public Article() { } @Id @GeneratedValue @Field(name="_id") private String id; @ElementCollection private List categoryLists = new ArrayList(); @Basic private String title; @Basic private String content; @Basic @Temporal(javax.persistence.TemporalType.DATE) private Date date; @Basic private String author; @ElementCollection private List tagLists = new ArrayList(); @NoSQL notation sets the data format and type and maps the NoSQL data. Because of using MongoDB in our sample application and documents in MongoDB stored in BSON format, MAP is used as data type. @ElementCollection notation maps the embedded collection into the parent document. Because more than one category and tag associated with an article would be a matter in our sample application, we map them as an element collection. Embedded Objects @Embeddable @NoSql(dataFormat=DataFormatType.MAPPED) public class Categories implements Serializable { @Basic private String category; @Embeddable @NoSql(dataFormat=DataFormatType.MAPPED) public class Tags implements Serializable { @Basic private String tag; We see @Embeddable notation at the top of the Categories and Tags’ class unlike Article entity class. The documents stored in the parent document are mapped with this notation. Please note that embedded objects do not need unique field. persistence.xml com.kodcu.entity.Article com.com.kodcu.entity.Categories com.kodcu.entity.Tags CRUD Operations index.xhtml MyBean.java public void saveArticle() { em.getTransaction().begin(); if(null == article.getId()) em.persist(article); else em.merge(article); em.getTransaction().commit(); } public void removeArticle() { em.getTransaction().begin(); em.remove(selectArticle); em.getTransaction().commit(); } 6. Demo Application Real content above and the demo application, can be accessed at NoSQL with JPA

August 6, 2013

by Hüseyin Akdoğan

CORE

· 31,647 Views · 1 Like

Getting started with CQEngine: LINQ for Java, Only Faster

CQEngine or collection query engine is a library that allows you to build indices over java collections and query them for objects using exposed properties. It offers similar capability to LINQ in .net but is thought to be faster because it builds indices over collections before querying them and uses set theory instead of iterations. In this post we will see how to query a simple collection of objects, in our example, a collection of users of a hypothetical system, using CQEngine. In a subsequent post, we will also see how iteratively searching a collection compares to querying via CQEngine. The first step is to get the CQEengine jar file. Download the jar from the CQEngine website or if you are using Maven, add the following dependency. com.googlecode.cqengine cqengine 1.0.3 Next, lets create the Class whose object we will be searching for: package co.syntx.examples.cqengine; import com.googlecode.cqengine.attribute.Attribute; import com.googlecode.cqengine.attribute.SimpleAttribute; public class User { private String username; private String password; private String fullname; private Role role; public User(String username, String password, String fullname, Role role) { super(); this.username = username; this.password = password; this.fullname = fullname; this.role = role; } public static final Attribute FULL_NAME = new SimpleAttribute("fullname") { public String getValue(User user) { return user.fullname; } }; public static final Attribute USERNAME = new SimpleAttribute("username") { public String getValue(User user) { return user.username; } }; public String getUsername() { return username; } public void setUsername(String username) { this.username = username; } public String getPassword() { return password; } public void setPassword(String password) { this.password = password; } public String getFullname() { return fullname; } public void setFullname(String fullname) { this.fullname = fullname; } public Role getRole() { return role; } public void setRole(Role role) { this.role = role; } } Next, we write a class, to perform our searches. I will go function by function. 1. Function to Build a Test Indexed Collection: In the following function, we build an indexed collection, define indices on attributes, and populate this collection with a certain number of objects. In actual usage, your collection will probably be filled with objects being returned from the DB, read from a file or other similar scenarios. public void buildIndexedCollection(int size) throws Exception { indexedUsers = CQEngine.newInstance(); indexedUsers.addIndex(HashIndex.onAttribute(User.FULL_NAME)); indexedUsers.addIndex(SuffixTreeIndex.onAttribute(User.FULL_NAME)); for (int i = 0; i < size; i++) { String username = RandomStringGenerator.generateRandomString(8,RandomStringGenerator.Mode.ALPHANUMERIC); String password = RandomStringGenerator.generateRandomString(8,RandomStringGenerator.Mode.ALPHANUMERIC); String fullname = RandomStringGenerator.generateRandomString(5,RandomStringGenerator.Mode.ALPHA) + " " + RandomStringGenerator.generateRandomString(5,RandomStringGenerator.Mode.ALPHA); Role role = new Role(); role.setName("admin"); indexedUsers.add(new User(username, password, fullname, role)); } } In line 3 we are initializing a new Indexed Collection, a reference of which is stored in the class variable indexedUsers. The reference is of type IndexedCollection In lines 4 and 5, we define two indices i) a Hash Index suitable for equal style queries. ii) a Suffix Index suitable for ends with style queried. For the purpose of this example, we are building indices only on the the Full name field. In line 8, we use a random string generator to populate dummy objects. In line 14 we add our object to our indexed collection. 2. Function to Perform Indexed Search for Exact Matches: In this function, we are querying for names that exactly match a given name. The equal function takes in the attribute upon which to perform the query and the value to search. The method equal is statically imported via a import static com.googlecode.cqengine.query.QueryFactory.*; In the example below, we are looping the results returned by the retrieve method and not doing anything with it. In your case, you may choose to return the Iterator returned by retrieve. public void indexedSearchForEquals(String fullname) throws Exception { Query query = equal(User.FULL_NAME, fullname); for (User user : indexedUsers.retrieve(query)) { // System.out.println(user.getFullname()); } } 3. Function to Perform Indexed Search for Ends With Matches: In this function, we are querying for names that end with a certain suffix. public void indexedSearchForEndsWith(String endswith) throws Exception { Query query1 = endsWith(User.FULL_NAME, endswith); for (User user : indexedUsers.retrieve(query1)) { // System.out.println(user.getFullname()); } } 4. Function to Perform Indexed Search for Equals or Ends With Matches: This function is a combination of both queries mentioned below and has an or relation between them. public void indexedSearchForEqualOrEndsWith(String equals, String ends) throws Exception { Query query = or(equal(User.FULL_NAME, equals),endsWith(User.FULL_NAME, ends)); for (User user : indexedUsers.retrieve(query)) { // System.out.println(user.getFullname()); } } 5. Putting it together: All the functions above belong to a class called CQEngineTest. We create a new object, build a test collection, then search for either exact matches, strings that end with a certain suffix or either. CQEngineTest test = new CQEngineTest(); test.buildIndexedCollection(size); test.indexedSearchForEqualOrEndsWith("test", "test"); In this example, we have used a Hash Index and a Suffix Tree Index. There are many other types of indices that you can choose depending on the type of query that you want to perform. A list of these indices and when to use them can be found on the cqengine project page. Also, apart from equal or endswith operations, there are others that you would typically expect to find. In a subsequent post we will also see how an indexed search compares with a typical iterative search in terms of times.

August 5, 2013

by Faheem Sohail

· 28,075 Views · 1 Like

JPA Searching Using Lucene - A Working Example with Spring and DBUnit

Working Example on Github There's a small, self contained mavenised example project over on Github to accompany this post - check it out here:https://github.com/adrianmilne/jpa-lucene-spring-demo Running the Demo See the README file over on GitHub for details of running the demo. Essentially - it's just running the Unit Tests, with the usual maven build and test results output to the console - example below. This is the result of running the DBUnit test, which inserts Book data into the HSQL database using JPA, and then uses Lucene to query the data, testing that the expected Books are returned (i.e. only those int he SCI-FI category, containing the word 'Space', and ensuring that any with 'Space' in the title appear before those with 'Space' only in the description. The Book Entity Our simple example stores Books. The Book entity class below is a standard JPA Entity with a few additional annotations to identify it to Lucene: @Indexed - this identifies that the class will be added to the Lucene index. You can define a specific index by adding the 'index' attribute to the annotation. We're just choosing the simplest, minimal configuration for this example. In addition to this - you also need to specify which properties on the entity are to be indexed, and how they are to be indexed. For our example we are again going for the default option by just adding an @Field annotation with no extra parameters. We are adding one other annotation to the 'title' field - @Boost - this is just telling Lucene to give more weight to search term matches that appear in this field (than the same term appearing in the description field). This example is purposefully kept minimal in terms of the ins-and-outs of Lucene (I may cover that in a later post) - we're really just concentrating on the integration with JPA and Spring for now. package com.cor.demo.jpa.entity; import javax.persistence.Entity; import javax.persistence.EnumType; import javax.persistence.Enumerated; import javax.persistence.GeneratedValue; import javax.persistence.Id; import javax.persistence.Lob; import org.hibernate.search.annotations.Boost; import org.hibernate.search.annotations.Field; import org.hibernate.search.annotations.Indexed; /** * Book JPA Entity. */ @Entity @Indexed public class Book { @Id @GeneratedValue private Long id; @Field @Boost(value = 1.5f) private String title; @Field @Lob private String description; @Field @Enumerated(EnumType.STRING) private BookCategory category; public Book(){ } public Book(String title, BookCategory category, String description){ this.title = title; this.category = category; this.description = description; } public Long getId() { return id; } public void setId(Long id) { this.id = id; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public BookCategory getCategory() { return category; } public void setCategory(BookCategory category) { this.category = category; } public String getDescription() { return description; } public void setDescription(String description) { this.description = description; } @Override public String toString() { return "Book [id=" + id + ", title=" + title + ", description=" + description + ", category=" + category + "]"; } } The Book Manager The BookManager class acts as a simple service layer for the Book operations - used for adding books and searching books. As you can see, the JPA database resources are autowired in by Spring from the application-context.xml. We are just using an in-memory hsql database in this example. package com.cor.demo.jpa.manager; import java.util.List; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; import javax.persistence.PersistenceContextType; import javax.persistence.Query; import org.hibernate.search.jpa.FullTextEntityManager; import org.hibernate.search.jpa.Search; import org.hibernate.search.query.dsl.QueryBuilder; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.context.annotation.Scope; import org.springframework.stereotype.Component; import org.springframework.transaction.annotation.Transactional; import com.cor.demo.jpa.entity.Book; import com.cor.demo.jpa.entity.BookCategory; /** * Manager for persisting and searching on Books. Uses JPA and Lucene. */ @Component @Scope(value = "singleton") public class BookManager { /** Logger. */ private static Logger LOG = LoggerFactory.getLogger(BookManager.class); /** JPA Persistence Unit. */ @PersistenceContext(type = PersistenceContextType.EXTENDED, name = "booksPU") private EntityManager em; /** Hibernate Full Text Entity Manager. */ private FullTextEntityManager ftem; /** * Method to manually update the Full Text Index. This is not required if inserting entities * using this Manager as they will automatically be indexed. Useful though if you need to index * data inserted using a different method (e.g. pre-existing data, or test data inserted via * scripts or DbUnit). */ public void updateFullTextIndex() throws Exception { LOG.info("Updating Index"); getFullTextEntityManager().createIndexer().startAndWait(); } /** * Add a Book to the Database. */ @Transactional public Book addBook(Book book) { LOG.info("Adding Book : " + book); em.persist(book); return book; } /** * Delete All Books. */ @SuppressWarnings("unchecked") @Transactional public void deleteAllBooks() { LOG.info("Delete All Books"); Query allBooks = em.createQuery("select b from Book b"); List books = allBooks.getResultList(); // We need to delete individually (rather than a bulk delete) to ensure they are removed // from the Lucene index correctly for (Book b : books) { em.remove(b); } } @SuppressWarnings("unchecked") @Transactional public void listAllBooks() { LOG.info("List All Books"); LOG.info("------------------------------------------"); Query allBooks = em.createQuery("select b from Book b"); List books = allBooks.getResultList(); for (Book b : books) { LOG.info(b.toString()); getFullTextEntityManager().index(b); } } /** * Search for a Book. */ @SuppressWarnings("unchecked") @Transactional public List search(BookCategory category, String searchString) { LOG.info("------------------------------------------"); LOG.info("Searching Books in category '" + category + "' for phrase '" + searchString + "'"); // Create a Query Builder QueryBuilder qb = getFullTextEntityManager().getSearchFactory().buildQueryBuilder().forEntity(Book.class).get(); // Create a Lucene Full Text Query org.apache.lucene.search.Query luceneQuery = qb.bool() .must(qb.keyword().onFields("title", "description").matching(searchString).createQuery()) .must(qb.keyword().onField("category").matching(category).createQuery()).createQuery(); Query fullTextQuery = getFullTextEntityManager().createFullTextQuery(luceneQuery, Book.class); // Run Query and print out results to console List result = (List) fullTextQuery.getResultList(); // Log the Results LOG.info("Found Matching Books :" + result.size()); for (Book b : result) { LOG.info(" - " + b); } return result; } /** * Convenience method to get Full Test Entity Manager. Protected scope to assist mocking in Unit * Tests. * @return Full Text Entity Manager. */ protected FullTextEntityManager getFullTextEntityManager() { if (ftem == null) { ftem = Search.getFullTextEntityManager(em); } return ftem; } /** * Get the JPA Entity Manager (required for the DBUnit Tests). * @return Entity manager */ protected EntityManager getEntityManager() { return em; } /** * Sets the JPA Entity Manager (required to assist with mocking in Unit Test) * @param em EntityManager */ protected void setEntityManager(EntityManager em) { this.em = em; } } application-context.xml This is the Spring configuration file. You can see in the JPA Entity Manager configuration the key for 'hibernate.search.default.indexBase' is added to the jpaPropertyMap to tell Lucene where to create the index. We have also externalised the database login credentials to a properties file (as you may wish to change these for different environments), for example by updating the propertyConfigurer to look for and use a different external properties if it finds one on the file system). classpath:/system.properties Testing Using DBUnit In the project is an example of using DBUnit with Spring to test adding and searching against the database using DBUnit to populate the database with test data, exercise the Book Manager search operations and then clean the database down. This is a great way to test database functionality and can be easily integrated into maven and continuous build environments. Because DBUnit bypasses the standard JPA insertion calls - the data does not get automatically added to the Lucene index. We have a method exposed on the service interface to update the Full Text index 'updateFullTextIndex()' - calling this causes Lucene to update the index with the current data in the database. This can be useful when you are adding search to pre-populated databases to index the existing content. package com.cor.demo.jpa.manager; import java.io.InputStream; import java.util.List; import org.dbunit.DBTestCase; import org.dbunit.database.DatabaseConnection; import org.dbunit.database.IDatabaseConnection; import org.dbunit.dataset.IDataSet; import org.dbunit.dataset.xml.FlatXmlDataSetBuilder; import org.dbunit.operation.DatabaseOperation; import org.hibernate.impl.SessionImpl; import org.junit.After; import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import com.cor.demo.jpa.entity.Book; import com.cor.demo.jpa.entity.BookCategory; /** * DBUnit Test - loads data defined in 'test-data-set.xml' into the database to run tests against the * BookManager. More thorough (and ultimately easier in this context) than using mocks. */ @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations = { "classpath:/application-context.xml" }) public class BookManagerDBUnitTest extends DBTestCase { /** Logger. */ private static Logger LOG = LoggerFactory.getLogger(BookManagerDBUnitTest.class); /** Book Manager Under Test. */ @Autowired private BookManager bookManager; @Before public void setup() throws Exception { DatabaseOperation.CLEAN_INSERT.execute(getDatabaseConnection(), getDataSet()); } @After public void tearDown() { deleteBooks(); } @Override protected IDataSet getDataSet() throws Exception { InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("test-data-set.xml"); FlatXmlDataSetBuilder builder = new FlatXmlDataSetBuilder(); return builder.build(inputStream); } /** * Get the underlying database connection from the JPA Entity Manager (DBUnit needs this connection). * @return Database Connection * @throws Exception */ private IDatabaseConnection getDatabaseConnection() throws Exception { return new DatabaseConnection(((SessionImpl) (bookManager.getEntityManager().getDelegate())).connection()); } /** * Tests the expected results for searching for 'Space' in SCF-FI books. */ @Test public void testSciFiBookSearch() throws Exception { bookManager.listAllBooks(); bookManager.updateFullTextIndex(); List results = bookManager.search(BookCategory.SCIFI, "Space"); assertEquals("Expected 2 results for SCI FI search for 'Space'", 2, results.size()); assertEquals("Expected 1st result to be '2001: A Space Oddysey'", "2001: A Space Oddysey", results.get(0).getTitle()); assertEquals("Expected 2nd result to be 'Apollo 13'", "Apollo 13", results.get(1).getTitle()); } private void deleteBooks() { LOG.info("Deleting Books...-"); bookManager.deleteAllBooks(); } } The source data for the test is defined in an xml file.

August 5, 2013

by Adrian Milne

· 31,307 Views

What Is NoSQL?

Dan McCreary and Ann Kelly, authors of 'Making Sense of NoSQL,' discuss the business drivers and motivations that make NoSQL so popular to organizations today.

August 1, 2013

by Eric Gregory

· 21,269 Views · 4 Likes

Jersey Client: Testing External Calls

Jim and I have been doing a bit of work over the last week which involved calling neo4j’s HA status URI to check whether or not an instance was a master/slave and we’ve been using jersey-client. The code looked roughly like this: class Neo4jInstance { private Client httpClient; private URI hostname; public Neo4jInstance(Client httpClient, URI hostname) { this.httpClient = httpClient; this.hostname = hostname; } public Boolean isSlave() { String slaveURI = hostname.toString() + ":7474/db/manage/server/ha/slave"; ClientResponse response = httpClient.resource(slaveURI).accept(TEXT_PLAIN).get(ClientResponse.class); return Boolean.parseBoolean(response.getEntity(String.class)); } } While writing some tests against this code we wanted to stub out the actual calls to the HA slave URI so we could simulate both conditions and a brief search suggested that mockito was the way to go. We ended up with a test that looked like this: @Test public void shouldIndicateInstanceIsSlave() { Client client = mock( Client.class ); WebResource webResource = mock( WebResource.class ); WebResource.Builder builder = mock( WebResource.Builder.class ); ClientResponse clientResponse = mock( ClientResponse.class ); when( builder.get( ClientResponse.class ) ).thenReturn( clientResponse ); when( clientResponse.getEntity( String.class ) ).thenReturn( "true" ); when( webResource.accept( anyString() ) ).thenReturn( builder ); when( client.resource( anyString() ) ).thenReturn( webResource ); Boolean isSlave = new Neo4jInstance(client, URI.create("http://localhost")).isSlave(); assertTrue(isSlave); } which is pretty gnarly but does the job. I thought there must be a better way so I continued searching and eventually came across this post on the mailing list which suggested creating a custom ClientHandler and stubbing out requests/responses there. I had a go at doing that and wrapped it with a little DSL that only covers our very specific use case: private static ClientBuilder client() { return new ClientBuilder(); } static class ClientBuilder { private String uri; private int statusCode; private String content; public ClientBuilder requestFor(String uri) { this.uri = uri; return this; } public ClientBuilder returns(int statusCode) { this.statusCode = statusCode; return this; } public Client create() { return new Client() { public ClientResponse handle(ClientRequest request) throws ClientHandlerException { if (request.getURI().toString().equals(uri)) { InBoundHeaders headers = new InBoundHeaders(); headers.put("Content-Type", asList("text/plain")); return createDummyResponse(headers); } throw new RuntimeException("No stub defined for " + request.getURI()); } }; } private ClientResponse createDummyResponse(InBoundHeaders headers) { return new ClientResponse(statusCode, headers, new ByteArrayInputStream(content.getBytes()), messageBodyWorkers()); } private MessageBodyWorkers messageBodyWorkers() { return new MessageBodyWorkers() { public Map> getReaders(MediaType mediaType) { return null; } public Map> getWriters(MediaType mediaType) { return null; } public String readersToString(Map> mediaTypeListMap) { return null; } public String writersToString(Map> mediaTypeListMap) { return null; } public MessageBodyReader getMessageBodyReader(Class tClass, Type type, Annotation[] annotations, MediaType mediaType) { return (MessageBodyReader) new StringProvider(); } public MessageBodyWriter getMessageBodyWriter(Class tClass, Type type, Annotation[] annotations, MediaType mediaType) { return null; } public List getMessageBodyWriterMediaTypes(Class tClass, Type type, Annotation[] annotations) { return null; } public MediaType getMessageBodyWriterMediaType(Class tClass, Type type, Annotation[] annotations, List mediaTypes) { return null; } }; } public ClientBuilder content(String content) { this.content = content; return this; } } If we change our test to use this code it now looks like this: @Test public void shouldIndicateInstanceIsSlave() { Client client = client().requestFor("http://localhost:7474/db/manage/server/ha/slave"). returns(200). content("true"). create(); Boolean isSlave = new Neo4jInstance(client, URI.create("http://localhost")).isSlave(); assertTrue(isSlave); } Is there a better way? In Ruby I’ve used WebMock to achieve this and Ashok pointed me towards WebStub which looks nice except I’d need to pass in the hostname + port rather than constructing that in the code.

August 1, 2013

by Mark Needham

· 10,799 Views

AWS: Attaching an EBS volume on an EC2 instance and making it available for use

I recently wanted to attach an EBS volume to an existing EC2 instance that I had running and since it was for a one off tasks (famous last words) I decided to configure it manually. I created the EBS volume through the AWS console and one thing that initially caught me out is that the EC2 instance and EBS volume need to be in the same region and zone. Therefore if I create my EC2 instance in ‘eu-west-1b’ then I need to create my EBS volume in ‘eu-west-1b’ as well otherwise I won’t be able to attach it to that instance. I attached the device as /dev/sdf although the UI gives the following warning: Linux Devices: /dev/sdf through /dev/sdp Note: Newer linux kernels may rename your devices to /dev/xvdf through /dev/xvdp internally, even when the device name entered here (and shown in the details) is /dev/sdf through /dev/sdp. After attaching the EBS volume to the EC2 instance my next step was to SSH onto my EC2 instance and make the EBS volume available. The first step is to create a file system on the volume: $ sudo mkfs -t ext3 /dev/sdf mke2fs 1.42 (29-Nov-2011) Could not stat /dev/sdf --- No such file or directory The device apparently does not exist; did you specify it correctly? It turns out that warning was handy and the device has in fact been renamed. We can confirm this by callingfdisk: $ sudo fdisk -l Disk /dev/xvda1: 8589 MB, 8589934592 bytes 255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvda1 doesn't contain a valid partition table Disk /dev/xvdf: 53.7 GB, 53687091200 bytes 255 heads, 63 sectors/track, 6527 cylinders, total 104857600 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvdf doesn't contain a valid partition table /dev/xvdf is the one we’re interested in so I re-ran the previous command: $ sudo mkfs -t ext3 /dev/xvdf mke2fs 1.42 (29-Nov-2011) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 3276800 inodes, 13107200 blocks 655360 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 400 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done Once I’d done that I needed to create a mount point for the volume and I thought the best place was probably a directory under /mnt: $ sudo mkdir /mnt/ebs The final step is to mount the volume: $ sudo mount /dev/xvdf /mnt/ebs And if we run df we can see that it’s ready to go: $ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 883M 6.7G 12% / udev 288M 8.0K 288M 1% /dev tmpfs 119M 164K 118M 1% /run none 5.0M 0 5.0M 0% /run/lock none 296M 0 296M 0% /run/shm /dev/xvdf 50G 180M 47G 1% /mnt/ebs

July 31, 2013

by Mark Needham

· 11,953 Views

OLAP Operation in R

OLAP (Online Analytical Processing) is a very common way to analyze raw transaction data by aggregating along different combinations of dimensions. This is a well-established field in Business Intelligence / Reporting. In this post, I will highlight the key ideas in OLAP operation and illustrate how to do this in R. Facts and Dimensions The core part of OLAP is a so-called "multi-dimensional data model", which contains two types of tables; "Fact" table and "Dimension" table A Fact table contains records each describe an instance of a transaction. Each transaction records contains categorical attributes (which describes contextual aspects of the transaction, such as space, time, user) as well as numeric attributes (called "measures" which describes quantitative aspects of the transaction, such as no of items sold, dollar amount). A Dimension table contain records that further elaborates the contextual attributes, such as user profile data, location details ... etc. In a typical setting of Multi-dimensional model ... Each fact table contains foreign keys that references the primary key of multiple dimension tables. In the most simple form, it is called a STAR schema. Dimension tables can contain foreign keys that references other dimensional tables. This provides a sophisticated detail breakdown of the contextual aspects. This is also called a SNOWFLAKE schema. Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn't contain reference pointer among each other. However, different Fact table usually share the same set of dimension tables. This is also called GALAXY schema. But it is a hard rule that Dimension table NEVER points / references Fact table A simple STAR schema is shown in following diagram. Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity. For example, the time dimension can be broken down into days, weeks, months, quarter and annual; Similarly, location dimension can be broken down into countries, states, cities ... etc. Here we first create a sales fact table that records each sales transaction. # Setup the dimension tables state_table <- data.frame(key=c("CA", "NY", "WA", "ON", "QU"), name=c("California", "new York", "Washington", "Ontario", "Quebec"), country=c("USA", "USA", "USA", "Canada", "Canada")) month_table <- data.frame(key=1:12, desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4")) prod_table <- data.frame(key=c("Printer", "Tablet", "Laptop"), price=c(225, 570, 1120)) # Function to generate the Sales table gen_sales <- function(no_of_recs) { # Generate transaction data randomly loc <- sample(state_table$key, no_of_recs, replace=T, prob=c(2,2,1,1,1)) time_month <- sample(month_table$key, no_of_recs, replace=T) time_year <- sample(c(2012, 2013), no_of_recs, replace=T) prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2)) unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3)) amount <- unit*prod_table[prod,]$price sales <- data.frame(month=time_month, year=time_year, loc=loc, prod=prod, unit=unit, amount=amount) # Sort the records by time order sales <- sales[order(sales$year, sales$month),] row.names(sales) <- NULL return(sales) } # Now create the sales fact table sales_fact <- gen_sales(500) # Look at a few records head(sales_fact) month year loc prod unit amount 1 1 2012 NY Laptop 1 225 2 1 2012 CA Laptop 2 450 3 1 2012 ON Tablet 2 2240 4 1 2012 NY Tablet 1 1120 5 1 2012 NY Tablet 2 2240 6 1 2012 CA Laptop 1 225 Multi-dimensional Cube Now, we turn this fact table into a hypercube with multiple dimensions. Each cell in the cube represents an aggregate value for a unique combination of each dimension. # Build up a cube revenue_cube <- tapply(sales_fact$amount, sales_fact[,c("prod", "month", "year", "loc")], FUN=function(x){return(sum(x))}) # Showing the cells of the cude revenue_cube , , year = 2012, loc = CA month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 1350 225 900 675 675 NA 675 1350 NA 1575 900 1350 Printer NA 2280 NA NA 1140 570 570 570 NA 570 1710 NA Tablet 2240 4480 12320 3360 2240 4480 3360 3360 5600 2240 2240 3360 , , year = 2013, loc = CA month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 225 225 450 675 225 900 900 450 675 225 675 1125 Printer NA 1140 NA 1140 570 NA NA 570 NA 1140 1710 1710 Tablet 3360 3360 1120 4480 2240 1120 7840 3360 3360 1120 5600 4480 , , year = 2012, loc = NY month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 450 450 NA NA 675 450 675 NA 225 225 NA 450 Printer NA 2280 NA 2850 570 NA NA 1710 1140 NA 570 NA Tablet 3360 13440 2240 2240 2240 5600 5600 3360 4480 3360 4480 3360 , , year = 2013, loc = NY ..... dimnames(revenue_cube) $prod [1] "Laptop" "Printer" "Tablet" $month [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" $year [1] "2012" "2013" $loc [1] "CA" "NY" "ON" "QU" "WA" OLAP Operations Here are some common operations of OLAP Slice Dice Rollup Drilldown Pivot "Slice" is about fixing certain dimensions to analyze the remaining dimensions. For example, we can focus in the sales happening in "2012", "Jan", or we can focus in the sales happening in "2012", "Jan", "Tablet". # Slice # cube data in Jan, 2012 revenue_cube[, "1", "2012",] loc prod CA NY ON QU WA Laptop 1350 450 NA 225 225 Printer NA NA NA 1140 NA Tablet 2240 3360 5600 1120 2240 # cube data in Jan, 2012 revenue_cube["Tablet", "1", "2012",] CA NY ON QU WA 2240 3360 5600 1120 2240 "Dice" is about limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube. For example, we can focus in sales happening in [Jan/ Feb/Mar, Laptop/Tablet, CA/NY]. revenue_cube[c("Tablet","Laptop"), c("1","2","3"), , c("CA","NY")] , , year = 2012, loc = CA month prod 1 2 3 Tablet 2240 4480 12320 Laptop 1350 225 900 , , year = 2013, loc = CA month prod 1 2 3 Tablet 3360 3360 1120 Laptop 225 225 450 , , year = 2012, loc = NY month prod 1 2 3 Tablet 3360 13440 2240 Laptop 450 450 NA , , year = 2013, loc = NY month prod 1 2 3 Tablet 3360 4480 6720 Laptop 450 NA 225 "Rollup" is about applying an aggregation function to collapse a number of dimensions. For example, we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't care where we sold our product). apply(revenue_cube, c("year", "prod"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) prod year Laptop Printer Tablet 2012 22275 31350 179200 2013 25200 33060 166880 "Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of granularity. For example, we want to focus in the annual and monthly revenue for each product and collapse the location dimension (ie: we don't care where we sold our product). apply(revenue_cube, c("year", "month", "prod"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) , , prod = Laptop month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 2250 2475 1575 1575 2250 1800 1575 1800 900 2250 1350 2475 2013 2250 900 1575 1575 2250 2475 2025 1800 2025 2250 3825 2250 , , prod = Printer month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 1140 5700 570 3990 4560 2850 1140 2850 2850 1710 3420 570 2013 1140 4560 3420 4560 2850 1140 570 3420 1140 3420 3990 2850 , , prod = Tablet month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 14560 23520 17920 12320 10080 14560 13440 15680 25760 12320 11200 7840 2013 8960 11200 10080 7840 14560 10080 29120 15680 15680 8960 12320 22400 "Pivot" is about analyzing the combination of a pair of selected dimensions. For example, we want to analyze the revenue by year and month. Or we want to analyze the revenue by product and location. apply(revenue_cube, c("year", "month"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 17950 31695 20065 17885 16890 19210 16155 20330 29510 16280 15970 10885 2013 12350 16660 15075 13975 19660 13695 31715 20900 18845 14630 20135 27500 apply(revenue_cube, c("prod", "loc"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) loc prod CA NY ON QU WA Laptop 16425 9450 7650 7425 6525 Printer 15390 19950 7980 10830 10260 Tablet 90720 117600 45920 34720 57120 I hope you can get a taste of the richness of data processing model in R. However, since R is doing all the processing in RAM. This requires your data to be small enough so it can fit into the local memory in a single machine.

July 30, 2013

by Ricky Ho

· 17,953 Views · 3 Likes

JMS vs RabbitMQ

Definition : JMS : Java Message Service is an API that is part of Java EE for sending messages between two or more clients. There are many JMS providers such as OpenMQ (glassfish’s default), HornetQ(Jboss), and ActiveMQ. RabbitMQ: is an open source message broker software which uses the AMQP standard and is written by Erlang. Messaging Model: JMS supports two models: one to one and publish/subscriber. RabbitMQ supports the AMQP model which has 4 models : direct, fanout, topic, headers. Data types: JMS supports 5 different data types but RabbitMQ supports only the binary data type. Workflow strategy: In AMQP, producers send to the exchange then the queue, but in JMS, producers send to the queue or topic directly. Technology compatibility: JMS is specific for java users only, but RabbitMQ supports many technologies. Performance: If you would like to know more about their performance, this benchmark is a good place to start, but look for others as well.

July 30, 2013

by Saeid Siavashi

· 51,746 Views · 16 Likes

Why String is Immutable in Java

this is an old yet still popular question. there are multiple reasons that string is designed to be immutable in java. a good answer depends on good understanding of memory, synchronization, data structures, etc. in the following, i will summarize some answers. 1. requirement of string pool string pool (string intern pool) is a special storage area in java heap. when a string is created and if the string already exists in the pool, the reference of the existing string will be returned, instead of creating a new object and returning its reference. the following code will create only one string object in the heap. string string1 = "abcd"; string string2 = "abcd"; here is how it looks: if string is not immutable, changing the string with one reference will lead to the wrong value for the other references. 2. allow string to cache its hashcode the hashcode of string is frequently used in java. for example, in a hashmap. being immutable guarantees that hashcode will always the same, so that it can be cashed without worrying the changes.that means, there is no need to calculate hashcode every time it is used. this is more efficient. 3. security string is widely used as parameter for many java classes, e.g. network connection, opening files, etc. were string not immutable, a connection or file would be changed and lead to serious security threat. the method thought it was connecting to one machine, but was not. mutable strings could cause security problem in reflection too, as the parameters are strings. here is a code example: boolean connect(string s){ if (!issecure(s)) { throw new securityexception(); } //here will cause problem, if s is changed before this by using other references. causeproblem(s); } in summary, the reasons include design, efficiency, and security. actually, this is also true for many other “why” questions in a java interview.

July 29, 2013

by Ryan Wang

· 217,351 Views · 9 Likes

Stepping through LMDB: Making Everything Easier

okay, i know that i have been critical about the lmdb codebase so far. but one thing that i really want to point out for it is that it was pretty easy to actually get things working on windows. it wasn’t smooth, in the sense that i had to muck around with the source a bit (hack endianess, remove a bunch of unix specific header files, etc). but that took less than an hour, and it was pretty much it. since i am by no means an experienced c developer, i consider this a major win. compare that to leveldb, which flat out won’t run on windows no matter how much time i spent trying, and it is a pleasure . also, stepping through the code i am starting to get a sense of how it works that is much different than the one i had when i just read the code. it is like one of those 3d images, you suddenly see something. the first thing that became obvious is that i totally missed the significance of the lock file. lmdb actually create two files: lock.mdb data.mdb lock.mdb is used to synchronized data between different readers. it seems to mostly be there if you want to have multiple writers using different processes. that is a very interesting model for an embedded database, i’ve to admit. not something that i think other embedded databases are offering. in order to do that, it create two named mutexes (one for read and one for write). a side note on windows support: lmdb supports windows, but it is very much a 2nd class citizen. you can see it in things like path not found error turning into a no such process error (because it try to use getlasterror() codes as c codes), or when it doesn’t create a directory even though not creating it would fail. i am currently debugging through the code and fixing such issues as i go along (but no, i am doing heavy handed magic fixes, just to get past this stage to the next one, otherwise i would have sent a pull request). here is one such example. here is the original code: but readfile in win32 will return false if the file is empty, so you actually need to write something like this to make the code work: past that hurdle, i think that i get a lot more about what is going on with the way lmdb works than before. let us start with the way data.mdb works. it is important to note that for pretty much everything in lmdb we use the system page size. by default, that is 4kb. the data file starts with 2 pages allocated. those page contain the following information: looking back at how couchdb did things, i am pretty sure that those two pages are going to be pretty important . i am guess that they would always contain the root of the data in the file. there is also the last transaction on them, which is what i imagine determine how something gets committed. i don’t know yet, as i said, guessing based on how couchdb works. i’ll continue this review in another time. next time, transactions…

July 27, 2013

by Oren Eini

· 9,495 Views

Displaying and Searching std::map Contents in WinDbg

This time we’re up for a bigger challenge. We want to automatically display and possibly search and filter std::map objects in WinDbg. The script for std::vectors was relatively easy because of the flat structure of the data in a vector; maps are more complex beasts. Specifically, an map in the Visual C++ STL is implemented as a red-black tree. Each tree node has three important pointers: _Left, _Right, and _Parent. Additionally, each node has a _Myval field that contains the std::pair with the key and value represented by the node. Iterating a tree structure requires recursion, and WinDbg scripts don’t have any syntax to define functions. However, we can invoke a script recursively – a script is allowed to contain the $$>a< command that invokes it again with a different set of arguments. The path to the script is also readily available in ${$arg0}. Before I show you the script, there’s just one little challenge I had to deal with. When you call a script recursively, the values of the pseudo-registers (like $t0) will be clobbered by the recursive invocation. I was on the verge of allocating memory dynamically or calling into a shell process to store and load variables, when I stumbled upon the .push and .pop commands, which store the register context and load it, respectively. These are a must for recursive WinDbg scripts. OK, so suppose you want to display values from an std::map where the key is less than or equal to 2. Here we go: 0:000> $$>a< traverse_map.script my_map -c ".block { .if (@@(@$t9.first) <= 2) { .echo ----; ?? @$t9.second } }" size = 10 ---- struct point +0x000 x : 0n1 +0x004 y : 0n2 +0x008 data : extra_data ---- struct point +0x000 x : 0n0 +0x004 y : 0n1 +0x008 data : extra_data ---- struct point +0x000 x : 0n2 +0x004 y : 0n3 +0x008 data : extra_data For each pair (stored in the $t9 pseudo-register), the block checks if the first component is less than or equal to 2, and if it is, outputs the second component. Next, here’s the script. Note it’s considerably more complex that what we had to with vectors, because it essentially invokes itself with a different set of parameters and then repeats recursively. .if ($sicmp("${$arg1}", "-n") == 0) { .if (@@(@$t0->_Isnil) == 0) { .if (@$t2 == 1) { .printf /D "%p\n", @$t0, @$t0 .printf "key = " ?? @$t0->_Myval.first .printf "value = " ?? @$t0->_Myval.second } .else { r? $t9 = @$t0->_Myval command } } $$ Recurse into _Left, _Right unless they point to the root of the tree .if (@@(@$t0->_Left) != @@(@$t1)) { .push /r /q r? $t0 = @$t0->_Left $$>a< ${$arg0} -n .pop /r /q } .if (@@(@$t0->_Right) != @@(@$t1)) { .push /r /q r? $t0 = @$t0->_Right $$>a< ${$arg0} -n .pop /r /q } } .else { r? $t0 = ${$arg1} .if (${/d:$arg2}) { .if ($sicmp("${$arg2}", "-c") == 0) { r $t2 = 0 aS ${/v:command} "${$arg3}" } } .else { r $t2 = 1 aS ${/v:command} " " } .printf "size = %d\n", @@(@$t0._Mysize) r? $t0 = @$t0._Myhead->_Parent r? $t1 = @$t0->_Parent $$>a< ${$arg0} -n ad command } Of particular note are the aS command which configures an alias that is then used by the recursive invocation to invoke a command block for each of the map’s elements; the $sicmp function which compares strings; and the .printf /D function, which outputs a chunk of DML. Finally, the recursion terminates when _Left or _Right are equal to the root of the tree (that’s just how the tree is implemented in this case).

July 26, 2013

by Sasha Goldshtein

· 4,917 Views

Using Morphia to Map Java Objects in MongoDB

MongoDB is an open source document-oriented NoSQL database system which stores data as JSON-like documents with dynamic schemas. As it doesn't store data in tables as is done in the usual relational database setup, it doesn't map well to the JPA way of storing data. Morphia is an open source lightweight type-safe library designed to bridge the gap between the MongoDB Java driver and domain objects. It can be an alternative to SpringData if you're not using the Spring Framework to interact with MongoDB. This post will cover the basics of persisting and querying entities along the lines of JPA by using Morphia and a MongoDB database instance. There are four POJOs this example will be using. First we have BaseEntity which is an abstract class containing the Id and Version fields: package com.city81.mongodb.morphia.entity; import org.bson.types.ObjectId; import com.google.code.morphia.annotations.Id; import com.google.code.morphia.annotations.Property; import com.google.code.morphia.annotations.Version; public abstract class BaseEntity { @Id @Property("id") protected ObjectId id; @Version @Property("version") private Long version; public BaseEntity() { super(); } public ObjectId getId() { return id; } public void setId(ObjectId id) { this.id = id; } public Long getVersion() { return version; } public void setVersion(Long version) { this.version = version; } } Whereas JPA would use @Column to rename the attribute, Morphia uses @Property. Another difference is that @Property needs to be on the variable whereas @Column can be on the variable or the get method. The main entity we want to persist is the Customer class: package com.city81.mongodb.morphia.entity; import java.util.List; import com.google.code.morphia.annotations.Embedded; import com.google.code.morphia.annotations.Entity; @Entity public class Customer extends BaseEntity { private String name; private List accounts; @Embedded private Address address; public String getName() { return name; } public void setName(String name) { this.name = name; } public List getAccounts() { return accounts; } public void setAccounts(List accounts) { this.accounts = accounts; } public Address getAddress() { return address; } public void setAddress(Address address) { this.address = address; } } As with JPA, the POJO is annotated with @Entity. The class also shows an example of @Embedded: The Address class is also annotated with @Embedded as shown below: package com.city81.mongodb.morphia.entity; import com.google.code.morphia.annotations.Embedded; @Embedded public class Address { private String number; private String street; private String town; private String postcode; public String getNumber() { return number; } public void setNumber(String number) { this.number = number; } public String getStreet() { return street; } public void setStreet(String street) { this.street = street; } public String getTown() { return town; } public void setTown(String town) { this.town = town; } public String getPostcode() { return postcode; } public void setPostcode(String postcode) { this.postcode = postcode; } } Finally, we have the Account class of which the customer class has a collection of: package com.city81.mongodb.morphia.entity; import com.google.code.morphia.annotations.Entity; @Entity public class Account extends BaseEntity { private String name; public String getName() { return name; } public void setName(String name) { this.name = name; } } The above show only a small subset of what annotations can be applied to domain classes. More can be found at http://code.google.com/p/morphia/wiki/AllAnnotations The Example class shown below goes through the steps involved in connecting to the MongoDB instance, populating the entities, persisting them and then retrieving them: package com.city81.mongodb.morphia; import java.net.UnknownHostException; import java.util.ArrayList; import java.util.List; import com.city81.mongodb.morphia.entity.Account; import com.city81.mongodb.morphia.entity.Address; import com.city81.mongodb.morphia.entity.Customer; import com.google.code.morphia.Datastore; import com.google.code.morphia.Key; import com.google.code.morphia.Morphia; import com.mongodb.Mongo; import com.mongodb.MongoException; /** * A MongoDB and Morphia Example * */ public class Example { public static void main( String[] args ) throws UnknownHostException, MongoException { String dbName = new String("bank"); Mongo mongo = new Mongo(); Morphia morphia = new Morphia(); Datastore datastore = morphia.createDatastore(mongo, dbName); morphia.mapPackage("com.city81.mongodb.morphia.entity"); Address address = new Address(); address.setNumber("81"); address.setStreet("Mongo Street"); address.setTown("City"); address.setPostcode("CT81 1DB"); Account account = new Account(); account.setName("Personal Account"); List accounts = new ArrayList(); accounts.add(account); Customer customer = new Customer(); customer.setAddress(address); customer.setName("Mr Bank Customer"); customer.setAccounts(accounts); Key savedCustomer = datastore.save(customer); System.out.println(savedCustomer.getId()); } Executing the first few lines will result in the creation of a Datastore. This interface will provide the ability to get, delete and save objects in the 'bank' MongoDB instance. The mapPackage method call on the morphia object determines what objects are mapped by that instance of Morphia. In this case all those in the package supplied. Other alternatives exist to map classes, including the method map which takes a single class (this method can be chained as the returning object is the morphia object), or passing a Set of classes to the Morphia constructor. After creating instances of the entities, they can be saved by calling save on the datastore instance and can be found using the primary key via the get method. The output from the Example class would look something like the below: 11-Jul-2012 13:20:06 com.google.code.morphia.logging.MorphiaLoggerFactory chooseLoggerFactory INFO: LoggerImplFactory set to com.google.code.morphia.logging.jdk.JDKLoggerFactory 4ffd6f7662109325c6eea24f Mr Bank Customer There are many other methods on the Datastore interface and they can be found along with the other Javadocs at http://morphia.googlecode.com/svn/site/morphia/apidocs/index.html An alternative to using the Datastore directly is to use the built in DAO support. This can be done by extending the BasicDAO class as shown below for the Customer entity: package com.city81.mongodb.morphia.dao; import com.city81.mongodb.morphia.entity.Customer; import com.google.code.morphia.Morphia; import com.google.code.morphia.dao.BasicDAO; import com.mongodb.Mongo; public class CustomerDAO extends BasicDAO { public CustomerDAO(Morphia morphia, Mongo mongo, String dbName) { super(mongo, morphia, dbName); } } To then make use of this, the Example class can be changed (and enhanced to show a query and a delete): ... CustomerDAO customerDAO = new CustomerDAO(morphia, mongo, dbName); customerDAO.save(customer); Query query = datastore.createQuery(Customer.class); query.and( query.criteria("accounts.name").equal("Personal Account"), query.criteria("address.number").equal("81"), query.criteria("name").contains("Bank") ); QueryResults retrievedCustomers = customerDAO.find(query); for (Customer retrievedCustomer : retrievedCustomers) { System.out.println(retrievedCustomer.getName()); System.out.println(retrievedCustomer.getAddress().getPostcode()); System.out.println(retrievedCustomer.getAccounts().get(0).getName()); customerDAO.delete(retrievedCustomer); } ... With the output from running the above shown below: 11-Jul-2012 13:30:46 com.google.code.morphia.logging.MorphiaLoggerFactory chooseLoggerFactory INFO: LoggerImplFactory set to com.google.code.morphia.logging.jdk.JDKLoggerFactory Mr Bank Customer CT81 1DB Personal Account This post only covers a few brief basics of Morphia but shows how it can help bridge the gap between JPA and NoSQL.

July 25, 2013

by Geraint Jones

· 76,146 Views

Jersey: Listing all Resources, Paths, Verbs to Build an Entry Point/Index for an API

I’ve been playing around with Jersey over the past couple of days and one thing I wanted to do was create an entry point or index which listed all my resources, the available paths and the verbs they accepted. Guido Simone explained a neat way of finding the paths and verbs for a specific resource using Jersey’sIntrospectionModeller: AbstractResource resource = IntrospectionModeller.createResource(JacksonResource.class); System.out.println("Path is " + resource.getPath().getValue()); String uriPrefix = resource.getPath().getValue(); for (AbstractSubResourceMethod srm :resource.getSubResourceMethods()) { String uri = uriPrefix + "/" + srm.getPath().getValue(); System.out.println(srm.getHttpMethod() + " at the path " + uri + " return " + srm.getReturnType().getName()); } If we run that against j4-minimal‘s JacksonResource class we get the following output: Path is /jackson GET at the path /jackson/{who} return com.g414.j4.minimal.JacksonResource$Greeting GET at the path /jackson/awesome/{who} return javax.ws.rs.core.Response That’s pretty neat but I didn’t want to have to manually list all my resources since I’ve already done that using Guice . I needed a way to programatically get hold of them and I partially found the way to do this from this postwhich suggests using Application.getSingletons(). I actually ended up using Application.getClasses() and I ended up with ResourceListingResource: @Path("/") public class ResourceListingResource { @GET @Produces(MediaType.APPLICATION_JSON) public Response showAll( @Context Application application, @Context HttpServletRequest request) { String basePath = request.getRequestURL().toString(); ObjectNode root = JsonNodeFactory.instance.objectNode(); ArrayNode resources = JsonNodeFactory.instance.arrayNode(); root.put( "resources", resources ); for ( Class aClass : application.getClasses() ) { if ( isAnnotatedResourceClass( aClass ) ) { AbstractResource resource = IntrospectionModeller.createResource( aClass ); ObjectNode resourceNode = JsonNodeFactory.instance.objectNode(); String uriPrefix = resource.getPath().getValue(); for ( AbstractSubResourceMethod srm : resource.getSubResourceMethods() ) { String uri = uriPrefix + "/" + srm.getPath().getValue(); addTo( resourceNode, uri, srm, joinUri(basePath, uri) ); } for ( AbstractResourceMethod srm : resource.getResourceMethods() ) { addTo( resourceNode, uriPrefix, srm, joinUri( basePath, uriPrefix ) ); } resources.add( resourceNode ); } } return Response.ok().entity( root ).build(); } private void addTo( ObjectNode resourceNode, String uriPrefix, AbstractResourceMethod srm, String path ) { if ( resourceNode.get( uriPrefix ) == null ) { ObjectNode inner = JsonNodeFactory.instance.objectNode(); inner.put("path", path); inner.put("verbs", JsonNodeFactory.instance.arrayNode()); resourceNode.put( uriPrefix, inner ); } ((ArrayNode) resourceNode.get( uriPrefix ).get("verbs")).add( srm.getHttpMethod() ); } private boolean isAnnotatedResourceClass( Class rc ) { if ( rc.isAnnotationPresent( Path.class ) ) { return true; } for ( Class i : rc.getInterfaces() ) { if ( i.isAnnotationPresent( Path.class ) ) { return true; } } return false; } } The only change I’ve made from Guido Simone’s solution is that I also call resource.getResourceMethods()because resource.getSubResourceMethods() only returns methods which have a @Path annotation. Since we’ll sometimes define our path at the class level and then define different verbs that operate on that resource it misses some methods out. If we run a cURL command (piped through python to make it look nice) against the root we get the following output: $ curl http://localhost:8080/ -w "\n" 2>/dev/null | python -mjson.tool { "resources": [ { "/bench": { "path": "http://localhost:8080/bench", "verbs": [ "GET", "POST", "PUT", "DELETE" ] } }, { "/sample/{who}": { "path": "http://localhost:8080/sample/{who}", "verbs": [ "GET" ] } }, { "/jackson/awesome/{who}": { "path": "http://localhost:8080/jackson/awesome/{who}", "verbs": [ "GET" ] }, "/jackson/{who}": { "path": "http://localhost:8080/jackson/{who}", "verbs": [ "GET" ] } }, { "/": { "path": "http://localhost:8080/", "verbs": [ "GET" ] } } ] }

July 24, 2013

by Mark Needham

· 20,979 Views · 2 Likes

Getting started with JPA and Mule

Working with JPA managed entities in Mule applications can be difficult. Since the JPA session is not propagated between message processors, transformers are typically needed to produce an entity from a message’s payload, pass it to a component for processing, then serialize it back to an un-proxied representation for further processing. Transactions have been complicated too. Its difficult to coordinate a transaction between multiple components that are operating with JPA entity payloads. Finally the lack of support for JPA queries makes it difficult to load objects without working with raw SQL and the JDBC transport. Mule Support for JPA Entities The JPA module aims to simplify working with JPA managed entities with Mule. It provides message processors that map to an EntityManager’s methods. The message processors participate in Mule transactions, making it easy to structure JPA transactions within Mule flows. The JPA module also provides a @PersistenceContext implementation. This allows Mule components to participate in JPA transactions. Installing the JPA Module To install the JPA Module you need to click on “Help” followed by “Install New Software…” from Mule Studio. Select the “MuleStudio Cloud Connectors Update Site” from the “Work With” drop-down list then find the “Mule Java Persistence API Module Mule Extension.” This is illustrated below: Fetching JPA Entities JPA query language or criteria queries can be executed using the “query” MP. Supplying a statement to the query will execute the given query and return the results to the next message processor, as illustrated in the following Gist: The queryParameters-ref defines the parameters. In this case the message’s payload as the parameters to the query. The following query illustrates how a Map payload could be used to populate query parameters: The query processor also supports criteria queries by setting the queryParameters-ref to an instance of a CriteriaQuery, as illustrated in the functional test snippet below. CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder(); CriteriaQuery criteriaQuery = criteriaBuilder.createQuery(Dog.class); Root from = criteriaQuery.from(Dog.class); Predicate condition = criteriaBuilder.equal(from.get("name"), "Cujo"); criteriaQuery.where(condition); runFlowWithPayloadAndExpect("testQuery", expectedResults, criteriaQuery); You can use the ”find” MP to load a single object if you know its ID: Transactions and Entity Operations The default behavior of most JPA providers, like Hibernate, is to provide proxies on entity relationships to avoid loading full object graphs into memory. When these objects are detached from the JPA session, however, attempts to access relations in the object will often fail because the proxied session is no longer available. This complicates using JPA is Mule applications as JPA objects pass between message processors and inbetween flows and the session subsequently becomes unavailable. The JPA module allows you to avoid this by wrapping your operations in a transactional block. Let’s first look at how to persist an object then query it within a transaction. The below assumes the message’s payload is an instance of the Dog domain class. Now let’s see how we can use the merge processor to attach a JPA object to a new session. This can be useful when passing a JPA entity from one flow to another. ....other processing here.... Detaching an entity is just as simple: Component Operations with JPA The real power of using JPA with Mule is allowing your business services to participate in Mule managed JPA transactions. A @PersistenceContext EntityManager reference in your component class will cause Mule to inject a reference to a transactional flow’s current EntityManager for that method, as illustrated in the following class: public class DogServiceImpl { @PersistenceContext EntityManager entityManager; public Dog groom(Dog dog) { return entityManager.merge(dog); } } We can now wire the component up in a flow: Conclusion JPA is an important part of the JEE ecosystem and hopefully this module will simplify your use of JPA managed entities in Mule applications.

July 24, 2013

by John D'Emic

· 8,181 Views