Data Resources

The Latest Data Topics

[This article by Chris Haddad comes to you from the DZone Guide to Cloud Development - 2015 Edition. For more information—including in-depth articles from industry experts, best solutions for PaaS, iPaaS, IaaS, and MBaaS, and more—click the link below to download your free copy of the guide.] Microservices can be understood from two angles. First, the differential: teams that take a microservice design approach divide business solutions into distinct, full-stack business services owned by autonomous teams. Second, the integral: microservice-based applications weave multiple atomic microservices into holistic user experiences. Unfortunately, traditional application delivery models and traditional middleware infrastructure do not address microservice-specific demands for on-demand provisioning, dynamic composition, and service level management. On the other hand, the Platform-as-a-Service (PaaS) model addresses these demands perfectly. Running microservices on a PaaS fabric decreases solution fragility, reduces operational burden, and enhances developer productivity. To understand why, we’ll first review how microservices separate concerns from both business and object-oriented design perspectives. Second, we’ll consider how microservice-based design can complicate deployment as applications scale dynamically. Third, we’ll focus on how a PaaS environment helps to solve many of the problems both addressed and introduced by microservices-based architectures — in other words, why PaaS and microservices are a match made in heaven. Microservices: Separating Concerns By Business Solution A microservice approach decomposes monolithic applications according to the single responsibility pattern. In a microservice solution, each microservice interface delivers discrete business capabilities (e.g. customer profile, product catalogue, inventory, order, billing, fulfillment) within a well-defined, bounded context. The atomic microservice interfaces reside on separate and distinct full-stack application platforms that contain separate database storage, integration flows, and web application hosting. By separating concerns onto separate full-stack platforms and not sharing database instances or web application hosts across services, every team is free to choose different runtime languages and frameworks for its own microservice. Also, every team is free to evolve its data schemas, application frameworks, and business logic without impacting other teams. Because microservices are a relatively new design approach, many development teams may have the misconception that creating a microservice-based solution requires simply deploying small web services in containers. But this doesn’t cut quite deep enough. The correct approach is to evolve your monolithic design by applying service-oriented principles (i.e. encapsulation, loose coupling, separation of concerns) in conjunction with domain-driven design techniques and dynamic runtime application composition. For example, in a typical ecommerce scenario, a development team applies the bounded context pattern and single responsibility pattern to refactor a monolithic application into units distinguished by business capability (see Figure 2). By creating a user experience from loosely coupled services instead of tightly coupled native-language business objects, teams have more independence to develop, evolve, and deploy each business capability separately. Obviously, the microservice design approach works best for (a) greenfield projects or (b) modernization efforts where teams focus on refactoring monolithic application assets. The Microservice Execution Trap Although a microservice approach decouples development dependencies and speeds up development iterations, microservices also create a challenging environment for high-performance scaling and reliable runtime execution. More complex, loosely coupled, and dynamic environments distribute business capabilities over the entire network. Even a task as simple as responding to a single web application page request may spread out across several microservice instances residing on a distributed network topology. Martin Fowler and Stefan Tilkov (both microservice proponents) warn teams that successfully implementing a microservice approach requires choosing platforms that decrease solution fragility and reduce operational burdens. What Platform-as-a-Service Offers Platform-as-a-Service environments reduce microservice operational burdens when infrastructure-as-code and declarative policies are used to eliminate all manual actions and increase runtime quality of service (i.e. reliability, availability, scalability, and performance). The appropriate PaaS environment will automatically deploy, provision, and link full-stack microservices. In a microservice architecture, teams want to rapidly release new versions and perform A/B testing across versions. When teams define instance dependencies, scaling properties, and security policies as PaaS metadata or code scripts, the runtime fabric can reduce manual effort and increase release confidence. With a DevOps- friendly PaaS, the team can experiment with new service versions and safely rollback to a prior stable release if a problem arises. Because microservices are full-stack silos *1* that can be composed of multiple server instances (e.g. web server, database, load balancer, integration server), a PaaS can reduce deployment complexity by automatically spinning up and linking all instances. Linking may require discovering instance locations, dynamically initializing network routes, and auto-configuring connection strings based on service version or tenant. A traditional application will compose business functions and user experience by statically linking class files and shared object libraries. In contrast, microservice- based applications use service composition to connect available microservices endpoints and realize a fully functional application. While many microservice proponents promote microservice-based interactions by “smart endpoints through dumb pipes, ‘ effective service composition requires smart infrastructure building blocks to bootstrap and maintain connections between services and consumers. The right PaaS solves these problems. Infrastructure building blocks will register service endpoint locations, associate metadata and policies, connect clients, circuit break around failures, correlate inter-service calls, and load balance traffic. A microservice-friendly PaaS will provide service registries, metadata services, discovery services, and service virtualization gateways. In the pipe, circuit breakers will automatically route traffic on failover or overload. Smart endpoint code will dynamically connect with microservices based on discovery service responses and negotiated quality of service parameters. Rather than being hard-coded to a specific service hostname and URI, endpoint code will query for microservice location based on security assurances, performance guarantees, traffic load, service version, client tenancy, or business domain. When services are unavailable or underperform, smart endpoints will follow the tolerant reader pattern and gracefully degrade experience or proactively recover. A few recovery options include reading from local caches or circuit tripping to backup service endpoints. In conjunction with smart endpoint actions, a smart PaaS will spin up new microservice endpoints and full-stack instances based on service level management metrics. By following microservice architecture best practices, teams create anti-fragile applications that not only withstand a shock, but also improve performance and quality of service when stressed or experiencing failures. To drive this non-intuitive behavior, the underlying platform environment must be ready to scale, repair, and reconnect services. PaaS service level management components will create more resilient and anti-fragile microservices by monitoring performance, elastically provisioning instances, and dynamically re-routing traffic. Scaling an anti-fragile microservice is more difficult than scaling a web application. The PaaS should distribute microservice instances across multiple availability zones and dynamically adjust traffic to reduce latency and response time. Because transient microservice instances will rapidly start, stop, and change location, the service management layer must be completely automated and integrated with routing services. A PaaS environment will deliver the service level management, dynamic service composition, circuit breakers, and on-demand provisioning functions required to overcome the complexity inherent within a distributed microservice-based application architecture. Running microservices on a PaaS fabric will decrease solution fragility, reduce operational burden, and enhance developer productivity. If you are pursuing a microservice design approach, make sure you choose a microservice- friendly PaaS. DOWNLOAD YOUR FREE COPY TODAY

May 5, 2015

by Chris Haddad

· 12,095 Views · 2 Likes

How To: Neo4j Data Import - Minimal Example

The easiest way to import data from relational or legacy systems, like plain CSV files without headers, into Neo4j with Cypher and a graph model.

April 30, 2015

by Michael Hunger

· 7,787 Views

Introduction to Probabilistic Data Structures

When processing large data sets, we often want to do some simple checks, such as number of unique items, most frequent items, and whether some items exist in the data set. The common approach is to use some kind of deterministic data structure like HashSet or Hashtable for such purposes. But when the data set we are dealing with becomes very large, such data structures are simply not feasible because the data is too big to fit in the memory. It becomes even more difficult for streaming applications which typically require data to be processed in one pass and perform incremental updates. Probabilistic data structures are a group of data structures that are extremely useful for big data and streaming applications. Generally speaking, these data structures use hash functions to randomize and compactly represent a set of items. Collisions are ignored but errors can be well-controlled under certain threshold. Comparing with error-free approaches, these algorithms use much less memory and have constant query time. They usually support union and intersection operations and therefore can be easily parallelized. This article will introduce three commonly used probabilistic data structures: Bloom filter, HyperLogLog, and Count-Min sketch. Membership Query - Bloom filter A Bloom filter is a bit array of m bits initialized to 0. To add an element, feed it to k hash functions to get k array position and set the bits at these positions to 1. To query an element, feed it to k hash functions to obtain k array positions. If any of the bits at these positions is 0, then the element is definitely not in the set. If the bits are all 1, then the element might be in the set. A Bloom filter with 1% false positive rate only requires 9.6 bits per element regardless of the size of the elements. For example, if we have inserted x, y, z into the bloom filter, with k=3 hash functions like the picture above. Each of these three elements has three bits each set to 1 in the bit array. When we look up for w in the set, because one of the bits is not set to 1, the bloom filter will tell us that it is not in the set. Bloom filter has the following properties: False positive is possible when the queried positions are already set to 1. But false negative is impossible. Query time is O(k). Union and intersection of bloom filters with same size and hash functions can be implemented with bitwise OR and AND operations. Cannot remove an element from the set. Bloom filter requires the following inputs: m: size of the bit array n: estimated insertion p: false positive probability The optimum number of hash functions k can be determined using the formula: Given false positive probability p and the estimated number of insertions n, the length of the bit array can be calculated as: The hash functions used for bloom filter should generally be faster than cryptographic hash algorithms with good distribution and collision resistance. Commonly used hash functions for bloom filter include Murmur hash, fnv series of hashes and Jenkins hashes. Murmur hash is the fastest among them. MurmurHash3 is used by Google Guava library's bloom filter implementation. Cardinality - HyperLogLog HyperLogLog is a streaming algorithm used for estimating the number of distinct elements (the cardinality) of very large data sets. HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k. For each element si in the stream, hash function h(si) transforms si into string of random bits (0 or 1 with probability of 1/2): The probability P of the bit patterns: 0xxxx... → P = 1/2 01xxx... → P = 1/4 001xx... → P = 1/8 The intuition is that when we are seeing prefix 0k 1..., it's likely there are n ≥ 2k+1 different strings. By keeping track of prefixes 0k 1... that have appeared in the data stream, we can estimate the cardinality to be 2p, where p is the length of the largest prefix. Because the variance is very high when using single counter, in order to get a better estimation, data is split into m sub-streams using the first few bits of the hash. The counters are maintained by m registers each has memory space of multiple of 4 bytes. If the standard deviation for each sub-stream is σ, then the standard deviation for the averaged value is only σ/√m. This is called stochastic averaging. For instance for m=4, The elements are split into m stream using the first 2 bits (00, 01, 10, 11) which are then discarded. Each of the register stores the rest of the hash bits that contains the largest 0k 1 prefix. The values in the m registers are then averaged to obtain the cardinality estimate. HyperLogLog algorithm uses harmonic mean to normalize result. The algorithm also makes adjustment for small and very large values. The resulting error is equal to 1.04/√m. Each of the m registers uses at most log2log2 n + O(1) bits when cardinalities ≤ n need to be estimated. Union of two HyperLogLog counters can be calculated by first taking the maximum value of the two counters for each of the m registers, and then calculate the estimated cardinality. Frequency - Count-Min Sketch Count-Min sketch is a probabilistic sub-linear space streaming algorithm. It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary. The basic data structure is a two dimensional d x w array of counters with d pairwise independent hash functions h1 ... hd of range w. Given parameters (ε,δ), set w = [e/ε], and d = [ln1/δ]. ε is the accuracy we want to have and δ is the certainty with which we reach the accuracy. The two dimensional array consists of wd counts. To increment the counts, calculate the hash positions with the d hash functions and update the counts at those positions. The estimate of the counts for an item is the minimum value of the counts at the array positions determined by the d hash functions. The space used by Count-Min sketch is the array of w*d counters. By choosing appropriate values for d and w, very small error and high probability can be achieved. Example of Count-Min sketch sizes for different error and probability combination: ε 1 - δ w d wd 0.1 0.9 28 3 84 0.1 0.99 28 5 140 0.1 0.999 28 7 196 0.01 0.9 272 3 816 0.01 0.99 272 5 1360 0.01 0.999 272 7 1940 0.001 0.999 2719 7 19033 Count-Min sketch has the following properties: Union can be performed by cell-wise ADD operation O(k) query time Better accuracy for higher frequency items (heavy hitters) Can only cause over-counting but not under-counting Count-Min sketch can be used for querying single item count or "heavy hitters" which can be obtained by keeping a heap structure of all the counts. Summary Probabilistic data structures have many applications in modern web and data applications where the data arrives in a streaming fashion and needs to be processed on the fly using limited memory. Bloom filter, HyperLogLog, and Count-Min sketch are the most commonly used probabilistic data structures. There are a lot of research on various streaming algorithms, synopsis data structures and optimization techniques that are worth investigating and studying. If you haven't tried these data structures, you will be amazed how powerful they can be once you start using them. It may be a little bit intimidating to understand the concept initially, but the implementation is actually quite simple. Google Guava has Bloom filter implementation using murmur hash. Clearspring's Java library stream-lib and Twitter's Scala library Algebird have implementation for all three data structures and other useful data structures that you can play with. I have included the links below. Links http://bigsnarf.wordpress.com/2013/02/08/probabilistic-data-structures-for-data-analytics/ http://en.wikipedia.org/wiki/Bloom_filter http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf http://www.moneyscience.com/pg/blog/ThePracticalQuant/read/438348/realtime-analytics-hokusai-adds-a-temporal-component-to-countmin-sketch http://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf https://github.com/addthis/stream-lib https://github.com/twitter/algebird

April 30, 2015

by Yin Niu

· 35,710 Views · 6 Likes

Why Elasticsearch is Suitable for Application Log Analytics

Handling Application Logs Enterprise application development using Web technologies has been around for a long time. In recent years we have seen a sharp increase in the deployment of such applications. This is partly due to the proliferation of ecommerce sites, social media sites, mobile application supporting sites, as well as the desire of enterprises to have their applications available 24x7. In most cases, such applications cater to huge load and are deployed on cloud infrastructure. Monitoring deployed applications is increasingly becoming a crucial task, as deployed applications are bound to fail, irrespective of the robust techniques used during development. Whenever an application fails, the most common resolution method starts by examining the application log. If the application has implemented logging properly, the logs can reveal the cause of application failure. Examination of log files is usually done by viewing the file using tools like vi, less, more, tail or grep. Another method is to download the file to a Windows system and viewing it using an editor like Notepad++. Engineers usually scan the log information to look for clues that point to the reasons for failure. Once the cause of failure is identified, suitable action is taken for restoring the application and/or service. The Key to Application Log Analytics This process, of logging onto a remote system and viewing logs is tedious. Additionally, many of the tools do not provide support to make the task of issue identification any simpler. Even when using tools like grep (if we know the pattern), we still need to view the logs in order to go through other information that has been logged, such as the log information that precedes the failure point. While it has always been possible to develop applications to parse application logs, the recent renewed interest in application log analytics is due to the acceptance of NoSQL-like technologies and the availability of standard tools to parse application logs. Though relational databases (RDBMS) have for many years provided the facility to store structured data, they are not well-suited for handling log data, as in many cases, the structure of the logged information is not the same across the file. This does not fit well in the rigidly defined world of an RDBMS. In comparison, NoSQL allows document flexibility and documents with different schemas can be stored in the same database / index / store. The ability to convert log data into a well-defined structure, as well as the ability to search, are key to implement a modern log analytics solution. In this document, we cover how Elasticsearch. Elasticsearch can store documents, giving us the benefit of structured storage without the overheads of a database system. The Suitability of Elasticsearch In the following subsections, we share our views as to why Elasticsearch is a suitable data store for an application log analytics solution. Elasticsearch is part of a popular trio of tools, commonly known as ELK. Of these, L stands for Logstash, the log parser; E stands for Elasticsearch, the document store; and K stands for Kibana, the visualization tool. Storing Documents Logstash can be used to parse plain text data into structured text. Once data has some structure, it becomes easy to find information by enabling search on it. While parsing application logs is not a challenge, the challenge has been in storing the data and enabling search on it. Most prior solutions have used an RDBMS for storage, but the varying structure and textual nature of application logs makes it difficult to use an RDBMS table structure to store data. RDBMSs are not geared toward ‘search’. They are geared for maintaining a ‘single value of truth’ for the data, defining relations between the data, ensuring their consistency and so on. Search is also not a strong point for RDBMSs as they use exact matches for values, while Elasticsearch supports exact matches as well as partial matches. It also supports document scoring, which attaches a confidence factor to the documents located. Elasticsearch supports documents in JSON format and uses the NoSQL philosophy for document storage. This has the advantage of allowing a flexible schema for the data. Unlike an RDBMS, Elasticsearch is a search engine at heart and hence is built for the same. Though Elasticsearch uses NoSQL for storing documents, it does not provide robust methods to update stored data. Not supporting updates is a serious disadvantage in most cases. In the case of application logs, not supporting updates actually works in favour of Elasticsearch. In case of machine logs, updates are not really required. Application logs are generated from a debugging perspective – having data handy for debugging purposes in the event of application crash or incorrect execution. They usually record important events from application execution and provide additional information to allow application developers to identify the reasons for failure. Additionally, existing information in application logs is rarely, if ever, updated. New information is continually being written to the logs, with no need to refer to old information. This plays to Elasticsearch’s strength, which is able to ingest and index new information very quickly. Search One of the easiest ways of locating information from large volumes of logs is to perform a search. Elasticsearch is well suited not only to handle search, it also supports huge volume of data, using distributed computing (implemented using Shards). While Kibana is one of the commonly used tools to display and visualize information stored in Elasticsearch, it is more suited to display standard charts like bar chart, column chart and pie chart. If the features provided by Kibana are not enough, we can always use Elasticsearch’s REST API support and it’s Query DSL (Domain-Specific Language), to search for required information. The Query DSL and the result of the query are in JSON format. Though this format makes it easy for applications to parse and process, users would need a friendly user interface to interact with the data. Handling Voluminous Data Elasticsearch supports distributed search out of the box – using the concept of ‘shards’. A shard is a single Lucene instance and is managed by Elasticsearch. Two types of shards, namely ‘primary shard’ and ‘replica shard’ are supported. By default, a document is first indexed on the primary shard and then on the replica shards. The number of primary shards can be specified, to cater to the expected volume. By default, Elasticsearch creates five shards for an index. But, once the number of primary shards is decided, it cannot be changed. A replica shards are copies the primary shard. They are used to handle fail-over and the increase performance. While performance across voluminous data can be handled by sharding, it is important to note that shards, once created for an index, cannot be changed. Thus, the sharding strategy of the data has to be decided in advance, after an assessment of the data and an estimation of its growth. In the case of application logs, the sharding strategy can be based on the application name, the business unit ID, the application OD or the application’s geolocation, just to name a few. Analytics By storing data in a structure, analytics can be enabled on the data. Not only can application perform a simple search, it is also possible to restrict the search for specific terms or over a specified time period. Structured storage also makes it easier to develop reports with well-defined visualizations, which in turn makes it easy to understand the current state of applications. It is also possible to perform various analytics operations like time series analysis using the timestamp and identification of patterns from the data using machine learning techniques (assuming, we have the right kind of data in the logs). Though Elasticsearch does not provide built-in support for analytics, applications can benefit from its fast search capability and also from its ability to handle voluminous data sets. In Closing One of the main hurdles for application logs has been the ability to search for information from the huge volume of data. By parsing application log files using Logstash, we can convert a flat file into structured data. Structured data, once stored in Elasticsearch, is easier to search and locate. Visualizations and business logic for generating alerts and tickets is easier to develop on structured data. Elasticsearch, which stores and searches documents, along with its ability to scale over huge volume of data, is a good candidate for inclusion in an application log analytics solution.

April 22, 2015

by Bipin Patwardhan

· 11,737 Views · 2 Likes

Internet of Things MQTT Quality of Service Levels

At Red Hat's Virtual Event, Building Data-driven Solutions for the Internet of Things, Kenneth Peeples spoke about connecting to the IoT with the MQTT protocol.

April 20, 2015

by Kenneth Peeples

· 12,307 Views

Using Apache Kafka for Integration and Data Processing Pipelines with Spring

written by josh long on the spring blog applications generated more and more data than ever before and a huge part of the challenge - before it can even be analyzed - is accommodating the load in the first place. apache’s kafka meets this challenge. it was originally designed by linkedin and subsequently open-sourced in 2011. the project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. the design is heavily influenced by transaction logs. it is a messaging system, similar to traditional messaging systems like rabbitmq, activemq, mqseries, but it’s ideal for log aggregation, persistent messaging, fast (_hundreds_ of megabytes per second!) reads and writes, and can accommodate numerous clients. naturally, this makes it perfect for cloud-scale architectures! kafka powers many large production systems . linkedin uses it for activity data and operational metrics to power the linkedin news feed, and linkedin today, as well as offline analytics going into hadoop. twitter uses it as part of their stream-processing infrastructure. kafka powers online-to-online and online-to-offline messaging at foursquare. it is used to integrate foursquare monitoring and production systems with hadoop-based offline infrastructures. square uses kafka as a bus to move all system events through square’s various data centers. this includes metrics, logs, custom events, and so on. on the consumer side, it outputs into splunk, graphite, or esper-like real-time alerting. netflix uses it for 300-600bn messages per day. it’s also used by airbnb, mozilla, goldman sachs, tumblr, yahoo, paypal, coursera, urban airship, hotels.com, and a seemingly endless list of other big-web stars. clearly, it’s earning its keep in some powerful systems! installing apache kafka there are many different ways to get apache kafka installed. if you’re on osx, and you’re using homebrew, it can be as simple as brew install kafka . you can also download the latest distribution from apache . i downloaded kafka_2.10-0.8.2.1.tgz , unzipped it, and then within you’ll find there’s a distribution of apache zookeeper as well as kafka, so nothing else is required. i installed apache kafka in my $home directory, under another directory, bin , then i created an environment variable, kafka_home , that points to $home/bin/kafka . start apache zookeeper first, specifying where the configuration properties file it requires is: $kafka_home/bin/zookeeper-server-start.sh $kafka_home/config/zookeeper.properties the apache kafka distribution comes with default configuration files for both zookeeper and kafka, which makes getting started easy. you will in more advanced use cases need to customize these files. then start apache kafka. it too requires a configuration file, like this: $kafka_home/bin/kafka-server-start.sh $kafka_home/config/server.properties the server.properties file contains, among other things, default values for where to connect to apache zookeeper ( zookeeper.connect ), how much data should be sent across sockets, how many partitions there are by default, and the broker id ( broker.id - which must be unique across a cluster). there are other scripts in the same directory that can be used to send and receive dummy data, very handy in establishing that everything’s up and running! now that apache kafka is up and running, let’s look at working with apache kafka from our application. some high level concepts.. a kafka broker cluster consists of one or more servers where each may have one or more broker processes running. apache kafka is designed to be highly available; there are no master nodes. all nodes are interchangeable. data is replicated from one node to another to ensure that it is still available in the event of a failure. in kafka, a topic is a category, similar to a jms destination or both an amqp exchange and queue. topics are partitioned, and the choice of which of a topic’s partition a message should be sent to is made by the message producer. each message in the partition is assigned a unique sequenced id, its offset . more partitions allow greater parallelism for consumption, but this will also result in more files across the brokers. producers send messages to apache kafka broker topics and specify the partition to use for every message they produce. message production may be synchronous or asynchronous. producers also specify what sort of replication guarantees they want. consumers listen for messages on topics and process the feed of published messages. as you’d expect if you’ve used other messaging systems, this is usually (and usefully!) asynchronous. like spring xd and numerous other distributed system, apache kafka uses apache zookeeper to coordinate cluster information. apache zookeeper provides a shared hierarchical namespace (called znodes ) that nodes can share to understand cluster topology and availability (yet another reason that spring cloud has forthcoming support for it..). zookeeper is very present in your interactions with apache kafka. apache kafka has, for example, two different apis for acting as a consumer. the higher level api is simpler to get started with and it handles all the nuances of handling partitioning and so on. it will need a reference to a zookeeper instance to keep the coordination state. let’s turn now turn to using apache kafka with spring. using apache kafka with spring integration the recently released apache kafka 1.1 spring integration adapter is very powerful, and provides inbound adapters for working with both the lower level apache kafka api as well as the higher level api. the adapter, currently, is xml-configuration first, though work is already underway on a spring integration java configuration dsl for the adapter and milestones are available. we’ll look at both here, now. to make all these examples work, i added the libs-milestone-local maven repository and used the following dependencies: org.apache.kafka:kafka_2.10:0.8.1.1 org.springframework.boot:spring-boot-starter-integration:1.2.3.release org.springframework.boot:spring-boot-starter:1.2.3.release org.springframework.integration:spring-integration-kafka:1.1.1.release org.springframework.integration:spring-integration-java-dsl:1.1.0.m1 using the spring integration apache kafka with the spring integration xml dsl first, let’s look at how to use the spring integration outbound adapter to send message instances from a spring integration flow to an external apache kafka instance. the example is fairly straightforward: a spring integration channel named inputtokafka acts as a conduit that forwards message messages to the outbound adapter, kafkaoutboundchanneladapter . the adapter itself can take its configuration from the defaults specified in the kafka:producer-context element or it from the adapter-local configuration overrides. there may be one or many configurations in a given kafka:producer-context element. here’s the java code from a spring boot application to trigger message sends using the outbound adapter by sending messages into the incoming inputtokafka messagechannel . package xml; import org.apache.commons.logging.log; import org.apache.commons.logging.logfactory; import org.springframework.beans.factory.annotation.qualifier; import org.springframework.boot.commandlinerunner; import org.springframework.boot.springapplication; import org.springframework.boot.autoconfigure.springbootapplication; import org.springframework.context.annotation.bean; import org.springframework.context.annotation.dependson; import org.springframework.context.annotation.importresource; import org.springframework.integration.config.enableintegration; import org.springframework.messaging.messagechannel; import org.springframework.messaging.support.genericmessage; @springbootapplication @enableintegration @importresource("/xml/outbound-kafka-integration.xml") public class demoapplication { private log log = logfactory.getlog(getclass()); @bean @dependson("kafkaoutboundchanneladapter") commandlinerunner kickoff(@qualifier("inputtokafka") messagechannel in) { return args -> { for (int i = 0; i < 1000; i++) { in.send(new genericmessage<>("#" + i)); log.info("sending message #" + i); } }; } public static void main(string args[]) { springapplication.run(demoapplication.class, args); } } using the new apache kafka spring integration java configuration dsl shortly after the spring integration 1.1 release, spring integration rockstar artem bilan got to work on adding a spring integration java configuration dsl analog and the result is a thing of beauty! it’s not yet ga (you need to add the libs-milestone repository for now), but i encourage you to try it out and kick the tires. it’s working well for me and the spring integration team are always keen on getting early feedback whenever possible! here’s an example that demonstrates both sending messages and consuming them from two different integrationflow s. the producer is similar to the example xml above. new in this example is the polling consumer. it is batch-centric, and will pull down all the messages it sees at a fixed interval. in our code, the message received will be a map that contains as its keys the topic and as its value another map with the partition id and the batch (in this case, of 10 records), of records read. there is a messagelistenercontainer -based alternative that processes messages as they come. package jc; import org.apache.commons.logging.log; import org.apache.commons.logging.logfactory; import org.springframework.beans.factory.annotation.autowired; import org.springframework.beans.factory.annotation.qualifier; import org.springframework.beans.factory.annotation.value; import org.springframework.boot.commandlinerunner; import org.springframework.boot.springapplication; import org.springframework.boot.autoconfigure.springbootapplication; import org.springframework.context.annotation.bean; import org.springframework.context.annotation.configuration; import org.springframework.context.annotation.dependson; import org.springframework.integration.integrationmessageheaderaccessor; import org.springframework.integration.config.enableintegration; import org.springframework.integration.dsl.integrationflow; import org.springframework.integration.dsl.integrationflows; import org.springframework.integration.dsl.sourcepollingchanneladapterspec; import org.springframework.integration.dsl.kafka.kafka; import org.springframework.integration.dsl.kafka.kafkahighlevelconsumermessagesourcespec; import org.springframework.integration.dsl.kafka.kafkaproducermessagehandlerspec; import org.springframework.integration.dsl.support.consumer; import org.springframework.integration.kafka.support.zookeeperconnect; import org.springframework.messaging.messagechannel; import org.springframework.messaging.support.genericmessage; import org.springframework.stereotype.component; import java.util.list; import java.util.map; /** * demonstrates using the spring integration apache kafka java configuration dsl. * thanks to spring integration ninja artem bilan * for getting the java configuration dsl working so quickly! * * @author josh long */ @enableintegration @springbootapplication public class demoapplication { public static final string test_topic_id = "event-stream"; @component public static class kafkaconfig { @value("${kafka.topic:" + test_topic_id + "}") private string topic; @value("${kafka.address:localhost:9092}") private string brokeraddress; @value("${zookeeper.address:localhost:2181}") private string zookeeperaddress; kafkaconfig() { } public kafkaconfig(string t, string b, string zk) { this.topic = t; this.brokeraddress = b; this.zookeeperaddress = zk; } public string gettopic() { return topic; } public string getbrokeraddress() { return brokeraddress; } public string getzookeeperaddress() { return zookeeperaddress; } } @configuration public static class producerconfiguration { @autowired private kafkaconfig kafkaconfig; private static final string outbound_id = "outbound"; private log log = logfactory.getlog(getclass()); @bean @dependson(outbound_id) commandlinerunner kickoff( @qualifier(outbound_id + ".input") messagechannel in) { return args -> { for (int i = 0; i < 1000; i++) { in.send(new genericmessage<>("#" + i)); log.info("sending message #" + i); } }; } @bean(name = outbound_id) integrationflow producer() { log.info("starting producer flow.."); return flowdefinition -> { consumer spec = (kafkaproducermessagehandlerspec.producermetadataspec metadata)-> metadata.async(true) .batchnummessages(10) .valueclasstype(string.class) .valueencoder(string::getbytes); kafkaproducermessagehandlerspec messagehandlerspec = kafka.outboundchanneladapter( props -> props.put("queue.buffering.max.ms", "15000")) .messagekey(m -> m.getheaders().get(integrationmessageheaderaccessor.sequence_number)) .addproducer(this.kafkaconfig.gettopic(), this.kafkaconfig.getbrokeraddress(), spec); flowdefinition .handle(messagehandlerspec); }; } } @configuration public static class consumerconfiguration { @autowired private kafkaconfig kafkaconfig; private log log = logfactory.getlog(getclass()); @bean integrationflow consumer() { log.info("starting consumer.."); kafkahighlevelconsumermessagesourcespec messagesourcespec = kafka.inboundchanneladapter( new zookeeperconnect(this.kafkaconfig.getzookeeperaddress())) .consumerproperties(props -> props.put("auto.offset.reset", "smallest") .put("auto.commit.interval.ms", "100")) .addconsumer("mygroup", metadata -> metadata.consumertimeout(100) .topicstreammap(m -> m.put(this.kafkaconfig.gettopic(), 1)) .maxmessages(10) .valuedecoder(string::new)); consumer endpointconfigurer = e -> e.poller(p -> p.fixeddelay(100)); return integrationflows .from(messagesourcespec, endpointconfigurer) .>>handle((payload, headers) -> { payload.entryset().foreach(e -> log.info(e.getkey() + '=' + e.getvalue())); return null; }) .get(); } } public static void main(string[] args) { springapplication.run(demoapplication.class, args); } } the example makes heavy use of java 8 lambdas. the producer spends a bit of time establishing how many messages will be sent in a single send operation, how keys and values are encoded (kafka only knows about byte[] arrays, after all) and whether messages should be sent synchronously or asynchronously. in the next line, we configure the outbound adapter itself and then define an integrationflow such that all messages get sent out via the kafka outbound adapter. the consumer spends a bit of time establishing which zookeeper instance to connect to, how many messages to receive (10) in a batch, etc. once the message batches are recieved, they’re handed to the handle method where i’ve passed in a lambda that’ll enumerate the payload’s body and print it out. nothing fancy. using apache kafka with spring xd apache kafka is a message bus and it can be very powerful when used as an integration bus. however, it really comes into its own because it’s fast enough and scalable enough that it can be used to route big-data through processing pipelines. and if you’re doing data processing, you really want spring xd ! spring xd makes it dead simple to use apache kafka (as the support is built on the apache kafka spring integration adapter!) in complex stream-processing pipelines. apache kafka is exposed as a spring xd source - where data comes from - and a sink - where data goes to. spring xd exposes a super convenient dsl for creating bash -like pipes-and-filter flows. spring xd is a centralized runtime that manages, scales, and monitors data processing jobs. it builds on top of spring integration, spring batch, spring data and spring for hadoop to be a one-stop data-processing shop. spring xd jobs read data from sources , run them through processing components that may count, filter, enrich or transform the data, and then write them to sinks. spring integration and spring xd ninja marius bogoevici , who did a lot of the recent work in the spring integration and spring xd implementation of apache kafka, put together a really nice example demonstrating how to get a full working spring xd and kafka flow working . the readme walks you through getting apache kafka, spring xd and the requisite topics all setup. the essence, however, is when you use the spring xd shell and the shell dsl to compose a stream. spring xd components are named components that are pre-configured but have lots of parameters that you can override with --.. arguments via the xd shell and dsl. (that dsl, by the way, is written by the amazing andy clement of spring expression language fame!) here’s an example that configures a stream to read data from an apache kafka source and then write the message a component called log , which is a sink. log , in this case, could be syslogd, splunk, hdfs, etc. xd> stream create kafka-source-test --definition "kafka --zkconnect=localhost:2181 --topic=event-stream | log"--deploy and that’s it! naturally, this is just a tase of spring xd, but hopefully you’ll agree the possibilities are tantalizing. deploying a kafka server with lattice and docker it’s easy to get an example kafka installation all setup using lattice , a distributed runtime that supports, among other container formats, the very popular docker image format. there’s a docker image provided by spotify that sets up a collocated zookeeper and kafka image . you can easily deploy this to a lattice cluster, as follows: ltc create --run-as-root m-kafka spotify/kafka from there, you can easily scale the apache kafka instances and even more easily still consume apache kafka from your cloud-based services. next steps you can find the code for this blog on my github account . we’ve only scratched the surface! if you want to learn more (and why wouldn’t you?), then be sure to check out marius bogoevici and dr. mark pollack’s upcoming webinar on reactive data-pipelines using spring xd and apache kafka where they’ll demonstrate how easy it can be to use rxjava, spring xd and apache kafka!

April 18, 2015

by Pieter Humphrey

· 29,141 Views

Meet Molly, the virtual nurse

Telemedicine is a topic that I’ve touched on numerous times on this blog over the past year or so, with a number of platforms emerging to offer patients the opportunity to consult with a medical professional from anywhere in the world. Most of these platforms are clinical based, such as the Babylon service, and therefore provide you with access to doctors when you have something wrong with you. There are however, also a number that are taking a more preventative approach and seek to keep you healthier in the first place. For instance, last autumn I wrote about the Vida platform that is providing coaching and support to ensure you stay out of the emergency room altogether. Of course, all of these services require a human healthcare professional at the other end of your video call to answer your queries for you. Sense.ly are taking another approach by offering an AI based nurse, called Molly, who aims to provide help and support in those periods between appointments with real life professionals. The service is aimed specifically at patients with common medical conditions such as diabetes or heart failure. The patient signs up to the site either direct or via their GP, and the platform then draws up a personalized care plan for them based upon both their medical records and the individual needs of the patient. The patient then follows this prescribed plan (hopefully), with regular check-ins with the virtual nurse via their smartphone or computer to help their progress. Doctors can also access information via the site to see how their patient is getting on, whilst the system will alert them automatically if worrying symptoms begin to emerge. I’ve written previously about the important role the appearance of avatars has on how we engage with them, so Molly has been designed with a friendly face and a softly spoken voice. She interacts with patients using voice recognition technology and can ask relatively simple questions of the patient whilst guiding them through exercise plans and collecting medical data from them. This data can then be analyzed by a doctor (although AI analysis seems inevitable), whilst the patient can also use Molly as a sort of health PA and book appointments with their doctor through her. With voice recognition technology growing at an incredible pace and the AI grunt of services such as Watson increasingly capable of making complex decisions, this seems the beginning of an inevitable trend towards more automated healthcare. Check out the video below that explains more about Molly and the service she offers. Original post

April 15, 2015

by Adi Gaskell

· 2,117 Views

Monitoring rsyslog’s Performance with impstats and Elasticsearch

If you’re using rsyslog for processing lots of logs (and, as we’ve shown before, rsyslog is good at processing lots of logs), you’re probably interested in monitoring it. To do that, you can use impstats, which comes from input module forprocess stats. impstats produces information like: – input stats, like how many events went through each input – queue stats, like the maximum size of a queue – action (output or message modification) stats, like how many events were forwarded by each action – general stats, like CPU time or memory usage In this post, we’ll show you how to send those stats to Elasticsearch (or Logsene — essentially hosted ELK, our log analytics service) that exposes the Elasticsearch API), where you can explore them with a nice UI, like Kibana. For example get the number of logs going through each input/output per hour: More precisely, we’ll look at: – the useful options around impstats – how to use those stats and what they’re about – how to ship stats to Elasticsearch/Logsene by using rsyslog’s Elasticsearch output – how to do this shipping in a fast and reliable way. This will apply to most rsyslog use-cases, not only impstats Configuring impstats Before starting, make sure you have a recent version of rsyslog. You can find the latest version (8.9.0 at the time of this writing), as well as packages for various distributions here. Many distributions still ship ancient versions like 5.x, which probably have impstats, but some of the features (like Elasticsearch output) may not be available. Once you’re there, load the impstats module at the beginning of your config: module( load="impstats" interval="10" # how often to generate stats resetCounters="on" # to get deltas (e.g. # of messages submitted in the last 10 seconds) log.file="/tmp/stats" # file to write those stats to log.syslog="off" # don't send stats through the normal processing pipeline. More on that in a bit At this point, if you restart rsyslog, you should see stats about all the inputs, queues, actions, as well as overall resource usage. For example, the stats below come from an rsyslog that takes messages over TCP and sends them over to Elasticsearch in Logstash-like format: Thu Apr 9 16:45:36 2015: omelasticsearch: origin=omelasticsearch submitted=11000 failed.http=0 failed.httprequests=0 failed.es=0 Thu Apr 9 16:45:36 2015: send-to-es: origin=core.action processed=10405 failed=0 suspended=0 suspended.duration=0 resumed=0 Thu Apr 9 16:45:36 2015: imtcp(13514): origin=imtcp submitted=6618 Thu Apr 9 16:45:36 2015: resource-usage: origin=impstats utime=2109000 stime=2415000 maxrss=53236 minflt=12559 majflt=1 inblock=8 oublock=0 nvcsw=164893 nivcsw=384355 Thu Apr 9 16:45:36 2015: main Q: origin=core.queue size=65095 enqueued=7149 full=0 discarded.full=0 discarded.nf=0 maxqsize=70000 Wait. What are these stats? Here’s my take on each line: 1. omelasticsearch (output module to Elasticsearch) sent 11K logs to ES in the last 10 seconds. There were no connectivity errors, nor any errors thrown by Elasticsearch (like you would get if you tried to index a string in an integer field) 2. the “send-to-es” action (which uses omelasticsearch) took a bit less than 11K logs from the main queue to send them to omelasticsearch. I assume the rest of them were sent before this 10 second window. Not terribly useful, but if there was a connectivity issue with Elasticsearch, you’d see how long this action was suspended 3. the TCP input received 6.6K logs in the last 10 seconds 4. rsyslog used ~2 seconds of user CPU time (utime=2109000 microseconds) and ~2.5s of system time. It used 53MB of RAM at most. You can see what all these abreviations mean by looking at getrusage’s man page 5. the default (main) queue currently buffers 65K messages from the inputs (though it went as high as 70K in the last 10 seconds), 7K of which were taken in the last 10 seconds Shipping Stats to Elasticsearch/Logsene Now that we have these stats, let’s centralize them to Elasticsearch. If you’re using rsyslog to push to Elasticsearch, it’s better to use another cluster or Logsene for stats. Otherwise, when Elasticsearch is in trouble, you won’t be able to see stats which might explain why you’re having trouble in the first place. Either way, we need four things: – produce those stats in JSON, so we can parse them easily – define a template for how JSON documents that we send to Elasticsearch will look like – parse the JSON stats – send those documents to Logsene/Elasticsearch using the defined template Here’s the relevant config snippet for sending to Logsene: module( load="impstats" interval="10" resetCounters="on" format="cee" # produce JSON stats ) module(load="mmjsonparse") module(load="omelasticsearch") #template for building the JSON documents that will land in Logsene/Elasticsearch template(name="stats" type="list") { constant(value="{") property(name="timereported" dateFormat="rfc3339" format="jsonf" outname="@timestamp") # the timestamp constant(value=",") property(name="hostname" format="jsonf" outname="host") # the host generating stats constant(value=",\"source\":\"impstats\",") # we'll hardcode "impstats" as a source property(name="$!all-json" position.from="2") # finally, we'll add all metrics } action( name="parse_impstats" # parse the type="mmjsonparse" # JSON stats ) action( name="impstats_to_es" # name actions so you can see them in impstats messages (instead of the default action 1, 2, etc) type="omelasticsearch" server="logsene-receiver.sematext.com" # host and port for Logsene/Elasticsearch serverport="80" # set serverport="443" and add usehttps="on" for using HTTPS instead of plain HTTP template="stats" # use the template defined earlier searchIndex="LOGSENE_APP_TOKEN_GOES_HERE" searchType="impstats" # we'll use a separate mapping type for stats bulkmode="on" # use Elasticsearch's bulk API action.resumeretrycount="-1" # retry indefinitely on failure ) That’s all you basically need to be sending stats to Logsene/Elasticsearch: load the impstats, mmjsonparse and omelasticsearch modules, define the template, parse stats and send them over. Note that while impstats comes bundled with most rsyslog packages, you need to install rsyslog-mmjsonparse and rsyslog-elasticsearch packages to install the other two plugins. Using a separate ruleset and configuring its queue Before wrapping up, let’s address two potential issues. First is that, by default, impstats will send stats events to the main queue (all input modules do that by default). This will mix stats with other logs, which has a couple of disadvantages: – you need to add a conditional to make sure only impstats events go to the impstats-specific destination – if rsyslog is queueing lots of messages in the main queue, stats can land in Elasticsearch with a delay To avoid these problems, you can bind impstats to a separate ruleset. Let’s call it “monitoring”. rsyslog will then process them separately from the main queue, which is associated to the default ruleset. You can have as many rulesets as you want, and they’re typically used to separate different types of logs. For example to process local logs and remote logs independently. Like the default ruleset which has the main queue, any ruleset can have its own queue (also, each action, no matter the ruleset it’s in, can have its own queue – more info on that here, here and here). Why am I talking about queues? Because if stats are important, you want to make sure you are able to queue them, instead of losing them if Elasticsearch becomes unavailable for a while. By default, the default ruleset comes with an in-memory queue of 10K or so messages. Additional rulesets have no queue by default, but you can add one by specifying queue options (you can find the complete list here). While in-memory queues are fast, they are typically small and you’d lose their contents if you have to shut down or restart rsyslog. In the following config snippet will add a disk-assisted queue to the “monitoring” ruleset. A disk assisted queue will normally be as fast as an in-memory queue, and will spill logs to disk in a performance-friendly way if it’s out of space. You can also make rsyslog save all logs to disk when you shut it down or restart it. module( load="impstats" interval="10" resetCounters="on" format="cee" ruleset="monitoring" # send stats to the monitoring ruleset ) # add here modules and template from the previous snippet ruleset( name="monitoring" # the monitoring ruleset defined earlier queue.type="FixedArray" # we'll have a fixed memory queue for this ruleset queue.highwatermark="50000" # at least until it contains 50K stats messages queue.spoolDirectory="/var/run/rsyslog/queues" # at which point, start writing in-memory messages to disk queue.filename="stats_ruleset" queue.lowwatermark="20000" # until the memory queue goes back to 20K, at which point the memory queue is used again queue.maxdiskspace="100m" # the queue will be full when it occupies 100MB of space on disk queue.size="5000000" # this is the total queue size (shouldn't be reachable) queue.dequeuebatchsize="1000" # how many messages from the queue to process at once (also determines how many messages will be in an ES Bulk) queue.saveonshutdown="on" # save queue contents to disk at shutdown ){ # add here actions from the previous snippet } Summary This was quite a long post, so let me summarize the features of rsyslog we touched on: – impstats is an input module that can generate stats about rsyslog’s inputs, queues and actions, as well as general process statistics – you normally want to write them to a file in a human-readable format for development/debugging or local performance tests – for production, it’s best to write them in JSON, parse them in a separate ruleset and send them to Logsene/Elasticsearch, where you can search and graph them – you can use disk-assisted queues to increase the capacity of an in-memory queue without losing performance under normal conditions. It can also save logs to disk on shutdown to make sure important stats are not lost If you find working with logs and/or Elasticsearch exciting, that’s what we do in lots of our products, consulting andsupport engagements. So if you want to join us, we’re hiring worldwide.

April 14, 2015

by Radu Gheorghe

· 8,155 Views

Using Multiple Grok Statements to Parse a Java Stack Trace

Parse your Java stack trace log information with the Logstash tool.

April 14, 2015

by Bipin Patwardhan

· 78,007 Views · 6 Likes

Adopting Microservices at Netflix: Lessons for Team and Process Design

[this article was written by tony mauro .] in a previous blog post , we shared best practices for designing a microservices architecture, based on adrian cockcroft’s presentation at nginx.conf2014 about his experience as director of web engineering and then cloud architect at netflix . in this follow-up post, we’ll review his recommendations for retooling your development team and processes for a smooth transition to microservices. optimize for speed, not efficiency source: [email protected] the top lesson that cockcroft learned at netflix is that speed wins in the marketplace. if you ask any developer whether a slower development process is better, no one ever says yes. nor do management or customers ever complain that your development cycle is too fast for them. the need for speed doesn’t just apply to tech companies, either: as software becomes increasingly ubiquitous on the internet of things – in cars, appliances, and sensors as well as mobile devices – companies that didn’t used to do software development at all now find that their success depends on being good at it. netflix made an early decision to optimize for speed. this refers specifically to tooling your software development process so that you can react quickly to what your customers want, or even better, can create innovative web experiences that attract customers. speed means learning about your customers and giving them what they want at a faster pace than your competitors. by the time competitors are ready to challenge you in a specific way, you’ve moved on to the next set of improvements. this approach turns the usual paradigm of optimizing for efficiency on its head. efficiency generally means trying to control the overall flow of the development process to eliminate duplication of effort and avoid mistakes, with an eye to keeping costs down. the common result is that you end up focusing on savings instead of looking for opportunities that increase revenue. in cockcroft’s experience, if you say “i’m doing this because it’s more efficient,” the unintended result is that you’re slowing someone else down. this is not an encouragement to be wasteful, but you should optimize for speed first. efficiency becomes secondary as you satisfy the constraint that you’re not slowing things down. the way you grow the business to be more efficient is to go faster. make sure your assumptions are still true many large companies that have enjoyed success in their market (we can call them incumbents ) are finding themselves overtaken by nimbler, usually smaller, organizations ( disruptors ) that react much more quickly to changing consumer behavior. their large size isn’t necessarily the root of the problem – netflix is no longer a small company, for example. as cockcroft sees it, the main cause of difficulty for industry incumbents is that they’re operating under business assumptions that are no longer true. or, as will rogers put it, it’s not what we don’t know that hurts. it’s what we know that ain’t so.” of course, you have to make assumptions as you formulate a business model, and then it makes sense to optimize your business practices around them. the danger comes from sticking with assumptions after they’re no longer true, which means you’re optimizing on the wrong thing. that’s when you become vulnerable to industry disruptors who are making the right assumptions and optimizations for the current business climate. as examples, consider the following assumptions that hold sway at many incumbents. we’ll examine them further in the indicated sections and describe the approach netflix adopted. computing power is expensive. this was true when increasing your computing capacity required capital expenditure on computer hardware. see put your infrastructure in the cloud . process prevents problems. at many companies, the standard response to something going wrong is to add a preventative step to the relevant procedure. see create a high freedom, high responsibility culture with less process . here are some ways to avoid holding onto assumptions that have passed their expiration date: as obvious as it might seem, you need to make your assumptions explicit, then periodically review them to make sure they still hold true. keep aware of technological trends. as an example, the cost of solid state storage drive (ssds) storage continues to go down. it’s still more expensive than regular disks, but the cost difference is becoming small enough that many companies are deciding the superior performance is worth paying a bit more for. [ed: in this entertaining video , fastly founder and ceo artur bergman explains why he believes ssds are always the right choice.] talk to people who aren’t your customers. this is especially necessary for incumbents, who need to make sure that potential new customers are interested in their product. otherwise, they don’t hear about the fact that they’re not being used. as an example, some vendors in the storage space are building hyper-converged systems even as more and more companies are storing their data in the cloud and using open source storage management software. netflix, for example, stores data on amazon web services (aws) servers with ssds and manages it with apache cassandra . a single specialist in java distributed systems is managing the entire configuration without any commercial storage tools or help from engineers specializing in storage, san, or backup. don’t base your future strategy on current it spending, but instead on level of adoption by developers. suppose that your company accounts for nearly all spending in the market for proprietary virtualization software, but then a competitor starts offering an open source-based product at only 1% the cost of yours. if people start choosing it instead of your product, than at the point that your share of total spending is still 90%, your market share has declined to only 10%. if you’re only attending to your revenue, it seems like you’re still in good shape, but 10% of market share can collapse really quickly. put your infrastructure in the cloud source: [email protected] in make sure your assumptions are still true , we mentioned that in the past it was valid to base your business plan on the assumption that computing power was expensive, because it was: the only way to increase your computing capacity was to buy computer hardware. you could then make money by using this expensive resource in the right way to solve customer problems. the advent of cloud computing has pretty much completely invalidated this assumption. it is now possible to buy the amount of capacity you need when you need it, and to pay for only the time you actually use it. the new assumption you need to make is that (virtual) machines are ephemeral. you can create and destroy them at the touch of a button or a call to an api, without any need to negotiate with other departments in your company. one way to think of this change is that the self-service cloud makes formerly impossible things instantaneous. all of netflix’s engineers are in california, but they manage a worldwide infrastructure. the cloud enables them to experiment and determine whether (for example) adding servers in particular location improves performance. suppose they notice problems with video delivery in brazil. they can easily set up 100 cloud server instances in são paulo within a couple hours. if after a week they determine that the difference in delivery speed and reliability isn’t large enought to justify the cost of the additional server instances, they can shut them down just as quickly and easily as they created them. this kind of experiment would be so expensive with a traditional infrastructure that you would never attempt it. you would have to hire an agent in são paulo to coordinate the project, find a data center, satisfy brazilian government regulations, ship machines to brazil, and so on. it would be six months before you could even run the test and find out that increased local capacity didn’t improve your delivery speed. create a high freedom, high responsibility culture with less process in make sure your assumptions are still true , we observed that many companies create rules and processes to prevent problems. when someone makes a mistake, they add a rule to the hr manual that says “well, don’t do that again.” if you read some hr manuals from this perspective, you can extract a historical record of everything that went wrong at the company. when something goes wrong in the development process, the corresponding reaction is to add a new step to the procedure. the major problem with creating process to prevent problems is that over time you build up complex “scar tissue” processes that slow you down. netflix doesn’t have an hr manual. there is a single guideline: “act in netflix’s best interest.” the idea is that if an employee can’t figure out how to interpret the guideline in a given situation, he or she doesn’t have enough judgment to work there. if you don’t trust the judgment of the people on your team, you have to ask why you’re employing them. it’s true that you’ll have to fire people occasionally for violating the guideline. overall, the high level of mutual trust among members of a team, and across the company as a whole, becomes a strong binding force. the following books outline new ways of thinking about process if you’re looking to transform your organization: the goal: a process of ongoing improvement by eliyahu m. goldratt and jeff cox. this book has become a standard management text at business schools since its original publication in 1984. written as a novel about a manager who has only 90 days to improve performance at his factory or have it closed down, it embodies goldratt’s theory of constraints in the context of process control and automation. the phoenix project: a novel about it, devops, and helping your business win by gene kim and kevin behr. as the title indicates, it’s also a novel, about an it manager who has 90 days to save a project that’s late and over budget, or his entire department will be outsourced. he discovers devops as the solution to his problem. replace silos with microservice teams most software development groups are separated into silos, with no overlap of personnel between them. the standard process for a software development project starts with the product manager meeting with the user experience and development groups to discuss ideas for new features. after the idea is implemented in code, the code is passed to the quality assurance (qa) and database administration teams and discussed in more meetings. communication with the system, network, and san administrators is often via tickets. the whole process tends to be slow and loaded with overhead. source: adrian cockcroft some companies try to speed up by creating small “start-up”-style teams that handle the development process from end to end, or sometimes such teams are the result of acquisitions where the acquired company continues to run independently as a separate division. but if the small teams are still doing monolithic delivery, there are usually still handoffs between individuals or groups with responsibility for different functions. the process suffers from the same problems as monolithic delivery in larger companies – it’s simply not very efficient or agile. source: adrian cockcroft conway’s law says that the interface structure of a software system will reflect the social structure of the organization that produced it. so if you want to switch to a microservices architecture, you need to organize your staff into product teams and use devops methodology. there are no longer distinct product managers, ux managers, development managers, and so on, managing downward in their silos. there is a manager for each product feature (implemented as a microservice), who supervises a team that handles all aspects of software development for the microservice, from conception through deployment. the platform team provides infrastructure support that the product teams access via apis. at netflix, the platform team was mostly aws in seattle, with some netflix-managed infrastructure layers built on top. but it doesn’t matter whether your cloud platform is in-house or public; the important thing is that it’s api-driven, self-service, and automatable. source: adrian cockcroft adopt continuous delivery, guided by the ooda loop a siloed team organization is usually paired with monolithic delivery model, in which an integrated, multi-function application is released as a unit (often version-numbered) on a regular schedule. most software development teams use this model initially because it is relatively simple and works well enough with a small number of developers (say, 50 or fewer). however, as the team grows it becomes a real issue when you discover a bug in one developer’s code during qa or production testing and the work of 99 other developers is blocked from release until the bug is fixed. in 2009 netflix adopted a continuous delivery model, which meshes perfectly with a microservices architecture. each microservice represents a single product feature that can be updated independently of the other microservices and on its own schedule. discovering a bug in a microservice has no effect on the release schedule of any other microservice. continuous delivery relies on packaging microservices in standard containers. netflix initially used aws machine images (amis) and it was possible to deploy an update into a test or production environment in about 10 minutes. with docker, that time is reduced even further, to mere seconds in some cases. at netflix, the conceptual framework for continuous development and delivery is an observe-orient-decide-act (ooda) loop . source: adrian cockcroft (http://www.slideshare.net/adrianco) observe refers to examining your current status to look for places where you can innovate. you want your company culture to implicitly authorize anyone who notices an opportunity to start a project to exploit it. for example, you might notice what the diagram calls a “customer pain point”: a lot of people abandoning the registration process on your website when they reach a certain step. you can undertake a project to investigate why and fix the problem. orient refers to analyzing metrics to understand the reasons for the phenomena you’ve observed at the observe point. often this involves analyzing large amounts of unstructured data, such as log files; this is often referred to as big data analysis. the answers you’re looking for are not already in your business intelligence database. you’re examining data that no one has previously looked at and asking questions that haven’t been asked before. decide refers to developing and executing a project plan. company culture is a big factor at this point. as previously discussed, in a high-freedom, high-responsibility culture you don’t need to get management approval before starting to make changes. you share your plan, but you don’t have to ask for permission. act refers to testing your solution and putting it into production. you deploy a microservice that includes your incremental feature to a cloud environment, where it’s automatically put into an ab test to compare it to the previous solution, side by side, for as long as it takes to collect the data that shows whether your approach is better. cooperating microservices aren’t disrupted, and customers don’t see your changes unless they’re selected for the test. if your solution is better, you deploy it into production. it doesn’t have to be a big improvement, either. if the number of clients for your microservice is large enough, then even a fraction of a percent improvement (in response time, say) can be shown to be statistically valid, and the cumulative effect over time of many small changes can be significant. now you’re back at the observe point. you don’t always have to perform all the steps or do them in strict order, either. the important characteristic of the process is that it enables you quickly to determine what your customers want and to create it for them. cockcroft says “it’s hard not to win” if you’re basing your moves on enough data points and your competitors are making guesses that take months to be proven or disproven. the state of art is to circle the loop every one to two weeks, but every microservice team can do it independently. with microservices you can go much faster because you’re not trying to get entire company going around the loop in lockstep. how nginx plus can help at nginx we believe it’s crucial to your future success that you adopt a 4-tier application architecture in which applications are developed and deployed as sets of microservices . we hope the information we’ve shared in this post and its predecessor, adopting microservices at netflix: lessons for architectural design , are helpful as you plan your transition to today’s state-of-the-art architecture for application development. when it’s time to deliver your apps, nginx plus offers an application delivery platform that provides the superior performance, reliability, and scalability your users expect. fully adopting a microservices-based architecture is easier and more likely to succeed when you move to a single software tool for web serving, load balancing, and content caching. nginx plus combines those functions and more in one easy to deploy and manage package. our approach empowers developers to define and control the flawless delivery of their microservices, while respecting the standards and best practices put into place by a platform team. click here to learn more about how nginx plus can help your applications succeed. video recordings fast delivery nginx.conf2014, october 2014 migrating to microservices, part 1 silicon valley microservices meetup, august 2014 migrating to microservices, part 2 silicon valley microservices meetup, august 2014

April 13, 2015

by Patrick Nommensen

· 9,845 Views

Patterns of API Virtualization

[This article was written by Matthew Heusser.] When Christopher Alexander wrote A Pattern Language in 1977, he was looking for a more powerful way to describe how towns and buildings were laid out. These patterns would allow architects, builders and planners to work together, to use the same words, mean the same thing, and create systems that were beautiful and worked, instead of more urban sprawl. Twenty years later, Gamma, Helms, Johnson and Vlissdes took the pattern idea and applied it to object-oriented software, which at the time was struggling to figure out how to create windows-based applications. Today the struggle is figuring out how to break software into small components that can be tested independently, and then having those components interact, typically over internet protocols. Raw SQL commands are giving way to service oriented systems that interact through APIs, sometimes all within one company, sometimes outside with Microsoft, Google, Amazon, or other APIs like a manufacturing company or supplier. While I do not claim to be Christopher Alexander or the Gang of Four, I am seeing some patterns emerge – a set of solutions to a defined problem – and would like to share a few of those today. What do you mean API? Alistair Cockburn’s Hexagonal Architecture (below) presents a way to think about APIs. The application we want to develop is in the middle and has a set of adapters to the external world. Those adapters might be an API we expose, like a ‘search’ interface to an online catalog, or the API’s we call, including the database, an email gateway, or the ‘permissions’ service, to see what types of search results we should show to this user. Cockburn’s Hexagonal Architecture gives us two ways to think about APIs: Our own, and the services we call. (Source: http://alistair.cockburn.us/Hexagonal+architecture) That’s a lot of APIs. Let’s explore about some ways to virtualize these services – and why. Automated Build and Continuous Integration Say, for example, you are working on a piece of software to analyze trending terms on social media – such as a customer complaint that is being liked and tweeted. You want companies to find these problems when they start to trend up, then reach out to the customer and solve it, or, perhaps, reach out to say “thank you” and amplify it. Modern build systems, like Jenkins, TFS, and TeamCity can compile, deploy, and even run the system to check for known scenarios. The trouble is those pesky adapters to external systems, like Twitter and Facebook. The software could do its job, but there is no way to know if the application is correct in its guesses about trends and importance. Getting the data from the providers can turn a quick build into a slow process that uses a lot of network traffic. By recording and storing known answers to predictable requests, then simulating the service and playing back known (“canned”) data, API Virtualization allows build systems to do more, with faster, more predictable results. This does not remove the need for end-to-end testing, but it does allow the team to have more confidence with each build. Performance Testing Your Application Like build/deploy systems, performance testing the application (the inside of the hexagon) with live, external services can cause problems. All that extra traffic can cause problems with the actual company network infrastructure; it could cause bandwidth problems at the point of the ISP. Some 3rd Party APIs charge a micro-fee per transaction, or limit bandwidth. Many of them lack a ‘test’ sandbox to develop in, so performance testing could interact with real, production work. Standing up a virtual server to return pre-planned data means you can performance test your application – not the third party – prevent bandwidth throttles, not step on production data, and avoid paying fees intended for real (production) use that is actually being used to test our environment. Avoid Integration Environment Inconsistency A few years ago I worked at a large organization that was wrapping old code in proxy services, so they could be consumed by other teams. Login, add-to-cart, search catalog, create custom catalog, permissions, all of it was possible to access through API calls, most of it as simple as a web URL that returned some text. The problem was the “System Integration Test” environment, or SIT. Every team tested its services in SIT, which meant about a third of the time, something was broken. After finding a bug in the current build, we would track it back to the catalog service, walk over to that team, bring up the issue, and they would say “thanks, we are testing a new build of catalog.” We expected catalog to work in SIT. Anything else meant a waste of someone’s time. Automated tools reporting false errors were even worse. When teams performance tested their services, everything calling the service got slow, if it worked at all. By virtualizing services we could test our application end-to-end against known data, without the troubles of SIT, or having to build additional expensive test-lab-like copies of production. Best of all, creating the virtual services is a snap – just record the live service with a tool and instruct it to play back similar requests. Flip Integration Tests from Virtual To Real for Final Checking All this API virtualization creates a risk that the team will move from test to production and something will be different between the Virtual API and the live one. If the Virtual API server is just returning the same thing product did when we recorded it and we have automated checks in place, we can change our test server to point to the real service and re-run all the automated checks. As long as the source data hasn’t changed and we are reading, not writing, from production, the checks should all pass. If the production API has changed, we will get failures, and they will be easy enough to fix and retest. Simulate Slow or Unresponsive Service In The Middle Of A Long Running Transaction Sometimes you want to test if a server is overloaded or down. Calling Facebook and asking them to turn off their servers is unlikely to work; even just coordinating with the team down the hall could create a lot of overhead. You also might want to test this often – every day or every hour – and manually pulling a plug or coordinating with the Login team every hour might not be realistic. The trick is to bring the service down once and record the exact behavior of the system, then use a virtual server to simulate that behavior, over and over again, every day. That means you’ll get the exact behavior, not a guess, and know exactly how the application under test can deal with it. Early Development of System against an Undeployed API Sometimes the API you are testing against does not exist, even in test. It’s still possible to create a Virt (virtual API) which returns some roughly equivalent data, and makes it possible to move forward on the core application without introducing new risks. Avoid Configuration and Copying Hassles Many companies use a test system that is a copy of production, and then refresh the system periodically. Sometimes, you want test scenarios that do not exist in production, so you have to create them … and lose them during a refresh. The same problem happens with 3rd party APIs, when, for example, a part is discontinued, and you are testing ordering that part, or the sample person you check for insurance coverage leaves the company. If the request for the part of the coverage goes through an API, you can record known good results that don’t change, even after a database refresh – then leave the real, end-to-end testing for an exploratory step that will be lighter, quicker, more accurate, and have more confidence. A Fistful of Techniques Today we discussed a half-dozen common patterns to API virtualization, mostly around testing systems in isolation that consume data through an API, like a 3rd party or an internal service. These ideas are new, and evolving. What are a few of your favorites?

April 9, 2015

by Denis Goodwin

· 4,159 Views

Adopting Microservices at Netflix: Lessons for Architectural Design

[This article was written by Tony Mauro.] In some recent blog posts, we’ve explained why we believe it’s crucial to adopt a four-tier application architecture in which applications are developed and deployed as sets of microservices. It’s becoming increasingly clear that if you keep using development processes and application architectures that worked just fine ten years ago, you simply can’t move fast enough to capture and hold the interest of mobile users who can choose from an ever-growing number of apps. Switching to a microservices architecture creates exciting opportunities in the marketplace for companies. For system architects and developers, it promises an unprecedented level of control and speed as they deliver innovative new web experiences to customers. But at such a breathless pace, it can feel like there’s not a lot of room for error. In the real world, you can’t stop developing and deploying your apps as you retool the processes for doing so. You know that your future success depends on transitioning to a microservices architecture, but how do you actually do it? Fortunately for us, several early adopters of microservices are now generously sharing their expertise in the spirit of open source, not only in the form of published code but in conference presentations and blog posts. Netflix is a leading example. As the Director of Web Engineering and then Cloud Architect, Adrian Cockcroft oversaw the company’s transition from a traditional development model with 100 engineers producing a monolithic DVD-rental application to a microservices architecture with many small teams responsible for the end-to-end development of hundreds of microservices that work together to stream digital entertainment to millions of Netflix customers every day. Now a Technology Fellow at Battery Ventures, Cockcroft is a prominent evangelist for microservices and cloud-native architectures, and serves on the NGINX Technical Advisory Board. In a two-part series of blog posts, we’ll present top takeaways from two talks that Cockcroft delivered last year, at the first annual NGINX conference in October and at a Silicon Valley Microservices Meetup a couple months earlier. (The complete video recordings are also well worth watching.) This post defines microservices architecture and outlines some best practices for designing one. Adopting Microservices at Netflix: Lessons for Team and Process Design discusses why and how to adopt a new mindset for software development and reorganize your teams around it. What is a Microservices Architecture? Cockcroft defines a microservices architecture as a service-oriented architecture composed of loosely coupled elements that have bounded contexts. Loosely coupled means that you can update the services independently; updating one service doesn’t require changing any other services. If you have a bunch of small, specialized services but still have to update them together, they’re not microservices because they’re not loosely coupled. One kind of coupling that people tend to overlook as they transition to a microservices architecture is database coupling, where all services talk to the same database and updating a service means changing the schema. You need to split the database up and denormalize it. The concept of bounded contexts comes from the book Domain Driven Design by Eric Evans. A microservice with correctly bounded context is self-contained for the purposes of software development. You can understand and update the microservice’s code without knowing anything about the internals of its peers, because the microservices and its peers interact strictly through APIs and so don’t share data structures, database schemata, or other internal representations of objects. If you’ve developed applications for the Internet, you’re already familiar with these concepts, in practice if not by name. Most mobile apps talk to quite a few back-end services, to enable its users to do things like share on Facebook, get directions from Google Maps, and find restaurants on Foursquare, all within the context of the app. If your mobile app were tightly coupled with those services, then before you could release an update you would have to talk to all of their development teams to make sure that your changes aren’t going to break anything. When working with a microservices architecture, you think of other internal development teams like those Internet back ends: as external services that your microservice interacts with through APIs. The commonly understood “contract” between microservices is that their APIs are stable and forward compatible. Just as it’s unacceptable for the Google Maps API to change without warning and in such a way that it breaks its users, your API can evolve but must remain compatible with previous versions. Best Practices for Designing a Microservices Architecture Cockcroft describes his role as Cloud Architect at Netflix not in terms of controlling the architecture, but as discovering and formalizing the architecture that emerged as the Netflix engineers built it. The Netflix development team established several best practices for designing and implementing a microservices architecture. Create a Separate Data Store for Each Microservice Do not use the the same back-end data store across microservices. You want the team for each microservice to choose the database that best suits the service. Moreover, with a single data store it’s too easy for microservices written by different teams to share database structures, perhaps in the name of reducing duplication of work. You end up with the situation where if one team updates a database structure, other services that also use that structure have to be changed too. Breaking apart the data can make data management more complicated, because the separate storage systems can more easily get out sync or become inconsistent, and foreign keys can change unexpectedly. You need to add a tool that performs master data management (MDM) by operating in the background to find and fix inconsistencies. For example, it might examine every database that stores subscriber IDs, to verify that the same IDs exist in all of them (there aren’t missing or extra IDs in any one database). You can write your own tool or buy one. Many commercial relational database management systems (RDBMSs) do these kinds of checks, but they usually impose too many requirements for coupling, and so don’t scale. Keep Code at a Similar Level of Maturity Keep all code in a microservice at a similar level of maturity and stability. In other words, if you need to add or rewrite some of the code in a deployed microservice that’s working well, the best approach is usually to create a new microservice for the new or changed code, leaving the existing microservice in place. [Editor’s note: This is sometimes referred to as the immutable infrastructure principle.] This way you can iteratively deploy and test the new code until it is bug free and maximally efficient, without risking failure or performance degradation in the existing microservice. Once the new microservice is as stable as the original, you can merge them back together if they really perform a single function together, or there are other efficiencies from combining them. However, in Cockcroft’s experience it is much more common to realize you should split up a microservice because it’s gotten too big. Do a Separate Build for Each Microservice Do a separate build for each microservice, so that it can pull in component files from the repository at the revision levels appropriate to it. This sometimes leads to the situation where various microservices pull in a similar set of files, but at different revision levels. That can make it more difficult to clean up your codebase by decommissioning old file versions (because you have to verify more carefully that a revision is no longer being used), but that’s an acceptable trade-off for how easy it is to add new files as you build new microservices. The asymmetry is intentional: you want introducing a new microservice, file, or function easy, not dangerous. Deploy in Containers Deploying microservices in containers is important because it means you just need just one tool to deploy everything. As long as the microservice is in a container, the tool knows how to deploy it. It doesn’t matter what the container is. That said, Docker seems very quickly to have become the de facto standard for containers. Treat Servers as Stateless Treat servers, particularly those that run customer-facing code, as interchangeable members of a group. They all perform the same functions, so you don’t need to be concerned about them individually. Your only concern is that there are enough of them to produce the amount of work you need, and you can use auto scaling to adjust the numbers up and down. If one stops working, it’s automatically replaced by another one. Avoid “snowflake” systems in which you depend on individual servers to perform specialized functions. Cockcroft’s analogy is that you want to think of servers like cattle, not pets. If you have a machine in production that performs a specialized function, and you know it by name, and everyone gets sad when it goes down, it’s a pet. Instead you should think of your servers like a herd of cows. What you care about is how many gallons of milk you get. If one day you notice you’re getting less milk than usual, you find out which cows aren’t producing well and replace them. Netflix Delivery Architecture is Built on nginx Netflix is a longtime nginx user and became the first customer of NGINX, Inc. after it incorporated in 2011. Indeed, Netflix chose nginx as the heart of their delivery infrastructure, the Netflix Open Connect Content Delivery Network (CDN), one of the largest CDNs in the world. With the ability to serve thousands, and sometimes millions, of requests per second, nginx is an optimal solution for high-performance HTTP delivery and enables companies like Netflix to offer high-quality digital experiences to millions of customers every day. Video Recordings Fast Delivery nginx.conf2014, October 2014 Migrating to Microservices, Part 1 Silicon Valley Microservices Meetup, August 2014 Migrating to Microservices, Part 2 Silicon Valley Microservices Meetup, August 2014

April 7, 2015

by Patrick Nommensen

· 33,825 Views · 1 Like

Introduction to Apache Cassandra's Architecture

Some key concepts for Apache's popular Cassandra Architecture include partitioning, replication, consistency, bootstrapping, and write paths.

April 6, 2015

by Akhil Mehra

· 118,191 Views · 38 Likes

Package by Component and Architecturally-aligned Testing

i've seen and had lots of discussion about "package by layer" vs "package by feature" over the past couple of weeks. they both have their benefits but there's a hybrid approach i now use that i call "package by component". to recap... package by layer let's assume that we're building a web application based upon the web-mvc pattern. packaging code by layer is typically the default approach because, after all, that's what the books, tutorials and framework samples tell us to do. here we're organising code by grouping things of the same type. there's one top-level package for controllers, one for services (e.g. "business logic") and one for data access. layers are the primary organisation mechanism for the code. terms such as "separation of concerns" are thrown around to justify this approach and generally layered architectures are thought of as a "good thing". need to switch out the data access mechanism? no problem, everything is in one place. each layer can also be tested in isolation to the others around it, using appropriate mocking techniques, etc. the problem with layered architectures is that they often turn into a big ball of mud because, in java anyway, you need to mark your classes as public for much of this to work. package by feature instead of organising code by horizontal slice, package by feature seeks to do the opposite by organising code by vertical slice. now everything related to a single feature (or feature set) resides in a single place. you can still have a layered architecture, but the layers reside inside the feature packages. in other words, layering is the secondary organisation mechanism. the often cited benefit is that it's "easier to navigate the codebase when you want to make a change to a feature", but this is a minor thing given the power of modern ides. what you can do now though is hide feature specific classes and keep them out of sight from the rest of the codebase. for example, if you need any feature specific view models, you can create these as package-protected classes. the big question though is what happens when that new feature set c needs to access data from features a and b? again, in java, you'll need to start making classes publicly accessible from outside of the packages and the big ball of mud will again emerge. package by layer and package by feature both have their advantages and disadvantages. to quote jason gorman from schools of package architecture - an illustration , which was written seven years ago. to round off, then, i would urge you to be mindful of leaning to far towards either school of package architecture. don't just mindlessly put socks in the sock draw and pants in the pants draw, but don't be 100% driven by package coupling and cohesion to make those decisions, either. the real skill is finding the right balance, and creating packages that make stuff easier to find but are as cohesive and loosely coupled as you can make them at the same time. package by component this is a hybrid approach with increased modularity and an architecturally-evident coding style as the primary goals. the basic premise here is that i want my codebase to be made up of a number of coarse-grained components, with some sort of presentation layer (web ui, desktop ui, api, standalone app, etc) built on top. a "component" in this sense is a combination of the business and data access logic related to a specific thing (e.g. domain concept, bounded context, etc). as i've described before , i give these components a public interface and package-protected implementation details, which includes the data access code. if that new feature set c needs to access data related to a and b, it is forced to go through the public interface of components a and b. no direct access to the data access layer is allowed, and you can enforce this if you use java's access modifiers properly. again, "architectural layering" is a secondary organisation mechanism. for this to work, you have to stop using the public keyword by default . this structure raises some interesting questions about testing, not least about how we mock-out the data access code to create quick-running "unit tests". architecturally-aligned testing the short answer is don't bother, unless you really need to. i've spoken about and written about this before, but architecture and testing are related. instead of the typical testing triangle (lots of "unit" tests, fewer slower running "integration" tests and even fewer slower ui tests), consider this. i'm trying to make a conscious effort to not use the term "unit testing" because everybody has a different view of how big a "unit" is. instead, i've adopted a strategy where some classes can and should be tested in isolation. this includes things like domain classes, utility classes, web controllers (with mocked components), etc. then there are some things that are easiest to test as components, through the public interface. if i have a component that stores data in a mysql database, i want to test everything from the public interface right back to the mysql database. these are typically called "integration tests", but again, this term means different things to different people. of course, treating the component as a black box is easier if i have control over everything it touches. if you have a component that is sending asynchronous messages or using an external, third-party service, you'll probably still need to consider adding dependency injection points (e.g. ports and adapters) to adequately test the component, but this is the exception not the rule. all of this still applies if you are building a microservices style of architecture. you'll probably have some low-level class tests, hopefully a bunch of service tests where you're testing your microservices though their public interface, and some system tests that run scenarios end-to-end. oh, and you can still write all of this in a test-first, tdd style if that's how you work. i'm using this strategy for some systems that i'm building and it seems to work really well. i have a relatively simple, clean and (to be honest) boring codebase with understandable dependencies, minimal test-induced design damage and a manageable quantity of test code. this strategy also bridges the model-code gap , where the resulting code actually reflects the architectural intent. in other words, we often draw "components" on a whiteboard when having architecture discussions, but those components are hard to find in the resulting codebase. packaging code by layer is a major reason why this mismatch between the diagram and the code exists. those of you who are familiar with my c4 model will probably have noticed the use of the terms "class" and "component". this is no coincidence. architecture and testing are more related than perhaps we've admitted in the past. p.s. i'll be speaking about this topic over the next few months at events across europe, the us and (hopefully) australia

April 4, 2015

by Simon Brown

· 11,162 Views

To Shard, or Not to Shard

When I talk with customers about sharding decisions I often start by telling the following true story… A couple of years ago, a customer came to me looking for advice on how to shard his system. He told me he was already convinced he needed to do that since he read that some smart people at MySQL giants like Facebook and Twitter were sharding—so naturally this was something he should be doing, too. I paused for a moment and then I asked him what the size of his database was. “10GB,” he said. I nodded and asked if he handles many queries or if they were very complicated. “No,” he said. “Just a few hundred queries per second, and they have not been loading down the system by more than a few percent.” I asked him whether he was expecting exponential growth in the near future—looking to double every week or something like that. “No, our load and data size grew about 7 percent last year and we expect about the same growth this year and for the foreseeable future.” My recommendation to him was not to waste time and effort on sharding because it is just not needed in his company’s case. Before you decide how to shard, you’d best understand whether or not you really need to shard to begin with. Yes, on the extremely large-scale side of database demands, sharding is the only game in town. And not just for MySQL, but for pretty much any technology out there. Yet thanks to emerging technologies there is an increasing amount of applications that can run databases without sharding. Today we can easily run with terabytes of data per MySQL instance and serve tens of thousands of queries in many OLTP environments. This allows organizations to build very large applications without needing to shard. And keep this in mind: Sharding is a pain under all circumstances. Even if you have sharding provided out of the box by the database system, it is a pain because it introduces more components and complexity. Creating good distributed query execution plans is a very complicated task that needs to take network topology and load into account in addition to the data distribution and load of individual nodes. Before you decide if you need to shard, you should look at alternatives to scale your application. In the MySQL world, the solutions are typically as follows: Alternatives to Sharding Functional Partitioning: In many environments a single MySQL instance becomes a dumping ground for all kinds of databases—you might end up having your main application share a database instance with Drupal, which powers your website, with WordPress, which powers your blog, and with vBulletin, which powers your forums. Splitting those pieces into different database instances is something you should consider before you look into sharding. Custom-made systems will often have many applications using different data sets that can be easily split out. Replication: Many applications are read-heavy, so scaling reads becomes the issue earlier than it does with scaling writes. Replication is a great solution for this. MySQL’s built-in replication is very robust, though due to its asynchronous nature it adds complexity to the application. The developer must decide which of the reads can be done from the replica servers and which can’t, because you must be absolutely certain that you’re reading the most recent, actual data. This is the reason that alternative, synchronous replication technologies for MySQL like Percona XtraDB Cluster, are gaining popularity: They provide single database-like behavior from the cluster in most cases. Caching and Queueing: Caching is a great technology for reducing the amount of reads that hit the database. There are many applications that have reduced read load on the database by 80-95% using this technology. Queueing, in contrast, optimizes writes. It does this by merging multiple write operations together so they hit the database efficiently. Most large-scale applications should rely heavily on both of these technologies. Memcached and Redis are two popular caching technologies in the MySQL space. For queueing, the most popular technologies are ActiveMQ and RabbitMQ [1]. Supplemental Technologies: MySQL is great at many things but not at everything. If you’re looking for high-performance full-text search, consider ElasticSearch, Sphinx, or Lucene. If you’re looking at large-scale data analytics, a Hadoop-based infrastructure or Vertica might work well for you. You should let MySQL handle the things it is good at, and leave the rest to supporting tools. Optimizations to Make Before Sharding Scaling isn’t just about architecture either. You also need to make sure your system is reasonably optimized. Many people decide sharding is inevitable for them even though there are much easier and more cost-effective ways to get the performance and scale they are looking for. All of which, I might add, are also going to be valuable if sharding is indeed eventually needed. Hardware: Are you using the right hardware? I’ve seen many people looking into sharding when in fact simply purchasing decent hardware would solve their problems for years to come. Make sure you have plenty of memory and high-performance flash storage if you’re working with a large database. In many cases it can transform your system so much it will look like magic. MySQL version and Configuration: Use a recent MySQL version. By that I mean the latest GA version (MySQL 5.6 at the time of this article’s publication). Percona Server, which is free, often offers additional performance improvements for demanding workloads. Use the most recent operating system too, especially if you’re using modern hardware. Finally, make sure MySQL is configured properly. The difference in MySQL performance between poorly configured MySQL and well-tuned MySQL can be 10x or more. Schema and Queries: The same application logic can be expressed using a variety of schema and queries. I’ve seen a lot of similar applications approaching things differently, and the difference in the performance between an optimal approach and a poor one (but still used in production) can be 100x or more. Many of the changes can be retrofitted to existing schema—such as minor query changes and changes to the index structure—however, if your schema doesn’t fit your application needs well, then you might be looking at a complete redesign. So it is a good idea to think things through early. When to Shard So when should you start thinking about sharding? Basically, if none of the measures listed above have given you the performance you need, it might be time to consider sharding. Sharding does have the advantage of allowing you to potentially use lower-cost hardware or cheaper cloud instances. Most developers are using agile development methods these days and there is a common term, “Architectural Runway,” which defines how far the application can go with its current architecture. If you’ve already found success using replication in particular, it might be a bad decision to add sharding because it will force your developers to deal with the complexity of sharding and asynchronous replication. However, replication is still typically used to achieve high availability even if you’re already sharding, but in this case it’s not for scaling reads. If you’ve come to the point where you’re sure you need to shard, here are some of the questions you need to ask about how you’ll implement your sharding strategy: Shard Level: At which level should we shard? It does not have to be at the database level. Many applications, SaaS in particular, often “shard” on higher levels, deploying multiple copies of their full stack to offer complete isolation for availability, performance, security etc. In many large scale applications you will see multiple copies of a full stack deployed, each having its own sharded MySQL environment. Shard Key: How do we shard? In many cases the choice depends on whether you’re authenticating for user accounts or your organization, but in other cases it is not so obvious. When making a sharding choice, you need to think about two things: 1) as many data access points as possible should go into a single shard, because cross-shard access is expensive if supported at all, and 2) making sure such sharding does not produce a shard that is too large to handle either in terms of data size or traffic. For example, sharding by country is a poor idea because the requirements to handle Belgium traffic won’t be the same for the United States or China, which require a lot more resources. Shard by Schema or Instance: What is the unit of your shard? The typical choices are MySQL instance or database (schema). I like the shard = database approach, which doesn’t limit you to a single MySQL instance per physical box. That way you do not have to run too many MySQL instances, but you can run more than one if the application works better that way. Shard Unit: If you shard by a single MySQL server, you will run into a problem with high availability very soon. When you have 100 MySQL servers there are roughly 100 more chances for one of them to crash compared with having only one, so ensuring there is a high availability solution becomes critical. Instead of sharing across MySQL servers you will usually be sharding across “Replication Clusters,” such as one MySQL primary node and one or several replica or PXC (Percona XtraDB Cluster) nodes. Shard Technology: What technology can you use to assist you with sharding? Within the MySQL world there is no standard sharding technology as of yet that everyone uses. Most of the large web properties have implemented something in-house for their sharding needs, and some have released their solutions as open source projects. One example is Vitess, contributed by Google, and another is JetPants, contributed by Tumblr. Rolling out your own simple sharding framework might look easy for some developers until you have to deal with operational issues like balancing the shards, resharding, etc., on a large scale. There are a number of purpose-built technologies that can help you with sharding if this doesn’t sound like something your team can manage. Sharding Technologies Here are technologies that you should consider: MySQL Fabric: This is the sharding technology being developed by the MySQL team at Oracle. MySQL Fabric is GA, but its functionality right now is rather limited, especially in terms of their support for multi-sharded queries. Given more time however, it has the potential to become the standard sharding technology for MySQL. Tesora: Tesora has a proxy-based solution for MySQL sharding that became open source some time ago. I would be especially looking at Tesora if you’re also looking at deploying OpenStack, as they’ve invested a lot into the integration. ScaleArc: ScaleArc is a commercial database proxy solution that can do caching, filtering, routing, and sharding. It is a pretty mature solution that handles multiple database technologies and not just MySQL. ScaleBase: ScaleBase is a sharding solution designed specifically for MySQL and the cloud, which similarly to MySQL, operates at the proxy level. There are many technologies in the MySQL space that can help you scale your application without sharding. If you’re going to build the next “Facebook,” however, you will surely need to shard, and there are a number of technologies that can help you do it as painlessly as possible. Large-scale applications on large-scale databases will always introduce complexity, which makes them more complicated to develop against and manage. Success comes with cost. [1] http://dzone.com/research/guide-to-enterprise-integration Peter Zaitsev co-founded Percona in 2006, assuming the role of CEO. Percona helps companies of all sizes maximize their success with MySQL. Peter enjoys mixing business leadership with hands on technical expertise. Peter is also the co-author of O’Reilly’s High Performance MySQL, one of the most popular books on MySQL performance.

April 2, 2015

by Peter Zaitsev

· 21,000 Views · 1 Like

Dismantling invokedynamic

Many Java developers regarded the JDK's version seven release as somewhat a disappointment. On the surface, merely a few language and library extensions made it into the release, namely Project Coin and NIO2. But under the covers, the seventh version of the platform shipped the single biggest extension to the JVM's type system ever introduced after its initial release. Adding the invokedynamic instruction did not only lay the foundation for implementing lambda expressions in Java 8, it also was a game changer for translating dynamic languages into the Java byte code format. While the invokedynamic instruction is an implementation detail for executing a language on the Java virtual machine, understanding the functioning of this instruction gives true insights into the inner workings of executing a Java program. This article gives a beginner's view on what problem the invokedynamic instruction solves and how it solves it. Method handles Method handles are often described as a retrofitted version of Java's reflection API, but this is not what they are meant to represent. While method handles can represent a method, constructor or field, they are not intended to describe properties of these class members. It is for example not possible to directly extract metadata from a method handle such as modifiers or annotation values of the represented method. And while method handles allow for the invocation of a referenced method, their main purpose is to be used together with an invokedynamic call site. For gaining a better understanding of method handles, looking at them as an imperfect replacement for the reflection API is however a reasonable starting point. Method handles cannot be instantiated. Instead, method handles are created by using a designated lookup object. These objects are themselves created by using a factory method that is provided by the MethodHandles class. Whenever this factory is invoked, it first creates a security context which ensures that the resulting lookup object can only locate methods that are also visible to the class from which the factory method was invoked. A lookup object can then be created as follows: class Example { void doSomething() { MethodHandles.Lookup lookup = MethodHandles.lookup(); } private void foo() { /* ... */ } } As argued before, the above lookup object could only be used to locate methods that are also visible to the Example class such asfoo. It would for example be impossible to look up a private method of another class. This is a first major difference to using the reflection API where private methods of outside classes can be located just as any other method and where these methods can even be invoked after marking such a method as accessible. Method handles are therefore sensible of their creation context which is a first major difference to the reflection API. Apart from that, a method handle is more specific than the reflection API by describing a specific type of method rather than representing just any method. In a Java program, a method's type is a composite of both the method's return type and the types of its parameters. For example, the only method of the following Counter class returns an int representing the number of characters of the only String-typed argument: class Counter { static int count(String name) { return name.length(); } } A representation of this method's type can be created by using another factory. This factory is found in the MethodType class which also represents instances of created method types. Using this factory, the method type for Counter::count can be created by handing over the method's return type and its parameter types bundled as an array: MethodType methodType = MethodType.methodType(int.class, new Class[] {String.class}); By using the lookup object that was created before and the above method type, it is now possible to locate a method handle that represents the Counter::count method as depicted in the following code: MethodType methodType = MethodType.methodType(int.class, new Class[] {String.class}); MethodHandles.Lookup lookup = MethodHandles.lookup(); MethodHandle methodHandle = lookup.findStatic(Counter.class, "count", methodType); int count = methodHandle.invokeExact("foo"); assertThat(count, is(3)); At first glance, using a method handle might seem like an overly complex version of using the reflection API. However, keep in mind that the direct invocation of a method using a handle is not the main intent of its use. The main difference of the above example code and of invoking a method via the reflection API is only revealed when looking into the differences of how the Java compiler translates both invocations into Java byte code. When a Java program invokes a method, this method is uniquely identified by its name and by its (non-generic) parameter types and even by its return type. It is for this reason that it is possible to overload methods in Java. And even though the Java programming language does not allow it, the JVM does in theory allow to overload a method by its return type. Following this principle, a reflective method call is executed as a common method call of the Method::invoke method. This method is identified by its two parameters which are of the types Object and Object[]. In addition to this, the method is identified by its Object return type. Because of this signature, all arguments to this method need to always be boxed and enclosed in an array. Similarly, the return value needs to be boxed if it was primitive or null is returned if the method was void. Method handles are the exception to this rule. Instead of invoking a method handle by referring to the signature ofMethodHandle::invokeExact signature which takes an Object[] as its single argument and returns Object, method handles are invoked by using a so-called polymorphic signature. A polymorphic signature is created by the Java compiler dependant on the types of the actual arguments and the expected return type at a call site. For example, when invoking the method handle as above with int count = methodHandle.invokeExact("foo"); the Java compiler translates this invocation as if the invokeExact method was defined to accept a single single argument of typeString and returning an int type. Obviously, such a method does not exist and for (almost) any other method, this would result in a linkage error at runtime. For method handles, the Java Virtual Machine does however recognize this signature to be polymorphic and treats the invocation of the method handle as if the Counter::count method that the handle refers to was inset directly into the call site. Thus, the method can be invoked without the overhead of boxing primitive values or the return type and without placing the argument values inside an array. At the same time, when using the invokeExact invocation, it is guaranteed to the Java virtual machine that the method handle always references a method at runtime that is compatible to the polymorphic signature. For the example, the JVM expected that the referenced method actually accepts a String as its only argument and that it returns a primitive int. If this constraint was not fulfilled, the execution would instead result in a runtime error. However, any other method that accepts a single String and that returns a primitive int could be successfully filled into the method handle's call site to replace Counter::count. In contrast, using the Counter::count method handle at the following three invocations would result in runtime errors, even though the code compiles successfully: int count1 = methodHandle.invokeExact((Object) "foo"); int count2 = (Integer) methodHandle.invokeExact("foo"); methodHandle.invokeExact("foo"); The first statement results in an error because the argument that is handed to the handle is too general. While the JVM expected a String as an argument to the method, the Java compiler suggested that the argument would be an Object type. It is important to understand that the Java compiler took the casting as a hint for creating a different polymorphic signature with anObject type as a single parameter type while the JVM expected a String at runtime. Note that this restriction also holds for handing too specific arguments, for example when casting an argument to an Integer where the method handle required aNumber type as its argument. In the second statement, the Java compiler suggested to the runtime that the handle's method would return an Integer wrapper type instead of the primitive int. And without suggesting a return type at all in the third statement, the Java compiler implicitly translated the invocation into a void method call. Hence, invokeExact really does mean exact. This restriction can sometimes be too harsh. For this reason, instead of requiring an exact invocation, the method handle also allows for a more forgiving invocation where conversions such as type castings and boxings are applied. This sort of invocation can be applied by using the MethodHandle::invoke method. Using this method, the Java compiler still creates a polymorphic signature. This time, the Java virtual machine does however test the actual arguments and the return type for compatibility at run time and converts them by applying boxings or castings, if appropriate. Obviously, these transformations can sometimes add a runtime overhead. Fields, methods and constructors: handles as a unified interface Other than Method instances of the reflection API, method handles can equally reference fields or constructors. The name of theMethodHandle type could therefore be seen as too narrow. Effectively, it does not matter what class member is referenced via a method handle at runtime as long as its MethodType, another type with a misleading name, matches the arguments that are passed at the associated call site. Using the appropriate factories of a MethodHandles.Lookup object, a field can be looked up to represent a getter or a setter. Using getters or setters in this context does not refer to invoking an actual method that follows the Java bean specification. Instead, the field-based method handle directly reads from or writes to the field but in shape of a method call via invoking the method handle. By representing such field access via method handles, field access or method invocations can be used interchangeably. As an example for such interchange, take the following class: class Bean { String value; void print(String x) { System.out.println(x); } } For the above Bean class, the following method handles can be used for either writing a string to the value field or for invoking the print method with the same string as an argument: MethodHandle fieldHandle = lookup.findSetter(Bean.class, "value", String.class); MethodType methodType = MethodType.methodType(void.class, new Class[] {String.class}); MethodHandle methodHandle = lookup.findVirtual(Bean.class, "print", methodType); As long as the method handle call site is handed an instance of Bean together with a String while returning void, both method handles could be used interchangeably as shown here: anyHandle.invokeExact((Bean) mybean, (String) myString); Note that the polymorphic signature of the above call site does not match the method type of the above handle. However, within Java byte code, non-static methods are invoked as if they were static methods with where the this reference is handed as a first, implicit argument. A non-static method's nominal type does therefore diverge from its actual runtime type. Similarly, access to a non-static field requires an instance to be access. Similarly to fields and methods, it is possible to locate and invoke constructors which are considered as methods with a voidreturn value for their nominal type. Furthermore, one can not only invoke a method directly but even invoke a super method as long as this super method is reachable for the class from which the lookup factory was created. In contrast, invoking a super method is not possible at all when relying on the reflection API. If required, it is even possible to return a constant value from a handle. Performance metrics Method handles are often described as being a more performant as the Java reflection API. At least for recent releases of the HotSpot virtual machine, this is not true. The simplest way of proving this is writing an appropriate benchmark. Then again, is not all too simple to write a benchmark for a Java program which is optimized while it is executed. The de facto standard for writing a benchmark has become using JMH, a harness that ships under the OpenJDK umbrella. The full benchmark can be found as a gist in my GitHub profile. In this article, only the most important aspects of this benchmark are covered. From the benchmark, it becomes obvious that reflection is already implemented quite efficiently. Modern JVMs know a concept named inflation where a frequently invoked reflective method call is replaced with runtime generated Java byte code. What remains is the overhead of applying the boxing for passing arguments and receiving a return values. These boxings can sometimes be eliminated by the JVM's Just-in-time compiler but this is not always possible. For this reason, using method handles can be more performant than using the reflection API if method calls involve a significant amount of primitive values. This does however require that the exact method signatures are already known at compile time such that the appropriate polymorphic signature can be created. For most use cases of the reflection API, this guarantee can however not be given because the invoked method's types are not known at compile time. In this case, using method handles does not offer any performance benefits and should not be used to replace it. Creating an invokedynamic call site Normally, invokedynamic call sites are created by the Java compiler only when it needs to translate a lambda expression into byte code. It is worthwhile to note that lambda expressions could have been implemented without invokedynamic call sites altogether, for example by converting them into anonymous inner classes. As a main difference to the suggested approach, using invokedynamic delays the creation of a similar class to runtime. We are looking into class creation in the next section. For now, bear however in mind that invokedynamic does not have anything to do with class creation, it only allows to delay the decision of how to dispatch a method until runtime. For a better understanding of invokedynamic call sites, it helps to create such call sites explicitly in order to look at the mechanic in isolation. To do so, the following example makes use of my code generation framework Byte Buddy which provides explicit byte code generation of invokedynamic call sites without requiring a any knowledge of the byte code format. Any invokedynamic call site eventually yields a MethodHandle that references the method to be invoked. Instead of invoking this method handle manually, it is however up to the Java runtime to do so. Because method handles have become a known concept to the Java virtual machine, these invocations are then optimized similarly to a common method call. Any such method handle is received from a so-called bootstrap method which is nothing more than a plain Java method that fulfills a specific signature. For a trivial example of a bootstrap method, look at the following code: class Bootstrapper { public static CallSite bootstrap(Object... args) throws Throwable { MethodType methodType = MethodType.methodType(int.class, new Class[] {String.class}) MethodHandles.Lookup lookup = MethodHandles.lookup(); MethodHandle methodHandle = lookup.findStatic(Counter.class, "count", methodType); return new ConstantCallSite(methodHandle); } } For now, we do not care much about the arguments of the method. Instead, notice that the method is static what is as a matter of fact a requirement. Within Java byte code, an invokedynamic call site references the full signature of a bootstrap method but not a specific object which could have a state and a life cycle. Once the invokedynamic call site is invoked, control flow is handed to the referenced bootstrap method which is now responsible for identifying a method handle. Once this method handle is returned from the bootstrap method, it is invoked by the Java runtime. As obvious from the above example, a MethodHandle is not returned directly from a bootstrap method. Instead, the handle is wrapped inside of a CallSite object. Whenever a bootstrap method is invoked, the invokedynamic call site is later permanently bound to the CallSite object that is returned from this method. Consequently, a bootstrap method is only invoked a single time for any call site. Thanks to this intermediate CallSite object, it is however possible to exchange the referenced MethodHandle at a later point. For this purpose, the Java class library already offers different implementations of CallSite. We have already seen a ConstantCallSite in the example code above. As the name suggests, a ConstantCallSite always references the same method handle without a possibility of a later exchange. Alternatively, it is however also possible to for example use aMutableCallSite which allows to change the referenced MethodHandle at a later point in time or it is even possible to implement a custom CallSite class. With the above bootstrap method and Byte Buddy, we can now implement a custom invokedynamic instruction. For this, Byte Buddy offers the InvokeDynamic instrumentation that accepts a bootstrap method as its only mandatory argument. Such instrumentations are then fed to Byte Buddy. Assuming the following class: abstract class Example { abstract int method(); } we can use Byte Buddy to subclass Example in order to override method. We are then going to implement this method to contain a single invokedynamic call site. Without any further configuration, Byte Buddy creates a polymorphic signature that resembles the method type of the overridden method. However, for non-static methods, the this reference is set as a first, implicit argument. Assuming that we want to bind the Counter::count method which expects a String as a single argument, we could not bind this handle to Example::method because of this type mismatch. Therefore, we need to create a different call site without the implicit argument but with an String in its place. This can be achieved by using Byte Buddy's domain specific language: Instrumentation invokeDynamic = InvokeDynamic .bootstrap(Bootstrapper.class.getDeclaredMethod(“bootstrap”, Object[].class)) .withoutImplicitArguments() .withValue("foo"); With this instrumentation in place, we can finally extend the Example class and override method to implement the invokedynamic call site as in the following code snippet: Example example = new ByteBuddy() .subclass(Example.class) .method(named(“method”)).intercept(invokeDynamic) .make() .load(Example.class.getClassLoader(), ClassLoadingStrategy.Default.INJECTION) .getLoaded() .newInstance(); int result = example.method(); assertThat(result, is(3)); As obvious from the above assertion, the characters of the "foo" string were counted correctly. By setting appropriate break points in the code, it is further possible to validate that the bootstrap method is called and that control flow further reaches theCounter::count method. So far, we did not gain much from using an invokedynamic call site. The above bootstrap method would always bindCounter::count and can therefore only produce a valid result if the invokedynamic call site really wanted to transform a Stringinto an int. Obviously, bootstrap methods can however be more flexible thanks to the arguments they receive from the invokedynamic call site. Any bootstrap method receives at least three arguments: As a first argument, the bootstrap method receives a MethodHandles.Lookup object. The security context of this object is that of the class that contains the invokedynamic call site that triggered the bootstrapping. As discussed before, this implies that private methods of the defining class could be bound to the invokedynamic call site using this lookup instance. The second argument is a String representing a method name. This string serves as a hint to indicate from the call site which method should be bound to it. Strictly speaking, this argument is not required as it is perfectly legal to bind a method with another name. Byte Buddy simply serves the the name of the overridden method as this argument, if not specified differently. Finally, the MethodType of the method handle that is expected to be returned is served as a third argument. For the example above, we specified explicitly that we expect a String as a single parameter. At the same time, Byte Buddy derived that we require an int as a return value from looking at the overridden method, as we again did not specify any explicit return type. It is up to the implementor of a bootstrap method what exact signature this method should portray as long as it can at least accept these three arguments. If the last parameter of a bootstrap method represents an Object array, this last parameter is treated as a varargs and can therefore accept any excess arguments. This is also the reason why the above example bootstrap method is valid. Additionally, a bootstrap method can receive several arguments from an invokedynamic call site as long as these arguments can be stored in a class's constant pool. For any Java class, a constant pool stores values that are used inside of a class, largely numbers or string values. As of today, such constants can be primitive values of at least 32 bit size, Strings, Classes,MethodHandles and MethodTypes. This allows bootstrap methods to be used more flexible, if locating a suitable method handle requires additional information in form of such arguments. Lambda expressions Whenever the Java compiler translates a lambda expression into byte code, it copies the lambda's body into a private method inside of the class in which the expression is defined. These methods are named lambda$X$Y with X being the name of the method that contains the lambda expression and with Y being a zero-based sequence number. The parameters of such a method are those of the functional interface that the lambda expression implements. Given that the lambda expression makes no use of non-static fields or methods of the enclosing class, the method is also defined to be static. For compensation, the lambda expression is itself substituted by an invokedynamic call site. On its invocation, this call site requests the binding of a factory for an instance of the functional interface. As arguments to this factory, the call site supplies any values of the lambda expression's enclosing method which are used inside of the expression and a reference to the enclosing instance, if required. As a return type, the factory is required to provide an instance of the functional interface. For bootstrapping a call site, any invokedynamic instruction currently delegates to the LambdaMetafactory class which is included in the Java class library. This factory is then responsible for creating a class that implements the functional interface and which invokes the appropriate method that contains the lambda's body which, as described before, is stored in the original class. In the future, this bootstrapping process might however change which is one of the major advantages of using invokedynamic for implementing lambda expressions. If one day, a better suited language feature was available for implementing lambda expressions, the current implementation could simply be swapped out. In order to being able to create a class that implements the functional interface, any call site representing a lambda expression provides additional arguments to the bootstrap method. For the obligatory arguments, it already provides the name of the functional interface's method. Also, it provides a MethodType of the factory method that the bootstrapping is supposed to yield as a result. Additionally, the bootstrap method is supplied another MethodType that describes the signature of the functional interface's method. To that, it receives a MethodHandle referencing the method that contains the lambda's method body. Finally, the call site provides a MethodType of the generic signature of the functional interface's method, i.e. the signature of the method at the call site before type-erasure was applied. When invoked, the bootstrap method looks at these arguments and creates an appropriate implementation of a class that implements the functional interface. This class is created using the ASM library, a low-level byte code parser and writer that has become the de facto standard for direct Java byte code manipulation. Besides implementing the functional interface's method, the bootstrap method also adds an appropriate constructor and a static factory method for creating instances of the class. It is this factory method that is later bound to the invokedyanmic call site. As arguments, the factory receives an instance to the lambda method's enclosing instance, in case it is accessed and also any values that are read from the enclosing method. As an example, consider the following lambda expression: class Foo { int i; void bar(int j) { Consumer consumer = k -> System.out.println(i + j + k); } } In order to be executed, the lambda expression requires access to both the enclosing instance of Foo and to the value j of its enclosing method. Therefore, the desugared version of the above class looks something like the following where the invokedynamic instruction is represented by some pseudo-code: class Foo { int i; void bar(int j) { Consumer consumer = ; } private /* non-static */ void lambda$foo$0(int j, int k) { System.out.println(this.i + j + k); } } In order to being able to invoke lambda$foo$0, both the enclosing Foo instance and the j variable are handed to the factory that is bound by the invokedyanmic instruction. This factory then receives the variables it requires in order to create an instance of the generated class. This generated class would then look something like the following: class Foo$$Lambda$0 implements Consumer { private final Foo _this; private final int j; private Foo$$Lambda$0(Foo _this, int j) { this._this = _this; this.j = j; } private static Consumer get$Lambda(Foo _this, int j) { return new Foo$$Lambda$0(_this, j); } public void accept(Object value) { // type erasure _this.lambda$foo$0(_this, j, (Integer) value); } } Eventually, the factory method of the generated class is bound to the invokedynamic call site via a method handle that is contained by a ConstantCallSite. However, if the lambda expression is fully stateless, i.e. it does not require access to the instance or method in which it is enclosed, the LambdaMetafactory returns a so-called constant method handle that references an eagerly created instance of the generated class. Hence, this instance serves as a singleton to be used for every time that the lambda expression's call site is reached. Obviously, this optimization decision affects your application's memory footprint and is something to keep in mind when writing lambda expressions. Also, no factory method is added to a class of a stateless lambda expression. You might have noticed that the lambda expression's method body is contained in a private method which is now invoked from another class. Normally, this would result in an illegal access error. To overcome this limitation, the generated classes are loaded using so-called anonymous class loading. Anonymous class loading can only be applied when a class is loaded explicitly by handing a byte array. Also, it is not normally possible to apply anonymous class loading in user code as it is hidden away in the internal classes of the Java class library. When a class is loaded using anonymous class loading, it receives a host class of which it inherits its full security context. This involves both method and field access rights and the protection domain such that a lambda expression can also be generated for signed jar files. Using this approch, lambda expression can be considered more secure than anonymous inner classes because private methods are never reachable from outside of a class. Under the covers: lambda forms Lambda forms are an implementation detail of how MethodHandles are executed by the virtual machine. Because of their name, lambda forms are however often confused with lambda expressions. Instead, lambda forms are inspired by lambda calculus and received their name for that reason, not for their actual usage to implement lambda expressions in the OpenJDK. In earlier versions of the OpenJDK 7, method handles could be executed in one of two modes. Method handles were either directly rendered as byte code or they were dispatched using explicit assembly code that was supplied by the Java runtime. The byte code rendering was applied to any method handle that was considered to be fully constant throughout the lifetime of a Java class. If the JVM could however not prove this property, the method handle was instead executed by dispatching it to the supplied assembly code. Unfortunately, because assembly code cannot be optimized by Java's JIT-compiler, this lead to non-constant method handle invocations to "fall off the performance cliff". As this also affected the lazily bound lambda expressions, this was obviously not a satisfactory solution. LambdaForms were introduced to solve this problem. Roughly speaking, lambda forms represent byte code instructions which, as stated before, can be optimized by a JIT-compiler. In the OpenJDK, a MethodHandle's invocation semantics are today represented by a LambdaForm to which the handle carries a reference. With this optimizable intermediate representation, the use of non-constant MethodHandles has become significantly more performant. As a matter of fact, it is even possible to see a byte-code compiled LambdaForm in action. Simply place a break point inside of a bootstrap method or inside of a method that is invoked via a MethodHandle. Once the break point kicks it, the byte code-translated LambdaForms can be found on the call stack. Why this matters for dynamic languages Any language that should be executed on the Java virtual machine needs to be translated to Java byte code. And as the name suggests, Java byte code aligns rather close to the Java programming language. This includes the requirement to define a strict type for any value and before invokedynamic was introduced, a method call required to specify an explicit target class for dispatching a method. Looking at the following JavaScript code, specifying either information is however not possible when translating the method into byte code: 1 2 3 function (foo) { foo.bar(); } Using an invokedynamic call site, it has become possible to delay the identification of the method's dispatcher until runtime and furthermore, to rebind the invocation target, in case that a previous decision needs to be corrected. Before, using the reflection API with all of its performance drawbacks was the only real alternative to implementing a dynamic language. The real profiteer of the invokedynamic instruction are therefore dynamic programming languages. Adding the instruction was a first step away from aligning the byte code format to the Java programming language, making the JVM a powerful runtime even for dynamic languages. And as lambda expressions proved, this stronger focus on hosting dynamic languages on the JVM does neither interfere with evolving the Java language. In contrast, the Java programming languages gained from these efforts.

April 2, 2015

by Rafael Winterhalter

· 13,776 Views · 7 Likes

Spark and ZooKeeper: Fault-Tolerant Job Manager out of the Box

Apache Spark, Solr, and Zookeeper work together to create a fault-tolerant, distributed ETL system that converts RDBMS data into Solr documents.

March 28, 2015

by Konstantin Smirnov

· 12,836 Views

Using Google Protocol Buffers with Spring MVC-based REST Services

Written by Josh Long on the Spring blog This week I’m in São Paulo, Brazil presenting at QCon SP. I had an interesting discussion with someone who loves Spring’s REST stack, but wondered if there was something more efficient than plain-ol’ JSON. Indeed, there is! I often get asked about Spring’s support for high-speed binary based encoding of messages. Spring’s long supported RPC encoding with the likes of Hessian, Burlap, etc., and Spring Framework 4.1 introduced support for Google Protocol Buffers which can be used with REST services as well. From the Google Protocol Buffer website: Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages… Google uses Protocol Buffers extensively in their own, internal, service-centric architecture. A .proto document describes the types (_messages_) to be encoded and contains a definition language that should be familiar to anyone who’s used C structs. In the document, you define types, fields in those types, and their ordering (memory offsets!) in the type relative to each other. The .proto files aren’t implementations - they’re declarative descriptions of messages that may be conveyed over the wire. They can prescribe and validate constraints - the type of a given field, or the cardinatlity of that field - on the messages that are encoded and decoded. You must use the Protobuf compiler to generate the appropriate client for your language of choice. You can use Google Protocol Buffers anyway you like, but in this post we’ll look at using it as a way to encode REST service payloads. This approach is powerful: you can use content-negotiation to serve high speed Protocol Buffer payloads to the clients (in any number of languages) that accept it, and something more conventional like JSON for those that don’t. Protocol Buffer messages offer a number of improvements over typical JSON-encoded messages, particularly in a polyglot system where microservices are implemented in various technologies but need to be able to reason about communication between services in a consistant, long-term manner. Protocol Buffers are several nice features that promote stable APIs: Protocol Buffers offer backward compatibility for free. Each field is numbered in a Protocol Buffer, so you don’t have to change the behavior of the code going forward to maintain backward compatability with older clients. Clients that don’t know about new fields won’t bother trying to parse them. Protocol Buffers provide a natural place to specify validation using the required,optional, and repeated keywords. Each client enforces these constraints in their own way. Protocol Buffers are polyglot, and work with all manner of technologies. In the example code for this blog alone there is a Ruby, Python and Java client for the Java service demonstrated. It’s just a matter of using one of the numerous supported compilers. You might think that you could just use Java’s inbuilt serialization mechanism in a homogeneous service environment but, as the Protocol Buffers team were quick to point out whent hey first introduced the technology, there are some problems even with that. Java language luminary Josh Bloch’s epic tome, Effective Java, on page 213, provides further details. Let’s first look at our .proto document: package demo; option java_package = "demo"; option java_outer_classname = "CustomerProtos"; message Customer { required int32 id = 1; required string firstName = 2; required string lastName = 3; enum EmailType { PRIVATE = 1; PROFESSIONAL = 2; } message EmailAddress { required string email = 1; optional EmailType type = 2 [default = PROFESSIONAL]; } repeated EmailAddress email = 5; } message Organization { required string name = 1; repeated Customer customer = 2; } You then pass this definition to the protoc compiler and specify the output type, like this: protoc -I=$IN_DIR --java_out=$OUT_DIR $IN_DIR/customer.proto Here’s the little Bash script I put together to code-generate my various clients: #!/usr/bin/env bash SRC_DIR=`pwd` DST_DIR=`pwd`/../src/main/ echo source: $SRC_DIR echo destination root: $DST_DIR function ensure_implementations(){ # Ruby and Go aren't natively supported it seems # Java and Python are gem list | grep ruby-protocol-buffers || sudo gem install ruby-protocol-buffers go get -u github.com/golang/protobuf/{proto,protoc-gen-go} } function gen(){ D=$1 echo $D OUT=$DST_DIR/$D mkdir -p $OUT protoc -I=$SRC_DIR --${D}_out=$OUT $SRC_DIR/customer.proto } ensure_implementations gen java gen python gen ruby This will generate the appropriate client classes in the src/main/{java,ruby,python}folders. Let’s first look at the Spring MVC REST service itself. A Spring MVC REST Service In our example, we’ll register an instance of Spring framework 4.1’s org.springframework.http.converter.protobuf.ProtobufHttpMessageConverter. This type is an HttpMessageConverter. HttpMessageConverters encode and decode the requests and responses in REST service calls. They’re usually activated after some sort of content negotiation has occurred: if the client specifies Accept: application/x-protobuf, for example, then our REST service will send back the Protocol Buffer-encoded response. package demo; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.context.annotation.Bean; import org.springframework.http.converter.protobuf.ProtobufHttpMessageConverter; import org.springframework.web.bind.annotation.PathVariable; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; import java.util.Arrays; import java.util.Collection; import java.util.Map; import java.util.concurrent.ConcurrentHashMap; import java.util.stream.Collectors; @SpringBootApplication public class DemoApplication { public static void main(String[] args) { SpringApplication.run(DemoApplication.class, args); } @Bean ProtobufHttpMessageConverter protobufHttpMessageConverter() { return new ProtobufHttpMessageConverter(); } private CustomerProtos.Customer customer(int id, String f, String l, Collection emails) { Collection emailAddresses = emails.stream().map(e -> CustomerProtos.Customer.EmailAddress.newBuilder() .setType(CustomerProtos.Customer.EmailType.PROFESSIONAL) .setEmail(e).build()) .collect(Collectors.toList()); return CustomerProtos.Customer.newBuilder() .setFirstName(f) .setLastName(l) .setId(id) .addAllEmail(emailAddresses) .build(); } @Bean CustomerRepository customerRepository() { Map customers = new ConcurrentHashMap<>(); // populate with some dummy data Arrays.asList( customer(1, "Chris", "Richardson", Arrays.asList("[email protected]")), customer(2, "Josh", "Long", Arrays.asList("[email protected]")), customer(3, "Matt", "Stine", Arrays.asList("[email protected]")), customer(4, "Russ", "Miles", Arrays.asList("[email protected]")) ).forEach(c -> customers.put(c.getId(), c)); // our lambda just gets forwarded to Map#get(Integer) return customers::get; } } interface CustomerRepository { CustomerProtos.Customer findById(int id); } @RestController class CustomerRestController { @Autowired private CustomerRepository customerRepository; @RequestMapping("/customers/{id}") CustomerProtos.Customer customer(@PathVariable Integer id) { return this.customerRepository.findById(id); } } Most of this code is pretty straightforward. It’s a Spring Boot application. Spring Boot automatically registers HttpMessageConverter beans so we need only define the ProtobufHttpMessageConverter bean and it gets configured appropriately. The @Configuration class seeds some dummy date and a mock CustomerRepository object. I won’t reproduce the Java type for our Protocol Buffer, demo/CustomerProtos.java, here as it is code-generated bit twiddling and parsing code; not all that interesting to read. One convenience is that the Java implementation automatically provides builder methods for quickly creating instances of these types in Java. The code-generated types are dumb struct like objects. They’re suitable for use as DTOs, but should not be used as the basis for your API. Do not extend them using Java inheritance to introduce new functionality; it’ll break the implementation and it’s bad OOP practice, anyway. If you want to keep things cleaner, simply wrapt and adapt them as appropriate, perhaps handling conversion from an ORM entity to the Protocol Buffer client type as appropriate in that wrapper. HttpMessageConverters may also be used with Spring’s REST client, the RestTemplate. Here’s the appropriate Java-language unit test: package demo; import org.junit.Test; import org.junit.runner.RunWith; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.test.IntegrationTest; import org.springframework.boot.test.SpringApplicationConfiguration; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.http.ResponseEntity; import org.springframework.http.converter.protobuf.ProtobufHttpMessageConverter; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import org.springframework.test.context.web.WebAppConfiguration; import org.springframework.web.client.RestTemplate; import java.util.Arrays; @RunWith(SpringJUnit4ClassRunner.class) @SpringApplicationConfiguration(classes = DemoApplication.class) @WebAppConfiguration @IntegrationTest public class DemoApplicationTests { @Configuration public static class RestClientConfiguration { @Bean RestTemplate restTemplate(ProtobufHttpMessageConverter hmc) { return new RestTemplate(Arrays.asList(hmc)); } @Bean ProtobufHttpMessageConverter protobufHttpMessageConverter() { return new ProtobufHttpMessageConverter(); } } @Autowired private RestTemplate restTemplate; private int port = 8080; @Test public void contextLoaded() { ResponseEntity customer = restTemplate.getForEntity( "http://127.0.0.1:" + port + "/customers/2", CustomerProtos.Customer.class); System.out.println("customer retrieved: " + customer.toString()); } } Things just work as you’d expect, not only in Java and Spring, but also in Ruby and Python. For completeness, here is a simple client using Ruby (client types omitted): #!/usr/bin/env ruby require './customer.pb' require 'net/http' require 'uri' uri = URI.parse('http://localhost:8080/customers/3') body = Net::HTTP.get(uri) puts Demo::Customer.parse(body) ..and here’s a client in Python (client types omitted): #!/usr/bin/env python import urllib import customer_pb2 if __name__ == '__main__': customer = customer_pb2.Customer() customers_read = urllib.urlopen('http://localhost:8080/customers/1').read() customer.ParseFromString(customers_read) print customer Where to go from Here If you want very high speed message encoding that works with multiple languages, Protocol Buffers are a compelling option. There are other encoding technologies like Avro or Thrift, but none nearly so mature and entrenched as Protocol Buffers. You don’t necessarily need to use Protocol Buffers with REST, either. You could plug it into some sort of RPC service, if that’s your style. There are almost as many client implementations as there are buildpacks for Cloud Foundry - so you could run almost anything on Cloud Foundry and enjoy the same high speed, consistent messaging across all your services! The code for this example is available online, as well, so don’t hesitate to check it out! Also.. Hi gang, in 2015, I’ve been trying to do a random tech-tip style post every week based on things that I see garnering interest in the community, either here or on the Pivotal blog. I use these weekly-_ish_ (OK! OK! - it’s not been easy doing them as regularly as This Week in Spring, but so far I haven’t missed a week! :-) ) posts as a chance to focus not on a specific new release, per se, but on the application of Spring in service to some community use case that might be cross-cutting or just might benefit from having a spotlight shined on it. So far we’ve looked at all manner of things - Vaadin, Activiti, 12-Factor App Style Configuration, Smarter Service to Service Invocations, Couchbase, and much more, etc. - and we’ve got some interesting stuff lined up, too. I wondered what else you want to see talked about, however. If you’ve got some ideas about what you’d like to see covered, or a community post of your own to contribute, reach out to me on Twitter (@starbuxman) or via email (jlong [at] pivotal [dot] io). I remain, as always, at your service.

March 27, 2015

by Pieter Humphrey

· 15,189 Views

Getting Started with Couchbase and Spring Data Couchbase

Written by Josh Long on the Spring blog. This blog was inspired by a talk that Laurent Doguin, a developer advocate over at Couchbase, and I gave at Couchbase Connect last year. Merci Laurent! This is a demo of the Spring Data Couchbase integration. From the project page, Spring Data Couchbase is: The Spring Data Couchbase project provides integration with the Couchbase Server database. Key functional areas of Spring Data Couchbase are a POJO centric model for interacting with Couchbase Buckets and easily writing a Repository style data access layer. What is Couchbase? Couchbase is a distributed data-store that enjoys true horizontal scaling. I like to think of it as a mix of Redis and MongoDB: you work with documents that are accessed through their keys. There are numerous client APIs for all languages. If you’re using Couchbase for your backend and using the JVM, you’ll love Spring Data Couchbase. The bullets on the project home page best enumerate its many features: Spring configuration support using Java based @Configuration classes or an XML namespace for the Couchbase driver. CouchbaseTemplate helper class that increases productivity performing common Couchbase operations. Includes integrated object mapping between documents and POJOs. Exception translation into Spring’s portable Data Access Exception hierarchy. Feature Rich Object Mapping integrated with Spring’s Conversion Service. Annotation based mapping metadata but extensible to support other metadata formats. Automatic implementation of Repository interfaces including support for custom finder methods (backed by Couchbase Views). JMX administration and monitoring Transparent @Cacheable support to cache any objects you need for high performance access. Running Couchbase Use Vagrant to Run Couchbase Locally You will need to have Couchbase installed if you don’t already (naturally). Michael Nitschinger (@daschl, also lead of the Spring Data Couchbase project), blogged about how to get a simple4-node Vagrant cluster up and running here. I’ve reproduced his example here in the vagrantdirectory. To use it, you’ll need to install Virtual Box and Vagrant, of course, but then simply run vagrant up in the vagrant directory. To get the most up-to-date version of this configuration script, I went to Michael’s GitHub vagrants project and found that, beyond this example, there are numerous other Vagrant scripts available. I have a submodule in this code’s project directory that points to that, but be sure to consult that for the latest-and-greatest. To get everything running on my machine, I chose the Ubuntu 12 installation of Couchbase 3.0.2. You can change how many nodes are started by configuring the VAGRANT_NODES environment variable before startup: VAGRANT_NODES=2 vagrant up You’ll need to administer and configure Couchbase on initial setup. Point your browser to the right IP for each node. The rules for determining that IP are well described in the README. The admin interface, in my case, was available at 192.168.105.101:8091 and192.168.105.102:8091. For more on this process, I recommend that you follow theguidelines here for the details. Here’s how I did it. I hit the admin interface on the first node and created a new cluster. I usedadmin for the username and password for the password. On all subsequent management pages, I simply joined the existing cluster by pointing the nodes to 192.168.105.101 and using the aforementioned admin credential. Once you’ve joined all nodes, look for theRebalance button in the Server Nodes panel and trigger a cluster rebalance. If you are done with your Vagrant cluster, you can use the vagrant halt command to shut it down cleanly. Very handy is also vagrant suspend, which will save the state of the nodes instead of shutting them down completely. If you want to administer the Couchbase cluster from the command line there is the handycouchbase-cli. You can simply use the vagrant ssh command to get into each of the nodes (by their node-names: node1, node2, etc..). Once there, you can run cluster configuration commands. For example the server-list command will enumerate cluster nodes. /opt/couchbase/bin/couchbase-cli server-list -c 192.168.56.101-u admin -p password It’s easy to trigger a rebalance using: /opt/couchbase/bin/couchbase-cli rebalance -c 192.168.56.101-u admin -p password Couchbase In the Cloud and on Cloud Foundry Couchbase lends itself to use in the cloud. It’s horizontally scalable (like Gemfire or Cassandra) in that there’s no single point of failure. It does not employ a master-slave or active/passive system. There are a few ways to get it up and running where your applications are running. If you’re running a Cloud Foundry installation, then you can install the the Cumulogic Service Broker which then lets your Cloud Foundry installation talk to the Cumulogic platform which itself can manage Couchbase instances. Service brokers are the bit of integration code that teach Cloud Foundry how to provision, destroy and generally interact with a managed service, like Couchbase, in this case. Using Spring Data Couchbase to Store Facebook Places Let’s look at a simple example that reads data (in this case from the Facebook Places API using Spring Social Facebook’s FacebookTemplate API) and then loads it into the Couchbase server. Get a Facebook Access Token You’ll also need a Facebook access token. The easiest way to do this is to go to the Facebook Developer Portal and create a new application and then get an application ID and an application secret. Take these two values and concatenate them with a pike character (|). Thus, you’ll have something of the form: appID|appSecret. The sample application uses Spring’s Environment mechanism to resolve the facebook.accessToken key. You can provide a value for it in the src/main/resources/application.properties file or using any of the other supported Spring Boot property resolution mechanisms. You could even provide the value as a -D argument: -Dfacebook.accessToken=...|... Telling Spring Data Couchbase About our Cluster Data in Couchbase is stored in buckets. It’s logically the same as a database in a SQL RDBMS. It is typically replicated across nodes and has its own configuration. We’ll be using the defaultbucket, but it’s a snap to create more buckets. Let’s look at the basic configuration required to use Spring Data Couchbase (in this case, in terms of a Spring Boot application): @SpringBootApplication @EnableScheduling @EnableCaching public class Application { @EnableCouchbaseRepositories @Configuration static class CouchbaseConfiguration extends AbstractCouchbaseConfiguration { @Value("${couchbase.cluster.bucket}") private String bucketName; @Value("${couchbase.cluster.password}") private String password; @Value("${couchbase.cluster.ip}") private String ip; @Override protected List bootstrapHosts() { return Arrays.asList(this.ip); } @Override protected String getBucketName() { return this.bucketName; } @Override protected String getBucketPassword() { return this.password; } } // more beans } A Spring Data Couchbase Repository Spring Data provides the notion of repositories - objects that handle typical data-access logic and provide convention-based queries. They can be used to map POJOs to data in the backing data store. Our example simply stores the information on businesses it reads from Facebook’s Places API. To acheive this we’ve created a simple Place entity that Spring Data Couchbase repositories will know how to persist: @Document(expiry = 0) class Place { @Id private String id; @Field private Location location; @Field @NotNull private String name; @Field private String affilitation, category, description, about; @Field private Date insertionDate; // .. getters, constructors, toString, etc } The Place entity references another entity, Location, which is basically the same. In the case of Spring Data Couchbase, repository finder methods map to views - queries written in JavaScript - in a Couchbase server. You’ll need to setup views on the Couchbase servers. Go to any Couchbase server’s admin console and visit the Views screen, then clickCreate Development View and name it place, as our entity will be demo.Place (the development view name is adapted from the entity’s class name by default). We’ll create two views, the generic all, which is required for any Spring Data Couchbase POJO, and the byName view, which will be used to drive the repository’s findByName finder method. This mapping is by convention, though you can override which view is employed with the @View annotation on the finder method’s declaration. First, all: Now, byName: When you’re done, be sure to Publish each view! Now you can use Spring Data repositories as you’d expect. The only thing that’s a bit different about these repositories is that we’re declaring a Spring Data Couchbase Query type for the argument to the findByName finder method, not a String. Using the @Query is straightforward: Query query = new Query(); query.setKey("Philz Coffee"); Collection places = placeRepository.findByName(query); places.forEach(System.out::println); Where to go from Here We’ve only covered some of the basics here. Spring Data Couchbase supports the Java bean validation API, and can be configured to honor validation constraints on its entities. Spring Data Couchbase also provides lower-level access to the CouchbaseClient API, if you want it. Spring Data Couchbase also implements the Spring CacheManager abstraction - you can use@Cacheable and friends with data on service methods and it’ll be transparently persisted to Couchbase for you. The code for this example is in my Github repository, co-developed with my pal Laurent Doguin (@ldoguin) over at Couchbase.

March 24, 2015

by Pieter Humphrey

· 19,321 Views

Walking Recursive Data Structures Using Java 8 Streams

The Streams API is a real gem in Java 8, and I keep finding more or less unexpected uses for them. I recently wrote about using them as ForkJoinPool facade. Here’s another interesting example: Walking recursive data structures. Without much ado, have a look at the code: class Tree { private int value; private List children = new LinkedList<>(); public Tree(int value, List children) { super(); this.value = value; this.children.addAll(children); } public Tree(int value, Tree... children) { this(value, asList(children)); } public int getValue() { return value; } public List getChildren() { return Collections.unmodifiableList(children); } public Stream flattened() { return Stream.concat( Stream.of(this), children.stream().flatMap(Tree::flattened)); } } It’s pretty boring, except for the few highlighted lines. Let’s say we want to be able to find elements matching some criteria in the tree or find particular element. One typical way to do it is a recursive function – but that has some complexity and is likely to need a mutable argument (e.g. a set where you can append matching elements). Another approach is iteration with a stack or a queue. They work fine, but take a few lines of code and aren’t so easy to generalize. Here’s what we can do with this flattened function: // Get all values in the tree: t.flattened().map(Tree::getValue).collect(toList()); // Get even values: t.flattened().map(Tree::getValue).filter(v -> v % 2 == 0).collect(toList()); // Sum of even values: t.flattened().map(Tree::getValue).filter(v -> v % 2 == 0).reduce((a, b) -> a + b); // Does it contain 13? t.flattened().anyMatch(t -> t.getValue() == 13); I think this solution is pretty slick and versatile. One line of code (here split to 3 for readability on blog) is enough to flatten the tree to a straightforward stream that can be searched, filtered and whatnot. It’s not perfect though: It is not lazy and flattened is called for each and every node in the tree every time. It probably could be improved using a Supplier. Anyway, it doesn’t matter for typical, reasonably small trees, especially in a business application on a very tall stack of libraries. But for very large trees, very frequent execution and tight time constraints the overhead might cause some trouble.

March 18, 2015

by Konrad Garus

· 25,162 Views · 1 Like