Data Engineering Resources

The Latest Data Engineering Topics

RESTEasy (and Jersey as well) contain a minimal web server within their libraries which enables their users to start up a tiny web server.

March 13, 2015

by Mark Paluch

· 311,095 Views · 6 Likes

The 100% Utilization Myth

Many organizations in which I've coached are concerned when the people on their teams aren't being "utilized" at 100% of their capacity. If someone is at work for 8 hours per day, minus an hour for lunch and breaks, that's 7 hours of potential capacity. Some organizations are progressive enough to see that the organization's overhead of administrative activities lowers that value to 6 hours per day. By extension, a team's capacity is simply a multiple of the number of team members and the number of hours available to be utilized, i.e. a team of 5 has 30 person-hours per day. So, anything less than 30 person-hours per day spent on tasks means that a team isn't being utilized 100%. Which is bad, apparently. A Thought Exercise If you're reading this, you have a computer and it's connected to a network. Your computer might be a small one that makes phone calls or a large one that can also handle the complex scenery of games at 60 frames per second without breaking a sweat. Regardless, it's a computer. Thanks to John von Neumann, almost all computers today have the same basic architecture, the heart of which is the Central Processing Unit or CPU. When the CPU on your computer, regardless of its size, reaches 100% utilization, what happens? Your computer slows down. Significantly. The CPU is saturated with instructions and does its best to process them, but it has a finite limit on how quickly it can do that. There's a finite limit to the speed at which the instructions and the data they use can be passed around among the internal components of the CPU. That processor just can't go any faster. To a person using the computer, it appears to be locked up. To see this in action, try running a modern game like Faster Than Light on a 1st generation iPad. The older processor can't keep up with the computing demand from a game targeted towards more a powerful CPU. The communications network to which your computer is connected might be wireless or have a "hard" connection, but behind the scenes the public internet is dominated by TCP/IP and Ethernet. What happens when the network to which your computer is connected is running at 100% capacity? The network slows down. Significantly. The equipment moving the packets of data about will experience numerous collisions and will have to send requests back to your computer to resend data. The equipment will also begin to simply "drop" packets because it can't process them quickly enough. As far as you're concerned the network has become either very slow or has simply stopped. To see this in real life, look at what happens to a web site when it's the subject of a Denial of Service (DoS) attack, which effectively saturates that site's network with data requests. The site either crashes completely or appears to be unresponsive because it can't handle the traffic. Knowledge work such as software development requires a huge amount of thinking (processing) and a huge amount of collaboration (communication). The human brain is a processor, and like the von Neumann architecture has subcomponents (lobes), working memory, etc. What happens when our processor hits 100% utilization? We simply can't process all the inputs fast enough and our ability to process information slows down dramatically. We even reach a state where the decision-making centre in our brain shuts itself down. Coupled with a decreasing ability to process information and make decisions, we experience increasing levels of anxiety. Now, let's look at teams. They represent a communications network, with the people being the nodes in the network. When the team reaches 100% utilization, the individual processors (the brains of the people!) begin to slow as the ability to process information diminishes and anxiety increases. This has the effect of slowing the team's ability to communicate and operate effectively. When one or multiple people reach the point of shutting down, the network collapses. Why does anyone think that 100% utilization is a good thing? In manufacturing, 100% utilization of the capacity of your workers is actually desirable. The thinking aspects have already been done in the design and engineering of the item being created, and the assembly process is infinitely repeatable until a new item comes along. This approach to the process was taken from the manufacturing world and applied to the development of software, with the thinking that assembly was done by the programmers after all the design and engineering had taken place. Well, that simply isn't true. The "assembly" of software isn't programming, it's the compilation and packing into something deployable. Writing software is a confluence of design and engineering that has creative and technical aspects. It's not an assembly line, and probably won't ever be within my lifetime. As long ago as the late 1950's, management guru Peter Drucker realized that there were people who "think for a living". He coined the term knowledge worker to describe that type of work. Developing software is a classic knowledge worker role and therefore has different rules for productivity. As Drucker said in his 1999 book Management Challenges of the 21st Century, Productivity of the knowledge worker is not — at least not primarily — a matter of the quantity of output. Quality is at least as important. If we're pushing individuals and teams to 100% capacity, the quality of work and therefore the productivity of the team will be reduced. In 2001, Tom deMarco released Slack: Getting Past Burnout, Busywork and the Myth of Total Efficiency, an entire book describing the problems created by trying to achieve 100% utilization. I highly recommend the book, and also recommend you give your manager a copy. :) Here are some selected quotes that are pertinent to reducing the load on our processor and communication network: Very successful companies have never struck me as particularly busy; in fact, they are, as a group, rather laid back. Energy is evident in the workplace, but it's not the energy tinged with fear that comes from being slightly behind on everything. The principal resource needed for invention is slack. When companies can't invent, it's usually because their people are too damn busy. People under time pressure don't think faster. - Tim Lister How Can We Deal With This? First and foremost, stop trying to do so damned much! This is truly a slow down to go faster type of solution. For a Team If you're an agile team using iterations or sprints, pull less work into a sprint even if your velocity says you should be able to pull more. The extra time can be used to increase the quality of the work that you do complete, which is in itself a motivating factor for knowledge workers. Increased quality means decreased rework, which means the team as a whole delivers faster over the long term. Similarly, the reduced anxiety and stress on the team members increases their ability to think about what they're doing, meaning better and more innovative solutions to problems. Also, ensure that teams are focusing on activities that contribute directly to their work. Meetings tend to be the worst offenders of this, including the meetings defined within agile processes. Some suggestions: No meeting can be scheduled unless there is a specific question to be answered or specific information to be passed along. Reduce the default meeting length from 1 hour to 30 minutes in whatever calendar tool you use. Even better, reduce it to 20 minutes. Choose blocks of time that are deemed meeting free zones. No one, regardless of their position in the org chart can schedule a meeting with the team or the team members during that time. Minimize meetings where people are on a phone. My observation is that people who call into meetings tend to speak longer than when they are in the room with everyone else. I know that I do this, and I see it often in others. For an Individual Learn to say, "No." At the very least, learn to say, "Not yet." It's really difficult to do this, and I speak from personal experience. We want to help others, we want to be team players. However, if we don't say no, we take on too much, our processing and communication ability suffers and we end up disappointing those we were trying to help! Also, learn to take breaks. Yes, take breaks. The same concepts from Slack apply here, with brief diversions clearing our mind and allowing us to work better afterwards. At the extreme, if you feel like you can't keep your eyes open, take a nap! I highly recommend the Pomodoro Technique as at least a starting point for giving yourself some slack. Step away from your work for just a few minutes and return recharged. Conclusion We don't want the CPU in our computer to be working at 100% and we don't want the communications network to which it's connected to be at 100% capacity either. Our brains are processors and teams are a communications network, so why would we want those to be 100% busy all the time? In fact, ensuring that teams and the people on them are always busy is a provably wrong approach for software development. It has it's roots in a manufacturing mindset and developing software is knowledge work. Software development requires substantially different ways to make teams and the individual people on them as productive as possible. Finally, there is no magic number for what percentage of utilization is best. People are variable and 90% utilization for a person might be fine on a given day while %30 is all that the same person can handle on the next. Don't aim for a number, aim for an environment where people know that they don't have to appear busy. That will leave plenty of capacity for each person's processor and their team's communications network to run smoothly and deliver real value more effectively.

March 12, 2015

by Dave Rooney

· 22,265 Views · 2 Likes

R/dplyr: Extracting Data Frame Column Value for Filtering With %in%

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step. I had a data frame: library(dplyr) df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5)) And wanted to extract the userIds of those people who have a score greater than 3. I started with: highScoringPeople = df %>% filter(score > 3) %>% select(userId) > highScoringPeople userId 1 3 2 4 3 5 And then filtered the data frame expecting to get back those 3 people: > df %>% filter(userId %in% highScoringPeople) [1] userId score <0 rows> (or 0-length row.names) No rows! I created vector with the numbers 3-5 to make sure that worked: > df %>% filter(userId %in% c(3,4,5)) userId score 1 3 4 2 4 5 3 5 5 That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore: > str(c(3,4,5)) num [1:3] 3 4 5 > str(highScoringPeople) 'data.frame': 3 obs. of 1 variable: $ userId: num 3 4 5 Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list. One way to fix this is to extract the userIds using the $ syntax instead of the select function: highScoringPeople = (df %>% filter(score > 3))$userId > str(highScoringPeople) num [1:3] 3 4 5 > df %>% filter(userId %in% highScoringPeople) userId score 1 3 4 2 4 5 3 5 5 Or if we want to do the column selection using dplyr we can extract the values for the column like this: highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]] > str(highScoringPeople) num [1:3] 3 4 5 Not so difficult after all.

March 12, 2015

by Mark Needham

· 14,699 Views

Why I Use OrientDB on Production Applications

Like many other Java developers, when i start a new Java development project that requires a database, i have hopes and dreams of what my database looks like: Java API (of course) Embeddable Pure Java Simple jar file for inclusion in my project Database stored in a directory on disk Faster than a rocket First I’m going to review these points, and then i’m going to talk about the database i chose for my latest project, which is in production now with hundreds of users accessing the web application each month. What I Want from My Database Here’s what i’m looking for in my database. These are the things that literally make me happy and joyous when writing code. Java API I code in Java. It’s natural for me to want to use a modern Java API for my database work. Embeddable My development productivity and programming enjoyment skyrocket when my database is embedded. The database starts and stops with my application. It’s easy to destroy my database and restart from scratch. I can upgrade my database by updating my database jar file. It’s easy to deploy my application into testing and production, because there’s no separate database server to startup and manage. (I know about the issue with clustering and an embedded database, but i’ll get to that.) Pure Java Back when i developed software that would be deployed on all manner of hardware, i was a stickler that all my code be pure Java, so that i could be confident that my code would run wherever customers and users deployed it. In this day of SaaS, i’m less picky. I develop on the Mac. I test and run in production on Linux. Those are the systems i care about, so if my database has some platform-specific code in it to make it run fast and well, i’m fine with that. Just as long as that platform-specific configuration is not exposed to me as the developer. Simple Jar File for Inclusion in My Project I really just want one database jar file to add to my project. And i don’t want that jar file messing with my code or the dependencies i include in my project. If the database uses Guava 1.2, and i’m using Guava 0.8, that can mess me up. I want my database to not interfere with jars that i use by introducing newer or older versions of class files that i already reference in my project’s jars. Database Stored in a Directory on Disk I like to destroy my database by deleting a directory. I like to run multiple, simultaneous databases by configuring each database to use a separate directory. That makes me super productive during development, and it makes it more fun for me to program to a database. Faster Than a Rocket I think that’s just a given. My Latest Project That Needs a Database My latest project is Floify.com. Floify is a Mortgage Borrower Portal, automating the process of collecting mortgage loan documents from borrowers and emailing milestone loan status updates to real estate agents and borrowers. Mortgage loan originators use Floify to automate the labor-intensive parts of their loan processes. The web application receives about 500 unique visitors per month. Floify experienced 28% growth in january 2015. Floify’s vital statistics are: 38,301 loan documents under management 3,619 registered users 3,113 loan packages under management The Database I Chose for My Latest Project When i started Floify, i looked for a database that met all the criteria i’ve described above. I decided against databases that were server-based (Postgres, etc). I decided against databases that weren’t Java-based (MongoDB, etc). I decided against databases that didn’t support ACID transactions. I narrowed my choices to OrientDB and Neo4j. It’s been a couple years since that decision process occurred, but i distinctly remember a few reasons why i ultimately chose OrientDB over Neo4j: Performance benchmarks for OrientDB were very impressive. The OrientDB development team was very active. Cost. OrientDB is free. Neo4j cost more than what i was willing to pay or what i could afford. I forget which it was. My Favourite OrientDB Features Here are some of my favourite features in OrientDB. These are not competitive advantages to OrientDB. It’s just some of the things that make me happy when coding against an embeddable database. I can create the database in code. I don’t have to use SQL for querying, but most of the time, i do. I already know SQL, and it’s just easy for me. I use the document database, and it’s very pleasant inserting new documents in Java. I can store multi-megabyte binary objects directly in the database. My database is stored in a directory on disk. When scalability demands it, i can upgrade to a two-server distributed database. I haven’t been there yet. Speed. For me, OrientDB is very fast, and in the few years i’ve been using it, it’s become faster. OrientDB doesn’t come in a single jar file, as would be my ideal. I have to include a few different jars, but that’s an easy tradeoff for me. Future In the future, as Floify’s performance and scalability needs demand it, i’ll investigate a multi-server database configuration on OrientDB. In the meantime, i’m preparing to upgrade to OrientDB 2.0, which was recently released and promises even more speed. Go speed. :-)

March 5, 2015

by Dave Sims

· 17,979 Views · 6 Likes

Using MongoDB with Hadoop & Spark: Part 2 - Hive Example

Originally Written by Matt Kalan Welcome to part two of our three-part series on MongoDB and Hadoop. In part one, we introduced Hadoop and how to set it up. In this post, we'll look at a Hive example. Introduction & Setup of Hadoop and MongoDB Hive Example Spark Example & Key Takeaways For more detail on the use case, see the first paragraph of part 1. Summary Use case: aggregating 1 minute intervals of stock prices into 5 minute intervals Input:: 1 minute stock prices intervals in a MongoDB database Simple Analysis: performed in: - Hive - Spark Output: 5 minute stock prices intervals in Hadoop Hive Example I ran the following example from the Hive command line (simply typing the command “hive” with no parameters), not Cloudera’s Hue editor, as that would have needed additional installation steps. I immediately noticed the criticism people have with Hive, that everything is compiled into MapReduce which takes considerable time. I ran most things with just 20 records to make the queries run quickly. This creates the definition of the table in Hive that matches the structure of the data in MongoDB. MongoDB has a dynamic schema for variable data shapes but Hive and SQL need a schema definition. CREATE EXTERNAL TABLE minute_bars ( id STRUCT, Symbol STRING, Timestamp STRING, Day INT, Open DOUBLE, High DOUBLE, Low DOUBLE, Close DOUBLE, Volume INT ) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id", "Symbol":"Symbol", "Timestamp":"Timestamp", "Day":"Day", "Open":"Open", "High":"High", "Low":"Low", "Close":"Close", "Volume":"Volume"}') TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/marketdata.minbars'); Recent changes in the Apache Hive repo make the mappings necessary even if you are keeping the field names the same. This should be changed in the MongoDB Hadoop Connector soon if not already by the time you read this. Then I ran the following command to create a Hive table for the 5 minute bars: CREATE TABLE five_minute_bars ( id STRUCT, Symbol STRING, Timestamp STRING, Open DOUBLE, High DOUBLE, Low DOUBLE, Close DOUBLE ); This insert statement uses the SQL windowing functions to group 5 1-minute periods and determine the OHLC for the 5 minutes. There are definitely other ways to do this but here is one I figured out. Grouping in SQL is a little different from grouping in the MongoDB aggregation framework (in which you can pull the first and last of a group easily), so it took me a little while to remember how to do it with a subquery. The subquery takes each group of 5 1-minute records/documents, sorts them by time, and takes the open, high, low, and close price up to that record in each 5-minute period. Then the outside WHERE clause selects the last 1-minute bar in that period (because that row in the subquery has the correct OHLC information for its 5-minute period). I definitely welcome easier queries to understand but you can run the subquery by itself to see what it’s doing too. INSERT INTO TABLE five_minute_bars SELECT m.id, m.Symbol, m.OpenTime as Timestamp, m.Open, m.High, m.Low, m.Close FROM (SELECT id, Symbol, FIRST_VALUE(Timestamp) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as OpenTime, LAST_VALUE(Timestamp) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as CloseTime, FIRST_VALUE(Open) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as Open, MAX(High) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as High, MIN(Low) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as Low, LAST_VALUE(Close) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as Close FROM minute_bars) as m WHERE unix_timestamp(m.CloseTime, 'yyyy-MM-dd HH:mm') - unix_timestamp(m.OpenTime, 'yyyy-MM-dd HH:mm') = 60*4; I can definitely see the benefit of being able to use SQL to access data in MongoDB and optionally in other databases and file formats, all with the same commands, while the mapping differences are handled in the table declarations. The downside is that the latency is quite high, but that could be made up some with the ability to scale horizontally across many nodes. I think this is the appeal of Hive for most people - they can scale to very large data volumes using traditional SQL, and latency is not a primary concern. Post #3 in this blog series shows similar examples using Spark. Introduction & Setup of Hadoop and MongoDB Hive Example Spark Example & Key Takeaways To learn more, watch our video on MongoDB and Hadoop. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights. WATCH MONGODB & HADOOP << Read Part 1

March 2, 2015

by Francesca Krihely

· 10,617 Views

Streaming Big Data: Storm, Spark and Samza

There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences. Apache Storm In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline. Apache Spark Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations). Apache Samza Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka. Common Ground All three real-time computation systems are open-source, low-latency, distributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations. The three frameworks use different vocabularies for similar concepts: Comparison Matrix A few of the differences are summarized in the table below: There are three general categories of delivery patterns: At-most-once: messages may be lost. This is usually the least desirable outcome. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases. Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident. Use Cases All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines. If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching. A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel... Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQL, MLlib, GraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time. A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu… If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects. A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale… Conclusion We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.

February 28, 2015

by Tony Siciliani

· 32,250 Views · 5 Likes

Standing Up a Local Netflix Eureka

Here I will consider two different ways of standing up a local instance of Netflix Eureka. If you are not familiar with Eureka, it provides a central registry where (micro)services can register themselves and client applications can use this registry to look up specific instances hosting a service and to make the service calls. Approach 1: Native Eureka Library The first way is to simply use the archive file generated by the Netflix Eureka build process: 1. Clone the Eureka source repository here: https://github.com/Netflix/eureka 2. Run "./gradlew build" at the root of the repository, this should build cleanly generating a war file in eureka-server/build/libs folder 3. Grab this file, rename it to "eureka.war" and place it in the webapps folder of either tomcat or jetty. For this exercise I have used jetty. 4. Start jetty, by default jetty will boot up at port 8080, however I wanted to instead bring it up at port 8761, so you can start it up this way, "java -jar start.jar -Djetty.port=8761" The server should start up cleanly and can be verified at this endpoint - "http://localhost:8761/eureka/v2/apps" Approach 2: Spring-Cloud-Netflix Spring-Cloud-Netflix provides a very neat way to bootstrap Eureka. To bring up Eureka server using Spring-Cloud-Netflix the approach that I followed was to clone the sample Eureka server application available here: https://github.com/spring-cloud-samples/eureka 1. Clone this repository 2. From the root of the repository run "mvn spring-boot:run", and that is it!. The server should boot up cleanly and the REST endpoint should come up here: "http://localhost:8761/eureka/apps". As a bonus, Spring-Cloud-Netflix provides a neat UI showing the various applications who have registered with Eureka at the root of the webapp at "http://localhost:8761/". Just a few small issues to be aware of, note that the context url's are a little different in the two cases "eureka/v2/apps" vs "eureka/apps", this can be adjusted on the configurations of the services which register with Eureka. Conclusion Your mileage with these approaches may vary. I have found Spring-Cloud-Netflix a little unstable at times but it has mostly worked out well for me. The documentation at the Spring-Cloud site is also far more exhaustive than the one provided at the Netflix Eureka site.

February 26, 2015

by Biju Kunjummen

· 13,045 Views

A JAXB Nuance: String Versus Enum from Enumerated Restricted XSD String

Although Java Architecture for XML Binding (JAXB) is fairly easy to use in nominal cases (especially since Java SE 6), it also presents numerous nuances. Some of the common nuances are due to the inability to exactlymatch (bind) XML Schema Definition (XSD) types to Java types. This post looks at one specific example of this that also demonstrates how different XSD constructs that enforce the same XML structure can lead to different Java types when the JAXB compiler generates the Java classes. The next code listing, for Food.xsd, defines a schema for food types. The XSD mandates that valid XML will have a root element called "Food" with three nested elements "Vegetable", "Fruit", and "Dessert". Although the approach used to specify the "Vegetable" and "Dessert" elements is different than the approach used to specify the "Fruit" element, both approaches result in similar "valid XML." The "Vegetable" and "Dessert" elements are declared directly as elements of the prescribed simpleTypes defined later in the XSD. The "Fruit" element is defined via reference (ref=) to another defined element that consists of a simpleType. Food.xsd Although Vegetable and Dessert elements are defined in the schema differently than Fruit, the resulting valid XML is the same. A valid XML file is shown next in the code listing for food1.xml. food1.xml Spinach Watermelon Pie At this point, I'll use a simple Groovy script to validate the above XML against the above XSD. The code for this Groovy XML validation script (validateXmlAgainstXsd.groovy) is shown next. validateXmlAgainstXsd.groovy #!/usr/bin/env groovy // validateXmlAgainstXsd.groovy // // Accepts paths/names of two files. The first is the XML file to be validated // and the second is the XSD against which to validate that XML. if (args.length < 2) { println "USAGE: groovy validateXmlAgainstXsd.groovy " System.exit(-1) } String xml = args[0] String xsd = args[1] import javax.xml.validation.Schema import javax.xml.validation.SchemaFactory import javax.xml.validation.Validator try { SchemaFactory schemaFactory = SchemaFactory.newInstance(javax.xml.XMLConstants.W3C_XML_SCHEMA_NS_URI) Schema schema = schemaFactory.newSchema(new File(xsd)) Validator validator = schema.newValidator() validator.validate(new javax.xml.transform.stream.StreamSource(xml)) } catch (Exception exception) { println "\nERROR: Unable to validate ${xml} against ${xsd} due to '${exception}'\n" System.exit(-1) } println "\nXML file ${xml} validated successfully against ${xsd}.\n" The next screen snapshot demonstrates running the above Groovy XML validation script against food1.xmland Food.xsd. The objective of this post so far has been to show how different approaches in an XSD can lead to the same XML being valid. Although these different XSD approaches prescribe the same valid XML, they lead to different Java class behavior when JAXB is used to generate classes based on the XSD. The next screen snapshot demonstrates running the JDK-provided JAXB xjc compiler against the Food.xsd to generate the Java classes. The output from the JAXB generation shown above indicates that Java classes were created for the "Vegetable" and "Dessert" elements but not for the "Fruit" element. This is because "Vegetable" and "Dessert" were defined differently than "Fruit" in the XSD. The next code listing is for the Food.java class generated by the xjc compiler. From this we can see that the generated Food.java class references specific generated Java types for Vegetable and Dessert, but references simply a generic Java String for Fruit. Food.java (generated by JAXB jxc compiler) // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.8-b130911.1802 // See http://java.sun.com/xml/jaxb // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2015.02.11 at 10:17:32 PM MST // package com.blogspot.marxsoftware.foodxml; import javax.xml.bind.annotation.XmlAccessType; import javax.xml.bind.annotation.XmlAccessorType; import javax.xml.bind.annotation.XmlElement; import javax.xml.bind.annotation.XmlRootElement; import javax.xml.bind.annotation.XmlSchemaType; import javax.xml.bind.annotation.XmlType; /** * Java class for anonymous complex type. * * The following schema fragment specifies the expected content contained within this class. * * * * * * * * * * * * * * * * */ @XmlAccessorType(XmlAccessType.FIELD) @XmlType(name = "", propOrder = { "vegetable", "fruit", "dessert" }) @XmlRootElement(name = "Food") public class Food { @XmlElement(name = "Vegetable", required = true) @XmlSchemaType(name = "string") protected Vegetable vegetable; @XmlElement(name = "Fruit", required = true) protected String fruit; @XmlElement(name = "Dessert", required = true) @XmlSchemaType(name = "string") protected Dessert dessert; /** * Gets the value of the vegetable property. * * @return * possible object is * {@link Vegetable } * */ public Vegetable getVegetable() { return vegetable; } /** * Sets the value of the vegetable property. * * @param value * allowed object is * {@link Vegetable } * */ public void setVegetable(Vegetable value) { this.vegetable = value; } /** * Gets the value of the fruit property. * * @return * possible object is * {@link String } * */ public String getFruit() { return fruit; } /** * Sets the value of the fruit property. * * @param value * allowed object is * {@link String } * */ public void setFruit(String value) { this.fruit = value; } /** * Gets the value of the dessert property. * * @return * possible object is * {@link Dessert } * */ public Dessert getDessert() { return dessert; } /** * Sets the value of the dessert property. * * @param value * allowed object is * {@link Dessert } * */ public void setDessert(Dessert value) { this.dessert = value; } } The advantage of having specific Vegetable and Dessert classes is the additional type safety they bring as compared to a general Java String. Both Vegetable.java and Dessert.java are actually enums because they come from enumerated values in the XSD. The two generated enums are shown in the next two code listings. Vegetable.java (generated with JAXB xjc compiler) // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.8-b130911.1802 // See http://java.sun.com/xml/jaxb // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2015.02.11 at 10:17:32 PM MST // package com.blogspot.marxsoftware.foodxml; import javax.xml.bind.annotation.XmlEnum; import javax.xml.bind.annotation.XmlEnumValue; import javax.xml.bind.annotation.XmlType; /** * Java class for Vegetable. * * The following schema fragment specifies the expected content contained within this class. * * * * * * * * * * * * */ @XmlType(name = "Vegetable") @XmlEnum public enum Vegetable { @XmlEnumValue("Carrot") CARROT("Carrot"), @XmlEnumValue("Squash") SQUASH("Squash"), @XmlEnumValue("Spinach") SPINACH("Spinach"), @XmlEnumValue("Celery") CELERY("Celery"); private final String value; Vegetable(String v) { value = v; } public String value() { return value; } public static Vegetable fromValue(String v) { for (Vegetable c: Vegetable.values()) { if (c.value.equals(v)) { return c; } } throw new IllegalArgumentException(v); } } Dessert.java (generated with JAXB xjc compiler) // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.8-b130911.1802 // See http://java.sun.com/xml/jaxb // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2015.02.11 at 10:17:32 PM MST // package com.blogspot.marxsoftware.foodxml; import javax.xml.bind.annotation.XmlEnum; import javax.xml.bind.annotation.XmlEnumValue; import javax.xml.bind.annotation.XmlType; /** * Java class for Dessert. * * The following schema fragment specifies the expected content contained within this class. * * * * * * * * * * * */ @XmlType(name = "Dessert") @XmlEnum public enum Dessert { @XmlEnumValue("Pie") PIE("Pie"), @XmlEnumValue("Cake") CAKE("Cake"), @XmlEnumValue("Ice Cream") ICE_CREAM("Ice Cream"); private final String value; Dessert(String v) { value = v; } public String value() { return value; } public static Dessert fromValue(String v) { for (Dessert c: Dessert.values()) { if (c.value.equals(v)) { return c; } } throw new IllegalArgumentException(v); } } Having enums generated for the XML elements ensures that only valid values for those elements can be represented in Java. Conclusion JAXB makes it relatively easy to map Java to XML, but because there is not a one-to-one mapping between Java and XML types, there can be some cases where the generated Java type for a particular XSD prescribed element is not obvious. This post has shown how two different approaches to building an XSD to enforce the same basic XML structure can lead to very different results in the Java classes generated with the JAXB xjccompiler. In the example shown in this post, declaring elements in the XSD directly on simpleTypes restricting XSD's string to a specific set of enumerated values is preferable to declaring elements as references to other elements wrapping a simpleType of restricted string enumerated values because of the type safety that is achieved when enums are generated rather than use of general Java Strings.

February 25, 2015

by Dustin Marx

· 21,012 Views · 1 Like

Redirecting All Kinds of stdout in Python

A common task in Python (especially while testing or debugging) is to redirect sys.stdout to a stream or a file while executing some piece of code. However, simply "redirecting stdout" is sometimes not as easy as one would expect; hence the slightly strange title of this post. In particular, things become interesting when you want C code running within your Python process (including, but not limited to, Python modules implemented as C extensions) to also have its stdout redirected according to your wish. This turns out to be tricky and leads us into the interesting world of file descriptors, buffers and system calls. But let's start with the basics. Pure Python The simplest case arises when the underlying Python code writes to stdout, whether by calling print, sys.stdout.write or some equivalent method. If the code you have does all its printing from Python, redirection is very easy. With Python 3.4 we even have a built-in tool in the standard library for this purpose - contextlib.redirect_stdout. Here's how to use it: from contextlib import redirect_stdout f = io.StringIO() with redirect_stdout(f): print('foobar') print(12) print('Got stdout: "{0}"'.format(f.getvalue())) When this code runs, the actual print calls within the with block don't emit anything to the screen, and you'll see their output captured by in the stream f. Incidentally, note how perfect the with statement is for this goal - everything within the block gets redirected; once the block is done, things are cleaned up for you and redirection stops. If you're stuck on an older and uncool Python, prior to 3.4 [1], what then? Well, redirect_stdout is really easy to implement on your own. I'll change its name slightly to avoid confusion: from contextlib import contextmanager @contextmanager def stdout_redirector(stream): old_stdout = sys.stdout sys.stdout = stream try: yield finally: sys.stdout = old_stdout So we're back in the game: f = io.StringIO() with stdout_redirector(f): print('foobar') print(12) print('Got stdout: "{0}"'.format(f.getvalue())) Redirecting C-level streams Now, let's take our shiny redirector for a more challenging ride: import ctypes libc = ctypes.CDLL(None) f = io.StringIO() with stdout_redirector(f): print('foobar') print(12) libc.puts(b'this comes from C') os.system('echo and this is from echo') print('Got stdout: "{0}"'.format(f.getvalue())) I'm using ctypes to directly invoke the C library's puts function [2]. This simulates what happens when C code called from within our Python code prints to stdout - the same would apply to a Python module using a C extension. Another addition is the os.system call to invoke a subprocess that also prints to stdout. What we get from this is: this comes from C and this is from echo Got stdout: "foobar 12 " Err... no good. The prints got redirected as expected, but the output from puts and echo flew right past our redirector and ended up in the terminal without being caught. What gives? To grasp why this didn't work, we have to first understand what sys.stdout actually is in Python. Detour - on file descriptors and streams This section dives into some internals of the operating system, the C library, and Python [3]. If you just want to know how to properly redirect printouts from C in Python, you can safely skip to the next section (though understanding how the redirection works will be difficult). Files are opened by the OS, which keeps a system-wide table of open files, some of which may point to the same underlying disk data (two processes can have the same file open at the same time, each reading from a different place, etc.) File descriptors are another abstraction, which is managed per-process. Each process has its own table of open file descriptors that point into the system-wide table. Here's a schematic, taken from The Linux Programming Interface: File descriptors allow sharing open files between processes (for example when creating child processes with fork). They're also useful for redirecting from one entry to another, which is relevant to this post. Suppose that we make file descriptor 5 a copy of file descriptor 4. Then all writes to 5 will behave in the same way as writes to 4. Coupled with the fact that the standard output is just another file descriptor on Unix (usually index 1), you can see where this is going. The full code is given in the next section. File descriptors are not the end of the story, however. You can read and write to them with the read and write system calls, but this is not the way things are typically done. The C runtime library provides a convenient abstraction around file descriptors - streams. These are exposed to the programmer as the opaque FILE structure with a set of functions that act on it (for example fprintf and fgets). FILE is a fairly complex structure, but the most important things to know about it is that it holds a file descriptor to which the actual system calls are directed, and it provides buffering, to ensure that the system call (which is expensive) is not called too often. Suppose you emit stuff to a binary file, a byte or two at a time. Unbuffered writes to the file descriptor with write would be quite expensive because each write invokes a system call. On the other hand, using fwrite is much cheaper because the typicall call to this function just copies your data into its internal buffer and advances a pointer. Only occasionally (depending on the buffer size and flags) will an actual write system call be issued. With this information in hand, it should be easy to understand what stdout actually is for a C program. stdout is a global FILE object kept for us by the C library, and it buffers output to file descriptor number 1. Calls to functions like printf and puts add data into this buffer. fflush forces its flushing to the file descriptor, and so on. But we're talking about Python here, not C. So how does Python translate calls to sys.stdout.write to actual output? Python uses its own abstraction over the underlying file descriptor - a file object. Moreover, in Python 3 this file object is further wrapper in an io.TextIOWrapper, because what we pass to print is a Unicode string, but the underlying write system calls accept binary data, so encoding has to happen en route. The important take-away from this is: Python and a C extension loaded by it (this is similarly relevant to C code invoked via ctypes) run in the same process, and share the underlying file descriptor for standard output. However, while Python has its own high-level wrapper around it - sys.stdout, the C code uses its own FILE object. Therefore, simply replacing sys.stdout cannot, in principle, affect output from C code. To make the replacement deeper, we have to touch something shared by the Python and C runtimes - the file descriptor. Redirecting with file descriptor duplication Without further ado, here is an improved stdout_redirector that also redirects output from C code [4]: from contextlib import contextmanager import ctypes import io import os, sys import tempfile libc = ctypes.CDLL(None) c_stdout = ctypes.c_void_p.in_dll(libc, 'stdout') @contextmanager def stdout_redirector(stream): # The original fd stdout points to. Usually 1 on POSIX systems. original_stdout_fd = sys.stdout.fileno() def _redirect_stdout(to_fd): """Redirect stdout to the given file descriptor.""" # Flush the C-level buffer stdout libc.fflush(c_stdout) # Flush and close sys.stdout - also closes the file descriptor (fd) sys.stdout.close() # Make original_stdout_fd point to the same file as to_fd os.dup2(to_fd, original_stdout_fd) # Create a new sys.stdout that points to the redirected fd sys.stdout = io.TextIOWrapper(os.fdopen(original_stdout_fd, 'wb')) # Save a copy of the original stdout fd in saved_stdout_fd saved_stdout_fd = os.dup(original_stdout_fd) try: # Create a temporary file and redirect stdout to it tfile = tempfile.TemporaryFile(mode='w+b') _redirect_stdout(tfile.fileno()) # Yield to caller, then redirect stdout back to the saved fd yield _redirect_stdout(saved_stdout_fd) # Copy contents of temporary file to the given stream tfile.flush() tfile.seek(0, io.SEEK_SET) stream.write(tfile.read()) finally: tfile.close() os.close(saved_stdout_fd) There are a lot of details here (such as managing the temporary file into which output is redirected) that may obscure the key approach: using dup and dup2 to manipulate file descriptors. These functions let us duplicate file descriptors and make any descriptor point at any file. I won't spend more time on them - go ahead and read their documentation, if you're interested. The detour section should provide enough background to understand it. Let's try this: f = io.BytesIO() with stdout_redirector(f): print('foobar') print(12) libc.puts(b'this comes from C') os.system('echo and this is from echo') print('Got stdout: "{0}"'.format(f.getvalue().decode('utf-8'))) Gives us: Got stdout: "and this is from echo this comes from C foobar 12 " Success! A few things to note: The output order may not be what we expected. This is due to buffering. If it's important to preserve order between different kinds of output (i.e. between C and Python), further work is required to disable buffering on all relevant streams. You may wonder why the output of echo was redirected at all? The answer is that file descriptors are inherited by subprocesses. Since we rigged fd 1 to point to our file instead of the standard output prior to forking to echo, this is where its output went. We use a BytesIO here. This is because on the lowest level, the file descriptors are binary. It may be possible to do the decoding when copying from the temporary file into the given stream, but that can hide problems. Python has its in-memory understanding of Unicode, but who knows what is the right encoding for data printed out from underlying C code? This is why this particular redirection approach leaves the decoding to the caller. The above also makes this code specific to Python 3. There's no magic involved, and porting to Python 2 is trivial, but some assumptions made here don't hold (such as sys.stdout being a io.TextIOWrapper). Redirecting the stdout of a child process We've just seen that the file descriptor duplication approach lets us grab the output from child processes as well. But it may not always be the most convenient way to achieve this task. In the general case, you typically use the subprocess module to launch child processes, and you may launch several such processes either in a pipe or separately. Some programs will even juggle multiple subprocesses launched this way in different threads. Moreover, while these subprocesses are running you may want to emit something to stdout and you don't want this output to be captured. So, managing the stdout file descriptor in the general case can be messy; it is also unnecessary, because there's a much simpler way. The subprocess module's swiss knife Popen class (which serve as the basis for much of the rest of the module) accepts a stdout parameter, which we can use to ask it to get access to the child's stdout: import subprocess echo_cmd = ['echo', 'this', 'comes', 'from', 'echo'] proc = subprocess.Popen(echo_cmd, stdout=subprocess.PIPE) output = proc.communicate()[0] print('Got stdout:', output) The subprocess.PIPE argument can be used to set up actual child process pipes (a la the shell), but in its simplest incarnation it captures the process's output. If you only launch a single child process at a time and are interested in its output, there's an even simpler way: output = subprocess.check_output(echo_cmd) print('Got stdout:', output) check_output will capture and return the child's standard output to you; it will also raise an exception if the child exist with a non-zero return code. Conclusion I hope I covered most of the common cases where "stdout redirection" is needed in Python. Naturally, all of the same applies to the other standard output stream - stderr. Also, I hope the background on file descriptors was sufficiently clear to explain the redirection code; squeezing this topic in such a short space is challenging. Let me know if any questions remain or if there's something I could have explained better. Finally, while it is conceptually simple, the code for the redirector is quite long; I'll be happy to hear if you find a shorter way to achieve the same effect. [1] Do not despair. As of February 2015, a sizable chunk of the worldwide Python programmers are in the same boat. [2] Note that bytes passed to puts. This being Python 3, we have to be careful since libc doesn't understand Python's unicode strings. [3] The following description focuses on Unix/POSIX systems; also, it's necessarily partial. Large book chapters have been written on this topic - I'm just trying to present some key concepts relevant to stream redirection. [4] The approach taken here is inspired by this Stack Overflow answer.

February 23, 2015

by Eli Bendersky

· 18,753 Views

Sneak Peek into the JCache API (JSR 107)

This post covers the JCache API at a high level and provides a teaser – just enough for you to (hopefully) start itching about it ;-) In this post …. JCache overview JCache API, implementations Supported (Java) platforms for JCache API Quick look at Oracle Coherence Fun stuff – Project Headlands (RESTified JCache by Adam Bien) , JCache related talks at Java One 2014, links to resources for learning more about JCache What is JCache? JCache (JSR 107) is a standard caching API for Java. It provides an API for applications to be able to create and work with in-memory cache of objects. Benefits are obvious – one does not need to concentrate on the finer details of implementing the Caching and time is better spent on the core business logic of the application. JCache components The specification itself is very compact and surprisingly intuitive. The API defines high level components (interfaces) some of which are listed below Caching Provider – used to control Caching Managers and can deal with several of them, Cache Manager – deals with create, read, destroy operations on a Cache Cache – stores entries (the actual data) and exposes CRUD interfaces to deal with the entries Entry – abstraction on top of a key-value pair akin to a java.util.Map Hierarchy of JCache API components JCache Implementations JCache defines the interfaces which of course are implemented by different vendors a.k.a Providers. Oracle Coherence Hazelcast Infinispan ehcache Reference Implementation – this is more for reference purpose rather than a production quality implementation. It is per the specification though and you can be rest assured of the fact that it does in fact pass the TCK as well From the application point of view, all that’s required is the implementation to be present in the classpath. The API also provides a way to further fine tune the properties specific to your provider via standard mechanisms. You should be able to track the list of JCache reference implementations from the JCP website link public class JCacheUsage{ public static void main(String[] args){ //bootstrap the JCache Provider CachingProvider jcacheProvider = Caching.getCachingProvider(); CacheManager jcacheManager = jcacheProvider.getCacheManager(); //configure cache MutableConfiguration jcacheConfig = new MutableConfiguration<>(); jcacheConfig.setTypes(String.class, MyPreciousObject.class); //create cache Cache cache = jcacheManager.createCache("PreciousObjectCache", jcacheConfig); //play around String key = UUID.randomUUID().toString(); cache.put(key, new MyPreciousObject()); MyPreciousObject inserted = cache.get(key); cache.remove(key); cache.get(key); //will throw javax.cache.CacheException since the key does not exist } } JCache provider detection JCache provider detection happens automatically when you only have a single JCache provider on the class path You can choose from the below options as well //set JMV level system property -Djavax.cache.spi.cachingprovider=org.ehcache.jcache.JCacheCachingProvider //code level config System.setProperty("javax.cache.spi.cachingprovider","org.ehcache.jcache.JCacheCachingProvider //you want to choose from multiple JCache providers at runtime CachingProvider ehcacheJCacheProvider = Caching.getCachingProvider("org.ehcache.jcache.JCacheCachingProvider"); //which JCache providers do I have on the classpath? Iterable jcacheProviders = Caching.getCachingProviders(); Java Platform support Compliant with Java SE 6 and above Does not define any details in terms of Java EE integration. This does not mean that it cannot be used in a Java EE environment – it’s just not standardized yet. Could not be plugged into Java EE 7 as a tried and tested standard Candidate for Java EE 8 Project Headlands: Java EE and JCache in tandem By none other than Adam Bien himself ! Java EE 7, Java SE 8 and JCache in action Exposes the JCache API via JAX-RS (REST) Uses Hazelcast as the JCache provider Highly recommended ! Oracle Coherence This post deals with high level stuff w.r.t JCache in general. However, a few lines about Oracle Coherence in general would help put things in perspective Oracle Coherence is a part of Oracle’s Cloud Application Foundation stack It is primarily an in-memory data grid solution Geared towards making applications more scalable in general What’s important to know is that from version 12.1.3 onwards, Oracle Coherence includes a reference implementation for JCache (more in the next section) JCache support in Oracle Coherence Support for JCache implies that applications can now use a standard API to access the capabilities of Oracle Coherence This is made possible by Coherence by simply providing an abstraction over its existing interfaces (NamedCache etc). Application deals with a standard interface (JCache API) and the calls to the API are delegated to the existing Coherence core library implementation Support for JCache API also means that one does not need to use Coherence specific APIs in the application resulting in vendor neutral code which equals portability How ironic – supporting a standard API and always keeping your competitors in the hunt ;-) But hey! That’s what healthy competition and quality software is all about ! Talking of healthy competition – Oracle Coherence does support a host of other features in addition to the standard JCache related capabilities. The Oracle Coherence distribution contains all the libraries for working with the JCache implementation The service definition file in the coherence-jcache.jar qualifies it as a valid JCache provider implementation Curious about Oracle Coherence ? Quick Starter page Documentation Installation Further reading about Coherence and JCache combo – Oracle Coherence documentation JCache at Java One 2014 Couple of great talks revolving around JCache at Java One 2014 Come, Code, Cache, Compute! by Steve Millidge Using the New JCache by Brian Oliver and Greg Luck Hope this was fun :-) Cheers !

February 23, 2015

by Abhishek Gupta

CORE

· 5,884 Views · 1 Like

Retry-After HTTP Header in Practice

Retry-After is a lesser known HTTP response header.

February 20, 2015

by Tomasz Nurkiewicz

CORE

· 15,860 Views

Getting Started with Dropwizard: Authentication, Configuration and HTTPS

Basic Authentication is the simplest way to secure access to a resource.

February 10, 2015

by Dmitry Noranovich

· 45,919 Views · 1 Like

The API Gateway Pattern: Angular JS and Spring Security Part IV

Written by Dave Syer in the Spring blog In this article we continue our discussion of how to use Spring Security with Angular JS in a “single page application”. Here we show how to build an API Gateway to control the authentication and access to the backend resources using Spring Cloud. This is the fourth in a series of articles, and you can catch up on the basic building blocks of the application or build it from scratch by reading the first article, or you can just go straight to the source code in Github. In the last article we built a simple distributed application that used Spring Session to authenticate the backend resources. In this one we make the UI server into a reverse proxy to the backend resource server, fixing the issues with the last implementation (technical complexity introduced by custom token authentication), and giving us a lot of new options for controlling access from the browser client. Reminder: if you are working through this article with the sample application, be sure to clear your browser cache of cookies and HTTP Basic credentials. In Chrome the best way to do that for a single server is to open a new incognito window. Creating an API Gateway An API Gateway is a single point of entry (and control) for front end clients, which could be browser based (like the examples in this article) or mobile. The client only has to know the URL of one server, and the backend can be refactored at will with no change, which is a significant advantage. There are other advantages in terms of centralization and control: rate limiting, authentication, auditing and logging. And implementing a simple reverse proxy is really simple with Spring Cloud. If you were following along in the code, you will know that the application implementation at the end of the last article was a bit complicated, so it’s not a great place to iterate away from. There was, however, a halfway point which we could start from more easily, where the backend resource wasn’t yet secured with Spring Security. The source code for this is a separate project in Github so we are going to start from there. It has a UI server and a resource server and they are talking to each other. The resource server doesn’t have Spring Security yet so we can get the system working first and then add that layer. Declarative Reverse Proxy in One Line To turn it into an API Gateawy, the UI server needs one small tweak. Somewhere in the Spring configuration we need to add an @EnableZuulProxy annotation, e.g. in the main (only)application class: @SpringBootApplication @RestController @EnableZuulProxy public class UiApplication { ... } and in an external configuration file we need to map a local resource in the UI server to a remote one in the external configuration (“application.yml”): security: ... zuul: routes: resource: path: /resource/** url: http://localhost:9000 This says “map paths with the pattern /resource/** in this server to the same paths in the remote server at localhost:9000”. Simple and yet effective (OK so it’s 6 lines including the YAML, but you don’t always need that)! All we need to make this work is the right stuff on the classpath. For that purpose we have a few new lines in our Maven POM: org.springframework.cloud spring-cloud-starter-parent 1.0.0.BUILD-SNAPSHOT pom import org.springframework.cloud spring-cloud-starter-zuul ... Note the use of the “spring-cloud-starter-zuul” - it’s a starter POM just like the Spring Boot ones, but it governs the dependencies we need for this Zuul proxy. We are also using because we want to be able to depend on all the versions of transitive dependencies being correct. Consuming the Proxy in the Client With those changes in place our application still works, but we haven’t actually used the new proxy yet until we modify the client. Fortunately that’s trivial. We just need to go from this implementation of the “home” controller: angular.module('hello', [ 'ngRoute' ]) ... .controller('home', function($scope, $http) { $http.get('http://localhost:9000/').success(function(data) { $scope.greeting = data; }) }); to a local resource: angular.module('hello', [ 'ngRoute' ]) ... .controller('home', function($scope, $http) { $http.get('resource/').success(function(data) { $scope.greeting = data; }) }); Now when we fire up the servers everything is working and the requests are being proxied through the UI (API Gateway) to the resource server. Further Simplifications Even better: we don’t need the CORS filter any more in the resource server. We threw that one together pretty quickly anyway, and it should have been a red light that we had to do anything as technically focused by hand (especially where it concerns security). Fortunately it is now redundant, so we can just throw it away, and go back to sleeping at night! Securing the Resource Server You might remember in the intermediate state that we started from there is no security in place for the resource server. Aside: Lack of software security might not even be a problem if your network architecture mirrors the application architecture (you can just make the resource server physically inaccessible to anyone but the UI server). As a simple demonstration of that we can make the resource server only accessible on localhost. Just add this to application.properties in the resource server: server.address: 127.0.0.1 Wow, that was easy! Do that with a network address that’s only visible in your data center and you have a security solution that works for all resource servers and all user desktops. Suppose that we decide we do need security at the software level (quite likely for a number of reasons). That’s not going to be a problem, because all we need to do is add Spring Security as a dependency (in the resource server POM): org.springframework.boot spring-boot-starter-security That’s enough to get us a secure resource server, but it won’t get us a working application yet, for the same reason that it didn’t in Part III: there is no shared authentication state between the two servers. Sharing Authentication State We can use the same mechanism to share authentication (and CSRF) state as we did in the last, i.e. Spring Session. We add the dependency to both servers as before: org.springframework.session spring-session 1.0.0.RELEASE org.springframework.boot spring-boot-starter-redis but this time the configuration is much simpler because we can just add the same Filterdeclaration to both. First the UI server (adding @EnableRedisHttpSession): @SpringBootApplication @RestController @EnableZuulProxy @EnableRedisHttpSession public class UiApplication { ... } and then the resource server. There are two changes to make: one is adding@EnableRedisHttpSession and a HeaderHttpSessionStrategy bean to theResourceApplication: @SpringBootApplication @RestController @EnableRedisHttpSession class ResourceApplication { ... @Bean HeaderHttpSessionStrategy sessionStrategy() { new HeaderHttpSessionStrategy(); } } and the other is to explicitly ask for a non-stateless session creation policy inapplication.properties: security.sessions: NEVER As long as redis is still running in the background (use the fig.yml if you like to start it) then the system will work. Load the homepage for the UI at http://localhost:8080 and login and you will see the message from the backend rendered on the homepage. How Does it Work? What is going on behind the scenes now? First we can look at the HTTP requests in the UI server (and API Gateway): VERB PATH STATUS RESPONSE GET / 200 index.html GET /css/angular-bootstrap.css 200 Twitter bootstrap CSS GET /js/angular-bootstrap.js 200 Bootstrap and Angular JS GET /js/hello.js 200 Application logic GET /user 302 Redirect to login page GET /login 200 Whitelabel login page (ignored) GET /resource 302 Redirect to login page GET /login 200 Whitelabel login page (ignored) GET /login.html 200 Angular login form partial POST /login 302 Redirect to home page (ignored) GET /user 200 JSON authenticated user GET /resource 200 (Proxied) JSON greeting That’s identical to the sequence at the end of Part II except for the fact that the cookie names are slightly different (“SESSION” instead of “JSESSIONID”) because we are using Spring Session. But the architecture is different and that last request to “/resource” is special because it was proxied to the resource server. We can see the reverse proxy in action by looking at the “/trace” endpoint in the UI server (from Spring Boot Actuator, which we added with the Spring Cloud dependencies). Go tohttp://localhost:8080/trace in a browser and scroll to the end (if you don’t have one already get a JSON plugin for your browser to make it nice and readable). You will need to authenticate with HTTP Basic (browser popup), but the same credentials are valid as for your login form. At or near the end you should see a pair of requests something like this: { "timestamp": 1420558194546, "info": { "method": "GET", "path": "/", "query": "" "remote": true, "proxy": "resource", "headers": { "request": { "accept": "application/json, text/plain, */*", "x-xsrf-token": "542c7005-309c-4f50-8a1d-d6c74afe8260", "cookie": "SESSION=c18846b5-f805-4679-9820-cd13bd83be67; XSRF-TOKEN=542c7005-309c-4f50-8a1d-d6c74afe8260", "x-forwarded-prefix": "/resource", "x-forwarded-host": "localhost:8080" }, "response": { "Content-Type": "application/json;charset=UTF-8", "status": "200" } }, } }, { "timestamp": 1420558200232, "info": { "method": "GET", "path": "/resource/", "headers": { "request": { "host": "localhost:8080", "accept": "application/json, text/plain, */*", "x-xsrf-token": "542c7005-309c-4f50-8a1d-d6c74afe8260", "cookie": "SESSION=c18846b5-f805-4679-9820-cd13bd83be67; XSRF-TOKEN=542c7005-309c-4f50-8a1d-d6c74afe8260" }, "response": { "Content-Type": "application/json;charset=UTF-8", "status": "200" } } } }, The second entry there is the request from the client to the gateway on “/resource” and you can see the cookies (added by the browser) and the CSRF header (added by Angular as discussed inPart II). The first entry has remote: true and that means it’s tracing the call to the resource server. You can see it went out to a uri path “/” and you can see that (crucially) the cookies and CSRF headers have been sent too. Without Spring Session these headers would be meaningless to the resource server, but the way we have set it up it can now use those headers to re-constitute a session with authentication and CSRF token data. So the request is permitted and we are in business! Conclusion We covered quite a lot in this article but we got to a really nice place where there is a minimal amount of boilerplate code in our two servers, they are both nicely secure and the user experience isn’t compromised. That alone would be a reason to use the API Gateway pattern, but really we have only scratched the surface of what that might be used for (Netflix uses it for a lot of things). Read up on Spring Cloud to find out more on how to make it easy to add more features to the gateway. The next article in this series will extend the application architecture a bit by extracting the authentication responsibilities to a separate server (the Single Sign On pattern).

February 9, 2015

by Pieter Humphrey

CORE

· 16,018 Views

Introducing the Database Selection Matrix

Originally Written by Mat Keep For the better part of a generation, the database landscape had changed very little. No one could say “this is not your father’s database.” They had become, in a word, boring. Then a combination of factors catalyzed an era of innovation in database technologies: cheap storage and compute resources; pervasive connectivity; social networks; smartphones; the proliferation of sensors; open source software. Data volumes grew (and are growing) at exponential rates. Over 80% of today’s data no longer fits neatly into the normalized row and column table formats of the past. And so developers began engineering solutions to a new set of problems with a very different set of resources and assumptions. Today these new options include a variety of database architectures built around diverse data models – from key-value to document to wide-column and graph. And of course you still have the option of the venerable relational database. For the enterprise these new technologies hold great promise. They open the door to new applications that could not be imagined before, or to more efficiently solve existing problems. They attract new technical talent. They facilitate the migration of systems to more cost effective infrastructure based on commodity hardware and cloud platforms. But at the same time, evaluation of these new options requires careful consideration. Selecting the appropriate database for a new project requires evaluation against multiple criteria, including: Development considerations: includes the data model, query functionality, available drivers, data consistency. These factors dictate the functionality of your application, and how quickly you can build it. Operational considerations: performance and scalability, high availability, data center awareness, security, management and backups. Over the application’s lifetime, operational costs will contribute a significant percentage to the project’s Total Cost of Ownership (TCO), and so these factors constitute your ability to meet SLAs while minimizing administrative overhead. Commercial considerations: licensing, pricing and support. You need to know that the database you choose is available in a way that is aligned with how you do business. Each these considerations need to be evaluated in context of specific application requirements as well as internal technology standards, skills availability and integration with your existing enterprise architecture. So, where to start? The Database Selection Matrix is designed to serve as a decision framework by teams responsible for database selection. It has been developed in collaboration with several large enterprises who have the choice of running multiple databases in production, and who wanted to institute a systematic methodology for database evaluation. Responses to questions in the matrix helped them identify key requirements and guide selection. And it can do the same for you. Lets illustrate how the Database Selection Matrix can be used by working through a practical example. The Database Selection Matrix in Action! ACME Retail Corporation runs a large vehicle fleet to distribute produce to its nationwide network of stores. The CEO is intent on improving distribution efficiency and so tasks her enterprise architects to build a new platform that can utilize sensor data generated by the company’s trucks. By capturing and analyzing this data, the organization believes it can optimize route planning, improve delivery times, cut wastage and reduce business interruptions caused by breakdowns. ACME Retail Corp is typical of many enterprises that see the opportunity to unlock new efficiencies by leveraging the “Internet of Things”. As Morgan Stanley stated in the “Internet of Things is Now” research “We do not believe traditional data storage architectures are well- suited to accommodate the volume, velocity, and variety of IoT data”. For this reason, enterprises are looking beyond traditional RDBMS technology to the swathe of new database options available to them. Bosch SI did exactly this when it took the decision to use MongoDB to power the Bosch IoT Suite. Of course, MongoDB may not be a perfect fit for every IoT project. There are many choices available – as there for every new type of project – and the ACME architecture group needs a way to navigate the complex landscape of modern databases. Using the Database Selection Matrix, they have the framework to ask the key questions that will guide their technology decisions. So lets put it into practice. Development Considerations In this opening phase, the architects need to evaluate how their shortlisted database options meet the functional requirements of the app that is being built. This is impacted directly by multiple factors – and these are the questions they will need to ask. The Data Model: Will the application need to handle data of varying structure and types? How large can each data type be – is our data made up of simple integers, strings and timestamps or can it also be large binary files such as images or videos? Can our data just be represented as a set of opaque values, or does it need to be typed so other applications can make sense of it? Do we know the data structure will remain constant, or will it vary as we introduce new sensor data and as the business updates application requirements? Does the application require its data to be strongly consistent (i.e. read our own writes), or can eventually consistent data be tolerated (and do our developers know how to handle the complexity it introduces?). Do we end up trading performance and availability if we configure the database to only return the freshest data? The Query Model: What sort of queries are we going to run against the database? Is it simple key-value lookups that we know in advance or do we need to execute ad-hoc queries and complex aggregations to support real-time analytics that the business wants to see? Can we run analytics directly against the database, or do we need to replicate data to dedicated search or analytics engines? Will the application be handling geospatial queries and text search? Does the data need to be integrated with our BI & analytics tools, and what about our new Hadoop cluster, or the data warehouse? Which languages will our engineers be using to develop the application, and does the database have drivers available for them? Operational Considerations In this second phase, the ACME Retail architects need to evaluate how each database would run in production. No-one wants to hand-feed a custom technology, so they need to understand if the database can meet the availability, scalability and security needs of the business, and interoperate with the existing management frameworks. Service Availability: What is the application’s availability SLA? What are our RTO and RPO objectives? Will our operations teams manage failure recovery, or is this something that should be fully automated by the database? What capabilities does the database offer to maintain availability during routine maintenance? Are there tools available to manage this or do we need to script something ourselves? Are there specific requirements to replicate data between our data centers to support disaster recovery? Scalability: How do we expect this application to grow? Will the database need to scale beyond the limits of just a few servers? If data is to be distributed across multiple nodes, will it be partitioned in such a way that it is still optimized for the application’s query patterns? Do we need to scale this across data centers? Can we write and read data locally to reduce the effects of geographic latency? Can we scale storage capacity and I/O by compressing the data, and are different compression algorithms available to optimize compression ratio to CPU overhead? Security: What types of data access control do we need? Can we just use authentication controls within the database or do we need to integrate with our existing LDAP infrastructure? What type of authorization controls are available, and how granular can we get? Do these controls needs to extend down to the level of individual attributes within a document? Is encryption needed, and will those pesky compliance officers need to audit every action taken against the database? Administration: How are going to run this thing? Does the database provide tools to automate provisioning and upgrades or do we need to create our own scripts? How about backups? Can we get incremental backups. How about point in time backups? And then monitoring. We need to know, for example, if disk utilization is peaking above 60% so we can take action before we hit a problem. Can we add these alerts into our existing operational workflow tools? Can we integrate the database’s management platform into our own operational tooling so we don’t need to leave our single screen? Commercial Considerations Once the ACME architects have profiled their technology requirements, they will need to understand how the database is licensed and priced, before legal and procurement come knocking at the door: Licensing: what license is used, and is this acceptable to our legal team? Are commercial licenses available? Support: What support options are open to me? Can I get support SLAs from my vendor, even if I use a community version of their product? What is the SLA I can expect if I do hit an issue? What sort of training is available? Is my only option to send my engineers to public classes, or can we get trained on-demand, at our own pace? What’s Next? The ACME example is designed to illustrate some of the key questions engineering teams need to ask. It is true that the database landscape is more complex than ever. But it needn’t be bewildering – the Database Selection Matrix is designed to help you identify and compare what is most critical as you build your next app, so go ahead and download it now. Looking for additional information about database selection? Learn why organizations choose MongoDB to deliver applications and outcomes that were never previously possible. Download the white paper below: THE VALUE OF DATABASE SELECTION

February 9, 2015

by Francesca Krihely

· 10,968 Views · 1 Like

Algorithms Explained: Minesweeper

This blog post explains the essential algorithms for the well-known Windows game "Minesweeper." Game Rules The board is a two-dimensional space, which has a predetermined number of mines. Cells have two states, opened and closed. If you left-click on a closed cell: Cell is empty and opened. If neighbor cell(s) have mine(s), this opened cell shows neighbor mine count. If neighbor cells have no mines, all neighbor cells are opened automatically. Cell has a mine, game ends with FAIL. If you right-click on a closed cell, you put a flag which shows that "I know this cell has a mine". If you multi-click (both right and left click) on a cell which is opened and has at least one mine on its neighbors: If neighbor cells' total flag count equals to this multi-clicked cell's count and predicted mine locations are true, all closed and unflagged neighbor cells are opened automatically. If neighbor cells' total flag count equals to this multi-clicked cell's count and at least one predicted mine location is wrong, game ends with FAIL. If all cells (without mines) are opened using left clicks and/or multi-clicks, game ends with SUCCESS. Data Structures We may think of each cell as a UI structure (e.g. button), which has the following attributes: colCoord = 0 to colCount rowCoord = 0 to rowCount isOpened = true / false (default false) hasFlag = true / false (default false) hasMine = true / false (default false) neighborMineCount 0 to 8 (default 0, total count of mines on neighbor cells) So, we have a two-dimensional "Button[][] cells" data structure to handle game actions. Algorithms Before Start: Assign mines to cells randomly (set hasMine=true) . Calculate neighborMineCount values for each cell, which have hasMine=false. (This step may be done for each clicked cell while game continues but it may be inefficient.) Note 1: Neighbor cells should be accessed with the coordinates: {(colCoord-1, rowCoord-1),(colCoord-1, rowCoord),(colCoord-1, rowCoord+1),(colCoord, rowCoord-1),(colCoord, rowCoord+1),(colCoord+1, rowCoord-1),(colCoord+1, rowCoord),(colCoord+1, rowCoord+1)} And don't forget that neighbor cell counts may be 3 (for corner cells), 5 (for edge cells) or 8 (for middle cells). Note 2: It's recommended to handle mouse clicks with "mouse release" actions instead of "mouse pressed/click" actions, otherwise a left or right click may be understood as a multi-click or vice versa. Right Click on a Cell: If cell isOpened=true, do nothing. If cell isOpened=false, set cell hasFlag=true and show a flag on the cell. Left Click on a Cell: If cell isOpened=true, do nothing. If cell isOpened=false: If cell hasMine=true, game over. If cell hasMine=false: If cell has neighborMineCount > 0, set isOpened=true, show neighborMineCount on the cell. If all cells which hasMine=false are opened, end game with SUCCESS. If cell has neighborMineCount == 0, set isOpened=true, call Left Click on a Cell for all neighbor cells, which hasFlag=false and isOpened=false. Note: The last step may be implemented with a recursive call or by using a stacked data structure. Multi Click (both Left and Right Click) on a Cell: If cell isOpened=false, do nothing. If cell isOpened=true: If cell neighborMineCount == 0, no nothing. If cell neighborMineCount > 0: If cell neighborMineCount != "neighbor hasFlag=true cell count", do nothing. If cell neighborMineCount == "neighbor hasFlag=true cell count": If all neighbor hasFlag=true cells are not hasMine=true, game over. If all neighbor hasFlag=true cells are hasMine=true (every flag is put on correct cell), call Left Click on a Cell for all neighbor cells, which hasFlag=false and isOpened=false. Note: The last step may be implemented with a recursive call or by using a stacked data structure.

February 9, 2015

by Cagdas Basaraner

· 32,723 Views

NetBeans in the Classroom: MySQL JDBC Connection Pool & JDBC Resource for GlassFish

This tutorial assumes that you have installed the Java EE version of NetBeans 8.02. It is further assumed that you have replaced the default GlassFish server instance in NetBeans with a new instance with its own folder for the domain. This is required for all platforms. See my previous article “Creating a New Instance of GlassFish in NetBeans IDE” The other day I presented to my students the steps necessary to make GlassFish responsible for JDBC connections. As I went through the steps I realized that I needed to record the steps for my students to reference. Here then are these steps. Steps 1 through 4 are required, Step 5 is optional. Step 1a: Manually Adding the MySQL driver to the GlassFish Domain Before we even start NetBeans we must do a bit of preliminary work. With the exception of Derby, GlassFish does not include the MySQL driver or any other driver in its distribution. Go to the MySql Connector/J download site at http://dev.mysql.com/downloads/connector/j/ and download the latest version. I recommend downloading the Platform Independent version. If your OS is Windows download the ZIP archive otherwise download the TAR archive. You are looking for the driver file named mysql-connector-java-5.1.34-bin.jar in the archive. Copy the driver file to the lib folder in the directory where you placed your domain. On my system the folder is located at C:\Users\Ken\personal_domain\lib. If GlassFish is already running then you will have to restart it so that it picks up the new library. Step 1b: Automatically Adding the MySQL driver to the GlassFish Domain NetBeans has a feature that deploys the database driver to the domain’s lib folder if that driver is in NetBeans’ folder of drivers. On my Windows 8.1 system the MySQL driver can be found in C:\Program Files\NetBeans 8.0.2\ide\modules\ext. Start NetBeans and go to the Services tab, expand Servers and right mouse click on your GlassFish Server. Click on Properties and the Servers dialog will appear. On this dialog you will see a check box labelled Enable JDBC Driver Deployment. By default it is checked. NetBeans determines the driver to copy to GlassFish from the file glassfish-resources.xml that we will create in Step 4 of this tutorial. Without this file and if you have not copied the driver into GlassFish manually then GlassFish will not be able to connect to the database. Any code in your web application will not work and all you will likely see are blank pages. Step 1a or Step 1b? I recommend Step 1a and manually add the driver. The reason I prefer this approach is that I can be certain that the most recent driver is in use. As of this writing NetBeans contains version 5.1.23 of the connector but the current version is 5.1.34. If you copy a driver into the lib folder then NetBeans will not replace it with an older driver even if the check box on the Server dialog is checked. NetBeans does not replace a driver if one is already in place. If you need a driver that NetBeans does have a copy of then Step 1b is your only choice. Step 2: Create a Database Connection in NetBeans One feature I have always liked in NetBeans is that it has an interface for working with databases. All that is required is that you create a connection to the database. It also has additional features for managing a MySQL server but we won’t need those. If you have not already started your MySQL DBMS then do that now. I assume that the database you wish to connect to already exists. Go to the Services tab and right mouse click on New Connection. In the next dialog you must choose the database driver you wish to use. It defaults to Java DB (Embedded). Pull down the combobox labeled Driver: and select MySQL (Connector/J driver). Click on Next and you will now see the Customize Connection dialog. Here you can enter the details of the connection. On my system the server is localhost and the database name is Aquarium. Here is what my dialog looks like. Notice the Test Connection button. I have clicked on mine and so I have the message Connection Succeeded. Click on Next. There is nothing to do on this dialog so click on Next. On this last dialog you have the option of assigning a name to the connection. By default it uses the URL but I prefer a more meaningful name. I have used AquariumMySQL. Click on Finish and the connection will appear under Databases. If the icon next to AquariumMySQL has what looks like a crack in it similar to the jdbc:derby connection then this means that a connection to the database could not be made. Verify that the database is running and is accessible. If it is then delete the connection and start over. Having a connection to the database in NetBeans is invaluable. You can interact with the database directly and issue SQL commands. As a MySQL user this means that I do not need to run the MySQL command line program to interact with the database. Step 3: Create a Web Application Project in NetBeans If you have not already done so create a New Project in NetBeans. I require my students to create a New Project in the Maven category of a Web Application project. Click on Next. In this dialog you can give the project a name and a location in your file system. The Artifact Id, Group Id and Version are used by Maven. The final dialog lets you select the application server that your application will use and the version of Java EE that your code must be compliant with. Here is my project ready for the next step. Step 4: Create the GlassFish JDBC Resource For GlassFish to manage your database connection you need to set up two resources, a JDBC Connection Pool and a JDBC Resource. You can create both in one step by creating a GlassFish JDBC Resource because you can create the Connection Pool as part of the same operation. Right mouse click on the project name and select New and then Other … Scroll down the Categories list and select GlassFish. In the File Types list select JDBC Resource. Click on Next. The next dialog is the General Attributes. Click on the radio button for Create New JDBC Connection Pool. In the text field JNDI Name enter a name that is unique for the project. JNDI names for connection resources always begin with jdbc/ followed by a name that starts with a lower case letter. I have used jdbc/myAquarium. Do not prefix the name with java:app/ as some tutorials suggest. An upcoming article will explain why. Click on Next. There is nothing for us to enter on the Properties dialog. Click on Next. On the Choose Database Connection dialog we will give our connection pool a name and select the database connection we created in Step 2. Notice that in the list of available connections you are shown the connection URL and not the name you assigned to it back in Step 2. Click on Next. On the Add Connection Pool Properties dialog you will see the connection URL and the user name and password. We do need to make one change. The resource type shows javax.sql.DataSource and we must change it to javax.sql.ConnectionPoolDataSource. Click on Next. There is nothing we need to change on Add Connection Pool Optional Properties so click on Finish. A new folder has appeared in the Projects view named Other Sources. It contains a sub folder named setup. In this folder is the file glassfish-resources.xml. The glassfish-resources.xml file will contain the following. I have reformatted the file for easier viewing. OPTIONAL Step 5: Configure GlassFish with glassfish-resources.xml The glassfish-resources.xml file, when included in the application’s WAR file in the WEB-INF folder, can configure the resource and pool for the application when it is deployed in GlassFish. When the application is un-deployed the resource and pool are removed. If you want to set up the resource and pool permanently in GlassFish then follow these steps. Go to the Services tab and select Servers and then right mouse click on GlassFish. If GlassFish is not running then click on Start. With the server started click on View Domain Admin Console. Your web browser will now open and show you the GlassFish console. If you assigned a user name and password to the server you will have to enter this information before you see the console. In the Common Tasks tree select Resources. You should now see in the panel adjacent to the tree the following: Click on Add Resources. You should now see: In the Location click on Choose File and locate your glassfish-resources.xml file. Mine is found at D:\NetBeansProjects\GlassFishTutorial\src\main\setup. You should now see: Click on OK. If everything has gone well you should see: The final task in this step is to test if the connection works. In the Common Tasks tree select Resources, JDBC, JDBC Connection Pools and aquariumPool. Click on Ping. You should see: The most common reason for the Ping to fail is that the database driver is not in the domain’s lib folder. Go to Step 1a and manually add the driver. The resources are now visible in NetBeans. Having the resource and pool add to GlassFish permanently will allow other applications to share this same resource and pool. You are now ready to code!

February 9, 2015

by Ken Fogel

· 51,955 Views · 3 Likes

Microservices: Five Architectural Constraints

Microservices is a new software architecture and delivery paradigm, where applications are composed of several small runtime services. The current mainstream approach for software delivery is to build, integrate, and test entire applications as a monolith. This approach requires any software change, however small, to require a full test cycle of the entire application. With Microservices a software module is delivered as an independent runtime service with a well defined API. The Microservices approach allow faster delivery of smaller incremental changes to an application. There are several tradeoffs to consider with the Microservices architecture. On one hand, the Microservices approach builds on several best practices and patterns for software design, architecture, and DevOps style organization. On the other hand, Microservices requires expertise in distributed programming and can become an operational nightmare without proper tooling in place. There are several good posts that highlight the pros-and-cons of Microservices, and I have added in the references section. In the remainder of this post, I will define five architectural constraints (principles that drive desired properties) for the Microservices architectural style. To be a Microservice, a service must be: Elastic Resilient Composable Minimal, and; Complete Microservice Constraint #1 - Elastic A microservice must be able to scale, up or down, independently of other services in the same application. This constraint implies that based on load, or other factors, you can fine tune your applications performance, availability, and resource usage. This constraint can be realized in different ways, but a popular pattern is to architect the system so that you can run multiple stateless instances of each microservice, and there is a mechanism for Service naming, registration, and discovery along with routing and load-balancing of requests. Microservice Constraint #2 - Resilient A microservice must fail without impacting other services in the same application. A failure of a single service instance should have minimal impact on the application. A failure of all instances of a microservice, should only impact a single application function and users should be able to continue using the rest of the application without impact. Adrian Cockroft describes Microservices as loosely coupled service oriented architecture with bounded contexts [3]. To be resilient a service has to be loosely coupled with other services, and a bounded context limits a service’s failure domain. Microservice Constraint #3 - Composable A microservice must offer an interface that is uniform and is designed to support service composition. Microservice APIs should be designed with a common way of identifying, representing, and manipulating resources, describing the API schema and supported API operations. The ‘Uniform Interfaces constraint of the REST architectural style describes this in detail. Service Composition is a SOA principle that has fairly obvious benefits, but few guidelines on how it can be achieved. A Microservice interface should be designed to support composition patterns like aggregation, linking, and higher-level functions such as caching, proxies and gateways. I previously discussed REST constraints and elements in as two part blog post: REST is not about APIs Microservice Constraint #4 - Minimal A microservice must only contain highly cohesive entities In software, cohesion is a measure of whether things belong together. A module is said to have high cohesion if all objects and functions in it are focused on the same tasks. Higher cohesion leads to more maintainable software. A Microservice should perform a single business function, which implies that all of its components are highly cohesive. This is also an Single Responsibility Principle (SRP) of object-oriented design [5] Microservice Constraint #5 - Complete A microservice must be functionally complete Bjarne Stroustrup, the creator of C++, stated that a good interface must be, “minimal but complete” i.e. as small as possible, and no smaller. Similarly, a Microservice must offer a complete function, with minimal dependencies (loose coupling) to other services in the application. This is important, as otherwise its becomes impossible to version and upgrade individual services. This constraint is designed to oppose the minimal constraint. Put together a microservice must be “minimal but complete.” Conclusions Designing a Microservices application requires application of several principles, patterns, and best practices of modular design and service-oriented architectures. In this post, I've outlined five architectural constraints which can help guide and retain the key benefits of a Microservices-style architecture. For example, Microservices Constraint# 1 - Elastic steers implementations towards separating the data tier from the application tier, and leads to stateless services. At Nirmata we have built our solution, that makes it easy to deploy and operate microservices applications, using these very same principles. We believe that Microservices style applications, running in containers, will power the next generation of software innovation. If you are using, or interested in using microservices, I would love to hear from you. Jim Bugwadia Founder and CEO Nirmata -- For additional content and articles follow us at @NirmataCloud. -- If you are in the San Francisco Bay Area, come join our Microservices meetup group. References [1] Microservices, Martin Fowler and James Lewis, http://martinfowler.com/articles/microservices.html [2] Microservices Are Not a free lunch!, Benjamin Wootton, http://contino.co.uk/microservices-not-a-free-lunch/ [3] State of the Art in Microservices, Adrian Cockroft, http://thenewstack.io/dockercon-europe-adrian-cockcroft-on-the-state-of-microservices/ [4] The Principles of Object-Oriented Design, Robert C. Martin, http://butunclebob.com/ArticleS.UncleBob.PrinciplesOfOod

February 5, 2015

by Jim Bugwadia

CORE

· 12,856 Views · 7 Likes

Dropwizard vs Spring Boot—A Comparison Matrix

Of late, I have been looking into Microservice containers that are available out there to help speed up the development. Although, Microservice is a generic term however there is some consensus with respect to what it means. Hence, we may conveniently refer to the definition Microservice as an "architectural design pattern, in which complex applications are composed of small, independent processes communicating with each other using language-agnostic APIs. These services are small, highly decoupled and focus on doing a small task." There are several Microservice containers out there. However, in my experience I have found Dropwizard and Spring-boot to have had received more attention and they appear to be widely used compared to the rest. In my current role, I was asked create a comparison matrix between the two, so it's here below. Dropwizard Spring-Boot What is it? Dropwizard pulls together stable, mature libraries from the Java ecosystem into a simple, light-weight package that lets you focus on getting things done. [more...] Takes an opinionated view of building production-ready Spring applications. Spring Boot favours convention over configuration and is designed to get you up and running as quickly as possible. [more...] Overview? Dropwizard straddles the line between being a library and a framework. Provide performant, reliable implementations of everything a production-ready web application needs. [more...] Spring-boot takes an opinionated view of the Spring platform and third-party libraries so you can get started with minimum fuss. Most Spring Boot applications need very little Spring configuration. [more...] Out of the box features? Dropwizard has out-of-the-box support for sophisticated configuration, application metrics, logging, operational tools, and much more, allowing you and your team to ship a production-quality web service in the shortest time possible. [more...] Spring-boot provides a range of non-functional features that are common to large classes of projects (e.g. embedded servers, security, metrics, health checks, externalized configuration). [more...] Libraries Core: Jetty, Jersey, Jackson and Matrics Others: Guava, Liquibase and Joda Time. Spring, JUnit, Logback, Guava. There are several starter POM files covering various use cases, which can be included in the POM to get started. Dependency Injection? No built in Dependency Injection. Requires a 3rd party dependency injection framework such as Guice, CDI or Dagger. [Ref...] Built in Dependency Injection provided by Spring Dependency Injection container. [Ref...] Types of Services i.e. REST, SOAP Has some support for other types of services but primarily is designed for performant HTTP/REST LAYER. If ever need to integrate SOAP, there is a dropwizard bundle for building SOAP web services using JAX-WS API is provided here but it’s not official drop-wizard sub project. [more...] As well as supporting REST Spring-boot has support for other types of services such as JMS, Advanced Message Queuing Protocol, SOAP based Web Services to name a few. [more...] Deployment? How it creates the Executable Jar? Uses Shading to build executable fat jars, where a shaded jar spackages all classes, from all jars, into a single 'uber jar'. [Ref...] Spring-boot adopts a different approach and avoids shaded jars, as it becomes hard to see which libraries you are actually using in your application. It can also be problematic if the same filename is used in Shaded jars. Instead it uses “Nested Jar” approach where all classes from all jars do not need to be included into a single “uber jar” instead all dependent jars should be in the “lib” folder, spring loader loads them appropriately. [Ref...] Contract First Web Services? No built in support. Would have to refer to 3rd party library (CXF or any other JAX-WS implementation) if needed a solution for the Contract First SOAP based services. Contract First services support is available with the help of spring-boot-starter-ws starter application. [Ref...] Externalised Configuration for properties and YAML Supports both Properties and YAML Supports both Properties and YAML Concluding Remarks If dealing with only REST micro services, drop wizard is an excellent choice. Where Spring-boot shines is the types of services supported i.e. REST, JMS, Messaging, and Contract First Services. Not least a fully built in Dependency Injection container. Disclaimer: The matrix is purely based on my personal views and experiences, having tried both frameworks and is by no means an exhaustive guide. Readers are requested to do their own research before making a strategic decision between the two very formidable frameworks.

February 2, 2015

by Rizwan Ullah

· 73,430 Views · 9 Likes

How-To: Setup Development Environment for Hadoop MapReduce

This post is intended for folks who are looking out for a quick start on developing a basic Hadoop MapReduce application. We will see how to set up a basic MR application for WordCount using Java, Maven and Eclipse and run a basic MR program in local mode , which is easy for debugging at an early stage. Assuming JDK 1.6+ is already installed and Eclipse has a setup for Maven plugin and download from default maven repository is not restriced. Problem Statement : To count the occurrence of each word appearing in an input file using MapReduce. Step 1 : Adding Dependency Create a maven project in eclipse and use following code in your pom.xml. 4.0.0 com.saurzcode.hadoop MapReduce 0.0.1-SNAPSHOT jar org.apache.hadoop hadoop-client 2.2.0 Upon saving it should download all required dependencies for running a basic Hadoop MapReduce program. Step 2 : Mapper Program Map step involves tokenizing the file, traversing the words, and emitting a count of one for each word that is found. Our mapper class should extend Mapper class and override it’s map method. When this method is called the value parameter of the method will contain a chunk of the lines of file to be processed and the output parameter is used to emit word instances. In real world clustered setup, this code will run on multiple nodes which will be consumed by set of reducers to process further. public class WordCountMapper extends Mapper { private final IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } Step 3 : Reducer Program Our reducer extends the Reducer class and implement logic to sum up each occurrence of word token received from mappers.Output from Reducers will go to the output folder as a text file ( default or as configured in Driver program for Output format) named as part-r-00000 along with a _SUCCESS file. public class WordCountReducer extends Reducer { public void reduce(Text text, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(text, new IntWritable(sum)); } } Step 4 : Driver Program Our driver program will configure the job by supplying the map and reduce program we just wrote along with various input , output parameters. public class WordCount { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Path inputPath = new Path(args[0]); Path outputDir = new Path(args[1]); // Create configuration Configuration conf = new Configuration(true); // Create job Job job = new Job(conf, "WordCount"); job.setJarByClass(WordCountMapper.class); // Setup MapReduce job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setNumReduceTasks(1); // Specify key / value job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Input FileInputFormat.addInputPath(job, inputPath); job.setInputFormatClass(TextInputFormat.class); // Output FileOutputFormat.setOutputPath(job, outputDir); job.setOutputFormatClass(TextOutputFormat.class); // Delete output if exists FileSystem hdfs = FileSystem.get(conf); if (hdfs.exists(outputDir)) hdfs.delete(outputDir, true); // Execute job int code = job.waitForCompletion(true) ? 0 : 1; System.exit(code); } } That’s It !! We are all set to execute our first MapReduce Program in eclipse in local mode. Let’s assume there is an input text file called input.txt in folder input which contains following text : foo bar is foo count count foo for saurzcode Expected output : foo 3 bar 1 is 1 count 2 for 1 saurzcode 1 Let’s run this program in eclipse as Java Application :- We need to give path to input and output folder/file to the program as argument.Also, note output folder shouldn’t exist before running this program else program will fail. java com.saurzcode.mapreduce.WordCount input/inputfile.txt output If this program runs successfully emitting set of lines while it is executing mappers and reducers, we should see a output folder and with following files : output/ _SUCCESS part-r-00000

January 30, 2015

by Saurabh Chhajed

· 12,762 Views · 1 Like

Why Customers Choose Datical DB to Automate Database Deployments

Over the past year, Datical has had amazing success with our flagship product, Datical DB. We’ve seen multiple visionary, sector-leading companies select Datical DB to drive their Application Schema changes. Now that the number has grown rapidly over the past year, we can begin to see patterns in why customers choose Datical DB. One of them turns out to be pretty emblematic of our other customers. So, let’s examine the reasons why they chose to adopt Datical DB. Customer Facing Applications are the Front Door When your competitor is only a mobile screen swipe away, this Datical customer focuses on brand reputation and customer satisfaction. They know that application uptime and fast delivery are key to customer retention and account expansion. All applications have a database backend. Though, when something goes wrong with the database, it is not apparent to the consumer. The consumer just knows that the app isn’t working. An unresponsive mobile app or website is often enough to make the customer take their business elsewhere. Customer stickiness due to the hassle of changing providers continues to lessen as the cost to change continues to drop. Therefore, immediate and fast access is a must have for today’s companies. Remember: Consumer facing applications don’t have business hours. They must ALWAYS be open. Cross Team Collaboration This Datical customer had difficulty in determining who made what change to the database. Moreover, answering which database and why was near impossible. All of the changes were stored in a single document. That solution is not multi-tenant, meaning that the team had to distribute the document to share information. Moreover, there was a single point of failure in the tracking process. Too often, people were required to visit the datacenter for days at a time during application changes. There was simply no method to notify team members of changes, when they occurred, and the change impact. This lack of communication led to almost 80% of all database change requests being rejected. There was clearly a need to increase communication of changes and increase the requested changes’ quality. Increase Staff Productivity With over 70% of the Database team’s time spent on managing change due to application requirements, the Database team was taxed in meeting other demands. Needs to improve scalability and reliability of the database servers became a lower priority as the DBAs struggled to keep up with change requests. Furthermore, development teams spent almost 10% of their time managing these change requests, including reworks of failed requests. By eliminating a manual, time-consuming process, this customer can now focus resources on addressing other needs such as managing continuity during server failure, better allocation of server resources, and certainly scalability concerns. Go Faster With the adoption of IBM UrbanCode Deploy, this customer quickly streamlined their deployments with all but the database automated. With all other components automated, the database changes required by application changes was clearly the weak link in the deployment chain. Truly, database automation is the last mile necessary to realize the promise of Agile, Continuous Delivery, and DevOps. Until the customer was able to apply Agile and Continuous Delivery to database changes, the entire application stack was, in effect, still using out-of-date development and deployment methods. To see the complete benefits of Agile and Continuous Delivery, automation throughout the entire stack, including the database, was absolutely necessary. Leverage Existing Infrastructure and Processes With out of the box integration with UrbanCode Deploy, Datical DB did not require an independent separate server. Furthermore, with Datical DB’s ability to utilize the customer’s existing source code control repository, the implementation cycle was shortened significantly. Too often, customers are asked by vendors to spend money on more server resources or make strange, unnatural changes to their network security to support potential solutions. By utilizing existing infrastructure and processes, Datical DB was able to deliver to this customer lightening quick ROI. We measure ROI in days and weeks, not months and years. Please join us for a webcast next Wednesday, February 4th, from 12:00 – 1:00 pm EST, as we discuss these customer benefits in detail and show how Datical DB integrates seamlessly with IBM UrbanCode Deploy.

January 30, 2015

by Robert Reeves

· 8,162 Views