Data Engineering Resources

The Latest Data Engineering Topics

In the traditional 3-tier architecture, data processing is performed by the application server where the data itself is stored in the database server. Application server and database server are typically two different machine. Therefore, the processing cycle proceeds as follows Application server send a query to the database server to retrieve the necessary data Application server perform processing on the received data Application server will save the changed data to the database server In the traditional data processing paradigm, we move data to the code. It can be depicted as follows ... Then big data phenomenon arrives. Because the data volume is huge, it cannot be hold by a single database server. Big data is typically partitioned and stored across many physical DB server machines. On the other hand, application servers need to be added to increase the processing power of big data. However, as we increase the number of App servers and DB servers for storing and processing the big data, more data need to be transfer back and forth across the network during the processing cycle, up to a point where the network becomes a major bottleneck. Moving Code to Data To overcome the network bottleneck, we need a new computing paradigm. Instead of moving data to the code, we move the code to the data and perform the processing at where the data is stored. Notice the change of the program structure The program execution starts at a driver, which orchestrate the execution happening remotely across many worker servers within a cluster. Data is no longer transferred to the driver program, the driver program holds a data reference in its variable rather than the data itself. The data reference is basically an id to locate the corresponding data residing in the database server Code is shipped from the program to the database server, where the execution is happening, and data is modified at the database server without leaving the server machine. Finally the program request a save of the modified data. Since the modified data resides in the database server, no data transfer happens over the network. By moving the code to the data, the volume of data transfer over network is significantly reduced. This is an important paradigm shift for big data processing. In the following session, I will use Apache Spark to illustrate how this big data processing paradigm is implemented. RDD Resilient Distributed Dataset (RDD) is how Spark implements the data reference concept. RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster. To make a clear distinction between data reference and data itself, a Spark program is organized as a sequence of execution steps, which can either be a "transformation" or an "action". Programming Model A typical program is organized as follows From an environment variable "context", create some initial data reference RDD objects Transform initial RDD objects to create more RDD objects. Transformation is expressed in terms of functional programming where a code block is shipped from the driver program to multiple remote worker server, which hold a partition of the RDD. Variable appears inside the code block can either be an item of the RDD or a local variable inside the driver program which get serialized over to the worker machine. After the code (and the copy of the serialized variables) is received by the remote worker server, it will be executed there by feeding the items of RDD residing in that partition. Notice that the result of a transformation is a brand new RDD (the original RDD is not mutated) Finally, the RDD object (the data reference) will need to be materialized. This is achieved through an "action", which will dump the RDD into a storage, or return its value data to the driver program. Here is a word count example # Get initial RDD from the context file = spark.textFile("hdfs://...") # Three consecutive transformation of the RDD counts = file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) # Materialize the RDD using an action counts.saveAsTextFile("hdfs://...") When the driver program starts its execution, it builds up a graph where nodes are RDD and edges are transformation steps. However, no execution is happening at the cluster until an action is encountered. At that point, the driver program will ship the execution graph as well as the code block to the cluster, where every worker server will get a copy. The execution graph is a DAG. Each DAG is a atomic unit of execution. Each source node (no incoming edge) is an external data source or driver memory Each intermediate node is a RDD Each sink node (no outgoing edge) is an external data source or driver memory Green edge connecting to RDD represents a transformation. Red edge connecting to a sink node represents an action Data Shuffling Although we ship the code to worker server where the data processing happens, data movement cannot be completely eliminated. For example, if the processing requires data residing in different partitions to be grouped first, then we need to shuffle data among worker server. Spark carefully distinguish "transformation" operation in two types. "Narrow transformation" refers to the processing where the processing logic depends only on data that is already residing in the partition and data shuffling is unnecessary. Examples of narrow transformation includes filter(), sample(), map(), flatMap() .... etc. "Wide transformation" refers to the processing where the processing logic depends on data residing in multiple partitions and therefore data shuffling is needed to bring them together in one place. Example of wide transformation includes groupByKey(), reduceByKey() ... etc. Joining two RDD can also affect the amount of data being shuffled. Spark provides two ways to join data. In a shuffle join implementation, data of two RDD with the same key will be redistributed to the same partition. In other words, each of the items in each RDD will be shuffled across worker servers. Beside shuffle join, Spark provides another alternative call broadcast join. In this case, one of the RDD will be broadcasted and copied over to every partition. Imagine the situation when one of the RDD is significantly smaller relative to the other, then broadcast join will reduce the network traffic because only the small RDD need to be copied to all worker servers while the large RDD doesn't need to be shuffled at all. In some cases, transformation can be re-ordered to reduce the amount of data shuffling. Below is an example of a JOIN between two huge RDDs followed by a filtering. Plan1 is a naive implementation which follows the given order. It first join the two huge RDD and then apply the filter on the join result. This ends up causing a big data shuffling because the two RDD is huge, even though the result after filtering is small. Plan2 offers a smarter way by using the "push-down-predicate" technique where we first apply the filtering in both RDDs before joining them. Since the filtering will reduce the number of items of each RDD significantly, the join processing will be much cheaper. Execution Planning As explain above, data shuffling incur the most significant cost in the overall data processing flow. Spark provides a mechanism that generate an execute plan from the DAG that minimize the amount of data shuffling. Analyze the DAG to determine the order of transformation. Notice that we starts from the action (terminal node) and trace back to all dependent RDDs. To minimize data shuffling, we group the narrow transformation together in a "stage" where all transformation tasks can be performed within the partition and no data shuffling is needed. The transformations becomes tasks that are chained together within a stage Wide transformation sits at the boundary of two stages, which requires data to be shuffled to a different worker server. When a stage finishes its execution, it persist the data into different files (one per partition) of the local disks. Worker nodes of the subsequent stage will come to pickup these files and this is where data shuffling happens Below is an example how the execution planning turns the DAG into an execution plan involving stages and tasks. Reliability and Fault Resiliency Since the DAG defines a deterministic transformation steps between different partitions of data within each RDD RDD, fault recovery is very straightforward. Whenever a worker server crashes during the execution of a stage, another worker server can simply re-execute the stage from the beginning by pulling the input data from its parent stage that has the output data stored in local files. In case the result of the parent stage is not accessible (e.g. the worker server lost the file), the parent stage need to be re-executed as well. Imagine this is a lineage of transformation steps, and any failure of a step will trigger a restart of execution from its last step. Since the DAG itself is an atomic unit of execution, all the RDD values will be forgotten after the DAG finishes its execution. Therefore, after the driver program finishes an action (which execute a DAG to its completion), all the RDD value will be forgotten and if the program access the RDD again in subsequent statement, the RDD needs to be recomputed again from its dependents. To reduce this repetitive processing, Spark provide a caching mechanism to remember RDDs in worker server memory (or local disk). Once the execution planner finds the RDD is already cache in memory, it will use the RDD right away without tracing back to its parent RDDs. This way, we prune the DAG once we reach an RDD that is in the cache. Overall speaking, Apache Spark provides a powerful framework for big data processing. By the caching mechanism that holds previous computation result in memory, Spark out-performs Hadoop significantly because it doesn't need to persist all the data into disk for each round of parallel processing. Although it is still very new, I think Spark will take off as the main stream approach to process big data.

March 13, 2015

by Ricky Ho

· 23,498 Views · 2 Likes

How to Write a "Hello, World!" Microservice

What does implementing microservices mean for a software developer? Especially, for the rookies, greenhorns, and newbs out there? I’m not talking about microservice software architecture here; this is about microservices software development. And not just that, the ultimate implementation goal should be “microservices done right”. For this post, I’ll go with Java. Yes, it’s wordy. Yes, it’s resource intensive (especially when used for the sole purpose of returning a single string). However the concept of classes and objects goes well with my intention of explaining how to do microservices correctly. Plus, it makes sense to use microservices in environments that are heavily biased towards Java. Anyway, please feel free to add your own “Hello, World!” microservice in your favorite language in the comments section below. Hello, monolith! As a prerequisite, you should be familiar with the following piece of code, what it does, and why it has to look the way it does (read this tutorial if you don’t): class Starter { public static void main(String[] args) { System.out.println(“Hello, World!”); } } This is a simple console application that yields the string “Hello, World!” This is not written in the microservice way. This is an example of when not to use the microservices approach: if all you need on your console is a single string, this is all you need. Hello, code duplication! In addition to this console application, I want this string to be available on the web by calling http://localhost:80/helloWorld.servlet from a browser. Here is the required code, implemented as plain HTTP servlet (yes, it’s wordy. Get over it.) class HelloWorldServlet extends HttpServlet { public void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.getWriter().println(“Hello, World!”); } } The string “Hello, World!” has to be “implemented” again. Sure, this is no big deal. But this simple string could be so much more. It could be the result of a complex calculation or it could be the result of a time consuming search query. So, just imagine that the string “Hello, World!” is the result of a week’s worth of hard work (If you’re new to programming, it may very well be...). How should you go about making it available to apps and services that you create? Step 1: HelloWorldService.java To save yourself from duplicating a week’s worth of coding, allow me to introduce the HelloWorldService class: class HelloWorldService { public String greet() { return “Hello, World!”; } } You can re-use this fine piece of software craftmanship in all your apps and classes without re-implementing or duplicating code. Here’s our console application again: class Starter { HelloWorldService helloWorldService = new HelloWorldService(); public static void main(String[] args) { String message = helloWorldService.greet(); System.out.println(message); } } The same goes for servlets: class HelloWorldServlet extends HttpServlet { HelloWorldService helloWorldService = new HelloWorldService(); public void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { String message = helloWorldService.greet(); response.getWriter().println(message); } } It also works great for Spring MVC controllers: @Controller class HelloWorldController { HelloWorldService helloWorldService = new HelloWorldService(); @RequestMapping("/helloWorld") public String greet() { String message = helloWorldService.greet(); return message; } } I could go on and show more examples, but I think you get the point (spoiler: it’s the bold lines that matter). Those of you who are familiar with microservices could point out that this may be fine for getting rid of code duplication, but this is no microservice. You’re right, but to get to “microservices done right,” you have to be able to separate you app’s concerns, which is what I did here in the most possible basic way: I separated the app’s frontend concerns from its backend concerns. The frontend is either a console app or a servlet, the backend is HelloWorldService. Serviceward, ho! To go down microservice lane from here, all we have to do is wrap HelloWorldService into some kind of web component that makes it accessible via HTTP, right? Let’s see… First, we could just use our servlet code from above, as it conveniently returns the string as a response to any HTTP request. But we won’t. Why? Because there’s something missing: fault tolerance. What could possibly fail when returning a simple string? That’s not the point. What matters is that the client side (the code that calls HelloWorldService) should be given enough information to effectively react to failures. We face two possible problems: The service as a whole may be unavailable The service may be unable to return a proper response The service is unavailable If a service is unavailable, it’s the client that is responsible for dealing with the situation. Frameworks like unirest.io save you the effort of writing many lines of code when dealing with HTTP requests. Future> future = Unirest.post("HTTP://helloworld.myservices.local/greet") .header("accept", "application/json") .asJsonAsync(new Callback() { public void failed(UnirestException e) { //tell them UI folks that the request went south } public void completed(HttpResponse response) { //extract data from response and fulfill it’s destiny } public void cancelled() { //shot a note to UI dept that the request got cancelled } } ); With this code, the client now knows when the service is not available or has timed out following no response. Wee can easily have an error message displayed in place of the string we expected to receive. Try/catch is probably the right solution here. Invalid responses however pose more of a challenge. The service fails If the service fails, we can just return a string with an appropriate error message. But how can you know if a message is an error message or a correct response? Yes, you can start every error message with [ERROR] or invent another “smart” (read: not-so-smart) workaround, but this won’t be a solution you’ll be proud of. And, there’s always the possibility that even valid responses may begin with ERROR because it’s simply part of the message. I’d go with JSON or XML for wrapping the answer. I prefer JSON because it’s a little less wordy than XML. And I really like using the JSON-HTML tool over at json.bloople.net for visualizing results during development. Of course, you might go for any of the numerous alternatives, like protobuf or a proprietary solution of your own. The main point is that you need to be able to apply structure to responses: { “status”:”ok”, ”message”:”Hello, World!” } By checking the status attribute, you can easily decide whether to handle an error or to display an appropriate message. { “status”:”error”, ”message”:”Invalid input parameter” } The possibilities are endless here. You can add an error code or additional properties. This all boils down to a single important point: apply structure to your responses. Structure, why? Because structure not only helps you keep your code maintainable, it also serves as the foundation of the API of your service. An API definition consists of more than a URL like this: GET HTTP://helloworld.myservices.local/greet API definitions also consist of the response structures that can be expected as a response (you know this already from a few lines back): { “status”:”ok”, ”message”:”Hello, World!” } Most important takeaway Keeping the API specifications of a service’s request and response stable is a key requirement for succeeding with microservices. Conclusion Are you (and your project) ready for microservices? If you read this and kept asking yourself, what good is all the overhead of microservices, then either your project won’t benefit from microservices or you’re just not there yet (for mindset perspective see my previous post about the value of microservices ). If you can’t stop thinking about microservices, then you probably are ready.

March 13, 2015

by Martin Goodwell

· 31,265 Views · 7 Likes

How to Test a REST API With JUnit

RESTEasy (and Jersey as well) contain a minimal web server within their libraries which enables their users to start up a tiny web server.

March 13, 2015

by Mark Paluch

· 311,577 Views · 6 Likes

The 100% Utilization Myth

Many organizations in which I've coached are concerned when the people on their teams aren't being "utilized" at 100% of their capacity. If someone is at work for 8 hours per day, minus an hour for lunch and breaks, that's 7 hours of potential capacity. Some organizations are progressive enough to see that the organization's overhead of administrative activities lowers that value to 6 hours per day. By extension, a team's capacity is simply a multiple of the number of team members and the number of hours available to be utilized, i.e. a team of 5 has 30 person-hours per day. So, anything less than 30 person-hours per day spent on tasks means that a team isn't being utilized 100%. Which is bad, apparently. A Thought Exercise If you're reading this, you have a computer and it's connected to a network. Your computer might be a small one that makes phone calls or a large one that can also handle the complex scenery of games at 60 frames per second without breaking a sweat. Regardless, it's a computer. Thanks to John von Neumann, almost all computers today have the same basic architecture, the heart of which is the Central Processing Unit or CPU. When the CPU on your computer, regardless of its size, reaches 100% utilization, what happens? Your computer slows down. Significantly. The CPU is saturated with instructions and does its best to process them, but it has a finite limit on how quickly it can do that. There's a finite limit to the speed at which the instructions and the data they use can be passed around among the internal components of the CPU. That processor just can't go any faster. To a person using the computer, it appears to be locked up. To see this in action, try running a modern game like Faster Than Light on a 1st generation iPad. The older processor can't keep up with the computing demand from a game targeted towards more a powerful CPU. The communications network to which your computer is connected might be wireless or have a "hard" connection, but behind the scenes the public internet is dominated by TCP/IP and Ethernet. What happens when the network to which your computer is connected is running at 100% capacity? The network slows down. Significantly. The equipment moving the packets of data about will experience numerous collisions and will have to send requests back to your computer to resend data. The equipment will also begin to simply "drop" packets because it can't process them quickly enough. As far as you're concerned the network has become either very slow or has simply stopped. To see this in real life, look at what happens to a web site when it's the subject of a Denial of Service (DoS) attack, which effectively saturates that site's network with data requests. The site either crashes completely or appears to be unresponsive because it can't handle the traffic. Knowledge work such as software development requires a huge amount of thinking (processing) and a huge amount of collaboration (communication). The human brain is a processor, and like the von Neumann architecture has subcomponents (lobes), working memory, etc. What happens when our processor hits 100% utilization? We simply can't process all the inputs fast enough and our ability to process information slows down dramatically. We even reach a state where the decision-making centre in our brain shuts itself down. Coupled with a decreasing ability to process information and make decisions, we experience increasing levels of anxiety. Now, let's look at teams. They represent a communications network, with the people being the nodes in the network. When the team reaches 100% utilization, the individual processors (the brains of the people!) begin to slow as the ability to process information diminishes and anxiety increases. This has the effect of slowing the team's ability to communicate and operate effectively. When one or multiple people reach the point of shutting down, the network collapses. Why does anyone think that 100% utilization is a good thing? In manufacturing, 100% utilization of the capacity of your workers is actually desirable. The thinking aspects have already been done in the design and engineering of the item being created, and the assembly process is infinitely repeatable until a new item comes along. This approach to the process was taken from the manufacturing world and applied to the development of software, with the thinking that assembly was done by the programmers after all the design and engineering had taken place. Well, that simply isn't true. The "assembly" of software isn't programming, it's the compilation and packing into something deployable. Writing software is a confluence of design and engineering that has creative and technical aspects. It's not an assembly line, and probably won't ever be within my lifetime. As long ago as the late 1950's, management guru Peter Drucker realized that there were people who "think for a living". He coined the term knowledge worker to describe that type of work. Developing software is a classic knowledge worker role and therefore has different rules for productivity. As Drucker said in his 1999 book Management Challenges of the 21st Century, Productivity of the knowledge worker is not — at least not primarily — a matter of the quantity of output. Quality is at least as important. If we're pushing individuals and teams to 100% capacity, the quality of work and therefore the productivity of the team will be reduced. In 2001, Tom deMarco released Slack: Getting Past Burnout, Busywork and the Myth of Total Efficiency, an entire book describing the problems created by trying to achieve 100% utilization. I highly recommend the book, and also recommend you give your manager a copy. :) Here are some selected quotes that are pertinent to reducing the load on our processor and communication network: Very successful companies have never struck me as particularly busy; in fact, they are, as a group, rather laid back. Energy is evident in the workplace, but it's not the energy tinged with fear that comes from being slightly behind on everything. The principal resource needed for invention is slack. When companies can't invent, it's usually because their people are too damn busy. People under time pressure don't think faster. - Tim Lister How Can We Deal With This? First and foremost, stop trying to do so damned much! This is truly a slow down to go faster type of solution. For a Team If you're an agile team using iterations or sprints, pull less work into a sprint even if your velocity says you should be able to pull more. The extra time can be used to increase the quality of the work that you do complete, which is in itself a motivating factor for knowledge workers. Increased quality means decreased rework, which means the team as a whole delivers faster over the long term. Similarly, the reduced anxiety and stress on the team members increases their ability to think about what they're doing, meaning better and more innovative solutions to problems. Also, ensure that teams are focusing on activities that contribute directly to their work. Meetings tend to be the worst offenders of this, including the meetings defined within agile processes. Some suggestions: No meeting can be scheduled unless there is a specific question to be answered or specific information to be passed along. Reduce the default meeting length from 1 hour to 30 minutes in whatever calendar tool you use. Even better, reduce it to 20 minutes. Choose blocks of time that are deemed meeting free zones. No one, regardless of their position in the org chart can schedule a meeting with the team or the team members during that time. Minimize meetings where people are on a phone. My observation is that people who call into meetings tend to speak longer than when they are in the room with everyone else. I know that I do this, and I see it often in others. For an Individual Learn to say, "No." At the very least, learn to say, "Not yet." It's really difficult to do this, and I speak from personal experience. We want to help others, we want to be team players. However, if we don't say no, we take on too much, our processing and communication ability suffers and we end up disappointing those we were trying to help! Also, learn to take breaks. Yes, take breaks. The same concepts from Slack apply here, with brief diversions clearing our mind and allowing us to work better afterwards. At the extreme, if you feel like you can't keep your eyes open, take a nap! I highly recommend the Pomodoro Technique as at least a starting point for giving yourself some slack. Step away from your work for just a few minutes and return recharged. Conclusion We don't want the CPU in our computer to be working at 100% and we don't want the communications network to which it's connected to be at 100% capacity either. Our brains are processors and teams are a communications network, so why would we want those to be 100% busy all the time? In fact, ensuring that teams and the people on them are always busy is a provably wrong approach for software development. It has it's roots in a manufacturing mindset and developing software is knowledge work. Software development requires substantially different ways to make teams and the individual people on them as productive as possible. Finally, there is no magic number for what percentage of utilization is best. People are variable and 90% utilization for a person might be fine on a given day while %30 is all that the same person can handle on the next. Don't aim for a number, aim for an environment where people know that they don't have to appear busy. That will leave plenty of capacity for each person's processor and their team's communications network to run smoothly and deliver real value more effectively.

March 12, 2015

by Dave Rooney

· 22,620 Views · 2 Likes

R/dplyr: Extracting Data Frame Column Value for Filtering With %in%

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step. I had a data frame: library(dplyr) df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5)) And wanted to extract the userIds of those people who have a score greater than 3. I started with: highScoringPeople = df %>% filter(score > 3) %>% select(userId) > highScoringPeople userId 1 3 2 4 3 5 And then filtered the data frame expecting to get back those 3 people: > df %>% filter(userId %in% highScoringPeople) [1] userId score <0 rows> (or 0-length row.names) No rows! I created vector with the numbers 3-5 to make sure that worked: > df %>% filter(userId %in% c(3,4,5)) userId score 1 3 4 2 4 5 3 5 5 That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore: > str(c(3,4,5)) num [1:3] 3 4 5 > str(highScoringPeople) 'data.frame': 3 obs. of 1 variable: $ userId: num 3 4 5 Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list. One way to fix this is to extract the userIds using the $ syntax instead of the select function: highScoringPeople = (df %>% filter(score > 3))$userId > str(highScoringPeople) num [1:3] 3 4 5 > df %>% filter(userId %in% highScoringPeople) userId score 1 3 4 2 4 5 3 5 5 Or if we want to do the column selection using dplyr we can extract the values for the column like this: highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]] > str(highScoringPeople) num [1:3] 3 4 5 Not so difficult after all.

March 12, 2015

by Mark Needham

· 15,189 Views

Why I Use OrientDB on Production Applications

Like many other Java developers, when i start a new Java development project that requires a database, i have hopes and dreams of what my database looks like: Java API (of course) Embeddable Pure Java Simple jar file for inclusion in my project Database stored in a directory on disk Faster than a rocket First I’m going to review these points, and then i’m going to talk about the database i chose for my latest project, which is in production now with hundreds of users accessing the web application each month. What I Want from My Database Here’s what i’m looking for in my database. These are the things that literally make me happy and joyous when writing code. Java API I code in Java. It’s natural for me to want to use a modern Java API for my database work. Embeddable My development productivity and programming enjoyment skyrocket when my database is embedded. The database starts and stops with my application. It’s easy to destroy my database and restart from scratch. I can upgrade my database by updating my database jar file. It’s easy to deploy my application into testing and production, because there’s no separate database server to startup and manage. (I know about the issue with clustering and an embedded database, but i’ll get to that.) Pure Java Back when i developed software that would be deployed on all manner of hardware, i was a stickler that all my code be pure Java, so that i could be confident that my code would run wherever customers and users deployed it. In this day of SaaS, i’m less picky. I develop on the Mac. I test and run in production on Linux. Those are the systems i care about, so if my database has some platform-specific code in it to make it run fast and well, i’m fine with that. Just as long as that platform-specific configuration is not exposed to me as the developer. Simple Jar File for Inclusion in My Project I really just want one database jar file to add to my project. And i don’t want that jar file messing with my code or the dependencies i include in my project. If the database uses Guava 1.2, and i’m using Guava 0.8, that can mess me up. I want my database to not interfere with jars that i use by introducing newer or older versions of class files that i already reference in my project’s jars. Database Stored in a Directory on Disk I like to destroy my database by deleting a directory. I like to run multiple, simultaneous databases by configuring each database to use a separate directory. That makes me super productive during development, and it makes it more fun for me to program to a database. Faster Than a Rocket I think that’s just a given. My Latest Project That Needs a Database My latest project is Floify.com. Floify is a Mortgage Borrower Portal, automating the process of collecting mortgage loan documents from borrowers and emailing milestone loan status updates to real estate agents and borrowers. Mortgage loan originators use Floify to automate the labor-intensive parts of their loan processes. The web application receives about 500 unique visitors per month. Floify experienced 28% growth in january 2015. Floify’s vital statistics are: 38,301 loan documents under management 3,619 registered users 3,113 loan packages under management The Database I Chose for My Latest Project When i started Floify, i looked for a database that met all the criteria i’ve described above. I decided against databases that were server-based (Postgres, etc). I decided against databases that weren’t Java-based (MongoDB, etc). I decided against databases that didn’t support ACID transactions. I narrowed my choices to OrientDB and Neo4j. It’s been a couple years since that decision process occurred, but i distinctly remember a few reasons why i ultimately chose OrientDB over Neo4j: Performance benchmarks for OrientDB were very impressive. The OrientDB development team was very active. Cost. OrientDB is free. Neo4j cost more than what i was willing to pay or what i could afford. I forget which it was. My Favourite OrientDB Features Here are some of my favourite features in OrientDB. These are not competitive advantages to OrientDB. It’s just some of the things that make me happy when coding against an embeddable database. I can create the database in code. I don’t have to use SQL for querying, but most of the time, i do. I already know SQL, and it’s just easy for me. I use the document database, and it’s very pleasant inserting new documents in Java. I can store multi-megabyte binary objects directly in the database. My database is stored in a directory on disk. When scalability demands it, i can upgrade to a two-server distributed database. I haven’t been there yet. Speed. For me, OrientDB is very fast, and in the few years i’ve been using it, it’s become faster. OrientDB doesn’t come in a single jar file, as would be my ideal. I have to include a few different jars, but that’s an easy tradeoff for me. Future In the future, as Floify’s performance and scalability needs demand it, i’ll investigate a multi-server database configuration on OrientDB. In the meantime, i’m preparing to upgrade to OrientDB 2.0, which was recently released and promises even more speed. Go speed. :-)

March 5, 2015

by Dave Sims

· 18,308 Views · 6 Likes

Using MongoDB with Hadoop & Spark: Part 2 - Hive Example

Originally Written by Matt Kalan Welcome to part two of our three-part series on MongoDB and Hadoop. In part one, we introduced Hadoop and how to set it up. In this post, we'll look at a Hive example. Introduction & Setup of Hadoop and MongoDB Hive Example Spark Example & Key Takeaways For more detail on the use case, see the first paragraph of part 1. Summary Use case: aggregating 1 minute intervals of stock prices into 5 minute intervals Input:: 1 minute stock prices intervals in a MongoDB database Simple Analysis: performed in: - Hive - Spark Output: 5 minute stock prices intervals in Hadoop Hive Example I ran the following example from the Hive command line (simply typing the command “hive” with no parameters), not Cloudera’s Hue editor, as that would have needed additional installation steps. I immediately noticed the criticism people have with Hive, that everything is compiled into MapReduce which takes considerable time. I ran most things with just 20 records to make the queries run quickly. This creates the definition of the table in Hive that matches the structure of the data in MongoDB. MongoDB has a dynamic schema for variable data shapes but Hive and SQL need a schema definition. CREATE EXTERNAL TABLE minute_bars ( id STRUCT, Symbol STRING, Timestamp STRING, Day INT, Open DOUBLE, High DOUBLE, Low DOUBLE, Close DOUBLE, Volume INT ) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES('mongo.columns.mapping'='{"id":"_id", "Symbol":"Symbol", "Timestamp":"Timestamp", "Day":"Day", "Open":"Open", "High":"High", "Low":"Low", "Close":"Close", "Volume":"Volume"}') TBLPROPERTIES('mongo.uri'='mongodb://localhost:27017/marketdata.minbars'); Recent changes in the Apache Hive repo make the mappings necessary even if you are keeping the field names the same. This should be changed in the MongoDB Hadoop Connector soon if not already by the time you read this. Then I ran the following command to create a Hive table for the 5 minute bars: CREATE TABLE five_minute_bars ( id STRUCT, Symbol STRING, Timestamp STRING, Open DOUBLE, High DOUBLE, Low DOUBLE, Close DOUBLE ); This insert statement uses the SQL windowing functions to group 5 1-minute periods and determine the OHLC for the 5 minutes. There are definitely other ways to do this but here is one I figured out. Grouping in SQL is a little different from grouping in the MongoDB aggregation framework (in which you can pull the first and last of a group easily), so it took me a little while to remember how to do it with a subquery. The subquery takes each group of 5 1-minute records/documents, sorts them by time, and takes the open, high, low, and close price up to that record in each 5-minute period. Then the outside WHERE clause selects the last 1-minute bar in that period (because that row in the subquery has the correct OHLC information for its 5-minute period). I definitely welcome easier queries to understand but you can run the subquery by itself to see what it’s doing too. INSERT INTO TABLE five_minute_bars SELECT m.id, m.Symbol, m.OpenTime as Timestamp, m.Open, m.High, m.Low, m.Close FROM (SELECT id, Symbol, FIRST_VALUE(Timestamp) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as OpenTime, LAST_VALUE(Timestamp) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as CloseTime, FIRST_VALUE(Open) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as Open, MAX(High) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as High, MIN(Low) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as Low, LAST_VALUE(Close) OVER ( PARTITION BY floor(unix_timestamp(Timestamp, 'yyyy-MM-dd HH:mm')/(5*60)) ORDER BY Timestamp) as Close FROM minute_bars) as m WHERE unix_timestamp(m.CloseTime, 'yyyy-MM-dd HH:mm') - unix_timestamp(m.OpenTime, 'yyyy-MM-dd HH:mm') = 60*4; I can definitely see the benefit of being able to use SQL to access data in MongoDB and optionally in other databases and file formats, all with the same commands, while the mapping differences are handled in the table declarations. The downside is that the latency is quite high, but that could be made up some with the ability to scale horizontally across many nodes. I think this is the appeal of Hive for most people - they can scale to very large data volumes using traditional SQL, and latency is not a primary concern. Post #3 in this blog series shows similar examples using Spark. Introduction & Setup of Hadoop and MongoDB Hive Example Spark Example & Key Takeaways To learn more, watch our video on MongoDB and Hadoop. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights. WATCH MONGODB & HADOOP << Read Part 1

March 2, 2015

by Francesca Krihely

· 10,879 Views

Streaming Big Data: Storm, Spark and Samza

There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences. Apache Storm In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline. Apache Spark Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations). Apache Samza Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka. Common Ground All three real-time computation systems are open-source, low-latency, distributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations. The three frameworks use different vocabularies for similar concepts: Comparison Matrix A few of the differences are summarized in the table below: There are three general categories of delivery patterns: At-most-once: messages may be lost. This is usually the least desirable outcome. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases. Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident. Use Cases All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines. If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching. A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel... Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQL, MLlib, GraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time. A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu… If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects. A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale… Conclusion We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.

February 28, 2015

by Tony Siciliani

· 32,730 Views · 5 Likes

Standing Up a Local Netflix Eureka

Here I will consider two different ways of standing up a local instance of Netflix Eureka. If you are not familiar with Eureka, it provides a central registry where (micro)services can register themselves and client applications can use this registry to look up specific instances hosting a service and to make the service calls. Approach 1: Native Eureka Library The first way is to simply use the archive file generated by the Netflix Eureka build process: 1. Clone the Eureka source repository here: https://github.com/Netflix/eureka 2. Run "./gradlew build" at the root of the repository, this should build cleanly generating a war file in eureka-server/build/libs folder 3. Grab this file, rename it to "eureka.war" and place it in the webapps folder of either tomcat or jetty. For this exercise I have used jetty. 4. Start jetty, by default jetty will boot up at port 8080, however I wanted to instead bring it up at port 8761, so you can start it up this way, "java -jar start.jar -Djetty.port=8761" The server should start up cleanly and can be verified at this endpoint - "http://localhost:8761/eureka/v2/apps" Approach 2: Spring-Cloud-Netflix Spring-Cloud-Netflix provides a very neat way to bootstrap Eureka. To bring up Eureka server using Spring-Cloud-Netflix the approach that I followed was to clone the sample Eureka server application available here: https://github.com/spring-cloud-samples/eureka 1. Clone this repository 2. From the root of the repository run "mvn spring-boot:run", and that is it!. The server should boot up cleanly and the REST endpoint should come up here: "http://localhost:8761/eureka/apps". As a bonus, Spring-Cloud-Netflix provides a neat UI showing the various applications who have registered with Eureka at the root of the webapp at "http://localhost:8761/". Just a few small issues to be aware of, note that the context url's are a little different in the two cases "eureka/v2/apps" vs "eureka/apps", this can be adjusted on the configurations of the services which register with Eureka. Conclusion Your mileage with these approaches may vary. I have found Spring-Cloud-Netflix a little unstable at times but it has mostly worked out well for me. The documentation at the Spring-Cloud site is also far more exhaustive than the one provided at the Netflix Eureka site.

February 26, 2015

by Biju Kunjummen

· 13,308 Views

A JAXB Nuance: String Versus Enum from Enumerated Restricted XSD String

Although Java Architecture for XML Binding (JAXB) is fairly easy to use in nominal cases (especially since Java SE 6), it also presents numerous nuances. Some of the common nuances are due to the inability to exactlymatch (bind) XML Schema Definition (XSD) types to Java types. This post looks at one specific example of this that also demonstrates how different XSD constructs that enforce the same XML structure can lead to different Java types when the JAXB compiler generates the Java classes. The next code listing, for Food.xsd, defines a schema for food types. The XSD mandates that valid XML will have a root element called "Food" with three nested elements "Vegetable", "Fruit", and "Dessert". Although the approach used to specify the "Vegetable" and "Dessert" elements is different than the approach used to specify the "Fruit" element, both approaches result in similar "valid XML." The "Vegetable" and "Dessert" elements are declared directly as elements of the prescribed simpleTypes defined later in the XSD. The "Fruit" element is defined via reference (ref=) to another defined element that consists of a simpleType. Food.xsd Although Vegetable and Dessert elements are defined in the schema differently than Fruit, the resulting valid XML is the same. A valid XML file is shown next in the code listing for food1.xml. food1.xml Spinach Watermelon Pie At this point, I'll use a simple Groovy script to validate the above XML against the above XSD. The code for this Groovy XML validation script (validateXmlAgainstXsd.groovy) is shown next. validateXmlAgainstXsd.groovy #!/usr/bin/env groovy // validateXmlAgainstXsd.groovy // // Accepts paths/names of two files. The first is the XML file to be validated // and the second is the XSD against which to validate that XML. if (args.length < 2) { println "USAGE: groovy validateXmlAgainstXsd.groovy " System.exit(-1) } String xml = args[0] String xsd = args[1] import javax.xml.validation.Schema import javax.xml.validation.SchemaFactory import javax.xml.validation.Validator try { SchemaFactory schemaFactory = SchemaFactory.newInstance(javax.xml.XMLConstants.W3C_XML_SCHEMA_NS_URI) Schema schema = schemaFactory.newSchema(new File(xsd)) Validator validator = schema.newValidator() validator.validate(new javax.xml.transform.stream.StreamSource(xml)) } catch (Exception exception) { println "\nERROR: Unable to validate ${xml} against ${xsd} due to '${exception}'\n" System.exit(-1) } println "\nXML file ${xml} validated successfully against ${xsd}.\n" The next screen snapshot demonstrates running the above Groovy XML validation script against food1.xmland Food.xsd. The objective of this post so far has been to show how different approaches in an XSD can lead to the same XML being valid. Although these different XSD approaches prescribe the same valid XML, they lead to different Java class behavior when JAXB is used to generate classes based on the XSD. The next screen snapshot demonstrates running the JDK-provided JAXB xjc compiler against the Food.xsd to generate the Java classes. The output from the JAXB generation shown above indicates that Java classes were created for the "Vegetable" and "Dessert" elements but not for the "Fruit" element. This is because "Vegetable" and "Dessert" were defined differently than "Fruit" in the XSD. The next code listing is for the Food.java class generated by the xjc compiler. From this we can see that the generated Food.java class references specific generated Java types for Vegetable and Dessert, but references simply a generic Java String for Fruit. Food.java (generated by JAXB jxc compiler) // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.8-b130911.1802 // See http://java.sun.com/xml/jaxb // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2015.02.11 at 10:17:32 PM MST // package com.blogspot.marxsoftware.foodxml; import javax.xml.bind.annotation.XmlAccessType; import javax.xml.bind.annotation.XmlAccessorType; import javax.xml.bind.annotation.XmlElement; import javax.xml.bind.annotation.XmlRootElement; import javax.xml.bind.annotation.XmlSchemaType; import javax.xml.bind.annotation.XmlType; /** * Java class for anonymous complex type. * * The following schema fragment specifies the expected content contained within this class. * * * * * * * * * * * * * * * * */ @XmlAccessorType(XmlAccessType.FIELD) @XmlType(name = "", propOrder = { "vegetable", "fruit", "dessert" }) @XmlRootElement(name = "Food") public class Food { @XmlElement(name = "Vegetable", required = true) @XmlSchemaType(name = "string") protected Vegetable vegetable; @XmlElement(name = "Fruit", required = true) protected String fruit; @XmlElement(name = "Dessert", required = true) @XmlSchemaType(name = "string") protected Dessert dessert; /** * Gets the value of the vegetable property. * * @return * possible object is * {@link Vegetable } * */ public Vegetable getVegetable() { return vegetable; } /** * Sets the value of the vegetable property. * * @param value * allowed object is * {@link Vegetable } * */ public void setVegetable(Vegetable value) { this.vegetable = value; } /** * Gets the value of the fruit property. * * @return * possible object is * {@link String } * */ public String getFruit() { return fruit; } /** * Sets the value of the fruit property. * * @param value * allowed object is * {@link String } * */ public void setFruit(String value) { this.fruit = value; } /** * Gets the value of the dessert property. * * @return * possible object is * {@link Dessert } * */ public Dessert getDessert() { return dessert; } /** * Sets the value of the dessert property. * * @param value * allowed object is * {@link Dessert } * */ public void setDessert(Dessert value) { this.dessert = value; } } The advantage of having specific Vegetable and Dessert classes is the additional type safety they bring as compared to a general Java String. Both Vegetable.java and Dessert.java are actually enums because they come from enumerated values in the XSD. The two generated enums are shown in the next two code listings. Vegetable.java (generated with JAXB xjc compiler) // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.8-b130911.1802 // See http://java.sun.com/xml/jaxb // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2015.02.11 at 10:17:32 PM MST // package com.blogspot.marxsoftware.foodxml; import javax.xml.bind.annotation.XmlEnum; import javax.xml.bind.annotation.XmlEnumValue; import javax.xml.bind.annotation.XmlType; /** * Java class for Vegetable. * * The following schema fragment specifies the expected content contained within this class. * * * * * * * * * * * * */ @XmlType(name = "Vegetable") @XmlEnum public enum Vegetable { @XmlEnumValue("Carrot") CARROT("Carrot"), @XmlEnumValue("Squash") SQUASH("Squash"), @XmlEnumValue("Spinach") SPINACH("Spinach"), @XmlEnumValue("Celery") CELERY("Celery"); private final String value; Vegetable(String v) { value = v; } public String value() { return value; } public static Vegetable fromValue(String v) { for (Vegetable c: Vegetable.values()) { if (c.value.equals(v)) { return c; } } throw new IllegalArgumentException(v); } } Dessert.java (generated with JAXB xjc compiler) // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.8-b130911.1802 // See http://java.sun.com/xml/jaxb // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2015.02.11 at 10:17:32 PM MST // package com.blogspot.marxsoftware.foodxml; import javax.xml.bind.annotation.XmlEnum; import javax.xml.bind.annotation.XmlEnumValue; import javax.xml.bind.annotation.XmlType; /** * Java class for Dessert. * * The following schema fragment specifies the expected content contained within this class. * * * * * * * * * * * */ @XmlType(name = "Dessert") @XmlEnum public enum Dessert { @XmlEnumValue("Pie") PIE("Pie"), @XmlEnumValue("Cake") CAKE("Cake"), @XmlEnumValue("Ice Cream") ICE_CREAM("Ice Cream"); private final String value; Dessert(String v) { value = v; } public String value() { return value; } public static Dessert fromValue(String v) { for (Dessert c: Dessert.values()) { if (c.value.equals(v)) { return c; } } throw new IllegalArgumentException(v); } } Having enums generated for the XML elements ensures that only valid values for those elements can be represented in Java. Conclusion JAXB makes it relatively easy to map Java to XML, but because there is not a one-to-one mapping between Java and XML types, there can be some cases where the generated Java type for a particular XSD prescribed element is not obvious. This post has shown how two different approaches to building an XSD to enforce the same basic XML structure can lead to very different results in the Java classes generated with the JAXB xjccompiler. In the example shown in this post, declaring elements in the XSD directly on simpleTypes restricting XSD's string to a specific set of enumerated values is preferable to declaring elements as references to other elements wrapping a simpleType of restricted string enumerated values because of the type safety that is achieved when enums are generated rather than use of general Java Strings.

February 25, 2015

by Dustin Marx

· 21,459 Views · 1 Like

Redirecting All Kinds of stdout in Python

A common task in Python (especially while testing or debugging) is to redirect sys.stdout to a stream or a file while executing some piece of code. However, simply "redirecting stdout" is sometimes not as easy as one would expect; hence the slightly strange title of this post. In particular, things become interesting when you want C code running within your Python process (including, but not limited to, Python modules implemented as C extensions) to also have its stdout redirected according to your wish. This turns out to be tricky and leads us into the interesting world of file descriptors, buffers and system calls. But let's start with the basics. Pure Python The simplest case arises when the underlying Python code writes to stdout, whether by calling print, sys.stdout.write or some equivalent method. If the code you have does all its printing from Python, redirection is very easy. With Python 3.4 we even have a built-in tool in the standard library for this purpose - contextlib.redirect_stdout. Here's how to use it: from contextlib import redirect_stdout f = io.StringIO() with redirect_stdout(f): print('foobar') print(12) print('Got stdout: "{0}"'.format(f.getvalue())) When this code runs, the actual print calls within the with block don't emit anything to the screen, and you'll see their output captured by in the stream f. Incidentally, note how perfect the with statement is for this goal - everything within the block gets redirected; once the block is done, things are cleaned up for you and redirection stops. If you're stuck on an older and uncool Python, prior to 3.4 [1], what then? Well, redirect_stdout is really easy to implement on your own. I'll change its name slightly to avoid confusion: from contextlib import contextmanager @contextmanager def stdout_redirector(stream): old_stdout = sys.stdout sys.stdout = stream try: yield finally: sys.stdout = old_stdout So we're back in the game: f = io.StringIO() with stdout_redirector(f): print('foobar') print(12) print('Got stdout: "{0}"'.format(f.getvalue())) Redirecting C-level streams Now, let's take our shiny redirector for a more challenging ride: import ctypes libc = ctypes.CDLL(None) f = io.StringIO() with stdout_redirector(f): print('foobar') print(12) libc.puts(b'this comes from C') os.system('echo and this is from echo') print('Got stdout: "{0}"'.format(f.getvalue())) I'm using ctypes to directly invoke the C library's puts function [2]. This simulates what happens when C code called from within our Python code prints to stdout - the same would apply to a Python module using a C extension. Another addition is the os.system call to invoke a subprocess that also prints to stdout. What we get from this is: this comes from C and this is from echo Got stdout: "foobar 12 " Err... no good. The prints got redirected as expected, but the output from puts and echo flew right past our redirector and ended up in the terminal without being caught. What gives? To grasp why this didn't work, we have to first understand what sys.stdout actually is in Python. Detour - on file descriptors and streams This section dives into some internals of the operating system, the C library, and Python [3]. If you just want to know how to properly redirect printouts from C in Python, you can safely skip to the next section (though understanding how the redirection works will be difficult). Files are opened by the OS, which keeps a system-wide table of open files, some of which may point to the same underlying disk data (two processes can have the same file open at the same time, each reading from a different place, etc.) File descriptors are another abstraction, which is managed per-process. Each process has its own table of open file descriptors that point into the system-wide table. Here's a schematic, taken from The Linux Programming Interface: File descriptors allow sharing open files between processes (for example when creating child processes with fork). They're also useful for redirecting from one entry to another, which is relevant to this post. Suppose that we make file descriptor 5 a copy of file descriptor 4. Then all writes to 5 will behave in the same way as writes to 4. Coupled with the fact that the standard output is just another file descriptor on Unix (usually index 1), you can see where this is going. The full code is given in the next section. File descriptors are not the end of the story, however. You can read and write to them with the read and write system calls, but this is not the way things are typically done. The C runtime library provides a convenient abstraction around file descriptors - streams. These are exposed to the programmer as the opaque FILE structure with a set of functions that act on it (for example fprintf and fgets). FILE is a fairly complex structure, but the most important things to know about it is that it holds a file descriptor to which the actual system calls are directed, and it provides buffering, to ensure that the system call (which is expensive) is not called too often. Suppose you emit stuff to a binary file, a byte or two at a time. Unbuffered writes to the file descriptor with write would be quite expensive because each write invokes a system call. On the other hand, using fwrite is much cheaper because the typicall call to this function just copies your data into its internal buffer and advances a pointer. Only occasionally (depending on the buffer size and flags) will an actual write system call be issued. With this information in hand, it should be easy to understand what stdout actually is for a C program. stdout is a global FILE object kept for us by the C library, and it buffers output to file descriptor number 1. Calls to functions like printf and puts add data into this buffer. fflush forces its flushing to the file descriptor, and so on. But we're talking about Python here, not C. So how does Python translate calls to sys.stdout.write to actual output? Python uses its own abstraction over the underlying file descriptor - a file object. Moreover, in Python 3 this file object is further wrapper in an io.TextIOWrapper, because what we pass to print is a Unicode string, but the underlying write system calls accept binary data, so encoding has to happen en route. The important take-away from this is: Python and a C extension loaded by it (this is similarly relevant to C code invoked via ctypes) run in the same process, and share the underlying file descriptor for standard output. However, while Python has its own high-level wrapper around it - sys.stdout, the C code uses its own FILE object. Therefore, simply replacing sys.stdout cannot, in principle, affect output from C code. To make the replacement deeper, we have to touch something shared by the Python and C runtimes - the file descriptor. Redirecting with file descriptor duplication Without further ado, here is an improved stdout_redirector that also redirects output from C code [4]: from contextlib import contextmanager import ctypes import io import os, sys import tempfile libc = ctypes.CDLL(None) c_stdout = ctypes.c_void_p.in_dll(libc, 'stdout') @contextmanager def stdout_redirector(stream): # The original fd stdout points to. Usually 1 on POSIX systems. original_stdout_fd = sys.stdout.fileno() def _redirect_stdout(to_fd): """Redirect stdout to the given file descriptor.""" # Flush the C-level buffer stdout libc.fflush(c_stdout) # Flush and close sys.stdout - also closes the file descriptor (fd) sys.stdout.close() # Make original_stdout_fd point to the same file as to_fd os.dup2(to_fd, original_stdout_fd) # Create a new sys.stdout that points to the redirected fd sys.stdout = io.TextIOWrapper(os.fdopen(original_stdout_fd, 'wb')) # Save a copy of the original stdout fd in saved_stdout_fd saved_stdout_fd = os.dup(original_stdout_fd) try: # Create a temporary file and redirect stdout to it tfile = tempfile.TemporaryFile(mode='w+b') _redirect_stdout(tfile.fileno()) # Yield to caller, then redirect stdout back to the saved fd yield _redirect_stdout(saved_stdout_fd) # Copy contents of temporary file to the given stream tfile.flush() tfile.seek(0, io.SEEK_SET) stream.write(tfile.read()) finally: tfile.close() os.close(saved_stdout_fd) There are a lot of details here (such as managing the temporary file into which output is redirected) that may obscure the key approach: using dup and dup2 to manipulate file descriptors. These functions let us duplicate file descriptors and make any descriptor point at any file. I won't spend more time on them - go ahead and read their documentation, if you're interested. The detour section should provide enough background to understand it. Let's try this: f = io.BytesIO() with stdout_redirector(f): print('foobar') print(12) libc.puts(b'this comes from C') os.system('echo and this is from echo') print('Got stdout: "{0}"'.format(f.getvalue().decode('utf-8'))) Gives us: Got stdout: "and this is from echo this comes from C foobar 12 " Success! A few things to note: The output order may not be what we expected. This is due to buffering. If it's important to preserve order between different kinds of output (i.e. between C and Python), further work is required to disable buffering on all relevant streams. You may wonder why the output of echo was redirected at all? The answer is that file descriptors are inherited by subprocesses. Since we rigged fd 1 to point to our file instead of the standard output prior to forking to echo, this is where its output went. We use a BytesIO here. This is because on the lowest level, the file descriptors are binary. It may be possible to do the decoding when copying from the temporary file into the given stream, but that can hide problems. Python has its in-memory understanding of Unicode, but who knows what is the right encoding for data printed out from underlying C code? This is why this particular redirection approach leaves the decoding to the caller. The above also makes this code specific to Python 3. There's no magic involved, and porting to Python 2 is trivial, but some assumptions made here don't hold (such as sys.stdout being a io.TextIOWrapper). Redirecting the stdout of a child process We've just seen that the file descriptor duplication approach lets us grab the output from child processes as well. But it may not always be the most convenient way to achieve this task. In the general case, you typically use the subprocess module to launch child processes, and you may launch several such processes either in a pipe or separately. Some programs will even juggle multiple subprocesses launched this way in different threads. Moreover, while these subprocesses are running you may want to emit something to stdout and you don't want this output to be captured. So, managing the stdout file descriptor in the general case can be messy; it is also unnecessary, because there's a much simpler way. The subprocess module's swiss knife Popen class (which serve as the basis for much of the rest of the module) accepts a stdout parameter, which we can use to ask it to get access to the child's stdout: import subprocess echo_cmd = ['echo', 'this', 'comes', 'from', 'echo'] proc = subprocess.Popen(echo_cmd, stdout=subprocess.PIPE) output = proc.communicate()[0] print('Got stdout:', output) The subprocess.PIPE argument can be used to set up actual child process pipes (a la the shell), but in its simplest incarnation it captures the process's output. If you only launch a single child process at a time and are interested in its output, there's an even simpler way: output = subprocess.check_output(echo_cmd) print('Got stdout:', output) check_output will capture and return the child's standard output to you; it will also raise an exception if the child exist with a non-zero return code. Conclusion I hope I covered most of the common cases where "stdout redirection" is needed in Python. Naturally, all of the same applies to the other standard output stream - stderr. Also, I hope the background on file descriptors was sufficiently clear to explain the redirection code; squeezing this topic in such a short space is challenging. Let me know if any questions remain or if there's something I could have explained better. Finally, while it is conceptually simple, the code for the redirector is quite long; I'll be happy to hear if you find a shorter way to achieve the same effect. [1] Do not despair. As of February 2015, a sizable chunk of the worldwide Python programmers are in the same boat. [2] Note that bytes passed to puts. This being Python 3, we have to be careful since libc doesn't understand Python's unicode strings. [3] The following description focuses on Unix/POSIX systems; also, it's necessarily partial. Large book chapters have been written on this topic - I'm just trying to present some key concepts relevant to stream redirection. [4] The approach taken here is inspired by this Stack Overflow answer.

February 23, 2015

by Eli Bendersky

· 19,680 Views

Sneak Peek into the JCache API (JSR 107)

This post covers the JCache API at a high level and provides a teaser – just enough for you to (hopefully) start itching about it ;-) In this post …. JCache overview JCache API, implementations Supported (Java) platforms for JCache API Quick look at Oracle Coherence Fun stuff – Project Headlands (RESTified JCache by Adam Bien) , JCache related talks at Java One 2014, links to resources for learning more about JCache What is JCache? JCache (JSR 107) is a standard caching API for Java. It provides an API for applications to be able to create and work with in-memory cache of objects. Benefits are obvious – one does not need to concentrate on the finer details of implementing the Caching and time is better spent on the core business logic of the application. JCache components The specification itself is very compact and surprisingly intuitive. The API defines high level components (interfaces) some of which are listed below Caching Provider – used to control Caching Managers and can deal with several of them, Cache Manager – deals with create, read, destroy operations on a Cache Cache – stores entries (the actual data) and exposes CRUD interfaces to deal with the entries Entry – abstraction on top of a key-value pair akin to a java.util.Map Hierarchy of JCache API components JCache Implementations JCache defines the interfaces which of course are implemented by different vendors a.k.a Providers. Oracle Coherence Hazelcast Infinispan ehcache Reference Implementation – this is more for reference purpose rather than a production quality implementation. It is per the specification though and you can be rest assured of the fact that it does in fact pass the TCK as well From the application point of view, all that’s required is the implementation to be present in the classpath. The API also provides a way to further fine tune the properties specific to your provider via standard mechanisms. You should be able to track the list of JCache reference implementations from the JCP website link public class JCacheUsage{ public static void main(String[] args){ //bootstrap the JCache Provider CachingProvider jcacheProvider = Caching.getCachingProvider(); CacheManager jcacheManager = jcacheProvider.getCacheManager(); //configure cache MutableConfiguration jcacheConfig = new MutableConfiguration<>(); jcacheConfig.setTypes(String.class, MyPreciousObject.class); //create cache Cache cache = jcacheManager.createCache("PreciousObjectCache", jcacheConfig); //play around String key = UUID.randomUUID().toString(); cache.put(key, new MyPreciousObject()); MyPreciousObject inserted = cache.get(key); cache.remove(key); cache.get(key); //will throw javax.cache.CacheException since the key does not exist } } JCache provider detection JCache provider detection happens automatically when you only have a single JCache provider on the class path You can choose from the below options as well //set JMV level system property -Djavax.cache.spi.cachingprovider=org.ehcache.jcache.JCacheCachingProvider //code level config System.setProperty("javax.cache.spi.cachingprovider","org.ehcache.jcache.JCacheCachingProvider //you want to choose from multiple JCache providers at runtime CachingProvider ehcacheJCacheProvider = Caching.getCachingProvider("org.ehcache.jcache.JCacheCachingProvider"); //which JCache providers do I have on the classpath? Iterable jcacheProviders = Caching.getCachingProviders(); Java Platform support Compliant with Java SE 6 and above Does not define any details in terms of Java EE integration. This does not mean that it cannot be used in a Java EE environment – it’s just not standardized yet. Could not be plugged into Java EE 7 as a tried and tested standard Candidate for Java EE 8 Project Headlands: Java EE and JCache in tandem By none other than Adam Bien himself ! Java EE 7, Java SE 8 and JCache in action Exposes the JCache API via JAX-RS (REST) Uses Hazelcast as the JCache provider Highly recommended ! Oracle Coherence This post deals with high level stuff w.r.t JCache in general. However, a few lines about Oracle Coherence in general would help put things in perspective Oracle Coherence is a part of Oracle’s Cloud Application Foundation stack It is primarily an in-memory data grid solution Geared towards making applications more scalable in general What’s important to know is that from version 12.1.3 onwards, Oracle Coherence includes a reference implementation for JCache (more in the next section) JCache support in Oracle Coherence Support for JCache implies that applications can now use a standard API to access the capabilities of Oracle Coherence This is made possible by Coherence by simply providing an abstraction over its existing interfaces (NamedCache etc). Application deals with a standard interface (JCache API) and the calls to the API are delegated to the existing Coherence core library implementation Support for JCache API also means that one does not need to use Coherence specific APIs in the application resulting in vendor neutral code which equals portability How ironic – supporting a standard API and always keeping your competitors in the hunt ;-) But hey! That’s what healthy competition and quality software is all about ! Talking of healthy competition – Oracle Coherence does support a host of other features in addition to the standard JCache related capabilities. The Oracle Coherence distribution contains all the libraries for working with the JCache implementation The service definition file in the coherence-jcache.jar qualifies it as a valid JCache provider implementation Curious about Oracle Coherence ? Quick Starter page Documentation Installation Further reading about Coherence and JCache combo – Oracle Coherence documentation JCache at Java One 2014 Couple of great talks revolving around JCache at Java One 2014 Come, Code, Cache, Compute! by Steve Millidge Using the New JCache by Brian Oliver and Greg Luck Hope this was fun :-) Cheers !

February 23, 2015

by Abhishek Gupta

CORE

· 6,316 Views · 1 Like

Retry-After HTTP Header in Practice

Retry-After is a lesser known HTTP response header.

February 20, 2015

by Tomasz Nurkiewicz

· 16,848 Views

Getting Started with Dropwizard: Authentication, Configuration and HTTPS

Basic Authentication is the simplest way to secure access to a resource.

February 10, 2015

by Dmitry Noranovich

· 48,826 Views · 1 Like

The API Gateway Pattern: Angular JS and Spring Security Part IV

Written by Dave Syer in the Spring blog In this article we continue our discussion of how to use Spring Security with Angular JS in a “single page application”. Here we show how to build an API Gateway to control the authentication and access to the backend resources using Spring Cloud. This is the fourth in a series of articles, and you can catch up on the basic building blocks of the application or build it from scratch by reading the first article, or you can just go straight to the source code in Github. In the last article we built a simple distributed application that used Spring Session to authenticate the backend resources. In this one we make the UI server into a reverse proxy to the backend resource server, fixing the issues with the last implementation (technical complexity introduced by custom token authentication), and giving us a lot of new options for controlling access from the browser client. Reminder: if you are working through this article with the sample application, be sure to clear your browser cache of cookies and HTTP Basic credentials. In Chrome the best way to do that for a single server is to open a new incognito window. Creating an API Gateway An API Gateway is a single point of entry (and control) for front end clients, which could be browser based (like the examples in this article) or mobile. The client only has to know the URL of one server, and the backend can be refactored at will with no change, which is a significant advantage. There are other advantages in terms of centralization and control: rate limiting, authentication, auditing and logging. And implementing a simple reverse proxy is really simple with Spring Cloud. If you were following along in the code, you will know that the application implementation at the end of the last article was a bit complicated, so it’s not a great place to iterate away from. There was, however, a halfway point which we could start from more easily, where the backend resource wasn’t yet secured with Spring Security. The source code for this is a separate project in Github so we are going to start from there. It has a UI server and a resource server and they are talking to each other. The resource server doesn’t have Spring Security yet so we can get the system working first and then add that layer. Declarative Reverse Proxy in One Line To turn it into an API Gateawy, the UI server needs one small tweak. Somewhere in the Spring configuration we need to add an @EnableZuulProxy annotation, e.g. in the main (only)application class: @SpringBootApplication @RestController @EnableZuulProxy public class UiApplication { ... } and in an external configuration file we need to map a local resource in the UI server to a remote one in the external configuration (“application.yml”): security: ... zuul: routes: resource: path: /resource/** url: http://localhost:9000 This says “map paths with the pattern /resource/** in this server to the same paths in the remote server at localhost:9000”. Simple and yet effective (OK so it’s 6 lines including the YAML, but you don’t always need that)! All we need to make this work is the right stuff on the classpath. For that purpose we have a few new lines in our Maven POM: org.springframework.cloud spring-cloud-starter-parent 1.0.0.BUILD-SNAPSHOT pom import org.springframework.cloud spring-cloud-starter-zuul ... Note the use of the “spring-cloud-starter-zuul” - it’s a starter POM just like the Spring Boot ones, but it governs the dependencies we need for this Zuul proxy. We are also using because we want to be able to depend on all the versions of transitive dependencies being correct. Consuming the Proxy in the Client With those changes in place our application still works, but we haven’t actually used the new proxy yet until we modify the client. Fortunately that’s trivial. We just need to go from this implementation of the “home” controller: angular.module('hello', [ 'ngRoute' ]) ... .controller('home', function($scope, $http) { $http.get('http://localhost:9000/').success(function(data) { $scope.greeting = data; }) }); to a local resource: angular.module('hello', [ 'ngRoute' ]) ... .controller('home', function($scope, $http) { $http.get('resource/').success(function(data) { $scope.greeting = data; }) }); Now when we fire up the servers everything is working and the requests are being proxied through the UI (API Gateway) to the resource server. Further Simplifications Even better: we don’t need the CORS filter any more in the resource server. We threw that one together pretty quickly anyway, and it should have been a red light that we had to do anything as technically focused by hand (especially where it concerns security). Fortunately it is now redundant, so we can just throw it away, and go back to sleeping at night! Securing the Resource Server You might remember in the intermediate state that we started from there is no security in place for the resource server. Aside: Lack of software security might not even be a problem if your network architecture mirrors the application architecture (you can just make the resource server physically inaccessible to anyone but the UI server). As a simple demonstration of that we can make the resource server only accessible on localhost. Just add this to application.properties in the resource server: server.address: 127.0.0.1 Wow, that was easy! Do that with a network address that’s only visible in your data center and you have a security solution that works for all resource servers and all user desktops. Suppose that we decide we do need security at the software level (quite likely for a number of reasons). That’s not going to be a problem, because all we need to do is add Spring Security as a dependency (in the resource server POM): org.springframework.boot spring-boot-starter-security That’s enough to get us a secure resource server, but it won’t get us a working application yet, for the same reason that it didn’t in Part III: there is no shared authentication state between the two servers. Sharing Authentication State We can use the same mechanism to share authentication (and CSRF) state as we did in the last, i.e. Spring Session. We add the dependency to both servers as before: org.springframework.session spring-session 1.0.0.RELEASE org.springframework.boot spring-boot-starter-redis but this time the configuration is much simpler because we can just add the same Filterdeclaration to both. First the UI server (adding @EnableRedisHttpSession): @SpringBootApplication @RestController @EnableZuulProxy @EnableRedisHttpSession public class UiApplication { ... } and then the resource server. There are two changes to make: one is adding@EnableRedisHttpSession and a HeaderHttpSessionStrategy bean to theResourceApplication: @SpringBootApplication @RestController @EnableRedisHttpSession class ResourceApplication { ... @Bean HeaderHttpSessionStrategy sessionStrategy() { new HeaderHttpSessionStrategy(); } } and the other is to explicitly ask for a non-stateless session creation policy inapplication.properties: security.sessions: NEVER As long as redis is still running in the background (use the fig.yml if you like to start it) then the system will work. Load the homepage for the UI at http://localhost:8080 and login and you will see the message from the backend rendered on the homepage. How Does it Work? What is going on behind the scenes now? First we can look at the HTTP requests in the UI server (and API Gateway): VERB PATH STATUS RESPONSE GET / 200 index.html GET /css/angular-bootstrap.css 200 Twitter bootstrap CSS GET /js/angular-bootstrap.js 200 Bootstrap and Angular JS GET /js/hello.js 200 Application logic GET /user 302 Redirect to login page GET /login 200 Whitelabel login page (ignored) GET /resource 302 Redirect to login page GET /login 200 Whitelabel login page (ignored) GET /login.html 200 Angular login form partial POST /login 302 Redirect to home page (ignored) GET /user 200 JSON authenticated user GET /resource 200 (Proxied) JSON greeting That’s identical to the sequence at the end of Part II except for the fact that the cookie names are slightly different (“SESSION” instead of “JSESSIONID”) because we are using Spring Session. But the architecture is different and that last request to “/resource” is special because it was proxied to the resource server. We can see the reverse proxy in action by looking at the “/trace” endpoint in the UI server (from Spring Boot Actuator, which we added with the Spring Cloud dependencies). Go tohttp://localhost:8080/trace in a browser and scroll to the end (if you don’t have one already get a JSON plugin for your browser to make it nice and readable). You will need to authenticate with HTTP Basic (browser popup), but the same credentials are valid as for your login form. At or near the end you should see a pair of requests something like this: { "timestamp": 1420558194546, "info": { "method": "GET", "path": "/", "query": "" "remote": true, "proxy": "resource", "headers": { "request": { "accept": "application/json, text/plain, */*", "x-xsrf-token": "542c7005-309c-4f50-8a1d-d6c74afe8260", "cookie": "SESSION=c18846b5-f805-4679-9820-cd13bd83be67; XSRF-TOKEN=542c7005-309c-4f50-8a1d-d6c74afe8260", "x-forwarded-prefix": "/resource", "x-forwarded-host": "localhost:8080" }, "response": { "Content-Type": "application/json;charset=UTF-8", "status": "200" } }, } }, { "timestamp": 1420558200232, "info": { "method": "GET", "path": "/resource/", "headers": { "request": { "host": "localhost:8080", "accept": "application/json, text/plain, */*", "x-xsrf-token": "542c7005-309c-4f50-8a1d-d6c74afe8260", "cookie": "SESSION=c18846b5-f805-4679-9820-cd13bd83be67; XSRF-TOKEN=542c7005-309c-4f50-8a1d-d6c74afe8260" }, "response": { "Content-Type": "application/json;charset=UTF-8", "status": "200" } } } }, The second entry there is the request from the client to the gateway on “/resource” and you can see the cookies (added by the browser) and the CSRF header (added by Angular as discussed inPart II). The first entry has remote: true and that means it’s tracing the call to the resource server. You can see it went out to a uri path “/” and you can see that (crucially) the cookies and CSRF headers have been sent too. Without Spring Session these headers would be meaningless to the resource server, but the way we have set it up it can now use those headers to re-constitute a session with authentication and CSRF token data. So the request is permitted and we are in business! Conclusion We covered quite a lot in this article but we got to a really nice place where there is a minimal amount of boilerplate code in our two servers, they are both nicely secure and the user experience isn’t compromised. That alone would be a reason to use the API Gateway pattern, but really we have only scratched the surface of what that might be used for (Netflix uses it for a lot of things). Read up on Spring Cloud to find out more on how to make it easy to add more features to the gateway. The next article in this series will extend the application architecture a bit by extracting the authentication responsibilities to a separate server (the Single Sign On pattern).

February 9, 2015

by Pieter Humphrey

· 16,310 Views

Introducing the Database Selection Matrix

Originally Written by Mat Keep For the better part of a generation, the database landscape had changed very little. No one could say “this is not your father’s database.” They had become, in a word, boring. Then a combination of factors catalyzed an era of innovation in database technologies: cheap storage and compute resources; pervasive connectivity; social networks; smartphones; the proliferation of sensors; open source software. Data volumes grew (and are growing) at exponential rates. Over 80% of today’s data no longer fits neatly into the normalized row and column table formats of the past. And so developers began engineering solutions to a new set of problems with a very different set of resources and assumptions. Today these new options include a variety of database architectures built around diverse data models – from key-value to document to wide-column and graph. And of course you still have the option of the venerable relational database. For the enterprise these new technologies hold great promise. They open the door to new applications that could not be imagined before, or to more efficiently solve existing problems. They attract new technical talent. They facilitate the migration of systems to more cost effective infrastructure based on commodity hardware and cloud platforms. But at the same time, evaluation of these new options requires careful consideration. Selecting the appropriate database for a new project requires evaluation against multiple criteria, including: Development considerations: includes the data model, query functionality, available drivers, data consistency. These factors dictate the functionality of your application, and how quickly you can build it. Operational considerations: performance and scalability, high availability, data center awareness, security, management and backups. Over the application’s lifetime, operational costs will contribute a significant percentage to the project’s Total Cost of Ownership (TCO), and so these factors constitute your ability to meet SLAs while minimizing administrative overhead. Commercial considerations: licensing, pricing and support. You need to know that the database you choose is available in a way that is aligned with how you do business. Each these considerations need to be evaluated in context of specific application requirements as well as internal technology standards, skills availability and integration with your existing enterprise architecture. So, where to start? The Database Selection Matrix is designed to serve as a decision framework by teams responsible for database selection. It has been developed in collaboration with several large enterprises who have the choice of running multiple databases in production, and who wanted to institute a systematic methodology for database evaluation. Responses to questions in the matrix helped them identify key requirements and guide selection. And it can do the same for you. Lets illustrate how the Database Selection Matrix can be used by working through a practical example. The Database Selection Matrix in Action! ACME Retail Corporation runs a large vehicle fleet to distribute produce to its nationwide network of stores. The CEO is intent on improving distribution efficiency and so tasks her enterprise architects to build a new platform that can utilize sensor data generated by the company’s trucks. By capturing and analyzing this data, the organization believes it can optimize route planning, improve delivery times, cut wastage and reduce business interruptions caused by breakdowns. ACME Retail Corp is typical of many enterprises that see the opportunity to unlock new efficiencies by leveraging the “Internet of Things”. As Morgan Stanley stated in the “Internet of Things is Now” research “We do not believe traditional data storage architectures are well- suited to accommodate the volume, velocity, and variety of IoT data”. For this reason, enterprises are looking beyond traditional RDBMS technology to the swathe of new database options available to them. Bosch SI did exactly this when it took the decision to use MongoDB to power the Bosch IoT Suite. Of course, MongoDB may not be a perfect fit for every IoT project. There are many choices available – as there for every new type of project – and the ACME architecture group needs a way to navigate the complex landscape of modern databases. Using the Database Selection Matrix, they have the framework to ask the key questions that will guide their technology decisions. So lets put it into practice. Development Considerations In this opening phase, the architects need to evaluate how their shortlisted database options meet the functional requirements of the app that is being built. This is impacted directly by multiple factors – and these are the questions they will need to ask. The Data Model: Will the application need to handle data of varying structure and types? How large can each data type be – is our data made up of simple integers, strings and timestamps or can it also be large binary files such as images or videos? Can our data just be represented as a set of opaque values, or does it need to be typed so other applications can make sense of it? Do we know the data structure will remain constant, or will it vary as we introduce new sensor data and as the business updates application requirements? Does the application require its data to be strongly consistent (i.e. read our own writes), or can eventually consistent data be tolerated (and do our developers know how to handle the complexity it introduces?). Do we end up trading performance and availability if we configure the database to only return the freshest data? The Query Model: What sort of queries are we going to run against the database? Is it simple key-value lookups that we know in advance or do we need to execute ad-hoc queries and complex aggregations to support real-time analytics that the business wants to see? Can we run analytics directly against the database, or do we need to replicate data to dedicated search or analytics engines? Will the application be handling geospatial queries and text search? Does the data need to be integrated with our BI & analytics tools, and what about our new Hadoop cluster, or the data warehouse? Which languages will our engineers be using to develop the application, and does the database have drivers available for them? Operational Considerations In this second phase, the ACME Retail architects need to evaluate how each database would run in production. No-one wants to hand-feed a custom technology, so they need to understand if the database can meet the availability, scalability and security needs of the business, and interoperate with the existing management frameworks. Service Availability: What is the application’s availability SLA? What are our RTO and RPO objectives? Will our operations teams manage failure recovery, or is this something that should be fully automated by the database? What capabilities does the database offer to maintain availability during routine maintenance? Are there tools available to manage this or do we need to script something ourselves? Are there specific requirements to replicate data between our data centers to support disaster recovery? Scalability: How do we expect this application to grow? Will the database need to scale beyond the limits of just a few servers? If data is to be distributed across multiple nodes, will it be partitioned in such a way that it is still optimized for the application’s query patterns? Do we need to scale this across data centers? Can we write and read data locally to reduce the effects of geographic latency? Can we scale storage capacity and I/O by compressing the data, and are different compression algorithms available to optimize compression ratio to CPU overhead? Security: What types of data access control do we need? Can we just use authentication controls within the database or do we need to integrate with our existing LDAP infrastructure? What type of authorization controls are available, and how granular can we get? Do these controls needs to extend down to the level of individual attributes within a document? Is encryption needed, and will those pesky compliance officers need to audit every action taken against the database? Administration: How are going to run this thing? Does the database provide tools to automate provisioning and upgrades or do we need to create our own scripts? How about backups? Can we get incremental backups. How about point in time backups? And then monitoring. We need to know, for example, if disk utilization is peaking above 60% so we can take action before we hit a problem. Can we add these alerts into our existing operational workflow tools? Can we integrate the database’s management platform into our own operational tooling so we don’t need to leave our single screen? Commercial Considerations Once the ACME architects have profiled their technology requirements, they will need to understand how the database is licensed and priced, before legal and procurement come knocking at the door: Licensing: what license is used, and is this acceptable to our legal team? Are commercial licenses available? Support: What support options are open to me? Can I get support SLAs from my vendor, even if I use a community version of their product? What is the SLA I can expect if I do hit an issue? What sort of training is available? Is my only option to send my engineers to public classes, or can we get trained on-demand, at our own pace? What’s Next? The ACME example is designed to illustrate some of the key questions engineering teams need to ask. It is true that the database landscape is more complex than ever. But it needn’t be bewildering – the Database Selection Matrix is designed to help you identify and compare what is most critical as you build your next app, so go ahead and download it now. Looking for additional information about database selection? Learn why organizations choose MongoDB to deliver applications and outcomes that were never previously possible. Download the white paper below: THE VALUE OF DATABASE SELECTION

February 9, 2015

by Francesca Krihely

· 11,370 Views · 1 Like

Algorithms Explained: Minesweeper

This blog post explains the essential algorithms for the well-known Windows game "Minesweeper." Game Rules The board is a two-dimensional space, which has a predetermined number of mines. Cells have two states, opened and closed. If you left-click on a closed cell: Cell is empty and opened. If neighbor cell(s) have mine(s), this opened cell shows neighbor mine count. If neighbor cells have no mines, all neighbor cells are opened automatically. Cell has a mine, game ends with FAIL. If you right-click on a closed cell, you put a flag which shows that "I know this cell has a mine". If you multi-click (both right and left click) on a cell which is opened and has at least one mine on its neighbors: If neighbor cells' total flag count equals to this multi-clicked cell's count and predicted mine locations are true, all closed and unflagged neighbor cells are opened automatically. If neighbor cells' total flag count equals to this multi-clicked cell's count and at least one predicted mine location is wrong, game ends with FAIL. If all cells (without mines) are opened using left clicks and/or multi-clicks, game ends with SUCCESS. Data Structures We may think of each cell as a UI structure (e.g. button), which has the following attributes: colCoord = 0 to colCount rowCoord = 0 to rowCount isOpened = true / false (default false) hasFlag = true / false (default false) hasMine = true / false (default false) neighborMineCount 0 to 8 (default 0, total count of mines on neighbor cells) So, we have a two-dimensional "Button[][] cells" data structure to handle game actions. Algorithms Before Start: Assign mines to cells randomly (set hasMine=true) . Calculate neighborMineCount values for each cell, which have hasMine=false. (This step may be done for each clicked cell while game continues but it may be inefficient.) Note 1: Neighbor cells should be accessed with the coordinates: {(colCoord-1, rowCoord-1),(colCoord-1, rowCoord),(colCoord-1, rowCoord+1),(colCoord, rowCoord-1),(colCoord, rowCoord+1),(colCoord+1, rowCoord-1),(colCoord+1, rowCoord),(colCoord+1, rowCoord+1)} And don't forget that neighbor cell counts may be 3 (for corner cells), 5 (for edge cells) or 8 (for middle cells). Note 2: It's recommended to handle mouse clicks with "mouse release" actions instead of "mouse pressed/click" actions, otherwise a left or right click may be understood as a multi-click or vice versa. Right Click on a Cell: If cell isOpened=true, do nothing. If cell isOpened=false, set cell hasFlag=true and show a flag on the cell. Left Click on a Cell: If cell isOpened=true, do nothing. If cell isOpened=false: If cell hasMine=true, game over. If cell hasMine=false: If cell has neighborMineCount > 0, set isOpened=true, show neighborMineCount on the cell. If all cells which hasMine=false are opened, end game with SUCCESS. If cell has neighborMineCount == 0, set isOpened=true, call Left Click on a Cell for all neighbor cells, which hasFlag=false and isOpened=false. Note: The last step may be implemented with a recursive call or by using a stacked data structure. Multi Click (both Left and Right Click) on a Cell: If cell isOpened=false, do nothing. If cell isOpened=true: If cell neighborMineCount == 0, no nothing. If cell neighborMineCount > 0: If cell neighborMineCount != "neighbor hasFlag=true cell count", do nothing. If cell neighborMineCount == "neighbor hasFlag=true cell count": If all neighbor hasFlag=true cells are not hasMine=true, game over. If all neighbor hasFlag=true cells are hasMine=true (every flag is put on correct cell), call Left Click on a Cell for all neighbor cells, which hasFlag=false and isOpened=false. Note: The last step may be implemented with a recursive call or by using a stacked data structure.

February 9, 2015

by Cagdas Basaraner

· 33,363 Views

NetBeans in the Classroom: MySQL JDBC Connection Pool & JDBC Resource for GlassFish

This tutorial assumes that you have installed the Java EE version of NetBeans 8.02. It is further assumed that you have replaced the default GlassFish server instance in NetBeans with a new instance with its own folder for the domain. This is required for all platforms. See my previous article “Creating a New Instance of GlassFish in NetBeans IDE” The other day I presented to my students the steps necessary to make GlassFish responsible for JDBC connections. As I went through the steps I realized that I needed to record the steps for my students to reference. Here then are these steps. Steps 1 through 4 are required, Step 5 is optional. Step 1a: Manually Adding the MySQL driver to the GlassFish Domain Before we even start NetBeans we must do a bit of preliminary work. With the exception of Derby, GlassFish does not include the MySQL driver or any other driver in its distribution. Go to the MySql Connector/J download site at http://dev.mysql.com/downloads/connector/j/ and download the latest version. I recommend downloading the Platform Independent version. If your OS is Windows download the ZIP archive otherwise download the TAR archive. You are looking for the driver file named mysql-connector-java-5.1.34-bin.jar in the archive. Copy the driver file to the lib folder in the directory where you placed your domain. On my system the folder is located at C:\Users\Ken\personal_domain\lib. If GlassFish is already running then you will have to restart it so that it picks up the new library. Step 1b: Automatically Adding the MySQL driver to the GlassFish Domain NetBeans has a feature that deploys the database driver to the domain’s lib folder if that driver is in NetBeans’ folder of drivers. On my Windows 8.1 system the MySQL driver can be found in C:\Program Files\NetBeans 8.0.2\ide\modules\ext. Start NetBeans and go to the Services tab, expand Servers and right mouse click on your GlassFish Server. Click on Properties and the Servers dialog will appear. On this dialog you will see a check box labelled Enable JDBC Driver Deployment. By default it is checked. NetBeans determines the driver to copy to GlassFish from the file glassfish-resources.xml that we will create in Step 4 of this tutorial. Without this file and if you have not copied the driver into GlassFish manually then GlassFish will not be able to connect to the database. Any code in your web application will not work and all you will likely see are blank pages. Step 1a or Step 1b? I recommend Step 1a and manually add the driver. The reason I prefer this approach is that I can be certain that the most recent driver is in use. As of this writing NetBeans contains version 5.1.23 of the connector but the current version is 5.1.34. If you copy a driver into the lib folder then NetBeans will not replace it with an older driver even if the check box on the Server dialog is checked. NetBeans does not replace a driver if one is already in place. If you need a driver that NetBeans does have a copy of then Step 1b is your only choice. Step 2: Create a Database Connection in NetBeans One feature I have always liked in NetBeans is that it has an interface for working with databases. All that is required is that you create a connection to the database. It also has additional features for managing a MySQL server but we won’t need those. If you have not already started your MySQL DBMS then do that now. I assume that the database you wish to connect to already exists. Go to the Services tab and right mouse click on New Connection. In the next dialog you must choose the database driver you wish to use. It defaults to Java DB (Embedded). Pull down the combobox labeled Driver: and select MySQL (Connector/J driver). Click on Next and you will now see the Customize Connection dialog. Here you can enter the details of the connection. On my system the server is localhost and the database name is Aquarium. Here is what my dialog looks like. Notice the Test Connection button. I have clicked on mine and so I have the message Connection Succeeded. Click on Next. There is nothing to do on this dialog so click on Next. On this last dialog you have the option of assigning a name to the connection. By default it uses the URL but I prefer a more meaningful name. I have used AquariumMySQL. Click on Finish and the connection will appear under Databases. If the icon next to AquariumMySQL has what looks like a crack in it similar to the jdbc:derby connection then this means that a connection to the database could not be made. Verify that the database is running and is accessible. If it is then delete the connection and start over. Having a connection to the database in NetBeans is invaluable. You can interact with the database directly and issue SQL commands. As a MySQL user this means that I do not need to run the MySQL command line program to interact with the database. Step 3: Create a Web Application Project in NetBeans If you have not already done so create a New Project in NetBeans. I require my students to create a New Project in the Maven category of a Web Application project. Click on Next. In this dialog you can give the project a name and a location in your file system. The Artifact Id, Group Id and Version are used by Maven. The final dialog lets you select the application server that your application will use and the version of Java EE that your code must be compliant with. Here is my project ready for the next step. Step 4: Create the GlassFish JDBC Resource For GlassFish to manage your database connection you need to set up two resources, a JDBC Connection Pool and a JDBC Resource. You can create both in one step by creating a GlassFish JDBC Resource because you can create the Connection Pool as part of the same operation. Right mouse click on the project name and select New and then Other … Scroll down the Categories list and select GlassFish. In the File Types list select JDBC Resource. Click on Next. The next dialog is the General Attributes. Click on the radio button for Create New JDBC Connection Pool. In the text field JNDI Name enter a name that is unique for the project. JNDI names for connection resources always begin with jdbc/ followed by a name that starts with a lower case letter. I have used jdbc/myAquarium. Do not prefix the name with java:app/ as some tutorials suggest. An upcoming article will explain why. Click on Next. There is nothing for us to enter on the Properties dialog. Click on Next. On the Choose Database Connection dialog we will give our connection pool a name and select the database connection we created in Step 2. Notice that in the list of available connections you are shown the connection URL and not the name you assigned to it back in Step 2. Click on Next. On the Add Connection Pool Properties dialog you will see the connection URL and the user name and password. We do need to make one change. The resource type shows javax.sql.DataSource and we must change it to javax.sql.ConnectionPoolDataSource. Click on Next. There is nothing we need to change on Add Connection Pool Optional Properties so click on Finish. A new folder has appeared in the Projects view named Other Sources. It contains a sub folder named setup. In this folder is the file glassfish-resources.xml. The glassfish-resources.xml file will contain the following. I have reformatted the file for easier viewing. OPTIONAL Step 5: Configure GlassFish with glassfish-resources.xml The glassfish-resources.xml file, when included in the application’s WAR file in the WEB-INF folder, can configure the resource and pool for the application when it is deployed in GlassFish. When the application is un-deployed the resource and pool are removed. If you want to set up the resource and pool permanently in GlassFish then follow these steps. Go to the Services tab and select Servers and then right mouse click on GlassFish. If GlassFish is not running then click on Start. With the server started click on View Domain Admin Console. Your web browser will now open and show you the GlassFish console. If you assigned a user name and password to the server you will have to enter this information before you see the console. In the Common Tasks tree select Resources. You should now see in the panel adjacent to the tree the following: Click on Add Resources. You should now see: In the Location click on Choose File and locate your glassfish-resources.xml file. Mine is found at D:\NetBeansProjects\GlassFishTutorial\src\main\setup. You should now see: Click on OK. If everything has gone well you should see: The final task in this step is to test if the connection works. In the Common Tasks tree select Resources, JDBC, JDBC Connection Pools and aquariumPool. Click on Ping. You should see: The most common reason for the Ping to fail is that the database driver is not in the domain’s lib folder. Go to Step 1a and manually add the driver. The resources are now visible in NetBeans. Having the resource and pool add to GlassFish permanently will allow other applications to share this same resource and pool. You are now ready to code!

February 9, 2015

by Ken Fogel

· 52,632 Views · 3 Likes

Microservices: Five Architectural Constraints

Microservices is a new software architecture and delivery paradigm, where applications are composed of several small runtime services. The current mainstream approach for software delivery is to build, integrate, and test entire applications as a monolith. This approach requires any software change, however small, to require a full test cycle of the entire application. With Microservices a software module is delivered as an independent runtime service with a well defined API. The Microservices approach allow faster delivery of smaller incremental changes to an application. There are several tradeoffs to consider with the Microservices architecture. On one hand, the Microservices approach builds on several best practices and patterns for software design, architecture, and DevOps style organization. On the other hand, Microservices requires expertise in distributed programming and can become an operational nightmare without proper tooling in place. There are several good posts that highlight the pros-and-cons of Microservices, and I have added in the references section. In the remainder of this post, I will define five architectural constraints (principles that drive desired properties) for the Microservices architectural style. To be a Microservice, a service must be: Elastic Resilient Composable Minimal, and; Complete Microservice Constraint #1 - Elastic A microservice must be able to scale, up or down, independently of other services in the same application. This constraint implies that based on load, or other factors, you can fine tune your applications performance, availability, and resource usage. This constraint can be realized in different ways, but a popular pattern is to architect the system so that you can run multiple stateless instances of each microservice, and there is a mechanism for Service naming, registration, and discovery along with routing and load-balancing of requests. Microservice Constraint #2 - Resilient A microservice must fail without impacting other services in the same application. A failure of a single service instance should have minimal impact on the application. A failure of all instances of a microservice, should only impact a single application function and users should be able to continue using the rest of the application without impact. Adrian Cockroft describes Microservices as loosely coupled service oriented architecture with bounded contexts [3]. To be resilient a service has to be loosely coupled with other services, and a bounded context limits a service’s failure domain. Microservice Constraint #3 - Composable A microservice must offer an interface that is uniform and is designed to support service composition. Microservice APIs should be designed with a common way of identifying, representing, and manipulating resources, describing the API schema and supported API operations. The ‘Uniform Interfaces constraint of the REST architectural style describes this in detail. Service Composition is a SOA principle that has fairly obvious benefits, but few guidelines on how it can be achieved. A Microservice interface should be designed to support composition patterns like aggregation, linking, and higher-level functions such as caching, proxies and gateways. I previously discussed REST constraints and elements in as two part blog post: REST is not about APIs Microservice Constraint #4 - Minimal A microservice must only contain highly cohesive entities In software, cohesion is a measure of whether things belong together. A module is said to have high cohesion if all objects and functions in it are focused on the same tasks. Higher cohesion leads to more maintainable software. A Microservice should perform a single business function, which implies that all of its components are highly cohesive. This is also an Single Responsibility Principle (SRP) of object-oriented design [5] Microservice Constraint #5 - Complete A microservice must be functionally complete Bjarne Stroustrup, the creator of C++, stated that a good interface must be, “minimal but complete” i.e. as small as possible, and no smaller. Similarly, a Microservice must offer a complete function, with minimal dependencies (loose coupling) to other services in the application. This is important, as otherwise its becomes impossible to version and upgrade individual services. This constraint is designed to oppose the minimal constraint. Put together a microservice must be “minimal but complete.” Conclusions Designing a Microservices application requires application of several principles, patterns, and best practices of modular design and service-oriented architectures. In this post, I've outlined five architectural constraints which can help guide and retain the key benefits of a Microservices-style architecture. For example, Microservices Constraint# 1 - Elastic steers implementations towards separating the data tier from the application tier, and leads to stateless services. At Nirmata we have built our solution, that makes it easy to deploy and operate microservices applications, using these very same principles. We believe that Microservices style applications, running in containers, will power the next generation of software innovation. If you are using, or interested in using microservices, I would love to hear from you. Jim Bugwadia Founder and CEO Nirmata -- For additional content and articles follow us at @NirmataCloud. -- If you are in the San Francisco Bay Area, come join our Microservices meetup group. References [1] Microservices, Martin Fowler and James Lewis, http://martinfowler.com/articles/microservices.html [2] Microservices Are Not a free lunch!, Benjamin Wootton, http://contino.co.uk/microservices-not-a-free-lunch/ [3] State of the Art in Microservices, Adrian Cockroft, http://thenewstack.io/dockercon-europe-adrian-cockcroft-on-the-state-of-microservices/ [4] The Principles of Object-Oriented Design, Robert C. Martin, http://butunclebob.com/ArticleS.UncleBob.PrinciplesOfOod

February 5, 2015

by Jim Bugwadia

· 13,234 Views · 7 Likes

Dropwizard vs Spring Boot—A Comparison Matrix

Of late, I have been looking into Microservice containers that are available out there to help speed up the development. Although, Microservice is a generic term however there is some consensus with respect to what it means. Hence, we may conveniently refer to the definition Microservice as an "architectural design pattern, in which complex applications are composed of small, independent processes communicating with each other using language-agnostic APIs. These services are small, highly decoupled and focus on doing a small task." There are several Microservice containers out there. However, in my experience I have found Dropwizard and Spring-boot to have had received more attention and they appear to be widely used compared to the rest. In my current role, I was asked create a comparison matrix between the two, so it's here below. Dropwizard Spring-Boot What is it? Dropwizard pulls together stable, mature libraries from the Java ecosystem into a simple, light-weight package that lets you focus on getting things done. [more...] Takes an opinionated view of building production-ready Spring applications. Spring Boot favours convention over configuration and is designed to get you up and running as quickly as possible. [more...] Overview? Dropwizard straddles the line between being a library and a framework. Provide performant, reliable implementations of everything a production-ready web application needs. [more...] Spring-boot takes an opinionated view of the Spring platform and third-party libraries so you can get started with minimum fuss. Most Spring Boot applications need very little Spring configuration. [more...] Out of the box features? Dropwizard has out-of-the-box support for sophisticated configuration, application metrics, logging, operational tools, and much more, allowing you and your team to ship a production-quality web service in the shortest time possible. [more...] Spring-boot provides a range of non-functional features that are common to large classes of projects (e.g. embedded servers, security, metrics, health checks, externalized configuration). [more...] Libraries Core: Jetty, Jersey, Jackson and Matrics Others: Guava, Liquibase and Joda Time. Spring, JUnit, Logback, Guava. There are several starter POM files covering various use cases, which can be included in the POM to get started. Dependency Injection? No built in Dependency Injection. Requires a 3rd party dependency injection framework such as Guice, CDI or Dagger. [Ref...] Built in Dependency Injection provided by Spring Dependency Injection container. [Ref...] Types of Services i.e. REST, SOAP Has some support for other types of services but primarily is designed for performant HTTP/REST LAYER. If ever need to integrate SOAP, there is a dropwizard bundle for building SOAP web services using JAX-WS API is provided here but it’s not official drop-wizard sub project. [more...] As well as supporting REST Spring-boot has support for other types of services such as JMS, Advanced Message Queuing Protocol, SOAP based Web Services to name a few. [more...] Deployment? How it creates the Executable Jar? Uses Shading to build executable fat jars, where a shaded jar spackages all classes, from all jars, into a single 'uber jar'. [Ref...] Spring-boot adopts a different approach and avoids shaded jars, as it becomes hard to see which libraries you are actually using in your application. It can also be problematic if the same filename is used in Shaded jars. Instead it uses “Nested Jar” approach where all classes from all jars do not need to be included into a single “uber jar” instead all dependent jars should be in the “lib” folder, spring loader loads them appropriately. [Ref...] Contract First Web Services? No built in support. Would have to refer to 3rd party library (CXF or any other JAX-WS implementation) if needed a solution for the Contract First SOAP based services. Contract First services support is available with the help of spring-boot-starter-ws starter application. [Ref...] Externalised Configuration for properties and YAML Supports both Properties and YAML Supports both Properties and YAML Concluding Remarks If dealing with only REST micro services, drop wizard is an excellent choice. Where Spring-boot shines is the types of services supported i.e. REST, JMS, Messaging, and Contract First Services. Not least a fully built in Dependency Injection container. Disclaimer: The matrix is purely based on my personal views and experiences, having tried both frameworks and is by no means an exhaustive guide. Readers are requested to do their own research before making a strategic decision between the two very formidable frameworks.

February 2, 2015

by Rizwan Ullah

· 74,057 Views · 9 Likes