Data Resources

The Latest Data Topics

We had a lengthy discussion recently during code review whether scala.Option.fold() is idiomatic and clever or maybe unreadable and tricky? Let's first describe what the problem is. Option.fold does two things: maps a function f overOption's value (if any) or returns an alternative alt if it's absent. Using simple pattern matching we can implement it as follows: val option: Option[T] = //... def alt: R = //... def f(in: T): R = //... val x: R = option match { case Some(v) => f(v) case None => alt } If you prefer one-liner, fold is actually a combination of map and getOrElse val x: R = option map f getOrElse alt Or, if you are a C programmer that still wants to write in C, but using Scala compiler: val x: R = if (option.isDefined) f(option.get) else alt Interestingly this is similar to how fold() is actually implemented, but that's an implementation detail. OK, all of the above can be replaced with single Option.fold(): val x: R = option.fold(alt)(f) Technically you can even use /: and \: operators (alt /: option) - but that would be simply masochistic. I have three problems with option.fold() idiom. First of all - it's anything but readable. We are folding (reducing) over Option - which doesn't really make much sense. Secondly it reverses the ordinary positive-then-negative-case flow by starting with failure (absence, alt) condition followed by presence block (f function; see also: Refactoring map-getOrElse to fold). Interestingly this method would work great for me if it was named mapOrElse: ** * Hypothetical in Option */ def mapOrElse[B](f: A => B, alt: => B): B = this map f getOrElse alt Actually there is already such method in Scalaz, called OptionW.cata. cata. Here is whatMartin Odersky has to say about it: "I personally find methods like cata that take two closures as arguments are often overdoing it. Do you really gain in readability over map + getOrElse? Think of a newcomer to your code[...]"While cata has some theoretical background, Option.fold just sounds like a random name collision that doesn't bring anything to the table, apart from confusion. I know what you'll say, that TraversableOnce has fold and we are sort-of doing the same thing. Why it's a random collision rather than extending the contract described inTraversableOnce? fold() method in Scala collections typically just delegates to one offoldLeft()/foldRight() (the one that works better for given data structure), thus it doesn't guarantee order and folding function has to be associative. But inOption.fold() the contract is different: folding function takes just one parameter rather than two. If you read my previous article about folds you know that reducing function always takes two parameters: current element and accumulated value (initial value during first iteration). But Option.fold() takes just one parameter: current Option value! This breaks the consistency, especially when realizing Option.foldLeft() andOption.foldRight() have correct contract (but it doesn't mean they are more readable). The only way to understand folding over option is to imagine Option as a sequence with0 or 1 elements. Then it sort of makes sense, right? No. def double(x: Int) = x * 2 Some(21).fold(-1)(double) //OK: 42 None.fold(-1)(double) //OK: -1 but: Some(21).toList.fold(-1)(double) : error: type mismatch; found : Int => Int required: (Int, Int) => Int Some(21).toList.fold(-1)(double) ^ If we treat Option[T] as a List[T], awkward Option.fold() breaks because it has different type than TraversableOnce.fold(). This is my biggest concern. I can't understand why folding wasn't defined in terms of the type system (trait?) and implemented strictly. As an example take a look at: Data.Foldable in Haskell (advanced) Data.Foldable typeclass describes various flavours of folding in Haskell. There are familiar foldl/foldr/foldl1/foldr1, in Scala namedfoldLeft/foldRight/reduceLeft/reduceRight accordingly. They have the same type as Scala and behave unsurprisingly with all types that you can fold over, including Maybe, lists, arrays, etc. There is also a function named fold, but it has a completely different meaning: class Foldable t where fold :: Monoid m => t m -> m While other folds are quite complex, this one barely takes a foldable container of ms (which have to be Monoids) and returns the same Monoid type. A quick recap: a type can be aMonoid if there exists a neutral value of that type and an operation that takes two values and produces just one. Applying that function with one of the arguments being neutral value yields the other argument. String ([Char]) is a good example with empty string being neutral value (mempty) and string concatenation being such operation (mappend). Notice that there are two different ways you can construct monoids for numbers: under addition with neutral value being 0 (x + 0 == 0 + x == x for any x) and under multiplication with neutral 1 (x * 1 == 1 * x == x for any x). Let's stick to strings. If I fold empty list of strings, I'll get an empty string. But when a list contains many elements, they are being concatenated: > fold ([] :: [String]) "" > fold [] :: String "" > fold ["foo", "bar"] "foobar" In the first example we have to explicitly say what is the type of empty list []. Otherwise Haskell compiler can't figure out what is the type of elements in a list, thus which monoid instance to choose. In second example we declare that whatever is returned from fold [], it should be a String. From that the compiler infers that [] actually must have a type of [String]. Last fold is the simplest: the program folds over elements in list and concatenates them because concatenation is the operation defined in Monoid Stringtypeclass instance. Back to options (or more precisely Maybe). Folding over Maybe monad having type parameter being Monoid (I can't believe I just said it) has an interesting interpretation: it either returns value inside Maybe or a default Monoid value: > fold (Just "abc") "abc" > fold Nothing :: String "" Just "abc" is same as Some("abc") in Scala. You can see here that if Maybe Stringis Nothing, neutral String monoid value is returned, that is an empty string. Summary Haskell shows that folding (also over Maybe) can be at least consistent. In ScalaOption.fold is unrelated to List.fold, confusing and unreadable. I advise avoiding it and staying with slightly more verbose map/getOrElse transformations or pattern matching. PS: Did I mention there is also Either.fold() (with even different contract) but noTry.fold()?

June 26, 2014

by Tomasz Nurkiewicz

· 9,617 Views

Internet of Things (IoT) Reference Architecture

to converge internet of thing devices with corporate it solutions, teams require a reference architecture for the internet of things (iot). the reference architecture must include devices, server-side capabilities, and cloud architecture required to interact with and manage the devices. a reference architecture should provide architects and developers of iot projects with an effective starting point that addresses major iot project and system requirements. a high-level iot reference architecture may include the following layers (see figure 1): external communications - web/portal, dashboard, apis event processing and analytics (including data storage) aggregation / bus layer – esb and message broker device communications devices cross-‐cutting layers include: device and application management identity and access management a more detailed architecture component description can be found in the iot reference architecture white paper .

June 18, 2014

by Chris Haddad

· 17,765 Views

Synchronizing Data Across Mule Applications with MongoDB

One of the many things that Mule does extremely well is transparently store state between messages. Take the idempotent-message-filter for example; The idempotent-message-filter keeps track of what messages have been processed previously to ensure no duplicate messages are processed. This is all achieved with one single xml element. Or take OAuth enabled cloud connectors that need to keep track of access tokens so you're not redirected to some authorize page each time. All this is handled behind the scenes without you having to worry about it. Mule’s ability to store this state is implemented via object-stores. Object-stores provide a simple generic way to store data in Mule. Many message processors such as the idempotent-meesage-filter, OAuth message processors, until-successful routers and many more all transparently use object-stores behind the scenes to store state between messages. By default, Mule uses in-memory object-stores behind the scenes. Things get more interesting however, when your Mule application is distributed across multiple Mule nodes. At some point you are going to need to synchronise object-stores across multiple applications. If you are fortunate to be using the Enterprise Edition of Mule you can take advantage of its out of the box clustering support for synchronizing state across an entire cluster. But even then as mentioned in this blog post by John Demic: Synchronizing Mule Applications Across Data Centers with Apache Cassandra, this may not be ideal if your Mule nodes are geographically distributed across multiple data centres. In the aforementioned post, John gives a great example of how to use Apache Cassandra as a shared object-store. But many other traditonal and NoSql data stores can be used as well. In this blog post I'm going to show a simple example of how you can achieve the same using MongoDB. If it's good enough for the Hipster Hacker then it's good enough for me! OMG! Does this officially make @MuleSoft Mule part of the Hipster Stack? pic.twitter.com/OwM4dzhGvi — Ryan Carter (@rydcarter) October 9, 2013 Configuring the MongoDB object-store MongoDB is an open-source document database, and the leading NoSQL database. The MongoDB module can be used to read/write/query and perform a whole load of other operation a MongoDB database as part of your Mule flows. It also exposes an additional submodule for using your MondogDB database as an object-store implementation in Mule. Here is a sample configuration from one of the tests: As you can see, it's classed as it's own module with it's own namespace and is configured simply using the mos:config name="mos" database="mongo-connector-test" element. However, this doesn't work for the current version of the connector. Due to the following bug. You can simply apply the fix in the aforementioned pull request, or you can instantiate the object store yourself using Spring - similar to that in the Apache Cassandra configuration. The object-store class can be found here: MongoObjectStore.java. As you can see, it has various annotations for default values and initialising the class. However, as we are not using this as a fully fledged Mule Devkit module, those annotations have no effect. So we have to pass in the default values ourselves and call the initialise method when instantiating our object-store. This is quite simple with Spring. Here is a full working example using the Spring instantiated object-store as the store for the idempotent-message-filter: If you run this configuration, Mule will poll every 10 seconds with the exact same message: "1". The first time it polls you should see "Passed Filter" in the log output. Subsequent polls should not log anything as the idempotent message filter will filter them out. This is usually performed with the default in-memory object-store. To prove this is persisted you should be able to check the value in the Mongo database collection: db.getCollection('mule.objectstore._default').find(). Also you can stop your Mule app and restart it to prove that it works between restarts and failovers etc.

June 17, 2014

by Ryan Carter

· 10,060 Views

Anomaly Detection : A Survey

this post is summary of the “anomaly detection : a survey”. anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. these non-conforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. anomalies are patterns in data that do not conform to a well defined notion of normal behavior. interesting to analyze unwanted noise in the data also can be found in there. novelty detection which aims at detecting previously unobserved (emergent, novel) patterns in the data challenges for anomaly detection drawing the boundary between normal and anomalous behavior availability of labeled data noisy data type of anomaly anomalies can be classified into following three categories point anomalies - an individual data instance can be considered as anomalous with respect to the rest of data contextual anomalies - a data instance is anomalous in a specific context (but not otherwise), then it is termed as a contextual anomaly (also referred as conditional anomaly). each data instance is defined using following two sets of attributes contextual attributes. the contextual attributes are used to determine the context (or neighborhood) for that instance eg: in time- series data, time is a contextual attribute which determines the position of an instance on the entire sequence behavioral attributes. the behavioral attributes define the non-contextual characteristics of an instance eg: in a spatial data set describing the average rainfall of the entire world, the amount of rainfall at any location is a behavioral attribute to explain this we will look into "exchange rate history for converting united states dollar (usd) to sri lankan rupee (lkr)"[1] contextual anomaly t2 in a exchange rate time series. note that the exchange rate at time t1 is same as that at time t2 but occurs in a different context and hence is not considered as an anomaly 3. collective anomalies - a collection of related data instances is anomalous with respect to the entire data set data labels the labels associated with a data instance denote if that instance is normal or anomalous. depending labels availability, anomaly detection techniques can be operated in one of the following three modes supervised anomaly detection - techniques trained in supervised mode assume the availability of a training data set which has labeled instances for normal as well as anomaly class semi-supervised anomaly detection - techniques that operate in a semi-supervised mode, assume that the training data has labeled instances for only the normal class. since they do not require labels for the anomaly class unsupervised anomaly detection - techniques that operate in unsupervised mode do not require training data, and thus are most widely applicable. the techniques implicit assume that normal instances are far more frequent than anomalies in the test data. if this assumption is not true then such techniques suffer from high false alarm rate output of anomaly detection anomaly detection have two types of output techniques scores. scoring techniques assign an anomaly score to each instance in the test data depending on the degree to which that instance is considered an anomaly labels. techniques in this category assign a label (normal or anomalous) to each test instance applications of anomaly detection intrusion detection intrusion detection refers to detection of malicious activity. the key challenge for anomaly detection in this domain is the huge volume of data. thus, semi-supervised and unsupervised anomaly detection techniques are preferred in this domain.denning[3] classifies intrusion detection systems into host based and net- work based intrusion detection systems. host based intrusion detection systems - this deals with operating system call traces network intrusion detection systems - these systems deal with detecting intrusions in network data. the intrusions typically occur as anomalous patterns (point anomalies) though certain techniques model[4] the data in a sequential fashion and detect anomalous subsequences (collective anomalies). a challenge faced by anomaly detection techniques in this domain is that the nature of anomalies keeps changing over time as the intruders adapt their network attacks to evade the existing intrusion detection solutions. fraud detection fraud detection refers to detection of criminal activities occurring in commercial organizations such as banks, credit card companies, insurance agencies, cell phone companies, stock market, etc. the organizations are interested in immediate detection of such frauds to prevent economic losses. detection techniques used for credit card fraud and network intrusion detection as below. statistical profiling using histograms parametric statisti- cal modeling non-parametric sta- tistical modeling bayesian networks neural networks support vector ma- chines rule-based clustering based nearest neighbor based spectral information theoretic here are some domain in fraud detections credit card fraud detection mobile phone fraud detection insurance claim fraud detection insider trading detection medical and public health anomaly detection anomaly detection in the medical and public health domains typically work with pa- tient records. the data can have anomalies due to several reasons such as abnormal patient condition or instrumentation errors or recording errors. thus the anomaly detection is a very critical problem in this domain and requires high degree of accuracy. industrial damage detection such damages need to be detected early to prevent further escalation and losses. fault detection in mechanical units structural defect detection image processing anomaly detection techniques dealing with images are either interested in any changes in an image over time (motion detection) or in regions which appear ab- normal on the static image. this domain includes satellite imagery. anomaly detection in text data anomaly detection techniques in this domain primarily detect novel topics or events or news stories in a collection of documents or news articles. the anomalies are caused due to a new interesting event or an anomalous topic. sensor networks since the sensor data collected from various wireless sensors has several unique characteristics. references [1] http://themoneyconverter.com/usd/lkr.aspx [2] varun chandola, arindam banerjee, and vipin kumar. 2009. anomaly detection: a survey. acm comput. surv. 41, 3, article 15 (july 2009), 58 pages. doi=10.1145/1541880.1541882 http://doi.acm.org/10.1145/1541880.1541882 [3] denning, d. e. 1987. an intrusion detection model. ieee transactions of software engineer-ing 13, 2, 222–232. [4]gwadera, r., atallah, m. j., and szpankowski, w. 2004. detection of significant sets of episodes in event sequences. in proceedings of the fourth ieee international conference on data mining. ieee computer society, washington, dc, usa, 3–10.

June 16, 2014

by Madhuka Udantha

· 12,887 Views

Querying XML CLOB Data Directly in Oracle

Oracle 9i introduced a new datatype - XMLType - specifically designed for handling XML naively in the database. However, we may find we have a database that stores XML directly in a CLOB/NCLOB datatype (either because of legacy data, or because we are supporting multiple database platforms). It can be useful to query this data directly using the XMLType functions available. This post just shows a very simple example of a query to do this. (Note: there may be better ways to do this - but this worked for me when I needed an ad-hoc query quickly!) Our Table Assume we have a table called USER, which has the following columns Our embedded XML (example for Ringo) Our Query Say we want to look at account details in the XML - e.g. find which users all have the same sort code. A query like the following: SELECT ID, NAME XMLTYPE(u.accountxml).EXTRACT('/person/account/sortCode/text()') as SORTCODE FROM USER u; Will produce output like this. Obviously from here you can use all the standard SQL operators to filter, join etc further to get what you need:

June 11, 2014

by Adrian Milne

· 45,606 Views · 1 Like

What's Wrong in Java 8, Part V: Tuples

In part 5 of our "What's Wrong in Java 8" series, we turn to Tuples.

June 10, 2014

by Pierre-Yves Saumont

· 132,780 Views · 4 Likes

An Architecturally-Evident Coding Style

okay, this is the separate blog post that i referred to in software architecture vs code . what exactly do we mean by an "architecturally-evident coding style"? i built a simple content aggregator for the local tech community here in jersey called techtribes.je , which is basically made up of a web server, a couple of databases and a standalone java application that is responsible for actually aggegrating the content displayed on the website. you can read a little more about the software architecture at techtribes.je - containers . the following diagram is a zoom-in of the standalone content updater application, showing how it's been decomposed. this diagram says that the content updater application is made up of a number of core components (which are shown on a separate diagram for brevity) and an additional four components - a scheduled content updater, a twitter connector, a github connector and a news feed connector. this diagram shows a really nice, simple architecture view of how my standalone content updater application has been decomposed into a small number of components. "component" is a hugely overloaded term in the software development industry, but essentially all i'm referring to is a collection of related behaviour sitting behind a nice clean interface. back to the "architecturally-evident coding style" and the basic premise is that the code should reflect the architecture. in other words, if i look at the code, i should be able to clearly identify each of the components that i've shown on the diagram. since the code for techtribes.je is open source and on github, you can go and take a look for yourself as to whether this is the case. and it is ... there's a je.techtribes.component package that contains sub-packages for each of the components shown on the diagram. from a technical perspective, each of these are simply spring beans with a public interface and a package-protected implementation. that's it; the code reflects the architecture as illustrated on the diagram. so what about those core components then? well, here's a diagram showing those. again, this diagram shows a nice simple decomposition of the core of my techtribes.je system into coarse-grained components. and again, browsing the source code will reveal the same one-to-one mapping between boxes on the diagram and packages in the code. this requires conscious effort to do but i like the simple and explicit nature of the relationship between the architecture and the code. when architecture and code don't match the interesting part of this story is that while i'd always viewed my system as a collection of "components", the code didn't actually look like that. to take an example, there's a tweet component on the core components diagram, which basically provides crud access to tweets in a mongodb database. the diagram suggests that it's a single black box component, but my initial implementation was very different. the following diagram illustrates why. my initial implementation of the tweet component looked like the picture on the left - i'd taken a "package by layer" approach and broken my tweet component down into a separate service and data access object. this is your stereotypical layered architecture that many (most?) books and tutorials present as a way to build (e.g.) web applications. it's also pretty much how i've built most software in the past too and i'm sure you've seen the same, especially in systems that use a dependency injection framework where we create a bunch of things in layers and wire them all together. layered architectures have a number of benefits but they aren't a silver bullet . this is a great example of where the code doesn't quite reflect the architecture - the tweet component is a single box on an architecture diagram but implemented as a collection of classes across a layered architecture when you look at the code. imagine having a large, complex codebase where the architecture diagrams tell a different story from the code. the easy way to fix this is to simply redraw the core components diagram to show that it's really a layered architecture made up of services collaborating with data access objects. the result is a much more complex diagram but it also feels like that diagram is starting to show too much detail. the other option is to change the code to match my architectural vision. and that's what i did. i reorganised the code to be packaged by component rather than packaged by layer. in essence, i merged the services and data access objects together into a single package so that i was left with a public interface and a package protected implementation. here's the tweet component on github . but what about... again, there's a clean simple mapping from the diagram into the code and the code cleanly reflects the architecture. it does raise a number of interesting questions though. why aren't you using a layered architecture? where did the tweetdao interface go? how do you mock out your dao implementation to do unit testing? what happens if i want to call the dao directly? what happens if you want to change the way that you store tweets? layers are now an implementation detail this is still a layered architecture, it's just that the layers are now a component implementation detail rather than being first-class architectural building blocks. and that's nice, because i can think about my components as being my architecturally significant structural elements and it's these building blocks that are defined in my dependency injection framework. something i often see in layered architectures is code bypassing a services layer to directly access a dao or repository. these sort of shortcuts are exactly why layered architectures often become corrupted and turn into big balls of mud. in my codebase, if any consumer wants access to tweets, they are forced to use the tweet component in its entirety because the dao is an internal implementation detail. and because i have layers inside my component, i can still switch out my tweet data storage from mongodb to something else. that change is still isolated. component testing vs unit testing ah, unit testing. bundling up my tweet service and dao into a single component makes the resulting tweet component harder to unit test because everything is package protected. sure, it's not impossible to provide a mock implementation of the mongodbtweetdao but i need to jump through some hoops. the other approach is to simply not do unit testing and instead test my tweet component through its public interface. dhh recently published a blog post called test-induced design damage and i agree with the overall message; perhaps we are breaking up our systems unnecessarily just in order to unit test them. there's very little to be gained from unit testing the various sub-parts of my tweet component in isolation, so in this case i've opted to do automated component testing instead where i test the component as a black-box through its component interface. mongodb is lightweight and fast, with the resulting component tests running acceptably quick for me, even on my ageing macbook air. i'm not saying that you should never unit test code in isolation, and indeed there are some situations where component testing isn't feasible. for example, if you're using asynchronous and/or third party services, you probably do want to ability to provide a mock implementation for unit testing. the point is that we shouldn't blindly create designs where everything can be mocked out and unit tested in isolation. food for thought the purpose of this blog post was to provide some more detail around how to ensure that code reflects architecture and to illustrate an approach to do this. i like the structure imposed by forcing my codebase to reflect the architecture. it requires some discipline and thinking about how to neatly carve-up the responsibilities across the codebase, but i think the effort is rewarded. it's also a nice stepping stone towards micro-services. my techtribes.je system is constructed from a number of in-process components that i treat as my architectural building blocks. the thinking behind creating a micro-services architecture is essentially the same, albeit the components (services) are running out-of-process. this isn't a silver bullet by any means, but i hope it's provided some food for thought around designing software and structuring a codebase with an architecturally-evident coding style.

June 9, 2014

by Simon Brown

· 6,253 Views

MapDB: The Agile Java Data Engine

MapDB is a pure Java database, specifically designed for the Java developer. The fundamental concept for MapDB is very clever yet natural to use: provide a reliable, full-featured and “tune-able” database engine using the Java Collections API. MapDB 1.0 has just been released, this is the culmination of years of research and development to get the project to this point. Jan Kotek, the primary developer for MapDB, also worked on predecessor projects (JDBM), starting MapDB as an entire from-scratch rewrite. Jan’s expertise and dedication to low-level debugging has yielded excellent results, producting an easy-to-use database for Java with comparable performance to many C-based engines. What sets MapDB apart is the “map” concept. The idea is to leverage the totally natural Java Collections API – so familiar to Java developers that most of them literally use it daily in their work. For most database interactions with a Java application, some sort of translator is required. There are many Object-Relational Mapping (ORM) tools to name just one category of such components. The goal has always been in the direction of making it natural to code in objects in the Java language, and translate them to a specific database syntax (such as SQL). However, such efforts have always come up short, adding complexity for both the application developer and the data architect. When using MapDB there is no object “translation layer” – developers just access data in familiar structures like Maps, Sets, Queues, etc. There is no change in syntax from typical Java coding, other than a brief initialization syntax and transaction management. A developer can literally transform memory-limited maps into a high-speed persistent store in seconds (typically changing just one line of code). A MapDB Example Here is a simple MapDB example, showing how easy and intuitive it is to use in a Java application: // Initialize a MapDB database DB db = DBMaker.newFileDB(new File("testdb")) .closeOnJvmShutdown() .make(); // Create a Map: Map myMap = db.getTreeMap(“testmap”); // Work with the Map using the normal Map API. myMap.put(“key1”, “value1”); myMap.put(“key2”, “value2”); String value = myMap.get(“key1”); ... That’s all you need to do, now you have a file-backed Map of virtually any size. Note the “builder-style” initialization syntax, enabling MapDB as the agile database choice for Java. There are many builder options that let you tune your database for the specific requirements at hand. Just a small subset of options include: In-memory implementation Enable transactions Configurable caching This means that you can configure your database just for what you need, effectively making MapDB serve the job of many other databases. MapDB comes with a set of powerful configuration options, and you can even extend the product to make your own data implementations if necessary. Another very powerful feature is that MapDB utilizes some of the advanced Java Collections variants, such as ConcurrentNavigableMap. With this type of Map you can go beyond simple key-value semantics, as it is also a sorted Map allowing you to access data in order, and find values near a key. Not many people are aware of this extension to the Collections API, but it is extremely powerful and allows you to do a lot with your MapDB database (I will cover more of these capabilities in a future article). The Agile Aspect of MapDB When I first met Jan and started talking with him about MapDB he said something that made a very important impression: If you know what data structure you want, MapDB allows you to tailor the structure and database characteristics to your exact application needs. In other words, the schema and ways you can structure your data is very flexible. The configuration of the physical data store is just as flexible, making a perfect combination for meeting almost any database need. They key to this capability is inherent in MapDB’s architecture, and how it translates to the MapDB API itself. Here is a simple diagram of the MapDB architecture: As you can see from the diagram, there are 3 tiers in MapDB: Collections API: This is the familiar Java Collections API that every Java developer uses for maintaining application state. It has a simple builder-style extension to allow you to control the exact characteristics of a given database (including its internal format or record structure). Engine: The Engine is the real key to MapDB, this is where the records for a database – including their internal structure, concurrency control, transactional semantics – are controlled. MapDB ships with several engines already, and it is straightforward to add your own Engine if needed for specialized data handling. Volume: This is the physical storage layer (e.g., on-disk or in-memory). MapDB has a few standard Volume implementations, and they should suffice for most projects. The main point is that the development API is completely distinct from the Engine implementation (the heart of MapDB), and both are separate from the actual physical storage layer. This offers a very agile approach, allowing developers to exactly control what type of internal structure is needed for a given database, and what the actual data structure looks like from the top-level Collections API. To make things even more extensible and agile, MapDB uses a concept of Engine Wrappers. An Engine Wrapper allows adding additional features and options on top of a specific engine layer. For example, if the standard Map engine is utilized for creating a B-Tree backed Map, it is feasible to enable (or disable) caching support. This caching feature is done through an Engine Wrapper, and that is what shows up in the builder-style API used to configure a given database. While a whole article could be written just about this, the point here is that this adds to MapDB’s inherent agile nature. By way of example, here is how you configure a pure in-memory database, without transactional capabilities: // Initialize an in-memory MapDB database // without transactions DB db = DBMaker.newMemoryDB() .transactionDisable() .closeOnJvmShutdown() .make(); // Create a Map: Map myMap = db.getTreeMap(“testmap”); // Work with the Map using the normal Map API. myMap.put(“key1”, “value1”); myMap.put(“key2”, “value2”); String value = myMap.get(“key1”); ... That’s it! All that was needed was to change the DBMaker call to add the new options, everything else works exactly the same as in the example shown earlier. Agile Data Model In addition to customizing the features and performance characteristics of a given database instance, MapDB allows you to create an agile data model, with a schema exactly matching your application requirements. This is probably similar to how you write your code when creating standard Java in-memory structures. For example, let’s say you need to lookup a Person object by username, or by personID. Simply create a Person object and two Maps to meet your needs: public class Person { private Integer personID; private String username; ... // Setters and getters go here ... } // Create a Map of Person by username. Map personByUsernameMap = ... // Create a Map of Person by personID. Map personByPersonIDMap = ... This is a very trivial example, but now you can easily write to both maps for each new Person instance, and subsequently retrieve a Person by either key. Another interesting concept with MapDB data structures are some key extensions to the normal Java Collections API. A common requirement in applications is to have a Map with a key/value, and in addition to finding the value for a key to be able to perform the inverse: lookup the key for a given value. We can easily do this using the MapDB extension for bi-directional maps: // Create a primary map HTreeMap map = DBMaker.newTempHashMap(); // Create the inverse mapping for primary map NavigableSet> inverseMapping = new TreeSet>(); // Bind the inverse mapping to primary map, so it is auto-updated each time the primary map gets a new key/value Bind.mapInverse(map, inverseMapping); map.put(10L,"value2"); map.put(1111L,"value"); map.put(1112L,"value"); map.put(11L,"val"); // Now find a key by a given value. Long keyValue = Fun.filter(inverseMapping.get(“value2”); MapDB supports many constructs for the interaction of Maps or other collections, allowing you to create a schema of related structures that can automatically be kept in sync. This avoids a lot of scanning of structures, makes coding fast and convenient, and can keep things very fast. Wrapping it up I have shown a very brief introduction on MapDB and how the product works. As you can see its strengths are its use of the natural Java Collections API, the agile nature of the engine itself, and the support for virtually any type of data model or schema that your application needs. MapDB is freely available for any use under the Apache 2.0 license. To learn more, check out: www.mapdb.org.

June 5, 2014

by Cory Isaacson

· 28,611 Views · 3 Likes

You Never Really Learn Something Until You Teach It

as software developers we spend a large amount of time learning. there is always a new framework or technology to learn. it can seem impossible to keep up with everything when there is something new every day. so, it should be no surprise to you that learning quickly and gaining a deeper understanding of what you learn is very important. and that is exactly why–if you are not doing so already– you need to incorporate teaching into your learning. why teaching is such an effective learning tool when we learn something, most of us learn it in bits and pieces. typically, if you read a book, you’ll find the material in that book organized in a sensible way. the same goes for others mediums like video or online courses. but, unfortunately, the material doesn’t go into your head in the same way. what happens instead is that you absorb information in jumbled bits and pieces. you learn something, but don’t completely “get it” until you learn something else later on. the earlier topic becomes more clear, but the way that data is structured in your mind is not very well organized–regardless of how organized the source of that information was. even now, as i write this blog post, i am struggling with taking the jumbled mess of information i have in my head about how teaching helps you learn and figuring out how to present it in an organized way. i know what i want to say, but i don’t yet know how to say it. only the process of putting my thoughts on paper will force me to reorganize them; to sort them out and make sense of them. when you try to teach something to someone else, you have to go through this process in your own mind. you have to take that mess of data, sort it out, repackage it and organize it in a way that someone else can understand. this process forces you to reorganize the way that data is stored in your own head. also, as part of this process, you’ll inevitably find gaps in your own understanding of whatever subject you are trying to teach. when we learn something we have a tendency to gloss over many things we think we understand. you might be able to solve a math problem in a mechanical way, and the steps you use to solve the math problem might be sufficient for what you are trying to do, but just knowing how to solve a problem doesn’t mean you understand how to solve a problem. knowledge is temporary. it is easily lost. understanding is much more permanent. it is rare that we forget something we understand thoroughly. when we are trying to explain something to someone else, we are forced to ask ourselves the most important question in leaning… in gaining true understanding… “why.” when we have to answer the question “why,” superficial understanding won’t do. we have to know something deeply in order to not just say how, but why. that means we have to explore a subject deeply ourselves. sometimes this involves just sitting and thinking about it clearly before you try to explain it to someone else. sometimes just the act of writing, speaking or drawing something causes you to make connections you didn’t make before, instantly deepening your knowledge. (ever had one of those moments when you explained something to someone else and you suddenly realized that before you tried to explain it you didn’t really understand it yourself, but now you magically do?) and, sometimes, you have to go back to the drawing board and do more research to fill in those gaps in your own knowledge you just uncovered when you tried to explain it to someone else. becoming a teacher so, you perhaps you agree with me so far, but you’ve got one problem–you’re not a teacher. well, i have good news for you. we are all teachers. teaching doesn’t have to be some formal thing where you have books and a classroom. teaching is simply repackaging information in a way that someone else can understand. the most effective teaching takes place when you can explain something to someone in terms of something else they already understand. (want a great book on the subject that might make your brain hurt? surfaces and essences: analogy as the fuel and fire of thinking. an excellent book by douglas hofstadter, author of godel, escher, bach: an eternal golden braid. both books are extremely difficult reads, but very rewarding.) as human beings, we do this all the time. whenever we communicate with someone else and tell them about something we learned or explain how to do something, we are teaching. of course, the more you do it, the better you get at it, and adding a little more formalization to your practice doesn’t hurt, but at heart, you–yes, you–are a teacher. one of the best ways to start teaching–that may even benefit your career–is to start a blog. many developers i talk to assume that they have to already be experts at something in order to blog about it. the truth is, you only have to be one step ahead of someone for them to learn from you. so, don’t be afraid to blog about what you are learning as you are learning it. there will always be someone out there who could benefit from your knowledge–even if you consider yourself a beginner. and don’t worry about blogging for someone else–at least not at first. just blog for yourself. the act of taking your thoughts and putting them into words will gain you the benefits of increasing your own understanding and reorganizing thoughts in your mind. i won’t pretend the process isn’t painful. when you first start writing, it doesn’t usually come easily. but, don’t worry too much about quality. worry about communicating your ideas. with time, you’ll eventually get better and you’ll find the process of converting the ideas in your head to words becomes easier and easier. of course, creating a blog isn’t the only way to start teaching. you can simply have a conversation with a friend, a coworker, or even your spouse about what you are learning. just make sure you express what you are learning externally in one form or another if you really want to gain a deep understanding of the subject. you can also record videos or screencasts, speak on a subject or even give a presentation at work. whatever you do, make sure that teaching is part of your learning process. for more posts like this and some content i only deliver via email, sign up here.

June 5, 2014

by John Sonmez

· 19,724 Views · 1 Like

Introducing Partitioned Collections for MongoDB Applications

TokuMX 1.5 is around the corner. The big feature will be something we discussed briefly when talking about replication changes in 1.4: partitioned collections. Before introducing the feature, I wanted to mention the following. Although TokuMX 1.5 is not available as of this writing, we would love to hear feedback on partitioned collections, which we think are wonderful for time-series data, as I describe below. If you are interested in trying out the feature, email [email protected] for a pre-release version of TokuMX 1.5. What is a partitioned collection? A partitioned collection is analogous to a partitioned table in relational databases. Oracle, MySQL, SQL Server, and Postgres all support partitioned tables. We are happy to bring this functionality to TokuMX. So, if the remainder of this blog is unclear, and you have friends in the office who are familiar with relational databases, you may want to ask them for more information :). Nevertheless, a partitioned collection is a collection that underneath the covers is broken into (or partitioned into) several individual collections, based on ranges of a “partition key”. From the application developer’s point of view, the collection is just another collection. Queries, inserts, updates, and deletes just work with no syntactical changes. Secondary indexes and replication work as well. But underneath the covers, the data will be broken into several collections, with each collection responsible for all data for a range of the partition key. If you are running TokuMX 1.4, a simple example is the oplog, which is a partitioned collection. Any normal query works just fine on the oplog. However, if you look in your data directory, you will see several .tokumx files named “local_oplog_rs_p…”. These files are the individual partitions that break up the data. Each partition stores a range of _id fields in the oplog. Why should I bother using a partitioned collection? This will be its own post with longer examples, but here is a summary. Partitioned collections have two big advantages: Large chunks of data can be deleted very efficiently by dropping partitions. The cost is that of performing an “rm” on some files in the filesystem. This is really fast and efficient. Queries that include the partition key may be isolated to individual partitions, and therefore run faster. This is similar to “query isolation” for shard keys. So, one scenario you may want a partitioned collection for is where the oldest data gets dropped periodically, and many queries benefit from a time based key. That will be a good fit. In short: time series data. If you have a time-series application where you want to keep a rolling period of data (e.g. the last 6 months worth), then using a partitioned collection will be great, and is preferable to using a TTL index or a capped collection. In a future blog post, I will expand on this. How do I use a partitioned collection in TokuMX 1.5? Basically, just like a normal collection, except with some commands added to create a partitioned collection, add partitions, and drop partitions. Below, I explain the shell commands added for this functionality. Our documentation contains the full commands so that they may be called by any driver’s runCommand method. Ok, so how do I create a partitioned collection? The first thing to consider is what your partition key should be. That is, what key do you want to use ranges of to partition your data? This key has similarities with a shard key. It should be a key that can be used to isolate partitions, the way a shard key is used to isolate shards (as explained here). Also, it should be a key that contains a range of data you would like to delete all at once. With time series data, that key will likely be a timestamp. In TokuMX, the partition key is always the primary key. To create a partitioned collection, “foo”, with a timestamp field, “ts”, used for the partitioning run the following: > db.createCollection("foo", { partitioned : 1 , primaryKey : { ts : 1 , _id : 1 } }) { "ok" : 1 } Note that in TokuMX, the primary key must have the _id field appended to it to ensure uniqueness. As a side note, we do not support hash based partitioning, only range based partitioning. Adding partitions? In TokuMX, partitions can only be appended to the end. Individual partitions cannot be split. So, say we have a collection that partitions on the _id field, where all _id’s happen to be integers. Suppose we have three partitions with the following ranges: _id <= 0 0 < _id <= 1000 _id > 1000 With this collection we cannot create a partition with the range 500 < _id <= 1000, because that would split the second partition. All we can do is add a new partition to the end, and “cap” the current last partition with a new maximum value. This new maximum value must be greater than or equal to the primary key (or in this case, _id) of the last partition’s last document. So, if the last partition’s last document has an _id of 2500, we can only partitions that create a range whose maximum is at least 2500. There are two ways to add a partition. The first method peeks at the last document in the current last partition, caps the partition with the primary key of that last document, and creates a new partition. To do so, one does: > db.foo.addPartition() { "ok" : 1 } In the above example, the partitioned collection would now have partitions with the following ranges: _id <= 0 0 < _id <= 1000 1000 < _id <= 2500 _id > 2500 Alternatively, we can specify what the new maximum of the existing last partition may be, provided it is greater than the last document in last partition (which in this example is 2500). To do so, we simply pass in the new maximum as a parameter: > db.foo.addPartition({ _id : 3000 }); { "ok" : 1 } This would make the collection have partitions with the following ranges: _id <= 0 0 < _id <= 1000 1000 < _id <= 3000 _id > 3000 Dropping partitions? Dropping partitions is simple. First, see what the partitions are with the following shell command: > db.foo.getPartitionInfo() { "numPartitions" : NumberLong(4), "partitions" : [ { "_id" : NumberLong(0), "max" : { "_id" : 0 }, "createTime" : ISODate("2014-05-29T01:50:15.839Z") }, { "_id" : NumberLong(1), "max" : { "_id" : 1000 }, "createTime" : ISODate("2014-05-29T01:50:27.049Z") }, { "_id" : NumberLong(2), "max" : { "_id" : 2500 }, "createTime" : ISODate("2014-05-29T01:50:30.549Z") }, { "_id" : NumberLong(3), "max" : { "_id" : { "$maxKey" : 1 } }, "createTime" : ISODate("2014-05-29T01:50:35.903Z") } ], "ok" : 1 } This lists each partition, what the maximum value that each partition may hold (thus defining the range of the partition), and the id of the partition (in the _id field). So, in the example we used for adding partitions, we have four partitions with _ids 0 through 3. To drop a partition, we run the following command and pass the _id of the partition we want to drop. To drop partition 0, we run: > db.foo.dropPartition(0) { "ok" : 1 } Looking at the list of partitions after this operation, we see the partition is dropped: > db.foo.getPartitionInfo() { "numPartitions" : NumberLong(3), "partitions" : [ { "_id" : NumberLong(1), "max" : { "_id" : 1000 }, "createTime" : ISODate("2014-05-29T01:50:27.049Z") }, { "_id" : NumberLong(2), "max" : { "_id" : 2500 }, "createTime" : ISODate("2014-05-29T01:50:30.549Z") }, { "_id" : NumberLong(3), "max" : { "_id" : { "$maxKey" : 1 } }, "createTime" : ISODate("2014-05-29T01:50:35.903Z") } ], "ok" : 1 } This covers how to use partitioned collections. We hope users in the MongoDB ecosystem find this feature as useful as relational database users do. In the comments section below, feel free to leave questions and/or feedback.

June 4, 2014

by Zardosht Kasheff

· 12,493 Views

Neo4j & Cypher: Rounding a Float Value to Decimal Places

About 6 months ago, Jacqui Read created a github issue explaining how she wanted to round a float value to a number of decimal places but was unable to do so due to the round function not taking the appropriate parameter. I found myself wanting to do the same thing last week where I initially had the following value: RETURN toFloat("12.336666") AS value I wanted to round that to 2 decimal places and Wes suggested multiplying the value before using ROUND and then dividing afterwards to achieve that. For two decimal places we need to multiply and divide by 100: WITH toFloat("12.336666") AS value RETURN round(100 * value) / 100 AS value 12.34 Michael suggested abstracting the number of decimal places like so: WITH 2 as precision WITH toFloat("12.336666") AS value, 10^precision AS factor RETURN round(factor * value)/factor AS value If we want to round to 4 decimal places we can easily change that: WITH 4 as precision WITH toFloat("12.336666") AS value, 10^precision AS factor RETURN round(factor * value)/factor AS value 12.3367 The code is available as a graph gist too if you want to play around with it. And you may as well check out the other gists while you’re at it – enjoy!

June 2, 2014

by Mark Needham

· 7,022 Views

NetBeans in the Classroom: Console Input Using Scanners

ken fogel is the program coordinator and chairperson of the computer science technology program at dawson college in montreal, canada. he is also a program consultant to and part-time instructor in the computer institute of concordia university 's school of extended learning. he blogs at omniprogrammer.com and tweets @omniprof . his regular columns about netbeans in education are listed here . i am old enough to remember commercial applications that had a console interface. my students have no experience with the console. when java was introduced the era of such applications was over. instead you used awt, then swing and now javafx. for the student starting out in programming these gui frameworks add complexity they may not be ready for. therefore i start my classes using console input and output. input can be divided into two categories. the first is values. the primitive types such as int and double are value types. as i explain it, a value is a something you can perform both mathematical and logical operations on. the second is strings. a string does not have a value in the mathematical sense. instead it has a state. therefore it cannot be used in a mathematical expression. state can be compared and so strings can be used is logical expressions such as determining if two strings are the same. if you are ordering strings then you can also determine if one string is greater or less than the other. java does not have any console input routines for values. to java everything you type is a string. this has meant that for many years every java textbook author provided a console input routine that could convert strings to values. java 1.5 made all these routines redundant with the introduction of the scanner class. i could now teach input without needing to go into detail about wrapper classes or the numberformatexception in a standard approach. /** * ask the user for their weight */ public void inputweight() { scanner sc = new scanner(system.in); system.out.print("please enter your weight: "); int weight = sc.nextint(); system.out.println("weight = " + weight); } there is a problem with this code. if the user enters a string that cannot be converted we get the numberformatexception we want to avoid. scanner provides a solution with its ‘hasnext. . .’ methods. these methods determine if the user input can be converted to the target value. you can reject input and avoid the exception. /** * ask the user for their weight */ public void inputweight() { scanner sc = new scanner(system.in); system.out.print("please enter your weight: "); int weight; // accept input and test to see if it’s an integer if (sc.hasnextint()) { // success, it was an integer so get it weight = sc.nextint(); system.out.println("weight = " + weight); } else { // failure so retrieve the input as a string string bad = sc.next(); // inform the user what went wrong system.out.println(bad + " cannot be converted to an integer"); } } now when you run the program you get the following for a good input. this is what you get for a bad input. i will return to numeric input in another article as you do not want a program to end when the user makes one mistake. character input to make my assignments more interesting i have the students create a menu from which they must choose an action. i like to use letters as the menu choice. the problem is that there is no nextchar() method in scanner. instead you need to use next() which will accept anything that you enter. public void displaymenu() { scanner sc = new scanner(system.in); system.out.println("select an account"); system.out.println("a: savings"); system.out.println("b: checking"); system.out.println("c: exit"); system.out.print("please choose a, b or c: "); string choice = sc.next(); system.out.println("you chose: " + choice); } which could result in the following: the first improvement is to convert the string into a character. to do this the string must only be one character long. rather than checking the length you can just extract the first character in the string using the string class charat method. string choice = sc.next(); char letter = choice.charat(0); system.out.println("you chose: " + letter); this will work and if you enter “a” you will get ‘a’. the problem now is that the case of the character you entered should not matter. this is easy to solve by adding one more method from string into the mix. char letter = choice.touppercase().charat(0); almost perfect but since we are using string input the user can enter any string they want. so “bacon” will result in ‘b’. we also need some way to control the allowable letters. in this menu the choices are a, b and c. the solution is to use the ‘hasnext’ method for ‘next’. not the normal, no parameter version, but the version that takes a pattern. pattern is a polite way of saying regular expression. from https://xkcd.com/208/ just as we did with integer input we first test the string that the user has entered. the regular expression will be the pattern that will be used to determine if the string is correct. this regular expression is probably the simplest one you can write. we want a single character, either upper or lower case and is one of the three allowable characters. here it is: [abcabc] the square brackets indicate that this is a character class expression that can match only a single character. what goes inside the brackets are the allowable characters, in this case a, b, c, a, b or c. since these characters are adjacent in the ascii/unicode table they can also be written as a range. [a-ca-c] putting this all together we get a routine that will only accept a single letter. public void displaymenu() { scanner sc = new scanner(system.in); system.out.println("select an account"); system.out.println("a: savings"); system.out.println("b: checking"); system.out.println("c: exit"); system.out.print("please choose a, b or c: "); char letter; if (sc.hasnext("[a-ca-c]")) { string choice = sc.next(); letter = choice.touppercase().charat(0); system.out.println("you chose: " + letter); } else { string choice = sc.next(); system.out.println(choice + " is not valid input"); } } now if you enter ‘a’: but if you enter ‘aardvark’: in my next article i will look at console input a bit more and i will introduce you to the beethoven test.

May 30, 2014

by Ken Fogel

· 43,643 Views

Implementing Correlation ids in Spring Boot (for Distributed Tracing in SOA/Microservices)

After attending Sam Newman’s microservice talks at Geecon last week I started to think more about what is most likely an essential feature of service-oriented/microservice platforms for monitoring, reporting and diagnostics: correlation ids. Correlation ids allow distributed tracing within complex service oriented platforms, where a single request into the application can often be dealt with by multiple downstream service. Without the ability to correlate downstream service requests it can be very difficult to understand how requests are being handled within your platform. I’ve seen the benefit of correlation ids in several recent SOA projects I have worked on, but as Sam mentioned in his talks, it’s often very easy to think this type of tracing won’t be needed when building the initial version of the application, but then very difficult to retrofit into the application when you do realise the benefits (and the need for!). I’ve not yet found the perfect way to implement correlation ids within a Java/Spring-based application, but after chatting to Sam via email he made several suggestions which I have now turned into a simple project using Spring Boot to demonstrate how this could be implemented. Why? During both of Sam’s Geecon talks he mentioned that in his experience correlation ids were very useful for diagnostic purposes. Correlation ids are essentially an id that is generated and associated with a single (typically user-driven) request into the application that is passed down through the stack and onto dependent services. In SOA or microservice platforms this type of id is very useful, as requests into the application typically are ‘fanned out’ or handled by multiple downstream services, and a correlation id allows all of the downstream requests (from the initial point of request) to be correlated or grouped based on the id. So called ‘distributed tracing’ can then be performed using the correlation ids by combining all the downstream service logs and matching the required id to see the trace of the request throughout your entire application stack (which is very easy if you are using a centralised logging framework such as logstash) The big players in the service-oriented field have been talking about the need for distributed tracing and correlating requests for quite some time, and as such Twitter have created their open source Zipkin framework (which often plugs into their RPC framework Finagle), and Netflix has open-sourced their Karyon web/microservice framework, both of which provide distributed tracing. There are of course commercial offering in this area, one such product being AppDynamics, which is very cool, but has a rather hefty price tag. Creating a proof-of-concept in Spring Boot As great as Zipkin and Karyon are, they are both relatively invasive, in that you have to build your services on top of the (often opinionated) frameworks. This might be fine for some use cases, but no so much for others, especially when you are building microservices. I’ve been enjoying experimenting with Spring Boot of late, and this framework builds on the much known and loved (at least by me :-) ) Spring framework by providing lots of preconfigured sensible defaults. This allows you to build microservices (especially ones that communicate via RESTful interfaces) very rapidly. The remainder of this blog pos explains how I implemented a (hopefully) non-invasive way of implementing correlation ids. Goals Allow a correlation id to be generated for a initial request into the application Enable the correlation id to be passed to downstream services, using as method that is as non-invasive into the code as possible Implementation I have created two projects on GitHub, one containing an implementation where all requests are being handled in a synchronous style (i.e. the traditional Spring approach of handling all request processing on a single thread), and also one for when an asynchronous (non-blocking) style of communication is being used (i.e., using the Servlet 3 asynchronous support combined with Spring’s DeferredResult and Java’s Futures/Callables). The majority of this article describes the asynchronous implementation, as this is more interesting: Spring Boot asynchronous (DeferredResult + Futures) communication correlation id Github repo The main work in both code bases is undertaken by the CorrelationHeaderFilter, which is a standard Java EE Filter that inspects the HttpServletRequest header for the presence of a correlationId. If one is found then we set a ThreadLocal variable in the RequestCorrelation Class (discussed later). If a correlation id is not found then one is generated and added to the RequestCorrelation Class: public class CorrelationHeaderFilter implements Filter { //... @Override public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException { final HttpServletRequest httpServletRequest = (HttpServletRequest) servletRequest; String currentCorrId = httpServletRequest.getHeader(RequestCorrelation.CORRELATION_ID_HEADER); if (!currentRequestIsAsyncDispatcher(httpServletRequest)) { if (currentCorrId == null) { currentCorrId = UUID.randomUUID().toString(); LOGGER.info("No correlationId found in Header. Generated : " + currentCorrId); } else { LOGGER.info("Found correlationId in Header : " + currentCorrId); } RequestCorrelation.setId(currentCorrId); } filterChain.doFilter(httpServletRequest, servletResponse); } //... private boolean currentRequestIsAsyncDispatcher(HttpServletRequest httpServletRequest) { return httpServletRequest.getDispatcherType().equals(DispatcherType.ASYNC); } The only thing is this code that may not instantly be obvious is the conditional checkcurrentRequestIsAsyncDispatcher(httpServletRequest), but this is here to guard against the correlation id code being executed when the Async Dispatcher thread is running to return the results (this is interesting to note, as I initially didn’t expect the Async Dispatcher to trigger the execution of the filter again?) Here is the RequestCorrelation Class, which contains a simple ThreadLocal static variable to hold the correlation id for the current Thread of execution (set via the CorrelationHeaderFilter above) public class RequestCorrelation { public static final String CORRELATION_ID = "correlationId"; private static final ThreadLocal id = new ThreadLocal(); public static String getId() { return id.get(); } public static void setId(String correlationId) { id.set(correlationId); } } Once the correlation id is stored in the RequestCorrelation Class it can be retrieved and added to downstream service requests (or data store access etc) as required by calling the static getId() method within RequestCorrelation. It is probably a good idea to encapsulate this behaviour away from your application services, and you can see an example of how to do this in a RestClient Class I have created, which composes Spring’s RestTemplate and handles the setting of the correlation id within the header transparently from the calling Class. @Component public class CorrelatingRestClient implements RestClient { private RestTemplate restTemplate = new RestTemplate(); @Override public String getForString(String uri) { String correlationId = RequestCorrelation.getId(); HttpHeaders httpHeaders = new HttpHeaders(); httpHeaders.set(RequestCorrelation.CORRELATION_ID, correlationId); LOGGER.info("start REST request to {} with correlationId {}", uri, correlationId); //TODO: error-handling and fault-tolerance in production ResponseEntity response = restTemplate.exchange(uri, HttpMethod.GET, new HttpEntity(httpHeaders), String.class); LOGGER.info("completed REST request to {} with correlationId {}", uri, correlationId); return response.getBody(); } } //... calling Class public String exampleMethod() { RestClient restClient = new CorrelatingRestClient(); return restClient.getForString(URI_LOCATION); //correlation id handling completely abstracted to RestClient impl } Making this work for asynchronous requests… The code included above works fine when you are handling all of your requests synchronously, but it is often a good idea in a SOA/microservice platform to handle requests in a non-blocking asynchronous manner. In Spring this can be achieved by using the DeferredResult Class in combination with the Servlet 3 asynchronous support. The problem with using ThreadLocal variables within the asynchronous approach is that the Thread that initially handles the request (and creates the DeferredResult/Future) will not be the Thread doing the actual processing. Accordingly, a bit of glue code is needed to ensure that the correlation id is propagated across the Threads. This can be achieved by extending Callable with the required functionality: (don’t worry if example Calling Class code doesn’t look intuitive – this adaption between DeferredResults and Futures is a necessary evil within Spring, and the full code including the boilerplate ListenableFutureAdapter is in my GitHub repo): public class CorrelationCallable implements Callable { private String correlationId; private Callable callable; public CorrelationCallable(Callable targetCallable) { correlationId = RequestCorrelation.getId(); callable = targetCallable; } @Override public V call() throws Exception { RequestCorrelation.setId(correlationId); return callable.call(); } } //... Calling Class @RequestMapping("externalNews") public DeferredResult externalNews() { return new ListenableFutureAdapter<>(service.submit(new CorrelationCallable<>(externalNewsService::getNews))); } And there we have it – the propagation of correlation id regardless of the synchronous/asynchronous nature of processing! You can clone the Github report containing my asynchronous example, and execute the application by running mvn spring-boot:run at the command line. If you access http://localhost:8080/externalNewsin your browser (or via curl) you will see something similar to the following in your Spring Boot console, which clearly demonstrates a correlation id being generated on the initial request, and then this being propagated through to a simulated external call (have a look in the ExternalNewsServiceRest Class to see how this has been implemented): [nio-8080-exec-1] u.c.t.e.c.w.f.CorrelationHeaderFilter : No correlationId found in Header. Generated : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : start REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 [nio-8080-exec-2] u.c.t.e.c.w.f.CorrelationHeaderFilter : Found correlationId in Header : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : completed REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 Conclusion I’m quite happy with this simple prototype, and it does meet the two goals I listed above. Future work will include writing some tests for this code (shame on me for not TDDing!), and also extend this functionality to a more realistic example. I would like to say a massive thanks to Sam, not only for sharing his knowledge at the great talks at Geecon, but also for taking time to respond to my emails. If you’re interested in microservices and related work I can highly recommend Sam’s Microservice book which is available in Early Access at O’Reilly. I’ve enjoyed reading the currently available chapters, and having implemented quite a few SOA projects recently I can relate to a lot of the good advice contained within. I’ll be following the development of this book with keen interest! If you have any comments or thoughts then please do share them via the comment below, or feel free to get in touch via the usual mechanisms! References I used Tomasz Nurkiewicz’s excellent blog several times for learning how best to wire up all of the DeferredResult/Future code in Spring: http://www.nurkiewicz.com/2013/03/deferredresult-asynchronous-processing.html

May 29, 2014

by Daniel Bryant

· 24,551 Views · 1 Like

Implementing Correlation IDs in Spring Boot (for Distributed Tracing in SOA/Microservices)

After attending Sam Newman’s microservice talks at Geecon last week I started to think more about what is most likely an essential feature of service-oriented/microservice platforms for monitoring, reporting and diagnostics: correlation ids. Correlation ids allow distributed tracing within complex service oriented platforms, where a single request into the application can often be dealt with by multiple downstream service. Without the ability to correlate downstream service requests it can be very difficult to understand how requests are being handled within your platform. I’ve seen the benefit of correlation ids in several recent SOA projects I have worked on, but as Sam mentioned in his talks, it’s often very easy to think this type of tracing won’t be needed when building the initial version of the application, but then very difficult to retrofit into the application when you do realise the benefits (and the need for!). I’ve not yet found the perfect way to implement correlation ids within a Java/Spring-based application, but after chatting to Sam via email he made several suggestions which I have now turned into a simple project using Spring Boot to demonstrate how this could be implemented. Why? During both of Sam’s Geecon talks he mentioned that in his experience correlation ids were very useful for diagnostic purposes. Correlation ids are essentially an id that is generated and associated with a single (typically user-driven) request into the application that is passed down through the stack and onto dependent services. In SOA or microservice platforms this type of id is very useful, as requests into the application typically are ‘fanned out’ or handled by multiple downstream services, and a correlation id allows all of the downstream requests (from the initial point of request) to be correlated or grouped based on the id. So called ‘distributed tracing’ can then be performed using the correlation ids by combining all the downstream service logs and matching the required id to see the trace of the request throughout your entire application stack (which is very easy if you are using a centralised logging framework such as logstash) The big players in the service-oriented field have been talking about the need for distributed tracing and correlating requests for quite some time, and as such Twitter have created their open source Zipkin framework (which often plugs into their RPC framework Finagle), and Netflix has open-sourced their Karyon web/microservice framework, both of which provide distributed tracing. There are of course commercial offering in this area, one such product being AppDynamics, which is very cool, but has a rather hefty price tag. Creating a proof-of-concept in Spring Boot As great as Zipkin and Karyon are, they are both relatively invasive, in that you have to build your services on top of the (often opinionated) frameworks. This might be fine for some use cases, but no so much for others, especially when you are building microservices. I’ve been enjoying experimenting with Spring Boot of late, and this framework builds on the much known and loved (at least by me :-) ) Spring framework by providing lots of preconfigured sensible defaults. This allows you to build microservices (especially ones that communicate via RESTful interfaces) very rapidly. The remainder of this blog pos explains how I implemented a (hopefully) non-invasive way of implementing correlation ids. Goals Allow a correlation id to be generated for a initial request into the application Enable the correlation id to be passed to downstream services, using as method that is as non-invasive into the code as possible Implementation I have created two projects on GitHub, one containing an implementation where all requests are being handled in a synchronous style (i.e. the traditional Spring approach of handling all request processing on a single thread), and also one for when an asynchronous (non-blocking) style of communication is being used (i.e., using the Servlet 3 asynchronous support combined with Spring’s DeferredResult and Java’s Futures/Callables). The majority of this article describes the asynchronous implementation, as this is more interesting: Spring Boot asynchronous (DeferredResult + Futures) communication correlation id Github repo The main work in both code bases is undertaken by the CorrelationHeaderFilter, which is a standard Java EE Filter that inspects the HttpServletRequest header for the presence of a correlationId. If one is found then we set a ThreadLocal variable in the RequestCorrelation Class (discussed later). If a correlation id is not found then one is generated and added to the RequestCorrelation Class: public class CorrelationHeaderFilter implements Filter { //... @Override public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException { final HttpServletRequest httpServletRequest = (HttpServletRequest) servletRequest; String currentCorrId = httpServletRequest.getHeader(RequestCorrelation.CORRELATION_ID_HEADER); if (!currentRequestIsAsyncDispatcher(httpServletRequest)) { if (currentCorrId == null) { currentCorrId = UUID.randomUUID().toString(); LOGGER.info("No correlationId found in Header. Generated : " + currentCorrId); } else { LOGGER.info("Found correlationId in Header : " + currentCorrId); } RequestCorrelation.setId(currentCorrId); } filterChain.doFilter(httpServletRequest, servletResponse); } //... private boolean currentRequestIsAsyncDispatcher(HttpServletRequest httpServletRequest) { return httpServletRequest.getDispatcherType().equals(DispatcherType.ASYNC); } The only thing is this code that may not instantly be obvious is the conditional check currentRequestIsAsyncDispatcher(httpServletRequest), but this is here to guard against the correlation id code being executed when the Async Dispatcher thread is running to return the results (this is interesting to note, as I initially didn’t expect the Async Dispatcher to trigger the execution of the filter again?) Here is the RequestCorrelation Class, which contains a simple ThreadLocal static variable to hold the correlation id for the current Thread of execution (set via the CorrelationHeaderFilter above) public class RequestCorrelation { public static final String CORRELATION_ID = "correlationId"; private static final ThreadLocal id = new ThreadLocal(); public static String getId() { return id.get(); } public static void setId(String correlationId) { id.set(correlationId); } } Once the correlation id is stored in the RequestCorrelation Class it can be retrieved and added to downstream service requests (or data store access etc) as required by calling the static getId() method within RequestCorrelation. It is probably a good idea to encapsulate this behaviour away from your application services, and you can see an example of how to do this in a RestClient Class I have created, which composes Spring’s RestTemplate and handles the setting of the correlation id within the header transparently from the calling Class. @Component public class CorrelatingRestClient implements RestClient { private RestTemplate restTemplate = new RestTemplate(); @Override public String getForString(String uri) { String correlationId = RequestCorrelation.getId(); HttpHeaders httpHeaders = new HttpHeaders(); httpHeaders.set(RequestCorrelation.CORRELATION_ID, correlationId); LOGGER.info("start REST request to {} with correlationId {}", uri, correlationId); //TODO: error-handling and fault-tolerance in production ResponseEntity response = restTemplate.exchange(uri, HttpMethod.GET, new HttpEntity(httpHeaders), String.class); LOGGER.info("completed REST request to {} with correlationId {}", uri, correlationId); return response.getBody(); } } //... calling Class public String exampleMethod() { RestClient restClient = new CorrelatingRestClient(); return restClient.getForString(URI_LOCATION); //correlation id handling completely abstracted to RestClient impl } Making this work for asynchronous requests… The code included above works fine when you are handling all of your requests synchronously, but it is often a good idea in a SOA/microservice platform to handle requests in a non-blocking asynchronous manner. In Spring this can be achieved by using the DeferredResult Class in combination with the Servlet 3 asynchronous support. The problem with using ThreadLocal variables within the asynchronous approach is that the Thread that initially handles the request (and creates the DeferredResult/Future) will not be the Thread doing the actual processing. Accordingly, a bit of glue code is needed to ensure that the correlation id is propagated across the Threads. This can be achieved by extending Callable with the required functionality: (don’t worry if example Calling Class code doesn’t look intuitive – this adaption between DeferredResults and Futures is a necessary evil within Spring, and the full code including the boilerplate ListenableFutureAdapter is in my GitHub repo): public class CorrelationCallable implements Callable { private String correlationId; private Callable callable; public CorrelationCallable(Callable targetCallable) { correlationId = RequestCorrelation.getId(); callable = targetCallable; } @Override public V call() throws Exception { RequestCorrelation.setId(correlationId); return callable.call(); } } //... Calling Class @RequestMapping("externalNews") public DeferredResult externalNews() { return new ListenableFutureAdapter<>(service.submit(new CorrelationCallable<>(externalNewsService::getNews))); } And there we have it – the propagation of correlation id regardless of the synchronous/asynchronous nature of processing! You can clone the Github report containing my asynchronous example, and execute the application by running mvn spring-boot:run at the command line. If you access http://localhost:8080/externalNews in your browser (or via curl) you will see something similar to the following in your Spring Boot console, which clearly demonstrates a correlation id being generated on the initial request, and then this being propagated through to a simulated external call (have a look in the ExternalNewsServiceRest Class to see how this has been implemented): [nio-8080-exec-1] u.c.t.e.c.w.f.CorrelationHeaderFilter : No correlationId found in Header. Generated : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : start REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 [nio-8080-exec-2] u.c.t.e.c.w.f.CorrelationHeaderFilter : Found correlationId in Header : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : completed REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 Conclusion I’m quite happy with this simple prototype, and it does meet the two goals I listed above. Future work will include writing some tests for this code (shame on me for not TDDing!), and also extend this functionality to a more realistic example. I would like to say a massive thanks to Sam, not only for sharing his knowledge at the great talks at Geecon, but also for taking time to respond to my emails. If you’re interested in microservices and related work I can highly recommend Sam’s Microservice book which is available in Early Access at O’Reilly. I’ve enjoyed reading the currently available chapters, and having implemented quite a few SOA projects recently I can relate to a lot of the good advice contained within. I’ll be following the development of this book with keen interest! If you have any comments or thoughts then please do share them via the comment below, or feel free to get in touch via the usual mechanisms! References I used Tomasz Nurkiewicz’s excellent blog several times for learning how best to wire up all of the DeferredResult/Future code in Spring: http://www.nurkiewicz.com/2013/03/deferredresult-asynchronous-processing.html

May 28, 2014

by Daniel Bryant

· 73,916 Views · 2 Likes

String Interning — What, Why, and When?

Learn about string interning, a method of storing only one copy of each distinct string value, which must be immutable.

May 23, 2014

by Saurabh Chhajed

· 146,450 Views · 13 Likes

Generating UML Class Diagrams from Code With ObjectAid

I've used this handy Eclipse plugin for years. But very few of my colleagues were aware of it. This fact surprises me and therefore I would like to highlight it. It is very handy for generation of UML class diagrams from your code. It is also invaluable for analyzing of existing design. One of the top UML tools is Enterprise Architect. But I was never happy with Enterprise Architect usability. I find it much easier to sketch class structure in Java and generate UML class diagrams by ObjectAid. Advantage of this approach is that you have basic skeleton of the module. You can check-it in, so that team can start working on implementation. Have to mention that it’s commercial product, but Class Diagram version is free. For my UML class diagrams needs (design analyze, generating UML diagrams for documentation, modules design sketching) it was enough so far. They provide also paid Sequence diagram and Diagram Add-On versions. To be honest I have never tried these. For sequence diagrams I was using only Enterprise Architect so far. I would be definitely pushing towards buying ObjectAid licenses if we wouldn’t have Enterprise Architect licenses already. There’s no point for me to provide detailed description of it, because ObjectAid site contains short and explanatory overview. I suggest to start with One-Minute Introduction. Everyone understand how inaccurate can UML class diagrams become. Keeping them up to date manually doesn’t make sense at all. ObjectAid fills this gap, because you can maintain your class diagrams up to date. But I had problems in the past when diagrams couldn’t handle renaming or moving of Java classes. Not sure if this is still problem of recent versions, because my diagrams are short lived. I can easily generate new diagram by just dragging appropriate classes into Class diagram editor. This is beauty of ObjectAid plugin. Links: ObjectAid site Installation instructions One-Minute Introduction

May 16, 2014

by Lubos Krnac

· 74,712 Views · 1 Like

Cyclop: A Web Based Editor for Cassandra Query Language

Cyclop is a web-based tool for querying Cassandra databases with features like syntax highlighting and query completion.

May 9, 2014

by Comsysto Gmbh

· 10,510 Views

How to Generate a Random String in Java using Apache Commons Lang

In a previous post, we had shared a small function that generated random string in Java. It turns out that similar functionality is available from a class in the extremely useful apache commons lang library. If you are using maven, download the jar using the following dependency: commons-lang commons-lang 20030203.000129 The class we are interested in is RandomStringUtils. Listed below are some functions you may find useful. Generate and print a random string of length 5 from all characters available System.out.println(RandomStringUtils.random(5)); Generate and print random string of length 10 from upper and lower case alphabets System.out.println(RandomStringUtils.randomAlphabetic(10)); Generate and print a random number of length 12 System.out.println(RandomStringUtils.randomNumeric(12)); Generate and print a random string of length 5 using only a, b, c and d characters System.out.println(RandomStringUtils.random(10,new char[]{'a','b','c','d'}));

May 6, 2014

by Faheem Sohail

· 20,308 Views

Pitfalls of the Hibernate Second-Level / Query Caches

This post will go through how to setup the Hibernate Second-Level and Query caches, how they work and what are their most common pitfalls.

May 6, 2014

by Vasco Cavalheiro

· 78,322 Views · 9 Likes

Open Session In View Design Tradeoffs

The Open Session in View (OSIV) pattern gives rise to different opinions in the Java development community. Let's go over OSIV and some of the pros and cons of this pattern. The problem The problem that OSIV solves is a mismatch between the Hibernate concept of session and it's lifecycle and the way that many server-side view technologies work. In a typical Java frontend application the service layer starts by querying some of the data needed to build the view. The remaining data needed can be lazy-loaded later, with the condition that the Hibernate session remains open - and there lies the problem. Between the moment that the service layer method finishes it's execution and the moment that the view is rendered, Hibernate has already committed the transaction and closed the session. When the view tries to lazy load the extra data that it needs, if finds the Hibernate session closed, causing a LazyInitializationException. The OSIV solution OSIV tackles this problem by ensuring that the Hibernate session is kept open all the way up to the rendering of the view - hence the name of the pattern. Because the session is kept open, no more LazyInitializationExceptions occur. The session or entity manager is kept open by means of a filter that is added to the request processing chain. In the case of JPA the OpenEntityManagerInViewFilter will create an entity manager at the beginning of the request, and then bind it to the request thread. The service layer will then be executed and the business transaction committed or rolled back, but the transaction manager will not remove the entity manager from the thread after the commit. When the view rendering starts, the transaction manager will then check if there is already an entity manager binded to the thread, and if so use it instead of creating a new one. After the request is processed, the filter will then unbind the entity manager from the thread. The end result is that the same entity manager used to commit the business transaction was kept around in the request thread, allowing the view rendering code to lazy load the needed data. Going back to the original problem Let's step back a moment and go back to the initial problem: the LazyInitializationException. Is this exception really a problem? This exception can also be seen as a warning sign of a wrongly written query in the service layer. When building a view and it's backing services, the developer knows upfront what data is needed, and can make sure that the needed data is loaded before the rendering starts. Several relation types such as one-to-many use lazy-loading by default, but that default setting can be overridden if needed at query time using the following syntax: select p FROM Person p left join fetch p.invoices This means that the lazy loading can be turned off on a case by case basis depending on the data needed by the view. OSIV in projects I've worked In projects I have worked that used OSIV, we could see via query logging that the database was getting hit with a high number of SQL queries, sometimes to the point that developers had to turn off the Hibernate SQL logging. The performance of these application was impacted, but it was kept manageable using second-level caches, and due to the fact that these where intranet-based applications with a limited number of users. Pros of OSIV The main advantage of OSIV is that it makes working with ORM and the database more transparent: Less queries need to be manually written Less awareness is required about the Hibernate session and how to solve LazyInitializationExceptions. Cons of OSIV OSIV seems to be easy to misuse and can accidentally introduce N+1 performance problems in the application. On projects I've worked OSIV did not work out well in the long-term. The alternative of writing custom queries that eager fetch data depending on the use case is manageable and turned out well in other projects I've worked. Alternatives to OSIV Besides the application-level solution of writing custom queries to pre-fetch the needed data, there are other framework-level aproaches to OSIV. The Seam Framework was built by some of the same developers as Hibernate , and solves the problem by introducing the notion of conversation. Can you let me know in the comments bellow your thoughts and experiences with OSIV, thanks for reading.

April 30, 2014

by Vasco Cavalheiro

· 19,159 Views · 3 Likes