Data Engineering Resources

The Latest Data Engineering Topics

SpringBoot...there is a lot of buzz about SpringBoot nowadays. So what is SpringBoot? SpringBoot is a new spring portfolio project which takes opinionated view of building production-ready Spring applications by drastically reducing the amount of configuration required. Spring Boot is taking the convention over configuration style to the next level by registering the default configurations automatically based on the classpath libraries available at runtime. Well.. you might have already read this kind of introduction to SpringBoot on many blogs. So let me elaborate on what SpringBoot is and how it helps developing Spring applications more quickly. Spring framework was created by Rod Johnson when many of the Java developers are struggling with EJB 1.x/2.x for building enterprise applications. Spring framework makes developing the business components easy by using Dependency Injection and Aspect Oriented Programming concepts. Spring became very popular and many more Spring modules like SpringSecurity, Spring Batch, Spring Data etc become part of Spring portfolio. As more and more features added to Spring, configuring all the spring modules and their dependencies become a tedious task. Adding to that Spring provides atleast 3 ways of doing anything :-). Some people see it as flexibility and some others see it as confusing. Slowly, configuring all the Spring modules to work together became a big challenge. Spring team came up with many approaches to reduce the amount of configuration needed by introducing Spring XML DSLs, Annotations and JavaConfig. In the very beginning I remember configuring a big pile of jar version declarations in section and lot of declarations. Then I learned creating maven archetypes with basic structure and minimum required configurations. This reduced lot of repetitive work, but not eliminated completely. Whether you write the configuration by hand or generate by some automated ways, if there is code that you can see then you have to maintain it. So whether you use XML or Annotations or JavaConfig, you still need to configure(copy-paste) the same infrastructure setup one more time. On the other hand, J2EE (which is dead long time ago) emerged as JavaEE and since JavaEE6 it became easy (compared to J2EE and JavaEE5) to develop enterprise applications using JavaEE platform. Also JavaEE7 released with all the cool CDI, WebSockets, Batch, JSON support etc things became even more simple and powerful as well. With JavaEE you don't need so much XML configuration and your war file size will be in KBs (really??? for non-helloworld/non-stageshow apps also :-)). Naturally this "convention over configuration" and "you no need to glue APIs together appServer already did it" arguments became the main selling points for JavaEE over Spring. Then Spring team addresses this problem with SpringBoot :-). Now its time to JavaEE to show whats the SpringBoot's counterpart in JavaEE land :-) JBoss Forge?? I love this Spring vs JavaEE thing which leads to the birth of powerful tools which ultimately simplify the developers life :-). Many times we need similar kind of infrastructure setup using same libraries. For example, take a web application where you map DispatcherServlet url-pattern to "/", implement RESTFul webservices using Jackson JSON library with Spring Data JPA backend. Similarly there could be batch or spring integration applications which needs similar infrastructure configuration. SpringBoot to the rescue. SpringBoot look at the jar files available to the runtime classpath and register the beans for you with sensible defaults which can be overridden with explicit settings. Also SpringBoot configure those beans only when the jars files available and you haven't define any such type of bean. Altogether SpringBoot provides common infrastructure without requiring any explicit configuration but lets the developer overrides if needed. To make things more simpler, SpringBoot team provides many starter projects which are pre-configured with commonly used dependencies. For example Spring Data JPA starter project comes with JPA 2.x with Hibernate implementation along with Spring Data JPA infrastructure setup. Spring Web starter comes with Spring WebMVC, Embedded Tomcat, Jackson JSON, Logback setup. Aaah..enough theory..lets jump onto coding. I am using latest STS-3.5.1 IDE which provides many more starter project options like Facebbok, Twitter, Solr etc than its earlier version. Create a SpringBoot starter project by going to File -> New -> Spring Starter Project -> select Web and Actuator and provide the other required details and Finish. This will create a Spring Starter Web project with the following pom.xml and Application.java 4.0.0 com.sivalabs hello-springboot 1.0-SNAPSHOT jar hello-springboot Spring Boot Hello World org.springframework.boot spring-boot-starter-parent 1.1.3.RELEASE org.springframework.boot spring-boot-starter-actuator org.springframework.boot spring-boot-starter-web org.springframework.boot spring-boot-starter-test test UTF-8 com.sivalabs.springboot.Application 1.7 org.springframework.boot spring-boot-maven-plugin package com.sivalabs.springboot; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.EnableAutoConfiguration; import org.springframework.context.annotation.ComponentScan; import org.springframework.context.annotation.Configuration; @Configuration @ComponentScan @EnableAutoConfiguration public class Application { public static void main(String[] args) { SpringApplication.run(Application.class, args); } } Go ahead and run this class as a standalone Java class. It will start the embedded Tomcat server on 8080 port. But we haven't added any endpoints to access, lets go ahead and add a simple REST endpoint. @Configuration @ComponentScan @EnableAutoConfiguration @Controller public class Application { public static void main(String[] args) { SpringApplication.run(Application.class, args); } @RequestMapping(value="/") @ResponseBody public String bootup() { return "SpringBoot is up and running"; } } Now point your browser to http://localhost:8080/ and you should see the response "SpringBoot is up and running". Remember while creating project we have added Actuator starter module also. With Actuator you can obtain many interesting facts about your application. Try accessing the following URLs and you can see lot of runtime environment configurations that are provided by SpringBoot. http://localhost:8080/beans http://localhost:8080/metrics http://localhost:8080/trace http://localhost:8080/env http://localhost:8080/mappings http://localhost:8080/autoconfig http://localhost:8080/dump SpringBoot actuator deserves a dedicated blog post to cover its vast number of features, I will cover it in my upcoming posts. I hope this article provides some basic introduction to SpringBoot and how it simplifies the Spring application development. More on SpringBoot in upcoming articles. - See more at: http://www.sivalabs.in/2014/07/springboot-introducing-springboot.html#sthash.7syCIt8V.dpuf

July 4, 2014

by Siva Prasad Reddy Katamreddy

· 12,420 Views · 5 Likes

What is Scalable Machine Learning?

scalability has become one of those core concept slash buzzwords of big data. it’s all about scaling out, web scale, and so on. in principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast. the terms “scalable” and “large scale” have been used in machine learning circles long before there was big data. there had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. so finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question. interestingly, this issue of scalability were seldom solved using actual scaling in in machine learning, at least not in the big data kind of sense. part of the reason is certainly that multicore processors didn’t yet exist at the scale they do today and that the idea of “just scaling out” wasn’t as pervasive as it is today. instead, “scalable” machine learning is almost always based on finding more efficient algorithms, and most often, approximations to the original algorithm which can be computed much more efficiently. to illustrate this, let’s search for nips papers (the annual advances in neural information processing systems, short nips, conference is one of the big ml community meetings) for papers which have the term “scalable” in the title. here are some examples: scalable inference for logistic-normal topic models … this paper presents a partially collapsed gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation … partially collapsed gibbs sampling is a kind of estimation algorithm for certain graphical models. a scalable approach to probabilistic latent space inference of large-scale networks … with […] an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices […] on a single machine in a matter of hours … stochastic variational inference algorithm is both an approximation and an estimation algorithm. scalable kernels for graphs with continuous attributes … in this paper, we present a class of path kernels with computational complexity $o(n^2(m + \delta^2 ))$ … and this algorithm has squared runtime in the number of data points, so wouldn’t even scale out well even if you could. usually, even if there is potential for scalability, it usually something that is “embarassingly parallel” (yep, that’s a technical term), meaning that it’s something like a summation which can be parallelized very easily. still, the actual “scalability” comes from the algorithmic side. so how do scalable ml algorithms look like? a typical example are the stochastic gradient descent (sgd) class of algorithms. these algorithms can be used, for example, to train classifiers like linear svms or logistic regression. one data point is considered at each iteration. the prediction error on that point is computed and then the gradient is taken with respect to the model parameters, giving information about how to adapt these parameters slightly to make the error smaller. vowpal wabbit is one program based on this approach and it has a nice definition of what it considers to mean scalable in machine learning: there are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. this project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation. so “scalable” means having a learning algorithm which can deal with any amount of data, without consuming ever growing amounts of resources like memory. for sgd type algorithms this is the case, because all you need to store are the model parameters, usually a few ten to hundred thousand double precision floating point value, so maybe a few megabytes in total. the main problem to speed this kind of computation up is how to stream the data by fast enough. to put it differently, not only does this kind of scalability not rely on scaling out, it’s actually not even necessary or possible to scale the computation out because the main state of the computation easily fits into main memory and computations on it cannot be distributed easily. i know that gradient descent is often taken as an example for map reduce and other approaches like in this paper on the architecture of spark , but that paper discusses a version of gradient descent where you are not taking one point at a time, but aggregate the gradient information for the whole data set before making the update to the model parameters. while this can be easily parallelized, it does not perform well in practice because the gradient information tends to average out when computed over the whole data set. if you want to know more, this large scale learning challenge sören sonnneburg organized in 2008 still has valuable information on how to deal with massive data sets. of course, there are things which can be easily scaled well using hadoop or spark, in particular any kind of data preprocessing or feature extraction where you need to apply the same operation to each data point in your data set. another area where parallelization is easy and useful is when you are using cross validation to do model selection where you usually have to train a large number of models for different parameter sets to find the combination which performs best. again, even here there is more potential for even speeding up such computations using better algorithms like in this paper of mine . i’ve just scratched the surface of this, but i hope you got the idea that scalability can mean quite different things. in big data (meaning the infrastructure side of it) what you want to compute is pretty well defined, for example some kind of aggregate over your data set, so you’re left with the question of how to parallelize that computation well. in machine learning, you have much more freedom because data is noisy and there’s always some freedom in how you model your data, so you can often get away with computing some variation of what you originally wanted to do and still perform well. often, this allows you to speed up your computations significantly by decoupling computations. parallelization is important, too, but alone it won’t get you very far. luckily, there are projects like spark and stratosphere/flink which work on providing more useful abstractions beyond map and reduce to make the last part easier for data scientists, but you won’t get rid of the algorithmic design part any time soon.

July 3, 2014

by Mikio Braun

· 18,538 Views · 1 Like

Reporting Back from MongoDB World 2014, NYC, Planet JSON

Closely approaching the one year mark of when I first joined MongoLab (and the MongoDB community), I had the pleasure of attending the inaugural MongoDB World conference put together by the incredible MongoDB team. Second only to the excitement around major MongoDB feature announcements was the collective disbelief that this was MongoDB’s first multi-day conference ever. A big congratulations to all those that worked hard to put on such a massive (did you see the Intrepid!?) event. All this planning would have been for naught if MongoDB leaders and engineers failed to deliver announcements and features that would meet and exceed expectations. From major public cloud announcements to the reveal of document-level locking in version 2.8, developers and conference goers had plenty to be excited about. There was a lot to digest from the conference… we’ll cover the major highlights in case you missed them. Big announcements in public cloud Our time at the MongoLab booth yielded many high-quality conversations, predominantly those about offloading previously internal processes and workloads to the public cloud. It was remarkable to see and hear so many enterprise teams with the exact same message: the public cloud is the future, and the future is now. It’s no surprise then that MongoDB, Inc. released not one, but two press releases around MongoDB solutions for the public cloud. Fully-managed MongoDB on the Microsoft Azure Store Nearly one year ago, MongoDB, Inc. chose to partner with the MongoLab team to build a production-ready MongoDB solution for developers on Microsoft Azure. On the first day of World, MongoDB, Inc. announced the product of our collaboration – a fully-managed highly available MongoDB-as-a-Service Add-On offering on the Microsoft Azure Store. This new service runs MongoDB Enterprise and offers replication, monitoring and support from MongoDB, Inc. It’s also backed by MongoDB Management Service (MMS), allowing for point-in-time recovery of MongoDB deployments. Now, teams without the expertise or resources to manage their MongoDB deployment(s) can outsource all the database operations (monitoring and alerting, backups, performance tuning, etc.) to both MongoLab and MongoDB’s expert support teams. You can check out the MongoDB add-on in the Azure Store: https://azure.microsoft.com/en-us/gallery/store/mongodb/mongodb-inc/ MongoDB solutions on Google Cloud Platform MongoDB, Inc. also announced the arrival of new resources to help Google Cloud Platform customers deploy MongoDB on Google Compute Engine. These resources include a “Click to Deploy” feature and a MongoDB on Google Compute Engine Solutions paper covering MongoDB best practices. If you are looking for a fully-managed solution, with automated provisioning, backups, integrated monitoring and alerting, along with expert support, MongoLab recently announced the arrival of production-ready replica sets on Google. Product Roadmap – MongoDB version 2.8 On the second day of MongoDB World, Eliot Horowitz, MongoDB, Inc. CTO & Co-founder, took center stage and announced two huge changes to the MongoDB core project: document-level locking and pluggable storage engines. These features not only reflect improvements to the core project, but also signal to the community that the MongoDB team is listening to its users and is capable of delivering the software needed to power the workloads of tomorrow. Document-level locking The slides above from Eliot’s keynote point to a current obstacle (database-level locking) in MongoDB that limits overall scalability. With database-level locking, any write operation to the database holds the write lock and prevents subsequent writes from executing on the database until the original operation holding the write lock completes. Eliot’s announcement of document-level locking moves the write lock contention from the database level to the document (MongoDB equivalent to SQL “records”) level. This change will allow users to achieve much higher write throughput (we saw a 10x performance improvement in the live demo) across their MongoDB deployments, improving write scalability. If you’d like to try out document-level locking, the MongoDB team has already pushed the feature to the master branch on GitHub. This should only be used for experimentation, not to be run in production. Pluggable storage engine As MongoDB matures, feature releases like document level locking will continue to allow developers to build robust systems on top of MongoDB. But as the number of use cases grows, different tooling tailored to specific use cases may prove to be extremely beneficial. For example, if Company X decides that they want to use MongoDB to warehouse some of their data, they would likely want to optimize their database for slow-moving data and storage efficiency (compression). With the introduction of pluggable storage engines, many new possibilities are open to the community. Teams can now write their own storage engine for a particular use case, configure replica set nodes with different storage engines for specific situations, or collaborate with the open-source community to architect innovative solutions. This feature not only allows for more granular control of the database, but also encourages the MongoDB community to work together. Takeaways: A maturing and thriving ecosystem Roughly a year ago, MongoLab CTO Todd Dampier recapped MongoSF 2013 and spoke to the health of the MongoDB ecosystem. How far we’ve come! After attending the inaugural MongoDB World and chatting with MongoDB Masters, interns, hackathon winners, power users and those new to the community, the enthusiasm is still surging and as positive as ever. This enthusiasm is well placed. Developers and hackers use MongoDB because so much rich data on the web is shared as JSON (think Facebook, Twitter, Google, etc.). As a result, MongoDB is the de-facto database for hackathons and bootstrapped projects. Just learn the API for the site you want to mine, throw the JSON in MongoDB and query your data with the rich query language- it’s that easy. The MongoDB ecosystem is maturing as well. Take a look at the Customer Success Stories and you’ll get a feel for the extent in which enterprises leverage the solution and use it in production. To further drive enterprise adoption, MongoDB, Inc.’s public cloud solutions and product roadmap features aim to help teams run MongoDB in production and give teams the confidence that MongoDB will continue to improve scalability and meet their growing project requirements. Congratulations again to the MongoDB team on their big announcements and for creating such a fantastic forum at which to learn and meet fellow MongoDB users. Our team at MongoLab had a great time making new friends and talking shop; we look forward to meeting more MongoDB users soon (at a MongoDB Days near you)! -Chris@MongoLab

July 2, 2014

by Chris Chang

· 6,472 Views

Why %util Number from Iostat is Meaningless for MySQL Capacity Planning

Earlier this month I wrote about vmstat iowait cpu numbers, and some of the comments I got were advertising the use of util% as reported by the iostat tool instead. I find this number even more useless for MySQL performance tuning and capacity planning. Now let me start by saying this is a really tricky and deceptive number. Many DBAs who report instances of their systems having a very busy IO subsystem said the util% in vmstat was above 99% and therefore they believe this number is a good indicator of an overloaded IO subsystem. Indeed – when your IO subsystem is busy, up to its full capacity, the utilization should be very close to 100%. However, it is perfectly possible for the IO subsystem and MySQL with it to have plenty more capacity than when utilization is showing 100% – as I will show in an example. Before that though lets see what the iostat manual page has to say on this topic – from this main page we can read: %util Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits. Which says right here that the number is useless for pretty much any production database server that is likely to be running RAID, Flash/SSD, SAN or cloud storage (such as EBS) capable of handling multiple requests in parallel. Let’s look at the following illustration. I will run sysbench on a system with a rather slow storage data size larger than buffer pool and uniform distribution to put pressure on the IO subsystem. I will use a read-only benchmark here as it keeps things more simple… sysbench –num-threads=1 –max-requests=0 –max-time=6000000 –report-interval=10 –test=oltp –oltp-read-only=on –db-driver=mysql –oltp-table-size=100000000 –oltp-dist-type=uniform –init-rng=on –mysql-user=root –mysql-password= run I’m seeing some 9 transactions per second, while disk utilization from iostat is at nearly 96%: [ 80s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 171.82ms (95%) [ 90s] threads: 1, tps: 9.20, reads/s: 128.80, writes/s: 0.00 response time: 157.72ms (95%) [ 100s] threads: 1, tps: 9.00, reads/s: 126.00, writes/s: 0.00 response time: 215.38ms (95%) [ 110s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 141.39ms (95%) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-0 0.00 0.00 127.90 0.70 4070.40 28.00 31.87 1.01 7.83 7.52 96.68 This makes a lot of sense – with read single thread read workload the drive should be only used getting data needed by the query, which will not be 100% as there is some extra time needed to process the query on the MySQL side as well as passing the result set back to sysbench. So 96% utilization; 9 transactions per second, this is a close to full-system capacity with less than 5% of device time to spare, right? Let’s run a benchmark with more concurrency – 4 threads at the time; we’ll see… [ 110s] threads: 4, tps: 21.10, reads/s: 295.40, writes/s: 0.00 response time: 312.09ms (95%) [ 120s] threads: 4, tps: 22.00, reads/s: 308.00, writes/s: 0.00 response time: 297.05ms (95%) [ 130s] threads: 4, tps: 22.40, reads/s: 313.60, writes/s: 0.00 response time: 335.34ms (95%) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-0 0.00 0.00 295.40 0.90 9372.80 35.20 31.75 4.06 13.69 3.38 100.01 So we’re seeing 100% utilization now, but what is interesting – we’re able to reclaim much more than less than 5% which was left if we look at utilization – throughput of the system increased about 2.5x Finally let’s do the test with 64 threads – this is more concurrency than exists at storage level which is conventional hard drives in RAID on this system… [ 70s] threads: 64, tps: 42.90, reads/s: 600.60, writes/s: 0.00 response time: 2516.43ms (95%) [ 80s] threads: 64, tps: 42.40, reads/s: 593.60, writes/s: 0.00 response time: 2613.15ms (95%) [ 90s] threads: 64, tps: 44.80, reads/s: 627.20, writes/s: 0.00 response time: 2404.49ms (95%) Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util dm-0 0.00 0.00 601.20 0.80 19065.60 33.60 31.73 65.98 108.72 1.66 100.00 In this case we’re getting 4.5x of throughput compared to single thread and 100% utilization. We’re also getting almost double throughput of the run with 4 thread where 100% utilization was reported. This makes sense – there are 4 drives which can work in parallel and with many outstanding requests they are able to optimize their seeks better hence giving a bit more than 4x. So what have we so ? The system which was 96% capacity but which could have driven still to provide 4.5x throughput – so it had plenty of extra IO capacity. More powerful storage might have significantly more ability to run requests in parallel so it is quite possible to see 10x or more room after utilization% starts to be reported close to 100% So if utilization% is not very helpful what can we use to understand our database IO capacity better ? First lets look at the performance reported from those sysbench runs. If we look at 95% response time you can see 1 thread and 4 threads had relatively close 95% time growing just from 150ms to 250-300ms. This is the number I really like to look at- if system is able to respond to the queries with response time not significantly higher than it has with concurrency of 1 it is not overloaded. I like using 3x as multiplier – ie when 95% spikes to be more than 3x of the single concurrency the system might be getting to the overload. With 64 threads the 95% response time is 15-20x of the one we see with single thread so it is surely overloaded. Do we have anything reported by iostat which we can use in a similar way? It turns out we do! Check out the “await” column which tells us how much the requester had to wait for the IO request to be serviced. With single concurrency it is 7.8ms which is what this drives can do for random IO and is as good as it gets. With 4 threads it is 13.7ms – less than double of best possible, so also good enough… with concurrency of 64 it is however 108ms which is over 10x of what this device could produce with no waiting and which is surely sign of overload. A couple words of caution. First, do not look at svctm which is not designed with parallel processing in mind. You can see in our case it actually gets better with high concurrency while really database had to wait a lot longer for requests submitted. Second, iostat mixes together reads and writes in single statistics which specifically for databases and especially on RAID with BBU might be like mixing apples and oranges together – writes should go to writeback cache and be acknowledged essentially instantly while reads only complete when actual data can be delivered. The tool pt-diskstats from Percona Tookit breaks them apart and so can be much more for storage tuning for database workload. Some of the recent operating systems also ship with sysstat/iostat which breaks out await to r_await and w_await which is much more useful. Final note – I used a read-only workload on purpose – when it comes to writes things can be even more complicated – MySQL buffer pool can be more efficient with more intensive writes plus group commit might be able to commit a lot of transactions with single disk write. Still, the same base logic will apply. Summary: The take away is simple – util% only shows if a device has at least one operation to deal with or is completely busy, which does not reflect actual utilization for a majority of modern IO subsystems. So you may have a lot of storage IO capacity left even when utilization is close to 100%.

July 1, 2014

by Peter Zaitsev

· 5,869 Views

Top 10 Hadoop Shell Commands to Manage HDFS

So you already know what Hadoop is? Why it is used? What problems you can solve with it?

June 30, 2014

by Saurabh Chhajed

· 485,677 Views · 8 Likes

The Dangers of SQL JOINs & Aggregate Functions: How to Avoid Fanouts

Every SQL developer uses JOINs and aggregate functions, but some may not be aware that the two interacting can return incorrect results if not handled carefully. Brett Sauve at Looker has covered the issue in this recent post - specifically, he focuses on the issue of "fanouts" - and it's an easy problem to overlook. A fanout occurs when the primary table in a query contains fewer rows than the combined table resulting from a JOIN. Because the new table contains more rows, aggregate functions like COUNT or SUM may include duplicates, throwing off the results. According to Sauve, the answer is careful ordering of JOIN tables: Herein lies a key point: to help avoid fanouts, begin your joins with the most granular table. Sauve also provides some tips to help SQL developers understand how to avoid fanouts and ensure the validity of their data. He points to three types of JOINs: 1-to-1 Many-to-1 1-to-Many Knowing which JOIN you are using can rule out the possibility of a fanout, for example, because it is not possible in a 1-to-1 or Many-to-1 JOIN. If you are using a 1-to-Many JOIN, though, Sauve has some tips there, too. For example, a messy JOIN can be cleaned up by grouping data to create a 1-to-1 relationship. Check out the full article for the rest of Sauve's tips to make sure your aggregate functions are working correctly, and if you're looking for a few more tips to keep your SQL skills sharp, you might try one of these: Yet Another 10 Common Mistakes Java Developers Make When Writing SQL 6 Simple Performance Tips for SQL SELECT Statements

June 30, 2014

by Alec Noller

· 16,584 Views

Making Operations on Volatile Fields Atomic

The expected behaviour for volatile fields is that they should behave in a multi-threaded application the same as they do in a single threaded application. They are not forbidden to behave the same way, but they are not guaranteed to behave the same way. The solution in Java 5.0+ is to use AtomicXxxx classes however these are relatively inefficient in terms of memory (they add a header and padding), performance (they add a references and little control over their relative positions), and syntactically they are not as clear to use. IMHO A simple solution if for volatile fields to act as they might be expected to do, the way JVM must support in AtomicFields which is not forbidden in the current JMM (Java- Memory Model) but not guaranteed. Why make fields volatile? The benefit of volatile fields is that they are visible across threads and some optimisations which avoid re-reading them are disabled so you always check again the current value even if you didn't change them. e.g. without volatile Thread 2: int a = 5; Thread 1: a = 6; (later) Thread 2: System.out.println(a); // prints 5 With volatile Thread 2: volatile int a = 5; Thread 1: a = 6; (later) Thread 2: System.out.println(a); // prints 6 Why not use volatile all the time? Volatile read and write access is substantially slower. When you write to a volatile field it stalls the entire CPU pipeline to ensure the data has been written to cache. Without this, there is a risk the next read of the value sees an old value, even in the same thread (See AtomicLong.lazySet() which avoids stalling the pipeline) The penalty can be in the order of 10x slower which you don't want to be doing on every access. What are the limitations of volatile? A significant limitation is that operations on the field is not atomic, even when you might think it is. Even worse than that is that usually, there is no difference. I.e. it can appear to work for a long time even years and suddenly/randomly break due to an incidental change such as the version of Java used, or even where the object is loaded into memory. e.g. which programs you loaded before running the program. e.g. updating a value Thread 2: volatile int a = 5; Thread 1: a += 1; Thread 2: a += 2; (later) Thread 2: System.out.println(a); // prints 6, 7 or 8 This is an issue because the read of a and the write of a are done separately and you can get a race condition. 99%+ of the time it will behave as expect, but sometimes it won't/ What can you do about it? You need to use AtomicXxxx classes. These wrap volatile fields with operations which behave as expected. Thread 2: AtomicInteger a = new AtomicInteger(5); Thread 1: a.incrementAndGet(); Thread 2: a.addAndGet(2); (later) Thread 2: System.out.println(a); // prints 8 What do I propose? The JVM has a means to behave as expected, the only surprising thing is you need to use a special class to do what the JMM won't guarantee for you. What I propose is that the JMM be changed to support the behaviour currently provided by the concurrency AtomicClasses. In each case the single threaded behaviour is unchanged. A multi-threaded program which does not see a race condition will behave the same. The difference is that a multi-threaded program does not have to see a race condition but changing the underlying behaviour current method suggested syntax notes x.getAndIncrement() x++ or x += 1 x.incrementAndGet() ++x x.getAndDecrment() x-- or x -= 1 x.decrementAndGet() --x x.addAndGet(y) (x += y) x.getAndAdd(y) ((x += y)-y) x.compareAndSet(e, y) (x == e ? x = y, true : false) Need to add the comma syntax used in other languages. These operations could be supported for all the primitive types such as boolean, byte, short, int, long, float and double. Additional assignment operators could be supported such as current method suggested syntax notes Atomic multiplication x *= 2; Atomic subtraction x -= y; Atomic division x /= y; Atomic modulus x %= y; Atomic shift x <<= y; Atomic shift x >>= z; Atomic shift x >>>= w; Atomic and x &= ~y; clears bits Atomic or x |= z; sets bits Atomic xor x ^= w; flips bits What is the risk? This could break code which relies on these operations occasionally failing due to race conditions. It might not be possible to support more complex expressions in a thread safe manner. This could lead to surprising bugs as the code can look like the works, but it doesn't. Never the less it will be no worse than the current state. JEP 193 - Enhanced Volatiles There is a JEP 193 to add this functionality to Java. An example is; class Usage { volatile int count; int incrementCount() { return count.volatile.incrementAndGet(); } } IMHO there is a few limitations in this approach. The syntax is fairly significant change. Changing the JMM might not require many changes the the Java syntax and possibly no changes to the compiler. It is a less general solution. It can be useful to support operations like volume += quantity; where these are double types. It places more burden on the developer to understand why he/she should use this instead of x++; Conclusion Much of the syntactic and performance overhead of using AtomicInteger and AtomicLong could be removed if the JMM guaranteed the equivalent single threaded operations behaved as expected for multi-threaded code. This feature could be added to earlier versions of Java by using byte code instrumentation.

June 27, 2014

by Peter Lawrey

· 11,920 Views

Option.fold() Considered Unreadable

We had a lengthy discussion recently during code review whether scala.Option.fold() is idiomatic and clever or maybe unreadable and tricky? Let's first describe what the problem is. Option.fold does two things: maps a function f overOption's value (if any) or returns an alternative alt if it's absent. Using simple pattern matching we can implement it as follows: val option: Option[T] = //... def alt: R = //... def f(in: T): R = //... val x: R = option match { case Some(v) => f(v) case None => alt } If you prefer one-liner, fold is actually a combination of map and getOrElse val x: R = option map f getOrElse alt Or, if you are a C programmer that still wants to write in C, but using Scala compiler: val x: R = if (option.isDefined) f(option.get) else alt Interestingly this is similar to how fold() is actually implemented, but that's an implementation detail. OK, all of the above can be replaced with single Option.fold(): val x: R = option.fold(alt)(f) Technically you can even use /: and \: operators (alt /: option) - but that would be simply masochistic. I have three problems with option.fold() idiom. First of all - it's anything but readable. We are folding (reducing) over Option - which doesn't really make much sense. Secondly it reverses the ordinary positive-then-negative-case flow by starting with failure (absence, alt) condition followed by presence block (f function; see also: Refactoring map-getOrElse to fold). Interestingly this method would work great for me if it was named mapOrElse: ** * Hypothetical in Option */ def mapOrElse[B](f: A => B, alt: => B): B = this map f getOrElse alt Actually there is already such method in Scalaz, called OptionW.cata. cata. Here is whatMartin Odersky has to say about it: "I personally find methods like cata that take two closures as arguments are often overdoing it. Do you really gain in readability over map + getOrElse? Think of a newcomer to your code[...]"While cata has some theoretical background, Option.fold just sounds like a random name collision that doesn't bring anything to the table, apart from confusion. I know what you'll say, that TraversableOnce has fold and we are sort-of doing the same thing. Why it's a random collision rather than extending the contract described inTraversableOnce? fold() method in Scala collections typically just delegates to one offoldLeft()/foldRight() (the one that works better for given data structure), thus it doesn't guarantee order and folding function has to be associative. But inOption.fold() the contract is different: folding function takes just one parameter rather than two. If you read my previous article about folds you know that reducing function always takes two parameters: current element and accumulated value (initial value during first iteration). But Option.fold() takes just one parameter: current Option value! This breaks the consistency, especially when realizing Option.foldLeft() andOption.foldRight() have correct contract (but it doesn't mean they are more readable). The only way to understand folding over option is to imagine Option as a sequence with0 or 1 elements. Then it sort of makes sense, right? No. def double(x: Int) = x * 2 Some(21).fold(-1)(double) //OK: 42 None.fold(-1)(double) //OK: -1 but: Some(21).toList.fold(-1)(double) : error: type mismatch; found : Int => Int required: (Int, Int) => Int Some(21).toList.fold(-1)(double) ^ If we treat Option[T] as a List[T], awkward Option.fold() breaks because it has different type than TraversableOnce.fold(). This is my biggest concern. I can't understand why folding wasn't defined in terms of the type system (trait?) and implemented strictly. As an example take a look at: Data.Foldable in Haskell (advanced) Data.Foldable typeclass describes various flavours of folding in Haskell. There are familiar foldl/foldr/foldl1/foldr1, in Scala namedfoldLeft/foldRight/reduceLeft/reduceRight accordingly. They have the same type as Scala and behave unsurprisingly with all types that you can fold over, including Maybe, lists, arrays, etc. There is also a function named fold, but it has a completely different meaning: class Foldable t where fold :: Monoid m => t m -> m While other folds are quite complex, this one barely takes a foldable container of ms (which have to be Monoids) and returns the same Monoid type. A quick recap: a type can be aMonoid if there exists a neutral value of that type and an operation that takes two values and produces just one. Applying that function with one of the arguments being neutral value yields the other argument. String ([Char]) is a good example with empty string being neutral value (mempty) and string concatenation being such operation (mappend). Notice that there are two different ways you can construct monoids for numbers: under addition with neutral value being 0 (x + 0 == 0 + x == x for any x) and under multiplication with neutral 1 (x * 1 == 1 * x == x for any x). Let's stick to strings. If I fold empty list of strings, I'll get an empty string. But when a list contains many elements, they are being concatenated: > fold ([] :: [String]) "" > fold [] :: String "" > fold ["foo", "bar"] "foobar" In the first example we have to explicitly say what is the type of empty list []. Otherwise Haskell compiler can't figure out what is the type of elements in a list, thus which monoid instance to choose. In second example we declare that whatever is returned from fold [], it should be a String. From that the compiler infers that [] actually must have a type of [String]. Last fold is the simplest: the program folds over elements in list and concatenates them because concatenation is the operation defined in Monoid Stringtypeclass instance. Back to options (or more precisely Maybe). Folding over Maybe monad having type parameter being Monoid (I can't believe I just said it) has an interesting interpretation: it either returns value inside Maybe or a default Monoid value: > fold (Just "abc") "abc" > fold Nothing :: String "" Just "abc" is same as Some("abc") in Scala. You can see here that if Maybe Stringis Nothing, neutral String monoid value is returned, that is an empty string. Summary Haskell shows that folding (also over Maybe) can be at least consistent. In ScalaOption.fold is unrelated to List.fold, confusing and unreadable. I advise avoiding it and staying with slightly more verbose map/getOrElse transformations or pattern matching. PS: Did I mention there is also Either.fold() (with even different contract) but noTry.fold()?

June 26, 2014

by Tomasz Nurkiewicz

· 9,597 Views

How to Install Mono on a Raspberry Pi

This post exists to help with an MSDN Magazine article that I am authoring It provides some of the low-level details for the article How to install Mono and root certificates on a raspberry pi How to create an Azure mobile service How to create a Custom API inside Azure mobile services that the raspberry pi can call into How to create an Azure storage account MONO - HOW TO INSTALL ON A RASPBERRY PI Why Mono? How to install Mono on a raspberry pi Installing trusted root certificates on to the raspberry pi http://www.mono-project.com/Main_Page An open source, cross-platform, implementation of C# and the CLR that is binary compatible with Microsoft.NET Mono is a free and open source project led by Xamarin (formerly by Novell) that provides a .NET Framework-compatible set of tools including, among others, a C# compiler and a Common Language Runtime WHY MONO? Because it lets us write .net code compiled on Windows We can simply copy the binary files from Windows to Linux and run it as is From a raspberry pi device, it is possible to use a .net application to take a photo and upload it to Windows Azure storage HOW TO INSTALL ON A RASPBERRY PI RUNNING LINUX You will issue the following commands: pi@raspberrypi ~ $ sudo apt-get update pi@raspberrypi ~ $ sudo apt-get install mono-complete The first command makes sure all the local package index are up to date with the changes made in repositories. Second command installs the complete Mono tooling and runtime. MAKING SURE THAT YOUR MONO APPLICATIONS CAN MAKE A HTTPS REST-BASED CALLS This command downloads the trusted root certificates from the Mozilla LXR web site into the Mono certificate store. Once complete, the Raspberry PI will be capable of making web requests using HTTPS requests within Mono. pi@raspberrypi ~ $ mozroots --import --ask-remove --machine CREATING A NEW AZURE MOBILE SERVICES ACCOUNT The mobile services account is needed to host a Node.js application that provides shared access signatures to raspberry pi devices The shared access signature is needed by the raspberry pi, so that it can directly and securely upload photos to Azure storage STEPS TO CREATE AN AZURE MOBILE SERVICE The steps below will create an Azure mobile service The service will be used to host a Node.js application interacting with a raspberry pi devices We will provision a SQL database, although it will not be used initially FOLLOW THESE STEPS TO CREATE THE MOBILE SERVICE Login into the Azure Portal Select MOBILE SERVICES from the left menu pane at the Azure Portal. In the lower left corner select "+NEW" to create a new Azure Mobile Service. Make sure you've selected, "COMPUTE / MOBILE SERVICE / CREATE." You will now enter a url. We will call this service raspberrymobileservice. For the DATABASE, we will choose "Create a new SQL database instance." The REGION we chose is "West US." The BACKEND is "JavaScript." Click the "->" arrow to proceed to the next screen. In this screen you will "Specify database settings." The NAME of your database will based on the URL you entered previously. In this case, the database is called "raspberrymobileservice_db." You will need to choose a SERVER. We will choose "New SQL database server" from the drop-down list. You will need to provide a SERVER LOGIN NAME and a SERVER LOGIN PASSWORD. Take note of the login you provided as it will be needed later CREATING A CUSTOM API Azure mobile services allows you to create a custom API written in JavaScript that can be called from a raspberry pi device using REST This custom API is really just a Node.js application running in the server CREATING THE API TO RESPOND TO THE DEVICE TRYING TO UPLOAD PHOTOS Now that the service is established, we will turn our attention to creating an API that the device can call into to upload a photo. Login into the Azure Portal Your mobile service will take a few minutes to complete, and you should see the "Ready" flag as the "Status" for your service. Once it is ready you can drill into your service to customize its behavior. Just to the right of the service name, click the right arrow key "->" to drill into the service details. The top menu bar will offer many options, but we are interested in the one titled "API." The API allows you to create a series of node.JS API calls that a device can call into using rest-based approaches. Click on "API." from there, select "CREATE A CUSTOM API." You will be asked to provide an API name. Type in "photos" for the API name. Below you will see a series of drop-down combo boxes that relate to permission. We will keep the default value of "Anybody with the application key." This might not be the best option for all scenarios. You can read more about this here. http://msdn.microsoft.com/en-us/library/azure/jj193161.aspx. Click the checkmark to complete the process. The name of the AP you just created, "Photos," should be visible on the portal interface. To drill into the photos API click on the right arrow key "->". The right arrow key will be just to the right of the name of the API "Photos". At this point you should see a basic script that has been provided by default. We will overwrite this default script with our own script as described in the MSDN Magazine article. CREATING A STORAGE ACCOUNT TO STORE THE PHOTOS Navigate to the portal and create a storage account Create a container for the photos Obtain the: Storage Account Name (you will provide a name) Storage Account Access key (generated for you) Container Name (you will create) CREATING A STORAGE ACCOUNT We will need a storage account so that we can upload photos to it. The steps are well documented here: http://azure.microsoft.com/en-us/documentation/articles/storage-create-storage-account/ In our case we call the storage account raspberrystorage. This means that the URL that the device will use to upload photos is https://raspberrystorage.blob.core.windows.net/. As you complete these steps make sure that you choose the storage account location to be the same location as was used for your mobile services account. This avoids any unnecessary latency or bandwidth costs between data centers. Once the storage account is created, we will need to create a container within it. Photos or any blob for that matter, are always stored within a container. To create a container drill into your newly created storage account and select CONTAINERS from the top menu. From there, select CREATE A CONTAINER. The new container dialog box will ask for a name for your container. Take note of the name you provide. We are calling our container ?photocontainer.? When the raspberry pi device uploads photos to the storage account, it will target a specific container, such as the one we just created. You will next be asked to indicate ACCESS rights. To keep things simple we will select access rights of Public Blob. ENTERING APP SETTINGS Rather than hard-code storage account information inside your JavaScript/Node.js applications, you should consider using apps settings inside of the Azure mobile services portal This post also discusses it well: http://blogs.msdn.com/b/carlosfigueira/archive/2013/12/09/application-settings-in-azure-mobile-services.aspx ?The idea of application settings is a set of key-value pairs which can be set for the mobile service (either via the portal or via the command-line interface), and those values could be then read in the service runtime.? NAVIGATING TO APP SETTINGS Navigate to the Azure Mobile Services section of the portal. Drill into the specific service by hitting the arrow below Select from the Configure Menu at the top Scroll down to the very bottom to see app settings Note that we need to enter: - We need to get this from Azure Storage - PhotoContainerName - AccountName - AccountKey We get this information from the Azure Storage Section of the Portal. Note that you need to have provisioned a Storage Account to have this information. How to get the AccountKey with Azure Storage Services Now you can get the access keys HOW NODE.JS WILL ACCESS THE APP SETTINGS You will create a Node.js application inside of Azure Mobile Services See previous steps THE NODE.JS APPLICATION READING APP SETTINGS You will starting by going back to Azure Mobile Services and drill down into your newly minted service We called ours raspberrymobileservice Once you click API, you should see: Notice the app settings are being read on lines 12 to 14.

June 19, 2014

by Bruno Terkaly

· 16,766 Views

Internet of Things (IoT) Reference Architecture

to converge internet of thing devices with corporate it solutions, teams require a reference architecture for the internet of things (iot). the reference architecture must include devices, server-side capabilities, and cloud architecture required to interact with and manage the devices. a reference architecture should provide architects and developers of iot projects with an effective starting point that addresses major iot project and system requirements. a high-level iot reference architecture may include the following layers (see figure 1): external communications - web/portal, dashboard, apis event processing and analytics (including data storage) aggregation / bus layer – esb and message broker device communications devices cross-‐cutting layers include: device and application management identity and access management a more detailed architecture component description can be found in the iot reference architecture white paper .

June 18, 2014

by Chris Haddad

· 17,726 Views

Synchronizing Data Across Mule Applications with MongoDB

One of the many things that Mule does extremely well is transparently store state between messages. Take the idempotent-message-filter for example; The idempotent-message-filter keeps track of what messages have been processed previously to ensure no duplicate messages are processed. This is all achieved with one single xml element. Or take OAuth enabled cloud connectors that need to keep track of access tokens so you're not redirected to some authorize page each time. All this is handled behind the scenes without you having to worry about it. Mule’s ability to store this state is implemented via object-stores. Object-stores provide a simple generic way to store data in Mule. Many message processors such as the idempotent-meesage-filter, OAuth message processors, until-successful routers and many more all transparently use object-stores behind the scenes to store state between messages. By default, Mule uses in-memory object-stores behind the scenes. Things get more interesting however, when your Mule application is distributed across multiple Mule nodes. At some point you are going to need to synchronise object-stores across multiple applications. If you are fortunate to be using the Enterprise Edition of Mule you can take advantage of its out of the box clustering support for synchronizing state across an entire cluster. But even then as mentioned in this blog post by John Demic: Synchronizing Mule Applications Across Data Centers with Apache Cassandra, this may not be ideal if your Mule nodes are geographically distributed across multiple data centres. In the aforementioned post, John gives a great example of how to use Apache Cassandra as a shared object-store. But many other traditonal and NoSql data stores can be used as well. In this blog post I'm going to show a simple example of how you can achieve the same using MongoDB. If it's good enough for the Hipster Hacker then it's good enough for me! OMG! Does this officially make @MuleSoft Mule part of the Hipster Stack? pic.twitter.com/OwM4dzhGvi — Ryan Carter (@rydcarter) October 9, 2013 Configuring the MongoDB object-store MongoDB is an open-source document database, and the leading NoSQL database. The MongoDB module can be used to read/write/query and perform a whole load of other operation a MongoDB database as part of your Mule flows. It also exposes an additional submodule for using your MondogDB database as an object-store implementation in Mule. Here is a sample configuration from one of the tests: As you can see, it's classed as it's own module with it's own namespace and is configured simply using the mos:config name="mos" database="mongo-connector-test" element. However, this doesn't work for the current version of the connector. Due to the following bug. You can simply apply the fix in the aforementioned pull request, or you can instantiate the object store yourself using Spring - similar to that in the Apache Cassandra configuration. The object-store class can be found here: MongoObjectStore.java. As you can see, it has various annotations for default values and initialising the class. However, as we are not using this as a fully fledged Mule Devkit module, those annotations have no effect. So we have to pass in the default values ourselves and call the initialise method when instantiating our object-store. This is quite simple with Spring. Here is a full working example using the Spring instantiated object-store as the store for the idempotent-message-filter: If you run this configuration, Mule will poll every 10 seconds with the exact same message: "1". The first time it polls you should see "Passed Filter" in the log output. Subsequent polls should not log anything as the idempotent message filter will filter them out. This is usually performed with the default in-memory object-store. To prove this is persisted you should be able to check the value in the Mongo database collection: db.getCollection('mule.objectstore._default').find(). Also you can stop your Mule app and restart it to prove that it works between restarts and failovers etc.

June 17, 2014

by Ryan Carter

· 10,017 Views

Unit Testing Checklist: Keep Your Tests Useful and Avoid Big Mistakes

A unit test is a set of methods launched frequently to validate your code. It is usually a good idea to write code in this order: Write a class API Write a method to test the API Implement the API Launch the unit tests Why write unit tests? They validate current and future implementations. They measure code quality. They force you to write testable, loosely coupled code. They’re cheaper than manual regression testing. They build confidence in your code. They help teamwork. Why use a checklist? Unit testing can be harder than actual implementation. Unit testing forces you to really think things through. But unit tests should be simple, direct, and easy to read and maintain. You also need to know when to stop writing tests and start writing the implementation. Use this checklist to be sure your tests are really useful and to the point. Remember: the checklist helps you avoid big mistakes, but you need to make sure of the following: □ My test class is testing only one class. o You are testing a class API to be sure the public contract is respected. □ My methods are testing only one method at a time. o Be sure not to test private methods! They are hidden implementation details, not API. □ My variables and method names are explicit. o For example, store an expected value in an expectedFoo variable instead of just foo. If you test many combinations, use composed variable names like inputValue_NotNull, inputValue_ZeroData, inputValue_PastDate, etc. (according to your application’s coding convention). □ My test cases are easy to read by humans. o Future maintainers should be able to read your tests before reading the implementation. This will help them understand a class API before tweaking or debugging it. □ My tests respect the usual clean code standards. o There should be no flow control in a test method (switch, if, etc.). A good test is just a very straightforward sequence of setup/validate instructions. If necessary, use sub-methods to factorize and make your tests easier to read. In case of multiple scenarios, use multiple test methods (one for each case). o For example, a test method should fit on screen without scrolling – 1 to 20 lines long. If the method is longer, consider writing multiple test methods for each case instead of jamming them together. □ My tests are also testing expected exceptions. o In java, use @Test(expected=MyException.class). □ My tests don’t need access to a database. o Or if a test does need database access, then it must be a mocked, “fire and forget” temporary database that you fill with test cases for every new test method (use the Setup/Teardown methods to prepare it). □ My tests don’t need access to network resources. o You can’t rely on third parties like network or device presence to validate a method (use mocks). □ My tests control side effects, limit values (max, min) and null variables (even when they throw exceptions). o You want to make sure these problem cases never occur, even when the test won’t be used during maintenance. □ My tests can be run at any time, at any place without needing configuration or human intervention. □ My tests pass for the current implementation and are easy to evolve. o Tests really exist to support code evolution. If they are too hard to maintain or too light to refine the code, then they are a useless burden. (Many developers avoid unit testing for this reason.) □ My tests are concrete. o In Java: don’t use Date() as input for a method you are testing, but build a concrete date out of Calendar (don’t forget to force the timezone). Other example: use name = “Smith”; instead of name = “name”; or name = “test”; □ My tests use a mock to simulate/stub complex class structure or methods. o Remember to test only one class API at a time. o Never test third-party libraries through your own classes. Libraries should come with their own tests (this is actually a good way to choose a library). □ My tests are never @ignored or commented out. Never. Ever. □ My tests help me validate my architecture. o If you can’t test a method or a class, your design is not agile enough. □ My tests can run on any supported platform, not just the targeted platform. o Don’t expect a particular device or hardware configuration. Otherwise, your tests will make migration tougher and you will be incentivized to disable them. □ My tests are lightning fast! o Slow tests shouldn’t drag you down. Speed encourages you to launch your tests often. It also helps to reduce building time on Continuous Integration systems. o Use a test runner that allows you to launch one test at a time while you are writing it. Use “delay” or “sleep” with caution – i.e., only in some edge cases, like waiting for notifications or clock-based methods.

June 16, 2014

by Jean-baptiste Rieu

· 24,986 Views

Anomaly Detection : A Survey

this post is summary of the “anomaly detection : a survey”. anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. these non-conforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. anomalies are patterns in data that do not conform to a well defined notion of normal behavior. interesting to analyze unwanted noise in the data also can be found in there. novelty detection which aims at detecting previously unobserved (emergent, novel) patterns in the data challenges for anomaly detection drawing the boundary between normal and anomalous behavior availability of labeled data noisy data type of anomaly anomalies can be classified into following three categories point anomalies - an individual data instance can be considered as anomalous with respect to the rest of data contextual anomalies - a data instance is anomalous in a specific context (but not otherwise), then it is termed as a contextual anomaly (also referred as conditional anomaly). each data instance is defined using following two sets of attributes contextual attributes. the contextual attributes are used to determine the context (or neighborhood) for that instance eg: in time- series data, time is a contextual attribute which determines the position of an instance on the entire sequence behavioral attributes. the behavioral attributes define the non-contextual characteristics of an instance eg: in a spatial data set describing the average rainfall of the entire world, the amount of rainfall at any location is a behavioral attribute to explain this we will look into "exchange rate history for converting united states dollar (usd) to sri lankan rupee (lkr)"[1] contextual anomaly t2 in a exchange rate time series. note that the exchange rate at time t1 is same as that at time t2 but occurs in a different context and hence is not considered as an anomaly 3. collective anomalies - a collection of related data instances is anomalous with respect to the entire data set data labels the labels associated with a data instance denote if that instance is normal or anomalous. depending labels availability, anomaly detection techniques can be operated in one of the following three modes supervised anomaly detection - techniques trained in supervised mode assume the availability of a training data set which has labeled instances for normal as well as anomaly class semi-supervised anomaly detection - techniques that operate in a semi-supervised mode, assume that the training data has labeled instances for only the normal class. since they do not require labels for the anomaly class unsupervised anomaly detection - techniques that operate in unsupervised mode do not require training data, and thus are most widely applicable. the techniques implicit assume that normal instances are far more frequent than anomalies in the test data. if this assumption is not true then such techniques suffer from high false alarm rate output of anomaly detection anomaly detection have two types of output techniques scores. scoring techniques assign an anomaly score to each instance in the test data depending on the degree to which that instance is considered an anomaly labels. techniques in this category assign a label (normal or anomalous) to each test instance applications of anomaly detection intrusion detection intrusion detection refers to detection of malicious activity. the key challenge for anomaly detection in this domain is the huge volume of data. thus, semi-supervised and unsupervised anomaly detection techniques are preferred in this domain.denning[3] classifies intrusion detection systems into host based and net- work based intrusion detection systems. host based intrusion detection systems - this deals with operating system call traces network intrusion detection systems - these systems deal with detecting intrusions in network data. the intrusions typically occur as anomalous patterns (point anomalies) though certain techniques model[4] the data in a sequential fashion and detect anomalous subsequences (collective anomalies). a challenge faced by anomaly detection techniques in this domain is that the nature of anomalies keeps changing over time as the intruders adapt their network attacks to evade the existing intrusion detection solutions. fraud detection fraud detection refers to detection of criminal activities occurring in commercial organizations such as banks, credit card companies, insurance agencies, cell phone companies, stock market, etc. the organizations are interested in immediate detection of such frauds to prevent economic losses. detection techniques used for credit card fraud and network intrusion detection as below. statistical profiling using histograms parametric statisti- cal modeling non-parametric sta- tistical modeling bayesian networks neural networks support vector ma- chines rule-based clustering based nearest neighbor based spectral information theoretic here are some domain in fraud detections credit card fraud detection mobile phone fraud detection insurance claim fraud detection insider trading detection medical and public health anomaly detection anomaly detection in the medical and public health domains typically work with pa- tient records. the data can have anomalies due to several reasons such as abnormal patient condition or instrumentation errors or recording errors. thus the anomaly detection is a very critical problem in this domain and requires high degree of accuracy. industrial damage detection such damages need to be detected early to prevent further escalation and losses. fault detection in mechanical units structural defect detection image processing anomaly detection techniques dealing with images are either interested in any changes in an image over time (motion detection) or in regions which appear ab- normal on the static image. this domain includes satellite imagery. anomaly detection in text data anomaly detection techniques in this domain primarily detect novel topics or events or news stories in a collection of documents or news articles. the anomalies are caused due to a new interesting event or an anomalous topic. sensor networks since the sensor data collected from various wireless sensors has several unique characteristics. references [1] http://themoneyconverter.com/usd/lkr.aspx [2] varun chandola, arindam banerjee, and vipin kumar. 2009. anomaly detection: a survey. acm comput. surv. 41, 3, article 15 (july 2009), 58 pages. doi=10.1145/1541880.1541882 http://doi.acm.org/10.1145/1541880.1541882 [3] denning, d. e. 1987. an intrusion detection model. ieee transactions of software engineer-ing 13, 2, 222–232. [4]gwadera, r., atallah, m. j., and szpankowski, w. 2004. detection of significant sets of episodes in event sequences. in proceedings of the fourth ieee international conference on data mining. ieee computer society, washington, dc, usa, 3–10.

June 16, 2014

by Madhuka Udantha

· 12,830 Views

Java 8 Friday: 10 Subtle Mistakes When Using the Streams API

at data geekery , we love java. and as we’re really into jooq’s fluent api and query dsl , we’re absolutely thrilled about what java 8 will bring to our ecosystem. java 8 friday every friday, we’re showing you a couple of nice new tutorial-style java 8 features, which take advantage of lambda expressions, extension methods, and other great stuff. you’ll find the source code on github . 10 subtle mistakes when using the streams api we’ve done all the sql mistakes lists: 10 common mistakes java developers make when writing sql 10 more common mistakes java developers make when writing sql yet another 10 common mistakes java developers make when writing sql (you won’t believe the last one) but we haven’t done a top 10 mistakes list with java 8 yet! for today’s occasion ( it’s friday the 13th ), we’ll catch up with what will go wrong in your application when you’re working with java 8. (it won’t happen to us, as we’re stuck with java 6 for another while) 1. accidentally reusing streams wanna bet, this will happen to everyone at least once. like the existing “streams” (e.g. inputstream ), you can consume streams only once. the following code won’t work: intstream stream = intstream.of(1, 2); stream.foreach(system.out::println); // that was fun! let's do it again! stream.foreach(system.out::println); you’ll get a java.lang.illegalstateexception: stream has already been operated upon or closed so be careful when consuming your stream. it can be done only once 2. accidentally creating “infinite” streams you can create infinite streams quite easily without noticing. take the following example: // will run indefinitely intstream.iterate(0, i -> i + 1) .foreach(system.out::println); the whole point of streams is the fact that they can be infinite, if you design them to be. the only problem is, that you might not have wanted that. so, be sure to always put proper limits: // that's better intstream.iterate(0, i -> i + 1) .limit(10) .foreach(system.out::println); 3. accidentally creating “subtle” infinite streams we can’t say this enough. you will eventually create an infinite stream, accidentally. take the following stream, for instance: intstream.iterate(0, i -> ( i + 1) % 2) .distinct() .limit(10) .foreach(system.out::println); so… we generate alternating 0′s and 1′s then we keep only distinct values, i.e. a single 0 and a single 1 then we limit the stream to a size of 10 then we consume it well… the distinct() operation doesn’t know that the function supplied to the iterate() method will produce only two distinct values. it might expect more than that. so it’ll forever consume new values from the stream, and the limit(10) will never be reached. tough luck, your application stalls. 4. accidentally creating “subtle” parallel infinite streams we really need to insist that you might accidentally try to consume an infinite stream. let’s assume you believe that the distinct() operation should be performed in parallel. you might be writing this: intstream.iterate(0, i -> ( i + 1) % 2) .parallel() .distinct() .limit(10) .foreach(system.out::println); now, we’ve already seen that this will turn forever. but previously, at least, you only consumed one cpu on your machine. now, you’ll probably consume four of them, potentially occupying pretty much all of your system with an accidental infinite stream consumption. that’s pretty bad. you can probably hard-reboot your server / development machine after that. have a last look at what my laptop looked like prior to exploding: if i were a laptop, this is how i’d like to go. 5. mixing up the order of operations so, why did we insist on your definitely accidentally creating infinite streams? it’s simple. because you may just accidentally do it. the above stream can be perfectly consumed if you switch the order of limit() and distinct() : intstream.iterate(0, i -> ( i + 1) % 2) .limit(10) .distinct() .foreach(system.out::println); this now yields: 0 1 why? because we first limit the infinite stream to 10 values (0 1 0 1 0 1 0 1 0 1), before we reduce the limited stream to the distinct values contained in it (0 1). of course, this may no longer be semantically correct, because you really wanted the first 10 distinct values from a set of data (you just happened to have “forgotten” that the data is infinite). no one really wants 10 random values, and only then reduce them to be distinct. if you’re coming from a sql background, you might not expect such differences. take sql server 2012, for instance. the following two sql statements are the same: -- using top selectdistincttop10 * fromi orderby.. -- using fetch select* fromi orderby.. offset 0 rows fetchnext10 rowsonly so, as a sql person, you might not be as aware of the importance of the order of streams operations. 6. mixing up the order of operations (again) speaking of sql, if you’re a mysql or postgresql person, you might be used to the limit .. offset clause. sql is full of subtle quirks, and this is one of them. the offset clause is applied first , as suggested in sql server 2012′s (i.e. the sql:2008 standard’s) syntax. if you translate mysql / postgresql’s dialect directly to streams, you’ll probably get it wrong: intstream.iterate(0, i -> i + 1) .limit(10) // limit .skip(5) // offset .foreach(system.out::println); the above yields 5 6 7 8 9 yes. it doesn’t continue after 9 , because the limit() is now applied first , producing (0 1 2 3 4 5 6 7 8 9). skip() is applied after, reducing the stream to (5 6 7 8 9). not what you may have intended. beware of the limit .. offset vs. "offset .. limit" trap! 7. walking the file system with filters we’ve blogged about this before . what appears to be a good idea is to walk the file system using filters: files.walk(paths.get(".")) .filter(p -> !p.tofile().getname().startswith(".")) .foreach(system.out::println); the above stream appears to be walking only through non-hidden directories, i.e. directories that do not start with a dot. unfortunately, you’ve again made mistake #5 and #6. walk() has already produced the whole stream of subdirectories of the current directory. lazily, though, but logically containing all sub-paths. now, the filter will correctly filter out paths whose names start with a dot “.”. e.g. .git or .idea will not be part of the resulting stream. but these paths will be: .\.git\refs , or .\.idea\libraries . not what you intended. now, don’t fix this by writing the following: files.walk(paths.get(".")) .filter(p -> !p.tostring().contains(file.separator + ".")) .foreach(system.out::println); while that will produce the correct output, it will still do so by traversing the complete directory subtree, recursing into all subdirectories of “hidden” directories. i guess you’ll have to resort to good old jdk 1.0 file.list() again. the good news is, filenamefilter and filefilter are both functional interfaces. 8. modifying the backing collection of a stream while you’re iterating a list , you must not modify that same list in the iteration body. that was true before java 8, but it might become more tricky with java 8 streams. consider the following list from 0..9: // of course, we create this list using streams: list list = intstream.range(0, 10) .boxed() .collect(tocollection(arraylist::new)); now, let’s assume that we want to remove each element while consuming it: list.stream() // remove(object), not remove(int)! .peek(list::remove) .foreach(system.out::println); interestingly enough, this will work for some of the elements! the output you might get is this one: 0 2 4 6 8 null null null null null java.util.concurrentmodificationexception if we introspect the list after catching that exception, there’s a funny finding. we’ll get: [1, 3, 5, 7, 9] heh, it “worked” for all the odd numbers. is this a bug? no, it looks like a feature. if you’re delving into the jdk code, you’ll find this comment in arraylist.arralistspliterator : /* * if arraylists were immutable, or structurally immutable (no * adds, removes, etc), we could implement their spliterators * with arrays.spliterator. instead we detect as much * interference during traversal as practical without * sacrificing much performance. we rely primarily on * modcounts. these are not guaranteed to detect concurrency * violations, and are sometimes overly conservative about * within-thread interference, but detect enough problems to * be worthwhile in practice. to carry this out, we (1) lazily * initialize fence and expectedmodcount until the latest * point that we need to commit to the state we are checking * against; thus improving precision. (this doesn't apply to * sublists, that create spliterators with current non-lazy * values). (2) we perform only a single * concurrentmodificationexception check at the end of foreach * (the most performance-sensitive method). when using foreach * (as opposed to iterators), we can normally only detect * interference after actions, not before. further * cme-triggering checks apply to all other possible * violations of assumptions for example null or too-small * elementdata array given its size(), that could only have * occurred due to interference. this allows the inner loop * of foreach to run without any further checks, and * simplifies lambda-resolution. while this does entail a * number of checks, note that in the common case of * list.stream().foreach(a), no checks or other computation * occur anywhere other than inside foreach itself. the other * less-often-used methods cannot take advantage of most of * these streamlinings. */ now, check out what happens when we tell the stream to produce sorted() results: list.stream() .sorted() .peek(list::remove) .foreach(system.out::println); this will now produce the following, “expected” output 0 1 2 3 4 5 6 7 8 9 and the list after stream consumption? it is empty: [] so, all elements are consumed, and removed correctly. the sorted() operation is a “stateful intermediate operation” , which means that subsequent operations no longer operate on the backing collection, but on an internal state. it is now “safe” to remove elements from the list! well… can we really? let’s proceed with parallel() , sorted() removal: list.stream() .sorted() .parallel() .peek(list::remove) .foreach(system.out::println); this now yields: 7 6 2 5 8 4 1 0 9 3 and the list contains [8] eek. we didn’t remove all elements!? free beers ( and jooq stickers ) go to anyone who solves this streams puzzler! this all appears quite random and subtle, we can only suggest that you never actually do modify a backing collection while consuming a stream. it just doesn’t work. 9. forgetting to actually consume the stream what do you think the following stream does? intstream.range(1, 5) .peek(system.out::println) .peek(i -> { if(i == 5) thrownewruntimeexception("bang"); }); when you read this, you might think that it will print (1 2 3 4 5) and then throw an exception. but that’s not correct. it won’t do anything. the stream just sits there, never having been consumed. as with any fluent api or dsl, you might actually forget to call the “terminal” operation. this might be particularly true when you use peek() , as peek() is an aweful lot similar to foreach() . this can happen with jooq just the same, when you forget to call execute() or fetch() : dsl.using(configuration) .update(table) .set(table.col1, 1) .set(table.col2, "abc") .where(table.id.eq(3)); oops. no execute() yes, the “best” way – with 1-2 caveats ;-) 10. parallel stream deadlock this is now a real goodie for the end! all concurrent systems can run into deadlocks, if you don’t properly synchronise things. while finding a real-world example isn’t obvious, finding a forced example is. the following parallel() stream is guaranteed to run into a deadlock: object[] locks = { newobject(), newobject() }; intstream .range(1, 5) .parallel() .peek(unchecked.intconsumer(i -> { synchronized(locks[i % locks.length]) { thread.sleep(100); synchronized(locks[(i + 1) % locks.length]) { thread.sleep(50); } } })) .foreach(system.out::println); note the use of unchecked.intconsumer() , which transforms the functional intconsumer interface into a org.jooq.lambda.fi.util.function.checkedintconsumer , which is allowed to throw checked exceptions. well. tough luck for your machine. those threads will be blocked forever :-) the good news is, it has never been easier to produce a schoolbook example of a deadlock in java! for more details, see also brian goetz’s answer to this question on stack overflow . conclusion with streams and functional thinking, we’ll run into a massive amount of new, subtle bugs. few of these bugs can be prevented, except through practice and staying focused. you have to think about how to order your operations. you have to think about whether your streams may be infinite. streams (and lambdas) are a very powerful tool. but a tool which we need to get a hang of, first.

June 16, 2014

by Lukas Eder

· 10,347 Views · 2 Likes

The Performance Price of Foreign Keys in PostgreSQL

If you want to optimize the performance of your PostgreSQL tables, you might benefit from looking in places you wouldn't ordinarily expect. For example, your foreign keys. According to Shaun Thomas in this recent post, foreign keys can have a considerable impact on performance. Ordinarily they're a good thing - Thomas acknowledges that - but there are certain design choices in PostgreSQL that can complicate them: In PostgreSQL, every foreign key is maintained with an invisible system-level trigger added to the source table in the reference. At least one trigger must go here, as operations that modify the source data must be checked that they do not violate the constraint. Each trigger, Thomas says, requires some overhead, and as the number of triggers increases, so does the cost. Thomas offers an example query to demonstrate where this kind of thing becomes an issue - not really a practical or real-world example, though - but the performance hit is substantial: . . . after merely five foreign keys, performance of our updates drops by 28.5%. By the time we have 20 foreign keys, the updates are 95% slower! This is not to say that you shouldn't use foreign keys, Thomas says, but that you should be careful with them and avoid abusing them. In other words, if you don't need a foreign key, you should probably just leave it out. Check out the Thomas' post for all the details.

June 12, 2014

by Alec Noller

· 17,657 Views

Querying XML CLOB Data Directly in Oracle

Oracle 9i introduced a new datatype - XMLType - specifically designed for handling XML naively in the database. However, we may find we have a database that stores XML directly in a CLOB/NCLOB datatype (either because of legacy data, or because we are supporting multiple database platforms). It can be useful to query this data directly using the XMLType functions available. This post just shows a very simple example of a query to do this. (Note: there may be better ways to do this - but this worked for me when I needed an ad-hoc query quickly!) Our Table Assume we have a table called USER, which has the following columns Our embedded XML (example for Ringo) Our Query Say we want to look at account details in the XML - e.g. find which users all have the same sort code. A query like the following: SELECT ID, NAME XMLTYPE(u.accountxml).EXTRACT('/person/account/sortCode/text()') as SORTCODE FROM USER u; Will produce output like this. Obviously from here you can use all the standard SQL operators to filter, join etc further to get what you need:

June 11, 2014

by Adrian Milne

· 45,581 Views · 1 Like

What's Wrong in Java 8, Part V: Tuples

In part 5 of our "What's Wrong in Java 8" series, we turn to Tuples.

June 10, 2014

by Pierre-Yves Saumont

· 132,732 Views · 4 Likes

Building a Simple RESTful API with Java Spark

Disclaimer: This post is about the Java micro web framework named Spark and not about the data processing engine Apache Spark. In this blog post we will see how Spark can be used to build a simple web service. As mentioned in the disclaimer, Spark is a micro web framework for Java inspired by the Ruby framework Sinatra. Spark aims for simplicity and provides only a minimal set of features. However, it provides everything needed to build a web application in a few lines of Java code. Getting Started Let's assume we have a simple domain class with a few properties and a service that provides some basic CRUDfunctionality: public class User { private String id; private String name; private String email; // getter/setter } public class UserService { // returns a list of all users public List getAllUsers() { .. } // returns a single user by id public User getUser(String id) { .. } // creates a new user public User createUser(String name, String email) { .. } // updates an existing user public User updateUser(String id, String name, String email) { .. } } We now want to expose the functionality of UserService as a RESTful API (For simplicity we will skip the hypermedia part of REST ;-)). For accessing, creating and updating user objects we want to use following URL patterns: GET /users Get a list of all users GET /users/ Get a specific user POST /users Create a new user PUT /users/ Update a user The returned data should be in JSON format. To get started with Spark we need the following Maven dependencies: com.sparkjava spark-core 2.0.0 org.slf4j slf4j-simple 1.7.7 Spark uses SLF4J for logging, so we need to a SLF4J binder to see log and error messages. In this example we use the slf4j-simple dependency for this purpose. However, you can also use Log4j or any other binder you like. Having slf4j-simple in the classpath is enough to see log output in the console. We will also use GSON for generating JSON output and JUnit to write a simple integration tests. You can find these dependencies in the complete pom.xml. Returning All Users Now it is time to create a class that is responsible for handling incoming requests. We start by implementing the GET /users request that should return a list of all users. import static spark.Spark.*; public class UserController { public UserController(final UserService userService) { get("/users", new Route() { @Override public Object handle(Request request, Response response) { // process request return userService.getAllUsers(); } }); // more routes } } Note the static import of spark.Spark.* in the first line. This gives us access to various static methods including get(), post(), put() and more. Within the constructor the get() method is used to register aRoute that listens for GET requests on /users. A Route is responsible for processing requests. Whenever aGET /users request is made, the handle() method will be called. Inside handle() we return an object that should be sent to the client (in this case a list of all users). Spark highly benefits from Java 8 Lambda expressions. Route is a functional interface (it contains only one method), so we can implement it using a Java 8 Lambda expression. Using a Lambda expression the Routedefinition from above looks like this: get("/users", (req, res) -> userService.getAllUsers()); To start the application we have to create a simple main() method. Inside main() we create an instance of our service and pass it to our newly created UserController: public class Main { public static void main(String[] args) { new UserController(new UserService()); } } If we now run main(), Spark will start an embedded Jetty server that listens on Port 4567. We can test our first route by initiating a GET http://localhost:4567/users request. In case the service returns a list with two user objects the response body might look like this: [com.mscharhag.sparkdemo.User@449c23fd, com.mscharhag.sparkdemo.User@437b26fe] Obviously this is not the response we want. Spark uses an interface called ResponseTransformer to convert objects returned by routes to an actual HTTP response. ReponseTransformer looks like this: public interface ResponseTransformer { String render(Object model) throws Exception; } ResponseTransformer has a single method that takes an object and returns a String representation of this object. The default implementation of ResponseTransformer simply calls toString() on the passed object (which creates output like shown above). Since we want to return JSON we have to create a ResponseTransformer that converts the passed objects to JSON. We use a small JsonUtil class with two static methods for this: public class JsonUtil { public static String toJson(Object object) { return new Gson().toJson(object); } public static ResponseTransformer json() { return JsonUtil::toJson; } } toJson() is an universal method that converts an object to JSON using GSON. The second method makes use of Java 8 method references to return a ResponseTransformer instance. ResponseTransformer is again a functional interface, so it can be satisfied by providing an appropriate method implementation (toJson()). So whenever we call json() we get a new ResponseTransformer that makes use of our toJson()method. In our UserController we can pass a ResponseTransformer as a third argument to Spark's get()method: import static com.mscharhag.sparkdemo.JsonUtil.*; public class UserController { public UserController(final UserService userService) { get("/users", (req, res) -> userService.getAllUsers(), json()); ... } } Note again the static import of JsonUtil.* in the first line. This gives us the option to create a newResponseTransformer by simply calling json(). Our response looks now like this: [{ "id": "1866d959-4a52-4409-afc8-4f09896f38b2", "name": "john", "email": "[email protected]" },{ "id": "90d965ad-5bdf-455d-9808-c38b72a5181a", "name": "anna", "email": "[email protected]" }] We still have a small problem. The response is returned with the wrong Content-Type. To fix this, we can register a Filter that sets the JSON Content-Type: after((req, res) -> { res.type("application/json"); }); Filter is again a functional interface and can therefore be implemented by a short Lambda expression. After a request is handled by our Route, the filter changes the Content-Type of every response toapplication/json. We can also use before() instead of after() to register a filter. Then, the Filterwould be called before the request is processed by the Route. The GET /users request should be working now :-) Returning a Specific User To return a specific user we simply create a new route in our UserController: get("/users/:id", (req, res) -> { String id = req.params(":id"); User user = userService.getUser(id); if (user != null) { return user; } res.status(400); return new ResponseError("No user with id '%s' found", id); }, json()); With req.params(":id") we can obtain the :id path parameter from the URL. We pass this parameter to our service to get the corresponding user object. We assume the service returns null if no user with the passed id is found. In this case, we change the HTTP status code to 400 (Bad Request) and return an error object. ResponseError is a small helper class we use to convert error messages and exceptions to JSON. It looks like this: public class ResponseError { private String message; public ResponseError(String message, String... args) { this.message = String.format(message, args); } public ResponseError(Exception e) { this.message = e.getMessage(); } public String getMessage() { return this.message; } } We are now able to query for a single user with a request like this: GET /users/5f45a4ff-35a7-47e8-b731-4339c84962be If an user with this id exists we will get a response that looks somehow like this: { "id": "5f45a4ff-35a7-47e8-b731-4339c84962be", "name": "john", "email": "[email protected]" } If we use an invalid user id, a ResponseError object will be created and converted to JSON. In this case the response looks like this: { "message": "No user with id 'foo' found" } Creating and Updating Users Creating and updating users is again very easy. Like returning the list of all users it is done using a single service call: post("/users", (req, res) -> userService.createUser( req.queryParams("name"), req.queryParams("email") ), json()); put("/users/:id", (req, res) -> userService.updateUser( req.params(":id"), req.queryParams("name"), req.queryParams("email") ), json()); To register a route for HTTP POST or PUT requests we simply use the static post() and put() methods of Spark. Inside a Route we can access HTTP POST parameters using req.queryParams(). For simplicity reasons (and to show another Spark feature) we do not do any validation inside the routes. Instead we assume that the service will throw an IllegalArgumentException if we pass in invalid values. Spark gives us the option to register ExceptionHandlers. An ExceptionHandler will be called if anException is thrown while processing a route. ExceptionHandler is another single method interface we can implement using a Java 8 Lambda expression: exception(IllegalArgumentException.class, (e, req, res) -> { res.status(400); res.body(toJson(new ResponseError(e))); }); Here we create an ExceptionHandler that is called if an IllegalArgumentException is thrown. The caught Exception object is passed as the first parameter. We set the response code to 400 and add an error message to the response body. If the service throws an IllegalArgumentException when the email parameter is empty, we might get a response like this: { "message": "Parameter 'email' cannot be empty" } The complete source the controller can be found here. Testing Because of Spark's simple nature it is very easy to write integration tests for our sample application. Let's start with this basic JUnit test setup: public class UserControllerIntegrationTest { @BeforeClass public static void beforeClass() { Main.main(null); } @AfterClass public static void afterClass() { Spark.stop(); } ... } In beforeClass() we start our application by simply running the main() method. After all tests finished we call Spark.stop(). This stops the embedded server that runs our application. After that we can send HTTP requests within test methods and validate that our application returns the correct response. A simple test that sends a request to create a new user can look like this: @Test public void aNewUserShouldBeCreated() { TestResponse res = request("POST", "/users?name=john&[email protected]"); Map json = res.json(); assertEquals(200, res.status); assertEquals("john", json.get("name")); assertEquals("[email protected]", json.get("email")); assertNotNull(json.get("id")); } request() and TestResponse are two small self made test utilities. request() sends a HTTP request to the passed URL and returns a TestResponse instance. TestResponse is just a small wrapper around some HTTP response data. The source of request() and TestResponse is included in the complete test classfound on GitHub. Conclusion Compared to other web frameworks Spark provides only a small amount of features. However, it is so simple you can build small web applications within a few minutes (even if you have not used Spark before). If you want to look into Spark you should clearly use Java 8, which reduces the amount of code you have to write a lot. You can find the complete source of the sample project on GitHub.

June 9, 2014

by Michael Scharhag

· 111,604 Views · 3 Likes

Neo4j & Cypher: UNWIND vs FOREACH

I’ve written a couple of posts about the new UNWIND clause in Neo4j’s cypher query language, but I forgot about my favorite use of UNWIND, which is to get rid of some uses of FOREACH from our queries. Let’s say we’ve created a timetree up front and now have a series of events coming in that we want to create in the database and attach to the appropriate part of the timetree. Before UNWIND existed we might try to write the following query using FOREACH: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events FOREACH (event IN events | CREATE (e:Event {name: event.name}) MATCH (year:Year {year: event.timetree.year }), (year)-[:HAS_MONTH]->(month {month: event.timetree.month }), (month)-[:HAS_DAY]->(day {day: event.timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day)) Unfortunately we can’t use MATCH inside a FOREACH statement so we’ll get the following error: Invalid use of MATCH inside FOREACH (line 5, column 3) " MATCH (year:Year {year: event.timetree.year }), " ^ Neo.ClientError.Statement.InvalidSyntax We can work around this by using MERGE instead in the knowledge that it’s never going to create anything because the timetree already exists: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events FOREACH (event IN events | CREATE (e:Event {name: event.name}) MERGE (year:Year {year: event.timetree.year }) MERGE (year)-[:HAS_MONTH]->(month {month: event.timetree.month }) MERGE (month)-[:HAS_DAY]->(day {day: event.timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day)) If we replace the FOREACH with UNWIND we’d get the following: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events UNWIND events AS event CREATE (e:Event {name: event.name}) WITH e, event.timetree AS timetree MATCH (year:Year {year: timetree.year }), (year)-[:HAS_MONTH]->(month {month: timetree.month }), (month)-[:HAS_DAY]->(day {day: timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day) Although the lines of code has slightly increased the query is now correct and we won’t accidentally correct new parts of our time tree. We could also pass on the event that we created to the next part of the query which wouldn’t be the case when using FOREACH.

June 9, 2014

by Mark Needham

· 12,721 Views

An Architecturally-Evident Coding Style

okay, this is the separate blog post that i referred to in software architecture vs code . what exactly do we mean by an "architecturally-evident coding style"? i built a simple content aggregator for the local tech community here in jersey called techtribes.je , which is basically made up of a web server, a couple of databases and a standalone java application that is responsible for actually aggegrating the content displayed on the website. you can read a little more about the software architecture at techtribes.je - containers . the following diagram is a zoom-in of the standalone content updater application, showing how it's been decomposed. this diagram says that the content updater application is made up of a number of core components (which are shown on a separate diagram for brevity) and an additional four components - a scheduled content updater, a twitter connector, a github connector and a news feed connector. this diagram shows a really nice, simple architecture view of how my standalone content updater application has been decomposed into a small number of components. "component" is a hugely overloaded term in the software development industry, but essentially all i'm referring to is a collection of related behaviour sitting behind a nice clean interface. back to the "architecturally-evident coding style" and the basic premise is that the code should reflect the architecture. in other words, if i look at the code, i should be able to clearly identify each of the components that i've shown on the diagram. since the code for techtribes.je is open source and on github, you can go and take a look for yourself as to whether this is the case. and it is ... there's a je.techtribes.component package that contains sub-packages for each of the components shown on the diagram. from a technical perspective, each of these are simply spring beans with a public interface and a package-protected implementation. that's it; the code reflects the architecture as illustrated on the diagram. so what about those core components then? well, here's a diagram showing those. again, this diagram shows a nice simple decomposition of the core of my techtribes.je system into coarse-grained components. and again, browsing the source code will reveal the same one-to-one mapping between boxes on the diagram and packages in the code. this requires conscious effort to do but i like the simple and explicit nature of the relationship between the architecture and the code. when architecture and code don't match the interesting part of this story is that while i'd always viewed my system as a collection of "components", the code didn't actually look like that. to take an example, there's a tweet component on the core components diagram, which basically provides crud access to tweets in a mongodb database. the diagram suggests that it's a single black box component, but my initial implementation was very different. the following diagram illustrates why. my initial implementation of the tweet component looked like the picture on the left - i'd taken a "package by layer" approach and broken my tweet component down into a separate service and data access object. this is your stereotypical layered architecture that many (most?) books and tutorials present as a way to build (e.g.) web applications. it's also pretty much how i've built most software in the past too and i'm sure you've seen the same, especially in systems that use a dependency injection framework where we create a bunch of things in layers and wire them all together. layered architectures have a number of benefits but they aren't a silver bullet . this is a great example of where the code doesn't quite reflect the architecture - the tweet component is a single box on an architecture diagram but implemented as a collection of classes across a layered architecture when you look at the code. imagine having a large, complex codebase where the architecture diagrams tell a different story from the code. the easy way to fix this is to simply redraw the core components diagram to show that it's really a layered architecture made up of services collaborating with data access objects. the result is a much more complex diagram but it also feels like that diagram is starting to show too much detail. the other option is to change the code to match my architectural vision. and that's what i did. i reorganised the code to be packaged by component rather than packaged by layer. in essence, i merged the services and data access objects together into a single package so that i was left with a public interface and a package protected implementation. here's the tweet component on github . but what about... again, there's a clean simple mapping from the diagram into the code and the code cleanly reflects the architecture. it does raise a number of interesting questions though. why aren't you using a layered architecture? where did the tweetdao interface go? how do you mock out your dao implementation to do unit testing? what happens if i want to call the dao directly? what happens if you want to change the way that you store tweets? layers are now an implementation detail this is still a layered architecture, it's just that the layers are now a component implementation detail rather than being first-class architectural building blocks. and that's nice, because i can think about my components as being my architecturally significant structural elements and it's these building blocks that are defined in my dependency injection framework. something i often see in layered architectures is code bypassing a services layer to directly access a dao or repository. these sort of shortcuts are exactly why layered architectures often become corrupted and turn into big balls of mud. in my codebase, if any consumer wants access to tweets, they are forced to use the tweet component in its entirety because the dao is an internal implementation detail. and because i have layers inside my component, i can still switch out my tweet data storage from mongodb to something else. that change is still isolated. component testing vs unit testing ah, unit testing. bundling up my tweet service and dao into a single component makes the resulting tweet component harder to unit test because everything is package protected. sure, it's not impossible to provide a mock implementation of the mongodbtweetdao but i need to jump through some hoops. the other approach is to simply not do unit testing and instead test my tweet component through its public interface. dhh recently published a blog post called test-induced design damage and i agree with the overall message; perhaps we are breaking up our systems unnecessarily just in order to unit test them. there's very little to be gained from unit testing the various sub-parts of my tweet component in isolation, so in this case i've opted to do automated component testing instead where i test the component as a black-box through its component interface. mongodb is lightweight and fast, with the resulting component tests running acceptably quick for me, even on my ageing macbook air. i'm not saying that you should never unit test code in isolation, and indeed there are some situations where component testing isn't feasible. for example, if you're using asynchronous and/or third party services, you probably do want to ability to provide a mock implementation for unit testing. the point is that we shouldn't blindly create designs where everything can be mocked out and unit tested in isolation. food for thought the purpose of this blog post was to provide some more detail around how to ensure that code reflects architecture and to illustrate an approach to do this. i like the structure imposed by forcing my codebase to reflect the architecture. it requires some discipline and thinking about how to neatly carve-up the responsibilities across the codebase, but i think the effort is rewarded. it's also a nice stepping stone towards micro-services. my techtribes.je system is constructed from a number of in-process components that i treat as my architectural building blocks. the thinking behind creating a micro-services architecture is essentially the same, albeit the components (services) are running out-of-process. this isn't a silver bullet by any means, but i hope it's provided some food for thought around designing software and structuring a codebase with an architecturally-evident coding style.

June 9, 2014

by Simon Brown

· 6,205 Views