DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
Testing Databases with JUnit and Hibernate Part 1: One to Rule them
There is little support for testing the database of your enterprise application. We will describe some problems and possible solutions based on Hibernate and JUnit.
September 6, 2011
by Jens Schauder
· 122,978 Views · 2 Likes
article thumbnail
Cloud Integration with Apache Camel and Amazon Web Services (AWS): S3, SQS and SNS
The integration framework Apache Camel already supports several important cloud services (see my overview article at http://www.kai-waehner.de/blog/2011/07/09/cloud-computing-heterogeneity-will-require-cloud-integration-apache-camel-is-already-prepared for more details). This article describes the combination of Apache Camel and the Amazon Web Services (AWS) interfaces of Simple Storage Service (S3), Simple Queue Service (SQS) and Simple Notification Service (SNS). Thus, The concept of Infrastructure as a Service (IaaS) is used to access messaging systems and data storage without any need for configuration. Registration to AWS and Setup of Camel First, you have to register to the Amazon Web Services (for free). Most AWS services include a free monthly quota, which is absolutely sufficient to play around and develop some simple applications. As its name states, AWS uses technology-independent web services. Besides, APIs for several different programming languages are available to ease development. By the way, Camel uses the AWS SDK for Java (http://aws.amazon.com/sdkforjava), of course. The documentation is detailed and easy to understand, including tutorials, screenshots and code examples . Hint 1: You should read the introductions to S3, SQS and SNS (go to http://aws.amazon.com and click on „products“) and play around with the AWS Management Console (http://aws.amazon.com/console) before you continue. This step is very easy and takes less than one hour. Then, you will have a much better understanding about AWS and where Camel can help you! Hint 2: It really helps to look at the source code of the camel-aws component, It helps you to understand how Camel uses the AWS Java API internally. If you want to write tests, you can do it the same way. In the past, I was afraid of looking at „complex“ source code of open source frameworks. But there is no need to be scared! The camel-aws component (and most other camel components) contain only of a few classes. Everything is easy to understand. It helps you to understand Camel internals, the AWS API, and to spot and solve errors due to exceptions in your code. In the meanwhile, the current Camel version 2.8 supports three AWS services: S3, SQS and SNS. All of them use similar concepts. Therefore, they are included in one single camel component: „camel-aws“. You have to add the libraries to your existing Camel project. As always, the simplest way is to use Maven and add the following dependency to the pom.xml: org.apache.camel camel-aws ${camel-version} Configuration of the Camel Endpoint The implementation and configuration of all three services is very similar. The URI looks like this (the code shows the SQS service): aws-sqs://queue-name[?options] There are two alternatives to configure your endpoint. Using Parameters The easy way is to use two paramters in the URI of your endpoint: „accessKey“ and „secretKey“ (you receive both after your AWS registration). “aws-sqs://unique-queue-name?accessKey=“INSERT_ME“&secretKey=INSERT_ME” Be aware of the following problem, which can result in a strange, non-speaking exception (thanks to Brendan Long): You’ll need to URL encode any +’s in your secret key (otherwise, they’ll be treated as spaces). + = %2B, so if your secretkey was “my+secret\key”, your Camel URL should have “secretKey=my%2Bsecret\key”. “Within the query string, the plus sign is reserved as shorthand notation for a space. Therefore, real plus signs must be encoded. This method was used to make query URIs easier to pass in systems which did not allow spaces.” Source: WC3 URI Recommendations Adding a configured AmazonClient to the Registry If you need to do more configuration (e.g. because your system is behind a firewall), you have to add an AmazonClient object to your registry. The following code shows an example using SQS, but SNS and S3 use exactly the same concept. @Override protected JndiRegistry createRegistry() throws Exception { JndiRegistry registry = super.createRegistry(); AWSCredentials awsCredentials = new BasicAWSCredentials(“INSERT_ME”, “INSERT_ME”); ClientConfiguration clientConfiguration = new ClientConfiguration(); clientConfiguration.setProxyHost(“http://myProxyHost”); clientConfiguration.setProxyPort(8080); AmazonSQSClient client = new AmazonSQSClient(awsCredentials, clientConfiguration); registry.bind(“amazonSQSClient”, client); return registry; } This example overwrites the createRegistry() method of a JUnit test (extending CamelTestSupport). You can also add this information to your runtime Camel application, of course. Apache Camel and the Simple Storage Service (S3) Simple Storage Service (S3) is a key-value-store. You can store small to very large data. The usage is very easy. You create buckets and put key-value data into these buckets. You can also create folders within buckets to organize your data. That’s it. You can monitor your buckets using the AWS Management Console – an intuitive GUI supporting most AWS services. The following example shows both alternatives for accessing the Amazon services (as described above): Paramenters and the AmazonClient. // Transfer data from your file inbox to the AWS S3 service from(“file:files/inbox”) // This is the key of your key-value data .setHeader(S3Constants.KEY, simple(“This is a static key”)) // Using parameters for accessing the AWS service .to(“aws-s3://camel-integration-bucket-mwea-kw?accessKey=INSERT_ME&secretKey=INSERT_ME&region=eu-west-1″); // Transfer data from the AWS S3 service to your file outbox from(“aws-s3://camel-integration-bucket-mwea-kw?amazonS3Client=#amazonS3Client&region=eu-wes”) .to(“file:files/outbox”); There are some additional parameters, for instance you can submit the desired AWS region or delete data after receiving it (see http://camel.apache.org/aws-s3.html and the corresponding SQS and SNS sites for more details about parameters and message headers). As you see in the code, you can use the AWS-S3 endpoint for producing and for consuming messages. Each bucket must be unique, thus you have to add some specific information such as your company to its name. Hint: If a bucket does not exist, Camel is creating it automatically (as the AWS API does). This concept is also used for SQS queues and SNS topics. Apache Camel and the Simple Queue Service (SQS) The Simple Queue Service (SQS) is similar to a JMS provider such as WebSphere MQ or ActiveMQ (but with some differences). You create queues and send messages to them. Consumers receive the messages. Contrary to most other AWS services, you cannot monitor queues by using the AWS management console directly. You have to use the service „Cloudwatch“ (http://aws.amazon.com/cloudwatch) and start an EC2 instance to monitor queues and its content. As you can see in the following code example, the syntax and concepts are almost the same as for the S3 service: from(“file:inbox”) .to(“aws-sqs://camel-integration-queue-mwea-kw?accessKey=INSERT_ME&secretKey=INSERT_ME”); from(“aws-sqs://camel-integration-queue-mwea-kw?amazonSQSClient=#amazonSQSClient”) .to(“file:outbox?fileName=sqs-${date:now:yyyy.MM.dd-hh:mm:ss:SS}”); Again, you can use the AWS-SQS endpoint for producing and for consuming messages. Each queue name must be unique. There exist two important differences to JMS (copy & paste from the AWS documentation): Q: How many times will I receive each message? Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies. Q: Why are there separate ReceiveMessage and DeleteMessage operations? When Amazon SQS returns a message to you, that message stays in the queue, whether or not you actually received the message. You are responsible for deleting the message; the delete request acknowledges that you’re done processing the message. If you don’t delete the message, Amazon SQS will deliver it again on another receive request. Apache Camel and the Simple Notification Service (SNS) The Simple Notification Service (SNS) acts like JMS topics. You create a topic, consumers subscribe to the topic and then receive notifications. Several transport protocols are supported: HTTP(S), Email and SQS. Further interfaces will be added in the future, e.g. the Short Message Service (SMS) for mobile phones. Contrary to S3 and SQS, Camel only offers a producer endpoint for this AWS service. You can only create topics and send messages via Camel. The reason is simple: Camel already offers endpoints for consuming these messages: HTTP, Email and SQS are already available. There is one tradeoff: A consumer cannot subscribe to topics using Camel – at the moment. The AWS Management Console has to be used. A very interesting discussion can be read on the Camel JIRA issue regarding the following questions: Should Camel be able to subscribe to topics? Should the producer contain this feature or should there be a consumer? In my opinion, there should be a consumer which is able to subscribe to topics, otherwise Camel is missing a key part of the AWS SNS service! Please read the discussion and contribute your opinion: https://issues.apache.org/jira/browse/CAMEL-3476. Apache Camel is already ready for the Cloud Computing Era AWS offers many more services for the cloud. Probably, it does not make sense to integrate everyone into Camel, but more AWS services will be supported in the future. For instance, SimpleDB and the Relational Database Service (RDS) are already planned and make sende, too: http://camel.apache.org/aws.html. The conclusion is easy: Apache Camel is already ready for the cloud computing era. Several important cloud services are already supported. Cloud integration will become very important in the future. Thus, Camel is on a very good way. Hopefully, we will see more cloud components, soon. I will continue to write articles about other Camel cloud components (and new AWS addons, ouf course). For instance, a component for the Platform as a Service (PaaS) product Google App Engine (GAE) is already available. If you have any additional important information, questions or other feedback, please write a comment. Thank you in advance… Best regards, Kai Wähner (Twitter: @KaiWaehner) [Content from my Blog: Cloud Integration with Apache Camel and Amazon Web Services (AWS): S3, SQS and SNS]
August 30, 2011
by Kai Wähner DZone Core CORE
· 26,164 Views
article thumbnail
Concurrency Pattern: Producer and Consumer
In my career spanning 15 years, the problem of Producer and Consumer is one that I have come across only a few times. In most programming cases, what we are doing is performing functions in a synchronous fashion where the JVM or the web container handles the complexities of multi-threading on its own. However, when writing certain kinds of use cases where we need this. Last week, I came acros one such use case that sent me 3 years back when I last did it. However, the way it was done last time was very different. When I first heard the problem statement, I knew instantly what was needed. However, my approach to doing it this time was going to be different from last time. It had simply to do with how I am viewing technology in my life today. I will not go into any non-technical side and will jump straight into the problem and its solution. I started to look at what existed in the market and did come across a couple of posts that helped me in channelizing my thoughts in the right way. Problem Statement We need a solution for a batch migration. We are migrating data form System 1 to System 2 and in the process we need to do three tasks: Load data from Database based on groups Process the data Update the records loaded in step#1 with modifications We have to handle 100s of groups and each group will have around 40K records. You can imagine the amount of time it would take if we were to perform this exercise in a synchronous fashion. Image here explains this problem in an effective way. Producer Consumer: The Problem Producer and Consumer Pattern Let us take a look at the Producer Consumer pattern to begin with. If you refer to the problem statement above and look at the image, we see that there are so many entities who are ready with their part of data. However, there are not enough workers who can process all the data. Hence, as the producers continue to line-up in a queue it just continues to grow. We see that the systems start to hog up threads and take a lot of time. Intermediate Solution Producer Consumer: The Intermediate approch We do have an intermediate solution. Refer to the image and you will immediately notice that the producers are piling up their work in a filing cabinet and the worker continues to pick it up as they get done with the previous task. However, this approach does have some glaring shortcomings: There is still one worker who has to do all the work. The external systems may be happy, but the task will still continue to exist until the worker has completed all of the tasks The producers will pile up their data in a queue and it needs resources to hold the same. Just as in this example the cabinet can fill up, the same can happen with the JVM resources too. We need to be careful how much data we are going to place in memory and in some cases it may not be much. The Solution Producer Consumer: The Solution The solution is what we see everyday in many places – like the cinema hall queue, Petrol Pumps etc. There are so many people who come in to book a ticket and based on how many people come in, the more people are added to issue tickets. Essentially, refer to image here and you will notice that Producers will keep adding their jobs to the cabinet and we have more workers to handle the work load. Java provided concurrency package to solve this issue. Till now, I have always worked on threading at a much lower level and this was first time I was going to work with this package. As I started to explore the web and read fellow bloggers with what they have to say, I came across one very good article. It helped in understanding the use of BlockingQueue in a very effective manner. However, the solutions provided by Dhruba would not have helped me in achieving the high throughput which is needed. So, I started to explore the use of ArrayBlockingQueue for the same. The Controller This is the first class where the contract between the producers and consumers are managed. The controller will setup 1 thread for the Producer and 2 threads for the consumer. Based on the needs we can create as many threads as we need; and even can even read the data from a properties or do some dynamic magic. For now, we will keep this simple. package com.kapil.techieforever.producerconsumer; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; public class TestProducerConsumer { public static void main(String args[]) { try { Broker broker = new Broker(); ExecutorService threadPool = Executors.newFixedThreadPool(3); threadPool.execute(new Consumer("1", broker)); threadPool.execute(new Consumer("2", broker)); Future producerStatus = threadPool.submit(new Producer(broker)); // this will wait for the producer to finish its execution. producerStatus.get(); threadPool.shutdown(); } catch (Exception e) { e.printStackTrace(); } } } I am using ExecuteService to create a thread pool and manage it. Instead of using the basic Thread implementation, this is a more effective way as it will handle the exiting and restarting the threads as needed. You will also notice that I am using Future class to get the status of the producer thread. This class is very effective and will halt my program from further execution. This is a nice way of replacing the “.join” method on the threads. Note: I am not using Future very effectively in this example; so you may have to try a few things as you feel fit. Also, you should note the Broker class which is being used as filing cabinet between the producers and consumers. We will see its implementation in just a little while. The Producer This class is responsible for producing the data that needs to be worked upon. package com.kapil.techieforever.producerconsumer; public class Producer implements Runnable { private Broker broker; public Producer(Broker broker) { this.broker = broker; } @Override public void run() { try { for (Integer i = 1; i < 5 + 1; ++i) { System.out.println("Producer produced: " + i); Thread.sleep(100); broker.put(i); } this.broker.continueProducing = Boolean.FALSE; System.out.println("Producer finished its job; terminating."); } catch (InterruptedException ex) { ex.printStackTrace(); } } } This class is doing the most simplest of things that it can do – adding an integer to the broker. Some key areas to note are: 1. There is a property on Broker which is updated in the end by the producer when its done producing. This is also known as the “final” or “poison” entry. This is used by the consumers to know that there are no more data coming up 2. I have used Thread.sleep to simulate that some producers may take more time to produce the data. You can tweak this value and see the consumers act The Consumer This class is responsible for reading the data from the broker and doing its job package com.kapil.techieforever.producerconsumer; public class Consumer implements Runnable { private String name; private Broker broker; public Consumer(String name, Broker broker) { this.name = name; this.broker = broker; } @Override public void run() { try { Integer data = broker.get(); while (broker.continueProducing || data != null) { Thread.sleep(1000); System.out.println("Consumer " + this.name + " processed data from broker: " + data); data = broker.get(); } System.out.println("Comsumer " + this.name + " finished its job; terminating."); } catch (InterruptedException ex) { ex.printStackTrace(); } } } This is again a simple class that reads the Integer and prints it on the console. However, key points to note are: 1. The loop to process data is an endless loop, that runs on two conditions – until the producer is consuming and there is some data with the broker 2. Again, the Thread.sleep is used to create effective and different scenarios The Broker package com.kapil.techieforever.producerconsumer; import java.util.concurrent.ArrayBlockingQueue; import java.util.concurrent.TimeUnit; public class Broker { public ArrayBlockingQueue queue = new ArrayBlockingQueue(100); public Boolean continueProducing = Boolean.TRUE; public void put(Integer data) throws InterruptedException { this.queue.put(data); } public Integer get() throws InterruptedException { return this.queue.poll(1, TimeUnit.SECONDS); } } The very first thing to note is that we are using ArrayBlockingQueue as the data holder. I am not going to say what this does, but insist you to read it on the JavaDocs here. however, I will explain that the producers are going to place the data in the queue and the consumers will fetch from the queue in FIFO format. But, if the producers are slow, the consumers will wait for data to come in and if the array is full, the producers will wait for it to fill up. Also, note that I am using the ‘poll’ function instead of get in the queue. This is to ensure that the consumers will not keep waiting for ever and the waiting will time out after a few seconds. This helps us in inter-communication and kill the consumers when all the data is processed. (Note: try replacing poll with get and you will see some interesting outputs). Code I have the code sitting on Google project hosting. Feel free to go across and download it from there. It is essentially an eclipse (Spring STS) project. You may also get additional packages and classes when you download it based on when you are downloading it. Feel free to look into those too and share your comments - You can browse the source code on the SVN browser or; - You can download it from the project itself From http://scratchpad101.com/2011/08/22/concurrency-pattern-producer-consumer/
August 29, 2011
by Kapil Viren Ahuja
· 68,600 Views · 2 Likes
article thumbnail
Java NIO vs. IO
when studying both the java nio and io api's, a question quickly pops into mind: when should i use io and when should i use nio? in this text i will try to shed some light on the differences between java nio and io, their use cases, and how they affect the design of your code. main differences of java nio and io the table below summarizes the main differences between java nio and io. i will get into more detail about each difference in the sections following the table. io nio stream oriented buffer oriented blocking io non blocking io selectors stream oriented vs. buffer oriented the first big difference between java nio and io is that io is stream oriented, where nio is buffer oriented. so, what does that mean? java io being stream oriented means that you read one or more bytes at a time, from a stream. what you do with the read bytes is up to you. they are not cached anywhere. furthermore, you cannot move forth and back in the data in a stream. if you need to move forth and back in the data read from a stream, you will need to cache it in a buffer first. java nio's buffer oriented approach is slightly different. data is read into a buffer from which it is later processed. you can move forth and back in the buffer as you need to. this gives you a bit more flexibility during processing. however, you also need to check if the buffer contains all the data you need in order to fully process it. and, you need to make sure that when reading more data into the buffer, you do not overwrite data in the buffer you have not yet processed. blocking vs. non-blocking io java io's various streams are blocking. that means, that when a thread invokes a read() or write(), that thread is blocked until there is some data to read, or the data is fully written. the thread can do nothing else in the meantime. java nio's non-blocking mode enables a thread to request reading data from a channel, and only get what is currently available, or nothing at all, if no data is currently available. rather than remain blocked until data becomes available for reading, the thread can go on with something else. the same is true for non-blocking writing. a thread can request that some data be written to a channel, but not wait for it to be fully written. the thread can then go on and do something else in the mean time. what threads spend their idle time on when not blocked in io calls, is usually performing io on other channels in the meantime. that is, a single thread can now manage multiple channels of input and output. selectors java nio's selectors allow a single thread to monitor multiple channels of input. you can register multiple channels with a selector, then use a single thread to "select" the channels that have input available for processing, or select the channels that are ready for writing. this selector mechanism makes it easy for a single thread to manage multiple channels. how nio and io influences application design whether you choose nio or io as your io toolkit may impact the following aspects of your application design: the api calls to the nio or io classes. the processing of data. the number of thread used to process the data. the api calls of course the api calls when using nio look different than when using io. this is no surprise. rather than just read the data byte for byte from e.g. an inputstream, the data must first be read into a buffer, and then be processed from there. the processing of data the processing of the data is also affected when using a pure nio design, vs. an io design. in an io design you read the data byte for byte from an inputstream or a reader. imagine you were processing a stream of line based textual data. for instance: name: anna age: 25 email: [email protected] phone: 1234567890 this stream of text lines could be processed like this: inputstream input = ... ; // get the inputstream from the client socket bufferedreader reader = new bufferedreader(new inputstreamreader(input)); string nameline = reader.readline(); string ageline = reader.readline(); string emailline = reader.readline(); string phoneline = reader.readline(); notice how the processing state is determined by how far the program has executed. in other words, once the first reader.readline() method returns, you know for sure that a full line of text has been read. the readline() blocks until a full line is read, that's why. you also know that this line contains the name. similarly, when the second readline() call returns, you know that this line contains the age etc. as you can see, the program progresses only when there is new data to read, and for each step you know what that data is. once the executing thread have progressed past reading a certain piece of data in the code, the thread is not going backwards in the data (mostly not). this principle is also illustrated in this diagram: java io: reading data from a blocking stream. a nio implementation would look different. here is a simplified example: bytebuffer buffer = bytebuffer.allocate(48); int bytesread = inchannel.read(buffer); notice the second line which reads bytes from the channel into the bytebuffer. when that method call returns you don't know if all the data you need is inside the buffer. all you know is that the buffer contains some bytes. this makes processing somewhat harder. imagine if, after the first read(buffer) call, that all what was read into the buffer was half a line. for instance, "name: an". can you process that data? not really. you need to wait until at leas a full line of data has been into the buffer, before it makes sense to process any of the data at all. so how do you know if the buffer contains enough data for it to make sense to be processed? well, you don't. the only way to find out, is to look at the data in the buffer. the result is, that you may have to inspect the data in the buffer several times before you know if all the data is inthere. this is both inefficient, and can become messy in terms of program design. for instance: bytebuffer buffer = bytebuffer.allocate(48); int bytesread = inchannel.read(buffer); while(! bufferfull(bytesread) ) { bytesread = inchannel.read(buffer); } the bufferfull() method has to keep track of how much data is read into the buffer, and return either true or false, depending on whether the buffer is full. in other words, if the buffer is ready for processing, it is considered full. the bufferfull() method scans through the buffer, but must leave the buffer in the same state as before the bufferfull() method was called. if not, the next data read into the buffer might not be read in at the correct location. this is not impossible, but it is yet another issue to watch out for. if the buffer is full, it can be processed. if it is not full, you might be able to partially process whatever data is there, if that makes sense in your particular case. in many cases it doesn't. the is-data-in-buffer-ready loop is illustrated in this diagram: java nio: reading data from a channel until all needed data is in buffer. summary nio allows you to manage multiple channels (network connections or files) using only a single (or few) threads, but the cost is that parsing the data might be somewhat more complicated than when reading data from a blocking stream. if you need to manage thousands of open connections simultanously, which each only send a little data, for instance a chat server, implementing the server in nio is probably an advantage. similarly, if you need to keep a lot of open connections to other computers, e.g. in a p2p network, using a single thread to manage all of your outbound connections might be an advantage. this one thread, multiple connections design is illustrated in this diagram: java nio: a single thread managing multiple connections. if you have fewer connections with very high bandwidth, sending a lot of data at a time, perhaps a classic io server implementation might be the best fit. this diagram illustrates a classic io server design: java io: a classic io server design - one connection handled by one thread. from http://tutorials.jenkov.com/java-nio/nio-vs-io.html
August 28, 2011
by Jakob Jenkov
· 133,910 Views · 19 Likes
article thumbnail
Dissecting the Disruptor: Why it's so fast (part one) - Locks Are Bad
martin fowler has written a really good article describing not only the disruptor , but also how it fits into the architecture at lmax . this gives some of the context that has been missing so far, but the most frequently asked question is still "what is the disruptor?". i'm working up to answering that. i'm currently on question number two: "why is it so fast?". these questions do go hand in hand, however, because i can't talk about why it's fast without saying what it does, and i can't talk about what it is without saying why it is that way. so i'm trapped in a circular dependency. a circular dependency of blogging. to break the dependency, i'm going to answer question one with the simplest answer, and with any luck i'll come back to it in a later post if it still needs explanation: the disruptor is a way to pass information between threads. as a developer, already my alarm bells are going off because the word "thread" was just mentioned, which means this is about concurrency, and concurrency is hard. concurrency 101 imagine two threads are trying to change the same value. case one: thread 1 gets there first: the value changes to "blah" then the value changes to "blahy" when thread 2 gets there. case two: thread 2 gets there first: the value changes to "fluffy" then the value changes to "blah" when thread 1 gets there. case three: thread 1 interrupts thread 2: thread 2 gets the value "fluff" and stores it as myvalue thread 1 goes in and updates value to "blah" then thread 2 wakes up and sets the value to "fluffy". case three is probably the only one which is definitely wrong, unless you think the naive approach to wiki editing is ok ( google code wiki, i'm looking at you...). in the other two cases it's all about intentions and predictability. thread 2 might not care what's in value, the intention might be to append "y" to whatever is in there regardless. in this circumstance, cases one and two are both correct. but if thread 2 only wanted to change "fluff" to "fluffy", then both cases two and three are incorrect. assuming that thread 2 wants to set the value to "fluffy", there are some different approaches to solving the problem. approach one: pessimistic locking (does the "no entry" sign make sense to people who don't drive in britain?) the terms pessimistic and optimistic locking seem to be more commonly used when talking about database reads and writes, but the principal applies to getting a lock on an object. thread 2 grabs a lock on entry as soon as it knows it needs it and stops anything from setting it. then it does its thing, sets the value, and lets everything else carry on. you can imagine this gets quite expensive, with threads hanging around all over the place trying to get hold of objects and being blocked. the more threads you have, the more chance that things are going to grind to a halt. approach two: optimistic locking in this case thread 2 will only lock entry when it needs to write to it. in order to make this work, it needs to check if entry has changed since it first looked at it. if thread 1 came in and changed the value to "blah" after thread 2 had read the value, thread 2 couldn't write "fluffy" to the entry and trample all over the change from thread 1. thread 2 could either re-try (go back, read the value, and append "y" onto the end of the new value), which you would do if thread 2 didn't care what the value it was changing was; or it could throw an exception or return some sort of failed update flag if it was expecting to change "fluff" to "fluffy". an example of this latter case might be if you have two users trying to update a wiki page, and you tell the user on the other end of thread 2 they'll need to load the new changes from thread 1 and then reapply their changes. potential problem: deadlock locking can lead to all sorts of issues, for example deadlock. imagine two threads that need access to two resources to do whatever they need to do: if you've used an over-zealous locking technique, both threads are going to sit there forever waiting for the other one to release its lock on the resource. that's when you reboot windows your computer. definite problem: locks are sloooow... the thing about locks is that they need the operating system to arbitrate the argument. the threads are like siblings squabbling over a toy, and the os kernel is the parent that decides which one gets it. it's like when you run to your dad to tell him your sister has nicked the transformer when you wanted to play with it - he's got bigger things to worry about than you two fighting, and he might finish off loading the dishwasher and putting on the laundry before settling the argument. if you draw attention to yourself with a lock, not only does it take time to get the operating system to arbitrate, the os might decide the cpu has better things to do than servicing your thread. the disruptor paper talks about an experiment we did. the test calls a function incrementing a 64-bit counter in a loop 500 million times. for a single thread with no locking, the test takes 300ms. if you add a lock (and this is for a single thread, no contention, and no additional complexity other than the lock) the test takes 10,000ms. that's, like, two orders of magnitude slower. even more astounding, if you add a second thread (which logic suggests should take maybe half the time of the single thread with a lock) it takes 224,000ms. incrementing a counter 500 million times takes nearly a thousand times longer when you split it over two threads instead of running it on one with no lock. concurrency is hard and locks are bad i'm just touching the surface of the problem, and obviously i'm using very simple examples. but the point is, if your code is meant to work in a multi-threaded environment, your job as a developer just got a lot more difficult: naive code can have unintended consequences. case three above is an example of how things can go horribly wrong if you don't realise you have multiple threads accessing and writing to the same data. selfish code is going to slow your system down. using locks to protect your code from the problem in case three can lead to things like deadlock or simply poor performance. this is why many organisations have some sort of concurrency problems in their interview process (certainly for java interviews). unfortunately it's very easy to learn how to answer the questions without really understanding the problem, or possible solutions to it. how does the disruptor address these issues? for a start, it doesn't use locks. at all. instead, where we need to make sure that operations are thread-safe (specifically, updating the next available sequence number in the case of multiple producers ), we use a cas (compare and swap/set) operation. this is a cpu-level instruction, and in my mind it works a bit like optimistic locking - the cpu goes to update a value, but if the value it's changing it from is not the one it expects, the operation fails because clearly something else got in there first. note this could be two different cores rather than two separate cpus. cas operations are much cheaper than locks because they don't involve the operating system, they go straight to the cpu. but they're not cost-free - in the experiment i mentioned above, where a lock-free thread takes 300ms and a thread with a lock takes 10,000ms, a single thread using cas takes 5,700ms. so it takes less time than using a lock, but more time than a single thread that doesn't worry about contention at all. back to the disruptor - i talked about the claimstrategy when i went over the producers . in the code you'll see two strategies, a singlethreadedstrategy and a multithreadedstrategy. you could argue, why not just use the multi-threaded one with only a single producer? surely it can handle that case? and it can. but the multi-threaded one uses an atomiclong (java's way of providing cas operations), and the single-threaded one uses a simple long with no locks and no cas. this means the single-threaded claim strategy is as fast as possible, given that it knows there is only one producer and therefore no contention on the sequence number. i know what you're thinking: turning one single number into an atomiclong can't possibly have been the only thing that is the secret to the disruptor's speed. and of course, it's not - otherwise this wouldn't be called "why it's so fast (part one )". but this is an important point - there's only one place in the code where multiple threads might be trying to update the same value. only one place in the whole of this complicated data-structure-slash-framework. and that's the secret. remember everything has its own sequence number? if you only have one producer then every sequence number in the system is only ever written to by one thread. that means there is no contention. no need for locks. no need even for cas. the only sequence number that is ever written to by more than one thread is the one on the claimstrategy if there is more than one producer. this is also why each variable in the entry can only be written to by one consumer . it ensures there's no write contention, therefore no need for locks or cas. back to why queues aren't up to the job so you start to see why queues, which may implemented as a ring buffer under the covers, still can't match the performance of the disruptor. the queue, and the basic ring buffer , only has two pointers - one to the front of the queue and one to the end: if more than one producer wants to place something on the queue, the tail pointer will be a point of contention as more than one thing wants to write to it. if there's more than one consumer, then the head pointer is contended, because this is not just a read operation but a write, as the pointer is updated when the item is consumed from the queue. but wait, i hear you cry foul! because we already knew this, so queues are usually single producer and single consumer (or at least they are in all the queue comparisons in our performance tests). there's another thing to bear in mind with queues/buffers. the whole point is to provide a place for things to hang out between producers and consumers, to help buffer bursts of messages from one to the other. this means the buffer is usually full (the producer is out-pacing the consumer) or empty (the consumer is out-pacing the producer). it's rare that the producer and consumer will be so evenly-matched that the buffer has items in it but the producers and consumers are keeping pace with each other. so this is how things really look. an empty queue: ...and a full queue: the queue needs a size so that it can tell the difference between empty and full. or, if it doesn't, it might determine that based on the contents of that entry, in which case reading an entry will require a write to erase it or mark it as consumed. whichever implementation is chosen, there's quite a bit of contention around the tail, head and size variables, or the entry itself if a consume operation also includes a write to remove it. on top of this, these three variables are often in the same cache line , leading to false sharing . so, not only do you have to worry about the producer and the consumer both causing a write to the size variable (or the entry), updating the tail pointer could lead to a cache-miss when the head pointer is updated because they're sat in the same place. i'm going to duck out of going into that in detail because this post is quite long enough as it is. so this is what we mean when we talk about "teasing apart the concerns" or a queue's "conflated concerns". by giving everything its own sequence number and by allowing only one consumer to write to each variable in the entry, the only case the disruptor needs to manage contention is where more than one producer is writing to the ring buffer. in summary the disruptor a number of advantages over traditional approaches: no contention = no locks = it's very fast. having everything track its own sequence number allows multiple producers and multiple consumers to use the same data structure. tracking sequence numbers at each individual place (ring buffer, claim strategy, producers and consumers), plus the magic cache line padding , means no false sharing and no unexpected contention. from http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast.html
July 23, 2011
by Trisha Gee
· 12,688 Views · 1 Like
article thumbnail
Testing Entity Validations with a Mock Entity - Roo in Action Corner
In Spring Roo in Action, Chapter 3, I discuss how Roo automatically executes the Bean Validators when persisting a live entity. However, when running unit tests, we don't have a live entity at all, nor do we have a Spring container - so how can we exercise the validation without actually hitting our Roo application and the database? The following post is ancillary material from the upcoming book Spring Roo in Action, by Ken Rimple and Srini Penchikala, with Gordon Dickens. You can purchase the MEAP edition of the book, and participate in the author forum, at www.manning.com/rimple. The answer is that we have to bootstrap the validation framework within the test ourselves. We can use the CourseDataOnDemand class's getNewTransientEntityName method to generate a valid, transient JPA entity. Then, we can: Mock static entity methods, such as findById, to bring back pre-fabricated class instances of your entity Initialize the validation engine, bootstrapping a JSR-303 bean validation framework engine, and perform validation on your entity Set any appropriate properties to apply to a particular test condition Initialize a test instance of the entity validator and assert the appropriate validation results are returned The concept in action... Given a Student entity with the following definition: @RooEntity @RooJavaBean @RooToString public class Student { @NotNull private String emergencyContactInfo; ... } The listing below shows a unit test method that ensures the NotNull validation fires against missing emergency contact information on the Student entity: @Test public void testStudentMissingEmergencyContactValidation() { // setup our test data StudentDataOnDemand dod = new StudentDataOnDemand(); // tell the mock to expect this call Student.findStudent(1L); // tell the mocking API to expect a return from the prior call in the form of // a new student from the test data generator, dod AnnotationDrivenStaticEntityMockingControl.expectReturn( dod.getNewTransientStudent(0)); // put our mock in playback mode AnnotationDrivenStaticEntityMockingControl.playback(); // Setup the validator API in our unit test LocalValidatorFactoryBean validator = new LocalValidatorFactoryBean(); validator.afterPropertiesSet(); // execute the call from the mock, set the emergency contact field // to an invalid value Student student = Student.findStudent(1L); student.setEmergencyContactInfo(null); // execute validation, check for violations Set> violations = validator.validate(student, Default.class); // do we have one? Assert.assertEquals(1, violations.size()); // now, check the constraint violations to check for our specific error ConstraintViolation violation = violations.iterator().next(); // contains the right message? Assert.assertEquals("{javax.validation.constraints.NotNull.message}", violation.getMessageTemplate()); // from the right field? Assert.assertEquals("emergencyContactInfo", violation.getPropertyPath().toString()); } Analysis The test starts with a declaration of a StudentOnDemand object, which we'll use to generate our test data. We'll get into the more advanced uses of the DataOnDemand Framework later in the chapter. For now, keep in mind that we can use this class to create an instance of an Entity, with randomly assigned, valid data. We then require that the test calls the Student.findStudent method, passing it a key of 1L. Next, we'll tell the entity mocking framework that the call should return a new transient Student instance. At this point, we've defined our static mocking behavior, so we'll put the mocking framework into playback mode. Next, we issue the actual Student.findById(1L) call, this time storing the result as the member variable student. This call will trip the mock, which will return a new transient instance. We then set the emergencyContactInfo field to null, so that it becomes invalid, as it is annotated with a @NotNull annotation. Now we are ready to set up our bean validation framework. We create a LocalValidatorFactoryBean instance, which will boot the Bean Validation Framework in the afterPropertiesSet() method, which is defined for any Spring Bean implementing InitializingBean. We must call this method ourselves, because Spring is not involved in our unit test. Now we're ready to run our validation and assert the proper behavior has occurred. We call our validator's validate method, passing it the student instance and the standard Default validation group, which will trigger validation. We'll then check that we only have one validation failure, and that the message template for the error is the same as the one for the @NotNull validation. We also check to ensure that the field that caused the validation was our emergencyContactInfo field. In our answer callback, we can launch the Bean Validation Framework, and execute the validate method against our entity. In this way, we can exercise our bean instance any way we want, and instead of persisting the entity, can perform the validation phase and exit gracefully. Caveats... There are a few things slightly wrong here. First of all, the Data on Demand classes actually use Spring to inject relationships to each other, which I've logged a bug against as ROO-2497. You can override the setup of the data on demand class and manually create the DoD of the referring one, which is fine. They have slated to work on this bug for Roo 1.2, so it should be fixed sometime in the next few months. Also, realize that this is NOT easy to do, compared to writing an integration test. However, this test runs markedly faster. If you have some sophisticated logic that you've attached to a @AssertTrue annotation, this is the way to test it in isolation. From http://www.rimple.com/tech/2011/7/17/testing-entity-validations-with-a-mock-entity-roo-in-action.html
July 20, 2011
by Ken Rimple
· 14,610 Views
article thumbnail
Lucene's near-real-time search is fast!
Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time. In some cases, such as modern social/news sites (e.g., LinkedIn, Twitter, Facebook, Stack Overflow, Hacker News, DZone, etc.), fast turnaround time is a hard requirement. Fortunately, it's trivial to use. Just open your initial NRT reader, like this: // w is your IndexWriter IndexReader r = IndexReader.open(w, true); (That's the 3.1+ API; prior to that use w.getReader() instead). The returned reader behaves just like one opened with IndexReader.open: it exposes the point-in-time snapshot of the index as of when it was opened. Wrap it in an IndexSearcher and search away! Once you've made changes to the index, call r.reopen() and you'll get another NRT reader; just be sure to close the old one. What's special about the NRT reader is that it searches uncommitted changes from IndexWriter, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often commit is called), something not previously possible. Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes. The reopen time is in proportion to how many changes you made since last reopening that reader. Lucene's approach is a nice compromise between immediate consistency, where changes are visible after each index change, and eventual consistency, where changes are visible "later" but you don't usually know exactly when. With NRT, your application has controlled consistency: you decide exactly when changes must become visible. Recently there have been some good improvements related to NRT: New default merge policy, TieredMergePolicy, which is able to select more efficient non-contiguous merges, and favors segments with more deletions. NRTCachingDirectory takes load off the IO system by caching small segments in RAM (LUCENE-3092). When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (LUCENE-2900). Segments that are 100% deleted are now dropped instead of inefficiently merged (LUCENE-2010). How fast is NRT search? I created a simple performance test to answer this. I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents. Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text. This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as Twitter's recent peak during this year's superbowl (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard. The test uses updateDocument, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments. In addition, 8 search threads run a fixed TermQuery at the same time. Finally, the NRT reader is reopened once per second. I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an OCZ Vertex 3 240 GB SSD, using Oracle's 64 bit Java 1.6.0_21 and Linux Fedora 13. I gave Java a 2 GB max heap, and used MMapDirectory. The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time: The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive Dygraph, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph. Search QPS dropped substantially with time. While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching. They are only truly deleted when the segments are merged. TermQuery is a worst-case query; harder queries, such as BooleanQuery, should see less slowdown from deleted, but not reclaimed, documents. Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped. It looks like TieredMergePolicy should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off. Once it finished the QPS recovered somewhat. Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical. Reopen delay during merging The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate). There are random single spikes, which is caused by Java running a full GC cycle. However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40). Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay. Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching. I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities. It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen. However, if that were the case, NRTCachingDirectory (used by the test) should have prevented it, but didn't. It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've set Linux's swappiness to 0. Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's pdflush tunables could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness. Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling Directory implementations to pass appropriate flags to the OS when opening files for merging (LUCENE-2793 and LUCENE-2795). From past testing I know that passing O_DIRECT can prevent merges from evicting hot pages, so it's possible this will fix our slow reopen time as well since it bypasses the write cache. Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges. This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it. It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity. This would make a nice addition to Lucene! Except for the slowdown during merging, the performance of NRT is impressive. Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough. While there are exciting ideas to bring true real-time search to Lucene, by directly searching IndexWriter's RAM buffer as Michael Busch has implemented at Twitter with some cool custom extensions to Lucene, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT. NIOFSDirectory vs MMapDirectory Out of curiosity, I ran the exact same test as above, but this time with NIOFSDirectory instead of MMapDirectory: There are some interesting differences. The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot. For some reason TermQuery, in particular, has high variance from one JVM instance to another. The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur). The takeaway message seems clear: on Linux, use MMapDirectory not NIOFSDirectory! Optimizing your NRT turnaround time My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text). You should test specifically for your use-case what reopen rate works best. Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate. Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements. There are also some simple steps you can take to reduce the turnaround time: Store the index on a fast IO system, ideally a modern SSD. Install a merged segment warmer (see IndexWriter.setMergedSegmentWarmer). This warmer is invoked by IndexWriter to warm up a newly merged segment without blocking the reopen of a new NRT reader. If your application uses Lucene's FieldCache or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader. Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices. The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk). If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the Pulsing codec for your id field since this gives faster lookup performance which will make your reopen faster. Use the new NRTCachingDirectory, which buffers small segments in RAM to take load off the IO system (LUCENE-3092). Pass false for applyDeletes when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader. While it's not clear that thread priorities actually work correctly (see this Google Tech Talk), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads. If the machine becomes saturated, ideally only the search threads should take the hit. Happy near-real-time searching!
July 11, 2011
by Michael Mccandless
· 19,089 Views
article thumbnail
SEVERE: Error in xpath:java.lang.RuntimeException: solrconfig.xml missing luceneMatchVersion
One of the things that changed from Solr 1.4.1 to 1.5+ was the introduction of a parameter to tell Solr / Lucene which kind of compability version its index files should be created and used in. Solr now refuses to start if you do not provide this setting (if you’re upgrading a previous installation from 1.4.1 or earlier). The fix isn’t really straight forward, and you’ll probably have to recreate your index files if you’re just arriving at the scene with Solr / Lucene 3.2 and 4.0. Solr 3.0 (1.5) might be able to upgrade the files from the 2.9 version, but if you’re jumping from Lucene 2.9 to 4.0, the easiest solution seems to be to delete the current index and reindex (set up replication, disable replication from the master, query the slave while reindexing the master, etc.. and you’ll have no downtime while doing this!). You’ll need to add a parameter to your solrconfig.xml file as well in the section. LUCENE_CURRENT Other valid values are LUCENE_30, LUCENE_31, LUCENE_32 and LUCENE_40. These values represent specific versions of the index structure, while LUCENE_CURRENT will use the version depending on which particular release of Lucene you’re using. The version format will be upgraded automagically between most releases, so you’ll probably be fine by using LUCENE_CURRENT. If you however are trying to load index files that are more than one version older, you may have to use one of the other values.
July 9, 2011
by Mats Lindh
· 10,290 Views
article thumbnail
The era of Object-Document Mapping
The Data Mapper pattern is a mechanism for persistence where the application model and the data source have no dependencies between each other. For example, a group of PHP classes and a relational database may be used together without having the PHP classes extends a base Active Record class, thanks to a Data Mapper like Doctrine 2. But everytime we talk about the Data Mapper pattern, we assume there is a relational database on the other side of the persistence boundary. We always save objects; we always map them to MySQL or Postgres tables; but it's not mandatory. In fact, you can store objects also in NoSQL stores, with no more impedance mismatch that the one already existing with relational databases. Doctrine is expanding with two projects which have the goal of replicating the easy object-relational mapping of Doctrine 2 to other stores. These two projects are the MongoDB and CouchDB Object-Document mappers. Since the basic unit of persistence is the document in both cases, objects (or group of them) are stored as documents. The retrieval process depends on the features of the underlying database: it may consists of simple queries (MongoDB) or just of a unique identifier (CouchDB). How do an ODM work? If you use Doctrine 2, you'll find out that many of its concepts translates well to the other mappers. Instead of Entities, we have Documents in our model: id = "myuser-1234"; $user->username = "test"; $this->dm->persist($user); $this->dm->flush(); $this->dm->clear(); $userNew = $this->dm->find($this->type, $user->id); There is a lot of reuse between the various projects: Doctrine\Common is a dependency for them all, and you will be able to use the same ArrayCollection with any of the mappers. Features There's really all I would need to get started: storage of new documents, update, removal; pretty much what you expect; Unit Of Work pattern for tracking what has been modified and execute a single flush() to save changes; support for collections and both one-to-many or many-to-one associations; support for embedded documents (read Value Objects); both embedding of one or many objects as document fields is possible. This is a striking difference with respect to ORMs: Doctrine 2 does not support Value Objects mapping. cascade of persistence and removal of objects; specification of the mapping metadata via annotations, XML, YAML or PHP (like for Doctrine 2). Projects status The concept of Object-Document Mapper, at least in this name, is pretty unique to PHP and Ruby. Mandango is a PHP alternative and Mongoid a Ruby one, for the MongoDB case. Both projects are in the midst of releasing their first stable version: we are talking about bleeding edge technology here. The MongoDB mapper has been around for more than an year and is now in beta3. The CouchDB has an alpha release, but I found out that the current version shipped via PEAR is broken: if you want to play with it, checkout it from github and initialize its submodules (the standalone Symfony components and Doctrine\Common): git clone https://github.com/doctrine/couchdb-odm.git couchdb_odm cd couchdb_odm git submodule init git submodule update For the MongoDB mapper, the PEAR installation should suffice: pear install pear.doctrine-project.org/DoctrineMongoDBODM-1.0.0BETA3 Conclusions I think I'm going to explore more the CouchDB ODM, since it is a departure from the relational model. MongoDB is still similar to traditional databases in the way queries are satisfied: specifying a set of constraints like in SQL WHERE clauses. I'm more interested instead in approached that skip the Mapper for reporting (like Command-Query Responsibility Separation), and maybe CouchDB could be a good fit. By the way, object-document mapping is an alternative to explore: Data Mappers are all the rage today in PHP (they already were yesterday in other languages.) If we already accept that relational databases aren't always the solution, extending NoSQL with Data Mappers is a natural choice.
July 7, 2011
by Giorgio Sironi
· 20,912 Views
article thumbnail
Is Object Serialization Evil?
In my daily work, I use both an RDBMS and MarkLogic, an XML database. MarkLogic can be considered akin to the newer NoSQL databases, but it has the added structure of XML and standard languages in XQuery and XPath. The NoSQL databases are typically storing documents or key-value pairs, and some other things in between. Given that any datastore will be searched at some point, you will always care how the data is actually stored or whether there is some way to query it easily. Once you start thinking about the problem, you quickly generalize to the “how do I persist any type of data” question. However, my focus is not going to be the comparison of the various data stores, but the comparison of how data is stored. More specifically, I want to show the object serialization, mainly the Java built in method, as a data persistence format is evil. Given what you normally read on this blog, this may seem like an oddly timed post, but I have run into serialization issues lately in some production code and Mark Needham recently wrote an interesting post about this as well. Coincidentally, Mark is also working with MarkLogic, and there is an interesting item in his post: The advantage of doing things this way [using lightweight wrappers] is that it means we have less code to write than we would with the serialisation/deserialisation approach although it does mean that we’re strongly coupled to the data format that our storage mechanism uses. However, since this is one bit of the architecture which is not going to change it seems to makes sense to accept the leakage of that layer. The interesting part of this is that he has accepted using the data format of the storage mechanism, XML in MarkLogic in this case. Why is this interesting? First, it is a move away from the ORM technologies that try to hide the complexities of converting data into objects in the RDBMS world. Also, this is a glimpse into the types of issues that could arise from non-RDBMS storage choices as well as how to persist objects in general. So, an RDBMS is typically used to map object attributes to a table and columns. The mapping is mostly straightforward with some defined relationship for child objects and collections. This is a well-known area, called Object-Relational Mapping (ORM), and several open source and commercial options exist. In this scenario, object attributes are stored in a similar datatype, meaning a String is stored as a varchar and an int is stored as an integer. But, what happens when you move away from an RDBMS for data persistence? If you look at Java and its session objects, pure object serialization is used. Assuming that an application session is fairly short-lived, meaning at most a few hours, object serialization is simple, well supported and built into the Java concept of a session. However, when the data persistence is over a longer period of time, possibly days or weeks, and you have to worry about new releases of the application, serialization quickly becomes evil. As any good Java developer knows, if you plan to serialize an object, even in a session, you need a real serialization ID (serialVersionUID), not just a 1L, and you need to implement the Serializable interface. However, most developers do not know the real rules behind the Java deserialization process. If your object has changed, more than just adding simple fields to the object, it is possible that Java cannot deserialize the object correctly even if the serialization ID has not changed. Suddenly, you cannot retrieve your data any longer, which is inherently bad. Now, may developers reading this may say that they would never write code that would have this problem. That may be true, but what about a library that you use or some other developer no longer employed by your company? Can you guarantee that this problem will never happen? The only way to guarantee that is to use a different serialization method. What options do we have? Obviously, there are the NoSQL datastores but the actual object format is the relevant question not which solution to choose. Besides the obvious serialized object, some NoSQL datastores use JSON to store objects, MarkLogic uses XML and there are others that store just key-value pairs. Key-value pairs are typically a mapping of a text key to a value that is a serialized object, either a binary or textual format. So, that leaves us with XML, JSON and other textual formats. One of the benefits of a structured format like XML or JSON is that they can be made searchable and provide some level of context. I have talked about data formats before, so I won’t go into a comparison again. However, do these types of formats avoid the issues that native Java object serialization has? This is really dependent upon what library you are using for serialization. Some libraries will deserialize an object without any issues regardless of whether the object field list has changed. Other libraries could have problems depending upon whether a serialized field exists in the target object, or there might not be solid support for collections (though that is doubtful at this point). Given that even structured formats could have serialization issues, is the only safe path hand-coded mappings like those used by ORM tools? Some JSON and XML serialization tools use the same mapping methods as the ORM tools in order to avoid these problems. However, once you define these mappings, you are explicitly stating how an object gets translated. This explicit definition will require maintenance, but that is definitely cleaner than trying to trace down a serialization defect in some random stack trace. So is implicit object serialization really worth the potential headaches? Or should we just consider it evil and never speak of it again? From http://regulargeek.com/2011/07/06/is-object-serialization-evil/
July 7, 2011
by Robert Diana
· 20,010 Views · 1 Like
article thumbnail
Using Advantage data providers to read DBF-files
In one of my projects I have to read FoxPro DBF-files and import data from them. As this code must run in server and customer doesn’t want to install FoxPro there we found another solution that seems at least to me way better. In this posting I will show you how to read DBF-files using Sybase Advantage data providers. Getting Advantage data providers Here are the download links to data providers: Advantage .NET Data Provider Release 10.1 for Windows (32-bit and 64-bit) Platforms for Advantage OLE DB Provider Release 10.1 Platforms for Advantage ODBC Driver Release 10.1 I downloaded and installed .NET data provider and my example here is fully based on this. Configuring application If you run application without configuring some data providers stuff before you will get the following error: Error 5185: Local server connections are restricted in this environment. See the 5185 error code documentation for details. Go to your application bin folder and add there usual text file called ads.ini. Here is the content for this file: [SETTINGS] MTIER_LOCAL_CONNECTIONS=1 Make sure you add reference to Advantage data provider assembly and include ads.ini to your project like shown on image above. Getting data to DataTable Here is short code example about how to get data from DBF-file to DataTable. static void Main(string[] args) { var tableName = "TABLENAME_WITHOUT_EXTENSION"; var connStr = "data source={0};tabletype=vfp;servertype= local;"; connStr = string.Format(connStr, "c:\\temp\\"); var table = new DataTable(); using (var conn = new AdsConnection(connStr)) using (var adapter = new AdsDataAdapter()) using (var cmd = new AdsCommand()) { cmd.Connection = conn; cmd.CommandText = "select * from " + tableName; adapter.SelectCommand = cmd; conn.Open(); adapter.Fill(table); conn.Close(); } Console.WriteLine("Table fields:"); foreach (DataColumn col in table.Columns) Console.WriteLine(col.ColumnName); Console.WriteLine(" "); Console.WriteLine("Rows: " + table.Rows.Count); Console.Read(); } If Advantage data providers were installed correctly and there are no errors in table names, locations and your SQL query then you should see list of table column names and row count on console window when you run the application.
May 17, 2011
by Gunnar Peipman
· 10,457 Views
article thumbnail
Java Web Application Security - Part I: Java EE 6 Login Demo
Back in February, I wrote about my upcoming conferences: In addition to Vegas and Poland, there's a couple other events I might speak at in the next few months: the Utah Java Users Group (possibly in April), Jazoon and ÜberConf (if my proposals are accepted). For these events, I'm hoping to present the following talk: Webapp Security: Develop. Penetrate. Protect. Relax. In this session, you'll learn how to implement authentication in your Java web applications using Spring Security, Apache Shiro and good ol' Java EE Container Managed Authentication. You'll also learn how to secure your REST API with OAuth and lock it down with SSL. After learning how to develop authentication, I'll introduce you to OWASP, the OWASP Top 10, its Testing Guide and its Code Review Guide. From there, I'll discuss using WebGoat to verify your app is secure and commercial tools like webapp firewalls and accelerators. Fast forward a couple months and I'm happy to say that I've completed my talk at the Utah JUG and it's been accepted at Jazoon and Über Conf. For this talk, I created a presentation that primarily consists of demos implementing basic, form and Ajax authentication using Java EE 6, Spring Security and Apache Shiro. In the process of creating the demos, I learned (or re-educated myself) how to do a number of things in all 3 frameworks: Implement Basic Authentication Implement Form-based Authentication Implement Ajax HTTP -> HTTPS Authentication (with programmatic APIs) Force SSL for certain URLs Implement a file-based store of users and passwords (in Jetty/Maven and Tomcat standalone) Implement a database store of users and passwords (in Jetty/Maven and Tomcat standalone) Encrypt Passwords Secure methods with annotations For the demos, I showed the audience how to do almost all of these, but skipped Tomcat standalone and securing methods in the interest of time. In July, when I do this talk at ÜberConf, I plan on adding 1) hacking the app (to show security holes) and 2) fixing it to protect it against vulnerabilities. I told the audience at UJUG that I would post the presentation and was planning on recording screencasts of the various demos so the online version of the presentation would make more sense. Today, I've finished the first screencast showing how to implement security with Java EE 6. Below is the presentation (with the screencast embedded on slide 10) as well as a step-by-step tutorial. * You can also watch the screencast on YouTube or download the presentation PDF. Java EE 6 Login Tutorial Download and Run the Application Implement Basic Authentication Implement Form-based Authentication Force SSL Store Users in a Database Summary Download and Run the Application To begin, download the application you'll be implementing security in. This app is a stripped-down version of the Ajax Login application I wrote for my article on Implementing Ajax Authentication using jQuery, Spring Security and HTTPS. You'll need Java 6 and Maven installed to run the app. Run it using mvn jetty:run and open http://localhost:8080 in your browser. You'll see it's a simple CRUD application for users and there's no login required to add or delete users. Implement Basic Authentication The first step is to protect the list screen so people have to login to view users. To do this, add the following to the bottom of src/main/webapp/WEB-INF/web.xml: users /users GET POST ROLE_ADMIN BASIC Java EE Login ROLE_ADMIN At this point, if you restart Jetty (Ctrl+C and jetty:run again), you'll get an error about a missing LoginService. This happens because Jetty doesn't know where the "Java EE Login" realm is located. Add the following to pom.xml, just after in the Jetty plugin's configuration. Java EE Login ${basedir}/src/test/resources/realm.properties The realm.properties file already exists in the project and contains user names and passwords. Start the app again using mvn jetty:run and you should be prompted to login when you click on the "Users" tab. Enter admin/admin to login. After logging in, you can try to logout by clicking the "Logout" link in the top-right corner. This calls a LogoutController with the following code that logs the user out. public void logout(HttpServletResponse response) throws ServletException, IOException { request.getSession().invalidate(); response.sendRedirect(request.getContextPath()); } You'll notice that clicking this link doesn't log you out, even though the session is invalidated. The only way to logout with basic authentication is to close the browser. In order to get the ability to logout, as well as to have more control over the look-and-feel of the login, you can implement form-based authentication. Implement Form-based Authentication To change from basic to form-based authentication, you simply have to replace the in your web.xml with the following: FORM /login.jsp /login.jsp?error=true The login.jsp page already exists in the project, in the src/main/webapp directory. This JSP has 3 important elements: 1) a form that submits to "${contextPath}/j_security_check", 2) an input element named "j_username" and 3) an input element named "j_password". If you restart Jetty, you'll now be prompted to login with this JSP instead of the basic authentication dialog. Force SSL Another thing you might want to implement to secure your application is forcing SSL for certain URLs. To do this on the same you already have in web.xml, add the following after : CONFIDENTIAL To configure Jetty to listen on an SSL port, add the following just after in your pom.xml: true 8080 true 8443 60000 ${project.build.directory}/ssl.keystore appfuse appfuse The keystore must be generated for Jetty to start successfully, so add the keytool-maven-plugin just above the jetty-maven-plugin in pom.xml. org.codehaus.mojo keytool-maven-plugin 1.0 generate-resources clean clean generate-resources genkey genkey ${project.build.directory}/ssl.keystore cn=localhost appfuse appfuse appfuse RSA Now if you restart Jetty, go to http://localhost:8080 and click on the "Users" tab, you'll get a 403. What the heck?! When this first happened to me, it took me a while to figure out. It turns out that Jetty doesn't redirect to HTTPS when using Java EE authentication, so you have to manually type in https://localhost:8443/ (or add a filter to redirect for you). If you deployed this same application on Tomcat (after enabling SSL), it would redirect for you. Store Users in a Database Finally, to store your users in a database instead of file, you'll need to change the in the Jetty plugin's configuration. Replace the existing element with the following: Java EE Login ${basedir}/src/test/resources/jdbc-realm.properties The jdbc-realm.properties file already exists in the project and contains the database settings and table/column names for the user and role information. jdbcdriver = com.mysql.jdbc.Driver url = jdbc:mysql://localhost/appfuse username = root password = usertable = app_user usertablekey = id usertableuserfield = username usertablepasswordfield = password roletable = role roletablekey = id roletablerolefield = name userroletable = user_role userroletableuserkey = user_id userroletablerolekey = role_id cachetime = 300 Of course, you'll need to install MySQL for this to work. After installing it, you should be able to create an "appfuse" database and populate it using the following commands: mysql -u root -p -e 'create database appfuse' curl https://gist.github.com/raw/958091/ceecb4a6ae31c31429d5639d0d1e6bfd93e2ea42/create-appfuse.sql > create-appfuse.sql mysql -u root -p appfuse < create-appfuse.sql Next you'll need to configure Jetty so it has MySQL's JDBC Driver in its classpath. To do this, add the following dependency just after the element (before ) in pom.xml: mysql mysql-connector-java 5.1.14 Now run the jetty-password.sh file in the root directory of the project to generate a password of your choosing. For example: $ sh jetty-password.sh javaeelogin javaeelogin OBF:1vuj1t2v1wum1u9d1ugo1t331uh21ua51wts1t3b1vur MD5:53b176e6ce1b5183bc970ef1ebaffd44 The last two lines are obfuscated and MD5 versions of the password. Update the admin user's password to this new value. You can do this with the following SQL statement. UPDATE app_user SET password='MD5:53b176e6ce1b5183bc970ef1ebaffd44' WHERE username = 'admin'; Now if you restart Jetty, you should be able to login with admin/javaeelogin and view the list of users. Summary In this tutorial, you learned how to implement authentication using standard Java EE 6. In addition to the basic XML configuration, there's also some new methods in HttpServletRequest for Java EE 6 and Servlet 3.0: authenticate(response) login(user, pass) logout() This tutorial doesn't show you how to use them, but I did play with them a bit as part of my UJUG demo when implementing Ajax authentication. I found that login() did work, but it didn't persist the authentication for the users session. I also found that after calling logout(), I still needed to invalidate the session to completely logout the user. There are some additional limitations I found with Java EE authentication, namely: No error messages for failed logins No Remember Me No auto-redirect from HTTP to HTTPS Container has to be configured Doesn’t support regular expressions for URLs Of course, no error messages indicating why login failed is probably a good thing (you don't want to tell users why their credentials failed). However, when you're trying to figure out if your container is configured properly, the lack of container logging can be a pain. In the next couple weeks, I'll post Part II of this series, where I'll show you how to implement this same set of features using Spring Security. In the meantime, please let me know if you have any questions. From http://raibledesigns.com/rd/entry/java_web_application_security_part
May 15, 2011
by Matt Raible
· 42,012 Views · 3 Likes
article thumbnail
Reset MySQL Root Password On Linux
Five easy steps to reset MySQL root password. Stop the MySQL server. Start the MySQL server with the --skip-grant-tables option. (it will not prompt for password) Connect to MySQL server as the root user. Setup new MySQL root password. Exit and restart the MySQL server. ### Shell Commands ### /etc/init.d/mysql stop mysqld_safe --skip-grant-tables & mysql -u root ### SQL Commands ### use mysql; UPDATE user SET password=PASSWORD("new-password") WHERE user='root'; flush privileges; exit ### Shell Commands ### /etc/init.d/mysql stop /etc/init.d/mysql start mysql -u root -pnew-password
May 13, 2011
by Artur Mkrtchyan
· 5,779 Views
article thumbnail
Database Interaction with DAO and DTO Design Patterns
Learn what is a DAO and how to create a DAO, as well as the significance of creating Data Access Objects.
May 13, 2011
by Sandeep Bhandari
· 165,707 Views · 5 Likes
article thumbnail
Practical PHP Testing Patterns: Transaction Rollback Teardown
Maintaining isolation of tests when they have a database as Shared Fixture is not a trivial task. An important constraint is not having the headache of keeping track what manipulations on the database has your code done; in that case the rollback may not even be performed in case of a regression. An alternative way to resetting the database via DELETE and TRUNCATE queries is to roll back a transaction which has been started in the setup phase during the teardown. Implementation The phases of a database test involving Transaction Rollback Teardown are roughly the following: begin transaction, usually in setUp(). arrange, act, assert actions in the various Test Methods. rollback of the transaction in teardown(). The active transaction is never committed. An issue with using this pattern is that code that already uses transaction is prone to generate errors, and ultimately should never be tested with this technique. The rules for your safety are simple: the SUT should never start a transaction or committing it. Some databases support nested transaction levels, but it's very brittle to use them for testing purposes, and in case of any failure the whole suite will blow up as test executes teardowns at the wrong level of nesting. This pattern safety is also difficult to ensure, as DDL statements like CREATE/DELETE or other commands may commit the current transaction automatically. Check the documentation of your testing database. The advantage of this pattern is great performance: rollback is faster than every other command, including TRUNCATE. Moreover, if you encapsulate transactions well in your production code, most of it won't commit them (typically leaving the control over the transaction to an upper layer). Doctrine 2 In a sense, we already use this pattern with an UnitOfWork ORM such as Doctrine 2 when we do not flush() the ORM in our code. The flow is: The database is ready by setup. Exercise code. Check results as persisted or removed entities. Instead of calling flush() over the Entity Manager, call clear(). In this case, the database never sees a transaction, as Doctrine 2 keeps everything in the Unit Of Work until you say to flush it. Even when your code is calling flush(), you can explicitly use beginTransaction() and rollback() over the connection object: in this other scenario, the testing database sees an open transaction, but it's never committed and can be discarded in teardown() like the pattern prescribes. Example The code sample is the same test case shown in the Table Truncation Teardown article, which now uses transactions to encapsulate the single tests. The various tests check the tables content is restored, along with the AUTOINCREMENT next value. exec('CREATE TABLE users ( id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR(255) )'); } $this->connection = self::$sharedConnection; $this->connection->beginTransaction(); } public function teardown() { $this->connection->rollback(); } public function testTableCanBePopulated() { $this->connection->exec('INSERT INTO users (name) VALUES ("Giorgio")'); $this->assertEquals(1, $this->howManyUsers()); } public function testTableRestartsFrom1() { $this->assertEquals(0, $this->howManyUsers()); $this->connection->exec('INSERT INTO users (name) VALUES ("Isaac")'); $stmt = $this->connection->query('SELECT name FROM users WHERE id=1'); $result = $stmt->fetch(); $this->assertEquals('Isaac', $result['name']); } public function testTableIsEmpty() { $this->assertEquals(0, $this->howManyUsers()); } private function howManyUsers() { $stmt = $this->connection->query('SELECT COUNT(*) AS number FROM users'); $result = $stmt->fetch(); return $result['number']; } }
May 11, 2011
by Giorgio Sironi
· 7,801 Views
article thumbnail
Practical PHP Testing Patterns: Stored Procedure Test
It happened in the day before the advent of DDD and the Hexagonal architecture, that you had code that lived inside the database, such as Stored Procedures, constraints, and triggers. Back in the day, the relational database was considered the single source of truth instead of a Domain Model written in a language like PHP or Java. Today the picture is different - but there are still scenarios where pushing code in the database make sense. One of the reasons for having logic expressed as SQL and in other database languages is their power, and their performance. SQL operators, especially when augmented by proprietary extensions, let you declare pieces of logic that you would instead have to code by hand. SQL that is executed directly on the database can accomplish operations too onerous to perform over a reconstituted object graph with a subsequent saving. In fact, every decent ORM include a language for batch updates that translates to SQL, like Doctrine with DQL; and also a mechanism for providing hints for the underlying database, like indexes definitions. The problem with SQL derivates and other database-specific embedded logic is that we cannot execute it and test it in isolation - we need a real copy of the database to perform our tests. Thus the Stored Procedure Test is an umbrella term for tests that encompasses database code, even when they're not actually stored procedures. When I'll use the term stored procedure in this article, it will be to signify any database-specific code, such as complex queries, triggers and so on. Implementation The pattern prescribes to write unit tests for the stored procedure, to test it in isolation from the rest of application a first simplification. These tests cover nontrivial logic in database code - probably you don't need them for indexes definition, but more for queries with aggregate functions. In the PHP world, Sqlite often suffices for testing queries - as long as you have an intermediate layer like Doctrine DBAL (part of Doctrine 2) which smooths out the differences between vendors. You use MySQL in production, Sqlite in the test suite, and you can write queries in Doctrine's DQL being confident that it will translate them to the right SQL dialect. These tests should be executed in a sandbox - a database with just enough structure and data to test the stored procedure at hand. This sandbox should run by definition on the production dbms. The most difficult aspect of the pattern is integrating with the dbms: it should be running and listening on the right port. A sandbox should be created in setUp() or setUpBeforeClass(), and destroyed during teardown. In case the database is not available, the tests should be marked as skipped or incomplete. Variations In In-Database Stored Procedure Tests, the test is written in the same language as the database code. I cannot imagine something more boring for a PHP programmer. In Remoted Stored Procedure Tests, which is the variation of interest for us, the tests are written in PHPUnit and integrated with the suite (slowing it down a bit). The logic is that whatever SQL logic you're going to add to your application, is already encapsulated in some PHP class: for example, complex queries are encapsulated in Repositories or DAO. So it's going to be feasible to build a sandbox via PHP code, and test the stored procedure as a black box. It will be encapsulated for a unique execution, like schema creation, or for executing multiple times in case of queries. Example The example shows you how to test a query with a real database - supposing a surrogate database does not support all the needed functions - from inside a test suite. I thought it would be difficult to write this test, but instead it required less than a Pomodoro. connection = new PDO("mysql:host=localhost;dbname=sandbox", 'root', ''); $this->connection->exec("CREATE TABLE users (name VARCHAR(255) NOT NULL PRIMARY KEY, year YEAR)"); $this->repository = new UserRepository($this->connection, 2011); } public function testAverageAgeIsCalculated() { $this->insertUser('Giorgio', 1942); $this->insertUser('Isaac', 1920); $this->assertEquals(80, $this->repository->getAverageAge()); } private function insertUser($name, $year) { $stmt = $this->connection->prepare("INSERT INTO users (name, year) VALUES (:name, :year)"); $stmt->bindValue('name', $name, PDO::PARAM_STR); $stmt->bindValue('year', $year, PDO::PARAM_INT); return $stmt->execute(); } public function tearDown() { $this->connection->exec('DROP TABLE users'); } } class UserRepository { private $connection; private $currentYear; public function __construct(PDO $connection, $currentYear) { $this->connection = $connection; $this->currentYear = $currentYear; } /** * We suppose AVG() cannot be correctly implemented by Sqlite or * another surrogate database (substitute another vendor feature * for the same effect). * We also suppose reconstituting millions of User objects to calculate * their average age isn't feasible: that's why we used SQL directly. */ public function getAverageAge() { $stmt = $this->connection->prepare('SELECT AVG(:year - year) AS average_age FROM users'); $stmt->bindValue('year', $this->currentYear, PDO::PARAM_INT); $stmt->execute(); $row = $stmt->fetch(); return $row['average_age']; } }
May 4, 2011
by Giorgio Sironi
· 2,826 Views
article thumbnail
Reasons for Slow Database Performance
Usually there are scenarios where the application does not perform as expected. A simple web page which fetches data from database and displays optimizes it for mobiles should be fast and turnaround times should be less than 30 seconds on a good network connection. But still there are cases where these kinds of applications suffer the most performance issues. This is because the database in these cases is not designed by giving proper attention to the application requirements. You can change one application design even after delivery but changing the database design once a number of application have been integrated with it is like explosion. Here I am giving some points which should be kept in mind while designing application and database. Only basic idea is being provided. For details one can search each topic on the web as each point expands well to multiple articles. 1) Bind Variables: When a SQL query is sent to the database engine for processing and sending the result, it is compiled by the database compiler to get the tokens of the query. This involves parsing, optimizing and identifying the query. After a number of steps, the SQL query is passed to the database engine for processing. In a small application with a user base of less than 500, it is usually the same query which is executed more often than others. The use of bind variables helps in storing the compiled query once and executing it with different data at different times. For using bind variables, one needs to use PreparedStatement objects in Java. 2) Query is not well formed: Usually the same SQL query can be written in multiple ways. There are ways by which a query can be optimized to give the best performance. The corresponding SQL construct should be chosen depending upon requirement. I have scenarios where people have used WHERE clause instead of GROUP BY and are complaining of poor response times. Similarly Sub queries and Joins complement each other. 3) Database structure is not well defined/normalized: This is probably known to everybody that the database tables should be properly normalized as this is part of every DBMS course at graduation level. If the tables are not properly designed and normalized, anomalies set in. 4) Proper caching is not in place: Many applications make use of temporary caches on the application server to store the reference data or frequently accessed data as memory is less of an issue than the time with new generation servers. 5) Number of rows in the table too large: If the table itself has too much of data then the queries will take time to execute. Partitioning a table into multiple tables is recommended in these situations. For example: If a table has employee records of 1000000 employees then it could be split into 5 small tables each having 200000 rows. The advantage is we know beforehand in which smaller table to look for a particular employee code as the division of large table can be done on the employee id column. 6) Connections are not being pooled: If connections are not pooled then the each time a new connection is requested for a request to database. Maintaining a connection pool is much better than creating and destroying the connection for executing every SQL query. Of course, there are frameworks like Hibernate which take care of creating the connection pools and also allow the customization of these pools 7) Connections not closed/returned to pool in case of exceptions: When an exception occurs while performing database operations, it ought to be caught. Usually catching the exception is not the issue because SQLException is a checked exception but closing the connection is something that most of the times is left out. If the connection is not released, the same connection cannot be used for any other purpose till the connection is timed out. 8) Stored procedures for complex computations on database: Stored procedures are a good way to perform database intensive operations. This is because they are already compiled and there is less network trips for getting the same results as compared to SQL queries. From http://extreme-java.blogspot.com/2011/04/reasons-for-slow-database-performance.html
April 30, 2011
by Sandeep Bhandari
· 59,830 Views · 1 Like
article thumbnail
Hammurabi - A Scala Rule Engine
One of the most common reasons why software projects fail, or suffer unbearable delays, is the misunderstandings between the analysts who define the business rules of the domain for which the software is going to be written and the developers who have to code these rules. The latter write those rules in a language that is completely obscure for the first ones. In this way the business analysts don't have a chance to read, understand and validate what the programmers developed and then they can only empirically test the final software behavior, hardly covering all the possible corner cases and often recognizing mistakes only when it is too late. What Hammurabi is Hammurabi is a rule engine written in Scala that tries to leverage the features of this language making it particularly suitable to implement extremely readable internal Domain Specific Languages. Indeed, what actually makes Hammurabi different from all other rule engines is that it is possible to write and compile its rules directly in the host language. Anyway the Hammurabi's rules also have the important property of being readable even by non technical person. As usual a practical example worth more than a thousand words. The golfers problem This logical puzzle has been taken from the first chapter of the Jess in Action book written by Ernest Friedman-Hill and published by Manning. It is described there as it follows: A foursome of golfers is standing at a tee, in a line from left to right. Each golfer wears different colored pants; one is wearing red pants. The golfer to Fred’s immediate right is wearing blue pants. Joe is second in line. Bob is wearing plaid pants. Tom isn’t in position one or four, and he isn’t wearing the hideous orange pants. In what order will the four golfers tee off, and what color are each golfer’s pants?” The Jess solution Jess is written in Java and is one of the most popular rule engine on the market. The solution to the golfers problems presented in the book mentioned above is the following: first it is necessary to define the data structures representing the problem (deftemplate pants-color (slot of) (slot is)) (deftemplate position (slot of) (slot is)) A deftemplate is a bit like a class declaration in Java and in this case is used to write a first rule that in turns creates the facts representing each of the possible combinations of golfers, pants-color and positions: (defrule generate-possibilities => (foreach ?name (create$ Fred Joe Bob Tom) (foreach ?color (create$ red blue plaid orange) (assert (pants-color (of ?name)(is ?color))) ) (foreach ?position (create$ 1 2 3 4) (assert (position (of ?name)(is ?position))) ) ) ) After that it is possible to translate the sentences of the problem in the corresponding Jess rule: (defrule find-solution ;; There is a golfer named Fred, whose position is ?p1 ;; and pants color is ?c1 (position (of Fred) (is ?p1)) (pants-color (of Fred) (is ?c1)) ;; The golfer to immediate right of Fred ;; is wearing blue pants. (position (of ?n&~Fred)(is ?p&:(eq ?p (+ ?p1 1)))) (pants-color (of ?n&~Fred)(is blue&~?c1)) ;; Joe is in position #2 (position (of Joe) (is ?p2&2&~?p1)) (pants-color (of Joe) (is ?c2&~?c1)) ;; Bob is wearing the plaid pants (position (of Bob)(is ?p3&~?p1&~?p&~?p2)) (pants-color (of Bob&~?n)(is plaid&?c3&~?c1&~?c2)) ;; Tom is not in position 1 or 4 ;; and is not wearing orange (position (of Tom&~?n)(is ?p4&~1&~4&~?p1&~?p2&~?p3)) (pants-color (of Tom)(is ?c4&~orange&~blue&~?c1&~?c2&~?c3)) => (printout t Fred " " ?p1 " " ?c1 crlf) (printout t Joe " " ?p2 " " ?c2 crlf) (printout t Bob " " ?p3 " " ?c3 crlf) (printout t Tom " " ?p4 " " ?c4 crlf) ) where the rows starting with ;; are just comments. In this way if you enter the code for the problem into Jess and then run it, you get the answer directly: Fred 1 orange Joe 2 blue Bob 4 plaid Tom 3 red Note that the facts that the golfers are in different positions and wear pants of different colors is not expressed in an explicit rule but need to be spread and repeated in many statements. This solution is clearly difficult to be maintained and doesn't scale well as underlined by the last condition statement stating that the position ?p4 of Tom is ?p4&~1&~4&~?p1&~?p2&~?p3 where ~ means not in the Jess language. In other words it says that the Tom's position is not only different from the position 1 and 4 but it is also different from the positions of all the other golfers (named one by one) formerly defined. Actually the needs to describe a golfer's position also as the negation of the positions of all the other golfers implies something even worse: it is not possible to translate each sentence of the problem in a different rule, but they have to be combined together in a single big rule. After this huge if part, its then section (the one after the => symbol) prints out a table containing the set of variables ?p1…?p4 and ?c1…?c4 that solves the problem. The Hammurabi solution As done while presenting the Jess solution, also with Hammurabi the first thing to do is to define the domain of the problem. In order to do that, since the Hammurabi rules are valid Scala statements, it is sufficient to create a plain Scala Person class having as attributes the name, the position and the color of the pants of the golfer that it represents: class Person(n: String) { val name = n var pos: Int = _ var color: String = _ } Then we can model the fact that all the golfers must have different position and pants color by putting them in 2 different Set: var availablePos = (1 to 4).toSet var availableColors = Set("blue", "plaid", "red", "orange") and write two small methods that pull off them from the corresponding set once they have been assigned to a specific golfer: val assign = new { def color(color: String) = new { def to(person: Person) = { person.color = color availableColors = availableColors - color } } def position(pos: Int) = new { def to(person: Person) = { person.pos = pos availablePos = availablePos - pos } } } Those methods are written in a quite weird way just to make even more readable the DSL that will be used to define the rules and to stress the idea that it is possible to write the rules in a valid Scala that could be easily understood by a non-technical person. Of course it is easy to go even further by encapsulating other concepts in some convenient methods as it has been done above. Now everything is ready to write the set of rules describing the golfers problem: val ruleSet = Set( rule ("Unique positions") let { val p = any(kindOf[Person]) when { (availablePos.size equals 1) and (p.pos equals 0) } then { assign position availablePos.head to p } }, rule ("Unique colors") let { val p = any(kindOf[Person]) when { (availableColors.size equals 1) and (p.color == null) } then { assign color availableColors.head to p } }, rule ("Joe is in position 2") let { val p = any(kindOf[Person]) when { p.name equals "Joe" } then { assign position 2 to p } }, rule ("Person to Fred’s immediate right is wearing blue pants") let { val p1 = any(kindOf[Person]) val p2 = any(kindOf[Person]) when { (p1.name equals "Fred") and (p2.pos equals p1.pos + 1) } then { assign color "blue" to p2 } }, rule ("Fred isn't in position 4") let { val possibleFredPos = availablePos - 4 val p = any(kindOf[Person]) when { (p.name equals "Fred") and (possibleFredPos.size == 1) } then { assign position possibleFredPos.head to p } }, rule ("Tom isn't in position 1 or 4") let { val possibleTomPos = availablePos - 1 - 4 val p = any(kindOf[Person]) when { (p.name equals "Tom") and (possibleTomPos.size equals 1) } then { assign position possibleTomPos.head to p } }, rule ("Bob is wearing plaid pants") let { val p = any(kindOf[Person]) when { p.name equals "Bob" } then { assign color "plaid" to p } }, rule ("Tom isn't wearing orange pants") let { val possibleTomColors = availableColors - "orange" val p = any(kindOf[Person]) when { (p.name equals "Tom") and (possibleTomColors.size equals 1) } then { assign color possibleTomColors.head to p } } ) Here the first 2 rules explicitly leverage the uniqueness of positions and pants colors by assigning the last available of them to the only person who still doesn't have one. The other rules just match one by one the sentences of the problem as it has been defined. Now it is possible to make Hammurabi solve the problem by creating the four golfers: val tom = new Person("Tom") val joe = new Person("Joe") val fred = new Person("Fred") val bob = new Person("Bob") add them to a working memory (the set of objects against which the rule engine will evaluate and fire the rules): val workingMemory = WorkingMemory(tom, joe, fred, bob) and letting the rule engine, initialized with the formerly defined set of rules, to work on it: RuleEngine(ruleSet) execOn workingMemory Working with immutable data structures Immutability is probably not something that should be enforced at all costs in Hammurabi, for a very simple reason: the largest part of the execution time is spent by Hammurabi, like all other rule engines, looking for rules that can actually be executed (fired), i.e. the ones for which the when condition is true. During this phase data are only read and never written, so immutability doesn't matter at all, and all the rules that can be fired are put in an agenda. During the subsequent phase the rules in the agenda MUST be fired one by one, since the execution of one of them could make false the when condition of another one. It means that lack of immutability shouldn't prevent the rule engine to safely run in parallel during the discovery phase. That said, it is also possible to obtain the same result working with immutable data structures. For example having an immutable person: case class Person(name: String, pos: Int = 0, color: String = null) it's enough to rewrite the methods that assign the position and pants color to the golfers as it follows: val assign = new { def color(color: String) = new { def to(person: Person) = { remove(person) produce(person.copy(color = color)) availableColors = availableColors - color } } def position(pos: Int) = new { def to(person: Person) = { remove(person) produce(person.copy(pos = pos)) availablePos = availablePos - pos } } } leaving all the rules unchanged. In this way the old version of the Person is removed from the working memory and a brand new one, with its position or color set accordingly to the fired rule, is produced and then added to the working memory itself. The methods remove and produce can be indeed used respectively to remove objects from the working memory and produce new objects, that could be then used to evaluate and fire other rules. Hammurabi internals In Hammurabi all the rules are evaluated (but not fired) in parallel basically by assigning each rule to a different Scala actor. The replacement of this actor implementation with one based on the upcoming Scala parallel collections is currently under evaluation but I decided to wait until this technology will be stable. As anticipated, the evaluation of the rule in parallel is safe because is a read only process. While the actors discover set of variables that can fire the rules they are evaluating, they add them to the rule engine's agenda. At the end of this evaluation process the rules are fired sequentially one by one after a re-evaluation of their application condition, because the execution of one of them could change the result of that condition for a subsequent one. All the rules are fired in no particular order unless a different priority has been specified for some of them. Indeed since sometimes there could be the need to treat some rules as special cases they have an optional property called salience that acts as a priority setting for that rule in order to allow activated rules with the highest salience to always fire first, followed by rules of lower salience. By default all rules have salience 0 but you can alter it in the rule definition as it follows: rule ("Important rule") withSalience 10 let { ... } rule ("Negligible rule") withSalience -5 let { ... } Each actor also records the set of values on which the rule it is responsible for has been already executed, in order to don't fire again the same rule on the same values. The evaluation phase and the firing one are then executed again and again until either there is no rule that can be fired or one of the rule during its firing phase invokes one of the methods exitWith, making the rule engine gracefully finishing returning a value representing the result of the whole evaluation process or failWith that cause the rule engine to terminate by throwing an exception. Of course if the rule engine stops just because there are no longer rules that can be fired, the result of the evaluation is represented by the whole working set (as in the former example) and you can read from there the values you are interested in. Further implementations and improvements At the moment it is only possible to take from the working memory the object(s) against which a given rule will be evaluated only selecting them by type, as shown in the statement: val p = any(kindOf[Person]) Further mechanisms to categorize and select those objects in a more precise way are under evaluation, even if it is already possible to limit them with a Boolean function as in the following example: val p = kindOf[Person] having (_.name == "Joe") This is useful also under a performance point of view because it dramatically lowers the number of combination that the rule engine needs to check before to find a rule that can be fired. For example the "Person to Fred’s immediate right is wearing blue pants" rule could be rewritten as it follows bringing the number of combination that has to be tried from 16 to 4: rule ("Person to Fred’s immediate right is wearing blue pants") let { val p1 = kindOf[Person] having (_.name == "Fred") val p2 = any(kindOf[Person]) when { p2.pos equals p1.pos + 1 } then { assign color "blue" to p2 } } I am also evaluating of directly feeding the working memory with a NoSQL database. In other words, with this solution, the data present in the db could represent the working memory itself. At the moment I am experimenting with MongoDB since is the one I know best, but if somebody has some other good idea or even better wants to collaborate with this project I'd be very glad of it. The version 0.1 of the Hammurabi rule engine has been just released and it is available here.
April 18, 2011
by Mario Fusco
· 27,864 Views
article thumbnail
Eradicating Non-Determinism in Tests
An automated regression suite can play a vital role on a software project, valuable both for reducing defects in production and essential for evolutionary design. In talking with development teams I've often heard about the problem of non-deterministic tests - tests that sometimes pass and sometimes fail. Left uncontrolled, non-deterministic tests can completely destroy the value of an automated regression suite. In this article I outline how to deal with non-deterministic tests. Initially quarantine helps to reduce their damage to other tests, but you still have to fix them soon. Therefore I discuss treatments for the common causes for non-determinism: lack of isolation, asynchronous behavior, remote services, time, and resource leaks. I've enjoyed watching ThoughtWorks tackle many difficult enterprise applications, bringing successful deliveries to many clients who have rarely seen success. Our experiences have been a great demonstration that agile methods, deeply controversial and distrusted when we wrote the manifesto a decade ago, can be used successfully. There are many flavors of agile development out there, but in what we do there is a central role for automated testing. Automated testing was a core approach to Extreme Programming from the beginning, and that philosophy has been the biggest inspiration to our agile work. So we've gained a lot of experience in using automated testing as a core part of software development. Automated testing can look easy when presented in a text book. And indeed the basic ideas are really quite simple. But in the pressure-cooker of a delivery project, trials come up that are often not given much attention in texts. As I know too well, authors have a habit of skimming over many details in order to get a core point across. In my conversations with our delivery teams, one recurring problem that we've run into is tests which have become unreliable, so unreliable that people don't pay much attention to whether they pass or fail. A primary cause of this unreliability is that some tests have become non-deterministic. A test is non-deterministic when it passes sometimes and fails sometimes, without any noticeable change in the code, tests, or environment. Such tests fail, then you re-run them and they pass. Test failures for such tests are seemingly random. Non-determinism can plague any kind of test, but it's particularly prone to affect tests with a broad scope, such as acceptance or functional tests. Why non-deterministic tests are a problem Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite. As a result they need to be dealt with as soon as you can, before your entire deployment pipeline is compromised. I'll start with expanding on their uselessness. The primary benefit of having automated tests is that they provide bug detection mechanism by acting as regression tests[1]. When a regression test goes red, you know you've got an immediate problem, often because a bug has crept into the system without you realizing. Having such a bug detector has huge benefits. Most obviously it means that you can find and fix bugs just after they are introduced. Not just does this give you the warm fuzzies because you kill bugs quickly, it also makes it easier to remove them since you know the bug got in with the last set of changes that are fresh in your mind. As a result you know where to look for the bug, which is more than half the battle in squashing it. The second level of benefit is that as you gain confidence in your bug detector, you gain the courage to make big changes knowing that when you goof, the bug detector will go off and you can fix the mistake quickly. [2] Without this teams are frightened to make the changes code needs in order to be kept clean, which leads to a rotting of the code base and plummeting development speed. The trouble with non-deterministic tests is that when they go red, you have no idea whether its due to a bug, or just part of the non-deterministic behavior. Usually with these tests a non-deterministic failure is relatively common, so you end up shrugging your shoulders when these tests go red. Once you start ignoring a regression test failure, then that test is useless and you might as well throw it away. Indeed you really ought to throw a non-deterministic test away, since if you don't it has an infectious quality. If you have a suite of 100 tests with 10 non-deterministic tests in them, than that suite will often fail. Initially people will look at the failure report and notice that the failures are in non-deterministic tests, but soon they'll lose the discipline to do that. Once that discipline is lost, then a failure in the healthy deterministic tests will get ignored too. At that point you've lots the whole game and might as well get rid of all the tests. Quarantine My principal aim in this article is to outline common cases of non-deterministic tests and how to eliminate the non-determinism. But before I get there I offer one piece of essential advice: quarantine your non-deterministic tests. If you have non-deterministic tests keep them in a different test suite to your healthy tests. That way you'll you can continue to pay attention to what's going on with your healthy tests and get good feedback from them. Place any non-deterministic test in a quarantined area. (But fix quarantined tests quickly.) Then the question is what to do with the quarantined test suites. They are useless as regression tests, but they do have a future as work items for cleaning up. You should not abandon such tests, since any tests you have in quarantine are not helping you with your regression coverage. A danger here is that tests keep getting thrown into quarantine and forgotten, which means your bug detection system is eroding. As a result it's worthwhile to have a mechanism that ensures that tests don't stay in quarantine too long. I've come across various ways to do this. One is a simple numeric limit: e.g. only allow 8 tests in quarantine. Once you hit the limit you must spend time to clear all the tests out. This has the advantage of batching up your test-cleaning if that's how you like to do things. Another route is to put a time limit on how long a test may be in quarantine, such as no longer than a week. The general approach with quarantine is to take the quarantined tests out of the main deployment pipeline so that you still get your regular build process. However a good team can be more aggressive. Our Mingle team puts its quarantine suite into the deployment pipeline one stage after its healthy tests. That way it can get the feedback from the healthy tests but is also forced to ensure that it sorts out the quarantined tests quickly. [3] Lack of Isolation In order to get tests to run reliably, you must have clear control over the environment in which they run, so you have a well-known state at the beginning of the test. If one test creates some data in the database and leaves it lying around, it can corrupt the run of another test which may rely on a different database state. Therefore I find it's really important to focus on keeping tests isolated. Properly isolated tests can be run in any sequence. As you get to larger operational scope of functional tests, it gets progressively harder to keep tests isolated. When you are tracking down a non-determinism, lack of isolation is a common and frustrating cause. Keep your tests isolated from each other, so that execution of one test will not affect any others. There are a couple of ways to get isolation - either always rebuild your starting state from scratch, or ensure that each test cleans up properly after itself. In general I prefer the former, as it's often easier - and in particular easier to find the source of a problem. If a test fails because it didn't build up the initial state properly, then it's easy to see which test contains the bug. With clean-up, however, one test will contain the bug, but another test will fail - so it's hard to find the real problem. Starting from a blank state is usually easy with unit tests, but can be much harder with functional tests [4] - particularly if you have a lot of data in a database that needs to be there. Rebuilding the database each time can add a lot of time to test runs, so that argues for switching to a clean-up strategy. One trick that's handy when you're using databases, is to conduct your tests inside a transaction, and then to rollback the transaction at the end of the test. That way the transaction manager cleans up for you, reducing the chance of errors[5]. Another approach is to do a single build of a mostly-immutable starting fixture before running a group of tests. Then ensure that the tests don't change that initial state (or if they do, they reverse the changes in tear-down). This tactic is more error-prone than rebuilding the fixture for each test, but it may be worthwhile iff it takes too long to build the fixture each time. Although databases are a common cause for isolation problems, there are plenty of times you can get these in-memory too. In particular be aware with static data and singletons. A good example for this kind of problem is contextual environment, such as the currently logged in user. If you have an explicit tear-down in a test, be wary of exceptions that occur during the tear-down. If this happens the test can pass, but cause isolation failures for subsequent tests. So ensure that if you do get a problem in a tear-down, it makes a loud noise. Some people prefer to put less emphasis on isolation and more on defining clear dependencies to force tests to run in a specified order. I prefer isolation because it gives you more flexibility in running subsets of tests and parallelizing tests. Asynchronous Behavior Asynchrony is a boon that allows you to keep software responsive while taking on long term tasks. Ajax calls allow a browser to stay responsive while going back to the server for more data, asynchronous message allow a server process to communicate with other system without being tied to their laggardly latency. But in testing, asynchrony can be curse. The common mistake here is to throw in a sleep: //pseudo-code makeAsyncCall; sleep(aWhile); readResponse; This can bite you two ways. First off you'll want to set the sleep time to long enough that it gives plenty of time to get the response. But that means that you'll spend a lot of time idly waiting for the response, thus slowing down your tests. The second bite is that, however long you sleep, sometimes it won't be enough. There will be some change in environment that will cause you to exceed the sleep - and you'll get false failure. As a result I strongly urge you to never use bare sleeps like this. Never use bare sleeps to wait for asynchonous responses: use a callback or polling. There are basically two tactics you can do for testing an asynchronous response. The first is for the asynchronous service to take a callback which it can call when done. This is the best since it means you'll never have to wait any longer than you need to [6]. The biggest problem with this is that the environment needs to be able to do this and then the service provider needs to do it. This is one of the advantages of having the development team integrated with testing - if they can provide a callback then they will. The second option is to poll on the answer. This is more than just looking once, but looking regularly, something like this //pseudo-code makeAsyncCall startTime = Time.now; while(! responseReceived) { if (Time.now - startTime > waitLimit) throw new TestTimeoutException; sleep (pollingInterval); } readResponse The point of this approach is that you can set the pollingInterval to a pretty small value, and know that that's the maximum amount of dead time you'll lose to waiting for a response. This means you can set the waitLimit very high, which minimizes the chance of hitting it unless something serious has gone wrong. [7] Make sure you use a clear exception class that indicates this is a test timeout that's failing. This will help make it clear what's gone wrong should it happen, and perhaps allow a more sophisticated test harness to take account of this information in its display. The time values, in particular the waitLimit, should never be literal values. Make sure they are always values that can be easily set in bulk, either by using constants or set through the runtime environment. That way if you need to tweak them (and you will) you can tweak them all quickly. All this advice is handy for async calls where you expect a response from the provider, but how about those where there is no response. These are calls where we invoke a command on something and expect it to happen without any acknowledgment. This is the trickiest case since you can test for your expected response, but there's nothing to do to detect a failure other than timing-out. If the provider is something you're building you can handle this by ensuring the provider implements some way of indicating that it's done - essentially some form of callback. Even if only the testing code uses it, it's worth it - although often you'll find this kind of functionality is valuable for other purposes too[8]. If the provider is someone else's work, you can try persuasion, but otherwise may be stuck. Although this is also a case when using Test Doubles for remote services is worthwhile (which I'll discuss more in the next section). If you have a general failure in something asynchronous, such that it's not responding at all, then you'll always be waiting for timeouts and your test suite will take a long time to fail. To combat this it's a good idea to use a smoke test to check that the asynchronous service is responding at all and stop the test run right away if it isn't. Gerard Meszaros's book, xUnit Test Patterns, contains lots of good patterns for constructing tests. You can also often side-step the asynchrony completely. Gerard Meszaros's Humble Object pattern says that whenever you have some logic that's in a hard-to-test environment, you should isolate the logic you need to test from that environment. In this case it means put most of the logic you need to test in a place where you can test it synchronously. The asynchronous behavior should be as minimal (humble) as possible, that way you don't need that much testing of it. Remote Services Sometimes I'm asked if ThoughtWorks does any integration work, which I find somewhat amusing since there's hardly any project we do that doesn't involve a fair bit of integration. By their nature, enterprise applications involve a great deal of combining data from different systems. These systems are maintained by other teams operating to their own schedules, teams that often use a very different software philosophy to our heavily test-driven agile approach. Testing with such remote systems brings a number of problems, and non-determinism is high on the list. Often remote systems don't have test system we can call, which means hitting a live system. If there is a test system, it may not be stable enough to provide deterministic responses. In this situation it's vital to ensure determinism, so it's time to reach for a Test Double - a component that looks like the remote service, but is really just a pretend version that mimics the remote system's behavior. The double needs to be setup so that provides the right kind of response in interaction with our system, but in a way we control. In this manner we can ensure determinism. Using a double has a downside, in particular when we are testing across a broad scope. How can we be sure that the double behaves in the same way that remote system does? We can tackle this again using tests, a form of test that I call Integration Contract Tests. These run the same interaction with the remote system and the double, and check that the two match. In this case 'match' may not mean coming up with the same result (due to the non-determinisms), but results that share the same essential structure. Integration Contract Tests need to be run frequently, but not part of our system's deployment pipeline. Periodic running based on the rate of the change of the remote system is usually best. For writing these kinds of test doubles, I'm a big fan of Self Initializing Fakes - since these are very simple to manage. Some people are firmly against using Test Doubles in functional tests, believing that you must test with real connection in order to ensure end-to-end behavior. While I sympathize with their argument, automated tests are useless if they are non-deterministic. So any advantage you gain by talking to the real system is overwhelmed by the need to stamp out non-determinism[9]. Time Few things are more non-deterministic than a call to the system clock. Each time you call it, you get a new result, and any tests that depend on it can thus change. Ask for all the todos due in the next hour, and you regularly get a different answer[10]. The most important thing here is to ensure that you always wrap the system clock with routines that can be replaced with a seeded value for testing. A clock stub can be set to particular time and frozen at that time, allowing your tests to have complete control over its movements. That way you can synchronize your test data to the values in the seeded clock.[11][12] Always wrap the system clock, so it can be easily substituted for testing. One thing to watch with this, is that eventually your test data might start having problems because it's too old, and you get conflicts with other time based factors in your application. In this case you can move the data, and your clock seeds to new values. When you do this, ensure that this is the only thing you do. That way you can be sure that any tests that fail are due to time-movement in the test data. Another area where time can be a problem is when you rely on other behaviors from the clock. I once saw a system that generated random keys based on clock values. This systems started failing when it was moved to a faster machine that could allocate multiple ids within a single clock tick.[13] I've heard so many problems due to direct calls to the system clock that I'd argue for finding a way to use code analysis to detect any direct calls to the system clock and failing the build right there. Even a simple regex check might save you a frustrating debugging session after a call at an ungodly hour. Resource Leaks If your application has some kind of resource leak, this will lead to random tests failing, since it's just which test causes the resource leak to go over a limit that gets the failure. This case is awkward because any test can fail intermittently due to this problem. If it isn't a case of one test being non-deterministic then resource leaks are a good candidate to investigate. By resource leak, I mean any resource that the application has to manage by acquiring and releasing. In non-memory-managed environments, the obvious example is memory. Memory-management did much to remove this problem, but other resources still need to be managed, such as database connections. Usually the best way to handle these kind of resources is through a Resource Pool. If you do this then a good tactic is to configure the pool to a size of 1 and make it throw an exception should it get a request for a resource when it has none left to give. That way the first test to request a resource after the leak will fail - which makes it a lot easier to find the problem test. This idea of limiting resource pool sizes, is about increasing constraints to make errors more likely to crop up in tests. This is good because we want errors to show in tests so we can fix them before they manifest themselves in production. This principle can be used in other ways too. One story I heard was of a system which generated randomly named temporary files, didn't clean them up properly, and crashed on a collision. This kind of bug is very hard to find, but one way to manifest it is to stub the randomizer for testing so it always returns the same value. That way you can surface the problem more quickly.
April 14, 2011
by Martin Fowler
· 6,681 Views · 1 Like
article thumbnail
Solr + Hadoop = Big Data Love
Bixo Labs shows how to use Solr as a NoSQL solution for big data Many people use the Hadoop open source project to process large data sets because it’s a great solution for scalable, reliable data processing workflows. Hadoop is by far the most popular system for handling big data, with companies using massive clusters to store and process petabytes of data on thousands of servers. Since it emerged from the Nutch open source web crawler project in 2006, Hadoop has grown in every way imaginable – users, developers, associated projects (aka the “Hadoop ecosystem”). Starting at roughly the same time, the Solr open source project has become the most widely used search solution on planet Earth. Solr wraps the API-level indexing and search functionality of Lucene with a RESTful API, GUI, and lots of useful administrative and data integration functionality. The interesting thing about combining these two open source projects is that you can use Hadoop to crunch the data, and then serve it up in Solr. And we’re not talking about just free-text search; Solr can be used as a key-value store (i.e. a NoSQL database) via its support for range queries. Even on a single server, Solr can easily handle many millions of records (“documents” in Lucene lingo). Even better, Solr now supports sharding and replication via the new, cutting-edge SolrCloud functionality. Background I started using Hadoop & Solr about five years ago, as key pieces of the Krugle code search startup I co-founded in 2005. Back then, Hadoop was still part of the Nutch web crawler we used to extract information about open source projects. And Solr was fresh out of the oven, having just been released as open source by CNET. At Bixo Labs we use Hadoop, Solr, Cascading, Mahout, and many other open source technologies to create custom data processing workflows. The web is a common source of our input data, which we crawl using the Bixo open source project. The Problem During a web crawl, the state of the crawl is contained in something commonly called a “crawl DB”. For broad crawls, this has to be something that works with billions of records, since you need one entry for each known URL. Each “record” has the URL as the key, and contains important state information such as the time and result of the last request. For Hadoop-based crawlers such as Nutch and Bixo, the crawl DB is commonly kept in a set of flat files, where each file is a Hadoop “SequenceFile”. These are just a packed array of serialized key/value objects. Sometimes we need to poke at this data, and here’s where the simple flat-file structure creates a problem. There’s no easy way run queries against the data, but we can’t store it in a traditional database since billions of records + RDBMS == pain and suffering. Here is where scalable NoSQL solutions shine. For example, the Nutch project is currently re-factoring this crawl DB layer to allow plugging in HBase. Other options include Cassandra, MongoDB, CouchDB, etc. But for simple analytics and exploration on smaller datasets, a Solr-based solution works and is easier to configure. Plus you get useful and surprising fun functionality like facets, geospatial queries, range queries, free-form text search, and lots of other goodies for free. Architecture So what exactly would such a Hadoop + Solr system look like? As mentioned previously, in this example our input data comes from a Bixo web crawler’s CrawlDB, with one entry for each known URL. But the input data could just as easily be log files, or records from a traditional RDBMS, or the output of another data processing workflow. The key point is that we’re going to take a bunch of input data, (optionally) munge it into a more useful format, and then generate a Lucene index that we access via Solr. Hadoop For the uninitiated, Hadoop implements both a distributed file system (aka “HDFS”) and an execution layer that supports the map-reduce programming model. Typically data is loaded and transformed during the map phase, and then combined/saved during the reduce phase. In our example, the map phase reads in Hadoop compressed SequenceFiles that contain the state of our web crawl, and our reduce phase write out Lucene indexes. The focus of this article isn’t on how to write Hadoop map-reduce jobs, but I did want to show you the code that implements the guts of the job. Note that it’s not typical Hadoop key/value manipulation code, which is painful to write, debug, and maintain. Instead we use Cascading, which is an open source workflow planning and data processing API that creates Hadoop jobs from shorter, more representative code. The snippet below reads SequenceFiles from HDFS, and pipes those records into a sink (output) that stores them using a LuceneScheme, which in turn saves records as Lucene documents in an index. Tap source = new Hfs(new SequenceFile(CRAWLDB_FIELDS), inputDir); Pipe urlPipe = new Pipe("crawldb urls"); urlPipe = new Each(urlPipe, new ExtractDomain()); Tap sink = new Hfs(new LuceneScheme(SOLR_FIELDS, STORE_SETTINGS, INDEX_SETTINGS, StandardAnalyzer.class, MAX_FIELD_LENGTH), outputDir, true); FlowConnector fc = new FlowConnector(); fc.connect(source, sink, urlPipe).complete(); We defined CRAWLDB_FIELDS and SOLR_FIELDS to be the set of input and output data elements, using names like “url” and “status”. We take advantage of the Lucene Scheme that we’ve created for Cascading, which lets us easily map from Cascading’s view of the world (records with fields) to Lucene’s index (documents with fields). We don’t have a Cascading Scheme that directly supports Solr (wouldn’t that be handy?), but we can make-do for now since we can do simple analysis for this example. We indexed all of the fields so that we can perform queries against them. Only the status message contains normal English text, so that’s the only one we have to analyze (i.e., break the text up into terms using spaces and other token delimiters). In addition, the ExtractDomain operation pulls the domain from the URL field and builds a new Solr field containing just the domain. This will allow us to do queries against the domain of the URL as well as the complete URL. We could also have chosen to apply a custom analyzer to the URL to break it into several pieces (i.e., protocol, domain, port, path, query parameters) that could have been queried individually. Running the Hadoop Job For simplicity and pay-as-you-go, it’s hard to beat Amazon’s EC2 and Elastic Mapreduce offerings for running Hadoop jobs. You can easily spin up a cluster of 50 servers, run your job, save the results, and shut it down – all without needing to buy hardware or pay for IT support. There are many ways to create and configure a Hadoop cluster; for us, we’re very familiar with the (modified) EC2 Hadoop scripts that you can find in the Bixo distribution. Step-by-step instructions are available at http://openbixo.org/documentation/running-bixo-in-ec2/ The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains step-by-step instructions for building and running the job. After the job is done, we’ll copy the resulting index out of the Hadoop distributed file system (HDFS) and onto the Hadoop cluster’s master server, then kill off the one slave we used. The Hadoop master is now ready to be configured as our Solr server. Solr On the Solr side of things, we need to create a schema that matches the index we’re generating. The key section of our schema.xml file is where we define the fields. These fields have a one-to-one correspondence with the SOLR_FIELDS we defined in our Hadoop workflow. They also need to use the same Lucene settings as what we defined in the static IndexWorkflow.java STORE_SETTINGS and INDEX_SETTINGS. Once we have this defined, all that’s left is to set up a server that we can use. To keep it simple, we’ll use the single EC2 instance in Amazon’s cloud (m1.large) that we used as our master for the Hadoop job, and run the simple Solr search server that relies on embedded Jetty to provide the webapp container. Similar to the Hadoop job, step-by-step instructions are in the README for the hadoop2solr project on GitHub. But in a nutshell, we’ll copy and unzip a Solr 1.4.1 setup on the EC2 server, do the same for our custom Solr configuration, create a symlink to the index, and then start it running with: Giving it a Try Now comes the interesting part. Since we opened up the default Jetty port used by Solr (8983) on this EC2 instance, we can directly access Solr’s handy admin console by pointing our browser at http://:8983/solr/admin % cd solr % java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar From here we can run queries against Solr: We can also use curl to talk to the server via HTTP requests: curl http://:8983/solr/select/?q=-status%3AFETCHED+and+-status%3AUNFETCHED The response is XML by default. Below is an example of the response from the above request, where we found 2,546 matches in 94ms. Now here’s what I find amazing. For an index of 82 million documents, running on a fairly wimpy box (EC2 m1.large = 2 virtual cores), the typical response time for a simple query like “status:FETCHED” is only 400 milliseconds, to find 9M documents. Even a complex query such as (status not FETCHED and not UNFETCHED) only takes six seconds. Scaling Obviously we could use beefier boxes. If we switched to something like m1.xlarge (15GB of memory, 4 virtual cores) then it’s likely we could handle upwards of 200M “records” in our Solr index and still get reasonable response times. If we wanted to scale beyond a single box, there are a number of solutions. Even out of the box Solr supports sharding, where your HTTP request can specify multiple servers to use in parallel. More recently, the Solr trunk has support for SolrCloud. This uses the ZooKeeper open source project to simplify coordination of multiple Solr servers. Finally, the Katta open source project supports Lucene-level distributed search, with many of the features needed for production quality distributed search that have not yet been added to SolrCloud. Summary The combination of Hadoop and Solr makes it easy to crunch lots of data and then quickly serve up the results via a fast, flexible search & query API. Because Solr supports query-style requests, it’s suitable as a NoSQL replacement for traditional databases in many situations, especially when the size of the data exceeds what is reasonable with a typical RDBMS. Solr has some limitations that you should be aware of, specifically: · Updating the index works best as a batch operation. Individual records can be updated, but each commit (index update) generates a new Lucene segment, which will impact performance. · Current support for replication, fail-over, and other attributes that you’d want in a production-grade solution aren’t yet there in SolrCloud. If this matters to you, consider Katta instead. · Many SQL queries can’t be easily mapped to Solr queries. The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains additional technical details.
April 4, 2011
by Ken Krugler
· 119,578 Views
  • Previous
  • ...
  • 517
  • 518
  • 519
  • 520
  • 521
  • 522
  • 523
  • 524
  • 525
  • 526
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×