Databases Resources

The Latest Databases Topics

Dissecting the Disruptor: Why it's so fast (part one) - Locks Are Bad

martin fowler has written a really good article describing not only the disruptor , but also how it fits into the architecture at lmax . this gives some of the context that has been missing so far, but the most frequently asked question is still "what is the disruptor?". i'm working up to answering that. i'm currently on question number two: "why is it so fast?". these questions do go hand in hand, however, because i can't talk about why it's fast without saying what it does, and i can't talk about what it is without saying why it is that way. so i'm trapped in a circular dependency. a circular dependency of blogging. to break the dependency, i'm going to answer question one with the simplest answer, and with any luck i'll come back to it in a later post if it still needs explanation: the disruptor is a way to pass information between threads. as a developer, already my alarm bells are going off because the word "thread" was just mentioned, which means this is about concurrency, and concurrency is hard. concurrency 101 imagine two threads are trying to change the same value. case one: thread 1 gets there first: the value changes to "blah" then the value changes to "blahy" when thread 2 gets there. case two: thread 2 gets there first: the value changes to "fluffy" then the value changes to "blah" when thread 1 gets there. case three: thread 1 interrupts thread 2: thread 2 gets the value "fluff" and stores it as myvalue thread 1 goes in and updates value to "blah" then thread 2 wakes up and sets the value to "fluffy". case three is probably the only one which is definitely wrong, unless you think the naive approach to wiki editing is ok ( google code wiki, i'm looking at you...). in the other two cases it's all about intentions and predictability. thread 2 might not care what's in value, the intention might be to append "y" to whatever is in there regardless. in this circumstance, cases one and two are both correct. but if thread 2 only wanted to change "fluff" to "fluffy", then both cases two and three are incorrect. assuming that thread 2 wants to set the value to "fluffy", there are some different approaches to solving the problem. approach one: pessimistic locking (does the "no entry" sign make sense to people who don't drive in britain?) the terms pessimistic and optimistic locking seem to be more commonly used when talking about database reads and writes, but the principal applies to getting a lock on an object. thread 2 grabs a lock on entry as soon as it knows it needs it and stops anything from setting it. then it does its thing, sets the value, and lets everything else carry on. you can imagine this gets quite expensive, with threads hanging around all over the place trying to get hold of objects and being blocked. the more threads you have, the more chance that things are going to grind to a halt. approach two: optimistic locking in this case thread 2 will only lock entry when it needs to write to it. in order to make this work, it needs to check if entry has changed since it first looked at it. if thread 1 came in and changed the value to "blah" after thread 2 had read the value, thread 2 couldn't write "fluffy" to the entry and trample all over the change from thread 1. thread 2 could either re-try (go back, read the value, and append "y" onto the end of the new value), which you would do if thread 2 didn't care what the value it was changing was; or it could throw an exception or return some sort of failed update flag if it was expecting to change "fluff" to "fluffy". an example of this latter case might be if you have two users trying to update a wiki page, and you tell the user on the other end of thread 2 they'll need to load the new changes from thread 1 and then reapply their changes. potential problem: deadlock locking can lead to all sorts of issues, for example deadlock. imagine two threads that need access to two resources to do whatever they need to do: if you've used an over-zealous locking technique, both threads are going to sit there forever waiting for the other one to release its lock on the resource. that's when you reboot windows your computer. definite problem: locks are sloooow... the thing about locks is that they need the operating system to arbitrate the argument. the threads are like siblings squabbling over a toy, and the os kernel is the parent that decides which one gets it. it's like when you run to your dad to tell him your sister has nicked the transformer when you wanted to play with it - he's got bigger things to worry about than you two fighting, and he might finish off loading the dishwasher and putting on the laundry before settling the argument. if you draw attention to yourself with a lock, not only does it take time to get the operating system to arbitrate, the os might decide the cpu has better things to do than servicing your thread. the disruptor paper talks about an experiment we did. the test calls a function incrementing a 64-bit counter in a loop 500 million times. for a single thread with no locking, the test takes 300ms. if you add a lock (and this is for a single thread, no contention, and no additional complexity other than the lock) the test takes 10,000ms. that's, like, two orders of magnitude slower. even more astounding, if you add a second thread (which logic suggests should take maybe half the time of the single thread with a lock) it takes 224,000ms. incrementing a counter 500 million times takes nearly a thousand times longer when you split it over two threads instead of running it on one with no lock. concurrency is hard and locks are bad i'm just touching the surface of the problem, and obviously i'm using very simple examples. but the point is, if your code is meant to work in a multi-threaded environment, your job as a developer just got a lot more difficult: naive code can have unintended consequences. case three above is an example of how things can go horribly wrong if you don't realise you have multiple threads accessing and writing to the same data. selfish code is going to slow your system down. using locks to protect your code from the problem in case three can lead to things like deadlock or simply poor performance. this is why many organisations have some sort of concurrency problems in their interview process (certainly for java interviews). unfortunately it's very easy to learn how to answer the questions without really understanding the problem, or possible solutions to it. how does the disruptor address these issues? for a start, it doesn't use locks. at all. instead, where we need to make sure that operations are thread-safe (specifically, updating the next available sequence number in the case of multiple producers ), we use a cas (compare and swap/set) operation. this is a cpu-level instruction, and in my mind it works a bit like optimistic locking - the cpu goes to update a value, but if the value it's changing it from is not the one it expects, the operation fails because clearly something else got in there first. note this could be two different cores rather than two separate cpus. cas operations are much cheaper than locks because they don't involve the operating system, they go straight to the cpu. but they're not cost-free - in the experiment i mentioned above, where a lock-free thread takes 300ms and a thread with a lock takes 10,000ms, a single thread using cas takes 5,700ms. so it takes less time than using a lock, but more time than a single thread that doesn't worry about contention at all. back to the disruptor - i talked about the claimstrategy when i went over the producers . in the code you'll see two strategies, a singlethreadedstrategy and a multithreadedstrategy. you could argue, why not just use the multi-threaded one with only a single producer? surely it can handle that case? and it can. but the multi-threaded one uses an atomiclong (java's way of providing cas operations), and the single-threaded one uses a simple long with no locks and no cas. this means the single-threaded claim strategy is as fast as possible, given that it knows there is only one producer and therefore no contention on the sequence number. i know what you're thinking: turning one single number into an atomiclong can't possibly have been the only thing that is the secret to the disruptor's speed. and of course, it's not - otherwise this wouldn't be called "why it's so fast (part one )". but this is an important point - there's only one place in the code where multiple threads might be trying to update the same value. only one place in the whole of this complicated data-structure-slash-framework. and that's the secret. remember everything has its own sequence number? if you only have one producer then every sequence number in the system is only ever written to by one thread. that means there is no contention. no need for locks. no need even for cas. the only sequence number that is ever written to by more than one thread is the one on the claimstrategy if there is more than one producer. this is also why each variable in the entry can only be written to by one consumer . it ensures there's no write contention, therefore no need for locks or cas. back to why queues aren't up to the job so you start to see why queues, which may implemented as a ring buffer under the covers, still can't match the performance of the disruptor. the queue, and the basic ring buffer , only has two pointers - one to the front of the queue and one to the end: if more than one producer wants to place something on the queue, the tail pointer will be a point of contention as more than one thing wants to write to it. if there's more than one consumer, then the head pointer is contended, because this is not just a read operation but a write, as the pointer is updated when the item is consumed from the queue. but wait, i hear you cry foul! because we already knew this, so queues are usually single producer and single consumer (or at least they are in all the queue comparisons in our performance tests). there's another thing to bear in mind with queues/buffers. the whole point is to provide a place for things to hang out between producers and consumers, to help buffer bursts of messages from one to the other. this means the buffer is usually full (the producer is out-pacing the consumer) or empty (the consumer is out-pacing the producer). it's rare that the producer and consumer will be so evenly-matched that the buffer has items in it but the producers and consumers are keeping pace with each other. so this is how things really look. an empty queue: ...and a full queue: the queue needs a size so that it can tell the difference between empty and full. or, if it doesn't, it might determine that based on the contents of that entry, in which case reading an entry will require a write to erase it or mark it as consumed. whichever implementation is chosen, there's quite a bit of contention around the tail, head and size variables, or the entry itself if a consume operation also includes a write to remove it. on top of this, these three variables are often in the same cache line , leading to false sharing . so, not only do you have to worry about the producer and the consumer both causing a write to the size variable (or the entry), updating the tail pointer could lead to a cache-miss when the head pointer is updated because they're sat in the same place. i'm going to duck out of going into that in detail because this post is quite long enough as it is. so this is what we mean when we talk about "teasing apart the concerns" or a queue's "conflated concerns". by giving everything its own sequence number and by allowing only one consumer to write to each variable in the entry, the only case the disruptor needs to manage contention is where more than one producer is writing to the ring buffer. in summary the disruptor a number of advantages over traditional approaches: no contention = no locks = it's very fast. having everything track its own sequence number allows multiple producers and multiple consumers to use the same data structure. tracking sequence numbers at each individual place (ring buffer, claim strategy, producers and consumers), plus the magic cache line padding , means no false sharing and no unexpected contention. from http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast.html

July 23, 2011

by Trisha Gee

· 12,776 Views · 1 Like

Testing Entity Validations with a Mock Entity - Roo in Action Corner

In Spring Roo in Action, Chapter 3, I discuss how Roo automatically executes the Bean Validators when persisting a live entity. However, when running unit tests, we don't have a live entity at all, nor do we have a Spring container - so how can we exercise the validation without actually hitting our Roo application and the database? The following post is ancillary material from the upcoming book Spring Roo in Action, by Ken Rimple and Srini Penchikala, with Gordon Dickens. You can purchase the MEAP edition of the book, and participate in the author forum, at www.manning.com/rimple. The answer is that we have to bootstrap the validation framework within the test ourselves. We can use the CourseDataOnDemand class's getNewTransientEntityName method to generate a valid, transient JPA entity. Then, we can: Mock static entity methods, such as findById, to bring back pre-fabricated class instances of your entity Initialize the validation engine, bootstrapping a JSR-303 bean validation framework engine, and perform validation on your entity Set any appropriate properties to apply to a particular test condition Initialize a test instance of the entity validator and assert the appropriate validation results are returned The concept in action... Given a Student entity with the following definition: @RooEntity @RooJavaBean @RooToString public class Student { @NotNull private String emergencyContactInfo; ... } The listing below shows a unit test method that ensures the NotNull validation fires against missing emergency contact information on the Student entity: @Test public void testStudentMissingEmergencyContactValidation() { // setup our test data StudentDataOnDemand dod = new StudentDataOnDemand(); // tell the mock to expect this call Student.findStudent(1L); // tell the mocking API to expect a return from the prior call in the form of // a new student from the test data generator, dod AnnotationDrivenStaticEntityMockingControl.expectReturn( dod.getNewTransientStudent(0)); // put our mock in playback mode AnnotationDrivenStaticEntityMockingControl.playback(); // Setup the validator API in our unit test LocalValidatorFactoryBean validator = new LocalValidatorFactoryBean(); validator.afterPropertiesSet(); // execute the call from the mock, set the emergency contact field // to an invalid value Student student = Student.findStudent(1L); student.setEmergencyContactInfo(null); // execute validation, check for violations Set> violations = validator.validate(student, Default.class); // do we have one? Assert.assertEquals(1, violations.size()); // now, check the constraint violations to check for our specific error ConstraintViolation violation = violations.iterator().next(); // contains the right message? Assert.assertEquals("{javax.validation.constraints.NotNull.message}", violation.getMessageTemplate()); // from the right field? Assert.assertEquals("emergencyContactInfo", violation.getPropertyPath().toString()); } Analysis The test starts with a declaration of a StudentOnDemand object, which we'll use to generate our test data. We'll get into the more advanced uses of the DataOnDemand Framework later in the chapter. For now, keep in mind that we can use this class to create an instance of an Entity, with randomly assigned, valid data. We then require that the test calls the Student.findStudent method, passing it a key of 1L. Next, we'll tell the entity mocking framework that the call should return a new transient Student instance. At this point, we've defined our static mocking behavior, so we'll put the mocking framework into playback mode. Next, we issue the actual Student.findById(1L) call, this time storing the result as the member variable student. This call will trip the mock, which will return a new transient instance. We then set the emergencyContactInfo field to null, so that it becomes invalid, as it is annotated with a @NotNull annotation. Now we are ready to set up our bean validation framework. We create a LocalValidatorFactoryBean instance, which will boot the Bean Validation Framework in the afterPropertiesSet() method, which is defined for any Spring Bean implementing InitializingBean. We must call this method ourselves, because Spring is not involved in our unit test. Now we're ready to run our validation and assert the proper behavior has occurred. We call our validator's validate method, passing it the student instance and the standard Default validation group, which will trigger validation. We'll then check that we only have one validation failure, and that the message template for the error is the same as the one for the @NotNull validation. We also check to ensure that the field that caused the validation was our emergencyContactInfo field. In our answer callback, we can launch the Bean Validation Framework, and execute the validate method against our entity. In this way, we can exercise our bean instance any way we want, and instead of persisting the entity, can perform the validation phase and exit gracefully. Caveats... There are a few things slightly wrong here. First of all, the Data on Demand classes actually use Spring to inject relationships to each other, which I've logged a bug against as ROO-2497. You can override the setup of the data on demand class and manually create the DoD of the referring one, which is fine. They have slated to work on this bug for Roo 1.2, so it should be fixed sometime in the next few months. Also, realize that this is NOT easy to do, compared to writing an integration test. However, this test runs markedly faster. If you have some sophisticated logic that you've attached to a @AssertTrue annotation, this is the way to test it in isolation. From http://www.rimple.com/tech/2011/7/17/testing-entity-validations-with-a-mock-entity-roo-in-action.html

July 20, 2011

by Ken Rimple

· 14,634 Views

Lucene's near-real-time search is fast!

Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time. In some cases, such as modern social/news sites (e.g., LinkedIn, Twitter, Facebook, Stack Overflow, Hacker News, DZone, etc.), fast turnaround time is a hard requirement. Fortunately, it's trivial to use. Just open your initial NRT reader, like this: // w is your IndexWriter IndexReader r = IndexReader.open(w, true); (That's the 3.1+ API; prior to that use w.getReader() instead). The returned reader behaves just like one opened with IndexReader.open: it exposes the point-in-time snapshot of the index as of when it was opened. Wrap it in an IndexSearcher and search away! Once you've made changes to the index, call r.reopen() and you'll get another NRT reader; just be sure to close the old one. What's special about the NRT reader is that it searches uncommitted changes from IndexWriter, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often commit is called), something not previously possible. Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes. The reopen time is in proportion to how many changes you made since last reopening that reader. Lucene's approach is a nice compromise between immediate consistency, where changes are visible after each index change, and eventual consistency, where changes are visible "later" but you don't usually know exactly when. With NRT, your application has controlled consistency: you decide exactly when changes must become visible. Recently there have been some good improvements related to NRT: New default merge policy, TieredMergePolicy, which is able to select more efficient non-contiguous merges, and favors segments with more deletions. NRTCachingDirectory takes load off the IO system by caching small segments in RAM (LUCENE-3092). When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (LUCENE-2900). Segments that are 100% deleted are now dropped instead of inefficiently merged (LUCENE-2010). How fast is NRT search? I created a simple performance test to answer this. I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents. Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text. This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as Twitter's recent peak during this year's superbowl (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard. The test uses updateDocument, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments. In addition, 8 search threads run a fixed TermQuery at the same time. Finally, the NRT reader is reopened once per second. I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an OCZ Vertex 3 240 GB SSD, using Oracle's 64 bit Java 1.6.0_21 and Linux Fedora 13. I gave Java a 2 GB max heap, and used MMapDirectory. The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time: The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive Dygraph, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph. Search QPS dropped substantially with time. While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching. They are only truly deleted when the segments are merged. TermQuery is a worst-case query; harder queries, such as BooleanQuery, should see less slowdown from deleted, but not reclaimed, documents. Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped. It looks like TieredMergePolicy should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off. Once it finished the QPS recovered somewhat. Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical. Reopen delay during merging The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate). There are random single spikes, which is caused by Java running a full GC cycle. However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40). Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay. Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching. I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities. It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen. However, if that were the case, NRTCachingDirectory (used by the test) should have prevented it, but didn't. It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've set Linux's swappiness to 0. Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's pdflush tunables could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness. Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling Directory implementations to pass appropriate flags to the OS when opening files for merging (LUCENE-2793 and LUCENE-2795). From past testing I know that passing O_DIRECT can prevent merges from evicting hot pages, so it's possible this will fix our slow reopen time as well since it bypasses the write cache. Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges. This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it. It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity. This would make a nice addition to Lucene! Except for the slowdown during merging, the performance of NRT is impressive. Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough. While there are exciting ideas to bring true real-time search to Lucene, by directly searching IndexWriter's RAM buffer as Michael Busch has implemented at Twitter with some cool custom extensions to Lucene, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT. NIOFSDirectory vs MMapDirectory Out of curiosity, I ran the exact same test as above, but this time with NIOFSDirectory instead of MMapDirectory: There are some interesting differences. The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot. For some reason TermQuery, in particular, has high variance from one JVM instance to another. The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur). The takeaway message seems clear: on Linux, use MMapDirectory not NIOFSDirectory! Optimizing your NRT turnaround time My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text). You should test specifically for your use-case what reopen rate works best. Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate. Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements. There are also some simple steps you can take to reduce the turnaround time: Store the index on a fast IO system, ideally a modern SSD. Install a merged segment warmer (see IndexWriter.setMergedSegmentWarmer). This warmer is invoked by IndexWriter to warm up a newly merged segment without blocking the reopen of a new NRT reader. If your application uses Lucene's FieldCache or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader. Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices. The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk). If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the Pulsing codec for your id field since this gives faster lookup performance which will make your reopen faster. Use the new NRTCachingDirectory, which buffers small segments in RAM to take load off the IO system (LUCENE-3092). Pass false for applyDeletes when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader. While it's not clear that thread priorities actually work correctly (see this Google Tech Talk), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads. If the machine becomes saturated, ideally only the search threads should take the hit. Happy near-real-time searching!

July 11, 2011

by Michael Mccandless

· 19,137 Views

SEVERE: Error in xpath:java.lang.RuntimeException: solrconfig.xml missing luceneMatchVersion

One of the things that changed from Solr 1.4.1 to 1.5+ was the introduction of a parameter to tell Solr / Lucene which kind of compability version its index files should be created and used in. Solr now refuses to start if you do not provide this setting (if you’re upgrading a previous installation from 1.4.1 or earlier). The fix isn’t really straight forward, and you’ll probably have to recreate your index files if you’re just arriving at the scene with Solr / Lucene 3.2 and 4.0. Solr 3.0 (1.5) might be able to upgrade the files from the 2.9 version, but if you’re jumping from Lucene 2.9 to 4.0, the easiest solution seems to be to delete the current index and reindex (set up replication, disable replication from the master, query the slave while reindexing the master, etc.. and you’ll have no downtime while doing this!). You’ll need to add a parameter to your solrconfig.xml file as well in the section. LUCENE_CURRENT Other valid values are LUCENE_30, LUCENE_31, LUCENE_32 and LUCENE_40. These values represent specific versions of the index structure, while LUCENE_CURRENT will use the version depending on which particular release of Lucene you’re using. The version format will be upgraded automagically between most releases, so you’ll probably be fine by using LUCENE_CURRENT. If you however are trying to load index files that are more than one version older, you may have to use one of the other values.

July 9, 2011

by Mats Lindh

· 10,310 Views

The era of Object-Document Mapping

The Data Mapper pattern is a mechanism for persistence where the application model and the data source have no dependencies between each other. For example, a group of PHP classes and a relational database may be used together without having the PHP classes extends a base Active Record class, thanks to a Data Mapper like Doctrine 2. But everytime we talk about the Data Mapper pattern, we assume there is a relational database on the other side of the persistence boundary. We always save objects; we always map them to MySQL or Postgres tables; but it's not mandatory. In fact, you can store objects also in NoSQL stores, with no more impedance mismatch that the one already existing with relational databases. Doctrine is expanding with two projects which have the goal of replicating the easy object-relational mapping of Doctrine 2 to other stores. These two projects are the MongoDB and CouchDB Object-Document mappers. Since the basic unit of persistence is the document in both cases, objects (or group of them) are stored as documents. The retrieval process depends on the features of the underlying database: it may consists of simple queries (MongoDB) or just of a unique identifier (CouchDB). How do an ODM work? If you use Doctrine 2, you'll find out that many of its concepts translates well to the other mappers. Instead of Entities, we have Documents in our model: id = "myuser-1234"; $user->username = "test"; $this->dm->persist($user); $this->dm->flush(); $this->dm->clear(); $userNew = $this->dm->find($this->type, $user->id); There is a lot of reuse between the various projects: Doctrine\Common is a dependency for them all, and you will be able to use the same ArrayCollection with any of the mappers. Features There's really all I would need to get started: storage of new documents, update, removal; pretty much what you expect; Unit Of Work pattern for tracking what has been modified and execute a single flush() to save changes; support for collections and both one-to-many or many-to-one associations; support for embedded documents (read Value Objects); both embedding of one or many objects as document fields is possible. This is a striking difference with respect to ORMs: Doctrine 2 does not support Value Objects mapping. cascade of persistence and removal of objects; specification of the mapping metadata via annotations, XML, YAML or PHP (like for Doctrine 2). Projects status The concept of Object-Document Mapper, at least in this name, is pretty unique to PHP and Ruby. Mandango is a PHP alternative and Mongoid a Ruby one, for the MongoDB case. Both projects are in the midst of releasing their first stable version: we are talking about bleeding edge technology here. The MongoDB mapper has been around for more than an year and is now in beta3. The CouchDB has an alpha release, but I found out that the current version shipped via PEAR is broken: if you want to play with it, checkout it from github and initialize its submodules (the standalone Symfony components and Doctrine\Common): git clone https://github.com/doctrine/couchdb-odm.git couchdb_odm cd couchdb_odm git submodule init git submodule update For the MongoDB mapper, the PEAR installation should suffice: pear install pear.doctrine-project.org/DoctrineMongoDBODM-1.0.0BETA3 Conclusions I think I'm going to explore more the CouchDB ODM, since it is a departure from the relational model. MongoDB is still similar to traditional databases in the way queries are satisfied: specifying a set of constraints like in SQL WHERE clauses. I'm more interested instead in approached that skip the Mapper for reporting (like Command-Query Responsibility Separation), and maybe CouchDB could be a good fit. By the way, object-document mapping is an alternative to explore: Data Mappers are all the rage today in PHP (they already were yesterday in other languages.) If we already accept that relational databases aren't always the solution, extending NoSQL with Data Mappers is a natural choice.

July 7, 2011

by Giorgio Sironi

· 20,977 Views

Is Object Serialization Evil?

In my daily work, I use both an RDBMS and MarkLogic, an XML database. MarkLogic can be considered akin to the newer NoSQL databases, but it has the added structure of XML and standard languages in XQuery and XPath. The NoSQL databases are typically storing documents or key-value pairs, and some other things in between. Given that any datastore will be searched at some point, you will always care how the data is actually stored or whether there is some way to query it easily. Once you start thinking about the problem, you quickly generalize to the “how do I persist any type of data” question. However, my focus is not going to be the comparison of the various data stores, but the comparison of how data is stored. More specifically, I want to show the object serialization, mainly the Java built in method, as a data persistence format is evil. Given what you normally read on this blog, this may seem like an oddly timed post, but I have run into serialization issues lately in some production code and Mark Needham recently wrote an interesting post about this as well. Coincidentally, Mark is also working with MarkLogic, and there is an interesting item in his post: The advantage of doing things this way [using lightweight wrappers] is that it means we have less code to write than we would with the serialisation/deserialisation approach although it does mean that we’re strongly coupled to the data format that our storage mechanism uses. However, since this is one bit of the architecture which is not going to change it seems to makes sense to accept the leakage of that layer. The interesting part of this is that he has accepted using the data format of the storage mechanism, XML in MarkLogic in this case. Why is this interesting? First, it is a move away from the ORM technologies that try to hide the complexities of converting data into objects in the RDBMS world. Also, this is a glimpse into the types of issues that could arise from non-RDBMS storage choices as well as how to persist objects in general. So, an RDBMS is typically used to map object attributes to a table and columns. The mapping is mostly straightforward with some defined relationship for child objects and collections. This is a well-known area, called Object-Relational Mapping (ORM), and several open source and commercial options exist. In this scenario, object attributes are stored in a similar datatype, meaning a String is stored as a varchar and an int is stored as an integer. But, what happens when you move away from an RDBMS for data persistence? If you look at Java and its session objects, pure object serialization is used. Assuming that an application session is fairly short-lived, meaning at most a few hours, object serialization is simple, well supported and built into the Java concept of a session. However, when the data persistence is over a longer period of time, possibly days or weeks, and you have to worry about new releases of the application, serialization quickly becomes evil. As any good Java developer knows, if you plan to serialize an object, even in a session, you need a real serialization ID (serialVersionUID), not just a 1L, and you need to implement the Serializable interface. However, most developers do not know the real rules behind the Java deserialization process. If your object has changed, more than just adding simple fields to the object, it is possible that Java cannot deserialize the object correctly even if the serialization ID has not changed. Suddenly, you cannot retrieve your data any longer, which is inherently bad. Now, may developers reading this may say that they would never write code that would have this problem. That may be true, but what about a library that you use or some other developer no longer employed by your company? Can you guarantee that this problem will never happen? The only way to guarantee that is to use a different serialization method. What options do we have? Obviously, there are the NoSQL datastores but the actual object format is the relevant question not which solution to choose. Besides the obvious serialized object, some NoSQL datastores use JSON to store objects, MarkLogic uses XML and there are others that store just key-value pairs. Key-value pairs are typically a mapping of a text key to a value that is a serialized object, either a binary or textual format. So, that leaves us with XML, JSON and other textual formats. One of the benefits of a structured format like XML or JSON is that they can be made searchable and provide some level of context. I have talked about data formats before, so I won’t go into a comparison again. However, do these types of formats avoid the issues that native Java object serialization has? This is really dependent upon what library you are using for serialization. Some libraries will deserialize an object without any issues regardless of whether the object field list has changed. Other libraries could have problems depending upon whether a serialized field exists in the target object, or there might not be solid support for collections (though that is doubtful at this point). Given that even structured formats could have serialization issues, is the only safe path hand-coded mappings like those used by ORM tools? Some JSON and XML serialization tools use the same mapping methods as the ORM tools in order to avoid these problems. However, once you define these mappings, you are explicitly stating how an object gets translated. This explicit definition will require maintenance, but that is definitely cleaner than trying to trace down a serialization defect in some random stack trace. So is implicit object serialization really worth the potential headaches? Or should we just consider it evil and never speak of it again? From http://regulargeek.com/2011/07/06/is-object-serialization-evil/

July 7, 2011

by Robert Diana

· 20,077 Views · 1 Like

Using Advantage data providers to read DBF-files

In one of my projects I have to read FoxPro DBF-files and import data from them. As this code must run in server and customer doesn’t want to install FoxPro there we found another solution that seems at least to me way better. In this posting I will show you how to read DBF-files using Sybase Advantage data providers. Getting Advantage data providers Here are the download links to data providers: Advantage .NET Data Provider Release 10.1 for Windows (32-bit and 64-bit) Platforms for Advantage OLE DB Provider Release 10.1 Platforms for Advantage ODBC Driver Release 10.1 I downloaded and installed .NET data provider and my example here is fully based on this. Configuring application If you run application without configuring some data providers stuff before you will get the following error: Error 5185: Local server connections are restricted in this environment. See the 5185 error code documentation for details. Go to your application bin folder and add there usual text file called ads.ini. Here is the content for this file: [SETTINGS] MTIER_LOCAL_CONNECTIONS=1 Make sure you add reference to Advantage data provider assembly and include ads.ini to your project like shown on image above. Getting data to DataTable Here is short code example about how to get data from DBF-file to DataTable. static void Main(string[] args) { var tableName = "TABLENAME_WITHOUT_EXTENSION"; var connStr = "data source={0};tabletype=vfp;servertype= local;"; connStr = string.Format(connStr, "c:\\temp\\"); var table = new DataTable(); using (var conn = new AdsConnection(connStr)) using (var adapter = new AdsDataAdapter()) using (var cmd = new AdsCommand()) { cmd.Connection = conn; cmd.CommandText = "select * from " + tableName; adapter.SelectCommand = cmd; conn.Open(); adapter.Fill(table); conn.Close(); } Console.WriteLine("Table fields:"); foreach (DataColumn col in table.Columns) Console.WriteLine(col.ColumnName); Console.WriteLine(" "); Console.WriteLine("Rows: " + table.Rows.Count); Console.Read(); } If Advantage data providers were installed correctly and there are no errors in table names, locations and your SQL query then you should see list of table column names and row count on console window when you run the application.

May 17, 2011

by Gunnar Peipman

· 10,477 Views

Java Web Application Security - Part I: Java EE 6 Login Demo

Back in February, I wrote about my upcoming conferences: In addition to Vegas and Poland, there's a couple other events I might speak at in the next few months: the Utah Java Users Group (possibly in April), Jazoon and ÜberConf (if my proposals are accepted). For these events, I'm hoping to present the following talk: Webapp Security: Develop. Penetrate. Protect. Relax. In this session, you'll learn how to implement authentication in your Java web applications using Spring Security, Apache Shiro and good ol' Java EE Container Managed Authentication. You'll also learn how to secure your REST API with OAuth and lock it down with SSL. After learning how to develop authentication, I'll introduce you to OWASP, the OWASP Top 10, its Testing Guide and its Code Review Guide. From there, I'll discuss using WebGoat to verify your app is secure and commercial tools like webapp firewalls and accelerators. Fast forward a couple months and I'm happy to say that I've completed my talk at the Utah JUG and it's been accepted at Jazoon and Über Conf. For this talk, I created a presentation that primarily consists of demos implementing basic, form and Ajax authentication using Java EE 6, Spring Security and Apache Shiro. In the process of creating the demos, I learned (or re-educated myself) how to do a number of things in all 3 frameworks: Implement Basic Authentication Implement Form-based Authentication Implement Ajax HTTP -> HTTPS Authentication (with programmatic APIs) Force SSL for certain URLs Implement a file-based store of users and passwords (in Jetty/Maven and Tomcat standalone) Implement a database store of users and passwords (in Jetty/Maven and Tomcat standalone) Encrypt Passwords Secure methods with annotations For the demos, I showed the audience how to do almost all of these, but skipped Tomcat standalone and securing methods in the interest of time. In July, when I do this talk at ÜberConf, I plan on adding 1) hacking the app (to show security holes) and 2) fixing it to protect it against vulnerabilities. I told the audience at UJUG that I would post the presentation and was planning on recording screencasts of the various demos so the online version of the presentation would make more sense. Today, I've finished the first screencast showing how to implement security with Java EE 6. Below is the presentation (with the screencast embedded on slide 10) as well as a step-by-step tutorial. * You can also watch the screencast on YouTube or download the presentation PDF. Java EE 6 Login Tutorial Download and Run the Application Implement Basic Authentication Implement Form-based Authentication Force SSL Store Users in a Database Summary Download and Run the Application To begin, download the application you'll be implementing security in. This app is a stripped-down version of the Ajax Login application I wrote for my article on Implementing Ajax Authentication using jQuery, Spring Security and HTTPS. You'll need Java 6 and Maven installed to run the app. Run it using mvn jetty:run and open http://localhost:8080 in your browser. You'll see it's a simple CRUD application for users and there's no login required to add or delete users. Implement Basic Authentication The first step is to protect the list screen so people have to login to view users. To do this, add the following to the bottom of src/main/webapp/WEB-INF/web.xml: users /users GET POST ROLE_ADMIN BASIC Java EE Login ROLE_ADMIN At this point, if you restart Jetty (Ctrl+C and jetty:run again), you'll get an error about a missing LoginService. This happens because Jetty doesn't know where the "Java EE Login" realm is located. Add the following to pom.xml, just after in the Jetty plugin's configuration. Java EE Login ${basedir}/src/test/resources/realm.properties The realm.properties file already exists in the project and contains user names and passwords. Start the app again using mvn jetty:run and you should be prompted to login when you click on the "Users" tab. Enter admin/admin to login. After logging in, you can try to logout by clicking the "Logout" link in the top-right corner. This calls a LogoutController with the following code that logs the user out. public void logout(HttpServletResponse response) throws ServletException, IOException { request.getSession().invalidate(); response.sendRedirect(request.getContextPath()); } You'll notice that clicking this link doesn't log you out, even though the session is invalidated. The only way to logout with basic authentication is to close the browser. In order to get the ability to logout, as well as to have more control over the look-and-feel of the login, you can implement form-based authentication. Implement Form-based Authentication To change from basic to form-based authentication, you simply have to replace the in your web.xml with the following: FORM /login.jsp /login.jsp?error=true The login.jsp page already exists in the project, in the src/main/webapp directory. This JSP has 3 important elements: 1) a form that submits to "${contextPath}/j_security_check", 2) an input element named "j_username" and 3) an input element named "j_password". If you restart Jetty, you'll now be prompted to login with this JSP instead of the basic authentication dialog. Force SSL Another thing you might want to implement to secure your application is forcing SSL for certain URLs. To do this on the same you already have in web.xml, add the following after : CONFIDENTIAL To configure Jetty to listen on an SSL port, add the following just after in your pom.xml: true 8080 true 8443 60000 ${project.build.directory}/ssl.keystore appfuse appfuse The keystore must be generated for Jetty to start successfully, so add the keytool-maven-plugin just above the jetty-maven-plugin in pom.xml. org.codehaus.mojo keytool-maven-plugin 1.0 generate-resources clean clean generate-resources genkey genkey ${project.build.directory}/ssl.keystore cn=localhost appfuse appfuse appfuse RSA Now if you restart Jetty, go to http://localhost:8080 and click on the "Users" tab, you'll get a 403. What the heck?! When this first happened to me, it took me a while to figure out. It turns out that Jetty doesn't redirect to HTTPS when using Java EE authentication, so you have to manually type in https://localhost:8443/ (or add a filter to redirect for you). If you deployed this same application on Tomcat (after enabling SSL), it would redirect for you. Store Users in a Database Finally, to store your users in a database instead of file, you'll need to change the in the Jetty plugin's configuration. Replace the existing element with the following: Java EE Login ${basedir}/src/test/resources/jdbc-realm.properties The jdbc-realm.properties file already exists in the project and contains the database settings and table/column names for the user and role information. jdbcdriver = com.mysql.jdbc.Driver url = jdbc:mysql://localhost/appfuse username = root password = usertable = app_user usertablekey = id usertableuserfield = username usertablepasswordfield = password roletable = role roletablekey = id roletablerolefield = name userroletable = user_role userroletableuserkey = user_id userroletablerolekey = role_id cachetime = 300 Of course, you'll need to install MySQL for this to work. After installing it, you should be able to create an "appfuse" database and populate it using the following commands: mysql -u root -p -e 'create database appfuse' curl https://gist.github.com/raw/958091/ceecb4a6ae31c31429d5639d0d1e6bfd93e2ea42/create-appfuse.sql > create-appfuse.sql mysql -u root -p appfuse < create-appfuse.sql Next you'll need to configure Jetty so it has MySQL's JDBC Driver in its classpath. To do this, add the following dependency just after the element (before ) in pom.xml: mysql mysql-connector-java 5.1.14 Now run the jetty-password.sh file in the root directory of the project to generate a password of your choosing. For example: $ sh jetty-password.sh javaeelogin javaeelogin OBF:1vuj1t2v1wum1u9d1ugo1t331uh21ua51wts1t3b1vur MD5:53b176e6ce1b5183bc970ef1ebaffd44 The last two lines are obfuscated and MD5 versions of the password. Update the admin user's password to this new value. You can do this with the following SQL statement. UPDATE app_user SET password='MD5:53b176e6ce1b5183bc970ef1ebaffd44' WHERE username = 'admin'; Now if you restart Jetty, you should be able to login with admin/javaeelogin and view the list of users. Summary In this tutorial, you learned how to implement authentication using standard Java EE 6. In addition to the basic XML configuration, there's also some new methods in HttpServletRequest for Java EE 6 and Servlet 3.0: authenticate(response) login(user, pass) logout() This tutorial doesn't show you how to use them, but I did play with them a bit as part of my UJUG demo when implementing Ajax authentication. I found that login() did work, but it didn't persist the authentication for the users session. I also found that after calling logout(), I still needed to invalidate the session to completely logout the user. There are some additional limitations I found with Java EE authentication, namely: No error messages for failed logins No Remember Me No auto-redirect from HTTP to HTTPS Container has to be configured Doesn’t support regular expressions for URLs Of course, no error messages indicating why login failed is probably a good thing (you don't want to tell users why their credentials failed). However, when you're trying to figure out if your container is configured properly, the lack of container logging can be a pain. In the next couple weeks, I'll post Part II of this series, where I'll show you how to implement this same set of features using Spring Security. In the meantime, please let me know if you have any questions. From http://raibledesigns.com/rd/entry/java_web_application_security_part

May 15, 2011

by Matt Raible

· 42,046 Views · 3 Likes

Reset MySQL Root Password On Linux

Five easy steps to reset MySQL root password. Stop the MySQL server. Start the MySQL server with the --skip-grant-tables option. (it will not prompt for password) Connect to MySQL server as the root user. Setup new MySQL root password. Exit and restart the MySQL server. ### Shell Commands ### /etc/init.d/mysql stop mysqld_safe --skip-grant-tables & mysql -u root ### SQL Commands ### use mysql; UPDATE user SET password=PASSWORD("new-password") WHERE user='root'; flush privileges; exit ### Shell Commands ### /etc/init.d/mysql stop /etc/init.d/mysql start mysql -u root -pnew-password

May 13, 2011

by Artur Mkrtchyan

· 5,779 Views

Database Interaction with DAO and DTO Design Patterns

Learn what is a DAO and how to create a DAO, as well as the significance of creating Data Access Objects.

May 13, 2011

by Sandeep Bhandari

· 165,748 Views · 5 Likes

Practical PHP Testing Patterns: Transaction Rollback Teardown

Maintaining isolation of tests when they have a database as Shared Fixture is not a trivial task. An important constraint is not having the headache of keeping track what manipulations on the database has your code done; in that case the rollback may not even be performed in case of a regression. An alternative way to resetting the database via DELETE and TRUNCATE queries is to roll back a transaction which has been started in the setup phase during the teardown. Implementation The phases of a database test involving Transaction Rollback Teardown are roughly the following: begin transaction, usually in setUp(). arrange, act, assert actions in the various Test Methods. rollback of the transaction in teardown(). The active transaction is never committed. An issue with using this pattern is that code that already uses transaction is prone to generate errors, and ultimately should never be tested with this technique. The rules for your safety are simple: the SUT should never start a transaction or committing it. Some databases support nested transaction levels, but it's very brittle to use them for testing purposes, and in case of any failure the whole suite will blow up as test executes teardowns at the wrong level of nesting. This pattern safety is also difficult to ensure, as DDL statements like CREATE/DELETE or other commands may commit the current transaction automatically. Check the documentation of your testing database. The advantage of this pattern is great performance: rollback is faster than every other command, including TRUNCATE. Moreover, if you encapsulate transactions well in your production code, most of it won't commit them (typically leaving the control over the transaction to an upper layer). Doctrine 2 In a sense, we already use this pattern with an UnitOfWork ORM such as Doctrine 2 when we do not flush() the ORM in our code. The flow is: The database is ready by setup. Exercise code. Check results as persisted or removed entities. Instead of calling flush() over the Entity Manager, call clear(). In this case, the database never sees a transaction, as Doctrine 2 keeps everything in the Unit Of Work until you say to flush it. Even when your code is calling flush(), you can explicitly use beginTransaction() and rollback() over the connection object: in this other scenario, the testing database sees an open transaction, but it's never committed and can be discarded in teardown() like the pattern prescribes. Example The code sample is the same test case shown in the Table Truncation Teardown article, which now uses transactions to encapsulate the single tests. The various tests check the tables content is restored, along with the AUTOINCREMENT next value. exec('CREATE TABLE users ( id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR(255) )'); } $this->connection = self::$sharedConnection; $this->connection->beginTransaction(); } public function teardown() { $this->connection->rollback(); } public function testTableCanBePopulated() { $this->connection->exec('INSERT INTO users (name) VALUES ("Giorgio")'); $this->assertEquals(1, $this->howManyUsers()); } public function testTableRestartsFrom1() { $this->assertEquals(0, $this->howManyUsers()); $this->connection->exec('INSERT INTO users (name) VALUES ("Isaac")'); $stmt = $this->connection->query('SELECT name FROM users WHERE id=1'); $result = $stmt->fetch(); $this->assertEquals('Isaac', $result['name']); } public function testTableIsEmpty() { $this->assertEquals(0, $this->howManyUsers()); } private function howManyUsers() { $stmt = $this->connection->query('SELECT COUNT(*) AS number FROM users'); $result = $stmt->fetch(); return $result['number']; } }

May 11, 2011

by Giorgio Sironi

· 7,946 Views

Practical PHP Testing Patterns: Stored Procedure Test

It happened in the day before the advent of DDD and the Hexagonal architecture, that you had code that lived inside the database, such as Stored Procedures, constraints, and triggers. Back in the day, the relational database was considered the single source of truth instead of a Domain Model written in a language like PHP or Java. Today the picture is different - but there are still scenarios where pushing code in the database make sense. One of the reasons for having logic expressed as SQL and in other database languages is their power, and their performance. SQL operators, especially when augmented by proprietary extensions, let you declare pieces of logic that you would instead have to code by hand. SQL that is executed directly on the database can accomplish operations too onerous to perform over a reconstituted object graph with a subsequent saving. In fact, every decent ORM include a language for batch updates that translates to SQL, like Doctrine with DQL; and also a mechanism for providing hints for the underlying database, like indexes definitions. The problem with SQL derivates and other database-specific embedded logic is that we cannot execute it and test it in isolation - we need a real copy of the database to perform our tests. Thus the Stored Procedure Test is an umbrella term for tests that encompasses database code, even when they're not actually stored procedures. When I'll use the term stored procedure in this article, it will be to signify any database-specific code, such as complex queries, triggers and so on. Implementation The pattern prescribes to write unit tests for the stored procedure, to test it in isolation from the rest of application a first simplification. These tests cover nontrivial logic in database code - probably you don't need them for indexes definition, but more for queries with aggregate functions. In the PHP world, Sqlite often suffices for testing queries - as long as you have an intermediate layer like Doctrine DBAL (part of Doctrine 2) which smooths out the differences between vendors. You use MySQL in production, Sqlite in the test suite, and you can write queries in Doctrine's DQL being confident that it will translate them to the right SQL dialect. These tests should be executed in a sandbox - a database with just enough structure and data to test the stored procedure at hand. This sandbox should run by definition on the production dbms. The most difficult aspect of the pattern is integrating with the dbms: it should be running and listening on the right port. A sandbox should be created in setUp() or setUpBeforeClass(), and destroyed during teardown. In case the database is not available, the tests should be marked as skipped or incomplete. Variations In In-Database Stored Procedure Tests, the test is written in the same language as the database code. I cannot imagine something more boring for a PHP programmer. In Remoted Stored Procedure Tests, which is the variation of interest for us, the tests are written in PHPUnit and integrated with the suite (slowing it down a bit). The logic is that whatever SQL logic you're going to add to your application, is already encapsulated in some PHP class: for example, complex queries are encapsulated in Repositories or DAO. So it's going to be feasible to build a sandbox via PHP code, and test the stored procedure as a black box. It will be encapsulated for a unique execution, like schema creation, or for executing multiple times in case of queries. Example The example shows you how to test a query with a real database - supposing a surrogate database does not support all the needed functions - from inside a test suite. I thought it would be difficult to write this test, but instead it required less than a Pomodoro. connection = new PDO("mysql:host=localhost;dbname=sandbox", 'root', ''); $this->connection->exec("CREATE TABLE users (name VARCHAR(255) NOT NULL PRIMARY KEY, year YEAR)"); $this->repository = new UserRepository($this->connection, 2011); } public function testAverageAgeIsCalculated() { $this->insertUser('Giorgio', 1942); $this->insertUser('Isaac', 1920); $this->assertEquals(80, $this->repository->getAverageAge()); } private function insertUser($name, $year) { $stmt = $this->connection->prepare("INSERT INTO users (name, year) VALUES (:name, :year)"); $stmt->bindValue('name', $name, PDO::PARAM_STR); $stmt->bindValue('year', $year, PDO::PARAM_INT); return $stmt->execute(); } public function tearDown() { $this->connection->exec('DROP TABLE users'); } } class UserRepository { private $connection; private $currentYear; public function __construct(PDO $connection, $currentYear) { $this->connection = $connection; $this->currentYear = $currentYear; } /** * We suppose AVG() cannot be correctly implemented by Sqlite or * another surrogate database (substitute another vendor feature * for the same effect). * We also suppose reconstituting millions of User objects to calculate * their average age isn't feasible: that's why we used SQL directly. */ public function getAverageAge() { $stmt = $this->connection->prepare('SELECT AVG(:year - year) AS average_age FROM users'); $stmt->bindValue('year', $this->currentYear, PDO::PARAM_INT); $stmt->execute(); $row = $stmt->fetch(); return $row['average_age']; } }

May 4, 2011

by Giorgio Sironi

· 2,865 Views

Reasons for Slow Database Performance

Usually there are scenarios where the application does not perform as expected. A simple web page which fetches data from database and displays optimizes it for mobiles should be fast and turnaround times should be less than 30 seconds on a good network connection. But still there are cases where these kinds of applications suffer the most performance issues. This is because the database in these cases is not designed by giving proper attention to the application requirements. You can change one application design even after delivery but changing the database design once a number of application have been integrated with it is like explosion. Here I am giving some points which should be kept in mind while designing application and database. Only basic idea is being provided. For details one can search each topic on the web as each point expands well to multiple articles. 1) Bind Variables: When a SQL query is sent to the database engine for processing and sending the result, it is compiled by the database compiler to get the tokens of the query. This involves parsing, optimizing and identifying the query. After a number of steps, the SQL query is passed to the database engine for processing. In a small application with a user base of less than 500, it is usually the same query which is executed more often than others. The use of bind variables helps in storing the compiled query once and executing it with different data at different times. For using bind variables, one needs to use PreparedStatement objects in Java. 2) Query is not well formed: Usually the same SQL query can be written in multiple ways. There are ways by which a query can be optimized to give the best performance. The corresponding SQL construct should be chosen depending upon requirement. I have scenarios where people have used WHERE clause instead of GROUP BY and are complaining of poor response times. Similarly Sub queries and Joins complement each other. 3) Database structure is not well defined/normalized: This is probably known to everybody that the database tables should be properly normalized as this is part of every DBMS course at graduation level. If the tables are not properly designed and normalized, anomalies set in. 4) Proper caching is not in place: Many applications make use of temporary caches on the application server to store the reference data or frequently accessed data as memory is less of an issue than the time with new generation servers. 5) Number of rows in the table too large: If the table itself has too much of data then the queries will take time to execute. Partitioning a table into multiple tables is recommended in these situations. For example: If a table has employee records of 1000000 employees then it could be split into 5 small tables each having 200000 rows. The advantage is we know beforehand in which smaller table to look for a particular employee code as the division of large table can be done on the employee id column. 6) Connections are not being pooled: If connections are not pooled then the each time a new connection is requested for a request to database. Maintaining a connection pool is much better than creating and destroying the connection for executing every SQL query. Of course, there are frameworks like Hibernate which take care of creating the connection pools and also allow the customization of these pools 7) Connections not closed/returned to pool in case of exceptions: When an exception occurs while performing database operations, it ought to be caught. Usually catching the exception is not the issue because SQLException is a checked exception but closing the connection is something that most of the times is left out. If the connection is not released, the same connection cannot be used for any other purpose till the connection is timed out. 8) Stored procedures for complex computations on database: Stored procedures are a good way to perform database intensive operations. This is because they are already compiled and there is less network trips for getting the same results as compared to SQL queries. From http://extreme-java.blogspot.com/2011/04/reasons-for-slow-database-performance.html

April 30, 2011

by Sandeep Bhandari

· 59,850 Views · 1 Like

Hammurabi - A Scala Rule Engine

One of the most common reasons why software projects fail, or suffer unbearable delays, is the misunderstandings between the analysts who define the business rules of the domain for which the software is going to be written and the developers who have to code these rules. The latter write those rules in a language that is completely obscure for the first ones. In this way the business analysts don't have a chance to read, understand and validate what the programmers developed and then they can only empirically test the final software behavior, hardly covering all the possible corner cases and often recognizing mistakes only when it is too late. What Hammurabi is Hammurabi is a rule engine written in Scala that tries to leverage the features of this language making it particularly suitable to implement extremely readable internal Domain Specific Languages. Indeed, what actually makes Hammurabi different from all other rule engines is that it is possible to write and compile its rules directly in the host language. Anyway the Hammurabi's rules also have the important property of being readable even by non technical person. As usual a practical example worth more than a thousand words. The golfers problem This logical puzzle has been taken from the first chapter of the Jess in Action book written by Ernest Friedman-Hill and published by Manning. It is described there as it follows: A foursome of golfers is standing at a tee, in a line from left to right. Each golfer wears different colored pants; one is wearing red pants. The golfer to Fred’s immediate right is wearing blue pants. Joe is second in line. Bob is wearing plaid pants. Tom isn’t in position one or four, and he isn’t wearing the hideous orange pants. In what order will the four golfers tee off, and what color are each golfer’s pants?” The Jess solution Jess is written in Java and is one of the most popular rule engine on the market. The solution to the golfers problems presented in the book mentioned above is the following: first it is necessary to define the data structures representing the problem (deftemplate pants-color (slot of) (slot is)) (deftemplate position (slot of) (slot is)) A deftemplate is a bit like a class declaration in Java and in this case is used to write a first rule that in turns creates the facts representing each of the possible combinations of golfers, pants-color and positions: (defrule generate-possibilities => (foreach ?name (create$ Fred Joe Bob Tom) (foreach ?color (create$ red blue plaid orange) (assert (pants-color (of ?name)(is ?color))) ) (foreach ?position (create$ 1 2 3 4) (assert (position (of ?name)(is ?position))) ) ) ) After that it is possible to translate the sentences of the problem in the corresponding Jess rule: (defrule find-solution ;; There is a golfer named Fred, whose position is ?p1 ;; and pants color is ?c1 (position (of Fred) (is ?p1)) (pants-color (of Fred) (is ?c1)) ;; The golfer to immediate right of Fred ;; is wearing blue pants. (position (of ?n&~Fred)(is ?p&:(eq ?p (+ ?p1 1)))) (pants-color (of ?n&~Fred)(is blue&~?c1)) ;; Joe is in position #2 (position (of Joe) (is ?p2&2&~?p1)) (pants-color (of Joe) (is ?c2&~?c1)) ;; Bob is wearing the plaid pants (position (of Bob)(is ?p3&~?p1&~?p&~?p2)) (pants-color (of Bob&~?n)(is plaid&?c3&~?c1&~?c2)) ;; Tom is not in position 1 or 4 ;; and is not wearing orange (position (of Tom&~?n)(is ?p4&~1&~4&~?p1&~?p2&~?p3)) (pants-color (of Tom)(is ?c4&~orange&~blue&~?c1&~?c2&~?c3)) => (printout t Fred " " ?p1 " " ?c1 crlf) (printout t Joe " " ?p2 " " ?c2 crlf) (printout t Bob " " ?p3 " " ?c3 crlf) (printout t Tom " " ?p4 " " ?c4 crlf) ) where the rows starting with ;; are just comments. In this way if you enter the code for the problem into Jess and then run it, you get the answer directly: Fred 1 orange Joe 2 blue Bob 4 plaid Tom 3 red Note that the facts that the golfers are in different positions and wear pants of different colors is not expressed in an explicit rule but need to be spread and repeated in many statements. This solution is clearly difficult to be maintained and doesn't scale well as underlined by the last condition statement stating that the position ?p4 of Tom is ?p4&~1&~4&~?p1&~?p2&~?p3 where ~ means not in the Jess language. In other words it says that the Tom's position is not only different from the position 1 and 4 but it is also different from the positions of all the other golfers (named one by one) formerly defined. Actually the needs to describe a golfer's position also as the negation of the positions of all the other golfers implies something even worse: it is not possible to translate each sentence of the problem in a different rule, but they have to be combined together in a single big rule. After this huge if part, its then section (the one after the => symbol) prints out a table containing the set of variables ?p1…?p4 and ?c1…?c4 that solves the problem. The Hammurabi solution As done while presenting the Jess solution, also with Hammurabi the first thing to do is to define the domain of the problem. In order to do that, since the Hammurabi rules are valid Scala statements, it is sufficient to create a plain Scala Person class having as attributes the name, the position and the color of the pants of the golfer that it represents: class Person(n: String) { val name = n var pos: Int = _ var color: String = _ } Then we can model the fact that all the golfers must have different position and pants color by putting them in 2 different Set: var availablePos = (1 to 4).toSet var availableColors = Set("blue", "plaid", "red", "orange") and write two small methods that pull off them from the corresponding set once they have been assigned to a specific golfer: val assign = new { def color(color: String) = new { def to(person: Person) = { person.color = color availableColors = availableColors - color } } def position(pos: Int) = new { def to(person: Person) = { person.pos = pos availablePos = availablePos - pos } } } Those methods are written in a quite weird way just to make even more readable the DSL that will be used to define the rules and to stress the idea that it is possible to write the rules in a valid Scala that could be easily understood by a non-technical person. Of course it is easy to go even further by encapsulating other concepts in some convenient methods as it has been done above. Now everything is ready to write the set of rules describing the golfers problem: val ruleSet = Set( rule ("Unique positions") let { val p = any(kindOf[Person]) when { (availablePos.size equals 1) and (p.pos equals 0) } then { assign position availablePos.head to p } }, rule ("Unique colors") let { val p = any(kindOf[Person]) when { (availableColors.size equals 1) and (p.color == null) } then { assign color availableColors.head to p } }, rule ("Joe is in position 2") let { val p = any(kindOf[Person]) when { p.name equals "Joe" } then { assign position 2 to p } }, rule ("Person to Fred’s immediate right is wearing blue pants") let { val p1 = any(kindOf[Person]) val p2 = any(kindOf[Person]) when { (p1.name equals "Fred") and (p2.pos equals p1.pos + 1) } then { assign color "blue" to p2 } }, rule ("Fred isn't in position 4") let { val possibleFredPos = availablePos - 4 val p = any(kindOf[Person]) when { (p.name equals "Fred") and (possibleFredPos.size == 1) } then { assign position possibleFredPos.head to p } }, rule ("Tom isn't in position 1 or 4") let { val possibleTomPos = availablePos - 1 - 4 val p = any(kindOf[Person]) when { (p.name equals "Tom") and (possibleTomPos.size equals 1) } then { assign position possibleTomPos.head to p } }, rule ("Bob is wearing plaid pants") let { val p = any(kindOf[Person]) when { p.name equals "Bob" } then { assign color "plaid" to p } }, rule ("Tom isn't wearing orange pants") let { val possibleTomColors = availableColors - "orange" val p = any(kindOf[Person]) when { (p.name equals "Tom") and (possibleTomColors.size equals 1) } then { assign color possibleTomColors.head to p } } ) Here the first 2 rules explicitly leverage the uniqueness of positions and pants colors by assigning the last available of them to the only person who still doesn't have one. The other rules just match one by one the sentences of the problem as it has been defined. Now it is possible to make Hammurabi solve the problem by creating the four golfers: val tom = new Person("Tom") val joe = new Person("Joe") val fred = new Person("Fred") val bob = new Person("Bob") add them to a working memory (the set of objects against which the rule engine will evaluate and fire the rules): val workingMemory = WorkingMemory(tom, joe, fred, bob) and letting the rule engine, initialized with the formerly defined set of rules, to work on it: RuleEngine(ruleSet) execOn workingMemory Working with immutable data structures Immutability is probably not something that should be enforced at all costs in Hammurabi, for a very simple reason: the largest part of the execution time is spent by Hammurabi, like all other rule engines, looking for rules that can actually be executed (fired), i.e. the ones for which the when condition is true. During this phase data are only read and never written, so immutability doesn't matter at all, and all the rules that can be fired are put in an agenda. During the subsequent phase the rules in the agenda MUST be fired one by one, since the execution of one of them could make false the when condition of another one. It means that lack of immutability shouldn't prevent the rule engine to safely run in parallel during the discovery phase. That said, it is also possible to obtain the same result working with immutable data structures. For example having an immutable person: case class Person(name: String, pos: Int = 0, color: String = null) it's enough to rewrite the methods that assign the position and pants color to the golfers as it follows: val assign = new { def color(color: String) = new { def to(person: Person) = { remove(person) produce(person.copy(color = color)) availableColors = availableColors - color } } def position(pos: Int) = new { def to(person: Person) = { remove(person) produce(person.copy(pos = pos)) availablePos = availablePos - pos } } } leaving all the rules unchanged. In this way the old version of the Person is removed from the working memory and a brand new one, with its position or color set accordingly to the fired rule, is produced and then added to the working memory itself. The methods remove and produce can be indeed used respectively to remove objects from the working memory and produce new objects, that could be then used to evaluate and fire other rules. Hammurabi internals In Hammurabi all the rules are evaluated (but not fired) in parallel basically by assigning each rule to a different Scala actor. The replacement of this actor implementation with one based on the upcoming Scala parallel collections is currently under evaluation but I decided to wait until this technology will be stable. As anticipated, the evaluation of the rule in parallel is safe because is a read only process. While the actors discover set of variables that can fire the rules they are evaluating, they add them to the rule engine's agenda. At the end of this evaluation process the rules are fired sequentially one by one after a re-evaluation of their application condition, because the execution of one of them could change the result of that condition for a subsequent one. All the rules are fired in no particular order unless a different priority has been specified for some of them. Indeed since sometimes there could be the need to treat some rules as special cases they have an optional property called salience that acts as a priority setting for that rule in order to allow activated rules with the highest salience to always fire first, followed by rules of lower salience. By default all rules have salience 0 but you can alter it in the rule definition as it follows: rule ("Important rule") withSalience 10 let { ... } rule ("Negligible rule") withSalience -5 let { ... } Each actor also records the set of values on which the rule it is responsible for has been already executed, in order to don't fire again the same rule on the same values. The evaluation phase and the firing one are then executed again and again until either there is no rule that can be fired or one of the rule during its firing phase invokes one of the methods exitWith, making the rule engine gracefully finishing returning a value representing the result of the whole evaluation process or failWith that cause the rule engine to terminate by throwing an exception. Of course if the rule engine stops just because there are no longer rules that can be fired, the result of the evaluation is represented by the whole working set (as in the former example) and you can read from there the values you are interested in. Further implementations and improvements At the moment it is only possible to take from the working memory the object(s) against which a given rule will be evaluated only selecting them by type, as shown in the statement: val p = any(kindOf[Person]) Further mechanisms to categorize and select those objects in a more precise way are under evaluation, even if it is already possible to limit them with a Boolean function as in the following example: val p = kindOf[Person] having (_.name == "Joe") This is useful also under a performance point of view because it dramatically lowers the number of combination that the rule engine needs to check before to find a rule that can be fired. For example the "Person to Fred’s immediate right is wearing blue pants" rule could be rewritten as it follows bringing the number of combination that has to be tried from 16 to 4: rule ("Person to Fred’s immediate right is wearing blue pants") let { val p1 = kindOf[Person] having (_.name == "Fred") val p2 = any(kindOf[Person]) when { p2.pos equals p1.pos + 1 } then { assign color "blue" to p2 } } I am also evaluating of directly feeding the working memory with a NoSQL database. In other words, with this solution, the data present in the db could represent the working memory itself. At the moment I am experimenting with MongoDB since is the one I know best, but if somebody has some other good idea or even better wants to collaborate with this project I'd be very glad of it. The version 0.1 of the Hammurabi rule engine has been just released and it is available here.

April 18, 2011

by Mario Fusco

· 27,927 Views

Eradicating Non-Determinism in Tests

An automated regression suite can play a vital role on a software project, valuable both for reducing defects in production and essential for evolutionary design. In talking with development teams I've often heard about the problem of non-deterministic tests - tests that sometimes pass and sometimes fail. Left uncontrolled, non-deterministic tests can completely destroy the value of an automated regression suite. In this article I outline how to deal with non-deterministic tests. Initially quarantine helps to reduce their damage to other tests, but you still have to fix them soon. Therefore I discuss treatments for the common causes for non-determinism: lack of isolation, asynchronous behavior, remote services, time, and resource leaks. I've enjoyed watching ThoughtWorks tackle many difficult enterprise applications, bringing successful deliveries to many clients who have rarely seen success. Our experiences have been a great demonstration that agile methods, deeply controversial and distrusted when we wrote the manifesto a decade ago, can be used successfully. There are many flavors of agile development out there, but in what we do there is a central role for automated testing. Automated testing was a core approach to Extreme Programming from the beginning, and that philosophy has been the biggest inspiration to our agile work. So we've gained a lot of experience in using automated testing as a core part of software development. Automated testing can look easy when presented in a text book. And indeed the basic ideas are really quite simple. But in the pressure-cooker of a delivery project, trials come up that are often not given much attention in texts. As I know too well, authors have a habit of skimming over many details in order to get a core point across. In my conversations with our delivery teams, one recurring problem that we've run into is tests which have become unreliable, so unreliable that people don't pay much attention to whether they pass or fail. A primary cause of this unreliability is that some tests have become non-deterministic. A test is non-deterministic when it passes sometimes and fails sometimes, without any noticeable change in the code, tests, or environment. Such tests fail, then you re-run them and they pass. Test failures for such tests are seemingly random. Non-determinism can plague any kind of test, but it's particularly prone to affect tests with a broad scope, such as acceptance or functional tests. Why non-deterministic tests are a problem Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite. As a result they need to be dealt with as soon as you can, before your entire deployment pipeline is compromised. I'll start with expanding on their uselessness. The primary benefit of having automated tests is that they provide bug detection mechanism by acting as regression tests[1]. When a regression test goes red, you know you've got an immediate problem, often because a bug has crept into the system without you realizing. Having such a bug detector has huge benefits. Most obviously it means that you can find and fix bugs just after they are introduced. Not just does this give you the warm fuzzies because you kill bugs quickly, it also makes it easier to remove them since you know the bug got in with the last set of changes that are fresh in your mind. As a result you know where to look for the bug, which is more than half the battle in squashing it. The second level of benefit is that as you gain confidence in your bug detector, you gain the courage to make big changes knowing that when you goof, the bug detector will go off and you can fix the mistake quickly. [2] Without this teams are frightened to make the changes code needs in order to be kept clean, which leads to a rotting of the code base and plummeting development speed. The trouble with non-deterministic tests is that when they go red, you have no idea whether its due to a bug, or just part of the non-deterministic behavior. Usually with these tests a non-deterministic failure is relatively common, so you end up shrugging your shoulders when these tests go red. Once you start ignoring a regression test failure, then that test is useless and you might as well throw it away. Indeed you really ought to throw a non-deterministic test away, since if you don't it has an infectious quality. If you have a suite of 100 tests with 10 non-deterministic tests in them, than that suite will often fail. Initially people will look at the failure report and notice that the failures are in non-deterministic tests, but soon they'll lose the discipline to do that. Once that discipline is lost, then a failure in the healthy deterministic tests will get ignored too. At that point you've lots the whole game and might as well get rid of all the tests. Quarantine My principal aim in this article is to outline common cases of non-deterministic tests and how to eliminate the non-determinism. But before I get there I offer one piece of essential advice: quarantine your non-deterministic tests. If you have non-deterministic tests keep them in a different test suite to your healthy tests. That way you'll you can continue to pay attention to what's going on with your healthy tests and get good feedback from them. Place any non-deterministic test in a quarantined area. (But fix quarantined tests quickly.) Then the question is what to do with the quarantined test suites. They are useless as regression tests, but they do have a future as work items for cleaning up. You should not abandon such tests, since any tests you have in quarantine are not helping you with your regression coverage. A danger here is that tests keep getting thrown into quarantine and forgotten, which means your bug detection system is eroding. As a result it's worthwhile to have a mechanism that ensures that tests don't stay in quarantine too long. I've come across various ways to do this. One is a simple numeric limit: e.g. only allow 8 tests in quarantine. Once you hit the limit you must spend time to clear all the tests out. This has the advantage of batching up your test-cleaning if that's how you like to do things. Another route is to put a time limit on how long a test may be in quarantine, such as no longer than a week. The general approach with quarantine is to take the quarantined tests out of the main deployment pipeline so that you still get your regular build process. However a good team can be more aggressive. Our Mingle team puts its quarantine suite into the deployment pipeline one stage after its healthy tests. That way it can get the feedback from the healthy tests but is also forced to ensure that it sorts out the quarantined tests quickly. [3] Lack of Isolation In order to get tests to run reliably, you must have clear control over the environment in which they run, so you have a well-known state at the beginning of the test. If one test creates some data in the database and leaves it lying around, it can corrupt the run of another test which may rely on a different database state. Therefore I find it's really important to focus on keeping tests isolated. Properly isolated tests can be run in any sequence. As you get to larger operational scope of functional tests, it gets progressively harder to keep tests isolated. When you are tracking down a non-determinism, lack of isolation is a common and frustrating cause. Keep your tests isolated from each other, so that execution of one test will not affect any others. There are a couple of ways to get isolation - either always rebuild your starting state from scratch, or ensure that each test cleans up properly after itself. In general I prefer the former, as it's often easier - and in particular easier to find the source of a problem. If a test fails because it didn't build up the initial state properly, then it's easy to see which test contains the bug. With clean-up, however, one test will contain the bug, but another test will fail - so it's hard to find the real problem. Starting from a blank state is usually easy with unit tests, but can be much harder with functional tests [4] - particularly if you have a lot of data in a database that needs to be there. Rebuilding the database each time can add a lot of time to test runs, so that argues for switching to a clean-up strategy. One trick that's handy when you're using databases, is to conduct your tests inside a transaction, and then to rollback the transaction at the end of the test. That way the transaction manager cleans up for you, reducing the chance of errors[5]. Another approach is to do a single build of a mostly-immutable starting fixture before running a group of tests. Then ensure that the tests don't change that initial state (or if they do, they reverse the changes in tear-down). This tactic is more error-prone than rebuilding the fixture for each test, but it may be worthwhile iff it takes too long to build the fixture each time. Although databases are a common cause for isolation problems, there are plenty of times you can get these in-memory too. In particular be aware with static data and singletons. A good example for this kind of problem is contextual environment, such as the currently logged in user. If you have an explicit tear-down in a test, be wary of exceptions that occur during the tear-down. If this happens the test can pass, but cause isolation failures for subsequent tests. So ensure that if you do get a problem in a tear-down, it makes a loud noise. Some people prefer to put less emphasis on isolation and more on defining clear dependencies to force tests to run in a specified order. I prefer isolation because it gives you more flexibility in running subsets of tests and parallelizing tests. Asynchronous Behavior Asynchrony is a boon that allows you to keep software responsive while taking on long term tasks. Ajax calls allow a browser to stay responsive while going back to the server for more data, asynchronous message allow a server process to communicate with other system without being tied to their laggardly latency. But in testing, asynchrony can be curse. The common mistake here is to throw in a sleep: //pseudo-code makeAsyncCall; sleep(aWhile); readResponse; This can bite you two ways. First off you'll want to set the sleep time to long enough that it gives plenty of time to get the response. But that means that you'll spend a lot of time idly waiting for the response, thus slowing down your tests. The second bite is that, however long you sleep, sometimes it won't be enough. There will be some change in environment that will cause you to exceed the sleep - and you'll get false failure. As a result I strongly urge you to never use bare sleeps like this. Never use bare sleeps to wait for asynchonous responses: use a callback or polling. There are basically two tactics you can do for testing an asynchronous response. The first is for the asynchronous service to take a callback which it can call when done. This is the best since it means you'll never have to wait any longer than you need to [6]. The biggest problem with this is that the environment needs to be able to do this and then the service provider needs to do it. This is one of the advantages of having the development team integrated with testing - if they can provide a callback then they will. The second option is to poll on the answer. This is more than just looking once, but looking regularly, something like this //pseudo-code makeAsyncCall startTime = Time.now; while(! responseReceived) { if (Time.now - startTime > waitLimit) throw new TestTimeoutException; sleep (pollingInterval); } readResponse The point of this approach is that you can set the pollingInterval to a pretty small value, and know that that's the maximum amount of dead time you'll lose to waiting for a response. This means you can set the waitLimit very high, which minimizes the chance of hitting it unless something serious has gone wrong. [7] Make sure you use a clear exception class that indicates this is a test timeout that's failing. This will help make it clear what's gone wrong should it happen, and perhaps allow a more sophisticated test harness to take account of this information in its display. The time values, in particular the waitLimit, should never be literal values. Make sure they are always values that can be easily set in bulk, either by using constants or set through the runtime environment. That way if you need to tweak them (and you will) you can tweak them all quickly. All this advice is handy for async calls where you expect a response from the provider, but how about those where there is no response. These are calls where we invoke a command on something and expect it to happen without any acknowledgment. This is the trickiest case since you can test for your expected response, but there's nothing to do to detect a failure other than timing-out. If the provider is something you're building you can handle this by ensuring the provider implements some way of indicating that it's done - essentially some form of callback. Even if only the testing code uses it, it's worth it - although often you'll find this kind of functionality is valuable for other purposes too[8]. If the provider is someone else's work, you can try persuasion, but otherwise may be stuck. Although this is also a case when using Test Doubles for remote services is worthwhile (which I'll discuss more in the next section). If you have a general failure in something asynchronous, such that it's not responding at all, then you'll always be waiting for timeouts and your test suite will take a long time to fail. To combat this it's a good idea to use a smoke test to check that the asynchronous service is responding at all and stop the test run right away if it isn't. Gerard Meszaros's book, xUnit Test Patterns, contains lots of good patterns for constructing tests. You can also often side-step the asynchrony completely. Gerard Meszaros's Humble Object pattern says that whenever you have some logic that's in a hard-to-test environment, you should isolate the logic you need to test from that environment. In this case it means put most of the logic you need to test in a place where you can test it synchronously. The asynchronous behavior should be as minimal (humble) as possible, that way you don't need that much testing of it. Remote Services Sometimes I'm asked if ThoughtWorks does any integration work, which I find somewhat amusing since there's hardly any project we do that doesn't involve a fair bit of integration. By their nature, enterprise applications involve a great deal of combining data from different systems. These systems are maintained by other teams operating to their own schedules, teams that often use a very different software philosophy to our heavily test-driven agile approach. Testing with such remote systems brings a number of problems, and non-determinism is high on the list. Often remote systems don't have test system we can call, which means hitting a live system. If there is a test system, it may not be stable enough to provide deterministic responses. In this situation it's vital to ensure determinism, so it's time to reach for a Test Double - a component that looks like the remote service, but is really just a pretend version that mimics the remote system's behavior. The double needs to be setup so that provides the right kind of response in interaction with our system, but in a way we control. In this manner we can ensure determinism. Using a double has a downside, in particular when we are testing across a broad scope. How can we be sure that the double behaves in the same way that remote system does? We can tackle this again using tests, a form of test that I call Integration Contract Tests. These run the same interaction with the remote system and the double, and check that the two match. In this case 'match' may not mean coming up with the same result (due to the non-determinisms), but results that share the same essential structure. Integration Contract Tests need to be run frequently, but not part of our system's deployment pipeline. Periodic running based on the rate of the change of the remote system is usually best. For writing these kinds of test doubles, I'm a big fan of Self Initializing Fakes - since these are very simple to manage. Some people are firmly against using Test Doubles in functional tests, believing that you must test with real connection in order to ensure end-to-end behavior. While I sympathize with their argument, automated tests are useless if they are non-deterministic. So any advantage you gain by talking to the real system is overwhelmed by the need to stamp out non-determinism[9]. Time Few things are more non-deterministic than a call to the system clock. Each time you call it, you get a new result, and any tests that depend on it can thus change. Ask for all the todos due in the next hour, and you regularly get a different answer[10]. The most important thing here is to ensure that you always wrap the system clock with routines that can be replaced with a seeded value for testing. A clock stub can be set to particular time and frozen at that time, allowing your tests to have complete control over its movements. That way you can synchronize your test data to the values in the seeded clock.[11][12] Always wrap the system clock, so it can be easily substituted for testing. One thing to watch with this, is that eventually your test data might start having problems because it's too old, and you get conflicts with other time based factors in your application. In this case you can move the data, and your clock seeds to new values. When you do this, ensure that this is the only thing you do. That way you can be sure that any tests that fail are due to time-movement in the test data. Another area where time can be a problem is when you rely on other behaviors from the clock. I once saw a system that generated random keys based on clock values. This systems started failing when it was moved to a faster machine that could allocate multiple ids within a single clock tick.[13] I've heard so many problems due to direct calls to the system clock that I'd argue for finding a way to use code analysis to detect any direct calls to the system clock and failing the build right there. Even a simple regex check might save you a frustrating debugging session after a call at an ungodly hour. Resource Leaks If your application has some kind of resource leak, this will lead to random tests failing, since it's just which test causes the resource leak to go over a limit that gets the failure. This case is awkward because any test can fail intermittently due to this problem. If it isn't a case of one test being non-deterministic then resource leaks are a good candidate to investigate. By resource leak, I mean any resource that the application has to manage by acquiring and releasing. In non-memory-managed environments, the obvious example is memory. Memory-management did much to remove this problem, but other resources still need to be managed, such as database connections. Usually the best way to handle these kind of resources is through a Resource Pool. If you do this then a good tactic is to configure the pool to a size of 1 and make it throw an exception should it get a request for a resource when it has none left to give. That way the first test to request a resource after the leak will fail - which makes it a lot easier to find the problem test. This idea of limiting resource pool sizes, is about increasing constraints to make errors more likely to crop up in tests. This is good because we want errors to show in tests so we can fix them before they manifest themselves in production. This principle can be used in other ways too. One story I heard was of a system which generated randomly named temporary files, didn't clean them up properly, and crashed on a collision. This kind of bug is very hard to find, but one way to manifest it is to stub the randomizer for testing so it always returns the same value. That way you can surface the problem more quickly.

April 14, 2011

by Martin Fowler

· 6,723 Views · 1 Like

Solr + Hadoop = Big Data Love

Bixo Labs shows how to use Solr as a NoSQL solution for big data Many people use the Hadoop open source project to process large data sets because it’s a great solution for scalable, reliable data processing workflows. Hadoop is by far the most popular system for handling big data, with companies using massive clusters to store and process petabytes of data on thousands of servers. Since it emerged from the Nutch open source web crawler project in 2006, Hadoop has grown in every way imaginable – users, developers, associated projects (aka the “Hadoop ecosystem”). Starting at roughly the same time, the Solr open source project has become the most widely used search solution on planet Earth. Solr wraps the API-level indexing and search functionality of Lucene with a RESTful API, GUI, and lots of useful administrative and data integration functionality. The interesting thing about combining these two open source projects is that you can use Hadoop to crunch the data, and then serve it up in Solr. And we’re not talking about just free-text search; Solr can be used as a key-value store (i.e. a NoSQL database) via its support for range queries. Even on a single server, Solr can easily handle many millions of records (“documents” in Lucene lingo). Even better, Solr now supports sharding and replication via the new, cutting-edge SolrCloud functionality. Background I started using Hadoop & Solr about five years ago, as key pieces of the Krugle code search startup I co-founded in 2005. Back then, Hadoop was still part of the Nutch web crawler we used to extract information about open source projects. And Solr was fresh out of the oven, having just been released as open source by CNET. At Bixo Labs we use Hadoop, Solr, Cascading, Mahout, and many other open source technologies to create custom data processing workflows. The web is a common source of our input data, which we crawl using the Bixo open source project. The Problem During a web crawl, the state of the crawl is contained in something commonly called a “crawl DB”. For broad crawls, this has to be something that works with billions of records, since you need one entry for each known URL. Each “record” has the URL as the key, and contains important state information such as the time and result of the last request. For Hadoop-based crawlers such as Nutch and Bixo, the crawl DB is commonly kept in a set of flat files, where each file is a Hadoop “SequenceFile”. These are just a packed array of serialized key/value objects. Sometimes we need to poke at this data, and here’s where the simple flat-file structure creates a problem. There’s no easy way run queries against the data, but we can’t store it in a traditional database since billions of records + RDBMS == pain and suffering. Here is where scalable NoSQL solutions shine. For example, the Nutch project is currently re-factoring this crawl DB layer to allow plugging in HBase. Other options include Cassandra, MongoDB, CouchDB, etc. But for simple analytics and exploration on smaller datasets, a Solr-based solution works and is easier to configure. Plus you get useful and surprising fun functionality like facets, geospatial queries, range queries, free-form text search, and lots of other goodies for free. Architecture So what exactly would such a Hadoop + Solr system look like? As mentioned previously, in this example our input data comes from a Bixo web crawler’s CrawlDB, with one entry for each known URL. But the input data could just as easily be log files, or records from a traditional RDBMS, or the output of another data processing workflow. The key point is that we’re going to take a bunch of input data, (optionally) munge it into a more useful format, and then generate a Lucene index that we access via Solr. Hadoop For the uninitiated, Hadoop implements both a distributed file system (aka “HDFS”) and an execution layer that supports the map-reduce programming model. Typically data is loaded and transformed during the map phase, and then combined/saved during the reduce phase. In our example, the map phase reads in Hadoop compressed SequenceFiles that contain the state of our web crawl, and our reduce phase write out Lucene indexes. The focus of this article isn’t on how to write Hadoop map-reduce jobs, but I did want to show you the code that implements the guts of the job. Note that it’s not typical Hadoop key/value manipulation code, which is painful to write, debug, and maintain. Instead we use Cascading, which is an open source workflow planning and data processing API that creates Hadoop jobs from shorter, more representative code. The snippet below reads SequenceFiles from HDFS, and pipes those records into a sink (output) that stores them using a LuceneScheme, which in turn saves records as Lucene documents in an index. Tap source = new Hfs(new SequenceFile(CRAWLDB_FIELDS), inputDir); Pipe urlPipe = new Pipe("crawldb urls"); urlPipe = new Each(urlPipe, new ExtractDomain()); Tap sink = new Hfs(new LuceneScheme(SOLR_FIELDS, STORE_SETTINGS, INDEX_SETTINGS, StandardAnalyzer.class, MAX_FIELD_LENGTH), outputDir, true); FlowConnector fc = new FlowConnector(); fc.connect(source, sink, urlPipe).complete(); We defined CRAWLDB_FIELDS and SOLR_FIELDS to be the set of input and output data elements, using names like “url” and “status”. We take advantage of the Lucene Scheme that we’ve created for Cascading, which lets us easily map from Cascading’s view of the world (records with fields) to Lucene’s index (documents with fields). We don’t have a Cascading Scheme that directly supports Solr (wouldn’t that be handy?), but we can make-do for now since we can do simple analysis for this example. We indexed all of the fields so that we can perform queries against them. Only the status message contains normal English text, so that’s the only one we have to analyze (i.e., break the text up into terms using spaces and other token delimiters). In addition, the ExtractDomain operation pulls the domain from the URL field and builds a new Solr field containing just the domain. This will allow us to do queries against the domain of the URL as well as the complete URL. We could also have chosen to apply a custom analyzer to the URL to break it into several pieces (i.e., protocol, domain, port, path, query parameters) that could have been queried individually. Running the Hadoop Job For simplicity and pay-as-you-go, it’s hard to beat Amazon’s EC2 and Elastic Mapreduce offerings for running Hadoop jobs. You can easily spin up a cluster of 50 servers, run your job, save the results, and shut it down – all without needing to buy hardware or pay for IT support. There are many ways to create and configure a Hadoop cluster; for us, we’re very familiar with the (modified) EC2 Hadoop scripts that you can find in the Bixo distribution. Step-by-step instructions are available at http://openbixo.org/documentation/running-bixo-in-ec2/ The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains step-by-step instructions for building and running the job. After the job is done, we’ll copy the resulting index out of the Hadoop distributed file system (HDFS) and onto the Hadoop cluster’s master server, then kill off the one slave we used. The Hadoop master is now ready to be configured as our Solr server. Solr On the Solr side of things, we need to create a schema that matches the index we’re generating. The key section of our schema.xml file is where we define the fields. These fields have a one-to-one correspondence with the SOLR_FIELDS we defined in our Hadoop workflow. They also need to use the same Lucene settings as what we defined in the static IndexWorkflow.java STORE_SETTINGS and INDEX_SETTINGS. Once we have this defined, all that’s left is to set up a server that we can use. To keep it simple, we’ll use the single EC2 instance in Amazon’s cloud (m1.large) that we used as our master for the Hadoop job, and run the simple Solr search server that relies on embedded Jetty to provide the webapp container. Similar to the Hadoop job, step-by-step instructions are in the README for the hadoop2solr project on GitHub. But in a nutshell, we’ll copy and unzip a Solr 1.4.1 setup on the EC2 server, do the same for our custom Solr configuration, create a symlink to the index, and then start it running with: Giving it a Try Now comes the interesting part. Since we opened up the default Jetty port used by Solr (8983) on this EC2 instance, we can directly access Solr’s handy admin console by pointing our browser at http://:8983/solr/admin % cd solr % java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar From here we can run queries against Solr: We can also use curl to talk to the server via HTTP requests: curl http://:8983/solr/select/?q=-status%3AFETCHED+and+-status%3AUNFETCHED The response is XML by default. Below is an example of the response from the above request, where we found 2,546 matches in 94ms. Now here’s what I find amazing. For an index of 82 million documents, running on a fairly wimpy box (EC2 m1.large = 2 virtual cores), the typical response time for a simple query like “status:FETCHED” is only 400 milliseconds, to find 9M documents. Even a complex query such as (status not FETCHED and not UNFETCHED) only takes six seconds. Scaling Obviously we could use beefier boxes. If we switched to something like m1.xlarge (15GB of memory, 4 virtual cores) then it’s likely we could handle upwards of 200M “records” in our Solr index and still get reasonable response times. If we wanted to scale beyond a single box, there are a number of solutions. Even out of the box Solr supports sharding, where your HTTP request can specify multiple servers to use in parallel. More recently, the Solr trunk has support for SolrCloud. This uses the ZooKeeper open source project to simplify coordination of multiple Solr servers. Finally, the Katta open source project supports Lucene-level distributed search, with many of the features needed for production quality distributed search that have not yet been added to SolrCloud. Summary The combination of Hadoop and Solr makes it easy to crunch lots of data and then quickly serve up the results via a fast, flexible search & query API. Because Solr supports query-style requests, it’s suitable as a NoSQL replacement for traditional databases in many situations, especially when the size of the data exceeds what is reasonable with a typical RDBMS. Solr has some limitations that you should be aware of, specifically: · Updating the index works best as a batch operation. Individual records can be updated, but each commit (index update) generates a new Lucene segment, which will impact performance. · Current support for replication, fail-over, and other attributes that you’d want in a production-grade solution aren’t yet there in SolrCloud. If this matters to you, consider Katta instead. · Many SQL queries can’t be easily mapped to Solr queries. The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains additional technical details.

April 4, 2011

by Ken Krugler

· 119,675 Views

Java Access to SQL Azure via the JDBC Driver for SQL Server

I’ve written a couple of posts (here and here) about Java and the JDBC Driver for SQL Server with the promise of eventually writing about how to get a Java application running on the Windows Azure platform. In this post, I’ll deliver on that promise. Specifically, I’ll show you two things: 1) how to connect to a SQL Azure Database from a Java application running locally, and 2) how to connect to a SQL Azure database from an application running in Windows Azure. You should consider these as two ordered steps in moving an application from running locally against SQL Server to running in Windows Azure against SQL Azure. In both steps, connection to SQL Azure relies on the JDBC Driver for SQL Server and SQL Azure. The instructions below assume that you already have a Windows Azure subscription. If you don’t already have one, you can create one here: http://www.microsoft.com/windowsazure/offers/. (You’ll need a Windows Live ID to sign up.) I chose the Free Trial Introductory Special, which allows me to get started for free as long as keep my usage limited. (This is a limited offer. For complete pricing details, see http://www.microsoft.com/windowsazure/pricing/.) After you purchase your subscription, you will have to activate it before you can begin using it (activation instructions will be provided in an email after signing up). Connecting to SQL Azure from an application running locally I’m going to assume you already have an application running locally and that it uses the JDBC Driver for SQL Server. If that isn’t the case, then you can start from scratch by following the steps in this post: Getting Started with the SQL Server JDBC Driver. Once you have an application running locally, then the process for running that application with a SQL Azure back-end requires two steps: 1. Migrate your database to SQL Azure. This only takes a couple of minutes (depending on the size of your database) with the SQL Azure Migration Wizard - follow the steps in the Creating a SQL Azure Server and Creating a SQL Azure Database sections of this post. 2. Change the database connection string in your application. Once you have moved your local database to SQL Azure, you only have to change the connection string in your application to use SQL Azure as your data store. In my case (using the Northwind database), this meant changing this… String connectionUrl = "jdbc:sqlserver://serverName\\sqlexpress;" + "database=Northwind;" + "user=UserName;" + "password=Password"; …to this… String connectionUrl = "jdbc:sqlserver://xxxxxxxxxx.database.windows.net;" + "database=Northwind;" + "user=UserName@xxxxxxxxxx;" + "password=Password"; (where xxxxxxxxxx is your SQL Azure server ID). Connecting to SQL Azure from an application running in Windows Azure The heading for this section might be a bit misleading. Once you have a locally running application that is using SQL Azure, then all you have to do is move your application to Windows Azure. The connecting part is easy (see above), but moving your Java application to Windows Azure takes a bit more work. Fortunately, Ben Lobaugh has written a great post that that shows how to use the Windows Azure Starter Kit for Java to get a Java application (a JSP application, actually) running in Windows Azure: Deploying a Java application to Windows Azure with Command-Line Ant. (If you are using Eclipse, see Ben’s related post: Deploying a Java application to Windows Azure with Eclipse.) I won’t repeat his work here, but I will call out the steps I took in modifying his instructions to deploy a simple JSP page that connects to SQL Azure. 1. Add the JDBC Driver for SQL Server to the Java archive. One step in Ben’s tutorial (see the Select the Java Runtime Environment section) requires that you create a .zip file from your local Java installation and add it to your Java/Azure application. Most likely, your local Java installation references the JDBC driver by setting the classpath environment variable. When you create a .zip file from your java installation, the JDBC driver will not be included and the classpath variable will not be set in the Azure environment. I found the easiest way around this was to simply add the sqljdbc4.jar file (probably located in C:\Program Files\Microsoft SQL Server JDBC Driver\sqljdbc_3.0\enu) to the \lib\ext directory of my local Java installation before creating the .zip file. Note: You can put the JDBC driver in a separate directory, include it when you create the .zip folder, and set the classpath environment variable in the startup.bat script. But, I found the above approach to be easier. 2. Modify the JSP page. Instead of the code Ben suggests for the HelloWorld.jsp file (see the Prepare your Java Application section), use code from your locally running application. In my case, I just used the code from this post after changing the connection string and making a couple minor JSP-specific changes: Northwind Customers That’s it!. To summarize the steps… Migrate your database to SQL Azure with the SQL Azure Migration Wizard. Change the database connection in your locally running application. Use the Windows Azure Starter Kit for Java to move your application to Windows Azure. (You’ll need to follow instructions in this post and instructions above.) Thanks. -Brian

March 30, 2011

by Brian Swan

· 18,951 Views

IBatis (MyBatis): Working with Dynamic Queries (SQL)

this tutorial will walk you through how to setup ibatis ( mybatis ) in a simple java project and will present how to work with dynamic queries (sql). pre-requisites for this tutorial i am using: ide: eclipse (you can use your favorite one) database: mysql libs/jars: mybatis , mysql conector and junit (for testing) this is how your project should look like: sample database please run the script into your database before getting started with the project implementation. you will find the script (with dummy data) inside the sql folder. 1 – article pojo i represented the pojo we are going to use in this tutorial with a uml diagram, but you can download the complete source code in the end of this article. the goal of this tutorial is to demonstrate how to retrieve the article information from database using dynamic sql to filter the data. 2 – article mapper – xml one of the most powerful features of mybatis has always been its dynamic sql capabilities. if you have any experience with jdbc or any similar framework, you understand how painful it is to conditionally concatenate strings of sql together, making sure not to forget spaces or to omit a comma at the end of a list of columns. dynamic sql can be downright painful to deal with. while working with dynamic sql will never be a party, mybatis certainly improves the situation with a powerful dynamic sql language that can be used within any mapped sql statement. the dynamic sql elements should be familiar to anyone who has used jstl or any similar xml based text processors. in previous versions of mybatis, there were a lot of elements to know and understand. mybatis 3 greatly improves upon this, and now there are less than half of those elements to work with. mybatis employs powerful ognl based expressions to eliminate most of the other elements. if choose (when, otherwise) trim (where, set) foreach let’s explain each one with examples. 1 – first scenario : we want to retrieve all the articles from database with an optional filter: title. in other words, if user specify an article title, we are going to retrieve the articles that match with the title, otherwise we are going to retrieve all the articles from database. so we are going to implement a condition (if) : select id, title, author from article where id_status = 1 and title like #{title} 2 – second scenario : now we have two optional filters: article title and author. the user can specify both, none or only one filter. so we are going to implement two conditions: select id, title, author from article where id_status = 1 and title like #{title} and author like #{author} 3 – third scenario : now we want to give the user only one option: the user will have to specify only one of the following filters: title, author or retrieve all the articles from ibatis category. so we are going to use a choose element: select id, title, author from article where id_status = 1 and title like #{title} and author like #{author} and id_category = 3 4 – fourth scenario : take a look at all three statements above. they all have a condition in common: where id_status = 1. it means we are already filtering the active articles let’s remove this condition to make it more interesting. select id, title, author from article where title like #{title} and author like #{author} what if both title and author are null? we are going to have the following statement: select id, title, authorfrom articlewhere and what if only the author is not null? we are going to have the fololwing statement: select id, title, authorfrom articlewhereand author like #{author} and both fails! how to fix it? 5 – fifth scenario : we want to retrieve all the articles with two optional filters: title and author. to avoid the 4th scenatio, we are going to use a where element: select id, title, author from article title like #{title} and author like #{author} mybatis has a simple answer that will likely work in 90% of the cases. and in cases where it doesn’t, you can customize it so that it does. the where element knows to only insert “where” if there is any content returned by the containing tags. furthermore, if that content begins with “and” or “or”, it knows to strip it off. if the where element does not behave exactly as you like, you can customize it by defining your own trim element. for example,the trim equivalent to the where element is: select id, title, author from article title like #{title} and author like #{author} the overrides attribute takes a pipe delimited list of text to override, where whitespace is relevant. the result is the removal of anything specified in the overrides attribute, and the insertion of anything in the with attribute. you can also use the trim element with set. 6 – sixth scenario : the user will choose all the categories an article can belong to. so in this case, we have a list (a collection), and we have to interate this collection and we are going to use a foreach element: select id, title, author from article title like #{title} and author like #{author} the foreach element is very powerful, and allows you to specify a collection, declare item and index variables that can be used inside the body of the element. it also allows you to specify opening and closing strings, and add a separator to place in between iterations. the element is smart in that it won’t accidentally append extra separators. note : you can pass a list instance or an array to mybatis as a parameter object. when you do, mybatis will automatically wrap it in a map, and key it by name. list instances will be keyed to the name “list” and array instances will be keyed to the name “array”. the complete article.xml file looks like this: select id, title, author from article where id_status = 1 and title like #{title} select id, title, author from article where id_status = 1 and title like #{title} and author like #{author} select id, title, author from article where id_status = 1 and title like #{title} and author like #{author} and id_category = 3 select id, title, author from article title like #{title} and author like #{author} select id, title, author from article title like #{title} and author like #{author} select id, title, author from article where id_category in #{category} download if you want to download the complete sample project, you can get it from my github account: https://github.com/loiane/ibatis-dynamic-sql if you want to download the zip file of the project, just click on download: there are more articles about ibatis to come. stay tooned! happy coding! from http://loianegroner.com/2011/03/ibatis-mybatis-working-with-dynamic-queries-sql/

March 23, 2011

by Loiane Groner

· 72,683 Views

IBatis (MyBatis): Discriminator Column Example – Inheritance Mapping Tutorial

This tutorial will walk you through how to setup iBatis (MyBatis) in a simple Java project and will present an example using a discriminator column, in another words it is a inheritance mapping tutorial. Pre-Requisites For this tutorial I am using: IDE: Eclipse (you can use your favorite one) DataBase: MySQL Libs/jars: Mybatis, MySQL conector and JUnit (for testing) This is how your project should look like: Sample Database Please run the script into your database before getting started with the project implementation. You will find the script (with dummy data) inside the sql folder. 1 – POJOs – Beans I represented the beans here with a UML model, but you can download the complete source code in the end of this article. As you can see on the Data Modeling Diagram and the UML diagram above, we have a class Employee and two subclasses: Developer and Manager. The goal of this tutorial is to retrieve all the Employees from the database, but these employees can be an instance of Developer or Manager, and we are going to use a discriminator column to see which class we are going to instanciate. 2 – Employee Mapper – XML Sometimes a single database query might return result sets of many different (but hopefully somewhat related) data types. The discriminator element was designed to deal with this situation, and others, including class inheritance hierarchies. The discriminator is pretty simple to understand, as it behaves much like a switch statement in Java. A discriminator definition specifies column and javaType attributes. The column is where MyBatis will look for the value to compare. The javaType is required to ensure the proper kind of equality test is performed (although String would probably work for almost any situation). SELECT id, name, employee_type, manager_id, info, developer_id, product FROM employee E left join manager M on M.employee_id = E.id left join developer D on D.employee_id = E.id In this example, MyBatis would retrieve each record from the result set and compare its employee type value. If it matches any of the discriminator cases, then it will use the resultMap specified by the case. This is done exclusively, so in other words, the rest of the resultMap is ignored (unless it is extended, which we talk about in a second). If none of the cases match, then MyBatis simply uses the resultMap as defined outside of the discriminator block. So, if the managerResult was declared as follows: Now all of the properties from both the managerResult and developerResult will be loaded. Once again though, some may find this external definition of maps somewhat tedious. Thereforethere’s an alternative syntax for those that prefer a more concise mapping style. For example: SELECT id, name, employee_type, manager_id, info, developer_id, product FROM employee E left join manager M on M.employee_id = E.id left join developer D on D.employee_id = E.id Remember that these are all Result Maps, and if you don’t specify any results at all, then MyBatis willautomatically match up columns and properties for you. So most of these examples are more verbosethan they really need to be. That said, most databases are kind of complex and it’s unlikely that we’ll beable to depend on that for all cases. 3 - Employee Mapper – Annotations We did the configuration in XML, now let’s try to use annotations to do the same thing we did using XML. This is the code for EmployeeMapper.java: package com.loiane.data; import java.util.List; import org.apache.ibatis.annotations.Case;import org.apache.ibatis.annotations.Result;import org.apache.ibatis.annotations.Select;import org.apache.ibatis.annotations.TypeDiscriminator; import com.loiane.model.Developer;import com.loiane.model.Employee;import com.loiane.model.Manager; public interface EmployeeMapper { final String SELECT_EMPLOYEE = "SELECT id, name, employee_type, manager_id, info, developer_id, product " + "FROM employee E left join manager M on M.employee_id = E.id " + "left join developer D on D.employee_id = E.id "; /** * Returns the list of all Employee instances from the database. * @return the list of all Employee instances from the database. */ @Select(SELECT_EMPLOYEE) @TypeDiscriminator(column = "employee_type", cases = { @Case (value="1", type = Manager.class, results={ @Result(property="managerId", column="manager_id"), @Result(property="info"), }), @Case (value="2", type = Developer.class, results={ @Result(property="developerId", column="developer_id"), @Result(property="project", column="product"), }) }) List getAllEmployeesAnnotation();} If you are reading this blog lately, you are already familiar with the @Select and @Result annotations. So let’s skip it. Let’s talk about the @TypeDiscriminator and @Case annotations. @TypeDiscriminator A group of value cases that can be used to determine the result mapping to perform. Attributes: column, javaType, jdbcType, typeHandler, cases. The cases attribute is an array of Cases. @Case A single case of a value and its corresponding mappings. Attributes: value, type, results. The results attribute is an array of Results, thus this Case Annotation is similar to an actual ResultMap, specified by the Results annotation below. In this example: We set the column atribute for @TypeDiscriminator to determine which column MyBatis will look for the value to compare. And we set an array of @Case. For each @Case we set the value, so if the column matches the value, MyBatis will instanciate a object of type we set and we also set an array of @Result to match column with class atribute. Note one thing: using XML we set the id and name properties. We did not set these properties using annotations. It is not necessary, because the column matches the atribute name. But if you need to set, it is going to look like this: @Select(SELECT_EMPLOYEE)@Results(value = { @Result(property="id"), @Result(property="name")})@TypeDiscriminator(column = "employee_type", cases = { @Case (value="1", type = Manager.class, results={ @Result(property="managerId", column="manager_id"), @Result(property="info"), }), @Case (value="2", type = Developer.class, results={ @Result(property="developerId", column="developer_id"), @Result(property="project", column="product"), })})List getAllEmployeesAnnotation(); 4 – EmployeeDAO In the DAO, we have two methods: the first one will call the select statement from the XML and the second one will call the annotation method. Both returns the same result. package com.loiane.dao; import java.util.List; import org.apache.ibatis.session.SqlSession;import org.apache.ibatis.session.SqlSessionFactory; import com.loiane.data.EmployeeMapper;import com.loiane.model.Employee; public class EmployeeDAO { /** * Returns the list of all Employee instances from the database. * @return the list of all Employee instances from the database. */ @SuppressWarnings("unchecked") public List selectAll(){ SqlSessionFactory sqlSessionFactory = MyBatisConnectionFactory.getSqlSessionFactory(); SqlSession session = sqlSessionFactory.openSession(); try { List list = session.selectList("Employee.getAllEmployees"); return list; } finally { session.close(); } } /** * Returns the list of all Employee instances from the database. * @return the list of all Employee instances from the database. */ public List selectAllUsingAnnotations(){ SqlSessionFactory sqlSessionFactory = MyBatisConnectionFactory.getSqlSessionFactory(); SqlSession session = sqlSessionFactory.openSession(); try { EmployeeMapper mapper = session.getMapper(EmployeeMapper.class); List list = mapper.getAllEmployeesAnnotation(); return list; } finally { session.close(); } } The output if you call one of these methods and print: Employee ID = 1 Name = Kate Manager ID = 1 Info = info KateEmployee ID = 2 Name = Josh Developer ID = 1 Project = webEmployee ID = 3 Name = Peter Developer ID = 2 Project = desktopEmployee ID = 4 Name = James Manager ID = 2 Info = info JamesEmployee ID = 5 Name = Susan Developer ID = 3 Project = web Download If you want to download the complete sample project, you can get it from my GitHub account: https://github.com/loiane/ibatis-discriminator If you want to download the zip file of the project, just click on download: There are more articles about iBatis to come. Stay tooned! From http://loianegroner.com/2011/03/ibatis-mybatis-discriminator-column-example-inheritance-mapping-tutorial/

March 22, 2011

by Loiane Groner

· 34,202 Views

Practical PHP Testing Patterns: Fake Object

The purpose of a Fake Object, a kind of Test Double, is to replace a collaborator with a functional copy. While Mocks prefer a specification of the behavior to check, Fake Objects are really a simplified version of the production object they substitute. A good Fake Object, however, will be very lightweight, easy to create and throw away to ensure test isolation. Usually it won't satisfy some other functional or non-functional requirements, otherwise we would use him instead of the real object. For example a UserRepository, which normally stores users in a database, may be substitued by a Fake that: does not send mails when an user is added. Doesn't really save anything in the database, but keeps a list in an array. When some complex method is called, just throws a NonImplementedException. Utility Of course performance and less brittleness of tests are the main selling points of a Fake Objects: think about testing with a real database and without one; however this is true for all Test Doubles and I don't need to repeat it. Often the Fake is used when the contract between the SUT and the collaborator is too complex to effectively build a Stub or Mock: many methods, or many calls to the same method are made. Return parameters are difficult to setup. A Fake avoids overspecifying a test for a very invasive contract, like the order of calls and precise parameters; a real implementation is more robust to changes in the contract. For example if the Fake is a collection, you can store and retrieve objects of another type, and still not change the Fake; expectations and configured return value instead would change (together). As an example, an embryonal Fake is PHPUnit $this->returnArgument() for the will() Mock expectation. It's a piece of functionality which substitutes an hard-coded expectation, in order to return one of the arguments of the call instead of adding the argument to the expectation itself. Note that many times we don't have to create new code to leverage Fakes: we can use the existing one with a different configuration (like a Cache maximumCachedItems=0, or with an adapter that use an in-memory cache instead of APC or Memcache) or something provided for us, like an sqlite in-memory database created by PDO. NullObjects are sometimes Fake, but are used in production more than in testing. Implementation Several steps are necessary for building an hand-rolled Fake: create the Fake class by hand, and define the simplest implementation that could do the job for this test. Remember, you're not testing the Fake, you're testing the SUT, so the Fake doesn't have to be perfect, but just passable for the test case at end. Instance an object from the Fake class: the constructor may be different from the real collaborator's one as it's not part of the contract. Install the Fake into the SUT, and proceed as normal. The steps for generating a Fake with PHPUnit are a little different; however, you will still have to write its businss logic: create the "mock" as always via getMock() or getMockBuilder(). Define expectations for its methods with will($this->returnCallback()) and some anonymous functions (or via self-shunting like we did for "Stubs"). Often objects passed in this closures (such as ArrayObject instances) can aid in making the Fake methods communicate with each other. Install the Fake and proceed with the test. The second approach is invaluable when you have a single method on the Fake: it's very fast. Variations Fake Database: an alternative implementation of the Database Facade that lets you test classes that depend on the database without actually using it. For exaple, a DAO which internally use an SplObjectStorage instead of the connection. In-Memory Database: sqlite3 in-memory database used with ORMs for tests that isolate you from the real database, but not from using PDO. High Return On Investment. Fake Web Service: mimics the interface of some web service like Google Analytics, when you have to interact also in write and not only in read. Fake Service Layer: simplified implementations of Service Layer classes, which avoid checking concurrency, authorization, authentication and so on in order to simplify functional testing. Example In this code sample, the test target a ForumManager class which needs to manipulate Posts. It's not a simple Facade: it moves Post around, merges threads and so on. So we inject in it a FakePostDao: ForumManager will call it instead of the database. When you I have many tests of this type, the time for writing the Fake implementation is well spent. array( new Post('Hello'), new Post('Hello!'), new Post('') ), 2 => array( new Post('Hi'), new Post('Hi!') ), 3 => array( new Post('Good morning.') ) )); $forumManager = new ForumManager($dao); $forumManager->mergeThreadsByIds(1, 2); $thread = $dao->getThread(1); $this->assertEquals(5, count($thread)); } } /** * The SUT. */ class ForumManager { private $dao; public function __construct(PostsDao $dao) { $this->dao = $dao; } public function mergeThreadsByIds($originalId, $toBeMergedId) { $original = $this->dao->getThread($originalId); $toBeMerged = $this->dao->getThread($toBeMergedId); $newOne = array_merge($original, $toBeMerged); $this->dao->removeThread($originalId); $this->dao->removeThread($toBeMergedId); $this->dao->addThread($originalId, $newOne); } } /** * Interface for the collaborator to substitute with the Test Double. */ interface PostsDao { public function getThread($id); public function removeThread($id); public function addThread($id, array $thread); } /** * Fake implementation. */ class FakePostDao implements PostsDao { private $threads; public function __construct(array $initialState) { $this->threads = $initialState; } public function getThread($id) { return $this->threads[$id]; } public function removeThread($id) { unset($this->threads[$id]); } /** * We model Thread as array of Posts for simplicity. */ public function addThread($id, array $thread) { $this->threads[$id] = $thread; } } /** * Again a Dummy object: minimal implementation, to make this test pass. */ class Post { }

March 16, 2011

by Giorgio Sironi

· 4,086 Views