DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
How to Build an iOS and Android App in 24 hours with HTML5 and Cordova
what can one create during the new year and christmas holidays? as it turned down – quite enough. even if you have two kids and a bunch of family members whom you want to visit. the only thing you cannot accomplish in time is to finish an article for dzone. it takes a lot of time, nearly the entire january. by the 5th of january i had a laptop and a couple of days to spend on some development. having estimated what i can do here, i decided to create a mobile app that would work faster than the original. for this, i needed to find communicative creators of a popular app. hence, i found a “ spender ” app in the app store. it is a simple app for tracking your budget. with it, you can estimate how effectively you spend your money in the end of each month. by the 5th of january, this app was in top-10 in the russian app store. i also found their dev-story on iphones.ru. in their dev-story, the developers wrote that after completing their previous project, they had three-four free days. so, they decided to create a new app during this free time. their product manager and programmers helped them with positioning the app and its key features. this encouraged me and i began to think how to create nearly the same app in 2 days . note: the original app was updated in the middle of january, and now it looks a little different from my app. anyway, you can find its screenshots in the dev-story. i already had the experience of mobile app development using c# and cocoa. since this was my personal free time, i wanted to use it with maximum effectiveness. even if i didn’t succeed, i was eager to learn a new framework or programming language. i was working for devexpress from 2006 till2011 and have been reading their announces since i left the company. so, i knew that they created a mobile js-framework based on cordova/phonegap. they made it after i left the company, so i was curious to try it. the gartner research company reports that by august, 2013 most of the enterprise mobile software was created using phonegap or phonegap-based products (like kony ). from my consumer experience, it's far from true. maybe i was wrong? i'm not so good at html and javascript. i can create mark-up with stackoverflow.com and i can write simple selectors with jquery. i can also find the required information in their documentation. in other words, html+js was a gap in my knowledge and i was ready to fill it or gain some experience. thus, i planned to create a cross-platform application that could become an advantage over the original ios-only spender app. moreover, i wanted to spend my time in the most effective way. on the one hand, i had a potentially effective js framework, on the other – a lack of js experience. i hoped that the js framework advantages could balance my poor experience. since i like to use a vcs during development, i'll try to recover my progress. you can download complete apps here: ios , android i'm not sure i can provide public access to my repo, because it contains images i bought from fotolia and third-party libraries, each with a difference license. i'm not a lawyer, so i’d prefer not to take the risk. the most curious of you can take a look into the app bundle itself. js wasn't minified. place: tula, russia, date: january, 5, 2014 +20 minutes spent on installing node.js and cordova cli +10 minutes downloaded a template app from cordova. added a template from phonejs. created a git-repo, registered it in webstorm. added a new record to the httpd.conf in order to have an ability to debug my future app in the browser. +38 minutes changed the app namespace to "io.nikitin.thriftbox". added navigation. phonejs is an mvc-framework. each app screen is represented as a collection of html markup (views) and fabric function (viewmodel). here is how it looks at its simplest // view content and thriftbox.home = function (params) { // request parameters taken from uri return {}; // viewmodel instance }; then view and view model are bound via knockout-bindings . to be in time, i create only two screens: expense input and monthly expense report. +4 hours 20 minutes here i got stuck for the first time. i couldn't create a markup of digit buttons. the original app had a huge keyboard that looked like a calculator or dialer. i found out that it was not that easy to create such a keyboard, even using a table tag. in the iphone retina screen, 1px borders between buttons changed their colors after clicking on the buttons. on my iphone, the difference in colors was very noticeable. i had to invent how to tackle this. i tried to implement buttons using div s. but i couldn't achieve a border width of 1 px and make all buttons look equal in different screens. three hours later i gave up the idea of using divs and moved forward. +28 minutes removing a clicked button indicator on ios. ios displays a gray indicator around tapped links and objects with the onclick event handler. since i had my own indicator of a tapped object (the tapped button became darker), i didn't need the default indicator. i solved this problem using the dxaction event: was: 1 became: 1 this event is an extended variation of a "click" event: its handler supports uri navigation between views and correctly works in the scrollable area. +14 minutes the buttonpress event handler shown in the previous example now validates numbers from user input. var number = ko.observable(null); var isvalidnumber = ko.computed(function() { return number() && parsefloat(number()) > 0; }); ...... function buttonpress(button) { if (button) { if (number()) number(number() + button); else number(button); } else if (number()) number(number().substr(0, number().length - 1)); } var viewmodel = { number: number, isvalidnumber: isvalidnumber, viewshowing: viewshowing, buttonpress: buttonpress }; ..... +8 minutes added a fastclick.js , which removes a delay between tapping the screen and raising the 'click' event on phones. the mobile browser delays the raising of the click event by default to be sure the end-user will not perform a double tap. for the end-user, this looks as if the app is sluggish. you click buttons much faster than an app responds. fastclick.js handles the touchstart event and then creates all the click event process logic. btw, adding this library was a mistake; later i'll tell why. +4 minutes added a limitation to the length of user input numbers. corrected the font size for a better look-and-feel. +58 minutes added a choice of an expense category. added a scrollable pane with available categories below the input field. video . it took less time than it could be. in the phonejs component collection, i found dxtileview . it provides a kinetic scrolling with the required appearance out-of-the-box. it's not easy to implement kinetic scrolling by yourself and thus it’s great that this scrolling is enabled for ios only - android doesn't have it. it was 7:40 pm, so, i decided to continue the next day. place: tula, russia, date: january, 5, 2014 +3 hours 9 minutes storing data on a local storage. phonejs contains classes for working with data: selection, filtering, sorting, and grouping. there are several approaches to store data: odata and localstorage. i didn't want to implement a server side for a free app, and decided to use localstorage. later i found out that this was not an ideal decision. for example, when updating to ios 5.1 user data is erased , other people complained that localstorage is cleared regularly or even when shutting the device down. i didn't want to risk, so i used file api of phonegap. documentation says that this api is based on w3c file api. in fact, this means that this api differs in safari for mac os, chrome for mac os, cordova for ios and cordova for android. file api implementation is different for ios and android . e.g. android implementation doesn't contain the 'blob' class and 'window.permanent' constant. ii however implements the 'localfilesystem' and 'localfilesystem.persistent' classes. the laptop browser provides additional api for requesting an additional storage space, which mobile browsers don't provide. the available documentation for this api adds more problems. i found several articles searching by "html5 file api". and, i couldn't find an article that would cover all my questions. finally i created a new class for working with fileapi. this class supports cordova 3.3 on ios, android, and chrome 32 for mac os and windows 8. you can find it here: https://github.com/chebum/filestorage-for-phone.js/blob/master/filestorage.js you can use it as follows: // in this example i create data/records file in the documents folder of the app fs.initfileapi(1000000, true) .then(function () { var records = new fs.filearraystore({ key: "id", filename: "records" }); return records.insert({ customer: "peter" }) }) .then(function () { alert("record saved!"); }); // or use low-level api: fs.initfileapi(100000, true) .then(function() { return fs.writefile("file1", "file content") }) .then(function() { alert("file saved!"); }); +33 minutes saving the added records to the storage. category list is stored in arraystore , to simplify the selection operations. +26 minutes creating layout for the app's views. phonejs provides several layouts that are the placeholders for the views. my app's start page didn't fit into any of the available layout, so i have chosen the emptylayout. but, it doesn't provide animation effects when navigating through views. i copied the emptylayout code and added an attribute that had animation effects. +1 h. 51 min. template's about screen was redesigned to a report screen, empty by that moment. created a viewmodel that selects data for a current month. added localization date formatting for the screen caption. +59 minutes added the display of expenses grouped by categories for a current month. +28 minutes added the selection of months for which the report should be generated. end-users can tap the screen header to select the required month. +1 h. 20 min. added cordova-plugin statusbar that didn't work outof-the-box. i found that the reference to cordova.js was commented in the phonejs app template: as a result, the native part of my app didn't work. +39 minutes in the report screen, the upper part was changed to dxtoolbar . +22 minutes i discoveredwhy the dxbutton click event handler didn't work. removing the fastclick.js solved my problem, but caused a delay between tapping and event raising. i've changed the dxaction event subscription to 'touchstart'. +25 minutes formatting output strings when generating a report. at night i dreamed of crappy buttons in the application’s main screen. places: tula, vnukovo airport, date: january, 7-8, 2014 i had an early flight to budapest from vnukovo, and because i had no time in the afternoon, i gradually completed at the airport at night. as you know, it’s not very comfortable to sleep or sit in a café chair for a long time, but it turned out that programming was ok. +2 h. 5 min. in the morning, i decided to split the buttons in order to remove borders between them. i took the ios dialer keyboard as a sample. i created three keyboards. the button size changes depending on screen resolution: for 3.5'', 4'' and 5'' phones. each table cell contained a div with configured alignment. because of the lack of an incomplete vertical text alignment in html, the final css style for buttons ended to be quite complex: .home-view .buttons td div { color: #4a5360; border: 1px solid #4a5360; text-align: center; position: absolute; left: 50%; /* small buttons - default */ font-size: 26px; padding: 13px 0 13px 0; width: 52px; line-height: 26px; border-radius: 26px; margin-left: -27px; margin-top: -27px; } +1 h. 50 minutes i bought several vector icon sets on fotolia. i cut the required icons and converted them to png. it took me quite a long time, maybe, because it was 1.30 am :) +1 hour 10 minutes added a splash-screen for the app. +36 minutes created three sizes for the app icon. localized the app name for ios. +20 minutes hiding the splash screen after the app is completely loaded. +2 hours fixing multiple bugs. +2 hours creating screenshots for play store +30 minutes creating screenshots for app store +30 minutes writing an app description for two app stores. +1 h. 30 minutes submitting my app to the app store. here i faced with an issue with the app certification. my accountancy let's summarize the time i spent and divide it into categories. development: 21 hours 37 minutes graphics and texts: 8 hours 26 minutes totally: 30 hours 3 minutes as a result, i got a minimum-feature working app, though it is not as cool as the latest version of "spender". i couldn't create splitting expenses by days and income input. my app's ui could be more elegant as well. after analyzing the original 'spender' developer work, i got the following. they say that they involved four developers for three-four days. it is about 96-128 man-hours. i spent only 30 man-hours and got an app for three mobile platforms. ios and android versions are already in stores. the version for windows phone 8 requires a ui redesign. i can be proud of myself :). you can download complete apps here: ios , android
February 12, 2014
by Ivan Nikitin
· 210,663 Views
article thumbnail
Build Your Own Custom Lucene Query and Scorer
Every now and then we’ll come across a search problem that can’t simply be solved with plain Solr relevancy. This usually means a customer knows exactly how documents should be scored. They may have little tolerance for close approximations of this scoring through Solr boosts, function queries, etc. They want a Lucene-based technology for text analysis and performant data structures, but they need to be extremely specific in how documents should be scored relative to each other. Well for those extremely specialized cases we can prescribe a little out-patient surgery to your Solr install – building your own Lucene Query. This is the Nuclear Option Before we dive in, a word of caution. Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths: You’ve utilized Solr’s extensive set of query parsers & features including function queries, joins, etc. None of this solved your problem You’ve exhausted the ecosystem of plugins that extend on the capabilities in (1). That didn’t work. You’ve implemented your own query parser plugin that takes user input and generates existing Lucene queries to do this work. This still didn’t solve your problem. You’ve thought carefully about your analyzers – massaging your data so that at index time and query time, text lines up exactly as it should to optimize the behavior of existing search scoring. This still didn’t get what you wanted. You’ve implemented your own custom Similarity that modifies how Lucene calculates the traditional relevancy statistics – query norms, term frequency, etc. You’ve tried to use Lucene’s CustomScoreQuery to wrap an existing Query and alter each documents score via a callback. This still wasn’t low-level enough for you, you needed even more control. If you’re still reading you either think this is going to be fun/educational (good for you!) or you’re one of the minority that must control exactly what happens with search. If you don’t know, you can of course contact us for professional services. Ok back to the action… Refresher – Lucene Searching 101 Recall that to search in Lucene, we need to get a hold of an IndexSearcher. This IndexSearcher performs search over an IndexReader. Assuming we’ve created an index, with these classes we can perform searches like in this code: Directory dir = new RAMDirectory(); IndexReader idxReader = new IndexReader(dir); idxSearcher idxSearcher = new IndexSearcher(idxReader) Query q = new TermQuery(new Term(“field”, “value”)); idxSearcher.search(q); Let’s summarize the objects we’ve created: Directory – Lucene’s interface to a file system. This is pretty straight-forward. We won’t be diving in here. IndexReader – Access to data structures in Lucene’s inverted index. If we want to look up a term, and visit every document it exists in, this is where we’d start. If we wanted to play with term vectors, offsets, or anything else stored in the index, we’d look here for that stuff as well. IndexSearcher — wraps an IndexReader for the purpose of taking search queries and executing them. Query – How we expect the searcher to perform the search, encompassing both scoring and which documents are returned. In this case, we’re searching for “value” in field “field”. This is the bit we want to toy with In addition to these classes, we’ll mention a support class exists behind the scenes: Similarity – Defines rules/formulas for calculating norms at index time and query normalization. Now with this outline, let’s think about a custom Lucene Query we can implement to help us learn. How about a query that searches for terms backwards. If the document matches a term backwards (like ananab for banana), we’ll return a score of 5.0. If the document matches the forwards version, let’s still return the document, with a score of 1.0 instead. We’ll call this Query “BackwardsTermQuery”. This example is hosted here on github. A tale of 3 classes – A Query, A Weight, and a Scorer Before we sling code, let’s talk about general architecture. A Lucene Query follows this general structure: A custom Query class, inheriting from Query A custom Weight class, inheriting from Weight A custom Scorer class inheriting from Scorer These three objects wrap each other. A Query creates a Weight, and a Weight in turn creates a Scorer. A Query is itself a very straight-forward class. One of its main responsibilities when passed to the IndexSearcher is to create a Weight instance. Other than that, there are additional responsibilities to Lucene and users of your Query to consider, that we’ll discuss in the “Query” section below. A Query creates a Weight. Why? Lucene needs a way to track IndexSearcher level statistics specific to each query while retaining the ability to reuse the query across multiple IndexSearchers. This is the role of the Weight class. When performing a search, IndexSearcher asks the Query to create a Weight instance. This instance becomes the container for holding high-level statistics for the Query scoped to this IndexSearcher (we’ll go over these steps more in the “Weight” section below). The IndexSearcher safely owns the Weight, and can abuse and dispose of it as needed. If later the Query gets reused by another IndexSearcher, a new Weight simply gets created. Once an IndexSearcher has a Weight, and has calculated any IndexSearcher level statistics, the IndexSearcher’s next task is to find matching documents and score them. To do this, the Weight in turn creates a Scorer. Just as the Weight is tied closely to an IndexSearcher, a Scorer is tied to an individual IndexReader. Now this may seem a little odd – in our code above the IndexSearcher always has exactly one IndexReader right? Not quite. See, a little hidden implementation detail is that IndexReaders may actually wrap other smaller IndexReaders – each tied to a different segment of the index. Therefore, an IndexSearcher needs to have the ability score documents across multiple, independent IndexReaders. How your scorer should iterate over matches and score documents is outlined in the “Scorer” section below. So to summarize, we can expand the last line from our example above… idxSearcher.search(q); … into this psuedocode: Weight w = q.createWeight(idxSearcher); // IndexSearcher level calculations for weight Foreach IndexReader idxReader: Scorer s = w.scorer(idxReader); // collect matches and score them Now that we have the basic flow down, let’s pick apart the three classes in a little more detail for our custom implementation. Our Custom Query What should our custom Query implementation look like? Query implementations always have two audiences: (1) Lucene and (2) users of your Query implementation. For your users, expose whatever methods you require to modify how a searcher matches and scores with your query. Want to only return as a match 1/3 of the documents that match the query? Want to punish the score because the document length is longer than the query length? Add the appropriate modifier on the query that impacts the scorer’s behavior. For our BackwardsTermQuery, we don’t expose accessors to modify the behavior of the search. The user simply uses the constructor to specify the term and field to search. In our constructor, we will simply be reusing Lucene’s existing TermQuery for searching individual terms in a document. private TermQuery backwardsQuery; private TermQuery forwardsQuery; public BackwardsTermQuery(String field, String term) { // A wrapped TermQuery for the reverse string Term backwardsTerm = new Term(field, new StringBuilder(term).reverse().toString()); backwardsQuery = new TermQuery(backwardsTerm); // A wrapped TermQuery for the Forward Term forwardsTerm = new Term(field, term); forwardsQuery = new TermQuery(forwardsTerm); } Just as importantly, be sure your Query meets the expectation of Lucene. Most importantly, you MUST override the following. createWeight() hashCode() equals() The method createWeight() we’ve discussed. This is where you’ll create a weight instance for an IndexSearcher. Pass any parameters that will influence the scoring algorithm, as the Weight will in turn be creating a searcher. Even though they are not abstract methods, overriding the hashCode()/equals() methods is very important. These methods are used by Lucene/Solr to cache queries/results. If two queries are equal, there’s no reason to rerun the query. Running another instance of your query could result in seeing the results of your first query multiple times. You’ll see your search for “peas” work great, then you’ll search for “bananas” and see “peas” search results. Override equals() and hashCode() so that “peas” != bananas. Our BackwardsTermQuery implements createWeight() by creating a custom BackwardsWeight that we’ll cover below: @Override public Weight createWeight(IndexSearcher searcher) throws IOException { return new BackwardsWeight(searcher); } BackwardsTermQuery has a fairly boilerplate equals() and hashCode() that passes through to the wrapped TermQuerys. Be sure equals() includes all the boilerplate stuff such as the check for self-comparison, the use of the super equals operator, the class comparison, etc etc. By using Lucene’s unit test suite, we can get a lot of good checks that our implementation of these is correct. @Override public boolean equals(Object other) { if (this == other) { return true; } if (!super.equals(other)) { return false; } if (getClass() != other .getClass()) { return false; } BackwardsTermQuery otherQ = (BackwardsTermQuery)(other); if (otherQ.getBoost() != getBoost()) { return false; } return otherQ.backwardsQuery.equals(backwardsQuery) && otherQ.forwardsQuery.equals(forwardsQuery); } @Override public int hashCode() { return super.hashCode() + backwardsQuery.hashCode() + forwardsQuery.hashCode(); } Our Custom Weight You may choose to use Weight simply as a mechanism to create Scorers (where the real meat of search scoring lives). However, your Custom Weight class must at least provide boilerplate implementations of the query normalization methods even if you largely ignore what is passed in: getValueForNormalization normalize These methods participate in a little ritual that IndexSearcher puts your Weight through with the Similarity for query normalization. To summarize the query normalization code in the IndexSearcher: float v = weight.getValueForNormalization(); float norm = getSimilarity().queryNorm(v); weight.normalize(norm, 1.0f); Great, what does this code do? Well a value is extracted from Weight. This value is then passed to a Similarity instance that “normalizes” that value. Weight then receives this normalized value back. In short, this is allowing IndexSearcher to give weight some information about how its “value for normalization” compares to the rest of the stuff being searched by this searcher. This is extremely high level, “value for normalization” could mean anything, but here it generally means “what I think is my weight” and what Weight receives back is what the searcher says “no really here is your weight”. The details of what that means depend on the Similarity and Weight implementation. It’s expected that the Weight’s generated Scorer will use this normalized weight in scoring. You can chose to do whatever you want in your own Scorer including completely ignoring what’s passed to normalize(). While our Weight isn’t factoring into the scoring calculation, for consistency sake, we’ll participate in the little ritual by overriding these methods: @Override public float getValueForNormalization() throws IOException { return backwardsWeight.getValueForNormalization() + forwardsWeight.getValueForNormalization(); } @Override public void normalize(float norm, float topLevelBoost) { backwardsWeight.normalize(norm, topLevelBoost); forwardsWeight.normalize(norm, topLevelBoost); } Outside of these query normalization details, and implementing “scorer”, little else happens in the Weight. However, you may perform whatever else that requires an IndexSearcher in the Weight constructor. In our implementation, we don’t perform any additional steps with IndexSearcher. The final and most important requirement of Weight is to create a Scorer. For BackwardsWeight we construct our custom BackwardsScorer, passing scorers created from each of the wrapped queries to work with. @Override public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder, boolean topScorer, Bits acceptDocs) throws IOException { Scorer backwardsScorer = backwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs); Scorer forwardsScorer = forwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs); return new BackwardsScorer(this, context, backwardsScorer, forwardsScorer); } Our Custom Scorer The Scorer is the real meat of the search work. Responsible for identifying matches and providing scores for those matches, this is where the lion share of our customization will occur. It’s important to note that a Scorer is also a Lucene DocIdSetIterator. A DocIdSetIterator is a cursor into a set of documents in the index. It provides three important methods: docID() – what is the id of the current document? (this is an internal Lucene ID, not the Solr “id” field you might have in your index) nextDoc() – advance to the next document advance(target) – advance (seek) to the target One uses a DocIdSetIterator by first calling nextDoc() or advance() and then reading the docID to get the iterator’s current location. The value of the docIDs only increase as they are iterated over. By implementing this interface a Scorer acts as an iterator over matches in the index. A Scorer for the query “field1:cat” can be iterated over in this manner to return all the documents that match the cat query. In fact, if you recall from my article, this is exactly how the terms are stored in the search index. You can chose to either figure out how to correctly iterate through the documents in a search index, or you can use the other Lucene queries as building blocks. The latter is often the simplest. For example, if you wish to iterate over the set of documents containing two terms, simply use the scorer corresponding to a BooleanQuery for iteration purposes. The first method of our scorer to look at is docID(). It works by reporting the lowest docID() of our underlying scorers. This scorer can be thought of as being “before” the other in the index, and as we want to report numerically increasing docIDs, we always want to chose this value: @Override public int docID() { int backwordsDocId = backwardsScorer.docID(); int forwardsDocId = forwardsScorer.docID(); if (backwordsDocId <= forwardsDocId && backwordsDocId != NO_MORE_DOCS) { currScore = BACKWARDS_SCORE; return backwordsDocId; } else if (forwardsDocId != NO_MORE_DOCS) { currScore = FORWARDS_SCORE; return forwardsDocId; } return NO_MORE_DOCS; } Similarly, we always want to advance the scorer with the lowest docID, moving it ahead. Then, we report our current position by returning docID() which as we’ve just seen will report the docID of the scorer that advanced the least in the nextDoc() operation. @Override public int nextDoc() throws IOException { int currDocId = docID(); // increment one or both if (currDocId == backwardsScorer.docID()) { backwardsScorer.nextDoc(); } if (currDocId == forwardsScorer.docID()) { forwardsScorer.nextDoc(); } return docID(); } In our advance() implementation, we allow each Scorer to advance. An advance() implementation promises to either land docID() exactly on or past target. Our call to docID() after we call advance will return either that one or both are on target, or it will return the lowest docID past target. @Override public int advance(int target) throws IOException { backwardsScorer.advance(target); forwardsScorer.advance(target); return docID(); } What a Scorer adds on top of DocIdSetIterator is the “score” method. When score() is called, a score for the current document (the doc at docID) is expected to be returned. Using the full capabilities of the IndexReader, any number of information stored in the index can be consulted to arrive at a score either in score() or while iterating documents in nextDoc()/advance(). Given the docId, you’ll be able to access the term vector for that document (if available) to perform more sophisticated calculations. In our query, we’ll simply keep track as to whether the current docID is from the wrapped backwards term scorer, indicating a match on the backwards term, or the forwards scorer, indicating a match on the normal, unreversed term. Recall docID() is always called on advance/nextDoc. You’ll notice we update currScore in docID, updating it every time the document advances. @Override public float score() throws IOException { return currScore; } A Note on Unit Testing Now that we have an implementation of a search query, we’ll want to test it! I highly recommend using Lucene’s test framework. Lucene will randomly inject different implementations of various support classes, index implementations, to throw your code off balance. Additionally, Lucene creates test implementations of classes such as IndexReader that work to check whether your Query correctly fulfills its contract. In my work, I’ve had numerous cases where tests would fail intermittently, pointing to places where my use of Lucene’s data structures subtly violated the expected contract. An example unit test is included in the github project associated with this blog post. Wrapping Up That’s a lot of stuff! And I didn’t even cover everything there is to know! As an exercise to the reader, you can explore the Scorer methods cost() and freq(), as well as the rewrite() method of Query used optionally for optimization. Additionally, I haven’t explored how most of the traditional search queries end up using a framework of Scorers/Weights that don’t actually inherit from Scorer or Weight known as “SimScorer” and “SimWeight”. These support classes consult a Similarity instance to customize calculation certain search statistics such as tf, convert a payload to a boost, etc. In short there’s a lot here! So tread carefully, there’s plenty of fiddly bits out there! But have fun! Creating a custom Lucene query is a great way to really understand how search works, and the last resort short in solving relevancy problems short of creating your own search engine. And if you have relevancy issues, contact us! If you don’t know whether you do, our search relevancy product, Quepid – might be able to tell you!
February 10, 2014
by Doug Turnbull
· 14,431 Views
article thumbnail
Voron & Time Series: Working with Real Data
dan liebster has been kind enough to send me a real world time series database. the data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this. this is what this looks like: the first thing that i did was take the code in this post , and try it out for size. i wrote the following: int i = 0; using (var parser = new textfieldparser(@"c:\users\ayende\downloads\timeseries.csv")) { parser.hasfieldsenclosedinquotes = true; parser.delimiters = new[] {","}; parser.readline();//ignore headers var startnew = stopwatch.startnew(); while (parser.endofdata == false) { var fields = parser.readfields(); debug.assert(fields != null); dts.add(fields[1], datetime.parseexact(fields[2], "o", cultureinfo.invariantculture), double.parse(fields[3])); i++; if (i == 25*1000) { break; } if (i%1000 == 0) console.write("\r{0,15:#,#} ", i); } console.writeline(); console.writeline(startnew.elapsed); } note that we are using a separate transaction per line , which means that we are really doing a lot of extra work. but this simulate very well incoming events coming one at a time. we were able to process 25,000 events in 8.3 seconds. at a rate of just over 3 events per millisecond . now, note that we have in here the notion of “channels”. from my investigation, it seems clear that some form of separation is actually very common in time series data. we are usually talking about sensors or some such, and we want to track data across different sensors over time. and there is little if any call for working over multiple sensors / channels at the same time. because of that, i made a relatively minor change in voron, that allows it to have an infinite number of separate trees. that means that i can use as many trees as you want, and we can model a channel as a tree in voron. i also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. that dropped the time to insert 25,000 lines to 0.8 seconds. or a full order of magnitude faster. that done, i inserted the full data set, which is just over 1,096,384 records. that took 36 seconds. in the data set i have, there are 35 channels. i just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. that allows doing things like doing averages over time, comparing data, etc. you can see the code implementing this in the following link .
February 7, 2014
by Oren Eini
· 4,002 Views
article thumbnail
Managing Disk Space in MongoDB
In our previous post on MongoDB storage structure and dbStats metrics, we covered how MongoDB stores data and the differences between the dataSize, storageSize and fileSize metrics. We can now apply this knowledge to evaluate strategies for re-using MongoDB disk space. When documents or collections are deleted, empty record blocks within data files arise. MongoDB attempts to reuse this space when possible, but it will never return this space to the file system. This behavior explains why fileSize never decreases despite deletes on a database. If your app frequently deletes or if your fileSize is significantly larger than the size of your data plus indexes, you can use one of the methods below reclaim free space. Getting your free space back Compacting individual collections You can compact individual collections using the compact command. This command rewrites and defragments all data in a collection, as well as all of the indexes on that collection. Important notes on compacting: This operation blocks all other database activity when running and should be used only when downtime for your database is acceptable. If you are running a replica set, you can perform compaction on secondaries in order to avoid blocking the primary and use failover to make the primary a secondary before compacting it. Compacting individual collections will not reduce your storage footprint on disk (i.e., your fileSize) but it will defragment the collections you compact. Compacting one or more databases For a single-node MongoDB deployment, you can use the db.repairDatabase() command to compact all the collections in the database. This operation rewrites all the data and indexes for each collection in the database from scratch and thereby compacts and defragments the entire database. To compact all the databases on your server process, you can stop your mongod process and run it with the “–repair” option. Important notes on running a repair: This operation blocks all other database activity when running and should be used only when downtime for your database is acceptable. Running a repair requires free disk space equal to the size of your current data set plus 2 GB. You can use space in a different volume than the one that your mongod is running in by specifying the “–repairpath” option. Compacting all databases on a server by re-syncing replica set nodes For a multi-node MongoDB deployment, you can resync a secondary from scratch to reclaim space. By resyncing each node in your replica set you effectively rewrite the data files from scratch and thereby defragment your database. Please note that if your cluster is comprised of only two electable nodes, you will sacrifice high availability during the resync because the secondary is completely wiped before syncing. If your app is sensitive to downtime, we recommend a process similar to the one we use here at MongoLab which we call a “rolling node replacement.” This process replaces each node in your cluster in turn by bringing a new node into the cluster, replicating the data to that new node and removing the old node. In this way, your cluster can maintain the same level of redundancy during the compaction as during normal operations. A tip about efficiently using space usePowerOf2Sizes Setting the usePowerof2Sizes option is a proactive approach to reusing space in collections that experience frequent document moves or deletions. This option supersedes the default padding factor mechanism and reduces the impact of fragmentation within the collection by allocating additional space for each document in intervals that follow the powers of 2. Setting this option for a specific collection makes it less likely that documents in that collection need to be moved when they grow in size, less likely that a document will need to be moved more than once in its lifetime, and more likely that space left by moving documents can be reused by new or other moved documents. Thanks for reading! We hope the above strategies help guide you in evaluating options for reusing empty space in your MongoDB.
February 6, 2014
by Chris Chang
· 23,601 Views
article thumbnail
Java: Handling a RuntimeException in a Runnable
At the end of last year I was playing around with running scheduled tasks to monitor a Neo4j cluster and one of the problems I ran into was that the monitoring would sometimes exit. I eventually realised that this was because a RuntimeException was being thrown inside the Runnable method and I wasn’t handling it. The following code demonstrates the problem: import java.util.ArrayList; import java.util.List; import java.util.concurrent.*; public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } If we run that code we’ll see the RuntimeException but the executor won’t exit because the thread died without informing it: Exception in thread "main" pool-1-thread-1 -> 1391212558074 java.util.concurrent.ExecutionException: java.lang.RuntimeException: game over at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at RunnableBlog.main(RunnableBlog.java:11) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Caused by: java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) At the time I ended up adding a try catch block and printing the exception like so: public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { try { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } catch (RuntimeException e) { e.printStackTrace(); } } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } This allows the exception to be recognised and as far as I can tell means that the thread executing the Runnable doesn’t die. java.lang.RuntimeException: game over pool-1-thread-1 -> 1391212651955 at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) pool-1-thread-1 -> 1391212652956 java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) pool-1-thread-1 -> 1391212653955 java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) This worked well and allowed me to keep monitoring the cluster. However, I recently started reading ‘Java Concurrency in Practice‘ (only 6 years after I bought it!) and realised that this might not be the proper way of handling the RuntimeException. public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { try { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } catch (RuntimeException e) { Thread t = Thread.currentThread(); t.getUncaughtExceptionHandler().uncaughtException(t, e); } } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } I don’t see much difference between the two approaches so it’d be great if someone could explain to me why this approach is better than my previous one of catching the exception and printing the stack trace.
February 6, 2014
by Mark Needham
· 19,587 Views
article thumbnail
Using Database Views in Grails
This post is a quick explanation on how to use database views in Grails. For an introduction I tried to summarize what database views are. However, I noticed I cannot describe it better than it is already done on Wikipedia. Therefore I will just quote the Wikipedia summary of View (SQL)here: In database theory, a view is the result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query command is kept in the database dictionary. Unlike ordinary base tables in a relational database, a view does not form part of the physical schema: as a result set, it is a virtual table computed or collated from data in the database, dynamically when access to that view is requested. Changes applied to the data in a relevant underlying table are reflected in the data shown in subsequent invocations of the view. (Wikipedia) Example Let's assume we have a Grails application with the following domain classes: class User { String name Address address ... } class Address { String country ... } For whatever reason we want a domain class that contains direct references to the name and the country of an user. However, we do not want to duplicate these two values in another database table. A view can help us here. Creating the view At this point I assume you are already using the Grails database-migration plugin. If you don't you should clearly check it out. The plugin is automatically included with newer Grails versions and provides a convenient way to manage databases using change sets. To create a view we just have to create a new change set: changeSet(author: '..', id: '..') { createView(""" SELECT u.id, u.name, a.country FROM user u JOIN address a on u.address_id = a.id """, viewName: 'user_with_country') } Here we create a view named user_with_country which contains three values: user id, user name andcountry. Creating the domain class Like normal tables views can be mapped to domain classes. The domain class for our view looks very simple: class UserWithCountry { String name String country static mapping = { table 'user_with_country' version false } } Note that we disable versioning by setting version to false (we don't have a version column in our view). At this point we just have to be sure that our database change set is executed before hibernate tries to create/update tables on application start. This is typically be done by disabling the table creation of hibernate in DataSource.groovy and enabling the automatic migration on application start by settinggrails.plugin.databasemigration.updateOnStart to true. Alternatively this can be achieved by manually executing all new changesets by running the dbm-update command. Usage Now we can use our UserWithCountry class to access the view: Address johnsAddress = new Address(country: 'england') User john = new User(name: 'john', address: johnsAddress) john.save(failOnError: true) assert UserWithCountry.count() == 1 UserWithCountry johnFromEngland = UserWithCountry.get(john.id) assert johnFromEngland.name == 'john' assert johnFromEngland.country == 'england' Advantages of views I know the example I am using here is not the best. The relationship between User and Address is already very simple and a view isn't required here. However, if you have more sophisticated data structures views can be a nice way to hide complex relationships that would require joining a lot of tables. Views can also be used as security measure if you don't want to expose all columns of your tables to the application.
January 25, 2014
by Michael Scharhag
· 16,626 Views
article thumbnail
Big Data Search, Part 4: The Index Format is Horrible
I have completed my own exercise, and while I wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is. This isn’t something that you can really use for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file. That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance. Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file. In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second. Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere. Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time. What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still… Obviously, there are multiple solutions for this issue. B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that). However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here. However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here. This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb. It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.
January 24, 2014
by Oren Eini
· 12,094 Views
article thumbnail
How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1
Learn how to set up a four node Hadoop cluster using AWS EC2, PuTTy(gen), and WinSCP.
January 23, 2014
by Hardik Pandya
· 135,885 Views · 3 Likes
article thumbnail
Big Data Search, Part 2: Setting Up
the interesting thing about this problem is that i was very careful in how i phrased things. i said what i wanted to happen, but didn’t specify what needs to be done. that was quite intentional. for that matter, the fact that i am posting about what is going to be our acceptance criteria is also intentional. the idea is to have a non trivial task, but something that should be very well understood and easy to research. it also means that the candidate needs to be able to write some non trivial code. and i can tell a lot about a dev from such a project. at the same time, this is a very self contained scenario. the idea is that this is something that you can do in a short amount of time. the reason that this is an interesting exercise is that this is actually at least two totally different but related problems. first, in a 15tb file, we obviously cannot rely on just scanning the entire file. that means that we have to have an index. and that means we have to build it. interestingly enough, an index being a sorted structure, that means that we have to solve the problem of sorting more data than can fit in main memory. the second problem is probably easier, since it is just an implementation of external sort, and there are plenty of algorithms around to handle that. note that i am not really interested in actual efficiencies for this particular scenario. i care about being able to see the code. see that it works, etc. my solution, for example, is a single threaded system that make no attempt at parallelism or i/o optimizations. it clocks at over 1 gb / minute and the memory consumption is at under 150mb. queries for a unique value return the result in 0.0004 seconds. queries that returned 153k results completed in about 2 seconds. when increasing the used memory to about 650mb, there isn’t really any difference in performance, which surprised me a bit. then again, the entire code is probably highly inefficient. but that is good enough for now. the process is kicked off with indexing: 1: var options = new directoryexternalstorageoptions("/path/to/index/files"); 2: var input = file.openread(@"/path/to/data/crimes_-_2001_to_present.csv"); 3: var sorter = new externalsorter(input, options, new int[] 4: { 5: 1,// case number 6: 4, // ichr 7: 8: }); 9: 10: sorter.sort(); i am actually using the chicago crime data for this. this is a 1gb file that i downloaded from the chicago city portal in csv format. this is what the data looks like: the externalsorter will read and parse the file, and start reading it into a buffer. when it gets to a certain size (about 64mb of source data, usually), it will sort the values in memory and output them into temporary files. those file looks like this: initially, i tried to do that with binary data, but it turns out that that was too complex to be easy, and writing this in a human readable format made it much easier to work with. the format is pretty simple, you have the value of the left, and on the right you have start position of the row for this value. we generate about 17 such temporary files for the 1gb file. one temporary file per each 64 mb of the original file. this lets us keep our actual memory consumption very low, but for larger data sets, we’ll probably want to actually do the sort every 1 gb or maybe more. our test machine has 16 gb of ram, so doing a sort and outputting a temporary file every 8 gb can be a good way to handle things. but that is beside the point. the end result is that we have multiple sorted files, but they aren’t sequential. in other words, in file #1 we have values 1,4,6,8 and in file #2 we have 1,2,6,7. we need to merge all of them together. luckily, this is easy enough to do. we basically have a heap that we feed entries from the files into. and that pretty much takes care of this. see merge sort if you want more details about this. the end result of merging all of those files is… another file, just like them, that contains all of the data sorted. then it is time to actually handle the other issue, actually searching the data. we can do that using simple binary search, with the caveat that because this is a text file, and there is no fixed size records or pages, it is actually a big hard to figure out where to start reading. in effect, what i am doing is to select an arbitrary byte position, then walk backward until i find a ‘\n’. once i found the new line character, i can read the full line, check the value, and decide where i need to look next. assuming that i actually found my value, i can now go to the byte position of the value in the original file and read the original line, giving it to the user. assuming an indexing rate of 1 gb / minute a 15 tb file would take about 10 days to index. but there are ways around that as well, but i’ll touch on them in my next post. what all of this did was bring home just how much we usually don’t have to worry about such things. but i consider this research well spent, we’ll be using this in the future.
January 21, 2014
by Oren Eini
· 3,444 Views
article thumbnail
Node.js and N1QL
This post was originally written by Brett Lawson. So, recently I added support to our Node.js client for executing N1QL queries against your cluster, providing you are running an instance of the N1QL engine (to get a hold of the updated version of the Node.js client with this support, point npm to our github master branch at https://github.com/couchbase/couchnode). When I implemented it, I didn’t have very much to test against at the time, so I figured it would be a interesting endeavor to see how nice the Node.js’s beer-sample example would look if we used entirely N1QL queries rather than using any views. I first started by converting over the basic queries which simply selected all beers or breweries from the sample data, and then moved on to converting the live-search querying to use N1QL as well. I figured I would write a little blog post on the conversions and make some remarks about what I noticed along the way. Here is our first query: var q = { limit : ENTRIES_PER_PAGE, stale : false }; db.view( "beer", "by_name", q).query(function(err, values) { var keys = _.pluck(values, 'id'); db.getMulti( keys, null, function(err, results) { var beers = _.map(results, function(v, k) { v.value.id = k; return v.value; }); res.render('beer/index', {'beers':beers}); }) }); and the converted version: db.query( "SELECT META().id AS id, * FROM beer-sample WHERE type='beer' LIMIT " + ENTRIES_PER_PAGE, function(err, beers) { res.render('beer/index', {'beers':beers}); }); As you can see, we no longer need to do two separate operations to retrieve the list. We can execute our N1QL query which will returns all the information that we need, and formats it appropriately; rather than needing to reformat the data and add our id values, we can simply select it as part of the result set. I find the N1QL version here is much more concise and appreciate how simple it was to construct the query. I then converted the brewery listing function following a similar path, and here is what I ended up with, as you can see, it is similarly beautiful and concise: db.query( "SELECT META().id AS id, name FROM beer-sample WHERE type='brewery' LIMIT " + ENTRIES_PER_PAGE, function(err, breweries) { res.render('brewery/index', {'breweries':breweries}); }); Next I converted the searching methods. These were a bit more of a challenge as looking at the original code directly, without thinking about what it was trying to achieve, the semantics were not immediately obvious, here is a look at what it looked like: var q = { startkey : value, endkey : value + JSON.parse('"\u0FFF"'), stale : false, limit : ENTRIES_PER_PAGE } db.view( "beer", "by_name", q).query(function(err, values) { var keys = _.pluck(values, 'id'); db.getMulti( keys, null, function(err, results) { var beers = []; for(var k in results) { beers.push({ 'id': k, 'name': results[k].value.name, 'brewery_id': results[k].value.brewery_id }); } res.send(beers); }); }); Again, we have quite a bit of code to achieve something which you should expect to be quite simple. In case you can’t tell, the map/reduce query above retrieves a listing of beers whose names begin with the value entered by the user. We are going to convert this to a N1QL LIKE clause, and as an added bonus, we will allow the search term to appear anywhere in the string, instead of requiring it at the beginning: db.query( "SELECT META().id, name, brewery_id FROM beer-sample WHERE type='beer' AND LOWER(name) LIKE '%" + term + "%' LIMIT " + ENTRIES_PER_PAGE, function(err, beers) { res.send(beers); }); We have again collapsed a large amount of vaguely understandable code down to a simple and concise query. I believe this begins to show the power of N1QL and why I am personally so excited to see N1QL. There is however one caveat I noticed while doing this, and this is that similar to SQL, you need to be careful about what kind of user-data you are passing into your queries. I wrote a simple cleaning function to try and prevent any malicious intent (though N1QL is currently read-only anyways), but my cleaning code is by no means extensive. Another issue I noticed is that our second query with the LIKE clause executed significantly slower as a N1QL query then it did when using map/reduce. I believe this is simply a result of N1QL still being developer preview, and there is lots of optimizations left to be done by the N1QL team. If you want to see the fully converted source code, take a look at the n1ql branch of the beersample-node repository available here, https://github.com/couchbaselabs/beersample-node/tree/n1ql. Thanks! Brett
January 17, 2014
by Don Pinto
· 7,880 Views · 1 Like
article thumbnail
A Beginner's Guide to ACID and Database Transactions
Read the original article here.
January 7, 2014
by Vlad Mihalcea
· 20,365 Views
article thumbnail
CGLib: The Missing Manual
The byte code instrumentation library cglib is a popular choice among many well-known Java frameworks such as Hibernate (not anymore) or Spring for doing their dirty work. Byte code instrumentation allows to manipulate or to create classes after the compilation phase of a Java application. Since Java classes are linked dynamically at run time, it is possible to add new classes to an already running Java program. Hibernate uses cglib for example for its generation of dynamic proxies. Instead of returning the full object that you stored in a a database, Hibernate will return you an instrumented version of your stored class that lazily loads some values from the database only when they are requested. Spring used cglib for example when adding security constraints to your method calls. Instead of calling your method directly, Spring security will first check if a specified security check passes and only delegate to your actual method after this verification. Another popular use of cglib is within mocking frameworks such as mockito, where mocks are nothing more than instrumented class where the methods were replaced with empty implementations (plus some tracking logic). Other than ASM - another very high-level byte code manipulation library on top of which cglib is built - cglib offers rather low-level byte code transformers that can be used without even knowing about the details of a compiled Java class. Unfortunately, the documentation of cglib is rather short, not to say that there is basically none. Besides a single blog article from 2005 that demonstrates the Enhancer class, there is not much to find. This blog article is an attempt to demonstrate cglib and its unfortunately often awkward API. Enhancer Let's start with the Enhancer class, the probably most used class of the cglib library. An enhancer allows the creation of Java proxies for non-interface types. The Enhancer can be compared with the Java standard library's Proxy class which was introduced in Java 1.3. The Enhancer dynamically creates a subclass of a given type but intercepts all method calls. Other than with the Proxy class, this works for both class and interface types. The following example and some of the examples after are based on this simple Java POJO: public class SampleClass { public String test(String input) { return "Hello world!"; } } Using cglib, the return value of test(String) method can easily be replaced by another value using an Enhancer and a FixedValue callback: @Test public void testFixedValue() throws Exception { Enhancer enhancer = new Enhancer(); enhancer.setSuperclass(SampleClass.class); enhancer.setCallback(new FixedValue() { @Override public Object loadObject() throws Exception { return "Hello cglib!"; } }); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); } In the above example, the enhancer will return an instance of an instrumented subclass of SampleClass where all method calls return a fixed value which is generated by the anonymous FixedValue implementation above. The object is created by Enhancer#create(Object...) where the method takes any number of arguments which are used to pick any constructor of the enhanced class. (Even though constructors are only methods on the Java byte code level, the Enhancer class cannot instrument constructors. Neither can it instrument static or final classes.) If you only want to create a class, but no instance, Enhancer#createClass will create a Class instance which can be used to create instances dynamically. All constructors of the enhanced class will be available as delegation constructors in this dynamically generated class. Be aware that any method call will be delegated in the above example, also calls to the methods defined in java.lang.Object. As a result, a call to proxy.toString() will also return "Hello cglib!". In contrast will a call to proxy.hashCode() result in a ClassCastException since the FixedValue interceptor always returns a String even though the Object#hashCode signature requires a primitive integer. Another observation that can be made is that final methods are not intercepted. An example of such a method is Object#getClass which will return something like "SampleClass$$EnhancerByCGLIB$$e277c63c" when it is invoked. This class name is generated randomly by cglib in order to avoid naming conflicts. Be aware of the different class of the enhanced instance when you are making use of explicit types in your program code. The class generated by cglib will however be in the same package as the enhanced class (and therefore be able to override package-private methods). Similar to final methods, the subclassing approach makes for the inability of enhancing final classes. Therefore frameworks as Hibernate cannot persist final classes. Next, let us look at a more powerful callback class, the InvocationHandler, that can also be used with an Enhancer: @Test public void testInvocationHandler() throws Exception { Enhancer enhancer = new Enhancer(); enhancer.setSuperclass(SampleClass.class); enhancer.setCallback(new InvocationHandler() { @Override public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { if(method.getDeclaringClass() != Object.class && method.getReturnType() == String.class) { return "Hello cglib!"; } else { throw new RuntimeException("Do not know what to do."); } } }); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); assertNotEquals("Hello cglib!", proxy.toString()); } This callback allows us to answer with regards to the invoked method. However, you should be careful when calling a method on the proxy object that comes with the InvocationHandler#invoke method. All calls on this method will be dispatched with the same InvocationHandler and might therefore result in an endless loop. In order to avoid this, we can use yet another callback dispatcher: @Test public void testMethodInterceptor() throws Exception { Enhancer enhancer = new Enhancer(); enhancer.setSuperclass(SampleClass.class); enhancer.setCallback(new MethodInterceptor() { @Override public Object intercept(Object obj, Method method, Object[] args, MethodProxy proxy) throws Throwable { if(method.getDeclaringClass() != Object.class && method.getReturnType() == String.class) { return "Hello cglib!"; } else { proxy.invokeSuper(obj, args); } } }); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); assertNotEquals("Hello cglib!", proxy.toString()); proxy.hashCode(); // Does not throw an exception or result in an endless loop. } The MethodInterceptor allows full control over the intercepted method and offers some utilities for calling the method of the enhanced class in their original state. But why would one want to use other methods anyways? Because the other methods are more efficient and cglib is often used in edge case frameworks where efficiency plays a significant role. The creation and linkage of the MethodInterceptor requires for example the generation of a different type of byte code and the creation of some runtime objects that are not required with the InvocationHandler. Because of that, there are other classes that can be used with the Enhancer: LazyLoader: Even though the LazyLoader's only method has the same method signature as FixedValue, the LazyLoader is fundamentally different to the FixedValue interceptor. The LazyLoader is actually supposed to return an instance of a subclass of the enhanced class. This instance is requested only when a method is called on the enhanced object and then stored for future invocations of the generated proxy. This makes sense if your object is expensive in its creation without knowing if the object will ever be used. Be aware that some constructor of the enhanced class must be called both for the proxy object and for the lazily loaded object. Thus, make sure that there is another cheap (maybe protected) constructor available or use an interface type for the proxy. You can choose the invoked constructed by supplying arguments to Enhancer#create(Object...). Dispatcher: The Dispatcher is like the LazyLoader but will be invoked on every method call without storing the loaded object. This allows to change the implementation of a class without changing the reference to it. Again, be aware that some constructor must be called for both the proxy and the generated objects. ProxyRefDispatcher: This class carries a reference to the proxy object it is invoked from in its signature. This allows for example to delegate method calls to another method of this proxy. Be aware that this can easily cause an endless loop and will always cause an endless loop if the same method is called from within ProxyRefDispatcher#loadObject(Object). NoOp: The NoOp class does not what its name suggests. Instead, it delegates each method call to the enhanced class's method implementation. At this point, the last two interceptors might not make sense to you. Why would you even want to enhance a class when you will always delegate method calls to the enhanced class anyways? And you are right. These interceptors should only be used together with a CallbackFilter as it is demonstrated in the following code snippet: @Test public void testCallbackFilter() throws Exception { Enhancer enhancer = new Enhancer(); CallbackHelper callbackHelper = new CallbackHelper(SampleClass.class, new Class[0]) { @Override protected Object getCallback(Method method) { if(method.getDeclaringClass() != Object.class && method.getReturnType() == String.class) { return new FixedValue() { @Override public Object loadObject() throws Exception { return "Hello cglib!"; }; } } else { return NoOp.INSTANCE; // A singleton provided by NoOp. } } }; enhancer.setSuperclass(MyClass.class); enhancer.setCallbackFilter(callbackHelper); enhancer.setCallbacks(callbackHelper.getCallbacks()); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); assertNotEquals("Hello cglib!", proxy.toString()); proxy.hashCode(); // Does not throw an exception or result in an endless loop. } The Enhancer instance accepts a CallbackFilter in its Enhancer#setCallbackFilter(CallbackFilter) method where it expects methods of the enhanced class to be mapped to array indices of an array of Callback instances. When a method is invoked on the created proxy, the Enhancer will then choose the according interceptor and dispatch the called method on the corresponding Callback (which is a marker interface for all the interceptors that were introduced so far). To make this API less awkward, cglib offers a CallbackHelper which will represent a CallbackFilter and which can create an array of Callbacks for you. The enhanced object above will be functionally equivalent to the one in the example for the MethodInterceptor but it allows you to write specialized interceptors whilst keeping the dispatching logic to these interceptors separate. How does it work? When the Enhancer creates a class, it will set create a privatestatic field for each interceptor that was registered as a Callback for the enhanced class after its creation. This also means that class definitions that were created with cglib cannot be reused after their creation since the registration of callbacks does not become a part of the generated class's initialization phase but are prepared manually by cglib after the class was already initialized by the JVM. This also means that classes created with cglib are not technically ready after their initialization and for example cannot be sent over the wire since the callbacks would not exist for the class loaded in the target machine. Depending on the registered interceptors, cglib might register additional fields such as for example for the MethodInterceptor where two privatestatic fields (one holding a reflective Method and a the other holding MethodProxy) are registered per method that is intercepted in the enhanced class or any of its subclasses. Be aware that the MethodProxy is making excessive use of the FastClass which triggers the creation of additional classes and is described in further detail below. For all these reasons, be careful when using the Enhancer. And always register callback types defensively, since the MethodInterceptor will for example trigger the creation of additional classes and register additional static fields in the enhanced class. This is specifically dangerous since the callback variables are also stored as static variables in the enhanced class: This implies that the callback instances are never garbage collected (unless their ClassLoader is, what is unusual). This is in particular dangerous when using anonymous classes which silently carry a reference to their outer class. Recall the example above: @Test public void testFixedValue() throws Exception { Enhancer enhancer = new Enhancer(); enhancer.setSuperclass(SampleClass.class); enhancer.setCallback(new FixedValue() { @Override public Object loadObject() throws Exception { return "Hello cglib!"; } }); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); } The anonymous subclass of FixedValue would become hardly referenced from the enhanced SampleClass such that neither the anonymous FixedValue instance or the class holding the @Test method would ever be garbage collected. This can introduce nasty memory leaks in your applications. Therefore, do not use non-static inner classes with cglib. (I only use them in this blog entry for keeping the examples short.) Finally, you should never intercept Object#finalize(). Due to the subclassing approach of cglib, intercepting finalize is implemented by overriding it what is in general a bad idea. Enhanced instances that intercept finalize will be treated differently by the garbage collector and will also cause these objects being queued in the JVM's finalization queue. Also, if you (accidentally) create a hard reference to the enhanced class in your intercepted call to finalize, you have effectively created an noncollectable instance. This is in general nothing you want. Note that final methods are never intercepted by cglib. Thus, Object#wait, Object#notify and Object#notifyAll do not impose the same problems. Be however aware that Object#clone can be intercepted what is something you might not want to do. Immutable Bean cglib's ImmutableBean allows you to create an immutability wrapper similar to for example Collections#immutableSet. All changes of the underlying bean will be prevented by an IllegalStateException (however, not by an UnsupportedOperationException as recommended by the Java API). Looking at some bean public class SampleBean { private String value; public String getValue() { return value; } public void setValue(String value) { this.value = value; } } we can make this bean immutable: @Test(expected = IllegalStateException.class) public void testImmutableBean() throws Exception { SampleBean bean = new SampleBean(); bean.setValue("Hello world!"); SampleBean immutableBean = (SampleBean) ImmutableBean.create(bean); assertEquals("Hello world!", immutableBean.getValue()); bean.setValue("Hello world, again!"); assertEquals("Hello world, again!", immutableBean.getValue()); immutableBean.setValue("Hello cglib!"); // Causes exception. } As obvious from the example, the immutable bean prevents all state changes by throwing an IllegalStateException. However, the state of the bean can be changed by changing the original object. All such changes will be reflected by the ImmutableBean. Bean Generator The BeanGenerator is another bean utility of cglib. It will create a bean for you at run time: @Test public void testBeanGenerator() throws Exception { BeanGenerator beanGenerator = new BeanGenerator(); beanGenerator.addProperty("value", String.class); Object myBean = beanGenerator.create(); Method setter = myBean.getClass().getMethod("setValue", String.class); setter.invoke(myBean, "Hello cglib!"); Method getter = myBean.getClass().getMethod("getValue"); assertEquals("Hello cglib!", getter.invoke(myBean)); } As obvious from the example, the BeanGenerator first takes some properties as name value pairs. On creation, the BeanGenerator creates the accessors get() void set() for you. This might be useful when another library expects beans which it resolved by reflection but you do not know these beans at run time. (An example would be Apache Wicket which works a lot with beans.) Bean Copier The BeanCopier is another bean utility that copies beans by their property values. Consider another bean with similar properties as SampleBean: public class OtherSampleBean { private String value; public String getValue() { return value; } public void setValue(String value) { this.value = value; } } Now you can copy properties from one bean to another: @Test public void testBeanCopier() throws Exception { BeanCopier copier = BeanCopier.create(SampleBean.class, OtherSampleBean.class, false); SampleBean bean = new SampleBean(); myBean.setValue("Hello cglib!"); OtherSampleBean otherBean = new OtherSampleBean(); copier.copy(bean, otherBean, null); assertEquals("Hello cglib!", otherBean.getValue()); } without being restrained to a specific type. The BeanCopier#copy mehtod takles an (eventually) optional Converter which allows to do some further manipulations on each bean property. If the BeanCopier is created with false as the third constructor argument, the Converter is ignored and can therefore be null. Bulk Bean A BulkBean allows to use a specified set of a bean's accessors by arrays instead of method calls: @Test public void testBulkBean() throws Exception { BulkBean bulkBean = BulkBean.create(SampleBean.class, new String[]{"getValue"}, new String[]{"setValue"}, new Class[]{String.class}); SampleBean bean = new SampleBean(); bean.setValue("Hello world!"); assertEquals(1, bulkBean.getPropertyValues(bean).length); assertEquals("Hello world!", bulkBean.getPropertyValues(bean)[0]); bulkBean.setPropertyValues(bean, new Object[] {"Hello cglib!"}); assertEquals("Hello cglib!", bean.getValue()); } The BulkBean takes an array of getter names, an array of setter names and an array of property types as its constructor arguments. The resulting instrumented class can then extracted as an array by BulkBean#getPropertyBalues(Object). Similarly, a bean's properties can be set by BulkBean#setPropertyBalues(Object, Object[]). Bean Map This is the last bean utility within the cglib library. The BeanMap converts all properties of a bean to a String-to-Object Java Map: @Test public void testBeanGenerator() throws Exception { SampleBean bean = new SampleBean(); BeanMap map = BeanMap.create(bean); bean.setValue("Hello cglib!"); assertEquals("Hello cglib", map.get("value")); } Additionally, the BeanMap#newInstance(Object) method allows to create maps for other beans by reusing the same Class. Key Factory The KeyFactory factory allows the dynamic creation of keys that are composed of multiple values that can be used in for example Map implementations. For doing so, the KeyFactory requires some interface that defines the values that should be used in such a key. This interface must contain a single method by the name newInstance that returns an Object. For example: public interface SampleKeyFactory { Object newInstance(String first, int second); } Now an instance of a a key can be created by: @Test public void testKeyFactory() throws Exception { SampleKeyFactory keyFactory = (SampleKeyFactory) KeyFactory.create(Key.class); Object key = keyFactory.newInstance("foo", 42); Map map = new HashMap(); map.put(key, "Hello cglib!"); assertEquals("Hello cglib!", map.get(keyFactory.newInstance("foo", 42))); } The KeyFactory will assure the correct implementation of the Object#equals(Object) and Object#hashCode methods such that the resulting key objects can be used in a Map or a Set. The KeyFactory is also used quite a lot internally in the cglib library. Mixin Some might already know the concept of the Mixin class from other programing languages such as Ruby or Scala (where mixins are called traits). cglib Mixins allow the combination of several objects into a single object. However, in order to do so, those objects must be backed by interfaces: public interface Interface1 { String first(); } public interface Interface2 { String second(); } public class Class1 implements Interface1 { @Override public String first() { return "first"; } } public class Class2 implements Interface2 { @Override public String second() { return "second"; } } Now the classes Class1 and Class2 can be combined to a single class by an additional interface: public interface MixinInterface extends Interface1, Interface2 { /* empty */ } @Test public void testMixin() throws Exception { Mixin mixin = Mixin.create(new Class[]{Interface1.class, Interface2.class MixinInterface.class}, new Object[]{new Class1(), new Class2()}); MixinInterface mixinDelegate = (MixinInterface) mixin; assertEquals("first", mixinDelegate.first()); assertEquals("second", mixinDelegate.second()); } Admittedly, the Mixin API is rather awkward since it requires the classes used for a mixin to implement some interface such that the problem could also be solved by non-instrumented Java. String Switcher The StringSwitcher emulates a String to int Java Map: @Test public void testStringSwitcher() throws Exception { String[] strings = new String[]{"one", "two"}; int[] values = new int[]{10, 20}; StringSwitcher stringSwitcher = StringSwitcher.create(strings, values, true); assertEquals(10, stringSwitcher.intValue("one")); assertEquals(20, stringSwitcher.intValue("two")); assertEquals(-1, stringSwitcher.intValue("three")); } The StringSwitcher allows to emulate a switch command on Strings such as it is possible with the built-in Java switch statement since Java 7. If using the StringSwitcher in Java 6 or less really adds a benefit to your code remains however doubtful and I would personally not recommend its use. Interface Maker The InterfaceMaker does what its name suggests: It dynamically creates a new interface. @Test public void testInterfaceMaker() throws Exception { Signature signature = new Signature("foo", Type.DOUBLE_TYPE, new Type[]{Type.INT_TYPE}); InterfaceMaker interfaceMaker = new InterfaceMaker(); interfaceMaker.add(signature, new Type[0]); Class iface = interfaceMaker.create(); assertEquals(1, iface.getMethods().length); assertEquals("foo", iface.getMethods()[0].getName()); assertEquals(double.class, iface.getMethods()[0].getReturnType()); } Other than any other class of cglib's public API, the interface maker relies on ASM types. The creation of an interface in a running application will hardly make sense since an interface only represents a type which can be used by a compiler to check types. It can however make sense when you are generating code that is to be used in later development. Method Delegate A MethodDelegate allows to emulate a C#-like delegate to a specific method by binding a method call to some interface. For example, the following code would bind the SampleBean#getValue method to a delegate: public interface BeanDelegate { String getValueFromDelegate(); } @Test public void testMethodDelegate() throws Exception { SampleBean bean = new SampleBean(); bean.setValue("Hello cglib!"); BeanDelegate delegate = (BeanDelegate) MethodDelegate.create( bean, "getValue", BeanDelegate.class); assertEquals("Hello world!", delegate.getValueFromDelegate()); } There are however some things to note: The factory method MethodDelegate#create takes exactly one method name as its second argument. This is the method the MethodDelegate will proxy for you. There must be a method without arguments defined for the object which is given to the factory method as its first argument. Thus, the MethodDelegate is not as strong as it could be. The third argument must be an interface with exactly one argument. The MethodDelegate implements this interface and can be cast to it. When the method is invoked, it will call the proxied method on the object that is the first argument. Furthermore, consider these drawbacks: cglib creates a new class for each proxy. Eventually, this will litter up your permanent generation heap space You cannot proxy methods that take arguments. If your interface takes arguments, the method delegation will simply not work without an exception thrown (the return value will always be null). If your interface requires another return type (even if that is more general), you will get a IllegalArgumentException. Multicast Delegate The MulticastDelegate works a little different than the MethodDelegate even though it aims at similar functionality. For using the MulticastDelegate, we require an object that implements an interface: public interface DelegatationProvider { void setValue(String value); } public class SimpleMulticastBean implements DelegatationProvider { private String value; public String getValue() { return value; } public void setValue(String value) { this.value = value; } } Based on this interface-backed bean we can create a MulticastDelegate that dispatches all calls to setValue(String) to several classes that implement the DelegationProvider interface: @Test public void testMulticastDelegate() throws Exception { MulticastDelegate multicastDelegate = MulticastDelegate.create( DelegatationProvider.class); SimpleMulticastBean first = new SimpleMulticastBean(); SimpleMulticastBean second = new SimpleMulticastBean(); multicastDelegate = multicastDelegate.add(first); multicastDelegate = multicastDelegate.add(second); DelegatationProvider provider = (DelegatationProvider)multicastDelegate; provider.setValue("Hello world!"); assertEquals("Hello world!", first.getValue()); assertEquals("Hello world!", second.getValue()); } Again, there are some drawbacks: The objects need to implement a single-method interface. This sucks for third-party libraries and is awkward when you use CGlib to do some magic where this magic gets exposed to the normal code. Also, you could implement your own delegate easily (without byte code though but I doubt that you win so much over manual delegation). When your delegates return a value, you will receive only that of the last delegate you added. All other return values are lost (but retrieved at some point by the multicast delegate). Constructor Delegate A ConstructorDelegate allows to create a byte-instrumented factory method. For that, that we first require an interface with a single method newInstance which returns an Object and takes any amount of parameters to be used for a constructor call of the specified class. For example, in order to create a ConstructorDelegate for the SampleBean, we require the following to call SampleBean's default (no-argument) constructor: public interface SampleBeanConstructorDelegate { Object newInstance(); } @Test public void testConstructorDelegate() throws Exception { SampleBeanConstructorDelegate constructorDelegate = (SampleBeanConstructorDelegate) ConstructorDelegate.create( SampleBean.class, SampleBeanConstructorDelegate.class); SampleBean bean = (SampleBean) constructorDelegate.newInstance(); assertTrue(SampleBean.class.isAssignableFrom(bean.getClass())); } Parallel Sorter The ParallelSorter claims to be a faster alternative to the Java standard library's array sorters when sorting arrays of arrays: @Test public void testParallelSorter() throws Exception { Integer[][] value = { {4, 3, 9, 0}, {2, 1, 6, 0} }; ParallelSorter.create(value).mergeSort(0); for(Integer[] row : value) { int former = -1; for(int val : row) { assertTrue(former < val); former = val; } } } The ParallelSorter takes an array of arrays and allows to either apply a merge sort or a quick sort on every row of the array. Be however careful when you use it: When using arrays of primitives, you have to call merge sort with explicit sorting ranges (e.g. ParallelSorter.create(value).mergeSort(0, 0, 3) in the example. Otherwise, the ParallelSorter has a pretty obvious bug where it tries to cast the primitive array to an array Object[] what will cause a ClassCastException. If the array rows are uneven, the first argument will determine the length of what row to consider. Uneven rows will either lead to the extra values not being considered for sorting or a ArrayIndexOutOfBoundException. Personally, I doubt that the ParallelSorter really offers a time advantage. Admittedly, I did however not yet try to benchmark it. If you tried it, I'd be happy to hear about it in the comments. Fast Class and Fast Members The FastClass promises a faster invocation of methods than the Java reflection API by wrapping a Java class and offering similar methods to the reflection API: @Test public void testFastClass() throws Exception { FastClass fastClass = FastClass.create(SampleBean.class); FastMethod fastMethod = fastClass.getMethod(SampleBean.class.getMethod("getValue")); MyBean myBean = new MyBean(); myBean.setValue("Hello cglib!"); assertTrue("Hello cglib!", fastMethod.invoke(myBean, new Object[0])); } Besides the demonstrated FastMethod, the FastClass can also create FastConstructors but no fast fields. But how can the FastClass be faster than normal reflection? Java reflection is executed by JNI where method invocations are executed by some C-code. The FastClass on the other side creates some byte code that calls the method directly from within the JVM. However, the newer versions of the HotSpot JVM (and probably many other modern JVMs) know a concept called inflation where the JVM will translate reflective method calls into native version's of FastClass when a reflective method is executed often enough. You can even control this behavior (at least on a HotSpot JVM) with setting the sun.reflect.inflationThreshold property to a lower value. (The default is 15.) This property determines after how many reflective invocations a JNI call should be substituted by a byte code instrumented version. I would therefore recommend to not use FastClass on modern JVMs, it can however fine-tune performance on older Java virtual machines. cglib Proxy The cglib Proxy is a reimplementation of the Java Proxy class mentioned in the beginning of this article. It is intended to allow using the Java library's proxy in Java versions before Java 1.3 and differs only in minor details. The better documentation of the cglib Proxy can however be found in the Java standard library's Proxy javadoc where an example of its use is provided. For this reason, I will skip a more detailed discussion of the cglib's Proxy at this place. A Final Word of Warning After this overview of cglib's functionality, I want to speak a final word of warning. All cglib classes generate byte code which results in additional classes being stored in a special section of the JVM's memory: The so called perm space. This permanent space is, as the name suggests, used for permanent objects that do not usually get garbage collected. This is however not completely true: Once a Class is loaded, it cannot be unloaded until the loading ClassLoader becomes available for garbage collection. This is only the case the Class was loaded with a custom ClassLoader which is not a native JVM system ClassLoader. This ClassLoader can be garbage collected if itself, all Classes it ever loaded and all instances of all Classes it ever loaded become available for garbage collection. This means: If you create more and more classes throughout the life of a Java application and if you do not take care of the removal of these classes, you will sooner or later run of of perm space what will result in your application's death by the hands of an OutOfMemoryError. Therefore, use cglib sparingly. However, if you use cglib wisely and carefully, you can really do amazing things with it that go beyond what you can do with non-instrumented Java applications. Lastly, when creating projects that depend on cglib, you should be aware of the fact that the cglib project is not as well maintained and active as it should be, considering its popularity. The missing documentation is a first hint. The often messy public API a second. But then there are also broken deploys of cglib to Maven central. The mailing list reads like an archive of spam messages. And the release cycles are rather unstable. You might therefore want to have a look at javassist, the only real low-level alternative to cglib. Javassist comes bundled with a pseudo-java compiler what allows to create quite amazing byte code instrumentations without even understanding Java byte code. If you like to get your hands dirty, you might also like ASM on top of which cglib is built. ASM comes with a great documentation of both the library and Java class files and their byte code. Note that these examples only run with cglib 2.2.2 and are not compatible with the newest release 3 of cglib. Unfortunately, I experienced the newest cglib version to occasionally produce invalid byte code which is why I considered an old version and also use this version in production. Also, note that most projects using cglib move the library to their own namespace in order to avoid version conflicts with other dependencies such as for example demonstrated by the Spring project. You should do the same with your project when making use of cglib. Tools such like jarjar can help you with the automation of this good practice.
January 7, 2014
by Rafael Winterhalter
· 76,712 Views · 18 Likes
article thumbnail
Hunting for an SWT Test Framework? Say Hello to Red Deer
This is the first in a series of posts on the new “Red Deer” (https://github.com/jboss-reddeer/reddeer) open source testing framework for Eclipse. In this post, we’ll introduce Red Deer, and take a look at the some of the advantages that it offers by building a sample test program from scratch. Some of the features that Red Deer automated offers are: An easy to use, high-level API for testing standard Eclipse components Support for creating custom extensions for your own applications A requirements validation mechanism to assist you in configuring complex tests Eclipse Tooling to Assist in Creating new Projects A record and playback tool to enable you to quickly create automated tests An integration with Selenium for testing web based applications Support for running tests in a Jenkins CI environment Note that as of this writing, Red Deer is in an incubation stage. The current release is at level 0.5. The target date for the 1.0 release of Red Deer is late 2014. But, as a community-based, open source project, now is a great time to try Red Deer and make suggestions or even contribute code! A Look at Red Deer’s Architecture The Red Deer project itself is comprised of utilities and the API that supports the development and execution of automated tests. The API (the parts of the above diagram that are enclosed in dashed line boxes) can be thought of as having three layers: The top layer consists of extensions to Red Deer’s abstract classes or implementations for Eclipse components such as Views, Editors, Wizards, or Shells. For example, if you are writing tests for a feature that uses a custom Eclipse View, you can extend Red Deer’s View class by adding support for the specific functions of the feature. The advantage that this API layer gives you is that your test programs do not have to focus on manipulating the individual UI elements directly to perform operations. Your programs can instead instantiate an instance of an Eclipse component such as a View, and then use that instance’s methods to perform operations on the View. This layer of abstraction makes your test programs easier to write, understand, and maintain. The middle layer consists of the Red Deer implementations for SWT UI elements such as: Button, Combo, Label, Menu, Shell, TabItem, Table, ToolBar, Tree. This API layer supports the API’s higher level by providing the building blocks for the API’s Views, Editors, Shells, and WIzards. This middle layer of the API also provides Red Deer packages that enable your tests to enforce requirements, so that necessary setup tasks are performed before a test is run. The bottom layer consists of Red Deer packages that support the execution of tests such as: Conditions, Matchers, Widgets, Workbench, and Red Deer extensions to JUnit. What Makes Red Deer different from other Tools? A Layer of Abstraction The top-most layer of the API enables you to instantiate Eclipse UI elements as objects, and then manipulate them through their methods. The resulting code is easier to read and maintain, instead of being brittle and subject to failures when the UI changes. For example, for a test that has to open a view and press a button, without Red Deer, the test would have to navigate the top level menu, find the view menu, then the view type in that menu, then find the view open dialog, then locate the “OK” button, etc. Your test would have to spend a lot of time navigating through the UI elements before it could even begin to perform the test’s steps. With Red Deer, the code to open a view (in this case, the servers view) is simply: ServersView view = new ServersView(); view.open(); Furthermore, within that ServersView, your test program can perform operations on the View through methods which are defined in the view (and are incidentally also well debugged by the Red Deer team), instead of having to explicitly locate and manipulate the UI elements directly. For example, to obtain a list of all the servers, instead of locating the UI tree that contains the server list, and extracting that list of servers into an array, your Red Deer program can simply call the “getServers()” method. Likewise, the code to open a PackageExplorer, and then select a project within that PackageExplorer is as follows: PackageExplorer packageExplorer = new PackageExplorer(); packageExplorer.open(); packageExplorer.getProject("myTestProject").select(); And, the code to retrieve all the projects within that PackageExplorer is simply: packageExplorer.getProjects(); The result are that your tests are easier to write and maintain and you can focus on testing your application’s logic instead of writing brittle code to navigate through the application. Installing Red Deer The only prerequisites to using Red Deer are Eclipse and Java. In this post, we’ll use Eclipse Kepler and OpenJDK 1.7, running on Red Hat Enterprise Linux (RHEL) 6. To install Red Deer 0.4 (this is the latest stable milestone version as of this writing) follow these steps: Open up Eclipse Navigate to: Help->Install New Software Define a new download site using the Red Deer update site URL: http://download.jboss.org/jbosstools/updates/stable/kepler/core/reddeer/0.4.0/ Select Red Deer, click on the Finish button and Red Deer will install Now that you have Red Deer installed, let’s move onto building a new Red Deer test. Building your First Red Deer Test To create a new Red Deer test project, you make use of the Red Deer UI tooling and select New->Project->Other->Red Deer Test: Before we move on, let’s take a look at the WEB-INF/MANIFEST.MF file that is created in the project: Manifest-Version: 1.0 Bundle-ManifestVersion: 2 Bundle-Name: com.example.reddeer.sample Bundle-SymbolicName: com.example.reddeer.sample;singleton:=true Bundle-Version: 1.0.0.qualifier Bundle-ActivationPolicy: lazy Bundle-Vendor: Sample Co Bundle-RequiredExecutionEnvironment: JavaSE-1.6 Require-Bundle: org.junit, org.jboss.reddeer.junit, org.jboss.reddeer.swt, org.jboss.reddeer.eclipse The line we’re interested in is the final line in the file. These are the bundles that are required by Red Deer. After the empty project is created by the wizard, you can define a package and create a test class. Here's the code for a minimal functional test. The test will verify that the eclipse configuration is not empty. package com.example.reddeer.sample; import static org.junit.Assert.assertFalse; import java.util.List; import org.jboss.reddeer.swt.api.TreeItem; import org.jboss.reddeer.swt.impl.button.PushButton; import org.jboss.reddeer.swt.impl.menu.ShellMenu; import org.jboss.reddeer.swt.impl.tree.DefaultTree; import org.junit.Test; import org.junit.runner.RunWith; import org.jboss.reddeer.junit.runner.RedDeerSuite; @RunWith(RedDeerSuite.class) public class SimpleTest { @Test public void TestIt() { new ShellMenu("Help", "About Eclipse Platform").select(); new PushButton("Installation Details").click(); DefaultTree ConfigTree = new DefaultTree(); List ConfigItems = ConfigTree.getAllItems(); assertFalse ("The list is empty!", ConfigItems.isEmpty()); for (TreeItem item : ConfigItems) { System.out.println ("Found: " + item.getText()); } } } After you save the test's source file, you can run the test. To run the test, select the Run As->Red Deer Test option: And - there's the green bar! Simplifying Tests with Requirements Red Deer requirements enable you to define actions that you want happen before a test is executed. The advantage to using requirements is that you define the actions with annotations instead of using a @BeforeClass method. The result is that your test code is easier to read and maintain. The biggest difference between a Red Deer requirement and the the @BeforeClass annotation from the JUnit framework is that if a requirement cannot be fulfilled the test is not executed. Like everything else in Red Deer, you can make use of predefined requirements, or you can extend the feature by adding your own custom requirements. These custom requirements can be made complex and for convenience can be stored in external properties files. (We’ll take a look at defining custom requirements in a later post in this series when we examine how to create and contribute extensions to Red Deer.) The current milestone release of Red Deer provides predefined requirements that enable you to clean out your current workspace and open a perspective. Let’s add these to our example. To do this, we need to add these import statements: import org.jboss.reddeer.eclipse.ui.perspectives.JavaBrowsingPerspective; import org.jboss.reddeer.requirements.cleanworkspace.CleanWorkspaceRequirement.CleanWorkspace; import org.jboss.reddeer.requirements.openperspective.OpenPerspectiveRequirement.OpenPerspective; And these annotations: @CleanWorkspace @OpenPerspective(JavaBrowsingPerspective.class) And, we also have to a reference to org.jboss.reddeer.requirements to the required bundle list in our example’s MANIFEST.MF file: Require-Bundle: org.junit, org.jboss.reddeer.junit, org.jboss.reddeer.swt, org.jboss.reddeer.eclipse, org.jboss.reddeer.requirements When we’re done, our example looks like this: package com.example.reddeer.sample; import static org.junit.Assert.assertFalse; import java.util.List; import org.jboss.reddeer.swt.api.TreeItem; import org.jboss.reddeer.swt.impl.button.PushButton; import org.jboss.reddeer.swt.impl.menu.ShellMenu; import org.jboss.reddeer.swt.impl.tree.DefaultTree; import org.junit.Test; import org.junit.runner.RunWith; import org.jboss.reddeer.junit.runner.RedDeerSuite; import org.jboss.reddeer.eclipse.ui.perspectives.JavaBrowsingPerspective; import org.jboss.reddeer.requirements.cleanworkspace.CleanWorkspaceRequirement.CleanWorkspace; import org.jboss.reddeer.requirements.openperspective.OpenPerspectiveRequirement.OpenPerspective; @RunWith(RedDeerSuite.class) @CleanWorkspace @OpenPerspective(JavaBrowsingPerspective.class) public class SimpleTest { @Test public void TestIt() { new ShellMenu("Help", "About Eclipse Platform").select(); new PushButton("Installation Details").click(); DefaultTree ConfigTree = new DefaultTree(); List ConfigItems = ConfigTree.getAllItems(); assertFalse ("The list is empty!", ConfigItems.isEmpty()); for (TreeItem item : ConfigItems) { System.out.println ("Found: " + item.getText()); } } } Notice how we were able to add those functions to the test code, while only adding a very small amount of actual new code? Yes, it can pay to be a lazy programmer. ;-) What’s Next? What’s next for Red Deer is its continued development as it progresses through its incubation stage until its 1.0 release. What’s next for this series of posts will be discussions about: The Red Deer Recorder - To enable you to capture manual actions and convert them into test programs How you can Extend Red Deer - To provide test coverage for your plugins’ specific functions. And How you can Contribute these extensions to the Red Deer project. How you can Define Complex Requirements - To enable you to perform setup tasks for your tests. Red Deer’s Integration with Selenium - To enable you to test web interfaces provided by your plugins. Running Red Deer tests with Jenkins - To enable you to take advantage of Jenkins’ Continuous Integration (CI) test framework. Author’s Acknowledgements I’d like to thank all the contributors to Red Deer for their vision and contributions. It’s a new project, but it is growing fast! The contributors (in alphabetic order) are: Stefan Bunciak, Radim Hopp, Jaroslav Jankovic, Lucia Jelinkova, Marian Labuda, Martin Malina, Jan Niederman, Vlado Pakan, Jiri Peterka, Andrej Podhradsky, Milos Prchlik, Radoslav Rabara, Petr Suchy, and Rastislav Wagner.
January 7, 2014
by Len DiMaggio
· 7,679 Views
article thumbnail
Bulk Fetching with Hibernate
If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from. A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified. To start we'll try the obvious first, performing a query to simply retrieve all data: new TransactionTemplate(txManager).execute(new TransactionCallback() { @Override public Void doInTransaction(TransactionStatus status) { Session session = sessionFactory.getCurrentSession(); List demoEntitities = (List) session.createQuery("from DemoEntity").list(); for(DemoEntity demoEntity : demoEntitities){ //Process and write result } return null; } }); After a couple of seconds: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded Clearly this won't cut it. To fix this we will be switching to Hibernate scrollable result sets as probably most developers are aware of. The above example instructs hibernate to execute the query, map the entire results to entities and return them. When using scrollable result sets records are transformed to entities one at a time: new TransactionTemplate(txManager).execute(new TransactionCallback() { @Override public Void doInTransaction(TransactionStatus status) { Session session = sessionFactory.getCurrentSession(); ScrollableResults scrollableResults = session.createQuery("from DemoEntity").scroll(ScrollMode.FORWARD_ONLY); int count = 0; while (scrollableResults.next()) { if (++count > 0 && count % 100 == 0) { System.out.println("Fetched " + count + " entities"); } DemoEntity demoEntity = (DemoEntity) scrollableResults.get()[0]; //Process and write result } return null; } }); After running this we get: ... Fetched 49800 entities Fetched 49900 entities Fetched 50000 entities Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded Although we are using a scrollable result set, every returned object is an attached object and becomes part of the persistence context (aka session). The result is actually the same as our first example in which we used "session.createQuery("from DemoEntity").list()". However, with that approach we had no control; everything happens behind the scenes and you get a list back with all the data if hibernate has done its job. using a scrollable result set on the other hand gives us a hook into the retrieval process and allows us to free memory up when needed. As we have seen it does not free up memory automatically, you have to instruct Hibernate to actually do it. Following options exist: Evicting the object from the persistent context after processing it Clearing the entire session every now and then We will opt for the first. In the above example under line 13 (//Process and write result) we'll add: session.evict(demoEntity); Important: If you were to perform any modification to the entity (or entities it has associations with that are cascade evicted alongside), make sure to flush the session PRIOR evicting or clearing, otherwise queries hold back because of Hibernate's write behind will not be sent to the database Evicting or clearing does not remove the entities from second level cache. If you enabled second level cache and are using it and you want to remove them as well use the desired sessionFactory.getCache().evictXxx() method From the moment you evict an entity it will be no longer attached (no longer associated with a session). Any modification done to the entity at that stage will no longer be reflected to the database automatically. If you are using lazy loading, accessing any property that was not loaded prior the eviction will yield the famous org.hibernate.LazyInitializationException. So basically, make sure the processing for that entity is done (or it is at least initialized for further needs) before you evict or clear After we run the application again, we see that it now successfully executes: ... Fetched 99800 entities Fetched 99900 entities Fetched 100000 entities Btw; you can also set the query read-only allowing hibernate to perform some extra optimizations: ScrollableResults scrollableResults = session.createQuery("from DemoEntity").setReadOnly(true).scroll(ScrollMode.FORWARD_ONLY); Doing this only gives a very marginal difference in memory usage, in this specific test setup it enabled us to read about 300 entities extra with the given amount of memory. Personally I would not use this feature merely for memory optimizations alone but only if it suits in your overall immutability strategy. With hibernate you have different options to make entities read-only: on the entity itself, the overall session read-only and so forth. Setting read only false on the query individually is probably the least preferred approach. (eg. entities loaded in the session before will remain unaffected, possibly modifiable. Lazy associations will be loaded modifiable even if the root objects returned by the query are read only). Ok, we were able to process our 100.000 records, life is good. But as it turns out Hibernate has another another option for bulk operations: the stateless session. You can obtain a scrollable result set from a stateless session the same way as from a normal session. A stateless session lies directly above JDBC. Hibernate will run in nearly "all features disabled" mode. This means no persistent context, no 2nd level caching, no dirty detection, no lazy loading, basically no nothing. From the javadoc: /** * A command-oriented API for performing bulk operations against a database. * A stateless session does not implement a first-level cache nor interact with any * second-level cache, nor does it implement transactional write-behind or automatic * dirty checking, nor do operations cascade to associated instances. Collections are * ignored by a stateless session. Operations performed via a stateless session bypass * Hibernate's event model and interceptors. Stateless sessions are vulnerable to data * aliasing effects, due to the lack of a first-level cache. For certain kinds of * transactions, a stateless session may perform slightly faster than a stateful session. * * @author Gavin King */ The only thing it does is transforming records to objects. This might be an appealing alternative because it helps you getting rid of that manual evicting/flushing: new TransactionTemplate(txManager).execute(new TransactionCallback() { @Override public Void doInTransaction(TransactionStatus status) { sessionFactory.getCurrentSession().doWork(new Work() { @Override public void execute(Connection connection) throws SQLException { StatelessSession statelessSession = sessionFactory.openStatelessSession(connection); try { ScrollableResults scrollableResults = statelessSession.createQuery("from DemoEntity").scroll(ScrollMode.FORWARD_ONLY); int count = 0; while (scrollableResults.next()) { if (++count > 0 && count % 100 == 0) { System.out.println("Fetched " + count + " entities"); } DemoEntity demoEntity = (DemoEntity) scrollableResults.get()[0]; //Process and write result } } finally { statelessSession.close(); } } }); return null; } }); Besides the fact that the stateless session has the most optimal memory usage, using the it has some side effects. You might have noticed that we are opening a stateless session and closing it explicitly: there is no sessionFactory.getCurrentStatelessSession() nor (at the time of writing) any Spring integration for managing the stateless session.Opening a stateless session allocates a new java.sql.Connection by default (if you use openStatelessSession()) to perform its work and therefore indirectly spawns a second transaction. You can mitigate these side effects by using the Hibernate work API as in the example which supplies the current Connection and pass it along to openStatelessSession(Connection connection). Closing the session in the finally has no impact on the physical connection since that is captured by the Spring infrastructure: only the logical connection handle is closed and a new logical connection handle was created when opening the stateless session. Also note that you have to deal with closing the stateless session yourself and that the above example is only good for read-only operations. From the moment you are going to modify using the stateless session there are some more caveats. As said before, hibernate runs in "all feature disabled" mode and as a direct consequence entities are returned in detached state. For each entity you modify, you'll have to call: statelessSession.update(entity) explicitly. First I tried this for modifying an entity: new TransactionTemplate(txManager).execute(new TransactionCallback() { @Override public Void doInTransaction(TransactionStatus status) { sessionFactory.getCurrentSession().doWork(new Work() { @Override public void execute(Connection connection) throws SQLException { StatelessSession statelessSession = sessionFactory.openStatelessSession(connection); try { DemoEntity demoEntity = (DemoEntity) statelessSession.createQuery("from DemoEntity where id = 1").uniqueResult(); demoEntity.setProperty("test"); statelessSession.update(demoEntity); } finally { statelessSession.close(); } } }); return null; } }); The idea is that we open a stateless session with the existing database Connection. As the StatelessSession javadoc indicates that no write behind occurs, I was convinced that each statement performed by the stateless session would be sent directly to the database. Eventually when the transaction (started by the TransactionTemplate) would be committed the results would become visible in the database. However, hibernate does BATCH statements using a stateless session. I'm not 100% sure what the difference is between batching and write behind, but the result is the same and thus contra dictionary with the javadoc as statements are queued and flushed at a later time. So, if you don't do anything special, statements that are batched will not be flushed and this is what happened in my case: the "statelessSession.update(demoEntity);" was batched and never flushed. One way to force the flush is to use the hibernate transaction API: StatelessSession statelessSession = sessionFactory.openStatelessSession(); statelessSession.beginTransaction(); ... statelessSession.getTransaction().commit(); ... While this works, you probably don't want to start controlling your transactions programatically just because you are using a stateless session. Also, doing this we are again running our stateless session work in a second transaction scenario since we didn't pass along our Connection and thus a new database connection will be acquired. The reason we can't pass along the outer Connection is because if we commit the inner transaction (the "stateless session transaction") and it would be using the same connection as the outer transaction (started by the TransactionTemplate) it would break the outer transaction atomicity as statements from the outer transaction sent to database would be committed along with the inner transaction. So not passing along the connections means opening a new connection and thus creating a second transaction. A better alternative would be just to trigger Hibernate to flush the stateless session. However, statelessSession has no "flush" method to manually trigger a flush. A solution here is to depend a bit on the Hibernate internal API. This solution makes the manual transaction handling and the second transaction obsolete: all statements become part of our (one and only) outer transaction: StatelessSession statelessSession = sessionFactory.openStatelessSession(connection); try { DemoEntity demoEntity = (DemoEntity) statelessSession.createQuery("from DemoEntity where id = 1").uniqueResult(); demoEntity.setProperty("test"); statelessSession.update(demoEntity); ((TransactionContext) statelessSession).managedFlush(); } finally { statelessSession.close(); } Fortunately there is an even better solution very recently posted on the Spring jira: https://jira.springsource.org/browse/SPR-2495 This is not yet part of Spring, but the factory bean implementation is pretty straight forward: StatelessSessionFactoryBean.java when using this you could simple inject the StatelessSession: @Autowired private StatelessSession statelessSession; It will inject a stateless session proxy which is equivalent to the way the normal "current" session works (with the minor difference that you inject a SessionFactory and need to obtain the currentSession each time). When the proxy is invoked it will lookup the stateless session bound to the running transaction. If none exists already it will create one with the same connection as the normal session (like we did in the example) and register a custom transaction synchronization for the stateless session. When the transaction is committed the stateless session is flushed thanks to the synchronization and finally closed. Using this you can inject the stateless session directly and use it as a current session (or the same way as you would inject a JPA PeristentContext for that matter). This relieves you from dealing with the opening and closing of the stateless session and having to deal with one way or the other to make it flush. The implementation is JPA aimed, but the JPA part is limited to obtaining the physical connection in obtainPhysicalConnection(). You can easily leave out the EntityManagerFactory and get the physical connection directly from the Hibernate session. Very careful conclusion: it is clear that the best approach will depend on your situation. If you use the normal session you will have to deal with eviction yourself when reading or persisting entities. Besides the fact you have to do this manually, it might also impact further use of the session if you have a mixed transaction; you both perform 'bulk' and 'normal' operations in the same transaction. If you continue with the normal operations you will have detached entities in your session which might lead to unexpected results (as dirty detection will no longer work and so forth). On the other hand you will still have the major hibernate benefits (as long as the entity isn't evicted) such as lazy loading, caching, dirty detection and the likes. Using the stateless session at the time of writing requires some extra attention on managing it (opening, closing and flushing) which can also be error prone. In the assumption you can proceed with the proposed factory bean, you have a very bare bone session which is separately from your normal session but still participating in the same transaction. With this you have a powerful tool to perform bulk operations without having to think about memory management. The downside is that you don't have any other hibernate functionality available.
January 6, 2014
by Koen Serneels
· 90,639 Views · 14 Likes
article thumbnail
Top Posts of 2013: Google's Big Data Papers
I’ll review Google’s most important Big Data publications and discuss where they are (as far as they’ve disclosed).
December 30, 2013
by Mikio Braun
· 117,026 Views
article thumbnail
Java: Using the Specification Pattern With JPA
This article is an introduction to using the specification pattern in Java. We also will see how we can combine classic specifications with JPA Criteria queries to retrieve objects from a relational database. Within this post we will use the following Poll class as an example entity for creating specifications. It represents a poll that has a start and end date. In the time between those two dates users can vote among different choices. A poll can also be locked by an administrator before the end date has been reached. In this case, a lock date will be set. @Entity public class Poll { @Id @GeneratedValue private long id; private DateTime startDate; private DateTime endDate; private DateTime lockDate; @OneToMany(cascade = CascadeType.ALL) private List votes = new ArrayList<>(); } For better readability I skipped getters, setters, JPA annotations for mapping Joda DateTime instances and fields that aren't needed in this example (like the question being asked in the poll). Now assume we have two constraints we want to implement: A poll is currently running if it is not locked and if startDate < now < endDate A poll is popular if it contains more than 100 votes and is not locked We could start by adding appropriate methods to Poll like: poll.isCurrentlyRunning(). Alternatively we could use a service method like pollService.isCurrentlyRunning(poll). However, we also want to be able to query the database to get all currently running polls. So we might add a DAO or repository method like pollRepository.findAllCurrentlyRunningPolls(). If we follow this way we implement the isCurrentlyRunning constraint two times in two different locations. Things become worse if we want to combine constraints. What if we want to query the database for a list of all popular polls that are currently running? This is where the specification pattern come in handy. When using the specification pattern we move business rules into extra classes called specifications. To get started with specifications we create a simple interface and an abstract class: public interface Specification { boolean isSatisfiedBy(T t); Predicate toPredicate(Root root, CriteriaBuilder cb); Class getType(); } abstract public class AbstractSpecification implements Specification { @Override public boolean isSatisfiedBy(T t) { throw new NotImplementedException(); } @Override public Predicate toPredicate(Root poll, CriteriaBuilder cb) { throw new NotImplementedException(); } @Override public Class getType() { ParameterizedType type = (ParameterizedType) this.getClass().getGenericSuperclass(); return (Class) type.getActualTypeArguments()[0]; } } Please ignore the AbstractSpecification class with the mysterious getType() method for a moment (we come back to it later). The central part of a specification is the isSatisfiedBy() method, which is used to check if an object satisfies the specification. toPredicate() is an additional method we use in this example to return the constraint as javax.persistence.criteria.Predicate instance which can be used to query a database. For each constraint we create a new specification class that extends AbstractSpecification and implements isSatisfiedBy() and toPredicate(). The specification implementation to check if a poll is currently running looks like this: public class IsCurrentlyRunning extends AbstractSpecification { @Override public boolean isSatisfiedBy(Poll poll) { return poll.getStartDate().isBeforeNow() && poll.getEndDate().isAfterNow() && poll.getLockDate() == null; } @Override public Predicate toPredicate(Root poll, CriteriaBuilder cb) { DateTime now = new DateTime(); return cb.and( cb.lessThan(poll.get(Poll_.startDate), now), cb.greaterThan(poll.get(Poll_.endDate), now), cb.isNull(poll.get(Poll_.lockDate)) ); } } Within isSatisfiedBy() we check if the passed object matches the constraint. In toPredicate() we construct a Predicate using JPA's CriteriaBuilder. We will use the resulting Predicate instance later to build a CriteriaQuery for querying the database. The specification for checking if a poll is popular looks similar: public class IsPopular extends AbstractSpecification { @Override public boolean isSatisfiedBy(Poll poll) { return poll.getLockDate() == null && poll.getVotes().size() > 100; } @Override public Predicate toPredicate(Root poll, CriteriaBuilder cb) { return cb.and( cb.isNull(poll.get(Poll_.lockDate)), cb.greaterThan(cb.size(poll.get(Poll_.votes)), 5) ); } } If we now want to test if a Poll instance matches one of these constraints we can use our newly created specifications: boolean isPopular = new IsPopular().isSatisfiedBy(poll); boolean isCurrentlyRunning = new IsCurrentlyRunning().isSatisfiedBy(poll); For querying the database we need to extend our DAO / repository to support specifications. This can look like the following: public class PollRepository { private EntityManager entityManager = ... public List findAllBySpecification(Specification specification) { CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder(); // use specification.getType() to create a Root instance CriteriaQuery criteriaQuery = criteriaBuilder.createQuery(specification.getType()); Root root = criteriaQuery.from(specification.getType()); // get predicate from specification Predicate predicate = specification.toPredicate(root, criteriaBuilder); // set predicate and execute query criteriaQuery.where(predicate); return entityManager.createQuery(criteriaQuery).getResultList(); } } Here we finally use the getType() method implemented in AbstractSpecification to createCriteriaQuery and Root instances. getType() returns the generic type of theAbstractSpecification instance defined by the subclass. For IsPopular andIsCurrentlyRunning it returns the Poll class. Without getType() we would have to create theCriteriaQuery and Root instances inside toPredicate() of every specification we create. So it is just a small helper to reduce boiler plate code inside specifications. Feel free to replace this with your own implementation if you come up with better approaches. Now we can use our repository to query the database for polls that match a certain specification: List popularPolls = pollRepository.findAllBySpecification(new IsPopular()); List currentlyRunningPolls = pollRepository.findAllBySpecification(new IsCurrentlyRunning()); At this point the specifications are the only components that contain the constraint definitions. We can use it to query the database or to check if an object fulfills the required rules. However one question remains: How do we combine two or more constraints? For example we would like to query the database for all popular polls that are still running. The answer to this is a variation of the composite design pattern called composite specifications. Using a composite specification we can combine specifications in different ways. To query the database for all running and popular pools we need to combine the isCurrentlyRunning with theisPopular specification using the logical and operation. Let's create another specification for this. We name itAndSpecification: public class AndSpecification extends AbstractSpecification { private Specification first; private Specification second; public AndSpecification(Specification first, Specification second) { this.first = first; this.second = second; } @Override public boolean isSatisfiedBy(T t) { return first.isSatisfiedBy(t) && second.isSatisfiedBy(t); } @Override public Predicate toPredicate(Root root, CriteriaBuilder cb) { return cb.and( first.toPredicate(root, cb), second.toPredicate(root, cb) ); } @Override public Class getType() { return first.getType(); } } An AndSpecification is created out of two other specifications. In isSatisfiedBy() and toPredicate()we return the result of both specifications combined by a logical and operation. We can use our new specification like this: Specification popularAndRunning = new AndSpecification<>(new IsPopular(), new IsCurrentlyRunning()); List polls = myRepository.findAllBySpecification(popularAndRunning); To improve readability we can add an and() method to the Specification interface: public interface Specification { Specification and(Specification other); // other methods } and implement it within our abstract implementation: abstract public class AbstractSpecification implements Specification { @Override public Specification and(Specification other) { return new AndSpecification<>(this, other); } // other methods } Now we can chain multiple specification by using the and() method: Specification popularAndRunning = new IsPopular().and(new IsCurrentlyRunning()); boolean isPopularAndRunning = popularAndRunning.isSatisfiedBy(poll); List polls = myRepository.findAllBySpecification(popularAndRunning); When needed we can easily extend this further with other composite specifications (for exampleOrSpecification or NotSpecification). Conclusion When using the specification pattern we move business rules in separate specification classes. These specification classes can be easily combined by using composite specifications. In general, specification improve reusability and maintainability. Additionally specifications can easily be unit tested. For more detailed information about the specification pattern I recommend this article by Eric Evans and Martin Fowler. You can find the source of this example project on GitHub.
December 30, 2013
by Michael Scharhag
· 116,586 Views · 8 Likes
article thumbnail
Extracting Tables from PDFs in Javascript with PDF.js
a common and difficult problem acquiring data is extracting tables from a pdf. previously, i described how to extract the text from a pdf with pdf.js , a pdf rendering library made by mozilla labs. the rendering process requires an html canvas object, and then draws each object (character, line, rectangle, etc) on it. the easiest way to get a list of these is to to intercept all the calls pdf.js makes to drawing functions on the canvas object. (see “ self modifying javascripts ” for a similar technique). the “set” method below adds a wrapper closure to each function, which logs the call. function replace(ctx, key) { var val = ctx[key]; if (typeof(val) == "function") { ctx[key] = function() { var args = array.prototype.slice.call(arguments); console.log("called " + key + "(" + args.join(",") + ")"); return val.apply(ctx, args); } } } for (var k in context) { replace(context, k); } var rendercontext = { canvascontext: context, viewport: viewport }; page.render(rendercontext); this lets us see a series of calls: called transform(1,0,0,1,150.42,539.67) called translate(0,0) called scale(1,-1) called scale(0.752625,0.752625) called measuretext(c) called save() called scale(0.9701818181818181,1) called filltext(c,0,0) called restore() called restore() called save() called transform(1,0,0,1,150.42,539.6 we can easily retrieve the text by noting the first argument to each “filltext” call: "congregations ranked by growth and decline in membership and worship attendance, 2006 to 2011philadelphia presbytery - table 16net membership changenet worship changepercent changepercent changeworship 2006worship 2011membership 2006membership 2011abington, abington- 143(74)-13.18%(57)0(15)0.00%(22)numberrank3003001,085942anchor, wrightstown0(23)0.00%(27)-12(25)-21.43%(52)numberrank56449797arch street, philadelphia-117(71)-68.42%(117)27(5)90.00% (2)numberrank305717154aston, aston3(21)3.53%(22)-5(19)-9.43% (31)numberrank53488588beaconno reportboth yearsno reportboth yearsnumberrankbensalem, bensalem-23(39)-13.94%(62)-28(36)-28.57% (64)numberrank9870165142berean, philadelphia106(4)44.92%(4)no reportboth yearsnumberrank00236342bethany collegiate, havertown- 188(76)-42.44%(110)43(3)21.29%(7)numberrank202245443255bethel, philadelphia-13(33)-13.68%(60)-27(35)-35.06% (71)numberrank77509582bethesda, philadelphia9(18)5.56%(18)no reportboth yearsnumberrank1150162171beverly hills, upper darby-3(26)-3.03% (32)-11(24)-20.00%(48)numberrank55449996bridesburg, philadelphia0(23)0.00%(27)no reportboth yearsnumberrank004444bristol, bristolno reportboth yearsno reportboth yearsnumberrankpage 1 of 10report prepared by research services, presbyterian church (u.s.a.)1- 800-728-7228, ext #204006-oct-12" notable, this doesn’t track line endings, and not all the characters are recorded in the expected order (the first line is rendered after the second). the calls to transform, translate, and scale control where text is placed. the filltext method also takes an (x, y) parameter set that moves the individual letters between words. the exact position is a combination of successive operations, which are modeled as a stack of matrix operations. thankfully, pdf.js tracks the output of these operations as it renders, so we don’t have to recalculate it. thus, we can make a method that records the letters and their real positions. this method takes the internal context object, the type of state transition, and the arguments to the transition. this method is then called from the ‘record’ function listed above. var chars = []; var cur = {}; function record(ctx, state, args) { if (state === 'filltext') { var c = args[0]; cur.c = c; cur.x = ctx._transformmatrix[4] + args[1]; cur.y = ctx._transformmatrix[5] + args[2]; chars[chars.length] = cur; cur = {}; } } these results can be sorted by position (x and y). the sort method arranges letters by position – if they are shifted up or down a small amount, they are considered to be on one line. chars.sort( function(a, b) { var dx = b.x - a.x; var dy = b.y - a.y; if (math.abs(dy) < 0.5) { return dx * -1; } else { return dy * -1; } } ); this presents several difficulties: this doesn’t detect right-to-left text, and it’s becoming clear that we’re going to have a hard time knowing when you’re in a table and when we aren’t. to do this, we define a function which can transform the array of letters and positions into a csv style output. this tracks from letter to letter – if it sees a “large” change in y, it makes a new line. if it sees a “large” change in x, it treats it as a new column. the real challenge is defining “large” which for my test pdf were around 15 and 20, for dx and dy. function gettext(marks, ex, ey, v) { var x = marks[0].x; var y = marks[0].y; var txt = ''; for (var i = 0; i < marks.length; i++) { var c = marks[i]; var dx = c.x - x; var dy = c.y - y; if (math.abs(dy) > ey) { txt += "\"\n\""; if (marks[i+1]) { // line feed - start from position of next line x = marks[i+1].x; } } if (math.abs(dx) > ex) { txt += "\",\""; } if (v) { console.log(dx + ", " + dy); } txt += c.c; x = c.x; y = c.y; } return txt; } this algorithm doesn’t handle newlines in rows, and oddly, the columns don’t come out in the right order, but they appear to be consistently out of order. line with large spaces (e.g. an em-dash) are detected as having multiple columns, but this can be cleaned up later – here is some sample output. you can see an example below, and the final source is available on github . congregations ranked by growth and decline in m","embership and w","orship attendance, 2006 to 2011" "","philadelphia presbytery"," - table 16" "","net ","membership ","change" "","net worship ","change","percent ","change","percent ","change","worship"," 2006","worship"," 2011","membership"," 2006","membership"," 2011" "","abington, abington","-143","(74)","-13.18%(57)","0","(15)","0.00%(22)","number","rank","300","300","1,085","942" "","anchor, wrightstown","0","(23)","0.00%(27)","-12","(25)","-21.43%(52)","number","rank","56","44","97","97" "","arch street, philadelphia","-117","(71)","-68.42%","(117)","27(5)","90.00%(2)","number","rank","30","57","171","54" "","aston, aston","3","(21)","3.53%(22)","-5","(19)","-9.43%(31)","number","rank","53","48","85","88" "","beacon","no report","both years","no report","both years","number","rank" "","bensalem, bensalem","-23","(39)","-13.94%(62)","-28","(36)","-28.57%(64)","number","rank","98","70","165","142" "","berean, philadelphia","106(4)","44.92%(4)","no report","both years","number","rank","0","0","236","342" "","bethany collegiate, havertown","-188","(76)","-42.44%","(110)","43(3)","21.29%(7)","number","rank","202","245","443","255" "","bethel, philadelphia","-13","(33)","-13.68%(60)","-27","(35)","-35.06%(71)","number","rank","77","50","95","82" "","bethesda, philadelphia","9","(18)","5.56%(18)","no report","both years","number","rank","115","0","162","171" "","beverly hills, upper darby","-3","(26)","-3.03%(32)","-11","(24)","-20.00%(48)","number","rank","55","44","99","96" "","bridesburg, philadelphia","0","(23)","0.00%(27)","no report","both years","number","rank","0","0","44","44" "","bristol, bristol","no report","both years","no report","both years","number","rank" "","page 1 of 10","report prepared by research services, presbyterian church (u.s.a.)","1-800-728-7228, ext #2040","06-oct-12"
December 26, 2013
by Gary Sieling
· 21,405 Views
article thumbnail
Storing Objects in Android
One alternative to using SQLite on Android is to store Java objects in SharedPreferences.
December 19, 2013
by Tony Siciliani
· 47,622 Views · 1 Like
article thumbnail
Handling Big Data with HBase Part 4: The Java API
Editor's note: Be sure to check out part 2 as well. This is the fourth of an introductory series of blogs on Apache HBase. In the third part, we saw a high level view of HBase architecture . In this part, we'll use the HBase Java API to create tables, insert new data, and retrieve data by row key. We'll also see how to setup a basic table scan which restricts the columns retrieved and also uses a filter to page the results. Having just learned about HBase high-level architecture, now let's look at the Java client API since it is the way your applications interact with HBase. As mentioned earlier you can also interact with HBase via several flavors of RPC technologies like Apache Thrift plus a REST gateway, but we're going to concentrate on the native Java API. The client APIs provide both DDL (data definition language) and DML (data manipulation language) semantics very much like what you find in SQL for relational databases. Suppose we are going to store information about people in HBase, and we want to start by creating a new table. The following listing shows how to create a new table using the HBaseAdmin class. Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("people")); tableDescriptor.addFamily(new HColumnDescriptor("personal")); tableDescriptor.addFamily(new HColumnDescriptor("contactinfo")); tableDescriptor.addFamily(new HColumnDescriptor("creditcard")); admin.createTable(tableDescriptor); The people table defined in preceding listing contains three column families: personal, contactinfo, and creditcard. To create a table you create an HTableDescriptor and add one or more column families by adding HColumnDescriptor objects. You then call createTable to create the table. Now we have a table, so let's add some data. The next listing shows how to use the Put class to insert data on John Doe, specifically his name and email address (omitting proper error handling for brevity). Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "people"); Put put = new Put(Bytes.toBytes("doe-john-m-12345")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("givenName"), Bytes.toBytes("John")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("mi"), Bytes.toBytes("M")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("surame"), Bytes.toBytes("Doe")); put.add(Bytes.toBytes("contactinfo"), Bytes.toBytes("email"), Bytes.toBytes("[email protected]")); table.put(put); table.flushCommits(); table.close(); In the above listing we instantiate a Put providing the unique row key to the constructor. We then add values, which must include the column family, column qualifier, and the value all as byte arrays. As you probably noticed, the HBase API's utility Bytes class is used a lot; it provides methods to convert to and from byte[] for primitive types and strings. (Adding a static import for the toBytes() method would cut out a lot of boilerplate code.) We then put the data into the table, flush the commits to ensure locally buffered changes take effect, and finally close the table. Updating data is also done via the Put class in exactly the same manner as just shown in the prior listing. Unlike relational databases in which updates must update entire rows even if only one column changed, if you only need to update a single column then that's all you specify in the Put and HBase will only update that column. There is also a checkAndPut operation which is essentially a form of optimistic concurrency control - the operation will only put the new data if the current values are what the client says they should be. Retrieving the row we just created is accomplished using the Get class, as shown in the next listing. (From this point forward, listings will omit the boilerplate code to create a configuration, instantiate the HTable, and the flush and close calls.) Get get = new Get(Bytes.toBytes("doe-john-m-12345")); get.addFamily(Bytes.toBytes("personal")); get.setMaxVersions(3); Result result = table.get(get); The code in the previous listing instantiates a Get instance supplying the row key we want to find. Next we use addFamily to instruct HBase that we only need data from the personal column family, which also cuts down the amount of work HBase must do when reading information from disk. We also specify that we'd like up to three versions of each column in our result, perhaps so we can list historical values of each column. Finally, calling get returns a Result instance which can then be used to inspect all the column values returned. In many cases you need to find more than one row. HBase lets you do this by scanning rows, as shown in the second part which showed using a scan in the HBase shell session. The corresponding class is the Scan class. You can specify various options, such as the start and ending row key to scan, which columns and column families to include and the maximum versions to retrieve. You can also add filters, which allow you to implement custom filtering logic to further restrict which rows and columns are returned. A common use case for filters is pagination. For example, we might want to scan through all people whose last name is Smith one page (e.g. 25 people) at a time. The next listing shows how to perform a basic scan. Scan scan = new Scan(Bytes.toBytes("smith-")); scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("givenName")); scan.addColumn(Bytes.toBytes("contactinfo"), Bytes.toBytes("email")); scan.setFilter(new PageFilter(25)); ResultScanner scanner = table.getScanner(scan); for (Result result : scanner) { // ... } In the above listing we create a new Scan that starts from the row key smith- and we then use addColumn to restrict the columns returned (thus reducing the amount of disk transfer HBase must perform) to personal:givenName and contactinfo:email. A PageFilter is set on the scan to limit the number of rows scanned to 25. (An alternative to using the page filter would be to specify a stop row key when constructing the Scan.) We then get a ResultScanner for the Scan just created, and loop through the results performing whatever actions are necessary. Since the only method in HBase to retrieve multiple rows of data is scanning by sorted row keys, how you design the row key values is very important. We'll come back to this topic later. You can also delete data in HBase using the Delete class, analogous to the Put class to delete all columns in a row (thus deleting the row itself), delete column families, delete columns, or some combination of those. Connection Handling In the above examples not much attention was paid to connection handling and RPCs (remote procedure calls). HBase provides the HConnection class which provides functionality similar to connection pool classes to share connections, for example you use the getTable() method to get a reference to an HTable instance. There is also an HConnectionManager class which is how you get instances of HConnection. Similar to avoiding network round trips in web applications, effectively managing the number of RPCs and amount of data returned when using HBase is important, and something to consider when writing HBase applications. Conclusion to Part 4 In this part we used the HBase Java API to create a people table, insert a new person, and find the newly inserted person information. We also used the Scan class to scan the people table for people with last name "Smith" and showed how to restrict the data retrieved and finally how to use a filter to limit the number of results. In the next part, we'll learn how to deal with the absence of SQL and relations when modeling schemas in HBase. References HBase web site, http://hbase.apache.org/ HBase wiki, http://wiki.apache.org/hadoop/Hbase HBase Reference Guide http://hbase.apache.org/book/book.html HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide Google Bigtable Paper, http://labs.google.com/papers/bigtable.html Hadoop web site, http://hadoop.apache.org/ Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk Sample code, https://github.com/sleberknight/basic-hbase-examples
December 18, 2013
by Scott Leberknight
· 56,783 Views · 3 Likes
article thumbnail
Top 24 Java-Based Content Management Systems
CMS, or content management systems, are platforms for managing and administering website content. There is no denying that CMSes are important in today's web ecosystem. These content management systems not only provide an easy way to build and maintain websites, but they also lend a helping hand in updating and editing website content without the need to spend hours or days writing and altering codes and scripts. Some of the leading CMSes are PHP-based, Ruby on Rails-based, ASP.NET-based, and Java-based. Among these, due to scalability, modernized architecture and open-source standards of a few, Java-based CMSs are getting quite a lot of attention lately, especially for enterprise websites, because of the scalable, modern, open source technology behind most of them. There are plenty of CMS tools based on Java to help developers create multi-lingual and multi-channel websites. But how do we decide on the best one for our use case? In this article, we’re going to explore the top 24 content management systems based on Java. Let’s have a look at each of them in detail: 1. Alfresco : Alfresco is one of the top open-source content management systems of Java. It comes with enterprise repository and portlet capabilities along with document management, collaboration, records management, knowledge management, web content management, imaging, and a lot more. Alfresco has a modular architecture and enables end users to efficiently manage websites across the cloud, mobile, hybrid and on-premise environments using open source Java technologies, such as Spring, Hibernate, Lucene and JSF. 2. Magnolia : Magnolia is a well-documented, easy to use, enterprise-grade open source CMS based on the Java Content Repository Standard. It is a highly popular CMS due to its out-of-the-box functionality and ease of use under an open source license. Moreover, Magnolia supports unique content delivery capabilities in a search-engine optimized manner and also follows W3C standards. Magnolia CMS has been deployed by enterprises and governments in more than 100 countries across the world. Here's a case study on Magnolia-based website development 3. LogicalDOC : Though less known than other software such as Alfresco, LogicalDOC is emerging as a powerful and more affordable alternative. With primary focus on Document Management, it offers very interesting content management, knowledge management and collaboration features, and all this in a really efficient way. A peculiar aspect of the interface is the use of Google GWT , this makes the user interface very responsive while the data transfer with the server is minimum. Also the availability of Free Apps for Android and Apple devices (iPhone and iPad) is an interesting feature. 4. Asbru: Asbru is another powerful, fully-featured, easy to use content management system with database-driven capabilities. It is built on the Spring framework with integrated community, databases, eCommerce and statistics modules, which helps developers to create, publish and manage rich and user-friendly internet, extranet and intranet websites on the go. Available in various editions, Asbru provides users with a simple, user-friendly platform to manage websites along with a host of other benefits and features such as custom templates and data, password protected content, multi-lingual content, communities, eCommerce and website analytics, a cutting-edge WYSIWYG content editor and a lot more. 5. OpenCMS : OpenCMS is based on Java and XML technology that allows you to build highly customizable and interactive websites and portals. It comes integrated with a WYSIWYG editor and fully-featured Template Engine which is fully compliant with W3C standards. OpenCMS can be deployed both in an open-source environment (Linux, Apache, Tomcat, MySQL) as well as a commercial environment (Windows NT, IIS, BEA Weblogic, Oracle) 6. Walrus: Walrus is yet another Spring-based CMS that provides unique and effective content management capabilities with a smart administrative interface and drag-and-drop facilities. Easy-to-setup and undo/redo features make Walrus a highly preferred and suitable CMS for government and non-profit enterprises. 7. Pulse : Pulse is a Java-based framework and portal solution that offers easy-to-use and extensible patterns for creating rich browser web applications and responsive websites. It brings a bunch of innovative and powerful components including content management, web shops, user management and more. A few of its key features include a WebDAV based virtual file system for digital asset management, mature user and role management, built-in internationalization, and more. 8. MeshCMS: MeshCMS is an easy to use online editing system written in Java. It comes with a host of features that you will find in any ideal content management system however, it uses a conventional approach in managing and editing website content. It is considered one of the fastest CMSes for editing files online, managing files, and building some very common components like menus, breadcrumbs, mail forms and so on. MeshCMS is accompanied by cross-browser capabilities, a WYSIWYG editor, hot-linking prevention, and tag-library that makes content management an interesting affair. 9. Liferay: Liferay is one of the most popular CMSes based on Java, and is recommended by many industry experts. It comes with awesome features that can make your content management tasks simple. Liferay is a very popular for developing personal as well as professional websites with ease. 10. DotCMS: DotCMS is a next-gen enterprise CMS that wears an open-source hat. It is highly popular and widely used CMS due to its open APIs, extensible and scalable architecture that it used to create personalized and engaging websites, intranets, extranets and applications with ease. 11. Jease: Jease highly known as ‘Java with ease’ is another open source content management system that is built on popular Java technologies like db40, Perst, Lucence and ZK. It is an extremely lightweight CMS with excellent Ajax interface. Due to its intuitive and interactive interface, it is highly simple and easy to customize and deploy websites in Jease even for inexperienced Java developers. 12. Hippo: Hippo is again a powerful open-source CMS made in Java that features enterprise level capabilities that helps in delivering personalized websites and channels. Hippo outlines its competitor by delivering outstanding customer experience through innovative solutions. Hippo has come a long way since 1999 serving medium to large organizations by offering a personalized multichannel content distribution platform including website, mobile, tablet, extranets and intranets. Its major version update was in December 2012 and since then it is seeing minor updates every couple of months. 13. Apache Lenya: Apache Lenya is another open-source Java CMS that features revision control , multisite management, scheduling, search, WYSIWYG editors, and workflow which makes website development and management quite interesting and easy for developers. Available in a variety of languages, Apache Lenya is highly preferred CMS among enterprises that desire to develop multi-lingual websites. 14. Contelligent: Conteligent is another smart CMS solution offered under Java technology stack. It is fully compliant with J2EE and offers great solution for creating and managing personalized websites. 15. InfoGlue: InfoGlue again is a Java-based CMS that is known for its advanced, scalable and robust open-source architecture. It is a highly flexible CMS built on JSR-168 and comes with full multi-language support, excellent information reuse and high integration capabilities. 16. OpenEdit: OpenEdit CMS is a dynamic tool for managing website content with online editing capabilities. Built in open-source architecture, OpenEdit provides facilities like user manager, file manager, version control and notification tools for managing media-rich websites. OpenCMS features enterprise grade plugins such as eCommerce, Content Management, Blog, Events Calendar, Social Networking Tools and more. 17. AtLeap: Atleap is a multi-lingual CMS based on Java which offers amazing content delivery assistance with SEO and full text search functionalities. AtLeap, a product of Blandware, is not only a CMS but a highly robust framework for developing website and web applications 18. Weceem: Weceem is yet another open source content management system, unlikely other CMS it is built upon well-known Java framework grails, spring and Java itself. Weceem has garner positive reviews and is an ideal CMS when it comes to grails, but faces tough competition in best Java CMS category. I came across a LinkedIn discussion which was enough for me to put this CMS in the Best Java CMS list. 19. Nuxeo: Nuxeno is a powerful open source CMS built on Java-based architecture. It offers solutions related to document management, case management and digital asset management. It is free from licensing free but do costs you when reach out for support and maintenance help. It has strong groups of customers including Electronic Arts, U.S. Navy and as stated on the company website, it’s been used in over 145 countries across thousands of organizations. 20. XperienCentral: Xperien central is currently the only CMS that offers unique content to a visitor as per his earlier journey, so you can tailor the content to increase the conversion. It offers multi-channel content delivery across website, mobile social media channels and applications. It is built on Java and hence it is extremely scalable and agile. 21 Atex: Atex is a web CMS that uses polopoly technology to deliver content. As per claims, it is the only industry leading CMS with built in paywall. Atex again is one of the premium CMS that offers amazing solutions for managing websites and helps marketers deliver the right content to relevant audiences. It has rich set of clientele. 22 Escenic: Customers of escenic include News of the World, The Sun, The Times, the Independent titles. It’s a closed source Java framework. Both Atex and Escenic are found to be highly popular in Sweden. Some of the biggest sites in Sweden use both these CMS. idg.se uses Atex and Aftonbladet.se uses Escenic 23. Adobe Experience Manager/ CQ5 : Best CMS list cannot be completed without including adobe experience manager. It is an all-round CMS which offers all kinds of agility and flexibility an organization may want. It helps deliver unique customer experience by delivering different content on different channels. Adobe Experience Manager was recently named a leader in web content management by Garnters magic quadrant. Earlier it was known as CQ5 but later was acquire by Adobe in 2010 24. SDL Tridion Again a well known CMS and highly recommended by industry experts. Its simple intuitive UI makes it simple to manage the content and deliver it uniformly across all channels. It recently received top score in overall content management experience according to an independent research firm - Forrester Research, Inc., This completes the list 23 top Java-based content management system. Hope after reading about all the CMS, you have got enough inferences and insight as to which CMS would be best for your website development project.
December 9, 2013
by Boni Satani
· 328,505 Views · 4 Likes
  • Previous
  • ...
  • 500
  • 501
  • 502
  • 503
  • 504
  • 505
  • 506
  • 507
  • 508
  • 509
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×