Data Engineering Resources

The Latest Data Engineering Topics

In my daily work, I use both an RDBMS and MarkLogic, an XML database. MarkLogic can be considered akin to the newer NoSQL databases, but it has the added structure of XML and standard languages in XQuery and XPath. The NoSQL databases are typically storing documents or key-value pairs, and some other things in between. Given that any datastore will be searched at some point, you will always care how the data is actually stored or whether there is some way to query it easily. Once you start thinking about the problem, you quickly generalize to the “how do I persist any type of data” question. However, my focus is not going to be the comparison of the various data stores, but the comparison of how data is stored. More specifically, I want to show the object serialization, mainly the Java built in method, as a data persistence format is evil. Given what you normally read on this blog, this may seem like an oddly timed post, but I have run into serialization issues lately in some production code and Mark Needham recently wrote an interesting post about this as well. Coincidentally, Mark is also working with MarkLogic, and there is an interesting item in his post: The advantage of doing things this way [using lightweight wrappers] is that it means we have less code to write than we would with the serialisation/deserialisation approach although it does mean that we’re strongly coupled to the data format that our storage mechanism uses. However, since this is one bit of the architecture which is not going to change it seems to makes sense to accept the leakage of that layer. The interesting part of this is that he has accepted using the data format of the storage mechanism, XML in MarkLogic in this case. Why is this interesting? First, it is a move away from the ORM technologies that try to hide the complexities of converting data into objects in the RDBMS world. Also, this is a glimpse into the types of issues that could arise from non-RDBMS storage choices as well as how to persist objects in general. So, an RDBMS is typically used to map object attributes to a table and columns. The mapping is mostly straightforward with some defined relationship for child objects and collections. This is a well-known area, called Object-Relational Mapping (ORM), and several open source and commercial options exist. In this scenario, object attributes are stored in a similar datatype, meaning a String is stored as a varchar and an int is stored as an integer. But, what happens when you move away from an RDBMS for data persistence? If you look at Java and its session objects, pure object serialization is used. Assuming that an application session is fairly short-lived, meaning at most a few hours, object serialization is simple, well supported and built into the Java concept of a session. However, when the data persistence is over a longer period of time, possibly days or weeks, and you have to worry about new releases of the application, serialization quickly becomes evil. As any good Java developer knows, if you plan to serialize an object, even in a session, you need a real serialization ID (serialVersionUID), not just a 1L, and you need to implement the Serializable interface. However, most developers do not know the real rules behind the Java deserialization process. If your object has changed, more than just adding simple fields to the object, it is possible that Java cannot deserialize the object correctly even if the serialization ID has not changed. Suddenly, you cannot retrieve your data any longer, which is inherently bad. Now, may developers reading this may say that they would never write code that would have this problem. That may be true, but what about a library that you use or some other developer no longer employed by your company? Can you guarantee that this problem will never happen? The only way to guarantee that is to use a different serialization method. What options do we have? Obviously, there are the NoSQL datastores but the actual object format is the relevant question not which solution to choose. Besides the obvious serialized object, some NoSQL datastores use JSON to store objects, MarkLogic uses XML and there are others that store just key-value pairs. Key-value pairs are typically a mapping of a text key to a value that is a serialized object, either a binary or textual format. So, that leaves us with XML, JSON and other textual formats. One of the benefits of a structured format like XML or JSON is that they can be made searchable and provide some level of context. I have talked about data formats before, so I won’t go into a comparison again. However, do these types of formats avoid the issues that native Java object serialization has? This is really dependent upon what library you are using for serialization. Some libraries will deserialize an object without any issues regardless of whether the object field list has changed. Other libraries could have problems depending upon whether a serialized field exists in the target object, or there might not be solid support for collections (though that is doubtful at this point). Given that even structured formats could have serialization issues, is the only safe path hand-coded mappings like those used by ORM tools? Some JSON and XML serialization tools use the same mapping methods as the ORM tools in order to avoid these problems. However, once you define these mappings, you are explicitly stating how an object gets translated. This explicit definition will require maintenance, but that is definitely cleaner than trying to trace down a serialization defect in some random stack trace. So is implicit object serialization really worth the potential headaches? Or should we just consider it evil and never speak of it again? From http://regulargeek.com/2011/07/06/is-object-serialization-evil/

July 7, 2011

by Robert Diana

· 20,007 Views · 1 Like

TechTip: Use of setLenient method on SimpleDateFormat

Sometimes when you are parsing a date string against a pattern(such as MM/dd/yyyy) using java.text.SimpleDateFormat, strange things might happen (for unknown developers) if your date string is dynamic content entered by a user in some input field on the user interface and if it is not entered in the specified format. The parse method in the SimpleDateFormat parses the date string that is in the incorrect format and returns your date object instead of throwing a java.text.ParseException. However, the date returned is not what you expect. The below code-snippet shows you this behaviour. package com.starwood.system.util; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; public class DateSample { public static void main(String args[]){ SimpleDateFormat sdf = new SimpleDateFormat () ; sdf.applyPattern("MM/dd/yyyy") ; try { Date d = sdf.parse("2011/02/06") ; System.out.println(d) ; } catch (ParseException e) { e.printStackTrace(); } } } Output: Thu Jul 02 00:00:00 MST 173 See the output, that is a date back in the year 173. To avoid this problem, call the setLenient (false) on SimpleDateFormat instance. That will make the parse method throw ParseException when the given input string is not in the specified format. Here is the modified code-snippet. package com.starwood.system.util; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; public class DateSample { public static void main(String args[]){ SimpleDateFormat sdf = new SimpleDateFormat () ; sdf.applyPattern("MM/dd/yyyy") ; sdf.setLenient(false) ; try { Date d = sdf.parse("2011/02/06") ; System.out.println(d) ; } catch (ParseException e) { System.out.println (e.getMessage()) ; } } } Output: Unparseable date: "2011/02/06" http://accordess.com/wpblog/2011/06/02/techtip-use-of-setlenient-method-on-simpledateformat

June 27, 2011

by Upendra Chintala

· 47,176 Views · 5 Likes

Developing Android Apps with NetBeans, Maven, and VirtualBox

I am an experienced Java developer who has used various IDEs and prefer NetBeans IDE over all others by a long shot. I am also very fond of Maven as the tool to simplify and automate nearly every aspect of the development of my Java project throughout its lifecycle. Recently, I started developing Android applications and naturally I looked for a Maven plugin that would manage my Android projects. Luckily I found the maven-android-plugin which worked like a charm and allowed me to use Maven for developing my Android projects. The Android Emulator from the Android SDK seemed unusably slow. Lucklily, I found a way to use an Android Virtual Machine for VirtualBox that worked nearly as fast as my native computer! This page documents my experiences. Tested Environment Dev machine: Ubuntu 11.04 Linux IDE: NetBeans VirtualBox: 4.0.8 r71778 Android SDK Revision 11, Add on XML Schema #1, Repository XML Schema #3 (from About in SDK and AVD Manager) Android Version: 2.2 Overview of Steps Download and install the Android SDK on your dev machine Attach an Android Device to dev machine Configure and load your device for development and other use Create an initial Android maven project Connect Android Device to Android SDK Debug Android app using NetBeans Graphical Debuger Download and Install Android SDK Download and install the Android SDK on your dev machine as described here. Make sure to set the following in dev machine ~/.bashrc file: export ANDROID_HOME=$HOME/android-sdk-linux_x86 #Change as needed export PATH="$ANDROID_HOME/tools:$ANDROID_HOME/platform-tools:$PATH" Attaching an Android Device to Dev Machine If you have an actual device that is usually always best. If not, you must use a virtual Android device which usually has various limitations (e.g. no GPS, Camera etc.). The Android SDK makes it easy to create a new Virtual Device but the resulting device is painfully slow in my experience and not usable. Do not bother with this. Instead, create a virtual Android device using VirtualBox as described in the following steps: Install virtual box and initial Android VM as described here: http://androidspin.com/2011/01/24/howto-install-android-x86-2-2-in-virtualbox/ http://geeknizer.com/how-to-run-google-android-in-virtualbox-vmware-on-netbooks/ Configure Android VM so it is connected bidirectionally with your dev machine over TCP as described here: http://stackoverflow.com/questions/61156/virtualbox-host-guest-network-setup I used the approach of configuring a HOST ONLY network adapater and a second NAT adapter on the Android VM within virtual box. Configuring your Android Device This section describes various things I did to setup a dev environment for my Android device: Root the device. I used Universal AndRoot Install ConnectBot so you have ssh and related network utilities Creating Initial Android Maven Application Create initial project using instructions here. I found it best to create stub project structure using the maven-archtype-plugin and the archtypes at https://github.com/akquinet/android-archetypes/wiki Connecting Android VM Device to Android SDK In order for your code to be deployed from NetBeans IDE to Android Device and in order for you to monitor your deployed app from the Dalvik Debug Monitor (ddms) you need to connect your android VM device to the android sdk over TCP as described in the following steps. On Android Device open the Terminal Emulator Type su to become root (your device must be rooted for this Type following commands in root shell: setprop service.adb.tcp.port 5555 stop adbd start adbd Type the following commands on dev machine shell. TODO: Note that IP address below is whatever is the ip address associated with the device (see ifconfig on linux for device vboxnet0) adb tcpip 5555 adb connect 192.168.0.101:5555 For details on above steps see: http://stackoverflow.com/questions/2604727/how-can-i-connect-to-android-with-adb-over-tcp Set up port forwarding as described here http://redkrieg.com/2010/10/11/adb-over-ssh-fun-with-port-forwards/ (this is where I am most fuzzy) Build your maven android project using Right-Click / Clean and Build Now for the acid test whether you can deploy your app to the device from NetBeans IDE! Right-click / Custom / Goal to show Run Maven dialog. Enter android:deploy in Goals field. Select Remember As button and enter android:deploy for its text field. If all is well, the app will deploy to the device and will show up in its "Applications" screen. Debugging Android App Using NetBeans Graphical Debugger Once you can build and deploy your app to the real or virtual Android device, here are the steps to debug the app using NetBeans debugger: On Device: Start the app (TODO: determine how to start app on device with JVM options so it can wait for debugger connection. This should be easy) On Dev Machine run Dalvik Debug Monitor (ddms) in background: $ANDROID_HOME/tools/ddms & Lookup your app in ddms and get its debug port. This is described here but does not address NetBeans specifically In NetBeans do: Debug / Attach Debugger and specify the port looked up in ddms in previous step. You may leave rest of the fields with defaults. Click OK

June 18, 2011

by Farrukh Najmi

· 173,492 Views

Android Tutorial: How to Parse/Read JSON Data Into a Android ListView

Today we get on with our series that will connect our Android applications to internet webservices! Next up in line: from JSON to a Listview. A lot of this project is identical to the previous post in this series so try to look there first if you have any problems. On the bottom of the post ill add the Eclipse project with the source. For this example i made use of an already existing JSON webservice located here. This is a piece of the JSON array that gets returned: {"earthquakes": [ { "eqid": "c0001xgp", "magnitude": 8.8, "lng": 142.369, "src": "us", "datetime": "2011-03-11 04:46:23", "depth": 24.4, "lat": 38.322 }, { "eqid": "2007hear", "magnitude": 8.4, "lng": 101.3815, "src": "us", "datetime": "2007-09-12 09:10:26", "depth": 30, "lat": -4.5172 }<--more -->]} So how do we get this data into our application! Behold our getJSON class! getJSON(String url) public static JSONObject getJSONfromURL(String url){//initializeInputStream is = null;String result = "";JSONObject jArray = null;//http posttry{HttpClient httpclient = new DefaultHttpClient();HttpPost httppost = new HttpPost(url);HttpResponse response = httpclient.execute(httppost);HttpEntity entity = response.getEntity();is = entity.getContent();}catch(Exception e){Log.e("log_tag", "Error in http connection "+e.toString());}//convert response to stringtry{BufferedReader reader = new BufferedReader(new InputStreamReader(is,"iso-8859-1"),8);StringBuilder sb = new StringBuilder();String line = null;while ((line = reader.readLine()) != null) {sb.append(line + "\n");}is.close();result=sb.toString();}catch(Exception e){Log.e("log_tag", "Error converting result "+e.toString());}//try parse the string to a JSON objecttry{ jArray = new JSONObject(result);}catch(JSONException e){Log.e("log_tag", "Error parsing data "+e.toString());}return jArray;} The code above can be divided in 3 parts. the first part makes the HTTP call the second part converts the stream into a String the third part converts the string to a JSPNObject Now we only have to implement this into out ListView. We can use the same method as in the XML tutorial. We make a HashMap that stores our data and we put JSON values in the HashMap. After that we will bind that HashMap to a SimpleAdapter. Here is how its done: Implementation ArrayList> mylist = new ArrayList>();//Get the data (see above)JSONObject json =JSONfunctions.getJSONfromURL("http://api.geonames.org/postalCodeSearchJSON?formatted=true&postalcode=9791&maxRows=10&username=demo&style=full"); try{//Get the element that holds the earthquakes ( JSONArray )JSONArray earthquakes = json.getJSONArray("earthquakes"); //Loop the Array for(int i=0;i < earthquakes.length();i++){ HashMap map = new HashMap(); JSONObject e = earthquakes.getJSONObject(i); map.put("id", String.valueOf(i)); map.put("name", "Earthquake name:" + e.getString("eqid")); map.put("magnitude", "Magnitude: " + e.getString("magnitude")); mylist.add(map);} }catch(JSONException e) { Log.e("log_tag", "Error parsing data "+e.toString()); } After this we only need to make up the Simple Adapter ListAdapter adapter = new SimpleAdapter(this, mylist , R.layout.main, new String[] { "name", "magnitude" }, new int[] { R.id.item_title, R.id.item_subtitle }); setListAdapter(adapter); final ListView lv = getListView(); lv.setTextFilterEnabled(true); lv.setOnItemClickListener(new OnItemClickListener() { public void onItemClick(AdapterView parent, View view, int position, long id) { @SuppressWarnings("unchecked") Toast.makeText(Main.this, "ID '" + o.get("id") + "' was clicked.", Toast.LENGTH_SHORT).show(); }); Now we have a ListView filled with JSON data! Here is the Eclipse project: source code Have fun playing around with it.

June 8, 2011

by Mark Mooibroek

· 260,367 Views

Using Advantage data providers to read DBF-files

In one of my projects I have to read FoxPro DBF-files and import data from them. As this code must run in server and customer doesn’t want to install FoxPro there we found another solution that seems at least to me way better. In this posting I will show you how to read DBF-files using Sybase Advantage data providers. Getting Advantage data providers Here are the download links to data providers: Advantage .NET Data Provider Release 10.1 for Windows (32-bit and 64-bit) Platforms for Advantage OLE DB Provider Release 10.1 Platforms for Advantage ODBC Driver Release 10.1 I downloaded and installed .NET data provider and my example here is fully based on this. Configuring application If you run application without configuring some data providers stuff before you will get the following error: Error 5185: Local server connections are restricted in this environment. See the 5185 error code documentation for details. Go to your application bin folder and add there usual text file called ads.ini. Here is the content for this file: [SETTINGS] MTIER_LOCAL_CONNECTIONS=1 Make sure you add reference to Advantage data provider assembly and include ads.ini to your project like shown on image above. Getting data to DataTable Here is short code example about how to get data from DBF-file to DataTable. static void Main(string[] args) { var tableName = "TABLENAME_WITHOUT_EXTENSION"; var connStr = "data source={0};tabletype=vfp;servertype= local;"; connStr = string.Format(connStr, "c:\\temp\\"); var table = new DataTable(); using (var conn = new AdsConnection(connStr)) using (var adapter = new AdsDataAdapter()) using (var cmd = new AdsCommand()) { cmd.Connection = conn; cmd.CommandText = "select * from " + tableName; adapter.SelectCommand = cmd; conn.Open(); adapter.Fill(table); conn.Close(); } Console.WriteLine("Table fields:"); foreach (DataColumn col in table.Columns) Console.WriteLine(col.ColumnName); Console.WriteLine(" "); Console.WriteLine("Rows: " + table.Rows.Count); Console.Read(); } If Advantage data providers were installed correctly and there are no errors in table names, locations and your SQL query then you should see list of table column names and row count on console window when you run the application.

May 17, 2011

by Gunnar Peipman

· 10,451 Views

How to Iterate ArrayList in Struts2

We will discuss how to iterate over a collection of String objects in Struts2 tag libraries and then a List of custom class objects. It looks as if iterating a list of string objects is easier than iterating over a list of custom class objects in Struts 2. But the reality is that iterating a list of custom class objects is also equally easier. By custom class we mean the User, Employee, Department, Products, Vehicles classes that are created in any web application. Download Working Sample Here Usually it happens that one needs to fetch a list of records from database/files and then display it in the JSP. The module requiring this functionality could be Search, Listing users/departments/products etc. The basic flow of struts2 web application goes like: The user initiates the request from one page. This request is received by the interceptor which further invokes the Struts2 action. The action class fetches the records and stores in a list. This list is available to the next JSP using the public getter method. Please note that the public getter method for the List is mandatory. Once the List has been populated by Struts2 action class, the JSP then iterates over this List and displays the corresponding information. In the days gone by, one would store the List as a session attribute and then access the list in JSP using the scriptlets to display appropriate output to the users. Here is a Struts2 sample application to iterate one String and one custom class objects List. Though we are using the Struts2 tag library to iterate the list but JSTL can also be used for iteration. Also if you are going to use the code examples given below, use the following URL's to access the application: http://localhost:8080//index.action Iterate a Custom class ArrayList in Struts2 web.xml struts2 org.apache.struts2.dispatcher.ng.filter.StrutsPrepareAndExecuteFilter struts2 *.action struts.xml /home.jsp /success.jsp /failure.jsp home.jsp Enter a user name to get the documents uploaded by that user. Username success.jsp Documents uploaded by the user are: failure.jsp FileAction.java package com.example; import java.util.ArrayList; import java.util.List; public class FetchAction { private String username; private String message; private List documents = new ArrayList(); public List getDocuments() { return documents; } public String getMessage() { return message; } public void setMessage(String message) { this.message = message; } public String getUsername() { return username; } public void setUsername(String username) { this.username = username; } public String execute() { if( username != null) { //logic to fetch the document list (say from database) Document d1 = new Document(); d1.setName("user.doc"); Document d2 = new Document(); d2.setName("office.doc"); Document d3 = new Document(); d3.setName("transactions.doc"); documents.add(d1); documents.add(d2); documents.add(d3); return "success"; } else { message="Unable to fetch"; return "failure"; } } } Document.java package com.example; public class Document { private String name; public String getName() { return name; } public void setName(String name) { this.name = name; } } Iterate String List in Struts2 The way to iterate the a String list is similar with the only difference that the action class FetchAction.java now populates the name of documents into an ArrayList of String objects. The code zip file containing the iteration over an ArrayList of custom class object or bean can be downloaded at: http://www.fileserve.com/file/QmrsJ7k The URL to access this application will be: http://localhost:8080/IteratorExample/index.action The code zip file containing the iteration over an ArrayList of string class object or bean can be downloaded at: http://www.fileserve.com/file/V2kXkfx The URL to access this application will be: http://localhost:8080/StringIteratorExample/index.action From http://extreme-java.blogspot.com/2011/05/how-to-iterate-arraylist-in-struts2.html

May 17, 2011

by Sandeep Bhandari

· 71,007 Views

Java Web Application Security - Part I: Java EE 6 Login Demo

Back in February, I wrote about my upcoming conferences: In addition to Vegas and Poland, there's a couple other events I might speak at in the next few months: the Utah Java Users Group (possibly in April), Jazoon and ÜberConf (if my proposals are accepted). For these events, I'm hoping to present the following talk: Webapp Security: Develop. Penetrate. Protect. Relax. In this session, you'll learn how to implement authentication in your Java web applications using Spring Security, Apache Shiro and good ol' Java EE Container Managed Authentication. You'll also learn how to secure your REST API with OAuth and lock it down with SSL. After learning how to develop authentication, I'll introduce you to OWASP, the OWASP Top 10, its Testing Guide and its Code Review Guide. From there, I'll discuss using WebGoat to verify your app is secure and commercial tools like webapp firewalls and accelerators. Fast forward a couple months and I'm happy to say that I've completed my talk at the Utah JUG and it's been accepted at Jazoon and Über Conf. For this talk, I created a presentation that primarily consists of demos implementing basic, form and Ajax authentication using Java EE 6, Spring Security and Apache Shiro. In the process of creating the demos, I learned (or re-educated myself) how to do a number of things in all 3 frameworks: Implement Basic Authentication Implement Form-based Authentication Implement Ajax HTTP -> HTTPS Authentication (with programmatic APIs) Force SSL for certain URLs Implement a file-based store of users and passwords (in Jetty/Maven and Tomcat standalone) Implement a database store of users and passwords (in Jetty/Maven and Tomcat standalone) Encrypt Passwords Secure methods with annotations For the demos, I showed the audience how to do almost all of these, but skipped Tomcat standalone and securing methods in the interest of time. In July, when I do this talk at ÜberConf, I plan on adding 1) hacking the app (to show security holes) and 2) fixing it to protect it against vulnerabilities. I told the audience at UJUG that I would post the presentation and was planning on recording screencasts of the various demos so the online version of the presentation would make more sense. Today, I've finished the first screencast showing how to implement security with Java EE 6. Below is the presentation (with the screencast embedded on slide 10) as well as a step-by-step tutorial. * You can also watch the screencast on YouTube or download the presentation PDF. Java EE 6 Login Tutorial Download and Run the Application Implement Basic Authentication Implement Form-based Authentication Force SSL Store Users in a Database Summary Download and Run the Application To begin, download the application you'll be implementing security in. This app is a stripped-down version of the Ajax Login application I wrote for my article on Implementing Ajax Authentication using jQuery, Spring Security and HTTPS. You'll need Java 6 and Maven installed to run the app. Run it using mvn jetty:run and open http://localhost:8080 in your browser. You'll see it's a simple CRUD application for users and there's no login required to add or delete users. Implement Basic Authentication The first step is to protect the list screen so people have to login to view users. To do this, add the following to the bottom of src/main/webapp/WEB-INF/web.xml: users /users GET POST ROLE_ADMIN BASIC Java EE Login ROLE_ADMIN At this point, if you restart Jetty (Ctrl+C and jetty:run again), you'll get an error about a missing LoginService. This happens because Jetty doesn't know where the "Java EE Login" realm is located. Add the following to pom.xml, just after in the Jetty plugin's configuration. Java EE Login ${basedir}/src/test/resources/realm.properties The realm.properties file already exists in the project and contains user names and passwords. Start the app again using mvn jetty:run and you should be prompted to login when you click on the "Users" tab. Enter admin/admin to login. After logging in, you can try to logout by clicking the "Logout" link in the top-right corner. This calls a LogoutController with the following code that logs the user out. public void logout(HttpServletResponse response) throws ServletException, IOException { request.getSession().invalidate(); response.sendRedirect(request.getContextPath()); } You'll notice that clicking this link doesn't log you out, even though the session is invalidated. The only way to logout with basic authentication is to close the browser. In order to get the ability to logout, as well as to have more control over the look-and-feel of the login, you can implement form-based authentication. Implement Form-based Authentication To change from basic to form-based authentication, you simply have to replace the in your web.xml with the following: FORM /login.jsp /login.jsp?error=true The login.jsp page already exists in the project, in the src/main/webapp directory. This JSP has 3 important elements: 1) a form that submits to "${contextPath}/j_security_check", 2) an input element named "j_username" and 3) an input element named "j_password". If you restart Jetty, you'll now be prompted to login with this JSP instead of the basic authentication dialog. Force SSL Another thing you might want to implement to secure your application is forcing SSL for certain URLs. To do this on the same you already have in web.xml, add the following after : CONFIDENTIAL To configure Jetty to listen on an SSL port, add the following just after in your pom.xml: true 8080 true 8443 60000 ${project.build.directory}/ssl.keystore appfuse appfuse The keystore must be generated for Jetty to start successfully, so add the keytool-maven-plugin just above the jetty-maven-plugin in pom.xml. org.codehaus.mojo keytool-maven-plugin 1.0 generate-resources clean clean generate-resources genkey genkey ${project.build.directory}/ssl.keystore cn=localhost appfuse appfuse appfuse RSA Now if you restart Jetty, go to http://localhost:8080 and click on the "Users" tab, you'll get a 403. What the heck?! When this first happened to me, it took me a while to figure out. It turns out that Jetty doesn't redirect to HTTPS when using Java EE authentication, so you have to manually type in https://localhost:8443/ (or add a filter to redirect for you). If you deployed this same application on Tomcat (after enabling SSL), it would redirect for you. Store Users in a Database Finally, to store your users in a database instead of file, you'll need to change the in the Jetty plugin's configuration. Replace the existing element with the following: Java EE Login ${basedir}/src/test/resources/jdbc-realm.properties The jdbc-realm.properties file already exists in the project and contains the database settings and table/column names for the user and role information. jdbcdriver = com.mysql.jdbc.Driver url = jdbc:mysql://localhost/appfuse username = root password = usertable = app_user usertablekey = id usertableuserfield = username usertablepasswordfield = password roletable = role roletablekey = id roletablerolefield = name userroletable = user_role userroletableuserkey = user_id userroletablerolekey = role_id cachetime = 300 Of course, you'll need to install MySQL for this to work. After installing it, you should be able to create an "appfuse" database and populate it using the following commands: mysql -u root -p -e 'create database appfuse' curl https://gist.github.com/raw/958091/ceecb4a6ae31c31429d5639d0d1e6bfd93e2ea42/create-appfuse.sql > create-appfuse.sql mysql -u root -p appfuse < create-appfuse.sql Next you'll need to configure Jetty so it has MySQL's JDBC Driver in its classpath. To do this, add the following dependency just after the element (before ) in pom.xml: mysql mysql-connector-java 5.1.14 Now run the jetty-password.sh file in the root directory of the project to generate a password of your choosing. For example: $ sh jetty-password.sh javaeelogin javaeelogin OBF:1vuj1t2v1wum1u9d1ugo1t331uh21ua51wts1t3b1vur MD5:53b176e6ce1b5183bc970ef1ebaffd44 The last two lines are obfuscated and MD5 versions of the password. Update the admin user's password to this new value. You can do this with the following SQL statement. UPDATE app_user SET password='MD5:53b176e6ce1b5183bc970ef1ebaffd44' WHERE username = 'admin'; Now if you restart Jetty, you should be able to login with admin/javaeelogin and view the list of users. Summary In this tutorial, you learned how to implement authentication using standard Java EE 6. In addition to the basic XML configuration, there's also some new methods in HttpServletRequest for Java EE 6 and Servlet 3.0: authenticate(response) login(user, pass) logout() This tutorial doesn't show you how to use them, but I did play with them a bit as part of my UJUG demo when implementing Ajax authentication. I found that login() did work, but it didn't persist the authentication for the users session. I also found that after calling logout(), I still needed to invalidate the session to completely logout the user. There are some additional limitations I found with Java EE authentication, namely: No error messages for failed logins No Remember Me No auto-redirect from HTTP to HTTPS Container has to be configured Doesn’t support regular expressions for URLs Of course, no error messages indicating why login failed is probably a good thing (you don't want to tell users why their credentials failed). However, when you're trying to figure out if your container is configured properly, the lack of container logging can be a pain. In the next couple weeks, I'll post Part II of this series, where I'll show you how to implement this same set of features using Spring Security. In the meantime, please let me know if you have any questions. From http://raibledesigns.com/rd/entry/java_web_application_security_part

May 15, 2011

by Matt Raible

· 42,012 Views · 3 Likes

Reset MySQL Root Password On Linux

Five easy steps to reset MySQL root password. Stop the MySQL server. Start the MySQL server with the --skip-grant-tables option. (it will not prompt for password) Connect to MySQL server as the root user. Setup new MySQL root password. Exit and restart the MySQL server. ### Shell Commands ### /etc/init.d/mysql stop mysqld_safe --skip-grant-tables & mysql -u root ### SQL Commands ### use mysql; UPDATE user SET password=PASSWORD("new-password") WHERE user='root'; flush privileges; exit ### Shell Commands ### /etc/init.d/mysql stop /etc/init.d/mysql start mysql -u root -pnew-password

May 13, 2011

by Artur Mkrtchyan

· 5,779 Views

Database Interaction with DAO and DTO Design Patterns

Learn what is a DAO and how to create a DAO, as well as the significance of creating Data Access Objects.

May 13, 2011

by Sandeep Bhandari

· 165,706 Views · 5 Likes

Practical PHP Testing Patterns: Transaction Rollback Teardown

Maintaining isolation of tests when they have a database as Shared Fixture is not a trivial task. An important constraint is not having the headache of keeping track what manipulations on the database has your code done; in that case the rollback may not even be performed in case of a regression. An alternative way to resetting the database via DELETE and TRUNCATE queries is to roll back a transaction which has been started in the setup phase during the teardown. Implementation The phases of a database test involving Transaction Rollback Teardown are roughly the following: begin transaction, usually in setUp(). arrange, act, assert actions in the various Test Methods. rollback of the transaction in teardown(). The active transaction is never committed. An issue with using this pattern is that code that already uses transaction is prone to generate errors, and ultimately should never be tested with this technique. The rules for your safety are simple: the SUT should never start a transaction or committing it. Some databases support nested transaction levels, but it's very brittle to use them for testing purposes, and in case of any failure the whole suite will blow up as test executes teardowns at the wrong level of nesting. This pattern safety is also difficult to ensure, as DDL statements like CREATE/DELETE or other commands may commit the current transaction automatically. Check the documentation of your testing database. The advantage of this pattern is great performance: rollback is faster than every other command, including TRUNCATE. Moreover, if you encapsulate transactions well in your production code, most of it won't commit them (typically leaving the control over the transaction to an upper layer). Doctrine 2 In a sense, we already use this pattern with an UnitOfWork ORM such as Doctrine 2 when we do not flush() the ORM in our code. The flow is: The database is ready by setup. Exercise code. Check results as persisted or removed entities. Instead of calling flush() over the Entity Manager, call clear(). In this case, the database never sees a transaction, as Doctrine 2 keeps everything in the Unit Of Work until you say to flush it. Even when your code is calling flush(), you can explicitly use beginTransaction() and rollback() over the connection object: in this other scenario, the testing database sees an open transaction, but it's never committed and can be discarded in teardown() like the pattern prescribes. Example The code sample is the same test case shown in the Table Truncation Teardown article, which now uses transactions to encapsulate the single tests. The various tests check the tables content is restored, along with the AUTOINCREMENT next value. exec('CREATE TABLE users ( id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR(255) )'); } $this->connection = self::$sharedConnection; $this->connection->beginTransaction(); } public function teardown() { $this->connection->rollback(); } public function testTableCanBePopulated() { $this->connection->exec('INSERT INTO users (name) VALUES ("Giorgio")'); $this->assertEquals(1, $this->howManyUsers()); } public function testTableRestartsFrom1() { $this->assertEquals(0, $this->howManyUsers()); $this->connection->exec('INSERT INTO users (name) VALUES ("Isaac")'); $stmt = $this->connection->query('SELECT name FROM users WHERE id=1'); $result = $stmt->fetch(); $this->assertEquals('Isaac', $result['name']); } public function testTableIsEmpty() { $this->assertEquals(0, $this->howManyUsers()); } private function howManyUsers() { $stmt = $this->connection->query('SELECT COUNT(*) AS number FROM users'); $result = $stmt->fetch(); return $result['number']; } }

May 11, 2011

by Giorgio Sironi

· 7,791 Views

Practical PHP Testing Patterns: Stored Procedure Test

It happened in the day before the advent of DDD and the Hexagonal architecture, that you had code that lived inside the database, such as Stored Procedures, constraints, and triggers. Back in the day, the relational database was considered the single source of truth instead of a Domain Model written in a language like PHP or Java. Today the picture is different - but there are still scenarios where pushing code in the database make sense. One of the reasons for having logic expressed as SQL and in other database languages is their power, and their performance. SQL operators, especially when augmented by proprietary extensions, let you declare pieces of logic that you would instead have to code by hand. SQL that is executed directly on the database can accomplish operations too onerous to perform over a reconstituted object graph with a subsequent saving. In fact, every decent ORM include a language for batch updates that translates to SQL, like Doctrine with DQL; and also a mechanism for providing hints for the underlying database, like indexes definitions. The problem with SQL derivates and other database-specific embedded logic is that we cannot execute it and test it in isolation - we need a real copy of the database to perform our tests. Thus the Stored Procedure Test is an umbrella term for tests that encompasses database code, even when they're not actually stored procedures. When I'll use the term stored procedure in this article, it will be to signify any database-specific code, such as complex queries, triggers and so on. Implementation The pattern prescribes to write unit tests for the stored procedure, to test it in isolation from the rest of application a first simplification. These tests cover nontrivial logic in database code - probably you don't need them for indexes definition, but more for queries with aggregate functions. In the PHP world, Sqlite often suffices for testing queries - as long as you have an intermediate layer like Doctrine DBAL (part of Doctrine 2) which smooths out the differences between vendors. You use MySQL in production, Sqlite in the test suite, and you can write queries in Doctrine's DQL being confident that it will translate them to the right SQL dialect. These tests should be executed in a sandbox - a database with just enough structure and data to test the stored procedure at hand. This sandbox should run by definition on the production dbms. The most difficult aspect of the pattern is integrating with the dbms: it should be running and listening on the right port. A sandbox should be created in setUp() or setUpBeforeClass(), and destroyed during teardown. In case the database is not available, the tests should be marked as skipped or incomplete. Variations In In-Database Stored Procedure Tests, the test is written in the same language as the database code. I cannot imagine something more boring for a PHP programmer. In Remoted Stored Procedure Tests, which is the variation of interest for us, the tests are written in PHPUnit and integrated with the suite (slowing it down a bit). The logic is that whatever SQL logic you're going to add to your application, is already encapsulated in some PHP class: for example, complex queries are encapsulated in Repositories or DAO. So it's going to be feasible to build a sandbox via PHP code, and test the stored procedure as a black box. It will be encapsulated for a unique execution, like schema creation, or for executing multiple times in case of queries. Example The example shows you how to test a query with a real database - supposing a surrogate database does not support all the needed functions - from inside a test suite. I thought it would be difficult to write this test, but instead it required less than a Pomodoro. connection = new PDO("mysql:host=localhost;dbname=sandbox", 'root', ''); $this->connection->exec("CREATE TABLE users (name VARCHAR(255) NOT NULL PRIMARY KEY, year YEAR)"); $this->repository = new UserRepository($this->connection, 2011); } public function testAverageAgeIsCalculated() { $this->insertUser('Giorgio', 1942); $this->insertUser('Isaac', 1920); $this->assertEquals(80, $this->repository->getAverageAge()); } private function insertUser($name, $year) { $stmt = $this->connection->prepare("INSERT INTO users (name, year) VALUES (:name, :year)"); $stmt->bindValue('name', $name, PDO::PARAM_STR); $stmt->bindValue('year', $year, PDO::PARAM_INT); return $stmt->execute(); } public function tearDown() { $this->connection->exec('DROP TABLE users'); } } class UserRepository { private $connection; private $currentYear; public function __construct(PDO $connection, $currentYear) { $this->connection = $connection; $this->currentYear = $currentYear; } /** * We suppose AVG() cannot be correctly implemented by Sqlite or * another surrogate database (substitute another vendor feature * for the same effect). * We also suppose reconstituting millions of User objects to calculate * their average age isn't feasible: that's why we used SQL directly. */ public function getAverageAge() { $stmt = $this->connection->prepare('SELECT AVG(:year - year) AS average_age FROM users'); $stmt->bindValue('year', $this->currentYear, PDO::PARAM_INT); $stmt->execute(); $row = $stmt->fetch(); return $row['average_age']; } }

May 4, 2011

by Giorgio Sironi

· 2,823 Views

Reasons for Slow Database Performance

Usually there are scenarios where the application does not perform as expected. A simple web page which fetches data from database and displays optimizes it for mobiles should be fast and turnaround times should be less than 30 seconds on a good network connection. But still there are cases where these kinds of applications suffer the most performance issues. This is because the database in these cases is not designed by giving proper attention to the application requirements. You can change one application design even after delivery but changing the database design once a number of application have been integrated with it is like explosion. Here I am giving some points which should be kept in mind while designing application and database. Only basic idea is being provided. For details one can search each topic on the web as each point expands well to multiple articles. 1) Bind Variables: When a SQL query is sent to the database engine for processing and sending the result, it is compiled by the database compiler to get the tokens of the query. This involves parsing, optimizing and identifying the query. After a number of steps, the SQL query is passed to the database engine for processing. In a small application with a user base of less than 500, it is usually the same query which is executed more often than others. The use of bind variables helps in storing the compiled query once and executing it with different data at different times. For using bind variables, one needs to use PreparedStatement objects in Java. 2) Query is not well formed: Usually the same SQL query can be written in multiple ways. There are ways by which a query can be optimized to give the best performance. The corresponding SQL construct should be chosen depending upon requirement. I have scenarios where people have used WHERE clause instead of GROUP BY and are complaining of poor response times. Similarly Sub queries and Joins complement each other. 3) Database structure is not well defined/normalized: This is probably known to everybody that the database tables should be properly normalized as this is part of every DBMS course at graduation level. If the tables are not properly designed and normalized, anomalies set in. 4) Proper caching is not in place: Many applications make use of temporary caches on the application server to store the reference data or frequently accessed data as memory is less of an issue than the time with new generation servers. 5) Number of rows in the table too large: If the table itself has too much of data then the queries will take time to execute. Partitioning a table into multiple tables is recommended in these situations. For example: If a table has employee records of 1000000 employees then it could be split into 5 small tables each having 200000 rows. The advantage is we know beforehand in which smaller table to look for a particular employee code as the division of large table can be done on the employee id column. 6) Connections are not being pooled: If connections are not pooled then the each time a new connection is requested for a request to database. Maintaining a connection pool is much better than creating and destroying the connection for executing every SQL query. Of course, there are frameworks like Hibernate which take care of creating the connection pools and also allow the customization of these pools 7) Connections not closed/returned to pool in case of exceptions: When an exception occurs while performing database operations, it ought to be caught. Usually catching the exception is not the issue because SQLException is a checked exception but closing the connection is something that most of the times is left out. If the connection is not released, the same connection cannot be used for any other purpose till the connection is timed out. 8) Stored procedures for complex computations on database: Stored procedures are a good way to perform database intensive operations. This is because they are already compiled and there is less network trips for getting the same results as compared to SQL queries. From http://extreme-java.blogspot.com/2011/04/reasons-for-slow-database-performance.html

April 30, 2011

by Sandeep Bhandari

· 59,829 Views · 1 Like

Solr Index Size Analysis

in this post i’m going to talk about a set of benchmarks that i’ve done with solr. the goal behind it is to see how each parameter defined in the schema affects the size of the index and the performance of the system. the first step was to fetch the set of documents that i was going to use in the tests. i wanted the documents to be composed of real text, so i started to look for sources in internet. the first one that i really liked was twitter. they provide a rest api that allows you to read a continuous stream of tweets, composed of approximately 1% of all the public tweets. each tweet is expressed as a json object, and carries meta-data about the message and the author. while this source allowed me to get a good number of documents in a short time (about 1.7 million tweets in 2 days), they were really small, so i started to look for a source of bigger documents, finally choosing wikipedia. i downloaded the documents through http using the “random article” feature in their site, obtaining about 160,000 articles in a couple of days. at the time of writting, the site download.wikipedia.org , which provides an easy way of downloading a bunch of articles, was out of service. the next step was to design the schema. because one of the objectives is to see how each change in the schema affects the size of the index, i used many different combination of parameters, as to measure the influence of each one of them. on each case, the database of stop-words was populated using the top 100 terms of each set of documents, obtained from the administration panel of solr. for both datasets, the “omitnorms”, “termvectors” and “stopwords” parameters are referred to the “text” field. in all cases, the value of the parameters “termoffsets” and “termpositions” is the same as “termvectors”. in the first figure you can see the size of the index for each schema for the twitter data-set, and which proportion of the index corresponds to each parameter. remember that this data-set has lots of documents (about 1.7 million) but each one is small (240 bytes on average). there are many remarkable things here. the first one is that the space occupied by the term vectors (~280 mib when not using stop words) is almost equal to the space occupied by the inverted index itself (~240 mib). in second place, the space saved by omitting norms is almost negligible (~2 mib). third, the space saved by using stop word is doubled when storing term vectors, going from about 4% of the index to about 10%. finally, the space occupied by the stored fields (~340 mib) is considerably bigger than the space occupied by the inverted index itself. in the second figure you can see the same information for the wikipedia data-set. the size occupied by the norms is still negligible (< 1mib), however, the size occupied by the stop words has increased to about 22% of the index size when not storing term vectors, and about 25% when storing them. this time, the size occupied by the term vectors (~1067 mib) is almost three times the space occupied by the inverted index itself (~380 mib). finally, the size of the stored documents (~6330 mib) is more than four times the size of the index with term vectors stored. at this point, we can state some conclusions concerning the size of the index: when the number of fields is small, the size of the norms is negligible, independently of the size and number of documents. when the documents are large, the stop words help reducing the size of the index significantly. maybe here is important to note two things. in first place, the documents fetched from wikipedia are writen using traditional language, and are all writen in english, while the documents fetched from twitter are writen using modern language, and in many different languages. in second place, i didn’t measure the precision and recall of the system when using stop words, so it is possible that the findability in a real scenario won’t be good. if you’re storing the documents, and they are big enough, it’s not so important if you store the term vectors or not, so if you’re using a feature such as highlighting and you are looking for good performance, you should store them. if you’re not storing documents, or your documents are small, you should think twice before storing the term vectors, because they’re going to increase significantly your index’s size. i hope you find this post useful. currently i’m working on a set of benchmarks to measure the influence of each one of these parameters in the performance of the system, so if you liked this post, stay tuned!

April 24, 2011

by Juan Grande

· 29,282 Views

Hammurabi - A Scala Rule Engine

One of the most common reasons why software projects fail, or suffer unbearable delays, is the misunderstandings between the analysts who define the business rules of the domain for which the software is going to be written and the developers who have to code these rules. The latter write those rules in a language that is completely obscure for the first ones. In this way the business analysts don't have a chance to read, understand and validate what the programmers developed and then they can only empirically test the final software behavior, hardly covering all the possible corner cases and often recognizing mistakes only when it is too late. What Hammurabi is Hammurabi is a rule engine written in Scala that tries to leverage the features of this language making it particularly suitable to implement extremely readable internal Domain Specific Languages. Indeed, what actually makes Hammurabi different from all other rule engines is that it is possible to write and compile its rules directly in the host language. Anyway the Hammurabi's rules also have the important property of being readable even by non technical person. As usual a practical example worth more than a thousand words. The golfers problem This logical puzzle has been taken from the first chapter of the Jess in Action book written by Ernest Friedman-Hill and published by Manning. It is described there as it follows: A foursome of golfers is standing at a tee, in a line from left to right. Each golfer wears different colored pants; one is wearing red pants. The golfer to Fred’s immediate right is wearing blue pants. Joe is second in line. Bob is wearing plaid pants. Tom isn’t in position one or four, and he isn’t wearing the hideous orange pants. In what order will the four golfers tee off, and what color are each golfer’s pants?” The Jess solution Jess is written in Java and is one of the most popular rule engine on the market. The solution to the golfers problems presented in the book mentioned above is the following: first it is necessary to define the data structures representing the problem (deftemplate pants-color (slot of) (slot is)) (deftemplate position (slot of) (slot is)) A deftemplate is a bit like a class declaration in Java and in this case is used to write a first rule that in turns creates the facts representing each of the possible combinations of golfers, pants-color and positions: (defrule generate-possibilities => (foreach ?name (create$ Fred Joe Bob Tom) (foreach ?color (create$ red blue plaid orange) (assert (pants-color (of ?name)(is ?color))) ) (foreach ?position (create$ 1 2 3 4) (assert (position (of ?name)(is ?position))) ) ) ) After that it is possible to translate the sentences of the problem in the corresponding Jess rule: (defrule find-solution ;; There is a golfer named Fred, whose position is ?p1 ;; and pants color is ?c1 (position (of Fred) (is ?p1)) (pants-color (of Fred) (is ?c1)) ;; The golfer to immediate right of Fred ;; is wearing blue pants. (position (of ?n&~Fred)(is ?p&:(eq ?p (+ ?p1 1)))) (pants-color (of ?n&~Fred)(is blue&~?c1)) ;; Joe is in position #2 (position (of Joe) (is ?p2&2&~?p1)) (pants-color (of Joe) (is ?c2&~?c1)) ;; Bob is wearing the plaid pants (position (of Bob)(is ?p3&~?p1&~?p&~?p2)) (pants-color (of Bob&~?n)(is plaid&?c3&~?c1&~?c2)) ;; Tom is not in position 1 or 4 ;; and is not wearing orange (position (of Tom&~?n)(is ?p4&~1&~4&~?p1&~?p2&~?p3)) (pants-color (of Tom)(is ?c4&~orange&~blue&~?c1&~?c2&~?c3)) => (printout t Fred " " ?p1 " " ?c1 crlf) (printout t Joe " " ?p2 " " ?c2 crlf) (printout t Bob " " ?p3 " " ?c3 crlf) (printout t Tom " " ?p4 " " ?c4 crlf) ) where the rows starting with ;; are just comments. In this way if you enter the code for the problem into Jess and then run it, you get the answer directly: Fred 1 orange Joe 2 blue Bob 4 plaid Tom 3 red Note that the facts that the golfers are in different positions and wear pants of different colors is not expressed in an explicit rule but need to be spread and repeated in many statements. This solution is clearly difficult to be maintained and doesn't scale well as underlined by the last condition statement stating that the position ?p4 of Tom is ?p4&~1&~4&~?p1&~?p2&~?p3 where ~ means not in the Jess language. In other words it says that the Tom's position is not only different from the position 1 and 4 but it is also different from the positions of all the other golfers (named one by one) formerly defined. Actually the needs to describe a golfer's position also as the negation of the positions of all the other golfers implies something even worse: it is not possible to translate each sentence of the problem in a different rule, but they have to be combined together in a single big rule. After this huge if part, its then section (the one after the => symbol) prints out a table containing the set of variables ?p1…?p4 and ?c1…?c4 that solves the problem. The Hammurabi solution As done while presenting the Jess solution, also with Hammurabi the first thing to do is to define the domain of the problem. In order to do that, since the Hammurabi rules are valid Scala statements, it is sufficient to create a plain Scala Person class having as attributes the name, the position and the color of the pants of the golfer that it represents: class Person(n: String) { val name = n var pos: Int = _ var color: String = _ } Then we can model the fact that all the golfers must have different position and pants color by putting them in 2 different Set: var availablePos = (1 to 4).toSet var availableColors = Set("blue", "plaid", "red", "orange") and write two small methods that pull off them from the corresponding set once they have been assigned to a specific golfer: val assign = new { def color(color: String) = new { def to(person: Person) = { person.color = color availableColors = availableColors - color } } def position(pos: Int) = new { def to(person: Person) = { person.pos = pos availablePos = availablePos - pos } } } Those methods are written in a quite weird way just to make even more readable the DSL that will be used to define the rules and to stress the idea that it is possible to write the rules in a valid Scala that could be easily understood by a non-technical person. Of course it is easy to go even further by encapsulating other concepts in some convenient methods as it has been done above. Now everything is ready to write the set of rules describing the golfers problem: val ruleSet = Set( rule ("Unique positions") let { val p = any(kindOf[Person]) when { (availablePos.size equals 1) and (p.pos equals 0) } then { assign position availablePos.head to p } }, rule ("Unique colors") let { val p = any(kindOf[Person]) when { (availableColors.size equals 1) and (p.color == null) } then { assign color availableColors.head to p } }, rule ("Joe is in position 2") let { val p = any(kindOf[Person]) when { p.name equals "Joe" } then { assign position 2 to p } }, rule ("Person to Fred’s immediate right is wearing blue pants") let { val p1 = any(kindOf[Person]) val p2 = any(kindOf[Person]) when { (p1.name equals "Fred") and (p2.pos equals p1.pos + 1) } then { assign color "blue" to p2 } }, rule ("Fred isn't in position 4") let { val possibleFredPos = availablePos - 4 val p = any(kindOf[Person]) when { (p.name equals "Fred") and (possibleFredPos.size == 1) } then { assign position possibleFredPos.head to p } }, rule ("Tom isn't in position 1 or 4") let { val possibleTomPos = availablePos - 1 - 4 val p = any(kindOf[Person]) when { (p.name equals "Tom") and (possibleTomPos.size equals 1) } then { assign position possibleTomPos.head to p } }, rule ("Bob is wearing plaid pants") let { val p = any(kindOf[Person]) when { p.name equals "Bob" } then { assign color "plaid" to p } }, rule ("Tom isn't wearing orange pants") let { val possibleTomColors = availableColors - "orange" val p = any(kindOf[Person]) when { (p.name equals "Tom") and (possibleTomColors.size equals 1) } then { assign color possibleTomColors.head to p } } ) Here the first 2 rules explicitly leverage the uniqueness of positions and pants colors by assigning the last available of them to the only person who still doesn't have one. The other rules just match one by one the sentences of the problem as it has been defined. Now it is possible to make Hammurabi solve the problem by creating the four golfers: val tom = new Person("Tom") val joe = new Person("Joe") val fred = new Person("Fred") val bob = new Person("Bob") add them to a working memory (the set of objects against which the rule engine will evaluate and fire the rules): val workingMemory = WorkingMemory(tom, joe, fred, bob) and letting the rule engine, initialized with the formerly defined set of rules, to work on it: RuleEngine(ruleSet) execOn workingMemory Working with immutable data structures Immutability is probably not something that should be enforced at all costs in Hammurabi, for a very simple reason: the largest part of the execution time is spent by Hammurabi, like all other rule engines, looking for rules that can actually be executed (fired), i.e. the ones for which the when condition is true. During this phase data are only read and never written, so immutability doesn't matter at all, and all the rules that can be fired are put in an agenda. During the subsequent phase the rules in the agenda MUST be fired one by one, since the execution of one of them could make false the when condition of another one. It means that lack of immutability shouldn't prevent the rule engine to safely run in parallel during the discovery phase. That said, it is also possible to obtain the same result working with immutable data structures. For example having an immutable person: case class Person(name: String, pos: Int = 0, color: String = null) it's enough to rewrite the methods that assign the position and pants color to the golfers as it follows: val assign = new { def color(color: String) = new { def to(person: Person) = { remove(person) produce(person.copy(color = color)) availableColors = availableColors - color } } def position(pos: Int) = new { def to(person: Person) = { remove(person) produce(person.copy(pos = pos)) availablePos = availablePos - pos } } } leaving all the rules unchanged. In this way the old version of the Person is removed from the working memory and a brand new one, with its position or color set accordingly to the fired rule, is produced and then added to the working memory itself. The methods remove and produce can be indeed used respectively to remove objects from the working memory and produce new objects, that could be then used to evaluate and fire other rules. Hammurabi internals In Hammurabi all the rules are evaluated (but not fired) in parallel basically by assigning each rule to a different Scala actor. The replacement of this actor implementation with one based on the upcoming Scala parallel collections is currently under evaluation but I decided to wait until this technology will be stable. As anticipated, the evaluation of the rule in parallel is safe because is a read only process. While the actors discover set of variables that can fire the rules they are evaluating, they add them to the rule engine's agenda. At the end of this evaluation process the rules are fired sequentially one by one after a re-evaluation of their application condition, because the execution of one of them could change the result of that condition for a subsequent one. All the rules are fired in no particular order unless a different priority has been specified for some of them. Indeed since sometimes there could be the need to treat some rules as special cases they have an optional property called salience that acts as a priority setting for that rule in order to allow activated rules with the highest salience to always fire first, followed by rules of lower salience. By default all rules have salience 0 but you can alter it in the rule definition as it follows: rule ("Important rule") withSalience 10 let { ... } rule ("Negligible rule") withSalience -5 let { ... } Each actor also records the set of values on which the rule it is responsible for has been already executed, in order to don't fire again the same rule on the same values. The evaluation phase and the firing one are then executed again and again until either there is no rule that can be fired or one of the rule during its firing phase invokes one of the methods exitWith, making the rule engine gracefully finishing returning a value representing the result of the whole evaluation process or failWith that cause the rule engine to terminate by throwing an exception. Of course if the rule engine stops just because there are no longer rules that can be fired, the result of the evaluation is represented by the whole working set (as in the former example) and you can read from there the values you are interested in. Further implementations and improvements At the moment it is only possible to take from the working memory the object(s) against which a given rule will be evaluated only selecting them by type, as shown in the statement: val p = any(kindOf[Person]) Further mechanisms to categorize and select those objects in a more precise way are under evaluation, even if it is already possible to limit them with a Boolean function as in the following example: val p = kindOf[Person] having (_.name == "Joe") This is useful also under a performance point of view because it dramatically lowers the number of combination that the rule engine needs to check before to find a rule that can be fired. For example the "Person to Fred’s immediate right is wearing blue pants" rule could be rewritten as it follows bringing the number of combination that has to be tried from 16 to 4: rule ("Person to Fred’s immediate right is wearing blue pants") let { val p1 = kindOf[Person] having (_.name == "Fred") val p2 = any(kindOf[Person]) when { p2.pos equals p1.pos + 1 } then { assign color "blue" to p2 } } I am also evaluating of directly feeding the working memory with a NoSQL database. In other words, with this solution, the data present in the db could represent the working memory itself. At the moment I am experimenting with MongoDB since is the one I know best, but if somebody has some other good idea or even better wants to collaborate with this project I'd be very glad of it. The version 0.1 of the Hammurabi rule engine has been just released and it is available here.

April 18, 2011

by Mario Fusco

· 27,863 Views

Eradicating Non-Determinism in Tests

An automated regression suite can play a vital role on a software project, valuable both for reducing defects in production and essential for evolutionary design. In talking with development teams I've often heard about the problem of non-deterministic tests - tests that sometimes pass and sometimes fail. Left uncontrolled, non-deterministic tests can completely destroy the value of an automated regression suite. In this article I outline how to deal with non-deterministic tests. Initially quarantine helps to reduce their damage to other tests, but you still have to fix them soon. Therefore I discuss treatments for the common causes for non-determinism: lack of isolation, asynchronous behavior, remote services, time, and resource leaks. I've enjoyed watching ThoughtWorks tackle many difficult enterprise applications, bringing successful deliveries to many clients who have rarely seen success. Our experiences have been a great demonstration that agile methods, deeply controversial and distrusted when we wrote the manifesto a decade ago, can be used successfully. There are many flavors of agile development out there, but in what we do there is a central role for automated testing. Automated testing was a core approach to Extreme Programming from the beginning, and that philosophy has been the biggest inspiration to our agile work. So we've gained a lot of experience in using automated testing as a core part of software development. Automated testing can look easy when presented in a text book. And indeed the basic ideas are really quite simple. But in the pressure-cooker of a delivery project, trials come up that are often not given much attention in texts. As I know too well, authors have a habit of skimming over many details in order to get a core point across. In my conversations with our delivery teams, one recurring problem that we've run into is tests which have become unreliable, so unreliable that people don't pay much attention to whether they pass or fail. A primary cause of this unreliability is that some tests have become non-deterministic. A test is non-deterministic when it passes sometimes and fails sometimes, without any noticeable change in the code, tests, or environment. Such tests fail, then you re-run them and they pass. Test failures for such tests are seemingly random. Non-determinism can plague any kind of test, but it's particularly prone to affect tests with a broad scope, such as acceptance or functional tests. Why non-deterministic tests are a problem Non-deterministic tests have two problems, firstly they are useless, secondly they are a virulent infection that can completely ruin your entire test suite. As a result they need to be dealt with as soon as you can, before your entire deployment pipeline is compromised. I'll start with expanding on their uselessness. The primary benefit of having automated tests is that they provide bug detection mechanism by acting as regression tests[1]. When a regression test goes red, you know you've got an immediate problem, often because a bug has crept into the system without you realizing. Having such a bug detector has huge benefits. Most obviously it means that you can find and fix bugs just after they are introduced. Not just does this give you the warm fuzzies because you kill bugs quickly, it also makes it easier to remove them since you know the bug got in with the last set of changes that are fresh in your mind. As a result you know where to look for the bug, which is more than half the battle in squashing it. The second level of benefit is that as you gain confidence in your bug detector, you gain the courage to make big changes knowing that when you goof, the bug detector will go off and you can fix the mistake quickly. [2] Without this teams are frightened to make the changes code needs in order to be kept clean, which leads to a rotting of the code base and plummeting development speed. The trouble with non-deterministic tests is that when they go red, you have no idea whether its due to a bug, or just part of the non-deterministic behavior. Usually with these tests a non-deterministic failure is relatively common, so you end up shrugging your shoulders when these tests go red. Once you start ignoring a regression test failure, then that test is useless and you might as well throw it away. Indeed you really ought to throw a non-deterministic test away, since if you don't it has an infectious quality. If you have a suite of 100 tests with 10 non-deterministic tests in them, than that suite will often fail. Initially people will look at the failure report and notice that the failures are in non-deterministic tests, but soon they'll lose the discipline to do that. Once that discipline is lost, then a failure in the healthy deterministic tests will get ignored too. At that point you've lots the whole game and might as well get rid of all the tests. Quarantine My principal aim in this article is to outline common cases of non-deterministic tests and how to eliminate the non-determinism. But before I get there I offer one piece of essential advice: quarantine your non-deterministic tests. If you have non-deterministic tests keep them in a different test suite to your healthy tests. That way you'll you can continue to pay attention to what's going on with your healthy tests and get good feedback from them. Place any non-deterministic test in a quarantined area. (But fix quarantined tests quickly.) Then the question is what to do with the quarantined test suites. They are useless as regression tests, but they do have a future as work items for cleaning up. You should not abandon such tests, since any tests you have in quarantine are not helping you with your regression coverage. A danger here is that tests keep getting thrown into quarantine and forgotten, which means your bug detection system is eroding. As a result it's worthwhile to have a mechanism that ensures that tests don't stay in quarantine too long. I've come across various ways to do this. One is a simple numeric limit: e.g. only allow 8 tests in quarantine. Once you hit the limit you must spend time to clear all the tests out. This has the advantage of batching up your test-cleaning if that's how you like to do things. Another route is to put a time limit on how long a test may be in quarantine, such as no longer than a week. The general approach with quarantine is to take the quarantined tests out of the main deployment pipeline so that you still get your regular build process. However a good team can be more aggressive. Our Mingle team puts its quarantine suite into the deployment pipeline one stage after its healthy tests. That way it can get the feedback from the healthy tests but is also forced to ensure that it sorts out the quarantined tests quickly. [3] Lack of Isolation In order to get tests to run reliably, you must have clear control over the environment in which they run, so you have a well-known state at the beginning of the test. If one test creates some data in the database and leaves it lying around, it can corrupt the run of another test which may rely on a different database state. Therefore I find it's really important to focus on keeping tests isolated. Properly isolated tests can be run in any sequence. As you get to larger operational scope of functional tests, it gets progressively harder to keep tests isolated. When you are tracking down a non-determinism, lack of isolation is a common and frustrating cause. Keep your tests isolated from each other, so that execution of one test will not affect any others. There are a couple of ways to get isolation - either always rebuild your starting state from scratch, or ensure that each test cleans up properly after itself. In general I prefer the former, as it's often easier - and in particular easier to find the source of a problem. If a test fails because it didn't build up the initial state properly, then it's easy to see which test contains the bug. With clean-up, however, one test will contain the bug, but another test will fail - so it's hard to find the real problem. Starting from a blank state is usually easy with unit tests, but can be much harder with functional tests [4] - particularly if you have a lot of data in a database that needs to be there. Rebuilding the database each time can add a lot of time to test runs, so that argues for switching to a clean-up strategy. One trick that's handy when you're using databases, is to conduct your tests inside a transaction, and then to rollback the transaction at the end of the test. That way the transaction manager cleans up for you, reducing the chance of errors[5]. Another approach is to do a single build of a mostly-immutable starting fixture before running a group of tests. Then ensure that the tests don't change that initial state (or if they do, they reverse the changes in tear-down). This tactic is more error-prone than rebuilding the fixture for each test, but it may be worthwhile iff it takes too long to build the fixture each time. Although databases are a common cause for isolation problems, there are plenty of times you can get these in-memory too. In particular be aware with static data and singletons. A good example for this kind of problem is contextual environment, such as the currently logged in user. If you have an explicit tear-down in a test, be wary of exceptions that occur during the tear-down. If this happens the test can pass, but cause isolation failures for subsequent tests. So ensure that if you do get a problem in a tear-down, it makes a loud noise. Some people prefer to put less emphasis on isolation and more on defining clear dependencies to force tests to run in a specified order. I prefer isolation because it gives you more flexibility in running subsets of tests and parallelizing tests. Asynchronous Behavior Asynchrony is a boon that allows you to keep software responsive while taking on long term tasks. Ajax calls allow a browser to stay responsive while going back to the server for more data, asynchronous message allow a server process to communicate with other system without being tied to their laggardly latency. But in testing, asynchrony can be curse. The common mistake here is to throw in a sleep: //pseudo-code makeAsyncCall; sleep(aWhile); readResponse; This can bite you two ways. First off you'll want to set the sleep time to long enough that it gives plenty of time to get the response. But that means that you'll spend a lot of time idly waiting for the response, thus slowing down your tests. The second bite is that, however long you sleep, sometimes it won't be enough. There will be some change in environment that will cause you to exceed the sleep - and you'll get false failure. As a result I strongly urge you to never use bare sleeps like this. Never use bare sleeps to wait for asynchonous responses: use a callback or polling. There are basically two tactics you can do for testing an asynchronous response. The first is for the asynchronous service to take a callback which it can call when done. This is the best since it means you'll never have to wait any longer than you need to [6]. The biggest problem with this is that the environment needs to be able to do this and then the service provider needs to do it. This is one of the advantages of having the development team integrated with testing - if they can provide a callback then they will. The second option is to poll on the answer. This is more than just looking once, but looking regularly, something like this //pseudo-code makeAsyncCall startTime = Time.now; while(! responseReceived) { if (Time.now - startTime > waitLimit) throw new TestTimeoutException; sleep (pollingInterval); } readResponse The point of this approach is that you can set the pollingInterval to a pretty small value, and know that that's the maximum amount of dead time you'll lose to waiting for a response. This means you can set the waitLimit very high, which minimizes the chance of hitting it unless something serious has gone wrong. [7] Make sure you use a clear exception class that indicates this is a test timeout that's failing. This will help make it clear what's gone wrong should it happen, and perhaps allow a more sophisticated test harness to take account of this information in its display. The time values, in particular the waitLimit, should never be literal values. Make sure they are always values that can be easily set in bulk, either by using constants or set through the runtime environment. That way if you need to tweak them (and you will) you can tweak them all quickly. All this advice is handy for async calls where you expect a response from the provider, but how about those where there is no response. These are calls where we invoke a command on something and expect it to happen without any acknowledgment. This is the trickiest case since you can test for your expected response, but there's nothing to do to detect a failure other than timing-out. If the provider is something you're building you can handle this by ensuring the provider implements some way of indicating that it's done - essentially some form of callback. Even if only the testing code uses it, it's worth it - although often you'll find this kind of functionality is valuable for other purposes too[8]. If the provider is someone else's work, you can try persuasion, but otherwise may be stuck. Although this is also a case when using Test Doubles for remote services is worthwhile (which I'll discuss more in the next section). If you have a general failure in something asynchronous, such that it's not responding at all, then you'll always be waiting for timeouts and your test suite will take a long time to fail. To combat this it's a good idea to use a smoke test to check that the asynchronous service is responding at all and stop the test run right away if it isn't. Gerard Meszaros's book, xUnit Test Patterns, contains lots of good patterns for constructing tests. You can also often side-step the asynchrony completely. Gerard Meszaros's Humble Object pattern says that whenever you have some logic that's in a hard-to-test environment, you should isolate the logic you need to test from that environment. In this case it means put most of the logic you need to test in a place where you can test it synchronously. The asynchronous behavior should be as minimal (humble) as possible, that way you don't need that much testing of it. Remote Services Sometimes I'm asked if ThoughtWorks does any integration work, which I find somewhat amusing since there's hardly any project we do that doesn't involve a fair bit of integration. By their nature, enterprise applications involve a great deal of combining data from different systems. These systems are maintained by other teams operating to their own schedules, teams that often use a very different software philosophy to our heavily test-driven agile approach. Testing with such remote systems brings a number of problems, and non-determinism is high on the list. Often remote systems don't have test system we can call, which means hitting a live system. If there is a test system, it may not be stable enough to provide deterministic responses. In this situation it's vital to ensure determinism, so it's time to reach for a Test Double - a component that looks like the remote service, but is really just a pretend version that mimics the remote system's behavior. The double needs to be setup so that provides the right kind of response in interaction with our system, but in a way we control. In this manner we can ensure determinism. Using a double has a downside, in particular when we are testing across a broad scope. How can we be sure that the double behaves in the same way that remote system does? We can tackle this again using tests, a form of test that I call Integration Contract Tests. These run the same interaction with the remote system and the double, and check that the two match. In this case 'match' may not mean coming up with the same result (due to the non-determinisms), but results that share the same essential structure. Integration Contract Tests need to be run frequently, but not part of our system's deployment pipeline. Periodic running based on the rate of the change of the remote system is usually best. For writing these kinds of test doubles, I'm a big fan of Self Initializing Fakes - since these are very simple to manage. Some people are firmly against using Test Doubles in functional tests, believing that you must test with real connection in order to ensure end-to-end behavior. While I sympathize with their argument, automated tests are useless if they are non-deterministic. So any advantage you gain by talking to the real system is overwhelmed by the need to stamp out non-determinism[9]. Time Few things are more non-deterministic than a call to the system clock. Each time you call it, you get a new result, and any tests that depend on it can thus change. Ask for all the todos due in the next hour, and you regularly get a different answer[10]. The most important thing here is to ensure that you always wrap the system clock with routines that can be replaced with a seeded value for testing. A clock stub can be set to particular time and frozen at that time, allowing your tests to have complete control over its movements. That way you can synchronize your test data to the values in the seeded clock.[11][12] Always wrap the system clock, so it can be easily substituted for testing. One thing to watch with this, is that eventually your test data might start having problems because it's too old, and you get conflicts with other time based factors in your application. In this case you can move the data, and your clock seeds to new values. When you do this, ensure that this is the only thing you do. That way you can be sure that any tests that fail are due to time-movement in the test data. Another area where time can be a problem is when you rely on other behaviors from the clock. I once saw a system that generated random keys based on clock values. This systems started failing when it was moved to a faster machine that could allocate multiple ids within a single clock tick.[13] I've heard so many problems due to direct calls to the system clock that I'd argue for finding a way to use code analysis to detect any direct calls to the system clock and failing the build right there. Even a simple regex check might save you a frustrating debugging session after a call at an ungodly hour. Resource Leaks If your application has some kind of resource leak, this will lead to random tests failing, since it's just which test causes the resource leak to go over a limit that gets the failure. This case is awkward because any test can fail intermittently due to this problem. If it isn't a case of one test being non-deterministic then resource leaks are a good candidate to investigate. By resource leak, I mean any resource that the application has to manage by acquiring and releasing. In non-memory-managed environments, the obvious example is memory. Memory-management did much to remove this problem, but other resources still need to be managed, such as database connections. Usually the best way to handle these kind of resources is through a Resource Pool. If you do this then a good tactic is to configure the pool to a size of 1 and make it throw an exception should it get a request for a resource when it has none left to give. That way the first test to request a resource after the leak will fail - which makes it a lot easier to find the problem test. This idea of limiting resource pool sizes, is about increasing constraints to make errors more likely to crop up in tests. This is good because we want errors to show in tests so we can fix them before they manifest themselves in production. This principle can be used in other ways too. One story I heard was of a system which generated randomly named temporary files, didn't clean them up properly, and crashed on a collision. This kind of bug is very hard to find, but one way to manifest it is to stub the randomizer for testing so it always returns the same value. That way you can surface the problem more quickly.

April 14, 2011

by Martin Fowler

· 6,681 Views · 1 Like

Solr + Hadoop = Big Data Love

Bixo Labs shows how to use Solr as a NoSQL solution for big data Many people use the Hadoop open source project to process large data sets because it’s a great solution for scalable, reliable data processing workflows. Hadoop is by far the most popular system for handling big data, with companies using massive clusters to store and process petabytes of data on thousands of servers. Since it emerged from the Nutch open source web crawler project in 2006, Hadoop has grown in every way imaginable – users, developers, associated projects (aka the “Hadoop ecosystem”). Starting at roughly the same time, the Solr open source project has become the most widely used search solution on planet Earth. Solr wraps the API-level indexing and search functionality of Lucene with a RESTful API, GUI, and lots of useful administrative and data integration functionality. The interesting thing about combining these two open source projects is that you can use Hadoop to crunch the data, and then serve it up in Solr. And we’re not talking about just free-text search; Solr can be used as a key-value store (i.e. a NoSQL database) via its support for range queries. Even on a single server, Solr can easily handle many millions of records (“documents” in Lucene lingo). Even better, Solr now supports sharding and replication via the new, cutting-edge SolrCloud functionality. Background I started using Hadoop & Solr about five years ago, as key pieces of the Krugle code search startup I co-founded in 2005. Back then, Hadoop was still part of the Nutch web crawler we used to extract information about open source projects. And Solr was fresh out of the oven, having just been released as open source by CNET. At Bixo Labs we use Hadoop, Solr, Cascading, Mahout, and many other open source technologies to create custom data processing workflows. The web is a common source of our input data, which we crawl using the Bixo open source project. The Problem During a web crawl, the state of the crawl is contained in something commonly called a “crawl DB”. For broad crawls, this has to be something that works with billions of records, since you need one entry for each known URL. Each “record” has the URL as the key, and contains important state information such as the time and result of the last request. For Hadoop-based crawlers such as Nutch and Bixo, the crawl DB is commonly kept in a set of flat files, where each file is a Hadoop “SequenceFile”. These are just a packed array of serialized key/value objects. Sometimes we need to poke at this data, and here’s where the simple flat-file structure creates a problem. There’s no easy way run queries against the data, but we can’t store it in a traditional database since billions of records + RDBMS == pain and suffering. Here is where scalable NoSQL solutions shine. For example, the Nutch project is currently re-factoring this crawl DB layer to allow plugging in HBase. Other options include Cassandra, MongoDB, CouchDB, etc. But for simple analytics and exploration on smaller datasets, a Solr-based solution works and is easier to configure. Plus you get useful and surprising fun functionality like facets, geospatial queries, range queries, free-form text search, and lots of other goodies for free. Architecture So what exactly would such a Hadoop + Solr system look like? As mentioned previously, in this example our input data comes from a Bixo web crawler’s CrawlDB, with one entry for each known URL. But the input data could just as easily be log files, or records from a traditional RDBMS, or the output of another data processing workflow. The key point is that we’re going to take a bunch of input data, (optionally) munge it into a more useful format, and then generate a Lucene index that we access via Solr. Hadoop For the uninitiated, Hadoop implements both a distributed file system (aka “HDFS”) and an execution layer that supports the map-reduce programming model. Typically data is loaded and transformed during the map phase, and then combined/saved during the reduce phase. In our example, the map phase reads in Hadoop compressed SequenceFiles that contain the state of our web crawl, and our reduce phase write out Lucene indexes. The focus of this article isn’t on how to write Hadoop map-reduce jobs, but I did want to show you the code that implements the guts of the job. Note that it’s not typical Hadoop key/value manipulation code, which is painful to write, debug, and maintain. Instead we use Cascading, which is an open source workflow planning and data processing API that creates Hadoop jobs from shorter, more representative code. The snippet below reads SequenceFiles from HDFS, and pipes those records into a sink (output) that stores them using a LuceneScheme, which in turn saves records as Lucene documents in an index. Tap source = new Hfs(new SequenceFile(CRAWLDB_FIELDS), inputDir); Pipe urlPipe = new Pipe("crawldb urls"); urlPipe = new Each(urlPipe, new ExtractDomain()); Tap sink = new Hfs(new LuceneScheme(SOLR_FIELDS, STORE_SETTINGS, INDEX_SETTINGS, StandardAnalyzer.class, MAX_FIELD_LENGTH), outputDir, true); FlowConnector fc = new FlowConnector(); fc.connect(source, sink, urlPipe).complete(); We defined CRAWLDB_FIELDS and SOLR_FIELDS to be the set of input and output data elements, using names like “url” and “status”. We take advantage of the Lucene Scheme that we’ve created for Cascading, which lets us easily map from Cascading’s view of the world (records with fields) to Lucene’s index (documents with fields). We don’t have a Cascading Scheme that directly supports Solr (wouldn’t that be handy?), but we can make-do for now since we can do simple analysis for this example. We indexed all of the fields so that we can perform queries against them. Only the status message contains normal English text, so that’s the only one we have to analyze (i.e., break the text up into terms using spaces and other token delimiters). In addition, the ExtractDomain operation pulls the domain from the URL field and builds a new Solr field containing just the domain. This will allow us to do queries against the domain of the URL as well as the complete URL. We could also have chosen to apply a custom analyzer to the URL to break it into several pieces (i.e., protocol, domain, port, path, query parameters) that could have been queried individually. Running the Hadoop Job For simplicity and pay-as-you-go, it’s hard to beat Amazon’s EC2 and Elastic Mapreduce offerings for running Hadoop jobs. You can easily spin up a cluster of 50 servers, run your job, save the results, and shut it down – all without needing to buy hardware or pay for IT support. There are many ways to create and configure a Hadoop cluster; for us, we’re very familiar with the (modified) EC2 Hadoop scripts that you can find in the Bixo distribution. Step-by-step instructions are available at http://openbixo.org/documentation/running-bixo-in-ec2/ The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains step-by-step instructions for building and running the job. After the job is done, we’ll copy the resulting index out of the Hadoop distributed file system (HDFS) and onto the Hadoop cluster’s master server, then kill off the one slave we used. The Hadoop master is now ready to be configured as our Solr server. Solr On the Solr side of things, we need to create a schema that matches the index we’re generating. The key section of our schema.xml file is where we define the fields. These fields have a one-to-one correspondence with the SOLR_FIELDS we defined in our Hadoop workflow. They also need to use the same Lucene settings as what we defined in the static IndexWorkflow.java STORE_SETTINGS and INDEX_SETTINGS. Once we have this defined, all that’s left is to set up a server that we can use. To keep it simple, we’ll use the single EC2 instance in Amazon’s cloud (m1.large) that we used as our master for the Hadoop job, and run the simple Solr search server that relies on embedded Jetty to provide the webapp container. Similar to the Hadoop job, step-by-step instructions are in the README for the hadoop2solr project on GitHub. But in a nutshell, we’ll copy and unzip a Solr 1.4.1 setup on the EC2 server, do the same for our custom Solr configuration, create a symlink to the index, and then start it running with: Giving it a Try Now comes the interesting part. Since we opened up the default Jetty port used by Solr (8983) on this EC2 instance, we can directly access Solr’s handy admin console by pointing our browser at http://:8983/solr/admin % cd solr % java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar From here we can run queries against Solr: We can also use curl to talk to the server via HTTP requests: curl http://:8983/solr/select/?q=-status%3AFETCHED+and+-status%3AUNFETCHED The response is XML by default. Below is an example of the response from the above request, where we found 2,546 matches in 94ms. Now here’s what I find amazing. For an index of 82 million documents, running on a fairly wimpy box (EC2 m1.large = 2 virtual cores), the typical response time for a simple query like “status:FETCHED” is only 400 milliseconds, to find 9M documents. Even a complex query such as (status not FETCHED and not UNFETCHED) only takes six seconds. Scaling Obviously we could use beefier boxes. If we switched to something like m1.xlarge (15GB of memory, 4 virtual cores) then it’s likely we could handle upwards of 200M “records” in our Solr index and still get reasonable response times. If we wanted to scale beyond a single box, there are a number of solutions. Even out of the box Solr supports sharding, where your HTTP request can specify multiple servers to use in parallel. More recently, the Solr trunk has support for SolrCloud. This uses the ZooKeeper open source project to simplify coordination of multiple Solr servers. Finally, the Katta open source project supports Lucene-level distributed search, with many of the features needed for production quality distributed search that have not yet been added to SolrCloud. Summary The combination of Hadoop and Solr makes it easy to crunch lots of data and then quickly serve up the results via a fast, flexible search & query API. Because Solr supports query-style requests, it’s suitable as a NoSQL replacement for traditional databases in many situations, especially when the size of the data exceeds what is reasonable with a typical RDBMS. Solr has some limitations that you should be aware of, specifically: · Updating the index works best as a batch operation. Individual records can be updated, but each commit (index update) generates a new Lucene segment, which will impact performance. · Current support for replication, fail-over, and other attributes that you’d want in a production-grade solution aren’t yet there in SolrCloud. If this matters to you, consider Katta instead. · Many SQL queries can’t be easily mapped to Solr queries. The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains additional technical details.

April 4, 2011

by Ken Krugler

· 119,571 Views

Java Access to SQL Azure via the JDBC Driver for SQL Server

I’ve written a couple of posts (here and here) about Java and the JDBC Driver for SQL Server with the promise of eventually writing about how to get a Java application running on the Windows Azure platform. In this post, I’ll deliver on that promise. Specifically, I’ll show you two things: 1) how to connect to a SQL Azure Database from a Java application running locally, and 2) how to connect to a SQL Azure database from an application running in Windows Azure. You should consider these as two ordered steps in moving an application from running locally against SQL Server to running in Windows Azure against SQL Azure. In both steps, connection to SQL Azure relies on the JDBC Driver for SQL Server and SQL Azure. The instructions below assume that you already have a Windows Azure subscription. If you don’t already have one, you can create one here: http://www.microsoft.com/windowsazure/offers/. (You’ll need a Windows Live ID to sign up.) I chose the Free Trial Introductory Special, which allows me to get started for free as long as keep my usage limited. (This is a limited offer. For complete pricing details, see http://www.microsoft.com/windowsazure/pricing/.) After you purchase your subscription, you will have to activate it before you can begin using it (activation instructions will be provided in an email after signing up). Connecting to SQL Azure from an application running locally I’m going to assume you already have an application running locally and that it uses the JDBC Driver for SQL Server. If that isn’t the case, then you can start from scratch by following the steps in this post: Getting Started with the SQL Server JDBC Driver. Once you have an application running locally, then the process for running that application with a SQL Azure back-end requires two steps: 1. Migrate your database to SQL Azure. This only takes a couple of minutes (depending on the size of your database) with the SQL Azure Migration Wizard - follow the steps in the Creating a SQL Azure Server and Creating a SQL Azure Database sections of this post. 2. Change the database connection string in your application. Once you have moved your local database to SQL Azure, you only have to change the connection string in your application to use SQL Azure as your data store. In my case (using the Northwind database), this meant changing this… String connectionUrl = "jdbc:sqlserver://serverName\\sqlexpress;" + "database=Northwind;" + "user=UserName;" + "password=Password"; …to this… String connectionUrl = "jdbc:sqlserver://xxxxxxxxxx.database.windows.net;" + "database=Northwind;" + "user=UserName@xxxxxxxxxx;" + "password=Password"; (where xxxxxxxxxx is your SQL Azure server ID). Connecting to SQL Azure from an application running in Windows Azure The heading for this section might be a bit misleading. Once you have a locally running application that is using SQL Azure, then all you have to do is move your application to Windows Azure. The connecting part is easy (see above), but moving your Java application to Windows Azure takes a bit more work. Fortunately, Ben Lobaugh has written a great post that that shows how to use the Windows Azure Starter Kit for Java to get a Java application (a JSP application, actually) running in Windows Azure: Deploying a Java application to Windows Azure with Command-Line Ant. (If you are using Eclipse, see Ben’s related post: Deploying a Java application to Windows Azure with Eclipse.) I won’t repeat his work here, but I will call out the steps I took in modifying his instructions to deploy a simple JSP page that connects to SQL Azure. 1. Add the JDBC Driver for SQL Server to the Java archive. One step in Ben’s tutorial (see the Select the Java Runtime Environment section) requires that you create a .zip file from your local Java installation and add it to your Java/Azure application. Most likely, your local Java installation references the JDBC driver by setting the classpath environment variable. When you create a .zip file from your java installation, the JDBC driver will not be included and the classpath variable will not be set in the Azure environment. I found the easiest way around this was to simply add the sqljdbc4.jar file (probably located in C:\Program Files\Microsoft SQL Server JDBC Driver\sqljdbc_3.0\enu) to the \lib\ext directory of my local Java installation before creating the .zip file. Note: You can put the JDBC driver in a separate directory, include it when you create the .zip folder, and set the classpath environment variable in the startup.bat script. But, I found the above approach to be easier. 2. Modify the JSP page. Instead of the code Ben suggests for the HelloWorld.jsp file (see the Prepare your Java Application section), use code from your locally running application. In my case, I just used the code from this post after changing the connection string and making a couple minor JSP-specific changes: Northwind Customers That’s it!. To summarize the steps… Migrate your database to SQL Azure with the SQL Azure Migration Wizard. Change the database connection in your locally running application. Use the Windows Azure Starter Kit for Java to move your application to Windows Azure. (You’ll need to follow instructions in this post and instructions above.) Thanks. -Brian

March 30, 2011

by Brian Swan

· 18,917 Views

IBatis (MyBatis): Working with Dynamic Queries (SQL)

this tutorial will walk you through how to setup ibatis ( mybatis ) in a simple java project and will present how to work with dynamic queries (sql). pre-requisites for this tutorial i am using: ide: eclipse (you can use your favorite one) database: mysql libs/jars: mybatis , mysql conector and junit (for testing) this is how your project should look like: sample database please run the script into your database before getting started with the project implementation. you will find the script (with dummy data) inside the sql folder. 1 – article pojo i represented the pojo we are going to use in this tutorial with a uml diagram, but you can download the complete source code in the end of this article. the goal of this tutorial is to demonstrate how to retrieve the article information from database using dynamic sql to filter the data. 2 – article mapper – xml one of the most powerful features of mybatis has always been its dynamic sql capabilities. if you have any experience with jdbc or any similar framework, you understand how painful it is to conditionally concatenate strings of sql together, making sure not to forget spaces or to omit a comma at the end of a list of columns. dynamic sql can be downright painful to deal with. while working with dynamic sql will never be a party, mybatis certainly improves the situation with a powerful dynamic sql language that can be used within any mapped sql statement. the dynamic sql elements should be familiar to anyone who has used jstl or any similar xml based text processors. in previous versions of mybatis, there were a lot of elements to know and understand. mybatis 3 greatly improves upon this, and now there are less than half of those elements to work with. mybatis employs powerful ognl based expressions to eliminate most of the other elements. if choose (when, otherwise) trim (where, set) foreach let’s explain each one with examples. 1 – first scenario : we want to retrieve all the articles from database with an optional filter: title. in other words, if user specify an article title, we are going to retrieve the articles that match with the title, otherwise we are going to retrieve all the articles from database. so we are going to implement a condition (if) : select id, title, author from article where id_status = 1 and title like #{title} 2 – second scenario : now we have two optional filters: article title and author. the user can specify both, none or only one filter. so we are going to implement two conditions: select id, title, author from article where id_status = 1 and title like #{title} and author like #{author} 3 – third scenario : now we want to give the user only one option: the user will have to specify only one of the following filters: title, author or retrieve all the articles from ibatis category. so we are going to use a choose element: select id, title, author from article where id_status = 1 and title like #{title} and author like #{author} and id_category = 3 4 – fourth scenario : take a look at all three statements above. they all have a condition in common: where id_status = 1. it means we are already filtering the active articles let’s remove this condition to make it more interesting. select id, title, author from article where title like #{title} and author like #{author} what if both title and author are null? we are going to have the following statement: select id, title, authorfrom articlewhere and what if only the author is not null? we are going to have the fololwing statement: select id, title, authorfrom articlewhereand author like #{author} and both fails! how to fix it? 5 – fifth scenario : we want to retrieve all the articles with two optional filters: title and author. to avoid the 4th scenatio, we are going to use a where element: select id, title, author from article title like #{title} and author like #{author} mybatis has a simple answer that will likely work in 90% of the cases. and in cases where it doesn’t, you can customize it so that it does. the where element knows to only insert “where” if there is any content returned by the containing tags. furthermore, if that content begins with “and” or “or”, it knows to strip it off. if the where element does not behave exactly as you like, you can customize it by defining your own trim element. for example,the trim equivalent to the where element is: select id, title, author from article title like #{title} and author like #{author} the overrides attribute takes a pipe delimited list of text to override, where whitespace is relevant. the result is the removal of anything specified in the overrides attribute, and the insertion of anything in the with attribute. you can also use the trim element with set. 6 – sixth scenario : the user will choose all the categories an article can belong to. so in this case, we have a list (a collection), and we have to interate this collection and we are going to use a foreach element: select id, title, author from article title like #{title} and author like #{author} the foreach element is very powerful, and allows you to specify a collection, declare item and index variables that can be used inside the body of the element. it also allows you to specify opening and closing strings, and add a separator to place in between iterations. the element is smart in that it won’t accidentally append extra separators. note : you can pass a list instance or an array to mybatis as a parameter object. when you do, mybatis will automatically wrap it in a map, and key it by name. list instances will be keyed to the name “list” and array instances will be keyed to the name “array”. the complete article.xml file looks like this: select id, title, author from article where id_status = 1 and title like #{title} select id, title, author from article where id_status = 1 and title like #{title} and author like #{author} select id, title, author from article where id_status = 1 and title like #{title} and author like #{author} and id_category = 3 select id, title, author from article title like #{title} and author like #{author} select id, title, author from article title like #{title} and author like #{author} select id, title, author from article where id_category in #{category} download if you want to download the complete sample project, you can get it from my github account: https://github.com/loiane/ibatis-dynamic-sql if you want to download the zip file of the project, just click on download: there are more articles about ibatis to come. stay tooned! happy coding! from http://loianegroner.com/2011/03/ibatis-mybatis-working-with-dynamic-queries-sql/

March 23, 2011

by Loiane Groner

· 72,614 Views

New Java 7 Feature: String in Switch support

One of the new features added in Java 7 is the capability to switch on a String. With Java 6, or less String color = "red"; if (color.equals("red")) { System.out.println("Color is Red"); } else if (color.equals("green")) { System.out.println("Color is Green"); } else { System.out.println("Color not found"); } String color = "red"; if (color.equals("red")) { System.out.println("Color is Red"); } else if (color.equals("green")) { System.out.println("Color is Green"); } else { System.out.println("Color not found"); } With Java 7: String color = "red"; switch (color) { case "red": System.out.println("Color is Red"); break; case "green": System.out.println("Color is Green"); break; default: System.out.println("Color not found"); } Conclusion The switch statement when used with a String uses the equals() method to compare the given expression to each value in the case statement and is therefore case-sensitive and will throw a NullPointerException if the expression is null. It is a small but useful feature which not only helps us write more readable code but the compiler will likely generate more efficient bytecode as compared to the if-then-else statement. From http://www.vineetmanohar.com/2011/03/new-java-7-feature-string-in-switch-support/

March 22, 2011

by Vineet Manohar

· 106,342 Views · 2 Likes

IBatis (MyBatis): Discriminator Column Example – Inheritance Mapping Tutorial

This tutorial will walk you through how to setup iBatis (MyBatis) in a simple Java project and will present an example using a discriminator column, in another words it is a inheritance mapping tutorial. Pre-Requisites For this tutorial I am using: IDE: Eclipse (you can use your favorite one) DataBase: MySQL Libs/jars: Mybatis, MySQL conector and JUnit (for testing) This is how your project should look like: Sample Database Please run the script into your database before getting started with the project implementation. You will find the script (with dummy data) inside the sql folder. 1 – POJOs – Beans I represented the beans here with a UML model, but you can download the complete source code in the end of this article. As you can see on the Data Modeling Diagram and the UML diagram above, we have a class Employee and two subclasses: Developer and Manager. The goal of this tutorial is to retrieve all the Employees from the database, but these employees can be an instance of Developer or Manager, and we are going to use a discriminator column to see which class we are going to instanciate. 2 – Employee Mapper – XML Sometimes a single database query might return result sets of many different (but hopefully somewhat related) data types. The discriminator element was designed to deal with this situation, and others, including class inheritance hierarchies. The discriminator is pretty simple to understand, as it behaves much like a switch statement in Java. A discriminator definition specifies column and javaType attributes. The column is where MyBatis will look for the value to compare. The javaType is required to ensure the proper kind of equality test is performed (although String would probably work for almost any situation). SELECT id, name, employee_type, manager_id, info, developer_id, product FROM employee E left join manager M on M.employee_id = E.id left join developer D on D.employee_id = E.id In this example, MyBatis would retrieve each record from the result set and compare its employee type value. If it matches any of the discriminator cases, then it will use the resultMap specified by the case. This is done exclusively, so in other words, the rest of the resultMap is ignored (unless it is extended, which we talk about in a second). If none of the cases match, then MyBatis simply uses the resultMap as defined outside of the discriminator block. So, if the managerResult was declared as follows: Now all of the properties from both the managerResult and developerResult will be loaded. Once again though, some may find this external definition of maps somewhat tedious. Thereforethere’s an alternative syntax for those that prefer a more concise mapping style. For example: SELECT id, name, employee_type, manager_id, info, developer_id, product FROM employee E left join manager M on M.employee_id = E.id left join developer D on D.employee_id = E.id Remember that these are all Result Maps, and if you don’t specify any results at all, then MyBatis willautomatically match up columns and properties for you. So most of these examples are more verbosethan they really need to be. That said, most databases are kind of complex and it’s unlikely that we’ll beable to depend on that for all cases. 3 - Employee Mapper – Annotations We did the configuration in XML, now let’s try to use annotations to do the same thing we did using XML. This is the code for EmployeeMapper.java: package com.loiane.data; import java.util.List; import org.apache.ibatis.annotations.Case;import org.apache.ibatis.annotations.Result;import org.apache.ibatis.annotations.Select;import org.apache.ibatis.annotations.TypeDiscriminator; import com.loiane.model.Developer;import com.loiane.model.Employee;import com.loiane.model.Manager; public interface EmployeeMapper { final String SELECT_EMPLOYEE = "SELECT id, name, employee_type, manager_id, info, developer_id, product " + "FROM employee E left join manager M on M.employee_id = E.id " + "left join developer D on D.employee_id = E.id "; /** * Returns the list of all Employee instances from the database. * @return the list of all Employee instances from the database. */ @Select(SELECT_EMPLOYEE) @TypeDiscriminator(column = "employee_type", cases = { @Case (value="1", type = Manager.class, results={ @Result(property="managerId", column="manager_id"), @Result(property="info"), }), @Case (value="2", type = Developer.class, results={ @Result(property="developerId", column="developer_id"), @Result(property="project", column="product"), }) }) List getAllEmployeesAnnotation();} If you are reading this blog lately, you are already familiar with the @Select and @Result annotations. So let’s skip it. Let’s talk about the @TypeDiscriminator and @Case annotations. @TypeDiscriminator A group of value cases that can be used to determine the result mapping to perform. Attributes: column, javaType, jdbcType, typeHandler, cases. The cases attribute is an array of Cases. @Case A single case of a value and its corresponding mappings. Attributes: value, type, results. The results attribute is an array of Results, thus this Case Annotation is similar to an actual ResultMap, specified by the Results annotation below. In this example: We set the column atribute for @TypeDiscriminator to determine which column MyBatis will look for the value to compare. And we set an array of @Case. For each @Case we set the value, so if the column matches the value, MyBatis will instanciate a object of type we set and we also set an array of @Result to match column with class atribute. Note one thing: using XML we set the id and name properties. We did not set these properties using annotations. It is not necessary, because the column matches the atribute name. But if you need to set, it is going to look like this: @Select(SELECT_EMPLOYEE)@Results(value = { @Result(property="id"), @Result(property="name")})@TypeDiscriminator(column = "employee_type", cases = { @Case (value="1", type = Manager.class, results={ @Result(property="managerId", column="manager_id"), @Result(property="info"), }), @Case (value="2", type = Developer.class, results={ @Result(property="developerId", column="developer_id"), @Result(property="project", column="product"), })})List getAllEmployeesAnnotation(); 4 – EmployeeDAO In the DAO, we have two methods: the first one will call the select statement from the XML and the second one will call the annotation method. Both returns the same result. package com.loiane.dao; import java.util.List; import org.apache.ibatis.session.SqlSession;import org.apache.ibatis.session.SqlSessionFactory; import com.loiane.data.EmployeeMapper;import com.loiane.model.Employee; public class EmployeeDAO { /** * Returns the list of all Employee instances from the database. * @return the list of all Employee instances from the database. */ @SuppressWarnings("unchecked") public List selectAll(){ SqlSessionFactory sqlSessionFactory = MyBatisConnectionFactory.getSqlSessionFactory(); SqlSession session = sqlSessionFactory.openSession(); try { List list = session.selectList("Employee.getAllEmployees"); return list; } finally { session.close(); } } /** * Returns the list of all Employee instances from the database. * @return the list of all Employee instances from the database. */ public List selectAllUsingAnnotations(){ SqlSessionFactory sqlSessionFactory = MyBatisConnectionFactory.getSqlSessionFactory(); SqlSession session = sqlSessionFactory.openSession(); try { EmployeeMapper mapper = session.getMapper(EmployeeMapper.class); List list = mapper.getAllEmployeesAnnotation(); return list; } finally { session.close(); } } The output if you call one of these methods and print: Employee ID = 1 Name = Kate Manager ID = 1 Info = info KateEmployee ID = 2 Name = Josh Developer ID = 1 Project = webEmployee ID = 3 Name = Peter Developer ID = 2 Project = desktopEmployee ID = 4 Name = James Manager ID = 2 Info = info JamesEmployee ID = 5 Name = Susan Developer ID = 3 Project = web Download If you want to download the complete sample project, you can get it from my GitHub account: https://github.com/loiane/ibatis-discriminator If you want to download the zip file of the project, just click on download: There are more articles about iBatis to come. Stay tooned! From http://loianegroner.com/2011/03/ibatis-mybatis-discriminator-column-example-inheritance-mapping-tutorial/

March 22, 2011

by Loiane Groner

· 34,153 Views