Data Resources

The Latest Data Topics

Clojure: partition-by, split-with, group-by, and juxt

Today I ran into a common situation: I needed to split a list into 2 sublists - elements that passed a predicate and elements that failed a predicate. I'm sure I've run into this problem several times, but it's been awhile and I'd forgotten what options were available to me. A quick look at http://clojure.github.com/clojure/ reveals several potential functions: partition-by, split-with, and group-by. partition-by From the docs: Usage: (partition-by f coll) Applies f to each value in coll, splitting it each time f returns a new value. Returns a lazy seq of partitions. Let's assume we have a collection of ints and we want to split them into a list of evens and a list of odds. The following REPL session shows the result of calling partition-by with our list of ints. user=> (partition-by even? [1 2 4 3 5 6]) ((1) (2 4) (3 5) (6)) The partition-by function works as described; unfortunately, it's not exactly what I'm looking for. I need a function that returns ((1 3 5) (2 4 6)). split-with From the docs: Usage: (split-with pred coll) Returns a vector of [(take-while pred coll) (drop-while pred coll)] The split-with function sounds promising, but a quick REPL session shows it's not what we're looking for. user=> (split-with even? [1 2 4 3 5 6]) [() (1 2 4 3 5 6)] As the docs state, the collection is split on the first item that fails the predicate - (even? 1). group-by From the docs: Usage: (group-by f coll) Returns a map of the elements of coll keyed by the result of f on each element. The value at each key will be a vector of the corresponding elements, in the order they appeared in coll. The group-by function works, but it gives us a bit more than we're looking for. user=> (group-by even? [1 2 4 3 5 6]) {false [1 3 5], true [2 4 6]} The result as a map isn't exactly what we desire, but using a bit of destructuring allows us to grab the values we're looking for. user=> (let [{evens true odds false} (group-by even? [1 2 4 3 5 6])] [evens odds]) [[2 4 6] [1 3 5]] The group-by results mixed with destructuring do the trick, but there's another option. juxt From the docs: Usage: (juxt f) (juxt f g) (juxt f g h) (juxt f g h & fs) Alpha - name subject to change. Takes a set of functions and returns a fn that is the juxtaposition of those fns. The returned fn takes a variable number of args, and returns a vector containing the result of applying each fn to the args (left-to-right). ((juxt a b c) x) => [(a x) (b x) (c x)] The first time I ran into juxt I found it a bit intimidating. I couldn't tell you why, but if you feel the same way - don't feel bad. It turns out, juxt is exactly what we're looking for. The following REPL session shows how to combine juxt with filter and remove to produce the desired results. user=> ((juxt filter remove) even? [1 2 4 3 5 6]) [(2 4 6) (1 3 5)] There's one catch to using juxt in this way, the entire list is processed with filter and remove. In general this is acceptable; however, it's something worth considering when writing performance sensitive code. From http://blog.jayfields.com/2011/08/clojure-partition-by-split-with-group.html

August 24, 2011

by Jay Fields

· 13,225 Views

Edge Side Includes with Varnish in 10 minutes

Varnish is a tool built to be an intermediate server in the HTTP chain, not an origin one like Apache or IIS. You can outsource caching, logging, zipping and other filters to Varnish, since they are not the main feature of an HTTP server like Apache. What we'll see today is how to work with Edge Side Includes in Varnish, as a way to compose dynamic pages from independently generated and cached fragments; we won't encounter logging or other features. If you are familiar with PHP, ESI is an (almost) standard for executing include()-like statements on a front end server like Varnish; the proxy is able not only to assembly pages but also to cache them according to different policies: a certain time, for a single user, and so on. Thijs Feryn and Alessandro Nadalin introduced me to Varnish and ESI respectively, for the first time. I recommend you to consider their blogs and talks as additional sources on these topics. Installation The default version of Varnish in Ubuntu 11.04 is instead 2.1, and apparently does not support ESI very much. Installation via packages means adding a public key and a repository to your list of software sources, and install the varnish package via apt-get or an equivalent command. You can install version 3.0.0 via packages, but only in Ubuntu LTS (10.04). A way that always works in these cases is the installation from sources. The linked page will list the package dependencies and give you a sequence of 3-4 commands to seamlessly compile varnish. I used checkinstall instead of make install to get a binary package that I can reuse later: $ sudo checkinstall -D --install=no --fstrans=no [email protected] --reset-uids=yes --nodoc --pkgname=varnish --pkgversion=3.0.0 --pkgrelease=201108231000 --arch=i386 After installation with dpkg, check that varnishd is available and of the right version: [10:18:17][giorgio@Desmond:~]$ varnishd -V varnishd (varnish-3.0.0 revision 3bd5997) Copyright (c) 2006 Verdens Gang AS Copyright (c) 2006-2011 Varnish Software AS Varnish needs minimal configuration: a server to point at. For our tests you can edit /etc/varnish/default.vcl and check (or add) the following: backend default { .host = "127.0.0.1"; .port = "80"; } You can execute ps -A | grep varnishd at any time to see if varnish is already in execution. Execution [09:55:18][giorgio@Desmond:~]$ sudo varnishd -f /etc/varnish/default.vcl -s malloc,1G -T 127.0.0.1:2000 -a 0.0.0.0:8080 storage_malloc: max size 1024 MB. 1 gigabyte of memory is allocated for keeping fragments in RAM. An administrative interface will respond on port 2000, and only be accessible from localhost. http://localhost:8080/ is the exposed HTTP server, and will point to http://localhost:80 as defined in the configuration. Look at man varnishd for more switched and to man vcl for additional explanations on the configuration language. A bit of ESI ESI is a technique for leveraging HTTP cache and at the same time build dynamic pages. The problem with today's pages is that they are highly dynamic: some sections change very often or according to the current user (Welcome, John Doe or the current posts timeline); some sections do not change at all for days (the navigation bar and the layout structure); some sections change in response to external events (the list of incoming messages only when a new message arrives). It would be ideal to set different caching configurations for all the page's fragments. But implementing this strategy in the application code is error-prone and means reinventing the wheel. To use HTTP cache you will be forced to load with Ajax every single fragment of the page, even a single paragraph. With ESI, your application produces only the pieces, and lets an implementor of the Edge Side Include specification like Varnish assemble the whole thing. Example HTML page (very static): Varnish will work on this page: . PHP page (really dynamic, can change at any time): Varnish will work on this page: 2011-08-23. No sign of Varnish interventions, and totally transparent for the client. And sometimes you can also throw away Zend_Layout and similar components to assemble HTML on the PHP side.

August 23, 2011

by Giorgio Sironi

· 25,209 Views · 1 Like

Practical PHP Refactoring: Replace Data Value with Object

One of the rules of simple design is the necessity to minimize the number of moving parts, like classes and methods, as long as the tests are satisfied and we are not accepting duplication or feeling the lack of an explicit concept. Thus, a rule that aids simple design is to use primitive types unless a field has already some behavior attached: we don't create a class for the user's name or the user's password; we just use some strings. As we make progress, however, we must be able to revise our decisions via refactoring: if a field gains some logic, this behavior shouldn't be modelled by methods in the containing class, but by a new object. The code in this new class can be reused, while the containing object will change from case to case and you will end up duplicating the same methods. Transforming a scalar value into an object is the essence of the Replace Data Value with Object refactoring. In most of the cases, a Value Object or a Parameter Object come out as a result: while DDD pursue Value Objects as concepts in the domain layer, this refactoring is more general and can be applied anywhere. For instance, in a project we started introducing Data Transfer Objects to model the data sent by the controller to a Service Layer. Data values in PHP In PHP, all scalar values are by nature data values as they cannot host methods: string, integers, and booleans are proper scalar. arrays are not scalar in the Perl or mathematical sense, but they are still a primitive type. On the borderline, we find some simple objects used as data containers in PHP: ArrayObjects. SplHeap and other SPL data structures. The classes on the borderline may host methods, but the original class is out of reach for modification, and an indirection has to be introduced."Local Extension" Steps Create the new class: it should contain as a private field just the value you want to substitute. The methods you immediately need have to be chosen between a constructor, getters, and setters (where needed). Change the field in the containing class. Update the constructor to also create the new object and populate the field, or accept injection (a rarer case). Update the original getter to delegate to the new one. Update the original setter to delegate to the new one (where present) or to create a new object. Run tests at the functional level; the changes should be propagated to the construction phases, while the external usage should not change very much. Example In the initial state, magic arrays are passed around. It's very easy to build an array where a key is missing or is called incorrectly. newPassword(array( 'userId' => 42, 'oldPassword' => 'gismo', 'newPassword' => 'supersecret', 'repeatNewPassword' => 'supersecret' )); $this->markTestIncomplete('This refactoring is about the introduction of an object; it suffices that the test does not explode.'); } } class UserService { public function newPassword($changePasswordData) { /* it's not interesting to do something here */ } } After the introduction of an ArrayObject extension, a little type safety is ensure and we gained a place to put methods at a little cost. newPassword(new ChangePasswordCommand(array( 'userId' => 42, 'oldPassword' => 'gismo', 'newPassword' => 'supersecret', 'repeatNewPassword' => 'supersecret' ))); $this->markTestIncomplete('This refactoring is about the introduction of an object; it suffices that the test does not explode.'); } } class UserService { public function newPassword(ChangePasswordCommand $changePasswordData) { /* it's not interesting to do something here */ } } class ChangePasswordCommand extends ArrayObject { } We add methods to implement logic on this object; in this case, validation logic; in general cases, any kind of code that should not be duplicated by the different clients. For a stricter implementation, wrap an array or another data structure (scalars, SPL objects) instead of extending ArrayObject as you gain immutability and encapsulation (but this kind of objects need little encapsulation.) class ChangePasswordCommand extends ArrayObject { public function __construct($data) { if (!isset($data['userId'])) { throw new Exception('User id is missing.'); } parent::__construct($data); } public function getPassword() { if ($this['newPassword'] != $this['repeatNewPassword']) { throw new Exception('Password do not match.'); } return $this['newPassword']; } } Being this a refactoring however, this is the less invasive kind of introduction of objects you can make as the client code can still use the ArrayAccess interface and treat the object as a scalar array.

August 15, 2011

by Giorgio Sironi

· 10,000 Views

Serialize only specific class properties to JSON string using JavaScriptSerializer

About one year ago I wrote a blog post about JavaScriptSerializer and the Serialize and Deserialize methods it supports. Note: This blog post has been in draft for sometime now, so I decided to complete it and publish it. There might be situation when you want to serialize to JSON string only specific properties of a given class. You can do that using JavaScriptSerializer in combination with LINQ. Let’s say we have the following class definition public class Customer { public string Name { get; set; } public string Surname { get; set; } public string Email { get; set; } public int Age { get; set; } public bool Drinker { get; set; } public bool Smoker { get; set; } public bool Single { get; set; } } Next, lets create method that will create sample data for our demo private List GetListOfCustomers() { List customers = new List(); customers.Add(new Customer() { Name = "Hajan", Surname = "Selmani", Age = 25, Drinker = false, Smoker = false, Single = false, Email = "[email protected]" }); customers.Add(new Customer() { Name = "John", Surname = "Doe", Age = 29, Drinker = false, Smoker = true, Single = false, Email = "[email protected]" }); customers.Add(new Customer() { Name = "Mark", Surname = "Moris", Age = 34, Drinker = true, Smoker = true, Single = true, Email = "[email protected]" }); return customers; } So, we have three customers with some property values for each of them. Now, lets serialize some of their properties using JavaScriptSerializer. First, you must put the following directive: using System.Web.Script.Serialization; Next, we create list of customers that will get the returned value from GetListOfCustomers method and we create instance of JavaScriptSerializer class List customers = GetListOfCustomers(); JavaScriptSerializer serializer = new JavaScriptSerializer(); Now, lets say we want to serialize as JSON string and retrieve only the Age property data… We do that with only one simple line of code: //this will serialize only the 'Age' property string jsonString = serializer.Serialize(customers.Select(x => x.Age)); The result will be: Nice! Now, what if we want to serialize multiple properties at once, but not all class properties? string jsonStringMultiple = serializer.Serialize(customers.Select(x => new { x.Name, x.Surname, x.Age })); The result will be: You see, the result is an array of objects with the four properties and their corresponding values we have selected using the LINQ query above. You can see that integer and boolean values are without quotes, which is correct way of serialization. Now, you probably saw a difference somewhere? Namely, in the first example where we have selected only one property, there are only the values of the property (no property name), while in the second example we have the property name and it’s corresponding value… Why is it like that? It’s because in the second query, we use new { … } to specify multiple properties in the select statement. Therefore, the anonymous new { … } creates an object of each found item. So, if you are interested to make some more tests, run the following two lines of code: var customers1 = customers.Select(x => x.Name).ToList(); var customers2 = customers.Select(x=> new { x.Name } ).ToList(); and you will obviously see the difference. If we use the new { } way for single property selection, like in the following example string jsonString2 = serializer.Serialize(customers.Select(x => new { x.Age })); the result will be: The complete demo code used for this blog post: List customers = GetListOfCustomers(); JavaScriptSerializer serializer = new JavaScriptSerializer(); //this will serialize only the 'Age' property string jsonString = serializer.Serialize(customers.Select(x => x.Age )); string jsonStringMultiple = serializer.Serialize(customers.Select(x => new { x.Name, x.Surname, x.Age, x.Drinker })); var customers1 = customers.Select(x => x.Name).ToList(); var customers2 = customers.Select(x=> new { x.Name } ).ToList(); string jsonString2 = serializer.Serialize(customers.Select(x => new { x.Age })); You can download the demo project here.

August 10, 2011

by Hajan Selmani

· 32,236 Views

A collection with billions of entries

There are a number of problems with having a large number of records in memory. One way around this is to use direct memory, but this is too low level for most developers. Is there a way to make this more friendly? Limitations of large numbers of objects The overhead per object is between 12 and 16 bytes for 64-bit JVMs. If the object is relatively small, this is significant and could be more than the data itself. The GC pause time increases with the number of objects. Pause times can be around one second per GB of objects. Collections and arrays only support two billion elements Huge collections One way to store more data and still follow object orientated principles is have wrappers for direct ByteBuffers. This can be tedious to write, but very efficient. What would be ideal is to have these wrappers generated automatically. Small JavaBean Example This is an example of JavaBean which would have far more overhead than actual data contained. interface MutableByte { public void setByte(byte b); public byte getByte(); } It is also small enough that I can create billions of these on my machine. This example creates a List with 16 billion elements. final long length = 16_000_000_000L; HugeArrayList hugeList = new HugeArrayBuilder() {{ allocationSize = 4 * 1024 * 1024; capacity = length; }.create(); List list = hugeList; assertEquals(0, list.size()); hugeList.setSize(length); // add a GC to see what the GC times are like. System.gc(); assertEquals(Integer.MAX_VALUE, list.size()); assertEquals(length, hugeList.longSize()); byte b = 0; for (MutableByte mb : list) mb.setByte(b++); b = 0; for (MutableByte mb : list) { byte b2 = mb.getByte(); byte expected = b++; if (b2 != expected) assertEquals(expected, b2); } From start to finish, the heap memory used is as follows. with -verbosegc 0 sec - 3100 KB used [GC 9671K->1520K(370496K), 0.0020330 secs] [Full GC 1520K->1407K(370496K), 0.0063500 secs] 10 sec - 3885 KB used 20 sec - 4428 KB used 30 sec - 4428 KB used ... deleted ... 1380 sec - 4475 KB used 1390 sec - 4476 KB used 1400 sec - 4476 KB used 1410 sec - 4476 KB used The only GC is one triggered explicitly. Without the System.gc(); no GC logs appear. After 20 sec, the increase in memory used is from logging how much memory was used. Conclusion The library is relatively slow. Each get or set takes about 40 ns which really adds up when there are so many calls to make. I plan to work on it so it is much faster. ;) On the upside, it wouldn't be possible to create 16 billion objects with the memory I have, nor could it be put in an ArrayList, so having it a little slow is still better than not working at all. From http://vanillajava.blogspot.com/2011/08/collection-with-billions-of-entries.html

August 10, 2011

by Peter Lawrey

· 17,403 Views

TechTip: Use of setLenient method on SimpleDateFormat

Sometimes when you are parsing a date string against a pattern(such as MM/dd/yyyy) using java.text.SimpleDateFormat, strange things might happen (for unknown developers) if your date string is dynamic content entered by a user in some input field on the user interface and if it is not entered in the specified format. The parse method in the SimpleDateFormat parses the date string that is in the incorrect format and returns your date object instead of throwing a java.text.ParseException. However, the date returned is not what you expect. The below code-snippet shows you this behaviour. package com.starwood.system.util; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; public class DateSample { public static void main(String args[]){ SimpleDateFormat sdf = new SimpleDateFormat () ; sdf.applyPattern("MM/dd/yyyy") ; try { Date d = sdf.parse("2011/02/06") ; System.out.println(d) ; } catch (ParseException e) { e.printStackTrace(); } } } Output: Thu Jul 02 00:00:00 MST 173 See the output, that is a date back in the year 173. To avoid this problem, call the setLenient (false) on SimpleDateFormat instance. That will make the parse method throw ParseException when the given input string is not in the specified format. Here is the modified code-snippet. package com.starwood.system.util; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Date; public class DateSample { public static void main(String args[]){ SimpleDateFormat sdf = new SimpleDateFormat () ; sdf.applyPattern("MM/dd/yyyy") ; sdf.setLenient(false) ; try { Date d = sdf.parse("2011/02/06") ; System.out.println(d) ; } catch (ParseException e) { System.out.println (e.getMessage()) ; } } } Output: Unparseable date: "2011/02/06" http://accordess.com/wpblog/2011/06/02/techtip-use-of-setlenient-method-on-simpledateformat

June 27, 2011

by Upendra Chintala

· 47,196 Views · 5 Likes

Android Tutorial: How to Parse/Read JSON Data Into a Android ListView

Today we get on with our series that will connect our Android applications to internet webservices! Next up in line: from JSON to a Listview. A lot of this project is identical to the previous post in this series so try to look there first if you have any problems. On the bottom of the post ill add the Eclipse project with the source. For this example i made use of an already existing JSON webservice located here. This is a piece of the JSON array that gets returned: {"earthquakes": [ { "eqid": "c0001xgp", "magnitude": 8.8, "lng": 142.369, "src": "us", "datetime": "2011-03-11 04:46:23", "depth": 24.4, "lat": 38.322 }, { "eqid": "2007hear", "magnitude": 8.4, "lng": 101.3815, "src": "us", "datetime": "2007-09-12 09:10:26", "depth": 30, "lat": -4.5172 }<--more -->]} So how do we get this data into our application! Behold our getJSON class! getJSON(String url) public static JSONObject getJSONfromURL(String url){//initializeInputStream is = null;String result = "";JSONObject jArray = null;//http posttry{HttpClient httpclient = new DefaultHttpClient();HttpPost httppost = new HttpPost(url);HttpResponse response = httpclient.execute(httppost);HttpEntity entity = response.getEntity();is = entity.getContent();}catch(Exception e){Log.e("log_tag", "Error in http connection "+e.toString());}//convert response to stringtry{BufferedReader reader = new BufferedReader(new InputStreamReader(is,"iso-8859-1"),8);StringBuilder sb = new StringBuilder();String line = null;while ((line = reader.readLine()) != null) {sb.append(line + "\n");}is.close();result=sb.toString();}catch(Exception e){Log.e("log_tag", "Error converting result "+e.toString());}//try parse the string to a JSON objecttry{ jArray = new JSONObject(result);}catch(JSONException e){Log.e("log_tag", "Error parsing data "+e.toString());}return jArray;} The code above can be divided in 3 parts. the first part makes the HTTP call the second part converts the stream into a String the third part converts the string to a JSPNObject Now we only have to implement this into out ListView. We can use the same method as in the XML tutorial. We make a HashMap that stores our data and we put JSON values in the HashMap. After that we will bind that HashMap to a SimpleAdapter. Here is how its done: Implementation ArrayList> mylist = new ArrayList>();//Get the data (see above)JSONObject json =JSONfunctions.getJSONfromURL("http://api.geonames.org/postalCodeSearchJSON?formatted=true&postalcode=9791&maxRows=10&username=demo&style=full"); try{//Get the element that holds the earthquakes ( JSONArray )JSONArray earthquakes = json.getJSONArray("earthquakes"); //Loop the Array for(int i=0;i < earthquakes.length();i++){ HashMap map = new HashMap(); JSONObject e = earthquakes.getJSONObject(i); map.put("id", String.valueOf(i)); map.put("name", "Earthquake name:" + e.getString("eqid")); map.put("magnitude", "Magnitude: " + e.getString("magnitude")); mylist.add(map);} }catch(JSONException e) { Log.e("log_tag", "Error parsing data "+e.toString()); } After this we only need to make up the Simple Adapter ListAdapter adapter = new SimpleAdapter(this, mylist , R.layout.main, new String[] { "name", "magnitude" }, new int[] { R.id.item_title, R.id.item_subtitle }); setListAdapter(adapter); final ListView lv = getListView(); lv.setTextFilterEnabled(true); lv.setOnItemClickListener(new OnItemClickListener() { public void onItemClick(AdapterView parent, View view, int position, long id) { @SuppressWarnings("unchecked") Toast.makeText(Main.this, "ID '" + o.get("id") + "' was clicked.", Toast.LENGTH_SHORT).show(); }); Now we have a ListView filled with JSON data! Here is the Eclipse project: source code Have fun playing around with it.

June 8, 2011

by Mark Mooibroek

· 260,385 Views

Using Advantage data providers to read DBF-files

In one of my projects I have to read FoxPro DBF-files and import data from them. As this code must run in server and customer doesn’t want to install FoxPro there we found another solution that seems at least to me way better. In this posting I will show you how to read DBF-files using Sybase Advantage data providers. Getting Advantage data providers Here are the download links to data providers: Advantage .NET Data Provider Release 10.1 for Windows (32-bit and 64-bit) Platforms for Advantage OLE DB Provider Release 10.1 Platforms for Advantage ODBC Driver Release 10.1 I downloaded and installed .NET data provider and my example here is fully based on this. Configuring application If you run application without configuring some data providers stuff before you will get the following error: Error 5185: Local server connections are restricted in this environment. See the 5185 error code documentation for details. Go to your application bin folder and add there usual text file called ads.ini. Here is the content for this file: [SETTINGS] MTIER_LOCAL_CONNECTIONS=1 Make sure you add reference to Advantage data provider assembly and include ads.ini to your project like shown on image above. Getting data to DataTable Here is short code example about how to get data from DBF-file to DataTable. static void Main(string[] args) { var tableName = "TABLENAME_WITHOUT_EXTENSION"; var connStr = "data source={0};tabletype=vfp;servertype= local;"; connStr = string.Format(connStr, "c:\\temp\\"); var table = new DataTable(); using (var conn = new AdsConnection(connStr)) using (var adapter = new AdsDataAdapter()) using (var cmd = new AdsCommand()) { cmd.Connection = conn; cmd.CommandText = "select * from " + tableName; adapter.SelectCommand = cmd; conn.Open(); adapter.Fill(table); conn.Close(); } Console.WriteLine("Table fields:"); foreach (DataColumn col in table.Columns) Console.WriteLine(col.ColumnName); Console.WriteLine(" "); Console.WriteLine("Rows: " + table.Rows.Count); Console.Read(); } If Advantage data providers were installed correctly and there are no errors in table names, locations and your SQL query then you should see list of table column names and row count on console window when you run the application.

May 17, 2011

by Gunnar Peipman

· 10,474 Views

How to Iterate ArrayList in Struts2

We will discuss how to iterate over a collection of String objects in Struts2 tag libraries and then a List of custom class objects. It looks as if iterating a list of string objects is easier than iterating over a list of custom class objects in Struts 2. But the reality is that iterating a list of custom class objects is also equally easier. By custom class we mean the User, Employee, Department, Products, Vehicles classes that are created in any web application. Download Working Sample Here Usually it happens that one needs to fetch a list of records from database/files and then display it in the JSP. The module requiring this functionality could be Search, Listing users/departments/products etc. The basic flow of struts2 web application goes like: The user initiates the request from one page. This request is received by the interceptor which further invokes the Struts2 action. The action class fetches the records and stores in a list. This list is available to the next JSP using the public getter method. Please note that the public getter method for the List is mandatory. Once the List has been populated by Struts2 action class, the JSP then iterates over this List and displays the corresponding information. In the days gone by, one would store the List as a session attribute and then access the list in JSP using the scriptlets to display appropriate output to the users. Here is a Struts2 sample application to iterate one String and one custom class objects List. Though we are using the Struts2 tag library to iterate the list but JSTL can also be used for iteration. Also if you are going to use the code examples given below, use the following URL's to access the application: http://localhost:8080//index.action Iterate a Custom class ArrayList in Struts2 web.xml struts2 org.apache.struts2.dispatcher.ng.filter.StrutsPrepareAndExecuteFilter struts2 *.action struts.xml /home.jsp /success.jsp /failure.jsp home.jsp Enter a user name to get the documents uploaded by that user. Username success.jsp Documents uploaded by the user are: failure.jsp FileAction.java package com.example; import java.util.ArrayList; import java.util.List; public class FetchAction { private String username; private String message; private List documents = new ArrayList(); public List getDocuments() { return documents; } public String getMessage() { return message; } public void setMessage(String message) { this.message = message; } public String getUsername() { return username; } public void setUsername(String username) { this.username = username; } public String execute() { if( username != null) { //logic to fetch the document list (say from database) Document d1 = new Document(); d1.setName("user.doc"); Document d2 = new Document(); d2.setName("office.doc"); Document d3 = new Document(); d3.setName("transactions.doc"); documents.add(d1); documents.add(d2); documents.add(d3); return "success"; } else { message="Unable to fetch"; return "failure"; } } } Document.java package com.example; public class Document { private String name; public String getName() { return name; } public void setName(String name) { this.name = name; } } Iterate String List in Struts2 The way to iterate the a String list is similar with the only difference that the action class FetchAction.java now populates the name of documents into an ArrayList of String objects. The code zip file containing the iteration over an ArrayList of custom class object or bean can be downloaded at: http://www.fileserve.com/file/QmrsJ7k The URL to access this application will be: http://localhost:8080/IteratorExample/index.action The code zip file containing the iteration over an ArrayList of string class object or bean can be downloaded at: http://www.fileserve.com/file/V2kXkfx The URL to access this application will be: http://localhost:8080/StringIteratorExample/index.action From http://extreme-java.blogspot.com/2011/05/how-to-iterate-arraylist-in-struts2.html

May 17, 2011

by Sandeep Bhandari

· 71,019 Views

Database Interaction with DAO and DTO Design Patterns

Learn what is a DAO and how to create a DAO, as well as the significance of creating Data Access Objects.

May 13, 2011

by Sandeep Bhandari

· 165,733 Views · 5 Likes

Solr Index Size Analysis

in this post i’m going to talk about a set of benchmarks that i’ve done with solr. the goal behind it is to see how each parameter defined in the schema affects the size of the index and the performance of the system. the first step was to fetch the set of documents that i was going to use in the tests. i wanted the documents to be composed of real text, so i started to look for sources in internet. the first one that i really liked was twitter. they provide a rest api that allows you to read a continuous stream of tweets, composed of approximately 1% of all the public tweets. each tweet is expressed as a json object, and carries meta-data about the message and the author. while this source allowed me to get a good number of documents in a short time (about 1.7 million tweets in 2 days), they were really small, so i started to look for a source of bigger documents, finally choosing wikipedia. i downloaded the documents through http using the “random article” feature in their site, obtaining about 160,000 articles in a couple of days. at the time of writting, the site download.wikipedia.org , which provides an easy way of downloading a bunch of articles, was out of service. the next step was to design the schema. because one of the objectives is to see how each change in the schema affects the size of the index, i used many different combination of parameters, as to measure the influence of each one of them. on each case, the database of stop-words was populated using the top 100 terms of each set of documents, obtained from the administration panel of solr. for both datasets, the “omitnorms”, “termvectors” and “stopwords” parameters are referred to the “text” field. in all cases, the value of the parameters “termoffsets” and “termpositions” is the same as “termvectors”. in the first figure you can see the size of the index for each schema for the twitter data-set, and which proportion of the index corresponds to each parameter. remember that this data-set has lots of documents (about 1.7 million) but each one is small (240 bytes on average). there are many remarkable things here. the first one is that the space occupied by the term vectors (~280 mib when not using stop words) is almost equal to the space occupied by the inverted index itself (~240 mib). in second place, the space saved by omitting norms is almost negligible (~2 mib). third, the space saved by using stop word is doubled when storing term vectors, going from about 4% of the index to about 10%. finally, the space occupied by the stored fields (~340 mib) is considerably bigger than the space occupied by the inverted index itself. in the second figure you can see the same information for the wikipedia data-set. the size occupied by the norms is still negligible (< 1mib), however, the size occupied by the stop words has increased to about 22% of the index size when not storing term vectors, and about 25% when storing them. this time, the size occupied by the term vectors (~1067 mib) is almost three times the space occupied by the inverted index itself (~380 mib). finally, the size of the stored documents (~6330 mib) is more than four times the size of the index with term vectors stored. at this point, we can state some conclusions concerning the size of the index: when the number of fields is small, the size of the norms is negligible, independently of the size and number of documents. when the documents are large, the stop words help reducing the size of the index significantly. maybe here is important to note two things. in first place, the documents fetched from wikipedia are writen using traditional language, and are all writen in english, while the documents fetched from twitter are writen using modern language, and in many different languages. in second place, i didn’t measure the precision and recall of the system when using stop words, so it is possible that the findability in a real scenario won’t be good. if you’re storing the documents, and they are big enough, it’s not so important if you store the term vectors or not, so if you’re using a feature such as highlighting and you are looking for good performance, you should store them. if you’re not storing documents, or your documents are small, you should think twice before storing the term vectors, because they’re going to increase significantly your index’s size. i hope you find this post useful. currently i’m working on a set of benchmarks to measure the influence of each one of these parameters in the performance of the system, so if you liked this post, stay tuned!

April 24, 2011

by Juan Grande

· 29,304 Views

Solr + Hadoop = Big Data Love

Bixo Labs shows how to use Solr as a NoSQL solution for big data Many people use the Hadoop open source project to process large data sets because it’s a great solution for scalable, reliable data processing workflows. Hadoop is by far the most popular system for handling big data, with companies using massive clusters to store and process petabytes of data on thousands of servers. Since it emerged from the Nutch open source web crawler project in 2006, Hadoop has grown in every way imaginable – users, developers, associated projects (aka the “Hadoop ecosystem”). Starting at roughly the same time, the Solr open source project has become the most widely used search solution on planet Earth. Solr wraps the API-level indexing and search functionality of Lucene with a RESTful API, GUI, and lots of useful administrative and data integration functionality. The interesting thing about combining these two open source projects is that you can use Hadoop to crunch the data, and then serve it up in Solr. And we’re not talking about just free-text search; Solr can be used as a key-value store (i.e. a NoSQL database) via its support for range queries. Even on a single server, Solr can easily handle many millions of records (“documents” in Lucene lingo). Even better, Solr now supports sharding and replication via the new, cutting-edge SolrCloud functionality. Background I started using Hadoop & Solr about five years ago, as key pieces of the Krugle code search startup I co-founded in 2005. Back then, Hadoop was still part of the Nutch web crawler we used to extract information about open source projects. And Solr was fresh out of the oven, having just been released as open source by CNET. At Bixo Labs we use Hadoop, Solr, Cascading, Mahout, and many other open source technologies to create custom data processing workflows. The web is a common source of our input data, which we crawl using the Bixo open source project. The Problem During a web crawl, the state of the crawl is contained in something commonly called a “crawl DB”. For broad crawls, this has to be something that works with billions of records, since you need one entry for each known URL. Each “record” has the URL as the key, and contains important state information such as the time and result of the last request. For Hadoop-based crawlers such as Nutch and Bixo, the crawl DB is commonly kept in a set of flat files, where each file is a Hadoop “SequenceFile”. These are just a packed array of serialized key/value objects. Sometimes we need to poke at this data, and here’s where the simple flat-file structure creates a problem. There’s no easy way run queries against the data, but we can’t store it in a traditional database since billions of records + RDBMS == pain and suffering. Here is where scalable NoSQL solutions shine. For example, the Nutch project is currently re-factoring this crawl DB layer to allow plugging in HBase. Other options include Cassandra, MongoDB, CouchDB, etc. But for simple analytics and exploration on smaller datasets, a Solr-based solution works and is easier to configure. Plus you get useful and surprising fun functionality like facets, geospatial queries, range queries, free-form text search, and lots of other goodies for free. Architecture So what exactly would such a Hadoop + Solr system look like? As mentioned previously, in this example our input data comes from a Bixo web crawler’s CrawlDB, with one entry for each known URL. But the input data could just as easily be log files, or records from a traditional RDBMS, or the output of another data processing workflow. The key point is that we’re going to take a bunch of input data, (optionally) munge it into a more useful format, and then generate a Lucene index that we access via Solr. Hadoop For the uninitiated, Hadoop implements both a distributed file system (aka “HDFS”) and an execution layer that supports the map-reduce programming model. Typically data is loaded and transformed during the map phase, and then combined/saved during the reduce phase. In our example, the map phase reads in Hadoop compressed SequenceFiles that contain the state of our web crawl, and our reduce phase write out Lucene indexes. The focus of this article isn’t on how to write Hadoop map-reduce jobs, but I did want to show you the code that implements the guts of the job. Note that it’s not typical Hadoop key/value manipulation code, which is painful to write, debug, and maintain. Instead we use Cascading, which is an open source workflow planning and data processing API that creates Hadoop jobs from shorter, more representative code. The snippet below reads SequenceFiles from HDFS, and pipes those records into a sink (output) that stores them using a LuceneScheme, which in turn saves records as Lucene documents in an index. Tap source = new Hfs(new SequenceFile(CRAWLDB_FIELDS), inputDir); Pipe urlPipe = new Pipe("crawldb urls"); urlPipe = new Each(urlPipe, new ExtractDomain()); Tap sink = new Hfs(new LuceneScheme(SOLR_FIELDS, STORE_SETTINGS, INDEX_SETTINGS, StandardAnalyzer.class, MAX_FIELD_LENGTH), outputDir, true); FlowConnector fc = new FlowConnector(); fc.connect(source, sink, urlPipe).complete(); We defined CRAWLDB_FIELDS and SOLR_FIELDS to be the set of input and output data elements, using names like “url” and “status”. We take advantage of the Lucene Scheme that we’ve created for Cascading, which lets us easily map from Cascading’s view of the world (records with fields) to Lucene’s index (documents with fields). We don’t have a Cascading Scheme that directly supports Solr (wouldn’t that be handy?), but we can make-do for now since we can do simple analysis for this example. We indexed all of the fields so that we can perform queries against them. Only the status message contains normal English text, so that’s the only one we have to analyze (i.e., break the text up into terms using spaces and other token delimiters). In addition, the ExtractDomain operation pulls the domain from the URL field and builds a new Solr field containing just the domain. This will allow us to do queries against the domain of the URL as well as the complete URL. We could also have chosen to apply a custom analyzer to the URL to break it into several pieces (i.e., protocol, domain, port, path, query parameters) that could have been queried individually. Running the Hadoop Job For simplicity and pay-as-you-go, it’s hard to beat Amazon’s EC2 and Elastic Mapreduce offerings for running Hadoop jobs. You can easily spin up a cluster of 50 servers, run your job, save the results, and shut it down – all without needing to buy hardware or pay for IT support. There are many ways to create and configure a Hadoop cluster; for us, we’re very familiar with the (modified) EC2 Hadoop scripts that you can find in the Bixo distribution. Step-by-step instructions are available at http://openbixo.org/documentation/running-bixo-in-ec2/ The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains step-by-step instructions for building and running the job. After the job is done, we’ll copy the resulting index out of the Hadoop distributed file system (HDFS) and onto the Hadoop cluster’s master server, then kill off the one slave we used. The Hadoop master is now ready to be configured as our Solr server. Solr On the Solr side of things, we need to create a schema that matches the index we’re generating. The key section of our schema.xml file is where we define the fields. These fields have a one-to-one correspondence with the SOLR_FIELDS we defined in our Hadoop workflow. They also need to use the same Lucene settings as what we defined in the static IndexWorkflow.java STORE_SETTINGS and INDEX_SETTINGS. Once we have this defined, all that’s left is to set up a server that we can use. To keep it simple, we’ll use the single EC2 instance in Amazon’s cloud (m1.large) that we used as our master for the Hadoop job, and run the simple Solr search server that relies on embedded Jetty to provide the webapp container. Similar to the Hadoop job, step-by-step instructions are in the README for the hadoop2solr project on GitHub. But in a nutshell, we’ll copy and unzip a Solr 1.4.1 setup on the EC2 server, do the same for our custom Solr configuration, create a symlink to the index, and then start it running with: Giving it a Try Now comes the interesting part. Since we opened up the default Jetty port used by Solr (8983) on this EC2 instance, we can directly access Solr’s handy admin console by pointing our browser at http://:8983/solr/admin % cd solr % java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar From here we can run queries against Solr: We can also use curl to talk to the server via HTTP requests: curl http://:8983/solr/select/?q=-status%3AFETCHED+and+-status%3AUNFETCHED The response is XML by default. Below is an example of the response from the above request, where we found 2,546 matches in 94ms. Now here’s what I find amazing. For an index of 82 million documents, running on a fairly wimpy box (EC2 m1.large = 2 virtual cores), the typical response time for a simple query like “status:FETCHED” is only 400 milliseconds, to find 9M documents. Even a complex query such as (status not FETCHED and not UNFETCHED) only takes six seconds. Scaling Obviously we could use beefier boxes. If we switched to something like m1.xlarge (15GB of memory, 4 virtual cores) then it’s likely we could handle upwards of 200M “records” in our Solr index and still get reasonable response times. If we wanted to scale beyond a single box, there are a number of solutions. Even out of the box Solr supports sharding, where your HTTP request can specify multiple servers to use in parallel. More recently, the Solr trunk has support for SolrCloud. This uses the ZooKeeper open source project to simplify coordination of multiple Solr servers. Finally, the Katta open source project supports Lucene-level distributed search, with many of the features needed for production quality distributed search that have not yet been added to SolrCloud. Summary The combination of Hadoop and Solr makes it easy to crunch lots of data and then quickly serve up the results via a fast, flexible search & query API. Because Solr supports query-style requests, it’s suitable as a NoSQL replacement for traditional databases in many situations, especially when the size of the data exceeds what is reasonable with a typical RDBMS. Solr has some limitations that you should be aware of, specifically: · Updating the index works best as a batch operation. Individual records can be updated, but each commit (index update) generates a new Lucene segment, which will impact performance. · Current support for replication, fail-over, and other attributes that you’d want in a production-grade solution aren’t yet there in SolrCloud. If this matters to you, consider Katta instead. · Many SQL queries can’t be easily mapped to Solr queries. The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains additional technical details.

April 4, 2011

by Ken Krugler

· 119,648 Views

Java Access to SQL Azure via the JDBC Driver for SQL Server

I’ve written a couple of posts (here and here) about Java and the JDBC Driver for SQL Server with the promise of eventually writing about how to get a Java application running on the Windows Azure platform. In this post, I’ll deliver on that promise. Specifically, I’ll show you two things: 1) how to connect to a SQL Azure Database from a Java application running locally, and 2) how to connect to a SQL Azure database from an application running in Windows Azure. You should consider these as two ordered steps in moving an application from running locally against SQL Server to running in Windows Azure against SQL Azure. In both steps, connection to SQL Azure relies on the JDBC Driver for SQL Server and SQL Azure. The instructions below assume that you already have a Windows Azure subscription. If you don’t already have one, you can create one here: http://www.microsoft.com/windowsazure/offers/. (You’ll need a Windows Live ID to sign up.) I chose the Free Trial Introductory Special, which allows me to get started for free as long as keep my usage limited. (This is a limited offer. For complete pricing details, see http://www.microsoft.com/windowsazure/pricing/.) After you purchase your subscription, you will have to activate it before you can begin using it (activation instructions will be provided in an email after signing up). Connecting to SQL Azure from an application running locally I’m going to assume you already have an application running locally and that it uses the JDBC Driver for SQL Server. If that isn’t the case, then you can start from scratch by following the steps in this post: Getting Started with the SQL Server JDBC Driver. Once you have an application running locally, then the process for running that application with a SQL Azure back-end requires two steps: 1. Migrate your database to SQL Azure. This only takes a couple of minutes (depending on the size of your database) with the SQL Azure Migration Wizard - follow the steps in the Creating a SQL Azure Server and Creating a SQL Azure Database sections of this post. 2. Change the database connection string in your application. Once you have moved your local database to SQL Azure, you only have to change the connection string in your application to use SQL Azure as your data store. In my case (using the Northwind database), this meant changing this… String connectionUrl = "jdbc:sqlserver://serverName\\sqlexpress;" + "database=Northwind;" + "user=UserName;" + "password=Password"; …to this… String connectionUrl = "jdbc:sqlserver://xxxxxxxxxx.database.windows.net;" + "database=Northwind;" + "user=UserName@xxxxxxxxxx;" + "password=Password"; (where xxxxxxxxxx is your SQL Azure server ID). Connecting to SQL Azure from an application running in Windows Azure The heading for this section might be a bit misleading. Once you have a locally running application that is using SQL Azure, then all you have to do is move your application to Windows Azure. The connecting part is easy (see above), but moving your Java application to Windows Azure takes a bit more work. Fortunately, Ben Lobaugh has written a great post that that shows how to use the Windows Azure Starter Kit for Java to get a Java application (a JSP application, actually) running in Windows Azure: Deploying a Java application to Windows Azure with Command-Line Ant. (If you are using Eclipse, see Ben’s related post: Deploying a Java application to Windows Azure with Eclipse.) I won’t repeat his work here, but I will call out the steps I took in modifying his instructions to deploy a simple JSP page that connects to SQL Azure. 1. Add the JDBC Driver for SQL Server to the Java archive. One step in Ben’s tutorial (see the Select the Java Runtime Environment section) requires that you create a .zip file from your local Java installation and add it to your Java/Azure application. Most likely, your local Java installation references the JDBC driver by setting the classpath environment variable. When you create a .zip file from your java installation, the JDBC driver will not be included and the classpath variable will not be set in the Azure environment. I found the easiest way around this was to simply add the sqljdbc4.jar file (probably located in C:\Program Files\Microsoft SQL Server JDBC Driver\sqljdbc_3.0\enu) to the \lib\ext directory of my local Java installation before creating the .zip file. Note: You can put the JDBC driver in a separate directory, include it when you create the .zip folder, and set the classpath environment variable in the startup.bat script. But, I found the above approach to be easier. 2. Modify the JSP page. Instead of the code Ben suggests for the HelloWorld.jsp file (see the Prepare your Java Application section), use code from your locally running application. In my case, I just used the code from this post after changing the connection string and making a couple minor JSP-specific changes: Northwind Customers That’s it!. To summarize the steps… Migrate your database to SQL Azure with the SQL Azure Migration Wizard. Change the database connection in your locally running application. Use the Windows Azure Starter Kit for Java to move your application to Windows Azure. (You’ll need to follow instructions in this post and instructions above.) Thanks. -Brian

March 30, 2011

by Brian Swan

· 18,935 Views

New Java 7 Feature: String in Switch support

One of the new features added in Java 7 is the capability to switch on a String. With Java 6, or less String color = "red"; if (color.equals("red")) { System.out.println("Color is Red"); } else if (color.equals("green")) { System.out.println("Color is Green"); } else { System.out.println("Color not found"); } String color = "red"; if (color.equals("red")) { System.out.println("Color is Red"); } else if (color.equals("green")) { System.out.println("Color is Green"); } else { System.out.println("Color not found"); } With Java 7: String color = "red"; switch (color) { case "red": System.out.println("Color is Red"); break; case "green": System.out.println("Color is Green"); break; default: System.out.println("Color not found"); } Conclusion The switch statement when used with a String uses the equals() method to compare the given expression to each value in the case statement and is therefore case-sensitive and will throw a NullPointerException if the expression is null. It is a small but useful feature which not only helps us write more readable code but the compiler will likely generate more efficient bytecode as compared to the if-then-else statement. From http://www.vineetmanohar.com/2011/03/new-java-7-feature-string-in-switch-support/

March 22, 2011

by Vineet Manohar

· 106,415 Views · 2 Likes

A Custom Float PropertyEditor

Both Java SE and the NetBeans Platform have default property editors for several primitive and common data types. These are suitable for most cases and most of us almost never need to worry about it. There are, however, those few moments where the one-size-fits-all approach does not actually fit. For instance, the default float editor is a text editor: For general-purpose cases, this should be fine. Now, what if you want to restrict the values your property can have or have other control over it? Or simply give the user a more comfortable control for data input, like a JSpinner: Most of what I'll show was taken from the NetBeans tutorials and Javadoc, so I will not extend in details what you can find easily there. First, lets implement the PropertyEditor and the InlineEditor (don't forget to to fix imports): public abstract class FloatPropertyEditor extends PropertyEditorSupport implements ExPropertyEditor, InplaceEditor.Factory{ protected InplaceEditor ed = null; protected SpinnerNumberModel model; public FloatPropertyEditor(Object source, SpinnerNumberModel model) { super(source); this.model = model; } public FloatPropertyEditor(SpinnerNumberModel model) { this.model = model; } @Override public String getAsText() { Float d = (Float)getValue(); if (d == null) { return "0.0"; } return NumberFormat.getNumberInstance().format(d.floatValue()); } @Override public void setAsText(String s) { try { setValue(new Float( NumberFormat.getNumberInstance().parse(s).floatValue())); } catch (ParseException ex) { setValue(Float.valueOf(0.0f)); } } @Override public void attachEnv(PropertyEnv env) { env.registerInplaceEditorFactory(this); } @Override public InplaceEditor getInplaceEditor() { if (ed == null) { ed = new FloatInplaceEditor(model); } return ed; } protected static class FloatInplaceEditor implements InplaceEditor { private final JSpinner spinner; private PropertyEditor editor = null; private PropertyModel model; public FloatInplaceEditor(SpinnerNumberModel model) { this.spinner = new JSpinner(model); } @Override public void connect(PropertyEditor propertyEditor, PropertyEnv env) { editor = propertyEditor; reset(); } @Override public JComponent getComponent() { return spinner; } @Override public void clear() { //avoid memory leaks: editor = null; model = null; } @Override public Object getValue() { return spinner.getValue(); } @Override public void setValue(Object object) { try { spinner.setValue(object); } catch (IllegalArgumentException e) { spinner.setValue(null); } } @Override public boolean supportsTextEntry() { return true; } @Override public void reset() { Float d = (Float) editor.getValue(); if (d != null) { setValue(d); } } @Override public KeyStroke[] getKeyStrokes() { return new KeyStroke[0]; } @Override public PropertyEditor getPropertyEditor() { return editor; } @Override public PropertyModel getPropertyModel() { return model; } @Override public void setPropertyModel(PropertyModel propertyModel) { this.model = propertyModel; } @Override public boolean isKnownComponent(Component component) { return component == spinner || spinner.isAncestorOf(component); } @Override public void addActionListener(ActionListener actionListener) { //do nothing - not needed for this component } @Override public void removeActionListener(ActionListener actionListener) { //do nothing - not needed for this component } } } So far, we've been closely following the tutorials. Notice that the FloatPropertyEditor class is abstract. This is because PropertyEditor classes should have a default constructor (more on this below). Not much of a gain, maybe, but now you have a JSpinner as an editor. Now imagine you are developing a 3D "bodies in space" application. You can select an item and you can change its dimensions and coordinates at will. Coordinates can have any real value, whether negative, zero, or positive (besides the practical dimensional limits your universe might have), while dimensions must be at least be zero. This is where the abstract plays its role. By inheriting the class above, you can decide which values our properties can have. For coordinate properties, set this class as PropertyEditor: public class CoordinateFloatPropertyEditor extends FloatPropertyEditor { public CoordinateFloatPropertyEditor() { super(new SpinnerNumberModel(0.0, -100000.0, 100000.0, 0.5)); } } And, for dimension properties, use this one, instead: public class DimensionFloatPropertyEditor extends FloatPropertyEditor { public DimensionFloatPropertyEditor() { super(new SpinnerNumberModel(0.0, 0.09999, 100000.0, 0.5)); } } It's up to you to decide the maximum, minimum, and step values, and tweak this sample for your needs. I want also to add that this is part of a real application, from where I was inspired to write this article. I'll be further extending this subject in the next posts. Thanks for you attention. Muchas gracias por su atención.

March 1, 2011

by Alied Pérez

· 11,656 Views · 1 Like

Solve Foreign-key Problems in DBUnit Test Data

If you create small per-test datasets, as DBUnit advises, you’ll get intermittent build failures due to foreign-key violations. This post explains (1) why this happens, (2) why small per-test datasets are still a good idea, and (3) one simple way to get around the problem. NB When I searched for solutions to this problem, I discovered that other kinds of foreign-key problem come up with DBUnit. Some people have circular dependencies in their relational database schemas, which stops DBUnit from loading the test data. If such is your case, I’m sorry to say that this post won’t help you with it, and your best option is probably to just take yourself outside and shoot yourself now. (Although some people seem to chosen instead to disable foreign key checking during test runs.) What causes the foreign-key violations The cause of the problem is simple, and illustrated by a trivial example. Suppose you have two entity classes, HitchHiker and SpaceShip. The HitchHiker table has a foreign key that references SpaceShip. The test data for HitchHikerDaoTest contains lines from both tables, whereas the test data for SpaceShipDaoTest contains only lines from SpaceShip. DBUnit’s default setup operation, CLEAN_INSERT, wipes data from every table occurring in the test dataset and then inserts the lines listed in that dataset. When SpaceShipDaoTest runs, DBUnit will start by deleting everything in the SpaceShip table. If any HitchHikers are currently riding in the SpaceShips that are about to be deleted, the database will object to their untimely eviction (I’m not sure whether the error message will read like Vogon poetry, though). If you start from an empty database, and execute SpaceShipDaoTest and then HitchHikerDaoTest, you’ll be fine; but if you do it in the other order, your build will fail. It’s that second-worst kind of bug, the unpredictable kind, since you don’t (usually) specify the order in which tests run. After all, they’re supposed to be independent! So you may well find that you have no problems for months on end, until one day you get an error running individual tests in a particular sequence, or Maven changes the order in which it runs your tests on the CI server, and BOOM! Why you should still use small independent datasets It’s tempting to circumvent the problem by using a single monolithic dataset for all your integration tests. I’ve tried this, and I advise against it. A big data file is hard to work with: you waste a lot of time scrolling around looking for the line you need, and it’s very hard to follow and understand foreign-key relations. Worse still: by modifying the data to make one test pass, you can easily accidentally break another one. The larger the dataset and the test suite become, the more fragile they get, and the more painstaking it becomes to modify them. How to avoid the foreign-key problem with small independent datasets One working but unsatisfactory solution would be to pad out every XML dataset with the list of all tables touched in the test suite. It’s unsatisfactory because the only way to add a table into a FlatXmlDataSet is to list a line of that table — a FlatXmlDataSet can’t contain empty tables — and there’s no justification for polluting the test data with lines from tables that are not part of the test. The solution I found was to use a DTD to clean tables before tests. Every XML file has different contents, but they all reference a single DTD which lists all the tables involved in the test suite. The DTD is easy to generate from the database schema, and useful for auto-complete and catching typos in column names, so you should probably already be using one. The code to exploit its contents is very simple: private IDataSet loadTestDataWithDtdTableList(String dtdFilename) throws IOException, DataSetException, SQLException { Reader dtdReader = new FileReader(new ClassPathResource(dtdFilename).getFile()); IDataSet dtdDataset = new FlatDtdDataSet(dtdReader); FlatXmlDataSetBuilder builder = new FlatXmlDataSetBuilder(); builder.setMetaDataSet(new DatabaseDataSet(dbUnitConnection, false)); IDataSet xmlDataset = builder.build(asFile(xmlFilename)); return new CompositeDataSet(dtdDataset, xmlDataset);} How it works: DBUnit provides a facility to load a dataset from a DTD. This dataset contains all the tables listed in the DTD, but of course empty of data. The DTD dataset is then combined with a FlatXmlDataSet representing your test data. The graphic below illustrates the composite dataset that would be produced for the SpaceShip example. If you have dictionary tables whose contents never change, you can and should leave them out of the DTD as well as out of the XML datasets, to improve test performance a little. One further detail: you should close the FileReader after test setup. I couldn’t find a hook into the end of the test setup operation (short of writing my own DatabaseOperation), so I saved the reference as a member variable and hooked the close() call into the tear-down phase of the test. NB For a more complete code example, see this Gist snippet of a base class for TestNG+Spring+DBUnit tests that adds the above-described DBUnit setup operation to Spring’s TestNG helper class. Happy database testing! From http://www.andrewspencer.net/2011/solve-foreign-key-problems-in-dbunit-test-data/

February 16, 2011

by Andrew Spencer

· 27,895 Views

Introduction to iBatis (MyBatis), An alternative to Hibernate and JDBC

i started to write a new article series about ibatis / mybatis . this is the first article and it will walk you through what is ibatis / mybatis and why you should use it. for those who does not know ibatis / mybatis yet, it is a persistence framework – an alternative to jdbc and hibernate , available for java and .net platforms. i’ve been working with it for almost two years, and i am enjoying it! the first thing you may notice in this and following articles about ibatis/mybatis is that i am using both ibatis and mybatis terms. why? until june 2010, ibatis was under apache license and since then, the framework founders decided to move it to google code and they renamed it to mybatis. the framework is still the same though, it just has a different name now. i gathered some resources, so i am just going to quote them: what is mybatis/ibatis? the mybatis data mapper framework makes it easier to use a relational database with object-oriented applications. mybatis couples objects with stored procedures or sql statements using a xml descriptor. simplicity is the biggest advantage of the mybatis data mapper over object relational mapping tools.to use the mybatis data mapper, you rely on your own objects, xml, and sql. there is little to learn that you don’t already know. with the mybatis data mapper, you have the full power of both sql and stored procedures at your fingertips. ( www.mybatis.org ) ibatis is based on the idea that there is value in relational databases and sql, and that it is a good idea to embrace the industrywide investment in sql. we have experiences whereby the database and even the sql itself have outlived the application source code, and even multiple versions of the source code. in some cases we have seen that an application was rewritten in a different language, but the sql and database remained largely unchanged. it is for such reasons that ibatis does not attempt to hide sql or avoid sql. it is a persistence layer framework that instead embraces sql by making it easier to work with and easier to integrate into modern object-oriented software. these days, there are rumors that databases and sql threaten our object models, but that does not have to be the case. ibatis can help to ensure that it is not. ( ibatis in action book) so… what is ibatis ? a jdbc framework developers write sql, ibatis executes it using jdbc. no more try/catch/finally/try/catch. an sql mapper automatically maps object properties to prepared statement parameters. automatically maps result sets to objects. support for getting rid of n+1 queries. a transaction manager ibatis will provide transaction management for database operations if no other transaction manager is available. ibatis will use external transaction management (spring, ejb cmt, etc.) if available. great integration with spring, but can also be used without spring (the spring folks were early supporters of ibatis). what isn’t ibatis ? an orm does not generate sql does not have a proprietary query language does not know about object identity does not transparently persist objects does not build an object cache essentially, ibatis is a very lightweight persistence solution that gives you most of the semantics of an o/r mapping toolkit, without all the drama. in other words ,ibatis strives to ease the development of data-driven applications by abstracting the low-level details involved in database communication (loading a database driver, obtaining and managing connections, managing transaction semantics, etc.), as well as providing higher-level orm capabilities (automated and configurable mapping of objects to sql calls, data type conversion management, support for static queries as well as dynamic queries based upon an object’s state, mapping of complex joins to complex object graphs, etc.). ibatis simply maps javabeans to sql statements using a very simple xml descriptor. simplicity is the key advantage of ibatis over other frameworks and object relational mapping tools.( http://www.developersbook.com ) who is using ibatis/mybatis? see the list in this link: http://www.apachebookstore.com/confluence/oss/pages/viewpage.action?pageid=25 i think the biggest case is myspace , with millions of users. very nice! this was just an introduction, so in next articles i will show how to create an application using ibatis/mybatis – step-by-step. enjoy! from http://loianegroner.com/2011/02/introduction-to-ibatis-mybatis-an-alternative-to-hibernate-and-jdbc/

February 9, 2011

by Loiane Groner

· 42,508 Views · 5 Likes

Spring Data with Redis

The Spring Data project provides a solution for accessing data stored in new emerging technologies like NoSQL databases and cloud based services. When we look into the SpringSource git repository we see a lot of spring-data sub-projects: spring-data-commons: common interfaces and utility class for other spring-data projects. spring-data-column: support for column based databases. It has not started yet, but there will be support for Cassandra and HBase spring-data-document: support for document databases. Currently MongoDB and CouchDB are supported. spring-data-graph: support for graph based databases. Currently Neo4j is supported. spring-data-keyvalue: support for key-value databases. Currently Redis and Riak are supported and probably Membase will be supported in future. spring-data-jdbc-ext: JDBC extensions, as example Oracle RAC connection failover is implemented. spring-data-jpa: simplifies JPA based data access layer. I would like to share with you how you can use Redis. The first step is to download it from the redis.io web page. try.redis-db.com is a useful site where we can run Redis commands. It also provides a step by step tutorial. This tutorial shows us all structures that Redis supports (list, set, sorted set and hashes) and some useful commands. A lot of reputable sites use Redis today. After download and unpacking we should compile Redis (version 2.2, the release candidate is the preferable one to use since some commands do not work in version 2.0.4). make sudo make install Once we run these commands we are all set to run the following five commands: redis-benchmark - for benchmarking Redis server redis-check-aof - check the AOF (Aggregate Objective Function), and it can repair that. redis-check-dump - check rdb files for unprocessable opcodes. redis-cli - Redis client. redis-server - Redis server. We can test Redis server. redis-server [1055] 06 Jan 18:19:15 # Warning: no config file specified, using the default config. In order to specify a config file use 'redis-server /path/to/redis.conf' [1055] 06 Jan 18:19:15 * Server started, Redis version 2.0.4 [1055] 06 Jan 18:19:15 * The server is now ready to accept connections on port 6379 [1055] 06 Jan 18:19:15 - 0 clients connected (0 slaves), 1074272 bytes in use and Redis client. redis-cli redis> set my-super-key "my-super-value" OK Now we create a simple Java project in order to show how simple a spring-data-redis module really is. mvn archetype:create -DgroupId=info.pietrowski -DpackageName=info.pietrowski.redis -DartifactId=spring-data-redis -Dpackage=jar Next we have to add in pom.xml milestone spring repository, and add spring-data-redis as a dependency. After that all required dependencies will be fetched. Next we create a resources folder under the main folder, and create application.xml which will have all the configuration. We can configure the JedisConnectionFactory, in two different ways, One - we can provide a JedisShardInfo object in shardInfo property. Two - we can provide host (default localhost), port (default 6379), password (default empty) and timeout (default 2000) properties. One thing to keep in mind is that the JedisShardInfo object has precedence and allows to setup weight, but only allows constructor injection. We can setup the factory to use connection pooling by setting the value of the pooling property to 'true' (default). See application.xml comments to see three different way of configuration. Note: There are two different libraries supported: Jedis and JRedis. They have very similar names and both have the same factory name. See the difference: org.springframework.data.keyvalue.redis.connection.jedis.JedisConnectionFactory org.springframework.data.keyvalue.redis.connection.jredis.JredisConnectionFactory Similar to what we do in Spring, we configure the template object by providing it with a connection factory. We will perform all the operations through this template object. By default we need to provide only Connection Factory, but there are more properties we can provide: exposeConnection (default false) - if we return real connection or proxy object. keySerializer, hashKeySerializer, valueSerializer, hashValueSerializer (default JdkSerializationRedisSerializer) which delegates serialization to Java serialization mechanism. stringSerializer (default StringRedisSerializer) which is simple String to byte[] (and back) serializer with UTF8 encoding. We are ready to execute some code which will be cooperating with the Redis instance. Spring-Data provides us with two ways of interaction, First is by using the execute method and providing a RedisCallback object. Second is by using *Operations helpers (these will be explained later) When we are using RedisCallback we have access to low level Redis commands, see this list of interfaces (I won't put all the methods here because it is huge list): RedisConnection - gathers all Redis commands plus connection management. RedisCommands - gathers all Redis commands (listed beloved). RedisHashCommands - Hash-specific Redis commands. RedisListCommands - List-specific Redis commands. RedisSetCommands - Set-specific Redis commands. RedisStringCommands - key/value specific Redis commands. RedisTxCommands - Transaction/Batch specific Redis commands. RedisZSetCommands - Sorted Set-specific Redis commands. Check RedisCallbackExample class, this was the hard way and the problem is we have to convert our objects into byte arrays in both directions, the second way is easier. Spring Data provides for us with Operations objects, so we have much more simpler API and all byte<->object conversion is made by serializer we setup (or the default one). Higher level API (you will easily recognize *Operation *Commands equivalents): HashOperations - Redis hash operations. ListOperations - Redis list operations. SetOperations - Redis set operations. ValueOperations - Redis 'string' operations. ZSetOperations - Redis sorted set operations. Most of methods get key as first parameters so we have an even better API for multiple operations on the same key: BoundHashOperations - Redis hash operations for specific key. BoundListOperations - Redis list operations for specific key. BoundSetOperations - Redis set operations for specific key. BoundValueOperations - Redis 'string' operations for specific key. BoundZSetOperations - Redis sorted set operations for specific key. Check RedisCallbackExample class to see some easy examples of *Operations usage. One important thing to mention is that you should use stringSerializers for keys, otherwise you will have problems from other clients, because standard serialization adds class information. Otherwise you end up keys such as: "\xac\xed\x00\x05t\x00\x05atInt" "\xac\xed\x00\x05t\x00\nmySuperKey" "\xac\xed\x00\x05t\x00\bsuperKey" Up until now we have just checked the API for Redis, but Spring Data offers more for us. All the cool stuff is in org.springframework.data.keyvalue.redis.support package and all sub-packages. We have: RedisAtomicInteger - Atomic integer (CAS operation) backed by Redis. RedisAtomicLong - Same as previous for Long. RedisList - Redis extension for List, Queue, Deque, BlockingDeque and BlockingQueue with two additional methods List range(start, end) and RedisList trim(start, end). RedisSet - Redis extension for Set with additional methods: diff, diffAndStore, intersect, intersectAndStore, union, unionAndStore. RedisZSet - Redis extension for SortedSet. Note that Comparator is not applicable here so this interface extends normal Set and provide proper methods similar to SortedSet. RedisMap - Redis extension for Map with additional Long increment(key, delta) method Every interface currently has one Default implementation. Check application-support.xml for examples of configuration and RedisSupportClassesExample for examples of use. There is lot of useful information in the comments as well. Summary The library is a first milestone release so there are minor bugs, the documentation isn't as perfect as we used to and the current version needs no stable Redis server. But this is definitely a great library which allows us to use all this cool NoSQL stuff in a "standard" Spring Data Access manner. Awesome job! This post is only useful if you checkout the code: from bitbucket , for the lazy ones here is spring-data-redis zip file as well. This post is originally from http://pietrowski.info/2011/01/spring-data-redis-tutorial/

February 3, 2011

by Sebastian Pietrowski

· 31,047 Views

Apache Solr: Get Started, Get Excited!

we've all seen them on various websites. crappy search utilities. they are a constant reminder that search is not something you should take lightly when building a website or application. search is not just google's game anymore. when a java library called lucene was introduced into the apache ecosystem, and then solr was built on top of that, open source developers began to wield some serious power when it came to customizing search features. in this article you'll be introduced to apache solr and a wealth of applications that have been built with it. the content is divided as follows: introduction setup solr applications summary 1. introduction apache solr is an open source search server. it is based on the full text search engine called apache lucene . so basically solr is an http wrapper around an inverted index provided by lucene. an inverted index could be seen as a list of words where each word-entry links to the documents it is contained in. that way getting all documents for the search query "dzone" is a simple 'get' operation. one advantage of solr in enterprise projects is that you don't need any java code, although java itself has to be installed. if you are unsure when to use solr and when lucene, these answers could help. if you need to build your solr index from websites, you should take a look into the open source crawler called apache nutch before creating your own solution. to be convinced that solr is actually used in a lot of enterprise projects, take a look at this amazing list of public projects powered by solr . if you encounter problems then the mailing list or stackoverflow will help you. to make the introduction complete i would like to mention my personal link list and the resources page which lists books, articles and more interesting material. 2. setup solr 2.1. installation as the very first step, you should follow the official tutorial which covers the basic aspects of any search use case: indexing - get the data of any form into solr. examples: json, xml, csv and sql-database. this step creates the inverted index - i.e. it links every term to its documents. querying - ask solr to return the most relevant documents for the users' query to follow the official tutorial you'll have to download java and the latest version of solr here . more information about installation is available at the official description . next you'll have to decide which web server you choose for solr. in the official tutorial, jetty is used, but you can also use tomcat. when you choose tomcat be sure you are setting the utf-8 encoding in the server.xml . i would also research the different versions of solr, which can be quite confusing for beginners: the current stable version is 1.4.1. use this if you need a stable search and don't need one of the latest features. the next stable version of solr will be 3.x the versions 1.5 and 2.x will be skipped in order to reach the same versioning as lucene. version 4.x is the latest development branch. solr 4.x handles advanced features like language detection via tika, spatial search , results grouping (group by field / collapsing), a new "user-facing" query parser ( edismax handler ), near real time indexing, huge fuzzy search performance improvements, sql join-a like feature and more. 2.2. indexing if you've followed the official tutorial you have pushed some xml files into the solr index. this process is called indexing or feeding. there are a lot more possibilities to get data into solr: using the data import handler (dih) is a really powerful language neutral option. it allows you to read from a sql database, from csv, xml files, rss feeds, emails, etc. without any java knowledge. dih handles full-imports and delta-imports. this is necessary when only a small amount of documents were added, updated or deleted. the http interface is used from the post tool, which you have already used in the official tutorial to index xml files. client libraries in different languages also exist. (e.g. for java (solrj) or python ). before indexing you'll have to decide which data fields should be searchable and how the fields should get indexed. for example, when you have a field with html in it, then you can strip irrelevant characters , tokenize the text into 'searchable terms', lower case the terms and finally stem the terms . in contrast, if you would have a field with text in it that should not be interpreted (e.g. urls) you shouldn't tokenize it and use the default field type string. please refer to the official documentation about field and field type definitions in the schema.xml file. when designing an index keep in mind the advice from mauricio : "the document is what you will search for. " for example, if you have tweets and you want to search for similar users, you'll need to setup a user index - created from the tweets. then every document is a user. if you want to search for tweets, then setup a tweet index; then every document is a tweet. of course, you can setup both indices with the multi index options of solr. please also note that there is a project called solr cell which lets you extract the relevant information out of several different document types with the help of tika. 2.3. querying for debugging it is very convenient to use the http interface with a browser to query solr and get back xml. use firefox and the xml will be displayed nicely: you can also use the velocity contribution , a cross-browser tool, which will be covered in more detail in the section about 'search application prototyping' . to query the index you can use the dismax handler or standard query handler . you can filter and sort the results: q=superman&fq=type:book&sort=price asc you can also do a lot more ; one other concept is boosting. in solr you can boost while indexing and while querying. to prefer the terms in the title write: q=title:superman^2 subject:superman when using the dismax request handler write: q=superman&qf=title^2 subject check out all the various query options like fuzzy search , spellcheck query input , facets , collapsing and suffix query support . 3. applications now i will list some interesting use cases for solr - in no particular order. to see how powerful and flexible this open source search server is. 3.1. drupal integration the drupal integration can be seen as generic use case to integrate solr into php projects. for the php integration you have the choice to either use the http interface for querying and retrieving xml or json. or to use the php solr client library . here is a screenshot of a typical faceted search in drupal : for more information about faceted search look into the wiki of solr . more php projects which integrates solr: open source typo3- solr module magento enterprise - solr module . the open source integration is out dated. oxid - solr module . no open source integration available. 3.2. hathi trust the hathi trust project is a nice example that proves solr's ability to search big digital libraries. to quote directly from the article : "... the index for our one million book index is over 200 gigabytes ... so we expect to end up with a two terabyte index for 10 million books" other examples for libraries: vufind - aims to replace opac internet archive national library of australia 3.3. auto suggestions mainly, there are two approaches to implement auto-suggestions (also called auto-completion) with solr: via facets or via ngramfilterfactory . to push it to the extreme you can use a lucene index entirely in ram. this approach is used in a large music shop in germany. live examples for auto suggestions: kaufda.de 3.4. spatial search applications when mentioning spatial search, people have geographical based applications in mind. with solr, this ordinary use case is attainable . some examples for this are : city search - city guides yellow pages kaufda.de spatial search can be useful in many different ways : for bioinformatics, fingerprints search, facial search, etc. (getting the fingerprint of a document is important for duplicate detection). the simplest approach is implemented in jetwick to reduce duplicate tweets, but this yields a performance of o(n) where n is the number of queried terms. this is okay for 10 or less terms, but it can get even better at o(1)! the idea is to use a special hash set to get all similar documents. this technique is called local sensitive hashing . read this nice paper about 'near similarity search and plagiarism analysis' for more information. 3.5. duckduckgo duckduckgo is made with open source and its "zero click" information is done with the help of solr using the dismax query handler: the index for that feature contains 18m documents and has a size of ~12gb. for this case had to tune solr: " i have two requirements that differ a bit from most sites with respect to solr: i generally only show one result, with sometimes a couple below if you click on them. therefore, it was really important that the first result is what people expected. false positives are really bad in 0-click, so i needed a way to not show anything if a match wasn't too relevant. i got around these by a) tweaking dismax and schema and b) adding my own relevancy filter on top that would re-order and not show anything in various situations. " all the rest is done with tuned open source products. to quote gabriel again: "the main results are a hybrid of a lot of things, including external apis, e.g. bing, wolframalpha, yahoo, my own indexes and negative indexes (spam removal), etc. there are a bunch of different types of data i'm working with. " check out the other cool features such as privacy or bang searches . 3.6. clustering support with carrot2 carrot2 is one of the "contributed plugins" of solr. with carrot2 you can support clustering : " clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. " see some research papers regarding clustering here . here is one visual example when applying clustering on the search "pannous" - our company : 3.7. near real time search solr isn't real time yet, but you can tune solr to the point where it becomes near real time, which means that the time ('real time latency') that a document takes to be searchable after it gets indexed is less than 60 seconds even if you need to update frequently. to make this work, you can setup two indices. one write-only index "w" for the indexer and one read-only index "r" for your application. index r refers to the same data directory of w, which has to be defined in the solrconfig.xml of r via: /pathto/indexw/data/ to make sure your users and the r index see the indexed documents of w, you have to trigger an empty commit every 60 seconds: wget -q http://localhost:port/solr/update?stream.body=%3ccommit/%3e -o /dev/null everytime such a commit is triggered a new searcher without any cache entries is created. this can harm performance for visitors hitting the empty cache directly after this commit, but you can fill the cache with static searches with the help of the newsearcher entry in your solrconfig.xml. additionally, the autowarmcount property needs to be tuned, which fills the cache with a newsearcher from old entries. also, take a look at the article 'scaling lucene and solr' , where experts explain in detail what to do with large indices (=> 'sharding') and what to do for high query volume (=> 'replicating'). 3.8. loggly = full text search in logs feeding log files into solr and searching them at near real-time shows that solr can handle massive amounts of data and queries the data quickly. i've setup a simple project where i'm doing similar things , but loggly has done a lot more to make the same task real-time and distributed. you'll need to keep the write index as small as possible otherwise commit time will increase too great. loggly creates a new solr index every 5 minutes and includes this when searching using the distributed capabilities of solr ! they are merging the cores to keep the number of indices small, but this is not as simple as it sounds. watch this video to get some details about their work. 3.9. solandra = solr + cassandra solandra combines solr and the distributed database cassandra , which was created by facebook for its inbox search and then open sourced. at the moment solandra is not intended for production use. there are still some bugs and the distributed limitations of solr apply to solandra too. tthe developers are working very hard to make solandra better. jetwick can now run via solandra just by changing the solrconfig.xml. solandra also has the advantages of being real-time (no optimize, no commit!) and distributed without any major setup involved. the same is true for solr cloud. 3.10. category browsing via facets solr provides facets , which make it easy to show the user some useful filter options like those shown in the "drupal integration" example. like i described earlier , it is even possible to browse through a deep category tree. the main advantage here is that the categories depend on the query. this way the user can further filter the search results with this category tree provided by you. here is an example where this feature is implemented for one of the biggest second hand stores in germany. a click on 'schauspieler' shows its sub-items: other shops: game-change 3.11. jetwick - open twitter search you may have noticed that twitter is using lucene under the hood . twitter has a very extreme use case: over 1,000 tweets per second, over 12,000 queries per second, but the real-time latency is under 10 seconds! however, the relevancy at that volume is often not that good in my opinion. twitter search often contains a lot of duplicates and noise. reducing this was one reason i created jetwick in my spare time. i'm mentioning jetwick here because it makes extreme use of facets which provides all the filters to the user. facets are used for the rss-alike feature (saved searches), the various filters like language and retweet-count on the left, and to get trending terms and links on the right: to make jetwick more scalable i'll need to decide which of the following distribution options to choose: use solr cloud with zookeeper use solandra move from solr to elasticsearch which is also based on apache lucene other examples with a lot of facets: cnet reviews - product reviews. electronics reviews, computer reviews & more. shopper.com - compare prices and shop for computers, cell phones, digital cameras & more. zappos - shoes and clothing. manta.com - find companies. connect with customers. 3.12. plaxo - online address management plaxo.com , which is now owned by comcast, hosts web addresses for more than 40 million people and offers smart search through the addresses - with the help of solr. plaxo is trying to get the latest 'social' information of your contacts through blog posts, tweets, etc. plaxo also tries to reduce duplicates . 3.13. replace fast or google search several users report that they have migrated from a commercial search solution like fast or google search appliance (gsa) to solr (or lucene). the reasons for that migration are different: fast drops linux support and google can make integration problems. the main reason for me is that solr isn't a black box —you can tweak the source code, maintain old versions and fix your bugs more quickly! 3.14. search application prototyping with the help of the already integrated velocity plugin and the data import handler it is possible to create an application prototype for your search within a few hours. the next version of solr makes the use of velocity easier. the gui is available via http://localhost:port/solr/browse if you are a ruby on rails user, you can take a look into flare. to learn more about search application prototyping, check out this video introduction and take a look at these slides. 3.15. solr as a whitelist imagine you are the new google and you have a lot of different types of data to display e.g. 'news', 'video', 'music', 'maps', 'shopping' and much more. some of those types can only be retrieved from some legacy systems and you only want to show the most appropriated types based on your business logic . e.g. a query which contains 'new york' should result in the selection of results from 'maps', but 'new yorker' should prefer results from the 'shopping' type. with solr you can set up such a whitelist-index that will help to decide which type is more important for the search query. for example if you get more or more relevant results for the 'shopping' type then you should prefer results from this type. without the whitelist-index - i.e. having all data in separate indices or systems, would make it nearly impossible to compare the relevancy. the whitelist-index can be used as illustrated in the next steps. 1. query the whitelist-index, 2. decide which data types to display, 3. query the sub-systems and 4. display results from the selected types only. 3.16. future solr is also useful for scientific applications, such as a dna search systems. i believe solr can also be used for completely different alphabets so that you can query nucleotide sequences - instead of words - to get the matching genes and determine which organism the sequence occurs in, something similar to blast . another idea you could harness would be to build a very personalized search. every user can drag and drop their websites of choice and query them afterwards. for example, often i only need stackoverflow, some wikis and some mailing lists with the expected results, but normal web search engines (google, bing, etc.) give me results that are too cluttered. my final idea for a future solr-based app could be a lucene/solr implementation of desktop search. solr's facets would be especially handy to quickly filter different sources (files, folders, bookmarks, man pages, ...). it would be a great way to wade through those extra messy desktops. 4. summary the next time you think about a problem, think about solr! even if you don't know java and even if you know nothing about search: solr should be in your toolbox. solr doesn't only offer professional full text search, it could also add valuable features to your application. some of them i covered in this article, but i'm sure there are still some exciting possibilities waiting for you!

January 25, 2011

by Peter Karussell

· 147,366 Views

Linqer – a nice tool for SQL to LINQ transition

Almost all .NET developers who have been working in several applications up to date are probably familiar with writing SQL queries for specific needs within the application. Before LINQ as a technology came on scene, my daily programming life was about 60-70% of the day writing code either in the front-end (ASPX, JavaScript, jQuery, HTML/CSS etc…) or in the back-end (C#, VB.NET etc…), and about 30-40% writing SQL queries for specific needs used within the application. Now, when LINQ is there, I feel that the percentage for writing SQL queries got down to about 10% per day. I don’t say it won’t change with time depending what technology I use within the projects or what way would be better, but since I’m writing a lot LINQ code in the latest projects, I thought to see if there is a tool that can automatically translate SQL to LINQ so that I can transfer many queries as a LINQ statements within the code. Linqer is a tool that I have tested in the previous two weeks and I see it works pretty good. Even I’m not using it yet to convert SQL to LINQ code because I did it manually before I discovered that Linqer could have really helped me, I would recommend it for those who are just starting with LINQ and have knowledge of writing SQL queries. Let’s pass through several steps so that I will help you get started faster… 1. Go to http://www.sqltolinq.com/ website and download the version you want. There is a Linqer Version 4.0.1 for .NET 4.0 or Linqer Version 3.5.1 for .NET 3.5. 2. Once you download the zip file, extract it and launch the Linqer4Inst.exe then add install location. In the location you will add, the Linqer.exe will be created. 3. Launch the Linqer.exe. Once you run it for first time, the Linqer Connection Pool will be displayed so that you can create connection to your existing Model Click the Add button Right after this, the following window will appear #1 – The name of the connection string you are creating #2 – Click “…” to construct your connection string using Wizard window #3 – Chose your language, either C# or VB #4 – Model LINQ to SQL or LINQ to Entities Right after you select LINQ to SQL, the options to select the files for the Model will be displayed. In our case I will select LINQ to SQL, and here is the current progress So, you can select existing model from your application or you can Generate LINQ to SQL Files so that the *.dbml and *.designer.cs will be automatically filled #5 – At the end, you can chose your context name of the model which will be used when generating the LINQ code Once you are done, click OK. You will get back to the parent window filled with all needed info and click Close. Note: You can later add additional connections in your Linqer Connections Pool from Tools –> Linqer Connections In the root folder where your Linqer.exe is placed, now you have Linqer.ini file containing the Connection string settings. Ok, now lets go to the interesting part. Lets create one (first) simple SQL query and try to translate it to LINQ statement. SQL Query select * from authors a where a.city = 'Oakland' If we add this query to Linqer, here is the result: So, the LINQ code is similar to the SQL code and is easy to read since it’s simple. Also, if you notice, the tool generates class (you can add class name) with prepared code for using in your project. Perfect! Now, lets try to translate a query with two joined tables (little bit more complex): SQL Query select * from employee left join publishers on employee.pub_id = publishers.pub_id where employee.fname like '%a' The LINQ generated code is: from employee in db.Employee join publishers in db.Publishers on employee.Pub_id equals publishers.Pub_id into publishers_join from publishers in publishers_join.DefaultIfEmpty() where employee.Fname.EndsWith("a") select new { employee.Emp_id, employee.Fname, employee.Minit, employee.Lname, employee.Job_id, employee.Job_lvl, employee.Pub_id, employee.Hire_date, Column1 = publishers.Pub_id, Pub_name = publishers.Pub_name, City = publishers.City, State = publishers.State, Country = publishers.Country } So, if you can notice the where clause, we said in the SQL query: ... like "%a" and the corresponding LINQ code in C# is ... EndsWith("a"); - Excellent! And the Class automatically generated by the tool is public class EmployeePubClass { private String _Emp_id; private String _Fname; private String _Minit; private String _Lname; private Int16? _Job_id; private Byte? _Job_lvl; private String _Pub_id; private DateTime? _Hire_date; private String _Column1; private String _Pub_name; private String _City; private String _State; private String _Country; public EmployeePubClass( String AEmp_id, String AFname, String AMinit, String ALname, Int16? AJob_id, Byte? AJob_lvl, String APub_id, DateTime? AHire_date, String AColumn1, String APub_name, String ACity, String AState, String ACountry) { _Emp_id = AEmp_id; _Fname = AFname; _Minit = AMinit; _Lname = ALname; _Job_id = AJob_id; _Job_lvl = AJob_lvl; _Pub_id = APub_id; _Hire_date = AHire_date; _Column1 = AColumn1; _Pub_name = APub_name; _City = ACity; _State = AState; _Country = ACountry; } public String Emp_id { get { return _Emp_id; } } public String Fname { get { return _Fname; } } public String Minit { get { return _Minit; } } public String Lname { get { return _Lname; } } public Int16? Job_id { get { return _Job_id; } } public Byte? Job_lvl { get { return _Job_lvl; } } public String Pub_id { get { return _Pub_id; } } public DateTime? Hire_date { get { return _Hire_date; } } public String Column1 { get { return _Column1; } } public String Pub_name { get { return _Pub_name; } } public String City { get { return _City; } } public String State { get { return _State; } } public String Country { get { return _Country; } } } public class List: List { public List(Pubs db) { var query = from employee in db.Employee join publishers in db.Publishers on employee.Pub_id equals publishers.Pub_id into publishers_join from publishers in publishers_join.DefaultIfEmpty() where employee.Fname.EndsWith("a") select new { employee.Emp_id, employee.Fname, employee.Minit, employee.Lname, employee.Job_id, employee.Job_lvl, employee.Pub_id, employee.Hire_date, Column1 = publishers.Pub_id, Pub_name = publishers.Pub_name, City = publishers.City, State = publishers.State, Country = publishers.Country }; foreach (var r in query) Add(new EmployeePubClass( r.Emp_id, r.Fname, r.Minit, r.Lname, r.Job_id, r.Job_lvl, r.Pub_id, r.Hire_date, r.Column1, r.Pub_name, r.City, r.State, r.Country)); } } Great! We have ready-to-use class for our application and we don't need to type all this code. Besides this way to generate code, you can in same time use this tool to see the db results I like this tool because mainly it’s very easy to use, lightweight and does the job pretty straight forward. You can try the tool and send me feedback using the comments in this blog post.

January 24, 2011

by Hajan Selmani

· 67,483 Views