Data Engineering Resources

The Latest Data Engineering Topics

HashMap – Single Key and Multiple Values Example

Sometimes you want to store multiple values for the same hash key. The following code examples show you three different ways to do this.

October 26, 2013

by Jagadeesh Motamarri

· 799,629 Views · 9 Likes

Extracting File Metadata with C# and the .NET Framework

How to extract extended image metadata using C# and the Windows API Code Pack, simplifying access to detailed file properties typically seen in Windows Explorer.

October 26, 2013

by Rob Sanders

· 39,957 Views · 2 Likes

Too Many Parameters in Java Methods, Part 7: Mutable State

In this seventh post of my series on addressing the issue of too many parameters in a Java method or constructor, I look at using state to reduce the need to pass parameters. One of the reasons I have waited until the 7th post of this series to address this is that it is one of my least favorite approaches for reducing parameters passed to methods and constructors. That stated, there are multiple flavors of this approach and I definitely prefer some flavors over others. Perhaps the best known and most widely scorned approach in all of software development for using state to reduce parameter methods is the use global variables. Although it may be semantically accurate to say thatJava does not have global variables, the reality is that for good or for bad the equivalent of global variables is achieved in Java via public static constructs. A particularly popular way to achieve this in Java is via the Stateful Singleton. In Patterns of Enterprise Application Architecture, Martin Fowler wrote that "any global data is always guilty until proven innocent." Global variables and "global-like" constructs in Java are considered bad form for several reasons. They can make it difficult for developers maintaining and reading code to know where the values are defined or last changed or even come from. By their very nature and intent, global data violates the principles of encapsulation and data hiding. Miško Hevery has written the following regarding the problems of static globals in an object-oriented language: Accessing global state statically doesn’t clarify those shared dependencies to readers of the constructors and methods that use the Global State. Global State and Singletons make APIs lie about their true dependencies. ... The root problem with global state is that it is globally accessible. In an ideal world, an object should be able to interact only with other objects which were directly passed into it (through a constructor, or method call). Having state available globally reduces the need for parameters because there is no need for one object to pass data to another object if both objects already have direct access to that data. However, as Hevery put it, that's completely orthogonal to the intent of object-oriented design. Mutable state is also an increasing problem as concurrent applications become more common. In his JavaOne 2012 presentation on Scala, Scala creator Martin Odersky stated that "every piece of mutable state you have is a liability" in a highly concurrent world and added that the problem is "non-determinism caused by concurrent threads accessing shared mutable state." Although there are reasons to avoid mutable state, it still remains a generally popular approach in software development. I think there are several reasons for this including that it's superfically easy to write mutable state sharing code and mutable shared code does provide ease of access. Some types of mutable data are popular because those types of mutable data have been taught and learned as effective for years. Finally, three are times when mutable state may be the most appropriate solution. For that last reason and to be complete, I now look at how the use of mutable state can reduce the number of parameters a method must expect. Stateful Singleton and Static Variables A Java implementation of Singleton and other public Java static fields are generally available to any Java code within the same Java Virtual Machine (JVM) and loaded with the same classloader [for more details, see When is a Singleton not a Singleton?]. Any data stored universally (at least from JVM/classloader perspective) is already available to client code in the same JVM and loaded with the same class loader. Because of this, there is no need to pass that data between clients and methods or constructors in that same JVM/classloader combination. Instance State While "statics" are considered "globally available," narrower instance-level state can also be used in a similar fashion to reduce the need to pass parameters between methods of the same class. An advantage of this over global variables is that the accessibility is limited to instances of the class (private fields) or instances of the class's children (private fields). Of course, if the fields are public, accessibility is pretty wide open, but the same data is not automatically available to other code in the same JVM/classloader. The next code listing demonstrates how state data can and sometimes is used to reduce the need for parameters between two methods internal to a given class. Example of Instance State Used to Avoid Passing Parameters /** * Simple example of using instance variable so that there is no need to * pass parameters to other methods defined in the same class. */ public void doSomethingGoodWithInstanceVariables() { this.person = Person.createInstanceWithNameAndAddressOnly( new FullName.FullNameBuilder(new Name("Flintstone"), new Name("Fred")).createFullName(), new Address.AddressBuilder(new City("Bedrock"), State.UN).createAddress()); printPerson(); } /** * Prints instance of Person without requiring it to be passed in because it * is an instance variable. */ public void printPerson() { out.println(this.person); } The above example is somewhat contrived and simplified, but does illustrate the point: the instance variableperson can be accessed by other instance methods defined in the same class, so that instance does not need to be passed between those instance methods. This does reduce the signature of potentially (public accessibility means it may be used by external methods) internal methods, but also introduces state and now means that the invoked method impacts the state of that same object. In other words, the benefit of not having to pass the parameter comes at the cost of another piece of mutable state. The other side of the trade-off, needing to pass the instance of Person because it is not an instance variable, is shown in the next code listing for comparison. Example of Passing Parameter Rather than Using Instance Variable /** * Simple example of passing a parameter rather than using an instance variable. */ public void doSomethingGoodWithoutInstanceVariables() { final Person person = Person.createInstanceWithNameAndAddressOnly( new FullName.FullNameBuilder(new Name("Flintstone"), new Name("Fred")).createFullName(), new Address.AddressBuilder(new City("Bedrock"), State.UN).createAddress()); printPerson(person); } /** * Prints instance of Person that is passed in as a parameter. * * @param person Instance of Person to be printed. */ public void printPerson(final Person person) { out.println(person); } The previous two code listings illustrate that parameter passing can be reduced by using instance state. I generally prefer to not use instance state solely to avoid parameter passing. If instance state is needed for other reasons, than the reduction of parameters to be passed is a nice side benefit, but I don't like introducing unnecessary instance state simply to remove or reduce the number of parameters. Although there was a time when the readability of reduced parameters might have justified instance state in a large single-threaded environment, I feel that the slight readability gain from reduced parameters is not worth the cost of classes that are not thread-safe in an increasingly multi-threaded world. I still don't like to pass a whole lot of parameters between methods of the same class, but I can use the parameters object (perhaps with a package-private scope class) to reduce the number of these parameters and pass that parameters object around instead of the large number of parameters. JavaBean Style Construction The JavaBeans convention/style has become extremely popular in the Java development community. Many frameworks such as Spring Framework and Hibernate rely on classes adhering to the JavaBeans conventions and some of the standards like Java Persistence API also are built around the JavaBeans conventions. There are multiple reasons for the popularity of the JavaBeans style including its ease-of-use and the ability to usereflection against this code adhering to this convention to avoid additional configuration. The general idea behind the JavaBean style is to instantiate an object with a no-argument constructor and then set its fields via single-argument "set" methods and access it fields via no-argument "get" methods. This is demonstrated in the next code listings. The first listing shows a simple example of a PersonBean class with no-arguments constructor and getter and setter methods. That code listing also includes some of the JavaBeans-style classes it uses. That code listing is followed by code using that JavaBean style class. Examples of JavaBeans Style Class public class PersonBean { private FullNameBean name; private AddressBean address; private Gender gender; private EmploymentStatus employment; private HomeownerStatus homeOwnerStatus; /** No-arguments constructor. */ public PersonBean() {} public FullNameBean getName() { return this.name; } public void setName(final FullNameBean newName) { this.name = newName; } public AddressBean getAddress() { return this.address; } public void setAddress(final AddressBean newAddress) { this.address = newAddress; } public Gender getGender() { return this.gender; } public void setGender(final Gender newGender) { this.gender = newGender; } public EmploymentStatus getEmployment() { return this.employment; } public void setEmployment(final EmploymentStatus newEmployment) { this.employment = newEmployment; } public HomeownerStatus getHomeOwnerStatus() { return this.homeOwnerStatus; } public void setHomeOwnerStatus(final HomeownerStatus newHomeOwnerStatus) { this.homeOwnerStatus = newHomeOwnerStatus; } } /** * Full name of a person in JavaBean style. * * @author Dustin */ public final class FullNameBean { private Name lastName; private Name firstName; private Name middleName; private Salutation salutation; private Suffix suffix; /** No-args constructor for JavaBean style instantiation. */ private FullNameBean() {} public Name getFirstName() { return this.firstName; } public void setFirstName(final Name newFirstName) { this.firstName = newFirstName; } public Name getLastName() { return this.lastName; } public void setLastName(final Name newLastName) { this.lastName = newLastName; } public Name getMiddleName() { return this.middleName; } public void setMiddleName(final Name newMiddleName) { this.middleName = newMiddleName; } public Salutation getSalutation() { return this.salutation; } public void setSalutation(final Salutation newSalutation) { this.salutation = newSalutation; } public Suffix getSuffix() { return this.suffix; } public void setSuffix(final Suffix newSuffix) { this.suffix = newSuffix; } @Override public String toString() { return this.salutation + " " + this.firstName + " " + this.middleName + this.lastName + ", " + this.suffix; } } package dustin.examples; /** * Representation of a United States address (JavaBeans style). * * @author Dustin */ public final class AddressBean { private StreetAddress streetAddress; private City city; private State state; /** No-arguments constructor for JavaBeans-style instantiation. */ private AddressBean() {} public StreetAddress getStreetAddress() { return this.streetAddress; } public void setStreetAddress(final StreetAddress newStreetAddress) { this.streetAddress = newStreetAddress; } public City getCity() { return this.city; } public void setCity(final City newCity) { this.city = newCity; } public State getState() { return this.state; } public void setState(final State newState) { this.state = newState; } @Override public String toString() { return this.streetAddress + ", " + this.city + ", " + this.state; } } Example of JavaBeans Style Instantiation and Population public PersonBean createPerson() { final PersonBean person = new PersonBean(); final FullNameBean personName = new FullNameBean(); personName.setFirstName(new Name("Fred")); personName.setLastName(new Name("Flintstone")); person.setName(personName); final AddressBean address = new AddressBean(); address.setStreetAddress(new StreetAddress("345 Cave Stone Road")); address.setCity(new City("Bedrock")); person.setAddress(address); return person; } The examples just shown demonstrate how the JavaBeans style approach can be used. This approach makes some concessions to reduce the need to pass a large number of parameters to a class's constructor. Instead, no parameters are passed to the constructor and each individual attribute that is needed must be set. One of the advantages of the JavaBeans style approach is that readability is enhanced as compared to a constructor with a large number of parameters because each of the "set" methods is hopefully named in a readable way. The JavaBeans approach is simple to understand and definitely achieves the goal of reducing lengthy parameters in the case of constructors. However, there are some disadvantages to this approach as well. One advantage is a lot of tedious client code for instantiating the object and setting its attributes one-at-a-time. It is easy with this approach to neglect to set a required attribute because there is no way for the compiler to enforce all required parameters be set without leaving the JavaBeans convention. Perhaps most damaging, there are several objects instantiated in this last code listing and these objects exist in different incomplete states from the time they are instantiated until the time the final "set" method is called. During that time, the objects are in what is really an "undefined" or "incomplete" state. The existence of "set" methods necessarily means that the class's attributes cannot be final, rendering the entire object highly mutable. Regarding the prevalent use of the JavaBeans pattern in Java, several credible authors have called into questionits value. Allen Holub's controversial article Why getter and setter methods are evil starts off with no holds barred: Though getter/setter methods are commonplace in Java, they are not particularly object oriented (OO). In fact, they can damage your code's maintainability. Moreover, the presence of numerous getter and setter methods is a red flag that the program isn't necessarily well designed from an OO perspective. Josh Bloch, in his less forceful and more gently persuasive tone, says of the JavaBeans getter/setter style: "The JavaBeans pattern has serious disadvantages of its own" (Effective Java, Second Edition, Item #2). It is in this context that Bloch recommends the builder pattern instead for object construction. I'm not against using the JavaBeans get/set style when the framework I've selected for other reasons requires it and the reasons for using that framework justify it. There are also areas where the JavaBeans style class is particularly well suited such as interacting with a data store and holding data from the data store for use by the application. However, I am not a fan of using the JavaBeans style for instantiating a question simply to avoid the need to pass parameters. I prefer one of the other approaches such as builder for that purpose. Benefits and Advantages I've covered different approaches to reducing the number of arguments to a method or constructor in this post, but they also share the same trade-off: exposing mutable state to reduce or eliminate the number of parameters that must be passed to a method or to a constructor. The advantages of these approaches are simplicity, generally readable (though "globals" can be difficult to read), and ease of first writing and use. Of course, their biggest advantage from this post's perspective is that they generally eliminate the need for any parameter passing. Costs and Disadvantages The trait that all approaches covered in this post share is the exposure of mutable state. This can lead to an extremely high cost if the code is used in a highly concurrent environment. There is a certain degree of unpredictability when object state is exposed for anyone to tinker with it as they like. It can be difficult to know which code made the wrong change or failed to make a necessary change (such as failing to call a "set" method when populating a newly instantiated object). Conclusion Some of the approaches covered in this post are highly popular despite their drawbacks. This may be for a variety of reasons including prevalence of use in popular frameworks (forcing users of the framework to use that style and also providing examples to others for their own code development). Other reasons for these approaches' popularity is the relative ease of initial development and the seemingly (deceptively) relatively little thought that needs to go into design with these approaches. In general, I prefer to spend a little more design and implementation effort to use builders and less mutable approaches when practical. However, there are cases where these mutable approaches work well in reducing the number of parameters passed around and introduce no more risk than was already present. My feeling is that Java developers should carefully consider use of any mutable Java classes and ensure that the mutability is either desired or is a cost that is justified by the reasons for using a mutable state approach.

October 25, 2013

by Dustin Marx

· 16,625 Views

What’s the Difference Between System.String and string?

One of the questions that lot of developers ask is – Is there any difference between string andSystem.String and what should be used? Short Answer There is no difference between the two. You can use either of them in your code. Explanation System.String is a class (reference type) defined the mscorlib in the namespace System. In other words, System.String is a type in the CLR. string is a keyword in C# Before we understand the difference, let us understand BCL and FCL terms. BCL is Common Language Infrastructure (CLI) available to languages like C#, A#, Boo, Cobra, F#, IronRuby, IronPython and other CLI languages. It includes common functions such as File Read/Write or IO and database/XML interactions. BCL was first implemented in Microsoft .NET in the form of mscorlib.dll FCL is standard Microsoft .NET specific library containing reusable classes/assets like System, System.CodeDom, System.Collections, System.Diagnostics, System.Globalization, System.IO, System.Resources and System.Text Now in C#, string (keyword in BCL) directly maps to System.String (an FCL type). Similarly, intmaps directly to System.Int32. Here int is mapped to a integer type that is 32 bit. But in other language, you could probably map int (keyword in BCL) to a 64 bit integer (FCL type). So the fact that using string and System.String in C# makes no difference is well established. Is it better to still use string instead of System.String? There is no universally agreed answer to this. But, as per me, even though both string and System.String mean the same and have no difference in performance of the application, it is better to use string. This is because string is a C# language specific keyword. Also C# language specification states, As a matter of style, use of the keyword is favored over use of the complete system type name Following this practice ensures that your code consistently uses keywords wherever possible rather than having a code with BCL and FCL types used.

October 25, 2013

by Punit Ganshani

· 11,387 Views · 3 Likes

Examples of the Windows Azure Storage Services REST API

The examples in this post were updated in September to work with the current version of the Windows Azure Storage REST API. In the Windows Azure MSDN Azure Forum there are occasional questions about the Windows Azure Storage Services REST API. I have occasionally responded to these with some code examples showing how to use the API. I thought it would be useful to provide some examples of using the REST API for tables, blobs and queues – if only so I don’t have to dredge up examples when people ask how to use it. This post is not intended to provide a complete description of the REST API. The REST API is comprehensively documented (other than the lack of working examples). Since the REST API is the definitive way to address Windows Azure Storage Services I think people using the higher level Storage Client API should have a passing understanding of the REST API to the level of being able to understand the documentation. Understanding the REST API can provide a deeper understanding of why the Storage Client API behaves the way it does. Fiddler The Fiddler Web Debugging Proxy is an essential tool when developing using the REST (or Storage Client) API since it captures precisely what is sent over the wire to the Windows Azure Storage Services. Authorization Nearly every request to the Windows Azure Storage Services must be authenticated. The exception is access to blobs with public read access. The supported authentication schemes for blobs, queues and tables and these are described here. The requests must be accompanied by an Authorization header constructed by making a hash-based message authentication code using the SHA-256 hash. The following is an example of performing the SHA-256 hash for the Authorization header: public static String CreateAuthorizationHeader(String canonicalizedString) { String signature = String.Empty; using (HMACSHA256 hmacSha256 = new HMACSHA256( Convert.FromBase64String(storageAccountKey) )) { Byte[] dataToHmac = System.Text.Encoding.UTF8.GetBytes(canonicalizedString); signature = Convert.ToBase64String(hmacSha256.ComputeHash(dataToHmac)); } String authorizationHeader = String.Format( CultureInfo.InvariantCulture, "{0} {1}:{2}", AzureStorageConstants.SharedKeyAuthorizationScheme, AzureStorageConstants.Account, signature ); return authorizationHeader; } This method is used in all the examples in this post. AzureStorageConstants is a helper class containing various constants. Key is a secret key for Windows Azure Storage Services account specified by Account. In the examples given here, SharedKeyAuthorizationScheme is SharedKey. The trickiest part in using the REST API successfully is getting the correct string to sign. Fortunately, in the event of an authentication failure the Blob Service and Queue Service responds with the authorization string they used and this can be compared with the authorization string used in generating the Authorization header. This has greatly simplified the us of the REST API. Table Service API The Table Service API supports the following table-level operations: Create Table Delete Table Query Tables The Table Service API supports the following entity-level operations: Delete Entity Insert Entity Merge Entity Update Entity Query Entities These operations are implemented using the appropriate HTTP VERB: DELETE – delete GET – query MERGE – merge POST – insert PUT – update This section provides examples of the Insert Entity and Query Entities operations. Insert Entity The InsertEntity() method listed in this section inserts an entity with two String properties, Artist and Title, into a table. The entity is submitted as an ATOM entry in the body of a request POSTed to the Table Service. In this example, the ATOM entry is generated by the GetRequestContentInsertXml() method. The date must be in RFC 1123 format in the x-ms-date header supplied to the canonicalized resource used to create the Authorization string. Note that the storage service version is set to “2012-02-12″ which requires the DataServiceVersion and MaxDataServiceVersion to be set appropriately. public void InsertEntity(String tableName, String artist, String title) { String requestMethod = "POST"; String urlPath = tableName; String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String contentMD5 = String.Empty; String contentType = "application/atom+xml"; String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n{1}\n{2}\n{3}\n{4}", requestMethod, contentMD5, contentType, dateInRfc1123Format, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] content = utf8Encoding.GetBytes(GetRequestContentInsertXml(artist, title)); Uri uri = new Uri(AzureStorageConstants.TableEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Accept = "application/atom+xml,application/xml"; request.ContentLength = content.Length; request.ContentType = contentType; request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Headers.Add("Accept-Charset", "UTF-8"); request.Headers.Add("DataServiceVersion", "2.0;NetFx"); request.Headers.Add("MaxDataServiceVersion", "2.0;NetFx"); using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(content, 0, content.Length); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } private String GetRequestContentInsertXml(String artist, String title) { String defaultNameSpace = "http://www.w3.org/2005/Atom"; String dataservicesNameSpace = "http://schemas.microsoft.com/ado/2007/08/dataservices"; String metadataNameSpace = "http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"; XmlWriterSettings xmlWriterSettings = new XmlWriterSettings(); xmlWriterSettings.OmitXmlDeclaration = false; xmlWriterSettings.Encoding = Encoding.UTF8; StringBuilder entry = new StringBuilder(); using (XmlWriter xmlWriter = XmlWriter.Create(entry)) { xmlWriter.WriteProcessingInstruction("xml", "version=\"1.0\" encoding=\"UTF-8\""); xmlWriter.WriteWhitespace("\n"); xmlWriter.WriteStartElement("entry", defaultNameSpace); xmlWriter.WriteAttributeString("xmlns", "d", null, dataservicesNameSpace); xmlWriter.WriteAttributeString("xmlns", "m", null, metadataNameSpace); xmlWriter.WriteElementString("title", null); xmlWriter.WriteElementString("updated", String.Format("{0:o}", DateTime.UtcNow)); xmlWriter.WriteStartElement("author"); xmlWriter.WriteElementString("name", null); xmlWriter.WriteEndElement(); xmlWriter.WriteElementString("id", null); xmlWriter.WriteStartElement("content"); xmlWriter.WriteAttributeString("type", "application/xml"); xmlWriter.WriteStartElement("properties", metadataNameSpace); xmlWriter.WriteElementString("PartitionKey", dataservicesNameSpace, artist); xmlWriter.WriteElementString("RowKey", dataservicesNameSpace, title); xmlWriter.WriteElementString("Artist", dataservicesNameSpace, artist); xmlWriter.WriteElementString("Title", dataservicesNameSpace, title + "\n" + title); xmlWriter.WriteEndElement(); xmlWriter.WriteEndElement(); xmlWriter.WriteEndElement(); xmlWriter.Close(); } String requestContent = entry.ToString(); return requestContent; } This generates the following request (as captured by Fiddler): POST https://STORAGE_ACCOUNT.table.core.windows.net/authors HTTP/1.1 Accept: application/atom+xml,application/xml Content-Type: application/atom+xml x-ms-date: Sun, 08 Sep 2013 06:31:12 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:w7Uu4wHZx4fFwa2bsxd/TJVZZ1AqMPwxvW+pYtoWHd0= Accept-Charset: UTF-8 DataServiceVersion: 2.0;NetFx MaxDataServiceVersion: 2.0;NetFx Host: STORAGE_ACCOUNT.table.core.windows.net Content-Length: 514 Expect: 100-continue Connection: Keep-Alive The body of the request is: 2013-09-08T07:19:07Z Beckett Molloy 2013-09-08T07:19:07.2189243Z Beckett Molloy Molloy Note that I should have URLEncoded the PartitionKey and RowKey but did not do so for simplicity. There are, in fact, some issues with the URL encoding of spaces and other symbols. Get Entity The GetEntity() method described in this section retrieves the single entity inserted in the previous section. The particular entity to be retrieved is identified directly in the URL. public void GetEntity(String tableName, String partitionKey, String rowKey) { String requestMethod = "GET"; String urlPath = String.Format("{0}(PartitionKey='{1}',RowKey='{2}')", tableName, partitionKey, rowKey); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n{2}", requestMethod, dateInRfc1123Format, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.TableEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Headers.Add("Accept-Charset", "UTF-8"); request.Accept = "application/atom+xml,application/xml"; request.Headers.Add("DataServiceVersion", "2.0;NetFx"); request.Headers.Add("MaxDataServiceVersion", "2.0;NetFx"); using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } This generates the following request (as captured by Fiddler): GET https://STORAGE_ACCOUNT.table.core.windows.net/authors(PartitionKey='Beckett',RowKey='Molloy') HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:31:14 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:1hWbr4aNq4JWCpNJY3rsLH1SkIyeFTJflbqyKMPQ1Gk= Accept-Charset: UTF-8 Accept: application/atom+xml,application/xml DataServiceVersion: 2.0;NetFx MaxDataServiceVersion: 2.0;NetFx Host: STORAGE_ACCOUNT.table.core.windows.net The Table Service generates the following response: HTTP/1.1 200 OK Cache-Control: no-cache Content-Type: application/atom+xml;charset=utf-8 ETag: W/"datetime'2013-09-08T06%3A31%3A14.1579056Z'" Server: Windows-Azure-Table/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: f4bd4c77-6fb6-42a8-8dff-81ea8d28fa2e x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:31:15 GMT Content-Length: 1108 The returned entities, in this case a single entity, are returned in ATOM entry format in the response body: https://STORAGE_ACCOUNT.table.core.windows.net/authors(PartitionKey='Beckett',RowKey='Molloy') 2013-09-08T06:31:15Z Beckett Molloy 2013-09-08T06:31:14.1579056Z Beckett Molloy Molloy Blob Service API The Blob Service API supports the following account-level operation: List Containers The Blob Service API supports the following container-level operation: Create Container Delete Container Get Container ACL Get Container Properties Get Container Metadata List Blobs Set Container ACL Set Container Metadata The Blob Service API supports the following blob-level operation: Copy Blob Delete Blob Get Blob Get Blob Metadata Get Blob Properties Lease Blob Put Blob Set Blob Metadata Set Blob Properties Snapshot Blob The Blob Service API supports the following operations on block blobs: Get Block List Put Block Put Block List The Blob Service API supports the following operations on page blobs: Get Page Regions Put Page This section provides examples of the Put Blob and Lease Blob operations. Put Blob The Blob Service and Queue Service use a different form of shared-key authentication from the Table Service so care should be taken in creating the string to be signed for authorization. The blob type, BlockBlob or PageBlob, must be specified as a request header and consequently appears in the authorization string. public void PutBlob(String containerName, String blobName) { String requestMethod = "PUT"; String urlPath = String.Format("{0}/{1}", containerName, blobName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String content = "Andrew Carnegie was born in Dunfermline"; UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] blobContent = utf8Encoding.GetBytes(content); Int32 blobLength = blobContent.Length; const String blobType = "BlockBlob"; String canonicalizedHeaders = String.Format( "x-ms-blob-type:{0}\nx-ms-date:{1}\nx-ms-version:{2}", blobType, dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, blobLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.BlobEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-blob-type", blobType); request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = blobLength; using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(blobContent, 0, blobLength); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String ETag = response.Headers["ETag"]; } } This generates the following request: PUT https://STORAGE_ACCOUNT.blob.core.windows.net/fife/dunfermline HTTP/1.1 x-ms-blob-type: BlockBlob x-ms-date: Sun, 08 Sep 2013 06:28:29 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:ntvh/lamVmikvwHhy6vRVBIh87kibkPlEOiHyLDia6g= Host: STORAGE_ACCOUNT.blob.core.windows.net Content-Length: 39 Expect: 100-continue Connection: Keep-Alive The body of the request is: Andrew Carnegie was born in Dunfermline The Blob Service generates the following response: HTTP/1.1 201 Created Transfer-Encoding: chunked Content-MD5: RYJnWGXLyt94l5jG82LjBw== Last-Modified: Sun, 08 Sep 2013 06:28:31 GMT ETag: "0x8D07A73C5704A86" Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: b74ef0a2-294d-4581-b8f1-6cda724bbdbf x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:28:30 GMT Lease Blob The Blob Service allows a user to lease a blob for a minute at a time and so acquire a write lock on it. The use case for this is the locking of a page blob used to store the VHD backing an writeable Azure Drive. The LeaseBlob() example in this section demonstrates a subtle issue with the creation of authorization strings. The URL has a query string, comp=lease. Rather than using this directly in creating the authorization string it must be converted into comp:lease with a colon replacing the equal symbol – see modifiedURL in the example. Furthermore, the Lease Blob operation requires the use of an x-ms-lease-action to indicate whether the lease is being acquired, renewed, released or broken. public void LeaseBlob(String containerName, String blobName) { String requestMethod = "PUT"; String urlPath = String.Format("{0}/{1}?comp=lease", containerName, blobName); String modifiedUrlPath = String.Format("{0}/{1}\ncomp:lease", containerName, blobName); const Int32 contentLength = 0; String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String leaseAction = "acquire"; String leaseDuration = "60"; String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-lease-action:{1}\nx-ms-lease-duration:{2}\nx-ms-version:{3}", dateInRfc1123Format, leaseAction, leaseDuration, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, modifiedUrlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, contentLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.BlobEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-lease-action", leaseAction); request.Headers.Add("x-ms-lease-duration", leaseDuration); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = contentLength; using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String leaseId = response.Headers["x-ms-lease-id"]; } } This generates the following request: PUT https://STORAGE_ACCOUNT.blob.core.windows.net/fife/dunfermline?comp=lease HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:28:31 GMT x-ms-lease-action: acquire x-ms-lease-duration: 60 x-ms-version: 2012-02-12 Authorization: SharedKey rebus:+SQ5+RFZg3hUaws5XCRHxsDgXb1ycdRIz5EKyHJWP7s= Host: rebus.blob.core.windows.net Content-Length: 0 The Blob Service generates the following response: HTTP/1.1 201 Created Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: 4b6ff77f-f885-4f74-803a-c92920d225c3 x-ms-version: 2012-02-12 x-ms-lease-id: b1320c2c-65ad-41d6-a7bd-85a4242c0ac5 Date: Sun, 08 Sep 2013 06:28:31 GMT Content-Length: 0 Queue Service API The Queue Service API supports the following queue-level operation: List Queues The Queue Service API supports the following queue-level operation: Create Queue Delete Queue Get Queue Metadata Set Queue Metadata The Queue Service API supports the following message-level operations: Clear Messages Delete Message Get Messages Peek Messages Put Message This section provides examples of the Put Message and Get Message operations. Put Message The most obvious curiosity about Put Message is that it uses the HTTP verb POST rather than PUT. The issue is presumably the interaction of the English language and the HTTP standard which states that PUT should be idempotent and that the Put Message operation is clearly not since each invocation merely adds another message to the queue. Regardless, it did catch me out when I failed to read the documentation well enough – so take that as a warning. The content of a message posted to the queue must be formatted in a specified XML schema and must then be UTF8 encoded. public void PutMessage(String queueName, String message) { String requestMethod = "POST"; String urlPath = String.Format("{0}/messages", queueName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String messageText = String.Format( "{0}", message); UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] messageContent = utf8Encoding.GetBytes(messageText); Int32 messageLength = messageContent.Length; String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-version:{1}", dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, messageLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.QueueEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = messageLength; using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(messageContent, 0, messageLength); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String requestId = response.Headers["x-ms-request-id"]; } } This generates the following request: POST https://rebus.queue.core.windows.net/revolution/messages HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:34:08 GMT x-ms-version: 2012-02-12 Authorization: SharedKey rebus:nyASTVWifnxHKnj2wXwuzzzXz5CxUBZj58SToV5QFK8= Host: rebus.queue.core.windows.net Content-Length: 76 Expect: 100-continue Connection: Keep-Alive The body of the request is: Saturday in the cafe The Queue Service generates the following response: HTTP/1.1 201 Created Server: Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: 14c6e73b-15d9-480c-b251-c4c01b48e529 x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:34:09 GMT Content-Length: 0 Get Messages The Get Messages operation described in this section retrieves a single message with the default message visibility timeout of 30 seconds. public void GetMessage(String queueName) { string requestMethod = "GET"; String urlPath = String.Format("{0}/messages", queueName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-version:{1}", dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n\n\n\n\n\n\n\n\n\n{1}\n{2}", requestMethod, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.QueueEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Accept = "application/atom+xml,application/xml"; using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } This generates the following request: GET https://rebus.queue.core.windows.net/revolution/messages HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:34:11 GMT x-ms-version: 2012-02-12 Authorization: SharedKey rebus:K67XooYhokw0i0AlCzYQ4GeLLrJih1r1vSqiO9DBo0c= Accept: application/atom+xml,application/xml Host: rebus.queue.core.windows.net The Queue Service generates the following response: HTTP/1.1 200 OK Content-Type: application/xml Server: Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: efb21a86-7d66-47fd-b13d-7aa74fce0568 x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:34:12 GMT Content-Length: 484 The message is returned in the response body as follows: 05fd902f-6031-4ef4-8298-ef3844ec3bc6Sun, 08 Sep 2013 06:34:11 GMTSun, 15 Sep 2013 06:34:11 GMT1AgAAAAMAAAAAAAAAAL+zgF2szgE=Sun, 08 Sep 2013 06:34:43 GMTSaturday in the cafe I noticed that some newline specifiers in strings (\n) were lost when the blog was auto-ported from Windows Live Spaces to WordPress. I have put them back in but it is possible I missed some. Consequently, in the event of a problem you should check the newlines in canonicalizedHeaders and stringToSign.

October 24, 2013

by Neil Mackenzie

· 38,807 Views

PostgreSQL to SQLite: The Journey

This article will be useful if you want to support both PostgreSQL and SQLite using JDBC. It will be especially useful if you: Are already accessing values from your (PostgreSQL) database using the regular JDBC ResultSet interface, like: Date d = rs.getDate("date_field"); BigDecimal bd = rs.getBigDecimal("bigdecimal_field"); And it is creating trouble when doing the same for SQLite, but you don't want to change that code. Are already retrieving autogenerated keys in PostgreSQL with a RETURNING clause, but this won't work in SQLite. You want a unified solution that works for both databases. Thought foreign keys are enforced in SQLite by default (like in PostgreSQL) and crashed with a wall. SQLite is allowing you to delete entries from your tables even when they are referenced in another table and you have explicitly told SQLite about it with a REFERENCES table_name(field_name) clause. Are having trouble with the differences between PostgreSQL and SQLite dialects (mostly concerning data types), for example, when making query filters with boolean values. Had your own way to manage exceptions for PostgreSQL and it is not working for SQLite (obviously). You want SQLite to fit into the model you already have. Other stuff might appear if you keep up... A few months ago I wanted to migrate an app to use SQLite as a data backend. In fact, I wanted it to work with both PostgreSQL and SQLite indistinctly (but not at the same time). I wanted to switch between these two databases easily without changing any code. I did it, but along the way I had to solve some problems that might be interesting to many other people. Many solutions I found were spread across the web, but there was no single place that explained how to completely achieve what I wanted. So, the aim of this post is to try to condense my learning into one article that may be of help to others as a (semi) complete guide. This guide might be useful not only to those creating their own frameworks, but for anyone who doesn't use any and are willing to try some quirks and tricks to make their app work. THE BEGINNING There are many cross-database incompatibilities between PostgreSQL and SQLite, most notably on data types. If you want to have the same code to work for both databases, you better use a framework that manages this for you. But here's the thing: the framework I use is created by myself, and didn't (completely) take these differences into account, since I mainly use PostgreSQL as database; that's how and why my problems arose. My framework conveys many things, but I focus here in the data access part. It uses some JDBC driver to connect to the databases, but it provides more abstract ways to do it; that's pretty much the data access part of the framework. A basic DAO class for my framework would look like this: public class MyDAO extends BaseDAO { public MyDAO() { super("context_alias", new DefaultDataMappingStrategy() { @Override public Object createResultObject(ResultSet rs) throws SQLException { MyModel model = (MyModel)ObjectsFactory.getObject("my_model_alias"); model.setStringField(rs.getString("string_field")); model.setIntegerField(rs.getInt("integer_field")); model.setBigDecimalField(rs.getBigDecimal("bigdecimal_field")); model.setDateField(rs.getDate("date_field")); model.setBooleanField(rs.getBoolean("boolean_field")); return model; } }); } @Override public String getTableName() { return "table_name"; } @Override public String getKeyFields() { return "string_field|integer_field"; } @Override protected Map getInsertionMap(Object obj) { Map map = new HashMap(); MyModel model = (MyModel) obj; map.put("string_field", model.getStringField()); map.put("integer_field", model.getIntegerField()); map.put("bigdecimal_field", model.getBigDecimalField()); map.put("date_field", model.getDateField()); map.put("boolean_field", model.getBooleanField()); return map; } @Override protected Map getUpdateMap(Object obj) { Map map = new HashMap(); MyModel model = (MyModel) obj; map.put("bigdecimal_field", model.getBigDecimalField()); map.put("date_field", model.getDateField()); map.put("boolean_field", model.getBooleanField()); return map; } @Override public String getFindAllStatement() { return "SELECT * FROM :@ "; } So, that I wanted to switch between databases without changing code means that I wanted to switch without changing my DAO classes. For SQLite, I used the xerial-jdbc-sqlite driver. I talk about drivers because there are some things that might be driver-specific when solving some problems; so when I say 'SQlite does it this way', I generally mean 'xerial-jdbc-sqlite driver does it this way'. Now, let's start. WARNING: Some of the solutions I give here fit into my framework, but might not directly fit into your code. It's up to you to imagine how to adapt what I provide here. DATA TYPES Since there are some differences between PostgreSQL and SQLite regarding data types, and I wanted to continue to access database values through the regular ResultSet interface, I had to have some mechanism to intercept the call to, for instance, resultset.getDate("date_field"). So I created a ResultSetWrapper class that would redefine the methods I was interested in, like this: public class ResultSetWrapper implements ResultSet { // The wrappped ResultSet ResultSet wrapped; /* I will use this DateFormat to format dates. I'm assuming an SQLite style pattern. I should not */ SimpleDateFormat df = new SimpleDateFormat("yyyy-mm-dd"); public ResultSetWrapper(ResultSet wrapped) { this.wrapped = wrapped; } /* Lots of ResultSet methods implementations go here, but this is an example of redefining a method I'm interested in changing its behavior: */ public Date getDate(String columnLabel) throws SQLException { Object value = this.wrapped.getObject(columnLabel); return (Date)TypesInferreer.inferDate(value); } } The getDate() method in ResultSetWrapper relies on TypesInferreer to convert the value retrieved to a Date value. All data types convertions would be encapsulated inside TypesInferreer, which would have methods to convert from different data types as needed. For instance, it would have a method like this one: public static Object inferDate(Object value) { java.util.Date date; // Do convertions here (convert value and asign to date) return date; } Which tries to convert any value to a Date (I'll show the actual implementation further). Now, instead of using the original resultset retrieved from saying preparedStatement.executeQuery(), you use new ResultSetWrapper(preparedStatement.executeQuery()). That's what my framework does: it passes this new resultset to DAO objects. Now let's see some type conversions. Mixing PostgreSQL Date and SQLite Long/String You could store Date values as text in a SQLite database (eg. '2013-10-09'); this you can do manually when creating the database, but when SQLite stores a Date object, by default it converts it to a Long value. There is no problem with this when saving the value to the SQLite database, but if you try to retrieve it using resultset.getDate("date_field"), then things get messy; It simply won't work (CastException). How do you access Date values, then? You create this method in TypesInfereer, which covers both String and Long variations: public static Object inferDate(Object value) { java.util.Date date = null; if(value == null) return null; if(value instanceof String) { try { date = df.parse((String)value); } catch (ParseException ex) { // Deal with ex } } else if(value instanceof Long) { date = new java.util.Date((Long)value); } else { date = (Date)value; } return new Date(date.getTime()); } And as you saw, the getDate() function in ResultSetWrapper is redefined like this: @Override public Date getDate(String columnLabel) throws SQLException { Object value = this.wrapped.getObject(columnLabel); return (Date)TypesInferreer.inferDate(value); } Now all DAOs can retrieve Date values from both databases indistinctly, using resultset.getDate("date_field"). Mixing PostgreSQL Numeric and SQLite Integer/Double/... My SQLite driver didn't implement the getBigDecimal() function. It complained like this when I called it: java.sql.SQLException: not implemented by SQLite JDBC driver. So I had to come up with a solution that was valid for both PostgreSQL and SQlite. This is what I did in ResultSetWrapper: @Override public BigDecimal getBigDecimal(String columnLabel) throws SQLException { Object value = this.wrapped.getObject(columnLabel); return (BigDecimal)TypesInferreer.inferBigDecimal(value); } But value would get different types depending on the actual value stored in the database; it could be an Integer, or a Double, or perhaps something else. I solved all the cases by doing this in TypesInfereer: public static Object inferBigDecimal(Object value) { if(value == null) return null; if(value instanceof BigDecimal == false) { return new BigDecimal(String.valueOf(value)); } return value; } Anyway, the String constructor of BigDecimal is the recommended one, so everything's fine with this. Now you can retrieve BigDecimal values using resultset.getBigDecimal("bigdecimal_field") from both databases. Mixing PostgreSQL Boolean and SQLite Integer SQLite doesn't have boolean values. Instead, it interprets any other value as boolean by following some rules. When SQLite saves a Boolean value to the database, it saves it as 0 or 1 for false or true respectively. Also, because drivers can interpret any value as boolean, you can use resultset.getBoolean("boolean_field") and it will work as expected by the rules. But the problem I faced was when creating filters. If a value for true is stored as 1 in the SQLite database, you can't expect the clause WHERE boolean_field = true to work. You will never find a match. Instead, you should have said WHERE boolean_field = 1. In my app, I created filters like this: dao.addFilter(new FilterSimple("boolean_field", true)); Now I needed FilterSimple to infer that, for SQLite, I meant 1 instead of true. So I created what I called a DatasourceVariation. These are objects that are specific for each type of database and are used accross all data accesses, by DAOs, Filters, and other objects. These objects would take care of managing all my cross-database incompatibilities, including: The way to reference a database object: in PostgreSQL you must prepend the schema name to every database object you refer in your queries. In SQLite you don't. The way to manage exeptions: explained further in this post. The way to backup and restore data: explained further in this post. Expressing BETWEEN clauses: Explained further in this post. And also, infering boolean values. For VariationSQLite, I did this: @Override public Object getReplaceValue(Object value) { if(value instanceof Boolean) { if((Boolean)value == true) return new Integer(1); else return new Integer(0); } return value; } Now we can say dao.addFilter(new FilterSimple("boolean_field", true)) for both databases, assuming that FilterSimple uses the variation to adapt the value before constructing the clause. RETRIEVING AUTOGENERATED KEYS When you have autonumeric fields (eg. Serial), in PostgreSQL you can specify a RETURNING clause at the end of an INSERT statement to automatically retrieve the values of autogenerated fields by doing this: PreparedStatement pstm = conn.prepareStatement(queryWithReturningClause); // ex. select * from table_x returning field_x ResultSet rs = statement.executeQuery(); if(rs.next()) { // Get autogenerated fields from rs } But that won't work with SQLite. In SQLite, retrieving autogenerated fields conveys a process that goes from creating the statement, executing the query and explicitly asking for the generated values. Like this: PreparedStatement pstm = conn.prepareStatement(queryWITHOUTreturningClause, Statement.RETURN_GENERATED_KEYS); pstm.executeUpdate(); ResultSet rs = pstm.getGeneratedKeys(); if (rs != null && rs.next()) { // Get autogenerated fields from rs } The good news is that this code works both for PostgreSQL and SQLite, so I replaced my previous code for this, and didn't have to make any distinction between databases. ENFORCING FOREIGN KEYS You'd think that using a REFERENCES table_name(field_name) clause when creating a SQLite database table makes foreign keys to be checked when deleting, updating, etc. You're wrong! Foreign keys are not enforced in SQLite by default. You have to explicitly say it, and it's done when creating the connection (WARNING: This is very driver-specific): SQLiteConfig config = new SQLiteConfig(); config.enforceForeignKeys(true); Connection conn = DriverManager.getConnection("jdbc:sqlite:" + dataSourcePath, config.toProperties()); For PostgreSQL it's different, so you better have a connection pool for each type of database, and decide which one to use at runtime. My framework does exactly that. NOTE: If you are capable of getting the connection depending on the database type, then you can enforce foreign keys transparently for both databases (for PostgreSQL it happens naturally without extra code). For instance, you could have an abstract getConnection() method, and each database's connection pool would return the connection in its own way. MANAGING EXCEPTIONS I had defined some different types of database exceptions in my framework: ExceptionDBDuplicateEntry, ExceptionDBEntryReferencedElsewhere, etc, which would be thrown and raised to upper layers in my architecture. For PostgreSQL, these exceptions directly mapped to some constant codes (which normally are vendor/driver specific): UNIQUE_VIOLATION = "23505", FOREIGN_KEY_VIOLATION = "23503", etc. So, for PostgreSQL, I managed database exceptions something like this: @Override public void manageException(SQLException ex) throws ExceptionDBDuplicateEntry, ExceptionDBEntryReferencedElsewhere { if (ex.getSQLState() == null) { ex = (SQLException) ex.getCause(); } if (ex.getSQLState().equals(UNIQUE_VIOLATION)) { throw new ExceptionDBDuplicateEntry(); } else if(ex.getSQLState().equals(FOREIGN_KEY_VIOLATION)) { throw new ExceptionDBEntryReferencedElsewhere(); } else { DAOPackage.log(ex); throw new ExceptionDBUnknownError(ex); } } That won't work for SQLite, obviously! So, what I did was move the database exceptions management to the DataSourceVariation. The VariationPostgresql class would have a method similar to the one above. For VarialtionSQLite, I did sort of a hack, but it's something that has worked until now (maybe until I change my driver). @Override public void manageException(SQLException ex) throws ExceptionDBDuplicateEntry, ExceptionDBEntryReferencedElsewhere { // This is a hack (is it???) String message = ex.getMessage().toLowerCase(); if(message.contains("sqlite_constraint")) { if(message.contains("is not unique")) throw new ExceptionDBDuplicateEntry(); else if(message.contains("foreign key constraint failed")) throw new ExceptionDBEntryReferencedElsewhere(); else { DAOPackage.log(ex); throw new ExceptionDBUnknownError(ex); } } else { DAOPackage.log(ex); throw new ExceptionDBUnknownError(ex); } } Update: This technique might have some flaws. But hey, can you find a better approach right away? FIXING BETWEEN CLAUSE The problem with the BETWEEN clause appeared while using a filter like this: dao.addFilter(new FilterBetween("date_field", date1, date2)); // date1 and date2 are java.util.Date objects FilterBetween would create a BETWEEN clause by formatting Dates as Strings, normally with the format 'yyyy-MM-dd' (although this should be configurable). Since dates in SQLite are long values, we can't create a clause like date_field BETWEEN '2013-01-01' AND '2013-02-01'. It had to be something like date_field >=1357016400000 AND date_field <= 1359694800000. So, I moved the creation of BETWEEN clauses to.... that's right, to DataSourceVariation. VariationSQLite does it like this: @Override public String getBetweenExpression(String fieldName, Object d1, Object d2) { String filter = ""; try { Date dd1 = null; Date dd2 = null; SimpleDateFormat df = new SimpleDateFormat("yyyy-mm-dd"); // Remember, this should be configurable if(d1 instanceof String) dd1 = df.parse((String)d1); else dd1 = (Date)d1; if(d2 instanceof String)dd2 = df.parse((String)d2); else dd2 = (Date)d2; filter = fieldName + " >= " + dd1.getTime() + " AND " + fieldName + " <= " + dd2.getTime(); } catch (ParseException ex) { DAOPackage.log(ex); throw new ExceptionDBUnknownError(ex); } return filter; } CONCLUSIONS As you can see, there are many intricacies when making an app support multiple database types. All I did here was only to support PostgreSQL and SQLite, but who knows what is needed to support other databases at the same time too. You can't expect JDBC alone will do all the work, so be prepared to solve some problems (and another problem, and another, ...) to make a database migration. And please, share your journey.

October 21, 2013

by Martín Proenza

· 12,673 Views

Reasons to Move from DataTables to Generic Collections

These days, no community member writes or speaks about using DataTables and DataSets for data operations. But, there are a number of real projects built using them, and many developers still feel happy when they use them in their projects. Sometimes it is not easy to completely replace DataTables with typed generic lists, particularly in bulky projects. But now is the right time to move, as future developers may not even learn about DataTables :). Generic collections have a number of advantages over DataTables. One cannot imagine a day without generic collections once he/she gets to know how beneficial they are. The following is a list of the reasons to move from DataTables to collections that I could think of now: DataTable stores boxed objects, and one needs to unbox values when needed. This adds overhead on the runtime environment. However, values in generic collections are strongly typed, so no boxing involved. Unboxing happens at runtime, as does the type checking. If there is a mismatch between types of source and target, it leads to a runtime exception. This may lead to a number of issues while using DataTables. In case of collections, as the types are checked at the compile time, such type mismatches are caught during compilation. .NET languages got very nice support for creating collections, like object initializer and collection initializer. We don’t have such features for DataTables. LINQ queries can be used on both DataTables and collections. But the experience of writing the queries on generic collections is better because of IntelliSense support provided by Visual Studio. DataTables are framework specific; we often see issues with serializing and de-serializing them in web services. Generic collections are easier to serialize and de-serialize, so they can be easily used in any service and consumed from a client written in any language. ORMs are becoming increasingly popular, and they use generic collections for all data operations. Mocking DataTables in unit tests is a pain, as it involves creating the structure of the table wherever needed. But a generic collection needs a class defined just once. These are my opinions on preferring collections over DataTables. Any feedback is welcome. Happy coding!

October 21, 2013

by Rabi Kiran Srirangam

· 30,154 Views · 3 Likes

Database vs. Data Science

One thing that Big Data certainly made happen is that it brought the database/infrastructure community and the data analysis/statistics/machine learning communities closer together. As always, each community had it’s own set of models, methods, and ideas about how to structure and interpret the world. You can still tell these differences when looking at current Big Data projects, and I think it’s important to be aware of the distinctions in order to better understand the relationships between different projects. Because, let’s face it, every project claims to re-invent Big Data. Hadoop and MapReduce being something like the founding fathers of Big Data, other’s projects have since appeared. Most notably, there are stream processing projects like Twitter’s Storm who move from batch-oriented processing to event-based processing which is more suited for real-time, low-latency processing. Spark is yet something different, a bit like Hadoop, but puts greater emphasis on iterative algorithms, and in-memory processing to achieve that landmark “100x faster than Hadoop” every current project seems to need to sport. Twitter’s summingbird project tries to bridge the gap between MapReduce and stream processing by providing us with a high-level set of operators which can then either run on MapReduce or Storm. However, both Spark or summingbird leave me sort of flat because you can see that they come from a database background, which means that there will still be a considerable gap to serious machine learning. So, what exactly is the difference? In the end, it’s the difference between relational and linear algebra. In the database world, you model relationships between objects, which you encode in tables, and foreign keys to link up entries between different tables. Probably the most important insight of the database world was to develop a query language, a declarative description of what you want to extract from your database, leaving the optimization of the query and the exact details of how to perform them efficiently to the database guys. The machine learning community, on the other hand, has its roots in linear algebra and probability theory. Objects are usually encoded as a feature vector, that is, a list of numbers describing different properties of an object. Data is often collected in matrices where each row corresponds to an object, and each column to a feature, not much unlike a table in a database. However, the operations you perform in order to do data analysis are quite different from the data base world. Take something as basic as linear regression: your try to learn a linear function f(x)=di=1wixi in a d-dimensional space (that is, where your objects are described by a d-dimensional vector) given n examples Xi, and Yi, where Xi are the features describing your objects and Yi is the real number you attach to Xi. One way to “learn” w is to tune it such that the quadratic error on the training examples is minimal. The solution can be written in closed form as w=(XXT)−1XY where X is the matrix built from the Xi (putting the Xi in the columns of X), and Y is the vector of outputs Yi. In order to solve this, you need to solve the linear equation (XXT)w=XY which can be done by one of a large number of algorithms, starting with Gaussian elimination, which you’ve probably learned in your undergrad studies, or the conjugate gradient algorithm, or by first computing a Cholesky decomposition. All of these algorithms have in common that they are iterative. They go through a number of operations, for example O(d3) for the Gaussian elimination case. They also need to store intermediate results. Gaussian elimination and Cholesky decomposition have rather elementary operations acting on individual entries, while the conjugate gradient algorithm performs a matrix-vector multiplication in each iteration. Most importantly, these algorithms can only be expressed very badly in SQL! It’s certainly not impossible, but you’d need to store your data in much different ways than you would in idiomatic database usage. So, it’s not about whether or not your framework can support iterative algorithms without significant latency, it’s about understanding that joins, group bys, and count() won’t get you far, but you need scalar products, matrix-vector and matrix-matrix multiplications. You don’t need indices for most ML algorithms, maybe except for being able to quickly find the k-nearest neighbors, because most algorithms tend to either take in the whole data set in each iteration or otherwise stream the whole set by some model which is iteratively updated like in stochastic gradient descent. I’m not sure projects like Spark or Stratosphere have fully grasped the significance of this yet. Database infrastructure-inspired Big Data has it’s place when it comes to extracting and preprocessing data, but eventually, you move from database land to machine learning land, which invariably means linear algebra land (or probability theory land, which often also reduces to linear algebra like computations). What often happens today is that you either painstakingly have to break down your linear algebra into MapReduce jobs, or you actively look for algorithms which fit the database view better. I think we’re still at the beginning of what is possible. Or, to be a bit more aggressive, claims that existing (infrastructure, database, parallelism inspired) frameworks provide you with sophistic data analytics are widely exaggerated. They take care of a very important problem by giving you a reliable infrastructure to scale your data analysis code, but there’s still a lot of work that needs to be done on your side. High-level DSLs like Apache Hive or Pig are a first step in this direction but still too much rooted in the database world IMHO. In summary, one should be aware of the difference between a framework which mostly is concerned with scaling and a tool which actually provides some piece of data analysis. And even if it comes with basic database-like analytics mechanisms, there is still a long way to go to do some serious data science. That’s why we’re also thinking that streamdrill occupies an interesting spot, because it is a bit of infrastructure, allowing you to process a serious amount of event data, but it also provides valuable analysis based on algorithms you wouldn’t want to implement yourself, even if you had some Big Data framework like Hadoop at hand. That’s an interesting direction I also would like to see more of in the future. Note: Just saw that Spark has a logistic regression example on their landing page. Well, doing matrix operations explicitly via map() on collections doesn’t count in my view ;)

October 18, 2013

by Mikio Braun

· 11,397 Views · 1 Like

Generating SQL Railroad Diagrams

simple talk - How to get SQL Railroad Diagrams from MSDN BNF syntax notation. On SQL Server Books-On-Line, in the Transact-SQL Reference (database Engine), every SQL Statement has its syntax represented in ‘Backus–Naur Form’ notation (BNF) syntax. For a programmer in a hurry, this should be ideal because It is the only quick way to understand and appreciate all the permutations of the syntax. It is a great feature once you get your eye in. It isn’t the only way to get the information; You can, of course, reverse-engineer an understanding of the syntax from the examples, but your understanding won’t be complete, and you’ll have wasted time doing it. BNF is a good start in representing the syntax: Oracle and SQLite go one step further, and have proper railroad diagrams for their syntax, which is a far more accessible way of doing it. There are three problems with the BNF on MSDN. Firstly, it is isn’t a standard version of BNF, but an ancient fork from EBNF, inherited from Sybase. Secondly, it is excruciatingly difficult to understand, and thirdly it has a number of syntactic and semantic errors. The page describing DML triggers, for example, currently has the absurd BNF error that makes it state that all statements in the body of the trigger must be separated by commas. There are a few other detail problems too. Here is the offending syntax for a DML trigger, pasted from MSDN. ... I’ve been trying to create railroad diagrams for all the important SQL Server SQL statements, as good as you’d find for Oracle, and have so far published the CREATE TABLE and ALTER TABLE railroad diagrams based on the BNF. Although I’ve been aware of them, I’ve never realised until recently how many errors there are. Then, Colin Daley created a translator for the SQL Server dialect of BNF which outputs standard EBNF notation used by the W3C. The example MSDN BNF for the trigger would be rendered as … ... Colin’s intention was to allow anyone to paste SQL Server’s BNF notation into his website-based parser, and from this generate classic railroad diagrams via Gunther Rademacher's Railroad Diagram Generator. Colin's application does this for you: you're not aware that you are moving to a different site. Because Colin's 'translator' it is a parser, it will pick up syntax errors. Once you’ve fixed the syntax errors, you will get the syntax in the form of a human-readable railroad diagram and, in this form, the semantic mistakes become flamingly obvious. Gunter’s Railroad Diagram Generator is brilliant. To be able, after correcting the MSDN dialect of BNF, to generate a standard EBNF, and from thence to create railroad diagrams for SQL Server’s syntax that are as good as Oracle’s, is a great boon, and many thanks to Colin for the idea. Here is the result of the W3C EBNF from Colin’s application then being run through the Railroad diagram generator. Now that’s much better, you’ll agree. This is pretty easy to understand, and at this point any error is immediately obvious. This should be seriously useful, and it is to me. However there is that snag. The BNF is generally incorrect, and you can’t expect the average visitor to mess about with it. The answer is, of course, to correct the BNF on MSDN and maybe even add railroad diagrams for the syntax. Stop giggling! I agree it won’t happen. In the meantime, we need to collaboratively store and publish these corrected syntaxes ourselves as we do them. How? GitHub? SQL Server Central? Simple-Talk? What should those of us who use the system do with our corrected EBNF so that anyone can use them without hassle? Grammar Translator If you are familiar with the Grammar Translator, go ahead and create railroad diagrams from the Transact-SQL Reference. Otherwise, please see the FAQ. In particular, be sure to try thetutorial. Welcome to Railroad Diagram Generator! This is a tool for creating syntax diagrams, also known as railroad diagrams, from context-free grammars specified in EBNF. Syntax diagrams have been used for decades now, so the concept is well-known, and some tools for diagram generation are in existence. The features of this one are usage of the W3C's EBNF notation, web-scraping of grammars from W3C specifications, online editing of grammars, diagram presentation in SVG, and it was completely written in web languages (XQuery, XHTML, CSS, JavaScript). There's nothing like a diagram to help grok something (and the MSDN BNF SQL stuff really makes my brain hurt...)

October 18, 2013

by Greg Duncan

· 9,181 Views

Extracting File Metadata with C# and the .NET Framework

The Windows Explorer (shell) provides extended file property information which can be quite valuable. The challenge was how to extract this information, given that the .NET Framework has somewhat limited support for this type of extraction?

October 14, 2013

by Rob Sanders

· 64,246 Views

SSL Performance Overhead in MySQL

this post comes from ernie souhrada at the mysql performance blog. note: this is part 1 of what will be a two-part series on the performance implications of using in-flight data encryption. some of you may recall my security webinar from back in mid-august; one of the follow-up questions that i was asked was about the performance impact of enabling ssl connections. my answer was 25%, based on some 2011 data that i had seen over on yassl’s website, but i included the caveat that it is workload-dependent, because the most expensive part of using ssl is establishing the connection. not long thereafter, i received a request to conduct some more specific benchmarks surrounding ssl usage in mysql, and today i’m going to show the results. first, the testing environment. all tests were performed on an intel core i7-2600k 3.4ghz cpu (8 cores, ht included) with 32gb of ram and centos 6.4. the disk subsystem is a 2-disk raid-0 of samsung 830 ssds, although since we’re only concerned with measuring the overhead added by using ssl connections, we’ll only be conducting read-only tests with a dataset that fits completely in the buffer pool. the version of mysql used for this experiment is community edition 5.6.13, and the testing tools are sysbench 0.5 and perl. we conduct two tests, each one designed to simulate one of the most common mysql usage patterns. first, we examine connection pooling, often seen in the java world, where some small set of connections are established by, for example, the servlet container and then just passed around to the application as needed, and one-request-per-connection, typical in the lamp world, where the script that displays a given page might connect to the database, run a couple of queries, and then disconnect. test 1: connection pool for the first test, i ran sysbench in read-only mode at concurrency levels of 1, 2, 4, 8, 16, and 32 threads, first with no encryption and then with ssl enabled and key lengths of 1024, 2048, and 4096 bits. 8 sysbench tables were prepared, each containing 100,000 rows, resulting in a total data size of approximately 256mb. the size of my innodb buffer pool was 4gb, and before conducting each official measurement run, i ran a warm-up run to prime the buffer pool. each official test run lasted 10 minutes; this might seem short, but unlike, say, a pcie flash storage device, i would not expect the variable under observation to really change that much over time or need time to stabilize. the basic sysbench syntax used is shown below. #!/bin/bash for ssl in on off ; do for threads in 1 2 4 8 16 32 ; do sysbench --test=/usr/share/sysbench/oltp.lua --mysql-user=msandbox$ssl --mysql-password=msandbox \ --mysql-host=127.0.0.1 --mysql-port=5613 --mysql-db=sbtest --mysql-ssl=$ssl \ --oltp-tables-count=8 --num-threads=$threads --oltp-dist-type=uniform --oltp-read-only=on \ --report-interval=10 --max-time=600 --max-requests=0 run > sb-ssl_${ssl}-threads-${threads}.out done done if you’re not familiar with sysbench, the important thing to know about it for our purposes is that it does not connect and disconnect after each query or after each transaction. it establishes n connections to the database (where n is the number of threads) and runs queries though them until the test is over. this behavior provides our connection-pool simulation. the assumption, given what we know about where ssl is the slowest, is that the performance penalty here should be the lowest. first, let’s look at raw throughput, measured in queries per second: the average throughput and standard deviation (both measured in queries per second) for each test configuration is shown below in tabular format: # of threads ssl key size 1 2 4 8 16 32 ssl off 9250.18 (1005.82) 18297.61 (689.22) 33910.31 (446.02) 50077.60 (1525.37) 49844.49 (934.86) 49651.09 (498.68) 1024-bit 2406.53 (288.53) 4650.56 (558.58) 9183.33 (1565.41) 26007.11 (345.79) 25959.61 (343.55) 25913.69 (192.90) 2048-bit 2448.43 (290.02) 4641.61 (510.91) 8951.67 (1043.99) 26143.25 (360.84) 25872.10 (324.48) 25764.48 (370.33) 4096-bit 2427.95 (289.00) 4641.32 (547.57) 8991.37 (1005.89) 26058.09 (432.86) 25990.13 (439.53) 26041.27 (780.71) so, given that this is an 8-core machine and io isn’t a factor, we would expect throughput to max out at 8 threads, so the levelling-off of performance is expected. what we also see is that it doesn’t seem to make much difference what key length is used, which is also largely expected. however, i definitely didn’t think the encryption overhead would be so high. the next graph here is 95th-percentile latency from the same test: and in tabular format, the raw numbers (average and standard deviation): # of threads ssl key size 1 2 4 8 16 32 ssl off 1.882 (0.522) 1.728 (0.167) 1.764 (0.145) 2.459 (0.523) 6.616 (0.251) 27.307 (0.817) 1024-bit 6.151 (0.241) 6.442 (0.180) 6.677 (0.289) 4.535 (0.507) 11.481 (1.403) 37.152 (0.393) 2048-bit 6.083 (0.277) 6.510 (0.081) 6.693 (0.043) 4.498 (0.503) 11.222 (1.502) 37.387 (0.393) 4096-bit 6.120 (0.268) 6.454 (0.119) 6.690 (0.043) 4.571 (0.727) 11.194 (1.395) 37.26 (0.307) with the exception of 8 and 32 threads, the latency introduced by the use of ssl is constant at right around 5ms, regardless of the key length or the number of threads. i’m not surprised that there’s a large jump in latency at 32 threads, but i don’t have an immediate explanation for the improvement in the ssl latency numbers at 8 threads. test 2: connection time for the second test, i wrote a simple perl script to just connect and disconnect from the database as fast as possible. we know that it’s the connection setup which is the slowest part of ssl, and the previous test already shows us roughly what we can expect for ssl encryption overhead for sending data once the connection has been established, so let’s see just how much overhead ssl adds to connection time. the basic script to do this is quite simple (non-ssl version shown): #!/usr/bin/perl use dbi; use time::hires qw(time); $start = time; for (my $i=0; $i<100; $i++) { my $dbh = dbi->connect("dbi:mysql:host=127.0.0.1;port=5613", "msandbox","msandbox",undef); $dbh->disconnect; undef $dbh; } printf "%.6f\n", time - $start; as with test #1, i ran test #2 with no encryption and ssl encryption of 1024, 2048, and 4098 bits, and i conducted 10 trials of each configuration. then i took the elapsed time for each test and converted it to connections per second. the graph below shows the results from each run: here are the averages and standard deviations: encryption average connections per second standard deviation none 2701.75 165.54 1024-bit 77.04 6.14 2048-bit 28.183 1.713 4096-bit 5.45 0.015 yes, that’s right, 4096-bit ssl connections are 3 orders of magnitude slower to establish than unencrypted connections. really, the connection overhead for any level of ssl usage is quite high when compared to the unencrypted test, and it’s certainly much higher than my original quoted number of 25%. analysis and parting thoughts so, what do we take away from this? the first thing is, of course, is that ssl overhead is a lot higher than 25%, particularly if your application uses anything close to the one-connection-per-request pattern. for a system which establishes and maintains long-running connections, the initial connection overhead becomes a non-factor, regardless of the encryption strength, but there’s still a rather large performance penalty compared to the unencrypted connection. this leads directly into the second point, which is that connection pooling is by far a more efficient method of using ssl if your application can support it. but what if connection pooling isn’t an option, mysql’s ssl performance is insufficient, and you still need full encryption of data in-flight? run the encryption component of your system at a lower layer – a vpn with hardware crypto would be the fastest approach, but even something as simple as an ssh tunnel or openvpn *might* be faster than ssl within mysql. i’ll be exploring some of these solutions in a follow-up post. and finally… when in doubt, run your own benchmarks. i don’t have an explanation for why the yassl numbers are so different from these (maybe yassl is a faster ssl library than openssl, or maybe they used a different cipher – if you’re curious, the original 25% number came from slides 56-58 of this presentation ), but in any event, this does illustrate why it’s important to run tests on your own hardware and with your own workload when you’re interested in finding out how well something will perform rather than taking someone else’s word for it.

October 11, 2013

by Peter Zaitsev

· 6,807 Views

Oracle Weblogic Stuck Thread Detection

The following question will again test your knowledge of the Oracle Weblogic threading model. I’m looking forward for your comments and experience on the same. If you are a Weblogic administrator, I’m certain that you heard of this common problem: stuck threads. This is one of the most common problems you will face when supporting a Weblogic production environment. A Weblogic stuck thread simply means a thread performing the same request for a very long time and more than the configurable Stuck Thread Max Time. Question: How can you detect the presence of STUCK threads during and following a production incident? Answer: As we saw from our last article “Weblogic Thread Monitoring Tips”, Weblogic provides functionalities allowing us to closely monitor its internal self-tuning thread pool. It will also highlight you the presence of any stuck thread. This monitoring view is very useful when you do a live analysis but what about after a production incident? The good news is that Oracle Weblogic will also log any detected stuck thread to the server log. Such information includes details on the request and more importantly, the thread stack trace. This data is crucial and will allow you to potentially better understand the root cause of any slowdown condition that occurred at a certain time. < ExecuteThread: '11' for queue: 'weblogic.kernel.Default (self-tuning)'> <[STUCK] ExecuteThread: '35' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "608" seconds working on the request "Workmanager: default, Version: 0, Scheduled=true, Started=true, Started time: 608213 ms POST /App1/jsp/test.jsp HTTP/1.1 Accept: application/x-ms-application... Referer: http://.. Accept-Language: en-US User-Agent: Mozilla/4.0 .. Content-Type: application/x-www-form-urlencoded Accept-Encoding: gzip, deflate Content-Length: 539 Connection: Keep-Alive Cache-Control: no-cache Cookie: JSESSIONID= ]", which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace: ................................... javax.servlet.http.HttpServlet.service(HttpServlet.java:727) javax.servlet.http.HttpServlet.service(HttpServlet.java:820) weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.java:227) weblogic.servlet.internal.StubSecurityHelper.invokeServlet(StubSecurityHelper.java:125) weblogic.servlet.internal.ServletStubImpl.execute(ServletStubImpl.java:301) weblogic.servlet.internal.ServletStubImpl.execute(ServletStubImpl.java:184) weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.... weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run() weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321) weblogic.security.service.SecurityManager.runAs(SecurityManager.java:120) weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2281) weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2180) weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1491) weblogic.work.ExecuteThread.execute(ExecuteThread.java:256) weblogic.work.ExecuteThread.run(ExecuteThread.java:221) Here is one more tip: the generation and analysis of a JVM thread dump will also highlight you stuck threads. As we can see from the snapshot below, the Weblogic thread state is now updated to STUCK, which means that this particular request is being executed since at least 600 seconds or 10 minutes. This is very useful information since the native thread state will typically remain to RUNNABLE. The native thread state will only get updated when dealing with BLOCKED threads etc. You have to keep in mind that RUNNABLE simply means that this thread is healthy from a JVM perspective. However, it does not mean that it truly is from a middleware or Java EE container perspective. This is why Oracle Weblogic has its own internal ExecuteThread state. Finally, if your organization or client is using any commercial monitoring tool, I recommend that you enable some alerting around both hogging thread and stuck thread. This will allow your support team to take some pro-active actions before the affected Weblogic managed server(s) become fully unresponsive.

October 9, 2013

by Pierre - Hugues Charbonneau

· 55,039 Views

Cache Scope with EHCache

In another blog post we explained how you can use a new feature of Mule 3.3 to cache data in your Mule flows. Here we look at how to configure Mule to use EHCache to handle the caching part, rather than storing the data in the default InMemoryObjectStore. Let’s get going. First let’s start by saying that there are millions of different ways to do this... We’ve taken the route of configuring everything through Spring. So just because you configured your EHCache differently, does not mean it’s wrong or ours is better. We prefer this way since Mule integrates very nicely with Spring. Now we have settled that, let’s make a list of what we need to do: Define cache manager Define cache factory bean Create a custom object store Define a Mule caching strategy The first job is to define a cache manager and cache factory bean. Spring provides two very handy classes for this, specific for EHCache: EhCacheManagerFactoryBean and EhCacheFactoryBean. The cache manager just needs to be defined. However, on the cache factory bean you can configure all EHCache details such as time to live, time to idle, when to overflow on disk, the eviction policy, and much more. For more information, you can check the API here or the EHCache website. Also, from the cache factory bean, you need to refer back to the cache manager. An example is shown in the following gist: Once the cache and the cache manager are configured, we need to define a custom object store that uses EHCache to store and retrieve the data. This is very easy to do, we just need to create a new class that implements the standard Mule’s ObjectStore interface, and use EHCache to do the operations. A working custom EHCache object store is shown in the following gist: package com.ricston.cache; import java.io.Serializable; import net.sf.ehcache.Ehcache; import net.sf.ehcache.Element; import org.mule.api.store.ObjectStore; import org.mule.api.store.ObjectStoreException; public class EhcacheObjectStore implements ObjectStore { private Ehcache cache; @Override public synchronized boolean contains(Serializable key) throws ObjectStoreException { return cache.isKeyInCache(key); } @Override public synchronized void store(Serializable key, T value) throws ObjectStoreException { Element element = new Element(key, value); cache.put(element); } @SuppressWarnings("unchecked") @Override public synchronized T retrieve(Serializable key) throws ObjectStoreException { Element element = cache.get(key); if (element == null) { return null; } return (T) element.getValue(); } @Override public synchronized T remove(Serializable key) throws ObjectStoreException { T value = retrieve(key); cache.remove(key); return value; } @Override public boolean isPersistent() { return false; } public Ehcache getCache() { return cache; } public void setCache(Ehcache cache) { this.cache = cache; } } As you can clearly see, this object store encapsulates an EHCache instance. This should be set before we start using this object store. As you can imagine, we will do this through Spring. The next step is to configure a caching strategy which uses our brand new EHCache object store. The caching strategy using our custom object store, and in the object store, we are using Spring to inject the cache defined earlier in this blog post. The rest in Mule can be exactly the same as in the other blog post we explained before. So here we have shown you how we can use EHCache as the caching engine for the cache scopes provided by Mule 3.3. A reason why you would do this is that with EHCache, you have a very good and proven caching product with a ton of settings that you can exploit and tune for your application. As a side note, if you have issues with EHCache classloading in Mule, place the EHCache jars inside $MULE_HOME/lib/user rather than in your application. Enjoy.

October 4, 2013

by Alan Cassar

· 14,193 Views · 1 Like

TestNG @Test Annotation and DataProviderClass Example

In the previous post, we have seen an example where dataProvider attribute has been used3 to test methods with different sets of input data for the same test method. TestNG provides another attribute dataProviderClass in conjunction with dataProvider to fetch the input data for the test methods from an external class. The actual class that holds input data is set to the dataProviderClass attribute and datProvider by itself holds the method name where the input data is actually fetched. Here is a quick example to show how to use dataProviderClass and dataProvide attribute Code Service Class ? view source print? 01.package com.skilledmonster.example; 02./** 03.* Simple calculator service to demonstrate TestNG Framework 04.* 05.* @author Jagadeesh Motamarri 06.* @version 1.0 07.*/ 08.public interface CalculatorService { 09.int sum(int a, int b); 10.int multiply(int a, int b); 11.int div(int a, int b); 12.int sub(int a, int b); 13.} Service Implementation Class ? view source print? 01.package com.skilledmonster.example; 02./** 03.* Simple calculator service implementation to demonstrate TestNG Framework 04.* 05.* @author Jagadeesh Motamarri 06.* @version 1.0 07.*/ 08.public class SimpleCalculator implements CalculatorService { 09.public int sum(int a, int b) { 10.return a + b; 11.} 12.public int multiply(int a, int b) { 13.return a * b; 14.} 15.public int div(int a, int b) { 16.return a / b; 17.} 18.public int sub(int a, int b) { 19.return a - b; 20.} 21.} Data Provider Class ? view source print? 01.package com.skilledmonster.common; 02.import org.testng.annotations.DataProvider; 03./** 04.* Data Provider class for TestNG test cases 05.* 06.* @author Jagadeesh Motamarri 07.* @version 1.0 08.*/ 09.public class TestNGDataProvider { 10./** 11.* Data Provider for testing sum of 2 numbers 12.* 13.* @return 14.*/ 15.@DataProvider 16.public static Object[][] testSumInput() { 17.return new Object[][] { { 5, 5 }, { 10, 10 }, { 20, 20 } }; 18.} 19./** 20.* Data Provider for testing multiplication of 2 numbers 21.* 22.* @return 23.*/ 24.@DataProvider 25.public static Object[][] testMultipleInput() { 26.return new Object[][] { { 5, 5 }, { 10, 10 }, { 20, 20 } }; 27.} 28.} Finally, test class that uses dataProviderClass attribute to feed the input data for the test methods ? package com.skilledmonster.example; import org.testng.Assert; import org.testng.annotations.BeforeClass; import org.testng.annotations.Test; import com.skilledmonster.common.TestNGDataProvider; /** * Example to demonstrate use of dataProviderClass and dataProvide attributes of TestNG framework * * @author Jagadeesh Motamarri * @version 1.0 */ public class TestNGAnnotationTestDataProviderExample { public CalculatorService service; @BeforeClass public void init() { System.out.println("@BeforeClass: The annotated method will be run before the first test method in the current class is invoked."); System.out.println("init service"); service = new SimpleCalculator(); } @Test(dataProviderClass = TestNGDataProvider.class, dataProvider = "testSumInput") public void testSum(int a, int b) { System.out.println("@Test : testSum()"); int result = service.sum(a, b); Assert.assertEquals(result, a + b); } @Test(dataProviderClass = TestNGDataProvider.class, dataProvider = "testMultipleInput") public void testMultiple(int a, int b) { System.out.println("@Test : testMultiple()"); int result = service.multiply(a, b); Assert.assertEquals(result, a * b); } } Output As shown in the above console output, each of the testSum() and testMutiple() methods are invoked with different sets of input data using an external class with dataProviderClass attribute. Advantage More flexibility and re-usability of commonly used data across several test classes. Download Download TestNG DataProvider Example

October 2, 2013

by Jagadeesh Motamarri

· 25,485 Views

Large Dataset Retrieval in Mule

Recently, a customer made a query on how to perform large dataset retrieval in Mule. The documentation page briefly explains how this may be achieved, however there is no working example on how to do this as far as I can tell. This blog post aims to explain in detail how large dataset retrieval works in Mule by giving an example. The customer wanted to transfer items from one database to another by performing a batch select and then a batch insert. The ‘batch insert’ part is pretty straightforward and is done automatically by Mule when the payload is of type List. However, the batch select is mastered in a different way. In order to retrieve all the records, we will use the Batch Manager to compute the ID ranges for the next batch of records to be retrieved. This is provided out of the box with Mule EE. We start by defining the database which will be used throughout the example to retrieve and insert records. For simplicity’s sake we are going to use the Derby in-memory database. NOTE: the records should be identified by a key which is unique and in a sequential numeric order. CREATE TABLE table1(KEY1 INTEGER GENERATED BY DEFAULT AS IDENTITY(START WITH 1) NOT NULL PRIMARY KEY, KEY2 VARCHAR(255)); CREATE TABLE table2(KEY1 VARCHAR(255), KEY2 VARCHAR(255)); INSERT INTO table1(KEY2) VALUES ('TEST1'); INSERT INTO table1(KEY2) VALUES ('TEST2'); INSERT INTO table1(KEY2) VALUES ('TEST3'); INSERT INTO table1(KEY2) VALUES ('TEST4'); INSERT INTO table1(KEY2) VALUES ('TEST5'); INSERT INTO table1(KEY2) VALUES ('TEST6'); INSERT INTO table1(KEY2) VALUES ('TEST7'); INSERT INTO table1(KEY2) VALUES ('TEST8'); INSERT INTO table1(KEY2) VALUES ('TEST9'); INSERT INTO table1(KEY2) VALUES ('TEST10'); As explained before, the select query is based on the ID ranges that are computed by the Batch Manager when nextBatch() is called. This will return a map with the lower and upper ids to be selected. In our case, we are storing this map into a flow variable named ‘boundaries’. After configuring the database and the JDBC connector, we need to configure the Batch Manager. This consists of specifying the idStore (which is a text file), which the BatchManager uses to store the starting point for the next batch. Moreover, on the Batch Manager, we need to configure the batch size and the starting point. In the documentation, you would find a reference to the noArgsWrapper. Its job is to invoke the nextBatch() method on the Batch Manager. However we find this very confusing and misleading, thus instead, we use a simple MEL expression which calls the nextBatch() directly. Now we have to configure the main flow where we perform the batch select. Given that the records are retrieved in batches, the flow has to be called multiple times until all of the records are retrieved. To solve this, we created a composite source so that at the end of the flow, if we haven’t retrieved all the records, we re-trigger the same flow using the VM queue. Once the current batch is finished, we need to call competeBatch() to instruct the batch manager that we’re done from the current batch, and ready to process the next. If this is not done, the Batch Manager will still consider the previous batch as ‘processing’. Furthermore, we have to check whether we have retrieved all of the records so we can stop processing. We do this by checking the size of the payload that is returned from the JDBC outbound endpoint. If the payload size is ’0′ (no more records to be retrieved), we have to call the completeBatch() method with ‘-1′, instructing the Batch Manager that all of the batch is complete. We must also set the starting point for next batch to ’0′. This is required so that when the flow is triggered again from the HTTP inbound endpoint, the flow will start processing from the first record. If the batch is not complete, we call the completeBatch() method (from the BatchManager class) with the current upperId. This sets the new starting point for the next batch to be processed. Finally we end the flow with a VM outbound on ‘batch’ which triggers the main flow to process the next batch of records. app.registry.seqBatchManager.completeBatch(-1); app.registry.seqBatchManager.setStartingPointForNextBatch(0); app.registry.seqBatchManager.completeBatch(flowVars.boundaries.upperId); A complete Mule configuration of the main flow shown here below.

October 2, 2013

by Clare Cini

· 10,442 Views

Clojure: Converting a string to a date

I wanted to do some date manipulation in Clojure recently and figured that since clj-time is a wrapper around Joda Time it’d probably do the trick. The first thing we need to do is add the dependency to our project file and then run lein reps to pull down the appropriate JARs. The project file should look something like this: project.clj (defproject ranking-algorithms "0.1.0-SNAPSHOT" :license {:name "Eclipse Public License" :url "http://www.eclipse.org/legal/epl-v10.html"} :dependencies [[org.clojure/clojure "1.4.0"] [clj-time "0.6.0"]]) Now let’s load the clj-time.format namespace into the REPL since we know we’ll be parsing dates: > (require '(clj-time [format :as f])) The string that I want to convert into a date looks like this: (def string-date "18 September 2012") The first thing we should do is check whether there is an existing formatter that we can use by evaluating the following function: > (f/show-formatters) ... :hour-minute 06:45 :hour-minute-second 06:45:22 :hour-minute-second-fraction 06:45:22.473 :hour-minute-second-ms 06:45:22.473 :mysql 2013-09-20 06:45:22 :ordinal-date 2013-263 :ordinal-date-time 2013-263T06:45:22.473Z :ordinal-date-time-no-ms 2013-263T06:45:22Z :rfc822 Fri, 20 Sep 2013 06:45:22 +0000 ... There are a lot of different built in formatters but unfortunately I couldn’t find one that exactly matched our date format so we’ll have to write our own one. For that we’ll need to refresh our knowledge of Java date formatting: We end up with the following formatter: > (f/parse (f/formatter "dd MMM YYYY") string-date) # It took me much longer than it should have to remember that ‘MMM’ is the pattern to match a short form of a month but it’s just the same as what we’d have to do in Java but with some neat wrapper functions.

October 2, 2013

by Mark Needham

· 5,860 Views

Getting Started with NHibernate and ASP.NET MVC- CRUD Operations

In this post we are going to learn how we can use NHibernate in ASP.NET MVC application. What is NHibernate: ORMs(Object Relational Mapper) are quite popular this days. ORM is a mechanism to map database entities to Class entity objects without writing a code for fetching data and write some SQL queries. It automatically generates SQL Query for us and fetch data behalf on us. NHibernate is also a kind of Object Relational Mapper which is a port of popular Java ORM Hibernate. It provides a framework for mapping an domain model classes to a traditional relational databases. Its give us freedom of writing repetitive ADO.NET code as this will be act as our database layer. Let’s get started with NHibernate. How to download: There are two ways you can download this ORM. From nuget package and from the source forge site. Nuget - http://www.nuget.org/packages/NHibernate/ Source Forge-http://sourceforge.net/projects/nhibernate/ Creating a table for CRUD: I am going to use SQL Server 2012 express edition as a database. Following is a table with four fields Id, First Name, Last name, Designation. Creating ASP.NET MVC project for NHibernate: Let’s create a ASP.NET MVC project for NHibernate via click on File-> New Project –> ASP.NET MVC 4 web application. Installing NuGet package for NHibernate: I have installed nuget package from Package Manager console via following Command. It will install like following. NHibertnate configuration file: Nhibernate needs one configuration file for setting database connection and other details. You need to create a file with ‘hibernate.cfg.xml’ in model Nhibernate folder of your application with following details. NHibernate.Connection.DriverConnectionProvider NHibernate.Driver.SqlClientDriver Server=(local);database=LocalDatabase;Integrated Security=SSPI; NHibernate.Dialect.MsSql2012Dialect Here you have got different settings for NHibernate. You need to selected driver class, connection provider as per your database. If you are using other databases like Orcle or MySQL you will have different configuration. ThisNHibernate ORM can work with any databases. Creating a model class for NHibernate: Now it’s time to create model class for our CRUD operations. Following is a code for that. Property name is identical to database table columns. namespace NhibernateMVC.Models { public class Employee { public virtual int Id { get; set; } public virtual string FirstName { get; set; } public virtual string LastName { get; set; } public virtual string Designation { get; set; } } } Creating a mapping file between class and table: Now we need a xml mapping file between class and model with name “Employee.hbm.xml” like following in Nhibernate folder. Creating a class to open session for NHibernate I have created a class in models folder called NHIbernateSession and a static function it to open a session for NHibertnate. using System.Web; using NHibernate; using NHibernate.Cfg; namespace NhibernateMVC.Models { public class NHibertnateSession { public static ISession OpenSession() { var configuration = new Configuration(); var configurationPath = HttpContext.Current.Server.MapPath(@"~\Models\Nhibernate\hibernate.cfg.xml"); configuration.Configure(configurationPath); var employeeConfigurationFile = HttpContext.Current.Server.MapPath(@"~\Models\Nhibernate\Employee.hbm.xml"); configuration.AddFile(employeeConfigurationFile); ISessionFactory sessionFactory = configuration.BuildSessionFactory(); return sessionFactory.OpenSession(); } } } Listing: Now we have our open session method ready its time to write controller code to fetch data from the database. Following is a code for that. using System; using System.Web.Mvc; using NHibernate; using NHibernate.Linq; using System.Linq; using NhibernateMVC.Models; namespace NhibernateMVC.Controllers { public class EmployeeController : Controller { public ActionResult Index() { using (ISession session = NHibertnateSession.OpenSession()) { var employees = session.Query().ToList(); return View(employees); } } } } Here you can see I have get a session via OpenSession method and then I have queried database for fetching employee database. Let’s create a new for this you can create this via right lick on view on above method.We are going to create a strongly typed view for this. Our listing screen is ready once you run project it will fetch data as following. Create/Add: Now its time to write add employee code. Following is a code I have written for that. Here I have used session.save method to save new employee. First method is for returning a blank view and another method with HttpPost attribute will save the data into the database. public ActionResult Create() { return View(); } [HttpPost] public ActionResult Create(Employee emplolyee) { try { using (ISession session = NHibertnateSession.OpenSession()) { using (ITransaction transaction = session.BeginTransaction()) { session.Save(emplolyee); transaction.Commit(); } } return RedirectToAction("Index"); } catch(Exception exception) { return View(); } } Now let’s create a create view strongly typed view via right clicking on view and add view. Once you run this application and click on create new it will load following screen. Edit/Update: Now let’s create a edit functionality with NHibernate and ASP.NET MVC. For that I have written two action result method once for loading edit view and another for save data. Following is a code for that. public ActionResult Edit(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } [HttpPost] public ActionResult Edit(int id, Employee employee) { try { using (ISession session = NHibertnateSession.OpenSession()) { var employeetoUpdate = session.Get(id); employeetoUpdate.Designation = employee.Designation; employeetoUpdate.FirstName = employee.FirstName; employeetoUpdate.LastName = employee.LastName; using (ITransaction transaction = session.BeginTransaction()) { session.Save(employeetoUpdate); transaction.Commit(); } } return RedirectToAction("Index"); } catch { return View(); } } Here in first action result I have fetched existing employee via get method of NHibernate session and in second I have fetched and changed the current employee with update details. You can create view for this via right click –>add view like below. I have created a strongly typed view for edit. Once you run code it will look like following. Details: Now it’s time to create a detail view where user can see the employee detail. I have written following logic for details view. public ActionResult Details(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } You can add view like following via right click on actionresult view. now once you run this in browser it will look like following. Delete: Now its time to write delete functionality code. Following code I have written for that. public ActionResult Delete(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } [HttpPost] public ActionResult Delete(int id, Employee employee) { try { using (ISession session = NHibertnateSession.OpenSession()) { using (ITransaction transaction = session.BeginTransaction()) { session.Delete(employee); transaction.Commit(); } } return RedirectToAction("Index"); } catch(Exception exception) { return View(); } } Here in the above first action result will have the delete confirmation view and another will perform actual delete operation with session delete method. When you run into the browser it will look like following. That’s it. It’s very easy to have crud operation with NHibernate. Stay tuned for more.

October 1, 2013

by Jalpesh Vadgama

· 47,305 Views

Semantic Search with Solr and NumPy

Built upon Lucene, Solr provides fast, highly scalable, and easily maintainable full-text search capabilities. However, under the hood, Solr is really just a sophisticated token-matching engine. What’s missing? — Semantic Search! Consider three, somewhat silly documents: Yellow banana peels. A banana is a long yellow fruit. This mystery fruit is long and yellow and has a peel. Now what happens if you search for the term “banana?" Under normal circumstances you only get back the first and second document. But why shouldn’t you also get back the third document? It’s obviously talking about bananas! Semantic Search via Collaborative Filtering Colleague Doug Turnbull and I recently set about to right this wrong with help from a machine learning technique called collaborative filtering. Collaborative filtering is most often used as a basis for recommendation algorithms. For example, collaborative filtering algorithms were the central focus of the now-famous Netflix Prize which awarded $1Million to the team which could build the best movie recommendation engine. When dealing with recommendations, collaborative filtering works by mathematically identifying commonalities in groups of users based upon the movies that they enjoyed. Then, if you appear to fall in one of those groups, the recommendation engine will point you towards a movie that a) you haven’t watched and b) you are likely to enjoy. So what does this have to do with Semantic Search? Everything! In just the same way that certain users gravitate towards certain movies, certain words commonly co-occur in the same documents. When working with Semantic Search, rather than recommending user to movies that they would likely enjoy, we are going to identify words that are likely to belong in a given document, whether or not they actually occurred there. The math is exactly the same! Here’s how the process works: First we identify a text field of interest in our documents and extract the associated term-document matrix for external processing. Each element of this term-document matrix indicates the strength of a particular term within a particular document (where strength can be anything, but will likely be either term frequency or TF*IDF). Next, collaborative filtering is applied to the term-document matrix which effectively generates a pseudo-term-document matrix. This pseudo-term-document matrix is the same size and shape as the original term-document matrix and references the same terms and documents, but the numbers are slightly different. These new values indicate the strength that a particular term should have in a particular document once noisy data is removed. Finally, the high-scoring values in the pseudo-term-document matrix are mapped back to the associated terms. These terms are then injected back into Solr in a new field which can be used for Semantic Search. Demo Time! So let’s consider an example case. As in plenty of our previous posts, we will be using the Science Fiction Stack Exchange. Why? Because we’re all nerds and with such a familiar topic, we can quickly intuit whether or not a search is returning relevant results. In this data set, the field of interest is the Body field because it contains the contents of all questions and answers. So, now that we’ve decided upon our demo dataset, we’re ready run the analysis. If you’d like to follow along, then please take a look at our git repo. This repo contains the example SciFi data set, the Semantic Search code, and README to get you going. However I’m going execute everything from within Python: >>> from SemanticAnalyzer import * >>> stvc = SolrTermVectorCollector(field='Body',feature='tf',batchSize=1000) >>> tdc = TermDocCollection(source=stvc,numTopics=150) That last line takes a few minutes. If it’s in the AM where you are, grab a coffee. If it’s in the PM, grab a beer. Once that line completes, we will have successfully extracted the term-document matrix from Solr. Now let’s play with it for a bit. One of the cool side effects of this analysis is the ability to quickly find words that commonly occur together. Let’s give it an easy test; here are the 30 most highly correlated words with the word ‘vader’ (as in Darth Vader). >>> tdc.getRelatedTerms('vader',30) Did you notice that pause when you called the function? That was the collaborative filtering taking place. The results of that process have now been saved, so additional calls will return quite quickly. vader luke emperor darth palpatin anakin sith skywalk sidiou apprentic empir luca side star son forc turn kill death rule suit father question jedi command obi tarkin dark wan plan Hey, not bad! Everything here seems very reasonably connected with Mr. Vader. You may notice some odd spellings here; that’s because these are the indexed terms, therefore they are stemmed. Let’s try again with a different term; this time everyone’s favorite wizard: >>> tdc.getRelatedTerms('potter',30) harri potter voldemort wizard snape death magic jame love spell time rowl lili eater travel seri hous hand hogwart three find wormtail kill slytherin hallow secret deathli muggl order lord Again, pretty good! One last try, and we’ll make it a little more challenging – a vague adjective: >>> tdc.getRelatedTerms('dark',30) dark side jedi sith eater lord death mark snape magic curs evil forc luke mercuri cave yoda jame palpatin dagobah anakin black call wizard slytherin live light siriu matter voldemort Indeed, most of these terms are like a hall of fame of dark things from Star Wars and Harry Potter. Now since the word correlation has proven itself out, it’s time to generate the pseudo terms and post them back to Solr. >>> SolrBlurredTermUpdater(tdc,blurredField="BodyBlurred").pushToSolr(0.1) This line will probably see you to the end of your coffee or beer (it takes about 10 minutes on my machine). But once it’s done, you can start issuing searches to Solr. Solr Results Here’s an example of Semantic Search using Solr: http://localhost:8983/solr/select/?q=-Body:dark +BodyBlurred:dark The Body field contains the original text while the BodyBlurred contains the pseudo-terms. So this finds all documents that do not include the term dark, but presumably contain dark content. Take a look at the documents that come back: { Body: " In the John Carter movie (2012), he shows off some of his powers, like jumping abnormally high, but I have difficulty evaluating his strength. On the one side, he shows great strength, as when he kills a thark warrior with one hand, but he is also quite mistreated by them. He also seems helpless when he is strangled by Tars Tarkas. Why does the strength he shows seem so inconsistent? ", BodyBlurred: "tv great movi control kill consid hand dark side power long mutant fight machin light abil sauron wormtail hulk" }, { Body: " In the movies, the Nazgul ride black horses with armour. I was wondering if that is all they are, or do they have some sort of magic? Are they evil? ", BodyBlurred: "movi black magic dark demon engin hous aveng slytherin" }, { Body: " The remaining Black Brother from the prologue of A Game of Thrones is apparently the deserter who is beheaded in the beginning of the book. But how did he manage to get to Winterfell from the other side of The Wall? Or did the show throw me off track and in the book there weren't any survivors, so the deserter is someone else? ", BodyBlurred: "book watch black hole dark side plai long game demon engin light turn district" }, { Body: " Was this ever discussed in any episode, or as a side-plot somewhere? ", BodyBlurred: "episod dark side light" } Not bad – most of those topics are rather … dark. Though check out that last result. So … maybe there are still some improvements we can make! But you also have to remember that we’re dealing with word correlation here, and I can only guess that somewhere else in the corpus, dark side-plots and dark episodes were surely discussed. Speaking of word correlations, check out this gem: { Body: " You're correct, Enterprise is the only Star Trek that fits into both the original and the new 2009 movie timelines. From the perspective of the Enterprise characters, both are possible futures, given the over-arcing conceit of the show was a Temporal Cold War, so its future is in flux and could line up with either of the timelines we're familiar with, or with an entirely different future. ", BodyBlurred: "answer charact place klingon star trek design travel crew watch work movi happen enterpris featur futur exist origin 2009 chang altern timelin war to version event captain gener pictur tng creat iii galaxi theori return alter voyag entir fry turn kirk paradox biff doc marti feder 1955 starship 2015 class hero centuri tempor uss phoenix mirror river 800 ncc 1701 simon conner skynet alisha" } The original document involves Star Trek and time travel. And appropriately, the pseudo terms include Star Trek things and time-travel terms … but do you see anything funny? That’s right, Biff, Doc and Marti made their way into the pseudo terms, likely because of their role in the popular time-travel film “Back to the Future.” Speaking of the future … Future Work Semantic Search with Solr is hot right now. In the upcoming Dublin LuceneRevolution I know of at least three related talks that have been submitted (one of them my own); I have heard that MapR is working on a Solr Semantic Search/Recommendation engine built atop of their Hadoop offering; and I suspect that with Cloudera’s recent foray into Solr with Mark Miller, they will also be working on the same thing. What’s next for our work? Recommendations! Remember, that’s how we started this conversation. E-commerce recommendations is a simple extension of the work presented above. Given an inventory catalog (e.g., product title, description, etc.), and given a history of user purchases, we can build a search-aware recommendation engine. That is, when a customer searches for a particular item, they will receive results as usual, except that the results will be boosted with items that they are more likely to purchase. How? Because we know what type of customer they are and what products that type of customer is more likely to buy! Do you have a good case for Solr Semantic Search and Recommendation? We’d love to hear it, please contact us!

September 30, 2013

by John Berryman

· 11,675 Views

ElasticSearch: Java API

ElasticSearch provides Java API, thus it executes all operations asynchronously by using client object.

September 30, 2013

by Hüseyin Akdoğan

CORE

· 137,571 Views · 4 Likes

Parallel SQL in C#

So, I’ve been wanting to get back to playing with C# for a while, and finally have had the opportunity. I’ve also been wanting to play with the Task library in .NET and see if I could get it to do something interesting, well below is the result. The code below, running in a .NET 4 project, will run two SQL SELECT statements against the AdventureWorks2012 database. There are three tasks in here, ParallelTask 1 and 2, and a timing task. The Parallel task takes a Connection String and a query as inputs, and passes out a Status Message. One of the important points with a task is that the task has to be self contained. This is why the connection is instantiated within the task. I also added in a Timing task (ParallelTiming) so I could pass out a ping message. The whole thing is controlled by the code in the main section, which is used to start the three tasks, with their appropriate parameters. After this it awaits the tasks completing, then passes out the resulting return messages. Try it out; it’s good fun and all you need is SQL Server, AdventureWorks and something to build C# projects. You can download the code here Have fun! /// Parallel_SQL demonstration code /// From Nick Haslam /// http://blog.nhaslam.com /// 16/9/2013 using System; using System.Collections.Generic; using System.Data.SqlClient; using System.Linq; using System.Text; using System.Threading.Tasks; namespace Parallel_SQL { class Program { /// /// First Parallel task /// ///Connection string details ///Query to execute ///Status message to pass back /// static Task ParallelTask1(string sConnString, string sQuery, Action StatusMessage) { return Task.Factory.StartNew(() => { SqlConnection conn = new SqlConnection(sConnString); conn.Open(); StatusMessage(“Running Query”); SqlDataReader reader = null; SqlCommand sqlCommand = new SqlCommand(sQuery, conn); reader = sqlCommand.ExecuteReader(); while (reader.Read()) { StatusMessage(reader[0].ToString()); } return “Task 1 Complete”; }); } /// /// Second Parallel task /// ///Connection string details ///Query to execute ///Status message to pass back /// static Task ParallelTask2(string sConnString, string sQuery, Action StatusMessage) { return Task.Factory.StartNew(() => { SqlConnection conn = new SqlConnection(sConnString); conn.Open(); StatusMessage(“Running Query”); SqlDataReader reader = null; SqlCommand sqlCommand = new SqlCommand(sQuery, conn); reader = sqlCommand.ExecuteReader(); while (reader.Read()) { StatusMessage(reader[0].ToString()); } return “Task 2 Complete”; }); } /// /// Timing Task /// ///Milliseconds between ping ///Status message to pass back /// static Task ParallelTiming(int iMSPause, Action StatusMessage) { return Task.Factory.StartNew(() => { for (int i = 0; i < 10; i++) { System.Threading.Thread.Sleep(iMSPause); StatusMessage(“******************** PING ********************”); } return “Timing task done”; }); } static void Main(string[] args) { string sConnString = “server=.; Trusted_Connection=yes; database=AdventureWorks2012;”; try { var Task1Control = ParallelTask1(sConnString, “SELECT top 500 TransactionID FROM Production.TransactionHistory”, (update) => { Console.WriteLine(String.Format(“{0} – {1}”, DateTime.Now, update)); }); var Task2Control = ParallelTask2(sConnString, “SELECT top 500 SalesOrderDetailID FROM sales.SalesOrderDetail”, (update) => { Console.WriteLine(String.Format(“{0} – \t\t{1}”, DateTime.Now, update)); }); var TimingTaskControl = ParallelTiming(250, (update) => { Console.WriteLine(String.Format(“{0} – \t\t\t{1}”, DateTime.Now, update)); }); // Await Completion of the tasks Console.WriteLine(“Task 1 Status – {0}”, Task1Control.Result); Console.WriteLine(“Task 2 Status – {0}”, Task2Control.Result); Console.WriteLine(“Timing Task Status – {0}”, TimingTaskControl.Result); } catch (Exception e) { Console.WriteLine(e.ToString()); } Console.ReadKey(); } } }

September 29, 2013

by Nick Haslam

· 22,640 Views · 31 Likes