Databases Resources

The Latest Databases Topics

JAX-RS is a framework designed to help you write RESTful applications both on the client and server side. With Java EE 7 is being slated to be released next year, 2013, JAX-RS is one of the specifications getting a deep revision. JAX-RS 2.0 is currently in the Public Draft phase at the JCP, so now is a good time to discuss some of the key new features so that you can start playing with your favorite JAX-RS implementation and give some valuable feedback the expert group needs to finalize the specification. Key features in 2.0: Client API Server-side Asynchronous HTTP Filters and Interceptors This article gives a brief overview of each of these features. Client Framework One huge thing missing from JAX-RS 1.0 was a client API. While it was easy to write a portable JAX-RS service, each JAX-RS implementation defined their own proprietary API. JAX-RS 2.0 fills in this gap with a fluent, low-level, request building API. Here's a simple example: Client client = ClientFactory.newClient(); WebTarget target = client.target("http://example.com/shop"); Form form = new Form().param("customer", "Bill") .param("product", "IPhone 5") .param("CC", "4444 4444 4444 4444"); Response response = target.request().post(Entity.form(form)); assert response.getStatus() == 200; Order order = response.readEntity(Order.class); Let's dissect this example code. The Client interface manages and configures HTTP connections. It is also a factory for WebTargets. WebTargets represent a specific URI. You build and execute requests from a WebTarget instance. Response is the same class defined in JAX-RS 1.0, but it has been expanded to support the client side. The example client, allocates an instance of a Client, creates a WebTarget, then posts form data to the URI represented by the WebTarget. The Response object is tested to see if the status is 200, then the application class Order is extracted from the response using the readEntity() method. The MessageBodyReader and MessageBodyWriter content handler interfaces defined in JAX-RS 1.0 are reused on the client side. When the readEntity() method is invoked in the example code, a MessageBodyReader is matched with the response's Content-Type and the Java type (Order) passed in as a parameter to the readEntity() method. If you are optimistic that your service will return a successful response, there are some nice helper methods that allow you to get the Java object directly without having to interact with and write additional code around a Response object. Customer cust = client.target("http://example.com/customers") .queryParam("name", "Bill Burke") .request().get(Customer.class); In this example we target a URI and specify an additional query parameter we want appended to the request URI. The get() method has an additional parameter of the Java type we want to unmarshal the HTTP response to. If the HTTP response code is something other than 200, OK, JAX-RS picks an exception that maps to the error code from a defined exception hierarchy in the JAX-RS client API. Asynchronous Client API The JAX-RS 2.0 client framework also supports an asynchronous API and a callback API. This allows you to execute HTTP requests in the background and either poll for a response or receive a callback when the request finishes. Future future = client.target("http://e.com/customers") .queryParam("name", "Bill Burke") .request() .async() .get(Customer.class); try { Customer cust = future.get(1, TimeUnit.MINUTES); } catch (TimeoutException ex) { System.err.println("timeout"); } The Future interface is a JDK interface that has been around since JDK 5.0. The above code executes an HTTP request in the background, then blocks for one minute while it waits for a response. You could also use the Future to poll to see if the request is finished or not. Here's an example of using a callback interface. InvocationCallback callback = new InvocationCallback { public void completed(Response res) { System.out.println("Request success!"); } public void failed(ClientException e) { System.out.println("Request failed!");n } }; client.target("http://example.com/customers") .queryParam("name", "Bill Burke") .request() .async() .get(callback); In this example, we instantiate an implementation of the InvocationCallback interface. We invoke a GET request in the background and register this callback instance with the request. The callback interface will output a message on whether the request executed successfully or not. Those are the main features of the client API. I suggest browsing the specification and Javadoc to learn more. Server-Side Asynchronous HTTP On the server-side, JAX-RS 2.0 provides support for asynchronous HTTP. Asynchronous HTTP is generally used to implement long-polling interfaces or server-side push. JAX-RS 2.0 support for Asynchronous HTTP is annotation driven and is very analogous with how the Servlet 3.0 specification handles asynchronous HTTP support through the AsyncContext interface. Here's an example of writing a crude chat program. @Path("/listener") public class ChatListener { List listeners = ...some global list...; @GET public void listen(@Suspended AsyncResponse res) { list.add(res); } } For those of you who have used the Servlet 3.0 asynchronous interfaces, the above code may look familiar to you. An AsyncResponse is injected into the JAX-RS resource method via the @Suspended annotation. This act disassociates the calling thread to the HTTP socket connection. The example code takes the AsyncResponse instance and adds it to a application-defined global List object. When the JAX-RS method returns, the JAX-RS runtime will do no response processing. A different thread will handle response processing. @Path("/speaker") public class ChatSpeaker { List listeners = ...some global list...; @POST @Consumes("text/plain") public void speak(String speech) { for (AsyncResponse res : listeners) { res.resume(Response.ok(speech, "text/plain").build());n } } } When a client posts text to this ChatSpeaker interface, the speak() method loops through the list of registered AsyncResponses and sends back an 200, OK response with the posted text. Those are the main features of the asynchronous HTTP interface, check out the Javadocs for a deeper detail. Filters and Entity Interceptors JAX-RS 2.0 has an interceptor API that allows framework developers to intercept request and response processing. This powerful API allows framework developers to transparently add orthogonal concerns like authentication, caching, and encoding without polluting application code. Prior to JAX-RS 2.0 many JAX-RS providers like Resteasy, Jersey, and Apache CXF wrote their own proprietary interceptor frameworks to deliver various features in their implementations. So, while JAX-RS 2.0 filters and interceptors can be a bit complex to understand please note that it is very use-case driven based on real-world examples. I wrote a blog on JAX-RS interceptor requirements awhile back to help guide the JAX-RS 2.0 JSR Expert Group on defining such an API. The blog is a bit dated, but hopefully you can get the gist of why we did what we did. JAX-RS 2.0 has two different concepts for interceptions: Filters and Entity Interceptors. Filters are mainly used to modify or process incoming and outgoing request headers or response headers. They execute before and after request and response processing. Entity Interceptors are concerned with marshaling and unmarshalling of HTTP message bodies. They wrap around the execution of MessageBodyReader and MessageBodyWriter instances. Server Side Filters On the server-side you have two different types of filters. ContainerRequestFilters run before your JAX-RS resource method is invoked. ContainerResponseFilters run after your JAX-RS resource method is invoked. As an added caveat, ContainerRequestFilters come in two flavors: pre-match and post-matching. Pre-matching ContainerRequestFilters are designated with the @PreMatching annotation and will execute before the JAX-RS resource method is matched with the incoming HTTP request. Pre-matching filters often are used to modify request attributes to change how it matches to a specific resource. For example, some firewalls do not allow PUT and/or DELETE invocations. To circumvent this limitation many applications tunnel the HTTP method through the HTTP header X-Http-Method-Override. A pre-matching ContainerRequestFilter could implement this behavior. @Provider public class HttpOverride implements ContainerRequestFilter { public void filter(ContainerRequestContext ctx) { String method = ctx.getHeaderString("X-Http-Method-Override"); if (method != null) ctx.setMethod(method); } } Post matching ContainerRequestFilters execute after the Java resource method has been matched. These filters can implement a range of features for example, annotation driven security protocols. After the resource class method is executed, JAX-RS will run all ContainerResponseFilters. These filters allow you to modify the outgoing response before it is marshalled and sent to the client. One example here is a filter that automatically sets a Cache-Control header. @Provider public class CacheControlFilter implements ContainerResponseFilter { public void filter(ContainerRequestContext req, ContainerResponseContext res) { if (req.getMethod().equals("GET")) { req.getHeaders().add("Cache-Control", cacheControl); } } } Client Side Filters On the client side you also have two types of filters: ClientRequestFilter and ClientResponseFilter. ClientRequestFilters run before your HTTP request is sent over the wire to the server. ClientResponseFilters run after a response is received from the server, but before the response body is unmarshalled. A good example of client request and response filters working together is a client-side cache that supports conditional GETs. The ClientRequestFilter would be responsible for setting the If-None-Match or If-Modified-Since headers if the requested URI is already cached. Here's what that code might look like. @Provider public class ConditionalGetFilter implements ClientRequestFilter { public void filter(ClientRequestContext req) { if (req.getMethod().equals("GET")) { CacheEntry entry = cache.getEntry(req.getURI()); if (entry != null) { req.getHeaders().putSngle("If-Modified-Since", entry.getLastModified()); } } } } The ClientResponseFilter would be responsible for either buffering and caching the response, or, if a 302, Not Modified response was sent back, to edit the Response object to change its status to 200, set the appropriate headers and buffer to the currently cached entry. This code would be a bit more complicated, so for brevity, we're not going to illustrate it within this article. Reader and Writer Interceptors While filters modify request or response headers, interceptors deal with message bodies. Interceptors are executed in the same call stack as their corresponding reader or writer. ReaderInterceptors wrap around the execution of MessageBodyReaders. WriterInterceptors wrap around the execution of MessageBodyWriters. They can be used to implement a specific content-encoding. They can be used to generate digital signatures or to post or pre-process a Java object model before or after it is marshalled. Here's an example of a GZIP encoding WriterInterceptor. @Provider public class GZIPEndoer implements WriterInterceptor { public void aroundWriteTo(WriterInterceptorContext ctx) throws IOException, WebApplicationException { GZIPOutputStream os = new GZIPOutputStream(ctx.getOutputStream()); try { ctx.setOutputStream(os); return ctx.proceed(); } finally { os.finish(); } } } Resource Method Filters and Interceptors Sometimes you want a filter or interceptor to only run for a specific resource method. You can do this in two different ways: register an implementation of DynamicFeature or use the @NameBinding annotation. The DynamicFeature interface is executed at deployment time for each resource method. You just use the Configurable interface to register the filters and interceptors you want for the specific resource method. @Provider public class ServerCachingFeature implements DynamicFeature { public void configure(ResourceInfo resourceInfo, Configurable configurable) { if (resourceInfo.getMethod().isAnnotationPresent(GET.class)) { configurable.register(ServerCacheFilter.class); } } } On the other hande, @NameBinding works a lot like CDI interceptors. You annotate a custom annotation with @NameBinding and then apply that custom annotation to your filter and resource method @NameBinding public @interface DoIt {} @DoIt public class MyFilter implements ContainerRequestFilter {...} @Path public class MyResource { @GET @DoIt public String get() {...} Wrapping Up Well, those are the main features of JAX-RS 2.0. There's also a bunch of minor features here and there, but youll have to explore them yourselves. If you want to testdrive JAX-RS 2.0 (and hopefully also give feedback to the expert group), Red Hat's Resteasy 3.0 and Oracle's Jersey project have implementations you can download and use. Useful Links Below are some useful links. I've also included links to some features in Resteasy that make use of filters and interceptors. This code might give you a more in-depth look into what you can do with this new JAX-RS 2.0 feature. JAX-RS 2.0 Public Draft Specification Resteasy 3.0 Download Jersey Resteasy 3.0 client cache implementation code (to see how filters interceptors work on client side) Doseta digital signature headers (good use case or interceptors) File suffix content negotiation implementation (server-side filter example) Other server-side examples (cache-control annotations, gzip encoding, role-based security)

November 1, 2012

by Bill Burke

· 94,314 Views · 3 Likes

Exporting and Importing VM Settings with Azure Command-Line Tools

We've talked previously about the Windows Azure command-line tools, and have used them in a few posts such as Brian's Migrating Drupal to a Windows Azure VM. While the tools are generally useful for tons of stuff, one of the things that's been painful to do with the command-line is export the settings for a VM, and then recreate the VM from those settings. You might be wondering why you'd want to export a VM and then recreate it. For me, cost is the first thing that comes to mind. It costs more to keep a VM running than it does to just keep the disk in storage. So if I had something in a VM that I'm only using a few hours a day, I'd delete the VM when I'm not using it and recreate it when I need it again. Another potential reason is that you want to create a copy of the disk so that you can create a duplicate virtual machine. The export process used to be pretty arcane stuff; using the azure vm show command with a --json parameter and piping the output to file. Then hacking the .json file to fix it up so it could be used with the azure vm create-from command. It was bad. It was so bad, the developers added a new export command to create the .json file for you. Here's the basic process: Create a VM VM creation has been covered multiple ways already; you're either going to use the portal or command line tools, and you're either going to select an image from the library or upload a VHD. In my case, I used the following command: azure vm create larryubuntu CANONICAL__Canonical-Ubuntu-12-04-amd64-server-20120528.1.3-en-us-30GB.vhd larry NotaRe This command creates a new VM in the East US data center, enables SSH on port 22 and then stores a disk image for this VM in a blob. You can see the new disk image in blob storage by running: azure vm disk list The results should return something like: info: Executing command vm disk list + Fetching disk images data: Name OS data: ---------------------------------------- ------- data: larryubuntu-larryubuntu-0-20121019170709 Linux info: vm disk list command OK That's the actual disk image that is mounted by the VM. Export and Delete the VM Alright, I've done my work and it's the weekend. I need to export the VM settings so I can recreate it on Monday, then delete the VM so I won't get charged for the next 48 hours of not working. To export the settings for the VM, I use the following command: azure vm export larryubuntu c:\stuff\vminfo.json This tells Windows Azure to find the VM named larryubuntu and export its settings to c:\stuff\vminfo.json. The .json file will contain something like this: { "RoleName":"larryubuntu", "RoleType":"PersistentVMRole", "ConfigurationSets": [ { "ConfigurationSetType":"NetworkConfiguration", "InputEndpoints": [ { "LocalPort":"22", "Name":"ssh", "Port":"22", "Protocol":"tcp", "Vip":"168.62.177.227" } ], "SubnetNames":[] } ], "DataVirtualHardDisks":[], "OSVirtualHardDisk": { "HostCaching":"ReadWrite", "DiskName":"larryubuntu-larryubuntu-0-20121024155441", "OS":"Linux" }, "RoleSize":"Small" } If you're like me, you'll immediately start thinking "Hrmmm, I wonder if I can mess around with things like RoleSize." And yes, you can. If you wanted to bump this up to medium, you'd just change that parameter to medium. If you want to play around more with the various settings, it looks like the schema is maintained at https://github.com/WindowsAzure/azure-sdk-for-node/blob/master/lib/services/serviceManagement/models/roleschema.json. Once I've got the file, I can safely delete the VM by using the following command. azure vm delete larryubuntu It spins a bit and then no more VM. Recreate the VM Ugh, Monday. Time to go back to work, and I need my VM back up and running. So I run the following command: azure vm create-from larryubuntu c:\stuff\vminfo.json --location "East US" It takes only a minute or two to spin up the VM and it's ready for work. That's it - fast, simple, and far easier than the old process of generating the .json settings file. Note that I haven't played around much with the various settings described in the schema for the json file that I linked above. If you find anything useful or interesting that can be accomplished by hacking around with the .json, leave a comment about it.

October 29, 2012

by Larry Franks

· 6,457 Views

Exploring the HTML5 Web Audio: Visualizing Sound

If you've read some of my other articles on this blog you probably know I'm a fan of HTML5. With HTML5 we get all this interesting functionality, directly in the browser, in a way that, eventually, is standard across browsers. One of the new HTML5 APIs that is slowly moving through the standardization process is the Web Audio API. With this API, currently only supported in Chrome, we get access to all kinds of interesting audio components you can use to create, modify and visualize sounds (such as the following spectrogram). So why do I start with visualizations? It looks nice, that's one reason, but not the important one. This API provides a number of more complex components, whose behavior is much easier to explain when you can see what happens. With a filter you can instantly see whether some frequencies are filtered, instead of trying to listen to the resulting audio for thse changes. There are many interesting examples that use this API. The problem is, though, that getting started with this API and with digital signal processing (DSP) usually isn't explained. In this article I'll walk you through a couple of steps that shows how to do the following: Create a signal volume meter Visualize the frequencies using a spectrum analyzer And show a time based spectrogram We start with the basic setup that we can use as the basis for the components we'll create. Setting up the basic If we want to experiment with sound, we need some sound source. We could use the microphone (as we'll do later in this series), but to keep it simple, for now we'll just use an mp3 as our input. To get this working using web audio we have to take the following steps: Load the data Read it in a buffer node and play the sound Load the data With the web audio we can use different types of audio sources. We've got a MediaElementAudioSourceNode that can be used to use the audio provided by a media element. There's also a MediaStreamAudioSourceNode. With this audio source node we can use the microphone as input (see my previous article on sound recognition). Finally there is the AudioBufferSourceNode. With this node we can load the data from an existing audio file (e.g mp3) and use that as input. For this example we'll use this last approach. // create the audio context (chrome only for now) var context = new webkitAudioContext(); var audioBuffer; var sourceNode; // load the sound setupAudioNodes(); loadSound("wagner-short.ogg"); function setupAudioNodes() { // create a buffer source node sourceNode = context.createBufferSource(); // and connect to destination sourceNode.connect(context.destination); } // load the specified sound function loadSound(url) { var request = new XMLHttpRequest(); request.open('GET', url, true); request.responseType = 'arraybuffer'; // When loaded decode the data request.onload = function() { // decode the data context.decodeAudioData(request.response, function(buffer) { // when the audio is decoded play the sound playSound(buffer); }, onError); } request.send(); } function playSound(buffer) { sourceNode.buffer = buffer; sourceNode.noteOn(0); } // log if an error occurs function onError(e) { console.log(e); } In this example you can see a couple of functions. The setupAudioNodes function creates a BufferSource audio node and connects it to the destination. The loadSound function shows how you can load an audio file. The buffer which is passed into the playSound function contains decoded audio that can be used by the web audio API. In this example I use an .ogg file, for a complete overview of the formats supported look at: https://sites.google.com/a/chromium.org/dev/audio-video Play the sound To play this audio file, all we have to do is turn the source node on, this is done in the playSound function: function playSound(buffer) { sourceNode.buffer = buffer; sourceNode.noteOn(0); } You can test this out at the following page: Example 1: Loading and playing a sound with Web Audio API. When you open that page, you'll hear some music. Nothing to spectacular for now, but nevertheless an easy way to load audio that'll use for the rest of this article. The first item on our list was the volume meter. Create a volume meter One of the basic scenario's, and often one of the first steps someone new to this API tries to create, is a simple signal volume meter (or an UV meter). I expected this to be a standard component in this API, where I could just read off the signal strength as a property. But, no such node exists. But not to worry, with the components that are available, it's pretty easy (not straightforward, but easy nevertheless) to get an indication of the signal strength of your audio file. Int this section we'll create the following simple volume meter: As you can see this is a simple volume meter where we measure the signal strength for the left and the right audio channel. This is drawn on the canvas, but you could have also used divs or svg to visualize this. Lets start with a single volume meter, instead of one for each channel. For this we need to do the following: Create an analyzer node: With this node we get realtime information about the data that is processed. This data we use to determine the signal strength Create a javascript node: We use this node as a timer to update the volume meters with new information Connect everything together Analyser node With the analyser node we can perform real-time frequency and time domain analysis. From the specification: a node which is able to provide real-time frequency and time-domain analysis information. The audio stream will be passed un-processed from input to output. I won't go into the mathematical details behind this node, since there are many articles out there that explain how this works (a good one is the chapter on fourier transformation from here). What you should now about this node is that it splits up the signal in frequency buckets and we get the amplitude (the signal strenght) for each set of frequencies (the bucket). The best way to understand this, is to skip a bit ahead in this article and look at the frequency distribution we'll create later on. This image plots the result from the analyser node. The frequencies increase from left to right, and the height of the bar shows the strength of that specific frequency bucket. More on this later on in the article. For now we don't want to see the strength of the separate frequency buckets, but the strength of the total signal. For this we'll just add all the strenghts from each bucket and divide it by the number of buckets. First we need to create an analyzer node // setup a analyzer analyser = context.createAnalyser(); analyser.smoothingTimeConstant = 0.3; analyser.fftSize = 1024; This creates an analyzer node whose result will be used to create the volume meter. We use a smoothingTimeConstant to make the meter less jittery. With this variable we use input from a longer time period to calculate the amplitudes, this results in a more smooth meter. The fftSize determine how many buckets we get containing frequency information. If we have a fftSize of 1024 we get 512 buckets (more info on this in the book on DPS and fourier transformations). When this node receives a stream of data, it analyzes this stream and provides us with information about the frequencies in that signal and their strengths. We now need a timer to update the meter at regular intervals. We could use the standard javascript setInterval function, but since we're looking at the Web Audio API lets use one of its nodes. The JavaScriptNode. The javascript node With the javascriptnode we can process the raw audio data directly from javascript. We can use this to write our own analyzers or complex components. We're not going to do that, though. When creating the javascript node, you can specify the interval at which it is called. We'll use that feature to update the meter at regulat intervals. Creating a javascript node is very easy. // setup a javascript node javascriptNode = context.createJavaScriptNode(2048, 1, 1); This will create a javascriptnode that is called whenever the 2048 frames have been sampled. Since our data is sampled at 44.1k, this function will be called approximately 21 times a second. Now what happens when this function is called: // when the javascript node is called // we use information from the analyzer node // to draw the volume javascriptNode.onaudioprocess = function() { // get the average, bincount is fftsize / 2 var array = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(array); var average = getAverageVolume(array) // clear the current state ctx.clearRect(0, 0, 60, 130); // set the fill style ctx.fillStyle=gradient; // create the meters ctx.fillRect(0,130-average,25,130); } function getAverageVolume(array) { var values = 0; var average; var length = array.length; // get all the frequency amplitudes for (var i = 0; i < length; i++) { values += array[i]; } average = values / length; return average; } In these two functions we calculate the average and draw the meter directly on the canvas (using a gradient so we have nice colors). Now all we have to do is connect the output from the audiosource to the analyser, the analyser to the javasource node (and if we want audio to hear, we also need to connect something to the destionation). Connect everything together Connecting everything together is easy: function setupAudioNodes() { // setup a javascript node javascriptNode = context.createJavaScriptNode(2048, 1, 1); // connect to destination, else it isn't called javascriptNode.connect(context.destination); // setup a analyzer analyser = context.createAnalyser(); analyser.smoothingTimeConstant = 0.3; analyser.fftSize = 1024; // create a buffer source node sourceNode = context.createBufferSource(); // connect the source to the analyser sourceNode.connect(analyser); // we use the javascript node to draw at a specific interval. analyser.connect(javascriptNode); // and connect to destination, if you want audio sourceNode.connect(context.destination); } And that's it. This will draw a single volume meter, for the complete signal. Now what do we do when we want to have a volume meter for each channel. For this we use a ChannelSplitter. Let's dive right into the code to connect everything: function setupAudioNodes() { // setup a javascript node javascriptNode = context.createJavaScriptNode(2048, 1, 1); // connect to destination, else it isn't called javascriptNode.connect(context.destination); // setup a analyzer analyser = context.createAnalyser(); analyser.smoothingTimeConstant = 0.3; analyser.fftSize = 1024; analyser2 = context.createAnalyser(); analyser2.smoothingTimeConstant = 0.0; analyser2.fftSize = 1024; // create a buffer source node sourceNode = context.createBufferSource(); splitter = context.createChannelSplitter(); // connect the source to the analyser and the splitter sourceNode.connect(splitter); // connect one of the outputs from the splitter to // the analyser splitter.connect(analyser,0,0); splitter.connect(analyser2,1,0); // we use the javascript node to draw at a // specific interval. analyser.connect(javascriptNode); // and connect to destination sourceNode.connect(context.destination); } As you can see we don't really change much. We introduce a new node, the splitter node. This node splits the sound into a left and a right channel. These channels can be processed separately. With this layout the following happens: The audiosource creates a signal based on the buffered audio. This signal is sent to the splitter, who splits the signal into a left and right stream. Each of these two streams is processed by their own realtime analyser. From the javascript node, we now get the information from both analysers and plot both meters I've shown step 1 through 3, let's quickly move on the step 4. For this we simply add the following to the onaudioprocess node: javascriptNode.onaudioprocess = function() { // get the average for the first channel var array = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(array); var average = getAverageVolume(array); // get the average for the second channel var array2 = new Uint8Array(analyser2.frequencyBinCount); analyser2.getByteFrequencyData(array2); var average2 = getAverageVolume(array2); // clear the current state ctx.clearRect(0, 0, 60, 130); // set the fill style ctx.fillStyle=gradient; // create the meters ctx.fillRect(0,130-average,25,130); ctx.fillRect(30,130-average2,25,130); } And now we've got two signal meters, one for each channel. Example 2: Visualize the signal strength with a volume meter. Or view the result on youtube: Now lets see how we can get the view of the frequencies I showed earlier. Create a frequency spectrum With all the work we already did in the previous section, creating a frequency spectrum overview is now very easy. We're going to aim for this: We set up the nodes just like we did in the first example: function setupAudioNodes() { // setup a javascript node javascriptNode = context.createJavaScriptNode(2048, 1, 1); // connect to destination, else it isn't called javascriptNode.connect(context.destination); // setup a analyzer analyser = context.createAnalyser(); analyser.smoothingTimeConstant = 0.3; analyser.fftSize = 512; // create a buffer source node sourceNode = context.createBufferSource(); sourceNode.connect(analyser); analyser.connect(javascriptNode); // sourceNode.connect(context.destination); } So this time we don't split the channels and we set the fftSize to 512. This means we get 256 bars that represent our frequency. We now just need to alter the onaudioprocess method and the gradient we use: var gradient = ctx.createLinearGradient(0,0,0,300); gradient.addColorStop(1,'#000000'); gradient.addColorStop(0.75,'#ff0000'); gradient.addColorStop(0.25,'#ffff00'); gradient.addColorStop(0,'#ffffff'); // when the javascript node is called // we use information from the analyzer node // to draw the volume javascriptNode.onaudioprocess = function() { // get the average for the first channel var array = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(array); // clear the current state ctx.clearRect(0, 0, 1000, 325); // set the fill style ctx.fillStyle=gradient; drawSpectrum(array); } function drawSpectrum(array) { for ( var i = 0; i < (array.length); i++ ){ var value = array[i]; ctx.fillRect(i*5,325-value,3,325); } }; In the drawSpectrum function we iterate over the array, and draw a vertical bar based on the value. That's it. For a live example, click on the following link: Example 3: Visualize the frequency spectrum. Or view it on youtube: And then the final one. The spectrogram. Time based spectrogram When you run the previous demo you see the strength of the various frequency buckets in real time. While this is a nice visualization, it doesn't allow you to analyze information over a period of time. If you want to do that you can create a spectrogram. With a spectrogram we plot a single line for each measurement. The y-axis represents the frequency, the x-asis the time and the color of a pixel the strength of that frequency. It can be used to analyze the received audio, and also creates nice looking images. The good thing, is that to output this data we don't have to change much from what we've already got in place. The only function that'll change is the onaudioprocess node and we'll create a slightly different analyser. analyser = context.createAnalyser(); analyser.smoothingTimeConstant = 0; analyser.fftSize = 1024; The enalyser we create here has an fftSize of 1024, this means we get 512 frequency buckets with strengths. So we can draw a spectrogram that has a height of 512 pixels. Also note that the smoothingTimeConstant is set to 0. This means we don't use any of the previous results in the analysis. We want to show the real information, not provide a smooth volume meter or frequency spectrum analysis. The easiest way to draw a spectrogram is by just start drawing the line at the left, and for each new set of frequencies increase the x-coordinate by one. The problem is that this will quickly fill up our canvas, and we'll only be able to see the first half a minute of the audio. To fix this, we need some creative canvas copying. The complete code for drawing the spectrogram is shown here: // create a temp canvas we use for copying and scrolling var tempCanvas = document.createElement("canvas"), tempCtx = tempCanvas.getContext("2d"); tempCanvas.width=800; tempCanvas.height=512; // used for color distribution var hot = new chroma.ColorScale({ colors:['#000000', '#ff0000', '#ffff00', '#ffffff'], positions:[0, .25, .75, 1], mode:'rgb', limits:[0, 300] }); ... // when the javascript node is called // we use information from the analyzer node // to draw the volume javascriptNode.onaudioprocess = function () { // get the average for the first channel var array = new Uint8Array(analyser.frequencyBinCount); analyser.getByteFrequencyData(array); // draw the spectrogram if (sourceNode.playbackState == sourceNode.PLAYING_STATE) { drawSpectrogram(array); } } function drawSpectrogram(array) { // copy the current canvas onto the temp canvas var canvas = document.getElementById("canvas"); tempCtx.drawImage(canvas, 0, 0, 800, 512); // iterate over the elements from the array for (var i = 0; i < array.length; i++) { // draw each pixel with the specific color var value = array[i]; ctx.fillStyle = hot.getColor(value).hex(); // draw the line at the right side of the canvas ctx.fillRect(800 - 1, 512 - i, 1, 1); } // set translate on the canvas ctx.translate(-1, 0); // draw the copied image ctx.drawImage(tempCanvas, 0, 0, 800, 512, 0, 0, 800, 512); // reset the transformation matrix ctx.setTransform(1, 0, 0, 1, 0, 0); } To draw the spectrogram we do the following: We copy what is currently drawn to a hidden canvas Next we draw a line of the current values at the far right of the canvas We set the translate on the canvas to -1 We copy the copied information back to the original canvas (that is now drawn 1 pixel to the left) And reset the transformation matrix See a running example here: Example 4: Create a spectrogram Or view it here: Last thing I'd like to mention regarding the code is the chroma.js library I used for the colors. If you ever need to draw something color or gradient related (e.g maps, strengths, levels) you can easily create color scales with this library. Two final pointers, I know I'll get questions about: Volume could be represented as a magnitude, just didn't want to complicate matters for this. The spectogram doesn't use logarithmic scales. Once again, didn't want to complicate things

October 23, 2012

by Jos Dirksen

· 70,137 Views · 1 Like

PartitionKey and RowKey in Windows Azure Table Storage

For the past few months, I’ve been coaching a “Microsoft Student Partner” (who has a great blog on Kinect for Windows by the way!) on Windows Azure. One of the questions he recently had was around PartitionKey and RowKey in Windows Azure Table Storage. What are these for? Do I have to specify them manually? Let’s explain… Windows Azure storage partitions All Windows Azure storage abstractions (Blob, Table, Queue) are built upon the same stack (whitepaper here). While there’s much more to tell about it, the reason why it scales is because of its partitioning logic. Whenever you store something on Windows Azure storage, it is located on some partition in the system. Partitions are used for scale out in the system. Imagine that there’s only 3 physical machines that are used for storing data in Windows Azure storage: Based on the size and load of a partition, partitions are fanned out across these machines. Whenever a partition gets a high load or grows in size, the Windows Azure storage management can kick in and move a partition to another machine: By doing this, Windows Azure can ensure a high throughput as well as its storage guarantees. If a partition gets busy, it’s moved to a server which can support the higher load. If it gets large, it’s moved to a location where there’s enough disk space available. Partitions are different for every storage mechanism: In blob storage, each blob is in a separate partition. This means that every blob can get the maximal throughput guaranteed by the system. In queues, every queue is a separate partition. In tables, it’s different: you decide how data is co-located in the system. PartitionKey in Table Storage In Table Storage, you have to decide on the PartitionKey yourself. In essence, you are responsible for the throughput you’ll get on your system. If you put every entity in the same partition (by using the same partition key), you’ll be limited to the size of the storage machines for the amount of storage you can use. Plus, you’ll be constraining the maximal throughput as there’s lots of entities in the same partition. Should you set the PartitionKey to the same value for every entity stored? No. You’ll end up with scaling issues at some point. Should you set the PartitionKey to a unique value for every entity stored? No. You can do this and every entity stored will end up in its own partition, but you’ll find that querying your data becomes more difficult. And that’s where our next concept kicks in… RowKey in Table Storage A RowKey in Table Storage is a very simple thing: it’s your “primary key” within a partition. PartitionKey + RowKey form the composite unique identifier for an entity. Within one PartitionKey, you can only have unique RowKeys. If you use multiple partitions, the same RowKey can be reused in every partition. So in essence, a RowKey is just the identifier of an entity within a partition. PartitionKey and RowKey and performance Before building your code, it’s a good idea to think about both properties. Don’t just assign them a guid or a random string as it does matter for performance. The fastest way of querying? Specifying both PartitionKey and RowKey. By doing this, table storage will immediately know which partition to query and can simply do an ID lookup on RowKey within that partition. Less fast but still fast enough will be querying by specifying PartitionKey: table storage will know which partition to query. Less fast: querying on only RowKey. Doing this will give table storage no pointer on which partition to search in, resulting in a query that possibly spans multiple partitions, possibly multiple storage nodes as well. Wihtin a partition, searching on RowKey is still pretty fast as it’s a unique index. Slow: searching on other properties (again, spans multiple partitions and properties). Note that Windows Azure storage may decide to group partitions in so-called "Range partitions" - see http://msdn.microsoft.com/en-us/library/windowsazure/hh508997.aspx. In order to improve query performance, think about your PartitionKey and RowKey upfront, as they are the fast way into your datasets. Deciding on PartitionKey and RowKey Here’s an exercise: say you want to store customers, orders and orderlines. What will you choose as the PartitionKey (PK) / RowKey (RK)? Let’s use three tables: Customer, Order and Orderline. An ideal setup may be this one, depending on how you want to query everything: Customer (PK: sales region, RK: customer id) – it enables fast searches on region and on customer id Order (PK: customer id, RK; order id) – it allows me to quickly fetch all orders for a specific customer (as they are colocated in one partition), it still allows fast querying on a specific order id as well) Orderline (PK: order id, RK: order line id) – allows fast querying on both order id as well as order line id. Of course, depending on the system you are building, the following may be a better setup: Customer (PK: customer id, RK: display name) – it enables fast searches on customer id and display name Order (PK: customer id, RK; order id) – it allows me to quickly fetch all orders for a specific customer (as they are colocated in one partition), it still allows fast querying on a specific order id as well) Orderline (PK: order id, RK: item id) – allows fast querying on both order id as well as the item bought, of course given that one order can only contain one order line for a specific item (PK + RK should be unique) You see? Choose them wisely, depending on your queries. And maybe an important sidenote: don’t be afraid of denormalizing your data and storing data twice in a different format, supporting more query variations. There’s one additional “index” That’s right! People have been asking Microsoft for a secondary index. And it’s already there… The table name itself! Take our customer – order – orderline sample again… Having a Customer table containing all customers may be interesting to search within that data. But having an Orders table containing every order for every customer may not be the ideal solution. Maybe you want to create an order table per customer? Doing that, you can easily query the order id (it’s the table name) and within the order table, you can have more detail in PK and RK. And there's one more: your account name. Split data over multiple storage accounts and you have yet another "partition". Conclusion In conclusion? Choose PartitionKey and RowKey wisely. The more meaningful to your application or business domain, the faster querying will be and the more efficient table storage will work in the long run.

October 19, 2012

by Maarten Balliauw

· 57,766 Views · 10 Likes

Debugging Hibernate Envers - Historical Data

recently in our project we reported a strange bug. in one report where we display historical data provided by hibernate envers , users encountered duplicated records in the dropdown used for filtering. we tried to find the source of this bug, but after spending a few hours looking at the code responsible for this functionality we had to give up and ask for a dump from production database to check what actually is stored in one table. and when we got it and started investigating, it turned out that there is a bug in hibernate envers 3.6 that is a cause of our problems. but luckily after some investigation and invaluable help from adam warski (author of envers) we were able to fix this issue. bug itself let’s consider following scenario: a transaction is started. we insert some audited entities during it and then it is rolled back. the same entitymanager is reused to start another transaction second transaction is committed but when we check audit tables for entities that were created and then rolled back in step one, we will notice that they are still there and were not rolled back as we expected. we were able to reproduce it in a failing test in our project, so the next step was to prepare failing test in envers so we could verify if our fix is working. failing test the simplest test cases already present in envers are located in simple.java class and they look quite straightforward: public class simple extends abstractentitytest { private integer id1; public void configure(ejb3configuration cfg) { cfg.addannotatedclass(inttestentity.class); } @test public void initdata() { entitymanager em = getentitymanager(); em.gettransaction().begin(); inttestentity ite = new inttestentity(10); em.persist(ite); id1 = ite.getid(); em.gettransaction().commit(); em.gettransaction().begin(); ite = em.find(inttestentity.class, id1); ite.setnumber(20); em.gettransaction().commit(); } @test(dependsonmethods = "initdata") public void testrevisionscounts() { assert arrays.aslist(1, 2).equals(getauditreader().getrevisions(inttestentity.class, id1)); } @test(dependsonmethods = "initdata") public void testhistoryofid1() { inttestentity ver1 = new inttestentity(10, id1); inttestentity ver2 = new inttestentity(20, id1); assert getauditreader().find(inttestentity.class, id1, 1).equals(ver1); assert getauditreader().find(inttestentity.class, id1, 2).equals(ver2); } } so preparing my failing test executing scenario described above wasn’t a rocket science: /** * @author tomasz dziurko (tdziurko at gmail dot com) */ public class transactionrollbackbehaviour extends abstractentitytest { public void configure(ejb3configuration cfg) { cfg.addannotatedclass(inttestentity.class); } @test public void testauditrecordsrollback() { // given entitymanager em = getentitymanager(); em.gettransaction().begin(); inttestentity itetorollback = new inttestentity(30); em.persist(itetorollback); integer rollbackediteid = itetorollback.getid(); em.gettransaction().rollback(); // when em.gettransaction().begin(); inttestentity ite2 = new inttestentity(50); em.persist(ite2); integer ite2id = ite2.getid(); em.gettransaction().commit(); // then list revisionsforsavedclass = getauditreader().getrevisions(inttestentity.class, ite2id); assertequals(revisionsforsavedclass.size(), 1, "there should be one revision for inserted entity"); list revisionsforrolledbackclass = getauditreader().getrevisions(inttestentity.class, rollbackediteid); assertequals(revisionsforrolledbackclass.size(), 0, "there should be no revisions for insert that was rolled back"); } } now i could verify that tests are failing on the forked 3.6 branch and check if the fix that we had is making this test green. the fix after writing a failing test in our project, i placed several breakpoints in envers code to understand better what is wrong there. but imagine being thrown in a project developed for a few years by many programmers smarter than you. i felt overwhelmed and had no idea where the fix should be applied and what exactly is not working as expected. luckily in my company we have adam warski on board. he is the initial author of envers and actually he pointed us the solution. the fix itself contains only one check that registers audit processes that will be executed on transaction completion only when such processes iare still in the map for the given transaction. it sounds complicated, but if you look at the class auditprocessmanager in this commit it should be more clear what is happening there. official path besides locating a problem and fixing it, there are some more official steps that must be performed to have fix included in envers. step 1. create jira issue with bug - https://hibernate.onjira.com/browse/hhh-7682 step 2: create local branch envers-bugfix-hhh-7682 of forked hibernate 3.6 step 3: commit and push failing test and fix to your local and remote repository on github step 4: create pull request - https://github.com/hibernate/hibernate-orm/pull/393 step 5: wait for merge and that’s all. now fix is merged into main repository and we have one bug less in the world of open source

October 17, 2012

by Tomasz Dziurko

· 7,822 Views

Implementing Repository Pattern with Entity Framework

When working with Entity Framework - Code First model approach, a developer creates POCO entities for database tables. The benefit of using Code First model is to have POCO entity for each table that can be used as either WCF Data Contracts or you can apply your own custom attributes to handle Security, Logging, etc. and there is no mapping needed as we used to do in Entity Framework (Model First) approach if the application architecture is n-tier based. Considering the Data Access layer, we will implement a repository pattern that encapsulates the persistence logic in a separate class. This class will be responsible to perform database operations. Let’s suppose the application is based on n-tier architecture and having 3 tiers namely Presentation, Business and Data Access. Common library contains all our POCO entities that will be used by all the layers. Presentation Layer: Contains Views, Forms Business Layer: Managers that handle logic functionality Data Access Layer: Contains Repository class that handles CRUD operations Common Library: Contain POCO entities. We will implement an interface named “IRepository” that defines the signature of all the appropriate generic methods needed to perform CRUD operation and then implement the Repository class that defines the actual implementation of each method. We can also instantiate Repository object using Dependency Injection or apply Factory pattern. Code Snippet: IRepository public interface IRepository : IDisposable { /// /// Gets all objects from database /// /// IQueryable All() where T : class; /// /// Gets objects from database by filter. /// /// Specified a filter /// IQueryable Filter(Expression> predicate) where T : class; /// /// Gets objects from database with filting and paging. /// /// /// Specified a filter /// Returns the total records count of the filter. /// Specified the page index. /// Specified the page size /// IQueryable Filter(Expression> filter, out int total, int index = 0, int size = 50) where T : class; /// /// Gets the object(s) is exists in database by specified filter. /// /// Specified the filter expression /// bool Contains(Expression> predicate) where T : class; /// /// Find object by keys. /// /// Specified the search keys. /// T Find(params object[] keys) where T : class; /// /// Find object by specified expression. /// /// /// T Find(Expression> predicate) where T : class; /// /// Create a new object to database. /// /// Specified a new object to create. /// T Create(T t) where T : class; /// /// Delete the object from database. /// /// Specified a existing object to delete. int Delete(T t) where T : class; /// /// Delete objects from database by specified filter expression. /// /// /// int Delete(Expression> predicate) where T : class; /// /// Update object changes and save to database. /// /// Specified the object to save. /// int Update(T t) where T : class; /// /// Select Single Item by specified expression. /// /// /// /// T Single(Expression> expression) where T : class; void SaveChanges(); void ExecuteProcedure(String procedureCommand, params SqlParameter[] sqlParams); } Code Snippet: Repository public class Repository : IRepository { DbContext Context; public Repository() { Context = new DBContext(); } public Repository(DBContext context) { Context = context; } public void CommitChanges() { Context.SaveChanges(); } public T Single(Expression> expression) where T : class { return All().FirstOrDefault(expression); } public IQueryable All() where T : class { return Context.Set().AsQueryable(); } public virtual IQueryable Filter(Expression> predicate) where T : class { return Context.Set().Where(predicate).AsQueryable(); } public virtual IQueryable Filter(Expression> filter, out int total, int index = 0, int size = 50) where T : class { int skipCount = index * size; var _resetSet = filter != null ? Context.Set().Where(filter).AsQueryable() : Context.Set().AsQueryable(); _resetSet = skipCount == 0 ? _resetSet.Take(size) : _resetSet.Skip(skipCount).Take(size); total = _resetSet.Count(); return _resetSet.AsQueryable(); } public virtual T Create(T TObject) where T : class { var newEntry = Context.Set().Add(TObject); Context.SaveChanges(); return newEntry; } public virtual int Delete(T TObject) where T : class { Context.Set().Remove(TObject); return Context.SaveChanges(); } public virtual int Update(T TObject) where T : class { try { var entry = Context.Entry(TObject); Context.Set().Attach(TObject); entry.State = EntityState.Modified; return Context.SaveChanges(); } catch (OptimisticConcurrencyException ex) { throw ex; } } public virtual int Delete(Expression> predicate) where T : class { var objects = Filter(predicate); foreach (var obj in objects) Context.Set().Remove(obj); return Context.SaveChanges(); } public bool Contains(Expression> predicate) where T : class { return Context.Set().Count(predicate) > 0; } public virtual T Find(params object[] keys) where T : class { return (T)Context.Set().Find(keys); } public virtual T Find(Expression> predicate) where T : class { return Context.Set().FirstOrDefault(predicate); } public virtual void ExecuteProcedure(String procedureCommand, params SqlParameter[] sqlParams){ Context.Database.ExecuteSqlCommand(procedureCommand, sqlParams); } public virtual void SaveChanges() { Context.SaveChanges(); } public void Dispose() { if (Context != null) Context.Dispose(); } } The benefit of using Repository pattern is that all the database operations will be managed centrally and in future if you want to change the underlying database connector you can add another Repository class and defines its own implementation or change the existing one.

October 13, 2012

by Ovais Mehboob Ahmed Khan

· 31,809 Views

MongoDB Aggregation Framework Examples in C#

MongoDB version 2.2 was released in late August and the biggest change it brought was the addition of the Aggregation Framework. Previously the aggregations required the usage of map/reduce, which in MongoDB doesn’t perform that well, mainly because of the single-threaded Javascript-based execution. The aggregation framework steps away from the Javascript and is implemented in C++, with an aim to accelerate performance of analytics and reporting up to 80 percent compared to using MapReduce. The aim of this post is to show examples of running the MongoDB Aggregation Framework with the official MongoDB C# drivers. Aggregation Framework and Linq Even though the current version of the MongoDB C# drivers (1.6) supports Linq, the support doesn’t extend to the aggregation framework. It’s highly probable that the Linq-support will be added later on and there’s already some hints about this in the driver’s source code. But at this point the execution of the aggregations requires the usage of the BsonDocument-objects. Aggregation Framework and GUIDs If you use GUIDs in your documents, the aggregation framework doesn’t work. This is because by default the GUIDs are stored in binary format and the aggregations won’t work against documents which contain binary data.. The solution is to store the GUIDs as strings. You can force the C# drivers to make this conversion automatically by configuring the mapping. Given that your C# class has Id-property defined as a GUID, the following code tells the driver to serialize the GUID as a string: BsonClassMap.RegisterClassMap(cm => { cm.AutoMap(); cm.GetMemberMap(c => c.Id) .SetRepresentation( BsonType.String); }); The example data These examples use the following documents: > db.examples.find() { "_id" : "1", "User" : "Tom", "Country" : "Finland", "Count" : 1 } { "_id" : "2", "User" : "Tom", "Country" : "Finland", "Count" : 3 } { "_id" : "3", "User" : "Tom", "Country" : "Finland", "Count" : 2 } { "_id" : "4", "User" : "Mary", "Country" : "Sweden", "Count" : 1 } { "_id" : "5", "User" : "Mary", "Country" : "Sweden", "Count" : 7 } Example 1: Aggregation Framework Basic usage This example shows how the aggregation framework can be executed through C#. We’re not going run any calculations to the data, we’re just going to filter it by the User. To run the aggregations, you can use either the MongoDatabase.RunCommand –method or the helper MongoCollection.Aggregate. We’re going to use the latter: var coll = localDb.GetCollection("examples"); ... coll.Aggregate(pipeline); The hardest part when working with Aggregation Framework through C# is building the pipeline. The pipeline is similar concept to the piping in PowerShell. Each operation in the pipeline will make modifications to the data: the operations can for example filter, group and project the data. In C#, the pipeline is a collection of BsonDocument object. Each document represents one operation. In our first example we need to do only one operation: $match. This operator will filter out the given documents. The following BsonDocument is a pipeline operation which filters out all the documents which don’t have User-field set to “Tom”. var match = new BsonDocument { { "$match", new BsonDocument { {"User", "Tom"} } } }; To execute this operation we add it to an array and pass the array to the MongoCollection.Aggregate-method: var pipeline = new[] { match }; var result = coll.Aggregate(pipeline); The MongoCollection.Aggregate-method returns an AggregateResult-object. It’s ResultDocuments-property (IEnumarable) contains the documents which are the output of the aggregation. To check how many results there were, we can get the Count: var result = coll.Aggregate(pipeline); Console.WriteLine(result.ResultDocuments.Count()); The result documents are BsonDocument-objects. If you have a C#-class which represent the documents, you can cast the results: var matchingExamples = result.ResultDocuments .Select(BsonSerializer.Deserialize) .ToList(); foreach (var example in matchingExamples) { var message = string.Format("{0} - {1}", example.User, example.Count); Console.WriteLine(message); } Another alternative is to use C#’s dynamic type. The following extension method uses JSON.net to convert a BsonDocument into a dynamic: public static class MongoExtensions { public static dynamic ToDynamic(this BsonDocument doc) { var json = doc.ToJson(); dynamic obj = JToken.Parse(json); return obj; } } Here’s a way to convert all the result documents into dynamic objects: var matchingExamples = result.ResultDocuments .Select(x => x.ToDynamic()) .ToList(); Example 2: Multiple filters & comparison operators This example filters the data with the following criteria: User: Tom Count: >= 2 var match = new BsonDocument { { "$match", new BsonDocument { {"User", "Tom"}, {"Count", new BsonDocument { { "$gte", 2 } } } } }; The execution of this operation is identical to the first example: var pipeline = new[] { match }; var result = coll.Aggregate(pipeline); var matchingExamples = result.ResultDocuments .Select(x => x.ToDynamic()) .ToList(); Also the result are as expected: foreach (var example in matchingExamples) { var message = string.Format("{0} - {1}", example.User, example.Count); Console.WriteLine(message); } Example 3: Multiple operations In our first two examples, the pipeline was as simple as possible: It contained only one operation. This example will filter the data with the same exact criteria as the second example, but this time using two $match operations: User: Tom Count: >= 2 var match = new BsonDocument { { "$match", new BsonDocument { {"User", "Tom"} } } }; var match2 = new BsonDocument { { "$match", new BsonDocument { {"Count", new BsonDocument { { "$gte", 2 } } } } }; var pipeline = new[] { match, match2 }; The output stays the same: The first operation “match” takes all the documents from the examples collection and removes every document which doesn’t match the criteria User = Tom. The output of this operation (3 documents) then moves to the second operation “match2” of the pipeline. This operation only sees those 3 documents, not the original collection. The operation filters out these documents based on its criteria and moves the result (2 documents) forward. This is where our pipeline ends and this is also our result. Example 4: Group and sum Thus far we’ve used the aggregation framework to just filter out the data. The true strength of the framework is its ability to run calculations on the documents. This example shows how we can calculate how many documents there are in the collection, grouped by the user. This is done using the $group-operator: var group = new BsonDocument { { "$group", new BsonDocument { { "_id", new BsonDocument { { "MyUser","$User" } } }, { "Count", new BsonDocument { { "$sum", 1 } } } } } }; The grouping key (in our case the User-field) is defined with the _id. The above example states that the grouping key has one field (“MyUser”) and the value for that field comes from the document’s User-field ($User). In the $group operation the other fields are aggregate functions. This example defines the field “Count” and adds 1 to it for every document that matches the group key (_id). var pipeline = new[] { group }; var result = coll.Aggregate(pipeline); var matchingExamples = result.ResultDocuments .Select(x => x.ToDynamic()) .ToList(); foreach (var example in matchingExamples) { var message = string.Format("{0} - {1}", example._id.MyUser, example.Count); Console.WriteLine(message); } Note the format in which the results are outputted: The user’s name is accessed through _id.MyUser-property. Example 5: Group and sum by field This example is similar to example 4. But instead of calculating the amount of documents, we calculate the sum of the Count-fields by the user: var group = new BsonDocument { { "$group", new BsonDocument { { "_id", new BsonDocument { { "MyUser","$User" } } }, { "Count", new BsonDocument { { "$sum", "$Count" } } } } } }; The only change is that instead of adding 1, we add the value from the Count-field (“$Count”). Example 6: Projections This example shows how the $project operator can be used to change the format of the output. The grouping in example 5 works well, but to access the user’s name we currently have to point to the _id.MyUser-property. Let’s change this so that user’s name is available directly through UserName-property: var group = new BsonDocument { { "$group", new BsonDocument { { "_id", new BsonDocument { { "MyUser","$User" } } }, { "Count", new BsonDocument { { "$sum", "$Count" } } } } } }; var project = new BsonDocument { { "$project", new BsonDocument { {"_id", 0}, {"UserName","$_id.MyUser"}, {"Count", 1}, } } }; var pipeline = new[] { group, project }; The code removes the _id –property from the output. It adds the UserName-property, which value is accessed from field _id.MyUser. The projection operations also states that the Count-value should stay as it is. var matchingExamples = result.ResultDocuments .Select(x => x.ToDynamic()) .ToList(); foreach (var example in matchingExamples) { var message = string.Format("{0} - {1}", example.UserName, example.Count); Console.WriteLine(message); } Example 7: Group with multiple fields in the keys For this example we add a new row into our document collection, leaving us with the following: { "_id" : "1", "User" : "Tom", "Country" : "Finland", "Count" : 1 } { "_id" : "2", "User" : "Tom", "Country" : "Finland", "Count" : 3 } { "_id" : "3", "User" : "Tom", "Country" : "Finland", "Count" : 2 } { "_id" : "4", "User" : "Mary", "Country" : "Sweden", "Count" : 1 } { "_id" : "5", "User" : "Mary", "Country" : "Sweden", "Count" : 7 } { "_id" : "6", "User" : "Tom", "Country" : "England", "Count" : 3 } This example shows how you can group the data by using multiple fields in the grouping key: var group = new BsonDocument { { "$group", new BsonDocument { { "_id", new BsonDocument { { "MyUser","$User" }, { "Country","$Country" }, } }, { "Count", new BsonDocument { { "$sum", "$Count" } } } } } }; var project = new BsonDocument { { "$project", new BsonDocument { {"_id", 0}, {"UserName","$_id.MyUser"}, {"Country", "$_id.Country"}, {"Count", 1}, } } }; var pipeline = new[] { group, project }; var result = coll.Aggregate(pipeline); var matchingExamples = result.ResultDocuments .Select(x => x.ToDynamic()) .ToList(); foreach (var example in matchingExamples) { var message = string.Format("{0} - {1} - {2}", example.UserName, example.Country, example.Count); Console.WriteLine(message); } Example 8: Match, group and project This example shows how you can combine many different pipeline operations. The data is first filtered ($match) by User=Tom, then grouped by the Country (“$group”) and finally the output is formatted into a readable format ($project). Match: var match = new BsonDocument { { "$match", new BsonDocument { {"User", "Tom"} } } }; Group: var group = new BsonDocument { { "$group", new BsonDocument { { "_id", new BsonDocument { { "Country","$Country" }, } }, { "Count", new BsonDocument { { "$sum", "$Count" } } } } } }; Project: var project = new BsonDocument { { "$project", new BsonDocument { {"_id", 0}, {"Country", "$_id.Country"}, {"Count", 1}, } } }; Result: var pipeline = new[] { match, group, project }; var result = coll.Aggregate(pipeline); var matchingExamples = result.ResultDocuments .Select(x => x.ToDynamic()) .ToList(); foreach (var example in matchingExamples) { var message = string.Format("{0} - {1}", example.Country, example.Count); Console.WriteLine(message); } More There are many other interesting operators in the MongoDB Aggregation Framework, like $unwind and $sort. The usage of these operators is identical to ones we used above so it should be possible to copy-paste one of the examples and use it as a basis for these other operations. Links MongoDB C# Language Center MongoDB Aggregation Framework Easy to follow blog post about the aggregation framework

October 11, 2012

by Mikael Koskinen

· 47,789 Views · 2 Likes

How to Create and Deploy a Website with Windows Azure

Curator's note: This article originally appeared at WindowsAzure.com. To use this feature and other new Windows Azure capabilities, sign up for the free preview. Just as you can quickly create and deploy a web application created from the gallery, you can also deploy a website created on a workstation with traditional developer tools from Microsoft or other companies. Table of Contents Deployment Options How to: Create a Website Using the Management Portal How to: Create a Website from the Gallery How to: Delete a Website Next Steps Deployment Options Windows Azure supports deploying websites from remote computers using WebDeploy, FTP, GIT or TFS. Many development tools provide integrated support for publication using one or more of these methods and may only require that you provide the necessary credentials, site URL and hostname or URL for your chosen deployment method. Credentials and deployment URLs for all enabled deployment methods are stored in the website's publish profile, a file which can be downloaded in the Windows Azure (Preview) Management Portal from the Quick Start page or the quick glance section of the Dashboard page. If you prefer to deploy your website with a separate client application, high quality open source GIT and FTP clients are available for download on the Internet for this purpose. How to: Create a Website Using the Management Portal Follow these steps to create a website in Windows Azure. Login to the Windows Azure (Preview) Management Portal. Click the Create New icon on the bottom left of the Management Portal. Click the Web Site icon, click the Quick Create icon, enter a value for URL and then click the check mark next to create web site on the bottom right corner of the page. When the website has been created you will see the text Creation of Web Site '[SITENAME]' Completed. Click the name of the website displayed in the list of websites to open the website's Quick Start management page. On the Quick Start page you are provided with options to set up TFS or GIT publishing if you would like to deploy your finished website to Windows Azure using these methods. FTP publishing is set up by default for websites and the FTP Host name is displayed under FTP Hostname on the Quick Start and Dashboard pages. Before publishing with FTP or GIT choose the option to Reset deployment credentials on the Dashboard page. Then specify the new credentials (username and password) to authenticate against the FTP Host or the Git Repository when deploying content to the website. The Configure management page exposes several configurable application settings in the following sections: Framework: Set the version of .NET framework or PHP required by your web application. Diagnostics: Set logging options for gathering diagnostic information for your website in this section. App Settings: Specify name/value pairs that will be loaded by your web application on start up. For .NET sites, these settings will be injected into your .NET configuration AppSettings at runtime, overriding existing settings. For PHP and Node sites these settings will be available as environment variables at runtime. Connection Strings: View connection strings for linked resources. For .NET sites, these connection strings will be injected into your .NET configuration connectionStrings settings at runtime, overriding existing entries where the key equals the linked database name. For PHP and Node sites these settings will be available as environment variables at runtime. Default Documents: Add your web application's default document to this list if it is not already in the list. If your web application contains more than one of the files in the list then make sure your website's default document appears at the top of the list. How to: Create a Website from the Gallery The gallery makes available a wide range of popular web applications developed by Microsoft, third party companies, and open source software initiatives. Web applications created from the gallery do not require installation of any software other than the browser used to connect to the Windows Azure Management Portal. In this tutorial, you'll learn: How to create a new site through the gallery. How to deploy the site through the Windows Azure Portal. You'll build a Word press blog that uses a default template. The following illustration shows the completed application: Note To complete this tutorial, you need a Windows Azure account that has the Windows Azure Web Sites feature enabled. You can create a free trial account and enable preview features in just a couple of minutes. For details, see Create a Windows Azure account and enable preview features. Create a web site in the portal Login to the Windows Azure Management Portal. Click the New icon on the bottom left of the dashboard. Click the Web Site icon, and click From Gallery. Locate and click the WordPress icon in list, and then click Next. On the Configure Your App page, enter or select values for all fields: Enter a URL name of your choice Leave Create a new MySQL database selected in the Database field Select the region closest to you Then click Next. On the Create New Database page, you can specify a name for your new MySQL database or use the default name. Select the region closest to you as the hosting location. Select the box at the bottom of the screen to agree to ClearDB's usage terms for your hosted MySQL database. Then click the check to complete the site creation. After you click Complete Windows Azure will initiate build and deploy operations. While the web site is being built and deployed the status of these operations is displayed at the bottom of the Web Sites page. After all operations are performed, A final status message when the site has been successfully deployed. Launch and manage your WordPress site Click on your new site from the Web Sites page to open the dashboard for the site. On the Dashboard management page, scroll down and click the link on the left under Site Url to open the site’s welcome page. Enter appropriate configuration information required by WordPress and click Install WordPress to finalize configuration and open the web site’s login page. Login to the new WordPress web site by entering the username and password that you specified on the Welcome page. You'll have a new WordPress site that looks similar to the site below. How to: Delete a Website Websites are deleted using the Delete icon in the Windows Azure Management Portal. The Delete icon is available in the Windows Azure Portal when you click Web Sites to list all of your websites and at the bottom of each of the website management pages. Next Steps For more information about Websites, see the following: Walkthrough: Troubleshooting a Website on Windows Azure

October 9, 2012

by Eric Gregory

· 85,359 Views

SQL Query Optimization and Normalization

Explore SQL query optimization and normalization.

October 4, 2012

by Michael Georgiou

· 37,812 Views · 2 Likes

Difference Between Mysql Replace and Insert on Duplicate Key Update

While me and my friend roshan recently working as a support developers at Australia famous e-commerce website. recently roshan as assign a new bug in this site it’s related to the product synchronize process in the ware house product table and the e-commerce site, his main task was check the quickly the site product table and check with ware house product table product if the either insert new data into a site database, or update an existing record on the site database, Of course, doing a lookup to see if the record exists already and then either updating or inserting would be an expensive process (existing items are defined either by a unique key or a primary key). Luckily, MySQL offers two functions to combat this (each with two very different approaches). 1. REPLACE = DELETE+INSERT 2. INSERT ON DUPLICATE KEY UPDATE = UPDATE + INSERT 1 . REPLACE This syntax is the same as the INSERT function. When dealing with a record with a unique or primary key, REPLACE will either do a DELETE and then an INSERT, or just an INSERT if use this this function will cause a record to be removed, and inserted at the end. It will cause the indexing to get broken apart, decreasing the efficiency of the table. If, however REPLACE INTO ds_product SET pID = 3112, catID = 231, uniCost = 232.50, salePrice = 250.23; 2. ON DUPLICATE KEY UPDATE ON DUPLICATE KEY UPDATE clause to the INSERT function. This one actively hunts down an existing record in the table which has the same UNIQUE or PRIMARY KEY as the one we’re trying to update. If it finds an existing one, you specify a clause for which column(s) you would like to UPDATE. Otherwise, it will do a normal INSERT. INSERT INTO ds_product SET pID = 3112, catID = 231, uniCost = 232.50, salePrice = 250.23, ON DUPLICATE KEY UPDATE uniCost = 232.50, salePrice = 250.23; This should be helpful when trying to create database queries that add and update information, without having to go through the extra step. Thanks Have a Nice Day

October 3, 2012

by Prathap Givantha Kalansuriya

· 14,416 Views

Customizing Spring Data JPA Repository

Spring Data is a very convenient library. However, as the project as quite new, it is not well featured. By default, Spring Data JPA will provide implementation of the DAO based on SimpleJpaRepository. In recent project, I have developed a customize repository base class so that I could add more features on it. You could add vendor specific features to this repository base class as you like. Configuration You have to add the following configuration to you spring beans configuration file. You have to specified a new repository factory class. We will develop the class later. extends SimpleJpaRepository implements GenericRepository , Serializable{ private static final long serialVersionUID = 1L; static Logger logger = Logger.getLogger(GenericRepositoryImpl.class); private final JpaEntityInformation entityInformation; private final EntityManager em; private final DefaultPersistenceProvider provider; private Class springDataRepositoryInterface; public Class getSpringDataRepositoryInterface() { return springDataRepositoryInterface; } public void setSpringDataRepositoryInterface( Class springDataRepositoryInterface) { this.springDataRepositoryInterface = springDataRepositoryInterface; } /** * Creates a new {@link SimpleJpaRepository} to manage objects of the given * {@link JpaEntityInformation}. * * @param entityInformation * @param entityManager */ public GenericRepositoryImpl (JpaEntityInformation entityInformation, EntityManager entityManager , Class springDataRepositoryInterface) { super(entityInformation, entityManager); this.entityInformation = entityInformation; this.em = entityManager; this.provider = DefaultPersistenceProvider.fromEntityManager(entityManager); this.springDataRepositoryInterface = springDataRepositoryInterface; } /** * Creates a new {@link SimpleJpaRepository} to manage objects of the given * domain type. * * @param domainClass * @param em */ public GenericRepositoryImpl(Class domainClass, EntityManager em) { this(JpaEntityInformationSupport.getMetadata(domainClass, em), em, null); } public S save(S entity) { if (this.entityInformation.isNew(entity)) { this.em.persist(entity); flush(); return entity; } entity = this.em.merge(entity); flush(); return entity; } public T saveWithoutFlush(T entity) { return super.save(entity); } public List saveWithoutFlush(Iterable entities) { List result = new ArrayList(); if (entities == null) { return result; } for (T entity : entities) { result.add(saveWithoutFlush(entity)); } return result; } } As a simple example here, I just override the default save method of the SimpleJPARepository. The default behaviour of the save method will not flush after persist. I modified to make it flush after persist. On the other hand, I add another method called saveWithoutFlush() to allow developer to call save the entity without flush. Define Custom repository factory bean The last step is to create a factory bean class and factory class to produce repository based on your customized base repository class. public class DefaultRepositoryFactoryBean , S, ID extends Serializable> extends JpaRepositoryFactoryBean { /** * Returns a {@link RepositoryFactorySupport}. * * @param entityManager * @return */ protected RepositoryFactorySupport createRepositoryFactory( EntityManager entityManager) { return new DefaultRepositoryFactory(entityManager); } } /** * * The purpose of this class is to override the default behaviour of the spring JpaRepositoryFactory class. * It will produce a GenericRepositoryImpl object instead of SimpleJpaRepository. * */ public class DefaultRepositoryFactory extends JpaRepositoryFactory{ private final EntityManager entityManager; private final QueryExtractor extractor; public DefaultRepositoryFactory(EntityManager entityManager) { super(entityManager); Assert.notNull(entityManager); this.entityManager = entityManager; this.extractor = DefaultPersistenceProvider.fromEntityManager(entityManager); } @SuppressWarnings({ "unchecked", "rawtypes" }) protected JpaRepository getTargetRepository( RepositoryMetadata metadata, EntityManager entityManager) { Class repositoryInterface = metadata.getRepositoryInterface(); JpaEntityInformation entityInformation = getEntityInformation(metadata.getDomainType()); if (isQueryDslExecutor(repositoryInterface)) { return new QueryDslJpaRepository(entityInformation, entityManager); } else { return new GenericRepositoryImpl(entityInformation, entityManager, repositoryInterface); //custom implementation } } @Override protected Class getRepositoryBaseClass(RepositoryMetadata metadata) { if (isQueryDslExecutor(metadata.getRepositoryInterface())) { return QueryDslJpaRepository.class; } else { return GenericRepositoryImpl.class; } } /** * Returns whether the given repository interface requires a QueryDsl * specific implementation to be chosen. * * @param repositoryInterface * @return */ private boolean isQueryDslExecutor(Class repositoryInterface) { return QUERY_DSL_PRESENT && QueryDslPredicateExecutor.class .isAssignableFrom(repositoryInterface); } } Conclusion You could now add more features to base repository class. In your program, you could now create your own repository interface extending GenericRepository instead of JpaRepository. public interface MyRepository extends GenericRepository { void someCustomMethod(ID id); } In next post, I will show you how to add hibernate filter features to this GenericRepository.

September 27, 2012

by Boris Lam

· 98,155 Views · 4 Likes

Introducing the New Date and Time API for JDK 8

Date and time handling in Java is a somewhat tricky part when you are new to the language. Time can be accessed via the static method System.currentTimeMillis() which returns the current time in milliseconds from January 1st 1970. If you prefer to work with Objects instead you can use java.util.Date, a class whose methods are mostly deprecated in recent versions of Java. To work with time offsets, say add one month to a date, there is java.util.GregorianCalendar. All in all, those methods described here are not very convenient to work with. Java 7 and below are lacking a good date and time API. The Joda Time library is a common drop-in if you need to work with date/time. With JSR 310 (Java Specification Request) this is about to change. JSR 310 adds a new date, time and calendar API to Java 8. The ThreeTen project provides a reference implementation to this new API and can already be utilized in current Java projects (I however recommend not to do this for production). As the README states: The API is currently considered usable and accurate, yet incomplete and subject to change. If you use this API you must be able to handle incompatible changes in later versions. Building ThreeTen Building the ThreeTen project is relatively easy. It requires both Git and Ant to be installed on your system. git clone git://github.com/ThreeTen/threeten.git cd threeten ant This will first fetch the most recent version of ThreeTen and then start the build process using ant. Note that building the library also requires either OpenJDK 1.6 or Oracle JDK 1.6. JSR 310 The new API specifies a number of new classes which are divided into the categories of continuous and human time. Continuous time is based on Unix time and is represented as a single incrementing number. Class Description Instant A point in time in nanoseconds from January 1st 1970 Duration An amount of time measured in nanoseconds Human time is based on fields that we use in our daily lifes such as day, hour, minute and second. It is represented by a group of classes, some of which we will discuss in this article. Class Description LocalDate a date, without time of day, offset or zone LocalTime the time of day, without date, offset or zone LocalDateTime the date and time, without offset or zone OffsetDate a date with an offset such as +02:00, without time of day or zone OffsetTime the time of day with an offset such as +02:00, without date or zone OffsetDateTime the date and time with an offset such as +02:00, without a zone ZonedDateTime the date and time with a time zone and offset YearMonth a year and month MonthDay month and day Year/MonthOfDay/DayOfWeek/... classes for the important fields DateTimeFields stores a map of field-value pairs which may be invalid Calendrical access to the low-level API Period a descriptive amount of time, such as "2 months and 3 days" In addition to the above classes three support classes have been implemented. The Clock class wraps the current time and date, ZoneOffset is a time offset from UTC and ZoneId defines a time zone such as 'Australia/Brisbane'. Using the API Getting the current time The current time is represented by the Clock class. The class is abstract, so you can not create instances of it. The systemUTC() static method will return the current time based on your system clock and set to UTC. import javax.time.Clock; Clock clock = Clock.systemUTC(); To use the default time zone on your system there also is systemDefaultZone(). Clock clock = Clock.systemDefaultZone(); The millis() method can then be used to access the current time in milliseconds from January 1st, 1970. This shows, that the Clock class and all subclasses are wrapped around System.currentTimeMillis(). Clock clock = Clock.systemDefaultZone(); long time = clock.millis(); Working with time zones To work with time zones you need to import the ZoneId class. The class provides a method to get the default system time zone: import javax.time.ZoneId; import javax.time.Clock; ZoneId zone = ZoneId.systemDefault(); Clock clock = Clock.system(zone); As seen above, the ZoneId can then be used to get an instance of a Clock with that time zone. Other time zones can be accessed by their name, e.g.: ZoneId zone = ZoneId.of("Europe/Berlin"); Clock clock = Clock.system(zone); Getting human date and time Working with a time represented in a single long variable is not what we wanted. We want to work with objects that represent human readable time. The LocalDate, LocalTime and LocalDateTime classes do just that. import javax.time.LocalDate; // The now() method returns the current DateTime LocalDate date = LocalDate.now(); System.out.printf("%s-%s-%s", date.getYear(), date.getMonthValue(), date.getDayOfMonth() ); Using LocalDate to print the current date Doing calculations with times and dates One of the most important functionalities of JSR-310 is that you can do calculations with dates and times. The API makes it very easy to do that. import javax.time.LocalTime; import javax.time.Period; import static javax.time.calendrical.LocalPeriodUnit.HOURS; Period p = Period.of(5, HOURS); LocalTime time = LocalTime.now(); LocalTime newTime; newTime = time.plus(5, HOURS); // or newTime = time.plusHours(5); // or newTime = time.plus(p); Three ways of adding 5 hours to the current time Each class that represents human time implements the AdjustableDateTime interface. The interface requires the plus and the minus method that take a value and a PeriodUnit as argument. Conclusion This article gave a (very) brief introduction into the new date and time API that will ship with Java 8. The API seems to be very consistent and well thought through and provides many ways to interact with dates and times. Upon release of Java 8 the API will be moved from the javax.time package over to java.time, so there will be no conflict if you start using the current implementation.

September 25, 2012

by Fabian Becker

· 78,603 Views

Nested Data Structures, and non-1NF design in PostgreSQL

This has been adapted from an ongoing series currently running on my blog. It has been adapted to be more self-contained, and rely less on other blog entries. For more see http://ledgersmbdev.blogspot.com PostgreSQL provides a very advanced set of tools for doing data modelling in ways which drift back and forth across a relational and non-relational divide. While it is generally a good idea to make the database relational first, and add objects later, the principles of object-relational database design allow you to do a lot more with PostgreSQL than you can on many other database platforms. This article will discuss the use of non-first-normal-form designs, in particular the storage of arrays of tuples in columns to simulate a nested table. The possible uses and problems of such a design will be discussed in detail. One of the promises of object-relational modelling is the ability to address information modelling on complex and nested data structures. Nested data structures bring considerable richness to the database, which is lost in a pure, flat, relational model. Nested data structures can be used to model tuple constraints in ways that are impossible to do when looking at flat data structures, at least as long as those constraints are limited to the information in a single tuple. At the same time there are cases where they simplify things and cases where they complicate things. This is true both in the case of using these for storage and for interfacing with stored procedures. PostgreSQL allows for nested tuples to be stored in a database, and for arrays of tuples. Other ORDBMS's allow something similar (Informix, DB2, and Oracle all support nested tables). Nested tables in PostgreSQL provide a number of gotchas, and additionally exposing the data in them to relational queries takes some extra work. In this post we will look at modelling general ledger transactions using a nested table approach, and both the benefits and limitations of this approach. In general this trades one set of problems for another and it is important to recognize the problems going in. The storage example came out of a brainstorming session I had with Marc Balmer of Micro Systems, though it is worth noting that this is not the solution they use in their products, nor is it the approach currently used by LedgerSMB. Basic Table Structure: The basic data schema will end up looking like this: CREATE TABLE journal_type ( id serial not null unique, label text primary key ); CREATE TABLE account ( id serial not null unique, control_code text primary key, -- account number description text ); CREATE TYPE journal_line_type AS ( account_id int, amount numeric ); CREATE TABLE journal_entry ( id serial not null unique, journal_type int references journal_type(id), source_document_id text,-- for example invoice number date_posted date not null, description text, line_items journal_line_type[], PRIMARY KEY (journal_type, source_document_id) ); This schema has a number of obvious gotchas and cannot, by itself, guarantee the sorts of things we want to do. However, using object-relational modelling we can fix these in ways that cannot do in a purely relational schema. The main problems are: First, since this is a double entry model, we need a constraint that says that the sum of the amounts of the lines must always equal zero. However, if we just add a sum() aggregate, we will end up with it summing every record in the db every time we do an insert, which is not what we want. We also want to make sure that no account_id's are null and no amounts are null. Additionally it is not possible in the schema above to easily expose the journal line information to purely relational tools. However we can use a VIEW to do this, though this produces yet more problems. Finally referential integrity enforcement between the account lines and accounts cannot be done declaratively. We will have to create TRIGGERs to enforce this manually. These problems are traded off against the fact that the relational model does not allow for the first problem to be solved at all so we trade off the fact that we have some solutions which are a bit of a pain for the fact that we have some solutions at all. Nested Table Constraints If we simply had a tuple as a column, we could look inside the tuple with check constraints. Something like check((column).subcolumn is not null). However in this case we cannot do that because we need to aggregate on a set of tuples attached to the row. To do this instead we create a set of table methods for managing the constraints: CREATE OR REPLACE FUNCTION is_balanced(journal_entry) RETURNS BOOL LANGUAGE SQL AS $$ SELECT sum(amount) = 0 FROM unnest($1.line_items); $$; CREATE OR REPLACE FUNCTION has_no_null_account_ids(journal_entry) RETURNS BOOL LANGUAGE SQL AS $$ SELECT bool_and(account_id is not null) FROM unnest($1.line_items); $$; CREATE OR REPLACE FUNCTION has_no_null_amounts(journal_entry) RETURNS BOOL LANGUAGE SQL AS $$ select bool_and(amount is not null) from unnest($1.line_items); $$; We can then create our constraints. Note that because we have to create the methods first, we have to add our constraints after the functions are defined, and these are added after the table is constructed. I have gone ahead and given these friendly names so that errors are easier for people (and machines) to process and handle. ALTER TABLE journal_entry ADD CONSTRAINT is_balanced CHECK ((journal_entry).is_balanced); ALTER TABLE journal_entry ADD CONSTRAINT has_no_null_account_ids CHECK ((journal_entry).has_no_null_account_ids); ALTER TABLE journal_entry ADD CONSTRAINT has_no_null_amounts CHECK ((journal_entry).has_no_null_amounts); Now we have integrity constraints reaching into our nested data. So let's test this out. insert into journal_type (label) values ('General'); We will re-use the account data from the previous post: or_examples=# select * from account; id | control_code | description ----+--------------+------------- 1 | 1500 | Inventory 2 | 4500 | Sales 3 | 5500 | Purchase (3 rows) Let's try inserting a few meaningless transactions, some of which violate our constraints: insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10001', now()::date, 'This is a test', ARRAY[row(1, 100)::journal_line_type]); ERROR: new row for relation "journal_entry" violates check constraint "is_balanced" So far so good. insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10001', now()::date, 'This is a test', ARRAY[row(1, 100)::journal_line_type, row(null, -100)::journal_line_type]); ERROR: new row for relation "journal_entry" violates check constraint "has_no_null_account_ids" Still good. insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10001', now()::date, 'This is a test', ARRAY[row(1, 100)::journal_line_type, row(2, -100)::journal_line_type, row(3, NULL)::journal_line_type]) ERROR: new row for relation "journal_entry" violates check constraint "has_no_null_amounts" Great. All constraints working properly. Let's try inserting a valid row: insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10001', now()::date, 'This is a test', ARRAY[row(1, 100)::journal_line_type, row(2, -100)::journal_line_type]); And it works! or_examples=# select * from journal_entry; id | journal_type | source_document_id | date_posted | description | li ne_items ----+--------------+--------------------+-------------+----------------+------------------------ 5 | 1 | ref-10001 | 2012-08-23 | This is a test | {"(1,100)","(2,-100)"} (1 row) Break-Out Views A second major problem that we will be facing with this schema is that if someone wants to create a report using a reporting tool that only really supports relational data very well, then the financial data will be opaque and not available. This scenario is one of the reasons why I think it is important generally to push the relational model to its breaking point before looking at object-relational functions. Consequently I think when doing nested tables it is important to ensure that the data in them is available through a relational interface, in this case, a view. In this case, we may want to model debits and credits in a way which is re-usable, so we will start by creating two type methods: CREATE OR REPLACE FUNCTION debits(journal_line_type) RETURNS NUMERIC LANGUAGE SQL AS $$ SELECT CASE WHEN $1.amount < 0 THEN $1.amount * -1 ELSE NULL END $$; CREATE OR REPLACE FUNCTION credits(journal_line_type) RETURNS NUMERIC LANGUAGE SQL AS $$ SELECT CASE WHEN $1.amount > 0 THEN $1.amount ELSE NULL END $$; Now we can use these as virtual columns anywhere a journal_line_type is used. The view definition itself is rather convoluted and this may impact performance. I am waiting for the LATERAL construct to become available which will make this easier. CREATE VIEW journal_line_items AS SELECT id AS journal_entry_id, (li).*, (li).debits, (li).credits FROM (SELECT je.*, unnest(line_items) li FROM journal_entry je) j; Remember li.debits and li.credits gets turned by the parser into debits(li) and credits(li), allowing for class.method notation here. Testing this out: SELECT * FROM journal_line_items; gives us journal_entry_id | account_id | amount | debits | credits ------------------+------------+--------+--------+--------- 5 | 1 | 100 | | 100 5 | 2 | -100 | 100 | 6 | 1 | 200 | | 200 6 | 3 | -200 | 200 | As you can see, this works. Now people with purely relational tools can access the information in the nested table. In general it is almost always worth creating break-out views of this sort where nested data is stored. However it is important to note that with larger data sets this is insufficient because indexing considerations makes it hard to look up specific information on a row level. This may or may not be the end of the world depending on data set size. Referential Integrity Controls The final problem is that relational integrity is not a well defined concept for nested data. For this reason, if we value relational integrity and foreign keys are involved, we must find ways of enforcing these. The simplest solution is a trigger which runs on insert, update, or delete, and manages another relation which can be used as a proxy for relational integrity checks. For example, we could: CREATE TABLE je_account ( je_id int references journal_entry (id), account_id int references account(id), primary key (je_id, account_id) ); This will be a very narrow table and so should be quick to search. It may also be useful in determining which accounts to look at for transactions if we need to do that. This table could then be used to optimize queries. To maintain the table we need to recognize that never ever will a journal entry's line items be updated or deleted. This is due to the need to maintain clear audit controls and trails. We may add other flags to the table to indicate transactions but we can handle insert, update, and delete conditions with a trigger, namely: CREATE FUNCTION je_ri_management() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$ DECLARE accounts int[]; BEGIN IF TG_OP ILIKE 'INSERT' THEN INSERT INTO je_account (je_id, account_id) SELECT NEW.id, account_id FROM unnest(NEW.line_items) GROUP BY account_id; RETURN NEW; ELSIF TG_OP ILIKE 'UPDATE' THEN IF NEW.line_items <> OLD.line_items THEN RAISE EXCEPTION 'Cannot journal entry line items!'; ELSE RETURN NEW; END IF; ELSIF TG_OP ILIKE 'DELETE' THEN RAISE EXCEPTION 'Cannot delete journal entries!'; ELSE RAISE EXCEPTION 'Invalid TG_OP in trigger'; END IF; END; $$; Then we add the trigger with: CREATE TRIGGER je_breakout_for_ri AFTER INSERT OR UPDATE OR DELETE ON journal_entry FOR EACH ROW EXECUTE PROCEDURE je_ri_management(); The final invalid TG_OP could be omitted but this is not a bad check to have. Let's try this out: insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10003', now()::date, 'This is a test', ARRAY[row(1, 200)::journal_line_type, row(3, -200)::journal_line_type]); or_examples=# select * from je_account; je_id | account_id -------+------------ 10 | 3 10 | 1 (2 rows) In this way referential integrity can be enforced. Solution 2.0: Refactoring the above to eliminate the view. The above solution will work great for small businesses but for larger businesses, querying this data will become slow for certain kinds of reports. Storage here is tied to a specific criteria, and indexing is somewhat problematic. There are ways we can address this, but they are not always optimal. At the same time our work is simplified because the actual accounting details are append-only. One solution to this is to refactor the above solution. Instead of: Main table Relational view Materialized view for referential integrity checking we can have: Main table, with tweaked storage for line items Materialized view for RI checking and relational access Unfortunately this sort of refactoring after the fact isn't simple. Typically you want to convert the journal_line_type type to a journal_line_type table, and inherit this in your materialized view table. You cannot simply drop and recreate since the column you are storing the data in is dependent on the structure. The solution is to rename the type, create a new one in its place. This must be done manually and there is no current capability to copy a composite type's structure into a table. You will then need to create a cast and a cast function. Then, when you can afford the downtime, you will want to convert the table to the new type. It is quite possible that the downtime will be delayed and you will have an extended time period where you are half-way through migrating the structure of your database. You can, however, decide to create a cast between the table and the type, perhaps an implicit one (though this is not inherited) and use this to centralize your logic. Unfortunately this leads to duplication-related complexity and in an ideal world would be avoided. However, assuming that the downtime ends up being tolerable, the resulting structures will end up such that they can be more readily optimized for a variety of workloads. In this regard you would have a main table, most likely with line_items moved to extended storage, whose function is to model journal entries as journal entries and apply relevant constraints, and a second table which models journal entry lines as independent lines. This also simplifies some of the constraint issues on the first table, and makes the modelling easier because we only have to look into the nested storage where we are looking at subset constraints. This section then provides a warning regarding the use of advanced ORDBMS functionality, namely that it is easy to get tunnel vision and create problems for the future. The complexity cost here is so high, that the primary model should generally remain relational, with things like nested storage primarily used to create constraints that cannot be effectively modelled otherwise. However, this becomes a great deal more complicated where values may be update or deleted. Here, however, we have a relatively simple case regarding data writes combined with complex constraints that cannot be effectively expressed in normalized, relational SQL. Therefore the standard maintenance concerns that counsel against duplicating information may give way to the fact that such duplication allows for richer constraints. Now, if we had been aware of the problems going in we would have chosen this structure all along. Our design would have been: CREATE TYPE journal_line AS ( entry_id bigserial primary key, --only possible key je_id int not null, account_id int, amount numeric ); After creating the journal entry table we'd: ALTER TABLE journal_line ADD FOREIGN KEY (je_id) REFERENCES journal_entry(id); If we have to handle purging old data we can make that key ON DELETE CASCADE. And the lines would have been of this type instead. We can then get rid of all constraints and their supporting functions other than the is_balanced one. Our debit and credit functions then also reference this type. Our trigger then looks like: CREATE FUNCTION je_ri_management() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$ DECLARE accounts int[]; BEGIN IF TG_OP ILIKE 'INSERT' THEN INSERT INTO journal_line (je_id, account_id, amount) SELECT NEW.id, account_id, amount FROM unnest(NEW.line_items); RETURN NEW; ELSIF TG_OP ILIKE 'UPDATE' THEN RAISE EXCEPTION 'Cannot journal entry line items!'; ELSIF TG_OP ILIKE 'DELETE' THEN RAISE EXCEPTION 'Cannot delete journal entries!'; ELSE RAISE EXCEPTION 'Invalid TG_OP in trigger'; END IF; END; $$; Approval workflows can be handled with a separate status table with its own constraints. Deletions of old information (up to a specific snapshot) can be handled by a stored procedure which is unit tested and disables this trigger before purging data. This system has the advantage of having several small components which are all complete and easily understood, and it is made possible because the data is exclusively append-only. As you can see from the above examples, nested data structures greatly complicate the data model and create problems with relational math that must be addressed if data logic will remain meaningful. This is a complex field, and it adds a lot of complexity to storage. In general, these are best avoided in actual data storage except where this approach makes formerly insurmountable problems manageable. Moreover, they add complexity to optimization once data gets large. Thus while non-atomic fields in this regard make sense as an initial point of entry in some narrow cases, as a point of actual query, they are very rarely the right approaches. It is possible that, at some point, nested storage will be able to have its own indexes, foreign keys, etc. but I cannot imagine this being a high priority and so it isn't clear that this will ever happen. In general, it usually makes the most sense to simply store the data in a pseudo-normalized way, with any non-1NF designs being the initial point of entry in a linear write model. Nested Data Structures as Interfaces Nested data structures as interfaces to stored procedures are a little more manageable. The main difficulties are in application-side data construction and output parsing. Some languages handle this more easily than others. Upper-level construction and handling of these structures is relatively straight-forward on the database-side and poses none of these problems. However, they do cause additional complexity and this must be managed carefully. The biggest issue when interfacing with an application is that ROW types are not usually automatically constructed by application-level frameworks even if they have arrays. This leaves the programmer to choose between unstructured text arrays which are fundamentally non-discoverable (and thus brittle), and arrays of tuples which are discoverable but require a lot of additional application code to handle. At the same time as a chicken and egg problem, frameworks will not add handling for this sort of problem unless people are already trying to do it. So my general recommendation is to use nested data types everywhere in the database sparingly, only where the benefits clearly outweigh the complexity costs. Complexity costs are certainly lower in the interface level and there are many more cases where it these techniques are net wins there, but that does not mean that they should be routinely used even there.

September 25, 2012

by Chris Travers

· 20,900 Views

Asynchronous WMI Queries: Stay Away From Them

So, it turns out that I have a WMI category on my blog. During the last couple of years I almost forgot about it, but WMI got a chance to wrap its poisonous tentacles around me again yesterday. Here’s another story. WMI is known for requiring lots of attention to security. To establish a WMI connection to a remote machine, you need to muck around with registry settings, DCOM configuration, group policy details, and other infernal things which we developers like to defer to someone else. But at least you know that once a machine has been configured properly to give you access through WMI, you can then access it from any other machine. Right? Right? Not so much. WMI has a concept of asynchronous queries, which are notably used for receiving event notifications. For example, the following code registers for an event notification whenever a process is created on my desktop machine: ManagementScope scope = new ManagementScope(@"\\sasha-desktop\root\cimv2"); WqlEventQuery query = new WqlEventQuery( "SELECT * FROM Win32_ProcessStartTrace"); ManagementEventWatcher watcher = new ManagementEventWatcher(scope, query); watcher.EventArrived += (o, e) => ...; //TODO: process the event watcher.Start(); Indeed, this thing works just fine if you point it to a local machine; but it fails when you call the Start method when you connect it to a remote machine. You could now strip the remote machine bare and have it expose its very innate networking guts to the entire Internet, and it still wouldn’t help you establish the connection. Interesting. When troubleshooting this nasty bug, I looked up a VBScript sample that receives new process creation events on another machine. Here it is: Set wmi = GetObject("winmgmts:\\sasha-desktop\root\cimv2") Set query = wmi.ExecNotificationQuery _ ("SELECT * FROM Win32_ProcessStartTrace'") Set process = query.NextEvent VBScript and all, it worked just fine. I started to suspect something smelly in the kingdom of .NET, so I rewrote the VBScript sample in C#, using the long-forgotten Microsoft.VisualBasic.Interaction class: dynamic wmi = Microsoft.VisualBasic.Interaction.GetObject( "winmgmts:\\sasha-desktop\root\cimv2"); dynamic query = wmi.ExecNotificationQuery( "SELECT * FROM Win32_ProcessStartTrace"); dynamic evt = query.NextEvent; This, too, worked just fine – although it’s not much a surprise, as it’s pretty much equivalent to the VBScript code at this time. Still interesting. This is when it hit me – the asynchronous nature of the ManagementEventWatcher.EventArrived event relies on an asynchronous WMI query, which requires a reverse connection to the client machine! This is configuration inferno, x2, on the client machine now, what with the DCOM security settings and sacrifices to the gods of group policy. Unless, of course, we give away the asynchrony and rely on the ManagementEventWatcher.WaitForNextEvent method. It’s synchronous. It burns a thread that has to sit idly by and wait while its siblings execute useful work. But it doesn’t establish a reverse DCOM connection to the caller. At least that.

September 22, 2012

by Sasha Goldshtein

· 11,038 Views

How To Create A Theme Options Page For WordPress

If you have ever used a WordPress premium theme then you would of seen the custom theme options page that is available. The theme options page that is found under the appearance menu which allows the admin of the WordPress site to change some of the settings on the theme. Most premium themes will come with options to change the colors of fonts, backgrounds, change images or font types...anything that allows you to style the WordPress theme. Some of the most common fields to change are: Theme Options - To edit the theme logo, change the stylesheet, upload a new favicon, Add Google analytics code, enter your feedburner URL and add custom CSS. Styling Options - Change the background colour or change the background image. Fonts - Change the font on all your header tags or the main content text. Social - Providing you theme with your social media profiles will make it easier to link to them in parts of your theme or display your latest tweets. Option pages can also be used on plugins to change settings and to customize the plugin. Examples Of Theme Options Pages Here is what some of the theme options page from premium themes look like. How To Build A Theme Option Page When creating an option page there are a few things you need to setup. Add Menu - If you want to display the menu under the appearance menu or if you want to give the options page it's own menu. Add Sections - These are sections of settings you are adding to the options page. Register Settings - Settings are the different fields you are adding to the options page, they need to be registered with the settings API. Display Settings - The settings API will be used to call a function to display the setting. Validate Setting - When the user saves the settings field the input will need to be validated before stored in the options table. Feedback Messages - When the settings are saved you need to be able to feedback to the user if the settings were saved successfully or if there was an error during validation. To help us perform all these tasks there is a WordPress API called the Settings API. This API allows admin pages to handle setting forms semi-automatically. With the API you can define pages for the settings, sections for the settings and fields for the settings. This works by registering setting fields to be displayed within sections and page will display these sections. WordPress uses the Settings API by default on existing admin pages, this means that by using the Settings API you can add to existing pages by registering new settings. All validation must be performed by the developer of the settings pages but the Settings API will control the creation of the form and storing the values in the form in the options table. Add Menu To WordPress Admin When adding a menu to the WordPress admin screen you have loads of flexibility you have the option of adding brand new menu items or adding the menu as a sub menu. To add a top level menu just use the following function add_menu_page(). $page_title - The title used on the settings page. $menu_title - The title used on the menu. $capability - Only displays the menu if the user matches this capability. $menu_slug - The unique name of the menu slug. $function - This is the callback function to run to display the page. $icon_url - Display a icon just for the menu. $position - This allows you to choose when the menu item appears in the list. If you prefer to have the menu under the appearance parent menu you can use the following code snippet. Or you can use the function add_theme_page() which will add a sub-menu under the appearance menu. add_theme_page( $page_title, $menu_title, $capability, $menu_slug, $function); Registering The Settings To start off we need to register the settings group we are going to store the settings page values. This will use the Settings API to define the group of settings, we will then add the settings to a group. When you store the settings in this group they are stored in the wp_options database table so you can get these values out at a later date. The wp_options table is a key value pairing stored in the database. This is what you should use when storing long term data on your WordPress site. If you are storing a lot of data it's best practice to turn the data into an array and store it under one key, instead of storing all the values over multiple keys. This means that if you have a settings page to change the site logo, background color, font, font size etc, you won't have an option for each of these but you will group them into an option group. The reason you do this is to increase on database efficiency by not adding too many rows to the options database. To register settings on the Settings API you need to use the function register_setting(). The parameters you pass into this are: Option Group - The name of the group of settings you are going to store. This must match the group name used in the settings_field() function. Option name - The name of the option which will be saved, this is the key that is used in the options table. Sanitize Callback - This is the function that is used to validate the settings for this option group. Add Sections To Settings Once the settings are registered we can add section groups to the Settings API. This will allow us to organise the settings on the page, so that you can add styles to display these differently on the page. The benefit of adding sections on your Settings API is so that we can call the function do_settings_sections() as this will display all the settings under this one section. To create you own settings all you have to do is use the function add_settings_section(). The parameters you need to use on this function are: Id - String to use for the ID of the section. Title - The title to use on the section. Callback - This is the function that will display the settings on the page. Page - This is the page that is displaying the section, should match the menu slug of the page. Add Fields To The Sections The last important function we need to use to add settings to the page is the add_settings_field() function, this is used as part of the Settings API to define fields to a section. The function will need to know the page slug and the section Id before you can define the settings to use. All the settings which you setup here will be stored in the options table under the key used in the register_settings() function. To use this function you need to add the following parameters. ID - ID of the field Title - Title of the field. Callback - Function used to display the setting. This is very important as it is used to display the input field you want. Page - Page which is going to display the field should be the same as the menu slug on the section. Section - Section Id which the field will be added to. $args - Additional arguments which are passed to the callback function. Example Of Using The Settings API There is a lot of information to take in above to create this settings page so it can seem a bit complicated but once you get your head around the structure the Settings API uses it's actually quite easy to understand. The best way to understand how this all works is to show you with an example. Create A Theme Option Page With A Textbox Field In this example we will create a theme option page and add a textbox on the page to add additional text to the index.php. Just add the following to your functions.php file to create a theme options page. First we start off by creating the menu item under the appearance menu by using the add_theme_page() function on the admin_menu action. /** * Theme Option Page Example */ function pu_theme_menu() { add_theme_page( 'Theme Option', 'Theme Options', 'manage_options', 'pu_theme_options.php', 'pu_theme_page'); } add_action('admin_menu', 'pu_theme_menu'); As you can see above we set the callback function to the theme options page to be pu_theme_page so we need to create this function to display our page. Here we create a form to submit to the options.php so that we can save in the options table, we call settings_fields() to the get the settings in register_settings() and use the do_settings_sections() function to display our settings. /** * Callback function to the add_theme_page * Will display the theme options page */ function pu_theme_page() { ?> Custom Theme Options Created by Paulund. 'text', 'id' => 'pu_textbox', 'name' => 'pu_textbox', 'desc' => 'Example of textbox description', 'std' => '', 'label_for' => 'pu_textbox', 'class' => 'css_class' ); add_settings_field( 'example_textbox', 'Example Textbox', 'pu_display_setting', 'pu_theme_options.php', 'pu_text_section', $field_args ); } The callback function on creating sections can be used to add addition information that will appear above every section, on this example we are just leaving it blank. /** * Function to add extra text to display on each section */ function pu_display_section($section){ } The callback function on the add_settings_field() function is pu_display_setting, this is the function that is going to echo the display of any input's on the page. The parameter to this function is the $args value on the add_settings_field() we can use this to add things like id, name, default value etc. We want to get any existing values from the wp_option table to display any values which previously typed in by the user, do to this we get the values from the table by using the get_option() function. /** * Function to display the settings on the page * This is setup to be expandable by using a switch on the type variable. * In future you can add multiple types to be display from this function, * Such as checkboxes, select boxes, file upload boxes etc. */ function pu_display_setting($args) { extract( $args ); $option_name = 'pu_theme_options'; $options = get_option( $option_name ); switch ( $type ) { case 'text': $options[$id] = stripslashes($options[$id]); $options[$id] = esc_attr( $options[$id]); echo ""; echo ($desc != '') ? "$desc" : ""; break; } } Finally we can validate the values added to the form by creating the validation callback function pu_validate_settings. All this does at the moment is loop through the inputs passed to it and checks if it's a letter or a number. The return of this function is what will be added to the database. /** * Callback function to the register_settings function will pass through an input variable * You can then validate the values and the return variable will be the values stored in the database. */ function pu_validate_settings($input) { foreach($input as $k => $v) { $newinput[$k] = trim($v); // Check the input is a letter or a number if(!preg_match('/^[A-Z0-9 _]*$/i', $v)) { $newinput[$k] = ''; } } return $newinput; } If you copy all the snippets above into your functions.php file you will see this options form under the appearance menu. Using Theme Options Within Your Theme Now that you understand how to create a theme options page you need to be able to use this value in your theme so you can change the settings. All the settings are stored in the wp_options table with WordPress it's very easy to get these values out all you have to do is use the get_option() function. The option name is the name you put on the register_settings() function. So in our example above you will use this code. The $options variable will now store an array of the values from the theme options, which you can display the value of the textbox we put on the page by using this snippet. Conclusion That's the basics that you need to understand to use the Settings API, now you can take this information and create your own theme options page. Experiment with different input types you can add to the form, experiment with different validation methods you want to use. In future tutorials I will post how you can use some of the inbuilt WordPress third party applications to create a better user experience on your theme options panel. This will include things like color pickers, date pickers, jQuery UI features etc. As you can see we have created a settings option page in just over 100 lines of code, so it's not a hard thing to do but that are a few steps to it and the features can be expanded on. For this reason people have created theme option frameworks to allow you to easily create a theme option page with much higher level of complexity with the options. But like many other frameworks I always recommend you learn the basics before using a framework, this is why it's important to understand how the Settings API works before using or creating a settings page framework.

September 18, 2012

by Paul Underwood

· 23,736 Views

Fixing Bugs - If You Can't Reproduce a Bug, You Can't Fix It

Fixing a problem usually starts with reproducing it – what Steve McConnell calls “stabilizing the error.” Technically speaking, you can’t be sure you are fixing the problem unless you can run through the same steps, see the problem happen yourself, fix it, and then run through the same steps and make sure that the problem went away. If you can’t reproduce it, then you are only guessing at what’s wrong, and that means you are only guessing that your fix is going to work. But let’s face it – it’s not always practical or even possible to reproduce a problem. Lots of bug reports don’t include enough information for you to understand what the hell the problem actually was, never mind what was going on when the problem occurred – especially bug reports from the field. Rahul Premraj and Thomas Zimmermann found in The Art of Collecting Bug Reports (from the book Making Software), that the two most important factors in determining whether a bug report will get fixed or not are: Is the description well-written, can the programmer understand what was wrong or why the customer thought something was wrong? Does it include steps to reproduce the problem, even basic information about what they were doing when the problem happened? It’s not a lot to ask – from a good tester at least. But you can’t reasonably expect this from customers. There are other cases where you have enough information, but don’t have the tools or expertise to reproduce a problem – for example, when a pen tester has found a security bug using specialist tools that you don’t have or don’t understand how to use. Sometimes you can fix a problem without being able to see it happen in front of you, come up with a theory on your own, trusting your gut – especially if this is code that you recently worked on. But reproducing the problem first gives you the confidence that you aren’t wasting your time and that you actually fixed the right issue. Trying to reproduce the problem should almost always be your first step. What’s involved in reproducing a bug? What you want to do is to find, as quickly as possible, a simple test that consistently shows the problem, so that you can then run a set of experiments, trace through the code, isolate what’s wrong, and prove that it went away after you fixed the code. The best explanation that I’ve found of how to reproduce a bug is in Debug It! where Paul Butcher patiently explains the pre-conditions (identifying the differences between your test environment and the customer’s environment, and trying to control as many of them as possible), and then how to walk backwards from the error to recreate the conditions required to make the problem happen again. Butcher is confident that if you take a methodical approach, you will (almost) always be able to reproduce the problem successfully. In Why Programs Fail: A guide to Systematic Debugging, Andreas Zeller, a German Comp Sci professor, explains that it’s not enough just to make the problem happen again. Your goal is to come up with the simplest set of circumstances that will trigger the problem – the smallest set of data and dependencies, the simplest and most efficient test(s) with the fewest variables, the shortest path to making the problem happen. You need to understand what is not relevant to the problem, what’s just noise that adds to the cost and time of debugging and testing – and get rid of it. You do this using binary techniques to slice up the input data set, narrowing in on the data and other variables that you actually need, repeating this until the problem starts to become clear. Code Complete’s chapter on Debugging is another good guide on how to reproduce a problem following a set of iterative steps, and how to narrow in on the simplest and most useful set of test conditions required to make the problem happen; as well as common places to look for bugs: checking for code that has been changed recently, code that has a history of other bugs, code that is difficult to understand (if you find it hard to understand, there’s a good chance that the programmers who worked on it before you did too). Replay Tools One of the most efficient ways to reproduce a problem, especially in server code, is by automatically replaying the events that led up to the problem. To do this you’ll need to capture a time-sequenced record of what happened, usually from an audit log, and a driver to read and play the events against the system. And for this to work properly, the behavior of the system needs to be deterministic – given the same set of inputs in the same sequence, the same results will occur each time. Otherwise you’ll have to replay the logs over and over and hope for the right set of circumstances to occur again. On one system that I worked on, the back-end engine was a deterministic state machine designed specifically to support replay. All of the data and events, including configuration and control data and timer events, were recorded in an inbound event log that we could replay. There were no random factors or unpredictable external events – the behavior of the system could always be recreated exactly by replaying the log, making it easy to reproduce bugs from the field. It was a beautiful thing, but most code isn’t designed to support replay in this way. Recent research in virtual machine technology has led to the development of replay tools to snapshot and replay events in a virtual machine. VMWare Workstation, for example, included a cool replay debugging facility for C/C++ programmers which was “guaranteed to have instruction-by-instruction identical behavior each time.” Unfortunately, this was an expensive thing to make work, and it was dropped in version 8, at the end of last year. Replay Solutions provides replay for Java programs, creating a virtual machine to record the complete stream of events (including database I/O, network I/O, system calls, interrupts) as the application is running, and then later letting you simulate and replay the same events against a copy of the running system, so that you can debug the application and observe its behavior. They also offer similar application record and replay technology for mobile HTML5 and JavaScript applications. This is exciting stuff, especially for complex systems where it is difficult to setup and reproduce problems in different environments. Fuzzing and Randomness If the problem is non-deterministic, or you can't come up with the right set of inputs, one approach to try is to simulate random data inputs and watch to see what happens - hoping to happen on a set of input variables that will trigger the problem. This is called fuzzing. Fuzzing is a brute force testing technique that is used to uncover data validation weaknesses that can cause reliability and security problems. It's effective at finding bugs, but it’s a terribly inefficient way to reproduce a specific problem. First you need to setup something to fuzz the inputs (this is easy if a program is reading from a file, or a web form – there are fuzzing tools to help with this – but a hassle if you need to write your own smart protocol fuzzer to test against internal APIs). Then you need time to run through all of the tests (with mutation fuzzing, you may need to run tens of thousands or hundreds of thousands of tests to get enough interesting combinations) and more time to sift through and review all of the test results and understand any problems that are found. Through fuzzing you will get new information about the system to help you identity problem areas in the code, and maybe find new bugs, but you may not end up any closer to fixing the problem that you started on. Reproducing problems, especially when you are working from a bad bug report (“the system was running fine all day, then it crashed… the error said something about a null pointer I think?”) can be a serious time sink. But what if you can’t reproduce the problem at all? Let’s look at that next…

September 9, 2012

by Jim Bird

· 45,646 Views

"Schemas" in CouchDB

schema noun ( pl. schemata or schemas ) 1 technical a representation of a plan or theory in the form of an outline or model: a schema of scientific reasoning. 2 Logic a syllogistic figure. 3 (in Kantian philosophy) a conception of what is common to all members of a class; a general or essential type or form. CouchDB is a schema-less document store, but there are times when a schema is a good thing to have around, one way or another. So can you have your cake and eat it too? Below I'll take a high level look at adding a kind of schema to an application and the benefits and draw backs associated with this way of working. What I describe below isn't for everyone. It goes against some of the core principles of CouchDB and makes your data much less human readable, but there are cases where that trade off is worth making. Schemas: WTF?! It might seem a bit weird to add a schema to a schema-less database but sometimes it is a very useful thing indeed. When you're dealing with large datasets verbose object key names can be a problem (e.g. cost you money) so you end up stuck between a rock and a hard place; either make your data terse and hard to use or be explicit and spend more on storage and network. { "shape": "triangle", "colour_label": "red", "opposite_length_in_mm": 767.12254256805875, "angle_in_radians": 1.5514293603308698, "adjacent_length_in_mm": 73.59881843627835 } What usually happens is some middle ground where a nice descriptive name like "angle_in_radians" gets reduced to "angle" or "rads". That's fine in that it reduces the storage and network required to deal with all that data. { "adj": 73.59881843627835, "shape": "triangle", "angle": 1.5514293603308698, "opp": 767.12254256805875, "colour": "red" } However, by making this small change you move the description of the data out of your database and into some undefined place; higher level code, documentation, shared knowledge, a whiteboard, a notebook, someones head. As your data becomes more terse you might rely on duck typing (deriving from the data itself what the data describes) to get data that quacks right in your application. That's fine so long as you have data that is sufficiently distinguishable from the other ducks on the pond; if I rely on pulling a triangle object from the database because it has an angle member I might accidentally pull out a rhombus or an icosahedron. To make sure you get the data you expect you might add an explicit type field to each data (e.g. "type=goose" or "shape=triangle") something which I've always felt was rather odd. This starts to add up on storage (remember you have a large dataset/flock of ducks) and, more importantly, it doesn't help with where the description of the data is held - you know that you have a goose but don't know what a goose is. This last point is important, especially if you're working in a team of developers. Knowing what describing a shape as a triangle means is vital in producing consistent code that many people can work on. The straight jacket of a SQL schema looks pretty comfy sometimes. Okay, I'll buy that a schema might be useful... So how do you add a schema into a CouchDB database, something that is inherently schema-less? Can I get the best of both worlds? Here's a little trick that might help. First you define a document that is the schema for a particular type of data: { "_id": "datatype/triangle/v1", "fields": [ "opposite_length_in_mm", "adjacent_length_in_mm", "angle_in_radians", "colour_label" ] } Then you change your document structure to reference that "schema": { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ] } Note that the schema is versioned and that ordering in the data list is important here! I now know precisely what the data represents without having to store that description in the data itself. This way of working has benefits beyond disk storage; you reduce wire traffic, and there is less for a client to parse before rendering it. This is especially useful if you're rendering into a browser based visualisation - you don't need a complex set of objects to make a bar chart, just a list of x and y values. I can also share the data structure with colleagues and be reasonably confident that when I'm talking about a "v1 triangle" they'll know that lengths are in millimeters, are the opposite and adjacent sides and that the angle is in radians, hopefully reducing the chance of costly mistakes. Isn't that error prone? Yes and no. If you make a mistake in the ordering of your fields then, yes you are going to have issues. This is reasonably easy to manage with some form of client verification (e.g. validation on a web form) and generating the interface from the data (e.g. use the schema definition to build the GUI). If you're adding these data into the database by hand (e.g. via a curl or futon) then you aren't going to be in the regime where this trick is useful; your dataset needs to be large for this to make sense. Things still quack What's particularly nice about this way of working is that I can still duck type the data, add additional fields to annotate it etc. since the schema isn't strictly enforced. Nothing stops me from having a triangle document like: { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" } My views that deal with the data with a schema will still work (by ignoring these additional fields), my MVC framework will still render my pages, and I'll still have all the data I want in my database. Nesting You could have a nested object structure like: { "datatype": "pattern/v1", "data": [ { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" }, { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "blue" ], "owner": "Fred", "location" "space" }, { "datatype": "square/v1", data: [ 10, "green" ] } ] } But if you're going to have a schema you may as well reflect the nesting inside it, e.g say that you have a list of triangles and a list of squares: { "_id": "datatype/pattern/v1", "fields": [ ["triangle/v1"], ["square/v1"] ] } { "datatype": "pattern/v1", "data": [ [ { "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" }, { "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "blue" ], "owner": "Fred", "location" "space" } ], [ { data: [ 10, "green" ] } ] } Schema evolution A nice feature of this way of working is that you can deal with schema evolutions; changing the format of your data. { "_id": "datatype/triangle/v2", "fields": [ "opposite_length_in_cm", "hypotenuse_length_in_cm", "angle_in_degrees", "colour_label" ] } There are only so many ways you can represent the data. While sometimes you may have a major schema evolution, one where old data is completely unusable, often changes are just tweaks for consistency (say changing the units of a quantity) or extending the schema by adding in optional data. In either case you should be able to use data from multiple schema versions together by using appropriate manipulations on the data. For example you could instantiate shape objects via a factory which knows how to create the right object for different schema versions. Validation The above does no validation of the data; the color field in the input data could be set to a number instead of a string, the angle to something non- physical etc. If you really needed validation you could do it with CouchDB's validation functions. If you go the fully validated route you'd want to define the schema in the design document (instead of as a normal doc) and use a CommonJS include to make sure that the validator in the app was doing the same thing as the schema. This ties you to a version of the design document (which is where the validators live), which may or may not be an issue. It will also considerably slow down insertion rate as CouchDB has to do more work to add your data. Personally I prefer to put validation logic in the client making writes. Views If I were using this way of working I would want to have a view which returned all the schema's defined on the database. This then allows me to build objects appropriately. A view to return schema's documents would look like: function(doc) { if (doc._id.slice(0, 'datatype'.length) == 'datatype') { emit (doc._id.slice('datatype/'.length, doc._id.length), doc.fields) } } You can pull out documents that have a schema with a simple view like: function(doc) { if (doc.datatype){ emit(doc.datatype, doc.data); } } This can be queried to find objects of a given shape using CouchDB's view slicing (e.g. ?startkey="square/v1"&endkey="square/v2") which returns data like: {"id":"datatype/square/v1","key":["square/v1",0],"value":["side_length_in_mm","colour_label"]}, {"id":"f98ffe7e4cd91cbb0d904f9098499ca8","key":["square/v1",1],"value":[872.4342711412228,"green"]}, {"id":"f98ffe7e4cd91cbb0d904f909849a218","key":["square/v1",1],"value":[370.29971491443905,"yellow"]}, {"id":"f98ffe7e4cd91cbb0d904f909849acd0","key":["square/v1",1],"value":[8.799279300193753,"yellow"]} You'll notice the name of the "schema" is the key and the values are held in value. This means I can parse the data into a set of appropriate objects with something like: var objects = []; function build(schema, data){ // Build the appropriate object for the schema... } for (row in data){ // build up the objects in a factory var obj = build(row.key, row.value); objects.push(obj); } If I wanted all versions of a shape the query would be, and used a vNUMERIC_COUNTER notation for versioning, ?startkey="square/v1"&endkey="square/vXXX" as numbers sort lower than strings. Taking it to the extreme If you are really worried about data size you can take this technique to the extreme by encoding the data arrays as a byte string and using the schema documents to describe that byte array. This effectively turns your JSON structure into something not dissimilar to a protocol buffer, at the expense of human readability and view complexity. If you are particularly concerned with data size over the wire (for example are writing an MMORPG) then this may be an acceptable trade off. Reminder This trick isn't suitable for every dataset. If you modify the data by hand it is prone to error. If you have a small dataset, or only ever send a small subset of the data to the client it's massive overkill. But if you have a large dataset of machine generated data, that needs to be frequently accessed over the WAN (think a monitoring app or game) then this is a nice way to reduce storage, network IO and browser render time. It's also worth reiterating that the schema is not enforced, you could have a square with 3 sides, and that adding strict schema enforcement with a validation function will considerably slow down insert rate.

September 8, 2012

by Simon Metson

· 10,382 Views

Manual Test-Driven Development

Test-Driven Development is a code-level practice, based on running automated tests that are written before the production code they exercise. But practices can be applied only in the context where they were developed: when some premises are not present is difficult to apply TDD as-is. Automated specification For example, consider the premise of assertion automation: it is possible to write a (hopefully) small algorithm that is able to check the result of running production code and return true or false. In the case the problem is: Draw an antialiased circle on this blank canvas. -- Carlo Pescio it is not immediately clear how to define automated tests for this behavior. We could check that some pixels are still blank inside or outside the circle, or that there is a bound number of pixels of black color; or even that they are contiguous. An opinion I've heard (that I try not to misrepresent) is that we only need to write some looser tests in these cases, checking only a few pixels of the circle. This process will give us a little feedback on the API of our Canvas or Circle object, but not much on the algorithm we are implementing inside it. Are we going in the right direction? Have new test cases correctly been satisfied without a large intervention on the existing code? Are we painting some unrelated pixels due to an hidden bug? What I argument here is instead that we should change the nature of the feedback mechanism. Speaking in control theory terms, change the block that acquires the output and influences the input to our design process. Develop in the browser When I was developing a Couchapp, a kind of web application served directly from a CouchDB database, I was appaled by the difficulty of testing it. While the production code was composed of ~100 lines, it was a complex mix of technologies: HTML and CSS code, client-side JavaScript for managing user events and some server-side JavaScript for the "queries" (actually the server-side only consists of the database in Couchapps.) Some of this logic could be tested in automation, like the result of queries over views. Yet much of it was related to a user interface, and as such requiring a large time investment to automate. Instead of waking up my Selenium server and start to manipulate a browser with code, I noticed that this UI was almost read-only; there were a few cases where a new document would have to be inserted, but a manual test of them was short and did not even required to reload the page. The whole application state was observable. Summing it up, I performed a frequent manual test that took a few seconds instead of trying to define complex and brittle automation logic for testing the UI. Now that I've been introduced to a simple qualitative ROI model by Carlo Pescio's article, I would do the same for every context where: a large time investment is needed for automating tests. it is possible to perform manual tests quickly. as the only logic conclusion. A word of caution TDD has many benefits (including catching regressions early) so I'm not prepared to give it up just because it is difficult to test. These are technical scenarios where I have successfully followed TDD by the book: multithreaded and multiprocess code applications distributed over multiple machines computer vision (object recognition and tracking) image manipulation code (via comparison testing) development of browser bindings for Selenium And even in the case the big picture is not easy to test-first (like in the case of image manipulation), we can benefit from TDD the pieces of the solution. For example, in the computer vision case I wasn't able to write a test beforehand for tracking a car inside a movie. But I was able to TDD the objects that the algorithmic solution to the problem called for: Patch, Area, Cluster, Movement, and so on. End-to-end TDD is not always cheap but unit level TDD can often be, if it considers testability as a relevant property (while regression testing even at the end-to-end level is always possible, in the worst case with record and replay.) End-to-end specifications If we can't define automated assertions for our "big picture" problem, it doesn't mean that we cannot apply the TDD approach, by substituting a manual step. Going back to the circle problem, I would define manual test cases on an inspection page seen by a human. I've seen this done with layouts and multiple browsers to catch CSS rendering bugs, for example: It would be very difficult to check these screenshots automatically, as each browser renders pages a bit differently from the others. The iterative process becomes: Define a cheap manual test, automating the arrange and act phases but not the assertion. Write only the code necessary to make it pass. Refactor. As long as the number of tests does not increase without limit and the manual check can be performed quickly, this approach does not slow you down with respect to TDD by-the-book. You'll have to take care of regression with other means; but at least you define a set of manual test cases. Feedback! TDD is an instrument of feedback: if feedback cannot be gathered in an automated way, we have to resort to manual checking of the specifications. Here are other examples of manual tools for generating feedback: Read-Eval-Print Loops: you can experimenting with existing classes and functions, and easily repeat steps thanks to history. the browser refresh button: the fastest way to transform a PSD into an HTML and CSS template. MongoDB console for learning the database API; other kinds of consoles like Firebug and Chrome's, or Clojure's.

September 3, 2012

by Giorgio Sironi

· 10,292 Views

Idempotent DB Update Scripts

An idempotent function gives the same result even if it is applied several times. That is exactly how a database update script should behave. It shouldn’t matter if it is run on or multiple times. The result should be the same. A database update script should be made to first check the state of the database and then apply the changes needed. If the script is done this way, several operations can be combined into one script that works on several databases despite the databases being at different (possibly unknown) state to start with. For the database schema itself I usually use Visual Studio 2010 database projects that handles updates automatically (in VS2012 the functionality has been changed significantly). Even with the schema updates handled automatically, there are always things that need manual handling. One common case is lookup tables that need initialization. Lookup Table Init Script I use a combination of a temp table and a MERGE clause to init lookup tables. CREATE TABLE #Colours ( ColourId INT NOT NULL, Name NVARCHAR(10) NOT NULL ) INSERT #Colours VALUES (1, N'Red'), (2, N'Green'), (3, N'Blue') MERGE Colours dst USING #Colours src ON (src.ColourId = dst.ColourId) WHEN MATCHED THEN UPDATE SET dst.ColourId = src.ColourId WHEN NOT MATCHED THEN INSERT VALUES (src.ColourId, src.Name) WHEN NOT MATCHED BY SOURCE THEN DELETE; DROP TABLE #Colours I think that the temp table approach is great because it gives a clear overview in the script of what the final values will be. It also works regardless of what the current values are. Sometimes it is relevant to keep old values, which can be done by removing the last two lines of the MERGE clause. It is also possible to flag records as inactive instead of deleting them. MERGE... ... WHEN NOT MATCHED BY SOURCE THEN SET dst.Active = 0; Checking Current State An idempotent script has to be able to check the current state and adopt its behaviour. The lookup table init script uses the MERGE clause for that, checking the actual values. In most cases it is possible to check the current state by inspecting the values of the table or through the sys meta data views. If that’s not possible, a separate table can be used to log the scripts run. This method has the advantage of an easy way to check what scripts have been run. The disadvantage is that it violates the DRY Principle by keeping a separate log, which can get out of sync with the actual database schema. What happens when a script is partially run and then fails before writing the log entry? What will happen the next time the script is run? This is where true idempotent script shines. Whenever there’s a doubt of the current state of the database the entire script can be run again, bringing the database to a known state.

September 3, 2012

by Anders Abel

· 11,201 Views

Building A Simple API Proxy Server with PHP

these days i’m playing with backbone and using public api as a source. the web browser has one horrible feature: it don’t allow you to fetch any external resource to our host due to the cross-origin restriction. for example if we have a server at localhost we cannot perform one ajax request to another host different than localhost. nowadays there is a header to allow it: access-control-allow-origin . the problem is that the remote server must set up this header. for example i was playing with github’s api and github doesn’t have this header. if the server is my server, is pretty straightforward to put this header but obviously i’m not the sysadmin of github, so i cannot do it. what the solution? one possible solution is, for example, create a proxy server at localhost with php. with php we can use any remote api with curl (i wrote about it here and here for example). it’s not difficult, but i asked myself: can we create a dummy proxy server with php to handle any request to localhost and redirects to the real server, instead of create one proxy for each request?. let’s start. problably there is one open source solution (tell me if you know it) but i’m on holidays and i want to code a little bit (i now, it looks insane but that’s me ). the idea is: ... $proxy->register('github', 'https://api.github.com'); ... and when i type: http://localhost/github/users/gonzalo123 and create a proxy to : https://api.github.com/users/gonzalo123 the request method is also important. if we create a post request to localhost we want a post request to github too. this time we’re not going to reinvent the wheel, so we will use symfony componets so we will use composer to start our project: we create a conposer.json file with the dependencies: { "require": { "symfony/class-loader":"dev-master", "symfony/http-foundation":"dev-master" } } now php composer.phar install and we can start coding. the script will look like this: register('github', 'https://api.github.com'); $proxy->run(); foreach($proxy->getheaders() as $header) { header($header); } echo $proxy->getcontent(); as we can see we can register as many servers as we want. in this example we only register github. the application only has two classes: restproxy , who extracts the information from the request object and calls to the real server through curlwrapper . request = $request; $this->curl = $curl; } public function register($name, $url) { $this->map[$name] = $url; } public function run() { foreach ($this->map as $name => $mapurl) { return $this->dispatch($name, $mapurl); } } private function dispatch($name, $mapurl) { $url = $this->request->getpathinfo(); if (strpos($url, $name) == 1) { $url = $mapurl . str_replace("/{$name}", null, $url); $querystring = $this->request->getquerystring(); switch ($this->request->getmethod()) { case 'get': $this->content = $this->curl->doget($url, $querystring); break; case 'post': $this->content = $this->curl->dopost($url, $querystring); break; case 'delete': $this->content = $this->curl->dodelete($url, $querystring); break; case 'put': $this->content = $this->curl->doput($url, $querystring); break; } $this->headers = $this->curl->getheaders(); } } public function getheaders() { return $this->headers; } public function getcontent() { return $this->content; } } the restproxy receive two instances in the constructor via dependency injection (curlwrapper and request). this architecture helps a lot in the tests , because we can mock both instances. very helpfully when building restproxy. the restproxy is registerd within packaist so we can install it using composer installer: first install componser curl -s https://getcomposer.org/installer | php and create a new project: php composer.phar create-project gonzalo123/rest-proxy proxy if we are using php5.4 (if not, what are you waiting for?) we can run the build-in server cd proxy php -s localhost:8888 -t www/ now we only need to open a web browser and type: http://localhost:8888/github/users/gonzalo123 the library is very minimal (it’s enough for my experiment) and it does’t allow authorization. of course full code is available in github .

September 2, 2012

by Gonzalo Ayuso

· 20,390 Views