Databases Resources

The Latest Databases Topics

Modeling Data in Neo4j: Bidirectional Relationships

transitioning from the relational world to the beautiful world of graphs requires a shift in thinking about data. although graphs are often much more intuitive than tables, there are certain mistakes people tend to make when modelling their data as a graph for the first time. in this article, we look at one common source of confusion: bidirectional relationships. directed relationships relationships in neo4j must have a type, giving the relationship a semantic meaning, and a direction. frequently, the direction becomes part of the relationship's meaning. in other words, the relationship would be ambiguous without it. for example, the following graph shows that the czech republic defeated sweden in ice hockey. had the direction of the relationship been reversed, the swedes would be much happier. with no direction at all, the relationship would be ambiguous, since it would not be clear who the winner was. note that the existence of this relationship implies a relationship of a different type going in the opposite direction, as the next graph illustrates. this is often the case. to give another example, the fact that pulp fiction was directed_by quentin tarantino implies that quentin tarantino is_director_of pulp fiction. you could come up with a huge number of such relationship pairs. one common mistake people often make when modelling their domain in neo4j is creating both types of relationships. since one relationship implies the other, this is wasteful, both in terms of space and traversal time. neo4j can traverse relationships in both directions. more importantly, thanks to the way neo4j organizes its data, the speed of traversal does not depend on the direction of the relationships being traversed. bidirectional relationships some relationships, on the other hand, are naturally bidirectional. a classic example is facebook or real-life friendship. this relationship is mutual - when someone is your friend, you are (hopefully) his friend, too. depending on how we look at the model, we could also say such relationship is undirected. graphaware and neo technology are partner companies. since this is a mutual relationship, we could model it as bidirectional or undirected relationship, respectively. but since none of this is directly possible in neo4j, beginners often resort to the following model, which suffers from the exact same problem as the incorrect ice hockey model: an extra unnecessary relationship. neo4j apis allow developers to completely ignore relationship direction when querying the graph, if they so desire. for example, in neo4j's own query language, cypher, the key part of a query finding all partner companies of neo technology would look something like match (neo)-[:partner]-(partner) the result would be the same as executing and merging the results of the following two different queries: match (neo)-[:partner]->(partner) and match (neo)<-[:partner]-(partner) therefore, the correct (or at least most efficient) way of modelling the partner relationships is using a single partner relationship with an arbitrary direction . conclusion relationships in neo4j can be traversed in both directions with the same speed. moreover, direction can be completely ignored. therefore, there is no need to create two different relationships between nodes, if one implies the other.

November 6, 2013

by Michal Bachman

· 28,497 Views · 2 Likes

Automatically Collect and Process Visitors’ IP Addresses

(NOTE: A version of this article posted previously contained incorrect information. The below version corrects those errors. The author apologises for any inconvenience.) I’m not a programmer. I’m an art collection manager with a fierce DIY streak that has helped me to develop a database application, and build and manage a website that incorporates it. Ever since I accidentally lobotomised my first Windows 3.1 computer, I’ve taught myself how to seek out, find and apply the information I need, sometimes through long hours of trial and error, and I owe almost all of it to the Internet. If it weren’t for people’s willingness to share information for free on innumerable forums and websites like this one, I would not have been able even to scratch the surface of completing the sorts of tasks that are now behind me. In that spirit of sharing, I thought I might humbly offer the solution I’ve cobbled together from various sources, and tweaked to automate tasks related to collecting the IP addresses of visitors to my website. I’m sure it’s nothing earth-shattering to an experienced coder, but it works well for me, and I’ve never seen a complete solution like it presented anywhere on the Internet before. It is a Windows-centric solution, since that’s the platform I’ve always used. The first step is to gather and record the IP addresses of website visitors. I’ve chosen to do this using php. Insert this code into the html of each page for which you’d like to capture IP addresses, just before the closing tag: If you want to record the IP addresses for each webpage to a different file, use a different name each time for filename.txt in the above example. Next, create the blank filename.txt file(s) in the same directory of your web server in which these html files reside. Now each time a visitor loads these pages, their IP address will be written to the text file(s) you’ve indicated. Next, you’ll need a way to download the text file(s) from the server to your local machine. If you’re writing to multiple files on the server, I’ve found it’s helpful to download them separately, then combine them into one list. Also, I like to sort the list and remove duplicate entries (you’ll see why a little later). Following is a Visual Basic script to do all of that. Let’s designate it C:\Folder\Subfolder\DloadCmbnDdupe.vbs. In the script below, substitute YourWebsite.com with the domain name of your website, C:\Folder\Subfolder with the actual location on your local computer and filename*.txt with the file name(s) on the web server to which you’ve chosen to write. Option Explicit On Error Resume Next Download "http://www.YourWebsite.com/filename.txt", _ " C:\Folder\Subfolder\filename.txt " Download "http://www.YourWebsite.com/filename2.txt", _ " C:\Folder\Subfolder\filename2.txt " Download "http://www.YourWebsite.com/filename3.txt", _ " C:\Folder\Subfolder\filename3.txt " CmbnDdupe() If Err <> 0 Then Wscript.echo "Error Type = " & Err.Description End If WScript.Quit '----------------------------------------------------------------------------------------- Function Download(strURL, strPath) Dim i, objFile, objFSO, objHTTP, strFile, strMsg Const ForReading = 1, ForWriting = 2, ForAppending = 8 Set objFSO = CreateObject("Scripting.FileSystemObject") If objFSO.FolderExists(strPath) Then strFile = objFSO.BuildPath(strPath, Mid(strURL, InStrRev(strURL, "/") + 1)) ElseIf objFSO.FolderExists(Left(strPath, InStrRev(strPath, "\") - 1)) Then strFile = strPath Else WScript.Echo "ERROR: Target folder not found." Exit Function End If Set objFile = objFSO.OpenTextFile(strFile, ForWriting, True) Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1") objHTTP.Open "GET", strURL, False objHTTP.Send For i = 1 To LenB(objHTTP.ResponseBody) objFile.Write Chr(AscB(MidB(objHTTP.ResponseBody, i, 1))) Next objFile.Close() End Function '----------------------------------------------------------------------------------------- Function CmbnDdupe() Dim shell Set shell=createobject("wscript.shell") shell.run "CmbnDdupe.bat" Set shell=nothing End Function You’ll notice that Function CmbnDdupe calls a batch file, CmbnDdupe.bat, in the same directory (C:\Folder\Subfolder). Here it is, below. Again, substitute filename*.txt with the file name(s) you used at the beginning. @echo off for %%x in (filename.txt) do type %%x>>templist for %%x in (filename2.txt) do type %%x>>templist for %%x in (filename3.txt) do type %%x>>templist ren templist IPlist.txt setlocal disableDelayedExpansion set file=IPlist.txt set "sorted=%file%.sorted" set "deduped=%file%.deduped" ::Define a variable containing a linefeed character set LF=^ ::The 2 blank lines above are critical, do not remove sort "%file%" >"%sorted%" >"%deduped%" ( set "prev=" for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%sorted%") do ( set "ln=%%A" setlocal enableDelayedExpansion if /i "!ln!" neq "!prev!" ( endlocal (echo %%A) set "prev=%%A" ) else endlocal ) ) >nul move /y "%deduped%" "%file%" del "%sorted%" exit This routine combines the downloaded files into one (templist), then renames it to IPlist.txt. It then sorts the IP addresses into ascending order, saving to IPlist.txt.sorted, and then removes any duplicates, saving to IPlist.txt.deduped. Finally, it moves (overwrites and deletes) IPlist.txt.deduped to IPlist.txt and deletes IPlist.txt.sorted, leaving behind IPlist.txt (the sorted and de-dupe-ified list). Now, we have a list of one IP address per visitor to the pages from which we’re collecting. At this point I like to ping each IP address to collect whatever information is available about it. This is why I remove the duplicate entries, which are caused by a visitor viewing more than one of the collecting pages, or by a visitor returning to the pages. I don’t need to waste time and bandwidth pinging the same IP address more than once. If I want to see which IPs visited which pages multiple times, I can always just look at filename.txt, filename2.txt and filename3.txt. I’ve named the ping routine PingList.vbs, and put it in C:\Folder\Subfolder. Here it is: Option Explicit On Error Resume Next Dim srcFile srcFile = "IPlist.txt" PingList(srcFile) If Err <> 0 Then Wscript.echo "Error Type = " & Err.Description End If WScript.Quit '----------------------------------------------------------------------------------------- Function PingList(srcFile) Dim objFSO Dim objShell Dim strCommand Dim opnFile Dim strText Dim logFile Set objFSO = CreateObject("Scripting.FileSystemObject") Set objShell = Wscript.CreateObject("Wscript.Shell") logFile = "Log.txt" If objFSO.FileExists(srcFile) Then Set opnFile = objFSO.OpenTextFile(srcFile, 1) Do While Not opnFile.AtEndOfStream strText = opnFile.ReadLine If Trim(strText) <> "" Then strCommand = strText objShell.run "%comspec% /c ping -a -n 1 " & strText & " >> " & logFile, , True End If Loop opnFile.Close Else WScript.Echo "File '" & srcFile & "' was not found." End If End Function This script pings each IP address in IPlist.txt, resolves the hostname if possible and writes the results to Log.txt in the same directory. Go ahead and create a blank Log.txt now in C:\Folder\Subfolder. The next thing you’ll want to do is clear the contents of the text files on the server, so that they will retain only the IP addresses from new visits. Create the following file in C:\Folder\Subfolder. I’ve named it Upload.cmd. It's critical that it has the .cmd file type. @echo off echo user YourUsername> ftpcmd.dat echo YourPassword>> ftpcmd.dat echo bin>> ftpcmd.dat echo cd /YourWebDirectory/>> ftpcmd.dat echo prompt>> ftpcmd.dat echo mput %1 %2 %3>> ftpcmd.dat echo rename filename_.txt filename.txt>> ftpcmd.dat echo rename filename 2_.txt filename2.txt>> ftpcmd.dat echo rename filename 3_.txt filename3.txt>> ftpcmd.dat echo quit>> ftpcmd.dat ftp -n -s:ftpcmd.dat ftp.YourWebsite.com del ftpcmd.dat exit Substitute YourUsername and YourPassword with the username and password with which you access your website files, and YourWebDirectory with the location of your website files on the server. In C:\Folder\Subfolder, create the blank text file(s) that will overwrite the ones on the server. Give them a different name (for instance, add an underscore), as you’ll want to distinguish them from the files you downloaded at the beginning of this exercise. Hence, the blank filename_.txt will be copied to the server as filename.txt, overwriting the existing file. The number of per-cent-sign-plus-integer combinations (variables) needs to correspond with the number of files you upload and overwrite; in this case, three (%1 %2 %3 = filename_.txt, filename2_.txt and filename3_.txt). Substitute YourWebsite.com for the domain name of your website. Before moving to the final step, create a blank text file named ErrorLog.txt in C:\Folder\Subfolder. This is where we’ll record any errors encountered during the execution of the combined routines. Now to put it all together and automate it. Create the following batch file (I’ve name it DLPingUL.bat) and put it in C:\. @echo off title Download Ping Upload - Scheduled task, please wait cd "C:\Folder\Subfolder" start "" /wait CScript DloadCmbnDdupe.vbs 2>> ErrorLog.txt start "" /wait Upload.cmd filename_.txt filename2_.txt filename3_.txt 2>> ErrorLog.txt copy /d /y /a IPlist.txt NewIPlist.txt /a 2>> ErrorLog.txt start "" /wait CScript PingList.vbs 2>> ErrorLog.txt del IPlist.txt exit The reason we’ve put this in C:\ is so that Windows Task Scheduler will have no problem with permissions when running it. This routine moves to the directory in which you’ve stored all of the relevant files (C:\Folder\Subfolder, substitute with the actual location); executes the Visual Basic script that creates the list; executes the upload, passing the file names of the blank replacement files to the echo mput %1 %2 %3 command; copies IPlist.txt to NewIPlist.txt, overwriting the latter if it exists (this is so you have a list to which to refer if you want); and executes the VB Script to ping and record the results. Finally, it deletes IPlist.txt, as it needs to be created programmatically each time. Any errors are recorded in C:\Folder\Subfolder\ErrorLog.txt. The final step is to create a new task in Task Scheduler that runs C:\DLPingUL.bat at a time of your choosing. I run it every Saturday at 3:00 am, so that when I wake up I have C:\Folder\Subfolder\Log.txt waiting for me with all of its pinged IP address information. Having the hostname can be especially helpful; it can show you which bots and spiders crawled your site, or from which corporation the visit originated. The only manual task I do is any further research on those IP addresses, like running them through Whois or whatismyipaddress.com/ip-lookup . These sites help me to determine, for example, if an IP address is static or dynamic, or if it is associated with hackers or spammers, among other useful bits of information. When I’m finished, I clear the contents of Log.txt so that it is ready for next time. If you’ve read all the way down to here then you’ve been very patient with me, and for that I thank you kindly. Addendum: For best results, when copying the above code, click on "View Source" in the upper right corner of the code box, and copy and paste the source. This will ensure consistency. _____________________________________________ Phillip Schubert is the founder of Schubert & Associates www.schubertassociates.com.au

November 5, 2013

by Phillip Schubert

· 19,198 Views

Service Injection in Doctrine DBAL Type

When you think of a Doctrine 2 DBAL Type you think of an atomic thing, but how can you work programmatically on this type without defining an event? A DBAL Type doesn't allow access to the Symfony 2 service container, you must use a hack. But before this let me explain the classic way (using events), why you should use this hack and why you shouldn't. The classic way is defined in the Symfony 2 Cookbook: How to Register Event Listeners and Subscribers Doctrine 2 events unlike Symfony 2 events aren't defined by the developer, the developer can only attach listeners on them. Why? Because Doctrine 2 isn't a framework that you can use for everything, persistence is its only job. When should you use this hack? When your stored object isn't a 1:1 representation of the PHP object and its elaboration can be memoizable or really fast. I use this hack for browscaps: with the BrowscapBundle I can convert from an user agent string to a stdClass object (like the get_browser function). Our object is container = $container; } public function prePersist(LifecycleEventArgs $args) { $this->doObjectToString($args); } public function postPersist(LifecycleEventArgs $args) { $this->doStringToObject($args); } public function preUpdate(LifecycleEventArgs $args) { $this->doObjectToString($args); } public function postUpdate(LifecycleEventArgs $args) { $this->doStringToObject($args); } public function postLoad(LifecycleEventArgs $args) { $this->doStringToObject($args); } private function doStringToObject($args) { $entity = $args->getEntity(); if ($entity instanceof Agent && !is_object($entity->getHeader())) { $browscap = $this->container->get('browscap'); $browser = $browscap->getBrowser($entity->getHeader()); $entity->setHeader($browser); } } private function doObjectToString($args) { $entity = $args->getEntity(); if ($entity instanceof Agent && is_object($entity->getHeader())) { $user_agent = $entity->getHeader()->browser_name; $entity->setHeader($user_agent); } } } With this code, everytime you will persist, update or extract a Agent entity from/to related storage system it'll be converted from string to object. The problem is that these callbacks will be invoked everytime and numerous events aren't recommended for your application. But with this hack I can write: services: acme.demo_bundle.event_listener.container_listener: arguments: - "@service_container" class: "Acme\DemoBundle\EventListener\ContainerListener" tags: - { name: doctrine.event_listener, event: getContainer } Doctrine ignores this event but it exists and results attached! container = $container; } public function getContainer() { return $this->container; } } This listener seems useless, but it's the only way for this hack because Doctrine 2 DBAL Type doesn't allow direct access to the service container but allows access to events listeners. getVarcharTypeDeclarationSQL($fieldDeclaration); } public function convertToPHPValue($value, AbstractPlatform $platform) { if (is_null($value)) { return null; } $listeners = $platform->getEventManager()->getListeners('getContainer'); $listener = array_shift($listeners); $container = $listener->getContainer(); return $container->get('browscap')->getBrowser($value); } public function convertToDatabaseValue($value, AbstractPlatform $platform) { if ($value instanceof Browscap) { return $value->getBrowser()->browser_name; } elseif ($value instanceof stdClass) { return $value->browser_name; } return $value; } public function getName() { return 'browscap'; } public function requiresSQLCommentHint(AbstractPlatform $platform) { return true; } } I use this hack to define only the events related to application flow (less events is better). Now that you know when you can use this, you must read why you shouldn't use it. Let me explain the reason with one simple example: imagine that one day PHP will allow external hooks in native classes constructor, how can you work without knowing what you're doing while initializing a new stdClass? The same reason here: everytime you extract a value from the database you want extract it fast (hopefully you'll extract more than one records), but how can you be sure that extraction is fast if every attribute of a single record depends on external libraries and logics? Quoting Ocramius, member of the Doctrine 2 development team: DBAL types are not designed for Dependency Injection. We explicitly avoided using DI for DBAL types because they have to stay simple. We’ve been asked many many times to change this behaviour, but doctrine believes that complex data manipulation should NOT happen within the very core of the persistence layer itself. That should be handled in your service layer.

November 2, 2013

by Emanuele Minotto

· 7,952 Views

Securing Docker’s Remote API

One piece to Docker that is interesting AMAZING is the Remote API that can be used to programatically interact with docker. I recently had a situation where I wanted to run many containers on a host with a single container managing the other containers through the API. But the problem I soon discovered is that at the moment when you turn networking on it is an all or nothing type of thing… you can’t turn networking off selectively on a container by container basis. You can disable IPv4 forwarding, but you can still reach the docker remote API on the machine if you can guess the IP address of it. One solution I came up with for this is to use nginx to expose the unix socket for docker over HTTPS and utilize client-side ssl certificates to only allow trusted containers to have access. I liked this setup a lot so I thought I would share how it’s done. Disclaimer: assumes some knowledge of docker! Generate The SSL Certificates We’ll use openssl to generate and self-sign the certs. Since this is for an internal service we’ll just sign it ourselves. We also remove the password from the keys so that we aren’t prompted for it each time we start nginx. # Create the CA Key and Certificate for signing Client Certs openssl genrsa -des3 -out ca.key 4096 openssl rsa -in ca.key -out ca.key # remove password! openssl req -new -x509 -days 365 -key ca.key -out ca.crt # Create the Server Key, CSR, and Certificate openssl genrsa -des3 -out server.key 1024 openssl rsa -in server.key -out server.key # remove password! openssl req -new -key server.key -out server.csr # We're self signing our own server cert here. This is a no-no in production. openssl x509 -req -days 365 -in server.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out server.crt # Create the Client Key and CSR openssl genrsa -des3 -out client.key 1024 openssl rsa -in client.key -out client.key # no password! openssl req -new -key client.key -out client.csr # Sign the client certificate with our CA cert. Unlike signing our own server cert, this is what we want to do. openssl x509 -req -days 365 -in client.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out client.crt Another option may be to leave the passphrase in and provide it as an environment variable when running a docker container or through some other means as an extra layer of security. We’ll move ca.crt, server.key and server.crt to /etc/nginx/certs. Setup Nginx The nginx setup for this is pretty straightforward. We just listen for traffic on localhost on port 4242. We require client-side ssl certificate validation and reference the certificates we generated in the previous step. And most important of all, set up an upstream proxy to the docker unix socket. I simply overwrote what was already in /etc/nginx/sites-enabled/default. upstream docker { server unix:/var/run/docker.sock fail_timeout=0; } server { listen 4242; server localhost; ssl on; ssl_certificate /etc/nginx/certs/server.crt; ssl_certificate_key /etc/nginx/certs/server.key; ssl_client_certificate /etc/nginx/certs/ca.crt; ssl_verify_client on; access_log on; error_log /dev/null; location / { proxy_pass http://docker; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; client_max_body_size 10m; client_body_buffer_size 128k; proxy_connect_timeout 90; proxy_send_timeout 120; proxy_read_timeout 120; proxy_buffer_size 4k; proxy_buffers 4 32k; proxy_busy_buffers_size 64k; proxy_temp_file_write_size 64k; } } One important piece to make this work is you should add the user nginx runs as to the docker group so that it can read from the socket. This could be www-data, nginx, or something else! Hack It Up! With this setup and nginx restarted, let’s first run a curl command to make sure that this setup correctly. First we’ll make a call without the client cert to double check that we get denied access then a proper one. # Is normal http traffic denied? curl -v http://localhost:4242/info # How about https, sans client cert and key? curl -v -s -k https://localhost:4242/info # And the final good request! curl -v -s -k --key client.key --cert client.crt https://localhost:4242/info For the first two we should get some run of the mill 400 http response codes before we get a proper JSON response from the final command! Woot! But wait there’s more… let’s build a container that can call the service to launch other containers! For this example we’ll simply build two containers: one that has the client certificate and key and one that doesn’t. The code for these examples are pretty straightforward and to save space I’ll leave the untrusted container out. You can view the untrusted container on github (although it is nothing exciting). First, the node.js application that will connect and display information: https = require 'https' fs = require 'fs' options = host: 172.42.1.62 port: 4242 method: 'GET' path: '/containers/json' key: fs.readFileSync('ssl/client.key') cert: fs.readFileSync('ssl/client.crt') headers: { 'Accept': 'application/json'} # not required, but being semantic here! req = https.request options, (res) -> console.log res req.end() And the Dockerfile used to build the container. Notice we add the client.crt and client.key as part of building it! FROM shykes/nodejs MAINTAINER James R. Carr ADD ssl/client* /srv/app/ssl ADD package.json /srv/app/package.json ADD app.coffee /srv/app/app.coffee RUN cd /srv/app && npm install . CMD cd /srv/app && npm start That’s about it. Run docker build . and docker run -n >IMAGE ID< and we should see a json dump to the console of the actively running containers. Doing the same in the untrusted directory should present us with some 400 error about not providing a client ssl certificate. I’ve shared a project with all this code plus a vagrant file on github for your own prusual. Enjoy!

October 31, 2013

by James Carr

· 14,313 Views

How to Use MongoDB as a Pure In-memory DB (Redis Style)

The Idea There has been a growing interest in using MongoDB as an in-memory database, meaning that the data is not stored on disk at all. This can be super useful for applications like: a write-heavy cache in front of a slower RDBMS system embedded systems PCI compliant systems where no data should be persisted unit testing where the database should be light and easily cleaned That would be really neat indeed if it was possible: one could leverage the advanced querying / indexing capabilities of MongoDB without hitting the disk. As you probably know the disk IO (especially random) is the system bottleneck in 99% of cases, and if you are writing data you cannot avoid hitting the disk. One sweet design choice of MongoDB is that it uses memory-mapped files to handle access to data files on disk. This means that MongoDB does not know the difference between RAM and disk, it just accesses bytes at offsets in giant arrays representing files and the OS takes care of the rest! It is this design decision that allows MongoDB to run in RAM with no modification. How it is done This is all achieved by using a special type of filesystem called tmpfs. Linux will make it appear as a regular FS but it is entirely located in RAM (unless it is larger than RAM in which case it can swap, which can be useful!). I have 32GB RAM on this server, let’s create a 16GB tmpfs: # mkdir /ramdata # mount -t tmpfs -o size=16000M tmpfs /ramdata/ # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvde1 5905712 4973924 871792 86% / none 15344936 0 15344936 0% /dev/shm tmpfs 16384000 0 16384000 0% /ramdata Now let’s start MongoDB with the appropriate settings. smallfiles and noprealloc should be used to reduce the amount of RAM wasted, and will not affect performance since it’s all RAM based. nojournal should be used since it does not make sense to have a journal in this context! dbpath=/ramdata nojournal = true smallFiles = true noprealloc = true After starting MongoDB, you will find that it works just fine and the files are as expected in the FS: # mongo MongoDB shell version: 2.3.2 connecting to: test > db.test.insert({a:1}) > db.test.find() { "_id" : ObjectId("51802115eafa5d80b5d2c145"), "a" : 1 } # ls -l /ramdata/ total 65684 -rw-------. 1 root root 16777216 Apr 30 15:52 local.0 -rw-------. 1 root root 16777216 Apr 30 15:52 local.ns -rwxr-xr-x. 1 root root 5 Apr 30 15:52 mongod.lock -rw-------. 1 root root 16777216 Apr 30 15:52 test.0 -rw-------. 1 root root 16777216 Apr 30 15:52 test.ns drwxr-xr-x. 2 root root 40 Apr 30 15:52 _tmp Now let’s add some data and make sure it behaves properly. We will create a 1KB document and add 4 million of them: > str = "" > aaa = "aaaaaaaaaa" aaaaaaaaaa > for (var i = 0; i < 100; ++i) { str += aaa; } > for (var i = 0; i < 4000000; ++i) { db.foo.insert({a: Math.random(), s: str});} > db.foo.stats() { "ns" : "test.foo", "count" : 4000000, "size" : 4544000160, "avgObjSize" : 1136.00004, "storageSize" : 5030768544, "numExtents" : 26, "nindexes" : 1, "lastExtentSize" : 536600560, "paddingFactor" : 1, "systemFlags" : 1, "userFlags" : 0, "totalIndexSize" : 129794000, "indexSizes" : { "_id_" : 129794000 }, "ok" : 1 } The document average size is 1136 bytes and it takes up about 5GB of storage. The index on _id takes about 130MB. Now we need to verify something very important: is the data duplicated in RAM, existing both within MongoDB and the filesystem? Remember that MongoDB does not buffer any data within its own process, instead data is cached in the FS cache. Let’s drop the FS cache and see what is in RAM: # echo 3 > /proc/sys/vm/drop_caches # free total used free shared buffers cached Mem: 30689876 6292780 24397096 0 1044 5817368 -/+ buffers/cache: 474368 30215508 Swap: 0 0 0 As you can see there is 6.3GB of used RAM of which 5.8GB is in FS cache (buffers). Why is there still 5.8GB of FS cache even after all caches were dropped?? The reason is that Linux is smart and it does not duplicate the pages between tmpfs and its cache… Bingo! That means your data exists with a single copy in RAM. Let’s access all documents and verify RAM usage is unchanged: > db.foo.find().itcount() 4000000 # free total used free shared buffers cached Mem: 30689876 6327988 24361888 0 1324 5818012 -/+ buffers/cache: 508652 30181224 Swap: 0 0 0 # ls -l /ramdata/ total 5808780 -rw-------. 1 root root 16777216 Apr 30 15:52 local.0 -rw-------. 1 root root 16777216 Apr 30 15:52 local.ns -rwxr-xr-x. 1 root root 5 Apr 30 15:52 mongod.lock -rw-------. 1 root root 16777216 Apr 30 16:00 test.0 -rw-------. 1 root root 33554432 Apr 30 16:00 test.1 -rw-------. 1 root root 536608768 Apr 30 16:02 test.10 -rw-------. 1 root root 536608768 Apr 30 16:03 test.11 -rw-------. 1 root root 536608768 Apr 30 16:03 test.12 -rw-------. 1 root root 536608768 Apr 30 16:04 test.13 -rw-------. 1 root root 536608768 Apr 30 16:04 test.14 -rw-------. 1 root root 67108864 Apr 30 16:00 test.2 -rw-------. 1 root root 134217728 Apr 30 16:00 test.3 -rw-------. 1 root root 268435456 Apr 30 16:00 test.4 -rw-------. 1 root root 536608768 Apr 30 16:01 test.5 -rw-------. 1 root root 536608768 Apr 30 16:01 test.6 -rw-------. 1 root root 536608768 Apr 30 16:04 test.7 -rw-------. 1 root root 536608768 Apr 30 16:03 test.8 -rw-------. 1 root root 536608768 Apr 30 16:02 test.9 -rw-------. 1 root root 16777216 Apr 30 15:52 test.ns drwxr-xr-x. 2 root root 40 Apr 30 16:04 _tmp # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvde1 5905712 4973960 871756 86% / none 15344936 0 15344936 0% /dev/shm tmpfs 16384000 5808780 10575220 36% /ramdata And that verifies it! :) What about replication? You probably want to use replication since a server loses its RAM data upon reboot! Using a standard replica set you will get automatic failover and more read capacity. If a server is rebooted MongoDB will automatically rebuild its data by pulling it from another server in the same replica set (resync). This should be fast enough even in cases with a lot of data and indices since all operations are RAM only :) It is important to remember that write operations get written to a special collection called oplog which resides in the local database and takes 5% of the volume by default. In my case the oplog would take 5% of 16GB which is 800MB. In doubt, it is safer to choose a fixed oplog size using the oplogSize option. If a secondary server is down for a longer time than the oplog contains, it will have to be resynced. To set it to 1GB, use: oplogSize = 1000 What about sharding? Now that you have all the querying capabilities of MongoDB, what if you want to implement a large service with it? Well you can use sharding freely to implement a large scalable in-memory store. Still the config servers (that contain the chunk distribution) should be disk based since their activity is small and rebuilding a cluster from scratch is not fun. What to watch for RAM is a scarce resource, and in this case you definitely want the entire data set to fit in RAM. Even though tmpfs can resort to swapping the performance would drop dramatically. To make best use of the RAM you should consider: usePowerOf2Sizes option to normalize the storage buckets run a compact command or resync the node periodically. use a schema design that is fairly normalized (avoid large document growth) Conclusion Sweet, you can now use MongoDB and all its features as an in-memory RAM-only store! Its performance should be pretty impressive: during the test with a single thread / core I was achieving 20k writes per second, and it should scale linearly over the number of cores.

October 28, 2013

by Antoine Girbal

· 61,147 Views

JMS-style selectors on Amazon SQS with Apache Camel

This blog post demonstrates how easy it is to use Apache Camel and its new json-path component along with the camel-sqs component to produce and consume messages on Amazon SQS. Amazon Web Services SQS is a message queuing “software as a service” (SaaS) in the cloud. To be able to use it, you need to sign up for AWS. It’s primary access mechanism is XML over HTTP through various AWS SDK clients provided by Amazon. Please check out the SQS documentation for more. And as “luck” would have it, one of the users in the Apache Camel community created a component to be able to integrate with SQS. This makes it trivial to add a producer or consumer to an SQS queue and plugs in nicely with the Camel DSL. SQS, however, is not a “one-size fits all” queueing service; you must be aware of your use case and make sure it fits (current requirements as well as somewhat into the future…). There are limitations that, if not studied and accounted for ahead of time, could come back to sink your project. An example of a viable alternative, and one that more closely fits the profile of a high performance and full featured message queue is Apache ActiveMQ. For example, one limitation to keep in mind is that unlike traditional JMS consumers, you cannot create a subscription to a queue that filters messages based on some predicate (at least not using the AWS-SQS API — you’d have to build that into your solution). Some other things to keep in mind when using SQS: The queue does not preserve FIFO messaging That is, message order is not preserved. They can arrive out of order from when they were sent. Apache Camel can help with its resequencer pattern. Bilgin Ibryam, now a colleague of mine at Red Hat, has written a great blog post about how to restore message order using the resequencer pattern. Message size is limited to 256K This is probably sufficient, but if your message sizes are variable, or contain more data that 256K, you will have to chunk them and send in smaller chunks. No selector or selective consumption If you’re familiar with JMS, you know that you can specify consumers to use a “selector” or a predicate expression that is evaluated on the broker side to determine whether or not a specific message should be dispatched to a specific consumer. For example, Durability constraints Some use cases call for the message broker to store messages until consumers return. SQS allows a limit of up to 14 days. This is most likely sufficient, but something to keep in mind. Binary payloads not allowed SQS only allows text-based messages, e.g., XML, JSON, fixed format text, etc. Binary such as Avro, Protocol Buffers, or Thrift are not allowed. For some of these limitations, you can work around them by building out the functionality yourself. I would always recommend taking a look at how an integration library like Apache Camel can help — which has out-of-the-box support for doing some of these things. Doing JMS-style selectors So the basic problem is we want to subscribe to a SQS queue, but we want to filter which messages we process. For those messages that we do not process, those should be left in the queue. To do this, we will make use of Apache Camel’s Filter EIP as well as the visibility timeouts available on the SQS queue. By default, SQS will dispatch all messages in its queue when it’s queried. We cannot change this, and thus not avoid the message being dispatched to us — we’ll have to do the filtering on our side (this is different than how a full-featured broker like ActiveMQ does it, i.e., filtering is done on the broker side so the consumer doesn’t even see the message it does not want to see). Once SQS dispatches a message, it does not remove it from the queue unless the consumer has acknowledged that it has it and is finished with it. The consumer does this by sending a DeleteMessage command. Until the DeleteMessage command is sent, the message is always in the queue, however visibility comes in to play here. When a message is dispatched to a consumer, there is a period of time which it will not be visible to other consumers. So if you browsed the queue, you would not see it (it should appear in the stats as “in-flight”). However, there is a configurable period of time you can specify for how long this “visibility timeout” should be active. So if you set the visibility to a lower time period (default is 30 seconds), you can more quickly get messages re-dispatched to consumers that would be able to handle the message. Take a look at the following Camel route which does just that: @Override public void configure() throws Exception { // every two seconds, send a message to the "demo" queue in SQS from("timer:kickoff?period=5000") .setBody().method(this, "generateJsonString") .to("aws-sqs://demo?amazonSQSClient=#sqsClient&defaultVisibilityTimeout=2"); } In the above Camel Route, we create a new message every 5 seconds and send it to an SQS queue named demo — note we set the defaultVisibilityTimeout to 2 seconds. This means that after a message gets dispatched to a consumer, SQS will wait about 2 seconds before considering it eligible to be dispatched to another consumer if it has not been deleted. On the consumer side, we take advantage of a couple Apache Camel conveniences Using JSON Path + Filter EIP Camel has an excellent new component named JSON-Path. Claus Ibsen tweeted about it when he hacked it up. This allows you to do Content-Based Routing on a JSON payload very easily by using XPath-style expressions to pick out and evaluate attributes in a JSON encoded object. So in the following example, we can test an attribute named ‘type’ to be equal to ‘LOGIN’ and use Camel’s Filter EIP to allow only those messages that match to go through and continue processing: public class ConsumerRouteBuilder extends RouteBuilder { @Override public void configure() throws Exception { from("aws-sqs://demo?amazonSQSClient=#sqsClient&deleteIfFiltered=false") .setHeader("identity").jsonpath("$['type']") .filter(simple("${header.identity} == 'login'")) .log("We have a message! ${body}") .to("file:target/output?fileName=login-message-${date:now:MMDDyy-HHmmss}.json"); } } To complete the functionality, we have to pay attention to a new configuration option added for the Camel-SQS component: deleteIfFiltered — Whether or not to send the DeleteMessage to the SQS queue if an exchange fails to get through a filter. If ‘false’ and exchange does not make it through a Camel filter upstream in the route, then don’t send DeleteMessage. By default, Camel will send the “DeleteMessage” command to SQS after a route has completed successfully (without an exception). However, in this case, we are specifying to not send the DeleteMessage command if the message had been previously filtered by Camel. This example demonstrates how easy it is to use Apache Camel and its new json-path component along with the camel-sqs component to produce and consume messages on Amazon SQS. Please take a look at the source code on my github repo to play with the live code and try it out yourself.

October 28, 2013

by Christian Posta

· 12,144 Views

Extracting File Metadata with C# and the .NET Framework

How to extract extended image metadata using C# and the Windows API Code Pack, simplifying access to detailed file properties typically seen in Windows Explorer.

October 26, 2013

by Rob Sanders

· 39,977 Views · 2 Likes

Examples of the Windows Azure Storage Services REST API

The examples in this post were updated in September to work with the current version of the Windows Azure Storage REST API. In the Windows Azure MSDN Azure Forum there are occasional questions about the Windows Azure Storage Services REST API. I have occasionally responded to these with some code examples showing how to use the API. I thought it would be useful to provide some examples of using the REST API for tables, blobs and queues – if only so I don’t have to dredge up examples when people ask how to use it. This post is not intended to provide a complete description of the REST API. The REST API is comprehensively documented (other than the lack of working examples). Since the REST API is the definitive way to address Windows Azure Storage Services I think people using the higher level Storage Client API should have a passing understanding of the REST API to the level of being able to understand the documentation. Understanding the REST API can provide a deeper understanding of why the Storage Client API behaves the way it does. Fiddler The Fiddler Web Debugging Proxy is an essential tool when developing using the REST (or Storage Client) API since it captures precisely what is sent over the wire to the Windows Azure Storage Services. Authorization Nearly every request to the Windows Azure Storage Services must be authenticated. The exception is access to blobs with public read access. The supported authentication schemes for blobs, queues and tables and these are described here. The requests must be accompanied by an Authorization header constructed by making a hash-based message authentication code using the SHA-256 hash. The following is an example of performing the SHA-256 hash for the Authorization header: public static String CreateAuthorizationHeader(String canonicalizedString) { String signature = String.Empty; using (HMACSHA256 hmacSha256 = new HMACSHA256( Convert.FromBase64String(storageAccountKey) )) { Byte[] dataToHmac = System.Text.Encoding.UTF8.GetBytes(canonicalizedString); signature = Convert.ToBase64String(hmacSha256.ComputeHash(dataToHmac)); } String authorizationHeader = String.Format( CultureInfo.InvariantCulture, "{0} {1}:{2}", AzureStorageConstants.SharedKeyAuthorizationScheme, AzureStorageConstants.Account, signature ); return authorizationHeader; } This method is used in all the examples in this post. AzureStorageConstants is a helper class containing various constants. Key is a secret key for Windows Azure Storage Services account specified by Account. In the examples given here, SharedKeyAuthorizationScheme is SharedKey. The trickiest part in using the REST API successfully is getting the correct string to sign. Fortunately, in the event of an authentication failure the Blob Service and Queue Service responds with the authorization string they used and this can be compared with the authorization string used in generating the Authorization header. This has greatly simplified the us of the REST API. Table Service API The Table Service API supports the following table-level operations: Create Table Delete Table Query Tables The Table Service API supports the following entity-level operations: Delete Entity Insert Entity Merge Entity Update Entity Query Entities These operations are implemented using the appropriate HTTP VERB: DELETE – delete GET – query MERGE – merge POST – insert PUT – update This section provides examples of the Insert Entity and Query Entities operations. Insert Entity The InsertEntity() method listed in this section inserts an entity with two String properties, Artist and Title, into a table. The entity is submitted as an ATOM entry in the body of a request POSTed to the Table Service. In this example, the ATOM entry is generated by the GetRequestContentInsertXml() method. The date must be in RFC 1123 format in the x-ms-date header supplied to the canonicalized resource used to create the Authorization string. Note that the storage service version is set to “2012-02-12″ which requires the DataServiceVersion and MaxDataServiceVersion to be set appropriately. public void InsertEntity(String tableName, String artist, String title) { String requestMethod = "POST"; String urlPath = tableName; String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String contentMD5 = String.Empty; String contentType = "application/atom+xml"; String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n{1}\n{2}\n{3}\n{4}", requestMethod, contentMD5, contentType, dateInRfc1123Format, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] content = utf8Encoding.GetBytes(GetRequestContentInsertXml(artist, title)); Uri uri = new Uri(AzureStorageConstants.TableEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Accept = "application/atom+xml,application/xml"; request.ContentLength = content.Length; request.ContentType = contentType; request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Headers.Add("Accept-Charset", "UTF-8"); request.Headers.Add("DataServiceVersion", "2.0;NetFx"); request.Headers.Add("MaxDataServiceVersion", "2.0;NetFx"); using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(content, 0, content.Length); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } private String GetRequestContentInsertXml(String artist, String title) { String defaultNameSpace = "http://www.w3.org/2005/Atom"; String dataservicesNameSpace = "http://schemas.microsoft.com/ado/2007/08/dataservices"; String metadataNameSpace = "http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"; XmlWriterSettings xmlWriterSettings = new XmlWriterSettings(); xmlWriterSettings.OmitXmlDeclaration = false; xmlWriterSettings.Encoding = Encoding.UTF8; StringBuilder entry = new StringBuilder(); using (XmlWriter xmlWriter = XmlWriter.Create(entry)) { xmlWriter.WriteProcessingInstruction("xml", "version=\"1.0\" encoding=\"UTF-8\""); xmlWriter.WriteWhitespace("\n"); xmlWriter.WriteStartElement("entry", defaultNameSpace); xmlWriter.WriteAttributeString("xmlns", "d", null, dataservicesNameSpace); xmlWriter.WriteAttributeString("xmlns", "m", null, metadataNameSpace); xmlWriter.WriteElementString("title", null); xmlWriter.WriteElementString("updated", String.Format("{0:o}", DateTime.UtcNow)); xmlWriter.WriteStartElement("author"); xmlWriter.WriteElementString("name", null); xmlWriter.WriteEndElement(); xmlWriter.WriteElementString("id", null); xmlWriter.WriteStartElement("content"); xmlWriter.WriteAttributeString("type", "application/xml"); xmlWriter.WriteStartElement("properties", metadataNameSpace); xmlWriter.WriteElementString("PartitionKey", dataservicesNameSpace, artist); xmlWriter.WriteElementString("RowKey", dataservicesNameSpace, title); xmlWriter.WriteElementString("Artist", dataservicesNameSpace, artist); xmlWriter.WriteElementString("Title", dataservicesNameSpace, title + "\n" + title); xmlWriter.WriteEndElement(); xmlWriter.WriteEndElement(); xmlWriter.WriteEndElement(); xmlWriter.Close(); } String requestContent = entry.ToString(); return requestContent; } This generates the following request (as captured by Fiddler): POST https://STORAGE_ACCOUNT.table.core.windows.net/authors HTTP/1.1 Accept: application/atom+xml,application/xml Content-Type: application/atom+xml x-ms-date: Sun, 08 Sep 2013 06:31:12 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:w7Uu4wHZx4fFwa2bsxd/TJVZZ1AqMPwxvW+pYtoWHd0= Accept-Charset: UTF-8 DataServiceVersion: 2.0;NetFx MaxDataServiceVersion: 2.0;NetFx Host: STORAGE_ACCOUNT.table.core.windows.net Content-Length: 514 Expect: 100-continue Connection: Keep-Alive The body of the request is: 2013-09-08T07:19:07Z Beckett Molloy 2013-09-08T07:19:07.2189243Z Beckett Molloy Molloy Note that I should have URLEncoded the PartitionKey and RowKey but did not do so for simplicity. There are, in fact, some issues with the URL encoding of spaces and other symbols. Get Entity The GetEntity() method described in this section retrieves the single entity inserted in the previous section. The particular entity to be retrieved is identified directly in the URL. public void GetEntity(String tableName, String partitionKey, String rowKey) { String requestMethod = "GET"; String urlPath = String.Format("{0}(PartitionKey='{1}',RowKey='{2}')", tableName, partitionKey, rowKey); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n{2}", requestMethod, dateInRfc1123Format, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.TableEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Headers.Add("Accept-Charset", "UTF-8"); request.Accept = "application/atom+xml,application/xml"; request.Headers.Add("DataServiceVersion", "2.0;NetFx"); request.Headers.Add("MaxDataServiceVersion", "2.0;NetFx"); using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } This generates the following request (as captured by Fiddler): GET https://STORAGE_ACCOUNT.table.core.windows.net/authors(PartitionKey='Beckett',RowKey='Molloy') HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:31:14 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:1hWbr4aNq4JWCpNJY3rsLH1SkIyeFTJflbqyKMPQ1Gk= Accept-Charset: UTF-8 Accept: application/atom+xml,application/xml DataServiceVersion: 2.0;NetFx MaxDataServiceVersion: 2.0;NetFx Host: STORAGE_ACCOUNT.table.core.windows.net The Table Service generates the following response: HTTP/1.1 200 OK Cache-Control: no-cache Content-Type: application/atom+xml;charset=utf-8 ETag: W/"datetime'2013-09-08T06%3A31%3A14.1579056Z'" Server: Windows-Azure-Table/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: f4bd4c77-6fb6-42a8-8dff-81ea8d28fa2e x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:31:15 GMT Content-Length: 1108 The returned entities, in this case a single entity, are returned in ATOM entry format in the response body: https://STORAGE_ACCOUNT.table.core.windows.net/authors(PartitionKey='Beckett',RowKey='Molloy') 2013-09-08T06:31:15Z Beckett Molloy 2013-09-08T06:31:14.1579056Z Beckett Molloy Molloy Blob Service API The Blob Service API supports the following account-level operation: List Containers The Blob Service API supports the following container-level operation: Create Container Delete Container Get Container ACL Get Container Properties Get Container Metadata List Blobs Set Container ACL Set Container Metadata The Blob Service API supports the following blob-level operation: Copy Blob Delete Blob Get Blob Get Blob Metadata Get Blob Properties Lease Blob Put Blob Set Blob Metadata Set Blob Properties Snapshot Blob The Blob Service API supports the following operations on block blobs: Get Block List Put Block Put Block List The Blob Service API supports the following operations on page blobs: Get Page Regions Put Page This section provides examples of the Put Blob and Lease Blob operations. Put Blob The Blob Service and Queue Service use a different form of shared-key authentication from the Table Service so care should be taken in creating the string to be signed for authorization. The blob type, BlockBlob or PageBlob, must be specified as a request header and consequently appears in the authorization string. public void PutBlob(String containerName, String blobName) { String requestMethod = "PUT"; String urlPath = String.Format("{0}/{1}", containerName, blobName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String content = "Andrew Carnegie was born in Dunfermline"; UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] blobContent = utf8Encoding.GetBytes(content); Int32 blobLength = blobContent.Length; const String blobType = "BlockBlob"; String canonicalizedHeaders = String.Format( "x-ms-blob-type:{0}\nx-ms-date:{1}\nx-ms-version:{2}", blobType, dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, blobLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.BlobEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-blob-type", blobType); request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = blobLength; using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(blobContent, 0, blobLength); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String ETag = response.Headers["ETag"]; } } This generates the following request: PUT https://STORAGE_ACCOUNT.blob.core.windows.net/fife/dunfermline HTTP/1.1 x-ms-blob-type: BlockBlob x-ms-date: Sun, 08 Sep 2013 06:28:29 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:ntvh/lamVmikvwHhy6vRVBIh87kibkPlEOiHyLDia6g= Host: STORAGE_ACCOUNT.blob.core.windows.net Content-Length: 39 Expect: 100-continue Connection: Keep-Alive The body of the request is: Andrew Carnegie was born in Dunfermline The Blob Service generates the following response: HTTP/1.1 201 Created Transfer-Encoding: chunked Content-MD5: RYJnWGXLyt94l5jG82LjBw== Last-Modified: Sun, 08 Sep 2013 06:28:31 GMT ETag: "0x8D07A73C5704A86" Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: b74ef0a2-294d-4581-b8f1-6cda724bbdbf x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:28:30 GMT Lease Blob The Blob Service allows a user to lease a blob for a minute at a time and so acquire a write lock on it. The use case for this is the locking of a page blob used to store the VHD backing an writeable Azure Drive. The LeaseBlob() example in this section demonstrates a subtle issue with the creation of authorization strings. The URL has a query string, comp=lease. Rather than using this directly in creating the authorization string it must be converted into comp:lease with a colon replacing the equal symbol – see modifiedURL in the example. Furthermore, the Lease Blob operation requires the use of an x-ms-lease-action to indicate whether the lease is being acquired, renewed, released or broken. public void LeaseBlob(String containerName, String blobName) { String requestMethod = "PUT"; String urlPath = String.Format("{0}/{1}?comp=lease", containerName, blobName); String modifiedUrlPath = String.Format("{0}/{1}\ncomp:lease", containerName, blobName); const Int32 contentLength = 0; String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String leaseAction = "acquire"; String leaseDuration = "60"; String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-lease-action:{1}\nx-ms-lease-duration:{2}\nx-ms-version:{3}", dateInRfc1123Format, leaseAction, leaseDuration, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, modifiedUrlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, contentLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.BlobEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-lease-action", leaseAction); request.Headers.Add("x-ms-lease-duration", leaseDuration); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = contentLength; using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String leaseId = response.Headers["x-ms-lease-id"]; } } This generates the following request: PUT https://STORAGE_ACCOUNT.blob.core.windows.net/fife/dunfermline?comp=lease HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:28:31 GMT x-ms-lease-action: acquire x-ms-lease-duration: 60 x-ms-version: 2012-02-12 Authorization: SharedKey rebus:+SQ5+RFZg3hUaws5XCRHxsDgXb1ycdRIz5EKyHJWP7s= Host: rebus.blob.core.windows.net Content-Length: 0 The Blob Service generates the following response: HTTP/1.1 201 Created Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: 4b6ff77f-f885-4f74-803a-c92920d225c3 x-ms-version: 2012-02-12 x-ms-lease-id: b1320c2c-65ad-41d6-a7bd-85a4242c0ac5 Date: Sun, 08 Sep 2013 06:28:31 GMT Content-Length: 0 Queue Service API The Queue Service API supports the following queue-level operation: List Queues The Queue Service API supports the following queue-level operation: Create Queue Delete Queue Get Queue Metadata Set Queue Metadata The Queue Service API supports the following message-level operations: Clear Messages Delete Message Get Messages Peek Messages Put Message This section provides examples of the Put Message and Get Message operations. Put Message The most obvious curiosity about Put Message is that it uses the HTTP verb POST rather than PUT. The issue is presumably the interaction of the English language and the HTTP standard which states that PUT should be idempotent and that the Put Message operation is clearly not since each invocation merely adds another message to the queue. Regardless, it did catch me out when I failed to read the documentation well enough – so take that as a warning. The content of a message posted to the queue must be formatted in a specified XML schema and must then be UTF8 encoded. public void PutMessage(String queueName, String message) { String requestMethod = "POST"; String urlPath = String.Format("{0}/messages", queueName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String messageText = String.Format( "{0}", message); UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] messageContent = utf8Encoding.GetBytes(messageText); Int32 messageLength = messageContent.Length; String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-version:{1}", dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, messageLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.QueueEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = messageLength; using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(messageContent, 0, messageLength); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String requestId = response.Headers["x-ms-request-id"]; } } This generates the following request: POST https://rebus.queue.core.windows.net/revolution/messages HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:34:08 GMT x-ms-version: 2012-02-12 Authorization: SharedKey rebus:nyASTVWifnxHKnj2wXwuzzzXz5CxUBZj58SToV5QFK8= Host: rebus.queue.core.windows.net Content-Length: 76 Expect: 100-continue Connection: Keep-Alive The body of the request is: Saturday in the cafe The Queue Service generates the following response: HTTP/1.1 201 Created Server: Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: 14c6e73b-15d9-480c-b251-c4c01b48e529 x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:34:09 GMT Content-Length: 0 Get Messages The Get Messages operation described in this section retrieves a single message with the default message visibility timeout of 30 seconds. public void GetMessage(String queueName) { string requestMethod = "GET"; String urlPath = String.Format("{0}/messages", queueName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-version:{1}", dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n\n\n\n\n\n\n\n\n\n{1}\n{2}", requestMethod, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.QueueEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Accept = "application/atom+xml,application/xml"; using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } This generates the following request: GET https://rebus.queue.core.windows.net/revolution/messages HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:34:11 GMT x-ms-version: 2012-02-12 Authorization: SharedKey rebus:K67XooYhokw0i0AlCzYQ4GeLLrJih1r1vSqiO9DBo0c= Accept: application/atom+xml,application/xml Host: rebus.queue.core.windows.net The Queue Service generates the following response: HTTP/1.1 200 OK Content-Type: application/xml Server: Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: efb21a86-7d66-47fd-b13d-7aa74fce0568 x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:34:12 GMT Content-Length: 484 The message is returned in the response body as follows: 05fd902f-6031-4ef4-8298-ef3844ec3bc6Sun, 08 Sep 2013 06:34:11 GMTSun, 15 Sep 2013 06:34:11 GMT1AgAAAAMAAAAAAAAAAL+zgF2szgE=Sun, 08 Sep 2013 06:34:43 GMTSaturday in the cafe I noticed that some newline specifiers in strings (\n) were lost when the blog was auto-ported from Windows Live Spaces to WordPress. I have put them back in but it is possible I missed some. Consequently, in the event of a problem you should check the newlines in canonicalizedHeaders and stringToSign.

October 24, 2013

by Neil Mackenzie

· 38,841 Views

PostgreSQL to SQLite: The Journey

This article will be useful if you want to support both PostgreSQL and SQLite using JDBC. It will be especially useful if you: Are already accessing values from your (PostgreSQL) database using the regular JDBC ResultSet interface, like: Date d = rs.getDate("date_field"); BigDecimal bd = rs.getBigDecimal("bigdecimal_field"); And it is creating trouble when doing the same for SQLite, but you don't want to change that code. Are already retrieving autogenerated keys in PostgreSQL with a RETURNING clause, but this won't work in SQLite. You want a unified solution that works for both databases. Thought foreign keys are enforced in SQLite by default (like in PostgreSQL) and crashed with a wall. SQLite is allowing you to delete entries from your tables even when they are referenced in another table and you have explicitly told SQLite about it with a REFERENCES table_name(field_name) clause. Are having trouble with the differences between PostgreSQL and SQLite dialects (mostly concerning data types), for example, when making query filters with boolean values. Had your own way to manage exceptions for PostgreSQL and it is not working for SQLite (obviously). You want SQLite to fit into the model you already have. Other stuff might appear if you keep up... A few months ago I wanted to migrate an app to use SQLite as a data backend. In fact, I wanted it to work with both PostgreSQL and SQLite indistinctly (but not at the same time). I wanted to switch between these two databases easily without changing any code. I did it, but along the way I had to solve some problems that might be interesting to many other people. Many solutions I found were spread across the web, but there was no single place that explained how to completely achieve what I wanted. So, the aim of this post is to try to condense my learning into one article that may be of help to others as a (semi) complete guide. This guide might be useful not only to those creating their own frameworks, but for anyone who doesn't use any and are willing to try some quirks and tricks to make their app work. THE BEGINNING There are many cross-database incompatibilities between PostgreSQL and SQLite, most notably on data types. If you want to have the same code to work for both databases, you better use a framework that manages this for you. But here's the thing: the framework I use is created by myself, and didn't (completely) take these differences into account, since I mainly use PostgreSQL as database; that's how and why my problems arose. My framework conveys many things, but I focus here in the data access part. It uses some JDBC driver to connect to the databases, but it provides more abstract ways to do it; that's pretty much the data access part of the framework. A basic DAO class for my framework would look like this: public class MyDAO extends BaseDAO { public MyDAO() { super("context_alias", new DefaultDataMappingStrategy() { @Override public Object createResultObject(ResultSet rs) throws SQLException { MyModel model = (MyModel)ObjectsFactory.getObject("my_model_alias"); model.setStringField(rs.getString("string_field")); model.setIntegerField(rs.getInt("integer_field")); model.setBigDecimalField(rs.getBigDecimal("bigdecimal_field")); model.setDateField(rs.getDate("date_field")); model.setBooleanField(rs.getBoolean("boolean_field")); return model; } }); } @Override public String getTableName() { return "table_name"; } @Override public String getKeyFields() { return "string_field|integer_field"; } @Override protected Map getInsertionMap(Object obj) { Map map = new HashMap(); MyModel model = (MyModel) obj; map.put("string_field", model.getStringField()); map.put("integer_field", model.getIntegerField()); map.put("bigdecimal_field", model.getBigDecimalField()); map.put("date_field", model.getDateField()); map.put("boolean_field", model.getBooleanField()); return map; } @Override protected Map getUpdateMap(Object obj) { Map map = new HashMap(); MyModel model = (MyModel) obj; map.put("bigdecimal_field", model.getBigDecimalField()); map.put("date_field", model.getDateField()); map.put("boolean_field", model.getBooleanField()); return map; } @Override public String getFindAllStatement() { return "SELECT * FROM :@ "; } So, that I wanted to switch between databases without changing code means that I wanted to switch without changing my DAO classes. For SQLite, I used the xerial-jdbc-sqlite driver. I talk about drivers because there are some things that might be driver-specific when solving some problems; so when I say 'SQlite does it this way', I generally mean 'xerial-jdbc-sqlite driver does it this way'. Now, let's start. WARNING: Some of the solutions I give here fit into my framework, but might not directly fit into your code. It's up to you to imagine how to adapt what I provide here. DATA TYPES Since there are some differences between PostgreSQL and SQLite regarding data types, and I wanted to continue to access database values through the regular ResultSet interface, I had to have some mechanism to intercept the call to, for instance, resultset.getDate("date_field"). So I created a ResultSetWrapper class that would redefine the methods I was interested in, like this: public class ResultSetWrapper implements ResultSet { // The wrappped ResultSet ResultSet wrapped; /* I will use this DateFormat to format dates. I'm assuming an SQLite style pattern. I should not */ SimpleDateFormat df = new SimpleDateFormat("yyyy-mm-dd"); public ResultSetWrapper(ResultSet wrapped) { this.wrapped = wrapped; } /* Lots of ResultSet methods implementations go here, but this is an example of redefining a method I'm interested in changing its behavior: */ public Date getDate(String columnLabel) throws SQLException { Object value = this.wrapped.getObject(columnLabel); return (Date)TypesInferreer.inferDate(value); } } The getDate() method in ResultSetWrapper relies on TypesInferreer to convert the value retrieved to a Date value. All data types convertions would be encapsulated inside TypesInferreer, which would have methods to convert from different data types as needed. For instance, it would have a method like this one: public static Object inferDate(Object value) { java.util.Date date; // Do convertions here (convert value and asign to date) return date; } Which tries to convert any value to a Date (I'll show the actual implementation further). Now, instead of using the original resultset retrieved from saying preparedStatement.executeQuery(), you use new ResultSetWrapper(preparedStatement.executeQuery()). That's what my framework does: it passes this new resultset to DAO objects. Now let's see some type conversions. Mixing PostgreSQL Date and SQLite Long/String You could store Date values as text in a SQLite database (eg. '2013-10-09'); this you can do manually when creating the database, but when SQLite stores a Date object, by default it converts it to a Long value. There is no problem with this when saving the value to the SQLite database, but if you try to retrieve it using resultset.getDate("date_field"), then things get messy; It simply won't work (CastException). How do you access Date values, then? You create this method in TypesInfereer, which covers both String and Long variations: public static Object inferDate(Object value) { java.util.Date date = null; if(value == null) return null; if(value instanceof String) { try { date = df.parse((String)value); } catch (ParseException ex) { // Deal with ex } } else if(value instanceof Long) { date = new java.util.Date((Long)value); } else { date = (Date)value; } return new Date(date.getTime()); } And as you saw, the getDate() function in ResultSetWrapper is redefined like this: @Override public Date getDate(String columnLabel) throws SQLException { Object value = this.wrapped.getObject(columnLabel); return (Date)TypesInferreer.inferDate(value); } Now all DAOs can retrieve Date values from both databases indistinctly, using resultset.getDate("date_field"). Mixing PostgreSQL Numeric and SQLite Integer/Double/... My SQLite driver didn't implement the getBigDecimal() function. It complained like this when I called it: java.sql.SQLException: not implemented by SQLite JDBC driver. So I had to come up with a solution that was valid for both PostgreSQL and SQlite. This is what I did in ResultSetWrapper: @Override public BigDecimal getBigDecimal(String columnLabel) throws SQLException { Object value = this.wrapped.getObject(columnLabel); return (BigDecimal)TypesInferreer.inferBigDecimal(value); } But value would get different types depending on the actual value stored in the database; it could be an Integer, or a Double, or perhaps something else. I solved all the cases by doing this in TypesInfereer: public static Object inferBigDecimal(Object value) { if(value == null) return null; if(value instanceof BigDecimal == false) { return new BigDecimal(String.valueOf(value)); } return value; } Anyway, the String constructor of BigDecimal is the recommended one, so everything's fine with this. Now you can retrieve BigDecimal values using resultset.getBigDecimal("bigdecimal_field") from both databases. Mixing PostgreSQL Boolean and SQLite Integer SQLite doesn't have boolean values. Instead, it interprets any other value as boolean by following some rules. When SQLite saves a Boolean value to the database, it saves it as 0 or 1 for false or true respectively. Also, because drivers can interpret any value as boolean, you can use resultset.getBoolean("boolean_field") and it will work as expected by the rules. But the problem I faced was when creating filters. If a value for true is stored as 1 in the SQLite database, you can't expect the clause WHERE boolean_field = true to work. You will never find a match. Instead, you should have said WHERE boolean_field = 1. In my app, I created filters like this: dao.addFilter(new FilterSimple("boolean_field", true)); Now I needed FilterSimple to infer that, for SQLite, I meant 1 instead of true. So I created what I called a DatasourceVariation. These are objects that are specific for each type of database and are used accross all data accesses, by DAOs, Filters, and other objects. These objects would take care of managing all my cross-database incompatibilities, including: The way to reference a database object: in PostgreSQL you must prepend the schema name to every database object you refer in your queries. In SQLite you don't. The way to manage exeptions: explained further in this post. The way to backup and restore data: explained further in this post. Expressing BETWEEN clauses: Explained further in this post. And also, infering boolean values. For VariationSQLite, I did this: @Override public Object getReplaceValue(Object value) { if(value instanceof Boolean) { if((Boolean)value == true) return new Integer(1); else return new Integer(0); } return value; } Now we can say dao.addFilter(new FilterSimple("boolean_field", true)) for both databases, assuming that FilterSimple uses the variation to adapt the value before constructing the clause. RETRIEVING AUTOGENERATED KEYS When you have autonumeric fields (eg. Serial), in PostgreSQL you can specify a RETURNING clause at the end of an INSERT statement to automatically retrieve the values of autogenerated fields by doing this: PreparedStatement pstm = conn.prepareStatement(queryWithReturningClause); // ex. select * from table_x returning field_x ResultSet rs = statement.executeQuery(); if(rs.next()) { // Get autogenerated fields from rs } But that won't work with SQLite. In SQLite, retrieving autogenerated fields conveys a process that goes from creating the statement, executing the query and explicitly asking for the generated values. Like this: PreparedStatement pstm = conn.prepareStatement(queryWITHOUTreturningClause, Statement.RETURN_GENERATED_KEYS); pstm.executeUpdate(); ResultSet rs = pstm.getGeneratedKeys(); if (rs != null && rs.next()) { // Get autogenerated fields from rs } The good news is that this code works both for PostgreSQL and SQLite, so I replaced my previous code for this, and didn't have to make any distinction between databases. ENFORCING FOREIGN KEYS You'd think that using a REFERENCES table_name(field_name) clause when creating a SQLite database table makes foreign keys to be checked when deleting, updating, etc. You're wrong! Foreign keys are not enforced in SQLite by default. You have to explicitly say it, and it's done when creating the connection (WARNING: This is very driver-specific): SQLiteConfig config = new SQLiteConfig(); config.enforceForeignKeys(true); Connection conn = DriverManager.getConnection("jdbc:sqlite:" + dataSourcePath, config.toProperties()); For PostgreSQL it's different, so you better have a connection pool for each type of database, and decide which one to use at runtime. My framework does exactly that. NOTE: If you are capable of getting the connection depending on the database type, then you can enforce foreign keys transparently for both databases (for PostgreSQL it happens naturally without extra code). For instance, you could have an abstract getConnection() method, and each database's connection pool would return the connection in its own way. MANAGING EXCEPTIONS I had defined some different types of database exceptions in my framework: ExceptionDBDuplicateEntry, ExceptionDBEntryReferencedElsewhere, etc, which would be thrown and raised to upper layers in my architecture. For PostgreSQL, these exceptions directly mapped to some constant codes (which normally are vendor/driver specific): UNIQUE_VIOLATION = "23505", FOREIGN_KEY_VIOLATION = "23503", etc. So, for PostgreSQL, I managed database exceptions something like this: @Override public void manageException(SQLException ex) throws ExceptionDBDuplicateEntry, ExceptionDBEntryReferencedElsewhere { if (ex.getSQLState() == null) { ex = (SQLException) ex.getCause(); } if (ex.getSQLState().equals(UNIQUE_VIOLATION)) { throw new ExceptionDBDuplicateEntry(); } else if(ex.getSQLState().equals(FOREIGN_KEY_VIOLATION)) { throw new ExceptionDBEntryReferencedElsewhere(); } else { DAOPackage.log(ex); throw new ExceptionDBUnknownError(ex); } } That won't work for SQLite, obviously! So, what I did was move the database exceptions management to the DataSourceVariation. The VariationPostgresql class would have a method similar to the one above. For VarialtionSQLite, I did sort of a hack, but it's something that has worked until now (maybe until I change my driver). @Override public void manageException(SQLException ex) throws ExceptionDBDuplicateEntry, ExceptionDBEntryReferencedElsewhere { // This is a hack (is it???) String message = ex.getMessage().toLowerCase(); if(message.contains("sqlite_constraint")) { if(message.contains("is not unique")) throw new ExceptionDBDuplicateEntry(); else if(message.contains("foreign key constraint failed")) throw new ExceptionDBEntryReferencedElsewhere(); else { DAOPackage.log(ex); throw new ExceptionDBUnknownError(ex); } } else { DAOPackage.log(ex); throw new ExceptionDBUnknownError(ex); } } Update: This technique might have some flaws. But hey, can you find a better approach right away? FIXING BETWEEN CLAUSE The problem with the BETWEEN clause appeared while using a filter like this: dao.addFilter(new FilterBetween("date_field", date1, date2)); // date1 and date2 are java.util.Date objects FilterBetween would create a BETWEEN clause by formatting Dates as Strings, normally with the format 'yyyy-MM-dd' (although this should be configurable). Since dates in SQLite are long values, we can't create a clause like date_field BETWEEN '2013-01-01' AND '2013-02-01'. It had to be something like date_field >=1357016400000 AND date_field <= 1359694800000. So, I moved the creation of BETWEEN clauses to.... that's right, to DataSourceVariation. VariationSQLite does it like this: @Override public String getBetweenExpression(String fieldName, Object d1, Object d2) { String filter = ""; try { Date dd1 = null; Date dd2 = null; SimpleDateFormat df = new SimpleDateFormat("yyyy-mm-dd"); // Remember, this should be configurable if(d1 instanceof String) dd1 = df.parse((String)d1); else dd1 = (Date)d1; if(d2 instanceof String)dd2 = df.parse((String)d2); else dd2 = (Date)d2; filter = fieldName + " >= " + dd1.getTime() + " AND " + fieldName + " <= " + dd2.getTime(); } catch (ParseException ex) { DAOPackage.log(ex); throw new ExceptionDBUnknownError(ex); } return filter; } CONCLUSIONS As you can see, there are many intricacies when making an app support multiple database types. All I did here was only to support PostgreSQL and SQLite, but who knows what is needed to support other databases at the same time too. You can't expect JDBC alone will do all the work, so be prepared to solve some problems (and another problem, and another, ...) to make a database migration. And please, share your journey.

October 21, 2013

by Martín Proenza

· 12,705 Views

Reasons to Move from DataTables to Generic Collections

These days, no community member writes or speaks about using DataTables and DataSets for data operations. But, there are a number of real projects built using them, and many developers still feel happy when they use them in their projects. Sometimes it is not easy to completely replace DataTables with typed generic lists, particularly in bulky projects. But now is the right time to move, as future developers may not even learn about DataTables :). Generic collections have a number of advantages over DataTables. One cannot imagine a day without generic collections once he/she gets to know how beneficial they are. The following is a list of the reasons to move from DataTables to collections that I could think of now: DataTable stores boxed objects, and one needs to unbox values when needed. This adds overhead on the runtime environment. However, values in generic collections are strongly typed, so no boxing involved. Unboxing happens at runtime, as does the type checking. If there is a mismatch between types of source and target, it leads to a runtime exception. This may lead to a number of issues while using DataTables. In case of collections, as the types are checked at the compile time, such type mismatches are caught during compilation. .NET languages got very nice support for creating collections, like object initializer and collection initializer. We don’t have such features for DataTables. LINQ queries can be used on both DataTables and collections. But the experience of writing the queries on generic collections is better because of IntelliSense support provided by Visual Studio. DataTables are framework specific; we often see issues with serializing and de-serializing them in web services. Generic collections are easier to serialize and de-serialize, so they can be easily used in any service and consumed from a client written in any language. ORMs are becoming increasingly popular, and they use generic collections for all data operations. Mocking DataTables in unit tests is a pain, as it involves creating the structure of the table wherever needed. But a generic collection needs a class defined just once. These are my opinions on preferring collections over DataTables. Any feedback is welcome. Happy coding!

October 21, 2013

by Rabi Kiran Srirangam

· 30,182 Views · 3 Likes

Database vs. Data Science

One thing that Big Data certainly made happen is that it brought the database/infrastructure community and the data analysis/statistics/machine learning communities closer together. As always, each community had it’s own set of models, methods, and ideas about how to structure and interpret the world. You can still tell these differences when looking at current Big Data projects, and I think it’s important to be aware of the distinctions in order to better understand the relationships between different projects. Because, let’s face it, every project claims to re-invent Big Data. Hadoop and MapReduce being something like the founding fathers of Big Data, other’s projects have since appeared. Most notably, there are stream processing projects like Twitter’s Storm who move from batch-oriented processing to event-based processing which is more suited for real-time, low-latency processing. Spark is yet something different, a bit like Hadoop, but puts greater emphasis on iterative algorithms, and in-memory processing to achieve that landmark “100x faster than Hadoop” every current project seems to need to sport. Twitter’s summingbird project tries to bridge the gap between MapReduce and stream processing by providing us with a high-level set of operators which can then either run on MapReduce or Storm. However, both Spark or summingbird leave me sort of flat because you can see that they come from a database background, which means that there will still be a considerable gap to serious machine learning. So, what exactly is the difference? In the end, it’s the difference between relational and linear algebra. In the database world, you model relationships between objects, which you encode in tables, and foreign keys to link up entries between different tables. Probably the most important insight of the database world was to develop a query language, a declarative description of what you want to extract from your database, leaving the optimization of the query and the exact details of how to perform them efficiently to the database guys. The machine learning community, on the other hand, has its roots in linear algebra and probability theory. Objects are usually encoded as a feature vector, that is, a list of numbers describing different properties of an object. Data is often collected in matrices where each row corresponds to an object, and each column to a feature, not much unlike a table in a database. However, the operations you perform in order to do data analysis are quite different from the data base world. Take something as basic as linear regression: your try to learn a linear function f(x)=di=1wixi in a d-dimensional space (that is, where your objects are described by a d-dimensional vector) given n examples Xi, and Yi, where Xi are the features describing your objects and Yi is the real number you attach to Xi. One way to “learn” w is to tune it such that the quadratic error on the training examples is minimal. The solution can be written in closed form as w=(XXT)−1XY where X is the matrix built from the Xi (putting the Xi in the columns of X), and Y is the vector of outputs Yi. In order to solve this, you need to solve the linear equation (XXT)w=XY which can be done by one of a large number of algorithms, starting with Gaussian elimination, which you’ve probably learned in your undergrad studies, or the conjugate gradient algorithm, or by first computing a Cholesky decomposition. All of these algorithms have in common that they are iterative. They go through a number of operations, for example O(d3) for the Gaussian elimination case. They also need to store intermediate results. Gaussian elimination and Cholesky decomposition have rather elementary operations acting on individual entries, while the conjugate gradient algorithm performs a matrix-vector multiplication in each iteration. Most importantly, these algorithms can only be expressed very badly in SQL! It’s certainly not impossible, but you’d need to store your data in much different ways than you would in idiomatic database usage. So, it’s not about whether or not your framework can support iterative algorithms without significant latency, it’s about understanding that joins, group bys, and count() won’t get you far, but you need scalar products, matrix-vector and matrix-matrix multiplications. You don’t need indices for most ML algorithms, maybe except for being able to quickly find the k-nearest neighbors, because most algorithms tend to either take in the whole data set in each iteration or otherwise stream the whole set by some model which is iteratively updated like in stochastic gradient descent. I’m not sure projects like Spark or Stratosphere have fully grasped the significance of this yet. Database infrastructure-inspired Big Data has it’s place when it comes to extracting and preprocessing data, but eventually, you move from database land to machine learning land, which invariably means linear algebra land (or probability theory land, which often also reduces to linear algebra like computations). What often happens today is that you either painstakingly have to break down your linear algebra into MapReduce jobs, or you actively look for algorithms which fit the database view better. I think we’re still at the beginning of what is possible. Or, to be a bit more aggressive, claims that existing (infrastructure, database, parallelism inspired) frameworks provide you with sophistic data analytics are widely exaggerated. They take care of a very important problem by giving you a reliable infrastructure to scale your data analysis code, but there’s still a lot of work that needs to be done on your side. High-level DSLs like Apache Hive or Pig are a first step in this direction but still too much rooted in the database world IMHO. In summary, one should be aware of the difference between a framework which mostly is concerned with scaling and a tool which actually provides some piece of data analysis. And even if it comes with basic database-like analytics mechanisms, there is still a long way to go to do some serious data science. That’s why we’re also thinking that streamdrill occupies an interesting spot, because it is a bit of infrastructure, allowing you to process a serious amount of event data, but it also provides valuable analysis based on algorithms you wouldn’t want to implement yourself, even if you had some Big Data framework like Hadoop at hand. That’s an interesting direction I also would like to see more of in the future. Note: Just saw that Spark has a logistic regression example on their landing page. Well, doing matrix operations explicitly via map() on collections doesn’t count in my view ;)

October 18, 2013

by Mikio Braun

· 11,414 Views · 1 Like

Generating SQL Railroad Diagrams

simple talk - How to get SQL Railroad Diagrams from MSDN BNF syntax notation. On SQL Server Books-On-Line, in the Transact-SQL Reference (database Engine), every SQL Statement has its syntax represented in ‘Backus–Naur Form’ notation (BNF) syntax. For a programmer in a hurry, this should be ideal because It is the only quick way to understand and appreciate all the permutations of the syntax. It is a great feature once you get your eye in. It isn’t the only way to get the information; You can, of course, reverse-engineer an understanding of the syntax from the examples, but your understanding won’t be complete, and you’ll have wasted time doing it. BNF is a good start in representing the syntax: Oracle and SQLite go one step further, and have proper railroad diagrams for their syntax, which is a far more accessible way of doing it. There are three problems with the BNF on MSDN. Firstly, it is isn’t a standard version of BNF, but an ancient fork from EBNF, inherited from Sybase. Secondly, it is excruciatingly difficult to understand, and thirdly it has a number of syntactic and semantic errors. The page describing DML triggers, for example, currently has the absurd BNF error that makes it state that all statements in the body of the trigger must be separated by commas. There are a few other detail problems too. Here is the offending syntax for a DML trigger, pasted from MSDN. ... I’ve been trying to create railroad diagrams for all the important SQL Server SQL statements, as good as you’d find for Oracle, and have so far published the CREATE TABLE and ALTER TABLE railroad diagrams based on the BNF. Although I’ve been aware of them, I’ve never realised until recently how many errors there are. Then, Colin Daley created a translator for the SQL Server dialect of BNF which outputs standard EBNF notation used by the W3C. The example MSDN BNF for the trigger would be rendered as … ... Colin’s intention was to allow anyone to paste SQL Server’s BNF notation into his website-based parser, and from this generate classic railroad diagrams via Gunther Rademacher's Railroad Diagram Generator. Colin's application does this for you: you're not aware that you are moving to a different site. Because Colin's 'translator' it is a parser, it will pick up syntax errors. Once you’ve fixed the syntax errors, you will get the syntax in the form of a human-readable railroad diagram and, in this form, the semantic mistakes become flamingly obvious. Gunter’s Railroad Diagram Generator is brilliant. To be able, after correcting the MSDN dialect of BNF, to generate a standard EBNF, and from thence to create railroad diagrams for SQL Server’s syntax that are as good as Oracle’s, is a great boon, and many thanks to Colin for the idea. Here is the result of the W3C EBNF from Colin’s application then being run through the Railroad diagram generator. Now that’s much better, you’ll agree. This is pretty easy to understand, and at this point any error is immediately obvious. This should be seriously useful, and it is to me. However there is that snag. The BNF is generally incorrect, and you can’t expect the average visitor to mess about with it. The answer is, of course, to correct the BNF on MSDN and maybe even add railroad diagrams for the syntax. Stop giggling! I agree it won’t happen. In the meantime, we need to collaboratively store and publish these corrected syntaxes ourselves as we do them. How? GitHub? SQL Server Central? Simple-Talk? What should those of us who use the system do with our corrected EBNF so that anyone can use them without hassle? Grammar Translator If you are familiar with the Grammar Translator, go ahead and create railroad diagrams from the Transact-SQL Reference. Otherwise, please see the FAQ. In particular, be sure to try thetutorial. Welcome to Railroad Diagram Generator! This is a tool for creating syntax diagrams, also known as railroad diagrams, from context-free grammars specified in EBNF. Syntax diagrams have been used for decades now, so the concept is well-known, and some tools for diagram generation are in existence. The features of this one are usage of the W3C's EBNF notation, web-scraping of grammars from W3C specifications, online editing of grammars, diagram presentation in SVG, and it was completely written in web languages (XQuery, XHTML, CSS, JavaScript). There's nothing like a diagram to help grok something (and the MSDN BNF SQL stuff really makes my brain hurt...)

October 18, 2013

by Greg Duncan

· 9,206 Views

Extracting File Metadata with C# and the .NET Framework

The Windows Explorer (shell) provides extended file property information which can be quite valuable. The challenge was how to extract this information, given that the .NET Framework has somewhat limited support for this type of extraction?

October 14, 2013

by Rob Sanders

· 64,297 Views

SSL Performance Overhead in MySQL

this post comes from ernie souhrada at the mysql performance blog. note: this is part 1 of what will be a two-part series on the performance implications of using in-flight data encryption. some of you may recall my security webinar from back in mid-august; one of the follow-up questions that i was asked was about the performance impact of enabling ssl connections. my answer was 25%, based on some 2011 data that i had seen over on yassl’s website, but i included the caveat that it is workload-dependent, because the most expensive part of using ssl is establishing the connection. not long thereafter, i received a request to conduct some more specific benchmarks surrounding ssl usage in mysql, and today i’m going to show the results. first, the testing environment. all tests were performed on an intel core i7-2600k 3.4ghz cpu (8 cores, ht included) with 32gb of ram and centos 6.4. the disk subsystem is a 2-disk raid-0 of samsung 830 ssds, although since we’re only concerned with measuring the overhead added by using ssl connections, we’ll only be conducting read-only tests with a dataset that fits completely in the buffer pool. the version of mysql used for this experiment is community edition 5.6.13, and the testing tools are sysbench 0.5 and perl. we conduct two tests, each one designed to simulate one of the most common mysql usage patterns. first, we examine connection pooling, often seen in the java world, where some small set of connections are established by, for example, the servlet container and then just passed around to the application as needed, and one-request-per-connection, typical in the lamp world, where the script that displays a given page might connect to the database, run a couple of queries, and then disconnect. test 1: connection pool for the first test, i ran sysbench in read-only mode at concurrency levels of 1, 2, 4, 8, 16, and 32 threads, first with no encryption and then with ssl enabled and key lengths of 1024, 2048, and 4096 bits. 8 sysbench tables were prepared, each containing 100,000 rows, resulting in a total data size of approximately 256mb. the size of my innodb buffer pool was 4gb, and before conducting each official measurement run, i ran a warm-up run to prime the buffer pool. each official test run lasted 10 minutes; this might seem short, but unlike, say, a pcie flash storage device, i would not expect the variable under observation to really change that much over time or need time to stabilize. the basic sysbench syntax used is shown below. #!/bin/bash for ssl in on off ; do for threads in 1 2 4 8 16 32 ; do sysbench --test=/usr/share/sysbench/oltp.lua --mysql-user=msandbox$ssl --mysql-password=msandbox \ --mysql-host=127.0.0.1 --mysql-port=5613 --mysql-db=sbtest --mysql-ssl=$ssl \ --oltp-tables-count=8 --num-threads=$threads --oltp-dist-type=uniform --oltp-read-only=on \ --report-interval=10 --max-time=600 --max-requests=0 run > sb-ssl_${ssl}-threads-${threads}.out done done if you’re not familiar with sysbench, the important thing to know about it for our purposes is that it does not connect and disconnect after each query or after each transaction. it establishes n connections to the database (where n is the number of threads) and runs queries though them until the test is over. this behavior provides our connection-pool simulation. the assumption, given what we know about where ssl is the slowest, is that the performance penalty here should be the lowest. first, let’s look at raw throughput, measured in queries per second: the average throughput and standard deviation (both measured in queries per second) for each test configuration is shown below in tabular format: # of threads ssl key size 1 2 4 8 16 32 ssl off 9250.18 (1005.82) 18297.61 (689.22) 33910.31 (446.02) 50077.60 (1525.37) 49844.49 (934.86) 49651.09 (498.68) 1024-bit 2406.53 (288.53) 4650.56 (558.58) 9183.33 (1565.41) 26007.11 (345.79) 25959.61 (343.55) 25913.69 (192.90) 2048-bit 2448.43 (290.02) 4641.61 (510.91) 8951.67 (1043.99) 26143.25 (360.84) 25872.10 (324.48) 25764.48 (370.33) 4096-bit 2427.95 (289.00) 4641.32 (547.57) 8991.37 (1005.89) 26058.09 (432.86) 25990.13 (439.53) 26041.27 (780.71) so, given that this is an 8-core machine and io isn’t a factor, we would expect throughput to max out at 8 threads, so the levelling-off of performance is expected. what we also see is that it doesn’t seem to make much difference what key length is used, which is also largely expected. however, i definitely didn’t think the encryption overhead would be so high. the next graph here is 95th-percentile latency from the same test: and in tabular format, the raw numbers (average and standard deviation): # of threads ssl key size 1 2 4 8 16 32 ssl off 1.882 (0.522) 1.728 (0.167) 1.764 (0.145) 2.459 (0.523) 6.616 (0.251) 27.307 (0.817) 1024-bit 6.151 (0.241) 6.442 (0.180) 6.677 (0.289) 4.535 (0.507) 11.481 (1.403) 37.152 (0.393) 2048-bit 6.083 (0.277) 6.510 (0.081) 6.693 (0.043) 4.498 (0.503) 11.222 (1.502) 37.387 (0.393) 4096-bit 6.120 (0.268) 6.454 (0.119) 6.690 (0.043) 4.571 (0.727) 11.194 (1.395) 37.26 (0.307) with the exception of 8 and 32 threads, the latency introduced by the use of ssl is constant at right around 5ms, regardless of the key length or the number of threads. i’m not surprised that there’s a large jump in latency at 32 threads, but i don’t have an immediate explanation for the improvement in the ssl latency numbers at 8 threads. test 2: connection time for the second test, i wrote a simple perl script to just connect and disconnect from the database as fast as possible. we know that it’s the connection setup which is the slowest part of ssl, and the previous test already shows us roughly what we can expect for ssl encryption overhead for sending data once the connection has been established, so let’s see just how much overhead ssl adds to connection time. the basic script to do this is quite simple (non-ssl version shown): #!/usr/bin/perl use dbi; use time::hires qw(time); $start = time; for (my $i=0; $i<100; $i++) { my $dbh = dbi->connect("dbi:mysql:host=127.0.0.1;port=5613", "msandbox","msandbox",undef); $dbh->disconnect; undef $dbh; } printf "%.6f\n", time - $start; as with test #1, i ran test #2 with no encryption and ssl encryption of 1024, 2048, and 4098 bits, and i conducted 10 trials of each configuration. then i took the elapsed time for each test and converted it to connections per second. the graph below shows the results from each run: here are the averages and standard deviations: encryption average connections per second standard deviation none 2701.75 165.54 1024-bit 77.04 6.14 2048-bit 28.183 1.713 4096-bit 5.45 0.015 yes, that’s right, 4096-bit ssl connections are 3 orders of magnitude slower to establish than unencrypted connections. really, the connection overhead for any level of ssl usage is quite high when compared to the unencrypted test, and it’s certainly much higher than my original quoted number of 25%. analysis and parting thoughts so, what do we take away from this? the first thing is, of course, is that ssl overhead is a lot higher than 25%, particularly if your application uses anything close to the one-connection-per-request pattern. for a system which establishes and maintains long-running connections, the initial connection overhead becomes a non-factor, regardless of the encryption strength, but there’s still a rather large performance penalty compared to the unencrypted connection. this leads directly into the second point, which is that connection pooling is by far a more efficient method of using ssl if your application can support it. but what if connection pooling isn’t an option, mysql’s ssl performance is insufficient, and you still need full encryption of data in-flight? run the encryption component of your system at a lower layer – a vpn with hardware crypto would be the fastest approach, but even something as simple as an ssh tunnel or openvpn *might* be faster than ssl within mysql. i’ll be exploring some of these solutions in a follow-up post. and finally… when in doubt, run your own benchmarks. i don’t have an explanation for why the yassl numbers are so different from these (maybe yassl is a faster ssl library than openssl, or maybe they used a different cipher – if you’re curious, the original 25% number came from slides 56-58 of this presentation ), but in any event, this does illustrate why it’s important to run tests on your own hardware and with your own workload when you’re interested in finding out how well something will perform rather than taking someone else’s word for it.

October 11, 2013

by Peter Zaitsev

· 6,837 Views

Large Dataset Retrieval in Mule

Recently, a customer made a query on how to perform large dataset retrieval in Mule. The documentation page briefly explains how this may be achieved, however there is no working example on how to do this as far as I can tell. This blog post aims to explain in detail how large dataset retrieval works in Mule by giving an example. The customer wanted to transfer items from one database to another by performing a batch select and then a batch insert. The ‘batch insert’ part is pretty straightforward and is done automatically by Mule when the payload is of type List. However, the batch select is mastered in a different way. In order to retrieve all the records, we will use the Batch Manager to compute the ID ranges for the next batch of records to be retrieved. This is provided out of the box with Mule EE. We start by defining the database which will be used throughout the example to retrieve and insert records. For simplicity’s sake we are going to use the Derby in-memory database. NOTE: the records should be identified by a key which is unique and in a sequential numeric order. CREATE TABLE table1(KEY1 INTEGER GENERATED BY DEFAULT AS IDENTITY(START WITH 1) NOT NULL PRIMARY KEY, KEY2 VARCHAR(255)); CREATE TABLE table2(KEY1 VARCHAR(255), KEY2 VARCHAR(255)); INSERT INTO table1(KEY2) VALUES ('TEST1'); INSERT INTO table1(KEY2) VALUES ('TEST2'); INSERT INTO table1(KEY2) VALUES ('TEST3'); INSERT INTO table1(KEY2) VALUES ('TEST4'); INSERT INTO table1(KEY2) VALUES ('TEST5'); INSERT INTO table1(KEY2) VALUES ('TEST6'); INSERT INTO table1(KEY2) VALUES ('TEST7'); INSERT INTO table1(KEY2) VALUES ('TEST8'); INSERT INTO table1(KEY2) VALUES ('TEST9'); INSERT INTO table1(KEY2) VALUES ('TEST10'); As explained before, the select query is based on the ID ranges that are computed by the Batch Manager when nextBatch() is called. This will return a map with the lower and upper ids to be selected. In our case, we are storing this map into a flow variable named ‘boundaries’. After configuring the database and the JDBC connector, we need to configure the Batch Manager. This consists of specifying the idStore (which is a text file), which the BatchManager uses to store the starting point for the next batch. Moreover, on the Batch Manager, we need to configure the batch size and the starting point. In the documentation, you would find a reference to the noArgsWrapper. Its job is to invoke the nextBatch() method on the Batch Manager. However we find this very confusing and misleading, thus instead, we use a simple MEL expression which calls the nextBatch() directly. Now we have to configure the main flow where we perform the batch select. Given that the records are retrieved in batches, the flow has to be called multiple times until all of the records are retrieved. To solve this, we created a composite source so that at the end of the flow, if we haven’t retrieved all the records, we re-trigger the same flow using the VM queue. Once the current batch is finished, we need to call competeBatch() to instruct the batch manager that we’re done from the current batch, and ready to process the next. If this is not done, the Batch Manager will still consider the previous batch as ‘processing’. Furthermore, we have to check whether we have retrieved all of the records so we can stop processing. We do this by checking the size of the payload that is returned from the JDBC outbound endpoint. If the payload size is ’0′ (no more records to be retrieved), we have to call the completeBatch() method with ‘-1′, instructing the Batch Manager that all of the batch is complete. We must also set the starting point for next batch to ’0′. This is required so that when the flow is triggered again from the HTTP inbound endpoint, the flow will start processing from the first record. If the batch is not complete, we call the completeBatch() method (from the BatchManager class) with the current upperId. This sets the new starting point for the next batch to be processed. Finally we end the flow with a VM outbound on ‘batch’ which triggers the main flow to process the next batch of records. app.registry.seqBatchManager.completeBatch(-1); app.registry.seqBatchManager.setStartingPointForNextBatch(0); app.registry.seqBatchManager.completeBatch(flowVars.boundaries.upperId); A complete Mule configuration of the main flow shown here below.

October 2, 2013

by Clare Cini

· 10,467 Views

Getting Started with NHibernate and ASP.NET MVC- CRUD Operations

In this post we are going to learn how we can use NHibernate in ASP.NET MVC application. What is NHibernate: ORMs(Object Relational Mapper) are quite popular this days. ORM is a mechanism to map database entities to Class entity objects without writing a code for fetching data and write some SQL queries. It automatically generates SQL Query for us and fetch data behalf on us. NHibernate is also a kind of Object Relational Mapper which is a port of popular Java ORM Hibernate. It provides a framework for mapping an domain model classes to a traditional relational databases. Its give us freedom of writing repetitive ADO.NET code as this will be act as our database layer. Let’s get started with NHibernate. How to download: There are two ways you can download this ORM. From nuget package and from the source forge site. Nuget - http://www.nuget.org/packages/NHibernate/ Source Forge-http://sourceforge.net/projects/nhibernate/ Creating a table for CRUD: I am going to use SQL Server 2012 express edition as a database. Following is a table with four fields Id, First Name, Last name, Designation. Creating ASP.NET MVC project for NHibernate: Let’s create a ASP.NET MVC project for NHibernate via click on File-> New Project –> ASP.NET MVC 4 web application. Installing NuGet package for NHibernate: I have installed nuget package from Package Manager console via following Command. It will install like following. NHibertnate configuration file: Nhibernate needs one configuration file for setting database connection and other details. You need to create a file with ‘hibernate.cfg.xml’ in model Nhibernate folder of your application with following details. NHibernate.Connection.DriverConnectionProvider NHibernate.Driver.SqlClientDriver Server=(local);database=LocalDatabase;Integrated Security=SSPI; NHibernate.Dialect.MsSql2012Dialect Here you have got different settings for NHibernate. You need to selected driver class, connection provider as per your database. If you are using other databases like Orcle or MySQL you will have different configuration. ThisNHibernate ORM can work with any databases. Creating a model class for NHibernate: Now it’s time to create model class for our CRUD operations. Following is a code for that. Property name is identical to database table columns. namespace NhibernateMVC.Models { public class Employee { public virtual int Id { get; set; } public virtual string FirstName { get; set; } public virtual string LastName { get; set; } public virtual string Designation { get; set; } } } Creating a mapping file between class and table: Now we need a xml mapping file between class and model with name “Employee.hbm.xml” like following in Nhibernate folder. Creating a class to open session for NHibernate I have created a class in models folder called NHIbernateSession and a static function it to open a session for NHibertnate. using System.Web; using NHibernate; using NHibernate.Cfg; namespace NhibernateMVC.Models { public class NHibertnateSession { public static ISession OpenSession() { var configuration = new Configuration(); var configurationPath = HttpContext.Current.Server.MapPath(@"~\Models\Nhibernate\hibernate.cfg.xml"); configuration.Configure(configurationPath); var employeeConfigurationFile = HttpContext.Current.Server.MapPath(@"~\Models\Nhibernate\Employee.hbm.xml"); configuration.AddFile(employeeConfigurationFile); ISessionFactory sessionFactory = configuration.BuildSessionFactory(); return sessionFactory.OpenSession(); } } } Listing: Now we have our open session method ready its time to write controller code to fetch data from the database. Following is a code for that. using System; using System.Web.Mvc; using NHibernate; using NHibernate.Linq; using System.Linq; using NhibernateMVC.Models; namespace NhibernateMVC.Controllers { public class EmployeeController : Controller { public ActionResult Index() { using (ISession session = NHibertnateSession.OpenSession()) { var employees = session.Query().ToList(); return View(employees); } } } } Here you can see I have get a session via OpenSession method and then I have queried database for fetching employee database. Let’s create a new for this you can create this via right lick on view on above method.We are going to create a strongly typed view for this. Our listing screen is ready once you run project it will fetch data as following. Create/Add: Now its time to write add employee code. Following is a code I have written for that. Here I have used session.save method to save new employee. First method is for returning a blank view and another method with HttpPost attribute will save the data into the database. public ActionResult Create() { return View(); } [HttpPost] public ActionResult Create(Employee emplolyee) { try { using (ISession session = NHibertnateSession.OpenSession()) { using (ITransaction transaction = session.BeginTransaction()) { session.Save(emplolyee); transaction.Commit(); } } return RedirectToAction("Index"); } catch(Exception exception) { return View(); } } Now let’s create a create view strongly typed view via right clicking on view and add view. Once you run this application and click on create new it will load following screen. Edit/Update: Now let’s create a edit functionality with NHibernate and ASP.NET MVC. For that I have written two action result method once for loading edit view and another for save data. Following is a code for that. public ActionResult Edit(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } [HttpPost] public ActionResult Edit(int id, Employee employee) { try { using (ISession session = NHibertnateSession.OpenSession()) { var employeetoUpdate = session.Get(id); employeetoUpdate.Designation = employee.Designation; employeetoUpdate.FirstName = employee.FirstName; employeetoUpdate.LastName = employee.LastName; using (ITransaction transaction = session.BeginTransaction()) { session.Save(employeetoUpdate); transaction.Commit(); } } return RedirectToAction("Index"); } catch { return View(); } } Here in first action result I have fetched existing employee via get method of NHibernate session and in second I have fetched and changed the current employee with update details. You can create view for this via right click –>add view like below. I have created a strongly typed view for edit. Once you run code it will look like following. Details: Now it’s time to create a detail view where user can see the employee detail. I have written following logic for details view. public ActionResult Details(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } You can add view like following via right click on actionresult view. now once you run this in browser it will look like following. Delete: Now its time to write delete functionality code. Following code I have written for that. public ActionResult Delete(int id) { using (ISession session = NHibertnateSession.OpenSession()) { var employee = session.Get(id); return View(employee); } } [HttpPost] public ActionResult Delete(int id, Employee employee) { try { using (ISession session = NHibertnateSession.OpenSession()) { using (ITransaction transaction = session.BeginTransaction()) { session.Delete(employee); transaction.Commit(); } } return RedirectToAction("Index"); } catch(Exception exception) { return View(); } } Here in the above first action result will have the delete confirmation view and another will perform actual delete operation with session delete method. When you run into the browser it will look like following. That’s it. It’s very easy to have crud operation with NHibernate. Stay tuned for more.

October 1, 2013

by Jalpesh Vadgama

· 47,345 Views

ElasticSearch: Java API

ElasticSearch provides Java API, thus it executes all operations asynchronously by using client object.

September 30, 2013

by Hüseyin Akdoğan

CORE

· 137,598 Views · 4 Likes

Parallel SQL in C#

So, I’ve been wanting to get back to playing with C# for a while, and finally have had the opportunity. I’ve also been wanting to play with the Task library in .NET and see if I could get it to do something interesting, well below is the result. The code below, running in a .NET 4 project, will run two SQL SELECT statements against the AdventureWorks2012 database. There are three tasks in here, ParallelTask 1 and 2, and a timing task. The Parallel task takes a Connection String and a query as inputs, and passes out a Status Message. One of the important points with a task is that the task has to be self contained. This is why the connection is instantiated within the task. I also added in a Timing task (ParallelTiming) so I could pass out a ping message. The whole thing is controlled by the code in the main section, which is used to start the three tasks, with their appropriate parameters. After this it awaits the tasks completing, then passes out the resulting return messages. Try it out; it’s good fun and all you need is SQL Server, AdventureWorks and something to build C# projects. You can download the code here Have fun! /// Parallel_SQL demonstration code /// From Nick Haslam /// http://blog.nhaslam.com /// 16/9/2013 using System; using System.Collections.Generic; using System.Data.SqlClient; using System.Linq; using System.Text; using System.Threading.Tasks; namespace Parallel_SQL { class Program { /// /// First Parallel task /// ///Connection string details ///Query to execute ///Status message to pass back /// static Task ParallelTask1(string sConnString, string sQuery, Action StatusMessage) { return Task.Factory.StartNew(() => { SqlConnection conn = new SqlConnection(sConnString); conn.Open(); StatusMessage(“Running Query”); SqlDataReader reader = null; SqlCommand sqlCommand = new SqlCommand(sQuery, conn); reader = sqlCommand.ExecuteReader(); while (reader.Read()) { StatusMessage(reader[0].ToString()); } return “Task 1 Complete”; }); } /// /// Second Parallel task /// ///Connection string details ///Query to execute ///Status message to pass back /// static Task ParallelTask2(string sConnString, string sQuery, Action StatusMessage) { return Task.Factory.StartNew(() => { SqlConnection conn = new SqlConnection(sConnString); conn.Open(); StatusMessage(“Running Query”); SqlDataReader reader = null; SqlCommand sqlCommand = new SqlCommand(sQuery, conn); reader = sqlCommand.ExecuteReader(); while (reader.Read()) { StatusMessage(reader[0].ToString()); } return “Task 2 Complete”; }); } /// /// Timing Task /// ///Milliseconds between ping ///Status message to pass back /// static Task ParallelTiming(int iMSPause, Action StatusMessage) { return Task.Factory.StartNew(() => { for (int i = 0; i < 10; i++) { System.Threading.Thread.Sleep(iMSPause); StatusMessage(“******************** PING ********************”); } return “Timing task done”; }); } static void Main(string[] args) { string sConnString = “server=.; Trusted_Connection=yes; database=AdventureWorks2012;”; try { var Task1Control = ParallelTask1(sConnString, “SELECT top 500 TransactionID FROM Production.TransactionHistory”, (update) => { Console.WriteLine(String.Format(“{0} – {1}”, DateTime.Now, update)); }); var Task2Control = ParallelTask2(sConnString, “SELECT top 500 SalesOrderDetailID FROM sales.SalesOrderDetail”, (update) => { Console.WriteLine(String.Format(“{0} – \t\t{1}”, DateTime.Now, update)); }); var TimingTaskControl = ParallelTiming(250, (update) => { Console.WriteLine(String.Format(“{0} – \t\t\t{1}”, DateTime.Now, update)); }); // Await Completion of the tasks Console.WriteLine(“Task 1 Status – {0}”, Task1Control.Result); Console.WriteLine(“Task 2 Status – {0}”, Task2Control.Result); Console.WriteLine(“Timing Task Status – {0}”, TimingTaskControl.Result); } catch (Exception e) { Console.WriteLine(e.ToString()); } Console.ReadKey(); } } }

September 29, 2013

by Nick Haslam

· 22,673 Views · 31 Likes

"Lazy" Database Synchronization Using RabbitMQ

The Problem Obviously, there are tons of different ways to sync databases, so why should it be described again? Let's imagine that we have an unusual situation with restrictions below: A future system will have some Head Office (HO) and a couple of Branch Offices (BOs) All offices are located in different places, and some of them have difficulties with the internet connection. It could even be a situation where the internet is available for 1-2 hours per day. Almost all vital data is created in the HO and should be presented as read-only in BOs. Data exchange should be limited with appropriate permissions (for example, if an operator has created some sensitive data in the HO for BO1, only BO1 should have access to it). HO should have access to all information that has been created or modified in BOs. According to all described points final decision to write own DB sync mechanism has been made. Basic Idea Due to connection degradation between HO and BOs, we have to sync everything within short-term sessions. Since there is no need to send information to all branches in general cases, we should be able to orchestrate data flow. Those thoughts bring us to the idea that we might implement some kind of RPC where an event occurs in one office, and it is reproduced (replayed) in another. Message queues (MQ) are a perfect solution to sync data between branches. RabbitMQ is my favorite MQ, so I will use it in this example. Also, this application will use the .NET stack which has a convenient API client implementation for RabbitMQ called EasyNetQ. High Level Application Architecture According to the idea of replaying some actions on other system instances, we should be able to divide them into single business-logic operations. The best way to achieve this it is by using the Aggregate Roots approach. The main idea is to have separated objects that are divided by domain entities, and each call to the methods of those objects is a single change to state of the business logic. For example, if we have some domain object Document and the ability to Get, Upsert, or Apply/Unapply, then we should describe its root as (pseudocode): public class DocumentRoot { public Document Get(Id) { ... } public Document Upsert(Document) { ... } public bool Apply(Id) { ... } public bool UnApply(Id) { ... } } Also, it's very important to ensure that each call will be in a transaction in order to avoid data loss. This can be achieved using simple method interception (for example Autofac + Castle.Proxy). In other worlds, the core process will look like this: Keep in mind things as entities primary keys, because data will be populated between different system instances, and we'll need to be sure that ID's will be the same. Also, collisions are possible while using simple auto-incrementing PK's, so our choice is GUID. With the help of a base repository, it's very simple to implement new GUID storage during object creation. Let's assume that we have an ExchangeInformation object that handles all data needed to restore a root call on a remote system. It will contain info about the method name, type name, input, and output params – this data can be obtained from a root interceptor. Also, it should have the list of new ID's, but it's not hard to get them too, even though we'll need to implement the UnitOfWork pattern on an ORM type to support transactions. This will allow us to place our ExchangeInformation in that UoF object (for example, within Entity Framework it's DbContext). Here is the implementation (using EF) of saving any changes in a domain within the base generic repository where the base entity looks like: public class EntityBase { public long Id { get; set; } public Guid Guid { get; set; } } public virtual void Save(T entity) { DbEntityEntry entry = Context.Entry(entity); if (entity.Guid == Guid.Empty) { try { Guid newGuid = Context.ExchangeInformation.IsExchangeRestore ? Context.ExchangeInformation.NewGuids[0] : Guid.NewGuid(); if (Context.ExchangeInformation.IsExchangeRestore) { Context.ExchangeInformation.NewGuids.RemoveAt(0); } else { Context.ExchangeInformation.NewGuids.Add(newGuid); } entity.Guid = newGuid; } catch { throw new Exception("Failed to restore exchange, no guid found"); } entry.State = EntityState.Added; return; } Context.Entry(entity).State = EntityState.Modified; } One more important note: to avoid code duplication, it's necessary to use GUID's on clients, because if they operate any other ID's we'll need to write two different implementations of any method. Big Picture After preparation completion, we can proceed with architecture design. Since every system instance should be able to send and receive new data, we can declare two RMQ topics: input and output. Also, because message flow must be orchestrated, queues for each system instance should be created within the output topic. The simplest strategy for a routing implementation is to use the branch office guide as a key. So we know how to do following at the moment : Save the source event in one office. Put this event to selected queues (selection could be made but it depends on the situation: read from the entity, call some additional method, use attributes etc.) The next step is a solution for how to make output events from one office appear in the input queue of the other office. RabbitMQ has two plugins for that: Federation and Shovel. They are quite similar, but shovel is working on a lower level and has more options to control the synchronization process, so that we'll use the second one to link queues. Shovel is very good with handling connection degradation and has lot of additional configurable options like message republishing properties, routing etc. Now it's time to combine all pieces in to single picture: Aggregators here are simple RabbitMQ consumers that handle incoming messages from other offices and launch appropriate methods. One other problem is restoring transferred params. From my point of view the best way is to use Json.Net with type serialization and restore them on a remote system instance with a small hack: private object[] GetParams(MethodInfo methodInfo, ExchangeInformation information, ExchangeMessage message) { ParameterInfo[] methodParams = methodInfo.GetParameters(); var listParams = new List>(information.InputParamsString); for (int ii = 0; ii < methodParams.Length; ii++) { var jObject = JsonConvert.DeserializeObject(information.OutputValueString); string typeName = jObject["$type"].ToString(); listParams.Add(jObject.ToObject(Type.GetType(typeName))); } return listParams.ToArray(); } Surely appropriate conditions for params count mismatch, so valid deserialization and so on are required. Conclusions The approach I've described is very easy to implement and it has lots of additional places that can be customized. For example, any other method can be executed before/instead of/after restoration on a target branch to change the logic of DOM behavior. The main issue is that collisions can occur if two BOs edit same object at the same time. Actually, it's not hard to track this situation by adding a hash to EntityBase. Nevertheless, a human's decision is needed to resolve conflicts, so a simple UI is necessary in the HO where the operator can choose which data is correct.

September 25, 2013

by Vladimir Kornev

· 18,286 Views · 2 Likes

Connecting to SQL Azure with SQL Management Studio

Intro If you want to manage your SQL Databases in Azure using tools that you’re a little more familiar and comfortable with – for example – SQL Management Studio, how do you go about connecting? You could read the help article from Microsoft, or you can follow my intuitive screen-based instructions, below: Assumptions 1. I’m assuming you have a version of SQL Management Studio already installed. I believe you’ll need at least SQL Server 2008 R2’s version or newer 2. I’m further assuming you’ve already created a SQL Database in Azure Steps to Connect SSMS to SQL Azure 1. Authenticate to the Azure Portal 2. Click on SQL Databases 3. Click on Servers 4. Click on the name of the Server you wish to connect to… 5. Click on Configure… If not already in place, click on ‘Add to the allowed IP addresses’ to add your current IP address (or specify an address you wish to connect from) and click ‘Save’ 6. Open SQL Management Studio and connect to Database services (usually comes up by default) Enter the fully qualified server name (.database.windows.net) Change to SQL Server Authentication Enter the login preferred (if a new database, the username you specified when yuo created the DB server) Enter the correct password 7. Hit the Connect button Troubleshooting Ensure you have the appropriate ports open outbound from your local network or connection (typically port 1433) Ensure you have allowed the correct public IP address you’re trying to connect from via the Azure Portal (steps 1-5 above) Ensure you are using the correct server name and user name For SSMS, this is the server name (in step 4) followed by .database.windows.net Ensure you are using SQL Server Authentication For SSMS the username format is If you forgot the password of your username, you can reset the password in the Azure Portal, in step 4, click on Dashboard: Lastly… You can click on the Database (in step 2) to see your connection options:

September 25, 2013

by Rob Sanders

· 262,940 Views