DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
Replacing Query String Elements in C# .NET and JavaScript
While writing list navigation and search features in websites today there is a constant need to find/replace and play with query string elements, so that you can easily manipulate these mystical items while you’re carrying them around in your website’s URLs. I have a few little methods I’ve used over the years and carry with me project to project, and this post is putting them on the record for easy access later. I have a secret. This post is actually more aimed at an audience of 'myself', and my ability to have an easy bit of source code to call upon when I’m on the go looking for a quick solution to cut and paste – as most of my blog posts are. But you, dear reader, you get to share in this benefit with me by pulling from the awesomeness within this post as well. Solution: .Net c# When doing this with c# you have a few pretty cool features up your sleeve. One of these is HttpUtility.ParseQueryString(urlPath) framework method. This static method allows you to extract a NameValueCollection that is editable from a given query string. Why is this cool? Because it allows you to very easily play with the query string collection like it is any other NameValueCollection – with Add() and Remove() methods. This makes it incredibly powerful. Quick & Dirty code beware! The code I’m pasting below is far from being the most elegant solution, i seem to have misplaced my nicer piece of code and am in too much of a rush to find it right now (sorry). Until i find my nicer solution, the method below will get you by – whether you have a hatred for ternary’s or not. public static string ReplaceQueryStringParam(string currentPageUrl, string paramToReplace, string newValue) { string urlWithoutQuery = currentPageUrl.IndexOf('?') >= 0 ? currentPageUrl.Substring(0, currentPageUrl.IndexOf('?')) : currentPageUrl; string queryString = currentPageUrl.IndexOf('?') >= 0 ? currentPageUrl.Substring(currentPageUrl.IndexOf('?')) : null; var queryParamList = queryString != null ? HttpUtility.ParseQueryString(queryString) : HttpUtility.ParseQueryString(string.Empty); if (queryParamList[paramToReplace] != null) { queryParamList[paramToReplace] = newValue; } else { queryParamList.Add(paramToReplace, newValue); } return String.Format("{0}?{1}", urlWithoutQuery, queryParamList); } To call this, you can do the following: // var currentUrl = HttpContext.Current.Request.Url; var currentUrl = "http://www.mysite.com/mypage?category=cool-products&sort=price&page=3"; // change the my sort-by param named"sort" to "name" var newUrlWithChangedSort = ReplaceQueryStringParam(currentUrl, "sort", "name"); Solution: JavaScript The second part of this post includes a JavaScript solution, as you never know when you have to do this on the client side. function replaceQueryString(url, param, value) { if (url.lastIndexOf('?') <= 0) url = url + "?"; var re = new RegExp("([?|&])" + param + "=.*?(&|$)", "i"); if (url.match(re)) return url.replace(re, '$1' + param + "=" + value + '$2'); else return url.substring(url.length - 1) == '?' ? url + param + "=" + value : url + '&' + param + "=" + value; } And to use the above code in your client-side javascript simply write something along the lines of: //var currentUrl = self.location; var currentUrl = "http://www.mysite.com/mypage?category=cool-products&sort=price&page=3"; // change the my sort-by param named"sort" to "name" var newUrlWithChangedSort = replaceQueryString(currentUrl, "sort", "name"); Easy – now next time you need to knock something together, instead of writing it yourself, you can simply cut & paste mine!
July 20, 2012
by Douglas Rathbone
· 16,494 Views
article thumbnail
How to Autoscale MySQL on Amazon EC2
Autoscaling your webserver tier is typically straightforward. Image your apache server with source code or without, then sync down files from S3 upon spinup. Roll that image into the autoscale configuration and you’re all set. With the database tier though, things can be a bit tricky. The typical configuration we see is to have a single master database where your application writes. But scaling out or horizontally on Amazon EC2 should be as easy as adding more slaves, right? Why not automate that process? Below we’ve set out to answer some of the questions you’re likely to face when setting up slaves against your master. We’ve included instructions on building an AMI that automatically spins up as a slave. Fancy! How can I autoscale my database tier? Build an auto-starting MySQL slave against your master. Configure those to spinup. Amazon’s autoscaling loadbalancer is one option, another is to use a roll-your-own solution, monitoring thresholds on servers, and spinning up or dropping off slaves as necessary. Does an AWS snapshot capture subvolume data or just the SIZE of the attached volume? In fact, if you have an attached EBS volume and you create an new AMI off of that, you will capture the entire root volume, plus your attached volume data. In fact we find this a great way to create an auto-building slave in the cloud. How do I freeze MySQL during AWS snapshot? mysql> flush tables with read lock;mysql> system xfs_freeze -f /data At this point you can use the Amazon web console, ylastic, or ec2-create-image API call to do so from the command line. When the server you are imaging off of above restarts – as it will do by default – it will start with /data partition unfrozen and mysql’s tables unlocked again. Voila! If you’re not using xfs for your /data filesystem, you should be. It’s fast! The xfsprogs docs seem to indicate this may also work with foreign filesystems. Check the docs for details. How do I build an AMI mysql slave that autoconnects to master? Install mysql_serverid script below. Configure mysql to use your /data EBS mount. Set all your my.cnf settings including server_id Configure the instance as a slave in the normal way. When using GRANT to create the ‘rep’ user on master, specify the host with a subnet wildcard. For example ’10.20.%’. That will subsequently allow any 10.20.x.y servers to connect and replicate. Point the slave at the master. When all is running properly, edit the my.cnf file and remove server_id. Don’t restart mysql. Freeze the filesystem as described above. Use the Amazon console, ylastic or API call to create your new image. Test it of course, to make sure it spins up, sets server_id and connects to master. Make a change in the test schema, and verify that it propagates to all slaves. How do I set server_id uniquely? As you hopefully already know, in MySQL replication environment each node requires a unique server_id setting. In my Amazon Machine Images, I want the server to startup and if it doesn’t find the server_id in the /etc/my.cnf file, to add it there, correctly! Is that so much to ask? Here’s what I did. Fire up your editor of choice and drop in this bit of code: #!/bin/shif grep -q “server_id” /etc/my.cnf then : # do nothing – it’s already set else # extract numeric component from hostname – should be internet IP in Amazon environment export server_id=`echo $HOSTNAME | sed ‘s/[^0-9]*//g’` echo “server_id=$server_id” >> /etc/my.cnf # restart mysql /etc/init.d/mysql restart fi Save that snippet at /root/mysql_serverid. Also be sure to make it executable: $ chmod +x /root/mysql_serverid Then just append it to your /etc/rc.local file with an editor or echo: $ echo "/root/mysql_serverid" >> /etc/rc.local Assuming your my.cnf file does *NOT* contain the server_id setting when you re-image, then it’ll set this automagically each time you spinup a new server off of that AMI. Nice! Can you easily slave off of a slave? How? It’s not terribly different from slaving off of a normal master. A. First enable slave updates. The setting is not dynamic, so if you don’t already have it set, you’ll have to restart your slave. log_slave_updates=true B. Get an initial snapshot of your slave data. You can do that the locking way: mysql> flush tables with read lock;mysql> show master status\G; mysql> system mysqldump -A > full_slave_dump.mysql mysql> unlock tables; You may also choose to use Percona’s excellent xtrabackup utility to create hotbackups without locking any tables. We are very lucky to have an open-source tool like this at our disposal. MySQL Enterprise Backup from Oracle Corp can also do this. C. On the slave, seed the database with your dump created above. $ mysql < full_slave_dump.mysql D. Now point your slave to the original slave. mysql> change master to master_user='rep', master_password='rep', master_host='192.168.0.1', master_log_file='server-bin-log.000004', master_log_pos=399;mysql> start slave; mysql> show slave status\G; Slave master is set as an IP address. Is there another way? It’s possible to use hostnames in MySQL replication, however it’s not recommended. Why? Because of the wacky world of DNS. Suffice it to say MySQL has to do a lot of work to resolve those names into IP addresses. A hickup in DNS can interrupt all MySQL services potentially as sessions will fail to authenticate. To avoid this problem do two things: A. Set this parameter in my.cnf skip_name_resolve = true Remove entries in mysql.user table where hostname is not an IP address. Those entries will be invalid for authentication after setting the above parameter. Doesn’t RDS take care of all of this for me? RDS is Amazon’s Relational Database Service which is built on MySQL. Amazon’s RDS solution presents MySQL as a service which brings certain benefits to administrators and startups: Simpler administration. Nuts and bolts are handled for you. Push-button replication. No more struggling with the nuances and issues of MySQL’s replication management. Simplicity of administration of course has it’s downsides. Depending on your environment, these may or may not be dealbreakers. No access to the slow query log. This is huge. The single best tool for troubleshooting slow database response is this log file. Queries are a large part of keeping a relational database server healthy and happy, and without this facility, you are severely limited. Locked in downtime window When you signup for RDS, you must define a thirty minute maintenance window. This is a weekly window during which your instance *COULD* be unavailable. When you host yourself, you may not require as much downtime at all, especially if you’re using master-master mysql and zero-downtime configuration. Can’t use Percona Server to host your MySQL data. You won’t be able to do this in RDS. Percona server is a high performance distribution of MySQL which typically rolls in serious performance tweaks and updates before they make it to community addition. Well worth the effort to consider it. No access to filesystem, server metrics & command line. Again for troubleshooting problems, these are crucial. Gathering data about what’s really happening on the server is how you begin to diagnose and troubleshoot a server stall or pileup. You are beholden to Amazon’s support services if things go awry. That’s because you won’t have access to the raw iron to diagnose and troubleshoot things yourself. Want to call in an outside consultant to help you debug or troubleshoot? You’ll have your hands tied without access to the underlying server. You can’t replicate to a non-RDS database. Have your own datacenter connected to Amazon via VPC? Want to replication to a cloud server? RDS won’t fit the bill. You’ll have to roll your own – as we’ve described above. And if you want to replicate to an alternate cloud provider, again RDS won’t work for you. Related posts: Deploying MySQL on Amazon EC2 – 8 Best Practices Review: Host Your Web Site In The Cloud, Amazon Web Services Made Easy 5 Ways to Boost MySQL Scalability Top MySQL DBA interview questions (Part 2) MySQL Cluster In The Cloud – Managers Guide
July 20, 2012
by Sean Hull
· 18,484 Views
article thumbnail
Spring Data - Apache Hadoop
Spring for Apache Hadoop is a Spring project to support writing applications that can benefit of the integration of Spring Framework and Hadoop. This post describes how to use Spring Data Apache Hadoop in an Amazon EC2 environment using the “Hello World” equivalent of Hadoop programming – a Wordcount application. 1./ Launch an Amazon Web Services EC2 instance. - Navigate to AWS EC2 Console (“https://console.aws.amazon.com/ec2/home”): - Select Launch Instance then Classic Wizzard and click on Continue. My test environment was a “Basic Amazon Linux AMI 2011.09″ 32-bit., Instant type: Micro (t1.micro , 613 MB), Security group quick-start-1 that enables ssh to be used for login. Select your existing key pair (or create a new one). Obviously you can select another AMI and instance types depending on your favourite flavour. (Should you vote for Windows 2008 based instance, you also need to have cygwin installed as an additional Hadoop prerequisite beside Java JDK and ssh, see “Install Apache Hadoop” section) 2./ Download Apache Hadoop - as of writing this article, 1.0.0 is the latest stable version of Apache Hadoop, that is what was used for testing purposes. I downloaded hadoop-1.0.0.tar.gz and copied it into /home/ec2-user directory using pscp command from my PC running Windows: c:\downloads>pscp -i mykey.ppk hadoop-1.0.0.tar.gz [email protected]:/home/ec2-user (the computer name above – ec2-ipaddress-region-compute.amazonaws.com – can be found on AWS EC2 console, Instance Description, public DNS field) 3./ Install Apache Hadoop: As prerequisites, you need to have Java JDK 1.6 and ssh installed, see Apache Single-Node Setup Guide. (ssh is automatically installed with Basic Amazon AMI). Then install hadoop itself: $ cd ~ # change directory to ec2-user home (/home/ec2-user) $ tar xvzf hadoop-1.0.0.tar.gz $ ln -s hadoop-1.0.0 hadoop $ cd hadoop/conf $ vi hadoop-env.sh # edit as below export JAVA_HOME=/opt/jdk1.6.0_29 $ vi core-site.xml # edit as below – this defines the namenode to be running on localhost and listeing to port 9000. fs.default.name hdfs://localhost:9000 $ vi hdsf-site.xml # edit as below this defines that file system replicate is 1 (in production environment it is supposed to be 3 by default) dfs.replication 1 $ vi mapred-site.xml # edit as below – this defines the jobtracker to be running on localhost and listeing to port 9001. mapred.job.tracker localhost:9001 $ cd ~/hadoop $ bin/hadoop namenode -format $ bin/start-all.sh At this stage all hadoop jobs are running in pseudo distributed mode, you can verify it by running: $ ps -ef | grep java You should see 5 java processes: namenode, secondarynamenode, datanode, jobtracker and tasktracker. 4./ Install Spring Data Hadoop Download Spring Data Hadoop package from SpringSource community download site. As of writing this article, the latest stable version is spring-data-hadoop-1.0.0.M1.zip. $ cd ~ $ tar xzvf spring-data-hadoop-1.0.0.M1.zip $ ln -s spring-data-hadoop-1.0.0.M1 spring-data-hadoop 5./ Build and Run Spring Data Hadoop Wordcount example $ cd spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount Spring Data Hadoop is using gradle as build tool. Check build.grandle build file. The original version packaged in the tar.gz file does not compile, it complains about thrift, version 0.2.0 and jdo2-api, version2.3-ec. Add datanucleus.org maven repository to the build.gradle file to support jdo2-api (http://www.datanucleus.org/downloads/maven2/) . Unfortunatelly, there seems to be no maven repo for thrift 0.2.0 . You should download thrift 0.2.0.jar and thrift.0.2.0.pom file e.g. from this repo: “http://people.apache.org/~rawson/repo“ and then add it to local maven repo. $ mvn install:install-file -DgroupId=org.apache.thrift -DartifactId=thrift -Dversion=0.2.0 -Dfile=thrift-0.2.0.jar -Dpackaging=jar $ vi build.grandle # modify the build file to refer to datanucleus maven repo for jdo2-api and the local repo for thrift repositories { // Public Spring artefacts mavenCentral() maven { url “http://repo.springsource.org/libs-release” } maven { url “http://repo.springsource.org/libs-milestone” } maven { url “http://repo.springsource.org/libs-snapshot” } maven { url “http://www.datanucleus.org/downloads/maven2/” } maven { url “file:///home/ec2-user/.m2/repository” } } I also modified the META-INF/spring/context.xml file in order to run hadoop file system commands manually: $ cd /home/ec2-user/spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount/src/main/resources $vi META-INF/spring/context.xml # remove clean-script and also the dependency on it for JobRunner. xmlns=”http://www.springframework.org/schema/beans” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xmlns:context=”http://www.springframework.org/schema/context” xmlns:hdp=”http://www.springframework.org/schema/hadoop” xmlns:p=”http://www.springframework.org/schema/p” xsi:schemaLocation=”http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd”> fs.default.name=${hd.fs} Copy the sample file – nietzsche-chapter-1.txt – to Hadoop file system (/user/ec2-user-/input directory) $ cd src/main/resources/data $ hadoop fs -mkdir /user/ec2-user/input $ hadoop fs -put nietzsche-chapter-1.txt /user/ec2-user/input/data $ cd ../../../.. # go back to samples/wordcount directory $ ../gradlew Verify the result: $ hadoop fs -cat /user/ec2-user/output/part-r-00000 | more “AWAY 1 “BY 1 “Beyond 1 “By 2 “Cheers 1 “DE 1 “Everywhere 1 “FROM” 1 “Flatterers 1 “Freedom 1
July 19, 2012
by Istvan Szegedi
· 11,893 Views
article thumbnail
My Experience Moving Data from MySQL to Cassandra
I had a relational database, that I wanted to migrate to cassandra. Cassandra's sstableloader provides option to load the existing data from flat files to a cassandra ring. Hence this can be used as a way to migrate data in relational databases to cassandra, as most relational databases let us export the data into flat files. sqoop gives the option to do this effectively. Interestingly, DataStax Enterprise provides everything we want in the big data space as a package. This includes, cassandra, hadoop, hive, pig, sqoop, and mahout, which comes handy in this case. Under the resources directory, you may find the cassandra, dse, hadoop, hive, log4j-appender, mahout, pig, solr, sqoop, and tomcat specific configurations. For example, from resources/hadoop/bin, you may format the hadoop name node using ./hadoop namenode -format as usual. * Download and extract DataStax Enterprise binary archive (dse-2.1-bin.tar.gz). * Follow the documentation, which is also available as a PDF. * Migrating a relational database to cassandra is documented and is also blogged. * Before starting DataStax, make sure that the JAVA_HOME is set. This also can be set directly on conf/hadoop-env.sh. * Include the connector to the relational database into a location reachable by sqoop. I put mysql-connector-java-5.1.12-bin.jar under resources/sqoop. * Set the environment $ bin/dse-env.sh * Start DataStax Enterprise, as an Analytics node. $ sudo bin/dse cassandra -t where cassandra starts the Cassandra process plus CassandraFS and the -t option starts the Hadoop JobTracker and TaskTracker processes. if you start without the -t flag, the below exception will be thrown during the further operations that are discussed below. No jobtracker found Unable to run : jobtracker not found Hence do not miss the -t flag. * Start cassandra cli to view the cassandra keyrings and you will be able to view the data in cassandra, once you migrate using sqoop as given below. $ bin/cassandra-cli -host localhost -port 9160 Confirm that it is connected to the test cluster that is created on the port 9160, by the below from the CLI. [default@unknown] describe cluster; Cluster Information: Snitch: com.datastax.bdp.snitch.DseDelegateSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner Schema versions: f5a19a50-b616-11e1-0000-45b29245ddff: [127.0.1.1] If you have missed mentioning the host/port (starting the cli by just bin/cassandra-cli) or given it wrong, you will get the response as "Not connected to a cassandra instance." $ bin/dse sqoop import --connect jdbc:mysql://127.0.0.1:3306/shopping_cart_db --username root --password root --table Category --split-by categoryName --cassandra-keyspace shopping_cart_db --cassandra-column-family Category_cf --cassandra-row-key categoryName --cassandra-thrift-host localhost --cassandra-create-schema Above command will now migrate the table "Category" in the shopping_cart_db with the primary key categoryName, into a cassandra keyspace named shopping_cart, with the cassandra row key categoryName. You may use the --direct mysql specific option, which is faster. In my above command, I have everything runs on localhost. +--------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-------------+------+-----+---------+-------+ | categoryName | varchar(50) | NO | PRI | NULL | | | description | text | YES | | NULL | | | image | blob | YES | | NULL | | +--------------+-------------+------+-----+---------+-------+ This also creates the respective java class (Category.java), inside the working directory. To import all the tables in the database, instead of a single table. $ bin/dse sqoop import-all-tables -m 1 --connect jdbc:mysql://127.0.0.1:3306/shopping_cart_db --username root --password root --cassandra-thrift-host localhost --cassandra-create-schema --direct Here "-m 1" tag ensures a sequential import. If not specified, the below exception will be thrown. ERROR tool.ImportAllTablesTool: Error during import: No primary key could be found for table Category. Please specify one with --split-by or perform a sequential import with '-m 1'. To check whether the keyspace is created, [default@unknown] show keyspaces; ................ Keyspace: shopping_cart_db: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: Category_cf Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider Key cache size / save period in seconds: 200000.0/14400 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy ............. [default@unknown] describe shopping_cart_db; Keyspace: shopping_cart_db: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: Category_cf Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider Key cache size / save period in seconds: 200000.0/14400 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy You may also use hive to view the databases created in cassandra, in an sql-like manner. * Start Hive $ bin/dse hive hive> show databases; OK default shopping_cart_db When the entire database is imported as above, separate java classes will be created for each of the tables. $ bin/dse sqoop import-all-tables -m 1 --connect jdbc:mysql://127.0.0.1:3306/shopping_cart_db --username root --password root --cassandra-thrift-host localhost --cassandra-create-schema --direct 12/06/15 15:42:11 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 12/06/15 15:42:11 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 12/06/15 15:42:11 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `Category` AS t LIMIT 1 12/06/15 15:42:11 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Category.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:13 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Category.jar 12/06/15 15:42:13 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:13 INFO mapreduce.ImportJobBase: Beginning import of Category 12/06/15 15:42:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/06/15 15:42:15 INFO mapred.JobClient: Running job: job_201206151241_0007 12/06/15 15:42:16 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:42:25 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:42:25 INFO mapred.JobClient: Job complete: job_201206151241_0007 12/06/15 15:42:25 INFO mapred.JobClient: Counters: 18 12/06/15 15:42:25 INFO mapred.JobClient: Job Counters 12/06/15 15:42:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6480 12/06/15 15:42:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:42:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:42:25 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:42:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:42:25 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:42:25 INFO mapred.JobClient: Bytes Written=2848 12/06/15 15:42:25 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:42:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21419 12/06/15 15:42:25 INFO mapred.JobClient: CFS_BYTES_WRITTEN=2848 12/06/15 15:42:25 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:42:25 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:42:25 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:42:25 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:42:25 INFO mapred.JobClient: Map input records=1 12/06/15 15:42:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=119435264 12/06/15 15:42:25 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:42:25 INFO mapred.JobClient: CPU time spent (ms)=630 12/06/15 15:42:25 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:42:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2085318656 12/06/15 15:42:25 INFO mapred.JobClient: Map output records=36 12/06/15 15:42:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:42:25 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 11.4492 seconds (0 bytes/sec) 12/06/15 15:42:25 INFO mapreduce.ImportJobBase: Retrieved 36 records. 12/06/15 15:42:25 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:25 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `Customer` AS t LIMIT 1 12/06/15 15:42:25 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Customer.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:25 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Customer.jar 12/06/15 15:42:26 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:26 INFO mapreduce.ImportJobBase: Beginning import of Customer 12/06/15 15:42:26 INFO mapred.JobClient: Running job: job_201206151241_0008 12/06/15 15:42:27 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:42:35 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:42:35 INFO mapred.JobClient: Job complete: job_201206151241_0008 12/06/15 15:42:35 INFO mapred.JobClient: Counters: 17 12/06/15 15:42:35 INFO mapred.JobClient: Job Counters 12/06/15 15:42:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6009 12/06/15 15:42:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:42:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:42:35 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:42:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:42:35 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:42:35 INFO mapred.JobClient: Bytes Written=0 12/06/15 15:42:35 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:42:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21489 12/06/15 15:42:35 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:42:35 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:42:35 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:42:35 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:42:35 INFO mapred.JobClient: Map input records=1 12/06/15 15:42:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=164855808 12/06/15 15:42:35 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:42:35 INFO mapred.JobClient: CPU time spent (ms)=510 12/06/15 15:42:35 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:42:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2082869248 12/06/15 15:42:35 INFO mapred.JobClient: Map output records=0 12/06/15 15:42:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:42:35 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.3143 seconds (0 bytes/sec) 12/06/15 15:42:35 INFO mapreduce.ImportJobBase: Retrieved 0 records. 12/06/15 15:42:35 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `OrderEntry` AS t LIMIT 1 12/06/15 15:42:35 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/OrderEntry.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:35 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/OrderEntry.jar 12/06/15 15:42:36 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:36 INFO mapreduce.ImportJobBase: Beginning import of OrderEntry 12/06/15 15:42:36 INFO mapred.JobClient: Running job: job_201206151241_0009 12/06/15 15:42:37 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:42:45 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:42:45 INFO mapred.JobClient: Job complete: job_201206151241_0009 12/06/15 15:42:45 INFO mapred.JobClient: Counters: 17 12/06/15 15:42:45 INFO mapred.JobClient: Job Counters 12/06/15 15:42:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6381 12/06/15 15:42:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:42:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:42:45 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:42:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:42:45 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:42:45 INFO mapred.JobClient: Bytes Written=0 12/06/15 15:42:45 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:42:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21569 12/06/15 15:42:45 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:42:45 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:42:45 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:42:45 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:42:45 INFO mapred.JobClient: Map input records=1 12/06/15 15:42:45 INFO mapred.JobClient: Physical memory (bytes) snapshot=137252864 12/06/15 15:42:45 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:42:45 INFO mapred.JobClient: CPU time spent (ms)=520 12/06/15 15:42:45 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:42:45 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2014703616 12/06/15 15:42:45 INFO mapred.JobClient: Map output records=0 12/06/15 15:42:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:42:45 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.2859 seconds (0 bytes/sec) 12/06/15 15:42:45 INFO mapreduce.ImportJobBase: Retrieved 0 records. 12/06/15 15:42:45 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `OrderItem` AS t LIMIT 1 12/06/15 15:42:45 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/OrderItem.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:45 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/OrderItem.jar 12/06/15 15:42:46 WARN manager.CatalogQueryManager: The table OrderItem contains a multi-column primary key. Sqoop will default to the column orderNumber only for this job. 12/06/15 15:42:46 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:46 INFO mapreduce.ImportJobBase: Beginning import of OrderItem 12/06/15 15:42:46 INFO mapred.JobClient: Running job: job_201206151241_0010 12/06/15 15:42:47 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:42:55 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:42:55 INFO mapred.JobClient: Job complete: job_201206151241_0010 12/06/15 15:42:55 INFO mapred.JobClient: Counters: 17 12/06/15 15:42:55 INFO mapred.JobClient: Job Counters 12/06/15 15:42:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5949 12/06/15 15:42:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:42:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:42:55 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:42:55 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:42:55 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:42:55 INFO mapred.JobClient: Bytes Written=0 12/06/15 15:42:55 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:42:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21524 12/06/15 15:42:55 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:42:55 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:42:55 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:42:55 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:42:55 INFO mapred.JobClient: Map input records=1 12/06/15 15:42:55 INFO mapred.JobClient: Physical memory (bytes) snapshot=116674560 12/06/15 15:42:55 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:42:55 INFO mapred.JobClient: CPU time spent (ms)=590 12/06/15 15:42:55 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:42:55 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2014703616 12/06/15 15:42:55 INFO mapred.JobClient: Map output records=0 12/06/15 15:42:55 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:42:55 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.2539 seconds (0 bytes/sec) 12/06/15 15:42:55 INFO mapreduce.ImportJobBase: Retrieved 0 records. 12/06/15 15:42:55 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `Payment` AS t LIMIT 1 12/06/15 15:42:55 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Payment.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:55 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Payment.jar 12/06/15 15:42:56 WARN manager.CatalogQueryManager: The table Payment contains a multi-column primary key. Sqoop will default to the column orderNumber only for this job. 12/06/15 15:42:56 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:56 INFO mapreduce.ImportJobBase: Beginning import of Payment 12/06/15 15:42:56 INFO mapred.JobClient: Running job: job_201206151241_0011 12/06/15 15:42:57 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:43:05 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:43:05 INFO mapred.JobClient: Job complete: job_201206151241_0011 12/06/15 15:43:05 INFO mapred.JobClient: Counters: 17 12/06/15 15:43:05 INFO mapred.JobClient: Job Counters 12/06/15 15:43:05 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5914 12/06/15 15:43:05 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:43:05 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:43:05 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:43:05 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:43:05 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:43:05 INFO mapred.JobClient: Bytes Written=0 12/06/15 15:43:05 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:43:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21518 12/06/15 15:43:05 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:43:05 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:43:05 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:43:05 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:43:05 INFO mapred.JobClient: Map input records=1 12/06/15 15:43:05 INFO mapred.JobClient: Physical memory (bytes) snapshot=137998336 12/06/15 15:43:05 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:43:05 INFO mapred.JobClient: CPU time spent (ms)=520 12/06/15 15:43:05 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:43:05 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2082865152 12/06/15 15:43:05 INFO mapred.JobClient: Map output records=0 12/06/15 15:43:05 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:43:05 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.2642 seconds (0 bytes/sec) 12/06/15 15:43:05 INFO mapreduce.ImportJobBase: Retrieved 0 records. 12/06/15 15:43:05 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:43:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `Product` AS t LIMIT 1 12/06/15 15:43:06 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Product.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:43:06 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Product.jar 12/06/15 15:43:06 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:43:06 INFO mapreduce.ImportJobBase: Beginning import of Product 12/06/15 15:43:07 INFO mapred.JobClient: Running job: job_201206151241_0012 12/06/15 15:43:08 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:43:16 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:43:16 INFO mapred.JobClient: Job complete: job_201206151241_0012 12/06/15 15:43:16 INFO mapred.JobClient: Counters: 18 12/06/15 15:43:16 INFO mapred.JobClient: Job Counters 12/06/15 15:43:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5961 12/06/15 15:43:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:43:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:43:16 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:43:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:43:16 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:43:16 INFO mapred.JobClient: Bytes Written=248262 12/06/15 15:43:16 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:43:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21527 12/06/15 15:43:16 INFO mapred.JobClient: CFS_BYTES_WRITTEN=248262 12/06/15 15:43:16 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:43:16 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:43:16 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:43:16 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:43:16 INFO mapred.JobClient: Map input records=1 12/06/15 15:43:16 INFO mapred.JobClient: Physical memory (bytes) snapshot=144871424 12/06/15 15:43:16 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:43:16 INFO mapred.JobClient: CPU time spent (ms)=1030 12/06/15 15:43:16 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:43:16 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2085318656 12/06/15 15:43:16 INFO mapred.JobClient: Map output records=300 12/06/15 15:43:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:43:16 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.2613 seconds (0 bytes/sec) 12/06/15 15:43:16 INFO mapreduce.ImportJobBase: Retrieved 300 records. I found DataStax an interesting project to explore more. I have blogged on the issues that I faced on this as a learner, and how easily can they be fixed - Issues that you may encounter during the migration to Cassandra using DataStax/Sqoop and the fixes.
July 16, 2012
by Pradeeban Kathiravelu
· 20,411 Views · 2 Likes
article thumbnail
Working with MongoDB MultiMaster
Learn all about working with MondoDB multimaster.
July 11, 2012
by Rick Copeland
· 28,169 Views · 2 Likes
article thumbnail
5 Things You Should Check Now to Improve PHP Web Performance
We all know how financially important it is for your app’s server architecture to handle peaks of load. This article discusses 5 tips for improving PHP Web performance.
July 11, 2012
by Gonzalo Ayuso
· 263,803 Views · 2 Likes
article thumbnail
The Activiti Performance Showdown
the question everybody always asks when they learn about activiti, is as old as software development itself: “how does it perform?”. up till now, when you would ask me that same question, i would tell you about how activiti minimizes database access in every way possible, how we break down the process structure into an ‘execution tree’ which allows for fast queries or how we leverage ten years of workflow framework development knowledge. you know, trying to get around the question without answering it. we knew it is fast, because of the theoretical foundation upon which we have built it. but now we have proof: real numbers …. yes, it’s going to be a lengthy post. but trust me, it’ll be worth your time! disclaimer: performance benchmarks are hard. really hard. different machines, slight different test setup … very small things can change the results seriously. the numbers here are only to prove that the activiti engine has a very minimal overhead, while also integrating very easily into the java eco-system and offering bpmn 2.0 process execution. the activiti benchmark project to test process execution overhead of the activiti engine, i created a little side project on github: https://github.com/jbarrez/activiti-benchmark the project contains currently 9 test processes, which we’ll analyse below. the logic in the project is pretty straightforward: a process engine is created for each test run each of the processes are sequentially executed on this process engine, using a threadpool from 1 up to 10 threads. all the processes are thrown into a bag, of which a number of random executions are drawn. all the results are collected and a html report with some nice charts are generated to run the benchmark, simply follow the instructions on the github page to build and execute the jar. benchmark results the test machine i used for the results is my (fairly old) desktop machine: amd phenom ii x4 940 3.0ghz, 8 gb 800mhz ram and an old-skool 7200 rpm hd running ubuntu 11.10. the database used for the test runs on the same machine on which the tests also run. so keep in mind that in a ‘real’ server environment the results could even be better! the benchmark project i mentioned above, was executed on a default ubuntu mysql 5 database. i just switched to the ‘large.cnf’ setting (which throws more ram at the db and stuff like that) instead the default config. each of the test processes ran for 2500 times, using a threadpool going from one to ten threads . in simpleton language: 2500 process executions using just one thread, 2500 threads using two threads, 2500 process executions using three … yeah, you get it. each benchmark run was done using a ‘default’ activiti process engine. this basically means a ‘regular’ standalone activiti engine, created in plain java. each benchmark run was also done in a ‘spring’ config. here, the process engine was constructed by wrapping it in the factory bean, the datasource is a spring datasource and also the transactions and connection pool is managed by spring (i’m actually using a tweaked bonecp threadpool) each benchmark run was executed with history on the default history level (ie. ‘audit’) and without history enabled (ie. history level ‘none’) . the processes are in detail analyzed in the sections below, but here are the integral results of the test runs already: activiti 5.9 – mysql – default – history enabled activiti 5.9 – mysql – default – history disabled activiti 5.9 – mysql – spring – history enabled activiti 5.9 – mysql – spring – history disabled i ran all the tests using the latest public release of activiti, being activiti 5.9. however, my test runs brought some potential performance fixes to the surface (i also ran the benchmark project through a profiler). it was quickly clear that most of the process execution time was done actually cleaning up when a process ended. basically, more than often queries were fired which were not necessary if we would save some more state in our execution tree. i sat together with daniel meyer from camunda and my colleague frederik heremans, and they’ve managed to commit fixes for this! as such, the current trunk of activiti, being activiti 5.10-snapshot at the moment, is significantly faster than 5.9 . activiti 5.10 – mysql – default – history enabled activiti 5.10 – mysql – default – history disabled activiti 5.10 – mysql – spring – history enabled activiti 5.10 – mysql – spring – history disabled from a high-level perspective (scroll down for detailed analysis), there are a few things to note: i had expected some difference between the default and spring config, due to the more ‘professional’ connection pool being used. however, the results for both environments are quite alike. sometimes the default is faster, sometimes spring. it’s hard to really find a pattern. as such, i omitted the spring results in the detailed analyses below. the best average timings are most of the times found when using four threads to execute the processes . this is probably due to having a quad-core machine. the best throughput numbers are most of the times found when using eight threads to execute the processes. i can only assume that is also has something to do with having a quad-core machine. when the number of threads in the threadpool go up, the throughput (processes executed / second) goes up, both it has a negative effect on the average time. certainly with more than six or seven threads, you see this effect very clear. this basically means that while the processes on itself take a little longer to execute, but due to the multiple threads you can execute more of these ‘slower’ processes in the same amount of time. enabling history does have an impact. often, enabling history will double execution time. this is logical, given that many extra records are inserted when history is on the default level (ie. ‘audit’). there was one last test i ran, just out of curiosity: running the best performing setting on an oracle xe 11.2 database. the oracle xe is a free version of the ‘real’ oracle database. no matter how hard, i tried, i couldn’t get it decently running on ubuntu. as such, i used an old windows xp install on that same machine. however, the os is 32 bit, wich means the system only has 3.2 of the 8gb of ram available. here are the results: activiti 5.10 – oracle on windows – default – history disabled the results speak for itself. oracle blows away any of the (single-threaded) results on mysql (and they are already very fast!). however, when going multi-threaded it is far worse than any of the mysql results. my guess is that these are due to the limitations of the xe version : only one cpu is used, only 1 gb of ram, etc. i would really like to run these test on a real oracle-managed-by-a-real-dba … feel free to contact me if you are interested ! in the next sections, we will take a detailed look into the performance numbers of each of the test processes. an excel sheet containing all the the numbers and charts below can be downloaded for yourself . process 1: the bare micromum (one transaction) the first process is not a very interesting one, business-wise at least. after starting the process, the end is immediately reached. not very useful on itself, but its numbers learn us one essential thing: the bare overhead of the activiti engine. here are the average timings: this process runs in a single transaction, which means that nothing is saved to the database when the history is disabled due to activiti’s optimizations. with history enabled, you’ll basically get the cost for inserting one row into the historical process instance table, which is around 4.44 ms here. it is also clear that our fix for activiti 5.10 has an enormous impact here. in the previous version, 99% of the time was spent in the cleanup check of the process. take a look at the best result here: 0.47 ms when using 4 threads to execute 2500 runs of this process. that’s only half a millisecond ! it’s fair to say that the activiti engine overhead is extremely small. the throughput numbers are equally impressive: in the best case here, 8741 processes are executed. per second. by the time you arrive here reading the post, you could have executed a few millions of this process . you can also see that there is little difference between 4 or 8 threads here. most of the execution time here is cpu time, and no potential collisions such as waiting for a database lock happens here. in these numbers, you can also easily see that the oracle xe doesn’t scale well with multiple threads (which is explained above). you will see the same behavior in the following results. process 2: the same, but a bit longer (one transaction) this process is pretty similar to the previous one. we have again only one transaction. after the process is started, we pass through seven no-op passthrough activities before reaching the end. some things to note here: the best result (again 4 threads, with history disabled) is actually better than the simpler previous process. but also note that the single threaded execution is a tad slower. this means that the process on itself is a bit slower, which is logical as is has more activities. but using more threads and having more activities in the process does allow for more potential interleaving. in the previous case, the thread was barely born before it was killed again. the difference between history enabled/disabled is bigger than the previous process. this is logical, as more history is written here (for each activity one record in the database). again, activiti 5.10 is far more superior to activiti 5.9. the throughput numbers follow these observations: there is more opportunity to use threading here. the best result lingers around 12000 process execution per second . again, it demonstrates the very lightweight execution of the activiti engine. process 3: parallelism in one transaction this process executes a parallel gateway that forks and one that joins in the same transaction. you would expect something along the lines of the previous results, but you’d be surprised: comparing these numbers with the previous process, you see that execution is slower. so why is this process slower, even if it has less activities? the reason lies with how the parallel gateway is implemented, especially the join behavior. the hard part, implementation-wise, is that you need to cope with the situation when multiple executions arrive at the join. to make sure that the behavior is atomic, we internally do some locking and fetch all child executions in the execution tree to find out whether the join activates or not. so it is quite a ‘costly’ operation, compared to the ‘regular’ activities. do mind, we’re talking here about only 5 ms single threaded and 3.59 ms in the best case for mysql . given the functionality that is required for implementing the parallel gateway functionality, this is peanuts if you’d ask me. the throughput numbers: this is the first process which actually contains some ‘logic’. in the best case above, it means 1112 processes can be executed in a second. pretty impressive, if you’d ask me! . process 4: now we’re getting somewhere (one transaction) this process already looks like something you’d see when modeling real business processes. we’re still running it in one database transaction though, as all the activities are automatic passthroughs. here we also have two forks and two joins. take a look at the lowest number: 6.88 ms on oracle when running with one thread. that’s freaking fast , taking in account all that is happening here. the history numbers are at least doubled here (activiti 5.10), which makes sense because there is quite a bit of activity audit logging going on here. you can also see that this causes to have a higher average time for four threads here, which is probably due to the implementation of the joining. if you know a bit about activiti internals, you’ll understand this means there are quite a bit of executions in the execution tree. we have one big concurrent root, but also multiple children which are sometimes also concurrent roots. but while the average time rises, the throughput definitely benefits: running this process with eight threads, allows you to do 411 runs of this process in a single second. there is also something peculiar here: the oracle database performs better with more thread concurrency. this is completely contrary with all other measurements, where oracle is always slower in that environment (see above for explanation). i assume it has something to do with the internal locking and forced update we are applying when forking/joining, which is better handled by oracle it seems. process 5: adding some java logic (single transaction) i added this process to see the influence of adding a java service task in a process. in this process, the first activity generates a random value, stores it as a process variable and then goes up or down in the process depending on the random value. the chance is about 50/50 to go up or down. the average timings are very very good. actually, the results are in the same range as those of process 1 and 2 above (which had no activities or only automatic passthroughs). this means that the overhead of integrating java logic into your process is nearly non-existant (nothing is of course for free). of course, you can still write slow code in that logic, but you can’t blame the activiti engine for that throughput numbers are comparable to those of process 1 and 2: very, very high. in the best case here, more than 9000 processes are executed per second . that indeed also means 9000 invocations of your own java logic. process 6, 7 and 8: adding wait states and transactions the previous processes demonstrated us the bare overhead of the activiti engine. here, we’ll take a look at how wait states and multiple transactions have influence on performance. for this, i added three test processes which contain user tasks. for each user task, the engine commits the current transaction and returns the thread to the client. since the results are pretty much compatible for these processes, we’re grouping them here. these are the processes: here are the average timings results, in order of the processes above. for the first process, containing just one user task: it is clear that having wait states and multiple transaction does have influence on the performance. this is also logical: before, the engine could optimize by not inserting the runtime state into the database, because the process was finished in one transaction. now, the whole state, meaning the pointers to where you are currently, need to be saved into the database. the process could be ‘sleeping’ like this for many days, months, years now …. the activiti engine doesn’t hold it into memory now anymore, and it is freed to give its full attention to other processes. if you check the results of the process with only one user task, you can see that in the best case (oracle, single thread – the 4 threads on mysql is pretty close) this is done in 6.27ms . this is really fast, if you take in account we have a few inserts (the execution tree, the task), a few updates (the execution tree) and deletes (cleaning up) going on here. the second process here, with 7 user tasks: the second chart learns us that logically, more transactions means more time. in the best case here the process is done in 32.12 ms . that is for seven transactions, which gives 4.6 ms for each transactions. so it is clear that average time scales in a linearly way when adding wait states. this makes of course sense, because transactions aren’t free. also note that enabling history does add quite some overhead here. this is due to having the history level set to ‘audit’, which stores all the user task information in the history tables. this is also noticeable from the difference between activiti 5.9 with history disabled and activiti 5.10 with history enabled: this is a rare case where activiti 5.10 with history enabled is slower than 5.9 with history disabled. but it is logical, given the volume of history stored here. and the third process learns us how user tasks and parallel gateways interact: the third chart learns us not much new. we have two user tasks now, and the more ‘expensive’ fork/join (see above). the average timings are how we expected them. the throughput charts are as you would expect given the average timings. between 70 and 250 processes per second. aw yeah! to save some space, you’ll need to click them to enlarge: process 9: so what about scopes? for the last process, we’ll take a look at ‘scopes’. a ‘scope’ is how we call it internally in the engine, and it has to do with variable visibility, relationships between the pointers indicating process state, event catching, etc. bpmn 2.0 has quite some cases for those scopes, for example with embedded subprocesses as shown in the process here. basically, every subprocess can have boundary events (catching an error, a message, etc) that only are applied on its internal activities when it’s scope is active. without going into too much technical details: to get scopes implemented in the correct way, you need some not so trivial logic. the example process here has 4 subprocesses, nested in each other. the inner process is using concurrency, which is a scope on itself again for the activiti engine. there are also two user tasks here, so that means two transactions. so let’s see how it performs: you can clearly see the big difference between activiti 5.9 and 5.10. scopes are indeed an area where the fixes around the ‘process cleanup’ at the end have a huge benefit, as many execution objects are created and persisted to represent the many different scopes. single threaded performance is not so good on activiti 5.9. luckily, as you can see from the gap between the blue and the red bars, those scopes do allow for high concurrency. the numbers of oracle, combined with the multi-threaded results of the 5.10 tests, do prove that scopes are now efficiently handled by the engine. the throughput charts prove that the process nicely scales with more threads, as you can see by the big gap between the red and green line in the second last block. in the best case, 64 processes of this more complex process are handled by the engine. random execution if you have already clicked on the full reports at the beginning of the post, you probably have noticed also random execution is tested for each environment. in this setting, 2500 process executions were done, both the process was randomly chosen. as shown in those reports this meant that over 2500 executions, each process was executed almost the same number of times (normal distribution). this last chart shows the best setting (activiti 5.10, history disabled) and how the throughput of those random process executions goes when adding more threads: as we’ve seen in many of the test above, once passed four threads things don’t change that much anymore. the numbers (167 processes/second) prove that in a realistic situation (ie. multiple processes executing at the same time), the activiti engine nicely scales up. conclusion the average timing charts show two things clearly: the activiti engine is fast and overhead is minimal ! the difference between history enabled or disabled is definitely noticeably. sometimes it comes even down to half the time needed. all history tests were done using the ‘audit’ level, but there is a simpler history level (‘activity’) which might be good enough for the use case. activiti is very flexible in history configuration, and you can tweak the history level for each process specifically. so do think about the level your process needs to have, if it needs to have history at all ! the throughput charts prove that the engine scales very well when more threads are available (ie. any modern application server). activiti is well designed to be used in high-throughput and availability (clustered) architectures . as i said in the introduction, the numbers are what they are: just numbers. my main point which i want to conclude here, is that the activiti engine is extremely lightweight. the overhead of using activiti for automating your business processes is small. in general, if you need to automate your business processes or workflows, you want top-notch integration with any java system and you like all of that fast and scalable … look no further!
July 10, 2012
by
· 11,079 Views
article thumbnail
Everything You Need To Know About Couchbase Architecture
After receiving a lot of good feedback and comment on my last blog on MongoDb, I was encouraged to do another deep dive on another popular document oriented db; Couchbase. I have been a long-time fan CouchDb and has wrote a blog on it many years ago. After it merges with Membase, I am very excited to take a deep look into it again. Couchbase is the merge of two popular NOSQL technologies: Membase, which provides persistence, replication, sharding to the high performance memcached technology CouchDB, which pioneers the document oriented model based on JSON Like other NOSQL technologies, both Membase and CouchDB are built from the ground up on a highly distributed architecture, with data shard across machines in a cluster. Built around the Memcached protocol, Membase provides an easy migration to existing Memcached users who want to add persistence, sharding and fault resilience on their familiar Memcached model. On the other hand, CouchDB provides first class support for storing JSON documents as well as a simple RESTful API to access them. Underneath, CouchDB also has a highly tuned storage engine that is optimized for both update transaction as well as query processing. Taking the best of both technologies, Membase is well-positioned in the NOSQL marketplace. Programming model Couchbase provides client libraries for different programming languages such as Java / .NET / PHP / Ruby / C / Python / Node.js For read, Couchbase provides a key-based lookup mechanism where the client is expected to provide the key, and only the server hosting the data (with that key) will be contacted. Couchbase also provides a query mechanism to retrieve data where the client provides a query (for example, range based on some secondary key) as well as the view (basically the index). The query will be broadcasted to all servers in the cluster and the result will be merged and sent back to the client. For write, Couchbase provides a key-based update mechanism where the client sends in an updated document with the key (as doc id). When handling write request, the server will return to client’s write request as soon as the data is stored in RAM on the active server, which offers the lowest latency for write requests. Following is the core API that Couchbase offers. (in an abstract sense) # Get a document by key doc = get(key) # Modify a document, notice the whole document # need to be passed in set(key, doc) # Modify a document when no one has modified it # since my last read casVersion = doc.getCas() cas(key, casVersion, changedDoc) # Create a new document, with an expiration time # after which the document will be deleted addIfNotExist(key, doc, timeToLive) # Delete a document delete(key) # When the value is an integer, increment the integer increment(key) # When the value is an integer, decrement the integer decrement(key) # When the value is an opaque byte array, append more # data into existing value append(key, newData) # Query the data results = query(viewName, queryParameters) In Couchbase, document is the unit of manipulation. Currently Couchbase doesn't support server-side execution of custom logic. Couchbase server is basically a passive store and unlike other document oriented DB, Couchbase doesn't support field-level modification. In case of modifying documents, client need to retrieve documents by its key, do the modification locally and then send back the whole (modified) document back to the server. This design tradeoff network bandwidth (since more data will be transferred across the network) for CPU (now CPU load shift to client). Couchbase currently doesn't support bulk modification based on a condition matching. Modification happens only in a per document basis. (client will save the modified document one at a time). Transaction Model Similar to many NOSQL databases, Couchbase’s transaction model is primitive as compared to RDBMS. Atomicity is guaranteed at a single document and transactions that span update of multiple documents are unsupported. To provide necessary isolation for concurrent access, Couchbase provides a CAS (compare and swap) mechanism which works as follows … When the client retrieves a document, a CAS ID (equivalent to a revision number) is attached to it. While the client is manipulating the retrieved document locally, another client may modify this document. When this happens, the CAS ID of the document at the server will be incremented. Now, when the original client submits its modification to the server, it can attach the original CAS ID in its request. The server will verify this ID with the actual ID in the server. If they differ, the document has been updated in between and the server will not apply the update. The original client will re-read the document (which now has a newer ID) and re-submit its modification. Couchbase also provides a locking mechanism for clients to coordinate their access to documents. Clients can request a LOCK on the document it intends to modify, update the documents and then releases the LOCK. To prevent a deadlock situation, each LOCK grant has a timeout so it will automatically be released after a period of time. Deployment Architecture In a typical setting, a Couchbase DB resides in a server clusters involving multiple machines. Client library will connect to the appropriate servers to access the data. Each machine contains a number of daemon processes which provides data access as well as management functions. The data server, written in C/C++, is responsible to handle get/set/delete request from client. The Management server, written in Erlang, is responsible to handle the query traffic from client, as well as manage the configuration and communicate with other member nodes in the cluster. Virtual Buckets The basic unit of data storage in Couchbase DB is a JSON document (or primitive data type such as int and byte array) which is associated with a key. The overall key space is partitioned into 1024 logical storage unit called "virtual buckets" (or vBucket). vBucket are distributed across machines within the cluster via a map that is shared among servers in the cluster as well as the client library. High availability is achieved through data replication at the vBucket level. Currently Couchbase supports one active vBucket zero or more standby replicas hosted in other machines. Curremtly the standby server are idle and not serving any client request. In future version of Couchbase, the standby replica will be able to serve read request. Load balancing in Couchbase is achieved as follows: Keys are uniformly distributed based on the hash function When machines are added and removed in the cluster. The administrator can request a redistribution of vBucket so that data are evenly spread across physical machines. Management Server Management server performs the management function and co-ordinate the other nodes within the cluster. It includes the following monitoring and administration functions Heartbeat: A watchdog process periodically communicates with all member nodes within the same cluster to provide Couchbase Server health updates. Process monitor: This subsystem monitors execution of the local data manager, restarting failed processes as required and provide status information to the heartbeat module. Configuration manager: Each Couchbase Server node shares a cluster-wide configuration which contains the member nodes within the cluster, a vBucket map. The configuration manager pull this config from other member nodes at bootup time. Within a cluster, one node’s Management Server will be elected as the leader which performs the following cluster-wide management function Controls the distribution of vBuckets among other nodes and initiate vBucket migration Orchestrates the failover and update the configuration manager of member nodes If the leader node crashes, a new leader will be elected from surviving members in the cluster. When a machine in the cluster has crashed, the leader will detect that and notify member machines in the cluster that all vBuckets hosted in the crashed machine is dead. After getting this signal, machines hosting the corresponding vBucket replica will set the vBucket status as “active”. The vBucket/server map is updated and eventually propagated to the client lib. Notice that at this moment, the replication level of the vBucket will be reduced. Couchbase doesn’t automatically re-create new replicas which will cause data copying traffic. Administrator can issue a command to explicitly initiate a data rebalancing. The crashed machine, after reboot can rejoin the cluster. At this moment, all the data it stores previously will be completely discard and the machine will be treated as a brand new empty machine. As more machines are put into the cluster (for scaling out), vBucket should be redistributed to achieve a load balance. This is currently triggered by an explicit command from the administrator. Once receive the “rebalance” command, the leader will compute the new provisional map which has the balanced distribution of vBuckets and send this provisional map to all members of the cluster. To compute the vBucket map and migration plan, the leader attempts the following objectives: Evenly distribute the number of active vBuckets and replica vBuckets among member nodes. Place the active copy and each replicas in physically separated nodes. Spread the replica vBucket as wide as possible among other member nodes. Minimize the amount of data migration Orchestrate the steps of replica redistribution so no node or network will be overwhelmed by the replica migration. Once the vBucket maps is determined, the leader will pass the redistribution map to each member in the cluster and coordinate the steps of vBucket migration. The actual data transfer happens directly between the origination node to the destination node. Notice that since we have generally more vBuckets than machines. The workload of migration will be evenly distributed automatically. For example, when new machines are added into the clusters, all existing machines will migrate some portion of its vBucket to the new machines. There is no single bottleneck in the cluster. Throughput the migration and redistribution of vBucket among servers, the life cycle of a vBucket in a server will be in one of the following states “Active”: means the server is hosting the vBucket is ready to handle both read and write request “Replica”: means the server is hosting the a copy of the vBucket that may be slightly out of date but can take read request that can tolerate some degree of outdate. “Pending”: means the server is hosting a copy that is in a critical transitional state. The server cannot take either read or write request at this moment. “Dead”: means the server is no longer responsible for the vBucket and will not take either read or write request anymore. Data Server Data server implements the memcached APIs such as get, set, delete, append, prepend, etc. It contains the following key datastructure: One in-memory hashtable (key by doc id) for the corresponding vBucket hosted. The hashtable acts as both a metadata for all documents as well as a cache for the document content. Maintain the entry gives a quick way to detect whether the document exists on disk. To support async write, there is a checkpoint linkedlist per vBucket holding the doc id of modified documents that hasn't been flushed to disk or replicated to the replica. To handle a "GET" request Data server routes the request to the corresponding ep-engine responsible for the vBucket. The ep-engine will lookup the document id from the in-memory hastable. If the document content is found in cache (stored in the value of the hashtable), it will be returned. Otherwise, a background disk fetch task will be created and queued into the RO dispatcher queue. The RO dispatcher then reads the value from the underlying storage engine and populates the corresponding entry in the vbucket hash table. Finally, the notification thread notifies the disk fetch completion to the memcached pending connection, so that the memcached worker thread can revisit the engine to process a get request. To handle a "SET" request, a success response will be returned to the calling client once the updated document has been put into the in-memory hashtable with a write request put into the checkpoint buffer. Later on the Flusher thread will pickup the outstanding write request from each checkpoint buffer, lookup the corresponding document content from the hashtable and write it out to the storage engine. Of course, data can be lost if the server crashes before the data has been replicated to another server and/or persisted. If the client requires a high data availability across different crashes, it can issue a subsequent observe() call which blocks on the condition that the server persist data on disk, or the server has replicated the data to another server (and get its ACK). Overall speaking, the client has various options to tradeoff data integrity with throughput. Hashtable Management To synchronize accesses to a vbucket hash table, each incoming thread needs to acquire a lock before accessing a key region of the hash table. There are multiple locks per vbucket hash table, each of which is responsible for controlling exclusive accesses to a certain ket region on that hash table. The number of regions of a hash table can grow dynamically as more documents are inserted into the hash table. To control the memory size of the hashtable, Item pager thread will monitor the memory utilization of the hashtable. Once a high watermark is reached, it will initiate an eviction process to remove certain document content from the hashtable. Only entries that is not referenced by entries in the checkpoint buffer can be evicted because otherwise the outstanding update (which only exists in hashtable but not persisted) will be lost. After eviction, the entry of the document still remains in the hashtable; only the document content of the document will be removed from memory but the metadata is still there. The eviction process stops after reaching the low watermark. The high / low water mark is determined by the bucket memory quota. By default, the high water mark is set to 75% of bucket quota, while the low water mark is set to 60% of bucket quota. These water marks can be configurable at runtime. In CouchDb, every document is associated with an expiration time and will be deleted once it is expired. Expiry pager is responsible for tracking and removing expired document from both the hashtable as well as the storage engine (by scheduling a delete operation). Checkpoint Manager Checkpoint manager is responsible to recycle the checkpoint buffer, which holds the outstanding update request, consumed by the two downstream processes, Flusher and TAP replicator. When all the request in the checkpoint buffer has been processed, the checkpoint buffer will be deleted and a new one will be created. TAP Replicator TAP replicator is responsible to handle vBucket migration as well as vBucket replication from active server to replica server. It does this by propagating the latest modified document to the corresponding replica server. At the time a replica vBucket is established, the entire vBucket need to be copied from the active server to the empty destination replica server as follows The in-memory hashtable at the active server will be transferred to the replica server. Notice that during this period, some data may be updated and therefore the data set transfered to the replica can be inconsistent (some are the latest and some are outdated). Nevertheless, all updates happen after the start of transfer is tracked in the checkpoint buffer. Therefore, after the in-memory hashtable transferred is completed, the TAP replicator can pickup those updates from the checkpoint buffer. This ensures the latest versioned of changed documents are sent to the replica, and hence fix the inconsistency. However the hashtable cache doesn’t contain all the document content. Data also need to be read from the vBucket file and send to the replica. Notice that during this period, update of vBucket will happen in active server. However, since the file is appended only, subsequent data update won’t interfere the vBucket copying process. After the replica server has caught up, subsequent update at the active server will be available at its checkpoint buffer which will be pickup by the TAP replicator and send to the replica server. CouchDB Storage Structure Data server defines an interface where different storage structure can be plugged-in. Currently it supports both a SQLite DB as well as CouchDB. Here we describe the details of CouchDb, which provides a super high performance storage mechanism underneath the Couchbase technology. Under the CouchDB structure, there will be one file per vBucket. Data are written to this file in an append-only manner, which enables Couchbase to do mostly sequential writes for update, and provide the most optimized access patterns for disk I/O. This unique storage structure attributes to Couchbase’s fast on-disk performance for write-intensive applications. The following diagram illustrate the storage model and how it is modified by 3 batch updates (notice that since updates are asynchronous, it is perform by "Flusher" thread in batches). The Flusher thread works as follows: 1) Pick up all pending write request from the dirty queue and de-duplicate multiple update request to the same document. 2) Sort each request (by key) into corresponding vBucket and open the corresponding file 3) Append the following into the vBucket file (in the following contiguous sequence) All document contents in such write request batch. Each document will be written as [length, crc, content] one after one sequentially. The index that stores the mapping from document id to the document’s position on disk (called the BTree by-id) The index that stores the mapping from update sequence number to the document’s position on disk. (called the BTree by-seq) The by-id index plays an important role for looking up the document by its id. It is organized as a B-Tree where each node contains a key range. To lookup a document by id, we just need to start from the header (which is the end of the file), transfer to the root BTree node of the by-id index, and then further traverse to the leaf BTree node that contains the pointer to the actual document position on disk. During the write, the similar mechanism is used to trace back to the corresponding BTree node that contains the id of the modified documents. Notice that in the append-only model, update is not happening in-place, instead we located the existing location and copy it over by appending. In other words, the modified BTree node will be need to be copied over and modified and finally paste to the end of file, and then its parent need to be modified to point to the new location, which triggers the parents to be copied over and paste to the end of file. Same happens to its parents’ parent and eventually all the way to the root node of the BTree. The disk seek can be at the O(logN) complexity. The by-seq index is used to keep track of the update sequence of lived documents and is used for asynchronous catchup purposes. When a document is created, modified or deleted, a sequence number is added to the by-seq btree and the previous seq node will be deleted. Therefore, for cross-site replication, view index update and compaction, we can quickly locate all the lived documents in the order of their update sequence. When a vBucket replicator asks for the list of update since a particular time, it provides the last sequence number in previous update, the system will then scan through the by-seq BTree node to locate all the document that has sequence number larger than that, which effectively includes all the document that has been modified since the last replication. As time goes by, certain data becomes garbage (see the grey-out region above) and become unreachable in the file. Therefore, we need a garbage collection mechanism to clean up the garbage. To trigger this process, the by-id and by-seq B-Tree node will keep track of the data size of lived documents (those that is not garbage) under its substree. Therefore, by examining the root BTree node, we can determine the size of all lived documents within the vBucket. When the ratio of actual size and vBucket file size fall below a certain threshold, a compaction process will be triggered whose job is to open the vBucket file and copy the survived data to another file. Technically, the compaction process opens the file and read the by-seq BTree at the end of the file. It traces the Btree all the way to the leaf node and copy the corresponding document content to the new file. The compaction process happens while the vBucket is being updated. However, since the file is appended only, new changes are recorded after the BTree root that the compaction has opened, so subsequent data update won’t interfere with the compaction process. When the compaction is completed, the system need to copy over the data that was appended since the beginning of the compaction to the new file. View Index Structure Unlike most indexing structure which provide a pointer from the search attribute back to the document. The CouchDb index (called View Index) is better perceived as a denormalized table with arbitrary keys and values loosely associated to the document. Such denormalized table is defined by a user-provided map() and reduce() function. map = function(doc) { … emit(k1, v1) … emit(k2, v2) … } reduce = function(keys, values, isRereduce) { if (isRereduce) { // Do the re-reduce only on values (keys will be null) } else { // Do the reduce on keys and values } // result must be ready for input values to re-reduce return result } Whenever a document is created, updated, deleted, the corresponding map(doc) function will be invoked (in an asynchronous manner) to generate a set of key/value pairs. Such key/value will be stored in a B-Tree structure. All the key/values pairs of each B-Tree node will be passed into the reduce() function, which compute an aggregated value within that B-Tree node. Re-reduce also happens in non-leaf B-Tree nodes which further aggregate the aggregated value of child B-Tree nodes. The management server maintains the view index and persisted it to a separate file. Create a view index is perform by broadcast the index creation request to all machines in the cluster. The management process of each machine will read its active vBucket file and feed each surviving document to the Map function. The key/value pairs emitted by the Map function will be stored in a separated BTree index file. When writing out the BTree node, the reduce() function will be called with the list of all values in the tree node. Its return result represent a partially reduced value is attached to the BTree node. The view index will be updated incrementally as documents are subsequently getting into the system. Periodically, the management process will open the vBucket file and scan all documents since the last sequence number. For each changed document since the last sync, it invokes the corresponding map function to determine the corresponding key/value into the BTree node. The BTree node will be split if appropriate. Underlying, Couchbase use a back index to keep track of the document with the keys that it previously emitted. Later when the document is deleted, it can look up the back index to determine what those key are and remove them. In case the document is updated, the back index can also be examined; semantically a modification is equivalent to a delete followed by an insert. The following diagram illustrates how the view index file will be incrementally updated via the append-only mechanism. Query Processing Query in Couchbase is made against the view index. A query is composed of the view name, a start key and end key. If the reduce() function isn’t defined, the query result will be the list of values sorted by the keys within the key range. In case the reduce() function is defined, the query result will be a single aggregated value of all keys within the key range. If the view has no reduce() function defined, the query processing proceeds as follows: Client issue a query (with view, start/end key) to the management process of any server (unlike a key based lookup, there is no need to locate a specific server). The management process will broadcast the request to other management process on all servers (include itself) within the cluster. Each management process (after receiving the broadcast request) do a local search for value within the key range by traversing the BTree node of its view file, and start sending back the result (automatically sorted by the key) to the initial server. The initial server will merge the sorted result and stream them back to the client. However, if the view has reduce() function defined, the query processing will involve computing a single aggregated value as follows: Client issue a query (with view, start/end key) to the management process of any server (unlike a key based lookup, there is no need to locate a specific server). The management process will broadcast the request to other management process on all servers (include itself) within the cluster. Each management process do a local reduce for value within the key range by traversing the BTree node of its view file to compute the reduce value of the key range. If the key range span across a BTree node, the pre-computed of the sub-range can be used. This way, the reduce function can reuse a lot of partially reduced values and doesn’t need to recomputed every value of the key range from scratch. The original server will do a final re-reduce() in all the return value from each other servers, and then passed back the final reduced value to the client. To illustrate the re-reduce concept, lets say the query has its key range from A to F. Instead of calling reduce([A,B,C,D,E,F]), the system recognize the BTree node that contains [B,C,D] has been pre-reduced and the result P is stored in the BTree node, so it only need to call reduce(A,P,E,F). Update View Index as vBucket migrates Since the view index is synchronized with the vBuckets in the same server, when the vBucket has migrated to a different server, the view index is no longer correct; those key/value that belong to a migrated vBucket should be discarded and the reduce value cannot be used anymore. To keep track of the vBucket and key in the view index, each bTree node has a 1024-bitmask indicating all the vBuckets that is covered in the subtree (ie: it contains a key emitted from a document belonging to the vBucket). Such bit-mask is maintained whenever the bTree node is updated. At the server-level, a global bitmask is used to indicate all the vBuckets that this server is responsible for. In processing the query of the map-only view, before the key/value pair is returned, an extra check will be perform for each key/value pair to make sure its associated vBucket is what this server is responsible for. When processing the query of a view that has a reduce() function, we cannot use the pre-computed reduce value if the bTree node contains a vBucket that the server is not responsible for. In this case, the bTree node’s bit mask is compared with the global bit mask. In case if they are not aligned, then the reduce value need to be recomputed. Here is an example to illustrate this process Couchbase is one of the popular NOSQL technology built on a solid technology foundation designed for high performance. In this post, we have examined a number of such key features: Load balancing between servers inside a cluster that can grow and shrink according to workload conditions. Data migration can be used to re-achieve workload balance. Asynchronous write provides lowest possible latency to client as it returns once the data is store in memory. Append-only update model pushes most update transaction into sequential disk access, hence provide extremely high throughput for write intensive applications. Automatic compaction ensures the data lay out on disk are kept optimized all the time. Map function can be used to pre-compute view index to enable query access. Summary data can be pre-aggregated using the reduce function. Overall, this cut down the workload of query processing dramatically. For a review on NOSQL architecture in general and some theoretical foundation, I have wrote a NOSQL design pattern blog, as well as some fundamental difference between SQL and NOSQL. For other NOSQL technologies, please read my other blog on MongoDb, Cassandra and HBase, Memcached Special thanks to Damien Katz and Frank Weigel from Couchbase team who provide a lot of implementation details of Couchbase.
July 7, 2012
by Ricky Ho
· 84,663 Views · 5 Likes
article thumbnail
HTML5 Geolocation API to Measure Speed and Heading of Your Car
in this article we'll show you how you can use the w3c geolocation api to measure the speed and the heading of your car while driving. this article further uses svg to render the speed gauge and heading compass. what we'll create in this article is the following: here you can see two gauges. one will show the heading you're driving to, and the other shows the speed in kilometers. you can test this out yourself by using the following link: open this in gps enable device . once opened the browser will probably ask you to allow access to your location. if you enable this and start moving, you'll see the two gauges move appropriately. getting all this to work is actually very easy and consists of the following steps: alter the svg images so we can rotate the needle and add to page. use the geolocation api to determine the current speed and heading update the needle based on the current and previous value we'll start with the svg part. alter the svg images so we can rotate the needle and add to page for the images i decided to use svg. svg has the advantage that it can scale without losing detail, and you can easily manipulate and animate the various parts of a svg image. both the svg images were copied from openclipart.org : compass rose speedometer these are vector graphics, both created using illustrator. before we can rotate the needles in these images we need to make a couple of small changes to the svg code. with svg you can apply matrix transformations to each svg element, with this you can easily rotate, skew, scale or translate a component. besides the matrix transformation you can also apply the rotation and translation directly using the translate and rotate keywords. in this example i've used the translaten and rotate functions directly. when working with these functions you have to take into account that the rotate function doesn't rotate around the center of the component, it rotates around point 0,0. so we need to make sure that for our needles the point we want to rotate around is set at 0,0. without diving into too much details, i removed the two needles from the image, and added them as a seperate group to the svg image. i then made sure the needles we're drawn relative to the 0,0 point i wanted to rotate around. for the speedometer the needle is now defined as this: and for the compass the needle is defined like this: if you know how to read svg, you can see that these figures are now drawn around their rotation point (the bottom center for the speedomoter and the center for the compass). as you can see we also added a specific id for both these elements. this way we can reference them directly from our javascript later on and update the transform property from a jquery animation. next we just need to add these to the page. for this i used d3.js , which has all kinds of helper functions for svg and which you can use to load these elements like this: function loadgraphics() { d3.xml("assets/compass.svg", "image/svg+xml", function(xml) { document.body.appendchild(xml.documentelement); }); d3.xml("assets/speed.svg", "image/svg+xml", function(xml) { document.body.appendchild(xml.documentelement); }); } and with this we've got our visualization components ready. use the geolocation api to determine the current speed and heading the next step is using the geolocation api to access the speed and heading properties. you can get this information from the position object that is provided to you by this api: interface position { readonly attribute coordinates coords; readonly attribute domtimestamp timestamp; }; this object has a coordinate object that contains the information we're looking for: interface coordinates { readonly attribute double latitude; readonly attribute double longitude; readonly attribute double? altitude; readonly attribute double accuracy; readonly attribute double? altitudeaccuracy; readonly attribute double? heading; readonly attribute double? speed; }; a lot of useful attributes, but we're only interested in these last two. the heading (from 0 to 360) shows the direction we're moving in, and the speed in meters per second is, as you've probably guessed, the speed we're moving at. there are two different options to get these values. we can poll ourselves for these values (e.g. setinterval) or we can wait wacth our position. in this second case you automatically recieve an update. in this example we use the second approach: function initgeo() { navigator.geolocation.watchposition( geosuccess, geofailure, { enablehighaccuracy:true, maximumage:30000, timeout:20000 } ); //movespeed(30); //movecompassneedle(56); } var count = 0; function geosuccess(event) { $("#debugoutput").text("geosuccess: " + count++ + " : " + event.coords.heading + ":" + event.coords.speed); var heading = event.coords.heading; var speed = event.coords.speed; if (heading != null && speed !=null && speed > 0) { movecompassneedle(heading); } if (speed != null) { // update the speed movespeed(speed); } } with this piece of code, we register a callback function on the watchposition. we also add a couple of properties to the watchposition function. with these properties we tell the api to use gps (enablehighaccuracy) and set some timeout and caching values. whenever we receive an update from the api the geosuccess function is called. this function recieves a position object (shown earlier) that we use to access the speed and the heading. based on the value of the heading and the speed we update the compass and the speedomoter. update the needle based on the current and previous value to update the needles we use jquery animations for the easing. normally you use a jquery animation to animatie css properties of an object, but you can also use this to animate arbitrary properties. to animate the speedomoter we use the following: var currentspeed = {property: 0}; function movespeed(speed) { // we use a svg transform to move to correct orientation and location var translatevalue = "translate(171,157)"; // to is in the range of 45 to 315, which is 0 to 260 km var to = {property: math.round((speed*3.6/250) *270) + 45}; // stop the current animation and run to the new one $(currentspeed).stop().animate(to, { duration: 2000, step: function() { $("#speed").attr("transform", translatevalue + " rotate(" + this.property + ")") } }); } we create a custom object, currentspeed, with a single property. this property is set to the rotate ratio that reflects the current speed. next, this property is used in a jquery animation. note that we stop any existing animations, should we get an update when the current animation is still running. in the step property of the animation we set the transfrom value of the svg element. this will rotate the needle, in two seconds, from the old value to the new value. and to animate the compass we do pretty much the same thing: var currentcompassposition = {property: 0}; function movecompassneedle(heading) { // we use a svg transform to move to correct orientation and location var translatevalue = "translate(225,231)"; var to = {property: heading}; // stop the current animation and run to the new one $(currentcompassposition).stop().animate(to, { duration: 2000, step: function() { $("#compass").attr("transform", translatevalue + " rotate(" + this.property + ")") } }); } there is a smal bug i ran into with this setup. sometimes my phone lost its gps signal (running firefox mobile), and that stopped the dials moving. refreshing the webpage was enough to get things started again however. i might change this to actively pull the information using the getcurrentlocation api call, to see whether that works better. another issue is that there is no way, at least that i found, for you to disable the phone entering sleep mode from the browser. so unless you configure your phone to not go to sleep, the screen will go black.
July 1, 2012
by Jos Dirksen
· 16,436 Views
article thumbnail
Reportlab: Mixing Fixed Content and Flowables
Recently I needed the ability to use Reportlab’s flowables, but place them in fixed locations. Some of you are probably wondering why I would want to do that. The nice thing about flowables, like the Paragraph, is that they’re easily styled. If I could bold something or center something AND put it in a fixed location, then that would rock! It took a lot of Googling and trial and error, but I finally got a decent template put together that I could use for mailings. In this article, I’m going to show you how to do this too. Getting Started You’ll need to make sure you have Reportlab or you’ll end up with a whole lot of nothing. You can go here to grab it. While you wait for it to download you can continue reading this article or go do something else productive. Are you ready now? Then let’s get this show on the road! Now we just need to come up with an example. Fortunately I was working on something at my job that I’ve been able to dummy up into the following silly and incomplete form letter. Study the code closely because you never know when there will be a test from reportlab.lib.pagesizes import letter from reportlab.lib.styles import getSampleStyleSheet from reportlab.lib.units import mm, inch from reportlab.pdfgen import canvas from reportlab.platypus import Image, Paragraph, Table ######################################################################## class LetterMaker(object): """""" #---------------------------------------------------------------------- def __init__(self, pdf_file, org, seconds): self.c = canvas.Canvas(pdf_file, pagesize=letter) self.styles = getSampleStyleSheet() self.width, self.height = letter self.organization = org self.seconds = seconds #---------------------------------------------------------------------- def createDocument(self): """""" voffset = 65 # create return address address = """ Jack Spratt 222 Ioway Blvd, Suite 100 Galls, TX 75081-4016 """ p = Paragraph(address, self.styles["Normal"]) # add a logo and size it logo = Image("snakehead.jpg") logo.drawHeight = 2*inch logo.drawWidth = 2*inch ## logo.wrapOn(self.c, self.width, self.height) ## logo.drawOn(self.c, *self.coord(140, 60, mm)) ## data = [[p, logo]] table = Table(data, colWidths=4*inch) table.setStyle([("VALIGN", (0,0), (0,0), "TOP")]) table.wrapOn(self.c, self.width, self.height) table.drawOn(self.c, *self.coord(18, 60, mm)) # insert body of letter ptext = "Dear Sir or Madam:" self.createParagraph(ptext, 20, voffset+35) ptext = """ The document you are holding is a set of requirements for your next mission, should you choose to accept it. In any event, this document will self-destruct %s seconds after you read it. Yes, %s can tell when you're done...usually. """ % (self.seconds, self.organization) p = Paragraph(ptext, self.styles["Normal"]) p.wrapOn(self.c, self.width-70, self.height) p.drawOn(self.c, *self.coord(20, voffset+48, mm)) #---------------------------------------------------------------------- def coord(self, x, y, unit=1): """ # http://stackoverflow.com/questions/4726011/wrap-text-in-a-table-reportlab Helper class to help position flowables in Canvas objects """ x, y = x * unit, self.height - y * unit return x, y #---------------------------------------------------------------------- def createParagraph(self, ptext, x, y, style=None): """""" if not style: style = self.styles["Normal"] p = Paragraph(ptext, style=style) p.wrapOn(self.c, self.width, self.height) p.drawOn(self.c, *self.coord(x, y, mm)) #---------------------------------------------------------------------- def savePDF(self): """""" self.c.save() #---------------------------------------------------------------------- if __name__ == "__main__": doc = LetterMaker("example.pdf", "The MVP", 10) doc.createDocument() doc.savePDF() Now you’ve seen the code, so we’ll spend a little time going over how it works. First off we create a Canvas object that we can use without our LetterMaker class. We also create a styles dict and set up a few other class variables. In the createDocument method, we create a Paragraph (an address) using some HTML-like tags to control the font and line breaking behavior. Then we create a logo and size it before putting both items into a Reportlab Table object. You’ll note that I’ve left in a couple commented out lines that show how to place the logo without the table. We use the coord method to help position the flowable. I found it on StackOverflow and thought it was pretty handy. The body of the letter uses a little string substitution and puts the result into another Paragraph. We also use a stored offset to help us position things. I find that storing a couple of offsets for certain portions of the code is very helpful. If you use them carefully then you can just change a couple of offsets to move the content around on the document rather than having to edit the position of each element. If you need to draw lines or shapes, you can do them in the usual way with your canvas object. Wrapping Up I hope this code will help you in your PDF creation endeavors. I have to admit that I’m posting it on here as much for my own future benefit as for your own. I’m a little sad I had to strip out so much from it, but my organization wouldn’t like it very much if I posted the original. Regardless, you now have the tools to create some pretty fancy PDF documents with Python. Now you just have to get out there and do it!
June 29, 2012
by Mike Driscoll
· 19,853 Views
article thumbnail
How to Find the Most Connected Neo4j Node Using Cypher
As I mentioned in another post about a month ago I’ve been playing around with a neo4j graph in which I have the following relationship between nodes: One thing I wanted to do was work out which node is the most connected on the graph, which would tell me who’s worked with the most people. I started off with the following cypher query: query = " START n = node(*)" query << " MATCH n-[r:colleagues]->c" query << " WHERE n.type? = 'person' and has(n.name)" query << " RETURN n.name, count(r) AS connections" query << " ORDER BY connections DESC" I can then execute that via the neo4j console or through irb using the neography gem like so: > require 'rubygems' > require 'neography' > neo = Neography::Rest.new(:port => 7476) > neo.execute_query query # cut for brevity {"data"=>[["Carlos Villela", 283], ["Mark Needham", 221]], "columns"=>["n.name", "connections"]} That shows me each person and the number of people they’ve worked with but I wanted to be able to see the most connected person in each office . Each person is assigned to an office while they’re working out of that office but people tend to move around so they’ll have links to multiple offices: I put ‘start_date’ and ‘end_date’ properties on the ‘member_of’ relationship and we can work out a person’s current office by finding the ‘member_of’ relationship which doesn’t have an end date defined: query = " START n = node(*)" query << " MATCH n-[r:colleagues]->c, n-[r2:member_of]->office" query << " WHERE n.type? = 'person' and has(n.name) and not(has(r2.end_date))" query << " RETURN n.name, count(r) AS connections, office.name" query << " ORDER BY connections DESC" And now our results look more like this: {"data"=>[["Carlos Villela", 283, "Porto Alegre - Brazil"], ["Mark Needham", 221, "London - UK South"]], "columns"=>["n.name", "connections"]} If we want to restrict that just to return the people for a specific person we can do that as well: query = " START n = node(*)" query << " MATCH n-[r:colleagues]->c, n-[r2:member_of]->office" query << " WHERE n.type? = 'person' and has(n.name) and (not(has(r2.end_date))) and office.name = 'London - UK South'" query << " RETURN n.name, count(r) AS connections, office.name" query << " ORDER BY connections DESC" {"data"=>[["Mark Needham", 221, "London - UK South"]], "columns"=>["n.name", "connections"]} In the current version of cypher we need to put brackets around the not expression otherwise it will apply the not to the rest of the where clause. Another way to get around that would be to put the not part of the where clause at the end of the line. While I am able to work out the most connected person by using these queries I’m not sure that it actually tells you who the most connected person is because it’s heavily biased towards people who have worked on big teams. Some ways to try and account for this are to bias the connectivity in favour of those you have worked longer with and also to give less weight to big teams since you’re less likely to have a strong connection with everyone as you might in a smaller team. I haven’t got onto that yet though!
June 26, 2012
by Mark Needham
· 13,487 Views
article thumbnail
When Should I Use An ORM?
I think like everyone, I go through the same journey whenever I find out about a new technology.. Huh? –> This is really cool –> I use it everywhere –> Hmm, sometimes it’s not so great Remember when people were writing websites with XSLT transforms? Yes, exactly. XML is great for storing a data structure as a string, but you really don’t want to be coding your application’s business logic with it. I’ve gone through a similar journey with Object Relational Mapping tools. After hand-coding my DALs, then code generating them, ORMs seemed like the answer to all my problems. I became an enthusiastic user of NHibernate through a number of large enterprise application builds. Even today I would still use an ORM for most classes of enterprise application. However there are some scenarios where ORMs are best avoided. Let me introduce my easy to use, ‘when to use an ORM’ chart. It’s got two axis, ‘Model Complexity’ and ‘Throughput’. The X-axis, Model Complexity, describes the complexity of your domain model; how many entities you have and how complex their relationships are. ORMs excel at mapping complex models between your domain and your database. If you have this kind of model, using an ORM can significantly speed up and simplify your development time and you’d be a fool not to use one. The problem with ORMs is that they are a leaky abstraction. You can’t really use them and not understand how they are communicating with your relational model. The mapping can be complex and you have to have a good grasp of both your relational database, how it responds to SQL requests, and how your ORM comes to generate both the relational schema and the SQL that talks to it. Thinking of ORMs as a way to avoid getting to grips with SQL, tables, and indexes will only lead to pain and suffering. Their benefit is that they automate the grunt work and save you the boring task of writing all that tedious CRUD column to property mapping code. The Y-axis in the chart, Throughput, describes the transactional throughput of your system. At very high levels, hundreds of transactions per second, you need hard-core DBA foo to get out of the deadlocked hell where you will inevitably find yourself. When you need this kind of scalability you can’t treat your ORM as anything other than a very leaky abstraction. You will have to tweak both the schema and the SQL it generates. At very high levels you’ll need Ayende level NHibernate skills to avoid grinding to a halt. If you have a simple model, but very high throughput, experience tells me that an ORM is more trouble than it’s worth. You’ll end up spending so much time fine tuning your relational model and your SQL that it simply acts as an unwanted obfuscation layer. In fact, at the top end of scalability you should question the choice of a relational ACID model entirely and consider an eventually-consistent event based architecture. Similarly, if your model is simple and you don’t have high throughput, you might be better off using a simple data mapper like SimpleData. So, to sum up, ORMs are great, but think twice before using one where you have a simple model and high throughput.
June 25, 2012
by Mike Hadlow
· 19,292 Views
article thumbnail
Handling HTTP 404 Error in ASP.NET Web API
Introduction: Building modern HTTP/RESTful/RPC services has become very easy with the new ASP.NET Web API framework. Using ASP.NET Web API framework, you can create HTTP services which can be accessed from browsers, machines, mobile devices and other clients. Developing HTTP services is now quite easy for ASP.NET MVC developer becasue ASP.NET Web API is now included in ASP.NET MVC. In addition to developing HTTP services, it is also important to return meaningful response to client if a resource(uri) not found(HTTP 404) for a reason(for example, mistyped resource uri). It is also important to make this response centralized so you can configure all of 'HTTP 404 Not Found' resource at one place. In this article, I will show you how to handle 'HTTP 404 Not Found' at one place. Description: Let's say that you are developing a HTTP RESTful application using ASP.NET Web API framework. In this application you need to handle HTTP 404 errors in a centralized location. From ASP.NET Web API point of you, you need to handle these situations, No route matched. Route is matched but no {controller} has been found on route. No type with {controller} name has been found. No matching action method found in the selected controller due to no action method start with the request HTTP method verb or no action method with IActionHttpMethodProviderRoute implemented attribute found or no method with {action} name found or no method with the matching {action} name found. Now, let create a ErrorController with Handle404 action method. This action method will be used in all of the above cases for sending HTTP 404 response message to the client. public class ErrorController : ApiController { [HttpGet, HttpPost, HttpPut, HttpDelete, HttpHead, HttpOptions, AcceptVerbs("PATCH")] public HttpResponseMessage Handle404() { var responseMessage = new HttpResponseMessage(HttpStatusCode.NotFound); responseMessage.ReasonPhrase = "The requested resource is not found"; return responseMessage; } } You can easily change the above action method to send some other specific HTTP 404 error response. If a client of your HTTP service send a request to a resource(uri) and no route matched with this uri on server then you can route the request to the above Handle404 method using a custom route. Put this route at the very bottom of route configuration, routes.MapHttpRoute( name: "Error404", routeTemplate: "{*url}", defaults: new { controller = "Error", action = "Handle404" } ); Now you need handle the case when there is no {controller} in the matching route or when there is no type with {controller} name found. You can easily handle this case and route the request to the above Handle404 method using a custom IHttpControllerSelector. Here is the definition of a custom IHttpControllerSelector, public class HttpNotFoundAwareDefaultHttpControllerSelector : DefaultHttpControllerSelector { public HttpNotFoundAwareDefaultHttpControllerSelector(HttpConfiguration configuration) : base(configuration) { } public override HttpControllerDescriptor SelectController(HttpRequestMessage request) { HttpControllerDescriptor decriptor = null; try { decriptor = base.SelectController(request); } catch (HttpResponseException ex) { var code = ex.Response.StatusCode; if (code != HttpStatusCode.NotFound) throw; var routeValues = request.GetRouteData().Values; routeValues["controller"] = "Error"; routeValues["action"] = "Handle404"; decriptor = base.SelectController(request); } return decriptor; } } Next, it is also required to pass the request to the above Handle404 method if no matching action method found in the selected controller due to the reason discussed above. This situation can also be easily handled through a custom IHttpActionSelector. Here is the source of custom IHttpActionSelector, public class HttpNotFoundAwareControllerActionSelector : ApiControllerActionSelector { public HttpNotFoundAwareControllerActionSelector() { } public override HttpActionDescriptor SelectAction(HttpControllerContext controllerContext) { HttpActionDescriptor decriptor = null; try { decriptor = base.SelectAction(controllerContext); } catch (HttpResponseException ex) { var code = ex.Response.StatusCode; if (code != HttpStatusCode.NotFound && code != HttpStatusCode.MethodNotAllowed) throw; var routeData = controllerContext.RouteData; routeData.Values["action"] = "Handle404"; IHttpController httpController = new ErrorController(); controllerContext.Controller = httpController; controllerContext.ControllerDescriptor = new HttpControllerDescriptor(controllerContext.Configuration, "Error", httpController.GetType()); decriptor = base.SelectAction(controllerContext); } return decriptor; } } Finally, we need to register the custom IHttpControllerSelector and IHttpActionSelector. Open global.asax.cs file and add these lines, configuration.Services.Replace(typeof(IHttpControllerSelector), new HttpNotFoundAwareDefaultHttpControllerSelector(configuration)); configuration.Services.Replace(typeof(IHttpActionSelector), new HttpNotFoundAwareControllerActionSelector()); Summary: In addition to building an application for HTTP services, it is also important to send meaningful centralized information in response when something goes wrong, for example 'HTTP 404 Not Found' error. In this article, I showed you how to handle 'HTTP 404 Not Found' error in a centralized location. Hopefully you will enjoy this article too.
June 22, 2012
by Imran Baloch
· 51,385 Views
article thumbnail
Wrapping Begin/End Asynchronous API into C#5 Tasks
Microsoft offered programmers several different ways of dealing with the asynchronous programming since .NET 1.0. The first model was Asynchronous programming model or APM for short. The pattern is implemented with two methods named BeginOperation and EndOperation. .NET 4 introduced new pattern – Task Asynchronous Pattern and with the introduction of .NET 4.5, Microsoft added language support for language integrated asynchronous coding style. You can check the MSDN for more samples and information. I will assume that you are familiar with it and have written code using it. You can wrap existing APM pattern into TPL pattern using the Task.Factory.FromAsync methods. For example: public static Task> ExecuteAsync(this DataServiceQuery query, object state) { return Task.Factory.FromAsync>(query.BeginExecute, query.EndExecute, state); } It is easy to wrap most of the asynchronous functions this way, but some cannot be since the wrapper functions assume that the last two parameters to the BeginOperation are AsyncCallback and object, and there are some versions of asynchronous operations that have different specifications. Examples: 1. Extra parameters after the object state parameter: IAsyncResult DataServiceContext.BeginExecuteBatch( AsyncCallback callback, object state, params DataServiceRequest[] queries); 2. Missing the expected object state parameter and different return type: ICancelableAsyncResult BeginQuery(AsyncCallback callBack); WorkItemCollection EndQuery(ICancelableAsyncResult car); Short solution for the first example The short and elegant way for wrapping the first example is to provide the following wrapper: public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { if (context == null) throw new ArgumentNullException("context"); return Task.Factory.FromAsync( context.BeginExecuteBatch(null, state, queries), context.EndExecuteBatch); } We simply call the Begin method ourselves and then wrap it using an another overload for FromAsync function. The longer way However, we can fully wrap it ourselves by simulating what the FromAsync wrapper does. The complete code is listed below. public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { // this will be our sentry that will know when our async operation is completed var tcs = new TaskCompletionSource(); try { context.BeginExecuteBatch((iar) => { try { var result = context.EndExecuteBatch(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { // if the inner operation was canceled, this task is cancelled too tcs.TrySetCanceled(); } catch (Exception ex) { // general exception has been set bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }, state, queries); } catch { tcs.TrySetResult(default(DataServiceResponse)); // propagate exceptions to the outside throw; } return tcs.Task; } Besides educational benefits, writing the full wrapper code allows us to add cancellation, logging and diagnostic information. Once we understand how to wrap APM pattern, We can now tackle the second problem easily. Handling the BeginQuery/EndQuery We will first create our own wrapper function in the spirit of the above code with the notable difference that we use the ICancelableAsyncResult interface instead of the IAsyncResult. public static class TaskEx { public static Task FromAsync(Func beginMethod, Func endMethod) { if (beginMethod == null) throw new ArgumentNullException("beginMethod"); if (endMethod == null) throw new ArgumentNullException("endMethod"); var tcs = new TaskCompletionSource(); try { beginMethod((iar) => { try { var result = endMethod(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { tcs.TrySetCanceled(); } catch (Exception ex) { bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }); } catch { tcs.TrySetResult(default(TResult)); throw; } return tcs.Task; } } The code is pretty self-explanatory and we can go ahead with the wrapping. There are four different operations that are exposed both in synchronous and asynchronous version: Query, LinkQuery, CountOnlyQuery and RegularQuery. The extension methods are short since we have already created our generic wrapper above: public static Task RunQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginQuery, query.EndQuery); } public static Task RunLinkQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginLinkQuery, query.EndLinkQuery); } public static Task RunCountOnlyQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginCountOnlyQuery, query.EndCountOnlyQuery); } public static Task RunRegularQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginRegularQuery, query.EndRegularQuery); } That is it for today, you can write your own handy extensions easily for APM functions out there.
June 21, 2012
by Toni Petrina
· 10,613 Views
article thumbnail
Top 10 Causes of Java EE Enterprise Performance Problems
Performance problems are one of the biggest challenges to expect when designing and implementing Java EE related technologies.
June 20, 2012
by Pierre - Hugues Charbonneau
· 274,044 Views · 20 Likes
article thumbnail
Fast Index Creation with InnoDB
Innodb can indexes built by sort since Innodb Plugin for MySQL 5.1 which is a lot faster than building them through insertion, especially for tables much larger than memory and large uncorrelated indexes you might be looking at 10x difference or more. Yet for some reason Innodb team has chosen to use very small (just 1MB) and hard coded buffer for this operation, which means almost any such index build operation has to use excessive sort merge passes significantly slowing down index built process. Mark Callaghan and Facebook Team has fixed this in their tree back in early 2011 adding innodb_merge_sort_block_size variable and I was thinking this small patch will be merged to MySQL 5.5 promptly, yet it has not happen to date. Here is example of gains you can expect (courtesy of Alexey Kopytov), using 1Mil rows Sysbench table. Buffer Length | alter table sbtest add key(c) 1MB 34 sec 8MB 26 sec 100MB 21 sec 128MB 17 sec REBUILD 37 sec REBUILD in this table is using “fast_index_creation=0″ which allows to disable fast index creation in Percona Server and force complete table to be rebuilt instead. Looking at this data we can see even for such small table there is possible to improve index creation time 2x by using large buffer. Also we can see we can substantially improve performance even increasing it from 1MB to 8MB, which might be sensible as default as even small systems should be able to allocate 8MB to do alter table. You may be wondering why in this case table rebuild is so close in performance to building index by sort with small buffer – this comes from building index on long character field with very short length, Innodb would use fixed size records for sort space which results in a lot more work done than you would otherwise need. Having some optimization to better deal with this case also would be nice. The table also was fitting in buffer pool completely in this case which means table rebuild could have done fast too. Results are from Percona Server 5.5.24
June 19, 2012
by Peter Zaitsev
· 4,499 Views
article thumbnail
How to Identify and Resolve Hibernate N+1 SELECT's Problems
Let’s assume that you’re writing code that’d track the price of mobile phones. Now, let’s say you have a collection of objects representing different Mobile phone vendors (MobileVendor), and each vendor has a collection of objects representing the PhoneModels they offer. To put it simple, there’s exists a one-to-many relationship between MobileVendor:PhoneModel. MobileVendor Class Class MobileVendor{ long vendor_id; PhoneModel[] phoneModels; ... } Okay, so you want to print out all the details of phone models. A naive O/R implementation would SELECT all mobile vendors and then do N additional SELECTs for getting the information of PhoneModel for each vendor. -- Get all Mobile Vendors SELECT * FROM MobileVendor; -- For each MobileVendor, get PhoneModel details SELECT * FROM PhoneModel WHERE MobileVendor.vendorId=? As you see, the N+1 problem can happen if the first query populates the primary object and the second query populates all the child objects for each of the unique primary objects returned. Resolve N+1 SELECTs problem (i) HQL fetch join "from MobileVendor mobileVendor join fetch mobileVendor.phoneModel PhoneModels" Corresponding SQL would be (assuming tables as follows: t_mobile_vendor for MobileVendor and t_phone_model for PhoneModel) SELECT * FROM t_mobile_vendor vendor LEFT OUTER JOIN t_phone_model model ON model.vendor_id=vendor.vendor_id (ii) Criteria query Criteria criteria = session.createCriteria(MobileVendor.class); criteria.setFetchMode("phoneModels", FetchMode.EAGER); In both cases, our query returns a list of MobileVendor objects with the phoneModels initialized. Only one query needs to be run to return all the PhoneModel and MobileVendor information required.
June 13, 2012
by Singaram Subramanian
· 201,661 Views · 13 Likes
article thumbnail
How to Get the JPQL/SQL String From a CriteriaQuery in JPA ?
I.T. is full of complex things that should (and sometimes could) be simple. Getting the JQPL/SQL String representation for a JPA 2.0 CriteriaQuery is one of them. By now you all know the JPA 2.0 Criteria API : a type safe way to write a JQPL query. This API is clever in the way that you don’t use Strings to build your query, but is quite verbose… and sometimes you get lost in dozens of lines of Java code, just to write a simple query. You get lost in your CriteriaQuery, you don’t know why your query doesn’t work, and you would love to debug it. But how do you debug it ? Well, one way would be by just displaying the JPQL and/or SQL representation. Simple, isn’t it ? Yes, but JPA 2.0 javax.persistence.Query doesn’t have an API to do this. You then need to rely on the implementation… meaning, the code is different if you use EclipseLink, Hibernate or OpenJPA. The CriteriaQuery we want to debug Let’s say you have a simple Book entity and you want to retrieve all the books sorted by their id. Something like SELECT b FROM Book b ORDER BY b.id DESC. How would you write this with the CriteriaQuery ? Well, something like these 5 lines of Java code : CriteriaBuilder cb = em.getCriteriaBuilder(); CriteriaQuery q = cb.createQuery(Book.class); Root b = q.from(Book.class); q.select(b).orderBy(cb.desc(b.get("id"))); TypedQuery findAllBooks = em.createQuery(q); So imagine when you have more complex ones. Sometimes, you just get lost, it gets buggy and you would appreciate to have the JPQL and/or SQL String representation to find out what’s happening. You could then even unit test it. Getting the JPQL/SQL String Representations for a Criteria Query So let’s use an API to get the JPQL/SQL String representations of a CriteriaQuery (to be more precise, the TypedQuery created from a CriteriaQuery). The bad news is that there is no standard JPA 2.0 API to do this. You need to use the implementation API hoping the implementation allows it (thank god that’s (nearly) the case for the 3 main JPA ORM frameworks). The good news is that the Query interface (and therefore TypedQuery) has an unwrap method. This method returns the provider’s query API implementation. Let’s see how you can use it with EclipseLink, Hibernate and OpenJPA. EclipseLink EclipseLink‘s Query representation is the org.eclipse.persistence.jpa.JpaQuery interface and the org.eclipse.persistence.internal.jpa.EJBQueryImpl implementation. This interface gives you the wrapped native query (org.eclipse.persistence.queries.DatabaseQuery) with two very handy methods : getJPQLString() and getSQLString(). Unfortunatelly the getJPQLString() method will not translate a CriteriaQuery into JPQL, it only works for queries originally written in JPQL (dynamic or named query). The getSQLString() method relies on the query being “prepared”, meaning you have to run the query once before getting the SQL String representation. findAllBooks.unwrap(JpaQuery.class).getDatabaseQuery().getJPQLString(); // doesn't work for CriteriaQuery findAllBooks.unwrap(JpaQuery.class).getDatabaseQuery().getSQLString(); Hibernate Hibernate‘s Query representation is org.hibernate.Query. This interface has several implementations and the very useful method that returns the SQL query string : getQueryString(). I couldn’t find a method that returns the JPQL representation, if I’ve missed something, please let me know. findAllBooks.unwrap(org.hibernate.Query.class).getQueryString() OpenJPA OpenJPA‘s Query representation is org.apache.openjpa.persistence.QueryImpl and also has a getQueryString() method that returns the SQL (not the JPQL). It delegates the call to the internal org.apache.openjpa.kernel.Query interface. I couldn’t find a method that returns the JPQL representation, if I’ve missed something, please let me know. findAllBooks.unwrap(org.apache.openjpa.persistence.QueryImpl.class).getQueryString() Unit testing Once you get your SQL String, why not unit test it ? Hey, but I don’t want to test my ORM, why would I do that ? Well, it happens that I’ve discovered a but in the new releases of OpenJPA by unit testing a query… so, there is a use case for that. Anyway, this is how you could do it : assertEquals("SELECT b FROM Book b ORDER BY b.id DESC", findAllBooksCriteriaQuery.unwrap(org.apache.openjpa.persistence.QueryImpl.class).getQueryString()); Conclusion As you can see, it’s not that simple to get a String representation for a TypedQuery. Here is a digest of the three main ORMs : ORM Framework Query implementation How to get the JPQL String How to get the SPQL String EclipseLink JpaQuery getDatabaseQuery().getJPQLString()* getDatabaseQuery().getSQLString()** Hibernate Query N/A getQueryString() OpenJPA QueryImpl getQueryString() N/A (*) Only possible on a dynamic or named query. Not possible on a CriteriaQuery (**) You need to execute the query first, if not, the value is null To illustrate all that I’ve written simple test cases using EclipseLink, Hibernate and OpenJPA that you can download from GitHub. Give it a try and let me know. And what about having an API in JPA 2.1 ? For a developers’ point of view it would be great to have two methods in the javax.persistence.Query (and therefore javax.persistence.TypedQuery) interface that would be able to easily return the JPQL and SQL String representations, e.g : Query.getJPQLString() and Query.getSQLString(). Hey, that would be the perfect time to have it in JPA 2.1 that will be shipped in less than a year. Now, as an implementer, this might be tricky to do, I would love to ear your point of view on this. Anyway, I’m going to post an email to the JPA 2.1 Expert Group… just in case we can have this in the next version of JPA ;o) References http://efreedom.com/Question/1-6412774/Get-SQL-String-JPQLQuery http://old.nabble.com/Cannot-get-the-JPQL—SQL-String-of-a-CriteriaQuery-td33882629.html http://paddyweblog.blogspot.fr/2010/04/some-examples-of-criteria-api-jpa-20.html http://www.altuure.com/2010/09/23/jpa-criteria-api-by-samples-part-i/ http://www.altuure.com/2010/09/23/jpa-criteria-api-by-samples-%E2%80%93-part-ii/ http://www.jumpingbean.co.za/blogs/jpa2-criteria-api http://wiki.eclipse.org/EclipseLink/FAQ/JPA#How_to_get_the_SQL_for_a_Query.3F
June 5, 2012
by Antonio Goncalves
· 60,906 Views · 1 Like
article thumbnail
Database unit testing with DBUnit, Spring and TestNG
I really like Spring, so I tend to use its features to the fullest. However, in some dark corners of its philosophy, I tend to disagree with some of its assumptions. One such assumption is the way database testing should work. In this article, I will explain how to configure your projects to make Spring Test and DBUnit play nice together in a multi-developers environment. Context My basic need is to be able to test some complex queries: before integration tests, I've to validate those queries get me the right results. These are not unit tests per se but let's assilimate them as such. In order to achieve this, I use since a while a framework named DBUnit. Although not maintained since late 2010, I haven't found yet a replacement (be my guest for proposals). I also have some constraints: I want to use TestNG for all my test classes, so that new developers wouldn't think about which test framework to use I want to be able to use Spring Test, so that I can inject my test dependencies directly into the test class I want to be able to see for myself the database state at the end of any of my test, so that if something goes wrong, I can execute my own queries to discover why I want every developer to have its own isolated database instance/schema Considering the last point, our organization let us benefit from a single Oracle schema per developer for those "unit-tests". Basic set up Spring provides the AbstractTestNGSpringContextTests class out-of-the-box. In turn, this means we can apply TestNG annotations as well as @Autowired on children classes. It also means we have access to the underlying applicationContext, but I prefer not to (and don't need to in any case). The structure of such a test would look like this: @ContextConfiguration(location = "classpath:persistence-beans.xml") public class MyDaoTest extends AbstractTestNGSpringContextTests { @Autowired private MyDao myDao; @Test public void whenXYZThenTUV() { ... } } Readers familiar with Spring and TestNG shouldn't be surprised here. Bringing in DBunit DbUnit is a JUnit extension targeted at database-driven projects that, among other things, puts your database into a known state between test runs. [...] DbUnit has the ability to export and import your database data to and from XML datasets. Since version 2.0, DbUnit can also work with very large datasets when used in streaming mode. DbUnit can also help you to verify that your database data match an expected set of values. DBunit being a JUnit extension, it's expected to extend the provided parent class org.dbunit.DBTestCase. In my context, I have to redefine some setup and teardown operation to use Spring inheritance hierarchy. Luckily, DBUnit developers thought about that and offer relevant documentation. Among the different strategies available, my tastes tend toward the CLEAN_INSERT and NONE operations respectively on setup and teardown. This way, I can check the database state directly if my test fails. This updates my test class like so: @ContextConfiguration(locations = {"classpath:persistence-beans.xml", "classpath:test-beans.xml"}) public class MyDaoTest extends AbstractTestNGSpringContextTests { @Autowired private MyDao myDao; @Autowired private IDatabaseTester databaseTester; @BeforeMethod protected void setUp() throws Exception { // Get the XML and set it on the databaseTester // Optional: get the DTD and set it on the databaseTester databaseTester.setSetUpOperation(DatabaseOperation.CLEAN_INSERT); databaseTester.setTearDownOperation(DatabaseOperation.NONE); databaseTester.onSetup(); } @Test public void whenXYZThenTUV() { ... } } Per-user configuration with Spring Of course, we need to have a specific Spring configuration file to inject the databaseTester. As an example, here is one: However, there's more than meets the eye. Notice the databaseTester has to be fed a datasource. Since a requirement is to have a database per developer, there are basically two options: either use a in-memory database or use the same database as in production and provide one such database schema per developer. I tend toward the latter solution (when possible) since it tends to decrease differences between the testing environment and the production environment. Thus, in order for each developer to use its own schema, I use Spring's ability to replace Java system properties at runtime: each developer is characterized by a different user.name. Then, I configure a PlaceholderConfigurer that looks for {user.name}.database.properties file, that will look like so: db.username=myusername1 db.password=mypassword1 db.schema=myschema1 This let me achieve my goal of each developer using its own instance of Oracle. If you want to use this strategy, do not forget to provide a specific database.properties for the Continuous Integration server. Huh oh? Finally, the whole testing chain is configured up to the database tier. Yet, when the previous test is run, everything is fine (or not), but when checking the database, it looks untouched. Strangely enough, if you did load some XML dataset and assert it during the test, it does behaves accordingly: this bears all symptoms of a transaction issue. In fact, when you closely look at Spring's documentation, everything becomes clear. Spring's vision is that the database should be left untouched by running tests, in complete contradiction to DBUnit's. It's achieved by simply rollbacking all changes at the end of the test by default. In order to change this behavior, the only thing to do is annotate the test class with @TransactionConfiguration(defaultRollback=false). Note this doesn't prevent us from specifying specific methods that shouldn't affect the database state on a case-by-case basis with the @Rollback annotation. The test class becomes: @ContextConfiguration(locations = {classpath:persistence-beans.xml", "classpath:test-beans.xml"}) @TransactionConfiguration(defaultRollback=false) public class MyDaoTest extends AbstractTestNGSpringContextTests { @Autowired private MyDao myDao; @Autowired private IDatabaseTester databaseTester; @BeforeMethod protected void setUp() throws Exception { // Get the XML and set it on the databaseTester // Optional: get the DTD and set it on the databaseTester databaseTester.setSetUpOperation(DatabaseOperation.CLEAN_INSERT); databaseTester.setTearDownOperation(DatabaseOperation.NONE); databaseTester.onSetup(); } @Test public void whenXYZThenTUV() { ... } } Conclusion Though Spring and DBUnit views on database testing are opposed, Spring's configuration versatility let us make it fit our needs (and benefits from DI). Of course, other improvements are possible: pushing up common code in a parent test class, etc. To go further: Spring Test documentation DBUnit site Database data verification Database testing best practices Generating DTD from your database schema
June 4, 2012
by Nicolas Fränkel
· 59,652 Views
article thumbnail
Spring Integration Gateways - Null Handling & Timeouts
Spring Integration (SI) Gateways Spring Integration Gateways () provide a semantically rich interface to message sub-systems. Gateways are specified using namespace constructs, these reference a specific Java interface () that is backed by an object dynamically implemented at run-time by the Spring Integration framework. Furthermore, these Java interfaces can, if you so wish, be defined entirely independent of any Spring artefacts - that's both code and configuration. One of the primary advantages of using the SI gateway as an interface to message sub-systems is that it's possible to automatically adopt the benefit of rich, default and customisable, gateway configuration. One such configuration attribute deserves further scrutiny and discussion primarily because it's easy to misunderstand and misconfigure around - default-reply-timeout. Primary Motivator for Gateway Analysis During recent consulting engagements, I've encountered a number of deployments that use Spring Integration Gateway specifications that may, in some circumstances, lead to production operational instability. This has often been in high-pressure environments or those where technology support is not backed by adequate training, testing, review or technology mentoring. How do gateways behave in Spring Integration (R2.0.5) One of the key sections, regarding gateways, in the Spring Integration manual clearly explains gateway semantics. Below is a 2-dimensional table of possible non-standard gateway returns for each of the scenarios that the SI Manual (r2.0.5) refers to. Gateway Non-standard Responses Runtime Events default-reply-timeout=x Single-threaded default-reply-timeout=x Multi-threaded default-reply-timeout=null Single-threaded default-reply-timeout=null Multi-threaded 1. Long Running Process Thread Parked null returned Thread Parked Thread Parked 2. Null Returned Downstream null returned null returned Thread Parked Thread Parked 3. void method Downstream null returned null returned Thread Parked Thread Parked 4. Runtime Exception Error handler invoked or exception thrown. Error handler invoked or exception thrown. Error handler invoked or exception thrown. Error handler invoked or exception thrown. The key parts of this table are the conditions that lead to invoking threads being parked (noted in red), nulls returned (noted in orange) and exceptions (noted in green). Each contributor consists of configuration that is under the developers control, deployed code that is under developers control and conditions that are usually not under developers control. Clearly, the column headings in the table above are divided into two sections; two gateway configuration attributes. The default-reply-timeout is set by the SI configured and is the amount of time that a client call is wiling to wait for a response from the gateway. Secondly, synchronous flows are represented by Single-threaded flows, asynchronous by Multi-threaded flows. A synchronous, or single-threaded flow, is one such as the following: The implicit input channel (gateway-request-channel) has no associated dispatcher configured. An asynchronous, or multi-threaded flow, is one such as the following: The explicit input channel has a dispatcher configured ("taskExecutor"). This task executor specifies a thread pool that supplies threads for execution and whose configuration as above marks a thread boundary. Note: This is not the only way of making channels asynchronous The other configuration attribute referenced is default-reply-timeout, this is set on the gateway namespace configuration such as the example above. Note that both of these runtime aspects are set by the configurer during SI flow design and implementation. They are entirely under developer control. The 'Runtime Events' column indicates gateway relevant runtime events that have to be considered during gateway configuration - these are obviously not under developer control. Trigger conditions for these events are not as unusual as one may hope. 1. Long Running Processes It's not uncommon for thread pools to become exhausted because all pooled threads are waiting for an external resource accessed through a socket, this may be a long running database query, a firewall keeping a connection open despite the server terminating etc. There is significant potential for these types of trigger. Some long-running processes terminate naturally, sometimes they never completed - an application restart is required. 2. Null returned downstream A null may be returned from a downstream SI construct such as a Transformer, Service Activator or Gateway. A Gateway may return null in some circumstances such as following a gateway timeout event. 3. Void method downstream Any custom code invoked during an SI flow may use a void method signature. This can also be caused by configuration in circumstances where flows are determined dynamically at runtime. 4. Runtime Exception RuntimeException's can be triggered during normal operation and are generally handled by catching them at the gateway or allowing them to propagate through. The reason that they are coloured green in the table above is that they are generally much easier to handle than timeouts. Gateway Timeout Handling Strategies There are four possible outcomes from invoking a gateway with a request message, all of these as a result of specific runtime events: a) an ordinary message response, b) an exception message, c) a null or d) no-response. Ordinary business responses and exceptions are straight forward to understand and will not be covered further in this article. The two significant outcomes that will be explored further are strategies for dealing with nulls and no-response. Generally speaking, long running processes either terminate or not. Long running processes that terminate may eventually return a message through the invoked gateway or timeout depending on timeout configuration, in which case a null may be returned. The severity of this as a problem depends on throughput volume, length of long running process and system resources (thread-pool size). Configuration exists for default-reply-timeout In the case where a long running process event is underway and a default-reply-timeout has been set, as long as the long running process completes before the default-reply-timeout expires, there is no problem to deal with. However, if the long running process does not complete before that timeout expires one of three outcomes will apply. Firstly, if the long running process terminates subsequent to the reply timeout expiry, the gateway will have already returned null to the invoker so the null response needs handling by the invoker. The thread handling the long-running process will be returned to the pool. Secondly, if the long running process does not terminate and a reply timeout has been set, the gateway will return null to the gateway invoker but the thread executing the long-running process will not get returned to the pool. Thirdly, and most significantly, if a default-reply-timeout has been configured but the long running process is running on the same thread as the invoker, i.e. synchronous channels supply messages to that process, the thread will not return, the default-reply-timeout has no affect. Assuming the most common processing scenario, a long running process completes either before or after the reply timeout expiry. When a null is returned by the gateway, the invoker is forced to deal with a null response. It's often unacceptable to force gateway consumers to deal with null responses and is not necessary as with a little additional configuration, this can be avoided. Absent Configuration for default-reply-timeout The most significant danger exists around gateways that have no default-reply-timeout configuration set. A long running process or a null returned from downstream will mean that the invoking thread is parked. This is true for both synchronous and asynchronous flows and may ultimately force an application to be restarted because the invoker thread pool is likely to start on a depletion course if this continues to occur. Spring Integration Timeout Handling Design Strategies For those Spring Integration configuration designers that are comfortable with gateway invokers dealing with null responses, exceptions and set default-reply-timeouts on gateways, there's no need to read further. However, if you wish to provide clients of your gateway a more predictable response, a couple of strategies exist for handling null responses from gateways in order that invokers are protected from having to deal with them. Firstly, the simpliest solution is to wrap the gateway with a service activator. The gateway must have the default-reply-timeout attribute value set in order to avoid unnecessary parking of threads. In order to avoid the consequence of long-running threads it's also very prudent to use a dispatcher soon after entry to the gateway - this breaks the thread boundary. Whilst this is a valid technical approach, the impact is that we have forced a different entry point to our message sub-system. Entry is now via a Service Activator rather than a Gateway. A side affect of this change is that the testing entry point changes. Integration tests that would normally reference a gateway to send a message now have to locate the backing implementation for the Service Activator, not ideal. An alternative approach toward solving this problem would be to configure two gateways with a Service Activator between them. Only one of the gateways would be exposed to invokers, the outer one. Both Gateways would reference the same service interface. The outer gateway specification would not specify the default-reply-timeout but would specify the input and output channels in the same way that a single gateway would. The Service Activator between the Gateways would handle null gateway responses and possibly any exceptions if preferred to the gateway error handler approach. An example is as follows: The Service Activator bean (enrollmentServiceGatewayHandler) deals with both null and exception responses from the adapted gateway (enrollmentServiceAdaptedGateway), in the situation where these are generated a business response detailing the error is generated. Spring Integration R2.1 Changes async-executor on gateway spec
May 26, 2012
by Matt Vickery
· 24,366 Views · 1 Like
  • Previous
  • ...
  • 511
  • 512
  • 513
  • 514
  • 515
  • 516
  • 517
  • 518
  • 519
  • 520
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×