DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Cloud Architecture Topics

article thumbnail
Local WebHooks with Mule Cloud Connect and LocalTunnel v2
When using an external API for WebHooks or Callbacks as discussed in Chapters 3 and 5 of Getting Started with Mule Cloud Connect; The API provider running somewhere out there on the web needs to callback your application that is happily running in isolation on your local machine. For an API provider to callback your application, the application must be accessible over the web. Sure, you could upload and test your application on a public facing server, but you may find it quicker and easier to work on your local development machine and these are typically behind firewalls, NAT, or otherwise not able to provide a public URL. You need a way to make your local application available over the web. There are a few good services and tools out there to help with this. Examples include ProxyLocal, and Forward.io. Alternatively, you can set up your own reverse SSH Tunnel if you already have a remote system to forward your requests, but this is cumbersome to say the least. I find Localtunnel to be an excellent fit for this need and localtunnel have just recently released v2 of its service with a host of new features and enhancements. More information can be found here: http://progrium.com/blog/2012/12/25/localtunnel-v2-available-in-beta/ Installing Localtunnel Those familiar with version 1 of the service will know that the v1 Localtunnel client was written in Ruby and required Rubygems to install it. The v2 client is now written in Python and can instead be installed via easy_install or pip. If instead you're interested in using Localtunnel v1, then I have wrote a previous blog post on the subject here: http://blogs.mulesoft.org/connector-callback-testing-local/ To get started, you will first need to check that you have Python installed. Localtunnel requires Python 2.6 or later. Most systems come with Python installed as standard, but if not you can check via the following command: $ python -version More info on installing Python can be found here: http://wiki.python.org/moin/BeginnersGuide/Download Once complete, you will need easy_install to install the Localtunnel client.If you don't have easy_install after you install Python, you can install it with this bootstrap script: $ curl http://peak.telecommunity.com/dist/ez_setup.py | python Once complete, you can install the Localtunnel client using the following command: $ easy_install localtunnel First run with LocalTunnel Once installed, creating a tunnel is as simple as running the following command: $ localtunnel-beta 8082 The parameter after the command: "8000" is the local port we want Localtunnel to forward to. So whatever port your app is running on should replace this value. Each time you run the command you should get output similar to the following: Port 8082 is now accessible from http://fb0322605126.v2.localtunnel.com ... Note: As v2 is still in beta; the command local-tunnel-beta will eventually be installed as just localtunnel. This lets you keep the v1 just in case anything goes wrong with v2 during the beta. Configuring the Connector Now onto Mule! To demonstrate I will use the Twilio Cloud Connector example from Chapter 5. Twilio has an awesome WebHook implementation with great debugging tools. Twilio uses callbacks to tell you about the status of your requests; When you use Twilio to a place a phone call or send an SMS the Twilio API allows you to send a URL where you'll receive information about the phone call once it ends or the status of the outbound SMS message after it's processed. This example uses the Twilio Cloud Connector to send a simple SMS message. The most important thing to note is that the "status-callback-flow-ref" attribute. All connector operations that support callback's will have an optional attribute ending in "-flow-ref". In this case : "status-callback-flow-ref". As the name suggests, this attribute should reference a flow. This value must be a valid flow id from within your configuration. It is this flow that will be used to listen for the callback. Notice that the flow has no inbound endpoint? This is where the magic happens; when Twilio process the SMS message it will send a callback automatically to that flow without you having to define an inbound endpoint. The connector automatically generates an inbound endpoint and sends the auto generated URL to Twilio for you. Customizing the Callback The URL generated for the callback URL is built using 'localhost' as the host, the 'http.port' environment variable or 'localPort' value as the port and the path of the URL is typically just a random generated string or static value. So if I run this locally it would send Twilio my non public address, something like: http://localhost:80/...vv3v3er342fvvn. Each connector that accepts HTTP callbacks will provide you with an optional http-callback-config child element to override these settings. These settings can be set at the connector's config level as follows: Here we have amended the previous example to add the additonal http-callback-config configuration. The configuration takes three additional arguments: domain, localPort and remotePort. These settings will be used to constuct the URL that is passed to the external system. The URL will be the same as the default generated URL of the HTTP inbound-endpoint except that the host is replaced by the 'domain' setting (or its default value) and the port is replaced by the 'remotePort' setting (or its default value). In this case we have used the domain from the URL that Localtunnel generated for us earlier: fb0322605126.v2.localtunnel.com and set the localPort to 8082 as we run the Localtunnel command using port 8082 and the remotePort to 80 as the localtunnel server just runs on port 80. And that's it! If you run this configuration you should start seeing your callback being printed to the console. The same goes for any OAuth connectors too. If your using any OAuth connectors built using the DevKit OAuth modules, you can configure the OAuth callback in a similar fashion. A full Mule/Twilio WebHook project can be found here: https://github.com/ryandcarter/GettingStarted-MuleCloudConnect-OReilly/tree/master/chapter05/twilio-webhooks
February 5, 2013
by Ryan Carter
· 4,931 Views
article thumbnail
Testing MapReduce with MRUnit
Testing and debugging multi threaded programs is hard. Now take the same programs and massively distribute them across multiple JVMs deployed on a cluster of machines and the complexity goes off the roof. One way to overcome this complexity is to do testing in isolation and catch as many bugs as possible locally. MRUnit is a testing framework that lets you test and debug Map Reduce jobs in isolation without spinning up a Hadoop cluster. In this blog post we will cover various features of MRUnit by walking through a simple MapReduce job. Lets say we want to take the input below and create an inverted index using MapReduce. Input www.kohls.com,clothes,shoes,beauty,toys www.amazon.com,books,music,toys,ebooks,movies,computers www.ebay.com,auctions,cars,computers,books,antiques www.macys.com,shoes,clothes,toys,jeans,sweaters www.kroger.com,groceries Expected output antiques www.ebay.com auctions www.ebay.com beauty www.kohls.com books www.ebay.com,www.amazon.com cars www.ebay.com clothes www.kohls.com,www.macys.com computers www.amazon.com,www.ebay.com ebooks www.amazon.com jeans www.macys.com movies www.amazon.com music www.amazon.com shoes www.kohls.com,www.macys.com sweaters www.macys.com toys www.macys.com,www.amazon.com,www.kohls.com groceries www.kroger.com below are the Mapper and Reducer that do the transformation public class InvertedIndexMapper extends MapReduceBase implements Mapper { public static final int RETAIlER_INDEX = 0; @Override public void map(LongWritable longWritable, Text text, OutputCollector outputCollector, Reporter reporter) throws IOException { final String[] record = StringUtils.split(text.toString(), ","); final String retailer = record[RETAIlER_INDEX]; for (int i = 1; i < record.length; i++) { final String keyword = record[i]; outputCollector.collect(new Text(keyword), new Text(retailer)); } } } public class InvertedIndexReducer extends MapReduceBase implements Reducer { @Override public void reduce(Text text, Iterator textIterator, OutputCollector outputCollector, Reporter reporter) throws IOException { final String retailers = StringUtils.join(textIterator, ','); outputCollector.collect(text, new Text(retailers)); } } Implementation details are not really important but basically Mapper gets a line at a time, splits the line and emits key value pairs where Key is a category of product and value is the website which is selling the product. For example line retailer,category1,category2 will be emitted as (category1,retailer) and (category2,retailer). Reducer gets a key and a list of values, transforms the list of values to a comma delimited String and emits the key and value out. Now lets use MRUnit to write various tests for this Job. Three key classes in MRUnits are MapDriver for Mapper Testing, ReduceDriver for Reducer Testing and MapReduceDriver for end to end MapReduce Job testing. This is how we will setup the Test Class. public class InvertedIndexJobTest { private MapDriver mapDriver; private ReduceDriver reduceDriver; private MapReduceDriver mapReduceDriver; @Before public void setUp() throws Exception { final InvertedIndexMapper mapper = new InvertedIndexMapper(); final InvertedIndexReducer reducer = new InvertedIndexReducer(); mapDriver = MapDriver.newMapDriver(mapper); reduceDriver = ReduceDriver.newReduceDriver(reducer); mapReduceDriver = MapReduceDriver.newMapReduceDriver(mapper, reducer); } } MRUnit supports two style of testings. First style is to tell the framework both input and output values and let the framework do the assertions, second is the more traditional approach where you do the assertion yourself. Lets write a test using the first approach. @Test public void testMapperWithSingleKeyAndValue() throws Exception { final LongWritable inputKey = new LongWritable(0); final Text inputValue = new Text("www.kroger.com,groceries"); final Text outputKey = new Text("groceries"); final Text outputValue = new Text("www.kroger.com"); mapDriver.withInput(inputKey, inputValue); mapDriver.withOutput(outputKey, outputValue); mapDriver.runTest(); } In the test above we tell the framework both input and output Key and Value pairs and the framework does the assertion for us. This test can be written in a more traditional way as follow @Test public void testMapperWithSingleKeyAndValueWithAssertion() throws Exception { final LongWritable inputKey = new LongWritable(0); final Text inputValue = new Text("www.kroger.com,groceries"); final Text outputKey = new Text("groceries"); final Text outputValue = new Text("www.kroger.com"); mapDriver.withInput(inputKey, inputValue); final List> result = mapDriver.run(); assertThat(result) .isNotNull() .hasSize(1) .containsExactly(new Pair(outputKey, outputValue)); } Sometimes Mapper emits multiple Key Value pairs for a single input. MRUnit provides a fluent API to support this use case. Here is an example @Test public void testMapperWithSingleInputAndMultipleOutput() throws Exception { final LongWritable key = new LongWritable(0); mapDriver.withInput(key, new Text("www.amazon.com,books,music,toys,ebooks,movies,computers")); final List> result = mapDriver.run(); final Pair books = new Pair(new Text("books"), new Text("www.amazon.com")); final Pair toys = new Pair(new Text("toys"), new Text("www.amazon.com")); assertThat(result) .isNotNull() .hasSize(6) .contains(books, toys); } You write the test for the reduce exactly the same way. @Test public void testReducer() throws Exception { final Text inputKey = new Text("books"); final ImmutableList inputValue = ImmutableList.of(new Text("www.amazon.com"), new Text("www.ebay.com")); reduceDriver.withInput(inputKey,inputValue); final List> result = reduceDriver.run(); final Pair pair2 = new Pair(inputKey, new Text("www.amazon.com,www.ebay.com")); assertThat(result) .isNotNull() .hasSize(1) .containsExactly(pair2); } Finally you can use MapReduceDriver to test your Mapper, Combiner and Reducer together as a single job. You can also pass multiple key value pairs as input to your job. Test below demonstrate MapReduceDriver in action @Test public void testMapReduce() throws Exception { mapReduceDriver.withInput(new LongWritable(0), new Text("www.kohls.com,clothes,shoes,beauty,toys")); mapReduceDriver.withInput(new LongWritable(1), new Text("www.macys.com,shoes,clothes,toys,jeans,sweaters")); final List> result = mapReduceDriver.run(); final Pair clothes = new Pair(new Text("clothes"), new Text("www.kohls.com,www.macys.com")); final Pair jeans = new Pair(new Text("jeans"), new Text("www.macys.com")); assertThat(result) .isNotNull() .hasSize(6) .contains(clothes, jeans); }
February 5, 2013
by Mansur Ashraf
· 13,795 Views · 1 Like
article thumbnail
Tutorial: Deploying an API on EC2 from AWS
Curator's Note: This article was co-authored by Andrzej Jarzyna. At 3scale we find Amazon to be a fantastic platform for running APIs due to the complete control you have on the application stack. For people new to AWS the learning curve is quite steep. So we put together our best practices into this short tutorial. Besides Amazon EC2 we will use the Ruby Grape gem to create the API interface and an Nginx proxy to handle access control. Best of all everything in this tutorial is completely FREE! For the purpose of this tutorial you will need a running API based on Ruby and Thin server. If you don’t have one you can simply clone an example repo as described below (in the “Deploying the Application” section). If you are interested in the background of this example (Sentiment API), you can see a couple of previous guides which 3scale has published. Here we use version_1 of the API(‘API up and running in 10 minutes‘) with some extra sentiment analysis functionality (this part is covered in the second tutorial of the Sentiment API tutorial). Now we will start the creation and configuration of the Amazon EC2 instance. If you already have an EC2 instance (micro or not), you can jump to the next step -> Preparing Instance for Deployment. Creating and configuring EC2 Instance Let’s start by signing up for the Amazon Elastic Compute Cloud (Amazon EC2). For our needs the free tier http://aws.amazon.com/free/ is enough, covering all the basic needs. Once the account is created go to the EC2 dashboard under your AWS Management Console and click on the Launch Instance button. That will transfer you to a popup window where you will continue the process: Choose the classic wizard Choose an AMI (Ubuntu Server 12.04.1 LTS 32bit, T1micro instance) leaving all the other settings for Instance Details as default Create a keypair and download it – this will be the key which you will use to make an ssh connection to the server, it’s VERY IMPORTANT! Add inbound rules for the firewall with source always 0.0.0.0/0 (HTTP, HTTPS, ALL ICMP, TCP port 3000 used by the Ruby thin server) Preparing Instance for Deployment Now, as we have the instance created and running, we can directly connect there from our console (Windows users from PuTTY). Right click on your instance, connect and choose Connect with a standalone SSH Client. Follow the steps and change the username to ubuntu (instead of root) in the given example. After executing this step you are connected to your instance. We will have to install new packages. Some of them require root credentials, so you will have to set a new root password: sudo passwd root. Then login as root: su root. Now with root credentials execute: sudo apt-get update and switch back to your normal user with exit command and install all the required packages: install some libraries which will be required by rvm, ruby and git: sudo apt-get install build-essential git zlib1g-dev libssl-dev libreadline-gplv2-dev imagemagick libxml2-dev libxslt1-dev openssl libreadline6 libreadline6-dev zlib1g libyaml-dev libxslt-dev autoconf libc6-dev ncurses-dev automake libtool bison libpq-dev libpq5 libeditline-dev install git (on Linux rather than from Source): http://www.git-scm.com/book/en/Getting-Started-Installing-Git install rvm: https://rvm.io/rvm/install/ install ruby rvm install 1.9.3 rvm use 1.9.3 --default Deploying the Application Our sample Sentiment API is located on Github. Try cloning the repository: git clone [email protected]:jerzyn/api-demo.git you can once again review the code and tutorial on creating and deploying this app here: http://www.3scale.net/2012/06/the-10-minute-api-up-running-3scale-grape-heroku-api-10-minutes/ and here http://www.3scale.net/2012/07/how-to-out-of-the-box-api-analytics/ note the changes (we are using only v1, as authentication will go through the proxy). Now you can deploy the app by issuing: bundle install. Now you can start the thin server: thin start. To access the API directly (i.e. without any security or access control) access: your-public-dns:3000/v1/words/awesome.json (you can find your-public-dns in the AWS EC2 Dashboard->Instances in the details window of your instance) For the Nginx integration you will have to create an elastic IP address. Inside the AWS EC2 dashboard create an elastic IP in the same region as your instance and associate that IP to it (you won’t have to pay anything for the elastic IP as long as it is associated with your instance in the same region). OPTIONAL: If you want to assign a custom domain to your amazon instance you will have to do one thing: add an A record to the DNS record of your domain mapping the domain to the elastic IP address you have previously created. Your domain provider should either give you some way to set the A record (the IPv4 address), or it will give you a way to edit the nameservers of your domain. If they do not allow you to set the A record directly, find a DNS management service, register your domain as a zone there and the service will give you the nameservers to enter in the admin panel of your domain provider. You can then add the A record for the domain. Some possible DNS management services include ZoneEdit (basic, free), Amazon route 53, etc. At this point you API is open to the world. This is good and bad – great that you are sharing, but bad in the sense that without rate limits a few apps could kill the resources of your server, and you have no insight into who is using your API and how it is being used. The solution is to add some management for your API… Enabling API Management with 3scale Rather than reinvent the wheel and implement rate limits, access controls and analytics from scratch we will leverage the handy 3scale API Management service. Get your free 3scale account, activate and log-in to the new instance through the provided links. The first time you log-in you can choose the option for some sample data to be created, so you will have some API keys to use later. Next you would probably like to go through the tour to get a glimpse on the system functionality (optional) and then start with the implementation. To get some instant results we will start with the sandbox proxy which can be used while in development. Then we will also configure an Nginx proxy which can scale up for full production deployments. There is some documentation on the configuration of the API proxy at 3scale: https://support.3scale.net/howtos/api-configuration/nginx-proxy and for more advanced configuration options here: https://support.3scale.net/howtos/api-configuration/nginx-proxy-advanced Once you sign into your 3scale account, Launch your API on the main Dashboard screen or Go to API->Select the service (API)->Integration in the sidebar->Proxy Set the address of of your API backend – this has to be the Elastic IP address unless the custom domain has been set, including http protocol and port 3000. Now you can save and turn on the sandbox proxy to test your API by hitting the sandbox endpoint (after creating some app credentials in 3scale): http://sandbox-endpoint/v1/words/awesome.json?app_id=APP_ID&app_key=APP_KEY where, APP_ID and APP_KEY are id and key of one of the sample applications which you created when you first logged into your 3scale account (if you missed that step just create a developer account and an application within that account). Try it without app credentials, next with incorrect credentials, and then once authenticated within and over any rate limits that you have defined. Only once it is working to your satisfaction do you need to download the config files for Nginx. Note: any time you have errors check whether you can access the API directly: your-public-dns:3000/v1/words/awesome.json. If that is not available, then you need to check if the AWS instance is running and if the Thin Server is running on the instance. Implement an Nginx Proxy for Access Control In order to streamline this step we recommend that you install the fantastic OpenResty web application that is basically a bundle of the standard Nginx core with almost all the necessary 3rd party Nginx modules built-in. Install dependencies: sudo apt-get install libreadline-dev libncurses5-dev libpcre3-dev perl Compile and install Nginx: cd ~ sudo wget http://agentzh.org/misc/nginx/ngx_openresty-1.2.3.8.tar.gz sudo tar -zxvf ngx_openresty-1.2.3.8.tar.gz cd ngx_openresty-1.2.3.8/ ./configure --prefix=/opt/openresty --with-luajit --with-http_iconv_module -j2 make sudo make install In the config file make the following changes: edit the .conf file from nginx download in line 28, which is preceded by info to change your server name put the correct domain (of your Elastic IP or custom domain name) in line 78 change the path to the .lua file, downloaded together with the .conf file. We are almost finished! Our last step is to start the NGINX proxy and put some traffic through it. If it is not running yet (remember, that thin server has to be started first), please go to your EC2 instance terminal (the one you were connecting through ssh before) and start it now: sudo /opt/openresty/nginx/sbin/nginx -p /opt/openresty/nginx/ -c /opt/openresty/nginx/conf/YOUR-CONFIG-FILE.conf The last step will be verifying that the traffic goes through with a proper authorization. To do that, access: http://your-public-dns/v1/words/awesome.json?app_id=APP_ID&app_key=APP_KEY where, APP_ID and APP_KEY are key and id of the application you want to access through the API call. Once everything is confirmed as working correctly, you will want to block public access to the API backend on port 3000, which bypasses any access controls. If encounter some problems with the Nginx configuration or need a more detailed guide, I encourage you to check the 3scale guide on configuring Nginx proxy: https://support.3scale.net/howtos/api-configuration/nginx-proxy. You can go completely wild with customization of your API gateway. If you want to dive more into the 3scale system configuration (like usage and monitoring of your API traffic) feel encouraged to browse our Quickstart guides and HowTo’s.
February 4, 2013
by Steven Willmott
· 17,804 Views
article thumbnail
Sorting Text Files with MapReduce
in my last post i wrote about sorting files in linux. decently large files (in the tens of gb’s) can be sorted fairly quickly using that approach. but what if your files are already in hdfs, or ar hundreds of gb’s in size or larger? in this case it makes sense to use mapreduce and leverage your cluster resources to sort your data in parallel. mapreduce should be thought of as a ubiquitous sorting tool, since by design it sorts all the map output records (using the map output keys), so that all the records that reach a single reducer are sorted. the diagram below shows the internals of how the shuffle phase works in mapreduce. given that mapreduce already performs sorting between the map and reduce phases, then sorting files can be accomplished with an identity function (one where the inputs to the map and reduce phases are emitted directly). this is in fact what the sort example that is bundled with hadoop does. you can look at the how the example code works by examining the org.apache.hadoop.examples.sort class. to use this example code to sort text files in hadoop, you would use it as follows: shell$ export hadoop_home=/usr/lib/hadoop shell$ $hadoop_home/bin/hadoop jar $hadoop_home/hadoop-examples.jar sort \ -informat org.apache.hadoop.mapred.keyvaluetextinputformat \ -outformat org.apache.hadoop.mapred.textoutputformat \ -outkey org.apache.hadoop.io.text \ -outvalue org.apache.hadoop.io.text \ /hdfs/path/to/input \ /hdfs/path/to/output this works well, but it doesn’t offer some of the features that i commonly rely upon in linux’s sort, such as sorting on a specific column, and case-insensitive sorts. linux-esque sorting in mapreduce i’ve started a new github repo called hadoop-utils , where i plan to roll useful helper classes and utilities. the first one is a flexible hadoop sort. the same hadoop example sort can be accomplished with the hadoop-utils sort as follows: shell$ $hadoop_home/bin/hadoop jar hadoop-utils--jar-with-dependencies.jar \ com.alexholmes.hadooputils.sort.sort \ /hdfs/path/to/input \ /hdfs/path/to/output to bring sorting in mapreduce closer to the linux sort, the --key and --field-separator options can be used to specify one or more columns that should be used for sorting, as well as a custom separator (whitespace is the default). for example, imagine you had a file in hdfs called /input/300names.txt which contained first and last names: shell$ hadoop fs -cat 300names.txt | head -n 5 roy franklin mario gardner willis romero max wilkerson latoya larson to sort on the last name you would run: shell$ $hadoop_home/bin/hadoop jar hadoop-utils--jar-with-dependencies.jar \ com.alexholmes.hadooputils.sort.sort \ --key 2 \ /input/300names.txt \ /hdfs/path/to/output the syntax of --key is pos1[,pos2] , where the first position (pos1) is required, and the second position (pos2) is optional - if it’s omitted then pos1 through the rest of the line is used for sorting. just like the linux sort, --key is 1-based, so --key 2 in the above example will sort on the second column in the file. lzop integration another trick that this sort utility has is its tight integration with lzop, a useful compression codec that works well with large files in mapreduce (see chapter 5 of hadoop in practice for more details on lzop). it can work with lzop input files that span multiple splits, and can also lzop-compress outputs, and even create lzop index files. you would do this with the codec and lzop-index options: shell$ $hadoop_home/bin/hadoop jar hadoop-utils--jar-with-dependencies.jar \ com.alexholmes.hadooputils.sort.sort \ --key 2 \ --codec com.hadoop.compression.lzo.lzopcodec \ --map-codec com.hadoop.compression.lzo.lzocodec \ --lzop-index \ /hdfs/path/to/input \ /hdfs/path/to/output multiple reducers and total ordering if your sort job runs with multiple reducers (either because mapreduce.job.reduces in mapred-site.xml has been set to a number larger than 1, or because you’ve used the -r option to specify the number of reducers on the command-line), then by default hadoop will use the hashpartitioner to distribute records across the reducers. use of the hashpartitioner means that you can’t concatenate your output files to create a single sorted output file. to do this you’ll need total ordering , which is supported by both the hadoop example sort and the hadoop-utils sort - the hadoop-utils sort enables this with the --total-order option. shell$ $hadoop_home/bin/hadoop jar hadoop-utils--jar-with-dependencies.jar \ com.alexholmes.hadooputils.sort.sort \ --total-order 0.1 10000 10 \ /hdfs/path/to/input \ /hdfs/path/to/output the syntax is for this option is unintuitive so let’s look at what each field means. more details on total ordering can be seen in chapter 4 of hadoop in practice . more details for details on how to download and run the hadoop-utils sort take a look at the cli guide in the github project page .
January 26, 2013
by Alex Holmes
· 15,445 Views
article thumbnail
Assign a Fixed IP to an AWS EC2 Instance
as described in my previous post the ip (and dns) of your running ec2 ami will change after a reboot of that instance. of course this makes it very hard to make your applications on that machine available for the outside world, like in this case our wordpress blog. that is where elastic ip comes to the rescue. with this feature you can assign a static ip to your instance. assign one to your application as follows: click on the elastic ips link in the aws console allocate a new address associate the address with a running instance right click to associate the ip with an instance: pick the instance to assign this ip to: note the ip being assigned to your instance if you go to the ip address you were assigned then you see the home page of your server: and the nicest thing is that if you stop and start your instance you will receive a new public dns but your instance is still assigned to the elastic ip address: one important note: as long as an elastic ip address is associated with a running instance, there is no charge for it. however an address that is not associated with a running instance costs $0.01/hour. this prevents users from ‘reserving’ addresses while they are not being used.
January 20, 2013
by Eric Genesky
· 22,917 Views
article thumbnail
Reading Hive Tables from MapReduce
This article is by Stephen Mouring Jr, appearing courtesy of Scott Leberknight. This is part two of a two part blog series on how to read/write Apache Hive data from MapReduce jos. Part one (Writing Hive Tables from MapReduce) is here. So just as sometimes you need to write data to Hive with a custom MapReduce job, sometimes you need to read that data back from Hive with a custom MapReduce job. As covered in part one, Hive is a layer that sits on HDFS and imposes a standard convention on the structure of the files so it can interpret them as columns and rows. Reading data out of Hive is just a matter of parsing the files correctly. Recall that files processed by MapReduce (and by extension, Hive) are output as key value pairs. Hive ignores the keys (read as a BytesWritable with a value of null) and reads/writes the values as Text objects. The value of the Text object for each row is the concatenation of all the column values delimited by the delimiter of the table (which Hive defaults to the "char 1" ASCII character). Seems like a simple problem, so my first thought was to just using String.split() in the map() method of the MapReduce job. String SEPARATOR_FIELD = new String(new char[] {1}); String[] rowColumns = new String (rowTextObject.getBytes()).split(SEPARATOR_FIELD); In theory this should have worked perfectly, but unfortunately I have found that String.split() actually consumes repeated delimiters. This is a problem if any of the values in the row are blank, since split() will shift the positions of your columns and you will be unable to match up what values belong with which columns. An alternative would be to create a String from the Text object and iterate through it using indexOf(). This approach however requires extra object creation and depending on the scale of your MapReduce job and the size of your rows, may slow you down needlessly. So an alternative is to use the Text object's find() method. String SEPARATOR_FIELD = new String(new char[] {1}); String[] rowColumns = new String[NUMBER_OF_COLUMNS_IN_YOUR_HIVE_TABLE]; int start = 0; int end = 0; for (int i = 0; i < rowColumns.length; ++i) { end = rowTextObject.find(SEPARATOR_FIELD, start); if (end == -1) { end = rowString.getLength(); } rowColumns[i] = new String(rowTextObject.getBytes(), start, end-start); start = end + 1; } This will parse out each value into the appropriately index of the rowColumns array. Blank values will also be handled correctly and result in blank strings being inserted into the rowColumns array.
January 11, 2013
by Scott Leberknight
· 6,602 Views · 1 Like
article thumbnail
Configuring IIS methods for ASP.NET Web API on Windows Azure Websites
That’s a pretty long title, I agree. When working on my implementation of RFC2324, also known as the HyperText Coffee Pot Control Protocol, I’ve been struggling with something that you will struggle with as well in your ASP.NET Web API’s: supporting additional HTTP methods like HEAD, PATCH or PROPFIND. ASP.NET Web API has no issue with those, but when hosting them on IIS you’ll find yourself in Yellow-screen-of-death heaven. The reason why IIS blocks these methods (or fails to route them to ASP.NET) is because it may happen that your IIS installation has some configuration leftovers from another API: WebDAV. WebDAV allows you to work with a virtual filesystem (and others) using a HTTP API. IIS of course supports this (because flagship product “SharePoint” uses it, probably) and gets in the way of your API. Bottom line of the story: if you need those methods or want to provide your own HTTP methods, here’s the bit of configuration to add to your Web.config file: Here’s what each part does: Under modules, the WebDAVModule is being removed. Just to make sure that it’s not going to get in our way ever again. The security/requestFiltering element I’ve added only applies if you want to define your own HTTP methods. So unless you need the XYZ method I’ve defined here, don’t add it to your config. Under handlers, I’m removing the default handlers that route into ASP.NET. Then, I’m adding them again. The important part? The "verb attribute. You can provide a list of comma-separated methods that you want to route into ASP.NET. Again, I’ve added my XYZ methodbut you probably don’t need it. This will work on any IIS server as well as on Windows Azure Websites. It will make your API… happy.
December 11, 2012
by Maarten Balliauw
· 20,501 Views
article thumbnail
Hazelcast Distributed Execution with Spring
The ExecutorService feature had come with Java 5 and is under the java.util.concurrent package. It extends the Executor interface and provides a thread pool functionality to execute asynchronous short tasks. Java Executor Service Types is suggested to look over basic ExecutorService implementation. Also ThreadPoolExecutor is a very useful implementation of ExecutorService ınterface. It extends AbstractExecutorService providing default implementations of ExecutorService execution methods. It provides improved performance when executing large numbers of asynchronous tasks and maintains basic statistics, such as the number of completed tasks. How to develop and monitor Thread Pool Services by using Spring is also suggested to investigate how to develop and monitor Thread Pool Services. So far, we have just talked Undistributed Executor Service implementation. Let us also investigate Distributed Executor Service. Hazelcast Distributed Executor Service feature is a distributed implementation of java.util.concurrent.ExecutorService. It allows to execute business logic in cluster. There are four alternative ways to realize it : 1) The logic can be executed on a specific cluster member which is chosen. 2) The logic can be executed on the member owning the key which is chosen. 3) The logic can be executed on the member Hazelcast will pick. 4) The logic can be executed on all or subset of the cluster members. This article shows how to develop Distributed Executor Service via Hazelcast and Spring. Used Technologies : JDK 1.7.0_09 Spring 3.1.3 Hazelcast 2.4 Maven 3.0.4 STEP 1 : CREATE MAVEN PROJECT A maven project is created as below. (It can be created by using Maven or IDE Plug-in). STEP 2 : LIBRARIES Firstly, Spring dependencies are added to Maven’ s pom.xml 3.1.3.RELEASE UTF-8 org.springframework spring-core ${spring.version} org.springframework spring-context ${spring.version} com.hazelcast hazelcast-all 2.4 log4j log4j 1.2.16 maven-compiler-plugin(Maven Plugin) is used to compile the project with JDK 1.7 org.apache.maven.plugins maven-compiler-plugin 3.0 1.7 1.7 maven-shade-plugin(Maven Plugin) can be used to create runnable-jar org.apache.maven.plugins maven-shade-plugin 2.0 package shade com.onlinetechvision.exe.Application META-INF/spring.handlers META-INF/spring.schemas STEP 3 : CREATE Customer BEAN A new Customer bean is created. This bean will be distributed between two node in OTV cluster. In the following sample, all defined properties(id, name and surname)’ types are String and standart java.io.Serializable interface has been implemented for serializing. If custom or third-party object types are used, com.hazelcast.nio.DataSerializable interface can be implemented for better serialization performance. package com.onlinetechvision.customer; import java.io.Serializable; /** * Customer Bean. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class Customer implements Serializable { private static final long serialVersionUID = 1856862670651243395L; private String id; private String name; private String surname; public String getId() { return id; } public void setId(String id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } public String getSurname() { return surname; } public void setSurname(String surname) { this.surname = surname; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((id == null) ? 0 : id.hashCode()); result = prime * result + ((name == null) ? 0 : name.hashCode()); result = prime * result + ((surname == null) ? 0 : surname.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Customer other = (Customer) obj; if (id == null) { if (other.id != null) return false; } else if (!id.equals(other.id)) return false; if (name == null) { if (other.name != null) return false; } else if (!name.equals(other.name)) return false; if (surname == null) { if (other.surname != null) return false; } else if (!surname.equals(other.surname)) return false; return true; } @Override public String toString() { return "Customer [id=" + id + ", name=" + name + ", surname=" + surname + "]"; } } STEP 4 : CREATE ICacheService INTERFACE A new ICacheService Interface is created for service layer to expose cache functionality. package com.onlinetechvision.cache.srv; import com.hazelcast.core.IMap; import com.onlinetechvision.customer.Customer; /** * A new ICacheService Interface is created for service layer to expose cache functionality. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public interface ICacheService { /** * Adds Customer entries to cache * * @param String key * @param Customer customer * */ void addToCache(String key, Customer customer); /** * Deletes Customer entries from cache * * @param String key * */ void deleteFromCache(String key); /** * Gets Customer cache * * @return IMap Coherence named cache */ IMap getCache(); } STEP 5 : CREATE CacheService IMPLEMENTATION CacheService is implementation of ICacheService Interface. package com.onlinetechvision.cache.srv; import com.hazelcast.core.IMap; import com.onlinetechvision.customer.Customer; import com.onlinetechvision.test.listener.CustomerEntryListener; /** * CacheService Class is implementation of ICacheService Interface. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class CacheService implements ICacheService { private IMap customerMap; /** * Constructor of CacheService * * @param IMap customerMap * */ @SuppressWarnings("unchecked") public CacheService(IMap customerMap) { setCustomerMap(customerMap); getCustomerMap().addEntryListener(new CustomerEntryListener(), true); } /** * Adds Customer entries to cache * * @param String key * @param Customer customer * */ @Override public void addToCache(String key, Customer customer) { getCustomerMap().put(key, customer); } /** * Deletes Customer entries from cache * * @param String key * */ @Override public void deleteFromCache(String key) { getCustomerMap().remove(key); } /** * Gets Customer cache * * @return IMap Coherence named cache */ @Override public IMap getCache() { return getCustomerMap(); } public IMap getCustomerMap() { return customerMap; } public void setCustomerMap(IMap customerMap) { this.customerMap = customerMap; } } STEP 6 : CREATE IDistributedExecutorService INTERFACE A new IDistributedExecutorService Interface is created for service layer to expose distributed execution functionality. package com.onlinetechvision.executor.srv; import java.util.Collection; import java.util.Set; import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import com.hazelcast.core.Member; /** * A new IDistributedExecutorService Interface is created for service layer to expose distributed execution functionality. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public interface IDistributedExecutorService { /** * Executes the callable object on stated member * * @param Callable callable * @param Member member * @throws InterruptedException * @throws ExecutionException * */ String executeOnStatedMember(Callable callable, Member member) throws InterruptedException, ExecutionException; /** * Executes the callable object on member owning the key * * @param Callable callable * @param Object key * @throws InterruptedException * @throws ExecutionException * */ String executeOnTheMemberOwningTheKey(Callable callable, Object key) throws InterruptedException, ExecutionException; /** * Executes the callable object on any member * * @param Callable callable * @throws InterruptedException * @throws ExecutionException * */ String executeOnAnyMember(Callable callable) throws InterruptedException, ExecutionException; /** * Executes the callable object on all members * * @param Callable callable * @param Set all members * @throws InterruptedException * @throws ExecutionException * */ Collection executeOnMembers(Callable callable, Set members) throws InterruptedException, ExecutionException; } STEP 7 : CREATE DistributedExecutorService IMPLEMENTATION DistributedExecutorService is implementation of IDistributedExecutorService Interface. package com.onlinetechvision.executor.srv; import java.util.Collection; import java.util.Set; import java.util.concurrent.Callable; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Future; import java.util.concurrent.FutureTask; import org.apache.log4j.Logger; import com.hazelcast.core.DistributedTask; import com.hazelcast.core.Member; import com.hazelcast.core.MultiTask; /** * DistributedExecutorService Class is implementation of IDistributedExecutorService Interface. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class DistributedExecutorService implements IDistributedExecutorService { private static final Logger logger = Logger.getLogger(DistributedExecutorService.class); private ExecutorService hazelcastDistributedExecutorService; /** * Executes the callable object on stated member * * @param Callable callable * @param Member member * @throws InterruptedException * @throws ExecutionException * */ @SuppressWarnings("unchecked") public String executeOnStatedMember(Callable callable, Member member) throws InterruptedException, ExecutionException { logger.debug("Method executeOnStatedMember is called..."); ExecutorService executorService = getHazelcastDistributedExecutorService(); FutureTask task = (FutureTask) executorService.submit( new DistributedTask(callable, member)); String result = task.get(); logger.debug("Result of method executeOnStatedMember is : " + result); return result; } /** * Executes the callable object on member owning the key * * @param Callable callable * @param Object key * @throws InterruptedException * @throws ExecutionException * */ @SuppressWarnings("unchecked") public String executeOnTheMemberOwningTheKey(Callable callable, Object key) throws InterruptedException, ExecutionException { logger.debug("Method executeOnTheMemberOwningTheKey is called..."); ExecutorService executorService = getHazelcastDistributedExecutorService(); FutureTask task = (FutureTask) executorService.submit(new DistributedTask(callable, key)); String result = task.get(); logger.debug("Result of method executeOnTheMemberOwningTheKey is : " + result); return result; } /** * Executes the callable object on any member * * @param Callable callable * @throws InterruptedException * @throws ExecutionException * */ public String executeOnAnyMember(Callable callable) throws InterruptedException, ExecutionException { logger.debug("Method executeOnAnyMember is called..."); ExecutorService executorService = getHazelcastDistributedExecutorService(); Future task = executorService.submit(callable); String result = task.get(); logger.debug("Result of method executeOnAnyMember is : " + result); return result; } /** * Executes the callable object on all members * * @param Callable callable * @param Set all members * @throws InterruptedException * @throws ExecutionException * */ public Collection executeOnMembers(Callable callable, Set members) throws ExecutionException, InterruptedException { logger.debug("Method executeOnMembers is called..."); MultiTask task = new MultiTask(callable, members); ExecutorService executorService = getHazelcastDistributedExecutorService(); executorService.execute(task); Collection results = task.get(); logger.debug("Result of method executeOnMembers is : " + results.toString()); return results; } public ExecutorService getHazelcastDistributedExecutorService() { return hazelcastDistributedExecutorService; } public void setHazelcastDistributedExecutorService(ExecutorService hazelcastDistributedExecutorService) { this.hazelcastDistributedExecutorService = hazelcastDistributedExecutorService; } } STEP 8 : CREATE TestCallable CLASS TestCallable Class shows business logic to be executed. TestCallable task for first member of the cluster : package com.onlinetechvision.task; import java.io.Serializable; import java.util.concurrent.Callable; /** * TestCallable Class shows business logic to be executed. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class TestCallable implements Callable, Serializable{ private static final long serialVersionUID = -1839169907337151877L; /** * Computes a result, or throws an exception if unable to do so. * * @return String computed result * @throws Exception if unable to compute a result */ public String call() throws Exception { return "First Member' s TestCallable Task is called..."; } } TestCallable task for second member of the cluster : package com.onlinetechvision.task; import java.io.Serializable; import java.util.concurrent.Callable; /** * TestCallable Class shows business logic to be executed. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class TestCallable implements Callable, Serializable{ private static final long serialVersionUID = -1839169907337151877L; /** * Computes a result, or throws an exception if unable to do so. * * @return String computed result * @throws Exception if unable to compute a result */ public String call() throws Exception { return "Second Member' s TestCallable Task is called..."; } } STEP 9 : CREATE AnotherAvailableMemberNotFoundException CLASS AnotherAvailableMemberNotFoundException is thrown when another available member is not found. To avoid this exception, first node should be started before the second node. package com.onlinetechvision.exception; /** * AnotherAvailableMemberNotFoundException is thrown when another available member is not found. * To avoid this exception, first node should be started before the second node. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class AnotherAvailableMemberNotFoundException extends Exception { private static final long serialVersionUID = -3954360266393077645L; /** * Constructor of AnotherAvailableMemberNotFoundException * * @param String Exception message * */ public AnotherAvailableMemberNotFoundException(String message) { super(message); } } STEP 10 : CREATE CustomerEntryListener CLASS CustomerEntryListener Class listens entry changes on named cache object. package com.onlinetechvision.test.listener; import com.hazelcast.core.EntryEvent; import com.hazelcast.core.EntryListener; /** * CustomerEntryListener Class listens entry changes on named cache object. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ @SuppressWarnings("rawtypes") public class CustomerEntryListener implements EntryListener { /** * Invoked when an entry is added. * * @param EntryEvent * */ public void entryAdded(EntryEvent ee) { System.out.println("EntryAdded... Member : " + ee.getMember() + ", Key : "+ee.getKey()+", OldValue : "+ee.getOldValue()+", NewValue : "+ee.getValue()); } /** * Invoked when an entry is removed. * * @param EntryEvent * */ public void entryRemoved(EntryEvent ee) { System.out.println("EntryRemoved... Member : " + ee.getMember() + ", Key : "+ee.getKey()+", OldValue : "+ee.getOldValue()+", NewValue : "+ee.getValue()); } /** * Invoked when an entry is evicted. * * @param EntryEvent * */ public void entryEvicted(EntryEvent ee) { } /** * Invoked when an entry is updated. * * @param EntryEvent * */ public void entryUpdated(EntryEvent ee) { } } STEP 11 : CREATE Starter CLASS Starter Class loads Customers to cache and executes distributed tasks. Starter Class of first member of the cluster : package com.onlinetechvision.exe; import com.onlinetechvision.cache.srv.ICacheService; import com.onlinetechvision.customer.Customer; /** * Starter Class loads Customers to cache and executes distributed tasks. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class Starter { private ICacheService cacheService; /** * Loads cache and executes the tasks * */ public void start() { loadCacheForFirstMember(); } /** * Loads Customers to cache * */ public void loadCacheForFirstMember() { Customer firstCustomer = new Customer(); firstCustomer.setId("1"); firstCustomer.setName("Jodie"); firstCustomer.setSurname("Foster"); Customer secondCustomer = new Customer(); secondCustomer.setId("2"); secondCustomer.setName("Kate"); secondCustomer.setSurname("Winslet"); getCacheService().addToCache(firstCustomer.getId(), firstCustomer); getCacheService().addToCache(secondCustomer.getId(), secondCustomer); } public ICacheService getCacheService() { return cacheService; } public void setCacheService(ICacheService cacheService) { this.cacheService = cacheService; } } Starter Class of second member of the cluster : package com.onlinetechvision.exe; import java.util.Set; import java.util.concurrent.ExecutionException; import com.hazelcast.core.Hazelcast; import com.hazelcast.core.HazelcastInstance; import com.hazelcast.core.Member; import com.onlinetechvision.cache.srv.ICacheService; import com.onlinetechvision.customer.Customer; import com.onlinetechvision.exception.AnotherAvailableMemberNotFoundException; import com.onlinetechvision.executor.srv.IDistributedExecutorService; import com.onlinetechvision.task.TestCallable; /** * Starter Class loads Customers to cache and executes distributed tasks. * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class Starter { private String hazelcastInstanceName; private Hazelcast hazelcast; private IDistributedExecutorService distributedExecutorService; private ICacheService cacheService; /** * Loads cache and executes the tasks * */ public void start() { loadCache(); executeTasks(); } /** * Loads Customers to cache * */ public void loadCache() { Customer firstCustomer = new Customer(); firstCustomer.setId("3"); firstCustomer.setName("Bruce"); firstCustomer.setSurname("Willis"); Customer secondCustomer = new Customer(); secondCustomer.setId("4"); secondCustomer.setName("Colin"); secondCustomer.setSurname("Farrell"); getCacheService().addToCache(firstCustomer.getId(), firstCustomer); getCacheService().addToCache(secondCustomer.getId(), secondCustomer); } /** * Executes Tasks * */ public void executeTasks() { try { getDistributedExecutorService().executeOnStatedMember(new TestCallable(), getAnotherMember()); getDistributedExecutorService().executeOnTheMemberOwningTheKey(new TestCallable(), "3"); getDistributedExecutorService().executeOnAnyMember(new TestCallable()); getDistributedExecutorService().executeOnMembers(new TestCallable(), getAllMembers()); } catch (InterruptedException | ExecutionException | AnotherAvailableMemberNotFoundException e) { e.printStackTrace(); } } /** * Gets cluster members * * @return Set Set of Cluster Members * */ private Set getAllMembers() { Set members = getHazelcastLocalInstance().getCluster().getMembers(); return members; } /** * Gets an another member of cluster * * @return Member Another Member of Cluster * @throws AnotherAvailableMemberNotFoundException An Another Available Member can not found exception */ private Member getAnotherMember() throws AnotherAvailableMemberNotFoundException { Set members = getAllMembers(); for(Member member : members) { if(!member.localMember()) { return member; } } throw new AnotherAvailableMemberNotFoundException("No Other Available Member on the cluster. Please be aware that all members are active on the cluster"); } /** * Gets Hazelcast local instance * * @return HazelcastInstance Hazelcast local instance */ @SuppressWarnings("static-access") private HazelcastInstance getHazelcastLocalInstance() { HazelcastInstance instance = getHazelcast().getHazelcastInstanceByName(getHazelcastInstanceName()); return instance; } public String getHazelcastInstanceName() { return hazelcastInstanceName; } public void setHazelcastInstanceName(String hazelcastInstanceName) { this.hazelcastInstanceName = hazelcastInstanceName; } public Hazelcast getHazelcast() { return hazelcast; } public void setHazelcast(Hazelcast hazelcast) { this.hazelcast = hazelcast; } public IDistributedExecutorService getDistributedExecutorService() { return distributedExecutorService; } public void setDistributedExecutorService(IDistributedExecutorService distributedExecutorService) { this.distributedExecutorService = distributedExecutorService; } public ICacheService getCacheService() { return cacheService; } public void setCacheService(ICacheService cacheService) { this.cacheService = cacheService; } } STEP 12 : CREATE hazelcast-config.properties FILE hazelcast-config.properties file shows the properties of cluster members. First member properties : hz.instance.name = OTVInstance1 hz.group.name = dev hz.group.password = dev hz.management.center.enabled = true hz.management.center.url = http://localhost:8080/mancenter hz.network.port = 5701 hz.network.port.auto.increment = false hz.tcp.ip.enabled = true hz.members = 192.168.1.32 hz.executor.service.core.pool.size = 2 hz.executor.service.max.pool.size = 30 hz.executor.service.keep.alive.seconds = 30 hz.map.backup.count=2 hz.map.max.size=0 hz.map.eviction.percentage=30 hz.map.read.backup.data=true hz.map.cache.value=true hz.map.eviction.policy=NONE hz.map.merge.policy=hz.ADD_NEW_ENTRY Second member properties : hz.instance.name = OTVInstance2 hz.group.name = dev hz.group.password = dev hz.management.center.enabled = true hz.management.center.url = http://localhost:8080/mancenter hz.network.port = 5702 hz.network.port.auto.increment = false hz.tcp.ip.enabled = true hz.members = 192.168.1.32 hz.executor.service.core.pool.size = 2 hz.executor.service.max.pool.size = 30 hz.executor.service.keep.alive.seconds = 30 hz.map.backup.count=2 hz.map.max.size=0 hz.map.eviction.percentage=30 hz.map.read.backup.data=true hz.map.cache.value=true hz.map.eviction.policy=NONE hz.map.merge.policy=hz.ADD_NEW_ENTRY STEP 13 : CREATE applicationContext-hazelcast.xml Spring Hazelcast Configuration file, applicationContext-hazelcast.xml, is created and Hazelcast Distributed Executor Service and Hazelcast Instance are configured. ${hz.instance.name} ${hz.members} STEP 14 : CREATE applicationContext.xml Spring Configuration file, applicationContext.xml, is created. classpath:/hazelcast-config.properties STEP 15 : CREATE Application CLASS Application Class is created to run the application. ackage com.onlinetechvision.exe; import org.springframework.context.ApplicationContext; import org.springframework.context.support.ClassPathXmlApplicationContext; /** * Application class starts the application * * @author onlinetechvision.com * @since 27 Nov 2012 * @version 1.0.0 * */ public class Application { /** * Starts the application * * @param String[] args * */ public static void main(String[] args) { ApplicationContext context = new ClassPathXmlApplicationContext("applicationContext.xml"); Starter starter = (Starter) context.getBean("starter"); starter.start(); } } STEP 16 : BUILD PROJECT After OTV_Spring_Hazelcast_DistributedExecution Project is built, OTV_Spring_Hazelcast_DistributedExecution-0.0.1-SNAPSHOT.jar will be created. Important Note : The Members of the cluster have got different configuration for Coherence so the project should be built separately for each member. STEP 17 : INTEGRATION with HAZELCAST MANAGEMENT CENTER Hazelcast Management Center enables to monitor and manage nodes in the cluster. Entity and backup counts which are owned by customerMap, can be seen via Map Memory Data Table. We have distributed 4 entries via customerMap as shown below : Sample keys and values can be seen via Map Browser : Added First Entry : Added Third Entry : hazelcastDistributedExecutorService details can be seen via Executors tab. We have executed 3 task on first member and 2 tasks on second member as shown below : STEP 18 : RUN PROJECT BY STARTING THE CLUSTER’ s MEMBER After created OTV_Spring_Hazelcast_DistributedExecution-0.0.1-SNAPSHOT.jar file is run at the cluster’ s members, the following console output logs will be shown : First member console output : Kas 25, 2012 4:07:20 PM com.hazelcast.impl.AddressPicker INFO: Interfaces is disabled, trying to pick one address from TCP-IP config addresses: [x.y.z.t] Kas 25, 2012 4:07:20 PM com.hazelcast.impl.AddressPicker INFO: Prefer IPv4 stack is true. Kas 25, 2012 4:07:20 PM com.hazelcast.impl.AddressPicker INFO: Picked Address[x.y.z.t]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true Kas 25, 2012 4:07:21 PM com.hazelcast.system INFO: [x.y.z.t]:5701 [dev] Hazelcast Community Edition 2.4 (20121017) starting at Address[x.y.z.t]:5701 Kas 25, 2012 4:07:21 PM com.hazelcast.system INFO: [x.y.z.t]:5701 [dev] Copyright (C) 2008-2012 Hazelcast.com Kas 25, 2012 4:07:21 PM com.hazelcast.impl.LifecycleServiceImpl INFO: [x.y.z.t]:5701 [dev] Address[x.y.z.t]:5701 is STARTING Kas 25, 2012 4:07:24 PM com.hazelcast.impl.TcpIpJoiner INFO: [x.y.z.t]:5701 [dev] --A new cluster is created and First Member joins the cluster. Members [1] { Member [x.y.z.t]:5701 this } Kas 25, 2012 4:07:24 PM com.hazelcast.impl.MulticastJoiner INFO: [x.y.z.t]:5701 [dev] Members [1] { Member [x.y.z.t]:5701 this } ... -- First member adds two new entries to the cache... EntryAdded... Member : Member [x.y.z.t]:5701 this, Key : 1, OldValue : null, NewValue : Customer [id=1, name=Jodie, surname=Foster] EntryAdded... Member : Member [x.y.z.t]:5701 this, Key : 2, OldValue : null, NewValue : Customer [id=2, name=Kate, surname=Winslet] ... --Second Member joins the cluster. Members [2] { Member [x.y.z.t]:5701 this Member [x.y.z.t]:5702 } ... -- Second member adds two new entries to the cache... EntryAdded... Member : Member [x.y.z.t]:5702, Key : 4, OldValue : null, NewValue : Customer [id=4, name=Colin, surname=Farrell] EntryAdded... Member : Member [x.y.z.t]:5702, Key : 3, OldValue : null, NewValue : Customer [id=3, name=Bruce, surname=Willis] Second member console output : Kas 25, 2012 4:07:48 PM com.hazelcast.impl.AddressPicker INFO: Interfaces is disabled, trying to pick one address from TCP-IP config addresses: [x.y.z.t] Kas 25, 2012 4:07:48 PM com.hazelcast.impl.AddressPicker INFO: Prefer IPv4 stack is true. Kas 25, 2012 4:07:48 PM com.hazelcast.impl.AddressPicker INFO: Picked Address[x.y.z.t]:5702, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5702], bind any local is true Kas 25, 2012 4:07:49 PM com.hazelcast.system INFO: [x.y.z.t]:5702 [dev] Hazelcast Community Edition 2.4 (20121017) starting at Address[x.y.z.t]:5702 Kas 25, 2012 4:07:49 PM com.hazelcast.system INFO: [x.y.z.t]:5702 [dev] Copyright (C) 2008-2012 Hazelcast.com Kas 25, 2012 4:07:49 PM com.hazelcast.impl.LifecycleServiceImpl INFO: [x.y.z.t]:5702 [dev] Address[x.y.z.t]:5702 is STARTING Kas 25, 2012 4:07:49 PM com.hazelcast.impl.Node INFO: [x.y.z.t]:5702 [dev] ** setting master address to Address[x.y.z.t]:5701 Kas 25, 2012 4:07:49 PM com.hazelcast.impl.MulticastJoiner INFO: [x.y.z.t]:5702 [dev] Connecting to master node: Address[x.y.z.t]:5701 Kas 25, 2012 4:07:49 PM com.hazelcast.nio.ConnectionManager INFO: [x.y.z.t]:5702 [dev] 55715 accepted socket connection from /x.y.z.t:5701 Kas 25, 2012 4:07:55 PM com.hazelcast.cluster.ClusterManager INFO: [x.y.z.t]:5702 [dev] --Second Member joins the cluster. Members [2] { Member [x.y.z.t]:5701 Member [x.y.z.t]:5702 this } Kas 25, 2012 4:07:56 PM com.hazelcast.impl.LifecycleServiceImpl INFO: [x.y.z.t]:5702 [dev] Address[x.y.z.t]:5702 is STARTED -- Second member adds two new entries to the cache... EntryAdded... Member : Member [x.y.z.t]:5702 this, Key : 3, OldValue : null, NewValue : Customer [id=3, name=Bruce, surname=Willis] EntryAdded... Member : Member [x.y.z.t]:5702 this, Key : 4, OldValue : null, NewValue : Customer [id=4, name=Colin, surname=Farrell] 25.11.2012 16:07:56 DEBUG (DistributedExecutorService.java:42) - Method executeOnStatedMember is called... 25.11.2012 16:07:56 DEBUG (DistributedExecutorService.java:46) - Result of method executeOnStatedMember is : First Member' s TestCallable Task is called... 25.11.2012 16:07:56 DEBUG (DistributedExecutorService.java:61) - Method executeOnTheMemberOwningTheKey is called... 25.11.2012 16:07:56 DEBUG (DistributedExecutorService.java:65) - Result of method executeOnTheMemberOwningTheKey is : First Member' s TestCallable Task is called... 25.11.2012 16:07:56 DEBUG (DistributedExecutorService.java:78) - Method executeOnAnyMember is called... 25.11.2012 16:07:57 DEBUG (DistributedExecutorService.java:82) - Result of method executeOnAnyMember is : Second Member' s TestCallable Task is called... 25.11.2012 16:07:57 DEBUG (DistributedExecutorService.java:96) - Method executeOnMembers is called... 25.11.2012 16:07:57 DEBUG (DistributedExecutorService.java:101) - Result of method executeOnMembers is : [First Member' s TestCallable Task is called..., Second Member' s TestCallable Task is called...] STEP 19 : DOWNLOAD https://github.com/erenavsarogullari/OTV_Spring_Hazelcast_DistributedExecution REFERENCES : Java ExecutorService Interface Hazelcast Distributed Executor Service
December 11, 2012
by Eren Avsarogullari
· 29,924 Views · 1 Like
article thumbnail
Lightweight RPC with ZeroMQ (ØMQ) and Protocol Buffers
A frequent issue I come across writing integration applications with Mule is deciding how to communicate back and forth between my front end application, typically a web or mobile application, and a flow hosted on Mule. I could use web services and do something like annotate a component with JAX-RS and expose this out over HTTP. This is potentially overkill, particularly if I only want to host a few methods, the methods are asynchronous or I don’t want to deal with the overhead of HTTP. It also could be a lot of extra effort if the only consumers of the API, at least initially, are internal facing applications. Another choice is to use “synchronous” JMS with temporary reply queues. While Mule makes this easy to do, particularly with MuleClient, I now have to deal with the overhead of spinning up a JMS infrastructure. I could also be limited to Java only clients, depending on which JMS broker I choose. The latter is particularly signifcant, as Java probably isn’t the technology of choice on the web or mobile layer. ØMQ for RPC ØMQ, or ZeroMQ, is a networking library designed from the ground up to ease integration between distributed applications. In addition to supporting a variety of messaging patterns, which are enumerated in the extremely well written guide, the library is written in platform agnostic C with wrappers for different languages like Java, Python and Ruby. These features make it a good candidate to solve the challenges I introduced above, particularly since a community contributed module for ØMQ was released recently. Let’s consider a simple service that accepts a request for a range of stock quotes and returns the results and see how we can host this service with Mule and expose it out with the ØMQ Module. Data Serialization with Protocol Buffers Data is transported back and forth over ØMQ as byte arrays. We, as such, need to decide on a way to serialize our stock quote request and responses “on the wire.” Before we do that, however, let’s take a look at the Java canonical data model we’re using on the client and server side. The following Gists show the important bits of the StockQuote and StockQuoteResponse classes. public class StockQuote implements Serializable { String symbol; Date date; Double open; Double high; Double low; Double close; Long volume; Double adjustedClose; public class StockQuoteRequest implements Serializable { String symbol; Date startDate; Date endDate; public interface StockDataService { public List getQuote(StockQuoteRequest request); } We could use Java serialization to get the objects into byte arrays. Ignoring the other deficiencies of default Java serialization, the main drawback is that it limits our clients to one’s running on a JVM. XML or JSON provide better alternatives, but for the purposes of this example we’ll assume we want a more compact representation of the data (this isn’t totally unrealistic, stock quote data can be extremely time sensitive and we probably want to minimize serialization and deserialization overhead.) Protocol Buffers provide a good middle ground and also boast a Mule Module to provide the necessary transformers we need to move back and forth from the byte array representations. Let’s define two .proto files to define the wire format and generate the intermediary stubs for serialization. package com.acmesoft.zeromq; option java_package = "com.acmesoft.stock.model.serialization.protobuf"; option optimize_for = SPEED;package com.acmesoft.zeromq; option java_package = "com.acmesoft.stock.model.serialization.protobuf"; option optimize_for = SPEED; option java_multiple_files = true; message StockQuoteResponseBuffer { repeated StockQuoteBuffer result = 1; } message StockQuoteBuffer { required string symbol = 1; required int64 date = 2; required double open = 3; required double high = 4; required double low = 5; required double close = 6; required int64 volume = 7; required double adjustedClose = 8; } option java_multiple_files = true; message StockQuoteRequestBuffer { required string symbol = 1; required int64 start = 2; required int64 end = 3; } You typically would use the “protoc” compiler to generate the Java stubs. This is tedious, however, so we’ll instead modify the pom.xml of our project to compile the protoc files during the compile goals: com.google.protobuf.tools maven-protoc-plugin /usr/local/bin/protoc compile testCompile Since we already have a domain model we’ll add some helper classes to simplify the serialization tasks on the client side. public byte[] toProtocolBufferAsBytes() { return StockQuoteRequestBuffer.newBuilder() .setSymbol(symbol) .setStart(startDate.getTime()) .setEnd(endDate.getTime()).build().toByteArray(); } public static StockQuoteRequest fromProtocolBuffer(StockQuoteRequestBuffer buffer) { StockQuoteRequest request = new StockQuoteRequest(); request.setSymbol(buffer.getSymbol()); request.setStartDate(new Date(buffer.getStart())); request.setEndDate(new Date(buffer.getEnd())); return request; } public static StockQuoteResponseBuffer toProtocolBuffer(List quotes) { StockQuoteResponseBuffer.Builder responseBuilder = StockQuoteResponseBuffer.newBuilder(); for (StockQuote quote : quotes) { responseBuilder.addResult(StockQuoteBuffer.newBuilder() .setAdjustedClose(quote.getAdjustedClose()) .setClose(quote.getClose()) .setDate(quote.getDate().getTime()) .setHigh(quote.getHigh()) .setLow(quote.getLow()) .setOpen(quote.getOpen()) .setSymbol(quote.getSymbol()) .setVolume(quote.getVolume()).build()); } return responseBuilder.build(); } public static List listOfStockQuotesFromBytes(byte[] bytes) { List buffer; try { buffer = StockQuoteResponseBuffer.parseFrom(bytes).getResultList(); } catch (InvalidProtocolBufferException e) { throw new SerializationException(e); } List quotes = new ArrayList(); for (StockQuoteBuffer stockQuoteBuffer : buffer) { StockQuote stockQuote = new StockQuote(); stockQuote.setClose(stockQuoteBuffer.getClose()); stockQuote.setDate(new Date(stockQuoteBuffer.getDate())); stockQuote.setHigh(stockQuoteBuffer.getHigh()); stockQuote.setOpen(stockQuoteBuffer.getOpen()); stockQuote.setSymbol(stockQuoteBuffer.getSymbol()); stockQuote.setVolume(stockQuoteBuffer.getVolume()); stockQuote.setAdjustedClose(stockQuoteBuffer.getAdjustedClose()); stockQuote.setLow(stockQuoteBuffer.getLow()); quotes.add(stockQuote); } return quotes; } Configuring StockDataService Now that we have a canonical data model and a wire format defined we’re ready to wire up a Mule flow to expose the service out. Note that for this to work you need to have jzmq installed locally on your system. The following dependency needs to be added to your pom.xml once its installed: org.zeromq zmq 2.2.0 /usr/local/lib/zmq.jar system Where systemPath is the location of the zmq.jar on your filesystem. Once that’s out of the way we can configure the flow, as illustrated below: The ZeroMQ inbound-endpoint will be bound to TCP port 9090 with a request-response exchange pattern. The deserialize MP in the protobuf module will deserialize the byte array to the generated StockQuoteRequestBuffer class. From there we’ll use MEL to invoke the helper method on StockQuoteRequest to transform the intermediary class to the domain model. The List of StockQuotes returned from StockDataService will be transformed by the MEL expression using the “toProtocolBuffer” helper method on the domain model. The Protocol Buffer Module is then smart enough to implicitly transform the intermediary object to a byte array for the response. Consuming the Service from the Client Side Now that the server is ready we can turn our attention to the client side code to invoke the remote service. Let’s take a look at how this works: StockQuoteRequest stockQuoteRequest = new StockQuoteRequest(); stockQuoteRequest.setSymbol("FB"); stockQuoteRequest.setStartDate(new Date( new Date().getTime() - (86400000 * 7))); stockQuoteRequest.setEndDate(new Date()); ZMQ.Socket zmqSocket = zmqContext.socket(ZMQ.REQ); zmqSocket.setReceiveTimeOut(RECEIVE_TIMEOUT); zmqSocket.connect("tcp://localhost:9090"); zmqSocket.send(stockQuoteRequest.toProtocolBufferAsBytes(), 0); List quotes = StockQuote.listOfStockQuotesFromBytes(zmqSocket.recv(0)); We start off by defining the StockQuoteRequest object to give us all the quotes for Facebook stock from the last week. We can then open up a ZMQ socket, set the timeout, connect to the ZMQ socket on the remote Mule instance and send the byte representation of the StockQuoteRequest to it. zmqSocket.recv is then used to receive the bytes back from Mule. From here we can use the listOfStockQuotesFromBytes helper method we wrote above to convert the Protocol Buffer representation to a List of StockQuotes. Despite the fair bit of plumbing we did above, this is a pretty concise bit of client side code to invoke the remote service. Conclusion This blog post only touched on the features of ØMQ and the ØMQ Mule Module. In addition to request-reply, other exchange-patterns are supported, like one-way, push and pull. This effectively gives you the benefits of a reliable, asynchronous messaging layer without a centralized infrastructure. I hope to cover this in a later post. Protocol buffers also seem like a natural fit as a wire format for ØMQ. protobuffers echo ØMQ’s principals of being lightweight, fast and platform agnostic. These are also, not coincidently, principals Mule shares as an integration framework. The project for this example is available on GitHub.
November 26, 2012
by John D'Emic
· 28,583 Views
article thumbnail
Exporting and Importing VM Settings with Azure Command-Line Tools
We've talked previously about the Windows Azure command-line tools, and have used them in a few posts such as Brian's Migrating Drupal to a Windows Azure VM. While the tools are generally useful for tons of stuff, one of the things that's been painful to do with the command-line is export the settings for a VM, and then recreate the VM from those settings. You might be wondering why you'd want to export a VM and then recreate it. For me, cost is the first thing that comes to mind. It costs more to keep a VM running than it does to just keep the disk in storage. So if I had something in a VM that I'm only using a few hours a day, I'd delete the VM when I'm not using it and recreate it when I need it again. Another potential reason is that you want to create a copy of the disk so that you can create a duplicate virtual machine. The export process used to be pretty arcane stuff; using the azure vm show command with a --json parameter and piping the output to file. Then hacking the .json file to fix it up so it could be used with the azure vm create-from command. It was bad. It was so bad, the developers added a new export command to create the .json file for you. Here's the basic process: Create a VM VM creation has been covered multiple ways already; you're either going to use the portal or command line tools, and you're either going to select an image from the library or upload a VHD. In my case, I used the following command: azure vm create larryubuntu CANONICAL__Canonical-Ubuntu-12-04-amd64-server-20120528.1.3-en-us-30GB.vhd larry NotaRe This command creates a new VM in the East US data center, enables SSH on port 22 and then stores a disk image for this VM in a blob. You can see the new disk image in blob storage by running: azure vm disk list The results should return something like: info: Executing command vm disk list + Fetching disk images data: Name OS data: ---------------------------------------- ------- data: larryubuntu-larryubuntu-0-20121019170709 Linux info: vm disk list command OK That's the actual disk image that is mounted by the VM. Export and Delete the VM Alright, I've done my work and it's the weekend. I need to export the VM settings so I can recreate it on Monday, then delete the VM so I won't get charged for the next 48 hours of not working. To export the settings for the VM, I use the following command: azure vm export larryubuntu c:\stuff\vminfo.json This tells Windows Azure to find the VM named larryubuntu and export its settings to c:\stuff\vminfo.json. The .json file will contain something like this: { "RoleName":"larryubuntu", "RoleType":"PersistentVMRole", "ConfigurationSets": [ { "ConfigurationSetType":"NetworkConfiguration", "InputEndpoints": [ { "LocalPort":"22", "Name":"ssh", "Port":"22", "Protocol":"tcp", "Vip":"168.62.177.227" } ], "SubnetNames":[] } ], "DataVirtualHardDisks":[], "OSVirtualHardDisk": { "HostCaching":"ReadWrite", "DiskName":"larryubuntu-larryubuntu-0-20121024155441", "OS":"Linux" }, "RoleSize":"Small" } If you're like me, you'll immediately start thinking "Hrmmm, I wonder if I can mess around with things like RoleSize." And yes, you can. If you wanted to bump this up to medium, you'd just change that parameter to medium. If you want to play around more with the various settings, it looks like the schema is maintained at https://github.com/WindowsAzure/azure-sdk-for-node/blob/master/lib/services/serviceManagement/models/roleschema.json. Once I've got the file, I can safely delete the VM by using the following command. azure vm delete larryubuntu It spins a bit and then no more VM. Recreate the VM Ugh, Monday. Time to go back to work, and I need my VM back up and running. So I run the following command: azure vm create-from larryubuntu c:\stuff\vminfo.json --location "East US" It takes only a minute or two to spin up the VM and it's ready for work. That's it - fast, simple, and far easier than the old process of generating the .json settings file. Note that I haven't played around much with the various settings described in the schema for the json file that I linked above. If you find anything useful or interesting that can be accomplished by hacking around with the .json, leave a comment about it.
October 29, 2012
by Larry Franks
· 6,406 Views
article thumbnail
PartitionKey and RowKey in Windows Azure Table Storage
For the past few months, I’ve been coaching a “Microsoft Student Partner” (who has a great blog on Kinect for Windows by the way!) on Windows Azure. One of the questions he recently had was around PartitionKey and RowKey in Windows Azure Table Storage. What are these for? Do I have to specify them manually? Let’s explain… Windows Azure storage partitions All Windows Azure storage abstractions (Blob, Table, Queue) are built upon the same stack (whitepaper here). While there’s much more to tell about it, the reason why it scales is because of its partitioning logic. Whenever you store something on Windows Azure storage, it is located on some partition in the system. Partitions are used for scale out in the system. Imagine that there’s only 3 physical machines that are used for storing data in Windows Azure storage: Based on the size and load of a partition, partitions are fanned out across these machines. Whenever a partition gets a high load or grows in size, the Windows Azure storage management can kick in and move a partition to another machine: By doing this, Windows Azure can ensure a high throughput as well as its storage guarantees. If a partition gets busy, it’s moved to a server which can support the higher load. If it gets large, it’s moved to a location where there’s enough disk space available. Partitions are different for every storage mechanism: In blob storage, each blob is in a separate partition. This means that every blob can get the maximal throughput guaranteed by the system. In queues, every queue is a separate partition. In tables, it’s different: you decide how data is co-located in the system. PartitionKey in Table Storage In Table Storage, you have to decide on the PartitionKey yourself. In essence, you are responsible for the throughput you’ll get on your system. If you put every entity in the same partition (by using the same partition key), you’ll be limited to the size of the storage machines for the amount of storage you can use. Plus, you’ll be constraining the maximal throughput as there’s lots of entities in the same partition. Should you set the PartitionKey to the same value for every entity stored? No. You’ll end up with scaling issues at some point. Should you set the PartitionKey to a unique value for every entity stored? No. You can do this and every entity stored will end up in its own partition, but you’ll find that querying your data becomes more difficult. And that’s where our next concept kicks in… RowKey in Table Storage A RowKey in Table Storage is a very simple thing: it’s your “primary key” within a partition. PartitionKey + RowKey form the composite unique identifier for an entity. Within one PartitionKey, you can only have unique RowKeys. If you use multiple partitions, the same RowKey can be reused in every partition. So in essence, a RowKey is just the identifier of an entity within a partition. PartitionKey and RowKey and performance Before building your code, it’s a good idea to think about both properties. Don’t just assign them a guid or a random string as it does matter for performance. The fastest way of querying? Specifying both PartitionKey and RowKey. By doing this, table storage will immediately know which partition to query and can simply do an ID lookup on RowKey within that partition. Less fast but still fast enough will be querying by specifying PartitionKey: table storage will know which partition to query. Less fast: querying on only RowKey. Doing this will give table storage no pointer on which partition to search in, resulting in a query that possibly spans multiple partitions, possibly multiple storage nodes as well. Wihtin a partition, searching on RowKey is still pretty fast as it’s a unique index. Slow: searching on other properties (again, spans multiple partitions and properties). Note that Windows Azure storage may decide to group partitions in so-called "Range partitions" - see http://msdn.microsoft.com/en-us/library/windowsazure/hh508997.aspx. In order to improve query performance, think about your PartitionKey and RowKey upfront, as they are the fast way into your datasets. Deciding on PartitionKey and RowKey Here’s an exercise: say you want to store customers, orders and orderlines. What will you choose as the PartitionKey (PK) / RowKey (RK)? Let’s use three tables: Customer, Order and Orderline. An ideal setup may be this one, depending on how you want to query everything: Customer (PK: sales region, RK: customer id) – it enables fast searches on region and on customer id Order (PK: customer id, RK; order id) – it allows me to quickly fetch all orders for a specific customer (as they are colocated in one partition), it still allows fast querying on a specific order id as well) Orderline (PK: order id, RK: order line id) – allows fast querying on both order id as well as order line id. Of course, depending on the system you are building, the following may be a better setup: Customer (PK: customer id, RK: display name) – it enables fast searches on customer id and display name Order (PK: customer id, RK; order id) – it allows me to quickly fetch all orders for a specific customer (as they are colocated in one partition), it still allows fast querying on a specific order id as well) Orderline (PK: order id, RK: item id) – allows fast querying on both order id as well as the item bought, of course given that one order can only contain one order line for a specific item (PK + RK should be unique) You see? Choose them wisely, depending on your queries. And maybe an important sidenote: don’t be afraid of denormalizing your data and storing data twice in a different format, supporting more query variations. There’s one additional “index” That’s right! People have been asking Microsoft for a secondary index. And it’s already there… The table name itself! Take our customer – order – orderline sample again… Having a Customer table containing all customers may be interesting to search within that data. But having an Orders table containing every order for every customer may not be the ideal solution. Maybe you want to create an order table per customer? Doing that, you can easily query the order id (it’s the table name) and within the order table, you can have more detail in PK and RK. And there's one more: your account name. Split data over multiple storage accounts and you have yet another "partition". Conclusion In conclusion? Choose PartitionKey and RowKey wisely. The more meaningful to your application or business domain, the faster querying will be and the more efficient table storage will work in the long run.
October 19, 2012
by Maarten Balliauw
· 57,670 Views · 10 Likes
article thumbnail
EasyNetQ Cluster Support
EasyNetQ, my super simple .NET API for RabbitMQ, now (from version 0.7.2.34) supports RabbitMQ clusters without any need to deploy a load balancer. Simply list the nodes of the cluster in the connection string ... var bus = RabbitHutch.CreateBus("host=ubuntu:5672,ubuntu:5673"); In this example I have set up a cluster on a single machine, 'ubuntu', with node 1 on port 5672 and node 2 on port 5673. When the CreateBus statement executes, EasyNetQ will attempt to connect to the first host listed (ubuntu:5672). If it fails to connect it will attempt to connect to the second host listed (ubuntu:5673). If neither node is available it will sit in a re-try loop attempting to connect to both servers every five seconds. It logs all this activity to the registered IEasyNetQLogger. You might see something like this if the first node was unavailable: DEBUG: Trying to connect ERROR: Failed to connect to Broker: 'ubuntu', Port: 5672 VHost: '/'. ExceptionMessage: 'None of the specified endpoints were reachable' DEBUG: OnConnected event fired INFO: Connected to RabbitMQ. Broker: 'ubuntu', Port: 5674, VHost: '/' If the node that EasyNetQ is connected to fails, EasyNetQ will attempt to connect to the next listed node. Once connected, it will re-declare all the exchanges and queues and re-start all the consumers. Here's an example log record showing one node failing then EasyNetQ connecting to the other node and recreating the subscribers: INFO: Disconnected from RabbitMQ Broker DEBUG: Trying to connect DEBUG: OnConnected event fired DEBUG: Re-creating subscribers INFO: Connected to RabbitMQ. Broker: 'ubuntu', Port: 5674, VHost: '/' You get automatic fail-over out of the box. That’s pretty cool. If you have multiple services using EasyNetQ to connect to a RabbitMQ cluster, they will all initially connect to the first listed node in their respective connection strings. For this reason the EasyNetQ cluster support is not really suitable for load balancing high throughput systems. I would recommend that you use a dedicated hardware or software load balancer instead, if that’s what you want.
October 14, 2012
by Mike Hadlow
· 6,851 Views
article thumbnail
How to Create and Deploy a Website with Windows Azure
Curator's note: This article originally appeared at WindowsAzure.com. To use this feature and other new Windows Azure capabilities, sign up for the free preview. Just as you can quickly create and deploy a web application created from the gallery, you can also deploy a website created on a workstation with traditional developer tools from Microsoft or other companies. Table of Contents Deployment Options How to: Create a Website Using the Management Portal How to: Create a Website from the Gallery How to: Delete a Website Next Steps Deployment Options Windows Azure supports deploying websites from remote computers using WebDeploy, FTP, GIT or TFS. Many development tools provide integrated support for publication using one or more of these methods and may only require that you provide the necessary credentials, site URL and hostname or URL for your chosen deployment method. Credentials and deployment URLs for all enabled deployment methods are stored in the website's publish profile, a file which can be downloaded in the Windows Azure (Preview) Management Portal from the Quick Start page or the quick glance section of the Dashboard page. If you prefer to deploy your website with a separate client application, high quality open source GIT and FTP clients are available for download on the Internet for this purpose. How to: Create a Website Using the Management Portal Follow these steps to create a website in Windows Azure. Login to the Windows Azure (Preview) Management Portal. Click the Create New icon on the bottom left of the Management Portal. Click the Web Site icon, click the Quick Create icon, enter a value for URL and then click the check mark next to create web site on the bottom right corner of the page. When the website has been created you will see the text Creation of Web Site '[SITENAME]' Completed. Click the name of the website displayed in the list of websites to open the website's Quick Start management page. On the Quick Start page you are provided with options to set up TFS or GIT publishing if you would like to deploy your finished website to Windows Azure using these methods. FTP publishing is set up by default for websites and the FTP Host name is displayed under FTP Hostname on the Quick Start and Dashboard pages. Before publishing with FTP or GIT choose the option to Reset deployment credentials on the Dashboard page. Then specify the new credentials (username and password) to authenticate against the FTP Host or the Git Repository when deploying content to the website. The Configure management page exposes several configurable application settings in the following sections: Framework: Set the version of .NET framework or PHP required by your web application. Diagnostics: Set logging options for gathering diagnostic information for your website in this section. App Settings: Specify name/value pairs that will be loaded by your web application on start up. For .NET sites, these settings will be injected into your .NET configuration AppSettings at runtime, overriding existing settings. For PHP and Node sites these settings will be available as environment variables at runtime. Connection Strings: View connection strings for linked resources. For .NET sites, these connection strings will be injected into your .NET configuration connectionStrings settings at runtime, overriding existing entries where the key equals the linked database name. For PHP and Node sites these settings will be available as environment variables at runtime. Default Documents: Add your web application's default document to this list if it is not already in the list. If your web application contains more than one of the files in the list then make sure your website's default document appears at the top of the list. How to: Create a Website from the Gallery The gallery makes available a wide range of popular web applications developed by Microsoft, third party companies, and open source software initiatives. Web applications created from the gallery do not require installation of any software other than the browser used to connect to the Windows Azure Management Portal. In this tutorial, you'll learn: How to create a new site through the gallery. How to deploy the site through the Windows Azure Portal. You'll build a Word press blog that uses a default template. The following illustration shows the completed application: Note To complete this tutorial, you need a Windows Azure account that has the Windows Azure Web Sites feature enabled. You can create a free trial account and enable preview features in just a couple of minutes. For details, see Create a Windows Azure account and enable preview features. Create a web site in the portal Login to the Windows Azure Management Portal. Click the New icon on the bottom left of the dashboard. Click the Web Site icon, and click From Gallery. Locate and click the WordPress icon in list, and then click Next. On the Configure Your App page, enter or select values for all fields: Enter a URL name of your choice Leave Create a new MySQL database selected in the Database field Select the region closest to you Then click Next. On the Create New Database page, you can specify a name for your new MySQL database or use the default name. Select the region closest to you as the hosting location. Select the box at the bottom of the screen to agree to ClearDB's usage terms for your hosted MySQL database. Then click the check to complete the site creation. After you click Complete Windows Azure will initiate build and deploy operations. While the web site is being built and deployed the status of these operations is displayed at the bottom of the Web Sites page. After all operations are performed, A final status message when the site has been successfully deployed. Launch and manage your WordPress site Click on your new site from the Web Sites page to open the dashboard for the site. On the Dashboard management page, scroll down and click the link on the left under Site Url to open the site’s welcome page. Enter appropriate configuration information required by WordPress and click Install WordPress to finalize configuration and open the web site’s login page. Login to the new WordPress web site by entering the username and password that you specified on the Welcome page. You'll have a new WordPress site that looks similar to the site below. How to: Delete a Website Websites are deleted using the Delete icon in the Windows Azure Management Portal. The Delete icon is available in the Windows Azure Portal when you click Web Sites to list all of your websites and at the bottom of each of the website management pages. Next Steps For more information about Websites, see the following: Walkthrough: Troubleshooting a Website on Windows Azure
October 9, 2012
by Eric Gregory
· 85,307 Views
article thumbnail
Your First Hadoop MapReduce Job
Hadoop MapReduce is a YARN-based system for parallel processing of large data sets. In this article, learn to quickly start writing the simplest MapReduce job.
September 12, 2012
by Amresh Singh
· 19,627 Views
article thumbnail
How to Write Better POJO Services
In Java, you can easily implement some business logic in Plain Old Java Object (POJO) classes, and then able to run them in a fancy server or framework without much hassle. There many server/frameworks, such as JBossAS, Spring or Camel etc, that would allow you to deploy POJO without even hardcoding to their API. Obviously you would get advance features if you willing to couple to their API specifics, but even if you do, you can keep these to minimal by encapsulating your own POJO and their API in a wrapper. By writing and designing your own application as simple POJO as possible, you will have the most flexible ways in choose a framework or server to deploy and run your application. One effective way to write your business logic in these environments is to use Service component. In this article I will share few things I learned in writing Services. What is a Service? The word Service is overly used today, and it could mean many things to different people. When I say Service, my definition is a software component that has minimal of life-cycles such as init, start, stop, and destroy. You may not need all these stages of life-cycles in every service you write, but you can simply ignore ones that don't apply. When writing large application that intended for long running such as a server component, definining these life-cycles and ensure they are excuted in proper order is crucial! I will be walking you through a Java demo project that I have prepared. It's very basic and it should run as stand-alone. The only dependency it has is the SLF4J logger. If you don't know how to use logger, then simply replace them with System.out.println. However I would strongly encourage you to learn how to use logger effectively during application development though. Also if you want to try out the Spring related demos, then obviously you would need their jars as well. Writing basic POJO service You can quickly define a contract of a Service with life-cycles as below in an interface. package servicedemo; public interface Service { void init(); void start(); void stop(); void destroy(); boolean isInited(); boolean isStarted(); } Developers are free to do what they want in their Service implementation, but you might want to give them an adapter class so that they don't have to re-write same basic logic on each Service. I would provide an abstract service like this: package servicedemo; import java.util.concurrent.atomic.*; import org.slf4j.*; public abstract class AbstractService implements Service { protected Logger logger = LoggerFactory.getLogger(getClass()); protected AtomicBoolean started = new AtomicBoolean(false); protected AtomicBoolean inited = new AtomicBoolean(false); public void init() { if (!inited.get()) { initService(); inited.set(true); logger.debug("{} initialized.", this); } } public void start() { // Init service if it has not done so. if (!inited.get()) { init(); } // Start service now. if (!started.get()) { startService(); started.set(true); logger.debug("{} started.", this); } } public void stop() { if (started.get()) { stopService(); started.set(false); logger.debug("{} stopped.", this); } } public void destroy() { // Stop service if it is still running. if (started.get()) { stop(); } // Destroy service now. if (inited.get()) { destroyService(); inited.set(false); logger.debug("{} destroyed.", this); } } public boolean isStarted() { return started.get(); } public boolean isInited() { return inited.get(); } @Override public String toString() { return getClass().getSimpleName() + "[id=" + System.identityHashCode(this) + "]"; } protected void initService() { } protected void startService() { } protected void stopService() { } protected void destroyService() { } } This abstract class provide the basic of most services needs. It has a logger and states to keep track of the life-cycles. It then delegate new sets of life-cycle methods so subclass can choose to override. Notice that the start() method is checking auto calling init() if it hasn't already done so. Same is done in destroy() method to the stop() method. This is important if we're to use it in a container that only have two stages life-cycles invocation. In this case, we can simply invoke start() and destroy() to match to our service's life-cycles. Some frameworks might go even further and create separate interfaces for each stage of the life-cycles, such as InitableService or StartableService etc. But I think that would be too much in a typical app. In most of the cases, you want something simple, so I like it just one interface. User may choose to ignore methods they don't want, or simply use an adaptor class. Before we end this section, I would throw in a silly Hello world service that can be used in our demo later. package servicedemo; public class HelloService extends AbstractService { public void initService() { logger.info(this + " inited."); } public void startService() { logger.info(this + " started."); } public void stopService() { logger.info(this + " stopped."); } public void destroyService() { logger.info(this + " destroyed."); } } Managing multiple POJO Services with a container Now we have the basic of Service definition defined, your development team may start writing business logic code! Before long, you will have a library of your own services to re-use. To be able group and control these services into an effetive way, we want also provide a container to manage them. The idea is that we typically want to control and manage multiple services with a container as a group in a higher level. Here is a simple implementation for you to get started: package servicedemo; import java.util.*; public class ServiceContainer extends AbstractService { private List services = new ArrayList(); public void setServices(List services) { this.services = services; } public void addService(Service service) { this.services.add(service); } public void initService() { logger.debug("Initializing " + this + " with " + services.size() + " services."); for (Service service : services) { logger.debug("Initializing " + service); service.init(); } logger.info(this + " inited."); } public void startService() { logger.debug("Starting " + this + " with " + services.size() + " services."); for (Service service : services) { logger.debug("Starting " + service); service.start(); } logger.info(this + " started."); } public void stopService() { int size = services.size(); logger.debug("Stopping " + this + " with " + size + " services in reverse order."); for (int i = size - 1; i >= 0; i--) { Service service = services.get(i); logger.debug("Stopping " + service); service.stop(); } logger.info(this + " stopped."); } public void destroyService() { int size = services.size(); logger.debug("Destroying " + this + " with " + size + " services in reverse order."); for (int i = size - 1; i >= 0; i--) { Service service = services.get(i); logger.debug("Destroying " + service); service.destroy(); } logger.info(this + " destroyed."); } } From above code, you will notice few important things: We extends the AbstractService, so a container is a service itself. We would invoke all service's life-cycles before moving to next. No services will start unless all others are inited. We should stop and destroy services in reverse order for most general use cases. The above container implementation is simple and run in synchronized fashion. This mean, you start container, then all services will start in order you added them. Stop should be same but in reverse order. I also hope you would able to see that there is plenty of room for you to improve this container as well. For example, you may add thread pool to control the execution of the services in asynchronized fashion. Running POJO Services Running services with a simple runner program. In the simplest form, we can run our POJO services on our own without any fancy server or frameworks. Java programs start its life from a static main method, so we surely can invoke init and start of our services in there. But we also need to address the stop and destroy life-cycles when user shuts down the program (usually by hitting CTRL+C.) For this, the Java has the java.lang.Runtime#addShutdownHook() facility. You can create a simple stand-alone server to bootstrap Service like this: package servicedemo; import org.slf4j.*; public class ServiceRunner { private static Logger logger = LoggerFactory.getLogger(ServiceRunner.class); public static void main(String[] args) { ServiceRunner main = new ServiceRunner(); main.run(args); } public void run(String[] args) { if (args.length < 1) throw new RuntimeException("Missing service class name as argument."); String serviceClassName = args[0]; try { logger.debug("Creating " + serviceClassName); Class serviceClass = Class.forName(serviceClassName); if (!Service.class.isAssignableFrom(serviceClass)) { throw new RuntimeException("Service class " + serviceClassName + " did not implements " + Service.class.getName()); } Object serviceObject = serviceClass.newInstance(); Service service = (Service)serviceObject; registerShutdownHook(service); logger.debug("Starting service " + service); service.init(); service.start(); logger.info(service + " started."); synchronized(this) { this.wait(); } } catch (Exception e) { throw new RuntimeException("Failed to create and run " + serviceClassName, e); } } private void registerShutdownHook(final Service service) { Runtime.getRuntime().addShutdownHook(new Thread() { public void run() { logger.debug("Stopping service " + service); service.stop(); service.destroy(); logger.info(service + " stopped."); } }); } } With abover runner, you should able to run it with this command: $ java demo.ServiceRunner servicedemo.HelloService Look carefully, and you'll see that you have many options to run multiple services with above runner. Let me highlight couple: Improve above runner directly and make all args for each new service class name, instead of just first element. Or write a MultiLoaderService that will load multiple services you want. You may control argument passing using System Properties. Can you think of other ways to improve this runner? Running services with Spring The Spring framework is an IoC container, and it's well known to be easy to work POJO, and Spring lets you wire your application together. This would be a perfect fit to use in our POJO services. However, with all the features Spring brings, it missed a easy to use, out of box main program to bootstrap spring config xml context files. But with what we built so far, this is actually an easy thing to do. Let's write one of our POJO Service to bootstrap a spring context file. package servicedemo; import org.springframework.context.ConfigurableApplicationContext; import org.springframework.context.support.FileSystemXmlApplicationContext; public class SpringService extends AbstractService { private ConfigurableApplicationContext springContext; public void startService() { String springConfig = System.getProperty("springContext", "spring.xml); springContext = new FileSystemXmlApplicationContext(springConfig); logger.info(this + " started."); } public void stopService() { springContext.close(); logger.info(this + " stopped."); } } With that simple SpringService you can run and load any spring xml file. For example try this: $ java -DspringContext=config/service-demo-spring.xml demo.ServiceRunner servicedemo.SpringService Inside the config/service-demo-spring.xml file, you can easily create our container that hosts one or more service in Spring beans. Notice that I only need to setup init-method and destroy-method once on the serviceContainer bean. You can then add one or more other service such as the helloService as much as you want. They will all be started, managed, and then shutdown when you close the Spring context. Note that Spring context container did not explicitly have the same life-cycles as our services. The Spring context will automatically instanciate all your dependency beans, and then invoke all beans who's init-method is set. All that is done inside the constructor of FileSystemXmlApplicationContext. No explicit init method is called from user. However at the end, during stop of the service, Spring provide the springContext#close() to clean things up. Again, they do not differentiate stop from destroy. Because of this, we must merge our init and start into Spring's init state, and then merge stop and destroy into Spring's close state. Recall our AbstractService#destory will auto invoke stop if it hasn't already done so. So this is trick that we need to understand in order to use Spring effectively. Running services with JEE app server In a corporate env, we usually do not have the freedom to run what we want as a stand-alone program. Instead they usually have some infrustructure and stricter standard technology stack in place already, such as using a JEE application server. In these situation, the most portable way to run POJO services is in a war web application. In a Servlet web application, you can write a class that implements javax.servlet.ServletContextListener and this will provide you the life-cycles hook via contextInitialized and contextDestroyed. In there, you can instanciate your ServiceContainer object and call start and destroy methods accordingly. Here is an example that you can explore: package servicedemo; import java.util.*; import javax.servlet.*; public class ServiceContainerListener implements ServletContextListener { private static Logger logger = LoggerFactory.getLogger(ServiceContainerListener.class); private ServiceContainer serviceContainer; public void contextInitialized(ServletContextEvent sce) { serviceContainer = new ServiceContainer(); List services = createServices(); serviceContainer.setServices(services); serviceContainer.start(); logger.info(serviceContainer + " started in web application."); } public void contextDestroyed(ServletContextEvent sce) { serviceContainer.destroy(); logger.info(serviceContainer + " destroyed in web application."); } private List createServices() { List result = new ArrayList(); // populate services here. return result; } } You may configure above in the WEB-INF/web.xml like this: servicedemo.ServiceContainerListener The demo provided a placeholder that you must add your services in code. But you can easily make that configurable using the web.xml for context parameters. If you were to use Spring inside a Servlet container, you may directly use their org.springframework.web.context.ContextLoaderListener class that does pretty much same as above, except they allow you to specify their xml configuration file using the contextConfigLocation context parameter. That's how a typical Spring MVC based application is configure. Once you have this setup, you can experiment our POJO service just as the Spring xml sample given above to test things out. You should see our service in action by your logger output. PS: Actually what we described here are simply related to Servlet web application, and not JEE specific. So you can use Tomcat server just fine as well. The importance of Service's life-cycles and it's real world usage All the information I presented here are not novelty, nor a killer design pattern. In fact they have been used in many popular open source projects. However, in my past experience at work, folks always manage to make these extremely complicated, and worse case is that they completely disregard the importance of life-cycles when writing services. It's true that not everything you going to write needs to be fitted into a service, but if you find the need, please do pay attention to them, and take good care that they do invoked properly. The last thing you want is to exit JVM without clean up in services that you allocated precious resources for. These would become more disastrous if you allow your application to be dynamically reloaded during deployment without exiting JVM, in which will lead to system resources leakage. The above Service practice has been put into use in the TimeMachine project. In fact, if you look at the timemachine.scheduler.service.SchedulerEngine, it would just be a container of many services running together. And that's how user can extend the scheduler functionalities as well, by writing a Service. You can load these services dynamically by a simple properties file.
September 4, 2012
by Zemian Deng
· 39,152 Views
article thumbnail
How to Migrate Drupal to Azure Web Sites
DrupalCon Munich is next week, and I am lucky enough to be going. As part of preparing for the conference, I thought it would be worthwhile to see just how easy (or difficult) it would be to migrate an existing Drupal site to Windows Azure Web Sites. So, in this post, I’ll do just that. Fortunately, because Windows Azure Web Sites supports both PHP and MySQL, the migration process is relatively straightforward. And, because Drupal and PHP run on any platform, the process I’ll describe should work for moving Drupal to Windows Azure Web Sites regardless of what platform you are moving from. Of course, Drupal installations can vary widely, so YMMV. I tested the instructions below on relatively small (and simple) Drupal installation running on CentOS 5. (Unfortunately, I won’t be using Drush since it isn’t supported on Windows Azure Websites.) If you are considering moving a large and complex Drupal application, may want to consider moving to Windows Azure Cloud Services (more information about that here: Migrating a Drupal Site from LAMP to Windows Azure). Before getting started, it’s worth noting that Windows Azure Websites lets you run up to 10 Web Sites for free in a multitenant environment. And, you can seamlessly upgrade to private, reserved VM instances as your traffic grows. To sign up, try the Windows Azure 90-day free trial. 1. Create a Windows Azure Web Site and MySQL database There is a step-by-step tutorial on http://www.windowsazure.com that walks you through creating a new website and a MySQL database, so I’ll refer you there to get started: Create a PHP-MySQL Windows Azure web site and deploy using Git. If you intend to use Git to publish your Drupal site, then go ahead and follow the instructions for setting up a Git repository. Make sure to follow the instructions in the Get remote MySQL connection information section as you will need that information later. You can ignore the remainder of the tutorial for the purposes of deploying your Drupal site, but if you are new to Windows Azure Web Sites (and to Git), you might find the additional reading informative. Ok, now you have a new website with a MySQL database, your have your MySQL database connection information, and you have (optionally) created a remote Git repository and made note of the Git deployment instructions. Now you are ready to copy your database to MySQL in Windows Azure Web Sites. 2. Copy database to MySQL in Windows Azure Web Sites I’m sure there is more than one way to copy your Drupal database, but I found the mysqldump tool to be effective and easy to use. To copy from a local machine to Windows Azure Web Sites, here’s the command I used: mysqldump -u local_username --password=local_password drupal | mysql -h remote_host -u remote_username --password=remote_password remote_db_name You will, of course, have to provide the username and password for your existing Drupal database, and you will have to provide the hostname, username, password, and database name for the MySQL database you created in step 1. This information is available in the connection string information that you should have noted in step 1. i.e. You should have a connection string that looks something like this: Database=remote_db_name;Data Source=remote_host;User Id=remote_username;Password=remote_password Depending on the size of your database, the copying process could take several minutes. Now your Drupal database is live in Windows Azure Websites. Before you deploy your Drupal code, you need to modify it so it can connect to the new database. 3. Modify database connection info in settings.php Here, you will again need your new database connection information. Open the /drupal/sites/default/setting.php file in your favorite text editor, and replace the values of ‘database’, ‘username’, ‘password’, and ‘host’ in the $databases array with the correct values for your new database. When you are finished, you should have something similar to this: $databases = array ( 'default' => array ( 'default' => array ( 'database' => 'remote_db_name', 'username' => 'remote_username', 'password' => 'remote_password', 'host' => 'remote_host', 'port' => '', 'driver' => 'mysql', 'prefix' => '', ), ), ); Be sure to save the settings.phpfile, then you are ready to deploy. 4. Deploy Drupal code using Git or FTP The last step is to deploy your code to Windows Azure Web Sites using Git or FTP. If you are using FTP, you can get the FTP hostname and username from you website’s dashboard. Then, use your favorite FTP client to upload your Drupal files to the /site/wwwroot folder of the remote site. If you are using Git, you need to set up a Git repository in Windows Azure Web Sites (steps for this are in the tutorial mentioned earlier). And, you will need Git installed on your local machine. Then, just follow the instructions provided after you created the repository: One note about using Git here: depending on your Git settings, your .gitignore file (a hidden file and a sibling to the .git folder created in your local root directory after you executed git commit), some files in your Drupal application may be ignored. In my case, all the files in the sites directory were ignored. If this happens, you will want to edit the .gitignore file so that these files aren’t ignored and redeploy. After you have deployed Drupal to Windows Azure Web Sites, you can continue to deploy updates via Git or FTP. Related information If you are looking for more information about Windows Azure Web Sites, these posts might be helpful: Windows Azure Websites- A PHP Perspective Windows Azure Websites, Web Roles, and VMs- When to use which- Configuring PHP in Windows Azure Websites with .user.ini Files One last thing you might consider, depending on your site, is using the Windows Azure Integration Module to store and serve your site’s media files.
August 19, 2012
by Brian Swan
· 10,224 Views
article thumbnail
How to Autoscale MySQL on Amazon EC2
Autoscaling your webserver tier is typically straightforward. Image your apache server with source code or without, then sync down files from S3 upon spinup. Roll that image into the autoscale configuration and you’re all set. With the database tier though, things can be a bit tricky. The typical configuration we see is to have a single master database where your application writes. But scaling out or horizontally on Amazon EC2 should be as easy as adding more slaves, right? Why not automate that process? Below we’ve set out to answer some of the questions you’re likely to face when setting up slaves against your master. We’ve included instructions on building an AMI that automatically spins up as a slave. Fancy! How can I autoscale my database tier? Build an auto-starting MySQL slave against your master. Configure those to spinup. Amazon’s autoscaling loadbalancer is one option, another is to use a roll-your-own solution, monitoring thresholds on servers, and spinning up or dropping off slaves as necessary. Does an AWS snapshot capture subvolume data or just the SIZE of the attached volume? In fact, if you have an attached EBS volume and you create an new AMI off of that, you will capture the entire root volume, plus your attached volume data. In fact we find this a great way to create an auto-building slave in the cloud. How do I freeze MySQL during AWS snapshot? mysql> flush tables with read lock;mysql> system xfs_freeze -f /data At this point you can use the Amazon web console, ylastic, or ec2-create-image API call to do so from the command line. When the server you are imaging off of above restarts – as it will do by default – it will start with /data partition unfrozen and mysql’s tables unlocked again. Voila! If you’re not using xfs for your /data filesystem, you should be. It’s fast! The xfsprogs docs seem to indicate this may also work with foreign filesystems. Check the docs for details. How do I build an AMI mysql slave that autoconnects to master? Install mysql_serverid script below. Configure mysql to use your /data EBS mount. Set all your my.cnf settings including server_id Configure the instance as a slave in the normal way. When using GRANT to create the ‘rep’ user on master, specify the host with a subnet wildcard. For example ’10.20.%’. That will subsequently allow any 10.20.x.y servers to connect and replicate. Point the slave at the master. When all is running properly, edit the my.cnf file and remove server_id. Don’t restart mysql. Freeze the filesystem as described above. Use the Amazon console, ylastic or API call to create your new image. Test it of course, to make sure it spins up, sets server_id and connects to master. Make a change in the test schema, and verify that it propagates to all slaves. How do I set server_id uniquely? As you hopefully already know, in MySQL replication environment each node requires a unique server_id setting. In my Amazon Machine Images, I want the server to startup and if it doesn’t find the server_id in the /etc/my.cnf file, to add it there, correctly! Is that so much to ask? Here’s what I did. Fire up your editor of choice and drop in this bit of code: #!/bin/shif grep -q “server_id” /etc/my.cnf then : # do nothing – it’s already set else # extract numeric component from hostname – should be internet IP in Amazon environment export server_id=`echo $HOSTNAME | sed ‘s/[^0-9]*//g’` echo “server_id=$server_id” >> /etc/my.cnf # restart mysql /etc/init.d/mysql restart fi Save that snippet at /root/mysql_serverid. Also be sure to make it executable: $ chmod +x /root/mysql_serverid Then just append it to your /etc/rc.local file with an editor or echo: $ echo "/root/mysql_serverid" >> /etc/rc.local Assuming your my.cnf file does *NOT* contain the server_id setting when you re-image, then it’ll set this automagically each time you spinup a new server off of that AMI. Nice! Can you easily slave off of a slave? How? It’s not terribly different from slaving off of a normal master. A. First enable slave updates. The setting is not dynamic, so if you don’t already have it set, you’ll have to restart your slave. log_slave_updates=true B. Get an initial snapshot of your slave data. You can do that the locking way: mysql> flush tables with read lock;mysql> show master status\G; mysql> system mysqldump -A > full_slave_dump.mysql mysql> unlock tables; You may also choose to use Percona’s excellent xtrabackup utility to create hotbackups without locking any tables. We are very lucky to have an open-source tool like this at our disposal. MySQL Enterprise Backup from Oracle Corp can also do this. C. On the slave, seed the database with your dump created above. $ mysql < full_slave_dump.mysql D. Now point your slave to the original slave. mysql> change master to master_user='rep', master_password='rep', master_host='192.168.0.1', master_log_file='server-bin-log.000004', master_log_pos=399;mysql> start slave; mysql> show slave status\G; Slave master is set as an IP address. Is there another way? It’s possible to use hostnames in MySQL replication, however it’s not recommended. Why? Because of the wacky world of DNS. Suffice it to say MySQL has to do a lot of work to resolve those names into IP addresses. A hickup in DNS can interrupt all MySQL services potentially as sessions will fail to authenticate. To avoid this problem do two things: A. Set this parameter in my.cnf skip_name_resolve = true Remove entries in mysql.user table where hostname is not an IP address. Those entries will be invalid for authentication after setting the above parameter. Doesn’t RDS take care of all of this for me? RDS is Amazon’s Relational Database Service which is built on MySQL. Amazon’s RDS solution presents MySQL as a service which brings certain benefits to administrators and startups: Simpler administration. Nuts and bolts are handled for you. Push-button replication. No more struggling with the nuances and issues of MySQL’s replication management. Simplicity of administration of course has it’s downsides. Depending on your environment, these may or may not be dealbreakers. No access to the slow query log. This is huge. The single best tool for troubleshooting slow database response is this log file. Queries are a large part of keeping a relational database server healthy and happy, and without this facility, you are severely limited. Locked in downtime window When you signup for RDS, you must define a thirty minute maintenance window. This is a weekly window during which your instance *COULD* be unavailable. When you host yourself, you may not require as much downtime at all, especially if you’re using master-master mysql and zero-downtime configuration. Can’t use Percona Server to host your MySQL data. You won’t be able to do this in RDS. Percona server is a high performance distribution of MySQL which typically rolls in serious performance tweaks and updates before they make it to community addition. Well worth the effort to consider it. No access to filesystem, server metrics & command line. Again for troubleshooting problems, these are crucial. Gathering data about what’s really happening on the server is how you begin to diagnose and troubleshoot a server stall or pileup. You are beholden to Amazon’s support services if things go awry. That’s because you won’t have access to the raw iron to diagnose and troubleshoot things yourself. Want to call in an outside consultant to help you debug or troubleshoot? You’ll have your hands tied without access to the underlying server. You can’t replicate to a non-RDS database. Have your own datacenter connected to Amazon via VPC? Want to replication to a cloud server? RDS won’t fit the bill. You’ll have to roll your own – as we’ve described above. And if you want to replicate to an alternate cloud provider, again RDS won’t work for you. Related posts: Deploying MySQL on Amazon EC2 – 8 Best Practices Review: Host Your Web Site In The Cloud, Amazon Web Services Made Easy 5 Ways to Boost MySQL Scalability Top MySQL DBA interview questions (Part 2) MySQL Cluster In The Cloud – Managers Guide
July 20, 2012
by Sean Hull
· 18,488 Views
article thumbnail
Working with MongoDB MultiMaster
Learn all about working with MondoDB multimaster.
July 11, 2012
by Rick Copeland
· 28,188 Views · 2 Likes
article thumbnail
Everything You Need To Know About Couchbase Architecture
After receiving a lot of good feedback and comment on my last blog on MongoDb, I was encouraged to do another deep dive on another popular document oriented db; Couchbase. I have been a long-time fan CouchDb and has wrote a blog on it many years ago. After it merges with Membase, I am very excited to take a deep look into it again. Couchbase is the merge of two popular NOSQL technologies: Membase, which provides persistence, replication, sharding to the high performance memcached technology CouchDB, which pioneers the document oriented model based on JSON Like other NOSQL technologies, both Membase and CouchDB are built from the ground up on a highly distributed architecture, with data shard across machines in a cluster. Built around the Memcached protocol, Membase provides an easy migration to existing Memcached users who want to add persistence, sharding and fault resilience on their familiar Memcached model. On the other hand, CouchDB provides first class support for storing JSON documents as well as a simple RESTful API to access them. Underneath, CouchDB also has a highly tuned storage engine that is optimized for both update transaction as well as query processing. Taking the best of both technologies, Membase is well-positioned in the NOSQL marketplace. Programming model Couchbase provides client libraries for different programming languages such as Java / .NET / PHP / Ruby / C / Python / Node.js For read, Couchbase provides a key-based lookup mechanism where the client is expected to provide the key, and only the server hosting the data (with that key) will be contacted. Couchbase also provides a query mechanism to retrieve data where the client provides a query (for example, range based on some secondary key) as well as the view (basically the index). The query will be broadcasted to all servers in the cluster and the result will be merged and sent back to the client. For write, Couchbase provides a key-based update mechanism where the client sends in an updated document with the key (as doc id). When handling write request, the server will return to client’s write request as soon as the data is stored in RAM on the active server, which offers the lowest latency for write requests. Following is the core API that Couchbase offers. (in an abstract sense) # Get a document by key doc = get(key) # Modify a document, notice the whole document # need to be passed in set(key, doc) # Modify a document when no one has modified it # since my last read casVersion = doc.getCas() cas(key, casVersion, changedDoc) # Create a new document, with an expiration time # after which the document will be deleted addIfNotExist(key, doc, timeToLive) # Delete a document delete(key) # When the value is an integer, increment the integer increment(key) # When the value is an integer, decrement the integer decrement(key) # When the value is an opaque byte array, append more # data into existing value append(key, newData) # Query the data results = query(viewName, queryParameters) In Couchbase, document is the unit of manipulation. Currently Couchbase doesn't support server-side execution of custom logic. Couchbase server is basically a passive store and unlike other document oriented DB, Couchbase doesn't support field-level modification. In case of modifying documents, client need to retrieve documents by its key, do the modification locally and then send back the whole (modified) document back to the server. This design tradeoff network bandwidth (since more data will be transferred across the network) for CPU (now CPU load shift to client). Couchbase currently doesn't support bulk modification based on a condition matching. Modification happens only in a per document basis. (client will save the modified document one at a time). Transaction Model Similar to many NOSQL databases, Couchbase’s transaction model is primitive as compared to RDBMS. Atomicity is guaranteed at a single document and transactions that span update of multiple documents are unsupported. To provide necessary isolation for concurrent access, Couchbase provides a CAS (compare and swap) mechanism which works as follows … When the client retrieves a document, a CAS ID (equivalent to a revision number) is attached to it. While the client is manipulating the retrieved document locally, another client may modify this document. When this happens, the CAS ID of the document at the server will be incremented. Now, when the original client submits its modification to the server, it can attach the original CAS ID in its request. The server will verify this ID with the actual ID in the server. If they differ, the document has been updated in between and the server will not apply the update. The original client will re-read the document (which now has a newer ID) and re-submit its modification. Couchbase also provides a locking mechanism for clients to coordinate their access to documents. Clients can request a LOCK on the document it intends to modify, update the documents and then releases the LOCK. To prevent a deadlock situation, each LOCK grant has a timeout so it will automatically be released after a period of time. Deployment Architecture In a typical setting, a Couchbase DB resides in a server clusters involving multiple machines. Client library will connect to the appropriate servers to access the data. Each machine contains a number of daemon processes which provides data access as well as management functions. The data server, written in C/C++, is responsible to handle get/set/delete request from client. The Management server, written in Erlang, is responsible to handle the query traffic from client, as well as manage the configuration and communicate with other member nodes in the cluster. Virtual Buckets The basic unit of data storage in Couchbase DB is a JSON document (or primitive data type such as int and byte array) which is associated with a key. The overall key space is partitioned into 1024 logical storage unit called "virtual buckets" (or vBucket). vBucket are distributed across machines within the cluster via a map that is shared among servers in the cluster as well as the client library. High availability is achieved through data replication at the vBucket level. Currently Couchbase supports one active vBucket zero or more standby replicas hosted in other machines. Curremtly the standby server are idle and not serving any client request. In future version of Couchbase, the standby replica will be able to serve read request. Load balancing in Couchbase is achieved as follows: Keys are uniformly distributed based on the hash function When machines are added and removed in the cluster. The administrator can request a redistribution of vBucket so that data are evenly spread across physical machines. Management Server Management server performs the management function and co-ordinate the other nodes within the cluster. It includes the following monitoring and administration functions Heartbeat: A watchdog process periodically communicates with all member nodes within the same cluster to provide Couchbase Server health updates. Process monitor: This subsystem monitors execution of the local data manager, restarting failed processes as required and provide status information to the heartbeat module. Configuration manager: Each Couchbase Server node shares a cluster-wide configuration which contains the member nodes within the cluster, a vBucket map. The configuration manager pull this config from other member nodes at bootup time. Within a cluster, one node’s Management Server will be elected as the leader which performs the following cluster-wide management function Controls the distribution of vBuckets among other nodes and initiate vBucket migration Orchestrates the failover and update the configuration manager of member nodes If the leader node crashes, a new leader will be elected from surviving members in the cluster. When a machine in the cluster has crashed, the leader will detect that and notify member machines in the cluster that all vBuckets hosted in the crashed machine is dead. After getting this signal, machines hosting the corresponding vBucket replica will set the vBucket status as “active”. The vBucket/server map is updated and eventually propagated to the client lib. Notice that at this moment, the replication level of the vBucket will be reduced. Couchbase doesn’t automatically re-create new replicas which will cause data copying traffic. Administrator can issue a command to explicitly initiate a data rebalancing. The crashed machine, after reboot can rejoin the cluster. At this moment, all the data it stores previously will be completely discard and the machine will be treated as a brand new empty machine. As more machines are put into the cluster (for scaling out), vBucket should be redistributed to achieve a load balance. This is currently triggered by an explicit command from the administrator. Once receive the “rebalance” command, the leader will compute the new provisional map which has the balanced distribution of vBuckets and send this provisional map to all members of the cluster. To compute the vBucket map and migration plan, the leader attempts the following objectives: Evenly distribute the number of active vBuckets and replica vBuckets among member nodes. Place the active copy and each replicas in physically separated nodes. Spread the replica vBucket as wide as possible among other member nodes. Minimize the amount of data migration Orchestrate the steps of replica redistribution so no node or network will be overwhelmed by the replica migration. Once the vBucket maps is determined, the leader will pass the redistribution map to each member in the cluster and coordinate the steps of vBucket migration. The actual data transfer happens directly between the origination node to the destination node. Notice that since we have generally more vBuckets than machines. The workload of migration will be evenly distributed automatically. For example, when new machines are added into the clusters, all existing machines will migrate some portion of its vBucket to the new machines. There is no single bottleneck in the cluster. Throughput the migration and redistribution of vBucket among servers, the life cycle of a vBucket in a server will be in one of the following states “Active”: means the server is hosting the vBucket is ready to handle both read and write request “Replica”: means the server is hosting the a copy of the vBucket that may be slightly out of date but can take read request that can tolerate some degree of outdate. “Pending”: means the server is hosting a copy that is in a critical transitional state. The server cannot take either read or write request at this moment. “Dead”: means the server is no longer responsible for the vBucket and will not take either read or write request anymore. Data Server Data server implements the memcached APIs such as get, set, delete, append, prepend, etc. It contains the following key datastructure: One in-memory hashtable (key by doc id) for the corresponding vBucket hosted. The hashtable acts as both a metadata for all documents as well as a cache for the document content. Maintain the entry gives a quick way to detect whether the document exists on disk. To support async write, there is a checkpoint linkedlist per vBucket holding the doc id of modified documents that hasn't been flushed to disk or replicated to the replica. To handle a "GET" request Data server routes the request to the corresponding ep-engine responsible for the vBucket. The ep-engine will lookup the document id from the in-memory hastable. If the document content is found in cache (stored in the value of the hashtable), it will be returned. Otherwise, a background disk fetch task will be created and queued into the RO dispatcher queue. The RO dispatcher then reads the value from the underlying storage engine and populates the corresponding entry in the vbucket hash table. Finally, the notification thread notifies the disk fetch completion to the memcached pending connection, so that the memcached worker thread can revisit the engine to process a get request. To handle a "SET" request, a success response will be returned to the calling client once the updated document has been put into the in-memory hashtable with a write request put into the checkpoint buffer. Later on the Flusher thread will pickup the outstanding write request from each checkpoint buffer, lookup the corresponding document content from the hashtable and write it out to the storage engine. Of course, data can be lost if the server crashes before the data has been replicated to another server and/or persisted. If the client requires a high data availability across different crashes, it can issue a subsequent observe() call which blocks on the condition that the server persist data on disk, or the server has replicated the data to another server (and get its ACK). Overall speaking, the client has various options to tradeoff data integrity with throughput. Hashtable Management To synchronize accesses to a vbucket hash table, each incoming thread needs to acquire a lock before accessing a key region of the hash table. There are multiple locks per vbucket hash table, each of which is responsible for controlling exclusive accesses to a certain ket region on that hash table. The number of regions of a hash table can grow dynamically as more documents are inserted into the hash table. To control the memory size of the hashtable, Item pager thread will monitor the memory utilization of the hashtable. Once a high watermark is reached, it will initiate an eviction process to remove certain document content from the hashtable. Only entries that is not referenced by entries in the checkpoint buffer can be evicted because otherwise the outstanding update (which only exists in hashtable but not persisted) will be lost. After eviction, the entry of the document still remains in the hashtable; only the document content of the document will be removed from memory but the metadata is still there. The eviction process stops after reaching the low watermark. The high / low water mark is determined by the bucket memory quota. By default, the high water mark is set to 75% of bucket quota, while the low water mark is set to 60% of bucket quota. These water marks can be configurable at runtime. In CouchDb, every document is associated with an expiration time and will be deleted once it is expired. Expiry pager is responsible for tracking and removing expired document from both the hashtable as well as the storage engine (by scheduling a delete operation). Checkpoint Manager Checkpoint manager is responsible to recycle the checkpoint buffer, which holds the outstanding update request, consumed by the two downstream processes, Flusher and TAP replicator. When all the request in the checkpoint buffer has been processed, the checkpoint buffer will be deleted and a new one will be created. TAP Replicator TAP replicator is responsible to handle vBucket migration as well as vBucket replication from active server to replica server. It does this by propagating the latest modified document to the corresponding replica server. At the time a replica vBucket is established, the entire vBucket need to be copied from the active server to the empty destination replica server as follows The in-memory hashtable at the active server will be transferred to the replica server. Notice that during this period, some data may be updated and therefore the data set transfered to the replica can be inconsistent (some are the latest and some are outdated). Nevertheless, all updates happen after the start of transfer is tracked in the checkpoint buffer. Therefore, after the in-memory hashtable transferred is completed, the TAP replicator can pickup those updates from the checkpoint buffer. This ensures the latest versioned of changed documents are sent to the replica, and hence fix the inconsistency. However the hashtable cache doesn’t contain all the document content. Data also need to be read from the vBucket file and send to the replica. Notice that during this period, update of vBucket will happen in active server. However, since the file is appended only, subsequent data update won’t interfere the vBucket copying process. After the replica server has caught up, subsequent update at the active server will be available at its checkpoint buffer which will be pickup by the TAP replicator and send to the replica server. CouchDB Storage Structure Data server defines an interface where different storage structure can be plugged-in. Currently it supports both a SQLite DB as well as CouchDB. Here we describe the details of CouchDb, which provides a super high performance storage mechanism underneath the Couchbase technology. Under the CouchDB structure, there will be one file per vBucket. Data are written to this file in an append-only manner, which enables Couchbase to do mostly sequential writes for update, and provide the most optimized access patterns for disk I/O. This unique storage structure attributes to Couchbase’s fast on-disk performance for write-intensive applications. The following diagram illustrate the storage model and how it is modified by 3 batch updates (notice that since updates are asynchronous, it is perform by "Flusher" thread in batches). The Flusher thread works as follows: 1) Pick up all pending write request from the dirty queue and de-duplicate multiple update request to the same document. 2) Sort each request (by key) into corresponding vBucket and open the corresponding file 3) Append the following into the vBucket file (in the following contiguous sequence) All document contents in such write request batch. Each document will be written as [length, crc, content] one after one sequentially. The index that stores the mapping from document id to the document’s position on disk (called the BTree by-id) The index that stores the mapping from update sequence number to the document’s position on disk. (called the BTree by-seq) The by-id index plays an important role for looking up the document by its id. It is organized as a B-Tree where each node contains a key range. To lookup a document by id, we just need to start from the header (which is the end of the file), transfer to the root BTree node of the by-id index, and then further traverse to the leaf BTree node that contains the pointer to the actual document position on disk. During the write, the similar mechanism is used to trace back to the corresponding BTree node that contains the id of the modified documents. Notice that in the append-only model, update is not happening in-place, instead we located the existing location and copy it over by appending. In other words, the modified BTree node will be need to be copied over and modified and finally paste to the end of file, and then its parent need to be modified to point to the new location, which triggers the parents to be copied over and paste to the end of file. Same happens to its parents’ parent and eventually all the way to the root node of the BTree. The disk seek can be at the O(logN) complexity. The by-seq index is used to keep track of the update sequence of lived documents and is used for asynchronous catchup purposes. When a document is created, modified or deleted, a sequence number is added to the by-seq btree and the previous seq node will be deleted. Therefore, for cross-site replication, view index update and compaction, we can quickly locate all the lived documents in the order of their update sequence. When a vBucket replicator asks for the list of update since a particular time, it provides the last sequence number in previous update, the system will then scan through the by-seq BTree node to locate all the document that has sequence number larger than that, which effectively includes all the document that has been modified since the last replication. As time goes by, certain data becomes garbage (see the grey-out region above) and become unreachable in the file. Therefore, we need a garbage collection mechanism to clean up the garbage. To trigger this process, the by-id and by-seq B-Tree node will keep track of the data size of lived documents (those that is not garbage) under its substree. Therefore, by examining the root BTree node, we can determine the size of all lived documents within the vBucket. When the ratio of actual size and vBucket file size fall below a certain threshold, a compaction process will be triggered whose job is to open the vBucket file and copy the survived data to another file. Technically, the compaction process opens the file and read the by-seq BTree at the end of the file. It traces the Btree all the way to the leaf node and copy the corresponding document content to the new file. The compaction process happens while the vBucket is being updated. However, since the file is appended only, new changes are recorded after the BTree root that the compaction has opened, so subsequent data update won’t interfere with the compaction process. When the compaction is completed, the system need to copy over the data that was appended since the beginning of the compaction to the new file. View Index Structure Unlike most indexing structure which provide a pointer from the search attribute back to the document. The CouchDb index (called View Index) is better perceived as a denormalized table with arbitrary keys and values loosely associated to the document. Such denormalized table is defined by a user-provided map() and reduce() function. map = function(doc) { … emit(k1, v1) … emit(k2, v2) … } reduce = function(keys, values, isRereduce) { if (isRereduce) { // Do the re-reduce only on values (keys will be null) } else { // Do the reduce on keys and values } // result must be ready for input values to re-reduce return result } Whenever a document is created, updated, deleted, the corresponding map(doc) function will be invoked (in an asynchronous manner) to generate a set of key/value pairs. Such key/value will be stored in a B-Tree structure. All the key/values pairs of each B-Tree node will be passed into the reduce() function, which compute an aggregated value within that B-Tree node. Re-reduce also happens in non-leaf B-Tree nodes which further aggregate the aggregated value of child B-Tree nodes. The management server maintains the view index and persisted it to a separate file. Create a view index is perform by broadcast the index creation request to all machines in the cluster. The management process of each machine will read its active vBucket file and feed each surviving document to the Map function. The key/value pairs emitted by the Map function will be stored in a separated BTree index file. When writing out the BTree node, the reduce() function will be called with the list of all values in the tree node. Its return result represent a partially reduced value is attached to the BTree node. The view index will be updated incrementally as documents are subsequently getting into the system. Periodically, the management process will open the vBucket file and scan all documents since the last sequence number. For each changed document since the last sync, it invokes the corresponding map function to determine the corresponding key/value into the BTree node. The BTree node will be split if appropriate. Underlying, Couchbase use a back index to keep track of the document with the keys that it previously emitted. Later when the document is deleted, it can look up the back index to determine what those key are and remove them. In case the document is updated, the back index can also be examined; semantically a modification is equivalent to a delete followed by an insert. The following diagram illustrates how the view index file will be incrementally updated via the append-only mechanism. Query Processing Query in Couchbase is made against the view index. A query is composed of the view name, a start key and end key. If the reduce() function isn’t defined, the query result will be the list of values sorted by the keys within the key range. In case the reduce() function is defined, the query result will be a single aggregated value of all keys within the key range. If the view has no reduce() function defined, the query processing proceeds as follows: Client issue a query (with view, start/end key) to the management process of any server (unlike a key based lookup, there is no need to locate a specific server). The management process will broadcast the request to other management process on all servers (include itself) within the cluster. Each management process (after receiving the broadcast request) do a local search for value within the key range by traversing the BTree node of its view file, and start sending back the result (automatically sorted by the key) to the initial server. The initial server will merge the sorted result and stream them back to the client. However, if the view has reduce() function defined, the query processing will involve computing a single aggregated value as follows: Client issue a query (with view, start/end key) to the management process of any server (unlike a key based lookup, there is no need to locate a specific server). The management process will broadcast the request to other management process on all servers (include itself) within the cluster. Each management process do a local reduce for value within the key range by traversing the BTree node of its view file to compute the reduce value of the key range. If the key range span across a BTree node, the pre-computed of the sub-range can be used. This way, the reduce function can reuse a lot of partially reduced values and doesn’t need to recomputed every value of the key range from scratch. The original server will do a final re-reduce() in all the return value from each other servers, and then passed back the final reduced value to the client. To illustrate the re-reduce concept, lets say the query has its key range from A to F. Instead of calling reduce([A,B,C,D,E,F]), the system recognize the BTree node that contains [B,C,D] has been pre-reduced and the result P is stored in the BTree node, so it only need to call reduce(A,P,E,F). Update View Index as vBucket migrates Since the view index is synchronized with the vBuckets in the same server, when the vBucket has migrated to a different server, the view index is no longer correct; those key/value that belong to a migrated vBucket should be discarded and the reduce value cannot be used anymore. To keep track of the vBucket and key in the view index, each bTree node has a 1024-bitmask indicating all the vBuckets that is covered in the subtree (ie: it contains a key emitted from a document belonging to the vBucket). Such bit-mask is maintained whenever the bTree node is updated. At the server-level, a global bitmask is used to indicate all the vBuckets that this server is responsible for. In processing the query of the map-only view, before the key/value pair is returned, an extra check will be perform for each key/value pair to make sure its associated vBucket is what this server is responsible for. When processing the query of a view that has a reduce() function, we cannot use the pre-computed reduce value if the bTree node contains a vBucket that the server is not responsible for. In this case, the bTree node’s bit mask is compared with the global bit mask. In case if they are not aligned, then the reduce value need to be recomputed. Here is an example to illustrate this process Couchbase is one of the popular NOSQL technology built on a solid technology foundation designed for high performance. In this post, we have examined a number of such key features: Load balancing between servers inside a cluster that can grow and shrink according to workload conditions. Data migration can be used to re-achieve workload balance. Asynchronous write provides lowest possible latency to client as it returns once the data is store in memory. Append-only update model pushes most update transaction into sequential disk access, hence provide extremely high throughput for write intensive applications. Automatic compaction ensures the data lay out on disk are kept optimized all the time. Map function can be used to pre-compute view index to enable query access. Summary data can be pre-aggregated using the reduce function. Overall, this cut down the workload of query processing dramatically. For a review on NOSQL architecture in general and some theoretical foundation, I have wrote a NOSQL design pattern blog, as well as some fundamental difference between SQL and NOSQL. For other NOSQL technologies, please read my other blog on MongoDb, Cassandra and HBase, Memcached Special thanks to Damien Katz and Frank Weigel from Couchbase team who provide a lot of implementation details of Couchbase.
July 7, 2012
by Ricky Ho
· 84,695 Views · 5 Likes
article thumbnail
Current Challenges of Moving Apps to the Cloud, and How to Anticpate Them
In my last post, I discussed some of the key considerations when moving an application to the cloud. To provide a better understanding, I’m using a simple scenario-based example to illustrate how an application could be moved to the cloud. This article will explain the challenges a company might face, the current architecture of the example application, and finally what the company should expect when moving an application to the cloud. My next article will discuss the recommended solution in more detail. Disclaimer Company name, logo, business, scenario, and incidents either are used fictitiously. Any resemblance to an actual company is entirely coincidental. Background Idelma is a ticket selling provider that sells tickets to concerts, sports event, and music gigs. Tickets are sold offline through ticket counters and online through a website called TicketOnline. Customers visiting TicketOnline can browse list of available shows, find out more information on each show, and finally purchase tickets online. When a ticket is purchased, it’s reserved but will not be processed immediately. Other processes such as generating ticket and sending the generated ticket along with the receipt will be done asynchronously in a few minutes time. Current Challenges During peak season (typically in July and December), TicketOnline suffered from heavy traffic that caused slow response time. The traffic for off-peak season is normally about 100,000 to 200,000 hits per day, with the average of 8 to 15 on-going shows. In peak season, the traffic may reach five to seven times more than off-peak season. The following diagram illustrates the web server hits counter of TicketOnline over the last three years. Figure 1 – TicketOnline web server hits counter for the last three years Additionally, the current infrastructure setup is not designed to be highly-available. This results in several periods of downtime each year. The options: on-premise vs cloud Idelma’s IT Manager Mr. Anthony recognizes the issues and decides to make some improvement to bring better competitive advantages to the company. When reading an article online, he discovered that cloud computing may be a good solution to address the issues. Another option would be to purchase a more powerful set of hardware that could handle the load. With that, he has done a pros and cons analysis of the two options: On-premise hardware investment There are at least two advantages of investing in more hardware. One, they will have full control over the infrastructure, and can use the server for other purposes when necessary. Second, there might be less or no modification needed on the application at all, depending on how it is architected and designed. If they decide to scale up (vertically), they might not need to make any changes. However, if they decide to scale out (horizontally) to a web farm model, a re-design would be needed. On the other hand, there are also several disadvantages of on-premise hardware investment. For sure, upfront investment in purchasing hardware and software are considered relatively expensive. Next, they would need to be able to answer the following questions: How much hardware and software should be purchased? What are the hardware specifications? If the capacity planning is not properly done, it may lead to either a waste of capacity or insufficient of capacity. Another concern is, when adding more hardware, more manpower might be needed as well. Cloud For cloud computing, there’s almost no upfront investment required for hardware, and in some cases software doesn’t pose a large upfront cost either. Another advantage is the cloud’s elastic nature fits TicketOnline periodic bursting very much. Remember, they face high load only in June and December. Another advantage would be less responsibility. The administrator can have more time to focus on managing the application since the infrastructure is managed by the provider. Though there are a number of advantages, there are also some disadvantages when choosing a cloud platform. For one thing, they might have less control over the infrastructure. As discussed in the previous article, there might also be some architectural changes when moving an application to the cloud. However, these can be dealt with in a one-time effort. The figure below summarizes the considerations between the two options: Figure 2 – Considerations of an On-premise or Cloud solution After looking at his analysis, Mr. Anthony believes that the cloud will bring more competitive advantages to the company. Understanding that Windows Azure offers various services for building internet-scale application, and Idelma is also an existing Microsoft customer, Mr. Anthony decided to explore Windows Azure. After evaluating the pricing, he is even more comfortable to step ahead. Quick preview of the current system Now, let’s take a look of the current architecture of TicketOnline. Figure 3 – TicketOnline Current Architecture TicketOnline web application The web application is hosted on a single instance physical server. It is running on Windows Server 2003 R2 as operating system with Internet Information Services (IIS) 6 as the web server and ASP.NET 2.0 as the web application framework. Database SQL Server 2005 is used as database engine to store mainly relational data for the application. Additionally, it is also used to store logs such as trace logs, performance-counters logs, and IIS logs. File server Unstructured files such as images and documents are stored separately in a file server. Interfacing with another system The application would need to interface with a proprietary CRM system that runs on a dedicated server to retrieve customer profiles through asmx web service. Batch Job As mentioned previously, receipt and ticket generation will happen asynchronously after purchasing is made. A scheduler-based batch job will perform asynchronous tasks every 10 minutes. The tasks include verifying booking details, generating tickets, and sending the ticket along with the receipt as an email to customer. The intention of an asynchronous process is to minimize concurrent access load as much as possible. This batch job is implemented as a Windows Service installed in a separated server. SMTP Server On-premise SMTP Server will be used to send email, initiated either from the batch job engine or the web application. Requirements for migration The application should be migrated to the cloud with the following requirements: The customer expects a cost effective solution in terms of the migration effort as well as the monthly running cost. There aren’t any functional changes on the system. Meaning, the user (especially front-end user) should not see any differences in term of functionality. As per policy, this propriety CRM system will not be moved to the cloud. The web service consumption should be consumed in secured manner. Calling for partners As the in-house IT team does not have competency and experience with Windows Azure, Mr. Anthony contacted Microsoft to suggest a partner who is capable to deliver the migration. Before a formal request for proposal (RFP) is made, he expects partner to provide the following: High-level architecture diagram how the system will look when moving to the cloud. Explanation of each component illustrated on the diagram. The migration processes, effort required, and potential challenges. If Microsoft recommends you as the partner, how will you handle this case? What will the architecture look like in your proposed solution? The most exciting part will come in the next article when I go into more detail on which solution is recommended and how the migration process takes place.
July 5, 2012
by Wely Lau
· 6,834 Views · 1 Like
  • Previous
  • ...
  • 275
  • 276
  • 277
  • 278
  • 279
  • 280
  • 281
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×