Cloud Architecture Resources

The Latest Cloud Architecture Topics

One piece to Docker that is interesting AMAZING is the Remote API that can be used to programatically interact with docker. I recently had a situation where I wanted to run many containers on a host with a single container managing the other containers through the API. But the problem I soon discovered is that at the moment when you turn networking on it is an all or nothing type of thing… you can’t turn networking off selectively on a container by container basis. You can disable IPv4 forwarding, but you can still reach the docker remote API on the machine if you can guess the IP address of it. One solution I came up with for this is to use nginx to expose the unix socket for docker over HTTPS and utilize client-side ssl certificates to only allow trusted containers to have access. I liked this setup a lot so I thought I would share how it’s done. Disclaimer: assumes some knowledge of docker! Generate The SSL Certificates We’ll use openssl to generate and self-sign the certs. Since this is for an internal service we’ll just sign it ourselves. We also remove the password from the keys so that we aren’t prompted for it each time we start nginx. # Create the CA Key and Certificate for signing Client Certs openssl genrsa -des3 -out ca.key 4096 openssl rsa -in ca.key -out ca.key # remove password! openssl req -new -x509 -days 365 -key ca.key -out ca.crt # Create the Server Key, CSR, and Certificate openssl genrsa -des3 -out server.key 1024 openssl rsa -in server.key -out server.key # remove password! openssl req -new -key server.key -out server.csr # We're self signing our own server cert here. This is a no-no in production. openssl x509 -req -days 365 -in server.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out server.crt # Create the Client Key and CSR openssl genrsa -des3 -out client.key 1024 openssl rsa -in client.key -out client.key # no password! openssl req -new -key client.key -out client.csr # Sign the client certificate with our CA cert. Unlike signing our own server cert, this is what we want to do. openssl x509 -req -days 365 -in client.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out client.crt Another option may be to leave the passphrase in and provide it as an environment variable when running a docker container or through some other means as an extra layer of security. We’ll move ca.crt, server.key and server.crt to /etc/nginx/certs. Setup Nginx The nginx setup for this is pretty straightforward. We just listen for traffic on localhost on port 4242. We require client-side ssl certificate validation and reference the certificates we generated in the previous step. And most important of all, set up an upstream proxy to the docker unix socket. I simply overwrote what was already in /etc/nginx/sites-enabled/default. upstream docker { server unix:/var/run/docker.sock fail_timeout=0; } server { listen 4242; server localhost; ssl on; ssl_certificate /etc/nginx/certs/server.crt; ssl_certificate_key /etc/nginx/certs/server.key; ssl_client_certificate /etc/nginx/certs/ca.crt; ssl_verify_client on; access_log on; error_log /dev/null; location / { proxy_pass http://docker; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; client_max_body_size 10m; client_body_buffer_size 128k; proxy_connect_timeout 90; proxy_send_timeout 120; proxy_read_timeout 120; proxy_buffer_size 4k; proxy_buffers 4 32k; proxy_busy_buffers_size 64k; proxy_temp_file_write_size 64k; } } One important piece to make this work is you should add the user nginx runs as to the docker group so that it can read from the socket. This could be www-data, nginx, or something else! Hack It Up! With this setup and nginx restarted, let’s first run a curl command to make sure that this setup correctly. First we’ll make a call without the client cert to double check that we get denied access then a proper one. # Is normal http traffic denied? curl -v http://localhost:4242/info # How about https, sans client cert and key? curl -v -s -k https://localhost:4242/info # And the final good request! curl -v -s -k --key client.key --cert client.crt https://localhost:4242/info For the first two we should get some run of the mill 400 http response codes before we get a proper JSON response from the final command! Woot! But wait there’s more… let’s build a container that can call the service to launch other containers! For this example we’ll simply build two containers: one that has the client certificate and key and one that doesn’t. The code for these examples are pretty straightforward and to save space I’ll leave the untrusted container out. You can view the untrusted container on github (although it is nothing exciting). First, the node.js application that will connect and display information: https = require 'https' fs = require 'fs' options = host: 172.42.1.62 port: 4242 method: 'GET' path: '/containers/json' key: fs.readFileSync('ssl/client.key') cert: fs.readFileSync('ssl/client.crt') headers: { 'Accept': 'application/json'} # not required, but being semantic here! req = https.request options, (res) -> console.log res req.end() And the Dockerfile used to build the container. Notice we add the client.crt and client.key as part of building it! FROM shykes/nodejs MAINTAINER James R. Carr ADD ssl/client* /srv/app/ssl ADD package.json /srv/app/package.json ADD app.coffee /srv/app/app.coffee RUN cd /srv/app && npm install . CMD cd /srv/app && npm start That’s about it. Run docker build . and docker run -n >IMAGE ID< and we should see a json dump to the console of the actively running containers. Doing the same in the untrusted directory should present us with some 400 error about not providing a client ssl certificate. I’ve shared a project with all this code plus a vagrant file on github for your own prusual. Enjoy!

October 31, 2013

by James Carr

· 14,313 Views

JMS-style selectors on Amazon SQS with Apache Camel

This blog post demonstrates how easy it is to use Apache Camel and its new json-path component along with the camel-sqs component to produce and consume messages on Amazon SQS. Amazon Web Services SQS is a message queuing “software as a service” (SaaS) in the cloud. To be able to use it, you need to sign up for AWS. It’s primary access mechanism is XML over HTTP through various AWS SDK clients provided by Amazon. Please check out the SQS documentation for more. And as “luck” would have it, one of the users in the Apache Camel community created a component to be able to integrate with SQS. This makes it trivial to add a producer or consumer to an SQS queue and plugs in nicely with the Camel DSL. SQS, however, is not a “one-size fits all” queueing service; you must be aware of your use case and make sure it fits (current requirements as well as somewhat into the future…). There are limitations that, if not studied and accounted for ahead of time, could come back to sink your project. An example of a viable alternative, and one that more closely fits the profile of a high performance and full featured message queue is Apache ActiveMQ. For example, one limitation to keep in mind is that unlike traditional JMS consumers, you cannot create a subscription to a queue that filters messages based on some predicate (at least not using the AWS-SQS API — you’d have to build that into your solution). Some other things to keep in mind when using SQS: The queue does not preserve FIFO messaging That is, message order is not preserved. They can arrive out of order from when they were sent. Apache Camel can help with its resequencer pattern. Bilgin Ibryam, now a colleague of mine at Red Hat, has written a great blog post about how to restore message order using the resequencer pattern. Message size is limited to 256K This is probably sufficient, but if your message sizes are variable, or contain more data that 256K, you will have to chunk them and send in smaller chunks. No selector or selective consumption If you’re familiar with JMS, you know that you can specify consumers to use a “selector” or a predicate expression that is evaluated on the broker side to determine whether or not a specific message should be dispatched to a specific consumer. For example, Durability constraints Some use cases call for the message broker to store messages until consumers return. SQS allows a limit of up to 14 days. This is most likely sufficient, but something to keep in mind. Binary payloads not allowed SQS only allows text-based messages, e.g., XML, JSON, fixed format text, etc. Binary such as Avro, Protocol Buffers, or Thrift are not allowed. For some of these limitations, you can work around them by building out the functionality yourself. I would always recommend taking a look at how an integration library like Apache Camel can help — which has out-of-the-box support for doing some of these things. Doing JMS-style selectors So the basic problem is we want to subscribe to a SQS queue, but we want to filter which messages we process. For those messages that we do not process, those should be left in the queue. To do this, we will make use of Apache Camel’s Filter EIP as well as the visibility timeouts available on the SQS queue. By default, SQS will dispatch all messages in its queue when it’s queried. We cannot change this, and thus not avoid the message being dispatched to us — we’ll have to do the filtering on our side (this is different than how a full-featured broker like ActiveMQ does it, i.e., filtering is done on the broker side so the consumer doesn’t even see the message it does not want to see). Once SQS dispatches a message, it does not remove it from the queue unless the consumer has acknowledged that it has it and is finished with it. The consumer does this by sending a DeleteMessage command. Until the DeleteMessage command is sent, the message is always in the queue, however visibility comes in to play here. When a message is dispatched to a consumer, there is a period of time which it will not be visible to other consumers. So if you browsed the queue, you would not see it (it should appear in the stats as “in-flight”). However, there is a configurable period of time you can specify for how long this “visibility timeout” should be active. So if you set the visibility to a lower time period (default is 30 seconds), you can more quickly get messages re-dispatched to consumers that would be able to handle the message. Take a look at the following Camel route which does just that: @Override public void configure() throws Exception { // every two seconds, send a message to the "demo" queue in SQS from("timer:kickoff?period=5000") .setBody().method(this, "generateJsonString") .to("aws-sqs://demo?amazonSQSClient=#sqsClient&defaultVisibilityTimeout=2"); } In the above Camel Route, we create a new message every 5 seconds and send it to an SQS queue named demo — note we set the defaultVisibilityTimeout to 2 seconds. This means that after a message gets dispatched to a consumer, SQS will wait about 2 seconds before considering it eligible to be dispatched to another consumer if it has not been deleted. On the consumer side, we take advantage of a couple Apache Camel conveniences Using JSON Path + Filter EIP Camel has an excellent new component named JSON-Path. Claus Ibsen tweeted about it when he hacked it up. This allows you to do Content-Based Routing on a JSON payload very easily by using XPath-style expressions to pick out and evaluate attributes in a JSON encoded object. So in the following example, we can test an attribute named ‘type’ to be equal to ‘LOGIN’ and use Camel’s Filter EIP to allow only those messages that match to go through and continue processing: public class ConsumerRouteBuilder extends RouteBuilder { @Override public void configure() throws Exception { from("aws-sqs://demo?amazonSQSClient=#sqsClient&deleteIfFiltered=false") .setHeader("identity").jsonpath("$['type']") .filter(simple("${header.identity} == 'login'")) .log("We have a message! ${body}") .to("file:target/output?fileName=login-message-${date:now:MMDDyy-HHmmss}.json"); } } To complete the functionality, we have to pay attention to a new configuration option added for the Camel-SQS component: deleteIfFiltered — Whether or not to send the DeleteMessage to the SQS queue if an exchange fails to get through a filter. If ‘false’ and exchange does not make it through a Camel filter upstream in the route, then don’t send DeleteMessage. By default, Camel will send the “DeleteMessage” command to SQS after a route has completed successfully (without an exception). However, in this case, we are specifying to not send the DeleteMessage command if the message had been previously filtered by Camel. This example demonstrates how easy it is to use Apache Camel and its new json-path component along with the camel-sqs component to produce and consume messages on Amazon SQS. Please take a look at the source code on my github repo to play with the live code and try it out yourself.

October 28, 2013

by Christian Posta

· 12,145 Views

Examples of the Windows Azure Storage Services REST API

The examples in this post were updated in September to work with the current version of the Windows Azure Storage REST API. In the Windows Azure MSDN Azure Forum there are occasional questions about the Windows Azure Storage Services REST API. I have occasionally responded to these with some code examples showing how to use the API. I thought it would be useful to provide some examples of using the REST API for tables, blobs and queues – if only so I don’t have to dredge up examples when people ask how to use it. This post is not intended to provide a complete description of the REST API. The REST API is comprehensively documented (other than the lack of working examples). Since the REST API is the definitive way to address Windows Azure Storage Services I think people using the higher level Storage Client API should have a passing understanding of the REST API to the level of being able to understand the documentation. Understanding the REST API can provide a deeper understanding of why the Storage Client API behaves the way it does. Fiddler The Fiddler Web Debugging Proxy is an essential tool when developing using the REST (or Storage Client) API since it captures precisely what is sent over the wire to the Windows Azure Storage Services. Authorization Nearly every request to the Windows Azure Storage Services must be authenticated. The exception is access to blobs with public read access. The supported authentication schemes for blobs, queues and tables and these are described here. The requests must be accompanied by an Authorization header constructed by making a hash-based message authentication code using the SHA-256 hash. The following is an example of performing the SHA-256 hash for the Authorization header: public static String CreateAuthorizationHeader(String canonicalizedString) { String signature = String.Empty; using (HMACSHA256 hmacSha256 = new HMACSHA256( Convert.FromBase64String(storageAccountKey) )) { Byte[] dataToHmac = System.Text.Encoding.UTF8.GetBytes(canonicalizedString); signature = Convert.ToBase64String(hmacSha256.ComputeHash(dataToHmac)); } String authorizationHeader = String.Format( CultureInfo.InvariantCulture, "{0} {1}:{2}", AzureStorageConstants.SharedKeyAuthorizationScheme, AzureStorageConstants.Account, signature ); return authorizationHeader; } This method is used in all the examples in this post. AzureStorageConstants is a helper class containing various constants. Key is a secret key for Windows Azure Storage Services account specified by Account. In the examples given here, SharedKeyAuthorizationScheme is SharedKey. The trickiest part in using the REST API successfully is getting the correct string to sign. Fortunately, in the event of an authentication failure the Blob Service and Queue Service responds with the authorization string they used and this can be compared with the authorization string used in generating the Authorization header. This has greatly simplified the us of the REST API. Table Service API The Table Service API supports the following table-level operations: Create Table Delete Table Query Tables The Table Service API supports the following entity-level operations: Delete Entity Insert Entity Merge Entity Update Entity Query Entities These operations are implemented using the appropriate HTTP VERB: DELETE – delete GET – query MERGE – merge POST – insert PUT – update This section provides examples of the Insert Entity and Query Entities operations. Insert Entity The InsertEntity() method listed in this section inserts an entity with two String properties, Artist and Title, into a table. The entity is submitted as an ATOM entry in the body of a request POSTed to the Table Service. In this example, the ATOM entry is generated by the GetRequestContentInsertXml() method. The date must be in RFC 1123 format in the x-ms-date header supplied to the canonicalized resource used to create the Authorization string. Note that the storage service version is set to “2012-02-12″ which requires the DataServiceVersion and MaxDataServiceVersion to be set appropriately. public void InsertEntity(String tableName, String artist, String title) { String requestMethod = "POST"; String urlPath = tableName; String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String contentMD5 = String.Empty; String contentType = "application/atom+xml"; String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n{1}\n{2}\n{3}\n{4}", requestMethod, contentMD5, contentType, dateInRfc1123Format, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] content = utf8Encoding.GetBytes(GetRequestContentInsertXml(artist, title)); Uri uri = new Uri(AzureStorageConstants.TableEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Accept = "application/atom+xml,application/xml"; request.ContentLength = content.Length; request.ContentType = contentType; request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Headers.Add("Accept-Charset", "UTF-8"); request.Headers.Add("DataServiceVersion", "2.0;NetFx"); request.Headers.Add("MaxDataServiceVersion", "2.0;NetFx"); using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(content, 0, content.Length); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } private String GetRequestContentInsertXml(String artist, String title) { String defaultNameSpace = "http://www.w3.org/2005/Atom"; String dataservicesNameSpace = "http://schemas.microsoft.com/ado/2007/08/dataservices"; String metadataNameSpace = "http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"; XmlWriterSettings xmlWriterSettings = new XmlWriterSettings(); xmlWriterSettings.OmitXmlDeclaration = false; xmlWriterSettings.Encoding = Encoding.UTF8; StringBuilder entry = new StringBuilder(); using (XmlWriter xmlWriter = XmlWriter.Create(entry)) { xmlWriter.WriteProcessingInstruction("xml", "version=\"1.0\" encoding=\"UTF-8\""); xmlWriter.WriteWhitespace("\n"); xmlWriter.WriteStartElement("entry", defaultNameSpace); xmlWriter.WriteAttributeString("xmlns", "d", null, dataservicesNameSpace); xmlWriter.WriteAttributeString("xmlns", "m", null, metadataNameSpace); xmlWriter.WriteElementString("title", null); xmlWriter.WriteElementString("updated", String.Format("{0:o}", DateTime.UtcNow)); xmlWriter.WriteStartElement("author"); xmlWriter.WriteElementString("name", null); xmlWriter.WriteEndElement(); xmlWriter.WriteElementString("id", null); xmlWriter.WriteStartElement("content"); xmlWriter.WriteAttributeString("type", "application/xml"); xmlWriter.WriteStartElement("properties", metadataNameSpace); xmlWriter.WriteElementString("PartitionKey", dataservicesNameSpace, artist); xmlWriter.WriteElementString("RowKey", dataservicesNameSpace, title); xmlWriter.WriteElementString("Artist", dataservicesNameSpace, artist); xmlWriter.WriteElementString("Title", dataservicesNameSpace, title + "\n" + title); xmlWriter.WriteEndElement(); xmlWriter.WriteEndElement(); xmlWriter.WriteEndElement(); xmlWriter.Close(); } String requestContent = entry.ToString(); return requestContent; } This generates the following request (as captured by Fiddler): POST https://STORAGE_ACCOUNT.table.core.windows.net/authors HTTP/1.1 Accept: application/atom+xml,application/xml Content-Type: application/atom+xml x-ms-date: Sun, 08 Sep 2013 06:31:12 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:w7Uu4wHZx4fFwa2bsxd/TJVZZ1AqMPwxvW+pYtoWHd0= Accept-Charset: UTF-8 DataServiceVersion: 2.0;NetFx MaxDataServiceVersion: 2.0;NetFx Host: STORAGE_ACCOUNT.table.core.windows.net Content-Length: 514 Expect: 100-continue Connection: Keep-Alive The body of the request is: 2013-09-08T07:19:07Z Beckett Molloy 2013-09-08T07:19:07.2189243Z Beckett Molloy Molloy Note that I should have URLEncoded the PartitionKey and RowKey but did not do so for simplicity. There are, in fact, some issues with the URL encoding of spaces and other symbols. Get Entity The GetEntity() method described in this section retrieves the single entity inserted in the previous section. The particular entity to be retrieved is identified directly in the URL. public void GetEntity(String tableName, String partitionKey, String rowKey) { String requestMethod = "GET"; String urlPath = String.Format("{0}(PartitionKey='{1}',RowKey='{2}')", tableName, partitionKey, rowKey); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n{2}", requestMethod, dateInRfc1123Format, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.TableEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Headers.Add("Accept-Charset", "UTF-8"); request.Accept = "application/atom+xml,application/xml"; request.Headers.Add("DataServiceVersion", "2.0;NetFx"); request.Headers.Add("MaxDataServiceVersion", "2.0;NetFx"); using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } This generates the following request (as captured by Fiddler): GET https://STORAGE_ACCOUNT.table.core.windows.net/authors(PartitionKey='Beckett',RowKey='Molloy') HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:31:14 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:1hWbr4aNq4JWCpNJY3rsLH1SkIyeFTJflbqyKMPQ1Gk= Accept-Charset: UTF-8 Accept: application/atom+xml,application/xml DataServiceVersion: 2.0;NetFx MaxDataServiceVersion: 2.0;NetFx Host: STORAGE_ACCOUNT.table.core.windows.net The Table Service generates the following response: HTTP/1.1 200 OK Cache-Control: no-cache Content-Type: application/atom+xml;charset=utf-8 ETag: W/"datetime'2013-09-08T06%3A31%3A14.1579056Z'" Server: Windows-Azure-Table/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: f4bd4c77-6fb6-42a8-8dff-81ea8d28fa2e x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:31:15 GMT Content-Length: 1108 The returned entities, in this case a single entity, are returned in ATOM entry format in the response body: https://STORAGE_ACCOUNT.table.core.windows.net/authors(PartitionKey='Beckett',RowKey='Molloy') 2013-09-08T06:31:15Z Beckett Molloy 2013-09-08T06:31:14.1579056Z Beckett Molloy Molloy Blob Service API The Blob Service API supports the following account-level operation: List Containers The Blob Service API supports the following container-level operation: Create Container Delete Container Get Container ACL Get Container Properties Get Container Metadata List Blobs Set Container ACL Set Container Metadata The Blob Service API supports the following blob-level operation: Copy Blob Delete Blob Get Blob Get Blob Metadata Get Blob Properties Lease Blob Put Blob Set Blob Metadata Set Blob Properties Snapshot Blob The Blob Service API supports the following operations on block blobs: Get Block List Put Block Put Block List The Blob Service API supports the following operations on page blobs: Get Page Regions Put Page This section provides examples of the Put Blob and Lease Blob operations. Put Blob The Blob Service and Queue Service use a different form of shared-key authentication from the Table Service so care should be taken in creating the string to be signed for authorization. The blob type, BlockBlob or PageBlob, must be specified as a request header and consequently appears in the authorization string. public void PutBlob(String containerName, String blobName) { String requestMethod = "PUT"; String urlPath = String.Format("{0}/{1}", containerName, blobName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String content = "Andrew Carnegie was born in Dunfermline"; UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] blobContent = utf8Encoding.GetBytes(content); Int32 blobLength = blobContent.Length; const String blobType = "BlockBlob"; String canonicalizedHeaders = String.Format( "x-ms-blob-type:{0}\nx-ms-date:{1}\nx-ms-version:{2}", blobType, dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, blobLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.BlobEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-blob-type", blobType); request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = blobLength; using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(blobContent, 0, blobLength); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String ETag = response.Headers["ETag"]; } } This generates the following request: PUT https://STORAGE_ACCOUNT.blob.core.windows.net/fife/dunfermline HTTP/1.1 x-ms-blob-type: BlockBlob x-ms-date: Sun, 08 Sep 2013 06:28:29 GMT x-ms-version: 2012-02-12 Authorization: SharedKey STORAGE_ACCOUNT:ntvh/lamVmikvwHhy6vRVBIh87kibkPlEOiHyLDia6g= Host: STORAGE_ACCOUNT.blob.core.windows.net Content-Length: 39 Expect: 100-continue Connection: Keep-Alive The body of the request is: Andrew Carnegie was born in Dunfermline The Blob Service generates the following response: HTTP/1.1 201 Created Transfer-Encoding: chunked Content-MD5: RYJnWGXLyt94l5jG82LjBw== Last-Modified: Sun, 08 Sep 2013 06:28:31 GMT ETag: "0x8D07A73C5704A86" Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: b74ef0a2-294d-4581-b8f1-6cda724bbdbf x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:28:30 GMT Lease Blob The Blob Service allows a user to lease a blob for a minute at a time and so acquire a write lock on it. The use case for this is the locking of a page blob used to store the VHD backing an writeable Azure Drive. The LeaseBlob() example in this section demonstrates a subtle issue with the creation of authorization strings. The URL has a query string, comp=lease. Rather than using this directly in creating the authorization string it must be converted into comp:lease with a colon replacing the equal symbol – see modifiedURL in the example. Furthermore, the Lease Blob operation requires the use of an x-ms-lease-action to indicate whether the lease is being acquired, renewed, released or broken. public void LeaseBlob(String containerName, String blobName) { String requestMethod = "PUT"; String urlPath = String.Format("{0}/{1}?comp=lease", containerName, blobName); String modifiedUrlPath = String.Format("{0}/{1}\ncomp:lease", containerName, blobName); const Int32 contentLength = 0; String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String leaseAction = "acquire"; String leaseDuration = "60"; String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-lease-action:{1}\nx-ms-lease-duration:{2}\nx-ms-version:{3}", dateInRfc1123Format, leaseAction, leaseDuration, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, modifiedUrlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, contentLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.BlobEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-lease-action", leaseAction); request.Headers.Add("x-ms-lease-duration", leaseDuration); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = contentLength; using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String leaseId = response.Headers["x-ms-lease-id"]; } } This generates the following request: PUT https://STORAGE_ACCOUNT.blob.core.windows.net/fife/dunfermline?comp=lease HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:28:31 GMT x-ms-lease-action: acquire x-ms-lease-duration: 60 x-ms-version: 2012-02-12 Authorization: SharedKey rebus:+SQ5+RFZg3hUaws5XCRHxsDgXb1ycdRIz5EKyHJWP7s= Host: rebus.blob.core.windows.net Content-Length: 0 The Blob Service generates the following response: HTTP/1.1 201 Created Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: 4b6ff77f-f885-4f74-803a-c92920d225c3 x-ms-version: 2012-02-12 x-ms-lease-id: b1320c2c-65ad-41d6-a7bd-85a4242c0ac5 Date: Sun, 08 Sep 2013 06:28:31 GMT Content-Length: 0 Queue Service API The Queue Service API supports the following queue-level operation: List Queues The Queue Service API supports the following queue-level operation: Create Queue Delete Queue Get Queue Metadata Set Queue Metadata The Queue Service API supports the following message-level operations: Clear Messages Delete Message Get Messages Peek Messages Put Message This section provides examples of the Put Message and Get Message operations. Put Message The most obvious curiosity about Put Message is that it uses the HTTP verb POST rather than PUT. The issue is presumably the interaction of the English language and the HTTP standard which states that PUT should be idempotent and that the Put Message operation is clearly not since each invocation merely adds another message to the queue. Regardless, it did catch me out when I failed to read the documentation well enough – so take that as a warning. The content of a message posted to the queue must be formatted in a specified XML schema and must then be UTF8 encoded. public void PutMessage(String queueName, String message) { String requestMethod = "POST"; String urlPath = String.Format("{0}/messages", queueName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String messageText = String.Format( "{0}", message); UTF8Encoding utf8Encoding = new UTF8Encoding(); Byte[] messageContent = utf8Encoding.GetBytes(messageText); Int32 messageLength = messageContent.Length; String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-version:{1}", dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n{1}\n\n\n\n\n\n\n\n\n{2}\n{3}", requestMethod, messageLength, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.QueueEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.ContentLength = messageLength; using (Stream requestStream = request.GetRequestStream()) { requestStream.Write(messageContent, 0, messageLength); } using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { String requestId = response.Headers["x-ms-request-id"]; } } This generates the following request: POST https://rebus.queue.core.windows.net/revolution/messages HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:34:08 GMT x-ms-version: 2012-02-12 Authorization: SharedKey rebus:nyASTVWifnxHKnj2wXwuzzzXz5CxUBZj58SToV5QFK8= Host: rebus.queue.core.windows.net Content-Length: 76 Expect: 100-continue Connection: Keep-Alive The body of the request is: Saturday in the cafe The Queue Service generates the following response: HTTP/1.1 201 Created Server: Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: 14c6e73b-15d9-480c-b251-c4c01b48e529 x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:34:09 GMT Content-Length: 0 Get Messages The Get Messages operation described in this section retrieves a single message with the default message visibility timeout of 30 seconds. public void GetMessage(String queueName) { string requestMethod = "GET"; String urlPath = String.Format("{0}/messages", queueName); String storageServiceVersion = "2012-02-12"; String dateInRfc1123Format = DateTime.UtcNow.ToString("R", CultureInfo.InvariantCulture); String canonicalizedHeaders = String.Format( "x-ms-date:{0}\nx-ms-version:{1}", dateInRfc1123Format, storageServiceVersion); String canonicalizedResource = String.Format("/{0}/{1}", AzureStorageConstants.Account, urlPath); String stringToSign = String.Format( "{0}\n\n\n\n\n\n\n\n\n\n\n\n{1}\n{2}", requestMethod, canonicalizedHeaders, canonicalizedResource); String authorizationHeader = Utility.CreateAuthorizationHeader(stringToSign); Uri uri = new Uri(AzureStorageConstants.QueueEndPoint + urlPath); HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri); request.Method = requestMethod; request.Headers.Add("x-ms-date", dateInRfc1123Format); request.Headers.Add("x-ms-version", storageServiceVersion); request.Headers.Add("Authorization", authorizationHeader); request.Accept = "application/atom+xml,application/xml"; using (HttpWebResponse response = (HttpWebResponse)request.GetResponse()) { Stream dataStream = response.GetResponseStream(); using (StreamReader reader = new StreamReader(dataStream)) { String responseFromServer = reader.ReadToEnd(); } } } This generates the following request: GET https://rebus.queue.core.windows.net/revolution/messages HTTP/1.1 x-ms-date: Sun, 08 Sep 2013 06:34:11 GMT x-ms-version: 2012-02-12 Authorization: SharedKey rebus:K67XooYhokw0i0AlCzYQ4GeLLrJih1r1vSqiO9DBo0c= Accept: application/atom+xml,application/xml Host: rebus.queue.core.windows.net The Queue Service generates the following response: HTTP/1.1 200 OK Content-Type: application/xml Server: Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0 x-ms-request-id: efb21a86-7d66-47fd-b13d-7aa74fce0568 x-ms-version: 2012-02-12 Date: Sun, 08 Sep 2013 06:34:12 GMT Content-Length: 484 The message is returned in the response body as follows: 05fd902f-6031-4ef4-8298-ef3844ec3bc6Sun, 08 Sep 2013 06:34:11 GMTSun, 15 Sep 2013 06:34:11 GMT1AgAAAAMAAAAAAAAAAL+zgF2szgE=Sun, 08 Sep 2013 06:34:43 GMTSaturday in the cafe I noticed that some newline specifiers in strings (\n) were lost when the blog was auto-ported from Windows Live Spaces to WordPress. I have put them back in but it is possible I missed some. Consequently, in the event of a problem you should check the newlines in canonicalizedHeaders and stringToSign.

October 24, 2013

by Neil Mackenzie

· 38,843 Views

ElasticSearch: Java API

ElasticSearch provides Java API, thus it executes all operations asynchronously by using client object.

September 30, 2013

by Hüseyin Akdoğan

CORE

· 137,598 Views · 4 Likes

Connecting to SQL Azure with SQL Management Studio

Intro If you want to manage your SQL Databases in Azure using tools that you’re a little more familiar and comfortable with – for example – SQL Management Studio, how do you go about connecting? You could read the help article from Microsoft, or you can follow my intuitive screen-based instructions, below: Assumptions 1. I’m assuming you have a version of SQL Management Studio already installed. I believe you’ll need at least SQL Server 2008 R2’s version or newer 2. I’m further assuming you’ve already created a SQL Database in Azure Steps to Connect SSMS to SQL Azure 1. Authenticate to the Azure Portal 2. Click on SQL Databases 3. Click on Servers 4. Click on the name of the Server you wish to connect to… 5. Click on Configure… If not already in place, click on ‘Add to the allowed IP addresses’ to add your current IP address (or specify an address you wish to connect from) and click ‘Save’ 6. Open SQL Management Studio and connect to Database services (usually comes up by default) Enter the fully qualified server name (.database.windows.net) Change to SQL Server Authentication Enter the login preferred (if a new database, the username you specified when yuo created the DB server) Enter the correct password 7. Hit the Connect button Troubleshooting Ensure you have the appropriate ports open outbound from your local network or connection (typically port 1433) Ensure you have allowed the correct public IP address you’re trying to connect from via the Azure Portal (steps 1-5 above) Ensure you are using the correct server name and user name For SSMS, this is the server name (in step 4) followed by .database.windows.net Ensure you are using SQL Server Authentication For SSMS the username format is If you forgot the password of your username, you can reset the password in the Azure Portal, in step 4, click on Dashboard: Lastly… You can click on the Database (in step 2) to see your connection options:

September 25, 2013

by Rob Sanders

· 262,940 Views

OpenStack Savanna: Fast Hadoop Cluster Provisioning on OpenStack

introduction openstack is one of the most popular open source cloud computing projects to provide infrastructure as a service solution. its key components are compute (nova), networking (neutron, formerly known as quantum), storage (object and block storage, swift and cinder, respectively), openstack dashboard (horizon), identity service (keystone) and image service (glance). there are other official incubated projects like metering (celiometer) and orchestration and service definition (heat). savanna is a hadoop as a service for openstack introduced by mirantis . it is still in an early phase (version .02 was released in summer 2013) and according to its roadmap version 1.0 is targeted for official openstack incubation. in principle, heat also could be used for hadoop cluster provisioning but savanna is especially tuned for providing hadoop-specific api functionality while heat is meant to be used for generic purposes. savanna architecture savanna is integrated with the core openstack components such as keystone, nova, glance, swift and horizon. it has a rest api that supports the hadoop cluster provisioning steps. savanna api is implemented as a wsgi server that, by default, listens to port 8386. in addition, savanna can also be integrated with horizon, the openstack dashboard to create a hadoop cluster from the management console. savanna also comes with a vanilla plugin that deploys a hadoop cluster image. the standard out-of-the-box vanilla plugin supports hadoop 1.1.2 version. installing savanna the simplest option to try out savanna is to use devstack in a virtual machine. i was using an ubuntu 12.04 virtual instance in my tests. in that environment we need to execute the following commands to install devstack and savanna api: $ sudo apt-get install git-core $ git clone https://github.com/openstack-dev/devstack.git $ vi localrc # edit localrc admin_password=nova mysql_password=nova rabbit_password=nova service_password=$admin_password service_token=nova # enable swift enabled_services+=,swift swift_hash=66a3d6b56c1f479c8b4e70ab5c2000f5 swift_replicas=1 swift_data_dir=$dest/data # force checkout prerequsites # force_prereq=1 # keystone is now configured by default to use pki as the token format which produces huge tokens. # set uuid as keystone token format which is much shorter and easier to work with. keystone_token_format=uuid # change the floating_range to whatever ips vm is working in. # in nat mode it is subnet vmware fusion provides, in bridged mode it is your local network. floating_range=192.168.55.224/27 # enable auto assignment of floating ips. by default savanna expects this setting to be enabled extra_opts=(auto_assign_floating_ip=true) # enable logging screen_logdir=$dest/logs/screen $ ./stack.sh # this will take a while to execute $ sudo apt-get install python-setuptools python-virtualenv python-dev $ virtualenv savanna-venv $ savanna-venv/bin/pip install savanna $ mkdir savanna-venv/etc $ cp savanna-venv/share/savanna/savanna.conf.sample savanna-venv/etc/savanna.conf # to start savanna api: $ savanna-venv/bin/python savanna-venv/bin/savanna-api --config-file savanna-venv/etc/savanna.conf to install savanna ui integrated with horizon, we need to run the following commands: $ sudo pip install savanna-dashboard $ cd /opt/stack/horizon/openstack-dashboard $ vi settings.py horizon_config = { 'dashboards': ('nova', 'syspanel', 'settings', 'savanna'), installed_apps = ( 'savannadashboard', .... $ cd /opt/stack/horizon/openstack-dashboard/local $ vi local_settings.py savanna_url = 'http://localhost:8386/v1.0' $ sudo service apache2 restart provisioning a hadoop cluster as a first step, we need to configure keystone-related environment variables to get the authentication token: ubuntu@ip-10-59-33-68:~$ vi .bashrc $ export os_auth_url=http://127.0.0.1:5000/v2.0/ $ export os_tenant_name=admin $ export os_username=admin $ export os_password=nova ubuntu@ip-10-59-33-68:~$ source .bashrc ubuntu@ip-10-59-33-68:~$ ubuntu@ip-10-59-33-68:~$ env | grep os os_password=nova os_auth_url=http://127.0.0.1:5000/v2.0/ os_username=admin os_tenant_name=admin ubuntu@ip-10-59-33-68:~$ keystone token-get +-----------+----------------------------------+ | property | value | +-----------+----------------------------------+ | expires | 2013-08-09t20:31:12z | | id | bdb582c836e3474f979c5aa8f844c000 | | tenant_id | 2f46e214984f4990b9c39d9c6222f572 | | user_id | 077311b0a8304c8e86dc0dc168a67091 | +-----------+----------------------------------+ $ export auth_token="bdb582c836e3474f979c5aa8f844c000" $ export tenant_id="2f46e214984f4990b9c39d9c6222f572" then we need to create the glance image that we want to use for our hadoop cluster. in our example we have used mirantis's vanilla image but we can also build our own image: $ wget http://savanna-files.mirantis.com/savanna-0.2-vanilla-1.1.2-ubuntu-12.10.qcow2 $ glance image-create --name=savanna-0.2-vanilla-hadoop-ubuntu.qcow2 --disk-format=qcow2 --container-format=bare < ./savanna-0.2-vanilla-1.1.2-ubuntu-12.10.qcow2 ubuntu@ip-10-59-33-68:~/devstack$ glance image-list +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ | id | name | disk format | container format | size | status | +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ | d0d64f5c-9c15-4e7b-ad4c-13859eafa7b8 | cirros-0.3.1-x86_64-uec | ami | ami | 25165824 | active | | fee679ee-e0c0-447e-8ebd-028050b54af9 | cirros-0.3.1-x86_64-uec-kernel | aki | aki | 4955792 | active | | 1e52089b-930a-4dfc-b707-89b568d92e7e | cirros-0.3.1-x86_64-uec-ramdisk | ari | ari | 3714968 | active | | d28051e2-9ddd-45f0-9edc-8923db46fdf9 | savanna-0.2-vanilla-hadoop-ubuntu.qcow2 | qcow2 | bare | 551699456 | active | +--------------------------------------+-----------------------------------------+-------------+------------------+-----------+--------+ $ export image_id=d28051e2-9ddd-45f0-9edc-8923db46fdf9 then we have installed httpie , an open source http client that can be used to send rest requests to savanna api: $ sudo pip install httpie from now on we will use httpie to send savanna commands. we need to register the image with savanna: $ export savanna_url="http://localhost:8386/v1.0/$tenant_id" $ http post $savanna_url/images/$image_id x-auth-token:$auth_token username=ubuntu http/1.1 202 accepted content-length: 411 content-type: application/json date: thu, 08 aug 2013 21:28:07 gmt { "image": { "os-ext-img-size:size": 551699456, "created": "2013-08-08t21:05:55z", "description": "none", "id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "metadata": { "_savanna_description": "none", "_savanna_username": "ubuntu" }, "mindisk": 0, "minram": 0, "name": "savanna-0.2-vanilla-hadoop-ubuntu.qcow2", "progress": 100, "status": "active", "tags": [], "updated": "2013-08-08t21:28:07z", "username": "ubuntu" } } $ http $savanna_url/images/$image_id/tag x-auth-token:$auth_token tags:='["vanilla", "1.1.2", "ubuntu"]' http/1.1 202 accepted content-length: 532 content-type: application/json date: thu, 08 aug 2013 21:29:25 gmt { "image": { "os-ext-img-size:size": 551699456, "created": "2013-08-08t21:05:55z", "description": "none", "id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "metadata": { "_savanna_description": "none", "_savanna_tag_1.1.2": "true", "_savanna_tag_ubuntu": "true", "_savanna_tag_vanilla": "true", "_savanna_username": "ubuntu" }, "mindisk": 0, "minram": 0, "name": "savanna-0.2-vanilla-hadoop-ubuntu.qcow2", "progress": 100, "status": "active", "tags": [ "vanilla", "ubuntu", "1.1.2" ], "updated": "2013-08-08t21:29:25z", "username": "ubuntu" } } then we need to create a nodegroup templates (json files) that will be sent to savanna. there is one template for the master nodes ( namenode , jobtracker ) and another template for the worker nodes such as datanode and tasktracker . the hadoop version is 1.1.2. $ vi ng_master_template_create.json { "name": "test-master-tmpl", "flavor_id": "2", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_processes": ["jobtracker", "namenode"] } $ vi ng_worker_template_create.json { "name": "test-worker-tmpl", "flavor_id": "2", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_processes": ["tasktracker", "datanode"] } $ http $savanna_url/node-group-templates x-auth-token:$auth_token < ng_master_template_create.json http/1.1 202 accepted content-length: 387 content-type: application/json date: thu, 08 aug 2013 21:58:00 gmt { "node_group_template": { "created": "2013-08-08t21:58:00", "flavor_id": "2", "hadoop_version": "1.1.2", "id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "name": "test-master-tmpl", "node_configs": {}, "node_processes": [ "jobtracker", "namenode" ], "plugin_name": "vanilla", "updated": "2013-08-08t21:58:00", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } } $ http $savanna_url/node-group-templates x-auth-token:$auth_token < ng_worker_template_create.json http/1.1 202 accepted content-length: 388 content-type: application/json date: thu, 08 aug 2013 21:59:41 gmt { "node_group_template": { "created": "2013-08-08t21:59:41", "flavor_id": "2", "hadoop_version": "1.1.2", "id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "name": "test-worker-tmpl", "node_configs": {}, "node_processes": [ "tasktracker", "datanode" ], "plugin_name": "vanilla", "updated": "2013-08-08t21:59:41", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } } the next step is to define the cluster template: $ vi cluster_template_create.json { "name": "demo-cluster-template", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "node_groups": [ { "name": "master", "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "count": 1 }, { "name": "workers", "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "count": 2 } ] } $ http $savanna_url/cluster-templates x-auth-token:$auth_token < cluster_template_create.json http/1.1 202 accepted content-length: 815 content-type: application/json date: fri, 09 aug 2013 07:04:24 gmt { "cluster_template": { "anti_affinity": [], "cluster_configs": {}, "created": "2013-08-09t07:04:24", "hadoop_version": "1.1.2", "id": "{ "name": "cluster-1", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "cluster_template_id" : "64c4117b-acee-4da7-937b-cb964f0471a9", "user_keypair_id": "stack", "default_image_id": "3f9fc974-b484-4756-82a4-bff9e116919b" }", "name": "demo-cluster-template", "node_groups": [ { "count": 1, "flavor_id": "2", "name": "master", "node_configs": {}, "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "node_processes": [ "jobtracker", "namenode" ], "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 }, { "count": 2, "flavor_id": "2", "name": "workers", "node_configs": {}, "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "node_processes": [ "tasktracker", "datanode" ], "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } ], "plugin_name": "vanilla", "updated": "2013-08-09t07:04:24" } } now we are ready to create the hadoop cluster: $ vi cluster_create.json { "name": "cluster-1", "plugin_name": "vanilla", "hadoop_version": "1.1.2", "cluster_template_id" : "64c4117b-acee-4da7-937b-cb964f0471a9", "user_keypair_id": "savanna", "default_image_id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9" } $ http $savanna_url/clusters x-auth-token:$auth_token < cluster_create.json http/1.1 202 accepted content-length: 1153 content-type: application/json date: fri, 09 aug 2013 07:28:14 gmt { "cluster": { "anti_affinity": [], "cluster_configs": {}, "cluster_template_id": "64c4117b-acee-4da7-937b-cb964f0471a9", "created": "2013-08-09t07:28:14", "default_image_id": "d28051e2-9ddd-45f0-9edc-8923db46fdf9", "hadoop_version": "1.1.2", "id": "d919f1db-522f-45ab-aadd-c078ba3bb4e3", "info": {}, "name": "cluster-1", "node_groups": [ { "count": 1, "created": "2013-08-09t07:28:14", "flavor_id": "2", "instances": [], "name": "master", "node_configs": {}, "node_group_template_id": "b3a79c88-b6fb-43d2-9a56-310218c66f7c", "node_processes": [ "jobtracker", "namenode" ], "updated": "2013-08-09t07:28:14", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 }, { "count": 2, "created": "2013-08-09t07:28:14", "flavor_id": "2", "instances": [], "name": "workers", "node_configs": {}, "node_group_template_id": "773b2cfb-1e05-46f4-923f-13edc7d6aac6", "node_processes": [ "tasktracker", "datanode" ], "updated": "2013-08-09t07:28:14", "volume_mount_prefix": "/volumes/disk", "volumes_per_node": 0, "volumes_size": 10 } ], "plugin_name": "vanilla", "status": "validating", "updated": "2013-08-09t07:28:14", "user_keypair_id": "savanna" } } after a while we can run the nova command to check if the instances are created and running: $ nova list +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ | id | name | status | task state | power state | networks | +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ | 1a9f43bf-cddb-4556-877b-cc993730da88 | cluster-1-master-001 | active | none | running | private=10.0.0.2, 192.168.55.227 | | bb55f881-1f96-4669-a94a-58cbf4d88f39 | cluster-1-workers-001 | active | none | running | private=10.0.0.3, 192.168.55.226 | | 012a24e2-fa33-49f3-b051-9ee2690864df | cluster-1-workers-002 | active | none | running | private=10.0.0.4, 192.168.55.225 | +--------------------------------------+-----------------------+--------+------------+-------------+----------------------------------+ now we can log in to the hadoop master instance and run the required hadoop commands: $ ssh -i savanna.pem [email protected] $ sudo chmod 777 /usr/share/hadoop $ sudo su hadoop $ cd /usr/share/hadoop $ hadoop jar hadoop-example-1.1.2.jar pi 10 100 savanna ui via horizon in order to create nodegroup templates, cluster templates and the cluster itself we used a command line tool -- httpie -- to send rest api calls. the same functionality is also available via horizon, the standard openstack dashboard. first we need to register the image with savanna: then we need to create the nodegroup templates: after that we have to create the cluster template: and finally we have to create the cluster:

August 20, 2013

by Istvan Szegedi

· 9,499 Views

Resource Pooling, Virtualization, Fabric, and the Cloud

One of the five essential attributes of cloud computing (see The 5-3-2 Principle of Cloud Computing) is resource pooling, which is an important differentiator separating the thought process of traditional IT from that of a service-based, cloud computing approach. Resource pooling in the context of cloud computing and from a service provider’s viewpoint denotes a set of strategies and a methodical way of managing resources. For a user, resource pooling institutes an abstraction for presenting and consuming resources in a consistent and transparent fashion. This article presents these key concepts derived from resource pooling: Resource Pools Virtualization in the Context of Cloud Computing Standardization, Automation, and Optimization Fabric Cloud Closing Thoughts Resource Pools Ultimately, data center resources can be logically placed into three categories. They are: compute, networks, and storage. For many, this grouping may appear trivial. It is, however, a foundation upon which some cloud computing methodologies are developed, products designed, and solutions formulated. Compute This is a collection of all CPU capabilities. Essentially all data center servers, either for supporting or actually running a workload, are all part of this compute group. Compute pool represents the total capacity for executing code and running instances. The process to construct a compute pool is to first inventory all servers and identify virtualization candidates followed by implementing server virtualization. It is never too early to introduce a system management solution to facilitate the processes, which in my view is a strategic investment and a critical component for all cloud initiatives. Networks The physical and logical artifacts putting in place to connect resources, segment, and isolate resources from layer three and below, etc., are gathered in the network pool. Networking enables resources becoming visible and hence possibly manageable. In the age of instant gratification, networks and mobility are redefining the security and system administration boundaries, and play a direct and impactful role in user productivity and customer satisfaction. Networking in cloud computing is more than just remote access, but empowerment for a user to self-serve and consume resources anytime, anywhere, with any device. BYOD and consumerization of IT are various expressions of these concepts. Storage This has long been a very specialized and sometimes mysterious part of IT. An enterprise storage solution is frequently characterized as a high-cost item with a significant financial and contractual commitment, specialized hardware, proprietary API and software, a dependency on direct vendor support, etc. In cloud computing, storage has become even more noticeable since the ability to grow and shrink based on demands, i.e. elasticity, demands an enterprise-level, massive, reliable, and resilient storage solution at a global scale. While enterprise IT is consolidating resources and transforming the existing establishment into a cloud computing environment, how to leverage existing storage devices from various vendors and integrate them with the next generation storage solutions is among the highest priorities for modernizing a data center. Virtualization in the Context of Cloud Computing In the last decade, virtualization has proved its value and accelerated the realization of cloud computing. Then, virtualization was mainly server virtualization, which in an over-simplified statement means hosting multiple server instances with the same hardware while each instance runs transparently and in insolation, as if each consumes the entire hardware and is the only instance running. Much of the customer expectations, business needs, and methodologies has since evolved. Now, we should validate virtualization in the context of cloud computing to fully address the innovations rapidly changing how IT conducts business and delivers services. As discussed below, in the context of cloud computing, consumable resources are delivered in some virtualized form. Various virtualization layers collectively construct and form the so-called fabric. Server Virtualization The concept of server virtualization remains: running multiple server instances with the same hardware while each instance runs transparently and in isolation, as if each instance is the only instance running and consuming the entire server hardware. In addition to virtualizing and consolidating servers, server virtualization also signifies the practices of standardizing server deployment switching away from physical boxes to VMs. Server virtualization is for packaging, delivering, and consuming a compute pool. There are a few important considerations of virtualizing servers. IT needs the ability to identify and manage bare metal such that the entire resource life-cycle management from commencing to decommissioning can be standardized and automated. To fundamentally reduce the support and training cost while increasing productivity, a consistent platform with tools applicable across physical, virtual, on-premises, and off-premises deployments is essential. The last thing IT wants is one set of tools for physical resources and another for those virtualized, one set of tools for on-premises deployment and another for those deployed to a service provider, and one set of tools for development and another for deploying applications. The requirement is one methodology for all, one skill set for all, and one set of tools for all. This advantage is obvious when developing applications and deploying Windows Server 2012 R2 on premises or off premises to Windows Azure. The Active Directory security model can work across sites, System Center can manage resources deployed off premises to Windows Azure, and Visual Studio can publish applications across platforms. Windows infrastructure architecture, security, and deployment models are all directly applicable. Network Virtualization The similar idea of server virtualization applies here. Network virtualization is the ability to run multiple networks on the same network device while each network runs transparently and in isolation, as if each network is the only network running and consuming the entire network hardware. Conceptually, since each network instance is running in isolation, one tenant’s 192.168.x network is not aware of another tenant’s identical192.168.x network running with the same network device. Network virtualization provides the translation between physical network characteristics and the representation of and a resource identity in a virtualized network. Consequently, above the network virtualization layer, various tenants while running in isolation can have identical network configurations. A great example of network virtualization is Windows Azure virtual networking. At any given time, there can be multiple Windows Azure subscribers all allocating the same 192.168.x address space with an identical subnet scheme (192.168.1.x/16) for deploying VMs. Those VMs belonging to one subscriber will however not be aware of or visible to those deployed by others, despite the fact that the network configuration, IP scheme, and IP address assignments may all be identical. Network virtualization in Windows Azure isolates on subscriber from the others such that each subscriber operates as if the subscription is the only one employing a 192.168.x address space. Storage Virtualization I believe this is where the next wave of drastic cost reduction of IT post-server virtualization happens. Historically, storage has been a high cost item in any IT budget in each and every aspects including hardware, software, staffing, maintenance, SLA, etc. Since the introduction of Windows Server 2012, there is a clear direction where storage virtualization is built into OS and becoming a commodity. New capabilities like Storage Pool, Hyper-V over SMB, Scale-Out Fire Share, etc., are now part of Windows Server OS and are making storage virtualization part of server administration routines and easily manageable with tools and utilities like PowerShell, which is familiar to many IT professionals. The concept of storage virtualization remains consistent with the idea of logically separating a computing object from its hardware, in this case the storage capacity. Storage virtualization is the ability to integrate multiple and heterogeneous storage devices, aggregate the storage capacities, and present/manage as one logical storage device with a continuous storage space. JBOD is a technology to realize this concept. Standardization, Automation and Optimization Each of the three resource pools has an abstraction to logically present itself with characteristics and work patterns. A compute pool is a collection of physical (virtualization and infrastructure) hosts and VMs. A virtualization host hosts VMs that run workloads deployed by service owners and consumed by authorized users. A network pool encompasses network resources including physical devices, logical switches, address spaces, and site configurations. Network virtualization as enabled/defined in configurations can identify and translate a logical/virtual IP address into a physical one, such that tenants with the same network hardware can implement an identical network scheme without a concern. A storage pool is based on storage virtualization which is a concept of presenting an aggregated storage capacity as one continuous storage space as if provided from one logical storage device. In other words, the three resource pools are wrapped with server virtualization, network virtualization, and storage virtualization, respectively. Each virtualization presents a set of methodologies on which work patterns are derived and common practices are developed. These virtualization layers provides opportunities to standardize, automate, and optimize deployments and considerably facilitates the adoption of cloud computing. Standardization Virtualizing resources decouples the dependency between instances and the underlying hardware. This offers an opportunity to simplify and standardize the logical representation of a resource. For instance, a VM is defined and deployed with a VM template that provides a level of consistency with a standardized configuration. Automation Once VM characteristics are identified and standardized, we can now generate an instance by providing only instance-based information or information that depends on run-time, such as the VM machine name, which must be validated at run-time to prevent duplicated names. This requirement for providing only minimal information at deployment can be significantly simplify and streamline operations for automation. And with automation, resources can then be deployed, instantiated, relocated, taken off-line, brought back online, or removed rapidly and automatically based on set criteria. Standardization and automation are essential mechanisms so that workload can be scaled on demand, i.e., become elastic. Optimization Standardization provides a set of common criteria. Automation executes operations based on set criteria with volumes, consistency, and expediency. With standardization and automation, instances can be instantiated with consistency, efficiency, and predictability. In other words, resources can be operated in bulk with consistency and predictability. The next logical step is then to optimize the usage based on SLA. The presented progression is what resource pooling and virtualizations can provide and facilitate. These methodologies are now built into products and solutions. Windows Server 2012 R2 and System Center 2012 and later integrate server virtualization, network virtualization, and storage virtualization into one consistent solution platform with standardization, automation, and optimization for building and managing clouds. Fabric This is a significant abstraction in cloud computing. Fabric implies accessibility and discoverability, and denotes the ability to discover, identify, and manage a resource. Conceptually, fabric is an umbrella term encompassing all the underlying infrastructure supporting a cloud computing environment. At the same time, a fabric controller represents the system management solution which manages, i.e. owns, fabric. In cloud architecture, fabric consists of the three resource pools: compute, networks, and storage. Compute provides the computing capabilities, executes code, and runs instances. Networks glues the resources based on requirements. Storage is where VMs, configurations, data, and resources are kept. Fabric shields the physical complexities of the three resource pools presented with server virtualization, network virtualization, and storage virtualization. All operations are eventually directed by the fabric controller of a data center. Above fabric, there are logical views of consumable resources including VMs, virtual networks, and logical storage drives. By deploying VMs, configuring virtual networks, or acquiring storage, a user consumes resources. Under fabric, there are virtualization and infrastructure hosts, Active Directory, DNS, clusters, load balancers, address pools, network sites, library shares, storage arrays, topology, racks, cables, etc., all under the fabric controller’s command to collectively present and support fabric. For a service provider, building a cloud computing environment is essentially establishing a fabric controller and constructing fabric. Namely, instituting a comprehensive management solution, building the three resource pools, and integrating server virtualization, network virtualization, and storage virtualization to form fabric. From a user’s point of view, how and where a resource is physically provided is not a concern, but the accessibility, readiness, scalability, and fulfillment of SLA are. Cloud This is a well-defined term and we should not be confused with it. (see NIST SP 800-145 and the 5-3-2 Principle of Cloud Computing) We need to be very clear on: what a cloud must exhibit (the five essential attributes), how to consume it (with SaaS, PaaS, or IaaS), and the model a service is deployed in (like private cloud, public cloud, and hybrid cloud). Cloud is a concept, a state, a set of capabilities such that a business can be delivered as a service, i.e. available on demand. The architecture of a cloud computing environment is presented with three resource pools: compute, networks, and storage. Each is an abstraction provided by a virtualization layer. Server virtualization presents a compute pool with VMs that supply the computing, i.e. CPUs, and power to execute code and run instances. Network virtualization offers a network pool and is the mechanism that allows multiple tenants with identical network configurations on the same virtualization host while connecting, segmenting, isolating network traffic with virtual NICs, logical switches, address space, network sites, IP pools, etc. Storage virtualization provides a logical storage device with the capacity to appear continuous and aggregated with a pool of storage devices behind the scene. The three resource pools together constitute the fabric (of a cloud) while the three virtualization layers collectively form the abstraction, such that while the underlying physical infrastructure may be intricate, the user experience above fabric remains logical and consistent. Deploying a VM, configuring a virtual network, or acquiring storage is transparent with virtualization regardless of where the VM actually resides, how the virtual network is physically wired, or what devices in the aggregate the requested storage is provided with. Closing Thoughts Cloud is a very consumer-focused approach. It is about a customer’s ability and control based on SLA in getting resources when needed and with scale, and equally important releasing resources when no longer required. It is not about products and technologies. It is about servicing, consuming, and strengthening the bottom line.

August 12, 2013

by Yung Chou

· 10,449 Views

AWS: Attaching an EBS volume on an EC2 instance and making it available for use

I recently wanted to attach an EBS volume to an existing EC2 instance that I had running and since it was for a one off tasks (famous last words) I decided to configure it manually. I created the EBS volume through the AWS console and one thing that initially caught me out is that the EC2 instance and EBS volume need to be in the same region and zone. Therefore if I create my EC2 instance in ‘eu-west-1b’ then I need to create my EBS volume in ‘eu-west-1b’ as well otherwise I won’t be able to attach it to that instance. I attached the device as /dev/sdf although the UI gives the following warning: Linux Devices: /dev/sdf through /dev/sdp Note: Newer linux kernels may rename your devices to /dev/xvdf through /dev/xvdp internally, even when the device name entered here (and shown in the details) is /dev/sdf through /dev/sdp. After attaching the EBS volume to the EC2 instance my next step was to SSH onto my EC2 instance and make the EBS volume available. The first step is to create a file system on the volume: $ sudo mkfs -t ext3 /dev/sdf mke2fs 1.42 (29-Nov-2011) Could not stat /dev/sdf --- No such file or directory The device apparently does not exist; did you specify it correctly? It turns out that warning was handy and the device has in fact been renamed. We can confirm this by callingfdisk: $ sudo fdisk -l Disk /dev/xvda1: 8589 MB, 8589934592 bytes 255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvda1 doesn't contain a valid partition table Disk /dev/xvdf: 53.7 GB, 53687091200 bytes 255 heads, 63 sectors/track, 6527 cylinders, total 104857600 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvdf doesn't contain a valid partition table /dev/xvdf is the one we’re interested in so I re-ran the previous command: $ sudo mkfs -t ext3 /dev/xvdf mke2fs 1.42 (29-Nov-2011) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 3276800 inodes, 13107200 blocks 655360 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 400 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done Once I’d done that I needed to create a mount point for the volume and I thought the best place was probably a directory under /mnt: $ sudo mkdir /mnt/ebs The final step is to mount the volume: $ sudo mount /dev/xvdf /mnt/ebs And if we run df we can see that it’s ready to go: $ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 883M 6.7G 12% / udev 288M 8.0K 288M 1% /dev tmpfs 119M 164K 118M 1% /run none 5.0M 0 5.0M 0% /run/lock none 296M 0 296M 0% /run/shm /dev/xvdf 50G 180M 47G 1% /mnt/ebs

July 31, 2013

by Mark Needham

· 11,981 Views

Integration of Amazon Redshift Data Warehouse with Talend Data Integration

In this blog post, I will show you how to "ETL" all kinds of data to Amazon’s cloud data warehouse Redshift wit Talend’s big data components. Let’s begin with a short introduction to Amazon Redshift (copied from website): "Amazon Redshift is [part of Amazon Web Services (AWS) and] a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. With a few clicks in the AWS Management Console, customers can launch a Redshift cluster, starting with a few hundred gigabytes and scaling to a petabyte or more, for under $1,000 per terabyte per year. Traditional data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.“ Sounds interesting! And indeed, we already see companies using Talend’s Redshift connectors. From Talend perspective it is not much more than just another database. If you have ever used a Talend connector, you can integrate to Redshift within some minutes. In the next sections, I will describe all necessary steps and give some hints regarding configuration issues and performance improvements. Be aware: You need Talend Open Studio for Data Integration (Apache License, open source) or any Talend Enterprise Edition / Platform which contains the Cloud components to see and use Amazon Redshift connectors. The open source edition offers all connectors and functionality to integrate with Amazon Redshift. However, Enterprise versions offer some more features (e.g. versioning), comfort (e.g. wizards) and commercial support. Setup Amazon Redshift Setup of Amazon Redshift is very easy. Just follow Amazon‘s getting started guide: http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html. Like every other AWS guide, it is very easy to understand and use. Be aware, that you just have to do step 1, 2 and 3 of the getting started guide for using it with Talend. Some hints: - Step 1 („before you begin“): Just sign up. Client tools and drivers are not necessary because they are already installed within Talend Studio. - Step 2 („launch a cluster“): Yes, please start your cluster! - Step 3(„authorize access“): If you are not sure what to do here, select Connection Type = CIDR/IP. Find out your IP address (http://whatismyipaddress.com) and enter it with „/32“ at the end. Example: „192.168.1.1/32“ Now you can connect to Amazon Redshift from your Talend Studio on your local computer. Step 4 (connect) and step 5 (create table, data, queries) are not necessary, this will be done from Talend Studio. Of course, you should not forget to delete your cluster (step 7) when you are done. Otherwise, you will pay for every hour, even if you do not access your DWH. Connect to Amazon Redshift from Talend Studio Create a new connection to Amazon Redshift database as you do with every other relational database. The easiest way is to use „DB Connection Wizard“ in metadata. Just enter your connection information and check if it works. You get all information about configuration from Amazon Web Console. The connection string looks something like this: „jdbc:paraccel://talend-demo-cluster.cp8t6c5.eu-west-1.redshift.amazonaws.com:5439/dev“ Next, right click on the created connection and select „retrieve schema“. „public“ is the default schema which you (have to) use. Now, you are ready to use this connection within Talend Jobs to write to Amazon Redshift and read from it. Create Talend Jobs (Write, Read, Delete) Amazon Redshift components work like any other Talend (relational) database components. Look at www.help.talend.com for more information if you have not used them before (or just try them out, they are very self-explanatory). You just have to drag&drop your connection from metadata . Afterwards, you can easily write data (tRedShiftOutput), read data (tRedshiftInput), or do any other queries such as delete or copy (tRedShiftRow). In the following job, I start with deleting all content in the Amazon Redshift table. Then, I read data from a MySQL table and insert it into an Amazon Redshift table. The table is created automatically (as I have configured it this way). After this subjob is finished, I read the data again, and store it to a CSV file (which is also created automatically). Of course, this is no business use case, but it shows how to use different Amazon Redshift components. Query Data from Amazon Redshift You can connect to Amazon Redshift directly from Talend Studio to explore and query data of the DWH. Thus, no other database tool is required. Just right click on your Amazon Redshift connection in metadata and select „edit queries“. Here you can define, execute and save SQL queries. Improve Performance Write performance of Amazon Redshift is relatively low compared to „classical“ relational databases (in your data center) as you have to upload all data into the cloud. Different alternatives exist to improve performance: - Bulk inserts: „Extended insert“ (in advanced settings) improves performance a lot, but still not to hyperspeed… Also, as it is bulk, you can just do inserts! It is not compatible to „rejects“ or „updates“ - AWS S3 and COPY command: S3 is Amazon’s „simple storage service“, a key-value store – also called NoSQL today – for storing very large objects. You can use Amazon Redshift’s COPY command (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) to transfer data from S3 to Amazon Redshift with good performance. Though, you still have to copy data to S3 before, same „cloud problem“ here. The COPY command can be used with tRedshiftRow, so no problem at all from Talend perspective. To transfer data to S3, you can either use the Talend S3 components from Talendforge, Talend’s open source community (http://www.talendforge.org/exchange), or use camel-s3, an Apache Camel component which is included in Talend ESB. The latter is an option, if you use Talend Data Services which combines Talend DI and Talend ESB in its unified platform. Summary You need not be a cloud or DWH expert, or an expert developer to integrate with Amazon’s cloud data warehouse Redshift. It is very easy with Talend’s integration solutions. Just drag&drop, configure, do some graphical mappings / transformations (if necessary), that’s it. Code is generated. Job runs. You can integrate Amazon Redshift almost as simple as any other relational database. Just be aware of some cloud specific security and performance issues. With Talend, you can easily „ETL“ all data from different sources to Redshift and store it there for under $1,000 per terabyte per year – even with the open source version! Best regards, Kai Wähner (Contact and feedback via @KaiWaehner, www.kai-waehner.de, LinkedIn / Xing) This is content from my blog: http://www.kai-waehner.de/blog/2013/06/26/integration-of-amazon-redshift-cloud-data-warehouse-aws-saas-dwh-with-talend-data-integration-di-big-data-bd-enterprise-service-bus-esb/

June 27, 2013

by Kai Wähner

CORE

· 20,579 Views · 1 Like

Create a Couchbase Cluster with Ansible

[This blog was syndicated from http://blog.grallandco.com] Introduction When I was looking for a more effective way to create my cluster I asked some sysadmins which tools I should use to do it. The answer I got during OSDC was not Puppet, nor Chef, but wasAnsible. This article shows you how you can easily configure and create a Couchbase cluster deployed and many linux boxes...and the only thing you need on these boxes is an SSH Server! Thanks to Jan-Piet Mens that was one of the person that convinced me to use Ansible and answered questions I had about Ansible. You can watch the demonstration below, and/or look at all the details in the next paragraph. Ansible Ansible is an open-source software that allows administrator to configure and manage many computers over SSH. I won't go in all the details about the installation, just follow the steps documented in the Getting Started Guide. As you can see from this guide, you just need Python and few other libraries and clone Ansible project from Github. So I am expecting that you have Ansible working with your various servers on which you want to deploy Couchbase. Also for this first scripts I am using root on my server to do all the operations. So be sure you have register the root ssh keys to your administration server, from where you are running the Ansible scripts. Create a Couchbase Cluster So before going into the details of the Ansible script it is interesting to explain how you create a Couchbase Cluster. So here are the 5 steps to create and configure a cluster: Install Couchbase on each nodes of the cluster, as documented here. Take one of the node and "initialize" the cluster, using cluster-init command. Add the other nodes to the cluster, using server-add command. Rebalance, using rebalance command. Create a Bucket, using bucket-create command. So the goal now is to create an Ansible Playbook that executes these steps for you. Ansible Playbook for Couchbase The first think you need is to have the list of hosts you want to target, so I have create a hosts file that contains all my server organized in 2 groups: [couchbase-main] vm1.grallandco.com [couchbase-nodes] vm2.grallandco.com vm3.grallandco.com The group [couchbase-main] group is just one of the node that will drive the installation and configuration, as you probably already know, Couchbase does not have any master... All nodes in the cluster are identical. To ease the configuration of the cluster, I have create another file that contains all parameters that must be sent to all the various commands. This file is located in the group_vars/all see the section Splitting Out Host and Group Specific Data in the documentation. # Adminisrator user and password admin_user: Administrator admin_password: password # ram quota for the cluster cluster_ram_quota: 1024 # bucket and replicas bucket_name: ansible bucket_ram_quota: 512 num_replicas: 2 Use this file to configure your cluster. Let's describe the playbook file : - name: Couchbase Installation hosts: all user: root tasks: - name: download Couchbase package get_url: url=http://packages.couchbase.com/releases/2.0.1/couchbase-server-enterprise_x86_64_2.0.1.deb dest=~/. - name: Install dependencies apt: pkg=libssl0.9.8 state=present - name: Install Couchbase .deb file on all machines shell: dpkg -i ~/couchbase-server-enterprise_x86_64_2.0.1.deb As expected, the installation has to be done on all servers as root then we need to execute 3 tasks: Download the product, the get_url command will only download the file if not already present Install the dependencies with the apt command, the state=present allows the system to only install this package if not already present Install Couchbase with a simple shell command. (here I am not checking if Couchbase is already installed) So we have now installed Couchbase on all the nodes. Let's now configure the first node and add the others: - name: Initialize the cluster and add the nodes to the cluster hosts: couchbase-main user: root tasks: - name: Configure main node shell: /opt/couchbase/bin/couchbase-cli cluster-init -c 127.0.0.1:8091 --cluster-init-username=${admin_user} --cluster-init-password=${admin_password} --cluster-init-port=8091 --cluster-init-ramsize=${cluster_ram_quota} - name: Create shell script for configuring main node action: template src=couchbase-add-node.j2 dest=/tmp/addnodes.sh mode=750 - name: Launch config script action: shell /tmp/addnodes.sh - name: Rebalance the cluster shell: /opt/couchbase/bin/couchbase-cli rebalance -c 127.0.0.1:8091 -u ${admin_user} -p ${admin_password} - name: create bucket ${bucket_name} with ${num_replicas} replicas shell: /opt/couchbase/bin/couchbase-cli bucket-create -c 127.0.0.1:8091 --bucket=${bucket_name} --bucket-type=couchbase --bucket-port=11211 --bucket-ramsize=${bucket_ram_quota} --bucket-replica=${num_replicas} -u ${admin_user} -p ${admin_password} Now we need to execute specific taks on the "main" server: Initialization of the cluster using the Couchbase CLI, on line 06 and 07 Then the system needs to ask all other server to join the cluster. For this the system needs to get the various IP and for each IP address execute the add-server command with the IP address. As far as I know it is not possible to get the IP address from the main playbook YAML file, so I ask the system to generate a shell script to add each node and execute the script. This is done from the line 09 to 13. To generate the shell script, I use Ansible Template, the template is available in the couchbase-add-node.j2 file. {% for host in groups['couchbase-nodes'] %} /opt/couchbase/bin/couchbase-cli server-add -c 127.0.0.1:8091 -u ${admin_user} -p ${admin_password} --server-add={{ hostvars[host]['ansible_eth0']['ipv4']['address'] }:8091 --server-add-username=${admin_user} --server-add-password=${admin_password} {% endfor %} As you can see this script loop on each server in the [couchbase-nodes] group and use its IP address to add the node to the cluster. Finally the script rebalance the cluster (line 16) and add a new bucket (line 19). You are now ready to execute the playbook using the following command : ./bin/ansible-playbook -i ./couchbase/hosts ./couchbase/couchbase.yml -vv I am adding the -vv parameter to allow you to see more information about what's happening during the execution of the script. This will execute all the commands described in the playbook, and after few seconds you will have a new cluster ready to be used! You can for example open a browser and go to the Couchase Administration Console and check that your cluster is configured as expected. As you can see it is really easy and fast to create a new cluster using Ansible. I have also create a script to uninstall properly the cluster.. just launch ./bin/ansible-playbook -i ./couchbase/hosts ./couchbase/couchbase-uninstall.yml

June 3, 2013

by Don Pinto

· 5,156 Views · 1 Like

Avro's Built-In Sorting

avro has a little-known gem of a feature which allows you to control which fields in an avro record are used for partitioning , sorting and grouping in mapreduce. the following figure gives a quick refresher as to what these terms mean. oh, and don’t take the placement of the “sorting” literally - sorting actually occurs on both the map and reduce side - but it’s always performed in the context of a specific partition (i.e. for a specific reducer). by default all the fields in an avro map output key are used for partitioning, sorting and grouping in mapreduce. let’s walk through an example and see how this works. you’ll begin with a simple schema github source : {"type": "record", "name": "com.alexholmes.avro.weathernoignore", "doc": "a weather reading.", "fields": [ {"name": "station", "type": "string"}, {"name": "time", "type": "long"}, {"name": "temp", "type": "int"}, {"name": "counter", "type": "int", "default": 0} ] } we’re going to see what happens when we run this code against a small sample data set, which we’ll generate using avro code github source : file input = tmpfolder.newfile("input.txt"); avrofiles.createfile(input, weathernoignore.schema$, arrays.aslist( weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(3).build(), weathernoignore.newbuilder().setstation("iad").settime(1).settemp(1).build(), weathernoignore.newbuilder().setstation("sfo").settime(2).settemp(1).build(), weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(2).build(), weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(1).build() ).toarray()); to understand how avro is partitioning, sorting and grouping the data, we’ll write an identity mapper and reducer, with a small enhancement to the reducer to increment the counter field for each record we see in an individual reducer instance github source : package com.alexholmes.avro.sort.basic; import com.alexholmes.avro.weathernoignore; import org.apache.avro.mapred.avrokey; import org.apache.avro.mapred.avrovalue; import org.apache.avro.mapreduce.avrojob; import org.apache.avro.mapreduce.avrokeyinputformat; import org.apache.avro.mapreduce.avrokeyoutputformat; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import java.io.ioexception; public class avrosort { private static class sortmapper extends mapper, nullwritable, avrokey, avrovalue> { @override protected void map(avrokey key, nullwritable value, context context) throws ioexception, interruptedexception { context.write(key, new avrovalue(key.datum())); } } private static class sortreducer extends reducer, avrovalue, avrokey, nullwritable> { @override protected void reduce(avrokey key, iterable> values, context context) throws ioexception, interruptedexception { int counter = 1; for (avrovalue weathernoignore : values) { weathernoignore.datum().setcounter(counter++); context.write(new avrokey(weathernoignore.datum()), nullwritable.get()); } } } public boolean runmapreduce(final job job, path inputpath, path outputpath) throws exception { fileinputformat.setinputpaths(job, inputpath); job.setinputformatclass(avrokeyinputformat.class); avrojob.setinputkeyschema(job, weathernoignore.schema$); job.setmapperclass(sortmapper.class); avrojob.setmapoutputkeyschema(job, weathernoignore.schema$); avrojob.setmapoutputvalueschema(job, weathernoignore.schema$); job.setreducerclass(sortreducer.class); avrojob.setoutputkeyschema(job, weathernoignore.schema$); job.setoutputformatclass(avrokeyoutputformat.class); fileoutputformat.setoutputpath(job, outputpath); return job.waitforcompletion(true); } } if you look at the output of the job below, you’ll see that the output is sorted across all the fields, and that the sorting is in field ordinal order. what this means is that when mapreduce is sorting these records, it compares the station field first, then the time field second, and so on according to the ordering of the fields in the avro schema. this is pretty much what you’d expect if you write your own complex writable type, and your comparator compared all the fields in order. {"station": "iad", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 2, "counter": 1} {"station": "sfo", "time": 1, "temp": 3, "counter": 1} {"station": "sfo", "time": 2, "temp": 1, "counter": 1} oh, and before we move on notice that the value for the counter field is always 1 , meaning that each reducer was only fed a single key/vaue pair, which makes sense since our identity mapper only emitted a single value for each key, the keys are unique, and the mapreduce partitioner, sorter and grouper were using all the fields in the record. excluding fields for sorting avro gives us the ability to indicate that specific fields should be ignored when performing ordering functions. in mapreduce these fields are ignored for sorting/partitioning and grouping in mapreduce, which basically means that we have the ability to perform secondary sorting. let’s examine the following schema github source : {"type": "record", "name": "com.alexholmes.avro.weather", "doc": "a weather reading.", "fields": [ {"name": "station", "type": "string"}, {"name": "time", "type": "long"}, {"name": "temp", "type": "int", "order": "ignore"}, {"name": "counter", "type": "int", "order": "ignore", "default": 0} ] } it’s pretty much identical to the first schema, the only difference being that the last two fields are flagged as being “ignored” for sorting/partitioning/grouping. let’s run the same (other than modified to work with the different schema) mapreduce code github source as above against this new schema and examine the outputs. {"station": "iad", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 3, "counter": 1} {"station": "sfo", "time": 1, "temp": 2, "counter": 2} {"station": "sfo", "time": 1, "temp": 1, "counter": 3} {"station": "sfo", "time": 2, "temp": 1, "counter": 1} there are a couple of notable differences between this output, and the output from the previous schema which didn’t have any ignored fields. first, it’s clear that the temp field isn’t being used in the sorting, which makes sense since we specified that it should be ignored in the schema. however, more interestingly, note the value of the counter field. all records that had identical station and time values went to the same reducer invocation, evidenced by the increasing value of counter . this is essentially secondary sort! now, all of this greatness isn’t without some limitations: you can’t support two mapreduce jobs that use the same avro key, but have different sorting/partitioning/grouping requirements. although it’s conceivable that you could create a new instance of the avro schema and set the ignored flags for these fields yourself. the partitioner, sorter and grouping functions in mapreduce all work off of the same fields (i.e. they all ignore fields that set as ignored in the schema). this means that your options for secondary sorting are limited. for example, you wouldn’t be able to partition all stations to the same reducer, and then group by station and time. ordering uses a field’s ordinal position to determine its order within the overall set of fields to be ordered. in other words, in a two-field record, the first field is always compared before the second. there’s no way to change this behavior other than flipping the order of the fields in the record. having said all of that - the “ignoring fields” feature for sorting is pretty awesome, and something that will no doubt come in handy in my future mapreduce work.

May 29, 2013

by Alex Holmes

· 8,148 Views

Amazon S3 Parallel MultiPart File Upload

In this blog post, I will present a simple tutorial on uploading a large file to Amazon S3 as fast as the network supports. Amazon S3 is clustered storage service of Amazon. It is designed to make web-scale computing easier. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. For using Amazon services, you'll need your AWS access key identifiers, which AWS assigned you when you created your AWS account. The following are the AWS access key identifiers: Access Key ID (a 20-character, alphanumeric sequence) For example: 022QF06E7MXBSH9DHM02 Secret Access Key (a 40-character sequence) For example: kWcrlUX5JEDGM/LtmEENI/aVmYvHNif5zB+d9+ct Caution Your Secret Access Key is a secret, which only you and AWS should know. It is important to keep it confidential to protect your account. Store it securely in a safe place. Never include it in your requests to AWS, and never e-mail it to anyone. Do not share it outside your organization, even if an inquiry appears to come from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret Access Key. The Access Key ID is associated with your AWS account. You include it in AWS service requests to identify yourself as the sender of the request. The Access Key ID is not a secret, and anyone could use your Access Key ID in requests to AWS. To provide proof that you truly are the sender of the request, you also include a digital signature calculated using your Secret Access Key. The sample code handles this for you. Your Access Key ID and Secret Access Key are displayed to you when you create your AWS account. They are not e-mailed to you. If you need to see them again, you can view them at any time from your AWS account. To get your AWS access key identifiers Go to the Amazon Web Services web site at http://aws.amazon.com. Point to Your Account and click Security Credentials. Log in to your AWS account. The Security Credentials page is displayed. Your Access Key ID is displayed in the Access Identifiers section of the page. To display your Secret Access Key, click Show in the Secret Access Key column. You can use your Amazon keys from a properties file in your application. Here is a sample for properties file containing Amazon keys: # Fill in your AWS Access Key ID and Secret Access Key # http://aws.amazon.com/security-credentials accessKey = secretKey = Here is sample AmazonUtil class for getting AWS Credentials from properties file. public class AmazonUtil { private static final Logger logger = LogUtil.getLogger(); private static final String AWS_CREDENTIALS_CONFIG_FILE_PATH = ConfigUtil.CONFIG_DIRECTORY_PATH + File.separator + "aws-credentials.properties"; private static AWSCredentials awsCredentials; static { init(); } private AmazonUtil() { } private static void init() { try { awsCredentials = new PropertiesCredentials(IOUtil.getResourceAsStream(AWS_CREDENTIALS_CONFIG_FILE_PATH)); } catch (IOException e) { logger.error("Unable to initialize AWS Credentials from " + AWS_CREDENTIALS_CONFIG_FILE_PATH); } } public static AWSCredentials getAwsCredentials() { return awsCredentials; } } Amazon S3 has Multipart Upload service which allows faster, more flexible uploads into Amazon S3. Multipart Upload allows you to upload a single object as a set of parts. After all parts of your object are uploaded, Amazon S3 then presents the data as a single object. With this feature you can create parallel uploads, pause and resume an object upload, and begin uploads before you know the total object size. For more information on Multipart Upload, review the Amazon S3 Developer Guide In this tutorial, my sample application uploads each file parts to Amazon S3 with different threads for using network throughput as possible as much. Each file part is associated with a thread and each thread uploads its associated part with Amazon S3 API. Figure 1. Amazon S3 Parallel Multi-Part File Upload Mechanism Amazon S3 API suppots MultiPart File Upload in this way: 1. Send a MultipartUploadRequest to Amazon. 2. Get a response containing a unique id for this upload operation. 3. For i in ${partCount} 3.1. Calculate size and offset of split-i in whole file. 3.2. Build a UploadPartRequest with file offset, size of current split and unique upload id. 3.3. Give this request to a thread and starts upload by running thread. 3.3.1. Send associated UploadPartRequest to Amazon. 3.3.2. Get response after successful upload and save ETag property of response. 4. Wait all threads to terminate 5. Get ETags (ETag is an identifier for successfully completed uploads) of all terminated threads. 6. Send a CompleteMultipartUploadRequest to Amazon with unique upload id and all ETags. So Amazon joins all file parts as target objects. Here is implementation: public class AmazonS3Util { private static final Logger logger = LogUtil.getLogger(); public static final long DEFAULT_FILE_PART_SIZE = 5 * 1024 * 1024; // 5MB public static long FILE_PART_SIZE = DEFAULT_FILE_PART_SIZE; private static AmazonS3 s3Client; private static TransferManager transferManager; static { init(); } private AmazonS3Util() { } private static void init() { // ... s3Client = new AmazonS3Client(AmazonUtil.getAwsCredentials()); transferManager = new TransferManager(AmazonUtil.getAwsCredentials()); } // ... public static void putObjectAsMultiPart(String bucketName, File file) { putObjectAsMultiPart(bucketName, file, FILE_PART_SIZE); } public static void putObjectAsMultiPart(String bucketName, File file, long partSize) { List partETags = new ArrayList(); List uploaders = new ArrayList(); // Step 1: Initialize. InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(bucketName, file.getName()); InitiateMultipartUploadResult initResponse = s3Client.initiateMultipartUpload(initRequest); long contentLength = file.length(); try { // Step 2: Upload parts. long filePosition = 0; for (int i = 1; filePosition < contentLength; i++) { // Last part can be less than part size. Adjust part size. partSize = Math.min(partSize, (contentLength - filePosition)); // Create request to upload a part. UploadPartRequest uploadRequest = new UploadPartRequest(). withBucketName(bucketName).withKey(file.getName()). withUploadId(initResponse.getUploadId()).withPartNumber(i). withFileOffset(filePosition). withFile(file). withPartSize(partSize); uploadRequest.setProgressListener(new UploadProgressListener(file, i, partSize)); // Upload part and add response to our list. MultiPartFileUploader uploader = new MultiPartFileUploader(uploadRequest); uploaders.add(uploader); uploader.upload(); filePosition += partSize; } for (MultiPartFileUploader uploader : uploaders) { uploader.join(); partETags.add(uploader.getPartETag()); } // Step 3: complete. CompleteMultipartUploadRequest compRequest = new CompleteMultipartUploadRequest(bucketName, file.getName(), initResponse.getUploadId(), partETags); s3Client.completeMultipartUpload(compRequest); } catch (Throwable t) { logger.error("Unable to put object as multipart to Amazon S3 for file " + file.getName(), t); s3Client.abortMultipartUpload( new AbortMultipartUploadRequest( bucketName, file.getName(), initResponse.getUploadId())); } } // ... private static class UploadProgressListener implements ProgressListener { File file; int partNo; long partLength; UploadProgressListener(File file) { this.file = file; } @SuppressWarnings("unused") UploadProgressListener(File file, int partNo) { this(file, partNo, 0); } UploadProgressListener(File file, int partNo, long partLength) { this.file = file; this.partNo = partNo; this.partLength = partLength; } @Override public void progressChanged(ProgressEvent progressEvent) { switch (progressEvent.getEventCode()) { case ProgressEvent.STARTED_EVENT_CODE: logger.info("Upload started for file " + "\"" + file.getName() + "\""); break; case ProgressEvent.COMPLETED_EVENT_CODE: logger.info("Upload completed for file " + "\"" + file.getName() + "\"" + ", " + file.length() + " bytes data has been transferred"); break; case ProgressEvent.FAILED_EVENT_CODE: logger.info("Upload failed for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; case ProgressEvent.CANCELED_EVENT_CODE: logger.info("Upload cancelled for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; case ProgressEvent.PART_STARTED_EVENT_CODE: logger.info("Upload started at " + partNo + ". part for file " + "\"" + file.getName() + "\""); break; case ProgressEvent.PART_COMPLETED_EVENT_CODE: logger.info("Upload completed at " + partNo + ". part for file " + "\"" + file.getName() + "\"" + ", " + (partLength > 0 ? partLength : progressEvent.getBytesTransfered()) + " bytes data has been transferred"); break; case ProgressEvent.PART_FAILED_EVENT_CODE: logger.info("Upload failed at " + partNo + ". part for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; } } } private static class MultiPartFileUploader extends Thread { private UploadPartRequest uploadRequest; private PartETag partETag; MultiPartFileUploader(UploadPartRequest uploadRequest) { this.s3Client = s3Client; this.uploadRequest = uploadRequest; } @Override public void run() { partETag = s3Client.uploadPart(uploadRequest).getPartETag(); } private PartETag getPartETag() { return partETag; } private void upload() { start(); } } }

May 28, 2013

by Serkan Özal

· 57,424 Views · 3 Likes

Azure Blob Storage - "The specified blob or block content is invalid"

If you’re uploading blobs by splitting blobs into blocks and you get the error – The specified blob or block content is invalid, then this post is for you. Short Version If you’re uploading blobs by splitting blobs into blocks and you get the above mentioned error, ensure that your block ids of your blocks are of same length. If the block ids of your blocks are of different length, you’ll get this error. Long Version Now for the longer version of this post . A few days back I was working with storage client library especially around uploading blobs in chunks and with one particular blob I was constantly getting the error – The specified blob or block content is invalid. I tried numerous combinations even resorting to REST API directly but to no avail. It only happened with just one blob. Furthermore if I uploaded the same blob without splitting it into blocks, all was well. I was at my wits’ end. Tried searching the Internet for this error but could not find a conclusive answer to my problem. After much trial and error, I was able to simulate the same problem on other blobs as well. Here’s how you can recreate it: Start uploading the blob by splitting it into blocks. For block id, let’s do a 7 character long string e.g. intValue.ToString(“d7”). This will ensure that my block ids would be “0000001”, “0000002”, …, ”0000010” ….. After one or two blocks are uploaded, cancel the operation. Now re-upload the blob by splitting it into blocks. However this time for block id, let’s do a 6 character long string e.g. intValue.ToString(“d6”). You’ll get the error as soon as you try to upload the 1st block. Possible Solutions Now that we know the root cause of this problem, let’s look at some of the possible solutions to solve this problem. Wait out One possible solution is to wait out. I know its lame but still a possible solution. We know that Windows Azure Blob Storage Service keeps all uncommitted blocks for a duration of 7 days and if within 7 days those uncommitted blocks are not committed, the storage service purges them. I wish storage service provided some mechanism to purge uncommitted blocks programmatically. Commit uncommitted blocks You could possibly commit the blocks which are in uncommitted state so that at least you get a blob (which would not be the blob we wanted to upload in the first place). You can then delete that blob and re-upload the blob by specifying block ids which are of same length. To fetch the list of uncommitted blocks, if you’re using REST API directly you can perform “Get Block List” operation and pass “blocklisttype=uncommitted” as one of the query string parameters. If you’re using storage client library (assuming you’re using the version 2.x of .Net storage client library), you can do something like the code below: private static List GetUncommittedBlockIds(CloudBlockBlob blob) { var sasUri = blob.GetSharedAccessSignature(new SharedAccessBlobPolicy() { SharedAccessExpiryTime = DateTime.UtcNow.AddMinutes(5), Permissions = SharedAccessBlobPermissions.Read, }); var blobUri = new Uri(string.Format("{0}{1}", blob.Uri, sasUri)); List uncommittedBlockIds = new List(); var request = BlobHttpWebRequestFactory.GetBlockList(blobUri, null, null, BlockListingFilter.Uncommitted, null, null); //request.Headers.Add("Authorization", using (var resp = (HttpWebResponse)request.GetResponse()) { using (var stream = resp.GetResponseStream()) { var getBlockListResponse = new GetBlockListResponse(stream); var blocks = getBlockListResponse.Blocks; foreach (var block in blocks.Where(b => !b.Committed)) { uncommittedBlockIds.Add(Encoding.UTF8.GetString(Convert.FromBase64String(block.Name))); } } } return uncommittedBlockIds; } A few things to keep in mind here: Microsoft.WindowsAzure.Storage.Blob namespace does not have the capability to get the list of uncommitted blocks. You would need to make use ofMicrosoft.WindowsAzure.Storage.Blob.Protocol namespace. Because we’re kind of invoking the REST API by executing an HttpWebRequest, I created a shared access signature on the blob so that I don’t have to create “Authorization” header. Fetch uncommitted blocks to see block id length You could fetch the list of uncommitted blocks just to find out the length of the block id used. You could then use that block id length for your new upload session and do the upload. Please see the code snippet above to find this information. Upload another blob with same name without splitting it into blocks You could also upload another blob with the same name without splitting it into blocks. It could very well be a zero byte blob. That way your uncommitted block list will be wiped clean. Then you could delete that dummy blob and re-upload the actual blob. A Few Words About Blocks Since we’re talking about blocks, I thought it might be useful to mention a few points about them: Blocks and block related operations are only applicable for “Block Blobs”. Duh!! You’ll get an error if you’re trying to do these operations on a “Page Blob”. For uploading large blobs, it is recommended that you split your blob into blocks. In fact if your blob size is more than 64 MB, then you have to split it into blocks. Minimum size of a block is 1 Byte and the maximum size of a block is 4 MB. It is recommended that you choose a block size based on your internet connectivity and number of parallel threads you want use to upload these blocks. A blob can be split into a maximum of 50000 blocks. It’s important to remember this limitation because you are reminded of this limit when you’re trying to upload 50001st block. The length of all the block ids must be same. So if you’re using an integer value to denote block id, you make sure that you pad that integer value with “0” so that you get same length. So you could do something likeint.ToString(“d6”). When passing the block id as a parameter, it must be Base64 encoded. While the order in which the blocks are uploaded is not important, the order is important when you commit the block list because that’s when the blob is constructed by the service. For example, let’s say you’re uploading a blob by splitting it into 5 blocks (with ids “000001”, “000002”, “000003”, “000004”, and “000005”). You could upload these blocks in any order – 000004, 000001, 000003, 000005, 000002 however when you commit the block list, ensure that the block ids are passed in proper order i.e. 000001, 000002, 000003, 000004, 000005. Summary That’s it for this post. I hope you’ve found this information useful. I spent considerable amount of time trying to fix this problem so I hope it will help some folks out. As always, if you find any issues with the post please let me know and I’ll fix it ASAP.

May 20, 2013

by Gaurav Mantri

· 10,937 Views

Deploy a File Server in the Cloud (WebDav on Windows Azure)

this month, my fellow it pro technical evangelists and i are authoring a new series of articles on 20 key scenarios with windows azure infrastructure services . check out the list of articles here: http://mythoughtsonit.com/2013/05/20-key-scenarios-with-windows-azure-infrastructure-services/ . web-based distributed authoring and versioning, or webdav, is a set of protocols based on http that allows end-users to map a network drive over http and edit content and files stored on the web server. when webdav was first offered on microsoft server i had evaluated it and decided it did not perform well enough for me. the webdav extension to iis was completely rewritten back in the server 2008 timeframe and is worth taking a look at again. in this article i will guide you step by step through the process of setting up webdav on server 2012 in a windows azure iaas environment. this will give you a solid performing file share on the internet over port 80 and the http protocol. first you need an azure account. you can setup a free trail of azure. details can be found here: http://mythoughtsonit.com/2013/04/step-by-step-guide-to-setting-up-a-windows-azure-free-trial/ second provision a server 2012 machine. watch a video of what to do here: third open port 80 to this new server: in the azure portal select your 2012 server and choose the “endpoints” tab on the top. click “add endpoint” at the bottom of the screen enter the endpoint information for port 80 to port 80 done. next we need to install the iis webserver and webdav. installing webdav on iis 8.0 start server manager and go to “add roles and features” under server roles – add the web server (iis) role click through the wizard until you come to the role services section. then find and select “webdav publishing” and “windows authentication” click next and then install when the install is finished you are ready to move on to the next section. configuring iis 8 for webdav after the installation finishes you need to configure the box for access. start the iis manager tool. choose the “default web site” on the left side. then click on “authentication” open the windows authentication option and enable it. open the “webdav authoring rules” create a webdav rule. i choose to allow all users access to all content. a better security practice is to limit what users can use the service. it’s your data so you decide. make sure webdav is enabled and that your access rule is set: that is it… now your ready to access your webdav file share! test and insure you can hit the web server by using your browser: because you opened port 80 and installed iis 8 you should see the default web page when you browse to your servers internet dns name. example: http://yourdomainname.cloudapp.net/ how to map a drive to your webdav server: there are two ways i use to connect to the webdav server how to map a drive to your webdav server from the win 8 gui: from windows explorer, right click on “computer” and select “map a network drive” map your network drive by entering the address to your server example: http://yourdomainname.cloudapp.net/ i selected “connect using different credentials” because my workstation was not joined to the server in anyway and i needed to use an account in the servers local sam database. hit “finish” and enter your credentials. now you will have a connected drive that you can access from windows explorer or any tool via the drive mapping. how to map a drive to your webdav server from a cmd box: 1. hit windows start and type: cmd 2. enter the command: net use [drive letter] [url] example: net use e: http://yourdomainname.cloudapp.net/

May 15, 2013

by Brian Lewis

· 15,969 Views

Multipart Upload on S3 with jclouds

1. Goal In the previous article, we looked at how we can use the generic Blob APIs from jclouds to upload content to S3. In this article we will use the S3 specific asynchronous API from jclouds to upload content and leverage the multipart upload functionality provided by S3. 2. Preparation 2.1. Set up the custom API The first part of the upload process is creating the jclouds API – this is a custom API for Amazon S3: public AWSS3AsyncClient s3AsyncClient() { String identity = ... String credentials = ... BlobStoreContext context = ContextBuilder.newBuilder("aws-s3"). credentials(identity, credentials).buildView(BlobStoreContext.class); RestContext providerContext = context.unwrap(); return providerContext.getAsyncApi(); } 2.2. Determining the number of parts for the content Amazon S3 has a 5 MB limit for each part to be uploaded. As such, the first thing we need to do is determine the right number of parts that we can split our content into so that we don’t have parts below this 5 MB limit: public static int getMaximumNumberOfParts(byte[] byteArray) { int numberOfParts= byteArray.length / fiveMB; // 5*1024*1024 if (numberOfParts== 0) { return 1; } return numberOfParts; } 2.3. Breaking the content into parts Were going to break the byte array into a set number of parts: public static List breakByteArrayIntoParts(byte[] byteArray, int maxNumberOfParts) { List parts = Lists. newArrayListWithCapacity(maxNumberOfParts); int fullSize = byteArray.length; long dimensionOfPart = fullSize / maxNumberOfParts; for (int i = 0; i < maxNumberOfParts; i++) { int previousSplitPoint = (int) (dimensionOfPart * i); int splitPoint = (int) (dimensionOfPart * (i + 1)); if (i == (maxNumberOfParts - 1)) { splitPoint = fullSize; } byte[] partBytes = Arrays.copyOfRange(byteArray, previousSplitPoint, splitPoint); parts.add(partBytes); } return parts; } We’re going to test the logic of breaking the byte array into parts – we’re going to generate some bytes, split the byte array, recompose it back together using Guava and verify that we get back the original: @Test public void given16MByteArray_whenFileBytesAreSplitInto3_thenTheSplitIsCorrect() { byte[] byteArray = randomByteData(16); int maximumNumberOfParts = S3Util.getMaximumNumberOfParts(byteArray); List fileParts = S3Util.breakByteArrayIntoParts(byteArray, maximumNumberOfParts); assertThat(fileParts.get(0).length + fileParts.get(1).length + fileParts.get(2).length, equalTo(byteArray.length)); byte[] unmultiplexed = Bytes.concat(fileParts.get(0), fileParts.get(1), fileParts.get(2)); assertThat(byteArray, equalTo(unmultiplexed)); } To generate the data, we simply use the support from Random: byte[] randomByteData(int mb) { byte[] randomBytes = new byte[mb * 1024 * 1024]; new Random().nextBytes(randomBytes); return randomBytes; } 2.4. Creating the Payloads Now that we have determined the correct number of parts for our content and we managed to break the content into parts, we need to generate the Payload objects for the jclouds API: public static List createPayloadsOutOfParts(Iterable fileParts) { List payloads = Lists.newArrayList(); for (byte[] filePart : fileParts) { byte[] partMd5Bytes = Hashing.md5().hashBytes(filePart).asBytes(); Payload partPayload = Payloads.newByteArrayPayload(filePart); partPayload.getContentMetadata().setContentLength((long) filePart.length); partPayload.getContentMetadata().setContentMD5(partMd5Bytes); payloads.add(partPayload); } return payloads; } 3. Upload The upload process is a flexible multi-step process – this means: the upload can be started before having all the data – data can be uploaded as it’s coming in data is uploaded in chunks – if one of these operations fails, it can simply be retrieved chunks can be uploaded in parallel – this can greatly increase the upload speed, especially in the case of large files 3.1. Initiating the Upload operation The first step in the Upload operation is to initiate the process. This request to S3 must contain the standard HTTP headers – the Content-MD5 header in particular needs to be computed. Were going to use the Guava hash function support here: Hashing.md5().hashBytes(byteArray).asBytes(); This is the md5 hash of the entire byte array, not of the parts yet. To initiate the upload, and for all further interactions with S3, we’re going to use the AWSS3AsyncClient – the asynchronous API we created earlier: ObjectMetadata metadata = ObjectMetadataBuilder.create().key(key).contentMD5(md5Bytes).build(); String uploadId = s3AsyncApi.initiateMultipartUpload(container, metadata).get(); The key is the handle assigned to the object – this needs to be a unique identifier specified by the client. Also notice that, even though we’re using the async version of the API, we’re blocking for the result of this operation – this is because we will need the result of the initialize to be able to move forward. The result of the operation is an upload id returned by S3 – this will identify the upload throughout it’s lifecycle and will be present in all subsequent upload operations. 3.2. Uploading the Parts The next step is uploading the parts. Our goal here is to send these requests in parallel, as the upload parts operation represent the bulk of the upload process: List> ongoingOperations = Lists.newArrayList(); for (int partNumber = 0; partNumber < filePartsAsByteArrays.size(); partNumber++) { ListenableFuture future = s3AsyncApi.uploadPart( container, key, partNumber + 1, uploadId, payloads.get(partNumber)); ongoingOperations.add(future); } The part numbers need to be continuous but the order in which the requests are send is not relevant. After all of the upload part requests have been submitted, we need to wait for their responses so that we can collect the individual ETag value of each part: Function, String> getEtagFromOp = new Function, String>() { public String apply(ListenableFuture ongoingOperation) { try { return ongoingOperation.get(); } catch (InterruptedException | ExecutionException e) { throw new IllegalStateException(e); } } }; List etagsOfParts = Lists.transform(ongoingOperations, getEtagFromOp); If, for whatever reason, one of the upload part operations fails, the operation can be retried until it succeeds. The logic above does not contain the retry mechanism, but building it in should be straightforward enough. 3.3. Completing the Upload operation The final step of the upload process is completing the multipart operation. The S3 API requires the responses from the previous parts upload as a Map, which we can now easily create from the list of ETags that we obtained above: Map parts = Maps.newHashMap(); for (int i = 0; i < etagsOfParts.size(); i++) { parts.put(i + 1, etagsOfParts.get(i)); } And finally, send the complete request: s3AsyncApi.completeMultipartUpload(container, key, uploadId, parts).get(); This will return final ETag of the finished object and will complete the entire upload process. 4. Conclusion In this article we built a multipart enabled, fully parallel upload operation to S3, using the custom S3 jclouds API. This operation is ready to be used as is, but it can be improved in a few ways. First, retry logic should be added around the upload operations to better deal with failures. Next, for really large files, even though the mechanism is sending all upload multipart requests in parallel, a throttling mechanism should still limit the number of parallel requests being sent. This is both to avoid bandwidth becoming a bottleneck as well as to make sure Amazon itself doesn’t flag the upload process as exceeding an allowed limit of requests per second – the Guava RateLimiter can potentially be very well suited for this. P.S. You might dig following me on Twitter.

April 21, 2013

by Eugen Paraschiv

· 6,642 Views · 1 Like

Upload on S3 with the jclouds Library

There are several good ways to upload content to an S3 bucket in the Java world – in this article we’ll look at what the jclouds library provides for this purpose. To use jclouds – specifically the APIs discussed in this article, this simple Maven dependency should be added to the pom of the project: org.jclouds jclouds-allblobstore 1.5.9 1. Uploading to Amazon S3 The first step, in order to access any of these APIs, is to create a BlobStoreContext: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(BlobStoreContext.class); This represents the entry-point to a general key-value storage service, such as Amazon S3 – but not limited to it. For the more specific S3 only implementation, the context can be created similarly: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(S3BlobStoreContext.class); And even more specifically: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); When the authenticated context is no longer needed, closing it is required to release all resources – threads and connections – associated to it. 2. The four S3 APIs of jclouds The jclouds library provides four different APIs to upload content to S3 bucket, ranging from simple but inflexible to complex and powerful, all obtained via the BlobStoreContext. Let’s start with the simplest. 2.1. Upload via the Map API The easiest way jclouds can be used to interact with an S3 bucket is by representing that bucket as a Map. The API is obtained from the context: InputStreamMap bucket = context.createInputStreamMap("bucketName"); Then, to upload a simple HTML file: bucket.putString("index1.html", "hello world1"); The InputStreamMap API exposes several other types of PUT operations – files, raw bytes – both for single and bulk. A simple integration test can be used as an example: @Test public void whenFileIsUploadedToS3WithMapApi_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); InputStreamMap bucket = context.createInputStreamMap("bucketName"); bucket.putString("index1.html", "hello world1"); context.close(); } 2.2. Upload via BlobMap Using the simple Map API is straightforward but ultimately limited – for example, there is no way to pass in metadata about the content being uploaded. When more flexibility and customization is necessary, this simplified approach to uploading data to S3 via a Map is no longer enough. The next API we’ll look at is the Blob Map API – this is obtained from the context: BlobMap bucket = context.createBlobMap("bucketName"); The API allows the client to access more lower level details, such as Content-Length, Content-Type, Content-Encoding, eTag hash and others; to upload new content in the bucket: Blob blob = bucket.blobBuilder().name("index2.html"). payload("hello world2"). contentType("text/html").calculateMD5().build(); The API also allows setting a variety of payloads on the create request. A simple integration test for uploading a basic HTML file to S3 via the Blob Map API: @Test public void whenFileIsUploadedToS3WithBlobMap_thenNoExceptions() throws IOException { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobMap bucket = context.createBlobMap("bucketName"); Blob blob = bucket.blobBuilder().name("index2.html"). payload("hello world2"). contentType("text/html").calculateMD5().build(); bucket.put(blob.getMetadata().getName(), blob); context.close(); } 2.3. Upload via BlobStore The previous APIs had no way to upload content using multipart upload – this makes them ill suited when working with large files. This limitation is addressed by the next API we’re going to look at – the synchronous BlobStore API. This is obtained from the context: BlobStore blobStore = context.getBlobStore(); To use the multipart support and upload a file to S3: Blob blob = blobStore.blobBuilder("index3.html"). payload("hello world3").contentType("text/html").build(); blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); The payload builder is the same one that was being used by the BlobMap API, so the same flexibility in specifying lower level metadata information about the blob is available here. The difference is the PutOptions supported by the PUT operation of the API – namely the multipart support. The previous integration test now has multipart enabled: @Test public void whenFileIsUploadedToS3WithBlobStore_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobStore blobStore = context.getBlobStore(); Blob blob = blobStore.blobBuilder("index3.html"). payload("hello world3").contentType("text/html").build(); blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); context.close(); } 2.4. Upload via AsyncBlobStore While the previous BlobStore API was synchronous, there is also an asynchronous API for BlobStore – AsyncBlobStore. The API is similarly obtained from the context: AsyncBlobStore blobStore = context.getAsyncBlobStore(); The only difference between the two is that the async API is returning ListenableFuture for the PUT asynchronous operation: Blob blob = blobStore.blobBuilder("index4.html"). .payload("hello world4").build(); blobStore.putBlob("bucketName", blob).get(); The integration test displaying this operation is similar to the synchronous one: @Test public void whenFileIsUploadedToS3WithBlobStore_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobStore blobStore = context.getBlobStore(); Blob blob = blobStore.blobBuilder("index4.html"). payload("hello world4").contentType("text/html").build(); Future putOp = blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); putOp.get(); context.close(); } 3. Conclusion In this article, we analysed the four APIs that the jclouds library provides to upload content to Amazon S3. These four APIs are generic and they work with other key-value storage services as well – such as Microsoft Azure Storage for example. In the next article we’ll look at the Amazon specific S3 API available in jclouds – the AWSS3Client. We’ll implement the operation of uploading a large file, dynamically calculate the optimal number of parts for any given file, and perform the upload of all parts in parallel. P.S. You might dig following me on Twitter.

April 18, 2013

by Eugen Paraschiv

· 8,916 Views · 1 Like

Introduction to SmartSVN

SmartSVN is a powerful and easy-to-use graphical client for Apache Subversion. There are several clients for Subversion, but here are just a few reasons you should try SmartSVN: It’s cross-platform – SmartSVN runs on Windows, Linux and Mac OS X, so you can continue using the operating system (OS) that works the best for you. It can also be integrated into your OS, via Mac’s Finder Integration or Windows Shell. Everything you need, out of the box – SmartSVN comes complete with all the tools you need to manage your Subversion projects: Conflict solver – this feature combines the freedom of a general, three-way-merge with the ability to detect and resolve any conflicts that occur during the development lifecycle. File compare – this allows you to make inner-line comparisons and directly edit the compared files. Built-in SSH client – allows users to access servers using the SSH protocol. This security-conscious protocol encrypts every piece of communication between the client and the server, for additional protection. A complete view of your project at a glance – the most important files (such as conflicted, modified or missing files) are placed at the top of the file list. SmartSVN also highlights which directories contain local modifications, which directories have been changed in the repository, and whether individual files have been modified locally or in the central repo. This makes it easy to get a quick overview of the state of your project. Fully customizable – maximize productivity by fine-tuning your SmartSVN installation to suit your particular needs: Change keyboard shortcuts, write your own plugin with the SmartSVN API, group revisions to personalize your display, create Change Sets, and alter the context menus and toolbars to suit you. You can learn more about customizing SmartSVN at our ‘5 Ways to Customize SmartSVN’ blog post. Comprehensive bug tracker support – Trac and JIRA are both fully supported. Multitude of support options – SmartSVN users have access to a range of free support, from refcards to blogsand documentation, the SmartSVN forum and a Twitter account maintained by our open source experts. If you need extra support with your SmartSVN installation, expert email support is included with SmartSVN Professional licenses. Want to learn more about SmartSVN? On April 18th, WANdisco will be be holding a free ‘Introduction to SmartSVN’ webinar covering everything you need to get off to a great start with this popular client: Repository basics Checkouts, working folders, editing files and commits Reporting on changes Simple branching Simple merging This webinar is free so register now.

April 13, 2013

by Jessica Thornsby

· 6,831 Views

Configuring Apache SolrCloud on Amazon VPC

We are going to construct an Apache SolrCloud (4.1) with 12 node EC2 instance(s) inside Amazon VPC in this post. Since the search data stored inside the SolrCloud is critical, we are going to build High availability at Solr Node level as well as AZ level. This setup will be done inside private subnet of Amazon VPC and will leverage 3 Availability Zones of the Amazon EC2 Region. Deployment architecture of the setup is given below: A small brief about setup: 3 Zookeepers will be deployed on 3 Availability Zones. ZK EC2 instances will be deployed on the Private subnet of the Amazon VPC. 3 Solr Shard EC2 instances will be deployed on Private subnet of Availability Zone 1 inside Amazon VPC. 3 Solr Replica EC2 instances will be deployed on Private subnet of Availability Zone 2 inside Amazon VPC. 3 Solr Replica EC2 instances will be deployed on Private subnet of Availability Zone 3 inside Amazon VPC. EBS optimized + PIOPS EC2 instances can be used for Solr EC2 Nodes To know more about SolrCloud Deployment best practices on Amazon VPC, Refer article: http://harish11g.blogspot.in/2013/03/Apache-Solr-cloud-on-Amazon-EC2-AWS-VPC-implementation-deployment.html Step 1: Creating Virtual Private Cloud on AWS Create a VPC with Public and Private Subnets. Assume the Load balancer and Web/App Servers can reside on the public subnet and Apache Solr Cloud will reside on the private subnet of the VPC. Step 2: Assigning the IP for the Subnets Create the subnet with its IP range. Chose the Availability zone for this subnet. Step 3: Multiple Subnets on Multiple AZ’s Create multiple subnets in Multiple AZ for building a Highly available setup for SolCloud Step 4: Install Java for Zookeeper & Solr Amazon Linux is chosen as the EC2 OS variant. Execute the following instructions on the respective EC2 nodes after their launch. EC2 instances should be launched in Multi-AZ in Multiple VPC Private Subnets. Solr uses Zookeeper as the cluster configuration and coordinator. Zookeeper is a distributed file system containing information about all the Solr Nodes. Solrconfig.xml, Schema.xml etc are stored in the repository.We have used Oracle-Sun Java over OpenJDK “sudo -s” “cd /opt” “wget --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2Ftechnetwork%2Fjava%2Fjavase%2Fdownloads%2Fjdk-7u3-download-1501626.html;" http://download.oracle.com/otn-pub/java/jdk/7u13-b20/jdk-7u13-linux-x64.rpm” “mv jdk-7u10-linux-x64.rpm?AuthParam=1357217677_76ec3d8d9a3644f4b9ec1ea79e1fcf33 jdk-7u10-linux-x64.rpm jdk-7u10-linux-x64.rpm” “sudo rpm -ivh jdk-7u10-linux-x64.rpm” “alternatives --install /usr/bin/java java /usr/java/jdk1.7.0_10/jre/bin/java 20000” “alternatives --install /usr/bin/javaws javaws /usr/java/jdk1.7.0_10/jre/bin/javaws 20000” “alternatives --install /usr/bin/javac javac /usr/java/jdk1.7.0_10/bin/javac 20000” “alternatives --install /usr/bin/jar jar /usr/java/jdk1.7.0_10/bin/jar 20000” “alternatives --install /usr/bin/java java /usr/java/jre1.7.0_10/bin/java 20000” “alternatives --install /usr/bin/javaws javaws /usr/java/jre1.7.0_10/bin/javaws 20000” “alternatives --configure java” Add JAVA_HOME in .bash_profile: “vim ~/.bash_profile” export JAVA_HOME="/usr/java/jdk1.7.0_09" export PATH=$PATH:$JAVA_HOME/bin Restart the instance. “init 6” Check the version of Java installed using “java -version” command Step 5: Configure the ZooKeeper (v3.4.5) Ensemble: Since single Zookeeper is not ideal for a large Solr cluster (because of SPOF), it is recommended to configure multiple Zookeepers in concert as an ensemble .In this step we will install and configure 3 ZooKeeper EC2 nodes spanning across 3 different Availability Zones in respective Private Subnets inside a VPC.Zookeeper will be configured on Amazon Linux. “sudo yum update” “sudo -s” “ cd /opt” “wget http://apache.techartifact.com/mirror/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz” “tar -xzvf zookeeper-3.4.5.tar.gz” “rm zookeeper-3.4.5.tar.gz” “cd zookeeper-3.4.5” “cp conf/zoo_sample.cfg conf/zoo.cfg” Add the following lines in zoo.cfg “vim conf/zoo.cfg” dataDir=/data server.1=[zk-server01-ip]:2888:3888 server.2=[zk-server02-ip]:2888:3888 server.3=[zk-server03-ip]:2888:3888 “cd /opt/zookeeper/data” “vim myid” 1 or 2 or 3 respectively on each ZooKeeper EC2 instances in Multi-AZ #Starting ZooKeeper Program. “bin/zkServer.sh start” Follow the above steps in all the ZooKeeper servers. ReferClustered (Multi-Server) SetupandConfiguration Parameters for understandingquorum_port,leader_election_port and the filemyid. Every ZooKeeper node needs to know about every other ZK EC2 node in the ensemble, and a majority of EC2’s (called a Quorum) are needed to provide the service. Make sure the VPC IP of all the Zookeepers are given in every ZK node, like the one in following command. server.1=:: server.2=:: server.3=:: Step 6: Configuring Solr 4.1 EC2 node In this step we will install and configure 3 Apache Solr4.1 Shard EC2 instances in a single Amazon AZ and 2 Solr Replicas in another AZ in their respective Private subnets. Please note that we have to specify all the ZooKeeper (ZK) hosts on every Solr instance as below. Note: Solr gets comes with jetty in default, it is suggested to use tomcat for production nodes. Perform the following after launching EC2 instances in Multi-AZ in Multiple VPC Private Subnets. “sudo -s” “yum update” “cd /opt” “wget http://apache.techartifact.com/mirror/lucene/solr/4.1.0/apache-solr-4.1.0.tgz” “tar -xzvf apache-solr-4.1.0.tgz” “rm -f apache-solr-4.1.0.tgz” On Solr Shard/Replica Instances: “cd /opt/apache-solr-4.0.0/example/” “vim /opt/apache-solr-4.0.0/example/solr/collection1/conf/solrconfig.xml” Change /var/data/solr to /data Starting Solr4.1 Shard/Replica Java Program. “java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=SolrCloud4.1-Conf -DnumShards=3 -DzkHost=[zk-server01-ip]:2181,[zk-server02-ip]:2181,[zk-server03-ip]:2181 -jar start.jar “java -DzkHost= DzkHost=:,:,: -jar start.jar” -DnumShards: the number of shards that will be present. Note that once set, this number cannot be increased or decreased without re-indexing the entire data set. (Dynamically changing the number of shards is part of the Solr roadmap!) -DzkHost: a comma-separated list of ZooKeeper servers. -Dbootstrap_confdir, -Dcollection.configName: these parameters are specified only when starting up the first Solr instance. This will enable the transfer of configuration files to ZooKeeper. Subsequent Solr instances need to just point to the ZooKeeper ensemble. The above command with –DnumShards=3 specifies that it is a 3-shard cluster. The first Solr EC2 node automatically becomes shard1 and the second Solr EC2 node automatically becomes shard2 …. What happens when we launch fourth Solr instance in this cluster? Since it’s a 3-shard cluster, the fourth Solr EC2 node automatically becomes a replica of shard1 and the fifth Solr EC2 node becomes a replica of shard2. Step 7: AWS Security Group TCP Ports to be enabled: Configure the following TCP ports on the AWS security group to allow access between Solr and ZK nodes deployed in Multiple AZ. Solr Shards/Replicas will connect to ZK through TCP Port 2181 Solr Web Interface with Jetty container through TCP Port 8983 Solr Web Interface with Tomcat container through TCP Port 8080 Every instance that is part of the ZooKeeper ensemble should know about every other machine in the ensemble. We can accomplish this with the series of lines of the form server.id=host:port:port For example, server.1=[vpc-ip]:2888:3888 server.2=[vpc-ip]:2888:3888 server.3=[vpc-ip]:2888:3888 TCP Ports 2888, 3888 should be opened for ZK Ensemble.

April 5, 2013

by Harish Ganesan

· 7,830 Views

Async I/O and ThreadPool Deadlock (Part 1)

I’ve mentioned in a past post that it was conceived while reading the source code for the System.Diagnostics.Process class. This post is about the reason that pushed me to read the source code in an attempt to fix the issue. It turned out that this was yet another case of LeakyAbstraction, which is a special interest of mine. As it turned out, this post ended being way too long (even for me). I don’t like installments, but I felt that it is something that is worth trying as the size was prohibitive for single-post consumption. As such, I’ve split it up on 5 parts, so that each part would be around a 1000 words or less. I’ll post one part a day. To give you an idea of the scope and subject of what’s to come, here is a quick overview. In part 1 I’ll lay out the problem. We are trying to spawn processes, read their output and kill if they take too long. Our first attempt is to use simple synchronous I/O to read the output and discover a deadlock. We solve the deadlock using asynchronous I/O. In part 2 we parallelize the code and discover reduced performance and yet another deadlock. We create a testbed and set about to investigate the problem at depth. In part 3 we will find out the root cause and we’ll discuss the mechanics (how and why) we hit such a problem. In part 4 we’ll discuss solutions to the problem and develop a generic solutions (with code) to fix the problem. Finally, in part 5 we see whether or not a generic solution could work before we summarize and conclude. Let’s begin at the very beginning. Suppose you want to execute some program (call it child), get all its output (and error) and, if it doesn’t exit within some time limit, kill it. Notice that there is no interaction and no input. This is how tests are executed in Phalanger using a test runner. Synchronous I/O The Process class has conveniently exposed the underlying pipes to the child process using stream instances StandardOutput and StandardError. And, like many, we too might be tempted to simply call StandardOutput.ReadToEnd() and StandardError.ReadToEnd(). Albeit, that would work, until it doesn’t. As Raymond Chen noted, it’ll work as long as the data fits into the internal pipe buffer. The problem with this approach is that we are asking to read until we reach the end of the data, which will only happen for certainty when the child process we spawned exits. However, when the buffer of the pipe which the child writes its output to is full, the child has to wait until there is free space in the buffer to write to. But, you say, what if we always read and empty the buffer? Good idea, except, we need to do that for both StandardOutput and StandardError at the same time. In the StandardOutput.ReadToEnd() call we read every byte coming in the buffer until the child process exits. While we have drained the StandardOutput buffer (so that the child process can’t be possibly blocked on that,) if it fills the StandardError buffer, which we aren’t reading yet, we will deadlock. The child won’t exit until it fully writes to the StandardError buffer (which is full because no one is reading it,) meanwhile, we are waiting for the process to exit so we can be sure we read to the end of the StandardOutput before we return (and start reading StandardError). The same problem exists for StandardOutput, if we first read StandardError, hence the need to drain both pipe buffers as they are fed, not one after the other. Async Reading The obvious (and only practical) solution is to read both pipes at the same time using separate threads. To that end, there are mainly two approaches. The pre-4.0 approach (async events), and the 4.5-and-up approach (tasks). Async Reading with Events The code is reasonably straight forward as it uses .Net events. We have two manual-reset events and two delegates that get called asynchronously when we read a line from each pipe. We get null data when we hit the end of file (i.e. when the process exits) for each of the two pipes. public static string ExecWithAsyncEvents(string path, string args, int timeoutMs) { using (var outputWaitHandle = new ManualResetEvent(false)) { using (var errorWaitHandle = new ManualResetEvent(false)) { using (var process = new Process()) { process.StartInfo = new ProcessStartInfo(path); process.StartInfo.Arguments = args; process.StartInfo.UseShellExecute = false; process.StartInfo.RedirectStandardOutput = true; process.StartInfo.RedirectStandardError = true; process.StartInfo.ErrorDialog = false; process.StartInfo.CreateNoWindow = true; var sb = new StringBuilder(1024); process.OutputDataReceived += (sender, e) => { sb.AppendLine(e.Data); if (e.Data == null) { outputWaitHandle.Set(); } }; process.ErrorDataReceived += (sender, e) => { sb.AppendLine(e.Data); if (e.Data == null) { errorWaitHandle.Set(); } }; process.Start(); process.BeginOutputReadLine(); process.BeginErrorReadLine(); process.WaitForExit(timeoutMs); outputWaitHandle.WaitOne(timeoutMs); errorWaitHandle.WaitOne(timeoutMs); process.CancelErrorRead(); process.CancelOutputRead(); return sb.ToString(); } } } } We certainly can improve on the above code (for example we should make the total wait limit <= timeoutMs) but you get the point with this sample. Also, no error handling or killing the child process when it times out and doesn’t exit. Async Reading with Tasks A much more simplified and sanitized approach is to use the new System.Threading.Tasks namespace/framework to do all the heavy-lifting for us. As you can see, the code has been cut by half and it’s much more readable, but we need Framework 4.5 and newer for this to work (although my target is 4.0, but for comparison purposes I gave it a spin). The results are the same. public static string ExecWithAsyncTasks(string path, string args, int timeout) { using (var process = new Process()) { process.StartInfo = new ProcessStartInfo(path); process.StartInfo.Arguments = args; process.StartInfo.UseShellExecute = false; process.StartInfo.RedirectStandardOutput = true; process.StartInfo.RedirectStandardError = true; process.StartInfo.ErrorDialog = false; process.StartInfo.CreateNoWindow = true; var sb = new StringBuilder(1024); process.Start(); var stdOutTask = process.StandardOutput.ReadToEndAsync(); var stdErrTask = process.StandardError.ReadToEndAsync(); process.WaitForExit(timeout); stdOutTask.Wait(timeout); stdErrTask.Wait(timeout); return sb.ToString(); } } Again, a healthy doze of error-handling is in order, but for illustration purposes left out. A point worthy of mention is that we can’t assume we read the streams by the time the child exits. There is a race condition and we still need to wait for the I/O operations to finish before we can read the results. In the next part we’ll parallelize the execution in an attempt to maximize efficiency and concurrency.

April 3, 2013

by Ashod Nakashian

· 5,794 Views

AWS VPC NAT Instance Failover and High Availability

Amazon Virtual Private Cloud (VPC) is a great way to setup an isolated portion of AWS and control the network topology. It is a great way to extend your data center and use AWS for burst requirements. With the latest VPC for Everyone announcement, what was earlier "Classic" and "VPC" in AWS will soon be only VPC. That is, every deployment in AWS will be on a VPC even though one might not need all the additional features that VPC provides. One might eventually start looking at utilizing VPC features such as multiple Subnets, Network isolation, Network ACLs, etc.. Those who have already worked with VPC's understand the role of NAT Instance in a VPC. When you create a VPC, you create them with multiple Subnets (Public and Private). Instances launched in the Public Subnet have direct internet connectivity to send and receive internet traffic through the internet gateway of the VPC. Typically, internet facing servers such as web servers are kept in the Public Subnet. A Private Subnet can be used to launch Instances that do not require direct access from the internet. Instances in a Private Subnet can access the Internet without exposing their private IP address by routing their traffic through a Network Address Translation (NAT) instance in the Public Subnet. AWS provides an AMI that can be launched as a NAT Instance. Following diagram is the representation of a standard VPC that gets provisioned through the AWS Management Console wizard. Standard Private and Public Subnets in a VPC The above architecture has A Public Subnet that has direct internet connectivity through the Internet Gateway. Web Instances can be placed within the Public Subnet The custom Route Table associated with Public Subnet will have the necessary routing information to route traffic to the Internet Gateway A NAT Instance is also provisioned in the Public Subnet A Private Subnet that has outbound internet connectivity through the NAT Instance in the Public Subnet The Main Route Table is by default associated with the Private Subnet. This will have necessary routing information to route internet traffic to the NAT Instance Instances in the Private Subnet will use the NAT Instance for outbound internet connectivity. For example, DB backups from standby that needs to be stored in S3. Background programs that make external web services calls Of course, the above architecture has limited High Availability since all the Subnets are created within the same Availability Zone. We can avoid this by creating multiple Subnets in multiple Availability Zones. Public and Private Subnets with multiple Availability Zones Additional Subnets (Public and Private) are created in one another Availability Zone Both Private Subnets are attached to the Main Routing Table Both Public Subnets are attached to the same Custom Routing Table Instances in the Private Subnet still continue to use the NAT Instance for outbound internet connectivity Though we increased the High Availability by utilizing multiple Availability Zones, the NAT Instance is still a Single Point of Failure. NAT Instance is just another EC2 Instance that can become unavailable any time. The updated architecture below uses two NAT Instances to provide failover and High Availability for the NAT Instances NAT Instance High Availability Each Subnet is associated with its own Route Table NAT1 is provisioned in Public Subnet 1 NAT2 is provisioned in Public Subnet 2 Private Subnet 1's Route Table (RT) has routing entry to NAT1 for internet traffic Private Subnet 2's Route Table (RT) has routing entry to NAT2 for internet traffic NAT Instance HA Illustration A script can be installed on both the NAT Instances to monitor each other and swap the routing table association if one of them fails. For example, if NAT1 detects that NAT2 is not responding to its ping requests, it can change the Route Table of Private Subnet 2 to NAT1 for internet traffic. Once NAT2 becomes operational again, a reverse swapping can happen. AWS has a pretty good documentation on this and a sample script for the swapping. Apart from HA, the above architecture also provides better overall throughput, since during normal conditions, both NAT Instances can be used to drive the outbound internet requirements of the VPC. If there are workloads that requires a lot of outbound internet connectivity, having more than one NAT Instance would make sense. Of course, you are still limited with one NAT Instance per Subnet.

March 28, 2013

by Raghuraman Balachandran

· 18,824 Views