Data Engineering Resources

The Latest Data Engineering Topics

Neo4j/Cypher: Returning a Row with Zero Count When No Relationship Exists

I’ve been trying to see if I can match some of the football stats that OptaJoe posts on twitter and one that I was looking at yesterday was around the number of red cards different teams have received. 1 – Sunderland have picked up their first PL red card of the season. The only team without one now are Man Utd. Angels. To refresh this is the sub graph that we’ll need to look at to work it out: I started off with the following query which traverses out from each match, finds the players who were sent off in the match and then groups the sendings off by the team they were playing for: START game = node:matches('match_id:*') MATCH game<-[:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COUNT(game) AS redCards ORDER BY redCards LIMIT 5 When we run this we get the following results: +------------------------------+ | team.name | redCards | +------------------------------+ | "Sunderland" | 1 | | "West Ham United" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | | "Liverpool" | 2 | +------------------------------+ 5 rows The problem we have here is that it hasn’t returned Manchester United because they haven’t yet received any red cards and therefore none of their players match the ‘sent_off_in’ relationship. I ran into something similar in a post I wrote about a month ago where I was working out which day of the week players scored on. The first step towards getting Manchester United to return with a count of 0 is to make the ‘sent_off_in’ relationship optional. However, that on its own that isn’t enough because it now returns a count of all the player performances for each team: START game = node:matches('match_id:*') MATCH game<-[?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COUNT(game) AS redCards ORDER BY redCards ASC LIMIT 5 +-----------------------------+ | team.name | redCards | +-----------------------------+ | "Chelsea" | 448 | | "Wigan Athletic" | 459 | | "Fulham" | 460 | | "Liverpool" | 466 | | "Everton" | 467 | +-----------------------------+ 5 rows Instead what we need to do is collect up all the ‘sent_off_in’ relationships and sum them up. We can use the COLLECT function to do that and the neat thing about COLLECT is that it doesn’t bother collecting the empty relationships so we end up with exactly what we need: START game = node:matches('match_id:*') MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COLLECT(r) AS redCards LIMIT 5 +-----------------------------------------------------------------------------------------------------+ | team.name | redCards | +-----------------------------------------------------------------------------------------------------+ | "Wigan Athletic" | [:sent_off_in[26443] {},:sent_off_in[37785] {}] | | "Everton" | [:sent_off_in[6795] {minute:61},:sent_off_in[21735] {},:sent_off_in[34594] {}] | | "Newcastle United" | [:sent_off_in[434] {minute:75},:sent_off_in[32389] {},:sent_off_in[34915] {}] | | "Southampton" | [:sent_off_in[49393] {minute:70},:sent_off_in[49392] {minute:82}] | | "West Ham United" | [:sent_off_in[21734] {minute:67}] | +-----------------------------------------------------------------------------------------------------+ 5 rows We then just need to call the LENGTH function to work out how many red cards there are in each collection and then we’re done: START game = node:matches('match_id:*') MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, LENGTH(COLLECT(r)) AS redCards ORDER BY redCards LIMIT 5 +--------------------------------+ | team.name | redCards | +--------------------------------+ | "Manchester United" | 0 | | "West Ham United" | 1 | | "Sunderland" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | +--------------------------------+ 5 rows

April 30, 2013

by Mark Needham

· 5,881 Views

XStream – XStreamely Easy Way to Work with XML Data in Java

from time to time there is a moment when we have to deal with xml data. and most of the time it is not the happiest day in our life. there is even a term “xml hell” describing situation when programmer has to deal with many xml configuration files that are hard to comprehend. but, like it or not, sometimes we have no choice, mostly because specification from client says something like “use configuration written in xml file” or something similar. and in such cases, xstream comes with its very cool features that make dealing with xml really less painful. overview xstream is a small library to serialize data between java objects and xml. it’s lightweight, small, has nice api and what is most important, it works with and without custom annotations that we might be not allowed to add when we are not the owner of java classes. first example suppose we have a requirement to load configuration from xml file: /users/tomek/work/mystuff/input.csv /users/tomek/work/mystuff/truststore.ts /users/tomek/work/mystuff/cn-user.jks password password user secret and we want to load it into configuration object: public class configuration { private string inputfile; private string user; private string password; private string truststorefile; private string keystorefile; private string keystorepassword; private string truststorepassword; // getters, setters, etc. } so basically what we have to do is: filereader filereader = new filereader("config.xml"); // load our xml file xstream xstream = new xstream(); // init xstream // define root alias so xstream knows which element and which class are equivalent xstream.alias("config", configuration.class); configuration loadedconfig = (configuration) xstream.fromxml(filereader); and that’s all, easy peasy something more serious ok, but previous example is very basic so now let’s do something more complicated: real xml returned by real webservice. 2013-03-09 john example 24 asd123123 2012-03-10 anna baker 26 axn567890 2010-12-05 tom meadow sgh08945 48 what we have here is simple list of bans written in xml. we want to load it into collection of ban objects. so let’s prepare some classes (getters/setters/tostring omitted): public class data { private list bans = new arraylist(); } public class ban { private string dateofupdate; private person person; } public class person { private string firstname; private string lastname; private int age; private string documentnumber; } as you can see there is some naming and type mismatch between xml and java classes (e.g. field name1->firstname, dateofupdate is string not a date), but it’s here for some example purposes. so the goal here is to parse xml and get data object with populated collection of ban instances containing correct data. let’s see how it can be achieved. parse with annotations first, easier way is to use annotations. and that’s the suggested approach in situation when we can modify java classes to which xml will be mapped. so we have: @xstreamalias("data") // maps data element in xml to this class public class data { // here is something more complicated. if we have list of elements that are // not wrapped in a element representing a list (like we have in our xml: // multiple elements not wrapped inside collection, // we have to declare that we want to treat these elements as an implicit list // so they can be converted to list of objects. @xstreamimplicit(itemfieldname = "ban") private list bans = new arraylist(); } @xstreamalias("ban") // another mapping public class ban { /* we want to have different field names in java classes so we define what element should be mapped to each field */ @xstreamalias("updated_at") // private string dateofupdate; @xstreamalias("troublemaker") private person person; } @xstreamalias("troublemaker") public class person { @xstreamalias("name1") private string firstname; @xstreamalias("name2") private string lastname; @xstreamalias("age") // string will be auto converted to int value private int age; @xstreamalias("number") private string documentnumber; and actual parsing logic is very short: filereader reader = new filereader("file.xml"); // load file xstream xstream = new xstream(); xstream.processannotations(data.class); // inform xstream to parse annotations in data class xstream.processannotations(ban.class); // and in two other classes... xstream.processannotations(person.class); // we use for mappings data data = (data) xstream.fromxml(reader); // parse // print some data to console to see if results are correct system.out.println("number of bans = " + data.getbans().size()); ban firstban = data.getbans().get(0); system.out.println("first ban = " + firstban.tostring()); as you can see annotations are very easy to use and as a result final code is very concise. but what to do in situation when we can’t modify mapping classes? we can use different approach that doesn’t require any modifications in java classes representing xml data. parse without annotations when we can’t enrich our model classes with annotations, there is another solution. we can define all mapping details using methods from xstream object: filereader reader = new filereader("file.xml"); // three first lines are easy, xstream xstream = new xstream(); // same initialisation as in the xstream.alias("data", data.class); // basic example above xstream.alias("ban", ban.class); // two more aliases to map... xstream.alias("troublemaker", person.class); // between node names and classes // we want to have different field names in java classes so // we have to use aliasfield(, , ) xstream.aliasfield("updated_at", ban.class, "dateofupdate"); xstream.aliasfield("troublemaker", ban.class, "person"); xstream.aliasfield("name1", person.class, "firstname"); xstream.aliasfield("name2", person.class, "lastname"); xstream.aliasfield("age", person.class, "age"); // notice here that xml will be auto-converted to int "age" xstream.aliasfield("number", person.class, "documentnumber"); /* another way to define implicit collection */ xstream.addimplicitcollection(bans.class, "bans"); data data = (data) xstream.fromxml(reader); // do the actual parsing // let's print results to check if data was parsed system.out.println("number of bans = " + data.getbans().size()); ban firstban = data.getbans().get(0); system.out.println("first ban = " + firstban.tostring()); as you can see xstream allows to easily convert more complicated xml structures into java objects, it also gives a possibility to tune results by using different names if this from xml doesn’t suit our needs. but there is one thing should catch your attention: we are converting xml representing a date into raw string which isn’t quite what we would like to get as a result. that’s why we will add converter to do some job for us. using existing custom type converter xstream library comes with set of built converters for most common use cases. we will use dateconverter. so now our class for ban looks like that: public class ban { private date dateofupdate; private person person; } and to use dateconverter we simply have to register it with date format that we expect to appear in xml data: xstream.registerconverter(new dateconverter("yyyy-mm-dd", new string[] {})); and that’s it. now instead of string our object is populated with date instance. cool and easy! but what about classes and situations that aren’t covered by existing converters? we could write our own. writing custom converter from scratch assume that instead of dateofupdate we want to know how many days ago update was done: public class ban { private int daysago; private person person; } of course we could calculate it manually for each ban object but using converter that will do this job for us looks more interesting. our daysagoconverter must implement converter interface so we have to implement three methods with signatures looking a little bit scary: public class daysagoconverter implements converter { @override public void marshal(object source, hierarchicalstreamwriter writer, marshallingcontext context) { } @override public object unmarshal(hierarchicalstreamreader reader, unmarshallingcontext context) { } @override public boolean canconvert(class type) { return false; } } last one is easy as we will convert only integer class. but there are still two methods left with these hierarchicalstreamwriter, marshallingcontext, hierarchicalstreamreader and unmarshallingcontext parameters. luckily, we could avoid dealing with them by using abstractsinglevalueconverter that shields us from so low level mechanisms. and now our class looks much better: public class daysagoconverter extends abstractsinglevalueconverter { @override public boolean canconvert(class type) { return type.equals(integer.class); } @override public object fromstring(string str) { return null; } public string tostring(object obj) { return null; } } additionally we must override method tostring(object obj) defined in abstractsinglevalueconverter as we want to store date in xml calculated from integer, not a simple object.tostring value which would be returned from default tostring defined in abstract parent. implementation code below is pretty straightforward, but most interesting lines are commented. i’ve skipped all validation stuff to make this example shorter. public class daysagoconverter extends abstractsinglevalueconverter { private final static string format = "yyyy-mm-dd"; // default date format that will be used in conversion private final datetime now = datetime.now().todatemidnight().todatetime(); // current day at midnight public boolean canconvert(class type) { return type.equals(integer.class); // converter works only with integers } @override public object fromstring(string str) { simpledateformat format = new simpledateformat(format); try { date date = format.parse(str); return days.daysbetween(new datetime(date), now).getdays(); // we simply calculate days between using jodatime } catch (parseexception e) { throw new runtimeexception("invalid date format in " + str); } } public string tostring(object obj) { if (obj == null) { return null; } integer daysago = ((integer) obj); return now.minusdays(daysago).tostring(format); // here we subtract days from now and return formatted date string } } usage to use our custom converter for a specific field we have to inform about it xstream object using registerlocalconverter: xstream.registerlocalconverter(ban.class, "daysago", new daysagoconverter()); we are using “local” method to apply this conversion only to specific field and not to every integer field in xml file. and after that we will get our ban objects populated with number of days instead of date. summary that’s all what i wanted to show you in this post. now you have basic knowledge about what xstream is capable of and how it can be used to easily map xml data to java objects. if you need something more advanced, please check project official page as it contains very good documentation and examples.

April 23, 2013

by Tomasz Dziurko

· 24,889 Views

Multipart Upload on S3 with jclouds

1. Goal In the previous article, we looked at how we can use the generic Blob APIs from jclouds to upload content to S3. In this article we will use the S3 specific asynchronous API from jclouds to upload content and leverage the multipart upload functionality provided by S3. 2. Preparation 2.1. Set up the custom API The first part of the upload process is creating the jclouds API – this is a custom API for Amazon S3: public AWSS3AsyncClient s3AsyncClient() { String identity = ... String credentials = ... BlobStoreContext context = ContextBuilder.newBuilder("aws-s3"). credentials(identity, credentials).buildView(BlobStoreContext.class); RestContext providerContext = context.unwrap(); return providerContext.getAsyncApi(); } 2.2. Determining the number of parts for the content Amazon S3 has a 5 MB limit for each part to be uploaded. As such, the first thing we need to do is determine the right number of parts that we can split our content into so that we don’t have parts below this 5 MB limit: public static int getMaximumNumberOfParts(byte[] byteArray) { int numberOfParts= byteArray.length / fiveMB; // 5*1024*1024 if (numberOfParts== 0) { return 1; } return numberOfParts; } 2.3. Breaking the content into parts Were going to break the byte array into a set number of parts: public static List breakByteArrayIntoParts(byte[] byteArray, int maxNumberOfParts) { List parts = Lists. newArrayListWithCapacity(maxNumberOfParts); int fullSize = byteArray.length; long dimensionOfPart = fullSize / maxNumberOfParts; for (int i = 0; i < maxNumberOfParts; i++) { int previousSplitPoint = (int) (dimensionOfPart * i); int splitPoint = (int) (dimensionOfPart * (i + 1)); if (i == (maxNumberOfParts - 1)) { splitPoint = fullSize; } byte[] partBytes = Arrays.copyOfRange(byteArray, previousSplitPoint, splitPoint); parts.add(partBytes); } return parts; } We’re going to test the logic of breaking the byte array into parts – we’re going to generate some bytes, split the byte array, recompose it back together using Guava and verify that we get back the original: @Test public void given16MByteArray_whenFileBytesAreSplitInto3_thenTheSplitIsCorrect() { byte[] byteArray = randomByteData(16); int maximumNumberOfParts = S3Util.getMaximumNumberOfParts(byteArray); List fileParts = S3Util.breakByteArrayIntoParts(byteArray, maximumNumberOfParts); assertThat(fileParts.get(0).length + fileParts.get(1).length + fileParts.get(2).length, equalTo(byteArray.length)); byte[] unmultiplexed = Bytes.concat(fileParts.get(0), fileParts.get(1), fileParts.get(2)); assertThat(byteArray, equalTo(unmultiplexed)); } To generate the data, we simply use the support from Random: byte[] randomByteData(int mb) { byte[] randomBytes = new byte[mb * 1024 * 1024]; new Random().nextBytes(randomBytes); return randomBytes; } 2.4. Creating the Payloads Now that we have determined the correct number of parts for our content and we managed to break the content into parts, we need to generate the Payload objects for the jclouds API: public static List createPayloadsOutOfParts(Iterable fileParts) { List payloads = Lists.newArrayList(); for (byte[] filePart : fileParts) { byte[] partMd5Bytes = Hashing.md5().hashBytes(filePart).asBytes(); Payload partPayload = Payloads.newByteArrayPayload(filePart); partPayload.getContentMetadata().setContentLength((long) filePart.length); partPayload.getContentMetadata().setContentMD5(partMd5Bytes); payloads.add(partPayload); } return payloads; } 3. Upload The upload process is a flexible multi-step process – this means: the upload can be started before having all the data – data can be uploaded as it’s coming in data is uploaded in chunks – if one of these operations fails, it can simply be retrieved chunks can be uploaded in parallel – this can greatly increase the upload speed, especially in the case of large files 3.1. Initiating the Upload operation The first step in the Upload operation is to initiate the process. This request to S3 must contain the standard HTTP headers – the Content-MD5 header in particular needs to be computed. Were going to use the Guava hash function support here: Hashing.md5().hashBytes(byteArray).asBytes(); This is the md5 hash of the entire byte array, not of the parts yet. To initiate the upload, and for all further interactions with S3, we’re going to use the AWSS3AsyncClient – the asynchronous API we created earlier: ObjectMetadata metadata = ObjectMetadataBuilder.create().key(key).contentMD5(md5Bytes).build(); String uploadId = s3AsyncApi.initiateMultipartUpload(container, metadata).get(); The key is the handle assigned to the object – this needs to be a unique identifier specified by the client. Also notice that, even though we’re using the async version of the API, we’re blocking for the result of this operation – this is because we will need the result of the initialize to be able to move forward. The result of the operation is an upload id returned by S3 – this will identify the upload throughout it’s lifecycle and will be present in all subsequent upload operations. 3.2. Uploading the Parts The next step is uploading the parts. Our goal here is to send these requests in parallel, as the upload parts operation represent the bulk of the upload process: List> ongoingOperations = Lists.newArrayList(); for (int partNumber = 0; partNumber < filePartsAsByteArrays.size(); partNumber++) { ListenableFuture future = s3AsyncApi.uploadPart( container, key, partNumber + 1, uploadId, payloads.get(partNumber)); ongoingOperations.add(future); } The part numbers need to be continuous but the order in which the requests are send is not relevant. After all of the upload part requests have been submitted, we need to wait for their responses so that we can collect the individual ETag value of each part: Function, String> getEtagFromOp = new Function, String>() { public String apply(ListenableFuture ongoingOperation) { try { return ongoingOperation.get(); } catch (InterruptedException | ExecutionException e) { throw new IllegalStateException(e); } } }; List etagsOfParts = Lists.transform(ongoingOperations, getEtagFromOp); If, for whatever reason, one of the upload part operations fails, the operation can be retried until it succeeds. The logic above does not contain the retry mechanism, but building it in should be straightforward enough. 3.3. Completing the Upload operation The final step of the upload process is completing the multipart operation. The S3 API requires the responses from the previous parts upload as a Map, which we can now easily create from the list of ETags that we obtained above: Map parts = Maps.newHashMap(); for (int i = 0; i < etagsOfParts.size(); i++) { parts.put(i + 1, etagsOfParts.get(i)); } And finally, send the complete request: s3AsyncApi.completeMultipartUpload(container, key, uploadId, parts).get(); This will return final ETag of the finished object and will complete the entire upload process. 4. Conclusion In this article we built a multipart enabled, fully parallel upload operation to S3, using the custom S3 jclouds API. This operation is ready to be used as is, but it can be improved in a few ways. First, retry logic should be added around the upload operations to better deal with failures. Next, for really large files, even though the mechanism is sending all upload multipart requests in parallel, a throttling mechanism should still limit the number of parallel requests being sent. This is both to avoid bandwidth becoming a bottleneck as well as to make sure Amazon itself doesn’t flag the upload process as exceeding an allowed limit of requests per second – the Guava RateLimiter can potentially be very well suited for this. P.S. You might dig following me on Twitter.

April 21, 2013

by Eugen Paraschiv

· 6,614 Views · 1 Like

Coalition or Council: Which One Are You?

I have been thinking about institutions that strive for change. Sometimes we call them communities or organizations, sometimes we call them alliances or parties. But whatever their nature, these institutions are usually led and managed by a small group of people. I see two kinds of leading groups: coalitions and councils. coalition A temporary alliance of distinct parties, persons, or states for joint action council A group elected or appointed as an advisory or legislative body Coalitions A coalition is a self-selecting team. The persons seek each other out because they want to be active agents for change, and by working together they can be more successful in achieving a common goal. In his change management books John Kotter referred to them as guiding coalitions. They are not elected. They are not appointed. They select each other because they want to. And they can even work undercover, because their goal is to influence, not to govern. The allied powers in World War II were a coalition. The Google founders were a coalition. The originators of the Stoos Network were a coalition. Councils A council is a group of representatives. These people also want to be active agents for change. But, their primary concern is to have buy-in from the larger group of people they are representing within the institute (community, organization, or party). The concept of democracy has led to many different versions of these councils. Sometimes we call them a government. Sometimes a committee. And everything has to be out in the open, because if it’s not, we call them cronies. Their goal is primarily to govern or advise the institute. The United Nations has a council. My former students society had a council. And many workplaces have management teams acting as councils. And you? If you have a group of people who all desire change, do you lead with a coalition or with a council? This is the big problem with some alliances and consortiums for change. They have directors who try to be both. It is a recipe for disaster. Maybe the best institutions have both: a coalition and a council. (image from Veni Markovski)

April 21, 2013

by Jurgen Appelo

· 7,078 Views

What Does a Java Array Look Like in Memory?

arrays in java store one of two things: either primitive values (int, char, …) or references (a.k.a pointers). when an object is creating by using “new”, memory is allocated on the heap and a reference is returned. this is also true for arrays. 1. single-dimension array int arr[] = new int[3]; the int[] arr is just the reference to the array of 3 integer. if you create an array with 10 integer, it is the same – an array is allocated and a reference is returned. 2. two-dimensional array how about 2-dimensional array? actually, we can only have one dimensional arrays in java. 2d arrays are basically just one dimensional arrays of one dimensional arrays. int[ ][ ] arr = new int[3][ ]; arr[0] = new int[3]; arr[1] = new int[5]; arr[2] = new int[4]; multi-dimensional arrays use the name rules. 3. where are they located in memory? from the above, there are arrays and reference variables in memory. as we know that jvm runtime data areas include heap, jvm stack, and others. for a simple example as follows, let’s see where the array and its reference are stored. class a { int x; int y; } ... public void m1() { int i = 0; m2(); } public void m2() { a a = new a(); } ... when m1 is invoked, a new frame (frame-1) is pushed into the stack, and local variable i is also created in frame-1. when m2 is invoked inside of m1, another new frame (frame-2) is pushed into the stack. in m2, an object of class a is created in the heap and reference variable is put in frame-2. now, at this point, the stack and heap looks like the following: arrays are treated the same way like objects, so how array locates in memory is straight-forward.

April 19, 2013

by Ryan Wang

· 31,338 Views · 1 Like

Upload on S3 with the jclouds Library

There are several good ways to upload content to an S3 bucket in the Java world – in this article we’ll look at what the jclouds library provides for this purpose. To use jclouds – specifically the APIs discussed in this article, this simple Maven dependency should be added to the pom of the project: org.jclouds jclouds-allblobstore 1.5.9 1. Uploading to Amazon S3 The first step, in order to access any of these APIs, is to create a BlobStoreContext: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(BlobStoreContext.class); This represents the entry-point to a general key-value storage service, such as Amazon S3 – but not limited to it. For the more specific S3 only implementation, the context can be created similarly: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(S3BlobStoreContext.class); And even more specifically: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); When the authenticated context is no longer needed, closing it is required to release all resources – threads and connections – associated to it. 2. The four S3 APIs of jclouds The jclouds library provides four different APIs to upload content to S3 bucket, ranging from simple but inflexible to complex and powerful, all obtained via the BlobStoreContext. Let’s start with the simplest. 2.1. Upload via the Map API The easiest way jclouds can be used to interact with an S3 bucket is by representing that bucket as a Map. The API is obtained from the context: InputStreamMap bucket = context.createInputStreamMap("bucketName"); Then, to upload a simple HTML file: bucket.putString("index1.html", "hello world1"); The InputStreamMap API exposes several other types of PUT operations – files, raw bytes – both for single and bulk. A simple integration test can be used as an example: @Test public void whenFileIsUploadedToS3WithMapApi_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); InputStreamMap bucket = context.createInputStreamMap("bucketName"); bucket.putString("index1.html", "hello world1"); context.close(); } 2.2. Upload via BlobMap Using the simple Map API is straightforward but ultimately limited – for example, there is no way to pass in metadata about the content being uploaded. When more flexibility and customization is necessary, this simplified approach to uploading data to S3 via a Map is no longer enough. The next API we’ll look at is the Blob Map API – this is obtained from the context: BlobMap bucket = context.createBlobMap("bucketName"); The API allows the client to access more lower level details, such as Content-Length, Content-Type, Content-Encoding, eTag hash and others; to upload new content in the bucket: Blob blob = bucket.blobBuilder().name("index2.html"). payload("hello world2"). contentType("text/html").calculateMD5().build(); The API also allows setting a variety of payloads on the create request. A simple integration test for uploading a basic HTML file to S3 via the Blob Map API: @Test public void whenFileIsUploadedToS3WithBlobMap_thenNoExceptions() throws IOException { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobMap bucket = context.createBlobMap("bucketName"); Blob blob = bucket.blobBuilder().name("index2.html"). payload("hello world2"). contentType("text/html").calculateMD5().build(); bucket.put(blob.getMetadata().getName(), blob); context.close(); } 2.3. Upload via BlobStore The previous APIs had no way to upload content using multipart upload – this makes them ill suited when working with large files. This limitation is addressed by the next API we’re going to look at – the synchronous BlobStore API. This is obtained from the context: BlobStore blobStore = context.getBlobStore(); To use the multipart support and upload a file to S3: Blob blob = blobStore.blobBuilder("index3.html"). payload("hello world3").contentType("text/html").build(); blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); The payload builder is the same one that was being used by the BlobMap API, so the same flexibility in specifying lower level metadata information about the blob is available here. The difference is the PutOptions supported by the PUT operation of the API – namely the multipart support. The previous integration test now has multipart enabled: @Test public void whenFileIsUploadedToS3WithBlobStore_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobStore blobStore = context.getBlobStore(); Blob blob = blobStore.blobBuilder("index3.html"). payload("hello world3").contentType("text/html").build(); blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); context.close(); } 2.4. Upload via AsyncBlobStore While the previous BlobStore API was synchronous, there is also an asynchronous API for BlobStore – AsyncBlobStore. The API is similarly obtained from the context: AsyncBlobStore blobStore = context.getAsyncBlobStore(); The only difference between the two is that the async API is returning ListenableFuture for the PUT asynchronous operation: Blob blob = blobStore.blobBuilder("index4.html"). .payload("hello world4").build(); blobStore.putBlob("bucketName", blob).get(); The integration test displaying this operation is similar to the synchronous one: @Test public void whenFileIsUploadedToS3WithBlobStore_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobStore blobStore = context.getBlobStore(); Blob blob = blobStore.blobBuilder("index4.html"). payload("hello world4").contentType("text/html").build(); Future putOp = blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); putOp.get(); context.close(); } 3. Conclusion In this article, we analysed the four APIs that the jclouds library provides to upload content to Amazon S3. These four APIs are generic and they work with other key-value storage services as well – such as Microsoft Azure Storage for example. In the next article we’ll look at the Amazon specific S3 API available in jclouds – the AWSS3Client. We’ll implement the operation of uploading a large file, dynamically calculate the optimal number of parts for any given file, and perform the upload of all parts in parallel. P.S. You might dig following me on Twitter.

April 18, 2013

by Eugen Paraschiv

· 8,889 Views · 1 Like

Stepping Backwards while Debugging: Move To Line

it happens to me many times: i’m stepping with the debugger through my code, and ups! i made one step too far! debugging, and made one step over too far what now? restart the whole debugging session? actually, there is a way to go ‘backwards’ gdb has a ‘reverse debugging’ feature, described here . i’m using the eclipse based codewarrior debugger, and this debug engine is not using gdb. the codewarrior debugger in mcu10.3 supports an eclipse feature: i select a code line in the editor view and use move to line : move to line what it does: it changes the current pc (program counter) of the program to that line: performed move to line now i can continue debugging from that line, e.g. stepping into that function call. yes, this is not true backward debugging. but it is simple and very effective. to perform true backward stepping, the debugger would need to reverse all operations, typically with a rather heavy state machine and data recording. but for the usual case where i simply need to go back a few lines, the ‘move to line’ is perfect. of course there are a few points to consider: this only changes the program counter. any variable changes/etc are not affected or reverted. in case of highly optimized code, there might be multiple sequence points per source line. so doing this for highly optimized code might not work correctly. it works ok within a function. it is not recommended to use it e.g. to set the pc outside of a function. because the context/stack frame is not set up. i use the ‘move to line’ frequently to ‘advance’ the program execution. e.g. to bypass some long sequences i’m not interested in, or to get out of an ‘endless’ loop. the same ‘move to line’ as available while doing assembly stepping too. see this post for details. happy line moving

April 15, 2013

by Erich Styger

· 9,897 Views

ActiveMQ and .NET combined!

ActiveMQ is one of the most popular messaging frameworks. For sure the most popular open source framework. Many people think that ActiveMQ works only with Java and this is not true at all. ActiveMQ can work with almost every popular language (including JavaScript!) through numerous protocols which it supports. Today I will show you how to use ActiveMQ in .NET-based solutions. Project setup Using VS 2010's Extension Manger I installed NuGet Package Manager. After installation and VS 2010 restart, I created a project called ActiveMQNMS. I right-clicked it and selected "Manage NuGet packages...". In the search field I typed: "ActiveMQ". There was a package called Apache.NMS.ActiveMQ. I installed it. (Note: ActiveMQ has one dependency - Apache.NMS package. The NMS package provides a unified API for working with different messaging frameworks and providers.) Starting ActiveMQ I already had ActiveMQ installed on my machine. If you don't have one, download it from http://activemq.apache.org. The default instance listens on 61616 port. However, mine is listening on 62626. If you want to run my code, please remember to change the port. To start ActiveMQ I executed: activemq-5.5.0\bin\activemq Depending on configured ports, you can use ActiveMQ web console to manage your queues, topics, subscribers, connections, embedded Apache Camel, etc. I'm using 8282 port, and the console URL is: http://localhost:8282/admin. Test stub In general the .NET API is almost a copy of the Java API. So if you're familiar with JMS and/or ActiveMQ you don't need any documentation. Please note TestIntialize and TestCleanup methods. using System; using Apache.NMS; using Apache.NMS.ActiveMQ; using Microsoft.VisualStudio.TestTools.UnitTesting; namespace ActiveMQNMS { [Serializable] public class Person { public string FirstName { get; set; } public string LastName { get; set; } } [TestClass] public class ActiveMqTest { private IConnection _connection; private ISession _session; private const String QUEUE_DESTINATION = "DotNet.ActiveMQ.Test.Queue"; [TestInitialize] public void TestInitialize() { IConnectionFactory factory = new ConnectionFactory("tcp://localhost:62626"); _connection = factory.CreateConnection(); _connection.Start(); _session = _connection.CreateSession(); } [TestCleanup] public void TestCleanup() { _session.Close(); _connection.Close(); } } } Writing Producer Here is the producer: [TestMethod] public void TestA() { IDestination dest = _session.GetQueue(QUEUE_DESTINATION); using (IMessageProducer producer = _session.CreateProducer(dest)) { var person = new Person { FirstName = "Łukasz", LastName = "Budnik" }; var objectMessage = producer.CreateObjectMessage(person); producer.Send(objectMessage); } } Run the test and refresh "Queues" list in ActiveMQ web console. You should see DotNet.ActiveMQ.Test.Queue queue with 1 enqueued and pending message. Purge the queue by hitting the purge link or you simply delete it. Writing Consumer Now we have to consume the message. Here is the code: [TestMethod] public void TestB() { Person person = null; IDestination dest = _session.GetQueue(QUEUE_DESTINATION); using (IMessageConsumer consumer = _session.CreateConsumer(dest)) { IMessage message; while ((message = consumer.Receive(TimeSpan.FromMilliseconds(2000))) != null) { var objectMessage = message as IObjectMessage; if (objectMessage != null) { person = objectMessage.Body as Person; if (person != null) { Assert.AreEqual("Łukasz", person.FirstName); Assert.AreEqual("Budnik", person.LastName); } } else { Assert.Fail("Object Message is null"); } } } if (person == null) { Assert.Fail("Person object is null"); } } Run tests. Refresh "Queues" tab in ActiveMQ web console. You should see 1 message enqueued and 1 message dequeued. As expected. Summary That's all. Simple, isn't it? ActiveMQ works very, very nicely with .NET. I have to find some performance comparison for ActiveMQ and MS or pure .C#/NET messaging frameworks. Or maybe you have it? Please share. cheers, Łukasz

April 15, 2013

by Łukasz Budnik

· 29,442 Views

Introduction to SmartSVN

SmartSVN is a powerful and easy-to-use graphical client for Apache Subversion. There are several clients for Subversion, but here are just a few reasons you should try SmartSVN: It’s cross-platform – SmartSVN runs on Windows, Linux and Mac OS X, so you can continue using the operating system (OS) that works the best for you. It can also be integrated into your OS, via Mac’s Finder Integration or Windows Shell. Everything you need, out of the box – SmartSVN comes complete with all the tools you need to manage your Subversion projects: Conflict solver – this feature combines the freedom of a general, three-way-merge with the ability to detect and resolve any conflicts that occur during the development lifecycle. File compare – this allows you to make inner-line comparisons and directly edit the compared files. Built-in SSH client – allows users to access servers using the SSH protocol. This security-conscious protocol encrypts every piece of communication between the client and the server, for additional protection. A complete view of your project at a glance – the most important files (such as conflicted, modified or missing files) are placed at the top of the file list. SmartSVN also highlights which directories contain local modifications, which directories have been changed in the repository, and whether individual files have been modified locally or in the central repo. This makes it easy to get a quick overview of the state of your project. Fully customizable – maximize productivity by fine-tuning your SmartSVN installation to suit your particular needs: Change keyboard shortcuts, write your own plugin with the SmartSVN API, group revisions to personalize your display, create Change Sets, and alter the context menus and toolbars to suit you. You can learn more about customizing SmartSVN at our ‘5 Ways to Customize SmartSVN’ blog post. Comprehensive bug tracker support – Trac and JIRA are both fully supported. Multitude of support options – SmartSVN users have access to a range of free support, from refcards to blogsand documentation, the SmartSVN forum and a Twitter account maintained by our open source experts. If you need extra support with your SmartSVN installation, expert email support is included with SmartSVN Professional licenses. Want to learn more about SmartSVN? On April 18th, WANdisco will be be holding a free ‘Introduction to SmartSVN’ webinar covering everything you need to get off to a great start with this popular client: Repository basics Checkouts, working folders, editing files and commits Reporting on changes Simple branching Simple merging This webinar is free so register now.

April 13, 2013

by Jessica Thornsby

· 6,633 Views

Application Services Governance Components

Application Services Governance is a necessary step towards building a responsive IT organization and achieving business agility. By guiding teams through a streamlined application services development process, Application Services Governance Platforms optimize IT effectiveness, raise software quality, and reduce delivery timeframes. Governance relies on policy, people, process and technology to guide business activity and consistently deliver positive outcomes. Effective governance channels business activity towards the ‘right’ path; by making the right actions the path of least resistance. To efficiently guide teams and demonstrate policy compliance benefits, Application Services Governance Platforms provide policy management, developer portals, repositories, service integration and composition, and business value dashboards. Effective governance encompasses the entire IT solution spanning APIs, services, business processes, data, and application delivery. While most governance solutions focus on web services, leading Application Services Governance Platforms bridge API governance, SOA governance, Cloud deployment governance, data governance, and application delivery governance. Additionally, the governance experience must be tailored for the participant’s project role. Portals may be personalized to present notifications, tasks, actions, and reports suitable for application service creators, publishers, subscribers, consumers, or business managers. Application delivery governance segments participants into developers, quality assurance testers, operations, project managers, and application users. End-user Application Services Governance priorities are evolving toward bridging service governance with API governance, extending application lifecycle management to embrace cloud deployment environments, and focusing on visualizing asset business value. Key governance challenges include meeting mobile application demands, implementing efficient self-service provisioning, right-sizing governance practices (not too heavy or light), and defining appropriate policy tiers. Governance Components To efficiently guide teams and demonstrate policy compliance benefits, Application Services Governance Platforms provide policy management, developer portals, repositories, service integration and composition, and business value dashboards. Figure 1 Application Services Governance Components Policy Management Policy management is used to specify the correct behavior, detail exception thresholds, and define corrective actions or notifications. Leading application services governance platforms deliver advanced policy management by conforming to a flexible architecture, addressing relevant policy categories, and spanning all lifecycle phases. A comprehensive Application Services Governance Platform manages: Design-time Policy Run-time Policy Security Policy Developer access Policy Service and API Lifecycle Management Policy Application Lifecycle Management Policy Within these six broad categories, application services governance commonly encompasses service level policies, usage policies, version policies, subscription policies, and access control policies. Registries serve as policy stores for many types of runtime policies including security policies, lifecycle management workflow policies, API policies, service description, service contracts, service consumption, service usage, service lifecycle management, service level agreements (SLAs) and XACML authorization policies. Leading platforms have built-in support for a number of policy standards including WS-Policy, XACML 3.0, and SCXML. Cloud foundation and cloud middleware components deliver sophisticated run-time policy enforcement for tenant partitioning, service level management, application provisioning, tenant access, and resource management. All run-time infrastructure products should serve as well-integrated policy enforcement points that may delegate policy decisions to external decision points or internally cache and process policy assertions. Identity Management infrastructure components serve as a policy decision point and a policy manager for sophisticated security policies encoded in XACML. The Application Service Governance Platforms use workflow engines to execute governance workflow, present task lists, and manage approvals. Complex Event Processor components can be configured as policy decision points, which use time-based policy pattern matching to evaluate run-time service, message, REST resource, and event traffic. For more information on policy management, read the detailed policy management blog post. Developer Portal and Repository Portals serve as the viewport into policy management, service integration and composition, and business value dashboards. The Application Service Governance portals should deliver an application service governance experience tuned for self-service, on-demand access, and safe API usage. Developer portals are often contextually personalized to fit the project and user’s role. For example, a developer portal may fit the needs of API creators and API publishers who are defining, documenting, and publishing APIs. The portal’s user experience may enable API creators and publishers to monitor, manage, and analyze API usage. A developer portal may also be personalized to deliver a user experience tailored for API consumers. API developers who are consuming APIs can find, explore, subscribe and evaluate APIs. Developer portals are often tuned to facilitate service meta-data and lifecycle management for service creators. Service and integration developers who are consuming services can find and explore services. A developer portal should guide teams toward effective and efficient governance when building service implementation and service consumption code. Advanced developer portals capabilities include overlaying build management governance, test governance (i.e. unit, integration, performance), implementation lifecycle governance, and deployment governance. An Application Services Governance Platform should enable flexible organization, classification & documentation of services, APIs, and any IT asset. Key repository capabilities include governing and managing: Any type of metadata in any structure Service, API, or artifact associations and relationships Schema definitions and namespaces Users and Roles User subscriptions Service level agreements Developer documentation Social taxonomies (e.g. ratings, comments, tags) Implementation artifacts (i.e. code, test cases) Service Integration and Composition Service integration and composition for APIs, web services, or business process are often implemented using tools provided by the run-time infrastructure vendor. Application Services Governance components must integrate into diverse run-time infrastructure containers and development tooling. Synchronizing policy, development artifacts, and deployment packages requires tight integration between design-time tools, development tools, run-time management consoles, and application services governance portals and repositories. Business Value Dashboards To gauge governance effectiveness and enhanced business value, analytic dashboards assess policy compliance, quality of service, service usage, architecture coherence, and team performance. The Application Services Governance platform should capture service tier subscription information, collects usage statistics, and integrate with billing and payment systems that deliver show-back or charge-back reports. Subscription and usage reports help teams understand asset adoption (by version, by service) and usage (by version, by service). By understanding adoption and usage, business owners and architects can intelligently invest future development resources, properly plan infrastructure scale, and rationalize the portfolio. Dashboards also present a service overview, number of services, service lifecycle stage, schema re-use, service dependencies, upgrade impacts, development team productivity, and project progress. Governance Lifecycle Phases API management portals and SOA Governance Registries must work together to keep API lifecycle stages synchronized with backend service implementation stages. An API Governance experience may provide a straightforward set of lifecycle stages (e.g., created, published, deprecated, retired, blocked) that may be customized by the development team. SOA Governance Registries facilitates service metadata management and governance across design, implementation, test, and run-time operations. Figure 2 below depicts the intersection of the two governance views. Figure 2: API and Service Lifecycle Views Application delivery governance usually relies on ad hoc tools and processes, knitted together by end-user delivery managers. Application Services Governance Platforms should span project inception, development, quality assurance, production deployment, production management, maintenance, and retirement. Figure 3 illustrates service implementation activities governed by an application delivery governance product. Figure 3: Implementation activities governed by application services delivery governance Application Services Governance Drivers The IT focus on API, DevOps, and Cloud scale is driving resurgent interest in Application Services Governance. As development teams support mobile applications by fielding web APIs, they are creating a new ‘demand layer’ in front of existing service implementations. Both API and SOA success requires creating loosely coupled consumer-provider connections, enforcing a separation of concerns between consumer and provider, and exposing a set of re-usable, shared services, and gaining service consumer adoption. With traditional SOA Governance, many development teams publish services, yet struggle to create a service architecture that is widely shared, re-used, and adopted across internal development teams. In today’s connected business world, API and SOA are the business. An effective governance approach must address human collaboration stumbling blocks. By publishing managed APIs, establishing API manager and publisher roles, extending the governance registry, facilitating API management practices (e.g self-service key management, self-service provisioning, service tier management, and usage visualization),and offering APIs through developer portal, organizations can overcome collaboration, trust, and adoption hurdles while enhancing SOA success. By publishing managed APIs, establishing API manager and publisher roles, extending the governance registry, and offering APIs through an API Store, team have a new opportunity to increase service re-use and enhance IT business value. For more information on how teams can complement SOA Governance with API Governance, read the promoting services with API Management white paper. Because services are often imbedded in application solutions, leading Application Services Governance platforms wrap services governance inside application delivery governance. When operation team members use traditional point tools (i.e. Puppet, Chef, Jenkins,Selenium) to achieve DevOps benefits, the teams spend a considerable amount of time and effort creating agile workflow, effective governance, seamless activity transitions, and on-demand self-service access. A configurable DevOps PaaS can implement governance best practices and be readily adopted by teams without extensive implementation effort. Effective application delivery governance presents a simplified and unified user experience to complex development tools, processes, and team hand-offs. By integrating software promotion best practices, test automation, continuous integration, and issue tracking, application delivery governance raises software quality while reducing delivery timeframes. For more information, read about how to accelerate agility and maintain governance with DevOps PaaS. Recommended Reading Policy Management for Application Services Governance Application Services Governance Requires More Than a SOA Registry API and SOA Convergence Promoting services with API Management white paper Accelerate agility and maintain governance with DevOps PaaS Governance Registry Brings Integrity to SaaS Platform Gartner’s analysis of WSO2 SOA Governance

April 13, 2013

by Chris Haddad

· 5,938 Views · 2 Likes

Complex Event Processing Made Easy (using Esper)

The following is a very simple example of event stream processing (using the ESPER engine). Note - a full working example is available over on GitHub: https://github.com/corsoft/esper-demo-nuclear What is Complex Event processing (CEP)? Complex Event Processing (CEP), or Event Stream Stream Processing (ESP) are technologies commonly used in Event-Driven systems. These type of systems consume, and react to a stream of event data in real time. Typically these will be things like financial trading, fraud identification and process monitoring systems – where you need to identify, make sense of, and react quickly to emerging patterns in a stream of data events. Key Components of a CEP system A CEP system is like your typical database model turned upside down. Whereas a typical database stores data, and runs queries against the data, a CEP data stores queries, and runs data through the queries. To do this it basically needs: Data – in the form of ‘Events’ Queries – using EPL (‘Event Processing Language’) Listeners – code that ‘does something’ if the queries return results A Simple Example - A Nuclear Power Plant Take the example of a Nuclear Power Station.. Now, this is just an example – so please try and suspend your disbelief if you know something about Nuclear Cores, Critical Temperatures, and the like. It’s just an example. I could have picked equally unbelievable financial transaction data. But ... Monitoring the Core Temperature Now I don’t know what the core is, or if it even exists in reality – but for this example lets assume our power station has one, and if it gets too hot – well, very bad things happen.. Lets also assume that we have temperature gauges (thermometers?) in place which take a reading of the core temperature every second – and send the data to a central monitoring system. What are the requirements? We need to be warned when 3 types of events are detected: MONITOR just tell us the average temperature every 10 seconds - for information purposes WARNING WARN us if we have 2 consecutive temperatures above a certain threshold CRITICAL ALERT us if we have 4 consecutive events, with the first one above a certain threshold, and each subsequent one greater than the last – and the last one being 1.5 times greater than the first. This is trying to alert us that we have a sudden, rising escalating temperature spike – a bit like the diagram below. And let’s assume this is a very bad thing. Using Esper There are a number of ways you could approach building a system to handle these requirements. For the purpose of this post though - we will look at using Esper to tackle this problem How we approach this with Esper is: Using Esper – we can create 3 queries (using EPL - Esper Query Language) to model each of these event patterns. We then attach a listener to each query - this will be triggered when the EPL detects a matching pattern of events) We create an Esper service, and register these queries (and their listeners) We can then just throw Temperature data through the service – and let Esper tell alert the listeners when we get matches. (A working example of this simple solution is available on Githib - see link above) Our Simple ESPER Solution At the core of the system are the 3 queries for detecting the events. Query 1 – MONITOR (Just monitor the average temperature) select avg(value) as avg_val from TemperatureEvent.win:time_batch(10 sec) Query 2 – WARN (Tell us if we have 2 consecutive events which breach a threshold) select * from TemperatureEvent " match_recognize ( measures A as temp1, B as temp2 pattern (A B) define A as A.temperature > 400, B as B.temperature > 400) Query 3 – CRITICAL - 4 consecutive rising values above all above 100 with the fourth value being 1.5x greater than the first select * from TemperatureEvent match_recognize ( measures A as temp1, B as temp2, C as temp3, D as temp4 pattern (A B C D) define A as A.temperature > 100, B as (A.temperature < B.value), C as (B.temperature < C.value), D as (C.temperature < D.value) and D.value > (A.value * 1.5)) Some Code Snippets TemperatureEvent We assume our incoming data arrives in the form of a TemperatureEvent POJO If it doesn't - we can convert it to one, e.g. if it comes in via a JMS queue, our queue listener can convert it to a POJO. We don't have to do this, but doing so decouples us from the incoming data structure, and gives us more flexibility if we start to do more processing in our Java code outside the core Esper queries. An example of our POJO is below package com.cor.cep.event; package com.cor.cep.event; import java.util.Date; /** * Immutable Temperature Event class. * The process control system creates these events. * The TemperatureEventHandler picks these up * and processes them. */ public class TemperatureEvent { /** Temperature in Celcius. */ private int temperature; /** Time temerature reading was taken. */ private Date timeOfReading; /** * Single value constructor. * @param value Temperature in Celsius. */ /** * Temerature constructor. * @param temperature Temperature in Celsius * @param timeOfReading Time of Reading */ public TemperatureEvent(int temperature, Date timeOfReading) { this.temperature = temperature; this.timeOfReading = timeOfReading; } /** * Get the Temperature. * @return Temperature in Celsius */ public int getTemperature() { return temperature; } /** * Get time Temperature reading was taken. * @return Time of Reading */ public Date getTimeOfReading() { return timeOfReading; } @Override public String toString() { return "TemperatureEvent [" + temperature + "C]"; } } Handling this Event In our main handler class - TemperatureEventHandler.java, we initialise the Esper service. We register the package containing our TemperatureEvent so the EPL can use it. We also create our 3 statements and add a listener to each statement /** * Auto initialise our service after Spring bean wiring is complete. */ @Override public void afterPropertiesSet() throws Exception { initService(); } /** * Configure Esper Statement(s). */ public void initService() { Configuration config = new Configuration(); // Recognise domain objects in this package in Esper. config.addEventTypeAutoName("com.cor.cep.event"); epService = EPServiceProviderManager.getDefaultProvider(config); createCriticalTemperatureCheckExpression(); createWarningTemperatureCheckExpression(); createTemperatureMonitorExpression(); } An example of creating the Critical Temperature warning and attaching the listener /** * EPL to check for a sudden critical rise across 4 events, * where the last event is 1.5x greater than the first. * This is checking for a sudden, sustained escalating * rise in the temperature */ private void createCriticalTemperatureCheckExpression() { LOG.debug("create Critical Temperature Check Expression"); EPAdministrator epAdmin = epService.getEPAdministrator(); criticalEventStatement = epAdmin.createEPL(criticalEventSubscriber.getStatement()); criticalEventStatement.setSubscriber(criticalEventSubscriber); } And finally - an example of the listener for the Critical event. This just logs some debug - that's as far as this demo goes. package com.cor.cep.subscriber; import java.util.Map; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.stereotype.Component; import com.cor.cep.event.TemperatureEvent; /** * Wraps Esper Statement and Listener. No dependency on Esper libraries. */ @Component public class CriticalEventSubscriber implements StatementSubscriber { /** Logger */ private static Logger LOG = LoggerFactory.getLogger(CriticalEventSubscriber.class); /** Minimum starting threshold for a critical event. */ private static final String CRITICAL_EVENT_THRESHOLD = "100"; /** * If the last event in a critical sequence is this much greater * than the first - issue a critical alert. */ private static final String CRITICAL_EVENT_MULTIPLIER = "1.5"; /** * {@inheritDoc} */ public String getStatement() { // Example using 'Match Recognise' syntax. String criticalEventExpression = "select * from TemperatureEvent " + "match_recognize ( " + "measures A as temp1, B as temp2, C as temp3, D as temp4 " + "pattern (A B C D) " + "define " + " A as A.temperature > " + CRITICAL_EVENT_THRESHOLD + ", " + " B as (A.temperature < B.temperature), " + " C as (B.temperature < C.temperature), " + " D as (C.temperature < D.temperature) " + "and D.temperature > " + "(A.temperature * " + CRITICAL_EVENT_MULTIPLIER + ")" + ")"; return criticalEventExpression; } /** * Listener method called when Esper has detected a pattern match. */ public void update(Map eventMap) { // 1st Temperature in the Critical Sequence TemperatureEvent temp1 = (TemperatureEvent) eventMap.get("temp1"); // 2nd Temperature in the Critical Sequence TemperatureEvent temp2 = (TemperatureEvent) eventMap.get("temp2"); // 3rd Temperature in the Critical Sequence TemperatureEvent temp3 = (TemperatureEvent) eventMap.get("temp3"); // 4th Temperature in the Critical Sequence TemperatureEvent temp4 = (TemperatureEvent) eventMap.get("temp4"); StringBuilder sb = new StringBuilder(); sb.append("***************************************"); sb.append("\n* [ALERT] : CRITICAL EVENT DETECTED! "); sb.append("\n* " + temp1 + " > " + temp2 + " > " + temp3 + " > " + temp4); sb.append("\n***************************************"); LOG.debug(sb.toString()); } } The Running Demo Full instructions for running the demo can be found here: https://github.com/corsoft/esper-demo-nuclear An example of the running demo is shown below - it generates random Temperature events and sends them through the Esper processor (in the real world this would come in via a JMS queue, http endpoint or socket listener). When any of our 3 queries detect a match - debug is dumped to the console. In a real world solution each of these 3 listeners would handle the events differently - maybe by sending messages to alert queues/endpoints for other parts of the system to pick up the processing. Conclusions Using a system like Esper is a neat way to monitor and spot patterns in data in real time with minimal code. This is obviously (and intentionally) a very bare bones demo, barely touching the surface of the capabilities available. Check out the Esper web site for more info and demos. Esper also has a plugin for Apache Camel Integration engine - which allows you to configure you EPL queries directly in XML Spring camel routes, removing the need for any Java code completely (we will possibly cover this in a later blog post!)

April 11, 2013

by Adrian Milne

· 68,377 Views · 4 Likes

Monitoring with DataDog

Recently I found myself sending more and more business metrics to Datadog, a Software as a Service solution that promises to collect all your data points and build business metrics, displaying them as graphs and triggering alerts whenever they get to critically low (or high) levels. The goals The more your automated tests raises their level of abstraction, the more they become oriented to external quality (what the customer wants and does) instead of internal quality (low coupling, high cohesion of the software design). The largest end-to-end tests that we have in place at Onebip connect several different projects on an integration server and run everything from the creation of a purchase or subscription to its renewal and termination (events that would happen months after creation). However, even end-to-end tests cannot guarantee that our applications work against external resources, such as merchants, mobile carrier, and ISPs. The only way to catch integration problems is monitoring. These problems, like a mobile carrier experiencing an outage, may be due to our errors or to external conditions; but they should nevertheless be discovered as early as possible. The infrastructure Datadog is the only data-collection service that passed the stress tests of SLL, our solution architect. It ships as an UDP server that you pay basing on the number of machines you want to run it on; for example, a preproduction and a production server are a common choice to start out. The server collects data locally and periodically uploads it to Datadog in bursts, where you can access it via a web application or via APIs in case you want to call it from your build. The UDP protocol is aligned with the goals of metric collections: a silent server that decouples the sending of metrics from the rest of the business logic: UDP packets are just lost if no process is there listening to them, no errors are raised if the server crashes or is not running or installed for some reason for instance in development machines). The monitoring code, which you write, should be decoupled and asynchronous as much as possible. The part that talks over the network is already externalized in the DataDog server, but you don't want the user to wait because you have to send some strange number. So the internal part (sending via UDP) is performed in Listener objects that implement the Observer pattern. These object still have to be wrapped in all-encompassing try/catch constructs so that any errors in the monitoring part never influence the business logic. Againg, you don't want a payment to fail because of an exception in how monitoring DateTime objects are built. For PHP we built a SilentListener class to wrap all of our object: class SilentListener { private $wrapped; public function __construct($wrapped) { $this->wrapped = $wrapped; } public function __call($method, $args) { try { call_user_func_array(array($this->wrapped, $method), $args); } catch (Exception $e) { $this->log($e); } } }SLL An example In some countries, we receive payments through mobile-originated messages (MO), a fancy word for saying SMS sent by the end user. So a simple way to monitor if we are receiving payment or if the server is exploded is to upload a metric counting them every time we receive one (pseudo-JSON format to show you the data): { counter: 1 } However, we can be more precise than this: an external outage or an integration problem may happen to a lower level than the whole application. For example, MOs can be delayed in Argentina, by a single carrier, while the rest of the world is still working fine. So our data points look like this: { counter: 1, tags: { country: "IT", carrier: "Vodafone", merchant: "Tasty Cookies, Inc.", } } and in turn graphs on DataDog or calls to its API can set up filters so that we can, if necessary, view only the data related to any combination of country, carrier and merchant. The nice thing, SLL says, is that you just start send data from production and only after you have data points available you build a graph or an alert system basing on what appears to be the most important tags. For example, a big merchant may benefit from some dedicated monitoring, while minor countries such as Vietnam should be monitored as a whole since their traffic is by far lower than that of the others.

April 10, 2013

by Giorgio Sironi

· 16,441 Views · 1 Like

Add Custom Post Meta Data To Post List Table

One of the best thing about WordPress is that you can customise almost anything. In the admin area you can see a list of all the posts you have added in WordPress. Within this table it shows the basic information for each of the posts, the title, the author, the category, tags, comments and the date the post was published. WordPress has a number of different filters and actions that allow you to edit the output of the column so you can add your own data to this list. For example if you have custom post meta data which is useful information you want to display on the list of posts you can add new custom columns to the list. In this article you will learn how to add new columns to the post list, how you can add data to the column and how you can make this column sortable. Add New Columns First we start off by adding the new column to the list, for this we use the WordPress filter manage_edit-post_columns. This will allow you to edit the output of the columns by adding new values to the column array. The callback function on this filter will pass in one parameter which are the current columns on the list, the return of this function will be the new columns on the post table. This means that we can add additional values to the array to add extra columns to the table. The following code will add a new column to the table just after the title column. // Add a column to the edit post list add_filter( 'manage_edit-post_columns', 'add_new_columns'); /** * Add new columns to the post table * * @param Array $columns - Current columns on the list post */ function add_new_columns( $columns ) { $column_meta = array( 'meta' => 'Custom Column' ); $columns = array_slice( $columns, 0, 2, true ) + $column_meta + array_slice( $columns, 2, NULL, true ); return $columns; } Add Columns To Custom Post Types If you have custom post types in your site and want to add additional columns to this list, WordPress comes with built in filters you can apply to add new columns to this table. add_filter( 'manage_${post_type}_posts_columns', 'add_new_columns'); If you have a custom post type of portfolio then you can use the following code to add a column to the list of portfolio post types. function add_portfolio_columns($columns) { return array_merge($columns, array('client' => __('Client'), 'project_date' =>__( 'Project Date'))); } add_filter('manage_portfolio_posts_columns' , 'add_portfolio_columns'); Add Data To Custom Columns Once you have created the new columns for the posts list you can now add data to the new columns by using the WordPress action manage_posts_custom_column. Adding an action to this will be called on each column, from this call we can get data for the post display this on the post list. The following code will check what column we are on and get the custom post meta data for the current post and display this in the column. // Add action to the manage post column to display the data add_action( 'manage_posts_custom_column' , 'custom_columns' ); /** * Display data in new columns * * @param $column Current column * * @return Data for the column */ function custom_columns( $column ) { global $post; switch ( $column ) { case 'meta': $metaData = get_post_meta( $post->ID, 'twitter_url', true ); echo $metaData; break; } } Add Data To Custom Post Type Columns Along with being able to add a filter to custom post types by adding new columns, WordPress has a built in action you can use to add data to custom columns. In the above example we add a new column just for post types of a portfolio to add two new columns to the list, using the below code you can add data to these new columns. function custom_portfolio_column( $column, $post_id ) { switch ( $column ) { case 'project_date': echo get_post_meta( $post_id , 'project_date' , true ); break; case 'client': echo get_post_meta( $post_id , 'client' , true ); break; } } add_action( 'manage_portfolio_posts_custom_column' , 'custom_portfolio_column' ); Make Columns Sortable By default the new custom columns are not sortable so this makes it hard to find data that you need. To sort the custom columns WordPress has another filter manage_edit-post_sortable_columns you can use to assign which columns are sortable. When this action is ran the function will pass in a parameter of all the columns which are currently sortable, by adding your new custom columns to this list will now make these columns sortable. The value you give this will be used in the URL so WordPress understands which column to order by. The following to allow you to sort by the custom column meta. // Register the column as sortable function register_sortable_columns( $columns ) { $columns['meta'] = 'Custom Column'; return $columns; } add_filter( 'manage_edit-post_sortable_columns', 'register_sortable_columns' ); That's all the information you need to change the way posts are listed in your admin area. What useful information do you wish was displayed in the post list?

April 10, 2013

by Paul Underwood

· 13,657 Views

Why Encapsulation Matters

Encapsulation is more than just defining accessor and mutator methods for a class. It is a broader concept of programming, not necessarily object-oriented programming, that consists in minimizing the interdependence between modules and it’s typically implemented through information hiding. Paramount to understand encapsulation is the realization that it has two main objectives: (1) hiding complexity and (2) hiding the sources of change. About Hiding Complexity Encapsulation is inherently related to the concepts of modularity and abstraction. So, in my opinion, to really understand encapsulation, one must first understand these two concepts. Let’s consider, for example, the level of abstraction in the concept of a car. A car is complex in its internal working. They have several subsystem, like the transmission system, the break system, the fuel system, etc. However, we have simplified its abstraction, and we interact with all cars in the world through the public interface of their abstraction: we know that all cars have a steering wheel through which we control direction, they have a pedal that when we press it we accelerate the car and control speed, and another one that when we press it we make it stop, and we have a gear stick that let us control if we go forward or backwards. These features constitute the public interface of the car abstraction. In the morning we can drive a sedan and then get out of it and drive an SUV in the afternoon as if it was the same thing. However, few of us know the details of how all these features are implemented under the hood. So, this simple analogy shows that human beings deal with complexity by defining abstractions with public interfaces that we use to interact with them and all the unnecessary details get hidden under the hood of these abstractions. And I want to emphasize that word “unnecessary” here, because the beauty of an abstraction is not having to understand all those details in order to be able to use it, we just need to understand a broader abstract concept and how it works and how we interact with it. That’s why most of us don’t know or don’t care how a car works under the hood, but that doesn’t prevents us from driving one. In his book Code Complete, Steve McConnell uses the analogy of an iceberg: only a small portion of an iceberg is visible on the surface, most of its true size is hidden underwater. Similarly, in our software designs the visible parts of our modules/classes constitute their public interface, and this is exposed to the outside world, the rest of it should be hidden to the naked eye. In the words of McConell “the interface to a class should reveal as little as possible about its inner workings”. Clearly, based on our car analogy, we can see that this encapsulation is good, since it hides unnecessary/complex details from the users. It makes objects simpler to use and understand. About Hiding the Sources of Change Now, continuing with the analogy; think of the time when cars did not have a hydraulics directional system. One day, the car manufactures invented it, and they decide it to put it in cars from there on. Still, this did not change the way in which drivers were interacting with them. At most, users experienced an improvement in the use of the directional system. A change like this was possible because the internal implementation of a car is encapsulated, that is, is hidden from its user. In other words changes can be safely done without affecting its public interface. In a similar way, if we achieve proper levels of encapsulation in our software design we can safely foster change and evolution of our APIs without breaking its users, by this minimizing the impact of changes and the interdependence of modules. Therefore, encapsulation is a way to achieve another important attribute of a good software design known as loose coupling. In his book Effective Java, Joshua Block highlights the power of information hiding and loose coupling when he says: “Information hiding is important for many reasons, most of which stem from the fact that it decouples the modules that compromise a system, allowing them to be developed, tested, optimized, used, understood, and modified in isolation. This speeds up system development because modules can be developed in parallel. It eases the burden of maintenance because modules can be understood more quickly and debugged with little fear of harming other modules [...] it enables effective performance tuning [since] those modules can be optimized without affecting the correctness of other modules increases software reuse because modules that aren’t tightly coupled often prove useful in other contexts besides the ones for which they were developed”. So, once more, we can clearly see that encapsulation is a desirable attribute that eases the introduction of change and foster the evolution of our APIs. As long as we respect the public interface of our abstractions we are free to change whatever we want of its encapsulated inner workings. About Breaking the Public Interface So what happens when we do not achieve the proper levels of encapsulation in our designs? Now, think that car manufactures decided to put the fuel cap below the car, and not in one of its sides. Let’s say we go and buy one of these new cars, and when we run out of gas we go to the nearest gas station, and then we do not find the fuel cap. Suddenly we realize is below the car, but we cannot reach it with the gas pump hose. Now, we have broken the public interface contract, and therefore, the entire world breaks, it falls apart because things are not working the way it was expected. A change like this would cost millions. We would need to change all gas pumps, not to mention mechanical shops and auto parts. When we break encapsulation we have to pay a price. This last part of our analogy, clearly reveals that failing to define proper abstractions with proper levels of encapsulation will end up causing difficulties when change finally happens. So, as we can see, the goal of encapsulation is reduce the complexity of the abstractions by providing a way to hide implementation details and it also help us to minimize interdependence and facilitate change. We maximize encapsulation by minimizing the exposure of implementation details. However encapsulation will not help us if we do not define proper abstractions. Simply put, there is no way to change the public interface of an abstraction without breaking its users. So, the design of good abstractions is of paramount importance to facilitate the evolution of the APIs, encapsulation is just one of the tools that help us create this good abstractions, but no level of encapsulation is going to make a bad abstraction work. Encapsulation in Java One of those things that we always want to encapsulate is the state of a class. The state of a class should only be accessed through its public interface. In a object-oriented programming language like Java, we achieve encapsulation by hiding details using the accessibility modifiers (i.e. public, protected, private, plus no modifier which implies package private). With these levels of accessibility we control the level of encapsulation, the less restrictive the level, the more expensive change is when it happens and the more coupled the class is with other dependent classes (i.e. user classes, subclasses, etc.). In object-oriented languages a class has two public interfaces: the public interface shared with all users of the class, and the protected interface shared with subclasses. It is of paramount importance that we design the proper levels of encapsulation for every one of these public interfaces so that we can facilitate change and foster evolution of our APIs. Why Getters and Setters? Many people wonder why we need accessor and mutator methods in Java (a.k.a. getters and setters), why can’t we just access the data directly? But the purpose of encapsulation here is is not to hide the data itself, but the implementation details on how this data is manipulated. So, once more what we want is a way to provide a public interface through which we can gain access to this data. We can later change the internal representation of the data without compromising the public interface of the class. On the contrary, by exposing the data itself, we compromise encapsulation, and therefore, the capacity of changing the ways to manipulate this data in the future without affecting its users. We would create a dependency with the data itself, and not with the public interface of the class. We would be creating a perfect cocktail for trouble when “change” finally finds us. There are several compelling reasons why we might want to encapsulate access to our fields. The best compendium of these reasons I have ever found is described in Joshua Bloch’s book [a href="http://192.9.162.55/docs/books/effective/" target="_blank"]Effective Java. There in Item 14: Minimize the accessibility of classes and members, he mentions several reasons, which I mention here: You can limit the values that can be stored in a field (i.e. gender must be F or M). You can take actions when the field is modified (trigger event, validate, etc). You can provide thread safety by synchronizing the method. You can switch to a new data representation (i.e. calculated fields, different data type) However, it is very important to understand that encapsulation is more than hiding fields. In Java we can hide entire classes, by this, hiding the implementation details of an entire API. My understanding of this important concept was broaden and enriched by my reading of a great article by Alan Snyder called Encapsulation and Inheritance in Object-Oriented Programming Languages which I recommend to all readers of this blog. I found a version of it available on the Web and I shared a link to it a the end of this article. Further Reading Encapsulation and Inheritance in Object-oriented Programming Languages Effective Java: Minimize the Accessibility of Classes and Members (p. 67-69) Code Complete 2: Hiding Secrets (Information Hiding) (p. 92).

April 7, 2013

by Edwin Dalorzo

· 56,311 Views · 3 Likes

Configuring Apache SolrCloud on Amazon VPC

We are going to construct an Apache SolrCloud (4.1) with 12 node EC2 instance(s) inside Amazon VPC in this post. Since the search data stored inside the SolrCloud is critical, we are going to build High availability at Solr Node level as well as AZ level. This setup will be done inside private subnet of Amazon VPC and will leverage 3 Availability Zones of the Amazon EC2 Region. Deployment architecture of the setup is given below: A small brief about setup: 3 Zookeepers will be deployed on 3 Availability Zones. ZK EC2 instances will be deployed on the Private subnet of the Amazon VPC. 3 Solr Shard EC2 instances will be deployed on Private subnet of Availability Zone 1 inside Amazon VPC. 3 Solr Replica EC2 instances will be deployed on Private subnet of Availability Zone 2 inside Amazon VPC. 3 Solr Replica EC2 instances will be deployed on Private subnet of Availability Zone 3 inside Amazon VPC. EBS optimized + PIOPS EC2 instances can be used for Solr EC2 Nodes To know more about SolrCloud Deployment best practices on Amazon VPC, Refer article: http://harish11g.blogspot.in/2013/03/Apache-Solr-cloud-on-Amazon-EC2-AWS-VPC-implementation-deployment.html Step 1: Creating Virtual Private Cloud on AWS Create a VPC with Public and Private Subnets. Assume the Load balancer and Web/App Servers can reside on the public subnet and Apache Solr Cloud will reside on the private subnet of the VPC. Step 2: Assigning the IP for the Subnets Create the subnet with its IP range. Chose the Availability zone for this subnet. Step 3: Multiple Subnets on Multiple AZ’s Create multiple subnets in Multiple AZ for building a Highly available setup for SolCloud Step 4: Install Java for Zookeeper & Solr Amazon Linux is chosen as the EC2 OS variant. Execute the following instructions on the respective EC2 nodes after their launch. EC2 instances should be launched in Multi-AZ in Multiple VPC Private Subnets. Solr uses Zookeeper as the cluster configuration and coordinator. Zookeeper is a distributed file system containing information about all the Solr Nodes. Solrconfig.xml, Schema.xml etc are stored in the repository.We have used Oracle-Sun Java over OpenJDK “sudo -s” “cd /opt” “wget --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2Ftechnetwork%2Fjava%2Fjavase%2Fdownloads%2Fjdk-7u3-download-1501626.html;" http://download.oracle.com/otn-pub/java/jdk/7u13-b20/jdk-7u13-linux-x64.rpm” “mv jdk-7u10-linux-x64.rpm?AuthParam=1357217677_76ec3d8d9a3644f4b9ec1ea79e1fcf33 jdk-7u10-linux-x64.rpm jdk-7u10-linux-x64.rpm” “sudo rpm -ivh jdk-7u10-linux-x64.rpm” “alternatives --install /usr/bin/java java /usr/java/jdk1.7.0_10/jre/bin/java 20000” “alternatives --install /usr/bin/javaws javaws /usr/java/jdk1.7.0_10/jre/bin/javaws 20000” “alternatives --install /usr/bin/javac javac /usr/java/jdk1.7.0_10/bin/javac 20000” “alternatives --install /usr/bin/jar jar /usr/java/jdk1.7.0_10/bin/jar 20000” “alternatives --install /usr/bin/java java /usr/java/jre1.7.0_10/bin/java 20000” “alternatives --install /usr/bin/javaws javaws /usr/java/jre1.7.0_10/bin/javaws 20000” “alternatives --configure java” Add JAVA_HOME in .bash_profile: “vim ~/.bash_profile” export JAVA_HOME="/usr/java/jdk1.7.0_09" export PATH=$PATH:$JAVA_HOME/bin Restart the instance. “init 6” Check the version of Java installed using “java -version” command Step 5: Configure the ZooKeeper (v3.4.5) Ensemble: Since single Zookeeper is not ideal for a large Solr cluster (because of SPOF), it is recommended to configure multiple Zookeepers in concert as an ensemble .In this step we will install and configure 3 ZooKeeper EC2 nodes spanning across 3 different Availability Zones in respective Private Subnets inside a VPC.Zookeeper will be configured on Amazon Linux. “sudo yum update” “sudo -s” “ cd /opt” “wget http://apache.techartifact.com/mirror/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz” “tar -xzvf zookeeper-3.4.5.tar.gz” “rm zookeeper-3.4.5.tar.gz” “cd zookeeper-3.4.5” “cp conf/zoo_sample.cfg conf/zoo.cfg” Add the following lines in zoo.cfg “vim conf/zoo.cfg” dataDir=/data server.1=[zk-server01-ip]:2888:3888 server.2=[zk-server02-ip]:2888:3888 server.3=[zk-server03-ip]:2888:3888 “cd /opt/zookeeper/data” “vim myid” 1 or 2 or 3 respectively on each ZooKeeper EC2 instances in Multi-AZ #Starting ZooKeeper Program. “bin/zkServer.sh start” Follow the above steps in all the ZooKeeper servers. ReferClustered (Multi-Server) SetupandConfiguration Parameters for understandingquorum_port,leader_election_port and the filemyid. Every ZooKeeper node needs to know about every other ZK EC2 node in the ensemble, and a majority of EC2’s (called a Quorum) are needed to provide the service. Make sure the VPC IP of all the Zookeepers are given in every ZK node, like the one in following command. server.1=:: server.2=:: server.3=:: Step 6: Configuring Solr 4.1 EC2 node In this step we will install and configure 3 Apache Solr4.1 Shard EC2 instances in a single Amazon AZ and 2 Solr Replicas in another AZ in their respective Private subnets. Please note that we have to specify all the ZooKeeper (ZK) hosts on every Solr instance as below. Note: Solr gets comes with jetty in default, it is suggested to use tomcat for production nodes. Perform the following after launching EC2 instances in Multi-AZ in Multiple VPC Private Subnets. “sudo -s” “yum update” “cd /opt” “wget http://apache.techartifact.com/mirror/lucene/solr/4.1.0/apache-solr-4.1.0.tgz” “tar -xzvf apache-solr-4.1.0.tgz” “rm -f apache-solr-4.1.0.tgz” On Solr Shard/Replica Instances: “cd /opt/apache-solr-4.0.0/example/” “vim /opt/apache-solr-4.0.0/example/solr/collection1/conf/solrconfig.xml” Change /var/data/solr to /data Starting Solr4.1 Shard/Replica Java Program. “java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=SolrCloud4.1-Conf -DnumShards=3 -DzkHost=[zk-server01-ip]:2181,[zk-server02-ip]:2181,[zk-server03-ip]:2181 -jar start.jar “java -DzkHost= DzkHost=:,:,: -jar start.jar” -DnumShards: the number of shards that will be present. Note that once set, this number cannot be increased or decreased without re-indexing the entire data set. (Dynamically changing the number of shards is part of the Solr roadmap!) -DzkHost: a comma-separated list of ZooKeeper servers. -Dbootstrap_confdir, -Dcollection.configName: these parameters are specified only when starting up the first Solr instance. This will enable the transfer of configuration files to ZooKeeper. Subsequent Solr instances need to just point to the ZooKeeper ensemble. The above command with –DnumShards=3 specifies that it is a 3-shard cluster. The first Solr EC2 node automatically becomes shard1 and the second Solr EC2 node automatically becomes shard2 …. What happens when we launch fourth Solr instance in this cluster? Since it’s a 3-shard cluster, the fourth Solr EC2 node automatically becomes a replica of shard1 and the fifth Solr EC2 node becomes a replica of shard2. Step 7: AWS Security Group TCP Ports to be enabled: Configure the following TCP ports on the AWS security group to allow access between Solr and ZK nodes deployed in Multiple AZ. Solr Shards/Replicas will connect to ZK through TCP Port 2181 Solr Web Interface with Jetty container through TCP Port 8983 Solr Web Interface with Tomcat container through TCP Port 8080 Every instance that is part of the ZooKeeper ensemble should know about every other machine in the ensemble. We can accomplish this with the series of lines of the form server.id=host:port:port For example, server.1=[vpc-ip]:2888:3888 server.2=[vpc-ip]:2888:3888 server.3=[vpc-ip]:2888:3888 TCP Ports 2888, 3888 should be opened for ZK Ensemble.

April 5, 2013

by Harish Ganesan

· 7,805 Views

Weekend Project: Send sensor data from Arduino to MongoDB

Arduino is an open-source electronics platform that can acknowledge and interact with its environment through a variety of sensor types. It’s great for hardware prototyping and one-off projects. I just got an Arduino Board from our friends at SendGrid, who also gave me a little tutorial in the art of Arduino hacking. Inspired by the tutorial and armed with this new board, I bought a passive infared (PIR) motion sensor from my local Radio Shack. Now I was ready to play; in particular, I wanted to be able to collect that continuous stream of hardware sensor data into a MongoDB database for logging, trend analysis, system event correlation, etc. To this end, I created the demo project “mongodb-motion”, which I’ve made public on Github. In the “mongodb-motion” Github repo, you will find an Arudino project that writes motion sensor data to a cloud MongoDB database at MongoLab and sends alerts via email based on certain criteria. I built this demo using Node.js and the MongoLab REST API. Below, I’ll go through exactly what hardware you need to make your own “mongodb-motion” project a success, and how the code actually works. What You Need The hardware used in this demo includes: an Arduino UNO R3 and a Parallax PIR motion sensor. How the Code Works You can use a variety of motion sensors with the Arduino. In this particular experiment, I used a PIR motion sensor. The PIR motion sensor behaves like a switch, with ‘down’ events emitted on motion detection and ‘up’ events a few seconds after motion ceases to be detected. On the receiving side, I used JohnnyFive, an appropriately named Node.js package that accepts sensor events and sends messages to the Arduino board. With the two ends set, I’ll move on to the project’s configuration file. In this demo, I’ve included a configuration file, config-sample.js, where credentials for the MongoLab REST API and for the email SMTP server can be added. In my case, I used the SendGrid SMTP service. The configuration file also has two callbacks that determine when an email is emitted, one for each type of event – “detect” and “ceased”. I’ve used this feature to automatically send an email alert if an event timestamp is between 7:00pm and 8:00am, ostensibly when my office should be motionless… I’m out there watching you, office! Once you’ve customized this config-sample.js file, be sure to rename it to config.js in order for it to be usable. If you inspect the project code, you’ll notice that the MongoLab REST API is called in the logMsg() function, using an https.request. Building this little demo has given me some new ideas for hardware hacking the cloud. I hope you give it a try too. Thanks to the Arduino, Node.js and Javascript communities, and special thanks to Rick Waldon for Johnny Five, SendGrid for the UNO board, and a big shout out to @swiftalphaone for the Waza tutorial.

April 3, 2013

by Ben Wen

· 17,804 Views

ActiveMQ Message Priorities: How it Works

There’s usually a steady drip of questions on the mailing list surrounding ActiveMQ’s message-priority support as well as good questions about observed behaviors and “what’s really supported”? I hope to help you understand what happens under the covers and what levels of priority can be supported. The details could get gory for some. If you’re not interested in the details, take a look at the ActiveMQ wiki for the high-level overview. First, since ActiveMQ supports JMS 1.1, let’s take a look at what the JMS Spec says about support for “JMSPriority”: JMS defines a ten-level priority value, with 0 as the lowest priority and 9 as the highest. In addition, clients should consider priorities 0-4 as gradations of normal priority and priorities 5-9 as gradations of expedited priority. JMS does not require that a provider strictly implement priority ordering of messages; however, it should do its best to deliver expedited messages ahead of normal messages. ActiveMQ observes three distinct levels of “Priority”: Default (JMSPriority == 4) High (JMSPriority > 4 && <= 9) Low (JMSPriority > 0 && < 4) If you don’t specify a priority for your MessageProducer or individual messages (see MessageProducer#send(message, deliveryMode, priority, timeToLive)), ActiveMQ’s client will default to using a JMSPriority == 4. As a JMS consumer, you can expect a FIFO ordering if the producers aren’t using priority or you’re not using some other form of selection criteria on the destination. ActiveMQ also “does its best” to deliver expedited messages ahead of “normal” messages, as the spec states. The message store that your broker uses greatly contributes to how that’s exactly done, but in general you can expect the broker to honor strict (0-9) priority support for only the JDBC backed messages stores. For KahaDB-backed message stores, only “category priority” is supported (Low, Default, High, where priorities in each category are not always differentiated, that is 5 and 9 are considered “High”). However, with the right settings and messaging profile, you can affect how [strict] prioritization happens even with KahaDB, so let’s take a quick look. Enabling Message Priority You can enable message priority on your Queues with the following setting in your activemq.xml configuration file: For queueName there is wildcard support, so you can enable priority support on a hierarchy of messages. When you enable priority support, the broker will use prioritized linked-list structures in its messages cursors as well as give KahaDB a hint to use priority categories when storing messages onto disk. There are varying levels of how strict the priority ordering can get, but at worst, you can assume priorities will be upheld by category. The following factors come into play which control how strict the priority ordering can get when using the KahaDB store: Caching enabled/disabled in the queue cursor MaxPageInSize for how many messages to page from the store in a batch Consumer prefetching Expired-message checking Broker Memory settings Persistent/Non-persistent messages The next section presents a little detail about what happens in KahaDB to support priority, while the following sections will go into how things happen in broker memory and are finally dispatched to a consumer and will point out along the way how the different factors from above come into play. KahaDB Prioritization Categories First we’ll start with how messages are stored on disk and loaded into a destination. KahaDB (the default message store) is a file-based message database that the broker uses to persist messages in a “log” or “journal”. The broker also keeps track of which messages are in the log by keeping a separate “index” that holds information about messages (like its location in the log, to which destination it’s associated, ordering, etc). The index also has a notion of message “priority”, which is implemented with three B+Tree structures, one for each priority level (see MessageOrderIndex in org.apache.activemq.store.kahadb.MessageDatabase). This implementation detail is the root of message prioritization and has implications for the rest of the broker as messages are removed from the store. When messages are retrieved from the store, they are done so in batches (maxPageInSize), and messages that are in the “highPriority” BTree are retrieved first. When the high-priority messages are exhausted, the store will then offer up the default priority and subsequently the low priority messages. You can set the maxPageInSize like so: The larger the page size, the larger the number of messages in a batch and the more messages you can see at a time per “snapshot”. For each batch that’s brought into memory, it’s messages are going to be strictly prioritized as described below by the store cursor. The downside is that if your messages are large, bringing in 500 at a time could exhaust your broker memory. The default setting is 200. Message Cursor Priority Lists When persistent messages come into the broker from a producer, they will be stored onto disk, but they will also be cached in memory waiting to be dispatched to a consumer. This is a default setting, so no need to explicitly set it. The idea behind this is to be able to dispatch to fast consumers without having to retrieve it directly from disk (if consumers become slow, the broker will auto-tune itself to not use the cache once it’s filled so as to not OOM). The good thing about this is that when prioritization support is used for a queue, the internal lists used for the cursors will support strict priority (0-9), so for all of the messages that are currently in memory (in the cache), they will be sorted properly from highest to lowest. The trick is what happens when all of the messages in the cache are “lower priority messages” and then a high-priority message comes in to the broker but won’t fit in the cache because it’s full… in that case the message will go directly to the store, be indexed in the “high-priority” index, but won’t be available for dispatch ahead of the lower priority messages until it’s paged into memory in the next batch. When NON persistent messages come into the broker, they will not go to the message store. They will be kept in memory for as long as possible and only pushed to disk (in a temporary store) when memory has passed a defined threshold (> 70% by default). So the same behaviors for cached messages above apply for non-persistent messages, namely, those that are in memory will be ordered strictly (0-9), but once they get pushed to disk, only categories are observed. If you disable the cursor’s cache (with the following setting) then you could help to eliminate the above scenario where the cache becomes full with lower priority messages right when a high-priority message comes in (and becomes stuck on disk because it cannot be paged into memory). However, doing this will slow down your throughput because messages must be paged in from disk before sending to consumers which will slow down the dispatch. But note, when doing this, you are more likely to see messages not following “strict” priority even with the priority lists in the cursor. They will, however, follow the priority categories (High, Default, Low) properly. So to recap, if you disable the cache, you can get higher priority messages delivered more timely than you can if the cache is enabled and it’s filled with lower priority messages. But disabling the cache, by itself, won’t get you to strict priority. Disabling the cache helps getting high priority messages to consumers ahead of lower priority messages, however for this to work as intended (and has bitten me), you’ll want to disable the asynchronous message expiry check. This expiry check pages messages into memory every 30 seconds regardless if they’re ready to be dispatched (by default) and performs a TTL check (time to live) on them and discards those messages that should be expired. This sort of checking effectively brings messages into memory and will stall the normal “page in for dispatch” just enough to miss higher priority messages. Turning off expiry checking, however, will keep expired messages in the store longer because the only expiry check would be done right before dispatch, so make an educated decision on this, and all ActiveMQ settings you tinker with. But to move in the direction of strict(er) order priorities, you’ll want to disable this. Lastly, consumer prefetch plays a role in achieving “strict ordering.” By default, prefetch is set to 1000 for queue consumers, which means they will be sent 1000 messages in a batch. This helps speed up the consumer when it’s consuming messages, but in terms of priority handling it in essence also acts like a cache of messages (discussed above) and could contribute to not seeing “strict ordering”. “category priority” could also be violated if your prefetch is filled with lower priority messages, and there is a new high-priority message that came in to the broker, you wouldn’t see it until the next message dispatch to the consumer. So the lower the prefetch, the better chance of seeing higher priority messages ahead of lower ones. With prefetch of 1, you’ll always get the highest priority message that the store cursor knows about. Client side message priority ActiveMQ also has priority support built right into the message client and it’s enabled by default. This means, when messages are being sent to your consumer (even before your consumer is receiving them, using prefetch), they will be cached on the consumer side and prioritized by default. This is regardless of whether you’re using priority support on the broker side. This could impact the ordering you see on the consumer so just keep this in mind. To disable it, set the following configuration option on your broker URL, e.g., tcp://0.0.0.0:61616?jms.messagePrioritySupported=false But as mentioned above, you’ll want to lower the prefetch to 1 to get the best chance of achieving strict ordering. Tradeoffs So ultimately, getting strictly ordered messages with KahaDB is possible but there are significant tradeoffs to consider and it won’t apply for every messaging situation. Do you want optimized, fast messaging? or do you want to slow down the messaging to achieve strict(er) ordering for priorities. Each situation is different and should be evaluated on a case-by-case basis. In general, however, you can rely on category level priorities. Reordering messages across large queues AND keeping high performance is problematic, and most Message Queue vendors do not do that very well. ActiveMQ’s priority support is strong, but another good alternative exists as discussed on the ActiveMQ wiki describing message priority and that is: using message selectors and balancing out the consumers in such a way that high priority messages end up getting consumed first. This approach tends to give more flexibility and control, but that’s for another post Leave me some comments if something wasn’t clear, or drop an email in the mailing list!

April 2, 2013

by Christian Posta

· 19,041 Views

HashSet vs. TreeSet vs. LinkedHashSet

in a set, there are no duplicate elements. that is one of the major reasons to use a set. there are 3 commonly used implementations of set in java: hashset, treeset and linkedhashset. when and which to use is an important question. in brief, if we want a fast set, we should use hashset; if we need a sorted set, then treeset should be used; if we want a set that can be read by following its insertion order, linkedhashset should be used. 1. set interface set interface extends collection interface. in a set, no duplicates are allowed. every element in a set must be unique. we can simply add elements to a set, and finally we will get a set of elements with duplicates removed automatically. 2. hashset vs. treeset vs. linkedhashset hashset is implemented using a hash table. elements are not ordered. the add, remove, and contains methods has constant time complexity o(1). treeset is implemented using a tree structure(red-black tree in algorithm book). the elements in a set are sorted, but the add, remove, and contains methods has time complexity of o(log (n)). it offers several methods to deal with the ordered set like first(), last(), headset(), tailset(), etc. linkedhashset is between hashset and treeset. it is implemented as a hash table with a linked list running through it, so it provides the order of insertion. the time complexity of basic methods is o(1). 3. treeset example treeset tree = new treeset(); tree.add(12); tree.add(63); tree.add(34); tree.add(45); iterator iterator = tree.iterator(); system.out.print("tree set data: "); while (iterator.hasnext()) { system.out.print(iterator.next() + " "); } output is sorted as follows: tree set data: 12 34 45 63 now let's define a dog class as follows: class dog { int size; public dog(int s) { size = s; } public string tostring() { return size + ""; } } let's add some dogs to treeset like the following: import java.util.iterator; import java.util.treeset; public class testtreeset { public static void main(string[] args) { treeset dset = new treeset(); dset.add(new dog(2)); dset.add(new dog(1)); dset.add(new dog(3)); iterator iterator = dset.iterator(); while (iterator.hasnext()) { system.out.print(iterator.next() + " "); } } } compile ok, but run-time error occurs: exception in thread "main" java.lang.classcastexception: collection.dog cannot be cast to java.lang.comparable at java.util.treemap.put(unknown source) at java.util.treeset.add(unknown source) at collection.testtreeset.main(testtreeset.java:22) because treeset is sorted, the dog object need to implement java.lang.comparable's compareto() method like the following: class dog implements comparable{ int size; public dog(int s) { size = s; } public string tostring() { return size + ""; } @override public int compareto(dog o) { return size - o.size; } } the output is: 1 2 3 4. hashset example hashset dset = new hashset(); dset.add(new dog(2)); dset.add(new dog(1)); dset.add(new dog(3)); dset.add(new dog(5)); dset.add(new dog(4)); iterator iterator = dset.iterator(); while (iterator.hasnext()) { system.out.print(iterator.next() + " "); } output: 5 3 2 1 4 note the order is not certain. 5. linkedhashset example linkedhashset dset = new linkedhashset(); dset.add(new dog(2)); dset.add(new dog(1)); dset.add(new dog(3)); dset.add(new dog(5)); dset.add(new dog(4)); iterator iterator = dset.iterator(); while (iterator.hasnext()) { system.out.print(iterator.next() + " "); } the order of the output is certain and it is the insertion order. 2 1 3 5 4 6. performance testing the following method tests the performance of the three class on add() method. public static void main(string[] args) { random r = new random(); hashset hashset = new hashset(); treeset treeset = new treeset(); linkedhashset linkedset = new linkedhashset(); // start time long starttime = system.nanotime(); for (int i = 0; i < 1000; i++) { int x = r.nextint(1000 - 10) + 10; hashset.add(new dog(x)); } // end time long endtime = system.nanotime(); long duration = endtime - starttime; system.out.println("hashset: " + duration); // start time starttime = system.nanotime(); for (int i = 0; i < 1000; i++) { int x = r.nextint(1000 - 10) + 10; treeset.add(new dog(x)); } // end time endtime = system.nanotime(); duration = endtime - starttime; system.out.println("treeset: " + duration); // start time starttime = system.nanotime(); for (int i = 0; i < 1000; i++) { int x = r.nextint(1000 - 10) + 10; linkedset.add(new dog(x)); } // end time endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedhashset: " + duration); } from the output below, we can clearly wee that hashset is the fastest one. hashset: 2244768 treeset: 3549314 linkedhashset: 2263320 if you enjoyed this article and want to learn more about java collections, check out this collection of tutorials and articles on all things java collections.

March 29, 2013

by Ryan Wang

· 181,646 Views · 3 Likes

ArrayList vs. LinkedList vs. Vector

1. list overview list, as its name indicates, is an ordered sequence of elements. when we talk about list, it is a good idea to compare it with set which is a set of elements which is unordered and every element is unique. the following is the class hierarchy diagram of collection. 2. arraylist vs. linkedlist vs. vector from the hierarchy diagram, they all implement list interface. they are very similar to use. their main difference is their implementation which causes different performance for different operations. arraylist is implemented as a resizable array. as more elements are added to arraylist, its size is increased dynamically. it's elements can be accessed directly by using the get and set methods, since arraylist is essentially an array. linkedlist is implemented as a double linked list. its performance on add and remove is better than arraylist, but worse on get and set methods. vector is similar with arraylist, but it is synchronized. arraylist is a better choice if your program is thread-safe. vector and arraylist require space as more elements are added. vector each time doubles its array size, while arraylist grow 50% of its size each time. linkedlist, however, also implements queue interface which adds more methods than arraylist and vector, such as offer(), peek(), poll(), etc. note: the default initial capacity of an arraylist is pretty small. it is a good habit to construct the arraylist with a higher initial capacity. this can avoid the resizing cost. 3. arraylist example arraylist al = new arraylist(); al.add(3); al.add(2); al.add(1); al.add(4); al.add(5); al.add(6); al.add(6); iterator iter1 = al.iterator(); while(iter1.hasnext()){ system.out.println(iter1.next()); } 4. linkedlist example linkedlist ll = new linkedlist(); ll.add(3); ll.add(2); ll.add(1); ll.add(4); ll.add(5); ll.add(6); ll.add(6); iterator iter2 = al.iterator(); while(iter2.hasnext()){ system.out.println(iter2.next()); } as shown in the examples above, they are similar to use. the real difference is their underlying implementation and their operation complexity. 5. vector vector is almost identical to arraylist, and the difference is that vector is synchronized. because of this, it has an overhead than arraylist. normally, most java programmers use arraylist instead of vector because they can synchronize explicitly by themselves. 6. performance of arraylist vs. linkedlist the time complexity comparison is as follows: i use the following code to test their performance: arraylist arraylist = new arraylist(); linkedlist linkedlist = new linkedlist(); // arraylist add long starttime = system.nanotime(); for (int i = 0; i < 100000; i++) { arraylist.add(i); } long endtime = system.nanotime(); long duration = endtime - starttime; system.out.println("arraylist add: " + duration); // linkedlist add starttime = system.nanotime(); for (int i = 0; i < 100000; i++) { linkedlist.add(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist add: " + duration); // arraylist get starttime = system.nanotime(); for (int i = 0; i < 10000; i++) { arraylist.get(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("arraylist get: " + duration); // linkedlist get starttime = system.nanotime(); for (int i = 0; i < 10000; i++) { linkedlist.get(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist get: " + duration); // arraylist remove starttime = system.nanotime(); for (int i = 9999; i >=0; i--) { arraylist.remove(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("arraylist remove: " + duration); // linkedlist remove starttime = system.nanotime(); for (int i = 9999; i >=0; i--) { linkedlist.remove(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist remove: " + duration); and the output is: arraylist add: 13265642 linkedlist add: 9550057 arraylist get: 1543352 linkedlist get: 85085551 arraylist remove: 199961301 linkedlist remove: 85768810 the difference of their performance is obvious. linkedlist is faster in add and remove, but slower in get. based on the complexity table and testing results, we can figure out when to use arraylist or linkedlist. in brief, linkedlist should be preferred if: there are no large number of random access of element there are a large number of add/remove operations

March 28, 2013

by Ryan Wang

· 611,203 Views · 20 Likes

HashMap vs. TreeMap vs. HashTable vs. LinkedHashMap

Learn all about important data structures like HashMap, HashTable, and TreeMap.

March 28, 2013

by Ryan Wang

· 438,346 Views · 14 Likes