DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Data Topics

article thumbnail
IndexedDB and Date Example
about an hour ago i gave a presentation on indexeddb. one of the attendees asked about dates and being able to filter based on a date range. i told him that my assumption was that you would need to convert the dates into numbers and use a number-based range. turns out i was wrong. here is an example. i began by creating an objectstore that used an index on the created field. since our intent is to search via a date field, i decided "created" would be a good name. i also named my objectstore as "data". boring, but it works. var openrequest = indexeddb.open("idbpreso_date1",1); openrequest.onupgradeneeded = function(e) { var thisdb = e.target.result; if(!thisdb.objectstorenames.contains("data")) { var os = thisdb.createobjectstore("data", {autoincrement:true}); os.createindex("created", "created", {unique:false}); } } next - i built a simple way to seed data. i based on a button click event to add 10 objects. each object will have one property, created, and the date object will be based on a random date from now till 7 days in the future. function doseed() { var now = new date(); for(var i=0; i<10; i++) { var daydiff = getrandomint(1, 7); var thisdate = new date(); thisdate.setdate(now.getdate() + daydiff); db.transaction(["data"],"readwrite").objectstore("data").add({created:thisdate}); } } //credit: mozilla developer center function getrandomint (min, max) { return math.floor(math.random() * (max - min + 1)) + min; } note that since indexeddb calls are asynchronous, my code should handle updating the user to let them know when the operation is done. since this is just a quick demo though, and since that add operation will complete incredibly fast, i decided to not worry about it. so at this point we'd have an application that lets us add data containing a created property with a valid javascript date. note i didn't change it to milliseconds. i just passed it in as is. for the final portion i added two date fields on my page. in chrome this is rendered nicely: based on these, i can then create an indexeddb range of either bounds, lowerbounds, or upperbounds. i.e., give me crap either after a date, before a date, or inside a date range. function dosearch() { var fromdate = document.queryselector("#fromdate").value; var todate = document.queryselector("#todate").value; var range; if(fromdate == "" && todate == "") return; var transaction = db.transaction(["data"],"readonly"); var store = transaction.objectstore("data"); var index = store.index("created"); if(fromdate != "") fromdate = new date(fromdate); if(todate != "") todate = new date(todate); if(fromdate != "" && todate != "") { range = idbkeyrange.bound(fromdate, todate); } else if(fromdate == "") { range = idbkeyrange.upperbound(todate); } else { range = idbkeyrange.lowerbound(fromdate); } var s = ""; index.opencursor(range).onsuccess = function(e) { var cursor = e.target.result; if(cursor) { s += "key "+cursor.key+""; for(var field in cursor.value) { s+= field+"="+cursor.value[field]+""; } s+=""; cursor.continue(); } document.queryselector("#status").innerhtml = s; } } the only conversion required here was to take the user input and turn it into "real" date objects. once done, everything works great: you can run the full demo below.
June 7, 2013
by Raymond Camden
· 7,300 Views
article thumbnail
Write CSV Data into Hive and Python
Apache Hive is a high level SQL-like interface to Hadoop. It lets you execute mostly unadulterated SQL, like this: CREATE TABLE test_table(key string, stats map); The map column type is the only thing that doesn’t look like vanilla SQL here. Hive can actually use different backends for a given table. Map is used to interface with column oriented backends like HBase. Essentially, because we won’t know ahead of time all the column names that could be in the HBase table, Hive will just return them all as a key/value dictionary. There are then helpers to access individual columns by key, or even pivot the map into one key per logical row. As part of the Hadoop family, Hive is focused on bulk loading and processing. So it’s not a surprise that Hive does not support inserting raw values like the following SQL: INSERT INTO suppliers (supplier_id, supplier_name) VALUES (24553, 'IBM'); However, for unit testing Hive scripts, it would be nice to be able to insert a few records manually. Then you could run your map reduce HQL, and validate the output. Luckily, Hive can load CSV files, so it’s relatively easy to insert a handful or records that way. CREATE TABLE foobar(key string, stats map) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' ; LOAD DATA LOCAL INPATH '/tmp/foobar.csv' INTO TABLE foobar; This will load a CSV file with the following data, where c4ca4-0000001-79879483-000000000124 is the key, and comments and likesare columns in a map. c4ca4-0000001-79879483-000000000124,comments:0|likes:0 c4ca4-0000001-79879483-000000000124,comments:0|likes:0 Because I’ve been doing this quite a bit in my unit tests, I wrote a quick Python helper to dump a list of key/map tuples to a temporary CSV file, and then load it into Hive. This uses hiver to talk to Hive over thrift. import hiver from django.core.files.temp import NamedTemporaryFile def _hql(self, hql): client = hiver.connect(settings.HIVE_HOST, settings.HIVE_PORT) try: client.execute(hql) finally: client.shutdown() def insert(self, table_name, rows): ''' cannot insert single rows via hive, need to save to a temp file and bulk load that ''' csv_file = NamedTemporaryFile(delete=True) for row in rows: map_repr = '|'.join('%s:%s' % (key, value) for key, value in row[1].items()) csv_file.write(row[0] + "," + map_repr + "\n") csv_file.flush() try: _hql('DROP TABLE IF EXISTS %s' % table_name) _hql(""" CREATE TABLE %s ( key string, map ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' """ % (table_name)) _hql(""" LOAD DATA LOCAL INPATH '%s' INTO TABLE %s """ % (csv_file.name, table_name) finally: csv_file.close() You can call it like this: insert('test_table', [ ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879496-000000000124', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ])
June 5, 2013
by Chase Seibert
· 14,716 Views
article thumbnail
Accessing An Artifact’s Maven And SCM Versions At Runtime
You can easily tell Maven to include the version of the artifact and its Git/SVN/… revision in the JAR manifest file and then access that information at runtime via getClass().getPackage.getImplementationVersion(). (All credit goes to Markus Krüger and other colleagues.) Include Maven artifact version in the manifest (Note: You will actually not want to use it, if you also want to include a SCM revision; see below.) pom.xml: ... org.apache.maven.plugins maven-jar-plugin ... true true ... ... The resulting MANIFEST.MF of the JAR file will then include the following entries, with values from the indicated properties: Built-By: ${user.name} Build-Jdk: ${java.version} Specification-Title: ${project.name} Specification-Version: ${project.version} Specification-Vendor: ${project.organization.name Implementation-Title: ${project.name} Implementation-Version: ${project.version} Implementation-Vendor-Id: ${project.groupId} Implementation-Vendor: ${project.organization.name} (Specification-Vendor and Implementation-Vendor come from the POM’s organization/name.) Include SCM revision For this you can either use the Build Number Maven plugin that produces the property ${buildNumber}, or retrieve it from environment variables passed by Jenkinsor Hudson (SVN_REVISION for Subversion, GIT_COMMIT for Git). For git alone, you could also use the maven-git-commit-id-plugin that can either replace strings such as ${git.commit.id} in existing resource files (using maven’s resource filtering, which you must enable) with the actual values or output all of them into a git.properties file. Let’s use the buildnumber-maven-plugin and create the manifest entries explicitely, containing the build number (i.e. revision) org.codehaus.mojo buildnumber-maven-plugin 1.2 validate create false false org.apache.maven.plugins maven-jar-plugin 2.4 ${project.name} ${project.version} ${buildNumber} ... Accessing the version & revision As mentioned above, you can access the manifest entries from your code via getClass().getPackage.getImplementationVersion() andgetClass().getPackage.getImplementationTitle(). References SO: How to get Maven Artifact version at runtime? Maven Archiver documentation
May 28, 2013
by Jakub Holý
· 12,760 Views
article thumbnail
Capturing camera/picture data without PhoneGap
As people know, I'm a huge fan of PhoneGap and what it allows me to do with JavaScript, HTML, and CSS. But I think it is crucial to remember that you don't always need PhoneGap. A great example of that is camera access. Did you know that recent mobile browsers support accessing the camera directly from HTML and JavaScript? Let's look at an example. Over a year ago I wrote a blog post where I created an application called "Color Thief." This application made use of PhoneGap's Camera API and a third party JavaScript library called Color Thief. I loved this example because it demonstrated how you could combine the extra power that PhoneGap provides along with existing JavaScript libraries. This morning I watched an excellent Google IO presentation (https://www.youtube.com/watch?v=EPYnGFEcis4&feature=youtube_gdata_player) on Mobile HTML. It was an overview of some of the exciting stuff you can now do with mobile HTML and JavaScript. To be clear, this was all without using wrappers like PhoneGap. In one of the examples the presenters discussed the new "capture" support for the input/file field type. This is rather simple to implement: If supported (recent Android and latest iOS), the user can then use their camera to select a picture. I decided to rebuild my old demo to skip PhoneGap completely and just make use of this feature. Here's the code: For the most part, this is pretty similar to the last version. I no longer wait for the deviceready event, but instead just listen for the document itself to load. Instead of listening for a button click, I've switched to a input field using type=file. I now listen for the change event, and on that, I see if I have access to a file. If I do, I can then use the URL object to create a pointer to the source and then simply add it to my DOM. After that, Color Thief takes over. The only tricky part I ran into was that in iOS the URL object is still prefixed. You can see how I get around that in the startup code. To be fair, this isn't 100% backwards compatible, I could add a few checks in here to ensure that things will work and gracefully let people on older phones know they can't use this feature. But the end result is nearly the exact same functionality in a web page - no PhoneGap, no native code. <br>
May 21, 2013
by Raymond Camden
· 17,568 Views
article thumbnail
Bucketing, Multiplexing and Combining in Hadoop - Part 1
this is the first blog post in a series which looks at some data organization patterns in mapreduce. we’ll look at how to bucket output across multiple files in a single task, how to multiplex data across multiple files, and also how to coalesce data. these are all common patterns that are useful to have in your mapreduce toolkit. we’ll kick things off with a look at bucketing data outputs in your map or reduce tasks. by default when using a fileoutputformat-derived outputformat (such as textoutputformat), all the outputs for a reduce task (or a map task in a map-only job) are written to a single file in hdfs. imagine a situation where you have user activity logs being streamed into hdfs, and you want to write a mapreduce job to better organize the incoming data. as an example a large organization with multiple products may want to bucket the logs based on the product. to do this you’ll need the ability to write to multiple output files in a single task. let’s take a look at how we can make that happen. multipleoutputformat there are a few ways you can achieve your goal, and the first option we’ll look at is the multipleoutputformat class in hadoop. this is an abstract class that lets you do the following: define the output path for each and every key/value output record being emitted by a task. incorporate the input paths into the output directory for map-only jobs. redefine the key and value that are used to write to the underlying recordwriter . this is useful in situations where you want to remove data from the outputs as it duplicates data in the filename. for each output path, define the recordwriter that should be used to write the outputs. ok enough with the words - let’s look at some data and code. first up is the simple data we’ll use in our example - imagine you work at a fruit market with locations in multiple cities, and you have a purchase transaction stream which contains the store location along with the fruit that was purchased. cupertino apple sunnyvale banana cupertino pear to help bucket your data for future analysis, you want to bin each record into city-specific files. for the simple data set above you don’t want to filter, project or transform your data, just bucket it out, so a simple identity map-only job will do the job. to force more than one mapper, we’ll write the data to two separate files. $ tab="$(printf '\t')" $ hdfs -put - file1.txt << eof cupertino${tab}apple sunnyvale${tab}banana eof $ hdfs -put - file2.txt << eof cupertino${tab}pear eof here’s the code which will let you write city-specific output files. import org.apache.commons.lang.stringutils; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.filesystem; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.identitymapper; import org.apache.hadoop.mapred.lib.multipletextoutputformat; import org.apache.hadoop.util.progressable; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; import java.io.ioexception; import java.util.arrays; /** * an example of how to use {@link org.apache.hadoop.mapred.lib.multipleoutputformat}. */ public class mofexample extends configured implements tool { /** * create output files based on the output record's key name. */ static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } /** * the main job driver. */ public int run(final string[] args) throws exception { string csvinputs = stringutils.join(arrays.copyofrange(args, 0, args.length - 1), ","); path outputdir = new path(args[args.length - 1]); jobconf jobconf = new jobconf(super.getconf()); jobconf.setjarbyclass(mofexample.class); jobconf.setnumreducetasks(0); jobconf.setmapperclass(identitymapper.class); jobconf.setinputformat(keyvaluetextinputformat.class); jobconf.setoutputformat(keybasedmultipletextoutputformat.class); fileinputformat.setinputpaths(jobconf, csvinputs); fileoutputformat.setoutputpath(jobconf, outputdir); return jobclient.runjob(jobconf).issuccessful() ? 0 : 1; } /** * main entry point for the utility. * * @param args arguments * @throws exception when something goes wrong */ public static void main(final string[] args) throws exception { int res = toolrunner.run(new configuration(), new mofexample(), args); system.exit(res); } } run this code and you’ll see the following files in hdfs, where /output is the job output directory: $ hadoop fs -lsr /output /output/cupertino/part-00000 /output/cupertino/part-00001 /output/sunnyvale/part-00000 if you look at the output files you’ll see that the files contain the correct buckets. $ hadoop fs -lsr /output/cupertino/* cupertino apple cupertino pear $ hadoop fs -lsr /output/sunnyvale/* sunnyvale banana awesome, you have your data bucketed by store. now that we have everything working, let’s look at what we did to get there. we had to do two things to get this working: extend multipletextoutputformat this is where the magic happened - let’s look at that class again. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } you are working with text, which is why you extended multipletextoutputformat , a class that in turn extends multipleoutputformat . multipletextoutputformat is a simple class which instructs the multipleoutputformat to use textoutputformat as the underlying output format for writing out the records. if you were to use multipleoutputformat as-is it behaves as if you were using the regular textoutputformat , which is to say that it’ll only write to a single output file. to write data to multiple files you had to extend it, as with the example above. the generatefilenameforkeyvalue method allows you to return the output path for an input record. the third argument, name , is the original fileoutputformat -created filename, which is in the form “part-nnnnn”, where “nnnnn” is the task index, to ensure uniqueness. to avoid file collisions, it’s a good idea to make sure your generated output paths are unique, and leveraging the original output file is certainly a good way of doing this. in our example we’re using the key as the directory name, and then writing to the original fileoutputformat filename within that directory. specify the outputformat the next step was easy - specify that this output format should be used for your job: jobconf.setoutputformat(keybasedmultipletextoutputformat.class); earlier we also mentioned that you can use the input path as part of the output path, which we will look at next. using the input filename as part of the output filename in map-only jobs what if we wanted to keep the input filename as part of the output filename? this only works for map-only jobs, and can be accomplished by overriding the getinputfilebasedoutputfilename method. let’s look at the following code to understand how this method fits into the overall sequence of actions that the multipleoutputformat class performs: public void write(k key, v value) throws ioexception { // get the file name based on the key string keybasedpath = generatefilenameforkeyvalue(key, value, myname); // get the file name based on the input file name string finalpath = getinputfilebasedoutputfilename(myjob, keybasedpath); // get the actual key k actualkey = generateactualkey(key, value); v actualvalue = generateactualvalue(key, value); recordwriter rw = this.recordwriters.get(finalpath); if (rw == null) { // if we don't have the record writer yet for the final path, create // one // and add it to the cache rw = getbaserecordwriter(myfs, myjob, finalpath, myprogressable); this.recordwriters.put(finalpath, rw); } rw.write(actualkey, actualvalue); }; the getinputfilebasedoutputfilename method is called with the output of generatefilenameforkeyvalue , which contains our already-customized output file. our new keybasedmultipletextoutputformat can now be updated to override getinputfilebasedoutputfilename and append the original input filename to the output filename: static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(object key, object value, string name) { return key.tostring() + "/" + name; } @override protected string getinputfilebasedoutputfilename(jobconf job, string name) { string infilename = new path(job.get("map.input.file")).getname(); return name + "-" + infilename; } if you run with your modified outputformat class you’ll see the following files in hdfs, confirming that the input filenames are now concatenated to the end of each output file. $ hadoop fs -lsr /output /output/cupertino/part-00000-file1.txt /output/cupertino/part-00001-file2.txt /output/sunnyvale/part-00000-file1.txt the implementation of getinputfilebasedoutputfilename in multipleoutputformat doesn’t do anything interesting by default, but if you set the value of the mapred.outputformat.numoftrailinglegs configurable to an integer greater than 0, then the getinputfilebasedoutputfilename will use part of the input path as the output path. let’s see what happens when we set the value to 1: jobconf.setint("mapred.outputformat.numoftrailinglegs", 1); the output files in hdfs now exactly mirror the input files used for the job: $ hadoop fs -lsr /output /output/file1.txt /output/file2.txt if we set mapred.outputformat.numoftrailinglegs to 2, and our input files exist in the /inputs directory, then our output directory looks like this: $ hadoop fs -lsr /output /output/input/file1.txt /output/input/file2.txt basically as you keep incrementing mapred.outputformat.numoftrailinglegs , then multipleoutputformat will continue to go up the parent directories of the input file and use them in the output path. modifying the output key and value it’s very possible that the actual key and value you want to emit are different from those that were used to determine the output file. in our example, we took the output key and wrote to a directory using the key name. if you do that keeping the key in the output file may be redundant. how would we modify the output record so that the key isn’t written? multipleoutputformat has your back with the generateactualkey method. class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override protected text generateactualkey(text key, text value) { return null; } } the returned value from this method replaces the key that’s supplied to the underlying recordwriter , so if you return null as in the above example, no key will be written to the file. $ hadoop fs -lsr /output/cupertino/* apple pear $ hadoop fs -lsr /output/sunnyvale/* banana you can achieve the same result for the output value by overriding the generateactualvalue method. changing the recordwriter in our final step we’ll look at how you can leverage multiple recordwriter classes for different output files. this is accomplished by overriding the getrecordwriter method. in the example below we’re leveraging the same textoutputformat for all the files, but it gives you a sense of what can be accomplished. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override public recordwriter getrecordwriter(filesystem fs, jobconf job, string name, progressable prog) throws ioexception { if (name.startswith("apple")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } else if (name.startswith("banana")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } return super.getrecordwriter(fs, job, name, prog); } } conclusion when using multipleoutputformat , give some thought to the number of distinct files that each reducer will create. it would be prudent to plan your bucketing so that you have a relatively small number of files. in this post we extended multipletextoutputformat , which is a simple extension of multipleoutputformat that supports text outputs. multiplesequencefileoutputformat also exists to support sequencefiles in a similar fashion. so what are the shortcomings with the multipleoutputformat class? if you have a job that uses both map and reduce phases, then multipleoutputformat can’t be used in the map-side to write outputs. of course, multipleoutputformat works fine in map-only jobs. all recordwriter classes must support exactly the same output record types. for example, you wouldn’t be able to support a recordwriter that emitted for one output file, and have another recordwriter that emitted . multipleoutputformat exists in the mapred package, so it won’t work with a job that requires use of the mapreduce package. all is not lost if you bump into either one of these issues, as you’ll discover in the next blog post.
May 20, 2013
by Alex Holmes
· 6,284 Views
article thumbnail
Lazy sequences implementation for Java 8
I just published the LazySeq library on GitHub - the result of my Java 8 experiments recently. I hope you will enjoy it. Even if you don't find it very useful, it's still a great lesson of functional programming in Java 8 (and in general). Also it's probably the first community library targeting Java 8! Introduction A Lazy sequence is a data structure that is computed only when its elements are actually needed. All operations on lazy sequences, like map() and filter() are lazy as well, postponing invocation up to the moment when it is really necessary. Lazy sequences are always traversed from the beginning using very cheap first/rest decomposition (head() and tail()). An important property of lazy sequences is that they can represent infinite streams of data, e.g. all natural numbers or temperature measurements over time. Lazy sequence remembers already computed values so if you access the Nth element, all elements from 1 to N-1 are computed as well and cached. Despite that LazySeq (being at the core of many functional languages and algorithms) is immutable and thread-safe. Rationale This library is heavily inspired by scala.collection.immutable.Stream and aims to provide immutable, thread-safe and easy to use lazy sequence implementation, possibly infinite. See Lazy sequences in Scala and Clojure for some use cases. Stream class name is already used in Java 8, therefore LazySeq was chosen, similar to lazy-seq in Clojure. Speaking of Stream, at first it looks like a lazy sequence implementation available out-of-the-box. However, quoting Javadoc: Streams are not data structures and: Once an operation has been performed on a stream, it is considered consumed and no longer usable for other operations. In other words java.util.stream.Stream is just a thin wrapper around existing collection, suitable for one time use. More akin to Iterator than to Stream in Scala. This library attempts to fill this niche. Of course implementing lazy sequence data structure was possible prior to Java 8, but lack of lambdas makes working with such data structure tedious and too verbose. Getting started Building and working with lazy sequences in 10 minutes. Infinite sequence of all natural numbers In order to create a lazy sequence you use LazySeq.cons() factory method that accepts first element (head) and a function that might be later used to compute rest (tail). For example in order to produce lazy sequence of natural numbers with given start element you simply say: private LazySeq naturals(int from) { return LazySeq.cons(from, () -> naturals(from + 1)); } There is really no recursion here. If there was, calling naturals() would quickly result in StackOverflowError as it calls itself without stop condition. However () -> naturals(from + 1) expression defines a function returning LazySeq (Supplier to be precise) that this data structure will invoke, but only if needed. Look at the code below, how many times do you think naturals() function was called (except the first line)? final LazySeq ints = naturals(2); final LazySeq strings = ints. map(n -> n + 10). filter(n -> n % 2 == 0). take(10). flatMap(n -> Arrays.asList(0x10000 + n, n)). distinct(). map(Integer::toHexString); First invocation of naturals(2) returns lazy sequence starting from 2 but rest (3, 4, 5, ...) is not computed yet. Later we map() over this sequence, filter() it, take() first 10 elements, remove duplicates, etc. All these operations do not evaluate the sequence and are as lazy as possible. For example take(10) doesn't evaluate first 10 elements eagerly to return them. Instead new lazy sequence is returned which remembers that it should truncate original sequence at 10th element. Same applies to distinct(). It doesn't evaluate the whole sequence to extract all unique values (otherwise code above would explode quickly, traversing infinite amount of natural numbers). Instead it returns a new sequence with only the first element. If you ever ask for the second unique element, it will lazily evaluate tail, but only as much as possible. Check out toString() output: System.out.println(strings); //[1000c, ?] Question mark (?) says: "there might be something more in that collection, but I don't know it yet". Do you understand where did 1000c came from? Look carefully: Start from an infinite stream of natural numbers starting from 2 Add 10 to each element (so the first element becomes 12 or C in hex) filter() out odd numbers (12 is even so it stays) take() first 10 elements from sequence so far Each element is replaced by two elements: that element plus 0x1000 and the element itself (flatMap()). This does not yield a sequence of pairs, but a sequence of integers that is twice as long We ensure only distinct() elements will be returned In the end we turn integers to hex strings. As you can see none of these operations really require evaluating the whole stream. Only head is being transformed and this is what we see in the end. So when this data structure is actually evaluated? When it absolutely must, e.g. during side-effect traversal: strings.force(); //or strings.forEach(System.out::println); //or final List list = strings.toList(); //or for (String s : strings) { System.out.println(s); } All the statements above alone will force evaluation of whole lazy sequence. Not very smart if our sequence was infinite, but strings was limited to first 10 elements so it will not run infinitely. If you want to force only part of the sequence, simply call strings.take(5).force(). BTW have you noticed that we can iterate over LazySeq strings using standard Java 5 for-each syntax? That's because LazySeq implements List interface, thus plays nicely with Java Collections Framework ecosystem: import java.util.AbstractList; public abstract class LazySeq extends AbstractList Please keep in mind that once lazy sequence is evaluated (computed) it will cache (memoize) them for later use. This makes lazy sequences great for representing infinite or very long streams of data that are expensive to compute. iterate() Building an infinite lazy sequence very often boils down to providing an initial element and a function that produces next item based on the previous one. In other words second element is a function of the first one, third element is a function of the second one, and so on. Convenience LazySeq.iterate() function is provided for such circumstances. ints definition can now look like this: final LazySeq ints = LazySeq.iterate(2, n -> n + 1); We start from 2 and each subsequent element is represented as previous element + 1. More examples: Fibonacci sequence and Collatz conjecture No article about lazy data structure can be left without Fibonacci numbers example: private static LazySeq lastTwoFib(int first, int second) { return LazySeq.cons( first, () -> lastTwoFib(second, first + second) ); } Fibonacci sequence is infinite as well but we are free to transform it in multiple ways: System.out.println( fib. drop(5). take(10). toList() ); //[5, 8, 13, 21, 34, 55, 89, 144, 233, 377] final int firstAbove1000 = fib. filter(n -> (n > 1000)). head(); fib.get(45); See how easy and natural it is to work with infinite stream of numbers? drop(5).take(10) skips first 5 elements and displays next 10. At this point first 15 numbers are already computed and will never by computed again. Finding first Fibonacci number above 1000 (happens to be 1597) is very straightforward. head() is always precomputed by filter() , so no further evaluation is needed. Last but not least we can simply just ask for 45th Fibonacci number (0-based) and get 1134903170. If you ever try to access any Fibonacci number up to this one, they are precomputed and fast to retrieve. Finite sequences (Collatz conjecture) Collatz conjecture is also quite interesting problem. For each positive integer n we compute next integer using following algorithm: n/2 if n is even 3n + 1 if n is odd For example starting from 10 series looks as follows: 10, 5, 16, 8, 4, 2, 1. The series ends when it reaches 1. Mathematicians believe that starting from any integer we will eventually reach 1 but it's not yet proven. Let us create a lazy sequence that generates Collatz series for given n, but only as many as needed. As stated above, this time our sequence will be finite: private LazySeq collatz(long from) { if (from > 1) { final long next = from % 2 == 0 ? from / 2 : from * 3 + 1; return LazySeq.cons(from, () -> collatz(next)); } else { return LazySeq.of(1L); } } This implementation is driven directly by the definition. For each number greater than 1 return that number + lazily evaluated (() -> collatz(next)) rest of the stream. As you can see if 1 is given, we return single element lazy sequence using special of() factory method. Let's test it with aforementioned 10: final LazySeq collatz = collatz(10); collatz.filter(n -> (n > 10)).head(); collatz.size(); filter() allows us to find first number in the sequence that is greater than 10. Remember that lazy sequence will have to traverse the contents (evaluate itself), but only to the point where it finds first matching element. Then it stops, ensuring it computes as little as possible. However size(), in order to calculate total number of elements, must traverse the whole sequence. Of course this can only work with finite lazy sequences, calling size() on an infinite sequence will end up poorly. If you play a bit with this sequence you will quickly realize that sequences for different numbers share the same suffix (always end with the same sequence of numbers). This begs for some caching/structural sharing. See CollatzConjectureTest for details. But can it be used to something, you know... useful? Real life? Infinite sequences of numbers are great, but not very practical in real life. Maybe some more down to earth examples? Imagine you have a collection and you need to pick few items from that collection randomly. Instead of collection I will use a function returning random latin characters: private char randomChar() { return (char) ('A' + (int) (Math.random() * ('Z' - 'A' + 1))); } But there is a twist. You need N (N < 26, number of latin characters) unique values. Simply calling randomChar() few times doesn't guarantee uniqueness. There are few approaches to this problem, with LazySeq it's pretty straightforward: LazySeq charStream = LazySeq.continually(this::randomChar); LazySeq uniqueCharStream = charStream.distinct(); continually() simply invokes given function for each element when needed. Thus charStream will be an infinite stream of random characters. Of course they can't be unique. However uniqueCharStream guarantees that its output is unique. It does so by examining next element of underlying charStream and rejecting items that already appeared. We can now say uniqueCharStream.take(4) and be sure that no duplicates will appear. Once again notice that continually(this::randomChar).distinct().take(4) really calls randomChar() only once! As long as you don't consume this sequence, it remains lazy and postpones evaluation as long as possible. Another example involves loading batches (pages) of data from database. Using ResultSet or Iterator is cumbersome but loading whole data set into memory often not feasible. An alternative involves loading first batch of data eagerly and then providing a function to load next batches. Data is loaded only when it's really needed and we don't suffer performance or scalability issues. First let's define abstract API for loading batches of data from database: public List loadPage(int offset, int max) { //load records from offset to offset + max } I abstract from the technology entirely, but you get the point. Imagine that we now define LazySeq that starts from row 0 and loads next pages only when needed: public static final int PAGE_SIZE = 5; private LazySeq records(int from) { return LazySeq.concat( loadPage(from, PAGE_SIZE), () -> records(from + PAGE_SIZE) ); } When creating new LazySeq instance by calling records(0) first page of 5 elements is loaded. This means that first 5 sequence elements are already computed. If you ever try to access 6th or above, sequence will automatically load all missing record and cache them. In other words you never compute the same element twice. More useful tools when working with sequences are grouped() and sliding() methods. First partitions input sequence into groups of equal size. Take this as an example, also proving that these methods are as always lazy: final LazySeq chars = LazySeq.of('A', 'B', 'C', 'D', 'E', 'F', 'G'); chars.grouped(3); //[[A, B, C], ?] chars.grouped(3).force(); //force evaluation //[[A, B, C], [D, E, F], [G]] and similarly for sliding(): chars.sliding(3); //[[A, B, C], ?] chars.sliding(3).force(); //force evaluation //[[A, B, C], [B, C, D], [C, D, E], [D, E, F], [E, F, G]] These two methods are extremely useful. You can look at your data through sliding window (e.g. to compute moving average) or partition it to equal-length buckets. Last interesting utility method you may find useful is scan() that iterates (lazily, of course) the input stream and constructs every element of output by applying a function on previous and current element of input. Code snippet is worth a thousand words: LazySeq list = LazySeq. numbers(1). scan(0, (a, x) -> a + x); list.take(10).force(); //[0, 1, 3, 6, 10, 15, 21, 28, 36, 45] LazySeq.numbers(1) is a sequence of natural numbers (1, 2, 3...). scan() creates a new sequence that starts from 0 and for each element of input (natural numbers) adds it to last element of itself. So we get: [0, 0+1, 0+1+2, 0+1+2+3, 0+1+2+3+4, 0+1+2+3+4+5...]. If you want a sequence of growing strings, just replace few types: LazySeq.continually("*"). scan("", (s, c) -> s + c). map(s -> "|" + s + "\\"). take(10). forEach(System.out::println); And enjoy this beautiful triangle: |\ |*\ |**\ |***\ |****\ |*****\ |******\ |*******\ |********\ |*********\ Alternatively (same output): lazySeq. stream(). map(n -> n + 1). flatMap(n -> asList(0, n - 1).stream()). filter(n -> n != 0). substream(4, 18). limit(10). sorted(). distinct(). collect(Collectors.toList()); Java collections framework interoperability LazySeq implements java.util.List interface, thus can be used in variety of places. Moreover it also implements Java 8 enhancements to collections, namely streams and collectors: lazySeq. stream(). map(n -> n + 1). flatMap(n -> asList(0, n - 1).stream()). filter(n -> n != 0). substream(4, 18). limit(10). sorted(). distinct(). collect(Collectors.toList()); However streams in Java 8 were created to work around feature that is a foundation of LazySeq - lazy evaluation. Example above postpones all intermediate steps until collect() is called. With LazySeq you can safely skip .stream() and work directly on sequence: lazySeq. map(n -> n + 1). flatMap(n -> asList(0, n - 1)). filter(n -> n != 0). slice(4, 18). limit(10). sorted(). distinct(); Moreover LazySeq provides special purpose collector (see: LazySeq.toLazySeq()) that avoids evaluation even when used with collect() - which normally forces full collection computation. Implementation details Each lazy sequence is built around the idea of eagerly computed head and lazily evaluated tail represented as function. This is very similar to classic single-linked list recursive definition: class List { private final T head; private final List tail; //... } However in case of lazy sequence tail is given as a function, not a value. Invocation of that function is postponed as long as possible: class Cons extends LazySeq { private final E head; private LazySeq tailOrNull; private final Supplier> tailFun; @Override public LazySeq tail() { if (tailOrNull == null) { tailOrNull = tailFun.get(); } return tailOrNull; } For full implementation see Cons.java and FixedCons.java used when tail is known at creation time (for example LazySeq.of(1, 2) as opposed to LazySeq.cons(1, () -> someTailFun()). Pitfalls and common dangers Below common issues and misunderstandings are described. Evaluating too much One of the biggest dangers of working with infinite sequences is trying to evaluate them completely, which obviously leads to infinite computation. The idea behind infinite sequence is not to evaluate it in its entirety but to take as much as we need without introducing artificial limits and accidental complexity (see database loading example). However evaluating whole sequence is way too simple to miss. For example calling LazySeq.size()must evaluate whole sequence and will run infinitely, eventually filling up stack or heap (implementation detail). There are other methods that require full traversal in order to function properly. E.g. allMatch() making sure all elements match given predicate. Some methods are even more dangerous, because whether they will finish or not depends on data in the sequence. For example anyMatch() may return immediately if head matches predicate - or never. Sometimes we can easily avoid costly operations by using more deterministic methods. For example: seq.size() <= 10 //BAD may not work or be extremely slow if seq is infinite. However we can achieve the same with (more) predictable: seq.drop(10).isEmpty() Remember that lazy sequences are immutable (so we don't really mutate seq), drop(n) is typically O(n) while isEmpty() is O(1). When in doubt, consult source code or JavaDoc to make sure your operation won't too eagerly evaluate your sequence. Also be very cautious when using LazySeq where java.util.Collection or java.util.List is expected. Holding unnecessary reference to head Lazy sequences be definition remember already computed elements. You have to be aware of that, otherwise your sequence (especially infinite) will quickly fill up available memory. However, because LazySeq is just a fancy linked list, if you no longer keep a reference to head (but only to some element in the middle), it becomes eligible for garbage collection. For example: //LazySeq first = seq.take(10); seq = seq.drop(10); First ten elements are dropped and we assume nothing holds a reference to what previously was hept in seq. This makes first ten elements eligible for garbage collection. However if we uncomment first line and keep reference to old head in first, JVM will not release any memory. Let's put that into perspective. The following piece of code will eventually throw OutOfMemoryError because infinite reference keeps holding the beginning of the sequence, therefore all the elements created so far: LazySeq infinite = LazySeq.continually(Big::new); for (Big arr : infinite) { // } However by inlining call to continually() or extracting it to a method this code works flawlessly (well, still runs forever, but uses almost no memory): private LazySeq getContinually() { return LazySeq.continually(Big::new); } for (Big arr : getContinually()) { // } What's the difference? For-each loop uses iterators underneath. LazySeqIterator underneath doesn't hold a reference to old head() when it advances, so if nothing else references that head, it will be eligible for garbage collection, see true javac output when for-each is used: for (Iterator cur = getContinually().iterator(); cur.hasNext(); ) { final Big arr = cur.next(); //... } TL;DR Your sequence grows while being traversed. If you keep holding one end while the other grows, it will eventually blow up. Just like your first level cache in Hibernate if you load too much in one transaction. Use only as much as needed. Converting to plain Java collections Converting is simple, but dangerous. This is a consequence of points above. You can convert lazy sequence to java.util.List by calling toList(): LazySeq even = LazySeq.numbers(0, 2); even.take(5).toList(); //[0, 2, 4, 6, 8] or using Collector from Java 8 having richer API: even. stream(). limit(5). collect(Collectors.toSet()) //[4, 6, 0, 2, 8] But remember that Java collections are finite from definition so avoid converting lazy sequences to collections explicitly. Note that LazySeq is already List, thus Iterable and Collection. It also has efficient LazySeq.iterator(). If you can, simply pass LazySeq instance directly and may just work. Performance, time and space complexity head() of every sequence (except empty) is always computed eagerly, thus accessing it is fast O(1). Computing tail() may take everything from O(1) (if it was already computed) to infinite time. As an example take this valid stream: import static com.blogspot.nurkiewicz.lazyseq.LazySeq.cons; import static com.blogspot.nurkiewicz.lazyseq.LazySeq.continually; LazySeq oneAndZeros = cons( 1, () -> continually(0) ). filter(x -> (x > 0)); It represents 1 followed by infinite number of 0s. By filtering all positive numbers (x > 0) we get a sequence with same head, but filtering of tail is delayed (lazy). However if we now carelessly call oneAndZeros.tail(), LazySeq will keep computing more and more of this infinite sequence, but since there is no positive element after initial 1, this operation will run forever, eventually throwing StackOverflowError or OutOfMemoryError (this is an implementation detail). However if you ever reach this state, it's probably a programming bug or misusing of the library. Typically tail() will be close to O(1). On the other hand if you have plenty of operations already "stacked", calling tail() will trigger them rapidly one after another, so tail() run time is heavily dependant on your data structure. Most operations on LazySeq are O(1) since they are lazy. Some operations, like get(n) or drop(n) are O(n) (n represents parameter, not sequence length). In general run time will be similar to normal linked list. Because LazySeq remembers all already computed values in a single linked list, memory consumption is always O(n), where nn is the number of already computed elements. Troubleshooting Error invalid target release: 1.8 during maven build If you see this error message during maven build: [INFO] BUILD FAILURE ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project lazyseq: Fatal error compiling: invalid target release: 1.8 -> [Help 1] it means you are not compiling using Java 8. Download JDK 8 with lambda support and let maven use it: $ export JAVA_HOME=/path/to/jdk8 I get StackOverflowError or program hangs infinitely When working with LazySeq you sometimes get StackOverflowError or OutOfMemoryError: java.lang.StackOverflowError at sun.misc.Unsafe.allocateInstance(Native Method) at java.lang.invoke.DirectMethodHandle.allocateInstance(DirectMethodHandle.java:426) at com.blogspot.nurkiewicz.lazyseq.LazySeq.iterate(LazySeq.java:118) at com.blogspot.nurkiewicz.lazyseq.LazySeq.lambda$0(LazySeq.java:118) at com.blogspot.nurkiewicz.lazyseq.LazySeq$$Lambda$2.get(Unknown Source) at com.blogspot.nurkiewicz.lazyseq.Cons.tail(Cons.java:32) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) When working with possibly infinite data structures, care must be taken. Avoid calling operations that must (size(), allMatch(), minBy(), forEach(), reduce(), ...) or can (filter(), distinct(), ...) traverse the whole sequence in order to give correct results. See Pitfalls for more examples and ways to avoid. Maturity Quality This project was started as an exercise and is not battle-proven. But a healthy 300+ unit-test suite (3:1 test code/production code ratio) guards quality and functional correctness. I also make sure LazySeq is as lazy as possible by mocking tail functions and verifying they are called as rarely as one can get. Contributions and bug reports In the event of finding a bug or missing feature, don't hesitate to open a new ticket or start pull request. I would also love to see more interesting usages of LazySeq in wild. Possible improvements Just like FixedCons is used when tail is known up-front, consider IterableCons that wraps existing Iterable in one node rather than building FixedCons hierarchy. This can be used for all concat methods. Parallel processing support (implementing spliterator?) License This project is released under version 2.0 of the Apache License.
May 15, 2013
by Tomasz Nurkiewicz
· 28,953 Views · 1 Like
article thumbnail
Hebrew Search with ElasticSearch
Hebrew search is not an easy task, and HebMorph is a project I started several years ago to address that problem. After a certain period of inactivity I'm back actively working on it. I'm also happy to say there are already several live systems using it to enable Hebrew searches in their applications. This post is a short step-by-step guide on how to use HebMorph in an ElasticSearch installation. There are quite a few configuration options and things to consider when enabling Hebrew search, most are in the realm of performance vs relevance trade-offs, but I'll talk about those in a separate post. 0. What exactly is HebMorph HebMorph is a project a bit wider than just providing a Hebrew search plugin for ElasticSearch, but for the purpose of this post let us treat it in that narrow aspect. HebMorph has 3 main parts - the hspell dictionary files, the hebmorph-core package which is a wrapper around the dictionary files with important bits that allow for locating words even if they weren't written exactly as they appear in the dictionary, and the hebmorph-lucene package which contains various tools for processing streams of text into Lucene tokens - the searchable parts. To enable Hebrew search from ElasticSearch we are going to need to use the Hebrew analyzer class HebMorph provides to analyze incoming Hebrew texts. That is done by providing ElasticSearch with the HebMorph packages and then telling it to use the Hebrew analyzer on text fields as needed. 1. Get HebMorph and hspell At the moment you will have to compile HebMorph from sources yourself using Maven. In the future we might upload it to a centralized repository, but since we still actively working on a lot of stuff there it is still a bit too early for that. Probably the easiest way to get HebMorph is to do git clone from the main repository. The repository is located at https://github.com/synhershko/HebMorph and includes the latest hspell files already under /hspell-data-files. If you are new to git GitHub offers great tutorials for getting started with it, and they also enable you to download the entire source tree as a zip or a tarball. Once you have the sources, run mvn package or mvn install to create 2 jars - hebmorph-core and hebmorph-lucene. Those 2 packages are required before moving on to the next step. 2. Create an ElasticSearch plugin In this step we will create a new plugin which we will use in the next step to create the Hebrew analyzers in. If you already have a plugin you wish to use, skip to the next step. ElasticSearch plugins are compiled Java packages you simply drop to the plugins folder of your ElasticSearch installation and it gets detected automatically by the ElasticSearch instance once it is initialized. If you are new to this, you might want to read up a bit on that in the official ElasticSearch documentation. Here is a great guide to start with: http://jfarrell.github.io/ The gist of this is having a Java project with a es-plugin.properties file embedded as a resource and pointing to class that tells ElasticSearch what classes to load as plugins, and their plugin type. In the next section we will use this to add our own Analyzer implementation which makes use of HebMorph's capabilities. 3. Creating an Hebrew Analyzer HebMorph already comes with MorphAnalyzer - an Analyzer implementation which takes care of Hebrew-aware tokenization, lemmatization and whatnot. Because it is highly configurable, personally I prefer re-implementing it in the ElasticSearch plugin so it is easier to change the configurations in code. In case you wondered, I'm not planning in supporting external configurations for this as it is too subtle and you should really know what you are doing there. Don't forget to add dependencies to hebmorph-core and hebmorph-lucene to your project. My common Analyzer setup for Hebrew search looks like this: public abstract class HebrewAnalyzer extends ReusableAnalyzerBase { protected enum AnalyzerType { INDEXING, QUERY, EXACT } private static final DictRadix prefixesTree = LingInfo.buildPrefixTree(false); private static DictRadix dictRadix; private final StreamLemmatizer lemmatizer; private final LemmaFilterBase lemmaFilter; protected final Version matchVersion; protected final AnalyzerType analyzerType; protected final char originalTermSuffix = '$'; static { try { dictRadix = Loader.loadDictionaryFromHSpellData(new File(resourcesPath + "hspell-data-files"), true); } catch (IOException e) { // TODO log } } protected HebrewAnalyzer(final AnalyzerType analyzerType) throws IOException { this.matchVersion = matchVersion; this.analyzerType = analyzerType; lemmatizer = new StreamLemmatizer(null, dictRadix, prefixesTree, null); lemmaFilter = new BasicLemmaFilter(); } @Override protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) { // on query - if marked as keyword don't keep origin, else only lemmatized (don't suffix) // if word termintates with $ will output word$, else will output all lemmas or word$ if OOV if (analyzerType == AnalyzerType.QUERY) { final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter); src.setAlwaysSaveMarkedOriginal(true); src.setSuffixForExactMatch(originalTermSuffix); TokenStream tok = new SuffixKeywordFilter(src, '$'); return new TokenStreamComponents(src, tok); } if (analyzerType == AnalyzerType.EXACT) { // on exact - we don't care about suffixes at all, we always output original word with suffix only final HebrewTokenizer src = new HebrewTokenizer(reader, prefixesTree, null); TokenStream tok = new NiqqudFilter(src); tok = new LowerCaseFilter(matchVersion, tok); tok = new AlwaysAddSuffixFilter(tok, '$', false); return new TokenStreamComponents(src, tok); } // on indexing we should always keep both the stem and marked original word // will ignore $ && will always output all lemmas + origin word$ // basically, if analyzerType == AnalyzerType.INDEXING) final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter); src.setAlwaysSaveMarkedOriginal(true); TokenStream tok = new SuffixKeywordFilter(src, '$'); return new TokenStreamComponents(src, tok); } public static class HebrewIndexingAnalyzer extends HebrewAnalyzer { public HebrewIndexingAnalyzer() throws IOException { super(AnalyzerType.INDEXING); } } public static class HebrewQueryAnalyzer extends HebrewAnalyzer { public HebrewQueryAnalyzer() throws IOException { super(AnalyzerType.QUERY); } } public static class HebrewExactAnalyzer extends HebrewAnalyzer { public HebrewExactAnalyzer() throws IOException { super(AnalyzerType.EXACT); } } } You may notice how I created 3 separate analyzers - one for indexing, one for querying and the last for exact querying. I'll be talking more about this in future posts, but the idea is to be able to provide flexibility on querying while still allow for correct indexing. Configuring the analyzers to be picked up from ElasticSearch is rather easy now. First, you need to wrap each analyzer in a "provider", like so: public class HebrewQueryAnalyzerProvider extends AbstractIndexAnalyzerProvider { private final HebrewAnalyzer.HebrewQueryAnalyzer hebrewAnalyzer; @Inject public HebrewQueryAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) throws IOException { super(index, indexSettings, name, settings); hebrewAnalyzer = new HebrewAnalyzer.HebrewQueryAnalyzer(); } @Override public HebrewAnalyzer.HebrewQueryAnalyzer get() { return hebrewAnalyzer; } } After you've created such providers for all types of analyzers, create an AnalysisBinderProcessor like this (or update your existing one with definitions for the Hebrew analyzers): public class MyAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor { private final static HashMap> languageAnalyzers = new HashMap<>(); static { languageAnalyzers.put("hebrew", HebrewIndexingAnalyzerProvider.class); languageAnalyzers.put("hebrew_query", HebrewQueryAnalyzerProvider.class); languageAnalyzers.put("hebrew_exact", HebrewExactAnalyzerProvider.class); } public static boolean analyzerExists(final String analyzerName) { return languageAnalyzers.containsKey(analyzerName); } @Override public void processAnalyzers(final AnalyzersBindings analyzersBindings) { for (Map.Entry> entry : languageAnalyzers.entrySet()) { analyzersBindings.processAnalyzer(entry.getKey(), entry.getValue()); } } } Don't forget to update your Plugin class to catch the AnalysisBinderProcessor - it should look something like this (plus any other stuff you want to add there): public class MyPlugin extends AbstractPlugin { @Override public String name() { return "my-plugin"; } @Override public String description() { return "Implements custom actions required by me"; } @Override public void processModule(Module module) { if (module instanceof AnalysisModule) { ((AnalysisModule)module).addProcessor(new MyAnalysisBinderProcessor()); } } } 4. Using the Hebrew analyzers Compile the ElasticSearch plugin and drop it along with its dependencies in a folder under the /plugins folder of ElasticSearch. You now have 3 new types of analyzers at your disposal: "hebrew", "hebrew_query" and "hebrew_exact". For indexing, you want to use the "hebrew" analyzer. In your mapping, you can define a certain field or an entire set of fields to use that specific analyzer by setting the analyzer for that field. You can also leave the analyzer configuration blank, and specify the analyzer to use for those fields with unspecified analyzer using the _analyzer field in the index request. See more about both here and here. The "hebrew" analyzer will expand each term to all recognized lemmas; in case the word wasn't recognized it will try to tolerate spelling errors or missing Yud/Vav - most of the time it will be successful (with some rate of false positives, which the lemma-filters should remove to some degree). Some words will still remain unrecognized and thus will be indexed as-is. When querying using a QueryString query you can specify what analyzer to use - use the "hebrew_query" or "hebrew_exact" analyzer. The former will perform lemma expansion similar to the indexing analyzer, and the latter will avoid that and allow you to perform exact matches (useful when searching for names or exact phrases). I pretty much ignored a lot of the complexity involved in fine tuning searches for Hebrew, and many very cool things HebMorph allows you to do with Hebrew search for the sake of focus. I will revisit them in a later blog post. 5. Administration The hspell dictionary files are looked up by a physical location on disk - you will need to provide a path they are saved at. Since dictionaries update, it is sometimes easier to update them that way in a distributed environment like the one I'm working with. It may be desirable to have them compiled within the same jar file as the code itself - I'll be happy to accept a pull request to do that. The code above is working with ElasticSearch 0.90 GA and Lucene 4.2.1. I also had it running on earlier versions of both technologies, but may had to make a few minor changes. I assume the samples would break on future versions and I'll probably don't have much time going back and keeping it up to date, but bear in mind most of the time the changes are minor and easy to understand and make by yourself. Both HebMorph and the hspell dictionary are released under the AGPL3. For any questions on licensing, feel free to contact me.
May 6, 2013
by Itamar Syn-hershko
· 7,049 Views
article thumbnail
Let's Talk ASM - String Concatenation
not a lot of developers today know assembly, which - regardless of your professional line of work - is a good skill to have. assembly teaches you think on a much lower level, going beyond the abstracted out layer provided by many of the high-level languages. today we're going to look at a way to implement a string concatenation function. specifically, i want to follow the following procedure for building the final result: ask the user for input append a crlf (carriage return + line feed) to the entered string append the entered string to the existing composite string follow back from step 1 until the user enters a terminator character display the composite string let's assume that you have zero knowledge of assembly. if that is the case, i would recommend starting here . in this example, i am using visual studio 2012 to test the code, but you might as well use an older version of the ide if you want. for convenience purposes, i would recommend downloading the basic framework code that comes for free from the writer of the introduction to 80x86 assembly language and computer architecture book: visual studio 2012 visual studio 2010 visual studio 2008 first, you have the standard declarations: .586 .model flat include io.h ; header file for input/output cr equ 0dh ; carriage return character lf equ 0ah ; line feed .stack 4096 .data prompt byte cr, lf, "original string? ",0 restitle byte "final result",0 stringin byte 1024 dup (?) stringout byte 1024 dup (?) linefeed byte cr, lf notice the reference to io.h - at this point you want a way to receive user input and display output data through standard winapi channels, and io.h does just that. some asm experts might argue that it is not a good idea to use winapi hooks in the context of a "pure" assembly program, for educational purposes, but in this situation the focus is on the inner workings of a different function. note: the program is adapted to the scenario where the execution of the string concatenation function is the sole purpose. as you will get a hang of the execution flow, you can easily adapt it to a scenario where some of the registers can be re-used. let's start by clearing the ecx and edx registers: .code _mainproc proc ; clear the ecx and edx registers because these will ; be used for length counters and sequential increments. xor ecx, ecx xor edx, edx once the strings will be entered by the user, i will need to find out the length of the string to append, in order to have a correct sequential memory address. now i need to get user input: input_data: ; prompt the user to enter the string he ultimately ; wants appended to the main string buffer. input prompt, stringin, 40 ; read ascii characters ; make sure that the string doesn't start with the $ character ; which would automatically mean that we need to terminate the ; reading process cmp stringin, '$' je done lea eax, [stringout + edx] ; destination address push eax ; push the destination on the stack lea eax, [stringin] ; source address push eax ; push the source on the stack call strcopy ; call the string copy procedure once the string is entered, i can check whether the terminator character - "$", was used. one of the great things about the cmp instruction is the fact that it checks the starting address of the entered string, therefore i can simply compare the entered data with a single character. in case the character is encountered, the program flow terminates at done, where the output is displayed: done: ; output the new data. output restitle, stringout mov eax, 0 ret strcopy is an internal procedure that will simply copy a string from one memory address to another: strcopy proc near32 push ebp mov ebp, esp push edi push esi pushf mov esi, [ebp+8] mov edi, [ebp+12] cld whilenonull: cmp byte ptr [esi], 0 je endwhilenonull movsb jmp whilenonull endwhilenonull: mov byte ptr [edi], 0 popf pop esi pop edi pop ebp ret 8 strcopy endp to make sure that the next string is properly appended, i need to find out the length of the previous one, for a correct memory address offset: ; let's get the length of the current string - move it ; to the proper register so that we can perform the measurement mov edi, eax ; find the length of the string that was just entered sub ecx, ecx sub al, al not ecx cld repne scasb not ecx dec ecx add edx, ecx repne scasb is used for an in-string iterative null terminator search (you can read more about it here ). it will decrement ecx for each character. ; we need to append the linefeed (crlf) to the string so we apply ; the same string concatenation procedure for that sequence. lea eax, [stringout + edx] ; destination address push eax ; first parameter lea eax, [linefeed] ; source push eax ; second parameter call strcopy ; call string copy procedure mov edi, eax ; we know that the crlf characters are 2 entities, therefore ; increment the overall counter by 2. add edx, 2 ; ask for more input because no terminator character was used. jmp input_data once the basic input data is processed, i can append the crlf sequence and increment edx for the proper offset, after which the program flow is being reset from the point where the user has to enter the next character sequence.
May 3, 2013
by Denzel D.
· 13,099 Views
article thumbnail
Neo4j/Cypher: Returning a Row with Zero Count When No Relationship Exists
I’ve been trying to see if I can match some of the football stats that OptaJoe posts on twitter and one that I was looking at yesterday was around the number of red cards different teams have received. 1 – Sunderland have picked up their first PL red card of the season. The only team without one now are Man Utd. Angels. To refresh this is the sub graph that we’ll need to look at to work it out: I started off with the following query which traverses out from each match, finds the players who were sent off in the match and then groups the sendings off by the team they were playing for: START game = node:matches('match_id:*') MATCH game<-[:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COUNT(game) AS redCards ORDER BY redCards LIMIT 5 When we run this we get the following results: +------------------------------+ | team.name | redCards | +------------------------------+ | "Sunderland" | 1 | | "West Ham United" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | | "Liverpool" | 2 | +------------------------------+ 5 rows The problem we have here is that it hasn’t returned Manchester United because they haven’t yet received any red cards and therefore none of their players match the ‘sent_off_in’ relationship. I ran into something similar in a post I wrote about a month ago where I was working out which day of the week players scored on. The first step towards getting Manchester United to return with a count of 0 is to make the ‘sent_off_in’ relationship optional. However, that on its own that isn’t enough because it now returns a count of all the player performances for each team: START game = node:matches('match_id:*') MATCH game<-[?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COUNT(game) AS redCards ORDER BY redCards ASC LIMIT 5 +-----------------------------+ | team.name | redCards | +-----------------------------+ | "Chelsea" | 448 | | "Wigan Athletic" | 459 | | "Fulham" | 460 | | "Liverpool" | 466 | | "Everton" | 467 | +-----------------------------+ 5 rows Instead what we need to do is collect up all the ‘sent_off_in’ relationships and sum them up. We can use the COLLECT function to do that and the neat thing about COLLECT is that it doesn’t bother collecting the empty relationships so we end up with exactly what we need: START game = node:matches('match_id:*') MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COLLECT(r) AS redCards LIMIT 5 +-----------------------------------------------------------------------------------------------------+ | team.name | redCards | +-----------------------------------------------------------------------------------------------------+ | "Wigan Athletic" | [:sent_off_in[26443] {},:sent_off_in[37785] {}] | | "Everton" | [:sent_off_in[6795] {minute:61},:sent_off_in[21735] {},:sent_off_in[34594] {}] | | "Newcastle United" | [:sent_off_in[434] {minute:75},:sent_off_in[32389] {},:sent_off_in[34915] {}] | | "Southampton" | [:sent_off_in[49393] {minute:70},:sent_off_in[49392] {minute:82}] | | "West Ham United" | [:sent_off_in[21734] {minute:67}] | +-----------------------------------------------------------------------------------------------------+ 5 rows We then just need to call the LENGTH function to work out how many red cards there are in each collection and then we’re done: START game = node:matches('match_id:*') MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, LENGTH(COLLECT(r)) AS redCards ORDER BY redCards LIMIT 5 +--------------------------------+ | team.name | redCards | +--------------------------------+ | "Manchester United" | 0 | | "West Ham United" | 1 | | "Sunderland" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | +--------------------------------+ 5 rows
April 30, 2013
by Mark Needham
· 5,871 Views
article thumbnail
XStream – XStreamely Easy Way to Work with XML Data in Java
from time to time there is a moment when we have to deal with xml data. and most of the time it is not the happiest day in our life. there is even a term “xml hell” describing situation when programmer has to deal with many xml configuration files that are hard to comprehend. but, like it or not, sometimes we have no choice, mostly because specification from client says something like “use configuration written in xml file” or something similar. and in such cases, xstream comes with its very cool features that make dealing with xml really less painful. overview xstream is a small library to serialize data between java objects and xml. it’s lightweight, small, has nice api and what is most important, it works with and without custom annotations that we might be not allowed to add when we are not the owner of java classes. first example suppose we have a requirement to load configuration from xml file: /users/tomek/work/mystuff/input.csv /users/tomek/work/mystuff/truststore.ts /users/tomek/work/mystuff/cn-user.jks password password user secret and we want to load it into configuration object: public class configuration { private string inputfile; private string user; private string password; private string truststorefile; private string keystorefile; private string keystorepassword; private string truststorepassword; // getters, setters, etc. } so basically what we have to do is: filereader filereader = new filereader("config.xml"); // load our xml file xstream xstream = new xstream(); // init xstream // define root alias so xstream knows which element and which class are equivalent xstream.alias("config", configuration.class); configuration loadedconfig = (configuration) xstream.fromxml(filereader); and that’s all, easy peasy something more serious ok, but previous example is very basic so now let’s do something more complicated: real xml returned by real webservice. 2013-03-09 john example 24 asd123123 2012-03-10 anna baker 26 axn567890 2010-12-05 tom meadow sgh08945 48 what we have here is simple list of bans written in xml. we want to load it into collection of ban objects. so let’s prepare some classes (getters/setters/tostring omitted): public class data { private list bans = new arraylist(); } public class ban { private string dateofupdate; private person person; } public class person { private string firstname; private string lastname; private int age; private string documentnumber; } as you can see there is some naming and type mismatch between xml and java classes (e.g. field name1->firstname, dateofupdate is string not a date), but it’s here for some example purposes. so the goal here is to parse xml and get data object with populated collection of ban instances containing correct data. let’s see how it can be achieved. parse with annotations first, easier way is to use annotations. and that’s the suggested approach in situation when we can modify java classes to which xml will be mapped. so we have: @xstreamalias("data") // maps data element in xml to this class public class data { // here is something more complicated. if we have list of elements that are // not wrapped in a element representing a list (like we have in our xml: // multiple elements not wrapped inside collection, // we have to declare that we want to treat these elements as an implicit list // so they can be converted to list of objects. @xstreamimplicit(itemfieldname = "ban") private list bans = new arraylist(); } @xstreamalias("ban") // another mapping public class ban { /* we want to have different field names in java classes so we define what element should be mapped to each field */ @xstreamalias("updated_at") // private string dateofupdate; @xstreamalias("troublemaker") private person person; } @xstreamalias("troublemaker") public class person { @xstreamalias("name1") private string firstname; @xstreamalias("name2") private string lastname; @xstreamalias("age") // string will be auto converted to int value private int age; @xstreamalias("number") private string documentnumber; and actual parsing logic is very short: filereader reader = new filereader("file.xml"); // load file xstream xstream = new xstream(); xstream.processannotations(data.class); // inform xstream to parse annotations in data class xstream.processannotations(ban.class); // and in two other classes... xstream.processannotations(person.class); // we use for mappings data data = (data) xstream.fromxml(reader); // parse // print some data to console to see if results are correct system.out.println("number of bans = " + data.getbans().size()); ban firstban = data.getbans().get(0); system.out.println("first ban = " + firstban.tostring()); as you can see annotations are very easy to use and as a result final code is very concise. but what to do in situation when we can’t modify mapping classes? we can use different approach that doesn’t require any modifications in java classes representing xml data. parse without annotations when we can’t enrich our model classes with annotations, there is another solution. we can define all mapping details using methods from xstream object: filereader reader = new filereader("file.xml"); // three first lines are easy, xstream xstream = new xstream(); // same initialisation as in the xstream.alias("data", data.class); // basic example above xstream.alias("ban", ban.class); // two more aliases to map... xstream.alias("troublemaker", person.class); // between node names and classes // we want to have different field names in java classes so // we have to use aliasfield(, , ) xstream.aliasfield("updated_at", ban.class, "dateofupdate"); xstream.aliasfield("troublemaker", ban.class, "person"); xstream.aliasfield("name1", person.class, "firstname"); xstream.aliasfield("name2", person.class, "lastname"); xstream.aliasfield("age", person.class, "age"); // notice here that xml will be auto-converted to int "age" xstream.aliasfield("number", person.class, "documentnumber"); /* another way to define implicit collection */ xstream.addimplicitcollection(bans.class, "bans"); data data = (data) xstream.fromxml(reader); // do the actual parsing // let's print results to check if data was parsed system.out.println("number of bans = " + data.getbans().size()); ban firstban = data.getbans().get(0); system.out.println("first ban = " + firstban.tostring()); as you can see xstream allows to easily convert more complicated xml structures into java objects, it also gives a possibility to tune results by using different names if this from xml doesn’t suit our needs. but there is one thing should catch your attention: we are converting xml representing a date into raw string which isn’t quite what we would like to get as a result. that’s why we will add converter to do some job for us. using existing custom type converter xstream library comes with set of built converters for most common use cases. we will use dateconverter. so now our class for ban looks like that: public class ban { private date dateofupdate; private person person; } and to use dateconverter we simply have to register it with date format that we expect to appear in xml data: xstream.registerconverter(new dateconverter("yyyy-mm-dd", new string[] {})); and that’s it. now instead of string our object is populated with date instance. cool and easy! but what about classes and situations that aren’t covered by existing converters? we could write our own. writing custom converter from scratch assume that instead of dateofupdate we want to know how many days ago update was done: public class ban { private int daysago; private person person; } of course we could calculate it manually for each ban object but using converter that will do this job for us looks more interesting. our daysagoconverter must implement converter interface so we have to implement three methods with signatures looking a little bit scary: public class daysagoconverter implements converter { @override public void marshal(object source, hierarchicalstreamwriter writer, marshallingcontext context) { } @override public object unmarshal(hierarchicalstreamreader reader, unmarshallingcontext context) { } @override public boolean canconvert(class type) { return false; } } last one is easy as we will convert only integer class. but there are still two methods left with these hierarchicalstreamwriter, marshallingcontext, hierarchicalstreamreader and unmarshallingcontext parameters. luckily, we could avoid dealing with them by using abstractsinglevalueconverter that shields us from so low level mechanisms. and now our class looks much better: public class daysagoconverter extends abstractsinglevalueconverter { @override public boolean canconvert(class type) { return type.equals(integer.class); } @override public object fromstring(string str) { return null; } public string tostring(object obj) { return null; } } additionally we must override method tostring(object obj) defined in abstractsinglevalueconverter as we want to store date in xml calculated from integer, not a simple object.tostring value which would be returned from default tostring defined in abstract parent. implementation code below is pretty straightforward, but most interesting lines are commented. i’ve skipped all validation stuff to make this example shorter. public class daysagoconverter extends abstractsinglevalueconverter { private final static string format = "yyyy-mm-dd"; // default date format that will be used in conversion private final datetime now = datetime.now().todatemidnight().todatetime(); // current day at midnight public boolean canconvert(class type) { return type.equals(integer.class); // converter works only with integers } @override public object fromstring(string str) { simpledateformat format = new simpledateformat(format); try { date date = format.parse(str); return days.daysbetween(new datetime(date), now).getdays(); // we simply calculate days between using jodatime } catch (parseexception e) { throw new runtimeexception("invalid date format in " + str); } } public string tostring(object obj) { if (obj == null) { return null; } integer daysago = ((integer) obj); return now.minusdays(daysago).tostring(format); // here we subtract days from now and return formatted date string } } usage to use our custom converter for a specific field we have to inform about it xstream object using registerlocalconverter: xstream.registerlocalconverter(ban.class, "daysago", new daysagoconverter()); we are using “local” method to apply this conversion only to specific field and not to every integer field in xml file. and after that we will get our ban objects populated with number of days instead of date. summary that’s all what i wanted to show you in this post. now you have basic knowledge about what xstream is capable of and how it can be used to easily map xml data to java objects. if you need something more advanced, please check project official page as it contains very good documentation and examples.
April 23, 2013
by Tomasz Dziurko
· 24,889 Views
article thumbnail
What Does a Java Array Look Like in Memory?
arrays in java store one of two things: either primitive values (int, char, …) or references (a.k.a pointers). when an object is creating by using “new”, memory is allocated on the heap and a reference is returned. this is also true for arrays. 1. single-dimension array int arr[] = new int[3]; the int[] arr is just the reference to the array of 3 integer. if you create an array with 10 integer, it is the same – an array is allocated and a reference is returned. 2. two-dimensional array how about 2-dimensional array? actually, we can only have one dimensional arrays in java. 2d arrays are basically just one dimensional arrays of one dimensional arrays. int[ ][ ] arr = new int[3][ ]; arr[0] = new int[3]; arr[1] = new int[5]; arr[2] = new int[4]; multi-dimensional arrays use the name rules. 3. where are they located in memory? from the above, there are arrays and reference variables in memory. as we know that jvm runtime data areas include heap, jvm stack, and others. for a simple example as follows, let’s see where the array and its reference are stored. class a { int x; int y; } ... public void m1() { int i = 0; m2(); } public void m2() { a a = new a(); } ... when m1 is invoked, a new frame (frame-1) is pushed into the stack, and local variable i is also created in frame-1. when m2 is invoked inside of m1, another new frame (frame-2) is pushed into the stack. in m2, an object of class a is created in the heap and reference variable is put in frame-2. now, at this point, the stack and heap looks like the following: arrays are treated the same way like objects, so how array locates in memory is straight-forward.
April 19, 2013
by Ryan Wang
· 31,328 Views · 1 Like
article thumbnail
Stepping Backwards while Debugging: Move To Line
it happens to me many times: i’m stepping with the debugger through my code, and ups! i made one step too far! debugging, and made one step over too far what now? restart the whole debugging session? actually, there is a way to go ‘backwards’ gdb has a ‘reverse debugging’ feature, described here . i’m using the eclipse based codewarrior debugger, and this debug engine is not using gdb. the codewarrior debugger in mcu10.3 supports an eclipse feature: i select a code line in the editor view and use move to line : move to line what it does: it changes the current pc (program counter) of the program to that line: performed move to line now i can continue debugging from that line, e.g. stepping into that function call. yes, this is not true backward debugging. but it is simple and very effective. to perform true backward stepping, the debugger would need to reverse all operations, typically with a rather heavy state machine and data recording. but for the usual case where i simply need to go back a few lines, the ‘move to line’ is perfect. of course there are a few points to consider: this only changes the program counter. any variable changes/etc are not affected or reverted. in case of highly optimized code, there might be multiple sequence points per source line. so doing this for highly optimized code might not work correctly. it works ok within a function. it is not recommended to use it e.g. to set the pc outside of a function. because the context/stack frame is not set up. i use the ‘move to line’ frequently to ‘advance’ the program execution. e.g. to bypass some long sequences i’m not interested in, or to get out of an ‘endless’ loop. the same ‘move to line’ as available while doing assembly stepping too. see this post for details. happy line moving
April 15, 2013
by Erich Styger
· 9,881 Views
article thumbnail
Complex Event Processing Made Easy (using Esper)
The following is a very simple example of event stream processing (using the ESPER engine). Note - a full working example is available over on GitHub: https://github.com/corsoft/esper-demo-nuclear What is Complex Event processing (CEP)? Complex Event Processing (CEP), or Event Stream Stream Processing (ESP) are technologies commonly used in Event-Driven systems. These type of systems consume, and react to a stream of event data in real time. Typically these will be things like financial trading, fraud identification and process monitoring systems – where you need to identify, make sense of, and react quickly to emerging patterns in a stream of data events. Key Components of a CEP system A CEP system is like your typical database model turned upside down. Whereas a typical database stores data, and runs queries against the data, a CEP data stores queries, and runs data through the queries. To do this it basically needs: Data – in the form of ‘Events’ Queries – using EPL (‘Event Processing Language’) Listeners – code that ‘does something’ if the queries return results A Simple Example - A Nuclear Power Plant Take the example of a Nuclear Power Station.. Now, this is just an example – so please try and suspend your disbelief if you know something about Nuclear Cores, Critical Temperatures, and the like. It’s just an example. I could have picked equally unbelievable financial transaction data. But ... Monitoring the Core Temperature Now I don’t know what the core is, or if it even exists in reality – but for this example lets assume our power station has one, and if it gets too hot – well, very bad things happen.. Lets also assume that we have temperature gauges (thermometers?) in place which take a reading of the core temperature every second – and send the data to a central monitoring system. What are the requirements? We need to be warned when 3 types of events are detected: MONITOR just tell us the average temperature every 10 seconds - for information purposes WARNING WARN us if we have 2 consecutive temperatures above a certain threshold CRITICAL ALERT us if we have 4 consecutive events, with the first one above a certain threshold, and each subsequent one greater than the last – and the last one being 1.5 times greater than the first. This is trying to alert us that we have a sudden, rising escalating temperature spike – a bit like the diagram below. And let’s assume this is a very bad thing. Using Esper There are a number of ways you could approach building a system to handle these requirements. For the purpose of this post though - we will look at using Esper to tackle this problem How we approach this with Esper is: Using Esper – we can create 3 queries (using EPL - Esper Query Language) to model each of these event patterns. We then attach a listener to each query - this will be triggered when the EPL detects a matching pattern of events) We create an Esper service, and register these queries (and their listeners) We can then just throw Temperature data through the service – and let Esper tell alert the listeners when we get matches. (A working example of this simple solution is available on Githib - see link above) Our Simple ESPER Solution At the core of the system are the 3 queries for detecting the events. Query 1 – MONITOR (Just monitor the average temperature) select avg(value) as avg_val from TemperatureEvent.win:time_batch(10 sec) Query 2 – WARN (Tell us if we have 2 consecutive events which breach a threshold) select * from TemperatureEvent " match_recognize ( measures A as temp1, B as temp2 pattern (A B) define A as A.temperature > 400, B as B.temperature > 400) Query 3 – CRITICAL - 4 consecutive rising values above all above 100 with the fourth value being 1.5x greater than the first select * from TemperatureEvent match_recognize ( measures A as temp1, B as temp2, C as temp3, D as temp4 pattern (A B C D) define A as A.temperature > 100, B as (A.temperature < B.value), C as (B.temperature < C.value), D as (C.temperature < D.value) and D.value > (A.value * 1.5)) Some Code Snippets TemperatureEvent We assume our incoming data arrives in the form of a TemperatureEvent POJO If it doesn't - we can convert it to one, e.g. if it comes in via a JMS queue, our queue listener can convert it to a POJO. We don't have to do this, but doing so decouples us from the incoming data structure, and gives us more flexibility if we start to do more processing in our Java code outside the core Esper queries. An example of our POJO is below package com.cor.cep.event; package com.cor.cep.event; import java.util.Date; /** * Immutable Temperature Event class. * The process control system creates these events. * The TemperatureEventHandler picks these up * and processes them. */ public class TemperatureEvent { /** Temperature in Celcius. */ private int temperature; /** Time temerature reading was taken. */ private Date timeOfReading; /** * Single value constructor. * @param value Temperature in Celsius. */ /** * Temerature constructor. * @param temperature Temperature in Celsius * @param timeOfReading Time of Reading */ public TemperatureEvent(int temperature, Date timeOfReading) { this.temperature = temperature; this.timeOfReading = timeOfReading; } /** * Get the Temperature. * @return Temperature in Celsius */ public int getTemperature() { return temperature; } /** * Get time Temperature reading was taken. * @return Time of Reading */ public Date getTimeOfReading() { return timeOfReading; } @Override public String toString() { return "TemperatureEvent [" + temperature + "C]"; } } Handling this Event In our main handler class - TemperatureEventHandler.java, we initialise the Esper service. We register the package containing our TemperatureEvent so the EPL can use it. We also create our 3 statements and add a listener to each statement /** * Auto initialise our service after Spring bean wiring is complete. */ @Override public void afterPropertiesSet() throws Exception { initService(); } /** * Configure Esper Statement(s). */ public void initService() { Configuration config = new Configuration(); // Recognise domain objects in this package in Esper. config.addEventTypeAutoName("com.cor.cep.event"); epService = EPServiceProviderManager.getDefaultProvider(config); createCriticalTemperatureCheckExpression(); createWarningTemperatureCheckExpression(); createTemperatureMonitorExpression(); } An example of creating the Critical Temperature warning and attaching the listener /** * EPL to check for a sudden critical rise across 4 events, * where the last event is 1.5x greater than the first. * This is checking for a sudden, sustained escalating * rise in the temperature */ private void createCriticalTemperatureCheckExpression() { LOG.debug("create Critical Temperature Check Expression"); EPAdministrator epAdmin = epService.getEPAdministrator(); criticalEventStatement = epAdmin.createEPL(criticalEventSubscriber.getStatement()); criticalEventStatement.setSubscriber(criticalEventSubscriber); } And finally - an example of the listener for the Critical event. This just logs some debug - that's as far as this demo goes. package com.cor.cep.subscriber; import java.util.Map; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.stereotype.Component; import com.cor.cep.event.TemperatureEvent; /** * Wraps Esper Statement and Listener. No dependency on Esper libraries. */ @Component public class CriticalEventSubscriber implements StatementSubscriber { /** Logger */ private static Logger LOG = LoggerFactory.getLogger(CriticalEventSubscriber.class); /** Minimum starting threshold for a critical event. */ private static final String CRITICAL_EVENT_THRESHOLD = "100"; /** * If the last event in a critical sequence is this much greater * than the first - issue a critical alert. */ private static final String CRITICAL_EVENT_MULTIPLIER = "1.5"; /** * {@inheritDoc} */ public String getStatement() { // Example using 'Match Recognise' syntax. String criticalEventExpression = "select * from TemperatureEvent " + "match_recognize ( " + "measures A as temp1, B as temp2, C as temp3, D as temp4 " + "pattern (A B C D) " + "define " + " A as A.temperature > " + CRITICAL_EVENT_THRESHOLD + ", " + " B as (A.temperature < B.temperature), " + " C as (B.temperature < C.temperature), " + " D as (C.temperature < D.temperature) " + "and D.temperature > " + "(A.temperature * " + CRITICAL_EVENT_MULTIPLIER + ")" + ")"; return criticalEventExpression; } /** * Listener method called when Esper has detected a pattern match. */ public void update(Map eventMap) { // 1st Temperature in the Critical Sequence TemperatureEvent temp1 = (TemperatureEvent) eventMap.get("temp1"); // 2nd Temperature in the Critical Sequence TemperatureEvent temp2 = (TemperatureEvent) eventMap.get("temp2"); // 3rd Temperature in the Critical Sequence TemperatureEvent temp3 = (TemperatureEvent) eventMap.get("temp3"); // 4th Temperature in the Critical Sequence TemperatureEvent temp4 = (TemperatureEvent) eventMap.get("temp4"); StringBuilder sb = new StringBuilder(); sb.append("***************************************"); sb.append("\n* [ALERT] : CRITICAL EVENT DETECTED! "); sb.append("\n* " + temp1 + " > " + temp2 + " > " + temp3 + " > " + temp4); sb.append("\n***************************************"); LOG.debug(sb.toString()); } } The Running Demo Full instructions for running the demo can be found here: https://github.com/corsoft/esper-demo-nuclear An example of the running demo is shown below - it generates random Temperature events and sends them through the Esper processor (in the real world this would come in via a JMS queue, http endpoint or socket listener). When any of our 3 queries detect a match - debug is dumped to the console. In a real world solution each of these 3 listeners would handle the events differently - maybe by sending messages to alert queues/endpoints for other parts of the system to pick up the processing. Conclusions Using a system like Esper is a neat way to monitor and spot patterns in data in real time with minimal code. This is obviously (and intentionally) a very bare bones demo, barely touching the surface of the capabilities available. Check out the Esper web site for more info and demos. Esper also has a plugin for Apache Camel Integration engine - which allows you to configure you EPL queries directly in XML Spring camel routes, removing the need for any Java code completely (we will possibly cover this in a later blog post!)
April 11, 2013
by Adrian Milne
· 68,365 Views · 4 Likes
article thumbnail
Monitoring with DataDog
Recently I found myself sending more and more business metrics to Datadog, a Software as a Service solution that promises to collect all your data points and build business metrics, displaying them as graphs and triggering alerts whenever they get to critically low (or high) levels. The goals The more your automated tests raises their level of abstraction, the more they become oriented to external quality (what the customer wants and does) instead of internal quality (low coupling, high cohesion of the software design). The largest end-to-end tests that we have in place at Onebip connect several different projects on an integration server and run everything from the creation of a purchase or subscription to its renewal and termination (events that would happen months after creation). However, even end-to-end tests cannot guarantee that our applications work against external resources, such as merchants, mobile carrier, and ISPs. The only way to catch integration problems is monitoring. These problems, like a mobile carrier experiencing an outage, may be due to our errors or to external conditions; but they should nevertheless be discovered as early as possible. The infrastructure Datadog is the only data-collection service that passed the stress tests of SLL, our solution architect. It ships as an UDP server that you pay basing on the number of machines you want to run it on; for example, a preproduction and a production server are a common choice to start out. The server collects data locally and periodically uploads it to Datadog in bursts, where you can access it via a web application or via APIs in case you want to call it from your build. The UDP protocol is aligned with the goals of metric collections: a silent server that decouples the sending of metrics from the rest of the business logic: UDP packets are just lost if no process is there listening to them, no errors are raised if the server crashes or is not running or installed for some reason for instance in development machines). The monitoring code, which you write, should be decoupled and asynchronous as much as possible. The part that talks over the network is already externalized in the DataDog server, but you don't want the user to wait because you have to send some strange number. So the internal part (sending via UDP) is performed in Listener objects that implement the Observer pattern. These object still have to be wrapped in all-encompassing try/catch constructs so that any errors in the monitoring part never influence the business logic. Againg, you don't want a payment to fail because of an exception in how monitoring DateTime objects are built. For PHP we built a SilentListener class to wrap all of our object: class SilentListener { private $wrapped; public function __construct($wrapped) { $this->wrapped = $wrapped; } public function __call($method, $args) { try { call_user_func_array(array($this->wrapped, $method), $args); } catch (Exception $e) { $this->log($e); } } }SLL An example In some countries, we receive payments through mobile-originated messages (MO), a fancy word for saying SMS sent by the end user. So a simple way to monitor if we are receiving payment or if the server is exploded is to upload a metric counting them every time we receive one (pseudo-JSON format to show you the data): { counter: 1 } However, we can be more precise than this: an external outage or an integration problem may happen to a lower level than the whole application. For example, MOs can be delayed in Argentina, by a single carrier, while the rest of the world is still working fine. So our data points look like this: { counter: 1, tags: { country: "IT", carrier: "Vodafone", merchant: "Tasty Cookies, Inc.", } } and in turn graphs on DataDog or calls to its API can set up filters so that we can, if necessary, view only the data related to any combination of country, carrier and merchant. The nice thing, SLL says, is that you just start send data from production and only after you have data points available you build a graph or an alert system basing on what appears to be the most important tags. For example, a big merchant may benefit from some dedicated monitoring, while minor countries such as Vietnam should be monitored as a whole since their traffic is by far lower than that of the others.
April 10, 2013
by Giorgio Sironi
· 16,413 Views · 1 Like
article thumbnail
Add Custom Post Meta Data To Post List Table
One of the best thing about WordPress is that you can customise almost anything. In the admin area you can see a list of all the posts you have added in WordPress. Within this table it shows the basic information for each of the posts, the title, the author, the category, tags, comments and the date the post was published. WordPress has a number of different filters and actions that allow you to edit the output of the column so you can add your own data to this list. For example if you have custom post meta data which is useful information you want to display on the list of posts you can add new custom columns to the list. In this article you will learn how to add new columns to the post list, how you can add data to the column and how you can make this column sortable. Add New Columns First we start off by adding the new column to the list, for this we use the WordPress filter manage_edit-post_columns. This will allow you to edit the output of the columns by adding new values to the column array. The callback function on this filter will pass in one parameter which are the current columns on the list, the return of this function will be the new columns on the post table. This means that we can add additional values to the array to add extra columns to the table. The following code will add a new column to the table just after the title column. // Add a column to the edit post list add_filter( 'manage_edit-post_columns', 'add_new_columns'); /** * Add new columns to the post table * * @param Array $columns - Current columns on the list post */ function add_new_columns( $columns ) { $column_meta = array( 'meta' => 'Custom Column' ); $columns = array_slice( $columns, 0, 2, true ) + $column_meta + array_slice( $columns, 2, NULL, true ); return $columns; } Add Columns To Custom Post Types If you have custom post types in your site and want to add additional columns to this list, WordPress comes with built in filters you can apply to add new columns to this table. add_filter( 'manage_${post_type}_posts_columns', 'add_new_columns'); If you have a custom post type of portfolio then you can use the following code to add a column to the list of portfolio post types. function add_portfolio_columns($columns) { return array_merge($columns, array('client' => __('Client'), 'project_date' =>__( 'Project Date'))); } add_filter('manage_portfolio_posts_columns' , 'add_portfolio_columns'); Add Data To Custom Columns Once you have created the new columns for the posts list you can now add data to the new columns by using the WordPress action manage_posts_custom_column. Adding an action to this will be called on each column, from this call we can get data for the post display this on the post list. The following code will check what column we are on and get the custom post meta data for the current post and display this in the column. // Add action to the manage post column to display the data add_action( 'manage_posts_custom_column' , 'custom_columns' ); /** * Display data in new columns * * @param $column Current column * * @return Data for the column */ function custom_columns( $column ) { global $post; switch ( $column ) { case 'meta': $metaData = get_post_meta( $post->ID, 'twitter_url', true ); echo $metaData; break; } } Add Data To Custom Post Type Columns Along with being able to add a filter to custom post types by adding new columns, WordPress has a built in action you can use to add data to custom columns. In the above example we add a new column just for post types of a portfolio to add two new columns to the list, using the below code you can add data to these new columns. function custom_portfolio_column( $column, $post_id ) { switch ( $column ) { case 'project_date': echo get_post_meta( $post_id , 'project_date' , true ); break; case 'client': echo get_post_meta( $post_id , 'client' , true ); break; } } add_action( 'manage_portfolio_posts_custom_column' , 'custom_portfolio_column' ); Make Columns Sortable By default the new custom columns are not sortable so this makes it hard to find data that you need. To sort the custom columns WordPress has another filter manage_edit-post_sortable_columns you can use to assign which columns are sortable. When this action is ran the function will pass in a parameter of all the columns which are currently sortable, by adding your new custom columns to this list will now make these columns sortable. The value you give this will be used in the URL so WordPress understands which column to order by. The following to allow you to sort by the custom column meta. // Register the column as sortable function register_sortable_columns( $columns ) { $columns['meta'] = 'Custom Column'; return $columns; } add_filter( 'manage_edit-post_sortable_columns', 'register_sortable_columns' ); That's all the information you need to change the way posts are listed in your admin area. What useful information do you wish was displayed in the post list?
April 10, 2013
by Paul Underwood
· 13,637 Views
article thumbnail
Why Encapsulation Matters
Encapsulation is more than just defining accessor and mutator methods for a class. It is a broader concept of programming, not necessarily object-oriented programming, that consists in minimizing the interdependence between modules and it’s typically implemented through information hiding. Paramount to understand encapsulation is the realization that it has two main objectives: (1) hiding complexity and (2) hiding the sources of change. About Hiding Complexity Encapsulation is inherently related to the concepts of modularity and abstraction. So, in my opinion, to really understand encapsulation, one must first understand these two concepts. Let’s consider, for example, the level of abstraction in the concept of a car. A car is complex in its internal working. They have several subsystem, like the transmission system, the break system, the fuel system, etc. However, we have simplified its abstraction, and we interact with all cars in the world through the public interface of their abstraction: we know that all cars have a steering wheel through which we control direction, they have a pedal that when we press it we accelerate the car and control speed, and another one that when we press it we make it stop, and we have a gear stick that let us control if we go forward or backwards. These features constitute the public interface of the car abstraction. In the morning we can drive a sedan and then get out of it and drive an SUV in the afternoon as if it was the same thing. However, few of us know the details of how all these features are implemented under the hood. So, this simple analogy shows that human beings deal with complexity by defining abstractions with public interfaces that we use to interact with them and all the unnecessary details get hidden under the hood of these abstractions. And I want to emphasize that word “unnecessary” here, because the beauty of an abstraction is not having to understand all those details in order to be able to use it, we just need to understand a broader abstract concept and how it works and how we interact with it. That’s why most of us don’t know or don’t care how a car works under the hood, but that doesn’t prevents us from driving one. In his book Code Complete, Steve McConnell uses the analogy of an iceberg: only a small portion of an iceberg is visible on the surface, most of its true size is hidden underwater. Similarly, in our software designs the visible parts of our modules/classes constitute their public interface, and this is exposed to the outside world, the rest of it should be hidden to the naked eye. In the words of McConell “the interface to a class should reveal as little as possible about its inner workings”. Clearly, based on our car analogy, we can see that this encapsulation is good, since it hides unnecessary/complex details from the users. It makes objects simpler to use and understand. About Hiding the Sources of Change Now, continuing with the analogy; think of the time when cars did not have a hydraulics directional system. One day, the car manufactures invented it, and they decide it to put it in cars from there on. Still, this did not change the way in which drivers were interacting with them. At most, users experienced an improvement in the use of the directional system. A change like this was possible because the internal implementation of a car is encapsulated, that is, is hidden from its user. In other words changes can be safely done without affecting its public interface. In a similar way, if we achieve proper levels of encapsulation in our software design we can safely foster change and evolution of our APIs without breaking its users, by this minimizing the impact of changes and the interdependence of modules. Therefore, encapsulation is a way to achieve another important attribute of a good software design known as loose coupling. In his book Effective Java, Joshua Block highlights the power of information hiding and loose coupling when he says: “Information hiding is important for many reasons, most of which stem from the fact that it decouples the modules that compromise a system, allowing them to be developed, tested, optimized, used, understood, and modified in isolation. This speeds up system development because modules can be developed in parallel. It eases the burden of maintenance because modules can be understood more quickly and debugged with little fear of harming other modules [...] it enables effective performance tuning [since] those modules can be optimized without affecting the correctness of other modules increases software reuse because modules that aren’t tightly coupled often prove useful in other contexts besides the ones for which they were developed”. So, once more, we can clearly see that encapsulation is a desirable attribute that eases the introduction of change and foster the evolution of our APIs. As long as we respect the public interface of our abstractions we are free to change whatever we want of its encapsulated inner workings. About Breaking the Public Interface So what happens when we do not achieve the proper levels of encapsulation in our designs? Now, think that car manufactures decided to put the fuel cap below the car, and not in one of its sides. Let’s say we go and buy one of these new cars, and when we run out of gas we go to the nearest gas station, and then we do not find the fuel cap. Suddenly we realize is below the car, but we cannot reach it with the gas pump hose. Now, we have broken the public interface contract, and therefore, the entire world breaks, it falls apart because things are not working the way it was expected. A change like this would cost millions. We would need to change all gas pumps, not to mention mechanical shops and auto parts. When we break encapsulation we have to pay a price. This last part of our analogy, clearly reveals that failing to define proper abstractions with proper levels of encapsulation will end up causing difficulties when change finally happens. So, as we can see, the goal of encapsulation is reduce the complexity of the abstractions by providing a way to hide implementation details and it also help us to minimize interdependence and facilitate change. We maximize encapsulation by minimizing the exposure of implementation details. However encapsulation will not help us if we do not define proper abstractions. Simply put, there is no way to change the public interface of an abstraction without breaking its users. So, the design of good abstractions is of paramount importance to facilitate the evolution of the APIs, encapsulation is just one of the tools that help us create this good abstractions, but no level of encapsulation is going to make a bad abstraction work. Encapsulation in Java One of those things that we always want to encapsulate is the state of a class. The state of a class should only be accessed through its public interface. In a object-oriented programming language like Java, we achieve encapsulation by hiding details using the accessibility modifiers (i.e. public, protected, private, plus no modifier which implies package private). With these levels of accessibility we control the level of encapsulation, the less restrictive the level, the more expensive change is when it happens and the more coupled the class is with other dependent classes (i.e. user classes, subclasses, etc.). In object-oriented languages a class has two public interfaces: the public interface shared with all users of the class, and the protected interface shared with subclasses. It is of paramount importance that we design the proper levels of encapsulation for every one of these public interfaces so that we can facilitate change and foster evolution of our APIs. Why Getters and Setters? Many people wonder why we need accessor and mutator methods in Java (a.k.a. getters and setters), why can’t we just access the data directly? But the purpose of encapsulation here is is not to hide the data itself, but the implementation details on how this data is manipulated. So, once more what we want is a way to provide a public interface through which we can gain access to this data. We can later change the internal representation of the data without compromising the public interface of the class. On the contrary, by exposing the data itself, we compromise encapsulation, and therefore, the capacity of changing the ways to manipulate this data in the future without affecting its users. We would create a dependency with the data itself, and not with the public interface of the class. We would be creating a perfect cocktail for trouble when “change” finally finds us. There are several compelling reasons why we might want to encapsulate access to our fields. The best compendium of these reasons I have ever found is described in Joshua Bloch’s book [a href="http://192.9.162.55/docs/books/effective/" target="_blank"]Effective Java. There in Item 14: Minimize the accessibility of classes and members, he mentions several reasons, which I mention here: You can limit the values that can be stored in a field (i.e. gender must be F or M). You can take actions when the field is modified (trigger event, validate, etc). You can provide thread safety by synchronizing the method. You can switch to a new data representation (i.e. calculated fields, different data type) However, it is very important to understand that encapsulation is more than hiding fields. In Java we can hide entire classes, by this, hiding the implementation details of an entire API. My understanding of this important concept was broaden and enriched by my reading of a great article by Alan Snyder called Encapsulation and Inheritance in Object-Oriented Programming Languages which I recommend to all readers of this blog. I found a version of it available on the Web and I shared a link to it a the end of this article. Further Reading Encapsulation and Inheritance in Object-oriented Programming Languages Effective Java: Minimize the Accessibility of Classes and Members (p. 67-69) Code Complete 2: Hiding Secrets (Information Hiding) (p. 92).
April 7, 2013
by Edwin Dalorzo
· 56,291 Views · 3 Likes
article thumbnail
Weekend Project: Send sensor data from Arduino to MongoDB
Arduino is an open-source electronics platform that can acknowledge and interact with its environment through a variety of sensor types. It’s great for hardware prototyping and one-off projects. I just got an Arduino Board from our friends at SendGrid, who also gave me a little tutorial in the art of Arduino hacking. Inspired by the tutorial and armed with this new board, I bought a passive infared (PIR) motion sensor from my local Radio Shack. Now I was ready to play; in particular, I wanted to be able to collect that continuous stream of hardware sensor data into a MongoDB database for logging, trend analysis, system event correlation, etc. To this end, I created the demo project “mongodb-motion”, which I’ve made public on Github. In the “mongodb-motion” Github repo, you will find an Arudino project that writes motion sensor data to a cloud MongoDB database at MongoLab and sends alerts via email based on certain criteria. I built this demo using Node.js and the MongoLab REST API. Below, I’ll go through exactly what hardware you need to make your own “mongodb-motion” project a success, and how the code actually works. What You Need The hardware used in this demo includes: an Arduino UNO R3 and a Parallax PIR motion sensor. How the Code Works You can use a variety of motion sensors with the Arduino. In this particular experiment, I used a PIR motion sensor. The PIR motion sensor behaves like a switch, with ‘down’ events emitted on motion detection and ‘up’ events a few seconds after motion ceases to be detected. On the receiving side, I used JohnnyFive, an appropriately named Node.js package that accepts sensor events and sends messages to the Arduino board. With the two ends set, I’ll move on to the project’s configuration file. In this demo, I’ve included a configuration file, config-sample.js, where credentials for the MongoLab REST API and for the email SMTP server can be added. In my case, I used the SendGrid SMTP service. The configuration file also has two callbacks that determine when an email is emitted, one for each type of event – “detect” and “ceased”. I’ve used this feature to automatically send an email alert if an event timestamp is between 7:00pm and 8:00am, ostensibly when my office should be motionless… I’m out there watching you, office! Once you’ve customized this config-sample.js file, be sure to rename it to config.js in order for it to be usable. If you inspect the project code, you’ll notice that the MongoLab REST API is called in the logMsg() function, using an https.request. Building this little demo has given me some new ideas for hardware hacking the cloud. I hope you give it a try too. Thanks to the Arduino, Node.js and Javascript communities, and special thanks to Rick Waldon for Johnny Five, SendGrid for the UNO board, and a big shout out to @swiftalphaone for the Waza tutorial.
April 3, 2013
by Ben Wen
· 17,786 Views
article thumbnail
ActiveMQ Message Priorities: How it Works
There’s usually a steady drip of questions on the mailing list surrounding ActiveMQ’s message-priority support as well as good questions about observed behaviors and “what’s really supported”? I hope to help you understand what happens under the covers and what levels of priority can be supported. The details could get gory for some. If you’re not interested in the details, take a look at the ActiveMQ wiki for the high-level overview. First, since ActiveMQ supports JMS 1.1, let’s take a look at what the JMS Spec says about support for “JMSPriority”: JMS defines a ten-level priority value, with 0 as the lowest priority and 9 as the highest. In addition, clients should consider priorities 0-4 as gradations of normal priority and priorities 5-9 as gradations of expedited priority. JMS does not require that a provider strictly implement priority ordering of messages; however, it should do its best to deliver expedited messages ahead of normal messages. ActiveMQ observes three distinct levels of “Priority”: Default (JMSPriority == 4) High (JMSPriority > 4 && <= 9) Low (JMSPriority > 0 && < 4) If you don’t specify a priority for your MessageProducer or individual messages (see MessageProducer#send(message, deliveryMode, priority, timeToLive)), ActiveMQ’s client will default to using a JMSPriority == 4. As a JMS consumer, you can expect a FIFO ordering if the producers aren’t using priority or you’re not using some other form of selection criteria on the destination. ActiveMQ also “does its best” to deliver expedited messages ahead of “normal” messages, as the spec states. The message store that your broker uses greatly contributes to how that’s exactly done, but in general you can expect the broker to honor strict (0-9) priority support for only the JDBC backed messages stores. For KahaDB-backed message stores, only “category priority” is supported (Low, Default, High, where priorities in each category are not always differentiated, that is 5 and 9 are considered “High”). However, with the right settings and messaging profile, you can affect how [strict] prioritization happens even with KahaDB, so let’s take a quick look. Enabling Message Priority You can enable message priority on your Queues with the following setting in your activemq.xml configuration file: For queueName there is wildcard support, so you can enable priority support on a hierarchy of messages. When you enable priority support, the broker will use prioritized linked-list structures in its messages cursors as well as give KahaDB a hint to use priority categories when storing messages onto disk. There are varying levels of how strict the priority ordering can get, but at worst, you can assume priorities will be upheld by category. The following factors come into play which control how strict the priority ordering can get when using the KahaDB store: Caching enabled/disabled in the queue cursor MaxPageInSize for how many messages to page from the store in a batch Consumer prefetching Expired-message checking Broker Memory settings Persistent/Non-persistent messages The next section presents a little detail about what happens in KahaDB to support priority, while the following sections will go into how things happen in broker memory and are finally dispatched to a consumer and will point out along the way how the different factors from above come into play. KahaDB Prioritization Categories First we’ll start with how messages are stored on disk and loaded into a destination. KahaDB (the default message store) is a file-based message database that the broker uses to persist messages in a “log” or “journal”. The broker also keeps track of which messages are in the log by keeping a separate “index” that holds information about messages (like its location in the log, to which destination it’s associated, ordering, etc). The index also has a notion of message “priority”, which is implemented with three B+Tree structures, one for each priority level (see MessageOrderIndex in org.apache.activemq.store.kahadb.MessageDatabase). This implementation detail is the root of message prioritization and has implications for the rest of the broker as messages are removed from the store. When messages are retrieved from the store, they are done so in batches (maxPageInSize), and messages that are in the “highPriority” BTree are retrieved first. When the high-priority messages are exhausted, the store will then offer up the default priority and subsequently the low priority messages. You can set the maxPageInSize like so: The larger the page size, the larger the number of messages in a batch and the more messages you can see at a time per “snapshot”. For each batch that’s brought into memory, it’s messages are going to be strictly prioritized as described below by the store cursor. The downside is that if your messages are large, bringing in 500 at a time could exhaust your broker memory. The default setting is 200. Message Cursor Priority Lists When persistent messages come into the broker from a producer, they will be stored onto disk, but they will also be cached in memory waiting to be dispatched to a consumer. This is a default setting, so no need to explicitly set it. The idea behind this is to be able to dispatch to fast consumers without having to retrieve it directly from disk (if consumers become slow, the broker will auto-tune itself to not use the cache once it’s filled so as to not OOM). The good thing about this is that when prioritization support is used for a queue, the internal lists used for the cursors will support strict priority (0-9), so for all of the messages that are currently in memory (in the cache), they will be sorted properly from highest to lowest. The trick is what happens when all of the messages in the cache are “lower priority messages” and then a high-priority message comes in to the broker but won’t fit in the cache because it’s full… in that case the message will go directly to the store, be indexed in the “high-priority” index, but won’t be available for dispatch ahead of the lower priority messages until it’s paged into memory in the next batch. When NON persistent messages come into the broker, they will not go to the message store. They will be kept in memory for as long as possible and only pushed to disk (in a temporary store) when memory has passed a defined threshold (> 70% by default). So the same behaviors for cached messages above apply for non-persistent messages, namely, those that are in memory will be ordered strictly (0-9), but once they get pushed to disk, only categories are observed. If you disable the cursor’s cache (with the following setting) then you could help to eliminate the above scenario where the cache becomes full with lower priority messages right when a high-priority message comes in (and becomes stuck on disk because it cannot be paged into memory). However, doing this will slow down your throughput because messages must be paged in from disk before sending to consumers which will slow down the dispatch. But note, when doing this, you are more likely to see messages not following “strict” priority even with the priority lists in the cursor. They will, however, follow the priority categories (High, Default, Low) properly. So to recap, if you disable the cache, you can get higher priority messages delivered more timely than you can if the cache is enabled and it’s filled with lower priority messages. But disabling the cache, by itself, won’t get you to strict priority. Disabling the cache helps getting high priority messages to consumers ahead of lower priority messages, however for this to work as intended (and has bitten me), you’ll want to disable the asynchronous message expiry check. This expiry check pages messages into memory every 30 seconds regardless if they’re ready to be dispatched (by default) and performs a TTL check (time to live) on them and discards those messages that should be expired. This sort of checking effectively brings messages into memory and will stall the normal “page in for dispatch” just enough to miss higher priority messages. Turning off expiry checking, however, will keep expired messages in the store longer because the only expiry check would be done right before dispatch, so make an educated decision on this, and all ActiveMQ settings you tinker with. But to move in the direction of strict(er) order priorities, you’ll want to disable this. Lastly, consumer prefetch plays a role in achieving “strict ordering.” By default, prefetch is set to 1000 for queue consumers, which means they will be sent 1000 messages in a batch. This helps speed up the consumer when it’s consuming messages, but in terms of priority handling it in essence also acts like a cache of messages (discussed above) and could contribute to not seeing “strict ordering”. “category priority” could also be violated if your prefetch is filled with lower priority messages, and there is a new high-priority message that came in to the broker, you wouldn’t see it until the next message dispatch to the consumer. So the lower the prefetch, the better chance of seeing higher priority messages ahead of lower ones. With prefetch of 1, you’ll always get the highest priority message that the store cursor knows about. Client side message priority ActiveMQ also has priority support built right into the message client and it’s enabled by default. This means, when messages are being sent to your consumer (even before your consumer is receiving them, using prefetch), they will be cached on the consumer side and prioritized by default. This is regardless of whether you’re using priority support on the broker side. This could impact the ordering you see on the consumer so just keep this in mind. To disable it, set the following configuration option on your broker URL, e.g., tcp://0.0.0.0:61616?jms.messagePrioritySupported=false But as mentioned above, you’ll want to lower the prefetch to 1 to get the best chance of achieving strict ordering. Tradeoffs So ultimately, getting strictly ordered messages with KahaDB is possible but there are significant tradeoffs to consider and it won’t apply for every messaging situation. Do you want optimized, fast messaging? or do you want to slow down the messaging to achieve strict(er) ordering for priorities. Each situation is different and should be evaluated on a case-by-case basis. In general, however, you can rely on category level priorities. Reordering messages across large queues AND keeping high performance is problematic, and most Message Queue vendors do not do that very well. ActiveMQ’s priority support is strong, but another good alternative exists as discussed on the ActiveMQ wiki describing message priority and that is: using message selectors and balancing out the consumers in such a way that high priority messages end up getting consumed first. This approach tends to give more flexibility and control, but that’s for another post Leave me some comments if something wasn’t clear, or drop an email in the mailing list!
April 2, 2013
by Christian Posta
· 18,984 Views
article thumbnail
ArrayList vs. LinkedList vs. Vector
1. list overview list, as its name indicates, is an ordered sequence of elements. when we talk about list, it is a good idea to compare it with set which is a set of elements which is unordered and every element is unique. the following is the class hierarchy diagram of collection. 2. arraylist vs. linkedlist vs. vector from the hierarchy diagram, they all implement list interface. they are very similar to use. their main difference is their implementation which causes different performance for different operations. arraylist is implemented as a resizable array. as more elements are added to arraylist, its size is increased dynamically. it's elements can be accessed directly by using the get and set methods, since arraylist is essentially an array. linkedlist is implemented as a double linked list. its performance on add and remove is better than arraylist, but worse on get and set methods. vector is similar with arraylist, but it is synchronized. arraylist is a better choice if your program is thread-safe. vector and arraylist require space as more elements are added. vector each time doubles its array size, while arraylist grow 50% of its size each time. linkedlist, however, also implements queue interface which adds more methods than arraylist and vector, such as offer(), peek(), poll(), etc. note: the default initial capacity of an arraylist is pretty small. it is a good habit to construct the arraylist with a higher initial capacity. this can avoid the resizing cost. 3. arraylist example arraylist al = new arraylist(); al.add(3); al.add(2); al.add(1); al.add(4); al.add(5); al.add(6); al.add(6); iterator iter1 = al.iterator(); while(iter1.hasnext()){ system.out.println(iter1.next()); } 4. linkedlist example linkedlist ll = new linkedlist(); ll.add(3); ll.add(2); ll.add(1); ll.add(4); ll.add(5); ll.add(6); ll.add(6); iterator iter2 = al.iterator(); while(iter2.hasnext()){ system.out.println(iter2.next()); } as shown in the examples above, they are similar to use. the real difference is their underlying implementation and their operation complexity. 5. vector vector is almost identical to arraylist, and the difference is that vector is synchronized. because of this, it has an overhead than arraylist. normally, most java programmers use arraylist instead of vector because they can synchronize explicitly by themselves. 6. performance of arraylist vs. linkedlist the time complexity comparison is as follows: i use the following code to test their performance: arraylist arraylist = new arraylist(); linkedlist linkedlist = new linkedlist(); // arraylist add long starttime = system.nanotime(); for (int i = 0; i < 100000; i++) { arraylist.add(i); } long endtime = system.nanotime(); long duration = endtime - starttime; system.out.println("arraylist add: " + duration); // linkedlist add starttime = system.nanotime(); for (int i = 0; i < 100000; i++) { linkedlist.add(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist add: " + duration); // arraylist get starttime = system.nanotime(); for (int i = 0; i < 10000; i++) { arraylist.get(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("arraylist get: " + duration); // linkedlist get starttime = system.nanotime(); for (int i = 0; i < 10000; i++) { linkedlist.get(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist get: " + duration); // arraylist remove starttime = system.nanotime(); for (int i = 9999; i >=0; i--) { arraylist.remove(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("arraylist remove: " + duration); // linkedlist remove starttime = system.nanotime(); for (int i = 9999; i >=0; i--) { linkedlist.remove(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist remove: " + duration); and the output is: arraylist add: 13265642 linkedlist add: 9550057 arraylist get: 1543352 linkedlist get: 85085551 arraylist remove: 199961301 linkedlist remove: 85768810 the difference of their performance is obvious. linkedlist is faster in add and remove, but slower in get. based on the complexity table and testing results, we can figure out when to use arraylist or linkedlist. in brief, linkedlist should be preferred if: there are no large number of random access of element there are a large number of add/remove operations
March 28, 2013
by Ryan Wang
· 611,158 Views · 20 Likes
article thumbnail
HashMap vs. TreeMap vs. HashTable vs. LinkedHashMap
Learn all about important data structures like HashMap, HashTable, and TreeMap.
March 28, 2013
by Ryan Wang
· 438,301 Views · 14 Likes
  • Previous
  • ...
  • 418
  • 419
  • 420
  • 421
  • 422
  • 423
  • 424
  • 425
  • 426
  • 427
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×