DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest AI/ML Topics

article thumbnail
Algorithm of the Week: Rabin-Karp String Searching
Brute force string matching is a very basic sub-string matching algorithm, but it’s good for some reasons. For example it doesn’t require preprocessing of the text or the pattern. The problem is that it’s very slow. That is why in many cases brute force matching can’t be very useful. For pattern matching we need something faster, but to understand other sub-string matching algorithms let’s take a look once again at brute force matching. In brute force sub-string matching we checked every single character from the text with the first character of the pattern. Once we have a match between them we shift the comparison between the second character of the pattern with the next character of the text, as shown on the picture below. This algorithm is slow for mainly two reasons. First, we have to check every single character from the text. On the other hand even if we find a match between a text character and the first character of the pattern we continue to check step by step (character by character) every single symbol of the pattern in order to find whether it is in the text. So is there any other approach to find whether the text contains the pattern? In fact there is a “faster” approach. In this case, in order to avoid the comparison between the pattern and the text character by character, we’ll try to compare them all at once, so we need a good hash function. With its help we can hash the pattern and check against hashed sub-strings of the text. We must be sure that the hash function is returning “small” hash codes for larger sub-strings. Another problem is that for larger patterns we can’t expect to have short hashes. But besides this the approach should be quite effective compared to the brute force string matching. This approach is known as Rabin-Karp algorithm. Overview Michael O. Rabin and Richard M. Karp came up with the idea of hashing the pattern and to check it against a hashed sub-string from the text in 1987. In general the idea seems quite simple, the only thing is that we need a hash function that gives different hashes for different sub-strings. Said hash function, for instance, may use the ASCII codes for every character, but we must be careful for multi-lingual support. The hash function may vary depending on many things, so it may consist of ASCII char to number converting, but it can also be anything else. The only thing we need is to convert a string (pattern) into some hash that is faster to compare. Let’s say we have the string “hello world”, and let’s assume that its hash is hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern “he” is contained in the text “hello world”. So in every step, we take from the text a sub-string with the length of m, where m is the pattern length. Thus we hash this sub-string and we can directly compare it to the hashed pattern, as in the picture above. Implementation So far we saw some diagrams explaining the Rabin-Karp algorithm, but let’s take a look at its implementation here, in this very basic example where a simple hash table is used in order to convert the characters into integers. The code is PHP and it’s used only to illustrate the principles of this algorithm. function hash_string($str, $len) { $hash = ''; $hash_table = array( 'h' => 1, 'e' => 2, 'l' => 3, 'o' => 4, 'w' => 5, 'r' => 6, 'd' => 7, ); for ($i = 0; $i < $len; $i++) { $hash .= $hash_table[$str{$i}]; } return (int)$hash; } function rabin_karp($text, $pattern) { $n = strlen($text); $m = strlen($pattern); $text_hash = hash_string(substr($text, 0, $m), $m); $pattern_hash = hash_string($pattern, $m); for ($i = 0; $i < $n-$m+1; $i++) { if ($text_hash == $pattern_hash) { return $i; } $text_hash = hash_string(substr($text, $i, $m), $m); } return -1; } // 2 echo rabin_karp('hello world', 'ello'); Multiple Pattern Match It’s great to say that the Rabin-Karp algorithm is great for multiple pattern match. Indeed its nature is supposed to support such functionality, which is its advantage in comparison to other string searching algorithms. Complexity The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. So where is it compared to brute-force matching? Well, brute force matching complexity is O(nm), so as it seems there’s not much of a gain in performance. However, it’s considered that Rabin-Karp’s complexity is O(n+m) in practice, and that makes it a bit faster, as shown on the chart below. Note that the Rabin-Karp algorithm also needs O(m) preprocessing time. Application As we saw Rabin-Karp is not much faster than brute force matching. So where we should use it? 3 Reasons Why Rabin-Karp is Cool 1. Good for plagiarism, because it can deal with multiple pattern matching! 2. Not faster than brute force matching in theory, but in practice its complexity is O(n+m)! 3. With a good hashing function it can be quite effective and it’s easy to implement! 2 Reasons Why Rabin-Karp is Not Cool 1. There are lots of string matching algorithms that are faster than O(n+m) 2. It’s practically as slow as brute force matching and it requires additional space Final Words Rabin-Karp is a great algorithm for one simple reason – it can be used to match against multiple patterns. This makes it perfect to detect plagiarism even for larger phrases.
April 3, 2012
by Stoimen Popov
· 36,733 Views
article thumbnail
Using "Natural": A NLP Module for node.js
Like most node modules "natural" is packaged as an NPM and can be installed from the command line with node.js.
March 27, 2012
by Christopher Umbel
· 63,999 Views · 3 Likes
article thumbnail
Algorithm of the Week: Brute Force String Matching
String matching is something crucial for database development and text processing software. Fortunately, every modern programming language and library is full of functions for string processing that help us in our everyday work. However it's important to understand their principles. String algorithms can typically be divided into several categories. One of these categories is string matching. When it comes to string matching, the most basic approach is what is known as brute force, which simply means to check every single character from the text to match against the pattern. In general we have a text and a pattern (most commonly shorter than the text). What we need to do is to answer the question whether this pattern appears in the text. Overview The principles of brute force string matching are quite simple. We must check for a match between the first characters of the pattern with the first character of the text as on the picture bellow. If they don’t match, we move forward to the second character of the text. Now we compare the first character of the pattern with the second character of the text. If they don’t match again, we move forward until we get a match or until we reach the end of the text. In case they match, we move forward to the second character of the pattern comparing it with the “next” character of the text, as shown in the picture bellow. Just because we have found a match between the first character from the pattern and some character of the text, doesn’t mean that the pattern appears in the text. We must move forward to see whether the full pattern is contained in the text. Implementation Implementation of brute force string matching is easy and here we can see a short PHP example. The bad news is that this algorithm is naturally quite slow. function sub_string($pattern, $subject) { $n = strlen($subject); $m = strlen($pattern); for ($i = 0; i < $n-$m; $i++) { $j = 0; while ($j < $m && $subject[$i+$j] == $pattern[$j]) { $j++; } if ($j == $m) return $i; } return -1; } echo sub_string('o wo', 'hello world!'); Complexity As I said this algorithm is slow. Actually every algorithm that contains “brute force” in its name is slow, but to show how slow string matching is, I can say that its complexity is O(n.m). Here n is the length of the text, while m is the length of the pattern. In case we fix the length of the text and test against variable length of the pattern, again we get a rapidly growing function. Application Brute force string matching can be very ineffective, but it can also be very handy in some cases. Just like the sequential search. It can be very useful… Doesn’t require pre-processing of the text – Indeed if we search the text only once we don’t need to pre-process it. Most of the algorithms for string matching need to build an index of the text in order to search quickly. This is great when you’ve to search more than once into a text, but if you do only once, perhaps (for short texts) brute force matching is great! Doesn’t require additional space – Because brute force matching doesn’t need pre-processing it also doesn’t require more space, which is one cool feature of this algorithm Can be quite effective for short texts and patterns It can be ineffective… If we search the text more than once – As I said in the previous section if you perform the search more than once it’s perhaps better to use another string matching algorithm that builds an index, and it’s faster. It’s slow – In general brute force algorithms are slow and brute force matching isn’t an exception. Final Words String matching is something very special in software development and it is used in various cases, so every developer must be familiar with this topic.
March 27, 2012
by Stoimen Popov
· 61,872 Views · 3 Likes
article thumbnail
Writing a simple named pipes server in C#
I solved a little problem last night when playing with named pipes. I created a named pipe that writes all output to a file. Named pipes are opened for all users on a single machine. In this post I will show you a simple class that works as a pipe server. In .NET-based languages we can use the System.IO.Pipes namespace classes to work with named pipes. Here is my simple pipe server that writes all client output to file. public class MyPipeServer { public void Run() { var sid = new SecurityIdentifier(WellKnownSidType.WorldSid, null); var rule = new PipeAccessRule(sid, PipeAccessRights.ReadWrite, AccessControlType.Allow); var sec = new PipeSecurity(); sec.AddAccessRule(rule); using (NamedPipeServerStream pipeServer = new NamedPipeServerStream ("testpipe",PipeDirection.InOut, 100, PipeTransmissionMode.Byte, PipeOptions.None, 0, 0, sec)) { pipeServer.WaitForConnection(); var read = 0; var bytes = new byte[4096]; using(var file=File.Open(@"c:\tmp\myfile.dat", FileMode.Create)) while ((read = pipeServer.Read(bytes, 0, bytes.Length)) > 0) { file.Write(bytes, 0, read); file.Flush(); } } } } Real-life pipe scenarios are usually more complex but this simple class is good to get things running like they should be.
March 10, 2012
by Gunnar Peipman
· 30,891 Views
article thumbnail
Configuration Drift
In my previous article on the server lifecycle I mentioned ConfigurationDrift, a term that I’ve either coined, or I’ve forgotten where I originally heard, although most likely I got it from the Puppet Labs folks. Configuration Drift is the phenomenon where running servers in an infrastructure become more and more different as time goes on, due to manual ad-hoc changes and updates, and general entropy. A nice automated server provisioning process as I’ve advocated helps ensure machines are consistent when they are created, but during a given machine’s lifetime it will drift from the baseline, and from the other machines. There are two main methods to combat configuration drift. One is to use automated configuration tools such as Puppet or Chef, and run them frequently and repeatedly to keep machines in line. The other is to rebuild machine instances frequently, so that they don’t have much time to drift from the baseline. The challenge with automated configuration tools is that they only manage a subset of a machine’s state. Writing and maintaining manifests/recipes/scripts is time consuming, so most teams tend to focus their efforts on automating the most important areas of the system, leaving fairly large gaps. There are diminishing returns for trying to close these gaps, where you end up spending inordinate amounts of effort to nail down parts of the system that don’t change very often, and don’t matter very much day to day. On the other hand, if you rebuild machines frequently enough, you don’t need to worry about running configuration updates after provisioning happens. However, this may increase the burden of fairly trivial changes, such as tweaking a web server configuration. In practice, most infrastructures are probably best off using a combination of these methods. Use automated configuration, continuously updated, for the areas of machine configuration where it gives the most benefit, and also ensure that machines are rebuilt frequently. The frequency of rebuilds will vary depending on the nature of the services provided and the infrastructure implementation, and may even vary for different types of machines. For example, machines that provide network services such as DNS may be rebuilt weekly, while those which handle batch processing tasks may be rebuilt on demand. Source: http://kief.com/configuration-drift.html
February 27, 2012
by Kief Morris
· 26,552 Views · 3 Likes
article thumbnail
Algorithm of the Week: Data Compression with Prefix Encoding
Prefix encoding, sometimes called front encoding, is yet another algorithm that tries to remove duplicated data in order to reduce its size. Its principles are simple, however this algorithm tends to be difficult to implement. To understand why, first let’s take a look at its nature. Have a look on the following dictionary. use used useful usefulness uselss uselessly uselessness Instead of keeping all these words in plain text or transferring all them over a network, we can compress (encode) them with prefix encoding. It’s clear that each of these words begins with the prefix “use” which is also the first word from the list. So we can easily compress them into the following array. $data = array( 0 => 'use', 1 => '0d', 2 => '0ful', 3 => '0fully', 4 => '0less', 5 => '0lessly', 6 => '0lessness', ); It’s clear that this is not the best compression and we can go even further by using not only the first word as prefix. $data = array( 0 => 'use', 1 => '0d', 2 => '0ful', 3 => '2ly', 4 => '0less', 5 => '4ly', 6 => '4ness', ); Now the compression is better and the good news is that decompression is a fairly simple process. However the tricky part is compression itself. The problem is that it is quite difficult to chose an appropriate prefix. In our first example this is simple, but most of the time in practice we will have more heterogeneous data. Indeed the process of compression can be very difficult for randomly generated data and the algorithm will be not only slow, but difficult to implement. The good thing is that this algorithm can be used in many cases once we know the data format in advance. So let’s see three examples where this algorithm can be very handy. Application Here are three examples of prefix encoding. As I stated above, the process of compression can be very difficult for random data, so it is a good practice to use it only if you know in advance the format of the input data. Date and time prefixes We humans often skip the first two digits of a year, so for instance we don’t always write 1995 or 1996, but we use the shorter – ‘95 and ‘96. Thus years can be encoded with shorter strings. input: (1991, 1992, 1993, 1994, 1995, 1996 output: (91, 92, 93, 94, 95, 96) The problem is that with small changes of the input stream we can confuse the decoder. Thus if we add years from the 21st century we lose the uniqueness of the data. input: (1998, 1992, 1999, 2011, 2012) output: (98, 92, 99, 11, 12) Now the decoder can decode the last two values as (1911, 1912) as “19” is considered to be the prefix. So we must know in advance that our prefix is absolutely equal for each of the values. If not the encoding format must be different. For instance we can also encode the prefix, with some special marker. input: (1998, 1992, 1932, 1924, 2001, 2012) output: (#19, 98, 92, 32, 24, #20, 01, 12) Once the decoder reads the # character it will know to decode the following number as prefix. This can be used in practice for date and time formats. Let’s say we have some datetime values, but we know that all of them are in the same day. 2012-01-31 15:33:45 2012-01-31 16:12:11 2012-01-31 17:32:35 2012-01-31 18:54:34 Obviously we can omit the date part of these strings and send (keep) only the time. Once again, we must be absolutely sure that all these values are in the same day. If not, we can use the encoding strategy of the previous example. Phone numbers Phone numbers are the typical case of prefix encoding. Not only the international code, but also the mobile network operators use prefixes for their phone numbers. Thus if we have to transfer phone numbers from, let’s say the UK, we can replace the leading “+44” with something shorter. If you happen to code a phone book for a mobile device you can save some space by compressing the data using prefix encoding and thus the user will have more space and will store more phone numbers on his or her mobile device. Phone number prefixes can be also used for database normalization. Thus you can store them in a separate db table and leave only the unique numbers from the phonebook. Geo Coordinates Using the same example from my previous post we can send GEO coordinates by removing a common prefix, for large levels of zoom. Indeed when you need to send lots of markers to your map application you can expect all of these markers to be fairly close to each other in large zoom level. On large zoom levels we can expect markers to have the same prefix. Now the coordinates of those points can have a common prefix, like the example bellow with the Subway stations. LatLon(40.762959,-73.985989) LatLon(40.761886,-73.983629) LatLon(40.762861,-73.981612) LatLon(40.764616,-73.98056) We can see that all of these GEO points have the same prefix (40.76x, -73.98x), so we can send the prefix only once. Prefix: (40.76,-73.98) Data: LatLon(2959,5989) LatLon(1886,3629) LatLon(2861,1612) LatLon(4616,056) These are only three examples of prefix encoding and this algorithm is very useful when transferring homogeneous data. Suffix Encoding Suffix encoding is practically the same algorithm as prefix encoding, with the small difference that we use to encode duplicating suffixes. Like the examples below, suffix encoding can be useful in replacing repeating last name suffixes. Johsnon Clark Jackson Or company names. Apple Inc. Google Inc. Yahoo! Inc. Here we can replace “ Inc.” with something else, but shorter. Related posts: Computer Algorithms: Data Compression with Relative Encoding Computer Algorithms: Data Compression with Run-length Encoding Computer Algorithms: Data Compression with Diagram Encoding and Pattern Substitution Source: http://www.stoimen.com/blog/2012/02/06/computer-algorithms-data-compression-with-prefix-encoding/
February 7, 2012
by Stoimen Popov
· 19,454 Views
article thumbnail
Algorithm of the Week: Data Compression with Relative Encoding
Overview Relative encoding is another data compression algorithm. While run-length encoding, bitmap encoding and diagram and pattern substitution were trying to reduce repeating data, with relative encoding the goal is a bit different. Indeed run-length encoding was searching for long runs of repeating elements, while pattern substitution and bitmap encoding were trying to “map” where the repetitions happen to occur. The only problem with these algorithms is that the input stream of data is not always constructed out of repeating elements. It is clear that if the input stream contains many repeating elements there must be some way of reducing them. However that doesn’t mean that we cannot compress data if there are no repetitions. It all depends on the data. Let’s say we have the following stream to compress. 1, 2, 3, 4, 5, 6, 7 It's hard to imagine how this stream of data can be compressed. The same problem may occur when trying to compress the alphabet. Indeed the letters of the alphabet are the very base of words so it is the minimal part for word construction and therefore hard to compress. Fortunately this isn’t true always. An algorithm that tries to deal with non-repeating data is relative encoding. Let’s see the following input stream – years from a given decade (the 90′s). 1991, 1991, 1999, 1998, 1991, 1993, 1992, 1992 Here we have 39 characters and we can reduce them. A natural approach is to remove the leading “19” as we humans often do. 91, 91, 99, 98, 91, 93, 92, 92 Now we have a shorter string, but we can go even further by keeping only the first year. All other years will as relative to this year. 91, 0, 8, 7, 0, 2, 1, 1 Now the volume of transferred data is reduced a lot (from 39 to 16 – more than 50%). However there are some questions we need to answer first, because the stream wont always be formatted in such a pretty way. How about the next character stream? 91, 94, 95, 95, 98, 100, 101, 102, 105, 110 We see that the value 100 is somehow in the middle of the interval and it is handy to use it as a base value for the relative encoding. Thus the stream above will become: -9, -6, -5, -5, -2, 100, 1, 2, 5, 10 The problem is that we can’t always decide which value will be the base value so easily. What if the data was dispersed in a different way: 96, 97, 98, 99, 100, 101, 102, 103, 999, 1000, 1001, 1002 Now the value of “100” isn’t useful, because compressing the stream will get something like this: -4, -3, -2, -1, 100, 1, 2, 3, 899, 900, 901, 902 To group the relative values around “some” base values will be far more handy. (-4, -3, -2, -1, 100, 1, 2, 3) (-1, 1000, 1, 2) However, to decide which value will be the base value isn’t that easy. Also the encoding format is not so trivial. On the other hand, this type of encoding can be useful in some specific cases as we can see below. Implementation The implementation of this algorithm depends on the specific task and the format of the data stream. Assuming that we have to transfer the stream of years in JSON from a web server to a browser, here’s a short PHP snippet. // JSON: [1991,1991,1999,1998,1999,1998,1995,1997,1994,1993] $years = array(1991,1991,1999,1998,1999,1998,1995,1997,1994,1993); function relative_encoding($input) { $output = array(); $inputLength = count($input); $base = $input[0]; $output[] = $base; for ($i = 1; $i < $inputLength; $i++) { $output[] = $input[$i] - $base; } return $output; } // JSON: [1991,0,8,7,8,7,4,6,3,2] echo json_encode(relative_encoding($years)); Application This algorithm may be very useful in many cases, such as this one: there are plenty of map applications around the web. Some products such as Google Maps, Yahoo! Maps, Bing Maps are quite famous, while there are also very useful open source projects like OpenStreetMap. The web sites using these apps number in the thousands. A typical use case is to transfer lots of Geo coordinates from a web server to a browser using JSON. Indeed any GEO point on Earth is relative to the point (0,0), which is located near the west coast of Africa, however on large zoom levels, when there are tons of markers we can transfer the information with relative encoding. For instance the following diagram shows San Francisco with some markers on it. The coordinates are relative to the point (0,0) on Earth. Map markers can be relative to the (0, 0) point on Earth, which can occasionally be useless. Far more useful may be to encode those markers, relative to the center of the city, thus we can save some space. Relative encoding can be useful for map markers on a large zoom level, however this type of compression can be tricky. For example, when dragging the map and updating the marker array. On the other hand, we must group markers if we have to load more than one city. That’s why we must be careful when implementing it. But it can be very useful – for instance on initial load of the map we can reduce data and speed up the load time. The thing is that with relative encoding we can save only changes to base value (data) – something like version control systems and thus reducing data transfer and load. Here’s a graphical example. In the first case on the diagram below we can see that each item is stored on its own. It doesn’t depend on the adjacent items and it can be completely independent of them. However we can keep full info only for the first item and any other item will be relative to it, like on the diagram bellow. Source: http://www.stoimen.com/blog/2012/01/30/computer-algorithms-data-compression-with-relative-encoding/
January 31, 2012
by Stoimen Popov
· 17,739 Views
article thumbnail
Algorithm of the Week: Data Compression with Diagram Encoding and Pattern Substitution
Two variants of run-length encoding are the diagram encoding and the pattern substitution algorithms. The diagram encoding is actually a very simple algorithm. Unlike run-length encoding, where the input stream must consists of many repeating elements, “aaaaaaaa” for instance, which are very rare in a natural language, there are many so-called “diagrams” in almost any natural language. In plain English there are some diagrams such as “the”, “and”, “ing” (in the word “waiting” for example), “ a”, “ t”, “ e” and many doubled letters. Actually we can extend those diagrams by adding surrounding spaces. Thus we can encode not only “the”, but “ the “, which are 5 characters (2 spaces and 3 letters) with something shorter. On the other hand, as I said, in plain English there are too many doubled letters, which unfortunately aren’t something special for run-length encoding and the compression ratio will be small. Even worse the encoded text may happen to be longer than the input message. Let’s see some examples. Let’s say we’ve to encode the message “successfully accomplished”, which consists of four doubled letters. However to compress it with run-length encoding we’ll need at least 8 characters, which doesn’t help us a lot. // 8 chars replaced by 8 chars!? input: "successfully accomplished" output: "su2ce2sfu2ly a2complished" The problem is that if the input text contains numbers, “2” in particular, we’ve to chose an escape symbol (“@” for example), which we’ll use to mark where the encoded run begins. Thus if the input message is “2 successfully accomplished tasks”, it will be encoded as “2 su@2ce@2sfu@2ly a@2complished tasks”. Now the output message is longer!!! than the input string. // the compressed message is longer!!! input: "2 successfully accomplished" output: "2 su@2ce@2sfu@2ly a@2complished tasks" Again if the input stream contains the escape symbol, we have to find another one, and the problem is that it is often too difficult to find short escape symbol that doesn’t appear in the input text, without a full scan of the text. That is why run-length encoding isn’t a good solution when compressing plain text, where long runs rarely appear. Well, of course, there are exceptions. For example such an exception is the lossy text compression with run-length encoding. It is intuitively clear that compressing text with loss is rarely useful, especially when you’ve to decompress exactly the same text. However there are some cases that lossy compression may be useful. Such case can be removing spaces. Indeed the text “successfully accomplished” brings us exactly the same information as “successfully accomplished”. In this case we can simply remove those spaces. Indeed we can use a marker to indicate the long run of spaces like “successfully@6 accomplished” in order to decompress the input string with absolutely no loss, but we can also throw those symbols away. This desision depends on the goal. Exactly with the same goal in mind we can remove new lines and tabs, only if we’re sure that the sense of the text is preserved. Yet again, a problem is that such long runs don’t happen to occur in random texts. That is why it’s better to use diagram encoding for plain text compression instead of run-length encoding. A Few Questions After understanding the principles of the diagram encoding, let’s see some examples. In the example above it is better to replace doubled letters with something shorter. Let’s say # for “cc”, @ for “ss” and % for “ll”. Thus the input text will be compressed as “su#e@fu%y a#omplished”, which is shorter. But yet again what will happen if the input message contains one of the substitutions? Also we can’t say if there are many doubled letters and enough reasonable substitutions for them. A better approach is to replace patterns. Run-length encoding isn't a good approach for text compression, because long runs rarely appear in a natural language. Pattern Substitution The pattern substitution algorithm is a variant of the diagram encoding. As I said above in plain English a very commonly used pattern can be “ the “, which is five characters long. We can now replace it with something like “$%” for example. In this case the message “I send the message” will become “I send$%message”. However there are some obstacles to overcome. The first problem is that we need to know the language and somehow to define commonly used patterns in a dictionary. What would happen with a message written in some language we don’t know nothing about. Let’s say – Latin like the example bellow. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras venenatis, sapien eget suscipit placerat, justo quam blandit mauris, quis tempor ante sapien sodales augue. Praesent ut mauris quam. Phasellus scelerisque, ante quis consequat tristique, metus turpis consectetur leo, vitae facilisis sapien mi eu sapien. Praesent vitae ligula elit, et faucibus augue. Sed rhoncus sodales dolor ut gravida. In quis augue ac nulla auctor mattis sed sed libero. Donec eget purus eget enim tempor porta vitae eget diam. Mauris aliquet malesuada ipsum, non pulvinar urna vestibulum ac. Donec feugiat velit vitae nunc cursus imperdiet. Donec accumsan faucibus dictum. Phasellus sed mauris sapien. Maecenas mi metus, tincidunt sed rhoncus nec, sodales non sapien. Clearly without knowing Latin it isn’t easy to define which are those commonly used patterns. The thing is that it’s better to use pattern substitution if you know in advance the set of words and characters. The second problem is related to decompression. It is obvious that we need to define a dictionary and this dictionary must be used when decoding the message. It will be great also if we find more patterns longer than three characters. If not, the compression ratio will be low. Unfortunately such patterns aren’t very common in any natural language. Diagram encoding and pattern substitution are far more suitable for text compression than run-length encoding. In fact, pattern substitution is very effective on compressing programming languages. Application It is interesting to answer the question, how to use diagram encoding or patter substitution to compress text in natural language, especially when we don’t know the language in detail? The answer hides in the question. We wont compress natural languages, but machine language. Exactly machine (programming) languages are limited to a smaller sets of words and symbols. Isn’t it true for any programing language? Like PHP, where words like “function”, “while”, “for”, “break”, “switch”, “foreach” happen to be often in use, or HTML with its defined set of tags. Perhaps the best example is CSS, where only the values of the properties can vary. CSS files also tend to have multiple new lines, tabs and spaces, which only humans read. The question here is why should we compress those file types. It’s clear that after the compression they will be completely useless, both for humans and machines. Yes, that is true, but what if we have to store versions of those files into a DB. Kind of a backup. Imagine you’re working for a web hosting company that has to store daily versions of the sites it’s hosting. Thus the volume of stored information even for small companies hosting only few sites can be enormous. The problem is that compressing those files with some conventional compressing tool isn’t a good idea. Thus we’ve to save a copy of the entire site every day, but as we know the difference between daily versions of a site can be small. A version control system is another solution, but then you’ve to store the plain text of the files. Perhaps a better approach is to compress the text using pattern substitution and then saving only differences – kind of version control, which can be done with “relative encoding”. Using the above method we can save lots of disk space and in the same time we can compress/decompress easily. Another good thing is that you can save only changes to the initial files, like version control, which can also be compressed. Implementation The implementation of this algorithm is again on PHP and tries only to describe the main principles of compression. In this case I tried to compress a CSS file using the compression above. Although this example is quite primitive we can see some interesting facts. First of all you only need encoding and decoding dictionaries. Practically the encoding and decoding processes are equal, so you don’t need to implement two different functions. Here in this example a native PHP function is used – str_replace, because the purpose of this algorithm is not to describe pattern substitution techniques, but pattern substitution. It assumes that today’s programming languages have string manipulation functions for the purposes of this task. $str = file_get_contents('large_style_file.css'); $encoding_dict = array( "\n" => '$0', 'text' => '$1', 'color' => '$2', 'display' => '$3', 'font' => '$4', 'width' => '$5', 'height' => '$6', ' ' => '', ); function replace_patterns($input, $dict) { foreach ($dict as $pattern => $replace) { $input = str_replace($pattern, $replace, $input); } return $input; } $result = replace_patterns($str, $encoding_dict); By only replacing few CSS properties I achieved almost 40% of compression ratio (as shown the diagram bellow). The initial file is 202 KB, while compressed it’s only 131 KB. Of course, it all depends on the CSS file, but how about replacing all property names with shorter ones. Perhaps then the compression will be even better. Source: http://www.stoimen.com/blog/2012/01/23/computer-algorithms-data-compression-with-diagram-encoding-and-pattern-substitution/
January 24, 2012
by Stoimen Popov
· 23,835 Views
article thumbnail
Algorithm of the Week: Data Compression with Bitmaps
In my previous post we saw how to compress data consisting of very long runs of repeating elements. This type of compression is known as “run-length encoding” and can be very handy when transferring data with no loss. The problem is that the data must follow a specific format. Thus the string “aaaaaaaabbbbbbbb” can be compressed as “a8b8”. Now a string with length 16 can be compressed as a string with length 4, which is 25% of its initial length without loosing any information. There will be a problem in case the characters (elements) were dispersed in a different way. What would happen if the characters are the same, but they don’t form long runs? What if the string was “abababababababab”? The same length, the same characters, but we cannot use run-length encoding! Indeed using this algorithm we’ll get at best the same string. In this case, however, we can see another fact. The string consists of too many repeating elements, although not arranged one after another. We can compress this string with a bitmap. This means that we can save the positions of the occurrences of a given element with a sequence of bits, which can be easily converted into a decimal value. In the example above the string “abababababababab” can be compressed as “1010101010101010”, which is 43690 in decimals, and even better AAAA in hexadecimal. Thus the long string can be compressed. When decompressing (decoding) the message we can convert again from decimal/hexadecimal into binary and match the occurrences of the characters. Well, the example above is too simple, but let’s say only one of the characters is repeating and the rest of the string consists of different characters like this: “abacadaeafagahai”. Then we can use bitmap only for the character “a” – “1010101010101010” and compress it as “AAAA bcdefghi”. As you can see all the example strings are exactly 16 characters and that is a limitation. To use bitmaps with variable length of the data is a bit tricky and it is not always easy (if possible) to decompress it. Basically bitmap compression saves the positions of an element that is repeated very often in the message! In the other hand bitmap compression is not only applicable on strings. We can compress also arrays, objects or any kind of data. The example from my previous post is very suitable. Then we had to transfer a large array from a server to the client (browser) using JSON. The data then was very suitable for “run-length encoding”. Now let’s assume we have the same data – a set of different years, which this time are dispersed in a different way. $data = array( 0 => 1991, 1 => 1992, 2 => 1993, 3 => 1994, 4 => 1991, 5 => 1992, 6 => 1993, 7 => 1992, 8 => 1991, 9 => 1991, 10 => 1991, 11 => 1992, 12 => 1992, 13 => 1991, 14 => 1991, 15 => 1992, ... ); The JSON will encoded message will be the following (a simple but yet very large javascript array). [1991,1992,1993,1994,1991,1992,1993,1992,1991,1991,1991,1992,1992,1991,1991,1992, ...] However if we use bitmap compression we’ll get a “shorter” array. $data = array( 0 => array(1991, '1000100011100110'), 1 => array(1992, '0100010100011001'), 2 => array(1993, '0010001000000000'), 3 => array(1994, '0001000000000000'), ); Now the JSON is: [[1991,"1000100011100110"],[1992,"0100010100011001"],[1993,"0010001000000000"],[1994,"0001000000000000"]] It is obvious that the compression ratio is getting better and better as the uncompressed data grows. In fact, most of us know bitmap compression from images, because this algorithm is largely used for image compression. We can imagine how successful it can be when compressing black and white images (as black and white can be represented as 0 and 1s). Actually it is used for more than two colors (256 for instance) and again the level of compression is very high. Implementation The following implementation on PHP aims only to illustrate the bitmap compressing algorithm. As we know this algorithm can be applicable for any kind of data structures. // too many repeating "a" characters $msg = 'aazahalavaatalawacamaahakafaaaqaaaiauaacaaxaauaxaaaaaapaayatagaaoafaawayazavaaaazaaabararaaaaakakaaqaarazacajaazavanazaaaeanaaoajauaaaaaxalaraaapabataaavaaab'; function bitmap($message) { $i = 0; $bits = $rest = ''; while ($v = $message[$i]) { if ($v == 'a') { $bits .= '1'; } else { $bits .= '0'; $rest .= $v; } $i++; } return number_format(bindec($bits), 0, '.', '') . $rest;; } echo bitmap($msg); // uncompressed: acaaaaadaaaabalaaeaaaaganaaxakaavawamaasavajawaaaayaauaaadalanagaeaeamaarafalaazaaaiasaanaahaaazaraxaalaahaaawaaajasamahaajaakarapanaakaoakaanawalaacamauaamaal // compressed: 152299251941730035874325065523548237677352452096zhlvtlwcmhkfqiucxuxpytgofwyzvzbrrkkqrzcjzvnzenojuxlrpbtvb Application This algorithm is very useful when there is an element in our data that repeats very often, so you need to investigate the nature of the data you want to compress. Actually because of this fact this algorithm is used for image compression as PNG8 or GIF. Source: http://www.stoimen.com/blog/2012/01/16/computer-algorithms-data-compression-with-bitmaps/
January 17, 2012
by Stoimen Popov
· 20,360 Views
article thumbnail
Smart Batching
How often have we all heard that “batching” will increase latency? As someone with a passion for low-latency systems this surprises me. In my experience when batching is done correctly, not only does it increase throughput, it can also reduce average latency and keep it consistent. Well then, how can batching magically reduce latency? It comes down to what algorithm and data structures are employed. In a distributed environment we are often having to batch up messages/events into network packets to achieve greater throughput. We also employ similar techniques in buffering writes to storage to reduce the number of IOPS. That storage could be a block device backed file-system or a relational database. Most IO devices can only handle a modest number of IO operations per second, so it is best to fill those operations efficiently. Many approaches to batching involve waiting for a timeout to occur and this will by its very nature increase latency. The batch can also get filled before the timeout occurs making the latency even more unpredictable. Figure 1. Figure 1. above depicts decoupling the access to an IO device, and therefore the contention for access to it, by introducing a queue like structure to stage the messages/events to be sent and a thread doing the batching for writing to the device. The Algorithm An approach to batching uses the following algorithm in Java pseudo code: public final class NetworkBatcher implements Runnable { private final NetworkFacade network; private final Queue queue; private final ByteBuffer buffer; public NetworkBatcher(final NetworkFacade network, final int maxPacketSize, final Queue queue) { this.network = network; buffer = ByteBuffer.allocate(maxPacketSize); this.queue = queue; } @Override public void run() { while (!Thread.currentThread().isInterrupted()) { while (null == queue.peek()) { employWaitStrategy(); // block, spin, yield, etc. } Message msg; while (null != (msg = queue.poll())) { if (msg.size() > buffer.remaining()) { sendBuffer(); } buffer.put(msg.getBytes()); } sendBuffer(); } } private void sendBuffer() { buffer.flip(); network.send(buffer); buffer.clear(); } } Basically, wait for data to become available and as soon as it is, send it right away. While sending a previous message or waiting on new messages, a burst of traffic may arrive which can all be sent in a batch, up to the size of the buffer, to the underlying resource. This approach can use ConcurrentLinkedQueue which provides low-latency and avoid locks. However it has an issue in not creating back pressure to stall producing/publishing threads if they are outpacing the batcher whereby the queue could grow out of control because it is unbounded. I’ve often had to wrap ConcurrentLinkedQueue to track its size and thus create back pressure. This size tracking can add 50% to the processing cost of using this queue in my experience. This algorithm respects the single writer principle and can often be employed when writing to a network or storage device, and thus avoid lock contention in third party API libraries. By avoiding the contention we avoid the J-Curve latency profile normally associated with contention on resources, due to the queuing effect on locks. With this algorithm, as load increases, latency stays constant until the underlying device is saturated with traffic resulting in a more "bathtub" profile than the J-Curve. Let’s take a worked example of handling 10 messages that arrive as a burst of traffic. In most systems traffic comes in bursts and is seldom uniformly spaced out in time. One approach will assume no batching and the threads write to device API directly as in Figure 1. above. The other will use a lock free data structure to collect the messages plus a single thread consuming messages in a loop as per the algorithm above. For the example let’s assume it takes 100µs to write a single buffer to the network device as a synchronous operation and have it acknowledged. The buffer will ideally be less than the MTU of the network in size when latency is critical. Many network sub-systems are asynchronous and support pipelining but we will make the above assumption to clarify the example. If the network operation is using a protocol like HTTP under REST or Web Services then this assumption matches the underlying implementation. Best (µs) Average (µs) Worst (µs) Packets Sent Serial 100 500 1,000 10 Smart Batching 100 150 200 1-2 The absolute lowest latency will be achieved if a message is sent from the thread originating the data directly to the resource, if the resource is un-contended. The table above shows what happens when contention occurs and a queuing effect kicks in. With the serial approach 10 individual packets will have to be sent and these typically need to queue on a lock managing access to the resource, therefore they get processed sequentially. The above figures assume the locking strategy works perfectly with no perceivable overhead which is unlikely in a real application. For the batching solution it is likely all 10 packets will be picked up in first batch if the concurrent queue is efficient, thus giving the best case latency scenario. In the worst case only one message is sent in the first batch with the other nine following in the next. Therefore in the worst case scenario one message has a latency of 100µs and the following 9 have a latency of 200µs thus giving a worst case average of 190µs which is significantly better than the serial approach. This is one good example when the simplest solution is just a bit too simple because of the contention. The batching solution helps achieve consistent low-latency under burst conditions and is best for throughput. It also has a nice effect across the network on the receiving end in that the receiver has to process fewer packets and therefore makes the communication more efficient both ends. Most hardware handles data in buffers up to a fixed size for efficiency. For a storage device this will typically be a 4KB block. For networks this will be the MTU and is typically 1500 bytes for Ethernet. When batching, it is best to understand the underlying hardware and write batches down in ideal buffer size to be optimally efficient. However keep in mind that some devices need to envelope the data, e.g. the Ethernet and IP headers for network packets so the buffer needs to allow for this. There will always be an increased latency from a thread switch and the cost of exchange via the data structure. However there are a number of very good non-blocking structures available using lock-free techniques. For the Disruptor this type of exchange can be achieved in as little as 50-100ns thus making the choice of taking the smart batching approach a no brainer for low-latency or high-throughput distributed systems. This technique can be employed for many problems and not just IO. The core of the Disruptor uses this technique to help rebalance the system when the publishers burst and outpace the EventProcessors. The algorithm can be seen inside the BatchEventProcessor. Note: For this algorithm to work the queueing structure must handle the contention better than the underlying resource. Many queue implementations are extremely poor at managing contention. Use science and measure before coming to a conclusion. Batching with the Disruptor The code below shows the same algorithm in action using the Disruptor's EventHandler mechanism. In my experience, this is a very effective technique for handling any IO device efficiently and keeping latency low when dealing with load or burst traffic. public final class NetworkBatchHandler implements EventHander { private final NetworkFacade network; private final ByteBuffer buffer; public NetworkBatchHandler(final NetworkFacade network, final int maxPacketSize) { this.network = network; buffer = ByteBuffer.allocate(maxPacketSize); } public void onEvent(Message msg, long sequence, boolean endOfBatch) throws Exception { if (msg.size() > buffer.remaining()) { sendBuffer(); } buffer.put(msg.getBytes()); if (endOfBatch) { sendBuffer(); } } private void sendBuffer() { buffer.flip(); network.send(buffer); buffer.clear(); } } The endOfBatch parameter greatly simplifies the handling of the batch compared to the double loop in the algorithm above. I have simplified the examples to illustrate the algorithm. Clearly error handling and other edge conditions need to be considered. Separation of IO from Work Processing There is another very good reason to separate the IO from the threads doing the work processing. Handing off the IO to another thread means the worker thread, or threads, can continue processing without blocking in a nice cache friendly manner. I've found this to be critical in achieving high-performance throughput. If the underlying IO device or resource becomes briefly saturated then the messages can be queued for the batcher thread allowing the work processing threads to continue. The batching thread then feeds the messages to the IO device in the most efficient way possible allowing the data structure to handle the burst and if full apply the necessary back pressure, thus providing a good separation of concerns in the workflow. Conclusion So there you have it. Smart Batching can be employed in concert with the appropriate data structures to achieve consistent low-latency and maximum throughput. From http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html
October 26, 2011
by Martin Thompson
· 11,407 Views · 1 Like
article thumbnail
Brute forcing a bin packing problem
Even a basic planning problem, such as bin packing, can be notoriously hard to solve and scale. One might consider the brute force algorithm. Let's take a look at how that algorithm works out on the cloud balance example of Drools Planner: Given a set of servers with different hardware (CPU, memory and network bandwidth) and given a set of processes with different hardware requirements, assign each process to 1 server and minimize the total cost of the active servers. The brute force algorithm is simple: try every combination between processes where each process is assigned to each server. For example, if we have 6 processes (P0, P1, P2, P3, P4, P5) and 2 servers (S0, S1), we'd try these solutions: P0->S0, P1->S0, P2->S0, P3->S0, P4->S0, P5->S0 P0->S0, P1->S0, P2->S0, P3->S0, P4->S0, P5->S1 P0->S0, P1->S0, P2->S0, P3->S0, P4->S1, P5->S0 P0->S0, P1->S0, P2->S0, P3->S0, P4->S1, P5->S1 ... P0->S1, P1->S1, P2->S1, P3->S1, P4->S1, P5->S1 On my machine, it takes 15ms to calculate the score of these 2^6 combinations. When I scale out to 9 processes and 3 servers, which are 3^9 combinations, it becomes 1582ms. So it scales like this: Notice that despite that the number of processes has not even doubled, the running time multiplied by 100! For comparison, I 've added the running time of the First Fit algorithm. And it gets worse: for 12 processes and 4 servers, which are 4^12 combinations, it take more than 17 minutes: What if we want to distribute 3000 processes over 1000 servers? With this kind of scalability, it will take too long. In fact, the brute force algorithm is useless. Luckily, Drools Planner implements several other optimization algorithms, which can handle such loads. If you want to know more about them, take a look at the Drools Planner manual or come to my talk at JUDCon London (31 Oct - 1 Nov). This article was originally posted on the Drools & jBPM blog.
September 26, 2011
by Geoffrey De Smet
· 9,956 Views
article thumbnail
A collection with billions of entries
There are a number of problems with having a large number of records in memory. One way around this is to use direct memory, but this is too low level for most developers. Is there a way to make this more friendly? Limitations of large numbers of objects The overhead per object is between 12 and 16 bytes for 64-bit JVMs. If the object is relatively small, this is significant and could be more than the data itself. The GC pause time increases with the number of objects. Pause times can be around one second per GB of objects. Collections and arrays only support two billion elements Huge collections One way to store more data and still follow object orientated principles is have wrappers for direct ByteBuffers. This can be tedious to write, but very efficient. What would be ideal is to have these wrappers generated automatically. Small JavaBean Example This is an example of JavaBean which would have far more overhead than actual data contained. interface MutableByte { public void setByte(byte b); public byte getByte(); } It is also small enough that I can create billions of these on my machine. This example creates a List with 16 billion elements. final long length = 16_000_000_000L; HugeArrayList hugeList = new HugeArrayBuilder() {{ allocationSize = 4 * 1024 * 1024; capacity = length; }.create(); List list = hugeList; assertEquals(0, list.size()); hugeList.setSize(length); // add a GC to see what the GC times are like. System.gc(); assertEquals(Integer.MAX_VALUE, list.size()); assertEquals(length, hugeList.longSize()); byte b = 0; for (MutableByte mb : list) mb.setByte(b++); b = 0; for (MutableByte mb : list) { byte b2 = mb.getByte(); byte expected = b++; if (b2 != expected) assertEquals(expected, b2); } From start to finish, the heap memory used is as follows. with -verbosegc 0 sec - 3100 KB used [GC 9671K->1520K(370496K), 0.0020330 secs] [Full GC 1520K->1407K(370496K), 0.0063500 secs] 10 sec - 3885 KB used 20 sec - 4428 KB used 30 sec - 4428 KB used ... deleted ... 1380 sec - 4475 KB used 1390 sec - 4476 KB used 1400 sec - 4476 KB used 1410 sec - 4476 KB used The only GC is one triggered explicitly. Without the System.gc(); no GC logs appear. After 20 sec, the increase in memory used is from logging how much memory was used. Conclusion The library is relatively slow. Each get or set takes about 40 ns which really adds up when there are so many calls to make. I plan to work on it so it is much faster. ;) On the upside, it wouldn't be possible to create 16 billion objects with the memory I have, nor could it be put in an ArrayList, so having it a little slow is still better than not working at all. From http://vanillajava.blogspot.com/2011/08/collection-with-billions-of-entries.html
August 10, 2011
by Peter Lawrey
· 17,382 Views
article thumbnail
Human Readable vs Machine Readable Formats
Most file/serialization formats can be broadly broking into two formats, Human Readable Text and Machine Readble Binary. The Human Readable formats have the advantage of being easily understood by a person reading them. Machine readable formats are easier/faster for a machine to encode/decode. There are formats which attempt to be a little of both. XML, JSon, CSV are examples of these. However these do not achieve close to the performance a binary format can achieve. Myth: Machine Readable Binary is always more compact than a Human Readable Binary can be more compact, however the obscurity of its format makes it difficult to ensure every byte counts. i.e. its usually hard enough getting something work. Making it compact as well is an added complication. However with Human Readable formats, determing how the format can be made more compact is more easily understood. As text: 38 bytes long, [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] As binary: 290 bytes long, ....sr..java.util.ArrayListx.....a....I..sizexp....w.....sr..java.lang.Long; .....#....J..valuexr..java.lang.Number...........xp........sq.~..........sq.~ ..........sq.~..........sq.~..........sq.~..........sq.~..........sq.~ ..........sq.~..........sq.~..........sq.~..........sq.~..........x Even though the first format is more compact, you can immedately see you could drop the [ ] and spaces after the ", " to make it more compact. With the binary formats, it is hard to know where to start. ComparingHumanReadableToBinaryMain.java List longs = new ArrayList(); for(long i=-1;i<=10;i++) longs.add(i); String asText = longs.toString(); byte[] bytes1 = asText.getBytes(); System.out.println("As text: "+ bytes1.length+" bytes long, "+asText); ByteArrayOutputStream baos = new ByteArrayOutputStream(); ObjectOutputStream oos = new ObjectOutputStream(baos); oos.writeObject(longs); oos.close(); byte[] bytes2 = baos.toByteArray(); System.out.println("As binary: "+bytes2.length+" bytes long, " +new String(bytes2, 0).replaceAll("[^\\p{Graph}]", ".")); Myth: Machine Readable Binary is always faster than a Human Readable Its assumed the cost of parsing data in a human readable format always makes it slower, however machine sreadbale formats have to deal with an issue human readbale formats takes for granted, that is byte endianness. For human readable formats the order of digits is fairly obvious, however for machine formats the byte endianess of the data might not match that the natrual byte order of the CPU, leading to a source of overhead (as it has to swap the bytes around) One example of this is using big-endian (e.g. TCP/Network byte order) on a little endian machine e.g. Windows/Linux Intel/AMD. A common class which has this issue is DataInputStream and DataOutputStream which re-arranges the byte order (even if the native byte order matches) For this reason, a fast human readable parse can be as fast or faster. In an earlier article I showed how a Human Readable format could be used to read/write integers 30% faster than using DataInput/DataOuput. Writing human readable data faster than binary. Myth: Using a Human Readable Format makes it easy to read Just using a human readable format doesn't mean it will be easier to read than a machine readable format. Reusing existing tools as much as possible makes human readable format preferrable. However, machine readable formats can come with tools which decode the data and make maintain it easier. If you have data which can only be managed with the use of specialist tools, being human readable is not much advantage. Images are a good example of where a machine readable format is the best option. It is hard to image editing or viewing an image without the need for a specialist tool. A practical human readable format would undoubtably lower the quality of the image. ;) ________/.- ,’_______`-. \ _________\ /`__________\’/ _________ /___’a___a`___\ _________|____,’(_)`.____ | _________\___( ._|_. )___ / __________\___ .__,’___ / __________.-`._______,’-.__ ________,’__,’___`-’___`.__`. _______/____/____V_____\___\_ _____,’____/_____o______\___`.__ ___,’_____|______o_______|_____`. __|_____,’|______o_______|`._____| ___`.__,’_.-\_____o______/-._`.__,’ __________/_`.___o____,’__\_ __.””-._,’_____`._:_,’_____`.,-””._ _/_,-._`_______)___(________’_,-.__\ (_(___`._____,’_____`.______,’___)_) _\_\____\__,’________`.____/.___/_/ On the other hand human readable formats can be almost as obscure. This is a piece of code written in a language I am not worthy of mentioning. ;) Its is descibed as "used to list all of the prime numbers between 1 and R" (!R)@&{&/x!/:2_!x}'!R Conclusion If you are designing a file format, start with a human readable one as its much easier to understand. If this is not compact enough, consider compressing it. If it is not fast enough concider making it a binary format, but make sure it really is faster to use such a format. If you are going to use a binary format make sure you have tools in place to supprot viewing (possibly editing) the data (which you would get for free with a text format) From http://vanillajava.blogspot.com/2011/07/human-readable-vs-machine-readble.html
July 13, 2011
by Peter Lawrey
· 21,598 Views · 5 Likes
article thumbnail
Developing Android Apps with NetBeans, Maven, and VirtualBox
I am an experienced Java developer who has used various IDEs and prefer NetBeans IDE over all others by a long shot. I am also very fond of Maven as the tool to simplify and automate nearly every aspect of the development of my Java project throughout its lifecycle. Recently, I started developing Android applications and naturally I looked for a Maven plugin that would manage my Android projects. Luckily I found the maven-android-plugin which worked like a charm and allowed me to use Maven for developing my Android projects. The Android Emulator from the Android SDK seemed unusably slow. Lucklily, I found a way to use an Android Virtual Machine for VirtualBox that worked nearly as fast as my native computer! This page documents my experiences. Tested Environment Dev machine: Ubuntu 11.04 Linux IDE: NetBeans VirtualBox: 4.0.8 r71778 Android SDK Revision 11, Add on XML Schema #1, Repository XML Schema #3 (from About in SDK and AVD Manager) Android Version: 2.2 Overview of Steps Download and install the Android SDK on your dev machine Attach an Android Device to dev machine Configure and load your device for development and other use Create an initial Android maven project Connect Android Device to Android SDK Debug Android app using NetBeans Graphical Debuger Download and Install Android SDK Download and install the Android SDK on your dev machine as described here. Make sure to set the following in dev machine ~/.bashrc file: export ANDROID_HOME=$HOME/android-sdk-linux_x86 #Change as needed export PATH="$ANDROID_HOME/tools:$ANDROID_HOME/platform-tools:$PATH" Attaching an Android Device to Dev Machine If you have an actual device that is usually always best. If not, you must use a virtual Android device which usually has various limitations (e.g. no GPS, Camera etc.). The Android SDK makes it easy to create a new Virtual Device but the resulting device is painfully slow in my experience and not usable. Do not bother with this. Instead, create a virtual Android device using VirtualBox as described in the following steps: Install virtual box and initial Android VM as described here: http://androidspin.com/2011/01/24/howto-install-android-x86-2-2-in-virtualbox/ http://geeknizer.com/how-to-run-google-android-in-virtualbox-vmware-on-netbooks/ Configure Android VM so it is connected bidirectionally with your dev machine over TCP as described here: http://stackoverflow.com/questions/61156/virtualbox-host-guest-network-setup I used the approach of configuring a HOST ONLY network adapater and a second NAT adapter on the Android VM within virtual box. Configuring your Android Device This section describes various things I did to setup a dev environment for my Android device: Root the device. I used Universal AndRoot Install ConnectBot so you have ssh and related network utilities Creating Initial Android Maven Application Create initial project using instructions here. I found it best to create stub project structure using the maven-archtype-plugin and the archtypes at https://github.com/akquinet/android-archetypes/wiki Connecting Android VM Device to Android SDK In order for your code to be deployed from NetBeans IDE to Android Device and in order for you to monitor your deployed app from the Dalvik Debug Monitor (ddms) you need to connect your android VM device to the android sdk over TCP as described in the following steps. On Android Device open the Terminal Emulator Type su to become root (your device must be rooted for this Type following commands in root shell: setprop service.adb.tcp.port 5555 stop adbd start adbd Type the following commands on dev machine shell. TODO: Note that IP address below is whatever is the ip address associated with the device (see ifconfig on linux for device vboxnet0) adb tcpip 5555 adb connect 192.168.0.101:5555 For details on above steps see: http://stackoverflow.com/questions/2604727/how-can-i-connect-to-android-with-adb-over-tcp Set up port forwarding as described here http://redkrieg.com/2010/10/11/adb-over-ssh-fun-with-port-forwards/ (this is where I am most fuzzy) Build your maven android project using Right-Click / Clean and Build Now for the acid test whether you can deploy your app to the device from NetBeans IDE! Right-click / Custom / Goal to show Run Maven dialog. Enter android:deploy in Goals field. Select Remember As button and enter android:deploy for its text field. If all is well, the app will deploy to the device and will show up in its "Applications" screen. Debugging Android App Using NetBeans Graphical Debugger Once you can build and deploy your app to the real or virtual Android device, here are the steps to debug the app using NetBeans debugger: On Device: Start the app (TODO: determine how to start app on device with JVM options so it can wait for debugger connection. This should be easy) On Dev Machine run Dalvik Debug Monitor (ddms) in background: $ANDROID_HOME/tools/ddms & Lookup your app in ddms and get its debug port. This is described here but does not address NetBeans specifically In NetBeans do: Debug / Attach Debugger and specify the port looked up in ddms in previous step. You may leave rest of the fields with defaults. Click OK
June 18, 2011
by Farrukh Najmi
· 173,489 Views
article thumbnail
Clustering Tomcat Servers with High Availability and Disaster Fallback
There has been a lot of buzz lately on high-availability and clustering. Most developers don't care and why should they? These features should be transparent to the application architecture and not something of concern to the developers of that application. But knowledge never hurts, so I emerged myself into the world of load balancing, heartbeats and virtual IP addresses. And you know what? Next time we need a infrastructure like this, I can at least sit down with the guys from the infrastructure department and at least know what the hell they are talking about. So what exactly is a high-availability clustered infrastructure (HACI, as I'll call it from now on) ? In essence, it should be a zero-downtime infrastructure (or at least perceived as one by the end user, which means never ever returning a default browser 404 page), capable of horizontal scaling when the need for it arises and without a single point of failure. It's the SLA writer's dream. A basic HACI setup looks like this: The users enters through a virtual IP address, assigned to one of the two load balancers. Only one of the load-balancers is active (the active master, LB1), the other one is there in the event LB1 fails ((LB2, a passive slave). The two load balancers are redundant, ie. having the exact same configuration. The load balancers redirect all traffic to the real servers. This can be done through round-robin assignment or through other means like sticky sessions, where the same user is redirected to the same server each and every time within a session. Servers can be added at any moment and configured on the load balancers. Ideally, the load balancer configuration is aware of the hardware specification and balances the load accordingly, but that's beyond the scope of this article (it involves adding weights). If all servers balanced by the load balancer fail, a backup server should be used to redirect all traffic coming from the load balancer. This can be a very lightweight server, which purpose is only to provide a sensible error page to the user (something like 'Sorry, we are performing maintenance'). Again, perception and immediate feedback to the user is key. You don't want to show the user a plain 404 page. Off course, if the backup server goes down too, you're in trouble (off course, by that time, warning bells should have gone off on every level in the hierarchy). So how to achieve this with as little effort as possible? If you want to try this out, I suggest you start by installing a virtual machine like VirtualBox or VMWare. This way you can try out the configuration yourself. In this example, I'll be load-balancing 3 Tomcat servers using sticky sessions using 2 load balancers in active-passive mode. I'm assuming all 3 Tomcat servers share the same hardware configuration, so they are all able to handle the same amount of traffic each. I'm also throwing in a backup server, in case all 3 Tomcat servers go down (serving a custom 503 page kindly informing the user of a catastrophic failure, instead of dropping the standard 404 bomb). You want to start off by assigning IP addresses to the servers. This will make your life a bit easier. We'll need 7 addresses: 3 for the tomcat server, 1 for the backup server, 2 for the loadbalancers and 1 virtual IP address to be shared between the load balancers (and which will be the entry point for your users). So our assignment will be: Virtual IP 10.0.5.99 www.haci.local LB1 10.0.5.100 lb1.haci.local #MASTER LB2 10.0.5.101 lb2.haci.local #SLAVE WEB1 10.0.5.102 web1.haci.local WEB2 10.0.5.103 web2.haci.local WEB3 10.0.5.104 web3.haci.local BACKUP 10.0.5.105 backup.haci.local Setting up the web servers is easy. You just install Tomcat on each server and create a simple JSP file to be served to users (make a small change, like the background color, on each server to distinguish the servers). I won't be covering session replication between the Tomcat servers, as it'll take me too far. If you want, you can configure the appropriate session replication and storage (using multicast or JDBC for example). The backup server I'm using is a basic LAMP server that returns a simple 503 page on every request it gets. The 503 error code is important, because it reflects the current state of the system: currently unavailable. For the loadbalancers I'll be using 2 applications: HAProxy and keepalived. HAProxy is going to handle load balancing, while keepalived will handle the failover between the two load balancers. First, we're going to configure HAProxy for both LB1 and LB2. Installing HAProxy is quite easy on an ubuntu system. Just do a sudo apt-get install haproxy and you're off. After the install, backup the current HAProxy config and start editing away. cp /etc/haproxy.cfg /etc/haproxy.cfg_orig cat /dev/null > /etc/haproxy.cfg vi /etc/haproxy.cfg The content of the config to reflect our setup should become something like this (same config on LB1 and LB2): global log 127.0.0.1 local0 log 127.0.0.1 local1 notice #log loghost local0 info maxconn 4096 #debug #quiet user haproxy group haproxy defaults log global mode http option httplog option dontlognull retries 3 redispatch maxconn 2000 contimeout 5000 clitimeout 50000 srvtimeout 50000 frontend http-in bind 10.0.5.99:80 default_backend servers backend servers mode http stats enable stats auth someuser:somepassword balance roundrobin cookie JSESSIONID prefix option httpclose option forwardfor option httpchk HEAD /check.txt HTTP/1.0 server web1 10.0.5.102:80 cookie haci_web1 check server web2 10.0.5.103:80 cookie haci_web2 check server web3 10.0.5.104:80 cookie haci_web3 check server webbackup 10.0.5.105:80 backup After this, enable HAProxy on both LB1 and LB2 by editing /etc/defaults/haproxy # Set ENABLED to 1 if you want the init script to start haproxy. ENABLED=1 # Add extra flags here. #EXTRAOPTS="-de -m 16" So far for the HAProxy configuration. We can't start it up yet, as LB1 and LB2 aren't listening yet on the virtual IP address. Next we'll configure the failover of the loadbalancers using keepalived. Installing it on Ubuntu is as easy as it was for HAProxy: sudo apt-get install keepalived. But its configuration is slightly different on both load balancers. First, we need to configure the both servers to be able to listen to the shared IP address. Add the following line to /etc/sysctl.conf: net.ipv4.ip_nonlocal_bind=1 And run sysctl -p Now, we configure keepalived so that LB1 is configured as the main load balancer and binds to the shared IP address, while LB2 is on standby, ready to take over whenever LB1 goes down. The configuration for LB1 looks like this (edit /etc/keepalived/keepalived.conf): vrrp_script chk_haproxy { # Requires keepalived-1.1.13 script "killall -0 haproxy" # cheaper than pidof interval 2 # check every 2 seconds weight 2 # add 2 points of prio if OK } vrrp_instance VI_1 { interface eth0 state MASTER virtual_router_id 51 priority 101 # 101 on master, 100 on backup virtual_ipaddress { 10.0.5.99 } track_script { chk_haproxy } } Start up keepalived and check whether it is listening to the virtual IP address. /etc/init.d/keepalived start ip addr sh eth0 It should return something like this, indicating it is listening to the virtual IP address 2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:0c:29:a5:5b:93 brd ff:ff:ff:ff:ff:ff inet 10.0.5.100/24 brd 10.0.5.255 scope global eth0 inet 10.0.5.99/32 scope global eth0 inet6 fe80::20c:29ff:fea5:5b93/64 scope link valid_lft forever preferred_lft forever Next, we configure LB2. The configuration is almost the same, exception for the priority. vrrp_script chk_haproxy { # Requires keepalived-1.1.13 script "killall -0 haproxy" # cheaper than pidof interval 2 # check every 2 seconds weight 2 # add 2 points of prio if OK } vrrp_instance VI_1 { interface eth0 state MASTER virtual_router_id 51 priority 100 # 101 on master, 100 on backup virtual_ipaddress { 10.0.5.99 } track_script { chk_haproxy } } Start up keepalived and check the network interface. /etc/init.d/keepalived start ip addr sh eth0 It should return something like this, indicating it is not listening to the virtual IP address 2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:0c:29:a5:5b:93 brd ff:ff:ff:ff:ff:ff inet 10.0.5.101/24 brd 10.0.5.255 scope global eth0 inet6 fe80::20c:29ff:fea5:5b93/64 scope link valid_lft forever preferred_lft forever Now, start up HAProxy on both LB1 and LB2. /etc/init.d/haproxy start Now you can issue requests to 10.0.5.99 (or www.haci.local), which will go to LB1, which in turn will load-balance the request to either WEB1, WEB2 and WEB3. You can test the load balancing by turning off WEB1 (or the server you're currently on). You can also the backup server by turning all main webservers (WEB1, WEB2 and WEB3). And you can test the loadbalancer failover by turning off LB1. At that point LB2 will kick in and act as the master, loadbalancing all requests. When you turn LB1 back on, it'll take over the master role once again. HAProxy allows you to add extra servers very easily, reloading the configuration without breaking existing sessions. See the HAProxy documentation for more info or on ServerFault. (http://serverfault.com/questions/165883/is-there-a-way-to-add-more-backend-server-to-haproxy-without-restarting-haproxy). Cheap and effective. While most enterprise shops have hardware load balancers, which also have these possibilities and more, if you're on a tight budget or need to simulate a HACI environment for development purposes (a lesson here: always simulate your production environment when you're testing during development), this might be the sane option. To finish, I'll quickly explain how to set up the backup server (a simple LAMP server). Create a vhost configuration on the apache for www.haci.local or any other domain pointing to the virtual IP address and set up mod_rewrite for it: RewriteEngine On RewriteCond %{REQUEST_URI} !\.(css|gif|ico|jpg|js|png|swf|txt)$ [NC] RewriteConf %{REQUEST_URI} !/503.php RewriteRule .* /503.php [L] Then create the 503.php file and add this to the top of it: Sorry, our servers are currently undergoing maintenance. Please check back with us in a while. Thank you for your patience. You can decorate the 503.php file any way you like. You can even use CSS, JavaScript and image files in the php file. Now, back to my IDE. I'm getting withdrawal symptoms.
March 11, 2011
by Lieven Doclo
· 58,005 Views
article thumbnail
Apache Solr: Get Started, Get Excited!
we've all seen them on various websites. crappy search utilities. they are a constant reminder that search is not something you should take lightly when building a website or application. search is not just google's game anymore. when a java library called lucene was introduced into the apache ecosystem, and then solr was built on top of that, open source developers began to wield some serious power when it came to customizing search features. in this article you'll be introduced to apache solr and a wealth of applications that have been built with it. the content is divided as follows: introduction setup solr applications summary 1. introduction apache solr is an open source search server. it is based on the full text search engine called apache lucene . so basically solr is an http wrapper around an inverted index provided by lucene. an inverted index could be seen as a list of words where each word-entry links to the documents it is contained in. that way getting all documents for the search query "dzone" is a simple 'get' operation. one advantage of solr in enterprise projects is that you don't need any java code, although java itself has to be installed. if you are unsure when to use solr and when lucene, these answers could help. if you need to build your solr index from websites, you should take a look into the open source crawler called apache nutch before creating your own solution. to be convinced that solr is actually used in a lot of enterprise projects, take a look at this amazing list of public projects powered by solr . if you encounter problems then the mailing list or stackoverflow will help you. to make the introduction complete i would like to mention my personal link list and the resources page which lists books, articles and more interesting material. 2. setup solr 2.1. installation as the very first step, you should follow the official tutorial which covers the basic aspects of any search use case: indexing - get the data of any form into solr. examples: json, xml, csv and sql-database. this step creates the inverted index - i.e. it links every term to its documents. querying - ask solr to return the most relevant documents for the users' query to follow the official tutorial you'll have to download java and the latest version of solr here . more information about installation is available at the official description . next you'll have to decide which web server you choose for solr. in the official tutorial, jetty is used, but you can also use tomcat. when you choose tomcat be sure you are setting the utf-8 encoding in the server.xml . i would also research the different versions of solr, which can be quite confusing for beginners: the current stable version is 1.4.1. use this if you need a stable search and don't need one of the latest features. the next stable version of solr will be 3.x the versions 1.5 and 2.x will be skipped in order to reach the same versioning as lucene. version 4.x is the latest development branch. solr 4.x handles advanced features like language detection via tika, spatial search , results grouping (group by field / collapsing), a new "user-facing" query parser ( edismax handler ), near real time indexing, huge fuzzy search performance improvements, sql join-a like feature and more. 2.2. indexing if you've followed the official tutorial you have pushed some xml files into the solr index. this process is called indexing or feeding. there are a lot more possibilities to get data into solr: using the data import handler (dih) is a really powerful language neutral option. it allows you to read from a sql database, from csv, xml files, rss feeds, emails, etc. without any java knowledge. dih handles full-imports and delta-imports. this is necessary when only a small amount of documents were added, updated or deleted. the http interface is used from the post tool, which you have already used in the official tutorial to index xml files. client libraries in different languages also exist. (e.g. for java (solrj) or python ). before indexing you'll have to decide which data fields should be searchable and how the fields should get indexed. for example, when you have a field with html in it, then you can strip irrelevant characters , tokenize the text into 'searchable terms', lower case the terms and finally stem the terms . in contrast, if you would have a field with text in it that should not be interpreted (e.g. urls) you shouldn't tokenize it and use the default field type string. please refer to the official documentation about field and field type definitions in the schema.xml file. when designing an index keep in mind the advice from mauricio : "the document is what you will search for. " for example, if you have tweets and you want to search for similar users, you'll need to setup a user index - created from the tweets. then every document is a user. if you want to search for tweets, then setup a tweet index; then every document is a tweet. of course, you can setup both indices with the multi index options of solr. please also note that there is a project called solr cell which lets you extract the relevant information out of several different document types with the help of tika. 2.3. querying for debugging it is very convenient to use the http interface with a browser to query solr and get back xml. use firefox and the xml will be displayed nicely: you can also use the velocity contribution , a cross-browser tool, which will be covered in more detail in the section about 'search application prototyping' . to query the index you can use the dismax handler or standard query handler . you can filter and sort the results: q=superman&fq=type:book&sort=price asc you can also do a lot more ; one other concept is boosting. in solr you can boost while indexing and while querying. to prefer the terms in the title write: q=title:superman^2 subject:superman when using the dismax request handler write: q=superman&qf=title^2 subject check out all the various query options like fuzzy search , spellcheck query input , facets , collapsing and suffix query support . 3. applications now i will list some interesting use cases for solr - in no particular order. to see how powerful and flexible this open source search server is. 3.1. drupal integration the drupal integration can be seen as generic use case to integrate solr into php projects. for the php integration you have the choice to either use the http interface for querying and retrieving xml or json. or to use the php solr client library . here is a screenshot of a typical faceted search in drupal : for more information about faceted search look into the wiki of solr . more php projects which integrates solr: open source typo3- solr module magento enterprise - solr module . the open source integration is out dated. oxid - solr module . no open source integration available. 3.2. hathi trust the hathi trust project is a nice example that proves solr's ability to search big digital libraries. to quote directly from the article : "... the index for our one million book index is over 200 gigabytes ... so we expect to end up with a two terabyte index for 10 million books" other examples for libraries: vufind - aims to replace opac internet archive national library of australia 3.3. auto suggestions mainly, there are two approaches to implement auto-suggestions (also called auto-completion) with solr: via facets or via ngramfilterfactory . to push it to the extreme you can use a lucene index entirely in ram. this approach is used in a large music shop in germany. live examples for auto suggestions: kaufda.de 3.4. spatial search applications when mentioning spatial search, people have geographical based applications in mind. with solr, this ordinary use case is attainable . some examples for this are : city search - city guides yellow pages kaufda.de spatial search can be useful in many different ways : for bioinformatics, fingerprints search, facial search, etc. (getting the fingerprint of a document is important for duplicate detection). the simplest approach is implemented in jetwick to reduce duplicate tweets, but this yields a performance of o(n) where n is the number of queried terms. this is okay for 10 or less terms, but it can get even better at o(1)! the idea is to use a special hash set to get all similar documents. this technique is called local sensitive hashing . read this nice paper about 'near similarity search and plagiarism analysis' for more information. 3.5. duckduckgo duckduckgo is made with open source and its "zero click" information is done with the help of solr using the dismax query handler: the index for that feature contains 18m documents and has a size of ~12gb. for this case had to tune solr: " i have two requirements that differ a bit from most sites with respect to solr: i generally only show one result, with sometimes a couple below if you click on them. therefore, it was really important that the first result is what people expected. false positives are really bad in 0-click, so i needed a way to not show anything if a match wasn't too relevant. i got around these by a) tweaking dismax and schema and b) adding my own relevancy filter on top that would re-order and not show anything in various situations. " all the rest is done with tuned open source products. to quote gabriel again: "the main results are a hybrid of a lot of things, including external apis, e.g. bing, wolframalpha, yahoo, my own indexes and negative indexes (spam removal), etc. there are a bunch of different types of data i'm working with. " check out the other cool features such as privacy or bang searches . 3.6. clustering support with carrot2 carrot2 is one of the "contributed plugins" of solr. with carrot2 you can support clustering : " clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. " see some research papers regarding clustering here . here is one visual example when applying clustering on the search "pannous" - our company : 3.7. near real time search solr isn't real time yet, but you can tune solr to the point where it becomes near real time, which means that the time ('real time latency') that a document takes to be searchable after it gets indexed is less than 60 seconds even if you need to update frequently. to make this work, you can setup two indices. one write-only index "w" for the indexer and one read-only index "r" for your application. index r refers to the same data directory of w, which has to be defined in the solrconfig.xml of r via: /pathto/indexw/data/ to make sure your users and the r index see the indexed documents of w, you have to trigger an empty commit every 60 seconds: wget -q http://localhost:port/solr/update?stream.body=%3ccommit/%3e -o /dev/null everytime such a commit is triggered a new searcher without any cache entries is created. this can harm performance for visitors hitting the empty cache directly after this commit, but you can fill the cache with static searches with the help of the newsearcher entry in your solrconfig.xml. additionally, the autowarmcount property needs to be tuned, which fills the cache with a newsearcher from old entries. also, take a look at the article 'scaling lucene and solr' , where experts explain in detail what to do with large indices (=> 'sharding') and what to do for high query volume (=> 'replicating'). 3.8. loggly = full text search in logs feeding log files into solr and searching them at near real-time shows that solr can handle massive amounts of data and queries the data quickly. i've setup a simple project where i'm doing similar things , but loggly has done a lot more to make the same task real-time and distributed. you'll need to keep the write index as small as possible otherwise commit time will increase too great. loggly creates a new solr index every 5 minutes and includes this when searching using the distributed capabilities of solr ! they are merging the cores to keep the number of indices small, but this is not as simple as it sounds. watch this video to get some details about their work. 3.9. solandra = solr + cassandra solandra combines solr and the distributed database cassandra , which was created by facebook for its inbox search and then open sourced. at the moment solandra is not intended for production use. there are still some bugs and the distributed limitations of solr apply to solandra too. tthe developers are working very hard to make solandra better. jetwick can now run via solandra just by changing the solrconfig.xml. solandra also has the advantages of being real-time (no optimize, no commit!) and distributed without any major setup involved. the same is true for solr cloud. 3.10. category browsing via facets solr provides facets , which make it easy to show the user some useful filter options like those shown in the "drupal integration" example. like i described earlier , it is even possible to browse through a deep category tree. the main advantage here is that the categories depend on the query. this way the user can further filter the search results with this category tree provided by you. here is an example where this feature is implemented for one of the biggest second hand stores in germany. a click on 'schauspieler' shows its sub-items: other shops: game-change 3.11. jetwick - open twitter search you may have noticed that twitter is using lucene under the hood . twitter has a very extreme use case: over 1,000 tweets per second, over 12,000 queries per second, but the real-time latency is under 10 seconds! however, the relevancy at that volume is often not that good in my opinion. twitter search often contains a lot of duplicates and noise. reducing this was one reason i created jetwick in my spare time. i'm mentioning jetwick here because it makes extreme use of facets which provides all the filters to the user. facets are used for the rss-alike feature (saved searches), the various filters like language and retweet-count on the left, and to get trending terms and links on the right: to make jetwick more scalable i'll need to decide which of the following distribution options to choose: use solr cloud with zookeeper use solandra move from solr to elasticsearch which is also based on apache lucene other examples with a lot of facets: cnet reviews - product reviews. electronics reviews, computer reviews & more. shopper.com - compare prices and shop for computers, cell phones, digital cameras & more. zappos - shoes and clothing. manta.com - find companies. connect with customers. 3.12. plaxo - online address management plaxo.com , which is now owned by comcast, hosts web addresses for more than 40 million people and offers smart search through the addresses - with the help of solr. plaxo is trying to get the latest 'social' information of your contacts through blog posts, tweets, etc. plaxo also tries to reduce duplicates . 3.13. replace fast or google search several users report that they have migrated from a commercial search solution like fast or google search appliance (gsa) to solr (or lucene). the reasons for that migration are different: fast drops linux support and google can make integration problems. the main reason for me is that solr isn't a black box —you can tweak the source code, maintain old versions and fix your bugs more quickly! 3.14. search application prototyping with the help of the already integrated velocity plugin and the data import handler it is possible to create an application prototype for your search within a few hours. the next version of solr makes the use of velocity easier. the gui is available via http://localhost:port/solr/browse if you are a ruby on rails user, you can take a look into flare. to learn more about search application prototyping, check out this video introduction and take a look at these slides. 3.15. solr as a whitelist imagine you are the new google and you have a lot of different types of data to display e.g. 'news', 'video', 'music', 'maps', 'shopping' and much more. some of those types can only be retrieved from some legacy systems and you only want to show the most appropriated types based on your business logic . e.g. a query which contains 'new york' should result in the selection of results from 'maps', but 'new yorker' should prefer results from the 'shopping' type. with solr you can set up such a whitelist-index that will help to decide which type is more important for the search query. for example if you get more or more relevant results for the 'shopping' type then you should prefer results from this type. without the whitelist-index - i.e. having all data in separate indices or systems, would make it nearly impossible to compare the relevancy. the whitelist-index can be used as illustrated in the next steps. 1. query the whitelist-index, 2. decide which data types to display, 3. query the sub-systems and 4. display results from the selected types only. 3.16. future solr is also useful for scientific applications, such as a dna search systems. i believe solr can also be used for completely different alphabets so that you can query nucleotide sequences - instead of words - to get the matching genes and determine which organism the sequence occurs in, something similar to blast . another idea you could harness would be to build a very personalized search. every user can drag and drop their websites of choice and query them afterwards. for example, often i only need stackoverflow, some wikis and some mailing lists with the expected results, but normal web search engines (google, bing, etc.) give me results that are too cluttered. my final idea for a future solr-based app could be a lucene/solr implementation of desktop search. solr's facets would be especially handy to quickly filter different sources (files, folders, bookmarks, man pages, ...). it would be a great way to wade through those extra messy desktops. 4. summary the next time you think about a problem, think about solr! even if you don't know java and even if you know nothing about search: solr should be in your toolbox. solr doesn't only offer professional full text search, it could also add valuable features to your application. some of them i covered in this article, but i'm sure there are still some exciting possibilities waiting for you!
January 25, 2011
by Peter Karussell
· 147,304 Views
article thumbnail
Setting mouse cursor position with WinAPI
Setting the mouse cursor position on a Windows machine with the help of .NET Framework shouldn't be that big of a problem. After all, there is the built-in Cursor class that lets you do that by executing a simple line of code: Cursor.Position = new System.Drawing.Point(0, 0); Of course, here 0 and 0 are the absolute coordinates for the mouse cursor on the screen. One thing to mention about this type of position setting is that Cursor requires a reference to System.Windows.Forms. And in some cases you don't want this extra reference. If that's the case, WinAPI is your solution. It requires some more work compared to the regular .NET way (class instance -> method call) but at the end you get more control than you would expect. When using WinAPI to set the cursor position, there are two ways you can go: mouse_event SendInput mouse_event is the very basic function that is only able to set the mouse coordinates. It was superseded and Microsoft recomends using SendInput instead. Nonetheless, it still works (although I cannot say for sure whether it will be working in future releases of Windows). So to start, I have a very basic class: class WINAPI_SUPERSEDED { [DllImport("user32.dll",SetLastError=true)] public static extern void mouse_event(uint dwFlags, uint dx, uint dy, uint dwData, int dwExtraInfo); public enum MouseFlags { MOUSEEVENTF_ABSOLUTE = 0x8000, MOUSEEVENTF_LEFTDOWN = 0x0002, MOUSEEVENTF_LEFTUP = 0x0004, MOUSEEVENTF_MIDDLEDOWN = 0x0020, MOUSEEVENTF_MIDDLEUP = 0x0040, MOUSEEVENTF_MOVE = 0x0001, MOUSEEVENTF_RIGHTDOWN = 0x0008, MOUSEEVENTF_RIGHTUP = 0x0010, MOUSEEVENTF_WHEEL = 0x0800, MOUSEEVENTF_XDOWN = 0x0080, MOUSEEVENTF_XUP = 0x0100 } public enum DataFlags { XBUTTON1 = 0x0001, XBUTTON2 = 0x0002 } } So to set the cursor position to 0,0 I would use this: WINAPI_SUPERSEDED.mouse_event((int)WINAPI_SUPERSEDED.MouseFlags.MOUSEEVENTF_MOVE | (int)WINAPI_SUPERSEDED.MouseFlags.MOUSEEVENTF_ABSOLUTE, 0, 0, 0, 0); If you go through the documentation, you will notice that in fact, dx and dy are in no way direct coordinates but rather normalized values ranged between 0 and 65,535 - that is only when the MOUSEEVENTF_ABSOLUTE flag is present. Otherwise, the position will be adjusted according to the current mouse cursor position. This method doesn't return any value so I cannot be informed whether it was successful or not. GetLastError won't give much information either. In this case, SendInput comes to the rescue. Here is what I have defined as the main class: class WINAPI { [DllImport("kernel32.dll")] public static extern uint GetLastError(); public enum MouseData { XBUTTON1 = 0x0001, XBUTTON2 = 0x0002 } public enum MouseFlags { MOUSEEVENTF_ABSOLUTE = 0x8000, MOUSEEVENTF_HWHEEL = 0x01000, MOUSEEVENTF_MOVE = 0x0001, MOUSEEVENTF_MOVE_NOCOALESCE = 0x2000, MOUSEEVENTF_LEFTDOWN = 0x0002, MOUSEEVENTF_LEFTUP = 0x0004, MOUSEEVENTF_RIGHTDOWN = 0x0008, MOUSEEVENTF_RIGHTUP = 0x0010, MOUSEEVENTF_MIDDLEDOWN = 0x0020, MOUSEEVENTF_MIDDLEUP = 0x0040, MOUSEEVENTF_VIRTUALDESK = 0x4000, MOUSEEVENTF_WHEEL = 0x0800, MOUSEEVENTF_XDOWN = 0x0080, MOUSEEVENTF_XUP = 0x0100 } [DllImport("user32.dll", SetLastError=true)] public static extern uint SendInput(uint nInputs, ref INPUT pInputs, int cbSize); [StructLayout (LayoutKind.Explicit)] public struct INPUT { [FieldOffset(0)] public int type; [FieldOffset(4)] public MOUSEINPUT mi; } public struct MOUSEINPUT { public int dx; public int dy; public int mouseData; public int dwFlags; public int time; public int extraInfo; } } This class is a bit more complicated, but at the same time you have to understand that in some cases, SendInput is used for hardware and keyboard input as well. Of course, for experimentation purposes I removed those parts in the sample class. Here you have the same MouseFlags enum that will let you pass custom flags defining the mouse behavior. Notice, that SendInput has SetLastError set to true, therefore if something wrong happens via this method, the error will be easily obtained via GetLastError, that is implemented as a helper method in the same class. VERY IMPORTANT: When you define the INPUT struct, make sure you use LayoutKind.Explicit since when passed to an unamanaged call, a specific field layout is required - as you can see, every field is decorated with a FieldOffset attribute. Also taking about StructLayout, you don't have to set StructLayout.Sequential to MOUSEINPUT since it is setautomatically by CLR. When I want to call the method above, I can simply use this snippet: WINAPI.MOUSEINPUT mouseInput = new WINAPI.MOUSEINPUT(); mouseInput.dx = 100; mouseInput.dy = 10; mouseInput.dwFlags = (int)WINAPI.MouseFlags.MOUSEEVENTF_ABSOLUTE | (int)WINAPI.MouseFlags.MOUSEEVENTF_MOVE; WINAPI.INPUT input = new WINAPI.INPUT(); input.type = 0; input.mi = mouseInput; uint x = WINAPI.SendInput(1, ref input, Marshal.SizeOf(input)); Console.WriteLine(x); Console.WriteLine(WINAPI.GetLastError()); Console.ReadLine(); Notice that I have to call the unmanaged version of sizeof and pass the INPUT struct to it in order for the method to correctly execute. The regular C# sizeof won't cut it here. When I am defining the type of input, 0 is repesenting the INPUT_MOUSE flag, since I am only handling the mouse here. Of course, I can re-organize my method to accept a set of INPUT instances - the native call itself allows this by requesting an array of INPUT and the correct indication of the number of INPUT instances passed, but that is not required for testing purposes.
November 28, 2010
by Denzel D.
· 17,343 Views
article thumbnail
JMS Clustering by Example
It's amazing how the JBoss Team put together an easy way to do JMS Clustering, out of the box!!. I'll start with an easy example, creating a Queue named "MyClusteredQueue". In this example I'm using JBoss AS 5.1. and two computers connected on the same network, with these IP's: - Computer A: 192.168.0.143 - Computer B: 192.168.0.210 So, here are the steps: 1) Install the JBoss on both computers. We are going to use the "all" configuration for both computers. 2) We create our Queue on both servers. Go to $JBOSS_HOME/server/all/deploy/messaging/ and edit the destinations-service.xml file. Add the MyClusteredQueue before the last server tag. It looks like this: jboss.messaging:service=ServerPeer jboss.messaging:service=PostOffice true This is how you add a Queue to the JBoss, and the people how are familiar with this, the only new thing is to add the attribute "Clustered". This step must be set on both computers. At the end of the article you can find the files. 3) Write the MDB to consume the messages, and deploy it on the two computers. (I'm using an EJB 3 - MDB style). import java.net.InetAddress; import javax.ejb.ActivationConfigProperty; import javax.ejb.MessageDriven; import javax.jms.Message; import javax.jms.MessageListener; import javax.jms.ObjectMessage; import org.apache.log4j.Logger; /** * @author felipeg * */ @MessageDriven(activationConfig = { @ActivationConfigProperty(propertyName="destinationType", propertyValue="javax.jms.Queue"), @ActivationConfigProperty(propertyName="destination", propertyValue="queue/MyClusteredQueue") }) public class JMSClusterClientHandler implements MessageListener { Logger log = Logger.getLogger(JMSClusterClientHandler.class); @Override public void onMessage(Message message) { try{ if (message instanceof ObjectMessage) { InetAddress addr = InetAddress.getLocalHost(); log.info("########## Processing Host: " + addr.getHostName() + " ##########" ); ObjectMessage objMessage = (ObjectMessage) message; Object obj = objMessage.getObject(); log.info("Object received:" + obj.toString()); } } catch (Exception e) { e.printStackTrace(); } } } 4) Start the jboss with the following options: Computer A: $ cd $JBOSS_HOME/bin $ ./run.sh -c all -b 192.168.0.143 -Djboss.messaging.ServerPeerID=1 Computer B: $ cd $JBOSS_HOME/bin $ ./run.sh -c all -b 192.168.0.210 -Djboss.messaging.ServerPeerID=2 It is necesary to give an ID to each server and this is accomplished with this directive: -Djboss.messaging.ServerPeerID When you start the jboss on computer A, you should see the logs (server.log) telling you that there is one node ready and listening, and once you start the jboss on computer B, on the log will appear the two nodes, the two IP's ready to consume messages. 5) Now it's time to send a Message to the Queue. To accomplish this it's necessary to change the connection factory to "ClusteredConnectionFactory" (JMSDispatcher.java - See the code below). Also on the jndi.properties (if you are using the default InitialContext) file it's necessary to add the two computers ip's separated by comma to the java.naming.provider.url property. (In my case a create a Properties variable and I set all the necessary properties, JMSDispatcher.java - see the code below). java.naming.provider.url=192.168.0.143:1099,192.168.0.210:1099 The client that I wrote is a web application, that consist in one index.jsp page, which contains a form that prompts you for the name of the queue, the type of messaging (Queue or Topic), the server ip and port, how many times it will send the message and the actual message to be sent; also the web application has a Servlet (JMSClusteredClient.java - see code below) that receives the postback and helper class (JMSDispatcher.java - see code below) that sends the message to the jboss servers. You can to deploy it in any computer. In my case I deployed it on the Computer A. And you can access it through this URL: http://192.168.0.143:8080/JMSWeb/ (just modify the IP where the client war was deployed). If you notice (on the index.jsp - code below) I've already put some default values that reflects the name of the Queue, and the IP's of my two computers. Now, If you increment the number of times that the message will be sent (maybe a 10) and fill out the message box, and click "Send" you should see on the two servers some of the messages being consumed by the MDB. Here are the Files to create the client: index.jsp JMS Clustered - Test Client Server: QueueTopic Times:Message: Servlet: JMSClusteredClient.java public class JMSClusteredClient extends HttpServlet { private static final long serialVersionUID = 1L; /** * @see HttpServlet#service(HttpServletRequest request, HttpServletResponse response) */ protected void service(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { PrintWriter out = response.getWriter(); String topicqueue = request.getParameter("topicqueue"); String message = request.getParameter("message"); String server = request.getParameter("server"); String messageType = request.getParameter("messageType"); String times = request.getParameter("times"); int intTimes = Integer.parseInt(times); JMSDispatcher dispatcher = new JMSDispatcher(); dispatcher.setTopicQueueName(topicqueue); dispatcher.setServer(server); dispatcher.setMessageType(messageType); try { for(int count =1; count <= intTimes;count++){ dispatcher.sendMessage( count + " of " + times + " " + message); } out.println("Message [" + message + "] sent successfully to [" + topic + "] to the [" + server + "] server " + times + " times."); } catch (JMSException e) { e.printStackTrace(); out.println("Error:" + e.getMessage()); } catch (NamingException e) { out.println("Error:" + e.getMessage()); e.printStackTrace(); } finally{ out.close(); } } } A utility to send the messages: JMSDispatcher.java public class JMSDispatcher { /** * */ private static final long serialVersionUID = 7105145023422143880L; private static Logger log = Logger.getLogger(JMSDispatcher.class); private final String CONNECTION_FACTORY_CLUSTERED = "ClusteredConnectionFactory"; private final String CONNECTION_FACTORY = "ConnectionFactory"; private final String TOPIC = "TOPIC"; private final String QUEUE = "QUEUE"; private String topicQueueName; private String server; private String messageType; public void setTopicQueueName(String value){ this.topicQueueName = value; } public void setServer(String value){ this.server = value; } public void setMessageType(String value){ this.messageType = value; } public void sendMessage(Object objectMessage) throws JMSException, NamingException{ log.debug("##### Setting up a Queue/Topic Message: #####"); if (TOPIC.equals(messageType)){ sendTopicMessage(objectMessage); } else if (QUEUE.equals(messageType)){ sendQueueMessage(objectMessage); } log.debug("##### Publishing Message: Done #####"); } private void sendQueueMessage(Object objectMessage) throws JMSException, NamingException{ try{ InitialContext initialContext = getInitialContext(); QueueConnectionFactory qcf = (QueueConnectionFactory) initialContext.lookup(CONNECTION_FACTORY_CLUSTERED); QueueConnection queueConn = qcf.createQueueConnection(); Queue queue = (Queue) initialContext.lookup(topicQueueName); QueueSession queueSession = queueConn.createQueueSession(false, Session.AUTO_ACKNOWLEDGE); queueConn.start(); QueueSender send = queueSession.createSender(queue); ObjectMessage om = queueSession.createObjectMessage((Serializable)objectMessage); setMessageProperties(om); log.debug("##### Publishing Message to a Queue: " + queueName + "#####"); send.send(om); send.close(); queueConn.stop(); queueSession.close(); queueConn.close(); }catch(MessageFormatException ex){ log.error("##### The MESSAGE is not Serializable ####"); throw ex; }catch(MessageNotWriteableException ex){ log.error("##### The MESSAGE is not Readable ####"); throw ex; }catch(JMSException ex){ log.error("##### JMS provider fails to set the object due to some internal error. ####"); throw ex; } } private void sendTopicMessage(Object objectMessage) throws JMSException, NamingException{ try{ InitialContext initialContext = getInitialContext(); TopicConnectionFactory tcf = (TopicConnectionFactory)initialContext.lookup(CONNECTION_FACTORY_CLUSTERED); TopicConnection topicConn = tcf.createTopicConnection(); Topic topic = (Topic) initialContext.lookup(topicQueueName); TopicSession topicSession = topicConn.createTopicSession(false,TopicSession.AUTO_ACKNOWLEDGE); topicConn.start(); TopicPublisher send = topicSession.createPublisher(topic); ObjectMessage om = topicSession.createObjectMessage(); om.setObject((Serializable)objectMessage); setMessageProperties(om); log.debug("##### Publishing Message to a Topic: " + topicName + "#####"); send.publish(om); send.close(); topicConn.stop(); topicSession.close(); topicConn.close(); }catch(MessageFormatException ex){ log.error("##### The MESSAGE is not Serializable ####"); throw ex; }catch(MessageNotWriteableException ex){ log.error("##### The MESSAGE is not Readable ####"); throw ex; }catch(JMSException ex){ log.error("##### JMS provider fails to set the object due to some internal error. ####"); throw ex; } } private InitialContext getInitialContext() throws NamingException{ Properties jboss = new Properties(); jboss.put("java.naming.factory.initial", "org.jnp.interfaces.NamingContextFactory"); jboss.put("java.naming.factory.url.pkgs", "org.jboss.naming:org.jnp.interfaces"); jboss.put("java.naming.provider.url", server); return new InitialContext(jboss); } } And the web.xml JMSWeb index.jsp JMSClusteredClient JMSClusteredClient com.blogspot.felipeg48.jms.web.JMSClusteredClient JMSClusteredClient /JMSClusteredClient Happy Clustering!!
May 26, 2010
by Felipe Gutierrez
· 16,739 Views
article thumbnail
Running Hazelcast on a 100 Node Amazon EC2 Cluster
The purpose of this article is to give you the details of our 100 node cluster demo. This demo is recorded and you can watch the 5 minute screencast Hazelcast is an open source clustering and highly scalable data distribution platform for Java. JVMs that are running Hazelcast will dynamically cluster and allow you to easily share and partition your application data across the cluster. Hazelcast is a peer-to-peer solution (there is no master node, every node is a peer) so there is no single point of failure. Communication among cluster members is always TCP/IP with Java NIO beauty. The default configuration comes with 1 backup so if a node fails, no data will be lost (you can specify the backup count). It is as simple as using java.util.{Map, Queue, Set, List}. Just add the hazelcast.jar into your classpath and start coding. When you download the Hazelcast, you will find a test.sh under bin directory. The test.sh runs an application which randomly makes 40% get, 40% put and 20% remove on a distributed map. In this demo the same test application will be used to see how it performs on 100 node cluster. Amazon EC2 and S3 An easy to use and scalable cloud environment was needed for demo so we decided to use Amazon EC2 for server instances (nodes) and S3 service to store demo application zip and configuration files. With its newly announced Java SDK, it is very simple to start/stop server instances and upload files to S3 programatically. Hazelcast AMI & Launcher The challenge here is that we are running an application on 100 nodes and dealing with each and every server in the cluster is a huge task. We don't want to ssh into every server and manually start the application. This part is automated by creating a special server image (AMI). The AMI contains Java Runtime and a launcher application we developed, which will download the demo application from Amazon S3, unzip it, and run the hazelcast/bin/test.sh in it. The Launcher is actually so generic that it can run any application; it doesn't care/know what test.sh contains. Deployer Deployment of the demo application is also automated so that we don't need to login into AWS Management Console and manually start instances. Deployer instantiates any number of Amazon EC2 servers with any AMI and also uploads the demo application zip file to S3. So the idea here is that, the Deployer will store the application into S3 and launch 100 EC2 instances with our image. The Launcher on each instance will download the application from S3 and run it. Demo Details. The smallest EC2 instances (m1.small) are used to run the demo. These are the virtual instances with CPU about 1.0 GHz. Also keep in mind that EC2 platform suffers from considerable amount of network latency. That's why we increased the thread count to 250 in our application. The following steps performed during the demo Download hazelcast-1.8.3.zip from www.hazelcast.com. Unzip the file and move the monitoring war file into tomcat6/webapps directory. Edit the test.sh under the bin directory: Add -Xmx1G -Xms1G Add -Dhazelcast.initial.wait.seconds=100 to make the cluster evenly partition on start so that migration can be avoided for better performance. Add t250 as an argument to the application to set thread count to 250. Remember the latency issue. Run the Deployer from IDE. Check from EC2 Management Console if 100 servers started. Start tomcat. Copy the public DNS name of one of the servers to connect to from monitoring tool. Go to http://localhost:8080/hazelcast-monitor-1.8.3/ (Hazelcast Monitoring Tool). Paste the address and connect to the cluster. Enjoy! Results You should always look for programatic ways of launching applications on the cloud. With these tools we were able to deploy and run the demo application on 100 servers in minutes. The entire Hazelcast cluster was making over 400,000 operations per second on the smallest EC2 instances. In our next demo we will experiment Hazelcast on large data set and even bigger cluster. Watch the screencast
April 16, 2010
by Fuad Malikov
· 62,688 Views · 1 Like
article thumbnail
Concurrent Programming in Groovy
It seems that the Groovy has a project for just about anything. That's one of the reasons why the language is so popular. The GPars library is an especially useful project in this new era of mult-core processors and concurrent programming. Formerly known as GParallelizer, GPars offers a framework for handling tasks concurrently and asynchronously while safe-guarding mutable values. DZone recently got an update on the project's latest news from project lead Václav Pech, and we've provided some examples of GPars concepts. GPars uses some of the best concepts from emerging languages and implements them for Groovy. Its actor support was inspired by the Actors library in Scala and the SafeVariable class in GPars was inspired by Agents in Clojure. The Groovy-based APIs in GPars are used to declare which parts of the code should be run concurrently. Objects can be enhanced with asynchronous methods to perform collections-based operations in parallel based on the fork/join model. GPars also has a Dataflow concurrency model that offers an alternative model that is inherently safe and robust due to algorithms that prevent having to deal with live-locks and race-conditions. The SafeVariable class is another technology in GPars that alleviates problems with concurrency by providing a non-blocking mt-safe reference to mutable state when Java libraries are integrated. Finally the Parallelizer, Asynchronizer, and Actors are some of the most interesting concepts in GPars. Actors Actors can be generated quickly to consume and send messages between each other even across distributed machines. You can build a messaging-based concurrency model with actors that are not limited by the number of threads. What was once only available to Scala developers, GPars now brings to Java and Groovy developers. Actors perform three different operations - send messages, receive messages and create new actors. New actors are created with the actor() method passing in the actor's body as a closure parameter. Inside the actor's body loop() is used to iterate, react() to receive messages, and reply() to send a message to the actor, which has sent the currently processed message. Here is how to create an actor that prints out all messages that it receives: import static groovyx.gpars.actor.Actors.* def console = actor { loop { react { println it } } } The loop() method ensures that the actor doesn't stop after processing the first message. Messages are sent using the send() method or the << operator. Here is an example of the sendAndWait () method in a message: actor << 'Message' actor.send 'Message' def reply1 = actor.sendAndWait('Message') def reply2 = actor.sendAndWait(10, TimeUnit.SECONDS, 'Message') def reply3 = actor.sendAndWait(10.seconds, 'Message') The sendAndWait() family blocks the caller until a reply from the actor becomes available. The reply is returned from sendAndWait() as a return value. For non-blocking message retrieval, calling the react() method, with or without a timeout parameter, from within the actor's code will consume the next message from the actor's inbox: println 'Waiting for a gift' react {gift -> if (myWife.likes gift) reply 'Thank you!' } Here is a more 'real world' example of an event-driven actor that receives two numeric messages, generates a sum, and sends the result to the console actor: import static groovyx.gpars.actor.Actors.* //not necessary, just showing that a single-threaded pool can still handle multiple actors defaultPooledActorGroup.resize 1 final def console = actor { loop { react { println 'Result: ' + it } } } final def calculator = actor { react {a -> react {b -> console.send(a + b) } } } calculator.send 2 calculator.send 3 calculator.join() Since Actors can share a relatively small thread pool, they bypass the threading limitations of the JVM and don't require excessive system resources even if an application consists of thousands of actors. There are some more sophisticated actor examples on the old GParallelizer wiki and there's also a nice article on the key concepts behind actors in erlang and scala. The documentation on GPars Actors can be found here. Asynchronizer A major feature of GPars is the Asynchronizer class, which runs tasks asynchronously in the background. It enables a Java Executor Service-based DSL on collections and closures. Inside the Asynchronizer.doParallel() blocks, asynchronous methods can be added to the closures. async() creates a variant of the supplied closure returning a future for the potential return value when invoked. callAsync() calls a closure in a separate thread supplying the given arguments and also returns a future for the potential value. Here is one example of Asynchronizer use: Asynchronizer.doParallel() { Closure longLastingCalculation = {calculate()} Closure fastCalculation = longLastingCalculation.async() //create a new closure, which starts the original closure on a thread pool Future result=fastCalculation() //returns almost immediately //do stuff while calculation performs … println result.get() } Parallelizer Finally, there's the Parallelizer, which is a concurrent collection processor. The common pattern to process collections takes elements sequentially, one at a time. This algorithm however, won't work well on multi-core hardware. The min() function on a dual-core chip can only leverage 50% of the computing power - 25% for a quad-core. Instead, GPars uses a tree-like structure for parallel processing. The Parallelizer class enables a ParallelArray(from JSR-166y)-based DSL on collections. Here is a use exapmle: doParallel { def selfPortraits = images.findAllParallel{it.contains me}.collectParallel {it.resize()} //a map-reduce functional style def smallestSelfPortrait = images.parallel.filter{it.contains me}.map{it.resize()}.min{it.sizeInMB} } Václav Pech told DZone that GPars currently has two new people joining the project - Jon Kerridge and Kevin Chalmers. The two developers are bringing their JCSP Groovy library with them into GPars. Pech said, "Apart from experimenting with the CSP concept, we will also enhance actor remoting and polish a couple of rough edges on the APIs. The documentation and especially the samples also deserve more attention." GPars' documentation is already quite robust. Pech says there quite a few issues queued up in their JIRA, but the previously listed issues remain the top priorities. Depending on the amount of time required to make progress with CSP, the GPars 1.0 release might happen in the summer says Pech. GPars is also developed by Alex Tkachman, a leading developer on the Groovy++ project. In an with Andres Almiray, Tkachman said that some of the work that comes out of Groovy++ might be assimilated into GPars, but no plans are in place yet since Groovy++ developers are still experimenting.
February 22, 2010
by Mitch Pronschinske
· 53,537 Views · 2 Likes
  • Previous
  • ...
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×