DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Data Topics

article thumbnail
Export MS Visio Diagram to XML (VDX, VTX, VSX) Formats in C# & VB.NET
This technical tip shows how .NET developers can export Microsoft Visio diagram to XML inside their own applications using Aspose.Diagram for .NET. Aspose.Diagram for .NET lets you export diagrams to a variety of formats: image formats, HTML, SVG, SWF and XML formats: VDX defines an XML diagram. VTX defines an XML template. VSX defines an XML stencil. The Diagram class' constructors read a diagram and the Save method is used to save, or export, a diagram in a different file format. The code snippets in this article show how to use the Save method to save a Visio file to VDX, VTX and VSX. Exporting VSD to VDX VDX is a schema-based XML file format that lets you save diagrams in a format that products other than Microsoft Visio can read. It's a useful format for transferring diagrams between software applications and retaining editable data. To export a VSD diagram to VDX first create an instance of the Diagram class and call the Diagram class' Save method to write the Visio drawing file to VDX. Exporting from VSD to VSX VSX is an XML format for defining stencils, the basic objects from which a diagram is built up. When a Visio file is converted to VSX, only the stencils are exported. To export a VSD diagram to VSX first you need to create an instance of the Diagram class and then call the Diagram class' Save method to write the Visio drawing file to VSX. //The Sample code shows how to export VSD to VDX //[C# Sample] //Call the diagram constructor to load diagram from a VSD file Diagram diagram = new Diagram("D:\\Drawing1.vsd"); this.Response.Clear(); this.Response.ClearHeaders(); this.Response.ContentType = "application/vnd.ms-visio"; this.Response.AppendHeader("Content-Disposition", "attachment; filename=Diagram.vdx"); this.Response.Flush(); System.IO.Stream vdxStream = this.Response.OutputStream; //Save input VSD as VDX diagram.Save(vdxStream, SaveFileFormat.VDX); this.Response.End(); //[VB.NET Code Sample] 'Call the diagram constructor to load diagram from a VSD file Dim diagram As New Diagram("D:\Drawing1.vsd") Me.Response.Clear() Me.Response.ClearHeaders() Me.Response.ContentType = "application/vnd.ms-visio" Me.Response.AppendHeader("Content-Disposition", "attachment; filename=Diagram.vdx") Me.Response.Flush() Dim vdxStream As System.IO.Stream = Me.Response.OutputStream 'Save inpupt VSD as VDX diagram.Save(vdxStream, SaveFileFormat.VDX) Me.Response.End() //The Sample code shows how to export VSD to VSX format. [C# Code Sample] // Call the diagram constructor to load diagram from a VSD file Diagram diagram = new Diagram("D:\\Drawing1.vsd"); this.Response.Clear(); this.Response.ClearHeaders(); this.Response.ContentType = "application/vnd.ms-visio"; this.Response.AppendHeader("Content-Disposition", "attachment; filename=Diagram.vsx"); this.Response.Flush(); System.IO.Stream vsxStream = this.Response.OutputStream; //Save input VSD as VSX diagram.Save(vsxStream, SaveFileFormat.VSX); this.Response.End() //[VB.NET Code Sample] 'Call the diagram constructor to load diagram from a VSD file Dim diagram As New Diagram("D:\Drawing1.vsd") Me.Response.Clear() Me.Response.ClearHeaders() Me.Response.ContentType = "application/vnd.ms-visio" Me.Response.AppendHeader("Content-Disposition", "attachment; filename=Diagram.vsx") Me.Response.Flush() Dim vsxStream As System.IO.Stream = Me.Response.OutputStream 'Save input VSD as VSX diagram.Save(vsxStream, SaveFileFormat.VSX) Me.Response.End()
January 29, 2014
by David Zondray
· 14,416 Views
article thumbnail
Geek, dork, nerd and dweeb — the difference in a Venn diagram
Working in and around Silicon Valley and technology, I hear people throwing around the terms “geek”, “dork”, ”nerd” and “dweeb” constantly. They’re thrown around interchangeably, in fact, which is where the problem lies. They’re not the same and knowing that matters a great deal. In fact, using the wrong term gives credit where credit it isn’t due or unfairly labels someone’s better qualities. Venn diagram Finding a Venn diagram to explain the difference was a big moment. I suddenly see how I should interview for different roles and what to look for in partners and employees. I know what to seek and what to avoid. It was an epiphany. Starting from that point, I could see career paths for each and every one (OK, except one). It breaks out like this: Geek, Nerd, Dweeb and Dork in order of value to the organization Geek – Both smart and driven but able to talk about fun things — they’re your leaders and sales people Nerd - Centered in smarts and drive, tempered by some awkwardness — they sustain your company Dweeb – Smart and awkward but probably uncommitted — they won’t stay up all night to solve a problem Dork - Least fun of the bunch — Avoid these people because they waste your time and sap your will to live These may not be everyone’s definitions, but maybe it’s time to standardize in our labels lest we use the terms insensitively. For an interesting take on these categories, I found this on Democratic Underground. Geek: Someone who spends a lot of time and energy in a certain special but conventional area, like computer programming or trouble-shooting, but not necessarily computers or technology. You can apparently have chess geeks, guitar geeks, or cooking geeks. A geek is an outwardly normal person who can relate to others in general but who has taken the time to learn specific technical skills and would rather talk about their special obsession than anything else. They are generally not athletic and enjoy sedentary pursuits like video games, comic books, being on the internet, etc. They usually dress to suit their special interest, which can be flamboyant, such as wearing a tee-shirt describing their special obsession or a hat bearing a logo of their special pursuit. Geeks can be self-confident and proud of their traits. Nerd: Someone with a great interest in academic subjects like math and science and who is socially awkward and has trouble relating to others outside of their fields of academia. Their IQ often exceeds their weight. Science fiction such as The Matrix and Star Wars or LOTR are often their cup of tea, as are hobbies like astronomy or chemistry sets. Nerds usually dress conservatively and are more interested in the mind than their outward appearance, although as both men and women they tend to be tidy, clean-cut, and hygienic. Nerds generally are self-confident in the academic setting and take pride in their intellect and band together with other nerds although their social skills outside of their academic obsession are diminished. Dork: Someone who has special interests like a geek but whose interests and obsessions are less common and odd, such as having an oddball collection of some sort like old Three Stooges bubblegum cards or an uncommon skill like yodeling. Walking talking Star Trek encyclopedic knowledge and convention dress up obsessions can be considered dorky. They can act silly at times and not care what anyone thinks. Dorks are typically more noted for their quirky personality and tend to be loners. Hygiene can sometimes be an issue. Dorks can nonetheless be self-confident and proud of the way they are because they simply don’t care what others think. Dweeb: A person who tends to be regarded as physically wimpish, intellectually challenged, and socially awkward, with little self-confidence. Dweebs tend to be obsessed with unusual pursuits like dorks (tap dancing or ant farms) but are lacking in skill, knowledge, or ability. Dweebs tend to be loners like dorks but understand their shortcomings and lack pride. Hygiene can also be an issue. I’m not a dork, nor dweeb and I don’t see myself as a nerd…I’m clearly a geek :-). Thank you to the Great White Snark, where I found the Venn diagram.
January 27, 2014
by Christopher Taylor
· 21,403 Views
article thumbnail
Big Data Search, Part 5: Sorting Optimizations
I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here. In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file. Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them. For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer. We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk. We are writing sorted data, so it is probably going to compress pretty well. Those are the things that pop to mind. Can you think of additional options?
January 27, 2014
by Oren Eini
· 7,863 Views
article thumbnail
Spring and Caching JMS Connections
As follow up to previous posts covering JMS, this post will delve into more depth on Spring's CachingConnectionFactory. Spring provides two implementations of the javax.jms.ConnectionFactory interface, namely, the SingleConnectionFactory and the CachingConnectionFactory. The SingleConnectionFactory returns as you might expect the same single connection upon all calls to the createConnection() method. This is fine for certain scenarios and applications but the CachingConnectionFactory provides a more performant and scalable solution. By default, a single session is cached so for a multi threaded application you would set the sessionCacheSize to be a more suitable number although this number wouldn't reflect the true number of sessions cached as this figure refers to the size of cache per session acknowledgement type eg AUTO_ACKNOWLEDGE, CLIENT_ACKNOWLEDGE, DUPS_OK_ACKNOWLEDGE and SESSION_TRANSACTED. By default, the CachingConnectionFactory will cache the Message Producers and Message Consumers for every session. As an aside the Message Consumers are cached using keys which include the JMS selector so the more fine grained the message filter the more Message Consumers there would be, and Message Consumers aren't closed until the session is closed and removed from the pool. An alternative is to use a Listener Container for consuming messages. Also to be noted is that on creating a CachingConnectionFactory instance, the reconnect on exception flag is set to be true. This should mean that the onException method on the default ExceptionListener class gets called which will reset the connections. You can also override the default exception listener with your own implementation. The below snippet of XML shows a simple configuration of a CachingConnectionFactory:
January 27, 2014
by Geraint Jones
· 52,603 Views · 1 Like
article thumbnail
Big Data Search, Part 4: The Index Format is Horrible
I have completed my own exercise, and while I wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is. This isn’t something that you can really use for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file. That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance. Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file. In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second. Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere. Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time. What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still… Obviously, there are multiple solutions for this issue. B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that). However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here. However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here. This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb. It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.
January 24, 2014
by Oren Eini
· 12,096 Views
article thumbnail
Big Data Search, Part 3: Binary Search of Textual Data
The index I created for the exercise is just a text file, sorted by the indexed key. When doing a search by a human, that makes it very easy to work with. Much easier than trying to work with a binary file, it also helps debugging. However, it does make it running a binary search on the data a bit harder. Mostly because there isn’t a nice way to say “give me the #th line”. Instead, I wrote the following: public void SetPositionToLineAt(long position) { // now we need to go back until we either get to the start of the file // or find a \n character const int bufferSize = 128; _buffer.Capacity = Math.Max(bufferSize, _buffer.Capacity); var charCount = _encoding.GetMaxCharCount(bufferSize); if (charCount > _charBuf.Length) _charBuf = new char[Utils.NearestPowerOfTwo(charCount)]; while (true) { _input.Position = position - (position < bufferSize ? 0 : bufferSize); var read = ReadToBuffer(bufferSize); var buffer = _buffer.GetBuffer(); var chars = _encoding.GetChars(buffer, 0, read, _charBuf, 0); for (int i = chars - 1; i >= 0; i--) { if (_charBuf[i] == '\n') { _input.Position = position - (bufferSize - i) + 1; return; } } position -= bufferSize; if (position < 0) { _input.Position = 0; return; } } } This code starts at an arbitrary byte position, and go backward until it find the new line character ‘\n’. This give me the ability to go to a rough location and get the line oriented input. Once I have that, the rest is pretty easy. Here is the binary search: while (lo <= hi) { position = (lo + hi) / 2; _reader.SetPositionToLineAt(position); bool? result; do { result = _reader.ReadOneLine(); } while (result == null); // skip empty lines if (result == false) yield break; // couldn't find anything var entry = _reader.Current.Values[0]; match = Utils.CompareArraySegments(expectedIndexEntry, entry); if (match == 0) { break; } if (match > 0) lo = position + _reader.Current.Values.Sum(x => x.Count) + 1; else hi = position - 1; } if (match != 0) { // no match yield break; } The idea is that this positions us on the location of the index that has an entry with a value that is equal to what we are searched on. We then write the following to actually get the data from the actual data file: // we have a match, now we need to return all the matches _reader.SetPositionToLineAt(position); while(true) { bool? result; do { result = _reader.ReadOneLine(); } while (result == null); // skip empty lines if(result == false) yield break; // end of file var entry = _reader.Current.Values[0]; match = Utils.CompareArraySegments(expectedIndexEntry, entry); if (match != 0) yield break; // out of the valid range we need _buffer.SetLength(0); _data.Position = Utils.ToInt64(_reader.Current.Values[1]); while (true) { var b = _data.ReadByte(); if (b == -1) break; if (b == '\n') { break; } _buffer.WriteByte((byte)b); } yield return _encoding.GetString(_buffer.GetBuffer(), 0, (int)_buffer.Length); } As you can see, we are moving forward in the index file, reading one line at a time. Then we take the second value, the position of the relevant line in the data file, and read that. We continue to do so as long as the indexed value is the same. Pretty simple, all told. But it comes with its own set of problems. I’ll discuss that in my next post.
January 22, 2014
by Oren Eini
· 5,877 Views
article thumbnail
Big Data Search, Part 2: Setting Up
the interesting thing about this problem is that i was very careful in how i phrased things. i said what i wanted to happen, but didn’t specify what needs to be done. that was quite intentional. for that matter, the fact that i am posting about what is going to be our acceptance criteria is also intentional. the idea is to have a non trivial task, but something that should be very well understood and easy to research. it also means that the candidate needs to be able to write some non trivial code. and i can tell a lot about a dev from such a project. at the same time, this is a very self contained scenario. the idea is that this is something that you can do in a short amount of time. the reason that this is an interesting exercise is that this is actually at least two totally different but related problems. first, in a 15tb file, we obviously cannot rely on just scanning the entire file. that means that we have to have an index. and that means we have to build it. interestingly enough, an index being a sorted structure, that means that we have to solve the problem of sorting more data than can fit in main memory. the second problem is probably easier, since it is just an implementation of external sort, and there are plenty of algorithms around to handle that. note that i am not really interested in actual efficiencies for this particular scenario. i care about being able to see the code. see that it works, etc. my solution, for example, is a single threaded system that make no attempt at parallelism or i/o optimizations. it clocks at over 1 gb / minute and the memory consumption is at under 150mb. queries for a unique value return the result in 0.0004 seconds. queries that returned 153k results completed in about 2 seconds. when increasing the used memory to about 650mb, there isn’t really any difference in performance, which surprised me a bit. then again, the entire code is probably highly inefficient. but that is good enough for now. the process is kicked off with indexing: 1: var options = new directoryexternalstorageoptions("/path/to/index/files"); 2: var input = file.openread(@"/path/to/data/crimes_-_2001_to_present.csv"); 3: var sorter = new externalsorter(input, options, new int[] 4: { 5: 1,// case number 6: 4, // ichr 7: 8: }); 9: 10: sorter.sort(); i am actually using the chicago crime data for this. this is a 1gb file that i downloaded from the chicago city portal in csv format. this is what the data looks like: the externalsorter will read and parse the file, and start reading it into a buffer. when it gets to a certain size (about 64mb of source data, usually), it will sort the values in memory and output them into temporary files. those file looks like this: initially, i tried to do that with binary data, but it turns out that that was too complex to be easy, and writing this in a human readable format made it much easier to work with. the format is pretty simple, you have the value of the left, and on the right you have start position of the row for this value. we generate about 17 such temporary files for the 1gb file. one temporary file per each 64 mb of the original file. this lets us keep our actual memory consumption very low, but for larger data sets, we’ll probably want to actually do the sort every 1 gb or maybe more. our test machine has 16 gb of ram, so doing a sort and outputting a temporary file every 8 gb can be a good way to handle things. but that is beside the point. the end result is that we have multiple sorted files, but they aren’t sequential. in other words, in file #1 we have values 1,4,6,8 and in file #2 we have 1,2,6,7. we need to merge all of them together. luckily, this is easy enough to do. we basically have a heap that we feed entries from the files into. and that pretty much takes care of this. see merge sort if you want more details about this. the end result of merging all of those files is… another file, just like them, that contains all of the data sorted. then it is time to actually handle the other issue, actually searching the data. we can do that using simple binary search, with the caveat that because this is a text file, and there is no fixed size records or pages, it is actually a big hard to figure out where to start reading. in effect, what i am doing is to select an arbitrary byte position, then walk backward until i find a ‘\n’. once i found the new line character, i can read the full line, check the value, and decide where i need to look next. assuming that i actually found my value, i can now go to the byte position of the value in the original file and read the original line, giving it to the user. assuming an indexing rate of 1 gb / minute a 15 tb file would take about 10 days to index. but there are ways around that as well, but i’ll touch on them in my next post. what all of this did was bring home just how much we usually don’t have to worry about such things. but i consider this research well spent, we’ll be using this in the future.
January 21, 2014
by Oren Eini
· 3,444 Views
article thumbnail
Extending Guava Caches to Overflow to Disk
Caching allows you to significantly speed up applications with only little effort. Two great cache implementations for the Java platform are the Guava caches and Ehcache. While Ehcache is much richer in features (such as its Searchable API, the possibility of persisting caches to disk or overflowing to big memory), it also comes with quite an overhead compared to Guava. In a recent project, I found a need to overflow a comprehensive cache to disk but at the same time, I regularly needed to invalidate particular values of this cache. Because Ehcache's Searchable API is only accessible to in-memory caches, this put me in quite a dilemma. However, it was quite easy to extend a Guava cache to allow overflowing to disk in a structured manner. This allowed me both overflowing to disk and the required invalidation feature. In this article, I want to show how this can be achieved. I will implement this file persisting cache FilePersistingCache in form of a wrapper to an actual Guava Cache instance. This is of course not the most elegant solution (more elegant would to implement an actual Guava Cache with this behavior), but I will do for most cases. To begin with, I will define a protected method that creates the backing cache I mentioned before: private LoadingCache makeCache() { return customCacheBuild() .removalListener(new PersistingRemovalListener()) .build(new PersistedStateCacheLoader()); } protected CacheBuilder customCacheBuild(CacheBuilder cacheBuilder) { return CacheBuilder.newBuilder(); } The first method will be used internally to build the necessary cache. The second method is supposed to be overridden in order to implement any custom requirement to the cache as for example an expiration strategy. This could for example be a maximum value of entries or soft references. This cache will be used just as any other Guava cache. The key to the cache's functionality are the RemovalListener and the CacheLoader that are used for this cache. We will define these two implementation as inner classes of the FilePersistingCache: private class PersistingRemovalListener implements RemovalListener { @Override public void onRemoval(RemovalNotification notification) { if (notification.getCause() != RemovalCause.COLLECTED) { try { persistValue(notification.getKey(), notification.getValue()); } catch (IOException e) { LOGGER.error(String.format("Could not persist key-value: %s, %s", notification.getKey(), notification.getValue()), e); } } } } public class PersistedStateCacheLoader extends CacheLoader { @Override public V load(K key) { V value = null; try { value = findValueOnDisk(key); } catch (Exception e) { LOGGER.error(String.format("Error on finding disk value to key: %s", key), e); } if (value != null) { return value; } else { return makeValue(key); } } } As obvious from the code, these inner classes call methods of FilePersistingCache we did not yet define. This allows us to define custom serialization behavior by overriding this class. The removal listener will check the reasons for a cache entry being evicted. If the RemovalCause is COLLECTED, the cache entry was not manually removed by the user but it was removed as a consequence of the cache's eviction strategy. We will therefore only try to persist a cache entry if the user did not wish the entries removal. The CacheLoader will first attempt to restore an existent value from disk and create a new value only if such a value could not be restored. The missing methods are defined as follows: private V findValueOnDisk(K key) throws IOException { if (!isPersist(key)) return null; File persistenceFile = makePathToFile(persistenceDirectory, directoryFor(key)); (!persistenceFile.exists()) return null; FileInputStream fileInputStream = new FileInputStream(persistenceFile); try { FileLock fileLock = fileInputStream.getChannel().lock(); try { return readPersisted(key, fileInputStream); } finally { fileLock.release(); } } finally { fileInputStream.close(); } } private void persistValue(K key, V value) throws IOException { if (!isPersist(key)) return; File persistenceFile = makePathToFile(persistenceDirectory, directoryFor(key)); persistenceFile.createNewFile(); FileOutputStream fileOutputStream = new FileOutputStream(persistenceFile); try { FileLock fileLock = fileOutputStream.getChannel().lock(); try { persist(key, value, fileOutputStream); } finally { fileLock.release(); } } finally { fileOutputStream.close(); } } private File makePathToFile(@Nonnull File rootDir, List pathSegments) { File persistenceFile = rootDir; for (String pathSegment : pathSegments) { persistenceFile = new File(persistenceFile, pathSegment); } if (rootDir.equals(persistenceFile) || persistenceFile.isDirectory()) { throw new IllegalArgumentException(); } return persistenceFile; } protected abstract List directoryFor(K key); protected abstract void persist(K key, V value, OutputStream outputStream) throws IOException; protected abstract V readPersisted(K key, InputStream inputStream) throws IOException; protected abstract boolean isPersist(K key); The implemented methods take care of serializing and deserializing values while synchronizing file access and guaranteeing that streams are closed appropriately. The last four methods remain abstract and are up to the cache's user to implement. The directoryFor(K) method should identify a unique file name for each key. In the easiest case, the toString method of the key's K class is implemented in such a way. Additionally, I made the persist, readPersisted and isPersist methods abstract in order to allow for a custom serialization strategy such as using Kryo. In the easiest scenario, you would use the built in Java functionality which uses ObjectInputStream and ObjectOutputStream. For isPersist, you would return true, assuming that you would only use this implementation if you need serialization. I added this feature to support mixed caches where you can only serialize values to some keys. Be sure not to close the streams within the persist and readPersisted methods since the file system locks rely on the streams to be open. The above implementation will take care of closing the stream for you. Finally, I added some service methods to access the cache. Implementing Guava's Cache interface would of course be a more elegant solution: public V get(K key) { return underlyingCache.getUnchecked(key); } public void put(K key, V value) { underlyingCache.put(key, value); } public void remove(K key) { underlyingCache.invalidate(key); } protected Cache getUnderlyingCache() { return underlyingCache; } Of course, this solution can be further improved. If you use the cache in a concurrent scenario, be further aware that the RemovalListener is, other than most Guava cache method's executed asynchronously. As obvious from the code, I added file locks to avoid read/write conflicts on the file system. This asynchronicity does however imply that there is a small chance that a value entry gets recreated even though there is still a value in memory. If you need to avoid this, be sure to call the underlying cache's cleanUp method within the wrapper's get method. Finally, remember to clean up the file system when you expire your cache. Optimally, you will use a temporary folder of your system for storing your cache entries in order to avoid this problem at all. In the example code, the directory is represented by an instance field named persistenceDirectory which could for example be initialized in the constructor. Update: I wrote a clean implementation of what I described above which you can find on my Git Hub page and on Maven Central. Feel free to use it, if you need to store your cache objects on disk.
January 17, 2014
by Rafael Winterhalter
· 18,529 Views · 1 Like
article thumbnail
Understanding sun.misc.Unsafe
The biggest competitor to the Java virtual machine might be Microsoft's CLR that hosts languages such as C#. The CLR allows to write unsafe code as an entry gate for low level programming, something that is hard to achieve on the JVM. If you need such advanced functionality in Java, you might be forced to use the JNI which requires you to know some C and will quickly lead to code that is tightly coupled to a specific platform. With sun.misc.Unsafe, there is however another alternative to low-level programming on the Java plarform using a Java API, even though this alternative is discouraged. Nevertheless, several applications rely on sun.misc.Unsafe such for example objenesis and therewith all libraries that build on the latter such for example kryo which is again used in for example Twitter's Storm. Therefore, it is time to have a look, especially since the functionality of sun.misc.Unsafe is considered to become part of Java's public API in Java 9. Getting hold of an instance of sun.misc.Unsafe The sun.misc.Unsafe class is intended to be only used by core Java classes which is why its authors made its only constructor private and only added an equally private singleton instance. The public getter for this instances performs a security check in order to avoid its public use: public static Unsafe getUnsafe() { Class cc = sun.reflect.Reflection.getCallerClass(2); if (cc.getClassLoader() != null) throw new SecurityException("Unsafe"); return theUnsafe; } This method first looks up the calling Class from the current thread’s method stack. This lookup is implemented by another internal class named sun.reflection.Reflection which is basically browsing down the given number of call stack frames and then returns this method’s defining class. This security check is however likely to change in future version. When browsing the stack, the first found class (index 0) will obviously be the Reflection class itself, and the second (index 1) class will be the Unsafe class such that index 2 will hold your application class that was calling Unsafe#getUnsafe(). This looked-up class is then checked for its ClassLoader where a null reference is used to represent the bootstrap class loader on a HotSpot virtual machine. (This is documented in Class#getClassLoader() where it says that “some implementations may use null to represent the bootstrap class loader”.) Since no non-core Java class is normally ever loaded with this class loader, you will therefore never be able to call this method directly but receive a thrown SecurityException as an answer. (Technically, you could force the VM to load your application classes using the bootstrap class loader by adding it to the –Xbootclasspath, but this would require some setup outside of your application code which you might want to avoid.) Thus, the following test will succeed: @Test(expected = SecurityException.class) public void testSingletonGetter() throws Exception { Unsafe.getUnsafe(); } However, the security check is poorly designed and should be seen as a warning against the singleton anti-pattern. As long as the use of reflection is not prohibited (which is hard since it is so widely used in many frameworks), you can always get hold of an instance by inspecting the private members of the class. From the Unsafe class's source code, you can learn that the singleton instance is stored in a private static field called theUnsafe. This is at least true for the HotSpot virtual machine. Unfortunately for us, other virtual machine implementations sometimes use other names for this field. Android’s Unsafe class is for example storing its singleton instance in a field called THE_ONE. This makes it hard to provide a “compatible” way of receiving the instance. However, since we already left the save territory of compatibility by using the Unsafe class, we should not worry about this more than we should worry about using the class at all. For getting hold of the singleton instance, you simply read the singleton field's value: Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); theUnsafe.setAccessible(true); Unsafe unsafe = (Unsafe) theUnsafe.get(null); Alternatively, you can invoke the private instructor. I do personally prefer this way since it works for example with Android while extracting the field does not: Constructor unsafeConstructor = Unsafe.class.getDeclaredConstructor(); unsafeConstructor.setAccessible(true); Unsafe unsafe = unsafeConstructor.newInstance(); The price you pay for this minor compatibility advantage is a minimal amount of heap space. The security checks performed when using reflection on fields or constructors are however similar. Create an Instance of a Class Without Calling a Constructor The first time I made use of the Unsafe class was for creating an instance of a class without calling any of the class's constructors. I needed to proxy an entire class which only had a rather noisy constructor but I only wanted to delegate all method invocations to a real instance which I did however not know at the time of construction. Creating a subclass was easy and if the class had been represented by an interface, creating a proxy would have been a straight-forward task. With the expensive constructor, I was however stuck. By using the Unsafe class, I was however able to work my way around it. Consider a class with an artificially expensive constructor: class ClassWithExpensiveConstructor { private final int value; private ClassWithExpensiveConstructor() { value = doExpensiveLookup(); } private int doExpensiveLookup() { try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } return 1; } public int getValue() { return value; } } Using the Unsafe, we can create an instance of ClassWithExpensiveConstructor (or any of its subclasses) without having to invoke the above constructor, simply by allocating an instance directly on the heap: @Test public void testObjectCreation() throws Exception { ClassWithExpensiveConstructor instance = (ClassWithExpensiveConstructor) unsafe.allocateInstance(ClassWithExpensiveConstructor.class); assertEquals(0, instance.getValue()); } Note that final field remained uninitialized by the constructor but is set with its type's default value. Other than that, the constructed instance behaves like a normal Java object. It will for example be garbage collected when it becomes unreachable. The Java run time itself creates objects without calling a constructor when for example creating objects for deserialization. Therefore, the ReflectionFactory offers even more access to individual object creation: @Test public void testReflectionFactory() throws Exception { @SuppressWarnings("unchecked") Constructor silentConstructor = ReflectionFactory.getReflectionFactory() .newConstructorForSerialization(ClassWithExpensiveConstructor.class, Object.class.getConstructor()); silentConstructor.setAccessible(true); assertEquals(10, silentConstructor.newInstance().getValue()); } Note that the ReflectionFactory class only requires a RuntimePermission called reflectionFactoryAccess for receiving its singleton instance and no reflection is therefore required here. The received instance of ReflectionFactory allows you to define any constructor to become a constructor for the given type. In the example above, I used the default constructor of java.lang.Object for this purpose. You can however use any constructor: class OtherClass { private final int value; private final int unknownValue; private OtherClass() { System.out.println("test"); this.value = 10; this.unknownValue = 20; } } @Test public void testStrangeReflectionFactory() throws Exception { @SuppressWarnings("unchecked") Constructor silentConstructor = ReflectionFactory.getReflectionFactory() .newConstructorForSerialization(ClassWithExpensiveConstructor.class, OtherClass.class.getDeclaredConstructor()); silentConstructor.setAccessible(true); ClassWithExpensiveConstructor instance = silentConstructor.newInstance(); assertEquals(10, instance.getValue()); assertEquals(ClassWithExpensiveConstructor.class, instance.getClass()); assertEquals(Object.class, instance.getClass().getSuperclass()); } Note that value was set in this constructor even though the constructor of a completely different class was invoked. Non-existing fields in the target class are however ignored as also obvious from the above example. Note that OtherClass does not become part of the constructed instances type hierarchy, the OtherClass's constructor is simply borrowed for the "serialized" type. Not mentioned in this blog entry are other methods such as Unsafe#defineClass, Unsafe#defineAnonymousClass or Unsafe#ensureClassInitialized. Similar functionality is however also defined in the public API's ClassLoader. Native Memory Allocation Did you ever want to allocate an array in Java that should have had more than Integer.MAX_VALUE entries? Probably not because this is not a common task, but if you once need this functionality, it is possible. You can create such an array by allocating native memory. Native memory allocation is used by for example direct byte buffers that are offered in Java's NIO packages. Other than heap memory, native memory is not part of the heap area and can be used non-exclusively for example for communicating with other processes. As a result, Java's heap space is in competition with the native space: the more memory you assign to the JVM, the less native memory is left. Let us look at an example for using native (off-heap) memory in Java with creating the mentioned oversized array: class DirectIntArray { private final static long INT_SIZE_IN_BYTES = 4; private final long startIndex; public DirectIntArray(long size) { startIndex = unsafe.allocateMemory(size * INT_SIZE_IN_BYTES); unsafe.setMemory(startIndex, size * INT_SIZE_IN_BYTES, (byte) 0); } } public void setValue(long index, int value) { unsafe.putInt(index(index), value); } public int getValue(long index) { return unsafe.getInt(index(index)); } private long index(long offset) { return startIndex + offset * INT_SIZE_IN_BYTES; } public void destroy() { unsafe.freeMemory(startIndex); } } @Test public void testDirectIntArray() throws Exception { long maximum = Integer.MAX_VALUE + 1L; DirectIntArray directIntArray = new DirectIntArray(maximum); directIntArray.setValue(0L, 10); directIntArray.setValue(maximum, 20); assertEquals(10, directIntArray.getValue(0L)); assertEquals(20, directIntArray.getValue(maximum)); directIntArray.destroy(); } First, make sure that your machine has sufficient memory for running this example! You need at least (2147483647 + 1) * 4 byte = 8192 MB of native memory for running the code. If you have worked with other programming languages as for example C, direct memory allocation is something you do every day. By calling Unsafe#allocateMemory(long), the virtual machine allocates the requested amount of native memory for you. After that, it will be your responsibility to handle this memory correctly. The amount of memory that is required for storing a specific value is dependent on the type's size. In the above example, I used an int type which represents a 32-bit integer. Consequently a single int value consumes 4 byte. For primitive types, size is well-documented. It is however more complex to compute the size of object types since they are dependent on the number of non-static fields that are declared anywhere in the type hierarchy. The most canonical way of computing an object's size is using the Instrumented class from Java's attach API which offers a dedicated method for this purpose called getObjectSize. I will however evaluate another (hacky) way of dealing with objects in the end of this section. Be aware that directly allocated memory is always native memory and therefore not garbage collected. You therefore have to free memory explicitly as demonstrated in the above example by a call to Unsafe#freeMemory(long). Otherwise you reserved some memory that can never be used for something else as long as the JVM instance is running what is a memory leak and a common problem in non-garbage collected languages. Alternatively, you can also directly reallocate memory at a certain address by calling Unsafe#reallocateMemory(long, long) where the second argument describes the new amount of bytes to be reserved by the JVM at the given address. Also, note that the directly allocated memory is not initialized with a certain value. In general, you will find garbage from old usages of this memory area such that you have to explicitly initialize your allocated memory if you require a default value. This is something that is normally done for you when you let the Java run time allocate the memory for you. In the above example, the entire area is overriden with zeros with help of the Unsafe#setMemory method. When using directly allocated memory, the JVM will neither do range checks for you. It is therefore possible to corrupt your memory as this example shows: @Test public void testMallaciousAllocation() throws Exception { long address = unsafe.allocateMemory(2L * 4); unsafe.setMemory(address, 8L, (byte) 0); assertEquals(0, unsafe.getInt(address)); assertEquals(0, unsafe.getInt(address + 4)); unsafe.putInt(address + 1, 0xffffffff); assertEquals(0xffffff00, unsafe.getInt(address)); assertEquals(0x000000ff, unsafe.getInt(address + 4)); } Note that we wrote a value into the space that was each partly reserved for the first and for the second number. This picture might clear things up. Be aware that the values in the memory run from the "right to the left" (but this might be machine dependent). The first row shows the initial state after writing zeros to the entire allocated native memory area. Then we override 4 byte with an offset of a single byte using 32 ones. The last row shows the result after this writing operation. Finally, we want to write an entire object into native memory. As mentioned above, this is a difficult task since we first need to compute the size of the object in order to know the amount of size we need to reserve. The Unsafe class does however not offer such functionality. At least not directly since we can at least use the Unsafe class to find the offset of an instance's field which is used by the JVM when itself allocates objects on the heap. This allows us to find the approximate size of an object: public long sizeOf(Class clazz) long maximumOffset = 0; do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { maximumOffset = Math.max(maximumOffset, unsafe.objectFieldOffset(f)); } } } while ((clazz = clazz.getSuperclass()) != null); return maximumOffset + 8; } This might at first look cryptic, but there is no big secret behind this code. We simply iterate over all non-static fields that are declared in the class itself or in any of its super classes. We do not have to worry about interfaces since those cannot define fields and will therefore never alter an object's memory layout. Any of these fields has an offset which represents the first byte that is occupied by this field's value when the JVM stores an instance of this type in memory, relative to a first byte that is used for this object. We simply have to find the maximum offset in order to find the space that is required for all fields but the last field. Since a field will never occupy more than 64 bit (8 byte) for a long or double value or for an object reference when run on a 64 bit machine, we have at least found an upper bound for the space that is used to store an object. Therefore, we simply add these 8 byte to the maximum index and we will not run into danger of having reserved to little space. This idea is of course wasting some byte and a better algorithm should be used for production code. In this context, it is best to think of a class definition as a form of heterogeneous array. Note that the minimum field offset is not 0 but a positive value. The first few byte contain meta information. The graphic below visualizes this principle for an example object with an int and a long field where both fields have an offset. Note that we do not normally write meta information when writing a copy of an object into native memory so we could further reduce the amount of used native memoy. Also note that this memory layout might be highly dependent on an implementation of the Java virtual machine. With this overly careful estimate, we can now implement some stub methods for writing shallow copies of objects directly into native memory. Note that native memory does not really know the concept of an object. We are basically just setting a given amount of byte to values that reflect an object's current values. As long as we remember the memory layout for this type, these byte contain however enough information to reconstruct this object. public void place(Object o, long address) throws Exception { Class clazz = o.getClass(); do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { long offset = unsafe.objectFieldOffset(f); if (f.getType() == long.class) { unsafe.putLong(address + offset, unsafe.getLong(o, offset)); } else if (f.getType() == int.class) { unsafe.putInt(address + offset, unsafe.getInt(o, offset)); } else { throw new UnsupportedOperationException(); } } } } while ((clazz = clazz.getSuperclass()) != null); } public Object read(Class clazz, long address) throws Exception { Object instance = unsafe.allocateInstance(clazz); do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { long offset = unsafe.objectFieldOffset(f); if (f.getType() == long.class) { unsafe.putLong(instance, offset, unsafe.getLong(address + offset)); } else if (f.getType() == int.class) { unsafe.putLong(instance, offset, unsafe.getInt(address + offset)); } else { throw new UnsupportedOperationException(); } } } } while ((clazz = clazz.getSuperclass()) != null); return instance; } @Test public void testObjectAllocation() throws Exception { long containerSize = sizeOf(Container.class); long address = unsafe.allocateMemory(containerSize); Container c1 = new Container(10, 1000L); Container c2 = new Container(5, -10L); place(c1, address); place(c2, address + containerSize); Container newC1 = (Container) read(Container.class, address); Container newC2 = (Container) read(Container.class, address + containerSize); assertEquals(c1, newC1); assertEquals(c2, newC2); } Note that these stub methods for writing and reading objects in native memory only support int and long field values. Of course, Unsafe supports all primitive values and can even write values without hitting thread-local caches by using the volatile forms of the methods. The stubs were only used to keep the examples concise. Be aware that these "instances" would never get garbage collected since their memory was allocated directly. (But maybe this is what you want.) Also, be careful when precalculating size since an object's memory layout might be VM dependent and also alter if a 64-bit machine runs your code compared to a 32-bit machine. The offsets might even change between JVM restarts. For reading and writing primitives or object references, Unsafe provides the following type-dependent methods: getXXX(Object target, long offset): Will read a value of type XXX from target's address at the specified offset. putXXX(Object target, long offset, XXX value): Will place value at target's address at the specified offset. getXXXVolatile(Object target, long offset): Will read a value of type XXX from target's address at the specified offset and not hit any thread local caches. putXXXVolatile(Object target, long offset, XXX value): Will place value at target's address at the specified offset and not hit any thread local caches. putOrderedXXX(Object target, long offset, XXX value): Will place value at target's address at the specified offet and might not hit all thread local caches. putXXX(long address, XXX value): Will place the specified value of type XXX directly at the specified address. getXXX(long address): Will read a value of type XXX from the specified address. compareAndSwapXXX(Object target, long offset, long expectedValue, long value): Will atomicly read a value of type XXX from target's address at the specified offset and set the given value if the current value at this offset equals the expected value. Be aware that you are copying references when writing or reading object copies in native memory by using the getObject(Object, long) method family. You are therefore only creating shallow copies of instances when applying the above method. You could however always read object sizes and offsets recursively and create deep copies. Pay however attention for cyclic object references which would cause infinitive loops when applying this principle carelessly. Not mentioned here are existing utilities in the Unsafe class that allow manipulation of static field values sucht as staticFieldOffset and for handling array types. Finally, both methods named Unsafe#copyMemory allow to instruct a direct copy of memory, either relative to a specific object offset or at an absolute address as the following example shows: @Test public void testCopy() throws Exception { long address = unsafe.allocateMemory(4L); unsafe.putInt(address, 100); long otherAddress = unsafe.allocateMemory(4L); unsafe.copyMemory(address, otherAddress, 4L); assertEquals(100, unsafe.getInt(otherAddress)); } Throwing Checked Exceptions Without Declaration There are some other interesting methods to find in Unsafe. Did you ever want to throw a specific exception to be handled in a lower layer but you high layer interface type did not declare this checked exception? Unsafe#throwException allows to do so: @Test(expected = Exception.class) public void testThrowChecked() throws Exception { throwChecked(); } public void throwChecked() { unsafe.throwException(new Exception()); } Native Concurrency The park and unpark methods allow you to pause a thread for a certain amount of time and to resume it: @Test public void testPark() throws Exception { final boolean[] run = new boolean[1]; Thread thread = new Thread() { @Override public void run() { unsafe.park(true, 100000L); run[0] = true; } }; thread.start(); unsafe.unpark(thread); thread.join(100L); assertTrue(run[0]); } Also, monitors can be acquired directly by using Unsafe using monitorEnter(Object), monitorExit(Object) and tryMonitorEnter(Object). A file containing all the examples of this blog entry is available as a gist.
January 14, 2014
by Rafael Winterhalter
· 152,590 Views · 39 Likes
article thumbnail
JBoss 5 to 7 in 11 steps
Introduction Some time ago we decided to upgrade our application from JBoss 5 to 7 (technically 7.2). In this article I going to describe several things which we found problematic. At the end I also provided a short list of benefits we gained in retrospect. First some general information about our application. It was built using EJB 3.0 technology. We have 2 interfaces for communicating with other components – JMS and JAX-WS. We use JBoss AS 5 as our messaging broker which is started as a separate JVM process. This part of the system we were not allowed to change. Finally – we use JPA to store processing results to Oracle DB. Step #1 – Convince your Product Owner Although our application was rather small and built on JEE5 standard it took us 4 weeks to migrate it to JEE6 and JBoss 7. So you can't do it as a maintenance ticket – it's simply too big. There is always problem with providing Business Value of such migration for Product Owners as well as for key Stakeholders. There are several aspects which might help you convincing them. One of the biggest benefits is processing time. JBoss 7 is simply faster and has better caching (Infinispan over Ehcache). Another one is startup time (our server is ready to go in 5-6 seconds opposed to 1 minute in JBoss 5). Finally – development is much faster (EJB 3.1 is much better then 3.0). The last one might be translated to “time to market”. Having above arguments I'm pretty sure you'll convince them. Step #2 – Do some reading Here is a list on interesting links which are worth reading before the migration: JBoss 5 -> 7 migration guide: https://docs.jboss.org/author/display/AS7/How+do+I+migrate+my+application+from+AS5+or+AS6+to+AS7 JBoss 7 vs EAP libraries: https://access.redhat.com/site/articles/112673 JBoss EAP Faq: http://www.jboss.org/jbossas/faq Cache implementation benchmarks: http://sourceforge.net/p/nitrocache/blog/2012/05/performance-benchmark-nitrocache--ehcache--infinispan--jcs--cach4j/ JBoss 7 performence tuning: http://www.mastertheboss.com/jboss-performance/jboss-as-7-performance-tuning JBoss caching: http://www.mastertheboss.com/hibernate-howto/using-hibernate-second-level-cache-with-jboss-as-5-6-7 Step #3 – Off you go – change Maven dependencies JBoss 5 isn't packaged very well, so I suppose you many dependencies included in your classpath (either directly or by transitive dependencies). This is the first big change in JBoss 7. Now I strongly advice you to use this artifact in your dependency management section: org.jboss.as jboss-as-parent 7.2.0.Final pom import We also decided to stick only to JEE6 spec and configure all additional JBoss 7 options with proper XML files. If it sounds good for your project too, just add this dependency and you're done with this step: org.jboss.spec jboss-javaee-6.0 1.0.0.Final pom provided After cleaning up dependencies your code probably won't compile for a couple of days or even weeks. It takes time to clean this up. Step #4 – EJB 3.0 to 3.1 migration Dependency Injection is a heart of the application, so it is worth to start with it. Almost all of your code should work, but you'll have some problems with beans annotated with @Service (these are singletons with JBoss 5 EJB Extended API). You just need to replace them with @Singleton annotations and put @PostConstruct annotation on your init method. One last thing – remember to use proper concurrency strategy. We decided to use @ConcurrencyManagement(BEAN) and leave the implementation as is. Step #5 – Upgrade to JPA 2.0 If you used JPA 1.0 with Hibernate, I'm pretty sure you have a lot of non standard annotations defining caching or cascading. All of them might be successfully replaced with JPA 2.0 annotations and finally you might get rid of Hibernate from compile classpath and depend only on JPA 2.0. Here are several standard things to do: Get rid of Hibernate's Session.evict and switch to EntityManager.detach Get rid of Hibernate's @Cache annotation and replace it with @Cachable Fix Cascades (now delete orphan is a part of @XXXToYYY annotations) Remove Hibernate dependency and stick with JEE6 spec Step #6 – Fix Hibernate's sequencer Migrating Hibernate 3 to 4 is a bit tricky because of the way it uses sequences (fields annotated with @Id). Hibernate by default uses a pool of ids instead of incrementing sequence. An example will be more descriptive: Some_DB_Sequence.nextval -> 1 Hibernate 3: 1*50 = 50; IDs to be used = 50, 51, 52.…, 99 Some_DB_Sequence.nextval -> 2 Hibernate 3: 2*50 = 100; IDs to be used = 100, 101, 102.…, 149 In Hibernate 4.x there is a new sequence generator that uses new IDs that are 1:1 related to DB sequence. Typically it's disabled by default... but not in JBoss 7.1. So after migration, Hibernate tries to insert entities using IDs read from sequence (using new sequence generator) that were already used which causes constraint violation. The fastest solution is to switch Hibernate to the old method of sequence generation (described in example above), that requires following change in persistence.xml: Step #7 – Caching Infinispan is shipped with JBoss 7 and does not require much configuration. There is only one setting in persistence.xml which needs to be set and the others might be removed: Infinispan itself might require some extra configuration – just use standalone-full-ha.xml as guide. Step #8 – RMI with JBoss 5 If you're using a lot of RMI communicating with other JBoss 5 servers – I have bad information for you – JBoss 5 and 7 are totally different and this kind of comminication will not work. I strongly recommend to switch to some other technology like JAX-WS. In the retrospect we are very glad we decided to do it. Step #9 – JMS migration We thought it would be really hard to connect with JMS server based on JBoss 5. It turned out that you have 2 options and both work fine: Start HornetQ server on your own instance and create a bridge to JBoss 5 instance Use Generic JMS adapter: https://github.com/jms-ra/generic-jms-ra Step #10 – Fix EAR layout In JBoss 5 it does not matter where all jars are being placed. All EJBs are being started. It does not work with JBoss 7 anymore. All EJB which should start must be added as modules. Step #11 – JMX console Bad information – it's not present in JBoss 7. We liked it very much, but we had to switch to jvisualvm to invoke our JMX operations. There is a ticket in WildFly Jira opened for that: https://issues.jboss.org/browse/WFLY-1197. Unfortunately at moment of writing this article it is not resolved. Some thoughts in retrospect It is really time consuming task to migrate from JBoss 5 to 7. Although in my opinion it is worth it. Now we have better caching for cluster solutions (Infinispan), better DI (EJB 3.1) and better Web Services (CXF instead of JBoss WS). Processing time decreased by 25% without any code change. Development speed increased in my opinion (it is really hard to measure it) by 50% and we are much more productive (faster server restarts). Memory footprint lowered from 1GB to 512MB. Finally automatic application redeployment finally works! However there is always a price to pay – the migration took us 4 weeks (2 sprints). We didn't write any code for our business in that period. So make sure you prepare well for such migration and my last advice – invest some time to write good automatic functional tests (we use Arquillian for that). Once they're green again – you're almost crossing finishing line.
January 9, 2014
by Sebastian Laskawiec
· 46,949 Views
article thumbnail
Spring Cache Abstraction
Spring cache abstraction applies caching to the Java methods. It provides an environment where we can cache the result for the methods we choose. By doing so, it improves the performance of the methods by avoiding multiple execution of the methods for the same object. Note that this type of caching can be applied to the methods which return the same result for the same input. In this post, we will dive into spring abstraction and give code samples to the related parts. Spring provides annotation for caching. The first and basic way of caching is done with @Cacheable annotation. When we make a method @Cacheable, for each invocation cache is checked to see whether a result for the invocation exist. Let’s see an example for basic use of @Cacheable as follows. @Cacheable("customers") public Customer findCustomer(long customerId) {...} When you have a complex input for the method, you have the ability generate key by specifying which attribute will be the key for the cache. Let’s see by an example as follows. @Cacheable(value="customer", key="identity.customerId") public Customer findCustomer(Identity identity) {...} Spring also provides conditional caching for @Cacheable annotation. You can specify a condition in which you want to cache items by a condition parameter. Let’s see condition parameter in an example. @Cacheable(value="customer", condition="identity.loginFrequency > 3") public Customer findCustomer(Identity identity) Eviction is an important issue, one should evict the entries from the cache since there can be stale items in the cache. While @Cacheable provides populating items into cahce, @CacheEvict provides removing stale items from the cache. Let’s see cache eviction example. @CacheEvict(value="customer", allEntries = true) public void removeAllCustomers(long customerId) {...} By defaults, Spring provides caching by ConcurrentHashMap by specifying cache manager as follows. However, we can use other cache managers like ImcacheCacheManager as follows. For an example project, you can have a look at imcache-examples project on githup. The example class is at SpringCacheExample.java and example configuration is at exampleContext.xml.
January 8, 2014
by Yusuf Aytaş
· 31,674 Views · 1 Like
article thumbnail
CGLib: The Missing Manual
The byte code instrumentation library cglib is a popular choice among many well-known Java frameworks such as Hibernate (not anymore) or Spring for doing their dirty work. Byte code instrumentation allows to manipulate or to create classes after the compilation phase of a Java application. Since Java classes are linked dynamically at run time, it is possible to add new classes to an already running Java program. Hibernate uses cglib for example for its generation of dynamic proxies. Instead of returning the full object that you stored in a a database, Hibernate will return you an instrumented version of your stored class that lazily loads some values from the database only when they are requested. Spring used cglib for example when adding security constraints to your method calls. Instead of calling your method directly, Spring security will first check if a specified security check passes and only delegate to your actual method after this verification. Another popular use of cglib is within mocking frameworks such as mockito, where mocks are nothing more than instrumented class where the methods were replaced with empty implementations (plus some tracking logic). Other than ASM - another very high-level byte code manipulation library on top of which cglib is built - cglib offers rather low-level byte code transformers that can be used without even knowing about the details of a compiled Java class. Unfortunately, the documentation of cglib is rather short, not to say that there is basically none. Besides a single blog article from 2005 that demonstrates the Enhancer class, there is not much to find. This blog article is an attempt to demonstrate cglib and its unfortunately often awkward API. Enhancer Let's start with the Enhancer class, the probably most used class of the cglib library. An enhancer allows the creation of Java proxies for non-interface types. The Enhancer can be compared with the Java standard library's Proxy class which was introduced in Java 1.3. The Enhancer dynamically creates a subclass of a given type but intercepts all method calls. Other than with the Proxy class, this works for both class and interface types. The following example and some of the examples after are based on this simple Java POJO: public class SampleClass { public String test(String input) { return "Hello world!"; } } Using cglib, the return value of test(String) method can easily be replaced by another value using an Enhancer and a FixedValue callback: @Test public void testFixedValue() throws Exception { Enhancer enhancer = new Enhancer(); enhancer.setSuperclass(SampleClass.class); enhancer.setCallback(new FixedValue() { @Override public Object loadObject() throws Exception { return "Hello cglib!"; } }); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); } In the above example, the enhancer will return an instance of an instrumented subclass of SampleClass where all method calls return a fixed value which is generated by the anonymous FixedValue implementation above. The object is created by Enhancer#create(Object...) where the method takes any number of arguments which are used to pick any constructor of the enhanced class. (Even though constructors are only methods on the Java byte code level, the Enhancer class cannot instrument constructors. Neither can it instrument static or final classes.) If you only want to create a class, but no instance, Enhancer#createClass will create a Class instance which can be used to create instances dynamically. All constructors of the enhanced class will be available as delegation constructors in this dynamically generated class. Be aware that any method call will be delegated in the above example, also calls to the methods defined in java.lang.Object. As a result, a call to proxy.toString() will also return "Hello cglib!". In contrast will a call to proxy.hashCode() result in a ClassCastException since the FixedValue interceptor always returns a String even though the Object#hashCode signature requires a primitive integer. Another observation that can be made is that final methods are not intercepted. An example of such a method is Object#getClass which will return something like "SampleClass$$EnhancerByCGLIB$$e277c63c" when it is invoked. This class name is generated randomly by cglib in order to avoid naming conflicts. Be aware of the different class of the enhanced instance when you are making use of explicit types in your program code. The class generated by cglib will however be in the same package as the enhanced class (and therefore be able to override package-private methods). Similar to final methods, the subclassing approach makes for the inability of enhancing final classes. Therefore frameworks as Hibernate cannot persist final classes. Next, let us look at a more powerful callback class, the InvocationHandler, that can also be used with an Enhancer: @Test public void testInvocationHandler() throws Exception { Enhancer enhancer = new Enhancer(); enhancer.setSuperclass(SampleClass.class); enhancer.setCallback(new InvocationHandler() { @Override public Object invoke(Object proxy, Method method, Object[] args) throws Throwable { if(method.getDeclaringClass() != Object.class && method.getReturnType() == String.class) { return "Hello cglib!"; } else { throw new RuntimeException("Do not know what to do."); } } }); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); assertNotEquals("Hello cglib!", proxy.toString()); } This callback allows us to answer with regards to the invoked method. However, you should be careful when calling a method on the proxy object that comes with the InvocationHandler#invoke method. All calls on this method will be dispatched with the same InvocationHandler and might therefore result in an endless loop. In order to avoid this, we can use yet another callback dispatcher: @Test public void testMethodInterceptor() throws Exception { Enhancer enhancer = new Enhancer(); enhancer.setSuperclass(SampleClass.class); enhancer.setCallback(new MethodInterceptor() { @Override public Object intercept(Object obj, Method method, Object[] args, MethodProxy proxy) throws Throwable { if(method.getDeclaringClass() != Object.class && method.getReturnType() == String.class) { return "Hello cglib!"; } else { proxy.invokeSuper(obj, args); } } }); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); assertNotEquals("Hello cglib!", proxy.toString()); proxy.hashCode(); // Does not throw an exception or result in an endless loop. } The MethodInterceptor allows full control over the intercepted method and offers some utilities for calling the method of the enhanced class in their original state. But why would one want to use other methods anyways? Because the other methods are more efficient and cglib is often used in edge case frameworks where efficiency plays a significant role. The creation and linkage of the MethodInterceptor requires for example the generation of a different type of byte code and the creation of some runtime objects that are not required with the InvocationHandler. Because of that, there are other classes that can be used with the Enhancer: LazyLoader: Even though the LazyLoader's only method has the same method signature as FixedValue, the LazyLoader is fundamentally different to the FixedValue interceptor. The LazyLoader is actually supposed to return an instance of a subclass of the enhanced class. This instance is requested only when a method is called on the enhanced object and then stored for future invocations of the generated proxy. This makes sense if your object is expensive in its creation without knowing if the object will ever be used. Be aware that some constructor of the enhanced class must be called both for the proxy object and for the lazily loaded object. Thus, make sure that there is another cheap (maybe protected) constructor available or use an interface type for the proxy. You can choose the invoked constructed by supplying arguments to Enhancer#create(Object...). Dispatcher: The Dispatcher is like the LazyLoader but will be invoked on every method call without storing the loaded object. This allows to change the implementation of a class without changing the reference to it. Again, be aware that some constructor must be called for both the proxy and the generated objects. ProxyRefDispatcher: This class carries a reference to the proxy object it is invoked from in its signature. This allows for example to delegate method calls to another method of this proxy. Be aware that this can easily cause an endless loop and will always cause an endless loop if the same method is called from within ProxyRefDispatcher#loadObject(Object). NoOp: The NoOp class does not what its name suggests. Instead, it delegates each method call to the enhanced class's method implementation. At this point, the last two interceptors might not make sense to you. Why would you even want to enhance a class when you will always delegate method calls to the enhanced class anyways? And you are right. These interceptors should only be used together with a CallbackFilter as it is demonstrated in the following code snippet: @Test public void testCallbackFilter() throws Exception { Enhancer enhancer = new Enhancer(); CallbackHelper callbackHelper = new CallbackHelper(SampleClass.class, new Class[0]) { @Override protected Object getCallback(Method method) { if(method.getDeclaringClass() != Object.class && method.getReturnType() == String.class) { return new FixedValue() { @Override public Object loadObject() throws Exception { return "Hello cglib!"; }; } } else { return NoOp.INSTANCE; // A singleton provided by NoOp. } } }; enhancer.setSuperclass(MyClass.class); enhancer.setCallbackFilter(callbackHelper); enhancer.setCallbacks(callbackHelper.getCallbacks()); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); assertNotEquals("Hello cglib!", proxy.toString()); proxy.hashCode(); // Does not throw an exception or result in an endless loop. } The Enhancer instance accepts a CallbackFilter in its Enhancer#setCallbackFilter(CallbackFilter) method where it expects methods of the enhanced class to be mapped to array indices of an array of Callback instances. When a method is invoked on the created proxy, the Enhancer will then choose the according interceptor and dispatch the called method on the corresponding Callback (which is a marker interface for all the interceptors that were introduced so far). To make this API less awkward, cglib offers a CallbackHelper which will represent a CallbackFilter and which can create an array of Callbacks for you. The enhanced object above will be functionally equivalent to the one in the example for the MethodInterceptor but it allows you to write specialized interceptors whilst keeping the dispatching logic to these interceptors separate. How does it work? When the Enhancer creates a class, it will set create a privatestatic field for each interceptor that was registered as a Callback for the enhanced class after its creation. This also means that class definitions that were created with cglib cannot be reused after their creation since the registration of callbacks does not become a part of the generated class's initialization phase but are prepared manually by cglib after the class was already initialized by the JVM. This also means that classes created with cglib are not technically ready after their initialization and for example cannot be sent over the wire since the callbacks would not exist for the class loaded in the target machine. Depending on the registered interceptors, cglib might register additional fields such as for example for the MethodInterceptor where two privatestatic fields (one holding a reflective Method and a the other holding MethodProxy) are registered per method that is intercepted in the enhanced class or any of its subclasses. Be aware that the MethodProxy is making excessive use of the FastClass which triggers the creation of additional classes and is described in further detail below. For all these reasons, be careful when using the Enhancer. And always register callback types defensively, since the MethodInterceptor will for example trigger the creation of additional classes and register additional static fields in the enhanced class. This is specifically dangerous since the callback variables are also stored as static variables in the enhanced class: This implies that the callback instances are never garbage collected (unless their ClassLoader is, what is unusual). This is in particular dangerous when using anonymous classes which silently carry a reference to their outer class. Recall the example above: @Test public void testFixedValue() throws Exception { Enhancer enhancer = new Enhancer(); enhancer.setSuperclass(SampleClass.class); enhancer.setCallback(new FixedValue() { @Override public Object loadObject() throws Exception { return "Hello cglib!"; } }); SampleClass proxy = (SampleClass) enhancer.create(); assertEquals("Hello cglib!", proxy.test(null)); } The anonymous subclass of FixedValue would become hardly referenced from the enhanced SampleClass such that neither the anonymous FixedValue instance or the class holding the @Test method would ever be garbage collected. This can introduce nasty memory leaks in your applications. Therefore, do not use non-static inner classes with cglib. (I only use them in this blog entry for keeping the examples short.) Finally, you should never intercept Object#finalize(). Due to the subclassing approach of cglib, intercepting finalize is implemented by overriding it what is in general a bad idea. Enhanced instances that intercept finalize will be treated differently by the garbage collector and will also cause these objects being queued in the JVM's finalization queue. Also, if you (accidentally) create a hard reference to the enhanced class in your intercepted call to finalize, you have effectively created an noncollectable instance. This is in general nothing you want. Note that final methods are never intercepted by cglib. Thus, Object#wait, Object#notify and Object#notifyAll do not impose the same problems. Be however aware that Object#clone can be intercepted what is something you might not want to do. Immutable Bean cglib's ImmutableBean allows you to create an immutability wrapper similar to for example Collections#immutableSet. All changes of the underlying bean will be prevented by an IllegalStateException (however, not by an UnsupportedOperationException as recommended by the Java API). Looking at some bean public class SampleBean { private String value; public String getValue() { return value; } public void setValue(String value) { this.value = value; } } we can make this bean immutable: @Test(expected = IllegalStateException.class) public void testImmutableBean() throws Exception { SampleBean bean = new SampleBean(); bean.setValue("Hello world!"); SampleBean immutableBean = (SampleBean) ImmutableBean.create(bean); assertEquals("Hello world!", immutableBean.getValue()); bean.setValue("Hello world, again!"); assertEquals("Hello world, again!", immutableBean.getValue()); immutableBean.setValue("Hello cglib!"); // Causes exception. } As obvious from the example, the immutable bean prevents all state changes by throwing an IllegalStateException. However, the state of the bean can be changed by changing the original object. All such changes will be reflected by the ImmutableBean. Bean Generator The BeanGenerator is another bean utility of cglib. It will create a bean for you at run time: @Test public void testBeanGenerator() throws Exception { BeanGenerator beanGenerator = new BeanGenerator(); beanGenerator.addProperty("value", String.class); Object myBean = beanGenerator.create(); Method setter = myBean.getClass().getMethod("setValue", String.class); setter.invoke(myBean, "Hello cglib!"); Method getter = myBean.getClass().getMethod("getValue"); assertEquals("Hello cglib!", getter.invoke(myBean)); } As obvious from the example, the BeanGenerator first takes some properties as name value pairs. On creation, the BeanGenerator creates the accessors get() void set() for you. This might be useful when another library expects beans which it resolved by reflection but you do not know these beans at run time. (An example would be Apache Wicket which works a lot with beans.) Bean Copier The BeanCopier is another bean utility that copies beans by their property values. Consider another bean with similar properties as SampleBean: public class OtherSampleBean { private String value; public String getValue() { return value; } public void setValue(String value) { this.value = value; } } Now you can copy properties from one bean to another: @Test public void testBeanCopier() throws Exception { BeanCopier copier = BeanCopier.create(SampleBean.class, OtherSampleBean.class, false); SampleBean bean = new SampleBean(); myBean.setValue("Hello cglib!"); OtherSampleBean otherBean = new OtherSampleBean(); copier.copy(bean, otherBean, null); assertEquals("Hello cglib!", otherBean.getValue()); } without being restrained to a specific type. The BeanCopier#copy mehtod takles an (eventually) optional Converter which allows to do some further manipulations on each bean property. If the BeanCopier is created with false as the third constructor argument, the Converter is ignored and can therefore be null. Bulk Bean A BulkBean allows to use a specified set of a bean's accessors by arrays instead of method calls: @Test public void testBulkBean() throws Exception { BulkBean bulkBean = BulkBean.create(SampleBean.class, new String[]{"getValue"}, new String[]{"setValue"}, new Class[]{String.class}); SampleBean bean = new SampleBean(); bean.setValue("Hello world!"); assertEquals(1, bulkBean.getPropertyValues(bean).length); assertEquals("Hello world!", bulkBean.getPropertyValues(bean)[0]); bulkBean.setPropertyValues(bean, new Object[] {"Hello cglib!"}); assertEquals("Hello cglib!", bean.getValue()); } The BulkBean takes an array of getter names, an array of setter names and an array of property types as its constructor arguments. The resulting instrumented class can then extracted as an array by BulkBean#getPropertyBalues(Object). Similarly, a bean's properties can be set by BulkBean#setPropertyBalues(Object, Object[]). Bean Map This is the last bean utility within the cglib library. The BeanMap converts all properties of a bean to a String-to-Object Java Map: @Test public void testBeanGenerator() throws Exception { SampleBean bean = new SampleBean(); BeanMap map = BeanMap.create(bean); bean.setValue("Hello cglib!"); assertEquals("Hello cglib", map.get("value")); } Additionally, the BeanMap#newInstance(Object) method allows to create maps for other beans by reusing the same Class. Key Factory The KeyFactory factory allows the dynamic creation of keys that are composed of multiple values that can be used in for example Map implementations. For doing so, the KeyFactory requires some interface that defines the values that should be used in such a key. This interface must contain a single method by the name newInstance that returns an Object. For example: public interface SampleKeyFactory { Object newInstance(String first, int second); } Now an instance of a a key can be created by: @Test public void testKeyFactory() throws Exception { SampleKeyFactory keyFactory = (SampleKeyFactory) KeyFactory.create(Key.class); Object key = keyFactory.newInstance("foo", 42); Map map = new HashMap(); map.put(key, "Hello cglib!"); assertEquals("Hello cglib!", map.get(keyFactory.newInstance("foo", 42))); } The KeyFactory will assure the correct implementation of the Object#equals(Object) and Object#hashCode methods such that the resulting key objects can be used in a Map or a Set. The KeyFactory is also used quite a lot internally in the cglib library. Mixin Some might already know the concept of the Mixin class from other programing languages such as Ruby or Scala (where mixins are called traits). cglib Mixins allow the combination of several objects into a single object. However, in order to do so, those objects must be backed by interfaces: public interface Interface1 { String first(); } public interface Interface2 { String second(); } public class Class1 implements Interface1 { @Override public String first() { return "first"; } } public class Class2 implements Interface2 { @Override public String second() { return "second"; } } Now the classes Class1 and Class2 can be combined to a single class by an additional interface: public interface MixinInterface extends Interface1, Interface2 { /* empty */ } @Test public void testMixin() throws Exception { Mixin mixin = Mixin.create(new Class[]{Interface1.class, Interface2.class MixinInterface.class}, new Object[]{new Class1(), new Class2()}); MixinInterface mixinDelegate = (MixinInterface) mixin; assertEquals("first", mixinDelegate.first()); assertEquals("second", mixinDelegate.second()); } Admittedly, the Mixin API is rather awkward since it requires the classes used for a mixin to implement some interface such that the problem could also be solved by non-instrumented Java. String Switcher The StringSwitcher emulates a String to int Java Map: @Test public void testStringSwitcher() throws Exception { String[] strings = new String[]{"one", "two"}; int[] values = new int[]{10, 20}; StringSwitcher stringSwitcher = StringSwitcher.create(strings, values, true); assertEquals(10, stringSwitcher.intValue("one")); assertEquals(20, stringSwitcher.intValue("two")); assertEquals(-1, stringSwitcher.intValue("three")); } The StringSwitcher allows to emulate a switch command on Strings such as it is possible with the built-in Java switch statement since Java 7. If using the StringSwitcher in Java 6 or less really adds a benefit to your code remains however doubtful and I would personally not recommend its use. Interface Maker The InterfaceMaker does what its name suggests: It dynamically creates a new interface. @Test public void testInterfaceMaker() throws Exception { Signature signature = new Signature("foo", Type.DOUBLE_TYPE, new Type[]{Type.INT_TYPE}); InterfaceMaker interfaceMaker = new InterfaceMaker(); interfaceMaker.add(signature, new Type[0]); Class iface = interfaceMaker.create(); assertEquals(1, iface.getMethods().length); assertEquals("foo", iface.getMethods()[0].getName()); assertEquals(double.class, iface.getMethods()[0].getReturnType()); } Other than any other class of cglib's public API, the interface maker relies on ASM types. The creation of an interface in a running application will hardly make sense since an interface only represents a type which can be used by a compiler to check types. It can however make sense when you are generating code that is to be used in later development. Method Delegate A MethodDelegate allows to emulate a C#-like delegate to a specific method by binding a method call to some interface. For example, the following code would bind the SampleBean#getValue method to a delegate: public interface BeanDelegate { String getValueFromDelegate(); } @Test public void testMethodDelegate() throws Exception { SampleBean bean = new SampleBean(); bean.setValue("Hello cglib!"); BeanDelegate delegate = (BeanDelegate) MethodDelegate.create( bean, "getValue", BeanDelegate.class); assertEquals("Hello world!", delegate.getValueFromDelegate()); } There are however some things to note: The factory method MethodDelegate#create takes exactly one method name as its second argument. This is the method the MethodDelegate will proxy for you. There must be a method without arguments defined for the object which is given to the factory method as its first argument. Thus, the MethodDelegate is not as strong as it could be. The third argument must be an interface with exactly one argument. The MethodDelegate implements this interface and can be cast to it. When the method is invoked, it will call the proxied method on the object that is the first argument. Furthermore, consider these drawbacks: cglib creates a new class for each proxy. Eventually, this will litter up your permanent generation heap space You cannot proxy methods that take arguments. If your interface takes arguments, the method delegation will simply not work without an exception thrown (the return value will always be null). If your interface requires another return type (even if that is more general), you will get a IllegalArgumentException. Multicast Delegate The MulticastDelegate works a little different than the MethodDelegate even though it aims at similar functionality. For using the MulticastDelegate, we require an object that implements an interface: public interface DelegatationProvider { void setValue(String value); } public class SimpleMulticastBean implements DelegatationProvider { private String value; public String getValue() { return value; } public void setValue(String value) { this.value = value; } } Based on this interface-backed bean we can create a MulticastDelegate that dispatches all calls to setValue(String) to several classes that implement the DelegationProvider interface: @Test public void testMulticastDelegate() throws Exception { MulticastDelegate multicastDelegate = MulticastDelegate.create( DelegatationProvider.class); SimpleMulticastBean first = new SimpleMulticastBean(); SimpleMulticastBean second = new SimpleMulticastBean(); multicastDelegate = multicastDelegate.add(first); multicastDelegate = multicastDelegate.add(second); DelegatationProvider provider = (DelegatationProvider)multicastDelegate; provider.setValue("Hello world!"); assertEquals("Hello world!", first.getValue()); assertEquals("Hello world!", second.getValue()); } Again, there are some drawbacks: The objects need to implement a single-method interface. This sucks for third-party libraries and is awkward when you use CGlib to do some magic where this magic gets exposed to the normal code. Also, you could implement your own delegate easily (without byte code though but I doubt that you win so much over manual delegation). When your delegates return a value, you will receive only that of the last delegate you added. All other return values are lost (but retrieved at some point by the multicast delegate). Constructor Delegate A ConstructorDelegate allows to create a byte-instrumented factory method. For that, that we first require an interface with a single method newInstance which returns an Object and takes any amount of parameters to be used for a constructor call of the specified class. For example, in order to create a ConstructorDelegate for the SampleBean, we require the following to call SampleBean's default (no-argument) constructor: public interface SampleBeanConstructorDelegate { Object newInstance(); } @Test public void testConstructorDelegate() throws Exception { SampleBeanConstructorDelegate constructorDelegate = (SampleBeanConstructorDelegate) ConstructorDelegate.create( SampleBean.class, SampleBeanConstructorDelegate.class); SampleBean bean = (SampleBean) constructorDelegate.newInstance(); assertTrue(SampleBean.class.isAssignableFrom(bean.getClass())); } Parallel Sorter The ParallelSorter claims to be a faster alternative to the Java standard library's array sorters when sorting arrays of arrays: @Test public void testParallelSorter() throws Exception { Integer[][] value = { {4, 3, 9, 0}, {2, 1, 6, 0} }; ParallelSorter.create(value).mergeSort(0); for(Integer[] row : value) { int former = -1; for(int val : row) { assertTrue(former < val); former = val; } } } The ParallelSorter takes an array of arrays and allows to either apply a merge sort or a quick sort on every row of the array. Be however careful when you use it: When using arrays of primitives, you have to call merge sort with explicit sorting ranges (e.g. ParallelSorter.create(value).mergeSort(0, 0, 3) in the example. Otherwise, the ParallelSorter has a pretty obvious bug where it tries to cast the primitive array to an array Object[] what will cause a ClassCastException. If the array rows are uneven, the first argument will determine the length of what row to consider. Uneven rows will either lead to the extra values not being considered for sorting or a ArrayIndexOutOfBoundException. Personally, I doubt that the ParallelSorter really offers a time advantage. Admittedly, I did however not yet try to benchmark it. If you tried it, I'd be happy to hear about it in the comments. Fast Class and Fast Members The FastClass promises a faster invocation of methods than the Java reflection API by wrapping a Java class and offering similar methods to the reflection API: @Test public void testFastClass() throws Exception { FastClass fastClass = FastClass.create(SampleBean.class); FastMethod fastMethod = fastClass.getMethod(SampleBean.class.getMethod("getValue")); MyBean myBean = new MyBean(); myBean.setValue("Hello cglib!"); assertTrue("Hello cglib!", fastMethod.invoke(myBean, new Object[0])); } Besides the demonstrated FastMethod, the FastClass can also create FastConstructors but no fast fields. But how can the FastClass be faster than normal reflection? Java reflection is executed by JNI where method invocations are executed by some C-code. The FastClass on the other side creates some byte code that calls the method directly from within the JVM. However, the newer versions of the HotSpot JVM (and probably many other modern JVMs) know a concept called inflation where the JVM will translate reflective method calls into native version's of FastClass when a reflective method is executed often enough. You can even control this behavior (at least on a HotSpot JVM) with setting the sun.reflect.inflationThreshold property to a lower value. (The default is 15.) This property determines after how many reflective invocations a JNI call should be substituted by a byte code instrumented version. I would therefore recommend to not use FastClass on modern JVMs, it can however fine-tune performance on older Java virtual machines. cglib Proxy The cglib Proxy is a reimplementation of the Java Proxy class mentioned in the beginning of this article. It is intended to allow using the Java library's proxy in Java versions before Java 1.3 and differs only in minor details. The better documentation of the cglib Proxy can however be found in the Java standard library's Proxy javadoc where an example of its use is provided. For this reason, I will skip a more detailed discussion of the cglib's Proxy at this place. A Final Word of Warning After this overview of cglib's functionality, I want to speak a final word of warning. All cglib classes generate byte code which results in additional classes being stored in a special section of the JVM's memory: The so called perm space. This permanent space is, as the name suggests, used for permanent objects that do not usually get garbage collected. This is however not completely true: Once a Class is loaded, it cannot be unloaded until the loading ClassLoader becomes available for garbage collection. This is only the case the Class was loaded with a custom ClassLoader which is not a native JVM system ClassLoader. This ClassLoader can be garbage collected if itself, all Classes it ever loaded and all instances of all Classes it ever loaded become available for garbage collection. This means: If you create more and more classes throughout the life of a Java application and if you do not take care of the removal of these classes, you will sooner or later run of of perm space what will result in your application's death by the hands of an OutOfMemoryError. Therefore, use cglib sparingly. However, if you use cglib wisely and carefully, you can really do amazing things with it that go beyond what you can do with non-instrumented Java applications. Lastly, when creating projects that depend on cglib, you should be aware of the fact that the cglib project is not as well maintained and active as it should be, considering its popularity. The missing documentation is a first hint. The often messy public API a second. But then there are also broken deploys of cglib to Maven central. The mailing list reads like an archive of spam messages. And the release cycles are rather unstable. You might therefore want to have a look at javassist, the only real low-level alternative to cglib. Javassist comes bundled with a pseudo-java compiler what allows to create quite amazing byte code instrumentations without even understanding Java byte code. If you like to get your hands dirty, you might also like ASM on top of which cglib is built. ASM comes with a great documentation of both the library and Java class files and their byte code. Note that these examples only run with cglib 2.2.2 and are not compatible with the newest release 3 of cglib. Unfortunately, I experienced the newest cglib version to occasionally produce invalid byte code which is why I considered an old version and also use this version in production. Also, note that most projects using cglib move the library to their own namespace in order to avoid version conflicts with other dependencies such as for example demonstrated by the Spring project. You should do the same with your project when making use of cglib. Tools such like jarjar can help you with the automation of this good practice.
January 7, 2014
by Rafael Winterhalter
· 76,747 Views · 18 Likes
article thumbnail
Bulk Fetching with Hibernate
If you need to process large database result sets from Java, you can opt for JDBC to give you the low level control required. On the other hand, if you are already using an ORM in your application, falling back to JDBC might imply some extra pain. You would be losing features such as optimistic locking, caching, automatic fetching when navigating the domain model and so forth. Fortunately most ORMs, like Hibernate, have some options to help you with that. While these techniques are not new, there are a couple of possibilities to choose from. A simplified example; let's assume we have a table (mapped to class "DemoEntity") with 100.000 records. Each record consists of a single column (mapped to the property "property" in DemoEntity) holding some random alphanumerical data of about ~2KB. The JVM is ran with -Xmx250m. Let's assume that 250MB is the overall maximum memory that can be assigned to the JVM on our system. Your job is to read all records currently in the table, doing some not further specified processing, and finally store the result. We'll assume that the entities resulting from our bulk operation are not modified. To start we'll try the obvious first, performing a query to simply retrieve all data: new TransactionTemplate(txManager).execute(new TransactionCallback() { @Override public Void doInTransaction(TransactionStatus status) { Session session = sessionFactory.getCurrentSession(); List demoEntitities = (List) session.createQuery("from DemoEntity").list(); for(DemoEntity demoEntity : demoEntitities){ //Process and write result } return null; } }); After a couple of seconds: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded Clearly this won't cut it. To fix this we will be switching to Hibernate scrollable result sets as probably most developers are aware of. The above example instructs hibernate to execute the query, map the entire results to entities and return them. When using scrollable result sets records are transformed to entities one at a time: new TransactionTemplate(txManager).execute(new TransactionCallback() { @Override public Void doInTransaction(TransactionStatus status) { Session session = sessionFactory.getCurrentSession(); ScrollableResults scrollableResults = session.createQuery("from DemoEntity").scroll(ScrollMode.FORWARD_ONLY); int count = 0; while (scrollableResults.next()) { if (++count > 0 && count % 100 == 0) { System.out.println("Fetched " + count + " entities"); } DemoEntity demoEntity = (DemoEntity) scrollableResults.get()[0]; //Process and write result } return null; } }); After running this we get: ... Fetched 49800 entities Fetched 49900 entities Fetched 50000 entities Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded Although we are using a scrollable result set, every returned object is an attached object and becomes part of the persistence context (aka session). The result is actually the same as our first example in which we used "session.createQuery("from DemoEntity").list()". However, with that approach we had no control; everything happens behind the scenes and you get a list back with all the data if hibernate has done its job. using a scrollable result set on the other hand gives us a hook into the retrieval process and allows us to free memory up when needed. As we have seen it does not free up memory automatically, you have to instruct Hibernate to actually do it. Following options exist: Evicting the object from the persistent context after processing it Clearing the entire session every now and then We will opt for the first. In the above example under line 13 (//Process and write result) we'll add: session.evict(demoEntity); Important: If you were to perform any modification to the entity (or entities it has associations with that are cascade evicted alongside), make sure to flush the session PRIOR evicting or clearing, otherwise queries hold back because of Hibernate's write behind will not be sent to the database Evicting or clearing does not remove the entities from second level cache. If you enabled second level cache and are using it and you want to remove them as well use the desired sessionFactory.getCache().evictXxx() method From the moment you evict an entity it will be no longer attached (no longer associated with a session). Any modification done to the entity at that stage will no longer be reflected to the database automatically. If you are using lazy loading, accessing any property that was not loaded prior the eviction will yield the famous org.hibernate.LazyInitializationException. So basically, make sure the processing for that entity is done (or it is at least initialized for further needs) before you evict or clear After we run the application again, we see that it now successfully executes: ... Fetched 99800 entities Fetched 99900 entities Fetched 100000 entities Btw; you can also set the query read-only allowing hibernate to perform some extra optimizations: ScrollableResults scrollableResults = session.createQuery("from DemoEntity").setReadOnly(true).scroll(ScrollMode.FORWARD_ONLY); Doing this only gives a very marginal difference in memory usage, in this specific test setup it enabled us to read about 300 entities extra with the given amount of memory. Personally I would not use this feature merely for memory optimizations alone but only if it suits in your overall immutability strategy. With hibernate you have different options to make entities read-only: on the entity itself, the overall session read-only and so forth. Setting read only false on the query individually is probably the least preferred approach. (eg. entities loaded in the session before will remain unaffected, possibly modifiable. Lazy associations will be loaded modifiable even if the root objects returned by the query are read only). Ok, we were able to process our 100.000 records, life is good. But as it turns out Hibernate has another another option for bulk operations: the stateless session. You can obtain a scrollable result set from a stateless session the same way as from a normal session. A stateless session lies directly above JDBC. Hibernate will run in nearly "all features disabled" mode. This means no persistent context, no 2nd level caching, no dirty detection, no lazy loading, basically no nothing. From the javadoc: /** * A command-oriented API for performing bulk operations against a database. * A stateless session does not implement a first-level cache nor interact with any * second-level cache, nor does it implement transactional write-behind or automatic * dirty checking, nor do operations cascade to associated instances. Collections are * ignored by a stateless session. Operations performed via a stateless session bypass * Hibernate's event model and interceptors. Stateless sessions are vulnerable to data * aliasing effects, due to the lack of a first-level cache. For certain kinds of * transactions, a stateless session may perform slightly faster than a stateful session. * * @author Gavin King */ The only thing it does is transforming records to objects. This might be an appealing alternative because it helps you getting rid of that manual evicting/flushing: new TransactionTemplate(txManager).execute(new TransactionCallback() { @Override public Void doInTransaction(TransactionStatus status) { sessionFactory.getCurrentSession().doWork(new Work() { @Override public void execute(Connection connection) throws SQLException { StatelessSession statelessSession = sessionFactory.openStatelessSession(connection); try { ScrollableResults scrollableResults = statelessSession.createQuery("from DemoEntity").scroll(ScrollMode.FORWARD_ONLY); int count = 0; while (scrollableResults.next()) { if (++count > 0 && count % 100 == 0) { System.out.println("Fetched " + count + " entities"); } DemoEntity demoEntity = (DemoEntity) scrollableResults.get()[0]; //Process and write result } } finally { statelessSession.close(); } } }); return null; } }); Besides the fact that the stateless session has the most optimal memory usage, using the it has some side effects. You might have noticed that we are opening a stateless session and closing it explicitly: there is no sessionFactory.getCurrentStatelessSession() nor (at the time of writing) any Spring integration for managing the stateless session.Opening a stateless session allocates a new java.sql.Connection by default (if you use openStatelessSession()) to perform its work and therefore indirectly spawns a second transaction. You can mitigate these side effects by using the Hibernate work API as in the example which supplies the current Connection and pass it along to openStatelessSession(Connection connection). Closing the session in the finally has no impact on the physical connection since that is captured by the Spring infrastructure: only the logical connection handle is closed and a new logical connection handle was created when opening the stateless session. Also note that you have to deal with closing the stateless session yourself and that the above example is only good for read-only operations. From the moment you are going to modify using the stateless session there are some more caveats. As said before, hibernate runs in "all feature disabled" mode and as a direct consequence entities are returned in detached state. For each entity you modify, you'll have to call: statelessSession.update(entity) explicitly. First I tried this for modifying an entity: new TransactionTemplate(txManager).execute(new TransactionCallback() { @Override public Void doInTransaction(TransactionStatus status) { sessionFactory.getCurrentSession().doWork(new Work() { @Override public void execute(Connection connection) throws SQLException { StatelessSession statelessSession = sessionFactory.openStatelessSession(connection); try { DemoEntity demoEntity = (DemoEntity) statelessSession.createQuery("from DemoEntity where id = 1").uniqueResult(); demoEntity.setProperty("test"); statelessSession.update(demoEntity); } finally { statelessSession.close(); } } }); return null; } }); The idea is that we open a stateless session with the existing database Connection. As the StatelessSession javadoc indicates that no write behind occurs, I was convinced that each statement performed by the stateless session would be sent directly to the database. Eventually when the transaction (started by the TransactionTemplate) would be committed the results would become visible in the database. However, hibernate does BATCH statements using a stateless session. I'm not 100% sure what the difference is between batching and write behind, but the result is the same and thus contra dictionary with the javadoc as statements are queued and flushed at a later time. So, if you don't do anything special, statements that are batched will not be flushed and this is what happened in my case: the "statelessSession.update(demoEntity);" was batched and never flushed. One way to force the flush is to use the hibernate transaction API: StatelessSession statelessSession = sessionFactory.openStatelessSession(); statelessSession.beginTransaction(); ... statelessSession.getTransaction().commit(); ... While this works, you probably don't want to start controlling your transactions programatically just because you are using a stateless session. Also, doing this we are again running our stateless session work in a second transaction scenario since we didn't pass along our Connection and thus a new database connection will be acquired. The reason we can't pass along the outer Connection is because if we commit the inner transaction (the "stateless session transaction") and it would be using the same connection as the outer transaction (started by the TransactionTemplate) it would break the outer transaction atomicity as statements from the outer transaction sent to database would be committed along with the inner transaction. So not passing along the connections means opening a new connection and thus creating a second transaction. A better alternative would be just to trigger Hibernate to flush the stateless session. However, statelessSession has no "flush" method to manually trigger a flush. A solution here is to depend a bit on the Hibernate internal API. This solution makes the manual transaction handling and the second transaction obsolete: all statements become part of our (one and only) outer transaction: StatelessSession statelessSession = sessionFactory.openStatelessSession(connection); try { DemoEntity demoEntity = (DemoEntity) statelessSession.createQuery("from DemoEntity where id = 1").uniqueResult(); demoEntity.setProperty("test"); statelessSession.update(demoEntity); ((TransactionContext) statelessSession).managedFlush(); } finally { statelessSession.close(); } Fortunately there is an even better solution very recently posted on the Spring jira: https://jira.springsource.org/browse/SPR-2495 This is not yet part of Spring, but the factory bean implementation is pretty straight forward: StatelessSessionFactoryBean.java when using this you could simple inject the StatelessSession: @Autowired private StatelessSession statelessSession; It will inject a stateless session proxy which is equivalent to the way the normal "current" session works (with the minor difference that you inject a SessionFactory and need to obtain the currentSession each time). When the proxy is invoked it will lookup the stateless session bound to the running transaction. If none exists already it will create one with the same connection as the normal session (like we did in the example) and register a custom transaction synchronization for the stateless session. When the transaction is committed the stateless session is flushed thanks to the synchronization and finally closed. Using this you can inject the stateless session directly and use it as a current session (or the same way as you would inject a JPA PeristentContext for that matter). This relieves you from dealing with the opening and closing of the stateless session and having to deal with one way or the other to make it flush. The implementation is JPA aimed, but the JPA part is limited to obtaining the physical connection in obtainPhysicalConnection(). You can easily leave out the EntityManagerFactory and get the physical connection directly from the Hibernate session. Very careful conclusion: it is clear that the best approach will depend on your situation. If you use the normal session you will have to deal with eviction yourself when reading or persisting entities. Besides the fact you have to do this manually, it might also impact further use of the session if you have a mixed transaction; you both perform 'bulk' and 'normal' operations in the same transaction. If you continue with the normal operations you will have detached entities in your session which might lead to unexpected results (as dirty detection will no longer work and so forth). On the other hand you will still have the major hibernate benefits (as long as the entity isn't evicted) such as lazy loading, caching, dirty detection and the likes. Using the stateless session at the time of writing requires some extra attention on managing it (opening, closing and flushing) which can also be error prone. In the assumption you can proceed with the proposed factory bean, you have a very bare bone session which is separately from your normal session but still participating in the same transaction. With this you have a powerful tool to perform bulk operations without having to think about memory management. The downside is that you don't have any other hibernate functionality available.
January 6, 2014
by Koen Serneels
· 90,651 Views · 14 Likes
article thumbnail
Top Posts of 2013: Google's Big Data Papers
I’ll review Google’s most important Big Data publications and discuss where they are (as far as they’ve disclosed).
December 30, 2013
by Mikio Braun
· 117,044 Views
article thumbnail
Logging, Processing and Monitoring Data using Talend, ElasticSearch, Logstash and Kibana
Your mission-critical projects need complex event processing, realtime management and monitoring. Talend 5.4 (released in December 2013, https://www.talend.com) offers a great new feature: Talend Event Logging. It allows logging, processing and monitoring of all technical events and business data. In this article, I will focus on how to process, filter and monitor business data. You can find more details about monitoring technical events and logs (e.g. OSGi events of the ESB container) in Talend’s documentation (www.help.talend.com). This new feature is very powerful, but also extendable to fit custom requirements. You can solve and monitor much more complex scenarios than the one I describe here. Talend Event Logging with Logstash, ElasticSearch and Kibana First, let’s take a look at the components / projects which are integrated and extended into Talend’s products for implementing this new feature. logstash (http://logstash.net) is a tool for managing events and logs. You can use it to collect different (distributed) logs, parse them, and store them for later use. Speaking of searching, logstash comes with a simple, but fine web interface for searching and drilling into all of your logs. It uses ElasticSearch under the hood. So, you can easily query through all your logs for specific errors or business analytics (e.g. searching for all lines matching an unique order id). Additional to the pure collection of events, the Event Logging feature supports custom processing (e.g. custom filtering, customer data enrichment/reduction), aggregation, signing and also server side custom pre- and post-processing of events - e.g. to send them to an intrusion detection system or to any other kind of potential higher level log processing /management system. Kibana (http://www.elasticsearch.org/overview/kibana) is a browser based analytics and search interface to logstash and other timestamped data sets stored in ElasticSearch. Kibana strives to be easy to get started with, while also being flexible and powerful, just like logstash and ElasticSearch. Main difference to logstash is a much more powerful HTML5 based web interface. You can • use multiple concurrent search inputs • highlight to drill down bar charts • create line charts, stacked, unstacked, filled or unfilled, with or without points • create Pie and donut charts that compare top terms or the results of multiple queries • create custom dashboards with multiple charts • and much more… Therefore, logstash, Elasticsearch and Kibana are a perfect combination. You can use Kibana to analyze and monitor your data as you do with logstash, however, Kibana’s web interface is much more powerful and comfortable than logstash. There is a great book about logstash: “The logstash book” – for just 9.99 USD. I can really recommend this book for getting started: http://www.logstashbook.com/. For Elasticsearch, you can find several books on Amazon. Unfortunately, Kibana has no good and extensive documentation yet. I heard from its developers that this topic is addressed for Q1 2014. Integration of Event Logging and Monitoring into Talend’s Unified Platform Talend 5.4 has integrated logstash and Kibana into its Unified Platform. Talend Administration Center (TAC) is Talend’s central web application for management and monitoring. It got a new logging view: Here, you can use very flexible and powerful realtime search capabilities of Elasticsearch. Some dashboards are available by default. You can also create your own custom dashboards easily within this site thanks to Kibana. Many panel types are available, such as pie, histogram, table, hits or trends. You can analyze every technical event or business data down to the message level: By default, you see fields such as message, source, timestamp, type, and others. Of course, you can also add custom fields suitable for your business case. This way, you have a central monitoring capability which allows analyzing data on distributed clusters easily. Under the hood, this data comes from logstash. Many alternative inputs are available for logstash, such as log4j input or tcp input. For business data, I often use file input (http://logstash.net/docs/1.2.2/inputs/file) to analyze files such as CSV. Adding new inputs is very simple. You just have to add an input to your logstash configuration file of your logstash server. As I mentioned already, all this is integrated into the Talend’s Unified Platform: In this example, there are three log4j inputs and one file input. I use the file input to analyze text files in a specific directory. In this case, output is an embedded Elasticsearch instance. In production, you should use an external Elasticsearch cluster, of course. Processing such as filtering can also be configured in this file. Thus, you can process, analyze and monitor all your different inputs within one central monitoring application thanks to Kibana. Building Talend Integration Jobs, Routes and Web Services As mentioned before, you can analyze almost all data with logstash, Elasticsearch and Kibana. It does not matter if your input is technical events from a container (e.g. OSGi events) or any business data such as CSV files, log4j logs, or something else. Talend implicitly supports technical events which are created by the ESB container, by MDM, etc. However, you can also add your custom business data from your Talend jobs (integration perspective), SOAP / REST Web Services (integration perspective) or Talend routes (mediation perspective), easily: This is just one example (part of Talend’s DI demos which are included in every DI installation). The job generates some random data and stores it to a CSV file. You just have to add the configured file or directory of tFileOutputDelimited to your logstash configuration using file input (with wild cards for more complex scenarios). That’s it. You can now monitor and analyze the business data in realtime. This example showed a Talend DI Job (i.e. ETL job). However, you can also monitor your business data from SOAP / REST Web Services or Talend Routes the same way. Conclusion Your mission-critical projects need management and monitoring. Today, this is not just possible with complex and expensive tools of large vendors, but also with Talend’s Unified Platform products such as Talend ESB or Talend MDM. Under the hood, Talend integrates and extends widely used open source products: logstash, Elasticsearch and Kibana. Have fun with Talend 5.4’s new event logging and monitoring features… Best regards, Kai Wähner (@KaiWaehner) CONTENT FROM MY BLOG: http://www.kai-waehner.de/blog/2013/12/17/realtime-event-logging-complex-event-processing-cep-and-monitoring-with-talends-unified-platform-5-4-di-esb-dq-mdm-bpm-using-elasticsearch-logstash-and-kibana/
December 19, 2013
by Kai Wähner DZone Core CORE
· 31,479 Views
article thumbnail
Handling Big Data with HBase Part 4: The Java API
Editor's note: Be sure to check out part 2 as well. This is the fourth of an introductory series of blogs on Apache HBase. In the third part, we saw a high level view of HBase architecture . In this part, we'll use the HBase Java API to create tables, insert new data, and retrieve data by row key. We'll also see how to setup a basic table scan which restricts the columns retrieved and also uses a filter to page the results. Having just learned about HBase high-level architecture, now let's look at the Java client API since it is the way your applications interact with HBase. As mentioned earlier you can also interact with HBase via several flavors of RPC technologies like Apache Thrift plus a REST gateway, but we're going to concentrate on the native Java API. The client APIs provide both DDL (data definition language) and DML (data manipulation language) semantics very much like what you find in SQL for relational databases. Suppose we are going to store information about people in HBase, and we want to start by creating a new table. The following listing shows how to create a new table using the HBaseAdmin class. Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("people")); tableDescriptor.addFamily(new HColumnDescriptor("personal")); tableDescriptor.addFamily(new HColumnDescriptor("contactinfo")); tableDescriptor.addFamily(new HColumnDescriptor("creditcard")); admin.createTable(tableDescriptor); The people table defined in preceding listing contains three column families: personal, contactinfo, and creditcard. To create a table you create an HTableDescriptor and add one or more column families by adding HColumnDescriptor objects. You then call createTable to create the table. Now we have a table, so let's add some data. The next listing shows how to use the Put class to insert data on John Doe, specifically his name and email address (omitting proper error handling for brevity). Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "people"); Put put = new Put(Bytes.toBytes("doe-john-m-12345")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("givenName"), Bytes.toBytes("John")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("mi"), Bytes.toBytes("M")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("surame"), Bytes.toBytes("Doe")); put.add(Bytes.toBytes("contactinfo"), Bytes.toBytes("email"), Bytes.toBytes("[email protected]")); table.put(put); table.flushCommits(); table.close(); In the above listing we instantiate a Put providing the unique row key to the constructor. We then add values, which must include the column family, column qualifier, and the value all as byte arrays. As you probably noticed, the HBase API's utility Bytes class is used a lot; it provides methods to convert to and from byte[] for primitive types and strings. (Adding a static import for the toBytes() method would cut out a lot of boilerplate code.) We then put the data into the table, flush the commits to ensure locally buffered changes take effect, and finally close the table. Updating data is also done via the Put class in exactly the same manner as just shown in the prior listing. Unlike relational databases in which updates must update entire rows even if only one column changed, if you only need to update a single column then that's all you specify in the Put and HBase will only update that column. There is also a checkAndPut operation which is essentially a form of optimistic concurrency control - the operation will only put the new data if the current values are what the client says they should be. Retrieving the row we just created is accomplished using the Get class, as shown in the next listing. (From this point forward, listings will omit the boilerplate code to create a configuration, instantiate the HTable, and the flush and close calls.) Get get = new Get(Bytes.toBytes("doe-john-m-12345")); get.addFamily(Bytes.toBytes("personal")); get.setMaxVersions(3); Result result = table.get(get); The code in the previous listing instantiates a Get instance supplying the row key we want to find. Next we use addFamily to instruct HBase that we only need data from the personal column family, which also cuts down the amount of work HBase must do when reading information from disk. We also specify that we'd like up to three versions of each column in our result, perhaps so we can list historical values of each column. Finally, calling get returns a Result instance which can then be used to inspect all the column values returned. In many cases you need to find more than one row. HBase lets you do this by scanning rows, as shown in the second part which showed using a scan in the HBase shell session. The corresponding class is the Scan class. You can specify various options, such as the start and ending row key to scan, which columns and column families to include and the maximum versions to retrieve. You can also add filters, which allow you to implement custom filtering logic to further restrict which rows and columns are returned. A common use case for filters is pagination. For example, we might want to scan through all people whose last name is Smith one page (e.g. 25 people) at a time. The next listing shows how to perform a basic scan. Scan scan = new Scan(Bytes.toBytes("smith-")); scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("givenName")); scan.addColumn(Bytes.toBytes("contactinfo"), Bytes.toBytes("email")); scan.setFilter(new PageFilter(25)); ResultScanner scanner = table.getScanner(scan); for (Result result : scanner) { // ... } In the above listing we create a new Scan that starts from the row key smith- and we then use addColumn to restrict the columns returned (thus reducing the amount of disk transfer HBase must perform) to personal:givenName and contactinfo:email. A PageFilter is set on the scan to limit the number of rows scanned to 25. (An alternative to using the page filter would be to specify a stop row key when constructing the Scan.) We then get a ResultScanner for the Scan just created, and loop through the results performing whatever actions are necessary. Since the only method in HBase to retrieve multiple rows of data is scanning by sorted row keys, how you design the row key values is very important. We'll come back to this topic later. You can also delete data in HBase using the Delete class, analogous to the Put class to delete all columns in a row (thus deleting the row itself), delete column families, delete columns, or some combination of those. Connection Handling In the above examples not much attention was paid to connection handling and RPCs (remote procedure calls). HBase provides the HConnection class which provides functionality similar to connection pool classes to share connections, for example you use the getTable() method to get a reference to an HTable instance. There is also an HConnectionManager class which is how you get instances of HConnection. Similar to avoiding network round trips in web applications, effectively managing the number of RPCs and amount of data returned when using HBase is important, and something to consider when writing HBase applications. Conclusion to Part 4 In this part we used the HBase Java API to create a people table, insert a new person, and find the newly inserted person information. We also used the Scan class to scan the people table for people with last name "Smith" and showed how to restrict the data retrieved and finally how to use a filter to limit the number of results. In the next part, we'll learn how to deal with the absence of SQL and relations when modeling schemas in HBase. References HBase web site, http://hbase.apache.org/ HBase wiki, http://wiki.apache.org/hadoop/Hbase HBase Reference Guide http://hbase.apache.org/book/book.html HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide Google Bigtable Paper, http://labs.google.com/papers/bigtable.html Hadoop web site, http://hadoop.apache.org/ Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk Sample code, https://github.com/sleberknight/basic-hbase-examples
December 18, 2013
by Scott Leberknight
· 56,797 Views · 3 Likes
article thumbnail
Implementing the “Card” UI Pattern in PhoneGap/HTML5 Applications
The Card UI pattern is a common look used by Pinterest and many other content sites. See how you can make a PhoneGap app with this look.
December 2, 2013
by Andrew Trice
· 116,251 Views · 2 Likes
article thumbnail
Groovy Goodness: Remove Part of String With Regular Expression Pattern
Since Groovy 2.2 we can subtract a part of a String value using a regular expression pattern. The first match found is replaced with an empty String. In the following sample code we see how the first match of the pattern is removed from the String: // Define regex pattern to find words starting with gr (case-insensitive). def wordStartsWithGr = ~/(?i)\s+Gr\w+/ assert ('Hello Groovy world!' - wordStartsWithGr) == 'Hello world!' assert ('Hi Grails users' - wordStartsWithGr) == 'Hi users' // Remove first match of a word with 5 characters. assert ('Remove first match of 5 letter word' - ~/\b\w{5}\b/) == 'Remove match of 5 letter word' // Remove first found numbers followed by a whitespace character. assert ('Line contains 20 characters' - ~/\d+\s+/) == 'Line contains characters' Code written with Groovy 2.2.
November 23, 2013
by Hubert Klein Ikkink
· 19,833 Views
article thumbnail
Deep Dive into Connection Pooling
As your application grows in functionality and/or usage, managing resources becomes increasingly important. Failure to properly utilize connection pooling is one major “gotcha” that we’ve seen greatly impact MongoDB performance and trip up developers of all levels. Connection Pools Creating new authenticated connections to the database is expensive. So, instead of creating and destroying connections for each request to the database, you want to re-use existing connections as much as possible. This is where connection pooling comes in. A Connection Pool is a cache of database connections maintained by your driver so that connections can be re-used when new connections to the database are required. When properly used, connection pools allow you to minimize the frequency and number of new connections to your database. Connection Churn Used improperly however, or not at all, your application will likely open and close new database connections too often, resulting in what we call “connection churn”. In a high-throughput application this can result in a constant flood of new connection requests to your database which will adversely affect the performance of your database and your application. Opening Too Many Connections Alternately, although less common, is the problem of creating too many MongoClient objects that are never closed. In this case, instead of churn, you get a steady increase in the number of connections to your database such that you have tens of thousands of connections open when you application could almost certainly due with far fewer. Since each connection takes RAM, you may find yourself wasting a good portion of your memory on connections which will also adversely affect your application’s performance. Although every application is different and the total number of connections to your database will greatly depend on how many client processes or application servers are connected, in our experience, any connection count great than 1000 – 1500 connections should raise an eyebrow, and most of the time your application will require far fewer than that. MongoClient and Connection Pooling Most MongoDB language drivers implement the MongoClient class which, if used properly, will handle connection pooling for you automatically. The syntax differs per language, but often you do something like this to create a new connection-pool-enabled client to your database: mongoClient = new MongoClient(URI, connectionOptions); Here the mongoClient object holds your connection pool, and will give your app connections as needed. You should strive to create this object once as your application initializes and re-use this object throughout your application to talk to your database. The most common connection pooling problem we see results from applications that create a MongoClient object way too often, sometimes on each database request. If you do this you will not be using your connection pool as each MongoClient object maintains a separate pool that is not being reused by your application. Example with Node.js Let’s look at a concrete example using the Node.js driver. Creating new connections to the database using the Node.js driver is done like this: mongodb.MongoClient.connect(URI, function(err, db) { // database operations }); The syntax for using MongoClient is slightly different here than with other drivers given Node’s single-threaded nature, but the concept is the same. You only want to call ‘connect’ once during your apps initialization phase vs. on each database request. Let’s take a closer look at the difference between doing the right thing vs. doing the wrong thing. Note: If you clone the repo from here, the logger will output your logs in your console so you can follow along. Consider the following examples: var express = require('express'); var mongodb = require('mongodb'); var app = express(); var MONGODB_URI = 'mongo-uri'; app.get('/', function(req, res) { // BAD! Creates a new connection pool for every request mongodb.MongoClient.connect(MONGODB_URI, function(err, db) { if(err) throw err; var coll = db.collection('test'); coll.find({}, function(err, docs) { docs.each(function(err, doc) { if(doc) { res.write(JSON.stringify(doc) + "\n"); } else { res.end(); } }); }); }); }); // App may initialize before DB connection is ready app.listen(3000); console.log('Listening on port 3000'); The first (no pooling): calls connect() in every request handler establishes new connections for every request (connection churn) initializes the app (app.listen()) before database connections are made var express = require('express'); var mongodb = require('mongodb'); var app = express(); var MONGODB_URI = 'mongodb-uri'; var db; var coll; // Initialize connection once mongodb.MongoClient.connect(MONGODB_URI, function(err, database) { if(err) throw err; db = database; coll = db.collection('test'); app.listen(3000); console.log('Listening on port 3000'); }); // Reuse database/collection object app.get('/', function(req, res) { coll.find({}, function(err, docs) { docs.each(function(err, doc) { if(doc) { res.write(JSON.stringify(doc) + "\n"); } else { res.end(); } }); }); }); The second (with pooling): calls connect() once reuses the database/collection variable (reuses existing connections) waits to initialize the app until after the database connection is established If you run the first example and refresh your browser enough times, you’ll quickly see that your MongoDB has a hard time handling the flood of connections and will terminate. Further Consideration – Connection Pool Size Most MongoDB drivers support a parameter that sets the max number of connections (pool size) available to your application. The connection pool size can be thought of as the max number of concurrent requests that your driver can service. The default pool size varies from driver to driver, e.g. for Node it is 5, whereas for Python it is 100. If you anticipate your application receiving many concurrent or long-running requests, we recommend increasing your pool size- adjust accordingly!
November 7, 2013
by Chris Chang
· 24,454 Views · 2 Likes
article thumbnail
Data Access Module using Groovy with Spock testing
This blog is more of a tutorial where we describe the development of a simple data access module, more for fun and learning than anything else. All code can be found here for those who don’t want to type along: https://github.com/ricston-git/tododb As a heads-up, we will be covering the following: Using Groovy in a Maven project within Eclipse Using Groovy to interact with our database Testing our code using the Spock framework We include Spring in our tests with ContextConfiguration A good place to start is to write a pom file as shown here. The only dependencies we want packaged with this artifact are groovy-all and commons-lang. The others are either going to be provided by Tomcat or are only used during testing (hence the scope tags in the pom). For example, we would put the jar with PostgreSQL driver in Tomcat’s lib, and tomcat-jdbc and tomcat-dbcp are already there. (Note: regarding the postgre jar, we would also have to do some minor configuration in Tomcat to define a DataSource which we can get in our app through JNDI – but that’s beyond the scope of this blog. See here for more info). Testing-wise, I’m depending on spring-test, spock-core, and spock-spring (the latter is to get spock to work with spring-test). Another significant addition in the pom is the maven-compiler-plugin. I have tried to get gmaven to work with Groovy in Eclipse, but I have found the maven-compiler-plugin to be a lot easier to work with. With your pom in an empty directory, go ahead and mkdir -p src/main/groovy src/main/java src/test/groovy src/test/java src/main/resources src/test/resources. This gives us a directory structure according to the Maven convention. Now you can go ahead and import the project as a Maven project in Eclipse (install the m2e plugin if you don’t already have it). It is important that you do not mvn eclipse:eclipse in your project. The .classpath it generates will conflict with your m2e plugin and (at least in my case), when you update your pom.xml the plugin will not update your dependencies inside Eclipse. So just import as a maven project once you have your pom.xml and directory structure set up. Okay, so our tests are going to be integration tests, actually using a PostgreSQL database. Since that’s the case, lets set up our database with some data. First go ahead and create a tododbtest database which will only be used for testing purposes. Next, put the following files in your src/test/resources: Note, fill in your username/password: DROP TABLE IF EXISTS todouser CASCADE; CREATE TABLE todouser ( id SERIAL, email varchar(80) UNIQUE NOT NULL, password varchar(80), registered boolean DEFAULT FALSE, confirmationCode varchar(280), CONSTRAINT todouser_pkey PRIMARY KEY (id) ); insert into todouser (email, password, registered, confirmationCode) values ('[email protected]', 'abc123', FALSE, 'abcdefg') insert into todouser (email, password, registered, confirmationCode) values ('[email protected]', 'pass1516', FALSE, '123456') insert into todouser (email, password, registered, confirmationCode) values ('[email protected]', 'anon', FALSE, 'codeA') insert into todouser (email, password, registered, confirmationCode) values ('[email protected]', 'anon2', FALSE, 'codeB') Basically, testContext.xml is what we’ll be configuring our test’s context with. The sub-division into datasource.xml and initdb.xml may be a little too much for this example… but changes are usually easier that way. The gist is that we configure our data source in datasource.xml (this is what we will be injecting in our tests), and the initdb.xml will run the schema.sql and test-data.sql to create our table and populate it with data. So lets create our test, or should I say, our specification. Spock is specification framework that allows us to write more descriptive tests. In general, it makes our tests easier to read and understand, and since we’ll be using Groovy, we might as well make use of the extra readability Spock gives us. package com.ricston.blog.sample.model.spec; import javax.sql.DataSource import org.springframework.beans.factory.annotation.Autowired import org.springframework.test.annotation.DirtiesContext import org.springframework.test.annotation.DirtiesContext.ClassMode import org.springframework.test.context.ContextConfiguration import spock.lang.Specification import com.ricston.blog.sample.model.data.TodoUser import com.ricston.blog.sample.model.dao.postgre.PostgreTodoUserDAO // because it supplies a new application context after each test, the initialize-database in initdb.xml is // executed for each test/specification @DirtiesContext(classMode=ClassMode.AFTER_EACH_TEST_METHOD) @ContextConfiguration('classpath:testContext.xml') class PostgreTodoUserDAOSpec extends Specification { @Autowired DataSource dataSource PostgreTodoUserDAO postgreTodoUserDAO def setup() { postgreTodoUserDAO = new PostgreTodoUserDAO(dataSource) } def "findTodoUserByEmail when user exists in db"() { given: "a db populated with a TodoUser with email [email protected] and the password given below" String email = '[email protected]' String password = 'anon' when: "searching for a TodoUser with that email" TodoUser user = postgreTodoUserDAO.findTodoUserByEmail email then: "the row is found such that the user returned by findTodoUserByEmail has the correct password" user.password == password } } One specification is enough for now, just to make sure that all the moving parts are working nicely together. The specification itself is easy enough to understand. We’re just exercising the findTodoUserByEmail method of PostgreTodoUserDAO – which we will be writing soon. Using the ContextConfiguration from Spring Test we are able to inject beans defined in our context (the dataSource in our case) through the use of annotations. This keeps our tests short and makes them easier to modify later on. Additionally, note the use of DirtiesContext. Basically, after each specification is executed, we cannot rely on the state of the database remaining intact. I am using DirtiesContext to get a new Spring context for each specification run. That way, the table creation and test data insertions happen all over again for each specification we run. Before we can run our specification, we need to create at least the following two classes used in the spec: TodoUser and PostgreTodoUserDAO package com.sample.data import org.apache.commons.lang.builder.ToStringBuilder class TodoUser { long id; String email; String password; String confirmationCode; boolean registered; @Override public String toString() { ToStringBuilder.reflectionToString(this); } } package com.ricston.blog.sample.model.dao.postgre import groovy.sql.Sql import javax.sql.DataSource import com.ricston.blog.sample.model.dao.TodoUserDAO import com.ricston.blog.sample.model.data.TodoUser class PostgreTodoUserDAO implements TodoUserDAO { private Sql sql public PostgreTodoUserDAO(DataSource dataSource) { sql = new Sql(dataSource) } /** * * @param email * @return the TodoUser with the given email */ public TodoUser findTodoUserByEmail(String email) { sql.firstRow """SELECT * FROM todouser WHERE email = $email""" } } package com.ricston.blog.sample.model.dao; import com.ricston.blog.sample.model.data.TodoUser; public interface TodoUserDAO { /** * * @param email * @return the TodoUser with the given email */ public TodoUser findTodoUserByEmail(String email); } We’re just creating a POGO in TodoUser, implementing its toString using common’s ToStringBuilder. In PostgreTodoUserDAO we’re using Groovy’s SQL to access the database, for now, only implementing the findTodoUserByEmail method. PostgreTodoUserDAO implements TodoUserDAO, an interface which specifies the required methods a TodoUserDAO must have. Okay, so now we have all we need to run our specification. Go ahead and run it as a JUnit test from Eclipse. You should get back the following error message: org.codehaus.groovy.runtime.typehandling.GroovyCastException: Cannot cast object '{id=3, [email protected], password=anon, registered=false, confirmationcode=codeA}' with class 'groovy.sql.GroovyRowResult' to class 'com.ricston.blog.sample.model.data.TodoUser' due to: org.codehaus.groovy.runtime.metaclass.MissingPropertyExceptionNoStack: No such property: confirmationcode for class: com.ricston.blog.sample.model.data.TodoUser Possible solutions: confirmationCode at com.ricston.blog.sample.model.dao.postgre.PostgreTodoUserDAO.findTodoUserByEmail(PostgreTodoUserDAO.groovy:23) at com.ricston.blog.sample.model.spec.PostgreTodoUserDAOSpec.findTodoUserByEmail when user exists in db(PostgreTodoUserDAOSpec.groovy:37) Go ahead and connect to your tododbtest database and select * from todouser; As you can see, our confirmationCode varchar(280), ended up as the column confirmationcode with a lower case ‘c’. In PostgreTodoUserDAO’s findTodoUserByEmail, we are getting back GroovyRowResult from our firstRow invocation. GroovyRowResult implements Map and Groovy is able to create a POGO (in our case TodoUser) from a Map. However, in order for Groovy to be able to automatically coerce the GroovyRowResult into a TodoUser, the keys in the Map (or GroovyRowResult) must match the property names in our POGO. We are using confirmationCode in our TodoUser, and we would like to stick to the camel case convention. What can we do to get around this? Well, first of all, lets change our schema to use confirmation_code. That’s a little more readable. Of course, we still have the same problem as before since confirmation_code will not map to confirmationCode by itself. (Note: remember to change the insert statements in test-data.sql too). One way to get around this is to use Groovy’s propertyMissing methods as show below: def propertyMissing(String name, value) { if(isConfirmationCode(name)) { this.confirmationCode = value } else { unknownProperty(name) } } def propertyMissing(String name) { if(isConfirmationCode(name)) { return confirmationCode } else { unknownProperty(name) } } private boolean isConfirmationCode(String name) { 'confirmation_code'.equals(name) } def unknownProperty(String name) { throw new MissingPropertyException(name, this.class) } By adding this to our TodoUser.groovy we are effectively tapping in on how Groovy resolves property access. When we do something like user.confirmationCode, Groovy automatically calls getConfirmationCode(), a method which we got for free when declared the property confirmationCode in our TodoUser. Now, when user.confirmation_code is invoked, Groovy doesn’t find any getters to invoke since we never declared the property confirmation_code, however, since we have now implemented the propertyMissing methods, before throwing any exceptions it will use those methods as a last resort when resolving properties. In our case we are effectively checking whether a get or set on confirmation_code is being made and mapping the respective operations to our confirmationCode property. It’s as simple as that. Now we can keep the auto coercion in our data access object and the property name we choose to have in our TodoUser. Assuming you’ve made the changes to the schema and test-data.sql to use confirmation_code, go ahead and run the spec file and this time it should pass. That’s it for this tutorial. In conclusion, I would like to discuss some finer points which someone who’s never used Groovy’s SQL before might not know. As you can see in PostgreTodoUserDAO.groovy, our database interaction is pretty much a one-liner. What about resource handling (e.g. properly closing the connection when we’re done), error logging, and prepared statements? Resource handling and error logging are done automatically, you just have to worry about writing your SQL. When you do write your SQL, try to stick to using triple quotes as used in the PostgreTodoUserDAO.groovy example. This produces prepared statements, therefore protecting against SQL injection and avoids us having to put ‘?’ all over the place and properly lining up the arguments to pass in to the SQL statement. Note that transaction management is something which the code using our artifact will have to take care of. Finally, note that a bunch of other operations (apart from findTodoUserByEmail) are implemented in the project on GitHub: https://github.com/ricston-git/tododb. Additionally, there is also a specification test for TodoUser, making sure that the property mapping works correctly. Also, in the pom.xml, there is some maven-surefire-plugin configuration in order to get the surefire-plugin to pick up our Spock specifications as well as any JUnit tests which we might have in our project. This allows us to run our specifications when we, for example, mvn clean package. After implementing all the operations you require in PostgreTodoUserDAO.groovy, you can go ahead and compile the jar or include in a Maven multi-module project to get a data access module you can use in other applications.
November 6, 2013
by Justin Calleja
· 21,171 Views
  • Previous
  • ...
  • 415
  • 416
  • 417
  • 418
  • 419
  • 420
  • 421
  • 422
  • 423
  • 424
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×