DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest AI/ML Topics

article thumbnail
How to Use NodeManager to Control WebLogic Servers
In my previous post, you have seen how we can start a WebLogic admin and multiple managed servers. One downside with that instruction is that those processes will start in foreground and the STDOUT are printed on terminal. If you intended to run these severs as background services, you might want to try the WebLogic node manager wlscontrol.sh tool. I will show you how you can get Node Manager started here. The easiest way is still to create the domain directory with the admin server running temporary and then create all your servers through the /console application as described in last post. Once you have these created, then you may shut down all these processes and start it with Node Manager. 1. cd $WL_HOME/server/bin && startNodeManager.sh & 3. $WL_HOME/common/bin/wlscontrol.sh -d mydomain -r $HOME/domains/mydomain -c -f startWebLogic.sh -s myserver START 4. $WL_HOME/common/bin/wlscontrol.sh -d mydomain -r $HOME/domains/mydomain -c -f startManagedWebLogic.sh -s appserver1 START The first step above is to start and run your Node Manager. It is recommended you run this as full daemon service so even OS reboot can restart itself. But for this demo purpose, you can just run it and send to background. Using the Node Manager we can then start the admin in step 2, and then to start the managed server on step 3. The NodeManager can start not only just the WebLogic server for you, but it can also monitor them and automatically restart them if they were terminated for any reasons. If you want to shutdown the server manually, you may use this command using Node Manager as well: $WL_HOME/common/bin/wlscontrol.sh -d mydomain -s appserver1 KILL The Node Manager can also be used to start servers remotely through SSH on multiple machines. Using this tool effectively can help managing your servers across your network. You may read more details here: http://docs.oracle.com/cd/E23943_01/web.1111/e13740/toc.htm TIPS1: If there is problem when starting server, you may wnat to look into the log files. One log file is the/servers//logs/.out of the server you trying to start. Or you can look into the Node Manager log itself at $WL_HOME/common/nodemanager/nodemanager.log TIPS2: You add startup JVM arguments to each server starting with Node Manager. You need to create a file under /servers//data/nodemanager/startup.properties and add this key value pair:Arguments = -Dmyapp=/foo/bar TIPS3: If you want to explore Windows version of NodeManager, you may want to start NodeManager without native library to save yourself some trouble. Try adding NativeVersionEnabled=false to$WL_HOME/common/nodemanager/nodemanager.properties file.
March 24, 2014
by Zemian Deng
· 14,196 Views
article thumbnail
Shrink Your Time Machine Backups and Free Disk Space
Time Machine is a backup and restore tool from Apple which is very well integrated into OS X. In my personal opinion Time Machine is not yet awesome.
March 18, 2014
by Enrico Maria Crisostomo
· 162,114 Views · 1 Like
article thumbnail
Jersey: Ignoring SSL certificate – javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException
Last week Alistair and I were working on an internal application and we needed to make a HTTPS request directly to an AWS machine using a certificate signed to a different host. We use jersey-client so our code looked something like this: Client client = Client.create(); client.resource("https://some-aws-host.compute-1.amazonaws.com").post(); // and so on When we ran this we predictably ran into trouble: com.sun.jersey.api.client.ClientHandlerException: javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subject alternative DNS name matching some-aws-host.compute-1.amazonaws.com found. at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.post(WebResource.java:241) at com.neotechnology.testlab.manager.bootstrap.ManagerAdmin.takeBackup(ManagerAdmin.java:33) at com.neotechnology.testlab.manager.bootstrap.ManagerAdminTest.foo(ManagerAdminTest.java:11) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.junit.runner.JUnitCore.run(JUnitCore.java:157) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Caused by: javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subject alternative DNS name matching some-aws-host.compute-1.amazonaws.com found. at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1884) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:276) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:270) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1341) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:153) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:868) at sun.security.ssl.Handshaker.process_record(Handshaker.java:804) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1016) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1312) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1339) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1323) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:563) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1300) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147) ... 31 more Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching some-aws-host.compute-1.amazonaws.com found. at sun.security.util.HostnameChecker.matchDNS(HostnameChecker.java:191) at sun.security.util.HostnameChecker.match(HostnameChecker.java:93) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:347) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:203) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:126) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1323) ... 45 more We figured that we needed to get our client to ignore the certificate and came across this Stack Overflow thread which had some suggestions on how to do this. None of the suggestions worked on their own but we ended up with a combination of a couple of the suggestions which did the trick: public Client hostIgnoringClient() { try { SSLContext sslcontext = SSLContext.getInstance( "TLS" ); sslcontext.init( null, null, null ); DefaultClientConfig config = new DefaultClientConfig(); Map properties = config.getProperties(); HTTPSProperties httpsProperties = new HTTPSProperties( new HostnameVerifier() { @Override public boolean verify( String s, SSLSession sslSession ) { return true; } }, sslcontext ); properties.put( HTTPSProperties.PROPERTY_HTTPS_PROPERTIES, httpsProperties ); config.getClasses().add( JacksonJsonProvider.class ); return Client.create( config ); } catch ( KeyManagementException | NoSuchAlgorithmException e ) { throw new RuntimeException( e ); } } You’re welcome Future Mark.
March 2, 2014
by Mark Needham
· 43,028 Views · 8 Likes
article thumbnail
AES-256 Encryption with Java and JCEKS
Security has become a great topic of discussion in the last few years due to the recent releasing of documents from Edward Snowden and the explosion of hacking against online commerce stores like JC Penny, Sony andTarget. While this post will not give you all of the tools to help prevent the use of illegally sourced data, this post will provide a starting point for building a set of tools and tactics that will help prevent the use of data by other parties. This post will show how to adopt AES encryption for strings in a Java environment. It will talk about creating AES keys and storing AES keys in a JCEKS keystore format. A working example of the code in this blog is located athttps://github.com/mike-ensor/aes-256-encryption-utility It is recommended to read each section in order because each section builds off of the previous section, however, this you might want to just jump quickly jump to a particular section. Setup - Setup and create keys with keytool Encrypt - Encrypt messages using byte[] keys Decrypt - Decrypt messages using same IV and key from encryption Obtain Keys from Keystore - Obtain keys from keystore via an alias What is JCEKS? JCEKS stands for Java Cryptography Extension KeyStore and it is an alternative keystore format for the Java platform. Storing keys in a KeyStore can be a measure to prevent your encryption keys from being exposed. Java KeyStores securely contain individual certificates and keys that can be referenced by an alias for use in a Java program. Java KeyStores are often created using the "keytool" provided with the Java JDK. NOTE: It is strongly recommended to create a complex passcode for KeyStores to keep the contents secure. The KeyStore is a file that is considered to be public, but it is advisable to not give easy access to the file. Setup All encryption is governed by laws of each country and often have restrictions on the strength of the encryption. One example is that in the United States, all encryption over 128-bit is restricted if the data is traveling outside of the boarder. By default, the Java JCE implements a strength policy to comply with these rules. If a stronger encryption is preferred, and adheres to the laws of the country, then the JCE needs to have access to the stronger encryption policy. Very plainly put, if you are planning on using AES 256-bit encryption, you must install theUnlimited Strength Jurisdiction Policy Files. Without the policies in place, 256-bit encryption is not possible. Installation of JCE Unlimited Strength Policy This post is focusing on the keys rather than the installation and setup of the JCE. The installation is rather simple with explicit instructions found here (NOTE: this is for JDK7, if using a different JDK, search for the appropriate JCE policy files). Keystore Setup When using the KeyTool manipulating a keystore is simple. Keystores must be created with a link to a new key or during an import of an existing keystore. In order to create a new key and keystore simply type: keytool -genseckey -keystore aes-keystore.jck -storetype jceks -storepass mystorepass -keyalg AES -keysize 256 -alias jceksaes -keypass mykeypass Important Flags In the example above here are the explanations for the keytool's parameters: Keystore Parameters genseckey Generate SecretKey. This is the flag indicating the creation of a synchronous key which will become our AES key keystore Location of the keystore. If the keystore does not exist, the tool will create a new store. Paths can be relative or absolute but must be local storetype this is the type of store (JCE, PK12, JCEKS, etc). JCEKS is used to store symmetric keys (AES) not contained within a certificate. storepass password related to the keystore. Highly recommended to create a strong passphrase for the keystore Key Parameters keyalg algorithm used to create the key (AES/DES/etc) keysize size of the key (128, 192, 256, etc) alias alias given to the newly created key in which to reference when using the key keypass password protecting the use of the key Encrypt As it pertains to data in Java and at the most basic level, encryption is an algorithmic process used to programmatically obfuscate data through a reversible process where both parties have information pertaining to the data and how the algorithm is used. In Java encryption, this involves the use of a Cipher. A Cipher object in the JCE is a generic entry point into the encryption provider typically selected by the algorithm. This example uses the default Java provider but would also work with Bouncy Castle. Generating a Cipher object Obtaining an instance of Cipher is rather easy and the same process is required for both encryption and decryption. (NOTE: Encryption and Decryption require the same algorithm but do not require the same object instance) Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding"); Once we have an instance of the Cipher, we can encrypt and decrypt data according to the algorithm. Often the algorithm will require additional pieces of information in order to encrypt/decrypt data. In this example, we will need to pass the algorithm the bytes containing the key and an initial vector (explained below). Initialization In order to use the Cipher, we must first initialize the cipher. This step is necessary so we can provide additional information to the algorithm like the AES key and the Initial Vector (aka IV). cipher.init(Cipher.ENCRYPT_MODE, secretKeySpecification, initialVector); Parameters The SecretKeySpecification is an object containing a reference to the bytes forming the AES key. The AES key is nothing more than a specific sized byte array (256-bit for AES 256 or 32 bytes) that is generated by the keytool(see above). Alternative Parameteters There are multiple methods to create keys such as a hash including a salt, username and password (or similar). This method would utilize a SHA1 hash of the concatenated strings, convert to bytes and then truncate result to the desired size. This post will not show the generation of a key using this method or the use of a PBE key method using a password and salt. The password and/or salt usage for the keys is handled by the keytool using the inputs during the creation of new keys. Initialization Vector The AES algorithm also requires a second parameter called the Initialiation Vector. The IV is used in the process to randomize the encrypted message and prevent the key from easy guessing. The IV is considered a publicly shared piece of information, but again, it is not recommended to openly share the information (for example, it wouldn't be wise to post it on your company's website). When encrypting a message, it is not uncommon to prepend the message with the IV since the IV will be a set/known size based on the algorithm. NOTE: the AES algorithm will output the same result if using the same IV, key and message. It is recommended that the IV be randomly created each time an encryption takes place. With the newly initialized Cipher, encrypting a message is simple. Simply call: byte[] encryptedMessageInBytes = Cipher.doFinal((message.getBytes("UTF-8")); String base64EncodedEncryptedMsg = BaseEncoding.base64().encode(encryptedMessageInBytes); String base32EncodedEncryptedMsg = BaseEncoding.base32().encode(encryptedMessageInBytes); Encoding Results Byte arrays are difficult to visualize since they often do not form characters in any charset. The best recommendation to solve this is to represent the bytes in HEX (base-16), Double HEX (base-32) or Base64 format. If the message will be passed via a URL or POST parameter, be sure to use a web-safe Base64 encoding. Google Guava library provides a excellent BaseEncoding utility. NOTE: Remember to decode the encoded message before decrypting. Decrypt Decrypting a message is almost a reverse of the encryption process with a few exceptions. Decryption requires a known initialization vector as a parameter unlike the encryption process generating a random IV. Decryption When decrypting, obtain a cipher object with the same process as the encryption method. The Cipher object will need to utilize the exact same algorithm including the method and padding selections. Once the code has obtained a reference to a Cipher object, the next step is to initialize the cipher for decryption and pass in a reference to a key and the initialization vector. // key is the same byte[] key used in encryption SecretKeySpec secretKeySpecification = new SecretKeySpec(key, "AES"); cipher.init(Cipher.DECRYPT_MODE, secretKeySpecification, initialVector); NOTE: The key is stored in the keystore and obtained by the use of an alias. See below for details on obtaining keys from a keystore Once the cipher has been provided the key, IV and initialized for decryption, the cipher is ready to perform the decryption. byte[] encryptedTextBytes = BaseEncoding.base64().decode(message); byte[] decryptedTextBytes = cipher.doFinal(encryptedTextBytes); String origMessage = new String(decryptedTextBytes); Strategies to keep IV The IV used to encrypt the message is important to decrypting the message therefore the question is raised, how do they stay together. One solution is to Base Encode (see above) the IV and prepend it to the encrypted and encoded message: Base64UrlSafe(myIv) + delimiter + Base64UrlSafe(encryptedMessage). Other possible solutions might be contextual such as including an attribute in an XML file with the IV and one for the alias to the key used. Obtain Key from Keystore The beginning of this post has shown how easy it is to create new AES-256 keys that reference an alias inside of a keystore database. The post then continues on how to encrypt and decrypt a message given a key, but has yet shown how to obtain a reference to the key in a keystore. Solution // for clarity, ignoring exceptions and failures InputStream keystoreStream = new FileInputStream(keystoreLocation); KeyStore keystore = KeyStore.getInstance("JCEKS"); keystore.load(keystoreStream, keystorePass.toCharArray()); if (!keystore.containsAlias(alias)) { thrownew RuntimeException("Alias for key not found"); } Key key = keystore.getKey(alias, keyPass.toCharArray()); Parameters keystoreLocation String - Location to local keystore file location keypass String - Password used when creating or modifying the keystore file with keytool (see above) alias String - Alias used when creating new key with keytool (see above) Conclusion This post has shown how to encrypt and decrypt string based messages using the AES-256 encryption algorithm. The keys to encrypt and decrypt these messages are held inside of a JCEKS formatted KeyStore database created using the JDK provided "keytool" utility. The examples in this post should be considered a solid start to encrypting/decrypting symmetric keys such as AES. This should not be considered the only line of defense when encrypting messages, for example key rotation. Key rotation is a method to mitigate risks in the event of a data breach. If an intruder obtains data and manages to hack a single key, the data contained in multiple files should have used several keys to encrypt the data thus bringing down risk of a total exposure loss. All of the examples in this blog post have been condensed into a simple tool allowing for the viewing of keys inside of a keystore, an operation that is not supported out of the box by the JDK keytool. Each aspect of the steps and topics outlined in this post are available at: https://github.com/mike-ensor/aes-256-encryption-utility. NOTE: The examples, sample code and any reference is to be used at the sole implementers risk and there is no implied warranty or liability, you assume all risks.
February 4, 2014
by Mike Ensor
· 102,283 Views · 2 Likes
article thumbnail
Running Multiple ActiveMQ Instances on One Machine
a few weeks ago i started making use of apache activemq again as the jms provider with my mule esb solution. since it had been a few years that i used activemq i thought it would be nice to check out some of the (new) features like the failover transport and other clustering features . to be able to test these last things i needed multiple installations of activemq on my machine. luckily this isn’t very hard to accomplish, although the documentation on this on the activemq site is quite minimal. the first step is to download and unzip the activemq package, which i did at ~/develop/apache-activemq-5.8.0. to create the instances i go to the activemq home directory and use the ‘create’ command like this: cd develop/apache-activemq-5.8.0/ ./bin/activemq create instancea ./bin/activemq create instanceb now if you do a ‘ls -l’ you will see that there are two subdirectories created, ‘instancea’ and ‘instanceb’. since both instances will make use of the default ports we have to modify the config for the second instance. go to the directory ‘develop/apache-activemq-5.8.0/instanceb/conf’ and open the file ‘jetty.xml’ to make the webconsole available at port ’8162′ by modifying the following line: next open the file ‘activemq.xml’ in the same directory and modify the following part: that’s it! make sure both files are saved and start the first instance with: cd ~/develop/apache-activemq-5.8.0/instancea/bin ./instancea console open up a new console and run the commands: cd /users/pascal/develop/apache-activemq-5.8.0/instanceb/bin ./instanceb console now you have two instances running next to each other and can start testing the ‘advanced’ functions of activemq.
January 24, 2014
by $$anonymous$$
· 17,781 Views · 1 Like
article thumbnail
Understanding sun.misc.Unsafe
The biggest competitor to the Java virtual machine might be Microsoft's CLR that hosts languages such as C#. The CLR allows to write unsafe code as an entry gate for low level programming, something that is hard to achieve on the JVM. If you need such advanced functionality in Java, you might be forced to use the JNI which requires you to know some C and will quickly lead to code that is tightly coupled to a specific platform. With sun.misc.Unsafe, there is however another alternative to low-level programming on the Java plarform using a Java API, even though this alternative is discouraged. Nevertheless, several applications rely on sun.misc.Unsafe such for example objenesis and therewith all libraries that build on the latter such for example kryo which is again used in for example Twitter's Storm. Therefore, it is time to have a look, especially since the functionality of sun.misc.Unsafe is considered to become part of Java's public API in Java 9. Getting hold of an instance of sun.misc.Unsafe The sun.misc.Unsafe class is intended to be only used by core Java classes which is why its authors made its only constructor private and only added an equally private singleton instance. The public getter for this instances performs a security check in order to avoid its public use: public static Unsafe getUnsafe() { Class cc = sun.reflect.Reflection.getCallerClass(2); if (cc.getClassLoader() != null) throw new SecurityException("Unsafe"); return theUnsafe; } This method first looks up the calling Class from the current thread’s method stack. This lookup is implemented by another internal class named sun.reflection.Reflection which is basically browsing down the given number of call stack frames and then returns this method’s defining class. This security check is however likely to change in future version. When browsing the stack, the first found class (index 0) will obviously be the Reflection class itself, and the second (index 1) class will be the Unsafe class such that index 2 will hold your application class that was calling Unsafe#getUnsafe(). This looked-up class is then checked for its ClassLoader where a null reference is used to represent the bootstrap class loader on a HotSpot virtual machine. (This is documented in Class#getClassLoader() where it says that “some implementations may use null to represent the bootstrap class loader”.) Since no non-core Java class is normally ever loaded with this class loader, you will therefore never be able to call this method directly but receive a thrown SecurityException as an answer. (Technically, you could force the VM to load your application classes using the bootstrap class loader by adding it to the –Xbootclasspath, but this would require some setup outside of your application code which you might want to avoid.) Thus, the following test will succeed: @Test(expected = SecurityException.class) public void testSingletonGetter() throws Exception { Unsafe.getUnsafe(); } However, the security check is poorly designed and should be seen as a warning against the singleton anti-pattern. As long as the use of reflection is not prohibited (which is hard since it is so widely used in many frameworks), you can always get hold of an instance by inspecting the private members of the class. From the Unsafe class's source code, you can learn that the singleton instance is stored in a private static field called theUnsafe. This is at least true for the HotSpot virtual machine. Unfortunately for us, other virtual machine implementations sometimes use other names for this field. Android’s Unsafe class is for example storing its singleton instance in a field called THE_ONE. This makes it hard to provide a “compatible” way of receiving the instance. However, since we already left the save territory of compatibility by using the Unsafe class, we should not worry about this more than we should worry about using the class at all. For getting hold of the singleton instance, you simply read the singleton field's value: Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); theUnsafe.setAccessible(true); Unsafe unsafe = (Unsafe) theUnsafe.get(null); Alternatively, you can invoke the private instructor. I do personally prefer this way since it works for example with Android while extracting the field does not: Constructor unsafeConstructor = Unsafe.class.getDeclaredConstructor(); unsafeConstructor.setAccessible(true); Unsafe unsafe = unsafeConstructor.newInstance(); The price you pay for this minor compatibility advantage is a minimal amount of heap space. The security checks performed when using reflection on fields or constructors are however similar. Create an Instance of a Class Without Calling a Constructor The first time I made use of the Unsafe class was for creating an instance of a class without calling any of the class's constructors. I needed to proxy an entire class which only had a rather noisy constructor but I only wanted to delegate all method invocations to a real instance which I did however not know at the time of construction. Creating a subclass was easy and if the class had been represented by an interface, creating a proxy would have been a straight-forward task. With the expensive constructor, I was however stuck. By using the Unsafe class, I was however able to work my way around it. Consider a class with an artificially expensive constructor: class ClassWithExpensiveConstructor { private final int value; private ClassWithExpensiveConstructor() { value = doExpensiveLookup(); } private int doExpensiveLookup() { try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } return 1; } public int getValue() { return value; } } Using the Unsafe, we can create an instance of ClassWithExpensiveConstructor (or any of its subclasses) without having to invoke the above constructor, simply by allocating an instance directly on the heap: @Test public void testObjectCreation() throws Exception { ClassWithExpensiveConstructor instance = (ClassWithExpensiveConstructor) unsafe.allocateInstance(ClassWithExpensiveConstructor.class); assertEquals(0, instance.getValue()); } Note that final field remained uninitialized by the constructor but is set with its type's default value. Other than that, the constructed instance behaves like a normal Java object. It will for example be garbage collected when it becomes unreachable. The Java run time itself creates objects without calling a constructor when for example creating objects for deserialization. Therefore, the ReflectionFactory offers even more access to individual object creation: @Test public void testReflectionFactory() throws Exception { @SuppressWarnings("unchecked") Constructor silentConstructor = ReflectionFactory.getReflectionFactory() .newConstructorForSerialization(ClassWithExpensiveConstructor.class, Object.class.getConstructor()); silentConstructor.setAccessible(true); assertEquals(10, silentConstructor.newInstance().getValue()); } Note that the ReflectionFactory class only requires a RuntimePermission called reflectionFactoryAccess for receiving its singleton instance and no reflection is therefore required here. The received instance of ReflectionFactory allows you to define any constructor to become a constructor for the given type. In the example above, I used the default constructor of java.lang.Object for this purpose. You can however use any constructor: class OtherClass { private final int value; private final int unknownValue; private OtherClass() { System.out.println("test"); this.value = 10; this.unknownValue = 20; } } @Test public void testStrangeReflectionFactory() throws Exception { @SuppressWarnings("unchecked") Constructor silentConstructor = ReflectionFactory.getReflectionFactory() .newConstructorForSerialization(ClassWithExpensiveConstructor.class, OtherClass.class.getDeclaredConstructor()); silentConstructor.setAccessible(true); ClassWithExpensiveConstructor instance = silentConstructor.newInstance(); assertEquals(10, instance.getValue()); assertEquals(ClassWithExpensiveConstructor.class, instance.getClass()); assertEquals(Object.class, instance.getClass().getSuperclass()); } Note that value was set in this constructor even though the constructor of a completely different class was invoked. Non-existing fields in the target class are however ignored as also obvious from the above example. Note that OtherClass does not become part of the constructed instances type hierarchy, the OtherClass's constructor is simply borrowed for the "serialized" type. Not mentioned in this blog entry are other methods such as Unsafe#defineClass, Unsafe#defineAnonymousClass or Unsafe#ensureClassInitialized. Similar functionality is however also defined in the public API's ClassLoader. Native Memory Allocation Did you ever want to allocate an array in Java that should have had more than Integer.MAX_VALUE entries? Probably not because this is not a common task, but if you once need this functionality, it is possible. You can create such an array by allocating native memory. Native memory allocation is used by for example direct byte buffers that are offered in Java's NIO packages. Other than heap memory, native memory is not part of the heap area and can be used non-exclusively for example for communicating with other processes. As a result, Java's heap space is in competition with the native space: the more memory you assign to the JVM, the less native memory is left. Let us look at an example for using native (off-heap) memory in Java with creating the mentioned oversized array: class DirectIntArray { private final static long INT_SIZE_IN_BYTES = 4; private final long startIndex; public DirectIntArray(long size) { startIndex = unsafe.allocateMemory(size * INT_SIZE_IN_BYTES); unsafe.setMemory(startIndex, size * INT_SIZE_IN_BYTES, (byte) 0); } } public void setValue(long index, int value) { unsafe.putInt(index(index), value); } public int getValue(long index) { return unsafe.getInt(index(index)); } private long index(long offset) { return startIndex + offset * INT_SIZE_IN_BYTES; } public void destroy() { unsafe.freeMemory(startIndex); } } @Test public void testDirectIntArray() throws Exception { long maximum = Integer.MAX_VALUE + 1L; DirectIntArray directIntArray = new DirectIntArray(maximum); directIntArray.setValue(0L, 10); directIntArray.setValue(maximum, 20); assertEquals(10, directIntArray.getValue(0L)); assertEquals(20, directIntArray.getValue(maximum)); directIntArray.destroy(); } First, make sure that your machine has sufficient memory for running this example! You need at least (2147483647 + 1) * 4 byte = 8192 MB of native memory for running the code. If you have worked with other programming languages as for example C, direct memory allocation is something you do every day. By calling Unsafe#allocateMemory(long), the virtual machine allocates the requested amount of native memory for you. After that, it will be your responsibility to handle this memory correctly. The amount of memory that is required for storing a specific value is dependent on the type's size. In the above example, I used an int type which represents a 32-bit integer. Consequently a single int value consumes 4 byte. For primitive types, size is well-documented. It is however more complex to compute the size of object types since they are dependent on the number of non-static fields that are declared anywhere in the type hierarchy. The most canonical way of computing an object's size is using the Instrumented class from Java's attach API which offers a dedicated method for this purpose called getObjectSize. I will however evaluate another (hacky) way of dealing with objects in the end of this section. Be aware that directly allocated memory is always native memory and therefore not garbage collected. You therefore have to free memory explicitly as demonstrated in the above example by a call to Unsafe#freeMemory(long). Otherwise you reserved some memory that can never be used for something else as long as the JVM instance is running what is a memory leak and a common problem in non-garbage collected languages. Alternatively, you can also directly reallocate memory at a certain address by calling Unsafe#reallocateMemory(long, long) where the second argument describes the new amount of bytes to be reserved by the JVM at the given address. Also, note that the directly allocated memory is not initialized with a certain value. In general, you will find garbage from old usages of this memory area such that you have to explicitly initialize your allocated memory if you require a default value. This is something that is normally done for you when you let the Java run time allocate the memory for you. In the above example, the entire area is overriden with zeros with help of the Unsafe#setMemory method. When using directly allocated memory, the JVM will neither do range checks for you. It is therefore possible to corrupt your memory as this example shows: @Test public void testMallaciousAllocation() throws Exception { long address = unsafe.allocateMemory(2L * 4); unsafe.setMemory(address, 8L, (byte) 0); assertEquals(0, unsafe.getInt(address)); assertEquals(0, unsafe.getInt(address + 4)); unsafe.putInt(address + 1, 0xffffffff); assertEquals(0xffffff00, unsafe.getInt(address)); assertEquals(0x000000ff, unsafe.getInt(address + 4)); } Note that we wrote a value into the space that was each partly reserved for the first and for the second number. This picture might clear things up. Be aware that the values in the memory run from the "right to the left" (but this might be machine dependent). The first row shows the initial state after writing zeros to the entire allocated native memory area. Then we override 4 byte with an offset of a single byte using 32 ones. The last row shows the result after this writing operation. Finally, we want to write an entire object into native memory. As mentioned above, this is a difficult task since we first need to compute the size of the object in order to know the amount of size we need to reserve. The Unsafe class does however not offer such functionality. At least not directly since we can at least use the Unsafe class to find the offset of an instance's field which is used by the JVM when itself allocates objects on the heap. This allows us to find the approximate size of an object: public long sizeOf(Class clazz) long maximumOffset = 0; do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { maximumOffset = Math.max(maximumOffset, unsafe.objectFieldOffset(f)); } } } while ((clazz = clazz.getSuperclass()) != null); return maximumOffset + 8; } This might at first look cryptic, but there is no big secret behind this code. We simply iterate over all non-static fields that are declared in the class itself or in any of its super classes. We do not have to worry about interfaces since those cannot define fields and will therefore never alter an object's memory layout. Any of these fields has an offset which represents the first byte that is occupied by this field's value when the JVM stores an instance of this type in memory, relative to a first byte that is used for this object. We simply have to find the maximum offset in order to find the space that is required for all fields but the last field. Since a field will never occupy more than 64 bit (8 byte) for a long or double value or for an object reference when run on a 64 bit machine, we have at least found an upper bound for the space that is used to store an object. Therefore, we simply add these 8 byte to the maximum index and we will not run into danger of having reserved to little space. This idea is of course wasting some byte and a better algorithm should be used for production code. In this context, it is best to think of a class definition as a form of heterogeneous array. Note that the minimum field offset is not 0 but a positive value. The first few byte contain meta information. The graphic below visualizes this principle for an example object with an int and a long field where both fields have an offset. Note that we do not normally write meta information when writing a copy of an object into native memory so we could further reduce the amount of used native memoy. Also note that this memory layout might be highly dependent on an implementation of the Java virtual machine. With this overly careful estimate, we can now implement some stub methods for writing shallow copies of objects directly into native memory. Note that native memory does not really know the concept of an object. We are basically just setting a given amount of byte to values that reflect an object's current values. As long as we remember the memory layout for this type, these byte contain however enough information to reconstruct this object. public void place(Object o, long address) throws Exception { Class clazz = o.getClass(); do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { long offset = unsafe.objectFieldOffset(f); if (f.getType() == long.class) { unsafe.putLong(address + offset, unsafe.getLong(o, offset)); } else if (f.getType() == int.class) { unsafe.putInt(address + offset, unsafe.getInt(o, offset)); } else { throw new UnsupportedOperationException(); } } } } while ((clazz = clazz.getSuperclass()) != null); } public Object read(Class clazz, long address) throws Exception { Object instance = unsafe.allocateInstance(clazz); do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { long offset = unsafe.objectFieldOffset(f); if (f.getType() == long.class) { unsafe.putLong(instance, offset, unsafe.getLong(address + offset)); } else if (f.getType() == int.class) { unsafe.putLong(instance, offset, unsafe.getInt(address + offset)); } else { throw new UnsupportedOperationException(); } } } } while ((clazz = clazz.getSuperclass()) != null); return instance; } @Test public void testObjectAllocation() throws Exception { long containerSize = sizeOf(Container.class); long address = unsafe.allocateMemory(containerSize); Container c1 = new Container(10, 1000L); Container c2 = new Container(5, -10L); place(c1, address); place(c2, address + containerSize); Container newC1 = (Container) read(Container.class, address); Container newC2 = (Container) read(Container.class, address + containerSize); assertEquals(c1, newC1); assertEquals(c2, newC2); } Note that these stub methods for writing and reading objects in native memory only support int and long field values. Of course, Unsafe supports all primitive values and can even write values without hitting thread-local caches by using the volatile forms of the methods. The stubs were only used to keep the examples concise. Be aware that these "instances" would never get garbage collected since their memory was allocated directly. (But maybe this is what you want.) Also, be careful when precalculating size since an object's memory layout might be VM dependent and also alter if a 64-bit machine runs your code compared to a 32-bit machine. The offsets might even change between JVM restarts. For reading and writing primitives or object references, Unsafe provides the following type-dependent methods: getXXX(Object target, long offset): Will read a value of type XXX from target's address at the specified offset. putXXX(Object target, long offset, XXX value): Will place value at target's address at the specified offset. getXXXVolatile(Object target, long offset): Will read a value of type XXX from target's address at the specified offset and not hit any thread local caches. putXXXVolatile(Object target, long offset, XXX value): Will place value at target's address at the specified offset and not hit any thread local caches. putOrderedXXX(Object target, long offset, XXX value): Will place value at target's address at the specified offet and might not hit all thread local caches. putXXX(long address, XXX value): Will place the specified value of type XXX directly at the specified address. getXXX(long address): Will read a value of type XXX from the specified address. compareAndSwapXXX(Object target, long offset, long expectedValue, long value): Will atomicly read a value of type XXX from target's address at the specified offset and set the given value if the current value at this offset equals the expected value. Be aware that you are copying references when writing or reading object copies in native memory by using the getObject(Object, long) method family. You are therefore only creating shallow copies of instances when applying the above method. You could however always read object sizes and offsets recursively and create deep copies. Pay however attention for cyclic object references which would cause infinitive loops when applying this principle carelessly. Not mentioned here are existing utilities in the Unsafe class that allow manipulation of static field values sucht as staticFieldOffset and for handling array types. Finally, both methods named Unsafe#copyMemory allow to instruct a direct copy of memory, either relative to a specific object offset or at an absolute address as the following example shows: @Test public void testCopy() throws Exception { long address = unsafe.allocateMemory(4L); unsafe.putInt(address, 100); long otherAddress = unsafe.allocateMemory(4L); unsafe.copyMemory(address, otherAddress, 4L); assertEquals(100, unsafe.getInt(otherAddress)); } Throwing Checked Exceptions Without Declaration There are some other interesting methods to find in Unsafe. Did you ever want to throw a specific exception to be handled in a lower layer but you high layer interface type did not declare this checked exception? Unsafe#throwException allows to do so: @Test(expected = Exception.class) public void testThrowChecked() throws Exception { throwChecked(); } public void throwChecked() { unsafe.throwException(new Exception()); } Native Concurrency The park and unpark methods allow you to pause a thread for a certain amount of time and to resume it: @Test public void testPark() throws Exception { final boolean[] run = new boolean[1]; Thread thread = new Thread() { @Override public void run() { unsafe.park(true, 100000L); run[0] = true; } }; thread.start(); unsafe.unpark(thread); thread.join(100L); assertTrue(run[0]); } Also, monitors can be acquired directly by using Unsafe using monitorEnter(Object), monitorExit(Object) and tryMonitorEnter(Object). A file containing all the examples of this blog entry is available as a gist.
January 14, 2014
by Rafael Winterhalter
· 152,554 Views · 39 Likes
article thumbnail
Semantic Search with Solr and NumPy
Built upon Lucene, Solr provides fast, highly scalable, and easily maintainable full-text search capabilities. However, under the hood, Solr is really just a sophisticated token-matching engine. What’s missing? — Semantic Search! Consider three, somewhat silly documents: Yellow banana peels. A banana is a long yellow fruit. This mystery fruit is long and yellow and has a peel. Now what happens if you search for the term “banana?" Under normal circumstances you only get back the first and second document. But why shouldn’t you also get back the third document? It’s obviously talking about bananas! Semantic Search via Collaborative Filtering Colleague Doug Turnbull and I recently set about to right this wrong with help from a machine learning technique called collaborative filtering. Collaborative filtering is most often used as a basis for recommendation algorithms. For example, collaborative filtering algorithms were the central focus of the now-famous Netflix Prize which awarded $1Million to the team which could build the best movie recommendation engine. When dealing with recommendations, collaborative filtering works by mathematically identifying commonalities in groups of users based upon the movies that they enjoyed. Then, if you appear to fall in one of those groups, the recommendation engine will point you towards a movie that a) you haven’t watched and b) you are likely to enjoy. So what does this have to do with Semantic Search? Everything! In just the same way that certain users gravitate towards certain movies, certain words commonly co-occur in the same documents. When working with Semantic Search, rather than recommending user to movies that they would likely enjoy, we are going to identify words that are likely to belong in a given document, whether or not they actually occurred there. The math is exactly the same! Here’s how the process works: First we identify a text field of interest in our documents and extract the associated term-document matrix for external processing. Each element of this term-document matrix indicates the strength of a particular term within a particular document (where strength can be anything, but will likely be either term frequency or TF*IDF). Next, collaborative filtering is applied to the term-document matrix which effectively generates a pseudo-term-document matrix. This pseudo-term-document matrix is the same size and shape as the original term-document matrix and references the same terms and documents, but the numbers are slightly different. These new values indicate the strength that a particular term should have in a particular document once noisy data is removed. Finally, the high-scoring values in the pseudo-term-document matrix are mapped back to the associated terms. These terms are then injected back into Solr in a new field which can be used for Semantic Search. Demo Time! So let’s consider an example case. As in plenty of our previous posts, we will be using the Science Fiction Stack Exchange. Why? Because we’re all nerds and with such a familiar topic, we can quickly intuit whether or not a search is returning relevant results. In this data set, the field of interest is the Body field because it contains the contents of all questions and answers. So, now that we’ve decided upon our demo dataset, we’re ready run the analysis. If you’d like to follow along, then please take a look at our git repo. This repo contains the example SciFi data set, the Semantic Search code, and README to get you going. However I’m going execute everything from within Python: >>> from SemanticAnalyzer import * >>> stvc = SolrTermVectorCollector(field='Body',feature='tf',batchSize=1000) >>> tdc = TermDocCollection(source=stvc,numTopics=150) That last line takes a few minutes. If it’s in the AM where you are, grab a coffee. If it’s in the PM, grab a beer. Once that line completes, we will have successfully extracted the term-document matrix from Solr. Now let’s play with it for a bit. One of the cool side effects of this analysis is the ability to quickly find words that commonly occur together. Let’s give it an easy test; here are the 30 most highly correlated words with the word ‘vader’ (as in Darth Vader). >>> tdc.getRelatedTerms('vader',30) Did you notice that pause when you called the function? That was the collaborative filtering taking place. The results of that process have now been saved, so additional calls will return quite quickly. vader luke emperor darth palpatin anakin sith skywalk sidiou apprentic empir luca side star son forc turn kill death rule suit father question jedi command obi tarkin dark wan plan Hey, not bad! Everything here seems very reasonably connected with Mr. Vader. You may notice some odd spellings here; that’s because these are the indexed terms, therefore they are stemmed. Let’s try again with a different term; this time everyone’s favorite wizard: >>> tdc.getRelatedTerms('potter',30) harri potter voldemort wizard snape death magic jame love spell time rowl lili eater travel seri hous hand hogwart three find wormtail kill slytherin hallow secret deathli muggl order lord Again, pretty good! One last try, and we’ll make it a little more challenging – a vague adjective: >>> tdc.getRelatedTerms('dark',30) dark side jedi sith eater lord death mark snape magic curs evil forc luke mercuri cave yoda jame palpatin dagobah anakin black call wizard slytherin live light siriu matter voldemort Indeed, most of these terms are like a hall of fame of dark things from Star Wars and Harry Potter. Now since the word correlation has proven itself out, it’s time to generate the pseudo terms and post them back to Solr. >>> SolrBlurredTermUpdater(tdc,blurredField="BodyBlurred").pushToSolr(0.1) This line will probably see you to the end of your coffee or beer (it takes about 10 minutes on my machine). But once it’s done, you can start issuing searches to Solr. Solr Results Here’s an example of Semantic Search using Solr: http://localhost:8983/solr/select/?q=-Body:dark +BodyBlurred:dark The Body field contains the original text while the BodyBlurred contains the pseudo-terms. So this finds all documents that do not include the term dark, but presumably contain dark content. Take a look at the documents that come back: { Body: " In the John Carter movie (2012), he shows off some of his powers, like jumping abnormally high, but I have difficulty evaluating his strength. On the one side, he shows great strength, as when he kills a thark warrior with one hand, but he is also quite mistreated by them. He also seems helpless when he is strangled by Tars Tarkas. Why does the strength he shows seem so inconsistent? ", BodyBlurred: "tv great movi control kill consid hand dark side power long mutant fight machin light abil sauron wormtail hulk" }, { Body: " In the movies, the Nazgul ride black horses with armour. I was wondering if that is all they are, or do they have some sort of magic? Are they evil? ", BodyBlurred: "movi black magic dark demon engin hous aveng slytherin" }, { Body: " The remaining Black Brother from the prologue of A Game of Thrones is apparently the deserter who is beheaded in the beginning of the book. But how did he manage to get to Winterfell from the other side of The Wall? Or did the show throw me off track and in the book there weren't any survivors, so the deserter is someone else? ", BodyBlurred: "book watch black hole dark side plai long game demon engin light turn district" }, { Body: " Was this ever discussed in any episode, or as a side-plot somewhere? ", BodyBlurred: "episod dark side light" } Not bad – most of those topics are rather … dark. Though check out that last result. So … maybe there are still some improvements we can make! But you also have to remember that we’re dealing with word correlation here, and I can only guess that somewhere else in the corpus, dark side-plots and dark episodes were surely discussed. Speaking of word correlations, check out this gem: { Body: " You're correct, Enterprise is the only Star Trek that fits into both the original and the new 2009 movie timelines. From the perspective of the Enterprise characters, both are possible futures, given the over-arcing conceit of the show was a Temporal Cold War, so its future is in flux and could line up with either of the timelines we're familiar with, or with an entirely different future. ", BodyBlurred: "answer charact place klingon star trek design travel crew watch work movi happen enterpris featur futur exist origin 2009 chang altern timelin war to version event captain gener pictur tng creat iii galaxi theori return alter voyag entir fry turn kirk paradox biff doc marti feder 1955 starship 2015 class hero centuri tempor uss phoenix mirror river 800 ncc 1701 simon conner skynet alisha" } The original document involves Star Trek and time travel. And appropriately, the pseudo terms include Star Trek things and time-travel terms … but do you see anything funny? That’s right, Biff, Doc and Marti made their way into the pseudo terms, likely because of their role in the popular time-travel film “Back to the Future.” Speaking of the future … Future Work Semantic Search with Solr is hot right now. In the upcoming Dublin LuceneRevolution I know of at least three related talks that have been submitted (one of them my own); I have heard that MapR is working on a Solr Semantic Search/Recommendation engine built atop of their Hadoop offering; and I suspect that with Cloudera’s recent foray into Solr with Mark Miller, they will also be working on the same thing. What’s next for our work? Recommendations! Remember, that’s how we started this conversation. E-commerce recommendations is a simple extension of the work presented above. Given an inventory catalog (e.g., product title, description, etc.), and given a history of user purchases, we can build a search-aware recommendation engine. That is, when a customer searches for a particular item, they will receive results as usual, except that the results will be boosted with items that they are more likely to purchase. How? Because we know what type of customer they are and what products that type of customer is more likely to buy! Do you have a good case for Solr Semantic Search and Recommendation? We’d love to hear it, please contact us!
September 30, 2013
by John Berryman
· 11,658 Views
article thumbnail
Algorithm of the Week: Quicksort - Three-way vs. Dual-pivot
It’s no news that quicksort is considered one of the most important algorithms of the century and that it is the de facto system sort for many languages, including the Arrays.sort in Java. So, what’s new about quicksort? Well, nothing except that I just now figured out (two damn years after the release of Java 7) that the quicksort implementation of Arrays.sort has been replaced with a variant called dual-pivot quicksort. This thread is not only awesome for this reason but also how humble Jon Bentley and Joshua Bloch really are. What did I do then? Just like everybody else, I wanted to implement it and do some benchmarking against some 10 million integers (random and duplicate). Oddly enough, I found the following results: Random Data Basic quicksort: 1222 ms Three-way quicksort: 1295 ms (seriously!) Dual-pivot quicksort: 1066 ms Duplicate Data Basic quicksort: 378 ms Three-way quicksort: 15 ms Dual-pivot quicksort: six ms Stupid Question One I am afraid that I am missing something in the implementation of the three-way partition. Across several runs against random inputs (of 10 million numbers), I could see that the single pivot always performs better (although the difference is less than 100 milliseconds for 10 million numbers). I understand that the whole purpose of making the three-way quicksort the default quicksort until now is that it does not give 0(n2) performance on duplicate keys, which is very evident when I run it against duplicate inputs. But is it true that, for the sake of handling duplicate data, a small penalty is taken by the three-way quicksort? Or is my implementation bad? Stupid Question Two My dual-pivot implementation (link below) does not handle duplicates well. It takes forever (0(n2)) to execute. Is there a good way to avoid this? Referring to the Arrays.sort implementation, I figured out that ascending sequences and duplicates are eliminated well before the actual sorting is done. So, as a dirty fix, if the pivots are equal I fast-forward the lowerIndex until it is different than pivot2. Is this a fair implementation? else if (pivot1==pivot2){ while (pivot1==pivot2 && lowIndex=j) break; exchange(input, i, j); } exchange(input, pivotIndex, j); return j; } } Three-way package basics.sorting.quick; import static basics.shuffle.KnuthShuffle.shuffle; import static basics.sorting.utils.SortUtils.exchange; import static basics.sorting.utils.SortUtils.less; public class QuickSort3Way { public void sort (int[] input){ //input=shuffle(input); sort (input, 0, input.length-1); } public void sort(int[] input, int lowIndex, int highIndex) { if (highIndex<=lowIndex) return; int lt=lowIndex; int gt=highIndex; int i=lowIndex+1; int pivotIndex=lowIndex; int pivotValue=input[pivotIndex]; while (i<=gt){ if (less(input[i],pivotValue)){ exchange(input, i++, lt++); } else if (less (pivotValue, input[i])){ exchange(input, i, gt--); } else{ i++; } } sort (input, lowIndex, lt-1); sort (input, gt+1, highIndex); } } Dual-pivot package basics.sorting.quick; import static basics.shuffle.KnuthShuffle.shuffle; import static basics.sorting.utils.SortUtils.exchange; import static basics.sorting.utils.SortUtils.less; public class QuickSortDualPivot { public void sort (int[] input){ //input=shuffle(input); sort (input, 0, input.length-1); } private void sort(int[] input, int lowIndex, int highIndex) { if (highIndex<=lowIndex) return; int pivot1=input[lowIndex]; int pivot2=input[highIndex]; if (pivot1>pivot2){ exchange(input, lowIndex, highIndex); pivot1=input[lowIndex]; pivot2=input[highIndex]; //sort(input, lowIndex, highIndex); } else if (pivot1==pivot2){ while (pivot1==pivot2 && lowIndex
August 14, 2013
by Arun Manivannan
· 46,379 Views
article thumbnail
A Prodecedural City in 100 Lines of Three.js
The above skyline is from "City," a simple flight-simulator by Ricardo "Mr. Doob" Cabello. "City" is a demo of the capabilities of WebGL, and is written in an impressive 100 lines of JavaScript using Three.js. In his blog post "How to Do a Procedural City in 100 Lines," Jerome Etienne walks you through the process of recreating Cabello's "City." The secret lies in creating 20,000 cubes that are given random sizes and positions and merging them together to create a city. Let's hope that this algorithm is never used for actual city-planning, though, because the buildings can randomly intersect each other and there are no logical spaces for streets! Check out Etienne's blog post here and watch the screencast introduction here:
August 9, 2013
by Allen Coin
· 8,768 Views
article thumbnail
Algorithm of the Week: Spatial Indexing with Quadtrees and Hilbert Curves
some time ago at oredev, after the sessions, there was "birds of a feather" - a sort of mini-unconference. anyone could write up a topic on the whiteboard; interested individuals added their names, and each group got allocated a room to chat about the topic. i joined the "spatial indexing" group, and we spent a fascinating hour and a half talking about spatial indexing methods, reminding me of several interesting algorithms and techniques. spatial indexing is increasingly important as more and more data and applications are geospatially-enabled. efficiently querying geospatial data, however, is a considerable challenge: because the data is two-dimensional (or sometimes, more), you can't use standard indexing techniques to query on position. spatial indexes solve this through a variety of techniques. in this post, we'll cover several - quadtrees , geohashes (not to be confused with geohashing ), and space-filling curves - and reveal how they're all interrelated. quadtrees quadtrees are a very straightforward spatial indexing technique. in a quadtree, each node represents a bounding box covering some part of the space being indexed, with the root node covering the entire area. each node is either a leaf node - in which case it contains one or more indexed points, and no children, or it is an internal node, in which case it has exactly four children, one for each quadrant obtained by dividing the area covered in half along both axes - hence the name. a representation of how a quadtree divides an indexed area. source: wikipedia inserting data into a quadtree is simple: starting at the root, determine which quadrant your point occupies. recurse to that node and repeat, until you find a leaf node. then, add your point to that node's list of points. if the list exceeds some pre-determined maximum number of elements, split the node, and move the points into the correct subnodes. a representation of how a quadtree is structured internally. to query a quadtree, starting at the root, examine each child node, and check if it intersects the area being queried for. if it does, recurse into that child node. whenever you encounter a leaf node, examine each entry to see if it intersects with the query area, and return it if it does. note that a quadtree is very regular - it is, in fact, a trie , since the values of the tree nodes do not depend on the data being inserted. a consequence of this is that we can uniquely number our nodes in a straightforward manner: simply number each quadrant in binary (00 for the top left, 10 for the top right, and so forth), and the number for a node is the concatenation of the quadrant numbers for each of its ancestors, starting at the root. using this system, the bottom right node in the sample image would be numbered 11 01. if we define a maximum depth for our tree, then, we can calculate a point's node number without reference to the tree - simply normalize the node's coordinates to an appropriate integer range (for example, 32 bits each), and then interleave the bits from the x and y coordinates -each pair of bits specifies a quadrant in the hypothetical quadtree. geohashes this system might seem familiar: it's a geohash ! at this point, you can actually throw out the quadtree itself - the node number, or geohash, contains all the information we need about its location in the tree. each leaf node in a full-height tree is a complete geohash, and each internal node is represented by the range from its smallest leaf node to its largest one. thus, you can efficiently locate all the points under any internal node by indexing on the geohash by performing a query for everything within the numeric range covered by the desired node. querying once we've thrown away the tree itself becomes a little more complex. instead of refining our search set recursively, we need to construct a search set ahead of time. first, find the smallest prefix (or quadtree node) that completely covers the query area. in the worst case, this may be substantially larger than the actual query area - for example, a small shape in the center of the indexed area that intersects all four quadrants would require selecting the root node for this step. the aim, now, is to construct a set of prefixes that completely covers the query region, while including as little area outside the region as possible. if we had no other constraints, we could simply select the set of leaf nodes that intersect the query area - but that would result in a lot of queries. another constraint, then, is that we want to minimise the number of distinct ranges we have to query for. one approach to doing this is to start by setting a maximum number of ranges we're willing to have. construct a set of ranges, initially populated with the prefix we identified earlier. pick the node in the set that can be subdivided without exceeding the maximum range count and will remove the most unwanted area from the query region. repeat this until there are no ranges in the set that can be further subdivided. finally, examine the resulting set, and join any adjacent ranges, if possible. the diagram below demonstrates how this works for a query on a circular area with a limit of 5 query ranges. how a query for a region is broken into a series of geohash prefixes/ranges. this approach works well, and it allows us to avoid the need to do recursive lookups - the set of range lookups we do execute can all be done in parallel. since each lookup can be expected to require a disk seek, parallelizing our queries allows us to substantially cut down the time required to return the results. still, we can do better. you may notice that all the areas we need to query in the above diagram are adjacent, yet we can only merge two of them (the two in the bottom right of the selected area) into a single range query, requiring us to do 4 separate queries. this is due in part to the order that our geohashing approach 'visits' subregions, working left to right, then top to bottom in each quad. the discontinuity as we go from top right to bottom left quad results in us having to split up some ranges that we could otherwise make contiguous. if we were to visit regions in a different order, perhaps we could minimise or eliminate these discontinuities, resulting in more areas that can be treated as adjacent and fetched with a single query. with an improvement in efficiency like that, we could do fewer queries for the same area covered, or conversely, the same number of queries, but including less extraneous area. illustrates the order in which the geohashing approach 'visits' each quad. hilbert curves suppose instead, we visit regions in a 'u' shape. within each quad, of course, we also visit subquads in the same 'u' shape, but aligned so as to match up with neighbouring quads. if we organise the orientation of these 'u's correctly, we can completely eliminate any discontinuities, and visit the entire area at whatever resolution we choose continuously, fully exploring each region before moving on to the next. not only does this eliminate discontinuities, but it also improves the overall locality. the pattern we get if we do this may look familiar - it's a hilbert curve. hilbert curves are part of a class of one-dimensional fractals known as space-filling curves , so named because they are one dimensional lines that nevertheless fill all available space in a fixed area. they're fairly well known, in part thanks to xkcd's use of them for a map of the internet . as you can see, they're also of use for spatial indexing, since they exhibit exactly the locality and continuity required. for example, if we take another look at the example we used for finding the set of queries required to encompass a circle above, we find that we can reduce the number of queries by one: the small region in the lower left is now contiguous with the region to its right, and whilst the two regions at the bottom are no longer contiguous with each other, the rightmost one is now contiguous with the large area in the upper right. illustrates the order in which a hilbert curve 'visits' each quad. one thing that our elegant new system is lacking, so far, is a way of converting between a pair of (x,y) coordinates and the corresponding position in the hilbert curve. with geohashing it was easy and obvious - just interleave the x and y coordinates - but there's no obvious way to modify that for a hilbert curve. searching the internet, you're likely to come across many descriptions of how hilbert curves are drawn, but few if any descriptions of how to find the position of an arbitrary point. to figure this out, we need to take a closer look at how the hilbert cure can be recursively constructed. the first thing to observe is that although most references to hilbert curves focus on how to draw the curve, this is a distraction from the essential property of the curve, and its importance to us: it's an ordering for points on a plane. if we express a hilbert curve in terms of this ordering, drawing the curve itself becomes trivial - simply a matter of connecting the dots. forget about how to connect adjacent sub-curves, and instead focus on how we can recursively enumerate the points. hilbert curves are all about ordering a set of points on a 2d plane at the root level, enumerating the points is simple: pick a direction and a start point, and proceed around the four quadrants, numbering them 0 to 3. the difficulty is introduced when we want to determine the order we visit the sub-quadrants in while maintaining the overall adjacency property. examination reveals that each of the sub-quadrants' curves is a simple transformation of the original curve: there are only four possible transformations. naturally, this applies recursively to sub-sub quadrants, and so forth. the curve we use for a given quadrant is determined by the curve we used for the square it's in, and the quadrant's position. with a little work, we can construct a table that encapsulates this: suppose we want to use this table to determine the position of a point on a third-level hilbert curve. for the sake of this example, assume our point has coordinates (5,2) starting with the first square on the diagram, find the quadrant your point is in - in this case, it's the upper right quadrant. the first part of our hilbert curve position, then, is 3 (11 in binary). next, we consult the square shown in the inset of square 3 - in this case, it's the second square. repeat the process: which sub-quadrant does our point fall into? here, it's the lower left one, meaning the next part of our position is 1, and the square we should consult next is the second one again. repeating the process one final time, we find our point falls in the upper right sub-sub-quadrant, our final coordinate is 3 (11 in binary). stringing them together, we now know the position of our point on the curve is 110111 binary, or 55. let's be a little more methodical, and write methods to convert between x,y coordinates and hilbert curve positions. first, we need to express our diagram above in terms a computer can understand: hilbert_map = { 'a': {(0, 0): (0, 'd'), (0, 1): (1, 'a'), (1, 0): (3, 'b'), (1, 1): (2, 'a')}, 'b': {(0, 0): (2, 'b'), (0, 1): (1, 'b'), (1, 0): (3, 'a'), (1, 1): (0, 'c')}, 'c': {(0, 0): (2, 'c'), (0, 1): (3, 'd'), (1, 0): (1, 'c'), (1, 1): (0, 'b')}, 'd': {(0, 0): (0, 'a'), (0, 1): (3, 'c'), (1, 0): (1, 'd'), (1, 1): (2, 'd')}, } in the snippet above, each element of 'hilbert_map' corresponds to one of the four squares in the diagram above. to make things easier to follow, i've identified each one with a letter - 'a' is the first square, 'b' the second, and so forth. the value for each square is a dict, mapping x and y coordinates for the (sub-)quadrant to the position along the line (the first part of the value tuple) and the square to use next (the second part of the value tuple). here's how we can use this to translate x and y coordinates into a hilbert curve position: def point_to_hilbert(x, y, order=16): current_square = 'a' position = 0 for i in range(order - 1, -1, -1): position <<= 2 quad_x = 1 if x & (1 << i) else 0 quad_y = 1 if y & (1 << i) else 0 quad_position, current_square = hilbert_map[current_square][(quad_x, quad_y)] position |= quad_position return position the input to this function is the integer x and y coordinates, and the order of the curve. an order 1 curve fills a 2x2 grid, an order 2 curve fills a 4x4 grid, and so forth. our x and y coordinates, then, should be normalized to a range of 0 to 2order-1. the function works by stepping over each bit of the x and y coordinates, starting with the most significant. for each, it determines which (sub-)quadrant the coordinate lies in, by testing the corresponding bit, then fetches the position along the line and the next square to use from the table we defined earlier. the curve position is set as the least significant 2 bits on the position variable, and at the beginning of the next loop, it's left-shifted to make room for the next set of coordinates. let's check that we've written the function correctly by running our example from above through it: >>> point_to_hilbert(5,2,3)55 presto! for a further test, we can use the function to generate a complete list of ordered points for a hilbert curve, then use a spreadsheet to graph them and see if we get a hilbert curve. enter the following expression into an interactive python interpreter: >>> points =[(x, y)for x in range(8)for y in range(8)]>>> sorted_points = sorted(points, key=lambda k: point_to_hilbert(k[0], k[1],3))>>>print'\n'.join('%s,%s'% x for x in sorted_points) take the resulting text, paste it into a file called 'hilbert.csv', open it in your favorite spreadsheet, and instruct it to generate a scatter plot. the result is, of course, a nicely plotted hilbert curve! the inverse of point_to_hilbert is a straightforward reversal of the hilbert_map; implementing it is left as an exercise for the reader. conclusion there you have it - spatial indexing, from quadtrees to geohashes to hilbert curves. one final observation: if you express the ordered sequence of x,y coordinates required to draw a hilbert curve in binary, do you notice anything interesting about the ordering? does it remind you of anything? just to wrap up, a caveat: all of the indexing methods i've described today are only well-suited to indexing points. if you want to index lines, polylines, or polygons, you're probably out of luck with these methods - and so far as i'm aware, the only known algorithm for effectively indexing shapes is the r-tree , an entirely different and more complex beast.
July 23, 2013
by Nick Johnson
· 43,693 Views
article thumbnail
Algorithm of the Week: Homomorphic Hashing
In a previous Damn Cool Algorithms post, we learned about Fountain Codes, a clever probabilistic algorithm that allows you break a large file up into a virtually infinite number of small chunks, such that you can collect any subset of those chunks - as long as you collect a few more than the volume of the original file - and be able to reconstruct the original file. This is a very cool construction, but as we observed last time, it has one major flaw when it comes to use in situations with untrusted users, such as peer to peer networks: there doesn't seem to be a practical way to verify if a peer is sending you valid blocks until you decode the file, which happens very near the end - far too late to detect and punish abuse. It's here that Homomorphic Hashes come to our rescue. A homomorphic hash is a construction that's simple in principle: a hash function such that you can compute the hash of a composite block from the hashes of the individual blocks. With a construction like this, we could distribute a list of individual hashes to users, and they could use those to verify incoming blocks as they arrive, solving our problem. Homomorphic Hashing is described in the paper On-the-fly verification of rateless erasure codes for efficient content distribution by Krohn et al. It's a clever construction, but rather difficult to understand at first, so in this article, we'll start with a strawman construction of a possible homomorphic hash, then improve upon it until it resembles the one in the paper - at which point you will hopefully have a better idea as to how it works. We'll also discuss the shortcomings and issues of the final hash, as well as how the authors propose to resolve them. Before we continue, a small disclaimer is needed: I'm a computer scientist, not a mathematician, and my discrete math knowledge is far rustier than I'd like. This paper stretches the boundaries of my understanding, and describing the full theoretical underpinnings of it is something I'm likely to make a hash of. So my goal here is to provide a basic explanation of the principles, sufficient for an intuition of how the construction works, and leave the rest for further exploration by the interested reader. A homomorphic hash that isn't We can construct a very simple candidate for a homomorphic hash by using one very simple mathematical identity: the observation that gx0 * gx1 = gx0 + x1. So, for instance, 23 * 22 = 25. We can make use of this by the following procedure: Pick a random number g For each element x in our message, take gx. This is the hash of the given element. Using the identity above, we can see that if we sum several message blocks together, we can compute their hash by multiplying the hashes of the individual blocks, and get the same result as if we 'hash' the sum. Unfortunately, this construction has a couple of obvious issues: Our 'hash' really isn't - the hashes are way longer than the message elements themselves! Any attacker can compute the original message block by taking the logarithm of the hash for that block. If we had a real hash with collisions, a similar procedure would let them generate a collision easily. A better hash with modular arithmetic Fortunately, there's a way we can fix both problems in one shot: by using modular arithmetic. Modular arithmetic keeps our numbers bounded, which solves our first problem, while also making our attacker's life more difficult: finding a preimage for one of our hashes now requires solving the discrete log problem, a major unsolved problem in mathematics, and the foundation for several cryptosystems. Here, unfortunately, is where the theory starts to get a little more complicated - and I start to get a little more vague. Bear with me. First, we need to pick a modulus for adding blocks together - we'll call it q. For the purposes of this example, let's say we want to add numbers between 0 and 255, so let's pick the smallest prime greater than 255 - which is 257. We'll also need another modulus under which to perform exponentiation and multiplication. We'll call this p. For reasons relating to Fermat's Little Theorem, this also needs to be a prime, and further, needs to be chosen such that p - 1 is a multiple of q (written q | (p - 1), or equivalently, p % q == 1). For the purposes of this example, we'll choose 1543, which is 257 * 6 + 1. Using a finite field also puts some constraints on the number, g, that we use for the base of the exponent. Briefly, it has to be 'of order q', meaning that gq mod p must equal 1. For our example, we'll use 47, since47257 % 1543 == 1. So now we can reformulate our hash to work like this: To hash a message block, we compute gb mod p - in our example, 47b mod 1543 - where b is the message block. To combine hashes, we simply multiply them mod p, and to combine message blocks, we add them mod q. Let's try it out. Suppose our message is the sequence [72, 101, 108, 108, 111] - that's "Hello" in ASCII. We can compute the hash of the first number as 4772 mod 1543, which is 883. Following the same procedure for the other elements gives us our list of hashes: [883, 958, 81, 81, 313]. We can now see how the properties of the hash play out. The sum of all the elements of the message is 500, which is 243 mod 257. The hash of 243 is 47243 mod 1543, or 376. And the product of our hashes is883 * 958 * 81 * 81 * 313 mod 1543 - also 376! Feel free to try this for yourself with other messages and other subsets - they'll always match, as you would expect. A practical hash Of course, our improved hash still has a couple of issues: The domain of our input values is small enough that an attacker could simply try them all out to find collisions. And the domain of our output values is small enough the attacker could attempt to find discrete logarithms by brute force, too. Although our hashes are shorter than they were without modular arithmetic, they're still longer than the input. The first of these is fairly straightforward to resolve: we can simply pick larger primes for p and q. If we choose ones that are sufficiently large, both enumerating all inputs and brute force logarithm finding will become impractical. The second problem is a little trickier, but not hugely so; we just have to reorganize our message a bit. Instead of breaking the message down into elements between 0 and q, and treating each of those as a block, we can break the message into arrays of elements between 0 and q. For instance, suppose we have a message that is 1024 bytes long. Instead of breaking it down into 1024 blocks of 1 byte each, let's break it down into, say, 64 blocks of 16 bytes. We then modify our hashing scheme a little bit to accommodate this: Instead of picking a single random number as the base of our exponent, g, we pick 16 of them, g0 - g16. To hash a block, we take each number gi and raise it to the power of the corresponding sub-block. The resulting output is the same length as when we were hashing only a single block per hash, but we're taking 16 elements as input instead of a single one. When adding blocks together, we add all the corresponding sub-blocks individually. All the properties we had earlier still hold. Better, we've given ourselves another tuneable parameter: the number of sub blocks per block. This will be invaluable in getting the right tradeoff between security, granularity of blocks, and protocol overhead. Practical applications What we've arrived at now is pretty much the construction described in the paper, and hopefully you can see how it would be applied to a system utilizing fountain codes. Simply pick two primes of about the right size - the paper recommends 257 bits for q and 1024 bits for p - figure out how big you want each block to be - and hence how many sub-blocks per block - and figure out a way for everyone to agree on the random numbers for g - such as by using a random number generator with a well defined seed value. The construction we have now, although useful, is still not perfect, and has a couple more issues we should address. First of these is one you may have noticed yourself already: our input values pack neatly into bytes - integers between 0 and 255 in our example - but after summing them in a finite field, the domain has grown, and we can no longer pack them back into the same number of bits. There are two solutions to this: the tidy one and the ugly one. The tidy one is what you'd expect: Since each value has grown by one bit, chop off the leading bit and transmit it along with the rest of the block. This allows you to transmit your block reasonably sanely and with minimal expansion in size, but is a bit messy to implement and seems - at least to me - inelegant. The ugly solution is this: Pick the smallest prime number larger than your chosen power of 2 for q, and simply ignore or discard overflows. At first glance this seems like a terrible solution, but consider: the smallest prime larger than 2256 is 2256 + 297. The chance that a random number in that range is larger than 2256 is approximately 1 in 3.9 * 1074, or approximately one in 2247. This is way smaller than the probability of, say, two randomly generated texts having the same SHA-1 hash. Thus, I think there's a reasonable argument for picking a prime using that method, then simply ignoring the possibility of overflows. Or, if you want to be paranoid, you can check for them, and throw out any encoded blocks that cause overflows - there won't be many of them, to say the least. Performance and how to improve it Another thing you may be wondering about this scheme is just how well it performs. Unfortunately, the short answer is "not well". Using the example parameters in the paper, for each sub-block we're raising a 1024 bit number to the power of a 257 bit number; even on modern hardware this is not fast. We're doing this for every 256 bits of the file, so to hash an entire 1 gigabyte file, for instance, we have to compute over 33 million exponentiations. This is an algorithm that promises to really put the assumption that it's always worth spending CPU to save bandwidth to the test. The paper offers two solutions to this problem; one for the content creator and one for the distributors. For the content creator, the authors demonstrate that there is a way to generate the random constants g, used as the bases of the exponents using a secret value. With this secret value, the content creator can generate the hashes for their files much more quickly than without it. However, anyone with the secret value can also trivially generate hash collisions, so in such a scheme, the publisher must be careful not to disclose the value to anyone, and only distribute the computed constants gi. Further, the set of constants themselves aren't small - with the example parameters, a full set of constants weighs in at about the size of 4 data blocks. Thus, you need a good way to distribute the per-publisher constants in addition to the data itself. Anyone interested in this scheme should consult section C of the paper, titled "Per-Publisher Homomorphic Hashing". For distributors, the authors offer a probabilistic check that works on batches of blocks, described in section D, "Computational Efficiency Improvements". Another easier to understand variant is this: Instead of verifying blocks individually as they arrive, accumulate blocks in a batch. When you have enough blocks, sum them all together, and calculate an expected hash by taking the product of the expected hashes of the individual blocks. Compute the composite block's hash. If it verifies, all the individual blocks are valid! If it doesn't, divide and conquer: split your batch in half and check each, winnowing out valid blocks until you're left with any invalid ones. The nice thing about either of these procedures is that they allow you to trade off verification work with your vulnerability window. You can even dedicate a certain amount of CPU time to verification, and simply batch up incoming blocks until the current computation finishes, ensuring you're always verifying the last batch as you receive the next. Conclusion Homomorphic Hashing provides a neat solution to the problem of verifying data from untrusted peers when using a fountain coding system, but it's not without its own drawbacks. It's complicated to implement and computationally expensive to compute, and requires careful tuning of the parameters to minimise the volume of the hash data without compromising security. Used correctly in conjunction with fountain codes, however, Homomorphic Hashing could be used to create an impressively fast and efficient content distribution network. As a side-note, I'm intending to resume more regular blogging with more Damn Cool Algorithms posts. Have an algorithm you think is Damn Cool and would like to hear more about? Post it in the comments!
July 9, 2013
by Nick Johnson
· 15,105 Views
article thumbnail
Akka vs Storm
I was recently working a bit with Twitter’s Storm, and it got me wondering, how does it compare to another high-performance, concurrent-data-processing framework, Akka. WHAT’S AKKA AND STORM? Let’s start with a short description of both systems. Storm is a distributed, real-time computation system. On a Storm cluster, you execute topologies, which process streams of tuples (data). Each topology is a graph consisting of spouts (which produce tuples) and bolts (which transform tuples). Storm takes care of cluster communication, fail-over and distributing topologies across cluster nodes. Akka is a toolkit for building distributed, concurrent, fault-tolerant applications. In an Akka application, the basic construct is an actor; actors process messages asynchronously, and each actor instance is guaranteed to be run using at most one thread at a time, making concurrency much easier. Actors can also be deployed remotely. There’s a clustering module coming, which will handle automatic fail-over and distribution of actors across cluster nodes. Both systems scale very well and can handle large amounts of data. But when to use one, and when to use the other? There’s another good blog post on the subject, but I wanted to take the comparison a bit further: let’s see how elementary constructs in Storm compare to elementary constructs in Akka. COMPARING THE BASICS Firstly, the basic unit of data in Storm is a tuple. A tuple can have any number of elements, and each tuple element can be any object, as long as there’s a serializer for it. In Akka, the basic unit is amessage, which can be any object, but it should be serializable as well (for sending it to remote actors). So here the concepts are almost equivalent. Let’s take a look at the basic unit of computation. In Storm, we have components: bolts andsprouts. A bolt can be any piece of code, which does arbitrary processing on the incoming tuples. It can also store some mutable data, e.g. to accumulate results. Moreover, bolts run in a single thread, so unless you start additional threads in your bolts, you don’t have to worry about concurrent access to the bolt’s data. This is very similar to an actor, isn’t it? Hence a Storm bolt/sprout corresponds to an Akka actor. How do these two compare in detail? Actors can receive arbitrary messages; bolts can receive arbitrary tuples. Both are expected to do some processing basing on the data received. Both have internal state, which is private and protected from concurrent thread access. ACTORS & BOLTS: DIFFERENCES One crucial difference is how actors and bolts communicate. An actor can send a message to any other actor, as long as it has the ActorRef (and if not, an actor can be looked up by-name). It can also send back a reply to the sender of the message that is being handled. Storm, on the other hand is one-way. You cannot send back messages; you also can’t send messages to arbitrary bolts. You can also send a tuple to a named channel (stream), which will cause the tuple (message) to be broadcast to all listeners, defined in the topology. (Bolts also ack messages, which is also a form of communication, to the ackers.) In Storm, multiple copies of a bolt’s/sprout’s code can be run in parallel (depending on theparallelism setting). So this corresponds to a set of (potentially remote) actors, with a load-balancer actor in front of them; a concept well-known from Akka’s routing. There are a couple of choices on how tuples are routed to bolt instances in Storm (random, consistent hashing on a field), and this roughly corresponds to the various router options in Akka (round robin, consistent hashing on the message). There’s also a difference in the “weight” of a bolt and an actor. In Akka, it is normal to have lots of actors (up to millions). In Storm, the expected number of bolts is significantly smaller; this isn’t in any case a downside of Storm, but rather a design decision. Also, Akka actors typically share threads, while each bolt instance tends to have a dedicated thread. OTHER FEATURES Storm also has one crucial feature which isn’t implemented in Akka out-of-the-box: guaranteed message delivery. Storm tracks the whole tree of tuples that originate from any tuple produced by a sprout. If all tuples aren’t acknowledged, the tuple will be replayed. Also the cluster management of Storm is more advanced (automatic fail-over, automatic balancing of workers across the cluster; based on Zookeeper); however the upcoming Akka clustering module should address that. Finally, the layout of the communication in Storm – the topology – is static and defined upfront. In Akka, the communication patterns can change over time and can be totally dynamic; actors can send messages to any other actors, or can even send addresses (ActorRefs). So overall, Storm implements a specific range of usages very well, while Akka is more of a general-purpose toolkit. It would be possible to build a Storm-like system on top of Akka, but not the other way round (at least it would be very hard).
June 26, 2013
by Adam Warski
· 21,214 Views
article thumbnail
Using SSH.NET
I’ve recently had the need to automate configuration of Nginx on an Ubuntu server. Of course, in UNIX land we like to use SSH (Secure Shell) to log into our servers and manage them remotely. Wouldn’t it be nice, I thought, if there was a managed SSH library somewhere so that I could automate logging onto my Ubuntu server, run various commands and transfer files. A short Google turned up SSH.NET by the somewhat mysterious Olegkap (at least I couldn’t find out anything else about them) which turned out to be just what I wanted. Here’s the blurb on the CodePlex site: “This project was inspired by Sharp.SSH library which was ported from java and it seems like was not supported for quite some time. This library is complete rewrite using .NET 4.0, without any third party dependencies and to utilize the parallelism as much as possible to allow best performance I can get.” It does exactly what it says on the tin. It’s on NuGet, so you can grab it with: PM> Install-Package SSH.NET Here’s how you run a remote command. First you need to build a ConnectionInfo object: public ConnectionInfo CreateConnectionInfo() { const string privateKeyFilePath = @"C:\some\private\key.pem"; ConnectionInfo connectionInfo; using (var stream = new FileStream(privateKeyFilePath, FileMode.Open, FileAccess.Read)) { var privateKeyFile = new PrivateKeyFile(stream); AuthenticationMethod authenticationMethod = new PrivateKeyAuthenticationMethod("ubuntu", privateKeyFile); connectionInfo = new ConnectionInfo( "my.server.com", "ubuntu", authenticationMethod); } return connectionInfo; } Then you simply create an SshClient instance and run commands: public void Connect() { using (var ssh = new SshClient(CreateConnectionInfo())) { ssh.Connect(); var command = ssh.CreateCommand("uptime"); var result = command.Execute(); Console.Out.WriteLine(result); ssh.Disconnect(); } } Here I’m running the ‘uptime’ command which output this when I ran it just now: 14:37:46 up 22 days, 3:59, 0 users, load average: 0.08, 0.03, 0.05 To transfer a file, just use the ScpClient: public void GetConfigurationFiles() { using (var scp = new ScpClient(CreateNginxServerConnectionInfo())) { scp.Connect(); scp.Download("/etc/nginx/", new DirectoryInfo(@"D:\Temp\ScpDownloadTest")); scp.Disconnect(); } } Which grabs all my Nginx configuration and transfers it to a directory tree on my windows machine. All in all a very nice little library that’s been working well for me so far. Give it a try if you need to interact with a UNIX-like machine from .NET code.
June 9, 2013
by Mike Hadlow
· 30,955 Views
article thumbnail
Stepping Backwards while Debugging: Move To Line
it happens to me many times: i’m stepping with the debugger through my code, and ups! i made one step too far! debugging, and made one step over too far what now? restart the whole debugging session? actually, there is a way to go ‘backwards’ gdb has a ‘reverse debugging’ feature, described here . i’m using the eclipse based codewarrior debugger, and this debug engine is not using gdb. the codewarrior debugger in mcu10.3 supports an eclipse feature: i select a code line in the editor view and use move to line : move to line what it does: it changes the current pc (program counter) of the program to that line: performed move to line now i can continue debugging from that line, e.g. stepping into that function call. yes, this is not true backward debugging. but it is simple and very effective. to perform true backward stepping, the debugger would need to reverse all operations, typically with a rather heavy state machine and data recording. but for the usual case where i simply need to go back a few lines, the ‘move to line’ is perfect. of course there are a few points to consider: this only changes the program counter. any variable changes/etc are not affected or reverted. in case of highly optimized code, there might be multiple sequence points per source line. so doing this for highly optimized code might not work correctly. it works ok within a function. it is not recommended to use it e.g. to set the pc outside of a function. because the context/stack frame is not set up. i use the ‘move to line’ frequently to ‘advance’ the program execution. e.g. to bypass some long sequences i’m not interested in, or to get out of an ‘endless’ loop. the same ‘move to line’ as available while doing assembly stepping too. see this post for details. happy line moving
April 15, 2013
by Erich Styger
· 9,862 Views
article thumbnail
HashSet vs. TreeSet vs. LinkedHashSet
in a set, there are no duplicate elements. that is one of the major reasons to use a set. there are 3 commonly used implementations of set in java: hashset, treeset and linkedhashset. when and which to use is an important question. in brief, if we want a fast set, we should use hashset; if we need a sorted set, then treeset should be used; if we want a set that can be read by following its insertion order, linkedhashset should be used. 1. set interface set interface extends collection interface. in a set, no duplicates are allowed. every element in a set must be unique. we can simply add elements to a set, and finally we will get a set of elements with duplicates removed automatically. 2. hashset vs. treeset vs. linkedhashset hashset is implemented using a hash table. elements are not ordered. the add, remove, and contains methods has constant time complexity o(1). treeset is implemented using a tree structure(red-black tree in algorithm book). the elements in a set are sorted, but the add, remove, and contains methods has time complexity of o(log (n)). it offers several methods to deal with the ordered set like first(), last(), headset(), tailset(), etc. linkedhashset is between hashset and treeset. it is implemented as a hash table with a linked list running through it, so it provides the order of insertion. the time complexity of basic methods is o(1). 3. treeset example treeset tree = new treeset(); tree.add(12); tree.add(63); tree.add(34); tree.add(45); iterator iterator = tree.iterator(); system.out.print("tree set data: "); while (iterator.hasnext()) { system.out.print(iterator.next() + " "); } output is sorted as follows: tree set data: 12 34 45 63 now let's define a dog class as follows: class dog { int size; public dog(int s) { size = s; } public string tostring() { return size + ""; } } let's add some dogs to treeset like the following: import java.util.iterator; import java.util.treeset; public class testtreeset { public static void main(string[] args) { treeset dset = new treeset(); dset.add(new dog(2)); dset.add(new dog(1)); dset.add(new dog(3)); iterator iterator = dset.iterator(); while (iterator.hasnext()) { system.out.print(iterator.next() + " "); } } } compile ok, but run-time error occurs: exception in thread "main" java.lang.classcastexception: collection.dog cannot be cast to java.lang.comparable at java.util.treemap.put(unknown source) at java.util.treeset.add(unknown source) at collection.testtreeset.main(testtreeset.java:22) because treeset is sorted, the dog object need to implement java.lang.comparable's compareto() method like the following: class dog implements comparable{ int size; public dog(int s) { size = s; } public string tostring() { return size + ""; } @override public int compareto(dog o) { return size - o.size; } } the output is: 1 2 3 4. hashset example hashset dset = new hashset(); dset.add(new dog(2)); dset.add(new dog(1)); dset.add(new dog(3)); dset.add(new dog(5)); dset.add(new dog(4)); iterator iterator = dset.iterator(); while (iterator.hasnext()) { system.out.print(iterator.next() + " "); } output: 5 3 2 1 4 note the order is not certain. 5. linkedhashset example linkedhashset dset = new linkedhashset(); dset.add(new dog(2)); dset.add(new dog(1)); dset.add(new dog(3)); dset.add(new dog(5)); dset.add(new dog(4)); iterator iterator = dset.iterator(); while (iterator.hasnext()) { system.out.print(iterator.next() + " "); } the order of the output is certain and it is the insertion order. 2 1 3 5 4 6. performance testing the following method tests the performance of the three class on add() method. public static void main(string[] args) { random r = new random(); hashset hashset = new hashset(); treeset treeset = new treeset(); linkedhashset linkedset = new linkedhashset(); // start time long starttime = system.nanotime(); for (int i = 0; i < 1000; i++) { int x = r.nextint(1000 - 10) + 10; hashset.add(new dog(x)); } // end time long endtime = system.nanotime(); long duration = endtime - starttime; system.out.println("hashset: " + duration); // start time starttime = system.nanotime(); for (int i = 0; i < 1000; i++) { int x = r.nextint(1000 - 10) + 10; treeset.add(new dog(x)); } // end time endtime = system.nanotime(); duration = endtime - starttime; system.out.println("treeset: " + duration); // start time starttime = system.nanotime(); for (int i = 0; i < 1000; i++) { int x = r.nextint(1000 - 10) + 10; linkedset.add(new dog(x)); } // end time endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedhashset: " + duration); } from the output below, we can clearly wee that hashset is the fastest one. hashset: 2244768 treeset: 3549314 linkedhashset: 2263320 if you enjoyed this article and want to learn more about java collections, check out this collection of tutorials and articles on all things java collections.
March 29, 2013
by Ryan Wang
· 181,564 Views · 3 Likes
article thumbnail
Algorithm of the Week: Aho-Corasick String Matching Algorithm in Haskell
let’s say you have a large piece of text and a dictionary of keywords. how do you quickly locate all the keywords? aho-corasick algorithm diagram well, there are many ways really, you could even iterate through the whole thing and compare words to keywords. but it turns out that’s going to be very slow. at least o(n_keywords * n_words) complexity. essentially you’re making as many passes over the text as your dictionary is big. in 1975 a couple of ibm researchers – alfred aho and margaret corasick – discovered an algorithm that can do this in a single pass. the aho-corasick string matching algorithm . i implemented it in haskell and it takes 0.005s to find 8 different keywords in oscar wilde’s the nightingale and the rose – a 12kb text. a quick naive keyword search implemented in python takes 0.023s . not a big difference practically speaking, but imagine a situation with megabytes of text and thousands of words in the dictionary. the authors mention printing out the result as a major bottleneck in their assessment of the algorithm. yep, printing . the aho-corasick algorithm at the core of this algorithm are three functions: the three functions of aho-corasick algorithm a parser based on a state machine, which maps (state, char) pairs to states and occasionally emits an output. this is called the goto function a failure function, which tells the goto function which state to jump into when the character it just read doesn’t match anything an output function, which maps states to outputs – potentially more than one per state the algorithm works in two stages. it will first construct the goto, failure and output functions. the complexity of this operation hinges solely on the size of our dictionary. then it iterates over the input text to produce all the matches. using state machines for parsing text is a well known trick – the real genius of this algorithm rests in that failure function if you ask me. it makes lateral transitions between states when the algorithm climbs itself into a wall. say you have she and hers in the dictionary. the goto machine eats your input string one character at the time. let’s say it’s already read s h . the next input is an e so it outputs she and reaches a final state. next it reads an r , but the state didn’t expect any more inputs, so the failure function puts us on the path towards hers . this is a bit tricky to explain in text, i suggest you look at the picture from the original article and look at what’s happening. my haskell implementation the first implementation i tried, relied on manully mapping inputs to outputs for the goto, failure and output functions by using pattern recognition. not very pretty, extremely hardcoded, but it worked and was easy to make. building the functions dynamically proved a bit trickier. type goto = map (int, char) int type failure = map int int type output = map int [string] first off, we build the goto function. -- builds the goto function build_goto::goto -> string -> (goto, string) build_goto m s = (add_one 0 m s, s) -- adds one string to goto function add_one::int -> goto -> [char] -> goto add_one _ m [] = m add_one state m (c:rest) | member key m = add_one (frommaybe 0 $ map.lookup key m) m rest | otherwise = add_one max (map.insert key max m) rest where key = (state, c) max = (size m)+1 essentially this builds a flattened prefix tree in a hashmap of (state, char) pairs mapping to the next state. it makes sure to avoid adding new edges to the three as much as possible. the reason it’s not simply a prefix tree are those lateral transitions; doing them in a tree would require backtracking and repeating of steps, so we haven’t achieved anything. once we have the goto function, building the output is trivial. -- builds the output function build_output::(?m::goto) => [string] -> output build_output [] = empty build_output (s:rest) = map.insert (fin 0 s) (list.filter (\x -> elem x dictionary) $ list.tails s) $ build_output rest -- returns the state in which an input string ends without using failures fin::(?m::goto) => int -> [char] -> int fin state [] = state fin state (c:rest) = fin next rest where next = frommaybe 0 $ map.lookup (state, c) ?m we are essentially going over the dictionary, finding the final state for each word and building a hash table mapping final states to their outputs. building the failure function was trickiest, because we need a way to iterate over the depths at which nodes are position in the goto state machine. but we threw that info away by using a hashmap. -- tells us which nodes in the goto state machine are at which traversal depth nodes_at_depths::(?m::goto) => [[int]] nodes_at_depths = list.map (\i -> list.filter (>0) $ list.map (\l -> if i < length l then l!!i else -1) paths) [0..(maximum $ list.map length paths)-1] where paths = list.map (path 0) dictionary we now have a list of lists, that tells us at which depth certain nodes are. -- builds the failure function build_fail::(?m::goto) => [[int]] -> int -> failure build_fail nodes 0 = fst $ mapaccuml (\f state -> (map.insert state 0 f, state)) empty (nodes!!0) build_fail nodes d = fst $ mapaccuml (\f state -> (map.insert state (decide_fail state lower) f, state)) lower (nodes!!d) where lower = build_fail nodes (d-1) -- inner step of building the failure function decide_fail::(?m::goto) => int -> failure -> int decide_fail state lower = findwithdefault 0 (s, c) ?m where (s', c) = key' state $ assocs ?m s = findwithdefault 0 s' lower -- gives us the key associated with a certain state (how to get there) key'::int -> [((int, char), int)] -> (int, char) key' _ [] = (-1, '_') -- this is ugly, being of maybe type would be better key' state ((k, v):rest) | state == v = k | otherwise = key' state rest here we are going over the list of nodes at depths and deciding what the failure should be for each depth based on the failures of depth-1. at depth zero, all failures go to the zeroth state. an important part of this process was inverting the goto hashmap so values point to keys, which is essentially what the key’ function does. finally, we can use the whole algorithm like this: main = do let ?m = fst $ mapaccuml build_goto empty dictionary let ?f = build_fail nodes_at_depths $ (length $ nodes_at_depths)-1 ?out = build_output dictionary print $ ahocorasick text a bit more involved than the usual example of haskell found online, it’s still pretty cool you can see the whole code on github here .
March 19, 2013
by Swizec Teller
· 21,935 Views
article thumbnail
In-Memory Data Grids
Introduction The IT buzzword of 2012 is without a doubt Big Data. It’s new and here to stay, and for a good reason. Big data is data that exceeds the processing capacity of conventional database systems. Great examples are CERN with the Large Hadron Collider, whose experiments generate 25 petabytes of data annually, or Walmart, which handles more than one million customer transaction every hour. Problems These vast amounts of data leave us with two problems. Problem 1: To gain value from this data, one must choose an alternative way to process it. The value of big data to an organization falls into two categories: analytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports. Problem 2: The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. Remember the CERN case where the LHC produces over 25 Petabytes of data annually? No “classic” database architecture or setup is capable of holding these amounts of data. Solutions Fortunately, both problems can be solved by implementing the correct infrastructure and rethinking data storage. There are two critical factors in Big Data environments: size and speed. We already discussed the vast amounts of data and desire to be able to access and process the data fast. The latter is the main differentiator from more traditional data warehouses. Just imagine what you can do when you can access all your data real-time. Enter big data. A common Big Data implementation is an in-memory data grid that lives in a distributed cluster, ensuring both speed, by storing data in-memory, and capacity by using scalability features provided by a cluster. As a bonus, availability is ensured by using a distributed cluster. As for the data storage, there are typically two kinds: in-memory databases and in-memory data grids. But first some background. It is not a new attempt to use main memory as a storage area instead of a disk. In our daily lives there are numerous examples of main memory databases (MMDB), as they perform much faster than disk-based databases. An every day example is a mobile phone. When you SMS or call someone most mobile service providers use MMDB to get the information on your contact as soon as possible. The same applies to your phone. When someone calls you, the caller details are looked up in the contacts application, usually providing a name and sometimes a picture. In memory data grids In Memory Data Grid (IMDG) is the same as MMDB in that it stores data in main memory, but it has a totally different architecture. The features of IMDG can be summarized as follows: Data is distributed and stored on multiple servers. Each server operates in the active mode. A data model is usually object-oriented (serialized) and non-relational. According to the necessity, you often need to add or reduce servers. No traditional database features such as tables. In other words, IMDG is designed to store data in main memory, ensure scalability and store an object itself. These days, there are many IMDG products, both commercial and open source. Some of the most commonly used products are: Hazelcast (http://www.hazelcast.com) JBoss Infinispan (http://www.jboss.org/infinispan) GridGain DataGrid (http://www.gridgain.com/features/in-memory-data-grid/) VMware Gemfire (http://www.vmware.com/nl/products/application-platform/vfabric-gemfire/overview.html) Oracle Coherence (http://www.oracle.com/technetwork/middleware/coherence/overview/index.html) Gigaspaces XAP (http://www.gigaspaces.com/datagrid) Terracotta Enterprise Suite (http://terracotta.org/products/enterprise-suite) Why Memory? The main reasons for using main memory for data storage are once again the two main themes of Big Data: speed and capacity. The processing performance of main memory is 800 times faster than an HDD and up to 40 times faster than an. Moreover, the latest x86 server supports main memory of hundreds of GB per server. It is said that the limit of a traditional processing database’s (OLTP) data capacity is approximately 1 TB and that the OLTP processing data capacity would not increment well. If servers using main memory of 1 TB or larger become more commonly used, you will be able to conduct operations with the entire data placed in main memory, at least in the field of OLTP. IMDG Architecture To use main memory as a storage area, two weak points should be overcome: Limited capacity: involves data that exceeds the maximum capacity of the main memory of the server Reliability: involves data loss in case of a (system) failure. IMDG overcomes the limit of capacity by ensuring horizontal scalability using a distributed architecture, and resolves the issue of reliability through a replication system as part of the grid (or a distributed cluster). Now let’s discuss how an IMDG actually works. First of all, it is important to understand that an IMDG is not the same as an in-memory database, also referred to as MMDB (main memory databases). Typical examples of MMDBs are Oracle TimesTen or Sap Hana. MMDBs are full database products that simply reside in memory. As a result of being a full-blown database, they also carry the weight and overhead of database management features. IMDG is different. No tables, indexes, triggers, stored procedures, process managers etc. Just plain storage. The data model used in IMDG is key-value pairs. A key-value pair is a list with only two parts: a key and a value. The key can be used for storing and retrieving the values in the list. A key can be compared to the index or primary key of a table in a database. Note that IMDG are closely tied to development environments such as Java as the key-value pairs are represented by the structures provided by such a programming environment. Most IMDGs are written in Java, and can only be used within other Java applications. Therefore, the values of key-value pairs can be anything supported by Java, ranging from simple data types such as a string or number, to complex objects. This overcomes the two important hurdles: as you can store complex Java objects as value, there’s no need to translate these objects into a relational datamodel (which is the case in more traditional applications using a database for storage). Furthermore, the seeming limitation of being able to store only one value per key, is actually no limitation at all. Large memory sizes Most of the products introduced above use Java as an implementation language. Java reserves and uses a part of the RAM (internal memory) for dynamic memory allocation. This reserved memory space is called the Java heap. All runtime objects created by a Java application are stored in heap. Using large amounts of data causes two problems. Size limitation: By default, the heap size is 128 MB, but for current business applications, this limit is reached easily. Once the heap is “full”, no new objects can be created and the Java application will show some nasty errors. Performance: It is possible to increase the size of the heap, but this introduces some new problems. When a heap reaches a size of more than 4 gigabytes, Java will have serious issues with memory managements, causing your application to slow down or even freeze. Java has a feature called Garbage Collector, which periodically scans the heap and checks each object if it is still valid and being used. If not, the garbage collector removes the object and defragments the newly available space. The problem is, the larger the heap size, the more work to do for the garbage collector, resulting in performance degradation. Imagine a large bank has a Java application that manages customers, accounts and transactions. We have seen that an IMDG allows the application to store and access all data very quickly by caching it in memory, instead of storing the data in relatively slow databases. Let’s assume the combined data has a size of 40 gigabytes. Storing it in heap is simply not possible, considering the performance penalties of Java’s memory management capabilities. The graph below illustrates the garbage collection pause time when placing cached data in heap: Terracotta’s BigMemory product has a method to overcome these limitations. The method is to use an off-heap memory (direct buffer). Data will not be stored in Java’s heap, but directly in the available internal memory (RAM). Since, this is not subject to Java’s garbage collector, there are no performance penalties. The differences on performance are significant, as can be seen in the graph below: Using off-heap storage has some major benefits: You can use all the available memory on your machine, not just the memory that is allocated to the heap (usually less that 512 Mb). This allows you to store more data in a in-memory data grid, greatly speeding up your application. The heap can be relieved by storing data in native memory, speeding up Java applications as less heap space has to be garbage collected. Clustering, fail over and high availability So far, we have seen IMDG features that are applicable to a single server. However, the real power of IMDG lies in it’s networking and clustering capabilities, providing features as data replication, data synchronization between clients, fail over and high availability. To achieve this, a cluster of servers (or server array) acts a backbone of the infrastructure. Applications (that still can have their own IMDG or off-heap cache) that are connected to the cluster can share, replicate and backup their data with either the cluster or other applications. The graph below depicts a typical setup using Terracotta's BigMemory: The caches on the application servers are usually referred to as “level 1” cache, while the data cache on the server array is referred to as “level 2” cache. There are many different scenarios possible for storing, clustering, synchronizing and replicating data. Covering all these topics goes far beyond the scope of this article. For more information, consult the technical documentation of the product of your choice. Conclusion Big Data brings us some new challenges. First of all, storing and accessing vast amounts of data makes us rethink traditional methods and technologies. Next, there’s the question what to do with all the available data. The potential value for marketing, financial and other businesses is huge. In order to facilitate Big Data, in-memory data grids are considered the best option. IMDGs with off-heap storage are even more powerful, allowing data centric enterprise application to overcome certain limits of the Java platform, such as memory and performance constraints. As the amount of data that (large) companies produce and store, grows exponentially, databases will hit a limit. Accessing your data without a performance penalty simply will not be possible. The answer to this is using an IMDG.
March 13, 2013
by Roy Prins
· 32,622 Views · 5 Likes
article thumbnail
Text Processing, Part 2: Oh, Inverted Index
This is the second part of my text processing series. In this blog, we'll look into how text documents can be stored in a form that can be easily retrieved by a query. I'll used the popular open source Apache Lucene index for illustration. There are two main processing flow in the system ... Document indexing: Given a document, add it into the index Document retrieval: Given a query, retrieve the most relevant documents from the index. The following diagram illustrate how this is done in Lucene. Index Structure Both documents and query is represented as a bag of words. In Apache Lucene, "Document" is the basic unit for storage and retrieval. A "Document" contains multiple "Fields" (also call zones). Each "Field" contains multiple "Terms" (equivalent to words). To control how the document will be indexed across its containing fields, a Field can be declared in multiple ways to specified whether it should be analyzed (a pre-processing step during index), indexed (participate in the index) or stored (in case it needs to be returned in query result). Keyword (Not analyzed, Indexed, Stored) Unindexed (Not analyzed, Not indexed, Stored) Unstored (Analyzed, Indexed, Not stored) Text (Analyzed, Indexed, Stored) The inverted index is a core data structure of the storage. It is organized as an inverted manner from terms to the list of documents (which contain the term). The list (known as posting list) is ordered by a global ordering (typically by document id). To enable faster retrieval, the list is not just a single list but a hierarchy of skip lists. For simplicity, we ignore the skip list in subsequent discussion. This data structure is illustration below based on Lucene's implementation. It is stored on disk as segment files which will be brought to memory during the processing. The above diagram only shows the inverted index. The whole index contain an additional forward index as follows. Document indexing Document in its raw form is extracted from a data adaptor. (this can be making an Web API to retrieve some text output, or crawl a web page, or receiving an HTTP document upload). This can be done in a batch or online manner. When the index processing start, it parses each raw document and analyze its text content. The typical steps includes ... Tokenize the document (breakdown into words) Lowercase each word (to make it non-case-sensitive, but need to be careful with names or abbreviations) Remove stop words (take out high frequency words like "the", "a", but need to careful with phrases) Stemming (normalize different form of the same word, e.g. reduce "run", "running", "ran" into "run") Synonym handling. This can be done in two ways. Either expand the term to include its synonyms (ie: if the term is "huge", add "gigantic" and "big"), or reduce the term to a normalized synonym (ie: if the term is "gigantic" or "huge", change it to "big") At this point, the document is composed with multiple terms. doc = [term1, term2 ...]. Optionally, terms can be further combined into n-grams. After that we count the term frequency of this document. For example, in a bi-gram expansion, the document will become ... doc1 -> {term1: 5, term2: 8, term3: 4, term1_2: 3, term2_3:1} We may also compute a "static score" based on some measure of quality of the document. After that, we insert the document into the posting list (if it exist, otherwise create a new posting list) for each terms (all n-grams), this will create the inverted list structure as shown in previous diagram. There is a boost factor that can be set to the document or field. The boosting factor effectively multiply the term frequency which effectively affecting the importance of the document or field. Document can be added to the index in one of the following ways; inserted, modified and deleted. Typically the document will first added to the memory buffer, which is organized as an inverted index in RAM. When this is a document insertion, it goes through the normal indexing process (as I described above) to analyze the document and build an inverted list in RAM. When this is a document deletion (the client request only contains the doc id), it fetches the forward index to extract the document content, then goes through the normal indexing process to analyze the document and build the inverted list. But in this case the doc object in the inverted list is labeled as "deleted". When this is a document update (the client request contains the modified document), it is handled as a deletion followed by an insertion, which means the system first fetch the old document from the forward index to build an inverted list with nodes marked "deleted", and then build a new inverted list from the modified document. (e.g. If doc1 = "A B" is update to "A C", then the posting list will be {A:doc1(deleted) -> doc1, B:doc1(deleted), C:doc1}. After collapsing A, the posting list will be {A:doc1, B:doc1(deleted), C:doc1} As more and more document are inserted into the memory buffer, it will become full and will be flushed to a segment file on disk. In the background, when M segments files have been accumulated, Lucene merges them into bigger segment files. Notice that the size of segment files at each level is exponentially increased (M, M^2, M^3). This maintains the number of segment files that need to be search per query to be at the O(logN) complexity where N is the number of documents in the index. Lucene also provide an explicit "optimize" call that merges all the segment files into one. Here lets detail a bit on the merging process, since the posting list is already vertically ordered by terms and horizontally ordered by doc id, merging two segment files S1, S2 is basically as follows Walk the posting list from both S1 and S2 together in sorted term order. For those non-common terms (term that appears in one of S1 or S2 but not both), write out the posting list to a new segment S3. Until we find a common term T, we merge the corresponding posting list from these 2 segments. Since both list are sorted by doc id, we just walk down both posting list to write out the doc object to a new posting list. When both posting lists have the same doc (which is the case when the document is updated or deleted), we pick the latest doc based on time order. Finally, the doc frequency of each posting list (of the corresponding term) will be computed. Document retrieval Consider a document is a vector (each term as the separated dimension and the corresponding value is the tf-idf value) and the query is also a vector. The document retrieval problem can be defined as finding the top-k most similar document that match a query, where similarity is defined as the dot-product or cosine distance between the document vector and the query vector. tf-idf is a normalized frequency. TF (term frequency) represents how many time the term appears in the document (usually a compression function such as square root or logarithm is applied). IDF is the inverse of document frequency which is used to discount the significance if that term appears in many other documents. There are many variants of TF-IDF but generally it reflects the strength of association of the document (or query) with each term. Given a query Q containing terms [t1, t2], here is how we fetch the corresponding documents. A common approach is the "document at a time approach" where we traverse the posting list of t1, t2 concurrently (as opposed to the "term at a time" approach where we traverse the whole posting list of t1 before we start the posting list of t2). The traversal process is described as follows ... For each term t1, t2 in query, we identify all the corresponding posting lists. We walk each posting list concurrently to return a sequence of documents (ordered by doc id). Notice that each return document contains at least one term but can also also contain multiple terms. We compute the dynamic score which is dot product of the query to document vector. Notice that we typically don't concern the TF/IDF of the query (which is short and we don't care the frequency of each term). Therefore we can just compute the sum up all the TF score of the posting list that has a match term after dividing the IDF score (at the head of each posting list). Lucene also support query level boosting where a boost factor can be attached to the query terms. The boost factor will multiply the term frequency correspondingly. We also look up the static score which is purely based on the document (but not the query). The total score is a linear combination of static and dynamic score. Although the score we used in above calculation is based on computing the cosine distance between the query and document, we are not restricted to that. We can plug in any similarity function that make sense to the domain. (e.g. we can use machine learning to train a model to score the similarity between a query and a document). After we compute a total score, we insert the document into a heap data structure where the topK scored document is maintained. Here the whole posting list will be traversed. In case of the posting list is very long, the response time latency will be long. Is there a way that we don't have to traverse the whole list and still be able to find the approximate top K documents ? There are a couple strategies we can consider. Static Score Posting Order: Notice that the posting list is sorted based on a global order, this global ordering provide a monotonic increasing document id during the traversal that is important to support the "document at a time" traversal because it is impossible to visit the same document again. This global ordering, however, can be quite arbitrary and doesn't have to be the document id. So we can pick the order to be based on the static score (e.g. quality indicator of the document) which is global. The idea is that we traverse the posting list in decreasing magnitude of static score, so we are more likely to visit the document with the higher total score (static + dynamic score). Cut frequent terms: We do not traverse the posting list whose term has a low IDF value (ie: the term appears in many documents and therefore the posting list tends to be long). This way we avoid to traverse the long posting list. TopR list: For each posting list, we create an extra posting list which contains the top R documents who has the highest TF (term frequency) in the original list. When we perform the search, we perform our search in this topR list instead of the original posting list. Since we have multiple inverted index (in memory buffer as well as the segment files at different levels), we need to combine the result them. If termX appears in both segmentA and segmentB, then the fresher version will be picked. The fresher version is determine as follows; the segment with a lower level (smaller size) will be considered more fresh. If the two segment files are at the same level, then the one with a higher number is more fresh. On the other hand, the IDF value will be the sum of the corresponding IDF of each posting list in the segment file (the value will be slightly off if the same document has been updated, but such discrepancy is negligible). However, the processing of consolidating multiple segment files incur processing overhead in document retrieval. Lucene provide an explicit "optimize" call to merge all segment files into one single file so there is no need to look at multiple segment files during document retrieval. Distributed Index For large corpus (like the web documents), the index is typically distributed across multiple machines. There are two models of distribution: Term partitioning and Document partitioning. In document partitioning, documents are randomly spread across different partitions where the index is built. In term partitioning, the terms are spread across different partitions. We'll discuss document partitioning as it is more commonly used. Distributed index is provider by other technologies that is built on Lucene, such as ElasticSearch. A typical setting is as follows ... In this setting, machines are organized as columns and rows. Each column represent a partition of documents while each row represent a replica of the whole corpus. During the document indexing, first a row of the machines is randomly selected and will be allocated for building the index. When a new document crawled, a column machine from the selected row is randomly picked to host the document. The document will be sent to this machine where the index is build. The updated index will be later propagated to the other rows of replicas. During the document retrieval, first a row of replica machines is selected. The client query will then be broadcast to every column machine of the selected row. Each machine will perform the search in its local index and return the TopM elements to the query processor which will consolidate the results before sending back to client. Notice that K/P < M < K, where K is the TopK documents the client expects and P is the number of columns of machines. Notice that M is a parameter that need to be tuned. One caveat of this distributed index is that as the posting list is split horizontally across partitions, we lost the global view of the IDF value without which the machine is unable to calculate the TF-IDF score. There are two ways to mitigate that ... Do nothing: here we assume the document are evenly spread across different partitions so the local IDF represents a good ratio of the actual IDF. Extra round trip: In the first round, query is broadcasted to every column which returns its local IDF. The query processor will collected all IDF response and compute the sum of the IDF. In the second round, it broadcast the query along with the IDF sum to each column of machines, which will compute the local score based on the IDF sum.
February 26, 2013
by Ricky Ho
· 9,194 Views
article thumbnail
Better explaining the CAP Theorem
today, i thought a lot about how to examine different databases. choosing a database is often a daunting task. there's a lot of confusion, a 'theorem', and more than all, the immortal proverb 'not one size fits all'. as if it helps. one of the first things that you realize, when examining nosql distributed databases (and how could you not)is that these days databases are like cars: they're all good. old fashioned sql databases can scale in and out, horizontally sharded over several machines to achieve high availability. nosql systems claim to be consistent. what difference then does it make what database would you choose? the availability and consistency that i mentioned comes, of course, from the misunderstood cap theorem , that - so people say - states that you can only choose 2 out of the 3 consistency: every read would get you the most recent write availability: every node (if not failed) always executes queries partition-tolerance: even if the connections between nodes are down, the other two (a & c) promises, are kept. usually its depicted in a nicely equilaterl triangle, as this one from ofirm : there's a nice proof and explanation of it in this 4 minute video here . but if we think about it, and also see some of brewer's (the theorem author) later remarks , we'll see that the 2 out of 3 is really 1 out of 2: it's really just a vs c! and this is simply because: availability is achieved by replicating the data across different machines consistency is achieved by updating several nodes before allowing further reads total partitioning, meaning failure of part of the system is rare. however, we could look at a delay, a latency, of the update between nodes, as a temporary partitioning . it will then cause a temporary decision between a and c: on systems that allow reads before updating all the nodes, we will get high availability on systems that lock all the nodes before allowing reads, we will get consistency that's it! and since this decision is temporary, it exists only for the duration of the delay, some may say that we are really contrasting latency (another word for availability) against consistency. by the way, there's no distributed system that wants to live with "paritioning" - if it does, it's not distributed. that is why putting sql in this triangle may lead to confusion.
February 17, 2013
by Lior Messinger
· 139,273 Views · 18 Likes
article thumbnail
Algorithm of the Week: Shortest Path in a Directed Acyclic Graph
Introduction We saw how to find the shortest path in a graph with positive edges using the Dijkstra’s algorithm. We also know how to find the shortest paths from a given source node to all other nodes even when there are negative edges using the Bellman-Ford algorithm. Now we’ll see that there’s a faster algorithm running in linear time that can find the shortest paths from a given source node to all other reachable vertices in a directed acyclic graph, also known as a DAG. Because the DAG is acyclic we don’t have to worry about negative cycles. As we already know it’s pointless to speak about shortest path in the presence of negative cycles because we can “loop” over these cycles and practically our path will become shorter and shorter. The presence of a negative cycles make our attempt to find the shortest path pointless! Thus we have two problems to overcome with Dijkstra and the Bellman-Ford algorithms. First of all we needed only positive weights and on the second place we didn’t want cycles. Well, we can handle both cases in this algorithm. Overview The first thing we know about DAGs is that they can easily be topologically sorted. Topological sort can be used in many practical cases, but perhaps the mostly used one is when trying to schedule dependent tasks. Topological sort is often used to “sort” dependent tasks! After a topological sort we end with a list of vertices of the DAG and we’re sure that if there’s an edge (u, v), u will precede v in the topologically sorted list. If there’s an edge (u,v) then u must precede v. This results in the more general case from the image. There’s no edge between B and D, but B precedes D! This information is precious and the only thing we need to do is to pass through this sorted list and to calculate distances for a shortest paths just like the algorithm of Dijkstra. OK, so let’s summarize this algorithm: - First we must topologically sort the DAG; - As a second step we set the distance to the source to 0 and infinity to all other vertices; - Then for each vertex from the list we pass through all its neighbors and we check for shortest path; It’s pretty much like the Dijkstra’s algorithm with the main difference that we used a priority queue then, while this time we use the list from the topological sort. Code This time the code is actually a pseudocode. Although all the examples so far was in PHP, perhaps pseudocode is easier to understand and doesn’t bind you in a specific language implementation. Also if you don’t feel comfortable with the given programming language it can be more difficult for you to understand the code than by reading pseudocode. 1. Topologically sort G into L; 2. Set the distance to the source to 0; 3. Set the distances to all other vertices to infinity; 4. For each vertex u in L 5. - Walk through all neighbors v of u; 6. - If dist(v) > dist(u) + w(u, v) 7. - Set dist(v) <- dist(u) + w(u, v); Application It’s clear why and where we must use this algorithm. The only problem is that we must be sure that the graph doesn’t have cycles. However if we’re aware of how the graph is created we may have some additional information if there are cycles or not – then this linear time algorithm can be very applicable.
October 30, 2012
by Stoimen Popov
· 28,875 Views
  • Previous
  • ...
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×