DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest AI/ML Topics

article thumbnail
Why Elasticsearch is Suitable for Application Log Analytics
Handling Application Logs Enterprise application development using Web technologies has been around for a long time. In recent years we have seen a sharp increase in the deployment of such applications. This is partly due to the proliferation of ecommerce sites, social media sites, mobile application supporting sites, as well as the desire of enterprises to have their applications available 24x7. In most cases, such applications cater to huge load and are deployed on cloud infrastructure. Monitoring deployed applications is increasingly becoming a crucial task, as deployed applications are bound to fail, irrespective of the robust techniques used during development. Whenever an application fails, the most common resolution method starts by examining the application log. If the application has implemented logging properly, the logs can reveal the cause of application failure. Examination of log files is usually done by viewing the file using tools like vi, less, more, tail or grep. Another method is to download the file to a Windows system and viewing it using an editor like Notepad++. Engineers usually scan the log information to look for clues that point to the reasons for failure. Once the cause of failure is identified, suitable action is taken for restoring the application and/or service. The Key to Application Log Analytics This process, of logging onto a remote system and viewing logs is tedious. Additionally, many of the tools do not provide support to make the task of issue identification any simpler. Even when using tools like grep (if we know the pattern), we still need to view the logs in order to go through other information that has been logged, such as the log information that precedes the failure point. While it has always been possible to develop applications to parse application logs, the recent renewed interest in application log analytics is due to the acceptance of NoSQL-like technologies and the availability of standard tools to parse application logs. Though relational databases (RDBMS) have for many years provided the facility to store structured data, they are not well-suited for handling log data, as in many cases, the structure of the logged information is not the same across the file. This does not fit well in the rigidly defined world of an RDBMS. In comparison, NoSQL allows document flexibility and documents with different schemas can be stored in the same database / index / store. The ability to convert log data into a well-defined structure, as well as the ability to search, are key to implement a modern log analytics solution. In this document, we cover how Elasticsearch. Elasticsearch can store documents, giving us the benefit of structured storage without the overheads of a database system. The Suitability of Elasticsearch In the following subsections, we share our views as to why Elasticsearch is a suitable data store for an application log analytics solution. Elasticsearch is part of a popular trio of tools, commonly known as ELK. Of these, L stands for Logstash, the log parser; E stands for Elasticsearch, the document store; and K stands for Kibana, the visualization tool. Storing Documents Logstash can be used to parse plain text data into structured text. Once data has some structure, it becomes easy to find information by enabling search on it. While parsing application logs is not a challenge, the challenge has been in storing the data and enabling search on it. Most prior solutions have used an RDBMS for storage, but the varying structure and textual nature of application logs makes it difficult to use an RDBMS table structure to store data. RDBMSs are not geared toward ‘search’. They are geared for maintaining a ‘single value of truth’ for the data, defining relations between the data, ensuring their consistency and so on. Search is also not a strong point for RDBMSs as they use exact matches for values, while Elasticsearch supports exact matches as well as partial matches. It also supports document scoring, which attaches a confidence factor to the documents located. Elasticsearch supports documents in JSON format and uses the NoSQL philosophy for document storage. This has the advantage of allowing a flexible schema for the data. Unlike an RDBMS, Elasticsearch is a search engine at heart and hence is built for the same. Though Elasticsearch uses NoSQL for storing documents, it does not provide robust methods to update stored data. Not supporting updates is a serious disadvantage in most cases. In the case of application logs, not supporting updates actually works in favour of Elasticsearch. In case of machine logs, updates are not really required. Application logs are generated from a debugging perspective – having data handy for debugging purposes in the event of application crash or incorrect execution. They usually record important events from application execution and provide additional information to allow application developers to identify the reasons for failure. Additionally, existing information in application logs is rarely, if ever, updated. New information is continually being written to the logs, with no need to refer to old information. This plays to Elasticsearch’s strength, which is able to ingest and index new information very quickly. Search One of the easiest ways of locating information from large volumes of logs is to perform a search. Elasticsearch is well suited not only to handle search, it also supports huge volume of data, using distributed computing (implemented using Shards). While Kibana is one of the commonly used tools to display and visualize information stored in Elasticsearch, it is more suited to display standard charts like bar chart, column chart and pie chart. If the features provided by Kibana are not enough, we can always use Elasticsearch’s REST API support and it’s Query DSL (Domain-Specific Language), to search for required information. The Query DSL and the result of the query are in JSON format. Though this format makes it easy for applications to parse and process, users would need a friendly user interface to interact with the data. Handling Voluminous Data Elasticsearch supports distributed search out of the box – using the concept of ‘shards’. A shard is a single Lucene instance and is managed by Elasticsearch. Two types of shards, namely ‘primary shard’ and ‘replica shard’ are supported. By default, a document is first indexed on the primary shard and then on the replica shards. The number of primary shards can be specified, to cater to the expected volume. By default, Elasticsearch creates five shards for an index. But, once the number of primary shards is decided, it cannot be changed. A replica shards are copies the primary shard. They are used to handle fail-over and the increase performance. While performance across voluminous data can be handled by sharding, it is important to note that shards, once created for an index, cannot be changed. Thus, the sharding strategy of the data has to be decided in advance, after an assessment of the data and an estimation of its growth. In the case of application logs, the sharding strategy can be based on the application name, the business unit ID, the application OD or the application’s geolocation, just to name a few. Analytics By storing data in a structure, analytics can be enabled on the data. Not only can application perform a simple search, it is also possible to restrict the search for specific terms or over a specified time period. Structured storage also makes it easier to develop reports with well-defined visualizations, which in turn makes it easy to understand the current state of applications. It is also possible to perform various analytics operations like time series analysis using the timestamp and identification of patterns from the data using machine learning techniques (assuming, we have the right kind of data in the logs). Though Elasticsearch does not provide built-in support for analytics, applications can benefit from its fast search capability and also from its ability to handle voluminous data sets. In Closing One of the main hurdles for application logs has been the ability to search for information from the huge volume of data. By parsing application log files using Logstash, we can convert a flat file into structured data. Structured data, once stored in Elasticsearch, is easier to search and locate. Visualizations and business logic for generating alerts and tickets is easier to develop on structured data. Elasticsearch, which stores and searches documents, along with its ability to scale over huge volume of data, is a good candidate for inclusion in an application log analytics solution.
April 22, 2015
by Bipin Patwardhan
· 11,703 Views · 2 Likes
article thumbnail
Meet Molly, the virtual nurse
Telemedicine is a topic that I’ve touched on numerous times on this blog over the past year or so, with a number of platforms emerging to offer patients the opportunity to consult with a medical professional from anywhere in the world. Most of these platforms are clinical based, such as the Babylon service, and therefore provide you with access to doctors when you have something wrong with you. There are however, also a number that are taking a more preventative approach and seek to keep you healthier in the first place. For instance, last autumn I wrote about the Vida platform that is providing coaching and support to ensure you stay out of the emergency room altogether. Of course, all of these services require a human healthcare professional at the other end of your video call to answer your queries for you. Sense.ly are taking another approach by offering an AI based nurse, called Molly, who aims to provide help and support in those periods between appointments with real life professionals. The service is aimed specifically at patients with common medical conditions such as diabetes or heart failure. The patient signs up to the site either direct or via their GP, and the platform then draws up a personalized care plan for them based upon both their medical records and the individual needs of the patient. The patient then follows this prescribed plan (hopefully), with regular check-ins with the virtual nurse via their smartphone or computer to help their progress. Doctors can also access information via the site to see how their patient is getting on, whilst the system will alert them automatically if worrying symptoms begin to emerge. I’ve written previously about the important role the appearance of avatars has on how we engage with them, so Molly has been designed with a friendly face and a softly spoken voice. She interacts with patients using voice recognition technology and can ask relatively simple questions of the patient whilst guiding them through exercise plans and collecting medical data from them. This data can then be analyzed by a doctor (although AI analysis seems inevitable), whilst the patient can also use Molly as a sort of health PA and book appointments with their doctor through her. With voice recognition technology growing at an incredible pace and the AI grunt of services such as Watson increasingly capable of making complex decisions, this seems the beginning of an inevitable trend towards more automated healthcare. Check out the video below that explains more about Molly and the service she offers. Original post
April 15, 2015
by Adi Gaskell
· 2,096 Views
article thumbnail
A cluster management framework, Apache Helix
What is Helix? It is used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Helix automates reassignment of resources in the face of node failure and recovery, cluster expansion, and reconfiguration. Modeling a distributed system as a state machine with constraints on states and transitions. Terminologies Node : A single machine Cluster: Set of Nodes Resource : A logical entry (e.g. database, index, task) Partition: Subset of the resource (Each subtask is referred to as a partition) Replica: Copy of a Partition State (e.g Master, Slave). It increase the availability of the system State: Describes the role of a replica (Each node in the cluster has its own Current State) State Machine and Transitions: An action that allows a replica to move from one state to another, thus changing its role. ( e.g Slave --> Master ) spectators: the external clients. Helix provides an External View that is an aggregated view of the current state across all nodes. Current State: represents resource's actual state at a participating node. - INSTANCE_NAME: Unique name representing the process - SESSION_ID: ID that is automatically assigned every time a process joins the cluster Rebalancer: The core component of Helix is the Controller which runs the Rebalance algorithm on every cluster event. Dynamic Ideal State: Helix powerful is that Ideal State can be changed dynamically. It is adjusting the ideal state. Whenever a cluster event occurs, Helix can operate in one of three modes FULL_AUTO SEMI_AUTO CUSTOMIZED Cluster events can be one of the following: Nodes start and/or stop Nodes experience soft and/or hard failures New nodes are added/removed [1] http://helix.apache.org/Concepts.html
April 13, 2015
by Madhuka Udantha
· 7,872 Views
article thumbnail
Introduction to Apache Cassandra's Architecture
Some key concepts for Apache's popular Cassandra Architecture include partitioning, replication, consistency, bootstrapping, and write paths.
April 6, 2015
by Akhil Mehra
· 118,113 Views · 38 Likes
article thumbnail
How to Configure a Simple JBoss Cluster in Domain Mode
Clustering is a very important thing to master for any serious user of an application server. Clustering allows for high availability by making your application available on secondary servers when the primary instance is down or it lets you scale up or out by increasing the server density on the host, or by adding servers on other hosts. It can even help to increase performance with effective load balancing between servers based on their respective hardware. Andy Overton has already covered how to set up a cluster of servers in standalone mode fronted by mod_cluster for load balancing, so in this post I'll cover clustering in domain mode. I won't rehash mod_cluster settings, so this will just cover the set up of a doman controller on one host, and the host controller and server instances on another host. To follow along with this blog, you'll need to download either JBoss EAP 6.x or WildFly. I'll be using WildFly 8.2 on Xubuntu 14.04. I'll be using $WF_HOME to refer to your WildFly home directory. Configuring the Domain Controller The domain controller needs both the domain.xml and host.xml configured. In the $WF_HOME/domain/configuration directory, you'll see that those two files are joined by a host-master.xml and a host-slave.xml. These are preconfigured host.xml files which you can use to give you a head start in making a host.xml for the domain controller (master) and host controller (slave) to use. You can either change the name of the file to be host.xml, so it will get picked up and used by default, or you can specify the host configuration you want to use on the command line by adding the --host-config argument: domain.sh --host-config=host-master.xml Whether you choose to modify the host.xml or the host-master.xml, you need to make sure that the empty element has been added to the section. This is so that when WildFly looks to see which server is the domain controller, it knows to become the domain controller itself. The other change is optional, but recommended. We need to tell the domain controller to bind its management interface to the correct IP address because, by default, it will bind to localhost, so the management communication it needs to do with the remote hosts won't be able to reach the domain controller at all! We can set this address permanently in the host.xml by making sure the inet-address value is set to the right IP, by changing the 127.0.0.1 in the example below to the correct IP: The result of that is that the default bind IP of the management interface is no longer localhost, although you can still override this value by starting JBoss with the variable left of the colon as a -D argument: domain.sh -Djboss.bind.address.management=10.0.0.1 Next, we need to modify the domain.xml file, where we need to define our server groups; essentially just defining the cluster. Each server group is named, so we can reference it later, and references a particular profile which needs to be one of the profiles named and defined in the same XML file. As I mentioned in my previous blog, domain mode has several profiles in the same file (domain.xml) rather than multiple files for each, like standalone mode (standalone.xml, standalone-ha.xml etc.). In the screenshot, there are two server groups defined - "main-server-group" which references the "full" profile, and "other-server-group" which references the "full-ha" profile. These are just the defaults which come with WildFly, so you're free to use them and modify the settings or create your own from scratch. Whichever you choose, it's a good idea to rename your server group to something meaningful, like a description of the workload, or the application name. Configuring the Host Controllers Every host server which you want to be part of the cluster must have the host.xml file configured. We've already configured the host.xml on the domain controller, so now we'll focus on the host controller. Remember, this process can be repeated on any number of hosts, depending on how many servers you want in your server group and their topology. First, we need to make sure that the domain controller and the host controller can communicate, and to do that we need a valid management user. On the domain controller, run the add-user.sh or add-user.bat script. You will need to make sure to: Choose a management user Make sure the user is different than the one you would use to log in to the web console Confirm that the new user will connect one AS process to another AS process Make a note of the secret value (this is very important!) You will find that you get prompts similar to the following: mike@mike-C2B2:~$ /opt/wildfly/wildfly-8.2.0.Final/bin/add-user.sh What type of user do you wish to add? a) Management User (mgmt-users.properties) b) Application User (application-users.properties) (a): a Enter the details of the new user to add. Using realm 'ManagementRealm' as discovered from the existing property files. Username : mgmt Password recommendations are listed below. To modify these restrictions edit the add-user.properties configuration file. - The password should not be one of the following restricted values {root, admin, administrator} - The password should contain at least 8 characters, 1 alphabetic character(s), 1 digit(s), 1 non-alphanumeric symbol(s) - The password should be different from the username Password : Re-enter Password : What groups do you want this user to belong to? (Please enter a comma separated list, or leave blank for none)[ ]: About to add user 'mgmt' for realm 'ManagementRealm' Is this correct yes/no? yes Added user 'mgmt' to file '/opt/wildfly/wildfly-8.2.0.Final/standalone/configuration/mgmt-users.properties' Added user 'mgmt' to file '/opt/wildfly/wildfly-8.2.0.Final/domain/configuration/mgmt-users.properties' Added user 'mgmt' with groups to file '/opt/wildfly/wildfly-8.2.0.Final/standalone/configuration/mgmt-groups.properties' Added user 'mgmt' with groups to file '/opt/wildfly/wildfly-8.2.0.Final/domain/configuration/mgmt-groups.properties' Is this new user going to be used for one AS process to connect to another AS process? e.g. for a slave host controller connecting to the master or for a Remoting connection for server to server EJB calls. yes/no? yes To represent the user add the following to the server-identities definition Once we have the secret value for our management user, we can add it to the host.xml file. I'm choosing to modify the host-slave.xml file, since much of the configuration I need is done for me: Next, we need to tell the host controller where to look for the domain controller. We set this to for the domain controller's host.xml file, but in the host-slave.xml we have an example tag filled out for us. All we need to do is add the domain controller's IP or hostname exactly as we did for the management bind address earlier. So our host-slave.xml should go from this: to this: This way, like with the management interface on the domain controller, the default address will be 10.0.0.1, but it can also be overridden on the command line if needed. Once we've sorted the communication out, we need to tell the host controller to actually start some server instances! At the bottom of the host-slave.xml file, there are two predefined servers to use: These are already configured to become members of the two server groups configured in the domain.xml. Note that the second server has to have a port offset. Despite it being in a different server group, it's still going to run on the same host and will attempt to bind to the same ports as the first server unless we tell it not to! We would also need to do the same thing if we added other server instances. Optionally, we can make things a little easier for ourselves when managing a lot of servers on a lot of hosts. We can give each server instance its own unique name, but we can also name the host by adding a name attribute to the parent tag, changing it from: to So both in the logs and in the admin console, you should see this host controller referred to as "host1". Now, if you wanted to name your server instances the same across hosts, you'll be able to tell which is which! If all you wanted was to configure a single domain controller and a single host controller, then that's all we need to do to get them speaking to each other. You can then carry on and configure mod_cluster and Apache to forward requests on to the right server, or just deploy your applications and connect to them directly.
April 3, 2015
by Mike Croft
· 23,576 Views
article thumbnail
Streaming Big Data: Storm, Spark and Samza
There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences. Apache Storm In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline. Apache Spark Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations). Apache Samza Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka. Common Ground All three real-time computation systems are open-source, low-latency, distributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations. The three frameworks use different vocabularies for similar concepts: Comparison Matrix A few of the differences are summarized in the table below: There are three general categories of delivery patterns: At-most-once: messages may be lost. This is usually the least desirable outcome. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases. Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident. Use Cases All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines. If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching. A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel... Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQL, MLlib, GraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time. A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu… If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects. A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale… Conclusion We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.
February 28, 2015
by Tony Siciliani
· 32,726 Views · 5 Likes
article thumbnail
Algorithms Explained: Minesweeper
This blog post explains the essential algorithms for the well-known Windows game "Minesweeper." Game Rules The board is a two-dimensional space, which has a predetermined number of mines. Cells have two states, opened and closed. If you left-click on a closed cell: Cell is empty and opened. If neighbor cell(s) have mine(s), this opened cell shows neighbor mine count. If neighbor cells have no mines, all neighbor cells are opened automatically. Cell has a mine, game ends with FAIL. If you right-click on a closed cell, you put a flag which shows that "I know this cell has a mine". If you multi-click (both right and left click) on a cell which is opened and has at least one mine on its neighbors: If neighbor cells' total flag count equals to this multi-clicked cell's count and predicted mine locations are true, all closed and unflagged neighbor cells are opened automatically. If neighbor cells' total flag count equals to this multi-clicked cell's count and at least one predicted mine location is wrong, game ends with FAIL. If all cells (without mines) are opened using left clicks and/or multi-clicks, game ends with SUCCESS. Data Structures We may think of each cell as a UI structure (e.g. button), which has the following attributes: colCoord = 0 to colCount rowCoord = 0 to rowCount isOpened = true / false (default false) hasFlag = true / false (default false) hasMine = true / false (default false) neighborMineCount 0 to 8 (default 0, total count of mines on neighbor cells) So, we have a two-dimensional "Button[][] cells" data structure to handle game actions. Algorithms Before Start: Assign mines to cells randomly (set hasMine=true) . Calculate neighborMineCount values for each cell, which have hasMine=false. (This step may be done for each clicked cell while game continues but it may be inefficient.) Note 1: Neighbor cells should be accessed with the coordinates: {(colCoord-1, rowCoord-1),(colCoord-1, rowCoord),(colCoord-1, rowCoord+1),(colCoord, rowCoord-1),(colCoord, rowCoord+1),(colCoord+1, rowCoord-1),(colCoord+1, rowCoord),(colCoord+1, rowCoord+1)} And don't forget that neighbor cell counts may be 3 (for corner cells), 5 (for edge cells) or 8 (for middle cells). Note 2: It's recommended to handle mouse clicks with "mouse release" actions instead of "mouse pressed/click" actions, otherwise a left or right click may be understood as a multi-click or vice versa. Right Click on a Cell: If cell isOpened=true, do nothing. If cell isOpened=false, set cell hasFlag=true and show a flag on the cell. Left Click on a Cell: If cell isOpened=true, do nothing. If cell isOpened=false: If cell hasMine=true, game over. If cell hasMine=false: If cell has neighborMineCount > 0, set isOpened=true, show neighborMineCount on the cell. If all cells which hasMine=false are opened, end game with SUCCESS. If cell has neighborMineCount == 0, set isOpened=true, call Left Click on a Cell for all neighbor cells, which hasFlag=false and isOpened=false. Note: The last step may be implemented with a recursive call or by using a stacked data structure. Multi Click (both Left and Right Click) on a Cell: If cell isOpened=false, do nothing. If cell isOpened=true: If cell neighborMineCount == 0, no nothing. If cell neighborMineCount > 0: If cell neighborMineCount != "neighbor hasFlag=true cell count", do nothing. If cell neighborMineCount == "neighbor hasFlag=true cell count": If all neighbor hasFlag=true cells are not hasMine=true, game over. If all neighbor hasFlag=true cells are hasMine=true (every flag is put on correct cell), call Left Click on a Cell for all neighbor cells, which hasFlag=false and isOpened=false. Note: The last step may be implemented with a recursive call or by using a stacked data structure.
February 9, 2015
by Cagdas Basaraner
· 33,362 Views
article thumbnail
Do it in Java 8: The State Monad
In a previous article (Do it in Java 8: Automatic memoization), I wrote about memoization and said that memoization is about handling state between function calls, although the value returned by the function does not change from one call to another as long as the argument is the same. I showed how this could be done automatically. There are however other cases when handling state between function calls is necessary but cannot be done this simple way. One is handling state between recursive calls. In such a situation, the function is called only once from the user point of view, but it is in fact called several times since it calls itself recursively. Most of these functions will not benefit internally from memoization. For example, the factorial function may be implemented recursively, and for one call to f(n), there will be n calls to the function, but not twice with the same value. So this function would not benefit from internal memoization. On the contrary, the Fibonacci function, if implemented according to its recursive definition, will call itself recursively a huge number of times will the same arguments. Here is the standard definition of the Fibonacci function: f(n) = f(n – 1) + f(n – 2) This definition has a major problem: calculating f(n) implies evaluating n² times the function with different values. The result is that, in practice, it is impossible to use this definition for n > 50 because the time of execution increases exponentially. But since we are calculating more than n values, it is obvious that some values are calculated several times. So this definition would be a good candidate for internal memoization. Warning: Java is not a true recursive language, so using any recursive method will eventually blow the stack, unless we use TCO (Tail Call Optimization) as described in my previous article (Do it in Java 8: recursive and corecursive Fibonacci). However, TCO can't be applied to the original Fibonacci definition because it is not tail recursive. So we will write an example working for n limited to a few thousands. The naive implementation Here is how we might implement the original definition: static BigInteger fib(BigInteger n) { return n.equals(BigInteger.ZERO) || n.equals(BigInteger.ONE) ? n : fib(n.subtract(BigInteger.ONE)).add(fib(n.subtract(BigInteger.ONE.add(BigInteger.ONE)))); } Do not try this implementation for values greater that 50. On my machine, fib(50) takes 44 minutes to return! Basic memoized version To avoid computing several times the same values, we can use a map were computed values are stored. This way, for each value, we first look into the map to see if it has already been computed. If it is present, we just retrieve it. Otherwise, we compute it, store it into the map and return it. For this, we will use a special class called Memo. This is a normal HashMap that mimic the interface of functional map, which means that insertion of a new (key, value) pair returns a map, and looking up a key returns an Optional: public class Memo extends HashMap { public Optional retrieve(BigInteger key) { return Optional.ofNullable(super.get(key)); } public Memo addEntry(BigInteger key, BigInteger value) { super.put(key, value); return this; } } Note that this is not a true functional (immutable and persistent) map, but this is not a problem since it will not be shared. We will also need a Tuple class that we define as: public class Tuple { public final A _1; public final B _2; public Tuple(A a, B b) { _1 = a; _2 = b; } } With this class, we can write the following basic implementation: static BigInteger fibMemo1(BigInteger n) { return fibMemo(n, new Memo().addEntry(BigInteger.ZERO, BigInteger.ZERO) .addEntry(BigInteger.ONE, BigInteger.ONE))._1; } static Tuple fibMemo(BigInteger n, Memo memo) { return memo.retrieve(n).map(x -> new Tuple<>(x, memo)).orElseGet(() -> { BigInteger x = fibMemo(n.subtract(BigInteger.ONE), memo)._1 .add(fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE), memo)._1); return new Tuple<>(x, memo.addEntry(n, x)); }); } This implementation works fine, provided values of n are not too big. For f(50), it returns in less that 1 millisecond, which should be compared to the 44 minutes of the naive version. Remark that we do not have to test for terminal values (f(0) and f(1). These values are simply inserted into the map at start. So, is there something we should make better? The main problem is that we have to handle the passing of the Memo map by hand. The signature of the fibMemo method is no longer fibMemo(BigInteger n), but fibMemo(BigInteger n, Memo memo). Could we simplify this? We might think about using automatic memoization as described in my previous article (Do it in Java 8: Automatic memoization ). However, this will not work: static Function fib = new Function() { @Override public BigInteger apply(BigInteger n) { return n.equals(BigInteger.ZERO) || n.equals(BigInteger.ONE) ? n : this.apply(n.subtract(BigInteger.ONE)).add(this.apply(n.subtract(BigInteger.ONE.add(BigInteger.ONE)))); } }; static Function fibm = Memoizer.memoize(fib); Beside the fact that we could not use lambdas because reference to this is not allowed, which may be worked around by using the original anonymous class syntax, the recursive call is made to the non memoized function, so it does not make things better. Using the State monad In this example, each computed value is strictly evaluated by the fibMemo method, and this is what makes the memo parameter necessary. Instead of a method returning a value, what we would need is a method returning a function that could be evaluated latter. This function would take a Memo as parameter, and this Memo instance would be necessary only at evaluation time. This is what the State monad will do. Java 8 does not provide the state monad, so we have to create it, but it is very simple. However, we first need an implementation of a list that is more functional that what Java offers. In a real case, we would use a true immutable and persistent List. The one I have written is about 1 000 lines, so I can't show it here. Instead, we will use a dummy functional list, backed by a java.util.ArrayList. Although this is less elegant, it does the same job: public class List { private java.util.List list = new ArrayList<>(); public static List empty() { return new List(); } @SafeVarargs public static List apply(T... ta) { List result = new List<>(); for (T t : ta) result.list.add(t); return result; } public List cons(T t) { List result = new List<>(); result.list.add(t); result.list.addAll(list); return result; } public U foldRight(U seed, Function> f) { U result = seed; for (int i = list.size() - 1; i >= 0; i--) { result = f.apply(list.get(i)).apply(result); } return result; } public List map(Function f) { List result = new List<>(); for (T t : list) { result.list.add(f.apply(t)); } return result; } public List filter(Function f) { List result = new List<>(); for (T t : list) { if (f.apply(t)) { result.list.add(t); } } return result; } public Optional findFirst() { return list.size() == 0 ? Optional.empty() : Optional.of(list.get(0)); } @Override public String toString() { StringBuilder s = new StringBuilder("["); for (T t : list) { s.append(t).append(", "); } return s.append("NIL]").toString(); } } The implementation is not functional, but the interface is! And although there are lots of missing capabilities, we have all we need. Now, we can write the state monad. It is often called simply State but I prefer to call it StateMonad in order to avoid confusion between the state and the monad: public class StateMonad { public final Function> runState; public StateMonad(Function> runState) { this.runState = runState; } public static StateMonad unit(A a) { return new StateMonad<>(s -> new StateTuple<>(a, s)); } public static StateMonad get() { return new StateMonad<>(s -> new StateTuple<>(s, s)); } public static StateMonad getState(Function f) { return new StateMonad<>(s -> new StateTuple<>(f.apply(s), s)); } public static StateMonad transition(Function f) { return new StateMonad<>(s -> new StateTuple<>(Nothing.instance, f.apply(s))); } public static StateMonad transition(Function f, A value) { return new StateMonad<>(s -> new StateTuple<>(value, f.apply(s))); } public static StateMonad> compose(List> fs) { return fs.foldRight(StateMonad.unit(List.empty()), f -> acc -> f.map2(acc, a -> b -> b.cons(a))); } public StateMonad flatMap(Function> f) { return new StateMonad<>(s -> { StateTuple temp = runState.apply(s); return f.apply(temp.value).runState.apply(temp.state); }); } public StateMonad map(Function f) { return flatMap(a -> StateMonad.unit(f.apply(a))); } public StateMonad map2(StateMonad sb, Function> f) { return flatMap(a -> sb.map(b -> f.apply(a).apply(b))); } public A eval(S s) { return runState.apply(s).value; } } This class is parameterized by two types: the value type A and the state type S. In our case, A will be BigInteger and S will be Memo. This class holds a function from a state to a tuple (value, state). This function is hold in the runState field. This is similar to the value hold in the Optional monad. To make it a monad, this class needs a unit method and a flatMap method. The unit method takes a value as parameter and returns a StateMonad. It could be implemented as a constructor. Here, it is a factory method. The flatMap method takes a function from A (a value) to StateMonad and return a new StateMonad. (In our case, A is the same as B.) The new type contains the new value and the new state that result from the application of the function. All other methods are convenience methods: map allows to bind a function from A to B instead of a function from A to StateMonad. It is implemented in terms of flatMap and unit. eval allows easy retrieval of the value hold by the StateMonad. getState allows creating a StateMonad from a function S -> A. transition takes a function from state to state and a value and returns a new StateMonad holding the value and the state resulting from the application of the function. In other words, it allows changing the state without changing the value. There is also another transition method taking only a function and returning a StateMonad. Nothing is a special class: public final class Nothing { public static final Nothing instance = new Nothing(); private Nothing() {} } This class could be replaced by Void, to mean that we do not care about the type. However, Void is not supposed to be instantiated, and the only reference of type Void is normally null. The problem is that null does not carry its type. We could instantiate a Void instance through introspection: Constructor constructor; constructor = Void.class.getDeclaredConstructor(); constructor.setAccessible(true); Void nothing = constructor.newInstance(); but this is really ugly, so we create a Nothing type with a single instance of it. This does the trick, although to be complete, Nothing should be able to replace any type (like null), which does not seem to be possible in Java. Using the StateMonad class, we can rewrite our program: static BigInteger fibMemo2(BigInteger n) { return fibMemo(n).eval(new Memo().addEntry(BigInteger.ZERO, BigInteger.ZERO).addEntry(BigInteger.ONE, BigInteger.ONE)); } static StateMonad fibMemo(BigInteger n) { return StateMonad.getState((Memo m) -> m.retrieve(n)) .flatMap(u -> u.map(StateMonad:: unit).orElse(fibMemo(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))))); } Now, the fibMemo method only takes a BigInteger as its parameter and returns a StateMonad, which means that when this method returns, nothing has been evaluated yet. The Memo doesn't even exist! To get the result, we may call the eval method, passing it the Memo instance. If you find this code difficult to understand, here is an exploded commented version using longer identifiers: static StateMonad fibMemo(BigInteger n) { /* * Create a function of type Memo -> Optional with a closure * over the n parameter. */ Function> retrieveValueFromMapIfPresent = (Memo memoizationMap) -> memoizationMap.retrieve(n); /* * Create a state from this function. */ StateMonad> initialState = StateMonad.getState(retrieveValueFromMapIfPresent); /* * Create a function for converting the value (BigInteger) into a State * Monad instance. This function will be bound to the Optional resulting * from the lookup into the map to give the result if the value was found. */ Function> createStateFromValue = StateMonad:: unit; /* * The value computation proper. This can't be easily decomposed because it * make heavy use of closures. It first calls recursively fibMemo(n - 1), * producing a StateMonad. It then flatMaps it to a new * recursive call to fibMemo(n - 2) (actually fibMemo(n - 1 - 1)) and get a * new StateMonad which is mapped to BigInteger addition * with the preceding value (x). Then it flatMaps it again with the function * y -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z) which adds * the two values and returns a new StateMonad with the computed value added * to the map. */ StateMonad computedValue = fibMemo(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))); /* * Create a function taking an Optional as its parameter and * returning a state. This is the main function that returns the value in * the Optional if it is present and compute it and put it into the map * before returning it otherwise. */ Function, StateMonad> computeFiboValueIfAbsentFromMap = u -> u.map(createStateFromValue).orElse(computedValue); /* * Bind the computeFiboValueIfAbsentFromMap function to the initial State * and return the result. */ return initialState.flatMap(computeFiboValueIfAbsentFromMap); } The most important part is the following: StateMonad computedValue = fibMemo_(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo_(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))); This kind of code is essential to functional programming, although it is sometimes replaced in other languages with “for comprehensions”. As Java 8 does not have for comprehensions we have to use this form. At this point, we have seen that using the state monad allows abstracting the handling of state. This technique can be used every time you have to handle state. More uses of the state monad The state monad may be used for many other cases were state must be maintained in a functional way. Most programs based upon maintaining state use a concept known as a State Machine. A state machine is defined by an initial state and a series of inputs. Each input submitted to the state machine will produce a new state by applying one of several possible transitions based upon a list of conditions concerning both the input and the actual state. If we take the example of a bank account, the initial state would be the initial balance of the account. Possible transition would be deposit(amount) and withdraw(amount). The conditions would be true for deposit and balance >= amount for withdraw. Given the state monad that we have implemented above, we could write a parameterized state machine: public class StateMachine { Function> function; public StateMachine(List, Transition>> transitions) { function = i -> StateMonad.transition(m -> Optional.of(new StateTuple<>(i, m)).flatMap((StateTuple t) -> transitions.filter((Tuple, Transition> x) -> x._1.test(t)).findFirst().map((Tuple, Transition> y) -> y._2.apply(t))).get()); } public StateMonad process(List inputs) { List> a = inputs.map(function); StateMonad> b = StateMonad.compose(a); return b.flatMap(x -> StateMonad.get()); } } This machine uses a bunch of helper classes. First, the inputs are represented by an interface: public interface Input { boolean isDeposit(); boolean isWithdraw(); int getAmount(); } There are two instances of inputs: public class Deposit implements Input { private final int amount; public Deposit(int amount) { super(); this.amount = amount; } @Override public boolean isDeposit() { return true; } @Override public boolean isWithdraw() { return false; } @Override public int getAmount() { return this.amount; } } public class Withdraw implements Input { private final int amount; public Withdraw(int amount) { super(); this.amount = amount; } @Override public boolean isDeposit() { return false; } @Override public boolean isWithdraw() { return true; } @Override public int getAmount() { return this.amount; } } Then come two functional interfaces for conditions and transitions: public interface Condition extends Predicate> {} public interface Transition extends Function, S> {} These act as type aliases in order to simplify the code. We could have used the predicate and the function directly. In the same manner, we use a StateTuple class instead of a normal tuple: public class StateTuple { public final A value; public final S state; public StateTuple(A a, S s) { value = Objects.requireNonNull(a); state = Objects.requireNonNull(s); } } This is exactly the same as an ordinary tuple with named members instead of numbered ones. Numbered members allows using the same class everywhere, but a specific class like this one make the code easier to read as we will see. The last utility class is Outcome, which represent the result returned by the state machine: public class Outcome { public final Integer account; public final List> operations; public Outcome(Integer account, List> operations) { super(); this.account = account; this.operations = operations; } public String toString() { return "(" + account.toString() + "," + operations.toString() + ")"; } } This again could be replaced with a Tuple>>, but using named parameters make the code easier to read. (In some functional languages, we could use type aliases for this.) Here, we use an Either class, which is another kind of monad that Java does not offer. I will not show the complete class, but only the parts that are useful for this example: public interface Either { boolean isLeft(); boolean isRight(); A getLeft(); B getRight(); static Either right(B value) { return new Right<>(value); } static Either left(A value) { return new Left<>(value); } public class Left implements Either { private final A left; private Left(A left) { super(); this.left = left; } @Override public boolean isLeft() { return true; } @Override public boolean isRight() { return false; } @Override public A getLeft() { return this.left; } @Override public B getRight() { throw new IllegalStateException("getRight() called on Left value"); } @Override public String toString() { return left.toString(); } } public class Right implements Either { private final B right; private Right(B right) { super(); this.right = right; } @Override public boolean isLeft() { return false; } @Override public boolean isRight() { return true; } @Override public A getLeft() { throw new IllegalStateException("getLeft() called on Right value"); } @Override public B getRight() { return this.right; } @Override public String toString() { return right.toString(); } } } This implementation is missing a flatMap method, but we will not need it. The Either class is somewhat like the Optional Java class in that it may be used to represent the result of an evaluation that may return a value or something else like an exception, an error message or whatever. What is important is that it can hold one of two things of different types. We now have all we need to use our state machine: public class Account { public static StateMachine createMachine() { Condition predicate1 = t -> t.value.isDeposit(); Transition transition1 = t -> new Outcome(t.state.account + t.value.getAmount(), t.state.operations.cons(Either.right(t.value.getAmount()))); Condition predicate2 = t -> t.value.isWithdraw() && t.state.account >= t.value.getAmount(); Transition transition2 = t -> new Outcome(t.state.account - t.value.getAmount(), t.state.operations.cons(Either.right(- t.value.getAmount()))); Condition predicate3 = t -> true; Transition transition3 = t -> new Outcome(t.state.account, t.state.operations.cons(Either.left(new IllegalStateException(String.format("Can't withdraw %s because balance is only %s", t.value.getAmount(), t.state.account))))); List, Transition>> transitions = List.apply( new Tuple<>(predicate1, transition1), new Tuple<>(predicate2, transition2), new Tuple<>(predicate3, transition3)); return new StateMachine<>(transitions); } } This could not be simpler. We just define each possible condition and the corresponding transition, and then build a list of tuples (Condition, Transition) that is used to instantiate the state machine. There are however to rules that must be enforced: Conditions must be put in the right order, with the more specific first and the more general last. We must be careful to be sure to match all possible cases. Otherwise, we will get an exception. At this stage, nothing has been evaluated. We did not even use the initial state! To run the state machine, we must create a list of inputs and feed it in the machine, for example: List inputs = List.apply( new Deposit(100), new Withdraw(50), new Withdraw(150), new Deposit(200), new Withdraw(150)); StateMonad = Account.createMachine().process(inputs); Again, nothing has been evaluated yet. To get the result, we just evaluate the result, using an initial state: Outcome outcome = state.eval(new Outcome(0, List.empty())) If we run the program with the list above, and call toString() on the resulting outcome (we can't do more useful things since the Either class is so minimal!) we get the following result: // // (100,[-150, 200, java.lang.IllegalStateException: Can't withdraw 150 because balance is only 50, -50, 100, NIL]) This is a tuple of the resulting balance for the account (100) and the list of operations that have been carried on. We can see that successful operations are represented by a signed integer, and failed operations are represented by an error message. This of course is a very minimal example, and as usual, one may think it would be much easier to do it the imperative way. However, think of a more complex example, like a text parser. All there is to do to adapt the state machine is to define the state representation (the Outcome class), define the possible inputs and create the list of (Condition,Transition). Going the functional way does not make the whole thing simpler. However, it allows abstracting the implementation of the state machine from the requirements. The only thing we have to do to create a new state machine is to write the new requirements!
October 9, 2014
by Pierre-Yves Saumont
· 25,940 Views · 7 Likes
article thumbnail
JBoss Data Grid: Installation and Development
In this blog, we will discuss one particular data grid platform from Redhat namely JBoss Data Grid (JDG). We will firstly cover how to access and install this data grid platform and then we will demonstrate how to develop and deploy a simple remote client/server data grid application which utilises the HotRod protocol. We will be using the latest release JDG 6.2 from Redhat in this article. Installation Overview To start using JDG, firstly log on to the redhat site https://access.redhat.com/home and download the software from the Downloads section of the site. We wish to download JDG 6.2 server by clicking on the appropriate links in the Downloads section. For future reference, it is also useful to download the quickstart and maven repository zip files. To install JDG, we simply unzip the JDG server package into an appropriate directory in your environment. JDG Overview In this section, we will provide a brief overview of the contents of the JDG installation package and the most notable configuration options available to users. Out of the box, users are provided with two runtime options either to run JDG in standalone or clustered mode. We can start JDG in either mode by invoking the stanadalone or clustered start up scripts in the / bin directory. To configure the JDG in either mode we need to configure the files standalone.xml and clustered.xml. In our case we will creating a distributed cache which will run on 3 node JDG cluster so we will be utilizing the clustered startup script. In order to set up and add new cache instances to JDG, we modify the infinispan subsystems in the appropriate xml configuration file above. We should also note the principal difference between the standalone and clustered configuration file is that in the clustered configuration file there is a JGroups subsystem configured element which allows for communication and messaging between configured cache instances running in a JDG cluster. Development Environment Setup and Configuration In this section, we will detail how to develop and configure a simple datagrid application which will be deployed to a 3 node JDG cluster. We will demonstrate how to configure and deploy a distributed cache in JDG and also show how to develop a HotRod Java client application which will be used to insert, update and display entries in the distributed cache. We will firstly discuss setting a new distributed cache on a 3 node JDG cluster. In this example, we will run our JDG cluster on a single machine by running each JDG instance on different ports. Firstly, we will create 3 instances of JDG by creating 3 directories (server1, server2, server3) on our host machine and unzipping each JDG installation into each directory. We will now configure each node in our cluster by copying and renaming the clustered.xml configuration file in the \server1\jboss-datagrid-6.2.0-server\standalone\configuration directory. We will name each of the cluster configuration files as "clustered1.xml", "clustered2.xml" and "clustered3.xml" for the JDG instances denoted by "server1", "server2" and "server3" respectively. We will now set up a new distributed cache on our JDG cluster by modifying the infinispan subsystem element in each clustered.xml file. We will demonstrate this for the node denoted "server1" here by modifying the file "clustered1.xml". The cache configuration shown here will be the same across all 3 nodes. To setup a new distributed cache named "directory-dist-cache", we configure the following elements in the file named "clustered1.xml" ......... ...... .............. ...... ...... /socket-binding-group> We will discuss the key elements and attributes relating to the configuration above. In the infinispan endpoint subsystem, we will configure hotrod clients to connect to the JDG server instance on socket 11222. The name of the cache container to host each of the cache instances will be held in the container named "clusteredcache". We have configured the infinispan core subsystem to the default cache container named "clusteredcacahe" whereby we will allow for jmx statistics to be collected relating the configured cache entries i.e statistics="true" We have created a new distributed cache named "directory-dist-cache" whereby there will be two copies of each cache entry held on two of the 3 cluster nodes. We have also set up an eviction policy whereby should there be more than 20 entries in our cache then cache entries will be removed using the LRU algorithm We should have configured nodes "server2" and "server3" to start up with a port offset of 100 and 200 respectively by configuring the socketing binding group element appropriately. Please view the socket bindings noted below. To set the socket binding element with a port offset of 100 on "server2", we configure "clustered2.xml" with the following entry: ...... ...... /socket-binding-group> To set the socket binding element with a port offset of 200 on "server3", we configure "clustered3.xml" with the following entry: ...... ...... /socket-binding-group> Before discussing the setup and configuration of our Hotrod client which will be used to interact with our JDG clustered HotRod server, we will start up each server instance to ensure our newly configured JDG distributed cache starts up correctly. Open up 3 Windows or Linux consoles and execute the following start up commands: Console 1: 1) Navigate to \server1\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the first instance of our JDG cluster denoted "server1": clustered -c=clustered1.xml -Djboss.node.name=server1 Console 2: 1) Navigate to \server2\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the second instance of our JDG cluster denoted "server2": clustered -c=clustered2.xml -Djboss.node.name=server2 Console 3: 1) Navigate to \server3\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the third instance of our JDG cluster denoted "server3": clustered -c=clustered3.xml -Djboss.node.name=server3 Providing all 3 JDG instances have started up correctly, you should see output in the console window whereby we can see there are 3 JDG instances in the JGroups view: HotRod Client Development Setup Now that the Hotrod server is up and running, we need to develop a Hotrod Java client which will interact with the clustered server application. The development environment consists of the following tools. 1) JDK Hotspot 1.7.0_45 2) IDE - Eclipse Kepler Build id: 20130919-0819 The HotRod client application is a simple application consisting of two Java classes. The application allows users to retrieve a reference to the distributed cache from the JDG server and then perform these actions: a) add new cinema objects. b) add and remove shows to each cinema object. c) print the list of all cinemas and shows stored in our distributed cache. The source code can be downloaded from github @ https://github.com/davewinters/JDG. We could use maven here to build and execute our application by configuring the maven settings.xml to point to the maven repository files we downloaded earlier and set up a maven project file (pom.xml) to build and execute the client application. In this article we will build our application using the Eclipse IDE and run the client application on the command line. To create a HotRod client application and execute the sample application, one should complete the following steps: 1) Create a new Java Project in Eclipse 2) Create a new package named uk.co.c2b2.jdg.hotrod and import the source code that has been downloaded from Github mentioned previously. 3) Now we need to configure the build path in Eclipse to contain the appropriate JDG client jar files which are required to compile the application. You should include all the client jar files in the project build path. These jar files are contained in the JDG installation zip file. For example on my machine these jar files are located in the directory: \server1\jboss-datagrid-6.2.0-server\client\hotrod\java 4. Providing the Eclipse build path has been configured appropriately, the application source should compile without issue. 5. We will need to execute the Hotrod application by opening the console window and executing the following command. Note the path specified here will differ depending on where the JDG client jar files and application class files are located in your environment: java -classpath ".;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\commons-pool-1.6-redhat-4.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-client-hotrod-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-commons-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-query-dsl-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-remote-query-client-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-logging-3.1.2.GA-redhat-1.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-river-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protobuf-java-2.5.0.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protostream-1.0.0.CR1-redhat-1.jar" uk/co/c2b2/jdg/hotrod/CinemaDirectory 6. The Hotrod client at runtime provides the end user with a number of different options to interact with the distributed cache as we can view from the console window below. Client Application Principal API Details We will not provide a detailed overview of the Hotrod application code however we will describe the principal API and code details briefly. In order to interact with the distributed cache on the JDG cluster using the Hotrod protocol, we will use the RemoteCacheManager Object which will allow us to retrieve a remote reference to the distributed cache. We have initialised a Properties object with the list of JDG instances and the associated with HotRod server port on each instance. We can add Cinema objects into the distributed cache using the RemoteCache.put() method. private RemoteCacheManager cacheManager; private RemoteCache cache; ..... Properties properties = new Properties(); properties.setProperty(ConfigurationProperties.SERVER_LIST, "127.0.0.1:11222;127.0.0.1:11322;127.0.0.1:11422"); cacheManager = new RemoteCacheManager(properties); cache = cacheManager.getCache("directory-dist-cache"); ..... cache.put(cinemaKey, cinemalist); In the webinar below, I describe in further detail how to set up a JDG cluster and how to develop and run the JDG application discussed above. For further details on JDG please visit: http://www.redhat.com/products/jbossenterprisemiddleware/data-grid/ Webinar: Introduction to JBoss Data Grid -- Installation, Configuration and Development In this webinar we will look at the basics of setting up JBoss Data Grid covering installation, configuration and development. We will look at practical examples of storing data, viewing the data in the cache and removing it. We will also take a look at the different clustered modes and what effect these have on the storage of your data:
July 25, 2014
by David Winters
· 16,069 Views
article thumbnail
What is Scalable Machine Learning?
scalability has become one of those core concept slash buzzwords of big data. it’s all about scaling out, web scale, and so on. in principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast. the terms “scalable” and “large scale” have been used in machine learning circles long before there was big data. there had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. so finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question. interestingly, this issue of scalability were seldom solved using actual scaling in in machine learning, at least not in the big data kind of sense. part of the reason is certainly that multicore processors didn’t yet exist at the scale they do today and that the idea of “just scaling out” wasn’t as pervasive as it is today. instead, “scalable” machine learning is almost always based on finding more efficient algorithms, and most often, approximations to the original algorithm which can be computed much more efficiently. to illustrate this, let’s search for nips papers (the annual advances in neural information processing systems, short nips, conference is one of the big ml community meetings) for papers which have the term “scalable” in the title. here are some examples: scalable inference for logistic-normal topic models … this paper presents a partially collapsed gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation … partially collapsed gibbs sampling is a kind of estimation algorithm for certain graphical models. a scalable approach to probabilistic latent space inference of large-scale networks … with […] an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices […] on a single machine in a matter of hours … stochastic variational inference algorithm is both an approximation and an estimation algorithm. scalable kernels for graphs with continuous attributes … in this paper, we present a class of path kernels with computational complexity $o(n^2(m + \delta^2 ))$ … and this algorithm has squared runtime in the number of data points, so wouldn’t even scale out well even if you could. usually, even if there is potential for scalability, it usually something that is “embarassingly parallel” (yep, that’s a technical term), meaning that it’s something like a summation which can be parallelized very easily. still, the actual “scalability” comes from the algorithmic side. so how do scalable ml algorithms look like? a typical example are the stochastic gradient descent (sgd) class of algorithms. these algorithms can be used, for example, to train classifiers like linear svms or logistic regression. one data point is considered at each iteration. the prediction error on that point is computed and then the gradient is taken with respect to the model parameters, giving information about how to adapt these parameters slightly to make the error smaller. vowpal wabbit is one program based on this approach and it has a nice definition of what it considers to mean scalable in machine learning: there are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. this project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation. so “scalable” means having a learning algorithm which can deal with any amount of data, without consuming ever growing amounts of resources like memory. for sgd type algorithms this is the case, because all you need to store are the model parameters, usually a few ten to hundred thousand double precision floating point value, so maybe a few megabytes in total. the main problem to speed this kind of computation up is how to stream the data by fast enough. to put it differently, not only does this kind of scalability not rely on scaling out, it’s actually not even necessary or possible to scale the computation out because the main state of the computation easily fits into main memory and computations on it cannot be distributed easily. i know that gradient descent is often taken as an example for map reduce and other approaches like in this paper on the architecture of spark , but that paper discusses a version of gradient descent where you are not taking one point at a time, but aggregate the gradient information for the whole data set before making the update to the model parameters. while this can be easily parallelized, it does not perform well in practice because the gradient information tends to average out when computed over the whole data set. if you want to know more, this large scale learning challenge sören sonnneburg organized in 2008 still has valuable information on how to deal with massive data sets. of course, there are things which can be easily scaled well using hadoop or spark, in particular any kind of data preprocessing or feature extraction where you need to apply the same operation to each data point in your data set. another area where parallelization is easy and useful is when you are using cross validation to do model selection where you usually have to train a large number of models for different parameter sets to find the combination which performs best. again, even here there is more potential for even speeding up such computations using better algorithms like in this paper of mine . i’ve just scratched the surface of this, but i hope you got the idea that scalability can mean quite different things. in big data (meaning the infrastructure side of it) what you want to compute is pretty well defined, for example some kind of aggregate over your data set, so you’re left with the question of how to parallelize that computation well. in machine learning, you have much more freedom because data is noisy and there’s always some freedom in how you model your data, so you can often get away with computing some variation of what you originally wanted to do and still perform well. often, this allows you to speed up your computations significantly by decoupling computations. parallelization is important, too, but alone it won’t get you very far. luckily, there are projects like spark and stratosphere/flink which work on providing more useful abstractions beyond map and reduce to make the last part easier for data scientists, but you won’t get rid of the algorithmic design part any time soon.
July 3, 2014
by Mikio Braun
· 18,537 Views · 1 Like
article thumbnail
Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel
Explore different message brokers, and discover how these important web technologies impact a customer's backlog of messages, and cluster/data performance.
June 3, 2014
by Yves Trudeau
· 460,590 Views · 86 Likes
article thumbnail
How to Use NodeManager to Control WebLogic Servers
In my previous post, you have seen how we can start a WebLogic admin and multiple managed servers. One downside with that instruction is that those processes will start in foreground and the STDOUT are printed on terminal. If you intended to run these severs as background services, you might want to try the WebLogic node manager wlscontrol.sh tool. I will show you how you can get Node Manager started here. The easiest way is still to create the domain directory with the admin server running temporary and then create all your servers through the /console application as described in last post. Once you have these created, then you may shut down all these processes and start it with Node Manager. 1. cd $WL_HOME/server/bin && startNodeManager.sh & 3. $WL_HOME/common/bin/wlscontrol.sh -d mydomain -r $HOME/domains/mydomain -c -f startWebLogic.sh -s myserver START 4. $WL_HOME/common/bin/wlscontrol.sh -d mydomain -r $HOME/domains/mydomain -c -f startManagedWebLogic.sh -s appserver1 START The first step above is to start and run your Node Manager. It is recommended you run this as full daemon service so even OS reboot can restart itself. But for this demo purpose, you can just run it and send to background. Using the Node Manager we can then start the admin in step 2, and then to start the managed server on step 3. The NodeManager can start not only just the WebLogic server for you, but it can also monitor them and automatically restart them if they were terminated for any reasons. If you want to shutdown the server manually, you may use this command using Node Manager as well: $WL_HOME/common/bin/wlscontrol.sh -d mydomain -s appserver1 KILL The Node Manager can also be used to start servers remotely through SSH on multiple machines. Using this tool effectively can help managing your servers across your network. You may read more details here: http://docs.oracle.com/cd/E23943_01/web.1111/e13740/toc.htm TIPS1: If there is problem when starting server, you may wnat to look into the log files. One log file is the/servers//logs/.out of the server you trying to start. Or you can look into the Node Manager log itself at $WL_HOME/common/nodemanager/nodemanager.log TIPS2: You add startup JVM arguments to each server starting with Node Manager. You need to create a file under /servers//data/nodemanager/startup.properties and add this key value pair:Arguments = -Dmyapp=/foo/bar TIPS3: If you want to explore Windows version of NodeManager, you may want to start NodeManager without native library to save yourself some trouble. Try adding NativeVersionEnabled=false to$WL_HOME/common/nodemanager/nodemanager.properties file.
March 24, 2014
by Zemian Deng
· 14,235 Views
article thumbnail
Shrink Your Time Machine Backups and Free Disk Space
Time Machine is a backup and restore tool from Apple which is very well integrated into OS X. In my personal opinion Time Machine is not yet awesome.
March 18, 2014
by Enrico Maria Crisostomo
· 162,245 Views · 1 Like
article thumbnail
Jersey: Ignoring SSL certificate – javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException
Last week Alistair and I were working on an internal application and we needed to make a HTTPS request directly to an AWS machine using a certificate signed to a different host. We use jersey-client so our code looked something like this: Client client = Client.create(); client.resource("https://some-aws-host.compute-1.amazonaws.com").post(); // and so on When we ran this we predictably ran into trouble: com.sun.jersey.api.client.ClientHandlerException: javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subject alternative DNS name matching some-aws-host.compute-1.amazonaws.com found. at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.post(WebResource.java:241) at com.neotechnology.testlab.manager.bootstrap.ManagerAdmin.takeBackup(ManagerAdmin.java:33) at com.neotechnology.testlab.manager.bootstrap.ManagerAdminTest.foo(ManagerAdminTest.java:11) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:300) at org.junit.runner.JUnitCore.run(JUnitCore.java:157) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Caused by: javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subject alternative DNS name matching some-aws-host.compute-1.amazonaws.com found. at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1884) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:276) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:270) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1341) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:153) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:868) at sun.security.ssl.Handshaker.process_record(Handshaker.java:804) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1016) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1312) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1339) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1323) at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:563) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1300) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468) at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147) ... 31 more Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching some-aws-host.compute-1.amazonaws.com found. at sun.security.util.HostnameChecker.matchDNS(HostnameChecker.java:191) at sun.security.util.HostnameChecker.match(HostnameChecker.java:93) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:347) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:203) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:126) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1323) ... 45 more We figured that we needed to get our client to ignore the certificate and came across this Stack Overflow thread which had some suggestions on how to do this. None of the suggestions worked on their own but we ended up with a combination of a couple of the suggestions which did the trick: public Client hostIgnoringClient() { try { SSLContext sslcontext = SSLContext.getInstance( "TLS" ); sslcontext.init( null, null, null ); DefaultClientConfig config = new DefaultClientConfig(); Map properties = config.getProperties(); HTTPSProperties httpsProperties = new HTTPSProperties( new HostnameVerifier() { @Override public boolean verify( String s, SSLSession sslSession ) { return true; } }, sslcontext ); properties.put( HTTPSProperties.PROPERTY_HTTPS_PROPERTIES, httpsProperties ); config.getClasses().add( JacksonJsonProvider.class ); return Client.create( config ); } catch ( KeyManagementException | NoSuchAlgorithmException e ) { throw new RuntimeException( e ); } } You’re welcome Future Mark.
March 2, 2014
by Mark Needham
· 43,054 Views · 8 Likes
article thumbnail
AES-256 Encryption with Java and JCEKS
Security has become a great topic of discussion in the last few years due to the recent releasing of documents from Edward Snowden and the explosion of hacking against online commerce stores like JC Penny, Sony andTarget. While this post will not give you all of the tools to help prevent the use of illegally sourced data, this post will provide a starting point for building a set of tools and tactics that will help prevent the use of data by other parties. This post will show how to adopt AES encryption for strings in a Java environment. It will talk about creating AES keys and storing AES keys in a JCEKS keystore format. A working example of the code in this blog is located athttps://github.com/mike-ensor/aes-256-encryption-utility It is recommended to read each section in order because each section builds off of the previous section, however, this you might want to just jump quickly jump to a particular section. Setup - Setup and create keys with keytool Encrypt - Encrypt messages using byte[] keys Decrypt - Decrypt messages using same IV and key from encryption Obtain Keys from Keystore - Obtain keys from keystore via an alias What is JCEKS? JCEKS stands for Java Cryptography Extension KeyStore and it is an alternative keystore format for the Java platform. Storing keys in a KeyStore can be a measure to prevent your encryption keys from being exposed. Java KeyStores securely contain individual certificates and keys that can be referenced by an alias for use in a Java program. Java KeyStores are often created using the "keytool" provided with the Java JDK. NOTE: It is strongly recommended to create a complex passcode for KeyStores to keep the contents secure. The KeyStore is a file that is considered to be public, but it is advisable to not give easy access to the file. Setup All encryption is governed by laws of each country and often have restrictions on the strength of the encryption. One example is that in the United States, all encryption over 128-bit is restricted if the data is traveling outside of the boarder. By default, the Java JCE implements a strength policy to comply with these rules. If a stronger encryption is preferred, and adheres to the laws of the country, then the JCE needs to have access to the stronger encryption policy. Very plainly put, if you are planning on using AES 256-bit encryption, you must install theUnlimited Strength Jurisdiction Policy Files. Without the policies in place, 256-bit encryption is not possible. Installation of JCE Unlimited Strength Policy This post is focusing on the keys rather than the installation and setup of the JCE. The installation is rather simple with explicit instructions found here (NOTE: this is for JDK7, if using a different JDK, search for the appropriate JCE policy files). Keystore Setup When using the KeyTool manipulating a keystore is simple. Keystores must be created with a link to a new key or during an import of an existing keystore. In order to create a new key and keystore simply type: keytool -genseckey -keystore aes-keystore.jck -storetype jceks -storepass mystorepass -keyalg AES -keysize 256 -alias jceksaes -keypass mykeypass Important Flags In the example above here are the explanations for the keytool's parameters: Keystore Parameters genseckey Generate SecretKey. This is the flag indicating the creation of a synchronous key which will become our AES key keystore Location of the keystore. If the keystore does not exist, the tool will create a new store. Paths can be relative or absolute but must be local storetype this is the type of store (JCE, PK12, JCEKS, etc). JCEKS is used to store symmetric keys (AES) not contained within a certificate. storepass password related to the keystore. Highly recommended to create a strong passphrase for the keystore Key Parameters keyalg algorithm used to create the key (AES/DES/etc) keysize size of the key (128, 192, 256, etc) alias alias given to the newly created key in which to reference when using the key keypass password protecting the use of the key Encrypt As it pertains to data in Java and at the most basic level, encryption is an algorithmic process used to programmatically obfuscate data through a reversible process where both parties have information pertaining to the data and how the algorithm is used. In Java encryption, this involves the use of a Cipher. A Cipher object in the JCE is a generic entry point into the encryption provider typically selected by the algorithm. This example uses the default Java provider but would also work with Bouncy Castle. Generating a Cipher object Obtaining an instance of Cipher is rather easy and the same process is required for both encryption and decryption. (NOTE: Encryption and Decryption require the same algorithm but do not require the same object instance) Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding"); Once we have an instance of the Cipher, we can encrypt and decrypt data according to the algorithm. Often the algorithm will require additional pieces of information in order to encrypt/decrypt data. In this example, we will need to pass the algorithm the bytes containing the key and an initial vector (explained below). Initialization In order to use the Cipher, we must first initialize the cipher. This step is necessary so we can provide additional information to the algorithm like the AES key and the Initial Vector (aka IV). cipher.init(Cipher.ENCRYPT_MODE, secretKeySpecification, initialVector); Parameters The SecretKeySpecification is an object containing a reference to the bytes forming the AES key. The AES key is nothing more than a specific sized byte array (256-bit for AES 256 or 32 bytes) that is generated by the keytool(see above). Alternative Parameteters There are multiple methods to create keys such as a hash including a salt, username and password (or similar). This method would utilize a SHA1 hash of the concatenated strings, convert to bytes and then truncate result to the desired size. This post will not show the generation of a key using this method or the use of a PBE key method using a password and salt. The password and/or salt usage for the keys is handled by the keytool using the inputs during the creation of new keys. Initialization Vector The AES algorithm also requires a second parameter called the Initialiation Vector. The IV is used in the process to randomize the encrypted message and prevent the key from easy guessing. The IV is considered a publicly shared piece of information, but again, it is not recommended to openly share the information (for example, it wouldn't be wise to post it on your company's website). When encrypting a message, it is not uncommon to prepend the message with the IV since the IV will be a set/known size based on the algorithm. NOTE: the AES algorithm will output the same result if using the same IV, key and message. It is recommended that the IV be randomly created each time an encryption takes place. With the newly initialized Cipher, encrypting a message is simple. Simply call: byte[] encryptedMessageInBytes = Cipher.doFinal((message.getBytes("UTF-8")); String base64EncodedEncryptedMsg = BaseEncoding.base64().encode(encryptedMessageInBytes); String base32EncodedEncryptedMsg = BaseEncoding.base32().encode(encryptedMessageInBytes); Encoding Results Byte arrays are difficult to visualize since they often do not form characters in any charset. The best recommendation to solve this is to represent the bytes in HEX (base-16), Double HEX (base-32) or Base64 format. If the message will be passed via a URL or POST parameter, be sure to use a web-safe Base64 encoding. Google Guava library provides a excellent BaseEncoding utility. NOTE: Remember to decode the encoded message before decrypting. Decrypt Decrypting a message is almost a reverse of the encryption process with a few exceptions. Decryption requires a known initialization vector as a parameter unlike the encryption process generating a random IV. Decryption When decrypting, obtain a cipher object with the same process as the encryption method. The Cipher object will need to utilize the exact same algorithm including the method and padding selections. Once the code has obtained a reference to a Cipher object, the next step is to initialize the cipher for decryption and pass in a reference to a key and the initialization vector. // key is the same byte[] key used in encryption SecretKeySpec secretKeySpecification = new SecretKeySpec(key, "AES"); cipher.init(Cipher.DECRYPT_MODE, secretKeySpecification, initialVector); NOTE: The key is stored in the keystore and obtained by the use of an alias. See below for details on obtaining keys from a keystore Once the cipher has been provided the key, IV and initialized for decryption, the cipher is ready to perform the decryption. byte[] encryptedTextBytes = BaseEncoding.base64().decode(message); byte[] decryptedTextBytes = cipher.doFinal(encryptedTextBytes); String origMessage = new String(decryptedTextBytes); Strategies to keep IV The IV used to encrypt the message is important to decrypting the message therefore the question is raised, how do they stay together. One solution is to Base Encode (see above) the IV and prepend it to the encrypted and encoded message: Base64UrlSafe(myIv) + delimiter + Base64UrlSafe(encryptedMessage). Other possible solutions might be contextual such as including an attribute in an XML file with the IV and one for the alias to the key used. Obtain Key from Keystore The beginning of this post has shown how easy it is to create new AES-256 keys that reference an alias inside of a keystore database. The post then continues on how to encrypt and decrypt a message given a key, but has yet shown how to obtain a reference to the key in a keystore. Solution // for clarity, ignoring exceptions and failures InputStream keystoreStream = new FileInputStream(keystoreLocation); KeyStore keystore = KeyStore.getInstance("JCEKS"); keystore.load(keystoreStream, keystorePass.toCharArray()); if (!keystore.containsAlias(alias)) { thrownew RuntimeException("Alias for key not found"); } Key key = keystore.getKey(alias, keyPass.toCharArray()); Parameters keystoreLocation String - Location to local keystore file location keypass String - Password used when creating or modifying the keystore file with keytool (see above) alias String - Alias used when creating new key with keytool (see above) Conclusion This post has shown how to encrypt and decrypt string based messages using the AES-256 encryption algorithm. The keys to encrypt and decrypt these messages are held inside of a JCEKS formatted KeyStore database created using the JDK provided "keytool" utility. The examples in this post should be considered a solid start to encrypting/decrypting symmetric keys such as AES. This should not be considered the only line of defense when encrypting messages, for example key rotation. Key rotation is a method to mitigate risks in the event of a data breach. If an intruder obtains data and manages to hack a single key, the data contained in multiple files should have used several keys to encrypt the data thus bringing down risk of a total exposure loss. All of the examples in this blog post have been condensed into a simple tool allowing for the viewing of keys inside of a keystore, an operation that is not supported out of the box by the JDK keytool. Each aspect of the steps and topics outlined in this post are available at: https://github.com/mike-ensor/aes-256-encryption-utility. NOTE: The examples, sample code and any reference is to be used at the sole implementers risk and there is no implied warranty or liability, you assume all risks.
February 4, 2014
by Mike Ensor
· 102,359 Views · 2 Likes
article thumbnail
Running Multiple ActiveMQ Instances on One Machine
a few weeks ago i started making use of apache activemq again as the jms provider with my mule esb solution. since it had been a few years that i used activemq i thought it would be nice to check out some of the (new) features like the failover transport and other clustering features . to be able to test these last things i needed multiple installations of activemq on my machine. luckily this isn’t very hard to accomplish, although the documentation on this on the activemq site is quite minimal. the first step is to download and unzip the activemq package, which i did at ~/develop/apache-activemq-5.8.0. to create the instances i go to the activemq home directory and use the ‘create’ command like this: cd develop/apache-activemq-5.8.0/ ./bin/activemq create instancea ./bin/activemq create instanceb now if you do a ‘ls -l’ you will see that there are two subdirectories created, ‘instancea’ and ‘instanceb’. since both instances will make use of the default ports we have to modify the config for the second instance. go to the directory ‘develop/apache-activemq-5.8.0/instanceb/conf’ and open the file ‘jetty.xml’ to make the webconsole available at port ’8162′ by modifying the following line: next open the file ‘activemq.xml’ in the same directory and modify the following part: that’s it! make sure both files are saved and start the first instance with: cd ~/develop/apache-activemq-5.8.0/instancea/bin ./instancea console open up a new console and run the commands: cd /users/pascal/develop/apache-activemq-5.8.0/instanceb/bin ./instanceb console now you have two instances running next to each other and can start testing the ‘advanced’ functions of activemq.
January 24, 2014
by $$anonymous$$
· 17,803 Views · 1 Like
article thumbnail
Understanding sun.misc.Unsafe
The biggest competitor to the Java virtual machine might be Microsoft's CLR that hosts languages such as C#. The CLR allows to write unsafe code as an entry gate for low level programming, something that is hard to achieve on the JVM. If you need such advanced functionality in Java, you might be forced to use the JNI which requires you to know some C and will quickly lead to code that is tightly coupled to a specific platform. With sun.misc.Unsafe, there is however another alternative to low-level programming on the Java plarform using a Java API, even though this alternative is discouraged. Nevertheless, several applications rely on sun.misc.Unsafe such for example objenesis and therewith all libraries that build on the latter such for example kryo which is again used in for example Twitter's Storm. Therefore, it is time to have a look, especially since the functionality of sun.misc.Unsafe is considered to become part of Java's public API in Java 9. Getting hold of an instance of sun.misc.Unsafe The sun.misc.Unsafe class is intended to be only used by core Java classes which is why its authors made its only constructor private and only added an equally private singleton instance. The public getter for this instances performs a security check in order to avoid its public use: public static Unsafe getUnsafe() { Class cc = sun.reflect.Reflection.getCallerClass(2); if (cc.getClassLoader() != null) throw new SecurityException("Unsafe"); return theUnsafe; } This method first looks up the calling Class from the current thread’s method stack. This lookup is implemented by another internal class named sun.reflection.Reflection which is basically browsing down the given number of call stack frames and then returns this method’s defining class. This security check is however likely to change in future version. When browsing the stack, the first found class (index 0) will obviously be the Reflection class itself, and the second (index 1) class will be the Unsafe class such that index 2 will hold your application class that was calling Unsafe#getUnsafe(). This looked-up class is then checked for its ClassLoader where a null reference is used to represent the bootstrap class loader on a HotSpot virtual machine. (This is documented in Class#getClassLoader() where it says that “some implementations may use null to represent the bootstrap class loader”.) Since no non-core Java class is normally ever loaded with this class loader, you will therefore never be able to call this method directly but receive a thrown SecurityException as an answer. (Technically, you could force the VM to load your application classes using the bootstrap class loader by adding it to the –Xbootclasspath, but this would require some setup outside of your application code which you might want to avoid.) Thus, the following test will succeed: @Test(expected = SecurityException.class) public void testSingletonGetter() throws Exception { Unsafe.getUnsafe(); } However, the security check is poorly designed and should be seen as a warning against the singleton anti-pattern. As long as the use of reflection is not prohibited (which is hard since it is so widely used in many frameworks), you can always get hold of an instance by inspecting the private members of the class. From the Unsafe class's source code, you can learn that the singleton instance is stored in a private static field called theUnsafe. This is at least true for the HotSpot virtual machine. Unfortunately for us, other virtual machine implementations sometimes use other names for this field. Android’s Unsafe class is for example storing its singleton instance in a field called THE_ONE. This makes it hard to provide a “compatible” way of receiving the instance. However, since we already left the save territory of compatibility by using the Unsafe class, we should not worry about this more than we should worry about using the class at all. For getting hold of the singleton instance, you simply read the singleton field's value: Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe"); theUnsafe.setAccessible(true); Unsafe unsafe = (Unsafe) theUnsafe.get(null); Alternatively, you can invoke the private instructor. I do personally prefer this way since it works for example with Android while extracting the field does not: Constructor unsafeConstructor = Unsafe.class.getDeclaredConstructor(); unsafeConstructor.setAccessible(true); Unsafe unsafe = unsafeConstructor.newInstance(); The price you pay for this minor compatibility advantage is a minimal amount of heap space. The security checks performed when using reflection on fields or constructors are however similar. Create an Instance of a Class Without Calling a Constructor The first time I made use of the Unsafe class was for creating an instance of a class without calling any of the class's constructors. I needed to proxy an entire class which only had a rather noisy constructor but I only wanted to delegate all method invocations to a real instance which I did however not know at the time of construction. Creating a subclass was easy and if the class had been represented by an interface, creating a proxy would have been a straight-forward task. With the expensive constructor, I was however stuck. By using the Unsafe class, I was however able to work my way around it. Consider a class with an artificially expensive constructor: class ClassWithExpensiveConstructor { private final int value; private ClassWithExpensiveConstructor() { value = doExpensiveLookup(); } private int doExpensiveLookup() { try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } return 1; } public int getValue() { return value; } } Using the Unsafe, we can create an instance of ClassWithExpensiveConstructor (or any of its subclasses) without having to invoke the above constructor, simply by allocating an instance directly on the heap: @Test public void testObjectCreation() throws Exception { ClassWithExpensiveConstructor instance = (ClassWithExpensiveConstructor) unsafe.allocateInstance(ClassWithExpensiveConstructor.class); assertEquals(0, instance.getValue()); } Note that final field remained uninitialized by the constructor but is set with its type's default value. Other than that, the constructed instance behaves like a normal Java object. It will for example be garbage collected when it becomes unreachable. The Java run time itself creates objects without calling a constructor when for example creating objects for deserialization. Therefore, the ReflectionFactory offers even more access to individual object creation: @Test public void testReflectionFactory() throws Exception { @SuppressWarnings("unchecked") Constructor silentConstructor = ReflectionFactory.getReflectionFactory() .newConstructorForSerialization(ClassWithExpensiveConstructor.class, Object.class.getConstructor()); silentConstructor.setAccessible(true); assertEquals(10, silentConstructor.newInstance().getValue()); } Note that the ReflectionFactory class only requires a RuntimePermission called reflectionFactoryAccess for receiving its singleton instance and no reflection is therefore required here. The received instance of ReflectionFactory allows you to define any constructor to become a constructor for the given type. In the example above, I used the default constructor of java.lang.Object for this purpose. You can however use any constructor: class OtherClass { private final int value; private final int unknownValue; private OtherClass() { System.out.println("test"); this.value = 10; this.unknownValue = 20; } } @Test public void testStrangeReflectionFactory() throws Exception { @SuppressWarnings("unchecked") Constructor silentConstructor = ReflectionFactory.getReflectionFactory() .newConstructorForSerialization(ClassWithExpensiveConstructor.class, OtherClass.class.getDeclaredConstructor()); silentConstructor.setAccessible(true); ClassWithExpensiveConstructor instance = silentConstructor.newInstance(); assertEquals(10, instance.getValue()); assertEquals(ClassWithExpensiveConstructor.class, instance.getClass()); assertEquals(Object.class, instance.getClass().getSuperclass()); } Note that value was set in this constructor even though the constructor of a completely different class was invoked. Non-existing fields in the target class are however ignored as also obvious from the above example. Note that OtherClass does not become part of the constructed instances type hierarchy, the OtherClass's constructor is simply borrowed for the "serialized" type. Not mentioned in this blog entry are other methods such as Unsafe#defineClass, Unsafe#defineAnonymousClass or Unsafe#ensureClassInitialized. Similar functionality is however also defined in the public API's ClassLoader. Native Memory Allocation Did you ever want to allocate an array in Java that should have had more than Integer.MAX_VALUE entries? Probably not because this is not a common task, but if you once need this functionality, it is possible. You can create such an array by allocating native memory. Native memory allocation is used by for example direct byte buffers that are offered in Java's NIO packages. Other than heap memory, native memory is not part of the heap area and can be used non-exclusively for example for communicating with other processes. As a result, Java's heap space is in competition with the native space: the more memory you assign to the JVM, the less native memory is left. Let us look at an example for using native (off-heap) memory in Java with creating the mentioned oversized array: class DirectIntArray { private final static long INT_SIZE_IN_BYTES = 4; private final long startIndex; public DirectIntArray(long size) { startIndex = unsafe.allocateMemory(size * INT_SIZE_IN_BYTES); unsafe.setMemory(startIndex, size * INT_SIZE_IN_BYTES, (byte) 0); } } public void setValue(long index, int value) { unsafe.putInt(index(index), value); } public int getValue(long index) { return unsafe.getInt(index(index)); } private long index(long offset) { return startIndex + offset * INT_SIZE_IN_BYTES; } public void destroy() { unsafe.freeMemory(startIndex); } } @Test public void testDirectIntArray() throws Exception { long maximum = Integer.MAX_VALUE + 1L; DirectIntArray directIntArray = new DirectIntArray(maximum); directIntArray.setValue(0L, 10); directIntArray.setValue(maximum, 20); assertEquals(10, directIntArray.getValue(0L)); assertEquals(20, directIntArray.getValue(maximum)); directIntArray.destroy(); } First, make sure that your machine has sufficient memory for running this example! You need at least (2147483647 + 1) * 4 byte = 8192 MB of native memory for running the code. If you have worked with other programming languages as for example C, direct memory allocation is something you do every day. By calling Unsafe#allocateMemory(long), the virtual machine allocates the requested amount of native memory for you. After that, it will be your responsibility to handle this memory correctly. The amount of memory that is required for storing a specific value is dependent on the type's size. In the above example, I used an int type which represents a 32-bit integer. Consequently a single int value consumes 4 byte. For primitive types, size is well-documented. It is however more complex to compute the size of object types since they are dependent on the number of non-static fields that are declared anywhere in the type hierarchy. The most canonical way of computing an object's size is using the Instrumented class from Java's attach API which offers a dedicated method for this purpose called getObjectSize. I will however evaluate another (hacky) way of dealing with objects in the end of this section. Be aware that directly allocated memory is always native memory and therefore not garbage collected. You therefore have to free memory explicitly as demonstrated in the above example by a call to Unsafe#freeMemory(long). Otherwise you reserved some memory that can never be used for something else as long as the JVM instance is running what is a memory leak and a common problem in non-garbage collected languages. Alternatively, you can also directly reallocate memory at a certain address by calling Unsafe#reallocateMemory(long, long) where the second argument describes the new amount of bytes to be reserved by the JVM at the given address. Also, note that the directly allocated memory is not initialized with a certain value. In general, you will find garbage from old usages of this memory area such that you have to explicitly initialize your allocated memory if you require a default value. This is something that is normally done for you when you let the Java run time allocate the memory for you. In the above example, the entire area is overriden with zeros with help of the Unsafe#setMemory method. When using directly allocated memory, the JVM will neither do range checks for you. It is therefore possible to corrupt your memory as this example shows: @Test public void testMallaciousAllocation() throws Exception { long address = unsafe.allocateMemory(2L * 4); unsafe.setMemory(address, 8L, (byte) 0); assertEquals(0, unsafe.getInt(address)); assertEquals(0, unsafe.getInt(address + 4)); unsafe.putInt(address + 1, 0xffffffff); assertEquals(0xffffff00, unsafe.getInt(address)); assertEquals(0x000000ff, unsafe.getInt(address + 4)); } Note that we wrote a value into the space that was each partly reserved for the first and for the second number. This picture might clear things up. Be aware that the values in the memory run from the "right to the left" (but this might be machine dependent). The first row shows the initial state after writing zeros to the entire allocated native memory area. Then we override 4 byte with an offset of a single byte using 32 ones. The last row shows the result after this writing operation. Finally, we want to write an entire object into native memory. As mentioned above, this is a difficult task since we first need to compute the size of the object in order to know the amount of size we need to reserve. The Unsafe class does however not offer such functionality. At least not directly since we can at least use the Unsafe class to find the offset of an instance's field which is used by the JVM when itself allocates objects on the heap. This allows us to find the approximate size of an object: public long sizeOf(Class clazz) long maximumOffset = 0; do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { maximumOffset = Math.max(maximumOffset, unsafe.objectFieldOffset(f)); } } } while ((clazz = clazz.getSuperclass()) != null); return maximumOffset + 8; } This might at first look cryptic, but there is no big secret behind this code. We simply iterate over all non-static fields that are declared in the class itself or in any of its super classes. We do not have to worry about interfaces since those cannot define fields and will therefore never alter an object's memory layout. Any of these fields has an offset which represents the first byte that is occupied by this field's value when the JVM stores an instance of this type in memory, relative to a first byte that is used for this object. We simply have to find the maximum offset in order to find the space that is required for all fields but the last field. Since a field will never occupy more than 64 bit (8 byte) for a long or double value or for an object reference when run on a 64 bit machine, we have at least found an upper bound for the space that is used to store an object. Therefore, we simply add these 8 byte to the maximum index and we will not run into danger of having reserved to little space. This idea is of course wasting some byte and a better algorithm should be used for production code. In this context, it is best to think of a class definition as a form of heterogeneous array. Note that the minimum field offset is not 0 but a positive value. The first few byte contain meta information. The graphic below visualizes this principle for an example object with an int and a long field where both fields have an offset. Note that we do not normally write meta information when writing a copy of an object into native memory so we could further reduce the amount of used native memoy. Also note that this memory layout might be highly dependent on an implementation of the Java virtual machine. With this overly careful estimate, we can now implement some stub methods for writing shallow copies of objects directly into native memory. Note that native memory does not really know the concept of an object. We are basically just setting a given amount of byte to values that reflect an object's current values. As long as we remember the memory layout for this type, these byte contain however enough information to reconstruct this object. public void place(Object o, long address) throws Exception { Class clazz = o.getClass(); do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { long offset = unsafe.objectFieldOffset(f); if (f.getType() == long.class) { unsafe.putLong(address + offset, unsafe.getLong(o, offset)); } else if (f.getType() == int.class) { unsafe.putInt(address + offset, unsafe.getInt(o, offset)); } else { throw new UnsupportedOperationException(); } } } } while ((clazz = clazz.getSuperclass()) != null); } public Object read(Class clazz, long address) throws Exception { Object instance = unsafe.allocateInstance(clazz); do { for (Field f : clazz.getDeclaredFields()) { if (!Modifier.isStatic(f.getModifiers())) { long offset = unsafe.objectFieldOffset(f); if (f.getType() == long.class) { unsafe.putLong(instance, offset, unsafe.getLong(address + offset)); } else if (f.getType() == int.class) { unsafe.putLong(instance, offset, unsafe.getInt(address + offset)); } else { throw new UnsupportedOperationException(); } } } } while ((clazz = clazz.getSuperclass()) != null); return instance; } @Test public void testObjectAllocation() throws Exception { long containerSize = sizeOf(Container.class); long address = unsafe.allocateMemory(containerSize); Container c1 = new Container(10, 1000L); Container c2 = new Container(5, -10L); place(c1, address); place(c2, address + containerSize); Container newC1 = (Container) read(Container.class, address); Container newC2 = (Container) read(Container.class, address + containerSize); assertEquals(c1, newC1); assertEquals(c2, newC2); } Note that these stub methods for writing and reading objects in native memory only support int and long field values. Of course, Unsafe supports all primitive values and can even write values without hitting thread-local caches by using the volatile forms of the methods. The stubs were only used to keep the examples concise. Be aware that these "instances" would never get garbage collected since their memory was allocated directly. (But maybe this is what you want.) Also, be careful when precalculating size since an object's memory layout might be VM dependent and also alter if a 64-bit machine runs your code compared to a 32-bit machine. The offsets might even change between JVM restarts. For reading and writing primitives or object references, Unsafe provides the following type-dependent methods: getXXX(Object target, long offset): Will read a value of type XXX from target's address at the specified offset. putXXX(Object target, long offset, XXX value): Will place value at target's address at the specified offset. getXXXVolatile(Object target, long offset): Will read a value of type XXX from target's address at the specified offset and not hit any thread local caches. putXXXVolatile(Object target, long offset, XXX value): Will place value at target's address at the specified offset and not hit any thread local caches. putOrderedXXX(Object target, long offset, XXX value): Will place value at target's address at the specified offet and might not hit all thread local caches. putXXX(long address, XXX value): Will place the specified value of type XXX directly at the specified address. getXXX(long address): Will read a value of type XXX from the specified address. compareAndSwapXXX(Object target, long offset, long expectedValue, long value): Will atomicly read a value of type XXX from target's address at the specified offset and set the given value if the current value at this offset equals the expected value. Be aware that you are copying references when writing or reading object copies in native memory by using the getObject(Object, long) method family. You are therefore only creating shallow copies of instances when applying the above method. You could however always read object sizes and offsets recursively and create deep copies. Pay however attention for cyclic object references which would cause infinitive loops when applying this principle carelessly. Not mentioned here are existing utilities in the Unsafe class that allow manipulation of static field values sucht as staticFieldOffset and for handling array types. Finally, both methods named Unsafe#copyMemory allow to instruct a direct copy of memory, either relative to a specific object offset or at an absolute address as the following example shows: @Test public void testCopy() throws Exception { long address = unsafe.allocateMemory(4L); unsafe.putInt(address, 100); long otherAddress = unsafe.allocateMemory(4L); unsafe.copyMemory(address, otherAddress, 4L); assertEquals(100, unsafe.getInt(otherAddress)); } Throwing Checked Exceptions Without Declaration There are some other interesting methods to find in Unsafe. Did you ever want to throw a specific exception to be handled in a lower layer but you high layer interface type did not declare this checked exception? Unsafe#throwException allows to do so: @Test(expected = Exception.class) public void testThrowChecked() throws Exception { throwChecked(); } public void throwChecked() { unsafe.throwException(new Exception()); } Native Concurrency The park and unpark methods allow you to pause a thread for a certain amount of time and to resume it: @Test public void testPark() throws Exception { final boolean[] run = new boolean[1]; Thread thread = new Thread() { @Override public void run() { unsafe.park(true, 100000L); run[0] = true; } }; thread.start(); unsafe.unpark(thread); thread.join(100L); assertTrue(run[0]); } Also, monitors can be acquired directly by using Unsafe using monitorEnter(Object), monitorExit(Object) and tryMonitorEnter(Object). A file containing all the examples of this blog entry is available as a gist.
January 14, 2014
by Rafael Winterhalter
· 152,606 Views · 39 Likes
article thumbnail
Semantic Search with Solr and NumPy
Built upon Lucene, Solr provides fast, highly scalable, and easily maintainable full-text search capabilities. However, under the hood, Solr is really just a sophisticated token-matching engine. What’s missing? — Semantic Search! Consider three, somewhat silly documents: Yellow banana peels. A banana is a long yellow fruit. This mystery fruit is long and yellow and has a peel. Now what happens if you search for the term “banana?" Under normal circumstances you only get back the first and second document. But why shouldn’t you also get back the third document? It’s obviously talking about bananas! Semantic Search via Collaborative Filtering Colleague Doug Turnbull and I recently set about to right this wrong with help from a machine learning technique called collaborative filtering. Collaborative filtering is most often used as a basis for recommendation algorithms. For example, collaborative filtering algorithms were the central focus of the now-famous Netflix Prize which awarded $1Million to the team which could build the best movie recommendation engine. When dealing with recommendations, collaborative filtering works by mathematically identifying commonalities in groups of users based upon the movies that they enjoyed. Then, if you appear to fall in one of those groups, the recommendation engine will point you towards a movie that a) you haven’t watched and b) you are likely to enjoy. So what does this have to do with Semantic Search? Everything! In just the same way that certain users gravitate towards certain movies, certain words commonly co-occur in the same documents. When working with Semantic Search, rather than recommending user to movies that they would likely enjoy, we are going to identify words that are likely to belong in a given document, whether or not they actually occurred there. The math is exactly the same! Here’s how the process works: First we identify a text field of interest in our documents and extract the associated term-document matrix for external processing. Each element of this term-document matrix indicates the strength of a particular term within a particular document (where strength can be anything, but will likely be either term frequency or TF*IDF). Next, collaborative filtering is applied to the term-document matrix which effectively generates a pseudo-term-document matrix. This pseudo-term-document matrix is the same size and shape as the original term-document matrix and references the same terms and documents, but the numbers are slightly different. These new values indicate the strength that a particular term should have in a particular document once noisy data is removed. Finally, the high-scoring values in the pseudo-term-document matrix are mapped back to the associated terms. These terms are then injected back into Solr in a new field which can be used for Semantic Search. Demo Time! So let’s consider an example case. As in plenty of our previous posts, we will be using the Science Fiction Stack Exchange. Why? Because we’re all nerds and with such a familiar topic, we can quickly intuit whether or not a search is returning relevant results. In this data set, the field of interest is the Body field because it contains the contents of all questions and answers. So, now that we’ve decided upon our demo dataset, we’re ready run the analysis. If you’d like to follow along, then please take a look at our git repo. This repo contains the example SciFi data set, the Semantic Search code, and README to get you going. However I’m going execute everything from within Python: >>> from SemanticAnalyzer import * >>> stvc = SolrTermVectorCollector(field='Body',feature='tf',batchSize=1000) >>> tdc = TermDocCollection(source=stvc,numTopics=150) That last line takes a few minutes. If it’s in the AM where you are, grab a coffee. If it’s in the PM, grab a beer. Once that line completes, we will have successfully extracted the term-document matrix from Solr. Now let’s play with it for a bit. One of the cool side effects of this analysis is the ability to quickly find words that commonly occur together. Let’s give it an easy test; here are the 30 most highly correlated words with the word ‘vader’ (as in Darth Vader). >>> tdc.getRelatedTerms('vader',30) Did you notice that pause when you called the function? That was the collaborative filtering taking place. The results of that process have now been saved, so additional calls will return quite quickly. vader luke emperor darth palpatin anakin sith skywalk sidiou apprentic empir luca side star son forc turn kill death rule suit father question jedi command obi tarkin dark wan plan Hey, not bad! Everything here seems very reasonably connected with Mr. Vader. You may notice some odd spellings here; that’s because these are the indexed terms, therefore they are stemmed. Let’s try again with a different term; this time everyone’s favorite wizard: >>> tdc.getRelatedTerms('potter',30) harri potter voldemort wizard snape death magic jame love spell time rowl lili eater travel seri hous hand hogwart three find wormtail kill slytherin hallow secret deathli muggl order lord Again, pretty good! One last try, and we’ll make it a little more challenging – a vague adjective: >>> tdc.getRelatedTerms('dark',30) dark side jedi sith eater lord death mark snape magic curs evil forc luke mercuri cave yoda jame palpatin dagobah anakin black call wizard slytherin live light siriu matter voldemort Indeed, most of these terms are like a hall of fame of dark things from Star Wars and Harry Potter. Now since the word correlation has proven itself out, it’s time to generate the pseudo terms and post them back to Solr. >>> SolrBlurredTermUpdater(tdc,blurredField="BodyBlurred").pushToSolr(0.1) This line will probably see you to the end of your coffee or beer (it takes about 10 minutes on my machine). But once it’s done, you can start issuing searches to Solr. Solr Results Here’s an example of Semantic Search using Solr: http://localhost:8983/solr/select/?q=-Body:dark +BodyBlurred:dark The Body field contains the original text while the BodyBlurred contains the pseudo-terms. So this finds all documents that do not include the term dark, but presumably contain dark content. Take a look at the documents that come back: { Body: " In the John Carter movie (2012), he shows off some of his powers, like jumping abnormally high, but I have difficulty evaluating his strength. On the one side, he shows great strength, as when he kills a thark warrior with one hand, but he is also quite mistreated by them. He also seems helpless when he is strangled by Tars Tarkas. Why does the strength he shows seem so inconsistent? ", BodyBlurred: "tv great movi control kill consid hand dark side power long mutant fight machin light abil sauron wormtail hulk" }, { Body: " In the movies, the Nazgul ride black horses with armour. I was wondering if that is all they are, or do they have some sort of magic? Are they evil? ", BodyBlurred: "movi black magic dark demon engin hous aveng slytherin" }, { Body: " The remaining Black Brother from the prologue of A Game of Thrones is apparently the deserter who is beheaded in the beginning of the book. But how did he manage to get to Winterfell from the other side of The Wall? Or did the show throw me off track and in the book there weren't any survivors, so the deserter is someone else? ", BodyBlurred: "book watch black hole dark side plai long game demon engin light turn district" }, { Body: " Was this ever discussed in any episode, or as a side-plot somewhere? ", BodyBlurred: "episod dark side light" } Not bad – most of those topics are rather … dark. Though check out that last result. So … maybe there are still some improvements we can make! But you also have to remember that we’re dealing with word correlation here, and I can only guess that somewhere else in the corpus, dark side-plots and dark episodes were surely discussed. Speaking of word correlations, check out this gem: { Body: " You're correct, Enterprise is the only Star Trek that fits into both the original and the new 2009 movie timelines. From the perspective of the Enterprise characters, both are possible futures, given the over-arcing conceit of the show was a Temporal Cold War, so its future is in flux and could line up with either of the timelines we're familiar with, or with an entirely different future. ", BodyBlurred: "answer charact place klingon star trek design travel crew watch work movi happen enterpris featur futur exist origin 2009 chang altern timelin war to version event captain gener pictur tng creat iii galaxi theori return alter voyag entir fry turn kirk paradox biff doc marti feder 1955 starship 2015 class hero centuri tempor uss phoenix mirror river 800 ncc 1701 simon conner skynet alisha" } The original document involves Star Trek and time travel. And appropriately, the pseudo terms include Star Trek things and time-travel terms … but do you see anything funny? That’s right, Biff, Doc and Marti made their way into the pseudo terms, likely because of their role in the popular time-travel film “Back to the Future.” Speaking of the future … Future Work Semantic Search with Solr is hot right now. In the upcoming Dublin LuceneRevolution I know of at least three related talks that have been submitted (one of them my own); I have heard that MapR is working on a Solr Semantic Search/Recommendation engine built atop of their Hadoop offering; and I suspect that with Cloudera’s recent foray into Solr with Mark Miller, they will also be working on the same thing. What’s next for our work? Recommendations! Remember, that’s how we started this conversation. E-commerce recommendations is a simple extension of the work presented above. Given an inventory catalog (e.g., product title, description, etc.), and given a history of user purchases, we can build a search-aware recommendation engine. That is, when a customer searches for a particular item, they will receive results as usual, except that the results will be boosted with items that they are more likely to purchase. How? Because we know what type of customer they are and what products that type of customer is more likely to buy! Do you have a good case for Solr Semantic Search and Recommendation? We’d love to hear it, please contact us!
September 30, 2013
by John Berryman
· 11,675 Views
article thumbnail
Algorithm of the Week: Quicksort - Three-way vs. Dual-pivot
It’s no news that quicksort is considered one of the most important algorithms of the century and that it is the de facto system sort for many languages, including the Arrays.sort in Java. So, what’s new about quicksort? Well, nothing except that I just now figured out (two damn years after the release of Java 7) that the quicksort implementation of Arrays.sort has been replaced with a variant called dual-pivot quicksort. This thread is not only awesome for this reason but also how humble Jon Bentley and Joshua Bloch really are. What did I do then? Just like everybody else, I wanted to implement it and do some benchmarking against some 10 million integers (random and duplicate). Oddly enough, I found the following results: Random Data Basic quicksort: 1222 ms Three-way quicksort: 1295 ms (seriously!) Dual-pivot quicksort: 1066 ms Duplicate Data Basic quicksort: 378 ms Three-way quicksort: 15 ms Dual-pivot quicksort: six ms Stupid Question One I am afraid that I am missing something in the implementation of the three-way partition. Across several runs against random inputs (of 10 million numbers), I could see that the single pivot always performs better (although the difference is less than 100 milliseconds for 10 million numbers). I understand that the whole purpose of making the three-way quicksort the default quicksort until now is that it does not give 0(n2) performance on duplicate keys, which is very evident when I run it against duplicate inputs. But is it true that, for the sake of handling duplicate data, a small penalty is taken by the three-way quicksort? Or is my implementation bad? Stupid Question Two My dual-pivot implementation (link below) does not handle duplicates well. It takes forever (0(n2)) to execute. Is there a good way to avoid this? Referring to the Arrays.sort implementation, I figured out that ascending sequences and duplicates are eliminated well before the actual sorting is done. So, as a dirty fix, if the pivots are equal I fast-forward the lowerIndex until it is different than pivot2. Is this a fair implementation? else if (pivot1==pivot2){ while (pivot1==pivot2 && lowIndex=j) break; exchange(input, i, j); } exchange(input, pivotIndex, j); return j; } } Three-way package basics.sorting.quick; import static basics.shuffle.KnuthShuffle.shuffle; import static basics.sorting.utils.SortUtils.exchange; import static basics.sorting.utils.SortUtils.less; public class QuickSort3Way { public void sort (int[] input){ //input=shuffle(input); sort (input, 0, input.length-1); } public void sort(int[] input, int lowIndex, int highIndex) { if (highIndex<=lowIndex) return; int lt=lowIndex; int gt=highIndex; int i=lowIndex+1; int pivotIndex=lowIndex; int pivotValue=input[pivotIndex]; while (i<=gt){ if (less(input[i],pivotValue)){ exchange(input, i++, lt++); } else if (less (pivotValue, input[i])){ exchange(input, i, gt--); } else{ i++; } } sort (input, lowIndex, lt-1); sort (input, gt+1, highIndex); } } Dual-pivot package basics.sorting.quick; import static basics.shuffle.KnuthShuffle.shuffle; import static basics.sorting.utils.SortUtils.exchange; import static basics.sorting.utils.SortUtils.less; public class QuickSortDualPivot { public void sort (int[] input){ //input=shuffle(input); sort (input, 0, input.length-1); } private void sort(int[] input, int lowIndex, int highIndex) { if (highIndex<=lowIndex) return; int pivot1=input[lowIndex]; int pivot2=input[highIndex]; if (pivot1>pivot2){ exchange(input, lowIndex, highIndex); pivot1=input[lowIndex]; pivot2=input[highIndex]; //sort(input, lowIndex, highIndex); } else if (pivot1==pivot2){ while (pivot1==pivot2 && lowIndex
August 14, 2013
by Arun Manivannan
· 46,424 Views
article thumbnail
A Prodecedural City in 100 Lines of Three.js
The above skyline is from "City," a simple flight-simulator by Ricardo "Mr. Doob" Cabello. "City" is a demo of the capabilities of WebGL, and is written in an impressive 100 lines of JavaScript using Three.js. In his blog post "How to Do a Procedural City in 100 Lines," Jerome Etienne walks you through the process of recreating Cabello's "City." The secret lies in creating 20,000 cubes that are given random sizes and positions and merging them together to create a city. Let's hope that this algorithm is never used for actual city-planning, though, because the buildings can randomly intersect each other and there are no logical spaces for streets! Check out Etienne's blog post here and watch the screencast introduction here:
August 9, 2013
by Allen Coin
· 8,846 Views
  • Previous
  • ...
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • 205
  • 206
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×