Data Resources

The Latest Data Topics

Integration of Amazon Redshift Data Warehouse with Talend Data Integration

In this blog post, I will show you how to "ETL" all kinds of data to Amazon’s cloud data warehouse Redshift wit Talend’s big data components. Let’s begin with a short introduction to Amazon Redshift (copied from website): "Amazon Redshift is [part of Amazon Web Services (AWS) and] a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. With a few clicks in the AWS Management Console, customers can launch a Redshift cluster, starting with a few hundred gigabytes and scaling to a petabyte or more, for under $1,000 per terabyte per year. Traditional data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.“ Sounds interesting! And indeed, we already see companies using Talend’s Redshift connectors. From Talend perspective it is not much more than just another database. If you have ever used a Talend connector, you can integrate to Redshift within some minutes. In the next sections, I will describe all necessary steps and give some hints regarding configuration issues and performance improvements. Be aware: You need Talend Open Studio for Data Integration (Apache License, open source) or any Talend Enterprise Edition / Platform which contains the Cloud components to see and use Amazon Redshift connectors. The open source edition offers all connectors and functionality to integrate with Amazon Redshift. However, Enterprise versions offer some more features (e.g. versioning), comfort (e.g. wizards) and commercial support. Setup Amazon Redshift Setup of Amazon Redshift is very easy. Just follow Amazon‘s getting started guide: http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html. Like every other AWS guide, it is very easy to understand and use. Be aware, that you just have to do step 1, 2 and 3 of the getting started guide for using it with Talend. Some hints: - Step 1 („before you begin“): Just sign up. Client tools and drivers are not necessary because they are already installed within Talend Studio. - Step 2 („launch a cluster“): Yes, please start your cluster! - Step 3(„authorize access“): If you are not sure what to do here, select Connection Type = CIDR/IP. Find out your IP address (http://whatismyipaddress.com) and enter it with „/32“ at the end. Example: „192.168.1.1/32“ Now you can connect to Amazon Redshift from your Talend Studio on your local computer. Step 4 (connect) and step 5 (create table, data, queries) are not necessary, this will be done from Talend Studio. Of course, you should not forget to delete your cluster (step 7) when you are done. Otherwise, you will pay for every hour, even if you do not access your DWH. Connect to Amazon Redshift from Talend Studio Create a new connection to Amazon Redshift database as you do with every other relational database. The easiest way is to use „DB Connection Wizard“ in metadata. Just enter your connection information and check if it works. You get all information about configuration from Amazon Web Console. The connection string looks something like this: „jdbc:paraccel://talend-demo-cluster.cp8t6c5.eu-west-1.redshift.amazonaws.com:5439/dev“ Next, right click on the created connection and select „retrieve schema“. „public“ is the default schema which you (have to) use. Now, you are ready to use this connection within Talend Jobs to write to Amazon Redshift and read from it. Create Talend Jobs (Write, Read, Delete) Amazon Redshift components work like any other Talend (relational) database components. Look at www.help.talend.com for more information if you have not used them before (or just try them out, they are very self-explanatory). You just have to drag&drop your connection from metadata . Afterwards, you can easily write data (tRedShiftOutput), read data (tRedshiftInput), or do any other queries such as delete or copy (tRedShiftRow). In the following job, I start with deleting all content in the Amazon Redshift table. Then, I read data from a MySQL table and insert it into an Amazon Redshift table. The table is created automatically (as I have configured it this way). After this subjob is finished, I read the data again, and store it to a CSV file (which is also created automatically). Of course, this is no business use case, but it shows how to use different Amazon Redshift components. Query Data from Amazon Redshift You can connect to Amazon Redshift directly from Talend Studio to explore and query data of the DWH. Thus, no other database tool is required. Just right click on your Amazon Redshift connection in metadata and select „edit queries“. Here you can define, execute and save SQL queries. Improve Performance Write performance of Amazon Redshift is relatively low compared to „classical“ relational databases (in your data center) as you have to upload all data into the cloud. Different alternatives exist to improve performance: - Bulk inserts: „Extended insert“ (in advanced settings) improves performance a lot, but still not to hyperspeed… Also, as it is bulk, you can just do inserts! It is not compatible to „rejects“ or „updates“ - AWS S3 and COPY command: S3 is Amazon’s „simple storage service“, a key-value store – also called NoSQL today – for storing very large objects. You can use Amazon Redshift’s COPY command (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) to transfer data from S3 to Amazon Redshift with good performance. Though, you still have to copy data to S3 before, same „cloud problem“ here. The COPY command can be used with tRedshiftRow, so no problem at all from Talend perspective. To transfer data to S3, you can either use the Talend S3 components from Talendforge, Talend’s open source community (http://www.talendforge.org/exchange), or use camel-s3, an Apache Camel component which is included in Talend ESB. The latter is an option, if you use Talend Data Services which combines Talend DI and Talend ESB in its unified platform. Summary You need not be a cloud or DWH expert, or an expert developer to integrate with Amazon’s cloud data warehouse Redshift. It is very easy with Talend’s integration solutions. Just drag&drop, configure, do some graphical mappings / transformations (if necessary), that’s it. Code is generated. Job runs. You can integrate Amazon Redshift almost as simple as any other relational database. Just be aware of some cloud specific security and performance issues. With Talend, you can easily „ETL“ all data from different sources to Redshift and store it there for under $1,000 per terabyte per year – even with the open source version! Best regards, Kai Wähner (Contact and feedback via @KaiWaehner, www.kai-waehner.de, LinkedIn / Xing) This is content from my blog: http://www.kai-waehner.de/blog/2013/06/26/integration-of-amazon-redshift-cloud-data-warehouse-aws-saas-dwh-with-talend-data-integration-di-big-data-bd-enterprise-service-bus-esb/

June 27, 2013

by Kai Wähner

CORE

· 20,572 Views · 1 Like

Implementing Memcached a Servlet Filter for Spring MVC-Based RESTful Services

I have a number of Spring MVC based RESTful services that return JSON. In 90% of the cases, the state of objects these services return will not change within a 24 hour period. This makes them (the JSON objects) perfect candidates for simple caching enabled by memcached. The idea was to have every request to Spring controllers intercepted, cache key generated and checked against the cache. If the key and corresponding value (JSON string) is available (a cache hit), it is returned to the caller as-is without making a full round trip to the database. However, if the cache has no entry for the key and hence no corresponding value (a cache miss), the call is forwarded to the controller, which in turn calls the logic to fetch desired object from the database and not only return it to the caller but also update the cache with the returned content. Keys are generated using the URL of the service in case of GET requests and the URL concatenated with POSTed input (as JSON) in case of POST requests. The resultant strings are encoded with MD5 to come up with a 32 character cache key which is well within the 250 character key length limit of memcached. Performance impact of using MD5 is yet to be evaluated during our load testing cycle. I started off trying to get hold of JSON response in the postHandle method of a Spring HandlerInterceptor. However since we are using @ResponseBody annotation in our controller, the JSON would be written directly to the stream. The ModelAndView was of course null because of this reason. If we removed the annotation and returned ModelAndView from the controller, the intended JSON object got enclosed in a map wrapper. A quick question on stack overflow didn’t help as the only suggestion I got was to extract my original object from the map wrapper. I wanted to keep this option (as discussed here as well ) as my last resort. The solution I eventually came up with involved Replacing the HandlerInterceptor with Servlet Filters Using DelegatingFilterProxy to make my filters spring application context aware Using HttpServletRequestWrapper to get control of the POST request body in the filter on the way in Using HttpServletResponseWrapper to get control of the response content in the filter on the way out True, its probably a more complex solution than just overriding MappingJacksonJsonView and extracting my JSON object, but it is more generic as it does not assume that all my content will always be JSON. Lets first start with the filter definition in the web.xml cacheFilter org.springframework.web.filter.DelegatingFilterProxy ... cacheFilter /* A standard filter configuration except for the fact that the filter class is always going to be org.springframework.web.filter.DelegatingFilterProxy. Where do you specify your own class ? As a bean in your spring context xml. The name of the filter and the name of the bean must be the same for the delegation to happen. Using the DelegatingFilterProxy allowed me to use my Filters with Spring. I can inject my dependencies as I would normally. Next, lets look at my MemcacheFilter filter Memcache Filter Class public class MemcacheFilter implements Filter { private static Logger logger = Logger.getLogger(MemcacheFilter.class); private CacheConfig cacheConfig; /** * Memcached lookup is being performed in this method. Firstly, keys are * generated depending on the request method (GET/POST). Then a cache lookup * is performed. If a value is obtained, the value is written to the * response otherwise, the actual target (in this case, Spring's Dispatcher * Servlet) is called by calling doFilter on the filteChain. The dispatcher * servlet calls the controller to produce the desire response which is * intercepted when the doFilter method returns. The Response is added to * the cache if the reponse code was 200(OK). * * @param request * @param response * @param filterChain * @throws IOException * @throws ServletException */ public void doFilter(ServletRequest request, ServletResponse response, FilterChain filterChain) throws IOException, ServletException { try { if ((request instanceof HttpServletRequest) && (response instanceof HttpServletResponse)) { // Wrapping the response in HTTPServletResponseWrapper MemcacheResponseWrapper responseWrap = new MemcacheResponseWrapper((HttpServletResponse) response); // Wrapping the request in HTTPServletResponseWrapper MemcacheRequestWrapper requestWrap = new MemcacheRequestWrapper((HttpServletRequest) request); // Get Memcached Client Instance MemcachedClient client = cacheConfig.getMemcachedClient(); Key keyGenerator = getKeyGenerator(requestWrap); if (keyGenerator != null) { String key = keyGenerator.getKey(requestWrap, cacheConfig); String value = (String) client.get(key); if (value == null) { // cache miss logger.info("Cache miss for key " + key); // call next filter/actual target for value filterChain.doFilter(requestWrap, responseWrap); if (responseWrap.getStatus() == HttpServletResponse.SC_OK) { // obtaining response content from // HttpServletResponseWrapper value = responseWrap.getOutputStream().toString(); // adding response to cache client.add(key, 0, value); logger.info("Adding response to cache: "+ (value.length() > 50 ? value.substring(0,50) + "..." : value)); } else { logger.warn("Did not add content to cache as response status is not 200"); } } else { // This case is a cache hit logger.info("Cache hit for key " + key); response.getWriter().println(value); } } else { logger.warn("Request skipped because no key generator could be found for the request's method"); // attempting call to actual target filterChain.doFilter(request, response); } } } catch (Exception ex) { logger.info("Cache functionality skipped due to exception", ex); // attempting call to actual target filterChain.doFilter(request, response); } } /** * Factory method that returns KeyGenerator based on the request method. * * @param httpRequest * @return */ private Key getKeyGenerator(HttpServletRequest httpRequest) { Key keyGenerator = null; if (httpRequest.getMethod().equalsIgnoreCase("GET")) { keyGenerator = new GetRequestKey(); } else if (httpRequest.getMethod().equalsIgnoreCase("POST")) { keyGenerator = new PostRequestKey(); } return keyGenerator; } public void init(FilterConfig arg0) throws ServletException { logger.debug("init"); } public CacheConfig getCacheConfig() { return cacheConfig; } public void setCacheConfig(CacheConfig cacheConfig) { this.cacheConfig = cacheConfig; } public void destroy() { logger.debug("destroy"); } } 1. I first wrap my request and response objects in the following statements. I have had to create the wrappers as well. Will get to those later. // Wrapping the response in HTTPServletResponseWrapper MemcacheResponseWrapper responseWrap = new MemcacheResponseWrapper((HttpServletResponse) response); // Wrapping the request in HTTPServletResponseWrapper MemcacheRequestWrapper requestWrap = new MemcacheRequestWrapper((HttpServletRequest) request); 2. Next, I have one of my injected classes, CacheConfig, provide me with a memcache client which I will use later to look up the cache. // Get Memcached Client Instance MemcachedClient client = cacheConfig.getMemcachedClient(); 3. I make a call to a function that tells me which key generator I should use, a GET one or a POST one depending on the request method. Key keyGenerator = getKeyGenerator(requestWrap); /** * Factory method that returns KeyGenerator based on the request method. * * @param httpRequest * @return */ private Key getKeyGenerator(HttpServletRequest httpRequest) { Key keyGenerator = null; if (httpRequest.getMethod().equalsIgnoreCase("GET")) { keyGenerator = new GetRequestKey(); } else if (httpRequest.getMethod().equalsIgnoreCase("POST")) { keyGenerator = new PostRequestKey(); } return keyGenerator; } 4. Check for a cache hit using the Key returned by the Key Generator. If its a miss, call next filter or target to compute actual value, get value from the response wrapper, and add it to the cache. if (keyGenerator != null) { String key = keyGenerator.getKey(requestWrap, cacheConfig); String value = (String) client.get(key); if (value == null) { // cache miss logger.info("Cache miss for key " + key); // call next filter/actual target for value filterChain.doFilter(requestWrap, responseWrap); if (responseWrap.getStatus() == HttpServletResponse.SC_OK) { // obtaining response content from // HttpServletResponseWrapper value = responseWrap.getOutputStream().toString(); // adding response to cache client.add(key, 0, value); logger.info("Adding response to cache: "+ (value.length() > 50 ? value.substring(0,50) + "..." : value)); } 5. If its a cache hit, just get return cached value else { // This case is a cache hit logger.info("Cache hit for key " + key); response.getWriter().println(value); } Lets take a look at each of the Wrappers. I am not going into a a lot of detail into how each of these work. Request Wrapper Class On the way in, the original POST content is extracted from the request and put in a String Buffer. To the filter, this content is returned via the toString() method of the WrappedInputStream class whereas the subsequently called controller calls the read method. public class MemcacheRequestWrapper extends HttpServletRequestWrapper { protected ServletInputStream stream; protected HttpServletRequest origRequest = null; protected BufferedReader reader = null; public MemcacheRequestWrapper(HttpServletRequest request) throws IOException { super(request); origRequest = request; } public ServletInputStream createInputStream() throws IOException { return (new WrappedInputStream(origRequest)); } @Override public ServletInputStream getInputStream() throws IOException { if (reader != null) { throw new IllegalStateException("getReader() has already been called for this request"); } if (stream == null) { stream = createInputStream(); } return stream; } @Override public BufferedReader getReader() throws IOException { if (reader != null) { return reader; } if (stream != null) { throw new IllegalStateException("getReader() has already been called for this request"); } stream = createInputStream(); reader = new BufferedReader(new InputStreamReader(stream)); return reader; } private class WrappedInputStream extends ServletInputStream { private StringBuffer originalInput = new StringBuffer(); private HttpServletRequest originalRequest; private ByteArrayInputStream byteArrayInputStream; public WrappedInputStream(HttpServletRequest request) throws IOException { this.originalRequest = request; BufferedReader bufferedReader = null; try { InputStream inputStream = request.getInputStream(); if (inputStream != null) { bufferedReader = new BufferedReader(new InputStreamReader(inputStream)); char[] charBuffer = new char[128]; int bytesRead = -1; while ((bytesRead = bufferedReader.read(charBuffer)) > 0) { originalInput.append(charBuffer, 0, bytesRead); } } byteArrayInputStream = new ByteArrayInputStream(originalInput.toString().getBytes()); } catch (IOException ex) { throw ex; } finally { if (bufferedReader != null) { try { bufferedReader.close(); } catch (IOException ex) { throw ex; } } } } @Override public String toString() { return this.originalInput.toString(); } @Override public int read() throws IOException { return byteArrayInputStream.read(); } } } Response Wrapper Class The response wrapper is similar to the request wrapper. Instead of the read method, there is a write method, called by the controller when its writing JSON content. This is stored in the wrapper and called in the filter. public class MemcacheResponseWrapper extends HttpServletResponseWrapper { protected ServletOutputStream stream; protected PrintWriter writer = null; protected HttpServletResponse origResponse = null; private int httpStatus = 200; public MemcacheResponseWrapper(HttpServletResponse response) { super(response); response.setContentType("application/json"); origResponse = response; } public ServletOutputStream createOutputStream() throws IOException { return (new WrappedOutputStream(origResponse)); } public ServletOutputStream getOutputStream() throws IOException { if (writer != null) { throw new IllegalStateException("getWriter() has already been called for this response"); } if (stream == null) { stream = createOutputStream(); } return stream; } public PrintWriter getWriter() throws IOException { if (writer != null) { return writer; } if (stream != null) { throw new IllegalStateException("getOutputStream() has already been called for this response"); } stream = createOutputStream(); writer = new PrintWriter(stream); return writer; } @Override public void sendError(int sc) throws IOException { httpStatus = sc; super.sendError(sc); } @Override public void sendError(int sc, String msg) throws IOException { httpStatus = sc; super.sendError(sc, msg); } @Override public void setStatus(int sc) { httpStatus = sc; super.setStatus(sc); } public int getStatus() { return httpStatus; } private class WrappedOutputStream extends ServletOutputStream { private StringBuffer originalOutput = new StringBuffer(); private HttpServletResponse originalResponse; public WrappedOutputStream(HttpServletResponse response) { this.originalResponse = response; } @Override public String toString() { return this.originalOutput.toString(); } @Override public void write(int arg0) throws IOException { originalOutput.append((char) arg0); originalResponse.getOutputStream().write(arg0); } } }

June 25, 2013

by Faheem Sohail

· 22,604 Views · 1 Like

Searchable Documents? Yes You Can. Another Reason to Choose AsciiDoc

Elasticsearch is a flexible and powerful open source, distributed real-time search and analytics engine for the cloud based on Apache Lucene which provides full text search capabilities. It is document oriented and schema free. Asciidoctor is a pure Ruby processor for converting AsciiDoc source files and strings into HTML 5, DocBook 4.5 and other formats. Apart of Asciidoctor Ruby part, there is an Asciidoctor-java-integration project which let us call Asciidoctor functions from Java without noticing that Ruby code is being executed. In this post we are going to see how we can use Elasticsearch over AsciiDocdocuments to make them searchable by their header information or by their content. Let's add required dependencies: junit junit 4.11 test com.googlecode.lambdaj lambdaj 2.3.3 org.elasticsearch elasticsearch 0.90.1 org.asciidoctor asciidoctor-java-integration 0.1.3 Lambdaj library is used to convert AsciiDoc files to a json documents. Now we can start an Elasticsearch instance which in our case it is going to be an embedded instance. node = nodeBuilder().local(true).node(); Next step is parse AsciiDoc document header, read its content and convert them into a json document. An example of json document stored in Elasticsearch can be: { "title":"Asciidoctor Maven plugin 0.1.2 released!", "authors":[ { "author":"Jason Porter", "email":"[email protected]" } ], "version":null, "content":"= Asciidoctor Maven plugin 0.1.2 released!.....", "tags":[ "release", "plugin" ] } And for converting an AsciiDoc File to a json document we are going to useXContentBuilder class which is provided by ElasticsearchJava API to create jsondocuments programmatically. package com.lordofthejars.asciidoctor; import static org.elasticsearch.common.xcontent.XContentFactory.*; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.util.List; import org.asciidoctor.Asciidoctor; import org.asciidoctor.Author; import org.asciidoctor.DocumentHeader; import org.asciidoctor.internal.IOUtils; import org.elasticsearch.common.xcontent.XContentBuilder; import ch.lambdaj.function.convert.Converter; public class AsciidoctorFileJsonConverter implements Converter { private Asciidoctor asciidoctor; public AsciidoctorFileJsonConverter() { this.asciidoctor = Asciidoctor.Factory.create(); } public XContentBuilder convert(File asciidoctor) { DocumentHeader documentHeader = this.asciidoctor.readDocumentHeader(asciidoctor); XContentBuilder jsonContent = null; try { jsonContent = jsonBuilder() .startObject() .field("title", documentHeader.getDocumentTitle()) .startArray("authors"); Author mainAuthor = documentHeader.getAuthor(); jsonContent.startObject() .field("author", mainAuthor.getFullName()) .field("email", mainAuthor.getEmail()) .endObject(); List authors = documentHeader.getAuthors(); for (Author author : authors) { jsonContent.startObject() .field("author", author.getFullName()) .field("email", author.getEmail()) .endObject(); } jsonContent.endArray() .field("version", documentHeader.getRevisionInfo().getNumber()) .field("content", readContent(asciidoctor)) .array("tags", parseTags((String)documentHeader.getAttributes().get("tags"))) .endObject(); } catch (IOException e) { throw new IllegalArgumentException(e); } return jsonContent; } private String[] parseTags(String tags) { tags = tags.substring(1, tags.length()-1); return tags.split(", "); } private String readContent(File content) throws FileNotFoundException { return IOUtils.readFull(new FileInputStream(content)); } } Basically we are building the json document by calling startObject methods to start a new object, field method to add new fields, and startArray to start an array. Then this builder will be used to render the equivalent object in json format. Notice that we are using readDocumentHeader method from Asciidoctor class which returns header attributes from AsciiDoc file without reading and rendering the whole document. And finally content field is set with all document content. And now we are ready to start indexing documents. Note that populateData method receives as parameter a Client object. This object is from Elasticsearch Java APIand represents a connection to Elasticsearch database. import static ch.lambdaj.Lambda.convert; //.... private void populateData(Client client) throws IOException { List asciidoctorFiles = new ArrayList() {{ add(new File("target/test-classes/java_release.adoc")); add(new File("target/test-classes/maven_release.adoc")); }; List jsonDocuments = convertAsciidoctorFilesToJson(asciidoctorFiles); for (int i=0; i < jsonDocuments.size(); i++) { client.prepareIndex("docs", "asciidoctor", Integer.toString(i)).setSource(jsonDocuments.get(i)).execute().actionGet(); } client.admin().indices().refresh(new RefreshRequest("docs")).actionGet(); } private List convertAsciidoctorFilesToJson(List asciidoctorFiles) { return convert(asciidoctorFiles, new AsciidoctorFileJsonConverter()); } It is important to note that the first part of the algorithm is converting all our AsciiDocfiles (in our case two) to XContentBuilder instances by using previous converter class and the method convert of Lambdaj project. If you want you can take a look to both documents used in this example in https://github.com/asciidoctor/asciidoctor.github.com/blob/develop/news/asciidoctor-java-integration-0-1-3-released.adoc and https://github.com/asciidoctor/asciidoctor.github.com/blob/develop/news/asciidoctor-maven-plugin-0-1-2-released.adoc. Next part is inserting documents inside one index. This is done by using prepareIndexmethod, which requires an index name (docs), an index type (asciidoctor), and the idof the document being inserted. Then we call setSource method which transforms theXContentBuilder object to json, and finally by calling execute().actionGet(), data is sent to database. The final step is only required because we are using an embedded instance ofElasticsearch (in production this part should not be required), which refresh the indexes by calling refresh method. After that point we can start querying Elasticsearch for retrieving information from our AsciiDoc documents. Let's start with very simple example, which returns all documents inserted: SearchResponse response = client.prepareSearch().execute().actionGet(); Next we are going to search for all documents that has been written by Alex Sotowhich in our case is one. import static org.elasticsearch.index.query.QueryBuilders.matchQuery; //.... QueryBuilder matchQuery = matchQuery("author", "Alex Soto"); QueryBuilder matchQuery = matchQuery("author", "Alexander Soto"); Note that I am searching for field author the string Alex Soto, which returns only one. The other document is written by Jason. But it is interesting to say that if you search for Alexander Soto, the same document will be returned; Elasticsearch is smart enough to know that Alex and Alexander are very similar names so it returns the document too. More queries, how about finding documents written by someone who is called Alex, but not Soto. import static org.elasticsearch.index.query.QueryBuilders.fieldQuery; //.... QueryBuilder matchQuery = fieldQuery("author", "+Alex -Soto"); And of course no results are returned in this case. See that in this case we are using afield query instead of a term query, and we use +, and - symbols to exclude and include words. Also you can find all documents which contains the word released on title. import static org.elasticsearch.index.query.QueryBuilders.matchQuery; //.... QueryBuilder matchQuery = matchQuery("title", "released"); And finally let's find all documents that talks about 0.1.2 release, in this case only one document talks about it, the other one talks about 0.1.3. QueryBuilder matchQuery = matchQuery("content", "0.1.2"); Now we only have to send the query to Elasticsearch database, which is done by using prepareSearch method. SearchResponse response = client.prepareSearch("docs") .setTypes("asciidoctor") .setQuery(matchQuery) .execute() .actionGet(); SearchHits hits = response.getHits(); for (SearchHit searchHit : hits) { System.out.println(searchHit.getSource().get("content")); } Note that in this case we are printing the AsciiDoc content through console, but you could use asciidoctor.render(String content, Options options) method to render the content into required format. So in this post we have seen how to index documents using Elasticsearch, how to get some important information from AsciiDoc files using Asciidoctor-java-integration project, and finally how to execute some queries to inserted documents. Of course there are more kind of queries in Elasticsearch, but the intend of this post wasn't to explore all possibilities of Elasticsearch. Also as corollary, note how important it is using AsciiDoc format for writing your documents. Without much effort you can build a search engine for your documentation. On the other side, imagine all code that would be required to implement the same using any proprietary binary format like Microsoft Word. So we have shown another reason to use AsciiDoc instead of other formats.

June 10, 2013

by Alex Soto

· 4,835 Views

IndexedDB and Date Example

about an hour ago i gave a presentation on indexeddb. one of the attendees asked about dates and being able to filter based on a date range. i told him that my assumption was that you would need to convert the dates into numbers and use a number-based range. turns out i was wrong. here is an example. i began by creating an objectstore that used an index on the created field. since our intent is to search via a date field, i decided "created" would be a good name. i also named my objectstore as "data". boring, but it works. var openrequest = indexeddb.open("idbpreso_date1",1); openrequest.onupgradeneeded = function(e) { var thisdb = e.target.result; if(!thisdb.objectstorenames.contains("data")) { var os = thisdb.createobjectstore("data", {autoincrement:true}); os.createindex("created", "created", {unique:false}); } } next - i built a simple way to seed data. i based on a button click event to add 10 objects. each object will have one property, created, and the date object will be based on a random date from now till 7 days in the future. function doseed() { var now = new date(); for(var i=0; i<10; i++) { var daydiff = getrandomint(1, 7); var thisdate = new date(); thisdate.setdate(now.getdate() + daydiff); db.transaction(["data"],"readwrite").objectstore("data").add({created:thisdate}); } } //credit: mozilla developer center function getrandomint (min, max) { return math.floor(math.random() * (max - min + 1)) + min; } note that since indexeddb calls are asynchronous, my code should handle updating the user to let them know when the operation is done. since this is just a quick demo though, and since that add operation will complete incredibly fast, i decided to not worry about it. so at this point we'd have an application that lets us add data containing a created property with a valid javascript date. note i didn't change it to milliseconds. i just passed it in as is. for the final portion i added two date fields on my page. in chrome this is rendered nicely: based on these, i can then create an indexeddb range of either bounds, lowerbounds, or upperbounds. i.e., give me crap either after a date, before a date, or inside a date range. function dosearch() { var fromdate = document.queryselector("#fromdate").value; var todate = document.queryselector("#todate").value; var range; if(fromdate == "" && todate == "") return; var transaction = db.transaction(["data"],"readonly"); var store = transaction.objectstore("data"); var index = store.index("created"); if(fromdate != "") fromdate = new date(fromdate); if(todate != "") todate = new date(todate); if(fromdate != "" && todate != "") { range = idbkeyrange.bound(fromdate, todate); } else if(fromdate == "") { range = idbkeyrange.upperbound(todate); } else { range = idbkeyrange.lowerbound(fromdate); } var s = ""; index.opencursor(range).onsuccess = function(e) { var cursor = e.target.result; if(cursor) { s += "key "+cursor.key+""; for(var field in cursor.value) { s+= field+"="+cursor.value[field]+""; } s+=""; cursor.continue(); } document.queryselector("#status").innerhtml = s; } } the only conversion required here was to take the user input and turn it into "real" date objects. once done, everything works great: you can run the full demo below.

June 7, 2013

by Raymond Camden

· 7,358 Views

Write CSV Data into Hive and Python

Apache Hive is a high level SQL-like interface to Hadoop. It lets you execute mostly unadulterated SQL, like this: CREATE TABLE test_table(key string, stats map); The map column type is the only thing that doesn’t look like vanilla SQL here. Hive can actually use different backends for a given table. Map is used to interface with column oriented backends like HBase. Essentially, because we won’t know ahead of time all the column names that could be in the HBase table, Hive will just return them all as a key/value dictionary. There are then helpers to access individual columns by key, or even pivot the map into one key per logical row. As part of the Hadoop family, Hive is focused on bulk loading and processing. So it’s not a surprise that Hive does not support inserting raw values like the following SQL: INSERT INTO suppliers (supplier_id, supplier_name) VALUES (24553, 'IBM'); However, for unit testing Hive scripts, it would be nice to be able to insert a few records manually. Then you could run your map reduce HQL, and validate the output. Luckily, Hive can load CSV files, so it’s relatively easy to insert a handful or records that way. CREATE TABLE foobar(key string, stats map) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' ; LOAD DATA LOCAL INPATH '/tmp/foobar.csv' INTO TABLE foobar; This will load a CSV file with the following data, where c4ca4-0000001-79879483-000000000124 is the key, and comments and likesare columns in a map. c4ca4-0000001-79879483-000000000124,comments:0|likes:0 c4ca4-0000001-79879483-000000000124,comments:0|likes:0 Because I’ve been doing this quite a bit in my unit tests, I wrote a quick Python helper to dump a list of key/map tuples to a temporary CSV file, and then load it into Hive. This uses hiver to talk to Hive over thrift. import hiver from django.core.files.temp import NamedTemporaryFile def _hql(self, hql): client = hiver.connect(settings.HIVE_HOST, settings.HIVE_PORT) try: client.execute(hql) finally: client.shutdown() def insert(self, table_name, rows): ''' cannot insert single rows via hive, need to save to a temp file and bulk load that ''' csv_file = NamedTemporaryFile(delete=True) for row in rows: map_repr = '|'.join('%s:%s' % (key, value) for key, value in row[1].items()) csv_file.write(row[0] + "," + map_repr + "\n") csv_file.flush() try: _hql('DROP TABLE IF EXISTS %s' % table_name) _hql(""" CREATE TABLE %s ( key string, map ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' """ % (table_name)) _hql(""" LOAD DATA LOCAL INPATH '%s' INTO TABLE %s """ % (csv_file.name, table_name) finally: csv_file.close() You can call it like this: insert('test_table', [ ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879496-000000000124', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ])

June 5, 2013

by Chase Seibert

· 14,765 Views

Accessing An Artifact’s Maven And SCM Versions At Runtime

You can easily tell Maven to include the version of the artifact and its Git/SVN/… revision in the JAR manifest file and then access that information at runtime via getClass().getPackage.getImplementationVersion(). (All credit goes to Markus Krüger and other colleagues.) Include Maven artifact version in the manifest (Note: You will actually not want to use it, if you also want to include a SCM revision; see below.) pom.xml: ... org.apache.maven.plugins maven-jar-plugin ... true true ... ... The resulting MANIFEST.MF of the JAR file will then include the following entries, with values from the indicated properties: Built-By: ${user.name} Build-Jdk: ${java.version} Specification-Title: ${project.name} Specification-Version: ${project.version} Specification-Vendor: ${project.organization.name Implementation-Title: ${project.name} Implementation-Version: ${project.version} Implementation-Vendor-Id: ${project.groupId} Implementation-Vendor: ${project.organization.name} (Specification-Vendor and Implementation-Vendor come from the POM’s organization/name.) Include SCM revision For this you can either use the Build Number Maven plugin that produces the property ${buildNumber}, or retrieve it from environment variables passed by Jenkinsor Hudson (SVN_REVISION for Subversion, GIT_COMMIT for Git). For git alone, you could also use the maven-git-commit-id-plugin that can either replace strings such as ${git.commit.id} in existing resource files (using maven’s resource filtering, which you must enable) with the actual values or output all of them into a git.properties file. Let’s use the buildnumber-maven-plugin and create the manifest entries explicitely, containing the build number (i.e. revision) org.codehaus.mojo buildnumber-maven-plugin 1.2 validate create false false org.apache.maven.plugins maven-jar-plugin 2.4 ${project.name} ${project.version} ${buildNumber} ... Accessing the version & revision As mentioned above, you can access the manifest entries from your code via getClass().getPackage.getImplementationVersion() andgetClass().getPackage.getImplementationTitle(). References SO: How to get Maven Artifact version at runtime? Maven Archiver documentation

May 28, 2013

by Jakub Holý

· 12,760 Views

Capturing camera/picture data without PhoneGap

As people know, I'm a huge fan of PhoneGap and what it allows me to do with JavaScript, HTML, and CSS. But I think it is crucial to remember that you don't always need PhoneGap. A great example of that is camera access. Did you know that recent mobile browsers support accessing the camera directly from HTML and JavaScript? Let's look at an example. Over a year ago I wrote a blog post where I created an application called "Color Thief." This application made use of PhoneGap's Camera API and a third party JavaScript library called Color Thief. I loved this example because it demonstrated how you could combine the extra power that PhoneGap provides along with existing JavaScript libraries. This morning I watched an excellent Google IO presentation (https://www.youtube.com/watch?v=EPYnGFEcis4&feature=youtube_gdata_player) on Mobile HTML. It was an overview of some of the exciting stuff you can now do with mobile HTML and JavaScript. To be clear, this was all without using wrappers like PhoneGap. In one of the examples the presenters discussed the new "capture" support for the input/file field type. This is rather simple to implement: If supported (recent Android and latest iOS), the user can then use their camera to select a picture. I decided to rebuild my old demo to skip PhoneGap completely and just make use of this feature. Here's the code: For the most part, this is pretty similar to the last version. I no longer wait for the deviceready event, but instead just listen for the document itself to load. Instead of listening for a button click, I've switched to a input field using type=file. I now listen for the change event, and on that, I see if I have access to a file. If I do, I can then use the URL object to create a pointer to the source and then simply add it to my DOM. After that, Color Thief takes over. The only tricky part I ran into was that in iOS the URL object is still prefixed. You can see how I get around that in the startup code. To be fair, this isn't 100% backwards compatible, I could add a few checks in here to ensure that things will work and gracefully let people on older phones know they can't use this feature. But the end result is nearly the exact same functionality in a web page - no PhoneGap, no native code. <br>

May 21, 2013

by Raymond Camden

· 17,631 Views

Bucketing, Multiplexing and Combining in Hadoop - Part 1

this is the first blog post in a series which looks at some data organization patterns in mapreduce. we’ll look at how to bucket output across multiple files in a single task, how to multiplex data across multiple files, and also how to coalesce data. these are all common patterns that are useful to have in your mapreduce toolkit. we’ll kick things off with a look at bucketing data outputs in your map or reduce tasks. by default when using a fileoutputformat-derived outputformat (such as textoutputformat), all the outputs for a reduce task (or a map task in a map-only job) are written to a single file in hdfs. imagine a situation where you have user activity logs being streamed into hdfs, and you want to write a mapreduce job to better organize the incoming data. as an example a large organization with multiple products may want to bucket the logs based on the product. to do this you’ll need the ability to write to multiple output files in a single task. let’s take a look at how we can make that happen. multipleoutputformat there are a few ways you can achieve your goal, and the first option we’ll look at is the multipleoutputformat class in hadoop. this is an abstract class that lets you do the following: define the output path for each and every key/value output record being emitted by a task. incorporate the input paths into the output directory for map-only jobs. redefine the key and value that are used to write to the underlying recordwriter . this is useful in situations where you want to remove data from the outputs as it duplicates data in the filename. for each output path, define the recordwriter that should be used to write the outputs. ok enough with the words - let’s look at some data and code. first up is the simple data we’ll use in our example - imagine you work at a fruit market with locations in multiple cities, and you have a purchase transaction stream which contains the store location along with the fruit that was purchased. cupertino apple sunnyvale banana cupertino pear to help bucket your data for future analysis, you want to bin each record into city-specific files. for the simple data set above you don’t want to filter, project or transform your data, just bucket it out, so a simple identity map-only job will do the job. to force more than one mapper, we’ll write the data to two separate files. $ tab="$(printf '\t')" $ hdfs -put - file1.txt << eof cupertino${tab}apple sunnyvale${tab}banana eof $ hdfs -put - file2.txt << eof cupertino${tab}pear eof here’s the code which will let you write city-specific output files. import org.apache.commons.lang.stringutils; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.filesystem; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.identitymapper; import org.apache.hadoop.mapred.lib.multipletextoutputformat; import org.apache.hadoop.util.progressable; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; import java.io.ioexception; import java.util.arrays; /** * an example of how to use {@link org.apache.hadoop.mapred.lib.multipleoutputformat}. */ public class mofexample extends configured implements tool { /** * create output files based on the output record's key name. */ static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } /** * the main job driver. */ public int run(final string[] args) throws exception { string csvinputs = stringutils.join(arrays.copyofrange(args, 0, args.length - 1), ","); path outputdir = new path(args[args.length - 1]); jobconf jobconf = new jobconf(super.getconf()); jobconf.setjarbyclass(mofexample.class); jobconf.setnumreducetasks(0); jobconf.setmapperclass(identitymapper.class); jobconf.setinputformat(keyvaluetextinputformat.class); jobconf.setoutputformat(keybasedmultipletextoutputformat.class); fileinputformat.setinputpaths(jobconf, csvinputs); fileoutputformat.setoutputpath(jobconf, outputdir); return jobclient.runjob(jobconf).issuccessful() ? 0 : 1; } /** * main entry point for the utility. * * @param args arguments * @throws exception when something goes wrong */ public static void main(final string[] args) throws exception { int res = toolrunner.run(new configuration(), new mofexample(), args); system.exit(res); } } run this code and you’ll see the following files in hdfs, where /output is the job output directory: $ hadoop fs -lsr /output /output/cupertino/part-00000 /output/cupertino/part-00001 /output/sunnyvale/part-00000 if you look at the output files you’ll see that the files contain the correct buckets. $ hadoop fs -lsr /output/cupertino/* cupertino apple cupertino pear $ hadoop fs -lsr /output/sunnyvale/* sunnyvale banana awesome, you have your data bucketed by store. now that we have everything working, let’s look at what we did to get there. we had to do two things to get this working: extend multipletextoutputformat this is where the magic happened - let’s look at that class again. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } you are working with text, which is why you extended multipletextoutputformat , a class that in turn extends multipleoutputformat . multipletextoutputformat is a simple class which instructs the multipleoutputformat to use textoutputformat as the underlying output format for writing out the records. if you were to use multipleoutputformat as-is it behaves as if you were using the regular textoutputformat , which is to say that it’ll only write to a single output file. to write data to multiple files you had to extend it, as with the example above. the generatefilenameforkeyvalue method allows you to return the output path for an input record. the third argument, name , is the original fileoutputformat -created filename, which is in the form “part-nnnnn”, where “nnnnn” is the task index, to ensure uniqueness. to avoid file collisions, it’s a good idea to make sure your generated output paths are unique, and leveraging the original output file is certainly a good way of doing this. in our example we’re using the key as the directory name, and then writing to the original fileoutputformat filename within that directory. specify the outputformat the next step was easy - specify that this output format should be used for your job: jobconf.setoutputformat(keybasedmultipletextoutputformat.class); earlier we also mentioned that you can use the input path as part of the output path, which we will look at next. using the input filename as part of the output filename in map-only jobs what if we wanted to keep the input filename as part of the output filename? this only works for map-only jobs, and can be accomplished by overriding the getinputfilebasedoutputfilename method. let’s look at the following code to understand how this method fits into the overall sequence of actions that the multipleoutputformat class performs: public void write(k key, v value) throws ioexception { // get the file name based on the key string keybasedpath = generatefilenameforkeyvalue(key, value, myname); // get the file name based on the input file name string finalpath = getinputfilebasedoutputfilename(myjob, keybasedpath); // get the actual key k actualkey = generateactualkey(key, value); v actualvalue = generateactualvalue(key, value); recordwriter rw = this.recordwriters.get(finalpath); if (rw == null) { // if we don't have the record writer yet for the final path, create // one // and add it to the cache rw = getbaserecordwriter(myfs, myjob, finalpath, myprogressable); this.recordwriters.put(finalpath, rw); } rw.write(actualkey, actualvalue); }; the getinputfilebasedoutputfilename method is called with the output of generatefilenameforkeyvalue , which contains our already-customized output file. our new keybasedmultipletextoutputformat can now be updated to override getinputfilebasedoutputfilename and append the original input filename to the output filename: static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(object key, object value, string name) { return key.tostring() + "/" + name; } @override protected string getinputfilebasedoutputfilename(jobconf job, string name) { string infilename = new path(job.get("map.input.file")).getname(); return name + "-" + infilename; } if you run with your modified outputformat class you’ll see the following files in hdfs, confirming that the input filenames are now concatenated to the end of each output file. $ hadoop fs -lsr /output /output/cupertino/part-00000-file1.txt /output/cupertino/part-00001-file2.txt /output/sunnyvale/part-00000-file1.txt the implementation of getinputfilebasedoutputfilename in multipleoutputformat doesn’t do anything interesting by default, but if you set the value of the mapred.outputformat.numoftrailinglegs configurable to an integer greater than 0, then the getinputfilebasedoutputfilename will use part of the input path as the output path. let’s see what happens when we set the value to 1: jobconf.setint("mapred.outputformat.numoftrailinglegs", 1); the output files in hdfs now exactly mirror the input files used for the job: $ hadoop fs -lsr /output /output/file1.txt /output/file2.txt if we set mapred.outputformat.numoftrailinglegs to 2, and our input files exist in the /inputs directory, then our output directory looks like this: $ hadoop fs -lsr /output /output/input/file1.txt /output/input/file2.txt basically as you keep incrementing mapred.outputformat.numoftrailinglegs , then multipleoutputformat will continue to go up the parent directories of the input file and use them in the output path. modifying the output key and value it’s very possible that the actual key and value you want to emit are different from those that were used to determine the output file. in our example, we took the output key and wrote to a directory using the key name. if you do that keeping the key in the output file may be redundant. how would we modify the output record so that the key isn’t written? multipleoutputformat has your back with the generateactualkey method. class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override protected text generateactualkey(text key, text value) { return null; } } the returned value from this method replaces the key that’s supplied to the underlying recordwriter , so if you return null as in the above example, no key will be written to the file. $ hadoop fs -lsr /output/cupertino/* apple pear $ hadoop fs -lsr /output/sunnyvale/* banana you can achieve the same result for the output value by overriding the generateactualvalue method. changing the recordwriter in our final step we’ll look at how you can leverage multiple recordwriter classes for different output files. this is accomplished by overriding the getrecordwriter method. in the example below we’re leveraging the same textoutputformat for all the files, but it gives you a sense of what can be accomplished. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override public recordwriter getrecordwriter(filesystem fs, jobconf job, string name, progressable prog) throws ioexception { if (name.startswith("apple")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } else if (name.startswith("banana")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } return super.getrecordwriter(fs, job, name, prog); } } conclusion when using multipleoutputformat , give some thought to the number of distinct files that each reducer will create. it would be prudent to plan your bucketing so that you have a relatively small number of files. in this post we extended multipletextoutputformat , which is a simple extension of multipleoutputformat that supports text outputs. multiplesequencefileoutputformat also exists to support sequencefiles in a similar fashion. so what are the shortcomings with the multipleoutputformat class? if you have a job that uses both map and reduce phases, then multipleoutputformat can’t be used in the map-side to write outputs. of course, multipleoutputformat works fine in map-only jobs. all recordwriter classes must support exactly the same output record types. for example, you wouldn’t be able to support a recordwriter that emitted for one output file, and have another recordwriter that emitted . multipleoutputformat exists in the mapred package, so it won’t work with a job that requires use of the mapreduce package. all is not lost if you bump into either one of these issues, as you’ll discover in the next blog post.

May 20, 2013

by Alex Holmes

· 6,317 Views

Lazy sequences implementation for Java 8

I just published the LazySeq library on GitHub - the result of my Java 8 experiments recently. I hope you will enjoy it. Even if you don't find it very useful, it's still a great lesson of functional programming in Java 8 (and in general). Also it's probably the first community library targeting Java 8! Introduction A Lazy sequence is a data structure that is computed only when its elements are actually needed. All operations on lazy sequences, like map() and filter() are lazy as well, postponing invocation up to the moment when it is really necessary. Lazy sequences are always traversed from the beginning using very cheap first/rest decomposition (head() and tail()). An important property of lazy sequences is that they can represent infinite streams of data, e.g. all natural numbers or temperature measurements over time. Lazy sequence remembers already computed values so if you access the Nth element, all elements from 1 to N-1 are computed as well and cached. Despite that LazySeq (being at the core of many functional languages and algorithms) is immutable and thread-safe. Rationale This library is heavily inspired by scala.collection.immutable.Stream and aims to provide immutable, thread-safe and easy to use lazy sequence implementation, possibly infinite. See Lazy sequences in Scala and Clojure for some use cases. Stream class name is already used in Java 8, therefore LazySeq was chosen, similar to lazy-seq in Clojure. Speaking of Stream, at first it looks like a lazy sequence implementation available out-of-the-box. However, quoting Javadoc: Streams are not data structures and: Once an operation has been performed on a stream, it is considered consumed and no longer usable for other operations. In other words java.util.stream.Stream is just a thin wrapper around existing collection, suitable for one time use. More akin to Iterator than to Stream in Scala. This library attempts to fill this niche. Of course implementing lazy sequence data structure was possible prior to Java 8, but lack of lambdas makes working with such data structure tedious and too verbose. Getting started Building and working with lazy sequences in 10 minutes. Infinite sequence of all natural numbers In order to create a lazy sequence you use LazySeq.cons() factory method that accepts first element (head) and a function that might be later used to compute rest (tail). For example in order to produce lazy sequence of natural numbers with given start element you simply say: private LazySeq naturals(int from) { return LazySeq.cons(from, () -> naturals(from + 1)); } There is really no recursion here. If there was, calling naturals() would quickly result in StackOverflowError as it calls itself without stop condition. However () -> naturals(from + 1) expression defines a function returning LazySeq (Supplier to be precise) that this data structure will invoke, but only if needed. Look at the code below, how many times do you think naturals() function was called (except the first line)? final LazySeq ints = naturals(2); final LazySeq strings = ints. map(n -> n + 10). filter(n -> n % 2 == 0). take(10). flatMap(n -> Arrays.asList(0x10000 + n, n)). distinct(). map(Integer::toHexString); First invocation of naturals(2) returns lazy sequence starting from 2 but rest (3, 4, 5, ...) is not computed yet. Later we map() over this sequence, filter() it, take() first 10 elements, remove duplicates, etc. All these operations do not evaluate the sequence and are as lazy as possible. For example take(10) doesn't evaluate first 10 elements eagerly to return them. Instead new lazy sequence is returned which remembers that it should truncate original sequence at 10th element. Same applies to distinct(). It doesn't evaluate the whole sequence to extract all unique values (otherwise code above would explode quickly, traversing infinite amount of natural numbers). Instead it returns a new sequence with only the first element. If you ever ask for the second unique element, it will lazily evaluate tail, but only as much as possible. Check out toString() output: System.out.println(strings); //[1000c, ?] Question mark (?) says: "there might be something more in that collection, but I don't know it yet". Do you understand where did 1000c came from? Look carefully: Start from an infinite stream of natural numbers starting from 2 Add 10 to each element (so the first element becomes 12 or C in hex) filter() out odd numbers (12 is even so it stays) take() first 10 elements from sequence so far Each element is replaced by two elements: that element plus 0x1000 and the element itself (flatMap()). This does not yield a sequence of pairs, but a sequence of integers that is twice as long We ensure only distinct() elements will be returned In the end we turn integers to hex strings. As you can see none of these operations really require evaluating the whole stream. Only head is being transformed and this is what we see in the end. So when this data structure is actually evaluated? When it absolutely must, e.g. during side-effect traversal: strings.force(); //or strings.forEach(System.out::println); //or final List list = strings.toList(); //or for (String s : strings) { System.out.println(s); } All the statements above alone will force evaluation of whole lazy sequence. Not very smart if our sequence was infinite, but strings was limited to first 10 elements so it will not run infinitely. If you want to force only part of the sequence, simply call strings.take(5).force(). BTW have you noticed that we can iterate over LazySeq strings using standard Java 5 for-each syntax? That's because LazySeq implements List interface, thus plays nicely with Java Collections Framework ecosystem: import java.util.AbstractList; public abstract class LazySeq extends AbstractList Please keep in mind that once lazy sequence is evaluated (computed) it will cache (memoize) them for later use. This makes lazy sequences great for representing infinite or very long streams of data that are expensive to compute. iterate() Building an infinite lazy sequence very often boils down to providing an initial element and a function that produces next item based on the previous one. In other words second element is a function of the first one, third element is a function of the second one, and so on. Convenience LazySeq.iterate() function is provided for such circumstances. ints definition can now look like this: final LazySeq ints = LazySeq.iterate(2, n -> n + 1); We start from 2 and each subsequent element is represented as previous element + 1. More examples: Fibonacci sequence and Collatz conjecture No article about lazy data structure can be left without Fibonacci numbers example: private static LazySeq lastTwoFib(int first, int second) { return LazySeq.cons( first, () -> lastTwoFib(second, first + second) ); } Fibonacci sequence is infinite as well but we are free to transform it in multiple ways: System.out.println( fib. drop(5). take(10). toList() ); //[5, 8, 13, 21, 34, 55, 89, 144, 233, 377] final int firstAbove1000 = fib. filter(n -> (n > 1000)). head(); fib.get(45); See how easy and natural it is to work with infinite stream of numbers? drop(5).take(10) skips first 5 elements and displays next 10. At this point first 15 numbers are already computed and will never by computed again. Finding first Fibonacci number above 1000 (happens to be 1597) is very straightforward. head() is always precomputed by filter() , so no further evaluation is needed. Last but not least we can simply just ask for 45th Fibonacci number (0-based) and get 1134903170. If you ever try to access any Fibonacci number up to this one, they are precomputed and fast to retrieve. Finite sequences (Collatz conjecture) Collatz conjecture is also quite interesting problem. For each positive integer n we compute next integer using following algorithm: n/2 if n is even 3n + 1 if n is odd For example starting from 10 series looks as follows: 10, 5, 16, 8, 4, 2, 1. The series ends when it reaches 1. Mathematicians believe that starting from any integer we will eventually reach 1 but it's not yet proven. Let us create a lazy sequence that generates Collatz series for given n, but only as many as needed. As stated above, this time our sequence will be finite: private LazySeq collatz(long from) { if (from > 1) { final long next = from % 2 == 0 ? from / 2 : from * 3 + 1; return LazySeq.cons(from, () -> collatz(next)); } else { return LazySeq.of(1L); } } This implementation is driven directly by the definition. For each number greater than 1 return that number + lazily evaluated (() -> collatz(next)) rest of the stream. As you can see if 1 is given, we return single element lazy sequence using special of() factory method. Let's test it with aforementioned 10: final LazySeq collatz = collatz(10); collatz.filter(n -> (n > 10)).head(); collatz.size(); filter() allows us to find first number in the sequence that is greater than 10. Remember that lazy sequence will have to traverse the contents (evaluate itself), but only to the point where it finds first matching element. Then it stops, ensuring it computes as little as possible. However size(), in order to calculate total number of elements, must traverse the whole sequence. Of course this can only work with finite lazy sequences, calling size() on an infinite sequence will end up poorly. If you play a bit with this sequence you will quickly realize that sequences for different numbers share the same suffix (always end with the same sequence of numbers). This begs for some caching/structural sharing. See CollatzConjectureTest for details. But can it be used to something, you know... useful? Real life? Infinite sequences of numbers are great, but not very practical in real life. Maybe some more down to earth examples? Imagine you have a collection and you need to pick few items from that collection randomly. Instead of collection I will use a function returning random latin characters: private char randomChar() { return (char) ('A' + (int) (Math.random() * ('Z' - 'A' + 1))); } But there is a twist. You need N (N < 26, number of latin characters) unique values. Simply calling randomChar() few times doesn't guarantee uniqueness. There are few approaches to this problem, with LazySeq it's pretty straightforward: LazySeq charStream = LazySeq.continually(this::randomChar); LazySeq uniqueCharStream = charStream.distinct(); continually() simply invokes given function for each element when needed. Thus charStream will be an infinite stream of random characters. Of course they can't be unique. However uniqueCharStream guarantees that its output is unique. It does so by examining next element of underlying charStream and rejecting items that already appeared. We can now say uniqueCharStream.take(4) and be sure that no duplicates will appear. Once again notice that continually(this::randomChar).distinct().take(4) really calls randomChar() only once! As long as you don't consume this sequence, it remains lazy and postpones evaluation as long as possible. Another example involves loading batches (pages) of data from database. Using ResultSet or Iterator is cumbersome but loading whole data set into memory often not feasible. An alternative involves loading first batch of data eagerly and then providing a function to load next batches. Data is loaded only when it's really needed and we don't suffer performance or scalability issues. First let's define abstract API for loading batches of data from database: public List loadPage(int offset, int max) { //load records from offset to offset + max } I abstract from the technology entirely, but you get the point. Imagine that we now define LazySeq that starts from row 0 and loads next pages only when needed: public static final int PAGE_SIZE = 5; private LazySeq records(int from) { return LazySeq.concat( loadPage(from, PAGE_SIZE), () -> records(from + PAGE_SIZE) ); } When creating new LazySeq instance by calling records(0) first page of 5 elements is loaded. This means that first 5 sequence elements are already computed. If you ever try to access 6th or above, sequence will automatically load all missing record and cache them. In other words you never compute the same element twice. More useful tools when working with sequences are grouped() and sliding() methods. First partitions input sequence into groups of equal size. Take this as an example, also proving that these methods are as always lazy: final LazySeq chars = LazySeq.of('A', 'B', 'C', 'D', 'E', 'F', 'G'); chars.grouped(3); //[[A, B, C], ?] chars.grouped(3).force(); //force evaluation //[[A, B, C], [D, E, F], [G]] and similarly for sliding(): chars.sliding(3); //[[A, B, C], ?] chars.sliding(3).force(); //force evaluation //[[A, B, C], [B, C, D], [C, D, E], [D, E, F], [E, F, G]] These two methods are extremely useful. You can look at your data through sliding window (e.g. to compute moving average) or partition it to equal-length buckets. Last interesting utility method you may find useful is scan() that iterates (lazily, of course) the input stream and constructs every element of output by applying a function on previous and current element of input. Code snippet is worth a thousand words: LazySeq list = LazySeq. numbers(1). scan(0, (a, x) -> a + x); list.take(10).force(); //[0, 1, 3, 6, 10, 15, 21, 28, 36, 45] LazySeq.numbers(1) is a sequence of natural numbers (1, 2, 3...). scan() creates a new sequence that starts from 0 and for each element of input (natural numbers) adds it to last element of itself. So we get: [0, 0+1, 0+1+2, 0+1+2+3, 0+1+2+3+4, 0+1+2+3+4+5...]. If you want a sequence of growing strings, just replace few types: LazySeq.continually("*"). scan("", (s, c) -> s + c). map(s -> "|" + s + "\\"). take(10). forEach(System.out::println); And enjoy this beautiful triangle: |\ |*\ |**\ |***\ |****\ |*****\ |******\ |*******\ |********\ |*********\ Alternatively (same output): lazySeq. stream(). map(n -> n + 1). flatMap(n -> asList(0, n - 1).stream()). filter(n -> n != 0). substream(4, 18). limit(10). sorted(). distinct(). collect(Collectors.toList()); Java collections framework interoperability LazySeq implements java.util.List interface, thus can be used in variety of places. Moreover it also implements Java 8 enhancements to collections, namely streams and collectors: lazySeq. stream(). map(n -> n + 1). flatMap(n -> asList(0, n - 1).stream()). filter(n -> n != 0). substream(4, 18). limit(10). sorted(). distinct(). collect(Collectors.toList()); However streams in Java 8 were created to work around feature that is a foundation of LazySeq - lazy evaluation. Example above postpones all intermediate steps until collect() is called. With LazySeq you can safely skip .stream() and work directly on sequence: lazySeq. map(n -> n + 1). flatMap(n -> asList(0, n - 1)). filter(n -> n != 0). slice(4, 18). limit(10). sorted(). distinct(); Moreover LazySeq provides special purpose collector (see: LazySeq.toLazySeq()) that avoids evaluation even when used with collect() - which normally forces full collection computation. Implementation details Each lazy sequence is built around the idea of eagerly computed head and lazily evaluated tail represented as function. This is very similar to classic single-linked list recursive definition: class List { private final T head; private final List tail; //... } However in case of lazy sequence tail is given as a function, not a value. Invocation of that function is postponed as long as possible: class Cons extends LazySeq { private final E head; private LazySeq tailOrNull; private final Supplier> tailFun; @Override public LazySeq tail() { if (tailOrNull == null) { tailOrNull = tailFun.get(); } return tailOrNull; } For full implementation see Cons.java and FixedCons.java used when tail is known at creation time (for example LazySeq.of(1, 2) as opposed to LazySeq.cons(1, () -> someTailFun()). Pitfalls and common dangers Below common issues and misunderstandings are described. Evaluating too much One of the biggest dangers of working with infinite sequences is trying to evaluate them completely, which obviously leads to infinite computation. The idea behind infinite sequence is not to evaluate it in its entirety but to take as much as we need without introducing artificial limits and accidental complexity (see database loading example). However evaluating whole sequence is way too simple to miss. For example calling LazySeq.size()must evaluate whole sequence and will run infinitely, eventually filling up stack or heap (implementation detail). There are other methods that require full traversal in order to function properly. E.g. allMatch() making sure all elements match given predicate. Some methods are even more dangerous, because whether they will finish or not depends on data in the sequence. For example anyMatch() may return immediately if head matches predicate - or never. Sometimes we can easily avoid costly operations by using more deterministic methods. For example: seq.size() <= 10 //BAD may not work or be extremely slow if seq is infinite. However we can achieve the same with (more) predictable: seq.drop(10).isEmpty() Remember that lazy sequences are immutable (so we don't really mutate seq), drop(n) is typically O(n) while isEmpty() is O(1). When in doubt, consult source code or JavaDoc to make sure your operation won't too eagerly evaluate your sequence. Also be very cautious when using LazySeq where java.util.Collection or java.util.List is expected. Holding unnecessary reference to head Lazy sequences be definition remember already computed elements. You have to be aware of that, otherwise your sequence (especially infinite) will quickly fill up available memory. However, because LazySeq is just a fancy linked list, if you no longer keep a reference to head (but only to some element in the middle), it becomes eligible for garbage collection. For example: //LazySeq first = seq.take(10); seq = seq.drop(10); First ten elements are dropped and we assume nothing holds a reference to what previously was hept in seq. This makes first ten elements eligible for garbage collection. However if we uncomment first line and keep reference to old head in first, JVM will not release any memory. Let's put that into perspective. The following piece of code will eventually throw OutOfMemoryError because infinite reference keeps holding the beginning of the sequence, therefore all the elements created so far: LazySeq infinite = LazySeq.continually(Big::new); for (Big arr : infinite) { // } However by inlining call to continually() or extracting it to a method this code works flawlessly (well, still runs forever, but uses almost no memory): private LazySeq getContinually() { return LazySeq.continually(Big::new); } for (Big arr : getContinually()) { // } What's the difference? For-each loop uses iterators underneath. LazySeqIterator underneath doesn't hold a reference to old head() when it advances, so if nothing else references that head, it will be eligible for garbage collection, see true javac output when for-each is used: for (Iterator cur = getContinually().iterator(); cur.hasNext(); ) { final Big arr = cur.next(); //... } TL;DR Your sequence grows while being traversed. If you keep holding one end while the other grows, it will eventually blow up. Just like your first level cache in Hibernate if you load too much in one transaction. Use only as much as needed. Converting to plain Java collections Converting is simple, but dangerous. This is a consequence of points above. You can convert lazy sequence to java.util.List by calling toList(): LazySeq even = LazySeq.numbers(0, 2); even.take(5).toList(); //[0, 2, 4, 6, 8] or using Collector from Java 8 having richer API: even. stream(). limit(5). collect(Collectors.toSet()) //[4, 6, 0, 2, 8] But remember that Java collections are finite from definition so avoid converting lazy sequences to collections explicitly. Note that LazySeq is already List, thus Iterable and Collection. It also has efficient LazySeq.iterator(). If you can, simply pass LazySeq instance directly and may just work. Performance, time and space complexity head() of every sequence (except empty) is always computed eagerly, thus accessing it is fast O(1). Computing tail() may take everything from O(1) (if it was already computed) to infinite time. As an example take this valid stream: import static com.blogspot.nurkiewicz.lazyseq.LazySeq.cons; import static com.blogspot.nurkiewicz.lazyseq.LazySeq.continually; LazySeq oneAndZeros = cons( 1, () -> continually(0) ). filter(x -> (x > 0)); It represents 1 followed by infinite number of 0s. By filtering all positive numbers (x > 0) we get a sequence with same head, but filtering of tail is delayed (lazy). However if we now carelessly call oneAndZeros.tail(), LazySeq will keep computing more and more of this infinite sequence, but since there is no positive element after initial 1, this operation will run forever, eventually throwing StackOverflowError or OutOfMemoryError (this is an implementation detail). However if you ever reach this state, it's probably a programming bug or misusing of the library. Typically tail() will be close to O(1). On the other hand if you have plenty of operations already "stacked", calling tail() will trigger them rapidly one after another, so tail() run time is heavily dependant on your data structure. Most operations on LazySeq are O(1) since they are lazy. Some operations, like get(n) or drop(n) are O(n) (n represents parameter, not sequence length). In general run time will be similar to normal linked list. Because LazySeq remembers all already computed values in a single linked list, memory consumption is always O(n), where nn is the number of already computed elements. Troubleshooting Error invalid target release: 1.8 during maven build If you see this error message during maven build: [INFO] BUILD FAILURE ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project lazyseq: Fatal error compiling: invalid target release: 1.8 -> [Help 1] it means you are not compiling using Java 8. Download JDK 8 with lambda support and let maven use it: $ export JAVA_HOME=/path/to/jdk8 I get StackOverflowError or program hangs infinitely When working with LazySeq you sometimes get StackOverflowError or OutOfMemoryError: java.lang.StackOverflowError at sun.misc.Unsafe.allocateInstance(Native Method) at java.lang.invoke.DirectMethodHandle.allocateInstance(DirectMethodHandle.java:426) at com.blogspot.nurkiewicz.lazyseq.LazySeq.iterate(LazySeq.java:118) at com.blogspot.nurkiewicz.lazyseq.LazySeq.lambda$0(LazySeq.java:118) at com.blogspot.nurkiewicz.lazyseq.LazySeq$$Lambda$2.get(Unknown Source) at com.blogspot.nurkiewicz.lazyseq.Cons.tail(Cons.java:32) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) When working with possibly infinite data structures, care must be taken. Avoid calling operations that must (size(), allMatch(), minBy(), forEach(), reduce(), ...) or can (filter(), distinct(), ...) traverse the whole sequence in order to give correct results. See Pitfalls for more examples and ways to avoid. Maturity Quality This project was started as an exercise and is not battle-proven. But a healthy 300+ unit-test suite (3:1 test code/production code ratio) guards quality and functional correctness. I also make sure LazySeq is as lazy as possible by mocking tail functions and verifying they are called as rarely as one can get. Contributions and bug reports In the event of finding a bug or missing feature, don't hesitate to open a new ticket or start pull request. I would also love to see more interesting usages of LazySeq in wild. Possible improvements Just like FixedCons is used when tail is known up-front, consider IterableCons that wraps existing Iterable in one node rather than building FixedCons hierarchy. This can be used for all concat methods. Parallel processing support (implementing spliterator?) License This project is released under version 2.0 of the Apache License.

May 15, 2013

by Tomasz Nurkiewicz

· 29,022 Views · 1 Like

Hebrew Search with ElasticSearch

Hebrew search is not an easy task, and HebMorph is a project I started several years ago to address that problem. After a certain period of inactivity I'm back actively working on it. I'm also happy to say there are already several live systems using it to enable Hebrew searches in their applications. This post is a short step-by-step guide on how to use HebMorph in an ElasticSearch installation. There are quite a few configuration options and things to consider when enabling Hebrew search, most are in the realm of performance vs relevance trade-offs, but I'll talk about those in a separate post. 0. What exactly is HebMorph HebMorph is a project a bit wider than just providing a Hebrew search plugin for ElasticSearch, but for the purpose of this post let us treat it in that narrow aspect. HebMorph has 3 main parts - the hspell dictionary files, the hebmorph-core package which is a wrapper around the dictionary files with important bits that allow for locating words even if they weren't written exactly as they appear in the dictionary, and the hebmorph-lucene package which contains various tools for processing streams of text into Lucene tokens - the searchable parts. To enable Hebrew search from ElasticSearch we are going to need to use the Hebrew analyzer class HebMorph provides to analyze incoming Hebrew texts. That is done by providing ElasticSearch with the HebMorph packages and then telling it to use the Hebrew analyzer on text fields as needed. 1. Get HebMorph and hspell At the moment you will have to compile HebMorph from sources yourself using Maven. In the future we might upload it to a centralized repository, but since we still actively working on a lot of stuff there it is still a bit too early for that. Probably the easiest way to get HebMorph is to do git clone from the main repository. The repository is located at https://github.com/synhershko/HebMorph and includes the latest hspell files already under /hspell-data-files. If you are new to git GitHub offers great tutorials for getting started with it, and they also enable you to download the entire source tree as a zip or a tarball. Once you have the sources, run mvn package or mvn install to create 2 jars - hebmorph-core and hebmorph-lucene. Those 2 packages are required before moving on to the next step. 2. Create an ElasticSearch plugin In this step we will create a new plugin which we will use in the next step to create the Hebrew analyzers in. If you already have a plugin you wish to use, skip to the next step. ElasticSearch plugins are compiled Java packages you simply drop to the plugins folder of your ElasticSearch installation and it gets detected automatically by the ElasticSearch instance once it is initialized. If you are new to this, you might want to read up a bit on that in the official ElasticSearch documentation. Here is a great guide to start with: http://jfarrell.github.io/ The gist of this is having a Java project with a es-plugin.properties file embedded as a resource and pointing to class that tells ElasticSearch what classes to load as plugins, and their plugin type. In the next section we will use this to add our own Analyzer implementation which makes use of HebMorph's capabilities. 3. Creating an Hebrew Analyzer HebMorph already comes with MorphAnalyzer - an Analyzer implementation which takes care of Hebrew-aware tokenization, lemmatization and whatnot. Because it is highly configurable, personally I prefer re-implementing it in the ElasticSearch plugin so it is easier to change the configurations in code. In case you wondered, I'm not planning in supporting external configurations for this as it is too subtle and you should really know what you are doing there. Don't forget to add dependencies to hebmorph-core and hebmorph-lucene to your project. My common Analyzer setup for Hebrew search looks like this: public abstract class HebrewAnalyzer extends ReusableAnalyzerBase { protected enum AnalyzerType { INDEXING, QUERY, EXACT } private static final DictRadix prefixesTree = LingInfo.buildPrefixTree(false); private static DictRadix dictRadix; private final StreamLemmatizer lemmatizer; private final LemmaFilterBase lemmaFilter; protected final Version matchVersion; protected final AnalyzerType analyzerType; protected final char originalTermSuffix = '$'; static { try { dictRadix = Loader.loadDictionaryFromHSpellData(new File(resourcesPath + "hspell-data-files"), true); } catch (IOException e) { // TODO log } } protected HebrewAnalyzer(final AnalyzerType analyzerType) throws IOException { this.matchVersion = matchVersion; this.analyzerType = analyzerType; lemmatizer = new StreamLemmatizer(null, dictRadix, prefixesTree, null); lemmaFilter = new BasicLemmaFilter(); } @Override protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) { // on query - if marked as keyword don't keep origin, else only lemmatized (don't suffix) // if word termintates with $ will output word$, else will output all lemmas or word$ if OOV if (analyzerType == AnalyzerType.QUERY) { final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter); src.setAlwaysSaveMarkedOriginal(true); src.setSuffixForExactMatch(originalTermSuffix); TokenStream tok = new SuffixKeywordFilter(src, '$'); return new TokenStreamComponents(src, tok); } if (analyzerType == AnalyzerType.EXACT) { // on exact - we don't care about suffixes at all, we always output original word with suffix only final HebrewTokenizer src = new HebrewTokenizer(reader, prefixesTree, null); TokenStream tok = new NiqqudFilter(src); tok = new LowerCaseFilter(matchVersion, tok); tok = new AlwaysAddSuffixFilter(tok, '$', false); return new TokenStreamComponents(src, tok); } // on indexing we should always keep both the stem and marked original word // will ignore $ && will always output all lemmas + origin word$ // basically, if analyzerType == AnalyzerType.INDEXING) final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter); src.setAlwaysSaveMarkedOriginal(true); TokenStream tok = new SuffixKeywordFilter(src, '$'); return new TokenStreamComponents(src, tok); } public static class HebrewIndexingAnalyzer extends HebrewAnalyzer { public HebrewIndexingAnalyzer() throws IOException { super(AnalyzerType.INDEXING); } } public static class HebrewQueryAnalyzer extends HebrewAnalyzer { public HebrewQueryAnalyzer() throws IOException { super(AnalyzerType.QUERY); } } public static class HebrewExactAnalyzer extends HebrewAnalyzer { public HebrewExactAnalyzer() throws IOException { super(AnalyzerType.EXACT); } } } You may notice how I created 3 separate analyzers - one for indexing, one for querying and the last for exact querying. I'll be talking more about this in future posts, but the idea is to be able to provide flexibility on querying while still allow for correct indexing. Configuring the analyzers to be picked up from ElasticSearch is rather easy now. First, you need to wrap each analyzer in a "provider", like so: public class HebrewQueryAnalyzerProvider extends AbstractIndexAnalyzerProvider { private final HebrewAnalyzer.HebrewQueryAnalyzer hebrewAnalyzer; @Inject public HebrewQueryAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) throws IOException { super(index, indexSettings, name, settings); hebrewAnalyzer = new HebrewAnalyzer.HebrewQueryAnalyzer(); } @Override public HebrewAnalyzer.HebrewQueryAnalyzer get() { return hebrewAnalyzer; } } After you've created such providers for all types of analyzers, create an AnalysisBinderProcessor like this (or update your existing one with definitions for the Hebrew analyzers): public class MyAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor { private final static HashMap> languageAnalyzers = new HashMap<>(); static { languageAnalyzers.put("hebrew", HebrewIndexingAnalyzerProvider.class); languageAnalyzers.put("hebrew_query", HebrewQueryAnalyzerProvider.class); languageAnalyzers.put("hebrew_exact", HebrewExactAnalyzerProvider.class); } public static boolean analyzerExists(final String analyzerName) { return languageAnalyzers.containsKey(analyzerName); } @Override public void processAnalyzers(final AnalyzersBindings analyzersBindings) { for (Map.Entry> entry : languageAnalyzers.entrySet()) { analyzersBindings.processAnalyzer(entry.getKey(), entry.getValue()); } } } Don't forget to update your Plugin class to catch the AnalysisBinderProcessor - it should look something like this (plus any other stuff you want to add there): public class MyPlugin extends AbstractPlugin { @Override public String name() { return "my-plugin"; } @Override public String description() { return "Implements custom actions required by me"; } @Override public void processModule(Module module) { if (module instanceof AnalysisModule) { ((AnalysisModule)module).addProcessor(new MyAnalysisBinderProcessor()); } } } 4. Using the Hebrew analyzers Compile the ElasticSearch plugin and drop it along with its dependencies in a folder under the /plugins folder of ElasticSearch. You now have 3 new types of analyzers at your disposal: "hebrew", "hebrew_query" and "hebrew_exact". For indexing, you want to use the "hebrew" analyzer. In your mapping, you can define a certain field or an entire set of fields to use that specific analyzer by setting the analyzer for that field. You can also leave the analyzer configuration blank, and specify the analyzer to use for those fields with unspecified analyzer using the _analyzer field in the index request. See more about both here and here. The "hebrew" analyzer will expand each term to all recognized lemmas; in case the word wasn't recognized it will try to tolerate spelling errors or missing Yud/Vav - most of the time it will be successful (with some rate of false positives, which the lemma-filters should remove to some degree). Some words will still remain unrecognized and thus will be indexed as-is. When querying using a QueryString query you can specify what analyzer to use - use the "hebrew_query" or "hebrew_exact" analyzer. The former will perform lemma expansion similar to the indexing analyzer, and the latter will avoid that and allow you to perform exact matches (useful when searching for names or exact phrases). I pretty much ignored a lot of the complexity involved in fine tuning searches for Hebrew, and many very cool things HebMorph allows you to do with Hebrew search for the sake of focus. I will revisit them in a later blog post. 5. Administration The hspell dictionary files are looked up by a physical location on disk - you will need to provide a path they are saved at. Since dictionaries update, it is sometimes easier to update them that way in a distributed environment like the one I'm working with. It may be desirable to have them compiled within the same jar file as the code itself - I'll be happy to accept a pull request to do that. The code above is working with ElasticSearch 0.90 GA and Lucene 4.2.1. I also had it running on earlier versions of both technologies, but may had to make a few minor changes. I assume the samples would break on future versions and I'll probably don't have much time going back and keeping it up to date, but bear in mind most of the time the changes are minor and easy to understand and make by yourself. Both HebMorph and the hspell dictionary are released under the AGPL3. For any questions on licensing, feel free to contact me.

May 6, 2013

by Itamar Syn-hershko

· 7,173 Views

Let's Talk ASM - String Concatenation

not a lot of developers today know assembly, which - regardless of your professional line of work - is a good skill to have. assembly teaches you think on a much lower level, going beyond the abstracted out layer provided by many of the high-level languages. today we're going to look at a way to implement a string concatenation function. specifically, i want to follow the following procedure for building the final result: ask the user for input append a crlf (carriage return + line feed) to the entered string append the entered string to the existing composite string follow back from step 1 until the user enters a terminator character display the composite string let's assume that you have zero knowledge of assembly. if that is the case, i would recommend starting here . in this example, i am using visual studio 2012 to test the code, but you might as well use an older version of the ide if you want. for convenience purposes, i would recommend downloading the basic framework code that comes for free from the writer of the introduction to 80x86 assembly language and computer architecture book: visual studio 2012 visual studio 2010 visual studio 2008 first, you have the standard declarations: .586 .model flat include io.h ; header file for input/output cr equ 0dh ; carriage return character lf equ 0ah ; line feed .stack 4096 .data prompt byte cr, lf, "original string? ",0 restitle byte "final result",0 stringin byte 1024 dup (?) stringout byte 1024 dup (?) linefeed byte cr, lf notice the reference to io.h - at this point you want a way to receive user input and display output data through standard winapi channels, and io.h does just that. some asm experts might argue that it is not a good idea to use winapi hooks in the context of a "pure" assembly program, for educational purposes, but in this situation the focus is on the inner workings of a different function. note: the program is adapted to the scenario where the execution of the string concatenation function is the sole purpose. as you will get a hang of the execution flow, you can easily adapt it to a scenario where some of the registers can be re-used. let's start by clearing the ecx and edx registers: .code _mainproc proc ; clear the ecx and edx registers because these will ; be used for length counters and sequential increments. xor ecx, ecx xor edx, edx once the strings will be entered by the user, i will need to find out the length of the string to append, in order to have a correct sequential memory address. now i need to get user input: input_data: ; prompt the user to enter the string he ultimately ; wants appended to the main string buffer. input prompt, stringin, 40 ; read ascii characters ; make sure that the string doesn't start with the $ character ; which would automatically mean that we need to terminate the ; reading process cmp stringin, '$' je done lea eax, [stringout + edx] ; destination address push eax ; push the destination on the stack lea eax, [stringin] ; source address push eax ; push the source on the stack call strcopy ; call the string copy procedure once the string is entered, i can check whether the terminator character - "$", was used. one of the great things about the cmp instruction is the fact that it checks the starting address of the entered string, therefore i can simply compare the entered data with a single character. in case the character is encountered, the program flow terminates at done, where the output is displayed: done: ; output the new data. output restitle, stringout mov eax, 0 ret strcopy is an internal procedure that will simply copy a string from one memory address to another: strcopy proc near32 push ebp mov ebp, esp push edi push esi pushf mov esi, [ebp+8] mov edi, [ebp+12] cld whilenonull: cmp byte ptr [esi], 0 je endwhilenonull movsb jmp whilenonull endwhilenonull: mov byte ptr [edi], 0 popf pop esi pop edi pop ebp ret 8 strcopy endp to make sure that the next string is properly appended, i need to find out the length of the previous one, for a correct memory address offset: ; let's get the length of the current string - move it ; to the proper register so that we can perform the measurement mov edi, eax ; find the length of the string that was just entered sub ecx, ecx sub al, al not ecx cld repne scasb not ecx dec ecx add edx, ecx repne scasb is used for an in-string iterative null terminator search (you can read more about it here ). it will decrement ecx for each character. ; we need to append the linefeed (crlf) to the string so we apply ; the same string concatenation procedure for that sequence. lea eax, [stringout + edx] ; destination address push eax ; first parameter lea eax, [linefeed] ; source push eax ; second parameter call strcopy ; call string copy procedure mov edi, eax ; we know that the crlf characters are 2 entities, therefore ; increment the overall counter by 2. add edx, 2 ; ask for more input because no terminator character was used. jmp input_data once the basic input data is processed, i can append the crlf sequence and increment edx for the proper offset, after which the program flow is being reset from the point where the user has to enter the next character sequence.

May 3, 2013

by Denzel D.

· 13,140 Views

Neo4j/Cypher: Returning a Row with Zero Count When No Relationship Exists

I’ve been trying to see if I can match some of the football stats that OptaJoe posts on twitter and one that I was looking at yesterday was around the number of red cards different teams have received. 1 – Sunderland have picked up their first PL red card of the season. The only team without one now are Man Utd. Angels. To refresh this is the sub graph that we’ll need to look at to work it out: I started off with the following query which traverses out from each match, finds the players who were sent off in the match and then groups the sendings off by the team they were playing for: START game = node:matches('match_id:*') MATCH game<-[:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COUNT(game) AS redCards ORDER BY redCards LIMIT 5 When we run this we get the following results: +------------------------------+ | team.name | redCards | +------------------------------+ | "Sunderland" | 1 | | "West Ham United" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | | "Liverpool" | 2 | +------------------------------+ 5 rows The problem we have here is that it hasn’t returned Manchester United because they haven’t yet received any red cards and therefore none of their players match the ‘sent_off_in’ relationship. I ran into something similar in a post I wrote about a month ago where I was working out which day of the week players scored on. The first step towards getting Manchester United to return with a count of 0 is to make the ‘sent_off_in’ relationship optional. However, that on its own that isn’t enough because it now returns a count of all the player performances for each team: START game = node:matches('match_id:*') MATCH game<-[?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COUNT(game) AS redCards ORDER BY redCards ASC LIMIT 5 +-----------------------------+ | team.name | redCards | +-----------------------------+ | "Chelsea" | 448 | | "Wigan Athletic" | 459 | | "Fulham" | 460 | | "Liverpool" | 466 | | "Everton" | 467 | +-----------------------------+ 5 rows Instead what we need to do is collect up all the ‘sent_off_in’ relationships and sum them up. We can use the COLLECT function to do that and the neat thing about COLLECT is that it doesn’t bother collecting the empty relationships so we end up with exactly what we need: START game = node:matches('match_id:*') MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, COLLECT(r) AS redCards LIMIT 5 +-----------------------------------------------------------------------------------------------------+ | team.name | redCards | +-----------------------------------------------------------------------------------------------------+ | "Wigan Athletic" | [:sent_off_in[26443] {},:sent_off_in[37785] {}] | | "Everton" | [:sent_off_in[6795] {minute:61},:sent_off_in[21735] {},:sent_off_in[34594] {}] | | "Newcastle United" | [:sent_off_in[434] {minute:75},:sent_off_in[32389] {},:sent_off_in[34915] {}] | | "Southampton" | [:sent_off_in[49393] {minute:70},:sent_off_in[49392] {minute:82}] | | "West Ham United" | [:sent_off_in[21734] {minute:67}] | +-----------------------------------------------------------------------------------------------------+ 5 rows We then just need to call the LENGTH function to work out how many red cards there are in each collection and then we’re done: START game = node:matches('match_id:*') MATCH game<-[r?:sent_off_in]-player-[:played]->likeThis-[:in]->game, likeThis-[:for]->team RETURN team.name, LENGTH(COLLECT(r)) AS redCards ORDER BY redCards LIMIT 5 +--------------------------------+ | team.name | redCards | +--------------------------------+ | "Manchester United" | 0 | | "West Ham United" | 1 | | "Sunderland" | 1 | | "Norwich City" | 1 | | "Reading" | 1 | +--------------------------------+ 5 rows

April 30, 2013

by Mark Needham

· 5,898 Views

XStream – XStreamely Easy Way to Work with XML Data in Java

from time to time there is a moment when we have to deal with xml data. and most of the time it is not the happiest day in our life. there is even a term “xml hell” describing situation when programmer has to deal with many xml configuration files that are hard to comprehend. but, like it or not, sometimes we have no choice, mostly because specification from client says something like “use configuration written in xml file” or something similar. and in such cases, xstream comes with its very cool features that make dealing with xml really less painful. overview xstream is a small library to serialize data between java objects and xml. it’s lightweight, small, has nice api and what is most important, it works with and without custom annotations that we might be not allowed to add when we are not the owner of java classes. first example suppose we have a requirement to load configuration from xml file: /users/tomek/work/mystuff/input.csv /users/tomek/work/mystuff/truststore.ts /users/tomek/work/mystuff/cn-user.jks password password user secret and we want to load it into configuration object: public class configuration { private string inputfile; private string user; private string password; private string truststorefile; private string keystorefile; private string keystorepassword; private string truststorepassword; // getters, setters, etc. } so basically what we have to do is: filereader filereader = new filereader("config.xml"); // load our xml file xstream xstream = new xstream(); // init xstream // define root alias so xstream knows which element and which class are equivalent xstream.alias("config", configuration.class); configuration loadedconfig = (configuration) xstream.fromxml(filereader); and that’s all, easy peasy something more serious ok, but previous example is very basic so now let’s do something more complicated: real xml returned by real webservice. 2013-03-09 john example 24 asd123123 2012-03-10 anna baker 26 axn567890 2010-12-05 tom meadow sgh08945 48 what we have here is simple list of bans written in xml. we want to load it into collection of ban objects. so let’s prepare some classes (getters/setters/tostring omitted): public class data { private list bans = new arraylist(); } public class ban { private string dateofupdate; private person person; } public class person { private string firstname; private string lastname; private int age; private string documentnumber; } as you can see there is some naming and type mismatch between xml and java classes (e.g. field name1->firstname, dateofupdate is string not a date), but it’s here for some example purposes. so the goal here is to parse xml and get data object with populated collection of ban instances containing correct data. let’s see how it can be achieved. parse with annotations first, easier way is to use annotations. and that’s the suggested approach in situation when we can modify java classes to which xml will be mapped. so we have: @xstreamalias("data") // maps data element in xml to this class public class data { // here is something more complicated. if we have list of elements that are // not wrapped in a element representing a list (like we have in our xml: // multiple elements not wrapped inside collection, // we have to declare that we want to treat these elements as an implicit list // so they can be converted to list of objects. @xstreamimplicit(itemfieldname = "ban") private list bans = new arraylist(); } @xstreamalias("ban") // another mapping public class ban { /* we want to have different field names in java classes so we define what element should be mapped to each field */ @xstreamalias("updated_at") // private string dateofupdate; @xstreamalias("troublemaker") private person person; } @xstreamalias("troublemaker") public class person { @xstreamalias("name1") private string firstname; @xstreamalias("name2") private string lastname; @xstreamalias("age") // string will be auto converted to int value private int age; @xstreamalias("number") private string documentnumber; and actual parsing logic is very short: filereader reader = new filereader("file.xml"); // load file xstream xstream = new xstream(); xstream.processannotations(data.class); // inform xstream to parse annotations in data class xstream.processannotations(ban.class); // and in two other classes... xstream.processannotations(person.class); // we use for mappings data data = (data) xstream.fromxml(reader); // parse // print some data to console to see if results are correct system.out.println("number of bans = " + data.getbans().size()); ban firstban = data.getbans().get(0); system.out.println("first ban = " + firstban.tostring()); as you can see annotations are very easy to use and as a result final code is very concise. but what to do in situation when we can’t modify mapping classes? we can use different approach that doesn’t require any modifications in java classes representing xml data. parse without annotations when we can’t enrich our model classes with annotations, there is another solution. we can define all mapping details using methods from xstream object: filereader reader = new filereader("file.xml"); // three first lines are easy, xstream xstream = new xstream(); // same initialisation as in the xstream.alias("data", data.class); // basic example above xstream.alias("ban", ban.class); // two more aliases to map... xstream.alias("troublemaker", person.class); // between node names and classes // we want to have different field names in java classes so // we have to use aliasfield(, , ) xstream.aliasfield("updated_at", ban.class, "dateofupdate"); xstream.aliasfield("troublemaker", ban.class, "person"); xstream.aliasfield("name1", person.class, "firstname"); xstream.aliasfield("name2", person.class, "lastname"); xstream.aliasfield("age", person.class, "age"); // notice here that xml will be auto-converted to int "age" xstream.aliasfield("number", person.class, "documentnumber"); /* another way to define implicit collection */ xstream.addimplicitcollection(bans.class, "bans"); data data = (data) xstream.fromxml(reader); // do the actual parsing // let's print results to check if data was parsed system.out.println("number of bans = " + data.getbans().size()); ban firstban = data.getbans().get(0); system.out.println("first ban = " + firstban.tostring()); as you can see xstream allows to easily convert more complicated xml structures into java objects, it also gives a possibility to tune results by using different names if this from xml doesn’t suit our needs. but there is one thing should catch your attention: we are converting xml representing a date into raw string which isn’t quite what we would like to get as a result. that’s why we will add converter to do some job for us. using existing custom type converter xstream library comes with set of built converters for most common use cases. we will use dateconverter. so now our class for ban looks like that: public class ban { private date dateofupdate; private person person; } and to use dateconverter we simply have to register it with date format that we expect to appear in xml data: xstream.registerconverter(new dateconverter("yyyy-mm-dd", new string[] {})); and that’s it. now instead of string our object is populated with date instance. cool and easy! but what about classes and situations that aren’t covered by existing converters? we could write our own. writing custom converter from scratch assume that instead of dateofupdate we want to know how many days ago update was done: public class ban { private int daysago; private person person; } of course we could calculate it manually for each ban object but using converter that will do this job for us looks more interesting. our daysagoconverter must implement converter interface so we have to implement three methods with signatures looking a little bit scary: public class daysagoconverter implements converter { @override public void marshal(object source, hierarchicalstreamwriter writer, marshallingcontext context) { } @override public object unmarshal(hierarchicalstreamreader reader, unmarshallingcontext context) { } @override public boolean canconvert(class type) { return false; } } last one is easy as we will convert only integer class. but there are still two methods left with these hierarchicalstreamwriter, marshallingcontext, hierarchicalstreamreader and unmarshallingcontext parameters. luckily, we could avoid dealing with them by using abstractsinglevalueconverter that shields us from so low level mechanisms. and now our class looks much better: public class daysagoconverter extends abstractsinglevalueconverter { @override public boolean canconvert(class type) { return type.equals(integer.class); } @override public object fromstring(string str) { return null; } public string tostring(object obj) { return null; } } additionally we must override method tostring(object obj) defined in abstractsinglevalueconverter as we want to store date in xml calculated from integer, not a simple object.tostring value which would be returned from default tostring defined in abstract parent. implementation code below is pretty straightforward, but most interesting lines are commented. i’ve skipped all validation stuff to make this example shorter. public class daysagoconverter extends abstractsinglevalueconverter { private final static string format = "yyyy-mm-dd"; // default date format that will be used in conversion private final datetime now = datetime.now().todatemidnight().todatetime(); // current day at midnight public boolean canconvert(class type) { return type.equals(integer.class); // converter works only with integers } @override public object fromstring(string str) { simpledateformat format = new simpledateformat(format); try { date date = format.parse(str); return days.daysbetween(new datetime(date), now).getdays(); // we simply calculate days between using jodatime } catch (parseexception e) { throw new runtimeexception("invalid date format in " + str); } } public string tostring(object obj) { if (obj == null) { return null; } integer daysago = ((integer) obj); return now.minusdays(daysago).tostring(format); // here we subtract days from now and return formatted date string } } usage to use our custom converter for a specific field we have to inform about it xstream object using registerlocalconverter: xstream.registerlocalconverter(ban.class, "daysago", new daysagoconverter()); we are using “local” method to apply this conversion only to specific field and not to every integer field in xml file. and after that we will get our ban objects populated with number of days instead of date. summary that’s all what i wanted to show you in this post. now you have basic knowledge about what xstream is capable of and how it can be used to easily map xml data to java objects. if you need something more advanced, please check project official page as it contains very good documentation and examples.

April 23, 2013

by Tomasz Dziurko

· 24,889 Views

What Does a Java Array Look Like in Memory?

arrays in java store one of two things: either primitive values (int, char, …) or references (a.k.a pointers). when an object is creating by using “new”, memory is allocated on the heap and a reference is returned. this is also true for arrays. 1. single-dimension array int arr[] = new int[3]; the int[] arr is just the reference to the array of 3 integer. if you create an array with 10 integer, it is the same – an array is allocated and a reference is returned. 2. two-dimensional array how about 2-dimensional array? actually, we can only have one dimensional arrays in java. 2d arrays are basically just one dimensional arrays of one dimensional arrays. int[ ][ ] arr = new int[3][ ]; arr[0] = new int[3]; arr[1] = new int[5]; arr[2] = new int[4]; multi-dimensional arrays use the name rules. 3. where are they located in memory? from the above, there are arrays and reference variables in memory. as we know that jvm runtime data areas include heap, jvm stack, and others. for a simple example as follows, let’s see where the array and its reference are stored. class a { int x; int y; } ... public void m1() { int i = 0; m2(); } public void m2() { a a = new a(); } ... when m1 is invoked, a new frame (frame-1) is pushed into the stack, and local variable i is also created in frame-1. when m2 is invoked inside of m1, another new frame (frame-2) is pushed into the stack. in m2, an object of class a is created in the heap and reference variable is put in frame-2. now, at this point, the stack and heap looks like the following: arrays are treated the same way like objects, so how array locates in memory is straight-forward.

April 19, 2013

by Ryan Wang

· 31,365 Views · 1 Like

Stepping Backwards while Debugging: Move To Line

it happens to me many times: i’m stepping with the debugger through my code, and ups! i made one step too far! debugging, and made one step over too far what now? restart the whole debugging session? actually, there is a way to go ‘backwards’ gdb has a ‘reverse debugging’ feature, described here . i’m using the eclipse based codewarrior debugger, and this debug engine is not using gdb. the codewarrior debugger in mcu10.3 supports an eclipse feature: i select a code line in the editor view and use move to line : move to line what it does: it changes the current pc (program counter) of the program to that line: performed move to line now i can continue debugging from that line, e.g. stepping into that function call. yes, this is not true backward debugging. but it is simple and very effective. to perform true backward stepping, the debugger would need to reverse all operations, typically with a rather heavy state machine and data recording. but for the usual case where i simply need to go back a few lines, the ‘move to line’ is perfect. of course there are a few points to consider: this only changes the program counter. any variable changes/etc are not affected or reverted. in case of highly optimized code, there might be multiple sequence points per source line. so doing this for highly optimized code might not work correctly. it works ok within a function. it is not recommended to use it e.g. to set the pc outside of a function. because the context/stack frame is not set up. i use the ‘move to line’ frequently to ‘advance’ the program execution. e.g. to bypass some long sequences i’m not interested in, or to get out of an ‘endless’ loop. the same ‘move to line’ as available while doing assembly stepping too. see this post for details. happy line moving

April 15, 2013

by Erich Styger

· 9,922 Views

Complex Event Processing Made Easy (using Esper)

The following is a very simple example of event stream processing (using the ESPER engine). Note - a full working example is available over on GitHub: https://github.com/corsoft/esper-demo-nuclear What is Complex Event processing (CEP)? Complex Event Processing (CEP), or Event Stream Stream Processing (ESP) are technologies commonly used in Event-Driven systems. These type of systems consume, and react to a stream of event data in real time. Typically these will be things like financial trading, fraud identification and process monitoring systems – where you need to identify, make sense of, and react quickly to emerging patterns in a stream of data events. Key Components of a CEP system A CEP system is like your typical database model turned upside down. Whereas a typical database stores data, and runs queries against the data, a CEP data stores queries, and runs data through the queries. To do this it basically needs: Data – in the form of ‘Events’ Queries – using EPL (‘Event Processing Language’) Listeners – code that ‘does something’ if the queries return results A Simple Example - A Nuclear Power Plant Take the example of a Nuclear Power Station.. Now, this is just an example – so please try and suspend your disbelief if you know something about Nuclear Cores, Critical Temperatures, and the like. It’s just an example. I could have picked equally unbelievable financial transaction data. But ... Monitoring the Core Temperature Now I don’t know what the core is, or if it even exists in reality – but for this example lets assume our power station has one, and if it gets too hot – well, very bad things happen.. Lets also assume that we have temperature gauges (thermometers?) in place which take a reading of the core temperature every second – and send the data to a central monitoring system. What are the requirements? We need to be warned when 3 types of events are detected: MONITOR just tell us the average temperature every 10 seconds - for information purposes WARNING WARN us if we have 2 consecutive temperatures above a certain threshold CRITICAL ALERT us if we have 4 consecutive events, with the first one above a certain threshold, and each subsequent one greater than the last – and the last one being 1.5 times greater than the first. This is trying to alert us that we have a sudden, rising escalating temperature spike – a bit like the diagram below. And let’s assume this is a very bad thing. Using Esper There are a number of ways you could approach building a system to handle these requirements. For the purpose of this post though - we will look at using Esper to tackle this problem How we approach this with Esper is: Using Esper – we can create 3 queries (using EPL - Esper Query Language) to model each of these event patterns. We then attach a listener to each query - this will be triggered when the EPL detects a matching pattern of events) We create an Esper service, and register these queries (and their listeners) We can then just throw Temperature data through the service – and let Esper tell alert the listeners when we get matches. (A working example of this simple solution is available on Githib - see link above) Our Simple ESPER Solution At the core of the system are the 3 queries for detecting the events. Query 1 – MONITOR (Just monitor the average temperature) select avg(value) as avg_val from TemperatureEvent.win:time_batch(10 sec) Query 2 – WARN (Tell us if we have 2 consecutive events which breach a threshold) select * from TemperatureEvent " match_recognize ( measures A as temp1, B as temp2 pattern (A B) define A as A.temperature > 400, B as B.temperature > 400) Query 3 – CRITICAL - 4 consecutive rising values above all above 100 with the fourth value being 1.5x greater than the first select * from TemperatureEvent match_recognize ( measures A as temp1, B as temp2, C as temp3, D as temp4 pattern (A B C D) define A as A.temperature > 100, B as (A.temperature < B.value), C as (B.temperature < C.value), D as (C.temperature < D.value) and D.value > (A.value * 1.5)) Some Code Snippets TemperatureEvent We assume our incoming data arrives in the form of a TemperatureEvent POJO If it doesn't - we can convert it to one, e.g. if it comes in via a JMS queue, our queue listener can convert it to a POJO. We don't have to do this, but doing so decouples us from the incoming data structure, and gives us more flexibility if we start to do more processing in our Java code outside the core Esper queries. An example of our POJO is below package com.cor.cep.event; package com.cor.cep.event; import java.util.Date; /** * Immutable Temperature Event class. * The process control system creates these events. * The TemperatureEventHandler picks these up * and processes them. */ public class TemperatureEvent { /** Temperature in Celcius. */ private int temperature; /** Time temerature reading was taken. */ private Date timeOfReading; /** * Single value constructor. * @param value Temperature in Celsius. */ /** * Temerature constructor. * @param temperature Temperature in Celsius * @param timeOfReading Time of Reading */ public TemperatureEvent(int temperature, Date timeOfReading) { this.temperature = temperature; this.timeOfReading = timeOfReading; } /** * Get the Temperature. * @return Temperature in Celsius */ public int getTemperature() { return temperature; } /** * Get time Temperature reading was taken. * @return Time of Reading */ public Date getTimeOfReading() { return timeOfReading; } @Override public String toString() { return "TemperatureEvent [" + temperature + "C]"; } } Handling this Event In our main handler class - TemperatureEventHandler.java, we initialise the Esper service. We register the package containing our TemperatureEvent so the EPL can use it. We also create our 3 statements and add a listener to each statement /** * Auto initialise our service after Spring bean wiring is complete. */ @Override public void afterPropertiesSet() throws Exception { initService(); } /** * Configure Esper Statement(s). */ public void initService() { Configuration config = new Configuration(); // Recognise domain objects in this package in Esper. config.addEventTypeAutoName("com.cor.cep.event"); epService = EPServiceProviderManager.getDefaultProvider(config); createCriticalTemperatureCheckExpression(); createWarningTemperatureCheckExpression(); createTemperatureMonitorExpression(); } An example of creating the Critical Temperature warning and attaching the listener /** * EPL to check for a sudden critical rise across 4 events, * where the last event is 1.5x greater than the first. * This is checking for a sudden, sustained escalating * rise in the temperature */ private void createCriticalTemperatureCheckExpression() { LOG.debug("create Critical Temperature Check Expression"); EPAdministrator epAdmin = epService.getEPAdministrator(); criticalEventStatement = epAdmin.createEPL(criticalEventSubscriber.getStatement()); criticalEventStatement.setSubscriber(criticalEventSubscriber); } And finally - an example of the listener for the Critical event. This just logs some debug - that's as far as this demo goes. package com.cor.cep.subscriber; import java.util.Map; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.stereotype.Component; import com.cor.cep.event.TemperatureEvent; /** * Wraps Esper Statement and Listener. No dependency on Esper libraries. */ @Component public class CriticalEventSubscriber implements StatementSubscriber { /** Logger */ private static Logger LOG = LoggerFactory.getLogger(CriticalEventSubscriber.class); /** Minimum starting threshold for a critical event. */ private static final String CRITICAL_EVENT_THRESHOLD = "100"; /** * If the last event in a critical sequence is this much greater * than the first - issue a critical alert. */ private static final String CRITICAL_EVENT_MULTIPLIER = "1.5"; /** * {@inheritDoc} */ public String getStatement() { // Example using 'Match Recognise' syntax. String criticalEventExpression = "select * from TemperatureEvent " + "match_recognize ( " + "measures A as temp1, B as temp2, C as temp3, D as temp4 " + "pattern (A B C D) " + "define " + " A as A.temperature > " + CRITICAL_EVENT_THRESHOLD + ", " + " B as (A.temperature < B.temperature), " + " C as (B.temperature < C.temperature), " + " D as (C.temperature < D.temperature) " + "and D.temperature > " + "(A.temperature * " + CRITICAL_EVENT_MULTIPLIER + ")" + ")"; return criticalEventExpression; } /** * Listener method called when Esper has detected a pattern match. */ public void update(Map eventMap) { // 1st Temperature in the Critical Sequence TemperatureEvent temp1 = (TemperatureEvent) eventMap.get("temp1"); // 2nd Temperature in the Critical Sequence TemperatureEvent temp2 = (TemperatureEvent) eventMap.get("temp2"); // 3rd Temperature in the Critical Sequence TemperatureEvent temp3 = (TemperatureEvent) eventMap.get("temp3"); // 4th Temperature in the Critical Sequence TemperatureEvent temp4 = (TemperatureEvent) eventMap.get("temp4"); StringBuilder sb = new StringBuilder(); sb.append("***************************************"); sb.append("\n* [ALERT] : CRITICAL EVENT DETECTED! "); sb.append("\n* " + temp1 + " > " + temp2 + " > " + temp3 + " > " + temp4); sb.append("\n***************************************"); LOG.debug(sb.toString()); } } The Running Demo Full instructions for running the demo can be found here: https://github.com/corsoft/esper-demo-nuclear An example of the running demo is shown below - it generates random Temperature events and sends them through the Esper processor (in the real world this would come in via a JMS queue, http endpoint or socket listener). When any of our 3 queries detect a match - debug is dumped to the console. In a real world solution each of these 3 listeners would handle the events differently - maybe by sending messages to alert queues/endpoints for other parts of the system to pick up the processing. Conclusions Using a system like Esper is a neat way to monitor and spot patterns in data in real time with minimal code. This is obviously (and intentionally) a very bare bones demo, barely touching the surface of the capabilities available. Check out the Esper web site for more info and demos. Esper also has a plugin for Apache Camel Integration engine - which allows you to configure you EPL queries directly in XML Spring camel routes, removing the need for any Java code completely (we will possibly cover this in a later blog post!)

April 11, 2013

by Adrian Milne

· 68,406 Views · 4 Likes

Monitoring with DataDog

Recently I found myself sending more and more business metrics to Datadog, a Software as a Service solution that promises to collect all your data points and build business metrics, displaying them as graphs and triggering alerts whenever they get to critically low (or high) levels. The goals The more your automated tests raises their level of abstraction, the more they become oriented to external quality (what the customer wants and does) instead of internal quality (low coupling, high cohesion of the software design). The largest end-to-end tests that we have in place at Onebip connect several different projects on an integration server and run everything from the creation of a purchase or subscription to its renewal and termination (events that would happen months after creation). However, even end-to-end tests cannot guarantee that our applications work against external resources, such as merchants, mobile carrier, and ISPs. The only way to catch integration problems is monitoring. These problems, like a mobile carrier experiencing an outage, may be due to our errors or to external conditions; but they should nevertheless be discovered as early as possible. The infrastructure Datadog is the only data-collection service that passed the stress tests of SLL, our solution architect. It ships as an UDP server that you pay basing on the number of machines you want to run it on; for example, a preproduction and a production server are a common choice to start out. The server collects data locally and periodically uploads it to Datadog in bursts, where you can access it via a web application or via APIs in case you want to call it from your build. The UDP protocol is aligned with the goals of metric collections: a silent server that decouples the sending of metrics from the rest of the business logic: UDP packets are just lost if no process is there listening to them, no errors are raised if the server crashes or is not running or installed for some reason for instance in development machines). The monitoring code, which you write, should be decoupled and asynchronous as much as possible. The part that talks over the network is already externalized in the DataDog server, but you don't want the user to wait because you have to send some strange number. So the internal part (sending via UDP) is performed in Listener objects that implement the Observer pattern. These object still have to be wrapped in all-encompassing try/catch constructs so that any errors in the monitoring part never influence the business logic. Againg, you don't want a payment to fail because of an exception in how monitoring DateTime objects are built. For PHP we built a SilentListener class to wrap all of our object: class SilentListener { private $wrapped; public function __construct($wrapped) { $this->wrapped = $wrapped; } public function __call($method, $args) { try { call_user_func_array(array($this->wrapped, $method), $args); } catch (Exception $e) { $this->log($e); } } }SLL An example In some countries, we receive payments through mobile-originated messages (MO), a fancy word for saying SMS sent by the end user. So a simple way to monitor if we are receiving payment or if the server is exploded is to upload a metric counting them every time we receive one (pseudo-JSON format to show you the data): { counter: 1 } However, we can be more precise than this: an external outage or an integration problem may happen to a lower level than the whole application. For example, MOs can be delayed in Argentina, by a single carrier, while the rest of the world is still working fine. So our data points look like this: { counter: 1, tags: { country: "IT", carrier: "Vodafone", merchant: "Tasty Cookies, Inc.", } } and in turn graphs on DataDog or calls to its API can set up filters so that we can, if necessary, view only the data related to any combination of country, carrier and merchant. The nice thing, SLL says, is that you just start send data from production and only after you have data points available you build a graph or an alert system basing on what appears to be the most important tags. For example, a big merchant may benefit from some dedicated monitoring, while minor countries such as Vietnam should be monitored as a whole since their traffic is by far lower than that of the others.

April 10, 2013

by Giorgio Sironi

· 16,495 Views · 1 Like

Add Custom Post Meta Data To Post List Table

One of the best thing about WordPress is that you can customise almost anything. In the admin area you can see a list of all the posts you have added in WordPress. Within this table it shows the basic information for each of the posts, the title, the author, the category, tags, comments and the date the post was published. WordPress has a number of different filters and actions that allow you to edit the output of the column so you can add your own data to this list. For example if you have custom post meta data which is useful information you want to display on the list of posts you can add new custom columns to the list. In this article you will learn how to add new columns to the post list, how you can add data to the column and how you can make this column sortable. Add New Columns First we start off by adding the new column to the list, for this we use the WordPress filter manage_edit-post_columns. This will allow you to edit the output of the columns by adding new values to the column array. The callback function on this filter will pass in one parameter which are the current columns on the list, the return of this function will be the new columns on the post table. This means that we can add additional values to the array to add extra columns to the table. The following code will add a new column to the table just after the title column. // Add a column to the edit post list add_filter( 'manage_edit-post_columns', 'add_new_columns'); /** * Add new columns to the post table * * @param Array $columns - Current columns on the list post */ function add_new_columns( $columns ) { $column_meta = array( 'meta' => 'Custom Column' ); $columns = array_slice( $columns, 0, 2, true ) + $column_meta + array_slice( $columns, 2, NULL, true ); return $columns; } Add Columns To Custom Post Types If you have custom post types in your site and want to add additional columns to this list, WordPress comes with built in filters you can apply to add new columns to this table. add_filter( 'manage_${post_type}_posts_columns', 'add_new_columns'); If you have a custom post type of portfolio then you can use the following code to add a column to the list of portfolio post types. function add_portfolio_columns($columns) { return array_merge($columns, array('client' => __('Client'), 'project_date' =>__( 'Project Date'))); } add_filter('manage_portfolio_posts_columns' , 'add_portfolio_columns'); Add Data To Custom Columns Once you have created the new columns for the posts list you can now add data to the new columns by using the WordPress action manage_posts_custom_column. Adding an action to this will be called on each column, from this call we can get data for the post display this on the post list. The following code will check what column we are on and get the custom post meta data for the current post and display this in the column. // Add action to the manage post column to display the data add_action( 'manage_posts_custom_column' , 'custom_columns' ); /** * Display data in new columns * * @param $column Current column * * @return Data for the column */ function custom_columns( $column ) { global $post; switch ( $column ) { case 'meta': $metaData = get_post_meta( $post->ID, 'twitter_url', true ); echo $metaData; break; } } Add Data To Custom Post Type Columns Along with being able to add a filter to custom post types by adding new columns, WordPress has a built in action you can use to add data to custom columns. In the above example we add a new column just for post types of a portfolio to add two new columns to the list, using the below code you can add data to these new columns. function custom_portfolio_column( $column, $post_id ) { switch ( $column ) { case 'project_date': echo get_post_meta( $post_id , 'project_date' , true ); break; case 'client': echo get_post_meta( $post_id , 'client' , true ); break; } } add_action( 'manage_portfolio_posts_custom_column' , 'custom_portfolio_column' ); Make Columns Sortable By default the new custom columns are not sortable so this makes it hard to find data that you need. To sort the custom columns WordPress has another filter manage_edit-post_sortable_columns you can use to assign which columns are sortable. When this action is ran the function will pass in a parameter of all the columns which are currently sortable, by adding your new custom columns to this list will now make these columns sortable. The value you give this will be used in the URL so WordPress understands which column to order by. The following to allow you to sort by the custom column meta. // Register the column as sortable function register_sortable_columns( $columns ) { $columns['meta'] = 'Custom Column'; return $columns; } add_filter( 'manage_edit-post_sortable_columns', 'register_sortable_columns' ); That's all the information you need to change the way posts are listed in your admin area. What useful information do you wish was displayed in the post list?

April 10, 2013

by Paul Underwood

· 13,684 Views

Why Encapsulation Matters

Encapsulation is more than just defining accessor and mutator methods for a class. It is a broader concept of programming, not necessarily object-oriented programming, that consists in minimizing the interdependence between modules and it’s typically implemented through information hiding. Paramount to understand encapsulation is the realization that it has two main objectives: (1) hiding complexity and (2) hiding the sources of change. About Hiding Complexity Encapsulation is inherently related to the concepts of modularity and abstraction. So, in my opinion, to really understand encapsulation, one must first understand these two concepts. Let’s consider, for example, the level of abstraction in the concept of a car. A car is complex in its internal working. They have several subsystem, like the transmission system, the break system, the fuel system, etc. However, we have simplified its abstraction, and we interact with all cars in the world through the public interface of their abstraction: we know that all cars have a steering wheel through which we control direction, they have a pedal that when we press it we accelerate the car and control speed, and another one that when we press it we make it stop, and we have a gear stick that let us control if we go forward or backwards. These features constitute the public interface of the car abstraction. In the morning we can drive a sedan and then get out of it and drive an SUV in the afternoon as if it was the same thing. However, few of us know the details of how all these features are implemented under the hood. So, this simple analogy shows that human beings deal with complexity by defining abstractions with public interfaces that we use to interact with them and all the unnecessary details get hidden under the hood of these abstractions. And I want to emphasize that word “unnecessary” here, because the beauty of an abstraction is not having to understand all those details in order to be able to use it, we just need to understand a broader abstract concept and how it works and how we interact with it. That’s why most of us don’t know or don’t care how a car works under the hood, but that doesn’t prevents us from driving one. In his book Code Complete, Steve McConnell uses the analogy of an iceberg: only a small portion of an iceberg is visible on the surface, most of its true size is hidden underwater. Similarly, in our software designs the visible parts of our modules/classes constitute their public interface, and this is exposed to the outside world, the rest of it should be hidden to the naked eye. In the words of McConell “the interface to a class should reveal as little as possible about its inner workings”. Clearly, based on our car analogy, we can see that this encapsulation is good, since it hides unnecessary/complex details from the users. It makes objects simpler to use and understand. About Hiding the Sources of Change Now, continuing with the analogy; think of the time when cars did not have a hydraulics directional system. One day, the car manufactures invented it, and they decide it to put it in cars from there on. Still, this did not change the way in which drivers were interacting with them. At most, users experienced an improvement in the use of the directional system. A change like this was possible because the internal implementation of a car is encapsulated, that is, is hidden from its user. In other words changes can be safely done without affecting its public interface. In a similar way, if we achieve proper levels of encapsulation in our software design we can safely foster change and evolution of our APIs without breaking its users, by this minimizing the impact of changes and the interdependence of modules. Therefore, encapsulation is a way to achieve another important attribute of a good software design known as loose coupling. In his book Effective Java, Joshua Block highlights the power of information hiding and loose coupling when he says: “Information hiding is important for many reasons, most of which stem from the fact that it decouples the modules that compromise a system, allowing them to be developed, tested, optimized, used, understood, and modified in isolation. This speeds up system development because modules can be developed in parallel. It eases the burden of maintenance because modules can be understood more quickly and debugged with little fear of harming other modules [...] it enables effective performance tuning [since] those modules can be optimized without affecting the correctness of other modules increases software reuse because modules that aren’t tightly coupled often prove useful in other contexts besides the ones for which they were developed”. So, once more, we can clearly see that encapsulation is a desirable attribute that eases the introduction of change and foster the evolution of our APIs. As long as we respect the public interface of our abstractions we are free to change whatever we want of its encapsulated inner workings. About Breaking the Public Interface So what happens when we do not achieve the proper levels of encapsulation in our designs? Now, think that car manufactures decided to put the fuel cap below the car, and not in one of its sides. Let’s say we go and buy one of these new cars, and when we run out of gas we go to the nearest gas station, and then we do not find the fuel cap. Suddenly we realize is below the car, but we cannot reach it with the gas pump hose. Now, we have broken the public interface contract, and therefore, the entire world breaks, it falls apart because things are not working the way it was expected. A change like this would cost millions. We would need to change all gas pumps, not to mention mechanical shops and auto parts. When we break encapsulation we have to pay a price. This last part of our analogy, clearly reveals that failing to define proper abstractions with proper levels of encapsulation will end up causing difficulties when change finally happens. So, as we can see, the goal of encapsulation is reduce the complexity of the abstractions by providing a way to hide implementation details and it also help us to minimize interdependence and facilitate change. We maximize encapsulation by minimizing the exposure of implementation details. However encapsulation will not help us if we do not define proper abstractions. Simply put, there is no way to change the public interface of an abstraction without breaking its users. So, the design of good abstractions is of paramount importance to facilitate the evolution of the APIs, encapsulation is just one of the tools that help us create this good abstractions, but no level of encapsulation is going to make a bad abstraction work. Encapsulation in Java One of those things that we always want to encapsulate is the state of a class. The state of a class should only be accessed through its public interface. In a object-oriented programming language like Java, we achieve encapsulation by hiding details using the accessibility modifiers (i.e. public, protected, private, plus no modifier which implies package private). With these levels of accessibility we control the level of encapsulation, the less restrictive the level, the more expensive change is when it happens and the more coupled the class is with other dependent classes (i.e. user classes, subclasses, etc.). In object-oriented languages a class has two public interfaces: the public interface shared with all users of the class, and the protected interface shared with subclasses. It is of paramount importance that we design the proper levels of encapsulation for every one of these public interfaces so that we can facilitate change and foster evolution of our APIs. Why Getters and Setters? Many people wonder why we need accessor and mutator methods in Java (a.k.a. getters and setters), why can’t we just access the data directly? But the purpose of encapsulation here is is not to hide the data itself, but the implementation details on how this data is manipulated. So, once more what we want is a way to provide a public interface through which we can gain access to this data. We can later change the internal representation of the data without compromising the public interface of the class. On the contrary, by exposing the data itself, we compromise encapsulation, and therefore, the capacity of changing the ways to manipulate this data in the future without affecting its users. We would create a dependency with the data itself, and not with the public interface of the class. We would be creating a perfect cocktail for trouble when “change” finally finds us. There are several compelling reasons why we might want to encapsulate access to our fields. The best compendium of these reasons I have ever found is described in Joshua Bloch’s book [a href="http://192.9.162.55/docs/books/effective/" target="_blank"]Effective Java. There in Item 14: Minimize the accessibility of classes and members, he mentions several reasons, which I mention here: You can limit the values that can be stored in a field (i.e. gender must be F or M). You can take actions when the field is modified (trigger event, validate, etc). You can provide thread safety by synchronizing the method. You can switch to a new data representation (i.e. calculated fields, different data type) However, it is very important to understand that encapsulation is more than hiding fields. In Java we can hide entire classes, by this, hiding the implementation details of an entire API. My understanding of this important concept was broaden and enriched by my reading of a great article by Alan Snyder called Encapsulation and Inheritance in Object-Oriented Programming Languages which I recommend to all readers of this blog. I found a version of it available on the Web and I shared a link to it a the end of this article. Further Reading Encapsulation and Inheritance in Object-oriented Programming Languages Effective Java: Minimize the Accessibility of Classes and Members (p. 67-69) Code Complete 2: Hiding Secrets (Information Hiding) (p. 92).

April 7, 2013

by Edwin Dalorzo

· 56,359 Views · 3 Likes

Weekend Project: Send sensor data from Arduino to MongoDB

Arduino is an open-source electronics platform that can acknowledge and interact with its environment through a variety of sensor types. It’s great for hardware prototyping and one-off projects. I just got an Arduino Board from our friends at SendGrid, who also gave me a little tutorial in the art of Arduino hacking. Inspired by the tutorial and armed with this new board, I bought a passive infared (PIR) motion sensor from my local Radio Shack. Now I was ready to play; in particular, I wanted to be able to collect that continuous stream of hardware sensor data into a MongoDB database for logging, trend analysis, system event correlation, etc. To this end, I created the demo project “mongodb-motion”, which I’ve made public on Github. In the “mongodb-motion” Github repo, you will find an Arudino project that writes motion sensor data to a cloud MongoDB database at MongoLab and sends alerts via email based on certain criteria. I built this demo using Node.js and the MongoLab REST API. Below, I’ll go through exactly what hardware you need to make your own “mongodb-motion” project a success, and how the code actually works. What You Need The hardware used in this demo includes: an Arduino UNO R3 and a Parallax PIR motion sensor. How the Code Works You can use a variety of motion sensors with the Arduino. In this particular experiment, I used a PIR motion sensor. The PIR motion sensor behaves like a switch, with ‘down’ events emitted on motion detection and ‘up’ events a few seconds after motion ceases to be detected. On the receiving side, I used JohnnyFive, an appropriately named Node.js package that accepts sensor events and sends messages to the Arduino board. With the two ends set, I’ll move on to the project’s configuration file. In this demo, I’ve included a configuration file, config-sample.js, where credentials for the MongoLab REST API and for the email SMTP server can be added. In my case, I used the SendGrid SMTP service. The configuration file also has two callbacks that determine when an email is emitted, one for each type of event – “detect” and “ceased”. I’ve used this feature to automatically send an email alert if an event timestamp is between 7:00pm and 8:00am, ostensibly when my office should be motionless… I’m out there watching you, office! Once you’ve customized this config-sample.js file, be sure to rename it to config.js in order for it to be usable. If you inspect the project code, you’ll notice that the MongoLab REST API is called in the logMsg() function, using an https.request. Building this little demo has given me some new ideas for hardware hacking the cloud. I hope you give it a try too. Thanks to the Arduino, Node.js and Javascript communities, and special thanks to Rick Waldon for Johnny Five, SendGrid for the UNO board, and a big shout out to @swiftalphaone for the Waza tutorial.

April 3, 2013

by Ben Wen

· 17,816 Views