DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
3 Ways to Optimize for Paging in MySQL
Lots and lots of web applications need to page through information. From customer records, to the albums in your itunes collection. So as web developers and architects, it’s important that we do all this efficiently. Start by looking at how you’re fetching information from your MySQL database. We’ve outlined three ways to do just that. 1. Paging without discarding records Ultimately we’re trying to avoid discarding records. After all if the server doesn’t fetch them, we save big. How else can we avoid this extra work. How about remember the last name. For example: select id, name, address, phone FROM customers WHERE id > 990 ORDER BY id LIMIT 10; Of course such a solution would only work if you were paging by ID. If you page by name, it might get messier as there may be more than one person with the same name. If ID doesn’t work for your application, perhaps returning paged users by USERNAME might work. Those would be unique: SELECT id, username FROM customers WHERE username > '[email protected]' ORDER BY username LIMIT 10; Paging queries can be slow with SQL as they often involve the OFFSET keyword which instructs the server you only want a subset. However it typically scans collects and then discards those rows first. With deferred join or by maintaining a place or position column you can avoid this, and speedup your database dramatically. 2. Try using a Deferred Join This is an interesting trick. Suppose you have pages of customers. Each page displays ten customers. The query will use LIMIT to get ten records, and OFFSET to skip all the previous page results. When you get to the 100th page, it’s doing LIMIT 10 OFFSET 990. So the server has to go and read all those records, then discard them. SELECT id, name, address, phone FROM customers ORDER BY name LIMIT 10 OFFSET 990; MySQL is first scanning an index then retrieving rows in the table by primary key id. So it’s doing double lookups and so forth. Turns out you can make this faster with a tricky thing called a deferred join. The inside piece just uses the primary key. An explain plan shows us “using index” which we love! SELECT id FROM customers ORDER BY name LIMIT 10 OFFSET 990; Now combine this using an INNER JOIN to get the ten rows and data you want: SELECT id, name, address, phone FROM customers INNER JOIN ( SELECT id FROM customers ORDER BY name LIMIT 10 OFFSET 990) AS my_results USING(id); That’s pretty cool! 3. Maintain a Page or Place column Another way to trick the optimizer from retrieving rows it doesn’t need is to maintain a column for the page, place or position. Yes you need to update that column whenever you (a) INSERT a row (b) DELETE a row ( c) move a row with UPDATE. This could get messy with page, but a straight place or position might work easier. SELECT id, name, address, phone FROM customers WHERE page = 100 ORDER BY name; Or with place column something like this: SELECT id, name, address, phone FROM customers WHERE place BETWEEN 990 AND 999 ORDER BY name;
June 25, 2013
by Sean Hull
· 21,235 Views
article thumbnail
Implementing Memcached a Servlet Filter for Spring MVC-Based RESTful Services
I have a number of Spring MVC based RESTful services that return JSON. In 90% of the cases, the state of objects these services return will not change within a 24 hour period. This makes them (the JSON objects) perfect candidates for simple caching enabled by memcached. The idea was to have every request to Spring controllers intercepted, cache key generated and checked against the cache. If the key and corresponding value (JSON string) is available (a cache hit), it is returned to the caller as-is without making a full round trip to the database. However, if the cache has no entry for the key and hence no corresponding value (a cache miss), the call is forwarded to the controller, which in turn calls the logic to fetch desired object from the database and not only return it to the caller but also update the cache with the returned content. Keys are generated using the URL of the service in case of GET requests and the URL concatenated with POSTed input (as JSON) in case of POST requests. The resultant strings are encoded with MD5 to come up with a 32 character cache key which is well within the 250 character key length limit of memcached. Performance impact of using MD5 is yet to be evaluated during our load testing cycle. I started off trying to get hold of JSON response in the postHandle method of a Spring HandlerInterceptor. However since we are using @ResponseBody annotation in our controller, the JSON would be written directly to the stream. The ModelAndView was of course null because of this reason. If we removed the annotation and returned ModelAndView from the controller, the intended JSON object got enclosed in a map wrapper. A quick question on stack overflow didn’t help as the only suggestion I got was to extract my original object from the map wrapper. I wanted to keep this option (as discussed here as well ) as my last resort. The solution I eventually came up with involved Replacing the HandlerInterceptor with Servlet Filters Using DelegatingFilterProxy to make my filters spring application context aware Using HttpServletRequestWrapper to get control of the POST request body in the filter on the way in Using HttpServletResponseWrapper to get control of the response content in the filter on the way out True, its probably a more complex solution than just overriding MappingJacksonJsonView and extracting my JSON object, but it is more generic as it does not assume that all my content will always be JSON. Lets first start with the filter definition in the web.xml cacheFilter org.springframework.web.filter.DelegatingFilterProxy ... cacheFilter /* A standard filter configuration except for the fact that the filter class is always going to be org.springframework.web.filter.DelegatingFilterProxy. Where do you specify your own class ? As a bean in your spring context xml. The name of the filter and the name of the bean must be the same for the delegation to happen. Using the DelegatingFilterProxy allowed me to use my Filters with Spring. I can inject my dependencies as I would normally. Next, lets look at my MemcacheFilter filter Memcache Filter Class public class MemcacheFilter implements Filter { private static Logger logger = Logger.getLogger(MemcacheFilter.class); private CacheConfig cacheConfig; /** * Memcached lookup is being performed in this method. Firstly, keys are * generated depending on the request method (GET/POST). Then a cache lookup * is performed. If a value is obtained, the value is written to the * response otherwise, the actual target (in this case, Spring's Dispatcher * Servlet) is called by calling doFilter on the filteChain. The dispatcher * servlet calls the controller to produce the desire response which is * intercepted when the doFilter method returns. The Response is added to * the cache if the reponse code was 200(OK). * * @param request * @param response * @param filterChain * @throws IOException * @throws ServletException */ public void doFilter(ServletRequest request, ServletResponse response, FilterChain filterChain) throws IOException, ServletException { try { if ((request instanceof HttpServletRequest) && (response instanceof HttpServletResponse)) { // Wrapping the response in HTTPServletResponseWrapper MemcacheResponseWrapper responseWrap = new MemcacheResponseWrapper((HttpServletResponse) response); // Wrapping the request in HTTPServletResponseWrapper MemcacheRequestWrapper requestWrap = new MemcacheRequestWrapper((HttpServletRequest) request); // Get Memcached Client Instance MemcachedClient client = cacheConfig.getMemcachedClient(); Key keyGenerator = getKeyGenerator(requestWrap); if (keyGenerator != null) { String key = keyGenerator.getKey(requestWrap, cacheConfig); String value = (String) client.get(key); if (value == null) { // cache miss logger.info("Cache miss for key " + key); // call next filter/actual target for value filterChain.doFilter(requestWrap, responseWrap); if (responseWrap.getStatus() == HttpServletResponse.SC_OK) { // obtaining response content from // HttpServletResponseWrapper value = responseWrap.getOutputStream().toString(); // adding response to cache client.add(key, 0, value); logger.info("Adding response to cache: "+ (value.length() > 50 ? value.substring(0,50) + "..." : value)); } else { logger.warn("Did not add content to cache as response status is not 200"); } } else { // This case is a cache hit logger.info("Cache hit for key " + key); response.getWriter().println(value); } } else { logger.warn("Request skipped because no key generator could be found for the request's method"); // attempting call to actual target filterChain.doFilter(request, response); } } } catch (Exception ex) { logger.info("Cache functionality skipped due to exception", ex); // attempting call to actual target filterChain.doFilter(request, response); } } /** * Factory method that returns KeyGenerator based on the request method. * * @param httpRequest * @return */ private Key getKeyGenerator(HttpServletRequest httpRequest) { Key keyGenerator = null; if (httpRequest.getMethod().equalsIgnoreCase("GET")) { keyGenerator = new GetRequestKey(); } else if (httpRequest.getMethod().equalsIgnoreCase("POST")) { keyGenerator = new PostRequestKey(); } return keyGenerator; } public void init(FilterConfig arg0) throws ServletException { logger.debug("init"); } public CacheConfig getCacheConfig() { return cacheConfig; } public void setCacheConfig(CacheConfig cacheConfig) { this.cacheConfig = cacheConfig; } public void destroy() { logger.debug("destroy"); } } 1. I first wrap my request and response objects in the following statements. I have had to create the wrappers as well. Will get to those later. // Wrapping the response in HTTPServletResponseWrapper MemcacheResponseWrapper responseWrap = new MemcacheResponseWrapper((HttpServletResponse) response); // Wrapping the request in HTTPServletResponseWrapper MemcacheRequestWrapper requestWrap = new MemcacheRequestWrapper((HttpServletRequest) request); 2. Next, I have one of my injected classes, CacheConfig, provide me with a memcache client which I will use later to look up the cache. // Get Memcached Client Instance MemcachedClient client = cacheConfig.getMemcachedClient(); 3. I make a call to a function that tells me which key generator I should use, a GET one or a POST one depending on the request method. Key keyGenerator = getKeyGenerator(requestWrap); /** * Factory method that returns KeyGenerator based on the request method. * * @param httpRequest * @return */ private Key getKeyGenerator(HttpServletRequest httpRequest) { Key keyGenerator = null; if (httpRequest.getMethod().equalsIgnoreCase("GET")) { keyGenerator = new GetRequestKey(); } else if (httpRequest.getMethod().equalsIgnoreCase("POST")) { keyGenerator = new PostRequestKey(); } return keyGenerator; } 4. Check for a cache hit using the Key returned by the Key Generator. If its a miss, call next filter or target to compute actual value, get value from the response wrapper, and add it to the cache. if (keyGenerator != null) { String key = keyGenerator.getKey(requestWrap, cacheConfig); String value = (String) client.get(key); if (value == null) { // cache miss logger.info("Cache miss for key " + key); // call next filter/actual target for value filterChain.doFilter(requestWrap, responseWrap); if (responseWrap.getStatus() == HttpServletResponse.SC_OK) { // obtaining response content from // HttpServletResponseWrapper value = responseWrap.getOutputStream().toString(); // adding response to cache client.add(key, 0, value); logger.info("Adding response to cache: "+ (value.length() > 50 ? value.substring(0,50) + "..." : value)); } 5. If its a cache hit, just get return cached value else { // This case is a cache hit logger.info("Cache hit for key " + key); response.getWriter().println(value); } Lets take a look at each of the Wrappers. I am not going into a a lot of detail into how each of these work. Request Wrapper Class On the way in, the original POST content is extracted from the request and put in a String Buffer. To the filter, this content is returned via the toString() method of the WrappedInputStream class whereas the subsequently called controller calls the read method. public class MemcacheRequestWrapper extends HttpServletRequestWrapper { protected ServletInputStream stream; protected HttpServletRequest origRequest = null; protected BufferedReader reader = null; public MemcacheRequestWrapper(HttpServletRequest request) throws IOException { super(request); origRequest = request; } public ServletInputStream createInputStream() throws IOException { return (new WrappedInputStream(origRequest)); } @Override public ServletInputStream getInputStream() throws IOException { if (reader != null) { throw new IllegalStateException("getReader() has already been called for this request"); } if (stream == null) { stream = createInputStream(); } return stream; } @Override public BufferedReader getReader() throws IOException { if (reader != null) { return reader; } if (stream != null) { throw new IllegalStateException("getReader() has already been called for this request"); } stream = createInputStream(); reader = new BufferedReader(new InputStreamReader(stream)); return reader; } private class WrappedInputStream extends ServletInputStream { private StringBuffer originalInput = new StringBuffer(); private HttpServletRequest originalRequest; private ByteArrayInputStream byteArrayInputStream; public WrappedInputStream(HttpServletRequest request) throws IOException { this.originalRequest = request; BufferedReader bufferedReader = null; try { InputStream inputStream = request.getInputStream(); if (inputStream != null) { bufferedReader = new BufferedReader(new InputStreamReader(inputStream)); char[] charBuffer = new char[128]; int bytesRead = -1; while ((bytesRead = bufferedReader.read(charBuffer)) > 0) { originalInput.append(charBuffer, 0, bytesRead); } } byteArrayInputStream = new ByteArrayInputStream(originalInput.toString().getBytes()); } catch (IOException ex) { throw ex; } finally { if (bufferedReader != null) { try { bufferedReader.close(); } catch (IOException ex) { throw ex; } } } } @Override public String toString() { return this.originalInput.toString(); } @Override public int read() throws IOException { return byteArrayInputStream.read(); } } } Response Wrapper Class The response wrapper is similar to the request wrapper. Instead of the read method, there is a write method, called by the controller when its writing JSON content. This is stored in the wrapper and called in the filter. public class MemcacheResponseWrapper extends HttpServletResponseWrapper { protected ServletOutputStream stream; protected PrintWriter writer = null; protected HttpServletResponse origResponse = null; private int httpStatus = 200; public MemcacheResponseWrapper(HttpServletResponse response) { super(response); response.setContentType("application/json"); origResponse = response; } public ServletOutputStream createOutputStream() throws IOException { return (new WrappedOutputStream(origResponse)); } public ServletOutputStream getOutputStream() throws IOException { if (writer != null) { throw new IllegalStateException("getWriter() has already been called for this response"); } if (stream == null) { stream = createOutputStream(); } return stream; } public PrintWriter getWriter() throws IOException { if (writer != null) { return writer; } if (stream != null) { throw new IllegalStateException("getOutputStream() has already been called for this response"); } stream = createOutputStream(); writer = new PrintWriter(stream); return writer; } @Override public void sendError(int sc) throws IOException { httpStatus = sc; super.sendError(sc); } @Override public void sendError(int sc, String msg) throws IOException { httpStatus = sc; super.sendError(sc, msg); } @Override public void setStatus(int sc) { httpStatus = sc; super.setStatus(sc); } public int getStatus() { return httpStatus; } private class WrappedOutputStream extends ServletOutputStream { private StringBuffer originalOutput = new StringBuffer(); private HttpServletResponse originalResponse; public WrappedOutputStream(HttpServletResponse response) { this.originalResponse = response; } @Override public String toString() { return this.originalOutput.toString(); } @Override public void write(int arg0) throws IOException { originalOutput.append((char) arg0); originalResponse.getOutputStream().write(arg0); } } }
June 25, 2013
by Faheem Sohail
· 22,552 Views · 1 Like
article thumbnail
Resolving SOAPFaultException caused by com.ctc.wstx.exc. WstxUnexpectedCharException
If you’re using any of these tools for Web Services – Axis2, CXF etc. – that internally make use of Woodstox XML processor (wstx), and you're getting an exception like this during webservice calls, javax.xml.ws.soap.SOAPFaultException: Error reading XMLStreamReader. at org.apache.cxf.jaxws.JaxWsClientProxy.invoke(JaxWsClientProxy.java:...) ... Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character ... at com.ctc.wstx.sr.StreamScanner.throwUnexpectedChar(StreamScanner.java:...) at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:...) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:...) at com.ctc.wstx.sr.BasicStreamReader.nextTag(BasicStreamReader.java:...) the problem is that the wstx tokenizer/parser encountered unexpected (but not necessarily invalid per se) character; character that is not legal in current context. Could happen, for example, if white space was missing between attribute value and name of next attribute, according to API docs (http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/exc/WstxUnexpectedCharException.html). This simply means that you’re receiving an ill-formed SOAP XML as response. You need to check the SOAP response construction logic/code at the other end you’re communicating to.
June 24, 2013
by Singaram Subramanian
· 21,003 Views
article thumbnail
Automating Nginx Reverse Proxy Configuration
It’s really nice if you can decouple your external API from the details of application segregation and deployment. In a previous post I explained some of the benefits of using a reverse proxy. On my current project we’ve building a distributed service oriented architecture that also exposes an HTTP API, and we’re using a reverse proxy to route requests addressed to our API to individual components. We have chosen the excellent Nginx web server to serve as our reverse proxy; it’s fast, reliable and easy to configure. We use it to aggregate multiple services exposing HTTP APIs into a single URL space. So, for example, when you type: http://api.example.com/product/pinstripe_suit It gets routed to: http://10.0.1.101:8001/product/pinstripe_suit But when you go to: http://api.example.com/customer/103474783 It gets routed to http://10.0.1.104:8003/customer/103474783 To the consumer of the API it appears that they are exploring a single URL space (http://api.example.com/blah/blah), but behind the scenes the different top level segments of the URL route to different back end servers. /product/… routes to 10.0.1.101:8001, but /customer/… routes to 10.0.1.104:8003. We also want this to be self-configuring. So, say I want to create a new component of the system that records stock levels. Rather than extending an existing component, I want to be able to write a stand-alone executable or service that exposes an HTTP endpoint, have it be automatically deployed to one of the hosts in my cloud infrastructure, and have Nginx automatically route requests addressed http://api.example.com/stock/whatever to my new component. We also want to load balance these back end services. We might want to deploy several instances of our new stock API and have Nginx automatically round robin between them. We call each top level segment ( /stock, /product, /customer ) a claim. A component publishes an ‘AddApiClaim’ message over RabbitMQ when it comes on line. This message has 3 fields: ‘Claim', ‘ipAddress’, and ‘PortNumber’. We have a special component, ProxyAutomation, that subscribes to these messages and rewrites the Nginx configuration as required. It uses SSH and SCP to log into the Nginx server, transfer the various configuration files, and instruct Nginx to reload its configuration. We use the excellent SSH.NET library to automate this. A really nice thing about Nginx configuration is wildcard includes. Take a look at our top level configuration file: ... http { include /etc/nginx/mime.types; default_type application/octet-stream; log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"'; access_log /var/log/nginx/access.log main; sendfile on; keepalive_timeout 65; include /etc/nginx/conf.d/*.conf; } Line 16 says, take any *.conf file in the conf.d directory and add it here. Inside conf.d is a single file for all api.example.com requests: include /etc/nginx/conf.d/api.example.com.conf.d/upstream.*.conf; server { listen 80; server_name api.example.com; include /etc/nginx/conf.d/api.example.com.conf.d/location.*.conf; location / { root /usr/share/nginx/api.example.com; index index.html index.htm; } } This is basically saying listen on port 80 for any requests with a host header ‘api.example.com’. This has two includes. The first one at line 1, I’ll talk about later. At line 7 it says ‘take any file named location.*.conf in the subdirectory ‘api.example.com.conf.d’ and add it to the configuration. Our proxy automation component adds new components (AKA API claims) by dropping new location.*.conf files in this directory. For example, for our stock component it might create a file, ‘location.stock.conf’, like this: location /stock/ { proxy_pass http://stock; } This simply tells Nginx to proxy all requests addressed to api.example.com/stock/… to the upstream servers defined at ‘stock’. This is where the other include mentioned above comes in, ‘upstream.*.conf’. The proxy automation component also drops in a file named upstream.stock.conf that looks something like this: upstream stock { server 10.0.0.23:8001; server 10.0.0.23:8002; } This tells Nginx to round-robin all requests to api.example.com/stock/ to the given sockets. In this example it’s two components on the same machine (10.0.0.23), one on port 8001 and the other on port 8002. As instances of the stock component get deployed, new entries are added to upstream.stock.conf. Similarly, when components get uninstalled, the entry is removed. When the last entry is removed, the whole file is also deleted. This infrastructure allows us to decouple infrastructure configuration from component deployment. We can scale the application up and down by simply adding new component instances as required. As a component developer, I don’t need to do any proxy configuration, just make sure my component publishes add and remove API claim messages and I’m good to go.
June 19, 2013
by Mike Hadlow
· 59,092 Views
article thumbnail
Relations with not-found="ignore"
NHibernate has a lot of interesting and specific option for mapping entities that can really cover every scenario you have in mind, but you need to be aware of every implication each advanced option has on performances. If you are in a legacy-database scenario where entity A reference Entity B, but someone outside the control of NHibernate can delete record from table used by Entity B, without setting the corresponding referencing field on Entity A. We will end with a Database with broken reference, where rows from Table A references with a field id a record in Table B that no longer exists. When this happens, if you load an Entity of type A that reference an Entity of type B that was deleted, it will throw an exception if you try to access navigation property, because NHibernate cannot find related entity in the Database. If you know NHibernate you can use the not-found=”Ignore” mapping option, that basically tells NHibernate to ignore a broken reference key, if EntityA references an Entity B that was already deleted from database, the reference will be ignored, navigation property will be set to Null, and no exception occurs. This kind of solution is not without side effects, first of all you will find that Every time you load an Entity of Type A another query is issued to the database to verify if related Entity B is really there. This actually disable lazy load, because related entity is always selected. This is not an optimum scenario, because you will end with a lot of extra query and this happens because not-found=”ignore” is only a way to avoid a real problem: you have broken foreign-key in your database. My suggestion is, fix data in database, keep the database clean without broken foreign-keys and remove all not-found=”ignore” mapping option unless you really have no other solution. Please remember that even if you are using NHibernate, you should not forget SQL capabilities. As an example SQL Server (and quite all of the relational database in the market) has the ability to setup rules for foreign-key, es ON DELETE SET NULL that automatically set to null a foreign key on a table, when related record is deleted. Such a feature will prevent you from having broken foreign key, even if some legacy process manipulates the database deleting records without corresponding update in related foreign-key. - See more at: http://www.codewrecks.com/blog/index.php/2013/06/18/relations-with-not-foundignore-disable-lazy-load-and-impact-on-performances/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+AlkampferEng+%28Alkampfer%27s+Place%29#sthash.93db7RQX.dpuf
June 19, 2013
by Ricci Gian Maria
· 5,092 Views
article thumbnail
How to Optimize MySQL UNION for High Speed
There are two ways to speedup UNIONs in a MySQL database. First use UNION ALL if at all possible, and second try to push down your conditions. 1. UNION ALL is much faster than UNION How does a UNION work? Imagine you have two tables for shirts. The short_sleeve table looks like this: blue green gray black And long_sleeve another that looks like this: red green yellow blue Related: Why Generalists are Better at Scaling the Web If you UNION those two tables, first MySQL will sort the combined set into a temp table like this: black blue blue gray green green red yellow Once it’s done this sort, it can easily remove the duplicate blue & duplicate green for this resulting set: black blue gray green red yellow See also: Mythical MySQL DBA – the talent drought. Why does it do this? UNION is defined that way in SQL. Duplicates must be removed and this is an efficient way for the MySQL engine to remove them. Combine results, sort, remove duplicates and return the set. Queries with UNION can be accelerated in two ways. Switch to UNION ALL or try to push ORDER BY, LIMIT and WHERE conditions inside each subquery. You’ll be glad you did! What if we did UNION ALL? The result would look like this: blue green gray black red green yellow blue Read this: MySQL DBA Interview & Hiring Guide. It doesn’t have to sort, and doesn’t have to remove duplicates. If you imagine combining two 10 million row tables, and don’t have to sort, this speedup can be HUGE. 2. Use Push-down Conditions to speedup UNION in MySQL Imagine with our example above the shirts have a design date, the year they were released. Yes we’re keeping this example very simple to illustrate the concept. Here is the short_sleeve table: blue 2013 green 2013 green 2012 gray 2011 black 2009 black 2011 And long_sleeve table looks like this: red 2012 red 2013 green 2011 yellow 2010 blue 2011 For 2013 designs could combine them like this: (SELECT type, release FROM short_sleeve) UNION (SELECT type, release FROM long_sleeve); WHERE release >=2013; See also: 5 More Things Deadly to Scalability and the original 5 Things Toxic to Scalability.. Here the WHERE clause works on this 11 record temp table: black 2009 black 2011 blue 2011 blue 2013 gray 2011 green 2013 green 2012 green 2011 red 2012 red 2013 yellow 2010 But it would be much faster to move the WHERE inside each subquery like this: (SELECT type, release FROM short_sleeve WHERE release >=2013) UNION (SELECT type, release FROM long_sleeve WHERE release >=2013); That would be operating on a combined 3 record table. Faster to sort & remove duplicates. Smaller result sets cache better too, providing a pay forward dividend. That’s what performance optimization is all about! Read this: RDS or MySQL – 10 Use Cases. Remember multi-million row sets in each part of this query will quickly illustrate the optimization. We’re using very small results to make visualizing easier. You can also use this optimization for ORDER BY and for LIMIT conditions. By reducing the number of records returned by EACH PART of the UNION, you reduce the work that happens at the stage where they are all combined. If you’re seeing some UNION queries in your slow query log, I suggest you try this optimization out and see if you can tweak it.
June 17, 2013
by Sean Hull
· 24,046 Views
article thumbnail
Searchable Documents? Yes You Can. Another Reason to Choose AsciiDoc
Elasticsearch is a flexible and powerful open source, distributed real-time search and analytics engine for the cloud based on Apache Lucene which provides full text search capabilities. It is document oriented and schema free. Asciidoctor is a pure Ruby processor for converting AsciiDoc source files and strings into HTML 5, DocBook 4.5 and other formats. Apart of Asciidoctor Ruby part, there is an Asciidoctor-java-integration project which let us call Asciidoctor functions from Java without noticing that Ruby code is being executed. In this post we are going to see how we can use Elasticsearch over AsciiDocdocuments to make them searchable by their header information or by their content. Let's add required dependencies: junit junit 4.11 test com.googlecode.lambdaj lambdaj 2.3.3 org.elasticsearch elasticsearch 0.90.1 org.asciidoctor asciidoctor-java-integration 0.1.3 Lambdaj library is used to convert AsciiDoc files to a json documents. Now we can start an Elasticsearch instance which in our case it is going to be an embedded instance. node = nodeBuilder().local(true).node(); Next step is parse AsciiDoc document header, read its content and convert them into a json document. An example of json document stored in Elasticsearch can be: { "title":"Asciidoctor Maven plugin 0.1.2 released!", "authors":[ { "author":"Jason Porter", "email":"[email protected]" } ], "version":null, "content":"= Asciidoctor Maven plugin 0.1.2 released!.....", "tags":[ "release", "plugin" ] } And for converting an AsciiDoc File to a json document we are going to useXContentBuilder class which is provided by ElasticsearchJava API to create jsondocuments programmatically. package com.lordofthejars.asciidoctor; import static org.elasticsearch.common.xcontent.XContentFactory.*; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.util.List; import org.asciidoctor.Asciidoctor; import org.asciidoctor.Author; import org.asciidoctor.DocumentHeader; import org.asciidoctor.internal.IOUtils; import org.elasticsearch.common.xcontent.XContentBuilder; import ch.lambdaj.function.convert.Converter; public class AsciidoctorFileJsonConverter implements Converter { private Asciidoctor asciidoctor; public AsciidoctorFileJsonConverter() { this.asciidoctor = Asciidoctor.Factory.create(); } public XContentBuilder convert(File asciidoctor) { DocumentHeader documentHeader = this.asciidoctor.readDocumentHeader(asciidoctor); XContentBuilder jsonContent = null; try { jsonContent = jsonBuilder() .startObject() .field("title", documentHeader.getDocumentTitle()) .startArray("authors"); Author mainAuthor = documentHeader.getAuthor(); jsonContent.startObject() .field("author", mainAuthor.getFullName()) .field("email", mainAuthor.getEmail()) .endObject(); List authors = documentHeader.getAuthors(); for (Author author : authors) { jsonContent.startObject() .field("author", author.getFullName()) .field("email", author.getEmail()) .endObject(); } jsonContent.endArray() .field("version", documentHeader.getRevisionInfo().getNumber()) .field("content", readContent(asciidoctor)) .array("tags", parseTags((String)documentHeader.getAttributes().get("tags"))) .endObject(); } catch (IOException e) { throw new IllegalArgumentException(e); } return jsonContent; } private String[] parseTags(String tags) { tags = tags.substring(1, tags.length()-1); return tags.split(", "); } private String readContent(File content) throws FileNotFoundException { return IOUtils.readFull(new FileInputStream(content)); } } Basically we are building the json document by calling startObject methods to start a new object, field method to add new fields, and startArray to start an array. Then this builder will be used to render the equivalent object in json format. Notice that we are using readDocumentHeader method from Asciidoctor class which returns header attributes from AsciiDoc file without reading and rendering the whole document. And finally content field is set with all document content. And now we are ready to start indexing documents. Note that populateData method receives as parameter a Client object. This object is from Elasticsearch Java APIand represents a connection to Elasticsearch database. import static ch.lambdaj.Lambda.convert; //.... private void populateData(Client client) throws IOException { List asciidoctorFiles = new ArrayList() {{ add(new File("target/test-classes/java_release.adoc")); add(new File("target/test-classes/maven_release.adoc")); }; List jsonDocuments = convertAsciidoctorFilesToJson(asciidoctorFiles); for (int i=0; i < jsonDocuments.size(); i++) { client.prepareIndex("docs", "asciidoctor", Integer.toString(i)).setSource(jsonDocuments.get(i)).execute().actionGet(); } client.admin().indices().refresh(new RefreshRequest("docs")).actionGet(); } private List convertAsciidoctorFilesToJson(List asciidoctorFiles) { return convert(asciidoctorFiles, new AsciidoctorFileJsonConverter()); } It is important to note that the first part of the algorithm is converting all our AsciiDocfiles (in our case two) to XContentBuilder instances by using previous converter class and the method convert of Lambdaj project. If you want you can take a look to both documents used in this example in https://github.com/asciidoctor/asciidoctor.github.com/blob/develop/news/asciidoctor-java-integration-0-1-3-released.adoc and https://github.com/asciidoctor/asciidoctor.github.com/blob/develop/news/asciidoctor-maven-plugin-0-1-2-released.adoc. Next part is inserting documents inside one index. This is done by using prepareIndexmethod, which requires an index name (docs), an index type (asciidoctor), and the idof the document being inserted. Then we call setSource method which transforms theXContentBuilder object to json, and finally by calling execute().actionGet(), data is sent to database. The final step is only required because we are using an embedded instance ofElasticsearch (in production this part should not be required), which refresh the indexes by calling refresh method. After that point we can start querying Elasticsearch for retrieving information from our AsciiDoc documents. Let's start with very simple example, which returns all documents inserted: SearchResponse response = client.prepareSearch().execute().actionGet(); Next we are going to search for all documents that has been written by Alex Sotowhich in our case is one. import static org.elasticsearch.index.query.QueryBuilders.matchQuery; //.... QueryBuilder matchQuery = matchQuery("author", "Alex Soto"); QueryBuilder matchQuery = matchQuery("author", "Alexander Soto"); Note that I am searching for field author the string Alex Soto, which returns only one. The other document is written by Jason. But it is interesting to say that if you search for Alexander Soto, the same document will be returned; Elasticsearch is smart enough to know that Alex and Alexander are very similar names so it returns the document too. More queries, how about finding documents written by someone who is called Alex, but not Soto. import static org.elasticsearch.index.query.QueryBuilders.fieldQuery; //.... QueryBuilder matchQuery = fieldQuery("author", "+Alex -Soto"); And of course no results are returned in this case. See that in this case we are using afield query instead of a term query, and we use +, and - symbols to exclude and include words. Also you can find all documents which contains the word released on title. import static org.elasticsearch.index.query.QueryBuilders.matchQuery; //.... QueryBuilder matchQuery = matchQuery("title", "released"); And finally let's find all documents that talks about 0.1.2 release, in this case only one document talks about it, the other one talks about 0.1.3. QueryBuilder matchQuery = matchQuery("content", "0.1.2"); Now we only have to send the query to Elasticsearch database, which is done by using prepareSearch method. SearchResponse response = client.prepareSearch("docs") .setTypes("asciidoctor") .setQuery(matchQuery) .execute() .actionGet(); SearchHits hits = response.getHits(); for (SearchHit searchHit : hits) { System.out.println(searchHit.getSource().get("content")); } Note that in this case we are printing the AsciiDoc content through console, but you could use asciidoctor.render(String content, Options options) method to render the content into required format. So in this post we have seen how to index documents using Elasticsearch, how to get some important information from AsciiDoc files using Asciidoctor-java-integration project, and finally how to execute some queries to inserted documents. Of course there are more kind of queries in Elasticsearch, but the intend of this post wasn't to explore all possibilities of Elasticsearch. Also as corollary, note how important it is using AsciiDoc format for writing your documents. Without much effort you can build a search engine for your documentation. On the other side, imagine all code that would be required to implement the same using any proprietary binary format like Microsoft Word. So we have shown another reason to use AsciiDoc instead of other formats.
June 10, 2013
by Alex Soto
· 4,803 Views
article thumbnail
Mapping Enums Done Right With @Convert in JPA 2.1
If you ever worked with Java enums in JPA you are definitely aware of their limitations and traps. Usingenum as a property of your @Entity is often very good choice, however JPA prior to 2.1 didn’t handle them very well. It gave you 2+1 choices: @Enumerated(EnumType.ORDINAL) (default) will map enum values using Enum.ordinal(). Basically first enumerated value will be mapped to 0 in database column, second to 1, etc. This is very compact and works great to the point when you want to modify your enum. Removing or adding value in the middle or rearranging them will totally break existing records. Ouch! To make matters worse, unit and integration tests often work on clean database, so they won’t catch discrepancy in old data. @Enumerated(EnumType.STRING) is much safer because it stores string representation of enum. You can now safely add new values and move them around. However renaming enum in Java code will still break existing records in DB. Even more important, such representation is very verbose, unnecessarily consuming database resources. You can also use raw representation (e.g. single char or int) and map it manually back and forth in @PostLoad/@PrePersist/@PreUpdate events. Most flexible and safe from database perspective, but quite ugly. Luckily Java Persistence API 2.1 (JSR-388) released few days ago provides standardized mechanism of pluggable data converters. Such API was present for ages in proprietary forms and it’s not really rocket science, but having it as part of JPA is a big improvement. To my knowledge Eclipselink is the only JPA 2.1 implementation available to date, so we will use it to experiment a bit. We will start from sample Spring application developed as part of “Poor man’s CRUD: jqGrid, REST, AJAX, and Spring MVC in one house” article. That version had no persistence, so we will add thin DAO layer on top of Spring Data JPA backed by Eclipselink. Only entity so far is Book: @Entity public class Book { @Id @GeneratedValue(strategy = IDENTITY) private Integer id; //... private Cover cover; //... } Where Cover is an enum: public enum Cover { PAPERBACK, HARDCOVER, DUST_JACKET } Neither ORDINAL nor STRING is a good choice here. The former because rearranging first three values in any way will break loading of existing records. The latter is too verbose. Here is where custom converters in JPA come into play: import javax.persistence.AttributeConverter; import javax.persistence.Converter; @Converter public class CoverConverter implements AttributeConverter { @Override public String convertToDatabaseColumn(Cover attribute) { switch (attribute) { case DUST_JACKET: return "D"; case HARDCOVER: return "H"; case PAPERBACK: return "P"; default: throw new IllegalArgumentException("Unknown" + attribute); } } @Override public Cover convertToEntityAttribute(String dbData) { switch (dbData) { case "D": return DUST_JACKET; case "H": return HARDCOVER; case "P": return PAPERBACK; default: throw new IllegalArgumentException("Unknown" + dbData); } } } OK, I won’t insult you, my dear reader, explaining this. Converting enum to whatever will be stored in relational database and vice-versa. Theoretically JPA provider should apply converters automatically if they are declared with: @Converter(autoApply = true) It didn’t work for me. Moreover declaring them explicitly instead of @Enumerated in@Entity class didn’t work as well: import javax.persistence.Convert; //... @Convert(converter = CoverConverter.class) private Cover cover; Resulting in exception: Exception Description: The converter class [com.blogspot.nurkiewicz.CoverConverter] specified on the mapping attribute [cover] from the class [com.blogspot.nurkiewicz.Book] was not found. Please ensure the converter class name is correct and exists with the persistence unit definition. Bug or feature, I had to mention converter in orm.xml: And it flies! I have a freedom of modifying my Cover enum (adding, rearranging, renaming) without affecting existing records. One tip I would like to share with you is related to maintainability. Every time you have a piece of code mapping from or to enum, make sure it’s tested properly. And I don’t mean testing every possible existing value manually. I am more after a test making sure that newenum values are reflected in mapping code. Hint: code below will fail (by throwingIllegalArgumentException) if you add new enum value but forget to add mapping code from it: for (Cover cover : Cover.values()) { new CoverConverter().convertToDatabaseColumn(cover); } Custom converters in JPA 2.1 are much more useful than what we saw. If you combine JPA with Scala, you can use @Converter to map database columns directly toscala.math.BigDecimal, scala.Option or small case class. In Java there will finally be a portable way of mapping Joda time. Last but not least, if you like (very) strongly typed domain, you may wish to have PhoneNumber class (with isInternational(),getCountryCode() and custom validation logic) instead of String or long. This small addition in JPA 2.1 will surely improve domain objects quality significantly. If you wish to play a bit with this feature, sample Spring web application is available on GitHub.
June 6, 2013
by Tomasz Nurkiewicz
· 69,366 Views · 6 Likes
article thumbnail
Serialization and injection
Serialization is a form of persistence: serialized data survives the process and the RAM where it was created and can be reconstituted inside different processes and machines that live in a different time or place. Sometimes serialization is a poor form of persistence in fact, one that confuses the boundary between the different schemas the data can fit in. However, what I found useful in the last years of development is to institute a strict separation: serialize Value Objects, Entities, and everything that represents the state of the application. Meanwhile, use Dependency Injection over services that are part of a larger object graph and never serialize this second kind of objects. In the discussion that follows, I make the assumption that serialization and deserialization occur on the same machine (e.g. like for web-oriented sessions.) The problem with serialization, which work transparently most of the time, is the need to serialize service objects instead of limiting the procedure to data structures. How can you store such objects? Not options Some options to solve this problems are really not options. Serialization by itself will fail because of the staleness of the references contained in these objects. For example, in PHP trying to serialize a database connections composed by a Repository or DAO object will rightly fail with an exception. Whenever an object represents a resource of the current machine, it cannot usually be serialized except in the case when the only resource involved is RAM. If the resource is disk space or other running processes such as a database daemon, the reconstitution of the object in another place and time will fail and it's best to just stop the developer immediately during storage. Quasi-options Some solutions to the problem try to avoid the staleness problem by serializing objects without their resources, and make them regrab a new version of them on deserialization. In PHP for example, this can be done with the __sleep() and __wakeup() magic methods, called automatically during serialization and deserializaton respectively. This deserialization mechanism introduces a dependency from the serialized Entity to external services: such a dependency is already in place when building the object the first time (passing the XService in the constructor) but it is aggravated when deserializing (depending on a XServiceFactory instead of just an XService). An improvement, from the dependencies point of view, is to reattach collaborators to deserialized objects like you would for other persistence-related tasks. For example, EntityRepository can inject the missing pieces of Entity every time its find() method is called. However, there is still another option, which is the most resilient from the modelling point of view and not only that of dependency management: injecting non-serializable collaborators through the stack. Objects can collaborate even without keeping field references to each other, and injecting dependencies as parameters move the dependency starting point from the server to the client object (which may or may not be desirable). What is most important is that Entities are relieved of having to manage external references in any context, not only that of persistence and in particular serialization. The metaphor for the 3rd option Misko Hevery likes to say: have you ever seen a credit card able to charge itself? If a CreditCard is an Entity in your domain, it would be very strange to keeping a wire attached to your wallet wherever you go. With the first option, you have the card spring a wire when it is taken out of the wallet, like in horror movies. This intelligent cable tries as its best to attach to the nearest Point of Sale (a bad case of bluetooth I think). With Repositories in mind, you're not dealing with automated wires anymore, but you're still attaching cables between cards and fixed devices. In reality, cards collaborate with the PoS in a fast process that does not last more than a few seconds. Actually, sometimes they don't touch it at all, as in all Internet-based purchases. Keeping services around to deal with external dependencies does not mean the API of your Domain Model has to be biased towards service objects: pos.charge(creditCard); // can equivalently be: creditCard.chargeOn(pos); This is a form of Double Dispatch since there are two objects collaborating and you can dispatch (send messages) to both, being polimorphic by substituting both objects. The sequence of calls is: client -> creditCard -> pos The client object still looks at CreditCard as a behaviorally complete object, but it is clear which dependency is necessary to run each use case (CreditCard method). You can persist a CreditCard easily and send it over the wire to caches or databases. When it comes the time to charge, it is the client that has to bring forward a service able to connect to a bank.
June 5, 2013
by Giorgio Sironi
· 7,203 Views
article thumbnail
Write CSV Data into Hive and Python
Apache Hive is a high level SQL-like interface to Hadoop. It lets you execute mostly unadulterated SQL, like this: CREATE TABLE test_table(key string, stats map); The map column type is the only thing that doesn’t look like vanilla SQL here. Hive can actually use different backends for a given table. Map is used to interface with column oriented backends like HBase. Essentially, because we won’t know ahead of time all the column names that could be in the HBase table, Hive will just return them all as a key/value dictionary. There are then helpers to access individual columns by key, or even pivot the map into one key per logical row. As part of the Hadoop family, Hive is focused on bulk loading and processing. So it’s not a surprise that Hive does not support inserting raw values like the following SQL: INSERT INTO suppliers (supplier_id, supplier_name) VALUES (24553, 'IBM'); However, for unit testing Hive scripts, it would be nice to be able to insert a few records manually. Then you could run your map reduce HQL, and validate the output. Luckily, Hive can load CSV files, so it’s relatively easy to insert a handful or records that way. CREATE TABLE foobar(key string, stats map) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' ; LOAD DATA LOCAL INPATH '/tmp/foobar.csv' INTO TABLE foobar; This will load a CSV file with the following data, where c4ca4-0000001-79879483-000000000124 is the key, and comments and likesare columns in a map. c4ca4-0000001-79879483-000000000124,comments:0|likes:0 c4ca4-0000001-79879483-000000000124,comments:0|likes:0 Because I’ve been doing this quite a bit in my unit tests, I wrote a quick Python helper to dump a list of key/map tuples to a temporary CSV file, and then load it into Hive. This uses hiver to talk to Hive over thrift. import hiver from django.core.files.temp import NamedTemporaryFile def _hql(self, hql): client = hiver.connect(settings.HIVE_HOST, settings.HIVE_PORT) try: client.execute(hql) finally: client.shutdown() def insert(self, table_name, rows): ''' cannot insert single rows via hive, need to save to a temp file and bulk load that ''' csv_file = NamedTemporaryFile(delete=True) for row in rows: map_repr = '|'.join('%s:%s' % (key, value) for key, value in row[1].items()) csv_file.write(row[0] + "," + map_repr + "\n") csv_file.flush() try: _hql('DROP TABLE IF EXISTS %s' % table_name) _hql(""" CREATE TABLE %s ( key string, map ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' """ % (table_name)) _hql(""" LOAD DATA LOCAL INPATH '%s' INTO TABLE %s """ % (csv_file.name, table_name) finally: csv_file.close() You can call it like this: insert('test_table', [ ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879496-000000000124', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ])
June 5, 2013
by Chase Seibert
· 14,707 Views
article thumbnail
Hadoop REST API - WebHDFS
Hadoop provides a Java native API to support file system operations..
June 3, 2013
by Istvan Szegedi
· 57,441 Views · 5 Likes
article thumbnail
Avro's Built-In Sorting
avro has a little-known gem of a feature which allows you to control which fields in an avro record are used for partitioning , sorting and grouping in mapreduce. the following figure gives a quick refresher as to what these terms mean. oh, and don’t take the placement of the “sorting” literally - sorting actually occurs on both the map and reduce side - but it’s always performed in the context of a specific partition (i.e. for a specific reducer). by default all the fields in an avro map output key are used for partitioning, sorting and grouping in mapreduce. let’s walk through an example and see how this works. you’ll begin with a simple schema github source : {"type": "record", "name": "com.alexholmes.avro.weathernoignore", "doc": "a weather reading.", "fields": [ {"name": "station", "type": "string"}, {"name": "time", "type": "long"}, {"name": "temp", "type": "int"}, {"name": "counter", "type": "int", "default": 0} ] } we’re going to see what happens when we run this code against a small sample data set, which we’ll generate using avro code github source : file input = tmpfolder.newfile("input.txt"); avrofiles.createfile(input, weathernoignore.schema$, arrays.aslist( weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(3).build(), weathernoignore.newbuilder().setstation("iad").settime(1).settemp(1).build(), weathernoignore.newbuilder().setstation("sfo").settime(2).settemp(1).build(), weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(2).build(), weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(1).build() ).toarray()); to understand how avro is partitioning, sorting and grouping the data, we’ll write an identity mapper and reducer, with a small enhancement to the reducer to increment the counter field for each record we see in an individual reducer instance github source : package com.alexholmes.avro.sort.basic; import com.alexholmes.avro.weathernoignore; import org.apache.avro.mapred.avrokey; import org.apache.avro.mapred.avrovalue; import org.apache.avro.mapreduce.avrojob; import org.apache.avro.mapreduce.avrokeyinputformat; import org.apache.avro.mapreduce.avrokeyoutputformat; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import java.io.ioexception; public class avrosort { private static class sortmapper extends mapper, nullwritable, avrokey, avrovalue> { @override protected void map(avrokey key, nullwritable value, context context) throws ioexception, interruptedexception { context.write(key, new avrovalue(key.datum())); } } private static class sortreducer extends reducer, avrovalue, avrokey, nullwritable> { @override protected void reduce(avrokey key, iterable> values, context context) throws ioexception, interruptedexception { int counter = 1; for (avrovalue weathernoignore : values) { weathernoignore.datum().setcounter(counter++); context.write(new avrokey(weathernoignore.datum()), nullwritable.get()); } } } public boolean runmapreduce(final job job, path inputpath, path outputpath) throws exception { fileinputformat.setinputpaths(job, inputpath); job.setinputformatclass(avrokeyinputformat.class); avrojob.setinputkeyschema(job, weathernoignore.schema$); job.setmapperclass(sortmapper.class); avrojob.setmapoutputkeyschema(job, weathernoignore.schema$); avrojob.setmapoutputvalueschema(job, weathernoignore.schema$); job.setreducerclass(sortreducer.class); avrojob.setoutputkeyschema(job, weathernoignore.schema$); job.setoutputformatclass(avrokeyoutputformat.class); fileoutputformat.setoutputpath(job, outputpath); return job.waitforcompletion(true); } } if you look at the output of the job below, you’ll see that the output is sorted across all the fields, and that the sorting is in field ordinal order. what this means is that when mapreduce is sorting these records, it compares the station field first, then the time field second, and so on according to the ordering of the fields in the avro schema. this is pretty much what you’d expect if you write your own complex writable type, and your comparator compared all the fields in order. {"station": "iad", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 2, "counter": 1} {"station": "sfo", "time": 1, "temp": 3, "counter": 1} {"station": "sfo", "time": 2, "temp": 1, "counter": 1} oh, and before we move on notice that the value for the counter field is always 1 , meaning that each reducer was only fed a single key/vaue pair, which makes sense since our identity mapper only emitted a single value for each key, the keys are unique, and the mapreduce partitioner, sorter and grouper were using all the fields in the record. excluding fields for sorting avro gives us the ability to indicate that specific fields should be ignored when performing ordering functions. in mapreduce these fields are ignored for sorting/partitioning and grouping in mapreduce, which basically means that we have the ability to perform secondary sorting. let’s examine the following schema github source : {"type": "record", "name": "com.alexholmes.avro.weather", "doc": "a weather reading.", "fields": [ {"name": "station", "type": "string"}, {"name": "time", "type": "long"}, {"name": "temp", "type": "int", "order": "ignore"}, {"name": "counter", "type": "int", "order": "ignore", "default": 0} ] } it’s pretty much identical to the first schema, the only difference being that the last two fields are flagged as being “ignored” for sorting/partitioning/grouping. let’s run the same (other than modified to work with the different schema) mapreduce code github source as above against this new schema and examine the outputs. {"station": "iad", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 3, "counter": 1} {"station": "sfo", "time": 1, "temp": 2, "counter": 2} {"station": "sfo", "time": 1, "temp": 1, "counter": 3} {"station": "sfo", "time": 2, "temp": 1, "counter": 1} there are a couple of notable differences between this output, and the output from the previous schema which didn’t have any ignored fields. first, it’s clear that the temp field isn’t being used in the sorting, which makes sense since we specified that it should be ignored in the schema. however, more interestingly, note the value of the counter field. all records that had identical station and time values went to the same reducer invocation, evidenced by the increasing value of counter . this is essentially secondary sort! now, all of this greatness isn’t without some limitations: you can’t support two mapreduce jobs that use the same avro key, but have different sorting/partitioning/grouping requirements. although it’s conceivable that you could create a new instance of the avro schema and set the ignored flags for these fields yourself. the partitioner, sorter and grouping functions in mapreduce all work off of the same fields (i.e. they all ignore fields that set as ignored in the schema). this means that your options for secondary sorting are limited. for example, you wouldn’t be able to partition all stations to the same reducer, and then group by station and time. ordering uses a field’s ordinal position to determine its order within the overall set of fields to be ordered. in other words, in a two-field record, the first field is always compared before the second. there’s no way to change this behavior other than flipping the order of the fields in the record. having said all of that - the “ignoring fields” feature for sorting is pretty awesome, and something that will no doubt come in handy in my future mapreduce work.
May 29, 2013
by Alex Holmes
· 8,090 Views
article thumbnail
Amazon S3 Parallel MultiPart File Upload
In this blog post, I will present a simple tutorial on uploading a large file to Amazon S3 as fast as the network supports. Amazon S3 is clustered storage service of Amazon. It is designed to make web-scale computing easier. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. For using Amazon services, you'll need your AWS access key identifiers, which AWS assigned you when you created your AWS account. The following are the AWS access key identifiers: Access Key ID (a 20-character, alphanumeric sequence) For example: 022QF06E7MXBSH9DHM02 Secret Access Key (a 40-character sequence) For example: kWcrlUX5JEDGM/LtmEENI/aVmYvHNif5zB+d9+ct Caution Your Secret Access Key is a secret, which only you and AWS should know. It is important to keep it confidential to protect your account. Store it securely in a safe place. Never include it in your requests to AWS, and never e-mail it to anyone. Do not share it outside your organization, even if an inquiry appears to come from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret Access Key. The Access Key ID is associated with your AWS account. You include it in AWS service requests to identify yourself as the sender of the request. The Access Key ID is not a secret, and anyone could use your Access Key ID in requests to AWS. To provide proof that you truly are the sender of the request, you also include a digital signature calculated using your Secret Access Key. The sample code handles this for you. Your Access Key ID and Secret Access Key are displayed to you when you create your AWS account. They are not e-mailed to you. If you need to see them again, you can view them at any time from your AWS account. To get your AWS access key identifiers Go to the Amazon Web Services web site at http://aws.amazon.com. Point to Your Account and click Security Credentials. Log in to your AWS account. The Security Credentials page is displayed. Your Access Key ID is displayed in the Access Identifiers section of the page. To display your Secret Access Key, click Show in the Secret Access Key column. You can use your Amazon keys from a properties file in your application. Here is a sample for properties file containing Amazon keys: # Fill in your AWS Access Key ID and Secret Access Key # http://aws.amazon.com/security-credentials accessKey = secretKey = Here is sample AmazonUtil class for getting AWS Credentials from properties file. public class AmazonUtil { private static final Logger logger = LogUtil.getLogger(); private static final String AWS_CREDENTIALS_CONFIG_FILE_PATH = ConfigUtil.CONFIG_DIRECTORY_PATH + File.separator + "aws-credentials.properties"; private static AWSCredentials awsCredentials; static { init(); } private AmazonUtil() { } private static void init() { try { awsCredentials = new PropertiesCredentials(IOUtil.getResourceAsStream(AWS_CREDENTIALS_CONFIG_FILE_PATH)); } catch (IOException e) { logger.error("Unable to initialize AWS Credentials from " + AWS_CREDENTIALS_CONFIG_FILE_PATH); } } public static AWSCredentials getAwsCredentials() { return awsCredentials; } } Amazon S3 has Multipart Upload service which allows faster, more flexible uploads into Amazon S3. Multipart Upload allows you to upload a single object as a set of parts. After all parts of your object are uploaded, Amazon S3 then presents the data as a single object. With this feature you can create parallel uploads, pause and resume an object upload, and begin uploads before you know the total object size. For more information on Multipart Upload, review the Amazon S3 Developer Guide In this tutorial, my sample application uploads each file parts to Amazon S3 with different threads for using network throughput as possible as much. Each file part is associated with a thread and each thread uploads its associated part with Amazon S3 API. Figure 1. Amazon S3 Parallel Multi-Part File Upload Mechanism Amazon S3 API suppots MultiPart File Upload in this way: 1. Send a MultipartUploadRequest to Amazon. 2. Get a response containing a unique id for this upload operation. 3. For i in ${partCount} 3.1. Calculate size and offset of split-i in whole file. 3.2. Build a UploadPartRequest with file offset, size of current split and unique upload id. 3.3. Give this request to a thread and starts upload by running thread. 3.3.1. Send associated UploadPartRequest to Amazon. 3.3.2. Get response after successful upload and save ETag property of response. 4. Wait all threads to terminate 5. Get ETags (ETag is an identifier for successfully completed uploads) of all terminated threads. 6. Send a CompleteMultipartUploadRequest to Amazon with unique upload id and all ETags. So Amazon joins all file parts as target objects. Here is implementation: public class AmazonS3Util { private static final Logger logger = LogUtil.getLogger(); public static final long DEFAULT_FILE_PART_SIZE = 5 * 1024 * 1024; // 5MB public static long FILE_PART_SIZE = DEFAULT_FILE_PART_SIZE; private static AmazonS3 s3Client; private static TransferManager transferManager; static { init(); } private AmazonS3Util() { } private static void init() { // ... s3Client = new AmazonS3Client(AmazonUtil.getAwsCredentials()); transferManager = new TransferManager(AmazonUtil.getAwsCredentials()); } // ... public static void putObjectAsMultiPart(String bucketName, File file) { putObjectAsMultiPart(bucketName, file, FILE_PART_SIZE); } public static void putObjectAsMultiPart(String bucketName, File file, long partSize) { List partETags = new ArrayList(); List uploaders = new ArrayList(); // Step 1: Initialize. InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(bucketName, file.getName()); InitiateMultipartUploadResult initResponse = s3Client.initiateMultipartUpload(initRequest); long contentLength = file.length(); try { // Step 2: Upload parts. long filePosition = 0; for (int i = 1; filePosition < contentLength; i++) { // Last part can be less than part size. Adjust part size. partSize = Math.min(partSize, (contentLength - filePosition)); // Create request to upload a part. UploadPartRequest uploadRequest = new UploadPartRequest(). withBucketName(bucketName).withKey(file.getName()). withUploadId(initResponse.getUploadId()).withPartNumber(i). withFileOffset(filePosition). withFile(file). withPartSize(partSize); uploadRequest.setProgressListener(new UploadProgressListener(file, i, partSize)); // Upload part and add response to our list. MultiPartFileUploader uploader = new MultiPartFileUploader(uploadRequest); uploaders.add(uploader); uploader.upload(); filePosition += partSize; } for (MultiPartFileUploader uploader : uploaders) { uploader.join(); partETags.add(uploader.getPartETag()); } // Step 3: complete. CompleteMultipartUploadRequest compRequest = new CompleteMultipartUploadRequest(bucketName, file.getName(), initResponse.getUploadId(), partETags); s3Client.completeMultipartUpload(compRequest); } catch (Throwable t) { logger.error("Unable to put object as multipart to Amazon S3 for file " + file.getName(), t); s3Client.abortMultipartUpload( new AbortMultipartUploadRequest( bucketName, file.getName(), initResponse.getUploadId())); } } // ... private static class UploadProgressListener implements ProgressListener { File file; int partNo; long partLength; UploadProgressListener(File file) { this.file = file; } @SuppressWarnings("unused") UploadProgressListener(File file, int partNo) { this(file, partNo, 0); } UploadProgressListener(File file, int partNo, long partLength) { this.file = file; this.partNo = partNo; this.partLength = partLength; } @Override public void progressChanged(ProgressEvent progressEvent) { switch (progressEvent.getEventCode()) { case ProgressEvent.STARTED_EVENT_CODE: logger.info("Upload started for file " + "\"" + file.getName() + "\""); break; case ProgressEvent.COMPLETED_EVENT_CODE: logger.info("Upload completed for file " + "\"" + file.getName() + "\"" + ", " + file.length() + " bytes data has been transferred"); break; case ProgressEvent.FAILED_EVENT_CODE: logger.info("Upload failed for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; case ProgressEvent.CANCELED_EVENT_CODE: logger.info("Upload cancelled for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; case ProgressEvent.PART_STARTED_EVENT_CODE: logger.info("Upload started at " + partNo + ". part for file " + "\"" + file.getName() + "\""); break; case ProgressEvent.PART_COMPLETED_EVENT_CODE: logger.info("Upload completed at " + partNo + ". part for file " + "\"" + file.getName() + "\"" + ", " + (partLength > 0 ? partLength : progressEvent.getBytesTransfered()) + " bytes data has been transferred"); break; case ProgressEvent.PART_FAILED_EVENT_CODE: logger.info("Upload failed at " + partNo + ". part for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; } } } private static class MultiPartFileUploader extends Thread { private UploadPartRequest uploadRequest; private PartETag partETag; MultiPartFileUploader(UploadPartRequest uploadRequest) { this.s3Client = s3Client; this.uploadRequest = uploadRequest; } @Override public void run() { partETag = s3Client.uploadPart(uploadRequest).getPartETag(); } private PartETag getPartETag() { return partETag; } private void upload() { start(); } } }
May 28, 2013
by Serkan Özal
· 57,361 Views · 3 Likes
article thumbnail
Web API in ASP.NET Web Forms Application
With the release of ASP.NET MVC 4 one of the exciting features packed in the release was ASP.NET Web API.
May 24, 2013
by Lohith Nagaraj
· 51,545 Views
article thumbnail
Azure Blob Storage - "The specified blob or block content is invalid"
If you’re uploading blobs by splitting blobs into blocks and you get the error – The specified blob or block content is invalid, then this post is for you. Short Version If you’re uploading blobs by splitting blobs into blocks and you get the above mentioned error, ensure that your block ids of your blocks are of same length. If the block ids of your blocks are of different length, you’ll get this error. Long Version Now for the longer version of this post . A few days back I was working with storage client library especially around uploading blobs in chunks and with one particular blob I was constantly getting the error – The specified blob or block content is invalid. I tried numerous combinations even resorting to REST API directly but to no avail. It only happened with just one blob. Furthermore if I uploaded the same blob without splitting it into blocks, all was well. I was at my wits’ end. Tried searching the Internet for this error but could not find a conclusive answer to my problem. After much trial and error, I was able to simulate the same problem on other blobs as well. Here’s how you can recreate it: Start uploading the blob by splitting it into blocks. For block id, let’s do a 7 character long string e.g. intValue.ToString(“d7”). This will ensure that my block ids would be “0000001”, “0000002”, …, ”0000010” ….. After one or two blocks are uploaded, cancel the operation. Now re-upload the blob by splitting it into blocks. However this time for block id, let’s do a 6 character long string e.g. intValue.ToString(“d6”). You’ll get the error as soon as you try to upload the 1st block. Possible Solutions Now that we know the root cause of this problem, let’s look at some of the possible solutions to solve this problem. Wait out One possible solution is to wait out. I know its lame but still a possible solution. We know that Windows Azure Blob Storage Service keeps all uncommitted blocks for a duration of 7 days and if within 7 days those uncommitted blocks are not committed, the storage service purges them. I wish storage service provided some mechanism to purge uncommitted blocks programmatically. Commit uncommitted blocks You could possibly commit the blocks which are in uncommitted state so that at least you get a blob (which would not be the blob we wanted to upload in the first place). You can then delete that blob and re-upload the blob by specifying block ids which are of same length. To fetch the list of uncommitted blocks, if you’re using REST API directly you can perform “Get Block List” operation and pass “blocklisttype=uncommitted” as one of the query string parameters. If you’re using storage client library (assuming you’re using the version 2.x of .Net storage client library), you can do something like the code below: private static List GetUncommittedBlockIds(CloudBlockBlob blob) { var sasUri = blob.GetSharedAccessSignature(new SharedAccessBlobPolicy() { SharedAccessExpiryTime = DateTime.UtcNow.AddMinutes(5), Permissions = SharedAccessBlobPermissions.Read, }); var blobUri = new Uri(string.Format("{0}{1}", blob.Uri, sasUri)); List uncommittedBlockIds = new List(); var request = BlobHttpWebRequestFactory.GetBlockList(blobUri, null, null, BlockListingFilter.Uncommitted, null, null); //request.Headers.Add("Authorization", using (var resp = (HttpWebResponse)request.GetResponse()) { using (var stream = resp.GetResponseStream()) { var getBlockListResponse = new GetBlockListResponse(stream); var blocks = getBlockListResponse.Blocks; foreach (var block in blocks.Where(b => !b.Committed)) { uncommittedBlockIds.Add(Encoding.UTF8.GetString(Convert.FromBase64String(block.Name))); } } } return uncommittedBlockIds; } A few things to keep in mind here: Microsoft.WindowsAzure.Storage.Blob namespace does not have the capability to get the list of uncommitted blocks. You would need to make use ofMicrosoft.WindowsAzure.Storage.Blob.Protocol namespace. Because we’re kind of invoking the REST API by executing an HttpWebRequest, I created a shared access signature on the blob so that I don’t have to create “Authorization” header. Fetch uncommitted blocks to see block id length You could fetch the list of uncommitted blocks just to find out the length of the block id used. You could then use that block id length for your new upload session and do the upload. Please see the code snippet above to find this information. Upload another blob with same name without splitting it into blocks You could also upload another blob with the same name without splitting it into blocks. It could very well be a zero byte blob. That way your uncommitted block list will be wiped clean. Then you could delete that dummy blob and re-upload the actual blob. A Few Words About Blocks Since we’re talking about blocks, I thought it might be useful to mention a few points about them: Blocks and block related operations are only applicable for “Block Blobs”. Duh!! You’ll get an error if you’re trying to do these operations on a “Page Blob”. For uploading large blobs, it is recommended that you split your blob into blocks. In fact if your blob size is more than 64 MB, then you have to split it into blocks. Minimum size of a block is 1 Byte and the maximum size of a block is 4 MB. It is recommended that you choose a block size based on your internet connectivity and number of parallel threads you want use to upload these blocks. A blob can be split into a maximum of 50000 blocks. It’s important to remember this limitation because you are reminded of this limit when you’re trying to upload 50001st block. The length of all the block ids must be same. So if you’re using an integer value to denote block id, you make sure that you pad that integer value with “0” so that you get same length. So you could do something likeint.ToString(“d6”). When passing the block id as a parameter, it must be Base64 encoded. While the order in which the blocks are uploaded is not important, the order is important when you commit the block list because that’s when the blob is constructed by the service. For example, let’s say you’re uploading a blob by splitting it into 5 blocks (with ids “000001”, “000002”, “000003”, “000004”, and “000005”). You could upload these blocks in any order – 000004, 000001, 000003, 000005, 000002 however when you commit the block list, ensure that the block ids are passed in proper order i.e. 000001, 000002, 000003, 000004, 000005. Summary That’s it for this post. I hope you’ve found this information useful. I spent considerable amount of time trying to fix this problem so I hope it will help some folks out. As always, if you find any issues with the post please let me know and I’ll fix it ASAP.
May 20, 2013
by Gaurav Mantri
· 10,863 Views
article thumbnail
Postgres Fuzzy Search Using Trigrams (+/- Django)
When building websites, you’ll often want users to be able to search for something by name. On LinerNotes, users can search for bands, albums, genres etc from a search bar that appears on the homepage and in the omnipresent nav bar. And we need a way to match those queries to entities in our Postgres database. At first, this might seem like a simple problem with a simple solution, especially if you’re using the ORM; just jam the user input into an ORM filter and retrieve every matching string. But there’s a problem: if you do Bands.objects.filter(name="beatles") You’ll probably get nothing back, because the name column in your “bands” table probably says “The Beatles” and as far as Postgres is concerned if it’s not exactly the same string, it’s not a match. Users are naturally terrible at spelling, and even if they weren’t they’d be bad at guessing exactly how the name is formatted in your database. Of course you can use the LIKE keyword in SQL (or the equivalent ‘__contains’ suffix in the ORM) to give yourself a little flexibility and make sure that “Beatles” returns “The Beatles”. But 1) the LIKE keyword requires you to evaluate a regex against every row in your table, or hope that you’ve configured your indices to support LIKE (a quick Google doesn’t tell me whether Django does that by default in the ORM) and 2) what if the user types “Beetles”? Well, then you’ve got a bit of a problem. No matter how obvious it is to human you that “beatles” is close to “beetles”[1], to the computer they’re just two non-identical byte sequences. If you want the computer to understand them as similar you’re going to have to give it a metric for similarity and a method to make the comparison. There are a few ways to do that. You can do what I did initially and whip out the power tools, i.e. a dedicated search system like Solr or ElasticSearch. These guys have notions of fuzziness built right in (Solr more automatically than ES). But they’re designed for full-text indexing of documents (e.g. full web pages) and they’re rather complex to set up and administer. ES has been enough of a hassle to keep running smoothly that I took the time to see if I could push the search workload to Postgres, and hence this article. Unless you need to do something real fancy, it’s probably overkill to use them for just matching names. Instead, we’re going to follow Starr Horne’s advice and use a Postgres EXTENSION that lets us build fuzziness into our query in a fast and fairly simple way. Specifically, we’re going to use an extension called pg_trgm (i.e. “Postgres Trigram”) which gives Postgres a “similarity” function that can evaluate how many three-character subsequences (i.e. “trigrams”) two strings share. This is actually a pretty good metric for fuzzy matching short strings like names. To use pg_trgm, you’ll need to install the “Postgres Contrib” package. On ubuntu: sudo apt-get install postgres-contrib **WARNING: THIS WILL TRY TO RESTART YOUR DATABASE** then pop open psql and install pg_trgm (NB: this only works on Postgres 9.1+; Google for the instructions if you’re on a lower version.) psql CREATE EXTENSION pg_trgm; \dx # to check it's installed Now you can do SELECT * FROM types_and_labels_view WHERE label %'Mountain Goats' ORDER BY similarity(label,'Mountain Goats') DESC LIMIT 100; And out will pop the 100 most similar names. This will still take a long time if your table is large, but we can improve that with a special type of index provided by pg_trgm: CREATE INDEX labels_trigram_index ON types_and_labels_table USING gist (label gist_trgm_ops); or CREATE INDEX labels_trigram_index ON types_and_labels_table USING gin (label gin_trgm_ops); (GIN is slower than GIST to build, but answers queries faster. That’ll take a while to build (possibly quite a while), but once it does you should be able to fuzzy search with ease and speed. If you’re using Django, you will have to drop into writing SQL to use this (until someone, maybe you, writes a Django extension to do this in the ORM.) And as a frustrating finishing note, my attempt to implement this on LinerNotes was not ultimately succesful. It seems that that index query performance is at least O(n) and with 50 million entities in my database queries take at least 10 seconds. I’ve read that performance is great up to about 100k records then drops off sharply from there. There are some apparently additional options for improving query performance, but I’ll be sticking with ElasticSearch for now. [1] Sorry, Googlebot! Not sorry, Bingbot.
May 19, 2013
by George London
· 9,586 Views
article thumbnail
Lazy sequences implementation for Java 8
I just published the LazySeq library on GitHub - the result of my Java 8 experiments recently. I hope you will enjoy it. Even if you don't find it very useful, it's still a great lesson of functional programming in Java 8 (and in general). Also it's probably the first community library targeting Java 8! Introduction A Lazy sequence is a data structure that is computed only when its elements are actually needed. All operations on lazy sequences, like map() and filter() are lazy as well, postponing invocation up to the moment when it is really necessary. Lazy sequences are always traversed from the beginning using very cheap first/rest decomposition (head() and tail()). An important property of lazy sequences is that they can represent infinite streams of data, e.g. all natural numbers or temperature measurements over time. Lazy sequence remembers already computed values so if you access the Nth element, all elements from 1 to N-1 are computed as well and cached. Despite that LazySeq (being at the core of many functional languages and algorithms) is immutable and thread-safe. Rationale This library is heavily inspired by scala.collection.immutable.Stream and aims to provide immutable, thread-safe and easy to use lazy sequence implementation, possibly infinite. See Lazy sequences in Scala and Clojure for some use cases. Stream class name is already used in Java 8, therefore LazySeq was chosen, similar to lazy-seq in Clojure. Speaking of Stream, at first it looks like a lazy sequence implementation available out-of-the-box. However, quoting Javadoc: Streams are not data structures and: Once an operation has been performed on a stream, it is considered consumed and no longer usable for other operations. In other words java.util.stream.Stream is just a thin wrapper around existing collection, suitable for one time use. More akin to Iterator than to Stream in Scala. This library attempts to fill this niche. Of course implementing lazy sequence data structure was possible prior to Java 8, but lack of lambdas makes working with such data structure tedious and too verbose. Getting started Building and working with lazy sequences in 10 minutes. Infinite sequence of all natural numbers In order to create a lazy sequence you use LazySeq.cons() factory method that accepts first element (head) and a function that might be later used to compute rest (tail). For example in order to produce lazy sequence of natural numbers with given start element you simply say: private LazySeq naturals(int from) { return LazySeq.cons(from, () -> naturals(from + 1)); } There is really no recursion here. If there was, calling naturals() would quickly result in StackOverflowError as it calls itself without stop condition. However () -> naturals(from + 1) expression defines a function returning LazySeq (Supplier to be precise) that this data structure will invoke, but only if needed. Look at the code below, how many times do you think naturals() function was called (except the first line)? final LazySeq ints = naturals(2); final LazySeq strings = ints. map(n -> n + 10). filter(n -> n % 2 == 0). take(10). flatMap(n -> Arrays.asList(0x10000 + n, n)). distinct(). map(Integer::toHexString); First invocation of naturals(2) returns lazy sequence starting from 2 but rest (3, 4, 5, ...) is not computed yet. Later we map() over this sequence, filter() it, take() first 10 elements, remove duplicates, etc. All these operations do not evaluate the sequence and are as lazy as possible. For example take(10) doesn't evaluate first 10 elements eagerly to return them. Instead new lazy sequence is returned which remembers that it should truncate original sequence at 10th element. Same applies to distinct(). It doesn't evaluate the whole sequence to extract all unique values (otherwise code above would explode quickly, traversing infinite amount of natural numbers). Instead it returns a new sequence with only the first element. If you ever ask for the second unique element, it will lazily evaluate tail, but only as much as possible. Check out toString() output: System.out.println(strings); //[1000c, ?] Question mark (?) says: "there might be something more in that collection, but I don't know it yet". Do you understand where did 1000c came from? Look carefully: Start from an infinite stream of natural numbers starting from 2 Add 10 to each element (so the first element becomes 12 or C in hex) filter() out odd numbers (12 is even so it stays) take() first 10 elements from sequence so far Each element is replaced by two elements: that element plus 0x1000 and the element itself (flatMap()). This does not yield a sequence of pairs, but a sequence of integers that is twice as long We ensure only distinct() elements will be returned In the end we turn integers to hex strings. As you can see none of these operations really require evaluating the whole stream. Only head is being transformed and this is what we see in the end. So when this data structure is actually evaluated? When it absolutely must, e.g. during side-effect traversal: strings.force(); //or strings.forEach(System.out::println); //or final List list = strings.toList(); //or for (String s : strings) { System.out.println(s); } All the statements above alone will force evaluation of whole lazy sequence. Not very smart if our sequence was infinite, but strings was limited to first 10 elements so it will not run infinitely. If you want to force only part of the sequence, simply call strings.take(5).force(). BTW have you noticed that we can iterate over LazySeq strings using standard Java 5 for-each syntax? That's because LazySeq implements List interface, thus plays nicely with Java Collections Framework ecosystem: import java.util.AbstractList; public abstract class LazySeq extends AbstractList Please keep in mind that once lazy sequence is evaluated (computed) it will cache (memoize) them for later use. This makes lazy sequences great for representing infinite or very long streams of data that are expensive to compute. iterate() Building an infinite lazy sequence very often boils down to providing an initial element and a function that produces next item based on the previous one. In other words second element is a function of the first one, third element is a function of the second one, and so on. Convenience LazySeq.iterate() function is provided for such circumstances. ints definition can now look like this: final LazySeq ints = LazySeq.iterate(2, n -> n + 1); We start from 2 and each subsequent element is represented as previous element + 1. More examples: Fibonacci sequence and Collatz conjecture No article about lazy data structure can be left without Fibonacci numbers example: private static LazySeq lastTwoFib(int first, int second) { return LazySeq.cons( first, () -> lastTwoFib(second, first + second) ); } Fibonacci sequence is infinite as well but we are free to transform it in multiple ways: System.out.println( fib. drop(5). take(10). toList() ); //[5, 8, 13, 21, 34, 55, 89, 144, 233, 377] final int firstAbove1000 = fib. filter(n -> (n > 1000)). head(); fib.get(45); See how easy and natural it is to work with infinite stream of numbers? drop(5).take(10) skips first 5 elements and displays next 10. At this point first 15 numbers are already computed and will never by computed again. Finding first Fibonacci number above 1000 (happens to be 1597) is very straightforward. head() is always precomputed by filter() , so no further evaluation is needed. Last but not least we can simply just ask for 45th Fibonacci number (0-based) and get 1134903170. If you ever try to access any Fibonacci number up to this one, they are precomputed and fast to retrieve. Finite sequences (Collatz conjecture) Collatz conjecture is also quite interesting problem. For each positive integer n we compute next integer using following algorithm: n/2 if n is even 3n + 1 if n is odd For example starting from 10 series looks as follows: 10, 5, 16, 8, 4, 2, 1. The series ends when it reaches 1. Mathematicians believe that starting from any integer we will eventually reach 1 but it's not yet proven. Let us create a lazy sequence that generates Collatz series for given n, but only as many as needed. As stated above, this time our sequence will be finite: private LazySeq collatz(long from) { if (from > 1) { final long next = from % 2 == 0 ? from / 2 : from * 3 + 1; return LazySeq.cons(from, () -> collatz(next)); } else { return LazySeq.of(1L); } } This implementation is driven directly by the definition. For each number greater than 1 return that number + lazily evaluated (() -> collatz(next)) rest of the stream. As you can see if 1 is given, we return single element lazy sequence using special of() factory method. Let's test it with aforementioned 10: final LazySeq collatz = collatz(10); collatz.filter(n -> (n > 10)).head(); collatz.size(); filter() allows us to find first number in the sequence that is greater than 10. Remember that lazy sequence will have to traverse the contents (evaluate itself), but only to the point where it finds first matching element. Then it stops, ensuring it computes as little as possible. However size(), in order to calculate total number of elements, must traverse the whole sequence. Of course this can only work with finite lazy sequences, calling size() on an infinite sequence will end up poorly. If you play a bit with this sequence you will quickly realize that sequences for different numbers share the same suffix (always end with the same sequence of numbers). This begs for some caching/structural sharing. See CollatzConjectureTest for details. But can it be used to something, you know... useful? Real life? Infinite sequences of numbers are great, but not very practical in real life. Maybe some more down to earth examples? Imagine you have a collection and you need to pick few items from that collection randomly. Instead of collection I will use a function returning random latin characters: private char randomChar() { return (char) ('A' + (int) (Math.random() * ('Z' - 'A' + 1))); } But there is a twist. You need N (N < 26, number of latin characters) unique values. Simply calling randomChar() few times doesn't guarantee uniqueness. There are few approaches to this problem, with LazySeq it's pretty straightforward: LazySeq charStream = LazySeq.continually(this::randomChar); LazySeq uniqueCharStream = charStream.distinct(); continually() simply invokes given function for each element when needed. Thus charStream will be an infinite stream of random characters. Of course they can't be unique. However uniqueCharStream guarantees that its output is unique. It does so by examining next element of underlying charStream and rejecting items that already appeared. We can now say uniqueCharStream.take(4) and be sure that no duplicates will appear. Once again notice that continually(this::randomChar).distinct().take(4) really calls randomChar() only once! As long as you don't consume this sequence, it remains lazy and postpones evaluation as long as possible. Another example involves loading batches (pages) of data from database. Using ResultSet or Iterator is cumbersome but loading whole data set into memory often not feasible. An alternative involves loading first batch of data eagerly and then providing a function to load next batches. Data is loaded only when it's really needed and we don't suffer performance or scalability issues. First let's define abstract API for loading batches of data from database: public List loadPage(int offset, int max) { //load records from offset to offset + max } I abstract from the technology entirely, but you get the point. Imagine that we now define LazySeq that starts from row 0 and loads next pages only when needed: public static final int PAGE_SIZE = 5; private LazySeq records(int from) { return LazySeq.concat( loadPage(from, PAGE_SIZE), () -> records(from + PAGE_SIZE) ); } When creating new LazySeq instance by calling records(0) first page of 5 elements is loaded. This means that first 5 sequence elements are already computed. If you ever try to access 6th or above, sequence will automatically load all missing record and cache them. In other words you never compute the same element twice. More useful tools when working with sequences are grouped() and sliding() methods. First partitions input sequence into groups of equal size. Take this as an example, also proving that these methods are as always lazy: final LazySeq chars = LazySeq.of('A', 'B', 'C', 'D', 'E', 'F', 'G'); chars.grouped(3); //[[A, B, C], ?] chars.grouped(3).force(); //force evaluation //[[A, B, C], [D, E, F], [G]] and similarly for sliding(): chars.sliding(3); //[[A, B, C], ?] chars.sliding(3).force(); //force evaluation //[[A, B, C], [B, C, D], [C, D, E], [D, E, F], [E, F, G]] These two methods are extremely useful. You can look at your data through sliding window (e.g. to compute moving average) or partition it to equal-length buckets. Last interesting utility method you may find useful is scan() that iterates (lazily, of course) the input stream and constructs every element of output by applying a function on previous and current element of input. Code snippet is worth a thousand words: LazySeq list = LazySeq. numbers(1). scan(0, (a, x) -> a + x); list.take(10).force(); //[0, 1, 3, 6, 10, 15, 21, 28, 36, 45] LazySeq.numbers(1) is a sequence of natural numbers (1, 2, 3...). scan() creates a new sequence that starts from 0 and for each element of input (natural numbers) adds it to last element of itself. So we get: [0, 0+1, 0+1+2, 0+1+2+3, 0+1+2+3+4, 0+1+2+3+4+5...]. If you want a sequence of growing strings, just replace few types: LazySeq.continually("*"). scan("", (s, c) -> s + c). map(s -> "|" + s + "\\"). take(10). forEach(System.out::println); And enjoy this beautiful triangle: |\ |*\ |**\ |***\ |****\ |*****\ |******\ |*******\ |********\ |*********\ Alternatively (same output): lazySeq. stream(). map(n -> n + 1). flatMap(n -> asList(0, n - 1).stream()). filter(n -> n != 0). substream(4, 18). limit(10). sorted(). distinct(). collect(Collectors.toList()); Java collections framework interoperability LazySeq implements java.util.List interface, thus can be used in variety of places. Moreover it also implements Java 8 enhancements to collections, namely streams and collectors: lazySeq. stream(). map(n -> n + 1). flatMap(n -> asList(0, n - 1).stream()). filter(n -> n != 0). substream(4, 18). limit(10). sorted(). distinct(). collect(Collectors.toList()); However streams in Java 8 were created to work around feature that is a foundation of LazySeq - lazy evaluation. Example above postpones all intermediate steps until collect() is called. With LazySeq you can safely skip .stream() and work directly on sequence: lazySeq. map(n -> n + 1). flatMap(n -> asList(0, n - 1)). filter(n -> n != 0). slice(4, 18). limit(10). sorted(). distinct(); Moreover LazySeq provides special purpose collector (see: LazySeq.toLazySeq()) that avoids evaluation even when used with collect() - which normally forces full collection computation. Implementation details Each lazy sequence is built around the idea of eagerly computed head and lazily evaluated tail represented as function. This is very similar to classic single-linked list recursive definition: class List { private final T head; private final List tail; //... } However in case of lazy sequence tail is given as a function, not a value. Invocation of that function is postponed as long as possible: class Cons extends LazySeq { private final E head; private LazySeq tailOrNull; private final Supplier> tailFun; @Override public LazySeq tail() { if (tailOrNull == null) { tailOrNull = tailFun.get(); } return tailOrNull; } For full implementation see Cons.java and FixedCons.java used when tail is known at creation time (for example LazySeq.of(1, 2) as opposed to LazySeq.cons(1, () -> someTailFun()). Pitfalls and common dangers Below common issues and misunderstandings are described. Evaluating too much One of the biggest dangers of working with infinite sequences is trying to evaluate them completely, which obviously leads to infinite computation. The idea behind infinite sequence is not to evaluate it in its entirety but to take as much as we need without introducing artificial limits and accidental complexity (see database loading example). However evaluating whole sequence is way too simple to miss. For example calling LazySeq.size()must evaluate whole sequence and will run infinitely, eventually filling up stack or heap (implementation detail). There are other methods that require full traversal in order to function properly. E.g. allMatch() making sure all elements match given predicate. Some methods are even more dangerous, because whether they will finish or not depends on data in the sequence. For example anyMatch() may return immediately if head matches predicate - or never. Sometimes we can easily avoid costly operations by using more deterministic methods. For example: seq.size() <= 10 //BAD may not work or be extremely slow if seq is infinite. However we can achieve the same with (more) predictable: seq.drop(10).isEmpty() Remember that lazy sequences are immutable (so we don't really mutate seq), drop(n) is typically O(n) while isEmpty() is O(1). When in doubt, consult source code or JavaDoc to make sure your operation won't too eagerly evaluate your sequence. Also be very cautious when using LazySeq where java.util.Collection or java.util.List is expected. Holding unnecessary reference to head Lazy sequences be definition remember already computed elements. You have to be aware of that, otherwise your sequence (especially infinite) will quickly fill up available memory. However, because LazySeq is just a fancy linked list, if you no longer keep a reference to head (but only to some element in the middle), it becomes eligible for garbage collection. For example: //LazySeq first = seq.take(10); seq = seq.drop(10); First ten elements are dropped and we assume nothing holds a reference to what previously was hept in seq. This makes first ten elements eligible for garbage collection. However if we uncomment first line and keep reference to old head in first, JVM will not release any memory. Let's put that into perspective. The following piece of code will eventually throw OutOfMemoryError because infinite reference keeps holding the beginning of the sequence, therefore all the elements created so far: LazySeq infinite = LazySeq.continually(Big::new); for (Big arr : infinite) { // } However by inlining call to continually() or extracting it to a method this code works flawlessly (well, still runs forever, but uses almost no memory): private LazySeq getContinually() { return LazySeq.continually(Big::new); } for (Big arr : getContinually()) { // } What's the difference? For-each loop uses iterators underneath. LazySeqIterator underneath doesn't hold a reference to old head() when it advances, so if nothing else references that head, it will be eligible for garbage collection, see true javac output when for-each is used: for (Iterator cur = getContinually().iterator(); cur.hasNext(); ) { final Big arr = cur.next(); //... } TL;DR Your sequence grows while being traversed. If you keep holding one end while the other grows, it will eventually blow up. Just like your first level cache in Hibernate if you load too much in one transaction. Use only as much as needed. Converting to plain Java collections Converting is simple, but dangerous. This is a consequence of points above. You can convert lazy sequence to java.util.List by calling toList(): LazySeq even = LazySeq.numbers(0, 2); even.take(5).toList(); //[0, 2, 4, 6, 8] or using Collector from Java 8 having richer API: even. stream(). limit(5). collect(Collectors.toSet()) //[4, 6, 0, 2, 8] But remember that Java collections are finite from definition so avoid converting lazy sequences to collections explicitly. Note that LazySeq is already List, thus Iterable and Collection. It also has efficient LazySeq.iterator(). If you can, simply pass LazySeq instance directly and may just work. Performance, time and space complexity head() of every sequence (except empty) is always computed eagerly, thus accessing it is fast O(1). Computing tail() may take everything from O(1) (if it was already computed) to infinite time. As an example take this valid stream: import static com.blogspot.nurkiewicz.lazyseq.LazySeq.cons; import static com.blogspot.nurkiewicz.lazyseq.LazySeq.continually; LazySeq oneAndZeros = cons( 1, () -> continually(0) ). filter(x -> (x > 0)); It represents 1 followed by infinite number of 0s. By filtering all positive numbers (x > 0) we get a sequence with same head, but filtering of tail is delayed (lazy). However if we now carelessly call oneAndZeros.tail(), LazySeq will keep computing more and more of this infinite sequence, but since there is no positive element after initial 1, this operation will run forever, eventually throwing StackOverflowError or OutOfMemoryError (this is an implementation detail). However if you ever reach this state, it's probably a programming bug or misusing of the library. Typically tail() will be close to O(1). On the other hand if you have plenty of operations already "stacked", calling tail() will trigger them rapidly one after another, so tail() run time is heavily dependant on your data structure. Most operations on LazySeq are O(1) since they are lazy. Some operations, like get(n) or drop(n) are O(n) (n represents parameter, not sequence length). In general run time will be similar to normal linked list. Because LazySeq remembers all already computed values in a single linked list, memory consumption is always O(n), where nn is the number of already computed elements. Troubleshooting Error invalid target release: 1.8 during maven build If you see this error message during maven build: [INFO] BUILD FAILURE ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project lazyseq: Fatal error compiling: invalid target release: 1.8 -> [Help 1] it means you are not compiling using Java 8. Download JDK 8 with lambda support and let maven use it: $ export JAVA_HOME=/path/to/jdk8 I get StackOverflowError or program hangs infinitely When working with LazySeq you sometimes get StackOverflowError or OutOfMemoryError: java.lang.StackOverflowError at sun.misc.Unsafe.allocateInstance(Native Method) at java.lang.invoke.DirectMethodHandle.allocateInstance(DirectMethodHandle.java:426) at com.blogspot.nurkiewicz.lazyseq.LazySeq.iterate(LazySeq.java:118) at com.blogspot.nurkiewicz.lazyseq.LazySeq.lambda$0(LazySeq.java:118) at com.blogspot.nurkiewicz.lazyseq.LazySeq$$Lambda$2.get(Unknown Source) at com.blogspot.nurkiewicz.lazyseq.Cons.tail(Cons.java:32) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) When working with possibly infinite data structures, care must be taken. Avoid calling operations that must (size(), allMatch(), minBy(), forEach(), reduce(), ...) or can (filter(), distinct(), ...) traverse the whole sequence in order to give correct results. See Pitfalls for more examples and ways to avoid. Maturity Quality This project was started as an exercise and is not battle-proven. But a healthy 300+ unit-test suite (3:1 test code/production code ratio) guards quality and functional correctness. I also make sure LazySeq is as lazy as possible by mocking tail functions and verifying they are called as rarely as one can get. Contributions and bug reports In the event of finding a bug or missing feature, don't hesitate to open a new ticket or start pull request. I would also love to see more interesting usages of LazySeq in wild. Possible improvements Just like FixedCons is used when tail is known up-front, consider IterableCons that wraps existing Iterable in one node rather than building FixedCons hierarchy. This can be used for all concat methods. Parallel processing support (implementing spliterator?) License This project is released under version 2.0 of the Apache License.
May 15, 2013
by Tomasz Nurkiewicz
· 28,943 Views · 1 Like
article thumbnail
Deploy a File Server in the Cloud (WebDav on Windows Azure)
this month, my fellow it pro technical evangelists and i are authoring a new series of articles on 20 key scenarios with windows azure infrastructure services . check out the list of articles here: http://mythoughtsonit.com/2013/05/20-key-scenarios-with-windows-azure-infrastructure-services/ . web-based distributed authoring and versioning, or webdav, is a set of protocols based on http that allows end-users to map a network drive over http and edit content and files stored on the web server. when webdav was first offered on microsoft server i had evaluated it and decided it did not perform well enough for me. the webdav extension to iis was completely rewritten back in the server 2008 timeframe and is worth taking a look at again. in this article i will guide you step by step through the process of setting up webdav on server 2012 in a windows azure iaas environment. this will give you a solid performing file share on the internet over port 80 and the http protocol. first you need an azure account. you can setup a free trail of azure. details can be found here: http://mythoughtsonit.com/2013/04/step-by-step-guide-to-setting-up-a-windows-azure-free-trial/ second provision a server 2012 machine. watch a video of what to do here: third open port 80 to this new server: in the azure portal select your 2012 server and choose the “endpoints” tab on the top. click “add endpoint” at the bottom of the screen enter the endpoint information for port 80 to port 80 done. next we need to install the iis webserver and webdav. installing webdav on iis 8.0 start server manager and go to “add roles and features” under server roles – add the web server (iis) role click through the wizard until you come to the role services section. then find and select “webdav publishing” and “windows authentication” click next and then install when the install is finished you are ready to move on to the next section. configuring iis 8 for webdav after the installation finishes you need to configure the box for access. start the iis manager tool. choose the “default web site” on the left side. then click on “authentication” open the windows authentication option and enable it. open the “webdav authoring rules” create a webdav rule. i choose to allow all users access to all content. a better security practice is to limit what users can use the service. it’s your data so you decide. make sure webdav is enabled and that your access rule is set: that is it… now your ready to access your webdav file share! test and insure you can hit the web server by using your browser: because you opened port 80 and installed iis 8 you should see the default web page when you browse to your servers internet dns name. example: http://yourdomainname.cloudapp.net/ how to map a drive to your webdav server: there are two ways i use to connect to the webdav server how to map a drive to your webdav server from the win 8 gui: from windows explorer, right click on “computer” and select “map a network drive” map your network drive by entering the address to your server example: http://yourdomainname.cloudapp.net/ i selected “connect using different credentials” because my workstation was not joined to the server in anyway and i needed to use an account in the servers local sam database. hit “finish” and enter your credentials. now you will have a connected drive that you can access from windows explorer or any tool via the drive mapping. how to map a drive to your webdav server from a cmd box: 1. hit windows start and type: cmd 2. enter the command: net use [drive letter] [url] example: net use e: http://yourdomainname.cloudapp.net/
May 15, 2013
by Brian Lewis
· 15,914 Views
article thumbnail
JPA - Querydsl Projections
In my last post: JPA - Basic Projections - I've mentioned about two basic possibilities of building JPA Projections. This post brings you more examples, this time based on Querydsl framework. Note, that I'm referring Querydslversion 3.1.1 here. Reinvented constructor expressions Take a look at the following code:... import static com.blogspot.vardlokkur.domain.QEmployee.employee; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; import org.springframework.beans.factory.annotation.Autowired; import com.blogspot.vardlokkur.domain.EmployeeNameProjection; import com.mysema.query.jpa.JPQLTemplates; import com.mysema.query.jpa.impl.JPAQuery; import com.mysema.query.types.ConstructorExpression; ... public class ConstructorExpressionExample { ... @PersistenceContext private EntityManager entityManager; @Autowired private JPQLTemplates jpqlTemplates; public void someMethod() { ... final List projections = new JPAQuery(entityManager, jpqlTemplates) .from(employee) .orderBy(employee.name.asc()) .list(ConstructorExpression.create(EmployeeNameProjection.class, employee.employeeId, employee.name)); ... } ... } The above Querydsl construction means: create new JPQL query [1][2], using employee as the data source, order the data using employee name [3], and return the list of EmployeeNameProjection, built using the 2-arg constructor called with employee ID and name [4]. This is very similar to the constructor expressions example from my previous post (JPA - Basic Projections), and leads to the following SQL query: select EMPLOYEE_ID, EMPLOYEE_NAME from EMPLOYEE order by EMPLOYEE_NAME asc As you see above, the main advantage comparing to the JPA constructor expressions is using Java class, instead of its name hard-coded in JPQL query. Even more reinvented constructor expressions Querydsl documentation [4] describes another way of using constructor expressions, requiring @QueryProjectionannotation and Query Type [1] usage for projection, see example below. Let's start with the projection class modification - note that I added @QueryProjection annotation on the class constructor. package com.blogspot.vardlokkur.domain; import java.io.Serializable; import javax.annotation.concurrent.Immutable; import com.mysema.query.annotations.QueryProjection; @Immutable public class EmployeeNameProjection implements Serializable { private final Long employeeId; private final String name; @QueryProjection public EmployeeNameProjection(Long employeeId, String name) { super(); this.employeeId = employeeId; this.name = name; } public Long getEmployeeId() { return employeeId; } public String getName() { return name; } } Now we may use modified projection class (and corresponding Query Type [1] ) in following way: ... import static com.blogspot.vardlokkur.domain.QEmployee.employee; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; import org.springframework.beans.factory.annotation.Autowired; import com.blogspot.vardlokkur.domain.EmployeeNameProjection; import com.blogspot.vardlokkur.domain.QEmployeeNameProjection; import com.mysema.query.jpa.JPQLTemplates; import com.mysema.query.jpa.impl.JPAQuery; ... public class ConstructorExpressionExample { ... @PersistenceContext private EntityManager entityManager; @Autowired private JPQLTemplates jpqlTemplates; public void someMethod() { ... final List projections = new JPAQuery(entityManager, jpqlTemplates) .from(employee) .orderBy(employee.name.asc()) .list(new QEmployeeNameProjection(employee.employeeId, employee.name)); ... } ... } Which leads to SQL query: select EMPLOYEE_ID, EMPLOYEE_NAME from EMPLOYEE order by EMPLOYEE_NAME asc In fact, when you take a closer look at the Query Type [1] generated for EmployeeNameProjection(QEmployeeNameProjection), you will see it is some kind of "shortcut" for creating constructor expression the way described in first section of this post. Mapping projection Querydsl provides another way of building projections, using factories based on MappingProjection. package com.blogspot.vardlokkur.domain; import static com.blogspot.vardlokkur.domain.QEmployee.employee; import com.mysema.query.Tuple; import com.mysema.query.types.MappingProjection; public class EmployeeNameProjectionFactory extends MappingProjection { public EmployeeNameProjectionFactory() { super(EmployeeNameProjection.class, employee.employeeId, employee.name); } @Override protected EmployeeNameProjection map(Tuple row) { return new EmployeeNameProjection(row.get(employee.employeeId), row.get(employee.name)); } } The above class is a simple factory creating EmployeeNameProjection instances using employee ID and name. Note that the factory constructor defines which employee properties will be used for building the projection, and mapmethod defines how the instances will be created. Below you may find an example of using the factory: ... import static com.blogspot.vardlokkur.domain.QEmployee.employee; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; import org.springframework.beans.factory.annotation.Autowired; import com.blogspot.vardlokkur.domain.EmployeeNameProjection; import com.blogspot.vardlokkur.domain.EmployeeNameProjectionFactory import com.mysema.query.jpa.JPQLTemplates; import com.mysema.query.jpa.impl.JPAQuery; ... public class MappingProjectionExample { ... @PersistenceContext private EntityManager entityManager; @Autowired private JPQLTemplates jpqlTemplates; public void someMethod() { ... final List projections = new JPAQuery(entityManager, jpqlTemplates) .from(employee) .orderBy(employee.name.asc()) .list(new EmployeeNameProjectionFactory()); .... } ... } As you see, the one and only difference here, comparing to constructor expression examples, is the list method call. Above example leads again to the very simple SQL query: select EMPLOYEE_ID, EMPLOYEE_NAME from EMPLOYEE order by EMPLOYEE_NAME asc Building projections this way is much more powerful, and doesn't require existence of n-arg projection constructor. QBean based projection (JavaBeans strike again) There is at least one more possibility of creating projection with Querydsl - QBean based - in this case we build the result list using: ... .list(Projections.bean(EmployeeNameProjection.class, employee.employeeId, employee.name)) This way requires EmployeeNameProjection class to follow JavaBean conventions, which is not always desired in application. Use it if you want, but you have been warned ;) Few links for the dessert Using Query Types Querying Ordering Constructor projections
May 15, 2013
by Michal Jastak
· 39,421 Views · 3 Likes
article thumbnail
Getting Started with Active Directory Lightweight Directory Services
introduction in preparation for some upcoming posts related to linq (what else?), windows powershell and rx, i had to set up a local ldap-capable directory service. (hint: it will pay off to read till the very end of the post if you’re wondering what i’m up to...) in this post i’ll walk the reader through the installation, configuration and use of active directory lightweight directory services (lds) , formerly known as active directory application mode (adam). having used the technology several years ago, in relation to the linq to active directory project (which as an extension to this blog series will receive an update), it was a warm and welcome reencounter. what’s lightweight directory services anyway? use of hierarchical storage and auxiliary services provided by technologies like active directory often has advantages over alternative designs, e.g. using a relational database. for example, user accounts may be stored in a directory service for an application to make use of. while active directory seems the natural habitat to store (and replicate, secure, etc.) additional user information, it admins will likely point you – the poor developer – at the door when asking to extend the schema. that’s one of the places where lds comes in, offering the ability to take advantage of the programming model of directory services while keeping your hands off “the one and only ad schema”. the lds website quotes other use cases, which i’ll just copy here verbatim: active directory lightweight directory service (ad lds), formerly known as active directory application mode, can be used to provide directory services for directory-enabled applications. instead of using your organization’s ad ds database to store the directory-enabled application data, ad lds can be used to store the data. ad lds can be used in conjunction with ad ds so that you can have a central location for security accounts (ad ds) and another location to support the application configuration and directory data (ad lds). using ad lds, you can reduce the overhead associated with active directory replication, you do not have to extend the active directory schema to support the application, and you can partition the directory structure so that the ad lds service is only deployed to the servers that need to support the directory-enabled application. install from media generation. the ability to create installation media for ad lds by using ntdsutil.exe or dsdbutil.exe. auditing. auditing of changed values within the directory service. database mounting tool. gives you the ability to view data within snapshots of the database files. active directory sites and services support. gives you the ability to use active directory sites and services to manage the replication of the ad lds data changes. dynamic list of ldif files. with this feature, you can associate custom ldif files with the existing default ldif files used for setup of ad lds on a server. recursive linked-attribute queries. ldap queries can follow nested attribute links to determine additional attribute properties, such as group memberships. obviously that last bullet point grabs my attention through i will retain myself from digressing here. getting started if you’re running windows 7, the following explanation is the right one for you. for older versions of the operating system, things are pretty similar though different downloads will have to be used. for windows server 2008, a server role exists for lds. so, assuming you’re on windows 7, start by downloading the installation media over here . after installing this, you should find an entry “active directory lightweight directory services setup wizard” under the “administrative tools” section in “control panel”: lds allows you to install multiple instances of directory services on the same machine, just like sql server allows multiple server instances to co-exist. each instance has a name and listens on certain ports using the ldp protocol. starting this wizard – which lives under %systemroot%\adam\adaminstall.exe, revealing the former product name – brings us here: after clicking next, we need to decide whether we create a new unique instance that hasn’t any ties with existing instances, or whether we want to create a replicate of an existing instance. for our purposes, the first option is what we need: next, we’re asked for an instance name. the instance name will be used for the creation of a windows service, as well as to store some settings. each instance will get its own windows service. in our sample, we’ll create a directory for the northwind employees tables, which we’ll use to create accounts further on. we’re almost there with the baseline configuration. the next question is to specify a port number, both for plain tcp and for ssl-encrypted traffic. the default ports, 389 and 636, are fine for us. later we’ll be able to connect to the instance by connecting to ldp over port 389, e.g. using the system.directoryservices namespace functionality in .net. notice every instance of lds should have its own port number, so only one can be using the default port numbers. now that we have completed the “physical administration”, the wizard moves on to a bit of “logical administration”. more specifically, we’re given the option to create a directory partition for the application. here we choose to create such a partition, though in many concrete deployment scenarios you’ll want the application’s setup to create this at runtime. our partition’s distinguished name will mimic a “northwind.local” domain containing a partition called “employees”: after this bit of logical administration, some more physical configuration has to be carried out, specifying the data files location and the account to run the services under. for both, the default settings are fine. also the administrative account assigned to manage the lds instance can be kept as the currently logged in user, unless you feel the need to change this in your scenario: finally, we’ve arrived at an interesting step where we’re given the option to import ldif files. and ldif file, with extension .ldf, contains the definition of a class that can be added to a directory service’s schema. basically those contain things like attributes and their types. under the %systemroot%\adam folder, a set of out-of-the-box .ldf files can be found: instead of having to run the ldifde.exe tool, the wizard gives us the option to import ldif files directly. those classes are documented in various places, such as rfc2798 for inetorgperson . on technet, information is presented in a more structured manner, e.g revealing that inetorgperson is a subclass of user . custom classes can be defined and imported after setup has completed. in this post, we won’t extend the schema ourselves but we will simply be using the built-in user class so let’s tick that one: after clicking next, we get a last chance to revisit our settings or can confirm the installation. at this point, the wizard will create the instance – setting up the service – and import the ldif files. congratulations! your first lds instance has materialized. if everything went alright, the northwindemployees service should show up: inspecting the directory to inspect the newly created directory instance, a bunch of tools exist. one is adsi edit which you could already see in the administrative tools. to set it up, open the mmc-based tool and go to action, connect to… in the dialog that appears, specify the server name and choose schema as the naming context. for example, if you want to inspect the user class, simply navigate to the schema node in the tree and show the properties of the user entry. to visualize the objects in the application partition, connect using the distinguished name specified during the installation: now it’s possible to create a new object in the directory using the context menu in the content pane: after specifying the class, we get to specify the “cn” name (for common name) of the object. in this case, i’ll use my full name: we can also set additional attributes, as shown below (using the “physicaldeliveryofficename” to specify the office number of the user): after clicking set, closing the attributes dialog and clicking finish to create the object, we see it pop up in the items view of the adsi editor snap-in: programmatic population of the directory obviously we’re much more interested in a programmatic way to program directory services. .net supports the use of directory services and related protocols (ldap in particular) through the system.directoryservices namespace. in a plain new console application, add a reference to the assembly with the same name (don’t both about other assemblies that deal with account management and protocol stuff): for this sample, i’ll also assume the reader got a northwind sql database sitting somewhere and knows how to get data out of its employees table as rich objects. below is how things look when using the linq to sql designer: we’ll just import a few details about the users; it’s left to the reader to map other properties onto attributes using the documentation about the user directory services class . just a few lines of code suffice to accomplish the task (assuming the system.directoryservices namespace is imported): static void main() { var path = "ldap://bartde-hp07/cn=employees,dc=northwind,dc=local"; var root = new directoryentry(path); var ctx = new northwinddatacontext(); foreach (var e in ctx.employees) { var cn = "cn=" + e.firstname + e.lastname; var u = root.children.add(cn, "user"); u.properties["employeeid"].value = e.employeeid; u.properties["sn"].value = e.lastname; u.properties["givenname"].value = e.firstname; u.properties["comment"].value = e.notes; u.properties["homephone"].value = e.homephone; u.properties["photo"].value = e.photo.toarray(); u.commitchanges(); } } after running this code – obviously changing the ldap path to reflect your setup – you should see the following in adsi edit (after hitting refresh): now it’s just plain easy to write an application that visualizes the employees with their data. we’ll leave that to the ui-savvy reader (just to tease that segment of my audience, i’ve also imported the employee’s photo as a byte-array). a small preview of what’s coming up to whet the reader’s appetite about next episodes on this blog, below is a single screenshot illustrating something – imho – rather cool (use of linq to active directory is just an implementation detail below): note: what’s shown here is the result of a very early experiment done as part of my current job on “linq to anything” here in the “cloud data programmability team”. please don’t fantasize about it as being a vnext feature of any product involved whatsoever. the core intent of those experiments is to emphasize the omnipresence of linq (and more widely, monads) in today’s (and tomorrow’s) world. while we’re not ready to reveal the “linq to anything” mission in all its glory (rather think of it as “linq to the unimaginable”), we can drop some hints. stay tuned for more!
May 11, 2013
by Bart De Smet
· 26,281 Views · 1 Like
  • Previous
  • ...
  • 504
  • 505
  • 506
  • 507
  • 508
  • 509
  • 510
  • 511
  • 512
  • 513
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×