DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Languages Topics

article thumbnail
Storing Objects in Android
One alternative to using SQLite on Android is to store Java objects in SharedPreferences.
December 19, 2013
by Tony Siciliani
· 47,659 Views · 1 Like
article thumbnail
Handling Big Data with HBase Part 4: The Java API
Editor's note: Be sure to check out part 2 as well. This is the fourth of an introductory series of blogs on Apache HBase. In the third part, we saw a high level view of HBase architecture . In this part, we'll use the HBase Java API to create tables, insert new data, and retrieve data by row key. We'll also see how to setup a basic table scan which restricts the columns retrieved and also uses a filter to page the results. Having just learned about HBase high-level architecture, now let's look at the Java client API since it is the way your applications interact with HBase. As mentioned earlier you can also interact with HBase via several flavors of RPC technologies like Apache Thrift plus a REST gateway, but we're going to concentrate on the native Java API. The client APIs provide both DDL (data definition language) and DML (data manipulation language) semantics very much like what you find in SQL for relational databases. Suppose we are going to store information about people in HBase, and we want to start by creating a new table. The following listing shows how to create a new table using the HBaseAdmin class. Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("people")); tableDescriptor.addFamily(new HColumnDescriptor("personal")); tableDescriptor.addFamily(new HColumnDescriptor("contactinfo")); tableDescriptor.addFamily(new HColumnDescriptor("creditcard")); admin.createTable(tableDescriptor); The people table defined in preceding listing contains three column families: personal, contactinfo, and creditcard. To create a table you create an HTableDescriptor and add one or more column families by adding HColumnDescriptor objects. You then call createTable to create the table. Now we have a table, so let's add some data. The next listing shows how to use the Put class to insert data on John Doe, specifically his name and email address (omitting proper error handling for brevity). Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "people"); Put put = new Put(Bytes.toBytes("doe-john-m-12345")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("givenName"), Bytes.toBytes("John")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("mi"), Bytes.toBytes("M")); put.add(Bytes.toBytes("personal"), Bytes.toBytes("surame"), Bytes.toBytes("Doe")); put.add(Bytes.toBytes("contactinfo"), Bytes.toBytes("email"), Bytes.toBytes("[email protected]")); table.put(put); table.flushCommits(); table.close(); In the above listing we instantiate a Put providing the unique row key to the constructor. We then add values, which must include the column family, column qualifier, and the value all as byte arrays. As you probably noticed, the HBase API's utility Bytes class is used a lot; it provides methods to convert to and from byte[] for primitive types and strings. (Adding a static import for the toBytes() method would cut out a lot of boilerplate code.) We then put the data into the table, flush the commits to ensure locally buffered changes take effect, and finally close the table. Updating data is also done via the Put class in exactly the same manner as just shown in the prior listing. Unlike relational databases in which updates must update entire rows even if only one column changed, if you only need to update a single column then that's all you specify in the Put and HBase will only update that column. There is also a checkAndPut operation which is essentially a form of optimistic concurrency control - the operation will only put the new data if the current values are what the client says they should be. Retrieving the row we just created is accomplished using the Get class, as shown in the next listing. (From this point forward, listings will omit the boilerplate code to create a configuration, instantiate the HTable, and the flush and close calls.) Get get = new Get(Bytes.toBytes("doe-john-m-12345")); get.addFamily(Bytes.toBytes("personal")); get.setMaxVersions(3); Result result = table.get(get); The code in the previous listing instantiates a Get instance supplying the row key we want to find. Next we use addFamily to instruct HBase that we only need data from the personal column family, which also cuts down the amount of work HBase must do when reading information from disk. We also specify that we'd like up to three versions of each column in our result, perhaps so we can list historical values of each column. Finally, calling get returns a Result instance which can then be used to inspect all the column values returned. In many cases you need to find more than one row. HBase lets you do this by scanning rows, as shown in the second part which showed using a scan in the HBase shell session. The corresponding class is the Scan class. You can specify various options, such as the start and ending row key to scan, which columns and column families to include and the maximum versions to retrieve. You can also add filters, which allow you to implement custom filtering logic to further restrict which rows and columns are returned. A common use case for filters is pagination. For example, we might want to scan through all people whose last name is Smith one page (e.g. 25 people) at a time. The next listing shows how to perform a basic scan. Scan scan = new Scan(Bytes.toBytes("smith-")); scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("givenName")); scan.addColumn(Bytes.toBytes("contactinfo"), Bytes.toBytes("email")); scan.setFilter(new PageFilter(25)); ResultScanner scanner = table.getScanner(scan); for (Result result : scanner) { // ... } In the above listing we create a new Scan that starts from the row key smith- and we then use addColumn to restrict the columns returned (thus reducing the amount of disk transfer HBase must perform) to personal:givenName and contactinfo:email. A PageFilter is set on the scan to limit the number of rows scanned to 25. (An alternative to using the page filter would be to specify a stop row key when constructing the Scan.) We then get a ResultScanner for the Scan just created, and loop through the results performing whatever actions are necessary. Since the only method in HBase to retrieve multiple rows of data is scanning by sorted row keys, how you design the row key values is very important. We'll come back to this topic later. You can also delete data in HBase using the Delete class, analogous to the Put class to delete all columns in a row (thus deleting the row itself), delete column families, delete columns, or some combination of those. Connection Handling In the above examples not much attention was paid to connection handling and RPCs (remote procedure calls). HBase provides the HConnection class which provides functionality similar to connection pool classes to share connections, for example you use the getTable() method to get a reference to an HTable instance. There is also an HConnectionManager class which is how you get instances of HConnection. Similar to avoiding network round trips in web applications, effectively managing the number of RPCs and amount of data returned when using HBase is important, and something to consider when writing HBase applications. Conclusion to Part 4 In this part we used the HBase Java API to create a people table, insert a new person, and find the newly inserted person information. We also used the Scan class to scan the people table for people with last name "Smith" and showed how to restrict the data retrieved and finally how to use a filter to limit the number of results. In the next part, we'll learn how to deal with the absence of SQL and relations when modeling schemas in HBase. References HBase web site, http://hbase.apache.org/ HBase wiki, http://wiki.apache.org/hadoop/Hbase HBase Reference Guide http://hbase.apache.org/book/book.html HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide Google Bigtable Paper, http://labs.google.com/papers/bigtable.html Hadoop web site, http://hadoop.apache.org/ Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk Sample code, https://github.com/sleberknight/basic-hbase-examples
December 18, 2013
by Scott Leberknight
· 56,802 Views · 3 Likes
article thumbnail
Apache Lucene: Fast Range Faceting Using Segment Trees and the Java ASM Library
In Lucene's facet module we recently added support for dynamic range faceting, to show how many hits match each of a dynamic set of ranges. For example, the Updated drill-down in the Lucene/Solr issue search application uses range facets. Another example is distance facets (< 1 km, < 2 km, etc.), where the distance is dynamically computed based on the user's current location. Price faceting might also use range facets, if the ranges cannot be established during indexing. To implement range faceting, for each hit, we first calculate the value (the distance, the age, the price) to be aggregated, and then lookup which ranges match that value and increment its counts. Today we use a simple linear search through all ranges, which has O(N) cost, where N is the number of ranges. But this is inefficient! Segment Trees There are fun data structures like segment trees and interval trees with O(log(N) + M) cost per lookup, where M is the number of ranges that match the given value. I chose to explore segment trees, as Lucene only requires looking up by a single value (interval trees can also efficiently look up all ranges overlapping a provided range) and also because all the ranges are known up front (interval trees also support dynamically adding or removing ranges). If the ranges will never overlap, you can use a simple binary search; Guava's ImmutableRangeSet takes this approach. However, Lucene's range faceting allows overlapping ranges so we can't do that. Segment trees are simple to visualize: you "project" all ranges down on top of one another, creating a one-dimensional Venn diagram, to define the elementary intervals. This classifies the entire range of numbers into a minimal number of distinct ranges, each elementary interval, such that all points in each elementary interval always match the same set of input ranges. The lookup process is then a binary search to determine which elementary interval a point belongs to, recording the matched ranges as you recurse down the tree. Consider these ranges; the lower number is inclusive and the upper number is exclusive: 0: 0 – 10 1: 0 – 20 2: 10 – 30 3: 15 – 50 4: 40 – 70 The elementary intervals (think Venn diagram!) are: -∞ – 0 0 – 10 10 – 15 15 – 20 20 – 30 30 – 40 40 – 50 50 – 70 70 – ∞ Finally, you build a binary tree on top of the elementary ranges, and then add output range indices to both internal nodes and the leaves of that tree, necessary to prevent adversarial cases that would require too much (O(N^2)) space. During lookup, as you walk down the tree, you gather up the output ranges (indices) you encounter; for our example, each elementary range is assigned the follow range indices as outputs: -∞ – 0 → 0 – 10 → 0 10 – 15 → 1, 2 15 – 20 → 1, 2, 3 20 – 30 → 2, 3 30 – 40 → 3 40 – 50 → 3, 4 50 – 70 → 4 70 – ∞ → Some ranges correspond to 1 elementary interval, while other ranges correspond to 2 or 3 or more, in general. Some, 2 in this example, may have no matching input ranges. Looking Up Matched Ranges I've pushed all sources described below to new Google code project; the code is still somewhat rough and exploratory, so there are likely exciting bugs lurking, but it does seem to work: it includes (passing!) tests and simple micro-benchmarks. I started with a basic segment tree implementation as described on the Wikipedia page, for long values, called SimpleLongRangeMultiSet; here's the recursive lookup method: private int lookup(Node node, long v, int[] answers, int upto) { if (node.outputs != null) { for(int range : node.outputs) { answers[upto++] = range; } } if (node.left != null) { if (v <= node.left.end) { upto = lookup(node.left, v, answers, upto); } else { upto = lookup(node.right, v, answers, upto); } } return upto; } This worked correctly, but I realized there must be non-trivial overhead for the recursion, checking for nulls, the for loop over the output values, etc. Next, I tried switching to parallel arrays to hold the binary tree (ArrayLongRangeMultiSet), where the left child of node N is at 2*N and the right child is at 2*N+1, but this turned out to be slower. After that I tested a code specializing implementation, first by creating dynamic Java source code from the binary tree. This eliminates the recursion and creates a single simple method that uses a series of if statements, specialized to the specific ranges, to do the binary search and record the range indices. Here's the resulting specialized code, compiled from the above ranges: void lookup(long v, int[] answers) { int upto = 0; if (v <= 19) { if (v <= 9) { if (v >= 0) { answers[upto++] = 0; answers[upto++] = 1; } } else { answers[upto++] = 1; answers[upto++] = 2; if (v >= 15) { answers[upto++] = 3; } } } else { if (v <= 39) { answers[upto++] = 3; if (v <= 29) { answers[upto++] = 2; } } else { if (v <= 49) { answers[upto++] = 3; answers[upto++] = 4; } else { if (v <= 69) { answers[upto++] = 4; } } } } } Finally, using the ASM library, I compiled the tree directly to specialized Java bytecode, and this proved to be fastest (up to 2.5X faster in some cases). As a baseline, I also added the simple linear search method, LinearLongRangeMultiSet; as long as you don't have too many ranges (say 10 or less), its performance is often better than the Java segment tree. The implementation also allows you to specify the allowed range of input values (for example, maybe all values are >=0 in your usage), which can save an if statement or two in the lookup method. Counting All Matched Ranges While the segment tree allows you to quickly look up all matching ranges for a given value, after a nice tip from fellow Lucene committee Robert Muir, we realized Lucene's range faceting does not need to know the ranges for each value; instead, it only requires the aggregated counts for each range in the end, after seeing many values. This leads to an optimization: compute counts for each elementary interval and then in the end, roll up those counts to get the count for each range. This will only work for single-valued fields, since for a multi-valued field you'd need to carefully never increment the same range more than once per hit. So based on that approach, I created a new LongRangeCounter abstract base class, and the SimpleLongRangeCounter Java implementation, and also the ASM specialized version, and the results are indeed faster (~20 to 50%) than using the lookup method to count; I'll use this approach with Lucene. Segment trees are normally always "perfectly" balanced but one final twist I explored was to use a training set of values to bias the order of the if statements. For example, if your ranges cover a tiny portion of the search space, as is the case for the Updated drill-down, then it should be faster to use a slightly unbalanced tree, by first checking if the value is less than the maximum range. However, in testing, while there are some cases where this "training" is a bit faster, often it's slower; I'm not sure why. Lucene I haven't folded this into Lucene yet, but I plan to; all the exploratory code lives in the segment-trees Google code project for now. Results on the micro-benchmarks can be entirely different once the implementations are folded into a "real" search application. While ASM is a powerful way to generate specialized code, and it gives sizable performance gains at least in the micro-benchmarks, it is an added dependency and complexity for ongoing development and many more developers know Java than ASM. It may also confuse hotspot, causing deoptimizations when there are multiple implementations for an abstract base class. Furthermore, if there are many ranges, the resulting specialized bytecode can be become quite large (but, still O(N*log(N)) in size), which may cause other problems. On balance I'm not sure the sizable performance gains (on a micro-benchmark) warrant using ASM in Lucene's range faceting.
December 17, 2013
by Michael Mccandless
· 10,322 Views
article thumbnail
Using the DataContractSerializer to Serialize and Deserialize
To serialize a Business object marked with [DataContract] and properties marked with [DataMember] attributes… public static string DataContractSerializeObject(T objectToSerialize) { using (var output = new StringWriter()) { using (var writer = new XmlTextWriter(output) { Formatting = Formatting.Indented }) { var dataContractSerializer = new DataContractSerializer(typeof(T), EntityUtilities.GetAllKnownTypes(), int.MaxValue, true, true, null); dataContractSerializer.WriteObject(writer, objectToSerialize); return output.GetStringBuilder().ToString(); } } } Then to deserialize back to your DataContract type, use this logic… public static T Deserialize(string xml) { using (Stream stream = new MemoryStream()) { byte[] data = System.Text.Encoding.UTF8.GetBytes(xml); stream.Write(data, 0, data.Length); stream.Position = 0; var dataContractSerializer = new DataContractSerializer(typeof(T), EntityUtilities.GetAllKnownTypes(), int.MaxValue, true, true, null); return (T)dataContractSerializer.ReadObject(stream); } }
December 14, 2013
by Merrick Chaffer
· 25,949 Views
article thumbnail
Top 24 Java-Based Content Management Systems
CMS, or content management systems, are platforms for managing and administering website content. There is no denying that CMSes are important in today's web ecosystem. These content management systems not only provide an easy way to build and maintain websites, but they also lend a helping hand in updating and editing website content without the need to spend hours or days writing and altering codes and scripts. Some of the leading CMSes are PHP-based, Ruby on Rails-based, ASP.NET-based, and Java-based. Among these, due to scalability, modernized architecture and open-source standards of a few, Java-based CMSs are getting quite a lot of attention lately, especially for enterprise websites, because of the scalable, modern, open source technology behind most of them. There are plenty of CMS tools based on Java to help developers create multi-lingual and multi-channel websites. But how do we decide on the best one for our use case? In this article, we’re going to explore the top 24 content management systems based on Java. Let’s have a look at each of them in detail: 1. Alfresco : Alfresco is one of the top open-source content management systems of Java. It comes with enterprise repository and portlet capabilities along with document management, collaboration, records management, knowledge management, web content management, imaging, and a lot more. Alfresco has a modular architecture and enables end users to efficiently manage websites across the cloud, mobile, hybrid and on-premise environments using open source Java technologies, such as Spring, Hibernate, Lucene and JSF. 2. Magnolia : Magnolia is a well-documented, easy to use, enterprise-grade open source CMS based on the Java Content Repository Standard. It is a highly popular CMS due to its out-of-the-box functionality and ease of use under an open source license. Moreover, Magnolia supports unique content delivery capabilities in a search-engine optimized manner and also follows W3C standards. Magnolia CMS has been deployed by enterprises and governments in more than 100 countries across the world. Here's a case study on Magnolia-based website development 3. LogicalDOC : Though less known than other software such as Alfresco, LogicalDOC is emerging as a powerful and more affordable alternative. With primary focus on Document Management, it offers very interesting content management, knowledge management and collaboration features, and all this in a really efficient way. A peculiar aspect of the interface is the use of Google GWT , this makes the user interface very responsive while the data transfer with the server is minimum. Also the availability of Free Apps for Android and Apple devices (iPhone and iPad) is an interesting feature. 4. Asbru: Asbru is another powerful, fully-featured, easy to use content management system with database-driven capabilities. It is built on the Spring framework with integrated community, databases, eCommerce and statistics modules, which helps developers to create, publish and manage rich and user-friendly internet, extranet and intranet websites on the go. Available in various editions, Asbru provides users with a simple, user-friendly platform to manage websites along with a host of other benefits and features such as custom templates and data, password protected content, multi-lingual content, communities, eCommerce and website analytics, a cutting-edge WYSIWYG content editor and a lot more. 5. OpenCMS : OpenCMS is based on Java and XML technology that allows you to build highly customizable and interactive websites and portals. It comes integrated with a WYSIWYG editor and fully-featured Template Engine which is fully compliant with W3C standards. OpenCMS can be deployed both in an open-source environment (Linux, Apache, Tomcat, MySQL) as well as a commercial environment (Windows NT, IIS, BEA Weblogic, Oracle) 6. Walrus: Walrus is yet another Spring-based CMS that provides unique and effective content management capabilities with a smart administrative interface and drag-and-drop facilities. Easy-to-setup and undo/redo features make Walrus a highly preferred and suitable CMS for government and non-profit enterprises. 7. Pulse : Pulse is a Java-based framework and portal solution that offers easy-to-use and extensible patterns for creating rich browser web applications and responsive websites. It brings a bunch of innovative and powerful components including content management, web shops, user management and more. A few of its key features include a WebDAV based virtual file system for digital asset management, mature user and role management, built-in internationalization, and more. 8. MeshCMS: MeshCMS is an easy to use online editing system written in Java. It comes with a host of features that you will find in any ideal content management system however, it uses a conventional approach in managing and editing website content. It is considered one of the fastest CMSes for editing files online, managing files, and building some very common components like menus, breadcrumbs, mail forms and so on. MeshCMS is accompanied by cross-browser capabilities, a WYSIWYG editor, hot-linking prevention, and tag-library that makes content management an interesting affair. 9. Liferay: Liferay is one of the most popular CMSes based on Java, and is recommended by many industry experts. It comes with awesome features that can make your content management tasks simple. Liferay is a very popular for developing personal as well as professional websites with ease. 10. DotCMS: DotCMS is a next-gen enterprise CMS that wears an open-source hat. It is highly popular and widely used CMS due to its open APIs, extensible and scalable architecture that it used to create personalized and engaging websites, intranets, extranets and applications with ease. 11. Jease: Jease highly known as ‘Java with ease’ is another open source content management system that is built on popular Java technologies like db40, Perst, Lucence and ZK. It is an extremely lightweight CMS with excellent Ajax interface. Due to its intuitive and interactive interface, it is highly simple and easy to customize and deploy websites in Jease even for inexperienced Java developers. 12. Hippo: Hippo is again a powerful open-source CMS made in Java that features enterprise level capabilities that helps in delivering personalized websites and channels. Hippo outlines its competitor by delivering outstanding customer experience through innovative solutions. Hippo has come a long way since 1999 serving medium to large organizations by offering a personalized multichannel content distribution platform including website, mobile, tablet, extranets and intranets. Its major version update was in December 2012 and since then it is seeing minor updates every couple of months. 13. Apache Lenya: Apache Lenya is another open-source Java CMS that features revision control , multisite management, scheduling, search, WYSIWYG editors, and workflow which makes website development and management quite interesting and easy for developers. Available in a variety of languages, Apache Lenya is highly preferred CMS among enterprises that desire to develop multi-lingual websites. 14. Contelligent: Conteligent is another smart CMS solution offered under Java technology stack. It is fully compliant with J2EE and offers great solution for creating and managing personalized websites. 15. InfoGlue: InfoGlue again is a Java-based CMS that is known for its advanced, scalable and robust open-source architecture. It is a highly flexible CMS built on JSR-168 and comes with full multi-language support, excellent information reuse and high integration capabilities. 16. OpenEdit: OpenEdit CMS is a dynamic tool for managing website content with online editing capabilities. Built in open-source architecture, OpenEdit provides facilities like user manager, file manager, version control and notification tools for managing media-rich websites. OpenCMS features enterprise grade plugins such as eCommerce, Content Management, Blog, Events Calendar, Social Networking Tools and more. 17. AtLeap: Atleap is a multi-lingual CMS based on Java which offers amazing content delivery assistance with SEO and full text search functionalities. AtLeap, a product of Blandware, is not only a CMS but a highly robust framework for developing website and web applications 18. Weceem: Weceem is yet another open source content management system, unlikely other CMS it is built upon well-known Java framework grails, spring and Java itself. Weceem has garner positive reviews and is an ideal CMS when it comes to grails, but faces tough competition in best Java CMS category. I came across a LinkedIn discussion which was enough for me to put this CMS in the Best Java CMS list. 19. Nuxeo: Nuxeno is a powerful open source CMS built on Java-based architecture. It offers solutions related to document management, case management and digital asset management. It is free from licensing free but do costs you when reach out for support and maintenance help. It has strong groups of customers including Electronic Arts, U.S. Navy and as stated on the company website, it’s been used in over 145 countries across thousands of organizations. 20. XperienCentral: Xperien central is currently the only CMS that offers unique content to a visitor as per his earlier journey, so you can tailor the content to increase the conversion. It offers multi-channel content delivery across website, mobile social media channels and applications. It is built on Java and hence it is extremely scalable and agile. 21 Atex: Atex is a web CMS that uses polopoly technology to deliver content. As per claims, it is the only industry leading CMS with built in paywall. Atex again is one of the premium CMS that offers amazing solutions for managing websites and helps marketers deliver the right content to relevant audiences. It has rich set of clientele. 22 Escenic: Customers of escenic include News of the World, The Sun, The Times, the Independent titles. It’s a closed source Java framework. Both Atex and Escenic are found to be highly popular in Sweden. Some of the biggest sites in Sweden use both these CMS. idg.se uses Atex and Aftonbladet.se uses Escenic 23. Adobe Experience Manager/ CQ5 : Best CMS list cannot be completed without including adobe experience manager. It is an all-round CMS which offers all kinds of agility and flexibility an organization may want. It helps deliver unique customer experience by delivering different content on different channels. Adobe Experience Manager was recently named a leader in web content management by Garnters magic quadrant. Earlier it was known as CQ5 but later was acquire by Adobe in 2010 24. SDL Tridion Again a well known CMS and highly recommended by industry experts. Its simple intuitive UI makes it simple to manage the content and deliver it uniformly across all channels. It recently received top score in overall content management experience according to an independent research firm - Forrester Research, Inc., This completes the list 23 top Java-based content management system. Hope after reading about all the CMS, you have got enough inferences and insight as to which CMS would be best for your website development project.
December 9, 2013
by Boni Satani
· 328,563 Views · 4 Likes
article thumbnail
iban4j: java library for IBAN generation and validation
European countries are moving towards Single Euro Payments Area (SEPA) initiative and therefore a lot of companies are implementing International Bank Account Numbers (IBAN) or replacing existing bank account details with IBAN. Recently I was dealing with International Bank Account Numbers and couldn't find any open source library neither to generate them nor to validate them. To fill the gap I implemented iban4j Java library for International Bank Account Number (IBAN) generation and validation. Using iban4j is as easy as: Iban iban = new Iban.Builder() .countryCode(CountryCode.AT) .bankCode("19043") .accountNumber("00234573201") .build(); Library is available in maven central: org.iban4j iban4j 1.0.0 For more details check out github page
December 8, 2013
by Artur Mkrtchyan
· 16,592 Views
article thumbnail
Java WebSockets (JSR-356) on Jetty 9.1
Jetty 9.1 is finally released, bringing Java WebSockets (JSR-356) to non-EE environments. It's awesome news and today's post will be about using this great new API along with Spring Framework. JSR-356 defines concise, annotation-based model to allow modern Java web applications easily create bidirectional communication channels using WebSockets API. It covers not only server-side, but client-side as well, making this API really simple to use everywhere. Let's get started! Our goal would be to build a WebSockets server which accepts messages from the clients and broadcasts them to all other clients currently connected. To begin with, let's define the message format, which server and client will be exchanging, as this simple Message class. We can limit ourselves to something like a String, but I would like to introduce to you the power of another new API - Java API for JSON Processing (JSR-353). package com.example.services; public class Message { private String username; private String message; public Message() { } public Message( final String username, final String message ) { this.username = username; this.message = message; } public String getMessage() { return message; } public String getUsername() { return username; } public void setMessage( final String message ) { this.message = message; } public void setUsername( final String username ) { this.username = username; } } To separate the declarations related to the server and the client, JSR-356 defines two basic annotations:@ServerEndpoint and @ClientEndpoit respectively. Our client endpoint, let's call itBroadcastClientEndpoint, will simply listen for messages server sends: package com.example.services; import java.io.IOException; import java.util.logging.Logger; import javax.websocket.ClientEndpoint; import javax.websocket.EncodeException; import javax.websocket.OnMessage; import javax.websocket.OnOpen; import javax.websocket.Session; @ClientEndpoint public class BroadcastClientEndpoint { private static final Logger log = Logger.getLogger( BroadcastClientEndpoint.class.getName() ); @OnOpen public void onOpen( final Session session ) throws IOException, EncodeException { session.getBasicRemote().sendObject( new Message( "Client", "Hello!" ) ); } @OnMessage public void onMessage( final Message message ) { log.info( String.format( "Received message '%s' from '%s'", message.getMessage(), message.getUsername() ) ); } } That's literally it! Very clean, self-explanatory piece of code: @OnOpen is being called when client got connected to the server and @OnMessage is being called every time server sends a message to the client. Yes, it's very simple but there is a caveat: JSR-356 implementation can handle any simple objects but not the complex ones like Message is. To manage that, JSR-356 introduces concept of encoders and decoders. We all love JSON, so why don't we define our own JSON encoder and decoder? It's an easy task which Java API for JSON Processing (JSR-353) can handle for us. To create an encoder, you only need to implementEncoder.Text< Message > and basically serialize your object to some string, in our case to JSON string, using JsonObjectBuilder. package com.example.services; import javax.json.Json; import javax.json.JsonReaderFactory; import javax.websocket.EncodeException; import javax.websocket.Encoder; import javax.websocket.EndpointConfig; public class Message { public static class MessageEncoder implements Encoder.Text< Message > { @Override public void init( final EndpointConfig config ) { } @Override public String encode( final Message message ) throws EncodeException { return Json.createObjectBuilder() .add( "username", message.getUsername() ) .add( "message", message.getMessage() ) .build() .toString(); } @Override public void destroy() { } } } For decoder part, everything looks very similar, we have to implement Decoder.Text< Message > and deserialize our object from string, this time using JsonReader. package com.example.services; import javax.json.JsonObject; import javax.json.JsonReader; import javax.json.JsonReaderFactory; import javax.websocket.DecodeException; import javax.websocket.Decoder; public class Message { public static class MessageDecoder implements Decoder.Text< Message > { private JsonReaderFactory factory = Json.createReaderFactory( Collections.< String, Object >emptyMap() ); @Override public void init( final EndpointConfig config ) { } @Override public Message decode( final String str ) throws DecodeException { final Message message = new Message(); try( final JsonReader reader = factory.createReader( new StringReader( str ) ) ) { final JsonObject json = reader.readObject(); message.setUsername( json.getString( "username" ) ); message.setMessage( json.getString( "message" ) ); } return message; } @Override public boolean willDecode( final String str ) { return true; } @Override public void destroy() { } } } And as a final step, we need to tell the client (and the server, they share same decoders and encoders) that we have encoder and decoder for our messages. The easiest thing to do that is just by declaring them as part of @ServerEndpoint and @ClientEndpoit annotations. import com.example.services.Message.MessageDecoder; import com.example.services.Message.MessageEncoder; @ClientEndpoint( encoders = { MessageEncoder.class }, decoders = { MessageDecoder.class } ) public class BroadcastClientEndpoint { } To make client's example complete, we need some way to connect to the server usingBroadcastClientEndpoint and basically exchange messages. The ClientStarter class finalizes the picture: package com.example.ws; import java.net.URI; import java.util.UUID; import javax.websocket.ContainerProvider; import javax.websocket.Session; import javax.websocket.WebSocketContainer; import org.eclipse.jetty.websocket.jsr356.ClientContainer; import com.example.services.BroadcastClientEndpoint; import com.example.services.Message; public class ClientStarter { public static void main( final String[] args ) throws Exception { final String client = UUID.randomUUID().toString().substring( 0, 8 ); final WebSocketContainer container = ContainerProvider.getWebSocketContainer(); final String uri = "ws://localhost:8080/broadcast"; try( Session session = container.connectToServer( BroadcastClientEndpoint.class, URI.create( uri ) ) ) { for( int i = 1; i <= 10; ++i ) { session.getBasicRemote().sendObject( new Message( client, "Message #" + i ) ); Thread.sleep( 1000 ); } } // Application doesn't exit if container's threads are still running ( ( ClientContainer )container ).stop(); } } Just couple of comments what this code does: we are connecting to WebSockets endpoint atws://localhost:8080/broadcast, randomly picking some client name (from UUID) and generating 10 messages, every with 1 second delay (just to be sure we have time to receive them all back). Server part doesn't look very different and at this point could be understood without any additional comments (except may be the fact that server just broadcasts every message it receives to all connected clients). Important to mention here: new instance of the server endpoint is created every time new client connects to it (that's why peers collection is static), it's a default behavior and could be easily changed. package com.example.services; import java.io.IOException; import java.util.Collections; import java.util.HashSet; import java.util.Set; import javax.websocket.EncodeException; import javax.websocket.OnClose; import javax.websocket.OnMessage; import javax.websocket.OnOpen; import javax.websocket.Session; import javax.websocket.server.ServerEndpoint; import com.example.services.Message.MessageDecoder; import com.example.services.Message.MessageEncoder; @ServerEndpoint( value = "/broadcast", encoders = { MessageEncoder.class }, decoders = { MessageDecoder.class } ) public class BroadcastServerEndpoint { private static final Set< Session > sessions = Collections.synchronizedSet( new HashSet< Session >() ); @OnOpen public void onOpen( final Session session ) { sessions.add( session ); } @OnClose public void onClose( final Session session ) { sessions.remove( session ); } @OnMessage public void onMessage( final Message message, final Session client ) throws IOException, EncodeException { for( final Session session: sessions ) { session.getBasicRemote().sendObject( message ); } } } In order this endpoint to be available for connection, we should start the WebSockets container and register this endpoint inside it. As always, Jetty 9.1 is runnable in embedded mode effortlessly: package com.example.ws; import org.eclipse.jetty.server.Server; import org.eclipse.jetty.servlet.DefaultServlet; import org.eclipse.jetty.servlet.ServletContextHandler; import org.eclipse.jetty.servlet.ServletHolder; import org.eclipse.jetty.websocket.jsr356.server.deploy.WebSocketServerContainerInitializer; import org.springframework.web.context.ContextLoaderListener; import org.springframework.web.context.support.AnnotationConfigWebApplicationContext; import com.example.config.AppConfig; public class ServerStarter { public static void main( String[] args ) throws Exception { Server server = new Server( 8080 ); // Create the 'root' Spring application context final ServletHolder servletHolder = new ServletHolder( new DefaultServlet() ); final ServletContextHandler context = new ServletContextHandler(); context.setContextPath( "/" ); context.addServlet( servletHolder, "/*" ); context.addEventListener( new ContextLoaderListener() ); context.setInitParameter( "contextClass", AnnotationConfigWebApplicationContext.class.getName() ); context.setInitParameter( "contextConfigLocation", AppConfig.class.getName() ); server.setHandler( context ); WebSocketServerContainerInitializer.configureContext( context ); server.start(); server.join(); } } The most important part of the snippet above is WebSocketServerContainerInitializer.configureContext: it's actually creates the instance of WebSockets container. Because we haven't added any endpoints yet, the container basically sits here and does nothing. Spring Framework and AppConfig configuration class will do this last wiring for us. package com.example.config; import javax.annotation.PostConstruct; import javax.inject.Inject; import javax.websocket.DeploymentException; import javax.websocket.server.ServerContainer; import javax.websocket.server.ServerEndpoint; import javax.websocket.server.ServerEndpointConfig; import org.eclipse.jetty.websocket.jsr356.server.AnnotatedServerEndpointConfig; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.web.context.WebApplicationContext; import com.example.services.BroadcastServerEndpoint; @Configuration public class AppConfig { @Inject private WebApplicationContext context; private ServerContainer container; public class SpringServerEndpointConfigurator extends ServerEndpointConfig.Configurator { @Override public < T > T getEndpointInstance( Class< T > endpointClass ) throws InstantiationException { return context.getAutowireCapableBeanFactory().createBean( endpointClass ); } } @Bean public ServerEndpointConfig.Configurator configurator() { return new SpringServerEndpointConfigurator(); } @PostConstruct public void init() throws DeploymentException { container = ( ServerContainer )context.getServletContext(). getAttribute( javax.websocket.server.ServerContainer.class.getName() ); container.addEndpoint( new AnnotatedServerEndpointConfig( BroadcastServerEndpoint.class, BroadcastServerEndpoint.class.getAnnotation( ServerEndpoint.class ) ) { @Override public Configurator getConfigurator() { return configurator(); } } ); } } As we mentioned earlier, by default container will create new instance of server endpoint every time new client connects, and it does so by calling constructor, in our caseBroadcastServerEndpoint.class.newInstance(). It might be a desired behavior but because we are usingSpring Framework and dependency injection, such new objects are basically unmanaged beans. Thanks to very well-thought (in my opinion) design of JSR-356, it's actually quite easy to provide your own way of creating endpoint instances by implementing ServerEndpointConfig.Configurator. TheSpringServerEndpointConfigurator is an example of such implementation: it's creates new managed bean every time new endpoint instance is being asked (if you want single instance, you can create singleton of the endpoint as a bean in AppConfig and return it all the time). The way we retrieve the WebSockets container is Jetty-specific: from the attribute of the context with name"javax.websocket.server.ServerContainer" (it probably might change in the future). Once container is there, we are just adding new (managed!) endpoint by providing our own ServerEndpointConfig (based onAnnotatedServerEndpointConfig which Jetty kindly provides already). To build and run our server and clients, we need just do that: mvn clean package java -jar target\jetty-web-sockets-jsr356-0.0.1-SNAPSHOT-server.jar // run server java -jar target/jetty-web-sockets-jsr356-0.0.1-SNAPSHOT-client.jar // run yet another client As an example, by running server and couple of clients (I run 4 of them, '392f68ef', '8e3a869d', 'ca3a06d0', '6cb82119') you might see by the output in the console that each client receives all the messages from all other clients (including its own messages): Nov 29, 2013 9:21:29 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Hello!' from 'Client' Nov 29, 2013 9:21:29 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #1' from '392f68ef' Nov 29, 2013 9:21:29 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #2' from '8e3a869d' Nov 29, 2013 9:21:29 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #7' from 'ca3a06d0' Nov 29, 2013 9:21:30 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #4' from '6cb82119' Nov 29, 2013 9:21:30 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #2' from '392f68ef' Nov 29, 2013 9:21:30 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #3' from '8e3a869d' Nov 29, 2013 9:21:30 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #8' from 'ca3a06d0' Nov 29, 2013 9:21:31 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #5' from '6cb82119' Nov 29, 2013 9:21:31 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #3' from '392f68ef' Nov 29, 2013 9:21:31 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #4' from '8e3a869d' Nov 29, 2013 9:21:31 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #9' from 'ca3a06d0' Nov 29, 2013 9:21:32 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #6' from '6cb82119' Nov 29, 2013 9:21:32 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #4' from '392f68ef' Nov 29, 2013 9:21:32 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #5' from '8e3a869d' Nov 29, 2013 9:21:32 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #10' from 'ca3a06d0' Nov 29, 2013 9:21:33 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #7' from '6cb82119' Nov 29, 2013 9:21:33 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #5' from '392f68ef' Nov 29, 2013 9:21:33 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #6' from '8e3a869d' Nov 29, 2013 9:21:34 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #8' from '6cb82119' Nov 29, 2013 9:21:34 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #6' from '392f68ef' Nov 29, 2013 9:21:34 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #7' from '8e3a869d' Nov 29, 2013 9:21:35 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #9' from '6cb82119' Nov 29, 2013 9:21:35 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #7' from '392f68ef' Nov 29, 2013 9:21:35 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #8' from '8e3a869d' Nov 29, 2013 9:21:36 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #10' from '6cb82119' Nov 29, 2013 9:21:36 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #8' from '392f68ef' Nov 29, 2013 9:21:36 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #9' from '8e3a869d' Nov 29, 2013 9:21:37 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #9' from '392f68ef' Nov 29, 2013 9:21:37 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #10' from '8e3a869d' Nov 29, 2013 9:21:38 PM com.example.services.BroadcastClientEndpoint onMessage INFO: Received message 'Message #10' from '392f68ef' 2013-11-29 21:21:39.260:INFO:oejwc.WebSocketClient:main: Stopped org.eclipse.jetty.websocket.client.WebSocketClient@3af5f6dc Awesome! I hope this introductory blog post shows how easy it became to use modern web communication protocols in Java, thanks to Java WebSockets (JSR-356), Java API for JSON Processing (JSR-353) and great projects such as Jetty 9.1! As always, complete project is available on GitHub.
December 6, 2013
by Andriy Redko
· 40,675 Views · 2 Likes
article thumbnail
Adding Java 8 Lambda Goodness to JDBC
Data access, specifically SQL access from within Java, has never been nice. This is in large part due to the fact that the JDBC api has a lot of ceremony. Java 7 vastly improved things with ARM blocks by taking away a lot of the ceremony around managing database objects such as Statements and ResultSets but fundamentally the code flow is still the same. Java 8 Lambdas gives us a very nice tool for improving the flow of JDBC. Out first attempt at improving things here is very simply to make it easy to work with ajava.sql.ResultSet. Here we simply wrap the ResultSet iteration and then delegate it to Lambda function. This is very similar in concept to Spring's JDBCTemplate. NOTE: I've released All the code snippets you see here under an Apache 2.0 license on Github. First we create a functional interface called ResultSetProcessor as follows: @FunctionalInterface public interface ResultSetProcessor { public void process(ResultSet resultSet, long currentRow) throws SQLException; } Very straightforward. This interface takes the ResultSet and the current row of theResultSet as a parameter. Next we write a simple utility to which executes a query and then calls ourResultSetProcessor each time we iterate over the ResultSet: public static void select(Connection connection, String sql, ResultSetProcessor processor, Object... params) { try (PreparedStatement ps = connection.prepareStatement(sql)) { int cnt = 0; for (Object param : params) { ps.setObject(++cnt, param)); } try (ResultSet rs = ps.executeQuery()) { long rowCnt = 0; while (rs.next()) { processor.process(rs, rowCnt++); } } catch (SQLException e) { throw new DataAccessException(e); } } catch (SQLException e) { throw new DataAccessException(e); } } Note I've wrapped the SQLException in my own unchecked DataAccessException. Now when we write a query it's as simple as calling the select method with a connection and a query: select(connection, "select * from MY_TABLE",(rs, cnt)-> { System.out.println(rs.getInt(1)+" "+cnt) }); So that's great, but I think we can do more... One of the nifty Lambda additions in Java is the new Streams API. This would allow us to add very powerful functionality with which to process a ResultSet. Using the Streams API over a ResultSet however creates a bit more of a challenge than the simple select with Lambda in the previous example. The way I decided to go about this is create my own Tuple type which represents a single row from a ResultSet. My Tuple here is the relational version where a Tuple is a collection of elements where each element is identified by an attribute, basically a collection of key value pairs. In our case the Tuple is ordered in terms of the order of the columns in the ResultSet. The code for the Tuple ended up being quite a bit so if you want to take a look, see the GitHub project in the resources at the end of the post. Currently the Java 8 API provides the java.util.stream.StreamSupport object which provides a set of static methods for creating instances of java.util.stream.Stream. We can use this object to create an instance of a Stream. But in order to create a Stream it needs an instance ofjava.util.stream.Spliterator. This is a specialised type for iterating and partitioning a sequence of elements, the Stream needs for handling operations in parallel. Fortunately the Java 8 api also provides the java.util.stream.Spliterators class which can wrap existing Collection and enumeration types. One of those types being ajava.util.Iterator. Now we wrap a query and ResultSet in an Iterator: public class ResultSetIterator implements Iterator { private ResultSet rs; private PreparedStatement ps; private Connection connection; private String sql; public ResultSetIterator(Connection connection, String sql) { assert connection != null; assert sql != null; this.connection = connection; this.sql = sql; } public void init() { try { ps = connection.prepareStatement(sql); rs = ps.executeQuery(); } catch (SQLException e) { close(); throw new DataAccessException(e); } } @Override public boolean hasNext() { if (ps == null) { init(); } try { boolean hasMore = rs.next(); if (!hasMore) { close(); } return hasMore; } catch (SQLException e) { close(); throw new DataAccessException(e); } } private void close() { try { rs.close(); try { ps.close(); } catch (SQLException e) { //nothing we can do here } } catch (SQLException e) { //nothing we can do here } } @Override public Tuple next() { try { return SQL.rowAsTuple(sql, rs); } catch (DataAccessException e) { close(); throw e; } } } This class basically delegates the iterator methods to the underlying result set and then on the next() call transforms the current row in the ResultSet into my Tuple type. And that's the basics done (This class will need a little bit more work though). All that's left is to wire it all together to make a Stream object. Note that due to the nature of a ResultSet it's not a good idea to try process them in parallel, so our stream cannot process in parallel. public static Stream stream(final Connection connection, final String sql, final Object... parms) { return StreamSupport .stream(Spliterators.spliteratorUnknownSize( new ResultSetIterator(connection, sql), 0), false); } Now it's straightforward to stream a query. In the usage example below I've got a table TEST_TABLE with an integer column TEST_ID which basically filters out all the non even numbers and then runs a count: long result = stream(connection, "select TEST_ID from TEST_TABLE") .filter((t) -> t.asInt("TEST_ID") % 2 == 0) .limit(100) .count(); And that's it! We now have a very powerful way of working with a ResultSet. So all this code is available under an Apache 2.0 license on GitHub here. I've rather lamely dubbed the project "lambda tuples," and the purpose really is to experiment and see where you can take Java 8 and Relational DB access, so please download or feel free to contribute.
December 5, 2013
by Julian Exenberger
· 78,273 Views · 6 Likes
article thumbnail
get current web application path in java
This is a code snippet to retrieve the path of the current running web application project in java public String getPath() throws UnsupportedEncodingException { String path = this.getClass().getClassLoader().getResource("").getPath(); String fullPath = URLDecoder.decode(path, "UTF-8"); String pathArr[] = fullPath.split("/WEB-INF/classes/"); System.out.println(fullPath); System.out.println(pathArr[0]); fullPath = pathArr[0]; String reponsePath = ""; // to read a file from webcontent reponsePath = new File(fullPath).getPath() + File.separatorChar + "newfile.txt"; return reponsePath; }
December 4, 2013
by Partheeban Thirumal
· 83,992 Views · 5 Likes
article thumbnail
Uncompressing 7-Zip Files with Groovy and 7-Zip-JBinding
This post demonstrates a Groovy script for uncompressing files with the 7-Zip archive format. The two primary objectives of this post are to demonstrate uncompressing 7-Zip files with Groovy and the handy 7-Zip-JBindingand to call out and demonstrate some key characteristics of Groovy as a scripting language. The 7-Zip page describes 7-Zip as "a file archiver with a high compression ratio." The page further adds, "7-Zip is open source software. Most of the source code is under the GNU LGPL license." More license information in available on the site along with information on the 7z format ("LZMA is default and general compression method of 7z format"). The 7-Zip page describes it as "a Java wrapper for 7-Zip C++ library" that "allows extraction of many archive formats using a very fast native library directly from Java through JNI." The 7z format is based on "LZMA andLZMA2 compression." Although there is an LZMA SDK available, it is easier to use the open source (SourceForge)7-Zip-JBinding project when manipulating 7-Zip files with Java. A good example of using Java with 7-Zip-JBinding to uncompress 7z files is available in the StackOverflowthread Decompress files with .7z extension in java. Dark Knight's response indicates how to use Java with 7-Zip-JBinding to uncompress a 7z file. I adapt Dark Knight's Java code into a Groovy script in this post. To demonstrate the adapted Groovy code for uncompressing 7z files, I first need a 7z file that I can extract contents from. The next series of screen snapshots show me using Windows 7-Zip installed on my laptop to compress the six PDFs available under the Guava Downloads page into a single 7z file called Guava.7z. Six Guava PDFs Sitting in a Folder Contents Selected and Right-Click Menu to Compress to 7z Format Guava.7z Compressed Archive File Created With a 7z file in place, I now turn to the adapted Groovy script that will extract the contents of this Guava.7zfile. As mentioned previously, this Groovy script is an adaptation of the Java code provided by Dark Knight on aStackOverflow thread. unzip7z.groovy // // This Groovy script is adapted from Java code provided at // http://stackoverflow.com/a/19403933 import static java.lang.System.err as error import java.io.File import java.io.FileNotFoundException import java.io.FileOutputStream import java.io.IOException import java.io.RandomAccessFile import java.util.Arrays import net.sf.sevenzipjbinding.ExtractOperationResult import net.sf.sevenzipjbinding.ISequentialOutStream import net.sf.sevenzipjbinding.ISevenZipInArchive import net.sf.sevenzipjbinding.SevenZip import net.sf.sevenzipjbinding.SevenZipException import net.sf.sevenzipjbinding.impl.RandomAccessFileInStream import net.sf.sevenzipjbinding.simple.ISimpleInArchive import net.sf.sevenzipjbinding.simple.ISimpleInArchiveItem if (args.length < 1) { println "USAGE: unzip7z.groovy .7z\n" System.exit(-1) } def fileToUnzip = args[0] try { RandomAccessFile randomAccessFile = new RandomAccessFile(fileToUnzip, "r") ISevenZipInArchive inArchive = SevenZip.openInArchive(null, new RandomAccessFileInStream(randomAccessFile)) ISimpleInArchive simpleInArchive = inArchive.getSimpleInterface() println "${'Hash'.center(10)}|${'Size'.center(12)}|${'Filename'.center(10)}" println "${'-'.multiply(10)}+${'-'.multiply(12)}+${'-'.multiply(10)}" simpleInArchive.getArchiveItems().each { item -> final int[] hash = new int[1] if (!item.isFolder()) { final long[] sizeArray = new long[1] ExtractOperationResult result = item.extractSlow( new ISequentialOutStream() { public int write(byte[] data) throws SevenZipException { //Write to file try { File file = new File(item.getPath()) file.getParentFile()?.mkdirs() FileOutputStream fos = new FileOutputStream(file) fos.write(data) fos.close() } catch (Exception e) { printExceptionStackTrace("Unable to write file", e) } hash[0] ^= Arrays.hashCode(data) // Consume data sizeArray[0] += data.length return data.length // Return amount of consumed data } }) if (result == ExtractOperationResult.OK) { println(String.format("%9X | %10s | %s", hash[0], sizeArray[0], item.getPath())) } else { error.println("Error extracting item: " + result) } } } } catch (Exception e) { printExceptionStackTrace("Error occurs", e) System.exit(1) } finally { if (inArchive != null) { try { inArchive.close() } catch (SevenZipException e) { printExceptionStackTrace("Error closing archive", e) } } if (randomAccessFile != null) { try { randomAccessFile.close() } catch (IOException e) { printExceptionStackTrace("Error closing file", e) } } } /** * Prints the stack trace of the provided exception to standard error without * Groovy meta data trace elements. * * @param contextMessage String message to precede stack trace and provide context. * @param exceptionToBePrinted Exception whose Groovy-less stack trace should * be printed to standard error. * @return Exception derived from the provided Exception but without Groovy * meta data calls. */ def Exception printExceptionStackTrace( final String contextMessage, final Exception exceptionToBePrinted) { error.print "${contextMessage}: ${org.codehaus.groovy.runtime.StackTraceUtils.sanitize(exceptionToBePrinted).printStackTrace()}" } In my adaptation of the Java code into the Groovy script shown above, I left most of the exception handling in place. Although Groovy allows exceptions to be ignored whether they are checked or unchecked, I wanted to maintain this handling in this case to make sure resources are closed properly and that appropriate error messages are presented to users of the script. One thing I did change was to make all of the output that is error-related be printed to standard error rather than to standard output. This required a few changes. First, I used Groovy's capability to rename something that is statically imported (see my related post Groovier Static Imports) to reference "java.lang.System.err" as "error" so that I could simply use "error" as a handle in the script rather than needing to use "System.err" to access standard error for output. Because Throwable.printStackTrace() already writes to standard error rather than standard output, I just used it directly. However, I placed calls to it in a new method that would first runStackTraceUtils.sanitize(Throwable) to remove Groovy-specific calls associated with Groovy's runtime dynamic capabilities from the stack trace. There were some other minor changes to the script as part of making it Groovier. I used Groovy's iteration on the items in the archive file rather than the Java for loop, removed semicolons at the ends of statements, used Groovy's String GDK extension for more controlled output reporting [to automatically center titles and tomultiply a given character by the appropriate number of times it needs to exist], and took advantage ofGroovy's implicit inclusion of args to add a check to ensure file for extraction was provided to the script. With the file to be extracted in place and the Groovy script to do the extracting ready, it is time to extract the contents of the Guava.7z file I demonstrated generating earlier in this post. The following command will run the script and places the appropriate 7-Zip-JBinding JAR files on the classpath. groovy -classpath "C:/sevenzipjbinding/lib/sevenzipjbinding.jar;C:/sevenzipjbinding/lib/sevenzipjbinding-Windows-x86.jar" unzip7z.groovy C:\Users\Dustin\Downloads\Guava\Guava.7z Before showing the output of running the above script against the indicated Guava.7z file, it is important to note the error message that will occur if the native operating system specific 7-Zip-JBinding JAR (sevenzipjbinding-Windows-x86.jar in my laptop's case) is not included on the classpath of the script. As the last screen snapshot indicates, neglecting to include the native JAR on the classpath leads to the error message: "Error occurs: java.lang.RuntimeException: SevenZipJBinding couldn't be initialized automaticly using initialization from platform depended JAR and the default temporary directory. Please, make sure the correct 'sevenzipjbinding-.jar' file is in the class path or consider initializing SevenZipJBinding manualy using one of the offered initialization methods: 'net.sf.sevenzipjbinding.SevenZip.init*()'" Although I simply added C:/sevenzipjbinding/lib/sevenzipjbinding-Windows-x86.jar to my script's classpath to make it work on this laptop, a more robust script might detect the operating system and apply the appropriate JAR to the classpath for that operating system. The 7-Zip-JBinding Download pagefeatures multiple platform-specific downloads (including platform-specific JARs) such assevenzipjbinding-4.65-1.06-rc-extr-only-Windows-amd64.zip, sevenzipjbinding-4.65-1.06-rc-extr-only-Mac-x86_64.zip, sevenzipjbinding-4.65-1.06-rc-extr-only-Mac-i386.zip, and sevenzipjbinding-4.65-1.06-rc-extr-only-Linux-i386.zip. Once the native 7-Zip-JBinding JAR is included on the classpath along with the core sevenzipjbinding.jar JAR, the script runs beautifully as shown in the next screen snapshot. The script extracts the contents of the 7z file into the same working directory as the Groovy script. A further enhancement would be to modify the script to accept a directory to which to write the extracted files or might write them to the same directory as the 7z archive file by default instead. Use of Groovy's built-in CLIBuilder support could also improve the script. Groovy is my preferred language of choice when scripting something that makes use of the JVM and/or of Java libraries and frameworks. Writing the script that is the subject of this post has been another reminder of that.
December 4, 2013
by Dustin Marx
· 12,752 Views
article thumbnail
G1 vs CMS vs Parallel GC
This post is following up the experiment we ran exactly a year ago comparing the performance of different GC algorithms in real-life settings. We took the same experiment, expanded the tests to contain the G1 garbage collector and ran the tests on different platform. This year our tests were run with the following Garbage Collectors: -XX:+UseParallelOldGC -XX:+UseConcMarkSweepGC -XX:+UseG1GC Description of the environment The experiment was ran on out-of-the-box JIRA configuration. The motivation for the test run was loud and clear – Minecraft, Dalvik-based Angry Bird and Eclipse asides, JIRA should be one of the most popular Java applications out there. And opposed to the alternatives it is a more typical representative of what most of us are dealing with on the everyday business – after all Java is still by far most used in server side Java EE apps. What also affected our decision was – the engineers from Atlassian ship nicely packaged load tests along the JIRA download. So we had a benchmark to use for our configuration. We carefully unzipped our fresh JIRA 6.1 download and installed it on a Mac OS X Mavericks. And ran the bundled tests without changing anything in the default memory settings. The Atlassian team had been kind enough to set them for us: -Xms256m -Xmx768m -XX:MaxPermSize=256m The tests used JIRA functionality in different common ways – creating tasks, assigning tasks, resolving tasks, searching and discovering tasks, etc. Total runtime for the test was 30 minutes. We ran the test using three different garbage collection algorithms – Parallel, CMS and G1 were used in our case. Each test started with a fresh JVM boot, followed by prepopulating the storage to the exactly the same state. Only after the preparations we launched the load generation. Results During each run we have collected GC logs using -XX:+PrintGCTimeStamps -Xloggc:/tmp/gc.log -XX:+PrintGCDetails and analyzed this statistics with the help of GCViewer The results can be aggregated as follows. Note that all measurements are in milliseconds: Parallel CMS G1 Total GC pauses 20 930 18 870 62 000 Max GC pause 721 64 50 Interpretation and results First stop – Parallel GC (-XX:+UseParallelOldGC). Out of the 30 minutes the tests took to complete, we spent close to 21 seconds in GC pauses with the parallel collector. And the longest pause took 721 milliseconds. So let us take this as the baseline: GC cycles reduced the throughput by 1.1% of the total runtime. And the worst-case latency was 721ms. Next contestant: CMS (-XX:+UseConcMarkSweepGC). Again, 30 minutes of tests out of which we lost a bit less than 19 seconds to GC. Throughput-wise this is roughly in the same neighbourhood as the parallel mode. Latency on the other hand has been improved significantly - the worst-case latency is reduced more than 10 times! We are now facing just 64ms as the maximum pause time from the GC. Last experiment used the newest and shiniest GC algorithm available – G1 (-XX:+UseG1GC). The very same tests were run and throughput-wise we saw results suffering severely. This time our application spent more than a minute waiting for the GC to complete. Comparing this to the just 1% of the overhead with CMS, we are now facing close to 3.5% effect on the throughput. But if you really do not care about throughput and want to squeeze out the last bit from the latency then – we have improved around 20% comparing to the already-good CMS – using G1 saw the longest GC pause only taking 50ms. Conclusion As always, trying to summarize such an experiment into a single conclusion is dangerous. So if you have time and required skills – definitely go ahead and measure your own environment instead of adopting to one-size-fits-all solution. But if I would dare to make such a conclusion, I would say that CMS is still the best “default” option to go with. G1 throughput is still so much worse that the improved latency is usually not worth it. If you enjoyed the content, consider subscribe to either our RSS feed or Twitter stream – we continue to publish on different performance optimization topics.
December 3, 2013
by Nikita Salnikov-Tarnovski
· 38,800 Views
article thumbnail
Groovy Goodness: Remove Part of String With Regular Expression Pattern
Since Groovy 2.2 we can subtract a part of a String value using a regular expression pattern. The first match found is replaced with an empty String. In the following sample code we see how the first match of the pattern is removed from the String: // Define regex pattern to find words starting with gr (case-insensitive). def wordStartsWithGr = ~/(?i)\s+Gr\w+/ assert ('Hello Groovy world!' - wordStartsWithGr) == 'Hello world!' assert ('Hi Grails users' - wordStartsWithGr) == 'Hi users' // Remove first match of a word with 5 characters. assert ('Remove first match of 5 letter word' - ~/\b\w{5}\b/) == 'Remove match of 5 letter word' // Remove first found numbers followed by a whitespace character. assert ('Line contains 20 characters' - ~/\d+\s+/) == 'Line contains characters' Code written with Groovy 2.2.
November 23, 2013
by Hubert Klein Ikkink
· 19,840 Views
article thumbnail
How to Create a Range From 1 to 10 in SQL
How do you create a range from 1 to 10 in SQL? Have you ever thought about it? This is such an easy problem to solve in any imperative language, it’s ridiculous. Take Java (or C, whatever) for instance: for (int i = 1; i <= 10; i++) System.out.println(i); This was easy, right? Things even look more lean when using functional programming. Take Scala, for instance: (1 to 10) foreach { t => println(t) } We could fill about 25 pages about various ways to do the above in Scala, agreeing on how awesome Scala is (or what hipsters we are). But how to create a range in SQL? … And we’ll exclude using stored procedures, because that would be no fun. In SQL, the data source we’re operating on are tables. If we want a range from 1 to 10, we’d probably need a table containing exactly those ten values. Here are a couple of good, bad, and ugly options of doing precisely that in SQL. OK, they’re mostly bad and ugly. By creating a table The dumbest way to do this would be to create an actual temporary table just for that purpose: CREATE TABLE "1 to 10" AS SELECT 1 value FROM DUAL UNION ALL SELECT 2 FROM DUAL UNION ALL SELECT 3 FROM DUAL UNION ALL SELECT 4 FROM DUAL UNION ALL SELECT 5 FROM DUAL UNION ALL SELECT 6 FROM DUAL UNION ALL SELECT 7 FROM DUAL UNION ALL SELECT 8 FROM DUAL UNION ALL SELECT 9 FROM DUAL UNION ALL SELECT 10 FROM DUAL See also this SQLFiddle This table can then be used in any type of select. Now that’s pretty dumb but straightforward, right? I mean, how many actual records are you going to put in there? By using a VALUES() table constructor This solution isn’t that much better. You can create a derived table and manually add the values from 1 to 10 to that derived table using the VALUES() table constructor. In SQL Server, you could write: SELECT V FROM ( VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10) ) [1 to 10](V) See also this SQLFiddle By creating enough self-joins of a sufficent number of values Another “dumb”, yet a bit more generic solution would be to create only a certain amount of constant values in a table, view or CTE (e.g. two) and then self join that table enough times to reach the desired range length (e.g. four times). The following example will produce values from 1 to 10, “easily”: WITH T(V) AS ( SELECT 0 FROM DUAL UNION ALL SELECT 1 FROM DUAL ) SELECT V FROM ( SELECT 1 + T1.V + 2 * T2.V + 4 * T3.V + 8 * T4.V V FROM T T1, T T2, T T3, T T4 ) WHERE V <= 10 ORDER BY V See also this SQLFiddle By using grouping sets Another way to generate large tables is by using grouping sets, or more specifically by using the CUBE() function. This works much in a similar way as the previous example when self-joining a table with two records: SELECT ROWNUM FROM ( SELECT 1 FROM DUAL GROUP BY CUBE(1, 2, 3, 4) ) WHERE ROWNUM <= 10 See also this SQLFiddle By just taking random records from a “large enough” table In Oracle, you could probably use ALL_OBJECTs. If you’re only counting to 10, you’ll certainly get enough results from that table: SELECT ROWNUM FROM ALL_OBJECTS WHERE ROWNUM <= 10 See also this SQLFiddle What’s so “awesome” about this solution is that you can cross join that table several times to be sure to get enough values: SELECT ROWNUM FROM ALL_OBJECTS, ALL_OBJECTS, ALL_OBJECTS, ALL_OBJECTS WHERE ROWNUM <= 10 OK. Just kidding. Don’t actually do that. Or if you do, don’t blame me if your productive system runs low on memory. By using the awesome PostgreSQL GENERATE_SERIES() function Incredibly, this isn’t part of the SQL standard. Neither is it available in most databases but PostgreSQL, which has the GENERATE_SERIES() function. This is much like Scala’s range notation: (1 to 10) SELECT * FROM GENERATE_SERIES(1, 10) See also this SQLFiddle By using CONNECT BY If you’re using Oracle, then there’s a really easy way to create such a table using the CONNECT BY clause, which is almost as convenient as PostgreSQL’s GENERATE_SERIES() function: SELECT LEVEL FROM DUAL CONNECT BY LEVEL < 10 See also this SQLFiddle By using a recursive CTE Recursive common table expressions are cool, yet utterly unreadable. the equivalent of the above Oracle CONNECT BY clause when written using a recursive CTE would look like this: WITH "1 to 10"(V) AS ( SELECT 1 FROM DUAL UNION ALL SELECT V + 1 FROM "1 to 10" WHERE V < 10 ) SELECT * FROM "1 to 10" See also this SQLFiddle By using Oracle’s MODEL clause A decent “best of” comparison of how to do things in SQL wouldn’t be complete without at least one example using Oracle’s MODEL clause (see this awesome use-case for Oracle’s spreadsheet feature). Use this clause only to make your co workers really angry when maintaining your SQL code. Bow before this beauty! SELECT V FROM ( SELECT 1 V FROM DUAL ) T MODEL DIMENSION BY (ROWNUM R) MEASURES (V) RULES ITERATE (10) ( V[ITERATION_NUMBER] = CV(R) + 1 ) ORDER BY 1 See also this SQLFiddle Conclusion There aren’t actually many nice solutions to do such a simple thing in SQL. Clearly, PostgreSQL’s GENERATE_SERIES() table function is the most beautiful solution. Oracle’s CONNECT BY clause comes close. For all other databases, some trickery has to be applied in one way or another. Unfortunately.
November 20, 2013
by Lukas Eder
· 30,238 Views
article thumbnail
Base64 Encoding in Java 8
The lack of Base64 encoding API in Java is, in my opinion, by far one of the most annoying holes in the libraries. Finally Java 8 includes a decent API for it in the java.util package. Here is a short introduction of this new API (apparently it has a little more than the regular encode/decode API). The java.util.Base64 Utility Class The entry point to Java 8's Base64 support is the java.util.Base64 class. It provides a set of static factory methods used to obtain one of the three Base64 encodes and decoders: Basic URL MIME Basic Encoding The standard encoding we all think about when we deal with Base64: no line feeds are added to the output and the output is mapped to characters in the Base64 Alphabet: A-Za-z0-9+/ (we see in a minute why is it important). Here is a sample usage: // Encode String asB64 = Base64.getEncoder().encodeToString("some string".getBytes("utf-8")); System.out.println(asB64); // Output will be: c29tZSBzdHJpbmc= // Decode byte[] asBytes = Base64.getDecoder().decode("c29tZSBzdHJpbmc="); System.out.println(new String(asBytes, "utf-8")); // And the output is: some string It can't get simpler than that and, unlike in earlier versions of Java, there is any need for external dependencies (commons-codec), Sun classes (sun.misc.BASE64Decoder) or JAXB's DatatypeConverter (Java 6 and above). However it gets even better. URL Encoding Most of us are used to get annoyed when we have to encode something to be later included in a URL or as a filename - the problem is that the Base64 Alphabet contains meaningful characters in both URIs and filesystems (most specifically the forward slash (/)). The second type of encoder uses the alternative "URL and Filename safe" Alphabet which includes -_ (minus and underline) instead of +/. See the following example: String basicEncoded = Base64.getEncoder().encodeToString("subjects?abcd".getBytes("utf-8")); System.out.println("Using Basic Alphabet: " + basicEncoded); String urlEncoded = Base64.getUrlEncoder().encodeToString("subjects?abcd".getBytes("utf-8")); System.out.println("Using URL Alphabet: " + urlEncoded); The output will be: Using Basic Alphabet: c3ViamVjdHM/YWJjZA== Using URL Alphabet: c3ViamVjdHM_YWJjZA== The sample above illustrates some content which if encoded using a basic encoder will result in a string containing a forward slash while when using a URL safe encoder the output will include an underscore instead (URL Safe encoding is described in clause 5 of the RFC) MIME Encoding The MIME encoder generates a Base64 encoded output using the Basic Alphabet but in a MIME friendly format: each line of the output is no longer than 76 characters and ends with a carriage return followed by a linefeed (\r\n). The following example generates a block of text (this is needed just to make sure we have enough 'body' to encode into more than 76 characters) and encodes it using the MIME encoder: StringBuilder sb = new StringBuilder(); for (int t = 0; t < 10; ++t) { sb.append(UUID.randomUUID().toString()); } byte[] toEncode = sb.toString().getBytes("utf-8"); String mimeEncoded = Base64.getMimeEncoder().encodeToString(toEncode); System.out.println(mimeEncoded); The output: NDU5ZTFkNDEtMDVlNy00MDFiLTk3YjgtMWRlMmRkMWEzMzc5YTJkZmEzY2YtM2Y2My00Y2Q4LTk5 ZmYtMTU1NzY0MWM5Zjk4ODA5ZjVjOGUtOGMxNi00ZmVjLTgyZjctNmVjYTU5MTAxZWUyNjQ1MjJj NDMtYzA0MC00MjExLTk0NWMtYmFiZGRlNDk5OTZhMDMxZGE5ZTYtZWVhYS00OGFmLTlhMjgtMDM1 ZjAyY2QxNDUyOWZiMjI3NDctNmI3OC00YjgyLThiZGQtM2MyY2E3ZGNjYmIxOTQ1MDVkOGQtMzIz Yi00MDg0LWE0ZmItYzkwMGEzNDUxZTIwOTllZTJiYjctMWI3MS00YmQzLTgyYjUtZGRmYmYxNDA4 Mjg3YTMxZjMxZmMtYTdmYy00YzMyLTkyNzktZTc2ZDc5ZWU4N2M5ZDU1NmQ4NWYtMDkwOC00YjIy LWIwYWItMzJiYmZmM2M0OTBm Wrapping All encoders and decoders created by the java.util.Base64 class support streams wrapping - this is a very elegant construct - both coding and efficiency wise (since no redundant buffering are needed) - to stream in and out buffers via encoders and decoders. The sample bellow illustrates how a FileOutputStream can be wrapped with an encoder and a FileInputStream is wrapped with a decoder - in both cases there is no need to buffer the content: public void wrapping() throws IOException { String src = "This is the content of any resource read from somewhere" + " into a stream. This can be text, image, video or any other stream."; // An encoder wraps an OutputStream. The content of /tmp/buff-base64.txt will be the // Base64 encoded form of src. try (OutputStream os = Base64.getEncoder().wrap(new FileOutputStream("/tmp/buff-base64.txt"))) { os.write(src.getBytes("utf-8")); } // The section bellow illustrates a wrapping of an InputStream and decoding it as the stream // is being consumed. There is no need to buffer the content of the file just for decoding it. try (InputStream is = Base64.getDecoder().wrap(new FileInputStream("/tmp/buff-base64.txt"))) { int len; byte[] bytes = new byte[100]; while ((len = is.read(bytes)) != -1) { System.out.print(new String(bytes, 0, len, "utf-8")); } } }
November 15, 2013
by Eyal Lupu
· 152,980 Views · 5 Likes
article thumbnail
Python: Making scikit-learn and Pandas Play Nice
In the last post I wrote about Nathan and my attempts at the Kaggle Titanic Problem, I mentioned that our next step was to try out scikit-learn, so I thought I should summarize where we’ve got up to. We needed to write a classification algorithm to work out whether a person onboard the Titanic survived, and luckily, scikit-learn has extensive documentation on each of the algorithms. Unfortunately almost all those examples use numpy data structures and we’d loaded the data using pandas and didn’t particularly want to switch back! Luckily it was really easy to get the data into numpy format by calling ‘values’ on the pandas data structure, something we learnt from a reply on Stack Overflow. For example if we were to wire up an ExtraTreesClassifier which worked out survival rate based on the ‘Fare’ and ‘Pclass’ attributes we could write the following code: import pandas as pd from sklearn.ensemble import ExtraTreesClassifier from sklearn.cross_validation import cross_val_score train_df = pd.read_csv("train.csv") et = ExtraTreesClassifier(n_estimators=100, max_depth=None, min_samples_split=1, random_state=0) columns = ["Fare", "Pclass"] labels = train_df["Survived"].values features = train_df[list(columns)].values et_score = cross_val_score(et, features, labels, n_jobs=-1).mean() print("{0} -> ET: {1})".format(columns, et_score)) To start with with read in the CSV file which looks like this: $ head -n5 train.csv PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked 1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S 2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C 3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S 4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S Next we create our classifier which “fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting.“ i.e. a better version of a random forest. On the next line we describe the features we want the classifier to use, then we convert the labels and features into numpy format so we can pass the to the classifier. Finally we call the cross_val_score function which splits our training data set into training and test components and trains the classifier against the former and checks its accuracy using the latter. If we run this code we’ll get roughly the following output: $ python et.py ['Fare', 'Pclass'] -> ET: 0.687991021324) This is actually a worse accuracy than we’d get by saying that females survived and males didn’t. We can introduce ‘Sex’ into the classifier by adding it to the list of columns: columns = ["Fare", "Pclass", "Sex"] If we re-run the code we’ll get the following error: $ python et.py An unexpected error occurred while tokenizing input The following traceback may be corrupted or invalid The error message is: ('EOF in multi-line statement', (514, 0)) ... Traceback (most recent call last): File "et.py", line 14, in et_score = cross_val_score(et, features, labels, n_jobs=-1).mean() File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/cross_validation.py", line 1152, in cross_val_score for train, test in cv) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__ self.retrieve() File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/parallel.py", line 450, in retrieve raise exception_type(report) sklearn.externals.joblib.my_exceptions.JoblibValueError/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/externals/joblib/my_exceptions.py:26: DeprecationWarning: BaseException.message has been deprecated as of Python 2.6 self.message, : JoblibValueError ___________________________________________________________________________ Multiprocessing exception: ... ValueError: could not convert string to float: male ___________________________________________________________________________ This is a slightly verbose way of telling us that we can’t pass non numeric features to the classifier – in this case ‘Sex’ has the values ‘female’ and ‘male’. We’ll need to write a function to replace those values with numeric equivalents. train_df["Sex"] = train_df["Sex"].apply(lambda sex: 0 if sex == "male" else 1) Now if we re-run the classifier we’ll get a slightly more accurate prediction: $ python et.py ['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359) The next step is to use the classifier against the test data set, so let’s load the data and run the prediction: test_df = pd.read_csv("test.csv") et.fit(features, labels) et.predict(test_df[columns].values) Now if we run that: $ python et.py ['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359) Traceback (most recent call last): File "et.py", line 22, in et.predict(test_df[columns].values) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict proba = self.predict_proba(X) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba X = array2d(X, dtype=DTYPE) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 91, in array2d X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/numeric.py", line 235, in asarray return array(a, dtype, copy=False, order=order) ValueError: could not convert string to float: male which is the same problem we had earlier! We need to replace the ‘male’ and ‘female’ values in the test set too so we’ll pull out a function to do that now. def replace_non_numeric(df): df["Sex"] = df["Sex"].apply(lambda sex: 0 if sex == "male" else 1) return df Now we’ll call that function with our training and test data frames: train_df = replace_non_numeric(pd.read_csv("train.csv")) ... test_df = replace_non_numeric(pd.read_csv("test.csv")) If we run the program again: $ python et.py ['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359) Traceback (most recent call last): File "et.py", line 26, in et.predict(test_df[columns].values) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 444, in predict proba = self.predict_proba(X) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/ensemble/forest.py", line 479, in predict_proba X = array2d(X, dtype=DTYPE) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 93, in array2d _assert_all_finite(X_2d) File "/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 27, in _assert_all_finite raise ValueError("Array contains NaN or infinity.") ValueError: Array contains NaN or infinity. There are missing values in the test set so we’ll replace those with average values from our training set using an Imputer: from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='mean', axis=0) imp.fit(features) test_df = replace_non_numeric(pd.read_csv("test.csv")) et.fit(features, labels) print et.predict(imp.transform(test_df[columns].values)) If we run that it completes successfully: $ python et.py ['Fare', 'Pclass', 'Sex'] -> ET: 0.813692480359) [0 1 0 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0] The final step is to add these values to our test data frame and then write that to a file so we can submit it to Kaggle. The type of those values is ‘numpy.ndarray’ which we can convert to a pandas Series quite easily: predictions = et.predict(imp.transform(test_df[columns].values)) test_df["Survived"] = pd.Series(predictions) We can then write the ‘PassengerId’ and ‘Survived’ columns to a file: test_df.to_csv("foo.csv", cols=['PassengerId', 'Survived'], index=False) Then output file looks like this: $ head -n5 foo.csv PassengerId,Survived 892,0 893,1 894,0 The code we’ve written is on github in case it’s useful to anyone.
November 14, 2013
by Mark Needham
· 32,112 Views · 1 Like
article thumbnail
Revisiting The Purpose of Manager Classes
I came across a debate about whether naming a class [Something]Manager is proper. Mainly, I came across this article, but also found other references here and here. I already had my own thoughts on this topic and decided I would share them with you. I started thinking on this a while ago when working in a Swing-powered desktop app. We decided we would make use of the Listeners model that Swing establishes for its components. To register a listener in a Swing component you say: // Some-Type = {Action, Key, Mouse, Focus, ...} component.add[Some-Type]Listener(listener); That's not a problem until you need to dynamically remove those listeners, or override them (remove and add a new listener that listens to the same event but fires a different action), or do any other 'unnatural' thing with them. This turns out to be a little difficult if you use the interface Swing provides out-of-the-box. For instance, to remove a listener, you need to pass the whole instance you want to remove; you would have to hold the reference to the listener if you intend to remove it some time later. So I decided I would take a step further and try to abstract the mess by creating a Manager class. What you could do with this class was something like this: EventsListenersManager.registerListener( "some unique name", component, listener); EventsListenersManager.overrideListener("some unique name", newListener); EventsListenerManager.disableListener("some unique name"); EventsListenerManager.enableListener("some unique name"); EventsListenerManager.unregisterListener("some unique name"); What we gained here??? We got rid of the explicit specification of [Some-Type] and have a single function for every type of listener; and we defined a unique name for the binding, which will be used as an id to remove, enable/disable, and override listeners; a.k.a. no need to hold references. Obviously, it is now convenient to always use the manager’s interface, since it reduces the need to trick our code and gives us a simple and more readable interface to achieve many things with listeners. But that’s not the only advantage of this manager class. You see, manager classes in general have one big advantage: they create an abstraction of a service and work as a centralized point to add logic of any nature, like security logic or performance logic. In the case of our EventsListenersManager, we put some extra logic to enchance the way we can use listeners by providing an interface that simplifies their use: it is easier to make a switch between two listeners now than it was before. Another thing we could have done was to restrict the number of listeners registered in one component for performance sake (Hint: listeners execution don’t run in parallel). So manager classes seem to be a good thing, but we can’t make a manager for every object we want to control in an application. This would consume time and would make us start thinking in a manager class even when we don’t need it. But we can discover a pattern for the need of defining a manager. I’ve seen them used when the resources they control are expensive, like database connections, media resources, hardware resources, etc; and as I see it, expensiveness can come in many flavors: Listeners are expensive because they may be difficult to play with and can be a bottleneck in performance (we must keep them short and fast). Connections are expensive to construct. Media resources are expensive to load. Hardware resources are expensive because they are scarce. So we create different managers for them: We create a manager for listeners to ease and control their use. We create a connections pool in order to limit and control the number of connections to a database. Games developers create a resources manager to have a single point to access resources and reduce the possibility to load them multiple times. We all know there is a driver for every piece of hardware. That’s their manager. Here we can see multiple variations of manager classes implementations, but they all attempt to solve a single thing: reducing the cost we may incur in when using expensive resources directly. I think people should have this line of thought: make a manager class for every expensive resource you detect in your application. You will be more than thankful when you want to add extra logic and you only have to go to one place. Also, enforce the use of manager objects instead of accessing the resources directly. The next time you are planning to make a manager class, ask yourself: is the 'thing' I want to control expensive in any way???
November 7, 2013
by Martín Proenza
· 10,765 Views
article thumbnail
Data Access Module using Groovy with Spock testing
This blog is more of a tutorial where we describe the development of a simple data access module, more for fun and learning than anything else. All code can be found here for those who don’t want to type along: https://github.com/ricston-git/tododb As a heads-up, we will be covering the following: Using Groovy in a Maven project within Eclipse Using Groovy to interact with our database Testing our code using the Spock framework We include Spring in our tests with ContextConfiguration A good place to start is to write a pom file as shown here. The only dependencies we want packaged with this artifact are groovy-all and commons-lang. The others are either going to be provided by Tomcat or are only used during testing (hence the scope tags in the pom). For example, we would put the jar with PostgreSQL driver in Tomcat’s lib, and tomcat-jdbc and tomcat-dbcp are already there. (Note: regarding the postgre jar, we would also have to do some minor configuration in Tomcat to define a DataSource which we can get in our app through JNDI – but that’s beyond the scope of this blog. See here for more info). Testing-wise, I’m depending on spring-test, spock-core, and spock-spring (the latter is to get spock to work with spring-test). Another significant addition in the pom is the maven-compiler-plugin. I have tried to get gmaven to work with Groovy in Eclipse, but I have found the maven-compiler-plugin to be a lot easier to work with. With your pom in an empty directory, go ahead and mkdir -p src/main/groovy src/main/java src/test/groovy src/test/java src/main/resources src/test/resources. This gives us a directory structure according to the Maven convention. Now you can go ahead and import the project as a Maven project in Eclipse (install the m2e plugin if you don’t already have it). It is important that you do not mvn eclipse:eclipse in your project. The .classpath it generates will conflict with your m2e plugin and (at least in my case), when you update your pom.xml the plugin will not update your dependencies inside Eclipse. So just import as a maven project once you have your pom.xml and directory structure set up. Okay, so our tests are going to be integration tests, actually using a PostgreSQL database. Since that’s the case, lets set up our database with some data. First go ahead and create a tododbtest database which will only be used for testing purposes. Next, put the following files in your src/test/resources: Note, fill in your username/password: DROP TABLE IF EXISTS todouser CASCADE; CREATE TABLE todouser ( id SERIAL, email varchar(80) UNIQUE NOT NULL, password varchar(80), registered boolean DEFAULT FALSE, confirmationCode varchar(280), CONSTRAINT todouser_pkey PRIMARY KEY (id) ); insert into todouser (email, password, registered, confirmationCode) values ('[email protected]', 'abc123', FALSE, 'abcdefg') insert into todouser (email, password, registered, confirmationCode) values ('[email protected]', 'pass1516', FALSE, '123456') insert into todouser (email, password, registered, confirmationCode) values ('[email protected]', 'anon', FALSE, 'codeA') insert into todouser (email, password, registered, confirmationCode) values ('[email protected]', 'anon2', FALSE, 'codeB') Basically, testContext.xml is what we’ll be configuring our test’s context with. The sub-division into datasource.xml and initdb.xml may be a little too much for this example… but changes are usually easier that way. The gist is that we configure our data source in datasource.xml (this is what we will be injecting in our tests), and the initdb.xml will run the schema.sql and test-data.sql to create our table and populate it with data. So lets create our test, or should I say, our specification. Spock is specification framework that allows us to write more descriptive tests. In general, it makes our tests easier to read and understand, and since we’ll be using Groovy, we might as well make use of the extra readability Spock gives us. package com.ricston.blog.sample.model.spec; import javax.sql.DataSource import org.springframework.beans.factory.annotation.Autowired import org.springframework.test.annotation.DirtiesContext import org.springframework.test.annotation.DirtiesContext.ClassMode import org.springframework.test.context.ContextConfiguration import spock.lang.Specification import com.ricston.blog.sample.model.data.TodoUser import com.ricston.blog.sample.model.dao.postgre.PostgreTodoUserDAO // because it supplies a new application context after each test, the initialize-database in initdb.xml is // executed for each test/specification @DirtiesContext(classMode=ClassMode.AFTER_EACH_TEST_METHOD) @ContextConfiguration('classpath:testContext.xml') class PostgreTodoUserDAOSpec extends Specification { @Autowired DataSource dataSource PostgreTodoUserDAO postgreTodoUserDAO def setup() { postgreTodoUserDAO = new PostgreTodoUserDAO(dataSource) } def "findTodoUserByEmail when user exists in db"() { given: "a db populated with a TodoUser with email [email protected] and the password given below" String email = '[email protected]' String password = 'anon' when: "searching for a TodoUser with that email" TodoUser user = postgreTodoUserDAO.findTodoUserByEmail email then: "the row is found such that the user returned by findTodoUserByEmail has the correct password" user.password == password } } One specification is enough for now, just to make sure that all the moving parts are working nicely together. The specification itself is easy enough to understand. We’re just exercising the findTodoUserByEmail method of PostgreTodoUserDAO – which we will be writing soon. Using the ContextConfiguration from Spring Test we are able to inject beans defined in our context (the dataSource in our case) through the use of annotations. This keeps our tests short and makes them easier to modify later on. Additionally, note the use of DirtiesContext. Basically, after each specification is executed, we cannot rely on the state of the database remaining intact. I am using DirtiesContext to get a new Spring context for each specification run. That way, the table creation and test data insertions happen all over again for each specification we run. Before we can run our specification, we need to create at least the following two classes used in the spec: TodoUser and PostgreTodoUserDAO package com.sample.data import org.apache.commons.lang.builder.ToStringBuilder class TodoUser { long id; String email; String password; String confirmationCode; boolean registered; @Override public String toString() { ToStringBuilder.reflectionToString(this); } } package com.ricston.blog.sample.model.dao.postgre import groovy.sql.Sql import javax.sql.DataSource import com.ricston.blog.sample.model.dao.TodoUserDAO import com.ricston.blog.sample.model.data.TodoUser class PostgreTodoUserDAO implements TodoUserDAO { private Sql sql public PostgreTodoUserDAO(DataSource dataSource) { sql = new Sql(dataSource) } /** * * @param email * @return the TodoUser with the given email */ public TodoUser findTodoUserByEmail(String email) { sql.firstRow """SELECT * FROM todouser WHERE email = $email""" } } package com.ricston.blog.sample.model.dao; import com.ricston.blog.sample.model.data.TodoUser; public interface TodoUserDAO { /** * * @param email * @return the TodoUser with the given email */ public TodoUser findTodoUserByEmail(String email); } We’re just creating a POGO in TodoUser, implementing its toString using common’s ToStringBuilder. In PostgreTodoUserDAO we’re using Groovy’s SQL to access the database, for now, only implementing the findTodoUserByEmail method. PostgreTodoUserDAO implements TodoUserDAO, an interface which specifies the required methods a TodoUserDAO must have. Okay, so now we have all we need to run our specification. Go ahead and run it as a JUnit test from Eclipse. You should get back the following error message: org.codehaus.groovy.runtime.typehandling.GroovyCastException: Cannot cast object '{id=3, [email protected], password=anon, registered=false, confirmationcode=codeA}' with class 'groovy.sql.GroovyRowResult' to class 'com.ricston.blog.sample.model.data.TodoUser' due to: org.codehaus.groovy.runtime.metaclass.MissingPropertyExceptionNoStack: No such property: confirmationcode for class: com.ricston.blog.sample.model.data.TodoUser Possible solutions: confirmationCode at com.ricston.blog.sample.model.dao.postgre.PostgreTodoUserDAO.findTodoUserByEmail(PostgreTodoUserDAO.groovy:23) at com.ricston.blog.sample.model.spec.PostgreTodoUserDAOSpec.findTodoUserByEmail when user exists in db(PostgreTodoUserDAOSpec.groovy:37) Go ahead and connect to your tododbtest database and select * from todouser; As you can see, our confirmationCode varchar(280), ended up as the column confirmationcode with a lower case ‘c’. In PostgreTodoUserDAO’s findTodoUserByEmail, we are getting back GroovyRowResult from our firstRow invocation. GroovyRowResult implements Map and Groovy is able to create a POGO (in our case TodoUser) from a Map. However, in order for Groovy to be able to automatically coerce the GroovyRowResult into a TodoUser, the keys in the Map (or GroovyRowResult) must match the property names in our POGO. We are using confirmationCode in our TodoUser, and we would like to stick to the camel case convention. What can we do to get around this? Well, first of all, lets change our schema to use confirmation_code. That’s a little more readable. Of course, we still have the same problem as before since confirmation_code will not map to confirmationCode by itself. (Note: remember to change the insert statements in test-data.sql too). One way to get around this is to use Groovy’s propertyMissing methods as show below: def propertyMissing(String name, value) { if(isConfirmationCode(name)) { this.confirmationCode = value } else { unknownProperty(name) } } def propertyMissing(String name) { if(isConfirmationCode(name)) { return confirmationCode } else { unknownProperty(name) } } private boolean isConfirmationCode(String name) { 'confirmation_code'.equals(name) } def unknownProperty(String name) { throw new MissingPropertyException(name, this.class) } By adding this to our TodoUser.groovy we are effectively tapping in on how Groovy resolves property access. When we do something like user.confirmationCode, Groovy automatically calls getConfirmationCode(), a method which we got for free when declared the property confirmationCode in our TodoUser. Now, when user.confirmation_code is invoked, Groovy doesn’t find any getters to invoke since we never declared the property confirmation_code, however, since we have now implemented the propertyMissing methods, before throwing any exceptions it will use those methods as a last resort when resolving properties. In our case we are effectively checking whether a get or set on confirmation_code is being made and mapping the respective operations to our confirmationCode property. It’s as simple as that. Now we can keep the auto coercion in our data access object and the property name we choose to have in our TodoUser. Assuming you’ve made the changes to the schema and test-data.sql to use confirmation_code, go ahead and run the spec file and this time it should pass. That’s it for this tutorial. In conclusion, I would like to discuss some finer points which someone who’s never used Groovy’s SQL before might not know. As you can see in PostgreTodoUserDAO.groovy, our database interaction is pretty much a one-liner. What about resource handling (e.g. properly closing the connection when we’re done), error logging, and prepared statements? Resource handling and error logging are done automatically, you just have to worry about writing your SQL. When you do write your SQL, try to stick to using triple quotes as used in the PostgreTodoUserDAO.groovy example. This produces prepared statements, therefore protecting against SQL injection and avoids us having to put ‘?’ all over the place and properly lining up the arguments to pass in to the SQL statement. Note that transaction management is something which the code using our artifact will have to take care of. Finally, note that a bunch of other operations (apart from findTodoUserByEmail) are implemented in the project on GitHub: https://github.com/ricston-git/tododb. Additionally, there is also a specification test for TodoUser, making sure that the property mapping works correctly. Also, in the pom.xml, there is some maven-surefire-plugin configuration in order to get the surefire-plugin to pick up our Spock specifications as well as any JUnit tests which we might have in our project. This allows us to run our specifications when we, for example, mvn clean package. After implementing all the operations you require in PostgreTodoUserDAO.groovy, you can go ahead and compile the jar or include in a Maven multi-module project to get a data access module you can use in other applications.
November 6, 2013
by Justin Calleja
· 21,181 Views
article thumbnail
Understanding the Concept of Functional Programming
What is Functional Programming? Functional programming is a specific way to look at problems and model their solutions. Pragmatically, functional programming is a coding style that exhibits the following characteristics: Power and flexibility. We can solve many general real-world problems using functional constructs Simplicity. Most functional programs exhibit a small set of keywords and concise syntax for expressing concepts Suitable for parallel processing. Via immutable values and operators, functional programs lend themselves to asynchronous and parallel processing. In functional programming, programs are executed by evaluating expressions, in contrast with imperative programming where programs are composed of statements which change global state when executed. Functional programming typically avoids using mutable state. Everything is a mathematical function. Functional programming languages can have objects, but generally those objects are immutable -- either arguments or return values to functions. There are no for/next loops, as those imply state changes. Instead, that type of looping is performed with recursion and by passing functions as arguments. Functional Programming vs. Imperative Programming: We can think of imperative programming as writing code that describes in exacting detail the steps the software must take to execute a given computation. These steps are generally expressed as a combination of statement executions, state changes, loops and conditional branches. Many programming languages support programming in both functional and imperative style but the syntax and facilities of a language are typically optimised for only one of these styles, and social factors like coding conventions and libraries often force the programmer towards one of the styles. Therefore, programming languages may be categorized into functional and imperative ones. Why Functional Programming? While you can develop concurrent, scalable and asynchronous software without embracing functional programming, it's simpler, safer, and easier to use the right tool for the job. Functional programming enables you to take advantage of multi-core systems, develop robust concurrent algorithms, parallelize compute-intensive algorithms, and to readily leverage the growing number of cloud computing platforms. Imagine you've implemented a large program in a purely functional way. All the data is properly threaded in and out of functions, and there are no truly destructive updates to speak of. Now pick the two lowest-level and most isolated functions in the entire codebase. They're used all over the place, but are never called from the same modules. Now make these dependent on each other: function A behaves differently depending on the number of times function B has been called and vice-versa. In addition, if you've not already explored non-imperative programming, it's a great way to expand your problem solving skills and your horizons. The new concepts that you'll learn will help you to look at many problems from a different perspective and will help you to become a better and more insightful OO programmer. I encourage everyone who wants to be a better programmer: consider learning a functional language. Haskell and OCaml are both great choices, and F# and Erlang are pretty good as well. It won’t be easy, but that is probably a good sign. Try and identify the difficult concepts you encounter and see if other people are leveraging them; frequently you can break through a mental roadblock by finding out what the intent of an unfamiliar abstraction really is. While you’re learning, do be careful not to take it too seriously. Like anything that requires time and effort, there is a danger of becoming over-invested in FP. Falling into this cognitive trap will ruin your investment. It’s easy to forget how many models of computation there are out there, and even easier to forget how much beautiful software has been written with any of them. It’s a narrow path to walk down, but on the other side, you emerge with more core concepts and models to leverage in your everyday programming. You will almost certainly become more comfortable with denser code, and will certainly gain new insights into how to be a better software engineer.
November 4, 2013
by Darshan Bobra
· 24,566 Views · 1 Like
article thumbnail
Writing Git Hooks Using Python
Since git hooks can be any executable script with an appropriate #! line, Python is more than suitable for writing your git hooks. Simply stated, git hooks are scripts which are called at different points of time in the life cycle of working with your git repository. Let’s start by creating a new git repository: ~/work> git init git-hooks-exp Initialized empty Git repository in /home/gene/work/git-hooks-exp/.git/ ~/work> cd git-hooks-exp/ ~/work/git-hooks-exp (master)> tree -al .git/ .git/ ├── branches ├── config ├── description ├── HEAD ├── hooks │ ├── applypatch-msg.sample │ ├── commit-msg.sample │ ├── post-update.sample │ ├── pre-applypatch.sample │ ├── pre-commit.sample │ ├── prepare-commit-msg.sample │ ├── pre-rebase.sample │ └── update.sample ├── info │ └── exclude ├── objects │ ├── info │ └── pack └── refs ├── heads └── tags 9 directories, 12 files Inside the .git are a number of directories and files, one of them being hooks/ which is where the hooks live. By default, you will have a number of hooks with the file names ending in .sample. They may be useful as starting points for your own scripts. However, since they all have an extension .sample, none of the hooks are actually activated. For a hook to be activated, it must have the right file name and it should be executable. Let’s see how we can write a hook using Python. We will write a post-commit hook. This hook is called immediately after you have made a commit. We are going to do something fairly useless, but quite interesting in this hook. We will take the commit SHA1 of this commit, and print how it may look like in a more human form. I do the latter using the humanhash module. You will need to have it installed. Here is how the hook looks like: #!/usr/bin/python import subprocess import humanhash # get the last commit SHA and print it after humanizing it # https://github.com/zacharyvoase/humanhash print humanhash.humanize( subprocess.check_output( ['git','rev-parse','HEAD'])) I use the subprocess.check_output() function to execute the command git rev-parse HEAD so that I can get the commit SHA1 and then call the humanhash.humanize() function with it. Save the hook as a file, post-commit in your hooks/ directory and make it executable using chmod +x .git/hooks/post-commit. Let’s see the hook in action: ~/work/git-hooks-exp (master)> touch file ~/work/git-hooks-exp (master)> git add file ~/work/git-hooks-exp (master)> git commit -m "Added a file" carbon-network-connecticut-equal [master (root-commit) 2d7880b] Added a file 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 file The commit SHA1 for the commit turned out to be 2d7880be746a1c1e75844fc1aa161e2b8d955427. Let’s check it with the humanize function and check if we get the same message as above: >>> humanhash.humanize('2d7880be746a1c1e75844fc1aa161e2b8d955427') 'carbon-network-connecticut-equal' And you can see the same message above as well. For some of the hooks, you will see that they are called with some parameters. In Python you can access them using the sys.argv attribute from the sys module, with the first member being the name of the hook of course and the others will be the parameters that the hook is called with.
October 31, 2013
by Amit Saha
· 13,587 Views
article thumbnail
Good Clean Python Install on Mavericks OSX 10.9 8
I just spent the better part of an hour trying to effectively install a version of python that doesn’t affect the rest of my system. This is usually pretty trivial effort according to most online articles. There are a few subtleties you might want to be aware of before you go and and well you know. I am writing this mostly for myself so pardon the casual nature, and/or grammatical errors. So you have a clean install of Mavericks, the world is beautiful. You have a new beautiful new background and you have sudden inspiration to get some work done. So many guides will tell you to do simply use system python which comes with easy_install and install pip with sudo. Don’t do this, For many reasons. Homebrew always has the most recent version (currently 2.7.5). Apple has made significant changes to its bundled Python, potentially resulting in hidden bugs. Homebrew’s Python includes the latest Python package management tools: pip and Setuptools First thing you do is install homebrew. Then do a little something like this. brew install python Once that is done, you will want to update your PATH variable to include the following directories in PATH /usr/local: /usr/local/bin: You want to put local on PATH because it contains other important directories like lib and sbin. What we are doing here is modifying the PATH variable which your system looks for commands, it will traverse the paths until it finds a match and use that. We want it to use copy of python in /usr/local You can put these modifications in your .zshrc or .bashrc whichever floats your boat. Now to test to see if you are using the correct python on your system (you have multiple copies now after the brew install) you want to you use: which python If you get /usr/bin/python you either didn’t set the PATH properly or didn’t reload either .zshrc or .bashrc (here is a hint: restart your terminal) If you get a usr/local/bin/python you are winning. You will also want to make sure you are using the correct copy of pip on your system as well, use the same steps as above. Never install anything with sudo. Don’t listen to random READMEs online. You make the decision, my son. Now you install things you want system wide without having to worry about breaking your system’s copy of Python. Swaggin'.
October 29, 2013
by Mahdi Yusuf
· 29,643 Views
  • Previous
  • ...
  • 417
  • 418
  • 419
  • 420
  • 421
  • 422
  • 423
  • 424
  • 425
  • 426
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×