DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

The Latest Data Engineering Topics

article thumbnail
Getting started with CQEngine: LINQ for Java, Only Faster
CQEngine or collection query engine is a library that allows you to build indices over java collections and query them for objects using exposed properties. It offers similar capability to LINQ in .net but is thought to be faster because it builds indices over collections before querying them and uses set theory instead of iterations. In this post we will see how to query a simple collection of objects, in our example, a collection of users of a hypothetical system, using CQEngine. In a subsequent post, we will also see how iteratively searching a collection compares to querying via CQEngine. The first step is to get the CQEengine jar file. Download the jar from the CQEngine website or if you are using Maven, add the following dependency. com.googlecode.cqengine cqengine 1.0.3 Next, lets create the Class whose object we will be searching for: package co.syntx.examples.cqengine; import com.googlecode.cqengine.attribute.Attribute; import com.googlecode.cqengine.attribute.SimpleAttribute; public class User { private String username; private String password; private String fullname; private Role role; public User(String username, String password, String fullname, Role role) { super(); this.username = username; this.password = password; this.fullname = fullname; this.role = role; } public static final Attribute FULL_NAME = new SimpleAttribute("fullname") { public String getValue(User user) { return user.fullname; } }; public static final Attribute USERNAME = new SimpleAttribute("username") { public String getValue(User user) { return user.username; } }; public String getUsername() { return username; } public void setUsername(String username) { this.username = username; } public String getPassword() { return password; } public void setPassword(String password) { this.password = password; } public String getFullname() { return fullname; } public void setFullname(String fullname) { this.fullname = fullname; } public Role getRole() { return role; } public void setRole(Role role) { this.role = role; } } Next, we write a class, to perform our searches. I will go function by function. 1. Function to Build a Test Indexed Collection: In the following function, we build an indexed collection, define indices on attributes, and populate this collection with a certain number of objects. In actual usage, your collection will probably be filled with objects being returned from the DB, read from a file or other similar scenarios. public void buildIndexedCollection(int size) throws Exception { indexedUsers = CQEngine.newInstance(); indexedUsers.addIndex(HashIndex.onAttribute(User.FULL_NAME)); indexedUsers.addIndex(SuffixTreeIndex.onAttribute(User.FULL_NAME)); for (int i = 0; i < size; i++) { String username = RandomStringGenerator.generateRandomString(8,RandomStringGenerator.Mode.ALPHANUMERIC); String password = RandomStringGenerator.generateRandomString(8,RandomStringGenerator.Mode.ALPHANUMERIC); String fullname = RandomStringGenerator.generateRandomString(5,RandomStringGenerator.Mode.ALPHA) + " " + RandomStringGenerator.generateRandomString(5,RandomStringGenerator.Mode.ALPHA); Role role = new Role(); role.setName("admin"); indexedUsers.add(new User(username, password, fullname, role)); } } In line 3 we are initializing a new Indexed Collection, a reference of which is stored in the class variable indexedUsers. The reference is of type IndexedCollection In lines 4 and 5, we define two indices i) a Hash Index suitable for equal style queries. ii) a Suffix Index suitable for ends with style queried. For the purpose of this example, we are building indices only on the the Full name field. In line 8, we use a random string generator to populate dummy objects. In line 14 we add our object to our indexed collection. 2. Function to Perform Indexed Search for Exact Matches: In this function, we are querying for names that exactly match a given name. The equal function takes in the attribute upon which to perform the query and the value to search. The method equal is statically imported via a import static com.googlecode.cqengine.query.QueryFactory.*; In the example below, we are looping the results returned by the retrieve method and not doing anything with it. In your case, you may choose to return the Iterator returned by retrieve. public void indexedSearchForEquals(String fullname) throws Exception { Query query = equal(User.FULL_NAME, fullname); for (User user : indexedUsers.retrieve(query)) { // System.out.println(user.getFullname()); } } 3. Function to Perform Indexed Search for Ends With Matches: In this function, we are querying for names that end with a certain suffix. public void indexedSearchForEndsWith(String endswith) throws Exception { Query query1 = endsWith(User.FULL_NAME, endswith); for (User user : indexedUsers.retrieve(query1)) { // System.out.println(user.getFullname()); } } 4. Function to Perform Indexed Search for Equals or Ends With Matches: This function is a combination of both queries mentioned below and has an or relation between them. public void indexedSearchForEqualOrEndsWith(String equals, String ends) throws Exception { Query query = or(equal(User.FULL_NAME, equals),endsWith(User.FULL_NAME, ends)); for (User user : indexedUsers.retrieve(query)) { // System.out.println(user.getFullname()); } } 5. Putting it together: All the functions above belong to a class called CQEngineTest. We create a new object, build a test collection, then search for either exact matches, strings that end with a certain suffix or either. CQEngineTest test = new CQEngineTest(); test.buildIndexedCollection(size); test.indexedSearchForEqualOrEndsWith("test", "test"); In this example, we have used a Hash Index and a Suffix Tree Index. There are many other types of indices that you can choose depending on the type of query that you want to perform. A list of these indices and when to use them can be found on the cqengine project page. Also, apart from equal or endswith operations, there are others that you would typically expect to find. In a subsequent post we will also see how an indexed search compares with a typical iterative search in terms of times.
August 5, 2013
by Faheem Sohail
· 27,304 Views · 1 Like
article thumbnail
JPA Searching Using Lucene - A Working Example with Spring and DBUnit
Working Example on Github There's a small, self contained mavenised example project over on Github to accompany this post - check it out here:https://github.com/adrianmilne/jpa-lucene-spring-demo Running the Demo See the README file over on GitHub for details of running the demo. Essentially - it's just running the Unit Tests, with the usual maven build and test results output to the console - example below. This is the result of running the DBUnit test, which inserts Book data into the HSQL database using JPA, and then uses Lucene to query the data, testing that the expected Books are returned (i.e. only those int he SCI-FI category, containing the word 'Space', and ensuring that any with 'Space' in the title appear before those with 'Space' only in the description. The Book Entity Our simple example stores Books. The Book entity class below is a standard JPA Entity with a few additional annotations to identify it to Lucene: @Indexed - this identifies that the class will be added to the Lucene index. You can define a specific index by adding the 'index' attribute to the annotation. We're just choosing the simplest, minimal configuration for this example. In addition to this - you also need to specify which properties on the entity are to be indexed, and how they are to be indexed. For our example we are again going for the default option by just adding an @Field annotation with no extra parameters. We are adding one other annotation to the 'title' field - @Boost - this is just telling Lucene to give more weight to search term matches that appear in this field (than the same term appearing in the description field). This example is purposefully kept minimal in terms of the ins-and-outs of Lucene (I may cover that in a later post) - we're really just concentrating on the integration with JPA and Spring for now. package com.cor.demo.jpa.entity; import javax.persistence.Entity; import javax.persistence.EnumType; import javax.persistence.Enumerated; import javax.persistence.GeneratedValue; import javax.persistence.Id; import javax.persistence.Lob; import org.hibernate.search.annotations.Boost; import org.hibernate.search.annotations.Field; import org.hibernate.search.annotations.Indexed; /** * Book JPA Entity. */ @Entity @Indexed public class Book { @Id @GeneratedValue private Long id; @Field @Boost(value = 1.5f) private String title; @Field @Lob private String description; @Field @Enumerated(EnumType.STRING) private BookCategory category; public Book(){ } public Book(String title, BookCategory category, String description){ this.title = title; this.category = category; this.description = description; } public Long getId() { return id; } public void setId(Long id) { this.id = id; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } public BookCategory getCategory() { return category; } public void setCategory(BookCategory category) { this.category = category; } public String getDescription() { return description; } public void setDescription(String description) { this.description = description; } @Override public String toString() { return "Book [id=" + id + ", title=" + title + ", description=" + description + ", category=" + category + "]"; } } The Book Manager The BookManager class acts as a simple service layer for the Book operations - used for adding books and searching books. As you can see, the JPA database resources are autowired in by Spring from the application-context.xml. We are just using an in-memory hsql database in this example. package com.cor.demo.jpa.manager; import java.util.List; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; import javax.persistence.PersistenceContextType; import javax.persistence.Query; import org.hibernate.search.jpa.FullTextEntityManager; import org.hibernate.search.jpa.Search; import org.hibernate.search.query.dsl.QueryBuilder; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.context.annotation.Scope; import org.springframework.stereotype.Component; import org.springframework.transaction.annotation.Transactional; import com.cor.demo.jpa.entity.Book; import com.cor.demo.jpa.entity.BookCategory; /** * Manager for persisting and searching on Books. Uses JPA and Lucene. */ @Component @Scope(value = "singleton") public class BookManager { /** Logger. */ private static Logger LOG = LoggerFactory.getLogger(BookManager.class); /** JPA Persistence Unit. */ @PersistenceContext(type = PersistenceContextType.EXTENDED, name = "booksPU") private EntityManager em; /** Hibernate Full Text Entity Manager. */ private FullTextEntityManager ftem; /** * Method to manually update the Full Text Index. This is not required if inserting entities * using this Manager as they will automatically be indexed. Useful though if you need to index * data inserted using a different method (e.g. pre-existing data, or test data inserted via * scripts or DbUnit). */ public void updateFullTextIndex() throws Exception { LOG.info("Updating Index"); getFullTextEntityManager().createIndexer().startAndWait(); } /** * Add a Book to the Database. */ @Transactional public Book addBook(Book book) { LOG.info("Adding Book : " + book); em.persist(book); return book; } /** * Delete All Books. */ @SuppressWarnings("unchecked") @Transactional public void deleteAllBooks() { LOG.info("Delete All Books"); Query allBooks = em.createQuery("select b from Book b"); List books = allBooks.getResultList(); // We need to delete individually (rather than a bulk delete) to ensure they are removed // from the Lucene index correctly for (Book b : books) { em.remove(b); } } @SuppressWarnings("unchecked") @Transactional public void listAllBooks() { LOG.info("List All Books"); LOG.info("------------------------------------------"); Query allBooks = em.createQuery("select b from Book b"); List books = allBooks.getResultList(); for (Book b : books) { LOG.info(b.toString()); getFullTextEntityManager().index(b); } } /** * Search for a Book. */ @SuppressWarnings("unchecked") @Transactional public List search(BookCategory category, String searchString) { LOG.info("------------------------------------------"); LOG.info("Searching Books in category '" + category + "' for phrase '" + searchString + "'"); // Create a Query Builder QueryBuilder qb = getFullTextEntityManager().getSearchFactory().buildQueryBuilder().forEntity(Book.class).get(); // Create a Lucene Full Text Query org.apache.lucene.search.Query luceneQuery = qb.bool() .must(qb.keyword().onFields("title", "description").matching(searchString).createQuery()) .must(qb.keyword().onField("category").matching(category).createQuery()).createQuery(); Query fullTextQuery = getFullTextEntityManager().createFullTextQuery(luceneQuery, Book.class); // Run Query and print out results to console List result = (List) fullTextQuery.getResultList(); // Log the Results LOG.info("Found Matching Books :" + result.size()); for (Book b : result) { LOG.info(" - " + b); } return result; } /** * Convenience method to get Full Test Entity Manager. Protected scope to assist mocking in Unit * Tests. * @return Full Text Entity Manager. */ protected FullTextEntityManager getFullTextEntityManager() { if (ftem == null) { ftem = Search.getFullTextEntityManager(em); } return ftem; } /** * Get the JPA Entity Manager (required for the DBUnit Tests). * @return Entity manager */ protected EntityManager getEntityManager() { return em; } /** * Sets the JPA Entity Manager (required to assist with mocking in Unit Test) * @param em EntityManager */ protected void setEntityManager(EntityManager em) { this.em = em; } } application-context.xml This is the Spring configuration file. You can see in the JPA Entity Manager configuration the key for 'hibernate.search.default.indexBase' is added to the jpaPropertyMap to tell Lucene where to create the index. We have also externalised the database login credentials to a properties file (as you may wish to change these for different environments), for example by updating the propertyConfigurer to look for and use a different external properties if it finds one on the file system). classpath:/system.properties Testing Using DBUnit In the project is an example of using DBUnit with Spring to test adding and searching against the database using DBUnit to populate the database with test data, exercise the Book Manager search operations and then clean the database down. This is a great way to test database functionality and can be easily integrated into maven and continuous build environments. Because DBUnit bypasses the standard JPA insertion calls - the data does not get automatically added to the Lucene index. We have a method exposed on the service interface to update the Full Text index 'updateFullTextIndex()' - calling this causes Lucene to update the index with the current data in the database. This can be useful when you are adding search to pre-populated databases to index the existing content. package com.cor.demo.jpa.manager; import java.io.InputStream; import java.util.List; import org.dbunit.DBTestCase; import org.dbunit.database.DatabaseConnection; import org.dbunit.database.IDatabaseConnection; import org.dbunit.dataset.IDataSet; import org.dbunit.dataset.xml.FlatXmlDataSetBuilder; import org.dbunit.operation.DatabaseOperation; import org.hibernate.impl.SessionImpl; import org.junit.After; import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import com.cor.demo.jpa.entity.Book; import com.cor.demo.jpa.entity.BookCategory; /** * DBUnit Test - loads data defined in 'test-data-set.xml' into the database to run tests against the * BookManager. More thorough (and ultimately easier in this context) than using mocks. */ @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations = { "classpath:/application-context.xml" }) public class BookManagerDBUnitTest extends DBTestCase { /** Logger. */ private static Logger LOG = LoggerFactory.getLogger(BookManagerDBUnitTest.class); /** Book Manager Under Test. */ @Autowired private BookManager bookManager; @Before public void setup() throws Exception { DatabaseOperation.CLEAN_INSERT.execute(getDatabaseConnection(), getDataSet()); } @After public void tearDown() { deleteBooks(); } @Override protected IDataSet getDataSet() throws Exception { InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("test-data-set.xml"); FlatXmlDataSetBuilder builder = new FlatXmlDataSetBuilder(); return builder.build(inputStream); } /** * Get the underlying database connection from the JPA Entity Manager (DBUnit needs this connection). * @return Database Connection * @throws Exception */ private IDatabaseConnection getDatabaseConnection() throws Exception { return new DatabaseConnection(((SessionImpl) (bookManager.getEntityManager().getDelegate())).connection()); } /** * Tests the expected results for searching for 'Space' in SCF-FI books. */ @Test public void testSciFiBookSearch() throws Exception { bookManager.listAllBooks(); bookManager.updateFullTextIndex(); List results = bookManager.search(BookCategory.SCIFI, "Space"); assertEquals("Expected 2 results for SCI FI search for 'Space'", 2, results.size()); assertEquals("Expected 1st result to be '2001: A Space Oddysey'", "2001: A Space Oddysey", results.get(0).getTitle()); assertEquals("Expected 2nd result to be 'Apollo 13'", "Apollo 13", results.get(1).getTitle()); } private void deleteBooks() { LOG.info("Deleting Books...-"); bookManager.deleteAllBooks(); } } The source data for the test is defined in an xml file.
August 5, 2013
by Adrian Milne
· 31,021 Views
article thumbnail
What Is NoSQL?
Dan McCreary and Ann Kelly, authors of 'Making Sense of NoSQL,' discuss the business drivers and motivations that make NoSQL so popular to organizations today.
August 1, 2013
by Eric Gregory
· 20,876 Views · 4 Likes
article thumbnail
Jersey Client: Testing External Calls
Jim and I have been doing a bit of work over the last week which involved calling neo4j’s HA status URI to check whether or not an instance was a master/slave and we’ve been using jersey-client. The code looked roughly like this: class Neo4jInstance { private Client httpClient; private URI hostname; public Neo4jInstance(Client httpClient, URI hostname) { this.httpClient = httpClient; this.hostname = hostname; } public Boolean isSlave() { String slaveURI = hostname.toString() + ":7474/db/manage/server/ha/slave"; ClientResponse response = httpClient.resource(slaveURI).accept(TEXT_PLAIN).get(ClientResponse.class); return Boolean.parseBoolean(response.getEntity(String.class)); } } While writing some tests against this code we wanted to stub out the actual calls to the HA slave URI so we could simulate both conditions and a brief search suggested that mockito was the way to go. We ended up with a test that looked like this: @Test public void shouldIndicateInstanceIsSlave() { Client client = mock( Client.class ); WebResource webResource = mock( WebResource.class ); WebResource.Builder builder = mock( WebResource.Builder.class ); ClientResponse clientResponse = mock( ClientResponse.class ); when( builder.get( ClientResponse.class ) ).thenReturn( clientResponse ); when( clientResponse.getEntity( String.class ) ).thenReturn( "true" ); when( webResource.accept( anyString() ) ).thenReturn( builder ); when( client.resource( anyString() ) ).thenReturn( webResource ); Boolean isSlave = new Neo4jInstance(client, URI.create("http://localhost")).isSlave(); assertTrue(isSlave); } which is pretty gnarly but does the job. I thought there must be a better way so I continued searching and eventually came across this post on the mailing list which suggested creating a custom ClientHandler and stubbing out requests/responses there. I had a go at doing that and wrapped it with a little DSL that only covers our very specific use case: private static ClientBuilder client() { return new ClientBuilder(); } static class ClientBuilder { private String uri; private int statusCode; private String content; public ClientBuilder requestFor(String uri) { this.uri = uri; return this; } public ClientBuilder returns(int statusCode) { this.statusCode = statusCode; return this; } public Client create() { return new Client() { public ClientResponse handle(ClientRequest request) throws ClientHandlerException { if (request.getURI().toString().equals(uri)) { InBoundHeaders headers = new InBoundHeaders(); headers.put("Content-Type", asList("text/plain")); return createDummyResponse(headers); } throw new RuntimeException("No stub defined for " + request.getURI()); } }; } private ClientResponse createDummyResponse(InBoundHeaders headers) { return new ClientResponse(statusCode, headers, new ByteArrayInputStream(content.getBytes()), messageBodyWorkers()); } private MessageBodyWorkers messageBodyWorkers() { return new MessageBodyWorkers() { public Map> getReaders(MediaType mediaType) { return null; } public Map> getWriters(MediaType mediaType) { return null; } public String readersToString(Map> mediaTypeListMap) { return null; } public String writersToString(Map> mediaTypeListMap) { return null; } public MessageBodyReader getMessageBodyReader(Class tClass, Type type, Annotation[] annotations, MediaType mediaType) { return (MessageBodyReader) new StringProvider(); } public MessageBodyWriter getMessageBodyWriter(Class tClass, Type type, Annotation[] annotations, MediaType mediaType) { return null; } public List getMessageBodyWriterMediaTypes(Class tClass, Type type, Annotation[] annotations) { return null; } public MediaType getMessageBodyWriterMediaType(Class tClass, Type type, Annotation[] annotations, List mediaTypes) { return null; } }; } public ClientBuilder content(String content) { this.content = content; return this; } } If we change our test to use this code it now looks like this: @Test public void shouldIndicateInstanceIsSlave() { Client client = client().requestFor("http://localhost:7474/db/manage/server/ha/slave"). returns(200). content("true"). create(); Boolean isSlave = new Neo4jInstance(client, URI.create("http://localhost")).isSlave(); assertTrue(isSlave); } Is there a better way? In Ruby I’ve used WebMock to achieve this and Ashok pointed me towards WebStub which looks nice except I’d need to pass in the hostname + port rather than constructing that in the code.
August 1, 2013
by Mark Needham
· 10,577 Views
article thumbnail
AWS: Attaching an EBS volume on an EC2 instance and making it available for use
I recently wanted to attach an EBS volume to an existing EC2 instance that I had running and since it was for a one off tasks (famous last words) I decided to configure it manually. I created the EBS volume through the AWS console and one thing that initially caught me out is that the EC2 instance and EBS volume need to be in the same region and zone. Therefore if I create my EC2 instance in ‘eu-west-1b’ then I need to create my EBS volume in ‘eu-west-1b’ as well otherwise I won’t be able to attach it to that instance. I attached the device as /dev/sdf although the UI gives the following warning: Linux Devices: /dev/sdf through /dev/sdp Note: Newer linux kernels may rename your devices to /dev/xvdf through /dev/xvdp internally, even when the device name entered here (and shown in the details) is /dev/sdf through /dev/sdp. After attaching the EBS volume to the EC2 instance my next step was to SSH onto my EC2 instance and make the EBS volume available. The first step is to create a file system on the volume: $ sudo mkfs -t ext3 /dev/sdf mke2fs 1.42 (29-Nov-2011) Could not stat /dev/sdf --- No such file or directory The device apparently does not exist; did you specify it correctly? It turns out that warning was handy and the device has in fact been renamed. We can confirm this by callingfdisk: $ sudo fdisk -l Disk /dev/xvda1: 8589 MB, 8589934592 bytes 255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvda1 doesn't contain a valid partition table Disk /dev/xvdf: 53.7 GB, 53687091200 bytes 255 heads, 63 sectors/track, 6527 cylinders, total 104857600 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/xvdf doesn't contain a valid partition table /dev/xvdf is the one we’re interested in so I re-ran the previous command: $ sudo mkfs -t ext3 /dev/xvdf mke2fs 1.42 (29-Nov-2011) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 3276800 inodes, 13107200 blocks 655360 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 400 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424 Allocating group tables: done Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done Once I’d done that I needed to create a mount point for the volume and I thought the best place was probably a directory under /mnt: $ sudo mkdir /mnt/ebs The final step is to mount the volume: $ sudo mount /dev/xvdf /mnt/ebs And if we run df we can see that it’s ready to go: $ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 883M 6.7G 12% / udev 288M 8.0K 288M 1% /dev tmpfs 119M 164K 118M 1% /run none 5.0M 0 5.0M 0% /run/lock none 296M 0 296M 0% /run/shm /dev/xvdf 50G 180M 47G 1% /mnt/ebs
July 31, 2013
by Mark Needham
· 11,586 Views
article thumbnail
OLAP Operation in R
OLAP (Online Analytical Processing) is a very common way to analyze raw transaction data by aggregating along different combinations of dimensions. This is a well-established field in Business Intelligence / Reporting. In this post, I will highlight the key ideas in OLAP operation and illustrate how to do this in R. Facts and Dimensions The core part of OLAP is a so-called "multi-dimensional data model", which contains two types of tables; "Fact" table and "Dimension" table A Fact table contains records each describe an instance of a transaction. Each transaction records contains categorical attributes (which describes contextual aspects of the transaction, such as space, time, user) as well as numeric attributes (called "measures" which describes quantitative aspects of the transaction, such as no of items sold, dollar amount). A Dimension table contain records that further elaborates the contextual attributes, such as user profile data, location details ... etc. In a typical setting of Multi-dimensional model ... Each fact table contains foreign keys that references the primary key of multiple dimension tables. In the most simple form, it is called a STAR schema. Dimension tables can contain foreign keys that references other dimensional tables. This provides a sophisticated detail breakdown of the contextual aspects. This is also called a SNOWFLAKE schema. Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn't contain reference pointer among each other. However, different Fact table usually share the same set of dimension tables. This is also called GALAXY schema. But it is a hard rule that Dimension table NEVER points / references Fact table A simple STAR schema is shown in following diagram. Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity. For example, the time dimension can be broken down into days, weeks, months, quarter and annual; Similarly, location dimension can be broken down into countries, states, cities ... etc. Here we first create a sales fact table that records each sales transaction. # Setup the dimension tables state_table <- data.frame(key=c("CA", "NY", "WA", "ON", "QU"), name=c("California", "new York", "Washington", "Ontario", "Quebec"), country=c("USA", "USA", "USA", "Canada", "Canada")) month_table <- data.frame(key=1:12, desc=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), quarter=c("Q1","Q1","Q1","Q2","Q2","Q2","Q3","Q3","Q3","Q4","Q4","Q4")) prod_table <- data.frame(key=c("Printer", "Tablet", "Laptop"), price=c(225, 570, 1120)) # Function to generate the Sales table gen_sales <- function(no_of_recs) { # Generate transaction data randomly loc <- sample(state_table$key, no_of_recs, replace=T, prob=c(2,2,1,1,1)) time_month <- sample(month_table$key, no_of_recs, replace=T) time_year <- sample(c(2012, 2013), no_of_recs, replace=T) prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2)) unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3)) amount <- unit*prod_table[prod,]$price sales <- data.frame(month=time_month, year=time_year, loc=loc, prod=prod, unit=unit, amount=amount) # Sort the records by time order sales <- sales[order(sales$year, sales$month),] row.names(sales) <- NULL return(sales) } # Now create the sales fact table sales_fact <- gen_sales(500) # Look at a few records head(sales_fact) month year loc prod unit amount 1 1 2012 NY Laptop 1 225 2 1 2012 CA Laptop 2 450 3 1 2012 ON Tablet 2 2240 4 1 2012 NY Tablet 1 1120 5 1 2012 NY Tablet 2 2240 6 1 2012 CA Laptop 1 225 Multi-dimensional Cube Now, we turn this fact table into a hypercube with multiple dimensions. Each cell in the cube represents an aggregate value for a unique combination of each dimension. # Build up a cube revenue_cube <- tapply(sales_fact$amount, sales_fact[,c("prod", "month", "year", "loc")], FUN=function(x){return(sum(x))}) # Showing the cells of the cude revenue_cube , , year = 2012, loc = CA month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 1350 225 900 675 675 NA 675 1350 NA 1575 900 1350 Printer NA 2280 NA NA 1140 570 570 570 NA 570 1710 NA Tablet 2240 4480 12320 3360 2240 4480 3360 3360 5600 2240 2240 3360 , , year = 2013, loc = CA month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 225 225 450 675 225 900 900 450 675 225 675 1125 Printer NA 1140 NA 1140 570 NA NA 570 NA 1140 1710 1710 Tablet 3360 3360 1120 4480 2240 1120 7840 3360 3360 1120 5600 4480 , , year = 2012, loc = NY month prod 1 2 3 4 5 6 7 8 9 10 11 12 Laptop 450 450 NA NA 675 450 675 NA 225 225 NA 450 Printer NA 2280 NA 2850 570 NA NA 1710 1140 NA 570 NA Tablet 3360 13440 2240 2240 2240 5600 5600 3360 4480 3360 4480 3360 , , year = 2013, loc = NY ..... dimnames(revenue_cube) $prod [1] "Laptop" "Printer" "Tablet" $month [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" $year [1] "2012" "2013" $loc [1] "CA" "NY" "ON" "QU" "WA" OLAP Operations Here are some common operations of OLAP Slice Dice Rollup Drilldown Pivot "Slice" is about fixing certain dimensions to analyze the remaining dimensions. For example, we can focus in the sales happening in "2012", "Jan", or we can focus in the sales happening in "2012", "Jan", "Tablet". # Slice # cube data in Jan, 2012 revenue_cube[, "1", "2012",] loc prod CA NY ON QU WA Laptop 1350 450 NA 225 225 Printer NA NA NA 1140 NA Tablet 2240 3360 5600 1120 2240 # cube data in Jan, 2012 revenue_cube["Tablet", "1", "2012",] CA NY ON QU WA 2240 3360 5600 1120 2240 "Dice" is about limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube. For example, we can focus in sales happening in [Jan/ Feb/Mar, Laptop/Tablet, CA/NY]. revenue_cube[c("Tablet","Laptop"), c("1","2","3"), , c("CA","NY")] , , year = 2012, loc = CA month prod 1 2 3 Tablet 2240 4480 12320 Laptop 1350 225 900 , , year = 2013, loc = CA month prod 1 2 3 Tablet 3360 3360 1120 Laptop 225 225 450 , , year = 2012, loc = NY month prod 1 2 3 Tablet 3360 13440 2240 Laptop 450 450 NA , , year = 2013, loc = NY month prod 1 2 3 Tablet 3360 4480 6720 Laptop 450 NA 225 "Rollup" is about applying an aggregation function to collapse a number of dimensions. For example, we want to focus in the annual revenue for each product and collapse the location dimension (ie: we don't care where we sold our product). apply(revenue_cube, c("year", "prod"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) prod year Laptop Printer Tablet 2012 22275 31350 179200 2013 25200 33060 166880 "Drilldown" is the reverse of "rollup" and applying an aggregation function to a finer level of granularity. For example, we want to focus in the annual and monthly revenue for each product and collapse the location dimension (ie: we don't care where we sold our product). apply(revenue_cube, c("year", "month", "prod"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) , , prod = Laptop month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 2250 2475 1575 1575 2250 1800 1575 1800 900 2250 1350 2475 2013 2250 900 1575 1575 2250 2475 2025 1800 2025 2250 3825 2250 , , prod = Printer month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 1140 5700 570 3990 4560 2850 1140 2850 2850 1710 3420 570 2013 1140 4560 3420 4560 2850 1140 570 3420 1140 3420 3990 2850 , , prod = Tablet month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 14560 23520 17920 12320 10080 14560 13440 15680 25760 12320 11200 7840 2013 8960 11200 10080 7840 14560 10080 29120 15680 15680 8960 12320 22400 "Pivot" is about analyzing the combination of a pair of selected dimensions. For example, we want to analyze the revenue by year and month. Or we want to analyze the revenue by product and location. apply(revenue_cube, c("year", "month"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) month year 1 2 3 4 5 6 7 8 9 10 11 12 2012 17950 31695 20065 17885 16890 19210 16155 20330 29510 16280 15970 10885 2013 12350 16660 15075 13975 19660 13695 31715 20900 18845 14630 20135 27500 apply(revenue_cube, c("prod", "loc"), FUN=function(x) {return(sum(x, na.rm=TRUE))}) loc prod CA NY ON QU WA Laptop 16425 9450 7650 7425 6525 Printer 15390 19950 7980 10830 10260 Tablet 90720 117600 45920 34720 57120 I hope you can get a taste of the richness of data processing model in R. However, since R is doing all the processing in RAM. This requires your data to be small enough so it can fit into the local memory in a single machine.
July 30, 2013
by Ricky Ho
· 17,551 Views · 3 Likes
article thumbnail
JMS vs RabbitMQ
Definition : JMS : Java Message Service is an API that is part of Java EE for sending messages between two or more clients. There are many JMS providers such as OpenMQ (glassfish’s default), HornetQ(Jboss), and ActiveMQ. RabbitMQ: is an open source message broker software which uses the AMQP standard and is written by Erlang. Messaging Model: JMS supports two models: one to one and publish/subscriber. RabbitMQ supports the AMQP model which has 4 models : direct, fanout, topic, headers. Data types: JMS supports 5 different data types but RabbitMQ supports only the binary data type. Workflow strategy: In AMQP, producers send to the exchange then the queue, but in JMS, producers send to the queue or topic directly. Technology compatibility: JMS is specific for java users only, but RabbitMQ supports many technologies. Performance: If you would like to know more about their performance, this benchmark is a good place to start, but look for others as well.
July 30, 2013
by Saeid Siavashi
· 51,346 Views · 16 Likes
article thumbnail
Why String is Immutable in Java
this is an old yet still popular question. there are multiple reasons that string is designed to be immutable in java. a good answer depends on good understanding of memory, synchronization, data structures, etc. in the following, i will summarize some answers. 1. requirement of string pool string pool (string intern pool) is a special storage area in java heap. when a string is created and if the string already exists in the pool, the reference of the existing string will be returned, instead of creating a new object and returning its reference. the following code will create only one string object in the heap. string string1 = "abcd"; string string2 = "abcd"; here is how it looks: if string is not immutable, changing the string with one reference will lead to the wrong value for the other references. 2. allow string to cache its hashcode the hashcode of string is frequently used in java. for example, in a hashmap. being immutable guarantees that hashcode will always the same, so that it can be cashed without worrying the changes.that means, there is no need to calculate hashcode every time it is used. this is more efficient. 3. security string is widely used as parameter for many java classes, e.g. network connection, opening files, etc. were string not immutable, a connection or file would be changed and lead to serious security threat. the method thought it was connecting to one machine, but was not. mutable strings could cause security problem in reflection too, as the parameters are strings. here is a code example: boolean connect(string s){ if (!issecure(s)) { throw new securityexception(); } //here will cause problem, if s is changed before this by using other references. causeproblem(s); } in summary, the reasons include design, efficiency, and security. actually, this is also true for many other “why” questions in a java interview.
July 29, 2013
by Ryan Wang
· 216,907 Views · 9 Likes
article thumbnail
Stepping through LMDB: Making Everything Easier
okay, i know that i have been critical about the lmdb codebase so far. but one thing that i really want to point out for it is that it was pretty easy to actually get things working on windows. it wasn’t smooth, in the sense that i had to muck around with the source a bit (hack endianess, remove a bunch of unix specific header files, etc). but that took less than an hour, and it was pretty much it. since i am by no means an experienced c developer, i consider this a major win. compare that to leveldb, which flat out won’t run on windows no matter how much time i spent trying, and it is a pleasure . also, stepping through the code i am starting to get a sense of how it works that is much different than the one i had when i just read the code. it is like one of those 3d images, you suddenly see something. the first thing that became obvious is that i totally missed the significance of the lock file. lmdb actually create two files: lock.mdb data.mdb lock.mdb is used to synchronized data between different readers. it seems to mostly be there if you want to have multiple writers using different processes. that is a very interesting model for an embedded database, i’ve to admit. not something that i think other embedded databases are offering. in order to do that, it create two named mutexes (one for read and one for write). a side note on windows support: lmdb supports windows, but it is very much a 2nd class citizen. you can see it in things like path not found error turning into a no such process error (because it try to use getlasterror() codes as c codes), or when it doesn’t create a directory even though not creating it would fail. i am currently debugging through the code and fixing such issues as i go along (but no, i am doing heavy handed magic fixes, just to get past this stage to the next one, otherwise i would have sent a pull request). here is one such example. here is the original code: but readfile in win32 will return false if the file is empty, so you actually need to write something like this to make the code work: past that hurdle, i think that i get a lot more about what is going on with the way lmdb works than before. let us start with the way data.mdb works. it is important to note that for pretty much everything in lmdb we use the system page size. by default, that is 4kb. the data file starts with 2 pages allocated. those page contain the following information: looking back at how couchdb did things, i am pretty sure that those two pages are going to be pretty important . i am guess that they would always contain the root of the data in the file. there is also the last transaction on them, which is what i imagine determine how something gets committed. i don’t know yet, as i said, guessing based on how couchdb works. i’ll continue this review in another time. next time, transactions…
July 27, 2013
by Oren Eini
· 8,944 Views
article thumbnail
Displaying and Searching std::map Contents in WinDbg
This time we’re up for a bigger challenge. We want to automatically display and possibly search and filter std::map objects in WinDbg. The script for std::vectors was relatively easy because of the flat structure of the data in a vector; maps are more complex beasts. Specifically, an map in the Visual C++ STL is implemented as a red-black tree. Each tree node has three important pointers: _Left, _Right, and _Parent. Additionally, each node has a _Myval field that contains the std::pair with the key and value represented by the node. Iterating a tree structure requires recursion, and WinDbg scripts don’t have any syntax to define functions. However, we can invoke a script recursively – a script is allowed to contain the $$>a< command that invokes it again with a different set of arguments. The path to the script is also readily available in ${$arg0}. Before I show you the script, there’s just one little challenge I had to deal with. When you call a script recursively, the values of the pseudo-registers (like $t0) will be clobbered by the recursive invocation. I was on the verge of allocating memory dynamically or calling into a shell process to store and load variables, when I stumbled upon the .push and .pop commands, which store the register context and load it, respectively. These are a must for recursive WinDbg scripts. OK, so suppose you want to display values from an std::map where the key is less than or equal to 2. Here we go: 0:000> $$>a< traverse_map.script my_map -c ".block { .if (@@(@$t9.first) <= 2) { .echo ----; ?? @$t9.second } }" size = 10 ---- struct point +0x000 x : 0n1 +0x004 y : 0n2 +0x008 data : extra_data ---- struct point +0x000 x : 0n0 +0x004 y : 0n1 +0x008 data : extra_data ---- struct point +0x000 x : 0n2 +0x004 y : 0n3 +0x008 data : extra_data For each pair (stored in the $t9 pseudo-register), the block checks if the first component is less than or equal to 2, and if it is, outputs the second component. Next, here’s the script. Note it’s considerably more complex that what we had to with vectors, because it essentially invokes itself with a different set of parameters and then repeats recursively. .if ($sicmp("${$arg1}", "-n") == 0) { .if (@@(@$t0->_Isnil) == 0) { .if (@$t2 == 1) { .printf /D "%p\n", @$t0, @$t0 .printf "key = " ?? @$t0->_Myval.first .printf "value = " ?? @$t0->_Myval.second } .else { r? $t9 = @$t0->_Myval command } } $$ Recurse into _Left, _Right unless they point to the root of the tree .if (@@(@$t0->_Left) != @@(@$t1)) { .push /r /q r? $t0 = @$t0->_Left $$>a< ${$arg0} -n .pop /r /q } .if (@@(@$t0->_Right) != @@(@$t1)) { .push /r /q r? $t0 = @$t0->_Right $$>a< ${$arg0} -n .pop /r /q } } .else { r? $t0 = ${$arg1} .if (${/d:$arg2}) { .if ($sicmp("${$arg2}", "-c") == 0) { r $t2 = 0 aS ${/v:command} "${$arg3}" } } .else { r $t2 = 1 aS ${/v:command} " " } .printf "size = %d\n", @@(@$t0._Mysize) r? $t0 = @$t0._Myhead->_Parent r? $t1 = @$t0->_Parent $$>a< ${$arg0} -n ad command } Of particular note are the aS command which configures an alias that is then used by the recursive invocation to invoke a command block for each of the map’s elements; the $sicmp function which compares strings; and the .printf /D function, which outputs a chunk of DML. Finally, the recursion terminates when _Left or _Right are equal to the root of the tree (that’s just how the tree is implemented in this case).
July 26, 2013
by Sasha Goldshtein
· 4,630 Views
article thumbnail
Using Morphia to Map Java Objects in MongoDB
MongoDB is an open source document-oriented NoSQL database system which stores data as JSON-like documents with dynamic schemas. As it doesn't store data in tables as is done in the usual relational database setup, it doesn't map well to the JPA way of storing data. Morphia is an open source lightweight type-safe library designed to bridge the gap between the MongoDB Java driver and domain objects. It can be an alternative to SpringData if you're not using the Spring Framework to interact with MongoDB. This post will cover the basics of persisting and querying entities along the lines of JPA by using Morphia and a MongoDB database instance. There are four POJOs this example will be using. First we have BaseEntity which is an abstract class containing the Id and Version fields: package com.city81.mongodb.morphia.entity; import org.bson.types.ObjectId; import com.google.code.morphia.annotations.Id; import com.google.code.morphia.annotations.Property; import com.google.code.morphia.annotations.Version; public abstract class BaseEntity { @Id @Property("id") protected ObjectId id; @Version @Property("version") private Long version; public BaseEntity() { super(); } public ObjectId getId() { return id; } public void setId(ObjectId id) { this.id = id; } public Long getVersion() { return version; } public void setVersion(Long version) { this.version = version; } } Whereas JPA would use @Column to rename the attribute, Morphia uses @Property. Another difference is that @Property needs to be on the variable whereas @Column can be on the variable or the get method. The main entity we want to persist is the Customer class: package com.city81.mongodb.morphia.entity; import java.util.List; import com.google.code.morphia.annotations.Embedded; import com.google.code.morphia.annotations.Entity; @Entity public class Customer extends BaseEntity { private String name; private List accounts; @Embedded private Address address; public String getName() { return name; } public void setName(String name) { this.name = name; } public List getAccounts() { return accounts; } public void setAccounts(List accounts) { this.accounts = accounts; } public Address getAddress() { return address; } public void setAddress(Address address) { this.address = address; } } As with JPA, the POJO is annotated with @Entity. The class also shows an example of @Embedded: The Address class is also annotated with @Embedded as shown below: package com.city81.mongodb.morphia.entity; import com.google.code.morphia.annotations.Embedded; @Embedded public class Address { private String number; private String street; private String town; private String postcode; public String getNumber() { return number; } public void setNumber(String number) { this.number = number; } public String getStreet() { return street; } public void setStreet(String street) { this.street = street; } public String getTown() { return town; } public void setTown(String town) { this.town = town; } public String getPostcode() { return postcode; } public void setPostcode(String postcode) { this.postcode = postcode; } } Finally, we have the Account class of which the customer class has a collection of: package com.city81.mongodb.morphia.entity; import com.google.code.morphia.annotations.Entity; @Entity public class Account extends BaseEntity { private String name; public String getName() { return name; } public void setName(String name) { this.name = name; } } The above show only a small subset of what annotations can be applied to domain classes. More can be found at http://code.google.com/p/morphia/wiki/AllAnnotations The Example class shown below goes through the steps involved in connecting to the MongoDB instance, populating the entities, persisting them and then retrieving them: package com.city81.mongodb.morphia; import java.net.UnknownHostException; import java.util.ArrayList; import java.util.List; import com.city81.mongodb.morphia.entity.Account; import com.city81.mongodb.morphia.entity.Address; import com.city81.mongodb.morphia.entity.Customer; import com.google.code.morphia.Datastore; import com.google.code.morphia.Key; import com.google.code.morphia.Morphia; import com.mongodb.Mongo; import com.mongodb.MongoException; /** * A MongoDB and Morphia Example * */ public class Example { public static void main( String[] args ) throws UnknownHostException, MongoException { String dbName = new String("bank"); Mongo mongo = new Mongo(); Morphia morphia = new Morphia(); Datastore datastore = morphia.createDatastore(mongo, dbName); morphia.mapPackage("com.city81.mongodb.morphia.entity"); Address address = new Address(); address.setNumber("81"); address.setStreet("Mongo Street"); address.setTown("City"); address.setPostcode("CT81 1DB"); Account account = new Account(); account.setName("Personal Account"); List accounts = new ArrayList(); accounts.add(account); Customer customer = new Customer(); customer.setAddress(address); customer.setName("Mr Bank Customer"); customer.setAccounts(accounts); Key savedCustomer = datastore.save(customer); System.out.println(savedCustomer.getId()); } Executing the first few lines will result in the creation of a Datastore. This interface will provide the ability to get, delete and save objects in the 'bank' MongoDB instance. The mapPackage method call on the morphia object determines what objects are mapped by that instance of Morphia. In this case all those in the package supplied. Other alternatives exist to map classes, including the method map which takes a single class (this method can be chained as the returning object is the morphia object), or passing a Set of classes to the Morphia constructor. After creating instances of the entities, they can be saved by calling save on the datastore instance and can be found using the primary key via the get method. The output from the Example class would look something like the below: 11-Jul-2012 13:20:06 com.google.code.morphia.logging.MorphiaLoggerFactory chooseLoggerFactory INFO: LoggerImplFactory set to com.google.code.morphia.logging.jdk.JDKLoggerFactory 4ffd6f7662109325c6eea24f Mr Bank Customer There are many other methods on the Datastore interface and they can be found along with the other Javadocs at http://morphia.googlecode.com/svn/site/morphia/apidocs/index.html An alternative to using the Datastore directly is to use the built in DAO support. This can be done by extending the BasicDAO class as shown below for the Customer entity: package com.city81.mongodb.morphia.dao; import com.city81.mongodb.morphia.entity.Customer; import com.google.code.morphia.Morphia; import com.google.code.morphia.dao.BasicDAO; import com.mongodb.Mongo; public class CustomerDAO extends BasicDAO { public CustomerDAO(Morphia morphia, Mongo mongo, String dbName) { super(mongo, morphia, dbName); } } To then make use of this, the Example class can be changed (and enhanced to show a query and a delete): ... CustomerDAO customerDAO = new CustomerDAO(morphia, mongo, dbName); customerDAO.save(customer); Query query = datastore.createQuery(Customer.class); query.and( query.criteria("accounts.name").equal("Personal Account"), query.criteria("address.number").equal("81"), query.criteria("name").contains("Bank") ); QueryResults retrievedCustomers = customerDAO.find(query); for (Customer retrievedCustomer : retrievedCustomers) { System.out.println(retrievedCustomer.getName()); System.out.println(retrievedCustomer.getAddress().getPostcode()); System.out.println(retrievedCustomer.getAccounts().get(0).getName()); customerDAO.delete(retrievedCustomer); } ... With the output from running the above shown below: 11-Jul-2012 13:30:46 com.google.code.morphia.logging.MorphiaLoggerFactory chooseLoggerFactory INFO: LoggerImplFactory set to com.google.code.morphia.logging.jdk.JDKLoggerFactory Mr Bank Customer CT81 1DB Personal Account This post only covers a few brief basics of Morphia but shows how it can help bridge the gap between JPA and NoSQL.
July 25, 2013
by Geraint Jones
· 75,739 Views
article thumbnail
Jersey: Listing all Resources, Paths, Verbs to Build an Entry Point/Index for an API
I’ve been playing around with Jersey over the past couple of days and one thing I wanted to do was create an entry point or index which listed all my resources, the available paths and the verbs they accepted. Guido Simone explained a neat way of finding the paths and verbs for a specific resource using Jersey’sIntrospectionModeller: AbstractResource resource = IntrospectionModeller.createResource(JacksonResource.class); System.out.println("Path is " + resource.getPath().getValue()); String uriPrefix = resource.getPath().getValue(); for (AbstractSubResourceMethod srm :resource.getSubResourceMethods()) { String uri = uriPrefix + "/" + srm.getPath().getValue(); System.out.println(srm.getHttpMethod() + " at the path " + uri + " return " + srm.getReturnType().getName()); } If we run that against j4-minimal‘s JacksonResource class we get the following output: Path is /jackson GET at the path /jackson/{who} return com.g414.j4.minimal.JacksonResource$Greeting GET at the path /jackson/awesome/{who} return javax.ws.rs.core.Response That’s pretty neat but I didn’t want to have to manually list all my resources since I’ve already done that using Guice . I needed a way to programatically get hold of them and I partially found the way to do this from this postwhich suggests using Application.getSingletons(). I actually ended up using Application.getClasses() and I ended up with ResourceListingResource: @Path("/") public class ResourceListingResource { @GET @Produces(MediaType.APPLICATION_JSON) public Response showAll( @Context Application application, @Context HttpServletRequest request) { String basePath = request.getRequestURL().toString(); ObjectNode root = JsonNodeFactory.instance.objectNode(); ArrayNode resources = JsonNodeFactory.instance.arrayNode(); root.put( "resources", resources ); for ( Class aClass : application.getClasses() ) { if ( isAnnotatedResourceClass( aClass ) ) { AbstractResource resource = IntrospectionModeller.createResource( aClass ); ObjectNode resourceNode = JsonNodeFactory.instance.objectNode(); String uriPrefix = resource.getPath().getValue(); for ( AbstractSubResourceMethod srm : resource.getSubResourceMethods() ) { String uri = uriPrefix + "/" + srm.getPath().getValue(); addTo( resourceNode, uri, srm, joinUri(basePath, uri) ); } for ( AbstractResourceMethod srm : resource.getResourceMethods() ) { addTo( resourceNode, uriPrefix, srm, joinUri( basePath, uriPrefix ) ); } resources.add( resourceNode ); } } return Response.ok().entity( root ).build(); } private void addTo( ObjectNode resourceNode, String uriPrefix, AbstractResourceMethod srm, String path ) { if ( resourceNode.get( uriPrefix ) == null ) { ObjectNode inner = JsonNodeFactory.instance.objectNode(); inner.put("path", path); inner.put("verbs", JsonNodeFactory.instance.arrayNode()); resourceNode.put( uriPrefix, inner ); } ((ArrayNode) resourceNode.get( uriPrefix ).get("verbs")).add( srm.getHttpMethod() ); } private boolean isAnnotatedResourceClass( Class rc ) { if ( rc.isAnnotationPresent( Path.class ) ) { return true; } for ( Class i : rc.getInterfaces() ) { if ( i.isAnnotationPresent( Path.class ) ) { return true; } } return false; } } The only change I’ve made from Guido Simone’s solution is that I also call resource.getResourceMethods()because resource.getSubResourceMethods() only returns methods which have a @Path annotation. Since we’ll sometimes define our path at the class level and then define different verbs that operate on that resource it misses some methods out. If we run a cURL command (piped through python to make it look nice) against the root we get the following output: $ curl http://localhost:8080/ -w "\n" 2>/dev/null | python -mjson.tool { "resources": [ { "/bench": { "path": "http://localhost:8080/bench", "verbs": [ "GET", "POST", "PUT", "DELETE" ] } }, { "/sample/{who}": { "path": "http://localhost:8080/sample/{who}", "verbs": [ "GET" ] } }, { "/jackson/awesome/{who}": { "path": "http://localhost:8080/jackson/awesome/{who}", "verbs": [ "GET" ] }, "/jackson/{who}": { "path": "http://localhost:8080/jackson/{who}", "verbs": [ "GET" ] } }, { "/": { "path": "http://localhost:8080/", "verbs": [ "GET" ] } } ] }
July 24, 2013
by Mark Needham
· 20,607 Views · 2 Likes
article thumbnail
Getting started with JPA and Mule
Working with JPA managed entities in Mule applications can be difficult. Since the JPA session is not propagated between message processors, transformers are typically needed to produce an entity from a message’s payload, pass it to a component for processing, then serialize it back to an un-proxied representation for further processing. Transactions have been complicated too. Its difficult to coordinate a transaction between multiple components that are operating with JPA entity payloads. Finally the lack of support for JPA queries makes it difficult to load objects without working with raw SQL and the JDBC transport. Mule Support for JPA Entities The JPA module aims to simplify working with JPA managed entities with Mule. It provides message processors that map to an EntityManager’s methods. The message processors participate in Mule transactions, making it easy to structure JPA transactions within Mule flows. The JPA module also provides a @PersistenceContext implementation. This allows Mule components to participate in JPA transactions. Installing the JPA Module To install the JPA Module you need to click on “Help” followed by “Install New Software…” from Mule Studio. Select the “MuleStudio Cloud Connectors Update Site” from the “Work With” drop-down list then find the “Mule Java Persistence API Module Mule Extension.” This is illustrated below: Fetching JPA Entities JPA query language or criteria queries can be executed using the “query” MP. Supplying a statement to the query will execute the given query and return the results to the next message processor, as illustrated in the following Gist: The queryParameters-ref defines the parameters. In this case the message’s payload as the parameters to the query. The following query illustrates how a Map payload could be used to populate query parameters: The query processor also supports criteria queries by setting the queryParameters-ref to an instance of a CriteriaQuery, as illustrated in the functional test snippet below. CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder(); CriteriaQuery criteriaQuery = criteriaBuilder.createQuery(Dog.class); Root from = criteriaQuery.from(Dog.class); Predicate condition = criteriaBuilder.equal(from.get("name"), "Cujo"); criteriaQuery.where(condition); runFlowWithPayloadAndExpect("testQuery", expectedResults, criteriaQuery); You can use the ”find” MP to load a single object if you know its ID: Transactions and Entity Operations The default behavior of most JPA providers, like Hibernate, is to provide proxies on entity relationships to avoid loading full object graphs into memory. When these objects are detached from the JPA session, however, attempts to access relations in the object will often fail because the proxied session is no longer available. This complicates using JPA is Mule applications as JPA objects pass between message processors and inbetween flows and the session subsequently becomes unavailable. The JPA module allows you to avoid this by wrapping your operations in a transactional block. Let’s first look at how to persist an object then query it within a transaction. The below assumes the message’s payload is an instance of the Dog domain class. Now let’s see how we can use the merge processor to attach a JPA object to a new session. This can be useful when passing a JPA entity from one flow to another. ....other processing here.... Detaching an entity is just as simple: Component Operations with JPA The real power of using JPA with Mule is allowing your business services to participate in Mule managed JPA transactions. A @PersistenceContext EntityManager reference in your component class will cause Mule to inject a reference to a transactional flow’s current EntityManager for that method, as illustrated in the following class: public class DogServiceImpl { @PersistenceContext EntityManager entityManager; public Dog groom(Dog dog) { return entityManager.merge(dog); } } We can now wire the component up in a flow: Conclusion JPA is an important part of the JEE ecosystem and hopefully this module will simplify your use of JPA managed entities in Mule applications.
July 24, 2013
by John D'Emic
· 7,934 Views
article thumbnail
Algorithm of the Week: Spatial Indexing with Quadtrees and Hilbert Curves
some time ago at oredev, after the sessions, there was "birds of a feather" - a sort of mini-unconference. anyone could write up a topic on the whiteboard; interested individuals added their names, and each group got allocated a room to chat about the topic. i joined the "spatial indexing" group, and we spent a fascinating hour and a half talking about spatial indexing methods, reminding me of several interesting algorithms and techniques. spatial indexing is increasingly important as more and more data and applications are geospatially-enabled. efficiently querying geospatial data, however, is a considerable challenge: because the data is two-dimensional (or sometimes, more), you can't use standard indexing techniques to query on position. spatial indexes solve this through a variety of techniques. in this post, we'll cover several - quadtrees , geohashes (not to be confused with geohashing ), and space-filling curves - and reveal how they're all interrelated. quadtrees quadtrees are a very straightforward spatial indexing technique. in a quadtree, each node represents a bounding box covering some part of the space being indexed, with the root node covering the entire area. each node is either a leaf node - in which case it contains one or more indexed points, and no children, or it is an internal node, in which case it has exactly four children, one for each quadrant obtained by dividing the area covered in half along both axes - hence the name. a representation of how a quadtree divides an indexed area. source: wikipedia inserting data into a quadtree is simple: starting at the root, determine which quadrant your point occupies. recurse to that node and repeat, until you find a leaf node. then, add your point to that node's list of points. if the list exceeds some pre-determined maximum number of elements, split the node, and move the points into the correct subnodes. a representation of how a quadtree is structured internally. to query a quadtree, starting at the root, examine each child node, and check if it intersects the area being queried for. if it does, recurse into that child node. whenever you encounter a leaf node, examine each entry to see if it intersects with the query area, and return it if it does. note that a quadtree is very regular - it is, in fact, a trie , since the values of the tree nodes do not depend on the data being inserted. a consequence of this is that we can uniquely number our nodes in a straightforward manner: simply number each quadrant in binary (00 for the top left, 10 for the top right, and so forth), and the number for a node is the concatenation of the quadrant numbers for each of its ancestors, starting at the root. using this system, the bottom right node in the sample image would be numbered 11 01. if we define a maximum depth for our tree, then, we can calculate a point's node number without reference to the tree - simply normalize the node's coordinates to an appropriate integer range (for example, 32 bits each), and then interleave the bits from the x and y coordinates -each pair of bits specifies a quadrant in the hypothetical quadtree. geohashes this system might seem familiar: it's a geohash ! at this point, you can actually throw out the quadtree itself - the node number, or geohash, contains all the information we need about its location in the tree. each leaf node in a full-height tree is a complete geohash, and each internal node is represented by the range from its smallest leaf node to its largest one. thus, you can efficiently locate all the points under any internal node by indexing on the geohash by performing a query for everything within the numeric range covered by the desired node. querying once we've thrown away the tree itself becomes a little more complex. instead of refining our search set recursively, we need to construct a search set ahead of time. first, find the smallest prefix (or quadtree node) that completely covers the query area. in the worst case, this may be substantially larger than the actual query area - for example, a small shape in the center of the indexed area that intersects all four quadrants would require selecting the root node for this step. the aim, now, is to construct a set of prefixes that completely covers the query region, while including as little area outside the region as possible. if we had no other constraints, we could simply select the set of leaf nodes that intersect the query area - but that would result in a lot of queries. another constraint, then, is that we want to minimise the number of distinct ranges we have to query for. one approach to doing this is to start by setting a maximum number of ranges we're willing to have. construct a set of ranges, initially populated with the prefix we identified earlier. pick the node in the set that can be subdivided without exceeding the maximum range count and will remove the most unwanted area from the query region. repeat this until there are no ranges in the set that can be further subdivided. finally, examine the resulting set, and join any adjacent ranges, if possible. the diagram below demonstrates how this works for a query on a circular area with a limit of 5 query ranges. how a query for a region is broken into a series of geohash prefixes/ranges. this approach works well, and it allows us to avoid the need to do recursive lookups - the set of range lookups we do execute can all be done in parallel. since each lookup can be expected to require a disk seek, parallelizing our queries allows us to substantially cut down the time required to return the results. still, we can do better. you may notice that all the areas we need to query in the above diagram are adjacent, yet we can only merge two of them (the two in the bottom right of the selected area) into a single range query, requiring us to do 4 separate queries. this is due in part to the order that our geohashing approach 'visits' subregions, working left to right, then top to bottom in each quad. the discontinuity as we go from top right to bottom left quad results in us having to split up some ranges that we could otherwise make contiguous. if we were to visit regions in a different order, perhaps we could minimise or eliminate these discontinuities, resulting in more areas that can be treated as adjacent and fetched with a single query. with an improvement in efficiency like that, we could do fewer queries for the same area covered, or conversely, the same number of queries, but including less extraneous area. illustrates the order in which the geohashing approach 'visits' each quad. hilbert curves suppose instead, we visit regions in a 'u' shape. within each quad, of course, we also visit subquads in the same 'u' shape, but aligned so as to match up with neighbouring quads. if we organise the orientation of these 'u's correctly, we can completely eliminate any discontinuities, and visit the entire area at whatever resolution we choose continuously, fully exploring each region before moving on to the next. not only does this eliminate discontinuities, but it also improves the overall locality. the pattern we get if we do this may look familiar - it's a hilbert curve. hilbert curves are part of a class of one-dimensional fractals known as space-filling curves , so named because they are one dimensional lines that nevertheless fill all available space in a fixed area. they're fairly well known, in part thanks to xkcd's use of them for a map of the internet . as you can see, they're also of use for spatial indexing, since they exhibit exactly the locality and continuity required. for example, if we take another look at the example we used for finding the set of queries required to encompass a circle above, we find that we can reduce the number of queries by one: the small region in the lower left is now contiguous with the region to its right, and whilst the two regions at the bottom are no longer contiguous with each other, the rightmost one is now contiguous with the large area in the upper right. illustrates the order in which a hilbert curve 'visits' each quad. one thing that our elegant new system is lacking, so far, is a way of converting between a pair of (x,y) coordinates and the corresponding position in the hilbert curve. with geohashing it was easy and obvious - just interleave the x and y coordinates - but there's no obvious way to modify that for a hilbert curve. searching the internet, you're likely to come across many descriptions of how hilbert curves are drawn, but few if any descriptions of how to find the position of an arbitrary point. to figure this out, we need to take a closer look at how the hilbert cure can be recursively constructed. the first thing to observe is that although most references to hilbert curves focus on how to draw the curve, this is a distraction from the essential property of the curve, and its importance to us: it's an ordering for points on a plane. if we express a hilbert curve in terms of this ordering, drawing the curve itself becomes trivial - simply a matter of connecting the dots. forget about how to connect adjacent sub-curves, and instead focus on how we can recursively enumerate the points. hilbert curves are all about ordering a set of points on a 2d plane at the root level, enumerating the points is simple: pick a direction and a start point, and proceed around the four quadrants, numbering them 0 to 3. the difficulty is introduced when we want to determine the order we visit the sub-quadrants in while maintaining the overall adjacency property. examination reveals that each of the sub-quadrants' curves is a simple transformation of the original curve: there are only four possible transformations. naturally, this applies recursively to sub-sub quadrants, and so forth. the curve we use for a given quadrant is determined by the curve we used for the square it's in, and the quadrant's position. with a little work, we can construct a table that encapsulates this: suppose we want to use this table to determine the position of a point on a third-level hilbert curve. for the sake of this example, assume our point has coordinates (5,2) starting with the first square on the diagram, find the quadrant your point is in - in this case, it's the upper right quadrant. the first part of our hilbert curve position, then, is 3 (11 in binary). next, we consult the square shown in the inset of square 3 - in this case, it's the second square. repeat the process: which sub-quadrant does our point fall into? here, it's the lower left one, meaning the next part of our position is 1, and the square we should consult next is the second one again. repeating the process one final time, we find our point falls in the upper right sub-sub-quadrant, our final coordinate is 3 (11 in binary). stringing them together, we now know the position of our point on the curve is 110111 binary, or 55. let's be a little more methodical, and write methods to convert between x,y coordinates and hilbert curve positions. first, we need to express our diagram above in terms a computer can understand: hilbert_map = { 'a': {(0, 0): (0, 'd'), (0, 1): (1, 'a'), (1, 0): (3, 'b'), (1, 1): (2, 'a')}, 'b': {(0, 0): (2, 'b'), (0, 1): (1, 'b'), (1, 0): (3, 'a'), (1, 1): (0, 'c')}, 'c': {(0, 0): (2, 'c'), (0, 1): (3, 'd'), (1, 0): (1, 'c'), (1, 1): (0, 'b')}, 'd': {(0, 0): (0, 'a'), (0, 1): (3, 'c'), (1, 0): (1, 'd'), (1, 1): (2, 'd')}, } in the snippet above, each element of 'hilbert_map' corresponds to one of the four squares in the diagram above. to make things easier to follow, i've identified each one with a letter - 'a' is the first square, 'b' the second, and so forth. the value for each square is a dict, mapping x and y coordinates for the (sub-)quadrant to the position along the line (the first part of the value tuple) and the square to use next (the second part of the value tuple). here's how we can use this to translate x and y coordinates into a hilbert curve position: def point_to_hilbert(x, y, order=16): current_square = 'a' position = 0 for i in range(order - 1, -1, -1): position <<= 2 quad_x = 1 if x & (1 << i) else 0 quad_y = 1 if y & (1 << i) else 0 quad_position, current_square = hilbert_map[current_square][(quad_x, quad_y)] position |= quad_position return position the input to this function is the integer x and y coordinates, and the order of the curve. an order 1 curve fills a 2x2 grid, an order 2 curve fills a 4x4 grid, and so forth. our x and y coordinates, then, should be normalized to a range of 0 to 2order-1. the function works by stepping over each bit of the x and y coordinates, starting with the most significant. for each, it determines which (sub-)quadrant the coordinate lies in, by testing the corresponding bit, then fetches the position along the line and the next square to use from the table we defined earlier. the curve position is set as the least significant 2 bits on the position variable, and at the beginning of the next loop, it's left-shifted to make room for the next set of coordinates. let's check that we've written the function correctly by running our example from above through it: >>> point_to_hilbert(5,2,3)55 presto! for a further test, we can use the function to generate a complete list of ordered points for a hilbert curve, then use a spreadsheet to graph them and see if we get a hilbert curve. enter the following expression into an interactive python interpreter: >>> points =[(x, y)for x in range(8)for y in range(8)]>>> sorted_points = sorted(points, key=lambda k: point_to_hilbert(k[0], k[1],3))>>>print'\n'.join('%s,%s'% x for x in sorted_points) take the resulting text, paste it into a file called 'hilbert.csv', open it in your favorite spreadsheet, and instruct it to generate a scatter plot. the result is, of course, a nicely plotted hilbert curve! the inverse of point_to_hilbert is a straightforward reversal of the hilbert_map; implementing it is left as an exercise for the reader. conclusion there you have it - spatial indexing, from quadtrees to geohashes to hilbert curves. one final observation: if you express the ordered sequence of x,y coordinates required to draw a hilbert curve in binary, do you notice anything interesting about the ordering? does it remind you of anything? just to wrap up, a caveat: all of the indexing methods i've described today are only well-suited to indexing points. if you want to index lines, polylines, or polygons, you're probably out of luck with these methods - and so far as i'm aware, the only known algorithm for effectively indexing shapes is the r-tree , an entirely different and more complex beast.
July 23, 2013
by Nick Johnson
· 42,948 Views
article thumbnail
Converting Java Objects to Byte Array, JSON and XML
Quick reference for converting Java objects to various formats (byte array, JSON, XML) and back, using different libraries for serialization and deserialization.
July 22, 2013
by Faheem Sohail
· 106,356 Views
article thumbnail
Log4j 2: Performance close to insane
Recently a respected member of the Apache community tried Log4j 2 and wrote on Twitter: (Quote from Mark Struberg: @TheASF #log4j2 rocks big times! Performance is close to insane ^^ http://logging.apache.org/log4j/2.x/ ) It happened shortly after Remko Popma contributed something which is now called the “AsyncLoggers”. Some of you might know Log4j 2 has AsyncAppenders already. They are similar like the ones you can find in Log4j 1 and other logging frameworks. I am honest: I wasn’t so excited about the new feature until I read the tweet on its performance and became curious. Clearly Java logging has many goals. Among them: logging must be as fast as hell. Nobody wants his logging framework to become a bottleneck. Of course you’ll always have a cost when logging. There is some operation the CPU must perform. Something is happening, even when you decide NOT to write a log statement. Logging is expected to be invisible. Until now, the well-known logging frameworks were similar in speed. Benchmarks are unreliable after all. We have made some benchmarks over at Apache Logging. Sometimes one logging frameworks wins, sometimes the other. But at the end of the day you can say they are all very good and you can choose whatever your liking is. Until we got Remko’s contribution and Log4j 2 became “insanely fast”. Small software projects running one thread might not care about performance so much. When running a SaaS you simply don’t know when your app gets so much attraction that you need to scale. Then you suddenly need some extra power. With Log4j 2, running 64 threads might bring you twelve times more logging throughput than with comparable frameworks. We speak of more than 18,000,000 messages per second, while others do around 1,500,000 or less in the same environment. I saw the chart, but simply couldn’t believe it. There must be something wrong. I rechecked. I ran the tests myself. It’s like that: Log4j 2 is insanely fast. Async Performance, last read on July 19, 2013 As of now, we have a logging framework which performs lots better than every other logging framework out there. As of now we need to justify our decision when we do not want to use Log4j 2, if speed matters.Everything else than Log4j 2 can become a bottleneck and a risk. With such a fast logging framework you might even consider to log a bit more in production than you did before. Eventually I wrote Remko an e-mail and asked him what exactly the difference between the old AsyncAppenders and the new Asynchronous Loggers is. The difference between old AsynAppenders and new AsyncLoggers “The Asynchronous Loggers do two things differently than the AsyncAppender”, he told me, “they try to do the minimum amount of work before handing off the log message to another thread, and they use a different mechanism to pass information between the producer and consumer threads. AsyncAppender uses an ArrayBlockingQueue to hand off the messages to the thread that writes to disk, and Asynchronous Loggers use the LMAX Disruptor library. Especially the Disruptor has made a large performance difference.” In other terms, the AsyncAppender use a first-in-first-out Queue to work through messages. But the Async Logger uses something new – the Disruptor. To be honest, I had never heard of it. And furthermore, I never thought much about scaling my logging framework. When somebody said “scale the system”, I thought about the database, the app server and much more, but usually not logging. In production, logging was off. End of story. But Remko thinks about scaling when it comes to logging. “Looking at the performance test results for the Asynchronous Loggers, the first thing you notice is that some ways of logging scale much better than others. By scaling better I mean that you get more throughput when you add more threads. If your throughput increases a constant amount with every thread you add, you have linear scalability. This is very desirable but can be difficult to achieve.”, he wrote me. “Comparing synchronous to asynchronous, you would expect any asynchronous mechanism to scale much better than synchronous logging because you don’t do the I/O in the producing thread any more, and we all know that ‘I/O is slow’ (and I’ll get back to this in a bit)”. Yes, exactly my understanding. I thought it would be enough to send something to a queue, and something else would pick it up and write the message. The app would go on. This is exactly what the old AsyncAppender does, wrote Remko: “With AsyncAppender, all your application thread needs to do is create a LogEvent object and put it on the ArrayBlockingQueue; the consuming thread will then take these events off the queue and do all the time-consuming work. That is, the work of turning the event into bytes and writing these bytes to the I/O device. Since the application threads do not need to do the I/O, you would expect this to scale better, meaning adding threads will allow you to log more events.” If you believed that like me, take a seat and a deep breath. We were wrong. “What may surprise you is that this is not the case.”, he wrote. “If you look at the performance numbers for the AsyncAppenders of all logging frameworks, you’ll see that every time you double the number of threads, your throughput per thread roughly halves.” “So your total throughput remains more or less flat! AsyncAppenders are faster than synchronous logging, but they are similar in the sense that neither of them gives you more total throughput when you add more threads.”, he told me. It hit me like a hammer. Basically instead of making your logging faster with adding more threads you made basically: nothing. After all Appenders didn’t scale until now. I asked Remko why this was the case. “It turns out that queues are not the most optimal data structure to pass information between threads. The concurrent queues that are part of the standard Java libraries use locks to make sure that values don’t get corrupted and to ensure data visibility between threads.”. LMAX Disruptor? “The LMAX team did a lot of research on this and found that these queues have a lot of lock contention. An interesting thing they found is that queues are always either full or empty: If your producer is faster, your queue will be full most of the time (and that may be a problem in itself ). If your consumer is fast enough, your queue will be empty most of the time. Either way, you will have contention on the head or on the tail of the queue, where both the producer and the consumer thread want to update the same field. To resolve this, the LMAX team came up with the Disruptor library, which is a lock-free data structure for passing messages between threads. Here is a performance comparison between the Disruptor and ArrayBlockingQueue:Performance Comparison.” Wow. After all these years of Java programming I actually felt a bit like a Junior programmer again. I missed the LMAX disruptor and even never considered it a performance problem to use the Queue. I wonder what other performance problems I did not discover so far. I realized, I had to re-learn Java. I asked Remko how he could find a library like the LMAX disruptor. I mean nobody writes software, creates an instance of a Queue-class, doubts its performance and finally searches the internet for “something better”. Or are there really people of that kind? “How I found about the Disruptor? The short answer is, it was all a mistake.”, he started. “Okay, perhaps that was a bit too short, so here is the longer answer: a colleague of mine wrote a small logger, essentially adding a time-stamped log message to a queue, with a background thread that took these strings off the queue and wrote them to disk. He did this because he needed better performance than what he could get with log4j-1.x. I did some testing and found it was faster, I don’t remember exactly by how much. I was quite surprised because I had been using log4j for years and had never thought it would be easily outperformed. Until then I had assumed that the well-known libraries would be fast, because, well… To be honest, I had just assumed. So this was a bit of an eye-opener for me. However, the custom logger was a bit bare-bones in terms of functionality so I started to look around for alternatives.” “Before I start talking about the Disruptor, I have to confess something. I recently went back to see how much faster the custom logger was than log4j-1.x, but when I measured it it was actually slower! It turned out that I had been comparing the custom logger to an old beta of log4j-2.0, I think beta3 or beta4. AsyncAppender in those betas still had a performance issue (LOG4J2-153 if you’re curious). If I had compared the custom logger to the AsyncAppender in log4j-1.x, I would have found that log4j-1.x was faster and I would not have thought about it further. But because I made this mistake I started to look for other high-performance logging libraries that were richer in functionality. I did not find such a logging library, but I ran into a whole bunch of other interesting stuff, including the Disruptor. Eventually I decided to try to combine Log4j-2, which has a very nice code base, with the Disruptor. The result of this was eventually accepted into Log4j-2 itself, and the rest, as they say, was history.” “One thing I came across that I should mention here is Peter Lawrey’sChronicle library. Chronicle uses memory-mapped files to write tens of millions of messages per second to disk with very low latency. Remember that above I said that “we all know that I/O is slow”? Chronicle shows that synchronous I/O can be very, very fast.“. “It was via Peter’s work that I came across the Disruptor. There is a lot of good material out there about the Disruptor. Just to give you a few pointers: Martin Fowler: LMAX Trisha Lee on LMAX under the hood (slightly outdated now but the most detailed material I know of) …and video presentations like this The Disruptor google group is also highly recommended. Recommended readings on Java performance in general are: Martin Thompson’s “Mechanical Sympathy” Martin Thompson Presentations. Martin Thompson has done a number of articles and presentations on various aspects of high performance computing in java. He does a great job of making the complex stuff that is going on under the hood accessible.” My bookmarks folder went full after reading this e-mail, and I appreciate the lots of starting points for improving my knowledge on Java performance. Should I use AsyncLoggers by default? I was sure I want to use the new Async Loggers. This all sounds just fantastic. But on the other hand, I am a bit scared and even a little paranoid to include new dependencies or new technologies like the new Log4j 2 Async Loggers. I asked Remko if he would use the new feature by default or if he would enable them just for a few, limited use cases. “I use Async Loggers by default, yes.”, he wrote me. “One use case when you would _not_ want to use asynchronous logging is when you use logging for audit purposes. In that case a logging error is a problem that your application needs to know about and deal with. I believe that most applications are different, in that they don’t care too much about logging errors. Most applications don’t want to stop if a logging exception occurs, in fact, they don’t even want to know about it. By default, appenders in Log4j-2.0 will suppress exceptions so the application doesn’t need to try/catch every log statement. If that is your usage, then you will not lose anything by using asynchronous loggers, so you get only the benefits, which is improved performance.” “One nice little detail I should mention is that both Async Loggers and Async Appenders fix something that has always bothered me in Log4j-1.x, which is that they will flush the buffer after logging the last event in the queue. With Log4j-1.x, if you used buffered I/O, you often could not see the last few log events, as they were still stuck in the memory buffer. Your only option was setting immediateFlush to true, which forces disk I/O on every single log event and has a performance impact. With Async Loggers and Appenders in Log4j-2.0 your log statements are all flushed to disk, so they are always visible, but this happens in a very efficient manner.” Isn’t it risky to log to use Log4js AsyncLoggers? But considering that Log4j-1 had serious threading issues and the modern world uses cloud computing and clustering all the time to scale their apps,isn’t asynchronous logging some kind of additional risk? Or is it safe? I knew my questions would sound like the questions of a decision maker, not of an developer. But the whole LMAX thing was so new to me and since I maintain the old and really ugly Log4j 1 code, I simply had to ask. Remko: “There are a number of questions in there. First, is Log4j-2 safer from a concurrency perspective than Log4j-1.x? I believe so. The Log4j-2 team has put in considerable effort to support multi-threaded applications, and the asynchronous loggers are just a very recent and relatively small addition to the project. Log4j-2 uses more granular locking than log4j-1.x, and is architecturally simpler, which should result in fewer issues, and any issues that do come up will be easier to fix.” “On the other hand, Log4j-2 is still in beta and is under active development, although recently I think most effort is being spent on fixing things and tying up loose ends rather than adding new features. I believe it is stable enough for production use. If you are considering using Log4j-2, for performance or other reasons, I’d suggest you do your due diligence and test, just like you would before adopting any other 3rd party library in your project.” (Sidenote: A stable version of Log4j2 can be expected soon, most likely autumn 2013). Sounded good to me. And yes, I can perfectly agree with that from my own observations on the project, though I personally did not write code in the Log4j 2 repository. “The other question I see is: Is asynchronous logging riskier than synchronous logging? I don’t think so, in fact, if your application is multi threaded the opposite may be the case: once the log event has been handed off to the consumer thread that does the I/O, there is only that one thread dealing with the layouts, appenders and all the other logging-related components. So after the hand-off you’re single-threaded and you don’t need to worry about any threading issues like deadlock and liveliness etc any more.” “You can take this one step further and make your business logic completely single-threaded, using the disruptor for all I/O or communication with external systems. Single-threaded business logic without lock contention can be blazingly fast. The results at LMAX (6 million transactions/sec, with less than 10 ms latency) speak for themselves.” Reading Remko’s message I learned three things. First, I had to learn more about Java performance. Second, I definitely want to make my applications use Log4j 2. As first step, I will enable it in my Struts 2 apps, which I use often. Third, a web application framework using the LMAX Disruptor might blow us all away. I would like to give a big thank you and a hug to Remko Popma for answering my questions and working on this blog post with me. All errors are my own.
July 20, 2013
by Christian Grobmeier
· 7,067 Views · 1 Like
article thumbnail
Java: Testing a Socket is Listening on All Network Interfaces/Wildcard Interface
I previously wrote a blog post describing how I’ve been trying to learn more about network sockets in which I created some server sockets and connected to them using netcat. The next step was to do the same thing in Java and I started out by writing a server socket which echoed any messages sent by the client: public class EchoServer { public static void main(String[] args) throws IOException { int port = 4444; ServerSocket serverSocket = new ServerSocket(port, 50, InetAddress.getByAddress(new byte[] {0x7f,0x00,0x00,0x01})); System.err.println("Started server on port " + port); while (true) { Socket clientSocket = serverSocket.accept(); System.err.println("Accepted connection from client: " + clientSocket.getRemoteSocketAddress() ); In in = new In (clientSocket); Out out = new Out(clientSocket); String s; while ((s = in.readLine()) != null) { out.println(s); } System.err.println("Closing connection with client: " + clientSocket.getInetAddress()); out.close(); in.close(); clientSocket.close(); } } } public final class In { private Scanner scanner; public In(java.net.Socket socket) { try { InputStream is = socket.getInputStream(); scanner = new Scanner(new BufferedInputStream(is), "UTF-8"); } catch (IOException ioe) { System.err.println("Could not open " + socket); } } public String readLine() { String line; try { line = scanner.nextLine(); } catch (Exception e) { line = null; } return line; } public void close() { scanner.close(); } } public class Out { private PrintWriter out; public Out(Socket socket) { try { out = new PrintWriter(socket.getOutputStream(), true); } catch (IOException ioe) { ioe.printStackTrace(); } } public void close() { out.close(); } public void println(Object x) { out.println(x); out.flush(); } } I ran the main method of the class and this creates a server socket on port 4444 listening on the 127.0.0.1 interface and we can connect to it using netcat like so: $ nc -v 127.0.0.1 4444 Connection to 127.0.0.1 4444 port [tcp/krb524] succeeded! hello hello The output in my IntelliJ console looked like this: Started server on port 4444 Accepted connection from client: /127.0.0.1:63222 Closing connection with client: /127.0.0.1 Using netcat is fine but what I actually wanted to do was write some test code which would check that I’d made sure the server socket on port 4444 was accessible via all interfaces i.e. bound to 0.0.0.0. There are actually some quite nice classes in Java which make this very easy to do and wiring those together I ended up with the following client code: public static void main(String[] args) throws IOException { Enumeration nets = NetworkInterface.getNetworkInterfaces(); for (NetworkInterface networkInterface : Collections.list(nets)) { for (InetAddress inetAddress : Collections.list(networkInterface.getInetAddresses())) { Socket socket = null; try { socket = new Socket(inetAddress, 4444); System.out.println(String.format("Connected using %s [%s]", networkInterface.getDisplayName(), inetAddress)); } catch (ConnectException ex) { System.out.println(String.format("Failed to connect using %s [%s]", networkInterface.getDisplayName(), inetAddress)); } finally { if (socket != null) { socket.close(); } } } } } } If we run the main method of that class we’ll see the following output (on my machine at least!): Failed to connect using en0 [/fe80:0:0:0:9afe:94ff:fe4f:ee50%4] Failed to connect using en0 [/192.168.1.89] Failed to connect using lo0 [/0:0:0:0:0:0:0:1] Failed to connect using lo0 [/fe80:0:0:0:0:0:0:1%1] Connected using lo0 [/127.0.0.1] Interestingly we can’t even connect via the loopback interface using IPv6 which is perhaps not that surprising in retrospect given we bound using an IPv4 address. If we tweak the second line of EchoServer from: ServerSocket serverSocket = new ServerSocket(port, 50, InetAddress.getByAddress(new byte[] {0x7f,0x00,0x00,0x01})); to ServerSocket serverSocket = new ServerSocket(port, 50, InetAddress.getByAddress(new byte[] {0x00,0x00,0x00,0x00})); And restart the server before re-running the client we can now connect through all interfaces: Connected using en0 [/fe80:0:0:0:9afe:94ff:fe4f:ee50%4] Connected using en0 [/192.168.1.89] Connected using lo0 [/0:0:0:0:0:0:0:1] Connected using lo0 [/fe80:0:0:0:0:0:0:1%1] Connected using lo0 [/127.0.0.1] We can then wrap the EchoClient code into our testing framework to assert that we can connect via all the interfaces.
July 17, 2013
by Mark Needham
· 12,533 Views
article thumbnail
Playing with NHibernate - Inverse and Cascade Mapping Attributes
I have to admit that NHibernate provides a really flexible way of handling class inheritance and parent-child relationship.
July 12, 2013
by Mariano Vazquez
· 35,848 Views
article thumbnail
JAX RS: Streaming a Response using StreamingOutput
A couple of weeks ago Jim and I were building out a neo4j unmanaged extension from which we wanted to return the results of a traversal which had a lot of paths. Our code initially looked a bit like this: package com.markandjim @Path("/subgraph") public class ExtractSubGraphResource { private final GraphDatabaseService database; public ExtractSubGraphResource(@Context GraphDatabaseService database) { this.database = database; } @GET @Produces(MediaType.TEXT_PLAIN) @Path("/{nodeId}/{depth}") public Response hello(@PathParam("nodeId") long nodeId, @PathParam("depth") int depth) { Node node = database.getNodeById(nodeId); final Traverser paths = Traversal.description() .depthFirst() .relationships(DynamicRelationshipType.withName("whatever")) .evaluator( Evaluators.toDepth(depth) ) .traverse(node); StringBuilder allThePaths = new StringBuilder(); for (org.neo4j.graphdb.Path path : paths) { allThePaths.append(path.toString() + "\n"); } return Response.ok(allThePaths.toString()).build(); } } We then compiled that into a JAR, placed it in ‘plugins’ and added the following line to ‘conf/neo4j-server.properties’: org.neo4j.server.thirdparty_jaxrs_classes=com.markandjim=/unmanaged After we’d restarted the neo4j server we were able to call this end point using cURL like so: $ curl -v http://localhost:7474/unmanaged/subgraph/1000/10 This approach works quite well but Jim pointed out that it was quite inefficient to load all those paths up into memory so we thought it would be quite cool if we could stream it as we got to each path. Traverser wraps an iterator so we are lazily evaluating the result set in any case. After a bit of searching we came StreamingOutput which is exactly what we need. We adapted our code to use that instead: package com.markandjim @Path("/subgraph") public class ExtractSubGraphResource { private final GraphDatabaseService database; public ExtractSubGraphResource(@Context GraphDatabaseService database) { this.database = database; } @GET @Produces(MediaType.TEXT_PLAIN) @Path("/{nodeId}/{depth}") public Response hello(@PathParam("nodeId") long nodeId, @PathParam("depth") int depth) { Node node = database.getNodeById(nodeId); final Traverser paths = Traversal.description() .depthFirst() .relationships(DynamicRelationshipType.withName("whatever")) .evaluator( Evaluators.toDepth(depth) ) .traverse(node); StreamingOutput stream = new StreamingOutput() { @Override public void write(OutputStream os) throws IOException, WebApplicationException { Writer writer = new BufferedWriter(new OutputStreamWriter(os)); for (org.neo4j.graphdb.Path path : paths) { writer.write(path.toString() + "\n"); } writer.flush(); } }; return Response.ok(stream).build(); } As far as I can tell the only discernible difference between the two approaches is that you get an almost immediate response from the streamed approached whereas the first approach has to put everything in the StringBuilder first. Both approaches make use of chunked transfer encoding which according to tcpdump seems to have a maximum packet size of 16332 bytes: 00:10:27.361521 IP localhost.7474 > localhost.55473: Flags [.], seq 6098196:6114528, ack 179, win 9175, options [nop,nop,TS val 784819663 ecr 784819662], length 16332 00:10:27.362278 IP localhost.7474 > localhost.55473: Flags [.], seq 6147374:6163706, ack 179, win 9175, options [nop,nop,TS val 784819663 ecr 784819663], length 16332
July 10, 2013
by Mark Needham
· 113,003 Views · 4 Likes
article thumbnail
Algorithm of the Week: Homomorphic Hashing
In a previous Damn Cool Algorithms post, we learned about Fountain Codes, a clever probabilistic algorithm that allows you break a large file up into a virtually infinite number of small chunks, such that you can collect any subset of those chunks - as long as you collect a few more than the volume of the original file - and be able to reconstruct the original file. This is a very cool construction, but as we observed last time, it has one major flaw when it comes to use in situations with untrusted users, such as peer to peer networks: there doesn't seem to be a practical way to verify if a peer is sending you valid blocks until you decode the file, which happens very near the end - far too late to detect and punish abuse. It's here that Homomorphic Hashes come to our rescue. A homomorphic hash is a construction that's simple in principle: a hash function such that you can compute the hash of a composite block from the hashes of the individual blocks. With a construction like this, we could distribute a list of individual hashes to users, and they could use those to verify incoming blocks as they arrive, solving our problem. Homomorphic Hashing is described in the paper On-the-fly verification of rateless erasure codes for efficient content distribution by Krohn et al. It's a clever construction, but rather difficult to understand at first, so in this article, we'll start with a strawman construction of a possible homomorphic hash, then improve upon it until it resembles the one in the paper - at which point you will hopefully have a better idea as to how it works. We'll also discuss the shortcomings and issues of the final hash, as well as how the authors propose to resolve them. Before we continue, a small disclaimer is needed: I'm a computer scientist, not a mathematician, and my discrete math knowledge is far rustier than I'd like. This paper stretches the boundaries of my understanding, and describing the full theoretical underpinnings of it is something I'm likely to make a hash of. So my goal here is to provide a basic explanation of the principles, sufficient for an intuition of how the construction works, and leave the rest for further exploration by the interested reader. A homomorphic hash that isn't We can construct a very simple candidate for a homomorphic hash by using one very simple mathematical identity: the observation that gx0 * gx1 = gx0 + x1. So, for instance, 23 * 22 = 25. We can make use of this by the following procedure: Pick a random number g For each element x in our message, take gx. This is the hash of the given element. Using the identity above, we can see that if we sum several message blocks together, we can compute their hash by multiplying the hashes of the individual blocks, and get the same result as if we 'hash' the sum. Unfortunately, this construction has a couple of obvious issues: Our 'hash' really isn't - the hashes are way longer than the message elements themselves! Any attacker can compute the original message block by taking the logarithm of the hash for that block. If we had a real hash with collisions, a similar procedure would let them generate a collision easily. A better hash with modular arithmetic Fortunately, there's a way we can fix both problems in one shot: by using modular arithmetic. Modular arithmetic keeps our numbers bounded, which solves our first problem, while also making our attacker's life more difficult: finding a preimage for one of our hashes now requires solving the discrete log problem, a major unsolved problem in mathematics, and the foundation for several cryptosystems. Here, unfortunately, is where the theory starts to get a little more complicated - and I start to get a little more vague. Bear with me. First, we need to pick a modulus for adding blocks together - we'll call it q. For the purposes of this example, let's say we want to add numbers between 0 and 255, so let's pick the smallest prime greater than 255 - which is 257. We'll also need another modulus under which to perform exponentiation and multiplication. We'll call this p. For reasons relating to Fermat's Little Theorem, this also needs to be a prime, and further, needs to be chosen such that p - 1 is a multiple of q (written q | (p - 1), or equivalently, p % q == 1). For the purposes of this example, we'll choose 1543, which is 257 * 6 + 1. Using a finite field also puts some constraints on the number, g, that we use for the base of the exponent. Briefly, it has to be 'of order q', meaning that gq mod p must equal 1. For our example, we'll use 47, since47257 % 1543 == 1. So now we can reformulate our hash to work like this: To hash a message block, we compute gb mod p - in our example, 47b mod 1543 - where b is the message block. To combine hashes, we simply multiply them mod p, and to combine message blocks, we add them mod q. Let's try it out. Suppose our message is the sequence [72, 101, 108, 108, 111] - that's "Hello" in ASCII. We can compute the hash of the first number as 4772 mod 1543, which is 883. Following the same procedure for the other elements gives us our list of hashes: [883, 958, 81, 81, 313]. We can now see how the properties of the hash play out. The sum of all the elements of the message is 500, which is 243 mod 257. The hash of 243 is 47243 mod 1543, or 376. And the product of our hashes is883 * 958 * 81 * 81 * 313 mod 1543 - also 376! Feel free to try this for yourself with other messages and other subsets - they'll always match, as you would expect. A practical hash Of course, our improved hash still has a couple of issues: The domain of our input values is small enough that an attacker could simply try them all out to find collisions. And the domain of our output values is small enough the attacker could attempt to find discrete logarithms by brute force, too. Although our hashes are shorter than they were without modular arithmetic, they're still longer than the input. The first of these is fairly straightforward to resolve: we can simply pick larger primes for p and q. If we choose ones that are sufficiently large, both enumerating all inputs and brute force logarithm finding will become impractical. The second problem is a little trickier, but not hugely so; we just have to reorganize our message a bit. Instead of breaking the message down into elements between 0 and q, and treating each of those as a block, we can break the message into arrays of elements between 0 and q. For instance, suppose we have a message that is 1024 bytes long. Instead of breaking it down into 1024 blocks of 1 byte each, let's break it down into, say, 64 blocks of 16 bytes. We then modify our hashing scheme a little bit to accommodate this: Instead of picking a single random number as the base of our exponent, g, we pick 16 of them, g0 - g16. To hash a block, we take each number gi and raise it to the power of the corresponding sub-block. The resulting output is the same length as when we were hashing only a single block per hash, but we're taking 16 elements as input instead of a single one. When adding blocks together, we add all the corresponding sub-blocks individually. All the properties we had earlier still hold. Better, we've given ourselves another tuneable parameter: the number of sub blocks per block. This will be invaluable in getting the right tradeoff between security, granularity of blocks, and protocol overhead. Practical applications What we've arrived at now is pretty much the construction described in the paper, and hopefully you can see how it would be applied to a system utilizing fountain codes. Simply pick two primes of about the right size - the paper recommends 257 bits for q and 1024 bits for p - figure out how big you want each block to be - and hence how many sub-blocks per block - and figure out a way for everyone to agree on the random numbers for g - such as by using a random number generator with a well defined seed value. The construction we have now, although useful, is still not perfect, and has a couple more issues we should address. First of these is one you may have noticed yourself already: our input values pack neatly into bytes - integers between 0 and 255 in our example - but after summing them in a finite field, the domain has grown, and we can no longer pack them back into the same number of bits. There are two solutions to this: the tidy one and the ugly one. The tidy one is what you'd expect: Since each value has grown by one bit, chop off the leading bit and transmit it along with the rest of the block. This allows you to transmit your block reasonably sanely and with minimal expansion in size, but is a bit messy to implement and seems - at least to me - inelegant. The ugly solution is this: Pick the smallest prime number larger than your chosen power of 2 for q, and simply ignore or discard overflows. At first glance this seems like a terrible solution, but consider: the smallest prime larger than 2256 is 2256 + 297. The chance that a random number in that range is larger than 2256 is approximately 1 in 3.9 * 1074, or approximately one in 2247. This is way smaller than the probability of, say, two randomly generated texts having the same SHA-1 hash. Thus, I think there's a reasonable argument for picking a prime using that method, then simply ignoring the possibility of overflows. Or, if you want to be paranoid, you can check for them, and throw out any encoded blocks that cause overflows - there won't be many of them, to say the least. Performance and how to improve it Another thing you may be wondering about this scheme is just how well it performs. Unfortunately, the short answer is "not well". Using the example parameters in the paper, for each sub-block we're raising a 1024 bit number to the power of a 257 bit number; even on modern hardware this is not fast. We're doing this for every 256 bits of the file, so to hash an entire 1 gigabyte file, for instance, we have to compute over 33 million exponentiations. This is an algorithm that promises to really put the assumption that it's always worth spending CPU to save bandwidth to the test. The paper offers two solutions to this problem; one for the content creator and one for the distributors. For the content creator, the authors demonstrate that there is a way to generate the random constants g, used as the bases of the exponents using a secret value. With this secret value, the content creator can generate the hashes for their files much more quickly than without it. However, anyone with the secret value can also trivially generate hash collisions, so in such a scheme, the publisher must be careful not to disclose the value to anyone, and only distribute the computed constants gi. Further, the set of constants themselves aren't small - with the example parameters, a full set of constants weighs in at about the size of 4 data blocks. Thus, you need a good way to distribute the per-publisher constants in addition to the data itself. Anyone interested in this scheme should consult section C of the paper, titled "Per-Publisher Homomorphic Hashing". For distributors, the authors offer a probabilistic check that works on batches of blocks, described in section D, "Computational Efficiency Improvements". Another easier to understand variant is this: Instead of verifying blocks individually as they arrive, accumulate blocks in a batch. When you have enough blocks, sum them all together, and calculate an expected hash by taking the product of the expected hashes of the individual blocks. Compute the composite block's hash. If it verifies, all the individual blocks are valid! If it doesn't, divide and conquer: split your batch in half and check each, winnowing out valid blocks until you're left with any invalid ones. The nice thing about either of these procedures is that they allow you to trade off verification work with your vulnerability window. You can even dedicate a certain amount of CPU time to verification, and simply batch up incoming blocks until the current computation finishes, ensuring you're always verifying the last batch as you receive the next. Conclusion Homomorphic Hashing provides a neat solution to the problem of verifying data from untrusted peers when using a fountain coding system, but it's not without its own drawbacks. It's complicated to implement and computationally expensive to compute, and requires careful tuning of the parameters to minimise the volume of the hash data without compromising security. Used correctly in conjunction with fountain codes, however, Homomorphic Hashing could be used to create an impressively fast and efficient content distribution network. As a side-note, I'm intending to resume more regular blogging with more Damn Cool Algorithms posts. Have an algorithm you think is Damn Cool and would like to hear more about? Post it in the comments!
July 9, 2013
by Nick Johnson
· 14,837 Views
  • Previous
  • ...
  • 754
  • 755
  • 756
  • 757
  • 758
  • 759
  • 760
  • 761
  • 762
  • 763
  • ...
  • Next

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: