Data Engineering Resources

The Latest Data Engineering Topics

Big Data Chapter Excerpt: Implementing Schemas with Apache Thrift

This is an excerpt from the upcoming Manning book about Big Data. Big Data Principles and Best Practices of Scalable Realtime Data Systems By Nathan Marz and Samuel E. Ritchie Thrift is a widely used project that originated at Facebook. It can be used for making language-neutral RPC servers, but developers use it for its schema-creation capabilities. In this article based on chapter 2, author Nathan Marz discusses workhorses of Thrift—the struct and union type definitions—and Thrift’s built-in mechanisms for evolving a schema over time. You may also be interested in… Thrift is a widely used project that originated at Facebook. It can be used for making language-neutral RPC servers, but developers use it for its schema-creation capabilities. The workhorses of Thrift are the struct and union type definitions, and Thrift has built-in mechanisms for evolving a schema over time. Orginally Authored by Nathan Marz and Samuel E. Ritchie Structs The following code shows how to define a struct using the Thrift Interface Definition Language (IDL). Defining a struct is like defining a class in an object-oriented language: you specify all the data the object contains. The difference is that a Thrift struct only contains data and doesn't specify any extra behavior for the object. Fields in a struct can be: Primitive types like strings, ints, longs, and doubles. In the Thrift IDL, these are referred to as string, i32, i64, and double, respectively. Collections of other types. Thrift supports list, map, and set. Another Thrift struct or union. struct Person { 1: string twitter_username; 2: string full_name; 3: list interests; } The following code listing shows how to serialize a struct with Java. As you can see, we're using ArrayList, a native Java data structure, as part of the Person object. List interests = new ArrayList() {{ add("hadoop"); add("nosql"); }; Person person = new Person("joesmith", "Joe Smith", interests); TSerializer serializer = new TSerializer(); byte[] serialized = serializer.serialize(person); Here's how to deserialize a Person object in Python. When the object is deserialized, it will be using native Python data structures for any collection types. person = Person() deserialize(person, serialized_bytes) Fields in structs can be defined as being either required or optional. If a field is defined as required, than a value for that field must be provided or else Thrift will give an error upon serialization or deserialization. If a field is optional, the value will be null if not provided. You should always declare fields as being either required or optional. The following code listing shows how to define a struct containing required and optional fields. struct Tweet { 1: required string text; 2: required i64 id; 3: required i64 timestamp; 4: required Person person; 5: optional i64 response_to_tweet_id; Unions You can also define unions in Thrift. A union is a struct that must have exactly one field set. Unions are useful for representing polymorphic data. The following listing shows how to define a "PersonID" using a Thrift union that can be one of many different kinds of identifiers. union PersonID { 1: string email; 2: i64 facebook_id; 3: i64 twitter_id; } Evolving a schema Thrift is designed so that schemas can be evolved over time. The key to evolving Thrift schemas over time is the numeric identifiers used for every field. Those ids are used to identify fields in their serialized form. When you want to change the schema but still be backward compatible with existing data, you must obey the following rules. Fields may be renamed. This is because the serialized form of an object uses the field ids to identify fields, not the names. Fields may be removed, but you must be sure never to reuse that field id. When deserializing, Thrift will skip over any fields that don't match an id it's expecting. So the data for that field will just be ignored in the existing data. If you were to reuse that field id, Thrift will try to deserialize that old data into your new field which will lead to either invalid or incorrect data. Only optional fields can be added to existing structs. You can't add required fields because existing data won't have that field and will not be deserializable. Note that this point does not apply to unions since unions have no notion of required and optional fields. Summary In a relational database, the schema language is part of the database system and is integrated with how the database stores and processes that data. In the Big Data world, you use your own serialization framework that's separate from the storage and processing pieces. You get the flexibility to fine-tune this component to work exactly as needed to fit your data model. There are a few different open source serialization frameworks available, namely Thrift, Protocol Buffers, and Avro. We discussed our favorite, Apache Thrift, because it’s mature and supports most languages, but you could use any of these tools for defining a schema. Here are some other Manning titles you might be interested in: MongoDB in Action Kyle Banker RabbitMQ in Action Alvaro Videla and Jason J.W. Williams Hadoop in Action Chuck Lam Last updated: January 11, 2012

January 12, 2012

by Chris Smith

· 11,566 Views

Local and Distributed Graph Traversal Engines

in the graph database space, there are two types of traversal engines: local and distributed. local traversal engines are typically for single-machine graph databases and are used for real-time production applications. distributed traversal engines are typically for multi-machine graph databases and are used for batch processing applications. this divide is quite sharp in the community, but there is nothing that prevents the unification of both models. a discussion of this divide and its unification is presented in this post. local traversal engines in a local traversal engine, there is typically a single processing agent that obeys a program. the agent is called a traverser and the program it follows is called a path description . in gremlin , a friend-of-a-friend path description is defined as such: g.v(1).oute('friend').inv.oute('friend').inv when this path description is interpreted by a traverser over a graph, an instance of the description is realized as actual paths in the graph that match that description. for example, the gremlin traverser starts at vertex 1 and then steps to its outgoing friend -edges. next, it will move to the head/target vertices of those edges (i.e. vertices 3 and 4). after that, it will go to the friend -edges of those vertices and finally, to the head of the previous edges (i.e. vertices 6 and 7). in this way, a single traverser is following all the paths that are exposed with each new atomic graph operation (i.e. each new step after a .). the abstract syntax being: step.step.step . what is returned by this friend-of-a-friend path description, on the example graph diagrammed, is vertices 6 and 7. in many situations, its not the end of the path that is desired, but some side-effect of the traversal. for example, as the traverser walks it can update some global data structure such as a ranking of the vertices. this idea is presented in the path description below, where the oute.inv path is looped over 1000 times. each time a vertex is traversed over, the map m is updated. this global map m maintains keys that are vertices and values that denote the number of times that each vertex has been touched ( groupcount ‘s behavior). m = [:] g.v(1).oute.inv.groupcount(m).loop(3){it.loops < 1000} the local traversal engine pattern is abstractly diagrammed on the right, where a single traverser is obeying some path description ( a.b.c ) and in doing so, moving around on a graph and updating a global data structure (the red boxed map). given the need for traversers to move from element to element, graph databases of this form tend to support strong data locality by means of a direct-reference graph data structure (i.e. vertices have pointers to edges and edges to vertices). a few examples of such graph databases include neo4j , orientdb , and dex . distributed traversal engines in a distributed traversal engine, a traversal is represented as a flow of messages between the elements of the graph. generally, each element (e.g. vertex) is operating independently of the other elements. each element is seen as its own processor with its own (usually homogenous) program to execute. elements communicate with each other via message passing . when no more messages have been passed, the traversal is complete and the results of the traversal are typically represented as a distributed data structure over the elements. graph databases of this nature tend to use the bulk synchronous parallel model of distributed computing. each step is synchronized in a manner analogous to a clock cycle in hardware. instances of this model include agrapa , pregel , trinity , and goldenorb . an example of distributed graph traversing is now presented using a ranking algorithm in java. [ note : in this example, edges are not first class citizens. this is typical of the state of the art in distributed traversal engines. they tend to be for single-relational, unlabeled-edge graphs.] public void evaluatestep(int step) { if(!this.inbox.isempty() && step < 1000) { this.rank = this.rank + this.inbox.size(); for(vertex vertex : this.adjacentvertices()) { for(int i=0; i

January 10, 2012

by Marko Rodriguez

· 8,473 Views

Solr Select Query GET vs POST Request

Learn the differences between GET and POST requests.

January 10, 2012

by Bas De Nooijer

· 24,256 Views

Searching relational content with Lucene's BlockJoinQuery

Lucene's 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content. Most search engines can't directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files. Perhaps the PDF documents you need to search are immense, so you break them up and index each section as a separate Lucene document; in this case you'll have common fields (title, abstract, author, date published, etc.) for the overall document, joined to the sub-document (section) with its own fields (text, page number, etc.). XML documents typically contain nested tags, representing joined sub-documents; emails have attachments; office documents can embed other documents. Nearly all search domains have some form of relational content, often requiring more than one join. If such content is so common then how do search applications handle it today? One obvious "solution" is to simply use a relational database instead of a search engine! If relevance scores are less important and you need to do substantial joining, grouping, sorting, etc., then using a database could be best overall. Most databases include some form a text search, some even using Lucene. If you still want to use a search engine, then one common approach is to denormalize the content up front, at index-time, by joining all tables and indexing the resulting rows, duplicating content in the process. For example, you'd index each song as a Lucene document, copying over all fields from the song's joined album and artist/band. This works correctly, but can be horribly wasteful as you are indexing identical fields, possibly including large text fields, over and over. Another approach is to do the join yourself, outside of Lucene, by indexing songs, albums and artist/band as separate Lucene documents, perhaps even in separate indices. At search-time, you first run a query against one collection, for example the songs. Then you iterate through all hits, gathering up (joining) the full set of corresponding albums and then run a second query against the albums, with a large OR'd list of the albums from the first query, repeating this process if you need to join to artist/band as well. This approach will also work, but doesn't scale well as you may have to create possibly immense follow-on queries. Yet another approach is to use a software package that has already implemented one of these approaches for you! elasticsearch, Apache Solr, Apache Jackrabbit, Hibernate Search and many others all handle relational content in some way. With BlockJoinQuery you can now directly search relational content yourself! Let's work through a simple example: imagine you sell shirts online. Each shirt has certain common fields such as name, description, fabric, price, etc. For each shirt you have a number of separate stock keeping units or SKUs, which have their own fields like size, color, inventory count, etc. The SKUs are what you actually sell, and what you must stock, because when someone buys a shirt they buy a specific SKU (size and color). Maybe you are lucky enough to sell the incredible Mountain Three-wolf Moon Short Sleeve Tee, with these SKUs (size, color): small, blue small, black medium, black large, gray Perhaps a user first searches for "wolf shirt", gets a bunch of hits, and then drills down on a particular size and color, resulting in this query: name:wolf AND size=small AND color=blue which should match this shirt. name is a shirt field while the size and color are SKU fields. But if the user drills down instead on a small gray shirt: name:wolf AND size=small AND color=gray then this shirt should not match because the small size only comes in blue and black. How can you run these queries using BlockJoinQuery? Start by indexing each shirt (parent) and all of its SKUs (children) as separate documents, using the new IndexWriter.addDocuments API to add one shirt and all of its SKUs as a single document block. This method atomically adds a block of documents into a single segment as adjacent document IDs, which BlockJoinQuery relies on. You should also add a marker field to each shirt document (e.g. type = shirt), as BlockJoinQuery requires a Filter identifying the parent documents. To run a BlockJoinQuery at search-time, you'll first need to create the parent filter, matching only shirts. Note that the filter must use FixedBitSet under the hood, like CachingWrapperFilter: Filter shirts = new CachingWrapperFilter( new QueryWrapperFilter( new TermQuery( new Term("type", "shirt")))); Create this filter once, up front and re-use it any time you need to perform this join. Then, for each query that requires a join, because it involves both SKU and shirt fields, start with the child query matching only SKU fields: BooleanQuery skuQuery = new BooleanQuery(); skuQuery.add(new TermQuery(new Term("size", "small")), Occur.MUST); skuQuery.add(new TermQuery(new Term("color", "blue")), Occur.MUST); Next, use BlockJoinQuery to translate hits from the SKU document space up to the shirt document space: BlockJoinQuery skuJoinQuery = new BlockJoinQuery( skuQuery, shirts, ScoreMode.None); The ScoreMode enum decides how scores for multiple SKU hits should be aggregated to the score for the corresponding shirt hit. In this query you don't need scores from the SKU matches, but if you did you can aggregate with Avg, Max or Total instead. Finally you are now free to build up an arbitrary shirt query using skuJoinQuery as a clause: BooleanQuery query = new BooleanQuery(); query.add(new TermQuery(new Term("name", "wolf")), Occur.MUST); query.add(skuJoinQuery, Occur.MUST); You could also just run skuJoinQuery as-is if the query doesn't have any shirt fields. Finally, just run this query like normal! The returned hits will be only shirt documents; if you'd also like to see which SKUs matched for each shirt, use BlockJoinCollector: BlockJoinCollector c = new BlockJoinCollector( Sort.RELEVANCE, // sort 10, // numHits true, // trackScores false // trackMaxScore ); searcher.search(query, c); The provided Sort must use only shirt fields (you cannot sort by any SKU fields). When each hit (a shirt) is competitive, this collector will also record all SKUs that matched for that shirt, which you can retrieve like this: TopGroups hits = c.getTopGroups( skuJoinQuery, skuSort, 0, // offset 10, // maxDocsPerGroup 0, // withinGroupOffset true // fillSortFields ); Set skuSort to the sort order for the SKUs within each shirt. The first offset hits are skipped (use this for paging through shirt hits). Under each shirt, at most maxDocsPerGroup SKUs will be returned. Use withinGroupOffset if you want to page within the SKUs. If fillSortFields is true then each SKU hit will have values for the fields from skuSort. The hits returned by BlockJoinCollector.getTopGroups are SKU hits, grouped by shirt. You'd get the exact same results if you had denormalized up-front and then used grouping to group results by shirt. You can also do more than one join in a single query; the joins can be nested (parent to child to grandchild) or parallel (parent to child1 and parent to child2). However, there are some important limitations of index-time joins: The join must be computed at index-time and "compiled" into the index, in that all joined child documents must be indexed along with the parent document, as a single document block. Different document types (for example, shirts and SKUs) must share a single index, which is wasteful as it means non-sparse data structures like FieldCache entries consume more memory than they would if you had separate indices. If you need to re-index a parent document or any of its child documents, or delete or add a child, then the entire block must be re-indexed. This is a big problem in some cases, for example if you index "user reviews" as child documents then whenever a user adds a review you'll have to re-index that shirt as well as all its SKUs and user reviews. There is no QueryParser support, so you need to programmatically create the parent and child queries, separating according to parent and child fields. The join can currently only go in one direction (mapping child docIDs to parent docIDs), but in some cases you need to map parent docIDs to child docIDs. For example, when searching songs, perhaps you want all matching songs sorted by their title. You can't easily do this today because the only way to get song hits is to group by album or band/artist. The join is a one (parent) to many (children), inner join. As usual, patches are welcome! There is work underway to create a more flexible, but likely less performant, query-time join capability, which should address a number of the above limitations. Source: http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html

January 9, 2012

by Michael Mccandless

· 14,722 Views

The Persistence Layer with Spring 3.1 and JPA

1. Overview This is the third of a series of articles about Persistence with Spring. This article will focus on the configuration and implementation of Spring with JPA. For a step by step introduction about setting up the Spring context using Java based configuration and the basic Maven pom for the project, see this article. The Persistence with Spring series: Part 1 – The Persistence Layer with Spring 3.1 and Hibernate Part 2 – Simplifying the Data Access Layer with Spring and Java Generics Part 4 – The Persistence Layer with Spring Data JPA Part 5 – Transaction configuration with JPA and Spring 3.1 2. No More Spring Templates As of Spring 3.1, the JpaTemplate and the corresponding JpaDaoSupport have been deprecated in favor of using the native Java Persistence API. Also, both of these classes are only relevant for JPA 1 (from the JpaTemplate javadoc): Note that this class did not get upgraded to JPA 2.0 and never will. As a consequence, it is now best practice to use the Java Persistence API directly instead of the JpaTemplate, which will effectively decouple the DAO layer implementation from Spring entirely. Exception Translation without the template One of the responsibilities of JpaTemplate is exception translation – translating the low level exceptions – which tie the API to JPA – into higher level, generic Spring exceptions. Without the template to do that, exception translation can still be enabled by annotating the DAOs with the @Repository annotation. That, coupled with a Spring bean postprocessor will advice all @Repository beans with all the implementations of PersistenceExceptionTranslator found in the Container – to provide exception translation without using the template. Exception translation is done through proxies; in order for Spring to be able to create proxies around the DAO classes, these must not be declared final. Injecting the JPA EntityManager with Spring without the template The EntityManager is the API of the persistence context; this can be injected directly in the DAO. The Spring Container is capable of acting as a JPA container and of injecting the EntityManager by honoring the @PersistenceContext (both as field-level and a method-level annotation). For this to work, the PersistenceAnnotationBeanPostProcessor bean must exist in the Spring Container. The bean can be either created explicitly by defining it in the configuration, or automatically, by defining context:annotation-config or context:component-scan in the configuration. 3. The Spring Java configuration The EntityManager is set up in the configuration by creating a Spring factory bean to manage it; this will allow the PersistenceAnnotationBeanPostProcessor to retrieve it from the Container. There are two options to set this up – either the simpler LocalEntityManagerFactoryBean or the more flexible LocalContainerEntityManagerFactoryBean. The latter option is used here, so that additional properties can be configured on it: @Configuration @EnableTransactionManagement public class PersistenceJPAConfig{ @Bean public LocalContainerEntityManagerFactoryBean entityManagerFactoryBean(){ LocalContainerEntityManagerFactoryBean factoryBean = new LocalContainerEntityManagerFactoryBean(); factoryBean.setDataSource( this.restDataSource() ); factoryBean.setPackagesToScan( new String[ ] { "org.rest" } ); JpaVendorAdapter vendorAdapter = new HibernateJpaVendorAdapter(){ { // JPA properties ... } }; factoryBean.setJpaVendorAdapter( vendorAdapter ); factoryBean.setJpaProperties( this.additionlProperties() ); return factoryBean; } @Bean public DataSource restDataSource(){ DriverManagerDataSource dataSource = new DriverManagerDataSource(); dataSource.setDriverClassName( this.driverClassName ); dataSource.setUrl( this.url ); dataSource.setUsername( "restUser" ); dataSource.setPassword( "restmy5ql" ); return dataSource; } @Bean public PlatformTransactionManager transactionManager(){ JpaTransactionManager transactionManager = new JpaTransactionManager(); transactionManager.setEntityManagerFactory( this.entityManagerFactoryBean().getObject() ); return transactionManager; } @Bean public PersistenceExceptionTranslationPostProcessor exceptionTranslation(){ return new PersistenceExceptionTranslationPostProcessor(); } } Also, note that cglib must be on the classpath for Java @Configuration classes to work; to better understand the need for cglib as a dependency, see this article. 4. The Spring XML configuration The same Spring configuration with XML: There is a relatively small difference between the way Spring is configured in XML and the new Java based configuration – in XML, a reference to another bean can point to either the bean or a bean factory for that bean. In Java however, since the types are different, the compiler doesn’t allow it, and so the EntityManagerFactory is first retrieved from it’s bean factory and then passed to the transaction manager: txManager.setEntityManagerFactory( this.entityManagerFactoryBean().getObject() ); 5. Going full XML-less Usually, JPA defines a persistence unit through the META-INF/persistence.xml file. Starting with Spring 3.1, this XML file is no longer necessary – the LocalContainerEntityManagerFactoryBean now supports a ‘packagesToScan’ property where the packages to scan for @Entity classes can be specified. The persistence.xml file was the last piece of XML to be removed – now, JPA can be fully set up with no XML. 5.1. The JPA Properties JPA properties would usually be specified in the persistence.xml file; alternatively, the properties can be specified directly to the entity manager factory bean: factoryBean.setJpaProperties( this.additionlProperties() ); As a side-note, if Hibernate would be the persistence provider, then this would be the way to specify Hibernate specific properties. 5.2. The DAO Each DAO will be based on an parametrized, abstract DAO class class with support for the common generic operations: public abstract class AbstractJpaDAO< T extends Serializable > { private Class< T > clazz; @PersistenceContext EntityManager entityManager; public void setClazz( Class< T > clazzToSet ){ this.clazz = clazzToSet; } public T findOne( Long id ){ return this.entityManager.find( this.clazz, id ); } public List< T > findAll(){ return this.entityManager.createQuery( "from " + this.clazz.getName() ) .getResultList(); } public void save( T entity ){ this.entityManager.persist( entity ); } public void update( T entity ){ this.entityManager.merge( entity ); } public void delete( T entity ){ this.entityManager.remove( entity ); } public void deleteById( Long entityId ){ T entity = this.getById( entityId ); this.delete( entity ); } } A few aspects are interesting here – as discussed, the abstract DAO does not extend any Spring template (such as JpaTemplate). Instead, the JPA EntityManager is injected directly in the DAO, and will have the role of the main Persistence API Also, note that the entity Class is passed in the constructor to be used in the generic operations: @Repository public class FooDAO extends AbstractHibernateDAO< Foo > implements IFooDAO{ public FooDAO(){ setClazz(Foo.class ); } } 6. The Maven configuration In addition to the Maven configuration defined in a previous article, the following dependencies are addeed: spring-orm (which also has spring-tx as its dependency) and hibernate-entitymanager: org.springframework spring-orm 3.2.2.RELEASE org.hibernate hibernate-entitymanager 4.2.0.Final runtime mysql mysql-connector-java 5.1.24 runtime Note that the MySQL dependency is included as a reference – a driver is needed to configure the datasource, but any Hibernate supported database will do. 7. Conclusion This article covered the configuration and implementation of the persistence layer with JPA 2 and Spring 3.1, using both XML and Java based configuration. The reasons to stop relying on templates for the DAO layer was discussed, as well as getting rid of the last piece of XML usually associated with JPA – the persistence.xml. The final result is a lightweight, clean DAO implementation, with almost no compile-time reliance on Spring. You can check out the full implementation in the github project. OriginalThe Persistence Layer with Spring 3.1 and JPA from the Persistence with Spring series

January 7, 2012

by Eugen Paraschiv

· 63,026 Views · 1 Like

Simplifying the Data Access Layer with Spring and Java Generics

1. Overview This is the second of a series of articles about Persistence with Spring. The previous article discussed setting up the persistence layer with Spring 3.1 and Hibernate, without using templates. This article will focus on simplifying the Data Access Layer by using a single, generified DAO, which will result in elegant data access, with no unnecessary clutter. Yes, in Java. The Persistence with Spring series: Part 1 – The Persistence Layer with Spring 3.1 and Hibernate Part 3 – The Persistence Layer with Spring 3.1 and JPA Part 4 – The Persistence Layer with Spring Data JPA Part 5 – Transaction configuration with JPA and Spring 3.1 2. The DAO mess Most production codebases have some kind of DAO layer. Usually the implementation ranges from a raw class with no inheritance to some kind of generified class, but one thing is consistent – there is always more then one. Most likely, there are as many DAOs as there are entities in the system. Also, depending on the level of generics involved, the actual implementations can vary from heavily duplicated code to almost empty, with the bulk of the logic grouped in an abstract class. 2.1. A Generic DAO Instead of having multiple implementations – one for each entity in the system – a single parametrized DAO can be used in such a way that it still takes full advantage of the type safety provided by generics. Two implementations of this concept are presented next, one for a Hibernate centric persistence layer and the other focusing on JPA. These implementation are by no means complete – only some data access methods are included, but they can be easily be made more thorough. 2.2. The Abstract Hibernate DAO public abstract class AbstractHibernateDAO< T extends Serializable > { private Class< T > clazz; @Autowired SessionFactory sessionFactory; public void setClazz( Class< T > clazzToSet ){ this.clazz = clazzToSet; } public T findOne( Long id ){ return (T) this.getCurrentSession().get( this.clazz, id ); } public List< T > findAll(){ return this.getCurrentSession() .createQuery( "from " + this.clazz.getName() ).list(); } public void save( T entity ){ this.getCurrentSession().persist( entity ); } public void update( T entity ){ this.getCurrentSession().merge( entity ); } public void delete( T entity ){ this.getCurrentSession().delete( entity ); } public void deleteById( Long entityId ){ T entity = this.getById( entityId ); this.delete( entity ); } protected Session getCurrentSession(){ return this.sessionFactory.getCurrentSession(); } } The DAO uses the Hibernate API directly, without relying on any Spring templates (such as HibernateTemplate). Using of templates, as well as management of the SessionFactory which is autowired in the DAO were covered in the previous post of the series. 2.3. The Abstract JPA DAO public abstract class AbstractJpaDAO< T extends Serializable > { private Class< T > clazz; @PersistenceContext EntityManager entityManager; public void setClazz( Class< T > clazzToSet ){ this.clazz = clazzToSet; } public T findOne( Long id ){ return this.entityManager.find( this.clazz, id ); } public List< T > findAll(){ return this.entityManager.createQuery( "from " + this.clazz.getName() ) .getResultList(); } public void save( T entity ){ this.entityManager.persist( entity ); } public void update( T entity ){ this.entityManager.merge( entity ); } public void delete( T entity ){ this.entityManager.remove( entity ); } public void deleteById( Long entityId ){ T entity = this.getById( entityId ); this.delete( entity ); } } Similar to the Hibernate DAO implementation, the Java Persistence API is used here directly, again not relying on the now deprecated Spring JpaTemplate. 2.4. The Generic DAO Now, the actual implementation of the generic DAO is as simple as it can be – it contains no logic. Its only purpose is to be injected by the Spring container in a service layer (or in whatever other type of client of the Data Access Layer): @Repository @Scope( BeanDefinition.SCOPE_PROTOTYPE ) public class GenericJpaDAO< T extends Serializable > extends AbstractJpaDAO< T > implements IGenericDAO< T >{ // } @Repository @Scope( BeanDefinition.SCOPE_PROTOTYPE ) public class GenericHibernateDAO< T extends Serializable > extends AbstractHibernateDAO< T > implements IGenericDAO< T >{ // } First, note that the generic implementation is itself parametrized – allowing the client to choose the correct parameter in a case by case basis. This will mean that the clients gets all the benefits of type safety without needing to create multiple artifacts for each entity. Second, notice the prototype scope of these generic DAO implementation. Using this scope means that the Spring container will create a new instance of the DAO each time it is requested (including on autowiring). That will allow a service to use multiple DAOs with different parameters for different entities, as needed. The reason this scope is so important is due to the way Spring initializes beans in the container. Leaving the generic DAO without a scope would mean using the default singleton scope, which would lead to a single instance of the DAO living in the container. That would obviously be majorly restrictive for any kind of more complex scenario. 3. The Service There is now a single DAO to be injected by Spring; also, the Class needs to be specified: @Service class FooService implements IFooService{ IGenericDAO< Foo > dao; @Autowired public void setDao( IGenericDAO< Foo > daoToSet ){ this.dao = daoToSet; this.dao.setClazz( Foo.class ); } // ... } Spring autowires the new DAO insteince using setter injection so that the implementation can be customized with the Class object. After this point, the DAO is fully parametrized and ready to be used by the service. 4. Conclusion This article discussed the simplification of the Data Access Layer by providing a single, reusable implementation of a generic DAO. This implementation was presented in both a Hibernate and a JPA based environment. The result is a streamlined persistence layer, with no unnecessary clutter. For a step by step introduction about setting up the Spring context using Java based configuration and the basic Maven pom for the project, see this article. The next article of the Persistence with Spring series will focus on setting up the DAL layer with Spring 3.1 and JPA. In the meantime, you can check out the full implementation in the github project. If you read this far, you should follow me on twitter here.

January 5, 2012

by Eugen Paraschiv

· 25,060 Views · 1 Like

JAXB and Joda-Time: Dates and Times

Joda-Time provides an alternative to the Date and Calendar classes currently provided in Java SE. Since they are provided in a separate library JAXB does not provide a default mapping for these classes. We can supply the necessary mapping via XmlAdapters. In this post we will cover the following Joda-Time types: DateTime, DateMidnight, LocalDate, LocalTime, LocalDateTime. Java Model The following domain model will be used for this example: package blog.jodatime; import javax.xml.bind.annotation.XmlRootElement; import javax.xml.bind.annotation.XmlType; import org.joda.time.DateMidnight; import org.joda.time.DateTime; import org.joda.time.LocalDate; import org.joda.time.LocalDateTime; import org.joda.time.LocalTime; @XmlRootElement @XmlType(propOrder={ "dateTime", "dateMidnight", "localDate", "localTime", "localDateTime"}) public class Root { private DateTime dateTime; private DateMidnight dateMidnight; private LocalDate localDate; private LocalTime localTime; private LocalDateTime localDateTime; public DateTime getDateTime() { return dateTime; } public void setDateTime(DateTime dateTime) { this.dateTime = dateTime; } public DateMidnight getDateMidnight() { return dateMidnight; } public void setDateMidnight(DateMidnight dateMidnight) { this.dateMidnight = dateMidnight; } public LocalDate getLocalDate() { return localDate; } public void setLocalDate(LocalDate localDate) { this.localDate = localDate; } public LocalTime getLocalTime() { return localTime; } public void setLocalTime(LocalTime localTime) { this.localTime = localTime; } public LocalDateTime getLocalDateTime() { return localDateTime; } public void setLocalDateTime(LocalDateTime localDateTime) { this.localDateTime = localDateTime; } } XmlAdapters Since Joda-Time and XML Schema both represent data and time information according to ISO 8601 the implementation of the XmlAdapters is quite trivial. DateTimeAdapter package blog.jodatime; import javax.xml.bind.annotation.adapters.XmlAdapter; import org.joda.time.DateTime; public class DateTimeAdapter extends XmlAdapter{ public DateTime unmarshal(String v) throws Exception { return new DateTime(v); } public String marshal(DateTime v) throws Exception { return v.toString(); } } DateMidnightAdapter package blog.jodatime; import javax.xml.bind.annotation.adapters.XmlAdapter; import org.joda.time.DateMidnight; public class DateMidnightAdapter extends XmlAdapter { public DateMidnight unmarshal(String v) throws Exception { return new DateMidnight(v); } public String marshal(DateMidnight v) throws Exception { return v.toString(); } } LocalDateAdapter package blog.jodatime; import javax.xml.bind.annotation.adapters.XmlAdapter; import org.joda.time.LocalDate; public class LocalDateAdapter extends XmlAdapter{ public LocalDate unmarshal(String v) throws Exception { return new LocalDate(v); } public String marshal(LocalDate v) throws Exception { return v.toString(); } } LocalTimeAdapter package blog.jodatime; import javax.xml.bind.annotation.adapters.XmlAdapter; import org.joda.time.LocalTime; public class LocalTimeAdapter extends XmlAdapter { public LocalTime unmarshal(String v) throws Exception { return new LocalTime(v); } public String marshal(LocalTime v) throws Exception { return v.toString(); } } LocalDateTimeAdapter package blog.jodatime; import javax.xml.bind.annotation.adapters.XmlAdapter; import org.joda.time.LocalDateTime; public class LocalDateTimeAdapter extends XmlAdapter{ public LocalDateTime unmarshal(String v) throws Exception { return new LocalDateTime(v); } public String marshal(LocalDateTime v) throws Exception { return v.toString(); } } Registering the XmlAdapters We will use the @XmlJavaTypeAdapters annotation to register the Joda-Time types at the package level. This means that whenever these types are found on a field/property on a class within this package the XmlAdapter will automatically be applied. @XmlJavaTypeAdapters({ @XmlJavaTypeAdapter(type=DateTime.class, value=DateTimeAdapter.class), @XmlJavaTypeAdapter(type=DateMidnight.class, value=DateMidnightAdapter.class), @XmlJavaTypeAdapter(type=LocalDate.class, value=LocalDateAdapter.class), @XmlJavaTypeAdapter(type=LocalTime.class, value=LocalTimeAdapter.class), @XmlJavaTypeAdapter(type=LocalDateTime.class, value=LocalDateTimeAdapter.class) }) package blog.jodatime; import javax.xml.bind.annotation.adapters.XmlJavaTypeAdapter; import javax.xml.bind.annotation.adapters.XmlJavaTypeAdapters; import org.joda.time.DateMidnight; import org.joda.time.DateTime; import org.joda.time.LocalDate; import org.joda.time.LocalDateTime; import org.joda.time.LocalTime; Demo To run the following demo you will need the Joda-Time jar on your classpath. It can be obtained here: http://sourceforge.net/projects/joda-time/files/joda-time/ package blog.jodatime; import javax.xml.bind.JAXBContext; import javax.xml.bind.Marshaller; import org.joda.time.DateMidnight; import org.joda.time.DateTime; import org.joda.time.LocalDate; import org.joda.time.LocalDateTime; import org.joda.time.LocalTime; public class Demo { public static void main(String[] args) throws Exception { Root root = new Root(); root.setDateTime(new DateTime(2011, 5, 30, 11, 2, 30, 0)); root.setDateMidnight(new DateMidnight(2011, 5, 30)); root.setLocalDate(new LocalDate(2011, 5, 30)); root.setLocalTime(new LocalTime(11, 2, 30)); root.setLocalDateTime(new LocalDateTime(2011, 5, 30, 11, 2, 30)); JAXBContext jc = JAXBContext.newInstance(Root.class); Marshaller marshaller = jc.createMarshaller(); marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true); marshaller.marshal(root, System.out); } } Output The following is the output from our demo code: 2011-05-30T11:02:30.000-04:00 2011-05-30T00:00:00.000-04:00 2011-05-30 11:02:30.000 2011-05-30T11:02:30.000 From http://blog.bdoughan.com/2011/05/jaxb-and-joda-time-dates-and-times.html

December 29, 2011

by Blaise Doughan

· 15,735 Views

The “4+1” View Model of Software Architecture

In November 1995, while working as Lead software architect at Hughes Aircraft Of Canada Philippe Kruchten published a paper entitled: "Architectural Blueprints—The “4+1” View Model of Software Architecture". The intent was to come up with a mechanism to separate the different aspects of a software system into different views of the system. Why? Because different stakeholders always have different interest in a software system. Some aspects of a system are relevant to the Developers; others are relevant to System administrators. The Developers want to know about things like classes; System administrators want to know about deployment, hardware and network configurations and don't care about classes. Similar points can be made for Testers, Project Managers and Customers. Kruchten thought it made sense to decompose architecture into distinct views so stakeholders could get what they wanted. In total there were 5 views in his approach but he decided to call it 4 + 1. We'll discuss why it's called 4 + 1 later! But first, let's have a look at each of the different views. The logical view This contains information about the various parts of the system. In UML the logical view is modelled using Class, Object, State machine and Interaction diagrams (e.g Sequence diagrams). It's relevance is really to developers. The process view This describes the concurrent processes within the system. It encompasses some non-functional requirements such as performance and availability. In UML, Activity diagrams - which can be used to model concurrent behaviour - are used to model the process view. The development view The development view focusses on software modules and subsystems. In UML, Package and Component diagrams are used to model the development view. The physical view The physical view describes the physical deployment of the system. For example, how many nodes are used and what is deployed on what node. Thus, the physical view concerns some non-functional requirements such as scalability and availability. In UML, Deployment diagrams are used to model the physical view. The use case view This view describes the functionality of the system from the perspective from outside world. It contains diagrams describing what the system is supposed to do from a black box perspective. This view typically contains Use Case diagrams. All other views use this view to guide them. Why is it called the 4 + 1 instead of just 5? Well this is because of the special significance the use case view has. When all other views are finished, it's effectively redundant. However, all other views would not be possible without it. It details the high levels requirements of the system. The other views detail how those requirements are realised. 4 + 1 came before UML It's important to remember the 4 + 1 approach was put forward two years before the first the introduction of UML which did not manifest in its first guise until 1997. UML is how most enterprise architectures are modelled and the 4 + 1 approach still plays a relevance to UML today. UML 2.0 has 13 different types of diagrams - each diagram type can be categorised into one of the 4 + 1 views. UML is 4 + 1 friendly! So is it important? The 4 + 1 approach isn't just about satisfying different stakeholders. It makes modelling easier to do because it makes it easier to organise. A typical project will contain numerous diagrams of the various types. For example, a project may contain a few hundred sequence diagrams and several class diagrams. Grouping diagrams of similar types and purpose means there is an emphasis in separating concerns. Sure isn't it just the same with Java? Grouping Java classes of similar purpose and related responsibilities into packages means organisation is better. Similarly, grouping different components into different jar files means organisation is better. Modelling tools will usually support the 4 + 1 approach and this means projects will have templates for how to split the various types of diagrams. In a company when projects follow industry standard templates again it means things are better organised. The 4 + 1 approach also provides a way for architects to be able to prioritise modelling concerns. It is rare that a project will have enough time to model every single diagram possible for an architecture. Architects can prioritise different views. For example, for a business domain intensive project it would make sense to prioritise the logical view. In a project with high concurrency and complex timing it would make sense to ensure the process view gets ample time. Similarly, the 4 + 1 approach makes it possible for stakeholders to get the parts of the model that are relevant to them. References: Architectural Blueprints—The “4+1” View Model of Software Architecture Paper http://www.cs.ubc.ca/~gregor/teaching/papers/4+1view-architecture.pdf Learning UML 2.0 by Russ Miles & Kim Hamilton. O'Reilly From http://dublintech.blogspot.com/2011/05/41-view-model-of-software-architecture.html

December 28, 2011

by Alex Staveley

· 53,932 Views

How to deploy a neo4j instance in Amazon EC2 in 10 minutes

Neo4j is a high-performance, NOSQL graph database with all the features of a mature and robust database. In this post I will explain how to deploy a neo4j instance in Amazon EC2 web service. For this tutorial to take you no more than 10 minutes you should be able to execute properly some bash commands like mv, tar, ssh and scp (secure copy). I also assume that you have an account in Amazon Web Services and you are familiar to the process of launching instances. If not, I strongly recommend you to follow this starting guide and complete it till you manage to connect to your instance with ssh. Start downloading the latest stable version of neo4j. Which you can find here. The “Community Edition” fits well for development purposes. Do not forget to select the Unix version of the server. This will download a tar.gz file which you will copy to your EC2 instance later. While you download the neo4j server open the AWS Management Console and launch a Basic 32-bit Amazon Linux AMI. If you want to launch an Ubuntu AMI please notice that it doesn’t ship with Java, which is required for running neo4j. If you are not familiar with key pairs, pem files or security groups I insist you to follow the EC2 starting guide I mentioned above. You can either create a new security group or use the default, but you will need to configure a new security rule for the neo4j server port. After launching the instance, create a TCP rule on port 7474 with source 0.0.0.0/0. Here you are opening port 7474 for anyone. If you are planning to use the neo4j REST API and remotely call it from another server, for example a Rails application hosted in Heroku, for security reasons, you may want to change the source field to the address of your Heroku server. Do not forget to open port 22 (SSH), this is typically the first rule normal people create after launching an instance. You are almost done! You should now install neo4j in your instance. Open a terminal in your localhost and navigate to the path where you downloaded neo4j. Copy the file to your Amazon instance by using the scp command: scp -i your_pem_file.pem neo4j-community-1.6.M01-unix.tar.gz ec2-user@YOUR_PUBLIC_INSTANCE_DNS:/home/ec2-user Please notice that you will need to change the path to your pem file, typically placed in ~/.ssh, the filename of the neo4j server you just downloaded and the plublic DNS of your instance. Now connect to your instance with SSH: ssh -i your_pem_file.pem ec2-user@YOUR_PUBLIC_INSTANCE_DNS Untar the neo4j server: tar xvfz neo4j-community-1.6.M01-unix.tar.gz.tar.gz Move it to /usr/local and rename the folder to neo4j: sudo mv neo4j-community-1.6.M01 /usr/local/neo4j Almost done!!! You should now open neo4j-server.properties under the conf directory and add the following line: org.neo4j.server.webserver.address=0.0.0.0 This lines allows anyone to connect remotely to your neo4j database server. Now run the start script. From the neo4j server folder. sudo ./bin/neo4j start Finally, open a browser and access the webadmin interface of your neo4j database by typing http://YOUR_PUBLIC_INSTANCE_DNS:7474. You should see the Neo4j Monitoring and Management Tool, pretty cool! If not, ask me You can now try using the REST API and the curl bash command to insert nodes and relationships. I hope this post helped you, good luck! Follow me on Twitter @negarnil Source: http://www.cloudtmp.com/java/how-to-deploy-a-neo4j-instance-in-amazon-ec2-in-10-minutes/

December 27, 2011

by Nicolas Garnil

· 27,412 Views · 1 Like

How to order by multiple columns using Lambas and LINQ in C#

Today, I was required to order a list of records based on a Name and then ID. A simple one, but I did spend some time on how to do it with Lambda Expression in C#. C# provides the OrderBy, OrderByDescending, ThenBy, ThenByDescending. You can use them in your lambda expression to order the records as per your requirement. Assuming your list is “Phones” and contains the following data public class Phone { public int ID { get; set; } public string Name { get; set; } } public class Phones : List { public Phones() { Add(new Phone { ID = 1, Name = "Windows Phone 7" }); Add(new Phone { ID = 5, Name = "iPhone" }); Add(new Phone { ID = 2, Name = "Windows Phone 7" }); Add(new Phone { ID = 3, Name = "Windows Mobile 6.1" }); Add(new Phone { ID = 6, Name = "Android" }); Add(new Phone { ID = 10, Name = "BlackBerry" }); } } If you were to use LINQ Query , the query will look like the one below dataGridView1.DataSource = (from m in new Phones() orderby m.Name, m.ID select m).ToList(); Simple isn’t it ? It very simple using Lamba expression too. Your Lambda’s expression for the above LINQ query will look like the one below dataGridView1.DataSource = new Phones().OrderByDescending(a => a.Name).ThenByDescending(a => a.ID).ToList();

December 24, 2011

by Senthil Kumar

· 136,839 Views

Enabling JMX in Hibernate, Ehcache, Quartz, DBPC and Spring

A collection of short how-to's for enabling JMX in several popular Java technologies. Continuing our journey with JMX (see: ...JMX for human beings) we will learn how to enable JMX support (typically statistics and monitoring capabilities) in some popular frameworks. Most of this information can be found on project's home pages, but I decided to collect it with few the addition of some useful tips. Hibernate (with Spring support) Exposing Hibernate statistics with JMX is pretty simple, however some nasty workarounds are requires when JPA API is used to obtain underlying SessionFactory class JmxLocalContainerEntityManagerFactoryBean() extends LocalContainerEntityManagerFactoryBean { override def createNativeEntityManagerFactory() = { val managerFactory = super.createNativeEntityManagerFactory() registerStatisticsMBean(managerFactory) managerFactory } def registerStatisticsMBean(managerFactory: EntityManagerFactory) { managerFactory match { case impl: EntityManagerFactoryImpl => val mBean = new StatisticsService(); mBean.setStatisticsEnabled(true) mBean.setSessionFactory(impl.getSessionFactory); val name = new ObjectName("org.hibernate:type=Statistics,application=spring-pitfalls") ManagementFactory.getPlatformMBeanServer.registerMBean(mBean, name); case _ => } } } Note that I have created a subclass of Springs built-in LocalContainerEntityManagerFactoryBean. By overriding createNativeEntityManagerFactory() method I can access EntityManagerFactory and by trying to downcast it to org.hibernate.ejb.EntityManagerFactoryImpl we were able to register Hibernate Mbean. One more thing has left. Obviously we have to use our custom subclass instead of org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean. Also, in order to collect the actual statistics instead of just seeing zeroes all the way down we must set the hibernate.generate_statistics flag. @Bean def entityManagerFactoryBean() = { val entityManagerFactoryBean = new JmxLocalContainerEntityManagerFactoryBean() entityManagerFactoryBean.setDataSource(dataSource()) entityManagerFactoryBean.setJpaVendorAdapter(jpaVendorAdapter()) entityManagerFactoryBean.setPackagesToScan("com.blogspot.nurkiewicz") entityManagerFactoryBean.setJpaPropertyMap( Map( "hibernate.hbm2ddl.auto" -> "create", "hibernate.format_sql" -> "true", "hibernate.ejb.naming_strategy" -> classOf[ImprovedNamingStrategy].getName, "hibernate.generate_statistics" -> true.toString ).asJava ) entityManagerFactoryBean } Here is a sample of what can we expect to see in JvisualVM (don't forget to install all plugins!): In addition we get a nice Hibernate logging: HQL: select generatedAlias0 from Book as generatedAlias0, time: 10ms, rows: 20 EhCache Monitoring caches is very important, especially in application where you expect values to generally be present there. I tend to query the database as often as needed to avoid unnecessary method arguments or local caching. Everything to make code as simple as possible. However this approach only works when caching on the database layer works correctly. Similar to Hibernate, enabling JMX monitoring in EhCache is a two-step process. First you need to expose provided MBean in MBeanServer: @Bean(initMethod = "init", destroyMethod = "dispose") def managementService = new ManagementService(ehCacheManager(), platformMBeanServer(), true, true, true, true, true) @Bean def platformMBeanServer() = ManagementFactory.getPlatformMBeanServer def ehCacheManager() = ehCacheManagerFactoryBean.getObject @Bean def ehCacheManagerFactoryBean = { val ehCacheManagerFactoryBean = new EhCacheManagerFactoryBean ehCacheManagerFactoryBean.setShared(true) ehCacheManagerFactoryBean.setCacheManagerName("spring-pitfalls") ehCacheManagerFactoryBean } Note that I explicitly set CacheManager name. This is not required but this name is used as part of the Mbean name and a default one contains hashCode value, which is not very pleasant. The final touch is to enable statistics on a cache basis: Now we can happily monitor various caching characteristics of every cache separately: As we can see the percentage of cache misses increases. Never a good thing. If we don't enable cache statistics, enabling JMX is still a good idea since we get a lot of management operations for free, including flushing and clearing caches (useful during debugging and testing). Quartz scheduler In my humble opinion Quartz scheduler is very underestimated library, but I will write an article about it on its own. This time we will only learn how to monitor it via JMX. Fortunately it's as simple as adding: org.quartz.scheduler.jmx.export=true To quartz.properties file. The JMX support in Quartz could have been slightly broader, but still one can query e.g. which jobs are currently running. By the way the new major version of Quartz (2.x) brings very nice DSL-like support for scheduling: val job = newJob(classOf[MyJob]) val trigger = newTrigger(). withSchedule( repeatSecondlyForever() ). startAt( futureDate(30, SECOND) ) scheduler.scheduleJob(job.build(), trigger.build()) Apache Commons DBCP Apache Commons DBCP is the most reasonable JDBC pooling library I came across. There is also c3p0, but it doesn't seem like it's actively developed any more. Tomcat JDBC Connection Pool looked promising, but since it's bundled in Tomcat, your JDBC drivers can no longer be packaged in WAR. The only problem with DBCP is that it does not support JMX. At all (see this two and a half year old issue). Fortunately this can be easily worked around. Besides we will learn how to use Spring built-in JMX support. Looks like the standard BasicDataSource has all what we need, all we have to do is to expose existing metrics via JMX. With Spring it is dead-simple – just subclass BasicDataSource and add @ManagedAttribute annotation over desired attributes: @ManagedResource class ManagedBasicDataSource extends BasicDataSource { @ManagedAttribute override def getNumActive = super.getNumActive @ManagedAttribute override def getNumIdle = super.getNumIdle @ManagedAttribute def getNumOpen = getNumActive + getNumIdle @ManagedAttribute override def getMaxActive: Int= super.getMaxActive @ManagedAttribute override def setMaxActive(maxActive: Int) { super.setMaxActive(maxActive) } @ManagedAttribute override def getMaxIdle = super.getMaxIdle @ManagedAttribute override def setMaxIdle(maxIdle: Int) { super.setMaxIdle(maxIdle) } @ManagedAttribute override def getMinIdle = super.getMinIdle @ManagedAttribute override def setMinIdle(minIdle: Int) { super.setMinIdle(minIdle) } @ManagedAttribute override def getMaxWait = super.getMaxWait @ManagedAttribute override def setMaxWait(maxWait: Long) { super.setMaxWait(maxWait) } @ManagedAttribute override def getUrl = super.getUrl @ManagedAttribute override def getUsername = super.getUsername } Here are few data source metrics going crazy during load-test: JMX support in the Spring framework itself is pretty simple. As you have seen above exposing arbitrary attribute or operation is just a matter of adding an annotation. You only have to remember about enabling JMX support using either XML or Java (also see: SPR-8943 : Annotation equivalent to with @Configuration): or: @Bean def annotationMBeanExporter() = new AnnotationMBeanExporter() This article wasn't particularly exciting. However, the knowledge of JMX metrics will enable us to write simple yet fancy dashboards in no time. Stay tuned! From http://nurkiewicz.blogspot.com/2011/12/enabling-jmx-in-hibernate-ehcache-qurtz.html

December 22, 2011

by Tomasz Nurkiewicz

· 12,670 Views

How to create offline HTML5 web apps in 5 easy steps

Among all cool new features introduced by HTML5, the possibility of caching web pages for offline use is definitely one of my favorites. Today, I’m glad to show you how you can create a page that will be available for offline browsing. Getting started View Demo Download files 1 – Add HTML5 doctype The first thing to do is create a valid HTML5 document. The HTML5 doctype is easier to remember than ones used for xhtml: ... Create a file named index.html, or get the example files from my CSS3 media queries article to use as a basis for this tutorial. In case you need it, the full HTML5 specs are available on the W3C website. 2 – Add .htaccess support The file we’re going to create to cache our web page is called a manifest file. Before creating it, we first have to add a directive to the .htaccess file (assuming your server is Apache). Open the .htaccess file, which is located on your website root, and add the following code: AddType text/cache-manifest .manifest This directive makes sure that every .manifest file is served as text/cache-manifest. If the file isn’t, then the whole manifest will have no effect and the page will not be available offline. 3 – Create the manifest file Now, things are going to be more interesting as we create a manifest file. Create a new file and save it as offline.manifest. Then, paste the following code in it. I’ll explain it later. CACHE MANIFEST #This is a comment CACHE index.html style.css image.jpg image-med.jpg image-small.jpg notre-dame.jpg Right now, you have a perfectly working manifest file. The way it works is very simple: After the CACHE declaration, you have to list each files you want to make available offline. That’s enough for caching a simple web page like the one from my example, but HTML5 caching has other interesting possibilities. For example, consider the following manifest file: CACHE MANIFEST #This is a comment CACHE index.html style.css NETWORK: search.php login.php FALLBACK: /api offline.html Like in the example manifest file, we have a CACHE declaration that caches index.html and style.css. But we also have the NETWORK declaration, which is used to specify files that shouldn’t be cached, such as a login page. The last declaration is FALLBACK. This declaration allows you to redirect the user to a particular file (in this example, offline.html) if a resource (/api) isn’t available offline. 4 – Link your manifest file to the html document Now, both your manifest file and your main html document are ready. The only thing you still have to do is to link the manifest file to the html document. Doing this is easy: simply add the manifest attribute to the html element as shown below: 5 – Test it Once done, you’re ready to go. If you visit your index.html file with Firefox 3.5+, you should see a banner like this one: Other browser I’ve tested (Chrome, Safari, Android and iPhone) do not warn about the file caching, and the file is automatically cached. Below you’ll find the browser compatibility of this technique: As usual Internet Explorer does not support it. IE: No support Firefox: 3.5+ Safari: 4.0+ Chrome: 5.0+ Opera: 10.6+ iPhone: 2.1+ Android: 2.0+ Source: http://www.catswhocode.com/blog/how-to-create-offline-html5-web-apps-in-5-easy-steps

December 22, 2011

by Jean-Baptiste Jung

· 24,103 Views

A Gentle Introduction to Making HTML5 Canvas Interactive

this is an big overhaul of one of my tutorials on making and moving shapes on an html5 canvas. this new tutorial is vastly cleaner than my old one, but if you still want to see that one or are looking for the concept of a “ghost context” you can find that one here. this tutorial will show you how to create a simple data structure for shapes on an html5 canvas and how to have them be selectable. the finished canvas will look like this: this article’s code is written primarily to be easy to understand. we’ll be going over a few things that are essential to interactive apps such as games (drawing loop, hit testing), and in later tutorials i will probably turn this example into a small game of some kind. i will try to accommodate javascript beginners but this introduction does expect at least a rudimentary understanding of js. not every piece of code is explained in the text, but almost every piece of code is thoroughly commented! the html5 canvas a canvas is made by using the tag in html: this text is displayed if your browser does not support html5 canvas. a canvas isn’t smart: it’s just a place for drawing pixels. if you ask it to draw something it will execute the drawing command and then immediately forget everything about what it has just drawn. this is sometimes referred to as an immediate drawing surface, as contrasted with svg as a retained drawing surface, since svg keeps a reference to everything drawn. because we have no such references, we have to keep track ourselves of all the things we want to draw (and re-draw) each frame. canvas also has no built-in way of dealing with animation. if you want to make something you’ve drawn move, you have to clear the entire canvas and redraw all of the objects with one or more of them moved. and you have to do it often, of course, if you want a semblance of animation or motion. so we’ll need to add: code for keeping track of objects code for keeping track of canvas state code for mouse events code for drawing the objects as they are made and move around keeping track of what we draw to keep things simple for this example we will start with a shape class to represent rectangular objects. javascript doesn’t technically have classes, but that isn’t a problem because javascript programmers are very good at playing pretend. functionally (well, for our example) we are going to have a shape class and create shape instances with it. what we are really doing is defining a function named shape and adding functions to shape’s prototype. you can make new instances of the function shape and all instances will share the functions defined on shape’s prototype. if you’ve never encountered prototypes in javascript before or if the above sounds confusing to you, i highly recommend reading crockford’s javascript: the good parts . the book is an intermediate overview of javascript that gives a good understanding of why programmers choose to create objects in different ways, why certain conventions are frowned upon, and just what makes javascript so different. here’s our shape constructor and one of the two prototype methods, which are comparable to a class instance methods: // constructor for shape objects to hold data for all drawn objects. // for now they will just be defined as rectangles. function shape(x, y, w, h, fill) { // this is a very simple and unsafe constructor. // all we're doing is checking if the values exist. // "x || 0" just means "if there is a value for x, use that. otherwise use 0." this.x = x || 0; this.y = y || 0; this.w = w || 1; this.h = h || 1; this.fill = fill || '#aaaaaa'; } // draws this shape to a given context shape.prototype.draw = function(ctx) { ctx.fillstyle = this.fill; ctx.fillrect(this.x, this.y, this.w, this.h); } keeping track of canvas state we’re going to have a second class/function called canvasstate. we’re only going to make one instance of this class and it will hold all of the state in this tutorial that is not associated with shapes themselves. canvasstate is going to foremost need a reference to the canvas and a few other field for convenience. we’re also going to compute and save the border and padding (if there is any) so that we can get accurate mouse coordinates. in the canvasstate constructor we will also have a collection of state relating to the objects on the canvas and the current status of dragging. we’ll make an array of shapes, a flag “dragging” that will be true while we are dragging, a field to keep track of which object is selected and a “valid” flag that will be set to false will cause the canvas to clear everything and redraw. is going to need an array of shapes to keep track of whats been drawn so far. i’m going to add a bunch of variables for keeping track of the drawing and mouse state. i already added boxes[] to keep track of each object, but we’ll also need a var for the canvas, the canvas’ 2d context (where wall drawing is done), whether the mouse is dragging, width/height of the canvas, and so on. we’ll also want to make a second canvas, for selection purposes, but i’ll talk about that later. function canvasstate(canvas) { // ... // i removed some setup code to save space // see the full source at the end // **** keep track of state! **** this.valid = false; // when set to true, the canvas will redraw everything this.shapes = []; // the collection of things to be drawn this.dragging = false; // keep track of when we are dragging // the current selected object. // in the future we could turn this into an array for multiple selection this.selection = null; this.dragoffx = 0; // see mousedown and mousemove events for explanation this.dragoffy = 0; mouse events we’ll add events for mousedown, mouseup, and mousemove that will control when an object starts and stops dragging. we’ll also disable the selectstart event, which stops double-clicking on canvas from accidentally selecting text on the page. finally we’ll add a double-click event that will create a new shape and add it to the canvasstate’s list of shapes. the mousedown event begins by calling getmouse on our canvasstate to return the x and y position of the mouse. we then iterate through the list of shapes to see if any of them contain the mouse position. we go through them backwards because they are drawn forwards, and we want to select the one that appears topmost, so we must find the potential shape that was drawn last. if we find the shape we save the offset, save that shape as our selection, set dragging to true and set the valid flag to false. already we’ve used most of our state! finally if we didn’t find any objects we need to see if there was a selection saved from last time. if there is we should clear it. since we clicked on nothing, we obviously didn’t click on the already-selected object! clearing the selection means we will have to clear the canvas and redraw everything without the selection ring, so we set the valid flag to false. // ... // (we are still in the canvasstate constructor) // this is an example of a closure! // right here "this" means the canvasstate. but we are making events on the canvas itself, // and when the events are fired on the canvas the variable "this" is going to mean the canvas! // since we still want to use this particular canvasstate in the events we have to save a reference to it. // this is our reference! var mystate = this; //fixes a problem where double clicking causes text to get selected on the canvas canvas.addeventlistener('selectstart', function(e) { e.preventdefault(); return false; }, false); // up, down, and move are for dragging canvas.addeventlistener('mousedown', function(e) { var mouse = mystate.getmouse(e); var mx = mouse.x; var my = mouse.y; var shapes = mystate.shapes; var l = shapes.length; for (var i = l-1; i >= 0; i--) { if (shapes[i].contains(mx, my)) { var mysel = shapes[i]; // keep track of where in the object we clicked // so we can move it smoothly (see mousemove) mystate.dragoffx = mx - mysel.x; mystate.dragoffy = my - mysel.y; mystate.dragging = true; mystate.selection = mysel; mystate.valid = false; return; } } // havent returned means we have failed to select anything. // if there was an object selected, we deselect it if (mystate.selection) { mystate.selection = null; mystate.valid = false; // need to clear the old selection border } }, true); the mousemove event checks to see if we have set the dragging flag to true. if we have it gets the current mouse positon and moves the selected object to that position, remembering the offset of where we were grabbing it. if the dragging flag is false the mousemove event does nothing. canvas.addeventlistener('mousemove', function(e) { if (mystate.dragging){ var mouse = mystate.getmouse(e); // we don't want to drag the object by its top-left corner, // we want to drag from where we clicked. // thats why we saved the offset and use it here mystate.selection.x = mouse.x - mystate.dragoffx; mystate.selection.y = mouse.y - mystate.dragoffy; mystate.valid = false; // something's dragging so we must redraw } }, true); the mouseup event is simple, all it has to do is update the canvasstate so that we are no longer dragging! so once you lift the mouse, the mousemove event is back to doing nothing. canvas.addeventlistener('mouseup', function(e) { mystate.dragging = false; }, true); the double click event we’ll use to add more shapes to our canvas. it calls addshape on the canvasstate with a new instance of shape. all addshape does is add the argument to the list of shapes in the canvasstate. // double click for making new shapes canvas.addeventlistener('dblclick', function(e) { var mouse = mystate.getmouse(e); mystate.addshape(new shape(mouse.x - 10, mouse.y - 10, 20, 20, 'rgba(0,255,0,.6)')); }, true); there are a few options i implemented, what the selection ring looks like and how often we redraw. setinterval simply calls our canvasstate’s draw method. our interval of 30 means that we call the draw method every 30 milliseconds. // **** options! **** this.selectioncolor = '#cc0000'; this.selectionwidth = 2; this.interval = 30; setinterval(function() { mystate.draw(); }, mystate.interval); } drawing now we’re set up to draw every 30 milliseconds, which will allow us to continuously update the canvas so it appears like the shapes we drag are smoothly moving around. however, drawing doesn’t just mean drawing the shapes over and over; we also have to clear the canvas on every draw. if we don’t clear it, dragging will look like the shape is making a solid line because none of the old shape-positions will go away. because of this, we clear the entire canvas before each draw frame. this can get expensive, and we only want to draw if something has actually changed within our framework, which is why we have the “valid” flag in our canvasstate. after everything is drawn the draw method will set the valid flag to true. then, ocne we do something like adding a new shape or trying to drag a shape, the state will get invalidated and draw() will clear, redraw all objects, and set the valid flag again. // while draw is called as often as the interval variable demands, // it only ever does something if the canvas gets invalidated by our code canvasstate.prototype.draw = function() { // if our state is invalid, redraw and validate! if (!this.valid) { var ctx = this.ctx; var shapes = this.shapes; this.clear(); // ** add stuff you want drawn in the background all the time here ** // draw all shapes var l = shapes.length; for (var i = 0; i < l; i++) { var shape = shapes[i]; // we can skip the drawing of elements that have moved off the screen: if (shape.x > this.width || shape.y > this.height || shape.x + shape.w < 0 || shape.y + shape.h < 0) return; shapes[i].draw(ctx); } // draw selection // right now this is just a stroke along the edge of the selected shape if (this.selection != null) { ctx.strokestyle = this.selectioncolor; ctx.linewidth = this.selectionwidth; var mysel = this.selection; ctx.strokerect(mysel.x,mysel.y,mysel.w,mysel.h); } // ** add stuff you want drawn on top all the time here ** this.valid = true; } } we go through all of shapes[] and draw each one in order. this will give the nice appearance of later shapes looking as if they are on top of earlier shapes. after all the shapes are drawn, a selection handle (if there is a selection) gets drawn around the shape that this.selection references. if you wanted a background (like a city) or a foreground (like clouds), one way to add them is to put them before or after the main two drawing bits. there are often better ways though, like using multiple canvases or a css background-image, but we won’t go over that here. getting mouse coordinates on canvas getting good mouse coordinates is a little tricky on canvas. you could use offsetx/y and layerx/y, but layerx/y is deprecated in webkit (chrome and safari) and firefox does not have offsetx/y. the most bulletproof way to get the correct mouse position is shown below. you have to walk up the tree adding the offsets together. then you must add any padding or border to the offset. finally, to fix coordinate problems when you have fixed-position elements on the page (like the wordpress admin bar or a stumbleupon bar) you must add the ’s offsettop and offsetleft. then you simply subtract that offset from the e.pagex/y values and you’ll get perfect coordinates in almost every possible situation. // creates an object with x and y defined, // set to the mouse position relative to the state's canvas // if you wanna be super-correct this can be tricky, // we have to worry about padding and borders canvasstate.prototype.getmouse = function(e) { var element = this.canvas, offsetx = 0, offsety = 0, mx, my; // compute the total offset if (element.offsetparent !== undefined) { do { offsetx += element.offsetleft; offsety += element.offsettop; } while ((element = element.offsetparent)); } // add padding and border style widths to offset // also add the offsets in case there's a position:fixed bar offsetx += this.stylepaddingleft + this.styleborderleft + this.htmlleft; offsety += this.stylepaddingtop + this.stylebordertop + this.htmltop; mx = e.pagex - offsetx; my = e.pagey - offsety; // we return a simple javascript object (a hash) with x and y defined return {x: mx, y: my}; } there are a few little methods i added that are not shown, such as shape’s method to see if a point is inside its bounds. you can see and download the full demo source here . now that we have a basic structure down, it is easy to write code that handles more complex shapes, like paths or images or video. rotation and scaling these things takes a bit more work, but is quite doable with the canvas and our selection method is already set up to deal with them. if you would like to see this code enhanced in future posts (or have any fixes), let me know. source: http://simonsarris.com/blog/510-making-html5-canvas-useful

December 21, 2011

by Simon Sarris

· 11,299 Views · 1 Like

Google Guava Cache

This Post is a continuation of my series on Google Guava, this time covering Guava Cache. Guava Cache offers more flexibility and power than either a HashMap or ConcurrentHashMap, but is not as heavy as using EHCache or Memcached (or robust for that matter, as Guava Cache operates solely in memory). The Cache interface has methods you would expect to see like ‘get’, and ‘invalidate’. A method you won’t find is ‘put’, because Guava Cache is ‘self-populating’, values that aren’t present when requested are fetched or calculated, then stored. This means a ‘get’ call will never return null. In all fairness, the previous statement is not %100 accurate. There is another method ‘asMap’ that exposes the entries in the cache as a thread safe map. Using ‘asMap’ will result in not having any of the self loading operations performed, so calls to ‘get’ will return null if the value is not present (What fun is that?). Although this is a post about Guava Cache, I am going to spend the bulk of the time talking about CacheLoader and CacheBuilder. CacheLoader specifies how to load values, and CacheBuilder is used to set the desired features and actually build the cache. CacheLoader CacheLoader is an abstract class that specifies how to calculate or load values, if not present. There are two ways to create an instance of a CacheLoader: Extend the CacheLoader class Use the static factory method CacheLoader.from If you extend CacheLoader you need to override the V load(K key) method, instructing how to generate the value for a given key. Using the static CacheLoader.from method you build a CacheLoader either by supplying a Function or Supplier interface. When supplying a Function object, the Function is applied to the key to calculate or retrieve the results. Using a Supplier interface the value is obtained independent of the key. CacheBuilder The CacheBuilder is used to construct cache instances. It uses the fluent style of building and gives you the option of setting the following properties on the cache: Cache Size limit (removals use a LRU algorithm) Wrapping keys in WeakReferences (Strong references used by default for keys) Wrapping values in either WeakReferences or SoftReferences (Strong references used by default) Time to expire entires after last access Time based expiration of entries after being written or updated Setting a RemovalListener that can recieve events once an entry is removed from the cache Concurrency Level of the cache (defaults to 4) The concurrency level option is used to partition the table internally such that updates can occur without contention. The ideal setting would be the maximum number of threads that could potentially access the cache at one time. Here is an example of a possible usage scenario for Guava Cache. public class PersonSearchServiceImpl implements SearchService> { public PersonSearchServiceImpl(SampleLuceneSearcher luceneSearcher, SampleDBService dbService) { this.luceneSearcher = luceneSearcher; this.dbService = dbService; buildCache(); } @Override public List search(String query) throws Exception { return cache.get(query); } private void buildCache() { cache = CacheBuilder.newBuilder().expireAfterWrite(10, TimeUnit.MINUTES) .maximumSize(1000) .build(new CacheLoader>() { @Override public List load(String queryKey) throws Exception { List ids = luceneSearcher.search(queryKey); return dbService.getPersonsById(ids); } }); } } In this example, I am setting the cache entries to expire after 10 minutes of being written or updated in the cache, with a maximum amount of 1,000 entires. Note the usage of CacheLoader on line 15. RemovalListener The RemovalListener will receive notification of an item being removed from the cache. These notifications could be from manual invalidations or from a automatic one due to time expiration or garbage collection. The RemovalListener parameters can be set to listen for specific type. To receive notifications for any key or value set them to use Object. It should be noted here that a RemovalListener will receive a RemovalNotification object that implements the Map.Entry interface. The key or value could be null if either has already been garbage collected. Also the key and value object will be strong references, regardless of the type of references used by the cache. CacheStats There is also a very useful class CacheStats that can be retrieved via a call to Cache.stats(). The CacheStats object can give insight into the effectiveness and performance of your cache by providing statistics such as: hit count miss count total load timme total requests CacheStats provides many other counts in addition to the ones listed above. Conclusion The Guava Cache presents some very compelling functionality. The decision to use a Guava Cache really comes down to the tradeoff between memory availability/usage versus increases in performance. I have added a unit test CacheTest demonstrating the usages discussed here. As alway comments and suggestions are welcomed. Thanks for your time. Resources Guava Project Home Cache API Source Code for blog series From http://codingjunkie.net/google-guava-cache/

December 16, 2011

by Bill Bejeck

· 52,890 Views · 1 Like

How To Sort String Array Using LINQ In C#

A string[] array and use a LINQ query expression to order its contents alphabetically. Note that we are ordering the strings, not the letters in the strings. using System; using System.Linq; class Program { static void Main() { string[] a = new string[] {"Indonesian","Korean","Japanese","English","German"}; var sort = from s in a orderby s select s; foreach (string c in sort) { Console.WriteLine(c); } } } /*OUTPUT English German Indonesian Japanese Korean */ Eclipse IDE and Eclipse

December 14, 2011

by Snippets Manager

· 10,948 Views · 1 Like

Easy URL rewriting in ASP.NET 4.0 web forms

In this post I am going to explain URL rewriting in greater details. This post will contain basic of URL rewriting and will explain how we can do URL rewriting in fewer lines of code. Why we need URL rewriting? Let’s consider a simpler scenario we want to display a customer details on a ASP.NET Page so how our page will know that for which customer we need to display details? The simplest way of doing is to use query string we will pass a customer id which uniquely identifies customer in query string. So our url will look like this. Customer.aspx?Id=1 This will work but the problem with above URL is that its not user friendly and search engine friendly. who is going to remember that what query string parameter I am going to pass and why we need that parameter. Also when search engine will crawl this site it will going to read this URL blindly as this url is not informative because it query string is not readable for search engine crawlers. So your search engine will be ranked lower as this URL is not readable to search engine crawlers.Now when do a URL rewriting our URL will be cleaner shorter and simpler like this. Customers/Id/1/ Here anybody in world can understand it talking about customer and this page will used to show customer details.Even search engine crawler will also know that you are talking about customers. That is why we need URL rewriting. URL rewriting and ASP.NET In earlier versions of ASP.NET we have to write lots of code for URL rewriting but Now with ASP.NET 4.0 you can easily rewrite in fewer lines of code. So let’s start a Demo where I will demonstrate you how we can easily rewrite URLs. So let’s first create a ASP. NET web form application via File->New->Project and a dialog box will open just like below and then created a empty project called URL rewriting. After creating a project I have added global.asax – where we are going to write url mapping logic and then I have added an asp.net page called which will display customer information just like below. So everything is now ready let’s start writing code. First thing we need to do is to define routes. Route will map a URL to physical page.First I will create static function called Register route which map route to particular file and then I am going to call this from application_start event of global.asax. Following is code for that. using System; using System.Web.Routing; namespace UrlRewriting { public class Global : System.Web.HttpApplication { protected void Application_Start(object sender, EventArgs e) { RegisterRoutes(RouteTable.Routes); } public static void RegisterRoutes(RouteCollection routeCollection) { routeCollection.MapPageRoute("RouteForCustomer", "Customer/{Id}", "~/Customer.aspx"); } } } Now as mapping code has been done let’s right code for customer.aspx page. I have have following code in page_load event of customer.aspx using System; namespace UrlRewriting { public partial class Customer : System.Web.UI.Page { protected void Page_Load(object sender, EventArgs e) { string id = Page.RouteData.Values["Id"].ToString(); Response.Write("Customer Details page"); Response.Write(string.Format("Displaying information for customer : {0}",id)); } } } Here in above code you can see that I am getting value from page route data and then just printing it. In real world it will fetch customer data from database and show customer details on page. Now let’s run that application. It will print a details of customer as I have passed Id in URL suppose you pass 1 as id in URL then it will look like following. Now if you put 2 in url it will print information about customer 2. That’s it. So we have enabled URL rewriting in asp.net in fewer lines of code. In next post I am going to explain redirection with URL rewriting. Hope you like this post. Stay tuned for more.. Till then Happy programming

December 10, 2011

by Jalpesh Vadgama

· 28,519 Views

MySQL vs. Neo4j on a Large-Scale Graph Traversal

this post presents an analysis of mysql (a relational database) and neo4j (a graph database) in a side-by-side comparison on a simple graph traversal. the data set that was used was an artificially generated graph with natural statistics. the graph has 1 million vertices and 4 million edges. the degree distribution of this graph on a log-log plot is provided below. a visualization of a 1,000 vertex subset of the graph is diagrammed above. loading the graph the graph data set was loaded both into mysql and neo4j. in mysql a single table was used with the following schema. create table graph ( outv int not null, inv int not null ); create index outv_index using btree on graph (outv); create index inv_index using btree on graph (inv); after loading the data, the table appears as below. the first line reads: “vertex 0 is connected to vertex 1.” mysql> select * from graph limit 10; +------+-----+ | outv | inv | +------+-----+ | 0 | 1 | | 0 | 2 | | 0 | 6 | | 0 | 7 | | 0 | 8 | | 0 | 9 | | 0 | 10 | | 0 | 12 | | 0 | 19 | | 0 | 25 | +------+-----+ 10 rows in set (0.04 sec) the 1 million vertex graph data set was also loaded into neo4j. in gremlin , the graph edges appear as below. the first line reads: “vertex 0 is connected to vertex 992915.” gremlin> g.e[1..10] ==>e[183][0-related->992915] ==>e[182][0-related->952836] ==>e[181][0-related->910150] ==>e[180][0-related->897901] ==>e[179][0-related->871349] ==>e[178][0-related->857804] ==>e[177][0-related->798969] ==>e[176][0-related->773168] ==>e[175][0-related->725516] ==>e[174][0-related->700292] warming up the caches before traversing the graph data structure in both mysql and neo4j, each database had a “ warm up ” procedure run on it. in mysql, a “select * from graph” was evaluated and all of the results were iterated through. in neo4j, every vertex in the graph was iterated through and the outgoing edges of each vertex were retrieved. finally, for both mysql and neo4j, the experiment discussed next was run twice in a row and the results of the second run were evaluated. traversing the graph the traversal that was evaluated on each database started from some root vertex and emanated n-steps out. there was no sorting, no distinct-ing, etc. the only two variables for the experiments are the length of the traversal and the root vertex to start the traversal from. in mysql, the following 5 queries denote traversals of length 1 through 5. note that the “?” is a variable parameter of the query that denotes the root vertex. select a.inv from graph as a where a.outv=? select b.inv from graph as a, graph as b where a.inv=b.outv and a.outv=? select c.inv from graph as a, graph as b, graph as c where a.inv=b.outv and b.inv=c.outv and a.outv=? select d.inv from graph as a, graph as b, graph as c, graph as d where a.inv=b.outv and b.inv=c.outv and c.inv=d.outv and a.outv=? select e.inv from graph as a, graph as b, graph as c, graph as d, graph as e where a.inv=b.outv and b.inv=c.outv and c.inv=d.outv and d.inv=e.outv and a.outv=? for neo4j, the blueprints pipes framework was used. a pipe of length n was constructed using the following static method. public static pipeline createpipeline(final integer steps) { final arraylist pipes = new arraylist(); for (int i = 0; i < steps; i++) { pipe pipe1 = new vertexedgepipe(vertexedgepipe.step.out_edges); pipe pipe2 = new edgevertexpipe(edgevertexpipe.step.in_vertex); pipes.add(pipe1); pipes.add(pipe2); } return new pipeline(pipes); } for both mysql and neo4j, the results of the query (sql and pipes) were iterated through. thus, all results were retrieved for each query. in mysql, this was done as follows. while (resultset.next()) { resultset.getint(finalcolumn); } in neo4j, this is done as follows. while (pipeline.hasnext()) { pipeline.next(); } experimental results the artificial graph dataset was constructed with a “ rich get richer “, preferential attachment model . thus, the vertices created earlier are the most dense (i.e. highest number of adjacent vertices). this property was used to limit the amount of time it would take to evaluate the tests for each traversal. only the first 250 vertices were used as roots of the traversals. before presenting timing results, note that all of these experiments were run on a macbook pro with a 2.66ghz intel core 2 duo and 4gigs of ram at 1067 mhz ddr3. the packages used were java 1.6, mysql jdbc 5.0.8, and blueprints pipes 0.1.2. java version "1.6.0_17" java(tm) se runtime environment (build 1.6.0_17-b04-248-10m3025) java hotspot(tm) 64-bit server vm (build 14.3-b01-101, mixed mode) the following java virtual machine parameters were used: -xmx1000m -xms500m below are the total running times for both mysql (red) and neo4j (blue) for traversals of length 1, 2, 3, and 4. the raw data is presented below along with the total number of vertices returned by each traversal—which, of course, is the same for both mysql and neo4j given that its the same graph data set being processed. also realize that traversals can loop and thus, many of the same vertices are returned multiple times. finally, note that only neo4j has the running time for a traversal of length 5. mysql did not finish after waiting 2 hours to complete. in comparison, neo4j took 14.37 minutes to complete a 5 step traversal. [mysql steps-1] time(ms):124 -- vertices_returned:11360 [mysql steps-2] time(ms):922 -- vertices_returned:162640 [mysql steps-3] time(ms):8851 -- vertices_returned:2206437 [mysql steps-4] time(ms):112930 -- vertices_returned:28125623 [mysql steps-5] n/a [neo4j steps-1] time(ms):27 -- vertices_returned:11360 [neo4j steps-2] time(ms):474 -- vertices_returned:162640 [neo4j steps-3] time(ms):3366 -- vertices_returned:2206437 [neo4j steps-4] time(ms):49312 -- vertices_returned:28125623 [neo4j steps-5] time(ms):862399 -- vertices_returned:358765631 next, the individual data points for both mysql and neo4j are presented in the plot below. each point denotes how long it took to return n number of vertices for the varying traversal lengths. finally, the data below provides the number of vertices returned per millisecond (on average) for each of the traversals. again, mysql did not finish in its 2 hour limit for a traversal of length 5. [mysql steps-1] vertices/ms:91.6128847554668 [mysql steps-2] vertices/ms:176.399127537985 [mysql steps-3] vertices/ms:249.286746556076 [mysql steps-4] vertices/ms:249.053599519823 [mysql steps-5] n/a [neo4j steps-1] vertices/ms:420.740351166341 [neo4j steps-2] vertices/ms:343.122344772028 [neo4j steps-3] vertices/ms:655.507125256186 [neo4j steps-4] vertices/ms:570.360621871775 [neo4j steps-5] vertices/ms:416.00886711325 conclusion in conclusion, given a traversal of an artificial graph with natural statistics, the graph database neo4j is more optimal than the relational database mysql. however, no attempts have been made to optimize the java vm, the sql queries, etc. these experiments were run with both neo4j and mysql “out of the box” and with a “natural syntax” for both types of queries. source: http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-large-scale-graph-traversal/

December 5, 2011

by Marko Rodriguez

· 58,413 Views · 1 Like

Create Your Own XML/JSON/HTML API with PHP

Develop your own API service for your PHP projects.

December 1, 2011

by Andrei Prikaznov

· 68,157 Views

Zero Downtime – What is it and why is it important?

For most large web applications, uptime is of foremost importants. Any outage can be seen by customers as a frustration, or opportunity to move to a competitor. What's more for a site that also includes e-commerce, it can mean real lost sales. Zero Downtime describes a site without service interruption. To achieve such lofty goals, redundancy becomes a critical requirement at every level of your infrastructure. If you're using cloud hosting, are you redundant to alternate availability zones and regions? Are you using geographically distributed load balancing? Do you have multiple clustered databases on the backend, and multiple webservers load balanced. All of these requirements will increase uptime, but may not bring you close to zero downtime. For that you'll need thorough testing. The solution is to pull the trigger on sections of your infrastructure, and prove that it fails over quickly without noticeable outage. The ultimate test is the outage itself. Sean Hull on Quora: What is zero downtime and why is it important? Source: http://www.iheavy.com/2011/06/23/zero-downtime-what-is-it-and-why-is-it-important/

November 23, 2011

by Sean Hull

· 26,149 Views

Eventual Consistency in NoSQL Databases: Theory and Practice

One of NoSQL's goals: handle previously-unthinkable amounts of data. One of unthinkable-amounts-of-data's problems: previously-improbable events become extremely probable, precisely because the set of interactions is so large. Flip a coin a hundred times, and you're not likely to get 50 heads in a row. But flip it a few trillion times, and you probably will find some 50-heads streaks. So NoSQL's performance strength is also its mathematical weakness. This order of scale can result in lots of problems, but one of the most common is consistency -- the C in ACID -- clearly a fundamental desideratum for any database system, but in principle much harder to acheive for NoSQL databases than for others. Emerging database technologies have forced developers and computer scientists to define more exactly what kind of consistency is really needed, for any given application. Two years ago, ACM (the Association for Computing Machinery) published an extremely helpful examination of the attenuated notion of consistency called 'eventual consistency'. Their summary: Data inconsistency in large-scale reliable distributed systems must be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running. The article surveys technical solutions as well as user considerations that might soften the undesirability of anything less than perfect, instantaneous consistency. It's not long (4 pages plus pictures), and explains some deep database issues quite clearly. On the more practical side of the problem: Russell Brown recently gave a talk at the NoSQL Exchange 2011 on exactly this topic. More specifically, he showed how some distributed systems (Riak in particular) try to minimize conflicts, and suggested some ways to reconcile conflicts automatically using smart semantic techniques. Check out the NoSQL Exchange page for Russell's talk here, which includes an embedded video. But read the ACM article first for a broader overview, since Russell launches into technical details pretty quickly.

November 22, 2011

by John Esposito

· 12,534 Views