DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Open Source Topics

article thumbnail
JMS vs RabbitMQ
Definition : JMS : Java Message Service is an API that is part of Java EE for sending messages between two or more clients. There are many JMS providers such as OpenMQ (glassfish’s default), HornetQ(Jboss), and ActiveMQ. RabbitMQ: is an open source message broker software which uses the AMQP standard and is written by Erlang. Messaging Model: JMS supports two models: one to one and publish/subscriber. RabbitMQ supports the AMQP model which has 4 models : direct, fanout, topic, headers. Data types: JMS supports 5 different data types but RabbitMQ supports only the binary data type. Workflow strategy: In AMQP, producers send to the exchange then the queue, but in JMS, producers send to the queue or topic directly. Technology compatibility: JMS is specific for java users only, but RabbitMQ supports many technologies. Performance: If you would like to know more about their performance, this benchmark is a good place to start, but look for others as well.
July 30, 2013
by Saeid Siavashi
· 51,726 Views · 16 Likes
article thumbnail
Using Morphia to Map Java Objects in MongoDB
MongoDB is an open source document-oriented NoSQL database system which stores data as JSON-like documents with dynamic schemas. As it doesn't store data in tables as is done in the usual relational database setup, it doesn't map well to the JPA way of storing data. Morphia is an open source lightweight type-safe library designed to bridge the gap between the MongoDB Java driver and domain objects. It can be an alternative to SpringData if you're not using the Spring Framework to interact with MongoDB. This post will cover the basics of persisting and querying entities along the lines of JPA by using Morphia and a MongoDB database instance. There are four POJOs this example will be using. First we have BaseEntity which is an abstract class containing the Id and Version fields: package com.city81.mongodb.morphia.entity; import org.bson.types.ObjectId; import com.google.code.morphia.annotations.Id; import com.google.code.morphia.annotations.Property; import com.google.code.morphia.annotations.Version; public abstract class BaseEntity { @Id @Property("id") protected ObjectId id; @Version @Property("version") private Long version; public BaseEntity() { super(); } public ObjectId getId() { return id; } public void setId(ObjectId id) { this.id = id; } public Long getVersion() { return version; } public void setVersion(Long version) { this.version = version; } } Whereas JPA would use @Column to rename the attribute, Morphia uses @Property. Another difference is that @Property needs to be on the variable whereas @Column can be on the variable or the get method. The main entity we want to persist is the Customer class: package com.city81.mongodb.morphia.entity; import java.util.List; import com.google.code.morphia.annotations.Embedded; import com.google.code.morphia.annotations.Entity; @Entity public class Customer extends BaseEntity { private String name; private List accounts; @Embedded private Address address; public String getName() { return name; } public void setName(String name) { this.name = name; } public List getAccounts() { return accounts; } public void setAccounts(List accounts) { this.accounts = accounts; } public Address getAddress() { return address; } public void setAddress(Address address) { this.address = address; } } As with JPA, the POJO is annotated with @Entity. The class also shows an example of @Embedded: The Address class is also annotated with @Embedded as shown below: package com.city81.mongodb.morphia.entity; import com.google.code.morphia.annotations.Embedded; @Embedded public class Address { private String number; private String street; private String town; private String postcode; public String getNumber() { return number; } public void setNumber(String number) { this.number = number; } public String getStreet() { return street; } public void setStreet(String street) { this.street = street; } public String getTown() { return town; } public void setTown(String town) { this.town = town; } public String getPostcode() { return postcode; } public void setPostcode(String postcode) { this.postcode = postcode; } } Finally, we have the Account class of which the customer class has a collection of: package com.city81.mongodb.morphia.entity; import com.google.code.morphia.annotations.Entity; @Entity public class Account extends BaseEntity { private String name; public String getName() { return name; } public void setName(String name) { this.name = name; } } The above show only a small subset of what annotations can be applied to domain classes. More can be found at http://code.google.com/p/morphia/wiki/AllAnnotations The Example class shown below goes through the steps involved in connecting to the MongoDB instance, populating the entities, persisting them and then retrieving them: package com.city81.mongodb.morphia; import java.net.UnknownHostException; import java.util.ArrayList; import java.util.List; import com.city81.mongodb.morphia.entity.Account; import com.city81.mongodb.morphia.entity.Address; import com.city81.mongodb.morphia.entity.Customer; import com.google.code.morphia.Datastore; import com.google.code.morphia.Key; import com.google.code.morphia.Morphia; import com.mongodb.Mongo; import com.mongodb.MongoException; /** * A MongoDB and Morphia Example * */ public class Example { public static void main( String[] args ) throws UnknownHostException, MongoException { String dbName = new String("bank"); Mongo mongo = new Mongo(); Morphia morphia = new Morphia(); Datastore datastore = morphia.createDatastore(mongo, dbName); morphia.mapPackage("com.city81.mongodb.morphia.entity"); Address address = new Address(); address.setNumber("81"); address.setStreet("Mongo Street"); address.setTown("City"); address.setPostcode("CT81 1DB"); Account account = new Account(); account.setName("Personal Account"); List accounts = new ArrayList(); accounts.add(account); Customer customer = new Customer(); customer.setAddress(address); customer.setName("Mr Bank Customer"); customer.setAccounts(accounts); Key savedCustomer = datastore.save(customer); System.out.println(savedCustomer.getId()); } Executing the first few lines will result in the creation of a Datastore. This interface will provide the ability to get, delete and save objects in the 'bank' MongoDB instance. The mapPackage method call on the morphia object determines what objects are mapped by that instance of Morphia. In this case all those in the package supplied. Other alternatives exist to map classes, including the method map which takes a single class (this method can be chained as the returning object is the morphia object), or passing a Set of classes to the Morphia constructor. After creating instances of the entities, they can be saved by calling save on the datastore instance and can be found using the primary key via the get method. The output from the Example class would look something like the below: 11-Jul-2012 13:20:06 com.google.code.morphia.logging.MorphiaLoggerFactory chooseLoggerFactory INFO: LoggerImplFactory set to com.google.code.morphia.logging.jdk.JDKLoggerFactory 4ffd6f7662109325c6eea24f Mr Bank Customer There are many other methods on the Datastore interface and they can be found along with the other Javadocs at http://morphia.googlecode.com/svn/site/morphia/apidocs/index.html An alternative to using the Datastore directly is to use the built in DAO support. This can be done by extending the BasicDAO class as shown below for the Customer entity: package com.city81.mongodb.morphia.dao; import com.city81.mongodb.morphia.entity.Customer; import com.google.code.morphia.Morphia; import com.google.code.morphia.dao.BasicDAO; import com.mongodb.Mongo; public class CustomerDAO extends BasicDAO { public CustomerDAO(Morphia morphia, Mongo mongo, String dbName) { super(mongo, morphia, dbName); } } To then make use of this, the Example class can be changed (and enhanced to show a query and a delete): ... CustomerDAO customerDAO = new CustomerDAO(morphia, mongo, dbName); customerDAO.save(customer); Query query = datastore.createQuery(Customer.class); query.and( query.criteria("accounts.name").equal("Personal Account"), query.criteria("address.number").equal("81"), query.criteria("name").contains("Bank") ); QueryResults retrievedCustomers = customerDAO.find(query); for (Customer retrievedCustomer : retrievedCustomers) { System.out.println(retrievedCustomer.getName()); System.out.println(retrievedCustomer.getAddress().getPostcode()); System.out.println(retrievedCustomer.getAccounts().get(0).getName()); customerDAO.delete(retrievedCustomer); } ... With the output from running the above shown below: 11-Jul-2012 13:30:46 com.google.code.morphia.logging.MorphiaLoggerFactory chooseLoggerFactory INFO: LoggerImplFactory set to com.google.code.morphia.logging.jdk.JDKLoggerFactory Mr Bank Customer CT81 1DB Personal Account This post only covers a few brief basics of Morphia but shows how it can help bridge the gap between JPA and NoSQL.
July 25, 2013
by Geraint Jones
· 76,127 Views
article thumbnail
Integration of Amazon Redshift Data Warehouse with Talend Data Integration
In this blog post, I will show you how to "ETL" all kinds of data to Amazon’s cloud data warehouse Redshift wit Talend’s big data components. Let’s begin with a short introduction to Amazon Redshift (copied from website): "Amazon Redshift is [part of Amazon Web Services (AWS) and] a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. With a few clicks in the AWS Management Console, customers can launch a Redshift cluster, starting with a few hundred gigabytes and scaling to a petabyte or more, for under $1,000 per terabyte per year. Traditional data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.“ Sounds interesting! And indeed, we already see companies using Talend’s Redshift connectors. From Talend perspective it is not much more than just another database. If you have ever used a Talend connector, you can integrate to Redshift within some minutes. In the next sections, I will describe all necessary steps and give some hints regarding configuration issues and performance improvements. Be aware: You need Talend Open Studio for Data Integration (Apache License, open source) or any Talend Enterprise Edition / Platform which contains the Cloud components to see and use Amazon Redshift connectors. The open source edition offers all connectors and functionality to integrate with Amazon Redshift. However, Enterprise versions offer some more features (e.g. versioning), comfort (e.g. wizards) and commercial support. Setup Amazon Redshift Setup of Amazon Redshift is very easy. Just follow Amazon‘s getting started guide: http://docs.aws.amazon.com/redshift/latest/gsg/welcome.html. Like every other AWS guide, it is very easy to understand and use. Be aware, that you just have to do step 1, 2 and 3 of the getting started guide for using it with Talend. Some hints: - Step 1 („before you begin“): Just sign up. Client tools and drivers are not necessary because they are already installed within Talend Studio. - Step 2 („launch a cluster“): Yes, please start your cluster! - Step 3(„authorize access“): If you are not sure what to do here, select Connection Type = CIDR/IP. Find out your IP address (http://whatismyipaddress.com) and enter it with „/32“ at the end. Example: „192.168.1.1/32“ Now you can connect to Amazon Redshift from your Talend Studio on your local computer. Step 4 (connect) and step 5 (create table, data, queries) are not necessary, this will be done from Talend Studio. Of course, you should not forget to delete your cluster (step 7) when you are done. Otherwise, you will pay for every hour, even if you do not access your DWH. Connect to Amazon Redshift from Talend Studio Create a new connection to Amazon Redshift database as you do with every other relational database. The easiest way is to use „DB Connection Wizard“ in metadata. Just enter your connection information and check if it works. You get all information about configuration from Amazon Web Console. The connection string looks something like this: „jdbc:paraccel://talend-demo-cluster.cp8t6c5.eu-west-1.redshift.amazonaws.com:5439/dev“ Next, right click on the created connection and select „retrieve schema“. „public“ is the default schema which you (have to) use. Now, you are ready to use this connection within Talend Jobs to write to Amazon Redshift and read from it. Create Talend Jobs (Write, Read, Delete) Amazon Redshift components work like any other Talend (relational) database components. Look at www.help.talend.com for more information if you have not used them before (or just try them out, they are very self-explanatory). You just have to drag&drop your connection from metadata . Afterwards, you can easily write data (tRedShiftOutput), read data (tRedshiftInput), or do any other queries such as delete or copy (tRedShiftRow). In the following job, I start with deleting all content in the Amazon Redshift table. Then, I read data from a MySQL table and insert it into an Amazon Redshift table. The table is created automatically (as I have configured it this way). After this subjob is finished, I read the data again, and store it to a CSV file (which is also created automatically). Of course, this is no business use case, but it shows how to use different Amazon Redshift components. Query Data from Amazon Redshift You can connect to Amazon Redshift directly from Talend Studio to explore and query data of the DWH. Thus, no other database tool is required. Just right click on your Amazon Redshift connection in metadata and select „edit queries“. Here you can define, execute and save SQL queries. Improve Performance Write performance of Amazon Redshift is relatively low compared to „classical“ relational databases (in your data center) as you have to upload all data into the cloud. Different alternatives exist to improve performance: - Bulk inserts: „Extended insert“ (in advanced settings) improves performance a lot, but still not to hyperspeed… Also, as it is bulk, you can just do inserts! It is not compatible to „rejects“ or „updates“ - AWS S3 and COPY command: S3 is Amazon’s „simple storage service“, a key-value store – also called NoSQL today – for storing very large objects. You can use Amazon Redshift’s COPY command (http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) to transfer data from S3 to Amazon Redshift with good performance. Though, you still have to copy data to S3 before, same „cloud problem“ here. The COPY command can be used with tRedshiftRow, so no problem at all from Talend perspective. To transfer data to S3, you can either use the Talend S3 components from Talendforge, Talend’s open source community (http://www.talendforge.org/exchange), or use camel-s3, an Apache Camel component which is included in Talend ESB. The latter is an option, if you use Talend Data Services which combines Talend DI and Talend ESB in its unified platform. Summary You need not be a cloud or DWH expert, or an expert developer to integrate with Amazon’s cloud data warehouse Redshift. It is very easy with Talend’s integration solutions. Just drag&drop, configure, do some graphical mappings / transformations (if necessary), that’s it. Code is generated. Job runs. You can integrate Amazon Redshift almost as simple as any other relational database. Just be aware of some cloud specific security and performance issues. With Talend, you can easily „ETL“ all data from different sources to Redshift and store it there for under $1,000 per terabyte per year – even with the open source version! Best regards, Kai Wähner (Contact and feedback via @KaiWaehner, www.kai-waehner.de, LinkedIn / Xing) This is content from my blog: http://www.kai-waehner.de/blog/2013/06/26/integration-of-amazon-redshift-cloud-data-warehouse-aws-saas-dwh-with-talend-data-integration-di-big-data-bd-enterprise-service-bus-esb/
June 27, 2013
by Kai Wähner DZone Core CORE
· 20,475 Views · 1 Like
article thumbnail
ActiveMQ and .NET combined!
ActiveMQ is one of the most popular messaging frameworks. For sure the most popular open source framework. Many people think that ActiveMQ works only with Java and this is not true at all. ActiveMQ can work with almost every popular language (including JavaScript!) through numerous protocols which it supports. Today I will show you how to use ActiveMQ in .NET-based solutions. Project setup Using VS 2010's Extension Manger I installed NuGet Package Manager. After installation and VS 2010 restart, I created a project called ActiveMQNMS. I right-clicked it and selected "Manage NuGet packages...". In the search field I typed: "ActiveMQ". There was a package called Apache.NMS.ActiveMQ. I installed it. (Note: ActiveMQ has one dependency - Apache.NMS package. The NMS package provides a unified API for working with different messaging frameworks and providers.) Starting ActiveMQ I already had ActiveMQ installed on my machine. If you don't have one, download it from http://activemq.apache.org. The default instance listens on 61616 port. However, mine is listening on 62626. If you want to run my code, please remember to change the port. To start ActiveMQ I executed: activemq-5.5.0\bin\activemq Depending on configured ports, you can use ActiveMQ web console to manage your queues, topics, subscribers, connections, embedded Apache Camel, etc. I'm using 8282 port, and the console URL is: http://localhost:8282/admin. Test stub In general the .NET API is almost a copy of the Java API. So if you're familiar with JMS and/or ActiveMQ you don't need any documentation. Please note TestIntialize and TestCleanup methods. using System; using Apache.NMS; using Apache.NMS.ActiveMQ; using Microsoft.VisualStudio.TestTools.UnitTesting; namespace ActiveMQNMS { [Serializable] public class Person { public string FirstName { get; set; } public string LastName { get; set; } } [TestClass] public class ActiveMqTest { private IConnection _connection; private ISession _session; private const String QUEUE_DESTINATION = "DotNet.ActiveMQ.Test.Queue"; [TestInitialize] public void TestInitialize() { IConnectionFactory factory = new ConnectionFactory("tcp://localhost:62626"); _connection = factory.CreateConnection(); _connection.Start(); _session = _connection.CreateSession(); } [TestCleanup] public void TestCleanup() { _session.Close(); _connection.Close(); } } } Writing Producer Here is the producer: [TestMethod] public void TestA() { IDestination dest = _session.GetQueue(QUEUE_DESTINATION); using (IMessageProducer producer = _session.CreateProducer(dest)) { var person = new Person { FirstName = "Łukasz", LastName = "Budnik" }; var objectMessage = producer.CreateObjectMessage(person); producer.Send(objectMessage); } } Run the test and refresh "Queues" list in ActiveMQ web console. You should see DotNet.ActiveMQ.Test.Queue queue with 1 enqueued and pending message. Purge the queue by hitting the purge link or you simply delete it. Writing Consumer Now we have to consume the message. Here is the code: [TestMethod] public void TestB() { Person person = null; IDestination dest = _session.GetQueue(QUEUE_DESTINATION); using (IMessageConsumer consumer = _session.CreateConsumer(dest)) { IMessage message; while ((message = consumer.Receive(TimeSpan.FromMilliseconds(2000))) != null) { var objectMessage = message as IObjectMessage; if (objectMessage != null) { person = objectMessage.Body as Person; if (person != null) { Assert.AreEqual("Łukasz", person.FirstName); Assert.AreEqual("Budnik", person.LastName); } } else { Assert.Fail("Object Message is null"); } } } if (person == null) { Assert.Fail("Person object is null"); } } Run tests. Refresh "Queues" tab in ActiveMQ web console. You should see 1 message enqueued and 1 message dequeued. As expected. Summary That's all. Simple, isn't it? ActiveMQ works very, very nicely with .NET. I have to find some performance comparison for ActiveMQ and MS or pure .C#/NET messaging frameworks. Or maybe you have it? Please share. cheers, Łukasz
April 15, 2013
by Łukasz Budnik
· 29,421 Views
article thumbnail
Introduction to SmartSVN
SmartSVN is a powerful and easy-to-use graphical client for Apache Subversion. There are several clients for Subversion, but here are just a few reasons you should try SmartSVN: It’s cross-platform – SmartSVN runs on Windows, Linux and Mac OS X, so you can continue using the operating system (OS) that works the best for you. It can also be integrated into your OS, via Mac’s Finder Integration or Windows Shell. Everything you need, out of the box – SmartSVN comes complete with all the tools you need to manage your Subversion projects: Conflict solver – this feature combines the freedom of a general, three-way-merge with the ability to detect and resolve any conflicts that occur during the development lifecycle. File compare – this allows you to make inner-line comparisons and directly edit the compared files. Built-in SSH client – allows users to access servers using the SSH protocol. This security-conscious protocol encrypts every piece of communication between the client and the server, for additional protection. A complete view of your project at a glance – the most important files (such as conflicted, modified or missing files) are placed at the top of the file list. SmartSVN also highlights which directories contain local modifications, which directories have been changed in the repository, and whether individual files have been modified locally or in the central repo. This makes it easy to get a quick overview of the state of your project. Fully customizable – maximize productivity by fine-tuning your SmartSVN installation to suit your particular needs: Change keyboard shortcuts, write your own plugin with the SmartSVN API, group revisions to personalize your display, create Change Sets, and alter the context menus and toolbars to suit you. You can learn more about customizing SmartSVN at our ‘5 Ways to Customize SmartSVN’ blog post. Comprehensive bug tracker support – Trac and JIRA are both fully supported. Multitude of support options – SmartSVN users have access to a range of free support, from refcards to blogsand documentation, the SmartSVN forum and a Twitter account maintained by our open source experts. If you need extra support with your SmartSVN installation, expert email support is included with SmartSVN Professional licenses. Want to learn more about SmartSVN? On April 18th, WANdisco will be be holding a free ‘Introduction to SmartSVN’ webinar covering everything you need to get off to a great start with this popular client: Repository basics Checkouts, working folders, editing files and commits Reporting on changes Simple branching Simple merging This webinar is free so register now.
April 13, 2013
by Jessica Thornsby
· 6,602 Views
article thumbnail
8 Lessons in Deployment Tooling Lessons Learned
It didn’t take long. A few months after we released an open source continuous integration tool (Anthill) in 2001, we were asked, “It’s great that I have the build setup, now how do I deploy to the test lab?” That email started a clear transformation in our thinking. Lesson 1: Builds generally exist to be tested or released to customers.The corollary is: Continuous integration is not about build, it is about quality and checking quality generally requires a deployment or six. In 2005, we updated our AnthillPro 2.5 with a shiny new deployment capability. It worked ok, but with that generation of tool being so build oriented, it was never a clean fit. Lesson 2: Deployments are a serious challenge, and can not be bolted on to a build tool. This lesson has been reinforced over the years as we’ve watched the results of various tools tacking on deployments. In 2006, we released AnthillPro 3. Oh boy, what a change. Now deployments and other stuff you do to builds days and weeks later were their own thing. Demonstrating a stunning lack of marketing savvy, we called them “non-originating workflows” (processes that don’t make builds) and declared a new type of tool, an “Application Lifecycle Automation Server”. Today, you’d call it a Continuous Delivery tool (thank you Jez and Dave for the better name). To my knowledge, AnthillPro was the first tool to really take a build lifecycle or build pipeline seriously and had the ambition to manage from continuous integration build through to production release quickly and with a solid audit trail. That brings us to Lesson 3: When you come from the development side of the house as either an engineer or a vendor, you have a lot to learn about audit and security. Around the same time we also learned a great deal about the Dev / Ops gap. While some of our customers implemented the tool as intended, most of the time it was used either by a dev organization doing continuous integration and deployment to the early test environments or a production support / release management group that only did the “official” builds and focused on the production release. The failure of the developer initiated efforts to get to production provided a key insight. Lesson 4: When it comes to deployment automation, start with production, and work your way to the lower environments, treating them as a simpler case of the hard problem. We also learned about the limitations of a build pipeline approach in complex applications. Lesson 5: Deployments of complex systems often require a coordinated release of many builds. The need to start at production style deployments (and not care as much about the build) and the emphasis on deployment time dependencies focused our efforts on a deployment centric tool, uDeploy to compliment the pipeline approach in AnthillPro. Production and scaled usage come together to make the deployment tool a critical piece of infrastructure. If you lose a datacenter, can you recover the tool and use it to recover other applications into a backup datacenter?Lesson 6: Your release and deployment systems need to be configured with high availability and recoverability in mind. For the better part of the last decade, we’ve been trying to streamline deployments and kill off the release weekend. Many of customers have eliminated or shrunk those big events down to a more manageable size. Big releases though still can require the coordination of dozens of people across various teams. One time events like configuring the deployment tool with the password for a new database in each environment, or routine manual steps like putting eyeballs on the new production system before putting it out to the public still need to be managed. Lesson 7: Deployment is still just one part of release. With that in mind, we’ve added uRelease to the portfolio. Different people in the application delivery value chain need different tools. Developers need continuous integration and great testing. Change management needs great auditing around production deployments. Testers need their environments to be not broken, updated regularly and close enough to production to make their testing worthwhile. The patterns and needs overlap. Everyone needs deployments that work. Everyone benefits from understanding the lifecycle of a build (or other version) as well as what is in a given environment. Everyone benefits when release planning meetings are shorter. Lesson 8: No single tool is enough. An integrated toolchain that exchanges information is required. So there you go. Twelve years of tooling, eight with deployments. Summed up in six over-simplified lessons. I’m sure we’ll be updating this next year. Lesson 1: Builds generally exist to be tested or released to customers. Lesson 2: Deployments are a serious challenge, and can not be bolted on to a build tool. Lesson 3: When you come from the development side of the house as either an engineer or a vendor, you have a lot to learn about audit and security. Lesson 4: When it comes to deployment automation, start with production, and work your way to the lower environments, treating them as a simpler case of the hard problem. Lesson 5: Deployments of complex systems often require a coordinated release of many builds. Lesson 6: Release and deployment systems need to be configured with high availability and recoverability in mind. Lesson 7: Deployment is still just one part of release Lesson 8: No single tool does “DevOps”. An integrated toolchain that exchanges information is required. What are the biggest lessons you’ve learned about deployments in the last few years? Share in the comments below!
March 4, 2013
by Eric Minick
· 7,286 Views
article thumbnail
Building an Online-Recommendation Engine with MongoDB
once upon a time there was a munich pizza baker who developed a technique to beam pizza out of bright sunshine. he can produce more than a thousand pizzas per second and needs a channel to sell this amount of pizza and decides to build an online shop. mario’s initial idea is to sell pizzas, but now he is thinking about introduction of new product lines like beverages, salads and pasta. before we take a look to the validation of mario´s idea, lets take a short look at the existing online shop. mario’s online shop is based on mongodb , apache wicket and spring . mongodb is a document-oriented nosql-database . mongodb stores records not in tables as a relational database but in bson documents, which is a binary version of json (java script object notation) and very similar to the object structure in mario’s application. the usage of mongodb makes his development easier and deployment faster. the figure shows a json document which is very similar to a java object: a json document property with the according value corresponds to the java object property with the appropriate value. you can add or remove properties in your java object and this will automatically change your database schema. so there is no need to put your java object model into a relational schema via hibernate. mario also decided to build his online shop only with open-source technologies like apache wicket and spring. wicket is a very common lightweight component-based web application framework and it is closely patterned after stateful gui frameworks such as javafx . the spring framework is an open source application framework and inversion of control container for the java platform and does not impose any specific programming model. spring has become popular in the java community as an alternative to, replacement for, or even addition to the enterprise javabean (ejb) model. because of this architecture mario is able to deploy its application in a lightweight application server like tomcat or jetty . this figure shows the system landscape of mario. mario has two major system on the lefthand site there is his online shop and on the righthand site there is ‘pas’ a famous billing system. in the middle is hadoop that connects both systems together. in the business world an application normally does not stand alone. in most cases an application must communicate with others. the lean architecture of marios online shop enables him to connect the billing system ‘pas’ to his online shop. spring for apache hadoop provides this integration between the two systems online shop and ‘pas’. hadoop supports data-intensive distributed applications and implements a computational paradigm named mapreduce, where the computation is divided into many small fragments, each of them may be executed or re-executed on any node in the cluster of commodity hardware. mario uses hadoop as an etl layer that enables him to transfer gigabytes of order information into the billing system. in this case hadoop makes it possible for a financial controller to verify if all orders were billed correctly. in addition to the online shop feature mario has a real-time sales dashboard that enables him to track his sales in real time. the dashboard displays daily and monthly sales statistics for each pizza and contains a map with the geographical overview of customer activity and competitor locations. here is a walkthrough of the shop : now lets talk about mario’s incredible new idea : mario wants to sell even more pizza! and other products as well. mario decides to use lean startup methods in order to test the possible introduction of new product lines and plans an experiment to validate his new idea using a scientific approach and pure facts instead of hunches. mario´s core assumption is that customers wants to buy other products than pizza – drinks, salads and pasta. furthermore he is worried about pricing. mario contacts all customers to complete a survey and provides an incentive for the participation, a free pizza to every customer who responds to the survey. the result of the survey validated mario’s assumption – customers want to buy beverages, salads and pasta. but he also found out that his customers are willing to pay higher prices for high-quality products and that they simply love his easy shopping flow. currently a pizza order can be completed with three clicks only, so there is new riskiest assumption to validate: will a more complex shopping flow affect his sales? the figures shows a validation board. a validation board is a deceptively simple tool for testing out product ideas. furthermore a validation board tracks pivots which follows from customer feedback. mario decides to introduce beverages, salads and pasta product lines and thinks about a possibility, how he can handle the extension of the product line without destroying the easy shopping flow. that’s why mario thinks a recommendation engine is the right way for him. panels for recommendations can be integrated in the online shop without changing the shopping flow. mario hired a statistician to help him implement a recommender system for his online shop for better cross-selling. he also defined new measurement points to validate his new idea . therefore he tracks the conversion rate of orders as well as cross-selling rates and every event in the online shop is already tracked in realtime. so mario can very easily perform further experiments in order to verify more assumptions. follow the blog to see how the story continues or come to mongodb usergroup meetup in munich , february 20, 2013 or mongodb days in berlin , february 26, 2013 to get a live presentation. our talk sheds light on how to build an online recommendation engine based on mongodb and apache mahout. we’ll show which recommenders must be built to reach mario’s goal and how these can be integrated in mario’s shop infrastructure.
February 17, 2013
by Comsysto Gmbh
· 8,447 Views
article thumbnail
Code Coverage Tools Comparison in Sonar
for those that are not familiar with sonar , ( i hope this post will make you at least try it or see it in action at http://nemo.sonarsource.org ) you can take a look at an earlier post i’ve written some time ago. in one sentence sonar is an open source platform that allows you to track and improve the quality of your source code. one of the key aspects when talking about software quality is the test coverage or code coverage which is how much of your source code is tested by unit tests. sonar integrates with the most popular open source code coverage tools ( jacoco , cobetura , emma ) and the well-known commercial clover by attlassian. by default it uses the jacoco (java code coverage) engine and you’ll shortly find out why before we move on, i’d like to give many kudos to evgeny mandrikov . this article is inspired by one of his older post s and its intention is to present a more updated comparison of the supported code coverage tools by sonar and point out some differences regarding their results and the way they work. recently sonar changed its default code coverage tool to jacoco and this post tries to explain the reasons behind that decision. some of the information is borrowed by evgeny’s post and the image is also taken from evgeny’s presentation about jacoco . so thanks a lot evgeny! now let’s go to the meat. for the comparison you’ll see, i’ve used the latest available sonar version 3.3, maven 2.2.1, java 1.6 and all analysis launched in a windows 7 machine (intel core i3-2120 cpu @ 3.30ghz) with 8gb ram. the projects were carefully selected ( a small, medium-sized and a large one – not that large as java code base but large enough to extract some results ). i ran five analysis for each open source code coverage tool ( i excluded the commercial clover from my comparison version ) and another five by disabling the code coverage mechanism. so that’s a total of 60 analysis ). in the following tables you can find some information about the code coverage tools and some basic metrics about the selected projects. pay attention to the date of the latest stable release. emma hasn’t been updated since dinosaurs era and cobertura is almost three years inactive. one might think that this isn’t an issue if they are stable and don’t need any new release. well, the truth is that both of them have bugs that frustrate end-users and there’s no one to fix them. on the other hand jacoco is continuously evolving and improving… the results of the analysis are displayed next. some important notices. emma doesn’t support branch coverage that’s why you’re not seeing any metrics. furthermore there are differences in the results of line and branch coverage, which are more concrete for larger projects. for instance in sonar jira plugin all three tools produce the same results whereas in sonar analysis and commons lang projects you can see that the numbers are not the same. now take a look at a graph that illustrates in a more readable way which tool is the fastest. it seems that emma and jacoco need the same amount of time to compute their metrics… but… as we already mentioned there’s a huge difference. there’s no branch coverage in emma reports. cobertura is always slower than jacoco so again the winner is jacoco. of course you can get even faster results by running a sonar analysis without computing code coverage metrics one last thing: jacoco, as the following figure shows is the only tool that analyses bytecode on-the-fly which is more . cobertura and emma run an offline analysis and use a class loader whereas jacoco has its own java agent for analysis code. this configuration allows jacoco to be very flexible, possible integrated with many other tools and frameworks and can be used with any language in a jvm environment. so, to sum up, if you’re using sonar ( if you don’t , you should ), then it strongly advisable to keep the default code coverage engine ( jacoco) , unless you have really important reasons for that. finally don’t forge to check sonar’s community 2013 unofficial survey and the upcoming book about sonar by manning publications. the release date is in about 3-4 months but you can get an early access version here . as always, feel free to comment or suggest improvements about the article and its content.
December 28, 2012
by Patroklos Papapetrou
· 109,518 Views · 1 Like
article thumbnail
Enterprise-ready Tool Support for Apache Camel
apache camel is my favorite integration framework on the java platform due to great dsls, a huge community, and so many different components. camel is used by many developers from different companies all over the world. however, most guys are not aware that some really cool and – more important – enterprise-ready tooling is available for camel, too. many people ask me about camel tooling when i do talks at conferences. this is the reason for this short blog post about camel tooling. [fyi: i work for talend (one of the vendors).] ide support camel consists of a set of normal java libraries and is therefore usable with any java ide (such as eclipse, netbeans or intellij idea) or even a classic text editor. programming dsls are available for java, groovy, and scala. even a kotlin dsl is in the works, thanks to camel’s founder james strachan. all familiar ide features such as code completion or javadoc view are available for these dsls. in the spring xml dsl, the eclipse-based springsource tool suite (sts) should be emphasized, which provides the best support for the spring framework and xml configurations. camel-specific tooling besides classical ide support, further products are available to provide additional functionality. integration problems can be modeled with the help of enterprise integration patterns (eip, http://www.eaipatterns.com/). eips are implemented by camel. visual designers are available to help modeling integration problems with these eips. these tools even generate the corresponding source code automatically. ideally, developers do not have to write any source code by hand. camel tooling is offered by talend with talend esb (http://de.talend.com/products/esb) and jboss, formerly fusesource, with fuse ide (http://fusesource.com/products/fuse-ide). both companies also provide full-time committers for the apache camel project. let’s take a short look at these two products in the following. open studio for talend esb talend esb is an eclipse-based integration platform within the talend unified platform. the familiar “look and feel” and the intuitive use of eclipse remain. the esb is open source and freely available. the paid enterprise version offers additional features and support. the esb can be used independently or in combination with other parts of the talend unified platform, such as BPM, big data, or master data management. the great benefit is that everything can be done within one suite using the same gui and concepts, based on eclipse. the entire talend unified platform is based on the “zero-coding” approach. this way, a very efficient implementation of integration problems is possible using the eips and components. routes are modeled and configured with intuitive tool support, all source code is generated. of course, custom integration logic can still be written and included, for example, pojos, spring beans, scripts in different languages, or own camel components. plenty of other components besides camel’s ones are available for talend esb – for example connectors to alfresco, jasper, sap, salesforce, or host systems. figure 1: visual designer of talend’s esb fuse ide the fuse ide is an eclipse plugin, which is installed from the eclipse update site. the visual designer (see figure 2) generates camel routes as xml code using the spring xml dsl. the generated code is editable vice-versa, i.e. the developer can change the source code. the graphical model applies changes automatically. fuse ide is intuitive to use for creating camel routes. fusesource offers some other products, which can be used in combination with fuse ide – such as management console or fuse mq for messaging. under fusesource, fuse ide was a proprietary product. however, fusesource was recently taken over by redhat (http://www.redhat.com/about/news/press-archive/2012/6/red-hat-to-acquire-fusesource) and now belongs to the jboss division. in the new roadmap, the fuse ide is still included. it will probably be integrated into the jboss enterprise soa platform and become “open sourced”. the integration of fusesource will take at least a few more months time to complete (http://www.redhat.com/promo/jboss_integration_week/). jboss now “owns” three esb products (jboss esb, switchyard and fuse esb). probably, these will be merged into one product in the end (switchyard is also based on camel). nevertheless, the fusesource products will also be supported for some time – primarily in order to satisfy existing customers (my guess). figure 2: visual designer of fuse ide (jboss, former fusesource) enterprise-ready tooling is already available for apache camel! the bottom line is that enterprise-ready tooling is already available for apache camel. it is great to see different companies working on tooling for apache camel. the winner definitely is apache camel… and there is no loser! talend esb and fuse ide are two different approaches for different kinds of projects. if you like the „zero-coding“ approach, then take a closer look at talend’s esb. it is really easy and efficient to realize integration projects without writing source code – nevertheless, there is enough flexibility for customization and adding own source code. the combination with bpm, mdm or big data (based on hadoop) is also supported within the unified platform using the same open source and „zero-coding“ concepts. if you „insist“ on writing and refactoring all source code by yourself within the text editor of an ide, then take a look at fuse ide. your best would be to try out both and see which one fits best into your next enterprise integration project. if you know any other cool camel tooling (no matter if it is enterprise-ready or not), or if you have any other feedback, please write a comment. thank you. best regards, kai wähner (twitter: @kaiwaehner) content from my blog: http://www.kai-waehner.de/blog/2012/11/23/enterprise-ready-tool-support-for-apache-camel/
November 26, 2012
by Kai Wähner DZone Core CORE
· 15,531 Views
article thumbnail
CI with GitHub, Bamboo and Nexus
This is the first in what will be a series of posts on how to establish a continuous delivery pipeline. The eventual goal is to have an app that we can push out into production anytime we like, safely and with little effort. But we’re going to start small and proceed incrementally, which is the best way to undertake this sort of effort. In this post we’ll start by establishing continuous integration for an arbitrary Java/Maven-based open source library project. It shouldn’t take more than an afternoon to do this more or less from scratch if you’re reasonably familiar with the technologies in question, except for the part where we have to get a Maven repo. We have to wait for Sonatype to approve the repo, but they’re pretty fast about it. I set this up for an open source library project I’m doing called Kite. The details of the project itself aren’t important for this exercise, but the open source library part is: Libraries (as opposed to apps) are easier to deal with, because deployment is just a matter of getting the binary into a Maven repo, as opposed to getting it running on a live server. Open source means that we can get a bunch of infrastructure freebies. Here’s what it will look like at the end of this post: So if you have an open source library you’ve been wanting to develop, now’s the time to get started! Find a place to host your project sources The right host depends on which SCM you prefer. If you like Git, then GitHub and Bitbucket are two obvious (free) choices. Bitbucket supports Mercurial as well, and it supports not just free public repos like GitHub, but free private repos as well. Anyway, I chose GitHub for Kite. Here’s the repo: GitHub Kite repo Set up a continuous integration server The next step is to set up a continuous integration server. Hudson and Jenkins are both pretty popular open source offerings, but I chose Atlassian’s Bamboo because the UI is a lot more polished. Bamboo is free for open source projects, and even if you decide to buy it, it’s only $10 for the starter license (proceeds go to charity), which gives you a local build agent. Anyway, you can’t go wrong with any of those, so choose one that makes sense and set it up. They’re all easy to set up. For Bamboo, you’ll need to install Java, Tomcat (or whichever container you like), Git and Maven 3 on the server as well. Bamboo is a Java web app, which is why you need Java and Tomcat. Bamboo uses Git to pull the code down from GitHub, and Maven to build the code. You’ll need to configure this using the Bamboo admin console, once you’ve installed the packages above, you can do the configuration entirely through the Bamboo UI and it’s very easy. After installing Ubuntu 11.10 server manually, I used Chef to set up everything on my Bamboo server except for deploying the Bamboo WAR itself, since Chef has community cookbooks for Java, Tomcat, Maven and Git. (Note however that currently the URL in the Maven 3 recipe is wrong, so if you go that way then you’ll need to update the URL and the checksum.) Jo Liss has a great tutorial on Chef Solo if you’re interested in giving this a shot. But you don’t have to use Chef. You can just use your native package manager if you prefer to do it that way. Once you have the executables all set up, you’ll need to create a build plan that slurps source code from GitHub and builds it with a Maven 3 task. In the screenshot you’ll notice that I created three separate stages here: a commit stage, an acceptance stage and a deploy stage. The idea, following Continuous Delivery, is to separate the fast-but-not-so-comprehensive feedback part from the slow-but-more-comprehensive feedback, and run those as different stages in the build. That way you know right away in most cases where you break the build (commit stage fails), and you still find out soon enough even in those cases where the breakage involves some kind of integration or business acceptance criterion. So in my Bamboo plan I do the commit stage first, then run integration tests during the acceptance stage, and finally deploy the snapshot build to my Maven repo using mvn deploy if the previous two stages pass. That makes it available to others for continuous integration. Let’s look at the Maven repo part now. Set up a Maven repo Sonatype offers a hosted Nexus-based Maven repo for free to open source projects. Moreover they’ll push your artifacts out to Maven central for you as well. So if that sounds good to you, here are instructions on signing up. JFrog’s Artifactory is an option as well. As you can see from the link, there’s an open source option. We use this at work and it’s pretty nice. For Sonatype, you’ll need to create a JIRA ticket to get the repo, as per the instructions above. Here’s mine. Also, your POM will need to conform to certain requirements; see this example for the POM that I’m using for Kite. Nothing too out-of-the-ordinary, though do take note of the parent POM. Anyway, once you have your Maven repo, practice manually deploying from the command line via mvn deploy just to make sure you’re able to push code. If it works, then you’ll want to try it out from Bamboo. Be sure to update Bamboo’s copy of Maven’s settings.xml so Bamboo can authenticate into the Sonatype repo: sonatype-nexus-snapshots your_username your_password sonatype-nexus-staging your_username your_password Once you have it working, you should be able to push your snapshots over to Sonatype. Here’s the Nexus UI, and here’s the raw directory browser view. Conclusion Though it takes a bit of effort to set this up, at the end of the day you have a good foundation for your continuous delivery efforts. In the next post in this series, we’ll expand our deployment pipeline so it can handle pushing a web application to a live, cloud-based container.
October 3, 2012
by Willie Wheeler
· 17,636 Views
article thumbnail
8 Ways to Improve Your Java EE Production Support Skills
This article will provide you with 8 ways to improve your production support skills which may help you better enjoy your IT support job.
August 15, 2012
by Pierre - Hugues Charbonneau
· 32,543 Views · 2 Likes
article thumbnail
What is ActiveMQ?
Although the Active MQ website already gives a pithy, to-the-point explanation of ActiveMQ, I would like to add some more context to their definition. From the ActiveMQ project’s website: “ActiveMQ is an open sourced implementation of JMS 1.1 as part of the J2EE 1.4 specification.” Here’s my take: ActiveMQ is an open-source, messaging software which can serve as the backbone for an architecture of distributed applications built upon messaging. The creators of ActiveMQ were driven to create this open-source project for two main reasons: The available existing solutions at the time were proprietary/very expensive Developers with the Apache Software Foundation were working on a fully J2EE compliant application server (Geronimo) and they needed a JMS solution that had a license compatible with Apache’s licensing. Since its inception, ActiveMQ has turned into a strong competitor of the commercial alternatives, such as WebSphereMQ, EMS/TIBCO and SonicMQ and is deployed in production in some of the top companies in industries ranging from financial services to retail. Using messaging as an integration or communication style leads to many benefits such as: Allowing applications built with different languages and on different operating systems to integrate with each other Location transparency – client applications don’t need to know where the service applications are located Reliable communication – the producers/consumers of messages don’t have to be available at the same time, or certain segments along the route of the message can go down and come back up without impacting the message getting to the service/consumer Scaling – can scale horizontally by adding more services that can handle the messages if too many messages are arriving Asynchronous communication – a client can fire a message and continue other processing instead of blocking until the service has sent a response; it can handle the response message only when the message is ready Reduced coupling – the assumptions made by the clients and services are greatly reduced as a result of the previous 5 benefits. A service can change details about itself, including its location, protocol, and availability, without affecting or disrupting the client. Please see Gregor Hohpe’s description about messaging or the book he and Bobby Woolf wrote about messaging-based enterprise application integration. There are other advantages as well (hopefully someone can add other benefits or drawbacks in the comments), and ActiveMQ is a free, open-source software that can facilitate delivering those advantages and has proven to be highly reliable and scalable in production environments.
July 21, 2012
by Christian Posta
· 28,642 Views · 8 Likes
article thumbnail
A Full Overview of the CamelOne 2012 Conference
CamelOne was yet again a really cool and fun conference. I just returned back home, from the 2nd annual CamelOne conference, held in downtown Boston. It was a 2 day packed with great talks with a balanced mix of technical talks, cloud stuff, and showcases of integration in the real world. CamelOne speaker podium The feedback of the conference was really good, as you can see from the image below. Feedback wall from CamelOne attendees The FuseSource engineering team was present, and on the end of the 2nd day, Debbie, got us together for a photo session. The FuseSource Engineering Team CamelOne 2012 - Day 1 So back the the 1st day. This year Jonathan and I was asked to do the opening key note, with our Camel story, and where the Camel project is today, and some thoughts about how Camel could be riding the cloud in the near future. Jonathan and I wanted to tell our Camel story with a sense of humor. I can say mission accomplished when the slide with the lovely Camel picture was shown. And no the picture is not retouched, it was found using google image search. Apache Camel is a project that stands out After the keynote I gave a talk about how to get started riding the Camel. The talk is a practical focus talk so I was sharing the time 50/50 between slides and live coding. As all sessions were recorded, and we can all watch them later, as they will be posted on the CamelOne website, free for anymore to watch. A professional production company is currently processing the videos. They should be ready in weeks from now. I will blog when the videos are online. Free Apache/Fuse Resources Apache Project Leader Videos Open Source Integration Tool Downloads Integrate Anywhere: Fuse ESB Super Fast Messaging with Fuse MQ Free SOA Webcasts Apache ActiveMQ and ServiceMix Lessons Integration Resources Jonathan was next after my talk, and his talk was a natural progress from mine talk. As he gave a rundown how you can run and deploy Camel and CXF applications in the ESB server. Jonathan showed how this works in practice as well. I got engaged in a number of hallway conversations, and didn't have the chance to attend a session at every slot. It was really great to meet so many happy Camel users, and hear their stories, where the Camel is riding in the real world. Also I was told that more and more commercial vendors is adapting Camel and embedding it internally in their commercial products. Kai Wahner gave his first of two talks today. Kai Wahner giving a talk about choosing Integration Frameworks He was talking about his experience with evaluating Apache Camel, Spring Integration, and Mule ESB. I guess Camel was in favor at a Camel conference :) James gave the ending general session of the first day. James Strachan giving ending key note on 1st day As always a pleasure to watch James talk with such a enthusiasm. His talk was a technical talk addressed to the developers how to develop and get Camel riding in the cloud. He gave a tour of the cool and awesome much improved Fuse IDE 2.1 product. It now has even more Camel crack for runtime insight. As well as easier deployment for both local jvms, remove machines, and as well the clouds. CamelOne 2012 - Day 2 On the 2nd day, we have Gabe Zichermann giving a very entertaining keynote, about gamification. Gabe talking about Gamification The idea of getting people engaged, based on the ideas of computer games. But applying them in a real life processes. For example a company managed to get its employees go to the gym, based on teaming up, and competing against your co-workers. In Sweden they have traffic cameras, using reverse sychology, by enlisting people in a lottery, if they are within speed limits. However people above the speed limit will of course still get a fine, and not participate in the lottery. I have heard good buzz about Stan Lewis and Dhirajs talk about how to manage, monitor and provision a cluster of machines, in a data centre or the cloud. Dhiraj showed how Fuse HQ could monitor the Camel applications running on numerous machines. And how that worked as well when Stan did a rolling upgrade on the fly, leaving Fuse HQ being able to compare and display a "before" vs. "after" overview. Likewise the tweets about Charles Moulliard were very positive. Seems like people wanted to go and play with websockets, and the Camel. This is definitly cool. So Camel 2.10 is a much anticipated release, having the websocket component out of the box. Kai was on the stage again, giving a talk about using Apache Camel with BPM (Avtiviti). Kai gave us a rundown of the differences between Camel and Activiti, and where they overlap. As well when you should use either one, or both of them. In Kais talk he give live demos which is a nice change in the flow, to see the "code" for real. Activiti and Camel together seems powerful. And the Activiti designer looks beautiful. Did you know that Camel is help protecting the Canadians. Mike Gingell from General Dynamics Canada gave us a rundown of how they have been successful by using open source integration technologies from Apache. The Camel is helping in cool stuff such as with the marine to detect torpedo attacks, with satellite surveillance of the north poles, and to keep track of personel and whatnot. Torpedo Warning System. Slide from Mike Gingell, General Dynamics Canada, Mike gave a really great talk, and also took us through how his team is battling uphill in a traditionally conservative organization where software projects take millions of $ and years to just get started. They have been on the open source road for about 5 years, and jumped on Apache ServiceMix when it became OSGi based. And the Camel has been riding all along together with ActiveMQ and CXF. So the Camel is help protecting Jonathan Anstey, who lives in New Foundland, Canada. Must be cool to know that the software he works on every day, is now serving the people of Canada. The ending keynote, was a real treat to all of us. Felix Ehmn from CERN gave us a very interesting talk how CERN is using ActiveMQ in their control room, to monitor the most complex machine man have ever built - the Large Hadron Collider; eg the 27km circle where they smash atoms together and see what happens. Felix giving ending keynote, about CERN using ActiveMQ There is 85.000 devices, which they need the monitor. Its everything, from censors on the collider, to fire alarms, door buttons and whatnot. CERN is definitily a cool place. In fact the coolest place on earth as well, as they need to cool down the collider, to 1 degree kelvin. That is - 272 degrees celsius. Like CERN, the CamelOne conference was very cool. I discovered a number of other blogs covering the CamelOne 2012 conference Kai Waehner blogged his CamelOne report. Christian Posta blogged as well And Rob Terpilowski who gave a talk also Hope to see you in 2013 at the next CamelOne conference. As David Reiser tweeted, the conference was awesome. David liked the conference
May 23, 2012
by Claus Ibsen
· 7,005 Views
article thumbnail
Martin Fowler on ORM Hate
while i was at the qcon conference in london a couple of months ago, it seemed that every talk included some snarky remarks about object/relational mapping (orm) tools. i guess i should read the conference emails sent to speakers more carefully, doubtless there was something in there telling us all to heap scorn upon orms at least once every 45 minutes. but as you can tell, i want to push back a bit against this orm hate - because i think a lot of it is unwarranted. the charges against them can be summarized in that they are complex, and provide only a leaky abstraction over a relational data store. their complexity implies a grueling learning curve and often systems using an orm perform badly - often due to naive interactions with the underlying database. there is a lot of truth to these charges, but such charges miss a vital piece of context. the object/relational mapping problem is hard . essentially what you are doing is synchronizing between two quite different representations of data, one in the relational database, and the other in-memory. although this is usually referred to as object-relational mapping, there is really nothing to do with objects here. by rights it should be referred to as in-memory/relational mapping problem, because it's true of mapping rdbmss to any in-memory data structure. in-memory data structures offer much more flexibility than relational models, so to program effectively most people want to use the more varied in-memory structures and thus are faced with mapping that back to relations for the database. the mapping is further complicated because you can make changes on either side that have to be mapped to the other. more complication arrives since you can have multiple people accessing and modifying the database simultaneously. the orm has to handle this concurrency because you can't just rely on transactions- in most cases, you can't hold transactions open while you fiddle with the data in-memory. i think that if you if you're going to dump on something in the way many people do about orms, you have to state the alternative. what do you do instead of an orm? the cheap shots i usually hear ignore this, because this is where it gets messy. basically it boils down to two strategies, solve the problem differently (and better), or avoid the problem. both of these have significant flaws. a better solution listening to some critics, you'd think that the best thing for a modern software developer to do is roll their own orm. the implication is that tools like hibernate and active record have just become bloatware, so you should come up with your own lightweight alternative. now i've spent many an hour griping at bloatware, but orms really don't fit the bill - and i say this with bitter memory. for much of the 90's i saw project after project deal with the object/relational mapping problem by writing their own framework - it was always much tougher than people imagined. usually you'd get enough early success to commit deeply to the framework and only after a while did you realize you were in a quagmire - this is where i sympathize greatly with ted neward's famous quote that object-relational mapping is the vietnam of computer science [1] . the widely available open source orms (such as ibatis, hibernate, and active record) did a great deal to remove this problem [2] . certainly they are not trivial tools to use, as i said the underlying problem is hard, but you don't have to deal with the full experience of writing that stuff (the horror, the horror). however much you may hate using an orm, take my word for it - you're better off. i've often felt that much of the frustration with orms is about inflated expectations. many people treat the relational database "like a crazy aunt who's shut up in an attic and whom nobody wants to talk about" [3] . in this world-view they just want to deal with in-memory data-structures and let the orm deal with the database. this way of thinking can work for small applications and loads, but it soon falls apart once the going gets tough. essentially the orm can handle about 80-90% of the mapping problems, but that last chunk always needs careful work by somebody who really understands how a relational database works. this is where the criticism comes that orm is a leaky abstraction. this is true, but isn't necessarily a reason to avoid them. mapping to a relational database involves lots of repetitive, boiler-plate code. a framework that allows me to avoid 80% of that is worthwhile even if it is only 80%. the problem is in me for pretending it's 100% when it isn't. david heinemeier hansson, of active record fame, has always argued that if you are writing an application backed by a relational database you should damn well know how a relational database works. active record is designed with that in mind, it takes care of boring stuff, but provides manholes so you can get down with the sql when you have to. that's a far better approach to thinking about the role an orm should play. there's a consequence to this more limited expectation of what an orm should do. i often hear people complain that they are forced to compromise their object model to make it more relational in order to please the orm. actually i think this is an inevitable consequence of using a relational database - you either have to make your in-memory model more relational, or you complicate your mapping code. i think it's perfectly reasonable to have a more relational domain model in order to simplify your object-relational mapping. that doesn't mean you should always follow the relational model exactly, but it does mean that you take into account the mapping complexity as part of your domain model design. so am i saying that you should always use an existing orm rather than doing something yourself? well i've learned to always avoid saying "always". one exception that comes to mind is when you're only reading from the database. orms are complex because they have to handle a bi-directional mapping. a uni-directional problem is much easier to work with, particularly if your needs aren't too complex and you are comfortable with sql. this is one of the arguments for cqrs . so most of the time the mapping is a complicated problem, and you're better off using an admittedly complicated tool than starting a land war in asia. but then there is the second alternative i mentioned earlier - can you avoid the problem? avoiding the problem to avoid the mapping problem you have two alternatives. either you use the relational model in memory, or you don't use it in the database. to use a relational model in memory basically means programming in terms of relations, right the way through your application. in many ways this is what the 90's crud tools gave you. they work very well for applications where you're just pushing data to the screen and back, or for applications where your logic is well expressed in terms of sql queries. some problems are well suited for this approach, so if you can do this, you should. but its flaw is that often you can't. when it comes to not using relational databases on the disk, there rises a whole bunch of new champions and old memories. in the 90's many of us (yes including me) thought that object databases would solve the problem by eliminating relations on the disk. we all know how that worked out. but there is now the new crew of nosql databases - will these allow us to finesse the orm quagmire and allow us to shock-and-awe our data storage? as you might have gathered , i think nosql is technology to be taken very seriously. if you have an application problem that maps well to a nosql data model - such as aggregates or graphs - then you can avoid the nastiness of mapping completely. indeed this is often a reason i've heard teams go with a nosql solution. this is, i think, a viable route to go - hence my interest in increasing our understanding of nosql systems. but even so it only works when the fit between the application model and the nosql data model is good. not all problems are technically suitable for a nosql database. and of course there are many situations where you're stuck with a relational model anyway. maybe it's a corporate standard that you can't jump over, maybe you can't persuade your colleagues to accept the risks of an immature technology. in this case you can't avoid the mapping problem. so orms help us deal with a very real problem for most enterprise applications. it's true they are often misused, and sometimes the underlying problem can be avoided. they aren't pretty tools, but then the problem they tackle isn't exactly cuddly either. i think they deserve a little more respect and a lot more understanding. 1: i have to confess a deep sense of conflict with the vietnam analogy. at one level it seems like a case of the pathetic overblowing of software development's problems to compare a tricky technology to war. nasty the programming may be, but you're still in a relatively comfy chair, usually with air conditioning, and bug-hunting doesn't involve bullets coming at you. but on another level, the phrase certainly resonates with the feeling of being sucked into a quagmire. 2: there were also commercial orms, such as toplink and kodo. but the approachability of open source tools meant they became dominant. 3: i like this phrase so much i feel compelled to subject it to re-use.
May 9, 2012
by Martin Fowler
· 115,457 Views · 4 Likes
article thumbnail
From Java to Node.js
I’ve been developing for quite a while and in quite a few languages. Somehow though, I’ve always seemed to fall back to Java when doing my own stuff – maybe partly from habit, partly because it has in my opinion the best open source selection out there, and party because I liked its mix of features and performance. Originally authored by Matan Amir Specifically though, in the web arena things have been moving fast and furious with new languages, approaches, and methods like RoR, Play!, and Lift (and many others!). While I “get it” regarding the benefits of these frameworks, I never felt the need to give them more than an initial deep dive to see how they work. I nod a few times at their similarities and move on back to plain REST-ful services in Java with Spring, maybe an ORM (i try to avoid them these days), and a JS-rich front-end. Recently, two factors made me deep dive into Node.js. First is my growing appreciation with the progress of the JavaScript front-end. What used to be little JavaScript snippets to validate a form “onSubmit” have evolved to a complete ecosystem of technologies, frameworks, and stacks. My personal favorite these days is Backbone.js and try to use it whenever I can. Second was the first hand feedback I got from a friend at Voxer about their success in using and deploying Node.js at real scale (they have a pretty big Node.js deployment). So I jumped in. In the short time I’ve been using it for some (real) projects, I can say that it’s my new favorite language of choice. First of all, the event-driven (and non-blocking) model is a perfect fit for server side development. While this has existed in the other languages and forms (Java Servlet 3.0, Event Machine, Twisted to name a few), I connected with the easy of use and natural maturity of it in JavaScript. We’re all used to using callbacks anyway from your typical run-of-the-mill AJAX work right? Second is the community and how large its gotten in the short time Node has been around. There are lots of useful open source libraries to use to solve your problems and the quality level is high. Third is that it was just so easy to pick up. I have to admit that my core JavaScript was decent before I started using Node, but in terms of the Node core library and feature set itself, it’s pretty lean and mean and provides a good starting point to build what you need in case you can’t find someone who already did (which you most likely can). So having said all that, I wanted to share what resources helped me get up to speed in Node.js coming from a Java background. Getting to know JavaScript There are lots of good resources out there on Node.js and i’ve listed them below. For me, the most critical piece was getting my JavaScript knowledge to the next level. Because you’ll be spending all your time writing your server code in JavaScript, it pays huge dividends to understand how to take full advantage of it. As a Java developer you’re probably trained to think in OO designs. With me, this was the part that I focused the most on. Fortunately (or unforunately), JavaScript is not a classic OO language. It can be if you shoehorn it, but i think that defeats the purpose. Here is my short list of JavaScript resources: JavaScript: The Good Parts - Definitely a requirement. Chapters 4 and 5 (functions, objects, and inheritance) are probably the most important part to understand well for those with an OO background. Learning Javascript with Object Graphs (part 2, and part 3) – howtonode.org has lots of good Node material, but this 3 part post is a good resource to top off your knowledge from the book. Simple “Class” Instantiation – Another good post I read recently. Worth digesting. Learning Node.js With a good JavaScript background, starting to use Node.js is pretty straightforward. The main part to understand is the asynchronous nature of the I/O and the need to produce and consume events to get stuff done. Here is a list of resources I used to get up to speed: DailyJS’s Node Tutorial - A multipart tutorial for Node on DailyJS’s blog. It’s a great resource and worth going through all the posts. Mixu’s Node Book - Not complete, but still worth it. I look forward to reading the future chapters. Node Beginner Book – A good starter read. How To Node – A blog dedicated to Node.js. Bookmark it. I felt that going through these was more than enough to give me the push I needed get started. Hopefully it does for you too (thanks to the authors for taking the time to share their knowledge!). Frameworks? Java is all about open source frameworks. That’s part of why it’s so popular. While Node.js is much newer, lots of people have already done quite a bit of heavy-lifting and have shared their code with the world. Below is what I think is a good mapping of popular Java frameworks to their Node.js equivalents (from what I know so far). Web MVC In Java land, most people are familiar with Web MVC frameworks like Spring MVC, Struts, Wicket, and JSF. More recently though, the trend towards client-side JS MVC frameworks like Ember.js (SproutCore) and Backbone.js reduces the required feature-set of some of these frameworks. Nonetheless, a good comparable Node.js web framework is Express. In a sense, it’s even more than a web framework because it also provides most of the “web server” functionality most Java developers are used to through Tomcat, Jetty, etc (more specifically, Express builds on top of Connect) It’s well thought out and provides the feature set needed to get things done. Application Lifecycle Framework (DI Framework) Spring is a popular framework in Java to provide a great deal of glue and abstraction functionality. It allows easy dependency injection, testing, object lifecycle management, transaction management, etc. It’s usually one of the first things I slap into a new project in Java. In Node.js… I haven’t really missed it. JavaScript allows much more flexible ways of decoupling dependencies and I personally haven’t felt the need to find a replacement for Spring. Maybe that’s a good thing? Object-Relational Mapping (ORMs) I have mixed feelings regarding ORMs in general, but it does make your life easier sometimes. There is no shortage of ORM librares for Node.js. Take your pick. Package Management Tools Maven is probably most popular build management tool for Java. While it is very flexible and powerful with a wide variety of plug-ins, it can get very cumbersome. Npm is the mainstream package manager for Node.js. It’s light, fast, and useful. Testing Framework Java has lots of these for sure, jUnit and company as standard. There are also mock libraries, stub libraries, db test libraries, etc. Node.js has quite a few as well. Pick your poison. I see that nodeunit is popular and is similar to jUnit. I’m personally testing Mocha. Testing tools are a more personal and subjective choice, but the good thing is that there definitely are good choices out there. Logging Java developers have quite a list of choices when choosing a logger library. Commons logging, log4j, logback, and slf4j (wrapper) are some of the more popular ones. Node.js also has a few. I’m currently using winston and have no complaints so far. It has the logging levels, multiple transports (appenders to log4j people), and does it all asynchronously as well. Hopefully this will help someone save some time when peeking into the world of Node. Good luck! Source: http://n0tw0rthy.wordpress.com/2012/01/08/from-java-to-node-js/
January 10, 2012
by Chris Smith
· 28,780 Views · 2 Likes
article thumbnail
Pros and Cons – When to use a Portal and Portlets instead of just Java Web-Frameworks
I had to answer the following question: Shall we use a Portal and if yes, should it be Liferay Portal or Oracle Portal? Or shall we use just one or more Java web frameworks? This article shows my result. I had to look especially at Liferay and Oracle products, nevertheless the result can be used for other products, too. The short answer: A Portal makes sense only in a few use cases, in the majority of cases you should not use one. In my case, we will not use one. What is a Portal? It is important to know that we are talking about an Enterprise Portal. Wikipedia has a good definition: „An enterprise portal [...] is a framework for integrating information, people and processes across organizational boundaries. It provides a secure unified access point, often in the form of a web-based user interface, and is designed to aggregate and personalize information through application-specific portlets.“ Several Portal server are available in the Java / JVM environment. Liferay Portal and GateIn Portal (former JBoss Portal) are examples for open source products while Oracle Portal or IBM WebSphere Portal are proprietary products. You develop Portlets („simple“ web applications) and deploy them in your portal. If you need to know more about a Portal or the Portlet JSR standards, ask Wikipedia: http://en.wikipedia.org/wiki/Portlet. Should we use a Portal or not? I found several pros and cons for using a Portal instead of just web applications. Disadvantages of using a Portal: Higher complexity Additional configuration (e.g. portlet.xml, Portal server) Communication between Portlets using Events is not trivial (it is also not trivial if two applications communicate without portlets, of course) Several restrictions when developing a web application within a Portlet Additional testing efforts (test your web applications and test it within a Portal and all its Portal features) Additional costs Open source usually offers enterprise editions which include support (e.g. Liferay) Proprietary products have very high initial costs. Besides, you need support, too (e.g. Oracle) You still have to customize the portal and integrate applications. A portal product does not give you corporate identity or systems integration for free. Software licensing often is only ten percent of the total price. Developers need additional skills besides using a web framework Several restrictions must be considered choosing a web-framework and implement the web application Rethinking about web application design is necessary Portlets use other concepts such as events or an action and render phase instead of only one phase Frameworks (also called bridges) help to solve this problem (but these are standardized for JSF only, a few other plugins are available, e.g. for GWT or Grails) Actually, IMO you have to use JSF if you want to realize Portlets in a stable, relatively „easy“ and future-proof way. There is no standard bridge for other frameworks. There are no books, best practices, articles or conference sessions about Portlets without JSF, right? Advantages of using a Portal Important: Many of the pros can be realized by oneself with relatively low efforts (see the "BUT" notes after each bullet point). Single Sign On (SSO) BUT: Several Java frameworks are available, e.g OpenSSO (powerful, but complicated) or JOSSO (not so powerful, but easy to use). Good products are available, e.g. Atlassian Crowd (I love Atlassian products such as Crowd, Jira or Confluence, because they are very intuitive and easy to use). Integration of several applications within one GUI A portal gives you layout and sequence of the applications for free (including stuff such as drag & drop, minimizing windows, and so on) Communication between Portlets (i.e. between different applications) BUT: This is required without a portal, too. Several solutions can be used, such as a database, messaging, web services, events, and so on. Even „push“ is possible for some time now (using specific web framework features or HTML 5 websockets). Uniform appearence BUT: CSS can solve this problem (the keyword „corporate identity“ exists in almost every company). Create a HTML template and include your applications within this template. Done. Personalization Regarding content, structure or graphical presentation Based on individual preferences or metadata BUT: Some of these features can be realized very easily by oneself (e.g. a simple role concept). Nevertheless, GUI features such as drag & drop are more effort (although component libraries can help you a lot). Many addons are included Search Content management Document management Web 2.0 tools (e.g. blogs or wikis) Collaboration suites (e.g. team pages) Analytics and reporting Development platforms BUT: A) Do you really need these Features? B) Is the offered functionality sufficent? Portals only offer „basic“ versions of stand-alone products. For instance, the content management system or search engine of a Portal is less powerful than other „real“ products offering this functionality. Thus, you have to think about the following central question: Do we really need all those features offered by a portal? Conclusion: The total cost of ownership (TCO) is much higher when using a portal. You have to be sure, that you really need the offered features. In some situations, you can defer your decision. Create your web applications as before. You can still integrate them in a Portal later, if you really need one. For instance, the following Oracle blog describes how you can use iFrames to do this: http://blogs.oracle.com/jheadstart/entry/integrating_a_jsf_application If you decide to use a Portal, you have to choose a Portal product. Should we use an Open Source or Proprietary Portal Product? Both, open source and proprietary Portal products have pros and cons. I especially looked at Oracle Portal and Liferay Portal, but probably most aspects can be considered when evaluating other products, too. Advantages of Oracle Portal Oracle offers a full-stack suite for development (including JSF and Portlets): Oracle Application Development Framework (ADF) Oracle JDeveloper offers good support for ADF. Everything from one product line increases efficiency (database, application server, ESB, IDE, Portal, …) – at least in theory :-) Disadvantages of Oracle Portal: High initial costs (I heard something about 200K in our company) Complex, heavyweight product (compared to Liferay Portal) Proprietary Communication between Portlets is not implemented using the standard JSR-286, but a custom proprietary solution (Source: http://www.contribute.be/web/contribute/news/-/journal_content/56_INSTANCE_pdF5/10234/21893) Advantages of Liferay Portal: Open source Drastically lower initial costs Lightwight product (1-Click-Install, etc.) Disadvantages of Liferay Portal: Not everything is from one product line (this cannot be considered as disadvantage always, but in our case the customer preferred very few different vendors (keyword “IT consolidation”) Portlets are still Portlets. Although Liferay is lightweight, realizing Portlets still sucks as it does with a proprietary product When to use a Portal? Well, the conclusion is difficult. In my opinion, it does make sense only in a few use cases. If you really need many or all of those Portal features, and they are also sufficient, then use a Portal product. Though, usually it is much easier to create a simple web application which integrates your applications. Use a SSO framework, create a template, and you are done. Your developers will appreciate not to work with Portlets and its increased complexity and restrictions. Did I miss any pros or cons? Do you have another opinion (probably, many people do???), then please write a comment and let’s discuss… Best regards, Kai Wähner (Twitter: @KaiWaehner) [Content from my Blog: Kai Wähner's Blog: Pros and Cons - When to use a Portal and Portlets instead of just Java Web-Frameworks]
October 13, 2011
by Kai Wähner DZone Core CORE
· 83,752 Views · 1 Like
article thumbnail
qcadoo MES - new Java-based Open Source manufacturing management software
Comment: Why did we do it ? For many years we stumbled upon many companies from different industries and with blushes on our cheeks, we saw that many new business software projects were solving the same problems over and over again. Every one is racing to bring new ERP systems, "innovative" CRMs and e-commerce suits. But most companies have already deployed these type of systems and optimized their sales process, finances and logistics. Little is however done with software for the manufacturing industry especially in the SME sector. Flying paper notes and Excel files - that is still the landscape in that domain. So we decided to make manufacturing management more systematic and started to work on qcadoo MES. We managed to acquire financing from the UE in the polish POIG 8.1 program and from a couple of private investors. We also build a team with people that have at least 10 year experience in developing IT solutions like ERP and BI systems for large international companies. What is qcadoo MES qcadoo MES is a manufacturing management system for small and medium companies. It combines many functions you can find in ERP, MRP and MES systems. We don't want to replace systems like SAP, MS Dynamics or Wonderware. We do not target the same groups they do. qcadoo MES lets you replace those paper notes and spreadsheets that are flying around the production line and gain benefit from having them all in one system. From the beginning of our project we had a couple of ambitious goals. First of all we wanted our software to be easy to use. From the first day we work closely with a company that specializes in usability. Thanks to this we have a clear and intuitive interface approved by eye-tracking tests. qcadoo MES should also eliminate the biggest issue with other systems of this type - starting with big and wide deployment which mess with your daily business tasks and give no guarantee of success. We did this by encapsulating every functionality - big or small - in separate modules. Modules can be turned off and on and extend functionality of other modules. Our platform takes care of everything - keeps the systems integrity and module dependencies intact. How does the user benefit from this ? He can start his adventure with qcadoo MES by deploying a simple functionality - a single module - and in time add new ones to solve just the task he needs. Additionally you don't just get access to modules from Qcadoo but mainly from our partners and from the Open Source community. Encouraged by successes of such companies like OpenERP, Open Bravo and SugarCRM we decided to also free our product on an Open Source license. We hope this will encourage developers and engineers from the manufacturing industry to collaborate with us on qcadoo MES. What makes us special You can find some manufacturing managements systems on the web. But what you wont find in them is: fully modular system - you can start from a simple functionality - one module - an add more in time. Just as your companies needs evolve. You can buy modules in the qcooStore - something similar to Apple AppStore or Android Market - but with solutions for the manufacturing industry created from day one with usability in mind - this speeds up deployments and makes user training go swiftly expandable to infinity - modules are the fundamental concept of this system and you can add new features easily without any technical knowledge. Thanks to this you can adjust our system easily to your companies specific needs. open source - our source code is available on the AGPL license which means that every one can use the software for commercial deployments (with having the terms of the AGPL in mind). Users that need more professional support and commercial modules can buy a commercial license available as a service (SaaS) - qcadoo MES is available for any browser and eliminates the need to install it manually, remembering about backups and system upgrades. This really reduces your costs and saves your time global range - from the beginning we are working on making qcadoo MES available worldwide. We are collaborating with partners from every country and believe that a wide choice of modules will fit in to production lines everywhere, Canada, Poland, India, you name it A new Module is a piece of cake - qcadoo Framework The whole system is based on the qcadoo Framework. A great tool for rapid implementation of modular database web application. Every thing in it is a module and therefore can be extended by another module. Yours for example ! While designing the qcadoo Framework we also wanted to make it very easy for new developers. For this purpose we created two simple XML languages: one to define the GUI and the other do define the data model. Thanks to them in a few lines of code you can add a new data entity, a table for it and all the basic CRUD operations (Create, Remove, Update, Delete). Also you can always write more complex business logic in a simple POJO Java class and use the XML languages to hook it to the right events. A new Application is also a piece of cake The qcadoo Framework is not limited to just manufacturing software. It's well suited for any business database application. Everywhere you need a rapid tool for data and GUI modeling, a modular architecture and a repository of business modules which you can use. Our partners ale already getting ready to use the qcadoo Framework in their commercial projects. We also hope that the open source community can benefit from it as well. About Qcadoo Limited Qcadoo Limited is a company which was founded in 2010 with the help of the UE POIG 8.1 financing program and by a couple of private investors. The core team is build from people which have years of experience in developing business software in Poland and the UE. More info on qcadoo MES and the qcadoo Framework at www.qcadoo.com
June 6, 2011
by Adam Walczak
· 6,551 Views
article thumbnail
When to Use Apache Camel?
When to use Apache Camel, a popular JVM/Java environment, and when to use other alternatives.
June 5, 2011
by Kai Wähner DZone Core CORE
· 153,195 Views · 12 Likes
article thumbnail
Solr + Hadoop = Big Data Love
Bixo Labs shows how to use Solr as a NoSQL solution for big data Many people use the Hadoop open source project to process large data sets because it’s a great solution for scalable, reliable data processing workflows. Hadoop is by far the most popular system for handling big data, with companies using massive clusters to store and process petabytes of data on thousands of servers. Since it emerged from the Nutch open source web crawler project in 2006, Hadoop has grown in every way imaginable – users, developers, associated projects (aka the “Hadoop ecosystem”). Starting at roughly the same time, the Solr open source project has become the most widely used search solution on planet Earth. Solr wraps the API-level indexing and search functionality of Lucene with a RESTful API, GUI, and lots of useful administrative and data integration functionality. The interesting thing about combining these two open source projects is that you can use Hadoop to crunch the data, and then serve it up in Solr. And we’re not talking about just free-text search; Solr can be used as a key-value store (i.e. a NoSQL database) via its support for range queries. Even on a single server, Solr can easily handle many millions of records (“documents” in Lucene lingo). Even better, Solr now supports sharding and replication via the new, cutting-edge SolrCloud functionality. Background I started using Hadoop & Solr about five years ago, as key pieces of the Krugle code search startup I co-founded in 2005. Back then, Hadoop was still part of the Nutch web crawler we used to extract information about open source projects. And Solr was fresh out of the oven, having just been released as open source by CNET. At Bixo Labs we use Hadoop, Solr, Cascading, Mahout, and many other open source technologies to create custom data processing workflows. The web is a common source of our input data, which we crawl using the Bixo open source project. The Problem During a web crawl, the state of the crawl is contained in something commonly called a “crawl DB”. For broad crawls, this has to be something that works with billions of records, since you need one entry for each known URL. Each “record” has the URL as the key, and contains important state information such as the time and result of the last request. For Hadoop-based crawlers such as Nutch and Bixo, the crawl DB is commonly kept in a set of flat files, where each file is a Hadoop “SequenceFile”. These are just a packed array of serialized key/value objects. Sometimes we need to poke at this data, and here’s where the simple flat-file structure creates a problem. There’s no easy way run queries against the data, but we can’t store it in a traditional database since billions of records + RDBMS == pain and suffering. Here is where scalable NoSQL solutions shine. For example, the Nutch project is currently re-factoring this crawl DB layer to allow plugging in HBase. Other options include Cassandra, MongoDB, CouchDB, etc. But for simple analytics and exploration on smaller datasets, a Solr-based solution works and is easier to configure. Plus you get useful and surprising fun functionality like facets, geospatial queries, range queries, free-form text search, and lots of other goodies for free. Architecture So what exactly would such a Hadoop + Solr system look like? As mentioned previously, in this example our input data comes from a Bixo web crawler’s CrawlDB, with one entry for each known URL. But the input data could just as easily be log files, or records from a traditional RDBMS, or the output of another data processing workflow. The key point is that we’re going to take a bunch of input data, (optionally) munge it into a more useful format, and then generate a Lucene index that we access via Solr. Hadoop For the uninitiated, Hadoop implements both a distributed file system (aka “HDFS”) and an execution layer that supports the map-reduce programming model. Typically data is loaded and transformed during the map phase, and then combined/saved during the reduce phase. In our example, the map phase reads in Hadoop compressed SequenceFiles that contain the state of our web crawl, and our reduce phase write out Lucene indexes. The focus of this article isn’t on how to write Hadoop map-reduce jobs, but I did want to show you the code that implements the guts of the job. Note that it’s not typical Hadoop key/value manipulation code, which is painful to write, debug, and maintain. Instead we use Cascading, which is an open source workflow planning and data processing API that creates Hadoop jobs from shorter, more representative code. The snippet below reads SequenceFiles from HDFS, and pipes those records into a sink (output) that stores them using a LuceneScheme, which in turn saves records as Lucene documents in an index. Tap source = new Hfs(new SequenceFile(CRAWLDB_FIELDS), inputDir); Pipe urlPipe = new Pipe("crawldb urls"); urlPipe = new Each(urlPipe, new ExtractDomain()); Tap sink = new Hfs(new LuceneScheme(SOLR_FIELDS, STORE_SETTINGS, INDEX_SETTINGS, StandardAnalyzer.class, MAX_FIELD_LENGTH), outputDir, true); FlowConnector fc = new FlowConnector(); fc.connect(source, sink, urlPipe).complete(); We defined CRAWLDB_FIELDS and SOLR_FIELDS to be the set of input and output data elements, using names like “url” and “status”. We take advantage of the Lucene Scheme that we’ve created for Cascading, which lets us easily map from Cascading’s view of the world (records with fields) to Lucene’s index (documents with fields). We don’t have a Cascading Scheme that directly supports Solr (wouldn’t that be handy?), but we can make-do for now since we can do simple analysis for this example. We indexed all of the fields so that we can perform queries against them. Only the status message contains normal English text, so that’s the only one we have to analyze (i.e., break the text up into terms using spaces and other token delimiters). In addition, the ExtractDomain operation pulls the domain from the URL field and builds a new Solr field containing just the domain. This will allow us to do queries against the domain of the URL as well as the complete URL. We could also have chosen to apply a custom analyzer to the URL to break it into several pieces (i.e., protocol, domain, port, path, query parameters) that could have been queried individually. Running the Hadoop Job For simplicity and pay-as-you-go, it’s hard to beat Amazon’s EC2 and Elastic Mapreduce offerings for running Hadoop jobs. You can easily spin up a cluster of 50 servers, run your job, save the results, and shut it down – all without needing to buy hardware or pay for IT support. There are many ways to create and configure a Hadoop cluster; for us, we’re very familiar with the (modified) EC2 Hadoop scripts that you can find in the Bixo distribution. Step-by-step instructions are available at http://openbixo.org/documentation/running-bixo-in-ec2/ The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains step-by-step instructions for building and running the job. After the job is done, we’ll copy the resulting index out of the Hadoop distributed file system (HDFS) and onto the Hadoop cluster’s master server, then kill off the one slave we used. The Hadoop master is now ready to be configured as our Solr server. Solr On the Solr side of things, we need to create a schema that matches the index we’re generating. The key section of our schema.xml file is where we define the fields. These fields have a one-to-one correspondence with the SOLR_FIELDS we defined in our Hadoop workflow. They also need to use the same Lucene settings as what we defined in the static IndexWorkflow.java STORE_SETTINGS and INDEX_SETTINGS. Once we have this defined, all that’s left is to set up a server that we can use. To keep it simple, we’ll use the single EC2 instance in Amazon’s cloud (m1.large) that we used as our master for the Hadoop job, and run the simple Solr search server that relies on embedded Jetty to provide the webapp container. Similar to the Hadoop job, step-by-step instructions are in the README for the hadoop2solr project on GitHub. But in a nutshell, we’ll copy and unzip a Solr 1.4.1 setup on the EC2 server, do the same for our custom Solr configuration, create a symlink to the index, and then start it running with: Giving it a Try Now comes the interesting part. Since we opened up the default Jetty port used by Solr (8983) on this EC2 instance, we can directly access Solr’s handy admin console by pointing our browser at http://:8983/solr/admin % cd solr % java -Dsolr.solr.home=../solr-conf -Dsolr.data.dir=../solr-data -jar start.jar From here we can run queries against Solr: We can also use curl to talk to the server via HTTP requests: curl http://:8983/solr/select/?q=-status%3AFETCHED+and+-status%3AUNFETCHED The response is XML by default. Below is an example of the response from the above request, where we found 2,546 matches in 94ms. Now here’s what I find amazing. For an index of 82 million documents, running on a fairly wimpy box (EC2 m1.large = 2 virtual cores), the typical response time for a simple query like “status:FETCHED” is only 400 milliseconds, to find 9M documents. Even a complex query such as (status not FETCHED and not UNFETCHED) only takes six seconds. Scaling Obviously we could use beefier boxes. If we switched to something like m1.xlarge (15GB of memory, 4 virtual cores) then it’s likely we could handle upwards of 200M “records” in our Solr index and still get reasonable response times. If we wanted to scale beyond a single box, there are a number of solutions. Even out of the box Solr supports sharding, where your HTTP request can specify multiple servers to use in parallel. More recently, the Solr trunk has support for SolrCloud. This uses the ZooKeeper open source project to simplify coordination of multiple Solr servers. Finally, the Katta open source project supports Lucene-level distributed search, with many of the features needed for production quality distributed search that have not yet been added to SolrCloud. Summary The combination of Hadoop and Solr makes it easy to crunch lots of data and then quickly serve up the results via a fast, flexible search & query API. Because Solr supports query-style requests, it’s suitable as a NoSQL replacement for traditional databases in many situations, especially when the size of the data exceeds what is reasonable with a typical RDBMS. Solr has some limitations that you should be aware of, specifically: · Updating the index works best as a batch operation. Individual records can be updated, but each commit (index update) generates a new Lucene segment, which will impact performance. · Current support for replication, fail-over, and other attributes that you’d want in a production-grade solution aren’t yet there in SolrCloud. If this matters to you, consider Katta instead. · Many SQL queries can’t be easily mapped to Solr queries. The code for this article is available via GitHub, at http://github.com/bixolabs/hadoop2solr. The README displayed on that page contains additional technical details.
April 4, 2011
by Ken Krugler
· 119,543 Views
article thumbnail
Configuring Sonar with Maven
Sonar is an open source web-based application to manage code quality which covers seven axes of code quality as: Architecture and design, comments, duplications, unit tests, complexity, potential bugs and coding rules. Developed in Java and can cover projects in Java, Flex, PHP, PL/SQL, Cobol and Visual Basic 6. It's very efficient to navigate, offering visual reporting and you can follow metrics evolution of your project and combine them. There is an online project called Nemo dedicated to open source projects, as you can see projects like Jetty, Apache Lucene and Apache Tomcat. So, let's config Sonar to work together with Maven. First of all set up Sonar server and other configurations in Maven's settings.xml file: sonar true jdbc:postgresql://localhost/sonar org.postgresql.Driver user password http://localhost:9000 You must to have a Sonar server running and define it in sonar.host.url parameter. In this example I'm using the default Sonar URL, http://localhost:9000 , and you must set user name and password for your database. To install local Sonar Server you can see this link. After that, to execute the code analysers and save results in Sonar database just execute mvn sonar:sonar. You can config Sonar to execute in your CI (Continuous Integration) application. Hudson is a perfect match, because there is a Sonar plugin for it. See the final result below: I prefer Teamcity CI, but there isn't a stable plugin as you can see in compatibility matrix. To resolve this plugin's problem, just add sonar:sonar in the maven command executed by Teamcity. Sonar is an essential tool for your software project to help you evaluating how cohesive your classes are, warning for future problems as code complexity or duplication problems arise, and will help you to have a cleaner code. If you have any question or some difficulties, contact me.
March 21, 2011
by Valdemar Júnior
· 143,535 Views · 2 Likes
  • Previous
  • ...
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×