Mahout, HBase Among Six Apache Graduates
The Java Zone is brought to you in partnership with AppDynamics. Discover how AppDynamics steps in to upgrade your performance game and prevent your enterprise from these top 10 Java performance problems.
The Apache Traffic Server was donated around five months ago in November 2009. The high performance HTTP/1.1 caching proxy server was donated by Yahoo. Its no surprise that Traffic Server made it to the top level in only five months. The server has been used since 2002 by Yahoo to serve about 400TB of data per day! The developers say that the software is capable of handling over 75k requests per second per server. The project committers plan on building native IPv6 support, full 64-bit software, and support for non-Linux Unix systems.
The Mahout project is still below version 1.0, but it has already gained a good deal of interest from data-driven application developers. The project supplies a collection of scalable machine-learning (A.I.) algorithm implementations, which include clustering, collaborative filtering, classification, feature reduction, and data mining algorithms. These implementations are built on top of Apache's MapReduce framework, Hadoop. Mahout has been a sub-project of Apache Lucene since 2008.
Tika is a lightweight, embeddable toolkit for advanced language detection and analysis. Tika uses MIME standards and provides rapid unification of existing parser libraries. It's been a Lucene sub-project since 2008 and is used in many Lucene projects including, Nutch, Mahout, and Solr. Tika is used by NASA, Day Software, and the Internet Archive.
Nutch is a modular, web searching engine that uses web-specifics such as a crawler, parsers for HTML, a link-graph database, and other document formats. Nutch enables the creation of plugins for things like querying, clustering, data retrieval, media-type parsing, and more. After a 100 million page demo system was created with Nutch, the project graduated from the incubator to the Lucene project in 2005.
Avro is a system for fast data serialization that has rich and dynamic schemas in its processing. It has a compact binary data format with features for persistence, remote procedure calls, and simple dynamic language integration. The Avro project was formerly a sub-project of Apache Hadoop.
HBase is a NoSQL data store based on Google's BigTable. The data store adds random read/write access to the Hadoop stack, extending offline processing capabilities and enabling realtime serving of very large datasets. The project's goal is the hosting of big tables -- billions of rows X millions of columns -- running atop commodity hardware. HBase was a sub-project of Hadoop since 2007.
It's been a busy year already for Apache; and an extremely successful one. Along with today's graduations, five other projects have been promoted to the top level this year. Apache UIMA, an analysis system for unstructured data, and Apache Shindig, an OpenSocial container, were two significant projects that got promoted. Apache Click, a JEE web app framework that graduated to TLP, has gotten some Java developers excited. Apache has also had two of its most high-profile projects graduated this year: Apache Cassandra and Apache Subversion. Subversion is, of course, one of the most popular version control systems available, and Cassandra is the red hot NoSQL data store that everyone's talking about.
With the six graduates today and the five graduates so far this year, that makes eleven new Top Level Apache projects - and the year's not even halfway finished! Imagine what could happen in the next seven months.