10 Big Data Tools
10 Big Data Tools
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Big data’s ever growing presence can present both challenges and opportunities for organizations working to manage massive amounts of data. Here’s a quick list of Big Data tools for information management:
1. Apache Hive:
Hive is an open-source data warehouse infrastructure built on top of Hadoop which provides tools to enable easy data ETL, a mechanism to put structures on the data and the capability to querying and analysis of large data sets stored in Hadoop files. Hive uses a simple SQL-like query language, HiveQL, enabling users experienced with SQL to query the data.
2. Jaspersoft BI Suite:
The Jaspersoft package is an open source software for producing reports from database columns. Industry leaders have found Jaspersoft to be well-polished and already installed in many businesses turning SQL tables into PDFs that everyone can scrutinize at meetings. JasperReports provides a Hive connector to reach inside of HBase.
Starting in 2000. 1010data is a New York-based analytical cloud service serving mainly customers on Wall Street. Customers in NYSE Euronext and even gaming and telecommunications. Its design supports massively parallel processing for scalability. It also has its own query language that supports a subset of SQL functions plus broader query types including graph and time-series analyses. This private-cloud approach eases the customers of the stress that comes with managing and scaling infrastructure.
Originally by the name of Ingres Corp., Actian has more than 10,000 customers and growing. It has expanded with Vectorwise and the acquisition of ParAccel. These developments have resulted in the creation of Actian Vector and Actian Matrix respectively. There are options that work with distributions from Apache, Cloudera, Hortonworks, and others.
5. Pentaho Business Analytics:
Pentaho has been compared to Jaspersoft in a sense that it is branching into big data by easing the process in absorbing information from the new sources though it started as a report generating engine. Pentaho’s tool can be hooked up to NoSQL databases such as MongoDB and Cassandra. Peter Wayner of InfoWorld notes that Pentaho Data, one of the more intriguing tools that’s a graphical programming interface, has a bunch of built-in modules you can drag onto a picture, then connect them.
6. Karmasphere Studio and Analyst:
Karsmasphere Studio is a set of plug-ins built on top of Eclipse since it’s a specialized IDE that makes it easier to create and run Hadoop jobs. Karmasphere’s tool walks you through each step when configuring a Hadoop job showing partial results along the way. Karmaspehere Analyst is designed to simplify the process of sifting through all data in a Hadoop cluster.
Cloudera strives to provide support for open-source Hadoop while extending the data-processing framework into a comprehensive “enterprise data hub” that can serve as a first destination and central point of management for all data within enterprises. Hadoop can be both a destination data warehouse and also an efficient staging and ETL source for an existing data warehouse. Enterprise conformed dimensions can be used as the basis for integrating Hadoop and conventional data warehouses. Cloudera seeks to become the “center of gravity” for data management.
8. HP Vertica Analytics Platform Version 7:
HP provides reference hardware configurations for leading Hadoop software distributions since it doesn’t have its own Hadoop distribution. The computer industry leader has named its big data platform architecture HAVEn which stands for Hadoop, Autonomy, Vertica, Enterprise Security and “n” applications. HP added a “FlexZone” in their Vertica 7 release to let users explore data in large data sets before defining the database scheme and related analyses and reports. This release is integrated with Hadoop through Hive’s HCatalog metadata store to give users a way to explore data on HDFS in a tabular view.
9. Talend Open Studio:
Talend’s tools are designed to help with data quality, data integration and data management. Talend delivers a unified platform that makes data management and application easier by providing a unified environment for managing the lifecycle across enterprise boundaries. It’s designed to help organizations build flexible, high-performance enterprise architectures that integrate and service-enable distributed applications which are 100 percent open source.
10. Apache Spark
Apache Spark is a new addition to Hadoop's open-source ecosystem. It offers a much faster query engine than Hive as is relies on its own data processing framework rather than relying on Hadoop's HDFS. It is used for event stream processing, real-time queries and machine learning.
Opinions expressed by DZone contributors are their own.