Why Data Warehousing as a Service?
This guest post comes courtesy of our friends at Treasure Data.
First, let me introduce myself. My name is Kaz Ohta and I am Treasure Data’s CTO and co-founder. My expertise is in distributed and parallel computing; my passion is open source technology. I was instrumental in developing MessagePack and Fluentd, contributed to the Linux Kernel, KDE, Memcached, and MongoDB, currently curate open source components (e.g. Apache Hadoop) and founded the Japanese Hadoop User Group.
Having worked with complex open source technologies for years and experienced first-hand what companies have to go through in terms of time, expense and specialized IT resources to implement and maintain a big data analytics solution, I realized that big data analytics was really only available to companies with deep pockets and highly skilled staff. For example, an on-premise Hadoop solution can take a company anywhere from 60 to 160 DevOps days to implement. With top Hadoop consultants charging $1,500 per day, that means implementation costs of more than $100,000.
My vision is to provide a service-based big data solution that eliminates these cost and complexity barriers. The Treasure Data Cloud Data Warehouse service offers an affordable, quick-to-implement and easy-to-use big data solution that does not require specialized IT resources, making big data analytics available to the mass market.
Here’s how we’ve done it. We leverage Hadoop and other open source technologies to keep our costs low and pass the cost savings on to our customers; and we’ve added our own innovative technology to address three critical Hadoop bottlenecks:
• Data-Load. We provide two tools to make data-load faster and easier. For initial data-load, we provide a bulk data-loader that can import any amount of data into our Cloud Data Warehouse. For streaming data collection and load, we provide td-agent – a lightweight data-collection daemon based on our highly successful Fluentd product. This provides both batch and continuous data feeds and supports standard JSON format transformation for structured, semi-structured and unstructured data types.
• Columnar Data Processing. We have replaced HDFS - which still has difficulties in data management and SPOF issue – with our own columnar database. This enables us to process massive volumes of data much more quickly making near real-time analysis a reality for Hadoop users. We also use our MessagePack technology and various compression algorithms to achieve 5-10x data storage efficiencies.
• Faster and Easier Querying. On the back-end, we provide an SQL-like query language that distributes and runs your query in parallel. There is no need to learn a complex domain specific language. We also provide a JDBC driver, which allows you to use your preferred BI / Visualization tools (e.g. Jaspersoft, Indicee, Metric Insights); and we will offer an ODBC driver soon which will enable integration with Excel, Tableau, etc.
The Treasure Data service is hosted on Amazon S3 so we can offer a scalable, reliable and secure infrastructure production environment without the need to pull IT staff from other projects. Processing, storage and network resources are completely elastic and can be scaled up or down as requirements dictate - for example one of our customers scaled from zero to 50 billion rows in 3 months. Our service is managed 24x7 by an expert operations staff and this eliminates the overhead costs associated with resourcing and managing an on-premise environment.
Treasure Data officially launches on Thursday, September 27th, in San Francisco. If you want to try our service for yourself you can sign up for free.