Over a million developers have joined DZone.

Compute Data Outside Database to Alleviate Warehouse Expansion Pressure

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

The data warehouse is essential to enterprise business intelligence, which accounts for a great part of the total enterprise cost.  With the global data explosion in recent years, the business data volume grew significantly, posing a serious challenge for enterprise data warehouse to meet the diverse and complex business demands. More data, more data warehouse applications, more concurrent accesses, higher performance, and faster I/O - all these demands give more pressure on data warehouse. Every IT manager nowadays has concern over expanding the data warehouse capacity at lower cost.

Here is an example. A data warehouse is originally provisioned, as shown below:

Server: One cluster with two high performance database servers.

Storage space: 5TB high performance disk array.

CPU: 8 high performance CPUs.

User license agreement: 100

To meet the storage capacity expansion need for the recent 12 months:

Computational performance: Double

Storage space: Quadruple

Concurrency: Double

How can an IT manager achieve his storage expansion goal? The common practice is to upgrade the database hardware and software: replace with more advanced data warehouse servers, replenish two data warehouse servers of the same class, add a 15T data-warehouse-specific disk or change to a 20T hard disk cabinet, and add 8 CPUs. In addition, they have to pay for the additional user license agreement, CPU, and disk storage space with expensive software licensing fees.

No matter which way you choose to upgrade, the data warehouse vendor will ultimately bind you with their products and charge you for the expansive upgrades.

Computation outside of a database is an alternative to expanding storage capacity. As we all know, of the 20T data warehouse data (including 30% real data, and 70% buffer), the core data is usually less than 1/10, i.e. taking up 1T space. The remaining 19T spaces are all for the redundant data. For example, after a new application is deployed, for the sake of core data security protection, the data warehouse usually requires a copy of the used data, not allowing for the direct access to core data from application. Quite often, the new application needs the access to the records with summarized and processed core data. For which, a core-data-based intermediate table is fabricated to speed the access. This means redundant data is growing with the development of existing and emerging business. The total amount of core data will always stay low.

This redundant data is not the core data, not requiring the high level of security protection. To move these redundant data to the average PC, and use the tools other than database for reading/writing and computing, the cost of database capacity expansion will be reduced dramatically. So, we can say the computation outside database in combination with the database computing is the best choice to achieve the database capacity expansion. The benefits include:

Computational performance: Implement parallel computation across multiple nodes using the inexpensive PCs and desktop CPUs. Compared with the high performance of databases, the same or even greater computational performance can be achieved at the relatively lower cost.

Storage space: With the cost-effective desktop level disk, users can get a storage space far greater than data-warehouse-specific disk at a extremely low cost. HDFS also facilitates the data security protection, access consistency, and non-stop disk capacity expansion.

Concurrency: With the concurrent access from multi-nodes, the centralized concurrent access can be allocated to multiple node machines for more accesses than just the centralized access from data warehouse. In addition, users do not have to pay for the access license agreement, additional CPUs, and disk storage spaces.

It seems that the computation outside database is pretty good. Hadoop and other similar software is available in the market to meet all above demands. But why do few people choose Hadoop as an option to alleviate the pressure on expanding the data warehouse capacity? This is because it are not as powerful as a database in computing, in particular the computation involving complex logic.

What about there is the software meeting the above-mentioned demands on computational performance, storage space, and concurrency, while is still equal or even more powerful than database in computing? With this software, it's evident that the storage capacity expansion pressure on database will be relieved greatly, so does the database capacity expansion cost.

esProc is built to meet these demands. It is the middleware specially designed to undertake the computation jobs between database and application. For the application layer, esProc has the easy-to-use JDBC interface; For the database layer, esProc is powerful in parallel computation. By implementing the computation outside database or in external storage, esProc alleviates the computational pressure on the database & storage, and concurrency. Owing to this, organizations can cut the cost of database software and hardware effectively while still optimizing the database administration.

esProc is built with a comprehensive and well-defined computing architecture, which is fully capable of sharing the workload on databases, and undertaking various computations of whatsoever complexity for applications. In addition, esProc supports the parallel computations across multiple nodes. The massive or intensive data computation workload can be shared by multiple average servers or inexpensive PCs equally.

With the supports for parallel computation, esProc can equally decompose and allocate the computation jobs used to solve centrally to multiple average PCs. Each node only needs to undertakes a few data computations.

With esProc, the core data can be stored in the database, while the intermediate table and script deprived from the core data can now be stored outside the database. By leveraging resources reasonably, the workload pressure on database will be alleviated effectively, database cost will be kept under control, management problems will be solved effectively, and various data warehouse applications will be handled with ease. These applications include the real-time high performance application, non-real-time big data application, desktop BI, report application, and ETL.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.


The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}