Cross-database Computing Methods
This essay compares DBLink, esProc, Scala, and Calcite for performing cross-database computing in terms of cross-base coding design, deployment and configuration, and computing performance.
Join the DZone community and get the full member experience.Join For Free
For some reason, data is sometimes divided into segments to be stored in different databases. This will involve mixed cross-database computing. There are various methods based on different theories for handling such a computation. Now let’s look at them through the aspects of coding design, deployment and configuration, and computing performance.
Since Oracle DBLink handles cross-database computations using Oracle SQL, it generates simple and easy-to-understand code.
We have an example here. The order table
Orders is stored in MySQL. We need to first sum the sale amounts for each SellerId and then join the result set with the employee table
Employees in the Oracle database to correspond each salespeople’s name to their total sales amount.
You just need to write the following SQL with DBLinks:
select e.name, o. subtotal from Employees e, (select SellerId, sum(Amount)
subtotal from Orders@mysql1 group by SellerId) o where o.Sellerid=e.EId
mysql1 in the above SQL is the MySQL data source name. The relationship between a table and the source holding it is simply represented by the 'at' sign,
Though DBLink coding is simple, configurations are complicated. Users need to install components from different vendors, make sure their versions are compatible, and do a lot of other configurations. Except for a few ones, you need to edit files for all other configuration items. Below are the main steps:
- Configure Oracle Client and MySQL ODBC to ensure that both databases can be accessed independently;
- Install MySQL database gateway, that is, Oracle Database Gateway for ODBC;
- Launch Oracle Heterogeneous Service (HS) and related agents;
- Create heterogeneous database access ODBC configuration file and enter MySQL ODBC-related information into it;
- Configure MySQL data source in listener.ora and tnsnames.ora respectively. Make sure their configurations are consistent;
- Use the DDL command to create a DB link pointing to MySQL (let’s name it mysql1, for example).
But the performance of DBLink is unsatisfactory. The primary cause is that the method cannot make good use of other databases’ computing capabilities. Probably to ensure syntax compatibility, Oracle designers prescribe that remote table computations be translated into two parts. One part covers three types of SQL queries to be executed in a remote server. They are non-functional conditional queries, calculated column generation, and full data set retrieval. The other part includes most of the computations to be executed locally. They are functional queries, calculated column generation, sorting, grouping and aggregation, etc. Many computations that can be efficiently handled remotely are transformed to the local, yet the execution of them is slower. Even a grouping and aggregation operation that returns a result set smaller than the source data size takes up transmission resources to retrieve data to the local for computation.
In the above join example, the subquery
select SellerId, sum(Amount) subtotal from Orders@mysql1 group by SellerId involves the computation of remote tables. A small result set will be transmitted to the local if the computation is performed remotely. But, since Oracle translates the query into two parts, only select SellerId, Amount from Orders, the whole source data retrieval, is executed on the remote machine. The remotely retrieved data is stored locally in a temporary table (
$Orders, for instance,) and the second part of the query (
select SellerId, sum(Amount) subtotal from $Orders group by SellerId) is then executed on the local machine.
Similar cross-database computation handling methods are also employed by DB2 (Federated Database) and MSSQL (Linked Server). All work similarly and have the same pros and cons.
Scala was initially intended to be a general-purpose programming language, yet it is now mostly employed to compute structured data. The language supplies three data types (DataFrame, DataSet, and RDD) and supports two types of syntax (DSL and SQL). Both DSL and SQL can handle cross-database computations and switch from one to the other. For example, the above example computing goal can be achieved with the following Scala program:
Though the program is long, the core code is simple. Scala runs on JVM, so Java and Scala can call each other with simple code.
The configurations for Scala’s cross-database computing method are much easier than those for Oracle’s. Users just need to place the database driver JARs in the classpath. This shows that Scala’s method for handling cross-database computations is more open and professional. Oracle’s, on the other hand, is closed and immature.
The performance of Scala’s method is better, too because it can make good use of the database’s computing ability. In the above Scala program, the remote table computations are actually executed remotely, such as the query
select SellerId, sum(Amount) subtotal from Orders group by SellerId, which is executed on the MySQL side. Rather than whole source data, the remote computations return a few result sets and thus transmission speed becomes much faster.
Here the program is for pure in-memory computations in cases when the size of result sets returned by database computations is relatively small. If the returned data set size is relatively large,
O. persist() function should be used to start the auto-interchange between memory and external disk storage. Though Scala is competent to handle in-memory computing, it does not have a well-developed external memory computing mechanism and malfunctions are not uncommon. It would be better than
O. persist() function is not used.
Calcite is an open-source project licensed by Apache Software Foundation. Its query planning engine is intended to be 'one size fits all' for all data computing platforms and data sources. The framework supports various SQL styles (not the SQL dialects executed actually). To implement the above cross-database computation in MySQL style, for example:
In the program, both
mysql1 are data source names that point to Oracle and MySQL. Calcite syntax is simple and easy to understand and is seemingly similar to the ordinary JDBC.
The Calcite configurations are simple. Besides putting database jars in the classpath, users only need to edit a data source file, which is ds.json in the above program.
Same as Oracle, Calcite cannot use the computing abilities of databases of different structures. So its performance is unsatisfactory, too. A Calcite-style SQL query will be ultimately divided into two parts after multi-layer translations. One part includes the translation of a SQL dialect (such as MySQL’s SQL implementation) and related execution in the database. The other part is the local computations after result sets of database computations are retrieved. The dialect translation is hard and is done roughly because Calcite needs to cater to different types of databases. For example, group by is translated into record retrieval instead of the actual grouping and aggregation. The database computing ability is thus wasted. The translation performed locally is relatively manageable and does not need to take the compatibility issue into account, so an operation can be translated into what it truly is.
There is another problem with Calcite. The framework only supports pure in-memory computations. Memory overflow will happen if the data set size is too large.
esProc is the specialized structured data computation engine and is also made to be 'one size fits all.' It is lighter, thinner, and simpler. For example, to handle the above cross-database computation with esProc, we have the following script file (cross.dfx):
The script can be executed or debugged on esProc IDE, and can be embedded into a Java program through JDBC, as shown below:
Users are allowed to write a query with the Java program if the computation is simple, without editing an extra script file, as the following shows:
In terms of performance, esProc and Scala are equal. Both can make the most use of the database computing ability to achieve high performance. In the above code, MySQL just needs to return a few small result sets instead of the whole source data.
esProc also offers the cursor mechanism to handle computations where the size of source data or the result sets is too large. The mechanism, well-developed and stable, is made for handling huge data volume. To Group
Employees table by
Dept filed and summarize
Amount field of
Orders table, for example:
The above esProc code increases performance by doing the join through an order-based merge.
For special cross-database computing scenarios, such as those involving sub-databases, esProc can use parallel processing to considerably enhance the performance.
Configurations for the esProc cross-database computation handling method are simple, too. Users only need to add the database jars in the classpath and set the database data source name.
In a nutshell, all four methods have simple syntax; all, except for DBLink, boast convenient configurations; Scala and esProc have more excellent performance; esProc outperforms the other three by supporting more class libraries and having more mature computing mechanisms.
Opinions expressed by DZone contributors are their own.