Editor's Note: This post was originally authored by Andrew Wilson from the VoltDB blog.
NewSQL and NoSQL databases present developers with a new and interesting programming model where there are fewer connections to the database, query parallelism, partitions and an easy model for supporting new clusters. Last month I presented a webinar introducing the basics of application development using VoltDB. We had a lot of great questions and I thought it would be handy to write them up and share the answers with everyone.
Q: Can you address how you would handle joining tables that have different partition keys? Would you not partition the tables in that case?
A: For multi-partition queries you join replicated tables and partitioned tables, and you can join two or more partitioned tables but they have to be joined on all table’s partition key.
If the query has a where clause that limits the query to a single partition, it can be executed as a single partition query. For single partition queries, you can join any combination of replicated and partitioned tables.
In all cases, consider joining on columns supported by indices (so that they are fast).
There are a number of factors to consider when joining tables in general sense. Should the table be a replicated table, thus local to every partition, or is it too large? For example, I may have a “users” table that contains thousands of users. I may need to map those users to a role and there may be only seven or eight roles within this “roles” table. In this case, it is safe to partition the “users” table and to replicate the “roles” table.
Consider though if your “users” table was joining with an “orders” table. This “orders” table may have thousands or millions of rows, making it a poor candidate for replication. Further, because of business rules it may be necessary to partition the orders on something other than the user but instead on the order ID. A join in this circumstance would result in a multi-partition query.
Q: When writing VoltDB SQL select statement that makes use of a non-unique index, does all the selected columns need to be included in the ORDER BY clause, in order to guarantee a deterministic behavior?
A: This is generally true since the non-unique index does not guarantee the relative ordering of records with duplicate indexed columns and different non-indexed columns. However, if ALL of the selected columns are indexed columns in the selected index, the columns do not need to be included in the ORDER BY clause.
If there are other (non-indexed) columns, the ORDER BY clause should include all of the selected columns.
For a unique index, the values of any other columns would be completely determined by the indexed column's ordering.
Q: PostgreSQL has an ARRAY  cell type, which we extensively use. Does VoltDB has something like that?
A: There is really no comparable type without using your own custom serialization mechanisms. We are releasing a new JSON column type in a future revision and JSON supports array fields. You can pass arrays to stored procedures as well so using the upcoming JSON type or a custom serializer may become options for you.
Q: Are Java Stored Procedures compatible with Oracle Stored Procedures? If not, do you know how to manage a migration?
A: They are not compatible. We use a number of VoltDB specific Java classes and consequently cannot run existing Oracle Java stored procedures.
Managing a migration, regardless of the source DB, is a bit tricky and is dependent upon your schedule. Under the best circumstances I would recommend reviewing the procedures first to see which ones make sense to move to VoltDB. You may find that one or more stored procedures can be merged into a single stored procedure. Next, I would build unit tests against the Oracle Java procedures using a known dataset and then create the VoltDB stored procedures and test them using the unit tests.
Q: Is it possible to use CDC (change data capture) to replicate VolDB data into an Analytics Database? If not, what are the alternatives?
A: There are a number of strategies that VoltDB users use to push data to analytics databases. The first is to use the VoltDB Export table functionality. Export tables act like persistent queues - you can add data to them via SQL and an external client polls on that data through a well-published protocol. This external “export” client pulls on the data and delivers it to the target external system, ack’ing the receipt of the data (and thus removing it from the VoltDB Export table). VoltDB supports (and ships) a flat file export client, a PostgreSQL client as well as a Hadoop export client out of the box. Developing new export clients are a fairly simple task. For more information on the VoltDB Export feature please see: http://voltdb.com/company/blog/voltdb-export-connecting-voltdb-other-systems
Other more traditional CDC strategies such as time stamping the rows or versioning the rows are easy to accomplish in VoltDB at the application layer by adding the appropriate columns to your tables. There is no internal timestamp per row, for example, in VoltDB.
Q: Is the database size is limited by the size of RAM available? How to manage virtual memory?
A: The database size is limited by the amount of memory within the cluster. We suggest users disable swapping to avoid unintentional performance degradations.
Note that virtualized machines often do not allow you to disable swapping or swapping is hidden from the guest OS. In either case, it is a good idea to do throughput testing as other factors can come into play that can negatively impact throughput and latency, such as the quality of the network and the drives assuming you are using either snapshots or command logging.
Q: What is largest implementation in terms of number of tables?
A: There are hundreds of VoltDB applications deployed throughout the world. While we don’t have exact table count numbers for those applications, we know of implementations in the 100 to 200 table range. Additionally, we have customers deploying applications on upwards of 30 nodes both on bare metal, as well as on Amazon’s cloud.
Q: Hi - we like to do all our builds with Maven. Do you support that?
A: Yes, we do work with maven but our libraries are not in the central repository. You usually need to manually add two libraries. The first is the client library if you are working on the client side and the stored procedure library if you are working on the server side. The following maven import statements are examples of what I used in VoltDB 2.8.2:
mvn install:install-file -Dfile=./voltdb-2.8.2.jar -DgroupId=org.voltdb -DartifactId=VoltDB -Dversion=2.8.2 -Dpackaging=jar
mvn install:install-file -Dfile=./voltdbclient-2.8.2.jar -DgroupId=org.voltdb -DartifactId=VoltDBclient -Dversion=2.8.2 -Dpackaging=jar
Q: Is there any recommended way to keep the db schema evolution in order (I.E. as a set of "patches")? Thank you
A: VoltDB supports catalog updates where you can add and remove tables. You can also add, remove and update stored procedures. Over the coming weeks we’ll be adding additional support for online schema changes such as adding/dropping indexes and columns.
Q: How well VoltDB gels with hibernate/mybatis or other java based persistence framework?
A: We do not currently support Hibernate or MyBatis directly.
Q: You mentioned about the limitation of 50MB for the result set of a query, does it mean the developer should predetermine the result set size of the query and not use such?
A: Yes, you need to be sure that your result set is less than 50MB or the query or stored procedure will fail.
Q: VoltDB is described as an in-memory database, but what makes it different from loading a complete Oracle database in memory?
A: We’re dramatically faster than Oracle in memory. Research at MIT, Brown and Yale have identified over 90% of processing in traditional databases can be classified as “overhead” leaving a scant 7% processing to the actual “work”. This overhead includes such things as buffer management, locking, latching and logging. VoltDB was architected from the ground up to eliminate this overhead by being in-memory, thus eliminating buffering, and to run each transaction in parallel and in isolation, thus eliminating the need to latch or lock. When traditional databases are running in memory, they still carry the traditional overhead, or baggage.
For more information on the innovative architecture and processing power of VoltDB, please reference the following research papers:
- The end of an Architectural Era: (It’s Time for a Complete Rewrite): http://hstore.cs.brown.edu/papers/hstore-endofera.pdf
- OLTP through the looking glass, and what we found there http://hstore.cs.brown.edu/papers/hstore-lookingglass.pdf
Thanks to everyone who tuned in and especially to those with who sent along some great questions. Don’t forget to check out the forums section on our community site for more information as well. Please join us next week when I will be presenting our next webinar “Writing VoltDB Apps in NodeJS” on October 11th.