To gather insights for DZone's Data Persistence Research Guide, scheduled for release in March, 2016, we spoke to 16 executives, from 13 companies, who develop databases and manage persistent data in their own company or help clients do so.
Here's who we talked to:
Satyen Sangani, CEO, Alation | Sam Rehman, CTO, Arxan | Andy Warfield, Co-Founder/CTO, Coho Data | Rami Chahine, V.P. Product Management and Dan Potter, CMO, Datawatch | Eric Frenkiel, Co-Founder/CEO, MemSQL | Will Shulman, CEO, MongoLab | Philip Rathle, V.P. of Product, Neo Technology | Paul Nashawaty, Product Marketing and Strategy, Progress | Joan Wrabetz, CTO, Qualisystems | Yiftach Shoolman, Co-Founder and CTO and Leena Joshi, V.P. Product Marketing, Redis Labs | Partha Seetala, CTO, Robin Systems | Dale Lutz, Co-Founder, and Paul Nalos, Database Team Lead, Safe Software | Jon Bock, VP of Product and Marketing, Snowflake Computing
What we learned foremost is the importance of storing data in some durable form to maintain asset properties with the ability to access it. There’s a tradeoff between speed, scale, and usability. Ultimately databases must be consistent, available, and able to tolerate partitions.
Here's what they said when we asked, "What do you see as the most important elements of databases?":
Tables, records and fields must be connected. There should be the ability to filter access to information more rapidly to produce results in a timely manner. Data is usually accessed with queries or business intelligence (BI) tools. It's important for any database to rapidly respond with deduped data.
The ability to support a broad variety of data so that it can be aggregated, analyzed, and reports generated. Each database is different with regards to aggregation and integration. The ability to support a lot of different uses since people are finding new ways to use data. Ability to store data in a lot of different forms before and after normalization (data preparation). Summations are used for reporting purposes providing another view of data whether it’s structured, unstructured, or semi-structured. Data quality is also important—sensors provide a lot of readings and you need to remove or highlight the exceptions so you don't need to store all of the data collected by the sensor.
Two types: transactional used by banks, ATMs, and deposit accounts; and, analytical used for asking questions, learning more about the business, and making predictions. Each database is well built to answer a particular set of questions. People used to buy Oracle and build what they needed. Now, there are 30 to 40 different databases each with a specific purpose and use case.
Today we need to think about the original use and then the secondary use of the data. Shift toward in-memory databases. The right database depends on the use case—may need to keep up with speed in a memory-centric database. The use case determines the optimal database. We're shifting to Hadoop for data mining to scale and web-scale analytics. Databases have bifurcated into what’s most relevant for the use case—“the consumerization of databases.”
We're no longer one size fits all. Every database is created for a particular use case. We’ve moved from transactional to task-driven databases. Read and write quickly for availability and replication.
Performance - ability to process more data, more quickly. There are new ways to deal with data with commodity hardware. We used to need mainframes and microprocessors. Now, you can achieve high performance with low-end hardware pulling them into a cluster (i.e. Cloud). This has led to high performance analytics with low-end hardware. The Cloud is an accelerator for the digital enterprise.
It’s established a new genre of software—SQL to NoSQL. There’s been a renaissance of SQL. Three properties of distributed systems: consistency, availability, and tolerate partitions—you can only have two. It used to be that you couldn’t scale SQL over time. Now, stored data can become consistent data over time. Amazon started this with consistency being promoted right into the distribution center.
It’s more important to understand when selecting—know what the requirements of the data are when selecting a solution. Do what’s on the label and understand what’s on the label. A lot of databases can do one thing very well but not other things that traditional databases would not do. Databases can store a lot more data without semantic data. It depends on what you’re doing with the data. Inside of SQL, have that on top of a lot of technologies with different models like Hadoop. Query languages—at what point do you say a query language is right for one type of data and not another. 1. Transactional model. 2. Data model. 3. Query language.
Primarily with an operation data store—the stuff behind the application. Mobile will have log-in, database with email address, password, and account information. Lyft would track user profiles and rides. Evernote would track notes and other services. Stores for data you want to put away. Deal with operational data stores. Ten years ago you’d have SQL with ANSI and standard queries with SQL. There’s been a renaissance in databases. MongoDB is an operational database.
A database is for storing data in some form of durability to maintain asset properties and the ability to access it. Optimization can differ between the network, memory, and persistence based on the need. Improve usability for the programmer. It’s hard to build a database so that it’s easily usable. Newer databases like Mongo and Couch solve the scale and ease of use, but aren’t as focused on integrity and consistency. You trade speed to scale for other qualities. NoSQL is good for doing front-end work. Security is always an issue when data is involved because data is like money—the bloodstream of the system. You can’t protect the data without protecting the logic around it.
There are many database technologies that serve different needs. Common benefits include centralizing data and standardizing how it is accessed, making it easier and cheaper to manage and use the same data for different purposes. This is often referred to as a "single source of truth."
Data must remain highly available since every minute of downtime can be prohibitively expensive. If you’re serving thousands to hundreds of thousands of operations per second, the cost of a minute of downtime could easily run into several hundreds of thousands of dollars. There are six features critical to ensuring high availability and safeguard against every type of failure or outage event: 1) in-memory replication; 2) multi-rack/zone/datacenter replication; 3) instant auto-failover; 4) AOF data persistence; 5) backup; and 6) multi-region/cloud replication. These features protect your data from process failures, node failures, multi-node failures, rack/zone/datacenter failures, network split events, and complete region or cloud failures.
It depends on the context—low latency versus high access.
So, what do you consider to be the most important elements of the database?