To gather insights for DZone's Data Persistence Research Guide, scheduled for release in March, 2016, we spoke to 16 executives, from 13 companies, who develop databases and manage persistent data in their own company or help clients do so.
Here's who we talked to:
Satyen Sangani, CEO, Alation | Sam Rehman, CTO, Arxan | Andy Warfield, Co-Founder/CTO, Coho Data | Rami Chahine, V.P. Product Management and Dan Potter, CMO, Datawatch | Eric Frenkiel, Co-Founder/CEO, MemSQL | Will Shulman, CEO, MongoLab | Philip Rathle, V.P. of Product, Neo Technology | Paul Nashawaty, Product Marketing and Strategy, Progress | Joan Wrabetz, CTO, Qualisystems | Yiftach Shoolman, Co-Founder and CTO and Leena Joshi, V.P. Product Marketing, Redis Labs | Partha Seetala, CTO, Robin Systems | Dale Lutz, Co-Founder, and Paul Nalos, Database Team Lead, Safe Software | Jon Bock, VP of Product and Marketing, Snowflake Computing
The macro-trend is that more data is being analyzed in real-time. The internet of connected things enables you to see how things interact. Applications are tending to use multiple databases to provide polyglot persistence.
Here's what we heard when we asked, "What problems are being solved with databases?":
Huge evolution is happening very rapidly. Relational databases with big iron are long gone. We are using distributed data and accessing information from different sources. Different databases provide different access. We may use Spark SQL for transactional data and then a warehouse for mining legacy data. Each database has a different look but they all need to have holistic access.
People are being more predictive with analytics. Databases used to be reporting engines. Now they support predictive enabling advertisers to know an audience’s receptivity to a banner ad, a retailer can use a customer profile to provide a special offer, predictive analytics can be used to route package systems from point A to point B by helping to determine how to staff and assign drivers.
Companies are using databases to learn about their business faster. Safeway uses Teradata and Alation for their loyalty card to market better, provide better service, predict what people will buy, and churn. Ebay is using it to understand how to instruct the website when presenting the customer an offer—an improved user experience (UX). Learning and measuring what people are doing to provide a better customer experience (CX). We track how people are using the software to improve it—one set of techniques providing better results.
The ability to test before making changes without disturbing the persistent data ensures that the customer experience (CX) improves rather than deteriorates.
New generation databases solely operate in memory (i.e. Apache Ignite and Spark) enabling data to be read as fast as possible. We expect data to be retrieved quickly. We solve problems by retrieving data faster than before. Need to solve for persistence for analytics. This enables us to do many things much faster.
The ability to get instant insights into what the business is experiencing. Oil and gas IoT drill bits let the drillers know when a drill bit is about to break. Advertisers can deliver the right ad to the right customer in 10 milliseconds. Banks have real-time risk management. Real-time logistics has led to on-demand ride sourcing.
Traditionally stored data and answered inquiries. Mesh with governance, audit, and meta data. The difference between traditional and database applications is increasingly blurred. Virtual machines and containers enable different types of databases like data stores for virtualization. Twenty years ago there were seven to ten applications by Microsoft. Now, most large companies are doing their own application development. Support is becoming broader. We need a storage platform for traditional and new forms.
A general class of problems tries to get information from individual pieces of data. Some queries grab individual pieces of data. The macro-trend is that more and more data will be analyzed in real time. Start to find more intelligent ways for recommendations, fraud detection, access rights, and IoT. How to deal with volumes of sensor data. Internet of connected things—what happens when you see how things interact, you can get into all of the systems and the necessary data and avoid those that it doesn’t need. Understand connections and data relationships in the graph. As databases proliferate, how do we keep track of the data in all of the places. What’s an original and what’s a copy?
People are noticing that different parts of an application will have different needs. Specialized databases do specific things better (i.e. graphing the database of a social network and crawling links). Other databases scale better. The trend is towards specialization with apps using two, three, or more databases for polyglot persistence. At the end of the day, all databases store, organize, and retrieve data. Query interfaces may differ (i.e. key value stores and robust languages).
Mobile, IoT, and distributed applications have led to the distributed nature of data. The cloud has solved some of the problems but also resulted in more distributed data. While it would be ideal for all data to be in a single location, this is no longer realistic. You must have data accessibility across the board. This raises issues around security and privacy. Keys and credentials are critical to managing data. Security is a difficult subject. The more you can tweak and customize to protect the data, the better (i.e. access tokens). Data is critical, don’t let it get outside of control zones. Implement enterprise level policies.
Organizations that implement this strategy benefit from improved performance in retrieving subsets of the data from large datasets. They can also reuse data for different purposes and perform querying, exploring, and mining to create value from data. They’re ultimately creating a shared operational picture, while providing different views of the same data to different users depending on their needs.
There are many use cases. Twitter timeline is based on the Redis time series database.
On the one end there are quick data use cases with non-relational clickstream data allowing data to be shared by multiple data science teams without copying from one source to another. Since this takes time, it must be tracked and costs more to store.