How to Handle the Influx of Data
High-speed ingestion with the ability to query.
Join the DZone community and get the full member experience.Join For Free
To learn about the current and future state of databases, we spoke with and received insights from 19 IT professionals. We asked, "How can companies get a handle on the vast amounts of data they’re collecting?" Here’s what they shared with us:
- It’s incredibly important to ingest, store, and present it for querying. We have a lambda architecture for in-memory processing, streaming, analytics, and then very scalable data at rest for historical data. When people struggle, they’ve figured out one piece of the puzzle. They may be able to ingest data quickly, but they are not able to analyze the data and get insights. It’s all about being able to capture the data and then do valuable things with it at the same time.
- Have an Agile data architecture. We have perfected the collection of data with data ingestion solutions like Spark and Kinesis. But there are still a lot of challenges remaining in analyzing and operationalizing the data. There is not enough scale and investment going on in those two areas. Focus on concepts like federated query. Data can reside anywhere. Optimize compute to understand where the data lives so you can produce fast results. Data labs give people their own sandbox to work on data that exists and bring compute to where the data resides.
- We handle data at a high level with governance based on where data is coming from, it’s structure, and where it’s going. With things like GDPR, this has become more important. Ingesting data streaming in real-time is key. Stream-based ingestions with volume and noise are increasing. Bring in other technologies like Kafka to ingest. Multiplatform offer “horses for courses.”
- The data management problem is solved with an overarching data management solution. Consider what data needs to be stored, for how long and at what granularity. For example, in banking, with mobile access, a lot of customers look at their balances when they are bored. Because we’re in-memory we can cache balance information so it’s cheap and easy for customers to get to.
- Be able to securely store large amounts of data. Companies are using the cloud to do this because they do not have to pre-provision resources. They typically store this data in object stores like Amazon S3 or Google Cloud Storage. The second challenge is to derive value from these data sets; much of the value stays inaccessible because there is no way to query the raw data. A developer has to massage the data using various data pipelines before he/she can unlock the value of this data, and this transformation typically uses its own custom APIs. Databases make it easy to query these data sets. Databases associate a schema with the data, either at read time or write time, and make it accessible by a developer via a very standard query language like SQL. New-age databases can continuously ingest data from cloud services, like Amazon S3, Google Cloud or DynamoDB, and make it queryable via standard SQL. This makes it easier for a developer to extract value from large sets of data.
- 1) Auditing is probably the first step. Understand what the data is, its origin, and destination. Then marry this with the overall strategy as the business and figure out whether vital data exists, whether it should be archived or whether it needs enrichment to produce meaningful data. 2) In a previous life, the first task was to run tools that would scan the network and find instances of running databases. In some cases, customers had several copies of the same data being processed by different systems costing vast amounts in infrastructure and resources. No one was using this data. This goes back to designing databases with a purpose in mind. 3) Stream Processing can play a huge part. Being able to validate, classify and enrich data, you can add context and meaning. That way you can determine how much value it may have to you. Stream processing enables organization and context, which in turn enables understanding.
- Active analytics platform enables clients to handle data and access streaming and historical data using SQL queries. We are now able to involve graph relationship queries, also recognize the opportunity to use trained ML algorithms to run against the active analytics database.
- Delete it as fast as you possibly can. The types of customers that can and cannot delete data vary by industry. Healthcare, aerospace, finance must preserve data. Are you going to archive? Real-time, or near-time available? Do you put it in a warehouse? Is the database transactional? How up to date does the data need to be? Real-time, near real time? Balance a transactional system at run time against the analytics customers want to run. RDBMS or Elk stack? A database is a tool, don’t abuse it. Have a strategy around the data, long-term and short-term problems to address. Get it right early or it just gets more difficult.
- Be more intelligent about how you will use the data to do novel things. Accelerate database releases to provide knowledge to the business more quickly. Be smart about equipping the right individuals to have control over their destiny. People are moving away from the monolith. Choose the right technology based on what you are trying to achieve. There are more tools today with greater specialization. Let teams chase after and test different solutions so they benefit from processing all of this.
- It’s a challenging task to get a handle on data collection, but it’s even more challenging to provide data access. Database technologies, such as data indexing, data normalizing, and data warehousing, allow companies to systematically store and retrieve data as efficiently and effectively as possible.
- If you collect meaningful data that you expect to be able to sort, categorize and report form it should be stored in a database! And your database strategy will be key to your operational efficiency.
- Databases are part of the solution. Choose a data storage product based on how to get the data in and how to query it. In terms of value, it comes down to how much you need to scale out to avoid performance hits. There’s vertical and horizontal scaling. Traditional vertical databases scale well. Now the more horizontal is scaling as well. Cost is an issue. If you host in a public cloud a lot of licensing headaches are removed because the cloud vendor has worked out the details. It’s much easier to adopt a database service because you don’t have to provision hardware.
- Traceability, lineage, governance are key. The graph model is able to represent the open-ended complex pictures using nodes and relationships in a node model. Keep track of meta lineage but all the different identifiers for the individual, his devices, and identities. We are seeing the rise of the chief data officer and governance with GDPR and California initiative. Not unlike a data warehouse where you get the data you need based on the requirements you have. See how pieces of data correlate across the entire enterprise. What kind of data pieces do you want to see correlated and what kind of relationships do you want to discover?
- Many companies need to have a better/more accurate understanding of how they expect their data to scale and what the projected growth rate is going to be. Granted, it can be hard to get perfectly right. (You might start with a small environment, get customers faster than anticipated, and blow out projections.) But make sure you have a way to understand what data you are collecting and what the volume is going to be – from the beginning! – cannot be overstressed. With Apache Cassandra, for example, it’s fairly easy to scale, but it’s not particularly fast to do so. You need to plan deployment with enough runway…if you hit limits, you’re going to have problems.
- Although we are built to handle and scale high volumes of data, one of the first steps is always to get a clear picture of which data points are really important. The value of data is also changing with the location (e.g. cloud vs. edge) and over time. Exploring and learning from (and with) the data is an important part of ongoing success.
- Use data platforms that allow you to work naturally with data of any shape or structure in its native form without having to constantly wrangle with a rigid schema. The ability to scale out on commodity infrastructure and do it across geographic regions to accommodate massive increases in data volume.
- That’s why we developed a platform to handle the scale and diversity of data. Edge to cloud is a common use case with initial processing at the edge and then moving the data to data centers. Once the data is in a central location that’s where you can do ML, come up with models, and push insight back to the edge. When you have datasets like that, that’s where the database and streaming fits in with fast streaming and fast processing you need a platform with different data services to meet all of your needs.
Here are the contributors of insight, knowledge, and experience:
- Raghu Chakravarthi, SVP and Chief Product Officer, Actian
- Joe Moser, Head of Product, Crate.io
- Brian Rauscher, Director of Support, Cybera
- Sanjay Challa, Director of Product Management, Datical
- OJ Ngo, CTO, DH2i
- Anders Wallgren, CTO, Electric Cloud
- Johnson Noel, Senior Solutions Architect, Hazelcast
- Adam Zegelin, SVP Engineering, Instaclustr
- Daniel Raskin, CMO, Kinetica
- James Corcoran, CTO of Enterprise Solutions, Kx
- Neeraja Rentachintala, V.P. of Product Management, MapR
- Mat Keep, Senior Director of Product & Solutions, MongoDB
- Philip Rathle, V.P. of Products and Matt Casters, Chief Solution Architect, Neo4j
- Ariff Kassam, V.P Products, NuoDB
- Dhruba Borthakur, co-founder and CTO, Rockset
- Erik Gfesser, Principal Architect, SPR
- Lucas Vogel, Owner, Endpoint Systems
- Neil Barton, CTO, WhereScape
Opinions expressed by DZone contributors are their own.