DZone Research: Big Data Ingestion

DZone 's Guide to

DZone Research: Big Data Ingestion

Keys for speed: cataloging, automation, indexing, scalability, Hadoop, and other platforms.

· Database Zone ·
Free Resource

To gather insights on the current and future state of the database ecosystem, we talked to IT executives from 22 companies about how their clients are using databases today and how they see use, and solutions, changing in the future.

We asked them, "How can companies get a handle on the vast amounts of data they’re collecting, and how can databases help solve this problem?" Here's what they told us:


  • Databases provide a good middle data level that’s easier for end users of the data. It is easier to access the data. Able to know where data is, where it came from, what I’m doing with the data, where does it go, who has access to it. Data cataloging and automation tools with metadata to know where data is and how it’s being processed.
  • The exponential growth of data, alongside the increasing sprawl of data as it spreads around companies, is making the cataloging and identification of that data an important issue for every company. Strangely, perhaps, increasing legislation around data privacy is actually helping to solve the problem because it has made companies realize they need to know where their data is, what it is, and who has access to it. So modern database development in the new age of tighter data legislation provides the answer to the data growth issue. Data should be cataloged and identified, with sensitive data clearly labeled, access to production environments should be tightly controlled, backups should be encrypted, and copies of databases used in development and testing should have sensitive data masked. Once those processes are in place and become part of the normal day-to-day workflow, the growth in data becomes more of an infrastructure issue. It changes the debate from How do I get a clearer picture of my data? to Do I have enough capacity on my server to store my data?


  • More and more companies are struggling with large amounts of data that standard databases can’t keep up with and are looking at big data solutions. Going beyond the database, a way that companies can get a handle on the vast amounts of data they’re collecting is by implementing a holistic strategy that includes a single point for automating, coordinating, controlling, and monitoring it all. Regarding ETL processes, once the data sets (email messaging, documents, videos, audio files, presentations, telemetric sources, and more) are collected from structured and unstructured sources, the focus turns to preparing and organizing the data for use. To simplify the development and ongoing maintenance of these various processes, automation should be built into the plan to support functions such as: file transfers, loading data, generating reports, flow control, etc.  Companies need to adopt an automation solution with capabilities that support big data automation to solve two major pain points: integration and scheduling. For example, a solution that offers pre-built integrations for the Hadoop ecosystem, including: HDFS, Pig, Hive, Sqoop, HBase, MapReduce, Spark, and Oozie, along with API adapters for extensibility, is critical for adapting to changing environments. These capabilities mean that you no longer need to involve developers and data science professionals to build out big data workflows, they become as easy as drag-and-drop! The benefits of a centralized strategy for automating and managing data processes include: 1) Reduces both the time and cost spent building and maintaining the repetitive assimilation of big data. 2) Minimizes the risk of manual errors by decreasing dependence on custom script creation. 3) Optimizes the efficiency and speed of business and IT workloads to deliver faster time to insight by business users. 4) Eliminates wait time for Job execution with File Triggers to instantiate workloads beyond interval, date and time, or constraints.


  • Databases have been there for a long time. Data is normalized for storage and quick access by indexing and writing good queries for ingesting and accessing the data. Getting more complicated with NoSQL. Non-relational data needs to be stored and accessed in NoSQL and SQL and index appropriately to link and extract. 

  • We provide extremely fast data import speeds — 100 GB in minutes. Keep up with enterprise needs. Already have historical data in some systems. This includes building an index to query quickly. Have the data and run the applications you want. Built-in data importer and parsing. Effective querying and analytics systems. Transactional workloads and analytical needs.  Graph lets you evolve the data model. Can just add new relationships.


  • When handling vast amounts of data, it pays off to choose a database that was built for scale. For example, MongoDB was designed from the start to be easy to scale horizontally. Sharding is a first-class feature in MongoDB, which makes it extremely easy to handle large amounts of data. Adding shards to a cluster is simple to do and shard balancing is automatic. 
  • Core competency is how it scales. If you need more capacity, you add more nodes. You don’t need to bring down the database. Don’t overprovision and don't be penalized later when you add. 
  • There are other alternatives to storing and processing time-series data such as using Hadoop or NoSQL stores. But none of these options are as scalable and performant compared to a purpose- built Time Series database. It is a purpose-built platform for time-series data — better than using generic big data solutions for storing time series data. It delivers time to awesome — developers can build amazing applications quickly. It offers scalability, performance and availability — high ingestion rate and high query performance.


  • Collection side problem is solved with Hadoop file systems. The issue begins when you want to do something with data. Storage is not everything. When need to analyze it needs to have been stored in an optimized way. You need to write performance code to do this. This doesn’t work for everyone. It worked for Google but when opened to others they were past that what works for Google and it doesn’t work for everyone else. It’s about storing data in the correct way and having the right tools to analyze at the end. Hadoop went the programmatic route. Not programmers, SQL people. Don’t want to start programming. Hire programming, lose flexibility. Tools have to match the people using them in the end. 
  • This the primary reason for being a leader in data ingestion. Detect changes to source data systems and facilitate data movement and storage. Move to Hadoop on-prem or in the cloud and help reassemble subset of data as needed for analytics.


  • Time series workloads are usually inserted and happen because time is always a dimension. You can architect a relational database to scale for these workloads for query and storing.
  • It’s not just a single processing type. Lambda represents this. For fast-arriving data, you need to understand context, separate signal from noise, and perform real-time analytics and responses. Rapid responses. Flexible multi-model database integrated into processing.
  • Interrelated with other questions on change and adoption. Databases have been around for a long time and will continue to be around. Depends on the type of data and what you are trying to do. Teams are getting smarter about which database they use to solve a particular problem. Having to think about the right tool for the right problem. Relational data stores, mainframe, non-relational data stores. Complex ETL workloads. Less around the volumes and more an innovation conversation. How we can help enterprises get innovation to customers quickly. Drive better experience, drive revenue, cut costs. Get new applications and databases out to market more quickly. Data is the new business constraint. Data can create the bottleneck between them and get innovation out to market. Agility to change schema, structure, logic. A huge factor to consume more data and drive innovation. 
  • Operational databases stay away from big data competitive situations. We help customers manage single terabytes of data. We enable them to consolidate multiple databases. There's no need to back up because we have replicated copies. There is still a need for point backups for "fat finger."
  • Comes back to understanding the business problem and the value of each piece of data to solve the problem. The last 24 hours of browsing data is important to recommendation engines whereas data from five years ago matters for purchasing or fraud detection. Not every piece of data needs to go into every system. Understand the cost benefit of including a piece of data in the system. Do you care about the detail or just the summary? The closer to the operating system the more expensive to maintain large amounts of data. Whenever you have an opportunity to use relationships between things to solve a particular problem — doing a join that involves more than three levels we fit nicely into that picture. 
  • We have data migration services from any source and any destination. Glue service takes data from any source and stuffs it into any destination. You can ingest by using AWS Snowball or continuously dump data into S3 for the least amount of overhead then crawl with glue or EMR. If you figure out the elements of data, you can process as it comes in. Take data dumps, bring data to fastest and cheapest place, and then fork out as needed. 
  • Streaming data with increased speed and sources challenge for a disc-based system. Change the way you are analyzing the data. Index building not working as well takes time. GPU technology uses brute force lookups. Ingestion is an important capability. 
  • While it is important to collect and analyze the data to get an understanding of past performance, in a lot of use cases/apps it is more important to analyze fast streaming data in-event to determine and influence business outcomes. For example, in fraud detection and prevention, it is crucial to detect and block a fraudulent transaction in-event I. e., before the payment is made/account is charged and services are rendered. Post-event (after payment is made) the financial institution/account holder/payee has already lost significant funds and the service/product provider has already sold the service/product and losses are incurred by multiple parties. 
  • A key area to focus on is non-production. For every production environment, teams create 8-12 copies of non-production for dev, testing, reporting, analytics. This creates a massive sprawl that drives up storage costs, increases the risk of breach, and makes it more complex to get the right data to the right person at the right time. Businesses need a single point of control to deliver, secure, govern non-production data. 
  • Data is the new oil, and consequently, modern databases play a very important role in the digital transformations that companies are undertaking to stay relevant. Not all databases are the same; depending on the desired use case, companies across various industries can benefit in several ways from the right database. The volume, velocity, and variety of data that business needs to deal with is something unprecedented, and it requires more than a database; it requires a modern data platform purpose-built to address these needs. Our platform has been adopted by leaders in many industries as they transform their businesses and further their lead in their industries. 
  • Data volume, variety, and velocity are exploding for several reasons. More and more digital streams are available, we’re getting better and more sophisticated about using multiple datasets, and the cost of keeping data has become lower than the cost of throwing it away. Databases help because they provide a way to process and unify data as well as a way to store it. Data platforms add in the technology to even more: add structure and intelligence to the data, attend to data quality, and make data exploration and investigation easy. With these, you can look at your data from a high-level perspective as well as delving into the details.

Here’s who we talked to:

  • Jim Manias, Vice President, Advanced Systems Concepts, Inc.
  • Tony Petrossian, Director, Engineering, Amazon Web Services
  • Dan Potter, V.P. Product Management and Marketing, Attunity
  • Ravi Mayuram, SVP of Engineering and CTO, Couchbase
  • Patrick McFadin, V.P. Developer Relations, DataStax
  • Sanjay Challa, Senior Product Marketing Manager, Datical
  • Matthew Yeh, Director of Product Marketing, Delphix
  • OJ Ngo, CTO, DH2i
  • Navdeep Sidhu, Head of Product Marketing, InfluxData
  • Ben Bromhead, CTO and Co-founder, Instaclustr
  • Jeff Fried, Director of Product Management, InterSystems
  • Dipti Borkar, Vice President Product Marketing, Kinetica
  • Jack Norris, V.P. Data and Applications, MapR
  • Will Shulman, CEO, mLab
  • Philip Rathle, V.P. of Products, Neo4j
  • Ariff Kasam, V.P. Products, and Josh Verrill, CMO, NuoDB
  • Simon Galbraith, CEO and Co-founder, Redgate Software
  • David Leichner, CMO and Arnon Shimoni, Product Marketing Manager, SQream
  • Todd Blashka, COO and Victor Lee, Director of Product Management, TigerGraph
  • Mike Freedman, CTO and Co-founder, and Ajay Kulkarni, CEO and Co-founder, TimescaleDB
  • Chai Bhat, Director of Product Marketing, VoltDB
  • Neil Barton, CTO, WhereScape
  • Topics:
    automation, big data, big data injestion, cataloging, database, hadoop, indexing, scalability

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}